What can machine learning teach us about habit formation? Evidence from exercise and hygiene

RESEARCH ARTICLE

PSYCHOLOGICAL AND COGNITIVE SCIENCES

What can machine learning teach us about habit formation?

Evidence from exercise and hygiene

AnastasiaBuyalskaya

a,1

,HungHo

b,1

,KatherineL.Milkman

,XiaominLi

,AngelaL.Duckworth

c,e

,andColinCamerer

d,f,2

EditedbyElkeWeber,PrincetonUniversity,Princeton,NJ;receivedSeptember24,2022;acceptedJanuary27,2023

We apply a machine learning technique to characterize habit formation in two large

panel data sets with objective measures of 1) gym attendance (over 12 million

observations) and 2) hospital handwashing (over 40 million observations). Our

Predicting Context Sensitivity (PCS) approach identifies context variables that best

predict behavior for each individual. This approach also creates a time series of overall

predictability for each individual. These time series predictability values are used to trace

a habit formation curve for each individual, operationalizing the time of habit formation

as the asymptotic limit of when behavior becomes highly predictable. Contrary to the

popular belief in a “magic number” of days to develop a habit, we find that it typically

takes months to form the habit of going to the gym but weeks to develop the habit

of handwashing in the hospital. Furthermore, we find that gymgoers who are more

predictable are less responsive to an intervention designed to promote more gym

attendance, consistent with past experiments showing that habit formation generates

insensitivity to reward devaluation.

habit

machine learning

context sensitivity

predictability

nudge

Much of human behavior is habitual. Unlike choices that are consciously deliberated,

habits are the result of stimulus–response associations (1). Habits constitute a behavior in

which responses are cued by context features (e.g., sensory stimuli, locations, preceding

actions) that were reliably present when the habit was previously executed (2).

To date, the best evidence of automatic and context-sensitive behavior linked to habits

comes from laboratory experiments on animals and humans (

SI Appendix

, section 1)

and a handful of field experiments (3), But there is surprisingly little research on how

human habits naturally develop outside of the laboratory over the course of weeks or

months in everyday life (4). Three prior observational studies examined habit formation

over substantial periods of time using daily self-reports. In a seminal study (5), 96

undergraduates ate, drank, and exercised daily in the same context for 12 wk and self-

reported habit strength every day. This study, and two similar ones (6) and (7), suggest

that

“habits typically develop asymptotically and idiosyncratically, potentially

differing in rate across people, cues and behaviors” (8, pg. 220).

To advance our understanding of how habits develop in natural settings, we develop a

machine learning methodology that is especially well suited for analyzing panel data with

repeated observations of behavior. Our Predicting Context Sensitivity (PCS) approach

identifies the context variables that best predict behavior for each individual. Specifically,

PCS uses a least absolute shrinkage and selection operator (LASSO) regression, a

hypothesis-free form of statistical analysis which does not prespecify what variables are

likely to be predictive of an individual’s behavior. The LASSO technique generates a

person-specific measure of overall behavioral predictability based on the variables that are

predictive of that person’s behavior. Predictability ranges from .5 (completely random)

to 1 (completely predictable), acknowledging that habit formation is a matter of degree

and is not dichotomous. The degree of habit formation can vary across people and vary

within person across time. These continuous measures of predictability we generate are

then used to study individual differences in predictability, and speed of habit formation,

which is defined as the amount of time it takes for a given person’s behavior to reach its

steady state of predictability.

Because the field data we analyze do not include direct measures of automaticity

of behavior, we operationalize habit as context-sensitive predictability. It is, of course,

possible that other mechanisms that are not automatic account for the context-sensitive

Significance

Psychology and neuroscience

define habit as a type of decision

process in which the same

behavior is executed in the same

context, regardless of the

outcome. We introduce a

machine learning method that

discovers which of many context

variables are associated with

behavior and infers how quickly

habits form. This can include but

is not limited to past choice. The

method can be used to target

interventions based on the

specific context variables that

affect behavior, that individual’s

predictability and how strongly

their habit has been formed. This

potential is illustrated by our

finding that a well-powered,

random-assignment intervention

to increase gym attendance had a

larger effect on gymgoers who

were less predictably

context-sensitive.

Author contributions: A.B., H.H., and C.C. designed

research; A.B., H.H., K.L.M., A.L.D., and C.C. performed

research; A.B., H.H., X.L., and C.C. analyzed data; and A.B.,

H.H., K.L.M., A.L.D., and C.C. wrote the paper.

The authors declare no competing interest.

This article is a PNAS Direct Submission.

©

2023 the Author(s). Published by PNAS.

This article is distributed under Creative Commons

Attribution-NonCommercial-NoDerivatives License 4.0

(CC BY-NC-ND).

A.B. and H.H. contributed equally to this work.

To whom correspondence may be addressed. Email:

camerer@caltech.edu.

This article contains supporting information online

at http://www.pnas.org/lookup/suppl/doi:10.1073/pnas.

2216115120/-/DCSupplemental.

Published April 17, 2023.

PNAS

2023 Vol. 120 No. 17 e2216115120

https://doi.org/10.1073/pnas.2216115120

1 of 7

Downloaded from https://www.pnas.org by CALIFORNIA INST OF TECHNOLOGY CALTECH LIBRARY SERVICES on June 5, 2023 from IP address 131.215.249.250.

predictability we observe (9). However, we note that our

approach is consistent with a large literature operationalizing

habit as we do, e.g., ref. 10.

PCS works best for large samples in which there are repeated

observations of behavior for each person. For each of the 30,110

people in study 1, there are a median of 1,525 daily observations

(over 4 y of gym attendance). For each of the 3,124 people

in Study 2, there are a median of 3,000 observations (98

hospital shifts). In both samples, we analyze objective measures

of behavior, rather than self-reports of behavior gathered after-

the-fact, thereby avoiding possible errors of memory and meta-

cognition (11).

PCS yields three important discoveries in this investigation:

First, context variables are more predictive of behavior for some

individuals than others. Second, contrary to common wisdom,

there is no “magic number” for how long it takes to form a habit.

Instead, the speed of habit formation appears to vary significantly

between behavioral domain: Gym habits take months to form and

handwashing habits take weeks to form. Third, consistent with

prior research on nonhuman animals, more habitual gymgoers are

reward-insensitive, responding less to a well-designed behavioral

intervention (12).

1. Study 1: Gym Attendance

A. Data.

We partnered with 24 h Fitness, a large North American

gym chain, to obtain check-in and background data on a total

of 60,277 regular gym users (IRB approval is described in

Appendix

, section 6). These users were spread across 560 gyms and

consented to share their information with researchers when they

signed up to be in a fitness program. The data we analyze track

gym attendance for each person from the first day they attended

this gym chain, which is ideal for studying the development of

habits from inception. Our gym attendance dataset spans 14 y,

ranging from 2006 to 2019 and it includes over 12 million data

points, each corresponding to one gym check-in. Each data point

has a timestamp and location of the gym visited, as well as other

details about the gym (such as its amenities). We infer several

other attributes not in the raw data files, such as the day of the

week, and individual-level variables such as the time since gym

membership creation. The total set of unique candidate context

variables used to estimate predictable gym attendance includes

a gym visit’s month of the year, day of the week, time lag (the

number of days which have elapsed since the gym goer’s last visit),

attendance rate in the past week, the number of consecutive days

of attendance (streak), and the number of consecutive days of

attendance for the same day of the week (day-of-week streak). A

full list of the variables analyzed and a longer description of the

data can be found in

SI Appendix

, section 2.

Our analytic sample is a subset of gym goers based on two

criteria. Members without a valid gym contract (1,083) are

removed. Second, we exclude participants with less than a year

of data (removing 28,878 members) and too little attendance

for the LASSO model to classify well due to sample imbalance

(removing 206 members). This leaves N = 30,110 gym goers.

We analyze the behavior of these gym goers from the first day

that they joined the gym.

Table 1 provides summary statistics for the final analytic

dataset. Gym members in our sample are 62

female and have a

median age of 34 y. The average individual in this dataset goes to

the gym every 4 to 6 d. The median number of days an individual

is observed (or “has an opportunity to go to the gym”) is 1,525 d,

or just over 4 y.

Table 1

.

Summary statistics from gym attendance an-

alytic sample

Mean SD

Q1 Median Q3

Age

36.76 12.35 27.00 34.00 45.00

Female

0.62

− − − −

Daily attendance

0.19 0.16 0.07

0.14

0.27

Days observed

2,020 1,453 658

1,525 3,655

Days between visits 15.77 29.74 3.69

6.89 15.22

Note: SD = standard deviation, Q1 and Q3 = first and third quartiles.

B. Analytic Approach.

For each individual in our dataset, we

first train a logit LASSO model to predict the likelihood of

gym attendance (a binary outcome variable) day-by-day. LASSO

adds a penalty to the minimized objective function (sum of

squared residuals) based on coefficient absolute magnitude. This

has the effect of “shrinking” many regression coefficients to zero

in order to guard against overfitting when so many variables

are included. To illustrate a typical difference between regular

regression and LASSO, suppose a researcher believed month-of-

the-year might be predictive of gym attendance but didn’t have a

strong hypothesis regarding which months were most predictive

for which individual. A standard OLS regression would derive

best-fitting nonzero coefficients for all the 12-mo variables. But it

is likely that most of those coefficients are not precisely estimated,

and the largest magnitudes would probably be exaggerated.

Because of the LASSO penalty, many of these coefficients that

are close to zero would shrink to exactly zero and the largest

magnitudes would be compressed toward zero (both “zeroing

out” variables and compressing them toward zero reduces the

LASSO penalty). The LASSO penalty is well known in many

applications to reduce false positives—which are expected when

there are many coefficients—so that false positives do not degrade

the fit from training to the hold-out test samples. This results in

a more compact model with only the most relevant predictor

variables having nonzero coefficients. For example, in our gym

data analysis, about half of the month coefficients have zero

LASSO coefficients (

SI Appendix

, Table S1).

It is crucial to note that LASSO shrinkage due to the magnitude

penalty implies that the coefficients are not the best estimates of

the true coefficient value (because they are deliberately shrunk

toward zero). This is a well-known property of LASSO. It is

the statistical price that is paid to guard against overfitting when

exploring large variable sets (in technical terms, it reflects a “bias-

variance” tradeoff). A byproduct of this property is that there is

no simple conventional procedure for computing standard errors

for LASSO coefficients (13) for ideas. We therefore do not say

anything about the statistical significance of nonzero effects or

differences in coefficients between people.

There are two major consequences of the LASSO coefficient

bias toward zero. One is called “model selection consistency”—

are the variables selected by LASSO those with true nonzero

coefficients? The second is called “stability”: When two variables

are highly correlated, LASSO will often set one predictive coeffi-

cient to zero and let the other variable do the predictive work of

the two variables together. Exploratory analysis seems to indicate

that they are not creating a problem for our analysis, particularly

since we are not trying to infer true coefficient values—we are just

trying to find the most predictive coefficients. More precisely, we

do not have standard errors around coefficients and therefore

cannot say for how many individuals any given predictor is

significant.

SI Appendix

, section 3.2 has a thorough discussion

2 of 7

https://doi.org/10.1073/pnas.2216115120

pnas.org

Downloaded from https://www.pnas.org by CALIFORNIA INST OF TECHNOLOGY CALTECH LIBRARY SERVICES on June 5, 2023 from IP address 131.215.249.250.

of these issues, and the procedures we carried out to examine this

problem. While we do not claim to fully resolve the problem of

model selection in our setting, we hope these procedures serve as

a reassurance about the directional effects of the most important

predictors.

Following machine learning standards, we use approximate six-

fold cross-validation, training the model on 85% (about 5/6ths)

of the full-time series data for each individual (

SI Appendix

section 3.1). We use the remaining 15% of the data (1/6th)

as our “test” set. This shows how good the LASSO model is

at predicting an individual’s attendance on days the model did

not observe. This gives us an out-of-sample test-set predictability

measure, called the area under the curve (AUC), for each gym

member. This serves as an objective full-sample measure of

context-sensitive predictability for each gym member (i.e., how

habitual an individual’s behavior might be) and allows us to avoid

possible errors induced by self-report measures, which rely on an

individual’s memories of the context cues present when they last

executed a habitual behavior. This machine learning technique

allows us to treat overall individual-level habit as a number from

.5 to 1.

Next, we determine when, if ever, behavior becomes more

habitual over time. Following (5), for each individual

, we

attempt to identify

(

)

, an exponential function of form

−

describing daily-level habit strength as a function of

time. Likewise, following (5), we define the inferred time to

habit formation as the time it takes for

(

)

to reach 95% of its

asymptote. In other words, we define “time to habit formation” as

the amount of time it takes for AUC, behavioral predictability,

to reach close to its steady-state value (this is seen visually as

AUC increasing and eventually plateauing after a certain point in

time). This is consistent with previous studies of habit formation,

which find that initial behavioral repetitions cause increases in

responding, with each marginal repetition delivering lower gains

in until the behavior reaches its limit (14).

To infer the increase in habit strength over time, we use a

different procedure than deriving AUC from full samples for each

individual. In this procedure, we estimate a series of time-specific

AUCs denoted

(

)

at period

. This procedure did not work

reliably at the individual level, so we first split the gym goers into

ten deciles. Deciles were based on the number of observations

(sample length) for each person. The shortest-length samples are

grouped into decile 1 and longest-length samples are grouped into

decile 10. The purpose of this split is to ensure that individuals

in a decile group are comparable in terms of their number of data

points because LASSO results (as with all statistical techniques)

can be sensitive to sample features such as the amount of data per

individual.

For each decile of gym goers, we compute the AUC values

obtained when training a LASSO model using data from periods

[

−

, t

−

]

and testing (validating) on data at period

, where

each period

consists of 2 wk of data (

0 denotes the first 2

wk). The procedure starts with

2, where we train a LASSO

model using everyone’s first 4 wk of data (corresponding to

and

1), and then it computes the AUC when evaluating that

model in ‘test’ holdout data from 5 to 6 wk. It then proceeds

iteratively for increasing values of

until it reaches the end of the

decile’s observed time period. This procedure creates a sequence

of sliding windows of equal size (4 wk of training data and two

weeks of test data).

C. Results.

In the LASSO training data for

110 gym

members, the mean individual-level AUC is 0.806 (median is

0.811, interquartile range 0.750 to 0.868), where 0.5 is random

March

Fig. 1.

Attendance patterns during March 2018 of two individuals with

comparable gym attendance rates but different AUC values. Red squares

indicate attendance.

and 1.0 is perfectly predictable. This indicates that the LASSO

models tend to do a good job fitting the gym goers’ attendance

behavior. On the test datasets, these measures are slightly lower

(as is expected), with a mean individual-level AUC of 0.768

(median is 0.778, and interquartile range 0.702 to 0.845).

Fig. 1

and

illustrate two gym members’ monthly

attendance calendars from March 2018. While the two members

go to the gym about equally often in March, one is highly

predictable (AUC = 0.946) and the other is not at all predictable

(AUC = 0.546). This is an example that shows frequency

and predictability are uncorrelated in our data (

SI Appendix

section 3.3) Of course, including missing variables we do not

have access to (e.g., someone’s work schedule) could increase the

predictability of the person in Fig. 1

, and it would also be likely

to reduce the differences in predictability across people.

As shown in Table 2, the most important predictor of gym

attendance across individuals is how much time has passed since

the previous gym visit (“time lag”). This predictor appears almost

universally important and goes in the same direction across most

individuals: for 76% of gym goers in our sample, the longer it has

been since they last visited the gym, the less likely they are to go on

a given day. Consistent with the examples in Fig. 1, day-of-week

streak is an important predictor, with 69% of the sample more

likely to go to the gym on the same days of the week that they had

previously attended. Among the days of the week, Monday and

Tuesday are the most important, positively predicting attendance

for 57% of both samples (

SI Appendix

, section 3.1), consistent

with other evidence of focal “fresh start” effects (15). Since we do

not have standard errors as mentioned previously, the summary

above only provides suggestive evidence about the directional

effects of these predictors.

Fig. 2 provides two examples of results from fitting the

exponential curves to the lowest and highest decile of gym goers’

AUC sequences (i.e., those with the least and most data per

person). The median estimated time it takes to reach the 95%

asymptote across all gym goers well fit by the exponential model

is 122 to 226 d or about 4 to 7 mo (

SI Appendix

, Table S8).

Model fit is not related to average frequency of gym attendance

nor the age of a gym member.

D. Additional Analyses of Reward Sensitivity and Individual

Differences.

Insensitivity to reward change is a gold standard

hallmark of strong habits in animal research. But such insensi-

tivity has proven difficult to identify in humans, e.g., refs. 16

and 17. It has not been studied in nonexperimental large-scale

field data like ours. Because we are testing for insensitivity in

field data in different ways, we openly explore multiple types

of reward change, and how they might affect habit measured in

different ways. Specifically, we explore the effects of two methods

PNAS

2023 Vol. 120 No. 17 e2216115120

https://doi.org/10.1073/pnas.2216115120

3 of 7

Downloaded from https://www.pnas.org by CALIFORNIA INST OF TECHNOLOGY CALTECH LIBRARY SERVICES on June 5, 2023 from IP address 131.215.249.250.

Table 2

Variables which LASSO identified as being most predictive across gymgoers

Importance

Median

% zero

% positive

Predictive effect

Time lag

1.25

−

1.40

−

0.34

−

0.02

Negative

Monday

0.36

0.00

0.11

0.50

Positive

Tuesday

0.35

0.00

0.10

0.49

Positive

Attendance last 7 d

0.34

0.09

0.29

0.47

Positive

Day-of-week streak

0.23

0.00

0.11

0.30

Positive

of reward change. The first method is to hypothesize possible

changes in reward, without having a direct measure (of the kind

carefully controlled in lab experiments). For gym attendance,

the hypothesized reward variables are unusually good or bad

weather. The second method is to use randomized interventions

designed to change subjective reward to promote more gym

attendance (12).

The effect of reward changes is evaluated with respect to low-

and high-habit behavior across two methods—one is within-

participant, and the second is between participants. The within-

participant approach uses the fact that our method for estimating

habit formation time divides each individual’s behavior into

“pre-” and “post-” habit formation periods. We test the joint

hypothesis that unusually good or bad weather changes the reward

value of going to the gym and has a weaker effect in the “post-

habit period”. This test shows no reduction in sensitivity to

unusual weather—the proxy for reward change—in post-habit

periods (

SI Appendix

, section 4.1).

The between-participant test uses the AUC measure of overall

predictability to divide the sample into people who are low or high

in context-sensitive predictability, and it tests whether responses

to an experimental StepUp intervention done in partnership with

24 h Fitness are different for the two groups (

SI Appendix

, section

4.2). It is worth noting that a gymgoer who is highly predictable is

not necessarily a more predictable, or habitual, gymgoer. Instead,

it may be that two gym goers are equally predictable given a larger

set of variables than we have. Omitting variables means that one

gymgoer might be highly predictable and the other not very

predictable given our feature set. This test of reward devaluation

sensitivity allows us to test whether this data limitation is likely

to be driving our results. If predictability does correlate positively

with how habitual a gym goer is, we would expect to see more

predictable gym goers responding less to reward devaluation.

Our results show that this is indeed the case. Less predictable

gymgoers are significantly more likely (

001

)

to increase

attendance in response to the StepUp intervention (

SI Appendix

section 4.2), which means that highly predictable gymgoers

respond less to the intervention. This finding provides evidence

against the view that our predictability measure is largely driven

by data limitations and is consistent with the experimental

literature associating insensitivity to reward changes with stronger

habit—measured here as predictability.

Finally, we take advantage of the size and diversity of the

gymgoer sample to explore whether demographic and SES

characteristics are correlated with predictability (an analysis that

was pregistered on AsPredicted.org, #59014). To do this, we link

our individual-level AUC predictability measures with Census

data using each individual’s home zip code and self-reported

age and gender. We remove 2,447 people for whom we did

not receive age or gender information from the gym, or whose

zip code did not have data available. We then regress the AUC

of the remaining sample (27,663 gym goers) on demographic

characteristics. Regression results, which can be found in

Appendix

, section 5.4, confirm that demographic attributes are

indeed predictors of AUC or “stronger habits”, although most of

the effects are small in magnitude. Specifically, older individuals

living in more rural (low population density) areas, where a large

fraction of married couples have children, have higher AUCs.

Younger individuals living in more urban (high population

density) areas have lower AUCs.

2nd decile

9th decile

200

400

600

1000

2000

3000

4000

0.5

0.6

0.7

0.8

Days since first gym visit

AUC

Fitted exp. curve

Time to habit formation

Fig. 2.

Estimation of the speed of habit formation for gym attendance for the second lowest (

Left

) and second highest (

Right

) deciles by sample size. Note that

the x-axes are different because, by construction, the deciles have different lengths of sampled time. The times to habit formation in these two deciles (shown

by where the vertical blue line intersects the x-axis) are 121 (

Left

) and 187 (

Right

) days.

4 of 7

https://doi.org/10.1073/pnas.2216115120

pnas.org

Downloaded from https://www.pnas.org by CALIFORNIA INST OF TECHNOLOGY CALTECH LIBRARY SERVICES on June 5, 2023 from IP address 131.215.249.250.

2. Study 2: Hand Washing among Hospital

Workers

A. Data.

We obtain hand-hygiene data from a company that

uses RFID technology to monitor whether individual healthcare

providers wash their hands at every opportunity to do so

throughout their hospital shifts Proventix, following (18) and

(19). The initial dataset tracks 5,246 healthcare workers across

30 hospitals. The dataset spans about a year, with over 40

million data points, each corresponding to whether an individual

caregiver did or did not wash their hands in the face of an

opportunity to do so. An opportunity is defined as a point in

time when a caregiver either entered or exited a patient’s room

with a Proventix sanitizer present; so each room visit presents

two sanitizing opportunities.

Each data point has a timestamp, as well as deidentified

hospital and room locations. We further infer several other

attributes about each opportunity to wash, such as the day of

the week when it arose and whether the healthcare worker in

question had complied with handwashing guidelines (i.e., washed

their hands) in this room previously. Our unique candidate

context cue variables include the time of day, time spent working,

previous room and shift compliance, and indicators for whether

the hospital worker is entering or exiting the room. A full list

of the variables used and a longer description of the data can be

found in

SI Appendix

, section 2.

Unlike the gymgoers, in this sample we do not observe

hospital workers from the start of their work; we only observe

their behavior after the RFID machines are installed. It is

therefore likely that some degree of habit formation had already

occurred, and we are not observing pure habit formation from

inception. Thus, we treat the introduction of Proventix’s RFID

surveillance technology as a shock that may have disrupted

behavior sufficiently to create somewhat new context-sensitive

habits (e.g., a habit of washing when monitored). This possibility

follows from the fact that hand sanitizing behavior increased by

over 50% after machines were installed (19). However, we do

acknowledge that handwashing may have habituated in many

caregivers before the time at which we first observe their behavior

in our sample.

We use two criteria to identify our final analytic sample.

First, we remove any hospital workers who had fewer than 30

shifts included in our data (removing 2,115 hospital workers).

Second, we remove seven workers without enough hand washing

compliance data for the LASSO model to fit variability. This

gives us an analytical sample of 3,124 hospital workers. Table 3

provides summary statistics about the workers in our analytic

sample. The mean compliance with handwashing is 0.45 per

opportunity. An average of 116 shifts are recorded per healthcare

Table 3

.

Summary statistics from handwashing ana-

lytic sample

Mean SD

Q1 Median Q3

Compliance

0.45 0.23 0.26

0.43

0.63

Total shifts

116

153

Rooms visited

Episodes (mins)

5.66 2.61 3.94

5.13

6.78

Episodes/shift

25.72 16.49 13.95 24.2

34.54

Shift length (mins)

512 214 408

581

646

Mins btwn episodes 22.42 11.5 13.95 20.12 29.00

Hrs between shifts 91.95 57.91 60.06 72.61 102.91

Note: SD = standard deviation, Q1 and Q3

=

first and third quartiles.

worker, and there are an average of 26 episodes (or visits to patient

rooms, each with two opportunities to wash—one upon entry

and one upon exit) per shift. We observe an average of 3,016

episodes per worker.

B. Analytic Approach.

We use the same machine learning ap-

proach as described in Study 1, training a LASSO model

to obtain person-specific sets of coefficients and predictability

measurements (AUCs) predicting when caregivers wash their

hands. As in study 1, we inferred time to habit formation using

the same approach of fitting an exponential curve to a sequence

of AUC values for each decile of caregivers. Each period consists

of two shifts rather than two weeks of gym attendance data.

C. Results.

The LASSO model does a satisfactory job fitting hos-

pital caregivers’ hand-washing behavior. In the training dataset,

the mean (median) individual-level AUC is 0.788 (0.783), and

the interquartile range is 0.742 to 0.828. In the test dataset,

these measures are only slightly lower, indicating the training

overfitting is not severe the mean (median) individual-level AUC

is 0.781 (0.776), and the interquartile range is 0.732 to 0.825.

While our LASSO models have slightly less predictive power

in this domain (compared to gym attendance), they still vastly

outperform random chance at predicting hospital caregivers’

hand-washing behavior.

As in study 1, the AUC measure—which can be used in

any behavioral domain—is produced for each individual, and

it once again serves as an objective measure of context-sensitive

predictability. Furthermore, PCS again narrows down the set of

context variables that are the most important predictors of hand

washing at the aggregate level (Table 4). The most important

and homogenous predictors of washing were a hospital worker’s

handwashing compliance during their last shift (a positive

predictor for 100% of the hospital workers), room entry (which

is a negative predictor for 77% of hospital workers, indicating

most are more likely to wash their hands upon exiting, rather

than entering, a room), and the room compliance of others (a

positive predictor for 66% of hospital workers).

Times of day intervals were not selected by the LASSO model

as predictive of most people’s hand-washing behavior. However,

consistent with previous research (18), the amount of time since

the start of a caregiver’s shift is a negative predictor of hand

washing for 42% of caregivers. Again, since we do not have

standard errors around these coefficient estimates, the summary

above only provides suggestive evidence about the directional

effects of these predictors.

As in study 1, we fit an exponential model to each decile’s

AUC sequences from early to later intervals. This allows us to

analyze the development of predictability over time (Fig. 3). For

all deciles, the median time to habit formation was on the order

of a couple of weeks (habits typically formed after 9 to 10 shifts or

about 220 washing opportunities). This is much faster than the

habit formation measured in study 1 of gym attendance, where

the median time to habit formation was 4 to 7 mo.

D. Additional Analysis of Reward Sensitivity.

As in study 1, we

explore whether sensitivity to reward changes is associated with

handwashing habits. As in study 1, we use both a hypothesized

reward change and an intervention designed to change the reward

value of handwashing. The hypothesized reward change is the last

opportunity a caregiver has to wash their hands during a work

shift—this is a dummy variable encoding the final room visit for

their shift. The hypothesis is that they are less likely to wash their

PNAS

2023 Vol. 120 No. 17 e2216115120

https://doi.org/10.1073/pnas.2216115120

5 of 7

Downloaded from https://www.pnas.org by CALIFORNIA INST OF TECHNOLOGY CALTECH LIBRARY SERVICES on June 5, 2023 from IP address 131.215.249.250.

Table 4

Variables which LASSO identified as being most predictive across the handwashing sample

Importance

Median

% zero

% positive

Predictive effect

Compliance last shift

0.77

0.66

0.70

0.92

100

Positive

Entry indicator

0.35

−

0.33

−

0.28

−

0.04

Negative

Compliance within a room

0.12

0.00

0.01

0.14

Negative

Room compliance of others

0.08

0.04

0.05

0.12

Negative

Prev. room compliance

0.07

0.03

0.04

0.11

Negative

hands because it is less important to do so, for hospital hygiene,

when they are leaving, or more important because they are going

home. There is no statistically significant effect—this variable

does not affect handwashing differently pre- and post-habit in the

within-participant analysis (

SI Appendix

, section 4.1). This might

be explained, in part, because we do not observe individuals from

the true “start of habit formation” in our handwashing dataset

or because exiting the last patient room during a shift does not

truly represent a reliable reward change.

We also test the sensitivity of low- and high-predictability

caregivers to various sporadic incentive interventions designed

to increase handwashing. The goal of this test is to see if

the stronger effect of the StepUp gym attendance intervention

on low-predictability participants is replicated. However, these

interventions did not have proper control groups; they were

also created by the dispenser company (Proventix) rather than

behavioral scientists. Unlike in the more careful StepUp design,

there are no differential effects of interventions based on

differences in participant predictability (

SI Appendix

, section 4.2).

3. Discussion

Despite the personal and policy relevance of understanding

habits, there has been a notable absence of field studies examining

the formation of individual habits using large-scale, observational

data. To address this gap, we introduce a machine learning

approach called PCS that identifies the context variables that

best predict behavior for each individual.

In models of habit commonly used in economics and applied

fields, the context predictor of greatest interest is recent past

behavior (20) and (21);

SI Appendix

, section 1. (4) notes that

this narrow focus on past choices is also evident “in applied social

psychology, as a well as other areas such as health, social medicine,

or education, and may have stalled progress in habit theory for

quite some time” (p. 3).

Applying PCS to two longitudinal datasets, we identify

individual differences in predictability and context cues that

predict behavior for each individual. While past behavior is

confirmed as a reliable predictor of future behavior, PCS also

revealed other predictive variables and heterogeneity in the rate of

habit formation across domains. Contrary to the idea of a “magic

number” of days in which habits form, our findings suggest that

developing a handwashing habit takes weeks, while developing a

gym habit takes months. One possible explanation is that relative

to handwashing, gym attendance is a less frequent and more

complex behavior. Handwashing is more likely to involve chained

sensorimotor action sequences which are more automatic (22).

Animal learning experiments have shown that reduced sensi-

tivity to reward changes is a strong marker of habit formation

(23) and (24). We tested the association between habit formation

and reward sensitivity in a well-powered, random-assignment

megastudy (12). We found that more habitual gym attendees are

less likely to respond to interventions (

−

, P

001).

However, we did not find evidence of reward insensitivity

in other scenarios, such as unusual weather or handwashing

policies. These results suggest that reward insensitivity may be

less common in human habit formation or too small to detect, for

the most part, in our field data. Alternatively, our datasets may

lack variables that sufficiently operationalize changes in reward.

Further tests of the reward-insensitivity hypothesis are a priority

for future research using field data.

In conclusion, we show that PCS is a machine learning

approach well suited to the study of habit formation in

natural settings. The potential of this technique for advancing

personalized behavior change is hinted at by our finding that less

habitual gym attendees are more sensitive to randomly assigned

2nd decile

9th decile

100

150

200

0.5

0.6

0.7

0.8

Shifts since first data point

AUC

Fitted exp. curve

Time to habit formation

Fig. 3.

Estimation of the speed of habit formation for handwashing, for the second lowest (

Left

) and second highest (

Right

) deciles of sample size. Note that

the x-axes are different because, by construction, the deciles have different lengths of sampled time. The times to habit formation in these two deciles (shown

by where the vertical blue line intersects the x-axis) are 10 (

Left

) and 9 shifts (

Right

; about 225 to 250 handwashing episodes).

6 of 7

https://doi.org/10.1073/pnas.2216115120

pnas.org

Downloaded from https://www.pnas.org by CALIFORNIA INST OF TECHNOLOGY CALTECH LIBRARY SERVICES on June 5, 2023 from IP address 131.215.249.250.

interventions designed by behavioral scientists. Additionally,

PCS may help identify the sensory, mental, and physical context

variables that trigger a behavior like drug use. In conjunction with

the increasing availability of large-scale, longitudinal datasets, we

hope this innovative methodology inaugurates a new era in the

study and personalization of behavior change interventions (25).

Data, Materials, and Software Availability.

The data analyzed in this paper

were provided by 24 h Fitness and Proventix. We have their legal permission

to share the deidentified data. The data and code to replicate the analyses are

available at https://osf.io/m8gdp/ (26).

ACKNOWLEDGMENTS.

We thank Anthony Kukavica, Predrag Pandiloski, and

Mira Potter-Schwartz for excellent research assistance. We thank Hengchen

Dai for her help accessing and interpreting the Proventix data. We thank the

Sloan Foundation G2018 11259 (CFC), the Behavior Change for Good Initiative

(whose contributions were funded by the Robert Wood Johnson Foundation,

the AKO Foundation, J. Alexander, M. J. Leder, W. G. Lichtenstein, the Pershing

Square Fund for Research on the Foundations of Human Behavior from Harvard

University and by Roybal Center grants (P30AG034546 and 5P30AG034532)

from the National Institute on Aging), the Linde Institute (X.L.), and the Chen

Neuroscience Institute at Caltech (A.B., CFC) for the financial support. We also

thank audiences at BCFG, TA-DAH, Behavioral Ops (Shanghai), Stanford GSB,

HKU, AAAI, and our internal science team meetings, for helpful ideas. We thank

24 h Fitness for partnering with the Behavior Change for Good Initiative at the

University of Pennsylvania to make this research possible, as well as Proventix.

Portions of this paper were developed from the thesis of A.B.

Author affiliations:

Department of Marketing, HEC Paris, Jouy-en-Josas 78350, France;

Department of Marketing, The University of Chicago Booth School of Business, Chicago,

IL 60637;

Operations, Information and Decisions Department, The Wharton School of

the University of Pennsylvania, Philadelphia, PA 19104;

Division of Humanities and

Social Sciences, California Institute of Technology, Pasadena, CA 91125;

Department of

Psychology, University of Pennsylvania, Philadelphia, PA 19104; and

Computational and

Neural Systems, California Institute of Technology, Pasadena, CA 91125

S. Fleetwood, A definition of habit for socio-economics.

Rev. Soc. Econ.

, 131–165 (2021).

W. Wood, D. Neal, The habitual consumer.

J. Consumer Psychol.

, 579–592 (2009).

D. T. Neal, W. Wood, M. Wu, D. Kurlander, The pull of the past: When do habits persist despite

conflict with motives?

Person. Soc. Psychol. Bull.

, 1428–1437 (2011).

B. Verplanken, “Introduction” in

The Psychology of Habit

, B Verplanken, Ed. (Springer, 2018),

pp. 1–10.

P. Lally, C. H. van Jaarsveld, H. W. Potts, J. Wardle, How are habits formed: Modelling habit

formation in the real world.

Euro. J. Soc. Psychol.

, 998–1009 (2010).

N. Kaushal, R. Rhodes, Exercise habit formation in new gym members: A longitudinal study.

Behav. Med.

, 652–663 (2015).

M. Fournier

et al

., Effects of circadian cortisol on the development of a health habit.

Health Psychol.

, 1059–1064 (2017).

B. Gardner, P. Lally, “Modelling habit formation and its determinants” in

The Psychology of Habit

B. Verplanken, Ed. (Springer, 2018), pp. 207–229.

K. Volpp, G. Loewenstein, What is a habit? Diverse mechanisms that can produce sustained

behavior change.

Organ. Behav. Human Decision Process.

161

, 36–38 (2020).

10. J. A. Ouellette, W. Wood, Habit and intention in everyday life: The multiple processes by which past

behavior predicts future behavior.

Psychol. Bull.

124

, 54–74 (1998).

11. A. Mazar, W. Wood, Illusory feelings, elusive habits: People overlook habits in explanations of

behavior.

Psychol. Sci.

, 563–578 (2022).

12. K. L. Milkman

et al

., Megastudies improve the impact of applied behavioural science.

Nature

600

478–483 (2021).

13. J. D. Lee, D. L. Sun, Y. Sun, J. Taylor, Exact post-selection inference, with application to the lasso.

Ann. Stat.

, 907–927 (2016).

14. P. Lally, B. Gardner, Promoting habit formation [Suppl 1].

Health Psychol. Rev.

, S137–S158

(2013).

15. H. Dai, K. Milkman, J. Riis, The fresh start effect: Temporal landmarks motivate aspirational

behavior.

Manag. Sci.

, 2563–2582 (2014).

16. S. DeWit

et al

., Shifting the balance between goals and habits: Five failures in experimental habit

induction.

J. Exp. Psychol.

147

, 1043–1065 (2018).

17. E. R. Pool

et al

., Determining the effects of training duration on the behavioral expression of

habitual control in humans: a multi-laboratory investigation.

Learn. Mem.

, 16–28 (2022).

18. H. Dai, K. L. Milkman, D. A. Hofmann, B. R. Staats, The impact of time at work and time off from

work on rule compliance: The case of hand hygiene in health care.

J. Appl. Psychol.

100

, 846–862

(2015).

19. B. R. Staats, H. Dai, D. Hofmann, K. L. Milkman, Motivating process compliance through individual

electronic monitoring: An empirical examination of hand hygiene in healthcare.

Manag. Sci.

1563–1585 (2017).

20. G. S. Becker, K. M. Murphy, A theory of rational addiction.

J. Polit. Econ.

, 675–700

(1988).

21. J. P. Dubé, G. J. Hitsch, P. E. Rossi, State dependence and alternative explanations for consumer

inertia.

RAND J. Econ.

, 417–445 (2010).

22. B. Balleine, A. Dezfouli, Hierarchical action control: Adaptive collaboration between actions and

habits.

Front. Psychol.

, 2735 (2019).

23. C. D. Adams, A. Dickinson, Instrumental responding following reinforcer devaluation.

Q. J. Exp.

Psychol.

33B

, 109–121 (1981).

24. E. Tricomi, B. W. Balleine, J. P. O’Doherty, A specific role for posterior dorsolateral striatum in

human habit learning.

Euro. J. Neurosci.

, 2225–2232 (2009).

25. C. J. Bryan, E. Tipton, D. S. Yeager, Behavioural science is unlikely to change the world without a

heterogeneity revolution.

Nat. Hum. Behav.

, 980–989 (2021).

26. H. Ho, Machine learning and habit formation. Open Science Framework. Deposited 26 February

2023.

PNAS

2023 Vol. 120 No. 17 e2216115120

https://doi.org/10.1073/pnas.2216115120

7 of 7

Downloaded from https://www.pnas.org by CALIFORNIA INST OF TECHNOLOGY CALTECH LIBRARY SERVICES on June 5, 2023 from IP address 131.215.249.250.