of 58
Supplementary Materials for:
Context-sensitivity of behavior in field data: Using machine
learning to study habit formation in natural settings
Anastasia Buyalskaya
Hung Ho
Xiaomin Li
Katherine Milkman
Angela Duckworth
Colin Camerer
March 21, 2023
Table of Contents
1 Literature Review
3
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.2 Psychology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.3 Computational Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Economics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Political Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Dataset Descriptions
17
2.1 Hand Washing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Gym Attendance Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Description of Context Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1
Hand washing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2
Gym attendance data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Analysis Details
19
3.1 Individual LASSO Regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Model Selection Challenges in LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 AUC vs Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Speed of Habit Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1
4 Field Tests of Insensitivity to Reward Devaluation
32
4.1 Within-subject Field Tests of Insensitivity to Reward Devaluation Pre- and Post-habit . . . . . . . . . 33
4.2 Between-person Predictability Reactions Toward Incentivised Intervention . . . . . . . . . . . . . . . . 38
4.3 Sensitivity Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 Additional Analyses: Demographic Predictors of AUC
45
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Variable List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Regression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6 Human Subjects Protections
48
6.1 Gym Attendance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2 Hand Washing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7 Review of Habit Formation Studies
49
7.1 Summary of Previous Habit Formation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2
1 Literature Review
Since habit naturally crosses disciplinary boundaries, the most promising understanding of it is likely to come from
integrating evidence and methods across disciplines (1) (pg. 42). That is our approach. The purpose of the following
section is to highlight key papers from the major disciplines we take evidence and methods from. Specifically, this
section summarizes how habit is studied in psychology, computational neuroscience, economics, and political science.
We first present a summary table comparing how these literatures have addressed the different hallmarks of
habitual behavior in Table S1. We mark in bold what we view as being the “best practices” of measurement. Some
of the attributes - such as time to habit formation - don’t have an ideal best practice yet. We follow the table with
more detailed reviews of each field’s major contributions.
3
1.1 Overview
Table
S
1:
General practices and me
asures
of ha
bit features across different social and n
atural
sciences
.
Bold typeface indicates
our subjective valuation of
bes
t practices
Psychology
Computational
Neuroscience
Ec
onomics,
Political Science
This
Pape
r
Me
thods
L
ab
&
field
experiments,
field data
L
ab
experiments
E
conometrics,
field e
xp
erimen
ts
M
achine
learning, field
dat
a
Data ty
pes
Self
-
report surveys
(SRHI)
, behavior
Behavior
,
neural
activ
ity
(small
samples)
,
lesions
Behavior
,
often
large sa
mples
B
ehavior
,
larg
e samples
and time span
(no
attrition)
L
ength
of
habit
formation
T
hought
to be
arou
nd 2
months
1
,
e
stimated from
increase in self
-
report
Simple motor
habits, trained
from <1 hour to
several hour
s
Field
exper
i
ments
often
assume
1
month
2
Estimated from
increased pred
ic
tability
automaticity
m
easured
by
self
-
rep
ort
3
and newer
im
plicit measures
Resp
on
se times,
changes in brain
activity
4
T
ypically
not
infer
r
able
from
choices
a
l
o
ne
No data
R
eward
devaluation
in
sensitivity
(RDI)
M
easured
in
“extinct
ion
” test
5
after
devaluation
by
feeding
to satiety
etc.
Optogenetic
s
modulate
d
habit.
6
Only weak
evidence
for
sho
rt
-
run (1 hr)
human tasks
7
.
Evidence
of rat
-
human
ho
mology
8
N
ot
tested
No evidence of
RDI
tested with
hypothesiz
ed
ecologica
lly
-
r
elevant
reward changes.
StepUp
gym
att
endance after
reward increase
is
negativel
y correlated
with
pre
dic
tability
C
ontext
sensitivity
“habit disc
ontinuity
hypothesis
(H
DH)
9
,
habits
more easily
changed when
context c
hange
s
N
ot
studied to
our kn
owledge
Behavior
re
duced by
context
change
(HDH)
Large set of context
variables permits
estimation of
indiv
idual
specific
context
-
sensitivity
I
ndividual
differences
Most studies
aggregate across
individuals
10
L
earn
ing
parameters can
be different
Most studies
aggregate across
individuals
AUC measures
difference
s
in
predictability
across
people.
Can compare
context variables across
people
4
Footnotes for Table S1:
1
This estimate comes from (2), who compiled panel observational data using self-report questionnaires and fit
individual-level habituation models. They estimated that it took anywhere from a few weeks to over half a year to
reach the 95% asymptote of behavior.
2
One of the first papers in economics to empirically test habit formation used one month (3). While they do not
explicitly justify the choice of one month, they do state “Our results indicate that it may be possible to encourage
the formation of good habits by offering monetary compensation for a sufficient number of occurrences, as doing so
appears to move some people past the “threshold” needed to engage in an activity.” Some later studies continued
to use one month (4). Acknowledging that this might be too short of an intervention interval for habits to form.
(5) wrote “An alternative explanation is that some subjects would have experienced an increase in postintervention
attendance if the intervention period had been longer”.
3
BFCS (“Behavior Frequency x Context Stability”) measure (6) is a self-report index covarying past behavior
frequency with measured contextual variables. It is designed to test behaviors which are repeated frequently in
familiar contexts are more likely to become habitual. The most commonly used self-report measure is the SRHI
(“Self-Report Habit Index”). A subscale has been designed specifically to capture automaticity and is called SRBAI
(“Self-Report Behavioral Automaticity Index”), including questions like “I do this behavior without thinking.”
4
Classic work by (7), recording from rodent brains during maze-running, found concomitant changes in response
times, increases in performance, and reductions in neural signals of imminent reward. We think of these as hallmarks
of automaticity but there may be other markers in humans (e.g. degraded memory for past activity, or misattributions
of cause from habit to internal states see (8)).
5
The classic extinction test paradigm was developed by (9). In an important field experiment with humans akin
to the extinction test, (10), found that individuals continue to eat stale popcorn beyond satiation if they are in a
specific context which cues the habitual behavior (specifically, watching a film or eating with their dominant hand).
6
See (11).
7
See (12).
8
The only paper to find evidence of reward devaluation insensitivity with humans in fMRI (13) has not been
directly replicated. The same feeding-to-satiety paradigm and analogous versions with token-money reward devalua-
tion have not reliably shown devaluation after short-run training (14). Establishing robust devaluation insensitivity
for humans is an active area of research (15).
9
The HDH is the hypothesis that behavioral interventions aimed at changing habits often work best after a
lifestyle change (16). It seems to be an open question whether this is due to changes in prices and opportunities,
or to subtler forms of context-sensitivity in which missing context cues do not trigger associations (or cravings, for
drugs of abuse).
10
As noted in (17), “Theory has also been inadequately tested at the individual level. Most (80 studies; 98%
of all the study sample) studies have exclusively modelled between-person variation in habit, based on aggregates of
5
individuals’ habit scores. Yet, habitual action is inherently idiosyncratic, based on personally acquired behavioural
responses to personally meaningful cues. Within-person effects cannot be reliably interpreted from aggregations of
processes that differ between people.”
1.2 Psychology
Psychologists define habit as a behavior which is prompted automatically by contextual cues as a result of learned
context-action associations (18). This definition combines two key attributes of habitual behavior which guide a lot
of the psychology research: automaticity, and predictable context-sensitivity.
Some habit researchers make further distinctions about what should be considered a habit. For example, (17)
argues that the initiation and performance of a behavior are distinct. He classifies behavior into one of three types:
habitually initiated but consciously performed (his example: riding a bike to work every morning), consciously initi-
ated but habitually performed (his example: exercising at the gym), or habitually initiated and habitually performed
(eating a snack in the afternoon). This is a sensible distinction but without measures of automaticity we cannot
apply it to our data. It is simply a reminder that repeated behaviors that we call habits need not be unconscious or
automatic.
Context-Sensitivity
The focus on context-sensitivity came from evidence that habits arise when context-stable behavioral repetition
creates a “transfer” from (internal) goals to (external) associations with environmental cues (19). In the language
of animal learning and instrumental conditioning, an S-R-O relation in which an association is developed between a
stimulus (S), the response (R) it elicits, and a reward outcome (O), becomes habitual as an S-R relation.
Habits are not “innate” to the behavioral repertoire in the way reflexes are (e.g. one is not born with, but
must develop, the habit of tooth-brushing, - unlike the reflex of being startled by something unexpected, which is
present in newborns at birth). Instead, most habits begin as goal-directed behaviors. Eating solid food using a
fork, for example, begins as a very deliberate goal-directed behavior in small children (one which requires a lot of
motor and cognitive control in the beginning). It may take months or even years for eating to become an automatic
motor sequences with little need for cognitive control, such that a habit can form. In adults, who have cheaper
cognitive control and can eat “mindlessly,” the behavior of eating is ripe for developing associations with context and
reward independent of nutritional goals. Specifically, the “trigger” to eat is often transferred to context elements of
the environment which reliably co-occur with the behavior. For example, people who snack frequently in a stable
context are no longer driven by an internal motivation to eat, but rather by an environmental cue (20).
The range of possible context cues is usually idiosyncratic, because they are likely to vary by the type of behavior
and by individual. The context cues most often studied in psychology tend to be physical time, space or social cues
which are easily measurable (such as the location in which a behavior occurs, the day of the week, the time of day,
whether other people were present when the behavior was executed, etc.) (21). However, one can imagine less easily
6
measurable contextual cues, such as a specific mood, sensory input or a memory, as triggering a habit. These may
be harder to measure objectively, for example relying on individuals’ recollection of a memory or ability to verbally
describe a feeling, but are still important. In clinical studies for example, stress and visual cues that induce craving
states are often measured given their importance for behavior (22–24). Psychology and applied psychology (e.g.,
health behavior research) are disciplines that are the most focused on, and seek to measure, context-sensitivity.
Automaticity
The other attribute that psychologists seek to measure to determine whether a behavior is truly a habit is
automaticity (25, 26). A behavior is considered automatic is if it “brought to mind by cognitive processes largely
outside of conscious awareness” (21) (pg. 14).
An early start on this definition came from (27), who presented four criteria of automatic behavior. The first is
awareness of the cognitive process which gives rise to the behavior. The second is intentionality – or control – over
the initiation of the cognitive process. The third is efficiency – automatic processing requires fewer mental resources.
And the fourth is control – the ability to stop or alter the cognitive process after it has begun. Even if there was an
easy way to measure all four of these, Bargh noted that not all of the criteria need to be met in order for a behavior
to be considered automatic. In fact, a behavior which only meets two or three may still be automatic, confusing
the definition even further still. More recent theoretical models of automaticity have maintained the view that it
is a multidimensional construct, continuing to emphasize the unintentionality, uncontrollability, and unconscious
execution of behavior (28).
Animal learning studies also illustrate how simple theories of automaticity and habit are often hard to evaluate
conclusively. (29) trained rats on a two-lever-press paradigm for 20 days or 60 days, then tested for automatic-
ity and sensitivity to reward devaluation. The more extensively trained rats performed the rewarding lever presses
more often and more quickly (by these measures, their behavior became more automatic). But both groups exhibited
similar insensitivity to reward devaluation and a difference in apparent goal-directed control of the two different levers.
Measurement
Next, we’ll examine the most common measures used in psychology to assess context-sensitivity and automaticity
of behavior. Classic research on habits (e.g. in animal learning studies) inferred habit from observed behaviour in
response to cues. However, psychological research with humans has typically used self-reported answers to questions
about a participant’s own behavior. In a meta-analysis looking at 136 empirical studies which applied ideas from
the habit literature to health behaviors over the years 1998-2013, (17) found that self-report scales are still the main
methods used to measure habitual behavior. Two scales dominate the literature.
The first scale – relied on by 88% of the studies in Gardner’s meta-analysis – is the SRHI, or “Self-Report Habit
Index” (30). Its popularity stems from the fact that the questionnaire is short (a 12-item scale), direct and has
become the standard in psychology. One of the questions asks the subject to rate their agreement with the following
statement on a Likert scale: “I do this behavior without thinking.” A subscale of SRHI was designed specifically to
7
capture automaticity and is called SRBAI (“Self-Report Behavioral Automaticity Index”).
Accurately self-reporting habits and automaticity relies on good memory meta-cognition (our thinking about
our thinking). This raises a fundamental question about limits to accuracy of self-reports about habit and their
automaticity. This question has been debated extensively in the applied psychology literature (31–34).
However, as (35) argue, the SRHI infers habit from reflections on
symptoms of
habitual responding– such as
proceeding without effort, conscious awareness, conscious intention, etc.– rather than assuming people have deeper
insight into habitual regulation. There could be other limitations of self-report, such as misconstrual of items. It
could also be that asking people to reflect on the characteristics of behavioral performance may lead them to report
behavioral frequency rather than habit symptoms. However, (36) found from think-aloud protocols that only 10%
of responses indicated problems.
Another popular measure – used by 12% of the studies in Gardner’s review – was Ouellette and Wood’s (1998)
BFCS (“Behavior Frequency x Context Stability”) measure. This is a self-report index co-varying past behavior
frequency with context stability. This measure is based on the assumptions outlined earlier that behaviors which are
repeated frequently in familiar contexts are more likely to become habitual (18). The questions aim to assess both
directly, phrased as “how often do you do this behavior?” and “when you do this behavior, how often is this cue
present?”
The BFCS self-report clearly depends on accurate memory recall and metacognition. It is conceivable that more
habitual activities are less well remembered. Consider the example of checking a mobile phone. Using a smartphone
app which calculated true frequency of phone use, (37) were able to track the phone behavior of 27 participants
over the course of 14 days. They found that there was no correlation between true phone-checking behavior and a
self-report measure called the Mobile Phone Problem Use Scale (“MPPUS”). The MPPUS is a 27-item questionnaire
that includes items such as “I can never spend enough time on my mobile phone”. This anecdote points to another
fault with self-report measures: they are inherently retrospective, relying heavily on hindsight. But memory degrades
quickly – with the details of a morning becoming foggy as one enters their afternoon – meaning the timescale at
which these questionnaires are administered is crucial.
1
A more systematic review comes from (38), who ran a meta-analysis of 47 studies to measure the link between
logged and self-reported digital media use. To evaluate the association between self-reported and logged media use,
66 effect sizes from 44 studies were considered (n=52,007) and correlations were calculated with robust variance
estimation (RVE). Their analysis concluded that self-reported media use has a positive but medium-magnitude rela-
tionship with logged (objective) measurements (r=0.38, 95% CI = 0.33 to 0.42,
p <
0.001). Furthermore, problematic
media use showed a slightly smaller association with usage logs (r=0.25, 95% CI = 0.20 to 0.29,
p <
0.001). These
studies, along with another relevant critique from (39), point out the challenges of self-reporting some aspects of
habits from memory.
Besides these two most common scales, two other measures in Gardner’s meta-analysis were used in just one study
1
A modern technique which the smartphone makes available is real-time experience sampling where people are prompted to discuss
situational cues and whether they are executing a habit.
8
each (1% of the sample each; proportions add to more than 100% because of rounding). The EHS (“Exercise Habit
Survey”), used in one study, is similar to BFCS. The other measure was an association test, designed to measure
cue-behavior associations underpinning habitual behaviors (an implicit association test).
So while psychologists have identified two important elements of habitual behavior - context-sensitivity and au-
tomaticity - there have been some concerns about how good their current measurement tools are as proxies for true
habitual behavior (1).
What behaviors can become habitual?
Psychologists study habits across a range of behavioral domains. Popular domains of study include activities
which are done frequently: eating, exercising, and hygiene behavior. However there is some debate around how
complex a behavior can be before it can no longer be considered a candidate for becoming habitual. This is in
part due to research which has demonstrated that simpler actions like drinking water tend to become habitual more
quickly than complex actions like exercise routines (2). The idea is also evident in animal learning, in which chained
motor sequences are slower to form habits (7).
Focusing on the two behaviors covered in this paper, hand-washing seems to be ripe for becoming habitual because
it involves a short motor sequence. (40) (pg. 248) suggest that hand-washing habits “minimize[e] cognitive resources
required for a given behavior to ensure that it can be performed with a maximum of patients and/or for when such
resources are especially needed”.
Whether exercise can become habitual is more debatable (41). Physical activity, particularly travelling to a gym
for exercise, is different from other familiar habitual behaviors. Two differences worth noting are that it is a multi-
step behavior, not a simple motor action, and that it takes a long time to perform. However, the type of exercise
which is done inside a gym is often a relatively straightforward motor action as well. Running on a treadmill, rowing,
lifting weights – while requiring “control” and “awareness” and hence not meeting the definition of automaticity – are
simple enough that many gym goers are able to multi-task while doing them – as is obvious by watching gym-goers
listening on their headphones, holding a conversation, reading or watching TV while they exercise. Secondly, the
other attribute of habitual behavior, context-sensitivity, is likely present for gym goers. Location, other people, time
of day, or biological states (for very regular exercisers) are likely candidates for cuing the decision to attend the gym.
Speed of Habit Formation
Behavior goes from being goal-directed to being habitual through frequent repetition in a context-stable state.
Many researchers have been interested in this habit formation process, and in particular, how long it takes for a
habit to form. However, answering this question using traditional tools from psychology is difficult because it requires
a significant amount of data collection (obtaining regular SRHI responses over many days, as an example). This
requires researcher time and persistent longitudinal engagement by subjects. Hence, only a handful of studies have
been done to answer this question (2, 42, 43).
A seminal study is (2). The researchers collected SRHI measures for 82 subjects daily over the course of 12 weeks
9
for an eating, drinking or physical activity behavior chosen by the subject. (2) then fit a curve to each individual’s
self-report scores through time in order to measure the time it took them to reach 95% of the asymptote (their
definition of when something became a habit). They were able to fit the model for 62 individuals and obtain a good
fit for 39 out of those 62, finding that “performing the behaviour more consistently was associated with better model
fit.” Their results showed that the median time to habit formation was 66 days, with a range of 18 to 254 days to
habit formation depending in part on the complexity of the behavior (e.g. the relatively simple act of drinking a
glass of water was quicker to form habits than a more complex physical activity).
Another study looked at the development of exercise habits by asking new gym members to complete surveys over
the course of 12 weeks (42). They found that exercising at least four times per week for 6 weeks was the minimum
requirement to establish an exercise habit, based on the time at which behavior appeared to reach an asymptote
(i.e. not change significantly after that time period). The most recent observational study focused on the effect of
circadian cortisol (modulated by time of day) on the development of a simple physical habit. (43) tracked 42 French
students for 90 days as they did a stretching exercise behavior. Some students were assigned to do it in the morning
(when cortisol levels are high) and some in the evening (when cortisol is low). The SRBAI was collected daily, and
the speed of habit formation process was then modelled using learning curves by fitting a four-parameter logistic
curve to SRBAI responses. The curve-fitting process was successful, converging for each participant (in contrast
to the power function following (2), which the researchers also tried, finding that only 48% had a moderate fit as
defined by
R
2
>
0.70). Their results showed that the morning group achieved automaticity at an earlier time point
(106 days) than the evening group (154 days), concluding that time of day influences the speed of habit formation.
Of these three quantitative studies, all showed that “habit typically develops asymptotically and idiosyncratically,
potentially differing in rate across people, cues and behaviors” (44) (pg. 220).
1.3 Computational Neuroscience
What does habitual behavior look like in brain activity? This has been the driving question for much research
in computational neuroscience. This research tends to focus on the neural basis of the two types of cognitive
processing mentioned in the last section: “goal-directed” behavior, a more deliberate cognitive functioning, and
habitual behavior. The existence of these respective decision making systems is now well-accepted and commonly
modeled theoretically as model-free (MF) and model-based (MB) decision-making (45–47). MF learning transitions
to habit learning with extensive experience.
When a new habit is being learned, inputs to the midbrain dopamine system drive dopaminergic neural activity
which encodes reward prediction errors (RPEs). These RPEs serve as learning signals. Learning an accurate
prediction of a stable reward results in smaller and smaller reward prediction errors over time. These signals are
thought to modulate synaptic plasticity in the striatum which in turn serves as the ”gate-keeper for tentative motor
plan representations” (48). The striatum can be further segmented into two distinct areas: the dorsolateral striatum
(DMS) and the dorsomedial striatum (DLS).
Instrumental behaviours which respond to reward values may start out as goal-directed actions largely controlled
10
by the associative striatum (DMS), which controls more goal-directed activity, when they are first being learned. But
under certain conditions and with enough repetition, these behaviors may become habitual and no longer contingent
on reward. Then cognitive control shifts to the sensorimotor striatum (DLS), which controls more stimulus-driven
behaviors (49, 50). Functional MRI studies which are used to localize brain activity during decision making have
confirmed that habitual processing tends to occur in the “sensorimotor loop,” which connects the basal ganglia with
the sensorimotor cortices and parts of the midbrain (13, 49). Brain scans have therefore been used to confirm that
the brain has two independent sources of action control which govern behavior, and to help determine whether a
behavior is habitual or goal-directed (12).
So what are the conditions necessary for a behavior to move from being goal-directed to being habitual? The
animal literature suggests that habituation requires a behavior to be repeated many times – a process known as
“overtraining” (13). Another variable which seems to contribute to habit formation is learning under stress. Lab
studies have found that inducing stress (in animals, including humans) leads to quicker formation and reliance on
habitual behavior (51). Finally, optogenetic studies using rodents found that the disruption of rodent infralimbic
cortex (a region in the medial prefrontal cortex which has been shown to be necessary for the expression of habits)
temporarily blocked habit responding (11).
One important test used to determine whether a behavior is habitual or not is a test of sensitivity to reward
devaluation. The procedure originated in animal learning studies, with (9), who studied how lever pressing in rats
could become habitual. When they analyzed habit, they described it as a behavior which becomes so automatic
that even devaluation of the reward value of an outcome will not have a large effect on the execution of the habitual
behavior. Specifically, they found that mildly poisoning a food pellet after a rodent has developed a highly-trained
habit of lever-pressing for the pellets did not deter the rodent from continuing to press the lever. This phenomenon
has been termed insensitivity to reward devaluation, and is a behavioral hallmark of habitual processing.
There is some evidence of insensitivity to reward devaluation in humans. (13) trained participants to learn
that responses to two different fractal images were associated with two different snack rewards. After overtraining
(choosing their preferred fractal many times in short succession), they were given one of the snacks to eat to satiety,
which presumably devalued it. Subjects who had food devalued this way continued to choose the fractal associated
with the devalued foods, indicating habit. This is evidence of human insensitivity to reward change similar to the
animal experiments.
However, other researchers have not been able to replicate these findings (14). This raises the question of
whether an experimental paradigm using rodents can be easily transferred to human behavior. Another concern
which has been raised about the reward devaluation paradigm is that it implies that behavior which is not goal-
directed is necessarily habitual. For example, the goal-independent behavior may not be context-sensitive (21) (pg.
23). However, there remains an interest in replicating this effect with humans with different paradigms and training
protocols.
One of the best studies showing insensitivity to reward devaluation in humans is a psychology study. While it does
not have neuroscience data, it is included here because it is a clear illustration of this reward devaluation test. (10)
11
found that people were more likely to overeat stale (”devalued”) popcorn in a context which cued habitual behavior
of eating popcorn (e.g. watching a movie in a cinema) but not when they were in an unfamiliar popcorn-eating
context (e.g. watching a movie in a meeting room, or eating the popcorn with their non-dominant hand) which did
not cue the habitual behavior. The effect captures a two-way interaction (cinema vs. meeting room or dominant
vs. non-dominant hand and whether the popcorn received was stale or fresh) and is evident only among individuals
classified as ”high habit” (vs. medium or low habit) per self-reports on a 7-point scale used to assess habit strength
for eating popcorn in movie theaters. The same study found that for low or medium habit individuals, or high habit
individuals in novel contexts, like eating popcorn in a meeting room, behavior remained sensitive to reward value
and decreased in frequency when the popcorn was stale (devalued).
There is some indirect evidence that the
reliability
of the reward history is associated with habit formation, as
measured by insensitivity to reward devaluation (which is a central concept in (52)’s theory of neural autopilot).
For example, (53) track the fraction of trials on which a model-free system or goal-directed system recommends the
optimal choice. These fractions are then compared by an “arbitrator” system which weights actions recommended by
the two systems according to their recommendation accuracy. Using fMRI they report evidence for neural circuitry
consistent with this arbitration computation.
There is substantial evidence that habit formation is stronger after training on a random interval schedule
compared to a random ratio schedule. (These “schedules” are terms from animal learning theory referring to how
often, and based on what behavior, reinforcing rewards are delivered.) In ratio schedules reward is based on the
animal’s behavior– e.g., each lever press has an independent 1/30 chance of being rewarded in a so-called RR30
schedule. In interval schedules, reward is based on passage of time– e.g. every 2 seconds, there is a 10% chance
of reward being delivered upon the next lever press, in a so-called RI20 schedule. In RI20, if the animal waits 20
seconds, then reward will be delivered with certainty upon the next lever press. In RR schedules there is no such
guarantee.
Reward reliability in (52) is defined by the absolute value of reward prediction errors. Consider the simple
case in which rewards, normalized to 1, are random at rate
p
. Then the expected reward reliability is
p
(1
p
)+
|
(0
p
)(1
p
)
|
= 2
p
(1
p
). This expression has a minimum at
p
=
.
5 and is increasing for lower and higher
reward rates. In most animal learning paradigms, the reward rate
p
is well below .5, so that reward reliability is
increasing in the reward rate
p
. In the experiments of (54), the reward rate is higher from the interval schedule than
in the ratio schedule training; and habit formation is stronger after interval schedule training. This is a small bit of
evidence consistent with a role for reward reliability. In addition, (55) reports slightly stronger habit formation when
the reward rate is higher (RR15 compared to RR30 in experiments 2 and 1), also consistent with a role for reward
reliability.
Habitual behavior that is automatic is accompanied by measurable psychological and biological features, including
faster response times, limited attention during choice (50) and degraded declarative memory (explaining the basis for
choice when asked, see (56)).
2
These attributes can be studied using a range of measurement tools, some of which
2
Studying two patients with large MTL lesions, (57) found neurotypical-level performance in an overtrained discrimination task with
12
are more portable outside of a laboratory setting, including eye-tracking methods to measure attention.
1.4 Economics
Economic theories and empirical tests have generally used the term “habit” in one way: To describe history-dependent
“adjacent complementarity” of goods or services
3
. The theories are motivated by strong evidence of empirical
correlation between past and current consumption. These models therefore specify consumption utility as a function
of actual immediate consumption
relative to a reference point
or ‘consumption habit’ (60–62).
This approach was never empirically microfounded in psychology or neuroscience but it is mentioned prominently
in the earliest studies creating a foundation for intertemporal choice. (63) wrote: “One cannot claim a high degree
of realism for [consumption insensitivity], because there is no clear reason why complementarity of goods could not
extend over more than one time period”.
In conventional microeconomic consumer theory, “complements” are pairs of goods X and Y which increase each
other’s marginal utilities when consumed together—that is, the marginal utility of X is greater if you have more Y.
Familiar examples of complements include hot dogs and hot dog buns, hammers and nails, and computer hardware
and software. Koopmans’s point is that complementarity could extend to the same good consumed in adjacent
periods (called “adjacement complementarity”). Rather than treating hot dogs and hot dog buns as complements,
yesterday’s hot dogs and today’s hot dog consumption are considered as possible complements.
In one macro-finance specification (64), the crucial variables are current consumption
C
t
and habit
X
t
. Util-
ity depends on past aggregate consumptions
C
t
1
,C
t
2
...
through another equation. In that specification
U
t
=
(
C
t
X
t
)
1
γ
1
1
γ
(and
X
t
is related to previous consumption levels in a complicated way).
Such preference assumptions were used in macroeconomics and finance to explain facts which are puzzling in
specifications in which utility depends only on consumption (65, 66). (64) motivate their specification with the
following hypothesis
4
: “repetition of a stimulus diminishes the perception of the stimulus and responses to it”
(pg. 208). This is indeed a property of sensory systems which are adaptive. However, these types of “repetition
suppression” are very short-run (e.g., seconds to minutes or days). Whether the same kind of history-dependent
adaptation works for, say, quarterly consumption by a household is an open question.
(68) derives a set of axioms relating the functional form of habitual history-sensitivity to underlying principles
that are mathematically equivalent. The functional representation of utility is:
no declarative memory or conscious awareness. Thus, lesion patients could perform the task automatically. However, performance was
completely degraded to random on a minor task variant. The two patients also learned the task about as quickly as four monkeys did.
3
Another form of habit is the idea that the discount factor depends on consumption (58). They appeal to an intuitive concept of “habits
of thrift” or luxurious spending hypothesized by (59) (pg. 337-338) (with no evidence) which link more income to less patience. This
concept is theoretically interesting but appears to be empirically counterfactual, as much evidence suggests higher income is associated
with more patience, rather than less patience.
4
The phenomenon they are describing is similar to reward prediction learning or, in perception, is called “repetition suppression” (67).
It would be useful to explore even a highly speculative link between these psychological foundations and the hypothesized micro-foundation
for macroeconomics further..
13
U
h
(
c
) =
t
=0
δ
t
u
(
c
t
k
=1
λ
k
h
(
t
)
k
)
where
h
k
is the habit consumption history k periods in the past and
λ
k
is a decay factor which weighs more
distant consumption history less.
A bolder extension of adjacent complementarity is called “rational addiction (RA)” (69). In this approach, current
utilities depend on consumption history, due to adjacent complementarity, much as in the (68) formalization. But it
is also coupled with self-awareness of the history-dependent structure and planning about the future. In this model,
“rationally addicted” people understand that if they consume more X today, they will value X tomorrow more highly.
The key prediction of the RA model is that current consumption will depend on current prices and will
also
depend on expected future prices. For example, once they hear that a large cigarette tax increase will take place
soon, rationally-addicted smokers might quit a habit abruptly -
before
the increase occurs. They’ll quit right away
because they prefer, today, to be an ex-smoker at time T when the tax goes up; otherwise, continuing to smoke at
T will be too expensive.
Both the macro-finance and RA specifications are natural in economics because the primitives in economic analyses
are stable preferences, Bayesian beliefs, and budget constraint. Habit can then enter into the theory in one of those
three ways. The default approach is to define habit as current preference depending on past consumption.
Conventional economic theory with these ingredients does not have learning, RPE, reward reliability in it. There
is also no implicit cost of mental effort. And there is no attempt to relate the history-dependent model to adaptive
functionality or to neural implementation.
Most economic empirical studies using the RA approach treat the fact that history-dependent consumption could
be present in a wide range of goods and activities as a provocative prediction. “People can be addicted not only
to harmful goods like cigarettes, alcohol, and illegal drugs, but also to activities that may seem to be physically
harmless, such as sports participation, shopping, listening to music, watching television, working, etc.” (70) The
RA approach does make the non-obvious prediction that current behavior depends on expectations of the future, in
sharp contrast to the neuroeconomic habit model which is not forward-looking.
There are many studies of RA. There are two limits in these previous empirics: (1) Most of the early empirical
evidence uses very coarse time scales (e.g., quarterly tax receipts to measure state-by-state cigarette consumption);
and (2) estimates of the expected future price component are not very good. Expected future prices are usually
proxied by past prices, and these proxies may not be independent of current consumption. Even very sophisticated
tests on coarse quarterly data have very limited power to test whether there is actually forward-planned RA.
(71) demonstrate the kinds of biases that can lead to results consistent with RA even when the basic data-
generating process has no actual adjacent complementarity mechanism. The central test of the forward-looking
property of RA is whether current consumption is increasing in (expected) future consumption. Simulations show
that when the consumption time series is highly auto correlated (as is typical), even if there is no history-sensitivity,
the RA prediction can spuriously appear to hold. However, other diagnostic features of these tests (such as inferred
discount factors reasonably close to 1) can also fail in both artificial and actual data sets.
14
An illustrative example of how history-sensitivity is used in empirical practice is (72). He derived a tractable
way to test whether optimal consumption with habit can be rationalized nonparametrically, in the sense that one
can find some set of inferred utilities, satisfying simple restrictions like GARP and extended to allow adjacent
complementarity, which fits a data set on consumption. The logic of this exercise is that if no set of inferred utilities
can “rationalize” the data, then the specification of stable utilities with adjacent complementarity is incorrect.
Crawford applied the method to data on quarterly smoking expenditures for 3,134 Spanish households. The
best-fitting habit lag is two quarters. Most households’ (91%) data can be rationalized using two lags (compared
to only 24% with one lag), but the power of the two-lag test is not very high (only 20% of random-generated data
would fail the test for optimization).
History-sensitivity is seen again and again in many types of data: It is established in internet use (73) and
employment (74). In marketing it is attributed to inertia or brand loyalty (75–77)
5
.
The boldest predictions of the RA theory seem to be just flat wrong. In theory, rational addicts should take
advantage of volume discounts on addictive goods, because they will optimally self-ration the goods over time.
There is no direct evidence of this pattern (e.g., alcoholics buying in bulk and self-rationing), although it could be
that rational addicts are liquidity-constrained. Instead, (79) found in lab and field data that “vice” goods, such as
cigarettes, are often purchased in smaller quantities, have higher quantity discounts, and have lower price elasticities
than similar virtue goods, regardless of liquidity-constraint. There is also substantial evidence that restricting hours
at which addictive goods are sold (typically alcohol) reduces consumption (80). This is inconsistent with rational
forward-looking optimization by addicts, who should plan their shopping around reduced hours.
For the purposes of this paper, we also note that the economic RA model does not connect with what is known
from psychology and cognitive neuroscience. The latter is loosely constrained by the philosophy that a good un-
derstanding of a behavior should have an explanation for adaptive functionality, algorithmic specificity, and neural
implementation.
(81) introduced an economic model of a specialized idea of context-sensitivity, from clinical psychology and
neuroscience, to explain cue-sensitivity of addiction. In the model, the presence of a state-dependent cue actually
changes utility. If a cue value is
x
i
, and consumption activity is
a
i
(=0 or 1), then the period-specific utility is
assumed to be
u
(
a
i
,x
i
) =
u
(
a
i
λx
i
) + (1
a
i
)
η
where (1
a
i
)
η
is the expected utility of the next-best activity if
the target activity is not done.
This is a simple economic translation of the evidence about biological addiction from opponent processes to
maintain homeostatis, but it is not a biologically plausible general model for everyday habits. An implication of the
Laibson specification is that mere presence of the cue creates negative utility (through unpleasant craving) if the
good isn’t consumed. In the PCS view, the presence of a cue is typically not pleasant or unpleasant; it just predicts
behavior through a neural autopilot mechanism driven by reward reliability, rather than via unpleasant craving which
addicts “self-medicate” to avoid.
5
There is some evidence of what (78) calls “situations” (the same as our cues or states) influencing choices but it has not been an
active area of research.
15
(82) create a more general model tailor-made to understand addictive habits. Preferences are influenced both by
a numerical state, which catalogs consumption history, how frequently states trigger an involuntary “hot” craving
state, and some other features. Their model is not as much a specific theory, as it is a modelling language to describe
different kinds of addiction patterns and invite empirical estimation.
The Laibson homeostatic cues model and Bernheim-Rangel M-states model are two examples of state-sensitivity
of preferences which go beyond the history-sensitivity in so much empirical work. In their models, the relevant
state, on which preferences depend, is a cue or history variable. The idea is that what people subjectively value
could depend on an environmental or contextual state (83). Nothing is new or surprising about that– umbrella
preference goes up when it’s raining. Historically, however, economists were reluctant to allow too broad a range of
state-sensitivity of preferences for fear—probably legitimately—that doing so would lead to an erosion of falsifiability.
Common examples in which state-sensitivity is central are examples like health, in which health quality (a physical
state) clearly influences subjective value of leisure or work.
1.5 Political Science
Political scientists have studied habit in the domain of voting. Voting is interesting for our purposes because it is very
infrequent— particularly compared to hand-washing or gym attendance, and to other activities studied in empirical
applied psychology. It is similarly far from the animal learning-based concept of motor habituation and insensitivity
to reward change from hundreds of rapid trials in short time spans, on the time scale of hours or days.
We do not know if the term “habit” should be associated with voting at all, because it seems so dissimilar in
many dimensions to animal learning, exercise or eating habits, and most others reviewed in this section. And it also
could be that what is called acquisition of a voting habit is better explained by a change in costs and benefits or
other causal explanations (84).
We include a discussion of these studies because at least one voting study (85) did link to habit scales. We also
think that interested readers who are unfamiliar with these studies can learn a little and judge for themselves whether
these political behaviors should be considered habits and how they can be better studies.
Most of the studies show that voting behavior does exhibit some context-sensitivity. Researchers have mostly
focused on how a disruption to total voting (“turnout”) in one election affects subsequent turnout. The disruptions
that are diagnostic are exogenous “natural experiments” which suggests possible causality, as if an experimental
treatment changed voting for some people but not similar others. If skipping voting one time breaks one’s “taste for
voting” – reducing the likelihood of voting in future elections – then voting is considered habitual, in the history-
dependent sense, in these articles. And as with many behaviors, past voting behavior predicts future behavior
(86).
These studies are of three types.
Observational studies seek to isolate the impact of an “as-if random” inducement to vote in one year, on voting
turnout in subsequent election years (87, 88).
16
Experimental studies apply a truly random assignment to inducement to vote and test whether it increases
future voting (89, 90).
“Quasi-experimental” causal identification studies use regression discontinuity designs which take advantage of
strict voting eligibility requirements – e.g. to test whether two similar people born days apart (91) vote more
in the future, if one got a lucky chance to vote before while the other person did not.
A challenge, as pointed out by (92), is that these designs often suffer from weak identification of short-run and
long-run effects. For example, if an inducement to focus on the treated election is focused on encouraging people to
“do their civic duty,” this effect of social pressure may endure into the next election, independent of habit formation.
Similarly, the early inducement may lead to increased interest in politics, which then causes the later turnout.
More recent work has acknowledged that behavior alone is not enough to label an action as habitual, citing the
psychology literature on automaticity and context-sensitivity as inspiration for creating a self-report voting habit
index akin to the SRHI. (85) developed a 7-item scale for “voter turnout habit” and validates it using UK and US
voting data. He argues that the “cost” of voting (93) will be lower when voting becomes habitual.
Other papers have looked at the consistency of environmental context voting behavior by looking at voting rates
following a change in home address or voting location address. This approach is a special case of our general focus
on PCS except for a narrow range of context variables and a long time between behaviors (and unfortunately, also
a change in cost).
For example, (94) found that the consolidation of voting precincts in Los Angeles country decreased overall turnout
substantially (which was partially, but not fully, offset with an increase in absentee votes). This change is consistent
with the hypothesis that removal of the environmental cue of the physical precinct deterred some individuals from
voting. (95) found that both self-reported previous voting and not moving (situational consistency) were associated
with voting. Research into other contextual cues, like time of day, which may be predictive of voting behavior has
been more limited (85).
2 Dataset Descriptions
The purpose of this section is to provide additional detail on the two main datasets used in this paper, along with a
full list of the context variables which were used to train the LASSO models.
2.1 Hand Washing Data
Hand-hygiene data came from Proventix, a company which uses RFID technology to monitor whether a healthcare
provider sanitized their hands during a hospital shift. The initial dataset tracks 5,246 hospital healthcare workers
across 30 different hospitals. The dataset spans about a year, with over 40 million data points, each corresponding
to whether an individual did or did not wash their hands. Each data point has a timestamp, room, and hospital
location.
17
We further infer several other attributes, such as time of day and individual-level variables such as whether the
healthcare worker complied (washed their hands) in this room previously. A full list of the variables that are used
follows in Section S2.3.1.
2.2 Gym Attendance Data
We obtain check-in data from a North American gym chain, containing information for 60,277 regular gym users
across 560 gyms. The data spans fourteen years, from 2006 to 2019. There were initially over 12 million data points,
each corresponding to one gym check-in. Each data point is accompanied by a timestamp, gym location, and other
information about the gym (such as the number of amenities and wi-fi availability, which we do not use in this
analysis).
We further infer several other attributes, such as the day of the week and individual-level variables such as the
time since gym membership creation. A full list of the variables that are used follows in Section S2.3.2.
2.3 Description of Context Variables
2.3.1 Hand washing data
Time at work:
minutes elapsed since the start of a person’s shift.
Rooms visited in shift:
number of rooms the caregiver had visited previously during the shift.
Compliance last opportunity:
an indicator variable of whether the caregiver washed her hands at the last
opportunity.
Time since last opportunity (mins):
minutes elapsed since the last opportunity.
Time since last compliance (mins):
minutes elapsed since the last compliance.
Frequency of patient encounter:
percentage of time in patient rooms as a fraction of time worked. At any
moment in the shift, this is defined as
cumulative time spent in patient room
cumulative time elapsed in shift
.
Entry indicator (0-1):
an indicator of whether the opportunity to wash is an entry (1) into a room (as
opposed to an exit (0) from a room).
Previous unit compliance:
average compliance (%) across previous shifts in the current hospital unit.
Unit frequency:
% of previous shifts in the current hospital unit.
Previous day-of-week compliance:
average compliance (%) across previous shifts in the current day of
week.
Day-of-week frequency:
% of previous shifts in current weekday (compared to other weekdays).
Previous room compliance:
average compliance (%) across previous shifts in the current room.
18
Room frequency:
% of time spent working in current room (compared to other rooms in the same hospital).
Room compliance of others:
average compliance rate (%) of other caregivers in the current room.
Compliance last shift:
compliance rate in the last shift before the current one.
Days since start:
number of days worked since the observed start date.
Time off:
hours elapsed between end of the last shift and the current shift.
Streak:
number of consecutive shifts with less than 36 hours apart.
Hour-slot fixed effects:
time of day is divided into four categories: 12am-6am, 6am-12pm, 12pm-6pm, and
6pm-12am.
Compliance within a room:
an indicator of whether the caregiver washed her hands in this room in the
current opportunity (e.g. if she washed upon entry, this variable value for the exit opportunity is equal to 1).
Month of the year.
2.3.2 Gym attendance data
Streak:
number of consecutive days with gym visits prior to the current day.
Day-of-week streak:
number of consecutive corresponding day-of-the-week gym visits prior to the current
day.
Time lag:
number of days since the last gym visit.
Attendance last 7 days:
number of gym visits during the last 7 days.
Month of the year.
Day of the week.
3 Analysis Details
The purpose of this section is to provide additional detail on our analysis methodology. Specifically, we provide
a formal description of the LASSO models and include a discussion of the model output (predictability) vs. a
traditional measure of habit (frequency). We then provide a formal description of the exponential model used to fit
the behavioral data to identify speed of habit formation, and discuss model.
19
3.1 Individual LASSO Regressions
We apply LASSO logistic regressions at the individual level. LASSO is an acronym for “Least Absolute Shrinkage
and Selection Operator”. LASSO is good for our purposes because it can improve out-of-sample predictive accuracy
by reducing variance without significantly increasing bias. It also useful for feature selection via shrinking many
insignificant variable coefficients towards 0. This property winnows down a large set of variables to smaller subsets.
For each individual, we select about 15% of their time series data as a holdout (“test”) set on which we will assess
the performance of the model. For the remaining (“training”) data, we train the model based on the following logit
specification:
P
(
Y
t
= 1) =
exp(
β
0
+
S
t
β
1
)
1 + exp(
β
0
+
S
t
β
1
)
,
where
t
indexes time,
Y
t
is the binary outcome variable indicating whether a habit was executed at time
t
, and
S
t
is
a vector of state variables. The list of state variables
S
t
used in each dataset can be found above in Section 2.3. The
LASSO includes a penalty term weighting the sum of absolute values of coefficients
β
1
1
by the tuning parameter
λ
. The coefficients
ˆ
β
1
are chosen to minimize the following loss function:
L
(
β
|
λ
) =
log
[
Y
t
=1
exp(
β
0
+
S
t
β
1
)
1 + exp(
β
0
+
S
t
β
1
)
Y
t
=0
1
1 + exp(
β
0
+
S
t
β
1
)
]
+
λ
β
1
1
.
As is standard with machine learning applications, we use stratified 5-fold cross validation to pick the optimal
λ
. In particular, the holdout set and the 5 folds used in cross-validation are selected such that the proportions of
observations with
Y
t
= 1 in each of them are the same.
To ensure reasonable performance of the LASSO model, we omit from the analytical sample people with too few
observations (fewer than 365 for the gym data and fewer than 1000 for the hand-hygiene data), and with unbalanced
habit execution rates (
<
5% or
>
90% for gym attendance, and
<
20% or
>
80% for hand-hygiene). Methods like
LASSO are known to not estimate and predict well with small samples or unbalanced samples of binary outcomes.
We report the summary statistics of the full sets of LASSO coefficients in Tables S2 and S3 below.
Table S2: Context Predictors of Gym Attendance
Summary statistics for the context cue variables for the individual LASSO models, sorted by variable importance
(sometimes called “feature importance” by machine learning scholars). Importance is measured by averaging the
absolute values of the standardized LASSO coefficients across individuals. The Q1, Median, and Q3 columns present
the first, second, and third quartile coefficient values for the sample. The columns % zero, % positive, and % negative
are the percentage of the individual LASSO models that had coefficients with zero, positive, and negative values,
respectively. For more detailed descriptions of the context predictors, see S2.3.2.
Variable
%
%
%
General
importance
Q1
Median
Q3
zero
positive
negative
predictive effect
Time lag
1
.
25
1.40
-0
.
34
-0
.
02
22
2
76
74
(Time lag)
2
0
.
92
0
.
00
0
.
00
0
.
86
57
39
3
36
Monday
0
.
36
0
.
00
0
.
11
0
.
50
32
57
11
46
20
Tuesday
0
.
35
0
.
00
0
.
10
0
.
49
33
56
11
45
Wednesday
0
.
34
0
.
00
0
.
06
0
.
46
35
54
12
42
Attendance last 7 days
0
.
34
0
.
09
0
.
29
0
.
47
9
82
8
74
Thursday
0
.
31
0
.
00
0
.
00
0
.
40
37
49
14
35
Friday
0
.
28
0
.
00
0
.
00
0
.
27
36
39
24
15
Day-of-week streak
0
.
23
0
.
00
0
.
11
0
.
30
25
69
7
62
Streak
0
.
22
0
.
00
0
.
00
0
.
14
36
40
24
16
Saturday
0
.
22
-0
.
04
0
.
00
0
.
15
35
36
29
7
(Streak)
2
0
.
15
-0
.
13
0
.
00
0
.
00
46
13
42
29
(Day-of-week streak)
2
0
.
13
-0
.
16
0
.
00
0
.
00
48
9
43
34
December
0
.
11
-0
.
05
0
.
00
0
.
00
47
16
38
22
January
0
.
10
0
.
00
0
.
00
0
.
05
45
39
16
23
July
0
.
09
0
.
00
0
.
00
0
.
01
48
27
26
1
August
0
.
09
0
.
00
0
.
00
0
.
00
48
27
25
2
September
0
.
09
-0
.
01
0
.
00
0
.
00
49
25
27
2
October
0
.
09
-0
.
01
0
.
00
0
.
00
49
22
29
7
November
0
.
09
-0
.
02
0
.
00
0
.
00
49
21
31
10
February
0
.
08
0
.
00
0
.
00
0
.
02
48
31
21
10
March
0
.
08
0
.
00
0
.
00
0
.
01
49
29
22
7
April
0
.
08
0
.
00
0
.
00
0
.
01
49
29
22
7
May
0
.
08
0
.
00
0
.
00
0
.
00
49
26
25
1
June
0
.
08
0
.
00
0
.
00
0
.
01
48
29
23
6
Table S3: Context Predictors of Hospital Hand Washing
Summary statistics for the context cue variables (including interactions) for the individual LASSO models, sorted by
variable importance. Importance is measured by averaging the absolute values of the standardized LASSO coefficients
across individuals. Q1, Median, and Q3 columns are the coefficients at the first (lowest), second, and third quartiles of
the sample. % zero, % positive, and % negative are the percentage of individual LASSO models which had coefficients
that had zero, positive, and negative values, respectively. For more detailed descriptions of the context predictors, see
Supplement Materials S2.3.
Variable
%
%
%
General
importance Q1 Median Q3 zero positive negative predictive effect
Compliance last shift
0
.
77
0
.
66
0
.
70
0
.
92
0
100
0
100
Entry indicator
0
.
35
-0
.
33 -0
.
28 -0
.
04 18
5
77
72
Compliance last opp.
×
Entry indicator
0
.
13
0
.
00
0
.
00
0
.
21 49
47
4
43
Compliance last opp.
×
Time since last opp.
0
.
12
0
.
00
0
.
00
0
.
00 54
1
45
44
Compliance within a room
0
.
12
0
.
00
0
.
01
0
.
14 33
51
16
35
21
Time since last opp.
0
.
09
0
.
00
0
.
00
0
.
00 61
24
15
9
(Time since last opp.)
2
0
.
08
0
.
00
0
.
00
0
.
00 74
7
18
11
Room compliance of others
0
.
08
0
.
04
0
.
05
0
.
12 32
66
2
64
Time at work
0
.
08
0
.
00
0
.
00
0
.
00 54
4
42
38
Compliance last opp.
×
(Time since last opp.)
2
0
.
07
0
.
00
0
.
00
0
.
00 74
20
5
15
Prev. room compliance
0
.
07
0
.
03
0
.
04
0
.
11 32
65
2
63
Compliance last opp.
0
.
05
0
.
00
0
.
00
0
.
07 47
45
7
38
Time at work
×
6am-12pm
0
.
05
0
.
00
0
.
00
0
.
00 78
10
12
2
Time since last compliance
0
.
05
0
.
00
0
.
00
0
.
00 64
9
27
18
Time at work
×
12pm-6pm
0
.
04
0
.
00
0
.
00
0
.
00 73
10
17
7
(Time since last compliance)
2
0
.
03
0
.
00
0
.
00
0
.
00 75
17
8
9
12am-6am
0
.
03
0
.
00
0
.
00
0
.
00 68
22
10
12
Frequency of patient encounter
0
.
03
0
.
00
0
.
00
0
.
01 58
31
12
19
Time at work
×
Patient encounter
0
.
03
0
.
00
0
.
00
0
.
00 64
8
28
20
Days since start
0
.
02
0
.
00
0
.
00
0
.
00 83
9
8
1
6am-12pm
0
.
02
0
.
00
0
.
00
0
.
00 80
7
13
6
12pm-6pm
0
.
02
0
.
00
0
.
00
0
.
00 77
12
11
1
Room frequency
0
.
02
0
.
00
0
.
00
0
.
00 63
19
19
0
Time at work
×
6pm-12am
0
.
02
0
.
00
0
.
00
0
.
00 82
10
7
3
(Time off)
2
0
.
01
0
.
00
0
.
00
0
.
00 84
8
8
0
October
0
.
01
0
.
00
0
.
00
0
.
00 81
10
9
1
November
0
.
01
0
.
00
0
.
00
0
.
00 82
10
8
2
December
0
.
01
0
.
00
0
.
00
0
.
00 81
10
9
1
March
0
.
01
0
.
00
0
.
00
0
.
00 82
9
10
1
April
0
.
01
0
.
00
0
.
00
0
.
00 80
10
11
1
May
0
.
01
0
.
00
0
.
00
0
.
00 80
9
10
1
June
0
.
01
0
.
00
0
.
00
0
.
00 80
10
10
0
July
0
.
01
0
.
00
0
.
00
0
.
00 79
11
10
1
August
0
.
01
0
.
00
0
.
00
0
.
00 78
11
11
0
September
0
.
01
0
.
00
0
.
00
0
.
00 82
9
9
0
Day-of-week frequency
0
.
01
0
.
00
0
.
00
0
.
00 77
13
10
3
Rooms visited in shift
0
.
01
0
.
00
0
.
00
0
.
00 83
8
9
1
6pm-12am
0
.
01
0
.
00
0
.
00
0
.
00 84
7
8
1
Prev. day-of-week compliance
0
.
01
0
.
00
0
.
00
0
.
00 79
8
13
5
Prev. unit compliance
0
.
01
0
.
00
0
.
00
0
.
00 78
10
13
3
Streak
0
.
01
0
.
00
0
.
00
0
.
00 78
8
15
7
Time off
0
.
01
0
.
00
0
.
00
0
.
00 85
8
7
1
Unit frequency
0
.
01
0
.
00
0
.
00
0
.
00 72
20
8
12
February
0
.
00
0
.
00
0
.
00
0
.
00 82
8
9
1
22
3.2 Model Selection Challenges in LASSO
LASSO is used to choose variables that are predictive (which we call “predictors”). It is well-known that the derived
coefficients of those predictors in LASSO will be different from the estimates of coefficients for the same variables
derived from OLS or logit regressions. This is because LASSO includes what is called a “regularization penalty”–
the sum of squared residuals is added to a weight times the absolute magnitude of the coefficients.
6
This difference between the OLS or logit estimate for a variable, and the LASSO coefficient for the same variable,
is an inevitable consequence of LASSO trying to balance the “bias-variance tradeoff”. That is, in LASSO the derived
predictor coefficients are biased away from their true values– they are typically “shrunk” toward zero to reduce the
regularization penalty for high values of
ˆ
β
. However, this bias is helpful for prediction because it reduces the variance
of the coefficient estimates, which helps to reduce in-sample overfitting. This property in turn guards against a big
drop from good in-sample fits to much poorer out-of-sample predictive fits.
The fact that LASSO is estimating variable coefficients with bias creates two challenges which are special to
LASSO and related regularization methods. These challenges are called “model selection consistency” and “stability”.
Model selection consistency is the answer to this question: If a variable W is truly part of the data-generating
process, will LASSO choose W as a nonzero variable? If the answer is yes, the model selection is consistent (i.e.,
similar across OLS or logit and LASSO). There are conditions under which LASSO is consistent in this sense Zhao
and Yu 96. However, these conditions are asymptotic so they do not tell us anything very informative about finite
samples. Furthermore, there is no procedure for estimating standard errors of LASSO coefficients without very
restrictive assumptions (although Bayesian LASSO can produce credible intervals, Park and Casella 97).
Without standard errors, it is difficult to know how closely the predictors that are selected by LASSO are to the
true coefficients that are generating behavior. Some true variables may be mistakenly regularized to zero by LASSO.
This consistency concern is why strive to be careful in the text by talking about “predictors” rather than “variables”.
“Stability” refers to a different consequence of high collinearity between variables that is special for LASSO
predictor estimation (98, 99) for discussion). High collinearity is always a problem but it has a special consequence
in LASSO. If two variables
W
and
Z
are correlated highly enough, LASSO will typically choose only one of the
W
or
Z
variables to have a nonzero value. This is because the squared residuals are not reduced much differently
by including both variables (because they are contributing shared variance in explaining the dependent variable),
but the regularization penalty increases if both variables are included because the penalty would be a multiple of
|
ˆ
β
W
|
+
|
ˆ
β
Z
|
rather than, for example, a penalty of
|
ˆ
β
W
|
if only W is included and
ˆ
β
Z
= 0 so that the
Z
LASSO
coefficient is zero. As a result, if both
W
and
Z
are included in the feature set of candidate variables, then Which
variable is chosen can vary arbitrarily, meaning that it changes depending on small changes in the sampled values of
6
It is crucial in doing this that the variables be standardized in some way, typically by subtracting the variable sample mean and
dividing by the variable sample standard deviation. Otherwise, variables which happen to be scaled to create large coefficient values are
unfairly penalized.
23
W
and
Z
. Keep in mind, however, that stability in this sense is similar to what happens in OLS: In OLS or logit
both
W
and
Z
variables will have estimated coefficients, but their standard errors are inflated and there can be other
iases in the correlated parameter estimates.)
For readers who are unfamiliar with LASSO, the most important part of this discussion, by far, is what you
have just read:
Because of model selection inconsistency, and stability, the set of nonzero LASSO variables and their
predictor coefficients is
not
guaranteed to overlap with the true coefficient values.
This difference is especially crucial,
as noted throughout the main text of our paper, when one is trying to make statistical judgments in comparing
coefficient values. For example, suppose one person in our gym attendance sample has a positive effect of a Monday
dummy variable on gym attendance, and another person has a negative effect on the same Monday dummy variable.
There is no statistical procedure to test whether those effects are significantly different between the two people,
because the LASSO coefficients do not have standard errors. Therefore, any statements about heterogeneity do not
have the usual backing of significance testing. It is still a true statement, and one that may be of some exploratory
or managerial value, that one predictor LASSO coefficient is positive and one is negative. But we cannot have any
statistical confidence that the coefficient signs are different, or any measure of how close the coefficients are compared
to their variances (standard errors).
Next we describe some procedures for trying to evaluate how badly the stability problem (arising from collinearity)
might be undermining our conclusions in our specific data. These procedures were created by us in response to reader
concerns, as an attempt to explore numerically whether consistency and stability are big or small problems. These
analytical procedures to evaluate impact of stability may be less useful, or even misleading, for readers who use
LASSO type methods, as we have, to learn about predictive coefficients. (Readers are therefore encouraged to learn
more and be cautious in using predictor coefficient estimates, or explore Bayesian credible intervals.)
Since high correlation of pairs of variables
W
and
Z
is the biggest threat to stability, we first computed a
two-variable correlation matrix for each person. Then we picked out pairs of variables which had the highest
correlations for a large number individuals. These are pairs of variables for which the median absolute value
of correlation across individuals is at least 0.4. By using the median, our method finds variable pairs that are
substantially correlated for at least half of the individuals. Moreover, we only look at pairs of variables where
there is at least one “important” variable (which is defined by a large percentage of non-zero coefficients and
an imbalance toward either negative or positive signs, as reported in Tables 2 & 4 in the main text).
For each pair of such correlated variables, we separate all individuals into two groups based on a median split
of the absolute value of correlation. The two groups are individuals with variable correlations above and below
the median.
We identified whether each of the two variables was included with a nonzero magnitude or was regularized to
zero. This procedure creates four possible selection outcomes of nonzero and zero magnitudes for each of the
two variables.
For each pair of variables, we looked at the percentage of each of the four possible selection outcomes between
24