of 40
nature human behaviour
https://doi.org/10.1038/s41562-024-02062-9
Article
Examining the replicability of online
experiments selected by a decision market
In the format provided by the
authors and unedited
Supplementary information
Supplementary
Notes
..................................................................................................................
1
Replication
Sample
Sizes
........................................................................................................
1
Conversion
of
Effect
Sizes
to
Cohen’s
d
.................................................................................
2
Preregistered
Hypothesis
Tests
and
Exploratory
Analyses
.....................................................
3
Protocol
Deviations
and
Further
Information
on
the
26
Replications
......................................
5
Supplementary
References
......................................................................................................
12
Supplementary
Figures
.............................................................................................................
15
Supplementary
Tables
..............................................................................................................
16
Supplementary Notes
Below
we
provide
further
details
on
the
replication
sample
sizes,
the
conversion
of
effect
sizes
to
Cohen’s
d
,
the
preregistered
hypothesis
tests
and
exploratory
analyses,
and
deviations
from
the 26 preregistered analysis plans for the 26 replications.
Replication Sample Sizes
The
replications
of
the
26
studies,
selected
from
the
sample
of
41
studies
1–41
published
in
the
Proceedings
of
the
National
Academy
of
Sciences
based
on
the
final
market
prices
in
the
decision
market,
were
carried
out
with
high
statistical
power.
Replication
sample
sizes
were
based
on
having
90%
power
to
detect
of
the
effect
size
reported
in
the
original
study
(with
the
effect
size
converted
to
Cohen’s
d
to
have
a
common
standardized
effect
size
measure
across
the
original
studies
and
the
replication
studies).
The
criteria
for
replication
were
an
effect
in
the
same
direction
as
the
original
study
and
p
< 0.05
(in
a
two-sided
test).
In
cases
where
this
power
estimation
led
to
a
sample
size
smaller
than
the
original
one,
we
used
the
same
sample
size
as
in
the
original
study.
On
average,
the
replication
sample
size
was
3.5
times
as
large
as
the
original
study
sample
size
(the
average
sample
size
in
the
26
original
studies
selected
for
publication was 292, and the average sample size in the replications was 1,019).
We
used
the
following
formula
to
estimate
the
“replication
sample
size
factor
(
f
),”
i.e.,
the
factor
by
which
the
original
sample
size
must
be
multiplied
to
have
90%
power
to
detect
of
the
original effect size in a two-sided test at the 5% significance level.
=
φ
1
(
0
.
975
)
1
(
0
.
90
)
2
3
·
(
)
2
=
3
.
2415
...
2
3
·
(
)
2
where
−1
denotes
the
inverse
cumulative
distribution
function
of
a
standard
normal
random
variable,
i.e.,
−1
(0.975)
and
−1
(0.90)
refers
to
the
critical
value
of
a
two-tailed
5%
threshold
and
a
one-tailed
10%
threshold,
respectively.
The
first
critical
value
is
the
critical
value
at
the
5%
significance
level;
the
second
critical
value
is
the
addition
needed
to
have
90%
power
(if
the
true
effect
size
divided
by
the
standard
error
is
3.2415,
there
is
a
10%
probability
that
the
observed
effect
size
will
yield
a
z
-value
below
1.96
and
a
90%
probability
that
it
will
yield
a
z
-value
above
1.96).
The
value
−1
(0.975) +
−1
(0.90) = 3.2415
equals
the
factor
one
would
multiply
the
standard
errors
with
to
get
the
minimum
effect
size
detectable
with
90%
power
at
the
5%
significance level.
In
the
above
formula,
z
is
the
z
-value
of
the
original
study
for
the
hypothesis
test
being
replicated.
This
formula
gives
the
factor
to
multiply
the
original
sample
size
to
get
the
replication
sample
size
(e.g.,
if
f
= 3
and
the
original
study
included
100
observations,
the
replication
sample
size
will
be
300).
For
a
t
-test,
we
replace
z
with
t
in
the
formula
above
(and
F(1,
df)
F
-tests
and
²(1)
tests
are
converted
to
t
and
z
values
as
for
the
conversion
to
Cohen’s
d
;
see
below).
The
replication
sample
formula
was
based
on
the
relationship
between
n
and
the
1
standard
error
(
se
)
in
a
standard
z
-test
and
t
-test
(and
this
relationship
is
the
same
in
an
independent-samples
test
and
a
paired
test).
When
the
”replication
sample
factor”
(
f
)
in
the
formula
above
was
below
1,
we
set
the
replication
sample
size
to
the
same
as
the
original
one
(as
of
the
original
effect
size
implies
a
smaller
replication
sample
size);
this
ensured
that
no
replication
study
had
a
smaller
sample
size
than
in
the
original
study.
In
these
cases,
the
power
to
detect
of
the
original
effect
size
in
the
replication
study
exceeded
90%,
which
happened
for
seven studies
5,11,17,28,30,32,40
.
Conversion of Effect Sizes to Cohen’s
d
We
converted
the
effect
sizes
of
all
the
original
studies
and
all
the
replication
studies
to
Cohen’s
d
to
have
a
standardized
effect
size
measure.
The
effect
size
estimates
for
the
original
studies
are
always
assigned
a
positive
sign;
the
effect
sizes
in
the
replication
studies
are
assigned
a
positive
sign
if
a
replication
effect
points
in
the
same
direction
as
in
the
original
study
and
a
negative
sign
if
the
effect
points
in
the
opposite
direction.
The
estimations
of
Cohen’s
d
were
based
on
the
formula
used
by
Szucs
&
Ioannidis
42
(Supplementary
Materials,
pp. 2–3)
to
convert
test
statistics
obtained
from
independent-samples
t
-tests
and
paired
t
-tests
to
Cohen’s
d
units. They used the following two formulas:
Unpaired
t
-test:
=
2
Paired
t
-test:
=
In
the
two
above
formulas,
n
is
the
sample
size
in
the
study
(the
number
of
individuals
for
studies
that
use
tests
based
on
the
individual
data
and
the
number
of
groups
for
studies
that
use
tests
based
on
d
ata
aggregated
on
the
group
level).
Note
that
the
paired
t
-test
formula
assumes
a
0.5
correlation
in
observations
within
pairs
(and
the
formula
for
the
independent-samples
t
-test assumes equal sample sizes in group 1 and group 2).
We
used
the
unpaired
t
-test
formula
for
studies
using
between-subjects
tests
and
the
paired
t
-test
formula
for
studies
using
within-subject
tests.
F(1,
df)
test
statistics
were
converted
to
t
-values
by
taking
the
square
root.
Studies
based
on
z
-test
statistics
were
converted
to
Cohen’s
d
using
the
same
formulas
as
above
but
replacing
t
with
z
.
²(1)
statistics
were
converted
to
z
-statistics by taking the square root.
For
interactions
between
two
between-subjects
factors,
the
above
unpaired
t
-test
formula
will
underestimate
effect
sizes.
Therefore,
for
those
studies,
we
used
the
following
formula
to
convert effect size estimates to Cohen’s
d
units:
Interactions of two between factors:
=
4
The
basis
of
the
above
formula
is
that
an
interaction
test
of
two
between-subjects
factors
will
increase
the
standard
error
by
a
factor
of
about
two
as
compared
to
an
estimate
of
the
main
2
effects.
The
following
five
studies
estimated
an
interaction
between
two
between-subjects
factors:
Clarkson
et
al.
10
,
Côté
et
al.
12
,
Handley
et
al.
18
,
Baldwin
&
Lammers
3
,
and
Hoffman
et
al.
19
.
There
were
also
two
studies
that
interacted
one
between-subjects
variable
with
one
within-subject
variable.
These
two
studies
were
converted
to
Cohen’s
d
using
the
unpaired
t
-test
formula
(and
we
coded
these
studies
as
”between
subjects
tests”
and
”interaction
tests”
in
the
study). These two studies are Bear et al.
4
and Morris et al.
32
.
We
also
have
one
study
that
interacted
two
within-subject
factors.
This
study
was
converted
to
Cohen’s
d
using the paired
t
-test formula. This study is Cooney et al.
11
.
Standard
errors
of
Cohen’s
d
and
confidence
intervals.
The
t
and
z
test
statistic
shows
the
ratio
between
the
effect
size
and
the
standard
error.
We
derived
the
standard
error
of
Cohen’s
d
by
preserving
this
ratio
(e.g.,
if
a
study
with
a
t
-value
of
2
was
converted
to
a
Cohen’s
d
of
1,
the
standard
error
of
Cohen’s
d
is
0.5).
The
95%
confidence
intervals
of
Cohen’s
d
for
studies
using
t
-tests
(or
F
-tests
converted
to
a
t
-test
statistic)
were
estimated
as
d
±
se
·
t
−1
(0.025);
where
t
−1
(0.025)
denotes
the
critical
value
of
the
inverse
t
-distribution
(for
the
df
of
the
t-
test)
at
2.5%,
i.e.,
the
5%
threshold
in
a
two-sided
test.
The
95%
confidence
intervals
of
Cohen’s
d
for
studies
using
z
-test
statistics
(or
²
tests
converted
to
a
z
-test
statistic)
were
estimated
as
d
±
se
·
−1
(0.025),
where
−1
(0.025)
denotes
the
critical
value
of
the
inverse
standard
normal
distribution at 2.5%, i.e., the 5% threshold in a two-sided test.
Preregistered Hypothesis Tests and Exploratory Analyses
The
hypothesis
tests
were
divided
into
primary
and
secondary
hypothesis
tests.
All
hypothesis
tests
were
based
on
two-tailed
p
-values.
We
interpret
a
p
-value
below
0.5%
as
“statistically
significant
evidence”
and
a
p
-value
below
5%
as
“suggestive
evidence”
following
the
recommendation of Benjamin et al.
43
.
Primary
hypothesis
1:
There
is
a
positive
correlation
between
the
decision
market
prices
and
the replication outcomes for the 26 replicated studies.
This
was
tested
using
a
point-biserial
correlation
between
the
final
decision
market
prices
and
the
replication
outcomes
based
on
the
statistical
significance
criterion.
We
think
of
this
as
a
test
of
“proof
of
concept”
of
decision
markets.
For
markets
to
be
used
as
a
tool
to
select
studies
to
be
replicated,
they
need
to
be
able
to
predict
replication
outcomes
to
some
extent,
implying
a
positive
correlation
between
market
prices
and
replication
outcomes.
Such
a
positive
correlation
has
been
found
in
previous
large-scale
replication
projects,
but
it
is
not
obvious
that
those
results carry over to decision markets. This test result is reported in the main text.
Primary
hypothesis
2:
The
standardized
effect
size
(measured
in
terms
of
Cohen’s
d
)
is
lower
in the 26 replication studies than in the 26 original studies.
3
This
was
tested
using
a
Wilcoxon
signed-ranks
test
of
the
replication
effect
sizes
versus
the
original
effect
sizes
for
the
26
replication
studies.
Previous
large-scale
replication
studies
have
found
that
the
replication
effect
sizes
are,
on
average,
about
50%
of
the
original
studies,
and
we
expected
to
observe
similar
replication
effect
sizes
in
this
study.
This
test
result
is
reported
in
the
main text.
Secondary
hypothesis
1:
The
replication
rate
is
lower
among
the
12
studies
with
the
lowest
decision market prices than for the 12 studies with the highest decision market prices.
This
was
tested
using
Fisher’s
exact
test
comparing
the
replication
rate
using
the
statistical
significance
criterion
between
the
12
studies
with
the
lowest
decision
market
prices
and
the
12
studies
with
the
highest
decision
market
prices.
This
test
is
related
to
primary
hypothesis
1
and
is
an
alternative
“proof
of
concept”
test
of
decision
markets,
but
as
it
has
somewhat
lower
power,
we
included
it
as
a
secondary
hypothesis
test.
This
test
result
is
reported
in
the
main
text.
Secondary
hypothesis
2:
There
is
a
positive
correlation
between
the
average
survey
belief
of
replication and the replication outcomes for the 26 replicated studies.
This
was
tested
using
a
point-biserial
correlation
between
the
average
survey
belief
about
replication
and
the
replication
outcomes
based
on
the
statistical
significance
criterion.
This
tests
if
the
survey
responses
can
predict
replications
and
corresponds
to
the
primary
hypothesis
1
test,
but
using
survey
data
instead
of
decision
market
prices.
The
average
survey-predicted
probability
of
replication
for
each
of
the
26
replication
studies
was
estimated
for
those
survey
respondents
who
did
at
least
one
trade
on
the
decision
market.
This
test
result
is
reported
in
the
main text.
Secondary
hypothesis
3:
There
is
a
positive
correlation
between
the
average
survey
belief
of
replication and the decision market prices for the 26 replicated studies.
This
was
tested
using
a
Pearson
correlation
between
the
average
survey
belief
about
replication
and
the
final
market
prices.
This
tests
if
the
predictions
of
the
decision
market
and
the
survey
are
correlated.
As
above,
the
average
survey-predicted
probability
of
replication
for
each
of
the
26
replication
studies
was
estimated
for
those
survey
respondents
who
made
at
least
one
trade
on the decision market. This test result is reported in the main text.
Secondary
hypothesis
4a:
The
average
absolute
prediction
error
is
lower
for
the
decision
market than for the survey for the 26 replicated studies.
This
was
tested
using
a
Wilcoxon
signed-ranks
test
defining
the
absolute
prediction
error
as
the
absolute
difference
between
the
prediction
and
the
replication
outcome
based
on
the
statistical
significance
criterion.
This
tests
if
the
decision
market
outperforms
the
survey
in
predicting
replication
outcomes.
As
above,
the
average
survey-predicted
probability
of
replication
for
each
of
the
26
replication
studies
was
estimated
for
those
survey
respondents
who
made
at
least
one
trade on the decision market. This test result is reported in the main text.
4
Secondary
hypothesis
4b:
The
average
squared
prediction
error
(Brier
score)
is
lower
for
the
decision market than for the survey for the 26 replicated studies.
This
was
tested
using
a
Wilcoxon
signed-ranks
test
of
the
squared
prediction
error
(the
Brier
score).
This
is
a
test
of
the
same
hypothesis
as
in
secondary
hypothesis
4a,
but
using
the
squared
prediction
error
(Brier
score)
instead
of
the
absolute
prediction
error
to
measure
prediction
performance.
As
above,
the
average
survey-predicted
probability
of
replication
for
each
of
the
26
replication
studies
was
estimated
for
those
survey
respondents
who
made
at
least one trade on the decision market. This test result is reported in the main text.
Preregistered
Exploratory
analyses:
In
the
exploratory
analyses,
we
tested
if
the
average
belief
(measured
on
a
scale
from
−3
to
3)
about
whether
the
pandemic
has
affected
the
probability
of
replication
differed
from
zero
in
a
one-sample
t
-test;
this
test
was
carried
out
separately
for
each
of
the
26
replication
studies.
We
also
tested
if
the
average
across
the
26
replication
studies
differed
from
zero
in
a
one-sample
t
-test
(i.e.,
we
first
constructed
the
average
answer
for
the
26
questions
for
each
survey
respondent
so
that
we
have
one
observation
per
respondent
and
then
tested
if
the
average
of
this
variable
differed
from
zero).
These test results are reported in the main text and Supplementary Table 5.
We
additionally
tested
if
the
average
belief
about
whether
the
pandemic
has
affected
the
probability
of
replication
was
significantly
correlated
with
the
replication
outcomes
based
on
the
statistical
significance
criterion
using
a
point-biserial
correlation;
and
if
it
was
significantly
correlated
with
the
final
decision
market
prices
and
the
average
survey
belief
of
replication
based
on
a
Pearson
correlation.
The
number
of
observations
for
estimating
these
correlations
was
26.
In
all
these
exploratory
analyses,
only
data
from
those
survey
respondents
who
did
at
least
one
trade
on
the
decision
market
were
included.
These
test
results
are
reported
in
the
main text.
Protocol Deviations and Further Information on the 26 Replications
Prior
to
starting
the
survey
data
collection
(that
preceded
the
decision
market
and
replications),
we
preregistered
an
analysis
plan
(a
replication
report)
for
each
of
the
41
potential
replications
at
OSF
after
obtaining
feedback
from
the
original
authors.
After
the
replications,
the
26
replication
reports
of
the
implemented
replications
were
updated
with
the
results
of
the
replications
and
also
posted
at
OSF.
Furthermore,
we
provided
all
original
authors
the
opportunity
to
comment
on
the
replications
(without
a
particular
due
date)
and
make
the
comments
available
as
we
receive
them
alongside
the
replication
reports.
The
preregistered
replication
reports,
the
post-replication
reports,
and
the
original
authors
commentaries
(if
available) are available at
https://osf.io/sejyp
.
Below
we
mention
any
deviations
from
the
preregistered
designs
and
analyses
and
further
information
on
the
implementation
of
the
replication
experiments
for
the
26
individual
5
replications
(in
case
there
were
any
issues
with
the
implementation).
Deviations
from
the
protocol are also detailed in the individual replication reports for each replication posted at OSF:
Atir
and
Ferguson
2
:
We
preregistered
to
perform
the
same
analysis
as
in
the
original
article,
i.e.,
a
test
of
fixed
effects
in
a
mixed-effects
model.
While
we
followed
the
pre-registered
analysis
exactly,
the
replication
result
is
based
on
a
z
-test
(rather
than
an
F
-test
as
reported
in
the
original
article)
due
to
differences
in
the
optimization
routines
implemented
in
different
software
applications
(the
replication
result
has
been
estimated
in Stata, the original result in SPSS).
Cheon
and
Hong
9
:
The
original
authors
discovered
an
error
in
the
reference
to
the
ANOVA
in
the
“Hypothesis
to
replicate
and
bet
on”
section
of
the
replication
report
when
giving
feedback
on
the
replication
results.
The
pre-replication
version
of
the
report
erroneously
reported
the
ANOVA
result
from
Study
3
(instead
of
Study
2)
of
the
original
article.
But
the
focal
test
for
the
replication
(a
t
-test),
provided
by
the
authors
since
it
is
not
reported
in
their
article,
was
correctly
reported
in
the
“Hypothesis
to
replicate
and
bet
on”
section.
Thus,
the
reporting
error
does
neither
affect
the
replication
design
nor
the
data analysis.
Côté
et
al.
12
:
We
erroneously
preregistered
that
income
would
enter
the
regression
analysis
in
terms
of
a
dichotomized
variable.
For
the
analysis
of
the
replication
data,
we
follow
the
original
article
and
use
participants’
continuous
income
reports
(mean-centered) instead.
Gheorghiu
et
al.
15
:
Fifteen
participants
were
excluded
from
the
analysis
due
to
technical
issues
(e.g.,
problems
with
loading
pages).
We
preregistered
that
the
focal
test
is
based
on
the
t
-test
of
the
coefficient
of
interest
in
a
mixed-effects
regression.
While
we
followed
the
pre-registered
analysis
exactly,
the
replication
result
is
based
on
a
z
-test
due
to
differences
in
the
optimization
routines
implemented
in
different
software
applications
(the replication result has been estimated in Stata, the original result in R).
Guilbeault
et
al.
16
:
The
replication
test
is
carried
out
on
the
network
level,
where
each
network
results
in
one
observation.
We
planned
to
collect
56
networks
(28
network
observations
per
treatment).
Furthermore,
we
planned
to
include
40
individual
participants
per
network
so
that
the
total
number
of
participants
would
be
56 x 40 = 2,240.
However,
we
did
not
manage
to
have
40
participants
in
all
networks;
eventually,
we
only
had
2,001
participants
across
56
networks
(i.e.,
we
reached
the
planned
sample
size
in
terms
of
the
number
of
networks
that
is
the
unit
of
observation
in
the
analysis,
but
the
number
of
participants
per
network
was
lower
than
planned).
We
conducted
28
sessions
with
two
networks
each,
but
it
was
challenging
to
recruit
and
perpetuate
exactly
40
participants
per
network,
so
we
decided
to
start
the
experiment
whenever
we
had
what
we
deemed
was
a
sufficient
number
of
participants.
The
median
network
size
was
37,
with
the
smallest
network
having
30
participants.
Not
all
6