Nature Human Behaviour
nature human behaviour
https://doi.org/10.1038/s41562-024-02062-9
Article
Examining the replicability of online
experiments selected by a decision market
Felix Holzmeister
1
, Magnus Johannesson
2
, Colin F. Camerer
3
, Yiling Chen
4
,
Teck-Hua Ho
5
, Suzanne Hoogeveen
6
, Juergen Huber
7
, Noriko Imai
8
,
Taisuke Imai
8
, Lawrence Jin
9
, Michael Kirchler
7
, Alexander Ly
1 0,1 1
,
Benjamin Mandl
12
, Dylan Manfredi
13
, Gideon Nave
13
, Brian A. Nosek
14,15
,
Thomas Pfeiffer
16
, Alexandra Sarafoglou
10
, Rene Schwaiger
7
,
Eric-Jan Wagenmakers
10
, Viking Waldén
17
& Anna Dreber
1,2
Here we test the feasibility of using decision markets to select studies
for replication and provide evidence about the replicability of online
experiments. Social scientists (
n
= 162) traded on the outcome of close
replications of 41 systematically selected MTurk social science experiments
published in PNAS 2015–2018, knowing that the 12 studies with the lowest
and the 12 with the highest final market prices would be selected for
replication, along with 2 randomly selected studies. The replication rate,
based on the statistical significance indicator, was 83% for the top-12 and
33% for the bottom-12 group. Overall, 54% of the studies were successfully
replicated, with replication effect size estimates averaging 45% of the
original effect size estimates. The replication rate varied between 54% and
62% for alternative replication indicators. The observed replicability of
MTurk experiments is comparable to that of previous systematic replication
projects involving laboratory experiments.
Can published research findings be trusted? Unfortunately, the answer
to this question is not straightforward, and the credibility of scientific
findings and methods has been questioned repeatedly
1
–
9
. A vital tool
for evaluating and enhancing the reliability of published findings is to
carry out replications, which can be used to sort out likely true positive
findings from likely false positives. A replication essentially updates the
probability of the hypothesis being true after observing the replication
outcome. A successful replication will move this probability towards
100%, while a failed replication will move it towards 0% (refs.
10
,
11
).
In recent years, several systematic large-scale replication projects in
the social sciences have been published
12
–
17
, reporting replication rates
of around 50% in terms of both the fraction of statistically significant
replications and the relative effect sizes of replications. Potential fac-
tors to explain these replication rates may be low statistical power
1
,
18
,
19
in the original studies, testing original hypotheses with low priors
1
,
10
,
20
and questionable research practices
1
,
21
,
22
. Systematic replication stud
-
ies led to discussions about improving research practices
23
,
24
and have
substantially increased the interest in independent replications
25
.
Received: 29 November 2023
Accepted: 11 October 2024
Published online: xx xx xxxx
Check for updates
1
Department of Economics, University of Innsbruck, Innsbruck, Austria.
2
Department of Economics, Stockholm School of Economics, Stockholm,
Sweden.
3
Division of the Humanities and Social Sciences, California Institute of Technology, Pasadena, CA, USA.
4
John A. Paulson School of Engineering
and Applied Sciences, Harvard University, Boston, MA, USA.
5
Nanyang Technological University, Singapore, Singapore.
6
Faculty of Social and Behavioural
Sciences, Utrecht University, Utrecht, The Netherlands.
7
Department of Banking and Finance, University of Innsbruck, Innsbruck, Austria.
8
Institute of
Social and Economic Research, Osaka University, Osaka, Japan.
9
Lee Kuan Yew School of Public Policy, National University of Singapore, Singapore,
Singapore.
10
Faculty of Social and Behavioural Sciences, University of Amsterdam, Amsterdam, The Netherlands.
11
Machine Learning, Centrum Wiskunde
and Informatica, Amsterdam, The Netherlands.
12
Independent Researcher, Vienna, Austria.
13
Marketing Department, Wharton School, University of
Pennsylvania, Philadelphia, PA, USA.
14
Department of Psychology, University of Virginia, Charlottesville, VA, USA.
15
Center for Open Science,
Charlottesville, VA, USA.
16
Institute for Advanced Study, Massey University, Auckland, New Zealand.
17
Sveriges Riksbank, Stockholm, Sweden.
e-mail:
anna.dreber@hhs.se
Nature Human Behaviour
Article
https://doi.org/10.1038/s41562-024-02062-9
moderating effects of culture differences in replications
14
,
87
. Replica
-
tion sample sizes were determined to have 90% power to detect 2/3
of the effect size reported in the original study at the 5% significance
level in a two-sided test (with the effect size estimates having been
converted to Cohen’s
d
to have a common standardized effect size
measure across the original studies and the replication studies; see
Methods for details). If sample size calculations led to replication
sample sizes smaller than in the original study, we targeted the same
sample size as in the original study. The average sample size in the
replications (
n
= 1,018) was 3.5 times as large as the average sample
size in the original studies (
n
= 292).
The replication results for the 26 MTurk experiments selected by
the decision market constitute the second contribution of this project.
Systematic evidence on the replicability of online experiments in the
social sciences is lacking, and concerns about the quality of online
experiments in general—and MTurk studies in particular—have been
raised
88
–
94
. Needless to say, the replication results only pertain to the
single focal result selected per paper, and the replication outcome
does not necessarily generalize to other results reported in the original
articles
95
,
96
. For convenience, we refer to the replications as ‘replication
of [study reference]’ though. Also, our assessment of the most central
result may differ from that of the original authors.
Preregistering study protocols and analysis plans have been
proposed as a means to reduce questionable research practices. While
empirical evidence is still limited, some recent studies suggest that
these practices enhance the credibility of published findings
97
–
99
,
although potential issues with preregistration have also been
raised
100
–
102
. Before starting the survey data collection (that pre
-
ceded the decision market and replications), we preregistered
103
,
104
an analysis plan (‘replication report’) for each of the 41 potential
replications at OSF after obtaining feedback from the original authors
(
https://osf.io/sejyp
). After the replications had been conducted, the
replication reports of the 26 studies selected for replication were
updated with the results of the replications (and potential deviations
from the protocol) and were posted to the same OSF repository. We
also preregistered an overall analysis plan at OSF before starting the
data collection, detailing the study’s design and all planned analyses
and tests (
https://osf.io/xsp6g
). Unless explicitly stated, all analyses
and tests reported in the paper have been preregistered and adhere
exactly to our preregistered analysis plan. Supplementary Notes
details any deviations from the planned design and analyses for the
26 replications.
We preregistered two primary replication indicators and two
primary hypotheses. The two primary replication indicators are the
relative effect size of the replications and the statistical significance
indicator for replication (that is, whether or not the replication results
in a statistically significant effect with
P
< 0.05 in the same direction as
the original effect), which was the replication outcome predicted by
forecasters in the survey and the decision market.
The statistical significance indicator is a binary criterion of replica
-
tion and is based on testing the hypothesis for which the original study
found support using standard null hypothesis significance testing. The
indicator crudely classifies replications as failed or successful depend
-
ing on whether the replication study yields evidence in support of the
original hypothesis at a particular significance threshold. (Critics of
null hypothesis significance testing or privileging a
P
value of 0.05
will, justifiably so, object to this crude classification; that is why it is
only one of the several indicators that we report.) A replication clas
-
sified as failed based on this indicator, however, does not imply that
the estimated replication effect size is significantly different from the
original estimate (see more on this below). To keep the false negative
risk at bay and to be informative, the statistical significance indicator
calls for well-powered replications (as in this study)
105
,
106
. However,
a limitation of this indicator for well-powered replication studies is
that it may classify a replication as successful even if the observed
However, as it is time-consuming and costly to conduct replications,
it has been argued that it is useful to have a principled mechanism to
decide which replications to prioritize to facilitate efficient and effec
-
tive usage of resources
25
–
37
. Here we test the feasibility of one potential
method to select which studies to replicate. Building on previous work
using prediction markets
38
–
40
to forecast replicability, we adapt the
forecasting methodology to what is referred to as decision markets
41
–
44
.
The decisive distinction between prediction markets and decision
markets is that prediction markets elicit aggregate-level replicability
forecasts on a predetermined set of studies, whereas decision market
forecasts determine which studies are going to be put to a replication
test. While previous studies provide evidence that prediction market
forecasts are predictive of replication outcomes
10
,
16
,
17
,
45
, prediction
efficiency might not generalize to decision markets, which involve
more complex procedures and incentives. The performance of deci
-
sion markets as a tool for selecting which empirical claims to replicate
has not been systematically examined. Note that a decision market
in itself is not sufficient to provide a mechanism to select studies for
replication, but it has to be combined with an objective function of
which studies to replicate (an example of an objective function would
be to replicate the studies with the lowest probability of replication).
For decision markets to be potentially useful for selecting studies for
replication, it first has to be established that the predictions of the deci
-
sion markets are associated with the replication outcomes. To provide
such a ‘proof of concept’ of using a decision market as a mechanism
to determine which studies to replicate, we first identified all social
science experiments published in the Proceedings of the National
Academy of Sciences (PNAS) between 2015 and 2018 that fulfilled
our inclusion criteria for (1) the journal and period; (2) the platform
on which the experiment was performed (Amazon Mechanical Turk;
MTurk); (3) the type of design (between-subjects or within-subject
treatment design); (4) the equipment and materials needed to imple-
ment the experiment (the experiment had to be logistically feasible
for us to implement); and (5) the results reported in the experiment
(at least one main or interaction effect with
P
< 0.05 reported in the
main text). On the basis of our inclusion criteria, we identified 44 arti-
cles, 3 of which have been excluded owing to a lack of feasibility, leaving
us with a final sample of 41 articles
46
–
86
(see Methods for details on the
inclusion criteria). For each of these articles, we identified one critical
finding with
P
< 0.05 that we could potentially replicate (see Methods
for details and Supplementary Table 1 for the hypotheses selected for
each of the 41 studies).
We then invited social science researchers to participate as fore-
casters in both a prediction survey and an incentivized decision market
on the 41 studies. In the survey, the forecasters independently esti
-
mated the probability of replication for the 41 studies. In the decision
market, they could trade on whether the result of each of the 41 studies
would replicate. Participants in the decision market received an endow
-
ment of 100 tokens corresponding to USD 50, and 162 participants
made a total of 4,412 trades. Traders in the market were informed about
the preregistered decision mechanism: the 12 studies with the highest
and the 12 studies with the lowest market prices were to be selected for
close replication; in addition, 2 randomly chosen studies (out of the
remaining 17 studies) are replicated to ensure incentive compatibility,
with participant payoffs scaled up by the inverse of their probability in
the decision rule (see Methods for details). For incentive compatibility,
all the 41 studies included need to have a strictly positive probability of
being selected for replication, which is ensured by having at least one
randomly selected study. Otherwise, traders would be incentivized to
only trade on those studies that will most likely be chosen according
to the decision rule.
All replication experiments, just like all original studies, were
conducted on Amazon Mechanical Turk (MTurk), and the same
sample restrictions and exclusion criteria as the original studies
were applied, which guards against concerns about the potential