of 3
COMMENTARY
The apparent prevalence of outcome variation from hidden
“dark methods” is a challenge for social science
ColinF.Camerer
a,1
Every working scientist knows that in the details are both
devils and angels. Lots of small design decisions have
to be made in collecting and analyzing data, and those
decisions affect conclusions. But beginning scientists, from
rookies in school science fairs to students in early years
of a rigorous Ph.D. program, are often surprised how
much small decisions matter. Despite this recognition that
details matter, when science is communicated, many small
decisions made privately by a science team are hidden
from view. It is difficult to disclose every detail (and usually
little disclosure is required). Such hidden decisions can be
thought of as “dark methods,” like dark matter which cannot
be directly seen because it does not reflect light, but which is
evident from its other effects. The Herculean effort resulting
in the new many-analyst study (1) which is the subject of my
Commentary should force a painful reckoning about the
extent of these dark method choices and their influence on
conclusions. Design decisions of each team that were coded
(107 of them) explained at most 10 to 20% of the outcome
variance. Assuming that the coding itself is not too noisy, it
seems that hidden decisions account for the lion’s share of
what different teams conclude.
In ref. 1, they recruited 73 teams to test the hypothesis
that “immigration reduces public support for government
provision of social policies.” Whether this hypothesis is true
is obviously an important question, especially now and
very likely in the world’s future as well. The hypothesis is
also sufficiently clear that social sciences should be able to
generate some progress toward an answer.
The teams were given data about 31 countries (mostly
rich and middle-income) from five waves of ISPSS data
spanning 1985 to 2016, asking six questions about the role
of government in policies about aging, work, and health.
Yearly data on immigrant stock and flow came from the
World Bank, UN, and OECD. These are the best available
data covering many countries and years in a standardized
way and are widely used.
What did they find? They found that average marginal
effects of immigrants on policy support were significantly
positive or negative in 17% and 25% of the tested models.
The other model results had 95% confidence intervals
including zero (58%). The range of subjective conclusions
was similar.
Their next question was how well differences in esti-
mates could be explained by various sources of variance.
It might be the case null results finding no effect are mostly
derived by less experienced teams, of different subjective
prior beliefs influenced what teams found. But actually,
differences in measured expertise and prior beliefs made
little difference. They coded 107 separate design decisions
taken by three or more teams. These are decisions such
as choices of an estimator, measurement strategy, inde-
pendent variables, subsets of data, etc. The contribution of
these coded decisions explained only a little more than 10%
of variance in results between teams.
The authors conclude that even when trying to carefully
code these design decisions (in order specifically to shed
light on typically dark methods), the coded variables do
not explain much. Eighty Percent of the variance in team-
reported results is due to some other variables that are not
coded. Fig. 1
A
illustrates both variability in team outcomes
and the weak relation between high-level design features
and those outcomes.
The challenges posed by the surprising influence of dark
methods come after almost two decades of other questions
about how well current practices cumulate scientific regu-
larity (2). Social scientists—as well as those in other fields,
especially medicine—are now well aware of the feared
and actual impacts of p-hacking, selective inference, and
both scientist-driven and editorial publication bias. A small
wave of direct replications in psychology, economics, and
in general science journals, intended to reproduce previous
experimental protocols as closely as possible, typically
found that many or most results do not replicate strongly
(3–5). (My rule of thumb is that the long-run effect size of
a genuine discovery will be at 2/3 as large as the original
effect). But most social sciences have also turned toward
self-correction, albeit at the slow pace of turning a large
oil tanker rather than a sports car. Preregistration, journal
requirements for data archiving, and Registered Reports
preaccepted before data are collected are big steps forward.
Thanks to the efforts of hundreds of scientists, we
can now draw some general conclusions from both this
new study (1) and two other recent “many-analyst” studies
(6 and 7). In all three studies, a large number of analysis
teams were given both common data of unusually high
quality and sample size and clear hypotheses to test. As
in ref. 1, any differences in results that emerge result only
from differences in teams’ methods.
In refs. 6 and 7, 70 teams were given fMRI data from a
large sample of
N
=
108 participants who chose whether
Author affiliations:
a
Humanities and Social Sciences and Computational and Neural
Systems, California Institute of Technology, Pasadena, CA 91106
Author contributions: C.F.C. wrote the paper.
The author declares no competing interest.
Copyright
©
2022 the Author(s). Published by PNAS. This article is distributed under
Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).
See companion article, “Observing many researchers using the same data and hypothesis
reveals a hidden universe of uncertainty,” 10.1073/pnas.2203150119.
1
Email: camerer@caltech.edu.
Published December 20, 2022.
PNAS
2022 Vol. 119 No. 52 e2216020119
https://doi.org/10.1073/pnas.2216020119
1of3
Downloaded from https://www.pnas.org by CALIFORNIA INST OF TECHNOLOGY CALTECH LIBRARY SERVICES on June 15, 2023 from IP address 131.215.225.155.
Fig. 1.
Illustrations of dark methods and peer scientist mispredicting
outcomes and variation. (
A
) Average marginal effect (AME) (y-axis) plotted
from low to high (x-axis) for three subjective conclusion categories reported
by teams. Subjective conclusions are overlapping, e.g., AMEs from .00 to
.015 are included in all three subjective categories. High-level design char-
acteristics (blue intensity coding,
Bottom
) are not evidently correlated with
either AME or conclusions. (
B
) Prices in an fMRI-linked scientific prediction
market to predict the percentage of teams finding support for hypothesis
8. The true (“fundamental”) value is .057, but market prices overestimate
that number. Market prices associated with team scientists who did analysis
(green) overestimate less than nonteam market traders. (
C
) Box-whisker plots
of 164 teams’ subjective beliefs about cross-team dispersion (y-axis, log scale).
Actual dispersion is much higher for the outlier-ridden full sample (coral red)
than a winsorized sample (tangerine yellow). For four of the hypotheses, the
actual dispersion is close to the 97.5% top whisker of research team beliefs.
(Sources: Ref. 1,
SI Appendix
, figure S9, 6, ExtData figure 5, and 7, figure 5).
to accept a series of gain–loss gamble or not. Teams tested
nine specific hypotheses, derived from previous findings,
about whether specific brain regions encoded decision
variables (e.g., was ventromedial prefrontal cortex activity
larger for larger potential gains?). For about half of the nine
hypotheses, most teams came to the same conclusions,
either that there was very little activation or (in one case) a lot
of activation supporting the hypothesis. For the other four
hypotheses, there was disagreement, with (thresholded)
activation reported by 20 to 80% of teams. In looking for
design decisions that explained results variance, they looked
at five variables. The two most important differences were
which software package was used and the smoothness
of the neural spatial map. (Maps are routinely smoothed
because localized measures of neural activity are noisy, but
the extent of smoothing is a design choice which varied.)
But each of these variables contributed only .04 to
R
2
. This
result brings a small number of basic “dark” design choices
into the light but leaves a lot unexplained.
In ref. 7, 164 teams were given data from 720 million
trades from 2002 to 2018 for the most actively traded
derivative, the EuroStoxx 50 index future contract. They
were asked to test six hypotheses about changes in
trading activity over this span. Changes are scientifically
and practically interesting because there was a global
financial crash in this sample, leading to changes in
regulation, and a rise in rapid algorithmic trading and other
trends.
The changes the teams estimated vary a lot across the six
hypotheses because they are annualized changes in num-
bers like order flow or market efficiency; they are not effect
sizes (although t-statistics are heavily analyzed and easier
for readers of this Commentary to understand). However,
the variation across teams—which the authors cleverly call
“nonstandard errors”—is about 1.65 times as large as the
mean SE of estimates. This cross-team dispersion is lower
but only by a little for the highest-quality research teams.
The fMRI and EuroStoxx studies also compared peer sci-
entist numerical predictions about how big the cross-team
dispersion will turn out to be, with the actual dispersion.
In the fMRI study, “prediction markets” were created in
which both research team scientists and others who were
not on teams could trade artificial assets with a monetary
value equal to the percentage of teams that accepted a
hypothesis. The market predictions overestimated the prob-
ability of hypothesis acceptance by 64%, although ranked
prediction levels across the nine hypotheses between pre-
diction prices and outcomes were highly correlated and
team members were more accurate (team members r
=
.96,
P <
.001; nonteam members r
=
.55,
P =
.12), Fig. 1
B
. This
cross-hypothesis accuracy is consistent with some degree
of accuracy of science-peer predictions across treatment
effects (e.g., ref. 8).
In the EuroStoxx study, dispersion of beliefs was 71%
lower than the actual dispersion of estimates (Fig. 1
C
).
These results show that the research team scientists
doing these analyses highly underestimated both likely
results and cross-team dispersion. In other words, a part
of the scientific study itself was to carefully document
whether the participants were surprised by the results or
not. They were surprised. Readers should be too (please
resist hindsight bias).
These three many-analyst organizers have thought of
every angle. Their evidence strongly suggests that obvious
explanations for hidden variability are just not right. The
teams were chosen and evaluated on simple measures of
expertise: In ref. 1, 83% had experience teaching data analy-
sis courses, and all of them were first required to reproduce
findings of a widely cited study (9) to join the research teams.
It does not appear that weaker or stronger research teams
(judged by experience and publications or by outside peer
review for EuroStoxx) have less outcome dispersion.
A trickier question is whether some slippery concept of
vagueness of hypothesis tests creates the hidden multi-
verse, which would not arise for a sharper hypothesis test.
It is true that testing the hypothesis “immigration reduces
public support for government provision of social policies”
seems to allow a lot of freedom to measure almost every
scientific word in the hypothesis differently (immigrant,
public support, social policies). But the data sets are the
same, so there is very limited room for measurement
differences. Furthermore, the fMRI hypotheses (6) are not
vague at all: they were taken straight from previous papers
and are very clear about what brain regions and statistical
outcomes are hypothesized to be associated. Even in that
case, for about half of the nine hypotheses, there was
substantial cross-team variation in results.
2of 3
https://doi.org/10.1073/pnas.2216020119
pnas.org
Downloaded from https://www.pnas.org by CALIFORNIA INST OF TECHNOLOGY CALTECH LIBRARY SERVICES on June 15, 2023 from IP address 131.215.225.155.
While it is difficult to know, by definition, what the hidden
dark method differences are, why are these experienced so
surprised by the amount of dispersion between their own
work and their peers’ work? The fact that predictions about
both outcome variability and outcome levels are so wildly
off, in the finance and fMRI many-analyst studies which
measured predictions, suggests that individual scientists do
not appreciate how different their peers’ analytical choices
are and how much results will be affected. How can evidence
about dark analytical variability exist, as shown by these
studies, but has also stayed so hidden and is therefore so
surprising?
A possible answer is that scientists are not immune to
a perceived “false consensus” bias that arises when people
generally overimagine their own ideas when judging what
others think (10), as is well established in psychology. But
such a mistake is particularly surprising because science
is so open about general analytical differences. Between
large conferences, small seminars, and peer review, there
are many opportunities to debate alternative design choices
and their likely impact.
There is a large difference across the two studies (immi-
gration and EuroStoxx) in how research teams responded
to feedback. While immigration teams in ref. 1 could change
their models and resubmit revised results after seeing
what others did, “no team voluntarily opted to do this”
(except after coding mistakes). The authors (1) suggest
that more “epistemic humility" is needed. However, in the
EuroStoxx many-analyst study, there was a lot of revision
and subsequent reduction in team variability across four
steps of the study (notably, after peer reviewers com-
mented on early stage results and again after the five
papers judged by nonteam peers to be the best were
publicized to all teams). Reduction in analytical variance
fell by 53% for the main sample (which excluded outliers).
This stark difference in revision rates from feedback about
other teams’ results seems to reflect norms in different
fields about some combination of humility and confor-
mity.
What’s next? Will more and better data save us? It is
not at all clear that better data will bring conclusions closer
together. The EuroStoxx data are as good as an 18-y span
could be for testing simple hypotheses about how financial
markets have changed, and there is still dispersion that
surprised the experts working with that single oceanic set of
data. One can imagine excellent new sources of data about
immigration, political reactions, and popular support for
policies. But new data sources are more likely an even larger
combinatorial explosion of different design decisions. There
is little chance that more new data will lead to more new
convergence rather than a new proliferation of different
approaches.
Hopefully, one trend which is next is improvements
in quantifying and promoting transparency.
*
Meanwhile,
it is hard to see how the regular peer-review process
can continue to credibly operate in the face of this new
evidence about the hidden analytical multiverse. When
selective peer-reviewed journals reject one paper on a
topic and accept another, they are implicitly endorsing a
combination of the methods and results of the accepted
paper compared to the methods and results of the rejected
paper. An acceptance says “We think this study used a
superior method and got us closer to the truth.” But how
can such endorsements be made with confidence when so
much of the methods is hidden?
Put more vividly, imagine if all 73 teams’ immigration
manuscripts from ref. 1 were submitted to journals over
a period of time. In the light of these results, there would be
no evidentiary basis to claim that one paper’s methods were
better and more truth-producing than another’s methods.
(Remember that measures of scientific competence or
investigator prior belief did not matter much either, so
referees who believe that they are irrelevant cannot fall back
on those simple judgments to say yes or no). But editors
have to make decisions and usually have to say no a lot more
than they can say yes. If referees and editors are no better
at sniffing out dark methods creating different outcomes
than these fastidious researchers (1) were, how can and
do referees decide? The pressure to decide among many
equally outstanding papers creates plenty of room for edito-
rial bias, referee-author rivalry, faddish conformity, network
favoritism, and other influences to sneak in. Even worse,
editorial choices can have large multiplier effects by guiding
other researchers, especially those with the largest career
concerns, in the directions pointed out by published articles.
Based on these results, professional organizations—
particularly societies and their journal editors—should be in
a crisis-management mode. An obvious step—which could
start tomorrow—is to help organize, fund, and commit to
publish more many-analyst multiverse studies. The power
to move scientists in a better direction is held by journals,
funding agencies, and (to some extent) rich universities. The
great news from ref. 1 and the other two studies described
here (as well as important precursors and ongoing efforts) is
that a lot of talented scientists are willing to spend valuable
time to figure out how to do science that is clearer and
cumulates regularity better, in the face of the surprising
many-analyst variance of results seen across not just the
immigration study but two other recent studies as well.
*
In analysis of structural economic models, it is usually hard for readers to tell how results
would differ if an assumption was violated (a type of dark method). A computable model
of transparency was derived by ref. 11. A similar idea could prove useful in other social
sciences.
1. N. Breznau
et al
., Observing many researchers using the same data and hypothesis reveals
a hidden universe of uncertainty.
Proc. Natl. Acad. Sci. U.S.A.
119
, e2203150119
(2022).
2. B. A. Nosek
et al
., Replicability, robustness, and reproducibility in psychological science.
Ann. Rev.
Psychol.
73
, 719–748 (2022).
3. O. S. Collaboration, Estimating the reproducibility of psychological science.
Science
349
, aac4716
(2015).
4. C. F. Camerer
et al
., Evaluating replicability of laboratory experiments in economics.
Science
351
,
1433–1436 (2016).
5. C. F. Camerer
et al
., Evaluating the replicability of social science experiments in nature and science
between 2010 and 2015.
Nat. Hum. Behav.
2
, 637–644 (2018).
6. R. Botvinik-Nezer
et al
., Variability in the analysis of a single neuroimaging dataset by many teams.
Nature
582
, 84–88 (2020).
7. A. J. Menkveld
et al
., Non-standard errors.
SSRN
. https://papers.ssrn.com/sol3/papers.cfm?
abstract_id=3961574 (2022).
8. S. DellaVigna, D. Pope, Predicting experimental results: Who knows what?J
Polit. Econ.
126
, 2410–
2456 (2018).
9. D. Brady, R. Finnigan, Does immigration undermine public support for social policy?
Am. Soc. Rev.
79
, 17–42 (2014).
10. G. Marks, N. Miller, Ten years of research on the false-consensus effect: An empirical and theoretical
review.
Psychol. Bull.
102
, 72–90 (1987).
11. I. Andrews, M. Gentzkow, J. M. Shapiro, Transparency in structural research.
J. Bus. Econ. Stat.
38
,
711–722 (2020).
PNAS
2022 Vol. 119 No. 52 e2216020119
https://doi.org/10.1073/pnas.2216020119
3of 3
Downloaded from https://www.pnas.org by CALIFORNIA INST OF TECHNOLOGY CALTECH LIBRARY SERVICES on June 15, 2023 from IP address 131.215.225.155.