1
Appendix
A
: Qualitative
Findings
from the Arid Lands Project
Qualitative research has the advantage that it can provide the substantive details necessary
to understand how complex systems work. It provides the context to identify conflicting
incentives and design flaws exacerbating the risk of fraud. While forensic audits and digit
analysis help us identify specific instances and levels of likely fraud, they do not provide all the
information required to design better monitoring systems to co
ntrol future fraud
.
This section
draws from thousands of pages of interviews with people familiar with th
e World Bank Arid
Lands Resource Management Project
. To understand how the project operated on the ground,
we draw upon the insider knowledge of proj
ect employees and beneficiaries, contractors,
consultants, civil servants, World Bank employees, investigative reporters, politicians, and
members of civil society (see
Ensminger
(2017)
for more details).
T
he Arid Lands project functioned within the corrupt institutional environment of Kenya.
In 2009, the Transparency International Corruption Perceptions Index ranked Kenya 146
th
out of
180 countries
(Transparency International, 2009)
. B
o
th then and now, Kenya qualifies as a
systemically corrupt country: corruption and impunity are the norm, and the political system
facilitates the theft of government resources, including those from
international donors,
such as
this one.
Independent interviews and
the World Bank’s Integrity Vice Presidency’s forensic audit
provide cross
-
corroborating details pointing to high
-
level government complicity in this
alleged
theft.
1
The specified flow of project funds was from the Kenyan Treasury to the
project
headquarters, then to districts
,
and from there to the villages. According to diverse sources, the
reality was that there were kickbacks flowing up and out at every level. Demands for kickbacks
began with senior government officials external to the project. It is alleged that headquar
ters
2
staff met some of those demands with funds embezzled from their headquarters budget.
However, the project specified that the bulk of the funds had to be wired to the districts, which
posed a challenge for headquarters staff intent on recapturing some of t
he funds. Interviewees
report that this occurred in the form of monthly
cash
“envelopes” sent from districts back to
headquarters as kickbacks. Some districts were able to avoid many such requests from
headquarters because their districts were home to po
werful national political actors who provided
protection. But in many cases, this
may
not
have
mean
t
less embezzlement, just different
recipients.
Even accounting for the corrupt environment in which this project operated, the fraud risk
of th
e
project was exacerbated by poor design. Two of these design flaws can be directly linked
to resulting weakness in the monitoring systems: staff selection and staff discretion in the choice
of which villages received projects.
It is often said that the “tone at the top” matters. This is arguable even more the case when
the surrounding institutional environment is systemically corrupt. Many of the senior staff in the
Arid Lands project were seconded from their permanent minist
ry jobs, to which they expected to
return, thus creating conflicts of interest and dual loyalty. This arrangement produced pressure
to engage in fraud in collaboration with their home ministries. The project was effectively
plugged directly into existing
corruption networks that syphoned funds from the project upward
to senior politicians and government civil servants. These features differentiate the project from
a more successful World Bank community
-
driven development project in Indonesia (the KDP).
Specifically, because the designers of the Indonesian project understood that they were operating
in a similarly corrupt institutional environment, they went out of their way to create recruitment
mechanisms independent of corruption centers in the governm
ent
(Guggenheim, 2006)
.
3
Given the pressure on the top layer of Arid Lands management to kick funds upward, in
addition to their desire for personal accumulation, it was important that they have obedient
subordinate staff beneath them, especially in the districts.
2
According to numerous sources, this
was achieved by hiring staff who were underqualified for their jobs. Many did not have the
minimum educational qualifications required for their jobs and were not subjected to competitive
selection. Their high opportu
nity costs meant that they were more likely to comply with corrupt
demands from headquarters.
Another design flaw resulted from granting the district officers nearly complete discretion
over the selection of villages receiving projects.
3
Project guidelines specified that selected
villages would choose their own committees to manage the finances and monitor the project. In
reality, the district officers were often approached by savvy villagers who agreed to collaborate
with the officers
in exchange for negotiated kickbacks from the village project (see Ensminger
(2017)
for details). As co
-
conspirators with the district offices, the village oversight committees
aligned with the district staff against the interests of their own villagers. Many alternative
designs would have improved upon this one. For example, the Indo
nesian KDP project
employed a competitive village selection model for projects
(Chavis, 2010)
.
The design flaws in staffing and village selection contributed to many of the monitoring
issues in the project. Village committees were tasked with monitoring their own projects,
together with district project staff, but as we have noted, they were colla
borating in the fraud.
Villagers themselves faced information asymmetries and incentives that hindered
whistleblowing.
4
First, it was not in the interest of either the district officers or the village
committee to share the project specifications with the community. Without knowing what they
were supposed to be receiving, it was impossible for villagers to know if funds wer
e being
4
misused. Second, the villagers were easily intimidated. The intended beneficiaries of these
micro projects were truly the world’s poorest citizens living on less than $2 per day. They were
grateful for any benefits from the project. It took years for i
ndividual villagers to begin to
protest, but given the extent of complicity in the project, who were they going to complain to?
Villagers who did complain were often bought off cheaply. If they persisted, the village was
threatened that it would be cut
-
o
ff from all future projects. This was the result of vesting
monopoly discretion for the allocation of projects with the district offices; their leverage over
villagers was all but absolute.
Given all the alleged embezzlement in this project, it is worth exploring how the World
Bank’s internal supervision processes failed to catch the ongoing fraud. Numerous Kenyan
government and World Bank offices signed off on regular financial reviews. A
task team leader
(TTL) from the World Bank was assigned to overall supervision, and the TTL occasionally
brought in missions of oversees experts. The Kenya National Audit Office conducted annual
audits of all of the project’s offices, and the TTL occasio
nally commissioned special audits from
the Nairobi branch of international audit firms for subsets of districts.
One explanation for poor World Bank supervision is misaligned incentives. World Bank
financial management staff, task team leaders, and outside missions are resource and time
constrained. World Bank project managers themselves perceive that the Bank does
not create
the right incentives for them to engage in monitoring and evaluation
(Berkman, 2008)
(Mansuri
& Rao, 2013, p. 302)
. To the extent that task team leaders are rewarded by the size of their
project portfolios, finding evidence of large
-
scale fraud in one’s own projects is not likely to be
career
-
advancing. Conflicts of interest also extended to the task team leader’s ma
nagement of
the outside experts brought in to provide periodic oversight. Many staff on this project
5
commented that the same experts appeared time and again to oversee the project; they felt that
fresh eyes that were less friendly with project management would have been more likely to
report
problems. Outside experts may also face conflicts of interest, including real or perceived
pressure to give positive evaluations to continue their relationship with the task team leader and
to stay in good graces with the World Bank. These conflicts of i
nterest are analogous to those
between firms and outside auditors
.
Both standard internal and external auditing of this project failed to catch most of the kinds
of abuses flagged by the World Bank’s forensic audit. Numerous interviewees described the
friendly relations enjoyed between the project staff and the regular K
enyan auditors who visited
headquarters and the districts annually. These were characterized as more socializing than
examination of accounts, and the same auditors returned year after year. The project officers
were less worried about professional or le
gal ramifications if the auditors found issues than they
were that this would increase the leverage that auditors had over the office to extract a higher
bribe to clean up the report. A particularly compelling report about the bribing of auditors came
fro
m a petrol station owner in a district capital: he explained that he always knew ahead of time
when the project’s auditors were about to arrive in town. He did business with the project and
because his
retail
business required it, he almost always had la
rge sums of cash on hand. Just
before the auditors arrived, the project staff would visit him to collect 200,000 Kenyan shillings
(about $3000) to pay the auditors. These funds were repaid in over
-
invoiced petrol. The World
Bank task team leader also ord
ered periodic audits of select districts from international firms in
Nairobi. According to interviewees who were closely involved, those audits were just as
compromised as the ones run by the Kenyan National Audit Office.
6
The qualitative investigation of this project points to many ways in which project design
contributed to fraud risk and the reasons why standard World Bank supervision failed to catch it.
What happened with the findings of the forensic audit speaks volum
es about the enduring
systemic nature of corruption in Kenyan institutions. Upon completion of their audit, the
Integrity Vice
-
Presidency of the World Bank filed their report
(World Bank Integrity Vice
Presidency, 2011)
, conducted a joint exercise with the Kenya National Audit Office to validate
their results, and also made that report public
(World Bank Integrity Vice Presidency and
Internal Audit Department, Treasury, Government of Kenya, 2011)
. In a highly unusual action,
the Kenyan Government was required to repay $3.8 million USD of the inappropriately
accounted funds. The World Bank also terminated the project, which is highly unusual for a
project that already had a board date set for its
5
-
year renewal. INT then submitted their
supporting audit evidence to the Kenyan Anti
-
Corruption Agency (KACC) for follow
-
up
investigation. To the best of our knowledge, no further investigation was undertaken
by the
Kenyan government
and no one from the
project was indicted or prosecuted. Most senior staff
we
re
still
in their posts
many years later
or ha
d
been promoted. S
everal of the most senior
staff
were
immediately
promoted to high level Presidential appointments upon the closing of the
project.
7
Appendix
B
: Simulations on Benford’s Law
Appendix
B
.1
: Simulations on Different Data Generating Processes
In this appendix, we show how Benford’s Law is the appropriate null distribution
for
expenditure data
even when human
ly tampered
digit
s
may exist in
underlying price
data, and
how
existing statistical tests may miss important patterns by failing to disaggregate data.
B.1.1: Underlying Price Manipulation
First, we consider whether Benford’s Law is the appropriate distribution for financial data
that arise when
underlying
prices
are
subject to digit preferences. That is, our goal is to exhibit
that the evidence of misreporting in World Bank transactions is not just a relic of manipulated
underlying prices that could be a broader Kenyan phenomenon.
We show that underlying prices
cannot explain the phenomena we find in our World Bank dataset.
Janvresse and De La Rue
(2004)
show that Benford’s law arises when data are drawn from
uniform distributions whose maximum is a random number drawn from a log
-
uniform
distribution. This “mixture” of uniform distributions of different magnitudes produces data
conformant with Benford’s
law.
This first simulation proceeds as follows: we generate
n
observations, where each
observation is the sum of price times quantity among
k
line items. The number of line items
k
is
different for each observation, and it is drawn from a uniform distribution between 1 and 100.
For each group of
k
line items, we draw a maximum price from a log
-
uniform distribution
between 1 and 10,000. Then, we draw
k
prices from a uniform distribution with that maximum.
For each price, we independently allow for contaminating digit preferences: with a 20%
probability, each price has 1 digit replaced with either a 2 or an 8, reflecting a preference for
8
even numbers. Finally, we draw a maximum price from a log
-
uniform distribution between 1
and 100, and draw
k
quantities from a uniform distribution with that maximum. Quantities are
chosen without digit preference contamination. The observation is then the sum of the price
times quantity values.
This setup reflects a realistic data
-
generating process where human preferences
contaminate underlying price data, but both line items and quantities are from untampered
uniform distributions. Prices from this dataset will not reflect a uniform or Benford
distribution,
because 20% of the data will have 1 digit replaced with either
the digit 2 or the digit
8.
However, in theory, the final observations should not reflect the underlying price preferences
because they contain multiple line items that have b
een multiplied by quantities and summed.
This simulation confirms our theoretical predictions, and the reported data is still Benford
-
conforming. Appendix Figure
B
.1 shows the first
-
digit Benford’s Law chi square test (Panel A),
and the all
-
digits
-
beyond
-
the
-
first Benford’s Law chi square test (Panel B). Both are Benford
distributed,
with
p = 0.2084
and
p = 0.229
respectively. Indeed, in this simulation
,
the most
common digit in panel B is 3
,
not statistically significant, reflecting the fact that digit preferences
for
2
or
8
in underlying price
data, however legitimate, will not be reflected in the overall
reported data.
This supports our conclusion that
the high prevalence of 2 or 8 in the reported
World Bank data are evidence of the manipulation of the reported data
themselves
.
9
Appendix Figure
B.
1
:
Line
-
Item
Totals where Underlying Prices are Manipulated
Notes:
This simulation demonstrates that even when we begin with
price data that is randomly
manipulated to
contain extra 2s and 8s
to reflect
the digit preferences of vendors, once those
receipts are multiplied by quantity and summed with others to arrive at a line
-
item entry for the
project, they conform to Benford’s Law.
These figures test conformance to Benford’s law in the
first digit (Panel A), and all other digits (Panel B). Despite the preference for 2s and 8s in prices,
overall data still conform to Benford’s law, with
p > 0.2
for each case. This shows that the
manipulation visible in our project is not the result of
unusual price prefe
rences by vendors.
B.1.2: Pooling Data from Different Sources
Second, we consider the
importance of disaggregating data
.
We conduct a second
simulation that asks whether manipulated data can be detected by a traditional Benford’s law
analysis that
pools data
from different reporters with different biases
.
Our simulation
shows the
value of disaggregating data and
of
using multiple digit
places
beyond the first digit.
The second simulation proceeds as follows. We consider 10 “districts
,
”
representing
distinct reporters
. E
ach
district
reports 1,000 observations, each corresponding to an item on a
A
B
10
financial statement. We begin with Benford
-
conforming data for each
district
but
allow each
district’s reporter to have 2 preferred digits, 0 through 9, chosen
randomly
. With a 20%
probability, each observation has 1 digit changed to
a
preferred digit from that district’s reporter.
Appendix Figure
B
.2 shows the result of this second simulation. Panel
A
shows the first digits
from all 10 simulated districts, which conform to Benford’s Law in aggregate
,
p = 0.2256
.
Panel
B shows the power of disaggregation
, with the first digits of a
single
district, which are not
Benford conforming
,
p = 0.0008985.
Panel C presents the test of all
-
digits
-
beyond
-
the first for
conformance to Benford’s law from all districts with
p = 0.00487
.
This second simulation shows that, even when Benford
-
conforming data are contaminated
by digit preferences, an overall test of first digits may fail to detect
manipulation if different
reporters exhibit different digit preferences that “wash out
.
” Disaggregation, and the use of
digits beyond the first place, can solve these issues and provide additional statistical power.
11
Appendix Figure
B
.
2
: The Power of Disaggregation and Multiple Digit Places
Notes:
Simulations show the effect of pooling data from 10 reporters with different digit
preferences. Panel A shows the total first digit from all reporters, which is not statistically
significant,
p > 0.2
.
This shows that, even when data are manipulated, the effects can wash out
when data are pooled and only the first digit place is considered.
Panel B shows that disaggregation
is powerful for detection by
showing the data
of one of the 10
reporter
s
, which fails to conform to
Benford’s law,
p < 0.005
.
Th
is highlights the statistical power of disaggregation.
Panel C shows
A
B
C
12
that jointly considering digits beyond the first place is powerful
,
even when data are not
disaggregated
the manipulation is statistically significant
,
p < 0.005
.
B
.2
Simulations on the Power of All
-
Digit
-
Place Testing
The first of o
ur
new test
s
in the main text
explores patterns in all digit places
simultaneously, rather than multiple tests of different digit places, which greatly improves
statistical power.
An extensive literature on forensic auditing and the use of digit analysis have
promoted the use of single
-
digit
-
place tests to find evidence of fraud, or to select samples of data
for additional review or auditing. These tests focus on comparing a singl
e digit, such as the first
digit, second digit, or last digit, to Benford’s Law (see, e.g. Nigrini and Mittermaier
(1997)
and
(Beber & Scacco, 2012)
). Here, we use a simulation to exhibit the relative power of
our
test as
opposed to single
-
digit
-
place testing.
Our simulation proceeds as follows.
We generate
Benford
-
conforming data between 4 and
8 digits long (i.e., between 1,000 and 99,999,999) with each of 6 simulated districts having 1,000
observations of data. We simulate 3 “
corrupt
” districts in the data, districts A, B, and C, which
each
prefer
2 digits chosen independently. For example, district A might prefer 3 and 7, while
district B might prefer 2 and 5. For each
corrupt
district, each observation is originally generated
as conformant to Benford’s Law, but there is a 20% chance that they manipulate
the data by
replacing a digit in that observation with their preferred digit. There are also 3 “
clean
” districts,
D, E, and F, which produce Benford
-
conforming data with no digit preferences. An ideal test
would be able to distinguish
cleaner
districts from
potentially corrupt
districts and successfully
flag districts A, B, and C for further review, while not flagging districts D, E, and F.
Our
simulation is a test for the power of all digit place testing, with the caveat that it requires
13
individuals
who fabricate data to display a consistent preference for digits across different digit
places
to
detect this behavior
.
To test the current standard in the literature, we first consider tests of first digits, second
digits, third digits, and last digits in each of these districts. Given a desired significance
threshold of
p < 0.05
, we
must correct for multiple testing by dividing by the number of tests (4)
and achieve a significance threshold of 0.0125.
Appendix Table B.1, Panel A presents the
results of these tests. The sample size for each
test is 1,000. Panel A shows the issues with single
-
digit testing. In the first digit, no districts are
statistically significant, and in the second digit, only district A stands out. No districts are
stat
istically significant in the third digit. Pooling data from different districts similarly fails to
detect aberrant patterns using these tests. In the last digit, District F is inappropriately flagged as
suspicious. Raising the statistical significance t
hreshold back to 5%, that is, ignoring the
Bonferroni correction for multiple tests, does not fix this issue; indeed, it would flag district E as
suspicious in the last digit as well. District B fails no tests despite being (statistically) equally as
mani
pulated as districts A and C.
14
Appendix Table
B.1:
Comparison
o
f Single Digit Tests To New
All Digits Test
w
ith Simulated Data
Panel A: Single Digit Tests
District A
District B
District C
District D
District E
District F
All
Districts
First Digits
.0247
.3849
0.2607
0.9681
0.5627
0.7321
0.5259
Second
Digits
0.0009345
0.3319
0.3817
0.2
0.2157
0.1086
0.1378
Third Digits
0.02922
0.05461
0.4149
0.1289
0.06716
0.3711
0.1919
Last Digits
0.002284
0.08462
0.0002037
0.06975
0.0299
0.001027
1.778e
-
11
n
(per test)
1
,000
1,000
1,000
1,000
1,000
1,000
6
,
000
Panel B: All Digit Places
District A
District B
District C
District D
District E
District F
All Districts
All Digit
Places
6.885e
-
5
0.003257
.03098
0.1345
0.1993
0.411
0.2367
n
5
,
503
5
,
410
5
,
507
5
,
458
5
,
488
5
,
511
32,877
Notes:
This table shows the result of simulated data, where single digits are tested separately
(Panel A) and simultaneously (Panel B). Bolded values are statistically significant,
after
correct
ing
for multiple testing with a Bonferroni correction (0.05 divided by 4 tests in panel A
for a significance level of 0.0125). Only districts A, B, C have manipulated data, but single
-
digit
testing fails to detect this, while also inappropriately flagging Dis
trict F in a last
-
digits test.
Districts A, B, and C, which have manipulated data, are correctly identified by an all digit places
test.
Importantly, last digits here are tested against the uniform distribution, as is promoted by
the literature (see Beber
et al
(2012)
). District F, which has no manipulation, fails this test. The
uniform distribution is generally appropriate for last digits, but last digits may have slight
15
tendencies towards Benford’s law when they are also part of short numbers. As seen in Table 1,
in a 3
-
digit number, the last digits are third digits, which are not uniformly distributed. Here, the
smallest number is 1,000, so the last digit place is the 4
th
digit place for these numbers.
Therefore, the last
-
digit test here produces a false positive, and indeed is marginally significant
(
p < 0.10)
for every district.
Appendix
Table
B.
1
, Panel B presents an alternative testing regime, where we consider our
new test of all digit places by district. This is a single test, and the appropriate statistical
significance threshold is 5%. The sample sizes vary slightly because exact values are
simulated,
so some districts have more digits than others due to the random length of numbers. The three
manipulated districts fail this test (A, B, and C), as they should, and none of the unmanipulated
districts do (D, E, and F), as th
ey should
not
.
Next, we extend
the results of this analysis to different
line
-
item
sample sizes (
n
)
and
different rates of manipulation (
p
) using the same setup of 6 districts. These extended
simulations show that the all
-
digit
-
places test outperforms many single
-
digit
-
place tests, with a
higher true
-
positive rate and a lower false
-
positive rate among a range of sample sizes and
manipulatio
n rates.
We generate Benford
-
conforming data between 4
-
and 8
-
digits long, (i.e., between 1,000
and 99,999,999), with each of 6 simulated districts having
n
line
-
item
observations of data. We
simulate 3 “
corrupt
” districts in the data, districts A, B, and C, which each
prefer
2 digits chosen
independently. For each
corrupt
district, each observation is originally generated as conformant
to Benford’s Law, but there is a
p
chance that they manipulate the data by replacing a digit in
that observation with their prefe
rred digit. There are also 3 “
clean
” districts, D, E, and F, which
produce Benford
-
conforming data with no digit preferences. An ideal test would be able to
16
distinguish
clean
districts from
corrupt
districts and successfully flag districts A, B, and C for
further review, while not flagging districts D, E,
and
F.
The hyper
-
parameters for the results
presented in Table
B.1
above
are
n
= 1,000, reflecting 1,000 observations per district, and
p
=
0.2, a 20% rate of manipulating data among the
districts
that fabricate.
We consider a battery of the same 4 tests as presented
above
: first digits, second digits,
third digits, and last digits. We vary
p
within 0.1, 0.2, 0.3, 0.4 and 0.5, and we consider sample
sizes 100, 500, 1000, 5000 and 10000. Because we are conducting 4 tests x 5 sample sizes x 5
probabilities of manipulation = 100 tests per district, we divide the desired significance level
(0.05)
by 100
(0.0005)
to accomplish a Bonferroni correction for multiple testing
.
Appendix Table
B
.
2
presents the
results of these tests.
We count the number of tests failed
among the
corrupt
districts (A, B,
and
C) out of 12 as the true positive rate, and the number of
tests failed among the clean districts (D, E,
and
F) out of 12 as the false positive rate.
As the
sample size increases, and as the probability of manipulation increases, these tests perform
better
,
but not perfectly; indeed among 1,000 data points per district with 20% manipulation
probability, single
-
digit tests
fai
l
only 75% of the time. Appendix Figure
B.
3 plots the true
positive rate and the false positive rate against the expected number of manipulated data points,
n
x p.
There are false positives, largely driven by last
-
digit testing,
which
can suffer from issues
due to the last digit having some Benford
distribution influence if
the number is too short (less
than 4 digits long).
In contrast
, Appendix Table
B
.
3
presents the results of the same variation in
n
and
p
but
using the new all
-
digit
-
places test. We conduct 5 sample sizes x 5
probabilities of manipulation
= 25 tests per district
, and so w
e divide the desired significance level (0.05) by 25
(0
.
002)
to
accomplish a Bonferroni correction for multiple testing.
We find a very high rate of true
17
positives, and a very low rate of false positives.
Above an
n
x p
of about 250 expected
manipulated observations, the test successfully catches the manipulating districts; only very
rarely are non
-
manipulating districts flagged (2 total times in 75 district tests).
Appendix Figure
B
.
4 plots these results, showing the excellent
performance
of
this powerful test.
This simulation shows how the all
-
digit
-
places test substantially outperforms single
-
digit
testing along many dimensions. Signals of fraud may be present in different digit places, but
individual
-
digit
-
place tests fail to combine these signals in statisti
cally powerful ways. When
performing single
-
digit testing, each test must be compared to a significance threshold, but each
test fails to incorporate corroborating information in different digit places. Our new multiple
-
digit
-
places test solves this issue
, improving the sample size and power of each test, and picking
up digit preferences that are observable when a reporter exhibits them over different digit places.
At all levels of manipulation and sample sizes, all
-
digit
-
places testing is higher powered, having
a better true positive rate and a lower false positive rate.
18
Appendix Table
B
.
2
:
Results of
Many Single
-
Digit Tests
Notes:
This table presents the results of single
-
digit
-
place testing
among simulated data
,
using 4
tests: first digit, second digit, third digit, and last digit
.
There are 6 districts
. D
istricts A, B,
and
C manipulate data, and districts D, E,
and
F do not.
Each district produces data with sample size
n
, and districts A, B,
and
C manipulate data with
probability of manipulation
p
.
We
count the
number of tests failed per district
as
n
and
p
are varied
. The true positive rate is the number of
tests failed among districts A, B,
and
C out of 12; the false
-
positive rate is the number of tests
failed among districts D, E,
and
F out of 12.
A Bonferroni correction is used to determine test
failure, dividing the desired significance rate of 5% by the number of tests,
which is
100 per
district.
n
p
n
times
p
A
Failed
B
Failed
C
Failed
D
Failed
E
Failed
F
Failed
True Positive
Rate
False
P
ositive
Rate
100
0.1
10
0
0
0
0
0
0
0
0
100
0.2
20
0
0
0
0
0
0
0
0
100
0.3
30
0
0
0
0
0
0
0
0
100
0.4
40
0
0
0
0
0
0
0
0
100
0.5
50
0
0
0
0
0
0
0
0
500
0.1
50
0
0
0
0
0
0
0
0
500
0.2
100
0
0
0
0
0
0
0
0
500
0.3
150
0
0
0
0
0
1
0
0.08
500
0.4
200
1
1
1
0
0
0
0.25
0
500
0.5
250
1
1
1
0
0
0
0.25
0
1000
0.1
100
0
0
0
0
0
0
0
0
1000
0.2
200
0
0
0
0
0
0
0
0
1000
0.3
300
2
0
3
0
0
0
0.42
0
1000
0.4
400
2
3
2
0
0
0
0.58
0
1000
0.5
500
4
3
4
0
0
1
0.92
0.08
5000
0.1
500
2
1
1
1
1
1
0.33
0.25
5000
0.2
1000
3
2
4
1
1
1
0.75
0.25
5000
0.3
1500
4
4
4
1
1
1
1
0.25
5000
0.4
2000
4
4
4
1
1
1
1
0.25
5000
0.5
2500
4
4
4
1
1
1
1
0.25
10000
0.1
1000
2
2
3
1
1
1
0.58
0.25
10000
0.2
2000
4
4
3
1
1
1
0.92
0.25
10000
0.3
3000
4
4
4
1
1
1
1
0.25
10000
0.4
4000
4
4
4
1
1
1
1
0.25
10000
0.5
5000
4
4
4
1
1
1
1
0.25
19
Appendix
Table B.3
:
Results of Many All
-
Digit
-
Place
Tests
Notes:
This table presents the results of
all
-
digit
-
place testing among
simulated data. There are 6
districts
. D
istricts A, B,
and
C manipulate data, and districts D, E,
and
F do not. Each district
produces data with sample size
n
, and districts A, B,
and
C manipulate data with probability of
manipulation
p
.
We count which
district
s
either fail the all
-
digit
-
places test (1) or
do
not (0)
as
n
and
p
are varied
.
The true
-
positive rate is the number of tests failed among districts A, B,
and
C
out of
3
; the false
-
positive rate is the number of tests failed among districts D, E,
and
F out of
3
.
A Bonferroni correction is used to determine test failure, dividing the desired significance rate of
5% by the number of tests,
which is
25
tests per district.
n
p
n
times
p
A
Failed
B
Failed
C
Failed
D
Failed
E
Failed
F
Failed
True Positive
Rate
False Positive
Rate
100
0.1
10
0
0
0
0
0
0
0
0
100
0.2
20
0
0
0
0
0
0
0
0
100
0.3
30
1
0
0
0
0
0
0.33
0
100
0.4
40
0
0
0
0
0
0
0
0
100
0.5
50
0
1
0
0
0
0
0.33
0
500
0.1
50
0
0
0
0
0
0
0
0
500
0.2
100
0
0
0
0
0
0
0
0
500
0.3
150
0
0
0
0
0
0
0
0
500
0.4
200
1
1
1
0
1
0
1
0.33
500
0.5
250
1
1
1
0
0
0
1
0
1000
0.1
100
0
0
0
0
0
0
0
0
1000
0.2
200
0
0
1
0
0
0
0.33
0
1000
0.3
300
1
0
1
0
0
0
0.67
0
1000
0.4
400
1
1
1
0
0
0
1
0
1000
0.5
500
1
1
1
0
0
0
1
0
5000
0.1
500
1
0
1
0
0
0
0.67
0
5000
0.2
1000
1
1
1
0
0
0
1
0
5000
0.3
1500
1
1
1
0
0
0
1
0
5000
0.4
2000
1
1
1
0
0
0
1
0
5000
0.5
2500
1
1
1
0
0
0
1
0
10000
0.1
1000
1
1
1
0
0
0
1
0
10000
0.2
2000
1
1
1
0
1
0
1
0.33
10000
0.3
3000
1
1
1
0
0
0
1
0
10000
0.4
4000
1
1
1
0
0
0
1
0
10000
0.5
5000
1
1
1
0
0
0
1
0
20
Appendix Figure
B
.
3: Variation in True
-
Positive Rate and False
-
Positive Rate
Among Many Single
-
Digit Tests
Notes:
This figure plots the true
-
positive and false
-
p
ositive rate of many single
-
digit tests
against the expected number of manipulated data points.
Single
-
digit place testing converges
slowly to a perfect true
-
positive rate.
0%
25%
50%
75%
100%
0
1000
2000
3000
4000
5000
Expected Number of Manipulated Data Points N x p
Share of Tests
False Positive Rate
True Positive Rate
Single Digit Tests
21
Appendix Figure
B
.
4: Variation in True
-
Positive Rate and False
-
Positive Rate
Among Many
All
-
Digit
Tests
Notes:
This figure plots the true
-
positive and false
-
positive rate of all
-
digit tests against the
expected number of
manipulated data points. All
-
digit
-
place testing outperforms single
-
digit
-
place testing and has a low number of false positives.
0.00
0.25
0.50
0.75
1.00
0
1000
2000
3000
4000
5000
Expected Number of Manipulated Data Points N x p
Share of Tests
False Positive Rate
True Positive Rate
All Digit Tests
22
Appendix C:
Externally
Validating All Digit Places with the World Bank Enterprise Survey
In order to
demonstrate the broad applicability
of
the new
a
ll
d
igit
p
laces test,
we conduct
an analysis on a
new
dataset: the World Bank Enterprise Survey
s
(WBES)
. Originally designed
to survey the global economy, the World Bank Enterprise Survey
s
collect data from firms
worldwide
(Available here:
https://www.enterprisesurveys.org/en/enterprisesurveys
)
. As part of
these surveys, they ask about qualitative and quantitative aspects of the firm’s business.
Notably, thousands of firms are asked about their
real reported current year sales, past sales, and
number of employees, e
ach of which is expected to conform to Benford’s Law.
We
validate our
method by comparing the results of our
a
ll
-
digit
p
laces test on th
ese
data to
indicators of corruption,
both
the WGI Control of Corruption measure
and
the
Transparency
International
Corruption Perceptions Index. Notably, the appropriate test for these is the
a
ll
-
digit
p
laces
t
est, and not
our new
padding test, because
individuals
reporting th
ese
data do not have an
incentive to inflate the values as they are not being paid for their reported numbers. Instead, the
a
ll
-
digit
p
laces test measures the behavioral biases (and lack of accounting standards) of
individuals
reporting these values.
We run an
a
ll
-
digit
p
laces Beyond the First test on the 3 appropriate variables from the
WBES:
•
Permanent, full
-
time employees end of last complete fiscal year (Survey Variable L1)
•
Last complete fiscal year’s total sales (Survey Variable D2)
•
Total Annual Sales 3 Years Ago (Survey Variable N3)
The WBES data contain 179,063 observations for each of 3 variables. For each
variable
,
we use our
a
ll
-
d
igit
-
p
laces test the same way as we do on our original analysis, skipping the first
digit and omitting digits 0 and 5. We conduct the analysis for each country separately, on each
23
variable. This produces a p
-
value for each country, for each variable, which we compare to the
significance threshold.
The WBES contains data from 154 countries. We observe all of these countries in the
World Governance Indicator Control of Corruption dataset. We set a Bonferroni
-
correct
significance threshold of 0.05/(Number of Countries) when assessing
the
statistical significance
for each country under Benford’s Law.
To validate our results
, we consider the correlation between failing
each
test and the
governance indicators. We also construct an index, counting how many total tests each country
fails out of 3, and correlate that index to the governance indicators.
T
he correlation produces a
statistical significance, for which the correct significance threshold is 5%/(Number of tests) or
about 1%.
Appendix
Table
C
.
1
, below, shows the result of this analysis. Note that the WGI is scaled
where positive values mean better governance, so
the
negative correlation
shows that worse
governance corresponds to more
digit
tests failed
. We see a strong, statistically significant
correlation between failing the all
-
digit
-
places test and the WGI Control of Corruption measure.
24
Appendix
Table
C
.
1
: Correlation between All Digit Places and WGI Control of Corruption
Correlation with
WGI Value
P
-
Value for No
Correlation
Index:
How Many Failed of 3
-
0.311
0.0000861
All Digit Places
# Of Employees (L1)
-
0.224
0.00533
All Digit Places
Last Year Sales (D2)
-
0.253
0.00156
All Digit Places
Sales 3 Years Ago (N3)
-
0.256
0.00135
Next, we show robustness of these results
.
Appendix
Table
C2
, below repeats this analysis,
replacing the WGI Control of Corruption measure with the 2021 Corruption Perceptions Index
Ranking. Of our 154 countries, we can observe 147 of them in the 2021 Corruption Perceptions
Index. The CPI is scaled where higher va
lues mean worse governance (lower
-
ranked for
corruption), so a positive correlation means more tests failed, worse governance.
Appendix
Table
C
.
2
shows
a strong, statistically significant correlation between failing the
all
-
digit
-
places test and the CPI ranking.
25
Appendix
Table
C
.
2
: Correlation between All Digit Places and CPI Ranking
Next, we show that these results are robust to the fact that the WBES data contains high
levels of rounding.
Our all
-
digit
-
places test ignores 0s and 5s; however, to ensure that the
rounding does not bias our results, we also re
-
run the same analysis, excluding the 20% of
countries with the highest amount of rounding on the most
-
rounded variable (D2 Sales).
Appendix
Table
C
.
3
shows the result of our correlation between the CPI and the All Digit
Places
test after eliminating countries with a higher than 80
th
percentile level of rounding. The sample
size is therefore 1
23
countries.
We see that the sign, magnitude
,
and statistical significance of the
correlation is not diminished due to the elimination of countries with high rounding; therefore,
we can conclude that rounding is not a source of bias in these estimates.
Correlation with
CPI Ranking
P
-
Value for No
Correlation
Index:
How Many Failed of 3
0.273
0.000806
All Digit Places
# Of Employees (L1)
0.192
0.0201
All Digit Places
Last Year Sales (D2)
0.237
0.00380
All Digit Places
Sales 3 Years Ago (N3)
0.209
0.01099
26
Appendix
Table
C
.
3
: Correlation between All Digit Places and WGI after Correcting for
Rounding
Correlation with
WGI Ranking
P
-
Value for No
Correlation
Index:
How Many Failed of 3
-
0.373
0.0000210
All Digit Places
# Of Employees (L1)
-
0.313
0.000429
All Digit Places
Last Year Sales (D2)
-
0.286
0.00134
All Digit Places
Sales 3 Years Ago (N3)
-
0.297
0.000851
Providing further validity, these results
are consistent
with our findings from the paper.
Appendix
Figure
C
.
2
, below, shows some of the
a
ll
d
igit
p
lace
plots
(beyond the first) from
different countries.
Panel A shows
the
worst
3 countries by the WGI Control of Corruption Score: South
Sudan, Yemen, and Venezuela
. Each
country
show
s
a strong preference for
even
numbers, with
the share of the digit 2 exceeding 20% (as opposed to the appropriate roughly 14%)
.
In contrast, Panel
B
shows the
3
highest
-
governance countries in the data
as measured by
the WGI
: Sweden, Finland
,
and Denmark. While not perfect, these data conform much more
strongly to Benford’s Law, and while there may be a slight preference for the digit 2, there is no
general preference for the even digits. This likely reflects the fact that accounting and repo
rting
standards are higher in these countries, and therefore the share of respondents who are estimating
or fabricating their sales figures i
s much smaller (though perhaps not 0).
Finally,
Panel C compares Kenya to its
neighbors
(
Uganda and Tanzania
), to demonstrate
that Kenya is not an outlier
.
The same pattern persists
as in low governance countries
;
for