of 13
Evaluating the Quality of Changes in Voter
Registration Databases: Supplementary Materials
1 Acknowlegements
2 Data
2.1 Data Overview
In this paper, we investigate 252 unique daily snapshots of the Orange County Voter Registration
dataset, beginning April 26, 2018, and ending May 24, 2019. Altogether, they cover 89% of
business days (weekdays). Each snapshot consists of roughly 1.5 million voters. We continue to
receive daily snapshots of the OC dataset in the 2020 cycle.
2.2 Why Orange County?
Orange County (California) is a large and diverse county in Southern California. Located south of
Los Angeles and north of San Diego, Orange County is home to a wide array of different business,
colleges and universities, and of course, Disneyland. The county currently has a total population
of almost 3.2 million residents, and in the 2016 presidential election, Orange County had just
over 2 million voting-eligible citizens, with approximately 1.5 million registered voters California
Secretary of State (2016). In that same election, 1.2 million of those registered voters participated
(80.7% of registered voters). Orange County’s population is also diverse, as the U.S. Census
Bureau’s most recent estimates show that 72% of the county’s population is White, 21% Asian, 2%
African-American, and 3.5% two or more races. The Census Bureau’s recent data estimates that
34% of the Orange County’s population is Hispanic or Latino United States Census Bureau (2017).
Thus, one reason we focus on Orange County for this study is that it is one of the largest and most
diverse election jurisdictions in the United States.
1
Secondly, Orange County is widely viewed as an innovator in the administration of elections.
The County’s Registrar of Voters, Neal Kelley, participates widely in state and national professional
organizations, and is has been recognized for his innovative administrative practices. Under his
administration, Orange County has developed many administrative processes and tools that are
viewed as best practices for election administration. These innovations include, for example,
building transparency by webcasting in real time virtually all aspects of the process of administering
an election, or more recently, pilot testing risk-limiting audits.
2.3 Data Availability
Upon publication, all of the code necessary to produce the analyses reported in our paper will
be available on the GitHub repository
along with an example dataset with synthetic
voter information. Due to the confidential nature of the voter registration data, and our data access
agreement with OCROV, we cannot share or post publicly the data used in this study. Researchers
who want to use these data can request access from the Orange County Registrar of Voters.
2.4 Data Dictionary
The voter file “snapshots” that we have received from the OCROV contain the fields described
below. The number in parentheses describe the number of unique values for each field,
1
based on
the snapshot of May 21, 2018, the registration deadline for the June 2018 primaries. The snapshot
consists of 1,478,541 observations.
Here we provide a data dictionary and the number of unique values in each of the sixty-two
data fields.
2
Many of the variables are created internally by the Orange County Registrar of Voters
for their usage; our interest is mostly limited to variables that contain direct inputs from the voters.
These variables of interest are listed in the Appendix in Table 2 with summary statistics.
3
Although
the Registrar assigns each voter with a unique ID (
lVoterUniqueID
) that is not duplicated in any
of the daily snapshots, not all voters are distinct entities.
In Orange County, the voter registration forms ask the voter for both the California Driver’s
License number (or a California Identification card number) and the last four digits of the Social
Security Number (SSN) Orange County Registrar of Voters (2018b). However, these are not strictly
1
The numbers are based on raw text, so that for instance, “MISS” and “Miss” are counted as distinct values.
2
Note that the canonical text cleaning and standardizing precedes both the calculations of number of unique entries
and the occurrence of the most frequent entries, such as stripping the string of non-alphanumeric entries, trimming
white-spaces, and case normalizing, except for email addresses, which may be case sensitive and in which certain
punctuation creates meaningful differences.
3
We exclude mailing addresses due to the fact that it usually overlaps with physical, residential address. We also
excluded reported place of birth as it seems to frequently be misreported, and the reported place of birth changes
frequently in the data.
2
required. If neither of them can be provided, a voter may be assigned a unique ID number solely
for registration purposes (Orange County Registrar of Voters, 2018a). Despite these seemingly
unique identifiers, duplicates still can be found in the database. Indeed, deduplication based on
exact matching on these identifiers—the most basic of deduplication efforts—is already performed
by the OCROV.
“lVoterUniqueID” (1,478,541): Interally assigned voter identification number.
“sAffNumber” (1,478,540): An identifier of the voter registration affidavit.
“szStateVoterID" (1): The voter identification number assigned by the Secretary of State’s Office to the record.
“sVoterTitle” (10): Title (e.g., “Dr.”, “Mrs.”) provided by the voter.
“szNameLast” (188,734): Last name.
“szNameFirst” (89,985): First name.
“szNameMiddle” (52,085): Middle name.
“sNameSuffix” (23): Name suffix.
“sGender” (3): Gender.
“szSitusAddress” (787,043): Address.
“szSitusCity” (48): City.
“sSitusState” (1): State.
“sSitusZip” (94): Zip Code.
“sHouseNum” (30,269): House number.
“sUnitAbbr” (20): House unit abbreviation.
“sUnitNum” (14,780): House unit number.
“szStreetName” (17,437): Street name.
“sStreetSuffix” (95): Street suffix.
“sPreDir” (9): Direction prefix.
“sPostDir” (5): Direction suffix.
“szMailAddress1” (807,272): Mailing address (street address).
“szMailAddress2” (22,249): Mailing address (city, state, and zip code).
“szMailAddress3” (2,271): Mailing address (overseas voters’ street address).
“szMailAddress4” (195): Mailing address (overseas voters’ country of residence).
“szMailZip” (13,425): Mailing Zip Code.
“szPhone” (706,711): Telephone number.
“szEmailAddress” (452,610): Email address.
“dtBirthDate” (30,468): Date of birth.
“sBirthPlace” (30,468): Place of birth.
“dtRegDate” (15,762): Registration record date.
“dtOrigRegDate” (16,477): Original registration date.
3
“dtLastUpdate_dt” (6,984): Update of record.
“sStatusCode” (1): Status of record.
“szStatusReasonDesc” (110): Description of record status.
“sUserCode1” (7,370): (Unknown)
“sUserCode2” (13): (Unknown)
“iDuplicateIDFlag” (4): Potential duplicate ID flag.
“szLanguageName” (1): Language.
“szPartyName” (46): Party registration.
“szAVStatusAbbr” (12): Absentee status abbreviation.
“szAVStatusDesc” (12): Absentee status description.
“szPrecinctName” (53): Precinct name.
“sPrecinctID” (1,487): Precinct ID.
“sPrecinctPortion” (8): Precinct portion.
“sDistrictID_0” (1): Geographic district identifier (0: County).
“iSubDistrict_0” (1): Geographic district (0: County).
“szDistrictName_0” (1): Geographic district name (0: County).
“sDistrictID_1” (7): Geographic district identifier (1: Congressional district).
“iSubDistrict_1” (1): Geographic district (1: Congressional district).
“szDistrictName_1” (7): Geographic district name (1: Congressional district).
“sDistrictID_2” (5): Geographic district identifier (2: Senate district).
“iSubDistrict_2” (1): Geographic district (2: Senate district).
“szDistrictName_2” (5): Geographic district name (2: Senate district).
“sDistrictID_3” (7): Geographic district identifier (3: Assembly district).
“iSubDistrict_3” (1): Geographic district (3: Assembly district).
“szDistrictName_3” (7): Geographic district name (3: Assembly district).
“sDistrictID_4” (5): Geographic district identifier (4: Supervisorial district).
“iSubDistrict_4” (1): Geographic district (4: Supervisorial district).
“szDistrictName_4” (5): Geographic district name (4: Supervisorial district).
“sDistrictID_5” (35): Geographic district identifier (5: City council ward division).
“iSubDistrict_5” (9): Geographic district (5: City council ward division).
“szDistrictName_5” (68): Geographic district name (5: City council ward division).
2.5 Hypothetical Changes to the Database
Figure 1 shows synthetic examples of changes in the voter file. They can also represent examples
of duplicates in the file.
4
1.50
1.55
1.60
May 2018
Aug 2018
Nov 2018
Feb 2019
May 2019
Date
Number of Registered Voters (Million)
Figure 1: Number of Records Per Day
Name
Address
Birth Date
Contact
First Middle Last Street Address City
Phone
Email
Steven B
Smith 110 S East Ave Brea 04/26/1980 714-765-3300 N/A
Steven
Smith 110 S East Ave Brea 04/26/1980 714-765-3300 smith@ex
Isidor
Agnes 99 6th St #72
Tustin 07/13/1960 N/A
N/A
Jsidor
Agne 99 6th St #72
Tustin 07/13/1960 714-205-8583 N/A
Anna Clara Zhang 203 Coast Ln
Tustin 12/01/1950 N/A
acz@ex
Anna C
Zhang 101 Sunny Blvd Brea 12/10/1950 N/A
acz@ex
Table 1: Synthetic Examples of Changes in Voter Files
5
2.6 Descriptive Statistics
Figure 1 show the total number of observations in the voter registration database by date. As can
be seen, the daily snapshots were generated on business days (weekdays). There are a few missing
snapshots—while the Orange County Registrar of Voters have made incredible contributions by
providing us with daily snapshots, when they were busy, we were unable to obtain some snapshots.
In addition, as aforementioned, Table 2 shows the data summary for some important user-
entered variables. This shows how data-intensive each field is, showing the amount of missing
data for the important fields, and the number of unique and most frequent entries. For instance, the
name suffix has too much missing data and too few unique entries to be very informative. Political
party, although an important variable, is likewise not informative for matching.
Table 2: Data Summary by Field of May 21 Snapshot
Category Field
Number of
Unique Entries
Number of
Most Freq. Entry
Number Missing Examples
Name
First
89,984
21,481
78 Jane
Middle
51,609
83,035
406,428 E
Last
188,734
26,385
0 Doe
Title (Name Prefix)
5
466,043
488,123 Ms.
Name Suffix
18
16,430
1,452,055 Jr.
Address
Street Address
786,224
93
0 1300 S Grand Ave Unit 101
City
48
140,081
0 Santa Ana
Zip Code
94
40,128
0 92705
Date of Birth
30,467
124
23 March 11, 1989
Place of Birth
319
678,187
60,999 CA
Gender
3
2,274
1,474,151 F
Political Party
46
540,859
0 No Party Preference
Contact
Phone
706,710
9,035
663,105 (714) 567-7600
Email
452,609
382
1,018,894 jane@roc.ocgov.com
3 Parameter and Variable Selection in Record Linkage
A recap of the probabilistic record linkage framework, which forms the basis of our analysis, is in
Figure 2. The two density distributions show match probability by the latent status of a true match.
If the match is a “true negative,” i.e, the entities are not the same voter, the match probability is
likely lower than when the match is a “true positive.” However, due to chance, some fields such as
names or address may coincide, resulting in an overlapping region. A researcher typically decides
upon a lower and upper cutoff of the match probability to classify the record pairs into nonmatches,
matches, and those that must be clerically reviewed. Note that for the final composite match
probability, we have to calculate each fields’ agreement levels and weight it using its frequency
6