of 3
The Privacy Paradox and Optimal Bias-Variance Trade-offs
in Data Acquisition
Guocheng Liao
*
CUHK
Yu S u
*
Caltech
Juba Ziani
University of Pennsylvania
Adam Wierman
Caltech
Jianwei Huang
CUHK, SZ
ABSTRACT
While users claim to be concerned about privacy, often they
do little to protect their privacy in their online actions. One
prominent explanation for this “privacy paradox” is that
when an individual shares her data, it is not just her pri-
vacy that is compromised; the privacy of other individuals
with correlated data is also compromised. This information
leakage encourages oversharing of data and significantly im-
pacts the incentives of individuals in online platforms. In
this extended abstract, we discuss the design of mechanisms
for data acquisition in settings with information leakage and
verifiable data. We summarize work designing an incentive
compatible mechanism that optimizes the worst-case trade-
o
between bias and variance of the estimation subject to a
budget constraint, where the worst-case is over the unknown
correlation between costs and data. Additionally, we charac-
terize the structure of the optimal mechanism in closed form
and study monotonicity and non-monotonicity properties of
the marketplace.
1. INTRODUCTION
There is a fundamental discrepancy between privacy atti-
tudes and the behaviors of users: users claim to be concerned
about their privacy but do little to protect privacy in their
actions. This phenomenon is known as the
privacy paradox
.
Understanding the reasons behind this paradox and its con-
sequences for the design of online platforms is an important
goal for both computer scientists and economists.
We try to explain the
privacy paradox
from the perspec-
tive of data correlation. In particular, when an individual
shares her data, it is not just her privacy that is compro-
mised; the privacy of other individuals whose data is cor-
related with hers is also compromised. Thus, these other
individuals are more likely to share their own data given
that some has already been leaked [1]. Information leakage
due to correlation has been shown to lead to oversharing
since each individual overlooks their own privacy concerns
as a result of the negative externalities created by others’
revelation decisions.
In this extended abstract,
we study the impact of privacy
concerns and information leakage on the design of data mar-
kets.
Specifically, we study the task of designing mechanisms
for obtaining verifiable data from a population for a statisti-
Equal Contribution.
Copyright is held by author/owner(s).
cal estimation task, such as estimating the expected value of
some function of the underlying data (e.g., salary and BMI).
A growing line of work has focused on the design of such
optimal data acquisition mechanisms, e.g., [3, 2]. Studies in
[3, 2] focused on mechanism design for unbiased estimation
with minimal variance. However, this literature assumed
that all individuals will participate, thus unbiased estima-
tion is possible. Further, this line of work has not considered
the issues created by information leakage due to correlation
between the participants. Information leakage creates sig-
nificant incentives for increased data sharing [1] and thus
mechanisms that do not consider it directly will su
er from
undetected bias and increased variance in the obtained esti-
mates.
Contributions.
In this extended abstract,we provide the
first characterization of an optimal mechanism for data ac-
quisition in the setting where agents are concerned about
privacy due to data correlation. Data correlation causes in-
formation leakages to non-data reporters. Additionally, the
mechanism allows, for the first time, a trade-o
between
the bias and variance of the estimator when privacy cost is
considered.
Specifically, we propose a novel model for data acquisi-
tion. The novelty of our model consists of three important
components. First, we introduce the privacy cost to model
impacts of data correlation. We divide the agents into dif-
ferent groups and assume that agents within the same group
share a same correlation strength. This gives us the power
to work with any granularity of choice with regard to data
correlation. Second, in reality, not every agent always de-
cides to join the platform. Thus, we introduce the notion
of participation rate as the ratio of the number of agents
who join the platform to the number of total agents. This
further allows us to study equilibria with respect to partic-
ipation rate, which is crucial since the mechanism impacts
the participation rate. Third, given that not every agent
joins the platform, it is not always realistic to aim for an
unbiased estimator. Instead, we minimize a linear combi-
nation of bias and variance of the estimator. Via a choice
of constant weights for bias and variance, we are able to
balance between these two metrics of the estimator.
Our main theoretical results provide a closed form solution
of payment and allocation rules under a choice of equilibrium
participation rate in order to achieve a truthful mechanism.
More specifically, we aim to minimize a linear combination
of bias and variance subject to budget and truthfulness con-
straints. Moreover, we provide conditions for the optimality
of an unbiased estimator in the case when it is possible to
6
Performance
Evaluation
Review,
Vol.
49,
No.
2, September
2021
achieve a full participation rate.
Our results o
er some interesting insights about mecha-
nisms for data acquisition. First, an unbiased estimator is
possible even if the budget is relatively small.Second, incor-
poration of privacy cost due to information leakage makes
it possible for the analyst to underpay the agents to acquire
the same data set and encourages more agents to join the
platform. This aligns with the
privacy paradox
frequently
observed in platforms today.
2. SYSTEM MODEL
We consider an online platform consisting of an analyst
and many agents. The analyst aims to design a mechanism
to purchase private data from agents to perform a statistical
estimation task, e.g., estimate the mean of the agents’ data.
There are
s
agents that hold data of interest to the analyst.
Every agent owns a data point. By reporting her data to the
platform’s survey, the agent incurs an overall cost
c
, which
is known to her but not to the analyst. The overall cost
consists of a combination of reporting cost and privacy cost,
where the reporting cost results from the act of reporting the
data while the privacy cost comes from both data sharing
and data correlation.
We consider a family of pricing mechanisms that presents
a menu to the agents. The menu consists of pairs of pay-
ments
A
and probabilities
P
of having agents’ data selected
for use in the estimation task. Both
A
and
P
are func-
tions of agent’s reported cost ̃
c
. The mechanism works as
follows. First, each agent is presented the menu by the ana-
lyst. Second, given the menu, an agent decides if she would
like to join the platform. An agent who decides to join the
platform is asked to report her cost, which determines the
payment and selection probability. The payment
P
is given
to the agent if she joins the platform
and
her data is selected
(used) by the analyst. More specifically, her data is selected
with probability
A
, and she receives the payment only if her
data is selected.
2.1 Agent Model
We introduce the agent’s utility function. If the agent
does not join the platform, she su
ers a privacy cost
g
(
c
) due
to information leakage as a result of correlation between her
data and the data of those agents who do join. If the agent
joins the platform but is not selected for data contribution,
she experiences the benefit of participation
w
and privacy
cost. The privacy cost is from the agent herself sharing
her data in the platform and the privacy cost due to her
friends’ sharing of possibly correlated data. We use
h
(
c
) to
denote this combination. If an agent joins the platform and
is selected for data contribution, she obtains the benefit of
participation and payment, and su
er true cost
c
. We use
N
to denote the set of agents who join the platform. In
summary, an agent
k
’s utility is:
u
i
( ̃
c
|
c
)=
8
>
<
>
:
g
(
c
)
,
if
k
62
N
,
h
(
c
)+
w
(
·
)
,
w/ prob. 1
A
( ̃
c
)
,
if
k
2
N
,
P
( ̃
c
)
c
+
w
(
·
)
,
w/ prob.
A
( ̃
c
)
,
if
k
2
N
.
Next, we elaborate each term in the utility in details.
Participation Benefit.
Agents receives participant ben-
efits by using services provided by the platform. The more
participants the platform attracts, the more valuable the
platform’s service becomes. Let
̄
be the average participa-
tion rate of the population, i.e., the ratio of number of agents
on the platform to the population of agents. We use
w
(
̄
) to
denote the participation benefit, which is a non-decreasing
function of the average participation rate
̄
.
Correlation Strength.
Data correlation results in in-
formation leakage. A stronger correlation naturally leads
to more leakage and induces a larger privacy cost. More-
over, some agents might share a stronger correlation with
each other within a group of agents than with others out-
side the group. In order to capture the inter-group versus
intra-group di
erence, we divide the
s
agents into
I
groups,
and agents within the same group
i
share a common cor-
relation strength
i
. The correlation strength vector
i
is
further defined as
i
,
(
i
,
i
), where
i
and
i
are
used to respectively denote the correlation strength induced
by agents inside group
i
and those outside group
i
.
Privacy Cost.
For an agent who does not join the plat-
form, her privacy cost
g
(
·
) comes entirely from information
leakage through her peers’ data sharing on the platform due
to data correlation. In contrast, for an agent who joins the
platform but does not report her data, not only her peers’
actions but also her own interactions with the platform re-
sult in her privacy cost, which is denoted by
h
(
·
). As join-
ing the platform means more privacy leakage, we assume
g
(
·
)
>h
(
·
). Moreover, recall that we consider
I
groups of
agents parameterized by data correlation. We model both
privacy cost functions of group
i
, i.e.,
g
i
(
·
) and
h
i
(
·
) with
index
i
, that depend on intra-group cost (related to
i
) and
inter-group cost (related to
i
).
2.2 Problem Statement
In this section, we introduce the analyst optimization prob-
lem. The analyst needs to design the payment function and
selection probability for each group of agents, subject to
two desirable economic constraints: individuals’ truthful-
ness and expected budget feasibility.
Definition 1 (Truthfulness).
A mechanism is truth-
ful if for every participant with cost
c
, she can maximize her
expected utility if she truthfully reports her cost, i.e.,
u
i
(
c
|
c
)
u
i
( ̃
c
|
c
)
,
8
̃
c
6
=
c,
8
i.
(1)
Definition 1 guarantees that rational agents on the plat-
form will truthfully report their costs.
Definition 2 (Expected Budget Constraint).
A mech-
anism with payment function
P
and selection probability
A
satisfies the expected budget constraint
B
if and only if
X
k
:
i
=
i
(
k
)
E
c
f
i
[
P
i
(
c
)
A
i
(
c
)
·
(
k
2
N
i
)]
B.
(2)
Definition 2 limits the expected payments of the analyst
to the agents for their data. In (2), the index
i
(of
P
i
(
c
)
and
A
i
(
c
)) denotes the association with agents of group
i
,
1
i
I
. And
i
(
k
) denotes the group index of agent
k
.
The objective of the analyst is to optimize a trade-o
be-
tween bias
B
μ
;
D
,A
) and variance
V
μ
;
D
,A
) of the estima-
tor. Specifically, the analyst wishes to learn an underlying
parameter
μ
of the whole population, and he obtains an
estimator ˆ
μ
based on participants’ data. The estimator ˆ
μ
is viewed as a random variable, and the randomness comes
from the allocation rule
A
and the joint distribution
D
of
agents’ data and cost.
Performance
Evaluation
Review,
Vol.
49,
No.
2, September
2021
7
Since the analyst does not know the joint distribution
D
,
he cannot directly optimize over bias and variance of the
estimator. Instead, we consider the goal of minimizing the
worst-case linear combination (with parameter
γ
) of bias
and variance over all instantiations of
D
that are consistent
with the marginal cost distribution
f
:
Definition 3
(Mechanism Design Task).
The analyst
aims to minimize a worst-case linear combination of bias
and variance by designing payment function
P
and alloca-
tion rule
A
subject to truthfulness and budgetary constraints:
inf
A,P
sup
f
consistent with
D
γ
·
V
μ
;
D
,A
) + (1
γ
)
B
μ
;
D
,A
)
s.t. Truthfulness constraint in
(1)
,
Expected budget constraint in
(2)
.
2.3 Equilibrium Participation Rate
We define participation rate profile
as a vector con-
sisting of participation rates for all
I
groups of agents, i.e.,
,
[
̄
i
]
1
i
I
. Notice that agents’ utilities depend on par-
ticipation rate profile, which a
ect their privacy costs.
Next, we introduce an agent’s decision on whether or
not to participate in the platform, for a given participa-
tion rate profile. For an agent with cost
c
in group
i
, we
use a binary variable
d
i
(
c
|
) to denote her decision, where
d
i
(
c
|
) = 1 means participation and
d
i
(
c
|
) = 0 means non-
participation. She decides to participate in the platform
only if her utility of non-participation is not lower than that
of participation:
d
i
(
c
|
)=
(
1
,
if max
̃
c
̄
u
i
( ̃
c
|
c
)
≥−
g
(
c,
i
;
i
)
,
0
,
otherwise
.
(3)
We now define the equilibrium concept formally. This
notion of equilibrium guarantees that agents’ decisions are
consistent with the participation rate profile.
Definition 4
(Equilibrium).
A participation rate pro-
file
=[
̄
i
]
1
i
I
is an equilibrium, if for each group
i,
1
i
I
, the fraction of participating agents is exactly
̄
i
, i.e.,
Z
c
[
d
i
(
c
|
) = 1]
·
f
i
(
c
)
dc
=
̄
i
,
8
i.
(4)
3. RESULTS
Our main results characterize the payment rule and the
optimal allocation rule. In the full version of this paper,
we provide additional results such as characterizations of
properties of the mechanism and the participation ratio.
Payment Rule.
To solve the analyst’s mechanism design
problem in Definition 3, we first analyze the structure of
the payment function in the mechanism associated with the
equilibrium
. We assume that for each group
i
2
[
I
],
the privacy cost
h
i
(
c
;
) given participation rate profile
is linear in cost
c
with parameter
b
i
(
). Theorem 1 shows
the payment function given monotonic allocation rule
A
that
induces agents’ truthfulness reporting of cost. The term
in
(5) is a constant given a fixed
. Due to space limitations,
we do not present its detailed definition.
Theorem
1.
The mechanism is truthful (strictly truthful,
respectively) and induces participation profile
at equilib-
rium if and only if for every group
i
2
[
I
]
, both of the fol-
lowing statements hold:
(i) allocation rule
A
i
( ̃
c
)
is a non-increasing (decreasing,
respectively) function of the reported cost
̃
c
;
(ii) payment function is given as in Equation
(5)
(as a
function of
A
), where
is the desired participation
rate profile.
P
i
( ̃
c
)=
1
A
i
( ̃
c
)
(1
b
i
(
))
Z
c
max
̃
c
A
i
(
z
)
dz
+
+ ̃
c
h
i
( ̃
c
;
)
.
(5)
Optimal Allocation Rule.
After deriving the payment
that induces agents’ truthfulness in Section 3, it remains
to characterize the allocation rule that optimizes the worst-
case bias-variance trade-o
, subject to budget constraint.
The optimal allocation rule of each group is non-monotonic
in the virtual cost.
Definition 5
(Virtual Cost).
In group
i
, the virtual
cost of an agent with cost
c
is
φ
i
(
c
;
)=
c
h
i
(
c
;
) + (1
b
i
(
))
F
i
(
c
)
f
i
(
c
)
.
F
i
and
f
i
are the CDF and PDF of cost in group
i
, respec-
tively.
Theorem
2.
Assume the virtual cost
φ
i
(
c
;
)
in group
i, i
2
[
I
]
is non-decreasing. Given a desired participation
rate profile
, the optimal allocation rule of group
i
is
A
i
(
c
)=
(
χ
,
if
φ
i
(
c
;
)
ˆ
φ
,
p
φ
i
(
c
;
)
,
if
ˆ
φ
<
φ
i
(
c
;
)
φ
i
c
i
;
)
.
The characterizations of the constants
,
χ
, and
ˆ
φ
are omit-
ted due to space limit.
The optimal allocation rule can have two possible struc-
tures:
Fixed then Decreasing
(FtD) or
Strictly Decreasing
(SD), depending on system parameters such as budget
B
.
The FtD structure has an allocation rule that is firstly fixed
in low-cost region and then strictly decreasing in the high-
cost region (inversely proportional to the square root of the
virtual cost). The other structure, SD, has an allocation
rule that is strictly decreasing (inversely proportional to the
square root of the virtual cost) in the whole region.
4. REFERENCES
[1] D. Acemoglu, A. Makhdoumi, A. Malekian, and
A. Ozdaglar. Too much data: Prices and ine
ffi
ciencies
in data markets. Technical report, National Bureau of
Economic Research, 2019.
[2] Y. Chen, N. Immorlica, B. Lucier, V. Syrgkanis, and
J. Ziani. Optimal data acquisition for statistical
estimation. In
Proceedings of the 2018 ACM Conference
on Economics and Computation
, pages 27–44. ACM,
2018.
[3] A. Roth and G. Schoenebeck. Conducting truthful
surveys, cheaply. In
Proceedings of the 13th ACM
Conference on Electronic Commerce
, pages 826–843,
2012.
8
Performance
Evaluation
Review,
Vol.
49,
No.
2, September
2021