3512798.3512802.pdf

The Privacy Paradox and Optimal Bias-Variance Trade-offs

in Data Acquisition

Guocheng Liao

*

CUHK

Yu S u

*

Caltech

Juba Ziani

University of Pennsylvania

Adam Wierman

Caltech

Jianwei Huang

CUHK, SZ

ABSTRACT

While users claim to be concerned about privacy, often they

do little to protect their privacy in their online actions. One

prominent explanation for this “privacy paradox” is that

when an individual shares her data, it is not just her pri-

vacy that is compromised; the privacy of other individuals

with correlated data is also compromised. This information

leakage encourages oversharing of data and significantly im-

pacts the incentives of individuals in online platforms. In

this extended abstract, we discuss the design of mechanisms

for data acquisition in settings with information leakage and

verifiable data. We summarize work designing an incentive

compatible mechanism that optimizes the worst-case trade-

↵

between bias and variance of the estimation subject to a

budget constraint, where the worst-case is over the unknown

correlation between costs and data. Additionally, we charac-

terize the structure of the optimal mechanism in closed form

and study monotonicity and non-monotonicity properties of

the marketplace.

1. INTRODUCTION

There is a fundamental discrepancy between privacy atti-

tudes and the behaviors of users: users claim to be concerned

about their privacy but do little to protect privacy in their

actions. This phenomenon is known as the

privacy paradox

Understanding the reasons behind this paradox and its con-

sequences for the design of online platforms is an important

goal for both computer scientists and economists.

We try to explain the

privacy paradox

from the perspec-

tive of data correlation. In particular, when an individual

shares her data, it is not just her privacy that is compro-

mised; the privacy of other individuals whose data is cor-

related with hers is also compromised. Thus, these other

individuals are more likely to share their own data given

that some has already been leaked [1]. Information leakage

due to correlation has been shown to lead to oversharing

since each individual overlooks their own privacy concerns

as a result of the negative externalities created by others’

revelation decisions.

In this extended abstract,

we study the impact of privacy

concerns and information leakage on the design of data mar-

kets.

Specifically, we study the task of designing mechanisms

for obtaining verifiable data from a population for a statisti-

∗

Equal Contribution.

cal estimation task, such as estimating the expected value of

some function of the underlying data (e.g., salary and BMI).

A growing line of work has focused on the design of such

optimal data acquisition mechanisms, e.g., [3, 2]. Studies in

[3, 2] focused on mechanism design for unbiased estimation

with minimal variance. However, this literature assumed

that all individuals will participate, thus unbiased estima-

tion is possible. Further, this line of work has not considered

the issues created by information leakage due to correlation

between the participants. Information leakage creates sig-

nificant incentives for increased data sharing [1] and thus

mechanisms that do not consider it directly will su

↵

er from

undetected bias and increased variance in the obtained esti-

mates.

Contributions.

In this extended abstract,we provide the

first characterization of an optimal mechanism for data ac-

quisition in the setting where agents are concerned about

privacy due to data correlation. Data correlation causes in-

formation leakages to non-data reporters. Additionally, the

mechanism allows, for the first time, a trade-o

↵

between

the bias and variance of the estimator when privacy cost is

considered.

Specifically, we propose a novel model for data acquisi-

tion. The novelty of our model consists of three important

components. First, we introduce the privacy cost to model

impacts of data correlation. We divide the agents into dif-

ferent groups and assume that agents within the same group

share a same correlation strength. This gives us the power

to work with any granularity of choice with regard to data

correlation. Second, in reality, not every agent always de-

cides to join the platform. Thus, we introduce the notion

of participation rate as the ratio of the number of agents

who join the platform to the number of total agents. This

further allows us to study equilibria with respect to partic-

ipation rate, which is crucial since the mechanism impacts

the participation rate. Third, given that not every agent

joins the platform, it is not always realistic to aim for an

unbiased estimator. Instead, we minimize a linear combi-

nation of bias and variance of the estimator. Via a choice

of constant weights for bias and variance, we are able to

balance between these two metrics of the estimator.

Our main theoretical results provide a closed form solution

of payment and allocation rules under a choice of equilibrium

participation rate in order to achieve a truthful mechanism.

More specifically, we aim to minimize a linear combination

of bias and variance subject to budget and truthfulness con-

straints. Moreover, we provide conditions for the optimality

of an unbiased estimator in the case when it is possible to

Performance

Evaluation

Review,

Vol.

49,

No.

2, September

2021

achieve a full participation rate.

Our results o

↵

er some interesting insights about mecha-

nisms for data acquisition. First, an unbiased estimator is

possible even if the budget is relatively small.Second, incor-

poration of privacy cost due to information leakage makes

it possible for the analyst to underpay the agents to acquire

the same data set and encourages more agents to join the

platform. This aligns with the

privacy paradox

frequently

observed in platforms today.

2. SYSTEM MODEL

We consider an online platform consisting of an analyst

and many agents. The analyst aims to design a mechanism

to purchase private data from agents to perform a statistical

estimation task, e.g., estimate the mean of the agents’ data.

There are

agents that hold data of interest to the analyst.

Every agent owns a data point. By reporting her data to the

platform’s survey, the agent incurs an overall cost

, which

is known to her but not to the analyst. The overall cost

consists of a combination of reporting cost and privacy cost,

where the reporting cost results from the act of reporting the

data while the privacy cost comes from both data sharing

and data correlation.

We consider a family of pricing mechanisms that presents

a menu to the agents. The menu consists of pairs of pay-

ments

and probabilities

of having agents’ data selected

for use in the estimation task. Both

and

are func-

tions of agent’s reported cost ̃

. The mechanism works as

follows. First, each agent is presented the menu by the ana-

lyst. Second, given the menu, an agent decides if she would

like to join the platform. An agent who decides to join the

platform is asked to report her cost, which determines the

payment and selection probability. The payment

is given

to the agent if she joins the platform

and

her data is selected

(used) by the analyst. More specifically, her data is selected

with probability

, and she receives the payment only if her

data is selected.

2.1 Agent Model

We introduce the agent’s utility function. If the agent

does not join the platform, she su

↵

ers a privacy cost

(

) due

to information leakage as a result of correlation between her

data and the data of those agents who do join. If the agent

joins the platform but is not selected for data contribution,

she experiences the benefit of participation

and privacy

cost. The privacy cost is from the agent herself sharing

her data in the platform and the privacy cost due to her

friends’ sharing of possibly correlated data. We use

(

) to

denote this combination. If an agent joins the platform and

is selected for data contribution, she obtains the benefit of

participation and payment, and su

↵

er true cost

. We use

to denote the set of agents who join the platform. In

summary, an agent

’s utility is:

( ̃

−

(

)

−

(

)

w/ prob. 1

−

( ̃

)

( ̃

)

−

(

)

w/ prob.

( ̃

)

Next, we elaborate each term in the utility in details.

Participation Benefit.

Agents receives participant ben-

efits by using services provided by the platform. The more

participants the platform attracts, the more valuable the

platform’s service becomes. Let

✓

be the average participa-

tion rate of the population, i.e., the ratio of number of agents

on the platform to the population of agents. We use

(

✓

) to

denote the participation benefit, which is a non-decreasing

function of the average participation rate

✓

Correlation Strength.

Data correlation results in in-

formation leakage. A stronger correlation naturally leads

to more leakage and induces a larger privacy cost. More-

over, some agents might share a stronger correlation with

each other within a group of agents than with others out-

side the group. In order to capture the inter-group versus

intra-group di

↵

erence, we divide the

agents into

groups,

and agents within the same group

share a common cor-

relation strength

↵

. The correlation strength vector

↵

further defined as

↵

,

(

↵

−

), where

↵

and

↵

−

are

used to respectively denote the correlation strength induced

by agents inside group

and those outside group

Privacy Cost.

For an agent who does not join the plat-

form, her privacy cost

(

) comes entirely from information

leakage through her peers’ data sharing on the platform due

to data correlation. In contrast, for an agent who joins the

platform but does not report her data, not only her peers’

actions but also her own interactions with the platform re-

sult in her privacy cost, which is denoted by

(

). As join-

ing the platform means more privacy leakage, we assume

(

)

(

). Moreover, recall that we consider

groups of

agents parameterized by data correlation. We model both

privacy cost functions of group

, i.e.,

(

) and

(

) with

index

, that depend on intra-group cost (related to

↵

) and

inter-group cost (related to

↵

−

2.2 Problem Statement

In this section, we introduce the analyst optimization prob-

lem. The analyst needs to design the payment function and

selection probability for each group of agents, subject to

two desirable economic constraints: individuals’ truthful-

ness and expected budget feasibility.

Definition 1 (Truthfulness).

A mechanism is truth-

ful if for every participant with cost

, she can maximize her

expected utility if she truthfully reports her cost, i.e.,

(

)

≥

( ̃

)

(1)

Definition 1 guarantees that rational agents on the plat-

form will truthfully report their costs.

Definition 2 (Expected Budget Constraint).

A mech-

anism with payment function

and selection probability

satisfies the expected budget constraint

if and only if

(

)

⇠

i

[

(

)

(

)

(

)]



(2)

Definition 2 limits the expected payments of the analyst

to the agents for their data. In (2), the index

(of

(

)

and

(

)) denotes the association with agents of group



. And

(

) denotes the group index of agent

The objective of the analyst is to optimize a trade-o

↵

be-

tween bias

(ˆ

;

) and variance

(ˆ

;

) of the estima-

tor. Specifically, the analyst wishes to learn an underlying

parameter

of the whole population, and he obtains an

estimator ˆ

based on participants’ data. The estimator ˆ

is viewed as a random variable, and the randomness comes

from the allocation rule

and the joint distribution

agents’ data and cost.

Performance

Evaluation

Review,

Vol.

49,

No.

2, September

2021

Since the analyst does not know the joint distribution

he cannot directly optimize over bias and variance of the

estimator. Instead, we consider the goal of minimizing the

worst-case linear combination (with parameter

) of bias

and variance over all instantiations of

that are consistent

with the marginal cost distribution

Definition 3

(Mechanism Design Task).

The analyst

aims to minimize a worst-case linear combination of bias

and variance by designing payment function

and alloca-

tion rule

subject to truthfulness and budgetary constraints:

inf

A,P

sup

consistent with

D

(ˆ

;

) + (1

−

)

(ˆ

;

)

s.t. Truthfulness constraint in

(1)

Expected budget constraint in

(2)

2.3 Equilibrium Participation Rate

We define participation rate profile

✓

as a vector con-

sisting of participation rates for all

groups of agents, i.e.,

✓

,

[

✓

]



. Notice that agents’ utilities depend on par-

ticipation rate profile, which a

↵

ect their privacy costs.

Next, we introduce an agent’s decision on whether or

not to participate in the platform, for a given participa-

tion rate profile. For an agent with cost

in group

, we

use a binary variable

(

✓

) to denote her decision, where

(

✓

) = 1 means participation and

(

✓

) = 0 means non-

participation. She decides to participate in the platform

only if her utility of non-participation is not lower than that

of participation:

(

✓

(

if max

( ̃

)

≥−

(

✓

;

↵

)

otherwise

(3)

We now define the equilibrium concept formally. This

notion of equilibrium guarantees that agents’ decisions are

consistent with the participation rate profile.

Definition 4

(Equilibrium).

A participation rate pro-

file

✓

⇤

✓

⇤

]



is an equilibrium, if for each group



, the fraction of participating agents is exactly

✓

⇤

, i.e.,

[

(

✓

⇤

) = 1]

(

)

✓

⇤

(4)

3. RESULTS

Our main results characterize the payment rule and the

optimal allocation rule. In the full version of this paper,

we provide additional results such as characterizations of

properties of the mechanism and the participation ratio.

Payment Rule.

To solve the analyst’s mechanism design

problem in Definition 3, we first analyze the structure of

the payment function in the mechanism associated with the

equilibrium

✓

⇤

. We assume that for each group

[

the privacy cost

(

;

✓

⇤

) given participation rate profile

✓

⇤

is linear in cost

with parameter

(

✓

⇤

). Theorem 1 shows

the payment function given monotonic allocation rule

that

induces agents’ truthfulness reporting of cost. The term

⌧

(5) is a constant given a fixed

✓

⇤

. Due to space limitations,

we do not present its detailed definition.

Theorem

The mechanism is truthful (strictly truthful,

respectively) and induces participation profile

✓

⇤

at equilib-

rium if and only if for every group

[

]

, both of the fol-

lowing statements hold:

(i) allocation rule

( ̃

)

is a non-increasing (decreasing,

respectively) function of the reported cost

;

(ii) payment function is given as in Equation

(5)

(as a

function of

), where

✓

⇤

is the desired participation

rate profile.

( ̃

)

✓

−

(

✓

⇤

))

max

(

)

⌧

◆

+ ̃

−

( ̃

;

✓

⇤

)

(5)

Optimal Allocation Rule.

After deriving the payment

that induces agents’ truthfulness in Section 3, it remains

to characterize the allocation rule that optimizes the worst-

case bias-variance trade-o

↵

, subject to budget constraint.

The optimal allocation rule of each group is non-monotonic

in the virtual cost.

Definition 5

(Virtual Cost).

In group

, the virtual

cost of an agent with cost

(

;

✓

⇤

−

(

;

✓

⇤

) + (1

−

(

✓

⇤

))

(

)

(

)

and

are the CDF and PDF of cost in group

, respec-

tively.

Theorem

Assume the virtual cost

(

;

✓

⇤

)

in group

i, i

[

]

is non-decreasing. Given a desired participation

rate profile

✓

⇤

, the optimal allocation rule of group

(

;

✓

⇤

)



⌘

i

(

;

✓

⇤

)

(

;

✓

⇤

)



(ˆ

;

✓

⇤

)

The characterizations of the constants

⌘

, and

are omit-

ted due to space limit.

The optimal allocation rule can have two possible struc-

tures:

Fixed then Decreasing

(FtD) or

Strictly Decreasing

(SD), depending on system parameters such as budget

The FtD structure has an allocation rule that is firstly fixed

in low-cost region and then strictly decreasing in the high-

cost region (inversely proportional to the square root of the

virtual cost). The other structure, SD, has an allocation

rule that is strictly decreasing (inversely proportional to the

square root of the virtual cost) in the whole region.

4. REFERENCES

[1] D. Acemoglu, A. Makhdoumi, A. Malekian, and

A. Ozdaglar. Too much data: Prices and ine

ffi

ciencies

in data markets. Technical report, National Bureau of

Economic Research, 2019.

[2] Y. Chen, N. Immorlica, B. Lucier, V. Syrgkanis, and

J. Ziani. Optimal data acquisition for statistical

estimation. In

Proceedings of the 2018 ACM Conference

on Economics and Computation

, pages 27–44. ACM,

2018.

[3] A. Roth and G. Schoenebeck. Conducting truthful

surveys, cheaply. In

Proceedings of the 13th ACM

Conference on Electronic Commerce

, pages 826–843,

2012.

Performance

Evaluation

Review,

Vol.

49,

No.

2, September

2021