of 44
UNCERTAINTY QUANTIFICATION OF THE 4TH KIND;
OPTIMAL POSTERIOR ACCURACY-UNCERTAINTY
TRADEOFF WITH THE MINIMUM ENCLOSING BALL
HAMED HAMZE BAJGIRAN, PAU BATLLE FRANCH, HOUMAN OWHADI
̊
,
CLINT SCOVEL, MAHDY SHIRDEL, MICHAEL STANLEY,
AND PEYMAN TAVALLALI
Abstract.
There are essentially three kinds of approaches to Uncer-
tainty Quantification (UQ): (A) robust optimization (min and max),
(B) Bayesian (conditional average), (C) decision theory (minmax). Al-
though (A) is robust, it is unfavorable with respect to accuracy and
data assimilation. (B) requires a prior, it is generally non-robust (brit-
tle) with respect to the choice of that prior and posterior estimations can
be slow. Although (C) leads to the identification of an optimal prior, its
approximation suffers from the curse of dimensionality and the notion of
loss/risk used to identify the prior is one that is averaged with respect to
the distribution of the data. We introduce a 4th kind which is a hybrid
between (A), (B), (C), and hypothesis testing. It can be summarized as,
after observing a sample
x
, (1) defining a likelihood region through the
relative likelihood and (2) playing a minmax game in that region to de-
fine optimal estimators and their risk. The resulting method has several
desirable properties (a) an optimal prior is identified after measuring the
data, and the notion of loss/risk is a posterior one, (b) the determination
of the optimal estimate and its risk can be reduced to computing the
minimum enclosing ball of the image of the likelihood region under the
quantity of interest map (such computations are fast and do not suffer
from the curse of dimensionality). The method is characterized by a
parameter in
r
0
,
1
s
acting as an assumed lower bound on the rarity of
the observed data (the relative likelihood). When that parameter is near
1, the method produces a posterior distribution concentrated around a
maximum likelihood estimate (MLE) with tight but low confidence UQ
estimates. When that parameter is near 0, the method produces a max-
imal risk posterior distribution with high confidence UQ estimates. In
addition to navigating the accuracy-uncertainty tradeoff, the proposed
method addresses the brittleness of Bayesian inference by navigating the
robustness-accuracy tradeoff associated with data assimilation.
1. Introduction
Let
φ
: Θ
Ñ
V
be a quantity of interest, where
V
is a finite-dimensional
vector space and Θ is a compact set. Let
X
be a measurable space and
Date
: August 25, 2021.
:
Author list in alphabetical order.
̊
Corresponding author: owhadi@caltech.edu .
1
arXiv:2108.10517v1 [stat.ME] 24 Aug 2021
2
BAJGIRAN, BATLLE, OWHADI, SCOVEL, SHIRDEL, STANLEY, AND TAVALLALI
write
P
p
X
q
for the set of probability distributions on
X
. Consider a model
P
: Θ
Ñ
P
p
X
q
representing the dependence of the distribution of a data
point
x
P
p ̈|
θ
q
on the value of the parameter
θ
P
Θ. Throughout, we use
} ̈}
to denote the Euclidean norm. We are interested in solving the following
problem.
Problem 1.
Let
θ
:
be an unknown element of
Θ
. Given an observation
x
P
p ̈|
θ
:
q
of data, estimate
φ
p
θ
:
q
and quantify the uncertainty (accuracy/risk)
of the estimate.
We will assume that
P
is a dominated model with positive densities,
that is, for each
θ
P
Θ,
P
p ̈|
θ
q
is defined by a (strictly) positive density
p
p ̈|
θ
q
:
X
Ñ
R
ą
0
with respect to a measure
ν
P
P
p
X
q
, such that, for each
measurable subset
A
of
X
,
P
p
A
|
θ
q “
ż
A
p
p
x
1
|
θ
q
p
x
1
q
, θ
P
Θ
.
(1.1)
1.1. The three main approaches to UQ.
Problem 1 is a fundamental
Uncertainty Quantification (UQ) problem, and there are essentially three
main approaches for solving it. We now describe them when
V
is a Euclidean
space with the
`
2
loss function.
1.1.1. Worst-case.
In a different setting, essentially where the set Θ con-
sists of probability measures, the OUQ framework [22] provides a worst-case
analysis for providing rigorous uncertainty bounds. In the setting of this pa-
per, in the absence of data (or ignoring the data
x
), the (vanilla) worst-case
(or robust optimization) answer is to estimate
φ
p
θ
:
q
with the minimizer
d
̊
P
V
of the worst-case error
R
p
d
q
:
max
θ
P
Θ
}
φ
p
θ
q ́
d
}
2
.
(1.2)
In that approach,
p
d
̊
,
R
p
d
̊
qq
are therefore identified as the center and
squared radius of the minimum enclosing ball of
φ
p
Θ
q
.
1.1.2. Bayesian.
The (vanilla) Bayesian (decision theory) approach (see
e.g. Berger [3, Sec. 4.4]) is to assume that
θ
is sampled from a prior distri-
bution
π
P
P
p
Θ
q
, and approximate
φ
p
θ
:
q
with the minimizer
d
π
p
x
q P
V
of
the Bayesian posterior risk
R
π
p
d
q
:
E
θ
π
x
}
φ
p
θ
q ́
d
}
2
, d
P
V,
(1.3)
associated with the decision
d
P
V
, where
π
x
:
p
p
x
| ̈q
π
ş
Θ
p
p
x
|
θ
q
p
θ
q
(1.4)
is the posterior measure determined by the likelihood
p
p
x
| ̈q
, the prior
π
and
the observation
x
. This minimizer is the posterior mean
d
π
p
x
q
:
E
θ
π
x
r
φ
p
θ
qs
(1.5)
UQ OF THE 4TH KIND
3
and the uncertainty is quantified by the posterior variance
R
π
p
d
π
p
x
qq
:
E
θ
π
x
}
φ
p
θ
q ́
d
π
p
x
q}
2
.
(1.6)
1.1.3. Game/decision theoretic.
The Wald’s game/decision theoretic
approach is to consider a two-player zero-sum game where player I selects
θ
P
Θ, and player II selects a decision function
d
:
X
Ñ
V
which estimates
the quantity of interest
φ
p
θ
q
(given the data
x
P
X
), resulting in the loss
L
p
θ,d
q
:
E
x
P
p ̈|
θ
q
}
φ
p
θ
q ́
d
p
x
q}
2
, θ
P
Θ
, d
:
X
Ñ
V,
(1.7)
for player II. Such a game will normally not have a saddle point, so following
von Neumann’s approach [40], one randomizes both players’ plays to identify
a Nash equilibrium. To that end, first observe that, for the quadratic loss
considered here (for ease of presentation), only the choice of player I needs
to be randomized. So, let
π
P
P
p
Θ
q
be a probability measure randomizing
the play of player I, and consider the lift
L
p
π,d
q
:
E
θ
π
E
x
P
p ̈|
θ
q
}
φ
p
θ
q ́
d
p
x
q}
2
, π
P
P
p
Θ
q
, d
:
X
Ñ
V,
(1.8)
of the game (1.7). A minmax optimal estimate of
φ
p
θ
:
q
is then obtained by
identifying a Nash equilibrium (a saddle point) for (1.8), i.e.
π
̊
P
P
p
Θ
q
and
d
̊
:
X
Ñ
V
satisfying
L
p
π,d
̊
q ď
L
p
π
̊
,d
̊
q ď
L
p
π
̊
,d
q
, π
P
P
p
Θ
q
, d
:
X
Ñ
V.
(1.9)
Consequently, an optimal strategy of player II is then the posterior mean
d
π
̊
p
x
q
of the form (1.5) determined by a worst-case measure and optimal
randomized/mixed strategy for player I
π
̊
:
P
arg max
π
P
P
p
Θ
q
E
θ
π,x
P
p ̈
q
}
φ
p
θ
q ́
d
π
p
x
q}
2
.
(1.10)
To connect with the Bayesian framework we observe (by changing the order
of integration) that the Wald’s risk (1.8) can be written as the average
L
p
π,d
q
:
E
x
X
π
R
π
p
d
p
x
qq
(1.11)
of the Bayesian decision risk
R
π
p
d
p
x
qq
(
(1.3) for
d
d
p
x
q
) determined by
the prior
π
and decision
d
p
x
q
with respect to the
X
-marginal distribution
X
π
:
ż
Θ
P
p ̈|
θ
q
p
θ
q
(1.12)
associated with the prior
π
and the model
P
. However, the prior used in
Bayesian decision theory is specified by the practitioner and in the Wald
framework is a worst-case prior (1.10).
4
BAJGIRAN, BATLLE, OWHADI, SCOVEL, SHIRDEL, STANLEY, AND TAVALLALI
Figure 1.
Curse of dimensionality in discretizing the prior.
The data is of the form
x
m
p
θ
q `

N
p
0
,
1
q
where
m
is
deterministic and

N
p
0
,
1
q
is small noise. (a) For the con-
tinuous prior, the posterior concentrates around
M
:
“ t
θ
P
Θ
|
m
p
θ
q “
x
u
. (b) For the discretized prior, the posterior
concentrates on the delta Dirac that is the closest to
M
.
1.2. Limitations of the three main approaches to UQ.
All three ap-
proaches described in Section 1.1 have limitations in terms of accuracy, ro-
bustness, and computational complexity. Although the worst-case approach
is robust, it appears unfavorable in terms of accuracy and data assimilation.
The Bayesian approach, on the other hand, suffers from the computational
complexity of estimating the posterior distribution and from brittleness [21]
with respect to the choice of prior along with Stark’s admonition [36] “your
prior can bite you on the posterior.” Although Kempthorne [13] develops
a rigorous numerical procedure with convergence guarantees for solving the
equations of Wald’s statistical decision theory which appears amenable to
computational complexity analysis, it appears it suffers from the curse of
dimensionality (see Fig. 1). This can be understood from the fact that the
risk associated with the worst-case measure in the Wald framework is an
average over the observational variable
x
P
X
of the conditional risk, condi-
tioned on the observation
x
. Consequently, for a discrete approximation of a
worst-case measure, after an observation is made, there may be insufficient
mass near the places where the conditioning will provide a good estimate of
the appropriate conditional measure. Indeed, in the proposal [19] to develop
Wald’s statistical decision theory along the lines of Machine Learning, with
its dual focus on performance and computation, it was observed that
UQ OF THE 4TH KIND
5
“Although Wald’s theory of Optimal Statistical Decisions has
resulted in many important statistical discoveries, looking
through the three Lehmann symposia of Rojo and P ́erez-
Abreu [30] in 2004, and Rojo [28, 29] in 2006 and 2009, it
is clear that the incorporation of the analysis of the compu-
tational algorithm, both in terms of its computational effi-
ciency and its statistical optimality, has not begun.”
Moreover, one might ask why, after seeing the data, one is choosing a worst-
case measure which optimizes the average (1.11) of the Bayesian risk (1.6),
instead of choosing it to optimize the value of the risk
R
π
p
d
π
p
x
qq
at the
value of the observation
x
.
1.3. UQ of the 4th kind.
In this paper, we introduce a framework which
is a hybrid between Wald’s statistical decision theory [42], Bayesian decision
theory [3, Sec. 4.4], robust optimization and hypothesis testing. Here we
describe its components for simplicity when the loss function is the
`
2
loss.
Later in Section 6 we develop the framework for general loss functions.
1.3.1. Rarity assumption on the data.
In [21, Pg. 576] it was demon-
strated that one could alleviate the brittleness of Bayesian inference (see
[20, 18]) by restricting to priors
π
for which the observed data
x
is not rare,
that is,
p
p
x
q
:
ż
Θ
p
p
x
|
θ
q
p
θ
q ě
α
(1.13)
according to the density of the
X
-marginal determined by
π
and the model
P
, for some
α
ą
0. In the proposed framework, we consider playing a game
after observing the data
x
whose loss function is defined by the Bayesian
decision risk
R
π
p
d
q
(1.3), where player I selects a prior
π
subject to a
rarity
assumption
(
π
P
P
x
p
α
q
) and player II selects a decision
d
P
V
. The rarity
assumption considered here is
P
x
p
α
q
:
!
π
P
P
p
Θ
q
: support
p
π
q Ă
θ
P
Θ :
p
p
x
|
θ
q ě
α
(
)
.
(1.14)
Since
p
p
x
|
θ
q ě
α
for all
θ
in the support of any
π
P
P
x
p
α
q
it follows that such
a
π
satisfies (1.13) and therefore is sufficient to prevent Bayesian brittleness.
1.3.2. The relative likelihood for the rarity assumption.
Observe
in (1.4) that the map from the prior
π
to posterior
π
x
is scale-invariant
in the likelihood
p
p
x
| ̈q
and that the effects of scaling the likelihood in the
rarity assumption can be undone by modifying
α
. Consequently, we scale
the likelihood function
̄
p
p
x
|
θ
q
:
p
p
x
|
θ
q
sup
θ
P
Θ
p
p
x
|
θ
q
, θ
P
Θ
,
(1.15)
to its
relative likelihood
function
̄
p
p
x
| ̈q
: Θ
Ñ p
0
,
1
s
.
(1.16)
6
BAJGIRAN, BATLLE, OWHADI, SCOVEL, SHIRDEL, STANLEY, AND TAVALLALI
According to Sprott [35, Sec. 2.4], the relative likelihood measures the plau-
sibility of any parameter value
θ
relative to a maximum likely
θ
and sum-
marizes the information about
θ
contained in the sample
x
. See Rossi [31,
p. 267] for its large sample connection with the
X
2
1
distribution and sev-
eral examples of the relationship between likelihood regions and confidence
intervals.
For
x
P
X
and
α
P r
0
,
1
s
, let
Θ
x
p
α
q
:
θ
P
Θ : ̄
p
p
x
|
θ
q ě
α
(
(1.17)
denote the corresponding
likelihood region
and, updating (1.14), redefine the
rarity assumption by
P
x
p
α
q
:
P
`
Θ
x
p
α
q
̆
.
(1.18)
That is, the rarity constraint
P
x
p
α
q
constrains priors to have support on
the likelihood region Θ
x
p
α
q
. We will now define the confidence level of the
family Θ
x
p
α
q
,x
P
X
.
1.3.3. Significance/confidence level.
For a given
α
, let the
significance
β
α
at the value
α
be the maximum (over
θ
P
Θ) of the probability that a
data
x
1
P
p ̈|
θ
q
does not satisfy the rarity assumption ̄
p
p
x
1
|
θ
q ě
α
, i.e.,
β
α
:
sup
θ
P
Θ
P
́
x
1
P
X
:
θ
R
Θ
x
1
p
α
q
(
ˇ
ˇ
ˇ
θ
̄
sup
θ
P
Θ
ż
1
t
̄
p
p ̈|
θ
α
u
p
x
1
q
p
p
x
1
|
θ
q
p
x
1
q
,
(1.19)
where, for fixed
θ
,
1
t
̄
p
p ̈|
θ
α
u
is the indicator function of the set
t
x
1
P
X
:
̄
p
p
x
1
|
θ
q ă
α
u
. Observe that, in the setting of hypothesis testing, (1)
β
α
can
be interpreted as the p-value associated with the hypothesis that the rarity
assumption is not satisfied (i.e. the hypothesis that
θ
does not belongs to
the set (1.17)), and (2) 1
́
β
α
can be interpreted as the confidence level
associated with the rarity assumption (i.e. the smallest probability that
θ
belongs to the set (1.17)). Therefore, to select
α
P r
0
,
1
s
, we set a
significance
level
β
̊
(e.g.
β
̊
0
.
05) and choose
α
to be the largest value such that the
significance at
α
satisfies
β
α
ď
β
̊
.
Remark 1.1.
For models where the maximum of the likelihood function
M
p
x
1
q
:
sup
θ
P
Θ
p
p
x
1
|
θ
q
, x
1
P
X,
is expensive to compute but for which there exists an efficiently computable
upper approximation
M
1
p
x
1
q ě
M
p
x
1
q
, x
1
P
X
available, the surrogate
̄
p
1
p
x
1
|
θ
q
:
p
p
x
1
|
θ
q
M
1
p
x
1
q
, x
1
P
X,
(1.20)
to the relative likelihood may be used in place of
(1.15)
. If we let
β
1
α
denote
the value determined in
(1.19)
using the surrogate
(1.20)
and
Θ
1
x
p
α
q
denote
the corresponding likelihood region, then we have
β
α
ď
β
1
α
and
Θ
1
x
p
α
q Ă
UQ OF THE 4TH KIND
7
Θ
x
p
α
q
, α
P r
0
,
1
s
. Consequently, obtaining
β
1
α
ď
β
̊
for significance level
β
̊
implies that
β
α
ď
β
̊
.
As an example, for an
N
-dimensional Gaussian model with
p
p
x
1
|
θ
q “
1
p
σ
?
2
π
q
N
e
́
1
2
σ
2
}
x
1
́
θ
}
2
with
Θ :
“ r ́
τ,τ
s
N
, the elementary upper bound
M
p
x
1
q
:
sup
θ
P
Θ
p
p
x
|
θ
q ď
1
p
σ
?
2
π
q
N
the surrogate relative likelihood defined in
(1.20)
becomes
̄
p
1
p
x
1
|
θ
q
:
e
́
1
2
σ
2
}
x
1
́
θ
}
2
.
1.3.4. Posterior game and risk.
After observing
x
P
X
, we now consider
playing a game using the loss
L
p
π,d
q
:
E
θ
π
x
}
φ
p
θ
q ́
d
}
2
, π
P
P
x
p
α
q
,d
P
V.
(1.21)
Since the likelihood
p
p
x
| ̈q
is positive, it follows from
π
x
:
̄
p
p
x
| ̈q
π
ş
Θ
̄
p
p
x
|
θ
q
p
θ
q
(1.22)
that the map from priors to posteriors is bijective and support
p
π
x
q “
support
p
π
q
so that we can completely remove the conditioning in the game defined by
(1.21) and instead play a game using the loss
L
p
π,d
q
:
E
θ
π
}
φ
p
θ
q ́
d
}
2
, π
P
P
x
p
α
q
,d
P
V.
(1.23)
Recall that a pair
p
π
α
,d
α
q P
P
x
p
α
V
is a saddle point of the game (1.23)
if
L
p
π,d
α
q ď
L
p
π
α
,d
α
q ď
L
p
π
α
,d
q
, π
P
P
x
p
α
q
, d
P
V.
We then have the following theorem.
Theorem 1.2.
Consider
x
P
X
,
α
P r
0
,
1
s
, and suppose that the relative
likelihood
p
p
x
| ̈q
and the quantity of interest
φ
: Θ
Ñ
V
are continuous. The
loss function
L
for the game
(6.3)
has saddle points and a pair
p
π
α
,d
α
q P
P
x
p
α
V
is a saddle point for
L
if and only if
d
α
:
E
π
α
r
φ
s
(1.24)
and
π
α
P
arg max
π
P
P
x
p
α
q
E
π
}
φ
́
E
π
r
φ
s}
2
.
(1.25)
Furthermore the associated risk
(
the value of the two person game
(6.3))
R
p
d
α
q
:
L
p
π
α
,d
α
q “
E
π
α
}
φ
́
E
π
α
r
φ
s}
2
(1.26)
is the same for all saddle points of
L
. Moreover, the second component
d
α
of the set of saddle points is unique and the set
O
x
p
α
q Ă
P
x
p
α
q
of first
components of saddle points is convex, providing a convex ridge
O
x
p
α
qˆt
d
α
u
of saddle points.
8
BAJGIRAN, BATLLE, OWHADI, SCOVEL, SHIRDEL, STANLEY, AND TAVALLALI
1.4. Duality with the minimum enclosing ball.
Although the La-
grangian duality between the maximum variance problem and the mini-
mum enclosing ball problem on finite sets is known, see Yildirim [44], we
now analyze the infinite case. Utilizing the recent generalization of the one-
dimensional result of Popoviciu [25] regarding the relationship between vari-
ance maximization and the minimum enclosing ball by Lim and McCann [15,
Thm. 1], the following theorem demonstrates that essentially the maximum
variance problem (1.25) determining a worst-case measure is the Lagrangian
dual of the minimum enclosing ball problem on the image
φ
p
Θ
x
p
α
qq
. Let
φ
̊
:
P
`
Θ
x
p
α
q
̆
Ñ
P
p
φ
p
Θ
x
p
α
qq
denote the pushforward map (change of vari-
ables) defined by
p
φ
̊
π
qp
A
q
:
π
p
φ
́
1
p
A
qq
for every Borel set
A
, mapping
probability measures on Θ
x
p
α
q
to probability measures on
φ
`
Θ
x
p
α
q
̆
.
Theorem 1.3.
For
x
P
X
,
α
P r
0
,
1
s
, suppose the relative likelihood
̄
p
p
x
| ̈q
and the quantity of interest
φ
: Θ
Ñ
V
are continuous. Consider a saddle
point
p
π
α
,d
α
q
of the game
(6.3)
. The optimal decision
d
α
and its associated
risk
R
p
d
α
q “
(6.10)
are equal to the center and squared radius, respectively,
of the minimum enclosing ball of
φ
p
Θ
x
p
α
qq
, i.e. the minimizer
z
̊
and the
value
R
2
of the minimum enclosing ball optimization problem
$
&
%
Minimize
r
2
Subject to
r
P
R
, z
P
φ
p
Θ
x
p
α
qq
,
}
x
́
z
}
2
ď
r
2
, x
P
φ
p
Θ
x
p
α
qq
.
(1.27)
Moreover, the variance maximization problem on
P
x
p
α
q
(1.25)
pushes for-
ward to the variance maximization problem on the image of the likelihood
region
P
p
φ
p
Θ
x
p
α
qqq
under
φ
giving the identity
E
π
}
φ
́
E
π
r
φ
s}
2
E
φ
̊
π
}
v
́
E
π
1
r
v
s}
2
, π
P
P
x
p
α
q
,
and the latter is the Lagrangian dual to the minimum enclosing ball problem
(1.27)
on the image
φ
p
Θ
x
p
α
qq
. Finally, let
B
, with center
z
̊
, denote the
minimum enclosing ball of
φ
p
Θ
x
p
α
qq
. Then a measure
π
α
P
P
x
p
α
q
is optimal
for the variance maximization problem
(1.25)
if and only if
φ
̊
π
α
`
φ
p
Θ
x
p
α
qqXB
B
̆
1
and
z
̊
ż
V
vd
p
φ
̊
π
α
qp
v
q
,
that is, all the mass of
φ
̊
π
α
lives on the intersection
φ
p
Θ
x
p
α
qqXB
B
of the
image
φ
p
Θ
x
p
α
qq
of the likelihood region and the boundary
B
B
of its minimum
enclosing ball and the center of mass of the measure
φ
̊
π
α
is the center
z
̊
of
B
.
Remark 1.4.
Note that once
α
, and therefore
Θ
x
p
α
q
, is determined that
the computation of the risk and the minmax estimator is determined by the
minimum enclosing ball about
φ
p
Θ
x
p
α
qq
, which is also determined by the
worst-case optimization problem
(1.2)
for
Θ :
Θ
x
p
α
q
.
UQ OF THE 4TH KIND
9
Theorem 1.3 introduces the possibility of primal-dual algorithms. In par-
ticular, the availability of rigorous stopping criteria for the maximum vari-
ance problem (1.25). To that end, for a feasible measure
π
P
P
x
p
α
q
, let
Var
p
π
q
:
E
π
}
φ
́
E
π
r
φ
s}
2
denote its variance and denote by Var
̊
:
sup
π
P
P
x
p
α
q
Var
p
π
q “
(1.26) the optimal variance. Let
p
r,z
q
be a feasible for
the minimum enclosing ball problem (1.27). Then the inequality Var
̊
R
2
ď
r
2
implies the rigorous bound
Var
̊
́
Var
p
π
q ď
r
2
́
Var
p
π
q
(1.28)
quantifying the suboptimality of the measure
π
in terms of known quantities
r
and Var
p
π
q
.
1.5. Finite-dimensional reduction.
Let ∆
m
p
Θ
q
denote the set of convex
sums of
m
Dirac measures located in Θ and, let
P
m
x
p
α
q Ă
P
x
p
α
q
defined by
P
m
x
p
α
q
:
m
p
Θ
qX
P
x
p
α
q
(1.29)
denote the finite-dimensional subset of the rarity assumption set
P
x
p
α
q
con-
sisting of the convex combinations of
m
Dirac measures supported in Θ
x
p
α
q
.
Theorem 1.5.
Let
α
P r
0
,
1
s
and
x
P
X
, and suppose that the likelihood
function
p
p
x
| ̈q
and quantity of interest
φ
: Θ
Ñ
V
are continuous. Then
for any
m
ě
dim
p
V
q`
1
, the variance maximization problem
(1.25)
has the
finite-dimensional reduction
max
π
P
P
x
p
α
q
E
π
}
φ
́
E
π
r
φ
s}
2
max
π
P
P
m
x
p
α
q
E
π
}
φ
́
E
π
r
φ
s}
2
.
(1.30)
Therefore one can compute a saddle point
p
d
α
α
q
of the game
(6.3)
as
π
α
m
ÿ
i
1
w
i
δ
θ
i
and
d
α
m
ÿ
i
1
w
i
φ
p
θ
i
q
(1.31)
where
w
i
ě
0
i
P
Θ
, i
1
,...,m
maximize
$
&
%
Maximize
ř
m
i
1
w
i
φ
p
θ
i
q
2
́
ř
m
i
1
w
i
φ
p
θ
i
q
2
Subject to
w
i
ě
0
i
P
Θ
,i
1
,...,m,
ř
m
i
1
w
i
1
̄
p
p
x
|
θ
i
q ě
α, i
1
,...,m.
(1.32)
As a consequence of Theorems 1.3 and 1.5, a measure with finite support
μ
:
ř
w
i
δ
z
i
on
V
is the pushforward under
φ
: Θ
Ñ
V
of an optimal
measure
π
α
for the maximum variance problem (1.25) if and only if, as
illustrated in Figure 2, it is supported on the intersection of
φ
p
Θ
x
p
α
qq
and
the boundary
B
B
of the minimum enclosing ball of
φ
p
Θ
x
p
α
qq
and the center
z
̊
of
B
is the center of mass
z
̊
ř
w
i
z
i
of the measure
μ
.
10 BAJGIRAN, BATLLE, OWHADI, SCOVEL, SHIRDEL, STANLEY, AND TAVALLALI
Figure 2.
Three examples of the minimum enclosing ball
B
about the image
φ
p
Θ
x
p
α
qq
(in green) with radius
R
and
center
d
z
̊
. The optimal discrete measures
μ
:
ř
w
i
δ
z
i
(
z
i
φ
p
θ
i
q
) on the range of
φ
for the maximum variance
problem are characterized by the fact that they are supported
on the intersection of
φ
p
Θ
x
p
α
qq
and
B
B
and
d
z
̊
ř
w
i
z
i
is the center of mass of the measure
μ
. The size of the solid
red balls indicate the size of the corresponding weights
w
i
.
1.6. Relaxing MLE with an accuracy/robustness tradeoff.
For fixed
x
P
X
, assume that the model
P
is such that the maximum likelihood
estimate (MLE)
θ
̊
:
arg max
θ
P
Θ
p
p
x
|
θ
q
(1.33)
of
θ
:
exists and is unique.
Observe that for
α
near one (1) the support of
π
α
and
d
α
concentrate
around the MLE
θ
̊
and
φ
p
θ
̊
q
, (2) the risk
R
p
d
α
q “
(6.10) concentrates
around zero, and (3) the confidence 1
́
β
α
associated with the rarity as-
sumption
θ
:
P
Θ
x
p
α
q
is the smallest. In that limit, our estimator inherits
the accuracy and lack of robustness of the MLE approach to estimating the
quantity of interest.
Conversely for
α
near zero, since by (1.17) Θ
x
p
α
q «
Θ, (1) the support of
the pushforward of
π
α
by
φ
concentrates on the boundary of
φ
p
Θ
q
and
d
α
concentrate around the center of the minimum enclosing ball of
φ
p
Θ
q
, (2)
the risk
R
p
d
α
q “
(1.26) is the highest and concentrates around the worst-
case risk (1.2), and (3) the confidence 1
́
β
α
associated with the rarity
assumption
θ
:
P
Θ
x
p
α
q
is the highest. In that limit, our estimator inherits
the robustness and lack of accuracy of the worst-case approach to estimating
the quantity of interest.
For
α
between 0 and 1, the proposed game-theoretic approach induces a
minmax optimal tradeoff between the accuracy of MLE and the robustness
of the worst case.
UQ OF THE 4TH KIND
11
2. Two simple examples
2.1. Gaussian Mean Estimation.
Consider the problem of estimating
the mean
θ
:
of a Gaussian distribution
N
p
θ
:
2
q
with known variance
σ
2
ą
0 from the observation of one sample
x
from that distribution and from
the information that
θ
:
P r ́
τ,τ
s
for some given
τ
ą
0. Note that this
problem can be formulated in the setting of Problem 1 by letting (1)
P
p ̈|
θ
q
be the Gaussian distribution on
X
:
R
with mean
θ
and variance
σ
2
, (2)
Θ :
“ r ́
τ,τ
s
and
V
:
R
and (3)
φ
: Θ
Ñ
V
be the identity map
φ
p
θ
q “
θ
.
Following Remark 1.1, we utilize the surrogate relative likelihood
̄
p
1
p
x
|
θ
q “
e
́
1
2
σ
2
|
x
́
θ
|
2
, θ
P
Θ
,
(2.1)
to define the likelihood region Θ
x
p
α
q
:
“ t
θ
P
Θ : ̄
p
1
p
x
|
θ
q ě
α
u
obtained by
the maximum likelihood upper bound sup
θ
P
Θ
p
p
x
|
θ
q ď
1
σ
?
2
π
instead of the
true relative likelihood and, for simplicity, remove the prime and henceforth
indicate it as ̄
p
. A simple calculation yields
Θ
x
p
α
q “
max
`
́
τ,x
́
a
2
σ
2
ln
p
1
{
α
q
̆
,
min
`
τ,x
`
a
2
σ
2
ln
p
1
{
α
q
̆
ı
.
(2.2)
Using Theorem 1.5 with
m
dim
p
V
q `
1
2, for
α
P r
0
,
1
s
, one can
compute a saddle point
p
π
α
,d
α
q
of the game (1.23) as
π
α
w
δ
θ
1
`p
1
́
w
q
δ
θ
2
and
d
α
1
`p
1
́
w
q
θ
2
(2.3)
where
w,θ
1
2
maximize the variance
$
&
%
Maximize
2
1
`p
1
́
w
q
θ
2
2
́p
1
`p
1
́
w
q
θ
2
q
2
over
0
ď
w
ď
1
, θ
1
2
P r ́
τ,τ
s
subject to
p
x
́
θ
i
q
2
2
σ
2
ď
ln
1
α
, i
1
,
2
,
(2.4)
where the last two constraints are equivalent to the rarity assumption
θ
i
P
Θ
x
p
α
q
.
Hence for
α
near 0, Θ
x
p
α
q “
Θ
“ r ́
τ,τ
s
, and by Theorem 1.3, the
variance is maximized by placing each Dirac on each boundary point of
the region Θ, each receiving half of the total probability mass, that is by
θ
1
“ ́
τ
,
θ
2
τ
and
w
1
{
2, in which case Var
π
α
τ
2
and
d
α
0. For
α
1, the rarity constraint implies
θ
1
θ
2
x
when
x
P r ́
τ,τ
s
, leading
to the MLE
d
α
x
with Var
π
α
0. Note that from (1.19) we have
β
α
sup
θ
Pr ́
τ,τ
s
P
x
1
N
p
θ,σ
2
q
̄
p
p
x
1
|
θ
q ă
α
sup
θ
Pr ́
τ,τ
s
P
x
1
N
p
θ,σ
2
q
`
p
x
1
́
θ
q
2
{
σ
2
ą ́
2 ln
α
̆
1
́
F
p ́
2 ln
α
q
,
where
F
is the cumulative distribution function of the chi-squared distri-
bution with one degree of freedom. We illustrate in Figure 3 the different
results of solving the optimization problem (2.4) in the case
σ
2
1,
x
1
.
5
and
τ
3. We plot the
α
́
β
curve (top left), the likelihood of the model
12 BAJGIRAN, BATLLE, OWHADI, SCOVEL, SHIRDEL, STANLEY, AND TAVALLALI
Figure 3.
α
́
β
relation, likelihood level sets, risk value and
decision for different choices of
α
(and consequently
β
) for the
normal mean estimation problem with
τ
3 and observed
value
x
1
.
5. Three different values in the
α
́
β
curve are
highlighted across the plots
in
r ́
τ,τ
s
and the
α
́
level sets (top right), and the evolutions of the risk
with
β
(bottom left), and the optimal decision with
β
(bottom right). Since,
by Theorem 1.3, the optimal decision is the midpoint of the interval with
extremes in either the
α
́
level sets or
̆
τ
, we observe that for low
β
, our
optimal decision does not coincide with the MLE.
2.2. Coin toss.
2.2.1.
n
tosses of a single coin.
In this example, we estimate the prob-
ability that a biased coin lands on heads from the observation of
n
indepen-
dent tosses of that coin. Specifically, we consider flipping a coin
Y
which
has an unknown probability
θ
:
of coming heads
p
Y
1
q
and probability
1
́
θ
:
coming up tails
p
Y
0
q
. Here Θ :
“ r
0
,
1
s
,
X
“ t
0
,
1
u
, and the
model
P
: Θ
Ñ
P
pt
0
,
1
uq
is
P
p
Y
1
|
θ
q “
θ
and
P
p
Y
0
|
θ
q “
1
́
θ
.
We toss the coin
n
times generating a sequence of i.i.d. Bernoulli vari-
ables
p
Y
1
,...,Y
n
q
all with the same unknown parameter
θ
:
P r
0
,
1
s
, and
let
x
:
“ p
x
1
,...,x
n
q P t
0
,
1
u
n
denote the outcome of the experiment. Let
h
ř
n
i
1
x
i
denote the number of heads observed and
t
n
́
h
the number
UQ OF THE 4TH KIND
13
of tails. Then the model for the
n
-fold toss is
P
p
x
|
θ
q “
n
ź
i
1
θ
x
i
p
1
́
θ
q
1
́
x
i
θ
h
p
1
́
θ
q
t
(2.5)
and, given an observation
x
, the MLE is
θ
h
n
so that the relative likelihood
(1.15) is
̄
p
p
x
|
θ
q “
θ
h
p
1
́
θ
q
t
`
h
n
̆
h
`
t
n
̆
t
.
(2.6)
Although the fact that ̄
p
p
x
|
0
q “
̄
p
p
x
|
1
q “
0 violates our positivity assump-
tions on the model in our framework, in this case this technical restriction
can be removed, so we can still use this example as an illustration. We seek
to estimate
θ
, so let
V
R
and let the quantity of interest
φ
: Θ
Ñ
V
be
the identity function
φ
p
θ
q “
θ
. In this case, given
α
P r
0
,
1
s
, the likelihood
region
Θ
x
p
α
q “
!
θ
P r
0
,
1
s
:
θ
h
p
1
́
θ
q
t
`
h
n
̆
h
`
t
n
̆
t
ě
α
)
(2.7)
constrains the support of priors to points with relative likelihood larger than
α
.
Using Theorem 1.5 with
m
dim
p
V
q`
1
2, one can compute a saddle
point
p
π
α
,d
α
q
of the game (1.23) as
π
α
w
δ
θ
1
`p
1
́
w
q
δ
θ
2
and
d
α
1
`p
1
́
w
q
θ
2
(2.8)
where
w,θ
1
2
maximize the variance
$
&
%
Maximize
2
1
`p
1
́
w
q
θ
2
2
́
p
1
`p
1
́
w
q
θ
2
q
2
over
0
ď
w
ď
1
, θ
1
2
P r
0
,
1
s
subject to
θ
h
i
p
1
́
θ
i
q
t
`
h
n
̆
h
`
t
n
̆
t
ě
α, i
1
,
2
.
(2.9)
Equation (1.19) allows us to compute
β
P r
0
,
1
s
as a function of
α
P
r
0
,
1
s
. The solution of the optimization problem can be found by finding
the minimum enclosing ball of the set Θ
x
p
α
q
, which in this 1-D case is also
subinterval of the interval
r
0
,
1
s
. For
n
5 tosses resulting in
h
4 heads
and
t
1 tails, Figure 4 plots (1)
β
, the relative likelihood, its level sets and
minimum enclosing balls as a function of
α
, and (2) The risk
R
p
d
α
q “
(1.26)
and optimal decision
d
α
as a function of
β
. Three different points in the
α
́
β
curve are highlighted.
2.2.2.
n
1
and
n
2
tosses of two coins.
We now consider the same problem
with two independent coins with unknown probabilities
θ
:
1
:
2
. After tossing
each coin
i n
i
times, the observation
x
consists of
h
i
heads and
t
i
tails for
14 BAJGIRAN, BATLLE, OWHADI, SCOVEL, SHIRDEL, STANLEY, AND TAVALLALI
Figure 4.
α
́
β
relation, likelihood level sets, risk value
and decision for different choices of
α
(and consequently
β
)
for the 1 coin problem after observing 4 heads and 1 tails.
Three different values in the
α
́
β
curve are highlighted across
the plots
each
i
, produce a 2D relative likelihood function on Θ
“ r
0
,
1
s
2
given by
̄
p
p
x
|
θ
1
2
q “
θ
h
1
1
p
1
́
θ
1
q
t
1
`
h
1
n
1
̆
h
1
`
t
1
n
1
̆
t
1
θ
h
2
2
p
1
́
θ
2
q
t
2
`
h
2
n
2
̆
h
2
`
t
2
n
2
̆
t
2
.
(2.10)
Figure 5 illustrates the level sets ̄
p
p
x
|
θ
1
2
q ě
α
and their corresponding
bounding balls for
h
1
1
,t
1
3
,h
2
5
,t
2
1 and different values of
α
P r
0
,
1
s
.