DECISION THEORETIC BOOTSTRAPPING
PEYMAN TAVALLALI
1
,
̊
, HAMED HAMZE BAJGIRAN
2
, DANIAL J. ESAID
3
, HOUMAN
OWHADI
2
,
̊
Abstract.
The design and testing of supervised machine learning models combine
two fundamental distributions: (1) the training data distribution (2) the testing data
distribution. Although these two distributions are identical and identifiable when the
data set is infinite; they are imperfectly known (and possibly distinct) when the data
is finite (and possibly corrupted) and this uncertainty must be taken into account
for robust Uncertainty Quantification (UQ). We present a general decision-theoretic
bootstrapping solution to this problem: (1) partition the available data into a training
subset and a UQ subset (2) take
m
subsampled subsets of the training set and train
m
models (3) partition the UQ set into
n
sorted subsets and take a random fraction of
them to define
n
corresponding empirical distributions
μ
j
(4) consider the adversarial
game where Player I selects a model
i
P t
1
,...,m
u
, Player II selects the UQ distribution
μ
j
and Player I receives a loss defined by evaluating the model
i
against data points
sampled from
μ
j
(5) identify optimal mixed strategies (probability distributions over
models and UQ distributions) for both players. These randomized optimal mixed
strategies provide optimal model mixtures and UQ estimates given the adversarial
uncertainty of the training and testing distributions represented by the game. The
proposed approach provides (1) some degree of robustness to distributional shift in
both the distribution of training data and that of the testing data (2) conditional
probability distributions on the output space forming aleatory representations of the
uncertainty on the output as a function of the input variable.
1
Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA 91109
2
California Institute of Technology, Pasadena, CA 91125
3
Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA 91109;
University of California, Riverside, 92521
̊
Corresponding Author, E-mails: peyman.tavallali@jpl.caltech.edu; owhadi@caltech.edu
©
2021. California Institute of Technology. Government sponsorship acknowledged.
1. Introduction
Although traditional cross validation (CV) provides model UQ/error estimates [64, 65,
2, 21, 67], these estimates are in general not robust to distribution shifts. We propose to
incorporate the impact of distributional uncertainty in the adversarial setting of decision
theory (DT) [47].
1.1. Main Result:
Our proposed decision theoretic bootstrapping (DTB) solution is
to:
1
arXiv:2103.09982v1 [cs.LG] 18 Mar 2021
2
TAVALLALI, BAJGIRAN, ESAID, OWHADI
Figure 1.
Train-UQ Partitioning of the Data
. The
left plot
shows
the schematic partitioning of the original empirical data into two exclusive
train and UQ sets. The
middle plot
shows how the train set is used to
train
m
ML estimators (randomly). Here, the training sets are found
by subsampling the train dataset
m
times. The
right plot
shows the
procedure of breaking up the UQ set into
n
subsets, from which
n
different
empirical distributions can be extracted.
‚
randomly partition the available data
tp
X
i
,Y
i
qu
i
P
I
into a training subset
tp
X
i
,Y
i
qu
i
P
T
and a UQ subset
tp
X
i
,Y
i
qu
i
P
U
(see Figure 1-a);
‚
randomly subsample the training set
T
into
m
subsets
t
T
i
u
m
i
“
1
, and train
m
models
t
F
i
u
m
i
“
1
(see Figure 1-b);
‚
For
k
P t
1
,...,K
u
repeat the following steps.
‚
Partition the UQ set into
n
sorted subsets.
–
Randomly subsample from all the
n
sorted subsets of
U
and define
n
corresponding
empirical distributions
t
μ
j
u
n
j
“
1
(see Figure 1-b);
–
consider the zero-sum adversarial game where Player I selects a model
F
i
,
Player II selects the UQ distribution
μ
j
, and Player I receives a random loss
defined as the empirical risk
L
ij
“
E
p
X,Y
q„
μ
j
E
p
Y,F
i
p
X
qq
,
(1.1)
where
E
is an error function. The diagram of the proposed game is as
follows.
(Player I)
F
i
min
μ
j
max
~
~
(Player II)
L
ij
–
identify and record the optimal mixed strategies (probability distributions)
p
p
k
q
“
́
p
p
k
q
1
,...,p
p
k
q
m
̄
over models
t
F
i
u
m
i
“
1
and
q
p
k
q
“
́
q
p
k
q
1
,...,q
p
k
q
n
̄
over
UQ distributions
t
μ
j
u
n
j
“
1
, for both players by solving the minimax problem
min
p
max
q
ř
i,j
p
i
L
ij
q
j
.
(1.2)
and record the value
v
p
k
q
of the game (1.2).
‚
Use the distribution of the
p
p
k
q
and possibly
v
p
k
q
for UQ analysis.
DECISION THEORETIC BOOTSTRAPPING
3
1.2. Structure of the paper:
The proposed methodology and algorithm is described
in Section 2. In Section 3, we discuss the relevance of DTB to CV. We then present two
detailed examples in Section 4. We cover related works in 5 and compare the proposed
method with other UQ for ML methods in the literature.
2. DTB Methodology and Algorithm
In any modeling endeavor, there are three major questions: “
How can the uncertainty
be quantified?
”, “
How robust is a given model if the distribution of the testing data is
distinct from that of training data?
”, and finally, “
How can we pick amongst many trained
models?
”. This manuscript presents a decision-theoretic (zero-sum game) framework
answering all three questions. We will now motivate these questions and illustrate DTB
with the following real-world problem.
Example 1.
Given the California housing dataset (that includes ZIP codes, square
footage, number of bedroom and bathrooms, sales prices, etc.)
[52]
, and
20
ML models
(e.g., regression trees) trained on distinct subsets of this dataset, (1) predict sales prices
for a new test set (that includes the previous features but not sales prices), (2) provide
confidence intervals for each prediction, (3) establish the robustness of the predictions to
distributional shift between training and testing data. Although selecting the model with
the lowest error in the training set is a natural solution to (1), the deterministic nature
of this solution is in conflict with the requirements of (2) providing confidence intervals
and (3) ensuring robustness to distributional shifts. Similarly, although using training
errors to create a weighted ranking of the
20
models could be used to provide confidence
intervals, it does not ensure robustness to distribution shifts. With the proposed approach,
we estimate the error of one of these
20
models (selected at random by Player I) against
an empirical distribution (selected at random by player II). The randomization of the
underlying game provides confidence intervals, and its adversarial nature ensures robustness
to distribution shifts. Figure 2 illustrates DTB. For that figure, we trained
m
“
20
different regression tree models (each with a maximum depth of
15
) on a train set
with
|
T
| “
13759
samples, then exposed those models to
n
“
100
unseen randomized
realizations of a UQ set (with
|
U
| “
6777
samples) to find each model’s probability,
and finally applied them to a left-out test set with
104
samples. We chose a small
test set to accentuate uncertainties for ease of presentation. Each model is trained
on a random half of the training set (
|
F
i
| « |
T
|{
2
). The sorted unseen randomized
realizations of the UQ set are divided into
n
“
100
subsets. The combined support of the
μ
j
in each randomization encompasses roughly
20%
of the whole UQ set. The
prediction
distribution
p
i
over the
20
models is calculated based on material presented in Section
3.2. For each
104
sample in the left-out test set, Figure 2 shows the mean prediction
(obtained by averaging the prediction of each model with respect to the distribution
p
i
)
and the standard deviation of that prediction (with respect to the distribution
p
i
). As can
be seen from this figure, almost all ground truth values over the test set fall in between
the confidence bounds of the predictions (note that we used one standard deviation to
define confidence intervals).
2.1. Setup.
The supervised ML setting is that of approximating an unknown function
F
:
given
F
:
p
X
i
q “
Y
i
for all
i
where
tp
X
i
,Y
i
qu
i
P
I
represents the available input/output
4
TAVALLALI, BAJGIRAN, ESAID, OWHADI
training data. One major issue with solving this problem as a regression/interpolation
(curve fitting) problem is that the interpolant does not carry an intrinsic measure of
uncertainty (even if it has been obtained as a conditional expectation by conditioning a
Gaussian prior [57], then the conditional covariance of the posterior distribution depends
on the covariance function of the Gaussian prior whose selection is to some degree
arbitrary). Furthermore, this approach is not stable to distribution shifts, especially in
deep neural networks case [46]. A partial remedy is to use CV to ensure that the training
and generalization errors are close [64, 65, 2, 21, 67]. However, CV does not address
the distribution shift problem and it does not provide a pointwise (input dependent)
prediction confidence intervals. Although there are methods that provide a stable linear
regression mitigating the effects of distribution shifts by a relaxed minimax approach [4],
yet these methods do not provide a pointwise prediction confidence interval. Contrary to
these, our proposed solution, based on decision theory [47], is to formulate the underlying
regression problem (linear or non-linear) as a zero-sum adversarial game over the set of
possible machine learning (ML) models and empirical data distributions providing both
robustness of estimation and prediction confidence intervals. To do so, the modeler
must provide a set of fitting functions
t
F
i
p
X
qu
m
i
“
1
and play an adversarial zero-sum
game against some random empirical distributions of the data. In what follows, we
present a detailed description of each step.
2.2. Partitioning the training data:
Our first step is to randomly partition the
available (training) data
tp
X
i
,Y
i
qu
i
P
I
into two mutually exclusive sets: a training set
tp
X
i
,Y
i
qu
i
P
T
and a UQ set
tp
X
i
,Y
i
qu
i
P
U
.
2.3. Training models:
Next, we subsample
m
subsets of the training set
T
. We write
t
T
i
u
m
i
“
1
for these subsets and train
m
corresponding models
t
F
i
u
m
i
“
1
(
F
i
is the model
obtained by using
T
i
as training data). The training pipeline can be arbitrary (e.g.,
the corresponding models could be deep neural networks [22], random forests [7, 28], or
any other ML model). Therefore, there is no constraint in the selection of the model
producing pipeline and any function estimator, relevant to the problem at hand, and any
training methodology could be used. Even the class of these functions could be different
across
i
P t
1
,...,m
u
. For example, some could be linear regression models (for some
labels
i
), some decision trees [8] (for the other labels) and some support vector machines
[9] (for the remaining labels).
2.4. Bootstrapping a set of empirical distributions:
Next, we sort
U
with respect
to the response variable
Y
and then partition this sorted set into
n
subsets
t
U
j
u
n
j
“
1
.
Then, we randomly subsample the sorted set
tp
X
i
,Y
i
qu
i
P
U
from the
n
subsets
t
U
j
u
n
j
“
1
and define
n
corresponding empirical distributions
μ
j
:
“
1
|
U
j
|
ÿ
j
P
U
j
δ
p
X
j
,Y
j
q
.
(2.1)
2.4.1. Deterministic Game:
Consider the game in which Player I selects an element
from
t
F
i
u
m
i
“
1
and Player II selects an element from
t
μ
j
u
n
j
“
1
(defined as in (2.1)). Let
L
ij
“
E
p
X,Y
q„
μ
j
“
E
p
Y,F
i
p
X
qq
‰
(2.2)
DECISION THEORETIC BOOTSTRAPPING
5
Figure 2.
Robust Prediction with Confidence Intervals
. This
plot shows that on a random test set from the California Housing dataset
[52], the predictions of 20 different regression trees are robust in the sense
that their confidence intervals include the ground truth. There are only a
few large confidence intervals, and the rest are relatively tight. This plot
is showing the value of double randomization, both over the empirical
distributions and the models. The latter is one of the main ideas of the
adversarial game-theoretic setup of DTB.
be the loss function for this game. In that deterministic setting the optimal mixed
strategies of Player I and II tends to concentrate on single elements, i.e., the worst
distribution in
t
μ
j
u
n
j
“
1
and its best response in
t
F
i
u
m
i
“
1
. This is due to the fact that,
although this game incorporates some degree of uncertainty in the selection of the data
generating distribution, it does not incorporate the uncertainty associated with sampling
from that distribution. To address this issue, we repeat the proposed game over many
random realizations of the
μ
j
and use the resulting distribution over optimal mixed
strategies for our UQ analysis.
2.5. Randomized Game:
Consider the game described in Subsection 2.4.1 and observe
that the randomized sampling process generating the
μ
j
induces a randomization of the
loss metric
L
(
L
ij
a random function of
p
i,j
q
). Assigning probabilities
p
“ p
p
1
,...,p
m
q
over models
t
F
i
u
m
i
“
1
and
q
“ p
q
1
,...,q
n
q
over a specific realization of
t
μ
j
u
n
j
“
1
, optimal
mixed strategies
p
p
:
,q
:
q
are identified as solutions of the minimax problem
min
p
max
q
ř
i,j
p
i
L
ij
q
j
,
(2.3)
6
TAVALLALI, BAJGIRAN, ESAID, OWHADI
Algorithm 1 Decision Theoretic CV
(1) Randomly separate
tp
X
i
,Y
i
qu
i
P
I
into mutually exclusive sets
tp
X
i
,Y
i
qu
i
P
T
and
tp
X
i
,Y
i
qu
i
P
U
.
(2) Generate
m
randomly subsampled subsets
t
T
i
u
m
i
“
1
of the training set
T
.
(a) Train the models
F
i
on the subsets
T
i
.
(3) Sort
U
with respect to the response variable
Y
and then partition into
n
subsets
t
U
j
u
n
j
“
1
.
(4) For
k
P
t
1
,...,K
u
(a) Generate
n
randomly subsampled subsets
!
U
p
k
q
j
)
n
j
“
1
from
t
U
j
u
n
j
“
1
write
!
μ
p
k
q
j
)
n
j
“
1
for the corresponding empirical distributions.
(b) Solve the minimax problem
min
p
max
q
ř
i,j
p
i
L
ij
q
j
, and record the optimal
p
p
k
q
and the value
v
p
k
q
of the game.
(5) Report
̄
p
“
1
K
ř
K
k
“
1
p
p
k
q
as the distribution over models and the histogram of
values
v
p
k
q
.
Note that the randomization of the loss
L
implies that
p
p
:
,
q
:
q
are random distributions.
We will now sample from these distributions by solving (2.3) many times (with different
random realizations of
L
). This approach is directly related to Harsanyi’s purification
over repeated games [26].
2.6. Probability distributions over models:
Given
K
ě
1, let
p
p
p
k
q
,
q
p
k
q
q
be
K
i.i.d. samples from
p
p
:
,
q
:
q
and write
v
p
k
q
for the corresponding random values of (2.3).
Note that the
p
p
p
k
q
,
q
p
k
q
q
are obtained by solving (2.3) over
K
independent realizations
of the loss
L
. We record each optimal
p
p
k
q
, each value
v
p
k
q
, and use the histogram defined
by the
v
p
k
q
and the empirical average
̄
p
“
1
K
K
ÿ
k
“
1
p
p
k
q
,
(2.4)
for UQ. From an ML perspective, this approach is robust to distribution shifts. This
procedure is presented in Algorithm 1.
2.7. How to Choose
K
,
n
and the
U
j
:
Choosing
K
is relatively straightforward.
Since (2.4) is a Monte Carlo (MC) estimate, its accuracy improves with the number
of sample points. The
U
j
are obtained through random subs-sampling of
U
and are
of chosen to be of equal size
s
(
|
U
j
| “
s
). The
U
j
do not overlap. We selected
s
and
n
so that the size of the combined support of the
U
j
is about 20% of the size of
U
(
ns
«
0
.
2
|
U
|
). Note that given
ns
«
0
.
2
|
U
|
, the parameterization of our approach can
be reduced to
s
which can be interpreted as a trade-off parameter between robustness
(small
s
) and accuracy (large
s
). Recall that robustness and accuracy are conflicting
requirements [49].
DECISION THEORETIC BOOTSTRAPPING
7
3. UQ analysis
3.1. CV and Bayesian inference.
While CV provides an estimate of the generalization
error, this estimate is not pointwise in the sense that it does not depend on the particular
input of the model. Furthermore, CV does not produce any distribution over models.
Furthermore, while Bayesian inference could be employed to obtain a distribution over
models, doing so would not be robust with respect to the selection of the prior [50, 51,
48, 49].
3.2. DTB.
On the other hand, by inducing a probability distribution ̄
p
over models,
DTB defines a robust point-wise distribution over the output space where robustness
in inherited from the adversarial nature of the game. This distribution depends on the
particular choice of the loss function and the histogram of the
v
p
k
q
(
K
k
“
1
provides a robust
quantification of generalization error. Given an input
X
, write
ρ
p
X
q “
m
ÿ
i
“
1
F
i
p
X
q
̄
p
i
,
(3.1)
for the average model output and
σ
2
p
X
q “
m
ÿ
i
“
1
p
F
i
p
X
q ́
ρ
p
X
qq
2
̄
p
i
,
(3.2)
for the standard deviation of the model outputs defined by
̄
p
. Note that
̄
p
may be
distinct from the uniform distribution employed in
ensemble modeling
.
4. Examples
We will now illustrate the proposed approach (Algorithm 1) on synthetic and real-
world datasets.
Example 2.
Consider the univariate function
y
“
x
sin
x
for
x
P r
0
,
10
s
(solid dark
plots in Figure 3). The dataset is composed of
35
pointwise evaluations of this function.
In the left plot of Figure 3 we regress each training subset
T
i
with a polynomial of
degree
4
to produce the corresponding model
F
i
and use Algorithm 1 with
20
regression
polynomials to obtain a probability distribution
̄
p
over these models. As seen in this plot,
the ground truth is close to the prediction mean, defined by (3.1), and it falls between
the confidence bounds, defined by (3.2). Although low order polynomials do not lead to
highly accurate models, Algorithm 1 provides a robust quantification of generalization
errors. Replacing the
20
regressions polynomials by regression trees further improves
this picture, as illustrated in the right plot of Figure 3. In this case, we fixed the depth of
each tree to
5
. Note that the confidence intervals are now smaller near the boundaries.
Generally, this simple example illustrates the robustness of Algorithm 1 when the data
is sparse, non-uniformly distributed, and the models are weak.
Example 3.
With this example, illustrate the robustness of Algorithm 1 on the following
datasets: California Housing
[52]
, Electrical Grid Stability
[3]
, Superconductor
[25]
, and
Bike Sharing
[14]
. For the sake of brevity, we call them Housing, Grid, SC, and Bike,
8
TAVALLALI, BAJGIRAN, ESAID, OWHADI
Figure 3.
Robust Prediction with Confidence Intervals
. The
left plot
shows the application of Algorithm 1 when the ML estimators
are 20 regression polynomials each with degree 4. The ground truth
is shown in solid dark color. The sample points are the dots, and the
prediction mean and confidence intervals are shown in red and filled-
blue, respectively. The
right plot
uses the setting of the left plot, except
that the 20 regressors are regression trees. Irrespective of the type of the
regressors, both plots show robust predictions and confidence intervals
obtained from only a sparse set of non-uniform samples.
respectively. To normalize target ranges, the target variables in Grid, SC and Bike were
multiplied by
100
.
0
,
0
.
1
, and
0
.
01
, respectively.
For contrast, we compare the model distribution obtained from Algorithm 1 (DT
models) with the uniform distribution. In all the cases, we take
10
regression trees
with a maximum depth of
10
. The original data is divided into
80%
train-UQ and
20%
test sets. The train-UQ set is then fed into Algorithm 1 once having separate
T
and
U
sets, and once
T
“
U
. The latter is done for the sake of a fair comparison with
uniform ensembles. In the algorithm, we take
n
“
100
and
K
“
100
. Also, for each
k
P t
1
,...,K
u
, we take
sn
|
U
|
»
0
.
2
.
We introduce two modes of training: In the first mode, each tree is trained on randomly
selected
0
.
5%
of the train set
T
. In the second mode, each tree is trained on randomly
selected
50%
of the train set. The first and second modes are called “weak” and “strong”,
respectively.
Furthermore, we introduce two comparison metrics: One is the overall mean squared
loss on the test set. The other is the maximum mean squared loss over folds, where folds
are determined by binning the sorted target variable. In these experiments, we use
100
folds. In fact, this second metric is an indicator of the robustness of the final ensemble
with respect to distribution shifts.
We repeat each setup in this experiment
20
times. The results are reported in Tables 1
and 2. The values in parenthesis are the ones produced with
T
“
U
. As seen from Table
2, the ensemble of estimators generated by Algorithm 1 are showing lower maximum
fold losses in nearly all weak cases and most of the strong cases, compared to their
uniform averaging counterparts. The difference is more prominent when the models are
weakly trained. Having a lower maximum fold loss is an indication of the robustness of
DECISION THEORETIC BOOTSTRAPPING
9
Overall
Weak Trees
Strong Trees
Uniform
DT
Uniform
DT
Housing
0.54
(
0.51
)
0.59(0.66)
0.31
(0.31)
0.32(0.31)
Grid
7.54(
7.23
)
7.27
(7.58)
1.95
(
1.92
)
1.98(1.95)
SC
3.41
(
3.15
)
3.52(3.81)
1.28
(1.23)
1.30(
1.22
)
Bike
1.47
(
1.39
)
1.66(1.87)
0.28
(
0.27
)
0.28
(0.28)
Table 1.
Overall Losses Comparison
. This table shows the overall
loss comparison of 10 regression tress when the ensemble mean is taken by
the use of Algorithm 1 versus simple uniform averaging. These are called
DT and Uniform, respectively, in the table. The values in parenthesis
correspond to
T
“
U
. The bold fonts show better losses in each category;
weak vs strong.
Max Fold
Weak Trees
Strong Trees
Uniform
DT
Uniform
DT
Housing
2.91(2.93)
2.40
(
2.52
)
1.74(1.72)
1.73
(
1.70
)
Grid
32.25(31.96)
29.97
(
27.47
)
9.79(
8.50
)
9.74
(
8.50
)
SC
21.96(19.22)
15.61
(
12.56
)
5.40(5.23)
5.39
(
5.12
)
Bike
19.43(17.60)
11.98
(
10.10
)
1.31
(1.38)
1.34(
1.36
)
Table 2.
Maximum Fold Losses Comparison
. This table shows
the maximum fold loss comparison of 10 regression trees. DT uses the
distribution over models obtained from Algorithm 1. Uniform refers to
the uniform distribution over models. These are called DT and Uniform,
respectively, in the table. The values in parenthesis correspond to
T
“
U
.
The bold fonts show better losses in each category; weak vs. strong.
Algorithm 1 with respect to distribution shifts, especially when the models are not strong.
In other words, Algorithm 1 shows its best applicability when the epistemic uncertainties
are notable. On the other hand, most of the times, the overall loss is lower for uniform
averaging of the models; see Table 1.
It is noteworthy to mention that the improvement in the maximum fold loss is larger
than the degradation in the overall loss when Algorithm 1 is used. In other words, if
Algorithm 1 is used, with a slight sacrifice of the overall accuracy, the models become
more stable and consistent, with higher precision, in their predictions.
Example 4.
In this example, we investigate the effect of
ns
{|
U
|
. Since this ratio is
a part of the purification over repeated games
[26]
, we call it “purification ratio” for
brevity. Our purpose is to illustrate the maximum fold loss, and also the overall loss,
as a function of the purification whose values are between
0
and
1
. We work with
the California Housing
[52]
dataset, and we take an ensemble of models produced by
Algorithm 1 with
n
“
100
. In all the cases, we take
10
regression trees with a maximum
depth of
10
. The original data is divided into
80%
train-UQ and
20%
test sets. Each
10
TAVALLALI, BAJGIRAN, ESAID, OWHADI
Figure 4.
Purification Ratio
. The
top plot
shows the maximum fold
loss for different purification ratios. The
bottom plot
shows the overall
loss for different purification ratios. Both plots correspond to the test
dataset. A weighted local linear regression is used to provide a better
picture in both plots. These plots show that the purification ratio should
be around 0
.
15 for this particular dataset to achieve the best balance
between the overall loss and maximum fold loss.
tree is trained on randomly selected
50%
of the train set. This time, to estimate the
robustness of the algorithm using maximum fold loss, we use only
20
bins (folds) over
the test set. For each purification ratio, we conduct the experiment
100
times, and we
plot the mean of each experiment. To make sure enough data points are seen by the
estimators,
K
is taken to be
5
over the purification ratio.
The results are shown in Figure 4. As can be seen from the plots, there is a purification
ratio close to
0
.
15
that produces the minimal maximum fold loss in all the tests. At
the same time, the overall loss decreases uniformly with a decrease in the purification
ratio. This experiment shows that to capture the best balance between the overall loss
and maximum fold loss, the purification ratio
ns
{|
U
|
should be around
0
.
15
. This also
supports our choice in the previous example of 0.20.
Example 5.
In this example, we investigate the effect of data sparsity on Algorithm
1. As in the previous example, we work with the California Housing
[52]
dataset. In
all the cases, we take
10
regression trees with a maximum depth of
10
. The original
data is divided into
80%
train-UQ and
20%
test sets. The train-UQ set is then fed into
Algorithm 1 with
T
“
U
. The latter is done for the sake of a fair comparison with
DECISION THEORETIC BOOTSTRAPPING
11
Figure 5.
Small vs Large Data Test
. The
top plot
shows that
Algorithm 1 is robust when the available data is sparse. The
bottom plot
shows that uniform averaging always provide a more accurate prediction.
Both plots show that for large datasets, both approaches converge in
robustness and accuracy.
uniform ensembles. We take
n
and
K
to be both
100
. Also, the purification ratio is
taken to be
0
.
20
. The maximum fold loss is calculated over
20
bins of the test set. In
this experiment, we vary the data availability by allowing a fixed fraction of the train
set to each estimator. We call this variable the “data fraction”. Figure 5 shows that
for sparse data availability, Algorithm 1’s error is unanimously less than the uniform
averaging method. At the same time, the overall loss is always smaller for uniform
averaging. This figure conveys the message that Algorithm 1 is more robust than uniform
averaging, specifically when only a small amount of data is available.
5. Related Works
5.1. Sources of Uncertainty:
From a statistical perspective, uncertainty manifests
itself in two general forms.
5.1.1. Aleatoric Uncertainty (Irreducible Uncertainty):
This type of uncertainty
is intrinsic to the randomness of the data generating process and is not reducible [11].
It is also known as the
statistical uncertainty
. Aleatoric uncertainty can be categorized
into homoskedastic and heteroskedastic categories [12]. The homoskedastic randomness
stays constant for all data points, however, the heteroskedastic uncertainty is variant
with respect to them.
12
TAVALLALI, BAJGIRAN, ESAID, OWHADI
5.1.2. Epistemic Uncertainty (Reducible Uncertainty):
This type of error is
about the oversight on data and can be reduced with an increase in data collection,
incorporation of more features, and a better selection of the model [11].
5.2. What does UQ mean in ML?.
Based on the sources of uncertainty enumerated,
we can define UQ as any method that can quantify the individual or combined effects of
the aleatoric and epistemic uncertainties. Hence, UQ can have different interpretations
in the context of ML, and statistical learning theory [15]. In what follows, we express a
survey of some of these methods.
5.2.1. UQ as Expected Generalization (Prediction) Error:
In this form, UQ
quantifies the expected generalization error of a machine learning algorithm. There are
two separate approaches here. One is analytical, used for linear models, and based on the
training dataset (in-sample) [40]. The other is based on efficient sample re-use (extra-
sample) and can be used for both linear and nonlinear models [15]. The first approach
encompasses methods like Mallows’
C
p
[36, 38, 35, 37], and Akaike information criterion
(AIC) [1]. The second approach uses techniques like cross-validation (CV) [64, 65, 2, 21,
67] and bootstrapping [13, 6, 7, 56]. For small to moderately large models, CV methods
can be used readily. However, computational costs may impede such approaches for
larger models like deep neural networks (DNN).
5.2.2. UQ as Posterior Distribution Variance:
In this form, the machine learning
that is used must be probabilistic in a Bayesian inference (BI) framework. The posterior
distribution of the model parameters can then be exploited to express prediction confidence
intervals [18, 41]. The major issue here is that the Bayesian approach needs priors that
are not necessarily found from observations. For linear models, this type of UQ task is
commonly done by the Bayesian information criterion (BIC) [62]. For nonlinear models,
it is done by approaches like the Monte Carlo (MC) and Markov Chain Monte Carlo
(MCMC) [19, 17, 20, 18]. Unfortunately, the MCMC methods become computationally
expensive for larger models. Using a variational Bayesian inference (VBI) [23] has been
proposed as a possible solution. For example Bayesian neural networks (BNN) [34, 43,
23, 33, 5, 61] are mainly using VBI. However, the computational costs are still prohibitive
[46], the MCMC comparisons show the low quality of posterior approximations [68], and
these approaches are sensitive to initial starting values [60].
5.2.3. UQ as Regression Quantiles:
There are methods in which data is used to
provide quantile bounds as prediction intervals over regressors in a frequentist sense
[30, 29]. An extension of this methodology is the method of quantile regression forests
(QRF) estimating the predictions of a regressor in terms of quantile values [39]. Recently,
conformal quantile regression (CQR) has combined conformal prediction [54, 66, 53] with
quantile regression (QR) methodology [39, 30, 29] to provide confidence intervals by using
a “pinball loss” [29] over a train set and then applying a conformal prediction over a
calibration set [59]. Although this method addresses heteroskedastic data with outliers,
it requires a specific loss function.
DECISION THEORETIC BOOTSTRAPPING
13
5.2.4. Deep Learning UQ (DLUQ ) Survey:
Deep learning (DL) has arisen as a
particularly powerful technology due to its ability to ingest raw data with less preprocessing
than classical ML [22, 32]. However, DL models are costly at the training phase. Hence,
the application of the UQ views expressed in this section is not straightforward. Because
of this major issue, the deep neural networks (DNN) models become over-confident [44].
Ad hoc methods like confidence calibration [24], temperature scaling [27] ( using Platt
scaling [55, 45]), histogram binning [69, 70], isotonic regression [70], deep ensembles with
Gaussian mixture [31], and Bayesian binning into quantiles [42] are introduced to address
deep learning UQ. Unfortunately, all these methods are mainly post-processing steps to
models. The only systematic DLUQ is the dropout UQ approach [16, 63]. Although
this method is closely related to BI of deep Gaussian processes [58, 10], it is a limit case
study that could deviate from a Gaussian process for a finite width-depth DNN.
6. Conclusion
In this article, we have furnished a systematic machine learning UQ approach based on
decision theory. DTB provides a rigorous UQ in a cross-validation setting and produces
ML algorithms that are more stable with respect to distribution shifts. The algorithm
presented in this paper can show its best performance when there is significant epistemic
uncertainty. At the same time, DTB provides pointwise prediction confidence intervals
for ML regressors. This framework is not confined to regression problems in a classical
setting and can be readily extended to DNN and classification tasks as well. DTB slightly
sacrifices accuracy to achieve high stability for ML estimators.
Acknowledgment
This research was carried out at the Jet Propulsion Laboratory, California Institute
of Technology, under a contract with the National Aeronautics and Space Administration
and support from Beyond Limits (Learning Optimal Models) and AFOSR (Grant number
FA9550-18-1-0271, Games for Computation and Learning). The authors are thankful to
Amy Braverman, Lukas Mandrake and Kiri Wagstaff, for their insights.
©
2021. California Institute of Technology. Government sponsorship acknowledged.
References
[1] Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle. In
Selected papers of hirotugu akaike
, pages 199–213. Springer, 1998.
[2] David M Allen. The relationship between variable selection and data agumentation and a method
for prediction.
Technometrics
, 16(1):125–127, 1974.
[3] Vadim Arzamasov, Klemens B ̈ohm, and Patrick Jochem. Towards concise models of grid stability.
In
2018 IEEE International Conference on Communications, Control, and Computing Technologies
for Smart Grids (SmartGridComm)
, pages 1–6. IEEE, 2018.
[4] Dimitris Bertsimas and Ivan Paskov. Stable regression: On the power of optimization over
randomization in training regression problems.
Journal of Machine Learning Research
, 21:1–25,
2020.
[5] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in
neural networks.
arXiv preprint arXiv:1505.05424
, 2015.
[6] Leo Breiman. Bagging predictors.
Machine learning
, 24(2):123–140, 1996.
[7] Leo Breiman. Random forests.
Machine learning
, 45(1):5–32, 2001.
14
TAVALLALI, BAJGIRAN, ESAID, OWHADI
[8] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen.
Classification and
regression trees
. CRC press, 1984.
[9] Corinna Cortes and Vladimir Vapnik. Support-vector networks.
Machine learning
, 20(3):273–297,
1995.
[10] Andreas Damianou and Neil Lawrence. Deep gaussian processes. In
Artificial Intelligence and
Statistics
, pages 207–215, 2013.
[11] Armen Der Kiureghian and Ove Ditlevsen. Aleatory or epistemic? does it matter?
Structural Safety
,
31(2):105–112, 2009.
[12] Francis X Diebold.
Elements of forecasting
. Citeseer, 1998.
[13] Bradley Efron. Estimating the error rate of a prediction rule: improvement on cross-validation.
Journal of the American statistical association
, 78(382):316–331, 1983.
[14] Hadi Fanaee-T and Joao Gama. Event labeling combining ensemble detectors and background
knowledge.
Progress in Artificial Intelligence
, 2(2-3):113–127, 2014.
[15] Jerome Friedman, Trevor Hastie, and Robert Tibshirani.
The elements of statistical learning
,
volume 1. Springer series in statistics Springer, Berlin, 2001.
[16] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model
uncertainty in deep learning. In
international conference on machine learning
, pages 1050–1059,
2016.
[17] Alan E Gelfand and Adrian FM Smith. Sampling-based approaches to calculating marginal densities.
Journal of the American statistical association
, 85(410):398–409, 1990.
[18] Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin.
Bayesian data analysis
. Chapman and Hall/CRC, 2013.
[19] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian
restoration of images.
IEEE Transactions on pattern analysis and machine intelligence
, (6):721–
741, 1984.
[20] Walter R Gilks, Sylvia Richardson, and David Spiegelhalter.
Markov chain Monte Carlo in practice
.
Chapman and Hall/CRC, 1995.
[21] Gene H Golub, Michael Heath, and Grace Wahba. Generalized cross-validation as a method for
choosing a good ridge parameter.
Technometrics
, 21(2):215–223, 1979.
[22] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio.
Deep learning
, volume 1.
MIT press Cambridge, 2016.
[23] Alex Graves. Practical variational inference for neural networks. In
Advances in neural information
processing systems
, pages 2348–2356, 2011.
[24] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural
networks. In
Proceedings of the 34th International Conference on Machine Learning-Volume 70
,
pages 1321–1330. JMLR. org, 2017.
[25] Kam Hamidieh. A data-driven statistical model for predicting the critical temperature of a
superconductor.
Computational Materials Science
, 154:346–354, 2018.
[26] John C Harsanyi. Games with randomly disturbed payoffs: A new rationale for mixed-strategy
equilibrium points.
International journal of game theory
, 2(1):1–23, 1973.
[27] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.
arXiv
preprint arXiv:1503.02531
, 2015.
[28] Tin Kam Ho. Random decision forests. In
Proceedings of 3rd international conference on document
analysis and recognition
, volume 1, pages 278–282. IEEE, 1995.
[29] Roger Koenker and Gilbert Bassett Jr. Regression quantiles.
Econometrica:
journal of the
Econometric Society
, pages 33–50, 1978.
[30] Roger Koenker and Kevin F Hallock. Quantile regression.
Journal of economic perspectives
,
15(4):143–156, 2001.
[31] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive
uncertainty estimation using deep ensembles. In
Advances in Neural Information Processing
Systems
, pages 6402–6413, 2017.
[32] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.
nature
, 521(7553):436, 2015.
DECISION THEORETIC BOOTSTRAPPING
15
[33] Christos Louizos and Max Welling. Multiplicative normalizing flows for variational bayesian neural
networks. In
Proceedings of the 34th International Conference on Machine Learning-Volume 70
,
pages 2218–2227. JMLR. org, 2017.
[34] David JC MacKay. A practical bayesian framework for backpropagation networks.
Neural
computation
, 4(3):448–472, 1992.
[35] CL Mallows. Choosing variables in a linear regression: A graphical aid. In
Central Regional Meeting
of the Institute of Mathematical Statistics, Manhattan, Kansas
, volume 5, 1964.
[36] Cohn L Mallows. More comments on cp.
Technometrics
, 37(4):362–372, 1995.
[37] Colin L Mallows. Choosing a subset regression. In
TECHNOMETRICS
, volume 9, page 190. AMER
STATISTICAL ASSOC 1429 DUKE ST, ALEXANDRIA, VA 22314, 1967.
[38] Colin L Mallows. Some comments on c p.
Technometrics
, 15(4):661–675, 1973.
[39] Nicolai Meinshausen. Quantile regression forests.
Journal of Machine Learning Research
,
7(Jun):983–999, 2006.
[40] Douglas C Montgomery, Elizabeth A Peck, and G Geoffrey Vining.
Introduction to linear regression
analysis
. John Wiley & Sons, 2015.
[41] Kevin P Murphy.
Machine learning: a probabilistic perspective
. MIT press, 2012.
[42] Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated
probabilities using bayesian binning. In
Twenty-Ninth AAAI Conference on Artificial Intelligence
,
2015.
[43] Radford M Neal.
Bayesian learning for neural networks
, volume 118. Springer Science & Business
Media, 2012.
[44] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence
predictions for unrecognizable images. In
Proceedings of the IEEE conference on computer vision
and pattern recognition
, pages 427–436, 2015.
[45] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised
learning. In
Proceedings of the 22nd international conference on Machine learning
, pages 625–632.
ACM, 2005.
[46] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D Sculley, Sebastian Nowozin, Joshua V Dillon,
Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating
predictive uncertainty under dataset shift.
arXiv preprint arXiv:1906.02530
, 2019.
[47] Houman Owhadi and Clint Scovel. Towards machine wald.
Springer Handbook of Uncertainty
Quantification
, 2015. arXiv preprint arXiv:1508.02449.
[48] Houman Owhadi and Clint Scovel. Brittleness of bayesian inference and new selberg formulas.
Communications in Mathematical Sciences
, 14(1):83–145, 2016.
[49] Houman Owhadi and Clint Scovel. Qualitative robustness in bayesian inference.
ESAIM: Probability
and Statistics
, 21:251–274, 2017.
[50] Houman Owhadi, Clint Scovel, and Tim Sullivan. On the brittleness of bayesian inference.
SIAM
Review
, 57(4):566–582, 2015.
[51] Houman Owhadi, Clint Scovel, Tim Sullivan, et al. Brittleness of bayesian inference under finite
information in a continuous world.
Electronic Journal of Statistics
, 9(1):1–79, 2015.
[52] R Kelley Pace and Ronald Barry. Sparse spatial autoregressions.
Statistics & Probability Letters
,
33(3):291–297, 1997.
[53] Harris Papadopoulos.
Inductive conformal prediction: Theory and application to neural networks
.
INTECH Open Access Publisher Rijeka, 2008.
[54] Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman. Inductive confidence
machines for regression. In
European Conference on Machine Learning
, pages 345–356. Springer,
2002.
[55] John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized
likelihood methods.
Advances in large margin classifiers
, 10(3):61–74, 1999.
[56] J Ross Quinlan.
C4. 5: programs for machine learning
. Elsevier, 2014.
[57] Carl Edward Rasmussen. Gaussian processes in machine learning. In
Summer School on Machine
Learning
, pages 63–71. Springer, 2003.