Decision Theoretic Bootstrapping

The design and testing of supervised machine learning models combine two fundamental distributions: (1) the training data distribution (2) the testing data distribution. Although these two distributions are identical and identifiable when the data set is infinite; they are imperfectly known (and possibly distinct) when the data is finite (and possibly corrupted) and this uncertainty must be taken into account for robust Uncertainty Quantification (UQ). We present a general decision-theoretic bootstrapping solution to this problem: (1) partition the available data into a training subset and a UQ subset (2) take $m$ subsampled subsets of the training set and train $m$ models (3) partition the UQ set into $n$ sorted subsets and take a random fraction of them to define $n$ corresponding empirical distributions $\mu_{j}$ (4) consider the adversarial game where Player I selects a model $i\in\left\{ 1,\ldots,m\right\} $, Player II selects the UQ distribution $\mu_{j}$ and Player I receives a loss defined by evaluating the model $i$ against data points sampled from $\mu_{j}$ (5) identify optimal mixed strategies (probability distributions over models and UQ distributions) for both players. These randomized optimal mixed strategies provide optimal model mixtures and UQ estimates given the adversarial uncertainty of the training and testing distributions represented by the game. The proposed approach provides (1) some degree of robustness to distributional shift in both the distribution of training data and that of the testing data (2) conditional probability distributions on the output space forming aleatory representations of the uncertainty on the output as a function of the input variable.


Introduction
Although traditional cross validation (CV) provides model UQ/error estimates [64,65,2,21,67], these estimates are in general not robust to distribution shifts.We propose to incorporate the impact of distributional uncertainty in the adversarial setting of decision theory (DT) [47].
1.1.Main Result: Our proposed decision theoretic bootstrapping (DTB) solution is to: ‚ randomly partition the available data tpX i , Y i qu iPI into a training subset tpX i , Y i qu iPT and a UQ subset tpX i , Y i qu iPU (see Figure 1-a); ‚ randomly subsample the training set T into m subsets tT i u m i"1 , and train m models tF i u m i"1 (see Figure 1-b); ‚ For k P t1, . . ., Ku repeat the following steps.‚ Partition the UQ set into n sorted subsets.
-Randomly subsample from all the n sorted subsets of U and define n corresponding empirical distributions tµ j u n j"1 (see Figure 1-b); consider the zero-sum adversarial game where Player I selects a model F i , Player II selects the UQ distribution µ j , and Player I receives a random loss defined as the empirical risk where E is an error function.The diagram of the proposed game is as follows.
(Player I) identify and record the optimal mixed strategies (probability distributions) p pkq " ´ppkq 1 , . . ., p pkq m ¯over models tF i u m i"1 and q pkq " ´qpkq 1 , . . ., q pkq n ¯over UQ distributions tµ j u n j"1 , for both players by solving the minimax problem min and record the value v pkq of the game (1.2). ‚ Use the distribution of the p pkq and possibly v pkq for UQ analysis.

Structure of the paper:
The proposed methodology and algorithm is described in Section 2. In Section 3, we discuss the relevance of DTB to CV.We then present two detailed examples in Section 4. We cover related works in 5 and compare the proposed method with other UQ for ML methods in the literature.

DTB Methodology and Algorithm
In any modeling endeavor, there are three major questions: "How can the uncertainty be quantified?", "How robust is a given model if the distribution of the testing data is distinct from that of training data?", and finally, "How can we pick amongst many trained models?".This manuscript presents a decision-theoretic (zero-sum game) framework answering all three questions.We will now motivate these questions and illustrate DTB with the following real-world problem.
Example 1.Given the California housing dataset (that includes ZIP codes, square footage, number of bedroom and bathrooms, sales prices, etc.) [52], and 20 ML models (e.g., regression trees) trained on distinct subsets of this dataset, (1) predict sales prices for a new test set (that includes the previous features but not sales prices), (2) provide confidence intervals for each prediction, (3) establish the robustness of the predictions to distributional shift between training and testing data.Although selecting the model with the lowest error in the training set is a natural solution to (1), the deterministic nature of this solution is in conflict with the requirements of (2) providing confidence intervals and (3) ensuring robustness to distributional shifts.Similarly, although using training errors to create a weighted ranking of the 20 models could be used to provide confidence intervals, it does not ensure robustness to distribution shifts.With the proposed approach, we estimate the error of one of these 20 models (selected at random by Player I) against an empirical distribution (selected at random by player II).The randomization of the underlying game provides confidence intervals, and its adversarial nature ensures robustness to distribution shifts.Figure 2 illustrates DTB.For that figure, we trained m " 20 different regression tree models (each with a maximum depth of 15) on a train set with |T | " 13759 samples, then exposed those models to n " 100 unseen randomized realizations of a UQ set (with |U| " 6777 samples) to find each model's probability, and finally applied them to a left-out test set with 104 samples.We chose a small test set to accentuate uncertainties for ease of presentation.Each model is trained on a random half of the training set (|F i | « |T |{2).The sorted unseen randomized realizations of the UQ set are divided into n " 100 subsets.The combined support of the µ j in each randomization encompasses roughly 20% of the whole UQ set.The prediction distribution p i over the 20 models is calculated based on material presented in Section 3.2.For each 104 sample in the left-out test set, Figure 2 shows the mean prediction (obtained by averaging the prediction of each model with respect to the distribution p i ) and the standard deviation of that prediction (with respect to the distribution p i ).As can be seen from this figure, almost all ground truth values over the test set fall in between the confidence bounds of the predictions (note that we used one standard deviation to define confidence intervals).
2.1.Setup.The supervised ML setting is that of approximating an unknown function F : given F : pX i q " Y i for all i where tpX i , Y i qu iPI represents the available input/output training data.One major issue with solving this problem as a regression/interpolation (curve fitting) problem is that the interpolant does not carry an intrinsic measure of uncertainty (even if it has been obtained as a conditional expectation by conditioning a Gaussian prior [57], then the conditional covariance of the posterior distribution depends on the covariance function of the Gaussian prior whose selection is to some degree arbitrary).Furthermore, this approach is not stable to distribution shifts, especially in deep neural networks case [46].A partial remedy is to use CV to ensure that the training and generalization errors are close [64,65,2,21,67].However, CV does not address the distribution shift problem and it does not provide a pointwise (input dependent) prediction confidence intervals.Although there are methods that provide a stable linear regression mitigating the effects of distribution shifts by a relaxed minimax approach [4], yet these methods do not provide a pointwise prediction confidence interval.Contrary to these, our proposed solution, based on decision theory [47], is to formulate the underlying regression problem (linear or non-linear) as a zero-sum adversarial game over the set of possible machine learning (ML) models and empirical data distributions providing both robustness of estimation and prediction confidence intervals.To do so, the modeler must provide a set of fitting functions tF i pXqu m i"1 and play an adversarial zero-sum game against some random empirical distributions of the data.In what follows, we present a detailed description of each step.

Partitioning the training data:
Our first step is to randomly partition the available (training) data tpX i , Y i qu iPI into two mutually exclusive sets: a training set tpX i , Y i qu iPT and a UQ set tpX i , Y i qu iPU .

Training models:
Next, we subsample m subsets of the training set T .We write tT i u m i"1 for these subsets and train m corresponding models tF i u m i"1 (F i is the model obtained by using T i as training data).The training pipeline can be arbitrary (e.g., the corresponding models could be deep neural networks [22], random forests [7,28], or any other ML model).Therefore, there is no constraint in the selection of the model producing pipeline and any function estimator, relevant to the problem at hand, and any training methodology could be used.Even the class of these functions could be different across i P t1, . . ., mu.For example, some could be linear regression models (for some labels i), some decision trees [8] (for the other labels) and some support vector machines [9] (for the remaining labels).

Bootstrapping a set of empirical distributions:
Next, we sort U with respect to the response variable Y and then partition this sorted set into n subsets tU j u n j"1 .Then, we randomly subsample the sorted set tpX i , Y i qu iPU from the n subsets tU j u n j"1 and define n corresponding empirical distributions (2.1) 2.4.1.Deterministic Game: Consider the game in which Player I selects an element from tF i u m i"1 and Player II selects an element from tµ j u n j"1 (defined as in (2.1)).Let Robust Prediction with Confidence Intervals.This plot shows that on a random test set from the California Housing dataset [52], the predictions of 20 different regression trees are robust in the sense that their confidence intervals include the ground truth.There are only a few large confidence intervals, and the rest are relatively tight.This plot is showing the value of double randomization, both over the empirical distributions and the models.The latter is one of the main ideas of the adversarial game-theoretic setup of DTB.
be the loss function for this game.In that deterministic setting the optimal mixed strategies of Player I and II tends to concentrate on single elements, i.e., the worst distribution in tµ j u n j"1 and its best response in tF i u m i"1 .This is due to the fact that, although this game incorporates some degree of uncertainty in the selection of the data generating distribution, it does not incorporate the uncertainty associated with sampling from that distribution.To address this issue, we repeat the proposed game over many random realizations of the µ j and use the resulting distribution over optimal mixed strategies for our UQ analysis.
2.5.Randomized Game: Consider the game described in Subsection 2.4.1 and observe that the randomized sampling process generating the µ j induces a randomization of the loss metric L (L ij a random function of pi, jq).Assigning probabilities p " pp 1 , . . ., p m q over models tF i u m i"1 and q " pq 1 , . . ., q n q over a specific realization of tµ j u n j"1 , optimal mixed strategies pp : , q : q are identified as solutions of the minimax problem (3) Sort U with respect to the response variable Y and then partition into n subsets tU j u n j"1 .(4) For k P t1, . . ., Ku (a) Generate n randomly subsampled subsets for the corresponding empirical distributions.
(b) Solve the minimax problem min p max q ř i,j p i L ij q j , and record the optimal p pkq and the value v pkq of the game.(5) Report p " 1 K ř K k"1 p pkq as the distribution over models and the histogram of values v pkq .
Note that the randomization of the loss L implies that pp : , q : q are random distributions.We will now sample from these distributions by solving (2.3) many times (with different random realizations of L).This approach is directly related to Harsanyi's purification over repeated games [26].
2.6.Probability distributions over models: Given K ě 1, let pp pkq , q pkq q be K i.i.d.samples from pp : , q : q and write v pkq for the corresponding random values of (2.3).Note that the pp pkq , q pkq q are obtained by solving (2.3) over K independent realizations of the loss L. We record each optimal p pkq , each value v pkq , and use the histogram defined by the v pkq and the empirical average for UQ.From an ML perspective, this approach is robust to distribution shifts.This procedure is presented in Algorithm 1.
2.7.How to Choose K, n and the U j : Choosing K is relatively straightforward.Since (2.4) is a Monte Carlo (MC) estimate, its accuracy improves with the number of sample points.The U j are obtained through random subs-sampling of U and are of chosen to be of equal size s (|U j | " s).The U j do not overlap.We selected s and n so that the size of the combined support of the U j is about 20% of the size of U (ns « 0.2|U|).Note that given ns « 0.2|U|, the parameterization of our approach can be reduced to s which can be interpreted as a trade-off parameter between robustness (small s) and accuracy (large s).Recall that robustness and accuracy are conflicting requirements [49].

UQ analysis
3.1.CV and Bayesian inference.While CV provides an estimate of the generalization error, this estimate is not pointwise in the sense that it does not depend on the particular input of the model.Furthermore, CV does not produce any distribution over models.Furthermore, while Bayesian inference could be employed to obtain a distribution over models, doing so would not be robust with respect to the selection of the prior [50,51,48,49].

DTB.
On the other hand, by inducing a probability distribution p over models, DTB defines a robust point-wise distribution over the output space where robustness in inherited from the adversarial nature of the game.This distribution depends on the particular choice of the loss function and the histogram of the v pkq ( K k"1 provides a robust quantification of generalization error.Given an input X, write ρ pXq " for the average model output and for the standard deviation of the model outputs defined by p.Note that p may be distinct from the uniform distribution employed in ensemble modeling.

Examples
We will now illustrate the proposed approach (Algorithm 1) on synthetic and realworld datasets.
Example 2. Consider the univariate function y " x sin x for x P r0, 10s (solid dark plots in Figure 3).The dataset is composed of 35 pointwise evaluations of this function.In the left plot of Figure 3 we regress each training subset T i with a polynomial of degree 4 to produce the corresponding model F i and use Algorithm 1 with 20 regression polynomials to obtain a probability distribution p over these models.As seen in this plot, the ground truth is close to the prediction mean, defined by (3.1), and it falls between the confidence bounds, defined by (3.2).Although low order polynomials do not lead to highly accurate models, Algorithm 1 provides a robust quantification of generalization errors.Replacing the 20 regressions polynomials by regression trees further improves this picture, as illustrated in the right plot of Figure 3.In this case, we fixed the depth of each tree to 5. Note that the confidence intervals are now smaller near the boundaries.Generally, this simple example illustrates the robustness of Algorithm 1 when the data is sparse, non-uniformly distributed, and the models are weak.
Example 3.With this example, illustrate the robustness of Algorithm 1 on the following datasets: California Housing [52], Electrical Grid Stability [3], Superconductor [25], and Bike Sharing [14].For the sake of brevity, we call them Housing, Grid, SC, and Bike, respectively.To normalize target ranges, the target variables in Grid, SC and Bike were multiplied by 100.0, 0.1, and 0.01, respectively.
For contrast, we compare the model distribution obtained from Algorithm 1 (DT models) with the uniform distribution.In all the cases, we take 10 regression trees with a maximum depth of 10.The original data is divided into 80% train-UQ and 20% test sets.The train-UQ set is then fed into Algorithm 1 once having separate T and U sets, and once T " U.The latter is done for the sake of a fair comparison with uniform ensembles.In the algorithm, we take n " 100 and K " 100.Also, for each k P t1, . . ., Ku, we take sn |U | » 0.2.We introduce two modes of training: In the first mode, each tree is trained on randomly selected 0.5% of the train set T .In the second mode, each tree is trained on randomly selected 50% of the train set.The first and second modes are called "weak" and "strong", respectively.
Furthermore, we introduce two comparison metrics: One is the overall mean squared loss on the test set.The other is the maximum mean squared loss over folds, where folds are determined by binning the sorted target variable.In these experiments, we use 100 folds.In fact, this second metric is an indicator of the robustness of the final ensemble with respect to distribution shifts.
We repeat each setup in this experiment 20 times.The results are reported in Tables 1  and 2. The values in parenthesis are the ones produced with T " U.As seen from Table 2, the ensemble of estimators generated by Algorithm 1 are showing lower maximum fold losses in nearly all weak cases and most of the strong cases, compared to their uniform averaging counterparts.The difference is more prominent when the models are weakly trained.Having a lower maximum fold loss is an indication of the robustness of The bold fonts show better losses in each category; weak vs. strong.
Algorithm 1 with respect to distribution shifts, especially when the models are not strong.
In other words, Algorithm 1 shows its best applicability when the epistemic uncertainties are notable.On the other hand, most of the times, the overall loss is lower for uniform averaging of the models; see Table 1.
It is noteworthy to mention that the improvement in the maximum fold loss is larger than the degradation in the overall loss when Algorithm 1 is used.In other words, if Algorithm 1 is used, with a slight sacrifice of the overall accuracy, the models become more stable and consistent, with higher precision, in their predictions.
Example 4. In this example, we investigate the effect of ns{ |U|.Since this ratio is a part of the purification over repeated games [26], we call it "purification ratio" for brevity.Our purpose is to illustrate the maximum fold loss, and also the overall loss, as a function of the purification whose values are between 0 and 1.We work with the California Housing [52] dataset, and we take an ensemble of models produced by Algorithm 1 with n " 100 .In all the cases, we take 10 regression trees with a maximum depth of 10.The original data is divided into 80% train-UQ and 20% test sets.Each Figure 4. Purification Ratio.The top plot shows the maximum fold loss for different purification ratios.The bottom plot shows the overall loss for different purification ratios.Both plots correspond to the test dataset.A weighted local linear regression is used to provide a better picture in both plots.These plots show that the purification ratio should be around 0.15 for this particular dataset to achieve the best balance between the overall loss and maximum fold loss.
tree is trained on randomly selected 50% of the train set.This time, to estimate the robustness of the algorithm using maximum fold loss, we use only 20 bins (folds) over the test set.For each purification ratio, we conduct the experiment 100 times, and we plot the mean of each experiment.To make sure enough data points are seen by the estimators, K is taken to be 5 over the purification ratio.
The results are shown in Figure 4.As can be seen from the plots, there is a purification ratio close to 0.15 that produces the minimal maximum fold loss in all the tests.At the same time, the overall loss decreases uniformly with a decrease in the purification ratio.This experiment shows that to capture the best balance between the overall loss and maximum fold loss, the purification ratio ns{ |U| should be around 0.15.This also supports our choice in the previous example of 0.20.
Example 5.In this example, we investigate the effect of data sparsity on Algorithm 1.As in the previous example, we work with the California Housing [52] dataset.In all the cases, we take 10 regression trees with a maximum depth of 10.The original data is divided into 80% train-UQ and 20% test sets.The train-UQ set is then fed into Algorithm 1 with T " U.The latter is done for the sake of a fair comparison with Figure 5. Small vs Large Data Test.The top plot shows that Algorithm 1 is robust when the available data is sparse.The bottom plot shows that uniform averaging always provide a more accurate prediction.Both plots show that for large datasets, both approaches converge in robustness and accuracy.uniform ensembles.We take n and K to be both 100.Also, the purification ratio is taken to be 0.20.The maximum fold loss is calculated over 20 bins of the test set.In this experiment, we vary the data availability by allowing a fixed fraction of the train set to each estimator.We call this variable the "data fraction".Figure 5 shows that for sparse data availability, Algorithm 1's error is unanimously less than the uniform averaging method.At the same time, the overall loss is always smaller for uniform averaging.This figure conveys the message that Algorithm 1 is more robust than uniform averaging, specifically when only a small amount of data is available.

Related Works
5.1.Sources of Uncertainty: From a statistical perspective, uncertainty manifests itself in two general forms.

Aleatoric Uncertainty (Irreducible Uncertainty):
This type of uncertainty is intrinsic to the randomness of the data generating process and is not reducible [11].It is also known as the statistical uncertainty.Aleatoric uncertainty can be categorized into homoskedastic and heteroskedastic categories [12].The homoskedastic randomness stays constant for all data points, however, the heteroskedastic uncertainty is variant with respect to them.

Epistemic Uncertainty (Reducible Uncertainty):
This type of error is about the oversight on data and can be reduced with an increase in data collection, incorporation of more features, and a better selection of the model [11].5.2.What does UQ mean in ML?.Based on the sources of uncertainty enumerated, we can define UQ as any method that can quantify the individual or combined effects of the aleatoric and epistemic uncertainties.Hence, UQ can have different interpretations in the context of ML, and statistical learning theory [15].In what follows, we express a survey of some of these methods.

UQ as Expected Generalization (Prediction) Error:
In this form, UQ quantifies the expected generalization error of a machine learning algorithm.There are two separate approaches here.One is analytical, used for linear models, and based on the training dataset (in-sample) [40].The other is based on efficient sample re-use (extrasample) and can be used for both linear and nonlinear models [15].The first approach encompasses methods like Mallows' C p [36,38,35,37], and Akaike information criterion (AIC) [1].The second approach uses techniques like cross-validation (CV) [64,65,2,21,67] and bootstrapping [13,6,7,56].For small to moderately large models, CV methods can be used readily.However, computational costs may impede such approaches for larger models like deep neural networks (DNN).

UQ as Posterior Distribution Variance:
In this form, the machine learning that is used must be probabilistic in a Bayesian inference (BI) framework.The posterior distribution of the model parameters can then be exploited to express prediction confidence intervals [18,41].The major issue here is that the Bayesian approach needs priors that are not necessarily found from observations.For linear models, this type of UQ task is commonly done by the Bayesian information criterion (BIC) [62].For nonlinear models, it is done by approaches like the Monte Carlo (MC) and Markov Chain Monte Carlo (MCMC) [19,17,20,18].Unfortunately, the MCMC methods become computationally expensive for larger models.Using a variational Bayesian inference (VBI) [23] has been proposed as a possible solution.For example Bayesian neural networks (BNN) [34,43,23,33,5,61] are mainly using VBI.However, the computational costs are still prohibitive [46], the MCMC comparisons show the low quality of posterior approximations [68], and these approaches are sensitive to initial starting values [60].

UQ as Regression Quantiles:
There are methods in which data is used to provide quantile bounds as prediction intervals over regressors in a frequentist sense [30,29].An extension of this methodology is the method of quantile regression forests (QRF) estimating the predictions of a regressor in terms of quantile values [39].Recently, conformal quantile regression (CQR) has combined conformal prediction [54,66,53] with quantile regression (QR) methodology [39,30,29] to provide confidence intervals by using a "pinball loss" [29] over a train set and then applying a conformal prediction over a calibration set [59].Although this method addresses heteroskedastic data with outliers, it requires a specific loss function.

Deep
Learning UQ (DLUQ ) Survey: Deep learning (DL) has arisen as a particularly powerful technology due to its ability to ingest raw data with less preprocessing than classical ML [22,32].However, DL models are costly at the training phase.Hence, the application of the UQ views expressed in this section is not straightforward.Because of this major issue, the deep neural networks (DNN) models become over-confident [44].Ad hoc methods like confidence calibration [24], temperature scaling [27] ( using Platt scaling [55,45]), histogram binning [69,70], isotonic regression [70], deep ensembles with Gaussian mixture [31], and Bayesian binning into quantiles [42] are introduced to address deep learning UQ.Unfortunately, all these methods are mainly post-processing steps to models.The only systematic DLUQ is the dropout UQ approach [16,63].Although this method is closely related to BI of deep Gaussian processes [58,10], it is a limit case study that could deviate from a Gaussian process for a finite width-depth DNN.

Conclusion
In this article, we have furnished a systematic machine learning UQ approach based on decision theory.DTB provides a rigorous UQ in a cross-validation setting and produces ML algorithms that are more stable with respect to distribution shifts.The algorithm presented in this paper can show its best performance when there is significant epistemic uncertainty.At the same time, DTB provides pointwise prediction confidence intervals for ML regressors.This framework is not confined to regression problems in a classical setting and can be readily extended to DNN and classification tasks as well.DTB slightly sacrifices accuracy to achieve high stability for ML estimators.

Figure 1 .
Figure 1.Train-UQ Partitioning of the Data.The left plot shows the schematic partitioning of the original empirical data into two exclusive train and UQ sets.The middle plot shows how the train set is used to train m ML estimators (randomly).Here, the training sets are found by subsampling the train dataset m times.The right plot shows the procedure of breaking up the UQ set into n subsets, from which n different empirical distributions can be extracted.

Figure 3 .
Figure 3. Robust Prediction with Confidence Intervals.The left plot shows the application of Algorithm 1 when the ML estimators are 20 regression polynomials each with degree 4. The ground truth is shown in solid dark color.The sample points are the dots, and the prediction mean and confidence intervals are shown in red and filledblue, respectively.The right plot uses the setting of the left plot, except that the 20 regressors are regression trees.Irrespective of the type of the regressors, both plots show robust predictions and confidence intervals obtained from only a sparse set of non-uniform samples.

)
Algorithm 1 Decision Theoretic CV(1) Randomly separate tpX i , Y i qu iPI into mutually exclusive sets tpX i , Y i qu iPT and tpX i , Y i qu iPU .(2) Generate m randomly subsampled subsets tT i u m i"1 of the training set T .(a) Train the models F i on the subsets T i .

Table 1 .
Overall Losses Comparison.This table shows the overall loss comparison of 10 regression tress when the ensemble mean is taken by the use of Algorithm 1 versus simple uniform averaging.These are called DT and Uniform, respectively, in the table.The values in parenthesis correspond to T " U.The bold fonts show better losses in each category; weak vs strong.

Table 2 .
Maximum Fold Losses Comparison.This table shows the maximum fold loss comparison of 10 regression trees.DT uses the distribution over models obtained from Algorithm 1. Uniform refers to the uniform distribution over models.These are called DT and Uniform, respectively, in the table.The values in parenthesis correspond to T " U.