Limits of Probabilistic Safety Guarantees when Considering Human Uncertainty

When autonomous robots interact with humans, such as during autonomous driving, explicit safety guarantees are crucial in order to avoid potentially life-threatening accidents. Many data-driven methods have explored learning probabilistic bounds over human agents' trajectories (i.e. confidence tubes that contain trajectories with probability $\delta$), which can then be used to guarantee safety with probability $1-\delta$. However, almost all existing works consider $\delta \geq 0.001$. The purpose of this paper is to argue that (1) in safety-critical applications, it is necessary to provide safety guarantees with $\delta<10^{-8}$, and (2) current learning-based methods are ill-equipped to compute accurate confidence bounds at such low $\delta$. Using human driving data (from the highD dataset), as well as synthetically generated data, we show that current uncertainty models use inaccurate distributional assumptions to describe human behavior and/or require infeasible amounts of data to accurately learn confidence bounds for $\delta \leq 10^{-8}$. These two issues result in unreliable confidence bounds, which can have dangerous implications if deployed on safety-critical systems.


I. INTRODUCTION
Autonomous robots will be increasingly deployed in unstructured human environments (e.g. roads and malls) where they must safely carry out tasks in the presence of other moving human agents. The cost of failure is high in these environments, as safety violations can be life-threatening. At present, safety is often enforced by learning an uncertainty distribution or confidence bounds over the future trajectory of other agents, and designing a controller that is robust to such uncertainty [1]. Based on these learned trajectory distributions, probabilistic safety guarantees can be provided at a specified safety threshold δ over a given planning horizon (e.g. by enforcing chance constraints such that P(collision) ≤ δ) [2]- [5]. However, for such guarantees to hold, it is critical that we accurately predict the uncertainty over other agents' future trajectories with high probability 1 − δ.
Current works that aim to provide probabilistic safety guarantees for autonomous navigation in uncertain, human environments consider safety thresholds in the range δ ≥ 0.001. While such guarantees are important, safety critical applications require δ that are orders of magnitude lower [6].
Suppose a robot/car is guaranteed safe with probability 1 − δ across every 10s planning horizon. Given δ ≈ 0.001, we could expect a safety violation every 3 hrs. For reference, based on NHTSA data [7], human drivers have an effective safety threshold δ < 10 −7 . It is clear then that for safety-critical robotic applications, we must strive for extremely low safety thresholds, on the order δ ≤ 10 −8 . However, this paper argues that current learning-based approaches that model human trajectory uncertainty (a) rely on highly inaccurate distribution assumptions, invalidating resulting safety guarantees, and/or (b) can not adequately extend to safety-critical situations. To illustrate this, we applied different uncertainty models (see Table  1) to data of human driving from the highD dataset [8]. We found that even under extremely generous assumptions, learned models are highly inaccurate in capturing human behavior at low δ, often mispredicting the probability of rare events by several orders of magnitude. Furthermore, we show that increasing dataset sizes will not sufficiently improve accuracy of learned uncertainty models.
Our results highlight potential danger in utilizing learned models of human uncertainty in safety-critical applications. Fundamental limitations prevent us from accurately learning the probability of rare trajectories with finite data, and using inaccurate confidence bounds can result in unexpected collisions. While this paper focuses on illustrating a crucial problem (rather than providing a solution), we conclude by discussing alternative approaches that can address these limitations by combining (a) learned patterns of behavior and (b) prior knowledge encoding human interaction rules.
Before proceeding, we emphasize three critical points regarding our results: • We focus on in-distribution error, rather than out-ofdistribution error. I.e., we highlight the fundamental inability of uncertainty models to accurately capture distributions at very low δ, regardless of generalization. • We focus not on robust control algorithms, but rather on the learned uncertainties that such algorithms leverage. • We distinguish motion predictors from uncertainty models. While recent performance of motion predictors has drastically improved [9], they all leverage an underlying uncertainty model (see Table 1) to capture the probability of uncommon events. E.g. most neural network motion predictors output a Gaussian uncertain prediction. This paper focuses on errors associated with uncertainty models (which propagate to the motion predictors).

II. RELATED WORK
Most recent approaches for guaranteed safe navigation in proximity to humans or their cars approximate uncertainty in human trajectories as a random process (i.e. deviations from a nominal trajectory are drawn i.i.d. from a learned distribution). These uncertainty models capture noise and the effects Min. Safety Threshold Gaussian Process [2], [10]- [12] δ ≥ 0.001 Gaussian Uncertainty w/ Dynamics [13]- [15] δ ≥ 0.001 Bayesian NN [5], [16] δ ≥ 0.05 Noisy Rational Model [3] δ ≥ 0.01 Hidden Markov Model [17], [18] δ ≥ 0.01 Quantile Regression [5] δ ≥ 0.05 Scenario Optimization [19]- [21] δ ≥ 0.01 Generative Models (e.g. GANs) [9], [22], [23] N/A of latent variables (e.g. intention), and enable probabilistic safety guarantees in uncertain, dynamic environments. Most models fall into one or more of the following categories: • Gaussian Process (GP): These approaches model other agents' trajectories as Gaussian processes, which treat trajectory uncertainty as a multivariate Gaussian [2], [11], [12], [24]. There are several extensions, such as the IGP model [25] (which accounts for interaction between multiple agents), or others [26], [27]. However, they all treat uncertainty as a multivariate Gaussian. • Gaussian Noise with Dynamics Model: These approaches use a dynamics model with additive Gaussian noise; noise can also be added in state observations. This induces a Gaussian distribution over other agents' future trajectory (or a situation where we can do momentmatching) [15], [28]. • Quantile Regression: This approach computes quantile bounds over the trajectories of other agents at a given confidence level, δ. This approach benefits from not assuming an uncertainty distribution over trajectories [5], [29]. • Scenario Optimization: This approach computes a predicted set over other agents' actions based on samples of previously observed scenarios [30]. It is distribution-free (i.e. does not assume a parametric uncertainty distribution) [19]- [21], [31]. [32], [33] do not use scenario optimization, but their work based on computing minimum support sets follows a similar flavor. • Noisy (i.e. Boltzmann) Rational Model: This model treats the human as a rational actor who takes "noisily optimal" actions according to a distribution in the exponential family, shown in Eq. (7). The uncertainty in the action is captured by this distribution, which relies on an accurate model of the human's value function [1], [3], [34]- [36]. • Generative Models (CVAE, GAN): These models generally learn an implicit distribution over trajectories. Rather than give an explicit distribution, they generate random trajectories that attempt to model the true distribution [9], [22]. However, other works have also utilized the CVAE framework to produce explicit parameterized distributions using a discrete latent space [23].
• Hidden Markov Model (HMM) / Markov Chain: These models capture uncertainty over discrete sets of states/intentions (e.g. goal positions) -as opposed to capturing uncertainty over trajectories. Thus, the objective is to infer the other agents' unobserved state/intention (from a discrete set) with very high certainty, 1 − δ [17], [18], [37]- [41]. • Uncertainty Quantifying (UQ) Neural Networks: These approaches do not constitute a separate class of uncertainty models, but refer to methods that train a neural network to capture the distribution over other agents' trajectories [16], [42]- [44]. We list them separately due to their popularity. Most often these networks output a Gaussian distribution or mixture of Gaussians (e.g. Bayesian neural networks [45], deep ensembles [46], Monte-Carlo dropout [47]). These models can also quantify uncertainty over discrete states (i.e. infer the hidden state in HMMs) [48], [49].
Once a predicted trajectory and its uncertainty is learned, many mechanisms exist to guarantee safety (e.g. incorporating uncertainty into chance constraints). In this work, we do not focus on these mechanisms (i.e. robust control algorithms) for guaranteeing safety; rather we focus on the issue of learning/modeling trajectory uncertainty, which such mechanisms must leverage for their safety guarantees.

III. EXPERIMENT SETUP
The remainder of this paper aims to highlight the limitations of the aforementioned uncertainty models when considering human behavior. We show that the prevalent model classes of uncertainty (see Table 1) fail to capture human behavior at safety-critical thresholds (δ ≤ 10 −8 ), and exhibit significant errors when evaluated against real-world data. In particular, we test these uncertainty models on realworld driving data from the highD dataset [8], which uses overhead drones to capture vehicle trajectories from human drivers on German highways.
In this section, we detail how we processed the highD dataset to extract important features in order to train/test the different uncertainty models. In the following section, we evaluate the accuracy of these models. In this example the red car must take into account the blue car's trajectory -and its uncertainty -in its plan to progress safely through the intersection. The dashed yellow curves denote the boundary of a tube that defines the δ confidence bound over trajectories. The white circle depicts a distribution over trajectories. The blue lines are example trajectories. (Right) Simplified illustration of different stages of the control pipeline. While every stage (prediction, planning, tracking) is crucial to guaranteeing safety, this paper focuses exclusively on the yellow box, prediction.

A. Processing Dataset
From the highD dataset, we extract all trajectories of length 10 seconds, τ [0,10] (denoting the agent's position over a 10 second horizon), as well as the corresponding environmental context, E τ , denoting the presence and position/velocity of surrounding cars. The trajectory and its context are denoted by the tuple (τ [0,10] , E τ ). We then split the trajectories/context into a training set, D train , and a test set, D test . For a given test trajectory, (τ ) as the set of trajectories with similar environmental context that are -close ( = 2ft) over their first 2s.
Every trajectory in the test set, D test , has equivalent scenarios in the pruned training set, M(D test ), such that we alleviate the issue of out-of-distribution error in learning. For clarity, let us define T train = M(D test ).

B. Training Learned Uncertainty Models
Given our test set, D test , and pruned training set, T train ⊆ D train , we would like to train a given uncertainty modelF (e.g. Gaussian) on T train , and observe how accurately it captures the distribution of trajectories within D test .
Let us divide a given scenario (τ [0,10] , E τ ) into the agent's state x = (τ [0:2] , E τ ), and its action a = τ [2:10] (its future trajectory). Since the action is drawn from some unknown distribution over trajectories, a ∼ A(x), our goal is to train a modelF (x) that accurately approximates A(x), minimizing the following error, where m defines some metric over probability distributions. Clearly we do not know the true distribution A(x), but we can obtain an empirical estimate based on any dataset D. We denote this empirical estimateÂ(x, D). Using our pruned training dataset, T train , we can train our uncertainty modelF (x) (e.g. Gaussian, quantile, etc.), to minimize the following error function: Then we can test the uncertainty modelF (x) on the test dataset D test , yielding the error function Note that the pruned training set T train contains data from all states x represented in the test set D test . This alleviates issues associated with out-of-distribution data, such that L test seen captures aleatoric uncertainty (vs. epistemic uncertainty). Because we do not have to consider generalization of our models to unseen (out-of-distribution) states, the following relationship generally holds, In our analysis, we focus on L test seen when measuring performance of our modelF . As this ignores generalization gap (how out-of-distribution examples affect model accuracy), it benchmarks best potential performance of each model class.
Accounting for replanning: Most motion planning algorithms re-plan their trajectory at some fixed frequency (e.g. 1Hz). To account for this, we examine prediction error (e.g. violation of the δ−uncertainty bound) only within a short replanning horizon. I.e. the prediction must be accurate only within this replanning horizon. The horizon is set to 2 sec.
Incorporating conservative assumptions: To further highlight the fundamental limitations of learning uncertainty models of human behavior, since many prediction algorithms leverage goal inference, we assume that an oracle gives us the target lane of every trajectory. Note that our aim is to illustrate limitations of learned probabilistic models, even under ideal conditions. Thus, this strong assumption (though unrealistic) helps us reason about the best-case scenario for each model class, providing an upper-bound on performance.
Summarizing, we consider (a) there is no generalization gap, and (b) we are given the target lane of every trajectory.
If the models perform poorly under these extremely generous assumptions, we can not expect reasonable performance in realistic settings.

IV. RESULTS -ERROR IN UNCERTAINTY MODELS
In this section, we analyze the accuracy of different uncertainty models in capturing the distribution of trajectories in D test , after being trained on T train .

A. Gaussian Uncertainty Models
We start by analyzing the popular Gaussian uncertainty model, used in most UQ neural networks [16], Gaussian process models [2], and robust regression [4], [27]. These approaches model the data and its uncertainty with a Gaussian distribution (see top 3 rows in Table 1). Using the procedure outlined in Section III, we compute the bestfit Gaussian distribution,F , over the training trajectories T train , and observe how well it captures the in-distribution test trajectories in D test . Figure 2 (K = 1) shows the ratio of observed to expected violations in the test set at each safety threshold, δ. A violation is defined when the test trajectory lies outside the δ-uncertainty bound predicted byF (within a 2s re-planning horizon) for a specified δ. If the data followed a perfect Gaussian distribution, each curve in Fig. 2 would track the dotted black line (i.e. ratio near 1). If the curve falls below the dotted black line, then the model is overly conservative, and vice versa. We see that while the Gaussian model might be valid for δ ≥ 0.01, it is highly inaccurate outside this range, posing a problem for safety-critical applications.
Gaussian mixture models (GMM): One might point out that problems with the Gaussian model could be alleviated using GMMs over a discrete set of goals (e.g. left versus right turn). For example, interacting Gaussian processes (IGP) leverage this tool to alleviate the freezing robot problem [25]. However, when we trained GMMs on the same data with different numbers of mixtures (K = 2, ..., 4), prediction performance on test data did not improve for low δ (see Fig.  2). These results illustrate limitations of any Gaussian-based uncertainty model (IGP, GMM, etc.), by highlighting that human behavioral variation is inherently non-Gaussian. In addition to the issue of inaccurate distributional assumptions, the confidence bounds at level δ ≈ 10 −8 become very large, making planning around these bounds difficult or potentially infeasible. Figure 3 shows the 5σ confidence tube projecting the position of a car forward in time, based on the trained modelF . The 5σ tube (corresponding to δ ≈ 10 −7 ) encroaches on each lane, making it difficult for other cars to drive alongside it. This is because, although the car will typically stay in its lane, in rare instances (see Figure 3) it will unexpectedly swerve into the other lane. This illustrates the difficulty of balancing the safety-efficiency tradeoff, as accounting for rare events may be necessary for safetycritical applications but introduces significant conservatism. To further emphasize fragility of the Gaussian model at low δ, we generated synthetic 2D data from different, known distributions, and examined how well the best fit Gaussian predicted violations at a given δ. Even with perfectly i.i.d. training/test data, the error at low δ was significant. Details and results are in Appendix A (found at [50]).

B. Noisy Rational Model
The noisy rational model considers that humans behave approximately optimally with respect to some reward function. It has enabled significant progress in inverse reinforcement learning (IRL) by allowing researchers to learn reward functions from human data [35], and compute explicit uncertainty intervals over human agents' actions [3]. However, the noisy rational model adopts an underlying model of uncertainty in the exponential family, which places a strong assumption on the shape of the uncertainty distribution and assumes that there is a single "optimal" trajectory: In our driving scenario, the optimal model simplifies to the Gaussian distribution, since Q H = x t+1 −x t+1 Σ for some Σ (i.e. we want to best fit the data). As a result, the issues illustrated in Figures 2 and 3 are exactly faced by the noisy rational model (i.e. the shape of the underlying distribution does not match the assumed distribution). Thus, even in the best case -known target lane, no generalization gapthese models are ill-equipped to provide safety guarantees for safety-critical systems (δ < 10 −8 ).

C. Quantile Regression
Quantile regression is an appealing alternative as it does not require strong assumptions on the underlying uncertainty distribution [5]. It is only concerned with computing tubes such that 1 − δ proportion of trajectories are within that tube and δ are outside. To demonstrate its performance, we again use the procedure outlined in Section III to train a quantile regression modelF . The quantile bounds are approximated as the smallest convex tube containing 1 − δ proportion of trajectories, which optimizes the expected mutual information between the state, x, and action, a [51]. Fig. 4: Prediction error vs. safety threshold, δ using computed quantile bounds or Gaussian uncertainty model on the highD dataset (assuming 2s re-planning horizon). The dashed black line represents a perfect prediction model. Figure 4 shows the ratio of the observed to expected number of test trajectories outside each quantile at safety threshold δ. As seen in the plot, the quantile regression model performs much better than the Gaussian model for δ > 0.1. However, performance rapidly deteriorates as δ decreases, making estimated confidence bounds meaningless, since they fail to predict violation probabilities.
This result makes sense, as obtaining accurate quantile bounds at the δ-confidence level relies on splitting the data: δ percent of points should be outside the quantile bound with the rest inside those bounds. However, little (if any) data is available outside the quantile bound for very low δ. Put differently, to observe a one-in-a-million event, we would need to see a million trajectories. To reliably predict those events, we would need many more trajectories.
Improving Accuracy with Increasing Data: Given the availability of increasingly large robotics datasets, we should ask whether we could reach good accuracy at desired safety thresholds, δ, by using more data. To answer this, we define the smallest accurate safety threshold, δ min , as follows, (8) We set ε = 0.5, where ε represents the vertical distance between each curve in Fig 4 and the dotted black line. Thus, δ min represents the smallest δ such that our computed quantile bounds are ε-accurate. Note that δ min is computed with respect to a given set of data. Therefore, by varying the size of our training set, we capture how δ min varies with the amount of training data, shown in Fig. 5a. The trend shown in Fig. 5a is surprisingly linear (r 2 = 0.995), which held across different sections of the dataset (i.e. different highways). This scaling is consistent with the lower bound on sample complexity derived in [52], shown in Eq. (9) and discussed further below.
While initially promising, if we project this linear trend down to δ min ≈ 10 −8 , we find that the amount of data required to reach safety-critical thresholds is far from feasible. Figure 5b shows that we would need trillions of kilometers of driving data to achieve accurate quantile bounds, even under extremely generous assumptions (e.g. perfect generalization). For reference, in 2018, approximately 5 trillion kilometers were driven in total across all cars/trucks in the U.S. [7].
We conducted the same analysis on synthetic 2D data, and found the same trends seen in Figures 4 and 5. Details and results are in Appendix B (found at [50]).
Quantile Regression as a Fundamental Limitation: One might be tempted to conclude from Figure 5a that we should look for alternative methods with better data efficiency. However, all methods providing confidence bounds over trajectories at some safety threshold δ can be fundamentally viewed as classification problems; we must classify 1 − δ trajectories within some learned bounds, with the rest outside those bounds. By viewing this as a classification problem, we can leverage results from VC-analysis that lower bound the data required, N , to guarantee a given prediction confidence [52]: To guarantee Pr(error) ≤ δ, we require where VCdim(M) is the VC dimension of the utilized model M (see [52] for proof). The linear trend in Fig. 5 (showing N (δ) ∝ 1 δmin ) fits very nicely with the lower bound (9), given that the second term dominates the first (i.e. we have large VC dimension). Note that if the first term dominated the second term, we would expect worse data scaling.
This suggests alternate methods cannot provide confidence bounds with better data scaling than shown in Figure 5.

D. Other Uncertainty Models
Due to space constraints, we discuss remaining models in Appendix C (found at [50]), including: (1) generative models, (2) scenario optimization, and (3) hidden Markov models (HMM). However, below, we briefly describe fundamental problems each of these models face: • Scenario optimization -similar to quantile regressionrequires far too much data to be feasible for small δ. Even with 40,000 trajectories in equivalent scenarios, we only reach δ ≈ 10 −4 . • Generative models implicitly learn the distribution A(x) = p(a|x). While promising, it has been shown -empirically and theoretically -that they can fail to learn the true distribution, even when their training objective nears optimality [53]. Also, using state-of-the-art models [9], [22], it would require at least a day to generate enough trajectories to certify safety with sufficiently high confidence. • HMMs are distinct as they learn probabilities over discrete states (e.g. goal positions). However, in the Appendix we show that even with a known observation function, P(obs|state), it is highly unlikely to obtain sufficient confidence (δ ≈ 10 −8 ) of being in a given state.
Note on UQ Neural Networks: We have not discussed UQ neural networks, because neural networks do not compose a distinct class of uncertainty models. Instead, they only provide a functional representation of the uncertainty in a given class (e.g. UQ neural networks typically output a Gaussian distribution). Our results highlight best-case performance bounds for each class of model uncertainties, given optimal fit to the data. Thus, using neural networks to parameterize the model uncertainty will only yield worse performance.

V. CONCLUSION AND FUTURE WORK
Our main message is that even under extremely generous assumptions, current models of human uncertainty are unable to extend safety guarantees to the confidence levels, e.g. δ < 10 −8 , that are needed for widespread adoption of safetycritical autonomy in human environments.
Learned uncertainty distributions become highly inaccurate at low δ, undermining any claimed guarantees of safety.
There is a fundamental limitation to modeling human uncertainty purely as a random process. Data-driven methods (i.e. machine learning) are designed to capture prominent patterns in data not to predict rare events. Intuitively, we need a million samples just to observe a one-in-a-million event. While it is possible that huge datasets could eventually enable accurate prediction of rare events, our analysis shows that such amounts of data are infeasible in the near future.
Human uncertainty vs. sensor-based uncertainties: Even if a system must be certified safe with δ = 10 −8 , it is uncommon to require any single module to have a failure probability less than 10 −8 . Instead, redundancy with multiple, independent modules can help certify system safety. The key is that the modules must be independent. While this may be a fair assumption for sensing uncertainty, it is not fair for human behavior prediction.
Future Work: A promising solution to guarantee safety at low δ is to utilize prior knowledge about human behavior; in particular, humans obey interaction rules (e.g. signaling intent) [54], which bound uncertainty in useful ways and can be encoded in assume-guarantee contracts [55]. A contract might encode that an agent cannot mislead others about its intention, assuming that others do not mislead it. For example, an agent cannot first pretend to yield to a merging vehicle, before speeding up to hit it. We thus propose trading one challenge for another: rather than learning uncertainty bounds that agents obey with probability 1−δ, we should aim to specify interpretable contracts (i.e. behavioral constraints) with learned components that agents must surely obey. We believe such a framework is necessary to move away from treating uncertainty in human behavior purely as a random process. Instead, human uncertainty can be constrained by combining learned components that predict expected actions and prior knowledge restricting the danger of rare events in a rigorous, interpretable manner. We further tested the Gaussian uncertainty model on a synthetic 2D data set, using the same process detailed in Section 3A. Each 2D data point is analogous to a trajectory, a = τ [2:10] ∼ A, in the highD driving dataset. Therefore, the goal is still to learn the modelF that best matches the data distribution A, minimizing m(A||F). However, using synthetic data allows us to test the accuracy of the uncertainty model with respect to a known underlying probability distribution, A.
We randomly generated 10,000 2D points for training data (further increasing the amount of training data did not improve performance) from 3 different distributions: (a) perfect Gaussian, (b) Gaussian with uniform noise (magnitude of noise was 30% of the data range), and (c) Gaussian with symmetric non-uniform noise (also 30% magnitude). For each of these training datasets, we computed the Gaussian uncertainty model that best fit the data. We then generated 10,000,000 2D points for our test data following the exact same distribution as the training data, and observed how well our computed Gaussian uncertainty model captured the test data. Fig. 6: Prediction error vs. safety threshold, δ, using a Gaussian uncertainty model on synthetic 2D data generated from 3 different distributions. The dashed black line represents a perfect prediction model. Significant prediction error arises when the underlying data distribution is non-Gaussian. Figure 6 shows that the learned uncertainty model performed very well when the underlying data distribution was Gaussian (blue curve). However, it performed poorly (off by an order-of-magnitude) at low δ when the underlying distribution was non-Gaussian. When the underlying distribution was Gaussian with added uniform noise (orange curve), the observed violations were much lower than the expected violations (i.e. the model was conservative). This is good for safety, but would clearly lead to overly conservative behavior, especially since the model is off by orders of magnitude.
However, more concerning is the case when the underlying distribution is Gaussian with non-uniform noise (green curve). In this case, the observed violations were much higher than the expected violations (greater by an order of magnitude), posing a clear risk for safety-critical applications. This reinforces our results in Section 3A by illustrating that significant prediction error inevitably arises, regardless of the amount of training data, when the underlying data distribution is non-Gaussian.

APPENDIX B: QUANTILE REGRESSION (SYNTHETIC DATA)
We repeated the quantile regression experiments from Section 3C, using synthetic 2D data rather than real-world driving data. This allowed us to observe how well the uncertainty model performed under ideal conditions when the training/testing data were perfectly i.i.d. We randomly generated 1,000,000 2D training data points (analogous to 1,000,000 trajectories) following a Gaussian distribution, and computed δ-quantile bounds following the same procedure described in Section 3C (i.e. computing the smallest convex set containing 1 − δ points). We then generated 10,000,000 2D test data points following the exact same distribution as the training data, and observed how well our computed quantile bounds captured the test data.  However, performance rapidly deteriorated as δ decreased, meaning the model failed to accurately predict violation probabilities at those safety thresholds. This is consistent with our results in Section 3C.
Using the synthetic data, we computed the smallest accurate safety threshold, δ min , as a function of the amount of training data, N . This threshold δ min was defined as follows, δ min = min δ such that log expected(δ) observed(δ) ≤ ε .
(10) where we set ε = 0.5, which represents the vertical distance between the blue curve in Fig 7 and the dotted black line. Therefore, δ min represents the smallest δ such that our computed quantile bounds are ε-accurate (as described in Section 3C). Figure 8(a) shows the same inverse linear trend (δ min ∝ 1 N ) on the synthetic data that was seen with the real driving data. Figure 8(b) shows the extrapolation of this trend towards lower δ min . This result reinforces the point made in Section 3C that quantile regression can be very accurate for larger δ, but it may not be feasible to collect enough data to reach safety thresholds δ min ≤ 10 −8 .

A. Generative Models
Generative models have garnered significant interest in trajectory prediction for their ability to implicitly learn the distribution A(x) = p(a|x). However, there are two significant issues with these approaches, the first of which is the time required to utilize these models in safety-critical situations. For example at best, a single prediction takes ≈ 0.05s with Social-GAN [9]. In order to guarantee safety with probability δ = 0.01, we would need to generate at least 100 trajectories taking > 5s. To guarantee safety with probability δ = 10 −8 , we would need to generate at least 10 8 trajectories taking > 5, 000, 000s (> 1 month), which is not suitable for real-time operation. While computational cost will surely decrease over time, it is unclear whether this modeling approach will be feasible in the near future.
More importantly, there are no guarantees that the uncertainty distribution implicitly captured by generative models provides any reasonable approximation to the true uncertainty distribution. It has been shown -both empirically and theoretically -that GANs can fail to learn the true distribution (suffering from "mode collapse"), even when their training objective nears optimality [53]. Furthermore, the theoretical data efficiency bound described by Eq. (9) suggests that the implicit distribution learned by such models will be inaccurate (at the safety thresholds we are considering) without currently infeasible amounts of data.

B. Scenario Optimization Model
Scenario Optimization is an appealing approach because (like quantile regression) it does not assume an underlying distribution over the data [21]. It relies only on the assumption that the data is drawn i.i.d. from some fixed (unknown) distribution. Therefore, we can obtain a high-confidence bound on the probability that a new trajectory is inside or outside a computed tube, without strong assumptions on the underlying distribution.
With this approach, the safety threshold, δ, is a direct function of the amount of observed data [30]; in other words δ = δ(N ), where N is the number of training trajectories or "samples". Therefore, we cannot set arbitrarily small confidence levels (e.g. δ = 10 −8 ). While this prevents users from applying the approach inappropriately, it requires very large amounts of data to get to low enough confidence levels for safety-critical applications. For example with 40, 000 trajectories, we were able to reach δ ≈ 10 −4 (after this point, computational feasibility became an issue). This suggests it is not feasible to reach desired δ levels given realistic datasets.
Using the highD dataset and treating the trajectories in the training set as observed samples, we obtained highconfidence bounds (computed as the convex hull of the training trajectories) such that new trajectories should lie within those bounds with probability at least 1 − δ. For example, Figure 9 shows the predicted confidence bounds for two representative driving instances; in Fig. 9a, the goal position is not given but in in Fig. 9b, the goal position is given. The scenario optimization approach predicts that the future trajectory of each car should fall within the blue confidence bounds at 2, 4, 6 seconds in the future with 98.5% (Fig. 9a) or 95.1% (Fig. 9b) probability.
To test the accuracy of the computed confidence bounds, we examined how often trajectories in the test set actually remained within those bounds in the highD dataset. The ratio of observed violations to expected violations was smaller than expected (i.e. the method was conservative), which is reassuring for safety. Specifically, the observed vs. expected percentage of violations was approximately 5% vs. 14%.  Fig. 9: Plot of confidence bounds over the car's future trajectory. The car's positional history is shown by the red circles, and training data is taken from equivalent scenarios in the highD dataset. (a) The goal position of the car is not known. We compute a 98.5% probability that a new trajectory falls within the blue confidence bounds at 2, 4, and 6 seconds in the future. (b) The target position of the car is known. We compute a 95.1% probability that a new trajectory falls within the blue confidence bounds at 2, 4, and 6 seconds in the future.
However, the safety threshold δ(N ) was always large (δ ∈ [0.02, 0.41]) and unable to be arbitrarily defined, which makes the scenario optimization approach currently inapplicable to many safety-critical applications. This is consistent with our conclusion in Section 3C, that much more data is necessary to obtain reliable, probabilistic bounds.

C. Hidden Markov Models
Rather than reasoning about uncertainty only over trajectories, many methods in the POMDP literature reason about uncertainty over discrete intentions. Most often, these discrete intentions denote different goal positions for the agent, but they could also denote different operational modes (e.g. yield vs. no yield). Hidden Markov models enable us to compute an agent's most likely intention, which proves useful in solving many challenging problems. However, when guaranteeing safety with safety threshold δ, intention must be correctly inferred with probability 1 − δ. Issues arise when the intention must be inferred with very high confidence δ ≤ 10 −8 .
(a) Data for each mode generated from a Gaussian distribution.
(b) Data for each mode generated from a uniform distribution. Fig. 10: Synthetic data is generated from two different modes: (mode 1 -blue, mode 2 -red). The confidence intervals below denote where a point would have to lie in order to classify it, with confidence δ = 10 −8 , as coming from either mode 1 or mode 2. For example, if a new point falls in the interval covered by the blue bar, it can be classified as coming from mode 1 with confidence δ ≤ 10 −8 . If it falls anywhere in the gray interval, we cannot conclude its mode (assuming a uniform prior).
We demonstrate this on a 1D toy problem with synthetic data. We generated 1000 i.i.d. data points from two distinct distributions (mode 1 and mode 2), and computed the best fit Gaussian for each of these distributions. Note that our results did not change when increasing the amount of data. We then computed the intervals in which a new point would have to lie in order for us to classify it in either mode 1 or mode 2 with 1 − δ confidence. This was done by applying Bayes rule, assuming a uniform prior over the modes, P(mode | x) = P(x | mode) P(mode) P(x) .
(11) Figure 10 shows these intervals when the points were generated from either a Gaussian distribution, or a uniform distribution. The interval covered by the gray line denotes the interval in which we can not classify (with δ confidence) a point's mode. We note that the gray line extends across a significant portion of the data range, but is reasonable when the underlying distribution of points in each mode is perfectly Gaussian. However, when the generated data is uniformly random, the uncertainty interval stretches across the entire range of data. This suggests that inferring intention or hidden "modes" under uncertainty will often be infeasible when considering very low safety thresholds, δ, especially since we have shown that human behavioral variation is non-Gaussian. Furthermore, we cannot compensate for this non-Gaussian variation as we do not have accurate knowledge of the true distribution.