Stagewise Safe Bayesian Optimization with Gaussian Processes
Yanan Sui
1
Vincent Zhuang
1
Joel W. Burdick
1
Yisong Yue
1
Abstract
Enforcing safety is a key aspect of many prob-
lems pertaining to sequential decision making un-
der uncertainty, which require the decisions made
at every step to be both informative of the op-
timal decision and also safe. For example, we
value both efficacy and comfort in medical ther-
apy, and efficiency and safety in robotic control.
We consider this problem of optimizing an un-
known utility function with absolute feedback or
preference feedback subject to unknown safety
constraints. We develop an efficient safe Bayesian
optimization algorithm,
S
TAGE
O
PT
, that sepa-
rates safe region expansion and utility function
maximization into two distinct stages. Compared
to existing approaches which interleave between
expansion and optimization, we show that
S
TA
-
GE
O
PT
is more efficient and naturally applicable
to a broader class of problems. We provide the-
oretical guarantees for both the satisfaction of
safety constraints as well as convergence to the
optimal utility value. We evaluate
S
TAGE
O
PT
on both a variety of synthetic experiments, as
well as in clinical practice. We demonstrate that
S
TAGE
O
PT
is more effective than existing safe
optimization approaches, and is able to safely and
effectively optimize spinal cord stimulation ther-
apy in our clinical experiments.
1. Introduction
Bayesian optimization is a well-established approach for
sequentially optimizing unknown utility functions. By lever-
aging regularity assumptions such as smoothness and con-
tinuity, such techniques offer efficient solutions for a wide
range of high-dimensional problem settings such as experi-
mental design and personalization in recommender systems.
1
California Institute of Technology, Pasadena, CA, USA.
Correspondence to:
Yanan Sui
<
ysui@caltech.edu
>
, Vin-
cent Zhuang
<
vzhuang@caltech.edu
>
, Joel W. Burdick
<
jwb@robotics.caltech.edu
>
, Yisong Yue
<
yyue@caltech.edu
>
.
Proceedings of the
35
th
International Conference on Machine
Learning
, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018
by the author(s).
Many of these applications are also subject to a variety of
safety constraints, so that actions cannot be freely chosen
from the entire input space. For instance, in safe Bayesian
optimization, any chosen action during optimization must
be known to be “safe”, regardless of the reward from the
utility function. Typically, one is initially given a small
region of the decision/action space that is known to be safe,
and must iteratively expand the safe action region during
optimization (Sui et al., 2015).
A motivating application of our work is a clinical setting,
where physicians need to sequentially choose among a large
set of therapies (Sui et al., 2017a). The effectiveness and
safety of different therapies are initially unknown, and can
only be determined through sequential tests starting from
some initial set of well-studied therapies. A natural way
to explore is to start from some therapies similar to these
initial ones, since their efficacy and safety would not differ
too greatly. By iteratively repeating this process, one can
gradually explore the utility and safety landscapes in a safe
fashion.
Our contributions.
We propose a novel safe Bayesian op-
timization algorithm,
S
TAGE
O
PT
, to address the challenge
of efficiently identifying the total safe region and optimiz-
ing the utility function within the safe region. In contrast
to previous safe Bayesian optimization work (Sui et al.,
2015; Berkenkamp et al., 2016a) which interleaves safe re-
gion expansion and optimization,
S
TAGE
O
PT
is a stagewise
algorithm which first expands the safe region and then op-
timizes the utility function.
S
TAGE
O
PT
is well suited for
settings in which the safety and utility functions are very
different (e.g., temperature vs gripping force), i.e. lie on
different scales or amplitudes. Furthermore, in settings in
which the utility and safety functions are measured in differ-
ent ways, it is natural to have a separate first stage dedicated
to safe region expansion. For example, in clinical trials we
may wish to spend the first stage only querying the patient
about the comfort of the stimulus, as opposed to having to
measure the utility and comfort simultaneously.
Conceptually,
S
TAGE
O
PT
models the safety function(s)
and utility function as sampled functions from different
Gaussian processes (GPs), and uses confidence bounds to
assess the safety of unexplored decisions. We provide theo-
retical results for
S
TAGE
O
PT
under the assumptions that
Stagewise Safe Bayesian Optimization with Gaussian Processes
(1) the safety and utility functions have bounded norms in
their Reproducing Kernel Hilbert Spaces (RKHS) associ-
ated with the GPs, and (2) the safety functions are Lipschitz-
continuous, which is guaranteed by many common kernels.
We guarantee (with high probability) the convergence of
S
TAGE
O
PT
to the safely reachable optimum decision. In
addition to simulation experiments, we apply
S
TAGE
O
PT
to a clinical setting of optimizing spinal cord stimulation
for patients with spinal cord injuries. Compared to expert
physicians, we find that
S
TAGE
O
PT
explores a larger safe
region and finds better stimulation strategy.
2. Related Work
Many Bayesian optimization methods often model the un-
known underlying functions as Gaussian processes (GPs),
which are smooth, flexible, nonparametric models (Ras-
mussen & Williams, 2006). GPs are widely used as a regu-
larity assumption in many Bayesian optimization techniques,
since they can easily encode prior knowledge and explicitly
model variance.
The fundamental tradeoff between exploration and exploita-
tion in sequential decision problems is commonly formal-
ized as the multi-armed bandit problem (MAB), introduced
by Robbins (1952). In MAB, each decision is associated
with a stochastic reward with initially unknown distribution.
The goal of a bandit algorithm is to maximize the cumu-
lative reward. In a variant called “best-arm identification”
(Audibert et al., 2010), one seeks to identify the decision
with highest reward with minimal trials. It has been widely
studied under a variety of different situations (cf., Bubeck
& Cesa-Bianchi (2012) for an overview). Many efficient
algorithms build on the methods of
upper confidence bounds
proposed in Auer (2002), and
Thompson sampling
proposed
in Thompson (1933). Their key ideas are to use posterior
distributions of rewards to implicitly negotiate the explore-
exploit tradeoff by optimistic sampling. This idea naturally
extends to bandit problems with complex (or even infinite)
decision sets under certain regularity conditions of the re-
ward function (Dani et al., 2008; Kleinberg et al., 2008;
Bubeck et al., 2008).
In the kernelized setting, several algorithms with theoreti-
cal guarantees have been proposed. Srinivas et al. (2010)
propose the GP-UCB algorithm, which uses confidence
bounds to address bandit problems with a reward function
modeled using a Gaussian process. Gotovos et al. (2013)
studies active sampling for localizing level sets, finding
where the objective crosses a specified threshold. Chowd-
hury & Gopalan (2017) extends the work of Srinivas et al.
(2010) by proving tighter bounds as well as providing guar-
antees for a GP-based Thompson sampling algorithm. How-
ever, none of these algorithms are designed to work with
safety constraints, and often violate them in practice (Sui
et al., 2015). There are also algorithms without theoretical
guarantees. Gelbart et al. (2014) studies a constrained Ex-
pected Improvement algorithm for Bayesian optimization
with unknown constraints. Hern
́
andez-Lobato et al. (2016)
considers a general framework for constrained Bayesian
optimization using information-based search.
The problem of safe exploration has been considered in con-
trol and reinforcement learning (Hans et al., 2008; Gillula &
Tomlin, 2011; Garcia & Fernandez, 2012; Turchetta et al.,
2016). These methods typically consider the problem of
safe exploration in MDPs. They ensure safety by restrict-
ing policies to be ergodic with high probability and able to
recover from any state visited. The safe optimization prob-
lem has also been studied under the restriction of the ban-
dit/optimization setting, where decisions do not cause state
transitions. This leads to simpler algorithms (
S
AFE
O
PT
)
with stronger guarantees (Sui et al., 2015; Berkenkamp et al.,
2016a), and fits well to safe sampling problems and applica-
tions. There are other safe algorithms (Schreiter et al., 2015;
Wu et al., 2016) under different active learning settings. Our
work builds upon the
S
AFE
O
PT
approach, with stronger
empirical performance and convergence rates on a broad
class of safety functions.
3. Problem Statement
We consider a sequential decision problem in which we
seek to optimize an unknown utility function
f
:
D
→
R
from noisy evaluations at iteratively chosen sample points
x
1
,
x
2
,...,
x
t
,...
∈
D
. However, we further require that
each of these sample points are “safe”: that is, for each
of
n
unknown safety functions
g
i
:
D
→
R
at
g
i
(
x
t
)
lies
above some threshold
h
i
∈
R
. We can formally write our
optimization problem as follows:
max
x
∈
D
f
(
x
)
subject to
g
i
(
x
)
≥
h
i
for
i
= 1
,...,n
(1)
Regularity assumptions.
In order to model the utility
function and the safety functions, we use Gaussian processes
(GPs), which are smooth yet flexible nonparametric models.
Equivalently, we assume that
f
and all
g
i
have bounded
norm in the associated Reproducing Kernel Hilbert Space
(RKHS). A GP is fully specified by its mean function
μ
(
x
)
and covariance function
k
(
x
,
x
′
)
; in this work, we assume
WLOG GP priors to have zero mean (i.e.
μ
(
x
) = 0)
. We
further assume that each safety function
g
i
is
L
i
-Lipschitz
continuous with respect to some metric
d
on
D
. This as-
sumption is quite mild, and is automatically satisfied by
many commonly-used kernels (Srinivas et al., 2010; Sui
et al., 2015).
Feedback
models.
We
primarily
consider
noise-
perturbed feedback, in which our observations are perturbed
Stagewise Safe Bayesian Optimization with Gaussian Processes
by i.i.d. Gaussian noise, i.e., for samples at points
A
T
= [
x
1
...
x
T
]
T
⊆
D
, we have
y
t
=
f
(
x
t
) +
n
t
where
n
t
∼
N
(0
,σ
2
)
. The posterior over
f
is then also Gaussian
with mean
μ
T
(
x
)
, covariance
k
T
(
x
,
x
′
)
and variance
σ
2
T
(
x
,
x
′
)
that satisfy,
μ
T
(
x
) =
k
T
(
x
)
T
(
K
T
+
σ
2
I
)
−
1
y
T
k
T
(
x
,
x
′
) =
k
(
x
,
x
′
)
−
k
T
(
x
)
T
(
K
T
+
σ
2
I
)
−
1
k
T
(
x
′
)
σ
2
T
(
x
) =
k
T
(
x
,
x
)
,
where
k
T
(
x
) = [
k
(
x
1
,
x
)
...k
(
x
T
,
x
)]
T
and
K
T
is the
positive definite kernel matrix
[
k
(
x
,
x
′
)]
x
,
x
′
∈
A
T
.
We also consider the case in which only preference feedback
is available for the utility function. This setting is often used
to characterize real-world applications that elicit subjective
human feedback. One way to formalize the online opti-
mization problem is the dueling bandits problem (Yue et al.,
2012; Sui et al., 2017b). In the basic dueling bandits formu-
lation, given two points
x
1
and
x
2
, we stochastically receive
binary 0/1 feedback according to a Bernoulli distribution
with parameter
φ
(
f
(
x
1
)
,f
(
x
2
))
, where
φ
is a
link func-
tion
mapping
R
×
R
to
[0
,
1]
. For example, a common link
function is the logit function
φ
(
x,y
) = (1 + exp(
y
−
x
))
−
1
.
To our knowledge, there are no existing algorithms for the
safe Bayesian dueling bandit setting. Although our proposed
algorithm is amenable to the full dueling bandits setting (as
discussed later), to compare against existing algorithms, we
consider the restricted dueling problem in which at timestep
t
one receives preference feedback between
x
t
and
x
t
−
1
.
The pseudocode for our proposed algorithm under this type
of dueling feedback can be found in Appendix
??
.
Safe optimization.
Using a uniform zero-mean prior (as
is typical in many Bayesian optimization approaches) does
not provide sufficient information to identify any point
as safe with high probability. Therefore, we addition-
ally assume that we are given an initial “seed” set of
safe decision(s), which we denote as
S
0
⊂
D
. Note
that given an arbitrary seed set, it is not guaranteed that
we will be able to discover the globally optimal decision
x
∗
= argmax
x
∈
D
f
(
x
)
, e.g. if the safe region around
x
∗
is topologically separate from that of
S
0
. Instead, we can
formally define the optimization goal for a given seed via
the
one-step reachability
operator:
R
(
S
)
:
=
S
∪
⋂
i
{
x
∈
D
∣
∣
∃
x
′
∈
S,
g
i
(
x
′
)
−
−
L
i
d
(
x
′
,
x
)
≥
h
i
}
,
which gives the set all of points that can be established
as safe given evaluations of
f
on
S
with
noise. Then,
given some finite horizon
T
, we can define the subset of
D
reachable after
T
iterations from the initial safe seed set
S
0
as the following:
R
T
(
S
0
)
:
=
R
(
R
...
(
R
︸
︷︷
︸
T
times
(
S
0
))
...
)
.
Thus, our optimization goal is
argmax
x
∈
R
T
(
S
0
)
f
(
x
)
.
4. Algorithm
We now introduce our proposed algorithm,
S
TAGE
O
PT
, for
the safe exploration for optimization problem.
Overview.
We start with a high-level description of
S
TA
-
GE
O
PT
.
S
TAGE
O
PT
separates the safe optimization prob-
lem into two stages: an exploration phase in which the safe
region is iteratively expanded, followed by an optimization
phase in which Bayesian optimization is performed within
the safe region. We assume that our algorithm runs for a
fixed
T
time steps, and that the first safe expansion region
has horizon
T
0
< T
with the optimization phase being
T
1
=
T
−
T
0
time steps long.
S
TAGE
O
PT
models the utility function and the safety func-
tions via Gaussian processes, and leverages their uncertainty
in order to safely explore and optimize. In particular, at each
iteration
t
,
S
TAGE
O
PT
uses the confidence intervals
Q
i
t
(
x
)
:
=
[
μ
i
t
−
1
(
x
)
±
β
t
σ
i
t
−
1
(
x
)
]
,
(2)
where
β
t
is a scalar whose choice will be discussed later.
We use superscripts to denote the confidence intervals for
the respective safety functions, and we use the superscript
f
for the utility function. In order to guarantee both safety
and progress in safe region expansion, instead of using
Q
i
t
directly,
S
TAGE
O
PT
uses the confidence intervals
C
i
t
defined as
C
i
t
(
x
)
:
=
C
i
t
−
1
(
x
)
∩
Q
i
t
(
x
)
,
C
i
0
(
x
) = [
h
i
,
∞
]
so that
C
i
t
are sequentially contained in
C
i
t
−
1
for all
t
. We
also define the upper and lower bounds of
C
i
t
to be
u
i
t
and
`
i
t
respectively, as well as the width as
w
i
t
=
u
i
t
−
`
i
t
.
We defined the optimization goal with respect to a tolerance
parameter
, which can employed as a stopping condition
for the expansion stage. Namely, if the expansion stage
stops at
T
0
under the condition
max
x
∈
G
t
w
t
(
x
)
≤
,
then
the
-Reachable safe region
R
T
0
(
S
0
)
is guaranteed to be
expanded. Similarly, we have a tolerance parameter
ζ
(in
Algorithm 1) to control utility function optimization with
time horizon
T
1
.
Stage One: Safe region expansion.
S
TAGE
O
PT
ex-
pands the safe region in the same way as that of
S
AFE
O
PT
(Sui et al., 2015; Berkenkamp et al., 2016a). An increasing
sequence of safe subsets
S
t
⊆
D
is computed based on the
confidence intervals of the GP posterior:
S
t
=
⋂
i
⋃
x
∈
S
t
−
1
{
x
′
∈
D
∣
∣
`
i
t
(
x
)
−
L
i
d
(
x
,
x
′
)
≥
h
i
}
.
Stagewise Safe Bayesian Optimization with Gaussian Processes
At each iteration,
S
TAGE
O
PT
computes a set of
expander
points
G
t
(that is, points within the current safe region that
are likely to expand the safe region) and picks the expander
with the highest predictive uncertainty.
In order to define the set
G
t
, we first define the function:
e
t
(
x
)
:
=
∣
∣
∣
⋂
i
{
x
′
∈
D
\
S
t
∣
∣
u
t
(
x
)
−
L
i
d
(
x
,
x
′
)
≥
h
i
}
∣
∣
∣
,
which (optimistically) quantifies the potential enlargement
of the current safe set after we sample a new decision
x
.
Then,
G
t
is simply given by:
G
t
=
{
x
∈
S
t
:
e
t
(
x
)
>
0
}
.
Finally, at each iteration
S
TAGE
O
PT
selects
x
t
to be
x
t
=
argmax
x
∈
G
t
w
n
(
x
,i
)
.
Stage Two: Utility optimization.
Once the safe region
is established,
S
TAGE
O
PT
can use any standard online
optimization approach to optimize the utility function within
the expanded safe region. For concreteness, we present
here the GP-UCB algorithm (Srinivas et al., 2010). For
completeness, we present a version of
S
TAGE
O
PT
based on
preference-based utility optimization in Appendix
??
. Our
theoretical analysis is also predicated on using GP-UCB,
since it offers finite-time regret bounds. Formally, at each
iteration in this phase, we select the arm
x
t
as the following:
x
t
= argmax
x
∈
S
t
μ
f
t
−
1
(
x
) +
β
t
σ
f
t
−
1
(
x
)
(3)
Note that it possible (though typically unlikely) for the safe
region to further expand during this phase.
Comparison between
S
AFE
O
PT
and
S
TAGE
O
PT
.
Al-
though
S
TAGE
O
PT
is similar to
S
AFE
O
PT
in that it con-
structs confidence intervals and defines the safe region in
the same way, there are distinct differences in how these
algorithms work. We illustrate the behavior of
S
AFE
O
PT
and
S
TAGE
O
PT
starting from a common safe seed in Fig-
ure 1. Initially, both algorithms select the same points since
they use the same definition of safe expansion. However,
S
TAGE
O
PT
selects noticeably better optimization points
than
S
AFE
O
PT
due its UCB criterion. We leave a more
detailed discussion of this behavior for Section 6.
We also re-emphasize that since
S
TAGE
O
PT
separates the
safe optimization problem into safe expansion and utility
optimization phases, it is much more amenable to a variety
of related settings than
S
AFE
O
PT
. For example, as dis-
cussed in detail in the appendix, dueling feedback can easily
be incorporated into
S
TAGE
O
PT
: in the dueling setting,
one can simply replace GP-UCB in the utility optimization
stage with any kernelized dueling-bandit algorithm, such as
K
ERNEL
S
ELF
S
PARRING
(Sui et al., 2017b).
Algorithm 1
S
TAGE
O
PT
1:
Input:
sample set
D
,
i
∈{
1
,...,n
}
,
GP prior for utility function
f
,
GP priors for safety functions
g
i
,
Lipschitz constants
L
i
for
g
i
,
safe seed set
S
0
,
safety threshold
h
i
,
accuracies
(for expansion),
ζ
(for optimization).
2:
C
i
0
(
x
)
←
[
h
i
,
∞
)
, for all
x
∈
S
0
3:
C
i
0
(
x
)
←
R
, for all
x
∈
D
\
S
0
4:
Q
i
0
(
x
)
←
R
, for all
x
∈
D
5:
C
f
0
(
x
)
←
R
, for all
x
∈
D
6:
Q
f
0
(
x
)
←
R
, for all
x
∈
D
7:
for
t
= 1
,... T
0
do
8:
C
i
t
(
x
)
←
C
i
t
−
1
(
x
)
∩
Q
i
t
−
1
(
x
)
9:
C
f
t
(
x
)
←
C
f
t
−
1
(
x
)
∩
Q
f
t
−
1
(
x
)
10:
S
t
←
⋂
i
⋃
x
∈
S
t
−
1
{
x
′
∈
D
∣
∣
`
i
t
(
x
)
−
L
i
d
(
x
,
x
′
)
≥
h
i
}
11:
G
t
←
{
x
∈
S
t
∣
∣
e
t
(
x
)
>
0
}
12:
if
∀
i,
i
t
<
then
13:
x
t
←
argmax
x
∈
G
t
,i
∈{
1
,...,n
}
w
i
t
(
x
)
14:
else
15:
x
t
←
argmax
x
∈
S
t
μ
f
t
−
1
(
x
) +
β
t
σ
f
t
−
1
(
x
)
16:
end if
17:
y
f,t
←
f
(
x
t
) +
n
f,t
18:
y
i,t
←
g
i
(
x
t
) +
n
i,t
19:
Compute
Q
f,t
(
x
)
and
Q
i,t
(
x
)
, for all
x
∈
S
t
20:
end for
21:
for
t
=
T
0
+ 1
,...,T
do
22:
C
f
t
(
x
)
←
C
f
t
−
1
(
x
)
∩
Q
f
t
−
1
(
x
)
23:
x
t
←
argmax
x
∈
S
t
μ
f
t
−
1
(
x
) +
β
t
σ
f
t
−
1
(
x
)
24:
y
f,t
←
f
(
x
t
) +
n
f,t
25:
y
i,t
←
g
i
(
x
t
) +
n
i,t
26:
Compute
Q
f,t
(
x
)
and
Q
i,t
(
x
)
, for all
x
∈
S
t
27:
end for
5. Theoretical Results
In this section, we show the effectiveness of
S
TAGE
O
PT
by
theoretically bounding its sample complexity for expansion
and optimization. The two stages of
S
TAGE
O
PT
are the
expansion of the safe region in search for the total safe
region, and the optimization within the safe region.
The correctness of
S
TAGE
O
PT
relies on the fact that the
classification of sets
S
t
and
G
t
is sound. While this re-
quires that the confidence bounds
C
t
are conservative, using
bounds that are too conservative will slow down the algo-
rithm considerably. The tightness of the confidence bounds
is controlled by parameter
β
t
in Equation 2. This prob-
lem of properly tuning confidence bounds using Gaussian
processes in exploration–exploitation trade-off has been
studied by Srinivas et al. (2010); Chowdhury & Gopalan
(2017). These algorithms are designed for the stochastic
multi-armed bandit problem on a kernelized input space
without safety constraints. However, their choice of confi-
dence bounds can be generalized to our setting for expansion
and optimization. In particular, for our theoretical results to
Stagewise Safe Bayesian Optimization with Gaussian Processes
S
AFE
O
PT
utility
S
AFE
O
PT
safety
S
TAGE
O
PT
utility
S
TAGE
O
PT
safety
(a) 2 iterations
(b) 4 iterations
(c) 10 iterations
Figure 1.
Evolution of GPs in
S
AFE
O
PT
and
S
TAGE
O
PT
for a fixed safe seed; dashed lines correspond to the mean and shaded areas to
±
2
standard deviations. The first and third rows depict the utility function, and the second and fourth rows depict a single safety function.
The utility and safety functions were randomly sampled from a zero-mean GP with a Matern kernel, and are represented with solid blue
lines. The safety threshold is shown as the green line, and safe regions are shown in red. The red markers correspond to safe expansions
and blue markers to maximizations and optimizations. We see that
S
TAGE
O
PT
identifies actions with higher utility than
S
AFE
O
PT
.
hold it suffices to choose:
β
t
=
B
+
σ
√
2(
γ
t
−
1
+ 1 + log(1
/δ
))
,
(4)
where
B
is a bound on the RKHS norm of
f
,
δ
is the allowed
failure probability, observation noise is
σ
-sub-Gaussian, and
γ
t
quantifies the effective degrees of freedom associated
with the kernel function. Concretely,
γ
t
= max
|
A
|≤
t
I
(
f
;
y
A
)
is the maximal mutual information that can be obtained
about the GP prior from
t
samples.
We present two main theorems for
S
TAGE
O
PT
. Theorem 1
ensures convergence to the reachable safe region in the safe
expansion stage. Theorem 2 ensures convergence towards
optimal utility value within the safe region in the utility
optimization stage. Both results are finite time bounds.
Theorem 1.
Suppose safety functions
g
i
satisfies
‖
g
i
‖
2
k
≤
B
and
g
i
further is
L
i
-Lipschitz-continuous.
i
∈{
1
,...,n
}
.
Also, suppose
S
0
6
=
∅
, and
g
i
(
x
)
≥
h
i
, for all
x
∈
S
0
.
Fix any
>
0
and
δ
∈
(0
,
1)
. Suppose we run the
safe region expansion stage of
S
TAGE
O
PT
with seed
set
S
0
, with noise
n
t
to be
σ
-sub-Gaussian, and
β
t
=
B
+
σ
√
2(
γ
t
−
1
+ 1 + log(1
/δ
))
with safety function hyper-
Stagewise Safe Bayesian Optimization with Gaussian Processes
parameters. Let
t
∗
be the smallest positive integer satisfying
t
∗
β
2
t
∗
γ
nt
∗
≥
C
1
(
|
̄
R
0
(
S
0
)
|
+ 1
)
2
,
where
C
1
= 8
/
log(1 +
σ
−
2
)
. Then, the following jointly
hold with probability at least
1
−
δ
:
• ∀
t
≥
1
and
i
∈{
1
,...,n
}
,
g
i
(
x
t
)
≥
h
i
,
• ∀
t
≥
t
∗
,
-Reachable safe region
R
T
0
(
S
0
)
is guaran-
teed to be expanded.
The detailed proof of Theorem 1 is presented in Ap-
pendix
??
. In Theorem 1, we count
t
from the beginning of
expansion stage. We choose
T
0
=
t
∗
with
T
0
the expansion
time defined in Section 4. We show that with high proba-
bility, the expansion stage of
S
TAGE
O
PT
guarantees safety,
and expands the initial safe region
S
0
to an
-reachable set
after at most
t
∗
iterations. The size of
t
∗
depends on the
largest size of safe region
̄
R
0
(
S
0
)
, the accuracy parame-
ters
,
ζ
, the confidence parameter
δ
, the complexity of the
function
B
and the parameterization of the GP via
γ
t
.
The proof is based on the following idea. Within a stage,
wherein
S
t
does not expand, the uncertainty
w
t
(
x
t
)
mono-
tonically decreases due to construction of
G
t
. We prove
that, the condition
max
x
∈
G
t
w
(
x
)
<
implies either of
two possibilities:
S
t
will expand after the next evaluation,
i.e., the reachable region will increase, and, hence, the next
stage shall commence; or, we have already established all
decisions within
̄
R
(
S
0
)
as safe, i.e.,
S
t
=
̄
R
(
S
0
)
. To
establish the sample complexity we use a bound on how
quickly
w
t
(
x
t
)
decreases.
Theorem 2.
Suppose utility function
f
satisfies
‖
f
‖
2
k
≤
B
,
δ
∈
(0
,
1)
, and noise
n
t
is
σ
-sub-Gaussian.
β
t
=
B
+
σ
√
2(
γ
t
−
1
+ 1 + log(1
/δ
))
with utility function hyperpa-
rameters.
T
1
the time horizon for optimization stage. Fix
any
ζ >
0
. Suppose we run the optimization stage of
S
TA
-
GE
O
PT
within the expansion stage safe region
R
T
0
(
S
0
)
.
Let
Y
be the smallest positive integer satisfying
4
√
2
√
Y
(
B
√
γ
Y
+
σ
√
2
γ
Y
(
γ
Y
+ 1 + log(1
/δ
)))
≤
ζ
Then with probability at least
1
−
δ
,
S
TAGE
O
PT
finds
ζ
-
optimal utility value:
f
(
ˆ
x
∗
)
≥
f
(
x
∗
)
−
ζ
.
The proof of Theorem 2 is presented in Appendix
??
. We
count
t
from the beginning of the optimization stage in
Theorem 2. We choose
T
1
=
Y
with
T
1
the time horizon of
optimization stage. We prove the existence of an
-optimal
decision
ˆ
x
∗
within the expansion stage safe region.
Discussion.
S
TAGE
O
PT
separates safe region expansion
and utility function maximization into two distinct stages.
0
20
40
60
80
100
0
.
2
0
.
3
0
.
4
0
.
5
Reward
S
AFE
O
PT
S
TAGE
O
PT
CEI
Figure 2.
Comparison between
S
AFE
O
PT
,
S
TAGE
O
PT
, and con-
strained EI on synthetic data with one safety function. In this
simple setting, both
S
AFE
O
PT
and
S
TAGE
O
PT
perform simi-
larly. In order to achieve the same level of safety guarantees,
constrained EI must be much more careful during exploration, and
consequently fails to identify the optimal point.
Theorem 1 guarantees
-optimal expansion in the first stage
within time horizon
T
0
. Theorem 2 guarantees
ζ
-optimal
utility value in the second stage within time horizon
T
1
.
Compared to existing approaches which interleave between
expansion and optimization,
S
TAGE
O
PT
does not require
any similarity or comparability between safety and utility.
In Section 6 we show empirically that
S
TAGE
O
PT
is more
efficient and far more natural for some applications.
6. Experimental Results
We evaluated our algorithm on synthetic data as well as on
a live clinical experiment on spinal cord therapy.
Modified
S
TAGE
O
PT
and
S
AFE
O
PT
.
In real applica-
tions, it may be difficult to compute an accurate estimate
of the Lipschitz constants, which may have an adverse ef-
fect on the definition of the safe region and its expansion
dynamics. In these scenarios, one can use a modified ver-
sion of
S
AFE
O
PT
that defines safe points using only the
GPs (Berkenkamp et al., 2016b). This modification can be
directly applied to
S
TAGE
O
PT
as well; for clarity, we state
the details here. Under this alternative definition, a point is
classified as safe if the lower confidence bound of each of
its safety GPs lies above the respective threshold:
S
t
=
⋂
i
{
x
∈
D
∣
∣
`
i
t
(
x
)
≥
h
i
}
.
A safe point is then an expander if an optimistic noiseless
measurement of its upper confidence bound results in a non-
safe point having all of its lower confidence bounds above
the respective thresholds:
e
t
(
x
)
:
=
∣
∣
∣
⋂
i
{
x
′
∈
D
\
S
t
∣
∣
`
t,u
t
(
x
)
(
x
)
≥
h
i
}
∣
∣
∣
.
6.1. Synthetic Data
We evaluated on several synthetic settings with various types
of safety constraints and feedback. In each setting, the utility
Stagewise Safe Bayesian Optimization with Gaussian Processes
40
60
80
100
2
.
05
2
.
1
2
.
15
Reward
S
AFE
O
PT
S
TAGE
O
PT
(a) 1 safety function reward
40
60
80
100
1
1
.
1
1
.
2
1
.
3
Reward
S
AFE
O
PT
S
TAGE
O
PT
(b) 3 safety functions reward
40
60
80
100
2
.
6
2
.
62
2
.
64
2
.
66
2
.
68
Reward
S
AFE
O
PT
S
TAGE
O
PT
(c) 1 safety function, dueling feedback
reward
40
60
80
100
50
60
70
80
Safe region size
S
AFE
O
PT
S
TAGE
O
PT
(d) 1 safety function safe region sizes
40
60
80
100
26
28
30
32
34
Safe region size
S
AFE
O
PT
S
TAGE
O
PT
(e) 3 safety functions safe region sizes
40
60
80
100
106
107
108
109
110
Safe region size
S
AFE
O
PT
S
TAGE
O
PT
(f) 1 safety function, dueling feedback
safe region sizes
Figure 3.
Results on three synthetic scenarios. The first row corresponds to the reward and the second row to the growth of the safe region
sizes (higher is better for both). In both of these metrics,
S
TAGE
O
PT
performs at least as well as
S
AFE
O
PT
. For clarity, we omit the first
40 iterations for each setting since the algorithms similarly expand the safe region during that phase.
function was sampled from a zero-mean GP with Mat
́
ern
kernel (
ν
= 1
.
2
) over the space
D
= [0
,
1]
2
uniformly
discretized into
25
×
25
points. We considered the following
safety constraint settings: (i) One safety function
g
1
sampled
from a zero-mean GP with a Mat
́
ern kernel with
ν
= 1
.
2
.
(ii) Three safety functions
g
1
,g
2
,g
3
, sampled from zero-
mean GPs with length scales 0.2, 0.4 and 0.8.
We set the amplitudes of the safety functions to be 0.1 that of
the utility function, and the safety threshold for each safety
function
g
i
to be
μ
i
+
1
2
σ
i
. We define a point
x
to be a
safe
seed
if it satisfies
g
i
(
x
)
> μ
i
+
σ
i
.
We also considered several cases for feedback. For both
safety settings, we examined the standard Gaussian noise-
perturbed case, with
σ
2
= 0
.
0025
. We also ran experiments
for the dueling feedback case and the first safety setting.
Algorithms.
As discussed previously,
S
AFE
O
PT
is the only
other known algorithm that has similar guarantees in our
setting, and serves as the main competitor to
S
TAGE
O
PT
.
In addition, we also compared against the constrained Ex-
pected Improvement (CEI) algorithm from Gelbart et al.
(2014). Since CEI only guarantees stepwise safety as op-
posed to over the entire time horizon, we set the safety
threshold to be
δ/T
with
δ
= 0
.
1
in order to match our set-
ting. Naturally, with such a stringent threshold, CEI is not
very competitive compared to
S
TAGE
O
PT
and
S
AFE
O
PT
,
as seen in Figure 2. In order to adequately distinguish be-
tween the latter two algorithms, we omit constrained EI
results from all further figures.
Results.
In each setting, we randomly sampled 30 combina-
tions of utility and safety functions and ran
S
TAGE
O
PT
and
S
AFE
O
PT
for
T
= 100
iterations starting from each of 10
randomly sampled safe seeds. For
S
TAGE
O
PT
, we used a
dynamic stopping criterion for the safe expansion phase (i.e.
T
0
) of when the safe region plateaus for 10 iterations, hard
capped at 80 iterations. In these experiments, we primarily
used GP-UCB in the utility optimization phase. We also
tried two other common acquisition functions, Expected Im-
provement (EI) and Maximum Probability of Improvement
(MPI). However, we observed similar behavior between
all three acquisition functions, since our algorithm quickly
identifies the reachable safe region in most scenarios.
In Figure 3, for each setting and algorithm, we plot both
the growth of the size of the safe region as well as a notion
of reward
r
t
= max
1
≤
i
≤
t
f
(
x
i
)
. Although there is some
similarity between the performances of the algorithms, it
is evident that
S
TAGE
O
PT
grows the safe region at least
as fast as
S
AFE
O
PT
, while also reaching a optimal sample
point more quickly.
6.2. Clinical Experiments
We finally applied
S
TAGE
O
PT
to safely optimize clinical
spinal cord stimulation in order to help tetraplegic patients
regain physical mobility. The goal is to find effective stimu-
lation therapies for patients with severe spinal cord injuries
without introducing undesirable side effects. For example,
bad stimulations could have negative effects on the reha-
bilitation and are often painful. This application is easily
Stagewise Safe Bayesian Optimization with Gaussian Processes
100
200
300
400
500
600
0
20
40
60
80
100
Input Space (%)
Input Space
Safe Space
Figure 4.
Expansion of the safe region for spinal cord injury ther-
apy. The orange solid line represents the growth of safe region
over time, and the blue dashed line the total size of the input space.
100
200
300
400
500
600
0
0
.
2
0
.
4
0
.
6
0
.
8
1
Utility (Normalized)
Physician’s Best Choice
S
TAGE
O
PT
GP Fitting of
S
TAGE
O
PT
Figure 5.
Utilities within the safe region (larger is better). The
green dashed line denotes the physician’s best choice. The thin
blue line shows the utilities of
S
TAGE
O
PT
at each iteration, and
the orange solid line is a GP curve fitting of these utilities.
framed under our problem setting; the chosen configurations
must stay above a safety threshold.
A total of 564 therapeutic trials were done with a tetraplegic
patient in gripping experiments over 10 weeks. In each trial,
one stimulating pattern was generated by the 32-channel-
electrode, and was fixed within each trial. For a fixed elec-
trode configuration, the stimulation frequency and ampli-
tude were modulated synergistically in order to find those
best for effective gripping. A similar setup was studied in
(Sui et al., 2017a). We optimized the electrode patterns
with preference-based
S
TAGE
O
PT
(see Appendix
??
) and
performed exhaustive search for stimulation frequency and
amplitude over a narrow range.
Results.
Figure 4 shows the reachable stimulating patterns
by the algorithm under safety constraints. The physicians
are confident that the total safe region has been reached
between 300 and 400 iterations. In our experiments,
S
TA
-
GE
O
PT
does not sample any unsafe stimulating patterns.
Figure 5 plots the utility measure of the stimulating pattern
at each iteration. The orange solid line is a GP curve fitting
of
S
TAGE
O
PT
(in thin blue). It clearly exceeds the physi-
cian’s best choice (dotted green line) after around 400 itera-
tions of online experiments. These results demonstrate the
practicality of
S
TAGE
O
PT
to safely optimize in challenging
settings, such as those involving live human subjects.
7. Conclusion & Discussion
In this paper, we study the problem of safe Bayesian opti-
mization, which is well suited to any setting requiring safe
online optimization such as medical therapies, safe recom-
mender systems, and safe robotic control. We proposed a
general framework,
S
TAGE
O
PT
, which is able to tackle
non-comparable safety constraints and utility function. We
provide strong theoretical guarantees for
S
TAGE
O
PT
with
safety functions and utility function sampled from Gaussian
processes. Specifically, we bound the sample complexity to
achieve an
-safe region and
ζ
-optimal utility value within
the safe region. The whole sampling process is guaranteed
to be safe with high probability.
We compared
S
TAGE
O
PT
with classical Bayesian optimiza-
tion methods and state-of-the-art safe optimization algo-
rithms. We evaluated multiple cases such as single safety
function, multiple safety functions, real-valued utility, and
dueling-feedback utility. Our extensive experiments on syn-
thetic data show that
S
TAGE
O
PT
can achieve its theoretical
guarantees on safety and optimality. Its performance on
safe expansion is among the best and utility maximization
outperforms the state-of-the-art.
This result also provides an efficient tool for online opti-
mization in safety-critical applications. For instance, we
applied
S
TAGE
O
PT
with dueling-feedback utility function
on the gripping rehabilitation therapy for tetraplegic pa-
tients. Our live clinical experiments demonstrated good
performance a real human experiment. The therapies pro-
posed by
S
TAGE
O
PT
outperform the ones suggested by
experienced physicians.
There are many interesting directions for future work. For in-
stance, we assume a static environment that does not evolve
in response to the actions taken. In our clinical applica-
tion, this implies assuming that the patients’ condition and
response to stimuli do not improve over time. Moving for-
ward, it would be interesting to incorporate dynamics into
our setting, which would lead to the multi-criteria safe re-
inforcement learning setting (Moldovan & Abbeel, 2012;
Turchetta et al., 2016; Wachi et al., 2018).
Another interesting direction is developing theoretically rig-
orous approaches outside of using Gaussian processes (GPs).
Although highly flexible, GPs require a well-specified prior
and kernel in order to be effective. While one could use
uniformed priors to model most settings, such priors tend
to lead to very slow convergence. One alternative is to
automatically learn a good kernel (Wilson et al., 2016). An-
other approach is to assume a low-dimensional manifold
within the high-dimensional uniformed kernel (Djolonga
et al., 2013), which could also speed up learning.
Stagewise Safe Bayesian Optimization with Gaussian Processes
Acknowledgements
This research was also supported in part by NSF Awards
#1564330 & #1637598, JPL PDF IAMS100224, a
Bloomberg Data Science Research Grant, and a gift from
Northrop Grumman.
References
Audibert, J.-Y., Bubeck, S., and Munos, R. Best arm identification
in multi-armed bandits. In
Conference on Learning Theory
(COLT)
, 2010.
Auer, P. Using confidence bounds for exploitation-exploration
trade-offs.
Journal of Machine Learning Research (JMLR)
,
2002.
Berkenkamp, F., Krause, A., and Schoellig, A. P. Bayesian opti-
mization with safety constraints: Safe and automatic parameter
tuning in robotics. Technical report, arXiv, February 2016a.
Berkenkamp, F., Schoellig, A. P., and Krause, A. Safe controller
optimization for quadrotors with gaussian processes. In
Proc.
of the International Conference on Robotics and Automation
(ICRA)
, pp. 491–496, May 2016b.
Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochastic and
nonstochastic multi-armed bandit problems.
Foundations and
Trends in Machine Learning
, 5:1–122, 2012.
Bubeck, S., Munos, R., Stoltz, G., and Szepesv
́
ari, C. Online
optimization in X-armed bandits. In
Advances in Neural Infor-
mation Processing Systems (NIPS)
, pp. 201–208, 2008.
Chowdhury, S. R. and Gopalan, A. On kernelized multi-armed
bandits. In
International Conference on Machine Learning
(ICML)
, pp. 844–853, 2017.
Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear op-
timization under bandit feedback. In
Proceedings of the 21st
Annual Conference on Learning Theory (COLT)
, pp. 355–366,
2008.
Djolonga, J., Krause, A., and Cevher, V. High-dimensional gaus-
sian process bandits. In
Advances in Neural Information Pro-
cessing Systems (NIPS)
, pp. 1025–1033, 2013.
Garcia, J. and Fernandez, F. Safe exploration of state and action
spaces in reinforcement learning.
Journal of Machine Learning
Research
, 2012.
Gelbart, M. A., Snoek, J., and Adams, R. P. Bayesian optimization
with unknown constraints.
Uncertainty in Artificial Intelligence
(UAI)
, 2014.
Gillula, J. and Tomlin, C. Guaranteed safe online learning of a
bounded system. In
IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS)
, 2011.
Gotovos, A., Casati, N., Hitz, G., and Krause, A. Active learning
for level set estimation. In
International Joint Conference on
Artificial Intelligence (IJCAI)
, 2013.
Hans, A., Schneegaß, D., Sch
̈
afer, A., and Udluft, S. Safe explo-
ration for reinforcement learning. In
ESANN
, 2008.
Hern
́
andez-Lobato, J. M., Gelbart, M. A., Adams, R. P., Hoffman,
M. W., and Ghahramani, Z. A general framework for con-
strained bayesian optimization using information-based search.
Journal of Machine Learning Research
, 2016.
Kleinberg, R., Slivkins, A., and Upfal, E. Multi-armed bandits in
metric spaces. In
ACM Symposium on Theory of Computing
(STOC)
, pp. 681–690. Association for Computing Machinery,
Inc., May 2008.
Moldovan, T. and Abbeel, P. Safe exploration in markov decision
processes. In
International Conference on Machine Learning
(ICML)
, 2012.
Rasmussen, C. E. and Williams, C. K. I.
Gaussian Processes for
Machine Learning
. MIT Press, 2006.
Robbins, H. Some aspects of the sequential design of experiments.
Bulletin of the American Mathematical Society
, 1952.
Schreiter, J., Nguyen-Tuong, D., Eberts, M., Bischoff, B., Markert,
H., and Toussaint, M. Safe exploration for active learning with
gaussian processes. In
Joint European Conference on Machine
Learning and Knowledge Discovery in Databases
, pp. 133–149.
Springer, 2015.
Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian
process optimization in the bandit setting: No regret and ex-
perimental design. In
International Conference on Machine
Learning (ICML)
, 2010.
Sui, Y., Gotovos, A., Burdick, J. W., and Krause, A. Safe explo-
ration for optimization with gaussian processes. In
International
Conference on Machine Learning (ICML)
, 2015.
Sui, Y., Yue, Y., and Burdick, J. W. Correlational dueling bandits
with application to clinical treatment in large decision spaces. In
International Joint Conference on Artificial Intelligence (IJCAI)
,
2017a.
Sui, Y., Zhuang, V., Burdick, J. W., and Yue, Y. Multi-dueling
bandits with dependent arms. In
Uncertainty in Artificial Intel-
ligence (UAI)
, 2017b.
Thompson, W. R. On the likelihood that one unknown probabil-
ity exceeds another in view of the evidence of two samples.
Biometrika
, 25(3/4):285–294, 1933.
Turchetta, M., Berkenkamp, F., and Krause, A. Safe exploration
in finite markov decision processes with gaussian processes. In
Neural Information Processing Systems (NIPS)
, pp. 4305–4313,
December 2016.
Wachi, A., Sui, Y., Yue, Y., and Ono, M. Safe exploration and
optimization of constrained mdps using gaussian processes. In
AAAI Conference on Artificial Intelligence (AAAI)
, 2018.
Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P. Deep ker-
nel learning. In
Artificial Intelligence and Statistics (AISTATS)
,
pp. 370–378, 2016.
Wu, Y., Shariff, R., Lattimore, T., and Szepesv
́
ari, C. Conservative
bandits. In
International Conference on Machine Learning
(ICML)
, pp. 1254–1262, 2016.
Yue, Y., Broder, J., Kleinberg, R., and Joachims, T. The k-armed
dueling bandits problem.
Journal of Computer and System
Sciences
, 78(5):1538–1556, 2012.