Stagewise Safe Bayesian Optimization with Gaussian Processes

Yanan Sui

Vincent Zhuang

Joel W. Burdick

Yisong Yue

Abstract

Enforcing safety is a key aspect of many prob-

lems pertaining to sequential decision making un-

der uncertainty, which require the decisions made

at every step to be both informative of the op-

timal decision and also safe. For example, we

value both efficacy and comfort in medical ther-

apy, and efficiency and safety in robotic control.

We consider this problem of optimizing an un-

known utility function with absolute feedback or

preference feedback subject to unknown safety

constraints. We develop an efficient safe Bayesian

optimization algorithm,

TAGE

, that sepa-

rates safe region expansion and utility function

maximization into two distinct stages. Compared

to existing approaches which interleave between

expansion and optimization, we show that

is more efficient and naturally applicable

to a broader class of problems. We provide the-

oretical guarantees for both the satisfaction of

safety constraints as well as convergence to the

optimal utility value. We evaluate

TAGE

on both a variety of synthetic experiments, as

well as in clinical practice. We demonstrate that

TAGE

is more effective than existing safe

optimization approaches, and is able to safely and

effectively optimize spinal cord stimulation ther-

apy in our clinical experiments.

1. Introduction

Bayesian optimization is a well-established approach for

sequentially optimizing unknown utility functions. By lever-

aging regularity assumptions such as smoothness and con-

tinuity, such techniques offer efficient solutions for a wide

range of high-dimensional problem settings such as experi-

mental design and personalization in recommender systems.

California Institute of Technology, Pasadena, CA, USA.

Correspondence to:

Yanan Sui

ysui@caltech.edu

, Vin-

cent Zhuang

vzhuang@caltech.edu

, Joel W. Burdick

jwb@robotics.caltech.edu

, Yisong Yue

yyue@caltech.edu

Proceedings of the

35

International Conference on Machine

Learning

by the author(s).

Many of these applications are also subject to a variety of

safety constraints, so that actions cannot be freely chosen

from the entire input space. For instance, in safe Bayesian

optimization, any chosen action during optimization must

be known to be “safe”, regardless of the reward from the

utility function. Typically, one is initially given a small

region of the decision/action space that is known to be safe,

and must iteratively expand the safe action region during

optimization (Sui et al., 2015).

A motivating application of our work is a clinical setting,

where physicians need to sequentially choose among a large

set of therapies (Sui et al., 2017a). The effectiveness and

safety of different therapies are initially unknown, and can

only be determined through sequential tests starting from

some initial set of well-studied therapies. A natural way

to explore is to start from some therapies similar to these

initial ones, since their efficacy and safety would not differ

too greatly. By iteratively repeating this process, one can

gradually explore the utility and safety landscapes in a safe

fashion.

Our contributions.

We propose a novel safe Bayesian op-

timization algorithm,

TAGE

, to address the challenge

of efficiently identifying the total safe region and optimiz-

ing the utility function within the safe region. In contrast

to previous safe Bayesian optimization work (Sui et al.,

2015; Berkenkamp et al., 2016a) which interleaves safe re-

gion expansion and optimization,

TAGE

is a stagewise

algorithm which first expands the safe region and then op-

timizes the utility function.

TAGE

is well suited for

settings in which the safety and utility functions are very

different (e.g., temperature vs gripping force), i.e. lie on

different scales or amplitudes. Furthermore, in settings in

which the utility and safety functions are measured in differ-

ent ways, it is natural to have a separate first stage dedicated

to safe region expansion. For example, in clinical trials we

may wish to spend the first stage only querying the patient

about the comfort of the stimulus, as opposed to having to

measure the utility and comfort simultaneously.

Conceptually,

TAGE

models the safety function(s)

and utility function as sampled functions from different

Gaussian processes (GPs), and uses confidence bounds to

assess the safety of unexplored decisions. We provide theo-

retical results for

TAGE

under the assumptions that

Stagewise Safe Bayesian Optimization with Gaussian Processes

(1) the safety and utility functions have bounded norms in

their Reproducing Kernel Hilbert Spaces (RKHS) associ-

ated with the GPs, and (2) the safety functions are Lipschitz-

continuous, which is guaranteed by many common kernels.

We guarantee (with high probability) the convergence of

TAGE

to the safely reachable optimum decision. In

addition to simulation experiments, we apply

TAGE

to a clinical setting of optimizing spinal cord stimulation

for patients with spinal cord injuries. Compared to expert

physicians, we find that

TAGE

explores a larger safe

region and finds better stimulation strategy.

2. Related Work

Many Bayesian optimization methods often model the un-

known underlying functions as Gaussian processes (GPs),

which are smooth, flexible, nonparametric models (Ras-

mussen & Williams, 2006). GPs are widely used as a regu-

larity assumption in many Bayesian optimization techniques,

since they can easily encode prior knowledge and explicitly

model variance.

The fundamental tradeoff between exploration and exploita-

tion in sequential decision problems is commonly formal-

ized as the multi-armed bandit problem (MAB), introduced

by Robbins (1952). In MAB, each decision is associated

with a stochastic reward with initially unknown distribution.

The goal of a bandit algorithm is to maximize the cumu-

lative reward. In a variant called “best-arm identification”

(Audibert et al., 2010), one seeks to identify the decision

with highest reward with minimal trials. It has been widely

studied under a variety of different situations (cf., Bubeck

& Cesa-Bianchi (2012) for an overview). Many efficient

algorithms build on the methods of

upper confidence bounds

proposed in Auer (2002), and

Thompson sampling

proposed

in Thompson (1933). Their key ideas are to use posterior

distributions of rewards to implicitly negotiate the explore-

exploit tradeoff by optimistic sampling. This idea naturally

extends to bandit problems with complex (or even infinite)

decision sets under certain regularity conditions of the re-

ward function (Dani et al., 2008; Kleinberg et al., 2008;

Bubeck et al., 2008).

In the kernelized setting, several algorithms with theoreti-

cal guarantees have been proposed. Srinivas et al. (2010)

propose the GP-UCB algorithm, which uses confidence

bounds to address bandit problems with a reward function

modeled using a Gaussian process. Gotovos et al. (2013)

studies active sampling for localizing level sets, finding

where the objective crosses a specified threshold. Chowd-

hury & Gopalan (2017) extends the work of Srinivas et al.

(2010) by proving tighter bounds as well as providing guar-

antees for a GP-based Thompson sampling algorithm. How-

ever, none of these algorithms are designed to work with

safety constraints, and often violate them in practice (Sui

et al., 2015). There are also algorithms without theoretical

guarantees. Gelbart et al. (2014) studies a constrained Ex-

pected Improvement algorithm for Bayesian optimization

with unknown constraints. Hern

andez-Lobato et al. (2016)

considers a general framework for constrained Bayesian

optimization using information-based search.

The problem of safe exploration has been considered in con-

trol and reinforcement learning (Hans et al., 2008; Gillula &

Tomlin, 2011; Garcia & Fernandez, 2012; Turchetta et al.,

2016). These methods typically consider the problem of

safe exploration in MDPs. They ensure safety by restrict-

ing policies to be ergodic with high probability and able to

recover from any state visited. The safe optimization prob-

lem has also been studied under the restriction of the ban-

dit/optimization setting, where decisions do not cause state

transitions. This leads to simpler algorithms (

AFE

)

with stronger guarantees (Sui et al., 2015; Berkenkamp et al.,

2016a), and fits well to safe sampling problems and applica-

tions. There are other safe algorithms (Schreiter et al., 2015;

Wu et al., 2016) under different active learning settings. Our

work builds upon the

AFE

approach, with stronger

empirical performance and convergence rates on a broad

class of safety functions.

3. Problem Statement

We consider a sequential decision problem in which we

seek to optimize an unknown utility function

→

from noisy evaluations at iteratively chosen sample points

,...,

,...

∈

. However, we further require that

each of these sample points are “safe”: that is, for each

unknown safety functions

→

(

)

lies

above some threshold

∈

. We can formally write our

optimization problem as follows:

max

x

∈

(

)

subject to

(

)

≥

for

= 1

,...,n

(1)

Regularity assumptions.

In order to model the utility

function and the safety functions, we use Gaussian processes

(GPs), which are smooth yet flexible nonparametric models.

Equivalently, we assume that

and all

have bounded

norm in the associated Reproducing Kernel Hilbert Space

(RKHS). A GP is fully specified by its mean function

(

)

and covariance function

(

′

)

; in this work, we assume

WLOG GP priors to have zero mean (i.e.

(

) = 0)

. We

further assume that each safety function

-Lipschitz

continuous with respect to some metric

. This as-

sumption is quite mild, and is automatically satisfied by

many commonly-used kernels (Srinivas et al., 2010; Sui

et al., 2015).

Feedback

models.

primarily

consider

noise-

perturbed feedback, in which our observations are perturbed

Stagewise Safe Bayesian Optimization with Gaussian Processes

by i.i.d. Gaussian noise, i.e., for samples at points

= [

...

]

⊆

, we have

(

) +

where

∼

,σ

)

. The posterior over

is then also Gaussian

with mean

(

)

, covariance

(

′

)

and variance

(

′

)

that satisfy,

(

) =

(

)

(

)

−

(

′

) =

(

′

)

−

(

)

(

)

−

(

′

)

(

) =

(

)

where

(

) = [

(

)

...k

(

)]

and

is the

positive definite kernel matrix

[

(

′

)]

x

x

′

∈

We also consider the case in which only preference feedback

is available for the utility function. This setting is often used

to characterize real-world applications that elicit subjective

human feedback. One way to formalize the online opti-

mization problem is the dueling bandits problem (Yue et al.,

2012; Sui et al., 2017b). In the basic dueling bandits formu-

lation, given two points

and

, we stochastically receive

binary 0/1 feedback according to a Bernoulli distribution

with parameter

(

)

(

))

, where

is a

link func-

tion

mapping

. For example, a common link

function is the logit function

(

x,y

) = (1 + exp(

−

))

−

To our knowledge, there are no existing algorithms for the

safe Bayesian dueling bandit setting. Although our proposed

algorithm is amenable to the full dueling bandits setting (as

discussed later), to compare against existing algorithms, we

consider the restricted dueling problem in which at timestep

one receives preference feedback between

and

−

The pseudocode for our proposed algorithm under this type

of dueling feedback can be found in Appendix

Safe optimization.

Using a uniform zero-mean prior (as

is typical in many Bayesian optimization approaches) does

not provide sufficient information to identify any point

as safe with high probability. Therefore, we addition-

ally assume that we are given an initial “seed” set of

safe decision(s), which we denote as

⊂

. Note

that given an arbitrary seed set, it is not guaranteed that

we will be able to discover the globally optimal decision

∗

= argmax

x

∈

(

)

, e.g. if the safe region around

∗

is topologically separate from that of

. Instead, we can

formally define the optimization goal for a given seed via

the

one-step reachability

operator:

(

)

∪

⋂

{

∈

∣

∃

′

∈

(

′

)

−

(

′

)

≥

}

which gives the set all of points that can be established

as safe given evaluations of

with

noise. Then,

given some finite horizon

, we can define the subset of

reachable after

iterations from the initial safe seed set

as the following:

(

)

(

...

(

︸

︷︷

︸

times

(

))

...

)

Thus, our optimization goal is

argmax

x

∈

(

0

)

(

)

4. Algorithm

We now introduce our proposed algorithm,

TAGE

, for

the safe exploration for optimization problem.

Overview.

We start with a high-level description of

TAGE

separates the safe optimization prob-

lem into two stages: an exploration phase in which the safe

region is iteratively expanded, followed by an optimization

phase in which Bayesian optimization is performed within

the safe region. We assume that our algorithm runs for a

fixed

time steps, and that the first safe expansion region

has horizon

< T

with the optimization phase being

−

time steps long.

TAGE

models the utility function and the safety func-

tions via Gaussian processes, and leverages their uncertainty

in order to safely explore and optimize. In particular, at each

iteration

TAGE

uses the confidence intervals

(

)

[

−

(

)

−

(

)

]

(2)

where

is a scalar whose choice will be discussed later.

We use superscripts to denote the confidence intervals for

the respective safety functions, and we use the superscript

for the utility function. In order to guarantee both safety

and progress in safe region expansion, instead of using

directly,

TAGE

uses the confidence intervals

defined as

(

)

−

(

)

∩

(

)

(

) = [

∞

]

so that

are sequentially contained in

−

for all

. We

also define the upper and lower bounds of

to be

and

respectively, as well as the width as

−

We defined the optimization goal with respect to a tolerance

parameter

, which can employed as a stopping condition

for the expansion stage. Namely, if the expansion stage

stops at

under the condition

max

x

∈

(

)

≤

then

the

-Reachable safe region

0

(

)

is guaranteed to be

expanded. Similarly, we have a tolerance parameter

(in

Algorithm 1) to control utility function optimization with

time horizon

Stage One: Safe region expansion.

TAGE

ex-

pands the safe region in the same way as that of

AFE

(Sui et al., 2015; Berkenkamp et al., 2016a). An increasing

sequence of safe subsets

⊆

is computed based on the

confidence intervals of the GP posterior:

⋂

⋃

x

∈

−

1

{

′

∈

∣

(

)

−

(

′

)

≥

}

Stagewise Safe Bayesian Optimization with Gaussian Processes

At each iteration,

TAGE

computes a set of

expander

points

(that is, points within the current safe region that

are likely to expand the safe region) and picks the expander

with the highest predictive uncertainty.

In order to define the set

, we first define the function:

(

)

∣

⋂

{

′

∈

∣

(

)

−

(

′

)

≥

}

∣

which (optimistically) quantifies the potential enlargement

of the current safe set after we sample a new decision

Then,

is simply given by:

{

∈

(

)

}

Finally, at each iteration

TAGE

selects

to be

argmax

x

∈

(

)

Stage Two: Utility optimization.

Once the safe region

is established,

TAGE

can use any standard online

optimization approach to optimize the utility function within

the expanded safe region. For concreteness, we present

here the GP-UCB algorithm (Srinivas et al., 2010). For

completeness, we present a version of

TAGE

based on

preference-based utility optimization in Appendix

. Our

theoretical analysis is also predicated on using GP-UCB,

since it offers finite-time regret bounds. Formally, at each

iteration in this phase, we select the arm

as the following:

= argmax

x

∈

−

(

) +

−

(

)

(3)

Note that it possible (though typically unlikely) for the safe

region to further expand during this phase.

Comparison between

AFE

and

TAGE

Al-

though

TAGE

is similar to

AFE

in that it con-

structs confidence intervals and defines the safe region in

the same way, there are distinct differences in how these

algorithms work. We illustrate the behavior of

AFE

and

TAGE

starting from a common safe seed in Fig-

ure 1. Initially, both algorithms select the same points since

they use the same definition of safe expansion. However,

TAGE

selects noticeably better optimization points

than

AFE

due its UCB criterion. We leave a more

detailed discussion of this behavior for Section 6.

We also re-emphasize that since

TAGE

separates the

safe optimization problem into safe expansion and utility

optimization phases, it is much more amenable to a variety

of related settings than

AFE

. For example, as dis-

cussed in detail in the appendix, dueling feedback can easily

be incorporated into

TAGE

: in the dueling setting,

one can simply replace GP-UCB in the utility optimization

stage with any kernelized dueling-bandit algorithm, such as

ERNEL

ELF

PARRING

(Sui et al., 2017b).

Algorithm 1

TAGE

Input:

sample set

∈{

,...,n

}

GP prior for utility function

GP priors for safety functions

Lipschitz constants

for

safe seed set

0

safety threshold

accuracies

(for expansion),

(for optimization).

0

(

x

)

←

[

∞

)

, for all

x

∈

0

0

(

x

)

←

, for all

x

∈

0

0

(

x

)

←

, for all

x

∈

0

(

x

)

←

, for all

x

∈

0

(

x

)

←

, for all

x

∈

for

= 1

,... T

0

(

x

)

←

−

1

(

x

)

∩

−

1

(

x

)

(

x

)

←

−

1

(

x

)

∩

−

1

(

x

)

10:

←

⋂

⋃

x

∈

−

1

{

x

′

∈

∣

(

x

)

−

(

x

x

′

)

≥

}

11:

←

{

x

∈

∣

(

x

)

}

12:

∀

then

13:

x

←

argmax

x

∈

∈{

1

,...,n

}

(

x

)

14:

else

15:

x

←

argmax

x

∈

−

1

(

x

) +

−

1

(

x

)

16:

end if

17:

f,t

←

(

x

) +

f,t

18:

i,t

←

(

x

) +

i,t

19:

Compute

f,t

(

x

)

and

i,t

(

x

)

, for all

x

∈

20:

end for

21:

for

0

+ 1

,...,T

22:

(

x

)

←

−

1

(

x

)

∩

−

1

(

x

)

23:

x

←

argmax

x

∈

−

1

(

x

) +

−

1

(

x

)

24:

f,t

←

(

x

) +

f,t

25:

i,t

←

(

x

) +

i,t

26:

Compute

f,t

(

x

)

and

i,t

(

x

)

, for all

x

∈

27:

end for

5. Theoretical Results

In this section, we show the effectiveness of

TAGE

theoretically bounding its sample complexity for expansion

and optimization. The two stages of

TAGE

are the

expansion of the safe region in search for the total safe

region, and the optimization within the safe region.

The correctness of

TAGE

relies on the fact that the

classification of sets

and

is sound. While this re-

quires that the confidence bounds

are conservative, using

bounds that are too conservative will slow down the algo-

rithm considerably. The tightness of the confidence bounds

is controlled by parameter

in Equation 2. This prob-

lem of properly tuning confidence bounds using Gaussian

processes in exploration–exploitation trade-off has been

studied by Srinivas et al. (2010); Chowdhury & Gopalan

(2017). These algorithms are designed for the stochastic

multi-armed bandit problem on a kernelized input space

without safety constraints. However, their choice of confi-

dence bounds can be generalized to our setting for expansion

and optimization. In particular, for our theoretical results to

Stagewise Safe Bayesian Optimization with Gaussian Processes

AFE

utility

AFE

safety

TAGE

utility

TAGE

safety

(a) 2 iterations

(b) 4 iterations

Figure 1.

Evolution of GPs in

AFE

and

TAGE

for a fixed safe seed; dashed lines correspond to the mean and shaded areas to

standard deviations. The first and third rows depict the utility function, and the second and fourth rows depict a single safety function.

The utility and safety functions were randomly sampled from a zero-mean GP with a Matern kernel, and are represented with solid blue

lines. The safety threshold is shown as the green line, and safe regions are shown in red. The red markers correspond to safe expansions

and blue markers to maximizations and optimizations. We see that

TAGE

identifies actions with higher utility than

AFE

hold it suffices to choose:

√

−

+ 1 + log(1

/δ

))

(4)

where

is a bound on the RKHS norm of

is the allowed

failure probability, observation noise is

-sub-Gaussian, and

quantifies the effective degrees of freedom associated

with the kernel function. Concretely,

= max

|≤

(

;

)

is the maximal mutual information that can be obtained

about the GP prior from

samples.

We present two main theorems for

TAGE

. Theorem 1

ensures convergence to the reachable safe region in the safe

expansion stage. Theorem 2 ensures convergence towards

optimal utility value within the safe region in the utility

optimization stage. Both results are finite time bounds.

Theorem 1.

Suppose safety functions

satisfies

‖

≤

and

further is

-Lipschitz-continuous.

∈{

,...,n

}

Also, suppose

∅

, and

(

)

≥

, for all

∈

Fix any

and

∈

. Suppose we run the

safe region expansion stage of

TAGE

with seed

set

, with noise

to be

-sub-Gaussian, and

√

−

+ 1 + log(1

/δ

))

with safety function hyper-

Stagewise Safe Bayesian Optimization with Gaussian Processes

parameters. Let

∗

be the smallest positive integer satisfying

∗

≥

(

)

+ 1

)

where

= 8

log(1 +

−

)

. Then, the following jointly

hold with probability at least

−

• ∀

≥

and

∈{

,...,n

}

(

)

≥

• ∀

≥

∗

-Reachable safe region

0

(

)

is guaran-

teed to be expanded.

The detailed proof of Theorem 1 is presented in Ap-

pendix

. In Theorem 1, we count

from the beginning of

expansion stage. We choose

∗

with

the expansion

time defined in Section 4. We show that with high proba-

bility, the expansion stage of

TAGE

guarantees safety,

and expands the initial safe region

to an

-reachable set

after at most

∗

iterations. The size of

∗

depends on the

largest size of safe region

(

)

, the accuracy parame-

ters

, the confidence parameter

, the complexity of the

function

and the parameterization of the GP via

The proof is based on the following idea. Within a stage,

wherein

does not expand, the uncertainty

(

)

mono-

tonically decreases due to construction of

. We prove

that, the condition

max

x

∈

(

)

implies either of

two possibilities:

will expand after the next evaluation,

i.e., the reachable region will increase, and, hence, the next

stage shall commence; or, we have already established all

decisions within

(

)

as safe, i.e.,

(

)

. To

establish the sample complexity we use a bound on how

quickly

(

)

decreases.

Theorem 2.

Suppose utility function

satisfies

‖

≤

∈

, and noise

-sub-Gaussian.

√

−

+ 1 + log(1

/δ

))

with utility function hyperpa-

rameters.

the time horizon for optimization stage. Fix

any

ζ >

. Suppose we run the optimization stage of

within the expansion stage safe region

0

(

)

Let

be the smallest positive integer satisfying

√

(

√

(

+ 1 + log(1

/δ

)))

≤

Then with probability at least

−

TAGE

finds

optimal utility value:

(

∗

)

≥

(

∗

)

−

The proof of Theorem 2 is presented in Appendix

. We

count

from the beginning of the optimization stage in

Theorem 2. We choose

with

the time horizon of

optimization stage. We prove the existence of an

-optimal

decision

∗

within the expansion stage safe region.

Discussion.

TAGE

separates safe region expansion

and utility function maximization into two distinct stages.

100

Reward

AFE

TAGE

CEI

Figure 2.

Comparison between

AFE

TAGE

, and con-

strained EI on synthetic data with one safety function. In this

simple setting, both

AFE

and

TAGE

perform simi-

larly. In order to achieve the same level of safety guarantees,

constrained EI must be much more careful during exploration, and

consequently fails to identify the optimal point.

Theorem 1 guarantees

-optimal expansion in the first stage

within time horizon

. Theorem 2 guarantees

-optimal

utility value in the second stage within time horizon

Compared to existing approaches which interleave between

expansion and optimization,

TAGE

does not require

any similarity or comparability between safety and utility.

In Section 6 we show empirically that

TAGE

is more

efficient and far more natural for some applications.

6. Experimental Results

We evaluated our algorithm on synthetic data as well as on

a live clinical experiment on spinal cord therapy.

Modified

TAGE

and

AFE

In real applica-

tions, it may be difficult to compute an accurate estimate

of the Lipschitz constants, which may have an adverse ef-

fect on the definition of the safe region and its expansion

dynamics. In these scenarios, one can use a modified ver-

sion of

AFE

that defines safe points using only the

GPs (Berkenkamp et al., 2016b). This modification can be

directly applied to

TAGE

as well; for clarity, we state

the details here. Under this alternative definition, a point is

classified as safe if the lower confidence bound of each of

its safety GPs lies above the respective threshold:

⋂

{

∈

∣

(

)

≥

}

A safe point is then an expander if an optimistic noiseless

measurement of its upper confidence bound results in a non-

safe point having all of its lower confidence bounds above

the respective thresholds:

(

)

∣

⋂

{

′

∈

∣

t,u

(

x

)

(

)

≥

}

∣

6.1. Synthetic Data

We evaluated on several synthetic settings with various types

of safety constraints and feedback. In each setting, the utility

Stagewise Safe Bayesian Optimization with Gaussian Processes

100

Reward

AFE

TAGE

(a) 1 safety function reward

100

Reward

AFE

TAGE

(b) 3 safety functions reward

100

Reward

AFE

TAGE

reward

100

Safe region size

AFE

TAGE

(d) 1 safety function safe region sizes

100

Safe region size

AFE

TAGE

(e) 3 safety functions safe region sizes

100

106

107

108

109

110

Safe region size

AFE

TAGE

(f) 1 safety function, dueling feedback

safe region sizes

Figure 3.

Results on three synthetic scenarios. The first row corresponds to the reward and the second row to the growth of the safe region

sizes (higher is better for both). In both of these metrics,

TAGE

performs at least as well as

AFE

. For clarity, we omit the first

40 iterations for each setting since the algorithms similarly expand the safe region during that phase.

function was sampled from a zero-mean GP with Mat

ern

kernel (

= 1

) over the space

= [0

uniformly

discretized into

points. We considered the following

safety constraint settings: (i) One safety function

sampled

from a zero-mean GP with a Mat

ern kernel with

= 1

(ii) Three safety functions

, sampled from zero-

mean GPs with length scales 0.2, 0.4 and 0.8.

We set the amplitudes of the safety functions to be 0.1 that of

the utility function, and the safety threshold for each safety

function

to be

. We define a point

to be a

safe

seed

if it satisfies

(

)

> μ

We also considered several cases for feedback. For both

safety settings, we examined the standard Gaussian noise-

perturbed case, with

= 0

0025

. We also ran experiments

for the dueling feedback case and the first safety setting.

Algorithms.

As discussed previously,

AFE

is the only

other known algorithm that has similar guarantees in our

setting, and serves as the main competitor to

TAGE

In addition, we also compared against the constrained Ex-

pected Improvement (CEI) algorithm from Gelbart et al.

(2014). Since CEI only guarantees stepwise safety as op-

posed to over the entire time horizon, we set the safety

threshold to be

δ/T

with

= 0

in order to match our set-

ting. Naturally, with such a stringent threshold, CEI is not

very competitive compared to

TAGE

and

AFE

as seen in Figure 2. In order to adequately distinguish be-

tween the latter two algorithms, we omit constrained EI

results from all further figures.

Results.

In each setting, we randomly sampled 30 combina-

tions of utility and safety functions and ran

TAGE

and

AFE

for

= 100

iterations starting from each of 10

randomly sampled safe seeds. For

TAGE

, we used a

dynamic stopping criterion for the safe expansion phase (i.e.

) of when the safe region plateaus for 10 iterations, hard

capped at 80 iterations. In these experiments, we primarily

used GP-UCB in the utility optimization phase. We also

tried two other common acquisition functions, Expected Im-

provement (EI) and Maximum Probability of Improvement

(MPI). However, we observed similar behavior between

all three acquisition functions, since our algorithm quickly

identifies the reachable safe region in most scenarios.

In Figure 3, for each setting and algorithm, we plot both

the growth of the size of the safe region as well as a notion

of reward

= max

≤

(

)

. Although there is some

similarity between the performances of the algorithms, it

is evident that

TAGE

grows the safe region at least

as fast as

AFE

, while also reaching a optimal sample

point more quickly.

6.2. Clinical Experiments

We finally applied

TAGE

to safely optimize clinical

spinal cord stimulation in order to help tetraplegic patients

regain physical mobility. The goal is to find effective stimu-

lation therapies for patients with severe spinal cord injuries

without introducing undesirable side effects. For example,

bad stimulations could have negative effects on the reha-

bilitation and are often painful. This application is easily

Stagewise Safe Bayesian Optimization with Gaussian Processes

100

200

300

400

500

600

100

Input Space (%)

Input Space

Safe Space

Figure 4.

Expansion of the safe region for spinal cord injury ther-

apy. The orange solid line represents the growth of safe region

over time, and the blue dashed line the total size of the input space.

100

200

300

400

500

600

Utility (Normalized)

Physician’s Best Choice

TAGE

GP Fitting of

TAGE

Figure 5.

Utilities within the safe region (larger is better). The

green dashed line denotes the physician’s best choice. The thin

blue line shows the utilities of

TAGE

at each iteration, and

the orange solid line is a GP curve fitting of these utilities.

framed under our problem setting; the chosen configurations

must stay above a safety threshold.

A total of 564 therapeutic trials were done with a tetraplegic

patient in gripping experiments over 10 weeks. In each trial,

one stimulating pattern was generated by the 32-channel-

electrode, and was fixed within each trial. For a fixed elec-

trode configuration, the stimulation frequency and ampli-

tude were modulated synergistically in order to find those

best for effective gripping. A similar setup was studied in

(Sui et al., 2017a). We optimized the electrode patterns

with preference-based

TAGE

(see Appendix

) and

performed exhaustive search for stimulation frequency and

amplitude over a narrow range.

Results.

Figure 4 shows the reachable stimulating patterns

by the algorithm under safety constraints. The physicians

are confident that the total safe region has been reached

between 300 and 400 iterations. In our experiments,

does not sample any unsafe stimulating patterns.

Figure 5 plots the utility measure of the stimulating pattern

at each iteration. The orange solid line is a GP curve fitting

TAGE

(in thin blue). It clearly exceeds the physi-

cian’s best choice (dotted green line) after around 400 itera-

tions of online experiments. These results demonstrate the

practicality of

TAGE

to safely optimize in challenging

settings, such as those involving live human subjects.

7. Conclusion & Discussion

In this paper, we study the problem of safe Bayesian opti-

mization, which is well suited to any setting requiring safe

online optimization such as medical therapies, safe recom-

mender systems, and safe robotic control. We proposed a

general framework,

TAGE

, which is able to tackle

non-comparable safety constraints and utility function. We

provide strong theoretical guarantees for

TAGE

with

safety functions and utility function sampled from Gaussian

processes. Specifically, we bound the sample complexity to

achieve an

-safe region and

-optimal utility value within

the safe region. The whole sampling process is guaranteed

to be safe with high probability.

We compared

TAGE

with classical Bayesian optimiza-

tion methods and state-of-the-art safe optimization algo-

rithms. We evaluated multiple cases such as single safety

function, multiple safety functions, real-valued utility, and

dueling-feedback utility. Our extensive experiments on syn-

thetic data show that

TAGE

can achieve its theoretical

guarantees on safety and optimality. Its performance on

safe expansion is among the best and utility maximization

outperforms the state-of-the-art.

This result also provides an efficient tool for online opti-

mization in safety-critical applications. For instance, we

applied

TAGE

with dueling-feedback utility function

on the gripping rehabilitation therapy for tetraplegic pa-

tients. Our live clinical experiments demonstrated good

performance a real human experiment. The therapies pro-

posed by

TAGE

outperform the ones suggested by

experienced physicians.

There are many interesting directions for future work. For in-

stance, we assume a static environment that does not evolve

in response to the actions taken. In our clinical applica-

tion, this implies assuming that the patients’ condition and

response to stimuli do not improve over time. Moving for-

ward, it would be interesting to incorporate dynamics into

our setting, which would lead to the multi-criteria safe re-

inforcement learning setting (Moldovan & Abbeel, 2012;

Turchetta et al., 2016; Wachi et al., 2018).

Another interesting direction is developing theoretically rig-

orous approaches outside of using Gaussian processes (GPs).

Although highly flexible, GPs require a well-specified prior

and kernel in order to be effective. While one could use

uniformed priors to model most settings, such priors tend

to lead to very slow convergence. One alternative is to

automatically learn a good kernel (Wilson et al., 2016). An-

other approach is to assume a low-dimensional manifold

within the high-dimensional uniformed kernel (Djolonga

et al., 2013), which could also speed up learning.

Stagewise Safe Bayesian Optimization with Gaussian Processes

Acknowledgements

This research was also supported in part by NSF Awards

#1564330 & #1637598, JPL PDF IAMS100224, a

Bloomberg Data Science Research Grant, and a gift from

Northrop Grumman.

References

Audibert, J.-Y., Bubeck, S., and Munos, R. Best arm identification

in multi-armed bandits. In

Conference on Learning Theory

(COLT)

, 2010.

Auer, P. Using confidence bounds for exploitation-exploration

trade-offs.

Journal of Machine Learning Research (JMLR)

2002.

Berkenkamp, F., Krause, A., and Schoellig, A. P. Bayesian opti-

mization with safety constraints: Safe and automatic parameter

tuning in robotics. Technical report, arXiv, February 2016a.

Berkenkamp, F., Schoellig, A. P., and Krause, A. Safe controller

optimization for quadrotors with gaussian processes. In

Proc.

of the International Conference on Robotics and Automation

(ICRA)

, pp. 491–496, May 2016b.

Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochastic and

nonstochastic multi-armed bandit problems.

Foundations and

Trends in Machine Learning

, 5:1–122, 2012.

Bubeck, S., Munos, R., Stoltz, G., and Szepesv

ari, C. Online

optimization in X-armed bandits. In

Advances in Neural Infor-

mation Processing Systems (NIPS)

, pp. 201–208, 2008.

Chowdhury, S. R. and Gopalan, A. On kernelized multi-armed

bandits. In

International Conference on Machine Learning

(ICML)

, pp. 844–853, 2017.

Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear op-

timization under bandit feedback. In

Proceedings of the 21st

Annual Conference on Learning Theory (COLT)

, pp. 355–366,

2008.

Djolonga, J., Krause, A., and Cevher, V. High-dimensional gaus-

sian process bandits. In

Advances in Neural Information Pro-

cessing Systems (NIPS)

, pp. 1025–1033, 2013.

Garcia, J. and Fernandez, F. Safe exploration of state and action

spaces in reinforcement learning.

Journal of Machine Learning

Research

, 2012.

Gelbart, M. A., Snoek, J., and Adams, R. P. Bayesian optimization

with unknown constraints.

Uncertainty in Artificial Intelligence

(UAI)

, 2014.

Gillula, J. and Tomlin, C. Guaranteed safe online learning of a

bounded system. In

IEEE/RSJ International Conference on

Intelligent Robots and Systems (IROS)

, 2011.

Gotovos, A., Casati, N., Hitz, G., and Krause, A. Active learning

for level set estimation. In

International Joint Conference on

Artificial Intelligence (IJCAI)

, 2013.

Hans, A., Schneegaß, D., Sch

afer, A., and Udluft, S. Safe explo-

ration for reinforcement learning. In

ESANN

, 2008.

Hern

andez-Lobato, J. M., Gelbart, M. A., Adams, R. P., Hoffman,

M. W., and Ghahramani, Z. A general framework for con-

strained bayesian optimization using information-based search.

Journal of Machine Learning Research

, 2016.

Kleinberg, R., Slivkins, A., and Upfal, E. Multi-armed bandits in

metric spaces. In

ACM Symposium on Theory of Computing

(STOC)

, pp. 681–690. Association for Computing Machinery,

Inc., May 2008.

Moldovan, T. and Abbeel, P. Safe exploration in markov decision

processes. In

International Conference on Machine Learning

(ICML)

, 2012.

Rasmussen, C. E. and Williams, C. K. I.

Gaussian Processes for

Machine Learning

. MIT Press, 2006.

Robbins, H. Some aspects of the sequential design of experiments.

Bulletin of the American Mathematical Society

, 1952.

Schreiter, J., Nguyen-Tuong, D., Eberts, M., Bischoff, B., Markert,

H., and Toussaint, M. Safe exploration for active learning with

gaussian processes. In

Joint European Conference on Machine

Learning and Knowledge Discovery in Databases

, pp. 133–149.

Springer, 2015.

Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian

process optimization in the bandit setting: No regret and ex-

perimental design. In

International Conference on Machine

Learning (ICML)

, 2010.

Sui, Y., Gotovos, A., Burdick, J. W., and Krause, A. Safe explo-

ration for optimization with gaussian processes. In

International

Conference on Machine Learning (ICML)

, 2015.

Sui, Y., Yue, Y., and Burdick, J. W. Correlational dueling bandits

with application to clinical treatment in large decision spaces. In

International Joint Conference on Artificial Intelligence (IJCAI)

2017a.

Sui, Y., Zhuang, V., Burdick, J. W., and Yue, Y. Multi-dueling

bandits with dependent arms. In

Uncertainty in Artificial Intel-

ligence (UAI)

, 2017b.

Thompson, W. R. On the likelihood that one unknown probabil-

ity exceeds another in view of the evidence of two samples.

Biometrika

, 25(3/4):285–294, 1933.

Turchetta, M., Berkenkamp, F., and Krause, A. Safe exploration

in finite markov decision processes with gaussian processes. In

Neural Information Processing Systems (NIPS)

, pp. 4305–4313,

December 2016.

Wachi, A., Sui, Y., Yue, Y., and Ono, M. Safe exploration and

optimization of constrained mdps using gaussian processes. In

AAAI Conference on Artificial Intelligence (AAAI)

, 2018.

Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P. Deep ker-

nel learning. In

Artificial Intelligence and Statistics (AISTATS)

pp. 370–378, 2016.

Wu, Y., Shariff, R., Lattimore, T., and Szepesv

ari, C. Conservative

bandits. In

International Conference on Machine Learning

(ICML)

, pp. 1254–1262, 2016.

Yue, Y., Broder, J., Kleinberg, R., and Joachims, T. The k-armed

dueling bandits problem.

Journal of Computer and System

Sciences

, 78(5):1538–1556, 2012.