of 9
IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED: OCTOBER, 2020
1
Chance-Constrained Trajectory Optimization for Safe
Exploration and Learning of Nonlinear Systems
Yashwanth Kumar Nakka, Anqi Liu, Guanya Shi, Anima Anandkumar, Yisong Yue, and Soon-Jo Chung
Abstract
—Learning-based control algorithms require data col-
lection with abundant supervision for training. Safe exploration
algorithms ensure the safety of this data collection process even
when only partial knowledge is available. We present a new
approach for optimal motion planning with safe exploration
that integrates chance-constrained stochastic optimal control with
dynamics learning and feedback control. We derive an iterative
convex optimization algorithm that solves an Info
rmation-cost
S
tochastic N
onlinear O
ptimal C
ontrol problem (Info-SNOC). The
optimization objective encodes control cost for performance and
exploration cost for learning, and the safety is incorporated
as distributionally robust chance constraints. The dynamics are
predicted from a robust regression model that is learned from
data. The Info-SNOC algorithm is used to compute a sub-optimal
pool of safe motion plans that aid in exploration for learning
unknown residual dynamics under safety constraints. A stable
feedback controller is used to execute the motion plan and collect
data for model learning. We prove the safety of rollout from our
exploration method and reduction in uncertainty over epochs,
thereby guaranteeing the consistency of our learning method.
We validate the effectiveness of Info-SNOC by designing and
implementing a pool of safe trajectories for a planar robot.
We demonstrate that our approach has higher success rate in
ensuring safety when compared to a deterministic trajectory
optimization approach.
Index Terms
—Motion and Path Planning, Model Learning
for Control, Machine Learning for Robot Control, Chance
Constraints
I. I
NTRODUCTION
R
Obots deployed in the real world often operate in un-
known or partially-known environments. Modeling the
complex dynamic interactions with the environment requires
high-fidelity techniques that are often computationally expen-
sive. Machine-learning models can remedy this difficulty by
approximating the dynamics from data [1]–[4]. The learned
models typically require off-line training with labeled data that
are often not available or hard to collect in many applications.
Safe exploration is an efficient approach to collect ground truth
data by safely interacting with the environment.
We present an episodic learning and control algorithm for
safe exploration, as shown in Fig. 1, that integrates learning,
stochastic trajectory planning, and rollout for active and safe
data collection.
Rollout
is defined as executing the computed
Manuscript received: May, 10, 2020; Revised: August, 11, 2020; Accepted:
October, 1, 2020.
This paper was recommended for publication by Editor Nancy Amato upon
evaluation of the Associate Editor and Reviewers’ comments.
The authors are with California Institute of Technology. Email:
{
ynakka,
anqiliu, gshi, anima, yyue, sjchung
}
@caltech.edu. This work was funded in
part by the Jet Propulsion Laboratory, Caltech and the Raytheon Company.
A. Liu is supported by a PIMCO Postdoctoral Fellowship. We acknowledge
the contribution of Irene S. Crowell in implementing Info-SNOC.
Digital Object Identifier (DOI): see top of the page.
Rollout
Nonlinear
Controller
Safety Filter
Sensors
Information-Cost Stochastic
Nonlinear Optimal Control
Learn
Unknown Model
Deterministic Surrogate
Sequential Convex
Programming
Data
Initial Safe
Set Data
epoch = 1
epoch = 1
epoch > 1
epoch += 1
Safe
Set
Obstacle
Obstacle
Obstacle
Terminal
Set
Probabilistic Safe T
rajectory
Robot
Gaussian
Model
Fig. 1. An episodic framework for safe exploration using chance-constrained
trajectory optimization. An initial estimate of the dynamics is computed using
a known safe set [8] and control policy. A probabilistic safe trajectory and
policy that satisfies safety chance-constraints is computed using Info-SNOC
for the estimated dynamics. This policy is used for rollout with a stable
feedback controller to collect data.
safe trajectory and policy using a stable feedback controller.
The planning problem is formulated as an Information-cost
Stochastic Nonlinear Optimal Control (Info-SNOC) problem
that maximizes exploration and minimizes the control effort.
Safety constraints are formulated as chance constraints.
The propagation of uncertainty in the dynamic model and
chance constraints in Info-SNOC are addressed by projecting
the problem to the generalized polynomial chaos (gPC) space
and computing a distributionally robust [5], [6] convex approx-
imation. By building on [6], we derive a sequential convex
optimization solution to the Info-SNOC problem to plan a
pool of sub-optimal safe and information-rich trajectories with
the learned approximation of the dynamics. A sample of the
trajectory pool is used as an input to the rollout stage to collect
new data. To ensure real-time safety, the nonlinear feedback
controller with a safety filter used in the rollout stage certifies
bounded stochastic stability [7]. The new data is used to learn
an improved dynamic model.
Related Work:
Safe exploration for continuous dynam-
ical systems has been studied using the following three
frameworks: learning-based model-predictive control (MPC),
dual-control, and active dynamics learning. Learning-based
MPC [8]–[11] has been studied extensively for controlling
the learned system. These deterministic techniques are also
applied for planning an information trajectory
1
to learn online.
The approach has limited analysis on safety of the motion
1
An information trajectory is defined as a result of optimal motion planning
that has more information about the unknown model compared to a fuel-
optimal trajectory.
arXiv:2005.04374v3 [cs.RO] 27 Oct 2020
2
IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED: OCTOBER, 2020
plans that use recursive feasibility argument and by appending
a known safe control policy. In contrast, stochastic trajec-
tory planning [6], [12]–[14] naturally extends to incorporate
probabilistic learned dynamic model. The safety constraints
formulated as joint chance constraints [12] facilitate a new
approach to analyze the safety of the motion plans computed
using Info-SNOC. The effect of uncertainty in the learned
model on the propagation of dynamics is estimated using the
method of generalized polynomial chaos (gPC) [6] expansion
for propagation, which has asymptotic convergence to the
original distribution, and provides guarantees on the constraint
satisfaction.
Estimating unknown parameters while simultaneously opti-
mizing for performance has been studied as a dual control
problem [15]. Dual control is an optimal control problem
formulation to compute a control policy that is optimized for
performance and guaranteed parameter convergence. In some
recent work [13], [16], the convergence of the estimate is
achieved by using the condition of persistency of excitation
in the optimal control problem. Our method uses Sequential
Convex Programming (SCP) [17]–[19] to compute the per-
sistent excitation trajectory. Recent work [20] uses nonlinear
programming tools to solve optimal control problems with
an upper-confidence bound [21] cost for exploration without
safety constraints. We follow a similar approach but formu-
late the planning problem as an SNOC with distributionally
robust linear and quadratic chance constraints for safety. The
distributionally robust chance constraints are convexified via
projection to the gPC space. The algorithm proposed in this
paper can be used in the MPC framework with appropriate
terminal conditions for feasibility and to solve dual control
problems with high efficiency using the interior point methods.
The contributions of the paper are as follows: a) we propose
a new safe exploration and motion planning method by di-
rectly incorporating safety as chance constraints and ensuring
stochastic nonlinear stability in the closed-loop control along
with a safety filter; b) we derive a new solution method to the
Info-SNOC problem for safe and optimal motion planning,
while ensuring the consistency and reduced uncertainty of our
dynamics learning method; and c) we use a multivariate robust
regression model [22] under a covariate shift assumption to
compute the multi-dimensional uncertainty estimates of the
unknown dynamics used in Info-SNOC.
Organization:
We discuss the preliminaries on robust re-
gression learning method, the Info-SNOC problem formula-
tion, and results on deterministic approximations of chance
constraints in Sec. II. The Info-SNOC algorithm along with
the implementation of rollout policy is presented in Sec. III.
In Sec. IV, we derive the end-to-end safety guarantees. In
Sec. V, we discuss a result of applying the Info-SNOC
algorithm to the nonlinear three degree-of-freedom spacecraft
robot model [23]. We conclude the paper in Sec. VI.
II. P
RELIMINARIES AND
P
ROBLEM
F
ORMULATION
In this section, we discuss the preliminaries of the learn-
ing method used for modelling the unknown dynamics, the
formulation of the Info-SNOC problem, and the deterministic
surrogate projection using gPC.
A. Robust Regression For Learning
An exploration step in active data collection for learning
dynamics is regarded as a covariate shift problem. Covariate
shift is a special case of distribution shift between training
and testing data distributions. In particular, we aim to learn
the unknown part of the dynamics
g
(
x,
̄
u
)
from state
x
and
control
̄
u
defined in Sec. II-B. The covariate shift assumption
indicates the conditional dynamics distribution given the states
and controls remains the same while the input distribution of
training (
Pr
s
(
x,
̄
u
)
) is different from the target input distri-
bution (
Pr
t
(
x,
̄
u
)
). Robust regression is derived from a
min–
max
adversarial estimation framework, where the estimator
minimizes a loss function and the adversary maximizes the
loss under statistical constraints. The resulting Gaussian distri-
butions provided by this learning framework are given below.
For more technical details, we refer the readers to [22], [24].
The output Gaussian distribution takes the form
N
(
μ
g
,
Σ
g
)
with mean
μ
g
and variance
Σ
g
:
Σ
g
(
x,
̄
u,θ
2
) =
(
2
Pr
s
(
x,
̄
u
)
Pr
t
(
x,
̄
u
)
θ
2
+ Σ
1
0
)
1
,
(1)
μ
g
(
x,
̄
u,θ
1
) = Σ
g
(
x,
̄
u,θ
2
)
(
2
Pr
s
(
x,
̄
u
)
Pr
t
(
x,
̄
u
)
θ
1
φ
(
x,
̄
u
) +
μ
0
Σ
1
0
)
,
where
N
(
μ
0
,
Σ
0
)
is a non-informative (i.e., large
Σ
0
) base
distribution, and
φ
(
x,
̄
u
)
is the feature function that is learned
using neural networks from data. The model parameters
θ
1
and
θ
2
are selected by maximizing the target log likelihood.
The density ratio
Pr
s
(
x,
̄
u
)
Pr
t
(
x,
̄
u
)
is estimated from data beforehand.
Robust regression can handle multivariate outputs with correla-
tion efficiently by incorporating neural networks and predict-
ing a multivariate Gaussian distribution, whereas traditional
methods like Gaussian process regression suffer from high-
dimensions and require heavy tuning of kernels [24].
B. Optimal and Safe Planning Problem
In this section, we present the finite-time chance-constrained
stochastic optimal control problem formulation [6] used to
design an information-rich trajectory. The optimization has
control effort and terminal cost as performance objectives, and
the safety is modelled as a joint chance constraint of a set
F
defined by a polytope or a quadratic constraint. The full
stochastic optimal control problem is as follows:
J
=
min
x
(
t
)
,
̄
u
(
t
)
E
[
t
f
t
0
J
(
x
(
t
)
,
̄
u
(
t
))
dt
+
J
f
(
x
(
t
)
,
̄
u
(
t
))
]
(2)
s.t.
̇
x
(
t
) =
f
(
x
(
t
)
,
̄
u
(
t
)) + ˆ
g
(
x
(
t
)
,
̄
u
(
t
))
(3)
Pr(
x
(
t
)
∈F
)
1
,
t
[
t
0
,t
f
]
(4)
̄
u
(
t
)
∈U ∀
t
[
t
0
,t
f
]
(5)
x
(
t
0
) =
x
0
E
(
x
(
t
f
)) =
μ
x
f
,
(6)
where
x
(
t
)
∈ X ⊆
R
n
denotes the state of the dynamics,
x
0
and
x
f
are the initial and the terminal states respectively,
the control
̄
u
∈ U ⊆
R
m
is deterministic,
ˆ
g
is the learned
probabilistic model, and
E
is the expectation operator. The
modelling assumptions and the problem formulation will be
elaborated in the following sections.
NAKKA
et al.
: CHANCE-CONSTRAINED TRAJECTORY OPTIMIZATION FOR SAFE EXPLORATION AND LEARNING OF NONLINEAR SYSTEMS
3
1) Dynamical Model:
The
ˆ
g
term of (3) is the estimated
model of the unknown
g
term of the original dynamics:
̇
̄
p
=
f
( ̄
p,
̄
u
) +
g
( ̄
p,
̄
u
)
︷︷
unknown
,
(7)
where the state
̄
p
∈ X
is now considered deterministic, and
the functions
f
:
X ×U →
R
n
and
g
:
X ×U →
R
n
are
Lipschitz with respect to
̄
p
and
̄
u
. The control set
U
is convex
and compact.
Remark
1
.
The maximum entropy distribution with the known
mean
μ
g
and covariance matrix
Σ
g
of the random variable
ˆ
g
is the Gaussian distribution
N
(
μ
g
,
Σ
g
)
.
The learning algorithm computes the mean vector
μ
g
(
μ
x
,
̄
u
)
and the covariance matrix
Σ
g
(
μ
x
,
̄
u
)
estimates of
g
(
x,
̄
u
)
that
are functions of the mean
μ
x
of the state
x
and control
̄
u
.
Due to Remark 1, the unknown bias term is modeled as a
multivariate Gaussian distribution
ˆ
g
(
μ
x
,
̄
u
)
∼N
(
μ
g
,
Σ
g
)
. The
estimate
ˆ
g
in (3) can be expressed as
ˆ
g
=
B
(
μ
x
,
̄
u
)
θ
+
μ
g
(
μ
x
,
̄
u
)
,
(8)
where
θ
∼N
(0
,
I)
i.i.d and
B
(
μ
x
,
̄
u
)
B
>
(
μ
x
,
̄
u
) = Σ
g
(
μ
x
,
̄
u
)
.
Using (8), (7) can be written in standard Ito stochastic differ-
ential equation (SDE) form as given below, where
θdt
=
dW.
dx
=
f
(
x,
̄
u
)
dt
+
μ
g
(
μ
x
,
̄
u
)
dt
+
B
(
μ
x
,
̄
u
)
dW
(9)
The existence and uniqueness of a solution to the SDE
for a given initial distribution
x
0
and control trajectory
̄
u
such that
Pr(
|
x
(
t
0
)
x
0
)
|
= 0) = 1
with measure
Pr
,
is guaranteed by the
Lipschitz condition
and
Restriction on
Growth condition
[25]. The approximate system (9) is assumed
to be controllable in the given feasible space.
2) Cost Functional:
The integrand cost functional
J
=
J
C
+
J
I
includes two objectives: a) exploration cost
J
I
for
achieving the maximum value of information for learning
the unknown dynamics
g
, and b) performance cost
J
C
(e.g.,
minimizing the control effort). The integrand cost functional
J
C
for fuel optimality, which is convex in
̄
u
, is given as
J
C
=
̄
u
s
where
s
∈{
1
,
2
,
∞}
(10)
One example of
J
I
is the following variance-based informa-
tion cost using each
i
th
diagonal element
σ
2
i
in
Σ
g
.
J
I
=
n
i
=1
σ
i
(
μ
x
,
̄
u
)
(11)
The information cost
J
I
in (11) is a functional of the mean
μ
x
of the state
x
and control
̄
u
at time
t
. By minimizing the
cost
J
I
, we maximize the information [21] available in the
trajectory
x
to learn the unknown model
g
. The terminal cost
functional
J
f
is quadratic in the state
x
,
J
f
=
x
(
t
f
)
>
Q
f
x
(
t
f
)
,
where
Q
f
is a positive semi-definite function.
3) State and Safety Constraints:
Safety is defined as a
constraint on the state space
x
,
x
(
t
)
∈ F
at time
t
. The safe
set
F
is relaxed by formulating a joint chance constraint with
risk of constraint violation as
Pr(
x
∈F
)
1
.
(12)
The constant

is called the risk measure of a chance con-
straint in this paper. We consider the polytopic constraint set
F
lin
=
{
x
∈ X
:
k
i
=1
a
>
i
x
+
b
i
0
}
with
k
flat sides and
a quadratic constraint set
F
quad
=
{
x
∈ X
:
x
>
Ax
c
}
for any realization
x
of the state. The joint chance constraint
Pr(
k
i
=1
a
>
i
x
+
b
i
0)
1

can be transformed to the
following individual chance constraint:
Pr(
a
>
i
x
+
b
i
0)
1

i
,
(13)
such that
k
i
=1

i
=

. Here we use

i
=

k
. The individual
risk measure

i
can be optimally allocated between the
k
constraints [26]. A quadratic chance constraint is given as:
Pr(
x
>
Ax
c
)
,
(14)
where
A
is a positive definite matrix. We use a tractable
conservative approximation of sets in (13,14) by formulating
distributionally robust chance constraint [5] for known mean
μ
x
and variance
Σ
x
of the random variable
x
at each time
t
.
Lemma 1.
The linear distributionally robust chance constraint
inf
x
(
μ
x
,
Σ
x
)
Pr(
a
>
x
+
b
0)
1

`
is equivalent to the
deterministic constraint:
a
>
μ
x
+
b
+
1

`

`
a
>
Σ
x
a
0
.
(15)
Proof:
See Theorem 3.1 in [5].
Lemma 2.
The semi-definite constraint on the variance
Σ
x
tr(
A
Σ
x
)

q
c
(16)
is a conservative deterministic approximation of the quadratic
chance-constraint
Pr((
x
μ
x
)
>
A
(
x
μ
x
)
c
)

q
.
Proof:
See Proposition 1 in [6].
The risk measures

`
and

q
are assumed to be given. For the
rest of the paper, we consider the following individual chance-
constrained problem,
(
x
,
̄
u
) =
argmin
x
(
t
)
,
̄
u
(
t
)
E
[
t
f
t
0
((1
ρ
)
J
C
+
ρJ
I
)
dt
+
J
f
]
,
s.t.
{
(9)
,
(15)
,
(16)
,
(5)
,
(6)
}
(17)
that is assumed to have a feasible solution with
ρ
[0
,
1]
.
C. Generalized Polynomial Chaos (gPC)
The problem (17) is projected into a finite-dimensional
space using the generalized polynomial chaos algorithm pre-
sented in [6]. We briefly review the ideas and equations.
We refer the readers to [6] for details on computing the
functions and matrices that are mentioned below. In the gPC
expansion, any
L
2
-bounded random variable
x
(
t
)
is expressed
as product of the basis functions matrix
Φ(
θ
)
defined on the
random variable
θ
and the deterministic time varying coeffi-
cients
X
(
t
) =
[
x
10
···
x
1
`
···
x
n
0
···
x
n`
]
>
. The
functions
φ
i
,
i
∈{
0
,
1
,...,`
}
are chosen to be Gauss-Hermite
polynomials in this paper. Let
Φ(
θ
) =
[
φ
0
(
θ
)
···
φ
`
(
θ
)
]
>
.
x
̄
Φ
X
;
where
̄
Φ =
I
n
×
n
Φ(
θ
)
>
(18)
4
IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED: OCTOBER, 2020
A Galerkin projection is used to transform the SDE (9) to the
deterministic equation,
̇
X
=
̄
f
(
X,
̄
u
) + ̄
μ
g
(
μ
x
,
̄
u
) +
̄
B
(
μ
x
,
̄
u
)
.
(19)
The exact form of the function
̄
f
is given in [6]. The moments
of the random variable
x
,
μ
x
and
Σ
x
can be expressed
as polynomial functions of elements of
X
. The polynomial
functions defined in [6] are used to project the distributionally
robust linear chance constraint in (15) to a second-order cone
constraint in
X
as follows:
(
a
>
S
)
X
+
b
+
1

`

`
X
>
UNN
>
U
>
X
0
,
(20)
where the matrices
S
,
U
, and
N
are functions of the expecta-
tion of polynomials
Φ
, that are given in [6]. The quadratic
chance constraint in (16) is transformed to the following
quadratic constraint in
X
:
n
i
=1
`
k
=1
a
i
φ
k
k
x
2
ik

q
c,
(21)
where
A
is a diagonal matrix with
a
i
being the
i
th
diagonal el-
ement, and
.,.
denotes an inner product operation. Similarly,
we project the initial and terminal state constraints
X
(
t
0
)
and
X
(
t
f
)
. In Sec. III, we present the Info-SNOC algorithm by
using the projected optimal control problem.
III. I
NFO
-SNOC M
AIN
A
LGORITHM
We present the main algorithm for the architecture shown
in Fig. 1 that integrates the learning method and the Info-
SNOC method with rollout. We discuss an iterative solution
method to Info-SNOC by projecting (17) to the gPC space.
We formulate a deterministic optimal control problem in the
gPC space and solve it using SCP [6], [18], [19] method. The
gPC projection of (17) is given by the following equation:
(
X
,
̄
u
) =
argmin
X
(
t
)
,
̄
u
(
t
)
[
t
f
t
0
((1
ρ
)
J
C
+
ρJ
I
)
dt
+
E
J
gPC
f
]
s.t.
{
(19)
,
(20)
,
(21)
,
(5)
}
X
(
t
0
) =
X
0
, X
(
t
f
) =
X
f
.
(22)
In SCP, the projected dynamic model (19) is linearized about
a feasible nominal trajectory and discretized to formulate
a linear equality constraint. Note that the constraints (20)
and (21) are already convex in the states
X
. The terminal
constraint
J
f
is projected to the quadratic constraint
J
gPC
f
=
X
>
f
̄
Φ
>
Q
f
̄
Φ
X
f
. The information cost functional
J
I
from (11)
is expressed as a function of
X
by using the polynomial
representation [6] of
μ
x
in terms of
X
. Let
S
= (
X,
̄
u
)
>
and
the cost
J
I
is linearized around a feasible nominal trajectory
S
o
= (
X
o
,
̄
u
o
)
>
to derive a linear convex cost functional
J
d
I
:
J
d
I
=
n
i
=1
(
σ
i
(
S
o
) +
∂σ
i
∂S
S
o
(
S
S
o
)
)
.
(23)
We use the convex approximation
J
d
I
as the information cost
in the SCP formulation of the optimal control problem in (22).
In the gPC space, we split the problem into two cases: a)
ρ
= 0
that computes a performance trajectory, and b)
ρ
(0
,
1]
that
computes information trajectory to have stable iterations. The
main algorithm is outlined below.
Algorithm:
We use an initial estimate of the model (8)
learned from data generated by a known safe control policy,
and a
nominal trajectory
(
x
o
,
̄
u
o
)
computed using determin-
istic SCP [18] with nominal model to initialize Algorithm 1.
The stochastic model and the chance constraints are projected
to the gPC state space, which is in line 2 of Algorithm 1. The
projected dynamics is linearized around the nominal trajectory
and used as a constraint in the SCP. The projection step is only
needed in the first epoch. The projected system can be directly
used for epoch
>
1
. The current estimated model is used to
solve (22) using SCP, in line 7 with
ρ
= 0, for a performance
trajectory
(
x
p
,
̄
u
p
)
. The output
(
x
p
,
̄
u
p
)
of this optimization
is used as initialization to the Info-SNOC problem obtained
by setting
ρ
(0
,
1]
to compute the information trajectory
(
x
i
,
̄
u
i
)
. The trajectory
(
x
i
,
̄
u
i
)
is then sampled for a safe
motion plan
( ̄
p
d
,
̄
u
d
)
in line 9, that is used for rollout, in
line 10, to collect more data for learning. The SCP step is
performed in the gPC space
X
. After each SCP step, the
gPC space coordinates
X
are projected back to the random
variable
x
space. The Info-SNOC problem outputs a trajectory
of random variable
x
with finite variance at each epoch.
Algorithm 1
Info-SNOC using SCP [19] and gPC [6]
1:
Initial Safe Set Data, Feasible Nominal Trajectory
(
x
o
,
̄
u
o
)
2:
gPC Projection as discussed in Sec. II-C
3:
Linearize the gPC cost and dynamics, see [6]
4:
epoch = 1
5:
while
Learning Criteria Not Satisfied
do
6:
Learn
g
∼N
(
μ
g
,
Σ
g
)
using Robust Regression
7:
(
x
p
,
̄
u
p
)
= SCP
((
x
o
,
̄
u
o
)
= 0)
, using (22)
8:
(
x
i
,
̄
u
i
)
= SCP
((
x
p
,
̄
u
p
)
(0
,
1])
, using (22)
9:
Sample
(
x
i
,
̄
u
i
)
for
( ̄
p
d
,
̄
u
d
)
10:
Rollout using sample
( ̄
p
d
,
̄
u
d
)
and
u
c
11:
Data collection during rollout
12:
epoch
epoch + 1
13:
end while
Convergence and Optimality:
The information trajectory
(
x
i
,
̄
u
i
)
computed using SCP with the approximate linear in-
formation cost
J
d
I
(23) is a sub-optimal solution of (22) with
the optimal cost value
J
d
I
. Therefore, the optimal cost of (22)
given by
J
I
is bounded above by
J
d
I
,
J
I
J
d
I
. For the Info-
SNOC algorithm, we cannot guarantee the convergence of SCP
iterations to a Karush-Kuhn-Tucker point using the method
in [17], [19] due to the non-convexity of
J
I
. Due to the non-
convex cost function
J
I
, the linear approximation
J
d
I
of the
cost
J
I
can potentially lead to numerical instability in SCP
iterations. Finding an initial performance trajectory,
ρ
= 0
,
and then optimizing for information,
ρ
(0
,
1]
, is observed to
be numerically stable compared to directly optimizing for the
original cost functional
J
=
J
C
+
J
I
.
Feasibility:
The initial phases of learning might lead to a
large covariance
Σ
g
due to the insufficient data, resulting in
an infeasible optimal control problem. To overcome this, we
use two strategies: 1) Explore the initial safe set till we find
a feasible solution to the problem, and 2) Use slack variables
on the terminal condition to approximately reach the goal
NAKKA
et al.
: CHANCE-CONSTRAINED TRAJECTORY OPTIMIZATION FOR SAFE EXPLORATION AND LEARNING OF NONLINEAR SYSTEMS
5
accounting for a large variance.
A. Rollout Policy Implementation
The information trajectory
(
x
i
,
̄
u
i
)
computed using the Info-
SNOC algorithm is sampled for a pool of motion plans
( ̄
p
d
,
̄
u
d
)
. The trajectory pool is computed by randomly sam-
pling the multivariate Gaussian distribution
θ
and transforming
it using the gPC expansion
x
(
θ
)
̄
Φ(
θ
)
X
. For any realization
̄
θ
of
θ
, we get a deterministic trajectory
̄
p
d
=
̄
Φ(
̄
θ
)
X
that
is

safe with respect to the distributionally robust chance
constraints. The trajectory
( ̄
p
d
,
̄
u
d
)
is executed using the
closed-loop control law
u
c
=
u
c
( ̄
p,
̄
p
d
,
̄
u
d
)
for rollout, where
̄
p
is the current state. To ensure real-time safety during the initial
stages of exploration, a safe control policy
u
s
is computed
using the control barrier function-based safety filter. The
properties of the control law
u
c
and the safety during rollout
are studied in the following section.
IV. A
NALYSIS
In this section, we present the main theoretical results
analyzing the following two questions: 1) at any epoch
i
how
do learning errors translate to safety violation bounds during
rollout, and 2) under what assumptions is the multivariate
robust regression a consistent learning method as epoch
→∞
.
The analysis proves that if the Info-SNOC algorithm computes
a motion plan with finite variance, then learning is consistent
and implies safety during rollout.
Assumption
1
.
The projected problem (22) computes a feasible
trajectory to the original problem (17). The assumption is gen-
erally true if we choose a sufficient number of polynomials [6],
for the projection operation.
Assumption
2
.
The probabilistic inequality
Pr(
g
( ̄
p,
̄
u
)
μ
g
( ̄
p,
̄
u
)
2
2
c
1
)
1

`
b
holds, where

`
b
is small, for
the same input
( ̄
p,
̄
u
)
to the original model
g
, and the mean
μ
g
of the learned model. Using this inequality, we can say
that the following bounds hold with high probability:
g
( ̄
p,
̄
u
)
μ
g
( ̄
p,
̄
u
)
2
2
c
1
,
tr (Σ
g
)
c
2
,
(24)
where
c
2
=
c
1

`
b
. As shown in [24], the mean predictions
made by the model learned using robust regression is bounded
by
c
1
, which depends on the choice of the function class for
learning. The variance prediction
Σ
g
is bounded by design.
With Assumptions 1 and 2, the analysis is decomposed into
the following three subsections.
A. State Error Bounds During Rollout
The following assumptions are made on the nominal system
̇
̄
p
=
f
( ̄
p,u
)
to derive the state tracking error bound during
rollout.
Assumption
3
.
There exists a globally exponentially sta-
ble (i.e., finite-gain
L
p
stable) tracking control law
u
c
=
u
c
( ̄
p,
̄
p
d
,
̄
u
d
)
for the nominal dynamics
̇
̄
p
=
f
( ̄
p,u
c
)
. The
control law
u
c
satisfies the property
u
c
( ̄
p
d
,
̄
p
d
,
̄
u
d
) = ̄
u
d
for
any sampled trajectory
( ̄
p
d
,
̄
u
d
)
from the information trajectory
(
x
i
,
̄
u
i
)
. At any time
t
the state
̄
p
satisfies the following
inequality, when the closed-loop control
u
c
is applied to the
nominal dynamics,
M
( ̄
p,t
)
∂f
̄
p
+
(
∂f
̄
p
)
>
M
( ̄
p,t
) +
d
dt
M
( ̄
p,t
)
≤−
2
αM
( ̄
p,t
)
,
where
f
=
f
( ̄
p,u
c
( ̄
p,
̄
p
d
,
̄
u
d
))
,
α >
0
,
M
( ̄
p,t
)
is a uniformly
positive definite matrix with
(
λ
min
(
M
)
̄
p
2
̄
p
>
M
( ̄
p,t
) ̄
p
λ
max
(
M
)
̄
p
2
)
, and
λ
max
and
λ
min
are the maximum and
minimum eigenvalues.
Assumption
4
.
The unknown model
g
satisfies the bound
(
g
( ̄
p,u
c
)
g
( ̄
p
d
,
̄
u
d
))
2
2
c
3
.
Assumption
5
.
The probability density ratio
ρ
x
i
(
t
)
ρ
x
i
(0)
r
is
bounded, where the functions
ρ
x
i
(0)
and
ρ
x
i
(
t
)
are the proba-
bility density functions of
x
i
at time
t
= 0
and
t
respectively.
Lemma 3.
Given that the estimated model
(8)
satisfies the
Assumption 2, and the systems (7) and
(9)
satisfy Assump-
tions 3, 4, 5, if the control
u
c
=
u
c
( ̄
p,x
i
,
̄
u
i
)
is applied to the
system
(7)
, then the following condition holds at time
t
E
x
i
(
t
)
(
̄
p
x
i
2
2
)
λ
max
(
M
)
2
λ
min
(
M
)
α
m
(
c
1
+
c
2
+
c
3
)
r
(25)
+
λ
max
(
M
)
r
λ
min
(
M
)
E
(
̄
p
(0)
x
i
(0)
2
)
e
2
α
m
t
,
where
(
x
i
,
̄
u
i
)
is computed from
(22)
and
α
m
= (
α
1)
.
The states
̄
p
∈ X
, and
x
i
∈ X
are feasible trajectories of
the deterministic dynamics (7) and the SDE (9) for the initial
conditions
̄
p
(0)
∈X
and
x
i
(0)
∈X
respectively at
t
t
0
.
Proof:
See Lemma 2 in [7].
Lemma 3 states that the expected mean squared error
E
(
̄
p
x
i
2
)
is bounded by
λ
max
(
M
)(
c
1
+
c
2
+
c
3
)
r
2
α
m
λ
min
(
M
)
as
t
→∞
when the
control law
u
c
is applied to the dynamics in (7). The bounded
tracking performance leads to constraint violation, which is
studied in the next section.
B. Safety Bounds
The safety of the original system (7) for the linear and
quadratic chance constraints during rollout with a controller
u
c
discussed in Sec. IV-A is analyzed in Theorems 1 and 2.
Theorem 1.
Given a feasible solution
(
x,
̄
u
x
)
of (17), with the
quadratic chance constraint
Pr((
x
μ
x
)
>
A
(
x
μ
x
)
c
)
E
((
x
μ
x
)
>
A
(
x
μ
x
))
c

q
, the trajectory
̄
p
of the deterministic
dynamics
(7)
satisfies the following inequality at any time
t
:
( ̄
p
μ
x
)
>
A
( ̄
p
μ
x
)
λ
max
(
A
)
E
x
(
̄
p
x
2
2
)
,
(26)
where
E
(
̄
p
x
2
2
)
is bounded as defined in Lemma 3.
Proof:
Consider the expectation of the ellipsoidal set
( ̄
p
μ
x
)
>
A
( ̄
p
μ
x
)
. Using
̄
p
μ
x
= ̄
p
x
+
x
μ
x
, the expectation
of the set can be expressed as follows
E
(
( ̄
p
μ
x
)
>
A
( ̄
p
μ
x
)
)
=
E
(
( ̄
p
x
)
>
A
( ̄
p
x
)
)
(27)
+
E
(
(
x
μ
x
)
>
A
(
x
μ
x
)
)
+ 2
E
(
( ̄
p
x
)
>
A
(
x
μ
x
)
)
.
Using the following equality:
E
(
( ̄
p
x
)
>
A
(
x
μ
x
)
)
=
E
(
(
x
μ
x
)
>
A
(
x
μ
x
)
)
,
and
E
(
(
x
μ
x
)
>
A
(
x
μ
x
)
)
0
in (27), we obtain the con-
straint bound in (26). Using the feedback tracking bound (25)
6
IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED: OCTOBER, 2020
in (26), we can show that the constraint violation bound is a
function of learning bounds
c
1
,
c
2
, and
c
3
.
Note that if the learning method converges, i.e.,
c
1
0
,
c
2
0
, and
c
3
0
, then
̄
p
μ
x
. The quadratic constraint
violation in (26) depends on the tracking error and the size of
the ellipsoidal set described by
A
.
Theorem 2.
Given a feasible solution
(
x,
̄
u
x
)
of (17) with
inf
x
(
μ
x
,
Σ
x
)
Pr(
a
>
x
+
b
0)
1

`
, the trajectory
̄
p
of
the deterministic dynamics
(7)
, with control
u
c
, satisfies the
following inequality at any time
t
:
inf
x
(
μ
x
,
Σ
x
)
Pr
(
a
>
̄
p
+
b
δ
`
(
x
)
)
1

`
,
(28)
where
δ
`
(
x
)
=
a
2
E
x
(
( ̄
p
x
)
2
)
−‖
a
2
1

`

`
c
4
,
E
x
(
( ̄
p
x
2
)
is bounded as defined in
(25)
and
tr(Σ
x
) =
c
4
.
Proof:
From Lemma 1, the feasible solution
(
x,u
x
)
satisfies the equivalent condition
P
(
μ
x
,
Σ
x
)
0
, where
P
(
μ
x
,
Σ
x
) =
a
>
μ
x
+
b
+
1

`

`
a
>
Σ
x
a
for the risk
measure

`
, mean
μ
x
, and covariance
Σ
x
. Consider the similar
condition for the actual trajectory
̄
p
(
t
)
,
P
(
μ
̄
p
,
Σ
̄
p
)
:
P
(
μ
̄
p
,
Σ
̄
p
) =
a
>
μ
̄
p
+
b
+
1

`

`
a
>
Σ
̄
p
a
=
a
>
μ
x
+
a
>
(
μ
̄
p
μ
x
) +
b
+
1

`

`
a
>
Σ
̄
p
a
+
1

`

`
a
>
Σ
x
a
1

`

`
a
>
Σ
x
a
(29)
Note that since the system (7) is deterministic, we have
μ
̄
p
= ̄
p
and
Σ
̄
p
= 0
. Using
P
(
μ
x
,
Σ
x
)
0
, the right hand side of the
above inequality reduces to the following:
P
(
μ
̄
p
,
Σ
̄
p
)
a
>
( ̄
p
μ
x
)
1

`

`
a
>
Σ
x
a.
(30)
Using the decomposition
Σ
x
=
̃
G
>
̃
G
, Cauchy-Schwarz’s
inequality, and Jensen’s inequality, we have
a
>
(
μ
̄
p
μ
x
)
a
2
( ̄
p
μ
x
)
2
≤ ‖
a
2
E
x
(
( ̄
p
x
)
2
)
. Using the sub-
multiplicative property of
`
2
-norm in the inequality above, we
have
P
(
μ
̄
p
,
Σ
̄
p
)
≤ ‖
a
2
(
E
(
( ̄
p
x
)
2
)
1

`

`
̃
G
F
)
.
Assuming that
tr(Σ
x
) =
c
4
in the above inequality, we have
P
(
μ
̄
p
,
Σ
̄
p
)
≤‖
a
2
E
(
( ̄
p
x
)
2
)
−‖
a
2
1

`

`
c
4
.
The above inequality is equivalent to the probabilistic linear
constraint in (28). The bound
δ
`
(
x
) =
a
2
E
(
( ̄
p
x
)
2
)
a
2
1

`

`
c
4
is a function of the learning bounds in (24)
by substituting the feedback tracking bound in (25).
The linear constraint is offset by
δ
`
leading to constraint
violation of the original formulation (13). Note that, if
c
1
0
,
c
2
0
,
c
3
0
, and
c
4
0
, then
δ
`
0
. In order to
ensure real-time safety during trajectory tracking, we use a
high gain control for disturbance attenuation with safety filter
augmentation for constraint satisfaction.
C. Consistency
Data is collected during the rollout of the nonlinear system
to learn a new model for the next epoch. For epoch
e
,
the predictor
ˆ
g
e
follows a multivariate Gaussian distribution
N
(
μ
e
g
,
Σ
e
g
)
and
g
is the empirical true data. We assume that
set
X
e
⊂X
generated by the optimization problem in (17) for
the first
e
iterations is a discretization of
X
. Assuming that
there exists a global optimal predictor
ˆ
g
in the function class
G
that can achieve the best error

at each epoch
e
:
max
x
∈X
e
g
ˆ
g
2
=

,
(31)
the consistency of the learning algorithm is given by the
convergence of the regret
r
e
, which is defined as the Euclidean
distance between the predictor
ˆ
g
e
and the optimal predictor
ˆ
g
:
r
e
=
ˆ
g
e
ˆ
g
2
0
,
as
e
→∞
.
(32)
We prove the consistency of Algorithm 1 in Theorem 3.
Theorem 3.
If Assumption 2 holds with the maximum pre-
diction error
max
x
∈X
e
g
μ
e
g
2
2
,
c
e
1
at epoch
e
, then the
regret
r
e
=
ˆ
g
e
ˆ
g
2
of Algorithm 1 achieves the bound:
r
2
e
c
e
1
+
dC
2
+

2
with probability
1
−|X
e
|
δ
e
,
(33)
where
δ
e
,
(
C
d
d
p
p
)
1
1
(2
π
)
d
2
|
Σ
e
g
|
1
2
e
1
2
C
2
d
p
=1
p
(34)
with the output dimension
d
,
p
,
d
q
=1
m
qp
>
0
,
p
{
1
,
2
,...,d
}
, and
M
= (Σ
e
g
)
1
with its
(
q,p
)
-th element
(
m
qp
)
. Furthermore, the regret
r
e
0
as
e
→∞
.
Proof:
Using the inequality
Pr(ˆ
g
e
μ
e
g
C
e
)
< δ
e
,
x
∈X
e
, since
ˆ
g
e
∼N
(
μ
e
g
,
Σ
e
g
)
, we can bound
r
e
as follows:
r
2
e
≤‖
μ
e
g
+
C
e
ˆ
g
2
2
,
(35)
where
δ
e
is defined in (34), and
e
is unit vector in
d
dimensions. Hence,
C
is a function of
|X
e
|
δ
e
. This is the
tail probabilities inequality [27] of multivariate Gaussian dis-
tributions. The error
r
2
e
is then bounded using the empirical
prediction error
μ
e
g
g
2
2
and the best prediction error (31)
as
r
2
e
≤‖
μ
e
g
+
C
e
g
2
2
+
g
ˆ
g
2
2
. By using the triangular
inequality, we get the bound in (33) as follows:
r
2
e
≤‖
μ
e
g
g
2
2
+
dC
2
+

2
c
e
1
+
dC
2
+

2
.
(36)
At any epoch
e
, we have
x
∈X
e
,
|‖
μ
e
g
2
2
+tr(Σ
e
g
)
−‖
g
2
2
|≤
ω
e
[24], where
ω
e
is a hyper-parameter that is associated with
model selection in robust regression. Using
tr(Σ
e
g
)
c
e
2
, (35)
becomes
r
2
e
ω
e
+
c
e
2
+
dC
2
+

2
.
Assuming that the ground truth model is in the function
class
G
, we have

= 0
. With infinite data, the hyper
parameter
ω
e
= 0
and
c
e
2
0
at each data point. For achieving
the same value of
δ
e
,
C
0
with shrinking
Σ
e
g
. This implies
that for large data, i.e., as
e
→ ∞
, we have
r
e
0
. The
regret
r
e
is bounded as shown in (33). Therefore, Algorithm 1
is consistent with high probability. Note that the Info-SNOC
approach outputs trajectories of finite variance at epoch
e
,
given that
tr(Σ
e
g
)
is bounded.
During the initial exploration phase, the finite dimensional
approximation in the gPC projection might add to the error