of 20
Meta-Adaptive Nonlinear Control:
Theory and Algorithms
Guanya Shi
, Kamyar Azizzadenesheli
, Michael O’Connell
, Soon-Jo Chung
, Yisong Yue
Caltech
Purdue University
{gshi,moc,sjchung,yyue}@caltech.edu, kamyar@purdue.edu
Abstract
We present an online multi-task learning approach for adaptive nonlinear control,
which we call Online Meta-Adaptive Control (OMAC). The goal is to control a
nonlinear system subject to adversarial disturbance and unknown
environment-
dependent
nonlinear dynamics, under the assumption that the environment-
dependent dynamics can be well captured with some shared representation. Our
approach is motivated by robot control, where a robotic system encounters a se-
quence of new environmental conditions that it must quickly adapt to. A key
emphasis is to integrate online representation learning with established methods
from control theory, in order to arrive at a unified framework that yields both
control-theoretic and learning-theoretic guarantees. We provide instantiations of
our approach under varying conditions, leading to the first non-asymptotic end-
to-end convergence guarantee for multi-task nonlinear control. OMAC can also
be integrated with deep representation learning. Experiments show that OMAC
significantly outperforms conventional adaptive control approaches which do not
learn the shared representation, in inverted pendulum and 6-DoF drone control
tasks under varying wind conditions
1
.
1 Introduction
One important goal in autonomy and artificial intelligence is to enable autonomous robots to learn
from prior experience to quickly adapt to new tasks and environments. Examples abound in robotics,
such as a drone flying in different wind conditions [
1
], a manipulator throwing varying objects [
2
],
or a quadruped walking over changing terrains [
3
]. Though those examples provide encouraging
empirical evidence, when designing such adaptive systems, two important theoretical challenges
arise, as discussed below.
First, from a learning perspective, the system should be able to learn an “efficient” representation
from prior tasks, thereby permitting faster future adaptation, which falls into the categories of
representation learning or meta-learning. Recently, a line of work has shown theoretically that learning
representations (in the standard supervised setting) can significantly reduce sample complexity on
new tasks [
4
6
]. Empirically, deep representation learning or meta-learning has achieved success
in many applications [
7
], including control, in the context of meta-reinforcement learning [
8
10
].
However, theoretical benefits (in the end-to-end sense) of representation learning or meta-learning for
adaptive control remain unclear.
Second, from a control perspective, the agent should be able to handle parametric model uncertainties
with control-theoretic guarantees such as stability and tracking error convergence, which is a common
adaptive control problem [
11
,
12
]. For classic adaptive control algorithms, theoretical analysis often
involves the use of Lyapunov stability and asymptotic convergence [
11
,
12
]. Moreover, many recent
1
Code and video:
https://github.com/GuanyaShi/Online-Meta-Adaptive-Control
35th Conference on Neural Information Processing Systems (NeurIPS 2021), Sydney, Australia.
arXiv:2106.06098v3 [cs.LG] 26 Oct 2021
studies aim to integrate ideas from learning, optimization, and control theory to design and analyze
adaptive controllers using learning-theoretic metrics. Typical results guarantee non-asymptotic
convergence in finite time horizons, such as regret [
13
17
] and dynamic regret [
18
20
]. However,
these results focus on a single environment or task. A multi-task extension, especially whether and
how prior experience could benefit the adaptation in new tasks, remains an open problem.
Main contributions.
In this paper, we address both learning and control challenges in a unified
framework and provide end-to-end guarantees. We derive a new method of Online Meta-Adaptive
Control (OMAC) that controls uncertain nonlinear systems under a sequence of new environmental
conditions. The underlying assumption is that the environment-dependent unknown dynamics can
well be captured by a shared representation, which OMAC learns using a
meta-adapter
. OMAC then
performs environment-specific updates using an
inner-adapter
.
We provide different instantiations of OMAC under varying assumptions and conditions. In the
jointly and element-wise convex cases, we show sublinear cumulative control error bounds, which
to our knowledge is the first non-asymptotic convergence result for multi-task nonlinear control.
Compared to standard adaptive control approaches that do not have a meta-adapter, we show that
OMAC possesses both stronger guarantees and empirical performance. We finally show how to
integrate OMAC with deep representation learning, which further improves empirical performance.
2 Problem statement
We consider the setting where a controller encounters a sequence of
N
environments, with each
environment lasting
T
time steps. We use
outer iteration
to refer to the iterating over the
N
environments, and
inner iteration
to refer to the
T
time steps within an environment.
Notations:
We use superscripts (e.g.,
(
i
)
in
x
(
i
)
t
) to denote the index of the outer iteration where
1
i
N
, and subscripts (e.g.,
t
in
x
(
i
)
t
) to denote the time index of the inner iteration where
1
t
T
. We use
step
(
i,t
) to refer to the inner time step
t
at the
i
th
outer iteration.
‖·‖
denotes
the 2-norm of a vector or the spectral norm of a matrix.
‖·‖
F
denotes the Frobenius norm of a matrix
and
λ
min
(
·
)
denotes the minimum eigenvalue of a real symmetric matrix.
vec(
·
)
R
mn
denotes the
vectorization of a
m
×
n
matrix, and
denotes the Kronecker product. Finally, we use
u
1:
t
to denote
a sequence
{
u
1
,u
2
,
···
,u
t
}
.
We consider a discrete-time nonlinear control-affine system [
21
,
22
] with environment-dependent
uncertainty
f
(
x,c
)
. The dynamics model at the
i
th
outer iteration is:
x
(
i
)
t
+1
=
f
0
(
x
(
i
)
t
) +
B
(
x
(
i
)
t
)
u
(
i
)
t
f
(
x
(
i
)
t
,c
(
i
)
) +
w
(
i
)
t
,
1
t
T,
(1)
where the state
x
(
i
)
t
R
n
, the control
u
(
i
)
t
R
m
,
f
0
:
R
n
R
n
is a known nominal dynamics
model,
B
:
R
n
R
n
×
m
is a known state-dependent actuation matrix,
c
(
i
)
R
h
is the unknown
parameter that indicates an environmental condition,
f
:
R
n
×
R
h
R
n
is the unknown
c
(
i
)
-
dependent dynamics model, and
w
(
i
)
t
is disturbance, potentially adversarial. For simplicity we define
B
(
i
)
t
=
B
(
x
(
i
)
t
)
and
f
(
i
)
t
=
f
(
x
(
i
)
t
,c
(
i
)
)
.
Interaction protocol.
We study the following adaptive nonlinear control problem under
N
unknown
environments. At the beginning of outer iteration
i
, the environment first selects
c
(
i
)
(adaptively and
adversarially), which is unknown to the controller, and then the controller makes decision
u
(
i
)
1:
T
under
unknown dynamics
f
(
x
(
i
)
t
,c
(
i
)
)
and potentially adversarial disturbances
w
(
i
)
t
. To summarize:
1.
Outer iteration
i
.
A policy encounters environment
i
(
i
∈ {
1
,...,N
}
), associated with unob-
served variable
c
(
i
)
(e.g., the wind condition for a flying drone). Run inner loop (Step 2).
2.
Inner loop.
Policy interacts with environment
i
for
T
time steps, observing
x
(
i
)
t
and taking action
u
(
i
)
t
, with state/action dynamics following (1).
3.
Policy optionally observes
c
(
i
)
at the end of the inner loop (used for some variants of the analysis).
4. Increment
i
=
i
+ 1
and repeat from Step 1.
We use average control error (
ACE
) as our performance metric:
2
Definition 1
(Average control error)
.
The average control error (
ACE
) of
N
outer iterations (i.e.,
N
environments) with each lasting
T
time steps, is defined as
ACE
=
1
TN
N
i
=1
T
t
=1
x
(
i
)
t
.
ACE
can be viewed as a non-asymptotic generalization of steady-state error in control [
23
]. We make
the following assumptions on the actuation matrix
B
, the nominal dynamics
f
0
, and disturbance
w
(
i
)
t
:
Assumption 1
(Full actuation, bounded disturbance, and e-ISS assumptions)
.
We consider fully-
actuated systems, i.e., for all
x
,
rank(
B
(
x
)) =
n
.
w
(
i
)
t
‖ ≤
W,
t,i
. Moreover, the nominal
dynamics
f
0
is exponentially input-to-state stable (e-ISS): let constants
β,γ
0
and
0
ρ <
1
. For
a sequence
v
1:
t
1
R
n
, consider the dynamics
x
k
+1
=
f
0
(
x
k
) +
v
k
,
1
k
t
1
.
x
t
satisfies:
x
t
‖≤
βρ
t
1
x
1
+
γ
t
1
k
=1
ρ
t
1
k
v
k
.
(2)
With the e-ISS property in Assumption 1, we have the following bound that connects
ACE
with the
average squared loss between
B
(
i
)
t
u
(
i
)
t
+
w
(
i
)
t
and
f
(
i
)
t
.
Lemma 1.
Assume
x
(
i
)
1
= 0
,
i
. The average control error (
ACE
) is bounded as:
N
i
=1
T
t
=1
x
(
i
)
t
TN
γ
1
ρ
N
i
=1
T
t
=1
B
(
i
)
t
u
(
i
)
t
f
(
i
)
t
+
w
(
i
)
t
2
TN
.
(3)
The proof can be found in the Appendix A.1. We assume
x
(
i
)
1
= 0
for simplicity: the influence of
non-zero and bounded
x
(
i
)
1
is a constant term in each outer iteration, from the e-ISS property (2).
Remark on the e-ISS assumption and the
ACE
metric.
Note that an exponentially stable linear
system
f
0
(
x
t
) =
Ax
t
(i.e., the spectral radius of
A
is
<
1
) satisfies the exponential ISS (e-ISS)
assumption. However, in nonlinear systems e-ISS is a stronger assumption than exponential stability.
For both linear and nonlinear systems, the e-ISS property of
f
0
is usually achieved by applying some
stable feedback controller to the system
2
, i.e.,
f
0
is the closed-loop dynamics [
11
,
24
,
16
]. e-ISS
assumption is standard in both online adaptive linear control [
24
,
16
,
14
] and nonlinear control [
13
],
and practical in robotic control such as drones [
25
]. In
ACE
we consider a regulation task, but it
can also capture trajectory tracking task with time-variant nominal dynamics
f
0
under incremental
stability assumptions [13]. We only consider the regulation task in this paper for simplicity.
Generality.
We would like to emphasize the generality of our dynamics model
(1)
. The nominal
control-affine part can model general fully-actuated robotic systems via Euler-Langrange equations
[
21
,
22
], and the unknown part
f
(
x,c
)
is nonlinear in
x
and
c
. We only need to assume the disturbance
w
(
i
)
t
is bounded, which is more general than stochastic settings in linear [14–16] and nonlinear [13]
cases. For example,
w
(
i
)
t
can model extra
(
x,u,c
)
-dependent uncertainties or adversarial disturbances.
Moreover, the environment sequence
c
(1:
N
)
could also be adversarial. In term of the extension to
under-actuated systems, all the results in this paper hold for the
matched uncertainty
setting [
13
], i.e.,
in the form
x
(
i
)
t
+1
=
f
0
(
x
(
i
)
t
) +
B
(
x
(
i
)
t
)(
u
(
i
)
t
f
(
x
(
i
)
t
,c
(
i
)
)) +
w
(
i
)
t
where
B
(
x
(
i
)
t
)
is not necessarily
full rank (e.g., drone and inverted pendulum experiments in Section 5). Generalizing to other
under-actuated systems is interesting future work.
3 Online meta-adaptive control (OMAC) algorithm
The design of our online meta-adaptive control (OMAC) approach comprises four pieces: the policy
class, the function class, the inner loop (within environment) adaptation algorithm
A
2
, and the outer
loop (between environment) online learning algorithm
A
1
.
Policy class.
We focus on the class of certainty-equivalence controllers [
13
,
26
,
14
], which is a
general class of model-based controllers that also includes linear feedback controllers commonly
2
For example, consider
x
t
+1
=
3
2
x
t
+ 2 sin
x
t
+ ̄
u
t
. With a feedback controller
̄
u
t
=
u
t
x
t
2 sin
x
t
,
the closed-loop dynamics is
x
t
+1
=
1
2
x
t
+
u
t
, where
f
0
(
x
) =
1
2
x
is e-ISS.
3
studied in online control [
26
,
14
]. After a model is learned from past data, a controller is designed
by treating the learned model as the truth [
12
]. Formally, at
step
(
i,t
), the controller first estimates
ˆ
f
(
i
)
t
(an estimation of
f
(
i
)
t
) based on past data, and then executes
u
(
i
)
t
=
B
(
i
)
t
ˆ
f
(
i
)
t
, where
(
·
)
is the
pseudo inverse. Note that from Lemma 1, the average control error of the
omniscient controller
using
ˆ
f
(
i
)
t
=
f
(
i
)
t
(i.e., the controller perfectly knows
f
(
x,c
)
) is upper bounded as
3
:
ACE
(omniscient)
γW/
(1
ρ
)
.
ACE
(omniscient)
can be viewed as a fundamental limit of the certainty-equivalence policy class.
Function class
F
for representation learning.
In OMAC, we need to define a function class
F
(
φ
(
x
;
ˆ
Θ)
,
ˆ
c
)
to compute
ˆ
f
(
i
)
t
(i.e.,
ˆ
f
(
i
)
t
=
F
(
φ
(
x
(
i
)
t
;
ˆ
Θ)
,
ˆ
c
)
), where
φ
(parameterized by
ˆ
Θ
) is a
representation shared by all environments, and
ˆ
c
is an environment-specific latent state. From a
theoretical perspective, the main consideration of the choice of
F
(
φ
(
x
)
,
ˆ
c
)
is on how it effects the
resulting learning objective. For instance,
φ
represented by a Deep Neural Network (DNN) would
lead to a highly non-convex learning objective. In this paper, we focus on the setting
ˆ
Θ
R
p
,
ˆ
c
R
h
,
and
p

h
, i.e., it is much more expensive to learn the shared representation
φ
(e.g., a DNN)
than “fine-tuning” via
ˆ
c
in a specific environment, which is consistent with meta-learning [
9
] and
representation learning [4, 7] practices.
Inner loop adaptive control.
We take a modular approach in our algorithm design, in order to cleanly
leverage state-of-the-art methods from online learning, representation learning, and adaptive control.
When interacting with a single environment (for
T
time steps), we keep the learned representation
φ
fixed, and use that representation for adaptive control by treating
ˆ
c
as an unknown low-dimensional
parameter. We can utilize any adaptive control method such as online gradient descent, velocity
gradient, or composite adaptation [27, 13, 11, 1].
Outer loop online learning.
We treat the outer loop (which iterates between environments) as an
online learning problem, where the goal is to learn the shared representation
φ
that optimizes the
inner loop adaptation performance. Theoretically, we can reason about the analysis by setting up a
hierarchical or nested online learning procedure (adaptive control nested within online learning).
Design goal.
Our goal is to design a meta-adaptive controller that has low
ACE
, ideally converging
to
ACE
(omniscient)
as
T,N
→∞
. In other words, OMAC should converge to performing as good
as the omniscient controller with perfect knowledge of
f
(
x,c
)
.
Algorithm 1 describes the OMAC algorithm. Since
φ
is environment-invariant and
p

h
, we
only adapt
ˆ
Θ
at the end of each outer iteration. On the other hand, because
c
(
i
)
varies in different
environments, we adapt
ˆ
c
at each inner step. As shown in Algorithm 1, at
step
(
i,t
), after applying
u
(
i
)
t
, the controller observes the next state
x
(
i
)
t
+1
and computes:
y
(
i
)
t
,
f
0
(
x
(
i
)
t
) +
B
(
i
)
t
u
(
i
)
t
x
(
i
)
t
+1
=
f
(
i
)
t
w
(
i
)
t
, which is a disturbed measurement of the ground truth
f
(
i
)
t
. We then define
`
(
i
)
t
(
ˆ
Θ
,
ˆ
c
)
,
F
(
φ
(
x
(
i
)
t
;
ˆ
Θ)
,
ˆ
c
)
y
(
i
)
t
2
as the observed loss at
step
(
i,t
), which is a squared loss between the
disturbed measurement of
f
(
i
)
t
and the model prediction
F
(
φ
(
x
(
i
)
t
;
ˆ
Θ)
,
ˆ
c
)
.
Instantiations.
Depending on
{
F
(
φ
(
x
;
ˆ
Θ)
,
ˆ
c
)
,
A
1
,
A
2
,
ObserveEnv
}
, we consider three settings:
Convex case
(Section 4.1): The observed loss
`
(
i
)
t
is convex with respect to
ˆ
Θ
and
ˆ
c
.
Element-wise convex case
(Section 4.2): Fixing
ˆ
Θ
or
ˆ
c
,
`
(
i
)
t
is convex with respect to the other.
Deep learning case
(Section 5): In this case, we use a DNN with weight
ˆ
Θ
to represent
φ
.
4 Main theoretical results
4.1 Convex case
In this subsection, we focus on a setting where the observed loss
`
(
i
)
t
(
ˆ
Θ
,
ˆ
c
) =
F
(
φ
(
x
;
ˆ
Θ)
,
ˆ
c
)
y
(
i
)
t
2
is convex with respect to
ˆ
Θ
and
ˆ
c
. We provide the following example to illustrate this case and
highlight its difference between conventional adaptive control (e.g., [13, 12]).
3
This upper bound is tight. Consider a scalar system
x
t
+1
=
ax
t
+
u
t
f
(
x
t
) +
w
with
|
a
|
<
1
and
w
a
constant. In this case
ρ
=
|
a
|
= 1
, and the omniscient controller
u
t
=
f
(
x
t
)
yields
ACE
=
γ
|
w
|
/
(1
ρ
)
.
4
Algorithm 1:
Online Meta-Adaptive Control (OMAC) algorithm
Input:
Meta-adapter
A
1
; inner-adapter
A
2
; model
F
(
φ
(
x
;
ˆ
Θ)
,
ˆ
c
)
; Boolean
ObserveEnv
for
i
= 1
,
···
,N
do
The environment selects
c
(
i
)
for
t
= 1
,
···
,T
do
Compute
ˆ
f
(
i
)
t
=
F
(
φ
(
x
(
i
)
t
;
ˆ
Θ
(
i
)
)
,
ˆ
c
(
i
)
t
)
Execute
u
(
i
)
t
=
B
(
i
)
t
ˆ
f
(
i
)
t
//certainty-equivalence policy
Observe
x
(
i
)
t
+1
,
y
(
i
)
t
=
f
(
x
(
i
)
t
,c
(
i
)
)
w
(
i
)
t
, and
`
(
i
)
t
(
ˆ
Θ
,
ˆ
c
) =
F
(
φ
(
x
(
i
)
t
;
ˆ
Θ)
,
ˆ
c
)
y
(
i
)
t
2
//
y
(
i
)
t
is a noisy measurement of
f
and
`
(
i
)
t
is the observed loss
Construct an inner cost function
g
(
i
)
t
c
)
by
A
2
//
g
(
i
)
t
is a function of
ˆ
c
Inner-adaptation:
ˆ
c
(
i
)
t
+1
←A
2
c
(
i
)
t
,g
(
i
)
1:
t
)
if
ObserveEnv
then
Observe
c
(
i
)
//only used in some instantiations
Construct an outer cost function
G
(
i
)
(
ˆ
Θ)
by
A
1
//
G
(
i
)
is a function of
ˆ
Θ
Meta-adaptation:
ˆ
Θ
(
i
+1)
←A
1
(
ˆ
Θ
(
i
)
,G
(1:
i
)
)
Example 1.
We consider a model
F
(
φ
(
x
;
ˆ
Θ)
,
ˆ
c
) =
Y
1
(
x
)
ˆ
Θ +
Y
2
(
x
c
to estimate
f
:
ˆ
f
(
i
)
t
=
Y
1
(
x
(
i
)
t
)
ˆ
Θ
(
i
)
+
Y
2
(
x
(
i
)
t
c
(
i
)
t
,
(4)
where
Y
1
:
R
n
R
n
×
p
,Y
2
:
R
n
R
n
×
h
are two known bases. Note that conventional adaptive
control approaches typically concatenate
ˆ
Θ
and
ˆ
c
and adapt on both at each time step, regardless
of the environment changes (e.g., [
13
]). Since
p

h
, such concatenation is computationally much
more expensive than OMAC, which only adapts
ˆ
Θ
in outer iterations.
Because
`
(
i
)
t
(
ˆ
Θ
,
ˆ
c
)
is
jointly
convex with respective to
ˆ
Θ
and
ˆ
c
, the OMAC algorithm in this case
falls into the category of Nested Online Convex Optimization (Nested OCO) [
28
]. The choice of
g
(
i
)
t
,G
(
i
)
,
A
1
,
A
2
and
ObserveEnv
are depicted in Table 1. Note that in the convex case OMAC
does not need to know
c
(
i
)
in the whole process (
ObserveEnv
= False
).
F
(
φ
(
x
;
ˆ
Θ)
,
ˆ
c
)
Any
F
model such that
`
(
i
)
t
(
ˆ
Θ
,
ˆ
c
)
is convex (e.g., Example 1)
g
(
i
)
t
c
)
ˆ
c
`
(
i
)
t
(
ˆ
Θ
(
i
)
,
ˆ
c
(
i
)
t
)
·
ˆ
c
G
(
i
)
(
ˆ
Θ)
T
t
=1
ˆ
Θ
`
(
i
)
t
(
ˆ
Θ
(
i
)
,
ˆ
c
(
i
)
t
)
·
ˆ
Θ
A
1
With a convex set
K
1
,
A
1
initializes
ˆ
Θ
(1)
∈ K
1
and returns
ˆ
Θ
(
i
+1)
∈ K
1
,
i
.
A
1
has
sublinear regret, i.e., the total regret of
A
1
is
T
·
o
(
N
)
(e.g., online gradient descent)
A
2
With a convex set
K
2
,
i
,
A
2
initializes
ˆ
c
(
i
)
1
∈ K
2
and returns
ˆ
c
(
i
)
t
+1
∈ K
2
,
t
.
A
2
has
sublinear regret, i.e., the total regret of
A
2
is
N
·
o
(
T
)
(e.g., online gradient descent)
ObserveEnv
False
Table 1: OMAC with convex loss
As shown in Table 1, at the end of
step
(
i,t
) we fix
ˆ
Θ =
ˆ
Θ
(
i
)
and update
ˆ
c
(
i
)
t
+1
∈ K
2
using
A
2
c
(
i
)
t
,g
(
i
)
1:
t
)
, which is an OCO problem with linear costs
g
(
i
)
1:
t
. On the other hand, at the end of outer
iteration
i
, we update
ˆ
Θ
(
i
+1)
∈ K
1
using
A
1
(
ˆ
Θ
(
i
)
,G
(1:
i
)
)
, which is another OCO problem with
linear costs
G
(1:
i
)
. From [28], we have the following regret relationship:
Lemma 2
(Nested OCO regret bound, [
28
])
.
OMAC (Algorithm 1) specified by Table 1 has regret:
N
i
=1
T
t
=1
`
(
i
)
t
(
ˆ
Θ
(
i
)
,
ˆ
c
(
i
)
t
)
min
Θ
∈K
1
N
i
=1
min
c
(
i
)
∈K
2
T
t
=1
`
(
i
)
t
,c
(
i
)
)
N
i
=1
G
(
i
)
(
ˆ
Θ
(
i
)
)
min
Θ
∈K
1
N
i
=1
G
(
i
)
(Θ)
︷︷
the total regret of
A
1
,T
·
o
(
N
)
+
N
i
=1
T
t
=1
g
(
i
)
t
c
(
i
)
t
)
N
i
=1
min
c
(
i
)
∈K
2
T
t
=1
g
(
i
)
t
(
c
(
i
)
)
︷︷
the total regret of
A
2
,N
·
o
(
T
)
.
(5)
5
Note that the total regret of
A
1
is
T
·
o
(
N
)
because
G
(
i
)
is scaled up by a factor of
T
. With Lemmas 1
and 2, we have the following guarantee for the average control error.
Theorem 3
(OMAC
ACE
bound with convex loss)
.
Assume the unknown dynamics model is
f
(
x,c
) =
F
(
φ
(
x
; Θ)
,c
)
. Assume the true parameters
Θ
∈K
1
and
c
(
i
)
∈K
2
,
i
. Then OMAC (Algorithm 1)
specified by Table 1 ensures the following
ACE
guarantee:
ACE
γ
1
ρ
W
2
+
o
(
T
)
T
+
o
(
N
)
N
.
To further understand Theorem 3 and compare OMAC with conventional adaptive control approaches,
we provide the following corollary using the model in Example 1.
Corollary 4.
Suppose the unknown dynamics model is
f
(
x,c
) =
Y
1
(
x
)Θ+
Y
2
(
x
)
c
, where
Y
1
:
R
n
R
n
×
p
,Y
2
:
R
n
R
n
×
h
are known bases. We assume
Θ
‖≤
K
Θ
and
c
(
i
)
‖≤
K
c
,
i
. Moreover,
we assume
Y
1
(
x
)
‖ ≤
K
1
and
Y
2
(
x
)
‖ ≤
K
2
,
x
. In Table 1 we use
ˆ
f
(
i
)
t
=
Y
1
(
x
(
i
)
t
)
ˆ
Θ
(
i
)
+
Y
2
(
x
(
i
)
t
c
(
i
)
t
, and Online Gradient Descent (OGD) [
27
] for both
A
1
and
A
2
, with learning rates
̄
η
(
i
)
and
η
(
i
)
t
respectively. We set
K
1
=
{
ˆ
Θ :
ˆ
Θ
‖≤
K
Θ
}
and
K
2
=
{
ˆ
c
:
ˆ
c
‖≤
K
c
}
. If we schedule the
learning rates as:
̄
η
(
i
)
=
2
K
Θ
(4
K
2
1
K
Θ
+ 4
K
1
K
2
K
c
+ 2
K
1
W
︷︷
C
1
)
T
i
, η
(
i
)
t
=
2
K
c
(4
K
2
2
K
c
+ 4
K
1
K
2
K
Θ
+ 2
K
2
W
︷︷
C
2
)
t
,
then the average control performance is bounded as:
ACE
(OMAC)
γ
1
ρ
W
2
+ 3
(
K
Θ
C
1
1
N
+
K
c
C
2
1
T
)
.
Moreover, the baseline adaptive control which uses OGD to adapt
ˆ
Θ
and
ˆ
c
at each time step satisfies:
ACE
(baseline adaptive control)
γ
1
ρ
W
2
+ 3
K
2
Θ
+
K
2
c
C
2
1
+
C
2
2
1
T
.
Note that
ACE
(baseline adaptive control)
does not improve as
N
increases (i.e., encountering more
environments has no benefit). If
p

h
, we potentially have
K
1

K
2
and
K
Θ

K
c
, so
C
1

C
2
.
Therefore, the
ACE
upper bound of OMAC is better than the baseline adaptation if
N > T
, which is
consistent with recent representation learning results [
4
,
5
]. It is also worth noting that the baseline
adaptation is much more computationally expensive, because it needs to adapt both
ˆ
Θ
and
ˆ
c
at each
time step. Intuitively, OMAC improves convergence because the meta-adapter
A
1
approximates the
environment-invariant part
Y
1
(
x
, which makes the inner-adapter
A
2
much more efficient.
4.2 Element-wise convex case
In this subsection, we consider the setting where the loss function
`
(
i
)
t
(
ˆ
Θ
,
ˆ
c
)
is element-wise convex
with respect to
ˆ
Θ
and
ˆ
c
, i.e., when one of
ˆ
Θ
or
ˆ
c
is fixed,
`
(
i
)
t
is convex with respect to the other one.
For instance,
F
could be function as depicted in Example 2. Outside the context of control, such
bilinear models are also studied in representation learning [4, 5] and factorization bandits [29, 30].
Example 2.
We consider a model
F
(
φ
(
x
;
ˆ
Θ)
,
ˆ
c
) =
Y
(
x
)
ˆ
Θˆ
c
to estimate
f
:
ˆ
f
(
i
)
t
=
Y
(
x
(
i
)
t
)
ˆ
Θ
(
i
)
ˆ
c
(
i
)
t
,
(6)
where
Y
:
R
n
R
n
×
̄
p
is a known basis,
ˆ
Θ
(
i
)
R
̄
p
×
h
, and
ˆ
c
(
i
)
t
R
h
. Note that the dimensionality
of
ˆ
Θ
is
p
= ̄
ph
. Conventional adaptive control typically views
ˆ
Θˆ
c
as a vector in
R
̄
p
and adapts it at
each time step regardless of the environment changes [13].
Compared to Section 4.1, the element-wise convex case poses new challenges: i) the coupling
between
ˆ
Θ
and
ˆ
c
brings inherent non-identifiability issues; and ii) statistical learning guarantees
6
typical need i.i.d. assumptions on
c
(
i
)
and
x
(
i
)
t
[
4
,
5
]. These challenges are further amplified by the
underlying unknown nonlinear system
(1)
. Therefore in this section we set
ObserveEnv
= True
, i.e.,
the controller has access to the true environmental condition
c
(
i
)
at the end of the
i
th
outer iteration,
which is practical in many systems when
c
(
i
)
has a concrete physical meaning, e.g., drones with wind
disturbances [1, 31].
F
(
φ
(
x
;
ˆ
Θ)
,
ˆ
c
)
Any
F
model such that
`
(
i
)
t
(
ˆ
Θ
,
ˆ
c
)
is element-wise convex (e.g., Example 2)
g
(
i
)
t
c
)
`
(
i
)
t
(
ˆ
Θ
(
i
)
,
ˆ
c
)
G
(
i
)
(
ˆ
Θ)
T
t
=1
`
(
i
)
t
(
ˆ
Θ
,c
(
i
)
)
A
1
&
A
2
Same as Table 1
ObserveEnv
True
Table 2: OMAC with element-wise convex loss
The inputs to OMAC for the element-wise convex case are specified in Table 2. Compared to
the convex case in Table 1, difference includes i) the cost functions
g
(
i
)
t
and
G
(
i
)
are convex, not
necessarily linear; and ii) since
ObserveEnv
= True
, in
G
(
i
)
we use the true environmental condition
c
(
i
)
. With inputs specified in Table 2, Algorithm 1 has
ACE
guarantees in the following theorem.
Theorem 5
(OMAC
ACE
bound with element-wise convex loss)
.
Assume the unknown dynamics
model is
f
(
x,c
) =
F
(
φ
(
x
; Θ)
,c
)
. Assume the true parameter
Θ
∈ K
1
and
c
(
i
)
∈ K
2
,
i
. Then
OMAC (Algorithm 1) specified by Table 2 ensures the following
ACE
guarantee:
ACE
γ
1
ρ
W
2
+
o
(
T
)
T
+
o
(
N
)
N
.
4.2.1 Faster convergence with sub Gaussian and environment diversity assumptions
Since the cost functions
g
(
i
)
t
and
G
(
i
)
in Table 2 are not necessarily strongly convex, the
ACE
upper
bound in Theorem 5 is typically
γ
1
ρ
W
2
+
O
(1
/
T
) +
O
(1
/
N
)
(e.g., using OGD for both
A
1
and
A
2
). However, for the bilinear model in Example 2, it is possible to achieve faster convergence
with extra sub Gaussian and
environment diversity
assumptions.
F
(
φ
(
x
;
ˆ
Θ)
,
ˆ
c
)
The bilinear model in Example 2
g
(
i
)
t
c
)
`
(
i
)
t
(
ˆ
Θ
(
i
)
,
ˆ
c
)
G
(
i
)
(
ˆ
Θ)
λ
ˆ
Θ
2
F
+
i
j
=1
T
t
=1
`
(
j
)
t
(
ˆ
Θ
,c
(
j
)
)
with some
λ >
0
A
1
ˆ
Θ
(
i
+1)
= arg min
ˆ
Θ
G
(
i
)
(
ˆ
Θ)
(i.e., Ridge regression)
A
2
Same as Table 1
ObserveEnv
True
Table 3: OMAC with bilinear model
With a bilinear model, the inputs to the OMAC algorithm are shown in Table 3. With extra assumptions
on
w
(
i
)
t
and the environment, we have the following
ACE
guarantees.
Theorem 6
(OMAC
ACE
bound with bilinear model)
.
Consider an unknown dynamics model
f
(
x,c
) =
Y
(
x
c
where
Y
:
R
n
R
n
×
̄
p
is a known basis and
Θ
R
̄
p
×
h
. Assume each component
of the disturbance
w
(
i
)
t
is
R
-sub-Gaussian, the true parameters
Θ
F
K
Θ
,
c
(
i
)
‖≤
K
c
,
i
, and
Y
(
x
)
F
K
Y
,
x
. Define
Z
(
j
)
t
=
c
(
j
)
>
Y
(
x
(
j
)
t
)
R
n
×
̄
ph
and assume environment diversity:
λ
min
(
i
j
=1
T
t
=1
Z
(
j
)
>
t
Z
(
j
)
t
) = Ω(
i
)
. Then OMAC (Algorithm 1) specified by Table 3 has the
following
ACE
guarantee (with probability at least
1
δ
):
ACE
γ
1
ρ
W
2
+
o
(
T
)
T
+
O
(
log(
NT/δ
) log(
N
)
N
)
.
(7)
If we relax the environment diversity condition to
Ω(
i
)
, the last term becomes
O
(
log(
NT/δ
)
N
)
.
The sub-Gaussian assumption is widely used in statistical learning theory to obtain concentration
bounds [
32
,
4
]. The environment diversity assumption states that
c
(
i
)
provides “new information”
7
in every outer iteration such that the minimum eigenvalue of
i
j
=1
T
t
=1
Z
(
j
)
>
t
Z
(
j
)
t
grows linearly
as
i
increases. Note that we do not need
λ
min
(
i
j
=1
T
t
=1
Z
(
j
)
>
t
Z
(
j
)
t
)
to increase as
T
goes up.
Moreover, if we relax the condition to
Ω(
i
)
, the
ACE
bound becomes worse than the general
element-wise convex case (the last term is
O
(1
/
N
)
), which implies the importance of “linear”
environment diversity
Ω(
i
)
. Task diversity has been shown to be important for representation learning
[4, 33]. We provide a proof sketch here and the full proof can be found in the Appendix A.6.
Proof sketch.
In the outer loop we use the martingale concentration bound [
32
] and the environment
diversity assumption to bound
ˆ
Θ
(
i
+1)
Θ
2
F
O
(
log(
iT/δ
)
i
)
,
i
1
with probability at least
1
δ
.
Then, we use Lemma 1 to show how the outer loop concentration bound and the inner loop regret
bound of
A
2
translate to
ACE
.
5 Deep OMAC and experiments
We now introduce deep OMAC, a deep representation learning based OMAC algorithm. Table 4
shows the instantiation. As shown in Table 4, Deep OMAC employs a DNN to represent
φ
, and uses
gradient descent for optimization. With the model
4
φ
(
x
;
ˆ
Θ)
·
ˆ
c
, the meta-adapter
A
1
updates the
representation
φ
via deep learning, and the inner-adapter
A
2
updates a linear layer
ˆ
c
.
F
(
φ
(
x
;
ˆ
Θ)
,
ˆ
c
)
φ
(
x
;
ˆ
Θ)
·
ˆ
c
, where
φ
(
x
;
ˆ
Θ) :
R
n
R
n
×
h
is a neural network with weight
ˆ
Θ
g
(
i
)
t
c
)
`
(
i
)
t
(
ˆ
Θ
(
i
)
,
ˆ
c
)
A
1
Gradient descent:
ˆ
Θ
(
i
+1)
ˆ
Θ
(
i
)
η
ˆ
Θ
T
t
=1
`
(
i
)
t
(
ˆ
Θ
,
ˆ
c
(
i
)
t
)
A
2
Same as Table 1
ObserveEnv
False
Table 4: OMAC with deep learning
To demonstrate the performance of OMAC, we consider two sets of experiments:
Inverted pendulum with external wind, gravity mismatch, and unknown damping.
The
continuous-time model is
ml
2
̈
θ
ml
ˆ
g
sin
θ
=
u
+
f
(
θ,
̇
θ,c
)
, where
θ
is the pendulum angle,
̇
θ
/
̈
θ
is the angular velocity/acceleration,
m
is the pendulum mass and
l
is the length,
ˆ
g
is the gravity
estimation,
c
is the unknown parameter that indicates the external wind condition, and
f
(
θ,
̇
θ,c
)
is
the unknown dynamics which depends on
c
, but also includes
c
-invariant terms such as damping
and gravity mismatch. This model generalizes [
35
] by considering damping and gravity mismatch.
6-DoF quadrotor with 3-D wind disturbances.
We consider a 6-DoF quadrotor model with
unknown wind-dependent aerodynamic force
f
(
x,c
)
R
3
, where
x
R
13
is the drone state
(including position, velocity, attitude, and angular velocity) and
c
is the unknown parameter
indicating the wind condition. We incorporate a realistic high-fidelity aerodynamic model from [
36
].
We consider 6 different controllers in the experiment (see more details about the dynamics model and
controllers in the Appendix A.7):
No-adapt
is simply using
ˆ
f
(
i
)
t
= 0
, and
omniscient
is using
ˆ
f
(
i
)
t
=
f
(
i
)
t
.
OMAC (convex)
is based on Example 1, where
ˆ
f
(
i
)
t
=
Y
1
(
x
(
i
)
t
)
ˆ
Θ
(
i
)
+
Y
2
(
x
(
i
)
t
c
(
i
)
t
. We use
random Fourier features [
37
,
13
] to generate both
Y
1
and
Y
2
. We use OGD for both
A
1
and
A
2
in
Table 1.
OMAC (bi-convex)
is based on Example 2, where
ˆ
f
(
i
)
t
=
Y
(
x
(
i
)
t
)
ˆ
Θ
(
i
)
ˆ
c
(
i
)
t
. Similarly, we use
random Fourier features to generate
Y
. Although the theoretical result in Section 4.2 requires
ObserveEnv
= True
, we relax this in our experiments and use
G
(
i
)
(
ˆ
Θ) =
T
t
=1
`
(
i
)
t
(
ˆ
Θ
,
ˆ
c
(
i
)
t
)
in
Table 2, instead of
T
t
=1
`
(
i
)
t
(
ˆ
Θ
,c
(
i
)
)
. We also deploy OGD for
A
1
and
A
2
.
Baseline
has the same
procedure except with
ˆ
Θ
(
i
)
ˆ
Θ
(1)
, i.e., it calls the inner-adapter
A
2
at every step and does not
update
ˆ
Θ
, which is standard in adaptive control [13, 12].
4
The intuition behind this structure is that, any analytic function
̄
f
(
x,
̄
c
)
can be approximated by
φ
(
x
)
c
( ̄
c
)
with a universal approximator
φ
[34]. We provide a detailed and theoretical justification in the Appendix A.7.
8
0.4
0.2
0.0
0.2
0.4
Position [m]
Baseline
ACE
= 0.199
m
p
x
p
z
5
10
15
20
25
30
Time [s]
5
0
5
Force [N]
f
x
f
x
f
z
f
z
OMAC (convex)
ACE
= 0.125
m
5
10
15
20
25
30
Time [s]
OMAC (bi-convex)
ACE
= 0.074
m
5
10
15
20
25
30
Time [s]
OMAC (deep)
ACE
= 0.066
m
5
10
15
20
25
30
Time [s]
(a) xz-position trajectories (top) and force predictions (bottom) in the quadrotor experiment from one random
seed. The wind condition is switched randomly every 2 s (indicated by the dotted vertical lines).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Wind condition index
0.1
0.2
0.3
0.4
0.5
Average control error
for each 2s wind condition
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Wind condition index
0
1
2
3
4
5
6
7
Average force prediction error (Newtons)
for each 2s wind condition
No-adapt
Baseline
OMAC (convex)
OMAC (bi-convex)
OMAC (deep)
Omniscient
(b) The evolution of control error (left) and prediction error (right) in each wind condition. The solid lines
average 10 random seeds and the shade areas show standard deviations. The performance OMAC improves as
the number of wind conditions increases (especially the bi-convex and deep variants) while the baseline not.
Figure 1: Drone experiment results.
OMAC (deep learning)
is based on Table 4. We use a DNN with spectral normalization [
38
,
25
,
39
,
40
] to represent
φ
, and use the
Adam
optimizer [
41
] for
A
1
. Same as other methods,
A
2
is also
an OGD algorithm.
u
nknown
external wind
no-adapt
baseline
OMAC (convex)
0
.
663
±
0
.
142
0
.
311
±
0
.
112
0
.
147
±
0
.
047
OMAC (bi-convex)
OMAC (deep)
omniscient
0
.
129
±
0
.
044
0
.
093
±
0
.
027
0
.
034
±
0
.
017
no-adapt
baseline
OMAC (convex)
0
.
374
±
0
.
044
0
.
283
±
0
.
043
0
.
251
±
0
.
043
OMAC (bi-convex)
OMAC (deep)
omniscient
0
.
150
±
0
.
019
0
.
141
±
0
.
024
0
.
100
±
0
.
018
Table 5:
ACE
results in pendulum (top) and drone (bottom) experiments from 10 random seeds.
For all methods, we randomly switch the environment (wind)
c
every
2 s
. To make a fair comparison,
except no-adapt or omniscient, all methods have the same learning rate for the inner-adapter
A
2
and
the dimensions of
ˆ
c
are also same (
dim(ˆ
c
) = 20
for the pendulum and
dim(ˆ
c
) = 30
for the drone).
ACE
results from 10 random seeds are depicted in Table 5. Figure 1 visualizes the drone experiment
results. We observe that i) OMAC significantly outperforms the baseline; ii) OMAC adapts faster
and faster as it encounters more environments but baseline cannot benefit from prior experience,
especially for the bi-convex and deep versions (see Figure 1), and iii) Deep OMAC achieves the best
ACE
due to the representation power of DNN.
9
We note that in the drone experiments the performance of OMAC (convex) is only marginally better
than the baseline. This is because the aerodynamic disturbance force in the quadrotor simulation is a
multiplicative combination of the relative wind speed, the drone attitude, and the motor speeds; thus,
the superposition structure
ˆ
f
(
i
)
t
=
Y
1
(
x
(
i
)
t
)
ˆ
Θ
(
i
)
+
Y
2
(
x
(
i
)
t
c
(
i
)
t
cannot easily model the unknown force,
while the bi-convex and deep learning variants both learn good controllers. In particular, OMAC
(bi-convex) achieves similar performance as the deep learning case with much fewer parameters. On
the other hand, in the pendulum experiments, OMAC (convex) is relatively better because a large
component of the
c
-invariant part in the unknown dynamics is in superposition with the
c
-dependent
part. For more details and the pendulum experiment visualization we refer to the Appendix A.7.
6 Related work
Meta-learning and representation learning.
Empirically, representation learning and meta-learning
have shown great success in various domains [
7
]. In terms of control, meta-reinforcement learning
is able to solve challenging mult-task RL problems [
8
10
]. We remark that learning representation
for control also refers to learn
state
representation from rich observations [
42
44
], but we consider
dynamics
representation in this paper. On the theoretic side, [
4
,
5
,
33
] have shown representation
learning reduces sample complexity on new tasks, and “task diversity” is critical. Consistently, we
show OMAC enjoys better convergence theoretically (Corollary 4) and empirically, and Theorem 6
also implies the importance of
environment diversity
. Another relevant line of theoretical work [45–
47
] uses tools from online convex optimization to show guarantees for gradient-based meta-learning,
by assuming closeness of all tasks to a single fixed point in parameter space. We eliminate this
assumption by considering a hierarchical or nested online optimization procedure.
Adaptive control and online control.
There is a rich body of literature studying Lyapunov stability
and asymptotic convergence in adaptive control theory [
11
,
12
]. Recently, there has been increased
interest in studying online adaptive control with non-asymptotic metrics (e.g., regret) from learning
theory, largely for settings with linear systems such as online LQR or LQG with unknown dynamics
[
14
16
,
24
,
48
]. The most relevant work [
13
] gives the first regret bound of adaptive nonlinear
control with unknown nonlinear dynamics and stochastic noise. Another relevant work studies online
robust control of nonlinear systems with a mistake guarantee on the number of robustness failures
[
49
]. However, all these results focus on the single-task case. To our knowledge, we show the first
non-asymptotic convergence result for multi-task adaptive control. On the empirical side, [
1
,
31
]
combine adaptive nonlinear control with meta-learning, yielding encouraging experimental results.
Online matrix factorization.
Our work bears affinity to online matrix factorization, particularly the
bandit collaborative filtering setting [
30
,
50
,
29
]. In this setting, one typically posits a linear low-rank
projection as the target representation (e.g., a low-rank factorization of the user-item matrix), which is
similar to our bilinear case. Setting aside the significant complexity introduced by nonlinear control,
a key similarity comes from viewing different users as “tasks” and recommended items as “actions”.
Prior work in this area has by and large not been able to establish strong regret bounds, in part due to
the non-identifiability issue inherent in matrix factorization. In contrast, in our setting, one set of
latent variables (e.g., the wind condition) has a concrete physical meaning that we are allowed to
measure (
ObserveEnv
in Algorithm 1), thus side-stepping this non-identifiability issue.
7 Concluding remarks
We have presented OMAC, a meta-algorithm for adaptive nonlinear control in a sequence of unknown
environments. We provide different instantiations of OMAC under varying assumptions and condi-
tions, leading to the first non-asymptotic convergence guarantee for multi-task adaptive nonlinear
control, and integration with deep learning. We also validate OMAC empirically. We use the average
control error (
ACE
) metric and focus on fully-actuated systems in this paper. Future work will seek
to consider general cost functions and systems. It is also interesting to study end-to-end convergence
guarantees of deep OMAC, with ideas from deep representation learning theories. Another interesting
direction is to study how to incorporate other areas of control theory, such as roboust control.
Broader impacts.
This work is primarily focused on establishing a theoretical understanding of
meta-adaptive control. Such fundamental work will not directly have broader societal impacts.
10
Acknowledgments and disclosure of funding
This project was supported in part by funding from Raytheon and DARPA PAI, with additional
support for Guanya Shi provided by the Simoudis Discovery Prize. There is no conflict of interest.
References
[1]
Michael O’Connell, Guanya Shi, Xichen Shi, and Soon-Jo Chung. Meta-learning-based robust
adaptive flight control under uncertain wind conditions.
arXiv preprint arXiv:2103.01932
, 2021.
[2]
Andy Zeng, Shuran Song, Johnny Lee, Alberto Rodriguez, and Thomas Funkhouser. Tossingbot:
Learning to throw arbitrary objects with residual physics.
IEEE Transactions on Robotics
, 36
(4):1307–1319, 2020.
[3]
Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter. Learning
quadrupedal locomotion over challenging terrain.
Science robotics
, 5(47), 2020.
[4]
Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning
the representation, provably.
arXiv preprint arXiv:2002.09434
, 2020.
[5]
Nilesh Tripuraneni, Chi Jin, and Michael I Jordan. Provable meta-learning of linear representa-
tions.
arXiv preprint arXiv:2002.11684
, 2020.
[6]
Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask
representation learning.
Journal of Machine Learning Research
, 17(81):1–32, 2016.
[7]
Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and
new perspectives.
IEEE transactions on pattern analysis and machine intelligence
, 35(8):
1798–1828, 2013.
[8]
Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel, and Sergey Levine. Meta-
reinforcement learning of structured exploration strategies.
arXiv preprint arXiv:1802.07245
,
2018.
[9]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-
tion of deep networks. In
International Conference on Machine Learning
, pages 1126–1135,
2017.
[10]
Anusha Nagabandi, Ignasi Clavera, Simin Liu, Ronald S Fearing, Pieter Abbeel, Sergey Levine,
and Chelsea Finn. Learning to adapt in dynamic, real-world environments through meta-
reinforcement learning.
arXiv preprint arXiv:1803.11347
, 2018.
[11]
Jean-Jacques E Slotine, Weiping Li, et al.
Applied nonlinear control
, volume 199. Prentice hall
Englewood Cliffs, NJ, 1991.
[12] Karl J Åström and Björn Wittenmark.
Adaptive control
. Courier Corporation, 2013.
[13]
Nicholas M Boffi, Stephen Tu, and Jean-Jacques E Slotine. Regret bounds for adaptive nonlinear
control. In
Learning for Dynamics and Control
, pages 471–483. PMLR, 2021.
[14]
Max Simchowitz and Dylan Foster. Naive exploration is optimal for online lqr. In
International
Conference on Machine Learning
, pages 8937–8948. PMLR, 2020.
[15]
Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample
complexity of the linear quadratic regulator.
Foundations of Computational Mathematics
, pages
1–47, 2019.
[16]
Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Logarithmic
regret bound in partially observable linear dynamical systems.
arXiv preprint arXiv:2003.11227
,
2020.
[17]
Sham Kakade, Akshay Krishnamurthy, Kendall Lowrey, Motoya Ohnishi, and Wen Sun. Infor-
mation theoretic regret bounds for online nonlinear control.
arXiv preprint arXiv:2006.12466
,
2020.
11