of 13
Meta-Adaptive Nonlinear Control:
Theory and Algorithms
Guanya Shi
, Kamyar Azizzadenesheli
, Michael O’Connell
, Soon-Jo Chung
, Yisong Yue
Caltech
Purdue University
{gshi,moc,sjchung,yyue}@caltech.edu, kamyar@purdue.edu
Abstract
We present an online multi-task learning approach for adaptive nonlinear control,
which we call Online Meta-Adaptive Control (OMAC). The goal is to control a
nonlinear system subject to adversarial disturbance and unknown
environment-
dependent
nonlinear dynamics, under the assumption that the environment-
dependent dynamics can be well captured with some shared representation. Our
approach is motivated by robot control, where a robotic system encounters a se-
quence of new environmental conditions that it must quickly adapt to. A key
emphasis is to integrate online representation learning with established methods
from control theory, in order to arrive at a unified framework that yields both
control-theoretic and learning-theoretic guarantees. We provide instantiations of
our approach under varying conditions, leading to the first non-asymptotic end-
to-end convergence guarantee for multi-task nonlinear control. OMAC can also
be integrated with deep representation learning. Experiments show that OMAC
significantly outperforms conventional adaptive control approaches which do not
learn the shared representation, in inverted pendulum and 6-DoF drone control
tasks under varying wind conditions
1
.
1 Introduction
One important goal in autonomy and artificial intelligence is to enable autonomous robots to learn
from prior experience to quickly adapt to new tasks and environments. Examples abound in robotics,
such as a drone flying in different wind conditions [
1
], a manipulator throwing varying objects [
2
],
or a quadruped walking over changing terrains [
3
]. Though those examples provide encouraging
empirical evidence, when designing such adaptive systems, two important theoretical challenges
arise, as discussed below.
First, from a learning perspective, the system should be able to learn an “efficient” representation
from prior tasks, thereby permitting faster future adaptation, which falls into the categories of
representation learning or meta-learning. Recently, a line of work has shown theoretically that learning
representations (in the standard supervised setting) can significantly reduce sample complexity on
new tasks [
4
6
]. Empirically, deep representation learning or meta-learning has achieved success
in many applications [
7
], including control, in the context of meta-reinforcement learning [
8
10
].
However, theoretical benefits (in the end-to-end sense) of representation learning or meta-learning for
adaptive control remain unclear.
Second, from a control perspective, the agent should be able to handle parametric model uncertainties
with control-theoretic guarantees such as stability and tracking error convergence, which is a common
adaptive control problem [
11
,
12
]. For classic adaptive control algorithms, theoretical analysis often
involves the use of Lyapunov stability and asymptotic convergence [
11
,
12
]. Moreover, many recent
1
Code and video:
https://github.com/GuanyaShi/Online-Meta-Adaptive-Control
35th Conference on Neural Information Processing Systems (NeurIPS 2021).
studies aim to integrate ideas from learning, optimization, and control theory to design and analyze
adaptive controllers using learning-theoretic metrics. Typical results guarantee non-asymptotic
convergence in finite time horizons, such as regret [
13
17
] and dynamic regret [
18
20
]. However,
these results focus on a single environment or task. A multi-task extension, especially whether and
how prior experience could benefit the adaptation in new tasks, remains an open problem.
Main contributions.
In this paper, we address both learning and control challenges in a unified
framework and provide end-to-end guarantees. We derive a new method of Online Meta-Adaptive
Control (OMAC) that controls uncertain nonlinear systems under a sequence of new environmental
conditions. The underlying assumption is that the environment-dependent unknown dynamics can
well be captured by a shared representation, which OMAC learns using a
meta-adapter
. OMAC then
performs environment-specific updates using an
inner-adapter
.
We provide different instantiations of OMAC under varying assumptions and conditions. In the
jointly and element-wise convex cases, we show sublinear cumulative control error bounds, which
to our knowledge is the first non-asymptotic convergence result for multi-task nonlinear control.
Compared to standard adaptive control approaches that do not have a meta-adapter, we show that
OMAC possesses both stronger guarantees and empirical performance. We finally show how to
integrate OMAC with deep representation learning, which further improves empirical performance.
2 Problem statement
We consider the setting where a controller encounters a sequence of
N
environments, with each
environment lasting
T
time steps. We use
outer iteration
to refer to the iterating over the
N
environments, and
inner iteration
to refer to the
T
time steps within an environment.
Notations:
We use superscripts (e.g.,
(
i
)
in
x
(
i
)
t
) to denote the index of the outer iteration where
1
i
N
, and subscripts (e.g.,
t
in
x
(
i
)
t
) to denote the time index of the inner iteration where
1
t
T
. We use
step
(
i, t
) to refer to the inner time step
t
at the
i
th
outer iteration.
k
·
k
denotes
the 2-norm of a vector or the spectral norm of a matrix.
k
·
k
F
denotes the Frobenius norm of a matrix
and
min
(
·
)
denotes the minimum eigenvalue of a real symmetric matrix.
vec(
·
)
2
R
mn
denotes the
vectorization of a
m
n
matrix, and
denotes the Kronecker product. Finally, we use
u
1:
t
to denote
a sequence
{
u
1
,u
2
,
···
,u
t
}
.
We consider a discrete-time nonlinear control-affine system [
21
,
22
] with environment-dependent
uncertainty
f
(
x, c
)
. The dynamics model at the
i
th
outer iteration is:
x
(
i
)
t
+1
=
f
0
(
x
(
i
)
t
)+
B
(
x
(
i
)
t
)
u
(
i
)
t
f
(
x
(
i
)
t
,c
(
i
)
)+
w
(
i
)
t
,
1
t
T,
(1)
where the state
x
(
i
)
t
2
R
n
, the control
u
(
i
)
t
2
R
m
,
f
0
:
R
n
!
R
n
is a known nominal dynamics
model,
B
:
R
n
!
R
n
m
is a known state-dependent actuation matrix,
c
(
i
)
R
h
is the unknown
parameter that indicates an environmental condition,
f
:
R
n
R
h
!
R
n
is the unknown
c
(
i
)
-
dependent dynamics model, and
w
(
i
)
t
is disturbance, potentially adversarial. For simplicity we define
B
(
i
)
t
=
B
(
x
(
i
)
t
)
and
f
(
i
)
t
=
f
(
x
(
i
)
t
,c
(
i
)
)
.
Interaction protocol.
We study the following adaptive nonlinear control problem under
N
unknown
environments. At the beginning of outer iteration
i
, the environment first selects
c
(
i
)
(adaptively and
adversarially), which is unknown to the controller, and then the controller makes decision
u
(
i
)
1:
T
under
unknown dynamics
f
(
x
(
i
)
t
,c
(
i
)
)
and potentially adversarial disturbances
w
(
i
)
t
. To summarize:
1.
Outer iteration
i
.
A policy encounters environment
i
(
i
2
{
1
,...,N
}
), associated with unob-
served variable
c
(
i
)
(e.g., the wind condition for a flying drone). Run inner loop (Step 2).
2.
Inner loop.
Policy interacts with environment
i
for
T
time steps, observing
x
(
i
)
t
and taking action
u
(
i
)
t
, with state/action dynamics following (1).
3.
Policy optionally observes
c
(
i
)
at the end of the inner loop (used for some variants of the analysis).
4.
Increment
i
=
i
+1
and repeat from Step 1.
We use average control error (
ACE
) as our performance metric:
2
Definition 1
(Average control error)
.
The average control error (
ACE
) of
N
outer iterations (i.e.,
N
environments) with each lasting
T
time steps, is defined as
ACE
=
1
TN
P
N
i
=1
P
T
t
=1
k
x
(
i
)
t
k
.
ACE
can be viewed as a non-asymptotic generalization of steady-state error in control [
23
]. We make
the following assumptions on the actuation matrix
B
, the nominal dynamics
f
0
, and disturbance
w
(
i
)
t
:
Assumption 1
(Full actuation, bounded disturbance, and e-ISS assumptions)
.
We consider fully-
actuated systems, i.e., for all
x
,
rank(
B
(
x
)) =
n
.
k
w
(
i
)
t
k
W,
8
t, i
. Moreover, the nominal
dynamics
f
0
is exponentially input-to-state stable (e-ISS): let constants
,
0
and
0
<
1
. For
a sequence
v
1:
t
1
2
R
n
, consider the dynamics
x
k
+1
=
f
0
(
x
k
)+
v
k
,
1
k
t
1
.
x
t
satisfies:
k
x
t
k
t
1
k
x
1
k
+
t
1
X
k
=1
t
1
k
k
v
k
k
.
(2)
With the e-ISS property in Assumption 1, we have the following bound that connects
ACE
with the
average squared loss between
B
(
i
)
t
u
(
i
)
t
+
w
(
i
)
t
and
f
(
i
)
t
.
Lemma 1.
Assume
x
(
i
)
1
=0
,
8
i
. The average control error (
ACE
) is bounded as:
P
N
i
=1
P
T
t
=1
k
x
(
i
)
t
k
TN
1
s
P
N
i
=1
P
T
t
=1
k
B
(
i
)
t
u
(
i
)
t
f
(
i
)
t
+
w
(
i
)
t
k
2
TN
.
(3)
The proof can be found in the Appendix A.1. We assume
x
(
i
)
1
=0
for simplicity: the influence of
non-zero and bounded
x
(
i
)
1
is a constant term in each outer iteration, from the e-ISS property (2).
Remark on the e-ISS assumption and the
ACE
metric.
Note that an exponentially stable linear
system
f
0
(
x
t
)=
Ax
t
(i.e., the spectral radius of
A
is
<
1
) satisfies the exponential ISS (e-ISS)
assumption. However, in nonlinear systems e-ISS is a stronger assumption than exponential stability.
For both linear and nonlinear systems, the e-ISS property of
f
0
is usually achieved by applying some
stable feedback controller to the system
2
, i.e.,
f
0
is the closed-loop dynamics [
11
,
24
,
16
]. e-ISS
assumption is standard in both online adaptive linear control [
24
,
16
,
14
] and nonlinear control [
13
],
and practical in robotic control such as drones [
25
]. In
ACE
we consider a regulation task, but it
can also capture trajectory tracking task with time-variant nominal dynamics
f
0
under incremental
stability assumptions [13]. We only consider the regulation task in this paper for simplicity.
Generality.
We would like to emphasize the generality of our dynamics model
(1)
. The nominal
control-affine part can model general fully-actuated robotic systems via Euler-Langrange equations
[
21
,
22
], and the unknown part
f
(
x, c
)
is nonlinear in
x
and
c
. We only need to assume the disturbance
w
(
i
)
t
is bounded, which is more general than stochastic settings in linear [14–16] and nonlinear [13]
cases. For example,
w
(
i
)
t
can model extra
(
x, u, c
)
-dependent uncertainties or adversarial disturbances.
Moreover, the environment sequence
c
(1:
N
)
could also be adversarial. In term of the extension to
under-actuated systems, all the results in this paper hold for the
matched uncertainty
setting [
13
], i.e.,
in the form
x
(
i
)
t
+1
=
f
0
(
x
(
i
)
t
)+
B
(
x
(
i
)
t
)(
u
(
i
)
t
f
(
x
(
i
)
t
,c
(
i
)
)) +
w
(
i
)
t
where
B
(
x
(
i
)
t
)
is not necessarily
full rank (e.g., drone and inverted pendulum experiments in Section 5). Generalizing to other
under-actuated systems is interesting future work.
3 Online meta-adaptive control (OMAC) algorithm
The design of our online meta-adaptive control (OMAC) approach comprises four pieces: the policy
class, the function class, the inner loop (within environment) adaptation algorithm
A
2
, and the outer
loop (between environment) online learning algorithm
A
1
.
Policy class.
We focus on the class of certainty-equivalence controllers [
13
,
26
,
14
], which is a
general class of model-based controllers that also includes linear feedback controllers commonly
2
For example, consider
x
t
+1
=
3
2
x
t
+2sin
x
t
+ ̄
u
t
. With a feedback controller
̄
u
t
=
u
t
x
t
2sin
x
t
,
the closed-loop dynamics is
x
t
+1
=
1
2
x
t
+
u
t
, where
f
0
(
x
)=
1
2
x
is e-ISS.
3
studied in online control [
26
,
14
]. After a model is learned from past data, a controller is designed
by treating the learned model as the truth [
12
]. Formally, at
step
(
i, t
), the controller first estimates
ˆ
f
(
i
)
t
(an estimation of
f
(
i
)
t
) based on past data, and then executes
u
(
i
)
t
=
B
(
i
)
t
ˆ
f
(
i
)
t
, where
(
·
)
is the
pseudo inverse. Note that from Lemma 1, the average control error of the
omniscient controller
using
ˆ
f
(
i
)
t
=
f
(
i
)
t
(i.e., the controller perfectly knows
f
(
x, c
)
) is upper bounded as
3
:
ACE
(omniscient)
W/
(1
)
.
ACE
(omniscient)
can be viewed as a fundamental limit of the certainty-equivalence policy class.
Function class
F
for representation learning.
In OMAC, we need to define a function class
F
(
(
x
;
ˆ
)
,
ˆ
c
)
to compute
ˆ
f
(
i
)
t
(i.e.,
ˆ
f
(
i
)
t
=
F
(
(
x
(
i
)
t
;
ˆ
)
,
ˆ
c
)
), where
(parameterized by
ˆ
) is a
representation shared by all environments, and
ˆ
c
is an environment-specific latent state. From a
theoretical perspective, the main consideration of the choice of
F
(
(
x
)
,
ˆ
c
)
is on how it effects the
resulting learning objective. For instance,
represented by a Deep Neural Network (DNN) would
lead to a highly non-convex learning objective. In this paper, we focus on the setting
ˆ
2
R
p
,
ˆ
c
2
R
h
,
and
p
h
, i.e., it is much more expensive to learn the shared representation
(e.g., a DNN)
than “fine-tuning” via
ˆ
c
in a specific environment, which is consistent with meta-learning [
9
] and
representation learning [4, 7] practices.
Inner loop adaptive control.
We take a modular approach in our algorithm design, in order to cleanly
leverage state-of-the-art methods from online learning, representation learning, and adaptive control.
When interacting with a single environment (for
T
time steps), we keep the learned representation
fixed, and use that representation for adaptive control by treating
ˆ
c
as an unknown low-dimensional
parameter. We can utilize any adaptive control method such as online gradient descent, velocity
gradient, or composite adaptation [27, 13, 11, 1].
Outer loop online learning.
We treat the outer loop (which iterates between environments) as an
online learning problem, where the goal is to learn the shared representation
that optimizes the
inner loop adaptation performance. Theoretically, we can reason about the analysis by setting up a
hierarchical or nested online learning procedure (adaptive control nested within online learning).
Design goal.
Our goal is to design a meta-adaptive controller that has low
ACE
, ideally converging
to
ACE
(omniscient)
as
T,N
!1
. In other words, OMAC should converge to performing as good
as the omniscient controller with perfect knowledge of
f
(
x, c
)
.
Algorithm 1 describes the OMAC algorithm. Since
is environment-invariant and
p
h
, we
only adapt
ˆ
at the end of each outer iteration. On the other hand, because
c
(
i
)
varies in different
environments, we adapt
ˆ
c
at each inner step. As shown in Algorithm 1, at
step
(
i, t
), after applying
u
(
i
)
t
, the controller observes the next state
x
(
i
)
t
+1
and computes:
y
(
i
)
t
,
f
0
(
x
(
i
)
t
)+
B
(
i
)
t
u
(
i
)
t
x
(
i
)
t
+1
=
f
(
i
)
t
w
(
i
)
t
, which is a disturbed measurement of the ground truth
f
(
i
)
t
. We then define
`
(
i
)
t
(
ˆ
,
ˆ
c
)
,
k
F
(
(
x
(
i
)
t
;
ˆ
)
,
ˆ
c
)
y
(
i
)
t
k
2
as the observed loss at
step
(
i, t
), which is a squared loss between the
disturbed measurement of
f
(
i
)
t
and the model prediction
F
(
(
x
(
i
)
t
;
ˆ
)
,
ˆ
c
)
.
Instantiations.
Depending on
{
F
(
(
x
;
ˆ
)
,
ˆ
c
)
,
A
1
,
A
2
,
ObserveEnv
}
, we consider three settings:
Convex case
(Section 4.1): The observed loss
`
(
i
)
t
is convex with respect to
ˆ
and
ˆ
c
.
Element-wise convex case
(Section 4.2): Fixing
ˆ
or
ˆ
c
,
`
(
i
)
t
is convex with respect to the other.
Deep learning case
(Section 5): In this case, we use a DNN with weight
ˆ
to represent
.
4 Main theoretical results
4.1 Convex case
In this subsection, we focus on a setting where the observed loss
`
(
i
)
t
(
ˆ
,
ˆ
c
)=
k
F
(
(
x
;
ˆ
)
,
ˆ
c
)
y
(
i
)
t
k
2
is convex with respect to
ˆ
and
ˆ
c
. We provide the following example to illustrate this case and
highlight its difference between conventional adaptive control (e.g., [13, 12]).
3
This upper bound is tight. Consider a scalar system
x
t
+1
=
ax
t
+
u
t
f
(
x
t
)+
w
with
|
a
|
<
1
and
w
a
constant. In this case
=
|
a
|
,
=1
, and the omniscient controller
u
t
=
f
(
x
t
)
yields
ACE
=
|
w
|
/
(1
)
.
4
Algorithm 1:
Online Meta-Adaptive Control (OMAC) algorithm
Input:
Meta-adapter
A
1
; inner-adapter
A
2
; model
F
(
(
x
;
ˆ
)
,
ˆ
c
)
; Boolean
ObserveEnv
for
i
=1
,
···
,N
do
The environment selects
c
(
i
)
for
t
=1
,
···
,T
do
Compute
ˆ
f
(
i
)
t
=
F
(
(
x
(
i
)
t
;
ˆ
(
i
)
)
,
ˆ
c
(
i
)
t
)
Execute
u
(
i
)
t
=
B
(
i
)
t
ˆ
f
(
i
)
t
//
certainty-equivalence policy
Observe
x
(
i
)
t
+1
,
y
(
i
)
t
=
f
(
x
(
i
)
t
,c
(
i
)
)
w
(
i
)
t
, and
`
(
i
)
t
(
ˆ
,
ˆ
c
)=
k
F
(
(
x
(
i
)
t
;
ˆ
)
,
ˆ
c
)
y
(
i
)
t
k
2
//
y
(
i
)
t
is a noisy measurement of
f
and
`
(
i
)
t
is the observed loss
Construct an inner cost function
g
(
i
)
t
c
)
by
A
2
//
g
(
i
)
t
is a function of
ˆ
c
Inner-adaptation:
ˆ
c
(
i
)
t
+1
A
2
c
(
i
)
t
,g
(
i
)
1:
t
)
if
ObserveEnv
then
Observe
c
(
i
)
//only used in some instantiations
Construct an outer cost function
G
(
i
)
(
ˆ
)
by
A
1
//
G
(
i
)
is a function of
ˆ
Meta-adaptation:
ˆ
(
i
+1)
A
1
(
ˆ
(
i
)
,G
(1:
i
)
)
Example 1.
We consider a model
F
(
(
x
;
ˆ
)
,
ˆ
c
)=
Y
1
(
x
)
ˆ
+
Y
2
(
x
c
to estimate
f
:
ˆ
f
(
i
)
t
=
Y
1
(
x
(
i
)
t
)
ˆ
(
i
)
+
Y
2
(
x
(
i
)
t
c
(
i
)
t
,
(4)
where
Y
1
:
R
n
!
R
n
p
,Y
2
:
R
n
!
R
n
h
are two known bases. Note that conventional adaptive
control approaches typically concatenate
ˆ
and
ˆ
c
and adapt on both at each time step, regardless
of the environment changes (e.g., [
13
]). Since
p
h
, such concatenation is computationally much
more expensive than OMAC, which only adapts
ˆ
in outer iterations.
Because
`
(
i
)
t
(
ˆ
,
ˆ
c
)
is
jointly
convex with respective to
ˆ
and
ˆ
c
, the OMAC algorithm in this case
falls into the category of Nested Online Convex Optimization (Nested OCO) [
28
]. The choice of
g
(
i
)
t
,G
(
i
)
,
A
1
,
A
2
and
ObserveEnv
are depicted in Table 1. Note that in the convex case OMAC
does not need to know
c
(
i
)
in the whole process (
ObserveEnv
= False
).
F
(
(
x
;
ˆ
)
,
ˆ
c
)
Any
F
model such that
`
(
i
)
t
(
ˆ
,
ˆ
c
)
is convex (e.g., Example 1)
g
(
i
)
t
c
)
r
ˆ
c
`
(
i
)
t
(
ˆ
(
i
)
,
ˆ
c
(
i
)
t
)
·
ˆ
c
G
(
i
)
(
ˆ
)
P
T
t
=1
r
ˆ
`
(
i
)
t
(
ˆ
(
i
)
,
ˆ
c
(
i
)
t
)
·
ˆ
A
1
With a convex set
K
1
,
A
1
initializes
ˆ
(1)
2
K
1
and returns
ˆ
(
i
+1)
2
K
1
,
8
i
.
A
1
has
sublinear regret, i.e., the total regret of
A
1
is
T
·
o
(
N
)
(e.g., online gradient descent)
A
2
With a convex set
K
2
,
8
i
,
A
2
initializes
ˆ
c
(
i
)
1
2
K
2
and returns
ˆ
c
(
i
)
t
+1
2
K
2
,
8
t
.
A
2
has
sublinear regret, i.e., the total regret of
A
2
is
N
·
o
(
T
)
(e.g., online gradient descent)
ObserveEnv
False
Table 1: OMAC with convex loss
As shown in Table 1, at the end of
step
(
i, t
) we fix
ˆ
=
ˆ
(
i
)
and update
ˆ
c
(
i
)
t
+1
2
K
2
using
A
2
c
(
i
)
t
,g
(
i
)
1:
t
)
, which is an OCO problem with linear costs
g
(
i
)
1:
t
. On the other hand, at the end of outer
iteration
i
, we update
ˆ
(
i
+1)
2
K
1
using
A
1
(
ˆ
(
i
)
,G
(1:
i
)
)
, which is another OCO problem with
linear costs
G
(1:
i
)
. From [28], we have the following regret relationship:
Lemma 2
(Nested OCO regret bound, [
28
])
.
OMAC (Algorithm 1) specified by Table 1 has regret:
N
X
i
=1
T
X
t
=1
`
(
i
)
t
(
ˆ
(
i
)
,
ˆ
c
(
i
)
t
)
min
2
K
1
N
X
i
=1
min
c
(
i
)
2
K
2
T
X
t
=1
`
(
i
)
t
(
,c
(
i
)
)
N
X
i
=1
G
(
i
)
(
ˆ
(
i
)
)
min
2
K
1
N
X
i
=1
G
(
i
)
(
)
|
{z
}
the total regret of
A
1
,T
·
o
(
N
)
+
N
X
i
=1
T
X
t
=1
g
(
i
)
t
c
(
i
)
t
)
N
X
i
=1
min
c
(
i
)
2
K
2
T
X
t
=1
g
(
i
)
t
(
c
(
i
)
)
|
{z
}
the total regret of
A
2
,N
·
o
(
T
)
.
(5)
5
Note that the total regret of
A
1
is
T
·
o
(
N
)
because
G
(
i
)
is scaled up by a factor of
T
. With Lemmas 1
and 2, we have the following guarantee for the average control error.
Theorem 3
(OMAC
ACE
bound with convex loss)
.
Assume the unknown dynamics model is
f
(
x, c
)=
F
(
(
x
;
)
,c
)
. Assume the true parameters
2
K
1
and
c
(
i
)
2
K
2
,
8
i
. Then OMAC (Algorithm 1)
specified by Table 1 ensures the following
ACE
guarantee:
ACE
1
r
W
2
+
o
(
T
)
T
+
o
(
N
)
N
.
To further understand Theorem 3 and compare OMAC with conventional adaptive control approaches,
we provide the following corollary using the model in Example 1.
Corollary 4.
Suppose the unknown dynamics model is
f
(
x, c
)=
Y
1
(
x
)
+
Y
2
(
x
)
c
, where
Y
1
:
R
n
!
R
n
p
,Y
2
:
R
n
!
R
n
h
are known bases. We assume
k
k
K
and
k
c
(
i
)
k
K
c
,
8
i
. Moreover,
we assume
k
Y
1
(
x
)
k
K
1
and
k
Y
2
(
x
)
k
K
2
,
8
x
. In Table 1 we use
ˆ
f
(
i
)
t
=
Y
1
(
x
(
i
)
t
)
ˆ
(
i
)
+
Y
2
(
x
(
i
)
t
c
(
i
)
t
, and Online Gradient Descent (OGD) [
27
] for both
A
1
and
A
2
, with learning rates
̄
(
i
)
and
(
i
)
t
respectively. We set
K
1
=
{
ˆ
:
k
ˆ
k
K
}
and
K
2
=
{
ˆ
c
:
k
ˆ
c
k
K
c
}
. If we schedule the
learning rates as:
̄
(
i
)
=
2
K
(4
K
2
1
K
+4
K
1
K
2
K
c
+2
K
1
W
|
{z
}
C
1
)
T
p
i
,
(
i
)
t
=
2
K
c
(4
K
2
2
K
c
+4
K
1
K
2
K
+2
K
2
W
|
{z
}
C
2
)
p
t
,
then the average control performance is bounded as:
ACE
(OMAC)
1
s
W
2
+3
K
C
1
1
p
N
+
K
c
C
2
1
p
T
.
Moreover, the baseline adaptive control which uses OGD to adapt
ˆ
and
ˆ
c
at each time step satisfies:
ACE
(baseline adaptive control)
1
s
W
2
+3
q
K
2
+
K
2
c
q
C
2
1
+
C
2
2
1
p
T
.
Note that
ACE
(baseline adaptive control)
does not improve as
N
increases (i.e., encountering more
environments has no benefit). If
p
h
, we potentially have
K
1
K
2
and
K
K
c
, so
C
1
C
2
.
Therefore, the
ACE
upper bound of OMAC is better than the baseline adaptation if
N>T
, which is
consistent with recent representation learning results [
4
,
5
]. It is also worth noting that the baseline
adaptation is much more computationally expensive, because it needs to adapt both
ˆ
and
ˆ
c
at each
time step. Intuitively, OMAC improves convergence because the meta-adapter
A
1
approximates the
environment-invariant part
Y
1
(
x
)
, which makes the inner-adapter
A
2
much more efficient.
4.2 Element-wise convex case
In this subsection, we consider the setting where the loss function
`
(
i
)
t
(
ˆ
,
ˆ
c
)
is element-wise convex
with respect to
ˆ
and
ˆ
c
, i.e., when one of
ˆ
or
ˆ
c
is fixed,
`
(
i
)
t
is convex with respect to the other one.
For instance,
F
could be function as depicted in Example 2. Outside the context of control, such
bilinear models are also studied in representation learning [4, 5] and factorization bandits [29, 30].
Example 2.
We consider a model
F
(
(
x
;
ˆ
)
,
ˆ
c
)=
Y
(
x
)
ˆ
ˆ
c
to estimate
f
:
ˆ
f
(
i
)
t
=
Y
(
x
(
i
)
t
)
ˆ
(
i
)
ˆ
c
(
i
)
t
,
(6)
where
Y
:
R
n
!
R
n
̄
p
is a known basis,
ˆ
(
i
)
2
R
̄
p
h
, and
ˆ
c
(
i
)
t
2
R
h
. Note that the dimensionality
of
ˆ
is
p
= ̄
ph
. Conventional adaptive control typically views
ˆ
ˆ
c
as a vector in
R
̄
p
and adapts it at
each time step regardless of the environment changes [13].
Compared to Section 4.1, the element-wise convex case poses new challenges: i) the coupling
between
ˆ
and
ˆ
c
brings inherent non-identifiability issues; and ii) statistical learning guarantees
6
typical need i.i.d. assumptions on
c
(
i
)
and
x
(
i
)
t
[
4
,
5
]. These challenges are further amplified by the
underlying unknown nonlinear system
(1)
. Therefore in this section we set
ObserveEnv
=True
, i.e.,
the controller has access to the true environmental condition
c
(
i
)
at the end of the
i
th
outer iteration,
which is practical in many systems when
c
(
i
)
has a concrete physical meaning, e.g., drones with wind
disturbances [1, 31].
F
(
(
x
;
ˆ
)
,
ˆ
c
)
Any
F
model such that
`
(
i
)
t
(
ˆ
,
ˆ
c
)
is element-wise convex (e.g., Example 2)
g
(
i
)
t
c
)
`
(
i
)
t
(
ˆ
(
i
)
,
ˆ
c
)
G
(
i
)
(
ˆ
)
P
T
t
=1
`
(
i
)
t
(
ˆ
,c
(
i
)
)
A
1
&
A
2
Same as Table 1
ObserveEnv
True
Table 2: OMAC with element-wise convex loss
The inputs to OMAC for the element-wise convex case are specified in Table 2. Compared to
the convex case in Table 1, difference includes i) the cost functions
g
(
i
)
t
and
G
(
i
)
are convex, not
necessarily linear; and ii) since
ObserveEnv
=True
, in
G
(
i
)
we use the true environmental condition
c
(
i
)
. With inputs specified in Table 2, Algorithm 1 has
ACE
guarantees in the following theorem.
Theorem 5
(OMAC
ACE
bound with element-wise convex loss)
.
Assume the unknown dynamics
model is
f
(
x, c
)=
F
(
(
x
;
)
,c
)
. Assume the true parameter
2
K
1
and
c
(
i
)
2
K
2
,
8
i
. Then
OMAC (Algorithm 1) specified by Table 2 ensures the following
ACE
guarantee:
ACE
1
r
W
2
+
o
(
T
)
T
+
o
(
N
)
N
.
4.2.1 Faster convergence with sub Gaussian and environment diversity assumptions
Since the cost functions
g
(
i
)
t
and
G
(
i
)
in Table 2 are not necessarily strongly convex, the
ACE
upper
bound in Theorem 5 is typically
1
q
W
2
+
O
(1
/
p
T
)+
O
(1
/
p
N
)
(e.g., using OGD for both
A
1
and
A
2
). However, for the bilinear model in Example 2, it is possible to achieve faster convergence
with extra sub Gaussian and
environment diversity
assumptions.
F
(
(
x
;
ˆ
)
,
ˆ
c
)
The bilinear model in Example 2
g
(
i
)
t
c
)
`
(
i
)
t
(
ˆ
(
i
)
,
ˆ
c
)
G
(
i
)
(
ˆ
)
k
ˆ
k
2
F
+
P
i
j
=1
P
T
t
=1
`
(
j
)
t
(
ˆ
,c
(
j
)
)
with some
>
0
A
1
ˆ
(
i
+1)
= arg min
ˆ
G
(
i
)
(
ˆ
)
(i.e., Ridge regression)
A
2
Same as Table 1
ObserveEnv
True
Table 3: OMAC with bilinear model
With a bilinear model, the inputs to the OMAC algorithm are shown in Table 3. With extra assumptions
on
w
(
i
)
t
and the environment, we have the following
ACE
guarantees.
Theorem 6
(OMAC
ACE
bound with bilinear model)
.
Consider an unknown dynamics model
f
(
x, c
)=
Y
(
x
)
c
where
Y
:
R
n
!
R
n
̄
p
is a known basis and
2
R
̄
p
h
. Assume each component
of the disturbance
w
(
i
)
t
is
R
-sub-Gaussian, the true parameters
k
k
F
K
,
k
c
(
i
)
k
K
c
,
8
i
, and
k
Y
(
x
)
k
F
K
Y
,
8
x
. Define
Z
(
j
)
t
=
c
(
j
)
>
Y
(
x
(
j
)
t
)
2
R
n
̄
ph
and assume environment diversity:
min
(
P
i
j
=1
P
T
t
=1
Z
(
j
)
>
t
Z
(
j
)
t
)=
(
i
)
. Then OMAC (Algorithm 1) specified by Table 3 has the
following
ACE
guarantee (with probability at least
1
):
ACE
1
r
W
2
+
o
(
T
)
T
+
O
(
log(
N T /
) log(
N
)
N
)
.
(7)
If we relax the environment diversity condition to
(
p
i
)
, the last term becomes
O
(
log(
NT/
)
p
N
)
.
The sub-Gaussian assumption is widely used in statistical learning theory to obtain concentration
bounds [
32
,
4
]. The environment diversity assumption states that
c
(
i
)
provides “new information”
7
in every outer iteration such that the minimum eigenvalue of
P
i
j
=1
P
T
t
=1
Z
(
j
)
>
t
Z
(
j
)
t
grows linearly
as
i
increases. Note that we do not need
min
(
P
i
j
=1
P
T
t
=1
Z
(
j
)
>
t
Z
(
j
)
t
)
to increase as
T
goes up.
Moreover, if we relax the condition to
(
p
i
)
, the
ACE
bound becomes worse than the general
element-wise convex case (the last term is
O
(1
/
p
N
)
), which implies the importance of “linear”
environment diversity
(
i
)
. Task diversity has been shown to be important for representation learning
[4, 33]. We provide a proof sketch here and the full proof can be found in the Appendix A.6.
Proof sketch.
In the outer loop we use the martingale concentration bound [
32
] and the environment
diversity assumption to bound
k
ˆ
(
i
+1)
k
2
F
O
(
log(
iT /
)
i
)
,
8
i
1
with probability at least
1
.
Then, we use Lemma 1 to show how the outer loop concentration bound and the inner loop regret
bound of
A
2
translate to
ACE
.
5 Deep OMAC and experiments
We now introduce deep OMAC, a deep representation learning based OMAC algorithm. Table 4
shows the instantiation. As shown in Table 4, Deep OMAC employs a DNN to represent
, and uses
gradient descent for optimization. With the model
4
(
x
;
ˆ
)
·
ˆ
c
, the meta-adapter
A
1
updates the
representation
via deep learning, and the inner-adapter
A
2
updates a linear layer
ˆ
c
.
F
(
(
x
;
ˆ
)
,
ˆ
c
)
(
x
;
ˆ
)
·
ˆ
c
, where
(
x
;
ˆ
) :
R
n
!
R
n
h
is a neural network with weight
ˆ
g
(
i
)
t
c
)
`
(
i
)
t
(
ˆ
(
i
)
,
ˆ
c
)
A
1
Gradient descent:
ˆ
(
i
+1)
ˆ
(
i
)
r
ˆ
P
T
t
=1
`
(
i
)
t
(
ˆ
,
ˆ
c
(
i
)
t
)
A
2
Same as Table 1
ObserveEnv
False
Table 4: OMAC with deep learning
To demonstrate the performance of OMAC, we consider two sets of experiments:
Inverted pendulum with external wind, gravity mismatch, and unknown damping.
The
continuous-time model is
ml
2
̈
ml
ˆ
g
sin
=
u
+
f
(
,
̇
, c
)
, where
is the pendulum angle,
̇
/
̈
is the angular velocity/acceleration,
m
is the pendulum mass and
l
is the length,
ˆ
g
is the gravity
estimation,
c
is the unknown parameter that indicates the external wind condition, and
f
(
,
̇
, c
)
is
the unknown dynamics which depends on
c
, but also includes
c
-invariant terms such as damping
and gravity mismatch. This model generalizes [
35
] by considering damping and gravity mismatch.
6-DoF quadrotor with 3-D wind disturbances.
We consider a 6-DoF quadrotor model with
unknown wind-dependent aerodynamic force
f
(
x, c
)
2
R
3
, where
x
2
R
13
is the drone state
(including position, velocity, attitude, and angular velocity) and
c
is the unknown parameter
indicating the wind condition. We incorporate a realistic high-fidelity aerodynamic model from [
36
].
We consider 6 different controllers in the experiment (see more details about the dynamics model and
controllers in the Appendix A.7):
No-adapt
is simply using
ˆ
f
(
i
)
t
=0
, and
omniscient
is using
ˆ
f
(
i
)
t
=
f
(
i
)
t
.
OMAC (convex)
is based on Example 1, where
ˆ
f
(
i
)
t
=
Y
1
(
x
(
i
)
t
)
ˆ
(
i
)
+
Y
2
(
x
(
i
)
t
c
(
i
)
t
. We use
random Fourier features [
37
,
13
] to generate both
Y
1
and
Y
2
. We use OGD for both
A
1
and
A
2
in
Table 1.
OMAC (bi-convex)
is based on Example 2, where
ˆ
f
(
i
)
t
=
Y
(
x
(
i
)
t
)
ˆ
(
i
)
ˆ
c
(
i
)
t
. Similarly, we use
random Fourier features to generate
Y
. Although the theoretical result in Section 4.2 requires
ObserveEnv
=True
, we relax this in our experiments and use
G
(
i
)
(
ˆ
) =
P
T
t
=1
`
(
i
)
t
(
ˆ
,
ˆ
c
(
i
)
t
)
in
Table 2, instead of
P
T
t
=1
`
(
i
)
t
(
ˆ
,c
(
i
)
)
. We also deploy OGD for
A
1
and
A
2
.
Baseline
has the same
procedure except with
ˆ
(
i
)
ˆ
(1)
, i.e., it calls the inner-adapter
A
2
at every step and does not
update
ˆ
, which is standard in adaptive control [13, 12].
4
The intuition behind this structure is that, any analytic function
̄
f
(
x,
̄
c
)
can be approximated by
(
x
)
c
( ̄
c
)
with a universal approximator
[34]. We provide a detailed and theoretical justification in the Appendix A.7.
8