arXiv:2003.05999v2 [cs.LG] 24 Jun 2020
Adaptive Control and Regret Minimization in Linear Quadrat
ic
Gaussian (LQG) Setting
Sahin Lale
1
, Kamyar Azizzadenesheli
2
, Babak Hassibi
1
, Anima Anandkumar
2
1
Department of Electrical Engineering
2
Department of Computing and Mathematical Sciences
California Institute of Technology, Pasadena
{alale,kazizzad,hassibi,anima}@caltech.edu
Abstract
We study the problem of adaptive control in partially observ
able linear quadratic Gaussian
control systems, where the model dynamics are unknown a prio
ri. We propose
LqgOpt
, a novel
reinforcement learning algorithm based on the principle of
optimism in the face of uncertainty, to
effectively minimize the overall control cost. We employ the
predictor state evolution represen-
tation of the system dynamics and deploy a recently proposed
closed-loop system identification
method, estimation, and confidence bound construction.
LqgOpt
efficiently explores the sys-
tem dynamics, estimates the model parameters up to their con
fidence interval, and deploys the
controller of the most optimistic model for further explora
tion and exploitation. We provide sta-
bility guarantees for
LqgOpt
, and prove the regret upper bound of
̃
O
(
√
T
)
for adaptive control
of linear quadratic Gaussian (
LQG
) systems, where
T
is the time horizon of the problem.
1 Introduction
One of the core challenges in the field of control theory and re
inforcement learning is adaptive control.
It is the problem of controlling dynamical systems when the d
ynamics of the systems are unknown
to the decision-making agents. In adaptive control, agents
interact with given systems in order to
explore and control them while the long-term objective is to
minimize the overall average associated
costs. The agent has to balance between
exploration
and
exploitation
, learn the dynamics, strategize
for further exploration, and exploit the estimation to mini
mize the overall costs. The sequential
nature of agent-system interaction results in challenges i
n the system identification, estimation,
and control under uncertainty, and these challenges are mag
nified when the systems are partially
observable,
i.e.
contain hidden underlying dynamics.
In the linear systems, when the underlying dynamics are full
y observable, the asymptotic op-
timality of estimation methods has been the topic of study in
the last decades [Lai et al., 1982,
Lai and Wei, 1987]. Recently, novel techniques and learning
algorithms have been developed to
study the finite-time behavior of adaptive control algorith
ms and shed light on the design of opti-
mal methods [Peña et al., 2009, Fiechter, 1997, Abbasi-Yadk
ori and Szepesvári, 2011]. In particular,
Abbasi-Yadkori and Szepesvári [2011] proposes to use the pr
inciple of optimism in the face of un-
certainty (
OFU
) to balance exploration and exploitation in
LQR
, where the state of the system
is observable.
OFU
principle suggests to estimate the model parameters up to th
eir confidence
1
interval, and then act according to the policy advised by the
model in confidence set with the lowest
optimal cost, known as the optimistic model.
When the underlying dynamics of linear systems are partiall
y observable, estimating the systems’
dynamics requires considering and analyzing unobservable
events, resulting in a series of significant
challenges in learning and controlling the partially obser
vable systems. A line of prior works are ded-
icated to the problem of open-loop model estimation Oymak an
d Ozay [2018], Sarkar et al. [2019],
Tsiamis and Pappas [2019] where the proposed methods highly
rely on random excitation, uncorre-
lated Gaussian noise, and do not allow feedback control. Rec
ently, Lale et al. [2020a] introduced a
novel estimation method for the general cases of both closed
and open-loop identification of linear
dynamical systems with unobserved hidden states. Their met
hod provides the first finite-time esti-
mation analysis and construction of confidence sets in parti
ally observable linear dynamical systems
when the data is collected using a feedback controller,
i.e.
closed-loop system identification.
In general, computing the optimal controller requires infe
rring the latent state of the system,
given the history of observations. When the model dynamics a
re not known precisely, the uncer-
tainties in the system estimation result in inaccurate late
nt state estimation and inaccurate linear
controller. The possibility of accumulation of these error
s creates a challenging problem in adap-
tive control of partially observable linear systems. There
fore, we need to consider these challenges
in designing an algorithm that performs desirably. In this w
ork, we employ
regret
, a metric in
quantifying the performance of learning algorithms that me
asures the difference between the cost
encountered by an adaptive control agent and that of an optim
al controller, knowing the underlying
system [Lai and Robbins, 1985].
Contributions:
In this work, we study the adaptive control of partially obse
rvable linear
systems from both model estimation/system identification a
nd the controller synthesis perspective.
We propose
LqgOpt
, an adaptive control algorithm for learning and controllin
g unknown partially
observable linear systems with quadratic cost and Gaussian
disturbances, i.e., linear quadratic
Gaussian (
LQG
), for which optimal control exists and has a closed form [Bert
sekas, 1995].
LqgOpt
interacts with the system, collects samples, estimates the
model parameters, and adapts accordingly.
LqgOpt
deploys
OFU
principle to balance the
exploration
vs.
exploitation
trade-off. Using the
predictor form of the state-space equations of the partiall
y observable linear systems, we deploy
the least-squares estimation problem introduced in Lale et
al. [2020a] and obtain confidence sets on
the system parameters.
LqgOpt
then uses these confidence sets to find the optimistic model an
d
use the optimal controller for the chosen model for further e
xploration-exploitation. To analyze
the finite-time regret of
LqgOpt
, we first provide a stability analysis for the sequence of opt
imistic
controllers. Finally, we prove that
LqgOpt
achieves a regret upper bound of
̃
O
(
√
T
)
for adaptive
control of partially observable linear dynamical systems w
ith
convex
quadratic cost, an improvement
to the
̃
O
(
T
2
/
3
)
regret upper bound in the prior work Lale et al. [2020b], wher
e
T
is the number of
total interactions.
Simchowitz et al. [2020] and Lale et al. [2020a] propose algo
rithms which achieve
̃
O
(
√
T
)
and
O
(
polylog
(
T
))
regret bound in partially observable linear systems respec
tively, under different prob-
lem setups. Simchowitz et al. [2020] employ the theory of onl
ine learning, and propose to start with
an initial phase of pure exploration, long enough for accura
te model estimation. Then this phase is
followed by committing to the learned model and deploying on
line learning for the policy updates.
They consider various settings (adversarial and stochasti
c noise) and attain a similar order regret
bound for
strongly convex
cost function. Lale et al. [2020a] provide a new closed-loop
system iden-
tification algorithm and similarly adopt online learning to
ols to achieve the first logarithmic regret
2
Table 1: Comparison with prior works in partially observabl
e linear dynamical systems.
Work
Regret
Cost
Identification
Mania et al. [2019]
√
T
Strongly Convex Open-Loop
Simchowitz et al. [2020]
√
T
Strongly Convex Open-Loop
Lale et al. [2020a]
polylog
(
T
)
Strongly Convex Closed-Loop
Lale et al. [2020b]
T
2
/
3
Convex
Open-Loop
Simchowitz et al. [2020]
T
2
/
3
Convex
Open-Loop
This Work
√
T
Convex
Closed-Loop
in partially observable linear dynamical systems with
strongly convex
cost function with stochastic
disturbances. These two works heavily rely on the strong con
vexity whereas the results in this paper
considers the more general setting of convex cost function (
Table 1).
2 Preliminaries
We denote the Euclidean norm of a vector
x
as
k
x
k
2
. We denote
ρ
(
A
)
as the spectral radius of a
matrix
A
,
k
A
k
F
as its Frobenius norm and
k
A
k
2
as its spectral norm.
Tr(
A
)
is its trace,
A
⊤
is
the transpose,
A
†
is the Moore-Penrose inverse. The
j
-th singular value of a rank-
n
matrix
A
is
denoted by
σ
j
(
A
)
, where
σ
max
(
A
) :=
σ
1
(
A
)
≥
σ
2
(
A
)
≥
. . .
≥
σ
min
(
A
) :=
σ
n
(
A
)
>
0
.
I
represents
the identity matrix with the appropriate dimensions.
Consider the following discrete time linear time-invarian
t system
Θ = (
A, B, C
)
and with dy-
namics as:
x
t
+1
=
Ax
t
+
Bu
t
+
w
t
y
t
=
Cx
t
+
z
t
.
(1)
At each time step
t
, the system is at (hidden) state
x
t
∈
R
n
, the agent receives observation
y
t
∈
R
m
under a measurement noise
z
t
∼ N