of 37
arXiv:2003.05999v2 [cs.LG] 24 Jun 2020
Adaptive Control and Regret Minimization in Linear Quadrat
ic
Gaussian (LQG) Setting
Sahin Lale
1
, Kamyar Azizzadenesheli
2
, Babak Hassibi
1
, Anima Anandkumar
2
1
Department of Electrical Engineering
2
Department of Computing and Mathematical Sciences
California Institute of Technology, Pasadena
{alale,kazizzad,hassibi,anima}@caltech.edu
Abstract
We study the problem of adaptive control in partially observ
able linear quadratic Gaussian
control systems, where the model dynamics are unknown a prio
ri. We propose
LqgOpt
, a novel
reinforcement learning algorithm based on the principle of
optimism in the face of uncertainty, to
effectively minimize the overall control cost. We employ the
predictor state evolution represen-
tation of the system dynamics and deploy a recently proposed
closed-loop system identification
method, estimation, and confidence bound construction.
LqgOpt
efficiently explores the sys-
tem dynamics, estimates the model parameters up to their con
fidence interval, and deploys the
controller of the most optimistic model for further explora
tion and exploitation. We provide sta-
bility guarantees for
LqgOpt
, and prove the regret upper bound of
̃
O
(
T
)
for adaptive control
of linear quadratic Gaussian (
LQG
) systems, where
T
is the time horizon of the problem.
1 Introduction
One of the core challenges in the field of control theory and re
inforcement learning is adaptive control.
It is the problem of controlling dynamical systems when the d
ynamics of the systems are unknown
to the decision-making agents. In adaptive control, agents
interact with given systems in order to
explore and control them while the long-term objective is to
minimize the overall average associated
costs. The agent has to balance between
exploration
and
exploitation
, learn the dynamics, strategize
for further exploration, and exploit the estimation to mini
mize the overall costs. The sequential
nature of agent-system interaction results in challenges i
n the system identification, estimation,
and control under uncertainty, and these challenges are mag
nified when the systems are partially
observable,
i.e.
contain hidden underlying dynamics.
In the linear systems, when the underlying dynamics are full
y observable, the asymptotic op-
timality of estimation methods has been the topic of study in
the last decades [Lai et al., 1982,
Lai and Wei, 1987]. Recently, novel techniques and learning
algorithms have been developed to
study the finite-time behavior of adaptive control algorith
ms and shed light on the design of opti-
mal methods [Peña et al., 2009, Fiechter, 1997, Abbasi-Yadk
ori and Szepesvári, 2011]. In particular,
Abbasi-Yadkori and Szepesvári [2011] proposes to use the pr
inciple of optimism in the face of un-
certainty (
OFU
) to balance exploration and exploitation in
LQR
, where the state of the system
is observable.
OFU
principle suggests to estimate the model parameters up to th
eir confidence
1
interval, and then act according to the policy advised by the
model in confidence set with the lowest
optimal cost, known as the optimistic model.
When the underlying dynamics of linear systems are partiall
y observable, estimating the systems’
dynamics requires considering and analyzing unobservable
events, resulting in a series of significant
challenges in learning and controlling the partially obser
vable systems. A line of prior works are ded-
icated to the problem of open-loop model estimation Oymak an
d Ozay [2018], Sarkar et al. [2019],
Tsiamis and Pappas [2019] where the proposed methods highly
rely on random excitation, uncorre-
lated Gaussian noise, and do not allow feedback control. Rec
ently, Lale et al. [2020a] introduced a
novel estimation method for the general cases of both closed
and open-loop identification of linear
dynamical systems with unobserved hidden states. Their met
hod provides the first finite-time esti-
mation analysis and construction of confidence sets in parti
ally observable linear dynamical systems
when the data is collected using a feedback controller,
i.e.
closed-loop system identification.
In general, computing the optimal controller requires infe
rring the latent state of the system,
given the history of observations. When the model dynamics a
re not known precisely, the uncer-
tainties in the system estimation result in inaccurate late
nt state estimation and inaccurate linear
controller. The possibility of accumulation of these error
s creates a challenging problem in adap-
tive control of partially observable linear systems. There
fore, we need to consider these challenges
in designing an algorithm that performs desirably. In this w
ork, we employ
regret
, a metric in
quantifying the performance of learning algorithms that me
asures the difference between the cost
encountered by an adaptive control agent and that of an optim
al controller, knowing the underlying
system [Lai and Robbins, 1985].
Contributions:
In this work, we study the adaptive control of partially obse
rvable linear
systems from both model estimation/system identification a
nd the controller synthesis perspective.
We propose
LqgOpt
, an adaptive control algorithm for learning and controllin
g unknown partially
observable linear systems with quadratic cost and Gaussian
disturbances, i.e., linear quadratic
Gaussian (
LQG
), for which optimal control exists and has a closed form [Bert
sekas, 1995].
LqgOpt
interacts with the system, collects samples, estimates the
model parameters, and adapts accordingly.
LqgOpt
deploys
OFU
principle to balance the
exploration
vs.
exploitation
trade-off. Using the
predictor form of the state-space equations of the partiall
y observable linear systems, we deploy
the least-squares estimation problem introduced in Lale et
al. [2020a] and obtain confidence sets on
the system parameters.
LqgOpt
then uses these confidence sets to find the optimistic model an
d
use the optimal controller for the chosen model for further e
xploration-exploitation. To analyze
the finite-time regret of
LqgOpt
, we first provide a stability analysis for the sequence of opt
imistic
controllers. Finally, we prove that
LqgOpt
achieves a regret upper bound of
̃
O
(
T
)
for adaptive
control of partially observable linear dynamical systems w
ith
convex
quadratic cost, an improvement
to the
̃
O
(
T
2
/
3
)
regret upper bound in the prior work Lale et al. [2020b], wher
e
T
is the number of
total interactions.
Simchowitz et al. [2020] and Lale et al. [2020a] propose algo
rithms which achieve
̃
O
(
T
)
and
O
(
polylog
(
T
))
regret bound in partially observable linear systems respec
tively, under different prob-
lem setups. Simchowitz et al. [2020] employ the theory of onl
ine learning, and propose to start with
an initial phase of pure exploration, long enough for accura
te model estimation. Then this phase is
followed by committing to the learned model and deploying on
line learning for the policy updates.
They consider various settings (adversarial and stochasti
c noise) and attain a similar order regret
bound for
strongly convex
cost function. Lale et al. [2020a] provide a new closed-loop
system iden-
tification algorithm and similarly adopt online learning to
ols to achieve the first logarithmic regret
2
Table 1: Comparison with prior works in partially observabl
e linear dynamical systems.
Work
Regret
Cost
Identification
Mania et al. [2019]
T
Strongly Convex Open-Loop
Simchowitz et al. [2020]
T
Strongly Convex Open-Loop
Lale et al. [2020a]
polylog
(
T
)
Strongly Convex Closed-Loop
Lale et al. [2020b]
T
2
/
3
Convex
Open-Loop
Simchowitz et al. [2020]
T
2
/
3
Convex
Open-Loop
This Work
T
Convex
Closed-Loop
in partially observable linear dynamical systems with
strongly convex
cost function with stochastic
disturbances. These two works heavily rely on the strong con
vexity whereas the results in this paper
considers the more general setting of convex cost function (
Table 1).
2 Preliminaries
We denote the Euclidean norm of a vector
x
as
k
x
k
2
. We denote
ρ
(
A
)
as the spectral radius of a
matrix
A
,
k
A
k
F
as its Frobenius norm and
k
A
k
2
as its spectral norm.
Tr(
A
)
is its trace,
A
is
the transpose,
A
is the Moore-Penrose inverse. The
j
-th singular value of a rank-
n
matrix
A
is
denoted by
σ
j
(
A
)
, where
σ
max
(
A
) :=
σ
1
(
A
)
σ
2
(
A
)
. . .
σ
min
(
A
) :=
σ
n
(
A
)
>
0
.
I
represents
the identity matrix with the appropriate dimensions.
Consider the following discrete time linear time-invarian
t system
Θ = (
A, B, C
)
and with dy-
namics as:
x
t
+1
=
Ax
t
+
Bu
t
+
w
t
y
t
=
Cx
t
+
z
t
.
(1)
At each time step
t
, the system is at (hidden) state
x
t
R
n
, the agent receives observation
y
t
R
m
under a measurement noise
z
t
∼ N
0
, σ
2
z
I

. Then the agent applies a control input
u
t
R
p
, and
receives a cost of
c
t
=
y
t
Qy
t
+
u
t
Ru
t
where
Q
and
R
are positive semidefinite and positive definite
matrices, respectively. After taking
u
t
, the state of the system evolves to
x
t
+1
for the time step
t
+ 1
under a process noise
w
t
∼N
0
, σ
2
w
I

. Here the noises are i.i.d. random vectors and
N
(
μ,
Σ)
denotes a multivariate normal distribution with mean vecto
r
μ
and covariance matrix
Σ
.
Definition 2.1.
A linear system
Θ = (
A, B, C
)
is
(
A, B
)
controllable if the controllability matrix,
C
(
A, B, n
) = [
B AB A
2
B . . . A
n
1
B
]
has full row rank. For all
H
n
,
C
(
A, B, H
)
defines the extended
(
A, B
)
controllability matrix.
Similarly, a linear system
Θ = (
A, B, C
)
is
A, C
observable if the observability matrix,
O
(
A, C, n
) = [
C
(
CA
)
(
CA
2
)
. . .
(
CA
n
1
)
]
has full column rank. For all
H
n
,
O
(
A, C, H
)
defines the extended
(
A, C
)
observability matrix.
3
Suppose the underlying system is controllable and observab
le. Then, the agent chooses control
inputs as a function of past observations and aims to minimiz
e the expected cost,
J
(Θ)= lim
T
→∞
min
u
=[
u
1
,...,u
T
]
1
T
E
"
T
X
t
=1
y
t
Qy
t
+
u
t
Ru
t
#
.
This problem is known as
LQG
control. The optimal solution to
LQG
control problem is a linear
feedback control policy given as
u
t
=
K
ˆ
x
t
|
t,
Θ
. Here
K
is the optimal feedback gain matrix,
K
=

R
+
B
P B

1
B
P A,
where
P
is the unique positive semidefinite solution to the followin
g discrete-time algebraic Riccati
equation (DARE):
P
=
A
P A
+
C
QC
A
P B

R
+
B
P B

1
B
P A,
(2)
and
ˆ
x
t
|
t,
Θ
is the minimum mean square error (MMSE) estimate of the under
lying state using system
parameters
Θ
and past observations, where
ˆ
x
0
|−
1
,
Θ
= 0
. At steady-state, this estimate is efficiently
obtained by using the Kalman filter:
ˆ
x
t
|
t,
Θ
= (
I
LC
) ˆ
x
t
|
t
1
,
Θ
+
Ly
t
,
(3)
ˆ
x
t
|
t
1
,
Θ
= (
A
ˆ
x
t
1
|
t
1
,
Θ
+
Bu
t
1
)
,
(4)
L
= Σ
C

C
Σ
C
+
σ
2
z
I

1
,
(5)
where
Σ
is the unique positive semidefinite solution to the followin
g DARE:
Σ =
A
Σ
A
A
Σ
C

C
Σ
C
+
σ
2
z
I

1
C
Σ
A
+
σ
2
w
I.
In the adaptive control, the underlying system parameters
Θ
are unknown, and the agent needs to
learn them through interaction with the system with the aim o
f minimizing the cumulative costs
P
T
t
=1
c
t
after
T
time steps. We measure the performance of the agent using reg
ret,
i.e.,
the difference
between the agent’s cost and the optimal expected cost:
REGRET
(
T
) =
T
X
t
=0
(
c
t
J
(Θ))
.
The system characterization depicted in (1) is called state
-space form of the system
Θ
. The
same discrete time linear time-invariant system can be repr
esented in several ways which has
been considered in various works in control theory and reinf
orcement learning [Kailath et al., 2000,
Tsiamis et al., 2019, Lale et al., 2020a]. Note that these rep
resentations all have the same second
order statistics. One of the most common form is the innovati
ons form
1
of the system characterized
as
x
t
+1
=
Ax
t
+
Bu
t
+
F e
t
y
t
=
Cx
t
+
e
t
.
(6)
1
For simplicity, all of the system representations are prese
nted for the steady-state of the system.
4
where
F
=
AL
is the Kalman gain in the observer form and
e
t
is the zero mean white innovation
process. In this equivalent representation of system, the s
tate
x
t
can be seen as the estimate of
the state in the state space representation, which is the exp
ression stated in (4). In the steady
state,
e
(
t
)
∼N
0
, C
Σ
C
+
σ
2
z
I

. Using the relationship between
e
t
and
y
t
, we obtain the following
characterization of the system
Θ
, known as the predictor form of the system,
x
t
+1
=
̄
Ax
t
+
Bu
t
+
F y
t
y
t
=
Cx
t
+
e
t
(7)
where
̄
A
=
A
F C
and
F
=
AL
. Notice that at steady state, the predictor form allows the c
urrent
output
y
t
to be described by the history of inputs and outputs with an i.
i.d. Gaussian disturbance
e
t
∼ N
0
, C
Σ
C
+
σ
2
z
I

. In this paper, we exploit these fundamental properties to e
stimate the
underlying system, even with feedback control. We consider
the set of stable systems.
Assumption 2.1.
The system is order
n
and minimal in the sense that the system cannot be
described by a state-space model of order less than
n
. The system is stable, i.e.
ρ
(
A
)
<
1
and
Φ(
A
) := sup
τ
0
k
A
τ
k
ρ
(
A
)
τ
<
.
Note that the assumption regarding
Φ(
A
)
is required for quantifying the finite time evolution
of the system and it is a mild condition,
e.g.
if
A
is diagonalizable,
Φ(
A
)
is finite. Additionally for
stable
A
,
Φ(
A
)
can be upper bounded by the
H
-norm of the system
x
t
+1
=
Ax
t
+
w
t
[Mania et al.,
2019].
We assume that the underlying system lives in the following s
et.
Assumption 2.2.
The unknown system
Θ = (
A, B, C
)
is a member of a set
S
, such that,
S ⊆
Θ
= (
A
, B
, C
, F
)
ρ
(
A
)
<
1
,
(
A
, B
)
is controllable,
(
A
, C
)
is observable,
(
A
, F
)
is controllable.
The above assumptions are standard in system identification
settings in order to ensure the
possibility of accurate estimation of the system parameter
s [Knudsen, 2001, Oymak and Ozay, 2018,
Tsiamis and Pappas, 2019, Sarkar et al., 2019, Tsiamis et al.
, 2019, Lale et al., 2020a,b].
Assumption 2.3.
The set
S
consists of systems that are contractible, i.e.,
ρ
:
=
sup
Θ
=(
A
,B
,C
)
∈S
A
B
K
)
<
1
,
where
K
)
is the optimal feedback gain matrix of
Θ
, and
υ
:
=
sup
Θ
=(
A
,B
,C
)
∈S
A
A
L
)
C
<
1
.
where
L
)
is the optimal Kalman gain matrix of
Θ
. There exists finite numbers
D,
Γ
, ζ
such
that
D
= sup
Θ
∈S
k
P
)
k
,
Γ = sup
Θ
∈S
k
K
)
k
and
ζ
= sup
Θ
∈S
k
L
)
k
.
This assumption allows us to develop stability guarantees i
n the presence of sub-optimal closed-
loop controllers.
5
3 Adaptive Control via
LqgOpt
In this section, we present
LqgOpt
, an adaptive control algorithm for
LQG
control problems, and
describe its compounding components. The outline of
LqgOpt
is given in Algorithm 1. The early
stage of deploying
LqgOpt
involves a fixed warm-up period dedicated for pure explorati
on using
Gaussian excitation.
LqgOpt
requires this exploration period to estimate the model para
meters
reliably enough that the controller designed based on the pa
rameter estimation and their confidence
set results in a stabilizing controller on the real system. T
he duration of this period depends on
how stabilizable the true parameters are and how accurate th
e model estimations should be. We
formally quantify these statements and the length of the war
m-up period.
After the warm-up period,
LqgOpt
utilizes the model parameter estimations and their confiden
ce
sets to design a controller corresponding to an optimistic m
odel in the confidence sets, obtained
by following the
OFU
principle. Due to the reliable estimation from the warm-up p
eriod, this
controller and all the future designed controller stabiliz
e the underlying true unknown model. The
agent deploys the prescribed controller on the real system f
or exploration and exploitation. The
agent collects samples throughout its interaction with the
environment, and use these samples for
further model estimation, confidence interval constructio
n, and design of the controller regarding to
an optimistic model. The agent repeats this process.
Since the Kalman filter converges exponentially fast to the s
teady-state gain in observer form,
without loss of generality, we assume that
x
0
∼N
(0
,
Σ)
,
i.e.
, the system starts at the steady-state.
This consideration eases the presentation of the algorithm
. We provide the overview of the analysis
for any arbitrary and almost surely finite initialization in
the Appendix G.
In the warm-up period
LqgOpt
excites the system with
u
t
∼ N
(0
, σ
2
u
I
)
for
1
t
T
w
.
Considering the predictor form representation of the syste
m given in (7), we can roll back the
state evolution
H
time steps back as follows,
x
t
=
H
1
X
k
=0
̄
A
k
(
F y
t
k
1
+
Bu
t
k
1
) +
̄
A
H
x
t
H
From Assumption 2.2, we have that
̄
A
is stable, thus the state can be estimated in principle for
large enough
H
. Using the generated input-output sequence
D
=
{
y
t
, u
t
}
T
w
t
=1
,
LqgOpt
constructs
N
subsequences of
H
input-output pairs,
φ
t
for
H
t
T
w
, where
T
w
=
H
+
N
1
,
φ
t
=
h
y
t
1
. . . y
t
H
u
t
1
. . . u
t
H
i
R
(
m
+
p
)
H
.
Using this definition, following Lale et al. [2020a], we can w
rite the following truncated autore-
gressive exogenous (ARX) model for the given system
Θ
,
y
t
=
M
φ
t
+
e
t
+
C
̄
A
H
x
t
H
(8)
where
M
R
m
×
(
m
+
p
)
H
defined as
M
=

CF, C
̄
AF, . . . , C
̄
A
H
1
F, CB, C
̄
AB, . . . , C
̄
A
H
1
B

.
(9)
Thus, any input-output trajectory
{
y
i
, u
t
}
T
t
=1
can be represented as
Y
T
= Φ
T
M
+
E
T
+
N
T
(10)
6
Algorithm 1
LqgOpt
1:
Input:
T
w
,
H
,
σ
o
,
σ
c
,
S >
0
,
δ >
0
,
n
,
m
,
p
,
Q
,
R
——
Warm-Up
————————————————
2:
for
t
= 0
,
1
, . . . , T
w
do
3:
Deploy
u
t
∼N
(0
, σ
2
u
I
)
and store
D
0
=
{
y
t
, u
t
}
T
w
t
=1
4:
end for
——
Adaptive Control
———————————–
5:
for
i
= 0
,
1
, . . .
do
6:
Calculate
ˆ
M
i
using
D
i
=
{
y
t
, u
t
}
2
i
T
w
t
=1
7:
Deploy
SysId
(
H,
ˆ
M
i
, n
) for
ˆ
A
i
,
ˆ
B
i
,
ˆ
C
i
,
ˆ
L
i
8:
Construct the confidence sets
C
A
(
i
)
,
C
B
(
i
)
,
C
C
(
i
)
,
C
L
(
i
)
s.t. w.h.p.
(
A, B, C, L
)
∈C
i
, where
C
i
:
=(
C
A
(
i
)
×C
B
(
i
)
×C
C
(
i
)
×C
L
(
i
))
9:
Find a
̃
Θ
i
= (
̃
A
i
,
̃
B
i
,
̃
C
i
,
̃
L
i
)
∈C
i
∩S
s.t.
J
(
̃
Θ
i
)
inf
Θ
∈C
i
∩S
J
) +
T
1
10:
for
t
= 2
i
T
w
, . . .
2
i
+1
T
w
1
do
11:
Execute the optimal controller for
̃
Θ
i
12:
end for
13:
end for
where
Y
T
= [
y
H
, y
H
+1
, . . . , y
T
]
R
N
×
m
Φ
T
= [
φ
H
, φ
H
+1
, . . . , φ
T
]
R
N
×
(
m
+
p
)
H
E
T
= [
e
H
, e
H
+1
, . . . , e
T
]
R
N
×
m
N
T
=

C
̄
A
H
x
0
, C
̄
A
H
x
1
, . . . , C
̄
A
H
x
T
H

R
N
×
m
for
N
=
T
H
+ 1
.
Note that, during the warm-up period the noise terms are zero
-mean including the effect of
initial state since we assume that
x
0
∼ N
(0
,
Σ)
. After the warm-up period,
LqgOpt
obtains the
first estimate of the unknown truncated ARX model
M
by solving the following regularized least
square problem first introduced in Lale et al. [2020a],
ˆ
M
0
= arg min
X
k
Y
T
w
Φ
T
w
X
k
2
F
+
λ
k
X
k
2
F
(11)
where the solution
ˆ
M
0
= (Φ
T
w
Φ
T
w
+
λI
)
1
Φ
T
w
Y
T
w
.
Using this solution,
LqgOpt
deploys a system-identification algorithm and obtains the e
stimates of
the system parameters
ˆ
A
0
,
ˆ
B
0
,
ˆ
C
0
,
ˆ
L
0
, with corresponding confidence sets
C
A
(0)
,
C
B
(0)
,
C
C
(0)
,
C
L
(0)
in which the underlying system parameters live with high pro
bability. With the initial confidence
sets,
LqgOpt
starts adaptive control period using the
OFU
principle. It selects the optimistic
model
i.e.
, the model that has the minimum average expected cost, among
the plausible models
and executes the optimal controller for the chosen model. As
the confidence sets shrink,
i.e.
, the
7
estimates of system parameters are
significantly
refined,
LqgOpt
adapts and updates its policy by
deploying
OFU
principle on the new confidence sets.
For a linear system
Θ = (
A, B, C
)
, we define truncated open-loop and closed-loop noise evolut
ion
parameters, respectively
G
ol
and
G
cl
. When the controller is set to be i.i.d. Gaussian excitement
s,
G
ol
R
H
(
m
+
p
)
×
2
H
(
n
+
m
+
p
)
encodes the open-loop evolution of the disturbances in the s
ystem, and
represents the responses to these disturbances on the
batch
of observations and actions
history
. Note
that the historical data is correlated even in the open-loop
setting with i.i.d. Gaussian excitements.
The exact definition of
G
ol
is provided in equation (20) of Appendix A.1. In Appendix A.1
, we also
show that
G
ol
is full row-rank,
i.e.
,
σ
min
(
G
ol
)
> σ
o
>
0
, where
σ
o
is known to
LqgOpt
.
When the controller is set to be the optimal policy for the und
erlying system,
i.e.
closed-loop
system,
G
cl
R
H
(
m
+
p
)
×
2
H
(
n
+
m
)
represents the translation of the truncated history of proc
ess and
measurement noises on the inputs,
φ
’s. The exact construction of
G
cl
is provided in detail in equation
(24) of Appendix A.2. Briefly, it is formed by shifting a block m
atrix
̄
G
R
(
m
+
p
)
×
2
H
(
n
+
m
)
by
m
+
n
in each block row where
̄
G
is constructed by
H
(
m
+
p
)
×
(
n
+
m
)
matrices. We assume that
H
used in
LqgOpt
is large enough that
̄
G
is full row rank for the given system. In Appendix A.2, we
show that, if
̄
G
is full row-rank,
G
cl
would be full row-rank, too. Thus, we have that for the choice
of
H
in
LqgOpt
,
σ
min
(
G
cl
)
is lower bounded by some positive value,
i.e.
,
σ
min
(
G
cl
)
> σ
c
>
0
, where
LqgOpt
only knows
σ
c
and searches for an optimistic system whose closed-loop noi
se evolution
parameter satisfies this lower bound.
The following theorem states the main result of the paper, an
end-to-end regret upper bound of
the adaptive control in
LQG
systems.
Theorem 3.1
(Regret Upper Bound)
.
Given a
LQG
Θ = (
A, B, C
)
, and regulating parame-
ters
Q
and
R
, with high probability, the regret of
LqgOpt
with a warm-up duration of
T
w
=
poly
(
H,
log(
T
)
, σ
o
, σ
c
, υ, ζ,
Γ
, m, n, p, ρ,
Φ(
A
))
is
REGRET
(
T
) =
̃
O

T

(12)
The exact expressions that define
T
w
are given in Appendix with the detailed definitions.
3.1 Learning the Truncated ARX Model
The following results are adapted from Lale et al. [2020a]. F
irst consider the effect of truncation
bias term,
N
t
. From Assumption 2.3, we have that
k
̄
A
k≤
υ <
1
. Thus, each term in
N
t
is order of
υ
H
. In order to get consistent estimation, for some problem dep
endent constant
c
H
,
LqgOpt
sets
H
log(
c
H
T
2
m/
λ
)
log(1
)
, resulting in a negligible bias term of order
1
/T
2
. The following Theorem 3 of
Lale et al. [2020a] gives the self-normalized finite sample e
stimation error of (11).
Theorem 3.2
(Closed-Loop Identification, [Lale et al., 2020a])
.
Let
ˆ
M
t
be the solution to (11) at
time
t
. For the given choice of
H
, define
V
t
=
V
+
t
X
i
=
H
φ
i
φ
i
where
V
=
λI
. Let
k
M
k
F
S
. For
δ
(0
,
1)
, with probability at least
1
δ
, for all
t
,
M
lies in
the set
C
M
(
t
)
, where
C
M
(
t
) =
{
M
: Tr((
ˆ
M
t
M
)
V
t
(
ˆ
M
t
M
)
)
β
t
}
,
8
for
β
t
defined as follows,
β
t
=
v
u
u
t
m
k
C
Σ
C
+
σ
2
z
I
k
log
det (
V
t
)
1
/
2
δ
det(
V
)
1
/
2
!
+
S
λ
+
t
H
T
2
2
.
For completeness, the proof is given in Appendix B. It uses sel
f-normalized tail inequalities to
get the first two terms in the definition of
β
t
, and with the given choice of
H
, we obtain the final
term in the bound. This bound can be translated to
k
ˆ
M
t
M
k
in order to be utilized for the
confidence set construction of the system parameters. First
, we need the following lemmas that
guarantee persistence of excitation during the warm-up per
iod and adaptive control period.
Lemma 3.1
(Persistence of Excitation in Warm-Up Period)
.
After sufficient time steps in warm-up
period of
LqgOpt
, with probability at least
1
δ
, we have
σ
min
t
X
i
=1
φ
i
φ
i
!
t
σ
2
o
min
{
σ
2
w
, σ
2
z
, σ
2
u
}
2
.
(13)
Lemma 3.2
(Persistence of Excitation in Adaptive Control Period)
.
After sufficient time steps in
adaptive control period of
LqgOpt
, with probability
1
3
δ
, we have
σ
min
t
X
i
=1
φ
i
φ
i
!
t
σ
2
c
min
{
σ
2
w
, σ
2
z
}
16
.
(14)
For two problem dependent parameters
Υ
w
and
Υ
c
, that uniformly bound the components of
φ
’s during the warm-up and adaptive control period respectiv
ely, we have the following theorem
which combines Theorem 3.2 with Lemma 3.1 and 3.2 to obtain th
e bound over
k
ˆ
M
t
M
k
.
Theorem 3.3.
During the warm-up period,
k
φ
t
k≤
Υ
w
H
with high probability. After the warm-up
period of
T
w
, the initial estimation of the truncated ARX model,
ˆ
M
0
, obeys
k
ˆ
M
0
M
k≤
poly
(
m, H, p
)
min
{
σ
w
, σ
z
, σ
u
}
σ
o
T
w
.
During the adaptive control, with high probability
k
Φ
t
k≤
Υ
c
H
. For the adaptive control period
at any time
t
2
T
w
, the least squares estimate of the truncated ARX model
ˆ
M
t
follows
k
ˆ
M
t
M
k≤
poly
(
m, H, p
)
q
t
min
{
σ
2
o
σ
2
w
, σ
2
o
σ
2
z
, σ
2
o
σ
2
u
,
σ
2
c
σ
2
w
8
,
σ
2
c
σ
2
z
8
}
.
Note that the choice of
H
depends on the horizon, which is needed to be known apriori. S
ince
the dependency of
H
in the horizon
T
is
log(
T
)
, one can deploy the standard doubling trick to relax
this requirement.
2
2
Doubling trick suggests to set the horizon to a time step, and
in a repeated fashion, whenever that time step is
reached, double that time step, and continue.
9
3.2 System Identification
After estimating
ˆ
M
t
,
LqgOpt
constructs confidence sets for the unknown system parameter
s
and uses these confidence sets to come up with the optimistic c
ontroller to exploit the informa-
tion gathered.
LqgOpt
uses a subspace identification algorithm
SysId
introduced by Lale et al.
[2020a] and for completeness given in Algorithm 2 in Appendi
x C.
SysId
is similar to Ho-Kalman
method [Ho and Kálmán, 1966] and estimates the system parame
ters from
ˆ
M
t
. First of all, notice
that
M
= [
F
,
G
]
where
F
=

CF, C
̄
AF, . . . , C
̄
A
H
1
F

R
m
×
mH
,
G
=

CB, C
̄
AB, . . . , C
̄
A
H
1
B

R
m
×
pH
.
Given the estimate for the truncated ARX model
ˆ
M
t
= [
ˆ
F
t
,
1
, . . . ,
ˆ
F
t
,
H
,
ˆ
G
t
,
1
, . . . ,
ˆ
G
t
,
H
]
,
where
ˆ
F
t
,
i
is the
i
’th
m
×
m
block of
ˆ
F
t
, and
ˆ
G
t
,
i
is the
i
’th
m
×
p
block of
ˆ
G
t
for all
1
i
H
,
SysId
constructs two
d
1
×
(
d
2
+ 1)
Hankel matrices
H
ˆ
F
t
and
H
ˆ
G
t
such that
(
i, j
)
th block of Hankel
matrix is
ˆ
F
t
,
i
and
ˆ
G
t
,
i
respectively. Then, it forms the following matrix
ˆ
H
t
.
ˆ
H
t
=
h
H
ˆ
F
t
,
H
ˆ
G
t
i
.
Recall that the dimension of latent state,
n
, is the order of the system for the observable and con-
trollable system. For some problem dependent constant
c
H
, let
H
max
n
2
n
+ 1
,
log(
c
H
T
2
m/
λ
)
log(1
)
o
,
we can pick
d
1
n
and
d
2
n
such
d
1
+
d
2
+1 =
H
. This guarantees that the system identification
problem is well-conditioned. Using Definition 2.1, if the in
put to the
SysId
was
M
= [
F
,
G
]
then
constructed Hankel matrix,
H
would be rank
n
,
H
= [
C
, . . . ,
(
C
̄
A
d
1
1
)
]
[
F, . . . ,
̄
A
d
2
F, B, . . . ,
̄
A
d
2
B
]
=
O
(
̄
A, C, d
1
) [
C
(
̄
A, F, d
2
+ 1)
,
̄
A
d
2
F,
C
(
̄
A, B, d
2
+ 1)
,
̄
A
d
2
B
]
=
O
(
̄
A, C, d
1
) [
F,
̄
A
C
(
̄
A, F, d
2
+ 1)
, B,
̄
A
C
(
̄
A, B, d
2
+ 1)]
.
Notice that
M
and
H
are uniquely identifiable for a given system
Θ
, whereas for any invertible
T
R
n
×
n
, the system resulting from
A
=
T
1
A
T
, B
=
T
1
B, C
=
C
T
, L
=
T
1
L
gives the same
M
and
H
. Similar to Ho-Kalman algorithm,
SysId
computes the SVD of
ˆ
M
t
and
estimates the extended observability and controllability
matrices and eventually system parameters
up to similarity transformation. To this end,
SysId
constructs
ˆ
H
t
by discarding
(
d
2
+ 1)
th and
(2
d
2
+ 2)
th block columns of
ˆ
H
t
,
i.e.
if it was
H
then we have,
H
=
O
(
̄
A, C, d
1
) [
C
(
̄
A, F, d
2
+ 1)
,
C
(
̄
A, B, d
2
+ 1)]
.
The algorithm then calculates
ˆ
N
t
, the best rank-
n
approximation of
ˆ
H
t
, obtained by setting its all
but top
n
singular values to zero. The estimates of
O
(
̄
A, C, d
1
)
,
C
(
̄
A, F, d
2
+ 1)
and
C
(
̄
A, B, d
2
+ 1)
are given as
ˆ
N
t
=
U
t
Σ
t
1
/
2
Σ
t
1
/
2
V
t
=
ˆ
O
t
(
̄
A, C, d
1
) [
ˆ
C
t
(
̄
A, F, d
2
+ 1)
,
ˆ
C
t
(
̄
A, B, d
2
+ 1)]
.
10
From these estimates
SysId
recovers
ˆ
C
t
as the first
m
×
n
block of
ˆ
O
t
(
̄
A, C, d
1
)
,
ˆ
B
t
as the first
n
×
p
block of
ˆ
C
t
(
̄
A, B, d
2
+ 1)
and
ˆ
F
t
as the first
n
×
m
block of
ˆ
C
t
(
̄
A, F, d
2
+ 1)
. Let
ˆ
H
+
t
be the
matrix obtained by discarding
1
st and
(
d
2
+ 2)
th block columns of
ˆ
H
t
,
i.e.
if it was
H
then
H
+
=
O
(
̄
A, C, d
1
)
̄
A
[
ˆ
C
t
(
̄
A, F, d
2
+ 1)
,
ˆ
C
t
(
̄
A, B, d
2
+ 1)]
.
Therefore,
SysId
recovers
ˆ
̄
A
t
=
ˆ
O
t
(
̄
A, C, d
1
)
ˆ
H
+
t
[
ˆ
C
t
(
̄
A, F, d
2
+ 1)
,
ˆ
C
t
(
̄
A, B, d
2
+ 1)]
.
Using the definition of
̄
A
=
A
F C
, the algorithm obtains
ˆ
A
t
=
ˆ
̄
A
t
+
ˆ
F
t
ˆ
C
t
. Recall that
F
=
AL
.
Using the Assumption 2.2,
SysId
finally recovers
ˆ
L
t
as the first
n
×
m
block of
ˆ
A
t
ˆ
O
t
(
̄
A, C, d
1
)
ˆ
H
t
.
The following Theorem 4 of Lale et al. [2020a] essentially tr
anslates the bound in Theorem 3.2 to
individual bounds of system parameter estimates. It provid
es the high probability confidence sets
required for deploying
OFU
principle for the adaptive control. Note that Theorem 4 of La
le et al.
[2020a] does not estimate
L
but in the current result we need to recover
L
for controller construction.
Theorem 3.4
(Confidence Set Construction, [Lale et al., 2020a])
.
Let
H
be the concatenation of
two Hankel matrices obtained from
M
. Let
̄
A,
̄
B,
̄
C,
̄
L
be the system parameters that
SysId
provides
for
M
. At time step
t
, let
ˆ
A
t
,
ˆ
B
t
,
ˆ
C
t
,
ˆ
L
t
denote the system parameters obtained by
SysId
using the
least squares estimate of the truncated ARX model,
ˆ
M
t
. Suppose Assumptions 2.1 and 2.2 hold,
thus
H
is rank-
n
. After the warm-up period of
T
w
, for the given choice of
H
, there exists a unitary
matrix
T
R
n
×
n
such that, with high probability,
̄
Θ = (
̄
A,
̄
B,
̄
C,
̄
L
)
(
C
A
×C
B
×C
C
×C
L
)
where
C
A
(
t
) =
n
A
R
n
×
n
:
k
ˆ
A
t
T
A
T
k≤
β
A
(
t
)
o
,
C
B
(
t
) =
n
B
R
n
×
p
:
k
ˆ
B
t
T
B
k≤
β
B
(
t
)
o
,
C
C
(
t
) =
n
C
R
m
×
n
:
k
ˆ
C
t
C
T
k≤
β
C
(
t
)
o
,
C
L
(
t
) =
n
L
R
p
×
m
:
k
ˆ
L
t
T
L
k≤
β
L
(
t
)
o
,
for
β
A
(
t
) =
c
1
nH
(
kHk
+
σ
n
(
H
))
σ
2
n
(
H
)
!
k
ˆ
M
t
M
k
, β
B
(
t
) =
β
C
=
s
20
nH
σ
n
(
H
)
k
ˆ
M
t
M
k
,
(15)
β
L
(
t
) =
c
2
kHk
p
σ
n
(
H
)
β
A
+
c
3
nH
(
kHk
+
σ
n
(
H
))
σ
3
/
2
n
(
H
)
k
ˆ
M
t
M
k
,
for some problem dependent constants
c
1
, c
2
and
c
3
.
The proof is given in the Appendix C. It combines Lemma B.1 of Oy
mak and Ozay [2018] with
careful perturbation analysis on the system parameter esti
mates provided by
SysId
.
3.3 Adaptive Control
Using the confidence sets,
LqgOpt
implements
OFU
principle. At time
t
, the algorithm chooses a
system
̃
Θ
t
= (
̃
A
t
,
̃
B
t
,
̃
C
t
,
̃
L
t
)
from
C
t
∩S
where
C
t
:
= (
C
A
(
t
)
×C
B
(
t
)
×C
C
(
t
)
×C
L
(
t
))
such that
J
(
̃
Θ
t
)
inf
Θ
∈C
t
∩S
J
) + 1
/T.
(16)
11
The algorithm designs the optimal feedback policy
(
̃
P
t
,
̃
K
t
,
̃
L
t
)
for the chosen system
̃
Θ
t
. It uses
this optimistic controller to control the underlying syste
m
Θ
for twice as long as the duration of
the previous control policy. This technique known as “doubl
ing trick” in reinforcement learning and
online learning prevents frequent policy updates and balan
ces the policy changes so that the overall
regret of the algorithm is affected by a constant factor only.
4 Regret Analysis of
LqgOpt
Now that the confidence set constructions and the adaptive co
ntrol procedure of
LqgOpt
are ex-
plained, it only remains to analyze the regret of
LqgOpt
. Lemma 4.1 of Lale et al. [2020b] shows
that the random exploration in the warm-up period acquires l
inear regret,
i.e.
O
(
T
w
)
.
In order to analyze the regret obtained during the adaptive c
ontrol period, we first need to
show that system will be well-controlled during the adaptiv
e control period. The following lemma
achieves that.
Lemma 4.1.
Suppose Assumptions 2.1-2.3 hold. After the warm-up period
of
T
w
,
LqgOpt
satisfies
the following with high probability for all
T
t
T
w
,
1.
Θ
(
C
A
(
t
)
×C
B
(
t
)
×C
C
(
t
)
×C
L
(
t
))
2.
k
ˆ
x
t
|
t,
ˆ
Θ
k≤
̃
X
3.
k
y
t
k≤
̃
Y
where
̃
X
,
̃
Y
=
O
(
p
log(
T
))
. Here,
O
hides the problem dependent constants.
The proof of the lemma with the precise expressions is given i
n Appendix D. This lemma is
critical for the regret analysis due to the nature of the adap
tive control problem in partially ob-
servable environments. The inaccuracies in the system para
meter estimates affect both the optimal
feedback gain synthesis and the estimation of the underlyin
g state. If these inaccuracies are not
tolerable in the adaptive control of the system, they will ac
cumulate fast and cause explosion and
unboundedness in the input and the output of the system. This
would result in linear, and po-
tentially super linear regret. The main technical challeng
e in the proof is to show that with
T
w
length warm-up period, the error between the optimistic con
troller’s state estimation
ˆ
x
t
|
t,
ˆ
Θ
and the
true state estimation
ˆ
x
t
|
t,
Θ
does not blow up. Lemma 4.1 shows that while the system parame
ter
estimates are refining, the input to the system and the system
’s output stays bounded during the
adaptive control period.
Given the verification of stability in the adaptive control p
eriod, we bound the regret of adaptive
control. The regret analysis is based on the Bellman optimali
ty equation for
LQG
control problem
provided in Lemma 4.3 of Lale et al. [2020b]. The following th
eorem gives the regret upper bound
of the adaptive control period of
LqgOpt
.
Theorem 4.1
(The regret of adaptive control)
.
Suppose Assumptions 2.1-2.3 hold. After the warm-
up period of
T
w
, with high probability, for any time
T
in adaptive control period, the regret of
LqgOpt
is bounded as follows:
REGRET
(
T
) =
̃
O

T

.
(17)
where
̃
O
(
·
)
hides the logarithmic factors and problem dependent consta
nts.
12