Supplementary Materials for
Smooth Imitation Learning for Online Sequence Prediction
A. Detailed Theoretical Analysis and Proofs
A.1. Proof of lemma 5.1
Lemma Statement
.
(Lemma 5.1) For a fixed
x
, define
π
p
r
x,a
sq
fi
φ
p
a
q
. If
φ
is non-negative and
H
-smooth w.r.t.
a
., then:
@
a,a
1
:
`
φ
p
a
q ́
φ
p
a
1
q
̆
2
ď
6
H
`
φ
p
a
q`
φ
p
a
1
q
̆
›
›
a
́
a
1
›
›
2
.
The proof of Lemma 5.1 rests on 2 properties of
H
-smooth
functions (differentiable) in
R
1
, as stated below
Lemma A.1
(Self-bounding property of Lipschitz-smooth
functions)
.
Let
φ
:
R
Ñ
R
be an
H
-smooth non-negative
function. Then for all
a
P
R
:
|∇
φ
p
a
q
|
ď
a
4
Hφ
p
a
q
Proof.
By mean value theorem, for any
a,a
1
we have
D
η
P
p
a,a
1
q
(or
p
a
1
,a
q
) such that
φ
p
a
1
q “
φ
p
a
q`
∇
φ
p
η
qp
a
1
́
a
q
.
Since
φ
is non-negative,
0
ď
φ
p
a
1
q “
φ
p
a
q`
∇
φ
p
a
qp
a
1
́
a
q
`p
∇
φ
p
η
q ́
∇
φ
p
a
qqp
a
1
́
a
q
ď
φ
p
a
q`
∇
φ
p
a
qp
a
1
́
a
q`
H
|
η
́
a
||
a
1
́
a
|
ď
φ
p
a
q`
∇
φ
p
a
qp
a
1
́
a
q`
H
|
a
1
́
a
|
2
Choosing
a
1
“
a
́
∇
φ
p
a
q
2
H
proves the lemma.
Lemma A.2
(1-d Case (Srebro et al., 2010))
.
Let
φ
:
R
Ñ
R
be an
H
-smooth non-negative function. Then for all
a,a
1
P
R
:
`
φ
p
a
q ́
φ
p
a
1
q
̆
2
ď
6
H
`
φ
p
a
q`
φ
p
a
1
q
̆ `
a
́
a
1
̆
2
Proof.
As before,
D
η
P p
a,a
1
q
such that
φ
p
a
1
q ́
φ
p
a
q “
∇
φ
p
η
qp
a
1
́
a
q
. By assumption of
φ
, we have
|∇
φ
p
η
q ́
∇
φ
p
a
q
|
ď
H
|
η
́
a
|
ď
H
|
a
1
́
a
|
. Thus we have:
|∇
φ
p
η
q
|
ď
|∇
φ
p
a
q
|
`
H
|
a
́
a
1
|
(9)
Consider two cases:
Case 1: If
|
a
́
a
1
|
ď
|∇
φ
p
a
q
|
5
H
, then by equation 9 we have
|∇
φ
p
η
q
|
ď
6
{
5
|∇
φ
p
a
q
|
. Thus
`
φ
p
a
q ́
φ
p
a
1
q
̆
2
“ p
∇
φ
p
η
qq
2
`
a
́
a
1
̆
2
ď
36
25
p
∇
φ
p
a
qq
2
`
a
́
a
1
̆
2
ď
144
25
Hφ
p
a
q
`
a
́
a
1
̆
2
by lemma A.1.
Therefore,
p
φ
p
a
q ́
φ
p
a
1
qq
2
ď
6
Hφ
p
a
q
p
a
́
a
1
q
2
ď
6
H
p
φ
p
a
q`
φ
p
a
1
q
qp
a
́
a
1
q
2
Case 2: If
|
a
́
a
1
|
ą
|∇
φ
p
a
q
|
5
H
, then equation 9 gives
|∇
φ
p
η
q
|
ď
6
H
|
a
́
a
1
|
. Once again
`
φ
p
a
q ́
φ
p
a
1
q
̆
2
“
`
φ
p
a
q ́
φ
p
a
1
q
̆
∇
φ
p
η
q
`
a
́
a
1
̆
ď
|
`
φ
p
a
q ́
φ
p
a
1
q
̆
||∇
φ
p
η
q
||
`
a
́
a
1
̆
|
ď
|
`
φ
p
a
q ́
φ
p
a
1
q
̆
|
́
6
H
`
a
́
a
1
̆
2
̄
ď
6
H
`
φ
p
a
q`
φ
p
a
1
q
̆ `
a
́
a
1
̆
2
Proof of Lemma 5.1.
The
extension
to
the
multi-
dimensional case is straightforward. For any
a,a
1
P
R
k
,
consider the function
φ
:
R
Ñ
R
such that
φ
p
t
q “
φ
pp
1
́
t
q
a
`
ta
1
q
, then
φ
is a differentiable, non-negative
function and
∇
t
p
φ
p
t
qq “ x
∇
φ
p
a
`
t
p
a
1
́
a
qq
,a
1
́
a
y
.
Thus:
|
φ
1
p
t
1
q ́
φ
1
p
t
2
q
|
“ |x
∇
φ
p
a
`
t
1
p
a
1
́
a
qq ́
∇
φ
p
a
`
t
2
p
a
1
́
a
qq
,a
1
́
a
y|
ď
›
›
∇
φ
p
a
`
t
1
p
a
1
́
a
qq ́
∇
φ
p
a
`
t
2
p
a
1
́
a
qq
›
›
̊
›
›
a
1
́
a
›
›
ď
H
|
t
1
́
t
2
|
›
›
a
́
a
1
›
›
2
Therefore
φ
is an
H
}
a
́
a
1
}
2
-smooth function in
R
. Ap-
ply lemma A.2 to
φ
, we have:
p
φ
p
1
q ́
φ
p
0
qq
2
ď
6
H
›
›
a
́
a
1
›
›
2
p
φ
p
1
q`
φ
p
0
qqp
1
́
0
q
2
which is the same as
p
φ
p
a
q ́
φ
p
a
1
qq
2
ď
6
H
p
φ
p
a
q `
φ
p
a
1
qq}
a
́
a
1
}
2
A.2. Proof of lemma 5.2
Lemma Statement
.
(Lemma 5.2) Given any starting state
s
0
, sequentially execute
π
det
and
π
sto
to obtain two sepa-
rate trajectories
A
“ t
a
t
u
T
t
“
1
and
̃
A
“ t
̃
a
t
u
T
t
“
1
such that
a
t
“
π
det
p
s
t
q
and
̃
a
t
“
π
sto
p
̃
s
t
q
, where
s
t
“ r
x
t
,a
t
́
1
s
and
̃
s
t
“ r
x
t
,
̃
a
t
́
1
s
. Assuming the policies are stable as
per Condition 1, we have
E
̃
A
r
̃
a
t
s “
a
t
@
t
“
1
,...,T
,
where the expectation is taken over all random roll-outs of
π
sto
.
Proof.
Given a starting state
s
0
, we prove by induction that
E
̃
A
r
̃
a
t
s “
a
t
.
It is easily seen that the claim is true for
t
“
1
.
Smooth Imitation Learning for Online Sequence Prediction
Now assuming that
E
̃
A
r
̃
a
t
́
1
s “
a
t
́
1
. We have
E
̃
A
r
̃
a
t
s “
E
̃
A
r
E
r
̃
a
t
|
̃
s
t
ss
“
E
̃
A
r
β
ˆ
π
p
̃
s
t
q`p
1
́
β
q
π
p
̃
s
t
qs
“
β
E
̃
A
r
ˆ
π
p
̃
s
t
qs`p
1
́
β
q
E
̃
A
r
π
p
̃
s
t
qs
Thus:
›
›
E
̃
A
r
̃
a
t
s ́
a
t
›
›
“
›
›
E
̃
A
r
̃
a
t
s ́
β
ˆ
π
p
s
t
q ́p
1
́
β
q
π
p
s
t
q
›
›
“ }
β
E
̃
A
r
ˆ
π
p
̃
s
t
qs`p
1
́
β
q
E
̃
A
r
π
p
̃
s
t
qs
́
β
ˆ
π
p
s
t
q ́p
1
́
β
q
π
p
s
t
q}
ď
β
›
›
E
̃
A
r
ˆ
π
p
̃
s
t
qs ́
ˆ
π
p
s
t
q
›
›
`p
1
́
β
q
›
›
E
̃
A
r
π
p
̃
s
t
qs ́
π
p
s
t
q
›
›
ď
β
›
›
E
̃
A
r
̃
a
t
́
1
s ́
a
t
́
1
›
›
`p
1
́
β
q
›
›
E
̃
A
r
̃
a
t
́
1
s ́
a
t
́
1
›
›
“
0
per inductive hypothesis.
Therefore we conclude that
E
̃
A
r
̃
a
t
s “
a
t
@
t
“
1
,...,T
A.3. Proof of theorem 5.6 and corollary 5.7 - Main
policy improvement results
In this section, we provide the proof to theorem 5.6 and
corollary 5.7.
Theorem Statement
.
(theorem 5.6) Assume
`
is convex and
L
-Lipschitz-continuous, and Condition 2 holds. Let
“
max
s
„
d
π
}
ˆ
π
p
s
q ́
π
p
s
q}
. Then for
β
P p
0
,
1
q
:
`
π
1
p
π
1
q ́
`
π
p
π
q ď
βγL
p
1
́
β
qp
1
́
γ
q
`
β
p
`
π
p
ˆ
π
q ́
`
π
p
π
qq
.
Proof.
First let’s review the notations: let
T
be the trajec-
tory horizon. For a policy
π
in the deterministic policy
class
Π
, given a starting state
s
0
, we roll out the full tra-
jectory
s
0
π
ÝÑ
s
1
π
ÝÑ
...
π
ÝÑ
s
T
, where
s
t
“ r
x
t
,π
p
s
t
́
1
qs
,
with
x
t
encodes the featurized input at current time
t
, and
π
p
s
t
́
1
q
encodes the dependency on previous predictions.
Let
`
p
π
p
s
qq
be the loss of taking action
π
p
s
q
at state
s
, we
can define the trajectory loss of policy
π
from starting state
s
0
as
`
p
π
|
s
0
q “
1
T
T
ÿ
t
“
1
`
p
π
p
s
t
qq
For a starting state distribution
μ
, we define policy loss
of
π
as the expected loss along trajectories induced by
π
:
`
π
p
π
q “
E
s
0
„
μ
r
`
p
π
|
s
0
qs
. Policy loss
`
π
p
π
q
can be under-
stood as
`
π
p
π
q “
ż
s
0
„
μ
E
x
t
„
X
1
T
«
T
ÿ
t
“
1
`
p
π
p
s
t
qq
ff
dμ
p
s
0
q
To prove policy improvement, we skip the subscript of al-
gorithm 1 to consider general policy update rule within
each iteration:
π
1
“
π
new
“
β
ˆ
π
`p
1
́
β
q
π
(10)
where
π
“
π
old
is the current policy (combined up until
the previous iteration),
ˆ
π
is the trained model from calling
the base regression routine
Train
p
S
,
p
A
|
h
q
. Learning rate
(step-size)
β
may be adaptively chosen in each iteration.
Recall that this update rule reflects deterministic interpola-
tion of two policies.
We are interested in quantifying the policy improvement
when updating
π
to
π
1
. Specifically, we want to bound
Γ
“
`
π
1
p
π
1
q ́
`
π
p
π
q
where
`
π
p
π
q
(respectively
`
π
1
p
π
1
q
) denotes the trajectory
loss of
π
(respectively
π
1
) on the state distribution induced
by
π
(resp.
π
1
)
We will bound the loss difference of old and new policies
conditioned on a common starting state
s
0
. Based on up-
date rule (10), consider rolling out
π
1
and
π
from the same
starting state
s
0
to obtain two separate sequences
π
1
ÞÝÑ
t
s
0
Ñ
s
1
1
...
Ñ
s
1
T
u
and
π
ÞÝÑ t
s
0
Ñ
s
1
...
Ñ
s
T
u
corresponding to the same stream of inputs
x
1
,...,x
T
.
Γ
p
s
0
q “
1
T
T
ÿ
t
“
1
`
p
π
1
p
s
1
t
qq ́
`
p
π
p
s
t
qq
“
1
T
T
ÿ
t
“
1
`
p
π
1
p
s
1
t
qq ́
`
p
π
1
p
s
t
qq`
`
p
π
1
p
s
t
qq ́
`
p
π
p
s
t
qq
(11)
Assume convexity of
`
(e.g. sum of square losses):
`
p
π
1
p
s
t
qq “
`
p
β
ˆ
π
p
s
t
q`p
1
́
β
q
π
p
s
t
qq
ď
β`
p
ˆ
π
p
s
t
qq`p
1
́
β
q
`
p
π
p
s
t
qq
Thus we can begin to bound individual components of
Γ
p
s
0
q
as
`
p
π
1
p
s
1
t
qq ́
`
p
π
p
s
t
qq ď
`
p
π
1
p
s
1
t
qqq ́
`
p
π
1
p
s
t
qq
`
β
r
`
p
ˆ
π
p
s
t
qq ́
`
p
π
p
s
t
qqs
Since
`
is
L
-Lipschitz continuous, we have
`
p
π
1
p
s
1
t
qq ́
`
p
π
1
p
s
t
qq ď
L
›
›
π
1
p
s
1
t
q ́
π
1
p
s
t
q
›
›
ď
Lγ
›
›
s
1
t
́
s
t
›
›
(12)
where (12) is due to the smoothness condition [2] of policy
class
Π
. Given a policy class
Π
with
γ
ă
1
, the following
claim can be proved by induction:
Claim:
}
s
1
t
́
s
t
} ď
β
p
1
́
β
qp
1
́
γ
q
Proof.
For the base case,
given the same start-
ing
state
s
0
,
we
have
s
1
1
“ r
x
1
,π
1
p
s
0
qs
and
s
1
“ r
x
1
,π
p
s
0
qs
.
Thus
}
s
1
1
́
s
1
} “
}
π
1
p
s
0
q ́
π
p
s
0
q} “ }
β
ˆ
π
p
s
0
q`p
1
́
β
q
π
p
s
0
q ́
π
p
s
0
q} “
β
}
ˆ
π
p
s
0
q ́
π
p
s
0
q} ď
β
ď
β
p
1
́
β
qp
1
́
γ
q
.
In the inductive case, assume we have
›
›
s
1
t
́
1
́
s
t
́
1
›
›
ď
Smooth Imitation Learning for Online Sequence Prediction
β
p
1
́
β
qp
1
́
γ
q
. Then similar to before, the definition of
s
1
t
and
s
t
leads to
›
›
s
1
t
́
s
t
›
›
“
›
›
“
x
t
,π
1
p
s
1
t
́
1
q
‰
́r
x
t
,π
p
s
t
́
1
qs
›
›
“
›
›
π
1
p
s
1
t
́
1
q ́
π
p
s
t
́
1
q
›
›
ď
›
›
π
1
p
s
1
t
́
1
q ́
π
1
p
s
t
́
1
q
›
›
`
›
›
π
1
p
s
t
́
1
q ́
π
p
s
t
́
1
q
›
›
ď
γ
›
›
s
1
t
́
1
́
s
t
́
1
›
›
`
β
}
ˆ
π
p
s
t
́
1
q ́
π
p
s
t
́
1
q}
ď
γ
β
p
1
́
β
qp
1
́
γ
q
`
β
ď
β
p
1
́
β
qp
1
́
γ
q
Applying the claim to equation (12), we have
`
p
π
1
p
s
1
t
qq ́
`
p
π
1
p
s
t
qq ď
βγL
p
1
́
β
qp
1
́
γ
q
which leads to
`
p
π
1
p
s
1
t
q ́
`
p
π
p
s
t
qqq ď
βγL
p
1
́
β
qp
1
́
γ
q
`
β
p
`
p
ˆ
π
p
s
t
qq ́
`
p
π
p
s
t
qqq
(13)
Integrating (13) over the starting state
s
0
„
μ
and input
trajectories
t
x
t
u
T
t
“
1
, we arrive at the policy improvement
bound:
`
π
1
p
π
1
q ́
`
π
p
π
q ď
βγL
p
1
́
β
qp
1
́
γ
q
`
β
p
`
π
p
ˆ
π
q ́
`
π
p
π
qq
where
`
π
p
ˆ
π
q
is the expected loss of the trained policy
ˆ
π
on
the state distribution induced by policy
π
(reduction term,
analogous to policy advantage in the traditional MDP ter-
minologies (Kakade & Langford, 2002))
This means in the worst case, as we choose
β
Ñ
0
, we
have
r
`
π
1
p
π
1
q ́
`
π
p
π
qs Ñ
0
, meaning the new policy does
not degrade much for a small choice of
β
. However if
`
π
p
ˆ
π
q ́
`
π
p
π
q !
0
, we can choose
β
to enforce mono-
tonic improvement of the policy by adaptively choosing
β
that minimizes the right-hand side. In particular, let the re-
duction term be
∆
“
`
π
p
π
q ́
`
π
p
ˆ
π
q ą
0
and let
δ
“
γL
1
́
γ
,
then for
β
“
∆
́
δ
2∆
we have the following monotonic policy
improvement:
`
π
1
p
π
1
q ́
`
π
p
π
q ď ́
p
∆
́
δ
q
2
2
p
∆
`
δ
q
A.4. Proof of theorem 5.5 -
T
-dependent improvement
Theorem Statement
.
(theorem 5.5) Assume
`
is con-
vex and
L
-Lipschitz, and Condition 1 holds. Let
“
max
s
„
d
π
}
ˆ
π
p
s
q ́
π
p
s
q}
. Then:
`
π
1
p
π
1
q ́
`
π
p
π
q ď
βLT
`
β
p
`
π
p
ˆ
π
q ́
`
π
p
π
qq
.
In particular, choosing
β
P p
0
,
1
{
T
q
yields:
`
π
1
p
π
1
q ́
`
π
p
π
q ď
L
`
β
p
`
π
p
ˆ
π
q ́
`
π
p
π
qq
.
Proof.
The proof of theorem 5.5 largely follows the struc-
tute of theorem 5.6, except that we are using the slighty
weaker Condition 1 which leads to weaker bound on the
policy improvement that depends on the trajectory horizon
T
. For any state
s
0
taken from the starting state distribu-
tion
μ
, sequentially roll-out policies
π
1
and
π
to receive
two separate trajectories
π
1
:
s
0
Ñ
s
1
1
Ñ
...
Ñ
s
1
T
and
π
1
:
s
0
Ñ
s
1
Ñ
...
Ñ
s
T
. Consider a pair of states
s
1
t
“ r
x
t
,π
1
p
s
1
t
́
1
qs
and
s
t
“ r
x
t
,π
p
s
t
́
1
qs
corresponding
to the same input feature
x
t
, as before we can decompose
`
p
π
1
p
s
1
t
qq ́
`
p
π
p
s
t
qq “
`
p
π
1
p
s
1
t
qq ́
`
p
π
1
p
s
t
qq`
`
p
π
1
p
s
t
qq ́
`
p
π
p
s
t
qq ď
L
}
π
1
p
s
1
t
q ́
π
1
p
s
t
q} `
β
p
`
p
ˆ
π
p
s
t
qq ́
`
p
π
p
s
t
qqq
due to convexity and
L
-Lipschitz continuity of
`
.
Condition 1 further yields:
`
p
π
1
p
s
1
t
qq ́
`
p
π
p
s
t
qq ď
L
}
s
1
t
́
s
t
}`
β
p
`
p
ˆ
π
p
s
t
qq ́
`
p
π
p
s
t
qqq
. By the construction
of the states, note that
›
›
s
1
t
́
s
t
›
›
“
›
›
π
1
p
s
1
t
́
1
q ́
π
p
s
t
́
1
q
›
›
ď
›
›
π
1
p
s
1
t
́
1
q ́
π
1
p
s
t
́
1
q
›
›
`
›
›
π
1
p
s
t
́
1
q ́
π
p
s
t
́
1
q
›
›
ď
›
›
s
1
t
́
1
́
s
t
́
1
›
›
`
β
p
}
ˆ
π
p
s
t
́
1
q ́
π
p
s
t
́
1
q
}q
ď
›
›
s
1
t
́
1
́
s
t
́
1
›
›
`
β
(by condition 1 and definition of
).
From here, one can use this recursive relation to easily
show that
}
s
1
t
́
s
t
} ď
βt
for all
t
P r
1
,T
s
.
Averaging over the
T
time steps and integrating over the
starting state distribution, we have:
`
π
1
p
π
1
q ́
`
π
p
π
q ď
βL
p
T
`
1
q{
2
`
β
p
`
π
p
ˆ
π
q ́
`
π
p
π
qq
ď
βLT
`
β
p
`
π
p
ˆ
π
q ́
`
π
p
π
qq
In particular,
β
P p
0
,
1
{
T
q
yields
`
π
1
p
π
1
q ́
`
π
p
π
q ď
L
`
β
p
`
π
p
ˆ
π
q ́
`
π
p
π
qq
.
A.5. Proof of proposition 5.8 - smooth expert
proposition
Proposition Statement
.
(Proposition 5.8) Let
ω
be the
average supervised training error from
F
, i.e.
ω
“
min
f
P
F
E
x
„
X
r}
f
pr
x,
0
sq ́
a
̊
}s
. Let the rolled-out trajectory
of current policy
π
be
t
a
t
u
. If the average gap between
π
and
π
̊
is such that
E
t
„
Uniform
r
1:
T
s
r}
a
̊
t
́
a
t
́
1
}s ě
3
ω
`
η
p
1
`
λ
q
, then using
t
a
̊
t
u
as feedback will cause the trained
policy
ˆ
π
to be non-smooth, i.e.:
E
t
„
Uniform
r
1:
T
s
r}
ˆ
a
t
́
ˆ
a
t
́
1
}s ě
η,
for
t
ˆ
a
t
u
the rolled-out trajectory of
ˆ
π
.
Proof.
Recall that
Π
λ
is formed by regularizing a class of
supervised learners
F
with the singleton class of smooth
function
H
fi
t
h
p
a
q “
a
u
, via a hyper-parameter
λ
Smooth Imitation Learning for Online Sequence Prediction
that controls the trade-off between being close to the two
classes.
Minimizing over
Π
λ
can be seen as a regularized optimiza-
tion problem:
ˆ
π
p
x,a
q “
argmin
π
P
Π
`
p
π
pr
x,a
sqq
“
argmin
f
P
F
,h
P
H
p
f
p
x,a
q ́
a
̊
q
2
`
λ
p
f
p
x,a
q ́
h
p
a
qq
2
“
argmin
f
P
F
p
f
p
x,a
q ́
a
̊
q
2
`
λ
p
f
p
x,a
q ́
a
q
2
(14)
where hyper-parameter
λ
trades-off the distance of
f
p
x,a
q
relative to
a
(smoothness) and
a
̊
(imitation accuracy), and
a
P
R
1
.
Such a policy
π
, at execution time, corresponds to the reg-
ularized minimizer of:
a
t
“
π
pr
x,a
t
́
1
sq
“
argmin
a
}
a
́
f
pr
x
t
,a
t
́
1
sq}
2
`
λ
}
a
́
a
t
́
1
}
2
“
f
pr
x
t
,a
t
́
1
sq`
λa
t
́
1
1
`
λ
(15)
where
f
P
F
is the minimizer of equation 14
Thus we enforce smoothness of learning policy from
Π
λ
by encouraging low first order difference of consecu-
tive actions of the executed trajectory
t
a
t
u
.
In prac-
tice, we may contrain this first order difference relative
to the human trajectory
1
T
ř
T
t
“
1
}
a
t
́
a
t
́
1
} ď
η
, where
η
9
1
T
ř
T
t
“
1
›
›
a
̊
t
́
a
̊
t
́
1
›
›
.
Consider any given iteration with the following set-up: we
execute old policy
π
“
π
old
to get rolled-out trajectory
t
a
t
u
T
t
“
1
. Form the new data set as
D
“ tp
s
t
,a
̊
t
qu
T
t
“
1
with
predictors
s
t
“ r
x
t
,a
t
́
1
s
and feedback labels simply the
human actions
a
̊
t
. Use this data set to train a policy
ˆ
π
by
learning a supervised
ˆ
f
P
F
from
D
. Similar to
π
, the
execution of
ˆ
π
corresponds to
ˆ
a
t
where:
ˆ
a
t
“
ˆ
π
pr
x
t
,
ˆ
a
t
́
1
sq
“
argmin
a
›
›
›
a
́
ˆ
f
p
r
x
t
,
ˆ
a
t
́
1
sq
›
›
›
2
`
λ
}
a
́
ˆ
a
t
́
1
}
2
“
ˆ
f
pr
x
t
,
ˆ
a
t
́
1
sq`
λ
ˆ
a
t
́
1
1
`
λ
(16)
Denote by
f
0
the ”naive” supervised learner from
F
. In
other words,
f
0
“
argmin
f
P
F
T
ř
t
“
1
}
f
pr
x
t
,
0
sq ́
a
̊
t
}
2
. Let
ω
be
the average gap between human trajectory and the rolled-
out trajectory of
f
0
, i.e.
ω
“
1
T
T
ÿ
t
“
1
}
f
0
pr
x
t
,
0
sq ́
a
̊
t
}
Note that it is reasonable to assume that the average errors
of
f
and
ˆ
f
are no worse than
f
0
, since in the worst case we
can simply discard the extra features
a
t
́
1
(resp.
ˆ
a
t
́
1
) of
f
(resp.
ˆ
f
) to recover the performance of the naive learner
f
0
:
1
T
T
ÿ
t
“
1
}
f
pr
x
t
,a
t
́
1
sq ́
a
̊
t
} ď
ω
1
T
T
ÿ
t
“
1
›
›
›
ˆ
f
pr
x
t
,
ˆ
a
t
́
1
sq ́
a
̊
t
›
›
›
ď
ω
Assume that the old policy
π
“
π
old
is ”bad” in the sense
that the rolled-out trajectory
t
a
t
u
T
t
“
1
differs substantially
from human trajectory
t
a
̊
t
u
T
t
“
1
. Specifically, denote the
gap:
1
T
T
ÿ
t
“
1
}
a
̊
t
́
a
t
́
1
} “
Ω
"
ω
This means the feedback correction
a
̊
t
to
s
t
“ r
x
t
,a
t
́
1
s
is not smooth. We will show that the trained policy
ˆ
π
from
D
will not be smooth.
From the definition of
a
t
and
ˆ
a
t
from equations 15 and 16,
we have for each
t
:
a
t
́
ˆ
a
t
“
λ
1
`
λ
p
a
t
́
1
́
ˆ
a
t
́
1
q`
f
pr
x
t
,a
t
́
1
sq ́
ˆ
f
pr
x
t
,
ˆ
a
t
́
1
sq
1
`
λ
Applying triangle inequality and summing up over
t
, we
have:
1
T
T
ÿ
t
“
1
}
a
t
́
ˆ
a
t
} ď
2
ω
From here we can provide a lower bound on the smooth-
ness of the new trajectory
ˆ
a
t
, as defined by the first order
difference
1
T
ř
T
t
“
1
}
ˆ
a
t
́
ˆ
a
t
́
1
}
. By definition of
ˆ
a
t
:
}
ˆ
a
t
́
ˆ
a
t
́
1
} “
›
›
›
›
›
ˆ
f
pr
x
t
,
ˆ
a
t
́
1
sq ́
ˆ
a
t
́
1
1
`
λ
›
›
›
›
›
“
›
›
›
›
›
ˆ
f
pr
x
t
,
ˆ
a
t
́
1
sq ́
a
̊
t
`
a
̊
t
́
a
t
́
1
`
a
t
́
1
́
ˆ
a
t
́
1
1
`
λ
›
›
›
›
›
ě
}
a
̊
t
́
a
t
́
1
} ́
›
›
›
ˆ
f
pr
x
t
,
ˆ
a
t
́
1
sq ́
a
̊
t
›
›
›
́}
a
t
́
1
́
ˆ
a
t
́
1
}
1
`
λ
Again summing up over
t
and taking the average, we ob-
tain:
1
T
T
ÿ
t
“
1
}
ˆ
a
t
́
ˆ
a
t
́
1
}
ě
Ω
́
3
ω
1
`
λ
Hence for
Ω
"
ω
, meaning the old trajectory is suffi-
ciently far away from the ideal human trajectory, setting
the learning target to be the ideal human actions will cause
the learned trajectory to be non-smooth.
Smooth Imitation Learning for Online Sequence Prediction
B. Imitation Learning for Online Sequence
Prediction With Smooth Regression
Forests
B.1. Variant of SIMILE Using Smooth Regression
Forest Policy Class
We provide a specific instantiation of algorithm 1 that
we used for our experiment, based on a policy class
Π
as a smooth regularized version of the space of tree-
based ensembles. In particular,
F
is the space of ran-
dom forests and
H
is the space of linear auto-regressors
H
fi
t
h
p
a
t
́
1:
t
́
τ
q “
ř
τ
i
“
1
c
i
a
t
́
i
u
. In combination,
F
and
H
form a complex tree-based predictor that can predict
smooth sequential actions.
Empirically, decision tree-based ensembles are among
the best performing supervised machine learning method
(Caruana & Niculescu-Mizil, 2006; Criminisi et al., 2012).
Due to the piece-wise constant nature of decision tree-
based prediction, the results are inevitably non-smooth. We
propose a recurrent extension based on
H
, where the pre-
diction at the leaf node is not necessarily a constant, but
rather is a smooth function of both static leaf node predic-
tion and its previous predictions. By merging the power-
ful tree-based policy class with a linear auto-regressor, we
provide a novel approach to train complex models that can
accommodate smooth sequential prediction using model-
based smooth regularizer, at the same time leveraging the
expressiveness of complex model-free function class (one
can similarly apply the framework to the space of neural
networks). Algorithm 2, which is based on SIMILE, de-
scribes in more details our training procedure used for the
automated camera planning experiment. We first describe
the role of the linear autoregressor class
H
, before dis-
cussing how to incorporate
H
into decision tree training
to make smooth prediction (see the next section).
The autoregresor
h
π
p
a
́
1
,...,a
́
τ
q
is typically selected
from a class of autoregressors
H
. In our experiments, we
use regularized linear autoregressors as
H
.
Consider a generic learning policy
π
with a rolled-out tra-
jectory
A
“ t
a
t
u
T
t
“
1
corresponding to the input sequence
X
“ t
x
t
u
T
t
“
1
. We form the state sequence
S
“ t
s
t
u
T
t
“
1
“
tr
x
t
,...,x
t
́
τ
,a
t
́
1
,...,a
t
́
τ
su
T
t
“
1
, where
τ
indicates the
past horizon that is adequate to approximately capture the
full state information. We approximate the smoothness of
the trajectory
A
by a linear autoregressor
h
π
”
h
π
p
s
t
q ”
τ
ÿ
i
“
1
c
i
a
t
́
i
for a (learned) set of coefficients
t
c
i
u
τ
i
“
1
such that
a
t
«
h
π
p
s
t
q
. Given feedback target
p
A
“ t
ˆ
a
t
u
, the joint loss
Algorithm 2
Imitation Learning for Online Sequence Pre-
diction with Smooth Regression Forest
Input:
Input features
X
“ t
x
t
u
T
t
“
1
, expert demonstration
A
̊
“ t
a
̊
t
u
T
t
“
1
, base routine
Forest
, past horizon
τ
,
sequence of
σ
P p
0
,
1
q
1:
Initialize
A
0
Ð
A
̊
,
S
0
Ð t
“
x
t
:
t
́
τ
,a
̊
t
́
1:
t
́
τ
‰
u
,
h
0
“
argmin
c
1
,...,c
τ
T
ř
t
“
1
`
a
̊
t
́
ř
τ
i
“
1
c
i
a
̊
t
́
i
̆
2
2:
Initial policy
π
0
“
ˆ
π
0
Ð
Forest
p
S
0
,
A
0
|
h
0
q
3:
for
n
“
1
,...,N
do
4:
A
n
“ t
a
n
t
u Ð t
π
n
́
1
p
“
x
t
:
t
́
τ
,a
n
́
1
t
́
1:
t
́
τ
‰
qu
//
sequential roll-out old policy
5:
S
n
Ð t
s
n
t
“
“
x
t
:
t
́
τ
,a
n
t
́
1:
t
́
τ
‰
u
//
Form states
in 1d case
6:
p
A
n
“ t
p
a
n
t
“
σa
n
t
`p
1
́
σ
q
a
̊
t
u @
s
n
t
P
S
n
//
collect smooth 1-step feedback
7:
h
n
“
argmin
c
1
,...,c
τ
T
ř
t
“
1
`
ˆ
a
n
t
́
ř
τ
i
“
1
c
i
ˆ
a
n
t
́
i
̆
2
//
update
c
i
via regularized least square
8:
ˆ
π
n
Ð
Forest
p
S
n
,
p
A
n
|
h
n
q
//
train with smooth
decision forests. See section B.2
9:
β
Ð
error
p
π
q
error
p
ˆ
π
q`
error
p
π
q
//
set
β
to weighted
empirical errors
10:
π
n
“
β
ˆ
π
n
`p
1
́
β
q
π
n
́
1
//
update policy
11:
end for
output
Last policy
π
N
function thus becomes
`
p
a,
ˆ
a
t
q “
`
d
p
a,
ˆ
a
t
q`
λ`
R
p
a,s
t
q
“ p
a
́
ˆ
a
t
q
2
`
λ
p
a
́
τ
ÿ
i
“
1
c
i
a
t
́
i
q
2
Here
λ
trades off between smoothness versus absolute im-
itation accuracy. The autoregressor
h
π
acts as a smooth
linear regularizer, the parameters of which can be updated
at each iteration based on feedback target
p
A
according to
h
π
“
argmin
h
P
H
›
›
›
p
A
́
h
p
p
A
q
›
›
›
2
“
argmin
c
1
,...,c
τ
p
T
ÿ
t
“
1
p
ˆ
a
t
́
τ
ÿ
i
“
1
c
i
ˆ
a
t
́
i
q
2
q
,
(17)
In practice we use a regularized version of equation (17)
to learn a new set of coefficients
t
c
i
u
τ
i
“
1
. The
Forest
procedure (Line 8 of algorithm 2) would use this updated
h
π
to train a new policy that optimizes the trade-off be-
tween
a
t
«
ˆ
a
t
(feedback) versus smoothness as dictated
by
a
t
«
ř
τ
i
“
1
c
i
a
t
́
i
.
B.1.1. S
MOOTH
R
EGULARIZATION WITH
L
INEAR
A
UTOREGRESSORS
Our application of Algorithm 1 to realtime camera planning
proceeds as follows: At each iteration, we form a state se-
Smooth Imitation Learning for Online Sequence Prediction
quence
S
based on the rolled-out trajectory
A
and tracking
input data
X
such that
s
t
“
r
x
t
,...,x
t
́
τ
,a
t
́
1
,...,a
t
́
τ
s
for appropriate
τ
that captures the history of the sequential
decisions. We generate feedback targets
p
A
based on each
s
t
P
S
following
ˆ
a
t
“
σa
t
`p
1
́
σ
q
a
̊
t
using a parameter
σ
P p
0
,
1
q
depending on the Euclidean distance between
A
and
A
̊
. Typically,
σ
gradually decreases to
0
as the rolled-
out trajectory improves on the training set. After gener-
ating the targets, a new linear autoregressor
h
π
(new set
of coefficients
t
c
i
u
τ
i
“
1
) is learned based on
p
A
using regu-
larized least squares (as described in the previous section).
We then train a new model
ˆ
π
based on
S
,
p
A
, and the up-
dated coefficients
t
c
i
u
, using
Forest
- our recurrent de-
cision tree framework that is capable of generating smooth
predictions using autoregressor
h
π
as a smooth regularizer
(see the following section for how to train smooth decision
trees). Note that typically this creates a ”chicken-and-egg”
problem. As the newly learned policy
ˆ
π
is greedily trained
with respect to
p
A
, the rolled-out trajectory of
ˆ
π
may have a
state distribution that is different from what the previously
learned
h
π
would predict. Our approach offers two reme-
dies to this circular problem. First, by allowing feedback
signals to vary smoothly relative to the current rolled-out
trajectory
A
, the new policy
ˆ
π
should induce a new au-
toregresor that is similar to previously learned
h
π
. Second,
by interpolating distributions (Line 10 of Algorithm 2) and
having
p
A
eventually converge to the original human trajec-
tory
A
̊
, we will have a stable and converging state distri-
bution, leading to a stable and converging
h
π
.
Throughout iterations, the linear autoregressor
h
π
and reg-
ularization parameter
λ
enforces smoothness of the rolled-
out trajectory, while the recurrent decision tree framework
Forest
learns increasingly accurate imitation policy. We
generally achieve a satisfactory policy after 5-10 iterations
in our sport broadcasting data sets. In the following sec-
tion, we describe the mechanics of our recurrent decision
tree training.
B.2. Smooth Regression Tree Training
Given states
s
as input, a decision tree specifies a parti-
tioning of the input state space. Let
D
“ tp
s
m
,
ˆ
a
m
qu
M
m
“
1
denote a training set of state/target pairs. Conventional re-
gression tree learning aims to learn a partitioning such that
each leaf node,
node
, makes a constant prediction via min-
imizing the squared loss function:
̄
a
node
“
argmin
a
ÿ
p
s,
ˆ
a
qP
D
node
`
d
p
a,
ˆ
a
q
“
argmin
a
ÿ
p
s,
ˆ
a
qP
D
node
p
ˆ
a
́
a
q
2
,
(18)
where
D
node
denotes the training data from
D
that has par-
titioned into the leaf
node
. For squared loss, we have:
̄
a
node
“
mean
t
ˆ
a
|p
s,
ˆ
a
q P
D
node
u
.
(19)
In the recurrent extension to
Forest
, we allow the deci-
sion tree to branch on the input state
s
, which includes the
previous predictions
a
́
1
,...,a
́
τ
. To enforce more ex-
plicit smoothness requirements, let
h
π
p
a
́
1
,...,a
́
τ
q
de-
note an autoregressor that captures the temporal dynamics
of
π
over the distribution of input sequences
d
x
, while
ig-
noring
the inputs
x
. At time step
t
,
h
π
predicts the behavior
a
t
“
π
p
s
t
q
given only
a
t
́
1
,...,a
t
́
τ
.
Our policy class
Π
of recurrent decision trees
π
makes
smoothed predictions by regularizing the predictions to be
close to its autoregressor
h
π
. The new loss function in-
corporates both the squared distance loss
`
d
, as well as a
smooth regularization loss such that:
L
D
p
a
q “
ÿ
p
s,
ˆ
a
qP
D
`
d
p
a,
ˆ
a
q`
λ`
R
p
a,s
q
“
ÿ
p
s,
ˆ
a
qP
D
p
a
́
ˆ
a
q
2
`
λ
p
y
́
h
π
p
s
qq
2
where
λ
is a hyper-parameter that controls how much we
care about smoothness versus absolute distance loss.
Making prediction:
For any any tree/policy
π
, each leaf
node is associated with the terminal leaf node value
̄
a
node
such that prediction
̃
a
given input state
s
is:
̃
a
p
s
q ”
π
p
s
q “
argmin
a
p
a
́
̄
a
node
p
s
q
q
2
`
λ
p
a
́
h
π
p
s
qq
2
,
(20)
“
̄
a
node
p
s
q
`
λh
π
p
s
q
1
`
λ
.
(21)
where
node
p
s
q
denotes the leaf node of the decision tree
that
s
branches to.
Setting terminal node value:
Given a fixed
h
π
and deci-
sion tree structure, navigating through consecutive binary
queries eventually yields a terminal leaf node with associ-
ated training data
D
node
Ă
D
.
One option is to set the terminal node value
̄
a
node
to satisfy:
̄
a
node
“
argmin
a
ÿ
p
s,
ˆ
a
qP
D
node
`
d
p
̃
a
p
s
|
a
q
,
ˆ
a
q
“
argmin
a
ÿ
p
s,
ˆ
a
qP
D
node
p
̃
a
p
s
|
a
q ́
ˆ
a
q
2
(22)
“
argmin
a
ÿ
p
s,
ˆ
a
qP
D
node
ˆ
a
`
λh
π
p
s
q
1
`
λ
́
ˆ
a
̇
2
for
̃
a
p
s
|
a
q
defined as in (21) with
a
”
̄
a
node
p
s
q
. Similar to
(19), we can write the closed-form solution of (22) as:
̄
a
node
“
mean
tp
1
`
λ
q
ˆ
a
́
λh
π
p
s
q|p
s,
ˆ
a
q P
D
node
u
.
(23)
When
λ
“
0
, (23) reduces to (19).
Smooth Imitation Learning for Online Sequence Prediction
Note that (22) only looks at imitation loss
`
d
, but not
smoothness loss
`
R
. Alternatively in the case of joint imi-
tation and smoothness loss, the terminal leaf node is set to
minimize the joint loss function:
̄
a
node
“
argmin
a
L
D
node
p
̃
a
p
s
|
a
qq
“
argmin
a
ÿ
p
s,
ˆ
a
qP
D
node
`
d
p
̃
a
p
s
|
a
q
,
ˆ
a
q`
λ`
R
p
̃
a
p
s
|
a
q
,s
q
“
argmin
a
ÿ
p
s,
ˆ
a
qP
D
node
p
̃
a
p
s
|
a
q ́
ˆ
a
q
2
`
λ
p
̃
a
p
s
|
a
q ́
h
π
p
s
qq
2
(24)
“
argmin
a
ÿ
p
s,
ˆ
a
qP
D
node
ˆ
a
`
λh
π
p
s
q
1
`
λ
́
ˆ
a
̇
2
`
λ
ˆ
a
`
λh
π
p
s
q
1
`
λ
́
h
π
p
s
q
̇
2
“
mean
t
ˆ
a
|p
s,
ˆ
a
q P
D
node
u
,
(25)
Node splitting mechanism:
For a node representing a sub-
set
D
node
of the training data, the node impurity is defined
as:
I
node
“
L
D
node
p
̄
a
node
q
“
ÿ
p
s,
ˆ
a
qP
D
node
`
d
p
̄
a
node
,
ˆ
a
q`
λ`
R
p
̄
a
node
,s
q
“
ÿ
p
s,
ˆ
a
qP
D
node
p
̄
a
node
́
ˆ
a
q
2
`
λ
p
̄
a
node
́
h
π
p
s
qq
2
where
̄
a
node
is set according to equation (23) or (25) over
p
s,
ˆ
a
q
’s in
D
node
. At each possible splitting point where
D
node
is partitioned into
D
left
and
D
right
, the impu-
rity of the left and right child of the node is defined simi-
larly. As with normal decision trees, the best splitting point
is chosen as one that maximizes the impurity reduction:
I
node
́
|
D
left
|
|
D
node
|
I
left
́
|
D
right
|
|
D
node
|
I
right