Smooth Imitation Learning for Online Sequence Prediction
Hoang M. Le
HMLE
@
CALTECH
.
EDU
Andrew Kang
AKANG
@
CALTECH
.
EDU
Yisong Yue
YYUE
@
CALTECH
.
EDU
California Institute of Technology, Pasadena, CA, USA
Peter Carr
PETER
.
CARR
@
DISNEYRESEARCH
.
COM
Disney Research, Pittsburgh, PA, USA
Abstract
We study the problem of smooth imitation learn-
ing for online sequence prediction, where the
goal is to train a policy that can smoothly imitate
demonstrated behavior in a dynamic and con-
tinuous environment in response to online, se-
quential context input. Since the mapping from
context to behavior is often complex, we take
a learning reduction approach to reduce smooth
imitation learning to a regression problem us-
ing complex function classes that are regular-
ized to ensure smoothness. We present a learn-
ing meta-algorithm that achieves fast and stable
convergence to a good policy. Our approach en-
joys several attractive properties, including be-
ing fully deterministic, employing an adaptive
learning rate that can provably yield larger policy
improvements compared to previous approaches,
and the ability to ensure stable convergence. Our
empirical results demonstrate significant perfor-
mance gains over previous approaches.
1. Introduction
In many complex planning and control tasks, it can be very
challenging to explicitly specify a good policy. For such
tasks, the use of machine learning to automatically learn
a good policy from observed expert behavior, also known
as imitation learning or learning from demonstrations, has
proven tremendously useful (Abbeel & Ng, 2004; Ratliff
et al., 2009; Argall et al., 2009; Ross & Bagnell, 2010; Ross
et al., 2011; Jain et al., 2013).
In this paper, we study the problem of imitation learning for
smooth online sequence prediction in a continuous regime.
Proceedings of the
33
rd
International Conference on Machine
Learning
, New York, NY, USA, 2016. JMLR: W&CP volume
48. Copyright 2016 by the author(s).
Online sequence prediction is the problem of making on-
line decisions in response to exogenous input from the en-
vironment, and is a special case of reinforcement learning
(see Section 2). We are further interested in policies that
make smooth predictions in a continuous action space.
Our motivating example is the problem of learning smooth
policies for automated camera planning (Chen et al., 2016):
determining where a camera should look given environ-
ment information (e.g., noisy person detections) and cor-
responding demonstrations from a human expert.
1
It is
widely accepted that a smoothly moving camera is essen-
tial for generating aesthetic video (Gaddam et al., 2015).
From a problem formulation standpoint, one key difference
between smooth imitation learning and conventional imita-
tion learning is the use of a “smooth” policy class (which
we formalize in Section 2), and the goal now is to mimic
expert demonstrations by choosing the best smooth policy.
The conventional supervised learning approach to imitation
learning is to train a classifier or regressor to predict the ex-
pert’s behavior given training data comprising input/output
pairs of contexts and actions taken by the expert. How-
ever, the learned policy’s prediction affects (the distribu-
tion of) future states during the policy’s actual execution,
and so violates the crucial i.i.d. assumption made by most
statistical learning approaches. To address this issue, nu-
merous learning reduction approaches have been proposed
(Daum
́
e III et al., 2009; Ross & Bagnell, 2010; Ross et al.,
2011), which iteratively modify the training distribution in
various ways such that any supervised learning guarantees
provably lift to the sequential imitation setting (potentially
at the cost of statistical or computational efficiency).
We present a learning reduction approach to smooth im-
itation learning for online sequence prediction, which we
call SIMILE (
S
mooth
IMI
tation
LE
arning).
Building
1
Access data at
http://www.disneyresearch.com/
publication/smooth-imitation-learning/
and
code at
http://github.com/hoangminhle/SIMILE
.
arXiv:1606.00968v1 [cs.LG] 3 Jun 2016
Smooth Imitation Learning for Online Sequence Prediction
upon learning reductions that employ policy aggregation
(Daum
́
e III et al., 2009), we provably lift supervised learn-
ing guarantees to the smooth imitation setting and show
much faster convergence behavior compared to previous
work. Our contributions can be summarized as:
•
We formalize the problem of smooth imitation learn-
ing for online sequence prediction, and introduce a
family of smooth policy classes that is amenable to
supervised learning reductions.
•
We present a principled learning reduction approach,
which we call SIMILE. Our approach enjoys sev-
eral attractive practical properties, including learning
a fully deterministic stationary policy (as opposed to
SEARN (Daum
́
e III et al., 2009)), and not requiring
data aggregation (as opposed to DAgger (Ross et al.,
2011)) which can lead to super-linear training time.
•
We provide performance guarantees that lift the
the underlying supervised learning guarantees to the
smooth imitation setting.
Our guarantees hold in
the agnostic setting, i.e., when the supervised learner
might not achieve perfect prediction.
•
We show how to exploit a stability property of our
smooth policy class to enable adaptive learning rates
that yield provably much faster convergence com-
pared to SEARN (Daum
́
e III et al., 2009).
•
We empirically evaluate using the setting of smooth
camera planning (Chen et al., 2016), and demonstrate
the performance gains of our approach.
2. Problem Formulation
Let
X
“ t
x
1
,...,x
T
u Ă
X
T
denote a context sequence
from the environment
X
, and
A
“ t
a
1
,...,a
T
u Ă
A
T
de-
note an action sequence from some action space
A
. Con-
text sequence is exogenous, meaning
a
t
does not influ-
ence future context
x
t
`
k
for
k
ě
1
. Let
Π
denote a
policy class, where each
π
P
Π
generates an action se-
quence
A
in response to a context sequence
X
. Assume
X
Ă
R
m
,
A
Ă
R
k
are continuous and infinite, with
A
non-negative and bounded such that
~
0
ĺ
a
ĺ
R
~
1
@
a
P
A
.
Predicting actions
a
t
may depend on recent contexts
x
t
,...,x
t
́
p
and actions
a
t
́
1
,...,a
t
́
q
. Without loss of
generality, we define a state space
S
as
t
s
t
“ r
x
t
,a
t
́
1
su
.
2
Policies
π
can thus be viewed as mapping states
S
“
X
ˆ
A
to actions
A
. A roll-out of
π
given context sequence
X
“
t
x
1
,...,x
T
u
is the action sequence
A
“ t
a
1
,...,a
T
u
:
a
t
“
π
p
s
t
q “
π
pr
x
t
,a
t
́
1
sq
,
s
t
`
1
“ r
x
t
`
1
,a
t
s @
t
P r
1
,...,T
s
.
Note that unlike the general reinforcement learning prob-
lem, we consider the setting where the state space splits
into external and internal components (by definition,
a
t
in-
fluences subsequent states
s
t
`
k
, but not
x
t
`
k
). The use
2
We can always concatenate consecutive contexts and actions.
of exogenous contexts
t
x
t
u
models settings where a policy
needs to take online, sequential actions based on external
environmental inputs, e.g. smooth self-driving vehicles for
obstacle avoidance, helicopter aerobatics in the presence of
turbulence, or smart grid management for external energy
demand. The technical motivation of this dichotomy is that
we will enforce smoothness only on the internal state.
Consider the example of autonomous camera planning for
broadcasting a sport event (Chen et al., 2016).
X
can cor-
respond to game information such as the locations of the
players, the ball, etc., and
A
can correspond to the pan-
tilt-zoom configuration of the broadcast camera. Manually
specifying a good camera policy can be very challenging
due to sheer complexity involved with mapping
X
to
A
.
It is much more natural to train
π
P
Π
to mimic observed
expert demonstrations. For instance,
Π
can be the space of
neural networks or tree-based ensembles (or both).
Following the basic setup from (Ross et al., 2011), for any
policy
π
P
Π
, let
d
π
t
denote the distribution of states at time
t
if
π
is executed for the first
t
́
1
time steps. Furthermore,
let
d
π
“
1
T
ř
T
t
“
1
d
π
t
be the average distribution of states if
we follow
π
for all
T
steps. The goal of imitation learning
is to find a policy
ˆ
π
P
Π
which minimizes the imitation loss
under its own induced distribution of states:
ˆ
π
“
argmin
π
P
Π
`
π
p
π
q “
argmin
π
P
Π
E
s
„
d
π
r
`
p
π
p
s
qqs
,
(1)
where the (convex) imitation loss
`
p
π
p
s
qq
captures how
well
π
imitates expert demonstrations for state
s
. One com-
mon
`
is squared loss between the policy’s decision and the
expert demonstration:
`
p
π
p
s
qq “ }
π
p
s
q ́
π
̊
p
s
q}
2
for some
norm
}
.
}
. Note that computing
`
typically requires hav-
ing access to a training set of expert demonstrations
π
̊
on
some set of context sequences. We also assume an agnos-
tic setting, where the minimizer of (1) does not necessarily
achieve 0 loss (i.e. it cannot perfectly imitate the expert).
2.1. Smooth Imitation Learning & Smooth Policy Class
In addition to accuracy, a key requirement of many con-
tinuous control and planning problems is smoothness (e.g.,
smooth camera trajectories). Generally, “smoothness” may
reflect domain knowledge about stability properties or ap-
proximate equilibria of a dynamical system. We thus for-
malize the problem of
smooth imitation learning
as mini-
mizing (1) over a smooth policy class
Π
.
Most previous work on learning smooth policies focused on
simple policy classes such as linear models (Abbeel & Ng,
2004), which can be overly restrictive. We instead define a
much more general smooth policy class
Π
as a regularized
space of complex models.
Definition 2.1
(Smooth policy class
Π
)
.
Given a com-
plex model class
F
and a class of smooth regularizers
H
, we define smooth policy class
Π
Ă
F
ˆ
H
as satisfying:
Smooth Imitation Learning for Online Sequence Prediction
Π
fi
t
π
“ p
f,h
q
,f
P
F
,h
P
H
|
π
p
s
q
is close to
both
f
p
x,a
q
and
h
p
a
q
@
induced state
s
“ r
x,a
s P
S
u
where closeness is controlled by regularization.
For instance,
F
can be the space of neural networks or de-
cision trees and
H
be the space of smooth analytic func-
tions.
Π
can thus be viewed as policies that predict close to
some
f
P
F
but are regularized to be close to some
h
P
H
.
For sufficiently expressive
F
, we often have that
Π
Ă
F
.
Thus optimizing over
Π
can be viewed as constrained op-
timization over
F
(by
H
), which can be challenging. Our
SIMILE approach integrates alternating optimization (be-
tween
F
and
H
) into the learning reduction. We provide
two concrete examples of
Π
below.
Example 2.1
(
Π
λ
)
.
Let
F
be any complex supervised
model class, and define the simplest possible
H
fi
t
h
p
a
q “
a
u
. Given
f
P
F
, the prediction of a policy
π
can be viewed
as regularized optimization over the action space to ensure
closeness of
π
to both
f
and
h
:
π
p
x,a
q “
argmin
a
1
P
A
›
›
f
p
x,a
q ́
a
1
›
›
2
`
λ
›
›
h
p
a
q ́
a
1
›
›
2
“
f
p
x,a
q`
λh
p
a
q
1
`
λ
“
f
p
x,a
q`
λa
1
`
λ
,
(2)
where regularization parameter
λ
trades-off closeness to
f
and to previous action. For large
λ
,
π
p
x,a
q
is encouraged
make predictions that stays close to previous action
a
.
Example 2.2
(Linear auto-regressor smooth regularizers)
.
Let
F
be any complex supervised model class, and define
H
using linear auto-regressors,
H
fi
t
h
p
a
q “
θ
J
a
u
, which
model actions as a linear dynamical system (Wold, 1939).
We can define
π
analogously to (2).
In general, SIMILE requires that
Π
satisfies a smooth prop-
erty stated below. This property, which is exploited in our
theoretical analysis (see Section 5), is motivated by the
observation that given a (near) constant stream of context
sequence, a stable behavior policy should exhibit a corre-
sponding action sequence with low curvature. The two ex-
amples above satisfy this property for sufficiently large
λ
.
Definition 2.2
(
H
-state-smooth imitation policy)
.
For
small constant
0
ă
H
!
1
, a policy
π
pr
x,a
sq
is
H
-
state-smooth if it is
H
-smooth w.r.t.
a
, i.e. for fixed
x
P
X
,
@
a,a
1
P
A
,
@
i
:
›
›
∇
π
i
pr
x,a
sq ́
∇
π
i
pr
x,a
1
sq
›
›
̊
ď
H
}
a
́
a
1
}
where
π
i
indicates the
i
th
component of
vector-valued function
3
π
p
s
q “
“
π
1
p
s
q
,...,π
k
p
s
q
‰
P
R
k
,
and
}
.
}
and
}
.
}
̊
are some norm and dual norm respectively.
For twice differentiable policy
π
, this is equivalent to hav-
ing the bound on the Hessian
∇
2
π
i
pr
x,a
sq
ĺ
H
I
k
@
i
.
3
This emphasizes the possibility that
π
is a vector-valued func-
tion of
a
. The gradient and Hessian are viewed as arrays of
k
gra-
dient vectors and Hessian matrices of 1-d case, since we simply
treat action in
R
k
as an array of
k
standard functions.
3. Related Work
The most popular traditional approaches for learning from
expert demonstration focused on using approximate policy
iteration techniques in the MDP setting (Kakade & Lang-
ford, 2002; Bagnell et al., 2003). Most prior approaches
operate in discrete and finite action space (He et al., 2012;
Ratliff et al., 2009; Abbeel & Ng, 2004; Argall et al.,
2009). Some focus on continuous state space (Abbeel &
Ng, 2005), but requires a linear model for the system dy-
namics. In contrast, we focus on learning complex smooth
functions within continuous action and state spaces.
One natural approach to tackle the more general setting
is to reduce imitation learning to a standard supervised
learning problem (Syed & Schapire, 2010; Langford &
Zadrozny, 2005; Lagoudakis & Parr, 2003). However, stan-
dard supervised methods assume i.i.d. training and test ex-
amples, thus ignoring the distribution mismatch between
training and rolled-out trajectories directly applied to se-
quential learning problems (Kakade & Langford, 2002).
Thus a naive supervised learning approach normally leads
to unsatisfactory results (Ross & Bagnell, 2010).
Iterative Learning Reductions.
State-of-the-art learning
reductions for imitation learning typically take an iterative
approach, where each training round uses standard super-
vised learning to learn a policy (Daum
́
e III et al., 2009;
Ross et al., 2011). In each round
n
, the following happens:
•
Given initial state
s
0
drawn from the starting distribu-
tion of states, the learner executes current policy
π
n
,
resulting in a sequence of states
s
n
1
,...,s
n
T
.
•
For each
s
n
t
, a label
p
a
n
t
(e.g., expert feedback) is col-
lected indicating what the expert would do given
s
n
t
,
resulting in a new dataset
D
n
“ tp
s
t
,
p
a
n
t
qu
.
•
The learner integrates
D
n
to learn a policy
ˆ
π
n
. The
learner updates the current policy to
π
n
`
1
based on
ˆ
π
n
and
π
n
.
The main challenge is controlling for the cascading errors
caused by the changing dynamics of the system, i.e., the
distribution of states in each
D
n
„
d
π
n
. A policy trained
using
d
π
n
induces a different distribution of states than
d
π
n
,
and so is no longer being evaluated on the same distribution
as during training. A principled reduction should (approx-
imately) preserve the i.i.d. relationship between training
and test examples. Furthermore the state distribution
d
π
should converge to a stationary distribution.
The arguably most notable learning reduction approaches
for imitation learning are SEARN (Daum
́
e III et al., 2009)
and DAgger (Ross et al., 2011). At each round, SEARN
learns a new policy
ˆ
π
n
and returns a distribution (or mix-
ture) over previously learned policies:
π
n
`
1
“
β
ˆ
π
n
`p
1
́
β
q
π
n
for
β
P p
0
,
1
q
. For appropriately small choices of
β
, this stochastic mixing limits the “distribution drift” be-
tween
π
n
and
π
n
`
1
and can provably guarantee that the
Smooth Imitation Learning for Online Sequence Prediction
performance of
π
n
`
1
does not degrage significantly rela-
tive to the expert demonstrations.
4
DAgger, on the other hand, achieves stability by aggre-
gating a new dataset at each round to learn a new policy
from the combined data set
D
Ð
D
Y
D
n
. This aggre-
gation, however, significantly increases the computational
complexity and thus is not practical for large problems that
require many iterations of learning (since the training time
grows super-linearly w.r.t. the number of iterations).
Both SEARN and DAgger showed that only a polynomial
number of training rounds is required for convergence to a
good policy, but with a dependence on the length of hori-
zon
T
. In particular, to non-trivially bound the total vari-
ation distance
}
d
π
new
́
d
π
old
}
1
of the state distributions
between old and new policies, a learning rate
β
ă
1
T
is
required to hold (Lemma 1 of Daum
́
e III, Langford, and
Marcu (2009) and Theorem 4.1 of Ross, Gordon, and Bag-
nell (2011)). As such, systems with very large time hori-
zons might suffer from very slow convergence.
Our Contributions.
Within the context of previous work,
our SIMILE approach can be viewed as extending SEARN
to smooth policy classes with the following improvements:
•
We provide a policy improvement bound that does not
depend on the time horizon
T
, and can thus converge
much faster. In addition, SIMILE has adaptive learn-
ing rate, which can further improve convergence.
•
For the smooth policy class described in Section 2, we
show how to generate simulated or “virtual” expert
feedback in order to guarantee stable learning. This
alleviates the need to have continuous access to a dy-
namic oracle / expert that shows the learner what to do
when it is off-track. In this regard, the way SIMILE
integrates expert feedback subsumes the set-up from
SEARN and DAgger.
•
Unlike SEARN, SIMILE returns fully deterministic
policies. Under the continuous setting, deterministic
policies are strictly better than stochastic policies as
(i) smoothness is critical and (ii) policy sampling re-
quires holding more data during training, which may
not be practical for infinite state and action spaces.
•
Our theoretical analysis reveals a new sequential pre-
diction setting that yields provably fast convergence,
in particular for smooth policy classes on finite-
horizon problems. Existing settings that enjoy such
results are limited to Markovian dynamics with dis-
counted future rewards or linear model classes.
4. Smooth Imitation Learning Algorithm
Our learning algorithm, called SIMILE (
S
mooth
IMI
tation
LE
arning), is described in Algorithm 1. At a high level, the
process can be described as:
4
A similar approach was adopted in Conservative Policy Iter-
ation for the MDP setting (Kakade & Langford, 2002).
Algorithm 1
SIMILE (
S
mooth
IMI
tation
LE
arning)
Input:
features
X
“ t
x
t
u
, human trajectory
A
̊
“ t
a
̊
t
u
,
base routine
Train
, smooth regularizers
h
P
H
1:
Initialize
A
0
Ð
A
̊
,
S
0
Ð t
“
x
t
,a
̊
t
́
1
‰
u
,
h
0
“
argmin
h
P
H
T
ř
t
“
1
›
›
a
̊
t
́
h
p
a
̊
t
́
1
q
›
›
2:
Initial policy
π
0
“
ˆ
π
0
Ð
Train
p
S
0
,
A
0
|
h
0
q
3:
for
n
“
1
,...,N
do
4:
A
n
“ t
a
n
t
u Ð
π
n
́
1
p
S
n
́
1
q
//
sequential roll-out
5:
S
n
Ð t
s
n
t
“
“
x
t
,a
n
t
́
1
‰
u
//
s
n
t
“ r
x
t
:
t
́
p
,a
t
́
1:
t
́
q
s
6:
p
A
n
“ t
p
a
n
t
u @
s
n
t
P
S
n
//
collect smooth feedback
7:
h
n
“
argmin
h
P
H
T
ř
t
“
1
›
›
p
a
n
t
́
h
p
p
a
n
t
́
1
q
›
›
//
new regularizer
8:
ˆ
π
n
Ð
Train
p
S
n
,
p
A
n
|
h
n
q
//
train policy
9:
β
Ð
β
p
`
p
ˆ
π
n
q
,`
p
π
n
́
1
qq
//
adaptively set
β
10:
π
n
“
β
ˆ
π
n
`p
1
́
β
q
π
n
́
1
//
update policy
11:
end for
output
Last policy
π
N
1. Start with some initial policy
ˆ
π
0
(Line 2).
2. At iteration
n
, use
π
n
́
1
to build a new state distribu-
tion
S
n
and dataset
D
n
“ tp
s
n
t
,
p
a
n
t
qu
(Lines 4-6).
3. Train
ˆ
π
n
“
argmin
π
P
Π
E
s
„
S
n
r
`
n
p
π
p
s
qqs
, where
`
n
is the imitation loss (Lines 7-8). Note that
`
n
needs
not be the original
`
, but simply needs to converge to
it.
4. Interpolate
ˆ
π
n
and
π
n
́
1
to generate a new determin-
istic policy
π
n
(Lines 9-10). Repeat from Step 2 with
n
Ð
n
`
1
until some termination condition is met.
Supervised Learning Reduction.
The actual reduction is
in Lines 7-8, where we follow a two-step procedure of first
updating the smooth regularize
h
n
, and then training
ˆ
π
n
via
supervised learning. In other words,
Train
finds the best
f
P
F
possible for a fixed
h
n
. We discuss how to set the
training targets
p
a
n
t
below.
Policy Update.
The new policy
π
n
is a deterministic inter-
polation between the previous
π
n
́
1
and the newly learned
ˆ
π
n
(Line 10). In contrast, for SEARN,
π
n
is a stochastic in-
terploation (Daum
́
e III et al., 2009). Lemma 5.2 and Corol-
lary 5.3 show that deterministic interpolation converges at
least as fast as stochastic for smooth policy classes.
This interpolation step plays two key roles. First, it is
a form of myopic or greedy online learning. Intuitively,
rolling out
π
n
leads to incidental exploration on the mis-
takes of
π
n
, and so each round of training is focused on
refining
π
n
. Second, the interpolation in Line 10 ensures
a slow drift in the distribution of states from round to
round, which preserves an approximate i.i.d. property for
the supervised regression subroutine and guarantees con-
vergence.
However this model interpolation creates an inherent ten-
sion between maintaining approximate i.i.d. for valid su-
Smooth Imitation Learning for Online Sequence Prediction
pervised learning and more aggressive exploration (and
thus faster convergence). For example, SEARN’s guaran-
tees only apply for small
β
ă
1
{
T
. SIMILE circumvents
much of this tension via a policy improvement bound that
allows
β
to adaptively increase depending on the quality of
ˆ
π
n
(see Theorem 5.6), which thus guarantees a valid learn-
ing reduction while substantially speeding up convergence.
Feedback Generation.
We can generate training targets
p
a
n
t
using “virtual” feedback from simulating expert demon-
strations, which has two benefits. First, we need not query
the expert
π
̊
at every iteration (as done in DAgger (Ross
et al., 2011)). Continuously acquiring expert demonstra-
tions at every round can be seen as a special case and a
more expensive strategy. Second, virtual feedback ensures
stable learning, i.e., every
ˆ
π
n
is a feasible smooth policy.
000
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
CVPR
#307
CVPR
#307
CVPR 2015 Submission #307. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
Learning Online Smooth Predictors for Realtime Camera Planning
Anonymous CVPR submission
Paper ID 307
Abstract
Data-driven prediction methods are extremely useful in
many computer vision applications. However, the estima-
tors are normally learned within a time independent con-
text. When used for online prediction, the results are jittery.
Although smoothing can be added after the fact (such as
a Kalman filter), the approach is not ideal. Instead, tem-
poral smoothness should be incorporated into the learning
process. In this paper, we show how the ‘search and learn’
algorithm (which has been used previously for tagging parts
of speech) can be adapted to efficiently learn regressors for
temporal signals. We apply our data-driven learning tech-
nique to a camera planning problem: given noisy basketball
player detection data, we learn where the camera should
look based on examples from a human operator. Our exper-
imental results show how a learning algorithm which takes
into account temporal consistency of sequential predictions
has significantly better performance than time independent
estimators.
1. Introduction
In this work, we investigate the problem of determining
where a camera should look when broadcasting a basketball
game (see Fig.
1
). Realtime camera planning shares many
similarities with online object tracking: in both cases, the
algorithms must constantly revise an estimated target posi-
tion as new evidence is acquired. Noise and other ambi-
guities cause non-ideal jittery trajectories: they are are not
good representations of how objects actually move, and in
camera planning, lead to unaesthetic results. In practice,
temporal regularization is employed to minimize jitter. The
amount of regularization is a design parameter, and controls
a trade-off between precision and smoothness. In contrast to
object tracking, smoothness is of paramount importance in
camera control: fluid movements which maintain adequate
framing are preferable to erratic motions which pursue per-
fect composition.
Model-free estimation methods, such as random forests,
are very popular because they can be learned directly from
Figure 1:
Camera Planning
.
The objective is to predict
an appropriate pan angle for a broadcast camera based
on noisy player detection data. Consider two planning al-
gorithms (shown as blue and red curves in the schematic)
which both make the same mistake at time
A
but recover to a
good framing by
C
(the ideal camera trajectory is shown in
black). The blue solution quickly corrects by time
B
using
a jerky motion, whereas the red curve conducts a gradual
correction. Although the red curve has a larger discrepancy
with the ideal motion curve, its velocity characteristics are
most similar to the ideal motion path.
data. Often, the estimator is learned within a time indepen-
dent paradigm, and temporal regularization is integrated as
a post-processing stage (such as a Kalman filter). However,
this two stage approach is not ideal because the data-driven
estimator is prevented from learning any temporal patterns.
In this paper, we condition the data-driven estimator on pre-
vious predictions, which allows it to learn temporal patterns
within the data (in addition to any direct feature-based re-
lationships). However, this recursive formulation (similar
to reinforcement learning) makes the problem much more
difficult to solve. We employ a variant of the ‘search and
learn’ (SEARN) algorithm to keep training efficient. Its
strategy is to decouple the recursive relationships using an
auxiliary reference signal. This allows the predictor to be
learned efficiently using supervised techniques, and our ex-
periments demonstrate significant improvements when us-
ing this holistic approach.
Problem Definition
In the case of camera planning, we
assume there is an underlying function
f
:
X
7!
Y
which
describes the ideal camera work that should occur at the
1
Figure 1.
Consider Figure 1, where our pol-
icy
π
n
(blue/red) made a mistake
at location A, and where we have
only a single expert demonstra-
tion from
π
̊
(black). Depending
on the smoothness requirements of
the policy class, we can simulate
virtual expert feedback as via ei-
ther the red line (more smooth) or blue (less smooth) as
a tradeoff between squared imitation loss and smoothness.
When the roll-out of
π
n
́
1
(i.e.
A
n
) differs substantially
from
A
̊
, especially during early iterations, using smoother
feedback (red instead of blue) can result in more stable
learning. We formalize this notion for
Π
λ
in Proposi-
tion 5.8. Intuitively, whenever
π
n
́
1
makes a mistake, re-
sulting in a “bad” state
s
n
t
, the feedback should recom-
mend a smooth correction
p
a
n
t
w.r.t.
A
n
to make training
“easier” for the learner.
5
The virtual feedback
p
a
n
t
should
converge to the expert’s action
a
̊
t
. In practice, we use
p
a
n
t
“
σa
n
t
`p
1
́
σ
q
a
̊
t
with
σ
Ñ
0
as
n
increases (which
satisfies Proposition 5.8).
5. Theoretical Results
All proofs are deferred to the supplementary material.
5.1. Stability Conditions
One natural smoothness condition is that
π
pr
x,a
sq
should
be stable w.r.t.
a
if
x
is fixed. Consider the camera planning
setting: the expert policy
π
̊
should have very small curva-
ture, since constant inputs should correspond to constant
actions. This motivates Definition
2
.
2
, which requires that
Π
has low curvature given fixed context. We also show that
smooth policies per Definition 2.2 lead to stable actions,
in the sense that “nearby” states are mapped to “nearby”
actions. The following helper lemma is useful:
5
A similar idea was proposed (He et al., 2012) for DAgger-
type algorithm, albeit only for linear model classes.
Lemma 5.1.
For a fixed
x
, define
π
pr
x,a
sq
fi
φ
p
a
q
. If
φ
is
non-negative and
H
-smooth w.r.t.
a
., then:
@
a,a
1
:
`
φ
p
a
q ́
φ
p
a
1
q
̆
2
ď
6
H
`
φ
p
a
q`
φ
p
a
1
q
̆
›
›
a
́
a
1
›
›
2
.
Writing
π
as
π
pr
x,a
sq
fi
“
π
1
pr
x,a
sq
,...,π
k
pr
x,a
sq
‰
with each
π
i
pr
x,a
sq
H
-smooth, Lemma 5.1 implies
}p
π
pr
x,a
sq ́
π
pr
x,a
1
sqq} ď
?
12
HR
}
a
́
a
1
}
for
R
up-
per bounding
A
. Bounded action space means that a suffi-
ciently small
H
leads to the following stability conditions:
Condition 1
(Stability Condition 1)
.
Π
satisfies the Sta-
bility Condition 1 if for a fixed input feature
x
, the ac-
tions of
π
in states
s
“ r
x,a
s
and
s
1
“ r
x,a
1
s
satisfy
}
π
p
s
q ́
π
p
s
1
q} ď }
a
́
a
1
}
for all
a,a
1
P
A
.
Condition 2
(Stability Condition 2)
.
Π
satisfies Stability
Condition 2 if each
π
is
γ
-Lipschitz continuous in the ac-
tion component
a
with
γ
ă
1
. That is, for a fixed
x
the
actions of
π
in states
s
“ r
x,a
s
and
s
1
“ r
x,a
1
s
satisfy
}
π
p
s
q ́
π
p
s
1
q} ď
γ
}
a
́
a
1
}
for all
a,a
1
P
A
.
These two conditions directly follow from Lemma 5.1 and
assuming sufficiently small
H
.
Condition 2 is mildly
stronger than Condition 1, and enables proving much
stronger policy improvement compared to previous work.
5.2. Deterministic versus Stochastic
Given two policies
π
and
ˆ
π
, and interpolation parameter
β
P p
0
,
1
q
, consider two ways to combine policies:
1.
stochastic
:
π
sto
p
s
q “
ˆ
π
p
s
q
with probability
β
, and
π
sto
p
s
q “
π
p
s
q
with probability
1
́
β
2.
deterministic
:
π
det
p
s
q “
β
ˆ
π
p
s
q`p
1
́
β
q
π
p
s
q
Previous learning reduction approaches only use stochastic
interpolation (Daum
́
e III et al., 2009; Ross et al., 2011),
whereas SIMILE uses deterministic. The following result
shows that deterministic and stochastic interpolation yield
the same expected behavior for smooth policy classes.
Lemma 5.2.
Given any starting state
s
0
, sequentially ex-
ecute
π
det
and
π
sto
to obtain two separate trajectories
A
“ t
a
t
u
T
t
“
1
and
̃
A
“ t
̃
a
t
u
T
t
“
1
such that
a
t
“
π
det
p
s
t
q
and
̃
a
t
“
π
sto
p
̃
s
t
q
, where
s
t
“ r
x
t
,a
t
́
1
s
and
̃
s
t
“ r
x
t
,
̃
a
t
́
1
s
.
Assuming the policies are stable as per Condition 1, we
have
E
̃
A
r
̃
a
t
s “
a
t
@
t
“
1
,...,T
, where the expectation is
taken over all random roll-outs of
π
sto
.
Lemma 5.2 shows that deterministic policy combination
(SIMILE) yields unbiased trajectory roll-outs of stochas-
tic policy combination (as done in SEARN & CPI). This
represents a major advantage of SIMILE, since the num-
ber of stochastic roll-outs of
π
sto
to average to the deter-
ministic trajectory of
π
det
is polynomial in the time hori-
zon
T
, leading to much higher computational complexity.
Furthermore, for convex imitation loss
`
π
p
π
q
, Lemma 5.2
and Jensen’s inequality yield the following corollary, which
states that under convex loss, deterministic policy performs
at least no worse than stochastic policy in expectation:
Smooth Imitation Learning for Online Sequence Prediction
Corollary 5.3
(Deterministic Policies Perform Better)
.
For
deterministic
π
det
and stochastic
π
sto
interpolations of two
policies
π
and
ˆ
π
, and convex loss
`
, we have:
`
π
det
p
π
det
q “
`
π
sto
p
E
r
π
sto
sq
ď
E
r
`
π
sto
p
π
sto
qs
where the expectation is over all roll-outs of
π
sto
.
Remark.
We construct a simple example to show that Con-
dition 1 may be necessary for iterative learning reductions
to converge. Consider the case where contexts
X
Ă
R
are either constant or vary neglibly. Expert demonstrations
should be constant
π
̊
pr
x
n
,a
̊
sq “
a
̊
for all
n
. Consider
an unstable policy
π
such that
π
p
s
q “
π
pr
x,a
sq “
ka
for fixed
k
ą
1
. The rolled-out trajectory of
π
diverges
π
̊
at an exponential rate. Assume optimistically that
ˆ
π
learns the correct expert behavior, which is simply
ˆ
π
p
s
q “
ˆ
π
pr
x,a
sq “
a
. For any
β
P p
0
,
1
q
, the updated policy
π
1
“
β
ˆ
π
`p
1
́
β
q
π
becomes
π
1
pr
x,a
sq “
βa
`p
1
́
β
q
ka
.
Thus the sequential roll-out of the new policy
π
1
will also
yield an exponential gap from the correct policy. By induc-
tion, the same will be true in all future iterations.
5.3. Policy Improvement
Our policy improvement guarantee builds upon the analysis
from SEARN (Daum
́
e III et al., 2009), which we extend to
using adaptive learning rates
β
. We first restate the main
policy improvement result from Daum
́
e III et al. (2009).
Lemma 5.4
(SEARN’s policy nondegradation - Lemma 1
from Daum
́
e III et al. (2009))
.
Let
`
max
“
sup
π,s
`
p
π
p
s
qq
,
π
1
is defined as
π
sto
in lemma 5.2. Then for
β
P p
0
,
1
{
T
q
:
`
π
1
p
π
1
q ́
`
π
p
π
q ď
βT
E
s
„
d
π
r
`
p
ˆ
π
p
s
qqs`
1
2
β
2
T
2
`
max
.
SEARN guarantees that the new policy
π
1
does not degrade
from the expert
π
̊
by much only if
β
ă
1
{
T
. Analy-
ses of SEARN and other previous iterative reduction meth-
ods (Ross et al., 2011; Kakade & Langford, 2002; Bagnell
et al., 2003; Syed & Schapire, 2010) rely on bounding the
variation distance between
d
π
and
d
π
1
. Three drawbacks of
this approach are: (i) non-trivial variation distance bound
typically requires
β
to be inversely proportional to time
horizon
T
, causing slow convergence; (ii) not easily appli-
cable to the continuous regime; and (iii) except under MDP
framework with discounted infinite horizon, previous vari-
ation distance bounds do not guarantee monotonic policy
improvements (i.e.
`
π
1
p
π
1
q ă
`
π
p
π
q
).
We provide two levels of guarantees taking advantage of
Stability Conditions 1 and 2 to circumvent these draw-
backs. Assuming the Condition 1 and convexity of
`
, our
first result yields a guarantee comparable with SEARN.
Theorem 5.5
(T-dependent Improvement)
.
Assume
`
is
convex and
L
-Lipschitz, and Condition 1 holds. Let
“
max
s
„
d
π
}
ˆ
π
p
s
q ́
π
p
s
q}
. Then:
`
π
1
p
π
1
q ́
`
π
p
π
q ď
βLT
`
β
p
`
π
p
ˆ
π
q ́
`
π
p
π
qq
.
(3)
In particular, choosing
β
P p
0
,
1
{
T
q
yields:
`
π
1
p
π
1
q ́
`
π
p
π
q ď
L
`
β
p
`
π
p
ˆ
π
q ́
`
π
p
π
qq
.
(4)
Similar to SEARN, Theorem 5.5 also requires
β
P p
0
,
1
{
T
q
to ensure the RHS of (4) stays small. However, note that the
reduction term
β
p
`
π
p
ˆ
π
q ́
`
π
p
π
qq
allows the bound to be
strictly negative if the policy
ˆ
π
trained on
d
π
significantly
improves on
`
π
p
π
q
(i.e., guaranteed policy improvement).
We observe empirically that this often happens, especially
in early iterations of training.
Under the mildly stronger Condition 2, we remove the de-
pendency on the time horizon
T
, which represents a much
stronger guarantee compared to previous work.
Theorem 5.6
(Policy Improvement)
.
Assume
`
is convex
and
L
-Lipschitz-continuous, and Condition 2 holds. Let
“
max
s
„
d
π
}
ˆ
π
p
s
q ́
π
p
s
q}
. Then for
β
P p
0
,
1
q
:
`
π
1
p
π
1
q ́
`
π
p
π
q ď
βγL
p
1
́
β
qp
1
́
γ
q
`
β
p
`
π
p
ˆ
π
q ́
`
π
p
π
qq
.
(5)
Corollary 5.7
(Monotonic Improvement)
.
Following the
notation from Theorem 5.6, let
∆
“
`
π
p
π
q ́
`
π
p
ˆ
π
q
and
δ
“
γL
1
́
γ
. Then choosing step size
β
“
∆
́
δ
2∆
, we have:
`
π
1
p
π
1
q ́
`
π
p
π
q ď ́
p
∆
́
δ
q
2
2
p
∆
`
δ
q
.
(6)
The terms
and
`
π
p
ˆ
π
q ́
`
π
p
π
q
on the RHS of (4) and (5)
come from the learning reduction, as they measure the “dis-
tance” between
ˆ
π
and
π
on the state distribution induced
by
π
(which forms the dataset to train
ˆ
π
). In practice, both
terms can be empirically estimated from the training round,
thus allowing an estimate of
β
to minimize the bound.
Theorem 5.6 justifies using an adaptive and more aggres-
sive interpolation parameter
β
to update policies. In the
worst case, setting
β
close to
0
will ensure the bound from
(5) to be close to
0
, which is consistent with SEARN’s
result. More generally, monotonic policy improvement
can be guaranteed for appropriate choice of
β
, as seen
from Corollary 5.7. This strict policy improvement was
not possible under previous iterative learning reduction ap-
proaches such as SEARN and DAgger, and is enabled in
our setting due to exploiting the smoothness conditions.
5.4. Smooth Feedback Analysis
Smooth Feedback Does Not Hurt:
Recall from Section
4 that one way to simulate “virtual” feedback for training
a new
ˆ
π
is to set the target
ˆ
a
t
“
σa
t
` p
1
́
σ
q
a
̊
t
for
σ
P p
0
,
1
q
, where smooth feedback corresponds to
σ
Ñ
1
.
To see that simulating smooth “virtual” feedback target
does not hurt the training progress, we alternatively view
SIMILE as performing gradient descent in a smooth func-
tion space (Mason et al., 1999). Define the cost functional
C
: Π
Ñ
R
over policy space to be the average imitation
loss over
S
as
C
p
π
q “
ş
S
}
π
p
s
q ́
π
̊
p
s
q}
2
dP
p
s
q
. The gra-
Smooth Imitation Learning for Online Sequence Prediction
dient (G
ˆ
ateaux derivative) of
C
p
π
q
w.r.t.
π
is:
∇
C
p
π
qp
s
q “
B
C
p
π
`
αδ
s
q
B
α
ˇ
ˇ
ˇ
α
“
0
“
2
p
π
p
s
q ́
π
̊
p
s
qq
,
where
δ
s
is Dirac delta function centered at s. By first order
approximation
C
p
π
1
q “
C
p
β
ˆ
π
`p
1
́
β
q
π
q “
C
p
π
`
β
p
ˆ
π
́
π
qq «
C
p
π
q`
β
x
∇
C
p
π
q
,
ˆ
π
́
π
y
. Like traditional gradient
descent, we want to choose
ˆ
π
such that the update moves
the functional along the direction of negative gradient. In
other words, we want to learn
ˆ
π
P
Π
such that
x
∇
C
p
π
q
,
ˆ
π
́
π
y !
0
. We can evaluate this inner product along the states
induced by
π
. We thus have the estimate:
x
∇
C
p
π
q
,
ˆ
π
́
π
y «
2
T
T
ÿ
t
“
1
p
π
p
s
t
q ́
π
̊
p
s
t
qqp
ˆ
π
p
s
t
q ́
π
p
s
t
qq
“
2
T
T
ÿ
t
“
1
p
a
t
́
a
̊
t
qp
ˆ
π
pr
x
t
,a
t
́
1
sq ́
a
t
q
.
Since we want
x
∇
C
p
π
q
,
ˆ
π
́
π
y ă
0
, this motivates the
construction of new data set
D
with states
tr
x
t
,a
t
́
1
su
T
t
“
1
and labels
t
p
a
t
u
T
t
“
1
to train a new policy
ˆ
π
, where we want
p
a
t
́
a
̊
t
qp
p
a
t
́
a
t
q ă
0
. A sufficient solution is to set target
p
a
t
“
σa
t
` p
1
́
σ
q
a
̊
t
(Section 4), as this will point the
gradient in negative direction, allowing the learner to make
progress.
Smooth Feedback is Sometimes Necessary:
When the
current policy performs poorly, smooth virtual feedback
may be required to ensure stable learning, i.e. producing
a feasible smooth policy at each training round. We for-
malize this notion of feasibility by considering the smooth
policy class
Π
λ
in Example 2.1. Recall that smooth reg-
ularization of
Π
λ
via
H
encourages the next action to be
close to the previous action. Thus a natural way to mea-
sure smoothness of
π
P
Π
λ
is via the average first order
difference of consecutive actions
1
T
ř
T
t
“
1
}
a
t
́
a
t
́
1
}
. In
particular, we want to explicitly constrain this difference
relative to the expert trajectory
1
T
ř
T
t
“
1
}
a
t
́
a
t
́
1
} ď
η
at
each iteration, where
η
9
1
T
ř
T
t
“
1
›
›
a
̊
t
́
a
̊
t
́
1
›
›
.
When
π
performs poorly, i.e. the ”average gap” between
current trajectory
t
a
t
u
and
t
a
̊
t
u
is large, the training target
for
ˆ
π
should be lowered to ensure learning a smooth policy
is feasible, as inferred from the following proposition. In
practice, we typically employ smooth virtual feedback in
early iterations when policies tend to perform worse.
Proposition 5.8.
Let
ω
be the average supervised train-
ing error from
F
, i.e.
ω
“
min
f
P
F
E
x
„
X
r}
f
pr
x,
0
sq ́
a
̊
}s
.
Let the rolled-out trajectory of current policy
π
be
t
a
t
u
.
If the average gap between
π
and
π
̊
is such that
E
t
„
Uniform
r
1:
T
s
r}
a
̊
t
́
a
t
́
1
}s ě
3
ω
`
η
p
1
`
λ
q
, then us-
ing
t
a
̊
t
u
as feedback will cause the trained policy
ˆ
π
to be
non-smooth, i.e.:
E
t
„
Uniform
r
1:
T
s
r}
ˆ
a
t
́
ˆ
a
t
́
1
}s ě
η,
(7)
for
t
ˆ
a
t
u
the rolled-out trajectory of
ˆ
π
.
Figure 2.
Expert (blue) and predicted (red) camera pan angles.
Left: SIMILE with
ă
10 iterations. Right: non-smooth policy.
Figure 3.
Adaptive versus fixed interpolation parameter
β
.
6. Experiments
Automated Camera Planning.
We evaluate SIMILE in a
case study of automated camera planning for sport broad-
casting (Chen & Carr, 2015; Chen et al., 2016). Given
noisy tracking of players as raw input data
t
x
t
u
T
t
“
1
, and
demonstrated pan camera angles from professional human
operator as
t
a
̊
t
u
T
t
“
1
, the goal is to learn a policy
π
that pro-
duces trajectory
t
a
t
u
T
t
“
1
that is both smooth and accurate
relative to
t
a
̊
t
u
T
t
“
1
. Smoothness is critical in camera con-
trol: fluid movements which maintain adequate framing are
preferable to jittery motions which constantly pursue per-
fect tracking (Gaddam et al., 2015). In this setting, time
horizon
T
is the duration of the event multiplied by rate of
sampling. Thus
T
tends to be very large.
Smooth Policy Class.
We use a smooth policy class fol-
lowing Example 2.2: regression tree ensembles
F
regular-
ized by a class of linear autoregressor functions
H
(Chen
et al., 2016). See Appendix B for more details.
Summary of Results.
•
Using our smooth policy class leads to dramatically
smoother trajectories than not regularizing using
H
.
•
Using our adaptive learning rate leads to much faster
convergence compared to conservative learning rates
from SEARN (Daum
́
e III et al., 2009).
•
Using smooth feedback ensures stable learning of
smooth policies at each iteration.
•
Deterministic policy interpolation performs better
than stochastic interpolation used in SEARN.
Smooth versus Non-Smooth Policy Classes.
Figure 2
shows a comparison of using a smooth policy class ver-
sus a non-smooth one (e.g., not using
H
). We see that our
Smooth Imitation Learning for Online Sequence Prediction
Figure 4.
Comparing different values of
σ
.
approach can reliably learn to predict trajectories that are
both smooth and accurate.
Adaptive vs. Fixed
β
:
One can, in principle, train using
SEARN, which requires a very conservative
β
in order to
guarantee convergence. In contrast, SIMILE adaptively se-
lects
β
based on relative empirical loss of
π
and
ˆ
π
(Line 9
of Algorithm 1). Let
error
p
ˆ
π
q
and
error
p
π
q
denote the
mean-squared errors of rolled-out trajectories
t
ˆ
a
t
u
,
t
a
t
u
,
respectively, w.r.t. ground truth
t
a
̊
t
u
. We can set
β
as:
ˆ
β
“
error
p
π
q
error
p
ˆ
π
q`
error
p
π
q
,
(8)
which encourages the learner to disregard bad policies
when interpolating, thus allowing fast convergence to a
good policy (see Theorem 5.6). Figure 3 compares the
convergence rate of SIMILE using adaptive
β
versus con-
servative fixed values of
β
commonly used in SEARN
(Daum
́
e III et al., 2009). We see that adaptively choosing
β
enjoys substantially faster convergence. Note that very
large fixed
β
may overshoot and worsen the combined pol-
icy after a few initial improvements.
Smooth Feedback Generation:
We set the target labels
to
ˆ
a
n
t
“
σa
n
t
` p
1
́
σ
q
a
̊
t
for
0
ă
σ
ă
1
(Line 6 of
Algorithm 1). Larger
σ
corresponds to smoother (
ˆ
a
n
t
is
closer to
a
n
t
́
1
) but less accurate target (further from
a
̊
t
),
as seen in Figure 4. Figure 5 shows the trade-off between
Figure 5.
smoothness loss (blue
line, measured by first
order
difference
in
Proposition 5.8) and
imitation loss (red line,
measured
by
mean
squared distance) for
varying
σ
. We navigate
this trade-off by setting
σ
closer to 1 in early iterations, and have
σ
Ñ
0
as
n
increases. This “gradual increase” produces more stable
policies, especially during early iterations where the
learning policy tends to perform poorly (as formalized
in Proposition 5.8). In Figure 4, when the initial policy
(green trajectory) has poor performance, setting smooth
Figure 6.
Performance after different number of iterations.
targets (Figure 4b) allows learning a smooth policy in
the subsequent round, in contrast to more accurate but
less stable performance of “difficult” targets with low
σ
(Figure 4c-d). Figure 6 visualizes the behavior of the the
intermediate policies learned by SIMILE, where we can
see that each intermediate policy is a smooth policy.
Deterministic vs. Stochastic Interpolation:
Finally, we
evaluate the benefits of using deterministic policy averag-
ing intead of stochastically combine different policies, as
done in SEARN. To control for other factors, we set
β
to
a fixed value of
0
.
5
, and keep the new training dataset
D
n
the same for each iteration
n
. The average imitation loss of
stochastic policy sampling are evaluated after 50 stochastic
roll-outs at each iterations. This average stochastic policy
error tends to be higher compared to the empirical error
of the deterministic trajectory, as seen from Figure 7, and
confirms our finding from Corollary 5.3.
Figure 7.
Deterministic policy error vs. average stochastic policy
error for
β
“
0
.
5
and 50 roll-outs of the stochastic policies.
7. Conclusion
We formalized the problem of smooth imitation learning
for online sequence prediction, which is a variant of imita-
tion learning that uses a notion of a smooth policy class. We
proposed SIMILE (
S
mooth
IMI
tation
LE
arning), which is
an iterative learning reduction approach to learning smooth
policies from expert demonstrations in a continuous and
dynamic environment. SIMILE utilizes an adaptive learn-
ing rate that provably allows much faster convergence com-
pared to previous learning reduction approaches, and also
enjoys better sample complexity than previous work by be-
ing fully deterministic and allowing for virtual simulation
of training labels. We validated the efficiency and practical-
ity of our approach on a setting of online camera planning.
Smooth Imitation Learning for Online Sequence Prediction
References
Abbeel, Pieter and Ng, Andrew Y. Apprenticeship learn-
ing via inverse reinforcement learning. In
International
Conference on Machine Learning (ICML)
, 2004.
Abbeel, Pieter and Ng, Andrew Y. Exploration and appren-
ticeship learning in reinforcement learning. In
Interna-
tional Conference on Machine Learning (ICML)
, 2005.
Argall, Brenna D, Chernova, Sonia, Veloso, Manuela, and
Browning, Brett.
A survey of robot learning from
demonstration.
Robotics and autonomous systems
, 57
(5):469–483, 2009.
Bagnell, J Andrew, Kakade, Sham M, Schneider, Jeff G,
and Ng, Andrew Y. Policy search by dynamic program-
ming. In
Neural Information Processing Systems (NIPS)
,
2003.
Caruana, Rich and Niculescu-Mizil, Alexandru. An em-
pirical comparison of supervised learning algorithms. In
International Conference on Machine Learning (ICML)
,
2006.
Chen, Jianhui and Carr, Peter. Mimicking human camera
operators. In
IEEE Winter Conference Applications of
Computer Vision (WACV)
, 2015.
Chen, Jianhui, Le, Hoang M., Carr, Peter, Yue, Yisong,
and Little, James J. Learning online smooth predictors
for real-time camera planning using recurrent decision
trees. In
IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR)
, 2016.
Criminisi, Antonio, Shotton, Jamie, and Konukoglu, En-
der. Decision forests: A unified framework for classifi-
cation, regression, density estimation, manifold learning
and semi-supervised learning.
Foundations and Trends
in Computer Graphics and Vision
, 7(2–3):81–227, 2012.
Daum
́
e III, Hal, Langford, John, and Marcu, Daniel.
Search-based structured prediction.
Machine learning
,
75(3):297–325, 2009.
Gaddam, Vamsidhar Reddy, Eg, Ragnhild, Langseth, Rag-
nar, Griwodz, Carsten, and Halvorsen, P
̊
al. The camera-
man operating my virtual camera is artificial: Can the
machine be as good as a human&quest.
ACM Transac-
tions on Multimedia Computing, Communications, and
Applications (TOMM)
, 11(4):56, 2015.
He, He, Eisner, Jason, and Daume, Hal. Imitation learning
by coaching. In
Neural Information Processing Systems
(NIPS)
, 2012.
Jain, Ashesh, Wojcik, Brian, Joachims, Thorsten, and Sax-
ena, Ashutosh. Learning trajectory preferences for ma-
nipulators via iterative improvement. In
Neural Informa-
tion Processing Systems (NIPS)
, 2013.
Kakade, Sham and Langford, John. Approximately op-
timal approximate reinforcement learning. In
Interna-
tional Conference on Machine Learning (ICML)
, 2002.
Lagoudakis, Michail and Parr, Ronald.
Reinforcement
learning as classification: Leveraging modern classi-
fiers. In
International Conference on Machine Learning
(ICML)
, 2003.
Langford, John and Zadrozny, Bianca.
Relating rein-
forcement learning performance to classification perfor-
mance. In
International Conference on Machine Learn-
ing (ICML)
, 2005.
Mason, Llew, Baxter, Jonathan, Bartlett, Peter L, and
Frean, Marcus. Functional gradient techniques for com-
bining hypotheses. In
Neural Information Processing
Systems (NIPS)
, 1999.
Ratliff, Nathan, Silver, David, and Bagnell, J. Andrew.
Learning to search: Functional gradient techniques for
imitation learning.
Autonomous Robots
, 27(1):25–53,
2009.
Ross, St
́
ephane and Bagnell, Drew. Efficient reductions for
imitation learning. In
Conference on Artificial Intelli-
gence and Statistics (AISTATS)
, 2010.
Ross, Stephane, Gordon, Geoff, and Bagnell, J. Andrew. A
reduction of imitation learning and structured prediction
to no-regret online learning. In
Conference on Artificial
Intelligence and Statistics (AISTATS)
, 2011.
Srebro, Nathan, Sridharan, Karthik, and Tewari, Ambuj.
Smoothness, low noise and fast rates. In
Neural Infor-
mation Processing Systems (NIPS)
, 2010.
Syed, Umar and Schapire, Robert E. A reduction from ap-
prenticeship learning to classification. In
Neural Infor-
mation Processing Systems (NIPS)
, 2010.
Wold, Herman. A study in the analysis of stationary time
series, 1939.