Learning Online Smooth Predictors for Realtime Camera Planning
using Recurrent Decision Trees
Jianhui Chen
̊
Hoang M. Le
:
Peter Carr
;
Yisong Yue
:
James J. Little
̊
̊
University of British Columbia
:
California Institute of Technology
;
Disney Research
{
jhchen14, little
}
@cs.ubc.ca
{
hmle, yyue
}
@caltech.edu
carr@disneyresearch.com
Abstract
We study the problem of online prediction for realtime
camera planning, where the goal is to predict smooth trajec-
tories that correctly track and frame objects of interest (e.g.,
players in a basketball game). The conventional approach
for training predictors does not directly consider temporal
consistency, and often produces undesirable jitter. Although
post-hoc smoothing (e.g., via a Kalman filter) can mitigate
this issue to some degree, it is not ideal due to overly strin-
gent modeling assumptions (e.g., Gaussian noise). We pro-
pose a recurrent decision tree framework that can directly
incorporate temporal consistency into a data-driven pre-
dictor, as well as a learning algorithm that can efficiently
learn such temporally smooth models. Our approach does
not require any post-processing, making online smooth pre-
dictions much easier to generate when the noise model is
unknown. We apply our approach to sports broadcasting:
given noisy player detections, we learn where the camera
should look based on human demonstrations. Our exper-
iments exhibit significant improvements over conventional
baselines and showcase the practicality of our approach.
1. Introduction
We investigate the problem of determining where a cam-
era should look when broadcasting a team sporting event,
such as basketball or soccer (see Fig.
1). Realtime camera
planning shares many similarities with online object track-
ing: in both cases, the algorithm must constantly revise an
estimated target position as new evidence is acquired. Noise
and other ambiguities can cause non-ideal jittery trajecto-
ries, which in camera planning lead to unaesthetic results
(see Fig.
2). In contrast to object tracking, smoothness is of
paramount importance in camera control: fluid movements
that maintain adequate framing are preferable to erratic mo-
tions that constantly pursue perfect composition.
Non-parametric or model-free estimation methods, such
as random forests [10], are very popular because they can
learn (almost) arbitrary predictors directly from training
Figure 1.
Camera Planning
.
The goal is to predict the pan angle
for a broadcast camera based on noisy player detections. Con-
sider two planning algorithms (the blue and red curves) which
both make the same mistake at time
A
but recover to a good fram-
ing by
C
(the ideal camera trajectory is shown in black). The blue
solution quickly corrects by time
B
using a jerky motion, whereas
the red curve conducts a gradual correction. Although the red
curve has a larger discrepancy with the ideal motion curve, its
velocity characteristics are most similar to the ideal motion path.
data. When applied to smooth trajectory prediction, the esti-
mator is often learned within a time-independent paradigm,
with temporal regularization integrated afterwards as a post-
processing stage (e.g., via a Kalman filter) [28,
30]. One
major limitation of this two-stage approach for camera plan-
ning is that the smoothing is done in a context-independent
way, which can lead to uninformed tradeoffs between accu-
racy and smoothness (see Fig.
1).
In this paper, we propose a recurrent decision tree frame-
work that can make predictions
conditioned on its own pre-
vious predictions
, which allows it to learn temporal patterns
within the data (in addition to any direct feature-based re-
lationships). However, this recursive formulation (similar
to reinforcement learning [31]) makes the learning problem
much more challenging compared to the time-independent
approach. We develop a learning algorithm based on the
“search and learn” (SEARN) approach [12] to efficiently
converge to a stable recurrent model.
We applied our approach to autonomous camera control
in sports, where the goal is to generate smooth camera mo-
tion that imitates a human expert. We provide both quan-
titative and qualitative evidence showing our approach sig-
nificantly outperforms several strong baselines.
1
2. Related work
Sequential Supervised Learning
Sequential supervised
learning is broadly applied in many domains, including nat-
ural language processing tasks such as part-of-speech tag-
ging [9] and computational biology tasks such as protein
alignment [32]. Unlike standard supervised learning, se-
quential supervised learning operates on sequences: each
training example
p
x
i
,
y
i
q
is a sequence of features
x
i
“
〈
x
i,
1
,x
i,
2
,...,x
i,T
i
〉
and a corresponding sequence of la-
bels
y
i
“
〈
y
i,
1
,y
i,
2
,...,y
i,T
i
〉
. The learned predictor out-
puts a sequence of labels for an input sequence of features.
Our setting is distinguished from conventional sequential
learning because we must learn an online predictor. In other
words, conventional sequential learning typically assumes
access to all of
x
before predicting
y
[1, 9, 13, 23, 24, 32].
Our setting is further distinguished from most previous se-
quential prediction work by aiming to learn model-free or
non-parametric predictors. For instance, the bulk of previ-
ous work utilize linear predictors and thus require a well-
specified feature representation [1, 9, 24, 32].
One approach for utilizing arbitrary predictors is via a
sliding window [13, 14, 23], in which any supervised learn-
ing method can be used to learn the sliding window predic-
tor. However, if the predictor is defined recurrently (i.e., it
depends on its previous predictions), then it is not obvious
what values to use for previous predictions when generat-
ing supervised training examples. One way to address this
issue is via “stacked” sequential learning [8, 28, 30], which
is essentially the two-stage approach of first training a non-
recurrent predictor, and then employing recurrent smooth-
ing (e.g., a hidden Markov model or Kalman filter).
Our approach instead directly trains a predictor to make
good predictions given previous predictions, which leads
to a “chicken and egg” problem of how to define a train-
ing set that depends on the predictions of the model to be
trained. We employ a learning reduction approach, based
on SEARN [12], that iteratively constructs a sequence of
supervised training sets such that the resulting sequence of
predictors efficiently converges to a stable recurrent pre-
dictor. Other learning reduction approaches for sequential
learning include DAgger [29], which can be more efficient
in the number of iterations needed to converge, but with
each iteration being more computationally expensive.
Camera Planning
Algorithms for determining where a
camera should look have been investigated for a variety
of scenarios from scripted cooking shows and college lec-
tures to team sports [5, 22]. It is widely accepted that a
smoothly moving camera is critical for generating aesthetic
video [16]. One approach is post-processing to regulate the
sequential predictions, which leads to a trade-off between
smoothness and responsiveness [6]. Alternatively, offline
batch processes [19] can be used if there is not an online re-
Figure 2.
Jitter and Overshooting Artifact
.
(a) A plan generated
by [6] includes a sudden direction change (green dashed box) and
jitter. (b) Post-processing independent predictions with a Kalman
filter reduces jitter but causes over shooting (black dashed box).
quirement. Significant filtering constrains the motion of the
camera, making it unable to track fast moving objects [26].
Other approaches can offer more refined control but first re-
quire users to select the main features of interest [17].
Camera Motion Models
Camera motion is well-studied
in the context of stabilization. Liu
et al
. [26] used a low-pass
filter to create constant angular velocity motions for video
stabilization. The rotation component was filtered by lin-
earizing the rotation space using derivatives with respect to
time to obtain angular velocities [25]. The method was ex-
tended to polynomial eigen-trajectories for subspace video
stabilization [27]. In contrast, we are interested in the prob-
lem of camera planning. In other words, rather than trying
to reconstruct a stable camera trajectory from existing tra-
jectories, our goal is to plan a brand new trajectory.
3. Problem Setup
Let
d
x
denote a distribution of input sequences:
x
“
x
x
1
,...,x
T
y „
d
x
. In camera planning,
x
can be a se-
quence of (noisy) detected player locations from stationary
cameras. For clarity, we assume each sequence has the same
length
T
, but in general
T
can vary and/or be unknown.
Let
Π
denote a class of policies that our learner is con-
sidering. Given a stream of inputs
x
“ x
x
1
,...,x
T
y
, each
π
P
Π
generates a stream of outputs
y
“ x
y
1
,...,y
T
y
. In
camera planning,
y
can be a sequence of pan angles of the
broadcast camera. Section 4 describes our recurrent deci-
sion tree framework that instantiates
Π
in our experiments.
Operationally, each
π
is a streaming predictor that takes
a
state
s
t
“ t
x
t
,...x
t
́
τ
,y
t
́
1
,...,y
t
́
τ
u
composed of the
recent inputs and predictions, and generates a new predic-
tion
y
t
“
π
p
s
t
q
. Note that the next state
s
t
`
1
depends only
on the new input
x
t
`
1
, the current state
s
t
, and the current
prediction
y
t
. Hence, for any input sequence
x
,
π
equiva-
lently generates a state sequence
s
“ x
s
1
,...,s
T
y
. We also
abuse notation to say that
y
“
π
p
x
q
denotes the sequence
of predictions
y
generated by streaming
x
into
π
(with the
construction of the state sequence
s
being implicit).
For any policy
π
P
Π
, let
d
π
t
denote the distribution
of states at time
t
if
π
is executed for the first
t
́
1
time
steps (
d
π
t
is defined exactly by
d
x
and
π
). Furthermore, let
d
π
“
1
T
ř
T
t
“
1
d
π
t
be the average distribution of states if we
follow
π
for all
T
steps. The goal of the learner is to find
a policy
ˆ
π
P
Π
that minimizes the imitation loss under its
own induced distribution of states:
ˆ
π
“
argmin
π
P
Π
E
s
„
d
π
r
`
p
s,π,π
̊
qs
.
(1)
Since our goal is to learn a policy
π
that both imitates the
(human) expert
π
̊
and is also smooth, we decompose our
loss function into precision and smoothness components:
`
p
s,π,π
̊
q “
`
̊
p
s,π,π
̊
q`
ω`
R
p
s,π
q
,
(2)
where
`
̊
measures how well
π
p
s
q
agrees with
π
̊
p
s
q
, and
`
R
measures how smooth the prediction of
π
p
s
q
is relative
to the current state
s
, with
ω
ě
0
controlling the trade-off
between the two. The precision error
`
̊
is typically the
squared deviation:
`
̊
p
s,π,π
̊
q “ }
π
̊
p
s
q ́
π
p
s
q}
2
.
A standard way to instantiate the smoothness error
`
R
is via the squared deviation of the velocity:
`
R
p
s,π
q “
}
v
p
s
q ́
v
p
π
p
s
qq}
2
, where
v
p
s
q
denotes the velocity in state
s
, which can be computed from the information encoded in
s
. One can thus interpret the smoothness error as encourag-
ing the curvature of the predicted trajectories to be low.
We assume the agnostic setting, where the minimizer
of (1) does not necessarily achieve 0 loss (i.e., we cannot
perfectly imitate the human expert). In practice, we ap-
proximate the expectation in (1) with a finite sample (e.g.,
a training set of basketball gameplay sequences), and also
solve (1) via alternating minimization: collect training data
according to current
ˆ
π
, and then train a new
ˆ
π
using that
training data (see Section 5 for more details).
Discussion and Interpretation.
By utilizing a recurrent
policy class
Π
, such as our recurrent decision tree frame-
work (see Section 4), the learner is able to reason about
inherent temporal relationships. For instance, suppose the
learner incorrectly predicted a previous value (see Fig. 1).
Then the learner could make a drastic correction (blue line)
to minimize the discrepancy as quickly as possible, which
might result in unaesthetic high-frequency camera motion.
Instead, a more gradual correction (red line) may better
trade off between instantaneous error and smooth motion.
Estimating the best
ˆ
π
is challenging due to its depen-
dence on its own previous predictions. Most notably, this
recurrent relationship makes it nontrivial to extract a set of
independent training examples to use with conventional su-
pervised learning. This issue is formally highlighted in the
learning objective (1), where the distribution that the loss is
evaluated over is induced by the policy under consideration.
In other words, the distribution
d
π
that the loss is evaluated
on in (1) depends on the
π
being considered, which leads
to complicated learning problem since conventional super-
vised learning assumes that distribution is fixed.
One straightforward approach is to approximate
d
π
in (1)
using the distribution of states,
d
π
̊
, that the human expert
π
̊
induces on
d
x
, and then select the
ˆ
π
P
Π
to minimize:
ˆ
π
“
argmin
π
P
Π
E
s
„
d
π
̊
r
`
p
s,π,π
̊
qs
.
(3)
Note that (3) is easy to evaluate since the expectation is over
d
π
̊
and does not depend on the policy
π
under considera-
tion.
1
However, as soon as the behavior of
ˆ
π
deviates from
π
̊
, then the distribution of states experienced by
ˆ
π
will dif-
fer from
π
̊
, and thus optimizing (3) will not be aligned with
optimizing (1).
2
We show in Section 5 how to address this
distribution mismatch problem via an alternating minimiza-
tion approach that efficiently converges to a stable solution.
4. Recurrent Decision Tree Framework
Decision trees are amongst the best performing learn-
ing approaches, and are popular whenever a non-parametric
predictor is desirable [10, 14, 23]. We propose a recurrent
extension, where the prediction at the leaf node is not nec-
essarily constant, but rather is a (smooth) function of both
static leaf node prediction and previous predictions from the
tree. For simplicity, we present our framework using a sin-
gle decision tree, although our approach extends trivially to
ensemble methods such as random forests [3, 10].
A decision tree specifies a partitioning of the input
space (i.e, the space of all possible states
s
). Let
D
“
tp
s
m
,y
̊
m
qu
M
m
“
1
denote a training set of state/target pairs.
Conventional regression tree learning aims to learn a parti-
tioning such that each leaf node, denoted by
node
, makes
a constant prediction to minimize the squared loss:
̄
y
node
“
argmin
y
ÿ
p
s,y
̊
qP
D
node
p
y
̊
́
y
q
2
,
(4)
where
D
node
denotes the training data from
D
that has par-
titioned into the leaf
node
. The solution to (4) is:
̄
y
node
“
1
size
p
D
node
q
ÿ
p
s,y
̊
qP
D
node
y
̊
(5)
One typically trains via greedy top-down induction [10],
which is an iterative process that repeatly chooses a leaf
node to be split based on a binary query of the input state
s
.
The above formulation (4) can be viewed as the simplest
version of our recurrent decision tree framework. In partic-
ular, the decision tree is allowed to branch on the input state
1
In practice, optimizing (3) reduces to a standard supervised learning
scenario. One simply collects the decisions made by the human expert
π
̊
to use as a static training set, and then choose the
ˆ
π
P
Π
that agrees most
with
π
̊
on the states induced by
π
̊
.
2
One simplified interpretation is that training on (3) does not teach the
predictor how to recover from its own mistakes.
s
, which, as discussed in Section 3, includes the previous
predictions
t
y
́
1
,...,y
́
τ
u
. Thus, decision trees that mini-
mize (4) form a valid recurrent policy class, and can be used
to instantiate
Π
. In the following, we will describe a more
general class of recurrent decision trees that enforce more
explicit smoothness requirements.
Note that recurrent neural networks (RNNs) [2] impose
a slightly different notion of “recurrent” than our approach.
Whereas our approach is only recurrently defined with re-
spect to the previous predictions of our approach, RNNs are
recurrently defined with respect to the previous hidden unit
activations and (or) previous predictions.
4.1. Jointly Capturing Dynamics and Control
Let
f
π
p
y
́
1
,...,y
́
τ
q
denote an autoregressor of the
temporal dynamics of
π
over the distribution of input se-
quences
d
x
, while
ignoring
the exogenous inputs
x
. In other
words, at time step
t
,
f
π
predicts the behavior
y
t
”
π
p
s
t
q
given only
y
t
́
1
,...,y
t
́
τ
. Typically,
f
π
is selected from a
class of autoregressors
F
(e.g., smooth autoregressors). For
our experiments, we use regularized linear autoregressors
as
F
, although one could also use more complex functions
(e.g., based on Kalman filters). See the supplemental mate-
rial for more details on linear autoregressors.
We now present a policy class
Π
of recurrent decision
trees
π
that make smoothed predictions by regularizing to
be close to its autoregressor
f
π
. For any tree/policy
π
, each
leaf node is associated with a “control” or “target” signal
̄
y
node
such that the prediction
ˆ
y
given input state
s
is:
ˆ
y
”
π
p
s
q ”
argmin
y
p
y
́
̄
y
node
p
s
q
q
2
`
λ
p
y
́
f
π
p
s
qq
2
,
(6)
where
node
p
s
q
denotes the leaf node of the decision tree
that
s
branches to, and
λ
ě
0
trades off between
ˆ
y
matching
the target signal
̄
y
node
p
s
q
versus the smooth autoregressor
f
π
p
s
q
. The closed-form solution to (6) is:
ˆ
y
p
s
q “
̄
y
node
p
s
q
`
λf
π
p
s
q
1
`
λ
.
(7)
Compared to just predicting
ˆ
y
”
̄
y
node
p
s
q
, the prediction in
(6) varies much more smoothly with
s
, since
ˆ
y
is now an
interpolation between the target signal
̄
y
node
p
s
q
and smooth
extrapolation
f
π
p
s
q
from previous predictions.
Training requires estimating both the autoregressor
f
π
and the decision tree of target signals
̄
y
node
. In practice,
given training data
D
, we perform greedy training by first
estimating
f
π
on
D
(which ignores the exogenous inputs
x
),
3
and then estimating the decision tree of target signals
to “course correct”
f
π
. Given a fixed
f
π
and decision tree
3
Intuitively, if we train
f
π
on
D
, then the state distribution of
D
should
be similar to the state distribution that will be induced by the resulting
trained
π
. We address this point further in Section 5.
structure, one can set
̄
y
node
as:
̄
y
node
“
argmin
y
ÿ
p
s,y
̊
qP
D
node
p
y
̊
́
ˆ
y
p
s
|
y
qq
2
,
(8)
for
ˆ
y
p
s
|
y
q
defined as in (7) with
y
”
̄
y
node
p
s
q
. Similar to
(5), we can write the closed-form solution of (8) as:
̄
y
node
“
1
size
p
D
node
q
ÿ
p
s,y
̊
qP
D
node
`
p
1
`
λ
q
y
̊
́
λf
π
p
s
q
̆
.
(9)
Note that when
λ
“
0
, then the autoregressor has no influ-
ence, and (9) reduces to (5). Note also that (8) is a simplified
setting that only looks at imitation loss
`
̊
, but not smooth-
ness loss
`
R
. We refer to the supplemental material for more
details regarding our full training procedure.
One can interpret our recurrent decision tree framework
as holistically integrating model-based temporal dynamics
f
with model-free control
̄
y
node
. The target signal
̄
y
node
can thus focus on generating “course corrections” of the
smooth temporal dynamics imposed by
f
. In practice, the
target signal
̄
y
node
can also be set using an ensemble such
as random forests [3, 10, 18], or regression trees that pre-
dict piecewise-linear variables in the leaf nodes [15]. In this
way, our framework is completely complementary with pre-
vious work on learning smoother regression tree predictors.
Extensions.
One could also define the predictions
y
as ve-
locity rather than absolute coordinates, which coupled with
the current state
s
, will encode the absolute coordinates.
Such an approach would be desirable if one wants to do per-
form autoregressor regularization in the velocity domain.
Another extension is to replace the L2 autoregression
penalty in (8) with an L1 penalty. With an L1 penalty, small
target signals would not deviate the prediction from the au-
toregressor, thus making the temporal curvature potentially
piecewise linear (which may be desirable). However, the
non-smooth nature of L1 regularization would make the
prediction more sensitive to the choice of
λ
.
5. Iterative Training Procedure
In general, it can be difficult to exactly solve (1) due to
the circular dependency between the distribution of states
and the policy under consideration. One meta-learning ap-
proach is to alternate between estimating
d
π
over a fixed
π
,
and optimizing
π
over a fixed
d
π
. At a high level, this can
be described as:
1. Start with some initial policy
ˆ
π
0
2. At the
i
-th iteration, use the previously estimated poli-
cies
ˆ
π
0
,...,
ˆ
π
i
́
1
to construct a new distribution
d
i
3. Estimate the best policy
ˆ
π
i
via:
ˆ
π
i
“
argmin
π
P
Π
E
s
„
d
i
r
`
i
p
s,π,π
̊
qs
.
(10)
Note that the loss function
`
i
need not be the original
loss function, but simply needs to converge to it.
4. Repeat from Step 2 with
i
Ð
i
`
1
until some termi-
nation condition is met.
One typically initializes using the human expert
ˆ
π
0
“
π
̊
, which we have demonstrations for in the training set
(i.e.,
π
̊
is memorized on the training set). The estimation
problem in (10) can be solved via standard supervised learn-
ing (see the supplemental material for how we solve (10) for
our recurrent decision tree framework). It remains to decide
how to construct a stable and converging sequence of distri-
butions
d
i
. For instance, it is known that simply choosing
d
i
“
d
ˆ
π
i
́
1
does not guarantee stable behavior [12, 29].
We build upon the SEARN approach [12] to iteratively
construct a stable sequence of distributions
d
i
for training
each
ˆ
π
i
. Algorithm 1 describes our approach, which is a
meta-learning approach that can be applied to other policy
classes beyond our recurrent decision tree framework (i.e.,
by using a different implementation of
Learn
). Note also
that we approximate each state distribution
d
i
using a finite
empirical sample
̃
S
i
— i.e.,
̃
S
i
is constructed from the given
input sequences
X
and the predictions iteration
i
,
̃
Y
i
.
Algorithm 1 is an instance of the alternating procedure
described above. Given a state distribution
d
i
, the training
problem is a straightforward supervised learning problem
(Line 7). Note that the iteration-specific loss
`
i
is simply
the original loss
`
using a (potentially) modified imitation
target
Y
̊
i
that converges to the original
Y
(Line 6).
The key innovation of SEARN [12] is that the new dis-
tribution
d
i
should be induced by an exponentially decay-
ing interpolation between every
ˆ
π
0
,...,
ˆ
π
i
́
1
. In practice,
we found it more convenient to interpolate the trajectories
ˆ
Y
0
,...,
ˆ
Y
i
́
1
(Line 10), which leads to a set of trajectories
̃
Y
i
that can be combined with the input
X
to form an empir-
ical sample of states
̃
S
i
to approximate
d
i
(Line 5). Because
of this interpolation, the sequence of distributions
d
i
will
converge in a stable way (see [12] for a rigorous analysis
of the convergence behavior). This stable behavior is espe-
cially important considering our greedy training procedure
for our recurrent decision tree framework: if the resulting
ˆ
π
i
has a very different behavior from the training distribution
d
i
, then
f
ˆ
π
i
will not be a good autoregressor of
ˆ
π
i
.
5.1. Implementation Details
Choosing
β
.
A straightforward choice of interpolation
parameter is to set
β
to a small constant (close to 0) to
ensure that the new distribution
d
i
stays close to the pre-
vious distribution
d
i
́
1
, which is the original approach of
SEARN [12]. One drawback of this conservative approach
is slower convergence rate, especially when the learner
needs to quickly move away from a bad policy
ˆ
π
i
.
We can also adaptively select
β
based on relative empir-
ical loss of learned policies (Line 9 of Algorithm 1). Let
Algorithm 1
Iterative Training Procedure
Input:
input sequences
X
Input:
expert demonstrations
Y
for
X
Input:
autoregression function class
F
Input:
history time horizon
τ
for generating states
//see Section 3
Input:
β
p
...
q
//distribution drift step size
1:
Initialize
̃
S
0
Ð
state sequences defined by
Y
and
X
2:
π
0
Ð
Learn
p
̃
S
0
,
Y
q
.
3:
Initialize
̃
Y
1
Ð t
π
0
p
x
q|
x
P
X
u
//initialize exploration
4:
for
i
“
1
,...,N
do
5:
̃
S
i
Ð
state sequences defined by
̃
Y
i
and
X
6:
Y
̊
i
Ð
expert feedback on each
̃
s
P
̃
S
i
//see Section 5.1
7:
ˆ
π
i
Ð
Learn
p
̃
S
i
,
Y
̊
i
q
//minimizing
(10)
8:
ˆ
Y
i
Ð t
ˆ
π
i
p
x
q|
x
P
X
u
//roll-out
ˆ
π
i
9:
ˆ
β
i
Ð
β
p
error
p
ˆ
Y
i
q
,
error
p
̃
Y
i
qq
//see Section 5.1
10:
̃
Y
i
`
1
Ð
ˆ
β
i
ˆ
Y
i
`p
1
́
ˆ
β
i
q
̃
Y
i
//distribution interpolation
11:
end for
12:
return
ˆ
π
P t
ˆ
π
1
,...,
ˆ
π
N
u
with best validation performance
error
p
ˆ
Y
i
q
and
error
p
̃
Y
i
q
denote the loss (2) of rolled-
out trajectories
ˆ
Y
i
,
̃
Y
i
, respectively, with respect to ground
truth
Y
. We can then set
β
as:
ˆ
β
i
“
error
p
̃
Y
i
q
error
p
ˆ
Y
i
q`
error
p
̃
Y
i
q
.
(11)
This adaptive selection of
β
encourages the learner to dis-
regard bad policies when interpolating, thus allowing fast
convergence to a good policy.
Choosing Y
̊
.
Intuitively, whenever the current policy
ˆ
π
i
makes a mistake, the expert feedback
y
̊
should recommend
smooth corrections (e.g., follow the red line in Figure 1). In
many cases, it suffices to simply use the original expert tra-
jectories
Y
̊
i
Ð
Y
. However, when
̃
Y
i
differs substantially
from
Y
, especially during early iterations, it can be benefi-
cial to use “smoothed” expert feedback
Y
̊
i
in place of
Y
.
Specifically, each step along
Y
̊
i
is computed to be a
gradual 1-step correction of
̃
Y
i
towards
Y
, which will grad-
ually converge to
Y
in later iterations. It is not strictly nec-
essary, as our recurrent decision tree framework can offset
non-smooth corrections by enforcing a larger regularization
parameter
λ
(trading higher smoothness for lower imita-
tion accuracy, cf. 8). Generally however, given the same
smoothness weight
λ
, smoothed expert feedback leads to
more stable learning. One simple way to provide smoothed
Y
̊
i
is to reduce the distance ratio
p
y
̊
́
y
q{p
̃
y
́
y
q
by a
small decaying rate:
y
̊
“
y
`
e
́
η
p
̃
y
́
y
q
.
6. Experiments
We evaluate our approach automated broadcasting for
basketball and soccer (see Fig. 3). A high-school basketball
match was recorded using two cameras: one near the ceiling
for
player
detection,
and
one
at
ground
level
for
broadcast-
ing
(operated
by
a
human
expert).
The
videos
were
syn-
chronized
at
60
fps.
‘Timeouts’
were
manually
removed,
re-
sulting
in
32
minutes
of
‘in-play’
data
divided
into
roughly
50
segments
(each
about
40
seconds
long),
with
two
held
out
for
validation
and
testing.
A
semi-professional
soccer
match
was
recorded
using
three
cameras:
two
near
the
flood
l ights
f or
p layer
detec-
tion,
and
a
robotic
PTZ
located
at
mid-field
remotely
oper-
ated
by
a
human
expert.
The
videos
were
synchronized
at
60
fps.
About
91
minutes
was
used
for
training,
and
two
2
minute
sequences
were
held
out
for
validation
and
testing.
Features
The
ground
locations
of
players
were
deter-
mined
from
3D
geometric
primitives
which
best
justified
the
background
subtraction
results
[4].
Each
ground
po-
sition
was
projected
to
a
spherical
coordinate
system
cen-
tered
and
aligned
with
the
broadcast
camera.
Because
the
number
of
detections
varies
due
to
clutter
and
occlusions,
a
fixed
l ength
f eature
v ector
w as
c onstructed
u sing
spatial
frequency
counts.
The
surface
of
the
sphere
was
quantized
at
three
resolutions
(
1
ˆ
2
,
1
ˆ
4
,
and
1
ˆ
8
)
resulting
in
a
14
dimensional
feature
vector
x
t
[6].
Labels
Pan/tilt/zoom
parameters
are
estimated
for
each
frame
of
the
broadcast
video
by
matching
detected
SIFT
key
points
to
a
collection
of
manually
calibrated
reference
frames
in
a
similar
fashion
to
[20].
The
homography
be-
tween
the
current
frame
and
the
best
match
in
the
database
of
reference
images
is
estimated,
from
which
the
camera’s
pan-tilt-zoom
settings
are
extracted.
Because
the
tilt
and
zoom
of
the
broadcast
camera
do
not
vary
significantly
over
the
dataset,
our
experiments
only
focus
on
building
an
esti-
mator
for
online
prediction
of
pan
angles.
6.1. Baselines
Savitzky-Golay.
[6]
learns
a
predictor
using
a
random
for-
est
trained
using
only
current
player
locations.
A
Savitzky-
Golay
(SG)
filter
smooths
the
predictions,
but
induces
a
de-
lay. Our implementation of this method augments the cur-
rent player locations with previous player locations. This
modification makes the instantaneous predictions more re-
liable, as the predictor has more temporal information.
Kalman Filter.
We replace the Savitzky-Golay filter with
a Kalman filter employing a constant velocity process
model. Parameters were determined through validation (see
supplemental material).
Dual Kalman Filter.
A dual Kalman filter [21] simulta-
neously estimates the unknown state of the system, as well
as its process matrix. Similar to our formulation, we as-
sume the system adheres to an autoregressive model. This
method then applies two Kalman filters in parallel: one to
estimate the coefficients of the autoregressive function, and
a second to estimate the trajectory of the camera, based on
the current estimate of the autoregressive model. Again, pa-
rameters were tuned through validation.
Conditional Regression Forests.
Conditional regression
forests (CRFs) [11] split the training data into multiple sub-
sets. We tested various splitting methods based on camera
position and velocity, such as dividing the data into
4
,
8
and
16
subsets of pan angle. We also tried both disjoint sets and
joint sets with different overlap ratios. We report the best
result from
8
subsets with 50% overlap. The output of the
CRF is further smoothed by a SG filter.
Filter Forests.
Filter forests (FF) [
15] is an efficient dis-
criminative approach for predicting continuous variables
given a signal. FF can learn the optimal filtering kernels
to smooth temporal signals. Our implementation includes
some adaptations, such as limited candidate window sizes,
to improve the performance on our datasets.
6.2. Benchmark Experiments
Fig.
4 shows the benchmark performance evaluated for
both basketball and soccer. We evaluate using joint loss (2)
with
ω
“
500
. The precision and smoothness losses are
plotted separately to illustrate their relative contributions to
the joint loss. For both settings, we see that our approach
achieves the best performance, with the performance gap
being especially pronounced in basketball.
We note that the soccer setting is significantly more chal-
lenging than basketball, and no method performs particu-
larly well for soccer. One possible explanation is that soccer
camera planning using only player detections is unreliable
due to players not following the ball (unlike in basketball).
A visual inspection of the generated videos also suggests
that lack of ball tracking in the input signal
x
is a significant
limitation in the soccer setting.
We also observe that our approach achieves very low
smoothness loss, despite not utilizing a post-processing
smoothing step (see Table
1 for a summary description of
(a)
SG
KF Dual KF CRF
FF
Ours
loss
0
10
20
30
40
50
60
70
80
precision
smoothness
(b)
SG
KF Dual KF CRF
FF
Ours
0
50
100
150
200
250
precision
smoothness
Figure 4.
Prediction loss
. (a) Basketball; (b) Soccer. Loss mea-
sured by (2). Our method achieves the lowest overall loss.
Method
Noise Model
Post-Smooth
Delay
SG & CRF
high-freq
required
Yes
KF & Dual KF
Gaussian
required
No
FF
self-learned
not-required
No
Ours
self-learned
not-required
No
Table 1.
Qualitative comparison
. Both our method and FF [15]
learn noise model from the data, do not require post-processing
and achieve real-time response. However, our model requires less
data and is efficient for both training and testing.
the different approaches). For instance, in the basketball
setting, our approach actually achieves the smoothest per-
formance. In fact, for the basketball setting, our approach
dominates all baselines in both imitation loss and smooth-
ness loss. The KF and FF baselines tend to achieve compet-
itive smoothness loss, but can suffer substantial imitation
loss. For soccer, the SG and CRF baselines achieve lower
imitation loss, suggesting that they might be better able to
track the action. However, they suffer from substantial jitter.
6.3. Visual Inspection
Figs. 5 and 6 show a visualization of the predictions.
From this visual inspection, we can verify qualitatively
that our method achieves the best balance of precision and
smoothness. Our predicted trajectory remains close to the
human operator’s trajectory and has less jitter than the other
methods. Even with post-smoothing, SG and CRF exhibit
significant jitter. KF struggles between jitter and over shoot-
ing when the noise is not Gaussian. Surprisingly, dual KF
performance is worse than KF, which is again presumably
because the noise is not Gaussian, and errors in the pro-
cess estimation propagate to the state estimation (see sup-
plemental material). FF is very competitive in the basket-
ball dataset, but its performance suffers from large jitter in
the more challenging soccer dataset.
Table 1 summarizes qualitative properties of all the
methods. SG and CRF assume the noise only has high-
frequency components. As a result, they struggle with low-
frequency noise. KF and dual KF rely on a Gaussian noise
assumption, which is not reasonable on our datasets.
Both our method and FF [15] learn control and dynamics
(a) SG
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
-30
-20
-10
0
10
20
(b) KF
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
-30
-20
-10
0
10
20
(c) Dual KF
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
-30
-20
-10
0
10
20
(d) CRF
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
-30
-20
-10
0
10
20
(e) FF
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
-30
-20
-10
0
10
20
(f) Our method
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
-30
-20
-10
0
10
20
Figure 5.
Comparison on basketball data.
(a) Method of [6], (b)
Kalman filter, (c) Dual Kalman filter , (d) Conditional regression
forests [11], (e) Filter forests [15], (f) Our method. The blue line
is from human operators and the red line is from predictions. Note
our method does not need any post-processing.
(a) SG
0
1000
2000
3000
4000
5000
6000
7000
-40
-20
0
20
40
(b) KF
0
1000
2000
3000
4000
5000
6000
7000
-40
-20
0
20
40
(c) Dual KF
0
1000
2000
3000
4000
5000
6000
7000
-40
-20
0
20
40
(d) CRF
0
1000
2000
3000
4000
5000
6000
7000
-40
-20
0
20
40
(e) FF
0
1000
2000
3000
4000
5000
6000
7000
-40
-20
0
20
40
(f) Our method
0
1000
2000
3000
4000
5000
6000
7000
-40
-20
0
20
40
Figure 6.
Comparison on soccer data.
. (a) Method of [6], (b)
Kalman filter, (c) Dual Kalman filter , (d) Conditional regression
forests [11], (e) Filter forests [15], (f) Our method. The blue line
is from human operators and the red line is from predictions. Note
our method does not need any post-processing.
aspects from the data. Neither requires post-processing and
both are able to generate a zero delay response. In camera
planning, our method has three advantages over FF. First,
our method has fewer parameters so that it requires less data
to train. Second, our model can be trained much faster than
FF because FF has to solve a large linear equation in each
split. Third, the experimental results show that our method
is more stable in both datasets.
Fig. 7 provides a visual comparison of our framework
using a random forest of 100 trees and varying influence of
the autoregressor,
λ
. When
λ
“
0
, the autoregressor has no
influence, and so the smoothness of the prediction depends
solely on the smoothness of decision tree ensemble and the
size of the history window,
τ
. Since decision trees are non-
Figure 7.
Comparing varying
λ
for basketball data.
λ
“
300
is a good trade-off between smoothness and accuracy. Very small
λ
allows more accurate but noisier predictions, and very large
λ
leads to smoother but less accurate predictions.
parametric, one could in principle learn a smooth predic-
tor given sufficient training examples and enough trees, but
the data and computational costs would be immense. As
λ
increases, the autoregressor causes the predictions to be
increasingly smooth. Recall that the learner (Algorithm
1)
always seeks to find the predictor within the policy class
with the lowest loss, and so it can adaptively trade off be-
tween smoothness and accuracy depending on the input sig-
nal
x
(rather than rely on post-processing). As
λ
becomes
very large, the policy class becomes restricted to extremely
smooth predictors. In practice,
λ
can be set via a user pref-
erence study or validation performance.
6.4. User Preference Study
We also conducted a user preference study to comple-
ment our benchmark experiments. Fig.
8 shows our user
study interface. Videos were generated by warping the
video captured by the human operator. We evaluated our
approach against the five baseline algorithms for both bas-
ketball and soccer. In each trial, participants were instructed
to choose the video that was more appealing (our method
was randomly assigned the left or right view).
Table
2 shows the results. For basketball, our method
is significantly preferred over the other methods based on a
two-tailed binomial test (
p
ă
0
.
001
). For soccer, none of
the methods performed particularly well (see Section
6.2),
making it challenging for users to judge which method gen-
erated better videos. Note that there is still a sizable prefer-
ence gap compared to the human expert.
7. Discussion
Although our approach achieved good results for basket-
ball, the results for soccer were much poorer. It is likely that
we require more expressive inputs in order to learn good
policy, such as tracking the ball in addition to the players.
Basketball
Soccer
Comparison
win / loss
win / loss
vs SG
22 / 3
14 / 11
vs KF
23 / 2
12 / 13
vs dual KF
25 / 0
24 / 1
vs CRF
24 / 1
12 / 13
vs FF
23 / 2
14 / 11
vs human
1 / 24
4 / 21
Table 2.
User study results
. For basketball, our method is signif-
icantly preferred over all baselines. For soccer, all methods per-
formed poorly, and users did not have a strong preference. There
is still a sizable preference gap compared to expert human.
The human demonstrations in the soccer dataset were al-
most entirely piece-wise linear. In other words, the human
expert is almost always directing the camera in a straight
line with very sharp course corrections. As such, it may be
that we require L1 regularization to an autoregressor that
prefers zero acceleration in order to better capture the tem-
poral dynamics of camera planning in soccer.
We chose decision trees due to their non-parametric or
model-free properties as well as ease of training. Using
other complex predictors, such as deep neural nets or Gaus-
sian processes, could potentially also work well.
Finally, our approach only addresses the planning prob-
lem, and cannot be directly applied to physical camera con-
trol without a control model. It would be interesting to
jointly learn to both planning and physical control [7].
8. Summary
We have introduced the problem of smooth online im-
itation learning, where the goal is to learn a predictor to
smoothly imitate a human expert given a stream of in-
put signals.
We have also presented a recurrent non-
parametric model class for generating smooth predictions,
which jointly integrates a model-free control signal with a
model-based autoregressor. In order to guarantee stability
of training, we extended previous work on iterative learning
of dynamical models to our setting. We applied our ap-
proach to camera control in sports, where we demonstrated
significant improvements over several strong baselines.