of 8
Hierarchical Exploration for Accelerating Contextual Bandits
Yisong Yue
yisongyue@cmu.edu
iLab, H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA 15213, USA
Sue Ann Hong
sahong@cs.cmu.edu
Carlos Guestrin
guestrin@cs.cmu.edu
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
Abstract
Contextual bandit learning is an increas-
ingly popular approach to optimizing recom-
mender systems via user feedback, but can
be slow to converge in practice due to the
need for exploring a large feature space. In
this paper, we propose a coarse-to-fine hier-
archical approach for encoding prior knowl-
edge that drastically reduces the amount of
exploration required. Intuitively, user pref-
erences can be reasonably embedded in a
coarse low-dimensional feature space that can
be explored e
ffi
ciently, requiring exploration
in the high-dimensional space only as neces-
sary. We introduce a bandit algorithm that
explores within this coarse-to-fine spectrum,
and prove performance guarantees that de-
pend on how well the coarse space captures
the user’s preferences. We demonstrate sub-
stantial improvement over conventional ban-
dit algorithms through extensive simulation
as well as a live user study in the setting of
personalized news recommendation.
1. Introduction
User feedback (e.g., ratings and clicks) has become a
crucial source of training data for optimizing recom-
mender systems. When making recommendations, one
must balance the needs for exploration (gathering in-
formative feedback) and exploitation (maximizing es-
timated user utility). A common formalization of such
a problem is the linear stochastic bandit problem (
Li
et al.
,
2010
), which models user utility as a linear func-
tion of user and content features.
Appearing in
Proceedings of the 29
th
International Confer-
ence on Machine Learning
, Edinburgh, Scotland, UK, 2012.
Copyright 2012 by the author(s)/owner(s).
Unfortunately, conventional bandit algorithms can
converge slowly with even moderately large feature
spaces. For instance, the well-studied LinUCB algo-
rithm (
Dani et al.
,
2008
;
Abbasi-Yadkori et al.
,
2011
)
achieves a regret bound that is linear in the dimension-
ality of the feature space, which cannot be improved
without further assumptions.
1
Intuitively, any bandit algorithm make recommenda-
tions that cover the entire feature space in order to
guarantee learning a reliable user model. Therefore,
a common approach to dealing with slow convergence
is dimensionality reduction based on prior knowledge,
such as previously learned user profiles, by represent-
ing new users as linear combinations of “stereotypical
users” (
Li et al.
,
2010
;
Yue & Guestrin
,
2011
).
However, if a user deviates from stereotypical users,
then a reduced space may not be expressive enough to
adequately learn her preferences. The challenge lies in
appropriately leveraging prior knowledge to reduce the
cost of exploration for new users, while maintaining
the representational power of the full feature space.
Our solution is a coarse-to-fine hierarchical approach
for encoding prior knowledge. Intuitively, a coarse,
low-rank subspace of the full feature space may be suf-
ficient to accurately learn a stereotypical user’s pref-
erences. At the same time, this coarse-to-fine
feature
hierarchy
allows exploration in the full space when a
user is not perfectly modeled by the coarse space.
We propose an algorithm, CoFineUCB, that automat-
ically balances exploration within the coarse-to-fine
feature hierarchy. We prove regret bounds that de-
pend on how well the user’s preferences project onto
the coarse subspace. We also present a simple and
general method for constructing feature hierarchies us-
ing prior knowledge. We perform empirical valida-
1
The regret bound is information-theoretically optimal
up to log factors (
Dani et al.
,
2008
).
Hierarchical Exploration for Accelerating Contextual Bandits
tion through simulation as well as a live user study
in personalized news recommendation, demonstrating
that CoFineUCB can substantially outperform con-
ventional methods utilizing only a single feature space.
2. The Learning Problem
We study the linear stochastic bandit problem
(
Abbasi-Yadkori et al.
,
2011
), which formalizes a rec-
ommendation system as a
bandit algorithm
that iter-
atively performs
actions
and learns from
rewards
re-
ceived per action. At each iteration
t
=1
,...,T
, our
algorithm interacts with the user as follows:
The system recommends an item (i.e., performs an
action) associated with feature vector
x
t
X
t
⊂￿
D
,
which encodes content and user features.
The user provides feedback (i.e., reward) ˆ
y
t
.
Rewards ˆ
y
t
are modeled as a linear function of actions
x
∈￿
D
such that
E
y
t
|
x
]=
w
∗￿
x
, where the weight
vector
w
denotes the user’s (unknown) preferences.
We assume feedback to be independently sampled and
bounded within [0
,
1],
2
and that
￿
x
￿≤
1 holds for all
x
. We quantify performance using the notion of regret
which compares the expected rewards of the selected
actions versus the optimal expected rewards:
R
T
(
w
)=
T
￿
t
=1
w
∗￿
x
t
w
∗￿
x
t
,
(1)
where
x
t
= argmax
x
X
t
w
T
x
.
3
We further suppose that user preferences are dis-
tributed according to some distribution
W
. We can
then define the expected regret over
W
as
R
T
(
W
)=
E
w
W
[
R
T
(
w
)]
,
(2)
and the goal now for the bandit algorithm is to perform
well with respect to
W
. We will present an approach
for optimizing (
2
) given a collection of existing user
profiles sampled i.i.d. from
W
.
3. Feature Hierarchies
To learn a reliable user model (i.e., a reliable esti-
mate of
w
) from user feedback, bandit algorithms
must make recommendations that explore the entire
D
-dimensional feature space. Conventional bandit al-
gorithms such as LinUCB place uniform a priori im-
portance on each dimension, which can be ine
ffi
cient
2
Our results also hold when each ˆ
y
t
is independent with
sub-Gaussian noise and mean
w
∗￿
x
t
(see Appendix
A
).
3
Since the rewards are sampled independently, any
guarantee on (
1
) translates into a high probability guaran-
tee on the regret of the observed feedback
￿
T
t
=1
w
∗￿
x
t
ˆ
y
t
.
w
*
!
w
*
Figure 1.
A visualization of a feature hierarchy, where
w
denotes the user profile, and ̃
w
the projected user profile.
in practice, especially if additional structure can be
assumed. We now motivate and formalize one such
structure: the feature hierarchy.
For example, suppose that two of the
D
features corre-
spond to interest in articles about baseball and cricket.
Suppose also that our prior knowledge suggests that
users are typically interested in one or the other, but
rarely both. Then we can design a feature subspace
where baseball and cricket topics project along oppo-
site directions in a single dimension. A bandit algo-
rithm leveraging this structure should, ideally, first ex-
plore at a coarse level to determine whether the user
is more interested in articles about baseball or cricket.
We can formalize the di
ff
erent levels of exploration as
a hierarchy that is composed of the full feature space
and a subspace. We define a
K
-dimensional subspace
using a matrix
U
∈￿
D
×
K
, and denote the projection
of action
x
∈￿
D
into the subspace as
̃
x
U
￿
x.
Likewise, we can write the user’s preferences
w
as
w
=
U
̃
w
+
w
,
(3)
where we call
w
the residual, or orthogonal compo-
nent, of
w
w.r.t.
U
.Then,
w
∗￿
x
= ̃
w
∗￿
̃
x
+
w
∗￿
x.
Figure
1
illustrates a feature hierarchy with a two di-
mensional subspace. Here,
w
projects well to the sub-
space, so we expect
w
∗￿
x
̃
w
∗￿
̃
x
(i.e.,
￿
w
￿
is small).
In such cases, a bandit algorithm can focus exploration
on the subspace to achieve faster convergence.
3.1. Extension to Deeper Hierarchies
For the
￿
-th level, we define the projected
w
￿
as
w
￿
1
=
U
￿
w
￿
+
w
￿
,
.
Then,
w
=
U
1
(
U
2
(
...
(
U
L
w
L
+
w
L
1
,
)
...w
1
,
)+
w
.
Hierarchical Exploration for Accelerating Contextual Bandits
Algorithm 1
CoFineUCB
1:
input
:
λ
,
̃
λ
,
U
,
c
t
(
·
), ̃
c
t
(
·
)
2:
for
t
=1
,...,T
do
3:
Define
X
t
[
x
1
,x
2
,...,x
t
1
]
4:
Define
̃
X
t
U
￿
X
t
5:
Define
Y
t
y
1
,
ˆ
y
2
,...,
ˆ
y
t
1
]
6:
̃
M
t
̃
λ
I
K
+
̃
X
t
̃
X
￿
t
7:
̃
w
t
̃
M
1
t
̃
X
t
Y
￿
t
//least squares on coarse level
8:
M
t
λ
I
D
+
X
t
X
￿
t
9:
w
t
M
1
t
(
X
t
Y
￿
t
+
λ
U
̃
w
t
)
//least sq on fine level
10:
Define
μ
t
(
x
)
w
￿
t
x
11:
x
t
argmax
x
X
t
μ
t
(
x
)+
c
t
(
x
)+ ̃
c
t
(
x
)
//play action
with highest upper confidence bound
12:
Recommend
x
t
, observe reward ˆ
y
t
13:
end for
For simplicity and practical relevance, we focus on two-
level hierarchies.
4. Algorithm & Main Results
We now present a bandit algorithm that exploits fea-
ture hierarchies. Our algorithm, CoFineUCB, is an
upper confidence bound algorithm that generalizes
the well-studied LinUCB algorithm, and automatically
trades o
ff
between exploring the coarse and full feature
spaces. CoFineUCB is described in Algorithm
1
.At
each iteration
t
, CoFineUCB estimates the user’s pref-
erences in the subspace, ̃
w
t
, as well as the full feature
space,
w
t
. Both estimates are solved via regularized
least-squares regression. First, ̃
w
t
is estimated via
̃
w
t
= argmin
̃
w
t
1
￿
τ
=1
( ̃
w
￿
̃
x
τ
ˆ
y
τ
)
2
+
̃
λ
￿
̃
w
￿
2
,
(4)
where ̃
x
τ
U
￿
x
τ
denotes the projected features of
the action taken at time
τ
.Then
w
t
is estimated via
w
t
= argmin
w
t
1
￿
τ
=1
(
w
￿
x
τ
ˆ
y
τ
)
2
+
λ
￿
w
U
̃
w
t
￿
2
,
(5)
which regularizes
w
t
to the projection of ̃
w
t
back into
the full space. Both optimization problems have closed
form solutions (Lines 7 & 9 in Algorithm
1
).
CoFineUCB is an optimistic algorithm that chooses
the action with the largest
potential reward
(given
some target confidence). Selecting such an action
requires computing confidence intervals around the
mean estimate
w
t
. We maintain confidence intervals
for both the full space and the subspace, denoted
c
t
(
·
)
and ̃
c
t
(
·
), respectively. Intuitively, a valid 1
δ
confi-
dence interval should satisfy the property that
|
x
￿
(
w
t
w
)
|
c
t
(
x
)+ ̃
c
t
(
x
)
(6)
holds with probability at least 1
δ
.
We will show that the following definitions of
c
t
(
·
) and
̃
c
t
(
·
) yield a valid 1
δ
confidence interval:
̃
c
t
(
x
)= ̃
α
(
v
)
t
￿
￿
￿
U
￿
M
1
t
x
￿
￿
￿
̃
M
1
t
+ ̃
α
(
b
)
t
￿
￿
￿
̃
M
1
t
U
￿
M
1
t
x
￿
￿
￿
(7)
c
t
(
x
)=
α
(
v
)
t
￿
x
￿
M
1
t
+
α
(
b
)
t
￿
￿
M
1
t
x
￿
￿
,
(8)
where ̃
α
(
v
)
t
, ̃
α
(
b
)
t
,
α
(
v
)
t
, and
α
(
b
)
t
are coe
ffi
cients that
must be set properly (Lemma
1
).
Broadly speaking, there are two types of uncertainty
a
ff
ecting an estimate,
w
￿
t
x
, of the utility of
x
: vari-
ance and bias. In our setting, variance is due to the
stochasticity of user feedback ˆ
y
t
. Bias, on the other
hand, is due to regularization when estimating ̃
w
t
and
w
t
. Intuitively, as our algorithm receives more feed-
back, it becomes less uncertain (w.r.t. both bias and
variance) of its estimates, ̃
w
t
and
w
t
. This notion of
uncertainty is captured via the inverse feature covari-
ance matrices
̃
M
t
and
M
t
(Lines 6 & 8 in Algorithm
1
).
Table
1
provides an interpretation of the four sources
of uncertainty described in (
7
) and (
8
).
Lemma
1
below describes how to set the coe
ffi
cients
such that
c
t
(
x
)+ ̃
c
t
(
x
) is a valid 1
δ
confidence bound.
Lemma 1.
Define
̃
S
=
￿
̃
w
￿
and
S
=
￿
w
￿
, and let
α
(
v
)
t
=
￿
log
￿
det (
M
t
)
1
/
2
det (
λ
I
D
)
1
/
2
/
δ
￿
̃
α
(
v
)
t
=
λ
￿
log
￿
det
￿
̃
M
t
￿
1
/
2
det
￿
̃
λ
I
K
￿
1
/
2
/
δ
￿
α
(
b
)
t
=
2
λ
S
̃
α
(
b
)
t
=
λ
̃
λ
̃
S.
Then
(
6
)
is a valid
1
δ
confidence interval.
With the confidence intervals defined, we are now
ready to present our main result on the regret bound.
Theorem 1.
Define
̃
c
t
(
·
)
and
c
t
(
·
)
as in
(
7
)
,
(
8
)
and
Lemma
1
.For
λ
max
x
￿
x
￿
2
and
̃
λ
max
x
￿
̃
x
￿
2
,
with probability
1
δ
, CoFineUCB achieves regret
R
T
(
w
)
￿
β
T
D
+
̃
β
T
K
￿
￿
2
T
log(1 +
T
)
,
where
β
T
=
￿
D
log((1 +
T /
λ
)
/
δ
)+
2
λ
S
(9)
̃
β
T
=
￿
K
log((1 +
T/
̃
λ
)
/
δ
)+
￿
̃
λ
̃
S.
(10)
Lemma
1
and Theorem
1
are proved in Appendix
A
.
Theorem
1
essentially bounds the regret as
R
T
(
w
)=
O
￿￿
￿
̃
λ
￿
̃
w
￿
K
+
2
λ
￿
w
￿
D
￿
T
￿
,
(11)
Hierarchical Exploration for Accelerating Contextual Bandits
Term
Interpretation
α
(
v
)
t
￿
x
￿
M
1
t
feedback variance in full space
̃
α
(
v
)
t
￿
￿
U
￿
M
1
t
x
￿
￿
̃
M
1
t
feedback variance in coarse space
α
(
b
)
t
￿
￿
M
1
t
x
￿
￿
regularization bias in full space
̃
α
(
b
)
t
￿
￿
￿
̃
M
1
t
U
￿
M
1
t
x
￿
￿
￿
regularization bias in coarse space
Table 1.
Interpretating sources of uncertainty in (
7
), (
8
).
!!
B
!
B
!
B
!
B
"
Figure 2.
An example of confidence regions utilized by
CoFineUCB and LinUCB.
B
denotes the ellipsoid confi-
dence region used by LinUCB. CoFineUCB maintains two
ellipsoid confidence regions,
̃
B
and
B
, for subspace and
full space, respectively. The joint confidence region of
CoFineUCB is essentially the convolution of
̃
B
and
B
,
̃
B
B
,whichcanbemuchsmallerthan
B
.
ignoring log factors. In contrast, the conventional Lin-
UCB algorithm only explores in the full feature space
and achieves an analogous regret bound of
R
T
(
w
)=
O
￿
λ
￿
w
￿
D
T
￿
.
(12)
Comparing (
11
)with(
12
) suggests that, when
K<<
D
and
￿
w
￿
is small, CoFineUCB su
ff
ers much less
regret due to more e
ffi
cient exploration. Depending on
U
,
￿
̃
w
￿
can also be much smaller than
￿
w
￿
. Section
5
describes an approach for computing such a
U
.
Intuitively, CoFineUCB enjoys a superior regret bound
to LinUCB due to its use of tighter confidence regions.
Figure
2
depicts a comparative example. LinUCB em-
ploys ellipsoid confidence regions. CoFineUCB utilizes
confidence regions that are essentially the convolution
of two smaller ellipsoids, which can be much smaller
than the confidence regions of LinUCB.
5. Constructing Feature Hierarchies
We now show how to construct a subspace
U
using pre-
existing user profiles
W
=
{
w
i
}
N
i
=1
, where each profile
is sampled independently from a common distribution
w
i
W
. In this setting, a reasonable objective is to
find a
U
that minimizes an empirical estimate of the
bound on
R
T
(
W
), which comprises
￿
̃
w
￿
and
￿
w
￿
.
Our approach is outlined in Algorithm
2
. We assume
that finding a
K
-dimensional subspace with low resid-
ual norms
￿
w
￿
is straightfoward. In our experiments,
we simply use the top
K
singular vectors of
W
.
Algorithm 2
LearnU: learning projection matrix
1:
input
:
W
∈￿
D
×
N
,
K
{
1
,...,D
}
2:
(
A,
Σ
,B
)
SV D
(
W
)
3:
U
0
A
1:
K
//top K singular vectors
4:
Solve for
via (
16
)using
U
0
and
W
5:
return
:
U
0
1
/
2
Given an orthonormal basis
U
0
∈￿
K
×
D
, one can
choose
U
span
(
U
0
) to minimize its total contribu-
tion to the regret bound in (
11
)overtheusersin
W
:
argmin
U
span
(
U
0
)
̃
C
￿
w
W
￿
̃
w
￿
,
(13)
where ̃
w
(
U
￿
U
)
1
U
￿
w
, and
̃
C
= max
x
￿
U
￿
x
￿
con-
strains the magnitude of
U
.
It is di
ffi
cult to optimize (
13
) directly, so we approxi-
mate it using a smooth formulation,
4
argmin
U
span
(
U
0
):
￿
U
￿
2
Fro
=
K
￿
w
W
￿
̃
w
￿
2
,
(14)
where we now constrain
U
via
￿
U
￿
2
Fro
=
K
.
We further restrict
U
to be
U
U
0
1
/
2
for
￿
0.
Under this restriction, (
14
) is equivalent to
argmin
:
trace
(
)=
K
￿
w
W
￿
̃
w
￿
0
1
̃
w
0
￿
2
,
(15)
where ̃
w
0
(
U
￿
0
U
0
)
1
U
￿
0
w
=
U
￿
0
w
. This formula-
tion is akin to multi-task structure learning, where
W
0
would denote the various tasks and
denotes feature
relationships common across tasks (
Argyriou et al.
,
2007
;
Zhang & Yeung
,
2010
). One can show that (
15
)
is convex and is minimized by
=
K
trace
￿
̃
W
0
̃
W
￿
0
￿
̃
W
0
̃
W
￿
0
,
(16)
where
̃
W
0
(
U
￿
0
U
0
)
1
U
￿
0
W
=
U
￿
0
W
. See Appendix
B
for a more detailed derivation.
6. Experiments
We evaluate CoFineUCB via both simulations and a
live user study in the personalized news recommenda-
tion domain. We first describe alternative methods, or
baselines, for leveraging prior knowledge (pre-existing
profiles
W
∈￿
D
×
N
) that do not use a feature hierar-
chy. These baselines can conceptually be phrased as
special cases of CoFineUCB. The key idea is to alter
4
One can also regularize by inserting an axis-aligned
“ridge” into
W
(i.e.,
W
[
W, I
D
]).
Hierarchical Exploration for Accelerating Contextual Bandits
the feature space such that
￿
w
￿
in the new space is
small. Thus, running LinUCB in the altered feature
space yields an improved bound on the regret (
12
),
which is linear in
￿
w
￿
.
6.1. Baseline Approaches
Mean-Regularized
One simple approach is to reg-
ularize to ̄
w
(e.g., the mean of
W
) when estimating
w
t
in LinUCB. The estimation problem can be written as
w
t
= argmin
w
t
1
￿
τ
=1
(
w
￿
x
τ
ˆ
y
τ
)
2
+
λ
￿
w
̄
w
￿
2
.
(17)
Typically,
￿
w
̄
w
￿
<
￿
w
￿
, implying lower regret.
Reshape
Another approach is to use LinUCB with a
feature space “reshaped” via a transform
U
D
∈￿
D
×
D
:
w
t
= argmin
w
t
1
￿
τ
=1
(
w
￿
U
￿
D
x
τ
ˆ
y
τ
)
2
+
λ
￿
w
￿
2
.
(18)
As in the mean-regularization approach above, here
we would like the representation of
w
in the reshaped
space to have a small norm. In our experiments, we
use
U
D
= LearnU(
W, D
) (Algorithm
2
).
We can incorporate such reshaping into CoFineUCB.
We first project
W
into the space defined by
U
D
,de-
noted by
ˆ
W
,
5
then compute
U
via LearnU(
ˆ
W,K
).
During model estimation, we replace (
5
)with
w
t
= argmin
w
t
1
￿
τ
=1
(
w
￿
U
￿
D
x
τ
ˆ
y
τ
)
2
+
λ
￿
w
U
̃
w
t
￿
2
.
Incorporating reshaping into CoFineUCB can lead to a
decrease in
S
=
￿
ˆ
w
￿
.
We found the modification to
be quite e
ff
ective in practice; all our experiments in the
following sections employ this variant of CoFineUCB.
SubspaceUCB
Finally, we can simply ignore the
full space and only apply LinUCB in the subspace.
While the method seems to perform well given a good
subspace (as seen in (
Li et al.
,
2010
;
Chapelle & Li
,
2011
;
Yue & Guestrin
,
2011
), among others), it can
yield linear regret if the residual of the user’s prefer-
ence is strong, as we will see in the experiments.
6.2. Experimental Setting
We employ the submodular bandit extension of linear
stochastic bandits (
Yue & Guestrin
,
2011
)tomodel
the news recommendation setting. Here, the algorithm
5
ˆ
W
￿
U
￿
D
U
D
￿
1
U
￿
D
W.
must choose a set of
L
actions and receives rewards
based on both the quality as well as diversity of the ac-
tions chosen (
L
= 1 is the conventional bandit setting).
Using this structured action space leads to a more real-
istic setting for content recommendation, since recom-
mender systems often must recommend multiple items
at a time. It is straightforward to extend CoFineUCB
to the submodular bandit setting (see Appendix
C
).
6.3. Simulations
We performed simulation evaluations using data col-
lected from a previous user study in personalized news
recommendation by (
Yue & Guestrin
,
2011
). The data
includes featurized articles (
D
= 100) and
N
= 77
user profiles. We employed leave-one-out validation:
for each user, the transformations
U
D
and
U
(
K
=
5) were trained using the remaining users’ profiles.
For each user, we ran 25 simulations (
T
= 10000).
All algorithms used the same
U
and
U
D
projections,
where applicable. We also compared with a variant
of CoFineUCB, CoFineUCB-focus, which scales down
exploration in the full space
c
t
by a factor of 0
.
25.
Figure
3
(a)
shows the cumulative regret of each al-
gorithm averaged over all users when recommending
one article per iteration (
L
= 1). All algorithms
dramatically outperform Naive LinUCB, with the ex-
ception of Mean-Regularized which performs almost
identically. While Reshape shows good eventual con-
vergence behavior, it incurs higher initial regret than
the CoFineUCB algorithms and SubspaceUCB. The
trends also hold when recommending multiple articles
per iteration (
L
= 5), as seen in Figure
3
(b)
.
The performance of the two variants of CoFineUCB
and SubspaceUCB demonstrate the benefit of explor-
ing in the subspace. However, Figure
3
(c)
reveals the
critical shortfall of SubspaceUCB by comparing aver-
age cumulative regret for the ten users with the largest
residual
￿
w
￿
. For these atypical users, the subspace
is not su
ffi
cient to adequately learn their preferences,
resulting in linear regret for SubspaceUCB.
Figure
3
(d)
shows the behavior of CoFineUCB as we
vary
K
. Larger subspaces require more exploration,
which in general leads to increased regret.
Figure
3
(e)
shows the behavior of CoFineUCB as we
vary the scaling of exploration in the full space
c
t
(CoFineUCB-focus is the special case where the scal-
ing factor is 0
.
25). More conservative exploration in
the full space tends to reduce regret. However, no ex-
ploration of the full space can lead to higher regret.
Synthetic Dataset
. We used a 25-dimensional syn-
thetic dataset to study the e
ff
ect of mismatch between
Hierarchical Exploration for Accelerating Contextual Bandits
0
2000
4000
6000
8000
10000
10
0
10
1
10
2
10
3
10
4
Iterations
Cumulative Regret
Naive
Mean
Regularized
Reshape
CoFineUCB
SubspaceUCB
CoFineUCB
focus
(a) All users simulation (
L
=1)
0
2000
4000
6000
8000
10000
10
0
10
1
10
2
10
3
10
4
10
5
Iterations
Cumulative Regret
(b) All users simulation (
L
=5)
0
2000
4000
6000
8000
10000
10
0
10
1
10
2
10
3
10
4
10
5
Iterations
Cumulative Regret
(c) Atypical users simulation (
L
=5)
0
1000
2000
3000
4000
5000
10
0
10
1
10
2
10
3
Iterations
Cumulative Regret
CoFineUCB, K=15
CoFineUCB, K=8
CoFineUCB, K=5
(d) CoFineUCB over varying
K
0
0.2
0.4
0.6
0.8
1
10
1
10
2
Multiplicative factor on c
t
Cumulative Regret
t=10000
t=5000
t=2500
t=1250
t=625
(e) CoFineUCB over varying
c
t
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
200
400
600
800
1000
1200
Residual Magnitude
Cumulative Regret, t=5000
Naive
SubspaceUCB
CoFineUCB
focus
(f) Subspace mismatch versus regret
Figure 3.
(a)
(e)
Cumulative regret results for news recommendation simulation.
(f)
Comparison over preference vectors
with varying projection residuals using synthetic simulation.
w
and
U
. This dataset allows for a more system-
atic analysis by forcing every
x
and
w
to have unit
norm. For residual magnitude
β
[0
,
1], we sampled
w
uniformly in a 5-dimensional subspace with magni-
tude
￿
1
β
2
, and uniformly in the remaining dimen-
sions with magnitude
β
. Figure
3
(f)
shows the regret
of both SubspaceUCB and CoFineUCB-focus increase
with the residual, with SubspaceUCB exhibiting more
dramatic increase, beyond that of even Naive LinUCB.
6.4. User Study
Our user study design follows the study conducted in
(
Yue & Guestrin
,
2011
). We presented each user with
ten articles per day over ten days from January 21,
2012 to February 8, 2012. Each day comprised approx-
imately ten thousand articles. We represented articles
using
D
= 100 features corresponding to topics learned
via latent Dirichlet Allocation (
Blei et al.
,
2003
). For
each day, articles shown are selected using an interleav-
ing of two bandit algorithms. The user is instructed
to briefly skim each article and mark each article as
“interested in reading in detail”or “not interested”.
We conducted the user study in two phases. Prior to
the first phase, we conducted a preliminary study to
collect preferences for constructing
U
(
K
= 5). In the
Comparison
#Users Win/Tie/Lose Gain/Day
CoFineUCB
27
24 / 1 / 3
0.69
v. Naive
CoFineUCB
30
21 / 3 / 6
0.27
v. Reshape
Table 2.
User study comparing CoFineUCB with two base-
lines. All results satisfy 95% statistical confidence.
first phase, we compared CoFineUCB with Naive. Af-
terwards, we took all the user profiles learned so far
to estimate a reshaping of the full space
U
D
, and com-
pared against Reshape. Due to the short duration of
each session (
T
= 10), we did not expect a meaningful
comparison between CoFineUCB and SubspaceUCB,
so we omitted it (We expect both methods to perform
equally well in early iterations, as seen in the simula-
tion experiments.). For each user session, we counted
the total number of liked articles recommended by each
algorithm. An algorithm wins a session if the user liked
more articles recommended by it.
Table
2
shows that over the two stages, about 80% of
the users prefer CoFineUCB. We see a smaller gain
against Reshape, a stronger baseline. On average,
users liked over half an additional article per day from
CoFineUCB over Naive, and about a quarter addi-
Hierarchical Exploration for Accelerating Contextual Bandits
1
2
3
4
5
6
ï
1
ï
0.5
0
0.5
1
w
Co
F
w
Nai
v
1
2
3
4
5
6
ï
1
ï
0.5
0
0.5
1
w
Co
F
w
Nai
v
1
2
3
4
5
6
ï
1
ï
0.5
0
0.5
1
w
Co
F
w
Nai
v
̃
w
￿
w
￿
1
2
3
4
5
6
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
0.8
1
CoFineUCB
Naive
1
0.5
-0.5
-1
0
Figure 4.
Each column of word clouds represents a dimension in the subspace. The bar lengths denote the magnitude
in each dimension of preferences vectors learned by CoFineUCB (blue) and Naive LinUCB (red). The rightmost column
shows the norm of residual
w
of weight vectors learned by CoFineUCB and Naive LinUCB.
tional per day over Reshape. These results show that
CoFineUCB is e
ff
ective in reducing the amount of ex-
ploration required.
Figure
4
shows a representation of four dimensions
of
U
learned from user profiles. Each dimension is
a combination of features, i.e., topics from LDA. In
the top row, the
i
-th word cloud contains represen-
tative words from topics associated with high positive
weights in
i
-th column of
U
, and the bottom row those
with high negative weights. Examining Figure
4
can
reveal tendencies in the user preferences collected in
our study; for example, the third column shows that
users interested in Republican politics also tend to fol-
low healthcare debates, but tend to be uninterested
in videogaming. Figure
4
also shows a comparison of
weights estimated by CoFineUCB and Naive LinUCB
for one user. Since Naive LinUCB does not utilize the
subspace, the weights it estimates tend to have much
higher residual norm, whereas CoFineUCB puts higher
weights on the subspace dimensions.
7. Related Work
Optimizing recommender systems via user feedback
has become increasingly popular in recent years (
El-
Arini et al.
,
2009
;
Li et al.
,
2010
;
2011
;
Yue & Guestrin
,
2011
;
Ahmed et al.
,
2012
). Most prior work do not ad-
dress the issue of exploration and often train with pre-
collected feedback, which may lead to a biased model.
The exploration-exploitation tradeo
ff
inherent in
learning from user feedback is naturally modeled as a
contextual bandit problem (
Langford & Zhang
,
2007
;
Li et al.
,
2010
;
Slivkins
,
2011
;
Chapelle & Li
,
2011
;
Krause & Ong
,
2011
). In contrast to most prior work,
we focus on principled approaches for encoding prior
knowledge for accelerated bandit learning.
Our work builds upon a long line of research on linear
stochastic bandits (
Dani et al.
,
2008
;
Rusmevichien-
tong & Tsitsiklis
,
2010
;
Abbasi-Yadkori et al.
,
2011
).
Although often practical, one limitation is the assump-
tion of realizability. In other words, we assume that
the true model of user behavior lies within our class.
The use of hierarchies in bandit learning is not new.
For instance, the work of (
Pandey et al.
,
2007b
;
a
)en-
code prior knowledge by hierarchically clustering arti-
cles into a taxonomy. However, their setting is feature-
free, which can make it di
ffi
cult to generalize to new
articles and users. In contrast, our approach makes
use of readily available feature-based prior knowledge
such as the learned preferences of existing users.
Another related line of work is that of sparse linear
bandits (
Abbasi-Yadkori et al.
,
2012
;
Carpentier &
Munos
,
2012
). The assumption is that the true
w
is sparse, and one can achieve regret bounds that de-
pend on the sparsity of
w
. In contrast, we consider
settings where user profiles are not necessarily sparse,
but can be well-approximated by a low-rank subspace.
It may be possible to integrate our feature hierar-
chy approach with other bandit learning algorithms,
such as Thompson Sampling (
Chapelle & Li
,
2011
).
Thompson Sampling is a probability matching algo-
rithm that samples
w
t
from the posterior distribution.
Using feature hierarchies, one can define a hierarchical
sampling approach that first samples ̃
w
t
in the sub-
space, and then samples
w
t
around ̃
w
t
in the full space.
Our approach can be applied to many structured
classes of bandit problems (e.g., (
Streeter & Golovin
,
2008
;
Cesa-Bianchi & Lugosi
,
2009
)), assuming that