feature_hierarchy_bandit

Hierarchical Exploration for Accelerating Contextual Bandits

Yisong Yue

yisongyue@cmu.edu

iLab, H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA 15213, USA

Sue Ann Hong

sahong@cs.cmu.edu

Carlos Guestrin

guestrin@cs.cmu.edu

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA

Abstract

Contextual bandit learning is an increas-

ingly popular approach to optimizing recom-

mender systems via user feedback, but can

be slow to converge in practice due to the

need for exploring a large feature space. In

this paper, we propose a coarse-to-fine hier-

archical approach for encoding prior knowl-

edge that drastically reduces the amount of

exploration required. Intuitively, user pref-

erences can be reasonably embedded in a

coarse low-dimensional feature space that can

be explored e

ffi

ciently, requiring exploration

in the high-dimensional space only as neces-

sary. We introduce a bandit algorithm that

explores within this coarse-to-fine spectrum,

and prove performance guarantees that de-

pend on how well the coarse space captures

the user’s preferences. We demonstrate sub-

stantial improvement over conventional ban-

dit algorithms through extensive simulation

as well as a live user study in the setting of

personalized news recommendation.

1. Introduction

User feedback (e.g., ratings and clicks) has become a

crucial source of training data for optimizing recom-

mender systems. When making recommendations, one

must balance the needs for exploration (gathering in-

formative feedback) and exploitation (maximizing es-

timated user utility). A common formalization of such

a problem is the linear stochastic bandit problem (

et al.

2010

), which models user utility as a linear func-

tion of user and content features.

Appearing in

Proceedings of the 29

International Confer-

ence on Machine Learning

, Edinburgh, Scotland, UK, 2012.

Unfortunately, conventional bandit algorithms can

converge slowly with even moderately large feature

spaces. For instance, the well-studied LinUCB algo-

rithm (

Dani et al.

2008

;

Abbasi-Yadkori et al.

2011

)

achieves a regret bound that is linear in the dimension-

ality of the feature space, which cannot be improved

without further assumptions.

Intuitively, any bandit algorithm make recommenda-

tions that cover the entire feature space in order to

guarantee learning a reliable user model. Therefore,

a common approach to dealing with slow convergence

is dimensionality reduction based on prior knowledge,

such as previously learned user profiles, by represent-

ing new users as linear combinations of “stereotypical

users” (

Li et al.

2010

;

Yue & Guestrin

2011

However, if a user deviates from stereotypical users,

then a reduced space may not be expressive enough to

adequately learn her preferences. The challenge lies in

appropriately leveraging prior knowledge to reduce the

cost of exploration for new users, while maintaining

the representational power of the full feature space.

Our solution is a coarse-to-fine hierarchical approach

for encoding prior knowledge. Intuitively, a coarse,

low-rank subspace of the full feature space may be suf-

ficient to accurately learn a stereotypical user’s pref-

erences. At the same time, this coarse-to-fine

feature

hierarchy

allows exploration in the full space when a

user is not perfectly modeled by the coarse space.

We propose an algorithm, CoFineUCB, that automat-

ically balances exploration within the coarse-to-fine

feature hierarchy. We prove regret bounds that de-

pend on how well the user’s preferences project onto

the coarse subspace. We also present a simple and

general method for constructing feature hierarchies us-

ing prior knowledge. We perform empirical valida-

The regret bound is information-theoretically optimal

up to log factors (

Dani et al.

2008

Hierarchical Exploration for Accelerating Contextual Bandits

tion through simulation as well as a live user study

in personalized news recommendation, demonstrating

that CoFineUCB can substantially outperform con-

ventional methods utilizing only a single feature space.

2. The Learning Problem

We study the linear stochastic bandit problem

(

Abbasi-Yadkori et al.

2011

), which formalizes a rec-

ommendation system as a

bandit algorithm

that iter-

atively performs

actions

and learns from

rewards

re-

ceived per action. At each iteration

,...,T

, our

algorithm interacts with the user as follows:

•

The system recommends an item (i.e., performs an

action) associated with feature vector

∈

⊂

which encodes content and user features.

•

The user provides feedback (i.e., reward) ˆ

Rewards ˆ

are modeled as a linear function of actions

∈

such that

[ˆ

∗

, where the weight

vector

∗

denotes the user’s (unknown) preferences.

We assume feedback to be independently sampled and

bounded within [0

1],

and that

≤

1 holds for all

. We quantify performance using the notion of regret

which compares the expected rewards of the selected

actions versus the optimal expected rewards:

(

∗

−

∗

(1)

where

∗

= argmax

∈

∗

We further suppose that user preferences are dis-

tributed according to some distribution

. We can

then define the expected regret over

(

∗

∼

[

(

∗

)]

(2)

and the goal now for the bandit algorithm is to perform

well with respect to

. We will present an approach

for optimizing (

) given a collection of existing user

profiles sampled i.i.d. from

3. Feature Hierarchies

To learn a reliable user model (i.e., a reliable esti-

mate of

∗

) from user feedback, bandit algorithms

must make recommendations that explore the entire

-dimensional feature space. Conventional bandit al-

gorithms such as LinUCB place uniform a priori im-

portance on each dimension, which can be ine

ffi

cient

Our results also hold when each ˆ

is independent with

sub-Gaussian noise and mean

∗

(see Appendix

Since the rewards are sampled independently, any

guarantee on (

) translates into a high probability guaran-

tee on the regret of the observed feedback

∗

−

w

*

!

w

*

Figure 1.

A visualization of a feature hierarchy, where

∗

denotes the user profile, and ̃

∗

the projected user profile.

in practice, especially if additional structure can be

assumed. We now motivate and formalize one such

structure: the feature hierarchy.

For example, suppose that two of the

features corre-

spond to interest in articles about baseball and cricket.

Suppose also that our prior knowledge suggests that

users are typically interested in one or the other, but

rarely both. Then we can design a feature subspace

where baseball and cricket topics project along oppo-

site directions in a single dimension. A bandit algo-

rithm leveraging this structure should, ideally, first ex-

plore at a coarse level to determine whether the user

is more interested in articles about baseball or cricket.

We can formalize the di

erent levels of exploration as

a hierarchy that is composed of the full feature space

and a subspace. We define a

-dimensional subspace

using a matrix

∈

, and denote the projection

of action

∈

into the subspace as

≡

Likewise, we can write the user’s preferences

∗

⊥

(3)

where we call

∗

⊥

the residual, or orthogonal compo-

nent, of

∗

w.r.t.

.Then,

∗

= ̃

∗

⊥

Figure

illustrates a feature hierarchy with a two di-

mensional subspace. Here,

∗

projects well to the sub-

space, so we expect

∗

≈

∗

(i.e.,

∗

⊥

is small).

In such cases, a bandit algorithm can focus exploration

on the subspace to achieve faster convergence.

3.1. Extension to Deeper Hierarchies

For the

-th level, we define the projected

∗

−

∗

⊥

Then,

∗

(

...

(

∗

−

⊥

)

...w

∗

⊥

∗

⊥

Hierarchical Exploration for Accelerating Contextual Bandits

Algorithm 1

CoFineUCB

input

(

), ̃

(

)

for

,...,T

Define

≡

[

,...,x

−

]

Define

≡

Define

≡

[ˆ

,...,

−

]

←

−

//least squares on coarse level

←

−

(

)

//least sq on fine level

10:

Define

(

)

≡

11:

←

argmax

∈

X

(

)+ ̃

(

)

//play action

with highest upper confidence bound

12:

Recommend

, observe reward ˆ

13:

end for

For simplicity and practical relevance, we focus on two-

level hierarchies.

4. Algorithm & Main Results

We now present a bandit algorithm that exploits fea-

ture hierarchies. Our algorithm, CoFineUCB, is an

upper confidence bound algorithm that generalizes

the well-studied LinUCB algorithm, and automatically

trades o

between exploring the coarse and full feature

spaces. CoFineUCB is described in Algorithm

.At

each iteration

, CoFineUCB estimates the user’s pref-

erences in the subspace, ̃

, as well as the full feature

space,

. Both estimates are solved via regularized

least-squares regression. First, ̃

is estimated via

= argmin

−

( ̃

−

)

(4)

where ̃

≡

denotes the projected features of

the action taken at time

.Then

is estimated via

= argmin

−

(

−

)

−

(5)

which regularizes

to the projection of ̃

back into

the full space. Both optimization problems have closed

form solutions (Lines 7 & 9 in Algorithm

CoFineUCB is an optimistic algorithm that chooses

the action with the largest

potential reward

(given

some target confidence). Selecting such an action

requires computing confidence intervals around the

mean estimate

. We maintain confidence intervals

for both the full space and the subspace, denoted

(

)

and ̃

(

), respectively. Intuitively, a valid 1

−

confi-

dence interval should satisfy the property that

(

−

∗

)

≤

(

)+ ̃

(

)

(6)

holds with probability at least 1

−

We will show that the following definitions of

(

) and

(

) yield a valid 1

−

confidence interval:

(

)= ̃

(

)

−

1

+ ̃

(

)

−

(7)

(

)

−

1

(

)

−

(8)

where ̃

(

)

, ̃

(

)

(

)

, and

(

)

are coe

ffi

cients that

must be set properly (Lemma

Broadly speaking, there are two types of uncertainty

ecting an estimate,

, of the utility of

: vari-

ance and bias. In our setting, variance is due to the

stochasticity of user feedback ˆ

. Bias, on the other

hand, is due to regularization when estimating ̃

and

. Intuitively, as our algorithm receives more feed-

back, it becomes less uncertain (w.r.t. both bias and

variance) of its estimates, ̃

and

. This notion of

uncertainty is captured via the inverse feature covari-

ance matrices

and

(Lines 6 & 8 in Algorithm

Table

provides an interpretation of the four sources

of uncertainty described in (

) and (

Lemma

below describes how to set the coe

ffi

cients

such that

(

)+ ̃

(

) is a valid 1

−

confidence bound.

Lemma 1.

Define

∗

and

⊥

∗

⊥

, and let

(

)

log

det (

)

det (

)

(

)

log

det

(

)

√

⊥

(

)

Then

(

)

is a valid

−

confidence interval.

With the confidence intervals defined, we are now

ready to present our main result on the regret bound.

Theorem 1.

Define

(

)

and

(

)

as in

(

)

(

)

and

Lemma

.For

≥

max

and

≥

max

with probability

−

, CoFineUCB achieves regret

(

∗

)

≤

√

log(1 +

)

where

log((1 +

T /

)

√

⊥

(9)

log((1 +

)

(10)

Lemma

and Theorem

are proved in Appendix

Theorem

essentially bounds the regret as

(

∗

√

∗

⊥

√

(11)

Hierarchical Exploration for Accelerating Contextual Bandits

Term

Interpretation

(

)

−

1

feedback variance in full space

(

)

−

1

feedback variance in coarse space

(

)

−

regularization bias in full space

(

)

−

regularization bias in coarse space

Table 1.

Interpretating sources of uncertainty in (

), (

!!

B

!

B

!

B

B

Figure 2.

An example of confidence regions utilized by

CoFineUCB and LinUCB.

denotes the ellipsoid confi-

dence region used by LinUCB. CoFineUCB maintains two

ellipsoid confidence regions,

and

⊥

, for subspace and

full space, respectively. The joint confidence region of

CoFineUCB is essentially the convolution of

and

⊥

⊗

⊥

,whichcanbemuchsmallerthan

ignoring log factors. In contrast, the conventional Lin-

UCB algorithm only explores in the full feature space

and achieves an analogous regret bound of

(

∗

√

∗

√

(12)

Comparing (

)with(

) suggests that, when

K<<

and

∗

⊥

is small, CoFineUCB su

ers much less

regret due to more e

ffi

cient exploration. Depending on

∗

can also be much smaller than

∗

. Section

describes an approach for computing such a

Intuitively, CoFineUCB enjoys a superior regret bound

to LinUCB due to its use of tighter confidence regions.

Figure

depicts a comparative example. LinUCB em-

ploys ellipsoid confidence regions. CoFineUCB utilizes

confidence regions that are essentially the convolution

of two smaller ellipsoids, which can be much smaller

than the confidence regions of LinUCB.

5. Constructing Feature Hierarchies

We now show how to construct a subspace

using pre-

existing user profiles

{

∗

}

, where each profile

is sampled independently from a common distribution

∗

∼

. In this setting, a reasonable objective is to

find a

that minimizes an empirical estimate of the

bound on

(

), which comprises

and

⊥

Our approach is outlined in Algorithm

. We assume

that finding a

-dimensional subspace with low resid-

ual norms

∗

⊥

is straightfoward. In our experiments,

we simply use the top

singular vectors of

Algorithm 2

LearnU: learning projection matrix

input

∈

{

,...,D

}

(

Σ

)

←

SV D

(

)

←

//top K singular vectors

Solve for

Ω

via (

)using

and

return

Ω

Given an orthonormal basis

∈

, one can

choose

∈

span

(

) to minimize its total contribu-

tion to the regret bound in (

)overtheusersin

argmin

∈

span

(

0

)

∈

(13)

where ̃

≡

(

)

−

, and

= max

con-

strains the magnitude of

It is di

ffi

cult to optimize (

) directly, so we approxi-

mate it using a smooth formulation,

argmin

∈

span

(

0

2

Fro

∈

(14)

where we now constrain

via

Fro

We further restrict

to be

≡

Ω

for

Ω

Under this restriction, (

) is equivalent to

argmin

Ω

trace

(

Ω

∈

Ω

−

(15)

where ̃

≡

(

)

−

. This formula-

tion is akin to multi-task structure learning, where

would denote the various tasks and

Ω

denotes feature

relationships common across tasks (

Argyriou et al.

2007

;

Zhang & Yeung

2010

). One can show that (

)

is convex and is minimized by

Ω

trace

(16)

where

≡

(

)

−

. See Appendix

for a more detailed derivation.

6. Experiments

We evaluate CoFineUCB via both simulations and a

live user study in the personalized news recommenda-

tion domain. We first describe alternative methods, or

baselines, for leveraging prior knowledge (pre-existing

profiles

∈

) that do not use a feature hierar-

chy. These baselines can conceptually be phrased as

special cases of CoFineUCB. The key idea is to alter

One can also regularize by inserting an axis-aligned

“ridge” into

(i.e.,

←

[

W, I

]).

Hierarchical Exploration for Accelerating Contextual Bandits

the feature space such that

∗

in the new space is

small. Thus, running LinUCB in the altered feature

space yields an improved bound on the regret (

which is linear in

∗

6.1. Baseline Approaches

Mean-Regularized

One simple approach is to reg-

ularize to ̄

(e.g., the mean of

) when estimating

in LinUCB. The estimation problem can be written as

= argmin

−

(

−

)

−

(17)

Typically,

∗

−

∗

, implying lower regret.

Reshape

Another approach is to use LinUCB with a

feature space “reshaped” via a transform

∈

= argmin

−

(

−

)

(18)

As in the mean-regularization approach above, here

we would like the representation of

∗

in the reshaped

space to have a small norm. In our experiments, we

use

= LearnU(

W, D

) (Algorithm

We can incorporate such reshaping into CoFineUCB.

We first project

into the space defined by

,de-

noted by

then compute

via LearnU(

W,K

During model estimation, we replace (

)with

= argmin

−

(

−

)

−

Incorporating reshaping into CoFineUCB can lead to a

decrease in

⊥

∗

⊥

We found the modification to

be quite e

ective in practice; all our experiments in the

following sections employ this variant of CoFineUCB.

SubspaceUCB

Finally, we can simply ignore the

full space and only apply LinUCB in the subspace.

While the method seems to perform well given a good

subspace (as seen in (

Li et al.

2010

;

Chapelle & Li

2011

;

Yue & Guestrin

2011

), among others), it can

yield linear regret if the residual of the user’s prefer-

ence is strong, as we will see in the experiments.

6.2. Experimental Setting

We employ the submodular bandit extension of linear

stochastic bandits (

Yue & Guestrin

2011

)tomodel

the news recommendation setting. Here, the algorithm

≡

−

must choose a set of

actions and receives rewards

based on both the quality as well as diversity of the ac-

tions chosen (

= 1 is the conventional bandit setting).

Using this structured action space leads to a more real-

istic setting for content recommendation, since recom-

mender systems often must recommend multiple items

at a time. It is straightforward to extend CoFineUCB

to the submodular bandit setting (see Appendix

6.3. Simulations

We performed simulation evaluations using data col-

lected from a previous user study in personalized news

recommendation by (

Yue & Guestrin

2011

). The data

includes featurized articles (

= 100) and

= 77

user profiles. We employed leave-one-out validation:

for each user, the transformations

and

(

5) were trained using the remaining users’ profiles.

For each user, we ran 25 simulations (

= 10000).

All algorithms used the same

and

projections,

where applicable. We also compared with a variant

of CoFineUCB, CoFineUCB-focus, which scales down

exploration in the full space

by a factor of 0

25.

Figure

(a)

shows the cumulative regret of each al-

gorithm averaged over all users when recommending

one article per iteration (

= 1). All algorithms

dramatically outperform Naive LinUCB, with the ex-

ception of Mean-Regularized which performs almost

identically. While Reshape shows good eventual con-

vergence behavior, it incurs higher initial regret than

the CoFineUCB algorithms and SubspaceUCB. The

trends also hold when recommending multiple articles

per iteration (

= 5), as seen in Figure

(b)

The performance of the two variants of CoFineUCB

and SubspaceUCB demonstrate the benefit of explor-

ing in the subspace. However, Figure

(c)

reveals the

critical shortfall of SubspaceUCB by comparing aver-

age cumulative regret for the ten users with the largest

residual

∗

⊥

. For these atypical users, the subspace

is not su

ffi

cient to adequately learn their preferences,

resulting in linear regret for SubspaceUCB.

Figure

(d)

shows the behavior of CoFineUCB as we

vary

. Larger subspaces require more exploration,

which in general leads to increased regret.

Figure

(e)

shows the behavior of CoFineUCB as we

vary the scaling of exploration in the full space

(CoFineUCB-focus is the special case where the scal-

ing factor is 0

25). More conservative exploration in

the full space tends to reduce regret. However, no ex-

ploration of the full space can lead to higher regret.

Synthetic Dataset

. We used a 25-dimensional syn-

thetic dataset to study the e

ect of mismatch between

Hierarchical Exploration for Accelerating Contextual Bandits

2000

4000

6000

8000

10000

Iterations

Cumulative Regret

Naive

Mean

−

Regularized

Reshape

CoFineUCB

SubspaceUCB

CoFineUCB

−

focus

(a) All users simulation (

=1)

2000

4000

6000

8000

10000

Iterations

Cumulative Regret

(b) All users simulation (

=5)

2000

4000

6000

8000

10000

Iterations

Cumulative Regret

=5)

1000

2000

3000

4000

5000

Iterations

Cumulative Regret

CoFineUCB, K=15

CoFineUCB, K=8

CoFineUCB, K=5

(d) CoFineUCB over varying

0.2

0.4

0.6

0.8

Multiplicative factor on c

Cumulative Regret

t=10000

t=5000

t=2500

t=1250

t=625

(e) CoFineUCB over varying

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

200

400

600

800

1000

1200

Residual Magnitude

Cumulative Regret, t=5000

Naive

SubspaceUCB

CoFineUCB

−

focus

(f) Subspace mismatch versus regret

Figure 3.

(a)

–

(e)

Cumulative regret results for news recommendation simulation.

(f)

Comparison over preference vectors

with varying projection residuals using synthetic simulation.

∗

and

. This dataset allows for a more system-

atic analysis by forcing every

and

∗

to have unit

norm. For residual magnitude

∈

1], we sampled

∗

uniformly in a 5-dimensional subspace with magni-

tude

−

, and uniformly in the remaining dimen-

sions with magnitude

. Figure

(f)

shows the regret

of both SubspaceUCB and CoFineUCB-focus increase

with the residual, with SubspaceUCB exhibiting more

dramatic increase, beyond that of even Naive LinUCB.

6.4. User Study

Our user study design follows the study conducted in

(

Yue & Guestrin

2011

). We presented each user with

ten articles per day over ten days from January 21,

2012 to February 8, 2012. Each day comprised approx-

imately ten thousand articles. We represented articles

using

= 100 features corresponding to topics learned

via latent Dirichlet Allocation (

Blei et al.

2003

). For

each day, articles shown are selected using an interleav-

ing of two bandit algorithms. The user is instructed

to briefly skim each article and mark each article as

“interested in reading in detail”or “not interested”.

We conducted the user study in two phases. Prior to

the first phase, we conducted a preliminary study to

collect preferences for constructing

(

= 5). In the

Comparison

#Users Win/Tie/Lose Gain/Day

CoFineUCB

24 / 1 / 3

0.69

v. Naive

CoFineUCB

21 / 3 / 6

0.27

v. Reshape

Table 2.

User study comparing CoFineUCB with two base-

lines. All results satisfy 95% statistical confidence.

first phase, we compared CoFineUCB with Naive. Af-

terwards, we took all the user profiles learned so far

to estimate a reshaping of the full space

, and com-

pared against Reshape. Due to the short duration of

each session (

= 10), we did not expect a meaningful

comparison between CoFineUCB and SubspaceUCB,

so we omitted it (We expect both methods to perform

equally well in early iterations, as seen in the simula-

tion experiments.). For each user session, we counted

the total number of liked articles recommended by each

algorithm. An algorithm wins a session if the user liked