Hierarchical Imitation and Reinforcement Learning
Hoang M. Le
1
Nan Jiang
2
Alekh Agarwal
2
Miroslav Dud
́
ık
2
Yisong Yue
1
Hal Daum
́
e III
3 2
Abstract
We study how to effectively leverage expert feed-
back to learn sequential decision-making poli-
cies. We focus on problems with sparse rewards
and long time horizons, which typically pose
significant challenges in reinforcement learning.
We propose an algorithmic framework, called
hi-
erarchical guidance
, that leverages the hierarchi-
cal structure of the underlying problem to inte-
grate different modes of expert interaction. Our
framework can incorporate different combina-
tions of imitation learning (IL) and reinforcement
learning (RL) at different levels, leading to dra-
matic reductions in both expert effort and cost of
exploration. Using long-horizon benchmarks, in-
cluding Montezuma’s Revenge, we demonstrate
that our approach can learn significantly faster
than hierarchical RL, and be significantly more
label-efficient than standard IL. We also theoret-
ically analyze labeling cost for certain instantia-
tions of our framework.
1. Introduction
Learning good agent behavior from reward signals alone—
the goal of reinforcement learning (RL)—is particularly
difficult when the planning horizon is long and rewards are
sparse. One successful method for dealing with such long
horizons is imitation learning (IL) (Abbeel & Ng, 2004;
Daum
́
e et al., 2009; Ross et al., 2011; Ho & Ermon, 2016),
in which the agent learns by watching and possibly query-
ing an expert. One limitation of existing imitation learn-
ing approaches is that they may require a large amount of
demonstration data in long-horizon problems.
The central question we address in this paper is:
when ex-
perts are available, how can we most effectively leverage
their feedback?
A common strategy to improve sample ef-
1
California Institute of Technology, Pasadena, CA
2
Microsoft
Research, New York, NY
3
University of Maryland, College Park,
MD. Correspondence to: Hoang M. Le
<
hmle@caltech.edu
>
.
Proceedings of the
35
th
International Conference on Machine
Learning
, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018
by the author(s).
ficiency in RL over long time horizons is to exploit hierar-
chical structure of the problem (Sutton et al., 1998; 1999;
Kulkarni et al., 2016; Dayan & Hinton, 1993; Vezhnevets
et al., 2017; Dietterich, 2000). Our approach leverages hi-
erarchical structure in imitation learning. We study the case
where the underlying problem is hierarchical, and subtasks
can be easily elicited from an expert. Our key design prin-
ciple is an algorithmic framework called
hierarchical guid-
ance
, in which feedback (labels) from the high-level ex-
pert is used to focus (guide) the low-level learner. The
high-level expert ensures that low-level learning only oc-
curs when necessary (when subtasks have not been mas-
tered) and only over relevant parts of the state space. This
differs from a na
̈
ıve hierarchical approach which merely
gives a subtask decomposition. Focusing on relevant parts
of the state space speeds up learning (improves sample ef-
ficiency), while omitting feedback on the already mastered
subtasks reduces expert effort (improves label efficiency).
We begin by formalizing the problem of hierarchical imi-
tation learning (Section 3) and carefully separate out cost
structures that naturally arise when the expert provides
feedback at multiple levels of abstraction. We first apply hi-
erarchical guidance to IL, derive hierarchically guided vari-
ants of behavior cloning and DAgger (Ross et al., 2011),
and theoretically analyze the benefits (Section 4). We next
apply hierarchical guidance to the hybrid setting with high-
level IL and low-level RL (Section 5). This architecture
is particularly suitable in settings where we have access to
high-level semantic knowledge, the subtask horizon is suf-
ficiently short, but the low-level expert is too costly or un-
available. We demonstrate the efficacy of our approaches
on a simple but extremely challenging maze domain, and
on Montezuma’s Revenge (Section 6). Our experiments
show that incorporating a modest amount of expert feed-
back can lead to dramatic improvements in performance
compared to pure hierarchical RL.
1
2. Related Work
For brevity, we provide here a short overview of related
work, and defer to Appendix C for additional discussion.
1
Code and experimental setups are available at
https://
sites.google.com/view/hierarchical-il-rl
Hierarchical Imitation and Reinforcement Learning
Imitation Learning.
One can broadly dichotomize IL into
passive collection of demonstrations (behavioral cloning)
versus active collection of demonstrations. The former set-
ting (Abbeel & Ng, 2004; Ziebart et al., 2008; Syed &
Schapire, 2008; Ho & Ermon, 2016) assumes that demon-
strations are collected a priori and the goal of IL is to find
a policy that mimics the demonstrations. The latter setting
(Daum
́
e et al., 2009; Ross et al., 2011; Ross & Bagnell,
2014; Chang et al., 2015; Sun et al., 2017) assumes an in-
teractive expert that provides demonstrations in response to
actions taken by the current policy. We explore extension
of both approaches into hierarchical settings.
Hierarchical Reinforcement Learning.
Several RL ap-
proaches to learning hierarchical policies have been ex-
plored, foremost among them the options framework (Sut-
ton et al., 1998; 1999; Fruit & Lazaric, 2017). It is of-
ten assumed that a useful set of options are fully defined a
priori, and (semi-Markov) planning and learning only oc-
curs at the higher level. In comparison, our agent does not
have direct access to policies that accomplish such subgoals
and has to learn them via expert or reinforcement feedback.
The closest hierarchical RL work to ours is that of Kulkarni
et al. (2016), which uses a similar hierarchical structure, but
no high-level expert and hence no hierarchical guidance.
Combining Reinforcement and Imitation Learning.
The
idea of combining IL and RL is not new (Nair et al., 2017;
Hester et al., 2018). However, previous work focuses on
flat policy classes that use IL as a “pre-training” step (e.g.,
by pre-populating the replay buffer with demonstrations).
In contrast, we consider feedback at multiple levels for a
hierarchical policy class, with different levels potentially
receiving different types of feedback (i.e., imitation at one
level and reinforcement at the other). Somewhat related to
our hierarchical expert supervision is the approach of An-
dreas et al. (2017), which assumes access to symbolic de-
scriptions of subgoals, without knowing what those sym-
bols mean or how to execute them. Previous literature has
not focused much on comparisons of sample complexity
between IL and RL, with the exception of the recent work
of Sun et al. (2017).
3. Hierarchical Formalism
For simplicity, we consider environments with a natural
two-level hierarchy; the
HI
level corresponds to choosing
subtasks, and the
LO
level corresponds to executing those
subtasks. For instance, an agent’s overall goal may be to
leave a building. At the
HI
level, the agent may first choose
the subtask
“go to the elevator,”
then
“take the elevator
down,”
and finally
“walk out.”
Each of these subtasks
needs to be executed at the
LO
level by actually navigat-
ing the environment, pressing buttons on the elevator, etc.
2
Subtasks, which we also call
subgoals
, are denoted as
g
∈G
, and the primitive actions are denoted as
a
∈A
. An
agent (also referred to as learner) acts by iteratively choos-
ing a subgoal
g
, carrying it out by executing a sequence
of actions
a
until completion, and then picking a new sub-
goal. The agent’s choices can depend on an observed state
s
∈ S
.
3
We assume that the horizon at the
HI
level is
H
HI
,
i.e., a trajectory uses at most
H
HI
subgoals, and the hori-
zon at the
LO
level is
H
LO
, i.e., after at most
H
LO
primitive
actions, the agent either accomplishes the subgoal or needs
to decide on a new subgoal. The total number of primitive
actions in a trajectory is thus at most
H
FULL
:
=
H
HI
H
LO
.
The hierarchical learning problem is to simultaneously
learn a
HI
-level policy
μ
:
S → G
, called the
meta-
controller
, as well as the subgoal policies
π
g
:
S →A
for
each
g
∈G
, called
subpolicies
. The aim of the learner
is to achieve a high reward when its meta-controller and
subpolicies are run together. For each subgoal
g
, we also
have a (possibly learned) termination function
β
g
:
S →
{
True
,
False
}
, which terminates the execution of
π
g
. The
hierarchical agent behaves as follows:
1:
for
h
HI
= 1
...H
HI
do
2:
observe state
s
and choose subgoal
g
←
μ
(
s
)
3:
for
h
LO
= 1
...H
LO
do
4:
observe state
s
5:
if
β
g
(
s
)
then break
6:
choose action
a
←
π
g
(
s
)
The execution of each subpolicy
π
g
generates a
LO
-level
trajectory
τ
= (
s
1
,a
1
,...,s
H
,a
H
,s
H
+1
)
with
H
≤
H
LO
.
4
The overall behavior results in a
hierarchical tra-
jectory
σ
= (
s
1
,g
1
,τ
1
,s
2
,g
2
,τ
2
,...
)
, where the last state
of each
LO
-level trajectory
τ
h
coincides with the next state
s
h
+1
in
σ
and the first state of the next
LO
-level trajec-
tory
τ
h
+1
. The subsequence of
σ
which excludes the
LO
-
level trajectories
τ
h
is called the
HI
-level trajectory
,
τ
HI
:
=
(
s
1
,g
1
,s
2
,g
2
,...
)
. Finally, the
full trajectory
,
τ
FULL
, is the
concatenation of all the
LO
-level trajectories.
We assume access to an
expert
, endowed with a meta-
2
An important real-world application is in goal-oriented di-
alogue systems. For instance, a chatbot assisting a user with
reservation and booking for flights and hotels (Peng et al., 2017;
El Asri et al., 2017) needs to navigate through multiple turns of
conversation. The chatbot developer designs the hierarchy of sub-
tasks, such as
ask
user
goal
,
ask
dates
,
offer
flights
,
confirm
, etc.
Each subtask consists of several turns of conversation. Typically
a global state tracker exists alongside the hierarchical dialogue
policy to ensure that cross-subtask constraints are satisfied.
3
While we use the term state for simplicity, we do not require
the environment to be fully observable or Markovian.
4
The trajectory might optionally include a reward signal after
each primitive action, which might either come from the environ-
ment, or be a pseudo-reward as we will see in Section 5.
Hierarchical Imitation and Reinforcement Learning
controller
μ
?
, subpolicies
π
?
g
, and termination functions
β
?
g
,
who can provide one or several types of supervision:
•
HierDemo
(
s
)
:
hierarchical demonstration
. The ex-
pert executes its hierarchical policy starting from
s
and returns the resulting hierarchical trajectory
σ
?
=
(
s
?
1
,g
?
1
,τ
?
1
,s
?
2
,g
?
2
,τ
?
2
,...
)
, where
s
?
1
=
s
.
•
Label
HI
(
τ
HI
)
:
HI
-level labeling
. The expert provides
a good next subgoal at each state of a given
HI
-level
trajectory
τ
HI
= (
s
1
,g
1
,s
2
,g
2
,...
)
, yielding a la-
beled data set
{
(
s
1
,g
?
1
)
,
(
s
2
,g
?
2
)
,...
}
.
•
Label
LO
(
τ
;
g
)
:
LO
-level labeling
. The expert pro-
vides a good next primitive action towards a given
subgoal
g
at each state of a given
LO
-level trajectory
τ
= (
s
1
,a
1
,s
2
,a
2
,...
)
, yielding a labeled data set
{
(
s
1
,a
?
1
)
,
(
s
2
,a
?
2
)
,...
}
.
•
Inspect
LO
(
τ
;
g
)
:
LO
-level inspection
. Instead of
annotating every state of a trajectory with a good ac-
tion, the expert only verifies whether a subgoal
g
was
accomplished, returning either
Pass
or
Fail
.
•
Label
FULL
(
τ
FULL
)
:
full labeling
. The expert labels
the agent’s full trajectory
τ
FULL
= (
s
1
,a
1
,s
2
,a
2
,...
)
,
from start to finish, ignoring hierarchical structure,
yielding a labeled data set
{
(
s
1
,a
?
1
)
,
(
s
2
,a
?
2
)
,...
}
.
•
Inspect
FULL
(
τ
FULL
)
:
full inspection
.
The expert
verifies whether the agent’s overall goal was accom-
plished, returning either
Pass
or
Fail
.
When the agent learns not only the subpolicies
π
g
, but also
termination functions
β
g
, then
Label
LO
also returns good
termination values
ω
?
∈ {
True
,
False
}
for each state of
τ
= (
s
1
,a
1
...
)
, yielding a data set
{
(
s
1
,a
?
1
,ω
?
1
)
,...
}
.
Although
HierDemo
and
Label
can be both generated
by the expert’s hierarchical policy
(
μ
?
,
{
π
?
g
}
)
, they differ
in the mode of expert interaction.
HierDemo
returns a
hierarchical trajectory
executed by the expert
, as required
for passive IL, and enables a hierarchical version of be-
havioral cloning (Abbeel & Ng, 2004; Syed & Schapire,
2008).
Label
operations provide labels
with respect to
the learning agent’s trajectories
, as required for interactive
IL.
Label
FULL
is the standard query used in prior work on
learning flat policies (Daum
́
e et al., 2009; Ross et al., 2011),
and
Label
HI
and
Label
LO
are its hierarchical extensions.
Inspect
operations are newly introduced in this paper,
and form a cornerstone of our interactive hierarchical guid-
ance protocol that enables substantial savings in label effi-
ciency. They can be viewed as “lazy” versions of the cor-
responding
Label
operations, requiring less effort. Our
underlying assumption is that if the given hierarchical tra-
jectory
σ
=
{
(
s
h
,g
h
,τ
h
)
}
agrees with the expert on
HI
level, i.e.,
g
h
=
μ
?
(
s
h
)
, and
LO
-level trajectories pass the
Algorithm 1
Hierarchical Behavioral Cloning (
h-BC
)
1: Initialize data buffers
D
HI
←∅
and
D
g
←∅
,
g
∈G
2:
for
t
= 1
,...,T
do
3:
Get a new environment instance with start state
s
4:
σ
?
←
HierDemo
(
s
)
5:
for all
(
s
?
h
,g
?
h
,τ
?
h
)
∈
σ
?
do
6:
Append
D
g
?
h
←D
g
?
h
∪
τ
?
h
7:
Append
D
HI
←D
HI
∪{
(
s
?
h
,g
?
h
)
}
8: Train subpolicies
π
g
←
Train
(
π
g
,
D
g
)
for all
g
9: Train meta-controller
μ
←
Train
(
μ,
D
HI
)
inspection, i.e.,
Inspect
LO
(
τ
h
;
g
h
) =
Pass
, then the re-
sulting full trajectory must also pass the full inspection,
Inspect
FULL
(
τ
FULL
) =
Pass
. This means that a hierarchi-
cal policy need not always agree with the expert’s execution
at
LO
level to succeed in the overall task.
Besides algorithmic reasons, the motivation for separating
the types of feedback is that different expert queries will
typically require different amount of effort, which we refer
to as
cost
. We assume the costs of the
Label
operations
are
C
L
HI
,
C
L
LO
and
C
L
FULL
, the costs of each
Inspect
op-
eration are
C
I
LO
and
C
I
FULL
. In many settings,
LO
-level in-
spection will require significantly less effort than
LO
-level
labeling, i.e.,
C
I
LO
C
L
LO
. For instance, identifying if a
robot has successfully navigated to the elevator is presum-
ably much easier than labeling an entire path to the elevator.
One reasonable cost model, natural for the environments in
our experiments, is to assume that
Inspect
operations
take time
O
(1)
and work by checking the final state of the
trajectory, whereas
Label
operations take time propor-
tional to the trajectory length, which is
O
(
H
HI
)
,
O
(
H
LO
)
and
O
(
H
HI
H
LO
)
for our three
Label
operations.
4. Hierarchically Guided Imitation Learning
Hierarchical guidance
is an algorithmic design principle in
which the feedback from high-level expert guides the low-
level learner in two different ways: (i) the high-level expert
ensures that low-level expert is only queried when neces-
sary (when the subtasks have not been mastered yet), and
(ii) low-level learning is limited to the relevant parts of the
state space. We instantiate this framework first within pas-
sive learning from demonstrations, obtaining
hierarchical
behavioral cloning
(Algorithm 1), and then within inter-
active imitation learning, obtaining
hierarchically guided
DAgger
(Algorithm 2), our best-performing algorithm.
4.1. Hierarchical Behavioral Cloning (h-BC)
We consider a natural extension of behavioral cloning to
the hierarchical setting (Algorithm 1). The expert pro-
vides a set of hierarchical demonstrations
σ
?
, each con-
sisting of
LO
-level trajectories
τ
?
h
=
{
(
s
?
`
,a
?
`
)
}
H
LO
`
=1
as well
as a
HI
-level trajectory
τ
?
HI
=
{
(
s
?
h
,g
?
h
)
}
H
HI
h
=1
. We then run
Hierarchical Imitation and Reinforcement Learning
Algorithm 2
Hierarchically Guided DAgger (
hg-DAgger
)
1: Initialize data buffers
D
HI
←∅
and
D
g
←∅
,
g
∈G
2: Run Hierarchical Behavioral Cloning (Algorithm 1)
up to
t
=
T
warm-start
3:
for
t
=
T
warm-start
+ 1
,...,T
do
4:
Get a new environment instance with start state
s
5:
Initialize
σ
←∅
6:
repeat
7:
g
←
μ
(
s
)
8:
Execute
π
g
, obtain
LO
-level trajectory
τ
9:
Append
(
s,g,τ
)
to
σ
10:
s
←
the last state in
τ
11:
until
end of episode
12:
Extract
τ
FULL
and
τ
HI
from
σ
13:
if
Inspect
FULL
(
τ
FULL
) =
Fail
then
14:
D
?
←
Label
HI
(
τ
HI
)
15:
Process
(
s
h
,g
h
,τ
h
)
∈
σ
in sequence as long as
g
h
agrees with the expert’s choice
g
?
h
in
D
?
:
16:
if
Inspect
(
τ
h
;
g
h
) =
Fail
then
17:
Append
D
g
h
←D
g
h
∪
Label
LO
(
τ
h
;
g
h
)
18:
break
19:
Append
D
HI
←D
HI
∪ D
?
20:
Update subpolicies
π
g
←
Train
(
π
g
,
D
g
)
for all
g
21:
Update meta-controller
μ
←
Train
(
μ,
D
HI
)
Train
(lines 8–9) to find the subpolicies
π
g
that best pre-
dict
a
?
`
from
s
?
`
, and meta-controller
μ
that best predicts
g
?
h
from
s
?
h
, respectively.
Train
can generally be any su-
pervised learning subroutine, such as stochastic optimiza-
tion for neural networks or some batch training procedure.
When termination functions
β
g
need to be learned as part of
the hierarchical policy, the labels
ω
?
g
will be provided by the
expert as part of
τ
?
h
=
{
(
s
?
`
,a
?
`
,ω
?
`
)
}
.
5
In this setting, hier-
archical guidance is automatic, because subpolicy demon-
strations only occur in relevant parts of the state space.
4.2. Hierarchically Guided DAgger (hg-DAgger)
Passive IL, e.g., behavioral cloning, suffers from the distri-
bution mismatch between the learning and execution distri-
butions. This mismatch is addressed by interactive IL algo-
rithms, such as SEARN (Daum
́
e et al., 2009) and DAgger
(Ross et al., 2011), where the expert provides correct ac-
tions along the learner’s trajectories through the operation
Label
FULL
. A na
̈
ıve hierarchical implementation would
provide correct labels along the entire hierarchical trajec-
tory via
Label
HI
and
Label
LO
. We next show how to use
hierarchical guidance to decrease
LO
-level expert costs.
We leverage two
HI
-level query types:
Inspect
LO
and
Label
HI
. We use
Inspect
LO
to verify whether the sub-
tasks are successfully completed and
Label
HI
to check
whether we are staying in the relevant part of the state
space. The details are presented in Algorithm 2, which uses
5
In our hierarchical imitation learning experiments, the termi-
nation functions are all learned. Formally, the termination signal
ω
g
, can be viewed as part of an augmented action at
LO
level.
DAgger as the learner on both levels, but the scheme can be
adapted to other interactive imitation learners.
In each episode, the learner executes the hierarchical pol-
icy, including choosing a subgoal (line 7), executing the
LO
-level trajectories, i.e., rolling out the subpolicy
π
g
for
the chosen subgoal, and terminating the execution accord-
ing to
β
g
(line 8). Expert only provides feedback when
the agent fails to execute the entire task, as verified by
Inspect
FULL
(line 13). When
Inspect
FULL
fails, the ex-
pert first labels the correct subgoals via
Label
HI
(line 14),
and only performs
LO
-level labeling as long as the learner’s
meta-controller chooses the correct subgoal
g
h
(line 15),
but its subpolicy fails (i.e., when
Inspect
LO
on line 16
fails). Since all the preceding subgoals were chosen and
executed correctly, and the current subgoal is also correct,
LO
-level learning is in the “relevant” part of the state space.
However, since the subpolicy execution failed, its learning
has not been mastered yet. We next analyze the savings in
expert cost that result from hierarchical guidance.
Theoretical Analysis.
We analyze the cost of hg-DAgger
in comparison with flat DAgger under somewhat stylized
assumptions. We assume that the learner aims to learn the
meta-controller
μ
from some policy class
M
, and subpoli-
cies
π
g
from some class
Π
LO
. The classes
M
and
Π
LO
are
finite (but possibly exponentially large) and the task is real-
izable, i.e., the expert’s policies can be found in the corre-
sponding classes:
μ
?
∈M
, and
π
?
g
∈
Π
LO
,
g
∈G
. This al-
lows us to use the
halving algorithm
(Shalev-Shwartz et al.,
2012) as the online learner on both levels. (The implemen-
tation of our algorithm does not require these assumptions.)
The halving algorithm maintains a version space over poli-
cies, acts by a majority decision, and when it makes a mis-
take, it removes all the erring policies from the version
space. In the hierarchical setting, it therefore makes at most
log
|M|
mistakes on the
HI
level, and at most
log
|
Π
LO
|
mis-
takes when learning each
π
g
. The mistake bounds can be
further used to upper bound the total expert cost in both
hg-DAgger and flat DAgger. To enable an apples-to-apples
comparison, we assume that the flat DAgger learns over the
policy class
Π
FULL
=
{
(
μ,
{
π
g
}
g
∈G
) :
μ
∈M
,π
g
∈
Π
LO
}
,
but is otherwise oblivious to the hierarchical task structure.
The bounds depend on the cost of performing different
types of operations, as defined at the end of Section 3. We
consider a modified version of flat DAgger that first calls
Inspect
FULL
, and only requests labels (
Label
FULL
) if the
inspection fails. The proofs are deferred to Appendix A.
Theorem 1.
Given finite classes
M
and
Π
LO
and realiz-
able expert policies, the total cost incurred by the expert in
hg-DAgger by round
T
is bounded by
TC
I
FULL
+
(
log
2
|M|
+
|G
opt
|
log
2
|
Π
LO
|
)
(
C
L
HI
+
H
HI
C
I
LO
)
+
(
|G
opt
|
log
2
|
Π
LO
|
)
C
L
LO
,
(1)
Hierarchical Imitation and Reinforcement Learning
where
G
opt
⊆ G
is the set of the subgoals actually used by
the expert,
G
opt
:
=
μ
?
(
S
)
.
Theorem 2.
Given the full policy class
Π
FULL
=
{
(
μ,
{
π
g
}
g
∈G
) :
μ
∈M
,π
g
∈
Π
LO
}
and a realizable ex-
pert policy, the total cost incurred by the expert in flat DAg-
ger by round
T
is bounded by
TC
I
FULL
+
(
log
2
|M|
+
|G|
log
2
|
Π
LO
|
)
C
L
FULL
.
(2)
Both bounds have the same leading term,
TC
I
FULL
, the cost
of full inspection, which is incurred every round and can
be viewed as the “cost of monitoring.” In contrast, the re-
maining terms can be viewed as the “cost of learning” in the
two settings, and include terms coming from their respec-
tive mistake bounds. The ratio of the cost of hierarchically
guided learning to the flat learning is then bounded as
Eq. (1)
−
TC
I
FULL
Eq. (2)
−
TC
I
FULL
≤
C
L
HI
+
H
HI
C
I
LO
+
C
L
LO
C
L
FULL
,
(3)
where we applied the upper bound
|G
opt
| ≤ |G|
. The sav-
ings thanks to hierarchical guidance depend on the specific
costs. Typically, we expect the inspection costs to be
O
(1)
,
if it suffices to check the final state, whereas labeling costs
scale linearly with the length of the trajectory. The cost ra-
tio is then
∝
H
HI
+
H
LO
H
HI
H
LO
. Thus, we realize most significant
savings if the horizons on each individual level are sub-
stantially shorter than the overall horizon. In particular, if
H
HI
=
H
LO
=
√
H
FULL
, the hierarchically guided approach
reduces the overall labeling cost by a factor of
√
H
FULL
.
More generally, whenever
H
FULL
is large, we reduce the
costs of learning be at least a constant factor—a significant
gain if this is a saving in the effort of a domain expert.
5. Hierarchically Guided IL / RL
Hierarchical guidance also applies in the hybrid setting
with interactive IL on the
HI
level and RL on the
LO
level.
The
HI
-level expert provides the hierarchical decomposi-
tion, including the pseudo-reward function for each sub-
goal,
6
and is also able to pick a correct subgoal at each
step. Similar to hg-DAgger, the labels from
HI
-level expert
are used not only to train the meta-controller
μ
, but also to
limit the
LO
-level learning to the relevant part of the state
space. In Algorithm 3 we provide the details, with DAgger
on
HI
level and
Q
-learning on
LO
level. The scheme can be
adapted to other interactive IL and RL algorithms.
The learning agent proceeds by
rolling in
with its meta-
controller (line 7). For each selected subgoal
g
, the sub-
policy
π
g
selects and executes primitive actions via the
6
This is consistent with many hierarchical RL approaches, in-
cluding options (Sutton et al., 1999), MAXQ (Dietterich, 2000),
UVFA (Schaul et al., 2015a) and h-DQN (Kulkarni et al., 2016).
Algorithm 3
Hierarchically Guided DAgger /
Q
-learning
(
hg-DAgger/Q
)
input
Function
pseudo
(
s
;
g
)
providing the pseudo-reward
input
Predicate
terminal
(
s
;
g
)
indicating the termination of
g
input
Annealed exploration probabilities
g
>
0
,
g
∈G
1: Initialize data buffers
D
HI
←∅
and
D
g
←∅
,
g
∈G
2: Initialize subgoal
Q
-functions
Q
g
,
g
∈G
3:
for
t
= 1
,...,T
do
4:
Get a new environment instance with start state
s
5:
Initialize
σ
←∅
6:
repeat
7:
s
HI
←
s, g
←
μ
(
s
)
and initialize
τ
←∅
8:
repeat
9:
a
←
g
-greedy
(
Q
g
,s
)
10:
Execute
a,
next state
̃
s,
̃
r
←
pseudo
( ̃
s
;
g
)
11:
Update
Q
g
: a (stochastic) gradient descent step
on a minibatch from
D
g
12:
Append
(
s,a,
̃
r,
̃
s
)
to
τ
and update
s
←
̃
s
13:
until
terminal
(
s
;
g
)
14:
Append
(
s
HI
,g,τ
)
to
σ
15:
until
end of episode
16:
Extract
τ
FULL
and
τ
HI
from
σ
17:
if
Inspect
FULL
(
τ
FULL
) =
Fail
then
18:
D
?
←
Label
HI
(
τ
HI
)
19:
Process
(
s
h
,g
h
,τ
h
)
∈
σ
in sequence as long as
g
h
agrees with the expert’s choice
g
?
h
in
D
?
:
20:
Append
D
g
h
←D
g
h
∪
τ
h
Append
D
HI
←D
HI
∪ D
?
21:
else
22:
Append
D
g
h
←D
g
h
∪
τ
h
for all
(
s
h
,g
h
,τ
h
)
∈
σ
23:
Update meta-controller
μ
←
Train
(
μ,
D
HI
)
-greedy rule (lines 9–10), until some termination condi-
tion is met. The agent receives some pseudo-reward, also
known as intrinsic reward (Kulkarni et al., 2016) (line 10).
Upon termination of the subgoal, agent’s meta-controller
μ
chooses another subgoal and the process continues until
the end of the episode, where the involvement of the expert
begins. As in hg-DAgger, the expert inspects the overall
execution of the learner (line 17), and if it is not successful,
the expert provides
HI
-level labels, which are accumulated
for training the meta-controller.
Hierarchical guidance impacts how the
LO
-level learners
accumulate experience. As long as the meta-controller’s
subgoal
g
agrees with the expert’s, the agent’s experience
of executing subgoal
g
is added to the experience replay
buffer
D
g
. If the meta-controller selects a “bad” subgoal,
the accumulation of experience in the current episode is
terminated. This ensures that experience buffers contain
only the data from the relevant part of the state space.
Algorithm 3 assumes access to a real-valued function
pseudo
(
s
;
g
)
, providing the pseudo-reward in state
s
when executing
g
, and a predicate
terminal
(
s
;
g
)
, indi-
cating the termination (not necessarily successful) of sub-
goal
g
. This setup is similar to prior work on hierar-
chical RL (Kulkarni et al., 2016).
One natural defini-
Hierarchical Imitation and Reinforcement Learning
tion of pseudo-rewards, based on an additional predicate
success
(
s
;
g
)
indicating a successful completion of sub-
goal
g
, is as follows:
1
if
success
(
s
;
g
)
−
1
if
¬
success
(
s
;
g
)
and
terminal
(
s
;
g
)
−
κ
otherwise,
where
κ >
0
is a small penalty to encourage short trajec-
tories. The predicates
success
and
terminal
are pro-
vided by an expert or learnt from supervised or reinforce-
ment feedback. In our experiments, we explicitly provide
these predicates to both hg-DAgger/Q as well as the hierar-
chical RL, giving them advantage over hg-DAgger, which
needs to learn when to terminate subpolicies.
6. Experiments
We evaluate the performance of our algorithms on two sep-
arate domains: (i) a simple but challenging maze naviga-
tion domain and (ii) the Atari game Montezuma’s Revenge.
6.1. Maze Navigation Domain
Task Overview.
Figure 1 (left) displays a snapshot of the
maze navigation domain. In each episode, the agent en-
counters a new instance of the maze from a large collec-
tion of different layouts. Each maze consists of 16 rooms
arranged in a 4-by-4 grid, but the openings between the
rooms vary from instance to instance as does the initial po-
sition of the agent and the target. The agent (white dot)
needs to navigate from one corner of the maze to the tar-
get marked in yellow. Red cells are obstacles (lava), which
the agent needs to avoid for survival. The contextual in-
formation the agent receives is the pixel representation of
a bird’s-eye view of the environment, including the partial
trail (marked in green) indicating the visited locations.
Due to a large number of random environment instances,
this domain is not solvable with tabular algorithms. Note
that rooms are not always connected, and the locations of
the hallways are not always in the middle of the wall. Prim-
itive actions include going one step
up
,
down
,
left
or
right
.
In addition, each instance of the environment is designed
to ensure that there is a path from initial location to target,
and the shortest path takes at least 45 steps (
H
FULL
= 100
).
The agent is penalized with reward
−
1
if it runs into lava,
which also terminates the episode. The agent only receives
positive reward upon stepping on the yellow block.
A hierarchical decomposition of the environment corre-
sponds to four possible subgoals of going to the room im-
mediately to the
north
,
south
,
west
,
east
, and the fifth pos-
sible subgoal
go
to
target
(valid only in the room con-
taining the target). In this setup,
H
LO
≈
5 steps, and
H
HI
≈
10–12 steps. The episode terminates after 100 prim-
itive steps if the agent is unsuccessful. The subpolicies
and meta-controller use similar neural network architec-
tures and only differ in the number of action outputs. (De-
tails of network architecture are provided in Appendix B.)
Hierarchically Guided IL.
We first compare our hierar-
chical IL algorithms with their flat versions. The algorithm
performance is measured by success rate, defined as the
average rate of successful task completion over the previ-
ous 100 test episodes, on random environment instances
not used for training. The cost of each
Label
operation
equals the length of the labeled trajectory, and the cost of
each
Inspect
operation equals 1.
Both h-BC and hg-DAgger outperform flat imitation learn-
ers (Figure 2, left). hg-DAgger, in particular, achieves
consistently the highest success rate, approaching 100%
in fewer than 1000 episodes. Figure 2 (left) displays the
median as well as the range from minimum to maximum
success rate over 5 random executions of the algorithms.
Expert cost varies significantly between the two hierarchi-
cal algorithms. Figure 2 (middle) displays the same suc-
cess rate, but as a function of the expert cost. hg-DAgger
achieves significant savings in expert cost compared to
other imitation learning algorithms thanks to a more effi-
cient use of the
LO
-level expert through hierarchical guid-
ance. Figure 1 (middle) shows that hg-DAgger requires
most of its
LO
-level labels early in the training and requests
primarily
HI
-level labels after the subgoals have been mas-
tered. As a result, hg-DAgger requires only a fraction of
LO
-level labels compared to flat DAgger (Figure 2, right).
Hierarchically Guided IL / RL.
We evaluate hg-
DAgger/Q with deep double
Q
-learning (DDQN, Van Has-
selt et al., 2016) and prioritized experience replay (Schaul
et al., 2015b) as the underlying RL procedure.
Each
subpolicy learner receives a pseudo-reward of 1 for each
successful execution, corresponding to stepping through
the correct door (e.g., door to the north if the subgoal
is
north
) and negative reward for stepping into lava or
through other doors.
Figure 1 (right) shows the learning progression of hg-
DAgger/Q, implying two main observations. First, the
number of
HI
-level labels rapidly increases initially and
then flattens out after the learner becomes more success-
ful, thanks to the availability of
Inspect
FULL
operation.
As the hybrid algorithm makes progress and the learning
agent passes the
Inspect
FULL
operation increasingly of-
ten, the algorithm starts saving significantly on expert feed-
back. Second, the number of
HI
-level labels is higher than
for both hg-DAgger and h-BC.
Inspect
FULL
returns
Fail
often, especially during the early parts of training. This is
primarily due to the slower learning speed of
Q
-learning
at the
LO
level, requiring more expert feedback at the
HI