Hierarchical Imitation and Reinforcement Learning

Hoang M. Le

Nan Jiang

Alekh Agarwal

Miroslav Dud

ık

Yisong Yue

Hal Daum

e III

3 2

Abstract

We study how to effectively leverage expert feed-

back to learn sequential decision-making poli-

cies. We focus on problems with sparse rewards

and long time horizons, which typically pose

significant challenges in reinforcement learning.

We propose an algorithmic framework, called

hi-

erarchical guidance

, that leverages the hierarchi-

cal structure of the underlying problem to inte-

grate different modes of expert interaction. Our

framework can incorporate different combina-

tions of imitation learning (IL) and reinforcement

learning (RL) at different levels, leading to dra-

matic reductions in both expert effort and cost of

exploration. Using long-horizon benchmarks, in-

cluding Montezuma’s Revenge, we demonstrate

that our approach can learn significantly faster

than hierarchical RL, and be significantly more

label-efficient than standard IL. We also theoret-

ically analyze labeling cost for certain instantia-

tions of our framework.

1. Introduction

Learning good agent behavior from reward signals alone—

the goal of reinforcement learning (RL)—is particularly

difficult when the planning horizon is long and rewards are

sparse. One successful method for dealing with such long

horizons is imitation learning (IL) (Abbeel & Ng, 2004;

Daum

e et al., 2009; Ross et al., 2011; Ho & Ermon, 2016),

in which the agent learns by watching and possibly query-

ing an expert. One limitation of existing imitation learn-

ing approaches is that they may require a large amount of

demonstration data in long-horizon problems.

The central question we address in this paper is:

when ex-

perts are available, how can we most effectively leverage

their feedback?

A common strategy to improve sample ef-

California Institute of Technology, Pasadena, CA

Microsoft

Research, New York, NY

University of Maryland, College Park,

MD. Correspondence to: Hoang M. Le

hmle@caltech.edu

Proceedings of the

35

International Conference on Machine

Learning

by the author(s).

ficiency in RL over long time horizons is to exploit hierar-

chical structure of the problem (Sutton et al., 1998; 1999;

Kulkarni et al., 2016; Dayan & Hinton, 1993; Vezhnevets

et al., 2017; Dietterich, 2000). Our approach leverages hi-

erarchical structure in imitation learning. We study the case

where the underlying problem is hierarchical, and subtasks

can be easily elicited from an expert. Our key design prin-

ciple is an algorithmic framework called

hierarchical guid-

ance

, in which feedback (labels) from the high-level ex-

pert is used to focus (guide) the low-level learner. The

high-level expert ensures that low-level learning only oc-

curs when necessary (when subtasks have not been mas-

tered) and only over relevant parts of the state space. This

differs from a na

ıve hierarchical approach which merely

gives a subtask decomposition. Focusing on relevant parts

of the state space speeds up learning (improves sample ef-

ficiency), while omitting feedback on the already mastered

subtasks reduces expert effort (improves label efficiency).

We begin by formalizing the problem of hierarchical imi-

tation learning (Section 3) and carefully separate out cost

structures that naturally arise when the expert provides

feedback at multiple levels of abstraction. We first apply hi-

erarchical guidance to IL, derive hierarchically guided vari-

ants of behavior cloning and DAgger (Ross et al., 2011),

and theoretically analyze the benefits (Section 4). We next

apply hierarchical guidance to the hybrid setting with high-

level IL and low-level RL (Section 5). This architecture

is particularly suitable in settings where we have access to

high-level semantic knowledge, the subtask horizon is suf-

ficiently short, but the low-level expert is too costly or un-

available. We demonstrate the efficacy of our approaches

on a simple but extremely challenging maze domain, and

on Montezuma’s Revenge (Section 6). Our experiments

show that incorporating a modest amount of expert feed-

back can lead to dramatic improvements in performance

compared to pure hierarchical RL.

2. Related Work

For brevity, we provide here a short overview of related

work, and defer to Appendix C for additional discussion.

Code and experimental setups are available at

https://

sites.google.com/view/hierarchical-il-rl

Hierarchical Imitation and Reinforcement Learning

Imitation Learning.

One can broadly dichotomize IL into

passive collection of demonstrations (behavioral cloning)

versus active collection of demonstrations. The former set-

ting (Abbeel & Ng, 2004; Ziebart et al., 2008; Syed &

Schapire, 2008; Ho & Ermon, 2016) assumes that demon-

strations are collected a priori and the goal of IL is to find

a policy that mimics the demonstrations. The latter setting

(Daum

e et al., 2009; Ross et al., 2011; Ross & Bagnell,

2014; Chang et al., 2015; Sun et al., 2017) assumes an in-

teractive expert that provides demonstrations in response to

actions taken by the current policy. We explore extension

of both approaches into hierarchical settings.

Hierarchical Reinforcement Learning.

Several RL ap-

proaches to learning hierarchical policies have been ex-

plored, foremost among them the options framework (Sut-

ton et al., 1998; 1999; Fruit & Lazaric, 2017). It is of-

ten assumed that a useful set of options are fully defined a

priori, and (semi-Markov) planning and learning only oc-

curs at the higher level. In comparison, our agent does not

have direct access to policies that accomplish such subgoals

and has to learn them via expert or reinforcement feedback.

The closest hierarchical RL work to ours is that of Kulkarni

et al. (2016), which uses a similar hierarchical structure, but

no high-level expert and hence no hierarchical guidance.

Combining Reinforcement and Imitation Learning.

The

idea of combining IL and RL is not new (Nair et al., 2017;

Hester et al., 2018). However, previous work focuses on

flat policy classes that use IL as a “pre-training” step (e.g.,

by pre-populating the replay buffer with demonstrations).

In contrast, we consider feedback at multiple levels for a

hierarchical policy class, with different levels potentially

receiving different types of feedback (i.e., imitation at one

level and reinforcement at the other). Somewhat related to

our hierarchical expert supervision is the approach of An-

dreas et al. (2017), which assumes access to symbolic de-

scriptions of subgoals, without knowing what those sym-

bols mean or how to execute them. Previous literature has

not focused much on comparisons of sample complexity

between IL and RL, with the exception of the recent work

of Sun et al. (2017).

3. Hierarchical Formalism

For simplicity, we consider environments with a natural

two-level hierarchy; the

level corresponds to choosing

subtasks, and the

level corresponds to executing those

subtasks. For instance, an agent’s overall goal may be to

leave a building. At the

level, the agent may first choose

the subtask

“go to the elevator,”

then

“take the elevator

down,”

and finally

“walk out.”

Each of these subtasks

needs to be executed at the

level by actually navigat-

ing the environment, pressing buttons on the elevator, etc.

Subtasks, which we also call

subgoals

, are denoted as

∈G

, and the primitive actions are denoted as

∈A

. An

agent (also referred to as learner) acts by iteratively choos-

ing a subgoal

, carrying it out by executing a sequence

of actions

until completion, and then picking a new sub-

goal. The agent’s choices can depend on an observed state

∈ S

We assume that the horizon at the

level is

i.e., a trajectory uses at most

subgoals, and the hori-

zon at the

level is

, i.e., after at most

primitive

actions, the agent either accomplishes the subgoal or needs

to decide on a new subgoal. The total number of primitive

actions in a trajectory is thus at most

FULL

The hierarchical learning problem is to simultaneously

learn a

-level policy

S → G

, called the

meta-

controller

, as well as the subgoal policies

S →A

for

each

∈G

, called

subpolicies

. The aim of the learner

is to achieve a high reward when its meta-controller and

subpolicies are run together. For each subgoal

, we also

have a (possibly learned) termination function

S →

{

True

False

}

, which terminates the execution of

. The

hierarchical agent behaves as follows:

for

= 1

...H

observe state

and choose subgoal

←

(

)

for

= 1

...H

observe state

(

)

then break

choose action

←

(

)

The execution of each subpolicy

generates a

-level

trajectory

= (

,...,s

)

with

≤

The overall behavior results in a

hierarchical tra-

jectory

= (

,τ

,...

)

, where the last state

of each

-level trajectory

coincides with the next state

and the first state of the next

-level trajec-

tory

. The subsequence of

which excludes the

level trajectories

is called the

-level trajectory

(

,...

)

. Finally, the

full trajectory

FULL

, is the

concatenation of all the

-level trajectories.

We assume access to an

expert

, endowed with a meta-

An important real-world application is in goal-oriented di-

alogue systems. For instance, a chatbot assisting a user with

reservation and booking for flights and hotels (Peng et al., 2017;

El Asri et al., 2017) needs to navigate through multiple turns of

conversation. The chatbot developer designs the hierarchy of sub-

tasks, such as

ask

user

goal

ask

dates

offer

flights

confirm

, etc.

Each subtask consists of several turns of conversation. Typically

a global state tracker exists alongside the hierarchical dialogue

policy to ensure that cross-subtask constraints are satisfied.

While we use the term state for simplicity, we do not require

the environment to be fully observable or Markovian.

The trajectory might optionally include a reward signal after

each primitive action, which might either come from the environ-

ment, or be a pseudo-reward as we will see in Section 5.

Hierarchical Imitation and Reinforcement Learning

controller

, subpolicies

, and termination functions

who can provide one or several types of supervision:

•

HierDemo

(

)

hierarchical demonstration

. The ex-

pert executes its hierarchical policy starting from

and returns the resulting hierarchical trajectory

(

,τ

,...

)

, where

•

Label

(

)

-level labeling

. The expert provides

a good next subgoal at each state of a given

-level

trajectory

= (

,...

)

, yielding a la-

beled data set

{

(

)

(

)

,...

}

•

Label

(

;

)

-level labeling

. The expert pro-

vides a good next primitive action towards a given

subgoal

at each state of a given

-level trajectory

= (

,...

)

, yielding a labeled data set

{

(

)

(

)

,...

}

•

Inspect

(

;

)

-level inspection

. Instead of

annotating every state of a trajectory with a good ac-

tion, the expert only verifies whether a subgoal

was

accomplished, returning either

Pass

Fail

•

Label

FULL

(

FULL

)

full labeling

. The expert labels

the agent’s full trajectory

FULL

= (

,...

)

from start to finish, ignoring hierarchical structure,

yielding a labeled data set

{

(

)

(

)

,...

}

•

Inspect

FULL

(

FULL

)

full inspection

The expert

verifies whether the agent’s overall goal was accom-

plished, returning either

Pass

Fail

When the agent learns not only the subpolicies

, but also

termination functions

, then

Label

also returns good

termination values

∈ {

True

False

}

for each state of

= (

...

)

, yielding a data set

{

(

,ω

)

,...

}

Although

HierDemo

and

Label

can be both generated

by the expert’s hierarchical policy

(

{

}

)

, they differ

in the mode of expert interaction.

HierDemo

returns a

hierarchical trajectory

executed by the expert

, as required

for passive IL, and enables a hierarchical version of be-

havioral cloning (Abbeel & Ng, 2004; Syed & Schapire,

2008).

Label

operations provide labels

with respect to

the learning agent’s trajectories

, as required for interactive

IL.

Label

FULL

is the standard query used in prior work on

learning flat policies (Daum

e et al., 2009; Ross et al., 2011),

and

Label

and

Label

are its hierarchical extensions.

Inspect

operations are newly introduced in this paper,

and form a cornerstone of our interactive hierarchical guid-

ance protocol that enables substantial savings in label effi-

ciency. They can be viewed as “lazy” versions of the cor-

responding

Label

operations, requiring less effort. Our

underlying assumption is that if the given hierarchical tra-

jectory

{

(

,τ

)

}

agrees with the expert on

level, i.e.,

(

)

, and

-level trajectories pass the

Algorithm 1

Hierarchical Behavioral Cloning (

h-BC

)

1: Initialize data buffers

←∅

and

←∅

∈G

for

= 1

,...,T

Get a new environment instance with start state

←

HierDemo

(

)

for all

(

,τ

)

∈

Append

←D

∪

Append

←D

∪{

(

)

}

8: Train subpolicies

←

Train

(

)

for all

9: Train meta-controller

←

Train

(

μ,

)

inspection, i.e.,

Inspect

(

;

) =

Pass

, then the re-

sulting full trajectory must also pass the full inspection,

Inspect

FULL

(

FULL

) =

Pass

. This means that a hierarchi-

cal policy need not always agree with the expert’s execution

level to succeed in the overall task.

Besides algorithmic reasons, the motivation for separating

the types of feedback is that different expert queries will

typically require different amount of effort, which we refer

to as

cost

. We assume the costs of the

Label

operations

are

and

FULL

, the costs of each

Inspect

op-

eration are

and

FULL

. In many settings,

-level in-

spection will require significantly less effort than

-level

labeling, i.e.,

. For instance, identifying if a

robot has successfully navigated to the elevator is presum-

ably much easier than labeling an entire path to the elevator.

One reasonable cost model, natural for the environments in

our experiments, is to assume that

Inspect

operations

take time

(1)

and work by checking the final state of the

trajectory, whereas

Label

operations take time propor-

tional to the trajectory length, which is

(

)

(

)

and

(

)

for our three

Label

operations.

4. Hierarchically Guided Imitation Learning

Hierarchical guidance

is an algorithmic design principle in

which the feedback from high-level expert guides the low-

level learner in two different ways: (i) the high-level expert

ensures that low-level expert is only queried when neces-

sary (when the subtasks have not been mastered yet), and

(ii) low-level learning is limited to the relevant parts of the

state space. We instantiate this framework first within pas-

sive learning from demonstrations, obtaining

hierarchical

behavioral cloning

(Algorithm 1), and then within inter-

active imitation learning, obtaining

hierarchically guided

DAgger

(Algorithm 2), our best-performing algorithm.

4.1. Hierarchical Behavioral Cloning (h-BC)

We consider a natural extension of behavioral cloning to

the hierarchical setting (Algorithm 1). The expert pro-

vides a set of hierarchical demonstrations

, each con-

sisting of

-level trajectories

{

(

)

}

as well

as a

-level trajectory

{

(

)

}

. We then run

Hierarchical Imitation and Reinforcement Learning

Algorithm 2

Hierarchically Guided DAgger (

hg-DAgger

)

1: Initialize data buffers

←∅

and

←∅

∈G

2: Run Hierarchical Behavioral Cloning (Algorithm 1)

up to

warm-start

for

warm-start

+ 1

,...,T

Get a new environment instance with start state

Initialize

←∅

repeat

←

(

)

Execute

, obtain

-level trajectory

Append

(

s,g,τ

)

10:

←

the last state in

11:

until

end of episode

12:

Extract

FULL

and

from

13:

Inspect

FULL

(

FULL

) =

Fail

then

14:

←

Label

(

)

15:

Process

(

,τ

)

∈

in sequence as long as

agrees with the expert’s choice

16:

Inspect

(

;

) =

Fail

then

17:

Append

←D

∪

Label

(

;

)

18:

break

19:

Append

←D

∪ D

20:

Update subpolicies

←

Train

(

)

for all

21:

Update meta-controller

←

Train

(

μ,

)

Train

(lines 8–9) to find the subpolicies

that best pre-

dict

from

, and meta-controller

that best predicts

from

, respectively.

Train

can generally be any su-

pervised learning subroutine, such as stochastic optimiza-

tion for neural networks or some batch training procedure.

When termination functions

need to be learned as part of

the hierarchical policy, the labels

will be provided by the

expert as part of

{

(

,ω

)

}

In this setting, hier-

archical guidance is automatic, because subpolicy demon-

strations only occur in relevant parts of the state space.

4.2. Hierarchically Guided DAgger (hg-DAgger)

Passive IL, e.g., behavioral cloning, suffers from the distri-

bution mismatch between the learning and execution distri-

butions. This mismatch is addressed by interactive IL algo-

rithms, such as SEARN (Daum

e et al., 2009) and DAgger

(Ross et al., 2011), where the expert provides correct ac-

tions along the learner’s trajectories through the operation

Label

FULL

. A na

ıve hierarchical implementation would

provide correct labels along the entire hierarchical trajec-

tory via

Label

and

Label

. We next show how to use

hierarchical guidance to decrease

-level expert costs.

We leverage two

-level query types:

Inspect

and

Label

. We use

Inspect

to verify whether the sub-

tasks are successfully completed and

Label

to check

whether we are staying in the relevant part of the state

space. The details are presented in Algorithm 2, which uses

In our hierarchical imitation learning experiments, the termi-

nation functions are all learned. Formally, the termination signal

, can be viewed as part of an augmented action at

level.

DAgger as the learner on both levels, but the scheme can be

adapted to other interactive imitation learners.

In each episode, the learner executes the hierarchical pol-

icy, including choosing a subgoal (line 7), executing the

-level trajectories, i.e., rolling out the subpolicy

for

the chosen subgoal, and terminating the execution accord-

ing to

(line 8). Expert only provides feedback when

the agent fails to execute the entire task, as verified by

Inspect

FULL

(line 13). When

Inspect

FULL

fails, the ex-

pert first labels the correct subgoals via

Label

(line 14),

and only performs

-level labeling as long as the learner’s

meta-controller chooses the correct subgoal

(line 15),

but its subpolicy fails (i.e., when

Inspect

on line 16

fails). Since all the preceding subgoals were chosen and

executed correctly, and the current subgoal is also correct,

-level learning is in the “relevant” part of the state space.

However, since the subpolicy execution failed, its learning

has not been mastered yet. We next analyze the savings in

expert cost that result from hierarchical guidance.

Theoretical Analysis.

We analyze the cost of hg-DAgger

in comparison with flat DAgger under somewhat stylized

assumptions. We assume that the learner aims to learn the

meta-controller

from some policy class

, and subpoli-

cies

from some class

. The classes

and

are

finite (but possibly exponentially large) and the task is real-

izable, i.e., the expert’s policies can be found in the corre-

sponding classes:

∈M

, and

∈

∈G

. This al-

lows us to use the

halving algorithm

(Shalev-Shwartz et al.,

2012) as the online learner on both levels. (The implemen-

tation of our algorithm does not require these assumptions.)

The halving algorithm maintains a version space over poli-

cies, acts by a majority decision, and when it makes a mis-

take, it removes all the erring policies from the version

space. In the hierarchical setting, it therefore makes at most

log

|M|

mistakes on the

level, and at most

log

mis-

takes when learning each

. The mistake bounds can be

further used to upper bound the total expert cost in both

hg-DAgger and flat DAgger. To enable an apples-to-apples

comparison, we assume that the flat DAgger learns over the

policy class

FULL

{

(

μ,

{

}

∈G

) :

∈M

,π

∈

}

but is otherwise oblivious to the hierarchical task structure.

The bounds depend on the cost of performing different

types of operations, as defined at the end of Section 3. We

consider a modified version of flat DAgger that first calls

Inspect

FULL

, and only requests labels (

Label

FULL

) if the

inspection fails. The proofs are deferred to Appendix A.

Theorem 1.

Given finite classes

and

and realiz-

able expert policies, the total cost incurred by the expert in

hg-DAgger by round

is bounded by

FULL

(

log

|M|

opt

log

)

(

)

(

opt

log

)

(1)

Hierarchical Imitation and Reinforcement Learning

where

opt

⊆ G

is the set of the subgoals actually used by

the expert,

opt

(

)

Theorem 2.

Given the full policy class

FULL

{

(

μ,

{

}

∈G

) :

∈M

,π

∈

}

and a realizable ex-

pert policy, the total cost incurred by the expert in flat DAg-

ger by round

is bounded by

FULL

(

log

|M|

|G|

log

)

FULL

(2)

Both bounds have the same leading term,

FULL

, the cost

of full inspection, which is incurred every round and can

be viewed as the “cost of monitoring.” In contrast, the re-

maining terms can be viewed as the “cost of learning” in the

two settings, and include terms coming from their respec-

tive mistake bounds. The ratio of the cost of hierarchically

guided learning to the flat learning is then bounded as

Eq. (1)

−

FULL

Eq. (2)

−

FULL

≤

FULL

(3)

where we applied the upper bound

opt

| ≤ |G|

. The sav-

ings thanks to hierarchical guidance depend on the specific

costs. Typically, we expect the inspection costs to be

(1)

if it suffices to check the final state, whereas labeling costs

scale linearly with the length of the trajectory. The cost ra-

tio is then

∝

. Thus, we realize most significant

savings if the horizons on each individual level are sub-

stantially shorter than the overall horizon. In particular, if

√

FULL

, the hierarchically guided approach

reduces the overall labeling cost by a factor of

√

FULL

More generally, whenever

FULL

is large, we reduce the

costs of learning be at least a constant factor—a significant

gain if this is a saving in the effort of a domain expert.

5. Hierarchically Guided IL / RL

Hierarchical guidance also applies in the hybrid setting

with interactive IL on the

level and RL on the

level.

The

-level expert provides the hierarchical decomposi-

tion, including the pseudo-reward function for each sub-

goal,

and is also able to pick a correct subgoal at each

step. Similar to hg-DAgger, the labels from

-level expert

are used not only to train the meta-controller

, but also to

limit the

-level learning to the relevant part of the state

space. In Algorithm 3 we provide the details, with DAgger

level and

-learning on

level. The scheme can be

adapted to other interactive IL and RL algorithms.

The learning agent proceeds by

rolling in

with its meta-

controller (line 7). For each selected subgoal

, the sub-

policy

selects and executes primitive actions via the

This is consistent with many hierarchical RL approaches, in-

cluding options (Sutton et al., 1999), MAXQ (Dietterich, 2000),

UVFA (Schaul et al., 2015a) and h-DQN (Kulkarni et al., 2016).

Algorithm 3

Hierarchically Guided DAgger /

-learning

(

hg-DAgger/Q

)

input

Function

pseudo

(

;

)

providing the pseudo-reward

input

Predicate

terminal

(

;

)

indicating the termination of

input

Annealed exploration probabilities

∈G

1: Initialize data buffers

←∅

and

←∅

∈G

2: Initialize subgoal

-functions

∈G

for

= 1

,...,T

Get a new environment instance with start state

Initialize

←∅

repeat

←

s, g

←

(

)

and initialize

←∅

repeat

←

-greedy

(

)

10:

Execute

next state

←

pseudo

( ̃

;

)

11:

Update

: a (stochastic) gradient descent step

on a minibatch from

12:

Append

(

s,a,

)

and update

←

13:

until

terminal

(

;

)

14:

Append

(

,g,τ

)

15:

until

end of episode

16:

Extract

FULL

and

from

17:

Inspect

FULL

(

FULL

) =

Fail

then

18:

←

Label

(

)

19:

Process

(

,τ

)

∈

in sequence as long as

agrees with the expert’s choice

20:

Append

←D

∪

Append

←D

∪ D

21:

else

22:

Append

←D

∪

for all

(

,τ

)

∈

23:

Update meta-controller

←

Train

(

μ,

)

-greedy rule (lines 9–10), until some termination condi-

tion is met. The agent receives some pseudo-reward, also

known as intrinsic reward (Kulkarni et al., 2016) (line 10).

Upon termination of the subgoal, agent’s meta-controller

chooses another subgoal and the process continues until

the end of the episode, where the involvement of the expert

begins. As in hg-DAgger, the expert inspects the overall

execution of the learner (line 17), and if it is not successful,

the expert provides

-level labels, which are accumulated

for training the meta-controller.

Hierarchical guidance impacts how the

-level learners

accumulate experience. As long as the meta-controller’s

subgoal

agrees with the expert’s, the agent’s experience

of executing subgoal

is added to the experience replay

buffer

. If the meta-controller selects a “bad” subgoal,

the accumulation of experience in the current episode is

terminated. This ensures that experience buffers contain

only the data from the relevant part of the state space.

Algorithm 3 assumes access to a real-valued function

pseudo

(

;

)

, providing the pseudo-reward in state

when executing

, and a predicate

terminal

(

;

)

, indi-

cating the termination (not necessarily successful) of sub-

goal

. This setup is similar to prior work on hierar-

chical RL (Kulkarni et al., 2016).

One natural defini-

Hierarchical Imitation and Reinforcement Learning

tion of pseudo-rewards, based on an additional predicate

success

(

;

)

indicating a successful completion of sub-

goal

, is as follows:











success

(

;

)

−

success

(

;

)

and

terminal

(

;

)

−

otherwise,

where

κ >

is a small penalty to encourage short trajec-

tories. The predicates

success

and

terminal

are pro-

vided by an expert or learnt from supervised or reinforce-

ment feedback. In our experiments, we explicitly provide

these predicates to both hg-DAgger/Q as well as the hierar-

chical RL, giving them advantage over hg-DAgger, which

needs to learn when to terminate subpolicies.

6. Experiments

We evaluate the performance of our algorithms on two sep-

arate domains: (i) a simple but challenging maze naviga-

tion domain and (ii) the Atari game Montezuma’s Revenge.

6.1. Maze Navigation Domain

Task Overview.

Figure 1 (left) displays a snapshot of the

maze navigation domain. In each episode, the agent en-

counters a new instance of the maze from a large collec-

tion of different layouts. Each maze consists of 16 rooms

arranged in a 4-by-4 grid, but the openings between the

rooms vary from instance to instance as does the initial po-

sition of the agent and the target. The agent (white dot)

needs to navigate from one corner of the maze to the tar-

get marked in yellow. Red cells are obstacles (lava), which

the agent needs to avoid for survival. The contextual in-

formation the agent receives is the pixel representation of

a bird’s-eye view of the environment, including the partial

trail (marked in green) indicating the visited locations.

Due to a large number of random environment instances,

this domain is not solvable with tabular algorithms. Note

that rooms are not always connected, and the locations of

the hallways are not always in the middle of the wall. Prim-

itive actions include going one step

down

left

right

In addition, each instance of the environment is designed

to ensure that there is a path from initial location to target,

and the shortest path takes at least 45 steps (

FULL

= 100

The agent is penalized with reward

−

if it runs into lava,

which also terminates the episode. The agent only receives

positive reward upon stepping on the yellow block.

A hierarchical decomposition of the environment corre-

sponds to four possible subgoals of going to the room im-

mediately to the

north

south

west

east

, and the fifth pos-

sible subgoal

target

(valid only in the room con-

taining the target). In this setup,

≈

5 steps, and

≈

10–12 steps. The episode terminates after 100 prim-

itive steps if the agent is unsuccessful. The subpolicies

and meta-controller use similar neural network architec-

tures and only differ in the number of action outputs. (De-

tails of network architecture are provided in Appendix B.)

Hierarchically Guided IL.

We first compare our hierar-

chical IL algorithms with their flat versions. The algorithm

performance is measured by success rate, defined as the

average rate of successful task completion over the previ-

ous 100 test episodes, on random environment instances

not used for training. The cost of each

Label

operation

equals the length of the labeled trajectory, and the cost of

each

Inspect

operation equals 1.

Both h-BC and hg-DAgger outperform flat imitation learn-

ers (Figure 2, left). hg-DAgger, in particular, achieves

consistently the highest success rate, approaching 100%

in fewer than 1000 episodes. Figure 2 (left) displays the

median as well as the range from minimum to maximum

success rate over 5 random executions of the algorithms.

Expert cost varies significantly between the two hierarchi-

cal algorithms. Figure 2 (middle) displays the same suc-

cess rate, but as a function of the expert cost. hg-DAgger

achieves significant savings in expert cost compared to

other imitation learning algorithms thanks to a more effi-

cient use of the

-level expert through hierarchical guid-

ance. Figure 1 (middle) shows that hg-DAgger requires

most of its

-level labels early in the training and requests

primarily

-level labels after the subgoals have been mas-

tered. As a result, hg-DAgger requires only a fraction of

-level labels compared to flat DAgger (Figure 2, right).

Hierarchically Guided IL / RL.

We evaluate hg-

DAgger/Q with deep double

-learning (DDQN, Van Has-

selt et al., 2016) and prioritized experience replay (Schaul

et al., 2015b) as the underlying RL procedure.

Each

subpolicy learner receives a pseudo-reward of 1 for each

successful execution, corresponding to stepping through

the correct door (e.g., door to the north if the subgoal

north

) and negative reward for stepping into lava or

through other doors.

Figure 1 (right) shows the learning progression of hg-

DAgger/Q, implying two main observations. First, the

number of

-level labels rapidly increases initially and

then flattens out after the learner becomes more success-

ful, thanks to the availability of

Inspect

FULL

operation.

As the hybrid algorithm makes progress and the learning

agent passes the

Inspect

FULL

operation increasingly of-

ten, the algorithm starts saving significantly on expert feed-

back. Second, the number of

-level labels is higher than

for both hg-DAgger and h-BC.

Inspect

FULL

returns

Fail

often, especially during the early parts of training. This is

primarily due to the slower learning speed of

-learning

at the

level, requiring more expert feedback at the