of 10
OCEAN: Online Task Inference for Compositional Tasks
with Context Adaptation
Hongyu Ren
Stanford University
Yuke Zhu
The University of Texas at Austin, Nvidia
Jure Leskovec
Stanford University
Anima Anandkumar
California Institute of Technology, Nvidia
Animesh Garg
University of Toronto, Vector Institute, Nvidia
Abstract
Real-world tasks often exhibit a compositional
structure that contains a sequence of simpler
sub-tasks. For instance, opening a door re-
quires reaching, grasping, rotating, and pulling
the door knob. Such compositional tasks re-
quire an agent to reason about the sub-task
at
hand
while orchestrating
global
behavior ac-
cordingly. This can be cast as an
online task
inference
problem, where the current task iden-
tity, represented by a context variable, is esti-
mated from the agent’s past experiences with
probabilistic inference. Previous approaches
have employed simple latent distributions, e.g.,
Gaussian, to model a single context for the en-
tire task. However, this formulation lacks the
expressiveness to capture the composition and
transition of the sub-tasks. We propose a vari-
ational inference framework O
CEAN
to per-
form online task inference for compositional
tasks. O
CEAN
models global and local con-
text variables in a joint latent space, where the
global variables represent a mixture of sub-
tasks required for the task, while the local vari-
ables capture the transitions between the sub-
tasks. Our framework supports flexible latent
distributions based on prior knowledge of the
task structure and can be trained in an unsu-
pervised manner. Experimental results show
that O
CEAN
provides more effective task in-
ference with sequential context adaptation and
thus leads to a performance boost on complex,
multi-stage tasks.
1 INTRODUCTION
Meta-reinforcement learning (meta-RL) algorithms aim
to train a versatile agent that quickly adapts to unseen
Proceedings of the 36
th
Conference on Uncertainty in Artificial
Intelligence (UAI)
, PMLR volume 124, 2020.
new tasks after trained on a domain of related tasks
[1, 2, 3, 4, 5]. Meta-RL has shown promise in simple
domains, where it requires much less data than training
an RL agent from scratch for each task of interest. A key
assumption that makes meta-RL desirable is that there
exists rich latent structure and relationships among the
tasks, implying that solving one task may reuse skills
from others. For instance, the agent learning to walk
backward may benefit from mastering how to run for-
ward. However, leveraging this structure in task in-
ference and adaptation [6] makes meta-RL much more
challenging than RL, albeit offers an opportunity for im-
provement in sample efficiency of multi-task learning.
Recent context-based meta-RL models [5, 7] capture the
current task with additional latent task variables. Specif-
ically, the agent first stochastically explores the environ-
ment and keeps record of (potentially unordered) transi-
tion tuples, which are referred to as
contexts
. A trained
context encoder uses the contexts to infer the latent task
variables, which serve as the agent’s belief over the cur-
rent task. Given the latent task variable, the agent makes
sequential decisions for the current task. These context-
based meta-RL models provide an elegant framework
that disentangles probabilistic task inference from de-
cision making, and can be integrated with multiple off-
policy learning algorithms [8, 9, 10], demonstrating im-
provement in both sample efficiency and asymptotic per-
formance.
However, prior art fails in case of complex task struc-
tures in most real-world scenarios. Often real tasks re-
quire finishing a sequence of sub-tasks and the current
sub-task may only be revealed to the agent after the pre-
vious sub-task is finished. For example, door opening
requires one to reach, grasp, rotate and pull or push
the door knob. Not only do these sub-tasks need to be
carried out in an appropriate order, but the each subse-
quent sub-task is also conditioned on the success pre-
vious one. This compositional structure is of important
value and needs to be accounted for during task infer-
Contexts
푳풐풄풂풍
+
Contexts
+
푳풐풄풂풍
+
+
Contexts
+
푳풐풄풂풍
+
....
....
Previous Contexts
푮풍풐풃풂풍
Activated Goal
Agent
Figure 1: Our framework O
CEAN
performs task inference based on the contexts. O
CEAN
consists of two modules:
global context encoder
q
G
φ
and local context encoder
q
L
φ
.
q
G
φ
leverages contexts from previous episodes to capture the
global structure of the task, for example, goal locations, rewards.
q
L
φ
(with a recurrent architecture) uses the contexts
from previous steps within an episode to do online sub-task inference, it aims to capture the current sub-task, in this
case, currently which goal has been activated. The agent takes actions
a
t
based on the current state
s
t
, the current
local latent context variable
z
Local
t
and the global context variable
z
Global
.
ence. However, current context-based meta-RL mod-
els do not leverage this sophisticated multi-stage struc-
ture and the dependencies between tasks. The reason
is modelling the whole episode of a task with a fixed
isotropic Gaussian context variables. As shown in Fig.
1, one task may consist of multiple stages where each
stage may have a different goal, a single static latent
context variable fails to model the sequential structure.
Although posterior estimation may offer slight improve-
ment through recurrence as more contexts accumulate,
the following issues remain: (1) latent context variables
that are held constant across an episode are not suitable
to model the sequence of sub-tasks executed in an order;
(2) isotropic Gaussian random variables are not flexible
enough to model mixtures of tasks.
This paper proposes O
CEAN
, an Online ContExt Adapta-
tioN framework that addresses the aforementioned issues
1
. Our framework is composed of two parts that account
for the local sub-task update and the global sub-task mix-
ture respectively. We introduce a local context encoder
with a recurrent architecture, which performs sub-task
inference on-the-fly based on the history contexts within
an episode as shown in top of Fig. 1. Vitally, the adap-
1
The implementation can be found in
https://
github.com/pairlab/ocean
.
tive local context variables instruct the agent to make
smooth transitions from one sub-task to another within
a single episode. In order to model different sub-task
structure, we also introduce a global context encoder fa-
cilitates a flexible latent space with rich prior distribu-
tions beyond Gaussian, categorical, Dirichlet and logis-
tic normal distributions based on the domain knowledge
of the task structure as shown in bottom of Fig. 1. With a
joint latent space comprised of the local and global con-
text variables, O
CEAN
gives rise to
a principled task in-
ference framework for context-based meta-reinforcement
learning
that is able to model complex, real-world tasks.
We observe that previous context-based meta-RL mod-
els [5, 7] can be viewed as special cases of our general
framework where global latent space is Gaussian and the
local context variables are frozen.
We integrate O
CEAN
with soft actor-critic [8], an off-
policy RL algorithm for high sample efficiency. The con-
text encoders and the RL agent can be trained jointly to
optimize for maximizing expected return across all the
tasks in meta-training. Note that our framework does
not require any labels or supervised signals of either sub-
task mixture or sub-task transition when training both the
global and local encoder. In meta-test phase, we sam-
ple the global context variables for a given new task,
and stochastically explore the task with both the global
context variables and the local context variables, which
O
CEAN
estimate and update in an online fashion. Task
inference becomes more and more accurate with more
exploration, thus adapting to the new task in meta-test.
We show through a 2D locomotion experiment that de-
signing an appropriate global latent space matters for
complex task structures. We further demonstrate that on-
line context adaptation can greatly improve the perfor-
mance in several continuous control tasks with sequen-
tial sub-task structure.
2 PRELIMINARIES
2.1 META-REINFORCEMENT LEARNING
We are interested in meta-reinforcement learning (meta-
RL), where we are given a distribution of tasks
p
(
T
)
.
Each sample from
p
(
T
)
is a Markov Decision Process
(MDP)
〈S
,
A
,P,r,ρ
0
, representing the state space,
action space, transition probability distribution, reward
function, the distribution of initial state and discount fac-
tor respectively. We assume that all tasks from
p
(
T
)
share the same known state and action space, but may
differ in transition probability, reward function and initial
state distribution, which are unknown but we can sample
from them. We refer to the contexts
C
of a given task
T
as a collection of transition tuples sampled from
T
. Each
transition tuple
(
s
i
,
a
i
,r
i
,
s
i
)
consists of the state
s
, ac-
tion
a
, reward
r
, next state
s
at a certain timestep. Meta-
RL models aim to train a flexible agent
π
θ
(
a
|
s
)
(param-
eterized by
θ
) on a given set of training tasks sampled
from
p
(
T
)
and the goal is that the agent can be adapted
quickly to a given set of unseen test tasks sampled from
the same distribution.
2.2 CONTEXT-BASED TASK INFERENCE
In order to make fast adaptation towards a given task,
the key is to perform accurate task inference, which can
be explicitly captured by latent task variables
p
(
z
|T
)
.
The RL agent
π
θ
(
a
|
s
,
z
)
then takes actions based on the
current observation (or state)
s
and the latent task vari-
ables
z
p
(
z
|T
)
in order to make decisions in task
T
.
The latent task variables capture the uncertainty over the
task and are crucial to achieve quick adaptation and high
performance in meta-RL. However, the true posterior is
unknown since we have no knowledge of the transition
probability, reward function and initial state distribution
of the task
T
. Context-based meta-RL models approxi-
mate
p
(
z
|T
)
by first collecting a set of contexts
C
from
the task
T
and calculating the posterior of latent
con-
text
variable
p
(
z
|
C
)
, where essentially the contexts can
be seen as the representative samples from the task
T
.
Although
p
(
z
|
C
)
is still intractable, we can train an ad-
ditional context encoder
q
φ
(
z
|
C
)
parameterized by
φ
to
estimate the true posterior based on amortized variational
inference. The corresponding evidence lower bound can
be derived as
E
T
[
E
z
q
φ
[
R
(
T
,
z
)
βD
KL
(
q
φ
(
z
|
C
)
||
p
(
z
))]]
,
(1)
where
p
(
z
)
is the prior distribution and captures our prior
knowledge of the task distribution,
R
(
T
,
z
)
is imple-
mented to recover the state-action value functions [5] or
the transition functions [7],
β
is a trade-off hyperparam-
eter that regularizes the capacity. One benefit of context-
based meta-RL models is that it disentangles task in-
ference from decision making and hence large amount
of off-policy data can be used for policy update, which
greatly improves the sample efficiency [5].
2.3 SOFT ACTOR-CRITIC
In this paper, we use the soft actor-critic algorithm (SAC)
[8, 9] to perform policy update in order to achieve bet-
ter sample efficiency. Based on the maximum entropy
RL framework [11], SAC is an off-policy actor-critic
algorithm that aims to maximize both the expected re-
ward and casual entropy, explicitly regularizing the pol-
icy. Specifically, SAC aims to optimize the following
objectives for the agent
π
θ
, the Q-function approximator
Q
θ
and the value function approximator
V
θ
using sam-
ples stored in a replay buffer
B
:
L
critic
=
E
(
s
,
a
,
s
,r
)
∼B
[
Q
θ
(
s
,
a
)
(
r
+
V
θ
(
s
))]
2
,
(2)
L
actor
=
E
s
∼B
[
D
KL
(
π
θ
(
·|
s
)
exp
(
Q
θ
(
s
,
·
))
Z
θ
(
s
)
)]
.
(3)
3
O
CEAN
: TASK INFERENCE WITH
LATENT CONTEXT VARIABLES
Latent context variables are crucial in context-based
meta RL models since they represent the belief over the
current tasks. Thus, these context variables should be
carefully tailored during task inference to take into ac-
count the complex compositional structure of real-world
tasks in order to achieve the most accurate context vari-
ables. In Sec. 3.1, we introduce
local
latent context vari-
ables and perform online context adaptation and sub-task
inference based on the contexts at past steps within an
episode. We implement a local context encoder with a
recurrent neural network for sequential probabilistic in-
ference. In Sec. 3.2, we use
global
context variables to
capture the global information of a task and further pro-
pose O
CEAN
, an Online ContExt AdaptatioN framework
for meta-RL with joint latent space that consists of global
and local context variables. In Sec. 3.3, we demonstrate
that our framework allows flexible latent distributions to
suit different prior knowledge of the task/sub-task struc-
ture. Finally, in Sec. 3.4, we integrate our task inference
framework with SAC for efficient policy update, and per-
form end-to-end training for the global and local context
encoders as well as the agent.
3.1 ONLINE CONTEXT ADAPTATION
Since real-world tasks often require an agent to finish a
sequence of sub-tasks, a single static latent context vari-
able is not sufficient to inform the agent of the transitions
in the sequence. To capture the ongoing sub-task and
the transitions between them, we use local latent con-
text variables
z
Local
that are estimated online throughout
the episode. To perform online probabilistic inference,
we train an additional local context encoder
q
L
φ
parame-
terized by
π
that takes as input the contexts at previous
steps and updates the posterior of the local context vari-
able for future steps. Since the local latent contexts at
different steps are not independent from each other, we
design a recurrent architecture for
q
L
φ
. To better model
the variability of dependencies between the local context
variables across different timesteps, we adopt variational
recurrent neural network [12], where the hidden state is
conditioned on stochastic samples from the previous pos-
terior. Specifically, the local context encoder
q
L
φ
consists
of three modules:
q
enc
φ
,
q
tran
φ
and
q
prior
φ
, which represent
the inference function, the transition function, the condi-
tional prior respectively. Given the context
c
t
at timestep
t
, the latent context variable
z
Local
t
+1
at timestep
t
+ 1
is
sampled from the posterior calculated as follows:
z
Local
t
+1
q
enc
φ
(
z
|
c
t
,
h
t
)
,
(4)
where
h
t
denotes the hidden state at timestep
t
. The
hidden state is updated according to the following recur-
rence:
h
t
=
q
tran
φ
(
c
t
1
,
z
Local
t
,
h
t
1
)
.
(5)
Then the agent takes actions by sampling from
π
θ
(
a
|
s
t
,
z
Local
t
)
, which is conditioned on the observa-
tion as well as the updated local context variable at
timestep
t
. Note that we use zero-valued vectors as ini-
tialization for
h
0
, and we can directly sample
z
Local
0
from a predefined uninformative prior, such as isotropic
Gaussian or uniform distribution.
Since we aim to capture the dependency between the cur-
rent and past timesteps, the corresponding prior distribu-
tion of
z
Local
t
is conditioned on the previous hidden state
p
(
z
Local
t
) =
q
prior
φ
(
h
t
1
)
rather than a given fixed unin-
formative prior, which neglects the temporal structure of
posterior at different steps.
Finally the loss of the local context encoder
q
L
φ
is defined
by replacing the KL loss term in Eq. 1 as follows:
D
Local
KL
=
t
D
KL
(
q
enc
φ
(
z
|
c
t
,
h
t
)
||
q
prior
φ
(
h
t
))
,
(6)
which takes the sum of the KL loss at all timesteps. and is
optimized in meta-training. Note that O
CEAN
does not
assume prior knowledge of either the labels of the spe-
cific transition steps between each sub-task or the labels
of the sub-tasks. Using the variational inference frame-
work, our model is able to discover sub-task structure in
an unsupervised manner. At meta-test time, we fix the lo-
cal context encoder
q
L
φ
and the agent first samples from
the prior distribution and then can execute the policy, col-
lect the context and infer the posterior of the local context
variable in a recurrent manner. Given a previously un-
seen test task, the agent explores at the first few steps and
then with more steps collected, quickly converges to the
optimal policy as the tasks inference becomes more and
more accurate, thus efficiently adapted to the test task at
hand.
Computation Efficiency.
Note that the local context en-
coder requires stochastic sampling at each step, which is
a sequential process. Given a limited computation bud-
get, it can be costly. Although real-world tasks often con-
tain sequence of sub-tasks, we do not need to update the
posterior at each step. We assume that the sub-task in
timestep
t
is very similar to the sub-task in timestep
t
+1
,
so one local context variable that represents the current
sub-task at timestep
t
is also likely to be accurate enough
for decision-making at timestep
t
+ 1
. To reduce the
computation overhead, one strategy is to perform poste-
rior estimation on a subset of the timesteps
{
tr,
2
tr,...
}
with certain temporal resolution
tr
. Specifically for Eq.
4,
z
Local
tr
is sampled from
q
enc
φ
(
z
|
c
0:
tr
1
,
h
0
)
, where
c
0:
tr
1
is the concatenated contexts from timestep 0 to
tr
1
. Then the cost of each posterior sampling can be
amortized over
tr
steps of decision making.
3.2 JOINT LATENT SPACE FOR TASK
INFERENCE
As introduced in Sec. 3.1, given a task, the agent first
samples from uninformative prior and explores the envi-
ronment at the beginning of the episode in order to infer
the local sub-task with
q
L
φ
, however, these exploration
steps may be sub-optimal since the agent has no knowl-
edge of the task and explores with context variables
drawn from uninformative prior. One limitation of the lo-
cal context variables is that they fail to make use of past
contexts from different trajectories. In this section, we
Algorithm 1: Meta-Training in OCEAN
Input:
T
1
...T
: Training tasks sampled from
p
(
T
)
.
Initialize:
B
i
: Replay buffers for each task;
θ
π
Q
V
: parameters in O
CEAN
;
α
1
2
3
: learning rate;
K
: number of trajectories collected in each iteration;
B
: number of trajectories sampled in each iteration.
01.
While
not done
do
02.
For
i
= 1
,...,T
do
03.
TrajCollect(
T
i
,
B
i
,
K
)
04.
Contexts
C
∼S
c
(
B
i
)
and trajectories
B
∼B
i
05.
z
Global
q
G
φ
(
z
|
C
)
06.
Initialize
z
Local
=
{}
07.
L
i
KL
=
βD
KL
(
q
(
z
|
C
)
||
p
(
z
))
08.
For
b
= 1
,...,B
do
09.
For
t
= 1
,
2
,...
do
10.
z
Local
t
q
L
φ
(
z
|
c
t
1
,
h
t
1
)
and add to
z
Local
11.
Update
c
t
= (
s
t
,
a
t
,
s
t
,r
t
)
12.
L
i
KL
+ =
βD
KL
(
q
L
φ
(
z
|
c
t
1
,
h
t
1
)
||
p
(
z
Local
t
))
13.
L
i
actor
=
L
actor
(
B
,
z
Global
,
z
Local
)
14.
L
i
critic
=
L
critic
(
B
,
z
Global
,
z
Local
)
15.
φ
φ
α
1
φ
i
(
L
i
critic
+
L
i
KL
)
16.
θ
π
θ
π
α
2
θ
π
i
L
i
actor
17.
θ
Q
θ
Q
α
3
θ
Q
i
L
i
critic
18.
θ
V
θ
V
α
3
θ
V
i
L
i
critic
introduce global context variables which leverages the
contexts from past trajectories to give the agent a global
overview of the task. If we have access to a pool of past
contexts
C
of size
n
collected from the same task (but
not necessarily from the same episode), we leverage this
task information to infer our belief over the global task.
Following P
EARL
[5], we introduce a global context en-
coder
q
G
φ
(
z
|
c
i
)
that infers the posterior based on each
single past context
c
i
C
. The posterior of the global
context is calculated as a function of each independent
posterior
q
G
φ
(
z
|
C
) =
f
(
q
G
φ
(
z
|
c
1
)
,...,q
G
φ
(
z
|
c
n
))
(de-
tailed in Sec. 3.3). With larger set of the past contexts
C
, we can achieve more accurate global task estimation.
Combined with the local context variables, we introduce
O
CEAN
with a joint latent space for online task inference
in meta-RL. The joint latent space consists of local and
global context variables where the local context variables
reason about the sub-task and are updated online, while
the global context variables captures the big picture of
the task. The global and local context encoder can be
trained jointly with the agent with detailed description in
Sec. 3.4. For simplicity, we assume the local and global
context variables are independent, thus the KL term in
the objective in Eq. 1 can be decomposed into sum of
two separate terms:
D
KL
(
q
G
φ
(
z
|
C
)
||
p
(
z
)) +
D
Local
KL
,
(7)
where
p
(
z
)
is the prior distribution for global context
variables and
D
Local
KL
is defined in Eq. 6.
Algorithm 2: TrajCollect
Input:
T
: Task;
B
: Replay Buffer;
K
: Number of trajectories.
Initialize:
C
: Context set.
01.
For
k
= 1
,...,K
do
02.
z
Global
q
G
φ
(
z
|
C
)
03.
For
t
= 1
,
2
,...
do
04.
z
Local
t
q
L
φ
(
z
|
c
t
1
,
h
t
1
)
05.
Roll out policy
π
θ
(
a
|
s
,
z
Global
,
z
Local
t
)
06.
Update
c
t
= (
s
t
,
a
t
,
s
t
,r
t
)
07.
Accumulate context
C
=
C
∪{
c
t
}
08. Add
C
to
B
3.3 FLEXIBLE PARAMETERIZATION OF THE
LATENT SPACE
A suitable latent space is critical in accurate task infer-
ence. Based on the prior knowledge of the task, O
CEAN
supports flexible parameterization of the latent space for
both the global and the local context variables. Be-
sides the Gaussian prior used in P
EARL
[5], we also de-
sign latent space with categorical distribution to model
tasks controlled by discrete factors, Dirichlet distribution
and logistic normal (logit-normal) distribution to model
multi-modal or proportional tasks. While it is straight
forward for local context encoder to infer the posterior
for the next step, estimating the global posterior is non-
trivial since we need to consider all the past contexts
from the replay buffer and design a suitable function
f
as introduced in Sec. 3.2.
When estimating the posterior of global latent context
variables, we model each context as an independent fac-
tor in order to capture the minimal sufficient information
[5]. Specifically, assume we have
n
contexts, we first es-
timate the posterior
q
G
φ
(
z
|
c
i
)
for each single context
c
i
using the global context encoder
q
G
φ
. Then we model the
global latent context variable
q
G
φ
(
z
|
C
)
as the weighted
product of the independent posteriors:
q
G
φ
(
z
|
C
)
n
i
=1
q
G
φ
(
z
|
c
i
)
1
n
. Our framework allows the global la-
tent space with Gaussian distribution, categorical distri-
bution, Dirichlet distribution and logit-normal distribu-
tion. For all four distributions, the operation of weighted
product of the probability density functions (PDFs) is
closed, and thus the we can easily calculate the global
posterior as shown below.
Gaussian / Logit-normal Distribution.
Assume the
posterior of context
c
i
has parameters
μ
i
and
σ
2
i
, then
the global posterior has parameters
μ
=
1
i
1
σ
2
i
i
μ
2
i
σ
2
i
and
σ
2
=
n
i
1
σ
2
i
.
Categorical Distribution.
Assume the posterior of con-
text
c
i
has parameters
(
p
i
1
,...,p
iK
)
, the global pos-
terior has parameters
(
n
i
p
i
1
Z
,...,
n
i
p
iK
Z
)
, where
Z
=
j
n
i
p
ij
.
Dirichlet Distribution.
Assume the posterior of context
c
i
has parameters
(
α
i
1
,...,α
iK
)
, the global posterior
has parameters
(
i
α
i
1
n
,...,
i
α
iK
n
)
.
After we achieve the posterior of global and local con-
texts, we can use the reparameterization trick to sam-
ple from these distributions in order to optimize the Eq.
7. Specifically, we use several reparameterization tricks
for Gaussian [13], categorical distribution [14, 15] and
Dirichlet distribution [16]. Note that the latent space
can be composite and may consist of random variables
from different distributions mentioned above based on
the prior knowledge of the tasks.
3.4 TRAINING WITH OFF-POLICY UPDATE
Sample efficiency is one of the most critical issues in
both RL and meta-RL. Following P
EARL
[5], since our
framework disentangles task inference (both global and
local) from decision making, the agent in our framework
can safely be trained using off-policy RL algorithms. As
discussed in Sec. 2.3, we use soft actor-critic algorithm
[8] and further extend the definition of
π
θ
,
Q
θ
and
V
θ
to
further take the (global and local) latent context variables
as input. Following P
EARL
, we implement the
R
term in
Eq. 1 to recover the value function and additionally de-
sign a sampler
S
c
for context sampling from the replay
buffer
B
so that the contexts do not diverge too much
from that achieved by the current parameters. We list the
meta-training algorithm of O
CEAN
in Algorithm 1. In
order to integrate our joint latent space with SAC, during
meta-training time, we first collect trajectories for each
task in each iteration as defined in Algorithm 2. Then
we sample contexts and batches of trajectories from the
replay buffer. We use
q
G
φ
to calculate the global latent
context variable and then run
q
L
φ
to achieve the local con-
text variable for each step in the trajectory batch before
we optimize the loss. During meta-test, we directly roll
out our policy while using
q
L
φ
to update the local context
variable at each step, which is very similar to trajectory
collection in Algorithm 2.
4 RELATED WORK
Our work is closely related to meta-learning or learning
to learn in reinforcement learning settings.
Meta-Learning
. Meta-learning tries to address the prob-
lem of learning to learn [17]. The goal is to leverage
the existing knowledge and data to grant the model with
more inductive bias so that when facing a set of new
tasks, the model can quickly adapt to it [18, 19].
Gradient-based Meta-RL.
In the context of rein-
forcement learning, one line of meta-RL models uses
gradient-based updates for few-shot adaptation [1, 2, 3,
20]. The objective for meta-training is to find a set of
parameters such that they are good initialization for a
wide range of tasks. At meta-test time, a few gradient up-
dates can potentially result in high performance, achiev-
ing adaptation towards unseen test tasks. The gradient-
based meta-RL models do not explicitly infer tasks and
hence cannot model the structure of sub-tasks. Another
limitation is that this line of work is mostly optimized
using on-policy data, thus making the learning process
sample inefficient.
Meta-RL with Memory.
Another line of meta-RL mod-
els uses recurrent network structure for the agent [4, 21].
The goal herein is to model previous steps implicitly with
the hidden states in the recurrent network. While it is
closely related to our work, however, it does not reason
over the uncertainty about the task structure and nor does
it perform task inference explicitly. This line of work
also demonstrates limited sample efficiency due to the
on-policy RL update.
Context Conditioned Meta-RL.
The most related line
of work is context based meta-RL models [5, 7, 22],
where the goal is to explicitly perform task inference
using contexts. These models represent tasks with la-
tent context variables and the objective performs con-
text inference as a separate module on top of Q-learning
[10]. The task identification is formalized as a varia-
tional inference problem, and they naturally disentangle
task inference from decision making by using context
conditioned value functions or agents. Integrated with
off-policy algorithms, these models achieve significant
higher sample efficiency and asymptotic performance in
meta-RL. Given a new task, these models make quick
adaptation through exploration with posterior sampling
[23]. However, in this line of work the latent context
variables are fixed across episode and does not allow
for compositional tasks with multi-step sub-tasks or sub-
goals.
Hierarchical RL.
Our work is also related to hierarchi-
cal RL [24]. Hierarchical RL models generally learn
policies with hierarchical structure, which often results
in a high-level and low-level policy. The high-level pol-
icy reasons over the structure of the task and either se-
lects a skill to execute [25], or assigns a goal for the
low-level policy [26, 27]. Our work focuses on meta-RL
and models the task distribution with a global and local
context variables rather than directly learn a hierarchical
policy [28].
O
CEAN
is a unified framework for meta-learning in se-
quential decision making. Our model combines the ad-
Figure 2: Our framework O
CEAN
v.s. several state-of-the-art baselines in multi-stage tasks. O
CEAN
achieves signif-
icantly better sample efficiency and performance since we are able to perform accurate online task inference that fits
especially in multi-stage setting where each task requires finishing a sequence of sub-tasks.
vantages of both world of recurrent memory and latent
contexts. We leverage the contexts to infer the task on
a global scale while design a local context encoder with
recurrent structure to perform online context adaptation
and infer the local sub-task. In addition to this we also
enable a richer context modeling with a flexible parame-
terization of the latent space. These empower the model
to learn in a complex setting with structured, composi-
tional tasks.
5 EXPERIMENTS
In this section we aim to investigate the following ques-
tions: (1) Can O
CEAN
succeed in several multi-stage
meta-RL tasks with the online context adaptation? (2)
What’s the impact of each module in our framework? (3)
Do we need to update the posterior within an episode
or can we only update the context variables by directly
resampling from the posterior at each step? (4) Do dif-
ferent choices of global latent space matter given prior
knowledge of the task distributions?
5.1 EXPERIMENTAL SETUP
We evaluate the performance of our framework O
CEAN
on a 2D point-robot navigation task and five simulated
environments in Mujoco [29] with continuous control,
four of which have a compositional sub-task structure.
We provide the details regarding the tasks as well as the
baselines below.
Point robot navigation.
The agent in 2D plane aims to
navigate to different goals on the edge of a half-circle.
Cheetah-Fwd-Back.
2D cheetah agent aims to run for-
ward or backward, this environment is not multi-stage
and only has 2 tasks.
Cheetah-Multi-Vel.
2D cheetah agent aims to run at
goal velocity. One task may contain multiple goal ve-
locities and the steps that the goal velocity shifts are ran-
domly sampled.
Cheetah-Multi-Direc / Humanoid-Multi-Direc.
2D
cheetah agent / 3D humanoid agent aims to run in goal
direction. One task may contain multiple goal directions
and the steps that the goal direction shifts are randomly
sampled.
Humanoid-Multi-Goal.
3D humanoid agent aims to run
to several goals. One task may contain multiple goals and
the steps that the goal shifts are randomly sampled.
The environments in Mujoco are adapted from previous
meta-RL works [1, 5], the main feature is that we added
multi-stage sub-tasks in each environment. For all the
environments, we adhere to the protocol that we sample
a fixed number of training and test tasks before training,
after the models are trained on the fixed set of training
tasks, we evaluate whether the model is able to quickly
adapt to the test tasks.
Baselines.
We compare O
CEAN
with several state-
of-the-art meta-RL methods, including gradient-based
meta-RL models: E-MAML [2], ProMP [3]; models
with recurrent policy: RL2 [4]; and context-based meta-
RL model: P
EARL
[5]
2
. In all the experiments, we have
the same number of latent variables as P
EARL
. We aim
to show that our online task inference scheme can make
the best use of the latent variables than using them all to
model global contexts.
5.2 MAIN RESULTS
We first conduct sanity check on Cheetah-Fwd-Back,
where each task is single-stage. We directly compare
our model with P
EARL
. As shown in Fig. 3, our model
could achieve comparable results with P
EARL
although
the context variables do not need update in this setting.
Our framework essentially makes learning harder as op-
2
The implementation of the baselines can be found in
https://github.com/jonasrothfuss/ProMP
and
https://github.com/katerakelly/oyster
.
2
4
6
8
Timesteps (x 10^5)
0
250
500
750
1000
1250
1500
Cheetah-Fwd-Back
OCEAN (ours)
PEARL
Figure 3: We compare O
CEAN
with P
EARL
in a task
without multi-stage sub-tasks. Since this task does not
require modeling the sub-task structure, we observe that
O
CEAN
is at par with the task-inference baseline.
posed to P
EARL
. In this setting, P
EARL
aligns well
with the inductive bias that no multi-stage sub-tasks ex-
ist. However, our method is much more flexible in the
design of latent space and P
EARL
can be viewed as a spe-
cial case of our model. If we already have prior knowl-
edge of the task structure, we can reduce our framework
to P
EARL
.
We further evaluate all the baselines and O
CEAN
in
several complex tasks that consist of sequence of
sub-tasks, including Cheetah-Multi-Vel, Cheetah-Multi-
Direc, Humanoid-Multi-Direc, Humanoid-Multi-Goal.
As shown in Fig. 2, our framework O
CEAN
signifi-
cantly outperforms all the other baselines in both sample
efficiency and asymptotic performance. The final con-
verged performance of the baselines are shown in dashed
lines. We observe that RL2 achieves better performance
than the gradient-based meta-RL models. The reason is
that policy that is based on recurrent model can naturally
takes the multi-stage into account, however RL2 cannot
reason over the uncertainty of the tasks and also does not
have global context variables, both of which make RL2
limited especially when the task structure is complex,
for example, in Humanoid-Multi-Direc and Humanoid-
Multi-Goal. Although in 2D cheetah environments, RL2
is able to achieve comparable and even better results than
P
EARL
, in these 3D environments, RL2 fails miserably
due to its incompetence of probabilistic task inference.
5.3 ABLATION STUDY
Here we conduct several ablative experiments to evaluate
the importance of each component in our model.
Architecture.
We first ablate the variational recurrent
neural network architecture and investigate the benefit
of the stochastic transition function as in Eq. 5 and the
2.5
5
7.5
Timesteps (x 10^5)
500
1000
1500
2000
Humanoid-Multi-Direc
OCEAN (ours)
OCEAN w/ RNN
Stochastic PEARL
OCEAN w/o Global
Figure 4: We compare O
CEAN
with several variants on
Humanoid-Multi-Direc.
dependency between the prior distribution of neighbour
steps. We replace the VRNN module with a normal
LSTM [30] as the architecture of the local context en-
coder
q
L
φ
, resulting in
O
CEAN
w/ RNN
. With a LSTM
architecture, the hidden vector
h
t
at timestep
t
no longer
depends on stochastic latent code
z
Local
t
and the prior
distribution of every
z
Local
t
will be uninformative prior
as
z
Local
0
. As shown in Fig. 4, in Humanoid-Multi-
Direc, O
CEAN
w/ RNN can still achieve better perfor-
mance than baseline P
EARL
because it is able to adap-
tively adjust the local context variables, but it is likely
to get stuck into local minima and performs worse than
O
CEAN
with VRNN architecture as it does not model the
dependencies between steps.
Global Latent Space.
Next, we aim to show the im-
portance of global context variables by designing a vari-
ant of our framework without global context variables
O
CEAN
w/o Global
. During both meta-training and
meta-test, the agent only leverages local context vari-
ables to infer the sub-tasks and updates the local con-
text variables online. However, the drawback is that
the method cannot leverage the explored contexts (from
other episodes) to achieve more information about the
task, in other words, the agent does not have a memory
buffer and may always start by exploring randomly in the
environment and gradually adapt to the task, whereas,
our model O
CEAN
can leverage the previous contexts
to infer the sub-task structure with a global view, the
global context variables also narrow down an agent’s ex-
ploration area at the beginning of an episode. As shown
in Fig. 4, we observe that O
CEAN
w/o Global has higher
variance in training than O
CEAN
since it generally re-
quires more exploration steps to gradually take optimal
actions.
Update Posterior Locally.
Here we investigate the ben-
efit of adaptively updating the posterior online and de-
P
EARL
Stochastic P
EARL
O
CEAN
Posterior
Fixed
Fixed
Update
Context Variables
Fixed
Update
Update
Table 1: Comparison among three methods. P
EARL
has
fixed posterior and context variables; Stochastic P
EARL
,
a variant of P
EARL
, repeatedly samples the context vari-
ables from the fixed posterior at each step; O
CEAN
up-
dates both the posterior and the context variables online.
sign a variant of P
EARL
with fixed posterior. P
EARL
in-
fers the posterior of the context variables, samples the
context variables from the posterior at the beginning of
an episode and holds them constant across the episode.
Thus, P
EARL
can be viewed as freezing the posterior as
well as the sample of global context variables. In order
to adapt P
EARL
to the multi-stage setting, one variant
of P
EARL
is to resample the context variables at each
step from the same posterior, named
Stochastic P
EARL
.
In this sense, the context variables are still updated “lo-
cally”. We list the difference in Table 1. The result
on Humanoid-Multi-Direc is shown in Fig. 4, Stochas-
tic P
EARL
behaves poorly in this multi-stage task, and
barely improves compared with P
EARL
. The results
demonstrate that na
̈
ıvely sampling from a fixed posterior
does not improve the accuracy of task inference, since
it only introduces additional noise instead of effectively
reasoning about the compositional structure.
5.4 INCORPORATE PRIOR KNOWLEDGE OF
TASK STRUCTURE
Here we investigate how to leverage prior knowledge
of the task structure by designing the appropriate latent
space. We conduct experiments on 2D point robot navi-
gation environments, where the objective of the agent is
to navigate to the goal, distributed on the edge of a half
circle. In order to create a multi-modal task distribution,
we sample goal locations as interpolations between the
two end points of the half circle, where the interpola-
tion weight follows Dirichlet distribution
Dir
(0
.
2
,
0
.
2)
.
Since we already know the goals (tasks) are Dirichlet dis-
tributed, we can design a latent space with a Dirichlet
prior. Note that since this task is not multi-stage, we did
not use the local context variables and only investigate
the effect of different global latent space on the perfor-
mance. For comparison, we also design two alternatives
with Gaussian or Categorical distribution as the global
latent space. As listed in Table. 2, O
CEAN
with Dirich-
let prior displays the best performance as it aligns well
with the task structure. The categorical distribution on
the other hand is not expressive enough for this setting.
This result demonstrates that it is beneficial to incorpo-
rate the prior knowledge of task structure as inductive
bias to the framework for more accurate task inference.
Return
Gaussian
Categorical
Dirichlet
Point-Robot
10.0
8.5
11.7
Table 2: The result of O
CEAN
with different choices of
global prior distribution in a point-robot navigation task
whose goals follow the Dirichlet distribution.
6 CONCLUSION AND FUTURE WORK
In this paper we propose a generalized online task infer-
ence framework for meta-reinforcement learning based
on online variational inference. The framework consists
of global and local latent context variables estimated by
two encoders respectively. The global variables capture
the general structure of the tasks and can be tailored to
the task if the structure is known to the user. The lo-
cal variables are updated and estimated online across an
episode that enables the agent to make smooth transitions
from one sub-task to another. Extensive experiments on
real-world continuous control tasks have shown superi-
ority of our framework over the state-of-the-art meta-RL
methods. One exciting future direction is to train the lo-
cal context encoders with an auxiliary task to predict the
time step of the transition. This auxiliary inference may
help the agent in the exploration steps and can also im-
prove the accuracy of sub-task inference. Another future
direction is to utilize the contexts from previous episodes
in the training of the local context encoder so that we do
not need to always start with sampling local context vari-
ables from uninformative prior.
Acknowledgements
A.G. is a CIFAR AI chair and also acknowledges Vector
Institute for computing support.
J. L. is a Chan Zucker-
berg Biohub investigator.
We gratefully acknowledge the
support of DARPA under Nos.
FA865018C7880 (ASED),
N660011924033 (MCS); ARO under Nos. W911NF-16-1-
0342 (MURI), W911NF-16-1-0171 (DURIP); NSF under Nos.
OAC-1835598 (CINES), OAC-1934578 (HDR), CCF-1918940
(Expeditions), IIS-2030477 (RAPID); Stanford Data Science
Initiative, Wu Tsai Neurosciences Institute, Chan Zuckerberg
Biohub, Amazon, Boeing, Chase, Docomo, Hitachi, Huawei,
JD.com, NVIDIA, Dell. The U.S. Government is authorized
to reproduce and distribute reprints for Governmental purposes
notwithstanding any copyright notation thereon. Any opinions,
findings, and conclusions or recommendations expressed in this
material are those of the authors and do not necessarily reflect
the views, policies, or endorsements, either expressed or im-
plied, of DARPA, NIH, ARO, or the U.S. Government.
References
[1] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-
learning for fast adaptation of deep networks,” in
Interna-
tional Conference on Machine Learning (ICML)
, 2017.
[2] B. C. Stadie, G. Yang, R. Houthooft, X. Chen, Y. Duan,
Y. Wu, P. Abbeel, and I. Sutskever, “Some considerations
on learning to explore via meta-reinforcement learning,”
in
Advances in Neural Information Processing Systems
(NeurIPS)
, 2018.
[3] J. Rothfuss, D. Lee, I. Clavera, T. Asfour, and P. Abbeel,
“Promp: Proximal meta-policy search,” in
International
Conference on Learning Representations (ICLR)
, 2019.
[4] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett,
I. Sutskever, and P. Abbeel, “Rl2: Fast reinforcement
learning via slow reinforcement learning,”
arXiv preprint
arXiv:1611.02779
, 2016.
[5] K. Rakelly, A. Zhou, D. Quillen, C. Finn, and S. Levine,
“Efficient off-policy meta-reinforcement learning via
probabilistic context variables,” in
International Confer-
ence on Machine Learning (ICML)
, 2019.
[6] J. Humplik, A. Galashov, L. Hasenclever, P. A. Ortega,
Y. W. Teh, and N. Heess, “Meta reinforcement learning as
task inference,”
arXiv preprint arXiv:1905.06424
, 2019.
[7] L. Zintgraf, M. Igl, K. Shiarlis, A. Mahajan, K. Hof-
mann, and S. Whiteson, “Variational task embeddings for
fast adapta-tion in deep reinforcement learning,” in
Inter-
national Conference on Learning Representations Work-
shop (ICLRW)
, 2019.
[8] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft
actor-critic: Off-policy maximum entropy deep reinforce-
ment learning with a stochastic actor,” in
International
Conference on Machine Learning (ICML)
, 2018.
[9] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha,
J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel,
et al.
,
“Soft actor-critic algorithms and applications,”
arXiv
preprint arXiv:1812.05905
, 2018.
[10] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Ve-
ness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K.
Fidjeland, G. Ostrovski,
et al.
, “Human-level control
through deep reinforcement learning,”
Nature
, 2015.
[11] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey,
“Maximum entropy inverse reinforcement learning.,” in
AAAI Conference on Artificial Intelligence (AAAI)
, 2008.
[12] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville,
and Y. Bengio, “A recurrent latent variable model for se-
quential data,” in
Advances in Neural Information Pro-
cessing Systems (NeurIPS)
, 2015.
[13] D. P. Kingma and M. Welling, “Auto-encoding variational
bayes,” in
International Conference on Learning Repre-
sentations (ICLR)
, 2014.
[14] E. Jang, S. Gu, and B. Poole, “Categorical reparameter-
ization with gumbel-softmax,” in
International Confer-
ence on Learning Representations (ICLR)
, 2017.
[15] C. J. Maddison, A. Mnih, and Y. W. Teh, “The concrete
distribution: A continuous relaxation of discrete random
variables,” in
International Conference on Learning Rep-
resentations (ICLR)
, 2017.
[16] M. Figurnov, S. Mohamed, and A. Mnih, “Implicit repa-
rameterization gradients,” in
Advances in Neural Infor-
mation Processing Systems (NeurIPS)
, 2018.
[17] J. Schmidhuber,
Evolutionary principles in self-
referential learning, or on learning how to learn: the
meta-meta-... hook
. PhD thesis, Technische Universit
̈
at
M
̈
unchen, 1987.
[18] S. Thrun and L. Pratt, “Learning to learn: Introduction
and overview,” in
Learning to learn
, Springer, 1998.
[19] Y. Bengio, S. Bengio, and J. Cloutier,
Learning a synaptic
learning rule
. Citeseer, 1990.
[20] T. Xu, Q. Liu, L. Zhao, and J. Peng, “Learning to explore
via meta-policy gradient,” in
International Conference on
Machine Learning (ICML)
, 2018.
[21] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer,
J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and
M. Botvinick, “Learning to reinforcement learn,”
arXiv
preprint arXiv:1611.05763
, 2016.
[22] R. Fakoor, P. Chaudhari, S. Soatto, and A. J. Smola,
“Meta-Q-Learning,” in
International Conference on
Learning Representations (ICLR)
, 2020.
[23] M. Strens, “A Bayesian framework for reinforcement
learning,” in
International Conference on Machine
Learning (ICML)
, 2000.
[24] T. G. Dietterich, “Hierarchical reinforcement learning
with the MAXQ value function decomposition,”
Journal
of artificial intelligence research
, 2000.
[25] C. Florensa, Y. Duan, and P. Abbeel, “Stochastic neu-
ral networks for hierarchical reinforcement learning,” in
International Conference on Learning Representations
(ICLR)
, 2017.
[26] S. Li, R. Wang, M. Tang, and C. Zhang, “Hierarchical re-
inforcement learning with advantage-based auxiliary re-
wards,” in
Advances in Neural Information Processing
Systems (NeurIPS)
, 2019.
[27] O. Nachum, S. S. Gu, H. Lee, and S. Levine,
“Data-efficient hierarchical reinforcement learning,” in
Advances in Neural Information Processing Systems
(NeurIPS)
, 2018.
[28] T. Haarnoja, K. Hartikainen, P. Abbeel, and S. Levine,
“Latent space policies for hierarchical reinforcement
learning,” in
International Conference on Machine
Learning (ICML)
, 2018.
[29] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics
engine for model-based control,” in
2012 IEEE/RSJ In-
ternational Conference on Intelligent Robots and Systems
(IROS)
, 2012.
[30] S. Hochreiter and J. Schmidhuber, “Long short-term
memory,”
Neural computation
, 1997.