of 25
Iterative Amortized Policy Optimization
Joseph Marino
California Institute of Technology
Alexandre Piché
Mila, Université de Montréal
Alessandro Davide Ialongo
University of Cambridge
Yisong Yue
California Institute of Technology
Abstract
Policy networks are a central feature of deep reinforcement learning (RL) algo-
rithms for continuous control, enabling the estimation and sampling of high-value
actions. From the variational inference perspective on RL, policy networks, when
used with entropy or KL regularization, are a form of
amortized optimization
, opti-
mizing network parameters rather than the policy distributions directly. However,
direct
amortized mappings can yield suboptimal policy estimates and restricted
distributions, limiting performance and exploration. Given this perspective, we con-
sider the more flexible class of
iterative
amortized optimizers. We demonstrate that
the resulting technique, iterative amortized policy optimization, yields performance
improvements over direct amortization on benchmark continuous control tasks.
Accompanying code:
github.com/joelouismarino/variational_rl
.
1 Introduction
Reinforcement learning (RL) algorithms involve policy evaluation and policy optimization [
73
].
Given a policy, one can estimate the value for each state or state-action pair following that policy,
and given a value estimate, one can improve the policy to maximize the value. This latter procedure,
policy optimization, can be challenging in continuous control due to instability and poor asymptotic
performance. In deep RL, where policies over continuous actions are often parameterized by deep
networks, such issues are typically tackled using regularization from previous policies [
67
,
68
] or
by maximizing policy entropy [
57
,
23
]. These techniques can be interpreted as variational inference
[
51
], using optimization to infer a policy that yields high expected return while satisfying prior policy
constraints. This smooths the optimization landscape, improving stability and performance [3].
However, one subtlety arises: when used with entropy or KL regularization, policy networks perform
amortized
optimization [
26
]. That is, rather than optimizing the action distribution, e.g., mean and
variance, many deep RL algorithms, such as soft actor-critic (SAC) [
31
,
32
], instead optimize a
network to output these parameters,
learning
to optimize the policy. Typically, this is implemented
as a direct mapping from states to action distribution parameters. While such
direct
amortization
schemes have improved the efficiency of variational inference as “encoder” networks [
44
,
64
,
56
],
they also suffer from several drawbacks:
1)
they tend to provide suboptimal estimates [
20
,
43
,
55
],
yielding a so-called “amortization gap” in performance [
20
],
2)
they are restricted to a single estimate
[
27
], thereby limiting exploration, and
3)
they cannot generalize to new objectives, unlike, e.g.,
gradient-based [36] or gradient-free optimizers [66].
Inspired by techniques and improvements from variational inference, we investigate
iterative
amor-
tized policy optimization. Iterative amortization [
55
] uses gradients or errors to iteratively update
the parameters of a distribution. Unlike direct amortization, which receives gradients only
after
Now at DeepMind, London, UK. Correspondence to
josephmarino@deepmind.com
.
35th Conference on Neural Information Processing Systems (NeurIPS 2021).
arXiv:2010.10670v2 [cs.LG] 22 Oct 2021
outputting the distribution, iterative amortization uses these gradients
online
, thereby learning to
iteratively optimize. In generative modeling settings, iterative amortization empirically outperforms
direct amortization [55, 54] and can find multiple modes of the optimization landscape [27].
The contributions of this paper are as follows:
We propose iterative amortized policy optimization, exploiting a new, fruitful connection
between amortized variational inference and policy optimization.
Using the suite of MuJoCo environments [
78
,
12
], we demonstrate performance improve-
ments over direct amortized policies, as well as more complex flow-based policies.
We demonstrate novel benefits of this amortization technique: improved accuracy, providing
multiple policy estimates, and generalizing to new objectives.
2 Background
2.1 Preliminaries
We consider Markov decision processes (MDPs), where
s
t
∈ S
and
a
t
∈ A
are the state and
action at time
t
, resulting in reward
r
t
=
r
(
s
t
,
a
t
)
. Environment state transitions are given by
s
t
+1
p
env
(
s
t
+1
|
s
t
,
a
t
)
, and the agent is defined by a parametric distribution,
p
θ
(
a
t
|
s
t
)
, with
parameters
θ
.
2
The discounted sum of rewards is denoted as
R
(
τ
) =
t
γ
t
r
t
, where
γ
(0
,
1]
is
the discount factor, and
τ
= (
s
1
,
a
1
,...
)
is a trajectory. The distribution over trajectories is:
p
(
τ
) =
ρ
(
s
1
)
T
1
t
=1
p
env
(
s
t
+1
|
s
t
,
a
t
)
p
θ
(
a
t
|
s
t
)
,
(1)
where the initial state is drawn from the distribution
ρ
(
s
1
)
. The standard RL objective consists of
maximizing the expected discounted return,
E
p
(
τ
)
[
R
(
τ
)]
. For convenience of presentation, we use
the undiscounted setting (
γ
= 1
), though the formulation can be applied with any valid
γ
.
2.2 KL-Regularized Reinforcement Learning
Various works have formulated RL, planning, and control problems in terms of probabilistic inference
[
21
,
8
,
79
,
77
,
11
,
51
]. These approaches consider the agent-environment interaction as a graphical
model, then convert reward maximization into maximum marginal likelihood estimation, learning
and inferring a policy that results in maximal reward. This conversion is accomplished by introducing
one or more binary observed variables [19], denoted as
O
for “optimality” [51], with
p
(
O
= 1
|
τ
)
exp
(
R
(
τ
)
)
,
where
α
is a temperature hyper-parameter. We would like to infer latent variables,
τ
, and learn
parameters,
θ
, that yield the maximum log-likelihood of optimality, i.e.,
log
p
(
O
= 1)
. Evaluating
this likelihood requires marginalizing the joint distribution,
p
(
O
= 1) =
p
(
τ,
O
= 1)
. This
involves averaging over all trajectories, which is intractable in high-dimensional spaces. Instead, we
can use variational inference to lower bound this objective, introducing a structured approximate
posterior distribution:
π
(
τ
|O
) =
ρ
(
s
1
)
T
1
t
=1
p
env
(
s
t
+1
|
s
t
,
a
t
)
π
(
a
t
|
s
t
,
O
)
.
(2)
This provides the following lower bound on the objective:
log
p
(
O
= 1) = log
p
(
O
= 1
|
τ
)
p
(
τ
)
(3)
π
(
τ
|O
)
[
log
p
(
O
= 1
|
τ
)
p
(
τ
)
π
(
τ
|O
)
]
(4)
=
E
π
[
R
(
τ
)
]
D
KL
(
π
(
τ
|O
)
p
(
τ
))
.
(5)
2
In this paper, we consider the entropy-regularized case, where
p
θ
(
a
t
|
s
t
) =
U
(
1
,
1)
, i.e., uniform.
However, we present the derivation for the KL-regularized case for full generality.
2