Batch Policy Learning under Constraints
Hoang M. Le
1
Cameron Voloshin
1
Yisong Yue
1
Abstract
When learning policies for real-world domains,
two important questions arise: (i) how to effi-
ciently use pre-collected off-policy, non-optimal
behavior data; and (ii) how to mediate among dif-
ferent competing objectives and constraints. We
thus study the problem of batch policy learning un-
der multiple constraints, and offer a systematic so-
lution. We first propose a flexible meta-algorithm
that admits any batch reinforcement learning and
online learning procedure as subroutines. We then
present a specific algorithmic instantiation and
provide performance guarantees for the main ob-
jective and all constraints. As part of off-policy
learning, we propose a simple method for off-
policy policy evaluation (OPE) and derive PAC-
style bounds. Our algorithm achieves strong em-
pirical results in different domains, including in a
challenging problem of simulated car driving sub-
ject to multiple constraints such as lane keeping
and smooth driving. We also show experimentally
that our OPE method outperforms other popular
OPE techniques on a standalone basis, especially
in a high-dimensional setting.
1. Introduction
We study the problem of policy learning under multiple con-
straints. Contemporary approaches to learning sequential
decision making policies have largely focused on optimizing
some cost objective that is easily reducible to a scalar value
function. However, in many real-world domains, choosing
the right cost function to optimize is often not a straight-
forward task. Frequently, the agent designer faces multiple
competing objectives. For instance, consider the aspirational
task of designing autonomous vehicle controllers: one may
care about minimizing the travel time while making sure
the driving behavior is safe, consistent, or fuel efficient. In-
1
California Institute of Technology, Pasadena, CA. Correspon-
dence to: Hoang M. Le
<
hmle@caltech.edu
>
.
Proceedings of the
36
th
International Conference on Machine
Learning
, Long Beach, California, PMLR 97, 2019. Copyright
2019 by the author(s).
deed, many such real-world applications require the primary
objective function be augmented with an appropriate set of
constraints (Altman, 1999).
Contemporary policy learning research has largely focused
on either online reinforcement learning (RL) with a focus on
exploration, or imitation learning (IL) with a focus on learn-
ing from expert demonstrations. However, many real-world
settings already contain large amounts of pre-collected data
generated by existing policies (e.g., existing driving behav-
ior, power grid control policies, etc.). We thus study the
complementary question:
can we leverage this abundant
source of (non-optimal) behavior data in order to learn se-
quential decision making policies with provable guarantees
on both primary objective and constraint satisfaction
?
We thus propose and study the problem of batch policy
learning under multiple constraints. Historically, batch RL
is regarded as a subfield of approximate dynamic program-
ming (ADP) (Lange et al., 2012), where a set of transitions
sampled from the existing system is given and fixed. From
an interaction perspective, one can view many online RL
methods (e.g., DDPG (Lillicrap et al., 2016)) as running a
growing batch RL subroutine per round of online RL. In
that sense, batch policy learning is complementary to any
exploration scheme. To the best of our knowledge, the study
of constrained policy learning in the batch setting is novel.
We present an algorithmic framework for learning sequential
decision making policies from off-policy data. We employ
multiple learning reductions to online and supervised learn-
ing, and present an analysis that relates performance in the
reduced procedures to the overall performance with respect
to both the primary objective and constraint satisfaction.
Constrained optimization is a well studied problem in su-
pervised machine learning and optimization. In fact, our ap-
proach is inspired by the work of Agarwal et al. (2018) in the
context of fair classification. In contrast to supervised learn-
ing for classification, batch policy learning for sequential
decision making introduces multiple additional challenges.
First, setting aside the constraints, batch policy learning
itself presents a layer of difficulty, and the analysis is signif-
icantly more complicated. Second, verifying whether the
constraints are satisfied is no longer as straightforward as
passing the training data through the learned classifier. In
sequential decision making, certifying constraint satisfac-
Batch Policy Learning under Constraints
tion amounts to an off-policy policy evaluation problem,
which is a challenging problem and the subject of active
research. In this paper, we develop a systematic approach
to address these challenges, provide a careful error analysis,
and experimentally validate our proposed algorithms. In
summary, our contributions are:
•
We formulate the problem of batch policy learning un-
der multiple constraints, and present the first approach
of its kind to solve this problem. The definition of con-
straints is general and can subsume many objectives.
Our approach utilizes multi-level learning reductions,
and we show how to instantiate it using various batch
RL and online learning subroutines. We show that
guarantees from the subroutines provably lift to pro-
vide end-to-end guarantees on the original constrained
batch policy learning problem.
•
While leveraging techniques from batch RL as a sub-
routine, we provide a refined theoretical analysis for
general non-linear function approximation that im-
proves upon the previously known sample complexity
result (Munos & Szepesv
́
ari, 2008).
•
To evaluate off-policy learning performance and con-
straint satisfaction, we propose a simple new technique
for off-policy policy evaluation (OPE), which is used
as a subroutine in our main algorithm. We show that it
is competitive to other OPE methods.
•
We validate our algorithm and analysis with two ex-
perimental settings. First, a simple navigation do-
main where we consider safety constraint. Second, we
consider a high-dimensional racing car domain with
smooth driving and lane centering constraints.
2. Problem Formulation
We first introduce notation. Let
X
⊂
R
d
be a bounded and
closed
d
-dimensional state space. Let
A
be a finite action
space. Let
c
: X
×
A
7→
[0
,
C
]
be the primary objective cost
function that is bounded by
C
. Let there be
m
constraint
cost functions,
g
i
: X
×
A
7→
[0
,
G
]
, each bounded by
G
. To simplify the notation, we view the set of constraints
as a vector function
g
: X
×
A
7→
[0
,
G
]
m
where
g
(
x,a
)
is the column vector of individual
g
i
(
x,a
)
. Let
p
(
·|
x,a
)
denote the (unknown) transition/dynamics model that maps
state/action pairs to a distribution over the next state. Let
γ
∈
(0
,
1)
denote the discount factor. Let
χ
be the initial
states distribution.
We consider the discounted infinite horizon setting. An
MDP is defined using the tuple
(X
,
A
,c,g,p,γ,χ
)
. A pol-
icy
π
∈
Π
maps states to actions, i.e.,
π
(
x
)
∈
A
. The
value function
C
π
: X
7→
R
corresponding to the pri-
mary cost function
c
is defined in the usual way:
C
π
(
x
) =
E
[
∑
∞
t
=0
γ
t
c
(
x
t
,a
t
)
|
x
0
=
x
]
, over the randomness of the
policy
π
and transition dynamics
p
. We similarly define the
vector-value function for the constraint costs
G
π
: X
7→
R
m
as
G
π
(
x
) =
E
[
∑
∞
t
=0
γ
t
g
(
x
t
,a
t
)
|
x
0
=
x
]
. Define
C
(
π
)
and
G
(
π
)
as the expectation of
C
π
(
x
)
and
G
π
(
x
)
, respec-
tively, over the distribution
χ
of initial states.
2.1. Batch Policy Learning under Constraints
In batch policy learning, we have a pre-collected dataset,
D =
{
(
x
i
,a
i
,x
′
i
,c
(
x
i
,a
i
)
,g
1:
m
(
x
i
,a
i
)
}
n
i
=1
, generated
from (a set of) historical behavioral policies denoted jointly
by
π
D
. The goal of batch policy learning under constraints is
to learn a policy
π
∈
Π
from
D
that minimizes the primary
objective cost while satisfying
m
different constraints:
min
π
∈
Π
C
(
π
)
s.t.
G
(
π
)
≤
τ
(OPT)
where
G
(
·
) = [
g
1
(
·
)
,...,g
m
(
·
)]
>
and
τ
∈
R
m
is a vector
of known constants. We assume that
(OPT)
is feasible.
However, the dataset
D
might be generated from multiple
policies that violate the constraints.
2.2. Examples of Policy Learning with Constraints
Counterfactual & Safe Policy Learning.
In conventional
online RL, the agent needs to “re-learn” from scratch when
the cost function is modified. Our framework enables coun-
terfactual policy learning assuming the ability to compute
the new cost objective from the same historical data. A
simple example is
safe
policy learning (Garcıa & Fern
́
andez,
2015). Define safety cost
g
(
x,a
) =
φ
(
x,a,c
)
as a new
function of existing cost
c
and features associated with cur-
rent state-action pair. The goal here is to counterfactually
avoid undesirable behaviors observed from historical data.
We experimentally study this safety problem in Section 5.
Other examples from the literature that belong to this safety
perspective include planning under chance constraints (Ono
et al., 2015; Blackmore et al., 2011). The constraint here is
G
(
π
) =
E
[
I
(
x
∈
X
error
)] = P(
x
∈
X
error
)
≤
τ
.
Multi-objective Batch Learning.
Traditional policy learn-
ing (RL or IL) presupposes that the agent optimizes a single
cost function. In reality, we may want to satisfy multiple
objectives that are not easily reducible to a scalar objective
function. One example is learning fast driving policies un-
der multiple behavioral constraints such as smooth driving
and lane keeping consistency (see Section 5).
2.3. Equivalence between Constraint Satisfaction and
Regularization
Our constrained policy learning framework accommodates
several existing regularized policy learning settings. Regu-
larization typically encodes prior knowledge, and has been
used extensively in the RL and IL literature to improve
Batch Policy Learning under Constraints
learning performance. Many instances of regularized policy
learning can be naturally cast into (OPT):
•
Entropy regularized RL
(Haarnoja et al., 2017; Ziebart,
2010) maps to policy-dependent constraint cost
g
(
x
) =
H
(
π
(
·|
x
)
)
, where
H
measures conditional entropy.
1
•
Conservative policy improvement
(Levine & Abbeel,
2014; Schulman et al., 2015; Achiam et al., 2017) is
equivalent to
G
(
π
) =
D
KL
(
π,π
k
)
, where
π
k
is some
“well-behaving” policy.
•
Smooth imitation learning
(Le et al., 2016) is equiva-
lent to
G
(
π
) = min
h
∈
H
∆(
h,π
)
, where
H
is a class of
provably smooth policies and
∆
is a divergence metric.
•
Regularizing RL with expert demonstration
(Hes-
ter et al., 2018) is equivalent to
G
(
π
)
=
E
[
`
(
π
(
x
)
,π
∗
(
x
))]
, where
π
∗
is the expert policy.
We provide further equivalence derivation of the above
examples in Appendix A. Of course, some problems are
more naturally described using the regularization perspec-
tive, while others using constraint satisfaction.
More generally, one can establish the equivalence between
regularized and constrained policy learning via a simple
appeal to Lagrangian duality as shown in Proposition 2.1
below. This Lagrangian duality also has a game-theoretic
interpretation (Section 5.4 of Boyd & Vandenberghe (2004)),
which serves as an inspiration for developing our approach.
Proposition 2.1.
Let
Π
be a convex set of policies. Let
C
: Π
7→
R
,C
: Π
7→
R
K
be value functions. Consider the
two policy optimization tasks:
Regularization:
min
π
∈
Π
C
(
π
) +
λ
>
G
(
π
)
Constraint:
min
π
∈
Π
C
(
π
)
s.t.
G
(
π
)
≤
τ
Assume that the Slater’s condition is satisfied in the
Constraint
problem (i.e.,
∃
π
s.t.
G
(
π
)
< τ
). As-
sume also that the constraint cannot be removed with-
out changing the optimal solution, i.e.,
inf
π
∈
Π
C
(
π
)
<
inf
π
∈
Π:
G
(
π
)
≤
τ
C
(
π
)
. Then
∀
λ >
0
,
∃
τ
, and vice versa,
such that
Regularization
and
Constraint
share
the same optimal solutions. (Proof in Appendix A.)
3. Proposed Approach
To make use of strong duality, we first
convexify
the policy
class
Π
by allowing stochastic combinations of policies,
which effectively expands
Π
into its convex hull
Conv
(Π)
.
Formally,
Conv
(Π)
contains
randomized policies
, which
we denote
π
=
∑
T
t
=1
α
t
π
t
for
π
t
∈
Π
and
∑
T
t
=1
α
t
= 1
.
Executing a mixed
π
consists of first sampling
one
policy
π
t
from
π
1:
T
according to distribution
α
1:
T
, and then exe-
1
Constraint value function
G
(
π
)
can be viewed as the expec-
tation over discounted state visitation distribution. The lack of
explicit discount rate does not intefere with our overall approach.
Algorithm 1
Meta-algo for Batch Constrained Learning
1:
for
each round
t
do
2:
π
t
←
Best-response
(
λ
t
)
3:
̂
π
t
←
1
t
∑
t
t
′
=1
π
t
′
,
̂
λ
t
←
1
t
∑
t
t
′
=1
λ
t
′
4:
L
max
= max
λ
L
(
̂
π
t
,λ
)
5:
L
min
=
L
(
Best-response
(
̂
λ
t
)
,
̂
λ
t
)
6:
if
L
max
−
L
min
≤
ω
then
7:
Return
̂
π
t
8:
end if
9:
λ
t
+1
←
Online-algorithm
(
π
1
,...,π
t
−
1
,π
t
)
10:
end for
cuting
π
t
. Note that we still have
E
[
π
] =
∑
T
t
=1
α
t
E
[
π
t
]
for
any first-moment statistic of interest (e.g., state distribution,
expected cost). It is easy to see that the augmented version
of
(OPT)
over
Conv
(Π)
has a solution at least as good as
the original
(OPT)
. As such, to lighten the notation, we will
equate
Π
with its convex hull for the rest of the paper.
3.1. Meta-Algorithm
The Lagrangian of
(OPT)
is
L
(
π,λ
) =
C
(
π
) +
λ
>
(
G
(
π
)
−
τ
)
for
λ
∈
R
m
+
. Clearly
(OPT)
is equivalent to the min-max
problem:
min
π
∈
Π
max
λ
∈
R
k
+
L
(
π,λ
)
. We assume
(OPT)
is feasible
and that Slater’s condition holds (otherwise, we can simply
increase the constraint
τ
by a tiny amount). Slater’s con-
dition and policy class convexification ensure that strong
duality holds (Boyd & Vandenberghe, 2004), and
(OPT)
is
also equivalent to the max-min problem:
max
λ
∈
R
k
+
min
π
∈
Π
L
(
π,λ
)
.
Since
L
(
π,λ
)
is linear in both
λ
and
π
(due to stochastic
mixture
2
) , strong duality is also a consequence of von
Neumann’s celebrated convex-concave minimax theorem
for zero-sum games (Von Neumann & Morgenstern, 2007).
From a game-thoeretic perspective, the problem becomes
finding the equilibrium of a two-player game between the
π
−
player and the
λ
−
player (Freund & Schapire, 1999). In
this repeated game, the
π
−
player minimizes
L
(
π,λ
)
given
the current
λ
, and the
λ
−
player maximizes it given the
current (mixture over)
π
.
We first present a meta-algorithm (Algorithm 1) that uses
any no-regret online learning algorithm (for
λ
) and batch
policy optimization (for
π
). At each iteration, given
λ
t
, the
π
-player runs
Best-response
to get the best response:
Best-response
(
λ
t
) = arg min
π
∈
Π
L
(
π,λ
t
)
= arg min
π
∈
Π
C
(
π
) +
λ
>
t
(
G
(
π
)
−
τ
)
.
This is equivalent to a standard batch reinforcement learn-
ing problem where we learn a policy that is optimal with
respect to
c
+
λ
>
t
g
. The corresponding mixed strategy is the
uniform distribution over all previous
π
t
. In response to the
2
This places no restrictions on the individual policies. Individ-
ual policy can be non-linear and cost function can be non-convex.
Batch Policy Learning under Constraints
Algorithm 2
Constrained Batch Policy Learning
Input:
Dataset
D =
{
x
i
,a
i
,x
′
i
,c
i
,g
i
}
n
i
=1
∼
π
D
. Online algo-
rithm parameters:
`
1
norm bound
B
, learning rate
η
1: Initialize
λ
1
= (
B
m
+1
,...,
B
m
+1
)
∈
R
m
+1
2:
for
each round
t
do
3:
Learn
π
t
←
FQI
(
c
+
λ
>
t
g
)
//
FQI with cost
c
+
λ
>
t
g
4:
Evaluate
̂
C
(
π
t
)
←
FQE
(
π
t
,c
)
//
Algo 3 with
π
t
, cost
c
5:
Evaluate
̂
G
(
π
t
)
←
FQE
(
π
t
,g
)
//
Algo 3 with
π
t
, cost
g
6:
̂
π
t
←
1
t
∑
t
t
′
=1
π
t
′
7:
̂
C
(
̂
π
t
)
←
1
t
∑
t
t
′
=1
̂
C
(
π
t
′
)
,
̂
G
(
̂
π
t
)
←
1
t
∑
t
t
′
=1
̂
G
(
π
t
′
)
8:
̂
λ
t
←
1
t
∑
t
t
′
=1
λ
t
′
9:
Learn
̃
π
←
FQI
(
c
+
̂
λ
>
t
g
)
//
FQI with cost
c
+
̂
λ
>
t
g
10:
Evaluate
̂
C
(
̃
π
)
←
FQE
(
̃
π,c
)
,
̂
G
(
̃
π
)
←
FQE
(
̃
π,g
)
11:
̂
L
max
= max
λ,
‖
λ
‖
1
=
B
(
̂
C
(
̂
π
t
) +
λ
>
[
(
̂
G
(
̂
π
t
)
−
τ
)
>
,
0
]
>
)
12:
̂
L
min
=
̂
C
(
̃
π
) +
̂
λ
>
t
[
(
̂
G
(
̃
π
)
−
τ
)
>
,
0
]
>
13:
if
̂
L
max
−
̂
L
min
≤
ω
then
14:
Return
̂
π
t
15:
end if
16:
Set
z
t
=
[
(
̂
G
(
π
t
)
−
τ
)
>
,
0
]
>
∈
R
m
+1
17:
λ
t
+1
[
i
] =
B
λ
t
[
i
]
e
−
ηz
t
[
i
]
∑
j
λ
t
[
j
]
e
−
ηz
t
[
j
]
∀
i
//
λ
[
i
]
the
i
th
coordinate
18:
end for
π
−
player, the
λ
−
player employs
Online-algorithm
,
which can be
any
no-regret algorithm that satisfies:
∑
t
L
(
π
t
,λ
t
)
≥
max
λ
∑
t
L
(
π
t
,λ
)
−
o
(
T
)
Finally, the algorithm terminates when the estimated primal-
dual gap is below a threshold
ω
(Lines 7-8).
Leaving aside (for the moment) issues of generalization,
Algorithm 1 is guaranteed to converge assuming: (i)
Best-response
gives the best single policy in the class,
and (ii)
L
max
and
L
min
can be evaluated exactly.
Proposition 3.1.
Assuming (i) and (ii) above, Algorithm 1
is guaranteed to stop and the convergence depends on the
regret of
Online-algorithm
. (Proof in Appendix B.)
3.2. Specific Instantiation of Meta-Algorithm
We now focus on a specific instantiation of Algorithm 1.
Algorithm 2 is our main algorithm in this paper.
Policy Learning.
We instantiate
Best-response
with
Fitted Q Iteration (FQI), a model-free off-policy learning
approach (Ernst et al., 2005). FQI relies on a series of
reductions to supervised learning. The key idea is to ap-
proximate the true action-value function
Q
∗
by a sequence
{
Q
k
∈
F
}
K
k
=0
, where
F
is a chosen function class.
In Lines 3 & 9,
FQI
(
c
+
λ
>
g
)
is defined as follows. With
Q
0
randomly initialized, for each
k
= 1
,...,K
, we form a
new training dataset
̃
D
k
=
{
(
x
i
,a
i
)
,y
i
}
n
i
=1
where:
∀
i
:
y
i
= (
c
i
+
λ
>
g
i
) +
γ
min
a
Q
k
−
1
(
x
′
i
,a
)
,
and
(
x
i
,a
i
,x
′
i
,c
i
,g
i
)
∼
D
(original dataset). A supervised
Algorithm 3
Fitted Q Evaluation:
FQE
(
π,c
)
Input:
Dataset
D =
{
x
i
,a
i
,x
′
i
,c
i
}
n
i
=1
∼
π
D
. Function class
F
.
Policy
π
to be evaluated
1: Initialize
Q
0
∈
F
randomly
2:
for
k
= 1
,
2
,...,K
do
3:
Compute target
y
i
=
c
i
+
γQ
k
−
1
(
x
′
i
,π
(
x
′
i
))
∀
i
4:
Build training set
̃
D
k
=
{
(
x
i
,a
i
)
,y
i
}
n
i
=1
5:
Solve a supervised learning problem:
Q
k
= arg min
f
∈
F
1
n
∑
n
i
=1
(
f
(
x
i
,a
i
)
−
y
i
)
2
6:
end for
Output:
̂
C
π
(
x
) =
Q
K
(
x,π
(
x
))
∀
x
regression procedure is called to solve for:
Q
k
= arg min
f
∈
F
1
n
n
∑
i
=1
(
f
(
x
i
,a
i
)
−
y
i
)
2
.
FQI returns the policy:
π
K
= arg min
a
Q
K
(
·
,a
)
. FQI has
been shown to work well with several empirical domains:
spoken dialogue systems (Pietquin et al., 2011), physical
robotic soccer (Riedmiller et al., 2009), and cart-pole swing-
up (Riedmiller, 2005), and clinical treatment (Prasad et al.,
2017). Another possible model-free subroutine is Least-
Squares Policy Iteration (LSPI) (Lagoudakis & Parr, 2003).
One can also consider model-based alternatives (Ormoneit
& Sen, 2002).
Off-policy Policy Evaluation.
A crucial difference be-
tween constrained policy learning and existing work on
constrained supervised learning is the technical challenge
of evaluating the objective and constraints. First, esti-
mating
̂
L
(
π,λ
)
(Lines 11-12) requires estimating
̂
C
(
π
)
and
̂
G
(
π
)
.
Second, any gradient-based approach to
Online-learning
requires passing in
̂
G
(
π
)
−
τ
as part
of gradient estimate (line 15). This problem is known as
the off-policy policy evaluation (OPE) problem: we need to
evaluate
̂
C
(
π
)
and
̂
G
(
π
)
having only access to data
D
∼
π
D
There are three main contemporary approaches to OPE:
(i) importance weighting (IS) (Precup et al., 2000; 2001),
which is unbiased but often has high-variance; (ii)
regression-based direct methods (DM), which are typically
model-based (Thomas & Brunskill, 2016),and can be biased
but have much lower variance than IS; and (iii) doubly-
robust techniques (Jiang & Li, 2016; Dud
́
ık et al., 2011),
which combine IS and DM.
We propose a simple model-free technique using function
approximation, called Fitted Q Evaluation (FQE). FQE is
based on an iterative reductions scheme similar to FQI,
but for the problem of off-policy evaluation. Algorithm 3
lays out the steps. The key difference with FQI is that the
min
operator is replaced by
Q
k
−
1
(
x
′
i
,π
(
x
′
i
))
(Line 3 of
Algorithm 3). Each
x
′
i
comes from the original
D
. Since we
know
π
(
x
′
i
)
, each
̃
D
k
is well-defined. Note that FQE can be
plugged-in as a direct method if one wishes to augment the
policy evaluation with a doubly-robust technique.
Batch Policy Learning under Constraints
Online Learning Subroutine.
As
L
(
π
t
,λ
)
is linear in
λ
,
many online convex optimization approaches can be used
for
Online-algorithm
. Perhaps the simpliest choice
is Online Gradient Descent (OGD) (Zinkevich, 2003). We
include an instantiation using OGD in Appendix G.
For our main Algorithm 2, similar to (Agarwal et al., 2018),
we use Exponentiated Gradient (EG) (Kivinen & Warmuth,
1997), which has a regret bound of
O
(
√
log(
m
)
T
)
instead
of
O
(
√
mT
)
as in OGD. One can view EG as a variant of
Online Mirror Descent (Nemirovsky & Yudin, 1983) with a
softmax link function, or of Follow-the-Regularized-Leader
with entropy regularization (Shalev-Shwartz et al., 2012).
Gradient-based algorithms generally require bounded
λ
. We
thus force
‖
λ
‖
1
≤
B
using hyperparameter
B
. Solving
(OPT)
exactly requires
B
=
∞
. We will analyze Algorithm
2 with respect to finite
B
. With some abuse of notation, we
augment
λ
into a
(
m
+1)
−
dimensional vector by appending
B
− ‖
λ
‖
1
, and augment the constraint cost vector
g
by
appending
0
(Lines 11, 12 & 15 of Algorithm 2).
3
4. Theoretical Analysis
4.1. Convergence Guarantee
The convergence rate of Algorithm 2 depends on the radius
B
of the dual variables
λ
, the maximal constraint value
G
,
and the number of constraints
m
. In particular, we can show
O
(
B
2
ω
2
)
convergence for primal-dual gap
ω
.
Theorem 4.1
(Convergence of Algorithm 2)
.
After
T
itera-
tions, the empirical duality gap is bounded by
̂
L
max
−
̂
L
min
≤
2
B
log(
m
+ 1)
ηT
+ 2
ηB
G
2
Consequently, to achieve the primal-dual gap of
ω
, setting
η
=
ω
4
G
2
B
will ensure that Algorithm 2 converges after
16
B
2
G
2
log(
m
+1)
ω
2
iterations. (Proof in Appendix B.)
Convergence analysis of our main Algorithm 2 is an exten-
sion of the proof to Proposition 3.1, leveraging the no-regret
property of the EG procedure (Shalev-Shwartz et al., 2012).
4.2. Generalization Guarantee of FQE and FQI
In this section, we provide sample complexity analysis
for FQE and FQI as
standalone
procedures for off-policy
evaluation and off-policy learning. We use the notion of
pseudo-dimension as capacity measure of non-linear func-
tion class
F
(Friedman et al., 2001). Pseudo-dimension
dim
F
, which naturally extends VC dimension into the re-
gression setting, is defined as the VC dimension of the
function class induced by the sub-level set of functions of
F
:
dim
F
=
VC-dim
(
{
(
x,y
)
7→
sign
(
f
(
x
)
−
y
) :
f
∈
F
}
)
.
Pseudo-dimension is finite for a large class of function ap-
3
The
(
m
+ 1)
th
coordinate of
g
is thus always satisfied. This
augmentation is only necessary when executing EG.
proximators. For example, Bartlett et al. (2017) bounded the
pseudo-dimension of piece-wise linear deep neural networks
(e.g., with ReLU activations) as
O
(
WL
log
W
)
, where
W
is the number of weights, and
L
is the number of layers.
Both FQI and FQE rely on reductions to supervised learning
to update the value functions. In both cases, the learned
policy and evaluation policy induces a different state-action
distribution compared to the data generating distribution
μ
.
We use the notion of concentration coefficient for the worst
case, proposed by (Munos, 2003), to measure the degree of
distribution shift. The following standard assumption from
analysis of related ADP algorithms limits the severity of
distribution shift over future time steps:
Assumption 1
(Concentrability coefficient of future
state-action distribution)
.
(Munos, 2003; 2007; Munos &
Szepesv
́
ari, 2008; Antos et al., 2008a;b; Lazaric et al., 2010;
2012; Farahmand et al., 2009; Maillard et al., 2010)
Let
P
π
be the operator acting on
f
: X
×
A
7→
R
s.t.
(
P
π
f
)(
x,a
) =
∫
X
f
(
x
′
,π
(
x
′
))
p
(
dx
′
|
x,a
)
. Given data gen-
erating distribution
μ
, initial state distribution
χ
, for
m
≥
0
and an arbitrary sequence of stationary policies
{
π
m
}
m
≥
1
define the concentration coeffient:
β
μ
(
m
) =
sup
π
1
,...,π
m
∥
∥
∥
∥
d
(
χP
π
1
P
π
2
...P
π
m
)
dμ
∥
∥
∥
∥
∞
We assume
β
μ
= (1
−
γ
)
2
∑
m
≥
1
mγ
m
−
1
β
μ
(
m
)
<
∞
.
This assumption is valid for a fairly large class of MDPs
(Munos, 2007). For instance
β
μ
is finite for any finite MDP,
or any infinite state-space MDP with bounded transition
density.
4
Having a finite concentration coefficient is equiva-
lent the top-Lyapunov exponent
Γ
≤
0
(Bougerol & Picard,
1992), which means the underlying stochastic system is
stable. We show below a simple sufficient condition for
Assumption 1 (albeit stronger than necessary).
Example 4.1.
Consider an MDP such that for any non-
stationary distribution
ρ
, the marginals over states satisfy
ρ
x
(
x
)
μ
x
(
x
)
≤
L
(i.e., transition dynamics are sufficiently stochas-
tic), and
∃
M
:
∀
x,a
:
μ
(
a
|
x
)
>
1
M
(i.e., the behavior
policy is sufficiently exploratory). Then
β
μ
≤
LM
.
Recall that for a given policy
π
, the Bellman (evalua-
tion) operator is defined as
(
T
π
Q
)(
x,a
) =
r
(
x,a
) +
γ
∫
X
Q
(
x
′
,π
(
x
′
))
p
(
dx
′
|
x,a
)
. In general
T
π
f
may not be-
long to
F
for
f
∈
F
. For FQE (and FQI), the main operation
in the algorithm is to iteratively project
T
π
Q
k
−
1
back to
F
via
Q
k
= arg min
f
∈
F
‖
f
−
T
π
Q
k
−
1
‖
. The performance
4
This assumption ensures sufficient data diversity, even when
the executing policy is deterministic. An example of how learning
can fail without this assumption is based on the “combination lock”
MDP (Koenig & Simmons, 1996). In this deterministic MDP
example,
β
μ
can grow arbitrarily large, and we need an exponential
number of samples for both FQE and FQI. See Appendix D.
Batch Policy Learning under Constraints
of both FQE and FQI thus depend on how well the function
class
F
approximates the Bellman operator. We measure
the ability of function class
F
to approximate the Bellman
evaluation operator via the worst-case Bellman error:
Definition 4.1
(inherent Bellman evaluation error)
.
Given
a function class
F
and policy
π
, the
inherent Bell-
man evaluation error
of
F
is defined as
d
π
F
=
sup
g
∈
F
inf
f
∈
F
‖
f
−
T
π
g
‖
π
where
‖·‖
π
is the
`
2
norm
weighted by the state-action distribution induced by
π
.
We are now ready to state the generalization bound for FQE:
Theorem 4.2
(Generalization error of FQE)
.
Under As-
sumption 1, for
>
0
&
δ
∈
(0
,
1)
, after
K
iterations of
Fitted Q Evaluation (Algorithm 3), for
n
=
O
(
C
4
2
(log
K
δ
+
dim
F
log
C
2
2
+log
dim
F
)
)
, we have with probability
1
−
δ
:
∣
∣
C
(
π
)
−
̂
C
(
π
)
∣
∣
≤
γ
1
/
2
(1
−
γ
)
3
/
2
(
√
β
μ
(2
d
π
F
+
) +
2
γ
K/
2
C
(1
−
γ
)
1
/
2
)
.
This result shows a dependency on
of
̃
O
(
1
2
)
, compared
to
̃
O
(
1
4
)
from other related ADP algorithms (Munos &
Szepesv
́
ari, 2008; Antos et al., 2008b). The price that we pay
is a multiplicative constant 2 in front of the inherent error
d
π
F
.
The error from second term on RHS decays exponentially
with iterations
K
. The proof is in Appendix E.
We can show an analogous generalization bound for FQI.
While FQI has been widely used, to the best of our knowl-
edge, a complete analysis of FQI for non-linear function
approximation has not been previously reported.
5
Definition 4.2
(inherent Bellman optimality error)
.
(Munos
& Szepesv
́
ari, 2008) Recall that the Bellman optimal-
ity operator is defined as
(
T
Q
)(
x,a
) =
r
(
x,a
) +
γ
∫
X
min
a
′
∈
A
Q
(
x
′
,a
′
)
p
(
dx
′
|
x,a
)
. Given a function class
F
, the
inherent Bellman error
is defined as
d
F
=
sup
g
∈
F
inf
f
∈
F
‖
f
−
T
g
‖
μ
, where
‖·‖
μ
is the
`
2
norm
weighted by
μ
, the state-action distribution induced by
π
D
.
Theorem 4.3
(Generalization error of FQI)
.
Under Assump-
tion 1, for
>
0
&
δ
∈
(0
,
1)
, after
K
iterations of Fit-
ted Q Iteration, for
n
=
O
(
C
4
2
(log
K
δ
+
dim
F
log
C
2
2
+
log
dim
F
)
)
, we have with probability
1
−
δ
:
∣
∣
C
∗
−
C
(
π
K
)
∣
∣
≤
2
γ
(1
−
γ
)
3
(
√
β
μ
(2
d
F
+
) + 2
γ
K/
2
C
)
where
π
K
is the policy acting greedy with respect to the
returned function
Q
K
. (Proof in Appendix F.)
4.3. End-to-End Generalization Guarantee
We are ultimately interested in the test-time performance
and constraint satisfaction of the returned policy from Al-
5
FQI for continuous action space from (Antos et al., 2008a)
is a variant of fitted policy iteration and not the version of FQI
under consideration. The appendix of (Lazaric & Restelli, 2011)
contains a proof of FQI specifically for linear function class.
gorithm 2. We now connect the previous analyses from
Theorems 4.1, 4.2 & 4.3 into an end-to-end error analysis.
Since Algorithm 2 uses FQI and FQE as subroutines, the
inherent Bellman error terms
d
F
and
d
π
F
will enter our over-
all performance bound. Estimating the inherent Bellman
error caused by function approximation is not possible in
general (chapter 11 of Sutton & Barto (2018)). We assume
existence of a sufficiently expressive
F
that can generally
make
d
F
and
d
π
F
arbitrarily small. To simplify our end-to-
end analysis, consider
d
F
= 0
and
d
π
F
= 0
, i.e., the function
class
F
is closed under applying the Bellman operator.
Assumption 2
(Bellman operator realizability)
.
We con-
sider function classes
F
sufficiently rich so that
∀
f
:
T
f
∈
F
&
T
π
f
∈
F
for the policies
π
returned by Algorithm 2.
With Assumptions 1 & 2, we have the following error bound:
Theorem 4.4
(Generalization guarantee of Algorithm 2)
.
Let
π
∗
be the optimal policy to
(OPT)
. Denote
V
=
C
+
B
G
.
Let
K
be the number of iterations of FQE and FQI. Let
̂
π
be the policy returned by Algorithm 2, with termina-
tion threshold
ω
. For
>
0
&
δ
∈
(0
,
1)
, when
n
=
O
(
V
4
2
(log
K
(
m
+1)
δ
+
dim
F
log
V
2
2
+ log
dim
F
)
)
, we have
with probability at least
1
−
δ
:
C
(
̂
π
)
≤
C
(
π
∗
) +
ω
+
(4 +
B
)
γ
(1
−
γ
)
3
(
√
β
μ
+ 2
γ
K/
2
V
)
,
and
G
(
̂
π
)
≤
τ
+ 2
V
+
ω
B
+
γ
1
/
2
(1
−
γ
)
3
/
2
(
√
β
μ
+
2
γ
K/
2
V
(1
−
γ
)
1
/
2
)
.
The proof is in Appendix C. This result guarantees that,
upon termination of Algorithm 2, the true performance on
the main objective can be arbitrarily close to that of the
optimal policy. At the same time, each constraint will be
approximately satisfied with high probability, assuming suf-
ficiently large
B
&
K
, and sufficiently small
.
5. Empirical Analysis
We perform experiments on two different domains: a grid-
world domain (from OpenAI’s FrozenLake) under a safety
constraint, and a challenging high-dimensional car racing
domain (from OpenAI’s CarRacing) under multiple behav-
ior constraints. We seek to answer the following questions
in our experiments: (i) whether the empirical convergence
behavior of Algorithm 2 is consistent with our theory; and
(ii) how the returned policy performs with respect to the
main objective and constraint satisfaction. Appendix H
includes a more detailed discussion of our experiments.
5.1. Frozen Lake.
Environment & Data Collection.
The environment is an
8x8 grid. The agent has 4 actions N,S,E,W at each state.
The main goal is to navigate from a starting position to
Batch Policy Learning under Constraints
the goal. Each episode terminates when the agent reaches
the goal or falls into a hole. The main cost function is
defined as
c
=
−
1
if goal is reached, otherwise
c
= 0
everywhere. We simulate a non-optimal data gathering
policy
π
D
by adding random sub-optimal actions to the
shortest path policy from any given state to goal. We run
π
D
for 5000 trajectories to collect the behavior dataset
D
(with constraint cost measurement specified below).
Counterfactual Safety Constraint.
We augment the main
objective
c
with safety constraint cost defined as
g
= 1
if the agent steps into a hole, and
g
= 0
otherwise. We
set the constraint threshold
τ
= 0
.
1
, roughly
75%
of the
accumulated constraint cost of behavior policy
π
D
. The
threshold can be interpreted as a counterfactually acceptable
probability that we allow the learned policy to fail.
Results.
The empirical primal dual gap
̂
L
max
−
̂
L
min
in
Figure 1 (left) quickly decreases toward the optimal gap
of zero. The convergence is fast and monotonic, support-
ing the predicted behavior from our theory. The test-time
performance in Figure 1 (middle) shows the safety con-
straint is always satisfied, while the main objective cost also
smoothly converges to the optimal value achieved by an
online RL baseline (DQN) trained without regard to the
constraint. The returned policy significantly outperformed
the data gathering policy
π
D
on both main and safety cost.
5.2. Car Racing.
Environment & Data Collection.
The car racing environ-
ment (OpenAI), is a high-dimensional domain where the
state is a
96
×
96
×
3
tensor of raw pixels. The action
space
A =
{
steering
×
gas
×
brake
}
takes 12 discretized
values. The goal in this episodic task is to traverse over
95%
of the track, measured by a given number of “tiles” as a
proxy for distance coverage. The agent receives a reward
(negative cost) for each unique tile crossed and no reward if
the agent is off-track. A small positive cost applies at every
time step, with maximum horizon of 1000 for each episode.
With these costs given by the environment, one can train
online RL agent using DDQN (Van Hasselt et al., 2016). We
collect
≈
5000
trajectories from DDQN’s randomization,
resulting in data set
D
with
≈
94000
transition tuples.
Fast Driving under Behavioral Constraints.
We study
the problem of minimizing environment cost while subject
to two behavioral constraints: smooth driving and lane cen-
tering. The first constraint
G
0
approximates smooth driving
by
g
0
(
x,a
) = 1
if
a
contains braking action, and
0
other-
wise. The second constraint cost
g
1
measures the distance
between the agent and center of lane at each time step. This
is a highly challenging setup since three objectives and con-
straints are in direct conflict with one another, e.g., fast
driving encourages the agent to cut corners and apply fre-
quent brakes to make turns. Outside of this work, we are not
aware of previous work in policy learning with 2 or more
constraints in high-dimensional settings.
Baseline and Procedure.
As a na
̈
ıve baseline, DDQN
achieves low cost, but exhibits “non-smooth” driving behav-
ior (see our supplementary videos). We set the threshold for
each constraint to 75% of the DDQN benchmark. We also
compare against regularized batch RL algorithms (Farah-
mand et al., 2009), specifically regularized LSPI. We in-
stantiate our subroutines, FQE and FQI, with multi-layered
CNNs. We augment LSPI’s linear policy with non-linear
features derived from a well-performing FQI model.
Results.
The returned mixture policy from our algorithm
achieves low main objective cost, comparable with online
RL policy trained without regard to constraints. After sev-
eral initial iterations violating the braking constraint, the
returned policy - corresponding to the appropriate
λ
trade-
off - satisties both constraints, while improving the main
objective. The improvement over data gathering policy is
significant for both constraints and main objective.
Regularized policy learning is an alternative approach to
(OPT)
(section 2). We provide the regularized LSPI base-
line the same set of
λ
found by our algorithm for one-shot
regularized learning (Figures 2 (left & middle)). While
regularized LSPI obtains good performance for the main ob-
jective, it does not achieve acceptable constraint satisfaction.
By default, regularized policy learning requires parameter
tuning heuristics. In principle, one can perform a grid-search
over a range of parameters to find the right combination - we
include such an example for both regularized LSPI and FQI
in Appendix H. However, since our objective and constraints
are in conflict, main objective and constraint satisfaction
of policies returned by one-shot regularized learning are
sensitive to step changes in
λ
. In constrast, our approach is
systematic, and is able to avoid the curse-of-dimensionality
of brute-force search that comes with multiple constraints.
In practice, one may wish to deterministically extract a
single policy from the returned mixture for execution. A
de-randomized policy can be obtained naturally from our
algorithm by selecting the best policy from the existing
FQE’s estimates of individual
Best-response
policies.
5.3. Off-Policy Evaluation
The off-policy evaluation by FQE is critical for updating
policies in our algorithm, and is ultimately responsible for
certifying constraint satisfaction. While other OPE meth-
ods can also be used in place of FQE, we find that the
estimates from popular methods are not sufficiently accu-
rate in a high-dimensional setting. As a standalone com-
parison, we select an individual policy and compare FQE
against PDIS (Precup et al., 2000), DR (Jiang & Li, 2016)
and WDR (Thomas & Brunskill, 2016) with respect to the
constraint cost evaluation. To compare both accuracy and