Minimax Model Learning
Cameron Voloshin
Nan Jiang
Yisong Yue
Caltech
UIUC
Caltech
Abstract
We present a novel off-policy loss function for
learning a transition model in model-based
reinforcement learning.
Notably, our loss
is derived from the off-policy policy evalua-
tion objective with an emphasis on correct-
ing distribution shift. Compared to previ-
ous model-based techniques, our approach al-
lows for greater robustness under model mis-
specification or distribution shift induced by
learning/evaluating policies that are distinct
from the data-generating policy. We pro-
vide a theoretical analysis and show empirical
improvements over existing model-based off-
policy evaluation methods. We provide fur-
ther analysis showing our loss can be used for
off-policy optimization (OPO) and demon-
strate its integration with more recent im-
provements in OPO.
1 Introduction
We study the problem of learning a transition model
in a batch, off-policy reinforcement learning (RL) set-
ting, i.e., of learning a function
P
(
s
′
|
s,a
) from a pre-
collected dataset
D
=
{
(
s
i
,a
i
,s
′
i
)
}
n
i
=1
without further
access to the environment. Contemporary approaches
to model learning focus primarily on improving the
performance of models learned through maximum like-
lihood estimation (MLE) (Sutton, 1990; Deisenroth &
Rasmussen, 2011; Kurutach et al., 2018; Clavera et al.,
2018; Chua et al., 2018; Luo et al., 2019). The goal
of MLE is to pick the model within some model class
P
that is most consistent with the observed data or,
equivalently, most likely to have generated the data.
This is done by minimizing negative log-loss (mini-
Proceedings of the 24
th
International Conference on Artifi-
cial Intelligence and Statistics (AISTATS) 2021, San Diego,
California, USA. PMLR: Volume 130. Copyright 2021 by
the author(s). Contact: clvoloshin@caltech.edu
mizing the KL divergence) summarized as follows:
̂
P
MLE
= arg min
P
∈P
1
n
∑
(
s
i
,a
i
,s
′
i
)
∈
D
−
log(
P
(
s
′
i
|
s
i
,a
i
))
.
(1)
A key limitation of MLE is that it focuses on picking a
good model under the data distribution while ignoring
how the model is actually used.
In an RL context, a model can be used to either learn a
policy (policy learning/optimization) or evaluate some
given policy (policy evaluation), without having to col-
lect more data from the true environment. We call
this actual objective the “decision problem.” Interact-
ing with the environment to solve the decision problem
can be difficult, expensive and dangerous, whereas a
model learned from batch data circumvents these is-
sues. Since MLE (1) does not optimize over the dis-
tribution of states induced by the policy from the de-
cision problem, it thus does not prioritize solving the
decision problem. Notable previous works that incor-
porate the decision problem into the model learning
objective are Value-Aware Model Learning (VAML)
and its variants (Farahmand et al., 2017; Farahmand,
2018; Abachi et al., 2020). These methods, however,
still define their losses w.r.t. the data distribution as
in MLE, and ignore the
distribution shift
from the pre-
collected data to the policy-induced distribution.
In contrast, we directly focus on requiring the model to
perform well under unknown distributions instead of
the data distribution. In other words, we are particu-
larly interested in developing approaches that directly
model the batch (offline) learning setting. As such, we
ask:
“From only pre-collected data, is there a model
learning approach that naturally controls the decision
problem error?”
In this paper, we present a new loss function for model
learning that: (1) only relies on batch or offline data;
(2) takes into account the distribution shift effects;
and (3) directly relates to the performance metrics for
off-policy evaluation and learning under certain realiz-
ability assumptions. The design of our loss is inspired
by recent advances in model-free off-policy evaluation
(e.g., Liu et al., 2018; Uehara et al., 2020), which we
build upon to develop our approach.
arXiv:2103.02084v1 [cs.LG] 2 Mar 2021
Minimax Model Learning
2 Preliminaries
We adopt the infinite-horizon discounted MDP frame-
work specified by a tuple (
S
,
A
,P,
R
,γ
), where
S
is
the state space,
A
is the action space,
P
:
S ×
A →
∆(
S
) is the transition function,
R
:
S ×A →
∆([
−
R
max
,R
max
]) is the reward function, and
γ
∈
[0
,
1) is the discount factor. Let
X ≡ S ×A
. Given
an MDP, a (stochastic) policy
π
:
S →
∆(
A
) and
a starting state distribution
d
0
∈
∆(
S
) together de-
termine a distribution over trajectories of the form
s
0
,a
0
,r
0
,s
1
,a
1
,r
1
,...,
where
s
0
∼
d
0
,a
t
∼
π
(
s
t
)
,r
t
∼
R
(
s
t
,a
t
), and
s
t
+1
∼
P
(
s
t
,a
t
) for
t
≥
0. The perfor-
mance of policy
π
is given by:
J
(
π,P
)
≡
E
s
∼
d
0
[
V
P
π
(
s
)]
,
(2)
where, by the Bellman Equation,
V
P
π
(
s
)
≡
E
a
∼
π
(
·|
s
)
[
E
r
∼R
(
·|
s,a
)
[
r
] +
γE
̃
s
∼
P
(
·|
s,a
)
[
V
P
π
( ̃
s
)]]
.
(3)
A useful equivalent measure of performance is:
J
(
π,P
) =
E
(
s,a
)
∼
d
P
π,γ
[
E
r
∼R
(
·|
s,a
)
[
r
]]
,
(4)
where
d
P
π,γ
(
s,a
)
≡
∑
∞
t
=0
γ
t
d
P
π,t
(
s,a
) is the (dis-
counted) distribution of state-action pairs induced by
running
π
in
P
and
d
P
π,t
∈
∆(
X
) is the distribution of
(
s
t
,a
t
) induced by running
π
under
P
. The first term
in
d
P
π,γ
is
d
P
π,
0
=
d
0
.
d
P
π,t
has a recursive definition that
we use in Section 3:
d
P
π,t
(
s,a
) =
∫
d
P
π,t
−
1
( ̃
s,
̃
a
)
P
(
s
|
̃
s,
̃
a
)
π
(
a
|
s
)
dν
( ̃
s,
̃
a
)
,
(5)
where
ν
is the Lebesgue measure.
In the batch learning setting, we are given a dataset
D
=
{
(
s
i
,a
i
,s
′
i
)
}
n
i
=1
, where
s
i
∼
d
π
b
(
s
),
a
i
∼
π
b
,
and
s
′
i
∼
P
(
·|
s
i
,a
i
), where
π
b
is some behavior pol-
icy that collects the data. For convenience, we write
(
s,a,s
′
)
∼
D
π
b
P
, where
D
π
b
(
s,a
) =
d
π
b
(
s
)
π
b
(
a
|
s
).
Let
E
[
·
] denote exact expectation and
E
n
[
·
] the em-
pirical approximation using the
n
data points of
D
.
Finally, we also need three classes
W
,
V
,
P
of functions.
W ⊂
(
X →
R
) represents ratios between state-action
occupancies,
V ⊂
(
S →
R
) represents value functions
and
P ⊂
(
X →
∆(
S
)) represents the class of models
(or simulators) of the true environment.
Note
. Any Lemmas or Theorems presented without
proof have full proofs in the Appendix.
3 Minimax Model Learning (MML)
for Off-Policy Evaluation (OPE)
3.1 Natural Derivation
We start with the off-policy evaluation (OPE) learn-
ing objective and derive the MML loss (Def 3.1). In
Section 4, we show the loss also bounds off-policy op-
timization (OPO) error through its connection with
OPE.
OPE Decision Problem.
The OPE objective is to
estimate:
J
(
π,P
∗
)
≡
E
[
∞
∑
i
=0
γ
i
r
i
∣
∣
∣
∣
s
0
∼
d
0
a
i
∼
π
(
·|
s
i
)
s
i
+1
∼
P
∗
(
·|
s
i
,a
i
)
r
i
∼R
(
·|
s,a
)
]
,
(6)
the performance of an evaluation policy
π
in the true
environment
P
∗
, using only logging data
D
with sam-
ples from
D
π
b
P
∗
. Solving this objective is difficult
because the actions in our dataset were chosen with
π
b
rather than
π
. Thus, any
π
6
=
π
b
potentially in-
duces a “shifted” state-action distribution
D
π
6
=
D
π
b
,
and ignoring this shift can lead to poor estimation.
Model-Based OPE.
Given a model class
P
and a
desired evaluation policy
π
, we want to find a simulator
̂
P
∈P
using only logging data
D
such that:
̂
P
= arg min
P
∈P
|
J
(
π,P
)
−
J
(
π,P
∗
)
|
.
(7)
Interpreting Eq. (7), we run
π
in
P
to compute
J
(
π,P
)
as a proxy to
J
(
π,P
∗
). If we find some
P
∈ P
such
that
|
δ
P,P
∗
π
|
=
|
J
(
π,P
)
−
J
(
π,P
∗
)
|
is small, then
P
is
a good simulator for
P
∗
.
Derivation.
Using (2) and (4), we have:
δ
P,P
∗
π
=
J
(
π,P
)
−
J
(
π,P
∗
)
=
E
s
∼
d
0
[
V
P
π
(
s
)]
−
E
(
s,a
)
∼
d
P
∗
π,γ
(
·
,
·
)
[
E
r
∼R
(
·|
s,a
)
[
r
]]
.
Adding and subtracting
E
(
s,a
)
∼
d
P
∗
π,γ
[
V
P
π
(
s
)], we have:
δ
P,P
∗
π
=
E
s
∼
d
0
[
V
P
π
(
s
)]
−
E
(
s,a
)
∼
d
P
∗
π,γ
[
V
P
π
(
s
)]
(8)
+
E
(
s,a
)
∼
d
P
∗
π,γ
[
V
P
π
(
s
)
−
E
r
∼R
(
·|
s,a
)
[
r
]]
.
(9)
To simplify the above expression, we make the fol-
lowing observations. First, Eq. (9) can be simplified
through the Bellman equation from Eq. (3). To see
this, notice that
d
P
∗
π,γ
is equivalent to some
d
(
s
)
π
(
a
|
s
)
for an appropriate choice of
d
(
s
). Thus,
E
(
s,a
)
∼
d
P
∗
π,γ
[
V
P
π
(
s
)
−
E
r
∼R
(
·|
s,a
)
[
r
]]
=
E
s
∼
d
(
·
)
[
E
a
∼
π
(
·|
s
)
[
V
P
π
(
s
)
−
E
r
∼R
(
·|
s,a
)
[
r
]]]
=
E
s
∼
d
(
·
)
[
E
a
∼
π
(
·|
s
)
[
E
s
′
∼
P
(
·|
s,a
)
[
γV
P
π
(
s
)]]]
=
γE
(
s,a
)
∼
d
P
∗
π,γ
[
E
s
′
∼
P
(
·|
s,a
)
[
V
P
π
(
s
′
)]]
.
Second, we can manipulate Eq. (8) using the definition
of
d
P
π,γ
and recursive property of
d
P
π,t
from Eq. (5):
Cameron Voloshin, Nan Jiang, Yisong Yue
S×A
d
P
∗
π,γ
(
s,a
)
P
∗
P
( ̃
s,
·
)
(
s
′
,
·
)
V
P
π
(
s
′
)
V
P
π
( ̃
s
)
V
P
π
D
π
b
Figure 1:
Visual of Eq.
(10)
. The error at every point
(
s,a
)
in
D
π
b
is the difference between
V
P
π
( ̃
s
)
(induced
by following
P
) and
V
P
π
(
s
′
)
(induced by following
P
∗
).
We re-weight the points
(
s,a
)
in
D
π
b
to mimic
d
P
∗
π,γ
.
Accumulating the errors exactly yields the OPE error
of using
P
as a simulator. MLE, instead, finds a
P
“pointing” in the same direction as
P
∗
for all points
in
D
π
b
, ignoring the discrepancy with
d
P
∗
π,γ
.
E
s
∼
d
0
[
V
P
π
(
s
)]
−
E
(
s,a
)
∼
d
P
∗
π,γ
[
V
P
π
(
s
)]
=
−
∞
∑
t
=1
γ
t
∫
d
P
∗
π,t
(
s,a
)
V
P
π
(
s
)
dν
(
s,a
)
=
−
γ
∞
∑
t
=0
γ
t
∫
d
P
∗
π,t
+1
(
s,a
)
V
P
π
(
s
)
dν
(
s,a
)
=
−
γ
∞
∑
t
=0
γ
t
∫
d
P
∗
π,t
( ̃
s,
̃
a
)
P
∗
(
s
|
̃
s,
̃
a
)
π
(
a
|
s
)
V
P
π
(
s
)
dν
( ̃
s,
̃
a,s,a
)
=
−
γ
∞
∑
t
=0
γ
t
∫
d
P
∗
π,t
(
s,a
)
P
∗
(
s
′
|
s,a
)
V
P
π
(
s
′
)
dν
(
s,a,s
′
)
=
−
γE
(
s,a
)
∼
d
P
∗
π,γ
[
E
s
′
∼
P
∗
(
·|
s,a
)
[
V
P
π
(
s
′
)]]
.
Combining the above allows us to succinctly express:
δ
P,P
∗
π
=
γE
(
s,a
)
∼
d
P
∗
π,γ
[
E
s
′
∼
P
(
·|
s,a
)
[
V
P
π
(
s
′
)]]
−
γE
(
s,a
)
∼
d
P
∗
π,γ
[
E
s
′
∼
P
∗
(
·|
s,a
)
[
V
P
π
(
s
′
)]]
.
Since
D
contains samples from
D
π
b
and not
d
P
∗
π,γ
, we
use importance sampling to simplify the right-hand
side of
δ
P,P
∗
π
to:
γ
E
(
s,a,s
′
)
∼
D
π
b
P
∗
[
d
P
∗
π,γ
D
π
b
(
E
̃
s
∼
P
(
·|
s,a
)
[
V
P
π
( ̃
s
)]
−
V
P
π
(
s
′
)
)
]
.
(10)
Define
w
P
π
(
s,a
)
≡
d
P
π,γ
(
s,a
)
D
π
b
(
s,a
)
. If we knew
w
P
∗
π
(
s,a
) and
V
P
π
(for every
P
∈P
), then we can select a
P
∈P
to
directly control
δ
P,P
∗
π
. We encode this intuition as:
Definition 3.1.
[MML Loss]
∀
w
∈W
,V
∈V
,P
∈P
,
L
MML
(
w,V,P
) =
E
(
s,a,s
′
)
∼
D
π
b
(
·
,
·
)
P
∗
(
·|
s,a
)
[
w
(
s,a
)
·
(
E
̃
s
∼
P
(
·|
s,a
)
[
V
( ̃
s
)]
−
V
(
s
′
)
)
]
.
When unambiguous, we will drop the MML subscript.
Here we have replaced
w
P
∗
π
(
s,a
) with
w
coming from
function class
W
and
V
P
π
with
V
from class
V
. The
function class
W
represents the possible distribution
shifts, while
V
represents the possible value functions.
With this intuition, we can formally guarantee that
J
(
π,P
)
≈
J
(
π,P
∗
) under the following
realizability
conditions
:
Assumption 1
(Adequate Support)
.
D
π
b
(
s,a
)
>
0
whenever
d
P
π,γ
(
s,a
)
>
0
. Define
w
P
π
(
s,a
)
≡
d
P
π,γ
(
s,a
)
D
π
b
(
s,a
)
.
Assumption 2
(OPE Realizability)
.
For a given
π
,
W×V
contains at least one of
(
w
P
π
,V
P
∗
π
)
or
(
w
P
∗
π
,V
P
π
)
for every
P
∈P
.
Theorem 3.1
(MML & OPE)
.
Under Assumption 2,
|
J
(
π,
̂
P
)
−
J
(
π,P
∗
)
|≤
γ
min
P
∈P
max
w
∈W
,V
∈V
|L
(
w,V,P
)
|
,
(11)
where
̂
P
= arg min
P
∈P
max
w
∈W
,V
∈V
|L
(
w,V,P
)
|
.
Remark 3.2.
We want to choose
V
,
W
,
P
carefully
so that many
P
∈ P
satisfy
L
(
w,V,P
) = 0
and As-
sumption 2. By inspection,
L
(
w,V,P
∗
) = 0
for any
V
∈V
,w
∈W
.
Remark 3.3.
While
V
P
π
∈V ∀
P
∈P
appears strong,
it can be verified for every
P
∈P
before accessing the
data, as the condition does not depend on
P
∗
. In prin-
ciple, we may redesign
V
to guarantee this condition.
Remark 3.4.
When
γ
= 0
,
J
does not depend on a
transition function, so
J
(
π,P
) =
J
(
π,P
∗
)
∀
P
∈P
.
L
(
w,V,P
∗
) = 0 and Theorem 3.1 implies that the fol-
lowing learning procedure will be robust to any distri-
bution shift in
W
and any value function in
V
:
Definition 3.2
(Minimax Model Learning (MML))
.
̂
P
= arg min
P
∈P
max
w
∈W
,V
∈V
|L
MML
(
w,V,P
)
|
.
(12)
3.2 Interpretation and Verifiability
Figure 1 gives a visual illustration of Eq. (10) which
leads to the MML Loss (Def 3.1).
π
b
has induced an
“inbalanced” training dataset
D
π
b
and the importance
sampling term acts to rebalance our data because our
test dataset will be
d
P
∗
π,γ
, induced by
π
. Because the ob-
jective is OPE, we don’t mind that
ˆ
P
is different than
P
∗
so long as
E
ˆ
P
[
V
ˆ
P
π
]
≈
E
P
∗
[
V
ˆ
P
π
]. In other words,
the size of
V
ˆ
P
π
tells us which state transitions are im-
portant to model correctly. We want to appropriately
utilize the capacity of our model class
P
so that
ˆ
P
models
P
∗
when
V
ˆ
P
π
is large. When it is small, it may
be better off to ignore the error in favor of other states.
Theorem 3.1 quantifies the error incurred by evalu-
ating
π
in
̂
P
instead of
P
∗
, assuming Assumption 2
Minimax Model Learning
holds. For OPE,
̂
P
is a reasonable proxy for
P
. In
this sense, MML is a principled method approach for
model-based OPE. See Appendix B.1 for a complete
proof of Thm 3.1 and Appendix B.2 for the sample
complexity analysis.
If the exploratory state distribution
d
π
b
and
π
b
are
known then
D
π
b
is known. In this case, we can also
verify that
w
P
π
∈W
for every
P
∈P
a priori. Together
with Remark 3.3, we may assume that both
w
P
π
∈ W
and
V
P
π
∈V
for all
P
∈P
. Consequently, only one of
V
P
∗
π
∈V
or
w
P
∗
π
∈W
has to be realizable for Theorem
3.1 to hold.
Instead of checking for realizability apriori, we can per-
form post-verification that
w
̂
P
π
∈W
and
V
̂
P
π
∈V
. To-
gether with the terms depending on
P
∗
, realizability of
these are also sufficient for Theorem 3.1 to hold. This
relaxes the strong “for all
P
∈P
” condition.
3.3 Comparison to Model-Free OPE
Recent model-free OPE literature (e.g., Liu et al.,
2018; Uehara et al., 2020) has similar realizability as-
sumptions to Assumption 2.
As an example, the method MWL (Uehara et al., 2020)
takes the form of:
J
(
π,P
∗
)
≈
E
(
s,a,r
)
∼
D
π
b
[
̂
w
(
s,a
)
r
]
where
̂
w
= arg min
w
∈W
max
Q
∈Q
|L
MWL
(
w,Q
)
|
,
requiring
Q
P
∗
π
to be realized to be a valid upper bound.
Here
Q
is analogous to our function class
V
where
E
a
∼
π
(
a
|
s
)
[
Q
P
∗
π
(
s,a
)] =
V
P
∗
π
(
s
)
.
The loss
L
MWL
has
no dependence on
P
and is therefore model-free. MQL
(Uehara et al., 2020) has analogous realizability con-
ditions to MWL.
Our loss,
L
MML
, has the same realizability assump-
tions in addition to one related to
P
(and not
P
∗
). As
discussed in Remark 3.3, these
P
-related assumptions
can be verified a priori and in principle, satisfied by re-
designing the function classes. Therefore, they do not
pose a substantial theoretical challenge. See Section 6
for a practical discussion.
An advantage of model-free approaches is that when
both
w
P
∗
π
,Q
P
∗
π
are realized, they return an exact OPE
point estimate. In contrast, MML additionally re-
quires some
P
∈ P
that makes the loss zero for any
w
∈ W
,V
∈ V
. The advantage of MML is the in-
creased flexibility of a model, enabling OPO (Sec-
tion 4) and visualization of results through simulation
(leading to more transparency).
While recent model-free OPE and our method both
take a minimax approach, the classes
W
,
V
,
P
play
different roles. In the model-free case, minimization
is w.r.t either
W
or
V
and maximization is w.r.t the
other. In our case,
W
,
V
are on the same (maximiza-
tion) team, while minimization is over
P
. This allows
us to treat
W × V
as a single unit, and represents
distribution-shifted value functions. A member of this
class,
E
data
[
wV
] (=
E
(
s,a
)
∼
D
π
b
[
d
P
∗
π,γ
D
π
b
V
P
π
(
s
)]), ties to-
gether the OPE estimate.
3.4 Misspecification of
P
,
V
,
W
Suppose Assumption 2 does not hold and
P
∗
6∈
P
.
Define a new function
h
(
s,a,s
′
)
∈ H
=
{
w
(
s,a
)
V
(
s
′
)
|
(
w,V
)
∈W ×V}
then we redefine
L
:
L
(
h,P
) =
E
(
s,a,s
′
)
∼
D
π
b
(
·
,
·
)
P
∗
(
·|
s,a
)
[
E
x
∼
P
(
·|
s,a
)
[
h
(
s,a,x
)]
−
h
(
s,a,s
′
)]
.
Proposition 3.5
(Misspecification discrepancy for
OPE)
.
Let
H⊂
(
S×A×S →
R
)
be a set of functions
on
(
s,a,s
′
)
. Denote
(
WV
)
∗
=
w
P
∗
π
(
s,a
)
V
P
π
(
s
′
)
(or,
equivalently,
(
WV
)
∗
=
w
P
π
(
s,a
)
V
P
∗
π
(
s
′
)
).
|
J
(
π,
̂
P
)
−
J
(
π,P
∗
)
|≤
γ
min
P
max
h
∈H
|L
(
h,P
)
|
+
γ
H
,
(13)
where
H
= max
P
∈P
min
h
∈H
|L
((
WV
)
∗
−
h,P
)
|
.
L
(
WV
∗
−
h,P
) measures the difference between
h
and (
WV
)
∗
. Another interpretation of Prop 3.5 is if
arg max
H∪{
(
WV
)
∗
}
L
(
h,P
) = (
WV
)
∗
for some
P
∈ P
then MML returns a value
γ
H
below the true upper
bound, otherwise the output of MML remains the up-
perbound. This result illustrates that realizability is
sufficient but not necessary for MML to be an upper-
bound on the loss.
3.5 Application to the Online Setting
While the main focus of MML is batch OPE and OPO,
we will make a few remarks relating to the online set-
ting. In particular, if we assume we can engage in on-
line data collection then
W
=
{
1
}
(the constant func-
tion), representing no distribution shift since
π
b
=
π
.
When VAML and MML share the same function class
V
, we can show that min
P
max
W
,
V
L
MML
(
w,V,P
)
2
≤
min
P
L
V AML
(
V
,P
) for any
V
,
P
.
In other words,
MML is a tighter decision-aware loss even in online
data collection. In addition, MML enables greater flex-
ibility in the choice of
V
. See Appendix B.4 for further
details.
Cameron Voloshin, Nan Jiang, Yisong Yue
4 Off-Policy Optimization (OPO)
4.1 Natural Derivation
In this section we examine how our MML approach can
be integrated into the policy learning/optimization ob-
jective. In this setting, the goal is to find a good policy
with respect to the true environment
P
∗
without in-
teracting with
P
∗
.
OPO Decision Problem.
Given a policy class Π
and access to only a logging dataset
D
with samples
from
D
π
b
P
∗
, find a policy
π
∈
Π that is competitive
with the unknown optimal policy
π
∗
P
∗
:
̂
π
∗
= arg min
π
∈
Π
|
J
(
π,P
∗
)
−
J
(
π
∗
P
∗
,P
∗
)
|
.
(14)
Note:
No additional exploration is allowed.
Model-Based OPO.
Given a model class
P
, we want
to find a simulator
̂
P
∈ P
using only logging data
D
and subsequently learn
π
∗
̂
P
∈
Π in
̂
P
through any pol-
icy optimization algorithm which we call Planner(
·
).
Algorithm 1
Standard Model-Based OPO
Input:
D
=
D
π
b
P
∗
, Modeler, Planner
1:
Learn
̂
P
←
Modeler(
D
)
2:
Learn
̂
π
∗
P
←
Planner(
̂
P
)
3:
return
̂
π
∗
P
In Algorithm 1, Modeler(
·
) refers to any (batch) model
learning procedure. The hope for model-based OPO
is that the ideal in-simulator policy
π
∗
̂
P
and the ac-
tual best (true environment) policy
π
∗
P
∗
perform com-
petitively:
J
(
π
∗
̂
P
,P
∗
)
≈
J
(
π
∗
P
∗
,P
∗
). Hence, instead
of minimizing Eq (14) over all
π
∈
Π, we can focus
Π =
{
π
∗
P
}
P
∈P
.
Derivation.
Beginning with the objective, we add
zero twice:
J
(
π
∗
P
∗
,P
∗
)
−
J
(
π
∗
P
,P
∗
) =
J
(
π
∗
P
∗
,P
∗
)
−
J
(
π
∗
P
∗
,P
)
︸
︷︷
︸
(
a
)
+
J
(
π
∗
P
∗
,P
)
−
J
(
π
∗
P
,P
)
︸
︷︷
︸
(
b
)
+
J
(
π
∗
P
,P
)
−
J
(
π
∗
P
,P
∗
)
︸
︷︷
︸
(
c
)
.
Term (b) is non-positive since
π
∗
P
is optimal in
P
(
π
∗
P
∗
is suboptimal), so we can drop it in an upper bound.
Term (a) is the OPE estimate of
π
∗
P
∗
and term (c)
the OPE estimate of
π
∗
P
, implying that we should use
Theorem 3.1. With this intuition, we have:
Theorem 4.1
(MML & OPO)
.
If
w
P
∗
π
∗
P
∗
,w
P
∗
π
∗
P
∈ W
and
V
P
π
∗
P
∗
,V
P
π
∗
P
∈V
for every
P
∈P
then:
|
J
(
π
∗
P
∗
,P
∗
)
−
J
(
π
∗
̂
P
,P
∗
)
|≤
2
γ
min
P
max
w,V
|L
(
w,V,P
)
|
.
The statement also holds if, instead,
w
P
π
∗
P
∗
,w
P
π
∗
P
∈ W
and
V
P
∗
π
∗
P
∗
,V
P
∗
π
∗
P
∈V
for every
P
∈P
.
4.2 Interpretation and Verifiability
Theorem 4.1 compares two different policies in the
same (true) environment, since
π
∗
̂
P
will be run in
P
∗
rather than
̂
P
. In contrast, Theorem 3.1 compared
the same policy in two different environments. The
derivation of Theorem 4.1 (see Appendix C.1) shows
that having a good bound on the OPE objective is
sufficient for OPO. MML shows how to learn a model
that exploits this relationship.
Furthermore, the realizability assumptions of Theorem
4.1 relax the requirements of an OPE oracle. Rather
than requiring the OPE estimate for every
π
, it is suffi-
cient to have the OPE estimate of
π
∗
P
∗
and
π
∗
P
(for ev-
ery
P
∈P
) when there is a
P
∈P
such that
L
(
w,V,P
)
is small for any
w
∈W
,V
∈V
.
We could have instead examined the quantity
min
π
|
J
(
π
∗
P
∗
,P
∗
)
−
J
(
π,P
∗
)
|
directly from Eq (14).
What we would find is that the upper bound is
2 min
P
max
w,V
|
E
d
0
[
V
]
−L
(
w,V,P
)
|
and the realizabil-
ity requirements would be that
V
P
π
∈V
,w
P
∗
π
∈W
for
every
π
in some policy class. This is a much stronger
requirement than in Theorem 4.1.
For OPO, apriori verification of realizability is possible
by enumerating over
P
∈P
. Whereas the target policy
π
was fixed in OPE, now
π
∗
P
varies for each
P
∈P
. It
may be more practical to, as in OPE, perform post-
verification that
w
P
π
∗
̂
P
∈ W
and
V
P
π
∗
̂
P
∈ V
. If they do
not hold, then we can modify the function classes until
they do. This relaxes the “for every
P
∈P
” condition
and leaves only a few unverifiable quantities relating
to
P
∗
.
Sample complexity and function class misspecification
results for OPO can be found in Appendix C.2, C.3.
4.3 Comparison to Model-Free OPO
For minimax model-free OPO, Chen & Jiang (2019)
have analyzed a minimax variant of Fitted Q Iter-
ation (FQI) (Ernst et al., 2005), inspired by Antos
et al. (2008). FQI is a commonly used model-free
OPO method. In addition to realizability assump-
tions, these methods also maintain a completeness as-
sumption: the function class of interest is closed under
bellman update. Increasing the function class size can
only help realizability but may break completeness. It
is unknown if the completeness assumption of FQI is
removable (Chen & Jiang, 2019). MML only has real-
izability requirements.
Minimax Model Learning
5 Scenarios & Considerations
In this section we investigate a few scenarios where we
can calculate the class
V
and
W
or modify the loss
based on structural knowledge of
P
,
W
,
and
V
.
In examining the scenarios, we aim to verify that MML
gives
sensible
results. For example, in scenarios where
we know MLE to be optimal, MML should ideally
coincide. Indeed, we show this to be the case for
the tabular setting and Linear-Quadratic Regulators.
Other scenarios include showing that MML is compat-
ible with incorporating prior knowledge using either a
nominal dynamics model or a kernel.
The proofs for any Lemmas in this section can be found
in Appendix E.
5.1 Linear & Tabular Function Classes
When
W
,
V
,
P
are linear function classes then the en-
tire minimax optimization has a closed form solution.
In particular,
P
takes the form
P
=
φ
(
s,a,s
′
)
T
α
where
φ
∈
R
|S×A×S|
is some basis of features with
α
∈
R
|S×A×S|
its parameters and (
w
(
s,a
)
,V
(
s
′
))
∈WV
=
{
ψ
(
s,a,s
′
)
T
β
:
‖
β
‖
∞
<
+
∞}
where
ψ
∈
R
|S×A×S|
.
Proposition 5.1
(Linear Function classes)
.
Let
P
=
φ
(
s,a,s
′
)
T
α
where
φ
∈
R
|S×A×S|
is some basis of fea-
tures with
α
its parameters. Let
(
w
(
s,a
)
,V
(
s
′
))
∈
WV
=
{
ψ
(
s,a,s
′
)
T
β
:
‖
β
‖
∞
<
+
∞}
. Then,
̂
α
=
E
−
T
n
[
∫
φ
(
s,a,s
′
)
ψ
(
s,a,s
′
)
T
dν
(
s
′
)
]
E
n
[
ψ
(
s,a,s
′
)]
,
(15)
if
E
n
[
∫
φ
(
s,a,s
′
)
ψ
(
s,a,s
′
)
T
dν
(
s
′
)
]
has full rank.
The tabular setting, when the state-action space is fi-
nite, is a common special case. We can choose:
ψ
(
s,a,s
′
) =
φ
(
s,a,s
′
) =
e
i
(16)
as the
i
th standard basis vector where
i
=
s
|A||S|
+
a
|S|
+
s
′
. There is no model misspecification in the
tabular setting (i.e.,
P
∗
∈P
), therefore
̂
P
=
P
∗
in the
case of infinite data.
Proposition 5.2
(Tabular representation)
.
Let
P
=
φ
(
s,a,s
′
)
T
α
with
φ
∈
R
|S×A×S|
as in Eq
(16)
and
α
its parameters.
Let
(
w
(
s,a
)
,V
(
s
′
))
∈ WV
=
{
φ
(
s,a,s
′
)
T
β
:
‖
β
‖
∞
<
+
∞}
. Assume we have at
least one data point from every
(
s,a
)
pair. Then:
̂
P
n
(
s
′
|
s,a
) =
#
{
(
s,a,s
′
)
∈
D
}
#
{
(
s,a,
·
)
∈
D
}
.
(17)
Prop. 5.2 shows that MML and MLE coincide, even in
the finite-data regime. Both models are simply the ob-
served propensity of entering state
s
′
from tuple (
s,a
).
5.2 Linear Quadratic Regulator (LQR)
The Linear Quadratic Regulator (LQR) is defined as
linear transition dynamics
P
∗
(
s
′
|
s,a
) =
A
∗
s
+
B
∗
a
+
w
∗
where
w
∗
is random noise and a quadratic reward func-
tion
R
(
s,a
) =
s
T
Qs
+
a
T
Ra
for
Q,R
≥
0 symmetric
positive semi-definite. For ease of exposition we as-
sume that
w
∗
∼
N
(0
,σ
∗
2
I
). We assume that (
A
∗
,B
∗
)
is controllable. Exploiting the structure of this prob-
lem, we can check that every
V
∈ V
takes the form
V
(
s
) =
s
T
Us
+
q
for some symmetric semi-positive
definite
U
and constant
q
(Appendix Lemma E.1).
Furthermore, we know controllers of the form
π
(
a
|
s
) =
−
Ks
where
K
∈
R
k
×
n
are optimal in LQR (Bertsekas
et al., 2005). We consider determistic and therefore
misspecified
models of the form
P
(
s
′
|
s,a
) =
As
+
Ba
.
W
is a Gaussian mixture and we can write
L
MML
as a
function of
U,K
and (
A,B
) (Appendix Lemma E.2).
Proposition 5.3
(MML + MLE Coincide for LQR)
.
Let
A
∈
R
n
×
n
,B
∈
R
n
×
k
,K
∈
R
k
×
n
. Let
U
∈ S
n
be positive semi-definite. Set
k
= 1
, a single input
system. Then,
arg min
(
A,B
)
max
K,U
|L
MML
(
K,U,
(
A,B
))
|
= (
A
∗
,B
∗
)
= arg min
(
A,B
)
L
MLE
(
A,B
)
.
Despite model misspecification, both MLE and MML
give the correct parameters (
̂
A,
̂
B
) = (
A
∗
,B
∗
). We
leave showing that MML and MLE coincide in multi-
input (
k >
1) LQR systems for future work.
5.3 Residual Dynamics & Environment Shift
Suppose we already had some baseline model
P
0
of
P
∗
. Alternatively, we may view this as the real world
starting with (approximately) known dynamics
P
0
and
drifting to
P
∗
. We can modify MML to incorporate
knowledge of
P
0
to find the residual dynamics:
Definition 5.1.
[Residual MML Loss] For
w
∈
W
,V
∈V
,P
∈P
,
L
(
w,V,P
) =
E
(
s,a,s
′
)
∼
D
π
b
(
·
,
·
)
P
∗
(
·|
s,a
)
[
w
(
s,a
)
·
(
E
x
∼
P
0
(
·|
s,a
)
[
P
0
(
x
|
s,a
)
−
P
(
x
|
s,a
)
P
0
(
x
|
s,a
)
V
(
x
)]
−
V
(
s
′
)
)
]
.
This solution form matches the intuition that having
prior knowledge in the form of
P
0
focuses the learning
objective on the difference between
P
∗
and
P
0
.
5.4 Incorporating Kernels
Our approach is also compatible with incorporating
kernels (which is a way of encoding domain knowledge
Cameron Voloshin, Nan Jiang, Yisong Yue
Figure 2:
LQR
.
(Left, OPE Error)
MML finds the
P
∈ P
with the lowest OPE error as
P
gets richer. Since
calculations are done in expectation, no error bars are included.
(Right, Verifiability)
The OPE error (smoothed)
increases with misspecification in
V
parametrized by
, the expected MSE between the true
V
P
∗
π
6∈ V
and the
approximated
̂
V
P
∗
π
∈V
. Nevertheless, directionally they all follow the same trajectory as
P
gets richer.
such as smoothness) to learn in a Reproducing Kernel
Hilbert Space (RKHS). For example, we may derive a
closed form for max
(
w,V
)
∈WV
L
(
w,V,P
)
2
when
W×V
corresponds to an RKHS and use standard gradient
descent to find
̂
P
∈ P
, making the minimax problem
much more tractable. See Appendix E.3 for a detailed
discussion on RKHS, computational issues relating to
sampling from
P
and alternative approaches to solving
the minimax problem.
6 Experiments
In our experiments, we seek to answer the following
questions: (1) Does MML prefer models that minimize
the OPE objective? (2) What can we expect when we
have misspecification in
V
? (3) How does MML per-
form against MLE and VAML in OPE? (4) Does our
approach complement modern offline RL approaches?
For this last question, we consider integrating MML
with the recently proposed MOREL (Kidambi et al.,
2020) approach for offline RL. See Appendix F.3 for
details on MOREL.
6.1 Brief Environment Description/Setup
We perform our experiments in three different do-
mains.
Linear-Quadratic Regulator (LQR).
The LQR
domain is a 1D environment with stochastic dynamics
P
∗
(
s
′
|
s,a
). We use a finite class
P
consisting of de-
terministic policies. We ensure
V
P
π
∈ V
for all
P
∈ P
by solving the equations in Appendix Lemma E.1. We
ensure
W
P
∗
π
∈W
using Appendix Equation (25).
Cartpole (Brockman et al., 2016).
The reward
function is modified to be a function of angle and lo-
cation rather than 0/1 to make the OPE problem more
challenging. Each
P
∈ P
is a parametrized NN that
outputs a mean, and logvariance representing a nor-
mal distribution around the next state. We model the
class
WV
as a RKHS as in Prop E.3 with an RBF
kernel.
Inverted Pendulum (IP) (Dorobantu & Taylor,
2020).
This IP environment has a Runge-Kutta(4) in-
tegrator rather than Forward Euler (Runge-Kutta(1))
as in OpenAI (Brockman et al., 2016), producing sig-
nificantly more realistic data. Each
P
∈ P
is a deter-
ministic model parametrized with a neural network.
We model the class
WV
as a RKHS as in Prop E.3
with an RBF kernel.
Further Detail
A thorough description of the en-
vironments, experimental details, setup and hyperpa-
rameters can be found in Appendix F.
6.2 Results
Does MML prefer models that minimize the
OPE objective?
We vary the size of the model class
Figure 2 (left) testing to see if MML will pick up on the
models which have better OPE performance. When
the sizes of
|P|
are small, each method selects (
A
∗
,B
∗
)
(e.g.
P
(
s
′
|
s,a
) =
A
∗
s
+
B
∗
a
), the deterministic version
of the optimal model. However, as we increase the
richness of
P
, MML begins to pick up on models that
are able to better evaluate
π
.
Two remarks are in order. In LQR, policy optimiza-
tion in (
A
∗
,B
∗
) coincides with policy optimization in
P
∗
. Therefore, if we tried to do policy optimization
in our selected model then our policy would be sub-
optimal in
P
∗
. Secondly, MML deliberately selects a
model other than (
A
∗
,B
∗
) because a good OPE esti-
mate relies on appoximating the contribution from the
stochastic part of
P
∗
.
There is a trade-off between the OPE objective and the
OPO objective. MML’s preference is dependent on the
capacities of
P
,
W
,
V
. Figure 2 (left) illustrates OPE
Minimax Model Learning
Figure 3:
(Cartpole, OPE Error) Comparison of
model-based approaches for OPE with function-approx.
Lower is better. MML outperforms others. Not pic-
tured: traditional model-free methods such as IS/PDIS
have error of order 3-8.
is preferred for
W
fixed. Appendix Figure 5 explores
the OPO objective and shows that if we increase
W
then OPO becomes favored. In some sense we are
asking MML to be robust to many more OPE problems
as
|W| ↑
and so the performance on any single one
decreases, favoring OPO.
What can we expect when we have misspeci-
fication in
V
?
To check verifiability in practice, we
would run
π
in a few
P
∈ P
and calculate
V
P
π
. We
would check if
V
P
π
∈ V
by fitting
̂
V
P
π
and measuring
the empirical gap
E
[(
̂
V
P
π
−
V
P
π
)
2
] =
2
.
Figure 2 (right) shows how MML performs when
V
P
π
6∈
V
but we do have
̂
V
P
π
(
s
) =
V
P
π
(
s
) +
N
(0
,
)
∈V
. Since
E
[(
̂
V
P
π
−
V
P
π
)
2
] =
2
then
is the root-mean squared
error between the two functions. Directionally all of
the errors go down as
|P| ↑
,
however it is clear that
has a noticeable effect. We speculate that if this error
not distributed around zero and instead is dependent
on the state then the effects can be worse.
How does MML perform against MLE and
VAML in OPE?
In addition to Figure 2 (left), Fig-
ure 3 also illustrates that our method outperforms the
other model-learning approaches in OPE. The envi-
ronment and reward function is challenging, requiring
function approximation. Despite the added complex-
ity of solving a minimax problem, doing so gives nearly
an order of magnitude improvement over MLE and
many orders over VAML. This validates that MML is
a good choice for model-learning for OPE.
Algorithm 2
OPO Algorithm (based on MOREL (Ki-
dambi et al., 2020))
Input:
D
,
L
among
{
MML, MLE, VAML
}
1:
Learn an ensemble of dynamics
P
1
,...,P
4
∈ P
using
P
i
= arg min
P
∈P
L
(
D
)
2:
Construct a pessimistic MDP
M
(see Appendix
F.3) with
P
(
s,a
) =
1
4
∑
4
i
=1
P
i
(
s,a
).
3:
̂
π
←
PPO(
M
) (Best of 3) (Schulman et al., 2017)
Figure 4:
(Invert.
Pend., OPO Performance)
Comparison of model-based approaches for OPO with
function-approx using Algorithm 2. Higher is better.
MML performs competitively even in low data regimes.
Does our approach complement modern offline
RL approaches?
We integrate MML, VAML, and
MLE with MOREL as in Algorithm 2. Consequently,
Figure 4 shows that MML performs competitively with
the other methods, achieving near-optimal perfor-
mance as the number of trajectories increases. MML
has good performance even in the low-data regime,
whereas other methods perform worse than
π
b
. Perfor-
mance in the low-data regime is of particular interest
since sample efficiency is highly desirable.
Algorithm 2 forms a pessimistic MDP where a pol-
icy is penalized if it enters a state where there is
disagreement between
P
1
,...,P
4
. Given that MML
performs well in low-data, we can reason that MML
produces models with support that stays within the
dataset
D
or generalize well slightly outside this set.
The other models poor performance is suggestive of
incorrect over-confidence outside of
D
and PPO pro-
duces a policy which takes advantage of this.
7 Other Related Work
Minimax and Model-Based RL.
Rajeswaran et al.
(2020) introduce an iterative minimax approach to si-
multaneously find the optimal-policy and a model of
the environment. Despite distribution-shift correction,
online data collection is required and is not compara-
ble to MML, where we focus on the batch setting.
Batch (Offline) Model-Based RL
Recent improve-
ments in batch model-based RL focus primarily on the
issue of policies taking advantage of errors in the model
(Kidambi et al., 2020; Deisenroth & Rasmussen, 2011;
Chua et al., 2018; Janner et al., 2019). These improve-
ments typically involve uncertainty quantification to
keep the agent in highly certain states to avoid model
exploitation. These improvements are independent of
the loss function involved.