Cameron Voloshin, Nan Jiang, Yisong Yue
Contents
1 Introduction
1
2 Preliminaries
2
3 Minimax Model Learning (MML) for
Off-Policy Evaluation (OPE)
2
3.1 Natural Derivation . . . . . . . . . . . .
2
3.2 Interpretation and Verifiability . . . . .
3
3.3 Comparison to Model-Free OPE . . . .
4
3.4 Misspecification of
P
,
V
,
W
. . . . . . .
4
3.5 Application to the Online Setting . . . .
4
4 Off-Policy Optimization (OPO)
5
4.1 Natural Derivation . . . . . . . . . . . .
5
4.2 Interpretation and Verifiability . . . . .
5
4.3 Comparison to Model-Free OPO . . . .
5
5 Scenarios & Considerations
6
5.1 Linear & Tabular Function Classes . . .
6
5.2 Linear Quadratic Regulator (LQR) . . .
6
5.3 Residual Dynamics & Environment Shift
6
5.4 Incorporating Kernels . . . . . . . . . .
6
6 Experiments
7
6.1 Brief Environment Description/Setup .
7
6.2 Results . . . . . . . . . . . . . . . . . . .
7
7 Other Related Work
8
8 Discussion and Future Work
9
A Glossary of Terms
12
B OPE
13
B.1 Main Result . . . . . . . . . . . . . . . . 13
B.2 Sample Complexity for OPE . . . . . . . 14
B.3 Misspecification for OPE . . . . . . . . . 15
B.4 Application to the Online Setting and
Brief VAML Comparison . . . . . . . . . 15
C OPO
18
C.1 Main Result . . . . . . . . . . . . . . . . 18
C.2 Sample Complexity for OPO . . . . . . 18
C.3 Misspecification . . . . . . . . . . . . . . 19
D Additional theory
20
D.1 Necessary and sufficient conditions for
uniqueness of
|L
(
w,V,P
)
|
= 0 . . . . . . 20
E Scenarios & Considerations
21
E.1 Linear Function Classes . . . . . . . . . 21
E.2 LQR . . . . . . . . . . . . . . . . . . . . 21
E.3 RKHS & Practical Implementation . . . 24
F Experiments
26
F.1 Environment Descriptions . . . . . . . . 26
F.1.1 LQR . . . . . . . . . . . . . . . . 26
F.1.2 Cartpole . . . . . . . . . . . . . . 26
F.1.3 Inverted Pendulum (IP) . . . . . 26
F.2 Experiment Descriptions . . . . . . . . . 26
F.2.1 LQR OPE/OPO . . . . . . . . . 26
F.2.2 Cartpole OPE . . . . . . . . . . 26
F.2.3 Inverted Pendulum OPO . . . . 27
F.3 MOREL . . . . . . . . . . . . . . . . . . 27
F.4 Additional Experiments . . . . . . . . . 28
Minimax Model Learning
A Glossary of Terms
Table 1: Glossary of terms
Acronym
Term
OPE
Off Policy (Policy) Evaluation
OPO
Off Policy (Policy) Optimization. Also goes by batch off-policy reinforcement learning.
S
State Space
A
Action Space
P
Transition Function
P
∗
True Transition Function
R
Reward Function
X
State-Action Space
S×A
γ
Discount Factor
π
Policy
J
(
π,P
)
Performance of
π
in
P
V
P
π
Value Function of
π
with respect to
P
d
0
Initial State Distribution
d
P,γ
π
(Discounted) Distribution of State-Action Pairs Induced by Running
π
in
P
w
P
π
Distribution Shift (
w
P
π
(
s,a
) =
d
P,γ
π
(
s,a
)
D
π
b
(
s,a
)
)
ν
Lebesgue measure
d
π
b
Behavior state distribution
π
b
Behavior policy
D
π
b
Behavior data (
d
π
b
π
b
)
D
Dataset containing samples from
D
π
b
P
∗
E
n
[
·
]
Empirical approximation using
D
E
[
·
]
Exact expectation
W
Distribution Shifts Function Class (e.g.
d
P
π
(
s,a
)
D
π
(
s,a
)
)
V
Value Function Class (e.g.
V
P
π
∈V
)
P
Model Function Class (e.g.
P
∈P
)
L
Model Learning Loss Function
̂
P
Best Model w.r.t
L
H
Misspecification Error
π
∗
P
Optimal Policy in
P
RKHS
Reproducing Kernel Hilbert Space
LQR
Linear Quadratic Regulator
IP
Inverted Pendulum
MML
Minimax Model Learning (Ours)
MLE
Maximum Likelihood Estimation
VAML
Value-Aware Model Learning
Cameron Voloshin, Nan Jiang, Yisong Yue
B OPE
In this section we explore the OPE results in the order in which they were presented in the main paper.
B.1 Main Result
Proof for Theorem 3.1.
Assume (
w
P
∗
π
,V
P
π
)
∈W ×V
. Fix some
P
∈P
. We use both definitions of
J
as follows
J
(
π,P
)
−
J
(
π,P
∗
) =
E
d
0
[
V
P
π
]
−
E
(
s,a
)
∼
d
P
∗
π,γ
,r
∼R
(
·|
s,a
)
[
r
]
=
E
(
s,a
)
∼
d
P
∗
π,γ
[
V
P
π
(
s
)
−
E
r
∼R
(
·|
s,a
)
[
r
]] +
E
d
0
[
V
P
π
]
−
E
(
s,a
)
∼
d
P
∗
π,γ
[
V
P
π
(
s
)]
=
E
(
s,a
)
∼
d
P
∗
π,γ
[
V
P
π
(
s
)
−
E
r
∼R
(
·|
s,a
)
[
r
]]
−
∞
∑
t
=1
γ
t
∫
d
P
∗
π,t
(
s,a
)
V
P
π
(
s
)
dν
(
s,a
)
=
E
(
s,a
)
∼
d
P
∗
π,γ
[
γE
s
′
∼
P
(
·|
s,a
)
[
V
P
π
(
s
′
)]]
−
γ
∞
∑
t
=0
γ
t
∫
d
P
∗
π,t
+1
(
s,a
)
V
P
π
(
s
)
dν
(
s,a
)
=
γE
(
s,a
)
∼
d
P
∗
π,γ
[
E
s
′
∼
P
(
·|
s,a
)
[
V
P
π
(
s
′
)]]
−
γ
∞
∑
t
=0
γ
t
∫
d
P
∗
π,t
( ̃
s,
̃
a
)
P
∗
(
s
|
̃
s,
̃
a
)
π
(
a
|
s
)
V
P
π
(
s
)
dν
( ̃
s,
̃
a,s,a
)
=
γE
(
s,a
)
∼
d
P
∗
π,γ
[
E
s
′
∼
P
(
·|
s,a
)
[
V
P
π
(
s
′
)]]
−
γ
∞
∑
t
=0
γ
t
∫
d
P
∗
π,t
(
s,a
)
P
∗
(
s
′
|
s,a
)
V
P
π
(
s
′
)
dν
(
s,a,s
′
)
=
γE
(
s,a
)
∼
d
P
∗
π,γ
[
E
s
′
∼
P
(
·|
s,a
)
[
V
P
π
(
s
′
)]]
−
γE
(
s,a
)
∼
d
P
∗
π,γ
[
E
s
′
∼
P
∗
(
·|
s,a
)
[
V
P
π
(
s
′
)]]
=
γE
(
s,a
)
∼
d
P
∗
π,γ
[
E
s
′
∼
P
(
·|
s,a
)
[
V
P
π
(
s
′
)]
−
E
s
′
∼
P
∗
(
·|
s,a
)
[
V
P
π
(
s
′
)]]
=
γE
(
s,a,s
′
)
∼
D
π
b
P
∗
(
·|
s,a
)
[
d
P
∗
π,γ
(
s,a
)
D
π
b
(
s,a
)
(
E
x
∼
P
(
·|
s,a
)
[
V
P
π
(
x
)]
−
V
P
π
(
s
′
)
)
]]
=
γE
(
s,a,s
′
)
∼
D
π
b
P
∗
(
·|
s,a
)
[
w
P
∗
π
(
s,a
)
(
E
x
∼
P
(
·|
s,a
)
[
V
P
π
(
x
)]
−
V
P
π
(
s
′
)
)
]]
=
γ
L
(
w
P
∗
π
,V
P
π
,P
)
,
where the first equality is definition. The second equality is addition of 0. The third equality is simplification.
The fourth equality is change of bounds. The fifth is definition. The sixth is relabeling of the integration
variables. The seventh and eighth are simplification. The ninth is importance sampling. The tenth and last is
definition. Since (
w
P
∗
π
,V
P
π
)
∈W ×V
then
|
J
(
π,P
)
−
J
(
π,P
∗
)
|
=
γ
|L
(
w
P
∗
π
,V
P
π
,P
)
|≤
γ
max
w
∈W
,V
∈V
|L
(
w,V,P
)
|≤
γ
min
P
∈P
max
w
∈W
,V
∈V
|L
(
w,V,P
)
|
,
where the last inequality holds because
P
was selected in
P
arbitrarily.
Now, instead, assume (
w
P
π
,V
P
∗
π
)
∈W ×V
. Fix some
P
∈P
. Then, similarly,
J
(
π,P
)
−
J
(
π,P
∗
) =
E
(
s,a
)
∼
d
P
π,γ
,r
∼R
(
·|
s,a
)
[
r
]
−
E
d
0
[
V
P
∗
π
]
=
E
(
s,a
)
∼
d
P
π,γ
[
V
P
∗
π
(
s
)]
−
E
d
0
[
V
P
∗
π
]
−
E
(
s,a
)
∼
d
P
π,γ
[
V
P
∗
π
(
s
)
−
E
r
∼R
(
·|
s,a
)
[
r
]]
=
∞
∑
t
=1
γ
t
∫
d
P
π,t
(
s,a
)
V
P
∗
π
(
s
)
dν
(
s,a
)
−
E
(
s,a
)
∼
d
P
π,γ
[
V
P
∗
π
(
s
)
−
E
r
∼R
(
·|
s,a
)
[
r
]]
=
γ
∞
∑
t
=0
γ
t
∫
d
P
π,t
+1
(
s,a
)
V
P
∗
π
(
s
)
dν
(
s,a
)
−
E
(
s,a
)
∼
d
P
π,γ
[
γE
s
′
∼
P
∗
(
·|
s,a
)
[
V
P
∗
π
(
s
′
)]]
=
γ
∞
∑
t
=0
γ
t
∫
d
P
π,t
( ̃
s,
̃
a
)
P
(
s
|
̃
s,
̃
a
)
π
(
a
|
s
)
V
P
∗
π
(
s
)
dν
( ̃
s,
̃
a,s,a
)
−
γE
(
s,a
)
∼
d
P
π,γ
[
E
s
′
∼
P
∗
(
·|
s,a
)
[
V
P
∗
π
(
s
′
)]]
Minimax Model Learning
=
γ
∞
∑
t
=0
γ
t
∫
d
P
π,t
(
s,a
)
P
(
s
′
|
s,a
)
V
P
∗
π
(
s
′
)
dν
(
s,a,s
′
)
−
γE
(
s,a
)
∼
d
P
π,γ
[
E
s
′
∼
P
∗
(
·|
s,a
)
[
V
P
∗
π
(
s
′
)]]
=
γE
(
s,a
)
∼
d
P
π,γ
[
E
s
′
∼
P
(
·|
s,a
)
[
V
P
∗
π
(
s
′
)]]
−
γE
(
s,a
)
∼
d
P
π,γ
[
E
s
′
∼
P
∗
(
·|
s,a
)
[
V
P
∗
π
(
s
′
)]]
=
γE
(
s,a
)
∼
d
P
π,γ
[
E
s
′
∼
P
(
·|
s,a
)
[
V
P
∗
π
(
s
′
)]
−
E
s
′
∼
P
∗
(
·|
s,a
)
[
V
P
∗
π
(
s
′
)]]
=
γE
(
s,a,s
′
)
∼
D
π
b
P
∗
(
·|
s,a
)
[
d
P
π,γ
(
s,a
)
D
π
b
(
s,a
)
(
E
x
∼
P
(
·|
s,a
)
[
V
P
∗
π
(
x
)]
−
V
P
∗
π
(
s
′
)
)
]]
=
γE
(
s,a,s
′
)
∼
D
π
b
P
∗
(
·|
s,a
)
[
w
P
π
(
s,a
)
(
E
x
∼
P
(
·|
s,a
)
[
V
P
∗
π
(
x
)]
−
V
P
∗
π
(
s
′
)
)
]]
=
γ
L
(
w
P
π
,V
P
∗
π
,P
)
,
where we follow the same steps as in the previous derivation. Since (
w
P
π
,V
P
∗
π
)
∈W ×V
then
|
J
(
π,P
)
−
J
(
π,P
∗
)
|
=
γ
|L
(
w
P
π
,V
P
∗
π
,P
)
|≤
γ
max
w
∈W
,V
∈V
|L
(
w,V,P
)
|≤
γ
min
P
∈P
max
w
∈W
,V
∈V
|L
(
w,V,P
)
|
,
where the last inequality holds because
P
was selected in
P
arbitrarily.
B.2 Sample Complexity for OPE
We do not have access to exact expectations, so we must work with
̂
P
n
= arg min
P
max
w,V
E
n
[
...
] instead of
̂
P
= arg min
P
max
w,V
E
[
...
]. Furthermore,
J
(
π,
̂
P
) requires exact expectation of an infinite sum:
E
d
0
[
∑
∞
t
=0
γ
t
r
t
]
where we collect
r
t
by running
π
in simulation
̂
P
. Instead, we can only estimate an empirical average over a
finite sum in
̂
P
n
:
J
T,m
(
π,
̂
P
n
) =
1
m
∑
m
j
=1
∑
T
t
=0
γ
t
r
j
t
, where each
j
indexes rollouts starting from
s
0
∼
d
0
and the
simulation is over
̂
P
n
. Our OPE estimate is therefore bounded as follows:
Theorem B.1.
[OPE Error] Let the functions in
V
and
W
be uniformly bounded by
C
V
and
C
W
respectively.
Assume the conditions of Theorem 3.1 hold and
|R|≤
R
max
,γ
∈
[0
,
1)
. Then, with probability
1
−
δ
,
|
J
T,m
(
π,
̂
P
n
)
−
J
(
π,P
∗
)
|≤
γ
min
P
max
w,V
|L
(
w,V,P
)
|
+ 4
γ
R
n
(
W
,
V
,
P
) +
2
R
max
1
−
γ
γ
T
+1
+
2
R
max
1
−
γ
√
log(2
/δ
)
/
(2
m
) + 4
γC
W
C
V
√
log(2
/δ
)
/n
where
R
n
(
W
,
V
,
P
)
is the Rademacher complexity of the function class
{
(
s,a,s
′
)
7→
w
(
s,a
)(
E
x
∼
P
[
V
(
x
)]
−
V
(
s
′
)) :
w
∈W
,V
∈V
,P
∈P}
.
Proof for Theorem B.1.
By definition and triangle inequality,
|
J
T,m
(
π,
̂
P
n
)
−
J
(
π,P
∗
)
|
=
|
J
T,m
(
π,
̂
P
n
)
−
J
(
π,
̂
P
n
) +
J
(
π,
̂
P
n
)
−
J
(
π,P
∗
)
|
≤|
J
T,m
(
π,
̂
P
n
)
−
J
(
π,
̂
P
n
)
|
︸
︷︷
︸
(
a
)
+
|
J
(
π,
̂
P
n
)
−
J
(
π,P
∗
)
|
︸
︷︷
︸
(
b
)
(18)
Define
̂
V
P
π,T
(
s
i
0
)
≡
∑
T
t
=0
γ
t
r
i
t
for some trajectory indexed by
i
∈
N
where
r
i
t
is the reward obtained by running
π
in
P
at time
t
≤
T
starting at
s
i
0
. For (
a
),
|
J
T,m
(
π,
̂
P
n
)
−
J
(
π,
̂
P
n
)
|
=
∣
∣
∣
∣
∣
1
m
m
∑
i
=1
̂
V
̂
P
n
π,T
(
s
i
0
)
−
1
m
m
∑
i
=1
̂
V
̂
P
n
π,
∞
(
s
i
0
) +
1
m
m
∑
i
=1
̂
V
̂
P
n
π,
∞
(
s
i
0
)
−
E
d
0
[
V
̂
P
n
π
]
∣
∣
∣
∣
∣
≤
∣
∣
∣
∣
∣
1
m
m
∑
i
=1
̂
V
̂
P
n
π,T
(
s
i
0
)
−
1
m
m
∑
i
=1
̂
V
̂
P
n
π,
∞
(
s
i
0
)
∣
∣
∣
∣
∣
+
∣
∣
∣
∣
∣
1
m
m
∑
i
=1
̂
V
̂
P
n
π,
∞
(
s
i
0
)
−
E
d
0
[
V
̂
P
n
π
]
∣
∣
∣
∣
∣
≤
2
R
max
1
−
γ
γ
T
+1
+
2
R
max
1
−
γ
√
log(2
/δ
)
/
(2
m
)
,
(19)
Cameron Voloshin, Nan Jiang, Yisong Yue
with probability 1
−
δ
, where the last inequality is definition of
̂
V
π,T
and Hoeffding’s inequality.
For (
b
), by Theorem 3.1,
|
J
(
π,
̂
P
n
)
−
J
(
π,P
∗
)
|
=
γ
|
L
(
w
P
∗
π
,V
̂
P
n
,
̂
P
n
)
|
≤
γ
max
w,V
|
L
(
w,V,
̂
P
n
)
|
=
γ
(max
w,V
|
L
(
w,V,
̂
P
n
)
|−
max
w,V
|
L
n
(
w,V,
̂
P
n
)
|
+ max
w,V
|
L
n
(
w,V,
̂
P
n
)
|−
max
w,V
|
L
(
w,V,
̂
P
)
|
+ max
w,V
|
L
(
w,V,
̂
P
)
|
)
≤
γ
(2 max
w,V,P
||
L
(
w,V,P
)
|−|
L
n
(
w,V,P
)
||
+ min
P
max
w,V
|
L
(
w,V,P
)
|
)
≤
γ
(2
R
′
n
(
W
,
V
,
P
) + 2
K
√
log(2
/δ
)
/n
+ min
P
max
w,V
|
L
(
w,V,P
)
|
)
≤
γ
(4
R
n
(
W
,
V
,
P
) + 2
K
√
log(2
/δ
)
/n
+ min
P
max
w,V
|
L
(
w,V,P
)
|
)
(20)
where
R
′
n
(
W
,
V
,
P
) is the Rademacher complexity of the function class
{
(
s,a,s
′
)
7→|
w
(
s,a
)(
E
x
∼
P
[
V
(
x
)]
−
V
(
s
′
))
|
:
w
∈W
,V
∈V
,P
∈P}
noting that
K
= 2
C
w
C
V
uniformly bounds
|
w
(
s,a
)(
E
x
∼
P
(
·|
s,a
)
[
V
(
x
)]
−
V
(
s
′
))
|
(Theorem 8 Bartlett & Mendelson
(2001)). Furthermore since absolute value is 1-Lipshitz (by reverse triangle ineq), then
R
′
n
<
2
R
n
(Theorem 12
Bartlett & Mendelson (2001)) where
R
n
(
W
,
V
,
P
) is the Rademacher complexity of the function class
{
(
s,a,s
′
)
7→
w
(
s,a
)(
E
x
∼
P
(
·|
s,a
)
[
V
(
x
)]
−
V
(
s
′
))) :
w
∈W
,V
∈V
,P
∈P}
.
Altogether, combining (1), (2), (3) we get our result.
The first term can be thought of as the estimate under infinite data, the second term as the penalty for using
function classes that are too rich, and the remaining terms as the price we pay for finite data/ finite calculations.
B.3 Misspecification for OPE
When the assumptions behind MML do not hold, our method underbounds the true error. The following is the
proof for this Proposition.
Proof for Prop. 3.5.
We have shown already that
J
(
π,
̂
P
)
−
J
(
π,P
∗
) =
γ
L
(
w
P
∗
π
,V
P
π
,P
) (=
γ
L
((
WV
)
∗
,P
)).
Therefore, by linearity of
L
in
H
,
we have
|L
((
WV
)
∗
,P
)
|
=
|L
(
h,P
) +
L
((
WV
)
∗
−
h,P
)
| ∀
h
∈H
,P
∈P
≤|L
(
h,P
)
|
+
|L
((
WV
)
∗
−
h,P
)
|
≤
min
P
max
h
|L
(
h,P
)
|
+
|L
(
h
−
(
WV
)
∗
,P
)
|
≤
min
P
max
h
|L
(
h,P
)
|
+ max
P
min
h
|L
((
WV
)
∗
−
h,P
)
|
where
H
= max
P
min
h
|L
((
WV
)
∗
−
h,P
)
|
.
Therefore
|
J
(
π,
̂
P
)
−
J
(
π,P
∗
)
| ≤
γ
(min
P
max
h
|L
(
h,P
)
|
+
H
)
,
as
desired.
B.4 Application to the Online Setting and Brief VAML Comparison
Algorithm 3 is the prototypical online model-based RL algorithm. In contrast to the batch setting, we allow for
online data collection. We require a function called PLANNER, which can take a model
P
k
and find the optimal
solution
π
k
in
P
k
.
Minimax Model Learning
Algorithm 3
Online Model-Based RL
Input:
π
0
=
π
b
. PLANNER(
·
)
1:
for
k
= 0
,
1
,...,K
do
2:
Collect data
D
k
by interacting with the true environment using
π
k
.
3:
Fit
P
k
←
arg min
P
∈P
max
w,V
∈W
,
V
L
MML
(
w,V,P
) where
D
π
b
=
D
k
4:
Fit
π
k
←
PLANNER(
P
k
)
5:
return
(
P
K
,
π
K
)
Here we show that MML lower bounds the VAML error in online model-based RL, where VAML is designed.
Proposition B.2.
Let
W
=
{
1
}
. Then
min
P
∈P
max
w
∈W
,V
∈V
L
MML
(
w,V,P
)
2
≤
min
P
∈P
L
V AML
(
V
,P
)
,
for every
V
,
P
.
Proof.
Fix
P
∈ P
.
Then, by definition,
L
MML
(
w,V,P
) =
E
(
s,a,s
′
)
∼
D
π
b
P
∗
[
w
(
s,a
)(
E
x
∼
P
(
·|
s,a
)
[
V
(
x
)]
−
V
(
s
′
))].
Since
W
=
{
1
}
, then we can eliminate this dependence and get
L
MML
(1
,V,P
)
=
E
(
s,a,s
′
)
∼
D
π
b
P
∗
[
E
x
∼
P
(
·|
s,a
)
[
V
(
x
)]
−
V
(
s
′
)]
.
Explicitly,
L
MML
(1
,V,P
)
2
= (
∫
(
∫
P
(
x
|
s,a
)
V
(
x
)
dν
(
x
)
−
∫
P
∗
(
s
′
|
s,a
)
V
(
s
′
)
dν
(
s
′
)
)
dν
(
s,a
))
2
= (
∫
(
∫
(
P
(
x
|
s,a
)
−
P
∗
(
x
|
s,a
))
V
(
s
′
)
dν
(
x
)
)
dν
(
s,a
))
2
≤
∫
(
∫
(
P
(
x
|
s,a
)
−
P
∗
(
x
|
s,a
))
V
(
x
)
dν
(
x
)
)
2
dν
(
s,a
)
,
Cauchy Schwarz
Taking the max
V
∈V
on both sides and noting max
V
∫
f
(
V
)
≤
∫
max
V
f
(
V
) for any
f,V
then
max
V
∈V
L
MML
(1
,V,P
)
2
≤
∫
max
V
∈V
(
∫
(
P
(
x
|
s,a
)
−
P
∗
(
x
|
s,a
))
V
(
x
)
dν
(
x
)
)
2
dν
(
s,a
)
(21)
=
L
V AML
(
V
,P
)
.
(22)
Since we chose
P
arbitrarily, then Eq 21 holds for any
P
∈ P
.
In particular, if
̂
P
V AML
=
arg min
P
∈P
L
V AML
(
V
,P
) then
min
P
∈P
max
V
∈V
L
MML
(1
,V,P
)
2
≤
max
V
∈V
L
MML
(1
,V,
̂
P
V AML
)
2
≤
min
P
∈P
L
V AML
(
V
,P
)
Prop B.2 reflects that the MML loss function is a tighter loss in the online model-based RL case than VAML. In a
sense, this reflects that MML should be the preferred decision-aware loss function even in online model-based RL.
An argument in favor of VAML is that it is more computationally tractable given an assumption that
V
is the set of
linear function approximators. However, if we desire to use more powerful function approximation VAML suffers
the same computational issues as MML. In general the pointwise supremum within VAML presents a substantial
computational challenge while the uniform supremum from MML is much more mild, can be formulated as a
two player game and solved via higher-order gradient descent (see Section E.3).
Lastly, VAML defines the pointwise loss with respect to the
L
2
norm of the difference between
P
and
P
∗
.
The choice is justified in that it is computationally friendlier but it is noted that
L
1
may also be reasonable
(Farahmand et al., 2017). We show in the following example that, actually, VAML may not work with a pointwise
L
1
error.
Cameron Voloshin, Nan Jiang, Yisong Yue
Example B.1.
Let
S
=
A
∪
B
, a disjoint partition of the state space. For simplicity, assume no dependence on
actions. Suppose our models
P
=
{
P
α
}
α
∈
[0
,
1]
take the form
P
α
(
s
′
|
s
) =
{
α
s
′
∈
A
1
−
α s
′
∈
B
Suppose also that
P
∗
α
∗
∈P
for some
α
∗
∈
[0
,
1]. Let
V
=
{
x
1
s
∈
A
(
s
) +
y
1
s
∈
B
(
s
)
|
x,y < M
∈
R
+
}
be all bounded
piecewise constant value functions with
‖
V
‖
∞
=
M
∈
R
+
. Then the empirical VAML loss with
L
1
pointwise
distance does not choose
P
∗
when
α
6
=
1
2
and cannot differentiate between
P
∗
and any other
P
∈P
when
α
∗
=
1
2
.
MML does not have this issue.
Proof.
To show this, first fix
P
∈P
. Then the empirical VAML loss (in expectation) is given by
E
s
∼
P
∗
[max
V
|
E
x
∼
P
[
V
(
x
)]
−
V
(
s
)
|
] =
α
∗
max
V
|
E
x
∼
P
[
V
(
x
)]
−
V
(
A
)
|
+ (1
−
α
∗
) max
V
|
E
x
∼
P
[
V
(
x
)]
−
V
(
B
)
|
=
α
∗
max
x,y
∈
[0
,M
]
|
αx
+ (1
−
α
)
y
−
x
|
+ (1
−
α
∗
) max
x,y
∈
[0
,M
]
|
αx
+ (1
−
α
)
y
−
y
|
=
α
∗
max
x,y
∈
[0
,M
]
|
(
α
−
1)(
x
−
y
)
|
+ (1
−
α
∗
) max
x,y
∈
[0
,M
]
|
α
(
x
−
y
)
|
= (
α
∗
|
α
−
1
|
+ (1
−
α
∗
)
|
α
|
)
M
If
α
∗
< .
5 then the minimizer of the above quantity is
α
= 0, if
α
∗
> .
5 then the minimizer is
α
= 1. Therefore,
if
α
∗
∈
(0
,.
5)
∪
(
.
5
,
1) then VAML picks the wrong model
α
6
=
α
∗
. Additionally, in the case that
α
∗
=
.
5 then
the loss is
M
2
for every
P
∈P
. In this case, VAML with
L
1
cannot differentiate between any model; all models
are perfectly identical.
On the other hand, we repeat this process with MML:
|
E
s
∼
P
∗
[
E
x
∼
P
[
V
(
x
)]
−
V
(
s
)]
|
=
|
α
∗
(
E
x
∼
P
[
V
(
x
)]
−
V
(
A
)) + (1
−
α
∗
)(
E
x
∼
P
[
V
(
x
)]
−
V
(
B
))
|
=
|
α
∗
(
αx
+ (1
−
α
)
y
−
x
) + (1
−
α
∗
)(
αx
+ (1
−
α
)
y
−
y
)
|
=
|
α
∗
(
α
−
1)(
x
−
y
) + (1
−
α
∗
)
α
(
x
−
y
)
|
=
|
α
−
α
∗
||
x
−
y
|
Clearly min
α
∈
[0
,
1]
max
x,y
∈
[0
,M
]
|
α
−
α
∗
||
x
−
y
|
= 0 where
α
=
α
∗
.
We do not have to worry about the choice of norm for MML because we know that the OPE error is precisely
L
MML
. On the other hand, as shown in the example, this is not the case for VAML.
Minimax Model Learning
C OPO
In this section we explore the OPO results in the order in which they were presented in the main paper.
C.1 Main Result
Proof for Theorem 4.1.
Fix some
P
∈P
. Through addition of 0, we get
J
(
π
∗
P
∗
,P
∗
)
−
J
(
π
∗
P
,P
∗
) =
J
(
π
∗
P
∗
,P
∗
)
−
J
(
π
∗
P
∗
,P
)
+
J
(
π
∗
P
∗
,P
)
−
J
(
π
∗
P
,P
)
+
J
(
π
∗
P
,P
)
−
J
(
π
∗
P
,P
∗
)
Since
π
∗
P
is optimal in
P
then
J
(
π
∗
P
∗
,P
)
−
J
(
π
∗
P
,P
)
≤
0 which implies
J
(
π
∗
P
∗
,P
∗
)
−
J
(
π
∗
P
,P
∗
)
≤
J
(
π
∗
P
∗
,P
∗
)
−
J
(
π
∗
P
∗
,P
) +
J
(
π
∗
P
,P
)
−
J
(
π
∗
P
,P
∗
)
Taking the absolute value of both sides, triangle inequality and invoking Lemma 3.1 yields:
|
J
(
π
∗
P
∗
,P
∗
)
−
J
(
π
∗
̂
P
,P
∗
)
|≤
2
γ
max
w,V
|
L
(
w,V,
̂
P
)
|
= 2
γ
min
P
max
w,V
|
L
(
w,V,P
)
|
when
w
P
∗
π
∗
P
∗
,w
P
∗
π
∗
P
∈ W
and
V
P
π
∗
P
∗
,V
P
π
∗
P
∈
V
for every
P
∈ P
, or alternatively
w
P
π
∗
P
∗
,w
P
π
∗
P
∈ W
and
V
P
∗
π
∗
P
∗
,V
P
∗
π
∗
P
∈
V
for every
P
∈P
.
C.2 Sample Complexity for OPO
Since we will only have access to the empirical version
̂
P
n
rather than
̂
P
, we provide the following bound
Theorem C.1
(Learning Error)
.
Let the functions in
V
and
W
be uniformly bounded by
C
V
and
C
W
respectively.
Assume the conditions of Theorem 4.1 hold and
|R|≤
R
max
,γ
∈
[0
,
1)
. Then, with probability
1
−
δ
,
|
J
(
π
∗
̂
P
n
,P
∗
)
−
J
(
π
∗
P
∗
,P
∗
)
|≤
2
γ
min
P
max
w,V
|
L
(
w,V,P
)
|
+ 8
γ
R
n
(
W
,
V
,
P
) + 8
γC
W
C
V
√
log(2
/δ
)
/n
where
R
n
(
W
,
V
,
P
)
is the Rademacher complexity of the function class
{
(
s,a,s
′
)
7→
w
(
s,a
)(
E
x
∼
P
[
V
(
x
)]
−
V
(
s
′
)) :
w
∈W
,P
∈P
,V
∈V}
.
Proof for Theorem C.1.
By Theorem 4.1,
|
J
(
π
∗
̂
P
n
,P
∗
)
−
J
(
π
∗
P
∗
,P
∗
)
|≤
2
γ
max
w,V
|
L
(
w,V,
̂
P
n
)
|
.
We have shown in the proof of Theorem 3.1 that
max
w,V
|
L
(
w,V,
̂
P
n
)
|≤
min
P
max
w,V
|
L
(
w,V,P
)
|
+ 4
R
n
(
W
,
V
,
P
) + 4
C
W
C
V
√
log(2
/δ
)
/n.
Combining the two completes the proof.
This bound has the same interpretation as in the OPO case, see Section B.2.
Cameron Voloshin, Nan Jiang, Yisong Yue
C.3 Misspecification
Similarly as in Section B.3, we show the misspecification gap for OPO in the following result.
Lemma C.2
(OPO Misspecification)
.
Let
H⊂
(
S ×A×S →
R
)
be functions on
(
s,a,s
′
)
. Denote
(
WV
)
∗
P
∗
=
w
P
∗
π
∗
P
∗
(
s,a
)
V
P
π
∗
P
∗
(
s
′
)
and
(
WV
)
∗
P
=
w
P
∗
π
∗
P
(
s,a
)
V
P
π
∗
P
(
s
′
)
.
|
J
(
π,
̂
P
)
−
J
(
π,P
∗
)
|≤
2
γ
(
min
P
∈P
max
h
∈H
|L
(
h,P
)
|
+
H
)
(23)
where
H
= max(max
P
∈P
min
h
∈H
|L
((
WV
)
∗
P
∗
−
h,P
)
|
,
max
P
∈P
min
g
∈H
|L
((
WV
)
∗
P
−
g,P
)
|
)
.
Proof for Lemma C.2.
From the proof of Theorem 4.1,
J
(
π
∗
P
∗
,P
∗
)
−
J
(
π
∗
P
,P
∗
)
≤
J
(
π
∗
P
∗
,P
∗
)
−
J
(
π
∗
P
∗
,P
) +
J
(
π
∗
P
,P
)
−
J
(
π
∗
P
,P
∗
) =
L
(
w
P
∗
π
∗
P
∗
,V
P
π
∗
P
∗
,P
) +
L
(
w
P
∗
π
∗
P
,V
P
π
∗
P
,P
). Using the result from proof of Lemma 3.5,
|L
(
w
P
∗
π
∗
P
∗
,V
P
π
∗
P
∗
,P
) +
L
(
w
P
∗
π
∗
P
,V
P
π
∗
P
,P
)
|≤|L
(
h,P
) +
L
((
WV
)
∗
P
∗
−
h,P
)
|
+
|L
(
g,P
) +
L
((
WV
)
∗
P
−
g,P
)
|
≤
2 min
P
max
h
∈H
|L
(
h,P
)
|
+ max
P
min
h
∈H
|L
((
WV
)
∗
P
∗
−
h,P
)
|
+ max
P
min
g
∈H
|L
((
WV
)
∗
P
−
g,P
)
|
≤
2(min
P
max
h
∈H
|L
(
h,P
)
|
+
H
)
where
H
= max(max
P
min
h
|L
((
WV
)
∗
P
∗
−
h,P
)
|
,
max
P
min
g
|L
((
WV
)
∗
P
−
g,P
)
|
)
.
Therefore
|
J
(
π,
̂
P
)
−
J
(
π,P
∗
)
|≤
2
γ
(min
P
max
h
|L
(
h,P
)
|
+
H
)
,
as desired.
Minimax Model Learning
D Additional theory
In this section, we provide additional results that were not covered in the paper. Specifically, we show that as
we make
W
,
V
too rich then the only model with zero loss is
P
∗
itself, which may not be in
P
.
D.1 Necessary and sufficient conditions for uniqueness of
|L
(
w,V,P
)
|
= 0
When
W
,
V
are in
L
2
then
|L|
= 0 is uniquely determined:
Lemma D.1
(Necessary and Sufficient)
.
L
(
w,V,P
) = 0
for all
w
∈
L
2
(
X
,ν
) =
{
g
:
∫
g
2
(
x,a
)
dν
(
x,a
)
<
∞}
,V
∈
L
2
(
S
,ν
) =
{
f
:
∫
f
2
(
x
)
dν
(
x
)
<
∞}
if and only if
P
=
P
∗
wherever
D
π
b
(
s,a
)
6
= 0
.
Corollary D.2.
The same result holds if
w
·
V
∈
L
2
(
X ×S
,ν
) =
{
h
:
∫
h
2
(
x,a,x
′
)
dν
(
x,a,x
′
)
<
∞}
.
Proof for Lemma D.1 and Corollary D.2.
We begin with definition 5.1 and expand the expectation.
L
(
w,V,P
) =
E
(
s,a,s
′
)
∼
D
π
b
(
·
,
·
)
P
∗
(
·|
s,a
)
[
w
(
s,a
)
(
E
x
∼
P
(
·|
s,a
)
[
V
(
x
)]
−
V
(
s
′
)
)
]
=
E
(
s,a
)
∼
D
π
b
(
·
,
·
)
[
w
(
s,a
)
(
E
s
′
∼
P
(
·|
s,a
)
[
V
(
s
′
)]
−
E
s
′
∼
P
(
·|
s,a
)
[
V
(
s
′
)]
)
]
=
∫
D
π
b
(
s,a
)
w
(
s,a
)(
V
(
s
′
)(
P
(
s
′
|
s,a
)
−
P
∗
(
s
′
|
s,a
))
dν
(
s,a,s
′
)
.
(
⇒
) Clearly if
P
=
P
∗
then
L
(
w,V,P
) = 0. (
⇐
) For the other direction, suppose
L
(
w,V,P
) = 0. By assumption,
w
(
s,a
) can take on any function in
L
2
(
X
,ν
) and therefore if
L
(
w,V,P
) = 0 then
∫
V
(
s
′
)(
P
(
s
′
|
s,a
)
−
P
∗
(
s
′
|
s,a
))
dν
(
s
′
) = 0
,
(24)
wherever
D
π
b
(
s,a
)
6
= 0. Similarly,
V
(
s
′
) can take on any function in
L
2
(
S
,ν
) and therefore if equation (24)
holds then
P
=
P
∗
. For the corollary, let (
w,V
)
∈WV
take on any function in
L
2
(
X ×S
,ν
). If
L
(
w,V,P
) = 0
then
P
(
s
′
|
s,a
)
−
P
∗
(
s
′
|
s,a
) = 0, as desired.
In an RKHS, when the kernel corresponds to an integrally strict positive definite kernel (ISPD),
P
=
P
∗
remains
the unique minimizer of the MML Loss:
Lemma D.3
(Realizability means zero loss even in RKHS)
.
L
(
w,f,P
) = 0
if and only if
P
=
P
∗
for all
(
w,V
)
∈ {
(
w
(
s,a
)
,V
(
s
′
)) :
〈
wV,wV
〉
H
k
≤
1
,w
:
X
×
A
→
R
,V
:
X
→
R
}
in an RKHS with an integrally strict
positive definite (ISPD) kernel.
Proof for Lemma D.3.
Uehara et al. (2020) prove an analogous result and proof here is included for reader
convenience. From Mercer’s theorem Mohri et al. (2012), there exists an orthonormal basis (
φ
j
)
∞
j
=1
of
L
2
(
X×S
,ν
)
such that RKHS is represented as
WV
=
w
·
V
=
∞
∑
j
=1
b
j
φ
j
∣
∣
∣
∣
(
b
j
)
∞
j
=1
∈
l
2
(
N
) with
∞
∑
j
=1
b
2
j
μ
j
<
∞
where each
μ
j
is a positive value since kernel is ISPD. Suppose there exists some
P
∈P
such that
L
(
w,V,P
) = 0
for all (
w,V
)
∈ WV
and
P
6
=
P
∗
. Then, by taking
b
j
= 1 when (
j
=
j
′
) and
b
j
= 0 when (
j
6
=
j
′
) for any
j
′
∈
N
, we have
L
(
φ
j
,P
) = 0 where we treat
w
·
V
as a single input to
L
. This implies
L
(
w,V,P
) = 0 for all
w
·
V
∈
L
2
(
X ×S
,ν
) = 0. This contradicts corollary D.2, concluding the proof.