of 11
GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017
1
Compactly Restrictable Metric Policy Optimization Problems
Victor D. Dorobantu, Kamyar Azizzadenesheli, and Yisong Yue
Abstract
— We study policy optimization problems for determin-
istic Markov decision processes (MDPs) with metric state and ac-
tion spaces, which we refer to as Metric Policy Optimization Prob-
lems (MPOPs). Our goal is to establish theoretical results on the
well-posedness of MPOPs that can characterize practically relevant
continuous control systems. To do so, we define a special class of
MPOPs called Compactly Restrictable MPOPs (CR-MPOPs), which
are flexible enough to capture the complex behavior of robotic
systems but specific enough to admit solutions using dynamic
programming methods such as value iteration. We show how to
arrive at CR-MPOPs using forward-invariance. We further show that
our theoretical results on CR-MPOPs can be used to characterize
feedback linearizable control affine systems.
Index Terms
— Continuous Markov Decision Processes,
Reinforcement Learning, Optimal Control, Value Iteration,
Selection Theorems, Sampled-Data, Physical Systems
I. I
NTRODUCTION
P
OLICY optimization is a cornerstone in planning, control and
reinforcement learning. Classic approaches include value itera-
tion and policy iteration [1]–[5]. From a theoretical standpoint, a key
step is establishing when policy optimization is well-posed, i.e., when
optimal policies exist, and when they can be found algorithmically.
While such results for value and policy iteration are well established
for discrete systems with finitely many states and actions (also known
as the tabular setting), relatively few foundational results have been
established for continuous control.
In this paper, we study Metric Policy Optimization Problems
(MPOPs), which come endowed with metric state and action spaces.
Compared to tabular MDPs, several new challenges arise in the
continuous setting. Even for deterministic problems, rewards may be
unbounded, maxima of functions (over actions) may not exist, and
the domains of value functions may be different for different policies.
Without addressing these challenges, optimal policies need not exist
and value iteration or policy iteration may be impossible.
In order to establish well-posedness of dynamic programming ap-
proaches, we define the class of Compactly Restrictable MPOPs (CR-
MPOPs). We show that CR-MPOPs arise naturally when imposing
forward-invariance constraints on the policy class one optimizes over.
As such, CR-MPOPs are well suited for characterizing many systems
which rely on forward-invariance for controller design.
Sampled-data design [6] allows us to synthesize policies for
continuous-time systems when inputs are passed through a zero-order
hold (held constant over fixed frequency time intervals), as is realistic
for many physical systems. We leverage recent results [7], [8] to
certify the existence of a compact subset of the state space that can be
rendered forward-invariant through control. These results are readily
applicable to feedback linearizable control affine systems, allowing
us to design CR-MPOPs for a wide array of complex systems.
Submitted May 15th, 2021. Resubmitted July 6th, 2022. This work
was supported in part by DARPA and Beyond Limits. Victor D. Dorobantu
was also supported in part by a Kortschak Fellowship.
Victor D. Dorobantu and Yisong Yue are with the Department of Com-
puting and Mathematical Sciences, California Institute of Technology,
Pasadena, CA 91125 USA. (e-mails:
{
vdoroban, yyue
}
@caltech.edu).
Yisong Yue is affiliated with Argo AI, Pittsburgh, PA. Kamyar Azizzade-
nesheli is with the Department of Computer Science, Purdue University,
West Lafayette, IN 47907 USA (e-mail: kamyar@purdue.edu).
Many frameworks extend the theory of discrete MDPs to general
state and action spaces. Most relevant for our work, [1] develops
semicontinuous MDPs with Borel measurable policy classes, [9]
and [10] develop results for analyticially and universally measurable
policy classes, respectively, and [11] refines conditions for the well-
posedness of value iteration. For a more complete summary, see [12].
Our work focuses on developing classes of policy optimization prob-
lems that can naturally connect theoretical results from reinforcement
learning with nonlinear continuous-time control, and can be directly
translated to checking certain properties in control systems.
Methods that maintain the forward-invariance of subsets of the
state space of interest have been well-studied [13]–[15]. These and
related methods have recently found applications in safe reinforce-
ment learning [16]–[19]. Physical systems and robotic platforms
have been popular applications for reinforcement learning methods
recently, with the majority of methods employing function approx-
imation, discretizations (such as fixed or adaptive gridding or state
aggregation), or direct policy search [20]–[22]. While policy search
methods generally have no guarantees of global optimality, they
have received much attention due to the scalability problems of
dynamic programming methods (especially after discretization) and
convergence problems of approximate dynamic programming (for a
more complete discussion of the relative merits of these methods,
see [20, §2.3]). In contrast, our goal is to identify settings that are
compatible with value iteration and to use control theoretic tools to
guide solution methods so we can expect good performance.
A. Contributions
We develop our results in the following phases: 1) We describe a
generic property (Assumption 1) that admits the well-posedness of
value iteration for MPOPs (Theorem 1). 2) We leverage the forward-
invariance of compact sets to design MPOPs and policy classes that
satisfy Assumption 1, allowing us to prove well-posedness of a large
class of policy optimization problems (Theorem 2). Such problem
settings are called CR-MPOPs. 3) We apply our results on CR-
MPOPs to analyze control affine systems with time-sampled control
inputs. 4) We further apply our results to analyze robotic systems with
time-sampled control inputs, which comprise a large class of control
affine systems. These results translate the generic requirements for
well-posedness to concrete requirements on the robotic system. 5) We
finally show that full-state feedback linearizable control affine sys-
tems can render a compact subset of the state space forward invariant
with continuous controllers (thus satisfying conditions of Theorem 2),
demonstrating practically relevant instances of CR-MPOPs.
B. Notation, Conventions, and Definitions
We consider nonempty metric spaces throughout this paper and
endow each such space with the
σ
-algebra generated by its topology,
called the Borel
σ
-algebra. In this case, open sets and closed sets are
measurable sets and continuous functions are measurable functions.
A metric space is separable if it contains a countable dense subset.
Finite or countable sets can be regarded as separable metric spaces
when equipped with the discrete metric [23, §3.A]. For nonempty
metric spaces
X
and
Y
, the set of measurable functions from
X
to
Y
is denoted
L
0
(
X
;
Y
)
. A bounded measurable function
f
:
X
R
is
arXiv:2207.05850v1 [math.OC] 12 Jul 2022
2
GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017
upper semicontinuous if for any
c
R
, the preimage
f
1
([
c,
)) =
{
x
X
:
f
(
x
)
c
}
is closed.
The set of all subsets of
Y
is the powerset of
Y
, denoted
P
(
Y
)
.
A set-valued function
C
:
X
→ P
(
Y
)
is called a correspondence.
The graph of
C
is defined as:
Graph(
C
) =
{
(
x,y
)
X
×
Y
:
y
C
(
x
)
}
.
(1)
If
C
(
x
)
6
=
for all
x
X
, then a selector of
C
is a function
f
:
X
Y
satisfying
f
(
x
)
C
(
x
)
for all
x
X
. For any set
B
Y
, the lower preimage of
B
under
C
is defined as:
C
`
(
B
) =
{
x
X
:
C
(
x
)
B
6
=
∅}
.
(2)
The correspondence
C
is upper hemicontinuous
1
if the lower preim-
age of each closed set is a closed set [24, Lemma 17.4].
A nonempty subset
X
0
X
is a metric space when equipped
with the restriction of the metric on
X
to
X
0
×
X
0
. For any open set
U
X
0
, there is an open set
V
X
such that
U
=
X
0
V
. For any
continuous function
f
:
X
Y
, the restriction
f
|
X
0
:
X
0
Y
is continuous. Similarly, for any measurable set
A
X
0
, there is
a measurable set
B
X
such that
A
=
X
0
B
, and for any
measurable function
f
:
X
Y
, the restriction
f
|
X
0
is measurable.
Consider a nonempty metric space
Z
and a function
f
:
X
×
Y
Z
. For
x
X
and
y
Y
, the
x
-section
f
x
:
Y
Z
and the
y
-section
f
y
:
X
Z
are defined as:
f
x
(
y
) =
f
(
x,y
)
,
f
y
(
x
) =
f
(
x
,y
)
,
(3)
for all
x
X
and
y
Y
. If
f
is continuous, then
f
has
continuous sections, and if
f
is measurable, then
f
has measurable
sections. We also define sections for functions on graphs. Consider a
correspondence
C
:
X
→P
(
Y
)
with
C
(
x
)
6
=
for all
x
X
, and
a function
f
: Graph(
C
)
Z
. For
x
X
, we define the
x
-section
f
x
:
C
(
x
)
Z
as above.
N
denotes the natural numbers,
Z
+
denotes the nonnegative
integers,
R
+
and
R
++
denote nonnegative and positive reals, and
S
d
+
and
S
d
++
denote
d
×
d
positive semidefinite and definite matrices.
II. M
ETRIC
P
OLICY
O
PTIMIZATION
P
ROBLEMS
In this paper, a deterministic Markov decision process (MDP) will
be characterized by a tuple
(
S
,
A
,C,f,r,γ
)
, with state space
S
,
action space
A
, action-admissibility correspondence
C
:
S →P
(
A
)
satisfying
C
(
s
)
6
=
for each state
s
∈ S
, transition map
f
:
Graph(
C
)
→S
, reward function
r
: Graph(
C
)
R
, and discount
factor
γ
[0
,
1)
. Note while
r
may be unbounded,
r
cannot assume
the values
±∞
. We refer to an MDP as a metric MDP if
S
and
A
are nonempty separable metric spaces and
f
and
r
are measurable
functions. In this paper, we focus solely on metric MDPs, and often
refer to them simply as MDPs for brevity.
We will limit our consideration to deterministic, Markovian, and
stationary (time-invariant) policies. In this case, a policy
π
:
S →A
is a selector of
C
; that is,
π
(
s
)
C
(
s
)
for all states
s
∈ S
. The
corresponding closed-loop transition map
f
π
:
S → S
and single-
step reward function
r
π
:
S →
R
are defined as:
f
π
(
s
) =
f
(
s,π
(
s
))
,
r
π
(
s
) =
r
(
s,π
(
s
))
.
(4)
for all states
s
∈ S
. For any
t
Z
+
, let
f
t
π
:
S → S
denote the
t
-iterated composition of
f
π
. Define the subset
S
π
⊆S
as:
S
π
=
{
s
∈S
:
t
=0
γ
t
r
π
(
f
t
π
(
s
)) converges absolutely
}
.
(5)
1
Some authors call such correspondences upper semicontinuous; we use
hemicontinuous to distinguish correspondences from real-valued functions.
If
r
π
is bounded, then
S
π
=
S
. If
γ >
0
, then
f
π
(
S
π
)
⊆ S
π
, as
a convergent series still converges after the first term is removed. If
S
π
6
=
, then the corresponding value function
V
π
:
S
π
R
is
defined explicitly and implicitly, for all states
s
∈S
π
, as:
V
π
(
s
) =
t
=0
γ
t
r
π
(
f
t
π
(
s
)) =
r
π
(
s
) +
γV
π
(
f
π
(
s
))
.
(6)
A set of policies
Π
admits a partial order if
S
π
=
S
for all policies
π
Π
. In this case, for policies
π,π
Π
, if
V
π
(
s
)
V
π
(
s
)
for
all states
s
∈S
, then
π

π
.
Definition 1
(Metric Policy Optimization Problem)
.
We refer to the
pair of an MDP
(
S
,
A
,C,f,r,γ
)
and a policy class
Π
admitting a
partial order as a
metric policy optimization problem
(MPOP). We
call an MPOP well-posed if there is an optimal policy
π
Π
with
π

π
for all policies
π
Π
.
Goals of this section.
Our goal is to analyze well-posedness of
policy optimization for MDPs. We first describe sufficient conditions
for well-posedness of value iteration (Theorem 1). We then establish
conditions under which ill-posed MPOPs can be restricted to well-
posed problems by only considering policies that render the same
subset of the state space forward-invariant (Theorem 2), resulting in
CR-MPOPs. We will show how to apply our results on value iteration
to control affine systems in Section III.
Throughout this section, we refer to and modify the following
example to ground our presentation:
Example 1.
Consider the state space
S
=
R
, action space
A
=
R
, and the constant action-admissibility correspondence
C
=
A
.
Additionally, consider the transition function
f
: Graph(
C
)
→ S
and reward function
r
: Graph(
C
)
R
defined as:
f
(
s,a
) =
s
+ tanh (
a
)
, r
(
s,a
) =
s
2
(tanh (
a
))
2
,
(7)
for all state-action pairs
(
s,a
)
Graph(
C
)
, and a discount factor
γ
[0
,
1)
. Note that
|
f
(
s,a
)
| ≤ |
s
|
+
|
tanh (
a
)
| ≤ |
s
|
+ 1
for all
state-action pairs
(
s,a
)
Graph(
C
)
. Consider any policy
π
:
S →
A
. By induction, we have
|
f
t
π
(
s
)
|≤|
s
|
+
t
for all states
s
∈S
and
t
Z
+
. For any state
s
∈S
and
T
Z
+
, we have:
T
t
=0
γ
t
r
π
(
f
t
π
(
s
))
≥−
t
=0
γ
t
((
f
t
π
(
s
))
2
+ (tanh (
π
(
f
t
π
(
s
))))
2
)
≥−
t
=0
γ
t
(
s
2
+ 2
|
s
|
t
+
t
2
+ 1)
=
s
2
+ 1
1
γ
2
γ
|
s
|
(1
γ
)
2
γ
(
γ
+ 1)
(1
γ
)
3
.
(8)
The sequence of partial sums as
T
→∞
is monotone and bounded
below, so
V
π
(
s
)
is well-defined. As
s
was arbitrary, we have
S
π
=
S
.
A. Value Iteration
Consider a policy class
Π
with
S
π
=
S
for each policy
π
Π
and
sup
π
Π
V
π
(
s
)
<
for each state
s
∈ S
. Accordingly, we define
the optimal (with respect to
Π
) value function
V
:
S →
R
as:
V
(
s
) = sup
π
Π
V
π
(
s
)
,
(9)
for all states
s
∈S
. If a policy
π
Π
satisfies
V
π
=
V
, then
π
is optimal and the policy optimization problem is well-posed.
Value iteration generates a sequence of approximations of
V
.
Given an initial guess
V
0
:
S →
R
, we seek a sequence of real-
valued functions
{
V
n
:
n
N
}
satisfying:
V
n
+1
(
s
) = sup
a
C
(
s
)
{
r
(
s,a
) +
γV
n
(
f
(
s,a
))
}
,
(10)
DOROBANTU
et al.
: COMPACTLY RESTRICTABLE METRIC POLICY OPTIMIZATION PROBLEMS
3
for all states
s
∈S
and
n
Z
+
.
We now describe conditions under which MPOPs are well-posed
and can be solved with value iteration (Theorem 1). When assump-
tions are strengthened by requiring
S
and
A
to be Polish spaces,
we will make use of [11, Theorem 4.1] (the setting of this theorem
is called a semicontinuous model in [1, §6.7]). We provide a more
direct proof of well-posedness without these additional assumptions
in the appendix.
Assumption 1.
The action admissibility correspondence
C
has
compact values and is upper hemicontinuous, the transition function
f
is continuous, the reward function
r
is upper semicontinuous and
bounded, and the policy class
Π
is the set of all measurable selectors
of
C
. Moreover, for each state-action pair
(
s,a
)
Graph(
C
)
, there
is a corresponding policy
π
s,a
Π
satisfying
π
s,a
(
s
) =
a
.
Remark 1.
If the projection of
Graph(
C
)
onto the action space is
compact, then the assumption that for any state-action pair
(
s,a
)
Graph(
C
)
, there exists a policy
π
s,a
Π
with
π
s,a
(
s
) =
a
can be
removed. See Lemma 1.
Remark 2
(Tabular MDP)
.
If
S
and
A
are finite sets equipped with
discrete metrics and
Π
is the set of all selectors of
C
, then these
assumptions are immediately met. This follows since the discrete
metric renders all sets are open, closed, and compact, and all functions
defined on such sets are continuous (and bounded if real-valued).
Example 2.
Consider the MDP
(
S
,
A
,C,f,r,γ
)
from Example 1.
While
C
is upper hemicontinuous, it is not compact-valued. While
f
and
r
are continuous,
r
is unbounded. Consider the constant
correspondence
C
= [
1
,
1]
, and the reward
r
: Graph(
C
)
R
:
r
(
s
) =
e
s
2
(tanh (
a
))
2
,
(11)
for all state-action pairs
(
s,a
)
Graph(
C
)
. Note that
Graph(
C
) =
[
1
,
1]
, so
r
is bounded. Since
f
is continuous, its
restriction to
Graph(
C
)
is as well. With
Π
the set of all measurable
selectors of
C
, note that for every pair
(
s,a
)
Graph(
C
)
, the
constant policy
π
s,a
=
a
is a measurable selector of
C
, so
π
s,a
Π
.
Thus, the MPOP defined by the MDP
(
S
,
A
,C
, f
|
Graph(
C
)
,r
)
and policy class
Π
satisfies Assumption 1.
We now demonstrate well-posedness under Assumption 1.
Theorem 1.
Consider an MPOP characterized by an MDP
(
S
,
A
,C,f,r,γ
)
and a policy class
Π
that satisfies Assumption 1.
There is an optimal policy
π
Π
satisfying
V
π
=
V
, and
V
is the limit of value iteration when the initial guess is bounded and
upper semicontinuous.
Proof.
If
S
and
A
are Polish spaces, we can show that the MDP in
question satisfies the three conditions of Assumptions (W*) of [11],
in which case, there is an optimal policy and
V
is the limit of value
iteration when the initial condition is the zero function (a conclusion
of [11, Theorem 4.1]). The first condition follows since
r
is upper
semicontinuous and bounded above (as it is bounded both above and
below), which corresponds to a lower semicontinuous cost function
that is bounded below. The second condition is satisfied by the upper
hemicontinuity of
C
(using the sequential definition in [24, Theorem
17.20]) since
r
is also bounded below. The third condition is satisfied
since the MDP is deterministic (for which transition probabilities are
represented by Dirac measures). In this case, we require:
lim
n
→∞
φ
(
f
(
s
n
,a
n
)) =
φ
(
f
(
s,a
))
,
(12)
for any bounded continuous function
φ
:
S →
R
and any convergent
sequence
{
(
s
n
,a
n
)
Graph(
C
) :
n
N
}
converging to
(
s,a
)
Graph(
C
)
; equality follows since
φ
f
is continuous (as the
composition of continuous functions).
If
S
and
A
are not both Polish spaces, we prove well-posedness in
the appendix, in which case,
V
is the limit of value iteration when
the initial guess is any bounded and upper semicontinuous function.
We conclude with a lemma mentioned in Remark 1, allowing
Assumption 1 to be relaxed slightly.
Lemma 1.
Suppose an MDP
(
S
,
A
,C,f,r,γ
)
and policy class
Π
satisfy Assumption 1, suppose the projection of
Graph(
C
)
onto
A
is compact, and let
ρ
A
:
A×A →
R
+
denote the metric on
A
.
For any state-action pair
(
s,a
)
Graph(
C
)
, the function
φ
s,a
:
Graph(
C
)
R
defined as:
φ
s,a
(
s
,a
) =
ρ
A
(
a
,a
)
,
(13)
for all state-action pairs
(
s
,a
)
Graph(
C
)
admits a maximizing
policy
π
s,a
Π
for which
π
s,a
(
s
) =
a
.
Proof.
Since
ρ
A
is continuous, so is
φ
s,a
. Since the projection of
Graph(
C
)
onto
A
is compact,
φ
s,a
is bounded. If
S
and
A
are Polish
spaces, then the proof of [11, Lemma 3.4] uses the Arsenin-Kunugui
theorem to show the existence of a policy
π
s,a
Π
satisfying:
φ
s,a
(
s
s,a
(
s
)) = max
a
C
(
s
)
φ
s,a
(
s
,a
)
=
min
a
C
(
s
)
ρ
A
(
a
,a
)
,
(14)
for all states
s
∈S
. In particular,
π
s,a
(
s
) =
a
as
φ
s,a
(
s,π
s,a
(
s
)) =
min
a
C
(
s
)
ρ
A
(
a
,a
) = 0
. This requires Assumptions (W*) of
[11] to be met, which is verified in the proof of Theorem 1.
We arrive at the same conclusion when
S
and
A
are not both
Polish spaces in the appendix.
B. Compactly Restrictable MPOPs
While Assumption 1 allows us to determine when MPOPs are
well-posed and can be solved using value iteration, it is not met for
many systems of interest. We now outline an alternative assumption
that not only mitigates theoretical shortcomings, but also better fits
the problem settings for physical systems. Our main result here is
Theorem 2, showing how an MPOP satisfying these new assumptions
can generate an MPOP satisfying Assumption 1 through a systematic
transformation. We call the original MPOP a CR-MPOP.
1) Motivation:
Consider an MPOP defined by an MDP
(
S
,
A
,C,f,r,γ
)
and a policy class
Π
that satisfies Assumption 1.
The proof of Theorem 1 requires a bounded reward function
r
and
a compact-valued action-admissibility correspondence
C
. For many
examples, these requirements are too strict, as shown in Example
2. If
C
can be modified such that its graph is compact and if
r
is
continuous, then
r
will be bounded over the graph of the modified
correspondence. This guides our systematic construction of a well-
posed policy optimization problem from an ill-posed one.
If the graph of the modified correspondence is compact, the
projection of the graph onto the state space must also be compact.
Therefore, if the state space is not already compact, then the modified
correspondence must be defined on a compact strict subset
S
0
⊂S
;
that is, the modified correspondence must have the form
C
0
:
S
0
P
(
A
)
, where
C
0
(
s
)
C
(
s
)
is nonempty and compact for all states
s
∈ S
0
. Moreover, while restricting admissible actions to compact
sets is reasonable for physical systems (as forces, voltages, currents,
etc. are bounded by physical constraints), these restrictions must
be chosen carefully to enforce the nonemptiness condition. For the
restriction of
f
to
Graph(
C
0
)
to be a well-defined transition function,
4
GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017
we require
f
(
s,a
)
∈ S
0
for all pairs
(
s,a
)
Graph(
C
0
)
. Finally,
we must ensure that
C
0
admits measurable selectors.
Restricting our attention to a compact subset
S
0
is also reasonable
for many physical systems, as dynamics models are often well-
understood only in a bounded set around an operating condition.
2) Compact Restriction:
For this construction, we use policies
that render compact subsets of the state space forward-invariant. A
policy
π
Π
renders a subset
S
0
⊆S
forward-invariant if
f
π
(
S
0
)
S
0
. By induction, we have
f
t
π
(
S
0
)
⊆S
0
for any
t
Z
+
.
Definition 2
(Compactly Restrictable Metric Policy Optimization
Problem)
.
An MPOP characterized by an MDP
(
S
,
A
,C,f,r,γ
)
and
a policy class
Π
is a
compactly restrictable metric policy optimization
problem
(CR-MPOP) if there is a nonempty compact subset
S
0
⊆S
and a continuous policy
π
0
Π
rendering
S
0
forward-invariant.
Assumption 2.
The action-admissibility correspondence
C
has
closed values and
Graph(
C
)
is closed, the transition function
f
and
reward function
r
are continuous, and the policy class
Π
is the set
of all measurable selectors of
C
.
To show how an CR-MPOP satisfying Assumption 2 can lead to
an MPOP satisfying Assumption 1, we need the following lemma.
Lemma 2.
Consider an MDP
(
S
,
A
,C,f,r,γ
)
and a policy class
Π
satisfying Assumption 2. For any state
s
∈ S
, the
s
-section
f
s
:
C
(
s
)
→S
is continuous. Additionally, for any state
s
∈S
and any
closed set
G
⊆S
, the preimage
(
f
s
)
1
(
G
)
is a closed subset of
A
.
Proof.
Let
id :
S×A→S×A
denote the identity function on
S×A
.
For any state
s
∈ S
, the
s
-section
id
s
:
A → S ×A
is continuous,
meaning the restriction
id
s
|
C
(
s
)
:
C
(
s
)
→S×A
is continuous. We
can write
f
s
=
f
( id
s
|
C
(
s
)
)
, implying
f
s
is continuous. For any
closed set
G
⊆ S
, there is a corresponding closed set
F
s,G
⊆ A
satisfying
(
f
s
)
1
(
G
) =
C
(
s
)
F
s,G
. Since
C
(
s
)
is also a closed
subset of
A
, we conclude that
(
f
s
)
1
(
G
)
a closed subset of
A
.
We now present our guarantee of the existence of a well-posed
value iteration settings under Assumption 2.
Theorem 2.
Consider a CR-MPOP characterized by an MDP
(
S
,
A
,C,f,r,γ
)
and a policy class
Π
that satisfies Assump-
tion 2. There exists an MPOP characterized by an MDP
(
S
0
,
A
0
,C
0
,f
0
,r
0
)
and a policy class
Π
0
satisfying Assumption
1, with
S
0
⊆ S
rendered forward-invariant by a continuous policy
in
Π
,
A
0
⊆A
,
C
0
:
S
0
→P
(
A
0
)
satisfying
C
0
(
s
)
C
(
s
)
for all
states
s
∈S
0
,
f
0
=
f
|
Graph(
C
0
)
,
r
0
=
r
|
Graph(
C
0
)
, and:
Π
0
=
{
π
|
S
0
:
π
Π
(
S
0
)
⊆A
0
,
and
f
π
(
S
0
)
⊆S
0
}
.
(15)
Proof.
For some
n
N
, consider continuous policies
π
1
,...,π
n
Π
satisfying
f
π
i
(
S
0
)
⊆S
0
for all
i
∈{
1
,...,n
}
. Such policies are
guaranteed to exist by the existence of the continuous policy
π
0
Π
.
Consider the union of the images of
S
0
under each of these policies,
defining the compact set
A
0
=
n
i
=1
π
i
(
S
0
)
. Since
S
and
A
are
separable metric spaces, so are
S
0
and
A
0
[24, Corollary 3.5].
Define the correspondence
C
0
:
S
0
→P
(
A
0
)
as:
C
0
(
s
) =
{
a
C
(
s
)
∩A
0
:
f
(
s,a
)
∈S
0
}
= (
f
s
)
1
(
S
0
)
∩A
0
,
(16)
for all states
s
∈ S
0
. Since
π
1
(
s
)
,...,π
n
(
s
)
C
0
(
s
)
,
C
0
(
s
)
6
=
for all
s
∈S
0
. By Lemma 2 and since
S
0
is closed (it is compact in
a metric space), the preimage
(
f
s
)
1
(
S
0
)
is a closed subset of
A
.
C
0
(
s
)
is thus compact, as it is a closed subset of
A
0
. For any closed
set
G
⊆A
0
, the lower preimage of
G
under
C
0
satisfies:
(
C
0
)
`
(
G
) =
{
s
∈S
0
:
f
(
s,a
)
∈S
0
for some
a
C
(
s
)
G
}
=
p
S
(
f
1
(
S
0
)
(
S
0
×
G
))
,
(17)
where
p
S
:
S ×A → S
denotes the canonical projection onto
S
.
Since
S
0
is closed and
f
is continuous, there is a closed set
F
S ×A
satisfying
f
1
(
S
0
) = Graph(
C
)
F
. Since
Graph(
C
)
is
also closed,
f
1
(
S
0
)
is a closed subset of
S×A
. Since
G
is a closed
subset of the compact set
A
0
,
G
is compact, and since
S
0
is compact,
the product
S
0
×
G
is compact. As a closed subset of a compact set,
f
1
(
S
0
)
(
S
0
×
G
)
is compact. Since the projection operator
p
S
is
continuous,
(
C
0
)
`
(
G
)
is compact. Therefore,
(
C
0
)
`
(
G
)
is a closed
subset of
S
0
, and since
G
was arbitrary,
C
0
is upper hemicontinuous.
As restrictions of continuous functions,
f
0
and
r
0
are continuous,
implying
r
0
is upper semicontinuous. Note that:
Graph(
C
0
) =
{
(
s,a
)
∈S
0
×A
0
:
f
(
s,a
)
∈S
0
}
=
f
1
(
S
0
)
(
S
0
×A
0
)
.
(18)
Since
A
0
is trivially a closed subset of
A
0
, we have already shown
that
f
1
(
S
0
)
(
S
0
× A
0
)
is compact, meaning
Graph(
C
0
)
is
compact. Therefore, since
r
0
is continuous, it is bounded.
The policy class
Π
0
satisfies:
Π
0
=
{
π
|
S
0
:
π
Π
(
s
)
C
0
(
s
) for all states
s
∈S
0
}
=
{
π
∈L
0
(
S
0
;
A
0
) :
π
(
s
)
C
0
(
s
) for all states
s
∈S
0
}
.
(19)
To verify the last equality, first consider a policy
π
Π
satisfying
π
(
s
)
C
0
(
s
)
for all states
s
∈ S
0
. The restriction
π
|
S
0
is a
measurable function from
S
0
to
A
0
selecting from
C
0
. Conversely,
consider a measurable function
π
:
S
0
→ A
0
selecting from
C
0
.
Pick any policy
π
e
Π
and define the extension
̄
π
:
S → A
as
π
on
S
0
and
π
e
on
S\S
0
. For any measurable set
B
⊆A
, we have:
̄
π
1
(
B
) =
{
s
∈S
0
:
π
(
s
)
∈A
0
B
}∪{
s
∈S\S
0
:
π
e
(
s
)
B
}
=
π
1
(
A
0
B
)
(
π
1
e
(
B
)
\S
0
)
,
(20)
which is a measurable set; therefore,
̄
π
is a measurable function.
Note that
̄
π
(
s
) =
π
(
s
)
C
0
(
s
)
C
(
s
)
for all states
s
∈ S
0
and
̄
π
(
s
) =
π
e
(
s
)
C
(
s
)
for all states
s /
∈S
0
. Since
̄
π
is a measurable
selector of
C
, we have
̄
π
Π
, and inclusion follows since
̄
π
|
S
0
=
π
.
Finally, since
Graph(
C
0
)
is compact, its projection onto
A
0
is
compact. Therefore, by Lemma 1, for every state-action pair
(
s,a
)
Graph(
C
0
)
, there is a policy
π
s,a
Π
0
satisfying
π
s,a
(
s
) =
a
.
Remark 3.
An MPOP characterized by an MDP
(
S
,
A
,C,f,r,γ
)
and a policy class
Π
that satisfies Assumption 1 is trivially a CR-
MPOP satisfying Assumption 2 if the following sufficient conditions
are met: 1) The state space
S
is compact, 2) The reward function
r
is continuous, 3) The policy class
Π
contains a continuous policy.
Example 3.
Consider the MDP
(
S
,
A
,C,f,r,γ
)
defined in Example
1 with a policy class
Π
comprised of measurable selectors of
C
, and
let
S
0
= [
1
,
1]
. Note that
|
f
(0
,a
)
|
=
|
tanh (
a
)
|
<
1
for all actions
a
R
. Consider any state
s
[
1
,
0)
. For any nonnegative action
a
R
+
, we have
1
s
f
(
s,a
) =
s
+ tanh (
a
)
< s
+ 1
<
1
,
and
f
(
s,
a
) =
s
+ tanh (
a
)
[
1
,s
)
if and only if
a
[tanh
1
(
1
s
)
,
0)
since
tanh
1
is monotonically increasing. By
a similar argument for states in
(0
,
1]
, we determine that, for all
s
∈S
0
and
a
∈A
, we have
f
(
s,a
)
∈S
0
if and only if:
a
[tanh
1
(
1
s
)
,
)
1
s <
0
,
A
s
= 0
,
(
−∞
,
tanh
1
(1
s
)] 0
< s
1
.
(21)
A correspondence coinciding with the requirements on actions
in (21) would not have compact values; this is illustrated by the
preimage
f
1
(
S
0
)
in Figure 1. Therefore, consider the continuous
policy
π
0
Π
defined as
π
0
(
s
) =
s
for all states
s
∈S
. For any
state
s
[
1
,
0)
, we have
π
0
(
s
) =
s >
0
tanh
1
(
1
s
)
.
DOROBANTU
et al.
: COMPACTLY RESTRICTABLE METRIC POLICY OPTIMIZATION PROBLEMS
5
−2
−1
0
1
2
S
−2
−1
0
1
2
A
−2
−1
0
1
2
S
−2
−1
0
1
2
A
Fig. 1. (Left) The preimage
f
1
(
S
0
)
is shown in blue, which contains
all state-action pairs mapped into
S
0
= [
1
,
1]
by
f
. For any state in
S
0
, the set of actions mapping that state into
S
0
is unbounded, therefore
not compact. (Right) The graph of
π
0
is shown in orange and the graph
of
C
0
is shown in blue. The graph of
C
0
is compact, and for every state
s
∈S
0
, the set
C
0
(
s
)
is compact.
Similarly, for any state
s
(0
,
1]
, we have
π
0
(
s
)
<
tanh
1
(1
s
)
.
Therefore,
f
π
0
(
S
0
)
⊆ S
0
. The image of
S
0
under
π
0
is
[
1
,
1]
.
Therefore, we can define
A
0
= [
1
,
1]
and
C
0
:
S
0
→ P
(
A
0
)
as,
for all states
s
∈S
0
:
C
0
(
s
) =
[tanh
1
(
1
s
)
,
1]
s
≤−
1
tanh (
1)
,
[
1
,
tanh
1
(1
s
)]
s
1
tanh (1)
,
[
1
,
1]
otherwise
.
(22)
C. Policy Iteration
We briefly mention the difficulties of policy iteration for MPOPs,
as noted in [12]. For a well-posed MPOP, policy iteration generates
a monotonically nondecreasing sequence of policies. Given an initial
policy
π
0
Π
, we seek a sequence of policies
{
π
n
Π :
n
N
}
satisfying, for all states
s
∈S
and
n
Z
+
:
r
(
s,π
n
+1
(
s
)) +
γV
π
n
(
f
(
s,π
n
+1
(
s
)))
= sup
a
C
(
s
)
{
r
(
s,a
) +
γV
π
n
(
f
(
s,a
))
}
.
(23)
If
f
,
r
, and
π
0
are continuous, then so are
f
π
0
and
r
π
0
which we can
show renders
V
π
0
continuous. This implies
r
+
γV
π
0
f
is continuous.
If
r
is bounded, then so are
V
π
0
and
r
+
γV
π
0
f
. Since
r
+
γV
π
0
f
is upper semicontinuous and bounded, we can show (using [11] or
the appendix) that for any state
s
∈S
, the optimization problem:
sup
a
C
(
s
)
{
r
(
s,a
) +
γV
π
0
(
f
(
s,a
))
}
,
(24)
is solved by the next policy
π
1
, a measurable selector of
C
. However,
we cannot conclude that
r
+
γV
π
1
f
is upper semicontinuous and
bounded, so the same argument cannot be applied iteratively.
Demonstrating that policy iteration can be applied requires special
knowledge that the policy improvement step in (23) continues to
produce policies with value functions that permit maximization and
appropriate selection (for the chosen policy class). Such knowledge
is available in linear-quadratic regulator (LQR) problems with linear
policy classes (see [25] for the discounted case).
D. Computation
There are 3 central difficulties in computationally implementing
value iteration for a well-posed MPOP: 1) Representing iterates
during value iteration, 2) Determining admissible actions, and 3) Ap-
proximating the update step (10). Typically, iterates are represented
using function approximators such as neural networks [5] or Gaussian
process regression models [26]. Training such approximators involves
sampling sufficiently many states from the state space region of
interest. For any given state, the maximization in the update step (10)
is generally a nonconvex optimization, with possible approximations
including (projected) gradient ascent from several initial action seeds
or maximization over a finite but sufficiently dense sampling of
admissible actions. If the function approximation and optimization
approximation errors can be controlled, approximate convergence of
value iteration can be guaranteed [27, Proposition 2.3.2].
Any approximate optimization approach requires checking ad-
missibility of actions, which, in the case of a CR-MPOP, requires
checking whether or not a state-action pair is mapped to the correct
compact set under the transition map. This necessitates efficient set
membership checking, and for nonconvex compact sets, proximity-
based membership approximations may be needed.
Finally, in Section III-C, we will use the closure of a reachable
set (under a specific policy) to restrict problems with feedback
linearizable control affine systems to well-posed MPOPs. To sample
a state from the reachable set, we can sample a state from the
appropriate set of initial conditions and follow the policy for a
number of steps sampled from a geometric distribution (with success
probability
1
γ
). The resulting distribution is the discounted state
distribution under the policy [28], which is supported on the entirety
of the reachable set (thus, on a dense subset of its closure).
There are many possible combinations of computational ap-
proaches that may be suitable for well-posed MPOPs; we deem the
precise best choice outside the scope of this theoretical framing.
III. C
ONTROL
A
FFINE
S
YSTEMS
We now show how to apply the results from Section II to a general
class of control systems called control affine systems [29, §13.2.3].
For state and action space dimensions
d,m
N
, respectively,
consider a set
D ⊆
R
d
and vector fields
f
0
,g
1
,...,g
m
:
D →
R
d
.
Define the matrix-valued function
G
:
D →
R
d
×
m
with columns
g
1
,...,g
m
. Define
F
:
R
m
R
d
as:
F
(
x,u
) =
f
0
(
x
) +
G
(
x
)
u,
(25)
for all
x
∈D
and
u
= (
u
1
,...,u
m
)
R
m
. An initial value problem
with constant control input
u
R
m
is characterized by an initial
condition
x
∈ D
and an open time interval
I
R
with
0
I
;
a corresponding solution is a differentiable function
φ
:
I
→ D
satisfying
φ
(0) =
x
and
̇
φ
(
t
) =
d
d
t
φ
(
t
) =
F
(
φ
(
t
)
,u
) =
f
0
(
φ
(
t
)) +
G
(
φ
(
t
))
u
for all times
t
I
.
A. Time Sampling
In contrast to continuous-time control design, we consider
sampled-data control design (see [6]–[8]), in which initial value
problems with constant control characterize the evolution of a system
over fixed sample intervals, resulting in control input trajectories that
are piecewise constant in time. Such assumptions are realistic for
physical systems interacting with digital controllers, which measure
states and compute control inputs at nearly fixed frequencies. This
setting requires the time intervals over which solutions are defined to
be sufficiently long, uniformly for all initial conditions and control
inputs under consideration.
Specifically, fix a sample period
h
R
++
, and define the subset
S
h
⊆ D
and correspondence
C
h
:
S
h
→ P
(
R
m
)
such that the
following properties are satisfied:
For every initial condition
x
∈ S
h
, the corresponding set of
control inputs
C
h
(
x
)
R
m
is nonempty,
For every
(
x,u
)
Graph(
C
h
)
, there is a unique solution to
the initial value problem characterized by initial condition
x
,
control input
u
, and open time interval
I
R
with
[0
,h
]
I
,
with the solution denoted
φ
x,u
:
I
→D
,
S
h
and
C
h
are maximal (cannot be contained in a superset and
containing correspondence)