0018-9286 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAC.2020.3035755, IEEE
Transactions on Automatic Control
Control Theory Meets POMDPs:
A Hybrid Systems Approach
Mohamadreza Ahmadi, Nils Jansen, Bo Wu, and Ufuk Topcu
Abstract
—Partially observable Markov decision processes
(POMDPs) provide a modeling framework for a variety of se-
quential decision making under uncertainty scenarios in artificial
intelligence (AI). Since the states are not directly observable in
a POMDP, decision making has to be performed based on the
output of a Bayesian filter (continuous beliefs); hence, making
POMDPs intractable to solve and analyze. To overcome the
complexity challenge of POMDPs, we apply techniques from
control theory.
Our contributions are fourfold: (i) We begin
by casting the problem of analyzing a POMDP into analyzing
the behavior of a discrete-time switched system. Then, (ii) in
order to estimate the reachable belief space of a POMDP,
i.e.,
the
set of all possible evolutions given an initial belief distribution
over the states and a set of actions and observations, we find
over-approximations in terms of sub-level sets of
Lyapunov-
like functions. Furthermore, (iii) in order to verify safety and
performance requirements of a given POMDP, we formulate
a barrier certificate theorem, wherein we show that if there
exists a barrier certificate satisfying a set of inequalities along
with the belief update equation of the POMDP, the safety and
performance properties are guaranteed to hold. In both cases
(ii) and (iii), the calculations can be decomposed and solved
in parallel. (iv) We show that the conditions we formulate
can be computationally implemented as a set of sum-of-squares
programs. We illustrate the applicability of our method by
addressing two problems in active ad scheduling and machine
teaching.
I. INTRODUCTION
A formal model for planning subject to stochastic behavior
is a Markov decision process (MDP) [1], where an agent
chooses to perform an action under full knowledge of the
environment it is operating in. The outcome of an action
is a probability distribution over the system states. Many
applications, however, allow only
partial observability
of the
current system state, say via noisy sensor outputs [2], [3], [4].
Partially observable Markov decision processes (POMDPs)
extend MDPs to account for such partial information [5]. Upon
certain
observations
, the agent infers the likelihood of the
system being in a certain state, called the belief state. The
belief state together with an update function form a (typically
uncountably infinite) MDP, referred to as the
belief MDP
[6],
[7], [8]. However, solving a POMDP exactly, i.e., synthesizing
This
work
was
supported
by
AFOSR
FA9550-19-1-0005,
AFRL
FA9550-19-1-0169,
DARPA
D19AP00004,
NSF
1646522,
NWO OCENW.KLEIN.187, and NSF 1652113 grants.
M. Ahmadi is with the Center for Autonomous Systems and Technologies
at the California Institute of Technology, 1200 E. Calif. Blvd., MC 104-
44, Pasadena, CA 91125. e-mail: (
{
mrahmadi
}
@caltech.edu). B. Wu and
U. Topcu are with the Oden Institute for Computational Engineering and
Sciences, University of Texas, Austin, 201 E 24th St, Austin, TX 78712, USA.
e-mail: (
{
bwu3, utopcu
}
@utexas.edu). N. Jansen is with the Department of
Software Science (SWS) at the Radboud University, Institute for Computing
and Information Sciences, PO Box 9010, 6500 GL Nijmegen, Netherlands.
e-mail: (
{
n.jansen
}
@science.ru.nl).
a policy ensuring optimality or a certain performance, is
carried out by assessing the entire belief MDP, rendering the
problem undecidable [7], [9]. Several promising approximate
point-based methods via finite abstraction of the belief space
are proposed in the literature [10], [11], [12], [13]. At the
heart of these methods is the observation that, given an
initial belief
b
0
(the initial probability of the system being
in some state), the set of possible evolutions emanating from
b
0
under arbitrary action and observation sequences, called the
reachable belief space
is much smaller than the whole belief
space [14]. Hence, only a finite number of
samples
or
points
from the reachable belief space can be considered for decision
making.
Nonetheless, computing the reachable belief space is non-
trivial and fundamental in analyzing the convergence of ap-
proximate POMDP solutions. In fact, Pineu
et. al.
[15] showed
that point-based value iteration can approximate the solution
of the POMDP sufficiently close, whenever the total reward
function is a discounted sum of the rewards over time, the
samples are sufficiently close to each other, and if the sampling
is carried out over all of the reachable belief space. In
another vein, Lee
et. al.
[16] demonstrated that an approximate
POMDP solution can be computed in time polynomial in the
covering number (the number of balls covering a set) of the
reachable belief space. Yet, the problem lies in the fact that
the reachable belief space is not known
a priori
and therefore
it is hard to make judgements on the convergence rate of
a given approximate method. Several techniques have been
proposed for estimating the reachable belief space mainly
based on heuristics [17], [18], ”smarter” sampling methods,
such as importance sampling [19], and graph search [20],
[21]. However, these methods are difficult to adapt from one
problem setting to another and are often
ad hoc
.
Even with a reachable belief space at hand, the approx-
imate techniques do not provide a guarantee for safety or
performance. That is, it is not clear whether the probability of
satisfying a pre-specified safety/ performance requirement is
an upper-bound or a lower-bound for a given POMDP. Estab-
lishing guaranteed safety and performance is of fundamental
importance in safety-critical applications, e.g. aircraft collision
avoidance [22] and Mars rovers [23].
In this paper, we use notions from control theory to analyze
a POMDP without the need to solve it explicitly. In this
sense, our method follows the footsteps of Lyapunov’s second
method for verifying the stability of all solutions of an ODE
without the need to solve it or the application of barrier
certificate theorems for safety verification without the need
to solve an ODE for all initial conditions in a given set. Our
main contributions are fourfold:
(i) We propose a switched system representation for the
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on November 05,2020 at 22:51:55 UTC from IEEE Xplore. Restrictions apply.
0018-9286 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAC.2020.3035755, IEEE
Transactions on Automatic Control
belief evolutions of POMDPs;
(ii) We present a method based on Lyapunov-like functions
to over-approximate the reachable belief space, thus al-
lowing us to verify reachability objectives. These over-
approximations are in terms of sub-level sets of the latter
functions. We demonstrate how these calculations can be
decomposed in terms of the POMDP actions;
(iii) We propose a method based on barrier certificates to
verify safety and performance of a given POMDP. We
show how the calculations of the barrier certificates can
be decomposed in terms of actions;
(iv) We formulate a set of sum-of-squares programs for
finding the Lyapunov functions and barrier certificates to
verify safety and performance and over-approximate the
reachable belief space, respectively.
We illustrate the efficacy of our methodology with two ex-
amples from cyber-physical systems and artificial intelligence.
Preliminary results related to the formulation of the barrier
certificates for POMDPs applied to privacy verification and
robust decision making were disseminated in [24], [25].
The rest of the paper is organized as follows. In the
next section, we briefly describe the POMDP formalism. In
Section III, we present the problems studied in this paper. In
Section IV, we formulate a hybrid system representation for
belief MDP. In Section V, we show how we can use Lyapunov-
like functions to approximate the reachable belief space. In
Section VI, we propose barrier certificate theorem for safety/
performance verification of POMDPs. In Section VII, we
propose a set of sum-of-squares programs for computational
implementation of the proposed results. In Section VIII, we
elucidate the proposed methodology with examples. Finally,
in Section IX, we conclude the paper and give directions for
future research.
Notation:
R
≥
0
denotes the set
[0
,
∞
)
.
Z
denotes the set of
integers and
Z
≥
c
for
c
∈
Z
implies the set
{
c,c
+ 1
,c
+ 2
,...
}
.
R
[
x
]
accounts for the set of polynomial functions with real
coefficients in
x
∈
R
n
,
p
:
R
n
→
R
and
Σ
⊂ R
is the
subset of polynomials with a sum-of-squares decomposition;
i.e,
p
∈
Σ[
x
]
if and only if there are
p
i
∈R
[
x
]
, i
∈{
1
,...,k
}
such that
p
=
p
2
i
+
···
+
p
2
k
. For a finite set
A
,
|
A
|
denotes
the number of elements in
A
and
co
{
A
}
denotes the convex
hull of
A
. In particular, for a family of indexed functions
f
a
:
A
→
Z
,
a
∈
A
,
co
{
f
a
}
a
∈
A
denotes the convex combination of
f
a
, a
∈
A
, i.e.,
co
{
f
a
}
a
∈
A
=
∑
a
∈
A
β
a
f
a
where
0
≤
β
a
≤
1
and
∑
a
∈
A
β
a
= 1
.
II. P
ARTIALLY
O
BSERVABLE
M
ARKOV
D
ECISION
P
ROCESSES
A POMDP is a sequential decision-making modeling frame-
work, which considers not only the stochastic outcomes of
actions, but also the imperfect state observations [26].
Definition 1.
A POMDP
P
is a tuple
(
Q,p
0
,A,T,Z,O
)
,
where
•
Q
is a finite set of states with indices
{
1
,
2
,...,n
}
;
•
p
0
:
Q
→
[0
,
1]
defines the distribution of the initial
states, i.e.,
p
0
(
q
)
denotes the probability of starting at
q
∈
Q
;
•
A
is a finite set of actions;
•
T
:
Q
×
A
×
Q
→
[0
,
1]
is the transition probability,
where
T
(
q,a,q
′
) :=
P
(
q
t
=
q
′
|
q
t
−
1
=
q,a
t
−
1
=
a
)
,
∀
t
∈
Z
≥
1
,q,q
′
∈
Q,a
∈
A.
•
Z
is the set of all possible observations. Often,
z
∈
Z
is
an incomplete projection of the world state
q
, contami-
nated by noise;
•
O
:
Q
×
A
×
Z
→
[0
,
1]
is the observation probability
(sensor model), where
O
(
q,a,z
) :=
P
(
z
t
=
z
|
q
t
=
q,a
t
−
1
=
a
)
,
∀
t
∈
Z
≥
1
,q
∈
Q,a
∈
A,z
∈
Z.
Since the states are not directly accessible in a POMDP, de-
cision making requires the history of observations. Therefore,
we need to define the notion of a
belief
b
(or the posterior) as
sufficient statistics for the history [27]. The belief in
q
at time
t
,
b
t
(
q
)
, is the probability of being in state
q
at time
t
given
that action
a
t
−
1
was taken at time
t
−
1
and observation
z
t
is
received [2]. Given a POMDP, the belief at
t
= 0
is defined
as
b
0
(
q
) =
p
0
(
q
)
and
b
t
(
q
)
can be obtained by a Bayesian
filter as
b
t
(
q
′
) =
P
(
q
′
|
z
t
,a
t
−
1
,b
t
−
1
(
q
))
=
P
(
z
t
|
q
′
,a
t
−
1
,b
t
−
1
(
q
))
P
(
q
′
|
a
t
−
1
,b
t
−
1
(
q
))
P
(
z
t
|
a
t
−
1
,b
t
−
1
(
q
))
=
P
(
z
t
|
q
′
,a
t
−
1
,b
t
−
1
(
q
))
P
(
z
t
|
a
t
−
1
,b
t
−
1
(
q
))
×
∑
q
∈
Q
P
(
q
′
|
a
t
−
1
,b
t
−
1
(
q
)
,q
)
P
(
q
|
a
t
−
1
,b
t
−
1
(
q
))
=
O
(
q
′
,a
t
−
1
,z
t
)
∑
q
∈
Q
T
(
q,a
t
−
1
,q
′
)
b
t
−
1
(
q
)
∑
q
′
∈
Q
O
(
q
′
,a
t
−
1
,z
t
)
∑
q
∈
Q
T
(
q,a
t
−
1
,q
′
)
b
t
−
1
(
q
)
,
(1)
where the beliefs belong to the belief unit simplex
B
=
{
b
∈
[0
,
1]
|
Q
|
|
∑
q
b
t
(
q
) = 1
,
∀
t
}
.
We illustrate the notion of belief with the following exam-
ple, which is more familiar to the controls community.
Example 1.
Consider the following linear continuous system
{
x
t
=
A
t
x
t
−
1
+
B
t
u
t
−
1
+
w
t
,
z
t
=
C
t
x
t
+
v
t
,
(2)
where
x
∈
R
n
are the system states,
u
∈
R
m
are the controls
(actions),
z
∈
R
l
are the observations,
A
t
∈
R
n
×
n
is the
state-transition model at time
t
,
B
t
∈
R
n
×
m
is the control-
transition model at time
t
,
C
t
∈
R
l
×
n
is the observation model
at time
t
, and
w,v
are the mutually independent noise drawn
from a zero mean multivariate normal distribution,
N
, with
covariances
Q
k
and
R
k
, respectively.
The Kalman filter [28] is a Bayesian estimator with states
μ
t
|
t
, the a posteriori state estimate at time
t
given observations
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on November 05,2020 at 22:51:55 UTC from IEEE Xplore. Restrictions apply.
0018-9286 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAC.2020.3035755, IEEE
Transactions on Automatic Control
up to and including at time
t
, and
Σ
t
|
t
, the a posteriori esti-
mate covariance matrix. The Kalman filter is conceptualized
as two distinct phases: “Predict” and “Update”. The Kalman
filter can be compactly described as
(
μ
t
|
t
,
Σ
t
|
t
) =
f
(
(
μ
t
−
1
|
t
−
1
,
Σ
t
−
1
|
t
−
1
)
,u
t
−
1
,z
t
)
,
where
f
is a nonlinear function. Indeed, for system
(2)
, the
belief is simply the states of the Kalman filter, i.e.,
b
t
=
(
μ
t
|
t
,
Σ
t
|
t
)
, and the function
f
is the belief update equation.
A policy in a POMDP setting is then a mapping
π
:
B →
A
, i.e., a mapping from the continuous belief space into the
discrete and finite action space. Essentially, a policy defined
over the belief space has infinite memory when mapped to the
potentially infinite executions of the POMDP.
III. P
ROBLEM
F
ORMULATION
We consider reachability, safety, and performance proper-
ties of finite POMDPs. In the following, we formally describe
the latter problems.
Given an initial belief
b
0
and a POMDP
P
, we are
concerned with estimating the set of all belief evolutions
emanating from
b
0
, which is known as the
reachable belief
space
.
Problem 1
(Reachable Belief Space)
.
Given a POMDP
P
=
(
Q,b
0
,A,T,Z,O
)
as in Definition 1 with belief update
(1)
,
compute
ˆ
R⊂B
such that
R
(
b
0
)
⊆
ˆ
R
, where
R
(
b
0
) =
{
b
∈B |
b
t
,
for all
t >
0
,
a
∈
A
, and
z
∈
Z
}
,
(3)
is the reachable belief space.
We are also interested in studying the reachable belief space
induced by a policy
π
.
Problem 2
(Reachable Belief Space Induced by Policy
π
)
.
Given a POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as in Definition 1
with belief update
(1)
and a policy
π
:
B →
A
, compute
ˆ
R⊂B
such that
R
π
(
b
0
)
⊆
ˆ
R
, where
R
π
(
b
0
) =
{
b
∈B |
b
t
,
for all
t >
0
and
z
∈
Z
}
.
(4)
We define
safety
as the probability of reaching a set of
unsafe states
Q
u
⊂
Q
being less than a given constant. To
this end, we use the belief states. We are interested in solving
the following problem.
Problem 3
(Safety at
τ
)
.
Given a POMDP as in Definition 1,
a point in time
τ
∈
Z
≥
0
, a set of unsafe states
Q
u
⊆
Q
, and
a safety requirement constant
λ
, check whether
∑
q
∈
Q
u
b
τ
(
q
)
≤
λ,
(5)
We also consider safety for all time.
Problem 4
(Safety)
.
Given a POMDP as in Definition 1, a set
of unsafe states
Q
u
⊆
Q
, and a safety requirement constant
λ
, check whether
∑
q
∈
Q
u
b
t
(
q
)
≤
λ,
∀
t
∈
Z
≥
0
.
(6)
In addition to safety, we are interested in checking whether
a
performance
criterion is satisfied.
Problem 5
( Performance)
.
Given a POMDP as in Defini-
tion 1, the reward function
R
:
Q
×
A
→
R
, in which
R
(
q,a
)
denotes the reward of taking action
a
while being at state
q
,
a point in time
τ
∈
Z
≥
0
, and a performance requirement
γ
,
check whether
τ
∑
s
=0
r
(
b
s
,a
s
)
≤
γ,
(7)
where
r
(
b
s
,a
s
) =
∑
q
∈
Q
b
t
R
(
q,a
t
)
.
Note that, if instead of having a given
γ
in Problem 5, we
minimize
γ
subject to a given POMDP with a reward function
R
, we can also study
optimality
.
IV. T
REATING
POMDP
S AS
H
YBRID
S
YSTEMS
Given a POMDP
P
as in Definition 1, checking whether (5)
and (7) hold by solving the POMDP directly is undecidable
problem [9]. In this section, we demonstrate that POMDPs can
be represented as a special hybrid system [29], i.e., a discrete-
time switched system [30], [31], [32]. We take advantage of
this representation in the sequel to formulate indirect methods
for analyzing a POMDP.
According to the belief update equation (1), the current
beliefs evolve according to the belief values at the previous
step, actions at the previous step, and the current observations.
Furthermore, the belief evolution
jumps
into different modes
based on the actions taken. Therefore, we can represent the
belief evolutions as a discrete-time switched system, where
the actions
a
∈
A
define the switching modes. Formally, the
belief
dynamics
(1) can be described as
b
t
=
f
a
(
b
t
−
1
,z
t
)
,
(8)
where
b
t
denote the belief vector at time
t
belonging to the
belief unit simplex
B
and
b
0
∈B
0
⊂B
(in particular,
B
0
can
be a singleton) representing the set of initial beliefs (prior).
In (8),
a
∈
A
denote the actions that can be interpreted as
the switching modes,
z
∈
Z
are the observations representing
inputs, and
t
∈
Z
≥
1
denote the discrete time instances. The
(rational) vector fields
{
f
a
}
a
∈
A
with
f
a
: [0
,
1]
|
Q
|
→
[0
,
1]
|
Q
|
are described as the vectors with rows
f
q
′
a
(
b,
·
,z
) =
O
(
q
′
,a,z
)
∑
q
∈
Q
T
(
q,a,q
′
)
b
t
−
1
(
q
)
∑
q
′
∈
Q
O
(
q
′
,a,z
)
∑
q
∈
Q
,T
(
q,a,q
′
)
b
t
−
1
(
q
)
,
where
f
q
′
a
denotes the
q
′
th row of
f
a
.
For infinite-horizon discounted cost POMDPs, a policy then
induces regions (Markov partitions) in the belief space, where
an action is applied [26]. We denote the index of these
mutually exclusive, connected partitions of the belief space as
α
∈
Γ
, where
Γ
is a finite set. Then, a policy for a POMDP
can be characterized as
π
(
b
) =
a
1
, b
∈B
α
1
.
.
.
.
.
.
a
n
, b
∈B
α
m
,
(9)
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on November 05,2020 at 22:51:55 UTC from IEEE Xplore. Restrictions apply.
0018-9286 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAC.2020.3035755, IEEE
Transactions on Automatic Control
inducing the dynamics
b
t
=
f
a
1
(
b,z
)
, b
∈B
α
1
.
.
.
.
.
.
f
a
n
(
b,z
)
, b
∈B
α
m
,
(10)
where
n
and
m
are integers. Note that the number of partitions
|
Γ
|
=
m
is not necessarily less than or greater than
|
A
|
. Indeed,
the number of
|
Γ
|
=
m
depends on the method used for
synthesizing the policy. For example, in the value iteration
method for POMDPs,
|
Γ
|
=
m
is a function of the so called
α
-vectors [33].
Given the above switched system representation, we can
consider two classes of problems in POMDPs:
1.
Arbitrary-Policy Verification
: This case corresponds to
analyzing (8) under
arbitrary switching
with switching
modes given by
a
∈
A
.
2.
Fixed-Policy Verification
: This corresponds to analyz-
ing (8) under
state-dependent switching
. Indeed, the pol-
icy
π
:
B →
A
determines regions in the belief space
where each mode (action) is active.
Both cases of switched systems with
arbitrary switching
and
state-dependent switching
are well-known in the sys-
tems and controls literature (see [34], [35] and references
therein).
Note that the second problem is well-known in
the probabilistic planning [36], [37] and model checking [38]
communities. For a fixed policy and a finite state space,
the problem reduces to solving a linear equation system to
compute the set of reachable states in the underlying Markov
chain. The next example illustrates the proposed switched
system representation for POMDPs with a fixed policy.
Example 2.
Consider a POMDP with two states
{
q
1
,q
2
}
, two
actions
{
a
1
,a
2
}
, and
z
=
{
z
1
,z
2
}∈
Z
. Action
a
1
correspond
to remaining in the same state, while action
a
2
to making a
transition to the other state. Observation
z
i
correspond to the
current state being
q
i
. The policy
π
=
{
a
1
, b
∈B
1
,
a
2
, b
∈B
2
(11)
leads to different switching modes based on whether the states
belong to the regions
B
1
=
{
b
|
0
≤
b
(
q
1
)
≤
ε
}
or
B
2
=
{
b
|
ε < b
(
q
1
)
≤
1
}
(see Figure 1). That is, the belief update
equation
(8)
is given by
b
t
=
{
f
a
1
(
b
t
−
1
,z
t
)
, b
∈B
1
,
f
a
2
(
b
t
−
1
,z
t
)
, b
∈B
2
.
(12)
Note that the belief space is given by
B
=
B
1
∪B
2
=
{
b
|
b
(
q
1
) +
b
(
q
2
) = 1
}
.
V. POMDP R
EACHABILITY
A
NALYSIS
U
SING
L
YAPUNOV
-L
IKE
F
UNCTIONS
Based on the switched system representation discussed in
the previous section, we can use tools from control theory
to analyze the evolution of the solutions of the belief update
equation including the reachable belief space without the need
to solve the POMDP explicitly.
q
2
q
1
0
1
ε
b
t
(
q
1
)
+
b
t
(
q
2
) = 1
B
1
B
2
Fig. 1: An example of a POMDP with two states and the
state-dependent switching modes induced by the policy (11).
Fig. 2: Over-approximation of the reachable belief space using
sub-level sets of a Lyapunov-like function.
From a control theory standpoint, estimating reachable
sets, invariant sets, or regions-of-attractions (ROAs) are well-
studied problems in the literature [39]. Recent results in
this area include learning ROAs using Gaussian processes
and active learning [40], computation of controlled invariant
sets for switching affine systems [41], ROA estimation using
formal language [42], and finding invariant sets based on
barrier functions [43]. In the following, we propose a basic
technique for over-approximating the reachable belief spaces
of POMDPs using Lyapunov-like functions.
The next theorem uses sub-level sets of
Lyapunov-like
functions to over-approximate the reachable belief space as
illustrated in Figure 2.
Theorem 1.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as in
Definition 1, consider the belief update dynamics
(8)
. If there
exists a function
V
:
B →
R
such that
V
(
f
a
(
b
t
−
1
,z
))
−
V
(
b
t
−
1
)
<
0
,
for all
a
∈
A, z
∈
Z, b
t
−
1
∈V
,
(13)
and
b
0
∈V
,
(14)
where
V
=
{
b
∈B |
V
(
b
)
≤
1
}
, then
b
t
∈V
for all
t
∈
Z
≥
0
,
and, in particular,
R
(
b
0
)
⊆V
.
Proof.
The proof relies on the fact that the belief trajectories
point inward at the boundary of
V
; hence, they remain inside
V
. We prove by contradiction. Assume, at time
T
6
= 0
, we
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on November 05,2020 at 22:51:55 UTC from IEEE Xplore. Restrictions apply.
0018-9286 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAC.2020.3035755, IEEE
Transactions on Automatic Control
Fig. 3: Over-approximation of the reachable belief space using
the sub-level sets of a set of local Lyapunov functions.
have
b
T
/
∈ V
for some
a
∈
A
or
z
∈
Z
. Thus,
V
(
b
T
)
>
1
.
Inequality (13) implies
V
(
b
t
)
−
V
(
b
t
−
1
)
<
0
,
for all
a
∈
A, z
∈
Z, b
∈V
.
Summing up the above expression from time
1
to
T
yields
T
∑
t
=1
(
V
(
b
t
)
−
V
(
b
t
−
1
))
<
0
,
for all
a
∈
A, z
∈
Z,
which simplifies to
V
(
b
T
)
−
V
(
b
0
)
<
0
,
for all
a
∈
A, z
∈
Z, b
∈V
.
That is,
V
(
b
T
)
< V
(
b
0
)
. In addition, from (14), we have
V
(
b
0
)
≤
1
. Then, it follows that
V
(
b
T
)
≤
1
, which contradict
the hypothesis
b
T
/
∈V
.
At this point, we prove by induction that
R
(
b
0
)
⊆ V
. Let
R
t
(
b
0
)
denote the reachable belief space at time
t
. At
t
=
0
, from (13), we have
b
0
∈ V
. Since
R
0
(
b
0
) =
{
b
0
}
, then
R
0
(
b
0
)
⊆V
. At time
t
= 1
, from (12), we have
V
(
b
1
)
−
V
(
b
0
)
<
0
,
∀
b
0
∈V
,
which implies that
V
(
b
1
)
< V
(
b
0
)
for all
b
0
∈ V
=
{
b
∈
B |
V
(
b
0
)
≤
1
}
. Hence,
V
(
b
1
)
< V
(
b
0
)
≤
1
and
b
1
∈ V
.
In addition,
R
1
(
b
0
) =
{
b
0
,b
1
}
and since
b
0
,b
1
∈ V
, then
R
1
(
b
0
)
⊆ V
. Then, by induction we have for any
t
=
τ
,
τ >
0
,
b
τ
∈V
and
R
τ
(
b
0
)
⊆V
.
We remark that the function
V
is not a Lyapunov function in
the conventional sense, since we do not impose that
V
(0) = 0
and
V
(
b
)
>
0
for all
b
∈ B\{
0
}
. The reason for calling
V
a Lyapunov-like function is to make an analogy to Lyapunov
based techniques for approximating reachable sets and regions
of attraction, which is familiar to the controls audience [44],
[45]. Henceforth, we abuse the notation and we refer to
V
’s
as Lyapunov functions.
In practice, we may have a large number of actions for a
POMDP. Therefore, it may be difficult to find a single Lya-
punov function
V
that satisfies the conditions of Theorem 1.
Next, we show that the computation of the Lyapunov function
can be decomposed into finding a set of local Lyapunov func-
tions for each action. This method is illustrated in Figure 3.
Theorem 2.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as in
Definition 1, consider the belief update dynamics
(8)
. If there
exist a set of functions
V
a
:
B →
R
,
a
∈
A
such that
V
a
(
f
a
′
(
b
t
−
1
,z
))
−
V
a
(
b
t
−
1
)
<
0
,
∀
z
∈
Z, b
t
−
1
∈V
a
,
(15)
for all
(
a,a
′
)
∈
A
2
and
b
0
∈V
a
,
(16)
for all
a
∈
A
, where
V
a
=
{
b
∈ B |
V
a
(
b
)
≤
1
}
, then
b
t
∈ V
max
=
{
b
∈ B |
max
a
∈
A
V
a
(
b
)
≤
1
}
for all
t
∈
Z
≥
0
,
and, in particular,
R
(
b
0
)
⊆V
max
.
Proof.
We show that if (15)-(16) are satisfied then the Lya-
punov function
V
= max
a
∈
A
V
a
satisfies the conditions
of Theorem 1. If (16) holds for all
a
∈
A
, we have
V
a
(
b
0
)
≤
1
, which in turn implies that
max
a
∈
A
V
a
(
b
0
)
≤
1
,
as well. Thus,
V
= max
a
∈
A
V
a
satisfies (14). Further-
more, if (15) is satisfied for all
(
a,a
′
)
∈
A
2
, we have
V
a
(
f
a
′
(
b
t
−
1
,z
))
< V
a
(
b
t
−
1
)
for
b
t
−
1
∈ V
a
, which in turn
implies that
max
a
V
a
(
f
a
′
(
b
t
−
1
,z
))
<
max
a
V
a
(
b
t
−
1
)
for
b
t
−
1
belonging to
max
a
V
a
(
b
t
−
1
)
≤
1
and all
a
′
∈
A
. That is, (13)
is satisfied with
V
= max
a
∈
A
V
a
. Then, from Theorem 1, we
have
b
t
∈V
max
for all
t
∈
Z
≥
0
and
R
(
b
0
)
⊆V
max
.
If a policy
π
is given, then we can compute a family of
local Lyapunov functions. The union of the 1-sublevel sets of
these functions over-approximates the reachable belief space.
Theorem 3.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
and
the policy
π
:
B →
A
and an induced partitioning
{B
α
}
such
that
B
=
∪
α
B
α
as given in
(9)
, consider the belief update
dynamics
(8)
. If there exists a family of functions
V
α
:
B
α
→
R
,
α
∈
Γ
, such that
V
α
(
f
a
(
b
t
−
1
,z
))
−
V
α
(
b
t
−
1
)
<
0
,
for all
z
∈
Z, b
t
−
1
∈V
α
,
(17)
and
b
0
∈V
α
,
(18)
where
V
α
=
{
b
∈ B
α
|
V
α
(
b
)
≤
1
}
, then
b
t
∈ ∪
α
V
α
for all
t
∈
Z
≥
0
, and, in particular,
R
(
b
0
)
⊆∪
α
∈
Γ
V
α
.
Proof.
From Theorem 1, we can infer that, for each fixed
a
∈
A
, we have
R
a
(
b
0
)
⊆ V
α
, where
R
a
(
b
0
)
is the
reachable belief space under action
a
starting at
b
0
. Since the
support of each
V
α
i
is
B
α
i
,
V
α
i
⊆ B
α
i
,
∪
α
∈
Σ
B
α
=
B
, then
∪
α
∈
Σ
V
α
=
V ⊆ B
. Furthermore, since
R
a
(
b
0
)
⊆ V
α
, then
∪
a
∈
A
R
a
(
b
0
) =
R
(
b
0
)
⊆∪
α
∈
Γ
V
α
.
We remark at the end of this section that the existence of
a policy
π
for a POMDP removes all non-determinism and
partial observability, resulting in an induced Markov chain.
In this case, in [46], a method based on Foster-Lyapunov
functions was used for analyzing Markov chains, which may
be also adapted to verify POMDPs with a fixed policy.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on November 05,2020 at 22:51:55 UTC from IEEE Xplore. Restrictions apply.
0018-9286 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAC.2020.3035755, IEEE
Transactions on Automatic Control
VI. POMDP S
AFETY
V
ERIFICATION
U
SING
B
ARRIER
C
ERTIFICATES
In the following, we show how we can use barrier certifi-
cates to verify properties of the switched systems induced by
a POMDP. We begin by verifying safety for a given POMDP.
Let us define the following unsafe set
B
s
u
=
b
∈B |
∑
q
∈
Q
u
b
τ
(
q
)
> λ
,
(19)
which is the complement of (5).
Theorem 4.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as in
Definition 1, consider the belief update dynamics
(8)
. Given a
set of initial beliefs
B
0
⊂
[0
,
1]
|
Q
|
, an unsafe set
B
s
u
as given
in
(19)
(
B
0
∩B
s
u
=
∅
), and a constant
τ
∈
Z
≥
0
, if there exists
a function
B
:
Z
×B →
R
called the barrier certificate such
that
B
(
τ,b
τ
)
>
0
,
∀
b
τ
∈B
s
u
,
(20)
B
(0
,b
0
)
<
0
,
∀
b
0
∈B
0
,
(21)
and
B
(
t,f
a
(
b
t
−
1
,z
))
−
B
(
t
−
1
,b
t
−
1
)
≤
0
,
∀
t
∈{
1
,
2
,...,τ
}
,
∀
a
∈
A,
∀
z
∈
Z,
∀
b
∈B
,
(22)
then there exist no solution of the belief update equation
(8)
such that
b
0
∈B
0
, and
b
τ
∈B
s
u
for all
a
∈
A
.
Proof.
Please refer to the proof of Theorem 1 in [47] noting
that
Θ =
∅
.
The above theorem provides conditions under which the
POMDP is
guaranteed
to be safe. The next result brings
forward a set of conditions, which verifies whether the per-
formance criterion (7) is satisfied. This is simply carried out
by making the unsafe set time-dependent (a tube).
Corollary 1.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as
in Definition 1, consider the belief update dynamics
(8)
and
the performance criterion
γ
as given by
(7)
. Let
̃
γ
:
Z
≥
0
→
R
satisfying
τ
∑
s
=0
̃
γ
(
s
)
≤
γ.
(23)
Given a set of initial beliefs
B
0
⊂B
, an unsafe set
B
o
u
=
{
(
t,b
)
|
r
(
b
t
,a
t
)
>
̃
γ
(
t
)
}
,
(24)
and a constant
τ
∈
Z
≥
0
, if there exists a function
B
:
Z
×B →
R
such that
(20)
-
(22)
are satisfied with
B
o
u
instead of
B
s
u
, then
for all
b
0
∈B
0
the performance criterion
(7)
holds.
Proof.
Please refer to the proof of Corollary 1 in [47] noting
that
Θ =
∅
.
The technique used in Corollary 1 is analogous to the one
used in [48], [49] for bounding (time-averaged) functional
outputs of systems described by partial differential equations.
The method proposed here, however, can be used for a large
class of discrete time systems and the belief update equation
is a special case that is of our interest.
In practice, we may have a large number of actions. Then,
finding a barrier certificate that satisfies the conditions of
Theorem 4 becomes computationally prohibitive. In the next
result, we show how the calculation of the barrier certificate
can be decomposed into finding a set of barrier certificates for
each action and then taking the convex hull of them.
Theorem 5.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as in
Definition 1, consider the belief update dynamics
(8)
. Given a
safety constraint as described in
(5)
with a safety requirement
λ
, and a point in time
τ
∈
Z
≥
0
, if there exists a set of functions
B
a
:
Z
×B →
R
,
a
∈
A
, such that
B
a
(
τ,b
τ
)
>
0
,
∀
b
τ
∈B
s
u
,
∀
a
∈
A,
(25)
with
B
s
u
as described in
(19)
,
B
a
(0
,b
0
)
<
0
,
for
b
0
=
p
0
,
∀
a
∈
A,
(26)
and
B
a
(
t,f
a
′
(
b
t
−
1
,z
))
−
B
a
(
t
−
1
,b
t
−
1
)
≤
0
,
∀
t
∈{
1
,
2
,...,τ
}
,
∀
(
a,a
′
)
∈
A
2
,
∀
z
∈
Z,
∀
b
∈B
,
(27)
then the safety requirement
λ
is satisfied, i.e., inequality
(5)
holds. Furthermore, the overall barrier certificate is given by
B
=
co
{
B
a
}
a
∈
A
.
Proof.
We show that if (25)-(27) are satisfied then the barrier
certificate
B
=
co
{
B
a
}
a
∈
A
satisfies the conditions of Theo-
rem 4. For each
a
∈
A
, we multiply both sides of (25) with a
constant
α
a
such that
∑
a
∈
A
α
a
= 1
. We obtain
∑
a
∈
A
α
a
B
a
(
τ,b
τ
)
>
0
,
∀
b
τ
∈B
s
u
,
which implies that
B
(
τ,b
τ
) =
co
{
B
a
(
τ,b
τ
)
}
a
∈
A
>
0
,
∀
b
τ
∈B
s
u
.
Therefore, (25) is satisfied with
B
=
co
{
B
a
}
a
∈
A
. Similarly,
we can show that if (26) is satisfied,
B
=
co
{
B
a
}
a
∈
A
satisfies (21). Multiplying both sides of (27) with a constant
α
a
such that
∑
a
∈
A
α
a
= 1
and summing over them yields
∑
a
∈
A
α
a
(
B
a
(
t,f
a
′
(
b
t
−
1
,z
))
−
B
a
(
t
−
1
,b
t
−
1
))
=
∑
a
∈
A
α
a
B
a
(
t,f
a
′
(
b
t
−
1
,z
))
−
∑
a
∈
A
α
a
B
a
(
t
−
1
,b
t
−
1
)
≤
0
,
∀
t
∈{
1
,
2
,...,τ
}
,
∀
z
∈
Z,
∀
a
′
∈
A,
∀
b
∈B
.
which implies that (22) is satisfied for
B
=
co
{
B
a
}
a
∈
A
.
Therefore, from Theorem 4, we conclude that the safety
requirement
λ
is satisfied.
The efficacy of the above result is that we can search for
each action-based barrier certificate
B
a
,
a
∈
A
, independently
or in parallel and then verify whether the overall POMDP
satisfies a pre-specified safety requirement (see Fig. 4 for an
illustration).
Next, we demonstrate that, if a policy is given, the search
for the barrier certificate can be decomposed into the search
for a set of local barrier certificates. We denote by
a
i
the action
active in the partition
B
i
.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on November 05,2020 at 22:51:55 UTC from IEEE Xplore. Restrictions apply.
0018-9286 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAC.2020.3035755, IEEE
Transactions on Automatic Control
Theorem 6.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as in
Definition 1, consider the belief update dynamics
(8)
. Given a
safety constraint as described in
(5)
with a safety requirement
λ
, a point in time
τ
∈
Z
≥
0
, and a policy
π
:
B →
A
as
described in
(9)
, if there exists a set of functions
B
i
:
Z
×B
i
→
R
,
i
∈{
1
,
2
,...,N
}
, such that
B
i
(
τ,b
τ
)
>
0
,
∀
b
τ
∈B
o
u
∩B
i
, i
∈{
1
,
2
,...,m
}
,
(28)
with
B
o
u
as described in
(19)
,
B
i
(0
,b
0
)
<
0
,
for
b
0
=
p
0
, i
∈{
1
,
2
,...,m
}
,
(29)
and
B
i
(
t,f
a
i
(
b
t
−
1
,z
))
−
B
i
(
t
−
1
,b
t
−
1
)
≤
0
,
∀
t
∈{
1
,
2
,...,τ
}
,
∀
z
∈
Z,
∀
b
∈B
i
,
i
∈{
1
,
2
,...,m
}
,
(30)
then the safety requirement
λ
is satisfied, i.e., inequality
(5)
holds. Furthermore, the overall barrier certificate is given by
B
=
co
{
B
i
}
m
i
=1
.
Proof.
We demonstrate that if (28)-(30) are satisfied then the
barrier certificate
B
=
co
{
B
i
}
m
i
=1
satisfies the conditions of
Theorem 4. For each
i
∈{
1
,
2
,...,m
}
, we multiply both sides
of (28) with a constant
α
i
such that
∑
m
i
=1
α
i
= 1
. We obtain
m
∑
i
=1
α
i
B
i
(
τ,b
τ
)
>
0
,
∀
b
τ
∈∪
m
i
=1
(
B
s
u
∩B
i
)
.
Since the support of each
B
i
is
B
i
,
∪
m
i
=1
B
i
=
B
, and
B
s
u
⊂B
,
we have
∪
m
i
=1
(
B
s
u
∩B
i
) =
B
s
u
∩B
=
B
s
u
. Hence,
B
(
τ,b
τ
) =
co
{
B
i
(
τ,b
τ
)
}
N
i
=1
>
0
,
∀
b
τ
∈B
s
u
.
Therefore, (20) is satisfied with
B
=
co
{
B
i
}
m
i
=1
. Similarly, we
can show that if (29) is satisfied,
B
=
co
{
B
i
}
m
i
=1
satisfies (21).
Multiplying both sides of (30) with constants
α
i
such that
∑
m
i
=1
α
i
= 1
and summing over them gives
m
∑
i
=1
α
i
(
B
i
(
t,f
a
i
(
b
t
−
1
,z
))
−
B
i
(
t
−
1
,b
t
−
1
)
)
=
m
∑
i
=1
α
i
B
i
(
t,f
a
i
(
b
t
−
1
,z
))
−
m
∑
i
=1
α
i
B
i
(
t
−
1
,b
t
−
1
)
≤
0
,
∀
t
∈{
1
,
2
,...,τ
}
,
∀
z
∈
Z,
∀
b
∈B
.
which implies that (22) is satisfied for
B
=
co
{
B
i
}
m
i
=1
.
Therefore, from Theorem 4, we conclude that the safety
requirement
λ
is satisfied.
We proposed two techniques for decomposing the con-
struction of the barrier certificates and checking given safety
requirements. Our method relied on barrier certificates that
take the form of the convex hull of a set of local barrier cer-
tificates (see similar results in [50], [51]). Though the convex
hull barrier certificate may introduce a level of conservatism,
it is computationally easier to find (as will be discussed in
more detail in the next section).
Fig. 4: Decomposing the barrier certificate computation for a
POMDP with two actions
a
1
and
a
2
: the zero-level sets of
B
a
1
and
B
a
2
at trial
τ
separate the evolutions of the hypothesis
beliefs starting at
b
0
from
B
o
u
. The green line illustrate the
zero-level set of the barrier certificate formed by taking the
convex hull of
B
a
1
and
B
a
2
.
VII. C
OMPUTATIONAL
M
ETHOD BASED ON
S
UM
-
OF
-S
QUARES
P
ROGRAMMING
In this section, we present a set of SOS programs that
can be used to find the Lyapunov functions used for reach-
ability analysis and the barrier certificates used for safety
and performance verification. For a brief introduction to SOS
programming and some related results, see Appendix A.
In order to formulate SOS programs, we need the problem
data to be polynomial or rational. Fortunately, the belief update
equation (1) is a rational function in the belief states
b
t
(
q
)
,
q
∈
Q
b
t
(
q
′
) =
M
a
(
b
t
−
1
(
q
′
)
,z
)
N
a
(
b
t
−
1
(
q
′
)
,z
)
=
O
(
q
′
,a
t
−
1
,z
t
)
∑
q
∈
Q
T
(
q,a
t
−
1
,q
′
)
b
t
−
1
(
q
)
∑
q
′
∈
Q
O
(
q
′
,a
t
−
1
,z
t
)
∑
q
∈
Q
T
(
q,a
t
−
1
,q
′
)
b
t
−
1
(
q
)
,
(31)
where
M
a
and
N
a
are linear and positive functions of
b
.
Moreover, the safe, and the performance objectives de-
scribed by (5) and (7), respectively, and the belief simplex
are all semi-algebraic sets, since they can be described by
polynomial inequalities/equalities.
Throughout this section, we also assume that the Lyapunov
functions and the barrier certificates are parametrized as poly-
nomial functions of
b
. Hence, the level sets of those functions
are semi-algebraic sets, as well.
At this stage, we are ready to present conditions based on
SOS programs to find a Lyapunov function for POMDPs and
thus over-approximate the reachable belief space.
Corollary 2.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as in
Definition 1, consider the belief update dynamics
(31)
. If there
exist a set of functions
V
∈
P
[
b
]
of degree
d
and
p
∈
Σ[
b
]
,
and a constant
c >
0
such that
−
N
d
a
(
b
t
−
1
,z
)
(
V
(
M
a
(
b
t
−
1
,z
)
N
a
(
b
t
−
1
,z
)
)
−
V
(
b
t
−
1
)
)
−
p
(
b
t
−
1
) (1
−
V
(
b
t
−
1
))
−
c
∈
Σ[
b
t
−
1
]
,
(32)
for all
a
∈
A
and
z
∈
Z
, and
1
−
V
(
b
0
)
≥
0
,
(33)
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on November 05,2020 at 22:51:55 UTC from IEEE Xplore. Restrictions apply.