of 14
Control Theory Meets POMDPs:
A Hybrid Systems Approach
Mohamadreza Ahmadi, Nils Jansen, Bo Wu, and Ufuk Topcu
Abstract
—Partially observable Markov decision processes
(POMDPs) provide a modeling framework for a variety of se-
quential decision making under uncertainty scenarios in artificial
intelligence (AI). Since the states are not directly observable
in a POMDP, decision making has to be performed based
on the output of a Bayesian filter (continuous beliefs). Hence,
POMDPs are often computationally intractable to solve exactly
and researchers resort to approximate methods often using
discretizations of the continuous belief space. These approximate
solutions are, however, prone to discretization errors, which has
made POMDPs ineffective in applications, wherein guarantees
for safety, optimality, or performance are required. To overcome
the complexity challenge of POMDPs, we apply notions from
control theory. The goal is to determine the reachable belief space
of a POMDP, that is, the set of all possible evolutions given an
initial belief distribution over the states and a set of actions and
observations. We begin by casting the problem of analyzing a
POMDP into analyzing the behavior of a discrete-time switched
system. For estimating the reachable belief space, we find over-
approximations in terms of sub-level sets of Lyapunov functions.
Furthermore, in order to verify safety and optimality require-
ments of a given POMDP, we formulate a barrier certificate
theorem, wherein we show that if there exists a barrier certificate
satisfying a set of inequalities along with the belief update
equation of the POMDP, the safety and optimality properties are
guaranteed to hold. In both cases, we show how the calculations
can be decomposed into smaller problems that can be solved
in parallel. The conditions we formulate can be computationally
implemented as a set of sum-of-squares programs. We illustrate
the applicability of our method by addressing two problems in
active ad scheduling and machine teaching.
I. INTRODUCTION
A formal model for planning subject to stochastic behavior
is a Markov decision process (MDP) [1], where an agent
chooses to perform an action under full knowledge of the
environment it is operating in. The outcome of an action
is a probability distribution over the system states. Many
applications, however, allow only
partial observability
of the
current system state, say via noisy sensor outputs [2], [3], [4].
Partially observable Markov decision processes (POMDPs)
extend MDPs to account for such partial information [5]. Upon
certain
observations
, the agent infers the likelihood of the
system being in a certain state, called the belief state. The
belief state together with an update function form a (typically
This work was supported by AFOSR FA9550-19-1-0005, DARPA
D19AP00004, NSF 1646522 and NSF 1652113.
M. Ahmadi is with the Center for Autonomous Systems and Technologies
at the California Institute of Technology, 1200 E. Calif. Blvd., MC 104-
44, Pasadena, CA 91125. e-mail: (
{
mrahmadi
}
@caltech.edu). B. Wu and
U. Topcu are with the Oden Institute for Computational Engineering and
Sciences, University of Texas, Austin, 201 E 24th St, Austin, TX 78712, USA.
e-mail: (
{
bwu3, utopcu
}
@utexas.edu). N. Jansen is with the Department of
Software Science (SWS) at the Radboud University, Institute for Computing
and Information Sciences, PO Box 9010, 6500 GL Nijmegen, Netherlands.
e-mail: (
{
n.jansen
}
@science.ru.nl).
uncountably infinite) MDP, referred to as the
belief MDP
[6],
[7], [8]. However, solving a POMDP exactly, i.e., synthesizing
a policy ensuring optimality or a certain performance, is
carried out by assessing the entire belief MDP, rendering the
problem undecidable [7], [9]. Several promising approximate
point-based methods via finite abstraction of the belief space
are proposed in the literature [10], [11], [12], [13]. At the
heart of these methods is the observation that, given an
initial belief
b
0
(the initial probability of the system being
in some state), the set of possible evolutions emanating from
b
0
under arbitrary action and observation sequences, called the
reachable belief space
is much smaller than the whole belief
space [14]. Hence, only a finite number of
samples
or
points
from the reachable belief space can be considered for decision
making.
Nonetheless, computing the reachable belief space is non-
trivial and fundamental in analyzing the convergence of ap-
proximate POMDP solutions. In fact, Pineu
et. al.
[15] showed
that point-based value iteration can approximate the solution
of the POMDP sufficiently close, whenever the total reward
function is a discounted sum of the rewards over time, the
samples are sufficiently close to each other, and if the sampling
is carried out over all of the reachable belief space. In
another vein, Lee
et. al.
[16] demonstrated that an approximate
POMDP solution can be computed in time polynomial in the
covering number (the number of balls covering a set) of the
reachable belief space. Yet, the problem lies in the fact that
the reachable belief space is not known
a priori
and therefore
it is hard to make judgements on the convergence rate of
a given approximate method. Several techniques have been
proposed for estimating the reachable belief space mainly
based on heuristics [17], [18], ”smarter” sampling methods,
such as importance sampling [19], and graph search [20],
[21]. However, these methods are difficult to adapt from one
problem setting to another and are often
ad hoc
.
Even with a reachable belief space at hand, the approximate
techniques do not provide a guarantee for safety or optimality.
That is, it is not clear whether the probability of satisfying a
pre-specified safety/optimality requirement is an upper-bound
or a lower-bound for a given POMDP. Establishing guaranteed
safety and optimality is of fundamental importance in safety-
critical applications, e.g. aircraft collision avoidance [22] and
Mars rovers [23].
In this paper, we use notions from control theory to analyze
a POMDP without the need to solve it explicitly. In this
sense, our method follows the footsteps of Lyapunov’s second
method for verifying the stability of all solutions of an ODE
without the need to solve it or the application of barrier
certificate theorems for safety verification without the need
to solve an ODE for all initial conditions in a given set. Our
arXiv:1905.08095v1 [cs.SY] 17 May 2019
main contributions are fourfold:
We propose a switched system representation for the
belief evolutions of POMDPs;
We present a method based on Lyapunov functions to
over-approximate the reachable belief space, thus al-
lowing us to verify reachability objectives. These over-
approximations are in terms of sub-level sets of Lyapunov
functions. We demonstrate how these calculations can be
decomposed in terms of the POMDP actions;
We propose a method based on barrier certificates to
verify optimality and safety of a given POMDP. We show
how the calculations of the barrier certificates can be
decomposed in terms of actions;
We formulate a set of sum-of-squares programs for
finding the Lyapunov functions and barrier certificates
to verify safety and optimality and over-approximate the
reachable belief space, respectively.
We illustrate the efficacy of our methodology with two exam-
ples from cyber-physical systems and artificial intelligence.
Preliminary results related to the formulation of the barrier
certificates for POMDPs applied to privacy verification and
robust decision making were disseminated in [24], [25].
The rest of the paper is organized as follows. In the
next section, we briefly describe the POMDP formalism. In
Section III, we present the problems studied in this paper.
In Section IV, we formulate a hybrid system representation
for belief MDP. In Section V, we show how we can use
Lyapunov functions to approximate the reachable belief space.
In Section VI, we propose barrier certificate theorem for
safety/optimality verification of POMDPs. In Section VII, we
propose a set of sum-of-squares programs for computational
implementation of the proposed results. In Section VIII, we
elucidate the proposed methodology with examples. Finally,
in Section IX, we conclude the paper and give directions for
future research.
Notation:
The notations employed in this paper are rela-
tively straightforward.
R
0
denotes the set
[0
,
)
.
Z
denotes
the set of integers and
Z
c
for
c
Z
implies the set
{
c,c
+ 1
,c
+ 2
,...
}
.
R
[
x
]
accounts for the set of polynomial
functions with real coefficients in
x
R
n
,
p
:
R
n
R
and
Σ
⊂ R
is the subset of polynomials with a sum-of-
squares decomposition; i.e,
p
Σ[
x
]
if and only if there are
p
i
∈ R
[
x
]
, i
∈ {
1
,...,k
}
such that
p
=
p
2
i
+
···
+
p
2
k
.
For a finite set
A
,
|
A
|
denotes the number of elements in
A
and
co
{
A
}
denotes the convex hull of
A
. In particular,
for a family of indexed functions
f
a
:
X → Y
,
a
A
,
co
{
f
a
}
a
A
denotes the convex combination of
f
a
, a
A
,
i.e.,
co
{
f
a
}
a
A
=
a
A
β
a
f
a
where
0
β
a
1
and
a
A
β
a
= 1
.
II. P
ARTIALLY
O
BSERVABLE
M
ARKOV
D
ECISION
P
ROCESSES
A POMDP is a sequential decision-making modeling frame-
work, which considers not only the stochastic outcomes of
actions, but also the imperfect state observations [26].
Definition 1.
A POMDP
P
is a tuple
(
Q,p
0
,A,T,Z,O
)
,
where
Q
is a finite set of states with indices
{
1
,
2
,...,n
}
;
p
0
:
Q
[0
,
1]
defines the distribution of the initial
states, i.e.,
p
0
(
q
)
denotes the probability of starting at
q
Q
;
A
is a finite set of actions;
T
:
Q
×
A
×
Q
[0
,
1]
is the transition probability,
where
T
(
q,a,q
) :=
P
(
q
t
=
q
|
q
t
1
=
q,a
t
1
=
a
)
,
t
Z
1
,q,q
Q,a
A.
Z
is the set of all possible observations. Often,
z
Z
is
an incomplete projection of the world state
q
, contami-
nated by noise;
O
:
Q
×
A
×
Z
[0
,
1]
is the observation probability
(sensor model), where
O
(
q,a,z
) :=
P
(
z
t
=
z
|
q
t
=
q,a
t
1
=
a
)
,
t
Z
1
,q
Q,a
A,z
Z.
Since the states are not directly accessible in a POMDP, de-
cision making requires the history of observations. Therefore,
we need to define the notion of a
belief
(or the posterior) as
sufficient statistics for the history [27]. Given a POMDP, the
belief at
t
= 0
is defined as
b
0
(
q
) =
p
0
(
q
)
and
b
t
(
q
)
denotes
the probability of the system being in state
q
at time
t
. At time
t
+ 1
, when action
a
A
is taken and
z
Z
is observed, the
belief update can be obtained by a Bayesian filter as
b
t
(
q
) =
P
(
q
|
z
t
,a
t
1
,b
t
1
)
=
P
(
z
t
|
q
,a
t
1
,b
t
1
)
P
(
q
|
a
t
1
,b
t
1
)
P
(
z
t
|
a
t
1
,b
t
1
)
=
P
(
z
t
|
q
,a
t
1
,b
t
1
)
P
(
z
t
|
a
t
1
,b
t
1
)
×
q
Q
P
(
q
|
a
t
1
,b
t
1
,q
)
P
(
q
|
a
t
1
,b
t
1
)
=
O
(
q
,a
t
1
,z
t
)
q
Q
T
(
q,a
t
1
,q
)
b
t
1
(
q
)
q
Q
O
(
q
,a
t
1
,z
t
)
q
Q
T
(
q,a
t
1
,q
)
b
t
1
(
q
)
,
(1)
where the beliefs belong to the belief unit simplex
B
=
{
b
[0
,
1]
|
Q
|
|
q
b
t
(
q
) = 1
,
t
}
.
A policy in a POMDP setting is then a mapping
π
:
B →
A
, i.e., a mapping from the continuous belief space into the
discrete and finite action space. Essentially, a policy defined
over the belief space has infinite memory when mapped to the
potentially infinite executions of the POMDP.
III. P
ROBLEM
F
ORMULATION
We consider reachability, safety, and optimality properties
of finite POMDPs. In the following, we formally describe the
latter problems.
Given an initial belief
b
0
and a POMDP
P
, we are
concerned with estimating the set of all belief evolutions
emanating from
b
0
, which is known as the
reachable belief
space
.
Problem 1
(Reachable Belief Space)
.
Given a POMDP
P
=
(
Q,b
0
,A,T,Z,O
)
as in Definition 1, estimate
R
(
b
0
) =
{
b
∈B |
b
t
,
for all
t >
0
,
a
A
, and
z
Z
}
.
(2)
We are also interested in studying the reachable belief space
induced by a policy
π
.
Problem 2
(Reachable Belief Space Induced by Policy
π
)
.
Given a POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as in Definition 1
and a policy
π
:
B →
A
, estimate
R
π
(
b
0
) =
{
b
∈B |
b
t
,
for all
t >
0
and
z
Z
}
.
(3)
We define
safety
as the probability of reaching a set of
unsafe states
Q
u
Q
being less than a given constant. To
this end, we use the belief states. We are interested in solving
the following problem.
Problem 3
(Safety at
τ
)
.
Given a POMDP as in Definition 1,
a point in time
τ
Z
0
, a set of unsafe states
Q
u
Q
, and
a safety requirement constant
λ
, check whether
q
Q
u
b
τ
(
q
)
λ,
(4)
We also consider safety for all time.
Problem 4
(Safety)
.
Given a POMDP as in Definition 1, a set
of unsafe states
Q
u
Q
, and a safety requirement constant
λ
, check whether
q
Q
u
b
t
(
q
)
λ,
t
Z
0
.
(5)
In addition to safety, we are interested in checking whether
an
optimality
criterion is satisfied.
Problem 5
(Optimality)
.
Given a POMDP as in Definition 1,
the reward function
R
:
Q
×
A
R
, in which
R
(
q,a
)
denotes
the reward of taking action
a
while being at state
q
, a point in
time
τ
Z
0
, and an optimality requirement
γ
, check whether
τ
s
=0
r
(
b
s
,a
s
)
γ,
(6)
where
r
(
b
s
,a
s
) =
q
Q
b
t
R
(
q,a
t
)
.
IV. T
REATING
POMDP
S AS
H
YBRID
S
YSTEMS
Given a POMDP
P
as in Definition 1, checking whether (4)
and (6) hold by solving the POMDP directly is undecidable
problem [9]. In this section, we demonstrate that POMDPs can
be represented as a special hybrid system [28], i.e., a discrete-
time switched system [29], [30], [31]. We take advantage of
this representation in the sequel to formulate indirect methods
for analyzing a POMDP.
According to the belief update equation (1), the current
beliefs evolve according to the belief values at the previous
step, actions at the previous step, and the current observations.
Furthermore, the belief evolution
jumps
into different modes
based on the actions taken. Therefore, we can represent the
belief evolutions as a discrete-time switched system, where
the actions
a
A
define the switching modes. Formally, the
belief
dynamics
(1) can be described as
b
t
=
f
a
(
b
t
1
,z
t
)
,
(7)
where
b
t
denote the belief vector at time
t
belonging to the
belief unit simplex
B
and
b
0
∈B
0
⊂B
(in particular,
B
0
can
be a singleton) representing the set of initial beliefs (prior).
In (7),
a
A
denote the actions that can be interpreted as
the switching modes,
z
Z
are the observations representing
inputs, and
t
Z
1
denote the discrete time instances. The
(rational) vector fields
{
f
a
}
a
A
with
f
a
: [0
,
1]
|
Q
|
[0
,
1]
|
Q
|
are described as the vectors with rows
f
q
a
(
b,
·
,z
) =
O
(
q
,a,z
)
q
Q
T
(
q,a,q
)
b
t
1
(
q
)
q
Q
O
(
q
,a,z
)
q
Q
,T
(
q,a,q
)
b
t
1
(
q
)
,
where
f
q
a
denotes the
q
th row of
f
a
.
A policy then induces regions (partitions) in the belief space
where an action is applied. We denote the index of these
partitions of the belief space as
α
Γ
, where
Γ
is a finite
set. Then, a policy for a POMDP can be characterized as
π
(
b
) =
a
1
, b
∈B
α
1
.
.
.
.
.
.
a
n
, b
∈B
α
m
,
(8)
inducing the dynamics
b
t
=
f
a
1
(
b,z
)
, b
∈B
α
1
.
.
.
.
.
.
f
a
n
(
b,z
)
, b
∈B
α
m
,
(9)
where
n
and
m
are integers. Note that the number of partitions
|
Γ
|
=
m
is not necessarily less than or greater than
|
A
|
. Indeed,
the number of
|
Γ
|
=
m
depends on the method used for
synthesizing the policy. For example, in the value iteration
method for POMDPs,
|
Γ
|
=
m
is a function of the so called
α
-vectors [32].
Given the above switched system representation, we can
consider two classes of problems in POMDPs:
1.
Arbitrary-Policy Verification
: This case corresponds to
analyzing (7) under
arbitrary switching
with switching
modes given by
a
A
.
2.
Fixed-Policy Verification
: This corresponds to analyz-
ing (7) under
state-dependent switching
. Indeed, the pol-
icy
π
:
B →
A
determines regions in the belief space
where each mode (action) is active.
Both cases of switched systems with
arbitrary switching
and
state-dependent switching
are well-known in the sys-
tems and controls literature (see [33], [34] and references
therein).
Note that the second problem is well-known in
the probabilistic planning [35], [36] and model checking [37]
communities. For a fixed policy and a finite state space,
the problem reduces to solving a linear equation system to
compute the set of reachable states in the underlying Markov
chain. The next example illustrates the proposed switched
system representation for POMDPs with a fixed policy.
q
2
q
1
0
1
ε
b
t
(
q
1
) +
b
t
(
q
2
) = 1
B
1
B
2
Fig. 1: An example of a POMDP with two states and the
state-dependent switching modes induced by the policy (10).
Example 1.
Consider a POMDP with two states
{
q
1
,q
2
}
, two
actions
{
a
1
,a
2
}
, and
z
=
{
z
1
,z
2
}∈
Z
. Action
a
1
correspond
to remaining in the same state, while action
a
2
to making a
transition to the other state. Observation
z
i
correspond to the
current state being
q
i
. The policy
π
=
{
a
1
, b
∈B
1
,
a
2
, b
∈B
2
(10)
leads to different switching modes based on whether the states
belong to the regions
B
1
=
{
b
|
0
b
(
q
1
)
ε
}
or
B
2
=
{
b
|
ε < b
(
q
1
)
1
}
(see Figure 1). That is, the belief update
equation
(7)
is given by
b
t
=
{
f
a
1
(
b
t
1
,z
t
)
, b
∈B
1
,
f
a
2
(
b
t
1
,z
t
)
, b
∈B
2
.
(11)
Note that the belief space is given by
B
=
B
1
∪B
2
=
{
b
|
b
(
q
1
) +
b
(
q
2
) = 1
}
.
V. POMDP R
EACHABILITY
A
NALYSIS
U
SING
L
YAPUNOV
F
UNCTIONS
Based on the switched system representation discussed in
the previous section, we can use tools from control theory
to analyze the evolution of the solutions of the belief update
equation including the reachable belief space without the need
to solve the POMDP explicitly.
From a control theory standpoint, estimating reachable
sets, invariant sets, or regions-of-attractions (ROAs) are well-
studied problems in the literature [38]. Recent results in
this area include learning ROAs using Gaussian processes
and active learning [39], computation of controlled invariant
sets for switching affine systems [40], ROA estimation using
formal language [41], and finding invariant sets based on
barrier functions [42]. In the following, we propose a basic
technique for over-approximating the reachable belief spaces
of POMDPs using Lyapunov functions.
The next theorem uses sub-level sets of Lyapunov functions
to over-approximate the reachable belief space as illustrated in
Figure 2.
Fig. 2: Over-approximation of the reachable belief space using
sub-level sets of a Lyapunov function.
Theorem 1.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as in
Definition 1, consider the belief update dynamics
(7)
. If there
exists a function
V
:
B →
R
such that
V
(
f
a
(
b
t
1
,z
))
V
(
b
t
1
)
<
0
,
for all
a
A, z
Z, b
t
1
∈V
,
(12)
and
b
0
∈V
,
(13)
where
V
=
{
b
∈B |
V
(
b
)
1
}
, then
b
t
∈V
for all
t
Z
0
,
and, in particular,
R
(
b
0
)
⊆V
.
Proof.
The proof relies on the fact that the belief trajectories
point inward at the boundary of
V
; hence, they remain inside
V
. We prove by contradiction. Assume, at time
T
6
= 0
, we
have
b
T
/
∈ V
for some
a
A
or
z
Z
. Thus,
V
(
b
T
)
>
1
.
Inequality (12) implies
V
(
b
t
)
V
(
b
t
1
)
<
0
,
for all
a
A, z
Z, b
∈V
.
Summing up the above expression from time
1
to
T
yields
T
t
=1
(
V
(
b
t
)
V
(
b
t
1
))
<
0
,
for all
a
A, z
Z,
which simplifies to
V
(
b
T
)
V
(
b
0
)
<
0
,
for all
a
A, z
Z, b
∈V
.
That is,
V
(
b
T
)
< V
(
b
0
)
. In addition, from (13), we have
V
(
b
0
)
1
. Then, it follows that
V
(
b
T
)
1
, which contradict
the hypothesis
b
T
/
∈V
.
At this point, we prove by induction that
R
(
b
0
)
⊆ V
. Let
R
t
(
b
0
)
denote the reachable belief space at time
t
. At
t
=
0
, from (13), we have
b
0
∈ V
. Since
R
0
(
b
0
) =
{
b
0
}
, then
R
0
(
b
0
)
⊆V
. At time
t
= 1
, from (12), we have
V
(
b
1
)
V
(
b
0
)
<
0
,
b
0
∈V
,
which implies that
V
(
b
1
)
< V
(
b
0
)
for all
b
0
∈ V
=
{
b
B |
V
(
b
0
)
1
}
. Hence,
V
(
b
1
)
< V
(
b
0
)
1
and
b
1
∈ V
.
In addition,
R
1
(
b
0
) =
{
b
0
,b
1
}
and since
b
0
,b
1
∈ V
, then
R
1
(
b
0
)
⊆ V
. Then, by induction we have for any
t
=
τ
,
τ >
0
,
b
τ
∈V
and
R
τ
(
b
0
)
⊆V
.
Fig. 3: Over-approximation of the reachable belief space using
the sub-level sets of a set of local Lyapunov functions.
In practice, we may have a large number of actions for a
POMDP. Therefore, it may be difficult to find a single Lya-
punov function
V
that satisfies the conditions of Theorem 1.
Next, we show that the computation of the Lyapunov function
can be decomposed into finding a set of local Lyapunov func-
tions for each action. This method is illustrated in Figure 3.
Theorem 2.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as in
Definition 1, consider the belief update dynamics
(7)
. If there
exist a set of functions
V
a
:
B →
R
,
a
A
such that
V
a
(
f
a
(
b
t
1
,z
))
V
a
(
b
t
1
)
<
0
,
z
Z, b
t
1
∈V
a
,
(14)
for all
(
a,a
)
A
2
and
b
0
∈V
a
,
(15)
for all
a
A
, where
V
a
=
{
b
∈ B |
V
a
(
b
)
1
}
, then
b
t
∈ V
max
=
{
b
∈ B |
max
a
A
V
a
(
b
)
1
}
for all
t
Z
0
,
and, in particular,
R
(
b
0
)
⊆V
max
.
Proof.
We show that if (14)-(15) are satisfied then the Lya-
punov function
V
= max
a
A
V
a
satisfies the conditions
of Theorem 1. If (15) holds for all
a
A
, we have
V
a
(
b
0
)
1
, which in turn implies that
max
a
A
V
a
(
b
0
)
1
,
as well. Thus,
V
= max
a
A
V
a
satisfies (13). Further-
more, if (14) is satisfied for all
(
a,a
)
A
2
, we have
V
a
(
f
a
(
b
t
1
,z
))
< V
a
(
b
t
1
)
for
b
t
1
∈ V
a
, which in turn
implies that
max
a
V
a
(
f
a
(
b
t
1
,z
))
<
max
a
V
a
(
b
t
1
)
for
b
t
1
belonging to
max
a
V
a
(
b
t
1
)
1
and all
a
A
. That is, (12)
is satisfied with
V
= max
a
A
V
a
. Then, from Theorem 1, we
have
b
t
∈V
max
for all
t
Z
0
and
R
(
b
0
)
⊆V
max
.
If a policy
π
is given, then we can compute a family of
local Lyapunov functions. The union of the 1-sublevel sets of
these functions over-approximates the reachable belief space.
Theorem 3.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
and
the policy
π
:
B →
A
and an induced partitioning
{B
α
}
such
that
B
=
α
B
α
as given in
(8)
, consider the belief update
dynamics
(7)
. If there exists a family of functions
V
α
:
B
α
R
,
α
Γ
, such that
V
α
(
f
a
(
b
t
1
,z
))
V
α
(
b
t
1
)
<
0
,
for all
z
Z, b
t
1
∈V
α
,
(16)
and
b
0
∈V
α
,
(17)
where
V
α
=
{
b
∈ B
α
|
V
α
(
b
)
1
}
, then
b
t
∈ ∪
α
V
α
for all
t
Z
0
, and, in particular,
R
(
b
0
)
⊆∪
α
Γ
V
α
.
Proof.
From Theorem 1, we can infer that, for each fixed
a
A
, we have
R
a
(
b
0
)
⊆ V
α
, where
R
a
(
b
0
)
is the
reachable belief space under action
a
starting at
b
0
. Since the
support of each
V
α
i
is
B
α
i
,
V
α
i
⊆ B
α
i
,
α
Σ
B
α
=
B
, then
α
Σ
V
α
=
V ⊆ B
. Furthermore, since
R
a
(
b
0
)
⊆ V
α
, then
a
A
R
a
(
b
0
) =
R
(
b
0
)
⊆∪
α
Γ
V
α
.
We remark at the end of this section that the existence of
a policy
π
for a POMDP removes all non-determinism and
partial observability, resulting in an induced Markov chain.
In this case, in [43], a method based on Foster-Lyapunov
functions was used for analyzing Markov chains, which may
be also adapted to verify POMDPs with a fixed policy.
VI. POMDP S
AFETY
V
ERIFICATION
U
SING
B
ARRIER
C
ERTIFICATES
In the following, we show how we can use barrier certifi-
cates to verify properties of the switched systems induced by
a POMDP. We begin by verifying safety for a given POMDP.
Let us define the following unsafe set
B
s
u
=
b
∈B |
q
Q
u
b
τ
(
q
)
> λ
,
(18)
which is the complement of (4).
Theorem 4.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as in
Definition 1, consider the belief update dynamics
(7)
. Given a
set of initial beliefs
B
0
[0
,
1]
|
Q
|
, an unsafe set
B
s
u
as given
in
(18)
(
B
0
∩B
s
u
=
), and a constant
τ
Z
0
, if there exists
a function
B
:
Z
×B →
R
called the barrier certificate such
that
B
(
τ,b
τ
)
>
0
,
b
τ
∈B
s
u
,
(19)
B
(0
,b
0
)
<
0
,
b
0
∈B
0
,
(20)
and
B
(
t,f
a
(
b
t
1
,z
))
B
(
t
1
,b
t
1
)
0
,
t
∈{
1
,
2
,...,τ
}
,
a
A,
z
Z,
b
∈B
,
(21)
then there exist no solution of the belief update equation
(7)
such that
b
0
∈B
0
, and
b
τ
∈B
s
u
for all
a
A
.
Proof.
The proof is carried out by contradiction. Assume at
time instance
τ
there exist a solution to (7) such that
b
0
∈B
0
and
b
τ
∈B
s
u
. From inequality (21), we have
B
(
t,b
t
)
B
(
t
1
,b
t
1
)
for all
t
∈ {
1
,
2
,...,τ
}
and all actions
a
A
. Hence,
B
(
t,b
t
)
B
(0
,b
0
)
for all
t
∈ {
1
,
2
,...,τ
}
. Furthermore,
inequality (20) implies that
B
(0
,b
0
)
<
0
,
for all
b
0
∈B
0
. Since the choice of
τ
can be arbitrary, this is a
contradiction because it implies that
B
(
τ,b
τ
)
B
(0
,b
0
)
<
0
.
Therefore, there exist no solution of (7) such that
b
0
∈B
0
and
b
τ
∈ B
s
u
for any sequence of actions
a
A
. Therefore, the
safety requirement is satisfied.
The above theorem provides conditions under which the
POMDP is
guaranteed
to be safe. The next result brings for-
ward a set of conditions, which verifies whether the optimality
criterion (6) is satisfied. This is simply carried out by making
the unsafe safe time-dependent (a tube).
Corollary 1.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as
in Definition 1, consider the belief update dynamics
(7)
and
the optimality criterion
γ
as given by
(6)
. Let
̃
γ
:
Z
0
R
satisfying
τ
s
=0
̃
γ
(
s
)
γ.
(22)
Given a set of initial beliefs
B
0
⊂B
, an unsafe set
B
o
u
=
{
(
t,b
)
|
r
(
b
t
,a
t
)
> γ
(
t
)
}
,
(23)
and a constant
τ
Z
0
, if there exists a function
B
:
Z
×B →
R
such that
(19)
-
(21)
are satisfied with
B
o
u
instead of
B
s
u
, then
for all
b
0
∈B
0
the optimality criterion
(6)
holds.
Proof.
The proof is straightforward and an application of
Theorem 4. If conditions (19)-(21) are satisfied with
B
o
u
instead of
B
s
u
, based on Theorem 4, we conclude that there
exist no solution of the belief update equation (7) such that
b
0
∈B
0
, and
b
τ
∈B
o
u
for all
a
A
. Therefore, we have
r
(
b
t
,a
t
)
̃
γ
(
t
)
,
t
∈{
0
,
1
,...,τ
}
.
Summing up both sides of the above equation from
t
= 0
to
t
=
τ
yields
τ
s
=0
r
(
b
s
,a
s
)
τ
s
=0
̃
γ
(
s
)
.
Then, from (22), we conclude that
τ
s
=0
r
(
b
s
,a
s
)
γ
.
The technique used in Corollary 1 is analogous to the one
used in [44], [45] for bounding (time-averaged) functional
outputs of systems described by partial differential equations.
The method proposed here, however, can be used for a large
class of discrete time systems and the belief update equation
is a special case that is of our interest.
In practice, we may have a large number of actions. Then,
finding a barrier certificate that satisfies the conditions of
Theorem 4 becomes computationally prohibitive. In the next
result, we show how the calculation of the barrier certificate
can be decomposed into finding a set of barrier certificates for
each action and then taking the convex hull of them.
Theorem 5.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as in
Definition 1, consider the belief update dynamics
(7)
. Given a
safety constraint as described in
(4)
with a safety requirement
λ
, and a point in time
τ
Z
0
, if there exists a set of functions
B
a
:
Z
×B →
R
,
a
A
, such that
B
a
(
τ,b
τ
)
>
0
,
b
τ
∈B
s
u
,
a
A,
(24)
with
B
s
u
as described in
(18)
,
B
a
(0
,b
0
)
<
0
,
for
b
0
=
p
0
,
a
A,
(25)
and
B
a
(
t,f
a
(
b
t
1
,z
))
B
a
(
t
1
,b
t
1
)
0
,
t
∈{
1
,
2
,...,τ
}
,
(
a,a
)
A
2
,
z
Z,
b
∈B
,
(26)
then the safety requirement
λ
is satisfied, i.e., inequality
(4)
holds. Furthermore, the overall barrier certificate is given by
B
=
co
{
B
a
}
a
A
.
Proof.
We show that if (24)-(26) are satisfied then the barrier
certificate
B
=
co
{
B
a
}
a
A
satisfies the conditions of Theo-
rem 4. For each
a
A
, we multiply both sides of (24) with a
constant
α
a
such that
a
A
α
a
= 1
. We obtain
a
A
α
a
B
a
(
τ,b
τ
)
>
0
,
b
τ
∈B
s
u
,
which implies that
B
(
τ,b
τ
) =
co
{
B
a
(
τ,b
τ
)
}
a
A
>
0
,
b
τ
∈B
s
u
.
Therefore, (24) is satisfied with
B
=
co
{
B
a
}
a
A
. Similarly,
we can show that if (25) is satisfied,
B
=
co
{
B
a
}
a
A
satisfies (20). Multiplying both sides of (26) with a constant
α
a
such that
a
A
α
a
= 1
and summing over them yields
a
A
α
a
(
B
a
(
t,f
a
(
b
t
1
,z
))
B
a
(
t
1
,b
t
1
))
=
a
A
α
a
B
a
(
t,f
a
(
b
t
1
,z
))
a
A
α
a
B
a
(
t
1
,b
t
1
)
0
,
t
∈{
1
,
2
,...,τ
}
,
z
Z,
a
A,
b
∈B
.
which implies that (21) is satisfied for
B
=
co
{
B
a
}
a
A
.
Therefore, from Theorem 4, we conclude that the safety
requirement
λ
is satisfied.
The efficacy of the above result is that we can search for
each action-based barrier certificate
B
a
,
a
A
, independently
or in parallel and then verify whether the overall POMDP
satisfies a pre-specified safety requirement (see Fig. 4 for an
illustration).
Next, we demonstrate that, if a policy is given, the search
for the barrier certificate can be decomposed into the search
for a set of local barrier certificates. We denote by
a
i
the action
active in the partition
B
i
.
Theorem 6.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as in
Definition 1, consider the belief update dynamics
(7)
. Given a
safety constraint as described in
(4)
with a safety requirement
λ
, a point in time
τ
Z
0
, and a teaching policy
π
:
B →
A
as described in
(8)
, if there exists a set of function
B
i
:
Z
×
B
i
R
,
i
∈{
1
,
2
,...,N
}
, such that
B
i
(
τ,b
τ
)
>
0
,
b
τ
∈B
o
u
∩B
i
, i
∈{
1
,
2
,...,m
}
,
(27)
with
B
o
u
as described in
(18)
,
B
i
(0
,b
0
)
<
0
,
for
b
0
=
p
0
, i
∈{
1
,
2
,...,m
}
,
(28)
and
B
i
(
t,f
a
i
(
b
t
1
,z
))
B
i
(
t
1
,b
t
1
)
0
,
t
∈{
1
,
2
,...,τ
}
,
z
Z,
b
∈B
i
,
i
∈{
1
,
2
,...,m
}
,
(29)
then the safety requirement
λ
is satisfied, i.e., inequality
(4)
holds. Furthermore, the overall barrier certificate is given by
B
=
co
{
B
i
}
m
i
=1
.
Proof.
We demonstrate that if (27)-(29) are satisfied then the
barrier certificate
B
=
co
{
B
i
}
m
i
=1
satisfies the conditions of
Theorem 4. For each
i
∈{
1
,
2
,...,m
}
, we multiply both sides
of (27) with a constant
α
i
such that
m
i
=1
α
i
= 1
. We obtain
m
i
=1
α
i
B
i
(
τ,b
τ
)
>
0
,
b
τ
∈∪
m
i
=1
(
B
s
u
∩B
i
)
.
Since the support of each
B
i
is
B
i
,
m
i
=1
B
i
=
B
, and
B
s
u
⊂B
,
we have
m
i
=1
(
B
s
u
∩B
i
) =
B
s
u
∩B
=
B
s
u
. Hence,
B
(
τ,b
τ
) =
co
{
B
i
(
τ,b
τ
)
}
N
i
=1
>
0
,
b
τ
∈B
s
u
.
Therefore, (19) is satisfied with
B
=
co
{
B
i
}
m
i
=1
. Similarly, we
can show that if (28) is satisfied,
B
=
co
{
B
i
}
m
i
=1
satisfies (20).
Multiplying both sides of (29) with constants
α
i
such that
m
i
=1
α
i
= 1
and summing over them gives
m
i
=1
α
i
(
B
i
(
t,f
a
i
(
b
t
1
,z
))
B
i
(
t
1
,b
t
1
)
)
=
m
i
=1
α
i
B
i
(
t,f
a
i
(
b
t
1
,z
))
m
i
=1
α
i
B
i
(
t
1
,b
t
1
)
0
,
t
∈{
1
,
2
,...,τ
}
,
z
Z,
b
∈B
.
which implies that (21) is satisfied for
B
=
co
{
B
i
}
m
i
=1
.
Therefore, from Theorem 4, we conclude that the safety
requirement
λ
is satisfied.
We proposed two techniques for decomposing the con-
struction of the barrier certificates and checking given safety
requirements. Our method relied on barrier certificates that
take the form of the convex hull of a set of local barrier cer-
tificates (see similar results in [46], [47]). Though the convex
hull barrier certificate may introduce a level of conservatism,
it is computationally easier to find (as will be discussed in
more detail in the next section).
VII. C
OMPUTATIONAL
M
ETHOD BASED ON
SOS P
ROGRAMMING
In this section, we present a set of SOS programs that can be
used to find the Lyapunov functions used for reachability anal-
ysis and the barrier certificates used for safety and optimality
verification. For a brief introduction to SOS programming and
some related results, see Appendix A.
In order to formulate SOS programs, we need the problem
data to be polynomial or rational. Fortunately, the belief update
Fig. 4: Decomposing the barrier certificate computation for a
POMDP with two actions
a
1
and
a
2
: the zero-level sets of
B
a
1
and
B
a
2
at trial
τ
separate the evolutions of the hypothesis
beliefs starting at
b
0
from
B
o
u
. The green line illustrate the
zero-level set of the barrier certificate formed by taking the
convex hull of
B
a
1
and
B
a
2
.
equation (1) is a rational function in the belief states
b
t
(
q
)
,
q
Q
b
t
(
q
) =
M
a
(
b
t
1
(
q
)
,z
)
N
a
(
b
t
1
(
q
)
,z
)
=
O
(
q
,a
t
1
,z
t
)
q
Q
T
(
q,a
t
1
,q
)
b
t
1
(
q
)
q
Q
O
(
q
,a
t
1
,z
t
)
q
Q
T
(
q,a
t
1
,q
)
b
t
1
(
q
)
,
(30)
where
M
a
and
N
a
are linear and positive functions of
b
.
Moreover, the safe, and the optimality objectives described
by (4) and (6), respectively, and the belief simplex are all
semi-algebraic sets, since they can be described by polynomial
inequalities/equalities.
Throughout this section, we also assume that the Lyapunov
functions and the barrier certificates are parametrized as poly-
nomial functions of
b
. Hence, the level sets of those functions
are semi-algebraic sets, as well.
At this stage, we are ready to present conditions based on
SOS programs to find a Lyapunov function for POMDPs and
thus over-approximate the reachable belief space.
Corollary 2.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as in
Definition 1, consider the belief update dynamics
(30)
. If there
exist a set of functions
V
P
[
b
]
of degree
d
and
p
Σ[
b
]
,
and a constant
c >
0
such that
N
d
a
(
b
t
1
,z
)
(
V
(
M
a
(
b
t
1
,z
)
N
a
(
b
t
1
,z
)
)
V
(
b
t
1
)
)
p
(
b
t
1
) (1
V
(
b
t
1
))
c
Σ[
b
t
1
]
,
(31)
for all
a
A
and
z
Z
, and
1
V
(
b
0
)
0
,
(32)
then
b
t
∈ V
=
{
b
∈ B |
V
(
b
)
1
}
for all
t
Z
0
, and, in
particular,
R
(
b
0
)
⊆V
.
Proof.
We show that if (31) and (32) hold, then (12) and (13)
are, respectively, satisfied. Condition (32) implies that (13)
holds. Furthermore, condition (12) for system (30) can be re-
written as
(
V
(
M
a
(
b
t
1
,z
)
N
a
(
b
t
1
,z
)
)
V
(
b
t
1
)
)
>
0
,
a
A,
z
Z.
Since
V
=
{
b
|
1
V
(
b
)
0
}
is a semi-algebraic set, we use
Propositions 1 and 2 in Appendix A to obtain
(
V
(
M
a
(
b
t
1
,z
)
N
a
(
b
t
1
,z
)
)
V
(
b
t
1
)
)
p
(
b
t
1
) (1
V
(
b
t
1
))
c
Σ[
b
t
1
]
,
a
A,
z
Z.
for
p
Σ[
b
]
and
c >
0
. Given that
N
a
(
b
t
1
,z
)
is a positive
polynomial of degree one, we can relax the above inequality
into a sum-of-squares condition given by
N
d
a
(
b
t
1
,z
)
(
V
(
M
a
(
b
t
1
,z
)
N
a
(
b
t
1
,z
)
)
V
(
b
t
1
)
)
p
(
b
t
1
) (1
V
(
b
t
1
))
c
Σ[
b
t
1
]
,
a
A,
z
Z.
Hence, if (31) holds, then (12) is satisfied, as well. From
Theorem 1, we infer that
b
t
∈ V
=
{
b
∈ B |
V
(
b
)
1
}
for
all
t
Z
0
, and
R
(
b
0
)
⊆V
.
The set
V
provides an over-approximation of the reach-
able belief space
R
(
b
0
)
. Indeed, we can tighten the over-
approximation by solving the following optimization problem:
min
V
P
[
b
]
,γ>
0
,p
Σ[
b
]
,c>
0
γ
V
(
b
0
)
γ
0
,
V
(
f
a
(
b,z
)) +
V
(
b
)
p
(
b
) (
γ
V
(
b
))
c
Σ[
b
]
,
(33)
for all
a
A
and
z
Z
. The above optimization is bilinear
in the variables
V
,
γ
, and
p
. However, If
p
is fixed, the
problem becomes convex. Similarly, when
V
and
γ
are fixed,
the optimization problem is convex in
p
. Optimization prob-
lems similar to (33) have been proposed in the literature for
estimating the region-of-attraction of polynomial (continuous)
systems [48].
We can computationally implement Theorem 2 by solving
|
A
|
optimization problems
min
V
a
P
[
b
]
,γ>
0
,p
a
Σ[
b
]
,c
a
>
0
γ
a
V
a
(
b
0
)
γ
0
,
V
a
(
f
a
(
b,z
)) +
V
a
(
b
)
p
a
(
b
) (
γ
a
V
a
(
b
))
Σ[
b
]
,
(34)
for all
z
Z
and
a
A
and then computing the level set via
max
a
A
V
a
(
b
)
γ
.
Furthermore, we can implement Theorem 3 by solving
|
Γ
|
optimization problems
min
V
α
P
[
b
]
α
>
0
,R
α
Σ[
b
]
γ
α
V
α
(
b
0
)
γ
α
0
,
V
α
(
f
a
(
b,z
)) +
V
α
(
b
)
R
α
(
b
) (
γ
α
V
α
(
b
))
Σ[
b
]
,
(35)
for all
z
Z
in parallel and then computing
α
Γ
{
b
|
V
α
(
b
)
γ
α
}
.
Next, we present SOS programs for finding the barrier
certificates and check safety and optimality. We begin by an
SOS formulation for Theorem 4.
Corollary 3.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as in
Definition 1, consider the belief update dynamics
(30)
. Given a
safety constraint as described in
(4)
with a safety requirement
λ
, and a point in time
τ
Z
0
, if there exist polynomial
functions
B
P
[
t,b
]
of degree
d
and
p
f
Σ[
b
]
, and constants
s
1
,s
2
>
0
such that
B
(
τ,b
τ
) +
p
f
(
b
τ
)
q
Q
u
b
τ
(
q
)
λ
s
1
Σ [
b
τ
]
,
(36)
B
(0
,b
0
)
s
2
>
0
,
(37)
and
N
a
(
b
t
1
)
d
(
B
(
t,
M
a
(
b
t
1
,z
)
N
a
(
b
t
1
,z
)
)
B
(
t
1
,b
t
1
)
)
Σ[
t,b
t
1
]
,
t
∈{
1
,
2
,...,τ
}
, z
Z, a
A,
(38)
then there exists no solution of
(7)
such that
b
τ
∈ B
s
u
and,
hence, the safety requirement is satisfied.
Proof.
SOS conditions (36) and (37) are a direct consequence
of applying Propositions 1 and 2 in Appendix A to verify con-
ditions (19) and (20), respectively. Furthermore, condition (21)
for system (30) can be re-written as
B
(
t,
M
a
(
b
t
1
,z
)
N
a
(
b
t
1
,z
)
)
B
(
t
1
,b
t
1
)
>
0
,
a
A,
z
Z.
Given that
N
a
(
b
t
1
(
q
)
,z
)
is a positive polynomial of degree
one, we can relax the above inequality into a SOS condition
given by
N
a
(
b
t
1
,z
)
d
(
B
(
t,
M
a
(
b
t
1
,z
)
N
a
(
b
t
1
,z
)
)
B
(
t
1
,b
t
1
)
)
Σ[
t,b
t
1
]
.
Hence, if (38) holds, then (21) is satisfied as well. From
Theorem 4, we infer that there is no
b
t
(
q
)
at time
τ
such
that
q
Q
u
b
τ
(
q
)
> λ
. Equivalently, the safety requirement
is satisfied at time
τ
. That is,
q
Q
u
b
τ
(
q
)
λ
.
Similarly, we can formulate SOS feasibility conditions for
checking the inequalities in Theorem 5.
Corollary 4.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as in
Definition 1, consider the belief update dynamics
(30)
. Given a
safety constraint as described in
(4)
with a safety requirement
λ
, and a point in time
τ
Z
0
, if there exist polynomial
functions
B
a
P
[
t,b
]
,
a
A
, of degree
d
and
p
f
a
Σ[
b
]
,
a
A
, and constants
s
1
a
,s
2
a
>
0
,
a
A
, such that
B
a
(
τ,b
τ
) +
p
f
a
(
b
τ
)
q
Q
u
b
τ
(
q
)
λ
s
1
a
Σ [
b
τ
]
, a
A,
(39)
B
a
(0
,b
0
)
s
2
a
>
0
, a
A,
(40)
and
N
a
(
b
t
1
)
d
(
B
a
(
t,
M
a
(
b
t
1
,z
)
N
a
(
b
t
1
,z
)
)
B
a
(
t
1
,b
t
1
)
)
Σ[
t,b
t
1
]
,
t
∈{
1
,
2
,...,τ
}
, z
Z,
(
a,a
)
A
2
,
(41)
then there exists no solution of
(30)
such that
b
τ
∈ B
s
u
and,
hence, the safety requirement is satisfied.
We assume that a policy in the form of (8) assigns actions
to semi-algebraic partitions of the hypothesis belief space
B
described as
B
i
=
{
b
∈B |
g
i
(
b
)
0
}
, i
∈{
1
,
2
,...,m
}
.
(42)
We then have the following SOS formulation for Theorem 6
using Positivstellensatz (Proposition 1 in Appendix A).
Corollary 5.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as in
Definition 1, consider the belief update dynamics
(30)
. Given a
safety constraint as described in
(4)
with a safety requirement
λ
, and a point in time
τ
Z
0
, a policy
π
:
B →
A
as de-
scribed in
(8)
, if there exist polynomial functions
B
i
P
[
t,b
]
,
i
∈{
1
,
2
,...,m
}
, of degree
d
,
p
l
1
i
Σ[
b
]
,
i
∈{
1
,
2
,...,m
}
,
p
l
2
i
Σ[
b
]
,
i
∈ {
1
,
2
,...,m
}
,
p
l
3
i
Σ[
b
]
,
i
∈ {
1
,
2
,...,m
}
,
and
p
f
i
Σ[
b
]
,
i
∈ {
1
,
2
,...,m
}
, and constants
s
1
i
,s
2
i
>
0
,
i
∈{
1
,
2
,...,m
}
, such that
B
i
(
τ,b
τ
) +
p
f
i
(
b
τ
)
q
Q
u
b
τ
(
q
)
λ
+
p
l
1
i
(
b
τ
)
g
i
(
b
τ
)
s
1
i
Σ [
b
τ
]
, i
∈{
1
,
2
,...,m
}
,
(43)
B
i
(0
,b
0
) +
p
l
2
i
(
p
0
)
g
i
(
p
0
)
s
2
i
>
0
, i
∈{
1
,
2
,...,m
}
,
(44)
and
N
a
(
b
t
1
)
d
(
B
i
(
t,
M
a
(
b
t
1
,z
)
N
a
(
b
t
1
,z
)
)
B
i
(
t
1
,b
t
1
)
)
+
p
l
3
i
(
b
t
1
)
g
i
(
b
t
1
)
Σ[
t,b
t
1
]
,
t
∈{
1
,
2
,...,τ
}
,
z
Z, a
A, i
∈{
1
,
2
,...,m
}
,
(45)
then there exists no solution of
(30)
such that
b
τ
∈ B
s
u
and,
hence, the safety requirement is satisfied.
Checking whether optimality holds can also be cast into
sum-of-squares programs. To this end, we assume the reward
function is a polynomial (or can be approximated by a
polynomial
1
) in beliefs , i.e.,
R
P
[
b
]
.
The following Corollary can be derived using similar argu-
ments as the proof of Corollary 3.
Corollary 6.
Given the POMDP
P
= (
Q,b
0
,A,T,Z,O
)
as in
Definition 1, consider the belief update dynamics
(30)
. Given
a constant
τ
Z
0
and the optimality requirement
(6)
, if
there exist polynomial functions
̃
γ
P
[
t
]
characterizing the
unsafe set
(23)
,
B
P
[
t,b
]
with degree
d
,
p
u
q
Σ[
b
]
,
q
Q
u
,
p
0
i
Σ[
b
]
,
i
= 1
,
2
,...,n
0
, and constants
s
1
,s
2
>
0
1
This assumption is realistic, since the beliefs belong to a bounded set
(a unit simplex) and by Stone-Weierstrass theorem any continuous function
defined on a bounded domain can be uniformly approximated arbitrary close
by a polynomial [49].
such that
(22)
, and
(36)
-
(38)
are satisfied, then the optimality
criterion
(6)
holds.
VIII. C
ASE
S
TUDIES
In this section, we illustrate the proposed method using two
examples. In the first example, we study a toy example of ad
scheduling scenario in social media using a POMDP model.
We illustrate how the over-approximation method based on
local Lyapunov functions introduced in Section V can be used
to predict the outcome of an ad scheduling policy. In the
second example, we compare the teaching performance of two
recently proposed machine teaching algorithms (myopic and
Ada-L) [50] and show the safety verification method presented
in Section VI can be used to show that myopic teaching has
poor performance while Ada-L has guaranteed convergence.
A. Interactive Advertisement Scheduling
In personalized live social media, a website’s revenue de-
pends on user engagement, which is closely related to the
user’s interest to the streamed online contents [51]. User
interest evolves probabilistically with the online contents and
the ads inserted [52]. Furthermore, user interest to online
content is not directly observable but can only be inferred from
the number of views or the number of likes during a given
time interval [53]. The broadcaster’s objective is to estimate
user interest and schedule relevant ads from time to time to
maintain high user interest and engagement.
Such interactive ads scheduling can be represented as a se-
quential decision-making problem under partial observability
[51], where a POMDP
P
= (
Q,p
0
,A,T,Z,O
)
can describe
how the user’s interest evolves. Each state
q
Q
represents
different levels of interest where
p
0
denotes the initial interest
distribution. The actions available to the broadcaster are to
insert different ads or continue with the live stream. The
transition
T
(
q,a,q
)
describes how a user’s interest may
evolve depending on the ads inserted or the content of live
stream when no ads is scheduled. The observation
Z
denotes
the number of likes in a given time interval, which can be
abstracted into a finite number of categories. The observation
function
O
(
q,z
)
denotes the probability of observing
z
likes
when the user’s interest level is at
q
. The transition and
observation functions can be obtained from data [51].
We consider a concrete POMDP model
P
where the topol-
ogy of the underlying MDP is as shown in Figure 5. The
user interest has three levels — low, medium, and high. The
initial probability distribution of the user interest is
p
0
(
q
1
) = 1
,
i.e., the user initially has low interest. The actions available
are
A
=
{
a
0
,a
1
}
, where
a
0
denotes no ads and
a
1
denotes
scheduling ads. The transition probabilities
T
with respect to
each action are as shown in (46).
T
a
0
=
0
.
8
,
0
.
2
,
0
.
1
0
.
1
,
0
.
7
,
0
.
2
0
.
1
,
0
.
1
,
0
.
7
,T
a
1
=
0
.
5
,
0
.
3
,
0
.
2
0
.
3
,
0
.
6
,
0
.
2
0
.
2
,
0
.
1
,
0
.
6
.
(46)
The observations are number of likes in a give time interval
and can be represented by a Poisson process [51]. To have