of 8
Entropy-Regularized Stochastic Games
Yagiz Savas, Mohamadreza Ahmadi, Takashi Tanaka, Ufuk Topcu
Abstract
— In two-player zero-sum stochastic games, where
two competing players make decisions under uncertainty, a pair
of optimal strategies is traditionally described by Nash equi-
librium and computed under the assumption that the players
have perfect information about the stochastic transition model
of the environment. However, implementing such strategies
may make the players vulnerable to unforeseen changes in the
environment. In this paper, we introduce entropy-regularized
stochastic games where each player aims to maximize the causal
entropy of its strategy in addition to its expected payoff. The
regularization term balances each player’s rationality with its
belief about the level of misinformation about the transition
model. We consider both entropy-regularized
N
-stage and
entropy-regularized discounted stochastic games, and establish
the existence of a value in both games. Moreover, we prove
the sufficiency of Markovian and stationary mixed strategies to
attain the value, respectively, in
N
-stage and discounted games.
Finally, we present algorithms, which are based on convex
optimization problems, to compute the optimal strategies. In
a numerical example, we demonstrate the proposed method
on a motion planning scenario and illustrate the effect of the
regularization term on the expected payoff.
I. I
NTRODUCTION
A two-player zero-sum stochastic game (SG) [1] models
sequential decision-making of two players with opposing
objectives in a stochastic environment. An SG is played in
stages. At each stage, the game is in a state, and the players
choose one of their available actions simultaneously and
receive payoffs. The game then transitions to a new random
state according to a probability distribution which represents
the stochasticity in the environment.
In an SG, each player aims to synthesize a strategy that
maximizes the player’s expected payoff at the end of the
game. Traditionally, a pair of optimal strategies is described
by Nash equilibrium [2] according to which both players play
their best-response strategies against the opponent’s strategy.
The value of the game then corresponds to the expected
payoff that each player receives at the end of the game, if
they both play their respective equilibrium strategies.
The concept of Nash equilibrium is based on the as-
sumptions that the players have perfect information about
the environment and act rationally [3]. However, in certain
scenarios, the information that a player has about the environ-
ment may not match the reality. For example, in a planning
scenario, if the player obtains its information about the
environment through surveillance missions performed in the
Y. Savas, T. Tanaka and U. Topcu are with the Department of
Aerospace Engineering, University of Texas at Austin, TX, USA. E-mail:
{
yagiz.savas, ttanaka, utopcu
}
@utexas.edu
M. Ahmadi is with the Center for Autonomous Systems and Tech-
nologies, California Institute of Technology, CA, USA. E-mail: mrah-
madi@caltech.edu
past, it may face with a significantly different environment
during the execution of the play. In such scenarios, playing an
equilibrium strategy may dramatically decrease the player’s
actual expected payoff as the strategy is computed under the
assumption of perfect information.
The principle of maximum entropy prescribes a probability
distribution that is “maximally noncommittal with regard to
missing information” [4]. The principle of maximum causal
entropy extends the maximum entropy principle to settings
where there is dynamically revealed side information that
causally affects the evolution of a stochastic process [5],
[6]. A distribution that maximizes the causal entropy of a
stochastic process (in the absence of additional constraints)
is the one that makes all admissible realizations equally prob-
able regardless of the revealed information [7]. Therefore, the
causal entropy of a player’s strategy provides a convenient
way to quantify the dependence of its strategy to its level
of information about the environment as well as the other
player’s strategy.
In this paper, we propose a method to synthesize a pair
of strategies that balances each player’s rationality with its
belief about the level of missing information. Specifically, we
regularize each player’s objective with the causal entropy of
its strategy which is causally dependent on the history of
play. Therefore, the proposed method allows the players to
adjust their strategies according to different levels of misin-
formation by tuning a parameter that controls the importance
of the regularization term. For example, in two extremes,
it allows the player to be perfectly rational or to purely
randomize its strategy.
We study both entropy-regularized
N
-stage and entropy-
regularized discounted games, and show the existence of
a value in both games. We first prove the sufficiency of
Markovian and stationary strategies for both players, respec-
tively, in
N
-stage and discounted games in order to maximize
their entropy-regularized expected payoff. Then, we provide
algorithms based on a sequence of convex optimization
problems to compute a pair of equilibrium strategies. Finally,
we demonstrate the proposed methods on a motion planning
scenario, and illustrate that the introduced regularization term
yields strategies that perform well in different environments.
Related work.
In stochastic games literature, the idea of bal-
ancing the expected payoffs with an additional regularization
term appeared recently in [8] and [9]. The work [8] proposes
to bound the rationality of the players to obtain tunable
behavior in video games. They study
γ
-discounted games and
restrict their attention to stationary strategies to balance the
expected payoffs with the Kullback-Leibner distance of the
player’s strategies from reference strategies. In [9], authors
arXiv:1907.11543v2 [math.OC] 29 Jul 2019
study
N
-stage games and consider only Markovian strategies
to balance the expected payoffs with the player’s sensing
costs which are expressed as directed information from states
to actions. Unlike this work, we introduce causal entropy of
strategies as the regularization term. Additionally, we allow
the player’s to follow history-dependent strategies and prove
the sufficiency of Markovian strategies to attain the value in
N
-stage games.
Regularization terms are also used in matrix and extensive
form games generally to learn equilibrium strategies [10],
[11], [12], [13]. When each player uses the same parameter
to regularize its expected payoffs with the entropy of its
strategy, an equilibrium strategy profile is called a quantal
response equilibrium (QRE) [3], and an equilibrium strategy
of a player is referred as quantal best response [11] or logit
choice strategy [10]. From a theoretical perspective, the main
difference between our approach and the well-studied QRE
concept [14] is that we establish the existence of equilibrium
strategies even if the players use different regularization
parameters. Additionally, we provide an efficient algorithm
based on a convex optimization problem to compute the
equilibrium strategies.
Robust stochastic games [15], [16] concern the synthesis
of equilibrium strategies when the uncertainty in transition
probabilities and payoff functions can be represented by
structured sets. Unlike robust SG models, the proposed
method in this paper can still be used when it is not possible
to form a structured uncertainty set.
In reinforcement learning literature, the use of regulariza-
tion terms is extensively studied to obtain robust behaviors
[17], improve the convergence rates [18], and compute
optimal strategies efficiently [19]. As stochastic games model
multi-player interactions, our approach leverages the ideas
discussed in aforementioned work to environments where an
adversary aims to prevent a player to achieve its objective.
II. B
ACKGROUND
We first review some concepts from game theory and
information theory that will be used in the subsequent
sections.
Notation:
For a sequence
x
, we write
x
t
to denote
(
x
1
,x
2
,...,x
t
)
. Upper case symbols such as
X
denote
random variables, and lower case symbols such as
x
denote
a specific realization. The cardinality of a set
X
is denoted
by
|X|
, and the probability simplex defined over the set
X
is denoted by
∆(
X
)
. For
V
1
,V
2
R
n
, we write
V
1
4
V
2
to
denote the coordinate-wise inequalities. We use the index
set
Z
+
=
{
1
,
2
,...
}
and the natural logarithm
log(
·
)=log
e
(
·
)
.
A. Two-Player Stochastic Games
A two-player stochastic game
Γ
[1] is played in stages.
At each stage
t
, the game is in one of its finitely many
states
X
, and each player observes the current state
x
t
. At
each state
x
t
, the players choose one of their finitely many
actions, and the game transitions to a successor state
x
t
+1
according to a probability distribution
P
:
X×U×W→
∆(
X
)
where
U
and
W
are finite action spaces for player 1 and
player 2, respectively. The pair of actions,
u
t
∈U
and
w
t
∈W
,
together with the current state
x
t
∈X
determine the payoff
R
(
x
t
,u
t
,w
t
)
R
<
to be made by player 2 to player 1 at
stage
t
.
A player’s strategy is a specification of a probability
distribution over available actions at each stage conditional
on the history of the game up to that stage. Formally, let
H
t
=(
X×U×W
)
t
1
×X
be the set of all possible history
of plays up to stage
t
. Then, the strategy of player 1 and
player 2 are denoted by
σ
=(
σ
1
2
,...
)
and
τ
=(
τ
1
2
,...
)
,
respectively, where
σ
t
:
H
t
∆(
U
)
and
τ
t
:
H
t
∆(
W
)
for
all
t
. If a player’s strategy depends only on the current
state for all stages, e.g.,
σ
t
:
X
t
∆(
U
)
for all
t
, the strategy
is said to be
Markovian
. A
stationary
strategy depends
only on the current state and is independent of the stage
number, e.g.,
σ
=(
σ,σ,...
)
, where
σ
:
X→
∆(
U
)
. We denote
the set of all strategies, all Markovian strategies, and all
stationary strategies for player
i
∈{
1
,
2
}
by
Γ
i
,
Γ
M
i
, and
Γ
S
i
,
respectively.
Let
μ
t
+1
(
x
t
+1
,u
t
,w
t
)
be the joint probability distribution
over the history
H
t
+1
of play which is uniquely determined
by the initial state distribution
μ
1
(
h
1
)
through the recursive
formula
μ
t
+1
(
x
t
+1
,u
t
,w
t
) =
P
(
x
t
+1
|
x
t
,u
t
,w
t
)
σ
t
(
u
t
|
h
t
)
×
τ
t
(
w
t
|
h
t
)
μ
t
(
h
t
)
(1)
where
h
t
∈H
t
is the history of play up to stage
t
.
A stochastic game with the initial distribution
μ
1
(
x
1
)
is
called an
N
-stage game, if the game ends after
N
stages.
The evaluation function for an
N
-stage game is
J
(
X
N
,U
N
,W
N
) :=
N
t
=1
E
μ
t
R
(
X
t
,U
t
,W
t
) +
E
μ
T
+1
R
(
X
N
+1
)
(2)
where
μ
t
(
·
):=
μ
t
(
·
)
σ
t
(
·
)
τ
t
(
·
)
. Similarly, if the number of
stages in the game is infinite, and the future payoffs are
discounted by a factor
0
<γ<
1
, the game is called a
γ
-
discounted game. The evaluation function for a
γ
-discounted
game is
t
=1
γ
t
1
E
μ
t
R
(
X
t
,U
t
,W
t
)
.
(3)
The player 1’s objective is to maximize the evaluation
function, i.e., its expected payoff, whereas the player 2 aims
to minimize it. A stochastic game is said to have the
value
V
?
, if for an evaluation function
f
(
σ
,
τ
)
, we have
V
?
= max
σ
Γ
u
min
τ
Γ
w
f
(
σ
,
τ
) = min
τ
Γ
w
max
σ
Γ
u
f
(
σ
,
τ
)
.
A pair of strategies
(
σ
?
,
τ
?
)
is said to be equilibrium
strategies if it attains the value of the game.
It is well-known that both
N
-stage and
γ
-discounted
games have a value for finite state and action sets [20].
Moreover, Markovian and stationary strategies are sufficient
for players to attain the value in
N
-stage and
γ
-discounted
games, respectively [1], [21].
B. Causal Entropy
For a sequential decision-making problem where decisions
depend causally on the past information such as the history
of play, the causal entropy of a strategy is a measure to
quantify the randomness of the strategy. Let
X
N
,
Y
N
and
Z
N
be sequences of random variables with length
N
. The
entropy of the sequence
X
N
causally conditioned on the
sequences
Y
N
and
Z
N
is defined as [22]
H
(
X
N
||
Y
N
,Z
N
) :=
N
t
=1
H
(
X
t
|
X
t
1
,Y
t
,Z
t
)
,
(4)
where
H
(
X
t
|
X
t
1
,Y
t
,Z
t
) :=
X
N
,
Y
N
,
Z
N
Pr
(
x
t
,y
t
,z
t
) log
Pr
(
x
t
|
x
t
1
,y
t
,z
t
)
.
(5)
The concept of causal entropy has recently been used to
infer correlated-equilibrium strategies in Markov games [23]
and to recover cost functions in inverse optimal control prob-
lems [6]. In this study, we employ causal entropy to compute
an equilibrium strategy profile that balances the players’
expected payoff with the randomness of their strategies in
stochastic games.
In the absence of additional constraints, a strategy
σ
Γ
1
that maximizes the causal entropy
H
(
U
N
||
X
N
,W
N
1
)
of
the player 1, which is conditioned on the revealed his-
tory of play, is the stationary strategy
σ
=(
σ,σ,...
)
where
σ
(
x
)(
u
)=1
/
|U|
. Therefore, a player that maximizes the
entropy of its strategy acts purely randomly regardless of the
history of play. On the other hand, a player that regularizes
its expected payoff with the entropy of its strategy can be
thought as a player that balances its rationality with its belief
about the correctness of the underlying transition model of
the environment.
III. P
ROBLEM
S
TATEMENT
We first consider entropy-regularized
N
-stage games for
which we define the evaluation function as
Φ
N
(
σ
,
τ
) :=
J
(
X
N
,U
N
,W
N
) +
1
β
1
H
(
U
N
||
X
N
,W
N
1
)
1
β
2
H
(
W
N
||
X
N
,U
N
1
)
,
(6)
where
β
1
2
>
0
are regularization parameters that adjust for
players the importance of the randomness in their strategies.
Note that, when
β
1
=
β
2
=
, both players act perfectly ratio-
nal, and we recover the evaluation function (2). Additionally,
since the play is simultaneous, the information of a player’s
strategy at a given stage is not revealed to the other player.
Hence, at each stage, players are allowed to condition their
strategies only to observed history of play.
Problem 1:
Provide an algorithm to synthesize, if exists,
equilibrium strategies in entropy-regularized
N
-stage games.
We next consider stochastic games that are played in in-
finite stages, and introduce entropy-regularized
γ
-discounted
games for which we define the evaluation function as
Φ
(
σ
,
τ
) :=
t
=1
γ
t
1
[
E
μ
t
R
(
X
t
,U
t
,W
t
)
+
1
β
1
H
(
U
t
|
H
t
)
1
β
2
H
(
W
t
|
H
t
)
]
,
(7)
where
H
t
=(
X
t
,U
t
1
,W
t
1
)
, i.e., the admissible histories
of play at stage
t
. Note that in the evaluation function (7), we
discount players’ future entropy gains as well as the expected
payoff in order to ensure the finiteness of the evaluation
function.
Problem 2:
Provide an algorithm to synthesize, if exists,
equilibrium strategies in entropy-regularized
γ
-discounted
games.
IV. E
XISTENCE OF
V
ALUES AND
T
HE
C
OMPUTATION OF
O
PTIMAL
S
TRATEGIES
In this section, we analyze entropy regularized
N
-stage and
γ
-discounted games, and show that both games have val-
ues. Then, we provide algorithms to synthesize equilibrium
strategies that attain the corresponding game values.
A. Entropy-Regularized
N
-Stage Games
Searching optimal strategies that solve a stochastic game
with the evaluation function
Φ
N
(
σ
,
τ
)
in the space of all
strategies can be intractable for large
N
. We begin with
establishing the existence of optimal strategies for both
players in the space of Markovian strategies.
Proposition 1:
Markovian strategies are sufficient for both
players to attain, if exists, the value in entropy-regularized
N-stage games, i.e.,
max
σ
Γ
M
u
min
τ
Γ
M
w
Φ
N
(
σ
,
τ
) = max
σ
Γ
u
min
τ
Γ
w
Φ
N
(
σ
,
τ
)
,
min
τ
Γ
M
w
max
σ
Γ
M
u
Φ
N
(
σ
,
τ
) = min
τ
Γ
w
max
σ
Γ
u
Φ
N
(
σ
,
τ
)
.
Proof:
See Appendix A.

Next, we show that entropy-regularized
N
-stage games
have a value. Let
ρ
t
:
X×U×W→
R
be a function and
x
t
be
a fixed state. Additionally, let
V
σ
t
t
t
(
x
t
) :=
E
σ
t
t
[
ρ
t
(
x
t
,u
t
,w
t
)
1
β
1
log
σ
t
(
u
t
|
x
t
)
+
1
β
2
log
τ
t
(
w
t
|
x
t
)
]
(8)
be the evaluation function for a “one-shot” game in which
the game starts from the state
x
t
and ends after both players
play their one-step strategy.
Proposition 2:
A stochastic game with the evaluation func-
tion (8) has a value, i.e.,
max
σ
t
∆(
U
)
min
τ
t
∆(
W
)
V
σ
t
t
t
(
x
t
) =
min
τ
t
∆(
W
)
max
σ
t
∆(
U
)
V
σ
t
t
t
(
x
t
)
.
Proof:
It is clear that
V
t
(
x
t
)
is a continuous function that
is concave in
σ
t
and convex in
τ
t
. Additionally,
∆(
U
)
and
∆(
W
)
are compact convex sets. The result follows from von
Neumann’s minimax theorem [24].

The following proposition states that one can compute the
value of the one shot game (8) and synthesize equilibrium
strategies by solving a convex optimization problem.
Proposition 3:
For a given one-shot game with the evalua-
tion function (8), optimal strategies
(
σ
?
t
?
t
)
satisfy
σ
?
t
(
u
t
|
x
t
)
arg max
σ
t
∆(
U
)
[
1
β
1
u
t
∈U
σ
t
(
u
t
|
x
t
) log
σ
t
(
u
t
|
x
t
)
1
β
2
log
w
t
∈W
exp (
β
2
u
t
∈U
σ
t
(
u
t
|
x
t
)
ρ
t
(
x
t
,u
t
,w
t
))
]
,
(9)
τ
?
t
(
w
t
|
x
t
) =
exp (
β
2
u
t
∈U
σ
?
t
(
u
t
|
x
t
)
ρ
t
(
x
t
,u
t
,w
t
))
w
t
∈W
exp (
β
2
u
t
∈U
σ
?
t
(
u
t
|
x
t
)
ρ
t
(
x
t
,u
t
,w
t
))
.
(10)
Furthermore, the unique value
V
?
t
(
x
t
)
of the game is given
by
V
?
t
(
x
t
) =
1
β
1
u
t
∈U
σ
?
t
(
u
t
|
x
t
) log
σ
?
t
(
u
t
|
x
t
)
1
β
2
log
w
t
∈W
exp (
β
2
u
t
∈U
σ
?
t
(
u
t
|
x
t
)
ρ
t
(
x
t
,u
t
,w
t
))
.
(11)
Proof:
See Appendix A.

It is worth noting that the objective function of the
optimization problem given in (9) is strictly concave, and
therefore, optimal strategies
σ
?
t
(
u
t
|
x
t
)
and
τ
?
t
(
w
t
|
x
t
)
are
unique. Additionaly, an optimal strategy with the form (10)
is known in the economics literature as quantal best response
[14], and for
β
1
=
β
2
<
, the optimal strategies form the
well-studied quantal response equilibrium strategies [3].
We remark that the optimization problem in (9) has a
closed-form solution which is a function of the optimal strat-
egy
τ
?
t
(
w
t
|
x
t
)
. However, since the closed-form expressions
for equilibrium strategies constitute a system of coupled
nonlinear equations, the convex optimization formulation
provides a more convenient way to compute equilibrium
strategies.
Utilizing the results of above propositions, we now refor-
mulate an entropy-regularized
N
-stage game as a series of
“one-shot” games through the use of Bellman recursions. Let
ρ
t
(
x
t
,u
t
,w
t
) =
R
(
x
t
,u
t
,w
t
)
+
x
t
+1
∈X
P
(
x
t
+1
|
x
t
,u
t
,w
t
)
V
σ
t
t
t
+1
(
x
t
+1
)
,
(12)
for
t
=1
,...,N
where
V
σ
t
t
N
+1
(
x
N
+1
)=
R
(
X
N
+1
)
. Then, it
can be easily verified that
Φ
N
(
σ
,
τ
) =
x
1
∈X
μ
1
(
x
1
)
V
1
(
x
1
)
(13)
for a given initial distribution
μ
1
(
x
1
)
. Consequently, we
obtain the following result.
Theorem 1:
Entropy-regularized
N
-stage games have a
value.
Proof:
Due to Proposition 1, we can focus on Markovian
strategies to find an equilibrium point in
N
-stage games. We
start from the stage
k
=
N
and compute the value of the one
shot game (8), which exists due to Proposition 2. Using (8)
with (12) for
k
=
N
1
,N
2
,...,
1
, we compute the value
of
N
k
+1
stage games. As a result, the claim follows due
to the equivalence given in (13).

Algorithm 1 summarizes the computation of the pair
(
σ
?
,
τ
?
)
of optimal strategies for entropy-regularized
N
-
stage games.
Algorithm 1
Strategy computation for
N
-stage games
1:
Initialize:
V
N
+1
(
x
N
+1
)
=
R
(
x
N
+1
)
for all
x
N
+1
∈X
.
2:
for
t
=
N,N
1
,...,
1
do
3:
Compute
ρ
t
(
x
t
,u
t
,w
t
)
for all
x
t
∈X
,
u
t
∈U
, and
w
t
∈W
as in (12).
4:
For all
x
t
∈X
,
V
?
t
(
x
t
) = max
σ
t
∆(
U
)
1
β
1
u
t
∈U
σ
t
(
u
t
|
x
t
) log
σ
t
(
u
t
|
x
t
)
1
β
2
log
w
t
∈W
exp(
β
2
u
t
∈U
σ
t
(
u
t
|
x
t
)
ρ
t
(
x
t
,u
t
,w
t
))
5:
For all
x
t
∈X
, compute
σ
?
t
and
τ
?
t
as in (9) and (10),
respectively.
6:
return
σ
?
=(
σ
?
1
,...,σ
?
T
)
and
τ
?
=(
τ
?
1
,...,τ
?
T
)
.
Remark:
In certain scenarios, one of the players may prefer
to play perfectly rationally against a boundedly rational
opponent, e.g.,
β
2
=
. In that case, it is still possible to
compute equilibrium strategies by solving a convex optimiza-
tion problem at each stage. The value of the one-shot game
(8) still exists due to the arguments provided in the proof
of Proposition 2. However, the form of optimal strategies
slightly changes to
τ
?
t
(
w
t
|
x
t
) = arg min
τ
t
∆(
W
)
1
β
1
log
u
t
∈U
exp
(
β
1
w
t
∈W
τ
t
(
w
t
|
x
t
)
ρ
t
(
x
t
,u
t
,w
t
)
)
,
(14)
σ
?
t
(
u
t
|
x
t
) =
exp (
β
1
w
t
∈W
τ
?
t
(
u
t
|
x
t
)
ρ
t
(
x
t
,u
t
,w
t
))
u
t
∈U
exp (
β
1
w
t
∈W
τ
?
t
(
w
t
|
x
t
)
ρ
t
(
x
t
,u
t
,w
t
))
.
(15)
It is important to note that, if
β
i
=
for some
i
∈{
1
,
2
}
,
equilibrium strategies may be not unique since the function
log
exp(
·
)
is not strictly convex over its domain [25].
B. Entropy-Regularized
γ
-Discounted Games
In this section, we focus on Markovian strategies, whose
optimality for
N
-stage games is shown in Proposition 1. Let
V∈
R
|X|
be a real-valued function, and for a given
x
∈X
,
L
(
V
)(
x,
·
,
·
) :
U×W→
R
be a function such that
L
(
V
)(
x,σ,τ
) :=
E
σ,τ
[
R
(
x,u,w
)
1
β
1
log
σ
(
u
|
x
)
+
1
β
2
log
τ
(
w
|
x
) +
γ
x
∈X
P
(
x
|
x,u,w
)
V
(
x
)
]
.
(16)
As discussed in Proposition 2, a one-shot game with the
evaluation function
L
(
V
)(
x,σ,τ
)
has a value. Therefore, we
can introduce the Shapley operator
Ψ:
V→
Ψ(
V
)
from
R
|X|
to itself specified, for all
x
∈X
, as
Ψ(
V
)[
x
] := max
σ
∆(
U
)
min
τ
∆(
W
)
L
(
V
)(
x,σ,τ
)
.
It is clear that the operator
Ψ
satisfies two key properties:
monotonicity
, i.e.,
V
4
V
implies
Ψ(
V
)
4
Ψ(
V
)
, and
reduction
of constants
, i.e., for any
k
0
,
Ψ(
V
+
k
1
)[
x
]=Ψ(
V
)[
x
]+
γk
for all
x
∈X
. Consequently, it is straightforward to show that
the operator
Ψ
is a contraction mapping [26]. Specifically,
we have
Ψ(
V
)
Ψ(
V
)
γ
‖V −
V‖
,
where
‖V‖
=max
x
∈X
V
(
x
)
. We omit the details since
similar results can be easily found in the literature [27]. Then,
by Banach’s fixed-point theorem [28], we conclude that the
operator
Ψ
has a unique fixed point which satisfies
V
?
V
?
.
Next, we need to show that the fixed point
V
?
is indeed
the value of the entropy-regularized
γ
-discounted game. Let
σ
?
=(
σ
?
?
,...
)
be a stationary strategy such that
σ
?
is
a one-step strategy for player 1 satisfying the fixed point
equation, and
τ
be an arbitrary Markovian strategy for player
2. Denoting by
h
t
the history of play of length
t
, one has,
by definition of
Ψ
and
σ
?
,
E
μ
t
[
R
(
x
t
,u
t
,w
t
)
1
β
1
log
σ
?
(
u
t
|
x
t
) +
1
β
2
log
τ
(
w
t
|
x
t
)
+
γ
x
∈X
P
(
x
|
x
t
,u
t
,w
t
)
V
?
(
x
)
h
t
]
E
μ
t
[
V
?
(
x
t
)
h
t
]
.
This expression can further be written as
E
μ
t
,
μ
t
+1
[
R
(
x
t
,u
t
,w
t
)
1
β
1
log
σ
?
(
u
t
|
x
t
)
+
1
β
2
log
τ
(
w
t
|
x
t
) +
γ
V
?
(
x
t
+1
)
h
t
]
E
μ
t
[
V
?
(
x
t
)
h
t
]
.
Multiplying by
γ
t
1
, taking expectations and summing over
1
t<k
, one obtains
k
1
t
=1
γ
t
1
E
μ
t
[
R
(
x
t
,u
t
,w
t
)
1
β
1
log
σ
?
(
u
t
|
x
t
)+
1
β
2
log
τ
(
w
t
|
x
t
)
x
1
]
≥V
?
(
x
1
)
γ
k
E
μ
k
+1
[
V
?
(
x
k
+1
)
x
1
]
.
Taking the limit as
k
→∞
and using Proposition 1, we obtain
Φ
(
σ
?
)
≥V
?
(
x
1
)
.
(17)
Similarly, when player 2 plays the optimal stationary strategy
τ
?
=(
τ
?
?
,...
)
against an arbitrary Markovian strategy
σ
of player 1, we have
Φ
(
σ,τ
?
)
≤V
?
(
x
1
)
.
(18)
Then, the combination of (17) and (18) implies the following
result.
Theorem 2:
Entropy-regularized
γ
-discounted games have
a value which satisfies
Ψ(
V
)=
V
. Furthermore, stationary
strategies are sufficient for both players to attain the game
value, i.e.,
max
σ
Γ
min
τ
Γ
Φ
(
σ,τ
) = max
σ
Γ
S
min
τ
Γ
S
Φ
(
σ,τ
)
.
Computation of optimal strategies is just an extension
of Algorithm 1. Note that for
γ
-discounted games, we
use the same one-shot game introduced in (8). Therefore,
optimal decision rules at each stage has the form (9) and
(10). Consequently, to compute the optimal strategies, we
initialize Algorithm 1 with an arbitrary value vector
V∈
R
|X|
and iterate until convergence, which is guaranteed by the
existence of a unique fixed point.
V. A N
UMERICAL
E
XAMPLE
In this section, we demonstrate the proposed strategy
synthesis method on a motion planning scenario that we
model as an entropy-regularized
γ
-discounted game. To
solve the convex optimization problems required for the
computation of equilibrium strategies, we use ECOS solver
[29] through the interface of CVXPY [30]. All computations
are performed by setting
γ
=0
.
8
.
As the environment model, we consider a
5
×
5
grid world
which is given in Figure 1 (top left). The brown grid denotes
the initial position of the player 1 which aims to reach the
goal (green) state. The red grid is the initial position of the
player 2 whose aim is to catch the player 1 before reaching
the goal state. Finally, black grids represent walls.
Let
x
=(
s
1
,s
2
)
be the current state of the game such that
x
[1]=
s
1
and
x
[2]=
s
2
are the positions of the player 1 and the
player 2, respectively. At each state, the action space for both
players is given as
U
=
W
=
{
right,left,up,down,stay
}
.
For simplicity, we assume deterministic transitions, i.e.,
P
(
x,u,w
)
∈{
0
,
1
}
for all
x
∈X
,
u
∈U
and
w
∈W
. If a player
takes an action for which the successor state is a wall, the
player stays in the same state with probability 1.
For a given
(
x,u,w
)
∈X×U×W
, we encode the payoff
function
R
(
x,u,w
)
as the sum of two functions such that
R
(
x,u,w
)=
R
1
(
x,u,w
)+
R
2
(
x,u,w
)
where
R
1
(
x,u,w
) =
x
[1]=
G
P
(
x
|
x,u,w
)
,
R
2
(
x,u,w
) =
{
−P
(
x
|
x,u,w
)
if
x
[1] =
x
[2]
6
=
G
0
otherwise
.
Note that the payoff function defines a zero-sum game
which is won by the player 1, if it reaches the goal state
before getting caught, and by the player 2, if it catches the
player 1 before reaching the goal state.
We first compute Nash equilibrium strategies in the ab-
sence of causal entropy terms, i.e.,
β
1
=
β
2
=
, by employing
standard linear programming formulation [31] for zero-sum
games. Starting from the initial state, an equilibrium strategy
for the player 1 is to move towards the goal state by taking
the action
right
deterministically, and for the player 2 is to
chase the player 1 by taking the action
up
in the first two
stages, and then, take the action
right
until reaching the goal
state. Therefore, a perfectly rational player 1 wins the game
with probability 1 no matter what strategy is followed by the
player 2.
To illustrate the drawback of playing with perfect rational-
ity, we assume that there is another wall in the environment
G
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
β
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Success Probability
G
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
β
0.0
0.05
0.1
0.15
0.2
0.25
0.3
Success Probability
G
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
β
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Success Probability
Fig. 1: (Top left) The nominal environment players use for
computing their strategies. (Top right) The probability that
the player 1 wins the game when it plays the strategy com-
puted by using
β
1
=
β
2
=
β
against the perfectly rational
player 2. (Middle and bottom left) The actual environments
where the game is played. (Middle and bottom right) The
probability of winning for the player 1 when it employs
strategies computed by using different
β
values against the
perfectly rational player 2.
about which the players have no information while they
compute the equilibrium strategies, i.e., the players use the
nominal environment (top left) to compute the equilibrium
strategies. First, we consider the case that the wall is between
the goal state and the player 1, as shown in Figure 1 (middle
left). In this case, if the player 1 follows the Nash equilibrium
strategy, the probability that it reaches the goal state becomes
zero. Therefore, following the Nash equilibrium strategy
makes the player 1 significantly vulnerable to such changes
in the environment.
To investigate the tradeoff between rationality and random-
ness, we compute 9 different strategies for player 1 by using
β
1
=
β
2
=2
,
3
,...,
10
, and let it play against the perfectly
rational player 2 which follows its Nash equilibrium strategy
computed on the nominal environment (top left). The win-
ning probabilities of player 1 under different strategies are
shown in Figure 1 (right) for the corresponding environments
given in Figure 1 (left). This specific scenario demonstrates
that, by choosing
β
=6
, the player 1 can obtain a robust
behavior against unforeseen changes in the environment, i.e.,
the winning probability is around 20%, without sacrificing
too much from its optimal performance, i.e., around 15%, if
the structure of the environment remains the same. It is worth
noting that the asymptotical performance of the player 1 as
β
→∞
approaches to its performance under Nash equilibrium
strategy as discussed in [14]. Additionally, the importance of
the randomness for the player 1 increases as
β
0
, and using
smaller
β
values negatively affects the performance after a
critical point, i.e.,
β
=4
.
Finally, one can argue that the tradeoff occurs in this
specific scenario only if the unpredicted wall is between the
player 1 and the goal state. To justify the choice of
β
value,
we also consider the scenario in which the unexpected wall
occupies another state which is shown in Figure 1 (bottom
left). In this case, as shown in Figure 1 (bottom right),
the use of
β
=6
result in a strategy that guarantees around
80% winning probability. Therefore, the entropy-regularized
strategy of the player 1 still provides an advantage against
unpredicted changes without sacrificing too much from the
optimal performance.
VI. C
ONCLUSIONS AND
F
UTURE
W
ORK
We consider the problem of two-player zero-sum stochas-
tic games with entropy regularization, wherein the players
aim to maximize the causal entropy of their strategies in
addition to the conventional expected payoff. We show that
equilibrium strategies exist for both entropy-regularized
N
-
stage and entropy-regularized
γ
-discounted games, and can
be computed by solving a convex optimization problem. In
numerical examples, we applied the proposed approach to
a motion planning scenario and observed that by tuning
the regularization parameter, a player can synthesize robust
strategies that perform well in different environments against
a perfectly rational opponent.
Extending this work to multi-agent reinforcement learning
settings, as discussed in [8], is an interesting future direction.
Future work can also investigate the effect of entropy regular-
ization term on the convergence rate of learning algorithms
as discussed in [18].
R
EFERENCES
[1] L. S. Shapley, “Stochastic games,”
Proceedings of the National
Academy of Sciences
, vol. 39, no. 10, pp. 1095–1100, 1953.
[2] J. Nash, “Non-cooperative games,”
Annals of Mathematics
, vol. 54,
no. 2, pp. 286–295, 1951.
[3] J. K. Goeree, C. A. Holt, and T. R. Palfrey,
Quantal Response
Equilibrium: A Stochastic Theory of Games
.
Princeton University
Press, 2016.
[4] E. T. Jaynes, “Information theory and statistical mechanics,”
Physical
Review
, vol. 106, no. 4, p. 620, 1957.
[5] B. D. Ziebart, J. A. Bagnell, and A. K. Dey, “The principle of
maximum causal entropy for estimating interacting processes,”
IEEE
Transactions on Information Theory
, vol. 59, no. 4, pp. 1966–1980,
2013.
[6] ——, “Modeling interaction via the principle of maximum causal
entropy,” in
Proceedings of the 27th International Conference on
International Conference on Machine Learning
, 2010, pp. 1255–1262.
[7] Y. Savas, M. Ornik, M. Cubuktepe, and U. Topcu, “Entropy maximiza-
tion for Markov decision processes under temporal logic constraints,”
arXiv:1807.03223v2 [math.OC]
, 2018.
[8] J. Grau-Moya, F. Leibfried, and H. Bou-Ammar, “Balancing two-
player stochastic games with soft Q-learning,” in
Proceedings of the
Twenty-Seventh International Joint Conference on Artificial Intelli-
gence
, 2018.
[9] M. Ahmadi, S. Bharadwaj, T. Tanaka, and U. Topcu, “Stochastic
games with sensing costs,” in
56th Annual Allerton Conference on
Communication, Control, and Computing
, 2018.
[10] P. Mertikopoulos and W. H. Sandholm, “Learning in games via rein-
forcement and regularization,”
Mathematics of Operations Research
,
vol. 41, no. 4, pp. 1297–1324, 2016.
[11] D. S. Leslie and E. J. Collins, “Individual Q-learning in normal form
games,”
SIAM Journal on Control and Optimization
, vol. 44, no. 2,
pp. 495–514, 2005.
[12] C. K. Ling, J. Z. Kolter, and F. Fang, “What game are we playing?
Differentiably learning games from incomplete observations,” in
Pro-
ceedings of Conference on Neural Information Processing Systems
,
2017.
[13] D. Fudenberg and D. Levine,
The Theory of Learning in Games
. The
MIT Press, 1998.
[14] R. D. McKelvey and T. R. Palfrey, “Quantal response equilibria for
normal form games,”
Games and Economic Behavior
, vol. 10, no. 1,
pp. 6–38, 1995.
[15] E. Kardes ̧, F. Ord
́
o
̃
nez, and R. W. Hall, “Discounted robust stochastic
games and an application to queueing control,”
Operations Research
,
vol. 59, no. 2, pp. 365–382, 2011.
[16] M. Aghassi and D. Bertsimas, “Robust game theory,”
Mathematical
Programming
, vol. 107, no. 1-2, pp. 231–273, 2006.
[17] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement
learning with deep energy-based policies,” in
Proceedings of the
Thirty-fourth International Conference on Machine Learning
, 2017.
[18] R. Fox, A. Pakman, and N. Tishby, “Taming the noise in reinforce-
ment learning via soft updates,” in
Proceedings of the Thirty-Second
Conference on Uncertainty in Artificial Intelligence
, 2016, pp. 202–
211.
[19] E. Todorov, “Efficient computation of optimal actions,”
Proceedings of
the National Academy of Sciences
, vol. 106, no. 28, pp. 11 478–11 483,
2009.
[20] T. Bewley and E. Kohlberg, “The asymptotic theory of stochastic
games,”
Mathematics of Operations Research
, vol. 1, no. 3, pp. 197–
208, 1976.
[21] S. Sorin, “Discounted stochastic games: The finite case,” in
Stochastic
Games and Applications
. Springer, 2003, pp. 51–55.
[22] G. Kramer, “Directed information for channels with feedback,” Ph.D.
dissertation, ETH Zurich, 1998.
[23] B. D. Ziebart, J. A. Bagnell, and A. K. Dey, “Maximum causal entropy
correlated equilibria for Markov games,” in
International Conference
on Autonomous Agents and Multiagent Systems
, 2011, pp. 207–214.
[24] J. v. Neumann, “Zur theorie der gesellschaftsspiele,”
Mathematische
Annalen
, vol. 100, no. 1, pp. 295–320, 1928.
[25] R. T. Rockafellar and R. J.-B. Wets,
Variational analysis
.
Springer
Science & Business Media, 2009, vol. 317.
[26] D. P. Bertsekas and J. N. Tsitsiklis,
Neuro-Dynamic Programming
.
Athena Scientific, 1996.
[27] A. Neyman and S. Sorin,
Stochastic games and applications
. Springer
Science & Business Media, 2003, vol. 570.
[28] M. L. Puterman,
Markov decision processes: Discrete stochastic
dynamic programming
. John Wiley
&
Sons, 2014.
[29] A. Domahidi, E. Chu, and S. Boyd, “ECOS: An SOCP solver for
embedded systems,” in
European Control Conference (ECC)
, 2013,
pp. 3071–3076.
[30] S. Diamond and S. Boyd, “CVXPY: A Python-embedded modeling
language for convex optimization,”
Journal of Machine Learning
Research
, vol. 17, no. 83, pp. 1–5, 2016.
[31] P. B. Miltersen and T. B. Sørensen, “Computing proper equilibria
of zero-sum games,” in
International Conference on Computers and
Games
, 2006, pp. 200–211.
[32] T. M. Cover and J. A. Thomas,
Elements of information theory
. John
Wiley
&
Sons, 2006.
A
PPENDIX
A
Proof of Proposition 1:
We show the sufficiency of
Markovian strategies only for the maximin problem. The
proof for the minimax formulation follows the same lines
with the arguments provided below.
The proof is based on backward induction on the stage
number
1
k
N
. Let
V
k
:= max
σ
Γ
u
min
τ
Γ
w
N
l
=
k
E
μ
l
[
R
(
X
l
,U
l
,W
l
)
log
σ
l
(
U
l
|
H
l
)
β
1
+
log
τ
l
(
W
l
|
H
l
)
β
2
]
+
E
μ
T
+1
R
(
X
N
+1
)
be the value of the
N
k
stage problem. Then, we can write
the value of
N
k
stage problem recursively as
V
k
= max
σ
k
min
τ
k
E
μ
k
[
R
(
X
k
,U
k
,W
k
)
1
β
1
log
σ
k
(
U
k
|
H
k
)
+
1
β
2
log
τ
k
(
W
k
|
H
k
) +
E
P
[
V
k
+1
]
]
.
(19)
Base step:
k
=
N
.
Let
σ
N
and
τ
?
N
be an arbitrary strategy for
player 1 and the optimal strategy for player 2 at stage
N
,
respectively. Let
λ
N
(
h
N
,u
N
,w
N
) :=
μ
N
(
x
N
,u
N
1
,w
N
1
)
×
σ
N
(
u
N
|
h
N
)
τ
?
N
(
w
N
|
h
N
)
be the joint distribution induced by
σ
N
(
u
N
|
h
N
)
and
τ
?
N
(
w
N
|
h
N
)
. Additionally, let
λ
N
(
x
N
,w
N
)
and
λ
N
(
x
N
)
be
the marginal distributions of
λ
N
(
h
N
,u
N
,w
N
)
. We construct
a new strategy for player 2 as
τ
N
(
w
N
|
x
N
):=
λ
N
(
x
N
,w
N
)
λ
N
(
x
N
)
. Let
λ
N
(
h
N
,u
N
,w
N
) :=
μ
N
(
x
N
,u
N
1
,w
N
1
)
×
σ
N
(
u
N
|
h
N
)
τ
N
(
w
N
|
x
N
)
be the joint distribution induced by
τ
N
(
w
N
|
x
N
)
. Then, by
construction, we have
λ
N
(
x
N
,u
N
,w
N
)=
λ
N
(
x
N
,u
N
,w
N
)
,
which can be easily verified by calculating the corresponding
marginal distributions. (For a similar strategy construction,
see Theorem 5.5.1 in [28].)
The inner optimization problem in (19) for
k
=
N
reads
V
N
= min
τ
N
J
c
N
(
λ
N
)
J
H
N
(
λ
N
)
(20)
where
J
c
N
(
λ
N
) =
E
λ
N
,
P
[
R
(
X
N
,U
N
,W
N
) +
R
(
X
N
+1
)
1
β
1
log
σ
N
(
U
N
|
H
N
)]
J
H
N
(
λ
N
) =
1
β
2
H
λ
N
(
W
N
|
X
N
,U
N
1
,W
N
1
)
.
Since the strategy
σ
N
is arbitrarily chosen, it is sufficient
to show that
J
c
N
(
λ
N
)=
J
c
N
(
λ
N
)
and
J
H
N
(
λ
N
)
J
H
N
(
λ
N
)
in
order to establish the sufficiency of Markovian strategies for
player 2. The first equality holds by construction. (Note that
the
log(
·
)
term is indifferent to changes in the strategy of
player 2.) The second inequality can be derived as
H
λ
N
(
W
N
|
X
N
,U
N
1
,W
N
1
)
H
λ
N
(
W
N
|
X
N
)
(21)
=
H
λ
N
(
W
N
|
X
N
)
(22)
=
H
λ
N
(
W
N
|
X
N
,U
N
1
,W
N
1
)
(23)
where (21) holds since conditioning reduces entropy [32],
(22) is because
λ
N
=
λ
N
by construction, and (23) is due
to the fact that
τ
N
(
w
N
|
x
N
)
is a Markovian strategy. Con-
sequently, for any strategy chosen by player 1 in stage
N
, player 2 has a best response strategy in the space of
Markovian strategies.
Next, we can assume that player 2 uses a Markovian
strategy and show through a similar strategy construction
explained above that player 1 has an optimal strategy in the
space of Markovian strategies. As a result, the value
V
N
depends on the joint distribution
μ
N
(
x
N
,u
N
1
,w
N
1
)
only
through its marginal
μ
N
(
x
N
)
and becomes a function of the
marginal
μ
N
(
x
N
)
only.
Inductive step:
k
=
t
Assume that Markovian strategies suf-
fice for both players for
k
=
t
+1
,t
+2
,...,N
. Then, by
induction hypothesis,
V
t
+1
is a function of
μ
t
+1
(
x
t
+1
)
only. Therefore, using the similar construction to the case
k
=
N
, we can construct Markovian strategies
σ
t
(
u
t
|
x
t
)
and
τ
t
(
w
t
|
x
t
)
such that the objective function in the right hand
side of (19) attained by
σ
t
and
τ
t
is equal to the value of
N
t
stage problem. As a result, we conclude that Markovian
strategies are sufficient for both players to solve the maximin
problem (6).

Proof of Proposition 3:
Since the one-shot game has
a value, without loss of generality, we focus on the prob-
lem
max
σ
t
∆(
U
)
min
τ
t
∆(
W
)
V
t
(
x
t
)
. For notational conve-
nience, we rewrite the problem as
max
Q
ij
min
Q
ik
jk
Q
ij
Q
ik
ρ
ijk
1
β
1
jk
Q
ik
Q
ij
log
Q
ij
+
1
β
2
jk
Q
ij
Q
ik
log
Q
ik
subject to
j
Q
ij
= 1
,
k
Q
ik
= 1
, Q
ij
0
, Q
ik
0
,
where
Q
ij
=
σ
t
(
u
t
|
x
t
)
,
Q
ik
=
τ
t
(
w
t
|
x
t
)
and
ρ
ijk
=
ρ
t
(
x
t
,u
t
,w
t
)
. Note that due to constraints
j
Q
ij
=1
and
k
Q
ik
=1
, we can replace
1
β
1
jk
Q
ik
Q
ij
log
Q
ij
and
1
β
2
jk
Q
ij
Q
ik
log
Q
ik
by
1
β
1
j
Q
ij
log
Q
ij
and
1
β
2
k
Q
ik
log
Q
ik
, respectively. For now, we neglect the
non-negativity constraints and write the Lagrangian for the
above optimization problem as
L
=
jk
Q
ij
Q
ik
ρ
ijk
1
β
1
j
Q
ij
log
Q
ij
+
1
β
2
k
Q
ik
log
Q
ik
+
λ
j
(
j
Q
ij
1)
+
λ
k
(
k
Q
ik
1)
where
λ
j
,
λ
k
are Lagrange multipliers. Then, taking deriva-
tive with respect to
Q
ik
and equating it to zero, we obtain
∂L
∂Q
ik
=
j
Q
ij
ρ
ijk
+
1
β
2
log
Q
ik
?
+
1
β
2
+
λ
k
= 0
.
Rearranging terms and using the constraint
k
Q
ik
= 1
, we
obtain
Q
ik
?
=
exp(
β
2
j
Q
ij
ρ
ijk
)
k
exp(
β
2
j
Q
ij
ρ
ijk
)
which is the same as (10). Note that the resulting strategy
also satisfies the non-negativity constraint. Plugging
Q
ik
?
into
Lagrangian
L
, we obtain the optimization problem given in
(9) for which the optimal variables correspond to the optimal
strategy of player 1. Similarly, the optimal value (11) of
the resulting optimization problem is the value of the game.
Uniqueness of the value follows from the fact that the value
of the game is the optimal value of a convex optimization
problem given in (9).