Convergence Analysis of Gradient-Based Learning in Continuous Games
Benjamin Chasnov
1
, Lillian Ratliff
1
, Eric Mazumdar
2
, Samuel Burden
1
1
University of Washington, Seattle, WA
2
University of California, Berkeley, Berkeley, CA
Abstract
Considering a class of gradient-based multi-
agent learning algorithms in non-cooperative
settings, we provide convergence guarantees to
a neighborhood of a stable Nash equilibrium.
In particular, we consider continuous games
where agents learn in 1) deterministic settings
with oracle access to their individual gradient
and 2) stochastic settings with an unbiased es-
timator of their individual gradient. We also
study the effects of non-uniform learning rates,
which cause a distortion of the vector field that
can alter the equilibrium to which the agents
converge and the learning path. We support the
analysis with numerical examples that provide
insight into how games may be synthesized to
achieve desirable equilibria.
1 INTRODUCTION
The characterization and computation of equilibria such
as
Nash equilibria
and its refinements constitutes a sig-
nificant focus in non-cooperative game theory. A natural
question that arises is ‘how do players find or learn such
equilibria and how should the grappling process that oc-
curs during learning be interpreted?’ With this question
in mind, a variety of fields have focused attention on the
problem of learning in games which has lead to a plethora
of learning algorithms including gradient play, fictitious
play, best response, and multi-agent reinforcement learn-
ing among others (Fudenberg and Levine, 1998). While
convergence has been studied for many of these algo-
rithms, the results largely tend to be asymptotic in nature;
questions of error bounds and convergence rates are often
less explored, particularly in the context of non-uniform
learning rates, a key feature of systems of autonomous
learning agents.
From an applications point of view, another recent trend
is in the adoption of game theoretic models of algorithm
interaction in machine learning applications. For instance,
game theoretic tools are being used to improve the ro-
bustness and generalizability of machine learning algo-
rithms; e.g., generative adversarial networks have become
a popular topic of study demanding the use of game theo-
retic ideas to provide performance guarantees (Daskalakis
et al., 2017). In other work from the learning community,
game theoretic concepts are being leveraged to analyze
the interaction of learning agents—see, e.g., (Balduzzi
et al., 2018; Heinrich and Silver, 2016; Mazumdar and
Ratliff, 2018; Mertikopoulos and Zhou, 2019; Tuyls et al.,
2018). Even more recently, the study of convergence to
Nash equilibria has been called into question (Papadim-
itriou and Piliouras, 2018); in its place is a proposal to
consider game dynamics as the
meaning of the game
. This
is an interesting perspective as it is well known that in gen-
eral learning dynamics do not obtain an Nash equilibrium
even asymptotically—see, e.g., (Hart and Mas-Colell,
2003)—and, perhaps more interestingly, many learning
dynamics exhibit very interesting limiting behaviors in-
cluding periodic orbits and chaos—see, e.g., (Bena
̈
ım
and Hirsch, 1999; Bena
̈
ım et al., 2012; Hofbauer, 1996;
Hommes and Ochea, 2012).
Despite this activity, we still lack a complete understand-
ing of the dynamics and limiting behaviors of coupled,
competing learning algorithms. One may imagine that
the myraid results on convergence of gradient descent in
optimization readily extend to the game setting. Yet, they
do not since gradient-based learning schemes in games
do
not correspond to gradient flows
, a class of flows that are
guaranteed to converge to local minimizers almost surely.
In particular, the gradient-based learning dynamics for
competitive, multi-agent settings have a
non-symmetric
Jacobian
and, as a consequence, their dynamics may ad-
mit complex eigenvalues and non-equilibrium limiting
behavior such as periodic orbits. In short, this fact makes
it difficut to extend many of the optimization approaches
to convergence in single-agent optimization settings to
multi-agent settings primarily due to the fact that steps
in the direction of a player’s individual gradient does
not guarantee that the player’s cost decreases. In fact,
in games, as our examples highlight, a player’s cost can
increase when they follow the gradient of their own cost.
This behavior is due to the coupling between the agents.
Some of the questions that remain unaddress, and to
which we provide at least partial answers, include the
derivation of error bounds and convergence rates which
are important for ensuring certain performance guarantees
on the collective behavior and can be used to provide guar-
antees on subsequent control or incentive policy synthesis.
We also investigate the question of how naturally arising
features of the learning process for autonomous agents,
such as their learning rates, impact the learning path and
limiting behavior. This further exposes interesting ques-
tions about the overall quality of the limiting behavior
and the cost accumulated along the learning path—e.g.,
is it better to be a slow or fast learner both in terms of the
cost of learning and the learned behavior?
Contributions
. We use state of the art tools and tech-
niques from dynamical systems theory and numerical
methods to make new contributions to the field of multi-
agent learning, the theory of continuous games, and learn-
ing in games. In particular, we study convergence of a
class of gradient-based multi-agent learning algorithms in
non-cooperative settings where agents have non-uniform
learning rates by leveraging the framework of
n
-player
continuous games. That is, we consider a class of learn-
ing algorithms
x
+
i
=
x
i
−
γ
i
g
i
(
x
i
,x
−
i
)
in which
g
i
is
derived from the gradient of a function that abstractly
represents the cost function of player
i
. This class en-
compases a wide variety of approaches to learning in
games including multi-agent policy gradient and multi-
agent gradient-based online optimization. We consider
two settings: (i) agents have oracle access to
g
i
and (ii)
agents have an unbiased estimator for
g
i
.
To our knowledge finite time guarantees for either the
stochastic or deterministic setting given non-uniform
learning rates have not been provided; we provide both.
Towards this end, we characterize the local structure of
the game around the equilibria and exploit this local struc-
ture to obtain finite time rates by combining it with dy-
namical systems theory results thereby leading to con-
vergence guarantees for settings not currently covered by
the state of the art. The analysis combines insights about
games and the structure of the learning dynamics near
equilibria with results from numerical methods to obtain
finite time bounds in the deterministic setting and very
recent advancements in concentration bounds for stochas-
tic approximation in the stochastic setting. The setting of
non-uniform learning rates complicates the analysis and
is well motivated, particularly for applications in which
the agents are autonomous and learning their strategies
through repeated interaction, as opposed to a setting in
which an external entity has the goal of computing the
Nash equilibria of a game.
2 PRELIMINARIES
Consider a setting in which at iteration
k
, each agent
i
∈ I
=
{
1
,...,n
}
updates their choice variable
x
i
∈
X
i
=
R
d
i
by the process
x
i,k
+1
=
x
i,k
−
γ
i,k
g
i
(
x
i,k
,x
−
i,k
)
(1)
where
γ
i,k
is the learning rate,
x
−
i,k
= (
x
j,k
)
j
∈I
/
{
i
}
∈
∏
j
∈I
/
{
i
}
X
j
denotes the choices of all agents excluding
the
i
–th agent, and
(
x
i,k
,x
−
i,k
)
∈
X
=
∏
i
∈I
X
i
. For
each
i
∈ I
, there exists a sufficiently smooth function
f
i
∈
C
q
(
X,
R
)
,
q
≥
2
such that
g
i
is either
D
i
f
i
where
D
i
(
·
)
denotes the derivative of
f
i
with respect to
x
i
or
an unbiased estimator of
D
i
f
i
—i.e.,
g
i
≡
̂
D
i
f
i
where
E
[
̂
D
i
f
i
] =
D
i
f
i
.
The collection of costs
(
f
1
,...,f
n
)
on
X
where
f
i
:
X
→
R
is agent
i
’s cost function and
X
i
is their ac-
tion space defines a
continuous game
. In this continuous
game abstraction, each player
i
∈ I
aims to selection
an action
x
i
∈
X
i
that minimizes their cost
f
i
(
x
i
,x
−
i
)
given the actions of all other agents,
x
−
i
∈
X
−
i
. From
this perspective, the learning algorithm in
(1)
can be inter-
preted as follows: players myopically update their actions
by following the gradient of their cost with respect to their
own choice variable.
Assumption 1.
For each
i
∈I
,
f
i
∈
C
q
(
X,
R
)
for
q
≥
2
and
ω
(
x
) = (
D
1
f
1
(
x
)
···
D
n
f
n
(
x
))
is
L
–Lipschitz.
Let
D
2
i
f
i
denote the second partial derivative of
f
i
with
respect to
x
i
and
D
ji
f
i
denote the partial derivative of
D
i
f
i
with respect to
x
j
. The
game Jacobian
—i.e., the
Jacobian of
ω
—is given by
J
(
x
) =
D
2
1
f
1
(
x
)
···
D
1
n
f
1
(
x
)
.
.
.
.
.
.
.
.
.
D
n
1
f
n
(
x
)
···
D
2
n
f
n
(
x
)
.
The entries of the above matrix are dependent on
x
, how-
ever, we drop this dependence where obvious. Note that
each
D
2
i
f
i
is symmetric under Assumption 1, yet
J
is
not. This is an important point and causes the subsequent
analysis to deviate from the typical analysis of (stochastic)
gradient descent.
The most common characterization of limiting behavior
in games is that of a Nash equilibrium. The following
definitions are useful for our analysis.
Definition 1.
A strategy
x
∈
X
is a local Nash equilib-
rium for the game
(
f
1
,...,f
n
)
if for each
i
∈ I
there
exists an open set
W
i
⊂
X
i
such that
x
i
∈
W
i
and
f
i
(
x
i
,x
−
i
)
≤
f
i
(
x
′
i
,x
−
i
)
for all
x
′
i
∈
W
i
. If the above
inequalities are strict,
x
is a strict local Nash equilibrium.
Definition 2.
A point
x
∈
X
is said to be a critical point
for the game if
ω
(
x
) = 0
.
We denote the set of critical points of a game
G
=
(
f
1
,...,f
n
)
as
C
(
G
) =
{
x
∈
X
|
ω
(
x
) = 0
}
. Anal-
ogous to single-player optimization, viewing all other
players’ actions as fixed, there are necessary and suffi-
cient conditions which characterize local optimality for
each player.
Proposition 1
(Ratliff et al. (2016))
.
If
x
is a local Nash
equilibrium of the game
(
f
1
,...,f
n
)
, then
ω
(
x
) = 0
and
D
2
i
f
i
(
x
)
≥
0
. On the other hand, if
ω
(
x
) = 0
and
D
2
i
f
i
(
x
)
>
0
, then
x
∈
X
is a local Nash equilibrium.
The sufficient conditions in the above result give rise to
the following definition of a differential Nash equilibrium.
Definition 3
(Ratliff et al. (2013); Ratliff et al. (2016))
.
A strategy
x
∈
X
is a differential Nash equilibrium if
ω
(
x
) = 0
and
D
2
i
f
i
(
x
)
>
0
for each
i
∈I
.
Differential Nash need not be isolated. However, if
J
(
x
)
is non-degenerate—meaning that
det
J
(
x
)
6
= 0
—for a
differential Nash
x
, then
x
is an
isolated strict local Nash
equilibrium
. Non-degenerate differential Nash are
generic
amongst local Nash equilibria and they are
structurally
stable
(Ratliff et al., 2014) which ensures they persist un-
der small perturbations. This also implies an asymptotic
convergence result: if the spectrum of
J
is strictly in the
right-half plane (i.e.
spec(
J
(
x
))
⊂
C
◦
+
), then a differen-
tial Nash equilibrium
x
is (exponentially) attracting under
the flow of
−
ω
(Ratliff et al., 2016, Prop. 2). We say such
equilibria are
stable
.
3 DETERMINISTIC SETTING
Let us first consider the setting in which each agent
i
has oracle access to
g
i
and their learning rates are non-
uniform, but constant—i.e.,
γ
i,k
≡
γ
i
. The learning
dynamics are given by
x
k
+1
=
x
k
−
Γ
ω
(
x
k
)
(2)
where
Γ = blockdiag(
γ
1
I
d
1
,...,γ
n
I
d
n
)
with
I
d
i
denot-
ing the
d
i
×
d
i
identity matrix.
3.1 ASYMPTOTIC ANALYSIS
For a stable differential Nash
x
∗
, let
R
(
x
∗
)
denote the
region of attraction for
x
∗
. Denote by
ρ
(
A
)
the spectral
radius of the matrix
A
.
Proposition 2.
Consider an
n
–player game
G
=
(
f
1
,...,f
n
)
satisfying Assumption 1. Let
x
∗
∈
X
be a
stable differential Nash equilibrium. Suppose agents use
the gradient-based learning rule
x
k
+1
=
x
k
−
Γ
ω
(
x
k
)
with learning rates
γ
i
such that
ρ
(
I
−
Γ
J
(
x
))
<
1
for all
x
∈ R
(
x
∗
)
. Then, for
x
0
∈ R
(
x
∗
)
,
x
k
→
x
∗
exponen-
tially.
The proof is a direct application of Ostrowski’s theo-
rem (Ostrowski, 1966).
Remark 1.
If all the agents have the same learning
rate—i.e., for each
i
∈ I
,
γ
i
=
γ
—then the condi-
tion that
ρ
(
I
−
Γ
J
(
x
))
<
1
on
R
(
x
∗
)
can be written
as
0
< γ <
̃
γ
where
̃
γ
is the smallest positive
h
such
that
max
j
|
1
−
hλ
j
(
J
(
x
∗
))
|
= 1
. If the game is a po-
tential game—i.e., there exists a function
φ
such that
D
i
f
i
=
D
i
φ
for each
i
which occurs if and only if
D
ij
f
i
=
D
ji
f
j
—then convergence analysis coincides
with gradient descent so that any
γ <
1
/L
where
L
is
the Lipschitz constant of
ω
results in local asymptotic
convergence.
Mazumdar and Ratliff (2018) show that
(2)
will almost
surely avoid strict saddle points of the dynamics, some of
which are Nash equilibria in non-zero sum games. More-
over, except on a set of measure zero,
(2)
will converge
to a stable attractor of
̇
x
=
−
ω
(
x
)
which includes stable
local non-Nash critical points. Since
ω
is not a gradient
flow, the set of attractors may also include limit cycles.
3.2 FINITE SAMPLE ANALYSIS
Throughout this subsection we need the following no-
tation.
For a symmetric matrix
A
∈
R
d
×
d
, let
λ
d
(
A
)
≤ ··· ≤
λ
1
(
A
)
be its eigenvalues.
Let
S
(
x
) =
1
2
(
J
(
x
) +
J
(
x
)
T
)
be the symmetric part of
J
(
x
)
. Define
α
= min
x
∈
B
r
(
x
∗
)
λ
d
(
S
(
x
)
T
S
(
x
)
)
and
β
= max
x
∈
B
r
(
x
∗
)
λ
1
(
J
(
x
)
T
J
(
x
))
where
B
r
(
x
∗
)
is a
r
–radius ball around
x
∗
. Let
B
r
0
(
x
∗
)
with
0
< r
0
<
∞
be the largest ball contained in the region of attraction of
x
∗
—i.e.
B
r
0
(
x
∗
)
⊂R
(
x
∗
)
.
Letting
g
(
x
) =
x
−
Γ
ω
(
x
)
, since
ω
∈
C
q
for some
q
≥
1
,
g
∈
C
q
, the expansion
g
(
x
) =
g
(
x
∗
) + (
I
−
Γ
J
(
x
))(
x
−
x
∗
) +
R
(
x
−
x
∗
)
holds, where
R
satisfies
lim
x
→
x
∗
‖
R
(
x
−
x
∗
)
‖
/
‖
x
−
x
∗
‖
= 0
so that given
c >
0
,
there exists an
r >
0
such that
‖
R
(
x
−
x
∗
)
‖≤
c
‖
x
−
x
∗
‖
,
∀
x
∈
B
r
(
x
∗
)
.
Proposition 3.
Suppose that
‖
I
−
Γ
J
(
x
)
‖
<
1
for all
x
∈
B
r
0
(
x
∗
)
⊂ R
(
x
∗
)
so that there exists
r
′
,r
′′
such
that
‖
I
−
Γ
J
(
x
)
‖ ≤
r
′
< r
′′
<
1
for all
x
∈
B
r
0
(
x
∗
)
.
For
1
−
r
′′
>
0
, let
0
< r <
∞
be the largest
r
such that
‖
R
(
x
−
x
∗
)
‖ ≤
(1
−
r
′′
)
‖
x
−
x
∗
‖
for all
x
∈
B
r
(
x
∗
)
.
Furthermore, let
x
0
∈
B
r
∗
(
x
∗
)
where
r
∗
= min
{
r,r
0
}
be arbitrary. Then, given
ε >
0
, gradient-based learn-
ing with learning rates
Γ
obtains an
ε
–differential Nash
equilibrium in finite time—i.e.,
x
t
∈
B
ε
(
x
∗
)
for all
t
≥
T
=
d
1
δ
log (
r
∗
/ε
)
e
where
δ
=
r
′′
−
r
′
.
With some modification, the proof follows the proof of
Theorem 1 in (Argyros, 1999); we provide it in Ap-
pendix A.1 for completeness.
Remark 2.
We note that the proposition can be more
generally stated with the assumption that
ρ
(
I
−
Γ
J
(
x
))
<
1
, in which case there exists some
δ
defined in terms of
bounds on powers of
I
−
Γ
J
. We provide the proof of this
in Appendix A.1. We also note that these results hold even
if
Γ
is not a diagonal matrix as we have assumed as long
as
ρ
(
I
−
Γ
J
(
x
))
<
1
.
A perhaps more interpretable finite bound stated in terms
of the game structure can also be obtained. Consider the
case in which players adopt learning rates
γ
i
=
√
α/
(
βk
i
)
with
k
i
≥
1
. Given a stable differential Nash equilibrium
x
∗
, let
B
r
(
x
∗
)
be the largest ball of radius
r
contained
in the region of attraction on which
̃
S
≡
1
2
(
̃
J
T
+
̃
J
)
is
positive definite where
̃
ω
= (
D
i
f
i
/k
i
)
i
∈I
so that
̃
J
≡
D
̃
ω
, and define
̃
α
= min
x
∈
B
r
(
x
∗
)
λ
d
(
̃
S
(
x
)
T
̃
S
(
x
)
)
and
̃
β
= max
x
∈
B
r
(
x
∗
)
λ
1
(
̃
J
(
x
)
T
̃
J
(
x
))
.
Theorem 1.
Suppose that Assumption 1 holds and that
x
∗
∈
X
is a stable differential Nash equilibrium. Let
x
0
∈
B
r
(
x
∗
)
,
α < k
min
β
,
√
α/k
min
≤
√
̃
α
, and for
each
i
,
γ
i
=
√
α/
(
βk
i
)
with
k
i
≥
1
. Then, given
ε >
0
,
the gradient-based learning dynamics with learning rates
γ
i
obtain an
ε
–differential Nash such that
x
t
∈
B
ε
(
x
∗
)
for all
t
≥d
2
βk
min
α
log(
r
ε
)
e
.
Proof
First, note that
‖
x
k
+1
−
x
∗
‖
=
‖
g
(
x
k
)
−
g
(
x
∗
)
‖
where
g
(
x
) =
x
−
Γ
ω
(
x
)
.
Now,
x
0
∈
B
r
(
x
∗
)
so that by the mean value theorem,
‖
g
(
x
0
)
−
g
(
x
∗
)
‖
=
‖
∫
1
0
Dg
(
τx
0
+ (1
−
τ
)
x
∗
)(
x
0
−
x
∗
)
dτ
‖ ≤
sup
x
∈
B
r
(
x
∗
)
‖
Dg
(
x
)
‖‖
x
0
−
x
∗
‖
. Hence, it suffices to
show that for the choice of
Γ
, the eigenvalues of
I
−
Γ
J
(
x
)
live in the unit circle, and then use an inductive argu-
ment. Let
Λ = diag (1
/k
1
,...,
1
/k
n
)
so that we need
to show that
I
−
γ
Λ
Dω
has eigenvalues in the unit cir-
cle. Since
ω
(
x
∗
) = 0
, we have that
‖
x
k
+1
−
x
∗
‖
2
=
‖
x
k
−
x
∗
−
γ
Λ(
ω
(
x
k
)
−
ω
(
x
∗
))
‖
2
≤
sup
x
∈
B
r
(
x
∗
)
‖
I
−
γ
Λ
J
(
x
)
‖
2
‖
x
k
−
x
∗
‖
2
If
sup
x
∈
B
r
(
x
∗
)
‖
I
−
γ
Λ
J
(
x
)
‖
2
is
less than one, where the norm is the operator
2
–norm, then
the dynamics are contracting. Indeed, observe that the sin-
gular values of
Λ
J
T
J
Λ
are the same as those of
J
T
Λ
2
J
since the latter is positive-definite symmetric. By noting
that
‖
A
‖
2
=
σ
max
(
A
)
and employing Cauchy-Schwartz,
we get that
‖
Λ
‖
2
2
‖
J
T
J
‖
2
≥‖
Λ
J
T
J
Λ
‖
2
. Thus,
(
I
−
γ
Λ
J
)
T
(
I
−
γ
Λ
J
)
≤
(1
−
2
γλ
d
(
̃
S
) +
γ
2
λ
1
(
J
T
J
)
k
2
min
)
I
≤
(1
−
2
γ
√
α/k
min
+
α/
(
βk
min
))
I
= (1
−
α/
(
βk
min
))
I.
Using the above to bound
sup
x
∈
B
r
(
x
∗
)
‖
I
−
γ
Λ
J
(
x
)
‖
2
,
we have
‖
x
k
+1
−
x
∗
‖
2
≤
(1
−
α
βk
min
)
1
/
2
‖
x
k
−
x
∗
‖
2
.
Since
α < k
min
β
,
(1
−
α/
(
βk
min
))
< e
−
α/
(
βk
min
)
so
that
‖
x
k
+1
−
x
∗
‖
2
≤
e
−
Tα/
(2
k
min
β
)
‖
x
0
−
x
∗
‖
2
. This,
in turn, implies that for alls
t
≥d
2
βk
min
α
log(
r/ε
)
e
,
x
t
∈
B
ε
(
x
∗
)
.
Multiple learning rates lead to a scaling rows which can
have a significant effect on the eigenstructure of the ma-
trix, thereby making it difficult to reason about the re-
lationship between
Γ
J
and
J
. None-the-less, there are
numerous approaches to solving nonlinear systems of
equations that employ
preconditioning
(i.e., coordinate
scaling). The purpose of using a preconditioning ma-
trix is to rescale the problem and achieve more stable or
faster convergence. In games, however, the interpretation
is slightly different since each of the coordinates of the
dynamics corresponds to minimizing a different cost func-
tion along the respective coordinate axis. The resultant
effect is a distortion of the vector field in such a way that
it has the effect of leading the joint action to a point which
has a lower value in general for the slower player rela-
tive to the flow of the dynamics given a uniform learning
rate and the same initialization. In this sense, it seems
that it is most beneficial for an agent to have the slower
learning rate, which is suggestive of desirable qualities
for synthesized algorithms. In the case of autonomous
learning agents, perhaps this reveals an interesting direc-
tion of future research in terms of synthesizing games or
learning rules via incentivization (Ratliff and Fiez, 2018)
or reward shaping (Ng et al., 1999) for either coordinating
agents or improving the learning process.
4 STOCHASTIC SETTING
Consider the setting where agents no longer have oracle
access to their individual gradients, but rather have an
unbiased estimator
g
i
≡
̂
D
i
f
i
and a timevarying learning
rate
γ
i,k
. For the sake of brevity, we show the convergence
result in detail for the two agent case—that is, where
I
=
{
1
,
2
}
. We note that the extension to
n
agents is
straightforward.
The gradient-based learning rules are given by
x
i,k
+1
=
x
i,k
−
γ
i,k
(
ω
(
x
k
) +
w
i,k
+1
)
(3)
so that within
γ
2
,k
=
o
(
γ
1
,k
)
, in the limit
τ
→
0
, the
above system can be thought of as approximating the
singularly perturbed system defined as follows:
̇
x
1
(
t
) =
−
D
1
f
1
(
x
1
(
t
)
,x
2
(
t
))
(4)
̇
x
2
(
t
) =
−
τD
2
f
2
(
x
1
(
t
)
,x
2
(
t
))
(5)
Indeed, since
lim
k
→∞
γ
2
,k
/γ
1
,k
→
0
—i.e.,
γ
2
,k
→
0
at a faster rate than
γ
1
,k
—updates to
x
1
appear to be
equilibriated for the current quasi-static
x
2
.
We require some modified assumptions in this section on
the learning process structure.
Assumption 2.
For the gradient-based learning rule
(3)
,
suppose that the following hold:
A2a.
Given the filtration
F
k
=
σ
(
x
s
,w
1
,s
,w
2
,s
,s
≤
k
)
,
{
w
i,k
+1
}
i
∈I
are conditionally independent,
E
[
w
i,k
+1
| F
k
] = 0
almost surely (a.s.), and
E
[
‖
w
i,k
+1
‖| F
k
]
≤
c
i
(1 +
‖
x
k
‖
)
a.s. for some
constants
c
i
≥
0
,
i
∈I
.
A2b.
The stepsize sequences
{
γ
i,k
}
t
,
i
∈ I
are posi-
tive scalars satisfying: (i)
∑
i
∑
k
γ
2
i,k
<
∞
; (ii)
∑
k
γ
i,k
=
∞
, i
∈I
; (iii)
γ
2
,k
=
o
(
γ
1
,k
)
.
A2c.
Each
f
i
∈
C
q
(
R
d
,
R
)
for some
q
≥
3
and each
f
i
and
ω
are
L
i
– and
L
ω
–Lipschitz, respectively.
Assumption 3.
For fixed
x
2
,
̇
x
1
(
t
) =
−
D
1
f
1
(
x
1
(
t
)
,x
2
)
has a globally asymptotically stable equilibrium
λ
(
x
2
)
.
4.1 ASYMPTOTIC GUARANTEES
The following lemma follows from classical analy-
sis (see, e.g., Borkar (2008, Chap. 6) or Bhatnagar
and Prasad (2013, Chap. 3)). Define the event
E
=
{
sup
k
∑
i
‖
x
i,k
‖
2
<
∞}
.
Lemma 1.
Under Assumptions 2 and 3, conditioned on
the event
E
,
(
x
1
,k
,x
2
,k
)
→{
(
λ
(
x
2
)
,x
2
)
|
x
2
∈
R
d
2
}
a.s.
Let
t
k
=
∑
k
−
1
l
=0
γ
2
,k
be the continuous time accumulated
after
k
samples of
x
2
. Define
x
2
(
t,s,x
s
)
for
t
≥
s
to be
the trajectory of
̇
x
2
=
−
D
2
f
2
(
λ
(
x
2
)
,x
2
)
.
Theorem 2.
Under Assumptions 2 and 3 hold,
for any
K >
0
,
lim
k
→∞
sup
0
≤
h
≤
K
‖
x
2
,k
+
h
−
x
2
(
t
k
+
h
,t
k
,x
k
)
‖
2
= 0
conditioned on
E
.
Proof
The proof invokes Lemma 1 above and (Bena
̈
ım,
1999, Prop. 4.1 and 4.2). Indeed, by Lemma 1,
(
λ
(
x
2
,k
)
−
x
2
,k
)
→
0
a.s. Hence, we can study the sample path gen-
erated by
x
2
,k
+1
=
x
2
,k
−
γ
2
,k
(
D
2
f
2
(
λ
(
x
2
,k
)
,x
2
,k
) +
w
2
,k
+1
)
. Since
D
2
f
2
∈
C
q
−
1
for some
q
≥
3
, it is
locally Lipschitz and, on the event
E
, it is bounded.
It thus induces a continuous globally integrable vector
field, and therefore satisfies the assumptions of Prop. 4.1
of (Bena
̈
ım, 1999). Moreover, under Assumption 2, the
assumptions of Prop. 4.2 of (Bena
̈
ım, 1999) are satisfied.
Invoking said propositions gives the desired result.
This result essentially says that the slow player’s sam-
ple path asymptotically tracks the flow of
̇
x
2
=
−
D
2
f
2
(
λ
(
x
2
)
,x
2
)
. If we additionally assume that the
slow component also has a global attractor, then the above
theorem gives rise to a stronger convergence result.
Assumption 4.
Given
λ
(
·
)
as in Assumption 3,
̇
x
2
(
t
) =
−
τD
2
f
2
(
λ
(
x
2
(
t
))
,x
2
(
t
))
has a globally asymptotically
stable equilibrium
x
∗
2
.
Corollary 1.
Under the assumptions of Theorem 2 and
Assumption 4, conditioned on
E
, gradient-based learning
converges a.s. to a stable attractor
(
x
∗
1
,x
∗
2
)
where
x
∗
1
=
λ
(
x
∗
2
)
, the set of which contains the stable differential
Nash equilibria.
More generally, the process
(
x
1
,k
,x
2
,k
)
will converge
almost surely to the internally chain transitive set of the
limiting dynamics
(5)
and this set contains the stable Nash
equilibria. If the only internally chain transitive sets for
(5)
are isolated equilibria (this occurs, e.g., if the game
is a potential game), then
x
k
converges almost surely to
a stationary point of the dynamics, a subset of which are
stable local Nash equilibria. It is also worth commenting
on what types of games will satisfy these assumptions. To
satisfy Assumption 3, it is sufficient that the fastest player
has a convex cost in their choice variable.
Proposition 4.
Suppose Assumption 2 and 4 hold and
that
f
1
(
·
,x
2
)
is convex. Conditioned on
E
, the sample
points of gradient-based learning satisfy
(
x
1
,k
,x
2
,k
)
→
{
(
λ
(
x
2
)
,x
2
)
|
x
2
∈
R
d
2
}
a.s. Moreover,
(
x
1
,k
,x
2
,k
)
→
(
x
∗
1
,x
∗
2
)
a.s., where
x
∗
1
=
λ
(
x
∗
2
)
.
Note that
(
x
∗
1
,x
∗
2
)
could still be a spurious sta-
ble non-Nash point still since the above implies
D
(
D
2
f
2
(
λ
(
·
)
,
·
))
|
x
∗
2
>
0
which does not imply that nec-
essarily
D
2
2
f
2
(
λ
(
x
∗
2
)
,x
∗
2
)
>
0
.
Remark 3
(Relaxing Assumptions: Local Asymptotical
Stability)
.
Under relaxed assumptions on
global asymp-
totic stability
(i.e., if Assumptions 3 and 4 are relaxed to
local asymptotic stability), we can obtain high-probability
results on convergence to locally asymptotically stable
attractors. However, this requires conditioning on an un-
verifiable event—i.e. the high-probability bound in this
case is conditioned on the event
{{
x
1
,k
}
belongs to a
compact set
B
, which depends on the sample point, of
∩
x
2
R
(
λ
(
x
2
))
}
where
R
(
λ
(
x
2
))
is the region of attrac-
tion of
λ
(
x
2
)
. None-the-less, it is possible to leverage
results from stochastic approximation (Karmakar and
Bhatnagar, 2018), (Borkar, 2008, Chap. 2) to prove lo-
cal versions of the results for non-uniform learning rates.
Further investigation is required to provide concentration
bounds for not only games but stochastic approximation
in general.
4.2 CONCENTRATION BOUNDS
In the stochastic setting, the learning dynamics are
stochastic approximation updates, and non-uniform learn-
ing rates lead to a multi-timescale setting. The concen-
tration bounds we derive leverage very recent results—
e.g., (Borkar and Pattathil, 2018)—from stochastic ap-
proximation and we note that our objective here is to
show that they apply to games and provide commentary
on the interpretation of the results in this context.
For a stable differential Nash equilibrium
x
∗
=
(
λ
(
x
∗
2
)
,x
∗
2
)
, using the bounds in Lemma 1 and Lemma 2
in Appendix A.2, we can provide a high-probability guar-
antee that
(
x
1
,k
,x
2
,k
)
gets locked in to a ball around
(
λ
(
x
∗
2
)
,x
∗
2
)
.
Let
̄
x
i
(
·
)
denote the linear interpolates between sam-
ple points
x
i,k
and, as in the preceding sub-section, let
x
i
(
·
,t
i,k
,x
k
)
denote the continuous time flow of
̇
x
i
with
initial data
(
t
i,k
,x
k
)
where
t
i,k
=
∑
k
−
1
l
=0
γ
i,k
. Define
also
τ
k
=
γ
2
,k
/γ
1
,k
. Alekseev’s formula is a nonlinear
variation of constants formula that provides solutions to
perturbations of differential equations using a local lin-
ear approximation. We can apply this to the
asymptotic
pseudo-trajectories
̄
x
i
(
·
)
in each timescale. For these
local approximations, linear systems theory let’s us find
growth rate bounds for the perturbations, which can, in
turn, be used to bound the normed difference between
the continuous time flow and the asymptotic pseudo-
trajectories. More detail is provided in Appendix A.2.
Towards this end, fix
ε
∈
[0
,
1)
and let
N
be such that
γ
1
,n
≤
ε/
(8
K
)
,
τ
n
≤
ε/
(8
K
)
for all
n
≥
N
. De-
fine
t
1
,k
=
̃
t
k
and
t
2
,k
=
ˆ
t
k
. Let
n
0
≥
N
and with
K
as in Lemma 1 (Appendix A.2), let
T
be such that
e
−
κ
1
(
̃
t
n
−
̃
t
n
0
)
H
n
0
≤
ε/
(8
K
)
for all
n
≥
n
0
+
T
where
κ
1
>
0
is a constant derived from Alekseev’s formula
applied to
̄
x
1
(
·
)
. Moreover, with
̄
K
as in Lemma 2 (Ap-
pendix A.2), let
e
−
κ
2
(
ˆ
t
n
−
ˆ
t
n
0
)
(
‖
̄
x
2
(
ˆ
t
n
0
)
−
x
2
(
ˆ
t
n
0
)
‖ ≤
ε/
(8
̄
K
)
,
∀
n
≥
n
0
+
T
where
κ
2
>
0
is a constant de-
rived from Alekseev’s formula applied to
̄
x
2
(
·
)
.
Theorem 3.
Suppose that Assumptions 2–4 hold and let
γ
2
,k
=
o
(
γ
1
,k
)
. Given a stable differential Nash equi-
librium
x
∗
= (
λ
(
x
∗
2
)
,x
∗
2
)
, the player 2’s sample path
generated by
(3)
for
i
= 1
will asymptotically track
z
k
=
λ
(
x
2
,k
)
, and given
ε
∈
[0
,
1)
,
x
k
will get ‘locked in’
to a
ε
–neighborhood with high probability conditioned
on reaching
B
r
0
(
x
∗
)
by iteration
n
0
. That is, letting
̄
n
=
n
0
+
T
+ 1
, for some
C
1
,C
2
>
0
,
P(
‖
x
1
,n
−
z
n
‖≤
ε,
∀
n
≥
̄
n
|
x
1
,n
0
,z
n
0
∈
B
r
0
)
≥
1
−
∑
∞
n
=
n
0
C
1
exp
(
−
C
2
√
ε/
√
γ
1
,n
)
−
∑
∞
n
=
n
0
C
2
exp
(
−
C
2
√
ε/
√
τ
n
)
−
∑
∞
n
=
n
0
C
1
exp
(
−
C
2
ε
2
/β
n
)
.
(6)
with
β
n
= max
n
0
≤
k
≤
n
−
1
e
−
κ
1
(
∑
n
−
1
i
=
k
+1
γ
1
,i
)
γ
1
,k
. More-
over, for some constants
̃
C
1
,
̃
C
2
>
0
,
P(
‖
x
2
,n
−
x
2
(
ˆ
t
n
)
‖≤
ε,
∀
n
≥
̄
n
|
x
n
0
,z
n
0
∈
B
r
0
)
≥
1 +
∑
∞
n
=
n
0
̃
C
1
exp
(
−
̃
C
2
√
ε/
√
γ
1
,n
)
−
∑
∞
n
=
n
0
̃
C
1
exp
(
−
̃
C
2
√
ε/
√
τ
n
)
−
∑
∞
n
=
n
0
̃
C
1
exp
(
−
̃
C
2
ε
2
/β
n
)
−
∑
∞
n
=
n
0
̃
C
1
exp
(
−
̃
C
2
ε
2
/η
n
)
(7)
with
η
n
= max
n
0
≤
k
≤
n
−
1
(
e
−
κ
2
(
∑
n
−
1
i
=
k
+1
γ
2
,i
)
γ
2
,k
)
.
Corollary 2.
Fix
ε
∈
[0
,
1)
and suppose that
γ
1
,n
≤
ε/
(8
K
)
for all
n
≥
0
. With
K
as in Lemma 1 (Ap-
pendix A.2), let
T
be such that
e
−
κ
1
(
̃
t
n
−
̃
t
0
)
H
0
≤
ε/
(8
K
)
for all
n
≥
T
. Furthermore, with
̄
K
as in Lemma 2
(Appendix A.2), let
e
−
κ
2
(
ˆ
t
n
−
ˆ
t
0
)
(
‖
̄
x
2
(
ˆ
t
0
)
−
x
2
(
ˆ
t
0
)
‖ ≤
ε/
(8
̄
K
)
,
∀
n
≥
T
. Under the assumptions of Theorem 3,
x
k
will will get ‘locked in’ to a
ε
–neighborhood with
high probability conditioned on
x
0
∈
B
r
0
(
x
∗
)
where the
high-probability bounds in
(6)
holds with
n
0
= 0
.
The key technique in proving the above theorem (which is
done in detail in Borkar and Pattathil (2018) which is, in
turn, leveraging results from Thoppe and Borkar (2018)),
is first to compute the errors between the sample points
from the stochastic learning rules and the continuous time
flow generated by initializing the continuous time limiting
dynamics at each sample point and flowing it forward
for time
t
n
+1
−
t
n
, doing this for each
x
1
,k
and
x
2
,k
separately and in their own timescale, and then take a
union bound over all the continuous time intervals defined
for
n
≥
n
0
.
In Appendix A.3, we specialize to the case of uniform
learning rates for which we can tighter bounds leveraging
the results of (Thoppe and Borkar, 2018).
5 NUMERICAL EXAMPLES
We consider several examples that illustrate the effect that
non-uniform learning rates have on the stability of the
learning dynamics, its vector field and resulting equilibria
of continuous games. These examples highlight the im-
portance of studying the convergence properties of game
dynamics in non-cooperative continuous games where
agents may learn at different rates. Additional examples
are provided in Appendix B.
5.1 EFFECTS OF PRECONDITIONG
Although the fixed points of the game dynamics are in-
variant under change of learning rates, i.e. the solutions
to
ω
= 0
and
Γ
ω
= 0
are the same for diagonal
Γ
, the
stability properties near such fixed points may change.
The following example illustrates a counter-intuitive but
non-degenerate situation in which an agent that decides to
learn slower causes a stable differential Nash equilibrium
to become unstable.
Consider a three-player continuous game where the Ja-
cobian at a fixed point has positive-definite block diag-
onals and strictly positive eigenvalues. This implies the
fixed point is a Nash equilibrium and that the dynamics
̇
x
=
−
ω
(
x
)
are stable in a neighborhood around the fixed
point, say
x
∗
. Now suppose an agent decides to slow
down its learning by five times, from
γ
to
γ/
5
. We show
via this simple example that this change can cause the
learning dynamics to become unstable.
Suppose the Jacobian of the continuous time learning
dynamics at the fixed point is
J
(
x
∗
) =
2 9
0
0 2
6
9 0 12
,
whose spectrum lives on the right half complex plane with
eigenvalues
{
14
.
9
,
0
.
5
±
6
.
0
i
}
. However, precondtioning
the Jacobian with
Γ =
diag
(1
,
1
,
1
/
5)
, the eigenvalues
of
Γ
J
are
{
6
.
7
,
−
0
.
2
±
4
.
0
i
}
, which indicate a saddle
point.
5.2 TORUS GAME
The second example is a two-player game with agents’
joint strategy space on a torus. This example serves as a
useful benchmark example because it has multiple equilib-
ria and they are completely characterizable. We visualize
the warping of the region of attraction of these equilib-
ria under different learning rates, and the affinity of the
“faster” player to its own zero line.
Each player’s strategy space is the unit circle
S
1
and have
cost
f
i
:
S
1
×
S
1
given by
[
f
1
(
θ
1
,θ
2
)
f
2
(
θ
1
,θ
2
)
]
=
[
−
α
1
cos(
θ
1
−
φ
1
) + cos(
θ
1
−
θ
2
)
−
α
2
cos(
θ
2
−
φ
2
) + cos(
θ
2
−
θ
1
)
]
where
α
i
and
φ
i
are constants and
θ
i
is player
i
’s choice
variable. An interpretation of these costs is that each
player wishes to be near
φ
i
but far from each other. This
game has many applications including those which ab-
stract nicely to coupled oscillators. The vector of individ-
ual gradients is given by
ω
(
θ
1
,θ
2
) =
[
α
1
sin(
θ
1
−
φ
1
)
−
sin(
θ
1
−
θ
2
)
α
2
sin(
θ
2
−
φ
2
)
−
sin(
θ
2
−
θ
1
)
]
,
(8)
(a)
(b)
Figure 1: The effects of non-uniform learning rates on the path
of convergence to the equilibria. The zero lines for each player
(
D
1
f
1
= 0
or
D
2
f
2
= 0
) are plotted as the diagonal and
curved lines, and the two stable Nash equilibria as circles (where
D
2
1
f
1
>
0
and
D
2
2
f
2
>
0
). (a) In the deterministic setting,
the region of attractions for each equilibrium can be computed
numerically. Four scenarios are shown, with a combination
of fast and slow agents. The region of attractions for each
Nash equilibrium are warped under different learning rates. (b)
In the stochastic setting, the samples (in black) approximate
the differential equation
̇
θ
=
−
Λ
ω
(
θ
)
(in red), where
Λ =
diag(1
,τ
)
for
γ
2
,k
=
o
(
γ
1
,k
)
and
Λ = diag(
τ,
1)
for
γ
1
,k
=
o
(
γ
2
,k
)
. Two initializations and learning rate configurations are
plotted.
and its game Jacobian has terms
α
i
cos(
θ
i
−
φ
i
)
−
cos(
θ
i
−
θ
−
i
)
,
i
= 1
,
2
on the diagonal and
cos(
θ
i
−
θ
−
i
)
,
i
= 1
,
2
on the off diagonal.
The Nash equilibria of this game occur where
ω
(
θ
1
,θ
2
) =
0
and where the diagonals of the game Jacobian are posi-
tive. Using constants
φ
= (0
, π/
8)
and
α
= (1
.
0
,
1
.
5)
,
there are two Nash equilibria situated at
(
−
1
.
063
,
1
.
014)
and
(1
.
408
,
−
0
.
325)
, respectively. These equilibria hap-
pen to also be stable differential Nash equilibria, thus we
expect the gradient dynamics to converge to them. Which
equilibrium they converge to, however, depends on the