of 10
1
H-TD
2
: Hybrid Temporal Difference Learning for
Adaptive Urban Taxi Dispatch
Benjamin Rivi
`
ere and Soon-Jo Chung
Abstract
—We present H-TD
2
: H
ybrid T
emporal D
ifference
Learning for T
axi D
ispatch, a model-free, adaptive decision-
making algorithm to coordinate a large fleet of automated
taxis in a dynamic urban environment to minimize expected
customer waiting times. Our scalable algorithm exploits the
natural transportation network company topology by switching
between two behaviors: distributed temporal-difference learning
computed locally at each taxi and infrequent centralized Bellman
updates computed at the dispatch center. We derive a regret
bound and design the trigger condition between the two behaviors
to explicitly control the trade-off between computational com-
plexity and the individual taxi policy’s bounded sub-optimality;
this advances the state of the art by enabling distributed op-
eration with bounded-suboptimality. Additionally, unlike recent
reinforcement learning dispatch methods, this policy estimation
is adaptive and robust to out-of-training domain events. This
result is enabled by a two-step modelling approach: the policy is
learned on an agent-agnostic, cell-based Markov Decision Process
and individual taxis are coordinated using the learned policy
in a distributed game-theoretic task assignment. We validate
our algorithm against a receding horizon control baseline in a
Gridworld environment with a simulated customer dataset, where
the proposed solution decreases average customer waiting time
by
50%
over a wide range of parameters. We also validate in a
Chicago city environment with real customer requests from the
Chicago taxi public dataset where the proposed solution decreases
average customer waiting time by
26%
over irregular customer
distributions during a 2016 Major League Baseball World Series
game.
Index Terms
—Real-Time Taxi Dispatch, Adaptive systems,
Multi-Agent Systems, Distributed decision-making, Autonomous
Vehicles
I. I
NTRODUCTION
Coordinating a large fleet of automated taxis in complex and
dynamic urban environments is an anticipated challenge for
transportation network companies such as Uber, Lyft, Waymo,
and Tesla. A typical urban mobility problem for these compa-
nies is taxi dispatch, where a fleet of taxis service customers
and the remaining, idle taxis are coordinated with a dispatch
algorithm to minimize the customer waiting time of future
requests. In practice, a transportation network company might
be composed of a dispatch center equipped with complete
information and a large computational budget and a fleet of
taxis, each operating with local information and a limited
amount of processing power and communication bandwidth
(see Fig. 1). In this manner, the transportation network com-
pany network can be decomposed into an underlying “star-
topology” network between taxis and the dispatch center, and
an arbitrary peer-to-peer network between taxis. The proposed
algorithm, H-TD
2
, exploits this topology explicitly by propos-
ing a hybrid algorithm with two distinct behaviors: the central
Figure 1.
Concept graphic of an intelligent transportation network. Au-
tonomous taxis, that can include both ground and air vehicles, estimate in
real-time the customer demand and coordinate locally to behave with bounded
sub-optimality.
node computes exact, large-batch policies infrequently, and
each taxi computes approximate, online updates with local
information.
The overview of H-TD
2
is shown in Fig. 2. At a given
timestep, the closest taxis service the new customer requests,
and the rest of the free taxis are dispatched to reduce expected
waiting time of future requests. The free taxis coordinate
with a distributed game theoretic scheme to optimize their
policy estimate, where the policy is estimated as follows:
the servicing taxis communicate the customer data to their
neighboring taxis, where the expected reward (e.g. customer
waiting time) is estimated with a distributed estimation algo-
rithm. Then, all taxis use the estimated reward to update their
policy estimate with temporal difference learning. If any of the
taxis determine that their policy estimate error is larger than
the user specified threshold, the taxi signals to the dispatch
center for a centralized policy update.
The contributions of the paper are stated as follows:
We derive a novel regret bound by leveraging distributed
estimation methods in local online policy estimation and
introduce a trigger condition to the batch update, permitting
the user to explicitly specify the policy’s computational and
communication expense vs. bounded sub-optimality trade-
off. This advances the state of the art by enabling distributed
operation with bounded sub-optimality.
We propose a taxi-dispatch solution that is adaptive, model-
free, and coordinated. Unlike state-of-the-art reinforcement-
learning dispatch methods, our method directly adapts the
policy based on real-time data, thereby providing a property
of robustness to irregular urban mobility events such as
arXiv:2105.02138v1 [eess.SY] 5 May 2021
2
Figure 2. Overview of H-TD
2
, where blue represents the taxi network, yellow
represents the customers, and green represents the dispatch center. The i
th
taxi
estimates the dispatch policy with local operations: distributed estimation of
reward,
R
i
t
computed in (9), and temporal difference learning to update the
policy,
Q
i
t
, (13). If any of the taxis determines that its policy estimate error,
δ
e
, is larger than the user specified threshold,
δ
d
, the taxi signals to the
dispatch center to receive a centralized policy update,
Q
b
t
(6). Finally, each
free taxi uses the policy in a game theoretic formulation,
Φ
(20), to find its
dispatch position vector,
u
i
t
.
traffic, weather, and major public events. This advancement
is enabled by two step approach: first we propose a hybrid
policy estimation in a finite-dimensional, agent-agnostic cell
abstraction, and then we interface the resulting policy esti-
mation for agent-based coordination with a local prescriptive
game-theoretic task assignment.
We demonstrate the performance and computational properties
of our method with numerical experiments: our algorithm
reduces customer waiting time compared to a receding horizon
control baseline and the simulation runtime is linear with the
number of agents. We also validate our claim that adaptive
algorithms are robust to general irregular events with a case
study of the Chicago City taxis during the 2016 Major League
Baseball World Series.
The remainder of the paper is organized as follows: in
Sec. II, we review the related literature and compare our
method with the state of the art. In Sec. III, we present the
taxi dispatch problem description and a motivating example.
In Sec. IV, we present the cell-based Markov Decision Process
(MDP). In Sec. V, we discuss the exact and approximate
solutions to the MDP and the integration of the learned policy
into a game theoretic method. In Sec. VI, we present numerical
experiments demonstrating the advantages of our algorithm
compared to a receding horizon control baseline in simulated
and real customer datasets. The details of the fleet simulator
implementation are presented in Appendix A.
II. R
ELATED
W
ORK
Recent urban mobility research has developed dynamic
and scalable methods. A well-studied example is the vehicle
routing and dial-a-ride problems [1], [2] where taxis find a
minimum cost path through a routing graph, and its dynamic
extension, where part or all of the customer information is
unknown and revealed dynamically. Recent dynamic routing
research proposes scalable solutions to dynamic routing prob-
lems with bio-inspired methods [3], data-driven methods [4]–
[6], and model-based methods [7], [8]. In this paper, we study
a variant of the dynamic routing problem, taxi dispatch, where
we propose a novel two-stage approach: distributed estimation
with temporal difference learning, and game-theoretic coordi-
nation. This advances the state of the art by permitting adaptive
distributed operation with bounded sub-optimality with respect
to the optimal centralized policy.
Taxi dispatch is an emerging urban mobility problem
where free taxis are dispatched to locations in the map to
minimize customer waiting time of future requests. Recent
approaches have adopted model-based [9], [10] and model-
free [11] methods. An online model-based method like [9]
uses real-time data to fit a system prediction model (for
example, customer demand and taxi supply) and then com-
pute a receding horizon control solution in response to that
model. Pre-specified system models can be over-restrictive,
and recent reinforcement-learning model-free methods [11]–
[13] have been used successfully to overcome this limitation.
In a model-free method, events are not explicitly modelled,
they are captured by the arbitrary dynamics of the underlying
reward. In regular operation, this reward is periodic, and can be
accurately predicted (either explicitly or implicitly) and used
for fleet control. However, it is possible that an irregular event
occurs out of training domain and causes the reward dynamics
to be unpredictable. In this case, we argue that it is better
to adapt in real-time than predict with irrelevant data. Our
model-free approach adapts the policy directly in response
to real-time data, achieving performance that is robust to
unpredictable, irregular events such as weather, accidents, and
major public events.
Our method leverages results from reinforcement learning
in convergence of temporal difference iteration [14]–[16] in
a dynamic environment, i.e. when the reward or transition
probabilities are changing over time. An alternative online
model-free approach is online actor-critic [17], where our work
differs from this result in two ways: we consider a general
non-quadratic reward function and we consider a multi-agent
setting by analyzing a hierarchical system of a temporal
difference iteration with distributed estimation of the reward
model. To the best of the authors knowledge, the only other
work to propose a distributed temporal difference algorithm is
recent work [18] that addresses the convergence properties of
consensus on model parameters in the case of linear function
approximations.
In general, multi-agent reinforcement learning research is
challenging because the MDP’s state and action space dimen-
sionality is coupled to the number of agents, which is typically
handled by using either (i) function approximation methods
such as deep neural networks or (ii) decoupled, decentralized
solutions. A survey paper on multi-agent reinforcement learn-
ing discusses additional methods [19]. In contrast to an agent-
based (or Lagrangian) approach, our method uses a naturally
scalable cell-based (or Eulerian) model that decouples the
problem dimensionality from the number of agents, inspired
by a method used in probabilistic swarm guidance [20].
Because of the cell-based abstraction, our algorithm requires
an additional task assignment component to coordinate taxis.
Task assignment is a canonical operations research problem
3
and there exists many available centralized [21]–[24] and
decentralized [25]–[27]. Among these options, we use a dis-
tributed prescriptive game theory [28] approach that leverages
existing asymptotic game theoretic optimality and convergence
results. In contrast to conventional
descriptive
game theory,
prescriptive
game theory designs multi-agent local interactions
to achieve desirable global behavior. Using one such method,
binary log-linear learning [28], the taxis achieve global coop-
erative behavior with only local information.
III. P
ROBLEM
D
ESCRIPTION
Notation
: We denote vectors with a bold symbol, matrices
with plain uppercase, scalars parameters with plain lowercase,
functions with italics, and we use caligraphic symbols for
operators and sets. We denote a taxi index with an
i
or
j
superscript, a customer index with a
k
superscript, and the time
index with a subscript
t
. Also,
I
n
denotes the
n
-dimensional
identity matrix.
Problem Statement
: We consider the urban taxi dispatch
problem, where we control a fleet of taxis to minimize
customer waiting time. At each timestep, each customer re-
quests is serviced by the nearest taxi. These
servicing
taxis
use customer information to update their reward model and
exchange information with neighboring taxis. The remaining
free
taxis are dispatched to locations in the map according
to the proposed dispatch algorithm. The overall fleet control
is summarized in Algorithm 1 and implementation details are
given in Appendix A.
Algorithm 1:
Fleet Control Problem
1
initialize taxi fleet;
2
for
t
[
t
0
:
t
f
]
do
3
broadcast local customer requests to taxis;
4
assign closest free taxis to service customers;
5
dispatch free taxis to locations in the map;
6
end
The system, as shown in Fig. 3, is composed of customers
and taxis. The
k
th
-customer state,
c
k
, is composed of the time
of request, trip duration, pickup location, and dropoff location,
i.e.
c
k
= [
t
k
r
,t
k
d
,
p
k,p
,
p
k,d
]
and its pickup location is shown in
green in the top subplot. The
i
th
-taxi is defined by a position
vector,
p
i
t
and an operation mode: free (shown in blue) or
servicing (shown in orange). The dispatch solution is a desired
position vector for each of the free taxis,
u
i
t
, that results
in minimizing customer waiting time over a time horizon.
Our method estimates the optimal policy, visualized with the
value function over the state-space in the bottom subplot, that
maximizes the expected reward over time.
IV. C
ELL
-B
ASED
M
ARKOV
D
ECISION
P
ROCESS
The previous section described the dispatch problem with a
agent-based perspective, i.e. in terms of positions and actions
of individual taxis and customers. Next, we will introduce
the cell-based Markov Decision Process (MDP), where cell-
based refers to an Eulerian perspective in which we analyze
Figure 3. State space representation of a Gridworld simulation with 1000
taxis, with corresponding value function estimation. In the top subplot, the
blue dots are free taxis positions, the orange dots are positions of taxis
currently servicing customers, and the green dots are the new customers
requests pickup positions. In the bottom subplot, the approximate value
function distribution is shown over the state space.
values like location and reward with respect to cells of a
discretized map as shown in Fig. 3. The cell-based formulation
decouples the decision making problem from the number of
agents, permitting a finite dimensional policy representation.
The decision making problem is formalized with a MDP,
M
,
defined as a tuple of state space, action space, transition model,
reward model, and discount factor [29]:
M
=
〈S
,
A
,
P
,
R
.
(1)
The state space,
S
, is defined as the set of map cells shown
in Fig. 3, where the state of the
i
th
-taxi,
s
i
t
is the cell index
that contains that taxi’s position. The number of cells in the
environment is denoted by the cardinality of the set,
|S|
.
The action space,
A
, is defined as a movement between
map cells for for taxi
i
. The taxi on dispatch has 5 actions:
a
i
t
∈A
=
{
stay
,
right
,
up
,
left
,
down
}
, defined with respect
to its current map cell in
S
.
The transition function,
P
:
S × A × S →
P
is defined
as
P
(
s
i
t
,a
i
t
,s
i
t
+1
) =
P
(
s
i
t
+1
|
s
i
t
,a
i
t
)
. We define
P
with a
deterministic, cell-based dynamical model,
f
:
s
i
t
+1
=
f
(
s
i
t
,a
i
t
)
where
P
(
s
i
t
,a
i
t
,s
i
t
+1
) = 1
.
(2)
If the next state is valid, the cell-based dynamical model
moves the taxi from the initial state,
s
i
t
to the neighboring
state
s
i
t
+1
according to its action. If the next state is not
valid, the dynamical model returns the initial state.
The reward function
R
t
:
S × A →
R
is defined to be
the negative of the expected customer waiting time and is
estimated from reward samples; reward is the negative of
the time it takes the taxi to go from its current position to
the dispatch position defined by the cell-action, and then
from that position to the customer request. The reward
has a subscript
t
because we assume the reward changes
over time due to the changing customer distribution. We
equivalently write the
R
t
function as a vector of state-action
pairs,
R
t
(
s,a
) =
R
t
[
s
|A|
+
i
(
a
)]
and
R
t
R
n
q
, where
n
q
=
|S||A|
and
i
(
a
)
denotes the action’s index. Given a
customer request,
c
k
, we compute a sample of the reward
4
function,
r
i
t
that can be used to estimate the underlying
reward,
R
t
according to an observation model:
r
i
t
(
s
i
t
,a
i
t
) =
(
η
(
p
i
t
,
u
i
t
) +
η
(
u
i
t
,
p
k,p
))
(3)
r
i
t
=
H
i
t
R
t
+
v
i
t
(4)
Recall that
p
i
t
is the position of the
i
th
-taxi,
u
i
t
is the dispatch
desired position, and
p
k,p
is the
k
th
-customer pickup posi-
tion. Also,
η
is the estimated time-of-arrival function that
accepts position vectors, returns a scalar time value, and is
specified in Appendix A. It is parameterized by the average
taxi velocity
v
taxi
. The measurement noise is sampled from
a normal distribution with variance
ς
,
v
i
t
∼ N
(0
,ςI
)
.
Finally, the cell-based observation model,
H
i
t
R
n
q
×
n
q
,
is a binary diagonal matrix with unity elements at the
state-actions pairs where the customer request
c
k
contains
information of the corresponding state-action pair, and
0
otherwise.
Note that
γ
is the discount rate of the system. This parameter
determines the trade-off between greedy and long-term
optimal behavior.
V. A
LGORITHM
D
ESCRIPTION AND
A
NALYSIS
: H-TD
2
We describe the details of H-TD
2
in this section, defin-
ing the exact and approximate policy estimation, the hybrid
switching behavior, and the game theoretic task assignment.
The overview of the method is given in Algorithm 2.
Algorithm 2:
H-TD
2
at timestep
t
1
input
: set of total, free, and servicing taxis:
I
,
I
f
,
I
s
2
output
: action profile for free taxis,
u
t
/
*
Hybrid Temporal Difference
*
/
3
for
i
∈I
do
4
if
δ
e
> δ
d
(15)
then
5
slow update
Q
i
t
with aggregated global
information (6) at the central node;
6
else
7
fast update
Q
i
t
with local information (13) at
each taxi;
8
end
9
end
/
*
Game Theoretic Task Assignment
*
/
10
randomly initialize cell-based action profile
A
t
;
11
while
A
t
not converged
do
12
randomly pick
i
∈I
f
that has not converged;
13
consider current action,
a
i
t
;
14
propose random action
a
i
t
;
15
compute marginal utility,
J
with
Q
i
t
(20);
16
stochastically assign action with
J
(21);
17
check
i
th
-taxi action convergence;
18
end
19
Convert cell-based actions
a
i
t
to position vectors
u
i
t
;
A. Centralized Q-value Computation
We present the idealized Bellman solution to the cell-based
decision making problem specified in (1). The solution is a
policy function that maps states to an action that maximizes
the discounted reward over time and can be represented as a
value function, as shown in Fig. 3, or an action-value function
known as
Q
b
t
-values. We use the latter and use the superscript
b
notation to denote the policy that is synthesized with a Bellman
iteration method. We adopt the conventional optimal
Q
b
t
-value
function as follows:
Q
b
t
(
s
t
,a
t
) =
E
s
P
(
s
t
|
s
t
,a
t
)
[
R
t
+
γ
E
a
t
π
b
Q
b
(
s
t
,a
t
)]
(5)
We specify this general formulation with some assumptions.
First, we use a finite-dimensional tabular
Q
b
t
and write the
Q
b
t
function as a vector of state-action pairs,
Q
b
t
R
n
q
,
as done with the reward in Sec. IV. Next, we apply the
deterministic transition function,
P
(
s
i
t
,a
i
t
,s
i
t
+1
)
, as specified
in (2) to remove the outer expectation. We remove the inner
expectation by specifying the policy
π
b
to be a transition
kernel matrix
F
b
such that
F
b
Q
b
t
(
s
i
t
) = max
a
i
t
Q
b
t
(
s
i
t
,a
i
t
)
.
Combining these, we rewrite a simplified expression for
Q
b
t
as the fixed point of the Bellman operator
T
:
Q
b
t
=
R
t
+
γF
b
Q
b
t
=
T
Q
b
t
(6)
The Bellman iteration can be solved using conventional value
iteration or policy iteration methods from batch customer data.
For the purpose of this paper, we use a Modified Policy
Iteration (MPI) method [30] to solve line 5 of Algorithm 2.
However, there are complications with implementing a pure
Bellman approach, which we address in the next subsection.
B. Distributed Reward Estimation and
Q
-value Iteration
We present a policy approximation that overcomes the
Bellman solution’s practical limitations. The first issue with a
pure Bellman solution is that in an online setting, the reward
information is hidden a priori and received incrementally in
samples,
r
i
t
. So the expectation of the reward is not imme-
diately available. Furthermore, each taxi only has access to
local information. To overcome these problems, we assume
the hidden reward evolves as a random walk process and
synthesize linear estimators for the hidden reward
R
t
:
R
t
+1
=
R
t
+
w
t
(7)
R
c
t
+1
=
R
c
t
+
j
∈I
K
j
t
(
r
j
t
H
j
t
R
c
t
)
(8)
R
i
t
+1
=
R
i
t
+
j
∈I
A
ij
t
(
r
j
t
R
i
t
)
(9)
Recall
r
i
t
is the reward sample defined in (4). Also
R
c
t
,
R
i
t
R
n
q
are centralized and distributed reward estimators that will
be used in the upcoming temporal difference learning, where
the superscript
c
denotes a centralized quantity. The
H
i
t
matrix
is the
i
th
-taxi’s measurement model defined in (4) and the
K
i
t
matrix are the corresponding estimator gains. The process
noise,
w
t
∼ N
(0
,εI
)
is sampled from a normal distribution
where the parameter
ε
is computed offline from training data.
5
The row-stochastic adjacency block matrix,
A
t
, specifies the
local information available to each agent and is defined as:
A
ij
t
=
B
ij
t
/
n
i
j
=1
B
ij
t
,
where
B
ij
t
=
{
K
j
t
H
j
t
j
∈I
i
t
0
n
q
×
n
q
else
(10)
I
i
t
=
{
j
∈I | ‖
p
i
t
p
j
t
< R
comm
}
.
(11)
with diagonal block matrices
A
ij
t
,B
ij
t
R
n
q
×
n
q
and full
matrices
A
t
,B
t
R
n
i
n
q
×
n
i
n
q
. The i
th
agent constructs the
B
ij
t
matrices from the gain and measurement matrices shared
by its neighbors
j
∈I
i
t
. The local observation is parameterized
by the radius of communication,
R
comm
.
The second issue with the Bellman approach is an intrinsic
drawback that at higher state/action dimensions the Bellman-
iteration calculation becomes computationally-expensive and
cannot be quickly evaluated online. Instead, temporal differ-
ence learning [31] can be used as an approximate method to
estimate
Q
-values online using
R
c
t
:
Q
c
t
+1
=
Q
c
t
+
α
(
R
c
t
+
γF
c
Q
c
t
Q
c
t
)
(12)
where
α
R
is the system learning rate and
F
c
is the
transition kernel for this policy.
This formulation requires a central node to collect the data,
compute a policy, and broadcast the new information at every
timestep, scaling the computation complexity, bandwidth, and
network delay with the number of taxis. To address this
limitation, we introduce a distributed algorithm using com-
munication between the taxis, and propose a policy update
computed at each taxi using only local information:
Q
i
t
+1
=
Q
i
t
+
α
(
R
i
t
+
γF
i
Q
i
t
Q
i
t
)
(13)
where this temporal difference (13), with the distributed re-
ward estimation (9) defines line 7 of Algorithm 2.
Using the approximate temporal difference method and
estimating with only local information hurts the quality of the
final policy used by each taxi. We derive the upper bound of
the negative effect of these approximations through a regret
bound analysis, comparing the policy synthesized with the
proposed algorithm (13) and the Bellman-optimal solution (6).
Theorem 1.
The expected distance between an arbitrary taxi
Q
-value estimates,
Q
i
t
, and the Bellman-optimal solution
Q
b
t
is upper bounded by:
E
Q
i
t
Q
b
t
2
2
n
q
(
ε
+
ς
)
(1
γ
)(1
1
λ
min
(
j
∈I
A
ij
t
))
(14)
where
λ
min
(
.
)
denotes the smallest eigenvalue of a matrix.
Proof:
First, we write the distributed iteration as an
application of the Bellman operator on the previous timestep
with a disturbance. Then we solve for the disturbance to derive
the final bound.
Step 1
: Consider (13) and add and subtract
α
R
t
:
Q
i
t
+1
=
Q
i
t
+
α
(
R
i
t
+
γF
i
Q
i
t
Q
i
t
) +
α
R
t
α
R
t
=
Q
i
t
+
α
(
R
t
+
γF
i
Q
i
t
Q
i
t
) +
α
(
R
i
t
R
t
)
=
Q
i
t
+
α
(
T
Q
i
t
Q
i
t
) +
α
e
i
t
where
e
i
t
=
R
i
t
R
t
. The system is contracting at rate
1
α
(1
γ
)
, implying the system geometrically converges to an
equilibrium about
T
Q
i
t
=
Q
i
t
. In addition, from Banach’s
fixed point theorem [32],
T
contracts to a unique fixed point,
Q
b
t
. Applying Discrete Gronwall’s lemma [32]:
Q
i
t
Q
b
t
‖≤
e
i
t
1
γ
Note that the decaying initial condition term does not appear
because we assume that the policy estimate is initialized with
the Bellman solution, i.e.
Q
i
0
=
Q
b
0
.
Step 2
: Here we need to bound the value
e
i
t
, i.e. the error
between the estimated reward and the true reward. We write
the dynamics of the error vector
e
i
t
by subtracting (7) from (9):
e
i
t
+1
= (
I
j
∈I
A
ij
t
)
e
i
t
+
d
t
where
d
t
=
w
t
+
j
∈I
A
ij
t
v
j
t
. By applying Weyl’s interlacing
eigenvalue theorem [33], we prove that the
i
th
system is con-
tracting at rate lower bounded by
λ
i
t
= 1
λ
min
(
j
∈I
A
ij
t
)
.
We rewrite the disturbance as the product of an input matrix,
M
, and the stacked noise,
z
t
∼N
(0
,W
)
:
d
t
=
M
t
z
t
where
z
t
=
[
w
t
;
v
1
t
;
...
;
v
n
i
t
]
,
M
t
=
[
I
n
q
,A
i,
1
t
,...,A
i,n
i
t
]
and
W
= blkdiag(
εI
n
q
,ςI
n
q
,...,ςI
n
q
)
.
By application of the convergence theorem of discrete
stochastic contracting systems [34], [35], the expected error
of a single agent is upper bounded by:
E
e
i
t
‖≤
2
C
1
1
λ
min
(
j
∈I
A
ij
t
)
C
= trace(
M
T
t
M
t
W
)
It remains to calculate the value
C
:
C
=
ε
trace(
I
n
q
) +
ς
j
∈I
trace
(
(
A
ij
t
)
T
A
ij
t
)
n
q
(
ε
+
ς
)
where we use the linearity of the trace operation to move it
outside of the sum, then we use the non-negativity and row-
stochasticity of
A
t
to bound
j
∈I
(
A
ij
t
)
T
A
ij
t
I
n
q
. The final
result is found by plugging the result from Step 2 into the
result from Step 1.
Remark 1.
The regret bound is driven by the contraction
rate of the system
λ
i
t
, a combined graph and observeability
quantity. Intuitively, this corresponds to a non-zero value when
taxi
i
and its neighbors taxis
j
can measure the entire state-
action vector. We can also consider a batch measurement
over a time interval,
n
T
and an average contraction rate,
λ
i
t
.
This time interval approach exists in the multi-agent adaptive
control literature, where
λ
i
t
>
0
is analogous to an excitation
level in the Collective Persistency of Excitation condition [36].
Remark 2.
The proposed online method is used to estimate
the optimal policy in a dynamic environment, i.e. the reward
model,
R
t
is time-dependent. In this case, the optimal policy
is non-stationary, i.e.
Q
b
t
+1
6
=
Q
b
t
. In order to guarantee the
convergence of the TD-algorithm, we require that there exists
6
a timescale separation between the convergence of the TD-
algorithm and the dynamics of
Q
b
:
Q
b
t
+1
Q
b
t
‖
(1
γ
)(1
1
λ
min
(
A
ij
t
))
Remark 3.
The adjacency matrix
A
t
dictates that each agent
takes a convex combination of the neighboring measurements.
Further, the estimation gain matrices,
K
i
t
are chosen with
a Distributed Kalman Information Filter, whose proof of
optimality with respect to mean-squared-error can be found
in [37], [38]. Thus, the agents weigh the neighboring mea-
surements appropriately.
C. Hybrid Temporal Difference Algorithm
We define the switching condition in line 4 of Algorithm 2
with two parameters,
δ
e
, the estimated error in the system, and
δ
d
, the user specified desired error in the system.
Proposition 1.
If we define:
δ
e
=
2
n
q
(
ε
+
ς
)
(1
γ
)(1
1
λ
min
(
j
∈I
A
ij
t
))
(15)
the expected policy sub-optimality will be bounded by
δ
d
.
Nominally, the system evolves with the distributed temporal
difference method (13), computing
δ
e
at each timestep. Each
agent is able to compute this value because
ε,ς
, and
γ
are
known system parameters and each agent keeps track of
its own
λ
min
(
j
∈I
A
ij
t
)
values. Applying the result from
Theorem 1, the expected policy suboptimality is identically
δ
e
, so, if
δ
e
exceeds the desired error,
δ
d
, the desired sub-
optimality is violated. However, if this condition occurs at time
t
, the system resets all taxis with a central policy update (6),
i.e.
Q
i
t
=
Q
b
t
,
i
∈ I
. Therefore, the H-TD
2
algorithm
maintains the distance between the estimated policy and a true
regret policy to user specification. In effect,
δ
d
controls the
trade-off of computational expense to policy sub-optimality,
where
δ
d
0
produces a solution with no regret but maximum
computational effort and
δ
d
→ ∞
produces a solution with
potentially infinite regret with little computational effort.
D. Game Theoretic Task Assignment
In this section, we propose a game-theoretic task assignment
to coordinate the taxis according to the
Q
i
t
-values estimated
in Sec. V-B. The
Q
i
t
policy does not account for the actions
of the other taxis and, without additional coordination, the
taxis would behave greedily by all going to the highest value
cell, increasing the overall customer waiting time. To avoid
this behavior, we design a potential game and a local action
profile iteration to maximize each agent’s marginal utility. This
is implemented in Algorithm 2, Lines 10-18.
First, we introduce a global action profile,
A
t
and a local
action profile for the
i
th
taxi,
A
i
t
:
A
t
=
{
a
j
t
| ∀
j
∈I}
,
and
A
i
t
=
{
a
j
t
| ∀
j
∈I
i
t
}
(16)
where the
j
th
-neighbor taxi communicates its action,
a
j
t
, to the
i
th
-taxi, where
a
j
t
is defined in (1).
Next, we introduce the current global fleet distribution
t
,
i.e. the number of taxis in each cell, as a function of the action
profile:
t
(
s,
A
t
) =
1
n
i
j
∈I
I
(
s
=
f
(
s
j
t
,a
j
t
))
(17)
where
I
denotes the indicator function and
f
is the dynamics
model specified in (1). The local fleet distribution,
i
t
, is
found with the same calculation but summing only over
the neighboring agents,
j
∈ I
i
t
. By defining the radius of
communication as
R
comm
= 3ds
where
ds
is the length of
a cell in the environment, we guarantee that the
i
th
taxi can
always calculate the fleet distribution in neighboring cells,
S
i
,
within one action of the current cell of the
i
th
taxi.
Next, we describe the desired fleet distribution using the
Q
-
values as computed in (9). Consider the following Boltzmann
exploration strategy [31] strategy to synthesize a desired
distribution,
, from the
Q
-value estimates:
(
s,
Q
i
t
) =
exp
(
β
max
a
∈A
Q
i
t
(
s,a
)
)
a
∈A
exp
(
β
Q
i
t
(
s,a
)
)
(18)
Recall that
A
are the local actions available to each taxi
defined in (1) and
β
R
is an exploration/exploitation
design constant. For simplicity of notation, we have written
the action-values in its functional form,
Q
i
t
.
The goal of the game theoretic task assignment is to
find an action profile,
A
t
through local iteration methods
that minimizes the distribution distance between the current
distribution,
t
and the desired distribution
. We describe a
potential and noncooperative game meaning that the taxis will
try to converge to a Nash equilibrium with a high potential
function value. The global and marginal potential functions,
Φ
and
J
are defined as follows:
Φ(
A
t
) =
s
∈S
(Ω
(
s,
Q
i
t
)
Ω(
s,
A
t
))
2
(19)
J
(
A
i
t
) =
s
∈S
i
(Ω
(
s,
Q
i
t
)
Ω(
s,
A
i
t
))
2
(20)
For the calculation of
J
, we only require the indices that
correspond to neighboring cells of the current cell of the
i
th
-
taxi, thereby permitting a local calculation.
Remark 4.
The game’s utility function,
Φ(
A
t
)
has an analogy
to sample-based planners if each taxi in the fleet is considered
as a sampled action of a stochastic policy,
(
s,
Q
i
t
)
. This
choice of utility function is interesting because it could be the
utility function chosen by a centralized algorithm but we can
maximize it with local calculations through J.
Remark 5.
Note that
J
is indeed the marginal contribution
on the global potential function:
Φ(
A
t
)
Φ(
A
t
) =
J
(
A
i
t
)
J
(
A
i
t
)
where
A
t
=
{
a
j
t
|∀
j
∈I
/
{
i
}}∪{
a
i
t
}
is the global alternative
action set.
We use a game-theoretic reinforcement learning technique,
binary log-linear learning [28] to iterate to an action set
A
t
,
shown in line 10-19 of Algorithm 2. At each timestep,
t
, the
7
action set,
A
t
is randomly initialized. While all other taxi’s
actions are held, a randomly selected
i
th
-taxi chooses between
the previously held action,
a
i
t
, and an alternate action
a
i
t
with
probability
p
i
t
(
A
i
t
,
A
i
t
)
:
p
i
t
(
A
i
t
,
A
i
t
) =
exp (
J
(
A
i
t
)
)
exp (
J
(
A
i
t
)
) + exp (
J
(
A
i
t
)
)
(21)
where
A
i
t
=
{
a
j
t
|∀
j
∈I
i
t
/
{
i
}}∪{
a
i
t
}
is the local alternative
action set. The coefficient
τ
R
>
0
is a design parameter
specifying how likely taxi
i
chooses a sub-optimal action,
to specify the trade-off between exploration and exploitation.
The action set is chosen once the iteration has converged,
completing the game-theoretic task assignment. Then, the cell-
based action
a
i
t
is converted to a dispatch vector
u
i
t
, where
u
i
t
is a randomly sampled position vector in the cell after the
dispatch action is taken.
VI. N
UMERICAL
E
XPERIMENTS
A. Baseline and Variants
We compare our H-TD
2
solution with a receding horizon
control (RHC) baseline dispatch algorithm that is similar to the
baseline in [11]. For this section, we use independent notation
from the rest of the paper, matching the notation in [11]. The
RHC dispatch algorithm is formulated as the following linear
program:
u
=
argmax
t
0
+
t
rhc
t
=
t
0
γ
t
t
0
M
i
=1
min(
w
t,i
x
t,i
,
0)
s.t.
M
j
=1
u
ij,t
=
x
t
+1
,i
,
M
i
=1
u
ij,t
=
x
t,i
u
ij,t
= 0
j /
∈S
i
,
x
t
0
,i
=
X
0
,i
(22)
where the state variable,
x
t
Z
|
S
|
, is the number of free
taxis in each cell, the control variable
u
ij,t
is the number
of taxis moving from cell
i
into cell
j
at time
t
and
t
rhc
is the RHC planning horizon. The first two constraints are
conservation constraints: (i) the number of taxis in cell
i
is
the number of taxis moving into cell
i
, and (ii) the number
of taxis moving from cell
i
is equal to the number of taxis
previously in cell
i
. The third constraint is that each taxi
can only move to a neighboring cell. The fourth constraint
is the initial condition. For large fleets, a proper assumption
from [11] is to relax
u
ij,t
from integers to real numbers,
resulting in a linear program in
u
ij,t
. The reward is the
difference of customer demand and taxi supply, where
w
t,i
represents the expected customer demand and computed from
training data similar to that proposed in [9], [11]. The baseline
is chosen to demonstrate the advantage of fast adaption over
learned prediction: static prediction methods (either explicit
in model-based or implicit in model-free) are vulnerable to
events occurring outside of the training domain.
We study the effect of each component of the policy
estimation algorithm by comparing variants of the policy:
Cen-
tralized Temporal Difference
(C-TD),
Distributed Temporal
Difference
(D-TD), and
Bellman-optimal
. All of these variants
control the fleet with binary log-linear learning (21), but they
differ in how
Q
is synthesized: C-TD uses (8) and (12),
D-TD uses (9) and (13), and the Bellman-optimal policy is
synthesized with (6). Equivalently, the D-TD and Bellman-
optimal algorithms can be interpreted as the limiting behavior
of H-TD
2
corresponding to the respective cases where
δ
d
=
and
δ
d
= 0
.
B. Customer Demand Datasets
We consider two datasets of customer requests: a synthetic
dataset for a Gridworld environment and the real customer taxi
dataset from the city of Chicago [39]. Recall each customer
request is defined as follows:
c
k
= [
t
k
r
,t
k
d
,
p
k,p
,
p
k,d
]
.
The synthetic dataset is generated as follows: At each
timestep,
t
[
t
0
,t
0
+ ∆
t
]
, the customer request model is
sampled
n
c
times, where
n
c
is the number of customers
per timestep of the simulation. The time of the request,
t
k
r
is uniformly randomly sampled in the timestep. The pickup
location,
p
k,p
, is found by sampling a 2-dimensional Gaussian
Mixture Model (GMM), where translating Gaussian distribu-
tions capture the underlying dynamic customer demand. The
dropoff location,
p
k,d
is a randomly sampled position in the
map, and the duration of the trip,
t
k
d
is given by
η
(
p
k,p
,
p
k,d
)
.
The GMM model is parameterized by the number of Gaussian
distributions,
n
G
, the speed of the distributions,
v
G
, the
variance,
σ
G
, and the initial position, and unit velocity vector
of the centroid for each distribution.
The real customer dataset is taken from the city taxi dataset
of Chicago [39] filtered by start and end timestamp. The
Socrata API permits importing raw data in the
c
k
format,
where the location data is specified in longitude, latitude co-
ordinates. For this experiment, we load a city map of Chicago
as a shapefile and perform minor geometric processing with
the Shapely Python toolbox.
C. Results
Gridworld Simulations:
We present variants of our H-TD
2
algorithm against a RHC baseline in a Gridworld environment
as shown in Fig. 3. This flexible, synthetic environment
permits us to test on a range of system parameters.
First, we introduce the algorithms in a small-scale sim-
ulation. We synthesize a dataset with parameters:
n
c
= 5
,
n
G
= 2
,
v
G
= 0
.
02625
,
σ
G
= 0
.
014
, and randomly initialize
the mean and direction of the distributions. Our simulation
parameters are:
n
i
= 100
,
|S|
= 85
,
γ
= 0
.
9
,
α
= 0
.
75
,
n
T
= 10
,
ς
= 0
.
014
,
ε
= 0
.
0187
,
δ
d
= 0
.
025
Q
b
0
,
β
= 150
,
v
taxi
= 0
.
125
,
τ
= 0
.
0001
, and
t
RHC
= 10
. We
run this experiment for each of the algorithms for
5
trials.
This experiment is shown in Fig. 3, modified with
n
i
= 1000
.
Next, we collect statistics on the cumulative reward of
each algorithm, and plot the results in Fig. 4. The algorithms
behave as expected: in descending order of performance,
Bellman, centralized temporal difference, H-TD
2
, distributed
temporal difference, followed by the receding horizon control
baseline. At the cost of computational effort, the user can tune
the performance of H-TD
2
between the distributed temporal
difference and Bellman-optimal solution by changing
δ
d
. The
8
0
20
40
60
80
100
Time
0
100
200
300
400
500
600
Cumulative Customer Waiting Time
RHC
D-TD
C-TD
Bellman
H-TD^2
Figure 4. Cumulative customer waiting time for different algorithms in the
small-scale Gridworld environment. The algorithms behave as expected: in
descending order of performance, Bellman, centralized temporal difference, H-
TD
2
, distributed temporal difference, followed by the receding horizon control
baseline. At the cost of computational effort, the user can tune the performance
of H-TD
2
between the distributed temporal difference and Bellman-optimal
solution by changing
δ
d
. The simulation is run
5
times, and the mean with
standard deviations is visualized in the plot.
0
20
40
60
80
100
Time
0.00
0.02
0.04
0.06
0.08
Error Trace
Q-Values Mean Squared Error
D-TD
C-TD
H-TD^2
Figure 5.
Q
-value error trace for different algorithms with respect to the
Bellman-optimal. Initially, the H-TD
2
and distributed temporal difference
algorithms behave identically, until the trigger condition is satisfied and the
H-TD
2
requests a global Bellman-optimal update, bringing the error to zero.
The
δ
d
parameter is set to
2
.
5 %
of the norm of the Bellman solution and is
shown with a dashed black horizontal line.
simulation is run
5
times, and the mean with standard devia-
tions is visualized in the plot.
To explain the performance difference between the variants
of our method, we show an error trace of the policy in
Fig. 5. For a given policy
Q
, we calculate the error with
respect to the Bellman solution:
e
Q
t
=
Q
b
t
Q
t
/
Q
b
t
.
For this experiment, we set the
δ
d
parameter is set to
2
.
5 %
of the norm of the Bellman solution, indicated by the dashed
horizontal line. Initially, the H-TD
2
and distributed temporal
difference (D-TD) algorithms behave identically, until the
trigger condition is satisfied and the H-TD
2
requests a global
Bellman-optimal update, thereby bringing its error to zero.
As expected, the centralized-temporal difference method, C-
TD, generally tends to estimate the
Q
-values better than its
distributed counterpart, D-TD.
0.5
1.0
1.5
2.0
Average Customer Waiting Time
RHC
H-TD^2
10
50
100
500
1000
Number of Taxis
10
2
10
3
10
4
Simulation Runtime [s]
RHC
H-TD^2
Figure 6. Performance and scalability analysis of H-TD
2
and RHC against
number of taxis. In the top subplot, the average reward is shown across a
variety of taxi density regimes, and the proposed algorithm outperforms a
receding horizon control baseline by at least
50 %
in all taxi-density regimes.
In the bottom subplot, the computational time is approximately linear with
number of taxis (and taxi-density) across 3 orders of magnitude.
Next, we test the proposed algorithm and baseline’s scala-
bility and performance across a wide range of taxi-densities
and plot the results in Fig. 6. We fix the parameters from
the small-scale simulation and only change the number of
taxis and number of customers, where we maintain the ratio
n
i
/n
c
= 10
. In the top figure, the average reward is shown
across a variety of taxi density regimes, where the H-TD
2
algorithm outperforms a RHC baseline by almost a factor of
2 in all taxi-density regimes. In the bottom figure, we show that
the computational time is approximately linear with number
of taxis (and taxi-density) across 3 orders of magnitude. The
computational complexity of both RHC and H-TD
2
scales with
the spatial resolution of the simulation, and in practice, we
limit the maximum number of cells to 200.
Chicago City Simulations
We present the H-TD
2
against
an RHC baseline using real customer data from the city of
Chicago public dataset [39], in a Chicago map environment.
We show that our algorithm outperforms the baseline in
practical datasets and demonstrate that online algorithms are
robust in irregular urban mobility events.
In Fig. 7, we present the Chicago city taxi customer demand
across an irregular event: Game 5 of the 2016 Baseball
World Series. The map cells show the number of customer
pickup requests, and the green star is Wrigley Field’s (baseball
stadium) location. Below, we plot the customer demand over
time for the cell containing Wrigley Field. We show that a
reward model trained using data from the day before would
not accurately predict the behavior of the next day.
We evaluate the algorithms and plot the results in Fig. 8. Our
simulation parameters are:
n
i
= 2000
,
|S|
= 156
,
γ
= 0
.
8
,
α
= 0
.
1
,
n
T
= 10
,
ς
= 0
.
0001
,
ε
= 0
.
0001
,
δ
d
= 0
.
025
Q
b
0
,
β
= 1
,
v
taxi
= 22
miles per hour,
τ
= 0
.
0001
, and
t
RHC
= 10
.
We train a reward model using the data from October 29
th
,
2016 with a total of
91
,
165
customer requests. Then, we
collect
54
,
115
customer requests from October 30
th
, 2016,
which we reveal real-time to the H-TD
2
and RHC dispatch
algorithms. In total, the H-TD
2
algorithm has a total customer
waiting time of
501 h
ours an improvement of
26 %
over