of 8
IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED APRIL, 2020
1
GLAS: Global-to-Local Safe Autonomy Synthesis
for Multi-Robot Motion Planning with End-to-End
Learning
Benjamin Rivi
`
ere, Wolfgang H
̈
onig, Yisong Yue, and Soon-Jo Chung
©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or
reuse of any copyrighted component of this work in other works. DOI: see top of this page.
Abstract
—We present GLAS: G
lobal-to-L
ocal A
utonomy
S
ynthesis, a provably-safe, automated distributed policy gener-
ation for multi-robot motion planning. Our approach combines
the advantage of centralized planning of avoiding local minima
with the advantage of decentralized controllers of scalability and
distributed computation. In particular, our synthesized policies
only require relative state information of nearby neighbors and
obstacles, and compute a provably-safe action. Our approach
has three major components: i) we generate demonstration
trajectories using a global planner and extract local observations
from them, ii) we use deep imitation learning to learn a
decentralized policy that can run efficiently online, and iii) we
introduce a novel differentiable safety module to ensure collision-
free operation, thereby allowing for end-to-end policy training.
Our numerical experiments demonstrate that our policies have
a
20%
higher success rate than optimal reciprocal collision
avoidance, ORCA, across a wide range of robot and obstacle
densities. We demonstrate our method on an aerial swarm,
executing the policy on low-end microcontrollers in real-time.
Index Terms
—Distributed Robot Systems, Path Planning for
Multiple Mobile Robots or Agents, Imitation Learning
I. I
NTRODUCTION
T
EAMS of robots that are capable of navigating in dy-
namic and occluded environments are important for ap-
plications in next generation factories, urban search and rescue,
and formation flying in cluttered environments or in space.
Current centralized approaches can plan such motions with
completeness guarantees, but require full state information
not available to robots on-board, and are too computationally
expensive to run in real-time. Distributed approaches instead
use local decoupled optimization, but often cause robots to
get trapped in local minima in cluttered environments. Our
approach, GLAS, bridges this gap by using a global planner
offline to learn a decentralized policy that can run efficiently
online. We can thus automatically synthesize an efficient
policy that avoids getting trapped in many cases. Unlike
other learning-based methods for motion planning, GLAS
Manuscript received: February, 24, 2020; Accepted April, 20, 2020.
The authors are with California Institute of Technology, USA.
{
briviere, whoenig, yyue, sjchung
}
@caltech.edu
.
This paper was recommended for publication by Editor Nak Young Chong
upon evaluation of the Associate Editor and Reviewers’ comments. This work
was supported by the Raytheon Company and Caltech/NASA Jet Propulsion
Laboratory. Video: https://youtu.be/z9LjSfLfG6c. Code: https://github.com/
bpriviere/glas.
Digital Object Identifier (DOI): see top of this page.
Fig. 1. We learn distributed policies using trajectories from a global planner.
We mask non-local information with an observation model, and perform
imitation learning on an observation-action dataset. In the policy, the grey
boxes are static obstacles, the grey circles are other robots, and the blue
square is the robot’s goal position.
operates in continuous state space with a time-varying num-
ber of neighbors and generates provably safe, dynamically-
coupled policies. We demonstrate in simulation that our policy
achieves significantly higher success rates compared to ORCA,
a state-of-the-art decentralized approach for single integrator
dynamics. We also extend our approach to double integrator
dynamics, and demonstrate that our synthesized policies work
well on a team of quadrotors with low-end microcontrollers.
The overview of GLAS is shown in Fig. 1. First, we generate
trajectories for random multi-robot motion planning instances
using a global planner. Second, we apply a local observation
model to generate a dataset of observation-action pairs. Third,
we perform deep imitation learning to generate a local policy
that imitates the expert (the global policy) to avoid local
minima. The vector field of Fig. 1 shows that the policy
successfully avoids local minima traps between obstacles and
gridlock between robots.
We guarantee the safety of our policy through a convex
combination of the learned desired action and our safety
module. Instead of using existing optimization-based meth-
ods, we derive an analytic form for our safety module. The
advantage of this safety module is that it is fully differentiable,
which enables us to train our policy in an end-to-end fashion,
resulting in policies that use less control effort.
arXiv:2002.11807v2 [cs.RO] 7 May 2020
2
IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED APRIL, 2020
Our main contributions are: i) to our knowledge, this is
the first approach that automatically synthesizes a local pol-
icy from global demonstrations in a continuous state/action
domain while guaranteeing safety and ii) derivation of a
novel differentiable safety module compatible with end-to-end
learning for dynamically-coupled motion planning.
We show that our policies outperform the state-of-the-art
distributed multi-robot motion planners for single integra-
tor dynamics in terms of success rate in a wide range of
robot/obstacle density cases. We also implement GLAS for
more physically-realistic double integrator dynamics in both
simulation and experimentation with an aerial swarm, demon-
strating real-time computation on low-end microcontrollers.
A. Related Work
Multi-robot motion planning is an active area of research be-
cause it is a non-convex optimization problem with high state
and action dimensionality. We compare the present work with
state-of-the-art methods: (a) collision avoidance controllers,
(b) optimal motion-planners, and (c) deep-learning methods.
Collision Avoidance:
Traditional controller-level approaches
include Optimal Reciprocal Collision Avoidance (ORCA) [1],
Buffered Voronoi Cells [2], Artificial Potential Functions [3–
5], and Control Barrier Functions [6]. These methods are
susceptible to trapping robots in local minima. We address
this problem explicitly by imitating a complete global planner
with local information. For optimal performance, we propose
to learn a controller end-to-end, including the safety module.
Existing methods that are based on optimization [1, 2, 6] are
challenging for backpropagation for end-to-end training. Exist-
ing analytic methods [3] do not explicitly consider gridlocks,
where robots’ respective barriers cancel each other and distur-
bances can cause the system to violate safety. Thus, we derive
a novel differentiable safety module. This system design of
fully differentiable modules for end-to-end backpropagation is
also explored in reinforcement learning [7] and estimation [8].
Motion Planners:
Motion planners are a higher-level ap-
proach that explicitly solves the optimal control problem
over a time horizon. Solving the optimal control problem is
non-convex, so most recent works with local guarantees use
approximate methods like Sequential Convex Programming
to quickly reach a solution [9, 10]. Motion planners are
distinguished as either global and centralized [10] or local and
decentralized [9, 11], depending on whether they find solutions
in joint space or computed by each robot.
GLAS is inherently scalable because it is computed at each
robot. Although it has no completeness guarantee, we empiri-
cally show it avoids local minima more often than conventional
local methods. The local minima issue shared by all the local
methods is a natural trade-off of decentralized algorithms.
Our method explores this trade-off explicitly by imitating a
complete, global planner with only local information.
Deep Learning Methods:
Recently, there have been new
learning-based approaches for multi-robot path planning [12–
16]. These works use deep Convolutional Neural Networks
(CNN) in a discrete state/action domain. Such discretization
prevents coupling to higher-order robot dynamics whereas
our solution permits tight coupling to the system dynamics
by operating in a continuous state/action domain using a
novel network architecture based on Deep Sets [17]. Deep
Sets are a relatively compact representation that leverages
the permutation-invariant nature of the underlying interac-
tions, resulting in a less computationally-expensive solution
than CNNs. In contrast to our work, the Neural-Swarm ap-
proach [18] uses Deep Sets to augment a tracking controller
for close proximity flight.
Imitation Learning (IL) can imitate an expensive plan-
ner [12–14, 19], thereby replacing the optimal control or
planning solver with a function that approximates the solution.
GLAS uses IL and additionally changes the input domain from
full state information to a local observation, thereby enabling
us to synthesize a decentralized policy from a global planner.
II. P
ROBLEM
F
ORMULATION
Notation:
We denote vectors with a boldface lowercase,
functions with an italics lowercase letter, scalar parameters
with plain lowercase, subspaces/sets with calligraphic upper-
case, and matrices with plain uppercase. We denote a robot
index with a superscript
i
or
j
, and a state or action without
the robot index denotes a joint space variable of stacked robot
states. A double superscript index denotes a relative vector, for
example
s
ij
=
s
j
s
i
. We use a continuous time domain, and
we suppress the explicit time dependency for states, actions,
and dependent functions for notation simplicity.
Problem Statement:
Let
V
denote the set of
n
i
robots,
G
=
{
g
1
,...,
g
n
i
}
denote their respective goal states,
S
0
=
{
s
1
0
,...,
s
n
i
0
}
denote their respective start states, and
denote
the set of
m
static obstacles. At time
t
, each robot
i
makes a
local observation,
o
i
, uses it to formulate an action,
u
i
, and
updates its state,
s
i
, according to the dynamical model. Our
goal is to find a controller,
u
:
O →U
that synthesizes actions
from local observations through:
o
i
=
y
(
i,
s
)
,
u
i
=
u
(
o
i
)
,
i,t
(1)
to approximate the solution to the optimal control problem:
u
=
argmin
{
u
i
|∀
i,t
}
c
(
s
,
u
)
s.t.
̇
s
i
=
f
(
s
i
,
u
i
)
i,t
s
i
(0) =
s
i
0
,
s
i
(
t
f
) =
g
i
,
s
∈X ∀
i
u
i
2
u
max
i,t
(2)
where
O
,
U
, and
S
are the observation, action, and state space,
y
is the local observation model,
c
is some cost function,
f
is
the dynamical model,
X ⊂ S
is the safe set capturing safety
of all robots,
t
f
is the time of the simulation, and
u
max
is the
maximum control bound.
Dynamical Model:
We consider both single and double
integrator systems. The single integrator state and action are
position and velocity vectors in
R
n
q
, respectively. The double
integrator state is a stacked position and velocity vector in
R
2
n
q
, and the action is a vector of accelerations in
R
n
q
. Here,
n
q
is the dimension of the workspace. The dynamics of the
i
th
robot for single and double integrator systems are:
̇
s
i
=
̇
p
i
=
u
i
and
̇
s
i
=
[
̇
p
i
̇
v
i
]
=
[
v
i
u
i
]
,
(3)
RIVIERE
et al.
: GLAS: GLOBAL-TO-LOCAL SAFE AUTONOMY SYNTHESIS FOR MULTI-ROBOT MOTION PLANNING WITH END-TO-END LEARNING
3
respectively, where
p
i
and
v
i
denote position and velocity.
Observation Model:
We are primarily focused on studying
the transition from global to local, which is defined via an
observation model,
y
:
V ×S →O
. An observation is:
o
i
=
[
e
ii
,
{
s
ij
}
j
∈N
i
V
,
{
s
ij
}
j
∈N
i
]
,
(4)
where
e
ii
=
g
i
s
i
and
N
i
V
,
N
i
denote the neighboring set
of robots and obstacles, respectively. These sets are defined
by the observation radius,
r
sense
, e.g.,
N
i
V
=
{
j
∈V | ‖
p
ij
2
r
sense
}
.
(5)
We encode two different neighbor sets because we input
robots and obstacles through respective sub-networks of our
neural network architecture in order to generate heterogeneous
behavior in reaction to different neighbor types. We denote the
union of the neighboring sets as
N
i
.
Performance Metrics:
To evaluate performance, we have
two criteria as specified by the optimal control problem. We
define our metrics over the set of successful robots,
I
, that
reach their goal and have no collisions:
I
=
{
i
∈V |
s
i
(
t
f
) =
g
i
and
p
ij
t
> r
safe
,
j,t
}
.
(6)
Our first metric of success,
r
s
, is the number of successful
robots, and our second metric,
r
p
, is the cost of deploying a
successful robot trajectory. For example, if the cost function
c
(
s
,
u
)
is the total control effort, the performance metrics are:
r
s
=
|I|
,
and
r
p
=
i
∈I
t
f
0
u
i
2
dt.
(7)
III. A
LGORITHM
D
ESCRIPTION AND
A
NALYSIS
: GLAS
In this section, we derive the method of
GLAS
,
G
lobal-to-
L
ocal
A
utonomy
S
ynthesis, to find policy
u
:
u
(
o
i
) =
α
π
(
o
i
)
π
(
o
i
) + (1
α
π
(
o
i
))
b
(
o
i
)
,
(8)
where
π
:
O →U
is a learned function,
b
:
O →U
is a safety
control module to ensure safety, and
α
π
:
O →
[0
,
1]
is an
adaptive gain function. We discuss each of the components of
the controller in this section. The overview of our controller
architecture is shown in Fig. 2.
A. Neural Policy Synthesis via Deep Imitation Learning
We describe how to synthesize the neural policy
π
that
imitates the behavior of an expert, where the expert refers
to a global optimal planner. Explicitly, we take batches of
observation-action pairs from an expert demonstration dataset
and we train a neural network by minimizing the loss on
output action given an observation. To use this method,
we need to generate an observation-action pair dataset from
expert demonstration and design a deep learning architecture
compatible with dynamic sensing network topologies.
Fig. 2. Neural network architecture consisting of 5 feed-forward compo-
nents. Each relative location of an obstacle is evaluated in
φ
; the sum of
all
φ
outputs is the input of
ρ
(deep set for obstacles). Another deep set
(
φ
V
and
ρ
V
) is used for neighboring robot positions. The desired control
π
is computed by
Ψ
, which takes the output of
ρ
,
ρ
V
, and the relative goal
location as input. The actual number of hidden layers is larger than shown.
The formula for the single integrator
b
function is shown.
1) Generating Demonstration Data:
Our dataset is gener-
ated using expert demonstrations from an existing centralized
planner [10]. This planner is resolution-complete and avoids
local minima; it is computationally efficient so we can generate
large expert demonstration datasets; and it uses an optimization
framework that can minimize control effort, so the policy
imitates a solution with high performance according to the
previously defined metrics. Specifically, we create our dataset
by generating maps with (i) fixed-size static obstacles with
random uniformly sampled grid positions and (ii) start/goal
positions for a variable number of robots, and then by comput-
ing trajectories using the centralized planner. For each timestep
and robot, we retrieve the local observation,
o
i
by masking
the non-local information with the observation model
y
, and
retrieving the action,
u
i
through the appropriate derivative of
the robot
i
trajectory, see Fig. 1. We repeat this process
n
case
times for each robot/obstacle case. Our dataset,
D
, is:
D
=
{
(
o
i
,
u
i
)
k
| ∀
i
∈V
,
k
∈{
1
...n
case
}
,
t
}
.
(9)
2) Model Architecture with Deep Sets:
The number of
visible neighboring robots and obstacles can vary dramatically
during each operation, which causes the dimensionality of the
observation vector to be time-varying. Leveraging the permu-
tation invariance of the observation, we model variable number
of robots and obstacles with the Deep Set architecture [17, 18].
Theorem 7 from [17] establishes this property:
Theorem 1:
Let
f
: [0
,
1]
l
R
be a permutation invariant
continuous function iff it has the representation:
f
(
x
1
,...,x
l
) =
ρ
(
l
m
=1
φ
(
x
m
)
)
,
(10)
for some continuous outer and inner function
ρ
:
R
l
+1
R
and
φ
:
R
R
l
+1
, respectively. The inner function
φ
is
independent of the function
f
.
Intuitively, the
φ
function acts as a contribution from each
element in the set, and the
ρ
function acts to combine the
contributions of each element. In effect, the policy can learn
4
IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED APRIL, 2020
the contribution of the neighboring set of robots and obstacles
with the following network structure:
π
(
o
i
)
n
=
Ψ
([
ρ
(
j
∈N
i
φ
(
s
ij
));
ρ
V
(
j
∈N
i
V
φ
V
(
s
ij
))])
,
π
(
o
i
) =
π
(
o
i
)
n
min
{
π
max
/
π
(
o
i
)
n
2
,
1
}
,
(11)
where
the
semicolon
denotes
a
stacked
vector
and
Ψ
V
V
are feed-forward networks of the form:
FF
(
x
) =
W
l
σ
(
...W
1
σ
(
x
))
,
(12)
where
FF
is a feed-forward network on input
x
,
W
l
is the
weight matrix of the
l
th
layer, and
σ
is the activation function.
We define the parameters for each of the
5
networks in Sec. IV.
We also scale the output of the
π
network to always be less
than
π
max
, to maintain consistency with our baselines.
3) End-to-End Training:
We train the neural policy
π
with
knowledge of the safety module
b
, to synthesize a controller
u
with symbiotic components. We train through the output
of
u
, not
π
, even though
b
has no tunable parameters. In
effect, the parameters of
π
are updated such that the policy
π
smoothly interacts with
b
while imitating the global planner.
With respect to a solution that trains through the output of
π
, end-to-end learning generates solutions with lower control
effort, measured through the
r
p
metric (7), see Fig. 3.
4) Additional Methods in Training:
We apply additional
preprocessing methods to the observation to improve our
training process performance and to regularize the data. We
denote the difference between the original observation and the
preprocessed data with an apostrophe; e.g., the input to the
neural network is an observation vector denoted by
o
i
.
We scale the relative goal vector observation as follows:
e
ii
=
α
g
e
ii
,
where
α
g
= min
{
r
sense
/
e
ii
,
1
}
.
(13)
This regularizes cases when the goal is beyond the sens-
ing radius. In such cases, the robot needs to avoid any
robots/obstacles and continue toward the goal. However, the
magnitude of
e
ii
outside the sensing region is not important.
We cap the maximum cardinality of the neighbor and
obstacle sets with
N
V
and
N
, e.g.,
N
i
V
=
{
j
∈N
i
V
|
N
V
-closest robots w.r.t.
p
ij
‖}
.
(14)
This enables batching of observation vectors into fixed-
dimension tensors for fast training, and upper bounds the
evaluation time of
π
to guarantee real-time performance on
hardware in large swarm experiments. We include an example
observation encoding in Fig. 1.
B. System Safety
We adopt the formulation of safe sets used in control barrier
functions and define the global safe set
X
as the super-level set
of a global safety function
g
:
S →
R
. We define this global
safety as the minimum of local safety functions,
h
:
O →
R
that specify pairwise collision avoidance between all objects
in the environment:
X
=
{
s
∈S |
g
(
s
)
>
0
}
,
(15)
g
(
s
) = min
i,j
h
(
p
ij
)
,
h
(
p
ij
) =
p
ij
‖−
r
safe
r
sense
r
safe
,
(16)
where
p
ij
denotes the vector between the closest point on
object
j
to center of object
i
. This allows us to consistently
define
r
safe
as the radius of the robot, where
r
safe
< r
sense
.
Intuitively, if a collision occurs between robots
i,j
, then
h
(
p
ij
)
<
0
and
g
(
s
)
<
0
, implying that the system is not safe.
In order to synthesize local controls with guaranteed global
safety, we need to show non-local safety functions cannot
violate global safety. Consider a pair of robots outside of the
neighborhood,
p
ij
> r
sense
. Clearly,
h
(
p
ij
)
>
1
, implying
this interaction is always safe.
C. Controller Synthesis
We use these safety functions to construct a global potential
function,
ψ
:
S →
R
that becomes unbounded when any safety
is violated, which resembles logarithmic barrier functions used
in the interior point method in optimization [20]. Similarly, we
can construct a local function,
ψ
i
:
O →
R
:
ψ
(
s
) =
log
i
j
∈N
i
h
(
p
ij
)
,
(17)
ψ
i
(
o
i
) =
log
j
∈N
i
h
(
p
ij
)
.
(18)
We use the local potential
ψ
i
to synthesize the safety control
module
b
. We first state some assumptions.
Assumption 1:
Initially, the distance between all objects is
at least
r
safe
+ ∆
r
, where
r
is a user-specified parameter.
Assumption 2:
We assume robot
i
’s geometry not to exceed
a ball of radius
r
safe
centered at
p
i
.
Theorem 2:
For the single integrator dynamics (3), the safety
defined by (15) is guaranteed under the control law (8) with
the following
b
(
o
i
)
and
α
π
(
o
i
)
for a scalar gains
k
p
>
0
and
k
c
>
0
:
b
(
o
i
) =
k
p
p
i
ψ
i
(
o
i
)
(19)
α
π
(
o
i
) =
(
k
p
k
c
)
‖∇
p
i
ψ
i
(
o
i
)
2
k
p
‖∇
p
i
ψ
i
(
o
i
)
2
+
|〈∇
p
i
ψ
i
(
o
i
)
(
o
i
)
〉|
h
(
o
i
)
<
0
1
else
(20)
with
h
(
o
i
) = min
j
∈N
i
h
(
p
ij
)
r
, and
p
i
ψ
i
as in (23).
Proof:
We prove global safety by showing boundedness
of
ψ
i
(
o
i
)
, because a bounded
ψ
(
s
)
implies no safety violation.
Since
ψ
(
s
)
is a sum of functions
ψ
i
(
o
i
)
, it suffices to show that
all
ψ
i
(
o
i
)
are bounded. To prove the boundedness of
ψ
i
(
o
i
)
,
we use a Lyapunov method. First, we note that
ψ
i
(
o
i
)
is an
appropriate positive Lyapunov function candidate as
ψ
i
(
o
i
)
>
0
o
i
, using the fact that
log(
x
)
(0
,
)
,
x
(0
,
1)
.
Second, we show
s
X
i
=
̇
ψ
i
(
o
i
)
<
0
, where the
boundary-layer
domain
X
i
⊂S
is defined as:
X
i
=
{
s
|
0
<
min
j
∈N
i
h
(
p
ij
)
<
r
}
.
(21)
This result implies that upon entering the boundary-layer of
the safe set,
X
i
, the controller will push the system back into
the interior of the safety set,
S\
X
i
.
RIVIERE
et al.
: GLAS: GLOBAL-TO-LOCAL SAFE AUTONOMY SYNTHESIS FOR MULTI-ROBOT MOTION PLANNING WITH END-TO-END LEARNING
5
We take the time derivative of
ψ
i
along the system dy-
namics, and plug the single integrator dynamics (3) and
controller (8) into
̇
ψ
i
:
̇
ψ
i
=
〈∇
s
ψ
i
,
̇
s
=
n
i
i
=1
〈∇
s
i
ψ
i
,
̇
s
i
=
n
i
i
=1
〈∇
p
i
ψ
i
,
u
i
=
n
i
i
=1
〈∇
p
i
ψ
i
π
π
+ (1
α
π
)
b
,
(22)
where
p
i
ψ
i
=
j
∈N
i
p
ij
p
ij
(
p
ij
‖−
r
safe
)
.
(23)
From here, establishing that any element of the sum is negative
when
s
X
i
implies the desired result. Expanding an
arbitrary element at the
i
th
index with the definition of
b
:
〈∇
p
i
ψ
i
π
π
+ (1
α
π
)
b
(24)
=
k
p
‖∇
p
i
ψ
i
2
+
α
π
(
〈∇
p
i
ψ
i
+
k
p
‖∇
p
i
ψ
i
2
)
≤−
k
p
‖∇
p
i
ψ
i
2
+
α
π
(
|〈∇
p
i
ψ
i
〉|
+
k
p
‖∇
p
i
ψ
i
2
)
By plugging
α
π
from (20) into (24), we arrive at the following:
〈∇
p
i
ψ
i
π
π
+ (1
α
π
)
b
〉≤−
k
c
‖∇
p
i
ψ
i
2
(25)
Thus, every element is strictly negative, unless
p
i
ψ
i
equals
zero; caused either by reaching the safety equilibrium of the
system at
h
(
p
ij
)
1
,
j
∈ N
i
, or in the case of deadlock
between robots.
Theorem 3:
For the double integrator dynamics given in (3),
the safety defined by (15) is guaranteed under control law (8)
for the barrier-controller and the gain defined as:
b
=
k
v
(
v
i
+
k
p
p
i
ψ
i
)
k
p
d
dt
p
i
ψ
i
k
p
p
i
ψ
i
,
α
π
=
{
a
1
k
c
(
k
p
ψ
i
+
1
2
v
k
2
)
a
1
+
|
a
2
|
h
(
o
i
)
<
0
1
else
,
(26)
a
1
=
k
v
v
i
+
k
p
p
i
ψ
i
2
+
k
2
p
‖∇
p
i
ψ
i
2
,
a
2
=
v
i
,k
p
p
i
ψ
i
+
v
i
+
k
p
p
i
ψ
i
+
k
p
d
dt
p
i
ψ
i
,
where
k
p
>
0
,
k
c
>
0
, and
k
v
>
0
are scalar gains,
h
is
defined as in Theorem 2,
d
dt
p
i
ψ
i
is defined in (32) and the
dependency on the observation is suppressed for legibility.
Proof:
We take the same proof approach as in Theorem 2
to show boundedness of
ψ
i
. We define a Lyapunov function,
V
augmented with a backstepping term:
V
=
k
p
ψ
i
+
1
2
v
k
2
,
(27)
where
v
is the stacked velocity vector,
v
= [
v
1
;
...
;
v
n
i
]
and
k
is the stacked nominally stabilizing control,
k
=
k
p
[
p
1
ψ
1
;
...
;
p
n
i
ψ
n
i
]
. For the same reasoning as the
previous result,
V
is a positive function, and thus an appropri-
ate Lyapunov candidate. Taking the derivative along the system
dynamics (3):
̇
V
=
n
i
i
=1
k
p
〈∇
s
i
ψ
i
,
̇
s
i
+
v
i
k
i
,
u
i
̇
k
i
(28)
=
n
i
i
=1
k
p
〈∇
p
i
ψ
i
,
v
i
+
v
i
k
i
,
u
i
̇
k
i
.
(29)
From here, establishing that an arbitrary element of the sum is
negative when
s
X
i
implies the desired result. We rewrite
the first inner product in this expression as:
〈∇
p
i
ψ
i
,
v
i
=
〈∇
p
i
ψ
i
,
k
i
+
〈∇
p
i
ψ
i
,
(
v
i
k
i
)
=
k
p
‖∇
p
i
ψ
i
2
+
〈∇
p
i
ψ
i
,
(
v
i
k
i
)
.
(30)
Next, we expand the second inner product of (29), and plug
k
i
,
u
i
, and
b
into (29):
v
i
+
k
p
p
i
ψ
i
,
u
i
+
k
p
d
dt
p
i
ψ
i
(31)
=
k
v
v
i
+
k
p
p
i
ψ
i
2
k
p
〈∇
p
i
ψ
i
,
v
i
+
k
p
p
i
ψ
i
+
α
π
v
i
+
k
p
p
i
ψ
i
b
where
d
dt
p
i
ψ
i
=
j
∈N
i
v
i
p
ij
(
p
ij
‖−
r
safe
)
p
i
,
v
i
p
i
p
ij
(
p
ij
‖−
r
safe
)(
p
ij
2
+ (
p
ij
‖−
r
safe
)
2
)
(32)
Combining both terms back into the expression:
k
p
〈∇
p
i
ψ
i
,
v
i
+
v
i
k
i
,
u
i
̇
k
i
=
k
2
p
‖∇
p
i
ψ
i
2
k
v
v
i
+
k
p
p
i
ψ
i
2
+
α
π
v
i
+
k
p
p
i
ψ
i
b
(33)
Expanding the last term with the definition of
b
, and upper
bounding it:
v
i
+
k
p
p
i
ψ
i
b
(34)
= (
k
v
v
i
+
k
p
p
i
ψ
i
2
+
k
2
p
‖∇
p
i
ψ
i
2
)
+ (
v
i
,k
p
p
i
ψ
i
+
v
i
+
k
p
p
i
ψ
i
+
k
p
d
dt
p
i
ψ
i
)
=
a
1
+
a
2
a
1
+
|
a
2
|
where the terms in parentheses are grouped into
a
1
,a
2
to
improve legibility. By plugging
α
π
from (26) into (33), we
arrive at
̇
V
=
n
i
i
=1
k
c
V
, which results in exponential
stability that guarantees the system will remain safe by pushing
it towards a safety equilibrium where
h
(
p
ij
)
1
,
j
∈N
i
. It
also makes the system robust to disturbances [21].
We make some remarks on the results of the proofs.
Remark 1:
The setup for this proof is to give
π
maximal au-
thority without violating safety. This trade-off is characterized
through the design parameter
r
, and the gains
k
p
and
k
v
that
control the measure of the conservativeness of the algorithm.
For the discrete implementation of this algorithm, we introduce
a parameter,
ε

1
, to artificially decrease
α
π
such that
α
π
= 1
ε
when
h
>
0
. The results (in continuous time)
of the proof still hold as
α
π
is a scalar gain on a destabilizing
term, so a lower
α
π
further stabilizes the system. The gain
k
c
can be arbitrarily small, in simulation we set it to
0
.
Remark 2:
Intuitively,
h
(
o
i
)
defines an unsafe domain for
robot
i
. In safe settings, i.e.
h
>
0
,
α
π
= 1
ε
and so the
barrier has little effect on the behavior. In most unsafe cases,
the barrier will be activated, driving a large magnitude safety
response. However, in dense multi-robot settings, it is possible
for safety responses to cancel each other out in a gridlock,
resulting in dangerous scenarios where small disturbances can
cause the system to violate safety. In this case, we use the
above result to put an adaptive gain,
α
π
(20,26) on the neural
6
IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED APRIL, 2020
policy and to drive
α
π
to
0
, cancelling the effect of
π
. Thus,
we use a convex combination of the neural optimal policy and
the safety control module to guarantee safety in all cases.
Remark 3:
In Theorem 2, we synthesize a local nominal
control that guarantees the global safety of the system. In
Theorem 3, we use Lyapunov backstepping to provide the
same nominal control through a layer of dynamics. This
method is valid for a large class of nonlinear dynamical sys-
tems known as full-state feedback linearizeable systems [21].
By following this method, GLAS can be extended to other
nonlinear dynamical systems in a straightforward manner.
IV. E
XPERIMENTS
We now present results of simulation comparing GLAS
and its variants with state-of-the-art baselines as well as
experimental results on physical quadrotors. Our supplemental
video includes additional simulations and experiments.
A. Learning Implementation and Hyperparameters
For data generation, we use an existing implementation
of a centralized global trajectory planner [10] and generate
2
×
10
5
(
200 k
) demonstrations in random
8 m
×
8 m
en-
vironments with 10 or
20 %
obstacles randomly placed in a
grid pattern and 4, 8, or 16 robots (e.g., see Fig. 3 for
10 %
obstacles and 8 robots). We sample trajectories every
0
.
5 s
and
generate
|D|
= 40
×
10
6
(40 M)
data points in total, evenly
distributed over the 6 different environment kinds. We use
different datasets for single and double integrator dynamics
with different desired smoothness in the global planner.
We implement our learning framework in Python using
PyTorch [22]. The
φ
and
φ
V
networks have an input layer
with 2 neurons, one hidden layer with 64 neurons, and an
output layer with 16 neurons. The
ρ
and
ρ
V
networks have
16 neurons in their input and output layers and one hidden
layer with 64 neurons. The
Ψ
network has an input layer with
34 neurons, one hidden layer with 64 neurons, and outputs
π
using two neurons. All networks use a fully connected
feedforward structure with ReLU activation functions. We use
an initial learning rate of 0.001 with the PyTorch optimizer Re-
duceLROnPlateau function, a batch size of
32 k
, and train for
200 epochs. During manual, iterative hyperparameter tuning,
we found that the hidden layers should at least use 32 neurons.
For efficient training of the Deep Set architecture, we create
batches where the number of neighbors
|N
V
|
and number of
obstacles
|N
|
are the same and limit the observation to a
maximum of 6 neighbors and 6 obstacles.
B. GLAS Variants
We study the effect of each component of the system
architecture by comparing variants of our controller:
end-to-
end
,
two-stage
, and
barrier
. End-to-end and two-stage are
synthesized through (8), but differ in how
π
is trained. For
end-to-end we calculate the loss on
u
(
o
i
)
, while for two-
stage we calculate the loss on
π
(
o
i
)
. Comparing these two
methods isolates the effect of the end-to-end training. The
barrier variant is a linear feedback to goal controller with
our safety module. Essentially, barrier is synthesized with (8),
where the
π
heuristic is replaced with a linear goal term:
K
e
ii
,
where, for single integrator systems,
K
=
k
p
I
, and for double
integrator systems,
K
= [
k
p
I,k
v
I
]
, with scalar gains
k
p
and
k
v
. Studying the performance of the barrier variant isolates
the effect of the global-to-local heuristic training.
C. Single Integrator Dynamics
We compare our method with ORCA, a state-of-the-art
decentralized approach for single integrator dynamics. Un-
like GLAS, ORCA requires relative velocities with respect
to neighbors in addition to relative positions. All methods
compute a velocity action with guaranteed safety.
We show example trajectories for the global planner, ORCA,
and GLAS variants in Fig. 3. In Fig. 3(b)/3(c), the purple
and brown robots are getting stuck in local minima caused by
obstacle traps. In Fig. 3(d)/3(e), our learned policies are able to
avoid those local minima. The end-to-end approach produces
smoother trajectories that use less control effort, e.g., red and
brown robot trajectories in Fig. 3(e).
1) Evaluation of Metrics:
We deploy the baseline and vari-
ants over 100 validation cases with 2, 4, 8, 16, and 32 robots
and
10 %
and
20 %
obstacle density (10 validation instances
for each case) and empirically evaluate the metrics defined in
(7). Our training data only contains different examples with up
to 16 robots. We train 5 instances of end-to-end and two-stage
models, to quantify the effect of random-weight initialization
in the neural networks on performance, see Fig. 4.
In the top row, we consider the success metric
r
s
. In a wide
range of robot/obstacle cases (2–16 robots/
64 m
2
), our global-
to-local methods outperform ORCA by
20 %
, solving almost
all instances. Our barrier variant has a similar success rate
as ORCA, demonstrating that the neural heuristic
π
is crucial
for our high success rates. The two-stage approach generalizes
better to higher-density cases beyond those in the training data.
We observe the inverse trend in the double integrator case and
analyzing this effect is an interesting future direction.
In the bottom row, we measure control effort
r
p
. Our end-
to-end approach uses less control effort than the two-stage
approach. ORCA has the lowest control effort, because the an-
alytical solution to single integrator optimal control is a bang-
bang controller, similar in nature to ORCA’s implementation.
2) Effect of Complexity of Data on Loss Function:
At some
robot/obstacle density, local observations and their actions will
become inconsistent, i.e., the same observation will match
to different actions generated by the global planner. We
quantify this data complexity limit by recording the value of
the validation loss when training using datasets of varying
complexity, see Fig. 5.
Here, we train the end-to-end model with
|D|
= 5 M
using
isolated datasets of 10 and
20 %
obstacles and with 4 and
16 robots, as well as a mixed dataset. We see that the easiest
case, 4 robots and
10 %
obstacles, results in a the smallest loss,
roughly
1 %
of the maximum action magnitude of the expert.
The learning task is more difficult with high robot density
compared to high obstacle density. A mixed dataset, as used
in all other experiments, is a good trade-off between imitating
the expert very well and being exposed to complex situations.
RIVIERE
et al.
: GLAS: GLOBAL-TO-LOCAL SAFE AUTONOMY SYNTHESIS FOR MULTI-ROBOT MOTION PLANNING WITH END-TO-END LEARNING
7
(a) Global
(b) ORCA
(c) GLAS Barrier
(d) GLAS Two-stage
(e) GLAS End-to-end
Fig. 3. Example trajectories for baselines (a-c) and our proposed method (d,e), where the goal is to move robots from their starting position (circles) to the
goal position (squares). Our methods achieve the highest success rate. The GLAS end-to-end policy generates trajectories that use less control effort.
Fig. 4. Success rate and control effort with varying numbers of robots in a
8 m
×
8 m
space for single integrator systems. Shaded area around the lines
denotes standard deviation over 5 repetitions. The shaded gray box highlights
validation outside the training domain.
Fig. 5. Testing loss when training using 10 and
20 %
obstacles and 4 or 16
robots. Synthesizing a distributed policy that is consistent with the global data
is harder for high robot densities than for high obstacle densities. We use the
GLAS end-to-end with
|D|
= 5 M
and repeat 5 times.
Fig. 6. Effect of sensing radius and amount of training data on robot success
rate. The validation has 4, 8, and 16 robot cases with 10 instances each.
Training and validation were repeated 5 times; the shaded area denotes the
standard deviation.
Fig. 7. Success rate and control effort with varying numbers of robots in
a
8 m
×
8 m
workspace for double integrator systems. Shaded area around
the lines denotes standard deviation over 5 repetitions. The shaded gray box
highlights validation outside the training domain.
3) Effect of Radius of Sensing on Performance:
We quan-
tify the transition from local-to-global by evaluating the per-
formance of models trained with various sensing radii and
dataset size. We evaluate performance on a validation set of
4, 8, and 16 robot cases with
10 %
and
20 %
obstacle densities.
First, we found that there exists an optimal sensing radius for
a given amount of data, which increases with larger datasets.
For example, in the
20 %
obstacle case, the optimal sensing
radius for
|D|
= 300 k
is around
2 m
and the optimal radius for
|D|
= 30 M
is
8 m
. Second, we found that between models of
various dataset sizes the performance gap at small sensing radii
is smaller compared to the performance gap at large sensing
radii. This result suggests that little data is needed to use local
information well, and large amounts of data is needed to learn
from global data.
D. Double Integrator Dynamics
We extend our results to double integrator dynamical sys-
tems to demonstrate GLAS extends naturally as a dynamically-
coupled motion planner. Similar to the single integrator eval-
uation, we show double integrator statistical evaluation with
respect to the performance metrics (7), for varying robot and
obstacle density cases. We use the same setup as in the single
integrator case (see Sec. IV-C), but with a different dataset
(
|D|
= 20 M
). Results are shown in Fig. 7.
8
IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED APRIL, 2020
In the top row, we consider the success metric
r
s
. In
a wide range of robot density cases (2–16 robots/
64 m
2
),
our global-to-local end-to-end method again outperforms the
barrier baseline by
15 %
to
20 %
. In the 32 robots/
64 m
2
case,
the barrier baseline outperforms the global-to-local methods.
We suspect this is a combination of our method suffering from
the significantly higher complexity of the problem and the
naive barrier method performing well because the disturbances
from the robot interaction push the robot out of local minima
obstacle traps. In contrast to the single integrator case, the
end-to-end solution generalizes better than the two-stage in
higher robot densities. We conjecture that having a much larger
training set can significantly improve performance.
In the bottom row, we consider the performance metric
r
p
. For double integrator systems, the cost function
c
(
s
,
u
)
corresponds to the energy consumption of each successfully
deployed robot. The end-to-end variant uses less effort (
6
.
25 %
) than the two-stage method on average.
E. Experimental Validation with Aerial Swarms
We implement the policy evaluation (
π,α
π
,
b
) in C to enable
real-time execution on-board of Crazyflie 2.0 quadrotors using
double integrator dynamics (see Fig. 1). The quadrotors use a
small STM32 microcontroller with
192 kB
SRAM running at
168 MHz
. Our policy evaluation takes
3
.
4 ms
for 1 neighbor
and
5
.
0 ms
for 3 neighbors, making it computationally efficient
enough to execute our policy in real-time at
40 Hz
. On-board,
we evaluate the policy, forward-propagate double integrator
dynamics, and track the resulting position and velocity set-
point using a nonlinear controller. The experimental validation
demonstrates that our policy generalizes to novel environments
where the obstacles are arranged in continuous space, as
opposed to on a grid.
We use a double integrator GLAS end-to-end policy in three
different scenarios with up to 3 obstacles and 12 quadrotors.
We fly in a motion capture space, where each robot is equipped
with a single marker, using the Crazyswarm [23] for tracking
and scripting. Our demonstration shows that our policy works
well on robots and that it can also handle cases that are con-
sidered difficult in decentralized multi-robot motion planning,
such as swapping positions with a narrow corridor.
V. C
ONCLUSION
In this work, we present GLAS, a novel approach for
multi-robot motion planning that combines the advantages
of existing centralized and distributed approaches. Unlike
traditional distributed methods, GLAS avoids local minima in
many cases. Unlike existing centralized methods, GLAS only
requires local relative state observations, which can be mea-
sured on board or communicated locally. We propose an end-
to-end training approach using a novel differentiable safety
method compatible with general dynamical models, resulting
in a dynamically-coupled motion planner with guaranteed
collision-free operation and empirically-validated low control
effort solutions. In future work, we will explore adaptive
data-set aggregation methods [24] for more efficient expert
querying. We will also compare the computational effort and
performance of the Deep Set representation with deep CNN
methods as well as Graph Neural Network methods [25].
R
EFERENCES
[1] J. van den Berg, S. J. Guy, M. C. Lin, and D. Manocha, “Reciprocal
n
-body collision avoidance”, in
Int. Symp. on Robot. Res.
, ser. Springer
Tracts in Advanced Robotics, vol. 70, Springer, 2009, pp. 3–19.
[2] S. Bandyopadhyay, S.-J. Chung, and F. Y. Hadaegh, “Probabilistic and
distributed control of a large-scale swarm of autonomous agents”,
IEEE
Trans. Robot.
, vol. 33, no. 5, pp. 1103–1123, 2017.
[3] O. Khatib, “Real-time obstacle avoidance for manipulators and mobile
robots”, in
Autonomous Robot Vehicles
, Springer, 1990, pp. 396–404.
[4] E. Rimon and D. E. Koditschek, “Exact robot navigation using artificial
potential functions”,
IEEE J. Robot. Autom.
, vol. 8, no. 5, pp. 501–518,
1992.
[5] H. G. Tanner and A. Kumar, “Formation stabilization of multiple
agents using decentralized navigation functions”, in
Robotics: Science
& Systems
, 2005, pp. 49–56.
[6] L. Wang, A. D. Ames, and M. Egerstedt, “Safety barrier certificates for
collisions-free multirobot systems”,
IEEE Trans. Robot.
, vol. 33, no. 3,
pp. 661–674, 2017.
[7] R. Cheng, A. Verma, G. Orosz,
et al.
, “Control regularization for
reduced variance reinforcement learning”, in
Proc. Int. Conf. Machine
Learning
, vol. 97, 2019, pp. 1141–1150.
[8] R. Jonschkowski, D. Rastogi, and O. Brock, “Differentiable particle
filters: End-to-end learning with algorithmic priors”, in
Robotics:
Science & Systems
, 2018.
[9] D. Morgan, G. P. Subramanian, S.-J. Chung, and F. Y. Hadaegh,
“Swarm assignment and trajectory optimization using variable-swarm,
distributed auction assignment and sequential convex programming”,
I.
J. Robotics Res.
, vol. 35, no. 10, pp. 1261–1285, 2016.
[10] W. H
̈
onig, J. A. Preiss, T. K. S. Kumar, G. S. Sukhatme, and N.
Ayanian, “Trajectory planning for quadrotor swarms”,
IEEE Trans.
Robot.
, vol. 34, no. 4, pp. 856–869, 2018.
[11] C. E. Luis and A. P. Schoellig, “Trajectory generation for multiagent
point-to-point transitions via distributed model predictive control”,
IEEE Trans. Robot. Autom. Lett.
, vol. 4, no. 2, pp. 375–382, 2019.
[12] G. Sartoretti, J. Kerr, Y. Shi,
et al.
, “PRIMAL: pathfinding via re-
inforcement and imitation multi-agent learning”,
IEEE Trans. Robot.
Autom. Lett.
, vol. 4, no. 3, pp. 2378–2385, 2019.
[13] Q. Li, F. Gama, A. Ribeiro, and A. Prorok, “Graph neural networks for
decentralized multi-robot path planning”,
CoRR
, vol. abs/1912.06095,
2019.
[14] A. Khan, V. Kumar, and A. Ribeiro, “Graph policy gradients for
large scale unlabeled motion planning with constraints”,
CoRR
, vol.
abs/1909.10704, 2019.
[15] A. Khan, C. Zhang, S. Li,
et al.
, “Learning safe unlabeled multi-robot
planning with motion constraints”, in
Proc. IEEE/RSJ Int. Conf. Intell.
Robots Syst.
, 2019, pp. 7558–7565.
[16] D. Raju, S. Bharadwaj, and U. Topcu, “Decentralized runtime synthesis
of shields for multi-agent systems”,
CoRR
, vol. abs/1910.10380, 2019.
[17] M. Zaheer, S. Kottur, S. Ravanbakhsh,
et al.
, “Deep sets”, in
Advances
in Neural Information Processing Systems 30
, 2017, pp. 3391–3401.
[18] G. Shi, W. H
̈
onig, Y. Yue, and S.-J. Chung, “Neural-swarm: Decentral-
ized close-proximity multirotor control using learned interactions”, in
Proc. IEEE Int. Conf. Robot. Autom.
, 2020.
[19] Y. Pan, C. Cheng, K. Saigol,
et al.
, “Agile autonomous driving using
end-to-end deep imitation learning”, in
Robotics: Science & Systems
,
2018.
[20] S. P. Boyd and L. Vandenberghe, “Convex optimization”. Cambridge
University Press, 2014,
ISBN
: 978-0-521-83378-3.
[21] H. K. Khalil, “Nonlinear systems”. Upper Saddle River, NJ: Prentice-
Hall, 2002,
ISBN
: 978-0130673893.
[22] A. Paszke, S. Gross, F. Massa,
et al.
, “Pytorch: An imperative style,
high-performance deep learning library”, in
Advances in Neural Infor-
mation Processing Systems 32
, 2019, pp. 8024–8035.
[23] J. A. Preiss*, W. H
̈
onig*, G. S. Sukhatme, and N. Ayanian,
“Crazyswarm: A large nano-quadcopter swarm”, in
Proc. IEEE Int.
Conf. Robot. Autom.
, 2017, pp. 3299–3304.
[24] S. Ross, G. J. Gordon, and D. Bagnell, “A reduction of imitation
learning and structured prediction to no-regret online learning”, in
Proc. Int. Conf. Artificial Intelligence and Statistics
, vol. 15, 2011,
pp. 627–635.
[25] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini,
“The graph neural network model”,
IEEE Trans. Neural Netw.
, vol. 20,
no. 1, pp. 61–80, 2008.