PhaseLink: A Deep Learning Approach to Seismic
Phase Association
Zachary E. Ross
1
, Yisong Yue
2
, Men-Andrin Meier
1
, Egill Hauksson
1
,
and Thomas H. Heaton
1
1
Seismological Laboratory, California Institute of Technology, Pasadena, CA, USA,
2
Department of Computing and
Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA
Abstract
Seismic phase association is a fundamental task in seismology that pertains to linking
together phase detections on different sensors that originate from a common earthquake. It is widely
employed to detect earthquakes on permanent and temporary seismic networks and underlies most
seismicity catalogs produced around the world. This task can be challenging because the number of
sources is unknown, events frequently overlap in time, or can occur simultaneously in different parts of
a network. We present PhaseLink, a framework based on recent advances in deep learning for grid-free
earthquake phase association. Our approach learns to link phases together that share a common origin
and is trained entirely on millions of synthetic sequences of
P
and
S
wave arrival times generated using a
1-D velocity model. Our approach is simple to implement for any tectonic regime, suitable for real-time
processing, and can naturally incorporate errors in arrival time picks. Rather than tuning a set of ad hoc
hyperparameters to improve performance, PhaseLink can be improved by simply adding examples of
problematic cases to the training data set. We demonstrate the state-of-the-art performance of PhaseLink
on a challenging sequence from southern California and synthesized sequences from Japan designed to test
the point at which the method fails. For the examined data sets, PhaseLink can precisely associate phases
to events that occur only
∼
12 s apart in origin time. This approach is expected to improve the resolution of
seismicity catalogs, add stability to real-time seismic monitoring, and streamline automated processing of
large seismic data sets.
1. Introduction
When an earthquake is detected on different stations of a seismic network, it is often desirable to link the
observed seismic phases to the earthquake that caused them. Historically, this task was performed by expert
seismic analysts who would visually examine the data from different stations, identify seismic phases, and
group them together (cf. Figure 1). As the modern digital era began, seismic networks started to accumulate
data in real time, and it became necessary to develop computer algorithms to automatically process the data.
With the development of the landmark short-term average / long-term average (STA/LTA) algorithm in
seismology (R. V. Allen, 1978; R. Allen, 1982), it became possible to detect earthquakes automatically
for the first time. This simple method uses the ratio of two moving averages to identify impulsive transient
signals and has become the de facto standard for earthquake detection around the world. One major short-
coming of the method is that it will not only identify earthquakes when present but also any other types
of impulsive transient signals that seismometers record. This led to the development of phase association
algorithms, which examine combinations of triggers on different stations to see whether any set have arrival
time patterns consistent with those of earthquakes (Draelos et al., 2015; Johnson et al., 1995; LeBras et al.,
1994; Myers et al., 2007; Reynen & Audet, 2017; Stewart, 1977). The association process therefore evolved
from one of simply grouping seismic phases together to being ultimately responsible for deciding whether
an earthquake occurred.
To date, algorithms for phase association all operate using the same fundamental principle. The region of
interest is gridded, and for each node therein, tentativ
e phase detections within the network are examined to
see whether some subset back-projects to a coherent origin. This means that a grid search must be conducted
continuously for all new picks that are made. Typically, grid associators require extensive tuning of a large
number of sensitive hyperparameters and have numerous ad hoc rules to stabilize potential problems that
RESEARCH ARTICLE
10.1029/2018JB016674
Key Points:
• We present a novel grid-free method
for associating seismic phases to
earthquakes
• The method is trained entirely on
millions of synthetic sequences of
phase picks and can be applied to any
tectonic regime
• PhaseLink can reliably detect events
occurring 12 s apart on a rigorous test
data set
Correspondence to:
Z. E. Ross,
zross@gps.caltech.edu
Citation:
Ross, Z. E., Yue, Y., Meier, M.-A.,
Hauksson, E., & Heaton, T. H.
(2019). PhaseLink: A deep learning
approach to seismic phase association.
Journal of Geophysical Research:
Solid Earth
,
124
, 856–869.
https://doi.org/10.1029/2018JB016674
Received 8 SEP 2018
Accepted 12 JAN 2019
Accepted article online 17 JAN 2019
Published online 25 JAN 2019
©2019. American Geophysical Union.
All Rights Reserved.
ROSS ET AL.
856
Journal of Geophysical Research: Solid Earth
10.1029/2018JB016674
Figure 1.
Cartoon example of a phase association scenario. Left panel shows the discrete set of picks for the entire
network. The number of events is unknown. Right panel shows the output after association and location. Picks colored
black are not linked to an event, while colored picks share a common origin.
can arise. Over the years, they have become increasingly sophisticated, with modern variants incorporating
Bayesian estimates of pick uncertainties (Myers et al., 2007), machine learning (Reynen & Audet, 2017), or
multiscale detection capabilities.
Today, seismologists strive to identify increasingly smaller events that are often at or below the noise level.
Resolving this level of detail requires not only increasing phase detection sensitivity but also dealing with
the dramatically larger volume of information to be processed in a reliable and rational manner. In par-
ticular, since smaller events occur even more frequently and therefore are more closely spaced in time,
moving forward requires technology that can easily handle the most complicated scenarios encountered at
the present.
In recent years, there has been truly astonishing progress within the field of artificial intelligence, most
notably in the area of deep learning. Deep learning is a subdiscipline of machine learning that is based on
training neural networks to learn generalized representations of extremely large data sets and has become
state of the art in numerous domains of artificial intelligence (LeCun et al., 2015), including natural lan-
guage processing (Sutskever et al., 2014), computer vision (Krizhevsky et al., 2012), and speech recognition
(Amodei et al., 2016). It has been recently introduced to seismology and has already shown considerable
promise in performing various tasks including similarity-based earthquake detection and localization (Perol
et al., 2018), generalized seismic phase detection (Ross, Meier, Hauksson, & Heaton, 2018), phase picking
(Zhu & Beroza, 2018), first-motion polarity determination (Ross, Meier, & Hauksson, 2018), detection of
events in laboratory experiments (Wu et al., 2018), seismic image sharpening (Lu et al., 2018), wavefield
simulation (Moseley et al., 2018), and predicting aftershock spatial patterns (DeVries et al., 2018).
In this paper, we present PhaseLink, which is a deep learning approach for grid-free earthquake phase
association. Our approach is built upon Recurrent Neural Networks (RNNs), which are designed to learn
temporal and contextual relationships in sequential data. We show how to design a training objective that
enables the trained RNN to accurately associate phase
detections coming from multiple temporally overlap-
ping earthquakes. Another attractive feature of our approach is that it is trained entirely from synthesized
data using simple 1-D velocity models (this paradigm is generically known as “sim-to-real” in the machine
learning community; Dosovitskiy et al., 2017; Shafaei et al., 2016; Shotton et al., 2011; Tobin et al., 2017;
Wang & Eisner, 2017). Thus, our approach is easily applicable to any tectonic regime by simply training on
ROSS ET AL.
857
Journal of Geophysical Research: Solid Earth
10.1029/2018JB016674
the synthesized data from the appropriate model and can also naturally incorporate errors in arrival time
picks. The full source code will be publicly available via the Southern California Earthquake Data Center.
2.BackgroundonRNNs
Artificial neural networks are systems that can discover complex nonlinear relationships between variables.
Fundamentally, they successively transform a set of input values through matrix multiplication and nonlin-
ear activation functions into one or more output variables of interest (Goodfellow et al., 2016). The outputs
can be either continuous (regression) or discrete (classification). In supervised learning, the parameters
which characterize the nonlinear mapping are learned by minimizing the prediction error of the model
against the ground truth. The standard type of neural network is today referred to as a fully connected neural
network because each neuron is fully connected to each previous input. Fully connected networks are excel-
lent at many classification and regression tasks but have trouble discovering structure in sequential data
sets because they lack feedback mechanisms that can enable information to propagate between successive
elements of a sequence.
These shortcomings were addressed by the development of the RNN (Hopfield, 1982). RNNs allow for infor-
mation to be passed between successive elements through the
use of an internal memory state. This state is
dynamically modulated by gates that are themselves composed of neural networks and control what infor-
mation is retained along the way. The parameters governing the gates are therefore learned through the
training process from the data directly. The outputs of RNNs, which are called hidden states, are very flex-
ible and could be a single-valued output given an input sequence or a sequence of outputs. To date, RNNs
have been applied to variety of settings, including language translation (Sutskever et al., 2014), speech syn-
thesis (Van Den Oord et al., 2016), speech recognition (Amodei et al., 2016), image captioning (You et al.,
2016), and many others.
The most commonly employed variant of the RNN is the long short-term memory (Hochreiter &
Schmidhuber, 1997) network. These networks have three gates that control the flow of information and are
useful because they are not so susceptible to training issues related to diminishing propagation of informa-
tion over large sequences. In recent years, another variant called the gated recurrent unit (Cho et al., 2014)
has become popular because it has only two gates instead of
three, resulting in fewer parameters and faster
training. These types of RNNs are considered state of the art for many problems including speech recognition
and language translation.
Over the years, numerous improvements have been made to these basic types of RNN layers, and one such
important development was the bidirectional RNN layer (Schuster & Paliwal, 1997). This layer uses two
RNNs running in opposite directions so that information from both directions of the sequence is available to
make predictions. A common example where this is useful is word prediction, where if a word in the middle
of a sentence is missing, it is generally desirable to use the contextual information from the entire sentence
to make a prediction, rather than just the words leading up to the missing one.
The outstanding capabilities of RNNs for learning structure in sequential data sets make them a natural
choice for the phase association problem, since a set of picks can be viewed as a time-ordered sequence of
arrivals. Furthermore, as RNNs process one element of a sequence at a time, they are well-suited for phase
association in a real-time seismic network, where phases arrive one at a time.
3. PhaseLink Framework
The PhaseLink approach is designed to solve the phase association problem: given a sequence of
N
picks,
determine how many earthquakes (if any) occurred, and which of the
N
picks belong to each respective
earthquake. Fundamentally, one can think of phase association as a (supervised) clustering problem of
assigning picks to earthquakes that generated them. In contrast to conventional clustering, there is a specific
temporal structure to our prediction task, and also, the number of clusters is unknown a priori. For instance,
having multiple overlapping earthquakes implies detecting picks coming from different “clusters.”
Figure 2 depicts the PhaseLink approach, which can be conceptually described in the following steps:
• We are given an input set of picks. Each pick has as attributes the location (latitude and longitude) of the
station that detected the pick, the time stamp, and phase type (Figure 2, step 1).
ROSS ET AL.
858
Journal of Geophysical Research: Solid Earth
10.1029/2018JB016674
Figure 2.
Overview of PhaseLink algorithm. A sliding window of picks is iteratively presented to a RNN, which outputs a binary sequence of equal length for
each window. These output sequences indicate which picks (if any) are from the same event as the first pick in the window. Each pick in the sequence has fi
ve
features: latitude, longitude, arrival time, phase type, and a binary padding indicator. The results from all windows are then aggregated to determi
ne distinct
clusters of picks (earthquakes detected). RNN = Recurrent Neural Network.
• The input pick stream is processed into a sequence of overlapping fixed-length sequential prediction tasks
(Figure 2, step 2). In particular, the prediction task is whether each pick in the input sequence belongs
to the same earthquake that generated the first (root) pick in the sequence, that is, a sequential binary
classification problem. We solve this fixed-length prediction task using RNNs (section 3.1), and we train
the RNNs using synthetic data (section 3.3).
• The overlapping predictions are then aggregated int
o a single set of pick clusters, where each cluster defines
one earthquake (section 3.2; Figure 2, step 3).
By decomposing the problem in this way, PhaseLink can, in principle, handle any number of overlap-
ping clusters. Conceptually, the reduced prediction task is based around a reference point and classifies
ROSS ET AL.
859
Journal of Geophysical Research: Solid Earth
10.1029/2018JB016674
Table 1
Model Architecture
Layer type
Units
Activation function
Bidirectional GRU
200
Sigmoid/tanh
Bidirectional GRU
200
Sigmoid/tanh
Dense
1
Sigmoid
Note
. GRU = gated recurrent unit.
a temporal neighborhood of points as belonging to the same cluster as
the reference point. A somewhat similar idea was proposed in supervised
clustering approaches that utilize must-link and cannot-link constraints
(Bilenko et al., 2004; Wagstaff et al., 2001), although those approaches are
more geared toward learning a metric space rather than directly solving the
clustering problem. Furthermore, our PhaseLink approach can exploit a
natural temporal locality structure to further constrain the prediction task.
Another benefit of directly considering the coclustering prediction prob-
lem is that we can tolerate false picks (those that do not belong to any
cluster/earthquake). A final benefit of this decomposition is that PhaseLink can utilize off-the-shelf RNN
implementations, which leads to significantly reduced system engineering overhead.
3.1. RNN Architecture
We designed a deep RNN consisting of stacked bidirectional gated recurrent unit layers (Table 1). The net-
work takes as input fixed-length sequences of picks and outputs a sequence of identical length (Figure 2,
step 2). The output sequence is binary valued, with a value of 1 indicating that a given pick belongs to the
same event as the root pick (the first pick of the sequence,
Y
0
) and a value of 0 indicating that the two picks
are unrelated. A sigmoid activation function is applied to the final output at each time step to squash the
value into the range
[
0
,
1
]
.
We apply the network to a sliding window of picks by incrementing over the entire sequence, shifting the
window by one pick at a time. For the remainder of this paper, we refer to a fixed length sliding window of
n
p
picks as a
subsequence
. Here we use
n
p
= 500. After predictions have been made for a subsequence, we
drop the root pick and take the next pick as the root for the new subsequence. For each root pick, we obtain
a set of binary predictions about which of the following picks in the subsequence are related to the root.
Each of the picks in a subsequence is characterized by five input features, resulting in an input feature set
with dimensions (
n
p
, 5). The first two features are the latitude and longitude coordinates of the station that
the pick was made on, which are both normalized to be in the range [0, 1] such that the range spans the
full dimensions of the seismic network. The third feature is the time of the pick, which is defined relative
to the root pick within the subsequence. Here we normalize the time values by a predefined maximum
allowed value for picks to be included in a subsequence, which is chosen to be 120 s. This value is somewhat
arbitrarily chosen but ends up being not too important. The normalization ensures that this feature does not
bias the training process. We discard any picks within the subsequence that are larger than 120 s and pad the
remainder of the feature window with zeros. The value of 500 picks is chosen loosely to correspond to the
maximum number of picks that we expect to have within any 120-s window, which could vary depending on
the problem. The penultimate feature is a binary value indicating the phase type, where a value of 0 means
a
P
wave and a value of 1 means an
S
wave. Lastly, we have another binary indicator variable for whether a
given pick is a zero-padded placeholder.
3.2. Aggregating Predictions
We now describe the final stage of PhaseLink, where the link predictions from each subsequence are aggre-
gated to formally detect earthquakes. The output of the RNN is a prediction matrix that describes the link
between each pick of a subsequence and its root pick (Figure 2, step 3). In order to assign picks to individual
events, rather than to subsequence root picks, we cluster linked picks by incrementing backward over the
prediction matrix. This is performed as follows:
• For each subsequence, a cluster nucleates if at least
n
nuc
picks have predicted labels of 1. Only those picks
labeled 1 are retained in the cluster, and all others are discarded.
• Once a cluster has nucleated, the set intersection is
separately determined between it and every existing
cluster of picks that arose from other subsequences.
• The existing cluster with the most picks in common is identified, and if this number is greater than
n
merge
,
the two clusters are merged.
• After performing these steps for all subsequences, each remaining cluster is retained if the cluster size is
at least
n
min
picks.
In this paper, we use
n
nuc
= 8, which was chosen to maximize the detection performance for the data sets
used herein; varying this hyperparameter in the range 4–8 leads to relatively little change in performance
ROSS ET AL.
860