Geophys. J. Int.
(2023)
234,
427–438
https://doi.org/10.1093/gji/ggad074
Advance Access publication 2023 February 16
GJI Seismology
A deep Gaussian process model for seismicity background rates
Jack B. Muir
1
and Zachary E. Ross
2
1
Department of Earth Sciences, University of Oxford, Oxford, OX
13
AN, UK. E-mail:
jack.muir@earth.ox.ac.uk
2
Seismological Laboratory, California Institute of Technology, Pasadena, California
91125
,USA
Accepted 2023 February 14. Received 2023 February 9; in original form 2022 September 23
SUMMARY
The spatio-temporal properties of seismicity give us incisive insight into the stress state evolu-
tion and fault structures of the crust. Empirical models based on self-exciting point processes
continue to provide an important tool for analysing seismicity, given the epistemic uncertainty
associated with physical models. In particular, the epidemic-type aftershock sequence (ETAS)
model acts as a reference model for studying seismicity catalogues. The traditional ETAS
model uses simple parametric definitions for the background rate of triggering-independent
seismicity. This reduces the effectiveness of the basic ETAS model in modelling the temporally
complex seismicity patterns seen in seismic swarms that are dominated by aseismic tectonic
processes such as fluid injection rather than aftershock triggering. In order to robustly capture
time-varying seismicity rates, we introduce a deep Gaussian process (GP) formulation for the
background rate as an extension to ETAS. GPs are a robust non-parametric model for function
spaces with covariance structure. By conditioning the length-scale structure of a GP with
another GP, we have a deep-GP: a probabilistic, hierarchical model that automatically tunes
its structure to match data constraints. We show how the deep-GP-ETAS model can be effi-
ciently sampled by making use of a Metropolis-within-Gibbs scheme, taking advantage of the
branching process formulation of ETAS and a stochastic partial differential equation (SPDE)
approximation for Mat
́
ern GPs. We illustrate our method using synthetic examples, and show
that the deep-GP-ETAS model successfully captures multiscale temporal behaviour in the
background forcing rate of seismicity. We then apply the results to two real-data catalogues:
the Ridgecrest, CA 2019 July 5
M
w
7.1 event catalogue, showing that deep-GP-ETAS can
successfully characterize a classical aftershock sequence; and the 2016–2019 Cahuilla, CA
earthquake swarm, which shows two distinct phases of aseismic forcing concordant with a
fluid injection-driven initial sequence, arrest of the fluid along a physical barrier and release
following the largest
M
w
4.4 event of the sequence.
Key words:
Inverse theory; Statistical methods; Statistical seismology.
1 INTRODUCTION
The study of seismicity patterns is a core concern of observational seismology. By documenting seismicity in the form of catalogues, we aim
to better understand the mechanics of the crust as it evolves through time. Increasing progress in observational seismological instrumentation
and methodology have led to increasingly complete and accurate catalogues, which have revealed fascinating fine-scale structure in seismicity
(Wei
et al.
2011
; Shelly
et al.
2016
;Ross
et al.
2019
,
2020
, e.g.). However, as a comprehensive physical theory of seismicity remains
elusive, we still rely heavily on empirical statistical models to understand catalogues of seismic events. The de facto standard statistical
model for the analysis of seismicity at short timescales is the epidemic-type aftershock sequence (ETAS) model, introduced by Ogata (
1988
)
for temporal catalogues, and extended to spatio-temporal catalogues by Ogata (
1998
). The ETAS model explicitly combines a background
rate of earthquake activity, independent of past events, with self-excitation of aftershocks following the combined Omori–Utsu (Utsu
1961
)
aftershock decay law and an earthquake productivity law developed by Ogata (
1998
) in reference to the earthquake catalogue classification
schema of Utsu (
1971
). The ETAS model allows us to decouple the background forcing of seismicity from self-excitation (i.e. it declusters
earthquake catalogues), and allows for statistical forecasting of earthquake rates into the future once calibrated.
The spatio-temporal variations in the ETAS parameters have given useful insight into the evolution of seismicity. Whilst classical
aftershock sequences are normally well modelled by the self-triggering component of ETAS, swarm-type sequences are insufficiently
C
The Author(s) 2023. Published by Oxford University Press on behalf of The Royal Astronomical Society. This is an Open Access
article distributed under the terms of the Creative Commons Attribution License (
https://creativecommons.org/licenses/by/4.0/
), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
427
Downloaded from https://academic.oup.com/gji/article/234/1/427/7043457 by California Institute of Technology user on 28 April 2023
428
J. B. Muir and Z. E. Ross
clustered to conform to ETAS with a stationary forcing rate, and instead reflect aseismic changes in forcing (Kumazawa & Ogata
2014
, e.g.).
In the zero-clustering limit, the behaviour of swarms may be modelled as an inhomogenous Poisson process; however, in reality some degree
of clustering must be accounted for, which poses a difficult challenge as it is impossible to directly observe both the self-excitation rate and
aseismic forcing rate independently. Many methodologies have been proposed to overcome this challenge. For instance, Marsan
et al.
(
2013
)
proposed deriving a time-dependent aseismic forcing rate by iteratively obtaining the maximum-likelihood estimate (MLE) parameters of an
ETAS model with per-earthquake forcing rate, updating the forcing given the MLE parameters, and then temporally smoothing the resulting
rate to avoid overfitting—this method was applied to the analysis of the foreshock sequence of the
M
w
9.0 2011 Tohoku earthquake in
Marsan & Enescu (
2012
). Hainzl
et al.
(
2013
) then used the method proposed in Marsan
et al.
(
2013
) to investigate the impact of transients
in the aseismic forcing rate on the prediction of aftershock productivity, an important consideration for accurate short-term forecasting of
seismic activity. Kumazawa
et al.
(
2016
) investigated the prolific Izu Peninsula swarms, and found a close fit between gradient-penalized
piecewise-linear functions and a simple model based on exponential decay of strains recorded at Higashi-Izu which allowed them to infer
synchronization between earthquake swarm activity and volumetric strain changes induced by magma intrusions. These successes in analysis
motivate us to investigate improvements in methodology that can robustly quantify the uncertainty in estimates of aseismic forcing rates.
In general, as we have no knowledge of the functional form of background variability, it is preferable to use a non-parametric formulation
for aseismic forcing; however, solving for a non-parametric background rate model has traditionally been difficult and does not permit a
Hessian based estimate of uncertainty at the MLE. Recent studies (Ross
2021
; Molkenthin
et al.
2022
) have advocated for a fully Bayesian
approach, in which the background rate and ETAS parameter priors are specified, combined with the ETAS likelihood to produce a posterior
probability distribution, and sampled using Markov Chain Monte Carlo methods (MCMC). While MCMC is in general an expensive technique,
it is an unfortunate necessity for the ETAS problem due to the lack of a general closed-form conjugate model that would allow direct draws
from the posterior distribution. Molkenthin
et al.
(
2022
), in particular, show how the background rate may be successfully modelled in non-
parametric fashion by making use of Gaussian processes (GP) to specify the background prior and using a branching-structure formulation of
the ETAS model to decouple the background events from aftershocks, which are handled by self-excitation. Beyond the confines of the ETAS
model, GPs have also proven useful in the spatial analysis of seismic sequences. Bayliss
et al.
(
2020
) investigated numerous geophysical
covariates for seismicity in a log-Gaussian Cox process model of spacial seismicity (e.g. distance to mapped faults, strain rates, etc.), and
used a GP to handle additional unspecified seismic intensity rate variability. Ross
et al.
(
2022
) investigated the hyperparameters of fully GP
specified log-Gaussian Cox process models for seismic sequences as a quantitative measure of geometrical properties such as the anisotropy of
seismicity. GPs are a flexible and robust framework for non-parametric modelling, however in the context of geophysical inversion problems
they suffer from two major deficiencies. The first is that, without approximations, they scale poorly to large problem sizes (e.g. large seismic
catalogues), as specifying the covariance structure requires inverting a matrix at
O
(
n
3
) expense where
n
is the catalogue size. The second
is that basic GP analysis assumes homogeneity of the underlying statistical distribution of the prior—this assumption poorly models abrupt
spatio-temporal changes in behaviour that are found in the real Earth.
In this study, our contributions are as follows. We show how the aforementioned issues may be resolved by making use of a deep GP
model (DGP) approximated by the solution to a stochastic partial differential equation (SPDE, Lindgren
et al.
(
2011
)). We also improve
the rate of MCMC convergence by making use of appropriate modern samplers that exploit the structure of the SPDE approximation and
the gradient of the combined semi-parametric model posterior. Finally, we illustrate the DGP-ETAS model using a combination of synthetic
catalogues, the Ridgecrest, CA earthquake sequence, and the Cahuilla, CA earthquake swarm.
2 SAMPLING THE ETAS MODEL
The temporal ETAS model is a self-exciting marked point process defined on the space
T
×
M
,where
T
=
(0
,
T
) is the time domain of
interest and
M
is the mark space of the process, which for ETAS is the event magnitude. Realizations of the process are earthquake catalogues
C
where the
i
th event is defined by the tuple (
t
i
,
m
i
).
C
is endowed with a cut-off magnitude
m
0
, which is the lower limit of observability.
The temporal ETAS model may be defined by its conditional intensity function, which gives the instantaneous rate of earthquakes given the
background forcing and past events. Assuming that earthquake triggering is time-independent, the conditional intensity is the sum of the
background rate and the triggering:
λ
(
t
|
C
,μ,θ
)
=
μ
(
t
)
+
i
∈
C
|
t
i
<
t
φ
(
t
−
t
i
|
m
i
,θ
)
.
(1)
The most common parametrization of
φ
, and the one we use in this work, is
φ
(
t
−
t
i
|
m
i
,θ
)
=
κ
(
m
i
|
θ
)
g
(
t
−
t
i
|
θ
)
(2)
κ
(
m
i
|
θ
)
=
K
exp(
α
(
m
i
−
m
0
))
(3)
g
(
t
|
θ
)
=
(
t
+
c
)
−
p
.
(4)
With this definition, the ETAS model triggering parameters are
θ
=
(
K
,
α
,
c
,
p
). The magnitude term
κ
is proportional to the Utsu aftershock-
productivity law, while the temporal term
g
is proportional to the modified Omori–Utsu aftershock decay law. The log-likelihood of observing
Downloaded from https://academic.oup.com/gji/article/234/1/427/7043457 by California Institute of Technology user on 28 April 2023
Deep GPs for seismicity background rates
429
catalogue
C
is given by
log
p
(
C
|
μ, θ
)
=
i
∈
C
log(
λ
(
t
i
|
C
,μ,θ
))
−
T
0
λ
(
t
|
C
,μ,θ
)
.
(5)
Given that the conditional-likelihood itself contains a sum over the catalogue, the log-likelihood of the standard ETAS model thus has a
double loop, and the effort required to compute it scales with the number of events squared. When considering heterogeneous background
rates, the problem is even worse as the background parameters influence the ETAS likelihood, dramatically increasing the complexity and
dimensionality of the likelihood surface. Following Ross (
2021
), a useful reformulation that improves computational tractability is to make
use of the splitting property of Poisson processes (Veen & Schoenberg
2008
). This formulation of ETAS introduces the auxiliary branching
variables
{
B
i
}
N
i
=
1
,where
B
i
=
j
tells us that the
i
th event in
C
had ‘parent’
j
and
B
i
=
0 tells us that the
i
th earthquake was a ‘background’
event. This factorization splits the estimation of the background rate and the ETAS parameters into two subproblems, where the background
catalogue is described by an inhomogeneous Poisson process and the aftershocks are handled by ETAS without background forcing—this
greatly simplifies inference of all parameters, reduces the expense of computing the ETAS likelihood and serendipitously declusters the
catalogue as a byproduct of sampling. We introduce subcatalogues
C
j
={
i
|
B
i
=
j
}
so that
|
C
j
|
is the number of earthquakes triggered by
event
j
(or the background for
C
0
). The conditional intensity for the
i
th event then becomes
λ
(
t
i
|
C
,μ,θ,
B
i
)
=
μ
(
t
i
)
B
i
=
0
φ
(
t
i
−
t
B
i
|
m
B
i
,θ
)1
≤
B
i
<
i
,
(6)
and the log-likelihood function becomes
log
p
(
C
|
μ, θ,
B
)
=
i
∈
C
0
log(
μ
(
t
i
))
−
T
0
μ
(
t
)
dt
+
j
∈
C
⎡
⎣
|
C
j
|
log
κ
(
m
j
|
K
,α
)
−
κ
(
m
j
|
K
,α
)
G
(
T
−
t
j
|
c
,
p
)
+
i
∈
C
j
log
g
(
t
i
−
t
j
|
c
,
p
)
⎤
⎦
,
(7)
where
G
(
T
−
t
j
|
c
,
p
)
=
T
−
t
j
0
g
(
t
|
c
,
p
)
dt
.
(8)
Note that the double sum in this form does not suffer quadratic scaling because each earthquake has a unique parent and so is only ever
visited once in the interior sum. Marginalizing over the branching variables gives us the probability distribution of
μ
and
θ
conditioned on all
possible combinations of parent/child earthquake pairs (i.e. with the likelihood defined by eq.
5
). The conditional probability for the branching
variables is analytic, and is given by
p
(
B
i
=
j
|
μ, θ
)
=
μ
(
t
i
)
Z
i
j
=
0
κ
(
m
j
|
K
,α
)
g
(
t
i
−
t
j
|
c
,
p
)
Z
i
1
≤
j
<
i
,
(9)
where the normalization factor is
Z
i
=
μ
(
t
i
)
+
j
<
i
κ
(
m
j
|
K
,α
)
g
(
t
i
−
t
j
|
c
,
p
)
.
(10)
3 DEEP GAUSSIAN PROCESSES
With the general form of the latent branching-variable ETAS model specified, it comes time to specify the model for the time-variable
background rate
μ
(
t
). Given that the ETAS model is a general phenomenological description of seismicity, it is appropriate to model the
background rate with a general function. We use the framework of GPs as a robust non-parametric description of the background seismicity
rate (Rasmussen & Williams
2006
; Molkenthin
et al.
2022
). GPs act as priors over function space (in our case the space of continuous
functions on the interval [0,
T
]), and are defined by the property that the distribution
{
f
(
x
i
)
}
of a GP
f
evaluated at any finite collection of
points
{
x
i
}
is given by a multivariate normal with mean function
m
(
x
) and covariance function
C
(
x
,
x
), that is,
⎡
⎢
⎢
⎢
⎢
⎣
f
(
x
1
)
f
(
x
2
)
.
.
.
f
(
x
n
)
⎤
⎥
⎥
⎥
⎥
⎦
∼
N
⎛
⎜
⎜
⎜
⎜
⎝
⎡
⎢
⎢
⎢
⎢
⎣
m
(
x
1
)
m
(
x
2
)
.
.
.
m
(
x
n
)
⎤
⎥
⎥
⎥
⎥
⎦
,
⎡
⎢
⎢
⎢
⎢
⎣
C
(
x
1
,
x
1
)
C
(
x
1
,
x
2
)
...
C
(
x
1
,
x
n
)
C
(
x
2
,
x
1
)
C
(
x
2
,
x
2
)
...
C
(
x
2
,
x
n
)
.
.
.
.
.
.
.
.
.
.
.
.
C
(
x
n
,
x
1
)
C
(
x
n
,
x
2
)
...
C
(
x
n
,
x
n
)
⎤
⎥
⎥
⎥
⎥
⎦
⎞
⎟
⎟
⎟
⎟
⎠
.
(11)
The mean and covariance functions completely specify the GP. The covariance function, which specifies the regularity and characteristic
length-scales of the GP, is the central object in GP modelling. A prototypical example would be the squared exponential covariance
C
(
x
,
x
)
=
σ
exp
−
||
x
−
x
||
2
2
l
2
,
(12)
Downloaded from https://academic.oup.com/gji/article/234/1/427/7043457 by California Institute of Technology user on 28 April 2023
430
J. B. Muir and Z. E. Ross
Figure 1.
Schematic of a two-layer DGP based on the covariance operator definition. Further layers may be similarly concatenated to produce deeper networks,
however the greater expressive capacity is unlikely to be required in an inverse problem context. Parameters without a representative trace are scal
ars, those
with traces are functions defined on the interval of interest. The parameter values are those used for the synthetic square catalogue described in the e
xamples
section. The schematic is arranged sequentially downward, so that each line feeds into the one below.
which generates functions
f
that are infinitely differentiable and have characteristic length-scale
l
and amplitude
σ
. Another commonly used
function is the Mat
́
ern covariance, which is ubiquitous in geospatial statistics. The Mat
́
ern has the form
C
(
x
,
x
)
=
σ
2
2
1
−
ν
(
ν
)
√
2
ν
||
x
−
x
||
l
ν
K
ν
√
2
ν
||
x
−
x
||
l
,
(13)
where the
is the generalized factorial or Gamma function,
K
ν
is the modified Bessel function of the second kind and
ν
is the regularity
parameter. Functions that have Mat
́
ern covariance are differentiable [
ν
]
−
1/2 times. In particular, the squared exponential covariance is
the special case of the limit of the Mat
́
ern covariance as
ν
goes to infinity. The added roughness of lower
ν
Mat
́
ern covariance typically
better represents realistic phenomena. For data with Gaussian noise, GP regression models have a closed form solution once the mean and
covariance are fully specified, and careful selection of the form of the covariance matrix can lead to rich and fully interpretable models with
a high degree of predictive power (Rasmussen & Williams
2006
). However, while GP regression models have analytic solutions, inverse
problems do not in general, as is the case with the estimation of the aseismic forcing rate
μ
(
t
) considered here. Furthermore, the construction
of suitable covariance kernels in a classical-GP framework requires a substantial degree of domain expertise, and consequently a large fraction
of the GP modelling literature assumes covariance functions that are homogeneous and isotropic for simplicity (see e.g. Gelfand & Schliep
2016
for a brief review). The loss of model flexibility inherent in this assumption can result in substantial misfit and prediction error, either
due to underfitting or overfitting. Because of these limitations, learning methods for automatic discovery of appropriate covariance functions
have become a major focus of GP research in order to improve computational efficiency and reduce domain expertise requirements, whilst
maintaining the attractive features of GP robustness and interpretability.
In this work, we investigate DGPs as a means of learning appropriate covariance functions. In the DGP framework, multiple GPs are
chained together, either feeding the outputs of one GP to the inputs of another (Damianou & Lawrence
2013
), or alternatively by letting the
outputs of one GP change the covariance function of the next (Roininen
et al.
2016
,seeFig.
1
). The latter approach is particularly appealing,
because the covariance function can be easily evaluated for each layer of the DGP, resulting in a fully interpretable model. Additionally, by
making use of an SPDE approximation to GPs, DGP models can be very efficiently simulated, as we will describe below.
3.1 Approximating DGPs with stochastic partial differential equations
GP models are expressive, robust, easily interpretable and easily implementable—however, the use of pure GP models in large-scale inverse
problem settings has been limited. This is due to their inherently poor scaling. The evaluation of a GP model requires inversion of an
n
×
Downloaded from https://academic.oup.com/gji/article/234/1/427/7043457 by California Institute of Technology user on 28 April 2023