A deep Gaussian process model for seismicity background rates

Geophys. J. Int.

(2023)

234,

427–438

https://doi.org/10.1093/gji/ggad074

Advance Access publication 2023 February 16

GJI Seismology

A deep Gaussian process model for seismicity background rates

Jack B. Muir

and Zachary E. Ross

Department of Earth Sciences, University of Oxford, Oxford, OX

AN, UK. E-mail:

jack.muir@earth.ox.ac.uk

Seismological Laboratory, California Institute of Technology, Pasadena, California

91125

,USA

Accepted 2023 February 14. Received 2023 February 9; in original form 2022 September 23

SUMMARY

The spatio-temporal properties of seismicity give us incisive insight into the stress state evolu-

tion and fault structures of the crust. Empirical models based on self-exciting point processes

continue to provide an important tool for analysing seismicity, given the epistemic uncertainty

associated with physical models. In particular, the epidemic-type aftershock sequence (ETAS)

model acts as a reference model for studying seismicity catalogues. The traditional ETAS

model uses simple parametric definitions for the background rate of triggering-independent

seismicity. This reduces the effectiveness of the basic ETAS model in modelling the temporally

complex seismicity patterns seen in seismic swarms that are dominated by aseismic tectonic

processes such as fluid injection rather than aftershock triggering. In order to robustly capture

time-varying seismicity rates, we introduce a deep Gaussian process (GP) formulation for the

background rate as an extension to ETAS. GPs are a robust non-parametric model for function

spaces with covariance structure. By conditioning the length-scale structure of a GP with

another GP, we have a deep-GP: a probabilistic, hierarchical model that automatically tunes

its structure to match data constraints. We show how the deep-GP-ETAS model can be effi-

ciently sampled by making use of a Metropolis-within-Gibbs scheme, taking advantage of the

branching process formulation of ETAS and a stochastic partial differential equation (SPDE)

approximation for Mat

ern GPs. We illustrate our method using synthetic examples, and show

that the deep-GP-ETAS model successfully captures multiscale temporal behaviour in the

background forcing rate of seismicity. We then apply the results to two real-data catalogues:

the Ridgecrest, CA 2019 July 5

7.1 event catalogue, showing that deep-GP-ETAS can

successfully characterize a classical aftershock sequence; and the 2016–2019 Cahuilla, CA

earthquake swarm, which shows two distinct phases of aseismic forcing concordant with a

fluid injection-driven initial sequence, arrest of the fluid along a physical barrier and release

following the largest

4.4 event of the sequence.

Key words:

Inverse theory; Statistical methods; Statistical seismology.

1 INTRODUCTION

The study of seismicity patterns is a core concern of observational seismology. By documenting seismicity in the form of catalogues, we aim

to better understand the mechanics of the crust as it evolves through time. Increasing progress in observational seismological instrumentation

and methodology have led to increasingly complete and accurate catalogues, which have revealed fascinating fine-scale structure in seismicity

(Wei

et al.

2011

; Shelly

et al.

2016

;Ross

et al.

2019

2020

, e.g.). However, as a comprehensive physical theory of seismicity remains

elusive, we still rely heavily on empirical statistical models to understand catalogues of seismic events. The de facto standard statistical

model for the analysis of seismicity at short timescales is the epidemic-type aftershock sequence (ETAS) model, introduced by Ogata (

1988

)

for temporal catalogues, and extended to spatio-temporal catalogues by Ogata (

1998

). The ETAS model explicitly combines a background

rate of earthquake activity, independent of past events, with self-excitation of aftershocks following the combined Omori–Utsu (Utsu

1961

)

aftershock decay law and an earthquake productivity law developed by Ogata (

1998

) in reference to the earthquake catalogue classification

schema of Utsu (

1971

). The ETAS model allows us to decouple the background forcing of seismicity from self-excitation (i.e. it declusters

earthquake catalogues), and allows for statistical forecasting of earthquake rates into the future once calibrated.

The spatio-temporal variations in the ETAS parameters have given useful insight into the evolution of seismicity. Whilst classical

aftershock sequences are normally well modelled by the self-triggering component of ETAS, swarm-type sequences are insufficiently

The Author(s) 2023. Published by Oxford University Press on behalf of The Royal Astronomical Society. This is an Open Access

article distributed under the terms of the Creative Commons Attribution License (

https://creativecommons.org/licenses/by/4.0/

), which

permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

427

Downloaded from https://academic.oup.com/gji/article/234/1/427/7043457 by California Institute of Technology user on 28 April 2023

428

J. B. Muir and Z. E. Ross

clustered to conform to ETAS with a stationary forcing rate, and instead reflect aseismic changes in forcing (Kumazawa & Ogata

2014

, e.g.).

In the zero-clustering limit, the behaviour of swarms may be modelled as an inhomogenous Poisson process; however, in reality some degree

of clustering must be accounted for, which poses a difficult challenge as it is impossible to directly observe both the self-excitation rate and

aseismic forcing rate independently. Many methodologies have been proposed to overcome this challenge. For instance, Marsan

et al.

(

2013

)

proposed deriving a time-dependent aseismic forcing rate by iteratively obtaining the maximum-likelihood estimate (MLE) parameters of an

ETAS model with per-earthquake forcing rate, updating the forcing given the MLE parameters, and then temporally smoothing the resulting

rate to avoid overfitting—this method was applied to the analysis of the foreshock sequence of the

9.0 2011 Tohoku earthquake in

Marsan & Enescu (

2012

). Hainzl

et al.

(

2013

) then used the method proposed in Marsan

et al.

(

2013

) to investigate the impact of transients

in the aseismic forcing rate on the prediction of aftershock productivity, an important consideration for accurate short-term forecasting of

seismic activity. Kumazawa

et al.

(

2016

) investigated the prolific Izu Peninsula swarms, and found a close fit between gradient-penalized

piecewise-linear functions and a simple model based on exponential decay of strains recorded at Higashi-Izu which allowed them to infer

synchronization between earthquake swarm activity and volumetric strain changes induced by magma intrusions. These successes in analysis

motivate us to investigate improvements in methodology that can robustly quantify the uncertainty in estimates of aseismic forcing rates.

In general, as we have no knowledge of the functional form of background variability, it is preferable to use a non-parametric formulation

for aseismic forcing; however, solving for a non-parametric background rate model has traditionally been difficult and does not permit a

Hessian based estimate of uncertainty at the MLE. Recent studies (Ross

2021

; Molkenthin

et al.

2022

) have advocated for a fully Bayesian

approach, in which the background rate and ETAS parameter priors are specified, combined with the ETAS likelihood to produce a posterior

probability distribution, and sampled using Markov Chain Monte Carlo methods (MCMC). While MCMC is in general an expensive technique,

it is an unfortunate necessity for the ETAS problem due to the lack of a general closed-form conjugate model that would allow direct draws

from the posterior distribution. Molkenthin

et al.

(

2022

), in particular, show how the background rate may be successfully modelled in non-

parametric fashion by making use of Gaussian processes (GP) to specify the background prior and using a branching-structure formulation of

the ETAS model to decouple the background events from aftershocks, which are handled by self-excitation. Beyond the confines of the ETAS

model, GPs have also proven useful in the spatial analysis of seismic sequences. Bayliss

et al.

(

2020

) investigated numerous geophysical

covariates for seismicity in a log-Gaussian Cox process model of spacial seismicity (e.g. distance to mapped faults, strain rates, etc.), and

used a GP to handle additional unspecified seismic intensity rate variability. Ross

et al.

(

2022

) investigated the hyperparameters of fully GP

specified log-Gaussian Cox process models for seismic sequences as a quantitative measure of geometrical properties such as the anisotropy of

seismicity. GPs are a flexible and robust framework for non-parametric modelling, however in the context of geophysical inversion problems

they suffer from two major deficiencies. The first is that, without approximations, they scale poorly to large problem sizes (e.g. large seismic

catalogues), as specifying the covariance structure requires inverting a matrix at

(

) expense where

is the catalogue size. The second

is that basic GP analysis assumes homogeneity of the underlying statistical distribution of the prior—this assumption poorly models abrupt

spatio-temporal changes in behaviour that are found in the real Earth.

In this study, our contributions are as follows. We show how the aforementioned issues may be resolved by making use of a deep GP

model (DGP) approximated by the solution to a stochastic partial differential equation (SPDE, Lindgren

et al.

(

2011

)). We also improve

the rate of MCMC convergence by making use of appropriate modern samplers that exploit the structure of the SPDE approximation and

the gradient of the combined semi-parametric model posterior. Finally, we illustrate the DGP-ETAS model using a combination of synthetic

catalogues, the Ridgecrest, CA earthquake sequence, and the Cahuilla, CA earthquake swarm.

2 SAMPLING THE ETAS MODEL

The temporal ETAS model is a self-exciting marked point process defined on the space

,where

) is the time domain of

interest and

is the mark space of the process, which for ETAS is the event magnitude. Realizations of the process are earthquake catalogues

where the

th event is defined by the tuple (

is endowed with a cut-off magnitude

, which is the lower limit of observability.

The temporal ETAS model may be defined by its conditional intensity function, which gives the instantaneous rate of earthquakes given the

background forcing and past events. Assuming that earthquake triggering is time-independent, the conditional intensity is the sum of the

background rate and the triggering:

(

,μ,θ

)

(

)

∈

(

−

,θ

)

(1)

The most common parametrization of

, and the one we use in this work, is

(

−

,θ

)

(

)

(

−

)

(2)

(

)

exp(

(

−

))

(3)

(

)

(

)

−

(4)

With this definition, the ETAS model triggering parameters are

(

). The magnitude term

is proportional to the Utsu aftershock-

productivity law, while the temporal term

is proportional to the modified Omori–Utsu aftershock decay law. The log-likelihood of observing

Downloaded from https://academic.oup.com/gji/article/234/1/427/7043457 by California Institute of Technology user on 28 April 2023

Deep GPs for seismicity background rates

429

catalogue

is given by

log

(

μ, θ

)

∈

log(

(

,μ,θ

))

−

(

,μ,θ

)

(5)

Given that the conditional-likelihood itself contains a sum over the catalogue, the log-likelihood of the standard ETAS model thus has a

double loop, and the effort required to compute it scales with the number of events squared. When considering heterogeneous background

rates, the problem is even worse as the background parameters influence the ETAS likelihood, dramatically increasing the complexity and

dimensionality of the likelihood surface. Following Ross (

2021

), a useful reformulation that improves computational tractability is to make

use of the splitting property of Poisson processes (Veen & Schoenberg

2008

). This formulation of ETAS introduces the auxiliary branching

variables

{

}

,where

tells us that the

th event in

had ‘parent’

and

0 tells us that the

th earthquake was a ‘background’

event. This factorization splits the estimation of the background rate and the ETAS parameters into two subproblems, where the background

catalogue is described by an inhomogeneous Poisson process and the aftershocks are handled by ETAS without background forcing—this

greatly simplifies inference of all parameters, reduces the expense of computing the ETAS likelihood and serendipitously declusters the

catalogue as a byproduct of sampling. We introduce subcatalogues

}

so that

is the number of earthquakes triggered by

event

(or the background for

). The conditional intensity for the

th event then becomes

(

,μ,θ,

)

(

)

(

−

,θ

≤

(6)

and the log-likelihood function becomes

log

(

μ, θ,

)

∈

log(

(

))

−

(

)

∈

⎡

⎣

log

(

,α

)

−

(

,α

)

(

−

)

∈

log

(

−

)

⎤

⎦

(7)

where

(

−

)

−

(

)

(8)

Note that the double sum in this form does not suffer quadratic scaling because each earthquake has a unique parent and so is only ever

visited once in the interior sum. Marginalizing over the branching variables gives us the probability distribution of

and

conditioned on all

possible combinations of parent/child earthquake pairs (i.e. with the likelihood defined by eq.

). The conditional probability for the branching

variables is analytic, and is given by

(

μ, θ

)

(

)

(

,α

)

(

−

)

≤

(9)

where the normalization factor is

(

)

(

,α

)

(

−

)

(10)

3 DEEP GAUSSIAN PROCESSES

With the general form of the latent branching-variable ETAS model specified, it comes time to specify the model for the time-variable

background rate

(

). Given that the ETAS model is a general phenomenological description of seismicity, it is appropriate to model the

background rate with a general function. We use the framework of GPs as a robust non-parametric description of the background seismicity

rate (Rasmussen & Williams

2006

; Molkenthin

et al.

2022

). GPs act as priors over function space (in our case the space of continuous

functions on the interval [0,

]), and are defined by the property that the distribution

{

(

)

}

of a GP

evaluated at any finite collection of

points

{

}

is given by a multivariate normal with mean function

(

) and covariance function

(

), that is,

⎡

⎢

⎣

(

)

(

)

(

)

⎤

⎥

⎦

∼

⎛

⎜

⎝

⎡

⎢

⎣

(

)

(

)

(

)

⎤

⎥

⎦

⎡

⎢

⎣

(

)

(

)

...

(

)

(

)

(

)

...

(

)

(

)

(

)

...

(

)

⎤

⎥

⎦

⎞

⎟

⎠

(11)

The mean and covariance functions completely specify the GP. The covariance function, which specifies the regularity and characteristic

length-scales of the GP, is the central object in GP modelling. A prototypical example would be the squared exponential covariance

(

)

exp

−

(12)

Downloaded from https://academic.oup.com/gji/article/234/1/427/7043457 by California Institute of Technology user on 28 April 2023

430

J. B. Muir and Z. E. Ross

Figure 1.

Schematic of a two-layer DGP based on the covariance operator definition. Further layers may be similarly concatenated to produce deeper networks,

however the greater expressive capacity is unlikely to be required in an inverse problem context. Parameters without a representative trace are scal

ars, those

with traces are functions defined on the interval of interest. The parameter values are those used for the synthetic square catalogue described in the e

xamples

section. The schematic is arranged sequentially downward, so that each line feeds into the one below.

which generates functions

that are infinitely differentiable and have characteristic length-scale

and amplitude

. Another commonly used

function is the Mat

ern covariance, which is ubiquitous in geospatial statistics. The Mat

ern has the form

(

)

−

(

)

√

−

√

−

(13)

where the

is the generalized factorial or Gamma function,

is the modified Bessel function of the second kind and

is the regularity

parameter. Functions that have Mat

ern covariance are differentiable [

]

−

1/2 times. In particular, the squared exponential covariance is

the special case of the limit of the Mat

ern covariance as

goes to infinity. The added roughness of lower

Mat

ern covariance typically

better represents realistic phenomena. For data with Gaussian noise, GP regression models have a closed form solution once the mean and

covariance are fully specified, and careful selection of the form of the covariance matrix can lead to rich and fully interpretable models with

a high degree of predictive power (Rasmussen & Williams

2006

). However, while GP regression models have analytic solutions, inverse

problems do not in general, as is the case with the estimation of the aseismic forcing rate

(

) considered here. Furthermore, the construction

of suitable covariance kernels in a classical-GP framework requires a substantial degree of domain expertise, and consequently a large fraction

of the GP modelling literature assumes covariance functions that are homogeneous and isotropic for simplicity (see e.g. Gelfand & Schliep

2016

for a brief review). The loss of model flexibility inherent in this assumption can result in substantial misfit and prediction error, either

due to underfitting or overfitting. Because of these limitations, learning methods for automatic discovery of appropriate covariance functions

have become a major focus of GP research in order to improve computational efficiency and reduce domain expertise requirements, whilst

maintaining the attractive features of GP robustness and interpretability.

In this work, we investigate DGPs as a means of learning appropriate covariance functions. In the DGP framework, multiple GPs are

chained together, either feeding the outputs of one GP to the inputs of another (Damianou & Lawrence

2013

), or alternatively by letting the

outputs of one GP change the covariance function of the next (Roininen

et al.

2016

,seeFig.

). The latter approach is particularly appealing,

because the covariance function can be easily evaluated for each layer of the DGP, resulting in a fully interpretable model. Additionally, by

making use of an SPDE approximation to GPs, DGP models can be very efficiently simulated, as we will describe below.

3.1 Approximating DGPs with stochastic partial differential equations

GP models are expressive, robust, easily interpretable and easily implementable—however, the use of pure GP models in large-scale inverse

problem settings has been limited. This is due to their inherently poor scaling. The evaluation of a GP model requires inversion of an

Downloaded from https://academic.oup.com/gji/article/234/1/427/7043457 by California Institute of Technology user on 28 April 2023