of 28
royalsocietypublishing.org/journal/rspa
Research
Cite this article:
Baddoo PJ, Herrmann B,
McKeon BJ, Brunton SL. 2022 Kernel learning
for robust dynamic mode decomposition:
linear and nonlinear disambiguation
optimization.
Proc.R.Soc.A
478
: 20210830.
https://doi.org/10.1098/rspa.2021.0830
Received: 28 October 2021
Accepted: 28 February 2022
Subject Areas:
applied mathematics, fluid mechanics,
mechanical engineering
Keywords:
machine learning, kernel methods, system
identification, modal decomposition
Author for correspondence:
Peter J. Baddoo
e-mail:
baddoo@mit.edu
Electronic supplementary material is available
online at
https://doi.org/10.6084/m9.figshare.
c.5901249
.
Kernel learning for robust
dynamic mode decomposition:
linear and nonlinear
disambiguation optimization
Peter J. Baddoo
1
, Benjamin Herrmann
2
,
Beverley J. McKeon
3
andStevenL.Brunton
4
1
Department of Mathematics, Massachusetts Institute of
Technology, Cambridge, MA 02139, USA
2
Department of Mechanical Engineering, University of Chile,
Beauchef 851, Santiago, Chile
3
Graduate Aerospace Laboratories, California Institute of
Technology, Pasadena, CA 91125, USA
4
Department of Mechanical Engineering, University of Washington,
Seattle, WA 98195, USA
PJB,
0000-0002-8671-6952
;BJM,
0000-0003-4220-1583
;
SLB,
0000-0002-6565-5118
Research in modern data-driven dynamical systems is
typically focused on the three key challenges of high
dimensionality, unknown dynamics and nonlinearity.
The dynamic mode decomposition (DMD) has
emerged as a cornerstone for modelling high-
dimensional systems from data. However, the quality
of the linear DMD model is known to be fragile with
respect to strong nonlinearity, which contaminates the
model estimate. By contrast, sparse identification of
nonlinear dynamics learns fully nonlinear models,
disambiguating the linear and nonlinear effects, but is
restricted to low-dimensional systems. In this work,
we present a kernel method that learns interpretable
data-driven models for high-dimensional, nonlinear
systems. Our method performs kernel regression
on a sparse dictionary of samples that appreciably
contribute to the dynamics. We show that this
kernel method efficiently handles high-dimensional
data and is flexible enough to incorporate partial
knowledge of system physics. It is possible to recover
the linear model contribution with this approach,
2022 The Authors. Published by the Royal Society under the terms of the
Creative Commons Attribution License
http://creativecommons.org/licenses/
by/4.0/
, which permits unrestricted use, provided the original author and
source are credited.
Downloaded from https://royalsocietypublishing.org/ on 13 April 2022
2
royalsocietypublishing.org/journal/rspa
Proc.R.Soc.A
478
: 20210830
..........................................................
thus separating the effects of the implicitly defined nonlinear terms. We demonstrate our
approach on data from a range of nonlinear ordinary and partial differential equations. This
framework can be used for many practical engineering tasks such as model order reduction,
diagnostics, prediction, control and discovery of governing laws.
1. Introduction
Discovering interpretable patterns and models from high-dimensional data is one of the principal
challenges of scientific machine learning, with the potential to transform our ability to predict
and control complex physical systems [
1
]. The current surge in the quality and quantity of
data, along with rapidly improving computational hardware, has motivated a wealth of machine
learning techniques that uncover such patterns for dynamical systems. Successful recent methods
include the dynamic mode decomposition (DMD) [
2
5
] and extended DMD (eDMD) [
6
,
7
], sparse
identification of nonlinear dynamics (SINDy) for ordinary and partial differential equations [
8
,
9
],
genetic programming for model discovery [
10
], physics-informed neural networks (PINNs) [
11
],
Lagrangian neural networks [
12
], time-lagged autoencoders [
13
], operator theoretic methods [
14
]
and operator inference [
15
]. Techniques based on generalized linear regression, such as DMD and
SINDy, are widely used because they are computationally efficient, require less data than neural
networks, are highly extensible and provide interpretable models. However, these approaches
are either challenged by nonlinearity (e.g. DMD) or do not scale to high-dimensional systems
(e.g. SINDy). In this work, we present a machine learning algorithm that leverages sparse kernel
regression to address both challenges, efficiently learning high-dimensional nonlinear models
that admit interpretable spatio-temporal coherent structures and robust locally linear models.
A central goal of modern data-driven dynamical systems is to identify a model
d
d
t
x
=
F
(
x
)
=
Lx
+
N
(
x
),
(1.1)
that describes the evolution of the state of the system,
x
. Here, we explicitly indicate that
the dynamics
F
have a linear
L
and nonlinear
N
contribution, although many techniques do
not model these separately or explicitly. However, several approaches obtain interpretable and
explicit models of this form. For example, DMD seeks a best-fit linear model of the dynamics,
while SINDy directly identifies sparse nonlinear models of the form in (1.1).
Our approach synthesizes favourable aspects of several approaches mentioned above;
however, it most directly complements and addresses the challenges of DMD for strongly
nonlinear systems. The DMD was originally introduced by Schmid [
2
]inthefluiddynamics
community as a method for extracting spatio-temporal coherent structures from high-
dimensional data, resulting in a low-rank representation of the best-fit linear operator that maps
the data forward in time [
4
,
5
]. The resulting linear DMD models have been used to characterize
many systems in fluid mechanics, where complex flows admit dominant modal decompositions
[
16
]. DMD has also been adopted in a wide range of fields beyond fluid mechanics, and much of
its success stems from the formulation of DMD as a linear regression problem [
4
], based entirely
on measurement data, resulting in several powerful extensions [
5
]. However, because DMD uses
least-squares regression to find a best-fit linear model d
x
/
d
t
Ax
to the data, the presence of
measurement noise [
17
], control inputs [
18
] and nonlinearity bias the regression. Mathematically,
the noise, control inputs and nonlinearity may all be lumped into a forcing
b
d
d
t
x
=
Lx
+
b
Ax
.
(1.2)
The forcing
b
contaminates the linear model estimate, so
A
from DMD does not approximate the
true linear contribution from
L
. It was recognized early on that the DMD algorithm was highly
sensitive to noise [
17
], resulting in noise-robust variants, including forward backward and total
least-squares DMD [
17
], optimized DMD [
19
] and DMD based on robust principal component
Downloaded from https://royalsocietypublishing.org/ on 13 April 2022
3
royalsocietypublishing.org/journal/rspa
Proc.R.Soc.A
478
: 20210830
..........................................................
best linear fit
best nonlinear fit
local
linear models
(
a
)(
b
)
Figure 1.
Learning regression models in linear (
a
) and nonlinear (
b
) feature spaces. Our approach disambiguates linear and
nonlinear model contributions to accurately extract local linear models. (Online version in colour.)
analysis (PCA) [
20
]. Similarly, DMD with control [
18
] was introduced to disambiguate the effect of
the linear dynamics from actuation. For statistically stationary systems with stochastic inputs, the
spectral proper orthogonal decomposition [
21
] produces an optimal basis of modes to describe the
variability in an ensemble of DMD modes [
22
]. The bias due to nonlinearity, shown in
figure 1
a
,
has been less thoroughly explored and is the topic of the present work.
Despite these challenges, DMD is frequently applied to strongly nonlinear systems, with
theoretical motivation from Koopman operator theory [
3
,
5
,
14
,
23
]. Williams
et al.
[
6
] developed
the eDMD, which augments the original state with nonlinear functions of the state to better
approximate the nonlinear eigenfunctions of the Koopman operator for nonlinear systems.
However, because this approach still fundamentally results in a linear model (in the augmented
state), it also suffers from the same issues of not being able to handle multiple fixed
points or attracting structures, and it also typically suffers from closure issues related to the
irrepresentability of Koopman eigenfunctions. Delay embedding methods, such as Hankel DMD
[
24
] and higher-order DMD [
25
], are effective for computing Koopman eigenfunctions but do
not separate the linear and nonlinear mechanisms of the system. The SINDy [
8
] algorithm is
a related regression approach to model discovery, which identifies a fully nonlinear model as
a sparse linear combination of candidate terms in a library. While SINDy is able to effectively
disambiguate the linear and nonlinear dynamics in (1.1), resulting in the ability to obtain de-
biased locally linear models as in
figure 1
b
, it only applies to relatively low-dimensional systems
because of poor scaling of the library with state dimension.
(a) Contributions of this work
In this work, we develop a custom kernel regression algorithm to learn accurate, efficient and
interpretable data-driven models for strongly nonlinear, high-dimensional dynamical systems.
This approach scales to very high dimensions, unlike SINDy, yet still accurately disambiguates
the linear part of the model from the implicitly defined nonlinear dynamics. Thus, it is
possible to obtain linear DMD models, local to a given base state, that are robust to strongly
nonlinear dynamics. Our approach, referred to as the linear and nonlinear disambiguation
optimization (LANDO) algorithm, may be viewed as a generalization of DMD that enables a
robust disambiguation of the underlying linear operator from nonlinear forcings. The learning
framework is illustrated in
figure 2
, and open-source code is available at
www.github.com/
baddoo/LANDO
.
To achieve this robust learning, we improve upon several leading kernel and system
identification algorithms. Recent works have successfully applied kernel methods [
26
,
27
] to study
data-driven dynamical systems [
7
,
28
31
]. A key inspiration for the present work is kernel DMD
(kDMD, [
7
]), which seeks to approximate the infinite-dimensional Koopman operator as a large
square matrix evolving nonlinear functions of the original state. An essential difference between
kDMD and the present work is that our goal is to implicitly model the (non-square) nonlinear
dynamics in (1.1) in terms of the original state
x
, enabling the robust extraction of the linear
Downloaded from https://royalsocietypublishing.org/ on 13 April 2022
4
royalsocietypublishing.org/journal/rspa
Proc.R.Soc.A
478
: 20210830
..........................................................
optional: learn dictionary and model
online from streaming data
(2) organize as data matrices
(3) define kernel function
linear:
k
(
u
,
) =
u
T
k
(
u
,
) = (
c
+
u
T
)
d
k
(
u
,
u
) = exp (–
||
u
u
||
2
/2
s
2
)
Gaussian:
polynomial:
(4) build sparse dictionary
(5) perform regression
(6) extract model
time
spans the largest
subspace in the
feature space
defined by
k
linear operator
p
d
t
X
=
X
=
x
1
x
m
y
1
y
m
Y
=
Y W
̃
W
̃
k
(
X
̃
,
X
)
X
̃
=
prediction
modes
f
(
x
) =
Lx
+
N
(
x
)
arg min ||
Y
W
̃
k
(
X
̃
,
X
)||
F
nonlinear forcing
y
(
t
) =
x
(
t
+
D
t
) or
y
(
t
) =
x
̇
(
t
)
(1) collect data pairs
(
x
,
y
)
Figure 2.
The linear and nonlinear disambiguation optimization (LANDO) framework. Training data in the form of snapshot
pairsarecollectedfromeithersimulationorexperimentin(1).Thedataareorganizedintomatricesin(2).In(3),anappropriate
kernel is defined, which can be informed by expert knowledge of the underlying physics of the system or through cross-
validation. In (4), a sparse dictionary of basis elements is constructed from the training samples, and in (5), the regression
problemissolved.Finally,in(6),aninterpretablemodelisextracted.(Onlineversionincolour.)
component
L
, as opposed to analysing the Koopman operator over measurement functions.
Further differences between LANDO and other DMD methods are elucidated in the electronic
supplementary material, SI §C. In our work, we present a modified kernel recursive least-squares
(KRLS) algorithm [
32
] to learn a nonlinear model that best characterizes the observed data.
To reduce the training cost, which typically scales with the cube of the number of training
samples for kernel methods, we use dictionary learning to iteratively identify samples that
appreciably contribute to the dynamics. This dictionary learning approach may be seen as a
sparsity promoting regularizer, significantly reducing the high condition number that is common
with kernel methods, thus improving robustness to noise. We introduce an iterative Cholesky
update to construct the dictionary in a numerically stable manner, significantly reducing the
training cost while also mitigating overfitting. Similarly to KRLS, our model has the option of
operating online by parsing data in a streaming fashion and exploiting rank-one matrix updates
to revise the model. Therefore, our approach is also suitable for model order reduction in practical
applications where data become available ‘on-the-fly’. Furthermore, we show how to incorporate
partially known physics, or uncover unknown physics, by designing or testing tailored kernels,
much as with the SINDy framework [
8
,
33
].
We demonstrate our proposed kernel learning approach on a range of complex dynamical
systems that arise in the physical sciences. As an illustrative example, we first explore the chaotic
Lorenz system. We also consider partial differential equations, using the LANDO framework
to uncover the linear and nonlinear components of the nonlinear Burgers’, and Kuramoto–
Sivashinsky (KS) equations using only nonlinear measurement data. The algorithm accurately
recovers the spectrum of the linear operator for these systems, enabling linearized analyses,
such as linear stability, transient growth and resolvent analyses [
34
,
35
], in a purely data-driven
Downloaded from https://royalsocietypublishing.org/ on 13 April 2022
5
royalsocietypublishing.org/journal/rspa
Proc.R.Soc.A
478
: 20210830
..........................................................
manner [
36
], even for strongly nonlinear systems. We also demonstrate the approach on a
high-dimensional system of coupled nonlinear Kuramoto oscillators.
Whether the linear and nonlinear dynamics can be separated depends on both the underlying
system and the available data. Our results indicate that the effectiveness of LANDO depends on
the sampling rate, whether the data is near an attractor, and the amplitude of measurement noise.
These issues are discussed throughout, but resolving them fully is the subject of ongoing work.
The remainder of the work is organized as follows. Section 2 provides a mathematical
background overview of the DMD and kernel methods. Section 3 introduces our kernel learning
procedure for dynamical systems, including the sparse dictionary learning with Cholesky
updates. We demonstrate how to extract interpretable structures from these kernel models, such
as robust linear DMD models, in §4. Results on a variety of nonlinear dynamical systems are
presented in §5. Finally, §6 concludes with a discussion of limitations and suggested extensions of
the method. The appendices (in the electronic supplementary material) explicate the connection
between LANDO and DMD, demonstrate how to incorporate the effects of control, present the
equations for online updating and investigate the noise sensitivity of the algorithm.
2. Problem statement and mathematical background
In this section, we will define our machine learning problem and review some relevant
mathematical ideas related to DMD (§2a) and kernel methods (§2b).
We consider dynamical systems describing the evolution of an
n
-dimensional vector
x
R
n
that characterizes the state of the system. We will consider both continuous-time and discrete-time
dynamics in this work. The dynamics may be expressed either in continuous time as
d
d
t
x
(
t
)
=
F
(
x
(
t
))
or in discrete time as
x
j
+
1
=
F
(
x
j
).
For a given physical system, the continuous-time and discrete-time representations of the
dynamics will correspond to different functions
F
, although we use the same function above
for notational simplicity. In general, the dynamics may also vary in time and depend on control
inputs
u
and parameters
β
; however, for simplicity, we begin with the autonomous dynamics
above.
Our goal is to learn a tractable representation of the dynamical system
F
:
R
n
R
n
that is
both accurate and interpretable, informing tasks such as physical understanding, diagnostics,
prediction and control. We suppose that we have access to a training set of data pairs
{
(
x
j
,
y
j
)
R
n
×
R
n
|
j
=
1,
...
,
m
}
, which are connected through the dynamics by
y
j
=
F
(
x
j
).
(2.1)
If the dynamics are expressed in continuous time then
y
j
=
̇
x
j
where the dot denotes
differentiation in time, and if the dynamics are expressed in discrete time then
y
j
=
x
j
+
1
.The
discrete-time formulation is more common, as data from simulations and experiments are often
sampled or generated at a fixed sampling interval

t
,so
x
j
=
x
(
j

t
). However, this work applies to
both discrete and continuous systems, and the only practical difference arises in the eigenvalues
of linearized models.
The training data correspond to
m
snapshots in time of a simulation or experiment. For ease of
notation, it is typical to arrange the samples into snapshot data matrices of the form
X
=
||
|
x
1
x
2
···
x
m
||
|
and
Y
=
||
|
y
1
y
2
···
y
m
||
|
.
(2.2)
In many applications, the state dimension is much larger than the number of snapshots, so
m

n
.
For example, the state may correspond to a fluid velocity field sampled on a discretized grid.
Downloaded from https://royalsocietypublishing.org/ on 13 April 2022
6
royalsocietypublishing.org/journal/rspa
Proc.R.Soc.A
478
: 20210830
..........................................................
Our machine learning problem consists of finding a function
f
that suitably maps the training
data given certain generalizability, interpretability and regularity qualifications. In our notation,
F
is the true function that generated the data whereas
f
is our model for
F
; it is hoped that
F
and
f
share some meaningful properties. The function
f
is typically restricted to a given class of models
(e.g. linear, polynomial, etc.), so that it may be written as the expansion
f
(
x
)
=
N

j
=
1
ξ
j
φ
j
(
x
)
⇒
f
(
x
)
=
Ξφ
(
x
).
(2.3)
Here,
φ
describes the feature library of
N
candidate terms that may describe the dynamics, and
Ξ
contains the coefficients that determine which model terms are active and in what proportions.
Mathematically, the optimization problem to be solved is
argmin
Ξ
||
Y
Ξφ
(
X
)
||
F
+
λ
R
(
Ξ
),
(2.4)
where
|| · ||
F
is the Frobenius norm. The first term in (2.4) corresponds to the error between
training samples and our model prediction, whereas the second term
λ
R
(
Ξ
) is a regularizer. For
example, in SINDy, the feature library
φ
will typically include linear and nonlinear terms, and
the regularizer will involve the number of non-zero elements
||
Ξ
||
0
, which may be relaxed to the
1-norm
||
Ξ
||
1
. In DMD, the features
φ
will simply contain the state
x
,
Ξ
will be the DMD matrix
A
, and instead of a regularizer
R
(
Ξ
), the minimization is constrained so that the rank of
A
=
Ξ
is less than or equal to
r
. Similarly, in eDMD, the feature
φ
will include nonlinear functions of the
state, and the minimization is modified to argmin
Ξ
||
φ
(
Y
)
Ξφ
(
X
)
||
F
+
λ
R
(
Ξ
), resulting in a
Ξ
that is a large square matrix evolving the nonlinear feature space forward in time.
For even moderate state dimensions
n
and feature complexity, such as the monomials of order
d
, the feature library of
φ
becomes prohibitively large and the optimization in (2.4) is intractable.
This scaling issue is the primary challenge in applying SINDy to high-dimensional systems.
Instead, it is possible to rewrite the expansion (2.3) in terms of an appropriate kernel function
k
as
f
(
x
)
=
m

j
=
1
w
j
k
(
x
j
,
x
)
⇒
f
(
x
)
=
W
k
(
X
,
x
).
(2.5)
In this case, the sum is over the number of snapshots
m
instead of the number of library elements
N
, dramatically improving the scaling. The optimization in (2.4) now becomes
argmin
W
||
Y
W
k
(
X
,
X
)
||
F
+
λ
R
(
W
).
(2.6)
We will show that it is possible to improve the scaling further by using a kernel defined on a sparse
dictionary
̃
X
.
Figure 3
shows our dictionary-based kernel modelling procedure, where the explicit
model on the left is a SINDy model, and the compact model on the right is our kernel model. Thus,
our kernel learning approach may be viewed as a kernelized SINDy without sparsity promotion.
Based on the implicit LANDO model, it is possible to efficiently extract the linear component
L
of the dynamics, along with a matrix for the nonlinear forcing
Y
=
LX
+
N
,
(2.7)
where here
N
=
N
(
x
1
)
N
(
x
2
)
···
N
(
x
m
)
is a nonlinear snapshot matrix, where each column
is the nonlinear component of the dynamics at that instant in time. Although this is not an explicit
expression for the nonlinear dynamics, as in SINDy, knowing the linear model and nonlinear
forcing will enable data-driven resolvent analysis [
36
], even for strongly nonlinear systems.
Technically, (2.7) may be centred at any base point
̄
x
, resulting in
y

j
=
L

x

j
+
N
(
x

j
),
(2.8)
where
x

=
x
̄
x
. We will also show that the linear model
L
may be represented efficiently without
being explicitly constructed, as in DMD.
Downloaded from https://royalsocietypublishing.org/ on 13 April 2022
7
royalsocietypublishing.org/journal/rspa
Proc.R.Soc.A
478
: 20210830
..........................................................
explicit model
implicit kernel model
dictionary-based
kernel model
f
(
x
) =
W
W
̃
(
n × N
)
(
n × m
)
(
n × m
̃
)
k
(
x
̃
1
,
x
)
k
(
x
̃
m
̃
,
x
)
k
(
x
1
,
x
)
x
1
x
1
2
x
d
n
x
1
x
2
x
n
k
(
x
m
,
x
)
X
Figure 3.
Schematic relationships between different models for
N
n
,
m
̃
m
. An explicit model (e.g. SINDy) produces
explicitweightsthatconnect
N
featuresto
n
outputs.Akernelmodelusesfewerweightsbuttherelationshipsbetweenvariables
are stored implicitly. The dictionary-based kernel model selects the most active samples and therefore uses fewer weights still.
(Online version in colour.)
Electronic supplementary material SI §C includes further comparison of LANDO to related
data-driven architectures. The eDMD algorithm has already been kernelized [
6
,
7
], enabling
efficient approximations to the Koopman operator with very large feature spaces. Although it
is related to the present work, the goal of eDMD/kDMD is to obtain a
square
representation of the
dynamics of measurement functions in a Hilbert space or feature space
φ
(
x
), rather than a closed
representation of the dynamics in the original state
x
. In this way, our approach more closely
resembles the SINDy procedure, but kernelized to scale to arbitrarily large problems. We will
also show that even though the representation of the dynamics is implicit, it is possible to extract
explicit model structures, such as the linear component and other relevant quantities, from the
kernel representation.
In the following subsections, we will outline the DMD algorithm and provide an introduction
to the kernel methods that will be used throughout this work.
(a) Dynamic mode decomposition
The original DMD algorithm of [
2
] was developed as a data-driven method for decomposing
high-dimensional snapshot data into a set of coherent spatial modes, along with a low-
dimensional model for how these mode amplitudes evolve linearly in time. As such, DMD may
be viewed as a hybrid algorithm combining PCA in space and the discrete-time Fourier transform
in time [
37
]. DMD has been adopted in a wide range of fields beyond fluid mechanics, including
epidemiology [
38
], neuroscience [
39
], video processing [
40
], robotics [
41
] and plasma physics [
42
].
Much of this success stems from the formulation of DMD as a linear regression problem [
4
], based
entirely on measurement data, resulting in several powerful extensions [
5
], including for control
[
18
], sparsity promoting DMD [
43
], for non-sequential time series [
4
,
19
] and for data that are
under-resolved in space [
44
]ortime[
45
].
The original algorithm was refined by [
4
] who phrased DMD in terms of the Moore–Penrose
pseudoinverse thereby allowing snapshots that are not equally spaced in time; this variant is
called
exact DMD
and will be the main form of DMD used in this paper. In the electronic
supplementary material, appendix C(a), we will show that exact DMD may be viewed as a special
case of our new method.
As mentioned in the previous section, it is assumed that each
x
j
and
y
j
are connected by a
dynamical system of the form
y
j
=
F
(
x
j
). The aim of DMD is to learn information about
F
by
approximating it as a linear operator and then performing diagnostics on that approximation. In
particular, DMD seeks the linear operator
A
that best maps the sets
{
x
j
}
and
{
y
j
}
into one another:
y
j
Ax
j
for
j
=
1,
...
,
m
.
(2.9)
Downloaded from https://royalsocietypublishing.org/ on 13 April 2022
8
royalsocietypublishing.org/journal/rspa
Proc.R.Soc.A
478
: 20210830
..........................................................
Expressed in terms of the snapshot matrices in (2.2), (2.9) becomes
Y
AX
,
(2.10)
and the minimum-norm solution is
A
=
argmin
A
||
Y
AX
||
F
=
YX
,
(2.11)
where † indicates the Moore–Penrose pseudoinverse [
46
]. If
X
has the singular value
decomposition (SVD)
X
=
U
Σ
V
then
A
=
YV
Σ
U
. Note that
A
is an
n
×
n
matrix so may be
extremely large in practice where
n
1. Thus, it is common to use a rank-
r
approximation for
A
, denoted by
ˆ
A
,where
r

n
.Toconstruct
ˆ
A
, we build a rank
r
approximation for
X
using the
truncated SVD:
X
U
r
Σ
r
V
r
=
X
r
. This approximation is optimal according to the Eckart–Young
theorem [
47
]. The matrix
A
is then projected onto the column space of
X
r
as
ˆ
A
=
U
r
AU
r
=
U
r
YV
r
Σ
1
r
.
(2.12)
Since
ˆ
A
is an
r
×
r
matrix, it is now feasible to compute its eigendecomposition as
ˆ
A
ˆ
Ψ
=
ˆ
ΨΛ
.
(2.13)
It was proved by [
4
] that the eigenvectors of the full matrix
A
can be approximated from the
reduced eigenvectors
Ψ
by
Ψ
=
YV
Σ
1
ˆ
Ψ
.
(2.14)
This eigendecomposition has many favourable properties. Firstly, it is an approximation to the
spectrum of the underlying Koopman operator of the system [
3
]. Secondly, if the snapshots
are equally spaced in time and
y
j
=
x
j
+
1
then the data can be reconstructed in terms of the
eigenvectors and eigenvalues as
x
j
=
ΨΛ
j
1
a
,
(2.15)
where the vector
a
contains the mode amplitudes often computed as
a
=
Ψ
x
1
. The above
provides a clear physical interpretation of the modes: the eigenvectors
Ψ
are the spatial modes
whereas the eigenvalues
Λ
correspond to the temporal evolution.
(b) Kernel methods
Kernel methods are a class of statistical machine learning algorithms that perform efficient
computations with high-dimensional nonlinear features [
26
]. Kernel methods have found
applications in adaptive filtering [
48
], nonlinear principal component analysis [
49
], nonlinear
regression [
32
], classification [
50
] and support vector machines [
51
]. The broad success of kernel
machines stems from their ability to efficiently compute inner products in a high-dimensional, or
even infinite-dimensional, nonlinear feature space. Thus, if a conventional linear algorithm can
be phrased exclusively in terms of inner products then it can be ‘kernelized’ and adapted for
nonlinear problems. This ‘kernel trick’ has been used to great effect in the above applications.
Kernels are continuous functions
k
:
R
n
×
R
n
R
, and a kernel is a Mercer kernel if it is
positive definite; i.e. for any collection of vectors
x

j
R
n
, the matrix
K
defined by [
K
]
i
,
j
=
k
(
x

i
,
x

j
)
is positive definite. By Mercer’s theorem [
52
], it follows that there exists a Hilbert space
H
k
and a
mapping
φ
:
R
n
H
k
such that
k
(
x
,
x

)
=
φ
(
x
),
φ
(
x

)
. In other words, every Mercer kernel can be
interpreted as an inner product in the Hilbert space
H
k
, which may be of an otherwise inaccessible
dimension. Every element
g
H
k
can be expressed as a linear combination
g
(
x
)
=
M

j
=
1
α
j
k
(
x

j
,
x
),
(2.16)
for some
M
N
,
α
i
R
n
and
x

j
R
n
. We drop the word ‘Mercer’ in the remainder of the article,
and assume that all kernels are Mercer kernels.
Downloaded from https://royalsocietypublishing.org/ on 13 April 2022
9
royalsocietypublishing.org/journal/rspa
Proc.R.Soc.A
478
: 20210830
..........................................................
An important result in the theory of kernel learning is the
representation theorem
. First proved
by [
53
] and then generalized by [
54
], the representation theorem provides very general conditions
where kernel methods can be used to solve machine learning problems. For the purposes of the
present work, the representation theorem may be stated thus: for a set of pairs of
m
training
samples, (
x
1
,
y
1
),
...
,(
x
m
,
y
m
), the solution to the minimization problem
argmin
f
H
k
||
Y
f
(
X
)
||
F
+
λ
R
(
f
),
(2.17)
may be expressed as
f
(
x
)
=
m

j
=
1
w
j
k
(
x
j
,
x
),
(2.18)
for vectors
w
j
R
n
. One important consequence of the representation theorem is that the solution
to the optimization problem (2.17) can be expressed as a linear combination of kernel functions
whose first arguments are the training data. Contrast this with the general representation of
members of
H
k
in (2.16) where the parameters
x

j
are not known. The representation theorem
allows us to avoid an exhaustive search for the optimal parameters, thereby reducing the problem
to a (linear) search for the weights
w
j
. In the above,
λ>
0 is a regularization parameter and the
regularizer on
f
is to be interpreted as the norm associated with
k
[
27
].
(i) An illustrative example
The discussion of kernels has thus far been rather abstract; we now make the theory concrete by
illustrating an application of the usefulness of kernel methods. This simple example is often used
in kernel tutorials [
7
,
26
].
Consider a three-dimensional state,
x
R
3
, upon which we want to perform some machine
learning task such as regression or classification. Suppose that we know—from either physical
intuition, empirical data or experience—that the system is governed by pairwise quadratic
interactions between the state variables. Thus, our machine learning model should operate in
the nonlinear feature space defined by
φ
(
x
)
=
[
x
1
x
2
x
1
x
3
x
2
x
3
x
2
1
x
2
2
x
2
3
]
T
R
6
.
(2.19)
Almost every machine learning algorithm uses inner products to measure correlations between
samples. Computing inner products in a feature space of dimension
N
costs 2
N
1 operations:
N
products and
N
1 summations. Thus, in this example, computing inner products in the
nonlinear feature space would usually require 11 operations. However, we still need to form the
two feature vectors
φ
(
x
)and
φ
(
x

), which cost a further six operations each, raising the total count
to 23 operations.
Equivalently, we could build our model in the slightly rescaled feature space
φ
(
x
)
=
2
x
1
x
2
2
x
1
x
3
2
x
2
x
3
x
2
1
x
2
2
x
2
3
T
.
(2.20)
Now note that inner products in this feature space may be expressed as
φ
(
x
),
φ
(
x

)
=
2
x
1
x

1
x
2
x

2
+
2
x
1
x

1
x
3
x

3
+
2
x
2
x

2
x
3
x

3
+
x
2
1
(
x

1
)
2
+
x
2
2
(
x

2
)
2
+
x
2
3
(
x

3
)
2
=
(
x
1
x

1
+
x
2
x

2
+
x
3
x

3
)
2
=
x
,
x


2
.
(2.21)
Thus, we can compute the inner product
φ
(
x
),
φ
(
x

)
in merely six operations by computing the
inner product
x
,
x

and then squaring the result. In other words, computing the inner product
amounted to evaluating the kernel
k
(
u
,
v
)
=
(
u
T
v
)
2
. Moreover, while computing the inner product
with the kernel, we never explicitly formed the feature space, and therefore did not need to store
φ
(
x
) in memory. In summary, if we use expression (2.21) then the cost of computing inner products
falls from 23 operations to six operations.
Downloaded from https://royalsocietypublishing.org/ on 13 April 2022
10
royalsocietypublishing.org/journal/rspa
Proc.R.Soc.A
478
: 20210830
..........................................................
This may seem a modest saving but the cost of computations in feature space explodes as
the state dimension or degree of nonlinearity increase. For a state of dimension
n
, the number
of degree
d
monomial features is
N
=
n
+
d
1
d

=
(
n
+
d
1)!
/
(
d
!(
n
1)!).
1
Thus, explicitly forming
vectors in this feature space is extremely expensive, as is computing inner products. For example,
for an
n
-dimensional state, the number of possible quadratic interactions between states is
n
(
n
1)
/
2. This scaling of the feature vector is the prime limitation of SINDy.
Instead of explicitly forming this vast feature space, we instead work with suitably chosen
kernels. The feature space of degree
d
monomials can be represented using the
polynomial kernel
k
(
u
,
v
)
=
(
u
T
v
)
d
.
(2.22)
Thus, using the kernel (2.22) to compute inner products reduces the operation count from 2
N
1
to 2
n
, which is significant when the state space is large and the nonlinearity is quadratic or higher.
3. Learning kernel models with sparse dictionaries
We now develop the main machine learning method presented by this paper. The procedure is
based on the KRLS algorithm of [
32
] but is more stable and allows further interpretation and
analysis of the learned model. We specifically tailor this approach to learn dynamical systems in a
robust and interpretable framework. Recall that we are solving the optimization problem defined
in (2.4) for a nonlinear function
f
that approximates the dynamics. By the representation theorem,
we may express the dynamical system approximation
f
from (2.3) in the kernelized form (2.18) as
f
(
x
)
=
N

j
=
1
ξ
j
φ
j
(
x
)
=
m

j
=
1
w
j
φ
(
x
j
),
φ
(
x
)
=
m

j
=
1
w
j
k
(
x
j
,
x
).
(3.1)
Arranging the column vectors
w
j
into a matrix
W
allows us to write
f
(
x
)
=
W
k
(
X
,
x
) so the
optimization problem is
argmin
W
||
Y
W
k
(
X
,
X
)
||
F
+
λ
R
(
f
).
(3.2)
Theoretically, a solution to (3.2), in the absence of regularization, is provided by the Moore–
Penrose pseudoinverse:
W
=
Y
k
(
X
,
X
)
.
(3.3)
As noted by [
32
], there are three practical problems with the above solution
— Numerical conditioning: even though the kernel matrix may formally have full rank,
it will usually have a very large condition number since the samples can be almost
linearly dependent in the feature space. When the condition number is large, the
condition number of the pseudoinverse will also be large and
W
will amplify noise by a
corresponding amount.
— Overfitting: the weight matrix
W
has
mn
entries, which is equal to the number of
constraining equations in (3.2). Thus, there is a risk of overfitting, which can limit the
generalizability of the model and make it sensitive to noise.
— Computational complexity: in nonlinear system identification, we usually need a large
number of samples to adequately learn the system. When there are
m
1 samples,
constructing the pseudoinverse
k
(
X
,
X
)
requires
O
(
m
3
) operations to construct and
O
(
m
2
)
space in memory, which can become prohibitively expensive. Additionally, evaluating the
model
f
for prediction or reconstruction requires multiplying the
n
×
m
weight matrix by
the
m
-vector of kernel evaluations, which will also become expensive for large sample
sets.
1
This can be thought of as the number of ways of distributing
d
unlabelled balls into
n
labelled urns.
Downloaded from https://royalsocietypublishing.org/ on 13 April 2022