of 44
Kernel Learning for Robust Dynamic Mode Decomposition:
Linear and Nonlinear Disambiguation Optimization (LANDO)
Peter J. Baddoo
1
, Benjamin Herrmann
2,3
, Beverley J. McKeon
4
, and Steven L. Brunton
2
1
Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
2
Department of Mechanical Engineering, University of Washington, Seattle, WA 98195, USA
3
Institute of Fluid Mechanics, Technische Universit
̈
at Braunschweig, 38108 Braunschweig, Germany
4
Graduate Aerospace Laboratories, California Institute of Technology, Pasadena CA 91125, USA
Abstract
Research in modern data-driven dynamical systems is typically focused on the three key
challenges of high dimensionality, unknown dynamics, and nonlinearity. The dynamic mode
decomposition (DMD) has emerged as a cornerstone for modeling high-dimensional systems
from data. However, the quality of the linear DMD model is known to be fragile with respect
to strong nonlinearity, which contaminates the model estimate. In contrast, sparse identifica-
tion of nonlinear dynamics (SINDy) learns fully nonlinear models, disambiguating the linear
and nonlinear effects, but is restricted to low-dimensional systems. In this work, we present
a kernel method that learns interpretable data-driven models for high-dimensional, nonlin-
ear systems. Our method performs kernel regression on a sparse dictionary of samples that
appreciably contribute to the underlying dynamics. We show that this kernel method effi-
ciently handles high-dimensional data and is flexible enough to incorporate partial knowledge
of system physics. It is possible to accurately recover the linear model contribution with this
approach, disambiguating the effects of the implicitly defined nonlinear terms, resulting in a
DMD-like model that is robust to strongly nonlinear dynamics. We demonstrate our approach
on data from a wide range of nonlinear ordinary and partial differential equations that arise in
the physical sciences. This framework can be used for many practical engineering tasks such
as model order reduction, diagnostics, prediction, control, and discovery of governing laws.
1 Introduction
Discovering interpretable patterns and models from high-dimensional data is one of the principal
challenges of scientific machine learning, with the potential to transform our ability to predict and
control complex physical systems [1]. The current surge in the quality and quantity of data, along
with rapidly improving computational hardware, has motivated a wealth of machine learning
techniques that uncover such patterns for dynamical systems. Successful recent methods include
the dynamic mode decomposition (DMD) [2–5] and extended DMD [6, 7], sparse identification
of nonlinear dynamics (SINDy) for ordinary and partial differential equations [8, 9], genetic pro-
gramming for model discovery [10, 11], physics informed neural networks (PINNs) [12, 13], La-
grangian neural networks [14], time-lagged autoencoders [15], operator theoretic methods [16–19],
and operator inference [20, 21]. Techniques based on generalized linear regression, such as DMD
and SINDy, are widely used because they are computationally efficient, require less data than neu-
ral networks, are highly extensible, and provide interpretable models. However, these approaches
are either challenged by nonlinearity (e.g., DMD) or don’t scale to high-dimensional systems (e.g.,
SINDy). In this work, we present a machine learning algorithm that leverages sparse kernel re-
gression to address both challenges, efficiently learning high-dimensional nonlinear models that
admit interpretable spatio-temporal coherent structures and robust locally linear models.
Corresponding author (baddoo@mit.edu)
1
arXiv:2106.01510v1 [physics.flu-dyn] 2 Jun 2021
best linear fit
best nonlinear fit
local
linear models
(a)
(b)
(a)
(b)
Figure 1: Learning regression models in linear (a) and nonlinear (b) feature spaces. Our approach
disambiguates linear and nonlinear model contributions to accurately extract local linear models.
A central goal of modern data-driven dynamical systems [1] is to identify a model
d
dt
x
=
F
(
x
) =
Lx
+
N
(
x
)
(1)
that describes the evolution of the state of the system,
x
. Here we explicitly indicate that the dy-
namics
F
have a linear
L
and nonlinear
N
contribution, although many techniques do not model
these separately or explicitly. However, several approaches obtain interpretable and explicit mod-
els of this form. For example, DMD seeks a best-fit linear model of the dynamics, while SINDy
directly identifies sparse nonlinear models of the form in (1).
Our approach synthesizes favorable aspects of several approaches mentioned above; how-
ever, it most directly complements and addresses the challenges of DMD for strongly nonlinear
systems. The dynamic mode decomposition was originally introduced by Schmid [2] in the fluid
dynamics community as a method for extracting spatio-temporal coherent structures from high-
dimensional data, resulting in a low-rank representation of the best-fit linear operator that maps
the data forward in time [4, 5]. The resulting linear DMD models have been used to character-
ize many systems in fluid mechanics, where complex flows admit dominant modal decompo-
sitions [22–25] and linear control is commonly used [26–29]. DMD has also been adopted in a
wide range of fields beyond fluid mechanics, and much of its success stems from the formula-
tion of DMD as a linear regression problem [4], based entirely on measurement data, resulting in
several powerful extensions [5]. However, because DMD uses least-squares regression to find a
best-fit linear model
d
x
/dt
Ax
to the data, the presence of measurement noise [30–33], control
inputs [34], and nonlinearity bias the regression. Mathematically, the noise, control inputs, and
nonlinearity may all be lumped into a forcing
b
:
d
dt
x
=
Lx
+
b
Ax
.
(2)
The forcing
b
contaminates the linear model estimate, so
A
from DMD does not approximate the
true linear contribution from
L
. It was recognized early that the DMD algorithm was highly sen-
sitive to noise [30–32], resulting in noise-robust variants, including forward backward DMD [31],
total least-squares DMD [32], optimized DMD [33], consistent DMD [35], and DMD based on ro-
bust PCA [36]. Similarly, DMD with control (DMDc) [34] was introduced to disambiguate the
effect of the linear dynamics from actuation. For statistically stationary systems with stochastic
inputs, the spectral proper orthogonal decomposition (SPOD) [22] produces an optimal basis of
modes to describe the variability in an ensemble of DMD modes [24]. The bias due to nonlinearity,
shown in Fig. 1(a), has been less thoroughly explored and is the topic of the present work.
Despite these challenges, DMD is frequently applied to strongly nonlinear systems, with the-
oretical motivation from Koopman operator theory [3, 5, 16, 19]. However, DMD typically only
yields accurate models for periodic or quasiperiodic systems, and is fundamentally unable to
2
capture nonlinear transients, multiple fixed points or periodic orbits, or other more complicated
attractors [37]. Williams et al. [6] developed the
extended DMD
(eDMD), which augments the orig-
inal state with nonlinear functions of the state to better approximate the nonlinear eigenfunctions
of the Koopman operator for nonlinear systems. Further, a kernel version of eDMD was intro-
duced for high-dimensional systems [7]. However, because this approach still fundamentally re-
sults in a linear model (in the augmented state), it also suffers from the same issues of not being
able to handle multiple fixed points or attracting structures, and it also typically suffers from clo-
sure issues related to the irrepresentability of Koopman eigenfunctions. The sparse identification
of nonlinear dynamics [8] algorithm is a related regression approach to model discovery, which
identifies a fully nonlinear model as a sparse linear combination of candidate terms in a library.
While SINDy is able to effectively disambiguate the linear and nonlinear dynamics in (1), resulting
in the ability to obtain de-biased locally linear models as in Fig. 1(b), it only applies to relatively
low-dimensional systems because of poor scaling of the library with state dimension.
1.1 Contributions of this work
In this work, we develop a custom kernel regression algorithm to learn accurate, efficient, and
interpretable data-driven models for strongly nonlinear, high-dimensional dynamical systems.
This approach scales to very high-dimensions, unlike SINDy, yet still accurately disambiguates
the linear part of the model from the implicitly defined nonlinear dynamics. Thus, it is possible to
obtain linear DMD models, local to a given base state, that are robust to strongly nonlinear dynam-
ics. Our approach, referred to as the linear and nonlinear disambiguation optimization (LANDO)
algorithm, may be viewed as a generalisation of DMD that enables a robust disambiguation of
the underlying linear operator from nonlinear forcings. The learning framework is illustrated in
Fig. 2, and open-source code is available at www.github.com/baddoo/LANDO.
To achieve this robust learning, we improve upon several leading kernel and system iden-
tification algorithms. Recent works have successfully applied kernel methods [38, 39] to study
data-driven dynamical systems [7, 40–48]. A key contribution, and an inspiration for the present
work, is kernel DMD (kDMD, [7]), which seeks to approximate the infinite-dimensional Koopman
operator as a large square matrix evolving nonlinear functions of the original state. An essential
difference between kDMD and the present work is that our goal is to implicitly model the (non-
square) nonlinear dynamics in (1) in terms of the original state
x
, enabling the robust extraction of
the linear component
L
, as opposed to analysing the Koopman operator over measurement func-
tions. In our work, we present a modified kernel recursive least-squares algorithm (KRLS, [49])
to learn a nonlinear model that best characterizes the observed data. To reduce the training cost,
which typically scales with the cube of the number of training samples for kernel methods, we use
dictionary learning to iteratively identify samples that contribute to the dynamics. This dictionary
learning approach may be seen as a sparsity promoting regulariser, significantly reducing the high
condition number that is common with kernel methods, improving robustness to noise. We intro-
duce an iterative Cholesky update to construct the dictionary in a numerically stable manner,
significantly reducing the training cost while also mitigating overfitting. Similarly to KRLS, our
model has the option of operating online by parsing data in a streaming fashion and exploiting
rank-one matrix updates to revise the model. Therefore, our approach is also suitable for model
order reduction in practical applications where data become available ”on-the-fly“. Further, we
show how to incorporate partially known physics, or uncover unknown physics, by designing or
testing tailored kernels, much as with the SINDy framework [8, 50].
We demonstrate our proposed kernel learning approach on a range of complex dynamical
systems that arise in the physical sciences. As an illustrative example, we first explore the chaotic
3
Optional: learn dictionary and model
online from streaming data
2)
organize as data matrices
3)
define kernel function
linear:
Gaussian:
polynomial:
4)
build sparse dictionary
5)
perform regression
6)
extract model
prediction
nonlinear forcing
time
1)
collect data pairs
or
spans the largest
subspace in the
feature space
defined by
k
linear operator
spectrum
Figure 2: The Linear and Nonlinear Disambiguation Optimization (LANDO) framework. Train-
ing data in the form of snapshot pairs are collected from either simulation or experiment in 1). The
data are organised into data matrices in 2). In 3) an appropriate kernel is defined, which can be in-
formed by expert knowledge of the underlying physics of the system or through a cross-validation
procedure. In 4) a sparse dictionary of basis elements is constructed from the training samples,
and in 5) the regression problem is solved. Finally, in 6) an interpretable model is extracted.
Lorenz system. We also consider partial differential equations, using the LANDO framework to
uncover the linear and nonlinear components of the nonlinear Burgers’, Korteweg-De Vries, and
Kuramoto–Sivashinsky equations using only nonlinear measurement data. The algorithm accu-
rately recovers the spectrum of the linear operator for these systems, enabling linearised analyses,
such as linear stability, transient growth, and resolvent analyses [51–54], in a purely data-driven
manner [55], even for strongly nonlinear systems. We also demonstrate the approach on a high-
dimensional system of coupled nonlinear Kuramoto oscillators.
The remainder of the work is organized as follows. Section 2 provides a mathematical back-
ground overview of the DMD and kernel methods. Section 3 introduces our kernel learning pro-
cedure for dynamical systems, including the sparse dictionary learning with Cholesky updates.
We demonstrate how to extract interpretable structures from these kernel models, such as robust
linear DMD models, in Section 4. Results on a variety of nonlinear dynamical systems are pre-
sented in Section 5. Finally, Section 6 concludes with a discussion of limitations and suggested
extensions of the method, providing a comparison with the related DMD, eDMD/kDMD, and
SINDy approaches, also summarised in Fig. 4. In the appendices we explicate the connection be-
tween our method and DMD, demonstrate how one can incorporate the effects of control, present
the equations for online updating, and investigate the sensitivity of the algorithm to noise.
4
2 Problem statement and mathematical background
In this section we will define our machine learning problem and review some relevant mathemat-
ical ideas related to DMD (Sec. 2.1) and kernel methods (Sec. 2.2).
We consider dynamical systems describing the evolution of an
n
-dimensional vector
x
R
n
that characterizes the state of the system. We will consider both continuous-time and discrete-time
dynamics in this work. The dynamics may be expressed either in continuous time as
d
dt
x
(
t
) =
F
(
x
(
t
))
or in discrete time as
x
j
+1
=
F
(
x
j
)
.
For a given physical system, the continuous-time and discrete-time representations of the dynam-
ics will correspond to different functions
F
, although we use the same function above for nota-
tional simplicity. In general, the dynamics may also vary in time and depend on control inputs
u
and parameters
β
; however, for simplicity, we begin with the autonomous dynamics above.
Our goal is to learn a tractable representation of the dynamical system
F
:
R
n
R
n
that
is both accurate and interpretable, informing tasks such as physical understanding, diagnos-
tics, prediction, and control. We suppose that we have access to a training set of data pairs
{
(
x
j
,
y
j
)
R
n
×
R
n
|
j
= 1
,...,m
}
, which are connected through the dynamics by
y
j
=
F
(
x
j
)
.
(3)
If the dynamics continuous in time,
y
j
=
̇
x
j
, where the dot denotes differentiation in time, and
if the dynamics are expressed in discrete time
y
j
=
x
j
+1
. The discrete-time formulation is more
common, as data from simulations and experiments is often sampled or generated at a fixed sam-
pling interval
t
, so
x
j
=
x
(
j
t
)
. However, this work applies to both discrete and continuous
systems, and the only practical difference arises in the eigenvalues of linearized models.
The training data correspond to
m
snapshots in time of a simulation or experiment. For ease
of notation, it is typical to arrange the samples into snapshot data matrices of the form
X
=
| | |
x
1
x
2
···
x
m
| | |
,
Y
=
| | |
y
1
y
2
···
y
m
| | |
.
(4)
In many applications the state dimension is much larger than the number of snapshots, so
m

n
.
For example, the state may correspond to a fluid velocity field sampled on a discretized grid.
Our machine learning problem consists of finding a function
f
that suitably maps the training
data given certain generalisability, interpretability, and regularity qualifications. In our notation,
F
is the true function that generated the data whereas
f
is our model for
F
; it is hoped that
F
and
f
share some meaningful properties. The function
f
is typically restricted to a given class of
models (e.g., linear, polynomial, etc.), so that it may be written as the expansion
f
(
x
)
N
j
=1
ξ
j
φ
j
(
x
)
=
f
(
x
)
Ξ
φ
(
x
)
.
(5)
Here,
φ
describes the feature library of
N
candidate terms that may describe the dynamics, and
Ξ
contains the coefficients that determine which model terms are active and in what proportions.
5
explicit model
implicit kernel model
dictionary-based
kernel model
Figure 3: Schematic relationships between different models for
N

n,m

̃
m
. An explicit model
(e.g. SINDy) produces explicit weights that connect
N
features to
n
outputs. A kernel model uses
fewer weights but the relationships between variables are stored implicitly. The dictionary-based
kernel model selects the most active samples and therefore uses fewer weights still.
Mathematically, the optimisation problem to be solved is
argmin
Ξ
Y
Ξ
φ
(
X
)
F
+
λR
(
Ξ
)
(6)
where
‖ ·‖
F
is the Frobenius (Hilbert–Schmidt) norm. The first term in (6) corresponds to the
error between training samples and our model prediction, whereas the second term
λR
(
Ξ
)
is a
regularizer. For example, in SINDy, the feature library
φ
will typically include linear and non-
linear terms, and the regularizer will involve the number of nonzero elements
Ξ
0
, which may
be relaxed to the
1
-norm
Ξ
1
. In DMD, the features
φ
will simply contain the state
x
,
Ξ
will be
the DMD matrix
A
, and instead of a regularizer
R
(
Ξ
)
, the minimization is constrained so that the
rank of
A
=
Ξ
is equal to
r
. Similarly, in extended DMD, the feature
φ
will include nonlinear
functions of the state, and the minimization is modified to
argmin
Ξ
φ
(
Y
)
Ξ
φ
(
X
)
F
+
λR
(
Ξ
)
,
resulting in a
Ξ
that is a large square matrix evolving the nonlinear feature space forward in time.
For even moderate state dimensions
n
and feature complexity, such as the order of monomials
d
, the feature library of
φ
becomes prohibitively large and the optimization in (6) is intractable.
This scaling issue is the primary challenge in applying SINDy to high-dimensional systems. In-
stead, it is possible to rewrite the expansion (5) in terms of an appropriate kernel function
k
as
f
i
(
x
)
m
j
=1
w
ij
k
(
x
j
,
x
)
=
f
(
x
)
W
k
(
X
,
x
)
.
(7)
In this case, the sum is over the number of snapshots
m
instead of the number of library elements
p
, dramatically improving the scaling. The optimization in (6) now becomes
argmin
W
Y
W
k
(
X
,
X
)
F
+
λR
(
W
)
.
(8)
We will show that it is possible to improve the scaling further by using a kernel defined on a sparse
dictionary
̃
X
. Figure 3 shows our dictionary-based kernel modeling procedure, where the explicit
model on the left is a SINDy model, and the compact model on the right is our kernel model. Thus,
our kernel learning approach may be viewed as a kernelized SINDy without sparsity promotion.
Based on the implicit LANDO model, it is possible to efficiently extract the linear component
L
of the dynamics, along with a matrix for the nonlinear forcing:
Y
=
LX
+
N
(9)
6