Neural Operator: Learning Maps Between Function Spaces
Nikola Kovachki
∗
NKOVACHKI
@
CALTECH
.
EDU
Caltech
Zongyi Li
∗
ZONGYILI
@
CALTECH
.
EDU
Caltech
Burigede Liu
BGL
@
CALTECH
.
EDU
Caltech
Kamyar Azizzadenesheli
KAMYAR
@
PURDUE
.
EDU
Purdue University
Kaushik Bhattacharya
BHATTA
@
CALTECH
.
EDU
Caltech
Andrew Stuart
ASTUART
@
CALTECH
.
EDU
Caltech
Anima Anandkumar
ANIMA
@
CALTECH
.
EDU
Caltech
Editor:
Abstract
The classical development of neural networks has primarily focused on learning mappings be-
tween finite dimensional Euclidean spaces or finite sets. We propose a generalization of neural
networks tailored to learn operators mapping between infinite dimensional function spaces. We
formulate the approximation of operators by composition of a class of linear integral operators and
nonlinear activation functions, so that the composed operator can approximate complex nonlinear
operators.
Furthermore, we introduce four classes of operator parameterizations: graph-based
operators, low-rank operators, multipole graph-based operators, and Fourier operators and describe
efficient algorithms for computing with each one. The proposed neural operators are resolution-
invariant: they share the same network parameters between different discretizations of the under-
lying function spaces and can be used for zero-shot super-resolutions. Numerically, the proposed
models show superior performance compared to existing machine learning based methodologies
on Burgers’ equation, Darcy flow, and the Navier-Stokes equation, while being several order of
magnitude faster compared to conventional PDE solvers.
Keywords:
Deep Learning, Operator Inference, Partial Differential Equations, Navier-Stokes
Equation.
1. Introduction
Learning mappings between infinite-dimensional function spaces is a challenging problem with
numerous applications across various disciplines. Examples arise in numerous differential equation
models in science and engineering, in robotics and in computer vision. In particular, any map where
either the input or the output space, or both, can be infinite-dimensional; and in particular where the
inputs and/or outputs are themselves functions. The possibility of learning such mappings opens
up a new class of problems in the design of neural networks with widespread applicability. New
ideas are required to build upon traditional neural networks which are mappings between finite-
dimensional Euclidean spaces and/or sets of finite cardinality.
A naive approach to this problem is simply to discretize the (input or output) function spaces and
apply standard ideas from neural networks. Instead we formulate a new class of deep neural network
1
arXiv:2108.08481v1 [cs.LG] 19 Aug 2021
based models, called neural operators, which map between spaces of functions on bounded domains
D
⊂
R
d
,
D
′
⊂
R
d
′
. Such models, once trained, have the property of being discretization invariant:
sharing the same network parameters between different discretizations of the underlying functional
data. In contrast, the naive approach leads to neural network architectures which depend heavily on
this discretization: new architectures with new parameters are needed to achieve the same error for
differently discretized data. We demonstrate, numerically, that the same neural operator can achieve
a constant error for any discretization of the data while standard feed-forward and convolutional
neural networks cannot.
In this paper we experiment with the proposed model within the context of partial differential
equations (PDEs). We study various solution operators or flow maps arising from the PDE model;
in particular, we investigate mappings between functions spaces where the input data can be, for
example, the initial condition, boundary condition, or coefficient function, and the output data is
the respective solution. We perform numerical experiments with operators arising from the one-
dimensional Burgers’ Equation (Evans, 2010), the two-dimensional steady sate of Darcy Flow (Bear
and Corapcioglu, 2012) and the two-dimensional Navier-Stokes Equation (Lemari
́
e-Rieusset, 2018).
1.1 Background and Context
PDEs.
“Differential equations [...] represent the most powerful tool humanity has ever created
for making sense of the material world.” Strogatz (2009). Over the past few decades, significant
progress has been made on formulating (Gurtin, 1982) and solving (Johnson, 2012) the governing
PDEs in many scientific fields from micro-scale problems (e.g., quantum and molecular dynam-
ics) to macro-scale applications (e.g., civil and marine engineering). Despite the success in the
application of PDEs to solve real-world problems, two significant challenges remain:
• identifying the governing model for complex systems;
• efficiently solving large-scale non-linear systems of equations.
Identifying/formulating the underlying PDEs appropriate for modeling a specific problem usually
requires extensive prior knowledge in the corresponding field which is then combined with universal
conservation laws to design a predictive model. For example, modeling the deformation and fracture
of solid structures requires detailed knowledge of the relationship between stress and strain in the
constituent material. For complicated systems such as living cells, acquiring such knowledge is
often elusive and formulating the governing PDE for these systems remains prohibitive, or the
models proposed are too simplistic to be informative. The possibility of acquiring such knowledge
from data can revolutionize these fields. Second, solving complicated non-linear PDE systems
(such as those arising in turbulence and plasticity) is computationally demanding and can often
make realistic simulations intractable. Again the possibility of using instances of data from such
computations to design fast approximate solvers holds great potential for accelerating scientific and
discovery and engineering practice.
Learning PDE solution operators.
Neural networks have the potential to address these chal-
lenges when designed in a way that allows for the emulation of mappings between function spaces
(Lu et al., 2019; Bhattacharya et al., 2020; Nelsen and Stuart, 2020; Li et al., 2020a,b,c; Patel et al.,
2021; Opschoor et al., 2020; Schwab and Zech, 2019; O’Leary-Roseberry et al., 2020). In PDE
applications, the governing equations are by definition local, whilst the solution operator exhibits
2
Figure 1: Zero-shot super-resolution: Vorticity field of the solution to the two-dimensional Navier-
Stokes equation with viscosity
10
4
(Re=
O
(200)
); Ground truth on top and prediction on bottom.
The model is trained on data that is discretized on a uniform
64
×
64
spatial grid and on a
20
-point
uniform temporal grid. The model is evaluated with a different initial condition that is discretized
on a uniform
256
×
256
spatial grid and a
80
-point uniform temporal grid (see Section 6.3.1).
non-local properties. Such non-local effects can be described by integral operators explicitly in the
spatial domain, or by means of spectral domain multiplication; convolution is an archetypal ex-
ample. For integral equations, the graph approximations of Nystr
̈
om type (Belongie et al., 2002)
provide a consistent way of connecting different grid or data structures arising in computational
methods and understanding their continuum limits (Von Luxburg et al., 2008; Trillos and Slep
ˇ
cev,
2018; Trillos et al., 2020). For spectral domain calculations, there are well-developed tools that
exist for approximating the continuum (Boyd, 2001; Trefethen, 2000). For these reasons, neural
networks that build in non-locality via integral operators or spectral domain calculations are natu-
ral. This is governing framework for our work aimed at designing mesh invariant neural network
approximations for the solution operators of PDEs.
1.2 Our Contribution
Neural Operators.
We introduce the concept of neural operators by generalizing standard feed-
forward neural networks to learn mappings between infinite-dimensional spaces of functions defined
on bounded domains of
R
d
. The non-local component of the architecture is instantiated through ei-
ther a parameterized integral operator or through multiplication in the spectral domain. The method-
ology leads to the following contributions.
leftmirgin=* We propose neural operators a concept which generalizes neural networks that
map between finite-dimensional Euclidean spaces to neural networks that map
between infinite-dimensional function spaces.
leftmiirgiin=* By construction, our architectures share the same parameters irrespective of the
discretization used on the input and output spaces done for the purposes of com-
putation. Consequently, neural operators are capable of zero-shot super-resolution
as demonstrated in Figure 1.
3
leftmiiirgiiin=* We propose four methods for practically implementing the neural operator frame-
work: graph-based operators, low-rank operators, multipole graph-based opera-
tors, and Fourier operators. Specifically, we develop a Nystr
̈
om extension to con-
nect the integral operator formulation of the neural operator to families of graph
neural networks (GNNs) on arbitrary grids. Furthermore, we study the spectral
domain formulation of the neural operator which leads to efficient algorithms in
settings where fast transform methods are applicable. We include an exhaustive
numerical study of the four formulations.
leftmivrgivn=* Numerically, we show that the proposed methodology consistently outperforms
all existing deep learning methods even on the resolutions for which the standard
neural networks were designed. For the two-dimensional Navier-Stokes equation,
when learning the entire flow map, the method achieves
<
1%
error with viscosity
1e
−
3
and
8%
error with Reynolds number
200
.
leftmvrgvn=* The Fourier neural operator has an inference time that is three orders of magnitude
faster than the pseudo-spectral method used to generate the data for the Navier-
Stokes problem (Chandler and Kerswell, 2013) –
0
.
005
s compared to the
2
.
2
s
on a
256
×
256
uniform spatial grid. Despite its tremendous speed advantage,
the method does not suffer from accuracy degradation when used in downstream
applications such as solving Bayesian inverse problems.
In this work, we propose the neural operator models to learn mesh-free, infinite-dimensional op-
erators with neural networks. Compared to previous methods that we will discuss in the related work
section 1.3, the neural operator remedies the mesh-dependent nature of standard finite-dimensional
approximation methods such as convolutional neural networks by producing a single set of network
parameters that may be used with different discretizations. It also has the ability to transfer solutions
between meshes. Furthermore, the neural operator needs to be trained only once, and obtaining a
solution for a new instance of the parameter requires only a forward pass of the network, alleviating
the major computational issues incurred in physics-informed neural network methods Raissi et al.
(2019). Lastly, the neural operator requires no knowledge of the underlying PDE, only data.
1.3 Related Works
We outline the major neural network-based approaches for the solution of PDEs. To make the
discussion concrete, we will consider the family of PDEs in the form
(
L
a
u
)(
x
) =
f
(
x
)
, x
∈
D,
u
(
x
) = 0
,
x
∈
∂D,
(1)
for some
a
∈A
,
f
∈U
∗
and
D
⊂
R
d
a bounded domain. We assume that the solution
u
:
D
→
R
lives in the Banach space
U
and
L
:
A → L
(
U
;
U
∗
)
is a mapping from the parameter Banach
space
A
to the space of (possibly unbounded) linear operators mapping
U
to its dual
U
∗
. A natural
operator which arises from this PDE is
G
†
:
A → U
defined to map the parameter to the solution
a
7→
u
. A simple example that we study further in Section 5.2 is when
L
a
is the weak form of the
second-order elliptic operator
−∇·
(
a
∇
)
subject to homogeneous Dirichlet boundary conditions.
In this setting,
A
=
L
∞
(
D
;
R
+
)
,
U
=
H
1
0
(
D
;
R
)
, and
U
∗
=
H
−
1
(
D
;
R
)
. When needed, we will
4
assume that the domain
D
is discretized into
K
∈
N
points and that we observe
N
∈
N
pairs
of coefficient functions and (approximate) solution functions
{
a
j
,u
j
}
N
j
=1
that are used to train the
model (see Section 2.1).
Finite-dimensional operators.
An immediate approach to approximate
G
†
is to parameterize it
as a deep convolutional neural network (CNN) between the finite-dimensional Euclidean spaces on
which the data is discretized i.e.
G
:
R
K
×
Θ
→
R
K
(Guo et al., 2016; Zhu and Zabaras, 2018; Adler
and Oktem, 2017; Bhatnagar et al., 2019). Khoo et al. (2017) concerns a similar setting, but with
output space
R
. Such approaches are, by definition, not mesh independent and need modifications to
the architecture for different resolution and discretization of
D
in order to achieve consistent error (if
at all possible). We demonstrate this issue numerically in Section 6. Furthermore, these approaches
are limited to the discretization size and geometry of the training data and hence it is not possible
to query solutions at new points in the domain. In contrast for our method, we show in Section
6, both invariance of the error to grid resolution, and the ability to transfer the solution between
meshes. The work Ummenhofer et al. (2020) proposed a continuous convolution network for fluid
problems, where off-grid points are sampled and linearly interpolated. However the continuous
convolution method is still constrained by the underlying grid which prevents generalization to
higher resolutions. Similarly, to get finer resolution solution, Jiang et al. (2020) proposed learning
super-resolution with a U-Net structure for fluid mechanics problems. However fine-resolution data
is needed for training, while neural operators are capable of zero-shot super-resolution with no new
data.
DeepONet
Recently, a novel operator regression architecture, named DeepONet, was proposed
by Lu et al. (2019, 2021) that designs a generic neural network based on the approximation theorem
from Chen and Chen (1995). The architecture consists of two neural networks: a branch net applied
on the input functions and a trunk net applied on the querying locations. Lanthaler et al. (2021)
developed an error estimate on the DeepONet. The standard DeepONet structure is a linear approx-
imation of the target operator, where the trunk net and branch net learn the coefficients and basis.
On the other hand, the neural operator is a non-linear approximation, which makes it constructively
more expressive. We include an detailed discussion of DeepONet in Section 3.2 and as well as a
numerical comparison to DeepONet in Section 6.2.
Physics Informed Neural Networks (PINNs).
A different approach is to directly parameterize
the solution
u
as a neural network
u
:
̄
D
×
Θ
→
R
(E and Yu, 2018; Raissi et al., 2019; Bar and
Sochen, 2019; Smith et al., 2020; Pan and Duraisamy, 2020). This approach is designed to model
one specific instance of the PDE, not the solution operator. It is mesh-independent, but for any
given new parameter coefficient function
a
∈ A
, one would need to train a new neural network
u
a
which is computationally costly and time consuming. Such an approach closely resembles classical
methods such as finite elements, replacing the linear span of a finite set of local basis functions with
the space of neural networks.
ML-based Hybrid Solvers
Similarly, another line of work proposes to enhance existing numeri-
cal solvers with neural networks by building hybrid models (Pathak et al., 2020; Um et al., 2020a;
Greenfeld et al., 2019) These approaches suffer from the same computational issue as classical
methods: one needs to solve an optimization problem for every new parameter similarly to the
PINNs setting. Furthermore, the approaches are limited to a setting in which the underlying PDE is
known. Purely data-driven learning of a map between spaces of functions is not possible.
5
Reduced basis methods.
Our methodology most closely resembles the classical reduced basis
method (RBM) (DeVore, 2014) or the method of Cohen and DeVore (2015). The method intro-
duced here, along with the contemporaneous work introduced in the papers (Bhattacharya et al.,
2020; Nelsen and Stuart, 2020; Opschoor et al., 2020; Schwab and Zech, 2019; O’Leary-Roseberry
et al., 2020; Lu et al., 2019), is, to the best of our knowledge, amongst the first practical super-
vised learning methods designed to learn maps between infinite-dimensional spaces. It remedies
the mesh-dependent nature of the approach in the papers (Guo et al., 2016; Zhu and Zabaras, 2018;
Adler and Oktem, 2017; Bhatnagar et al., 2019) by producing a single set of network parameters
that can be used with different discretizations. Furthermore, it has the ability to transfer solutions
between meshes. Moreover, it need only be trained once on the equation set
{
a
j
,u
j
}
N
j
=1
. Then,
obtaining a solution for a new
a
∼
μ
only requires a forward pass of the network, alleviating the
major computational issues incurred in (E and Yu, 2018; Raissi et al., 2019; Herrmann et al., 2020;
Bar and Sochen, 2019) where a different network would need to be trained for each input parame-
ter. Lastly, our method requires no knowledge of the underlying PDE: it is purely data-driven and
therefore non-intrusive. Indeed the true map can be treated as a black-box, perhaps to be learned
from experimental data or from the output of a costly computer simulation, not necessarily from a
PDE.
Continuous neural networks.
Using continuity as a tool to design and interpret neural networks
is gaining currency in the machine learning community, and the formulation of ResNet as a continu-
ous time process over the depth parameter is a powerful example of this (Haber and Ruthotto, 2017;
E, 2017). The concept of defining neural networks in infinite-dimensional spaces is a central prob-
lem that long been studied (Williams, 1996; Neal, 1996; Roux and Bengio, 2007; Globerson and
Livni, 2016; Guss, 2016). The general idea is to take the infinite-width limit which yields a non-
parametric method and has connections to Gaussian Process Regression (Neal, 1996; Matthews
et al., 2018; Garriga-Alonso et al., 2018), leading to the introduction of deep Gaussian processes
(Damianou and Lawrence, 2013; Dunlop et al., 2018). Thus far, such methods have not yielded
efficient numerical algorithms that can parallel the success of convolutional or recurrent neural net-
works for the problem of approximating mappings between finite dimensional spaces. Despite the
superficial similarity with our proposed work, this body of work differs substantially from what
we are proposing: in our work we are motivated by the continuous dependence of the data, or the
functions it samples, in a spatial variable; in the work outlined in this paragraph continuity is used
to approximate the network architecture when the depth or width approaches infinity.
Nystr
̈
om approximation, GNNs, and graph neural operators (GNOs).
The graph neural oper-
ators (Section 4.1) has an underlying Nystr
̈
om approximation formulation (Nystr
̈
om, 1930) which
links different grids to a single set of network parameters. This perspective relates our continuum
approach to Graph Neural Networks (GNNs). GNNs are a recently developed class of neural net-
works that apply to graph-structured data; they have been used in a variety of applications. Graph
networks incorporate an array of techniques from neural network design such as graph convolu-
tion, edge convolution, attention, and graph pooling (Kipf and Welling, 2016; Hamilton et al., 2017;
Gilmer et al., 2017; Veli
ˇ
ckovi
́
c et al., 2017; Murphy et al., 2018). GNNs have also been applied to
the modeling of physical phenomena such as molecules (Chen et al., 2019) and rigid body systems
(Battaglia et al., 2018) since these problems exhibit a natural graph interpretation: the particles are
the nodes and the interactions are the edges. The work (Alet et al., 2019) performs an initial study
that employs graph networks on the problem of learning solutions to Poisson’s equation, among
6
other physical applications. They propose an encoder-decoder setting, constructing graphs in the
latent space, and utilizing message passing between the encoder and decoder. However, their model
uses a nearest neighbor structure that is unable to capture non-local dependencies as the mesh size
is increased. In contrast, we directly construct a graph in which the nodes are located on the spatial
domain of the output function. Through message passing, we are then able to directly learn the ker-
nel of the network which approximates the PDE solution. When querying a new location, we simply
add a new node to our spatial graph and connect it to the existing nodes, avoiding interpolation error
by leveraging the power of the Nystr
̈
om extension for integral operators.
Low-rank kernel decomposition and low-rank neural operators (LNOs).
Low-rank decom-
position is a popular method used in kernel methods and Gaussian process (Kulis et al., 2006; Bach,
2013; Lan et al., 2017; Gardner et al., 2018). We present the low-rank neural operator in Section 4.2
where we structure the kernel network as a product of two factor networks inspired by Fredholm
theory. The low-rank method, while simple, is very efficient and easy to train especially when the
target operator is close to linear. Khoo and Ying (2019) similarly propose to use neural networks
with low-rank structure to approximate the inverse of differential operators. The framework of two
factor networks is also similar to the trunk and branch network used in DeepONet (Lu et al., 2019).
But in our work, the factor networks are defined on the physical domain and non-local informa-
tion is accumulated through integration, making our low-rank operator resolution invariant. On the
other hand, the number of parameters of the networks in DeepONet(s) depend on the resolution of
the data; see Section 3.2 for further discussion.
Multipole, multi-resolution methods, and multipole graph neural operators (MGNOs).
To
efficiently capture long-range interaction, multi-scale methods such as the classical fast multipole
methods (FMM) have been developed (Greengard and Rokhlin, 1997). Based on the assumption
that long-range interactions decay quickly, FMM decomposes the kernel matrix into different ranges
and hierarchically imposes low-rank structures to the long-range components (hierarchical matrices)
(B
̈
orm et al., 2003). This decomposition can be viewed as a specific form of the multi-resolution
matrix factorization of the kernel (Kondor et al., 2014; B
̈
orm et al., 2003). For example, the works of
Fan et al. (2019c,b); He and Xu (2019) propose a similar multipole expansion for solving parametric
PDEs on structured grids. However, the classical FMM requires nested grids as well as the explicit
form of the PDEs. In Section 4.3, we propose the multipole graph neural operator (MGNO) by
generalizing this idea to arbitrary graphs in the data-driven setting, so that the corresponding graph
neural networks can learn discretization-invariant solution operators which are fast and can work on
complex geometries.
Fourier transform, spectral methods, and Fourier neural operators (FNOs).
The Fourier
transform is frequently used in spectral methods for solving differential equations since differen-
tiation is equivalent to multiplication in the Fourier domain. Fourier transforms have also played an
important role in the development of deep learning. In theory, they appear in the proof of the uni-
versal approximation theorem (Hornik et al., 1989) and, empirically, they have been used to speed
up convolutional neural networks (Mathieu et al., 2013). Neural network architectures involving the
Fourier transform or the use of sinusoidal activation functions have also been proposed and studied
(Bengio et al., 2007; Mingo et al., 2004; Sitzmann et al., 2020). Recently, some spectral methods for
PDEs have been extended to neural networks (Fan et al., 2019a,c; Kashinath et al., 2020). In Section
7
4.4, we build on these works by proposing the Fourier neural operator architecture defined directly
in Fourier space with quasi-linear time complexity and state-of-the-art approximation capabilities.
1.4 Paper Outline
In Section 2, we define the general operator learning problem, which is not limited to PDEs. In
Section 3, we define the general framework in term of kernel integral operators and relate our
proposed approach to existing methods in the literature. In Section 4, we define four different ways
of efficiently computing with neural operators: graph-based operators (GNO), low-rank operators
(LNO), multipole graph-based operators (MGNO), and Fourier operators (FNO). In Section 5 we
define four partial differential equations which serve as a testbed of various problems which we
study numerically. In Section 6, we show the numerical results for our four approximation methods
on the four test problems, and we discuss and compare the properties of each methods. In Section 7
we conclude the work, discuss potential limitations and outline directions for future work.
2. Learning Operators
2.1 Problem Setting
Our goal is to learn a mapping between two infinite dimensional spaces by using a finite collection
of observations of input-output pairs from this mapping. We make this problem concrete in the
following setting. Let
A
and
U
be Banach spaces of functions defined on bounded domains
D
⊂
R
d
,
D
′
⊂
R
d
′
respectively and
G
†
:
A → U
be a (typically) non-linear map. Suppose we have
observations
{
a
j
,u
j
}
N
j
=1
where
a
j
∼
μ
are i.i.d. samples drawn from some probability measure
μ
supported on
A
and
u
j
=
G
†
(
a
j
)
is possibly corrupted with noise. We aim to build an approximation
of
G
†
by constructing a parametric map
G
θ
:
A→U
, θ
∈
R
p
(2)
with parameters from the finite-dimensional space
R
p
and then choosing
θ
†
∈
R
p
so that
G
θ
†
≈G
†
.
We will be interested in controlling the error of the approximation on average with respect to
μ
.
In particular, assuming
G
†
is
μ
-measurable, we will aim to control the
L
2
μ
(
A
;
U
)
Bochner norm of
the approximation
‖G
†
−G
θ
‖
2
L
2
μ
(
A
;
U
)
=
E
a
∼
μ
‖G
†
(
a
)
−G
θ
(
a
)
‖
2
U
=
∫
A
‖G
†
(
a
)
−G
θ
(
a
)
‖
2
U
dμ
(
a
)
.
(3)
This is a natural framework for learning in infinite-dimensions as one could seek to solve the asso-
ciated empirical-risk minimization problem
min
θ
∈
R
p
E
a
∼
μ
‖G
†
(
a
)
−G
θ
(
a
)
‖
2
U
≈
min
θ
∈
R
p
1
N
N
∑
j
=1
‖
u
j
−G
θ
(
a
j
)
‖
2
U
(4)
which directly parallels the classical finite-dimensional setting (Vapnik, 1998).
2.2 Discretization
Since our data
a
j
and
u
j
are , in general, functions, to work with them numerically, we assume
access only to their point-wise evaluations. To illustrate this, we will continue with the example
8
of the preceding paragraph. For simplicity, assume
D
=
D
′
and suppose that the input and output
function are real-valued. Let
D
j
=
{
x
(1)
j
,...,x
(
n
j
)
j
} ⊂
D
be a
n
j
-point discretization of the
domain
D
and assume we have observations
a
j
|
D
j
,u
j
|
D
j
∈
R
n
j
, for a finite collection of input-
output pairs indexed by
j
. In the next section, we propose a kernel inspired graph neural network
architecture which, while trained on the discretized data, can produce the solution
u
(
x
)
for any
x
∈
D
given an input
a
∼
μ
. That is to say that our approach is independent of the discretization
D
j
. We refer to this as being a function space architecture, a mesh-invariant architecture or a
discretization-invariant architecture; this claim is verified numerically by showing invariance of the
error as
n
j
→ ∞
. Such a property is highly desirable as it allows a transfer of solutions between
different grid geometries and discretization sizes with a single architecture which has a fixed number
of parameters.
We note that, while the application of our methodology is based on having point-wise evalua-
tions of the function, it is not limited by it. One may, for example, represent a function numerically
as a finite set of truncated basis coefficients. Invariance of the representation would then be with
respect to the size of this set. Our methodology can, in principle, be modified to accommodate
this scenario through a suitably chosen architecture. We do not pursue this direction in the current
work.
3. Proposed Architecture
3.1 Neural Operators
In this section, we outline the neural operator framework. We assume that the input functions
a
∈A
are
R
d
a
-valued and defined on the bounded domain
D
⊂
R
d
while the output functions
u
∈ U
are
R
d
u
-valued and defined on the bounded domain
D
′
⊂
R
d
′
. The proposed architecture
G
θ
:
A→U
has the following overall structure:
1.
Lifting
: Using a pointwise function
R
d
a
→
R
d
v
0
, map the input
{
a
:
D
→
R
d
a
} 7→ {
v
0
:
D
→
R
d
v
0
}
to its first hidden representation. Usually, we choose
d
v
0
> d
a
and hence this is
a lifting operation performed by a fully local operator.
2.
Iterative kernel integration
: For
t
= 0
,...,T
−
1
, map each hidden representation to the
next
{
v
t
:
D
t
→
R
d
v
t
}7→{
v
t
+1
:
D
t
+1
→
R
d
v
t
+1
}
via the action of the sum of a local linear
operator, a non-local integral kernel operator, and a bias function, composing the sum with a
fixed, pointwise nonlinearity. Here we set
D
0
=
D
and
D
T
=
D
′
and impose that
D
t
⊂
R
d
t
is a bounded domain.
3.
Projection
: Using a pointwise function
R
d
v
T
→
R
d
u
, map the last hidden representation
{
v
T
:
D
′
→
R
d
v
T
} 7→ {
u
:
D
′
→
R
d
u
}
to the output function. Analogously to the first
step, we usually pick
d
v
T
> d
u
and hence this is a projection step performed by a fully local
operator.
The outlined structure mimics that of a finite dimensional neural network where hidden repre-
sentations are successively mapped to produce the final output. In particular, we have
G
θ
:
=
Q◦
σ
T
(
W
T
−
1
+
K
T
−
1
+
b
T
−
1
)
◦···◦
σ
1
(
W
0
+
K
0
+
b
0
)
◦P
(5)
where
P
:
R
d
a
→
R
d
v
0
,
Q
:
R
d
v
T
→
R
d
u
are the local lifting and projection mappings respectively,
W
t
∈
R
d
v
t
+1
×
d
v
t
are local linear operators (matrices),
K
t
:
{
v
t
:
D
t
→
R
d
v
t
} → {
v
t
+1
:
D
t
+1
→
9
R
d
v
t
+1
}
are integral kernel operators,
b
t
:
D
t
+1
→
R
d
v
t
+1
are bias functions, and
σ
t
are fixed
activation functions acting locally as maps
R
v
t
+1
→
R
v
t
+1
in each layer. The output dimensions
d
v
0
,...,d
v
T
as well as the input dimensions
d
1
,...,d
T
−
1
and domains of definition
D
1
,...,D
T
−
1
are hyperparameters of the architecture. By local maps, we mean that the action is pointwise, in
particular, for the lifting and projection maps, we have
(
P
(
a
))(
x
) =
P
(
a
(
x
))
for any
x
∈
D
and
(
Q
(
v
T
))(
x
) =
Q
(
v
T
(
x
))
for any
x
∈
D
′
and similarly, for the activation,
(
σ
(
v
t
+1
))(
x
) =
σ
(
v
t
+1
(
x
))
for any
x
∈
D
t
+1
. The maps,
P
,
Q
, and
σ
t
can thus be thought of as defining Nemitskiy
operators (Dudley and Norvaisa, 2011, Chapters 6,7) when each of their components are assumed to
be Borel measurable. This interpretation allows us to define the general neural operator architecture
when pointwise evaluation is not well-defined in the spaces
A
or
U
e.g. when they are Lebesgue,
Sobolev, or Besov spaces.
The crucial difference between the proposed architecture (5) and a standard feed-forward neural
network is that all operations are directly defined in function space and therefore do not depend on
any discretization of the data. Intuitively, the lifting step locally maps the data to a space where
the non-local part of
G
†
is easier to capture. This is then learned by successively approximating
using integral kernel operators composed with a local nonlinearity. Each integral kernel operator is
the function space analog of the weight matrix in a standard feed-forward network since they are
infinite-dimensional linear operators mapping one function space to another. We turn the biases,
which are normally vectors, to functions and, using intuition from the ResNet architecture [CITE],
we further add a local linear operator acting on the output of the previous layer before applying
the nonlinearity. The final projection step simply gets us back to the space of our output function.
We concatenate in
θ
∈
R
p
the parameters of
P
,
Q
,
{
b
t
}
which are usually themselves shallow
neural networks, the parameters of the kernels representing
{K
t
}
which are again usually shallow
neural networks, and the matrices
{
W
t
}
. We note, however, that our framework is general and other
parameterizations such as polynomials may also be employed.
Integral Kernel Operators
We define three version of the integral kernel operator
K
t
used in (5).
For the first, let
κ
(
t
)
∈
C
(
D
t
+1
×
D
t
;
R
d
v
t
+1
×
d
v
t
)
and let
ν
t
be a Borel measure on
D
t
. Then we
define
K
t
by
(
K
t
(
v
t
))(
x
) =
∫
D
t
κ
(
t
)
(
x,y
)
v
t
(
y
)
d
ν
t
(
y
)
∀
x
∈
D
t
+1
.
(6)
Normally, we take
ν
t
to simply be the Lebesgue measure on
R
d
t
but, as discussed in Section 4,
other choices can be used to speed up computation or aid the learning process by building in
a
priori
information.
For the second, let
κ
(
t
)
∈
C
(
D
t
+1
×
D
t
×
R
d
a
×
R
d
a
;
R
d
v
t
+1
×
d
v
t
)
. Then we define
K
t
by
(
K
t
(
v
t
))(
x
) =
∫
D
t
κ
(
t
)
(
x,y,a
(Π
D
t
+1
(
x
))
,a
(Π
D
t
(
y
)))
v
t
(
y
)
d
ν
t
(
y
)
∀
x
∈
D
t
+1
.
(7)
where
Π
D
t
:
D
t
→
D
are fixed mappings. We have found numerically that, for certain PDE prob-
lems, the form (7) outperforms (6) due to the strong dependence of the solution
u
on the parameters
a
. Indeed, if we think of (5) as a discrete time dynamical system, then the input
a
∈ A
only enters
through the initial condition hence its influence diminishes with more layers. By directly building
in
a
-dependence into the kernel, we ensure that it influences the entire architecture.
10
Lastly, let
κ
(
t
)
∈
C
(
D
t
+1
×
D
t
×
R
d
v
t
×
R
d
v
t
;
R
d
v
t
+1
×
d
v
t
)
. Then we define
K
t
by
(
K
t
(
v
t
))(
x
) =
∫
D
t
κ
(
t
)
(
x,y,v
t
(Π
t
(
x
))
,v
t
(
y
))
v
t
(
y
)
d
ν
t
(
y
)
∀
x
∈
D
t
+1
.
(8)
where
Π
t
:
D
t
+1
→
D
t
are fixed mappings. Note that, in contrast to (6) and (7), the integral
operator (8) is nonlinear since the kernel can depend on the input function
v
t
. With this definition
and a particular choice of kernel
κ
t
and measure
ν
t
, we show in Section 3.3 that neural operators
are a continuous input/output space generalization of the popular transformer architecture (Vaswani
et al., 2017).
Single Hidden Layer Construction
Having defined possible choices for the integral kernel oper-
ator, we are now in a position to explicitly write down a full layer of the architecture defined by (5).
For simplicity, we choose the integral kernel operator given by (6), but note that the other definitions
(7), (8) work analogously. We then have that a single hidden layer update is given by
v
t
+1
(
x
) =
σ
t
+1
(
W
t
v
t
(Π
t
(
x
)) +
∫
D
t
κ
(
t
)
(
x,y
)
v
t
(
y
)
d
ν
t
(
y
) +
b
t
(
x
)
)
∀
x
∈
D
t
+1
(9)
where
Π
t
:
D
t
+1
→
D
t
are fixed mappings. We remark that, since we often consider functions on
the same domain, we usually take
Π
t
to be the identity.
We will now give an example of a full single hidden layer architecture i.e. when
T
= 2
. We
choose
D
1
=
D
, take
σ
2
as the identity, and denote
σ
1
by
σ
, assuming it is any activation function.
Furthermore, for simplicity, we set
W
1
= 0
,
b
1
= 0
, and assume that
ν
0
=
ν
1
is the Lebesgue
measure on
R
d
. We then have that (5) becomes
(
G
θ
(
a
))(
x
) =
Q
(
∫
D
κ
(1)
(
x,y
)
σ
(
W
0
P
(
a
(
y
)) +
∫
D
κ
(0)
(
y,z
)
P
(
a
(
z
))
d
z
+
b
0
(
y
)
)
d
y
)
(10)
for any
x
∈
D
′
. In this example,
P ∈
C
(
R
d
a
;
R
d
v
0
)
,
κ
(0)
∈
C
(
D
×
D
;
R
d
v
1
×
d
v
0
)
,
b
0
∈
C
(
D
;
R
d
v
1
)
,
W
0
∈
R
d
v
1
×
d
v
0
,
κ
(0)
∈
C
(
D
′
×
D
;
R
d
v
2
×
d
v
1
)
, and
Q ∈
C
(
R
d
v
2
;
R
d
u
)
. One can then parametrize
the continuous functions
P
,
Q
,κ
(0)
,κ
(1)
,b
0
by standard feed-forward neural networks (or by any
other means) and the matrix
W
0
simply by its entries. The parameter vector
θ
∈
R
p
then becomes
the concatenation of the parameters of
P
,
Q
,κ
(0)
,κ
(1)
,b
0
along with the entries of
W
0
. One can
then optimize these parameters by minimizing with respect to
θ
using standard gradient based min-
imization techniques. To implement this minimization, the functions entering the loss need to be
discretized; but the learned parameters may then be used with other discretizations. In Section 4,
we discuss various choices for parametrizing the kernels and picking the integration measure and
how those choices affect the computational complexity of the architecture.
Preprocessing
It is often beneficial to manually include features into the input functions
a
to help
facilitate the learning process. For example, instead of considering the
R
d
a
-valued vector field
a
as input, we use the
R
d
+
d
a
-valued vector field
(
x,a
(
x
))
. By including the identity element, infor-
mation about the geometry of the spatial domain
D
is directly incorporated into the architecture.
This allows the neural networks direct access to information that is already known in the problem
and therefore eases learning. We use this idea in all of our numerical experiments in Section 6.
Similarly, when dealing with a smoothing operators, it may be beneficial to include a smoothed
version of the inputs
a
using, for example, Gaussian convolution. Derivative information may also
be of interest and thus, as input, one may consider, for example, the
R
d
+2
d
a
+
dd
a
-valued vector field
(
x,a
(
x
)
,a
(
x
)
,
∇
x
a
(
x
))
. Many other possibilities may be considered on a problem-specific basis.
11
3.2 DeepONets are Neural Operators
We will now draw a parallel between the recently proposed DeepONet architecture in Lu et al.
(2019) and our neural operator framework. In fact, we will show that a particular variant of functions
from the DeepONets class is a special case of a single hidden layer neural operator construction. To
that end, we work with (10) where we choose
W
0
= 0
and denote
b
0
by
b
. For simplicity, we will
consider only real-valued functions i.e.
d
a
=
d
u
= 1
and set
d
v
0
=
d
v
1
=
d
v
2
=
n
∈
N
. Define
P
:
R
→
R
n
by
P
(
x
) = (
x,...,x
)
and
Q
:
R
n
→
R
by
Q
(
x
) =
x
1
+
···
+
x
n
. Furthermore
let
κ
(1)
:
D
′
×
D
→
R
n
×
n
be given as
κ
(1)
(
x,y
) =
diag
(
κ
(1)
1
(
x,y
)
,...,κ
(1)
n
(
x,y
))
for some
κ
(1)
1
,...κ
(1)
n
:
D
′
×
D
→
R
. Similarly let
κ
(0)
:
D
×
D
→
R
n
×
n
be given as
κ
(0)
(
x,y
) =
diag
(
κ
(0)
1
(
x,y
)
,...,κ
(0)
n
(
x,y
))
for some
κ
(0)
1
,...κ
(0)
n
:
D
×
D
→
R
. Then (10) becomes
(
G
θ
(
v
))(
x
) =
n
∑
j
=1
∫
D
κ
(1)
j
(
x,y
)
σ
(
∫
D
κ
(0)
j
(
y,z
)
a
(
z
)
d
z
+
b
j
(
y
)
)
d
y
where
b
(
y
) = (
b
1
(
y
)
,...,b
n
(
y
))
for some
b
1
,...,b
n
:
D
→
R
. Choose each
κ
(1)
j
(
x,y
) =
w
j
(
y
)
φ
j
(
x
)
for some
w
1
,...,w
n
:
D
→
R
and
φ
1
,...,φ
n
:
D
′
→
R
then we obtain
(
G
θ
(
a
))(
x
) =
n
∑
j
=1
G
j
(
a
)
φ
j
(
x
)
(11)
where
G
1
,...,G
n
:
A→
R
are functionals defined as
G
j
(
a
) =
∫
D
w
j
(
y
)
σ
(
∫
D
κ
(0)
j
(
y,z
)
a
(
z
)
d
z
+
b
j
(
y
)
)
d
y.
(12)
The set of maps
φ
1
,...,φ
n
form the
trunk net
while the functionals
G
1
,...,G
n
form the
branch net
of a DeepONet. The only difference between DeepONet(s) and (11) is the parametriza-
tion used for the functionals
G
1
,...,G
n
. Following Chen and Chen (1995), DeepONet(s) de-
fine the functional
G
j
as maps between finite dimensional spaces. Indeed, they are viewed as
G
j
:
R
q
→
R
and defined to map pointwise evaluations
(
a
(
x
1
)
,...,a
(
x
q
))
of
a
for some set of
points
x
1
,...,x
q
∈
D
. We note that, in practice, this set of evaluation points is not known
a priori
and could potentially be very large.
Indeed we show that (12) can approximate the functionals
defined by DeepONet(s) arbitrary well therefore making DeepONet(s) a special case of neural op-
erators. Furthermore (12) is consistent in function space as the number of parameters used to define
each
G
j
is
independent
of any discretization that may be used for
a
, while this is not true in the
DeepONet case as the number of parameters grow as we refine the discretization of
a
. We demon-
strate numerically in Section 6 that the error incurred by DeepONet(s) grows with the discretization
of
a
while it is remains constant for neural operators.
We point out that parametrizations of the form (11) fall within the class of
linear
approximation
methods since the nonlinear space
G
†
(
A
)
is approximated by the linear space span
{
φ
1
,...,φ
n
}
DeVore (1998). The quality of the best possible linear approximation to a nonlinear space is given
by the Kolmogorov
n
-width where
n
is the dimension of the linear space used in the approximation
(Pinkus, 1985). The rate of decay of the
n
-width as a function of
n
quantifies how well the linear
space approximates the nonlinear one. It is well know that for some problems such as the flow maps
of advection-dominated PDEs, the
n
-widths decay very slowly; hence a very large
n
is needed for a
12
good approximation for such problems (Cohen and DeVore, 2015). This can be limiting in practice
as more parameters are needed in order to describe more basis functions
φ
j
and therefore more data
is needed to fit these parameters.
On the other hand, we point out that parametrizations of the form (5), and the particular case
(10), constitute (in general) a form of
nonlinear
approximation. The benefits of nonlinear approx-
imation are well understood in the setting of function approximation, see e.g. (DeVore, 1998),
however the theory for the operator setting is still in its infancy (Bonito et al., 2020; Cohen et al.,
2020). We observe numerically in Section 6 that nonlinear parametrizations such as (10) outperform
linear one such as DeepONets or the low-rank method introduced in Section 4.2.
3.3 Transformers are Neural Operators
We will now show that our neural operator framework can be viewed as a continuum generalization
to the popular transformer architecture (Vaswani et al., 2017) which has been extremely successful
in natural language processing tasks (Devlin et al., 2018; Brown et al., 2020) and, more recently,
is becoming a popular choice in computer vision tasks (Dosovitskiy et al., 2020). The parallel
stems from the fact that we can view sequences of arbitrary length as arbitrary discretizations of
functions. Indeed, in the context of natural language processing, we may think of a sentence as
a “word”-valued function on, for example, the domain
[0
,
1]
. Assuming our function is linked to
a sentence with a fixed semantic meaning, adding or removing words from the sentence simply
corresponds to refining or coarsening the discretization of
[0
,
1]
. We will now make this intuition
precise.
We will show that by making a particular choice of the nonlinear integral kernel operator (8)
and discretizing the integral by a Monte-Carlo approximation, a neural operator layer reduces to
a pre-normalized, single-headed attention, transformer block as originally proposed in (Vaswani
et al., 2017). For simplicity, we assume
d
v
t
=
n
∈
N
and that
D
t
=
D
for any
t
= 0
,...,T
, the
bias term is zero, and
W
=
I
is the identity. Further, to simplify notation, we will drop the layer
index
t
from (9) and, employing (8), obtain
u
(
x
) =
σ
(
v
(
x
) +
∫
D
κ
v
(
x,y,v
(
x
)
,v
(
y
))
v
(
y
)
d
y
)
∀
x
∈
D
(13)
a single layer of the neural operator where
v
:
D
→
R
n
is the input function to the layer and we
denote by
u
:
D
→
R
n
the output function. We use the notation
κ
v
to indicate that the kernel
depends on the entirety of the function
v
and not simply its pointwise values. While this is not
explicitly done in (8), it is a straightforward generalization. We now pick a specific form for kernel,
in particular, we assume
κ
v
:
R
n
×
R
n
→
R
n
×
n
does not explicitly depend on the spatial variables
(
x,y
)
but only on the input pair
(
v
(
x
)
,v
(
y
))
. Furthermore, we let
κ
v
(
v
(
x
)
,v
(
y
)) =
g
v
(
v
(
x
)
,v
(
y
))
R
where
R
∈
R
n
×
n
is matrix of free parameters i.e. its entries are concatenated in
θ
so they are
learned, and
g
v
:
R
n
×
R
n
→
R
is defined as
g
v
(
v
(
x
)
,v
(
y
)) =
(
∫
D
exp
(
〈
Av
(
s
)
,Bv
(
y
)
〉
√
m
)
d
s
)
−
1
exp
(
〈
Av
(
x
)
,Bv
(
y
)
〉
√
m
)
.
13
Here
A,B
∈
R
m
×
n
are again matrices of free parameters,
m
∈
N
is a hyperparameter, and
〈·
,
·〉
is
the Euclidean inner-product on
R
m
. Putting this together, we find that (13) becomes
u
(
x
) =
σ
v
(
x
) +
∫
D
exp
(
〈
Av
(
x
)
,Bv
(
y
)
〉
√
m
)
∫
D
exp
(
〈
Av
(
s
)
,Bv
(
y
)
〉
√
m
)
d
s
Rv
(
y
)
d
y
∀
x
∈
D.
(14)
Equation (14) can be thought of as the continuum limit of a transformer block. To see this, we will
discretize to obtain the usual transformer block.
To that end, let
{
x
1
,...,x
k
} ⊂
D
be a uniformly-sampled,
k
-point discretization of
D
and
denote
v
j
=
v
(
x
j
)
∈
R
n
and
u
j
=
u
(
x
j
)
∈
R
n
for
j
= 1
,...,k
. Approximating the inner-integral
in (14) by Monte-Carlo, we have
∫
D
exp
(
〈
Av
(
s
)
,Bv
(
y
)
〉
√
m
)
d
s
≈
|
D
|
k
k
∑
l
=1
exp
(
〈
Av
l
,Bv
(
y
)
〉
√
m
)
.
Plugging this into (14) and using the same approximation for the outer integral yields
u
j
=
σ
v
j
+
k
∑
q
=1
exp
(
〈
Av
j
,Bv
q
〉
√
m
)
∑
k
l
=1
exp
(
〈
Av
l
,Bv
q
〉
√
m
)
Rv
q
, j
= 1
,...,k.
(15)
Equation (15) can be viewed as a Nystr
̈
om approximation of (14). Define the vectors
z
q
∈
R
k
by
z
q
=
1
√
m
(
〈
Av
1
,Bv
q
〉
,...,
〈
Av
k
,Bv
q
〉
)
, q
= 1
,...,k.
Define
S
:
R
k
→
∆
k
, where
∆
k
denotes the
k
-dimensional probability simplex, as the softmax
function
S
(
w
) =
(
exp
(
w
1
)
∑
k
j
=1
exp
(
w
j
)
,...,
exp
(
w
k
)
∑
k
j
=1
exp
(
w
j
)
)
,
∀
w
∈
R
k
.
Then we may re-write (15) as
u
j
=
σ
v
j
+
k
∑
q
=1
S
j
(
z
q
)
Rv
q
, j
= 1
,...,k.
Furthermore, if we re-parametrize
R
=
R
out
R
val
where
R
out
∈
R
n
×
m
and
R
val
∈
R
m
×
n
are
matrices of free parameters, we obtain
u
j
=
σ
v
j
+
R
out
k
∑
q
=1
S
j
(
z
q
)
R
val
v
q
, j
= 1
,...,k
which is precisely the single-headed attention, transformer block with no layer normalization ap-
plied inside the activation function. In the language of transformers, the matrices
A
,
B
, and
R
val
correspond to the
queries
,
keys
, and
values
functions respectively. We note that tricks such as layer
14
normalization (Ba et al., 2016) can be adapted in a straightforward manner to the continuum setting
and incorporated into (14). Furthermore multi-headed self-attention can be realized by simply al-
lowing
κ
v
to be a sum of over multiple functions with form
g
v
R
all of which have separate trainable
parameters. Including such generalizations yields the continuum limit of the transformer as imple-
mented in practice. We do not pursue this here as our goal is simply to draw a parallel between the
two methods.
While we have not rigorously experimented with using transformer architectures for the prob-
lems outlined in Section 5, we have found, in initial tests, that they perform worse, are slower, and
are more memory expensive than neural operators using (6)-(8). Their high computational complex-
ity is evident from (14) as we must evaluate a
nested
integral of
v
for each
x
∈
D
. Recently more
efficient transformer architectures have been proposed e.g. (Choromanski et al., 2020) and some
have been applied to computer vision tasks. We leave as interesting future work experimenting and
comparing these architectures to the neural operator both on problems in scientific computing and
more traditional machine learning tasks.
4. Parameterization and Computation
In this section, we discuss various ways of parameterizing the infinite dimensional architecture
(5). The goal is to find an intrinsic infinite dimensional paramterization that achieves small error
(say
), and then rely on numerical approximation to ensure that this parameterization delivers an
error of the same magnitude (say
2
), for all data discretizations fine enough. In this way the
number of parameters used to achieve
O
(
)
error is independent of the data discretization. In many
applications we have in mind the data discretization is something we can control, for example
when generating input/output pairs from solution of partial differential equations via numerical
simulation. The proposed approach allows us to train a neural operator approximation using data
from different discretizations, and to predict with discretizations different from those used in the
data, all by relating everything to the underlying infinite dimensional problem.
We also discuss the computational complexity of the proposed parameterizations and suggest al-
gorithms which yield efficient numerical methods for approximation. Subsections 4.1-4.4 delineate
each of the proposed methods.
To simplify notation, we will only consider a single layer of (5) i.e. (9) and choose the input
and output domains to be the same. Furthermore, we will drop the layer index
t
and write the single
layer update as
u
(
x
) =
σ
(
Wv
(
x
) +
∫
D
κ
(
x,y
)
v
(
y
)
d
ν
(
y
) +
b
(
x
)
)
∀
x
∈
D
(16)
where
D
⊂
R
d
is a bounded domain,
v
:
D
→
R
n
is the input function and
u
:
D
→
R
m
is the output function.
When the domain domains
D
of
v
and
u
are different, we will usually
extend them to be on a larger domain. We will consider
σ
to be fixed, and, for the time being,
take d
ν
(
y
) =
d
y
to be the Lebesgue measure on
R
d
. Equation (16) then leaves three objects
which can be parameterized:
W
,
κ
, and
b
. Since
W
is linear and acts only locally on
v
, we will
always parametrize it by the values of its associated
m
×
n
matrix; hence
W
∈
R
m
×
n
yielding
mn
parameters.
The rest of this section will be dedicated to choosing the kernel function
κ
:
D
×
D
→
R
m
×
n
and
the computation of the associated integral kernel operator. For clarity of exposition, we consider
15
only the simplest proposed version of this operator (6) but note that similar ideas may also be
applied to (7) and (8). Furthermore, we will drop
σ
,
W
, and
b
from (16) and simply consider the
linear update
u
(
x
) =
∫
D
κ
(
x,y
)
v
(
y
)
d
ν
(
y
)
∀
x
∈
D.
(17)
To demonstrate the computational challenges associated with (17), let
{
x
1
,...,x
J
} ⊂
D
be a
uniformly-sampled
J
-point discretization of
D
. Recall that we assumed d
ν
(
y
) =
d
y
and, for
simplicity, suppose that
ν
(
D
) = 1
, then the Monte Carlo approximation of (17) is
u
(
x
j
) =
1
J
J
∑
l
=1
κ
(
x
j
,x
l
)
v
(
x
l
)
, j
= 1
,...,J.
Therefore to compute
u
on the entire grid requires
O
(
J
2
)
matrix-vector multiplications. Each of
these matrix-vector multiplications requires
O
(
mn
)
operations; for the rest of the discussion, we
treat
mn
=
O
(1)
as constant and consider only the cost with respect to
J
the discretization pa-
rameter since
m
and
n
are fixed by the architecture choice whereas
J
varies depending on required
discretization accuracy and hence may be arbitrarily large. This cost is not specific to the Monte
Carlo approximation but is generic for quadrature rules which use the entirety of the data. There-
fore, when
J
is large, computing (17) becomes intractable and new ideas are needed in order to
alleviate this. Subsections 4.1-4.4 propose different approaches to the solution to this problem, in-
spired by classical methods in numerical analysis. We finally remark that, in contrast, computations
with
W
,
b
, and
σ
only require
O
(
J
)
operations which justifies our focus on computation with the
kernel integral operator.
Kernel Matrix.
It will often times be useful to consider the kernel matrix associated to
κ
for the
discrete points
{
x
1
,...,x
J
}⊂
D
. We define the kernel matrix
K
∈
R
mJ
×
nJ
to be the
J
×
J
block
matrix with each block given by the value of the kernel i.e.
K
jl
=
κ
(
x
j
,x
l
)
∈
R
m
×
n
, j,l
= 1
,...,J
where we use
(
j,l
)
to index an individual block rather than a matrix element. Various numerical
algorithms for the efficient computation of (17) can be derived based on assumptions made about
the structure of this matrix, for example, bounds on its rank or sparsity.
4.1 Graph Neural Operator (GNO)
We first outline the Graph Neural Operator (GNO) which approximates (17) by combining a Nystr
̈
om
approximation with domain truncation and is implemented with the standard framework of graph
neural networks. This construction was originally proposed in Li et al. (2020c).
Nystr
̈
om approximation.
A simple yet effective method to alleviate the cost of computing (17)
is employing a Nystr
̈
om approximation. This amounts to sampling uniformly at random the points
over which we compute the output function
u
. In particular, let
x
k
1
,...,x
k
J
′
⊂ {
x
1
,...,x
J
}
be
J
′
J
randomly selected points and, assuming
ν
(
D
) = 1
, approximate (17) by
u
(
x
k
j
)
≈
1
J
′
J
′
∑
l
=1
κ
(
x
k
j
,x
k
l
)
v
(
x
k
l
)
, j
= 1
,...,J
′
.
16