of 21
Special Issue dedicated to Robert Schaback on the occasion of his 75th birthday, Volume 15
·
2022
·
Pages 125–145
Mean-field limits of trained weights in deep learning: A dynamical
systems perspective
Alexandre Smirnov
a
·
Boumediene Hamzi
b
·
Houman Owhadi
c
Communicated by Gabriele Santin
Abstract
Training a residual neural network with
L
2
regularization on weights and biases is equivalent to minim-
izing a discrete least action principle and to controlling a discrete Hamiltonian system representing the
propagation of input data across layers. The kernel
/
feature map analysis of this Hamiltonian system
suggests a mean-field limit for trained weights and biases as the number of data points goes to infinity.
The purpose of this paper is to investigate this mean-field limit and illustrate its existence through
numerical experiments and analysis (for simple kernels).
1 Introduction
Supervised learning is a class of learning problems which seek to find a relationship between a set of predictor variables (the
inputs) and a set of target variables (the outputs). The target, also called dependent variable, can correspond to a quantitative
measurement (e.g., height, stock prices, etc.) or to a qualitative measurement (e.g., sex, class, etc.). The problem of predicting
quantitative variables is called regression. On the other hand, classification problems aim at predicting qualitative variables. Over
many years of research, various models have been developed to solve both type of problems. In particular,
artificial neural network
(ANN) models have become very popular since their complex architecture allows the model to capture underlying patterns in the
data. This architecture has proved to be successful in different areas of artificial intelligence such as image processing, speech
recognition or text mining. ANNs transform the inputs using a composition of consecutive mappings, generally represented by a
directed acyclic graph. With this representation, the inputs are being propagated through multiple layers, defining a trajectory
that is optimized by training the parameters of the network.
In this paper we investigate the propagation of the inputs from a Dynamical Systems perspective, an approach recently
explored by Houman Owhadi in
[
10
]
. Owhadi shows that the minimizers of
L
2
regularized ResNets, a particular class of ANNs,
satisfy a discrete least action principle implying the near preservation of the norm of weights and biases across layers. The
parameters of trained ResNets can be identified as solutions of a Hamiltonian system defined by the activation function and the
architecture of the ANN. The form of the Hamiltonian suggests that it is amenable to mean-field limit analysis as the number of
data points increases. Furthermore, when the mean-field limit holds, the trajectory of inputs across layers is given by a mean-field
Hamiltonian system where position and momentum variables are nearly decoupled through mean-field averaging.
The purpose
of this paper is to analyse and numerically illustrate this mean-field limit by investigating the convergence of the trained
weights and biases of the model (appearing in the Hamiltonian in our kernel
/
feature map setting) as the number of data
points goes to infinity.
In Section A we define the mathematical setting of supervised learning and introduce kernel learning, a class of problems that
encompasses ANNs and which is used throughout this paper. In Section B we explain how to identify the optimization of neural
networks as a dynamical system problem and characterize the optimal trajectory of the inputs. In Section 2 we reformulate the
dynamics using the feature space representation of kernels. With this representation we show that the trajectory is determined by
a parameter amenable to mean-field limit analysis. In Section 3, we simulate this parameter on regression and classification
datasets.
All the results in the Appendices are a review of some results in Owhadi’s paper
[
10
]
.
2 Mean-field analysis of Hamiltonian dynamics
In Section B, we derived the Hamiltonian representation of minimizers of mechanical regression and idea registration. We
stated that the propagation of the inputs is completely determined by the initial momentum
p
(
0
)
. In this section, we derive
a
Department of Mathematics, Imperial College London, United Kingdom, email: alexandre.smirnov20@imperial.ac.uk
b
Department of Computing and Mathematical Sciences, Caltech, CA, USA.email:boumediene.hamzi@gmail.com
c
Department of Computing and Mathematical Sciences, Caltech, CA, USA. email: owhadi@caltech.edu
Smirnov
·
Hamzi
·
Owhadi
126
another system which does not involve the momentum. The system is obtained for an operator-valued kernel
Γ
with feature map
ψ
:
X
−→
L
(
X
,
F
)
such that (42) holds,
Γ
(
x
,
x
)=
ψ
T
(
x
)
ψ
(
x
)
.
Then, the ODE for the trajectory is of the form

̇
q
i
=
ψ
T
(
q
i
)
α
(
t
)
q
i
(
0
)=
x
i
,
(1)
for some mapping
α
(
t
)
. We characterize
α
using the feature map identification
(44)
of the RKHS
V
and show that this mapping
is amenable to mean-field limit analysis. We also discuss a data-based method to simulate this limit.
2.1 Feature space representation of kernels
We reformulate mechanical regression and idea registration using the feature space representation of kernels (cf., Sec. A.4) as
presented in
[
10, Sec. 6
]
.
2.1.1 Mechanical regression
The following theorem characterizes the minimizers of mechanical regression in the feature space of
Γ
.
Theorem 2.1
(
Mechanical regression in feature space
)
.
Let
Γ
be a kernel of RKHS
V
with associated feature space and map
F
and
ψ
:
X
−→
L
(
X
,
F
)
. Then,
α
1
, . . . ,
α
L
satisfy

Minimize
ν
L
2
P
L
s
=
1
α
s
2
F
+
(
φ
L
(
X
)
,
Y
)
,
over
α
1
, . . . ,
α
L
F
,
(2)
if and only if the v
s
(
·
)=
ψ
T
(
·
)
α
s
V
minimize
(59)
and
φ
L
(
·
)=(
I
+
ψ
T
(
·
)
α
L
)
◦···◦
(
I
+
ψ
T
(
·
)
α
1
)
.
(3)
This theorem identifies the minimizers
v
s
with parameters
α
s
belonging to the feature space
F
(possibly infinite-dimensional).
In practice, we choose
Γ
to be a scalar operator-valued kernel (see Def. A.4) obtained with a finite dimensional feature space
F
and map
φ
,
Γ
(
x
,
x
)=
φ
T
(
x
)
φ
(
x
)
I
X
.
Using Lemma A.4,
{
α
s
}
L
s
=
1
are matrices in
R
dim
(
X
)
×
dim
(
F
)
and
α
s
2
F
corresponds to a matrix Frobenius norm. These parameters
define the optimal discrete trajectory
q
1
, . . . ,
q
L
+
1

q
s
+
1
=
q
s
+
α
s
φ
(
q
s
)
, for
s
=
1, . . . ,
L
q
1
=
X
,
(4)
where we used
ψ
T
(
·
)
α
s
=
α
s
φ
(
·
)
. By introducing
t
=
1
/
L
and
α
s
=
t
̃
α
s
, the system is equivalent to

q
s
+
1
=
q
s
+
t
̃
α
s
φ
(
q
s
)
, for
s
=
1, . . . ,
L
q
1
=
X
,
(5)
where
{
̃
α
s
}
L
s
=
1
minimize
ν
2
L
X
s
=
1
̃
α
s
2
F
t
+
(
q
L
+
1
(
X
)
,
Y
)
, where
q
L
+
1
=
φ
L
(
X
)
.
(6)
(5)
is a discrete dynamical system which approximates
(1)
(for each input
x
i
). The time-dependent parameter
α
(
t
)
is
approximated by
{
α
s
}
with
α
s
=
α
(
s
/
L
)
at time
t
s
=
s
/
L
.
2.1.2 Idea registration
The next theorem characterizes
α
in the feature map representation of idea registration.
Theorem 2.2
(
Idea registration in feature space
)
.
The
α
(
t
)
satisfy
̈
Minimize
ν
2
R
1
0
α
(
t
)
2
F
d t
+
(
φ
v
(
X
, 1
)
,
Y
)
over
α
C
([
0, 1
]
,
F
)
(7)
if and only if v
(
·
,
t
)=
ψ
T
(
·
)
α
(
t
)
and
φ
v
(
x
,
t
)
minimize
(69)
. Furthermore, at the minimum,
α
(
t
)
2
F
is constant over t
[
0, 1
]
.
Hence,
α
is identified as the minimizer of
(7)
and
(71)
implies
̇
q
i
=
ψ
T
(
q
i
)
α
(
t
)
. If in addition,
Γ
is scalar operator-valued,
the ODE for the trajectory becomes

̇
q
i
=
α
(
t
)
φ
(
q
i
)
q
i
(
0
)=
x
i
.
(8)
Dolomites Research Notes on Approximation
ISSN 2035-6803
Smirnov
·
Hamzi
·
Owhadi
127
2.1.3 Equivalence with ResNet block minimizers
In Sec. 2.1.1 we reformulated mechanical regression using the feature map representation of
Γ
. In addition, if
Γ
is a SOV kernel,
the minimizers are identified with matrices
{
α
s
}
L
s
=
1
. In the special case that
Γ
and
K
have the form in
(54)
, mechanical regression
is equivalent to minimizing one ResNet block (see Ex. A.3) with
L
2
regularization on weights and biases. This is summarized in
the following theorem.
Theorem 2.3.
If
Γ
(
x
,
x
)=
φ
(
x
)
T
φ
(
x
)
I
X
and
K
(
x
,
x
)=
φ
(
x
)
T
φ
(
x
)
I
Y
, where
φ
(
x
)
is given by
(53)
, then minimizers of
(59)
are of the form
f
(
x
)=
̃
αφ
(
x
)
and v
s
(
x
)=
α
s
φ
(
x
)
,
(9)
where
̃
α
=(
̃
W
,
̃
b
)
L
(
X
R
,
Y
)
,
α
s
=(
W
s
,
b
s
)
L
(
X
R
,
X
)
are minimizers of
min
̃
α
,
α
1
,...,
α
L
ν
L
2
L
X
s
=
1
α
s
2
L
(
X
R
,
X
)
+
λ
̃
α
2
L
(
X
R
,
Y
)
+
f
(
φ
L
(
X
))
Y
2
Y
N
,
(10)
and
α
L
(
F
,
Z
)
corresponds to the Frobenius norm of the linear map
α
:
F
−→
Z
.
The approximation f
=
f
φ
L
is of the form
f
φ
L
(
x
)=(
̃
αφ
)
(
I
+
α
L
φ
)
◦···◦
(
I
+
α
1
φ
)(
x
)
.
(11)
2.2 Mean-field analysis
2.2.1 The mean-field approximation
Mean-field theory aims at developing approximation strategies for systems composed of interacting particles. It was originally
developed in statistical physics to study systems such as the Ising model
[
2
]
. The central idea behind mean-field theory is to
approximate the interacting terms in the system by a simpler non-interacting function. Our aim is to apply this strategy to study
the Hamiltonian (79),
H
(
q
,
p
)=
1
2
N
X
i
,
j
=
1
p
T
i
Γ
(
q
i
,
q
j
)
p
j
.
(12)
It can be viewed as a system of interacting particles. The feature map representation of
Γ
allows us to split the interacting term
Γ
(
q
i
,
q
j
)
into two terms
ψ
T
(
q
i
)
and
ψ
(
q
j
)
. If we re-scale the momentum
̄
p
j
:
=
N p
j
and define
α
(
t
)
:
=
1
N
N
X
j
=
1
ψ
(
q
j
(
t
))
̄
p
j
(
t
)
,
(13)
we can remove the interaction in the Hamiltonian
H
(
q
,
p
)=
1
2
N
X
i
=
1
p
T
i
ψ
T
(
q
i
)
α
(
t
)
.
We can rewrite the dynamics of
(
q
,
p
)
as
(
̇
q
i
=
ψ
T
(
q
i
)
α
̇
̄
p
i
=
x
̄
p
T
i
ψ
T
(
x
)
α

x
=
q
i
.
(14)
Note that this coincides with the dynamics of
q
obtained in Theorem 2.2. Eq.
(13)
suggests that
α
(
t
)
is amenable to mean-field
limit analysis as
N
→∞
. If the limit exists, the system
(12)
behaves like a decoupled system of particles in the sense that the
trajectory of each data input is not influenced by other inputs. Secondly,
α
(
t
)
defines the ODE for the flow map
φ
v
(70),
̇
φ
v
(
x
,
t
)=
α
(
t
)
φ
v
(
x
,
t
)
which determines the solution of idea registration
f
(
·
)=
f
(
φ
v
(
·
, 1
))
.
2.2.2 Type of convergence
In theory,
α
(
t
)
is a possibly infinite-dimensional object belonging to the feature space
F
of the operator-valued kernel
Γ
. In
practice, we consider
Γ
to be scalar operator-valued defined with a finite-dimensional feature space
F
=
R
D
. Lemma A.4 states
that
α
(
t
)
R
p
×
D
for all
t
[
0
,
1
]
(taking
Y
=
X
in the lemma). To simulate the limit, we need to plot a functional of
{
α
(
t
)
}
0
<
t
<
1
against different values of
N
. A natural choice corresponds to the energy integral appearing in the loss (7),
E
(
α
)=
Z
1
0
α
(
t
)
2
F
d t
.
(15)
Dolomites Research Notes on Approximation
ISSN 2035-6803
Smirnov
·
Hamzi
·
Owhadi
128
Numerically, we approximate
α
(
t
)
with
{
α
s
}
L
s
=
1
using mechanical regression. The energy (15) is then approximated by
E
(
α
1
, . . . ,
α
L
)=
L
X
s
=
1
α
s
2
F
t
=
1
L
L
X
s
=
1
α
s
2
F
,
(16)
which corresponds to the average Frobenius norm of
α
s
across the layers of the network, and appears in the formulation
(6)
of
mechanical regression.
2.3 Numerical simulations
Provided that
Γ
is a Scalar Operator Valued (SOV) kernel, the feature space representation of mechanical regression given by
Theorem 2.1 reduces the search for the optimal trajectory to the search for matrices
{
α
s
}
L
s
=
1
that minimize
(2)
. These parameters
correspond to a discretization of the time-dependent parameter
α
(
t
)
in idea registration (Thm. 2.2). We present the optimization
algorithm with the feature space framework.
Let
X
=
R
p
and
Y
=
R
d
be the input and output space. Let
Γ
and
K
be scalar operator-valued kernels obtained using feature
maps
φ
,
φ
2
with corresponding feature spaces
F
and
F
2
,
Γ
(
x
,
x
)=
φ
T
(
x
)
φ
(
x
)
I
X
, where
φ
(
x
)
F
=
R
D
for some
D
1.
K
(
x
,
x
)=
φ
T
2
(
x
)
φ
2
(
x
)
I
Y
, where
φ
2
(
x
)
F
2
=
R
D
2
for some
D
2
1.
Obtain
φ
L
(
X
)=
q
L
+
1
from the discrete dynamics (4). The function
f
is of the form
f
(
·
)=
w
2
φ
2
(
·
)
, where
w
2
=

φ
T
2
(
φ
L
(
X
))
φ
2
(
φ
L
(
X
))+(
N
λ
)
I
D
2
‹
1
φ
T
2
(
φ
L
(
X
))
Y
R
d
×
D
2
(17)
minimizes the ridge loss
r id g e
(
φ
L
(
X
)
,
Y
)=
1
N
w
2
φ
2
(
φ
L
(
X
))
Y
2
+
λ
w
2
2
F
2
,
(18)
where the squared error term has been normalized. By renormalizing the squared error in the ridge loss term,
we aim to illustrate
the convergence of the trained model parameters
{
α
s
}
L
s
=
1
and
w
(obtained as minimizers of
(2)
) as
N
→∞
.
The normalization of the squared error will control the ridge loss as
N
increases. If the trained model parameters
{
α
s
}
L
s
=
1
and
w
(obtained as minimizers of (2)) converge as
N
→∞
, then the energy integral
E
(
α
)
must converge too.
For any deformed input
φ
L
(
X
)
, (17) uniquely defines
w
2
. Thus, the loss (2) is completely determined by
α
1
, . . . ,
α
L
,
L
(
α
1
, . . . ,
α
L
)
:
=
ν
L
2
L
X
s
=
1
α
s
2
F
+
r id g e
(
φ
L
(
X
)
,
Y
)
.
(19)
We can use a gradient-based method to minimize the loss
(19)
with respect to
α
1
, . . . ,
α
L
. In this article we use the Adam
optimizer, a gradient-based method that computes adaptive learning rates for each parameter
[
12
]
.
The feature perspective has the advantage that it doesn’t involve simulating a mechanical system unlike the shooting method
(Sec. B.4).
2.4 Analytical example: Linear kernel
In order to gain intuition about the optimization of
α
(
t
)
, we derive an analytical solution to
(1)
when
X
=
R
and
Γ
is the scalar
operator-valued linear kernel
Γ
(
x
,
x
)=
φ
(
x
)
φ
(
x
)
I
X
,
φ
(
x
)=
x
.
Applying Ex. A.5 with
p
=
1, the feature space of
Γ
is
F
=
R
. The trajectory (8) is
̇
q
i
=
α
(
t
)
q
i
=
̇
q
i
q
i
=
α
(
t
)
=
q
i
(
t
)=
x
i
exp

Z
t
0
α
(
s
)
ds
‹
.
Energy preservation of
α
(
t
)
2
F
=
α
2
(
t
)
over
t
implies that
α
(
t
)
is constant in
t
, i.e.
α
(
t
)
α
R
. Hence, the inputs are
deformed according to the trajectory
q
i
(
t
)=
x
i
e
α
t
=
q
i
(
1
)=
x
i
e
α
.
Dolomites Research Notes on Approximation
ISSN 2035-6803
Smirnov
·
Hamzi
·
Owhadi
129
The deformation depends on the optimal
α
that minimizes the loss (7), which can be explicitly written as,
L
(
α
)=
ν
2
α
2
+
λ
Y
T
[
K
(
X e
α
,
X e
α
)+
λ
I
N
]
1
Y
=
ν
2
α
2
+
λ
e
2
α
Y
T
[
K
(
X
,
X
)+
λ
e
2
α
I
N
]
1
Y
.
The minimizer
α
of
L
changes the regularization
λ
on the ridge loss into a parameter
̃
λ
=
λ
e
2
α
. Therefore, deforming the input
space is equivalent to tuning the regularization of the ridge loss. Since
̃
λ
0 as
α
→∞
, the regularization
ν >
0 penalizes
overfitting of the training data.
3 Simulations in feature space representation
In this Section we simulate the parameter
α
(
t
)
described in Section 2 for both regression and classification datasets. For regression,
one is a synthetic dataset and the second is the Boston housing dataset
[
4
]
. For classification datasets, we choose MNIST and
Fashion MNIST, two benchmark sets.
3.1 Model optimization
The input and output spaces are
X
=
R
p
and
Y
=
R
d
.
3.1.1 Choice of feature maps
We suppose that the kernels
Γ
,
K
are scalar operator-valued kernels obtained using feature maps
φ
,
φ
2
with corresponding feature
spaces
F
and
F
2
defined as follows:
Γ
(
x
,
x
)=
φ
(
x
)
T
φ
(
x
)
I
X
, where
φ
(
x
)=(
a
(
x
)
, 1
)
F
=
R
p
+
1
is defined through an activation as in Sec. A.4.3.
K
(
x
,
x
)=
φ
2
(
x
)
T
φ
2
(
x
)
I
Y
, where
φ
2
(
x
)=
a
(
W
2
x
+
b
2
)
F
2
,
W
2
R
dim
F
2
×
p
and
b
2
R
dim
F
2
. We take
F
2
=
R
D
2
. The
entries
W
2
i
,
j
,
b
2
i
are taken randomly as discussed in Sec. A.4.4.
3.1.2 Choice of other parameters
We train the model for three different activation functions
a
(
x
)
: ReLU,
tanh
and sigmoid. The kernel
Γ
is used to deform the
input space and its feature map up-samples the inputs by one dimension. For the ridge kernel
K
we can choose
F
2
such that it
up-samples or down-samples the inputs. For our experiments we fix
F
2
=
R
10
.
3.1.3 Model training and comparison
We train the model as described in Sec. 2.3 using the Adam optimizer to minimize the
L
2
regularized loss
(19)
. We initialize
the parameters at
α
s
=
0 for
s
=
1
, . . . ,
L
, which corresponds to no deformation of the inputs, i.e.
q
L
+
1
=
X
. We apply ridge
regression without space deformation, then we plot the training and tests errors. For regression data, the MSE is used as the
metric error; for classification we use the proportion of misclassified samples. We show test errors for the sake of completeness
since the purpose of this paper is not to improve test error but to
study the convergence of the trained model parameters
as we increase
N
. To analyze convergence, we plot the energy
E
(
α
)
given by
(16)
as a function of
N
, for different activation
functions and different values of
L
.
3.2 Regression datasets
3.2.1 Synthetic data
We take a synthetic dataset
(
X
i
,
Y
i
)
N
i
=
1
where
X
i
X
=
R
and
Y
i
=
X
2
i
+
X
i
+
ε
i
,
ε
i
N
(
0,
σ
2
)
for
σ
2
=
1.
We choose
N
training inputs evenly spaced in the interval
[
5
,
5
]
with
N
=
50
·
2
k
and
k
=
0
,
1
, . . . ,
5. We fix the test dataset
to 200 points: 100 in each of the intervals
[
7.5,
5
]
and
[
5, 7.5
]
.
3.2.2 Boston housing dataset
The Boston housing data
[
4
]
was collected in 1978 and each of the
N
=
506
entries represent aggregated data about 14 features
for homes from various suburbs in Boston, Massachusetts. The target variable is the median value of owner-occupied homes and
the remaining features are the predictors. Thus, we have
X
=
R
13
and
Y
=
R
. For reasons of numerical instability, we only train
the model with ReLU and tanh activations.
We split the dataset into training and test set, where we choose
10%
of the total set size for test data (around 50 points), the
remaining data is used for training. Test set size is kept low on purpose since we are interested in getting as much training data
to get the most accurate estimate of mean-field convergence.
Dolomites Research Notes on Approximation
ISSN 2035-6803
Smirnov
·
Hamzi
·
Owhadi
130
3.2.3 Result interpretation
Model performance
On the synthetic data (Fig. 4), mechanical regression outperforms ridge regression for any choice of
L
and activation, except for
tanh and
L
=
2. For ReLU and
tanh
using
L
=
2 layers produces a higher training
/
test MSE and
L
2
regularized loss compared to
deeper models; for
L
4, the models have very similar MSE and objective loss Fig. (4A), Fig. (4(B). For the sigmoid activation,
these values are identical for all values of
L
Fig. (4C). The key element is the deformation of the inputs, which improves both the
fit and the
L
2
regularized loss: compare the Ridge curves and mechanical regression curves. For all activations, the deformation
yields a considerable reduction in
L
2
loss.
On the Boston housing dataset (Fig.
(9)
), mechanical regression outperforms ridge regression on the training set for both
activation functions. The best decrease in MSE on the training set is for the ReLU activation (9A).
Convergence
On the synthetic dataset, Fig. 5 shows that
E
(
α
)
appears to converge as
N
increases. This seems to holds for of all values of
L
.
The exceptions are mechanical regression with
L
=
2 and the ReLU activation (5A), which is upward trending, as well as with
sigmoid activation.
Fig. 6 shows that on the Boston housing dataset,
E
(
α
)
fluctuates less as
L
increases. Hence, using a deeper network, controls the
energy as we increase
N
. When
L
=
2 fluctuations are very important compared to other values of
L
.
Finally, an increase in
L
appears to reduce the average norm
E
(
α
)
on both datasets. This means that the energy in the loss
(19)
,
becomes smaller as we decrease the time step used to discretize idea registration.
3.3 Classification datasets
3.3.1 MNIST
MNIST is a classification dataset containing images of handwritten digits. It has a training set of 60,000 examples and a test set of
10,000 examples. We train mechanical regression with
N
data points taken from the training set, where
N
∈{
500
,
1000
, . . . ,
5000
}
with a step of 500. We choose
F
2
=
R
784
, which corresponds to the dimension of a flattened
28
×
28
input image. The database is
widely used for training and testing models in machine learning.
3.3.2 Fashion MNIST
Fashion MNIST is a dataset of Zalando’s article images consisting of a training set of 60,000 examples and a test set of 10,000
examples. Each example is a
28
×
28
image, associated with a label from 10 classes. Observe that it shares the same image size
and structure of training and testing splits as MNIST. This dataset is used for benchmarking machine learning algorithms and is
regarded as more complicated than MNIST (in terms of model performance).
3.3.3 Result interpretation
Model performance
On MNIST (Fig. 13), mechanical regression with ReLU and
tanh
converge towards a
5
6%
testing error (2a,2b) as we increase
the number of training data points, while sigmoid converges towards a
14%
testing error (2c). The same pattern holds for Fashion
MNIST (Fig. 18), with ReLU and
tanh
converging towards
15
16%
and sigmoid towards
21%
. Interestingly, as
N
increases, the
training accuracy decreases, but test accuracy increases. A possible explanation is that increasing
N
, increases the penalty in the
ridge loss (
N
λ
in Eq.
(17)
) which prevents the model from overfitting. This can lead to better generalization, which is reflected
in the better test accuracy and worse training accuracy.
Convergence
Fig. 14 (MNIST) and Fig. 19 (Fashion MNIST) have very similar shapes. Convergence clearly holds for
L
=
2
,
3. For
L
=
1,
convergence seems to be slower since the values are slightly decreasing in
N
. The curves are flatter for MNIST which indicates
a faster convergence than for Fashion MNIST. This could be explained by the fact that Fashion MNIST is a more challenging
classification problem than MNIST, which requires more training samples.
3.4 Discussion
3.4.1 Observations
Our model presented in Sec. 3.1 uses the feature map representation of mechanical regression to approximate idea registration.
We train the model on several classification and regression datasets, including a dynamical system. We observed the behaviour of
the energy
E
(
α
)
after minimizing the loss for increasing values of
N
. On the synthetic, MNIST and Fashion MNIST datasets we can
observe clear convergence in
N
. On MNIST and Fashion MNIST, there is also convergence of the test loss. On the Boston housing
dataset, the fluctuation of the energy decreases as we use more hidden layers but we cannot observe a similar convergence to MNIST.
Dolomites Research Notes on Approximation
ISSN 2035-6803
Smirnov
·
Hamzi
·
Owhadi
131
3.4.2 Limitations and suggestions
The rationale behind initializing
α
to 0 is the following: since the regularization
ν
induces a penalty on the deformation of the
inputs, we allow it to occur only if the reduction in ridge loss is greater than this penalty. The limitation of this reasoning is that
the optimizer may get stuck at a local minimum, hence not exploring the full parameter space. One approach to overcome this
could be to optimize
α
with different random initializations and pick the one that minimizes the loss (19).
Secondly, the high-dimensionality of
α
complicates the study of element-wise convergence. We use the energy since its
presence in the loss function makes it a natural candidate. Other transformations can be explored.
Thirdly, the value of
E
(
α
)
is certainly affected by the penalty parameters
λ
and
ν
. We have not explored other values which
can change it. For example, an interesting case would be to try the simulations for
ν
=
0, thus not penalizing the deformation.
Finally, other network architectures will lead to different values for
E
(
α
)
and possibly faster convergence to a minimum.
Hence, further experiment with other model architectures should be performed.
4 Conclusion
In this article we presented a dynamical system perspective of the training of neural networks by characterizing the optimal
propagation of the inputs across its layers. We reviewed the theory of kernel based learning and introduced operator-valued
kernels as a tool to construct spaces of functions. Using this setting, we introduced a model architecture, called
mechanical
regression
, whose minimizers are identified as a solution of a discrete Hamiltonian system. We obtained the continuous dynamics
as an infinite-depth limit of the discrete case.
We then used the feature space representation of kernels to reformulate the dynamics of the minimizers. The form of the
trajectory in this representation is amenable to mean-field limit analysis. We showed a data-based approach to simulate a
functional of this limit, involving the training of a neural network. Finally, we trained our model on various standard datasets
and investigated its performance and the limiting behaviour of the functional. We observed a convergence in energy on MNIST
and Fashion MNIST, two benchmark classification datasets. However, we should verify this convergence with other datasets and
model architectures which motivates the need for further investigation.
5 Acknowledgment
Parts of this work were done when B. H. was a Marie Curie fellow at Imperial College London. B. H. thanks the European
Commission for funding through the Marie Curie fellowship STALDYS-792919 (Statistical Learning for Dynamical Systems). H. O.
gratefully acknowledges support by the Air Force Office of Scientific Research under award number FA9550-18-1-0271 (Games
for Computation and Learning).
6 Code
The Python codes of all numerical experiments in this paper are at
https://github.com/alexsm98/Mean-field-limit-code-.
git
A Supervised learning with Kernels
A.1 Introduction
Supervised learning aims at finding a relationship between
x
X
and
y
Y
. We are given
N
pairs of training data points
{
(
x
i
,
y
i
)
}
N
i
=
1
, which we use to model the relationship. We set
X
=(
x
1
, . . . ,
x
N
)
X
N
and
Y
=(
y
1
, . . . ,
y
N
)
Y
N
. In regression,
X
=
R
p
, where the
p
coordinates are called features, and
Y
=
R
d
for some
p
,
d
1. In classification,
Y
corresponds to a discrete
subset
D
=
{
c
1
, . . . ,
c
l
}⊂
R
d
. We formulate the supervised learning problem as follows.
Problem
A.1 (
Supervised learning
)
.
Let
f
be an unknown function mapping
X
to
Y
. Given the information
f
(
X
)=
Y
with the
training data
(
X
,
Y
)
X
N
×
Y
N
approximate
f
with a mapping
f
that minimizes a loss function
:
Y
N
×
Y
N
−→
R
0
.
In parametric models, we look for the optimal
f
in a certain Hilbert space of functions
H
and its performance is evaluated
using a given metric. It is important to note the difference between
loss
and
metric
. The former is the objective function that we
minimize to get
f
, while the latter measures the model’s performance on the data. In regression, the
mean squared error
(MSE)
Y
(
f
(
X
)
,
Y
)
:
=
1
N
f
(
X
)
Y
2
Y
N
, where
Y
2
Y
N
:
=
N
X
i
=
1
Y
i
2
Y
,
is usually chosen as the metric, while in classification data we use the proportion of misclassified samples. The MSE can also be
used as a loss function for both regression and classification. Throughout this paper we will use the
ridge loss
r id g e
(
f
(
X
)
,
Y
)
:
=
f
(
X
)
Y
2
Y
N
+
λ
f
2
H
,
(20)
Dolomites Research Notes on Approximation
ISSN 2035-6803
Smirnov
·
Hamzi
·
Owhadi
132
where
λ>
0 is a positive regularization parameter and
f
2
H
is the norm of the function
f
in the Hilbert space
H
to which it
belongs. This loss introduces an additional penalty term
λ >
0 on
f
to prevent overfitting of the training data, an issue that
commonly arises when using the MSE as the loss. Overfitting means that the model well-approximates the training output
Y
but
fails to generalize to new data, resulting in large approximation error.
The ridge loss can be used for both regression and classification. In regression, the output
Y
is directly predicted with
ˆ
Y
=
f
(
X
)
. In classification,
f
(
X
)
is a vector whose dimension is equal to the number of possible classes. The class is predicted
with
ˆ
Y
=
arg max
f
(
X
)
.
Example A.1
(
Linear ridge regression
)
.
We aim to determine a parametric function of the form
f
θ
f
θ
:
R
p
−→
R
x
7−→
f
θ
(
x
)=
x
T
θ
The Hilbert space
H
can be identified with the space of parameters
θ
R
p
and the penalty term regularizes the Euclidean
norm
f
θ
2
H
=
θ
2
. The parameter that minimizes the ridge loss is
θ
r id g e
=
X
T
X
+
λ
I
N

1
X
T
Y
.
(21)
A new input
x
is predicted as
y
=
θ
T
r id g e
x
=
p
X
k
=
1
θ
k
x
k
.
The solution also admits a dual form, obtained using the relationship (cf.,
[
13
]
)
X
T
X
+
λ
I
N

1
X
T
Y
=
X
T
X X
T
+
λ
I
N

1
Y
.
(22)
Thus, we can write
θ
r id g e
=
X
T
α
and the prediction as
y
=
N
X
i
=
1
α
i
x
T
i
x
.
(23)
When formulating the problem in dual form, we optimize over a parameter
α
R
N
, whose dimension corresponds to the
number of data points. Additionally, observe the presence of dot product terms
x
T
i
x
, which can be viewed as a similarity measure
between two input samples. Kernel methods generalize this idea by replacing the dot product with a function
k
(
x
i
,
x
)
which
corresponds to an inner product of the inputs in a transformed space, called the
feature space
.
A.2 Neural networks
In this section, we introduce neural networks and refer the reader to
[
3
]
for a more detailed overview on this topic.
A
neural network
can be used as a supervised learning model whose architecture contains multiple layers of input data
transformation. The general idea is to transform the input
x
into
x
by a composition of consecutive mappings. We then map the
modified input to the output
y
. The function
f
can be expressed as a composition
f
=
f
D
◦···◦
f
1
,
(24)
where
f
k
:
X
k
−→
X
k
+
1
and
f
k
belongs to a Hilbert space
H
k
for
k
=
1
, . . . ,
D
. We have
X
0
=
X
,
X
D
+
1
=
Y
. Note that each
function
f
k
maps
X
k
into a new space
X
k
+
1
. The space
X
k
is commonly referred to as the
k
-th layer of the network and all the
layers
X
k
̸
=
X
,
Y
are called the hidden layers.
Example A.2
(
Artificial Neural Network
)
.
Each function
f
k
is of the form
f
k
(
x
)=
a
(
W
k
x
+
b
k
)
,
(25)
where
a
is a fixed activation function,
W
k
L
(
X
k
,
X
k
+
1
)
, and
L
is the space of linear operators between
X
k
and
X
k
+
1
, and
b
k
X
k
+
1
.
Popular choices for
a
are the Relu,
tanh
or sigmoid activations. The function
W
k
is a bounded linear operator called the weight of
the
k
-th layer; the term
b
k
is called the bias. With this architecture the solution to Pb. A.1 is the minimizer of the loss
min
W
k
,
b
k
(
f
(
X
)
,
Y
)
.
In practice, we have
X
k
=
R
d
k
for all
k
, such that
W
k
is a matrix in
R
d
k
×
d
k
+
1
.
Example A.3
(
Residual Neural Network
)
.
Write
f
=
F
D
◦···◦
F
1
, where
F
k
=
f
k
(
I
+
v
k
L
k
)
◦···◦
(
I
+
v
k
1
)
,
(26)
where
f
k
is defined as in Example A.2 and
v
k
s
=
a
(
W
k
s
x
+
b
k
s
)
. We can also take
v
k
s
=
W
k
s
a
(
x
)+
b
k
s
. Residual neural networks
are commonly called ResNets.
More generally, one can choose any sequence of Hilbert spaces
H
1
, . . . ,
H
D
to which the mappings
f
1
, . . . ,
f
D
belong. In
particular, we will choose
Reproducing Kernel Hilbert
(RKHS) spaces, which are defined by a kernel function. We present the
theory of kernel learning and RKHS in the next section.
Dolomites Research Notes on Approximation
ISSN 2035-6803
Smirnov
·
Hamzi
·
Owhadi
133
A.3 Scalar-valued Kernels
In this section, we provide a brief summary of supervised learning with scalar real-valued kernels. For a more detailed overview
we suggest
[
14
]
and
[
15
]
.
A.3.1 Motivation and the kernel trick
The idea behind kernel methods is to map the input space
X
into a Hilbert space
(
F
,
〈·
,
·〉
F
)
with a mapping
Φ
:
X
−→
F
and
look for linear models in this space,
f
θ
(
x
)=
θ
,
Φ
(
x
)
F
.
In view of (23) and assuming that
F
=
R
P
, an input
x
is predicted as
y
=
N
X
i
=
1
α
i
Φ
T
(
x
i
)
Φ
(
x
)
,
(27)
where
α
=
Φ
(
X
)
Φ
T
(
X
)+
λ
I
N

1
Y
(28)
and the matrix
Φ
(
X
)
has rows
Φ
T
(
x
i
)
. Eq.
(27)
,
(28)
use the feature space only to calculate inner products since
(
Φ
(
X
)
Φ
T
(
X
))
i
,
j
=
Φ
T
(
x
i
)
Φ
(
x
j
)
. Hence, if we define a function
k
:
X
×
X
−→
R
, called a
positive definite kernel
(or simply kernel), such that
k
(
x
i
,
x
j
)=
Φ
(
x
i
)
,
Φ
(
x
j
)
F
,
(29)
the model in (27) becomes
y
=
N
X
i
=
1
α
i
k
(
x
i
,
x
)
.
The approximation
f
is
f
(
·
)=
N
X
i
=
1
α
i
k
(
x
i
,
·
)
.
(30)
Eq. (29) corresponds to the definition of a real-valued kernel
k
:
X
×
X
−→
R
(see
[
5
]
). Thus, a kernel corresponds to an
inner product in a Hilbert space, called the feature space. This approach is called the
kernel trick
and the theoretical result that
justifies (30) is the
Representer theorem
presented in Sec. A.3.3.
Recall that supervised learning aims at finding the best approximation
f
within a Hilbert space of functions. Eq.
(30)
suggests
that this space is defined by the kernel
k
. That is indeed the case and the space is the
Reproducing Kernel Hilbert Space
(RKHS) of
k
.
A.3.2 Reproducing Kernel Hilbert Space
We start with the following definitions
[
15, p. 119
]
.
Definition A.1.
Let
H
be a Hilbert space of functions
f
:
X
−→
R
defined on a non-empty set
X
.
(i) A function
k
:
X
×
X
−→
R
is called a
reproducing kernel
of
H
if we have
k
(
·
,
x
)
H
for all
x
X
and the
reproducing
property
f
(
x
)=
f
,
k
(
x
,
·
)
(31)
holds for all
f
H
and all
x
X
. In particular,
k
(
x
,
·
)
,
k
(
x
,
·
)
=
k
(
x
,
x
)
.
(32)
(ii) The space
H
is called a
reproducing kernel Hilbert space
over
X
if for all
x
X
the Dirac functional
δ
x
:
H
−→
R
defined by
δ
x
(
f
)
:
=
f
(
x
)
,
f
H
,
is continuous.
From (32) it follows that
k
is symmetric in its arguments and satisfies the conditions for positive definiteness, hence it is a
kernel. In fact, every RKHS has a unique reproducing kernel
k
which spans
H
as follows,
H
=
span
{
k
(
x
,
·
)
|
x
X
}
.
On the other hand, given a kernel
k
we want to define an RKHS such that
k
is its (unique) reproducing kernel. Thm. 4.21 in
[
15, p. 121
]
states that every kernel admits a unique RKHS
H
, which can be defined as the completion of the set
H
pr e
:
=
§
n
X
i
=
1
α
i
k
(
x
i
,
·
)
:
n
N
,
α
1
, . . . ,
α
n
R
,
x
1
, . . . ,
x
n
X
ª
.
For every
f
=
P
n
i
=
1
α
i
k
(
x
i
,
·
)
H
pr e
we have
f
2
H
=
n
X
i
,
j
=
1
α
i
α
j
k
(
x
i
,
x
j
)
(33)
Consequently, we have a bijection between kernels and RKHS of functions.
Dolomites Research Notes on Approximation
ISSN 2035-6803
Smirnov
·
Hamzi
·
Owhadi
134
A.3.3 The Representer Theorem
Given a training dataset
(
X
,
Y
)
X
N
×
Y
N
we consider the supervised learning problem, where the space of functions is the
RKHS
H
of a kernel
k
. The following theorem provides the explicit form of the solution to this problem
[
14
, p.90
]
(with slight
modifications in the notations).
Theorem A.1
(
Representer theorem
)
.
Define
1.
:
[
0,
)
−→
R
a strictly increasing monotonic function.
2.
Y
:
Y
N
×
Y
N

−→
R
∪{∞}
an arbitrary error function.
This defines a loss function
(
f
)=
Y
(
f
(
X
)
,
Y
)+
(
f
)
.
(34)
Then a minimizer of
in
H
admits a representation of the form
f
(
·
)=
N
X
i
=
1
α
i
k
(
x
i
,
·
)
.
(35)
Example A.4
(
Kernel ridge regression
)
.
Given a
λ>
0 and a kernel
k
defining an RKHS
(
H
,
〈·
,
·〉
H
)
of functions mapping
X
to
Y
=
R
, the ridge regression solution approximates
f
with the minimizer of
(20)
, where
(
f
)=
λ
f
2
H
and
Y
(
f
(
X
)
,
Y
)=
f
(
X
)
Y
2
Y
N
. The optimal parameter
α
is given by
α
=
k
(
X
,
X
)+
λ
I
N

1
Y
,
(36)
where
k
(
X
,
X
)
i
,
j
=
k
(
x
i
,
x
j
)
. Given a new input
x
, use (35) and (36) to write the solution as
f
(
x
)=
k
(
x
,
X
)
k
(
X
,
X
)+
λ
I
N

1
Y
,
(37)
where
k
(
x
,
X
)=
k
(
x
,
x
1
)
, . . . ,
k
(
x
,
x
N
)

. Observe that for
λ
=
0, the function interpolates exactly the training data, meaning
that
f
(
x
i
)=
y
i
. The ridge loss is given by the formula
r id g e
(
f
(
X
)
,
Y
)=
λ
Y
T
[
k
(
X
,
X
)+
λ
I
N
]
1
Y
.
(38)
Since the function
f
is entirely determined by the kernel
k
and the penalty
λ
, we will write
(
X
,
Y
)
for
r id g e
(
f
(
X
)
,
Y
)
.
A.4 Operator-valued kernels
A.4.1 Introduction and main results
Operator-valued kernels
are described in details in
[
6
]
. The need to introduce them arises when a supervised learning problem is
considered in a functional setting: each attribute of the data is a function, and the label of each data is also a function. Similarly
to scalar-valued kernels, operator-valued kernels are in one-to-one correspondence with RKHS of functions
f
:
X
−→
Y
. With
operator-valued kernels, the training of weights and biases of neural networks is equivalent to identifying the minimal function in
a suitable RKHS defined by such a kernel. We provide a quick overview of the theory of operator-valued kernels as presented in
[
10, Sec 2
]
,
Definition A.2
(
Operator-valued kernel
)
.
We call
K
:
X
×
X
−→
L
(
Y
)
an
operator-valued kernel
if
(1) K is Hermitian, i.e.
K
(
x
,
x
)=
K
(
x
,
x
)
T
for
x
,
x
X
,
(39)
writing
A
T
for the adjoint of the operator
A
with respect to
〈·
,
·〉
Y
.
(2) non-negative, i.e.
m
X
i
,
j
=
1
y
i
,
K
(
x
i
,
x
j
)
y
j
Y
0, for
(
x
i
,
y
i
)
X
×
Y
,
m
N
.
(40)
We call
K
non-degenerate if
P
m
i
,
j
=
1
y
i
,
K
(
x
i
,
x
j
)
y
j
Y
=
0 implies
y
i
=
0 for all
i
whenever
x
i
̸
=
x
j
for
i
̸
=
j
.
For operator-valued kernels a feature map and space are defined as follows.
Definition A.3
(
Feature space
/
map
)
.
F
and
ψ
:
X
−→
L
(
Y
,
F
)
are a
feature space
and
feature map
for the kernel
K
if, for all
(
x
,
x
,
y
,
y
)
X
2
×
Y
2
,
y
T
K
(
x
,
x
)
y
=
ψ
(
x
)
y
,
ψ
(
x
)
y
F
.
(41)
Write
ψ
T
(
x
)
L
(
F
,
Y
)
for the adjoint of
ψ
(
x
)
and
α
T
α
:
=
α
,
α
F
for the inner product in
F
so that we can write
K
(
x
,
x
)
as
K
(
x
,
x
)=
ψ
T
(
x
)
ψ
(
x
)
.
(42)
For
α
F
, write
ψ
T
α
for the function
X
−→
Y
mapping
x
to the element
y
such that
y
,
y
Y
=
y
,
ψ
T
(
x
)
α
Y
=
ψ
(
x
)
y
,
α
F
.
(43)
The following theorem characterizes the RKHS of an operator-valued kernel.
Dolomites Research Notes on Approximation
ISSN 2035-6803