Special Issue dedicated to Robert Schaback on the occasion of his 75th birthday, Volume 15
·
2022
·
Pages 125–145
Mean-field limits of trained weights in deep learning: A dynamical
systems perspective
Alexandre Smirnov
a
·
Boumediene Hamzi
b
·
Houman Owhadi
c
Communicated by Gabriele Santin
Abstract
Training a residual neural network with
L
2
regularization on weights and biases is equivalent to minim-
izing a discrete least action principle and to controlling a discrete Hamiltonian system representing the
propagation of input data across layers. The kernel
/
feature map analysis of this Hamiltonian system
suggests a mean-field limit for trained weights and biases as the number of data points goes to infinity.
The purpose of this paper is to investigate this mean-field limit and illustrate its existence through
numerical experiments and analysis (for simple kernels).
1 Introduction
Supervised learning is a class of learning problems which seek to find a relationship between a set of predictor variables (the
inputs) and a set of target variables (the outputs). The target, also called dependent variable, can correspond to a quantitative
measurement (e.g., height, stock prices, etc.) or to a qualitative measurement (e.g., sex, class, etc.). The problem of predicting
quantitative variables is called regression. On the other hand, classification problems aim at predicting qualitative variables. Over
many years of research, various models have been developed to solve both type of problems. In particular,
artificial neural network
(ANN) models have become very popular since their complex architecture allows the model to capture underlying patterns in the
data. This architecture has proved to be successful in different areas of artificial intelligence such as image processing, speech
recognition or text mining. ANNs transform the inputs using a composition of consecutive mappings, generally represented by a
directed acyclic graph. With this representation, the inputs are being propagated through multiple layers, defining a trajectory
that is optimized by training the parameters of the network.
In this paper we investigate the propagation of the inputs from a Dynamical Systems perspective, an approach recently
explored by Houman Owhadi in
[
10
]
. Owhadi shows that the minimizers of
L
2
regularized ResNets, a particular class of ANNs,
satisfy a discrete least action principle implying the near preservation of the norm of weights and biases across layers. The
parameters of trained ResNets can be identified as solutions of a Hamiltonian system defined by the activation function and the
architecture of the ANN. The form of the Hamiltonian suggests that it is amenable to mean-field limit analysis as the number of
data points increases. Furthermore, when the mean-field limit holds, the trajectory of inputs across layers is given by a mean-field
Hamiltonian system where position and momentum variables are nearly decoupled through mean-field averaging.
The purpose
of this paper is to analyse and numerically illustrate this mean-field limit by investigating the convergence of the trained
weights and biases of the model (appearing in the Hamiltonian in our kernel
/
feature map setting) as the number of data
points goes to infinity.
In Section A we define the mathematical setting of supervised learning and introduce kernel learning, a class of problems that
encompasses ANNs and which is used throughout this paper. In Section B we explain how to identify the optimization of neural
networks as a dynamical system problem and characterize the optimal trajectory of the inputs. In Section 2 we reformulate the
dynamics using the feature space representation of kernels. With this representation we show that the trajectory is determined by
a parameter amenable to mean-field limit analysis. In Section 3, we simulate this parameter on regression and classification
datasets.
All the results in the Appendices are a review of some results in Owhadi’s paper
[
10
]
.
2 Mean-field analysis of Hamiltonian dynamics
In Section B, we derived the Hamiltonian representation of minimizers of mechanical regression and idea registration. We
stated that the propagation of the inputs is completely determined by the initial momentum
p
(
0
)
. In this section, we derive
a
Department of Mathematics, Imperial College London, United Kingdom, email: alexandre.smirnov20@imperial.ac.uk
b
Department of Computing and Mathematical Sciences, Caltech, CA, USA.email:boumediene.hamzi@gmail.com
c
Department of Computing and Mathematical Sciences, Caltech, CA, USA. email: owhadi@caltech.edu
Smirnov
·
Hamzi
·
Owhadi
126
another system which does not involve the momentum. The system is obtained for an operator-valued kernel
Γ
with feature map
ψ
:
X
−→
L
(
X
,
F
)
such that (42) holds,
Γ
(
x
,
x
′
)=
ψ
T
(
x
)
ψ
(
x
′
)
.
Then, the ODE for the trajectory is of the form
̇
q
i
=
ψ
T
(
q
i
)
α
(
t
)
q
i
(
0
)=
x
i
,
(1)
for some mapping
α
(
t
)
. We characterize
α
using the feature map identification
(44)
of the RKHS
V
and show that this mapping
is amenable to mean-field limit analysis. We also discuss a data-based method to simulate this limit.
2.1 Feature space representation of kernels
We reformulate mechanical regression and idea registration using the feature space representation of kernels (cf., Sec. A.4) as
presented in
[
10, Sec. 6
]
.
2.1.1 Mechanical regression
The following theorem characterizes the minimizers of mechanical regression in the feature space of
Γ
.
Theorem 2.1
(
Mechanical regression in feature space
)
.
Let
Γ
be a kernel of RKHS
V
with associated feature space and map
F
and
ψ
:
X
−→
L
(
X
,
F
)
. Then,
α
1
, . . . ,
α
L
satisfy
Minimize
ν
L
2
P
L
s
=
1
∥
α
s
∥
2
F
+
ℓ
(
φ
L
(
X
)
,
Y
)
,
over
α
1
, . . . ,
α
L
∈
F
,
(2)
if and only if the v
s
(
·
)=
ψ
T
(
·
)
α
s
∈
V
minimize
(59)
and
φ
L
(
·
)=(
I
+
ψ
T
(
·
)
α
L
)
◦···◦
(
I
+
ψ
T
(
·
)
α
1
)
.
(3)
This theorem identifies the minimizers
v
s
with parameters
α
s
belonging to the feature space
F
(possibly infinite-dimensional).
In practice, we choose
Γ
to be a scalar operator-valued kernel (see Def. A.4) obtained with a finite dimensional feature space
F
and map
φ
,
Γ
(
x
,
x
′
)=
φ
T
(
x
)
φ
(
x
′
)
I
X
.
Using Lemma A.4,
{
α
s
}
L
s
=
1
are matrices in
R
dim
(
X
)
×
dim
(
F
)
and
∥
α
s
∥
2
F
corresponds to a matrix Frobenius norm. These parameters
define the optimal discrete trajectory
q
1
, . . . ,
q
L
+
1
q
s
+
1
=
q
s
+
α
s
φ
(
q
s
)
, for
s
=
1, . . . ,
L
q
1
=
X
,
(4)
where we used
ψ
T
(
·
)
α
s
=
α
s
φ
(
·
)
. By introducing
∆
t
=
1
/
L
and
α
s
=
∆
t
̃
α
s
, the system is equivalent to
q
s
+
1
=
q
s
+
∆
t
̃
α
s
φ
(
q
s
)
, for
s
=
1, . . . ,
L
q
1
=
X
,
(5)
where
{
̃
α
s
}
L
s
=
1
minimize
ν
2
L
X
s
=
1
∥
̃
α
s
∥
2
F
∆
t
+
ℓ
(
q
L
+
1
(
X
)
,
Y
)
, where
q
L
+
1
=
φ
L
(
X
)
.
(6)
(5)
is a discrete dynamical system which approximates
(1)
(for each input
x
i
). The time-dependent parameter
α
(
t
)
is
approximated by
{
α
s
}
with
α
s
=
α
(
s
/
L
)
at time
t
s
=
s
/
L
.
2.1.2 Idea registration
The next theorem characterizes
α
in the feature map representation of idea registration.
Theorem 2.2
(
Idea registration in feature space
)
.
The
α
(
t
)
satisfy
̈
Minimize
ν
2
R
1
0
∥
α
(
t
)
∥
2
F
d t
+
ℓ
(
φ
v
(
X
, 1
)
,
Y
)
over
α
∈
C
([
0, 1
]
,
F
)
(7)
if and only if v
(
·
,
t
)=
ψ
T
(
·
)
α
(
t
)
and
φ
v
(
x
,
t
)
minimize
(69)
. Furthermore, at the minimum,
∥
α
(
t
)
∥
2
F
is constant over t
∈
[
0, 1
]
.
Hence,
α
is identified as the minimizer of
(7)
and
(71)
implies
̇
q
i
=
ψ
T
(
q
i
)
α
(
t
)
. If in addition,
Γ
is scalar operator-valued,
the ODE for the trajectory becomes
̇
q
i
=
α
(
t
)
φ
(
q
i
)
q
i
(
0
)=
x
i
.
(8)
Dolomites Research Notes on Approximation
ISSN 2035-6803
Smirnov
·
Hamzi
·
Owhadi
127
2.1.3 Equivalence with ResNet block minimizers
In Sec. 2.1.1 we reformulated mechanical regression using the feature map representation of
Γ
. In addition, if
Γ
is a SOV kernel,
the minimizers are identified with matrices
{
α
s
}
L
s
=
1
. In the special case that
Γ
and
K
have the form in
(54)
, mechanical regression
is equivalent to minimizing one ResNet block (see Ex. A.3) with
L
2
regularization on weights and biases. This is summarized in
the following theorem.
Theorem 2.3.
If
Γ
(
x
,
x
′
)=
φ
(
x
)
T
φ
(
x
′
)
I
X
and
K
(
x
,
x
′
)=
φ
(
x
)
T
φ
(
x
′
)
I
Y
, where
φ
(
x
)
is given by
(53)
, then minimizers of
(59)
are of the form
f
(
x
)=
̃
αφ
(
x
)
and v
s
(
x
)=
α
s
φ
(
x
)
,
(9)
where
̃
α
=(
̃
W
,
̃
b
)
∈
L
(
X
⊕
R
,
Y
)
,
α
s
=(
W
s
,
b
s
)
∈
L
(
X
⊕
R
,
X
)
are minimizers of
min
̃
α
,
α
1
,...,
α
L
ν
L
2
L
X
s
=
1
∥
α
s
∥
2
L
(
X
⊕
R
,
X
)
+
λ
∥
̃
α
∥
2
L
(
X
⊕
R
,
Y
)
+
∥
f
(
φ
L
(
X
))
−
Y
∥
2
Y
N
,
(10)
and
∥
α
∥
L
(
F
,
Z
)
corresponds to the Frobenius norm of the linear map
α
:
F
−→
Z
.
The approximation f
‡
=
f
◦
φ
L
is of the form
f
◦
φ
L
(
x
)=(
̃
αφ
)
◦
(
I
+
α
L
φ
)
◦···◦
(
I
+
α
1
φ
)(
x
)
.
(11)
2.2 Mean-field analysis
2.2.1 The mean-field approximation
Mean-field theory aims at developing approximation strategies for systems composed of interacting particles. It was originally
developed in statistical physics to study systems such as the Ising model
[
2
]
. The central idea behind mean-field theory is to
approximate the interacting terms in the system by a simpler non-interacting function. Our aim is to apply this strategy to study
the Hamiltonian (79),
H
(
q
,
p
)=
1
2
N
X
i
,
j
=
1
p
T
i
Γ
(
q
i
,
q
j
)
p
j
.
(12)
It can be viewed as a system of interacting particles. The feature map representation of
Γ
allows us to split the interacting term
Γ
(
q
i
,
q
j
)
into two terms
ψ
T
(
q
i
)
and
ψ
(
q
j
)
. If we re-scale the momentum
̄
p
j
:
=
N p
j
and define
α
(
t
)
:
=
1
N
N
X
j
=
1
ψ
(
q
j
(
t
))
̄
p
j
(
t
)
,
(13)
we can remove the interaction in the Hamiltonian
H
(
q
,
p
)=
1
2
N
X
i
=
1
p
T
i
ψ
T
(
q
i
)
α
(
t
)
.
We can rewrite the dynamics of
(
q
,
p
)
as
(
̇
q
i
=
ψ
T
(
q
i
)
α
̇
̄
p
i
=
−
∂
x