of 11
Learning by Turning: Neural Architecture Aware Optimisation
Yang Liu
*1
Jeremy Bernstein
*2
Markus Meister
2
Yisong Yue
2
Abstract
Descent methods for deep networks are noto-
riously capricious: they require careful tuning
of step size, momentum and weight decay, and
which method will work best on a new bench-
mark is a priori unclear. To address this problem,
this paper conducts a combined study of neural
architecture and optimisation, leading to a new
optimiser called
Nero
: the
ne
uronal
ro
tator. Nero
trains reliably without momentum or weight de-
cay, works in situations where Adam and SGD
fail, and requires little to no learning rate tuning.
Also, Nero’s memory footprint is
square root
that of Adam or LAMB. Nero combines two ideas:
(1) projected gradient descent over the space of
balanced networks
; (2) neuron-specific updates,
where the step size sets the
angle
through which
each neuron’s hyperplane
turns
. The paper con-
cludes by discussing how this geometric connec-
tion between architecture and optimisation may
impact theories of generalisation in deep learning.
1. Introduction
Deep learning has brought on a new paradigm in computer
science, enabling artificial systems to interact with the world
at an unprecedented level of complexity. That said, the core
technology relies on various heuristic numerical techniques
that are sometimes brittle and often require extensive tuning.
A major goal of modern research in machine learning is to
uncover the principles underlying learning in neural systems,
and thus to derive more reliable learning algorithms.
Part of the challenge of this endeavour is that learning in
deep networks is an inherently coupled problem. Suppose
that training performance is sensitive to a particular detail
of the neural architecture—then it is unclear whether that
detail affects the expressivity of the architecture, or just the
ability of the descent method to train the architecture.
*
Equal contribution
1
Abacus.AI
2
Caltech. Correspondence to:
YL <yang@abacus.ai> and JB <bernstein@caltech.edu>. Code
available at
github.com/jxbz/nero
.
Proceedings of the
38
th
International Conference on Machine
Learning
, PMLR 139, 2021. Copyright 2021 by the authors.
This observation motivates the
combined study
of archi-
tecture and optimisation, and this paper explores several
questions at that intersection. First of all:
h
?
i
What is the right domain of optimisation for a neu-
ral network’s weights? Is it
R
d
, or something more
exotic—such as a Cartesian product of hyperspheres?
Typically, optimisation is conducted over
R
d
, while a careful
weight initialisation and a tuned weight decay hyperparam-
eter impose a soft constraint on the optimisation domain.
Since normalisation schemes such as batch norm (Ioffe &
Szegedy, 2015) render the network invariant to the scale
of the weights, weight decay also plays a somewhat subtle
second role in modifying the effective learning rate. Hy-
perparameters with this kind of subtle coupling add to the
compounding cost of hyperparameter search.
Furthermore, descent methods such as Adam (Kingma &
Ba, 2015) and LAMB (You et al., 2020) use either synapse-
specific or layer-specific gradient normalisation. This moti-
vates a second question:
h
?
i
At what level of granularity should an optimiser work?
Should normalisation occur per-synapse or per-layer—
or perhaps, per-neuron?
This paper contends that in deep learning, hyperparameters
proliferate because of hidden couplings between optimiser
and architecture. By studying the above questions, and dis-
tilling simple rules that govern optimisation and architecture,
this paper aims to make deep learning less brittle—and less
sensitive to opaque hyperparameters.
Summary of contributions:
1.
A new optimiser—
Nero
: the
ne
uronal
ro
tator. Nero
performs per-neuron projected gradient descent, and
uses
square root the memory of Adam or LAMB.
2.
Experiments across image classification, image gener-
ation, natural language processing and reinforcement
learning, in which Nero’s
out-of-the-box
configuration
tends to outperform
tuned
baseline optimisers.
3.
Discussion of how the connection between optimisa-
tion and architecture relates to generalisation theories,
such as PAC-Bayes and norm-based complexity.
Learning by Turning: Neural Architecture Aware Optimisation
2. Related work
This section reviews relevant work pertaining to both neural
architecture design and optimisation in machine learning,
and concludes with a bridge to the neuroscience literature.
2.1. Neural Architecture Design
The importance of wiring constraints for the stable function
of engineered neural systems is not a new discovery. One
important concept is that of
balanced excitation and inhibi-
tion
. For instance, Rosenblatt (1958) found that balancing
the proportion of excitatory and inhibitory synaptic connec-
tions made his perceptron more robust to varying input sizes.
Another concept relates to the
total magnitude of synapse
strengths
. For example, Rochester et al. (1956) constrained
the sum of magnitudes of synapses impinging on a neuron
so as to stabilise the process of learning. Similar ideas were
explored by von der Malsburg (1973) and Miller & MacKay
(1994). These works are early predecessors to this paper’s
definition of
balanced networks
given in Section 3.1.
Given the resurgence of neural networks over the last decade,
the machine learning community has taken up the mantle
of research on neural architecture design. Special weight
scalings—such as
Xavier init
(Glorot & Bengio, 2010) and
Kaiming init
(He et al., 2015)—have been proposed to sta-
bilise signal transmission through deep networks. These
scalings are only imposed at initialisation and are free to
wander during training—an issue which may be addressed
by tuning a weight decay hyperparameter. More recent
approaches—such as batch norm (Ioffe & Szegedy, 2015)—
explicitly control activition statistics throughout training by
adding extra normalisation layers to the network.
Other recent normalisation techniques lie closer to the
work of Rosenblatt (1958) and Rochester et al. (1956).
Techniques that involve constraining a neuron’s weights
to the unit hypersphere include: weight norm (Salimans &
Kingma, 2016), decoupled networks (Liu et al., 2017; 2018)
and orthogonal parameterised training (Liu et al., 2021).
Techniques that also balance excitation and inhibition in-
clude centred weight norm (Huang et al., 2017) and weight
standardisation (Qiao et al., 2019).
2.2. Descent Methods in Deep Learning
Much classic work in optimisation theory focuses on de-
riving convergence results for descent methods under as-
sumptions such as
convexity
(Boyd & Vandenberghe, 2004)
and
Lipschitz continuity of the gradient
(Nesterov, 2004).
These simplifying assumptions are often used in the ma-
chine learning literature. For instance, Bottou et al. (2018)
provide convergence guarantees for stochastic gradient de-
scent (SGD) under each of these assumptions. However,
these assumptions do not hold in deep learning (Sun, 2019).
On a related note, SGD is not the algorithm of choice in
many deep learning applications, and heuristic methods
such as RMSprop (Tieleman & Hinton, 2012) and Adam
(Kingma & Ba, 2015) often work better. For instance, Adam
often works much better than SGD for training generative
adversarial networks (Bernstein et al., 2020a). Yet the theory
behind Adam is poorly understood (Reddi et al., 2018).
A more recent line of work has explored optimisation meth-
ods that make
relative updates
to neural network param-
eters. Optimisers like LARS (You et al., 2017), LAMB
(You et al., 2020) and Fromage (Bernstein et al., 2020a)
make per-layer relative updates, while Madam (Bernstein
et al., 2020b) makes per-synapse relative updates. You et al.
(2017) found that these methods stabilise large batch train-
ing, while Bernstein et al. (2020a) found that they require
little to no learning rate tuning across tasks.
Though these recent methods partially account for the neural
architecture—by paying attention to its layered operator
structure—they do not rigorously address the optimisation
domain. As such, LARS and LAMB require a tunable
weight decay hyperparameter, while Fromage and Madam
restrict the optimisation to a bounded set of tunable size
(i.e. weight clipping). Without this additional tuning, these
methods can be unstable—see for instance (Bernstein et al.,
2020a, Figure 2) and (Bernstein et al., 2020b, Figure 3).
The discussion in the previous paragraph typifies the ma-
chine learning state of the art: optimisation techniques that
work well, albeit only after hyperparameter tuning. For
instance, LAMB is arguably the state-of-the-art relative
optimiser, but it contains in total
five
tunable hyperparame-
ters. Since—at least naïvely—the cost of hyperparameter
search is exponential in the number of hyperparameters, the
prospect of fully tuning LAMB is computationally daunting.
2.3. Homeostatic Control in Neuroscience
Since the brain is a system that must learn stably without
hyperparameter do-overs, it is worth looking to neuroscience
for inspiration on designing better learning algorithms.
A major swathe of neuroscience research studies mecha-
nisms by which the brain performs homeostatic control.
For instance, neuroscientists report a form of homeosta-
sis termed
synaptic scaling
, where a neuron modulates the
strengths of all its synapses to stabilise its firing rate (Tur-
rigiano, 2008). More generally,
heterosynaptic plasticity
refers to homeostatic mechanisms that modulate the strength
of unstimulated synapses (Chistiakova et al., 2015). Shen
et al. (2020) review connections to normalisation methods
used in machine learning.
These observations inspired this paper to consider imple-
menting homeostatic control via projected gradient descent—
leading to the Nero optimiser.
Learning by Turning: Neural Architecture Aware Optimisation
3. Background Theory
In general, an
L
-layer neural network
f
p ̈q
is a composition
of
L
simpler functions
f
1
p ̈q
,...,f
L
p ̈q
:
f
p
x
q“
f
L
̋
f
L
́
1
̋
...
̋
f
1
p
x
q
.
(forward pass)
Due to this compositionality, any slight ill-conditioning
in the simple functions
f
i
p ̈q
has the potential to
com-
pound
over layers, making the overall network
f
p ̈q
very
ill-conditioned. Architecture design should aim to prevent
this from happening, as will be covered in Section 3.1
The Jacobian
B
f
{B
f
l
, which plays a key role in evaluating
gradients, also takes the form of a deep product:
B
f
B
f
l
B
f
L
B
f
L
́
1
̈
B
f
L
́
1
B
f
L
́
2
̈
...
̈
B
f
l
`
1
B
f
l
.
(backward pass)
Therefore, it is also important from the perspective of
gradient-based optimisation that compositionality is ade-
quately addressed, as will be covered in Section 3.2.
3.1. Balanced Network Architectures
A common strategy to mitigate the issue of compounding
ill-conditioning is to explicitly re-normalise the activations
at every network layer. Batch norm (Ioffe & Szegedy, 2015)
exemplifies this strategy, and was found to improve the
trainability of deep residual networks. Batch norm works by
standardising the activations across a batch of inputs at each
network layer—that is, it shifts and scales the activations to
have mean zero and variance one across a batch.
Although batch norm works well, it adds computational
overhead to both the forward and backward pass. To explore
how far one can get without explicit re-normalisation, the
following definitions are useful:
Definition 1.
A neuron is
balanced
if its weight vector
w
P
R
d
satisfies the following constraints:
d
i
1
w
i
0;
(balanced excitation & inhibition)
d
i
1
w
2
i
1
.
(
`
2
constant sum rule)
Definition 2.
A network is
balanced
if all its constituent
neurons are balanced.
As noted by Huang et al. (2017), balanced neurons attain
some of the properties of batch norm for free. To see this,
consider a linear neuron
y
i
w
i
x
i
with inputs
x
i
that are
uncorrelated with mean
μ
and variance
1
. Then the output
y
is standardised:
E
r
y
s“
i
w
i
E
r
x
i
s“
μ
i
w
i
0;
Var
r
y
s“
i
w
2
i
Var
r
x
i
s“
i
w
2
i
1
.
While the assumptions on the inputs
x
i
are unlikely to hold
exactly, under more general conditions the constraints may
at least
encourage
the standardisation of activation statistics
through the layers of the network (Brock et al., 2021).
3.2. Stable Descent Steps
Since a network is trained via perturbations to its parameters,
it is important to know what size perturbations are appro-
priate. Consider an
L
-layer network with weight matrices
W
“p
W
1
,W
2
,...,W
L
q
and loss function
L
p
W
q
. For a per-
turbation
W
“p
W
1
,
W
2
,...,
W
L
q
, the following
definition establishes a notion of stable step size:
Definition 3.
Let
l
denote the angle between
W
l
and
́
r
W
l
L
p
W
q
. A descent step is
stable
if for all
l
1
,...,L
:
}
r
W
l
L
p
W
`
W
q ́
r
W
l
L
p
W
q}
F
}
r
W
l
L
p
W
q}
F
cos
l
.
(1)
Or in words: for each layer, the relative change in gradient
induced by the perturbation should not exceed the cosine of
the angle between the perturbation and the negative gradient.
This definition is useful because a stable descent step is
guaranteed to decrease a continuously differentiable loss
function
L
p
W
q
(Bernstein et al., 2020a). Still, extracting
a stable step
W
directly from Inequality 1 would require
first computing extra gradients
r
W
l
L
p
W
`
W
q
. Bernstein
et al. (2020a) proposed the following model to avoid this:
Definition 4.
The loss function obeys
deep relative trust
if
for all perturbations
W
“p
W
1
,
W
2
,...,
W
L
q
:
}
r
W
l
L
p
W
`
W
q ́
r
W
l
L
p
W
q}
F
}
r
W
l
L
p
W
q}
F
§
L
π
k
1
ˆ
1
`
}
W
k
}
F
}
W
k
}
F
̇
́
1
.
While deep relative trust is based on a perturbation analysis
of
L
-layer perceptrons (Bernstein et al., 2020a, Theorem 1),
the key idea is that its product structure explicitly models
the product structure of the network’s backward pass.
The deep relative trust model suggests that a stable descent
step should involve small relative perturbations
per layer
.
This motivates the layer-wise family of descent methods
(You et al., 2017; 2020). Still, it is unclear whether layers
are the right base object to consider. Perhaps a more refined
analysis would replace the layers appearing in Definition 4
with individual
neurons
or even
synapses
.
Small relative perturbations per-synapse were explored by
Bernstein et al. (2020b) and found to slightly degrade train-
ing performance compared to Adam and SGD. But this
paper will explore the per-neuron middle ground:
Definition 5.
A step of size
°
0
is said to be
per-neuron
relative
if for any neuron with weights
w
P
R
d
and bias
b
P
R
, the perturbations
w
P
R
d
and
b
P
R
satisfy:
}
w
}
2
{}
w
}
2
§
and
|
b
|{|
b
|
§
.
A per-neuron relative update is automatically per-layer rel-
ative. To see this, consider a weight matrix
W
whose
N
rows correspond to
N
neurons
w
p
1
q
,...,w
p
N
q
. Then:
}
W
}
F
}
W
}
F
c
N
i
1
}
w
p
i
q
}
2
2
N
i
1
}
w
p
i
q
}
2
2
§
c
N
i
1
2
}
w
p
i
q
}
2
2
N
i
1
}
w
p
i
q
}
2
2
.
(2)
Learning by Turning: Neural Architecture Aware Optimisation
4. Nero: the Ne
uronal Ro
tator
Following the discussion in Section 3, this paper will con-
sider an optimisation algorithm that makes
per-neuron rel-
ative updates
(Definition 5) constrained to the space of
balanced networks
(Definition 2).
Since a balanced neuron is constrained to the unit hyper-
sphere, a per-neuron relative update with step size
corre-
sponds to a pure rotation of the neuron’s weight vector by
angle
«
. To see this, take
small in the following picture:
}
w
}
2
1
}
w
`
w
}
2
1
}
w
}
2
Hence, this paper proposes
Nero
: the
ne
uronal
ro
tator.
Nero’s goal is to reduce the burden of hyperparameter tun-
ing by baking architectural information into the optimiser.
More concretely, the anticipated advantages are as follows:
1.
Since per-neuron relative updates are automatically
per-layer relative by Equation 2, they should inherit the
properties of per-layer updates—in particular, stability
across batch sizes (You et al., 2017) while needing little
to no learning rate tuning (Bernstein et al., 2020a).
2.
Since balanced networks place hard constraints on the
norm of a neuron’s weights, the need for initialisation
tuning and weight decay on these weights is removed.
3.
Gradients are often normalised by running averages,
in order to retain relative scale information between
successive minibatch gradients (Tieleman & Hinton,
2012). Along with momentum, this is the main mem-
ory overhead of Adam and LAMB compared to vanilla
SGD. Per-neuron running averages consume
square
root the memory of per-synapse running averages.
4.
Since normalisation is local to a neuron, no commu-
nication is needed between neurons in a layer (unlike
for per-layer updates). This makes the optimiser more
distributable—for example, a single layer can be split
across multiple compute devices without fuss. For the
same reason, the Nero update seems more biologically
plausible than per-layer optimisers such as LAMB.
There is a significant difference between the implementa-
tion of balanced networks in Nero versus prior work. In
centred weight norm (Huang et al., 2017) and weight stan-
dardisation (Qiao et al., 2019), a neuron’s underlying weight
representation is an
unnormalised
vector
r
w
P
R
d
—which is
normalised by including the following reparameterisation
in the neural architecture:
normalise
p
r
w
q
:
r
w
́
1
T
r
w
̈
1
{
d
}
r
w
́
1
T
r
w
̈
1
{
d
}
2
,
(3)
where
1
denotes the vector of 1s.
Algorithm 1
Nero optimiser. “Out-of-the-box” hyperpa-
rameter defaults are
0
.
01
and
0
.
999
. The constant
b
P
R
`
refers to the initialisation scale of the biases.
Input:
step size
Pp
0
,
1
s
, averaging constant
Pr
0
,
1
q
repeat
for each neuron do
ô
get weight & bias gradients
g
w
P
R
n
&
g
b
P
R
ô
update running averages
̄
g
2
w
̈
̄
g
2
w
`p
1
́
q ̈}
g
w
}
2
2
̄
g
2
b
̈
̄
g
2
b
`p
1
́
q ̈
g
2
b
ô
update weights
w
P
R
n
and bias
b
P
R
w
w
́
̈}
w
}
2
{
̄
g
w
̈
g
w
b
b
́
̈
b
{
̄
g
b
̈
g
b
ô
project weights back to constraint set
w
w
́
1
n
n
i
1
w
i
w
w
{}
w
}
2
end for
until converged
Since the target of automatic differentiation is still the unnor-
malised vector
r
w
, overhead is incurred in both the forward
and backward pass. Moreover, there is a subtle coupling
between the step size in additive optimisers like Adam and
the scale of the unnormalised weights
r
w
—see Section 5.3.
In contrast, Nero opts to implement balanced networks via
projected gradient descent. This is lighter-weight than Equa-
tion 3, since duplicate copies of the weights are not needed
and the network’s backward pass does not involve extra
operations. Furthermore, Nero can be used as a drop-in
replacement for optimisers like Adam, SGD or LAMB,
without the user needing to manually modify the network
architecture via the reparameterisation in Equation 3. Note
that projected gradient descent arises frequently in machine
learning (Chen et al., 2019; Bai et al., 2019).
Pseudocode for Nero is provided in Algorithm 1. Since
Nero normalises gradients via running averages, a Nero up-
date is only approximately per-neuron relative. For brevity,
the Adam-style bias correction of the running averages is
omitted from the pseudocode. But in the Pytorch implemen-
tation used in this paper’s experiments, the running averages
̄
g
w
and
̄
g
b
are divided by a factor of
a
1
́
t
before the
t
th
update. This corrects for the warmup bias stemming from
̄
g
w
and
̄
g
b
being initialised to zero (Kingma & Ba, 2015).
While the pseudocode in Algorithm 1 is presented for
neu-
rons
and
biases
, in the Pytorch implementation the bias
update is applied to any parameters lacking a notion of
fan-in—including batch norm gains and biases. Typical
initialisation scales are
b
1
for gains and
b
0
.
01
for biases. The Pytorch implementation of Nero defaults to
b
0
.
01
for any bias parameter initialised to zero.
Learning by Turning: Neural Architecture Aware Optimisation
5. Experiments
This section presents experiments intended to demonstrate
Nero’s key properties. In all figures, the mean and range are
plotted over three repeats. For Nero,
out-of-the-box
refers
to setting
0
.
01
and
0
.
999
. The code for these
experiments is available at
github.com/jxbz/nero
,
and more experimental details are given in Appendix A.
5.1. Constraints Help Nero
To verify that projecting to the space of balanced networks
improves the performance of Nero, an ablation experiment
was conducted. As can be seen in Figure 1, when training
a VGG-11 image classifier on the CIFAR-10 dataset, Nero
performed best with both constraints switched on.
5.2. Per-Neuron Updates are a Good Middle Ground
Since Bernstein et al. (2020b) found that per-synapse rel-
ative updates led to slightly degraded performance, while
per-layer relative updates typically perform well (You et al.,
2017; 2020; Bernstein et al., 2020a), this section compares
per-synapse, per-neuron and per-layer relative updates. In
particular, Nero is compared to Madam (per-synapse rela-
tive) and LAMB (per-layer relative).
A VGG-11 model was trained on the CIFAR-10 dataset.
Without constraints, the three optimisers performed sim-
ilarly, achieving
12%
top-1 validation error (Figure 2,
top). Constraining to the space of balanced networks (Defi-
nition 2) improved both Nero and LAMB, but did not have a
significant effect on Madam (Figure 2, bottom). In both con-
figurations, Nero outperformed Madam and LAMB, demon-
strating the viability of per-neuron relative updates.
5.3. The Pitfalls of Reparameterisation
Existing implementations of balanced networks (Definition
2) work via the re-parameterisation given in Equation 3
(Huang et al., 2017; Qiao et al., 2019). This leads to an
undesired coupling between the learning rate in optimisers
like Adam and the scale of the unnormalised
r
w
parameters.
To verify this, a network with weights normalised by Equa-
tion 3 was trained to classify the MNIST dataset. The initial
weights
r
w
were drawn from
N
p
0
,
2
q
, and the experiment
was repeated for
1
and
100
. The Adam optimiser
was used for training with a fixed learning rate of
0
.
01
. As
can be seen in Figure 3 (left), the training performance was
sensitive to the weight scale
, despite the fact that a weight
normalisation scheme was being used.
The unnecessary scale freedom of reparameterisation can
lead to other undesired consequences such as numerical
overflow. Nero completely eliminates this issue by imple-
menting balanced networks via projected gradient descent.
0
50
100
150
200
Epoch
10
3
10
2
10
1
10
0
Top-1 error
Training
0
50
100
150
200
Epoch
0
.
1
0
.
2
0
.
4
Validation
Constraints
Both
Mean
Norm
None
Figure 1.
Ablating the
balanced network
constraints. A VGG-11
network was trained on CIFAR-10. The legend denotes which of
Nero’s constraints were active.
Mean
refers to balanced excitation
& inhibition, while
norm
refers to the
`
2
constant sum rule.
0
50
100
150
200
Epoch
10
2
10
1
10
0
Top-1 error
Training
0
50
100
150
200
Epoch
0
.
1
0
.
2
0
.
4
0
.
8
Validation
Nero w/o constraints
Madam
LAMB
0
50
100
150
200
Epoch
10
2
10
1
10
0
Top-1 error
Training
0
50
100
150
200
Epoch
0
.
1
0
.
2
0
.
4
0
.
8
Validation
Nero
Madam+constraints
LAMB+constraints
Figure 2.
Comparing per-synapse (Madam), per-neuron (Nero)
and per-layer (LAMB) relative updates. A VGG-11 network was
trained to classify CIFAR-10. Top: all optimisers
without
balanced
network constraints. Bottom: all optimisers
with
constraints.
0
1
2
3
4
5
Epoch
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Training accuracy
Reparameterisation
Initialisation scale
=1
= 100
0
10
20
30
40
50
Epoch
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
100 layer MLP
Nero
SGD
Adam
LAMB
Figure 3.
Left: Training a 5 layer perceptron normalised via
reparameterisation (Equation 3) on MNIST. For a fixed Adam
learning rate, training is sensitive to the scale
of the raw weights
r
w
. This motivates the different approach taken by Nero. Right:
Using Nero to train a 100 layer perceptron—without batch norm
or skip connections—to classify MNIST.
Learning by Turning: Neural Architecture Aware Optimisation
5.4. Nero Trains Deeper Networks
Very deep networks are typically difficult to train without
architectural modifications such as residual connections (He
et al., 2016) or batch norm (Ioffe & Szegedy, 2015). To test
whether Nero enables training very deep models without
such modifications, Figure 3 (right) shows the results of
training a very deep multilayer perceptron (MLP) on the
MNIST dataset. Unlike SGD, Adam and LAMB, Nero
could reliably train a 100-layer MLP.
5.5. Nero Works Well Out-of-the-Box
This section probes the versatility and robustness of Nero by
comparing its optimisation and generalisation performance
with three popular alternatives—SGD, Adam, and LAMB—
across six learning problems. The tasks span the domains
of computer vision, natural language processing, and rein-
forcement learning. A wide spectrum of neural architectures
were tested—from convolutional networks to transformers.
To make a fair comparison between optimisers, a fair hyper-
parameter tuning strategy is needed. In this section:
1.
Learning rates were tuned over
t
10
́
4
,
10
́
3
,...,
10
0
u
.
2.
For Adam, LAMB and SGD, the momentum hyperpa-
rameter was tuned to achieve good performance on the
most complicated benchmark—cGAN training—and
then fixed across the rest of the benchmarks. In each
case, the best momentum value for cGAN was 0.
3.
in Nero and
2
in Adam and LAMB were fixed
to
0
.
999
across all experiments, as recommended by
Kingma & Ba (2015) and You et al. (2020).
4.
Weight decay was not used in any of the experiments.
The results are collated in Table 1. Nero achieved the
best validation performance in every experiment—while
the runner-up varied across tasks. What’s more, the same
learning rate of
0
.
01
was optimal for Nero in five out of
six experiments. This means that Nero has strong
out-of-the-
box
performance, since Nero’s only other hyperparameter
was fixed to
0
.
999
across all experiments.
The remainder of this section discusses each experiment in
turn. Implementation details are given in Appendix A.
Image synthesis with cGAN
Generative Adversarial Net-
work (Goodfellow et al., 2014, GAN) training is perhaps
the most challenging optimisation problem tackled in this
paper. Good performance has traditionally relied on exten-
sive tuning: different learning rates are often used in the
generator and discriminator (Heusel et al., 2017) and train-
ing is highly sensitive to momentum (Brock et al., 2019,
p. 35). The class-conditional GAN model in this paper is
based on the BigGAN architecture (Brock et al., 2019). This
is a heterogeneous network involving a variety of building
0
30
60
90
120
Epoch
0
20
40
60
80
100
FID
Training
Nero
SGD
Adam
LAMB
0
30
60
90
120
Epoch
0
20
40
60
80
100
Test
Figure 4.
Class-conditional GAN training on CIFAR-10. Equal
learning rates were used in the generator and discriminator. The
Fréchet Inception Distance (Heusel et al., 2017, FID) measures
the distance between the sample statistics of real and fake data as
represented at a deep layer of a pre-trained image classifier.
0
50
100
150
200
Epoch
10
3
10
2
10
1
10
0
Top-1 error
Training
Nero
SGD
Adam
LAMB
0
50
100
150
200
Epoch
0
.
1
0
.
2
0
.
4
Validation
0
50
100
150
200
Epoch
10
4
10
3
10
2
10
1
10
0
Top-1 error
Training
Nero
SGD
Adam
LAMB
0
50
100
150
200
Epoch
0
.
05
0
.
10
0
.
20
0
.
40
Validation
Figure 5.
CIFAR-10 classification. Top: performance of a vanilla,
convolutional VGG-11 network. Bottom: performance of a batch-
normalised, residual ResNet-18 network.
0
5
10
15
Epoch
0
100
200
300
400
500
600
Perplexity
Training
Nero
SGD
Adam
LAMB
0
5
10
15
Epoch
150
200
250
300
350
400
Validation
Figure 6.
Training a language model on the Wikitext-2 dataset.
A small transformer network was used, composed of 19 tensors.
Nero achieved the best anytime performance.
Learning by Turning: Neural Architecture Aware Optimisation
Task
Dataset
Model
Metric
p
Ö
q
Nero
SGD
Adam
LAMB
Nero
SGD
Adam
LAMB
cGAN
CIFAR-10
BigGAN-like
FID (
Ó
)
15
.
43
̆
0
.
37
33
.
06
̆
0
.
42
23
.
42
̆
0
.
85
16
.
32
̆
0
.
23
0.01
0.01
0.0001
0.01
Classification
CIFAR-10
VGG11
Top-1 Error (
Ó
)
11
.
16
%
̆
0
.
17
12
.
61%
̆
0
.
21
12
.
86%
̆
0
.
34 13
.
66%
̆
0
.
05
0.01
0.1
0.001
0.01
Classification
CIFAR-10
ResNet-18 Top-1 Error (
Ó
)
5
.
75
%
̆
0
.
07
7
.
75%
̆
0
.
17 5
.
93%
̆
0
.
19
6
.
46%
̆
0
.
12
0.01
0.1
0.01
0.1
Language Model
Wikitext-2
Transformer Perplexity (
Ó
)
172
.
99
̆
0
.
51
181
.
76
̆
0
.
49 178
.
05
̆
0
.
96
200
.
54
̆
0
.
53
0.01
1.0
0.0001
0.01
Translation
WMT16 En–De Transformer Perplexity (
Ó
)
11
.
35
̆
1
.
20
92
.
40
̆
89
.
48 12
.
63
̆
0
.
34
16
.
36
̆
0
.
29
0.001 0.0001 0.0001
0.01
PPO
Atari Pong
vanilla CNN
Reward (
Ò
)
20
.
62
̆
0
.
05
11
.
99
̆
8
.
65
15
.
92
̆
3
.
40
́
19
.
46
̆
0
.
10
0.01
0.1
0.0001
0.001
Table 1.
Validation results for the best learning rate
. The best result is shown in bold, while the runner-up is underlined.
blocks: convolutions, embeddings, fully connected layers,
attention layers, conditional batch norm and spectral norm
(Miyato et al., 2018). The results are presented in Figure 4.
Image classification
Experiments were run across all
baselines on the CIFAR-10 dataset. The networks used
were the vanilla, convolutional VGG-11 network (Simonyan
& Zisserman, 2015) and the batch-normalised, residual
ResNet-18 network (He et al., 2015). The results are pre-
sented in Figure 5. ImageNet results using ResNet-50 are
presented in Section 5.6. Due to limited computational
resources, the LAMB and Adam baselines were omitted.
Natural language processing
Much recent progress in
natural language processing is based on the transformer
architecture (Vaswani et al., 2017). Transformers process
information via layered, all-to-all comparisons—without
recourse to recurrence or convolution. This paper experi-
mented with a smaller transformer (19 tensors) trained on
the Wikitext-2 dataset, and a larger transformer (121 ten-
sors) trained on WMT2016 English–German translation.
The results are presented in Figures 6 and 7.
Reinforcement learning
Many reinforcement learning al-
gorithms use neural networks to perform function approx-
imation. Proximal Policy Optimization (Schulman et al.,
2017, PPO) is one example, and PPO has gained increasing
popularity for its simplicity, scalability, and robust perfor-
mance. This paper experimented with PPO on the Atari
Pong video game. The results are presented in Figure 8.
While LAMB failed to train on this task, further investiga-
tion revealed that setting LAMB’s momentum hyperparam-
eter to 0.9 enabled LAMB to learn. This demonstrates that
LAMB is sensitive to the momentum hyperparameter.
5.6. Nero Can Be Regularised
This section compares using Nero versus SGD to train a
ResNet-50 classifier on the ImageNet dataset. The results
are shown in Figure 9. While out-of-the-box Nero attained
the best training error and better validation error than SGD,
it performed worse than SGD with tuned weight decay on
the validation set. But after fine-tuning the learning rate
and adding regularisation, Nero roughly matched SGD with
weight decay. In particular, the tuned version of Nero used
a learning rate of 0.02 (tuned), a bias scale parameter
b
1
.
0
(not tuned) and the batch norm gains were regularised
towards one using a quadratic penalty.
0
25
50
75
100
Epoch
10
1
10
2
10
3
Perplexity
Training
Nero
SGD
Adam
LAMB
0
25
50
75
100
Epoch
10
1
10
2
10
3
Validation
Figure 7.
Training an English–German translation model on
WMT16. A larger transformer network was used, composed of 121
tensors. The optimisers with gradient normalisation—Nero, Adam,
and LAMB—performed best in training this model. Training with
SGD was unstable and led to significantly worse perplexity.
0
1
2
3
4
5
Million steps
20
10
0
10
20
Reward
Nero
SGD
Adam
LAMB
Figure 8.
Training a policy network to play Pong. Proximal
Policy Optimisation (PPO) was used. Pong’s reward is bounded
between
̆
21
. While investigating LAMB’s failure to train the
policy network, it was discovered that adjusting the
1
momentum
hyperparameter from 0 to 0.9 improved LAMB’s performance.
0
20
40
60
80
Epoch
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
Top-1 error
Training
Nero OOTB
Nero tuned
SGD
SGD+wd
0
20
40
60
80
Epoch
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
0
.
6
Validation
Figure 9.
Training a ResNet-50 network to classify the ImageNet
dataset.
Nero OOTB
(out-of-the-box) achieved the best training
performance but overfit compared to SGD with weight decay.
Nero tuned
—which most importantly regularised batch norm gains
towards one—recovered most of the lost performance.