Learning by Turning: Neural Architecture Aware Optimisation
Yang Liu
*1
Jeremy Bernstein
*2
Markus Meister
2
Yisong Yue
2
Abstract
Descent methods for deep networks are noto-
riously capricious: they require careful tuning
of step size, momentum and weight decay, and
which method will work best on a new bench-
mark is a priori unclear. To address this problem,
this paper conducts a combined study of neural
architecture and optimisation, leading to a new
optimiser called
Nero
: the
ne
uronal
ro
tator. Nero
trains reliably without momentum or weight de-
cay, works in situations where Adam and SGD
fail, and requires little to no learning rate tuning.
Also, Nero’s memory footprint is
„
square root
that of Adam or LAMB. Nero combines two ideas:
(1) projected gradient descent over the space of
balanced networks
; (2) neuron-specific updates,
where the step size sets the
angle
through which
each neuron’s hyperplane
turns
. The paper con-
cludes by discussing how this geometric connec-
tion between architecture and optimisation may
impact theories of generalisation in deep learning.
1. Introduction
Deep learning has brought on a new paradigm in computer
science, enabling artificial systems to interact with the world
at an unprecedented level of complexity. That said, the core
technology relies on various heuristic numerical techniques
that are sometimes brittle and often require extensive tuning.
A major goal of modern research in machine learning is to
uncover the principles underlying learning in neural systems,
and thus to derive more reliable learning algorithms.
Part of the challenge of this endeavour is that learning in
deep networks is an inherently coupled problem. Suppose
that training performance is sensitive to a particular detail
of the neural architecture—then it is unclear whether that
detail affects the expressivity of the architecture, or just the
ability of the descent method to train the architecture.
*
Equal contribution
1
Abacus.AI
2
Caltech. Correspondence to:
YL <yang@abacus.ai> and JB <bernstein@caltech.edu>. Code
available at
github.com/jxbz/nero
.
Proceedings of the
38
th
International Conference on Machine
Learning
, PMLR 139, 2021. Copyright 2021 by the authors.
This observation motivates the
combined study
of archi-
tecture and optimisation, and this paper explores several
questions at that intersection. First of all:
h
?
i
What is the right domain of optimisation for a neu-
ral network’s weights? Is it
R
d
, or something more
exotic—such as a Cartesian product of hyperspheres?
Typically, optimisation is conducted over
R
d
, while a careful
weight initialisation and a tuned weight decay hyperparam-
eter impose a soft constraint on the optimisation domain.
Since normalisation schemes such as batch norm (Ioffe &
Szegedy, 2015) render the network invariant to the scale
of the weights, weight decay also plays a somewhat subtle
second role in modifying the effective learning rate. Hy-
perparameters with this kind of subtle coupling add to the
compounding cost of hyperparameter search.
Furthermore, descent methods such as Adam (Kingma &
Ba, 2015) and LAMB (You et al., 2020) use either synapse-
specific or layer-specific gradient normalisation. This moti-
vates a second question:
h
?
i
At what level of granularity should an optimiser work?
Should normalisation occur per-synapse or per-layer—
or perhaps, per-neuron?
This paper contends that in deep learning, hyperparameters
proliferate because of hidden couplings between optimiser
and architecture. By studying the above questions, and dis-
tilling simple rules that govern optimisation and architecture,
this paper aims to make deep learning less brittle—and less
sensitive to opaque hyperparameters.
Summary of contributions:
1.
A new optimiser—
Nero
: the
ne
uronal
ro
tator. Nero
performs per-neuron projected gradient descent, and
uses
„
square root the memory of Adam or LAMB.
2.
Experiments across image classification, image gener-
ation, natural language processing and reinforcement
learning, in which Nero’s
out-of-the-box
configuration
tends to outperform
tuned
baseline optimisers.
3.
Discussion of how the connection between optimisa-
tion and architecture relates to generalisation theories,
such as PAC-Bayes and norm-based complexity.
Learning by Turning: Neural Architecture Aware Optimisation
2. Related work
This section reviews relevant work pertaining to both neural
architecture design and optimisation in machine learning,
and concludes with a bridge to the neuroscience literature.
2.1. Neural Architecture Design
The importance of wiring constraints for the stable function
of engineered neural systems is not a new discovery. One
important concept is that of
balanced excitation and inhibi-
tion
. For instance, Rosenblatt (1958) found that balancing
the proportion of excitatory and inhibitory synaptic connec-
tions made his perceptron more robust to varying input sizes.
Another concept relates to the
total magnitude of synapse
strengths
. For example, Rochester et al. (1956) constrained
the sum of magnitudes of synapses impinging on a neuron
so as to stabilise the process of learning. Similar ideas were
explored by von der Malsburg (1973) and Miller & MacKay
(1994). These works are early predecessors to this paper’s
definition of
balanced networks
given in Section 3.1.
Given the resurgence of neural networks over the last decade,
the machine learning community has taken up the mantle
of research on neural architecture design. Special weight
scalings—such as
Xavier init
(Glorot & Bengio, 2010) and
Kaiming init
(He et al., 2015)—have been proposed to sta-
bilise signal transmission through deep networks. These
scalings are only imposed at initialisation and are free to
wander during training—an issue which may be addressed
by tuning a weight decay hyperparameter. More recent
approaches—such as batch norm (Ioffe & Szegedy, 2015)—
explicitly control activition statistics throughout training by
adding extra normalisation layers to the network.
Other recent normalisation techniques lie closer to the
work of Rosenblatt (1958) and Rochester et al. (1956).
Techniques that involve constraining a neuron’s weights
to the unit hypersphere include: weight norm (Salimans &
Kingma, 2016), decoupled networks (Liu et al., 2017; 2018)
and orthogonal parameterised training (Liu et al., 2021).
Techniques that also balance excitation and inhibition in-
clude centred weight norm (Huang et al., 2017) and weight
standardisation (Qiao et al., 2019).
2.2. Descent Methods in Deep Learning
Much classic work in optimisation theory focuses on de-
riving convergence results for descent methods under as-
sumptions such as
convexity
(Boyd & Vandenberghe, 2004)
and
Lipschitz continuity of the gradient
(Nesterov, 2004).
These simplifying assumptions are often used in the ma-
chine learning literature. For instance, Bottou et al. (2018)
provide convergence guarantees for stochastic gradient de-
scent (SGD) under each of these assumptions. However,
these assumptions do not hold in deep learning (Sun, 2019).
On a related note, SGD is not the algorithm of choice in
many deep learning applications, and heuristic methods
such as RMSprop (Tieleman & Hinton, 2012) and Adam
(Kingma & Ba, 2015) often work better. For instance, Adam
often works much better than SGD for training generative
adversarial networks (Bernstein et al., 2020a). Yet the theory
behind Adam is poorly understood (Reddi et al., 2018).
A more recent line of work has explored optimisation meth-
ods that make
relative updates
to neural network param-
eters. Optimisers like LARS (You et al., 2017), LAMB
(You et al., 2020) and Fromage (Bernstein et al., 2020a)
make per-layer relative updates, while Madam (Bernstein
et al., 2020b) makes per-synapse relative updates. You et al.
(2017) found that these methods stabilise large batch train-
ing, while Bernstein et al. (2020a) found that they require
little to no learning rate tuning across tasks.
Though these recent methods partially account for the neural
architecture—by paying attention to its layered operator
structure—they do not rigorously address the optimisation
domain. As such, LARS and LAMB require a tunable
weight decay hyperparameter, while Fromage and Madam
restrict the optimisation to a bounded set of tunable size
(i.e. weight clipping). Without this additional tuning, these
methods can be unstable—see for instance (Bernstein et al.,
2020a, Figure 2) and (Bernstein et al., 2020b, Figure 3).
The discussion in the previous paragraph typifies the ma-
chine learning state of the art: optimisation techniques that
work well, albeit only after hyperparameter tuning. For
instance, LAMB is arguably the state-of-the-art relative
optimiser, but it contains in total
five
tunable hyperparame-
ters. Since—at least naïvely—the cost of hyperparameter
search is exponential in the number of hyperparameters, the
prospect of fully tuning LAMB is computationally daunting.
2.3. Homeostatic Control in Neuroscience
Since the brain is a system that must learn stably without
hyperparameter do-overs, it is worth looking to neuroscience
for inspiration on designing better learning algorithms.
A major swathe of neuroscience research studies mecha-
nisms by which the brain performs homeostatic control.
For instance, neuroscientists report a form of homeosta-
sis termed
synaptic scaling
, where a neuron modulates the
strengths of all its synapses to stabilise its firing rate (Tur-
rigiano, 2008). More generally,
heterosynaptic plasticity
refers to homeostatic mechanisms that modulate the strength
of unstimulated synapses (Chistiakova et al., 2015). Shen
et al. (2020) review connections to normalisation methods
used in machine learning.
These observations inspired this paper to consider imple-
menting homeostatic control via projected gradient descent—
leading to the Nero optimiser.
Learning by Turning: Neural Architecture Aware Optimisation
3. Background Theory
In general, an
L
-layer neural network
f
p ̈q
is a composition
of
L
simpler functions
f
1
p ̈q
,...,f
L
p ̈q
:
f
p
x
q“
f
L
̋
f
L
́
1
̋
...
̋
f
1
p
x
q
.
(forward pass)
Due to this compositionality, any slight ill-conditioning
in the simple functions
f
i
p ̈q
has the potential to
com-
pound
over layers, making the overall network
f
p ̈q
very
ill-conditioned. Architecture design should aim to prevent
this from happening, as will be covered in Section 3.1
The Jacobian
B
f
{B
f
l
, which plays a key role in evaluating
gradients, also takes the form of a deep product:
B
f
B
f
l
“
B
f
L
B
f
L
́
1
̈
B
f
L
́
1
B
f
L
́
2
̈
...
̈
B
f
l
`
1
B
f
l
.
(backward pass)
Therefore, it is also important from the perspective of
gradient-based optimisation that compositionality is ade-
quately addressed, as will be covered in Section 3.2.
3.1. Balanced Network Architectures
A common strategy to mitigate the issue of compounding
ill-conditioning is to explicitly re-normalise the activations
at every network layer. Batch norm (Ioffe & Szegedy, 2015)
exemplifies this strategy, and was found to improve the
trainability of deep residual networks. Batch norm works by
standardising the activations across a batch of inputs at each
network layer—that is, it shifts and scales the activations to
have mean zero and variance one across a batch.
Although batch norm works well, it adds computational
overhead to both the forward and backward pass. To explore
how far one can get without explicit re-normalisation, the
following definitions are useful:
Definition 1.
A neuron is
balanced
if its weight vector
w
P
R
d
satisfies the following constraints:
∞
d
i
“
1
w
i
“
0;
(balanced excitation & inhibition)
∞
d
i
“
1
w
2
i
“
1
.
(
`
2
constant sum rule)
Definition 2.
A network is
balanced
if all its constituent
neurons are balanced.
As noted by Huang et al. (2017), balanced neurons attain
some of the properties of batch norm for free. To see this,
consider a linear neuron
y
“
∞
i
w
i
x
i
with inputs
x
i
that are
uncorrelated with mean
μ
and variance
1
. Then the output
y
is standardised:
E
r
y
s“
∞
i
w
i
E
r
x
i
s“
μ
∞
i
w
i
“
0;
Var
r
y
s“
∞
i
w
2
i
Var
r
x
i
s“
∞
i
w
2
i
“
1
.
While the assumptions on the inputs
x
i
are unlikely to hold
exactly, under more general conditions the constraints may
at least
encourage
the standardisation of activation statistics
through the layers of the network (Brock et al., 2021).
3.2. Stable Descent Steps
Since a network is trained via perturbations to its parameters,
it is important to know what size perturbations are appro-
priate. Consider an
L
-layer network with weight matrices
W
“p
W
1
,W
2
,...,W
L
q
and loss function
L
p
W
q
. For a per-
turbation