Learning by Turning: Neural Architecture Aware Optimisation

Yang Liu

Jeremy Bernstein

Markus Meister

Yisong Yue

Abstract

Descent methods for deep networks are noto-

riously capricious: they require careful tuning

of step size, momentum and weight decay, and

which method will work best on a new bench-

mark is a priori unclear. To address this problem,

this paper conducts a combined study of neural

architecture and optimisation, leading to a new

optimiser called

Nero

: the

uronal

tator. Nero

trains reliably without momentum or weight de-

cay, works in situations where Adam and SGD

fail, and requires little to no learning rate tuning.

Also, Nero’s memory footprint is

„

square root

that of Adam or LAMB. Nero combines two ideas:

(1) projected gradient descent over the space of

balanced networks

; (2) neuron-specific updates,

where the step size sets the

angle

through which

each neuron’s hyperplane

turns

. The paper con-

cludes by discussing how this geometric connec-

tion between architecture and optimisation may

impact theories of generalisation in deep learning.

1. Introduction

Deep learning has brought on a new paradigm in computer

science, enabling artificial systems to interact with the world

at an unprecedented level of complexity. That said, the core

technology relies on various heuristic numerical techniques

that are sometimes brittle and often require extensive tuning.

A major goal of modern research in machine learning is to

uncover the principles underlying learning in neural systems,

and thus to derive more reliable learning algorithms.

Part of the challenge of this endeavour is that learning in

deep networks is an inherently coupled problem. Suppose

that training performance is sensitive to a particular detail

of the neural architecture—then it is unclear whether that

detail affects the expressivity of the architecture, or just the

ability of the descent method to train the architecture.

Equal contribution

Abacus.AI

Caltech. Correspondence to:

YL <yang@abacus.ai> and JB <bernstein@caltech.edu>. Code

available at

github.com/jxbz/nero

Proceedings of the

38

International Conference on Machine

Learning

This observation motivates the

combined study

of archi-

tecture and optimisation, and this paper explores several

questions at that intersection. First of all:

What is the right domain of optimisation for a neu-

ral network’s weights? Is it

, or something more

exotic—such as a Cartesian product of hyperspheres?

Typically, optimisation is conducted over

, while a careful

weight initialisation and a tuned weight decay hyperparam-

eter impose a soft constraint on the optimisation domain.

Since normalisation schemes such as batch norm (Ioffe &

Szegedy, 2015) render the network invariant to the scale

of the weights, weight decay also plays a somewhat subtle

second role in modifying the effective learning rate. Hy-

perparameters with this kind of subtle coupling add to the

compounding cost of hyperparameter search.

Furthermore, descent methods such as Adam (Kingma &

Ba, 2015) and LAMB (You et al., 2020) use either synapse-

specific or layer-specific gradient normalisation. This moti-

vates a second question:

At what level of granularity should an optimiser work?

Should normalisation occur per-synapse or per-layer—

or perhaps, per-neuron?

This paper contends that in deep learning, hyperparameters

proliferate because of hidden couplings between optimiser

and architecture. By studying the above questions, and dis-

tilling simple rules that govern optimisation and architecture,

this paper aims to make deep learning less brittle—and less

sensitive to opaque hyperparameters.

Summary of contributions:

A new optimiser—

Nero

: the

uronal

tator. Nero

performs per-neuron projected gradient descent, and

uses

„

square root the memory of Adam or LAMB.

Experiments across image classification, image gener-

ation, natural language processing and reinforcement

learning, in which Nero’s

out-of-the-box

configuration

tends to outperform

tuned

baseline optimisers.

Discussion of how the connection between optimisa-

tion and architecture relates to generalisation theories,

such as PAC-Bayes and norm-based complexity.

Learning by Turning: Neural Architecture Aware Optimisation

2. Related work

This section reviews relevant work pertaining to both neural

architecture design and optimisation in machine learning,

and concludes with a bridge to the neuroscience literature.

2.1. Neural Architecture Design

The importance of wiring constraints for the stable function

of engineered neural systems is not a new discovery. One

important concept is that of

balanced excitation and inhibi-

tion

. For instance, Rosenblatt (1958) found that balancing

the proportion of excitatory and inhibitory synaptic connec-

tions made his perceptron more robust to varying input sizes.

Another concept relates to the

total magnitude of synapse

strengths

. For example, Rochester et al. (1956) constrained

the sum of magnitudes of synapses impinging on a neuron

so as to stabilise the process of learning. Similar ideas were

explored by von der Malsburg (1973) and Miller & MacKay

(1994). These works are early predecessors to this paper’s

definition of

balanced networks

given in Section 3.1.

Given the resurgence of neural networks over the last decade,

the machine learning community has taken up the mantle

of research on neural architecture design. Special weight

scalings—such as

Xavier init

(Glorot & Bengio, 2010) and

Kaiming init

(He et al., 2015)—have been proposed to sta-

bilise signal transmission through deep networks. These

scalings are only imposed at initialisation and are free to

wander during training—an issue which may be addressed

by tuning a weight decay hyperparameter. More recent

approaches—such as batch norm (Ioffe & Szegedy, 2015)—

explicitly control activition statistics throughout training by

adding extra normalisation layers to the network.

Other recent normalisation techniques lie closer to the

work of Rosenblatt (1958) and Rochester et al. (1956).

Techniques that involve constraining a neuron’s weights

to the unit hypersphere include: weight norm (Salimans &

Kingma, 2016), decoupled networks (Liu et al., 2017; 2018)

and orthogonal parameterised training (Liu et al., 2021).

Techniques that also balance excitation and inhibition in-

clude centred weight norm (Huang et al., 2017) and weight

standardisation (Qiao et al., 2019).

2.2. Descent Methods in Deep Learning

Much classic work in optimisation theory focuses on de-

riving convergence results for descent methods under as-

sumptions such as

convexity

(Boyd & Vandenberghe, 2004)

and

Lipschitz continuity of the gradient

(Nesterov, 2004).

These simplifying assumptions are often used in the ma-

chine learning literature. For instance, Bottou et al. (2018)

provide convergence guarantees for stochastic gradient de-

scent (SGD) under each of these assumptions. However,

these assumptions do not hold in deep learning (Sun, 2019).

On a related note, SGD is not the algorithm of choice in

many deep learning applications, and heuristic methods

such as RMSprop (Tieleman & Hinton, 2012) and Adam

(Kingma & Ba, 2015) often work better. For instance, Adam

often works much better than SGD for training generative

adversarial networks (Bernstein et al., 2020a). Yet the theory

behind Adam is poorly understood (Reddi et al., 2018).

A more recent line of work has explored optimisation meth-

ods that make

relative updates

to neural network param-

eters. Optimisers like LARS (You et al., 2017), LAMB

(You et al., 2020) and Fromage (Bernstein et al., 2020a)

make per-layer relative updates, while Madam (Bernstein

et al., 2020b) makes per-synapse relative updates. You et al.

(2017) found that these methods stabilise large batch train-

ing, while Bernstein et al. (2020a) found that they require

little to no learning rate tuning across tasks.

Though these recent methods partially account for the neural

architecture—by paying attention to its layered operator

structure—they do not rigorously address the optimisation

domain. As such, LARS and LAMB require a tunable

weight decay hyperparameter, while Fromage and Madam

restrict the optimisation to a bounded set of tunable size

(i.e. weight clipping). Without this additional tuning, these

methods can be unstable—see for instance (Bernstein et al.,

2020a, Figure 2) and (Bernstein et al., 2020b, Figure 3).

The discussion in the previous paragraph typifies the ma-

chine learning state of the art: optimisation techniques that

work well, albeit only after hyperparameter tuning. For

instance, LAMB is arguably the state-of-the-art relative

optimiser, but it contains in total

five

tunable hyperparame-

ters. Since—at least naïvely—the cost of hyperparameter

search is exponential in the number of hyperparameters, the

prospect of fully tuning LAMB is computationally daunting.

2.3. Homeostatic Control in Neuroscience

Since the brain is a system that must learn stably without

hyperparameter do-overs, it is worth looking to neuroscience

for inspiration on designing better learning algorithms.

A major swathe of neuroscience research studies mecha-

nisms by which the brain performs homeostatic control.

For instance, neuroscientists report a form of homeosta-

sis termed

synaptic scaling

, where a neuron modulates the

strengths of all its synapses to stabilise its firing rate (Tur-

rigiano, 2008). More generally,

heterosynaptic plasticity

refers to homeostatic mechanisms that modulate the strength

of unstimulated synapses (Chistiakova et al., 2015). Shen

et al. (2020) review connections to normalisation methods

used in machine learning.

These observations inspired this paper to consider imple-

menting homeostatic control via projected gradient descent—

leading to the Nero optimiser.

Learning by Turning: Neural Architecture Aware Optimisation

3. Background Theory

In general, an

-layer neural network

p ̈q

is a composition

simpler functions

p ̈q

,...,f

p ̈q

q“

...

(forward pass)

Due to this compositionality, any slight ill-conditioning

in the simple functions

p ̈q

has the potential to

com-

pound

over layers, making the overall network

p ̈q

very

ill-conditioned. Architecture design should aim to prevent

this from happening, as will be covered in Section 3.1

The Jacobian

, which plays a key role in evaluating

gradients, also takes the form of a deep product:

“

...

(backward pass)

Therefore, it is also important from the perspective of

gradient-based optimisation that compositionality is ade-

quately addressed, as will be covered in Section 3.2.

3.1. Balanced Network Architectures

A common strategy to mitigate the issue of compounding

ill-conditioning is to explicitly re-normalise the activations

at every network layer. Batch norm (Ioffe & Szegedy, 2015)

exemplifies this strategy, and was found to improve the

trainability of deep residual networks. Batch norm works by

standardising the activations across a batch of inputs at each

network layer—that is, it shifts and scales the activations to

have mean zero and variance one across a batch.

Although batch norm works well, it adds computational

overhead to both the forward and backward pass. To explore

how far one can get without explicit re-normalisation, the

following definitions are useful:

Definition 1.

A neuron is

balanced

if its weight vector

satisfies the following constraints:

∞

“

(balanced excitation & inhibition)

∞

“

(

constant sum rule)

Definition 2.

A network is

balanced

if all its constituent

neurons are balanced.

As noted by Huang et al. (2017), balanced neurons attain

some of the properties of batch norm for free. To see this,

consider a linear neuron

“

∞

with inputs

that are

uncorrelated with mean

and variance

. Then the output

is standardised:

s“

∞

s“

∞

“

Var

s“

∞

Var

s“

∞

“

While the assumptions on the inputs

are unlikely to hold

exactly, under more general conditions the constraints may

at least

encourage

the standardisation of activation statistics

through the layers of the network (Brock et al., 2021).

3.2. Stable Descent Steps

Since a network is trained via perturbations to its parameters,

it is important to know what size perturbations are appro-

priate. Consider an

-layer network with weight matrices

“p

,...,W

and loss function

. For a per-

turbation

“p

,...,

, the following

definition establishes a notion of stable step size:

Definition 3.

Let

✓

denote the angle between

and

. A descent step is

stable

if for all

“

,...,L

}

q ́

}

†

cos

✓

(1)

Or in words: for each layer, the relative change in gradient

induced by the perturbation should not exceed the cosine of

the angle between the perturbation and the negative gradient.

This definition is useful because a stable descent step is

guaranteed to decrease a continuously differentiable loss

function

(Bernstein et al., 2020a). Still, extracting

a stable step

directly from Inequality 1 would require

first computing extra gradients

. Bernstein

et al. (2020a) proposed the following model to avoid this:

Definition 4.

The loss function obeys

deep relative trust

for all perturbations

“p

,...,

}

q ́

}

“

}

While deep relative trust is based on a perturbation analysis

-layer perceptrons (Bernstein et al., 2020a, Theorem 1),

the key idea is that its product structure explicitly models

the product structure of the network’s backward pass.

The deep relative trust model suggests that a stable descent

step should involve small relative perturbations

per layer

This motivates the layer-wise family of descent methods

(You et al., 2017; 2020). Still, it is unclear whether layers

are the right base object to consider. Perhaps a more refined

analysis would replace the layers appearing in Definition 4

with individual

neurons

or even

synapses

Small relative perturbations per-synapse were explored by

Bernstein et al. (2020b) and found to slightly degrade train-

ing performance compared to Adam and SGD. But this

paper will explore the per-neuron middle ground:

Definition 5.

A step of size

⌘

is said to be

per-neuron

relative

if for any neuron with weights

and bias

, the perturbations

and

satisfy:

}

{}

}

⌘

and

|{|

⌘

A per-neuron relative update is automatically per-layer rel-

ative. To see this, consider a weight matrix

whose

rows correspond to

neurons

,...,w

. Then:

}

“

∞

“

1

}

2

2

∞

“

1

}

2

2

∞

“

1

⌘

2

}

2

2

∞

“

1

}

2

2

“

⌘

(2)

Learning by Turning: Neural Architecture Aware Optimisation

4. Nero: the Ne

uronal Ro

tator

Following the discussion in Section 3, this paper will con-

sider an optimisation algorithm that makes

per-neuron rel-

ative updates

(Definition 5) constrained to the space of

balanced networks

(Definition 2).

Since a balanced neuron is constrained to the unit hyper-

sphere, a per-neuron relative update with step size

⌘

corre-

sponds to a pure rotation of the neuron’s weight vector by

angle

⌘

. To see this, take

⌘

small in the following picture:

}

“

}

“

}

“

⌘

✓

Hence, this paper proposes

Nero

: the

uronal

tator.

Nero’s goal is to reduce the burden of hyperparameter tun-

ing by baking architectural information into the optimiser.

More concretely, the anticipated advantages are as follows:

Since per-neuron relative updates are automatically

per-layer relative by Equation 2, they should inherit the

properties of per-layer updates—in particular, stability

across batch sizes (You et al., 2017) while needing little

to no learning rate tuning (Bernstein et al., 2020a).

Since balanced networks place hard constraints on the

norm of a neuron’s weights, the need for initialisation

tuning and weight decay on these weights is removed.

Gradients are often normalised by running averages,

in order to retain relative scale information between

successive minibatch gradients (Tieleman & Hinton,

2012). Along with momentum, this is the main mem-

ory overhead of Adam and LAMB compared to vanilla

SGD. Per-neuron running averages consume

„

square

root the memory of per-synapse running averages.

Since normalisation is local to a neuron, no commu-

nication is needed between neurons in a layer (unlike

for per-layer updates). This makes the optimiser more

distributable—for example, a single layer can be split

across multiple compute devices without fuss. For the

same reason, the Nero update seems more biologically

plausible than per-layer optimisers such as LAMB.

There is a significant difference between the implementa-

tion of balanced networks in Nero versus prior work. In

centred weight norm (Huang et al., 2017) and weight stan-

dardisation (Qiao et al., 2019), a neuron’s underlying weight

representation is an

unnormalised

vector

—which is

normalised by including the following reparameterisation

in the neural architecture:

normalise

“

1

1

{

}

1

1

{

}

(3)

where

1

denotes the vector of 1s.

Algorithm 1

Nero optimiser. “Out-of-the-box” hyperpa-

rameter defaults are

⌘

“

and

“

999

. The constant

refers to the initialisation scale of the biases.

Input:

step size

⌘

, averaging constant

repeat

for each neuron do

get weight & bias gradients

update running averages

–

q ̈}

}

–

q ̈

update weights

and bias

–

⌘

̈}

}

{

–

⌘

{

project weights back to constraint set

–

∞

“

–

{}

}

end for

until converged

Since the target of automatic differentiation is still the unnor-

malised vector

, overhead is incurred in both the forward

and backward pass. Moreover, there is a subtle coupling

between the step size in additive optimisers like Adam and

the scale of the unnormalised weights

—see Section 5.3.

In contrast, Nero opts to implement balanced networks via

projected gradient descent. This is lighter-weight than Equa-

tion 3, since duplicate copies of the weights are not needed

and the network’s backward pass does not involve extra

operations. Furthermore, Nero can be used as a drop-in

replacement for optimisers like Adam, SGD or LAMB,

without the user needing to manually modify the network

architecture via the reparameterisation in Equation 3. Note

that projected gradient descent arises frequently in machine

learning (Chen et al., 2019; Bai et al., 2019).

Pseudocode for Nero is provided in Algorithm 1. Since

Nero normalises gradients via running averages, a Nero up-

date is only approximately per-neuron relative. For brevity,

the Adam-style bias correction of the running averages is

omitted from the pseudocode. But in the Pytorch implemen-

tation used in this paper’s experiments, the running averages

and

are divided by a factor of

before the

update. This corrects for the warmup bias stemming from

and

being initialised to zero (Kingma & Ba, 2015).

While the pseudocode in Algorithm 1 is presented for

neu-

rons

and

biases

, in the Pytorch implementation the bias

update is applied to any parameters lacking a notion of

fan-in—including batch norm gains and biases. Typical

initialisation scales are

“

for gains and

“

for biases. The Pytorch implementation of Nero defaults to

“

for any bias parameter initialised to zero.

Learning by Turning: Neural Architecture Aware Optimisation

5. Experiments

This section presents experiments intended to demonstrate

Nero’s key properties. In all figures, the mean and range are

plotted over three repeats. For Nero,

out-of-the-box

refers

to setting

⌘

“

and

“

999

. The code for these

experiments is available at

github.com/jxbz/nero

and more experimental details are given in Appendix A.

5.1. Constraints Help Nero

To verify that projecting to the space of balanced networks

improves the performance of Nero, an ablation experiment

was conducted. As can be seen in Figure 1, when training

a VGG-11 image classifier on the CIFAR-10 dataset, Nero

performed best with both constraints switched on.

5.2. Per-Neuron Updates are a Good Middle Ground

Since Bernstein et al. (2020b) found that per-synapse rel-

ative updates led to slightly degraded performance, while

per-layer relative updates typically perform well (You et al.,

2017; 2020; Bernstein et al., 2020a), this section compares

per-synapse, per-neuron and per-layer relative updates. In

particular, Nero is compared to Madam (per-synapse rela-

tive) and LAMB (per-layer relative).

A VGG-11 model was trained on the CIFAR-10 dataset.

Without constraints, the three optimisers performed sim-

ilarly, achieving

„

12%

top-1 validation error (Figure 2,

top). Constraining to the space of balanced networks (Defi-

nition 2) improved both Nero and LAMB, but did not have a

significant effect on Madam (Figure 2, bottom). In both con-

figurations, Nero outperformed Madam and LAMB, demon-

strating the viability of per-neuron relative updates.

5.3. The Pitfalls of Reparameterisation

Existing implementations of balanced networks (Definition

2) work via the re-parameterisation given in Equation 3

(Huang et al., 2017; Qiao et al., 2019). This leads to an

undesired coupling between the learning rate in optimisers

like Adam and the scale of the unnormalised

parameters.

To verify this, a network with weights normalised by Equa-

tion 3 was trained to classify the MNIST dataset. The initial

weights

were drawn from

, and the experiment

was repeated for

“

and

“

100

. The Adam optimiser

was used for training with a fixed learning rate of

. As

can be seen in Figure 3 (left), the training performance was

sensitive to the weight scale

, despite the fact that a weight

normalisation scheme was being used.

The unnecessary scale freedom of reparameterisation can

lead to other undesired consequences such as numerical

overflow. Nero completely eliminates this issue by imple-

menting balanced networks via projected gradient descent.

0

50

100

150

200

Epoch

Top-1 error

Training

0

50

100

150

200

Epoch

0

.

1

0

.

2

0

.

4

Validation

Constraints

Both

Mean

Norm

None

Figure 1.

Ablating the

balanced network

constraints. A VGG-11

network was trained on CIFAR-10. The legend denotes which of

Nero’s constraints were active.

Mean

refers to balanced excitation

& inhibition, while

norm

refers to the

2

constant sum rule.

0

50

100

150

200

Epoch

10

2

10

1

10

0

Top-1 error

Training

0

50

100

150

200

Epoch

0

.

1

0

.

2

0

.

4

0

.

8

Validation

Nero w/o constraints

Madam

LAMB

0

50

100

150

200

Epoch

10

2

10

1

10

0

Top-1 error

Training

0

50

100

150

200

Epoch

0

.

1

0

.

2

0

.

4

0

.

8

Validation

Nero

Madam+constraints

LAMB+constraints

Figure 2.

Comparing per-synapse (Madam), per-neuron (Nero)

and per-layer (LAMB) relative updates. A VGG-11 network was

trained to classify CIFAR-10. Top: all optimisers

without

balanced

network constraints. Bottom: all optimisers

with

constraints.

Epoch

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

Training accuracy

Reparameterisation

Initialisation scale

= 100

0

10

20

30

40

50

Epoch

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

100 layer MLP

Nero

SGD

Adam

LAMB

Figure 3.

Left: Training a 5 layer perceptron normalised via

reparameterisation (Equation 3) on MNIST. For a fixed Adam

learning rate, training is sensitive to the scale

of the raw weights

w

. This motivates the different approach taken by Nero. Right:

Using Nero to train a 100 layer perceptron—without batch norm

or skip connections—to classify MNIST.

Learning by Turning: Neural Architecture Aware Optimisation

5.4. Nero Trains Deeper Networks

Very deep networks are typically difficult to train without

architectural modifications such as residual connections (He

et al., 2016) or batch norm (Ioffe & Szegedy, 2015). To test

whether Nero enables training very deep models without

such modifications, Figure 3 (right) shows the results of

training a very deep multilayer perceptron (MLP) on the

MNIST dataset. Unlike SGD, Adam and LAMB, Nero

could reliably train a 100-layer MLP.

5.5. Nero Works Well Out-of-the-Box

This section probes the versatility and robustness of Nero by

comparing its optimisation and generalisation performance

with three popular alternatives—SGD, Adam, and LAMB—

across six learning problems. The tasks span the domains

of computer vision, natural language processing, and rein-

forcement learning. A wide spectrum of neural architectures

were tested—from convolutional networks to transformers.

To make a fair comparison between optimisers, a fair hyper-

parameter tuning strategy is needed. In this section:

Learning rates were tuned over

,...,

For Adam, LAMB and SGD, the momentum hyperpa-

rameter was tuned to achieve good performance on the

most complicated benchmark—cGAN training—and

then fixed across the rest of the benchmarks. In each

case, the best momentum value for cGAN was 0.

in Nero and

in Adam and LAMB were fixed

999

across all experiments, as recommended by

Kingma & Ba (2015) and You et al. (2020).

Weight decay was not used in any of the experiments.

The results are collated in Table 1. Nero achieved the

best validation performance in every experiment—while

the runner-up varied across tasks. What’s more, the same

learning rate of

⌘

“

was optimal for Nero in five out of

six experiments. This means that Nero has strong

out-of-the-

box

performance, since Nero’s only other hyperparameter

was fixed to

“

999

across all experiments.

The remainder of this section discusses each experiment in

turn. Implementation details are given in Appendix A.

Image synthesis with cGAN

Generative Adversarial Net-

work (Goodfellow et al., 2014, GAN) training is perhaps

the most challenging optimisation problem tackled in this

paper. Good performance has traditionally relied on exten-

sive tuning: different learning rates are often used in the

generator and discriminator (Heusel et al., 2017) and train-

ing is highly sensitive to momentum (Brock et al., 2019,

p. 35). The class-conditional GAN model in this paper is

based on the BigGAN architecture (Brock et al., 2019). This

is a heterogeneous network involving a variety of building

0

30

60

90

120

Epoch

0

20

40

60

80

100

FID

Training

Nero

SGD

Adam

LAMB

0

30

60

90

120

Epoch

0

20

40

60

80

100

Test

Figure 4.

Class-conditional GAN training on CIFAR-10. Equal

learning rates were used in the generator and discriminator. The

Fréchet Inception Distance (Heusel et al., 2017, FID) measures

the distance between the sample statistics of real and fake data as

represented at a deep layer of a pre-trained image classifier.

0

50

100

150

200

Epoch

Top-1 error

Training

Nero

SGD

Adam

LAMB

0

50

100

150

200

Epoch

0

.

1

0

.

2

0

.

4

Validation

0

50

100

150

200

Epoch

Top-1 error

Training

Nero

SGD

Adam

LAMB

0

50

100

150

200

Epoch

0

.

05

0

.

10

0

.

20

0

.

40

Validation

Figure 5.

CIFAR-10 classification. Top: performance of a vanilla,

convolutional VGG-11 network. Bottom: performance of a batch-

normalised, residual ResNet-18 network.

0

5

10

15

Epoch

0

100

200

300

400

500

600

Perplexity

Training

Nero

SGD

Adam

LAMB

0

5

10

15

Epoch

150

200

250

300

350

400

Validation

Figure 6.

Training a language model on the Wikitext-2 dataset.

A small transformer network was used, composed of 19 tensors.

Nero achieved the best anytime performance.

Learning by Turning: Neural Architecture Aware Optimisation

Task

Dataset

Model

Metric

Ö

Nero

SGD

Adam

LAMB

Nero

⌘

SGD

⌘

Adam

⌘

LAMB

⌘

cGAN

CIFAR-10

BigGAN-like

FID (

)

15

.

43

0

.

37

0.01

0.0001

0.01

Classification

CIFAR-10

VGG11

Top-1 Error (

)

11

.

16

0

.

17

61%

86%

34 13

66%

0.01

0.1

0.001

0.01

Classification

CIFAR-10

ResNet-18 Top-1 Error (

)

5

.

75

0

.

07

75%

17 5

93%

46%

0.01

0.1

0.01

0.1

Language Model

Wikitext-2

Transformer Perplexity (

)

172

.

99

0

.

51

181

49 178

200

0.01

1.0

0.0001

0.01

Translation

WMT16 En–De Transformer Perplexity (

)

11

.

35

1

.

20

48 12

0.001 0.0001 0.0001

0.01

PPO

Atari Pong

vanilla CNN

Reward (

)

20

.

62

0

.

05

0.01

0.1

0.0001

0.001

Table 1.

Validation results for the best learning rate

⌘

. The best result is shown in bold, while the runner-up is underlined.

blocks: convolutions, embeddings, fully connected layers,

attention layers, conditional batch norm and spectral norm

(Miyato et al., 2018). The results are presented in Figure 4.

Image classification

Experiments were run across all

baselines on the CIFAR-10 dataset. The networks used

were the vanilla, convolutional VGG-11 network (Simonyan

& Zisserman, 2015) and the batch-normalised, residual

ResNet-18 network (He et al., 2015). The results are pre-

sented in Figure 5. ImageNet results using ResNet-50 are

presented in Section 5.6. Due to limited computational

resources, the LAMB and Adam baselines were omitted.

Natural language processing

Much recent progress in

natural language processing is based on the transformer

architecture (Vaswani et al., 2017). Transformers process

information via layered, all-to-all comparisons—without

recourse to recurrence or convolution. This paper experi-

mented with a smaller transformer (19 tensors) trained on

the Wikitext-2 dataset, and a larger transformer (121 ten-

sors) trained on WMT2016 English–German translation.

The results are presented in Figures 6 and 7.

Reinforcement learning

Many reinforcement learning al-

gorithms use neural networks to perform function approx-

imation. Proximal Policy Optimization (Schulman et al.,

2017, PPO) is one example, and PPO has gained increasing

popularity for its simplicity, scalability, and robust perfor-

mance. This paper experimented with PPO on the Atari

Pong video game. The results are presented in Figure 8.

While LAMB failed to train on this task, further investiga-

tion revealed that setting LAMB’s momentum hyperparam-

eter to 0.9 enabled LAMB to learn. This demonstrates that

LAMB is sensitive to the momentum hyperparameter.

5.6. Nero Can Be Regularised

This section compares using Nero versus SGD to train a

ResNet-50 classifier on the ImageNet dataset. The results

are shown in Figure 9. While out-of-the-box Nero attained

the best training error and better validation error than SGD,

it performed worse than SGD with tuned weight decay on

the validation set. But after fine-tuning the learning rate

and adding regularisation, Nero roughly matched SGD with

weight decay. In particular, the tuned version of Nero used

a learning rate of 0.02 (tuned), a bias scale parameter

“

(not tuned) and the batch norm gains were regularised

towards one using a quadratic penalty.

0

25

50

75

100

Epoch

10

1

10

2

10

3

Perplexity

Training

Nero

SGD

Adam

LAMB

0

25

50

75

100

Epoch

10

1

10

2

10

3

Validation

Figure 7.

Training an English–German translation model on

WMT16. A larger transformer network was used, composed of 121

tensors. The optimisers with gradient normalisation—Nero, Adam,

and LAMB—performed best in training this model. Training with

SGD was unstable and led to significantly worse perplexity.

0

1

2

3

4

5

Million steps

20

10

0

10

20

Reward

Nero

SGD

Adam

LAMB

Figure 8.

Training a policy network to play Pong. Proximal

Policy Optimisation (PPO) was used. Pong’s reward is bounded

between

̆

21

. While investigating LAMB’s failure to train the

policy network, it was discovered that adjusting the

1

momentum

hyperparameter from 0 to 0.9 improved LAMB’s performance.

0

20

40

60

80

Epoch

0

.

1

0

.

2

0

.

3

0

.

4

0

.

5

0

.

6

Top-1 error

Training

Nero OOTB

Nero tuned

SGD

SGD+wd

0

20

40

60

80

Epoch

0

.

1

0

.

2

0

.

3

0

.

4

0

.

5

0

.

6

Validation

Figure 9.

Training a ResNet-50 network to classify the ImageNet

dataset.

Nero OOTB

(out-of-the-box) achieved the best training

performance but overfit compared to SGD with weight decay.

Nero tuned

—which most importantly regularised batch norm gains

towards one—recovered most of the lost performance.