of 41
REVIEW
Communicated by Tobi Delbruck
Neuromorphic Engineering: In Memory of Misha Mahowald
Carver Mead
carver@caltech.edu
California Institute of Technology, Pasadena, CA 91125, U.S.A.
We review the coevolution of hardware and software dedicated to neu-
romorphic systems. From modest beginnings, these disciplines have be-
come central to the larger field of computation. In the process, their
biological foundations become more relevant, and their realizations
increasingly overlap. We identify opportunities for significant steps for-
ward in both the near and more distant future.
1 Introduction
Since before the dawn of computation devices, deep thinkers have mused
on the capabilities of brains and whether there could be machines that could
emulate some of their capabilities. With the advent of each new technol-
ogy, it became the model of “the kind of thing that must be going on in the
brain.” The brain must be like a clockwork, then a telegraph system, then
a telephone system, then a digital computer, and now like the Internet. So
the quest to make something that “works like the brain” has come to mean
very different things to different people.
The first issue of
Neural Computation
in 1989 contained papers based
on two seemingly totally unrelated views, leading to divergent threads of
endeavor:
The use of backpropagation in training multilayer neural networks,
based on the ability of computer programs running on general-
purpose computers to receive inputs and learn from them
The creation of special-purpose, low-power silicon integrated circuits
to analyze real-time sensory signals, based on the ability of living
creatures to respond to their environment in “intelligent” ways
Fast-forward to 2022, when, ironically, Moore’s law has propelled the first
thread onto the center stage of industrial computation, while the second
has only recently seen commercial development. We will find that this di-
vergence is illusory and that at a deeper level, the two threads are actually
converging. We explore how this came about and what it might be telling
us about the future.
Neural Computation
35, 343–383
(2023)
© 2022 Massachusetts Institute of Technology
https://doi
.
org/10
.
1162/neco_a_01553
Downloaded from http://direct.mit.edu/neco/article-pdf/35/3/343/2072229/neco_a_01553.pdf by Ramona Marchand on 21 February 2023
344
C. Mead
Figure 1: Real cost of computation versus year. (Graph: Steve Jurvetson, CC BY
2.0.)
2 General-Purpose Computing
In 1989, neural networks were assumed to be computer programs that ran
on general-purpose computers. The evolution of general-purpose comput-
ing machines is well portrayed in Figure 1. Notice that computation is the
number of operations per second: we can always get more operations by
waiting longer.
How do we understand this remarkable evolution? Historically, over the
long run, the cost of computation has been directly related to the energy
used in that computation. Today’s electronic wristwatch does far more com-
putation than the Whirlwind did when it was built (Redmond & Smith,
1980). It is not just the computation itself that costs; it is the energy con-
sumed and the system overhead required to supply that energy and get rid
of the heat: the boxes, the connectors, the circuit boards, the power supply,
the fans—all of the superstructure that makes the system work. As the tech-
nology has evolved, it has always moved, in fits and spurts, in the direction
of lower energy per unit computation. That trend took us from mechani-
cal gadgets to relays to vacuum tubes to transistors to integrated circuits. It
was the force behind the transition from NMOS to CMOS technology that
happened around 1980.
1
Today, it is still, by Moore’s law, pushing us down
to nanometer sizes in semiconductor technology.
1
For that reason, this review concentrates on energy per computation as the central
long-term theme. There are often other commercial considerations that dominate tech-
nology choices, but we will not dwell on them here.
Downloaded from http://direct.mit.edu/neco/article-pdf/35/3/343/2072229/neco_a_01553.pdf by Ramona Marchand on 21 February 2023
Neuromorphic Engineering
345
Figure 2: (A) Energy per operation at the server level (Saunders, 2012). (B) En-
ergy to switch a single transistor (lesswrong.com). Both the square points in the
left plot and the round blue points in the right plot represent Intel technology.
The energy dissipated in a digital switching event is
1
2
CV
2
, where
C
is
the capacitance and
V
is the signal voltage. As the individual elements of
the integrated circuit (transistors and wires) have been made smaller over
the years, the capacitance
C
of each electrical node has decreased and the
power-supply voltage
V
has decreased accordingly. Thus, the energy re-
quired to charge an electrical node from a logic-0 to a logic-1 has decreased.
As the transistors are made smaller, the length of the path an electron trav-
els to move from one side of a transistor to the other gets shorter. The time
required for a transistor logic circuit to switch is directly proportional to this
“transit time.” As the dimensions shrink, the distance from one transistor
to another decreases, the wire connecting them will thus be shorter, so it
will take signals less time to get from one end to the other. The net result
is that smaller feature sizes create computing circuits that run faster and
use much less energy. That trend is shown in Figure 2. When we compare
the two graphs for 2010, a server operation cost an energy of
10
3
J, and
switching a single transistor cost
10
16
J. There is a factor of
10
13
between
the energy to make a transistor work and that required to do an operation
the way we do it in a digital computer.
There are two primary causes of energy dissipation in the general-
purpose digital systems we built in 2010, and a similar one (but perhaps
a bit better) in the systems we build today:
1. We lose a factor of more than 1000 because of the way we build digital
hardware. Even on the same chip, very few signals are local—most of
the energy is expended moving data around rather than using those
data where they originate. The capacitance of the gate is only a very
small fraction of capacitance of the wire that takes the signal from
where it originates to where it is used, so we spend most of our en-
ergy charging up the the capacitance of the wires and not just the
Downloaded from http://direct.mit.edu/neco/article-pdf/35/3/343/2072229/neco_a_01553.pdf by Ramona Marchand on 21 February 2023
346
C. Mead
transistor gates. Signals that go off the chip onto a circuit board cost
at least another two orders of magnitude in time and three orders of
magnitude in energy;
2. We use far more than one transistor to do an operation. “Today’s
many-core processors have parallel single instruction multiple data
(SIMD) instruction sets for floating-point, and dedicate around 1
million transistors per core to handling floating-point operations”
(INRC, 2021).
The factor-of-1000 opportunity requires us to make algorithms more lo-
cal, so that we do not have to ship the data all over the place. That is a
big win: we have built digital chips that way and have achieved factors
of 10 up to 1000 reduction in power dissipation for the same operation.
Factors like that have gradually been making their way into the systems
shown in Figure 1. Whenever there are operations that we want to do a
lot of, they start bogging down the main computer—things like interfacing
with the printer or the network, sound output, or the display screen—and
some clever people figure out a way to make a special-purpose digital cir-
cuit that
justdoesthatthing
. These application-specific integrated circuits can
be tightly crafted to minimize wiring overhead and data movement, since
they do not have to be “general purpose.” We will see a lot of this later.
By the year 1989 that we are commemorating, workstations and high-
end PCs included built-in support for sound and quality color graphics.
It was the games market that drove the development of quality computer
graphics for many years, since interaction time is key for games. The de-
mand for higher-quality graphics drove the development of increasingly
sophisticated graphics processing units (GPUs), devices that end up play-
ing a pivotal role in our story.
Also in 1989, the first commercial Internet service providers (ISPs)
emerged in the United States and Australia, and what had been the
ARPANET, along with a motley assortment of small and large private net-
works, were merged to become the Internet. To provide high-bandwidth
interconnection over long distances required the development of an en-
tirely new technology, and a relatively new field of endeavor, quantum
optics, took center stage. Optical fibers with unimaginably low loss were
developed, along with new kinds of semiconductor lasers, modulators,
wavelength-division multiplexors—and the list goes on. Residual attenu-
ation in fibers is compensated by ingenious in-fiber amplification, enabling
a single fiber to transmit many optical channels, each channel carrying hun-
dreds of gigabits per second, over spans of thousands of kilometers.
Also in 1989, Sequoia Data Corp. introduced Compumarket, the first
Internet-based system for e-commerce. Sellers could post items for sale, and
buyers could search the database and make purchases with a credit card.
From this modest beginning, the Internet has emerged as host for an entire
digital economy that has become a substantial fraction of the entire world
Downloaded from http://direct.mit.edu/neco/article-pdf/35/3/343/2072229/neco_a_01553.pdf by Ramona Marchand on 21 February 2023
Neuromorphic Engineering
347
economy. In addition to sellers of traditional goods and services, the Inter-
net makes possible entirely new kinds of enterprises that offer entirely new
information-based products and services. Successful enterprises in the dig-
ital economy have grown to become some of the most highly valued com-
panies in the world. Because the value created by many of these firms is
based on information rather than on physical goods, they have developed
huge data centers to house large clusters of computation and data-storage
servers, the term
server
indicating “accessible from the Internet.” Thus “the
cloud” came to be populated with enormous computing resources and even
more enormous troves of organized data.
The wide availability of computers equipped with GPUs led those with
applications that required heavy specialized processing to adapt their algo-
rithms to run on GPUs. For anything that was a bit different from standard
graphics processing, those adaptations were always a challenge.
3 Neural Networks
The neural network community arose from several disparate sources:
Computer scientists, disappointed by the many high-profile failures
in traditional artificial intelligence (AI) and believing that neurobiol-
ogy provided clues that might enable a way forward
Neurobiologists, who believed that computer simulation might pro-
vide a path to better understand their neural observations
Engineers and physicists, who believed that the brains of animals
worked on principles vastly different from those in known comput-
ers and that discovering these principles was the key to an entire new
computing paradigm
Theorists who believed that the key to understanding was through
mathematical analysis
The
field
b ecame
p opular
i mmediately
i n
a cademic
c ircles.
With
t he
in-
troduction
of
the
backpropagation
algorithm
for
minimization
of
error
(Rumelhart,
Hinton,
&
Williams,
1986;
Durbin
&
Rumelhart,
19
8
9)
as
an
effective
way
to
train
multilayer
networks,
slow
but
steady
progress
was
being
made
on
problems
of
interest
in
the
real
world.
So
until
2012,
this
largely
academic
discipline
was
chugging
a
long,
with
occasional
spinoffs
into
the
commercial
sector.
Then,
in
2012,
all
that
changed.
3.1
The
Tipping
Point.
Alex
Krizhevsky
(2021),
a
young
Ukrainian
im-
migrant
living
in
Canada,
was
good
at
coding—so
maybe
he
should
just
get
a
job
doing
that—it
paid
good
money.
Then
he
stumbled
on
a
story
about
machine
learning
and
found
a
group
at
the
University
of
Toronto
that
specialized
in
just
that.
Alex
was
admitted
to
the
universi
ty’s
PhD
pro-
gram,
with
Geoffrey
Hinton
a
s
his
adviser.
Hinton
had
invented
a
class
of
networks
called
restricted
Boltzmann
machines
and
was
pursuing
their
Downloaded from http://direct.mit.edu/neco/article-pdf/35/3/343/2072229/neco_a_01553.pdf by Ramona Marchand on 21 February 2023
348
C. Mead
application to image recognition, so Alex joined the effort. By 2009, he
had finished his MSc thesis, “Learning Multiple Layers of Features from
Tiny Images” (Krizhevsky & Hinton, 2009). In the process, he and his fel-
low graduate student Vinod Nair had developed two important data sets:
CIFAR-10 and CIFAR-100 (Canadian Institute for Advanced Research, 10
and 100 classes)—both labeled subsets of the 1.6 million “tiny images”
data set.
Both Alex and Nair continued working on image-recognition problems.
Nair was still working with restricted Boltzmann machines in 2010 when
he discovered that using a rectified linear unit (ReLU) nonlinearity in place
of the almost universal tanh improved their learning performance (Nair &
Hinton, 2010). It has turned out that the ReLU, which resembles activa-
tion curves found in many biological systems much more closely than a
tanh does, has a number of advantages in deep neural networks and now
is nearly universally adopted.
Alex experimented with modified versions of the ReLU in a number
of convolutional deep belief networks (which contained restricted Boltz-
man machines). He trained them on the tiny-images data set and tested
the trained networks on the CIFAR-10 test set. His best network achieved a
78.9% accuracy, roundly beating the previous record of 74.5% (Krizhevsky
& Hinton, 2010). In his paper, he comments, “The most computationally-
intensive networks that we describe here take 45 hours to pre-train and 36
hours to fine-tune on an Nvidia GTX 280 GPU. By far, most of the time is
spent squeezing the last few fractions of a percent from the nets.”
Alex and his adviser took different lessons from this result. Hinton con-
cluded that “current methods for recognizing objects in images perform
poorly and use methods that are intellectually unsatisfying . . . artificial
neural networks should use local ‘capsules’ that perform some quite com-
plicated internal computations on their inputs and then encapsulate the
results of these computations into a small vector of highly-informative out-
puts” (Hinton, Krizhevsky, & Wang, 2011). So he put Alex on a project to
learn how to train input layers of networks to learn such “capsules,” think-
ing that there might be a respectable PhD thesis in it for Alex.
But Alex had blood in his teeth: he had seriously beaten the tiny-images
record, and his paper on that accomplishment had not even been published!
And that accomplishment had not required anything “intellectually satis-
fying.” What he needed was more powerful GPUs and to squeeze every
possible operation out of them.
From his work with various-sized networks and data sets of various-
sized images, he had developed a “gut feeling” that training a network with
many more layers and therefore many more weights was possible by very
clever use of the detailed capabilities of the GPUs. He set out to create a
computing environment based on a graphics system with two of the new,
state-of-the-art GTX 580 3GB GPUs—with many long nights in the bow-
els of the GPUs, painfully finding every possible connection and its exact
Downloaded from http://direct.mit.edu/neco/article-pdf/35/3/343/2072229/neco_a_01553.pdf by Ramona Marchand on 21 February 2023
Neuromorphic Engineering
349
timing. Then the fog cleared: he found a mapping of a multilevel network
onto the GPUs that used all their capability maximally effectively!
In 2011, Alex’s fellow gradate student Ilya Sutskever found out about the
ImageNet LSVRC contest to classify 1.2 million high-resolution images into
1000 different classes. This is the event where, each year, the heavyweights
in neural network technology compete to create and train their network on
a really hard problem. The prize goes to the network with the lowest error
rate in solving that problem. Ilya thought that Alex’s experience in crafting
neural network architectures and his mastery of the new GPUs might be up
to the challenge, so the two decided to go for it.
The role of “intellectually satisfying” reasoning started on page 1 of their
paper describing the results:
To learn about thousands of objects from millions of images, we need a
model with a large learning capacity. However, the immense complexity
of the object recognition task means that this problem cannot be specified
even by a dataset as large as ImageNet, so our model should also have lots
of prior knowledge to compensate for all the data we don’t have. Con-
volutional neural networks (CNNs) constitute one such class of models,
. . . Despite the attractive qualities of CNNs, and despite the relative ef-
ficiency of their local architecture, they have still been prohibitively ex-
pensive to apply in large scale to high-resolution images. Luckily, current
GPUs, paired with a highly-optimized implementation of 2D convolu-
tion, are powerful enough to facilitate the training of interestingly-large
CNNs, and recent datasets such as ImageNet contain enough labeled
examples to train such models without severe overfitting (Krizhevsky,
Sutskever, & Hinton, 2017).
Overfitting
is the term neural network people use for the tendency
of networks to “memorize” all the examples in the training set but not
“generalize”—recognize examples that were not in the training set. Their
paper has a an entire section—“Reducing Overfitting”—from which we
learn that
the easiest and most common method to reduce overfitting on image data
is to artificially enlarge the dataset using label-preserving transforma-
tions....Weemploytwodistinctformsofdataaugmentation,bothof
which allow transformed images to be produced from the original images
with very little computation.
...Thefirstformofdataaugmentationcon-
sists of generating image translations and horizontal reflections. . . . This
increases the size of our training set by a factor of 2048. . . . The second
form of data augmentation consists of altering the intensities of the RGB
channels in training images. . . . This scheme approximately captures an
important property of natural images, namely, that object identity is in-
variant to changes in the intensity and color of the illumination.
These data set augmentation techniques are known as “hints” and were
formally introduced by Abu-Mostafa in the cover paper of the July 1995
Downloaded from http://direct.mit.edu/neco/article-pdf/35/3/343/2072229/neco_a_01553.pdf by Ramona Marchand on 21 February 2023
350
C. Mead
issue of
Neural Computation
(Abu-Mostafa, 1995). Even with these “hint” en-
largements, Alex’s network still exhibited substantial overfitting. The sec-
tion continues:
The recently-introduced technique, called “dropout” (Hinton, Srivastava,
Krizhevsky, Sutskever, & Salakhutdinov, 2012)
...consistsofsettingto
zero the output of each hidden neuron with probability 0.5. The neurons
which are “dropped out” in this way do not contribute to the forward
pass and do not participate in backpropagation. So every time an input
is presented, the neural network samples a different architecture, but all
these architectures share weights. . . . We use dropout in the first two fully-
connectedlayers....Withoutdropout,ournetworkexh
ibits substantial
overfitting. Dropout roughly doubles the number of iterations required
to converge.
So what was required to get good classification performance and still
keep the network from overfitting was a combination of “intellectually sat-
isfying” techniques and a “big hack attack”—trying a lot of variations of
network architecture, guided by Alex’s increasingly keen intuition for
what
actually worked—It was a really hard year!
It took six days to train a net-
work on all those images and would have taken weeks or months without
Alex’s “highly-optimized GPU implementation of 2D convolution and all
the other operations inherent in training convolutional neural networks.”
It took six months for these experiments to get to the same performance as
published results for the 2010 competition. Hinton had been skeptical but
could see that this work was going to become valuable, so he organized
DNNresearch (Lardinois, 2013), with Alex, Ilya, and himself as owners.
The networks were getting better. By June 2012, when the training
and validation data for the ILSVRC 2012 were released, Alex’s single net-
work was regularly producing less than 20% error rate on the 2010 data
set—soundly beating the best published value of 25.7%. Nonetheless, oth-
ers were making progress, and the submission was not due until mid-
September. The whole trio was now fully engaged in the effort—this is
a
huge opportunity!
They found that the network could be improved by
“pretraining” with the entire 2011 ImageNet Fall release—tune, tweak . . .
September—deadline extended to September 30—tune, tweak—they sub-
mitted two entries:
1. The average of five networks: Trained only on the training data sup-
plied for this 2012 contest
2. The above average: Further averaged with two networks trained on
the entire fall 2011 release
Get some sleep! Hold your breath! Think about something else . . .
On October 8, preliminary error rates released to participants—
WOW!
Alex: 15.3% Next best: 26.2% They had
blown it away!
The official release
Downloaded from http://direct.mit.edu/neco/article-pdf/35/3/343/2072229/neco_a_01553.pdf by Ramona Marchand on 21 February 2023
Neuromorphic Engineering
351
of the full results was not for another five days, but the word traveled like
wildfire!
Alex capsuled his own deep belief: “Our results show that a large, deep
convolutional neural network is capable of achieving record-breaking re-
sults on a highly-challenging dataset using purely supervised learning
....
All of our experiments suggest that our results can be improved simply by
waiting for faster GPUs and bigger datasets to become available.”
Yann LeCun, creator of convolutional neural networks, is quoted as say-
ing, “This is proof—AlexNet is an unequivocal turning point in the history
of computer vision!”
Thus, with “AlexNet,” the “New AI” was born: the key to the future was
scale—more data, more computing cycles.
Google had started the Google Brain project in 2011 to use its huge GPU
infrastructure to implement a truly enormous distributed deep-learning
system, originally called DistBelief. Google had been supporting DNNre-
search, and, a few months after the ILSVRC 2012 results were released,
Google acquired DNNresearch. Clusters of GPUs became widely available
as part of cloud computing, with more and more sophisticated methods
developed for mapping the operations involved in deep learning onto the
things that GPUs did the best. The size of networks that could be trained
grew by orders of magnitude.
Success emboldened researchers to try approaches that appeared to
be “brute force,” “wasteful,” “impractical,” or even “stupid,” such as us-
ing individual pixel values as direct inputs to the first layer of a deep
network. Success in one application area encouraged bolder approaches
in other areas—“If more is better, go for it!” The result is evident in
Figure 3.
The Google team created many highly successful deep-neural-network-
based products, among them the speech-recognition system that is used as a
front end to Search, and TensorFlow—an open-source software library that
allows anyone to use machine learning by providing the tools to train their
own neural network. To supplement free access to TensorFlow, Google has
made a lot of free time on its GPU clusters available to researchers devel-
oping and training their own networks. A huge number of innovations in
network concept, application, and architecture ensued, contributed by rank
amateurs and old-timers alike. Hinton’s capsules for example, came back in
full force in ResNets, and on it goes.
As can be seen in Figure 3, as a result of success on the part of many de-
velopers, even the latest GPUs, when fielded in clusters in large data cen-
ters, threatened to require resources beyond those available.
But for every deep neural network that gets trained, many copies of the
net with fixed weights are parceled out for users to simply use by applying
their own inputs and evaluating the corresponding outputs. This phase of
a network’s life, called
inference
, is vastly simpler than training the net, but
many more people are doing it.
Downloaded from http://direct.mit.edu/neco/article-pdf/35/3/343/2072229/neco_a_01553.pdf by Ramona Marchand on 21 February 2023
352
C. Mead
Figure 3: AlexNet starts a tidal wave! (OpenAI, https://arkinv
.
st/2ZOH2Rr.)
To address what looked like a looming disaster from a large number of
potential users, in 2013 Google launched a project to design a custom sili-
con chip, highly optimized for the operations most used in the application
of neural nets. The project was under wraps until 2017 when an excellent
paper appeared describing the chip, its history, and its performance (Jouppi
et al., 2017).
By 2015, Norm Jouppi and his team had developed and fielded in its
data centers a custom accelerator chip, the tensor processing unit (TPU).
For neural network inference applications—evaluation of inputs on an
already-trained network—they claimed 30 to 80 times more operations per
watt-second than Intel CPUs and Nvidia GPUs of the same technology gen-
eration. Norm Jouppi, lead author on the Google announcement, explained
how the project had come about: “The need for TPUs really emerged about
six years ago, when we started using computationally-expensive deep
learning models in more and more places throughout our products. The
computational expense of using these models had us worried. If we consid-
ered a scenario where people use Google voice search for just three minutes
a day and we ran deep neural nets for our speech recognition system on the
processing units we were using, we would have had to double the number
of Google data centers!” (Jouppi et al., 2017).
In 2017, when that statement was written, Google’s data centers used
10
13
watt-hours per year, thus running an average power of
10
9
watts.
Google had
10
9
active users, so a single user accounted for an average of
Downloaded from http://direct.mit.edu/neco/article-pdf/35/3/343/2072229/neco_a_01553.pdf by Ramona Marchand on 21 February 2023
Neuromorphic Engineering
353
just about 1 watt continuous. Three minutes’ worth is
1
/
500 of a day, so
to double the power usage per user, Google voice search on deep neural
nets would be using 500 times the average power per user
500 watts, just
doing inference, while it was being used.
The TPU chip used the group’s knowledge of the flow of data in
the execution of the neural inference algorithm to arrange the multi-
ply/accumulate units physically next to each other so the data only need to
move to the nearest neighboring units in the order that the operations are
executed, as Jouppi describes:
The matrix unit uses systolic execution
2
to save energy by reducing
reads and writes of the Unified Buffer. Data flows in from the left, and
the weights are loaded from the top. A given 256-element multiply-
accumulate operation moves through the matrix as a diagonal wave-
front. The weights are preloaded, and take effect with the advancing wave
alongside the first data of a new block. Control and data are pipelined to
give the illusion that the 256 inputs are read at once, and that they in-
stantly update one location of each of 256 accumulators (Jouppi et al.,
2017).
The TPU is a beautiful example of minimizing data movement and in-
terspersing memory with processing.
So even with gigawatts of power used by data centers, the execution of
deep neural networks has required two steps of specialized silicon circuit
design—the first for GPUs and the second for TPUs—in order to bring the
power usage within reason. What started as a pure software endeavor is
ending as an ever-increasing specialization of silicon circuits for performing
neural operations.
More recently, what used to be GPUs aimed at speeding up graphics
have increasingly morphed into dedicated deep-learning “accelerators.”
Nvidia just announced the H100 (Andersch et al., 2022, named after Grace
Hopper)—a huge chip (more than 8 cm
2
), with nearly half of its dense-
processing area devoted to tensor operations. When fully running, the chip
dissipates 700 watts. Cerebras (2022) has taken an even more audacious ap-
proach and made a wafer-scale system. Both companies realized that most
of the time and power consumption goes into moving data around. And sig-
nals that need to go off the chip are by far the worst. The hardest part of go-
ing to a very large chip (like Nvidia) or wafer scale (like Cerebras) is dealing
with the “bad spots” on the wafer; normal semiconductor practice is to just
throw away the chips that have bad spots. But it is a universal experience
in multilayer networks that weights and data in networks are inherently
sparse, so if a value is zero, you don’t need to send it. That idea generalizes
to. Both companies use this technique to obtain good silicon use in spite
2
The term
systolic
was introduced by Kung (1979, 1982), and Kung and Leiserson
(1979).
Downloaded from http://direct.mit.edu/neco/article-pdf/35/3/343/2072229/neco_a_01553.pdf by Ramona Marchand on 21 February 2023
354
C. Mead
of unavoidable defects. To take full advantage of sparsity, Intel’s Loihi sys-
tem (INRC, 2021), SpiNNaker (Hoppner et al., 2021), Cerebras’s wafer-scale
system (2022), and others use a message-based, event-driven interconnect
protocol (more on this important topic later). For example, a multiply is not
triggered until both operands are received. The Cerebras wafer-scale “chip”
is 215 mm on a side. When fully running, it dissipates 20 kW, so it is water-
cooled, but fits in a compact cabinet. It is claimed to do the work of an entire
data center cluster, so having a personal version like this reminds us of the
days when a computer required an entire air-conditioned room in a com-
puting center and was accessed by time-sharing. And then minicomputers
came along.
4 Inference with Continuous Variables
As described above, inference in an
n
-deep network is basically
n
layer op-
erations. Let’s look at a layer that has
i
inputs and must compute
j
outputs,
which become inputs for the following layer. That layer operation computes
each output
Q
j
by multiplying each input
V
i
by a weight
W
ij
, adding the re-
sult to the total for that output and applying a nonlinear operation
f
(
Q
)to
the sum, which becomes one input
V
j
to the next level:
Q
j
=
i
W
ij
V
i
V
j
(Level n
+
1)
=
f
(
Q
j
)(Level n)
.
(4.1)
The values of the variables in equation 4.1 are continuous, and the weights
are arrived at by a learning procedure (typically backpropagation) that in-
crementally adjusts the
W
ij
to minimize the error after each presentation
or learning cycle. It is imagined that there is some error surface in weight
space, and we wish to slide down to a minimum in that surface. Now a
digital solution inherently quantizes the inputs, weights, and outputs to
discrete values, so the error surface is no longer smooth but has steps all
over it. So when we change the weights a bit, the output might not change,
might change a little, or might change a lot, depending up whether it is
close to the inevitable discontinuities or flat spots in the error surface due to
discretization of the variables involved. For that reason, digital implemen-
tations of deep neural nets use a floating-point representation when they
train the network by tweaking the weights. It is not that the absolute accu-
racy of the weights is necessarily high, since the weights are often truncated
to short integers for the inference process (Dally et al., 2018), but it is an at-
tempt to emulate a system that has a smooth derivative. So the first question
we should ask is: Instead of faking, quasi-continuous variables with binary
values, why not use continuous physical variables to start with? This exper-
iment has only been done at small scale, so it remains for future innovators
to determine if it is feasible for truly large deep neural nets.
Downloaded from http://direct.mit.edu/neco/article-pdf/35/3/343/2072229/neco_a_01553.pdf by Ramona Marchand on 21 February 2023
Neuromorphic Engineering
355
4.1 Inference in Analog Silicon.
Inference is the easy part; training is
much harder. So let’s see if there is a good way to do inference in a way that
is more natural than Jouppi’s TPU chip.
Electrically minded people realize immediately that if each weight is
stored as a charge on the floating gate of an MOS transistor, the input is
a voltage on the source of that transistor, and the drain current of that tran-
sistor (with the drain voltage held constant) will be some approximation to
the product of the stored charge
W
ij
and the input voltage
V
i
. When that
current is injected into a wire, shared with all
i
floating-gate transistors for
the same output
j
, the resulting total current
Q
j
in the wire is the sum of
the currents from all transistors connected to that wire (Kirchhoff would be
happy):
Q
j
i
W
ij
V
i
dt V
j
=
f
(
Q
j
)
.
(4.2)
There are many ingenious but simple transistor circuits that produce,
smooth
f
(
Q
j
) nonlinearities (including smooth versions of the ReLU func-
tion discussed earlier) as a natural part of the current-to-voltage conversion
that produces the input
V
j
to the next layer and clamps the wire voltage, so
the array itself works in current-steering mode, thus dissipating very lit-
tle energy. The interlayer circuits also implement a renormalization of the
signal levels; they are the only part of the network that dissipates any sig-
nificant power, which is thus order(
j
), not order(
j
2
).
In all the digital training experience over the years, it has been found that
the form of the function
f
(
Q
) can make a substantial difference in the re-
sults obtained (Nair & Hinton, 2010). In analog circuits, quite sophisticated
functions can be introduced that cost nothing in either energy or time. Such
functions have a long history in the analog world; the ones that work reli-
ably in practice are sufficiently rare that each is referred to by the inventor’s
name (e.g., the Gilbert multiplier).
In 1995, an appropriate technology for the inference task described above
had an efficient small-scale proof of concept using such a floating-gate MOS
technology, as shown on the left in Figure 4.
By 2011, a programmable analog array (Schlottmann & Hasler, 2011) us-
ing similar floating-gate synapse technology to store the weights could be
field-programmed to do inference tasks with 1000 times less energy and
100 times less silicon area than the best digital solutions of the day. The
current-mode differential synapse arrangement used is shown in Figure 4A.
That arrangement can be programmed to produce two individual output
currents or one differential output.
The
i
×
j
array has these important properties:
Each cell stores the weight
W
ij
in a nonvolatile manner.
Downloaded from http://direct.mit.edu/neco/article-pdf/35/3/343/2072229/neco_a_01553.pdf by Ramona Marchand on 21 February 2023
356
C. Mead
Figure 4: Floating-gate analog vector-matrix multiply arrays. (A: Diorio,
Hasler, Minch, & Mead, 1996; B: Schlottmann & Hasler, 2011.)
The cell multiplies the input by the stored weight to produce its con-
tribution to the output current.
The output wire, an integral part of the cell, sums the cell’s output
current with that of the other cells.
The actual synapse circuit in the cell can (in a special process similar
to EEPROM or Flash) be as small as a single transistor.
The widely acclaimed GPT-3 network has 175 billion weights connecting
8.3 million units arranged 384 layers deep (96
×
3) and 49,152 units wide. To
compare a digital hardware implementation with a partially analog one, we
consider a 50,000-wide, 400-layer network, with each layer fully connected
to the next—for round numbers—10
12
weights.
In the 16 nm digital implementation described by Dally et al. (2018), each
4 bit multiply costs
10
14
J and the shift/add costs about the same, so just
the multiply/shift/add for the entire network would cost
2
×
10
2
Jper
inference step. But just feeding the digital weights to the processing unit on
the same chip takes
10
12
J per weight, which is
100
×
theenergycostof
the local multiply/add operations. If the architecture allows the weights to
be reused without moving them, the data-movement energy cost is reduced
by the reuse factor, which Dally et al. (2018) give as 32. The weight move-
ment plus multiply/add energy then comes to
.
05 J per presentation for
a fully on-chip implementation. Used at a presentation rate of 30/sec this
digital chip would dissipate 1.5 watts, and 10 to 100 times more power for
weight storage on separate memory chips.
Let’s estimate what that would look like in dedicated analog silicon. In a
modern flash memory process, each synapse could be 30 nm
=
3
×
10
8
m
on a side (Goda, 2020). So each layer array with 50,000 inputs and 50,000
outputs could be
1
.
5 mm on a side. Since the outputs of one array go
to the inputs of the next array, the entire network can be tiled into a
Downloaded from http://direct.mit.edu/neco/article-pdf/35/3/343/2072229/neco_a_01553.pdf by Ramona Marchand on 21 February 2023
Neuromorphic Engineering
357
20
×
20 array. When laid out in this way, the entire inference engine—400
layers each having 50,000 inputs and 50,000 outputs with all weights stored
locally—would fit on a 30 mm
×
30 mm chip—only slightly larger than
the latest digital GPU/TPU chips, which require separate memory chips for
weight storage—and pay the attendant energy and time penalty. The analog
chip would execute 10
12
multiply/add operations per presentation and re-
quire 2
×
10
7
interlayer functions. As in the digital implementation, the big
energy drain will be the input interlayer circuits driving the input voltage
V
i
across the array. The total current addition on the array will require less
energy, as the voltage across the weight transistors is less than the full sig-
nal voltage, and the total current is then supplied by the interlayer circuits.
If we make the conservative assumption that each interlayer function costs
the same energy as one of Dally’s digital multiply/add plus one wire driven
across the 1.5 mm width of the array, that would be
10
13
J per wire per pre-
sentation. For all 400
×
50
,
000 interlayer circuits, the energy would come to
afew
×
10
6
J per presentation. At 30 presentations/sec the chip would dis-
sipate
10
4
watt. Pipelining between arrays, which can speed up through-
put by substantial factors, can be accomplished in analog as it can in digital
technology. By using this fully unrolled design, we have merged the mem-
ory and information processing, an approach that goes under the rubric of
“in-memory computation” or “in-memory processing” and has recently re-
ceived considerable attention (Sebastian, Le Gallo, Khaddam-Aljameh, &
Eleftheriou, 2020; Liu, 2022; Mythic-ai.com, 2021; ISSCC, 2022).
There are several risks and costs associated with this analog approach:
1. The huge advantage of digital technology is that each signal is re-
stored at each step of processing or storage. In contrast, traditional
analog signals are not restored; signals propagating in any cascade of
analog sections accumulate error and noise as they propagate, so any
workable system must restore its signals after at most a few stages. In
the above system, this restoration would logically take place in the
current-sense circuits of at least every few network layers. Thus, a
practical analog implementation will actually be a hybrid system—
analog computation followed by “digital” restoration. All network
designs do some version of restoration in the
F
(
Q
) interlayer func-
tion, but it will only become clear what effect the analog-necessitated
restriction has on the overall capability of a many-level network by
actually comparing one with and without these restrictions. Seems
like a good PhD thesis project.
2. Moore’s law has, over many process generations, optimized silicon
fabrication processes for the densest digital circuits. The result has
been that analog properties are, if not totally neglected, at least rel-
egated to minor status. Reoptimizing a process for analog perfor-
mance is a nontrivial undertaking.
Downloaded from http://direct.mit.edu/neco/article-pdf/35/3/343/2072229/neco_a_01553.pdf by Ramona Marchand on 21 February 2023
358
C. Mead
3. As processes have scaled down in dimensions and supply voltage,
the number of electrons representing a full-scale analog signal (or
digital “one”) has decreased exponentially, yet it is still far from the
true digital limit, where the presence or absence of a single elec-
tronic charge reliably represents a single digital bit. In this interme-
diate range of physical and electronic scales, inherent variations of
properties from one transistor to another become a larger and larger
proportion of a full-scale signal. Thus, the above approach increases
the need for expert circuit design at the hardware level and clever
resource allocation at the system level to work around individual-
transistor variation (and chip defects, as discussed earlier).
4. Programming the chip for a specific network is an iterative process:
each weight must be read, changed in the direction required, and
then read again, and the cycle repeated until the desired weight is
attained. This process requires a variable time for each weight on an
expensive, dedicated programming facility;
5. Reprogramming after deployment requires the equivalent of the pro-
gramming facility, which would require a level of power that may not
be compatible with the overall low-power environment.
The Bottom Line
1. An analog inference chip with resident weight storage can poten-
tially achieve 10,000 times lower power and about the same silicon
area as a single-chip self-contained digital implementation.
2. The analog chip would have different, and potentially more, re-
strictions than an efficient digital chip, with currently unknown
consequences.
3. This programmable inference chip could be realized with a variant
of today’s EEPROM or flash memory technology.
4. However, realizing the level of signal retention required for an analog
representation could take a lot of development.
5. The effort might be worthwhile for applications where power con-
sumption is paramount.
4.2 Analog Backpropagation.
We saw that analog technology could be
instrumental in reducing power in the evaluation (inference) phase, where
the neural network is applied to a real problem by a real user, not the de-
veloper. It seems like the continuous nature of analog variables would be
of even more value during the training phase of a deep neural network.
The following discussion is quite speculative. It would take a lot of
effort—and some major inventions—to bring it about. We will encounter
a number of alternatives to standard backpropagation in later sections, so
this may not be the best way forward (see the caveats at the end of this
section).
Downloaded from http://direct.mit.edu/neco/article-pdf/35/3/343/2072229/neco_a_01553.pdf by Ramona Marchand on 21 February 2023
Neuromorphic Engineering
359
Let’s look at how the array shown on the right in Figure 4 might be
trained. For this discussion, we use the configuration where the output lines
labeled
I
out
+
and
I
out
represent two separate outputs. One step in train-
ing each layer of the network with backpropagation consists of two passes
through the network:
Forward Pass
Present input signals to the input
i
wires of the network.
Inference: Propagate signals through weights in the forward direction
as described for the inference chip.
Compare the network
j
outputs with the desired outputs, thereby
generating a vector of
j
error signals.
Backward Pass
Present the error signals to the
j
“output wires” of the network, now
acting as inputs.
Propagate these error signals backward through the same weights,
thereby generating an error value on each of the
i
“input” wires to be
fed back to the “previous” layer.
Each “synapse” circuit uses its
j
error value to proportionally correct
its own local weight.
Signal propagation on a wire and through a transistor can be quite sym-
metric. Since the same
W
ij
transistor is used on the forward and backward
passes and its weight is adjusted by the learning process, transistor-to-
transistor threshold variation is thereby reduced as a source of error. This
arrangement also makes all the information necessary for backpropagation
available locally to the weights to be updated.
In order to propagate signals in the reverse direction, the circuit imple-
menting the nonlinear function
f
(
V
i
) at the (forward) output of each layer
needs to be augmented with drivers for driving signals in the reverse di-
rection and a mode control for selecting that pass the network is executing.
Such circuits need to be done carefully. Transistor offset voltages in these
interlayer circuits can potentially be reduced using the same floating-gate
technology as is used in the synapse circuits. Pipelining between arrays,
which can speed up the forward inference step by substantial factors, is not
easily included in a fully unrolled backpropagation design.
The synapse circuit itself is even more challenging. The floating-gate
learning rule described in Diorio et al. (1996) needs to use the local error
signal on its
j
wire together with a global update command, presumably
on a separate dedicated wire, to either increment or decrement the charge
on the floating weight storage gate. An efficient solution to this design chal-
lenge would be a major contribution to the field. Optimal solutions require
a fully integrated approach encompassing device physics, fabrication pro-
cess, circuit design, system-level function, and layout.
Downloaded from http://direct.mit.edu/neco/article-pdf/35/3/343/2072229/neco_a_01553.pdf by Ramona Marchand on 21 February 2023