of 12
25 YEARS AGO: THE FIRST ASYNCHRONOUS
MICROPROCESSOR
Alain J. Martin
Department of Computer Science
California Institute of Technology
Pasadena, CA 91125, USA
January 27, 2014
Twenty-five years ago, in December 1988, my
research group at Caltech submitted the world’s
first asynchronous (“clockless”) microprocessor
design for fabrication to MOSIS. We received
the chips in early 1989; testing started in Febru-
ary 1989. The chips were found fully func-
tional on first silicon. The results were pre-
sented at the Decennial Caltech VLSI Confer-
ence in March of the same year. The first entirely
asynchronous microprocessor had been designed
and successfully fabricated. As the technology fi-
nally reaches industry, and with the benefit of a
quarter-century hindsight, here is a recollection
of this landmark project.
1 No, You Can’t
A year earlier, I had approached our DARPA
program manager asking him to support my plan
to design and fabricate an asynchronous proces-
sor. His immediate answer was to reject the re-
quest as ludicrous. “This has been tried before:
it doesn’t work,” he informed me. After I in-
sisted, pointing to several modest experiments
that showed promise, he finally said (I quote
from memory): “OK, Alain. This is a bad idea,
but you are a good guy. I will let you do it. You
will try and fail. And you will be cured of this
nonsense!”
DARPA program managers always do the
right thing in the end, sometimes for the wrong
reason. To his defense, I have to admit that
his skepticism was unanimously shared by all ex-
perts of the time.
2CanWe?
As a computer scientist, my interest in VLSI
was triggered by Mead & Conway’s
Introduction
to VLSI Systems.
Their revolutionary text es-
tablished the link between VLSI design and dis-
tributed computing. A new field of computing
was born, in which computations were imple-
mented directly in a physical medium without
the interface of a computer. Another revelation
of the text was that shrinking of the devices’
physical dimensions would continue unabated for
a long time, allowing larger and larger systems
to be built, with the corollary that mastering the
systems’ complexity would be the main issue.
1
Figure 1:
Layout of the Caltech Asynchronous Microprocessor
Yet another revelation of the book was Chuck
Seitz’s chapter on “System Timing,” where asyn-
chronous logic, called “self-timed” in the text,
was described as a potentially superior approach
owing to its independence of timing in the face
of expected increasing timing variations.
Those were compelling arguments for a com-
puter scientist, and I decided to apply my expe-
rience in distributed computing to the synthesis
of asynchronous VLSI systems. In blissful igno-
rance of the di
ffi
culty ahead, I chose as a first
exercise the asynchronous implementation of a
distributed mutual-exclusion algorithm I had in-
vented. At the end of a painful and solitary ef-
fort, I found myself not only with a good cir-
cuit but also with the prototype of a synthesis
method. The result was presented at the 1985
MIT VLSI conference.
3 A New Synthesis Method
Unlike anything else being tried at the time, the
approach relied on three levels of digital repre-
sentation before the transistor level: (1) a high-
level description in terms of a novel message-
passing programming language (CHP)
1
; (2) an
intermediate representation,
the handshaking ex-
pansion
(HSE) derived from CHP by replacing
variables by their boolean expansions; arithmetic
and boolean operations on those variables were
replaced by boolean operations, and communi-
cations by
handshake protocols
; (3) a low-level
notation which contains only one operation,
the
production rule
, and one composition mecha-
nism,
the production-rule set
(PRS). All produc-
tion rules of a set are executed in parallel, the re-
quired sequencing being enforced by conditional
executions. Conveniently, a production rule has
an almost direct implementation as a “pull-up”
or “pull-down” circuit in CMOS.
Apriori,
it looked preposterous to ask the de-
signer to work with three languages when many
of them at the time had trouble with a single
hardware description language, but the rewards
1
CHP was originally inspired by C.A.R.Hoare’s CSP.
But the two languages are vastly di
ff
erent.
2
were worth the e
ff
ort. Each level of representa-
tion permitted a class of transformations unique
to that level. At the CHP level, the original spec-
ification of the system could be decomposed into
a collection of components (called “processes”
in CHP) communicating by messages. The goal
was both to simplify each component and to in-
troduce concurrency. As fine-grain communica-
tion would require more e
ffi
cient use of a chan-
nel beyond a simple connecting pipe, CHP was
the first language where channels and ports were
“first-class” components, with new constructs
such as the
probe
and
value probe
allowing to
perform some computations directly on ports.
At the HSE level, the main issue was the over-
head of handshake protocols. I wanted to manip-
ulate the handshake sequences so as to reduce
this overhead. The HSE representation lent it-
self to another class of transformations, called
reshu
ffl
ings.
A legitimate reshu
ffl
ing would re-
order the sequence of boolean transitions and
waits without compromising the logic or intro-
ducing deadlock. It is through the analysis of
all legitimate reshu
ffl
ings that we discovered the
most e
ffi
cient solutions for control circuits.
At the PRS level, not only, as we already men-
tioned, could a production-rule set be directly
implemented as a CMOS circuit, but the no-
tation also lent itself to a simple and e
ffi
cient
event-driven simulator,
prsim.
A systematic application of the synthesis
method quickly led to the discovery of new build-
ing blocks for asynchronous systems, discoveries
which I think would have been impossible by a
“seat-of-the-pants” approach. With several stu-
dents joining my lab, learning the method, and
applying it to various examples, a complete set of
basic components was developed including con-
trol circuits, adders, and registers.
4QDILogic
A new class of asynchronous circuits emerged
from our experiments, that we called
quasi delay-
insensitive
or QDI. At first, we (other researchers
and I) were intent on designing digital circuits
not only without clock but even without any re-
liance on delays. Such circuits were called
delay
insensitive
or DI. But all circuits we designed as
DI contained some delay assumptions that were
justified as “optimizations”. After a while, it
dawned on me that we were fooling ourselves:
we simply could not produce solutions that were
really DI. Instead, I proved that the class of DI
circuits was very limited and that most circuits
of interest fell outside this class. The issue re-
lated to
forks
in the circuit, where a fork is a
circuit node that is an input of more than one
gate. For anything but very simple functions, at
least one timing assumption was required: cer-
tain forks relied on some assumption about the
delays in the branches of the fork. I called them
isochronic forks
. Isolating the timing issues in
asynchronous logic to the isochronic fork was a
great step forward in understanding the limits of
asynchrony, and it took a while for the commu-
nity to absorb the new concept. (We recently
weakened the isochronicity requirement some-
what.
2
)
That became the QDI approach we settled on:
the circuits could tolerate any amount of delays
in operators and wires except in isochronic forks.
The delay assumption needed for the correct be-
havior of an isochronic fork is a one-sided in-
2
The original isochronicity condition bounds the dif-
ference of delays on the two branches of the fork. It is
su
ffi
cient but not necessary. A weaker necessary and suf-
ficient condition states that the delay along a specific path
of transitions, called “the adversary path,” is longer than
the delay in a branch of the fork.
3
equality between delays that can always be sat-
isfied by making one of the paths longer.
At that time, it was customary for the few
practitioners of asynchronous-logic design to
make liberal use of delay elements. The QDI dis-
cipline essentially forbade the practice: the com-
pletion of a function execution (when it needed
to be known) had to be explicitly calculated:
each communication was implemented with
four-
phase, return-to-zero,
handshake protocol
3
, and
each data word was coded with a
dual-rail
or
one-of-n code
(or
one-hot
code). Altogether, the
overhead compared to a few delay elements here
and there seemed exorbitant, and I was soon
branded as a dangerous fanatic
...
It is for me
a vindication to observe that the whole com-
munity today is quietly converging toward some
form of QDI as people realize that it is the ap-
proach most likely to deliver functioning, robust,
non-trivial, clockless circuits.
By the mid 1980s we had developed a com-
plete and robust approach for QDI circuit syn-
thesis. A component of the method was the sys-
tematic separation of the control part and the
datapath of a design directly from its high-level
CHP specification. The datapath contained QDI
implementations of registers and standard arith-
metic and boolean functions—adders, compara-
tors, incrementors, etc. The control part con-
tained the control signals that enforced the se-
quencing of actions between the various compo-
nents of a datapath. Basic circuit building blocks
had been invented, in particular the
D-element
3
Objection to the use of a four-phase, return-to-
zero, protocol continues to this day: Why insist on
a 4-transition protocol when a non-return-to-zero, 2-
transition protocol exists? The answer is that it is very
costly to do arithmetic and logical operations when one
has to alternate the sense of the signals according to the
parity of the communication.
for sequencing two bare handshake protocols, the
write-acknowledge
for detecting the completion
of a write to a boolean register, the
completion
tree,
dual-rail ripple-carry adders,
channel ar-
biters
,etc.
By 1987, the approach had been successfully
tested on a number of non-trivial examples that
had been actually implemented through MOSIS,
and we were ready to tackle the design of a small
microprocessor.
5Let’sTry!
According to my lab book, the project started
in earnest in July 1988. Beside myself, the
team was strong of four graduate students: Steve
Burns, who would take a leading role in the
project, Drazen Borkovic, Tony Lee, and Pieter
Hazewindus. Pieter joined the project later in
the fall. The students had convinced me that
we should design our own instruction set, which
they did earlier on. It was a small ISA, typical
of a microcontroller of today, without specific IO
instructions.
The processor had a 16-bit, RISC-type in-
struction set, 16 registers, four buses, an ALU,
and two adders—one for program-counter calcu-
lations and one for memory addresses. Instruc-
tion and data memories were separate. The chip
size turned out to be approximately 20,000 tran-
sistors.
The originally targeted technology was HP
2.0-
μ
m SCMOS, a two-metal-layer technology
with N-well, 5V nominal Vdd, and 1V threshold
voltage. We ended up fabricating two versions,
the CAM20 in 2.0-
μ
m SCMOS, and the CAM16
in 1.6-
μ
m SCMOS. The 1.6-
μ
m HP CMOS 40
technology was also a two-metal-layer N-well
scalable CMOS technology with 5V nominal Vdd
and 0.75V threshold voltages.
4
A few days after the CAM20 had been sub-
mitted for fabrication, as the team was resting
a little, Tony Lee was looking contentedly at his
layout when he discovered with horror that he
had forgotten to connect some wells to the power
rail. Convinced that the chip would fail, we
checked the MOSIS run schedule and saw that a
run in 1.6-
μ
m was coming up in a few weeks. As
the technology was truly scalable in those palmy
days, we could fix the error, make a few changes
and submit the CAM16 for fabrication in time.
Hence the two versions.
But let’s go back to the beginning of the
summer
...
The design proceeded on two fronts.
One was the top-down synthesis of the pro-
cessor, starting from a simple CHP program
executing instructions sequentially. (This pro-
gram was then used as the specification from
which the final concurrent CHP version was de-
rived.) The other was the design of the dat-
apath. Thanks to the aforementioned separa-
tion of control and datapath in the synthesis,
we could design the datapath early before choos-
ing the pipeline structure and before the control
part was fixed. The only important decision we
had to make early was the number of buses that
would access the register file. In order to maxi-
mize concurrent access to the registers, and thus
between instruction executions, we introduced
four buses; in retrospect, the number of buses
was probably an overkill.
By the end of the summer, the datapath was
laid out and tested. (Layout of the datapath
was done manually using Magic.
4
) However, we
would keep modifying the control until a few
4
From the website: “Magic is a venerable VLSI lay-
out tool, written in the 1980s at Berkeley by John
Ousterhout
...
Due largely to its liberal Berkeley open-
source license, Magic has remained popular with univer-
sities and small companies.”
Figure 2:
The CHP processes and communication
channels of the CAM
weeks before tape out. The CHP processes and
their communication channels are shown in Fig-
ure 2. (What is called the EXEC in the figure
is in fact the Decode process.) The datapath
consisted of a register file (16 16-bit registers), a
PCADD unit to update the program counter, a
memory unit to transfer data between the reg-
isters and the data memory, and an ALU. The
ALU was used for arithmetic and logical opera-
tions: By manipulation of the KPG codes, any
logical instruction was turned into an addition.
5.1 A New Pipeline Architecture
Good ideas in design often have an
a posteri-
ori
simplicity and obviousness that obfuscate
the fact that arriving there was anything but ob-
vious. That was the case for the general pipeline
structure that we converged on after multiple it-
erations, as testified by my lab notes.
What emerged from the exercise was a sim-
ple and general architecture that would become,
mutatis mutandis,
the archetype of future asyn-
chronous processor pipelines. The pipeline had
two main connected parts: a
fetch loop
and an
5
execution loop
. The fetch loop was a ring of pro-
cesses consisting of the Fetch process computing
the next program counter (pc) and sending it to
the instruction memory IMEM; the instruction
memory receiving a pc and sending the corre-
sponding instruction to the Decode; the Decode
decoding the instruction and sending informa-
tion about the next instruction to the Fetch.
The Decode was also part of the execution
loop. As such, it would concurrently send pieces
of a decoded instruction to the register file (the
address(es) of the register(s) involved) and the
instruction opcode to the proper execution unit
(ALU or memory unit). The execution unit
would typically receive the parameters of the in-
struction from the register file and send the re-
sult back to the register file. Execution units and
the register file would exchange parameters and
result via the buses.
5
Unlike in a synchronous architecture,
the execution units were not aligned on
the pipeline, and therefore out-of-order
execution was automatic, provided the use
of the registers allowed it.
The architecture also exposed the main per-
formance bottlenecks in an asynchronous pro-
cessor pipeline. The first was the fetch loop:
one data “token” was circulating around the loop
and therefore the latency of the token around the
fetch loop was a critical factor in the throughput
of the whole system. Two
critical cycles
were
identified as limiting the throughput: the fetch
loop and the cycle including sending parameters
to an execution unit and returning the result to
the register file, with the fetch loop usually the
most critical. Another di
ffi
culty was the reser-
5
In the CAM, a bus was a many-to-many communica-
tion channel. We abandoned this approach in later sys-
tems in favor of an implementation as a network of split
and merge components with only point-to-point channels.
vation of the registers for an instruction in view
of the concurrent execution of several instruc-
tions. Preventing read/write hazards on register
access was the only restriction to the concurrent
use of registers by execution units. However,
in the absence of a global time reference, it be-
came a tricky synchronization problem. This is-
sue will remain critical in all subsequent designs
with several di
ff
erent solutions being proposed.
Later asynchronous microprocessors with
more complex instruction sets, like the Caltech
MiniMIPS, a QDI clone of MIPS R3000 micro-
processor, will have considerably more involved
pipelines. But the basic structure introduced in
the CAM will obtain.
5.2 CAD Tools
Our use of CAD tools was very modest. We did
use Magic for hand layout and SPICE for elec-
trical simulation, but no other commercial tool.
Steve Burns, with the help of Pieter Hazewindus
and Andrew Fyfe, wrote an excellent place-and-
route tool,
VGladys
, that would take production
rules as input, match them with standard oper-
ators for which we had built a standard cell li-
brary, and place and route the cells. It was very
useful for the control part which went through
many last-minute modifications. Another use-
ful tool was the production-rule simulator
prsim
which allowed excellent low-level digital simula-
tion, and gave an estimate of the critical cycles.
What about high-level synthesis? Based on
my early high-level synthesis experiments, Steve
Burns had written a rather sophisticated syntax-
directed compiler to generate gate-level imple-
mentation of CHP programs; and our original
intention was to use it for the CAM. But the
first attempts indicated that the performance of
the generated circuit would not be su
ffi
cient to
6
make a convincing demonstration of QDI logic.
We abandoned syntax-directed translation and
never used it much afterward.
6 Yes, We Can!
Testing started on February 23rd 1989. Only two
of the 12 CAM20 chips passed all tests, proba-
bly due to the floating wells, but they turned out
to be the most robust. Out of the 50 CAM16
chips, 34 were found to be entirely functional.
The chips were tested in various settings: over
the widest range of operating voltages, at di
ff
er-
ent temperatures, in particular in liquid nitro-
gen, and with a variety of test programs.
One interesting experiment was using a hair
dryer to heat up the chip. As the temperature
rose, one could see the oscillation period stretch
on the oscilloscope, and then shrink again as the
dryer was removed.
6.1 Testing for Stuck-at Faults
It has been claimed that “self-timed is self-
testing.” We debunked that claim. In an asyn-
chronous system, in absence of a global time ref-
erence, the knowledge that a transition has com-
pleted can only be derived from its causing an-
other transition to take place. We say that the
second transition
acknowledges
the first one. If
all transitions are acknowledged, then a transi-
tion being “stuck” will cause the system to dead-
lock, and the stuck-at fault will be detected, by
running normal computations. This is the idea
behind self-testing. But only in an entirely DI
system are all transitions acknowledged. In all
other cases, some transitions are not. And there-
fore not all stuck-at faults lead to deadlock. In a
QDI system, it is the isochronic fork that causes
certain transitions to be unacknowledged.
Detecting all stuck-at faults requires adding
extra circuitry, which we decided not to do in this
experimental design. Instead, we used a program
written by Pieter Hazewindus that could test the
entire chip for detectable stuck-at faults in less
than 700 instructions.
6.2 Performance
The results reported here are for the chips run-
ning a sequence of ADD instructions. Since the
performance does not su
ff
er from the branch de-
lays, we consider it the peak performance.
Performance can be summarized as follows: at
2V, the CAM16 runs at 5MIPS drawing 5.2mA
of current; at nominal 5V, it runs at 18MIPS
drawing 45mA, and at 10V it runs at 26MIPS
drawing 105mA. The CAM20 runs at 11MIPS
at 5V, and reaches 15MIPS at 7V. The self-
regulating power supply of asynchronous circuits
allowed us to operate the processor from just
about any power source that is capable of pro-
viding approximately 50
μ
W of power at 0.8V.
The CAM16 performance over the whole volt-
age range is shown in Figure 3.
Once again in total ignorance of the expected
di
ffi
culties, we undertook to test our chips across
the widest possible range of operating volt-
age, starting from
deep subthreshold
(around
0.3V). The chips would stop functioning around
0.3V, but starting from 0.4V for the CAM20 and
0.5V for the CAM16, most chips would work
flawlessly.
Figure 4 is a photograph of a page from our
lab book showing the first performance measure-
ments of a CAM20 chip in subthreshold. The
frequency goes from 3Hz at 0.4V to 680KHz at
1.1V.
The chips were also tested at liquid nitrogen
temperature (77K) by pouring liquid nitrogen in
7
Figure 3:
Frequency as a function of voltage for the
CAM16
a styrofoam cup fixed on top of the die. The
experiment was a little tricky, but we were able
to measure the performance. The CAM20 ver-
sion reached 20MIPS at 5V, and 30MIPS at
12V. The CAM16 version reached 30MIPS at
5V. Figure 5 shows all measurements of fre-
quency against voltage. The solid line in the
figure represents the CAM20 performance in liq-
uid nitrogen. Observe the significant jump in
performance compared to the bottom line repre-
senting the room temperature frequencies. Also
worth noticing is the upward shift of the thresh-
old voltage at low temperature.
Figure 6 shows the energy per instruction for
a CAM16 chip across the whole voltage range
from 0.5V up to 12V with a minimum energy
point around 1V, i.e. above threshold voltage.
The robustness of the chip to voltage vari-
ations, and its ability to operate at very low
voltage invited Mika Nystr ̈om to try what he
called the first
Potato-Chip Experiment
: He used
a potato as power supply! The Potato Processor
runs at around 300KHz at 0.75V. (See Figure 7.)
Figure 4:
First measurements of frequency in deep
subthreshold voltage for a CAM20 chip, taken by
Steve Burns on February 24th, 1989
To give the reader an idea of where the state
of the art (in clocked VLSI) stood at the time, I
consulted the
Berkeley Hardware Prototypes
site,
where Berkeley had the excellent idea of archiv-
ing their chip designs. In 1988, the
SPUR
chip
set, containing a RISC microprocessor, was fab-
ricated in a similar (or perhaps identical) tech-
nology, 1.6-
μ
m CMOS. It operated at 10MHz.
In 1990, the
VLSI-BAM
, a RISC microproces-
sor fabricated in 1.3-
μ
m CMOS ran at 20MHz
and consumed 1W at 5V. Our, admittedly more
8