INVITED
PAPER
Asynchronous Techniques for
System-on-Chip Design
Digital circuit designs that are not sensitive to delay promise to allow operation
without clocks for future systems-on-a-chip.
B
y
Alain J. Martin,
Member IEEE
, and Mika Nystro
̈
m,
Member IEEE
ABSTRACT
|
SoC design will require asynchronous techniques
as the large parameter variations across the chip will make it
impossible to control delays in clock networks and other global
signals efficiently. Initially, SoCs will be globally asynchronous
and locally synchronous (GALS). But the complexity of the
numerous asynchronous/synchronous interfaces required in a
GALS will eventually lead to entirely asynchronous solutions.
This paper introduces the main design principles, methods, and
building blocks for asynchronous VLSI systems, with an em-
phasis on communication and synchronization. Asynchronous
circuits with the only delay assumption of isochronic forks are
called quasi-delay-insensitive (QDI). QDI is used in the paper as
the basis for asynchronous logic. The paper discusses asyn-
chronous handshake protocols for communication and the
notion of validity/neutrality tests, and completion tree. Basic
building blocks for sequencing, storage, function evaluation,
and buses are described, and two alternative methods for the
implementation of an arbitrary computation are explained.
Issues of arbitration, and synchronization play an important
role in complex distributed systems and especially in GALS. The
two main asynchronous/synchronous interfaces needed in
GALS
V
one based on synchronizer, the other on stoppable
clock
V
are described and analyzed.
KEYWORDS
|
Arbiter; asynchronous; asynchronous bus;
asynchronous/synchronous interface; C-element; completion
tree; dual-rail; globally asynchronous and locally synchronous
(GALS); half-buffer; handshake protocol; isochronic fork;
metastability; passive–active buffer; precharge half-buffer
(PCHB); quasi-delay-insensitive (QDI); stoppable clock;
synchronizer
I.
INTRODUCTION
It is now generally agreed that the sizable very large scale
integration (VLSI) systems [
systems-on-chip
(SoCs)] of the
nanoscale era will not operate under the control of a single
clock and will require asynchronous techniques. The large
parameter variations across a chip will make it prohibi-
tively expensive to control delays in clocks and other global
signals. Also, issues of modularity and energy consumption
plead in favor of asynchronous solutions at the system
level. Whether those future systems will be entirely asyn-
chronous, as we predict, or globally asynchronous and
locally synchronous (GALS), as more conservative practi-
tioners would have it, we anticipate that the use of asyn-
chronous methods will be extensive and limited only by
the traditional designers’ relative lack of familiarity with
the approach. Fortunately, the past two decades have
witnessed spectacular progress in developing methods and
prototypes for asynchronous (clockless) VLSI. Today, a
complete catalogue of mature techniques and standard
components,aswellassomecomputer-aideddesign(CAD)
tools, are available for the design of complex asynchro-
nous digital systems.
This paper introduces the main design principles,
methods, and building blocks for asynchronous VLSI
systems, with an emphasis on communication and synchro-
nization. Such systems will be organized as distributed
systems on a chip consisting of a large collection of com-
ponents communicating by message exchange. Therefore,
the paper places a strong emphasis on issues related to
network and communication
V
issues for which asynchro-
nous techniques are particularly well-suited. Our hope is
that after reading this paper, the designer of an SoC should
be familiar enough with those techniques that he or she
would no longer hesitate to use them. Even those adepts of
GALS who are adamant not to let asynchrony penetrate
further than the network part of their SoC must realize that
network architectures for SoCs are rapidly becoming so
complex as to require the mobilization of the complete
armory of asynchronous techniques.
Manuscript received September 14, 2005; revised March 3, 2006. This work was
supported in part by the Defense Advanced Research Projects Agency (DARPA).
The authors are with the Department of Computer Science, California Institute of
Technology, Pasadena, CA 91125 USA (e-mail: alain@async.caltech.edu).
Digital Object Identifier: 10.1109/JPROC.2006.875789
Vol. 94, No. 6, June 2006 |
Proceedings of the IEEE
1089
0018-9219/$20.00
2006 IEEE
The paper is organized as follows. The second section
contains a brief history and the main definitions of the
different asynchronous logics according to their timing
assumptions. Section III introduces the computational
models and languages used in this paper to describe and
construct asynchronous circuits. Section IV introduces the
most common asynchronous communication protocols and
the notion of validity/neutrality tests. Basic building blocks
for sequencing, storage, and function evaluation are intro-
duced in Section V. Section VI presents two alternative
methods for the implementation of an arbitrary computa-
tion: syntax-directed decomposition and data-driven decom-
position. The two approaches differ in how a specification is
decomposed into pipeline stages. Section VII describes
several implementations of buses. Section VIII deals with
issues of arbitration, and synchronization. Section IX pre-
sents the asynchronous/synchronous interfaces needed in a
GALS system.
II.
ABRIEFHISTORYANDA
FEW DEFINITIONS
The field of asynchronous design is both old and new. The
1952 ILLIAC and the 1962 ILLIAC II at the University of
Illinois are said to have contained both synchro-
nous and asynchronous parts [2]. The 1960 PDP6
from Digital Equipment (DEC) was also asyn-
chronous [52]. The
B
macromodule
[
experiment
in the 1960s proposed asynchronous building
blocks that, in spirit, are very close to a modern
system-level approach [10] and to GALS. Impor-
tant theoretical contributions of this period
include the works of Huffman [20], Muller [33],
and Unger [53]. An excellent presentation of
Muller’s work is in [32]. The pioneering work of Molnar
and his colleagues at Washington University was instru-
mental in explaining metastability [6].
Even though asynchronous logic never disappeared
completely, when clocked techniques offered an easy way
of dealing with timing and hiding hazards, clockless logic
was all but forgotten until the arrival of VLSI in the late
1970s. The first Caltech Conference on VLSI in 1979 con-
tained a complete session on
B
self-timed
[
logic, as asyn-
chronous logic was called at the time, with in particular an
important paper by Stucki and Cox on
B
synchronization
strategies
[
presenting the first
pausable clock
V
as we shall
see, an important device in GALS
V
and discussing meta-
stability [47]. Seitz’s chapter on system timing in Mead and
Conway’s epoch-making 1980
Introduction to VLSI Systems
revived the research community’s interest in the topic
V
if
not the industry’s interest [31], [45].
The first
B
modern
[
synthesis methods appear around
the mid1980 with the Caltech program-transformation
approach [24] and T.-A. Chu’s State-transition-graph (STG)
approach [8]. Soon after, Burns and Martin, Brunvand,
and van Berkel proposed similar methods for the syntax-
directed compilation of high-level description into asyn-
chronous circuits [3], [5].
Petrify
is a more recent tool for
the synthesis of asynchronous controllers described as
Petri nets [12].
The first single-chip asynchronous microprocessor was
designed at Caltech in 1988 [27]. It was followed by the
first
B
Amulet
[
(a family of asynchronous clones of the
ARM processor) from the University of Manchester in
1993 [14], the TITAC, an 8-bit microprocessor from the
Tokyo Institute of Technology in 1994 [36], and the
Amulet2e and TITAC-2 in 1997 [15], [37]
V
the TITAC-2 is
a 32-bit microprocessor. Also in 1997, the Caltech group
designed the MiniMIPS, an asynchronous version of the
32-bit MIPS R3000 microprocessor [28]. With a perfor-
mance close to four times that of a clocked version in the
same technology for the first prototype, the MiniMIPS is,
at the moment of writing, still the fastest complete asyn-
chronous processor ever fabricated [30]. A group at
Grenoble ported the Caltech MiniMIPS building blocks
to standard cells to use in a 16-bit RISC [42]. Other asyn-
chronous chip experiments include the design of a fast
divider at Stanford in 1991 [49], and an instruction-length
decoder for the Pentium by a research group at Intel in 1999
[44]. Low-power asynchronous microcontrollers have been
designed at Philips [17], Caltech [30], and Cornell [21].
The concept of GALS was first proposed by Chapiro [7] in
1984. It has recently gained in popularity, in particular
with the work of a group at Zurich [35].
Several books on asynchronous logic have been
published. Among them, our favorites are [11], [34], and
[46]. A special issue of the
Proceedings of the IEEE
also gives
a good overview of the state of the art [38]. The book by
Dally and Poulton, although not specifically about asyn-
chronous systems contains an excellent chapter on syn-
chronization [13]. The on-line
Asynchronous Bibliography
maintains an updated bibliography of the field to date [1].
A digital circuit is
asynchronous
when no clock is used
to implement sequencing. Such circuits are also called
clockless
. The various asynchronous approaches differ in
their use of delay assumptions to implement sequencing.
A
circuit is delay-insensitive (DI) when its correct operation is
independent of any assumption on delays in operators and
wires except that the delays are finite and positive
.Theterm
delay-insensitive appears informally in [45]. It was proved
in 1990 that in a model in which all delays are exposed
V
the building blocks are elementary gates with a single
A digital circuit is asynchronous
when no clock is used to implement
sequencing. Such circuits are also
called clockless.
Martin and Nystro
̈
m: Asynchronous Techniques for System-on-Chip Design
1090
Proceedings of the IEEE
|Vol.94,No.6,June2006
Boolean output
V
the class of entirely delay-insensitive
circuits is very limited [26]. Most circuits of interest to the
digital designer fall outside the class. But it can also be
proved that a single delay assumption on certain forks
connecting the output of a gate to the inputs of several
other gates is enough to implement a Turing machine, and
therefore the whole class of Turing-computable functions
[23]. Those forks are called
isochronic
.
Asynchronous circuits with the only delay assumption of
isochronic forks are called quasi-delay-insensitive (QDI)
.
We use QDI as the basis for asynchronous logic. All other
forms of the technology can be viewed as a transformation
from a QDI approach by adding some delay assumption. An
asynchronous circuit in which
all
forks are assumed
isochronic corresponds to what has been called a
speed-
independent
circuit, which is a circuit in which the delays in
the interconnects (wires and forks) are negligible com-
pared to the delays in the gates. The concept of speed-
independent circuit was introduced by Muller [33].
Similarly,
self-timed
circuits are asynchronous circuits in
which all forks that fit inside a chosen physical area called
equipotential region
are isochronic [45].
Several styles of asynchronous circuits currently in use
fall into some hybrid category. They rely on some specific
timing assumption besides the implicit QDI assumption.
For instance, the
B
bundled data
[
technique uses a timing
assumption to implement the communication protocol
between components. Another approach
V
timed asynchro-
nous logic
V
starts from a QDI circuit, and then derives
timing relations between events in the QDI computation
that are used to simplify the solution. (See [34].) Yet
another approach uses a timing assumption to control the
reset phase of the handshake protocol (to be explained
later). Two logic families based on this approach are
asynchronous pulse logic
([40]) and GasP ([48]).
III.
S
O
C
S
AS DISTRIBUTED SYSTEMS
SoCs are complex distributed systems in which a large
number of parallel components communicate with one
another and synchronize their activities by message ex-
change. At the heart of digital logic synthesis lie funda-
mental concurrency issues like concurrent read and write
of variables and synchronization between send and receive
commands. Synchronous (clocked) logic brings a simple
solution to the problem by partially ordering transitions
with respect to a succession of global events (clock signals)
so as to order conflicting read/write actions. In the absence
of a global time reference, asynchronous logic has to deal
with concurrency in all its generality, and asynchronous-
logic synthesis relies on the methods and notations of
concurrent computing. There exist many languages for
distributed computing. The high-level language used in
this paper is called
Communicating Hardware Processes
(CHP). It is based on CSP [19], and is used widely in one
form or other in the design of asynchronous systems.
We introduce only those constructs of the language needed
for describing the method and the examples, and that are
common to most computational models based on commu-
nication. The systematic design of an SoC is a process of
successive refinements taking the design from a high-
level description to a transistor netlist. The three levels of
representation
V
CHP, HSE, PRS
V
used in this paper
mirror the three main stages of the refinement.
A. Modeling Systems: Communicating Processes
A system is composed of concurrent modules called
processes
. Processes do not share variables but communi-
cate only by send and receive actions on ports.
1) Communication, Ports, and Channels:
Asendportofa
process
V
say, port
R
of process
p
1
V
is connected to a
receive port of another process
V
say, port
L
of process
p
2
V
to form a
channel
. A receive command on port
L
is
denoted
L
?
y
. It assigns to local variable
y
the value received
on
L
. A send command on port
R
,denoted
R
!
x
, assigns to
port
R
the value of local variable
x
.Thedataitem
transferred during a communication is called a message.
The net effect of the combined send
R
!
x
and receive
L
?
y
is the assignment
y
:
¼
x
together with the synchronization
of the send and receive actions.
The
slack
of a channel is the maximal difference
between the number of completed send actions and the
number of completed receive actions on the two ports of
the channel. In other words, the slack is the capacity of the
channel to store messages. Since we implement channels
with wires only, we choose to have
slack-zero channels
:the
completion of a send at one end of the channel coincides
with the completion of a receive at the other end of the
channel. Both send and receive actions are said to be
B
blocking.
[
A send or receive action on a port may have to
be delayed (pending) until the matching action on the
other port of the channel is ready.
2) Assignment:
Thevalueofavariableischangedbyan
explicit
assignment
to the variable as in
x
:
¼
expr
.For
b
Boolean,
b
"
and
b
#
stand for
b
:
¼
true and
b
:
¼
false,
respectively.
3) Sequential and Parallel Compositions:
CHP and HSE
provide two composition operators: the sequential oper-
ator
S
1
;
S
2 and the parallel operator. Unrestricted use of
parallel composition would cause read/write conflicts on
shared variables. CHP restricts the use of concurrency in
two ways. The parallel bar
k
,asin
S
1
k
S
2, denotes the
parallel composition of processes.
1
CHP also allows a
limited form of concurrency
inside
a process, denoted by
the comma, as in
S
1,
S
2. The comma is restricted to
1
In CHP, processes do not share variables. In HSE, the only shared
variables are those introduced for the implementation of communication.
They cannot cause read/write conflicts.
Martin and Nystro
̈
m: Asynchronous Techniques for System-on-Chip Design
Vol. 94, No. 6, June 2006 |
Proceedings of the IEEE
1091
program parts that are
noninterfering
:if
S
1 writes
x
,then
S
2
neither reads
x
nor writes
x
.
4) Selection, Wait, and Repetition:
The
selection command
½
B
1
!
S
1
½
B
2
!
S
2
½
...
is a generalization of the if-
statement. It has an arbitrary number (at least one) of
clauses, called
B
guarded commands,
[
Bi
!
Si
where
Bi
is a
Boolean condition and
Si
is a program part. The execution
of the selection consists of: 1) evaluating all guards and
2) executing the command
Si
withthetrueguard
Bi
.In
this version of the selection, at most one guard can be
true at any time. There is also an arbitrated version
where several guards can be true. In that case, an
arbitrary true guard is selected. The arbitrated selection
is identified by a thin bar as in
½
B
1
!
S
1
j
B
2
!
S
2
.
In both versions, when no guard is true, the execution is
suspended: the execution of the selection reduces to a
wait
for a guard to become true. Hence, waiting for a condition to
be true can be implemented with the selection
½
B
!
skip
,
where
skip
is the command that does nothing but ter-
minates. A shorthand notation for this selection is
½
B
.
In this paper, we use only the nonterminating
repetition
½
S
that repeats
S
forever.
5) Pipeline Slack and Slack Matching:
Slack matching is
an optimization by which simple buffers are added to a
system of distributed processes to increase the throughput.
A pipeline is a connected subgraph of the process graph
with one input port and one output port. The
static slack
of
a pipeline is the maximal number of messages the pipeline
can hold. A pipeline consisting of chain of
n
simple buff-
ers has a static slack of
n
, since each simple buffer can
hold at most one message, and the channels have slack
zero
V
unless, as we shall see, the buffers implementa-
tions are subjected to a transformation called reshuffling,
which can reduce their slack.
The
dynamic slack
of a pipeline denotes the number of
messages or, more generally, the range of numbers of
messages that the pipeline must hold to run at optimal
throughput.Forthesamepipelineof
n
simple buffers with
a symmetric implementation, the dynamic slack is cen-
tered around
n
=
2. (See [22], [41].) However, the most
efficient buffer templates are not symmetrical
V
they favor
forward latency over backward latency. For such buffers,
the dynamic-slack range is reduced, typically centering
around
n
=
8 for the MiniMIPS. (The definitions of static
and dynamic slacks are easy to extend to rings of processes.)
B. Modeling System Components: HSE
Each CHP process is refined into a partial order of
signal transitions, i.e., transitions on Boolean variables.
The HSE notation is not different from CHP except that it
allows only Boolean variables, and send and receive
communications have been replaced with their
handshak-
ing expansion
in terms of the Boolean variables modeling
the communication wires. The modeling of wires intro-
duces a restricted form of shared variables between
processes (the variables implementing channels). A typical
example of an HSE program is
½
li
;
ro
"
;
½
ri
;
ro
#
;
½:
ri
;
lo
"
;
½:
li
;
lo
#
½
:
The input variables
li
and
ri
can only be read. The output
variables
lo
and
ro
can be read and written. The above
example can be read as follows.
B
Repeat forever: wait for
li
to be true; set
ro
to true; wait for
ri
to be true; set
ro
to
false; etc
...
[
C. Modeling Circuits: Production Rules
A circuit
V
for instance, a CMOS circuit
V
is a network
of operators (logic gates). Each gate has an arbitrary
number of inputs (in practice this number is limited by the
resistance of transistor chains), and one output. (The only
exceptions are arbiter and synchronizer, which each have
two inputs and two outputs.) A logic gate with Boolean
output
z
sets
z
to true when a Boolean condition
Bu
of its
inputs holds, and sets
z
to false when a Boolean condition
Bd
of its inputs holds. Those behaviors are formalized by
the two
production rules
Bu
!
z
"
Bd
!
z
#
:
A production rule (PR) is a construct of the form B
!
twheret
is a simple assignment (a transition) and B is a Boolean
expression called the guard of the PR. A production rule set
(PRS) is the parallel composition of all production rules in the
set
. Each pair of complementary PRs,
V
the production
rules that set and reset the same variable
V
is identified and
implemented with (standard or nonstandard) CMOS
operators. In order to carry over the robustness and delay
insensitivity of the design to the circuit level, all operators
are restoring (they have gain) and static (the output node is
never
B
floating
[
).
Fig. 1.
The pull-up and pull-down networks implementing a CMOS logic
gate. (a) Combinational gate. (b) State-holding gate with a standard
B
staticizer
[
(
B
keeper
[
). The circled transistors are weak.
Martin and Nystro
̈
m: Asynchronous Techniques for System-on-Chip Design
1092
Proceedings of the IEEE
|Vol.94,No.6,June2006
Complementary PRs must be
noninterfering
, i.e., guards
Bu
and
Bd
cannotbetrueatthesametime.If
Bu
¼:
Bd
,
then either
Bu
or
Bd
holds at any time, and output
z
is
always connected to either the Vdd or the ground
V
z
is
always
B
driven.
[
In this case, the operator is
combinational
with the simple CMOS implementation of Fig. 1(a). If
there are states in the computation where neither
Bu
nor
Bd
holds, then output
z
is
B
floating
[
in those states. The
operator has to maintain the current value of
z
and is
therefore
state-holding
. In a QDI circuit, a state-holding
operator is usually implemented with the
B
keeper
[
or
B
staticizer
[
shown in Fig. 1(b). (Adding a staticizer in-
troduces interferences between the original PRs and the
rules added by the feedback inverter. The interferences are
resolved by making the transistors of the feedback inverter
B
weak,
[
i.e., with low drain-to-source current.) Alterna-
tively, the state-holding operator can be transformed into a
combinational one, like the
nor
-gate implementation of
the set–reset shown in Fig. 2(b).
1) Combinational Gates:
With the exception of the
B
write-acknowledge,
[
we use standard combinational
gatesinthispaper:inverter,
nand
,
nor
.
nand
-and
nor
-
gates may have multiple inputs. Because of the restriction
on the length of transistor chains, a multiple-input gate
may have to be decomposed into a tree of smaller gates.
2
2) State-Holding Elements:
Thethreestate-holdinggates
used in this paper are the Muller C-element, the set–reset
gate, and the precharge function. The
Muller C-element
(also called C-element) with inputs
x
and
y
and output
z
,
denoted
z
¼
x
Cy
,implementsthePRs
x
^
y
!
z
"
and
:
x
^:
y
!
z
#
. The C-element is a state-holding element,
since the current value of
z
must be maintained when
x
6
¼
y
. The three-input C-element (denoted 3C) is used in a
few examples in the paper. The
set–reset gate
with inputs
s
and
r
and output
z
has the PRs
s
!
z
"
and
r
!
z
#
.
Since
s
and
r
must be mutually exclusive, it is always
possible to implement the set–reset gate as the C-element
z
¼
s
C
:
r
.Thisimplementationrequirestoinverteither
s
or
r
. Rather, we can code
z
with the two variables
zt
and
zf
,
such that
zt
¼:
zf
, and we implement the set–reset gate
either with two cross-coupled
nor
-gates or two cross-
coupled inverters as shown in Fig. 2(b) and (c).
We also use a nonstandard operator, the
precharge
function
(PCF). An example of a precharge function with
inputs
en
,
x
:
0,
x
:
1,
y
:
0,
y
:
1andoutput
z
is described by the
following two PR pairs:
en
^ð
x
:
0
^
y
:
0
_
x
:
1
^
y
:
1
Þ!
z
:
1
#
:
en
!
z
:
1
"
en
^ð
x
:
0
^
y
:
1
_
x
:
1
^
y
:
0
Þ!
z
:
0
#
:
en
!
z
:
0
"
:
This gate computes the function
X
¼
Y
where
X
,
Y
,and
Z
are dual-rail encoded;
en
is a control signal. This gate is also
a state-holding element. The PCF can be used for any
function (Boolean or 1-of-N) whose CMOS pulldown-
networks do not exceed the limit on transistor-chain length.
D. Stability and Noninterference
Stability and noninterference are the two properties of
PRS that guarantee that the circuits are operating cor-
rectly, i.e., without
logic hazards
. A hazard is the possibility
of an incomplete transition (a
B
glitch
[
).
3
2
Large-gate decomposition is an annoying problem in asynchronous
design, since it may violate stability. The decomposition of a multi-input
or
-gate we use in the paper is stable because a transition on the output is
caused by exactly one input transition.
Fig. 2.
State-holding elements. (a) The C-element. (b)
NOR
-gate implementation of set–reset. (c) Inverter implementation of set–reset. (d) Example
of a precharge function. The cross-coupled inverters in (a) and (b) are standards
B
staticizers.
[
The letter w indicates that the transistors of
the inverter are weak.
3
Of course, electrical effects, in particular charge sharing and cross-
talk, can also produce voltage glitches that are not eliminated by stability
and noninterference.
Martin and Nystro
̈
m: Asynchronous Techniques for System-on-Chip Design
Vol. 94, No. 6, June 2006 |
Proceedings of the IEEE
1093
How do we guarantee the proper execution of pro-
duction rule
G
!
t
? In other words, what can go wrong
and how do we avoid it? Two types of malfunction may
take place: 1)
G
may cease to hold before transition
t
has
completed, as the result of a concurrent transition
invalidating
G
,and2)the
complementary transition t
0
of
t
is executed while the execution of
t
is in progress, leading
to an undefined state. We introduce two requirements,
stability
and
noninterference
that eliminate the two sources
of malfunction.
The
B
result
[
R
ð
t
Þ
of a transition
t
is defined as
R
ð
x
"Þ¼
x
,and
R
ð
x
#Þ¼:
x
,forany
x
.
Definition 1:
Aproductionrule
G
!
t
is said to be stable
in a computation if and only if
G
can change from true to
false only in those states of the computation in which
R
ð
t
Þ
holds. A production-rule set is said to be stable if and only
if all production rules in the set are stable.
Definition 2:
Two production rules
Bu
!
x
"
and
Bd
!
x
#
aresaidtobenoninterferinginacomputation
if and only if
:
Bu
_:
Bd
is an invariant of the computation.
A production-rule set is noninterfering if every pair of com-
plementary production rules in the set is noninterfering.
Any concurrent execution of a stable and noninterfer-
ing PRS is equivalent to the sequential execution model in
which, at each step of the computation, a PR with a true
guard is selected and executed. The selection of the PR
should be weakly fair, i.e., any enabled PR is eventually
selected for execution.
The existence of a sequential execution model for QDI
computations greatly simplifies reasoning about, and
simulating, those computations. Properties similar to
stability are used in other theories of asynchronous
computations, in particular,
semimodularity
[33] and
persistency
[11]. Consider rule
B
^
x
!
x
#
as part of a
gate
G
with output
x
. At the logical level, the execution
of transition
x
#
when the guard holds invalidates the
guard. (Such production rules are therefore called
self-
invalidating
.)
We exclude self-invalidating production rules,
since, in most implementations, they would violate the sta-
bility condition
.
E. Isochronic Forks
A computation implements a partial order of transi-
tions. In the absence of timing assumptions, this partial
order is based on a causality relation. For example,
transition
x
"
causes transition
y
#
in state
S
if and only if
x
"
makes guard
By
of
y
#
true in
S
.Transition
y
#
is said to
acknowledge
transition
x
"
.Wedonothavetobemore
specificaboutthepreciseorderingintimeoftransitions
x
"
and
y
#
. The acknowledgment relation is enough to
introduce the desired partial order among transitions, and
to conclude that
x
"
precedes
y
#
. In an implementation of
the circuit, gate
Gx
with output
x
is directly connected to
gate
Gy
with output
y
, i.e.,
x
is an input of
Gy
.
Hence, a necessary condition for an asynchronous circuit
to be delay-insensitive is that all transitions are acknowledged
.
Unfortunately, the class of computations in which all
transitions are acknowledged is very limited. Consider the
example of Fig. 3. Signal
x
is forked to
x
1, an input of gate
Gy
with output
y
,andto
x
2, an input of gate
Gz
with
output
z
. A transition
x
"
when
c
holdsisfollowedbya
transition
y
"
, but not by a transition
z
"
,i.e.,transition
x
1
"
is acknowledged but transition
x
2
"
is not, and
vice
versa
when
:
c
holds. Hence, in either case, a transition on
one output of the fork is not acknowledged. In order to
guarantee that the unacknowledged transition completes
without violating the specified order, a timing assumption
called the
isochronicity assumption
has to be introduced,
and the forks that require that assumption are called
isochronic forks
[26]. (Not all forks in a QDI circuit are
isochronic.) Most circuits presented in this paper contain
isochronic forks. A typical instance is the fork with input
qi
in Fig. 13 describing a single-bit register.
The timing assumption on isochronic forks is a one-
sided inequality that can always be satisfied: it requires
that the delay of a single transition be shorter that the
sum of the delays on a multitransition path. It can be
proved by constructing a Turing machine as a QDI circuit
(see [23]) that the class of QDI circuits is Turing-
complete, i.e., all Turing-computable functions have a
QDI implementation.
IV.
ASYNCHRONOUS COMMUNICATION
PROTOCOLS
The implementation of send/receive communication is
central to the methods of asynchronous logic, since this
form of communication is used at all levels of system de-
sign, from communication between, say, a processor and a
cache down to the interaction between the control part
and the datapath of an ALU. Communication across a
channel connecting two asynchronous components
p
1and
Fig. 3.
The fork (x, x1, x2) is isochronic: a transition on x1 causes a
transition on y only when c is true, and a transition on x2 causes a
transition on z only when c is false. Hence, certain transitions on x1
and on x2 are not acknowledged, and therefore a timing assumption
must be used to guarantee the proper completion of those
unacknowledged transitions.
Martin and Nystro
̈
m: Asynchronous Techniques for System-on-Chip Design
1094
Proceedings of the IEEE
|Vol.94,No.6,June2006
p
2 is implemented as a
handshake protocol
.Inalatersec-
tion, we will describe how to implement communication
between a synchronous (clocked) component and an asyn-
chronous one. Such interfaces are needed in a GALS SoC.
A. Bare Handshake Protocol
Letusfirstimplementa
B
bare
[
communication be-
tween processes
p
1and
p
2: no data is transmitted. (Bare
communications are used as a synchronization point be-
tween two processes.) In that case, channel
ð
R
;
L
Þ
can be
implemented with two wires: wire
ð
ro
;
li
Þ
and wire
ð
lo
;
ri
Þ
.
(The wires that implement a channel are also called
rails
.)
See Fig. 4.
Wire
ð
ro
;
li
Þ
is written by
p
1andreadby
p
2. Wire
ð
lo
;
ri
Þ
is written by
p
2andreadby
p
1. An assignment
ro
"
or
ro
#
in
p
1 is eventually followed by the corresponding
assignment
li
"
or
li
#
in
p
2 due to the behavior of wire
ð
ro
;
li
Þ
. And symmetrically for variables
lo
and
ri
,andwire
ð
lo
;
ri
Þ
. By convention, and unless specified otherwise, all
variables are initialized to
false
.
1) Two-Phase Handshake:
The simplest handshake
protocol implementing the slack-zero communication
between
R
and
L
is the so-called
two-phase handshake
protocol, also called
nonreturn to zero
(NRZ). The protocol
is defined by the following handshake sequence
Ru
for
R
and
Lu
for
L
:
Ru
:
ro
"
;
½
ri
Lu
:
½
li
;
lo
"
:
Given the behavior of the two wires
ð
ro
;
li
Þ
and
ð
lo
;
ri
Þ
,the
only possible interleaving of the elementary transitions of
Ru
and
Lu
is
ro
"
;
li
"
;
lo
"
;
ri
"
.
This interleaving is a valid implementation of a slack-
zero execution of
R
and
L
, since there is no state in the
system where one handshake has terminated and the other
has not started. But now all handshake variables are true,
and therefore the next handshake protocol for
R
and
L
has to be
Rd
:
ro
#
;
½:
ri
Ld
:
½:
li
;
lo
#
:
The use of the two different protocols is possible if it can be
statically determined (i.e., by inspection of the CHP code)
which are the even (up-going) and odd (down-going)
phases of the communication sequence on each channel.
But if, for instance, the CHP program contains a selection
command, it may be impossible to determine whether a
given communication is an even or odd one. In that case, a
general protocol has to be used that is valid for both phases,
as follows:
R
:
ro
:
¼:
ro
;
½
ro
¼
ri
L
:
½
lo
6
¼
li
;
lo
:
¼:
lo
:
This protocol has a complicated circuit implementation,
requiring exclusive-or gates and the storage of the current
values of
lo
and
ro
. Two-phase handshake also requires that
arithmetic and logical operations performed on the data
transmitted be implemented in both upgoing and down-
going logics, which is quite inefficient. Therefore, in spite
of its simplicity, the two-phase handshake protocol is rarely
used besides some obvious cases.
2) Four-Phase Handshake:
A straightforward solution is
to always reset all variables to their initial value (zero).
Such a protocol is called
four-phase
or
return-to-zero
(RZ).
R
is implemented as
Ru
;
Rd
and
L
as
Lu
;
Ld
as follows:
R
:
ro
"
;
½
ri
;
ro
#
;
½:
ri
L
:
½
li
;
lo
"
;
½:
li
;
lo
#
:
In this case, the only possible interleaving of transitions for
a concurrent execution of
R
and
L
is
ro
"
;
li
"
;
lo
"
;
ri
"
;
ro
#
;
li
#
;
lo
#
;
ri
#
.
Again, it can be shown that this interleaving imple-
ments a slack-zero communication between
R
and
L
. It can
even be argued that this implementation is in fact the
sequencing of two slack-zero communications: the first
one between
Ru
and
Lu
, the second one between
Rd
and
Ld
. This observation will be used later to optimize the
protocols by a transformation called
reshuffling
.
B. Handshake Protocols With Data: Bundled Data
Let us now deal with the case when the communi-
cation also entails transmitting data, for instance, by
sending on
R
ð
R
!
x
Þ
and receiving on
L
ð
L
?
y
Þ
.Asolution
immediately comes to mind: let us add a collection of
data wires next to the handshake wires. The data wire
ð
rd
;
ld
Þ
is indicated by a double arrow on Fig. 5. The
protocols are as follows:
R
!
x
:
rd
:
¼
x
;
ro
"
;
½
ri
;
ro
#
;
½:
ri
L
?
y
:
½
li
;
y
:
¼
ld
;
lo
"
;
½:
li
;
lo
#
:
This protocol relies on the timing assumption that the order
between
rd
:
¼
x
and
ro
"
inthesenderismaintainedinthe
Fig. 4.
Implementation of a
B
bare
[
channel (L
;
R
)
with two handshake
wires: (lo
;
ri
)
and (ro
;
li
).
Martin and Nystro
̈
m: Asynchronous Techniques for System-on-Chip Design
Vol. 94, No. 6, June 2006 |
Proceedings of the IEEE
1095
receiver: when the receiver has observed
li
to be true, it can
assume that
ld
has been set to the right value, which
amounts to assuming that the delay on wire
ð
ro
;
li
Þ
is always
B
safely
[
longer than the delay on wire
ð
rd
;
ld
Þ
.Sucha
protocol is used and is called
bundled-data
. The efficiency of
bundle-data versus DI codes is a hotly debated issue. We
will discuss it later.
C. DI Data Codes
In the absence of timing assumptions, the protocol
cannot rely on a single wire to indicate when the data wires
have been assigned a valid value by the sender. The validity
of the data has to be encoded with the data itself. A DI data
code is one in which the validity and neutrality of the data
areencodedwithinthedata.Furthermore,thecodeis
chosen such that when the data changes from neutral to
valid, no intermediate value is valid; when the data
changes from valid to neutral, no intermediate value is
neutral. Such codes are also called
separable
.Thereare
many DI codes but two are almost exclusively used on
chip
V
the
dual-rail
and
1-of-N
codes.
D. Dual-Rail Code
In a dual-rail code, two wires, bit.0 and bit.1, are used
for each bit of the binary representation of the data. [See
Fig. 6(a).] The neutral and valid values are encoded as
follows:
value
:
neutral
0
1
bit
:
0
:
010
bit
:
1
:
001
For a two-bit data word
ð
x
0
;
x
1
Þ
, its dual-rail encoding is
value
:
neutral
0
1
2
3
x
0
:
0
:
01010
x
0
:
1
:
00101
x
1
:
0
:
01100
x
1
:
1
:
00011
E. 1-of-N Codes
In a
1-of-N code
, one wire is used for each value of the
data. Hence, the same two-bit data word is now encoded as
follows:
value
:
neutral
0
1
2
3
d
:
0
:
01000
d
:
1
:
00100
d
:
2
:
00010
d
:
3
:
00001
For a Boolean data-word, dual-rail and 1-of-N are obviously
identical.Fora2-bitdataword,bothdual-railand1-of-4
codes require four wires. For an
N
-bit data word, dual-rail
requires 2
N
wires. If the bits of the original word are
paired and each pair is 1-of-4 encoded, this coding also
requires 2
N
wires. An assignment of a valid value to a
dual-rail-coded word requires 2
N
transitions, but
requires only
N
transitions in the case of a 1-of-4 code.
[See Fig. 6(b).]
F.
k
-out-of-
N
Codes
The 1-of-
N
code, also called
one-hot
, is a special case of
a larger class of codes called
k
-
out-of
-
N
.Insteadofusing
just one true bit out of
N
code bits, as is done in the 1-of-
N
,
we may use
k
,0
G
k
G
N
, true bits to represent a valid code
value. The number of valid values of a
k
-out-of-
N
code is
N
k
.Hence,themaximalnumberofvalidvaluesforagiven
N
is obtained by choosing
k
as
N
=
2. Sperner has proved
that this code is not only the optimal
k
-out-of-
N
code, but
also the optimal DI code in terms of the size of the code set
for a given
N
.
G. Which DI Code?
ThechoiceofaDIcodeinthedesignofasystemona
chip is dictated by a number of practical requirements.
First, the tests for validity and neutrality must be simple.
The neutrality test is simple: as in all codes, the unique
neutral value is the set of all zeroes or the set of all ones.
But the validity test may vary greatly with the code.
Second, the coding and decoding of a data word must be
simple. Third, the overhead in terms of the number of bits
used for a code word compared to the number of bits used
for a data word should be kept reasonably small. Finally,
thecodeshouldbeeasyto
B
split
[
:acodedwordisoften
split into portions that are distributed among a number of
processes
V
for example, a processor instruction may be
decomposed into an opcode, and several register fields. It
is very convenient if the portions of a code word are
themselves a valid code word. This is the case for the dual-
rail code for all partitions and for the 1-of-4 code for
partitions down to a quarter-byte. For all those practical
reasons, dual-rail and 1-of-4 are used almost exclusively in
asynchronous VLSI design.
Fig. 5.
A bundled-data communication protocol. The cigar shape on the
control wire (ro
;
li
)
indicates that the delay
on the wire has been
adjusted to be longer than the delays on the data wires (rd
;
ld
).
Martin and Nystro
̈
m: Asynchronous Techniques for System-on-Chip Design
1096
Proceedings of the IEEE
|Vol.94,No.6,June2006
H. Validity and Neutrality Tests
The combination of four-phase handshake protocol and
DI code for the data gives the following general
implementation for communication on a channel. In this
generic description, we use global names for both the
sender and receiver variables. A collection of data wires
called
data
encodes the message being sent. A single
acknowledge
wire
ack
is used by the receiver to notify the
sender that the message has been received. This wire is
called the
enable
wire when it is initialized high (true).
The data wires are initially set to the neutral value of
the code. The concurrent assignment denoted
data
*
takes
the data wires from the neutral value to a valid value. The
concurrent assignment
data
+
takes the data wires from a
valid value back to the neutral value. The protocols for
sender and receiver can be described as
Send
:
data
*
;
v
ð
ack
Þ
½
;
data
+
;
n
ð
ack
Þ
½
Receive
:
v
ð
data
Þ
½
;
ack
"
;
n
ð
data
Þ
½
;
ack
#
:
The predicate
v
ð
X
Þ
, called
validity test
,isusedto
determine that
X
is a valid value for the chosen DI code.
The predicate
n
ð
X
Þ
, called
neutrality test
,isusedto
determine that
x
has the neutral value in the chosen DI
code. The implementations of validity and neutrality tests
play an important role in the efficiency of QDI systems.
1) Active and Passive Protocols:
There is an asymmetry in
the (two-phase and four-phase) handshake protocols
described in the previous section: one side, here the
sender, starts by setting some output variables (wires) to a
valid value. Such a protocol is called
active
.Theotherside,
here the receiver, starts by waiting for some input variables
(wires) to have a valid value. Such a protocol is called
passive
. Symmetrical protocols are possible but are more
complicated and therefore rarely used.
Of course, an active protocol on one side of a channel
has to be matched to a passive protocol on the other side of
thesamechannel.Itseems
B
natural
[
to choose the sender
sidetobeactiveandthereceiversidetobepassive,butin
fact, the sender can be passive and the receiver active. We
will see cases when this is a better choice. The protocol is
then as follows:
Send
:
v
ð
ack
Þ
½
;
data
*
;
n
ð
ack
Þ
½
;
data
+
Receive
:
ack
"
;
v
ð
data
Þ
½
;
ack
#
;
n
ð
data
Þ
½
:
2) Example
V
One-Bit Channel:
A one-bit channel
between a sender and a receiver is implemented as in
Fig. 7. Two data wires
ð
r
:
1
;
l
:
1
Þ
and
ð
r
:
0
;
l
:
0
Þ
are used to
codethevaluestrueandfalseofthebit.Wire
ð
lo
;
ri
Þ
is
often called the acknowledge wire. Next, we implement
the send action
R
!
x
and the receive action
L
?
y
,where
x
is a
Boolean variable local to the sender and
y
is a Boolean
variable local to the receiver. The validity and neutrality
tests for the receive are
l
:
1
_
l
:
0and
:
l
:
1
^:
l
:
0. For an
active send and a passive receive, we get for
R
!
x
½
x
!
r
:
1
"½:
x
!
r
:
0
"
;
½
ri
;
r
:
1
#
;
r
:
0
#
;
½:
ri
and for
L
?
y
½
l
:
1
_
l
:
0
;
½
l
:
1
!
y
"½
l
:
0
!
y
#
;
lo
"
;
½:
l
:
1
^:
l
:
0
;
lo
#
:
Inthesend,theselection
½
x
!
r
:
1
"½:
x
!
r
:
0
"
assigns the value of
x
to the data wires of port
R
.Inthe
receive, the selection
½
l
:
1
!
y
"½
l
:
0
!
y
#
assigns the
Fig. 7.
Handshake wires for a one-bit DI data channel.
Fig. 6.
(a) A dual-rail coding of a Boolean data-channel. (b) A 1-of-4 coding of a four-valued integer data channel. Observe that no delay element
is needed.
Martin and Nystro
̈
m: Asynchronous Techniques for System-on-Chip Design
Vol. 94, No. 6, June 2006 |
Proceedings of the IEEE
1097