of 4
Asynchronous Logic for High Variability Nano-CMOS
Alain J. Martin
California Institute of Technology
Pasadena, CA 91125, USA
Abstract
—At the nanoscale level, parameter variations in fabricated de-
vices cause extreme variability in delay. Delay variations are also the main
issue in subthreshold operation. Consequently, asynchronous logic seems
an ideal, and probably unavoidable choice, for the design of digital circuits
in nano CMOS or other emerging technologies. This paper examines the
robustness of one particular asynchronous logic: quasi-delay insensitive or
QDI. We identify the three components of this logic that can be affected
by extreme variability: staticizer, isochronic fork, and rings. We show that
staticizers can be eliminated, and isochronic forks and rings can be made
arbitrarily robust to timing variations.
I
NTRODUCTION
All future technologies, nano-CMOS as well as potential emerging
technologies like carbon nanotubes or printed electronics, will face
great fabrication challenges that will translate into important param-
eter variations and decreased reliability [1], [2]. All parameter vari-
ations that do not cause the circuit to fail affect the timing. Another
challenge to the CMOS designer is subthreshold operation. Operat-
ing a circuit at subthreshold voltage is an attractive option to further
reduce energy consumption. However it is usually accompanied with
large delay fluctuations because of the exponential dependency of the
Ids
current on
Vdd
and
Vth
, combined with
Vth
variations [3].
Therefore, a design approach that is (mostly) independent of de-
lays for the logical correctness of the circuits produces designs that
are much more robust to parameter variations thatn traditional ap-
proaches. Because of its weaker dependence on delay, asynchronous
logic seems an ideal and perhaps unavoidable choice for digital cir-
cuits in the nanoscale era, in particular the
quasi-delay insensitive
(QDI) approach that relies on the weakest possible timing assump-
tion (isochronic fork). In fact, all QDI circuits designed at Caltech
have shown remarkable robustness to
V dd
,
V th
and temperature vari-
ations. All (including several microprocessors) could operate over a
wide range of voltages,
including subthreshold voltage
without partic-
ular precautions. SPICE simulations also show that threshold voltage
variations between twice the nominal voltage value and zero volt, are
tolerated. The question addressed in this paper is the following. In
the presence of extreme PVT (process, voltage, temperature) parame-
ter variations, how will a QDI circuit eventually fail, and what can be
done to avoid the failure or at least significantly improve the circuit’s
robustness?
In order to answer this question, we analyse the different compo-
nents of a correct QDI system to assess its robustness to parameter
variations. We first examine the individual gates (operators) that are
the building blocks of a QDI circuit. We then look at the way in which
the operators are composed to form a system.
I. I
MPLEMENTING
QDI O
PERATORS
We distinguish two types of operators: combinational and state-
holding. A complete logic family for high performance circuits—like,
e.g., the circuits of the Caltech MiniMIPS (see [6])—requires only
nand, nor and inverter in terms of combinational gates, plus the non-
standard write-acknowledge circuit (
wack
).
The implementation of combinational gates presents no problem be-
sides the restriction on the number of transistors in series. The decom-
position of a large gate into a network of smaller gates that satisfy the
restriction is part of the QDI synthesis procedure.
The number of state-holding elements used in QDI logic is surpris-
ingly small. We can design any digital circuit (complete micropro-
cessor) with just the 2-input C-element, the set-reset latch for mem-
ory and registers, and the precharge function. However, the standard
CMOS implementation of state-holding operators in this technology
does present a serious problem: the weak inverter of the staticizer.
Consider the operator defined by the two
production rules
:
A
x
B
x
(
A
and
B
, the “guards” of the production rules, are boolean expres-
sions in terms of the input variables of the operator: When
A
holds,
x
is set to true, when
B
holds,
x
is set to false.) Without loss of general-
ity, let’s assume that it is implemented as
A
x
B
x
x
x
¬
x
x
By definition of a state-holding element, there exist states in the com-
putation where
¬
A
∧¬
B
holds. In those “floating” states, the path to
Vdd through the pullup implementing
B
, and the path to GND through
the pulldown
A
are both cut. The voltage value of the physical node
implementing
x
has to be maintained in other ways. The general ap-
proach is to find another path to Vdd and another path to GND to
maintain the current value of
x
. The simplest and crudest solution
is to add a “staticizer” (or “keeper”). The staticizer implementation
consists of
always
maintaining the current value of
x
by adding the
pullup
¬
x
x
and the pulldown
x
x
, giving the gate:
A
x
x
B
∨¬
x
x
The added circuitry is a feedback inverter with input
x
and output
x
.
The difficulty with this solution is that when the value of
x
has to be
changed, for instance from true to false, there is a conflict (a “fight”)
between
A
x
and
¬
x
x
. Symmetrically, when the value
of
x
has to be changed from false to true, there is a “fight” between
B
x
and
x
x
. There is no logical resolution to those con-
flicts; they can only be resolved by physical means by making sure that
the current through the pullup implementing
B
and through the pull-
down implementing
A
is stronger than the current through the tran-
sistors of the feedback inverter. This is usually achieved by imple-
menting the two transistors of the feedback inverter as highly resistive
(“weak”), making the currents through the pullup and the pulldown
of the feedback inverter much smaller than the currents through the
pullup
B
and pulldown
A
. But the weak currents cannot be too weak
since they have to maintain the voltage on node
x
in the presence of
possibly important leakage.
Hence, the correct behavior of the staticizer relies on a two-sided
inequality requirement on the feedback inverter’s current. In a tech-
nology with high parameter variations, the requirement for the imple-
mentation to tolerate a variation of for instance
3
σ
makes it impossible
978-1-4244-5091-6/09/$25.00 ©2009 IEEE
69
x
a
b
b
a
x_
x
a
a
b
b
xx
x
Fig. 1. A two-input C-element implemented as a majority gate.
to satisfy the two-sided inequality.
We conclude that it will not be pos-
sible to use staticizers in nano-CMOS.
We have to use another solution to maintain the value of the output
nodes in the floating states. The general method consists in identifying
the floating states in which the value of the output has to be maintained
and using the feedback inverter
only in those states.
Without additional
information, the floating states are characterized by
¬
A
∧¬
B
, leading
to the general solution:
A
∨¬
B
x
x
B
∨¬
A
∧¬
x
x
A. C-element
The Muller C-element is an essential building block of asyn-
chronous logic. The 2-input C-element with inputs
a
and
b
and output
x
is defined as:
a
b
x
¬
a
∧¬
b
x
The above transformation applied to this pair of PRs leads to the well-
known majority-gate implementation (see Figure 1):
a
b
a
x
b
x
x
¬
a
∧¬
b
∨¬
a
∧¬
x
∨¬
b
∧¬
x
x
x
x
¬
x
x
For the three-input C-element, the transformation gives:
a
b
c
a
x
b
x
c
x
x
¬
a
∧¬
b
∧¬
c
∨¬
a
∧¬
x
∨¬
b
∧¬
x
∨¬
c
∧¬
x
x
B. Read/Write Register
The read/write register is used in a standard pipeline stage of the
Caltech Asynchronous Microprocessor (CAM). It is also used in a
slightly different form to implement general-purpose asynchronous
registers.
As shown in Figure 2, in its simplest form, the register consists of
three part: the set/reset latch maintaining the current value of the regis-
ter bit
(
x
,
x
)
, the two nand-gates that constitute the read part, and the
write-acknowledge (
wack
) cell that generates the acknowledge signal
for the write handshake. The implementation of the read part presents
no difficulty. The set/reset latch can be implemented as cross-coupled
nor-gates which do not need relative sizing unlike the implementation
with cross-coupled inverters (see [4]). Only the
wack
needs some at-
tention. Its specification is
wt
x
wf
x
wack
¬
wt
∧¬
wf
wack
The floating states are
¬
wt
wf
∧¬
x
and
wt
∧¬
wf
∧¬
x
. In both
states, the write interface is in the process of changing the values of
x_
x
ri
xf_
xt_
wt
wf
wack_
wack
SR
+
reg
x_
x
Fig. 2. Dual-rail single-bit register with read and write interfaces
x
and
x
and therefore the output
wack
should be kept high in the
floating states. The transformation then gives
¬
wt
∧¬
wf
∨¬
wt
∧¬
x
∨¬
wf
∧¬
x
wack
Now the two guards are complementary.
The write-acknowledge is a
combinational gate.
Altogether the above design leads to a robust lay-
out for the register. No feedback inverter is needed, the only feedback
being that of the cross-coupled nor-gates, and there are at most two
transistors in series on all paths. (However, it is difficult to use the
cross-coupled nor-gates in an SRAM implementation where compact-
ness is paramount.)
C. Precharge Function
The precharge function is a state-holding cell with two sets of in-
puts: the input
en
is a control signal used to precharge the node
z
high in the neutral state (
¬
en
). The other set
X
of inputs is used
to compute the boolean function
F
when
en
is high. Function
F
is
necessarily small enough to fit in a single pulldown, and therefore the
number of inputs of
X
is also limited—rarely more than four. The
PRS for the precharge function cell is:
¬
en
z
en
F
z
z
z
¬
z
z
The floating state is
en
∧¬
F
. Applying the transformation and after
simplification, we get:
¬
en
∨¬
F
∧¬
z
z
en
F
en
z
z
For example, if
F
is the dual-rail equality of
a
and
b
, i.e.
f
=
a
0
b
0
a
1
b
1
, the complement of
F
is
(
¬
a
0
∨¬
b
0)
(
¬
a
1
∨¬
b
1)
. This
leads to 5 different pullup paths with 3 transistors in series (including
¬
z
).
The difficulty of this general solution is that the structure of function
F
may create a complementary function
¬
F
with too many conjuncts
in a term of the DNF representation of the function. But it should be
observed that the added pullup
¬
F
∧¬
z
is used only to maintain the
value true of
z
in the floating state and not to switch the output. There-
fore, the possible complexity of
¬
F
, which translates into a weak drive
(
I
ds
current), might be acceptable. However, as already argued, the
I
ds
current cannot be too weak, and therefore we need a method to decom-
pose the precharge function implementation when
¬
F
is too complex.
Such a decomposition exists.
D. General Decomposition of the Precharge Function
Without loss of generality, consider a function
F
with 3 terms in
its “complete” DNF (meaning that each minterm contains all literals):
F
=
f
1
f
2
f
3
. Its complement has 3 conjuncts in each term of
its DNF, which is too many since the pullup
¬
F
∧¬
z
has a chain of 4
p-transistors. The specification of the PCF(
F
) is:
*[[
en
(
f
1
f
2
f
3)
]
;
z
;
[
¬
en
]
;
z
]
70
Because at most one term
f
i
of
F
is true at any time, the following
decomposition is equivalent to the above specification:
(
*[[
en
f
1
]
;
z
1
;
[
¬
en
]
;
z
1
]
*[[
en
f
2
]
;
z
2
;
[
¬
en
]
;
z
2
]
*[[
en
f
3
]
;
z
3
;
[
¬
en
]
;
z
3
]
*[[
z
1
z
2
z
3
]
;
z
;
[
¬
z
1
∧¬
z
2
∧¬
z
3
]
;
z
]
)
where
z
1
,
z
2
,
z
3
are intermediate variables. Because at most one term
f
i
of
F
is true at any time, only one input
z
i
of the nand-gate is true
at any time. Therefore, the nand-gate can be decomposed arbitrarily
to satisfy the limitation on transistor-chain length without violation the
QDI requirement of stability.
However, the following simplification significantly improves the cir-
cuit. It applies whenever the dual-rail version of the function has to be
computed, i.e.:
¬
en
zt
en
Ft
zt
¬
en
zf
en
Ff
zf
If the only floating states are
en
Ft
∧¬
Ff
for the
Ff
block, and
en
Ff
∧¬
Ft
for the
Ft
block, then we can staticize the PRS as:
¬
en
∨¬
zf
zt
¬
en
∨¬
zt
zf
with the pulldowns unchanged. This simplification requires that
¬
en
Ft
Ff
be an invariant of the system. In other words, the condition
for applying the transformation is that
the function inputs are not reset
when
en
is set
. Certain QDI templates already satisfy this condition.
It is the case for the control/data decomposition scheme of the pipeline
stage used in the CAM. (It relies on an easily satisfiable isochronicity
assumption.) Other QDI templates can easily be transformed to satisfy
the invariant. It is the case for the PCHB (“precharge half-buffer”) of
the MiniMIPS design style. Since the validity (usually called
v
(
L
)
)
of the data input is always computed, it suffices to replace the enable
signal
en
with
en
defined as
v
(
L
)
C
en
(i.e.
en
is the output of the
C-element with
en
and
v
(
L
)
as inputs).
II. I
SOCHRONIC
F
ORK
Once all QDI building blocks have been defined, the next step is to
compose them into a working system. Can we satisfy the timing re-
quirements of isochronic forks? In the past, a simple-to-explain suffi-
cient timing condition has often been used to implement the isochronic
fork; but, in the presence of important timing variations, this sufficient
timing condition is difficult to implement as it relies on a two-sided
timing inequality. The designer should switch to a more complex but
easier to satisfy necessary and sufficient condition. This condition re-
lies on the notion of
adversary path
. It is a one-sided timing inequality
stating that a single transition should be shorter in time than a sequence
of multiple transitions (usually five in CMOS). (For a formal proof of
this result, see [7].)
Let us first review how the need for isochronic forks arises. A com-
putation implements a partial order of transitions. In the absence of
timing assumptions, this partial order is based on a causality relation.
For example, transition
x
causes transition
y
in state
S
if and only
if
x
makes guard
By
of
y
true in
S
. Transition
y
is said to
ac-
knowledge
transition
x
. We do not have to be more specific about the
precise ordering in time of transitions
x
and
y
. The acknowledg-
ment relation is enough to introduce the desired partial order among
transitions, and to conclude that
x
precedes
y
. In an implementa-
tion of the circuit, gate
Gx
with output
x
is directly connected to gate
Gy
with output
y
, i.e.,
x
is an input of
Gy
.
Hence, a necessary condition for an asynchronous circuit to be
delay-insensitive is that all transitions are acknowledged.
x2
Gy
y
Gz
z
c
c
x1
x
Fig. 3. The fork (x,x1,x2) is isochronic: a transition on x1 causes a transition
on y only when c is true, and a transition on x2 causes a transition on z only
when c is false. Hence, certain transitions on x1 and on x2 are not acknowl-
edged, and therefore a timing assumption must be used to guarantee the proper
completion of those unacknowledged transitions.
Unfortunately, the class of computations in which all transitions are
acknowledged is very limited, as we proved in [5]. Consider the fol-
lowing example in which the specification of the computation requires
an ordering of transitions on variables
x
,
y
, and
z
defined by the se-
quence:
...
x
;
y
;
...
;
x
;
z
...
Implementing this sequence of transitions requires introducing at least
one control variable
c
to distinguish the states in which
x
causes
y
from the states in which
x
causes
z
, leading to PRs of the form:
c
x
y
¬
c
x
z
As shown in Figure 3,
x
as the output of gate
Gx
is forked to
x
1
, an
input of gate
Gy
with output
y
, and to
x
2
, an input of gate
Gz
with
output
z
. A transition
x
when
c
holds is followed by a transition
y
,
but not by a transition
z
, i.e. transition
x
1
is acknowledged but tran-
sition
x
2
is not, and vice versa when
¬
c
holds. Hence, in either case,
a transition on one output of the fork is not acknowledged. In order to
guarantee that the unacknowledged transition completes without vio-
lating the specified order, a timing assumption called the
isochronicity
assumption
has to be introduced, and the forks that require that as-
sumption are called
isochronic forks
[5]. The branch of the fork with
the non-acknowledged transition is called an
isochronic branch.
(Not
all forks in a QDI circuit are isochronic.)
A. The Weakest Isochronicity Assumption
First, it is important to realize that the weakest timing assumption
associated with an isochronic fork is not a relation between the delays
on the different branches of the fork—which would be a very tight as-
sumption indeed. For example, in the case of a fork with two branches,
if
δ
(
t
1)
and
δ
(
t
2)
are the delays of transition
t
1
on one branch and
transition
t
2
on the other branch, the timing assumption is NOT that
|
δ
(
t
1)
δ
(
t
2)
|≤

. As already mentioned, such a timing requirement
has been used because of its simplicity. It is sufficient but not neces-
sary, and is very difficult to fulfill in the presence of large parameter
variations, in particular threshold voltages.
In order to characterize the weakest (necessary and sufficient) tim-
ing condition for an isochronic fork to behave properly, let us examine
how a circuit will fail when a transition on an isochronic branch fails
to complete. In our previous example, assume that
x
1
causes
y
, but
x
2
does not complete. Since
x
2
does not change the value of
z
, ini-
tially the failure of
x
2
to complete does not cause a failure of gate
Gz
.
But
y
causes a sequence of transitions that will eventually change an
other input of
Gz
, say transition
t
since we do not know the sense of
this transition. At this point, transition
x
2
not having completed will
cause a misfiring of
Gz
, since in the partial-order specification of the
circuit,
x
2
should have completed.
71
C
x1_
ro (0−>?)
x2_ (1−>0)
lo_
ri_
(0−>1)
li
li
2
(0−>1)
li
1
(0−>1)
ri_ (1)
Fig. 4. The fork (li,li1,li2) is isochronic, with li2 as isochronic branch for the
transition 0 to 1 on li
Hence, the weakest isochronicity assumption is that transition
x
2
completes before the sequence of transitions starting at
x
1
and ending
at
t
. This sequence of transitions defines a path in the circuit called the
adversary path
of isochronic branch
x
2
. The isochronicity assump-
tion states that the delay
δ
(
x
2
)
for unacknowledged transition
x
2
to
complete be always smaller than the sum of the delays of the transi-
tions on the adversary path (including acknowledged transition
x
1
).
This one-sided inequality can always be satisfied by making the ad-
versary path longer. SPICE Monte Carlo simulations of a typical QDI
circuit in IBM 65nm CMOS 10 LPe show that an adversary path of
length 5 will fail once in 10,000 at Vdd = 400mV. But an adversary
path of length 7 will fail only once in 2 million trials at Vdd = 300mV.
B. A Worst-case Example
A worst-case example, where the adversary path contains just one
gate, is the Caltech “Q-element” (see [4]) shown in Figure 4. Its spec-
ification in terms of boolean transitions on its external variables is:
[
lo
;
li
;
lo
;
li
;
ro
;
ri
;
ro
;
ri
]
(where
*[
...
]
means “repeat
forever.”)
We analyze the isochronic fork
(
li,li
1
,li
2)
. Initially,
x
2
is true
and
ro
is false. In order to satisfy the specification, transition
li
should not change the value of
ro
. But since the inverted output of the
C-element is the input
x
2
of the nor-gate the adversary path (
li
1
, C-
element,
x
2
) causes transition
x
2
, which could cause transition
ro
,
unless the isochronic-branch transition
li
2
terminates before
x
2
.
This is one of the tightest isochronicity requirements since the adver-
sary path contains only one gate, the C-element. In an implementation
of the C-element as a majority gate, the inverted output has to be taken
after two inverters, making the adversary path safer (3 transitions).
III. O
SCILLATING
R
INGS
A QDI system is nothing but an interlocking of oscillating rings.
Therefore, a correct implementation of a QDI system requires to guar-
antee that the rings of restoring gates keep oscillating, in the same
way that an odd number of inverters connected in a ring will keep
oscillating if properly designed. Two conditions, called
stability
and
non-interference
are necessary and sufficient to guarantee that a QDI
circuit is free of hazards (incomplete or erroneous transitions). Rule
Au
x
is stable in a computation if
Au
can be invalidated only in
the states where
x
is true. (In other words, once
Au
becomes true it
remains true until
x
fires.) Non-interference states that
Au
x
and
Ad
x
cannot fire in the same state, i.e.
Au
and
Ad
are never true
in the same state.
Since hardware computations are non-terminating, each transition
z
is followed, after a number of other transitions, by transition
z
,
and vice versa. Since the guards
Bu
and
Bd
of those transitions are
mutually exclusive, the chain of transitions between
z
and
z
must
contain a transition that invalidates
Bu
. Hence, transition
z
invali-
dates its guard
Bu
through a sequence of intermediate transitions. Can
we still say that
Bu
is stable? Is it possible that the effect of
z
prop-
agates through the cycle of gates fast enough to invalidate itself? At
the electrical level, could the voltage
V
(
z
)
stabilize at an intermediate
value close to
V dd/
2
?
Arguing that such a ring of operators is not self-invalidating is
equivalent to arguing that the ring oscillates. This is an electrical prop-
erty of the circuit that relates the slew rates of transitions, the gain of
the operators, and the number of operators on the ring. We must re-
quire that any cycle of operators be implemented with a number of
stages at least equal to a chosen minimum to guarantee that the cy-
cle is not self-invalidating. In a CMOS circuit operating at nominal
Vdd
, three restoring operators with good gain are usually sufficient,
although we usually require five to be safe. In other nano-technologies,
the minimal number of operators on a ring will have to be determined
based on the electrical properties of the technology.
IV. C
ONCLUSION
The goal of this paper was to establish the necessary and sufficient
conditions to make QDI circuits practically immune to the large pa-
rameter variations that will affect the design of nano-CMOS systems
and subthreshold operation. First, we have shown that there exist ef-
ficient static implementations of state-holding operators without weak
feedback and with no more than two transistors in series in the pullups.
Those implementations do not rely on specific current ratios or tran-
sistor sizing.
Second, the only necessary timing assumption, the isochronic fork
assumption, has been refined, and we showed that it is a one-sided
timing assumption that can always be satisfied. Third, the issue of
oscillating rings of operators has been analyzed. It was established
that in order to keep a ring oscillating, it has to contain a minimal
number of restoring operators. This number is technology dependent.
The same results can be applied to making SRAM robust to parame-
ter variations at the cost of an area overhead. If the designer insists on a
6T SRAM cell design, then the usual issues of read robustness and rel-
ative transistor sizing have to be dealt with, with increasing difficulty
as the technology advances.
In conclusion, the reader should observe that the conditions we have
formulated for the correct operation of a QDI system are different from
what is required in a clocked system. In particular, the timing require-
ments in a clocked system are two-sided inequalities that are difficult
to satisfy over a wide range of parameter values. This is where the
robustness advantage of QDI lies, with the absence of clock.
A
CKNOWLEDGMENTS
The research described in this paper was supported by a grant from
the National Science Foundation.
R
EFERENCES
[1] Shekhar Borkar
et al.
Parameter Variations and Impact on Circuits and
Microarchitecture.
DAC 2003
, pp338-342.
[2] K.Bowman
et al.
Impact of Die-to-Die and Within-Die Parameter Fluctu-
ations on Maximum Clock Frequency Distribution for Gigascale Integra-
tion.
IEEE JSSC,
vol.37, no.2, 2002.
[3] S. Hanson
et al.
Ultralow-voltage, minimum-energy CMOS.
IBM J. Res
& Dev.
vol 50, No 4/5. 2006.
[4] A.J. Martin, M. Nystr
̈
om. Asynchronous Techniques for System-on-Chip
Design.
Proc. of the IEEE
, Special Issue on Systems-on-Chip, 94, 6,
1089-1120. 2006.
[5] Alain J. Martin. The Limitations to Delay-Insensitivity in Asynchronous
Circuits.
Sixth MIT Conf. on ARVLSI,
MIT Press, 1990.
[6] Alain J. Martin
et al.
The Design of an Asynchronous MIPS R3000 Mi-
croprocessor.
Proc. 17th Conf. on ARVLSI
. IEEE Computer Society Press,
pp. 164–181, 1997.
[7] Sean Keller, Michael Katelman, Alain J. Martin. A Necessary and Suf-
ficient Timing Assumption for Speed-Independent Circuits.
Proc. 15th
IEEE Int. Symp. on Asynchronous Circuits & Systems
, 2009.
72