NUPACK: Analysis and Design of Nucleic Acid
Structures, Devices, and Systems
Mark E. Fornace,
1
,
2
,
#
, Jining Huang
1
,
#
, Cody T. Newman
1
,
#
, Nicholas J. Porubsky
2
, Marshall B. Pierce,
and Niles A. Pierce
1
,
3
,
∗
1
Division of Biology & Biological Engineering, California Institute of Technology, Pasadena, CA 91125, USA.
2
Division of Chemistry & Chemical Engineering, California
Institute of Technology, Pasadena, CA 91125, USA.
3
Division of Engineering & Applied Science, California Institute of Technology, Pasadena, CA 91125, USA.
ABSTRACT:
CUGGACAUCACCUCCCACAACGAGGACUAGUUGUG
GGAGGUGAUGUCGGGUGUUCACUACCAGCAGAACA
CCCGACAUCACCUACCACUACCAGCAGAACAAGGU
AGAUGUCGGGUGUUCUGCUGGUAGUGGUCUGGACA
UCACCUCCCACAACGAGGACUAGUUGUGGGAGGUG
AUGUCGGGUGUUCACUACCAGCAGAACACCCGACA
UCACCUACCACUACCAGCAGAACAAGGUAGAUGUC
GGGUGUUCUGCUGGUAGUGGUCUGGACAUCACCUC
CCACAACGAGGACUAGUUGUGGGAGGUGAUGUCGG
GUGUUCACUACCAGCAGAACACCCGACAUCACCUA
CCACUACCAGCAGAACAAGGUAGAUGUCGGGUGUU
CUGCUGGUAGUGGUCUGGACAUCACCUCCCACAAC
GAGGACUAGUUGUGGGAGGUGAUGUCGGGUGUUCA
CUACCAGCAGAACACCCGACAUCACCUACCACUAC
CAGCAGAACAAGGUAGAUGUCGGGUGUUCUGCUGG
UAGUGGU
Crosstalk
Products
10 nM
10 nM
10 nM
Intermediates
Reactants
10 nM
10 nM
10 nM
All 10 nM
Design
Analysis
NUPACK is a growing software suite for the analysis and de-
sign of nucleic acid structures, devices, and systems serving the
needs of researchers in the fields of nucleic acid nanotechnology,
molecular programming, synthetic biology, and across the life
sciences. NUPACK algorithms are unique in treating complex
and test tube ensembles containing arbitrary numbers of inter-
acting strand species, providing crucial tools for capturing con-
centration effects essential to analyzing and designing the inter-
molecular interactions that are a hallmark of these fields. The
all-new NUPACK web app (nupack.org) has been re-architected
for the cloud, leveraging a cluster that scales dynamically in response to user demand to enable rapid job submission and result in-
spection even at times of peak user demand. The web app exploits the all-new NUPACK 4 scientific code base as its backend,
offering enhanced physical models (coaxial and dangle stacking subensembles), dramatic speedups (20–120
×
for test tube analysis),
and increased scalability for large complexes. NUPACK 4 algorithms can also be run locally using the all-new NUPACK Python
module.
KEYWORDS:
DNA, RNA, secondary structure, base-pairing, hybridization, complex ensemble, test tube ensemble, multi-tube
ensemble, equilibrium, partition function, concentration, ensemble defect, analysis, design, reaction pathway engineering
INTRODUCTION
We are engaged in a multi-decade effort to develop NUPACK
(Nucleic Acid Package), a growing software suite for the anal-
ysis and design of nucleic acid structures, devices, and sys-
tems.
1
NUPACK algorithms
2–9
are formulated in terms of
nucleic acid secondary structure (i.e., the base-pairs of a set
of DNA or RNA strands) and employ empirical free energy
parameters.
10–22
Problem categories.
NUPACK algorithms address two fun-
damental classes of problems:
•
Sequence analysis:
given a set of DNA or RNA strands,
analyze the equilibrium base-pairing properties over a
specified ensemble.
•
Sequence design:
given a set of desired equilibrium base-
pairing properties, design the sequences of a set of DNA
or RNA strands over a specified ensemble. Sequence de-
sign is performed subject to diverse sequence constraints.
Ensembles.
NUPACK algorithms operate over two funda-
mental ensembles:
•
Complex ensemble:
the ensemble of all (unpseudoknot-
ted connected) secondary structures for an arbitrary
number of interacting RNA or DNA strands.
•
Test tube ensemble:
the ensemble of a dilute solu-
tion containing an arbitrary number of RNA or DNA
strand species (introduced at user-specified concentra-
tions) interacting to form an arbitrary number of com-
plex species.
Furthermore, to enable reaction pathway engineering of dy-
namic hybridization cascades (e.g., shape and sequence trans-
duction using small conditional RNAs
23, 24
) and large-scale
structural engineering including pseudoknots (e.g., RNA
origamis
25
), NUPACK generalizes sequence analysis and de-
sign to
multi-tube ensembles
comprising one or more test
tubes.
8
Note that a complex ensemble is subsidiary to a test tube
ensemble, so complex analysis is inherent in test tube anal-
ysis (but not vice versa), and complex design is inherent in
test tube design (but not vice versa). As it is typically infea-
sible to experimentally study a single complex in isolation,
we recommend analyzing and designing nucleic acid strands
in a test tube ensemble that contains the complex of inter-
est as well as other competing complexes that might form
in solution. For example, if one is experimentally studying
strands A and B that are intended to predominantly form
a secondary structure within the ensemble of complex A
·
B,
one should not presuppose that the strands do indeed form
A
·
B and simply analyze or design the base-pairing properties
of that complex. Instead, it is more physically relevant to
analyze or design a test tube ensemble containing strands A
and B interacting to form multiple complex species (e.g., A,
B, A
·
A, A
·
B, B
·
B) so as to capture both concentration infor-
mation (how much A
·
B forms?) and structural information
(what are the base-pairing properties of A
·
B when it does
form?).
All-new NUPACK 4 scientific code base.
NUPACK 4 anal-
ysis algorithms employ a new unified dynamic programming
framework
9
that provides enhanced secondary structure mod-
els including coaxial and dangle stacking subensembles, dra-
matic speedups (e.g., 20–120
×
for analysis of test tube en-
sembles), and enhanced scalability for large complexes (e.g.,
30,000 nt). NUPACK 4 design algorithms leverage these per-
formance benefits and support both hard constraints (that
must be obeyed by the designed sequences; e.g., diversity
and biological sequence constraints) and soft constraints (that
penalize but do not prohibit suboptimal sequences; e.g., se-
quence symmetry and energy match constraints) for multi-
tube ensembles.
All-new cloud-based NUPACK web app.
Since its launch
in 2007, usage of the NUPACK web app (nupack.org)
1
has
increased to the point where the underlying static compute
cluster is frequently overwhelmed by user demand. To pro-
1
Table 1.
Secondary structure notation: dot-parens plus, run-length encoded (RLE) dot-parens-plus, and DU+.
((((((((((((..........))))))))))))
((((((((((((+))))))))))))..........
((((((((((((+..........))))))))))))
(12.10)12
(12+)12.10
(12+.10)12
D12 U10
D12 + U10
D12 (+ U10)
((((((((((((..........+))))))))))))
..........((((((((((((+))))))))))))
(12.10+)12
.10(12+)12
D12 (U10 +)
U10 D12 +
Secondary structure*
Dot-parens-plus
RLE dot-parens-plus
DU+
∗
For each secondary structure, the first nucleotide is depicted in red.
vide a scalable resource for the global research community,
the NUPACK web app was re-architected from the ground
up to exploit a scalable compute cluster that resizes dynam-
ically in the cloud in response to user demand. The all-new
NUPACK web app integrates diverse components to create an
intuitive and powerful analysis and design environment:
•
Algorithms:
mathematically rigorous, physically sound,
computationally efficient scientific algorithms.
2–9
•
Hardware:
a hybrid cloud compute cluster combining
local hardware and scalable cloud hardware.
•
Interface:
an intuitive web interface for rapid job sub-
mission and result inspection.
•
Graphics:
publication-quality client-side graphics to en-
able straightforward interpretation of results, interactive
data, and efficient preparation of talks and papers.
Researchers can run jobs and inspect results within the NU-
PACK web app, which leverages NUPACK 4 scientific code
base as its backend.
All-new NUPACK Python module.
Alternatively, re-
searchers can script and run jobs locally using the all-new
NUPACK 4 Python module, providing the flexibility to in-
teract with the broader Python ecosystem.
Outline.
To provide a foundation for describing the NU-
PACK web app, we begin by defining the physical model,
relevant physical quantities, and the design formulation. For
the convenience of the reader, definitions and descriptions are
drawn from the original algorithms papers,
3, 5–9
but are here
organized into a single unified presentation for easy perusal.
We then summarize the features of the Analysis, Design, and
Utilities pages of the NUPACK web app:
•
Analysis page:
Analyze the equilibrium base-pairing
properties of one or more test tube ensembles (and sub-
sidiary complex ensembles). These are the all-purpose
sequence analysis tools.
•
Design page:
Design the sequences for one or more
test tube ensembles (and subsidiary complex ensembles).
These are the all-purpose sequence design tools.
•
Utilities page:
Analyze, design, or prepare figures for a
single complex ensemble. These are quick tools applica-
ble when your ensemble is a single complex.
PHYSICAL MODEL
Sequence.
The
sequence
,
φ
, of one or more interacting RNA
strands is specified as a list of bases
φ
a
∈{
A,C,G,U
}
for
a
=
1
,...,
|
φ
|
. For DNA,
φ
a
∈{
A,C,G,T
}
. Nucleic acid sequences
are listed 5
′
to 3
′
. For RNA calculations, T is automatically
converted to U, and vice versa for DNA calculations. For
sequence design, sequence constraints can be specified using
IUPAC degenerate nucleotide codes (see Table 2).
Table 2.
IUPAC degenerate nucleotide codes for RNA.
Code
Nucleotides
∗
M
A or C
R
A or G
W
A or U
S
C or G
Y
C or U
K
G or U
V
A, C, or G
H
A, C, or U
D
A, G, or U
B
C, G, or U
N
A, C, G, or U
∗
For DNA, T replaces U.
Secondary structure.
A
secondary structure
,
s
, of one or
more interacting RNA strands is defined by a set of base
pairs, each a Watson–Crick pair (A
·
U or C
·
G) or a wobble
pair (G
·
U) (e.g., see Figure 1a). For DNA, the corresponding
Watson–Crick pairs are A
·
T or C
·
G and there are no wobble
pairs (G
·
T is classified as a mismatch
18
). A
polymer graph
representation of a secondary structure is constructed by or-
dering the strands around a circle, drawing the backbones in
succession from 5
′
to 3
′
around the circumference with a
nick
between each strand, and drawing straight lines connecting
paired bases. A secondary structure is
unpseudoknotted
if
there exists a strand ordering for which the polymer graph
has no crossing lines, or
pseudoknotted
if all strand orderings
contain crossing lines. A secondary structure is
connected
if
no subset of the strands is free of the others.
5
Secondary structures may be specified in one of three ways
for NUPACK calculations (see Table 1 for examples):
•
dot-parens-plus notation
: each unpaired base is repre-
sented by a dot, each base pair by matching parentheses,
and each nick between strands by a plus.
1
•
run-length encoded (RLE) dot-parens-plus notation
: as a
shorthand for dot-parens-plus, any sequence of consec-
utive characters in dot-parens-plus may be replaced by
the character followed by a number.
9
•
DU+ notation
: Using DU+ notation, a duplex is de-
noted by
D
followed by the number of base pairs and an
unpaired region is denoted by
U
followed by the number
of unpaired nucleotides.
26
Each duplex is followed imme-
diately by the substructure (specified in DU+ notation)
that is “enclosed” by the duplex. If this substructure
includes more than one element, parentheses are used to
denote scope. A nick between strands is specified by a
“+”.
In mathematical expressions, it is convenient to represent
secondary structure
s
using a
structure matrix
S
(
s
) with
entries
S
a,b
(
s
) = 1 if structure
s
contains base pair
a
·
b
2
Complex
ABC
A
B
C
Test tube
A
B
C
Strands
BBC
BC
AAA
ACB
BBB
CCC
CA
CCB
ABC
AAC
AB
A
BBA
BB
AAB
B
CC
CCA
AA
C
Complexes
a
b
Figure 1.
Complex and test tube ensembles. (a) A (connected
unpseudoknotted) secondary structure for a complex of 3 strands
with strand ordering
π
= ABC. An arrowhead denotes the 3
′
end
of each strand. (b) A test tube ensemble containing strand species
Ψ
0
=
{
A,B,C
}
interacting to form all complex species Ψ of up to
L
max
= 3 strands. Adapted with permission from Fornace
et al.
,
ACS Synth Biol
, 9, 2665-2678, 2020. Copyright 2020 American
Chemical Society.
and
S
a,b
(
s
) = 0 otherwise. Abusing notation, the entry
S
a,a
(
s
) = 1 if base
a
is unpaired in structure
s
and 0 oth-
erwise. Hence,
S
(
s
) is a symmetric matrix with row and
column sums of 1.
Complex ensemble.
Consider a complex of
L
distinct
strands (each with a unique identifier in
{
1
,
···
,L
}
) corre-
sponding to strand ordering
π
. The
complex ensemble
Γ(
φ
)
contains all connected polymer graphs with no crossing lines
(i.e., all unpseudoknotted secondary structures).
5
As a mat-
ter of algorithmic necessity, all of the dynamic programs
in NUPACK operate on complex ensemble
Γ(
φ
) treating all
strands as distinct. However, in the laboratory, strands with
the same sequence are typically indistinguishable with respect
to experimental observables. For comparison to experimental
data, physical quantities calculated over ensemble
Γ(
φ
) are
post-processed to obtain the corresponding quantities calcu-
lated over
complex ensemble
Γ(
φ
) in which strands with the
same sequence are treated as indistinguishable.
9
The ensem-
ble Γ(
φ
)
⊆
Γ(
φ
) is a maximal subset of distinct secondary
structures for strand ordering
π
. Two secondary structures
are indistinguishable if their polymer graphs can be rotated
so that all strands are mapped onto indistinguishable strands,
all base pairs are mapped onto base pairs, and all unpaired
bases are mapped onto unpaired bases; otherwise the struc-
tures are distinct.
5
Test tube ensemble.
A
test tube ensemble
is a dilute so-
lution containing a set of strand species, Ψ
0
, introduced at
user-specified concentrations, that interact to form a set of
complex species, Ψ, each corresponding to a different strand
ordering treating strands with the same sequence as indis-
tinguishable.
5, 9
For
L
strands, there are (
L
−
1)! strand or-
derings if all strands are different species (e.g., complexes
π
= ABC and
π
= ACB for
L
= 3 and strands A, B, C), but
fewer than (
L
−
1)! strand orderings if some strands are of the
same species (e.g., complex
π
= AAA for
L
= 3 with three
A strands). By the Representation Theorem,
5
a secondary
structure in the complex ensemble for one strand ordering
does not appear in the complex ensemble for any other strand
ordering, averting redundancy. It is often convenient to define
Ψ to contain all complex species of up to
L
max
strands, al-
though Ψ can be defined to contain arbitrary complex species
formed from the strand species in Ψ
0
.
Multi-tube ensemble.
Consider the multi-tube ensemble
Ω where tube
h
∈
Ω contains the set of strand species Ψ
0
h
interacting to form the set of complex species Ψ
h
. The set of
all complexes in multi-tube ensemble Ω is then Ψ
≡∪
h
∈
Ω
Ψ
h
.
hairpin loop
stacked pairs
bulge loop
interior loop
multiloop
exterior loop
C
B
A
Figure 2.
Loop-based free energy model for a complex. Canonical
loop types for a complex with strand ordering
π
= ABC. Adapted
with permission from Fornace
et al.
,
ACS Synth Biol
, 9, 2665-2678,
2020. Copyright 2020 American Chemical Society.
Note that the multi-tube ensemble encompasses the complex
and test tube ensembles as subsidiary special cases.
8
Loop-based free energy model.
For each (unpseudoknot-
ted connected) secondary structure
s
∈
Γ(
φ
), the free energy,
∆
G
(
φ,s
), is estimated as the sum of the empirically deter-
mined free energies of the constituent loops
12–14, 17, 21, 22
plus
a strand association penalty,
27
∆
G
assoc
, applied
L
−
1 times
for a complex of
L
strands:
∆
G
(
φ,s
) = (
L
−
1) ∆
G
assoc
+
∑
loop
∈
s
∆
G
(loop)
.
(1)
The secondary structure of Figure 2 illustrates the different
loop types, with loop free energy, ∆
G
(loop), modeled as fol-
lows:
12–14, 17, 21, 22
•
A
hairpin loop
is closed by a single base-pair
a
·
b
. The
loop free energy, ∆
G
hairpin
a,b
, depends on sequence and
loop size.
•
An
interior loop
is closed by two base pairs (
a
·
b
and
d
·
e
with
a < d < e < b
). The loop free energy, ∆
G
interior
a,d,e,b
de-
pends on sequence, loop size, and loop asymmetry.
Bulge
loops
(where either
d
=
a
+ 1 or
e
=
b
−
1) and
stacked
pairs
(where both
d
=
a
+ 1 and
e
=
b
−
1) are treated
as special cases of interior loops.
•
A
multiloop
is closed by three or more base pairs. The
loop free energy is modeled as the sum of three sequence-
independent penalties: ∆
G
multi
init
for formation of a multi-
loop, ∆
G
multi
bp
for each closing base pair, ∆
G
multi
nt
for each
unpaired nucleotide inside the multiloop, one sequence-
dependent penalty: ∆
G
terminalbp
a,b
for each closing pair
a
·
b
, and optional coaxial and dangle stacking contribu-
tions (see below).
•
An
exterior loop
contains a nick between strands and any
number of closing base pairs. The exterior loop free en-
ergy is the sum of ∆
G
terminalbp
a,b
for each closing base pair
a
·
b
, plus optional coaxial and dangle stacking contribu-
tions (see below). Hence, an unpaired strand has a free
energy of zero, corresponding to the reference state.
5
See Section S1.7 of Reference 9 for details on the functional
form of loop-based free energy models.
Coaxial and dangle stacking.
Within a multiloop or an
exterior loop, there is a subensemble of coaxial stacking
states between adjacent closing base pairs and dangle stack-
ing states between closing base pairs and adjacent unpaired
bases. Within a multiloop or exterior loop, a base pair can
form one
coaxial stack
with an adjacent base pair, or can form
3
a
0 coaxial stacks
0 dangle stacks
0 coaxial stacks, 1 dangle stack
0 coaxial stacks, 2 dangle stacks
1 coaxial stack
0 dangle stacks
1 coaxial stack, 1 dangle stack
1 coaxial stack, 2 dangle stacks
b
1 dangle stack
2 dangle stacks
0 coaxial stacks
1 coaxial stack
0 dangle stacks
c
multiloop coaxial stack
multiloop dangle stack
multiloop
exterior loop coaxial stack
exterior loop dangle stack
exterior loop
Multiloop stacking states
Exterior loop stacking states
Exterior loop stacking states
Figure 3.
Coaxial and dangle stacking states for multiloops and exterior loops. (a) Stacking subensemble for the multiloop of Figure
2. (b,c) Stacking subensembes for two exterior loops from Figure 2. Adapted with permission from Fornace
et al.
,
ACS Synth Biol
, 9,
2665-2678, 2020. Copyright 2020 American Chemical Society.
a
dangle stack
with at most two adjacent unpaired bases; un-
paired bases can either form no stack, or can form a dangle
stack with at most one adjacent base pair.
For a given multiloop or exterior loop, the energetic contri-
butions of all possible coaxial and dangle stacking states are
enumerated so as to calculate the free energy:
9
∆
G
stacking
=
−
kT
log
∑
ω
∈
loop
∏
x
∈
ω
e
−
∆
G
x
/kT
(2)
where
ω
indexes the possible stacking states within the loop
and
x
indexes the individual stacks (coaxial or dangle) within
a stacking state. The free energy of a multiloop or exterior
loop is augmented by the corresponding ∆
G
stacking
bonus.
Hence, a secondary structure
s
continues to be defined as a
set of base pairs, and the stacking states within a given multi-
loop or exterior loop are treated as a structural subensemble
that contributes in a Boltzmann-weighted fashion to the free
energy model for the loop. Let
s
q
∈
s
denote a stacking state
of the paired and unpaired bases in
s
. We may equivalently
define the free energy of secondary structure
s
in terms of the
stacking state free energies
∆
G
(
φ,s
q
)
(3)
for all stacking states
s
q
∈
s
:
∆
G
(
φ,s
) =
−
kT
log
∑
s
q
∈
s
e
−
∆
G
(
φ,s
q
)
/kT
(4)
Let
Γ
q
(
φ
) denote the ensemble of stacking states correspond-
ing to the complex ensemble of secondary structures
Γ(
φ
).
NUPACK supports the following coaxial and dangle stack-
ing formulations:
•
All stacking
(default): complex ensemble with coaxial
and dangle stacking.
•
Coaxial stacking:
complex ensemble with coaxial stack-
ing.
•
Dangle stacking:
complex ensemble with dangle stacking.
•
No stacking:
complex ensemble without coaxial or dangle
stacking.
For backwards compatibility with NUPACK 3, NUPACK 4
also supports historical complex ensembles without coaxial
stacking and with approximate dangle stacking.
Symmetry correction.
For a secondary structure
s
∈
Γ(
φ
)
with an
R
-fold rotational symmetry, there is an
R
-fold re-
duction in distinguishable conformational space, so the free
energy
∆
G
(
φ,s
) must be adjusted
5
by a symmetry correction:
∆
G
(
φ,s
) =
∆
G
(
φ,s
) + ∆
G
sym
(
φ,s
)
.
(5)
where
∆
G
sym
(
φ,s
) =
kT
log
R
(
φ,s
)
.
(6)
Because the symmetry factor
R
(
φ,s
) is a global property of
each secondary structure
s
∈
Γ(
φ
), it is not suitable for use
with dynamic programs that treat multiple subproblems si-
multaneously without access to global structural information.
As a result, dynamic programs operate on ensemble
Γ(
φ
) us-
ing physical model
∆
G
(
φ,s
) and then the Distinguishabil-
ity Correction Theorem
5
enables exact conversion of physical
quantities to ensemble Γ(
φ
) using physical model ∆
G
(
φ,s
).
Interestingly, ensembles
Γ(
φ
) and Γ(
φ
) both have utility when
examining the physical properties of a complex as they pro-
vide related but different perspectives, akin to complemen-
tary thought experiments.
9
Free energy parameters.
NUPACK supports the follow-
ing temperature-dependent parameter sets for RNA:
•
rna06
based on References 14, 19 and 21 with additional
parameters
13, 17
including coaxial stacking
14, 22
and dan-
gle stacking
11, 17, 22
in 1M Na
+
.
•
rna95
based on Reference 11 with additional parame-
ters
17
including coaxial stacking
14, 22
and dangle stack-
ing
11, 17, 22
in 1M Na
+
.
and for DNA:
•
dna04
based on References 12 and 18 with additional pa-
rameters
17
including coaxial stacking
16
and dangle stack-
ing
15, 17
in user-specified concentrations of Na
+
, K
+
,
NH
+
4
, and Mg
++
.
12, 16, 18, 20
DNA/RNA hybrids are not allowed. For backwards compat-
ibility with NUPACK 3, NUPACK 4 also supports historical
DNA and RNA parameter sets.
Salts.
The default salt conditions for RNA and DNA pa-
rameter sets are [Na
+
] = 1M; these are the only salt condi-
tions for RNA. Salt corrections are available for DNA param-
eters to permit calculations in user-specified sodium, potas-
sium, ammonium, and magnesium ion concentrations.
4
•
Sodium:
based on refs 12 and 18; the sum of the concen-
trations of (monovalent) sodium, potassium, and ammo-
nium ions, [Na
+
]+[K
+
]+[NH
+
4
], is specified in units of
molar (default: 1.0, range: [0.05,1.1]).
•
Magnesium:
based on refs 16 and 20; the concentra-
tion of (divalent) magnesium ions, [Mg
++
], is specified
in units of molar (default: 0.0, range: [0.0,0.2]).
PHYSICAL QUANTITIES
Consider a multi-tube ensemble Ω where tube
h
∈
Ω contains
a set of strand species Ψ
0
h
interacting to form a set of com-
plex species Ψ
h
. Let
j
∈
Ψ
h
denote a complex with sequence
φ
j
and complex ensembles
Γ(
φ
j
) (treating all strands as dis-
tinct) and Γ(
φ
j
) (treating strands with the same sequence as
indistinguishable). NUPACK calculates a number of physical
quantities over these ensembles.
5, 9
Partition function.
For complex
j
, the partition function
evaluated over ensemble Γ(
φ
j
) treating strands with the same
sequence as indistinguishable is denoted
Q
(
φ
j
) =
∑
s
∈
Γ(
φ
j
)
e
−
∆
G
(
φ
j
,s
)
/kT
.
(7)
Complex free energy.
For complex
j
, the corresponding
complex free energy is
∆
G
(
φ
j
)
≡−
kT
log(
Q
(
φ
j
))
.
(8)
Structure free energy.
For complex
j
, the secondary struc-
ture free energy treating strands with the same sequence as
indistinguishable is denoted
∆
G
(
φ
j
,s
)
.
(9)
If the physical model includes coaxial and dangle stacking,
the structure free energy will include stacking contributions
∆
G
stacking
. If the secondary structure
s
has a rotational sym-
metry, the structure free energy will include the symmetry
correction ∆
G
sym
(
φ
j
,s
).
Equilibrium structure probability.
For complex
j
, the equi-
librium structure probability of any secondary structure
s
∈
Γ(
φ
j
) treating strands with the same sequence as indistin-
guishable is denoted
p
(
φ
j
,s
) =
e
−
∆
G
(
φ
j
,s
)
/kT
/Q
(
φ
j
)
.
(10)
Boltzmann-sampled structures.
For complex
j
, a set of
J
secondary structures Boltzmann-sampled from ensemble
Γ(
φ
j
) treating strands with the same sequence as indistin-
guishable is denoted
Γ
sample
(
φ
j
,J
)
∈
Γ(
φ
j
)
.
(11)
Boltzmann-sampled structures are available only via the NU-
PACK Python module.
Equilibrium base-pairing probabilities.
For complex
j
, the
base-pairing probability matrix
P
(
φ
j
) has entries
P
a,b
(
φ
j
)
∈
[0
,
1] corresponding to the probability
P
a,b
(
φ
j
) =
∑
s
∈
Γ(
φ
j
)
p
(
φ
j
,s
)
S
a,b
(
s
)
(12)
that base pair
a
·
b
forms at equilibrium within ensemble
Γ(
φ
j
),
treating all strands as distinct. Here,
S
(
s
) is the structure
matrix and
p
(
s
) is the equilibrium probability of structure
s
∈
Γ(
φ
j
) treating all strands as distinct. Abusing notation,
the entry
P
i,i
(
φ
j
)
∈
[0
,
1] denotes the equilibrium probability
that base
i
is unpaired over ensemble
Γ(
φ
j
). Hence,
P
(
φ
j
)
is a symmetric matrix with row and column sums of 1. In
NUPACK graphics, the diagonal entries denoting unpaired
bases are depicted as an extra column to the right of the
matrix.
MFE proxy structure.
For complex
j
, the minimum free en-
ergy (MFE) stacking state
s
q
MFE
(
φ
j
)
∈
Γ
q
(
φ
j
) treating all
strands as distinct is
s
q
MFE
(
φ
j
) = arg
min
s
q
∈
Γ
q
(
φ
j
)
∆
G
(
φ
j
,s
q
)
(13)
The corresponding MFE proxy structure is
s
MFE
′
(
φ
j
)
≡{
s
∈
Γ(
φ
j
)
|
s
q
MFE
(
φ
j
)
∈
s
}
,
(14)
defined as the secondary structure containing the MFE stack-
ing state within its subensemble. The free energy of the MFE
proxy structure is
∆
G
(
φ
j
,s
MFE
′
(
φ
j
))
.
(15)
There may be more than one MFE stacking state, each cor-
responding to the same or different MFE proxy structures. If
there are multiple MFE proxy structures, the NUPACK web
app presents only one of them; to see multiple MFE proxy
structures, use the NUPACK Python module.
Suboptimal proxy structures.
For complex
j
, the set of
suboptimal proxy secondary structures with stacking states
within a specified ∆
G
gap
≥
0 of the MFE stacking state is
denoted
Γ
subopt
(
φ
j
,
∆
G
gap
)
=
{
s
∈
Γ(
φ
j
)
|
s
q
∈
s,
∆
G
(
φ
j
,s
q
)
≤
∆
G
(
φ
j
,s
q
MFE
(
φ
j
)) + ∆
G
gap
}
.
(16)
Suboptimal proxy structures are available only via the NU-
PACK Python module.
Complex ensemble defect.
For complex
j
with target struc-
ture
s
j
, the dimensional complex ensemble defect
n
(
φ
j
,s
j
) =
|
φ
j
|−
∑
1
≤
a
≤|
φ
j
|
,
1
≤
b
≤|
φ
j
|
P
a,b
(
φ
j
)
S
a,b
(
s
j
)
,
(17)
represents the equilibrium number of incorrectly paired nu-
cleotides over the ensemble
Γ(
φ
j
) relative to target structure
s
j
.
3, 6
Here,
P
(
φ
j
) is the equilibrium base-pairing probability
matrix and
S
(
s
j
) is the target structure matrix for
s
j
. The
normalized complex ensemble defect
is then
N
j
≡
n
(
φ
j
,s
j
)
/
|
φ
j
| ∈
[0
,
1]
,
(18)
representing the equilibrium fraction of incorrectly paired nu-
cleotides evaluated over the ensemble of complex
j
relative to
target structure
s
j
.
Complex ensemble size.
For complex
j
, the number of
secondary structures in the complex ensemble, treating all
strands as distinct, is denoted
|
Γ(
φ
j
)
|
.
(19)
The corresponding number of stacking states is denoted
|
Γ
q
(
φ
j
)
|
.
(20)
Equilibrium complex concentrations.
For the set of com-
plexes Ψ
h
in test tube
h
, the set of equilibrium complex con-
centrations is denoted
x
Ψ
h
≡
x
j
∀
j
∈
Ψ
h
.
(21)
These concentrations are the unique solution to the strictly
convex optimization problem
5
min
x
Ψ
h
∑
j
∈
Ψ
h
x
j
(log
x
j
−
log
Q
j
−
1)
(22)
subject to
∑
j
∈
Ψ
h
A
i,j
x
j
=
x
0
i
∀
i
∈
Ψ
0
h
,
(23)
5