nupack-analysis-and-design-of-nucleic-acid-structures-devices-and-systems.pdf

NUPACK: Analysis and Design of Nucleic Acid

Structures, Devices, and Systems

Mark E. Fornace,

, Jining Huang

, Cody T. Newman

, Nicholas J. Porubsky

, Marshall B. Pierce,

and Niles A. Pierce

∗

Division of Biology & Biological Engineering, California Institute of Technology, Pasadena, CA 91125, USA.

Division of Chemistry & Chemical Engineering, California

Institute of Technology, Pasadena, CA 91125, USA.

Division of Engineering & Applied Science, California Institute of Technology, Pasadena, CA 91125, USA.

ABSTRACT:

CUGGACAUCACCUCCCACAACGAGGACUAGUUGUG

GGAGGUGAUGUCGGGUGUUCACUACCAGCAGAACA

CCCGACAUCACCUACCACUACCAGCAGAACAAGGU

AGAUGUCGGGUGUUCUGCUGGUAGUGGUCUGGACA

UCACCUCCCACAACGAGGACUAGUUGUGGGAGGUG

AUGUCGGGUGUUCACUACCAGCAGAACACCCGACA

UCACCUACCACUACCAGCAGAACAAGGUAGAUGUC

GGGUGUUCUGCUGGUAGUGGUCUGGACAUCACCUC

CCACAACGAGGACUAGUUGUGGGAGGUGAUGUCGG

GUGUUCACUACCAGCAGAACACCCGACAUCACCUA

CCACUACCAGCAGAACAAGGUAGAUGUCGGGUGUU

CUGCUGGUAGUGGUCUGGACAUCACCUCCCACAAC

GAGGACUAGUUGUGGGAGGUGAUGUCGGGUGUUCA

CUACCAGCAGAACACCCGACAUCACCUACCACUAC

CAGCAGAACAAGGUAGAUGUCGGGUGUUCUGCUGG

UAGUGGU

Crosstalk

Products

10 nM

Intermediates

Reactants

10 nM

All 10 nM

Design

Analysis

NUPACK is a growing software suite for the analysis and de-

sign of nucleic acid structures, devices, and systems serving the

needs of researchers in the fields of nucleic acid nanotechnology,

molecular programming, synthetic biology, and across the life

sciences. NUPACK algorithms are unique in treating complex

and test tube ensembles containing arbitrary numbers of inter-

acting strand species, providing crucial tools for capturing con-

centration effects essential to analyzing and designing the inter-

molecular interactions that are a hallmark of these fields. The

all-new NUPACK web app (nupack.org) has been re-architected

for the cloud, leveraging a cluster that scales dynamically in response to user demand to enable rapid job submission and result in-

spection even at times of peak user demand. The web app exploits the all-new NUPACK 4 scientific code base as its backend,

offering enhanced physical models (coaxial and dangle stacking subensembles), dramatic speedups (20–120

for test tube analysis),

and increased scalability for large complexes. NUPACK 4 algorithms can also be run locally using the all-new NUPACK Python

module.

KEYWORDS:

DNA, RNA, secondary structure, base-pairing, hybridization, complex ensemble, test tube ensemble, multi-tube

ensemble, equilibrium, partition function, concentration, ensemble defect, analysis, design, reaction pathway engineering

INTRODUCTION

We are engaged in a multi-decade effort to develop NUPACK

(Nucleic Acid Package), a growing software suite for the anal-

ysis and design of nucleic acid structures, devices, and sys-

tems.

NUPACK algorithms

2–9

are formulated in terms of

nucleic acid secondary structure (i.e., the base-pairs of a set

of DNA or RNA strands) and employ empirical free energy

parameters.

10–22

Problem categories.

NUPACK algorithms address two fun-

damental classes of problems:

•

Sequence analysis:

given a set of DNA or RNA strands,

analyze the equilibrium base-pairing properties over a

specified ensemble.

•

Sequence design:

given a set of desired equilibrium base-

pairing properties, design the sequences of a set of DNA

or RNA strands over a specified ensemble. Sequence de-

sign is performed subject to diverse sequence constraints.

Ensembles.

NUPACK algorithms operate over two funda-

mental ensembles:

•

Complex ensemble:

the ensemble of all (unpseudoknot-

ted connected) secondary structures for an arbitrary

number of interacting RNA or DNA strands.

•

Test tube ensemble:

the ensemble of a dilute solu-

tion containing an arbitrary number of RNA or DNA

strand species (introduced at user-specified concentra-

tions) interacting to form an arbitrary number of com-

plex species.

Furthermore, to enable reaction pathway engineering of dy-

namic hybridization cascades (e.g., shape and sequence trans-

duction using small conditional RNAs

23, 24

) and large-scale

structural engineering including pseudoknots (e.g., RNA

origamis

), NUPACK generalizes sequence analysis and de-

sign to

multi-tube ensembles

comprising one or more test

tubes.

Note that a complex ensemble is subsidiary to a test tube

ensemble, so complex analysis is inherent in test tube anal-

ysis (but not vice versa), and complex design is inherent in

test tube design (but not vice versa). As it is typically infea-

sible to experimentally study a single complex in isolation,

we recommend analyzing and designing nucleic acid strands

in a test tube ensemble that contains the complex of inter-

est as well as other competing complexes that might form

in solution. For example, if one is experimentally studying

strands A and B that are intended to predominantly form

a secondary structure within the ensemble of complex A

one should not presuppose that the strands do indeed form

B and simply analyze or design the base-pairing properties

of that complex. Instead, it is more physically relevant to

analyze or design a test tube ensemble containing strands A

and B interacting to form multiple complex species (e.g., A,

B, A

A, A

B, B

B) so as to capture both concentration infor-

mation (how much A

B forms?) and structural information

(what are the base-pairing properties of A

B when it does

form?).

All-new NUPACK 4 scientific code base.

NUPACK 4 anal-

ysis algorithms employ a new unified dynamic programming

framework

that provides enhanced secondary structure mod-

els including coaxial and dangle stacking subensembles, dra-

matic speedups (e.g., 20–120

for analysis of test tube en-

sembles), and enhanced scalability for large complexes (e.g.,

30,000 nt). NUPACK 4 design algorithms leverage these per-

formance benefits and support both hard constraints (that

must be obeyed by the designed sequences; e.g., diversity

and biological sequence constraints) and soft constraints (that

penalize but do not prohibit suboptimal sequences; e.g., se-

quence symmetry and energy match constraints) for multi-

tube ensembles.

All-new cloud-based NUPACK web app.

Since its launch

in 2007, usage of the NUPACK web app (nupack.org)

has

increased to the point where the underlying static compute

cluster is frequently overwhelmed by user demand. To pro-

Table 1.

Secondary structure notation: dot-parens plus, run-length encoded (RLE) dot-parens-plus, and DU+.

((((((((((((..........))))))))))))

((((((((((((+))))))))))))..........

((((((((((((+..........))))))))))))

(12.10)12

(12+)12.10

(12+.10)12

D12 U10

D12 + U10

D12 (+ U10)

((((((((((((..........+))))))))))))

..........((((((((((((+))))))))))))

(12.10+)12

.10(12+)12

D12 (U10 +)

U10 D12 +

Secondary structure*

Dot-parens-plus

RLE dot-parens-plus

DU+

∗

For each secondary structure, the first nucleotide is depicted in red.

vide a scalable resource for the global research community,

the NUPACK web app was re-architected from the ground

up to exploit a scalable compute cluster that resizes dynam-

ically in the cloud in response to user demand. The all-new

NUPACK web app integrates diverse components to create an

intuitive and powerful analysis and design environment:

•

Algorithms:

mathematically rigorous, physically sound,

computationally efficient scientific algorithms.

2–9

•

Hardware:

a hybrid cloud compute cluster combining

local hardware and scalable cloud hardware.

•

Interface:

an intuitive web interface for rapid job sub-

mission and result inspection.

•

Graphics:

publication-quality client-side graphics to en-

able straightforward interpretation of results, interactive

data, and efficient preparation of talks and papers.

Researchers can run jobs and inspect results within the NU-

PACK web app, which leverages NUPACK 4 scientific code

base as its backend.

All-new NUPACK Python module.

Alternatively, re-

searchers can script and run jobs locally using the all-new

NUPACK 4 Python module, providing the flexibility to in-

teract with the broader Python ecosystem.

Outline.

To provide a foundation for describing the NU-

PACK web app, we begin by defining the physical model,

relevant physical quantities, and the design formulation. For

the convenience of the reader, definitions and descriptions are

drawn from the original algorithms papers,

3, 5–9

but are here

organized into a single unified presentation for easy perusal.

We then summarize the features of the Analysis, Design, and

Utilities pages of the NUPACK web app:

•

Analysis page:

Analyze the equilibrium base-pairing

properties of one or more test tube ensembles (and sub-

sidiary complex ensembles). These are the all-purpose

sequence analysis tools.

•

Design page:

Design the sequences for one or more

test tube ensembles (and subsidiary complex ensembles).

These are the all-purpose sequence design tools.

•

Utilities page:

Analyze, design, or prepare figures for a

single complex ensemble. These are quick tools applica-

ble when your ensemble is a single complex.

PHYSICAL MODEL

Sequence.

The

sequence

, of one or more interacting RNA

strands is specified as a list of bases

∈{

A,C,G,U

}

for

,...,

. For DNA,

∈{

A,C,G,T

}

. Nucleic acid sequences

are listed 5

′

to 3

′

. For RNA calculations, T is automatically

converted to U, and vice versa for DNA calculations. For

sequence design, sequence constraints can be specified using

IUPAC degenerate nucleotide codes (see Table 2).

Table 2.

IUPAC degenerate nucleotide codes for RNA.

Code

Nucleotides

∗

A or C

A or G

A or U

C or G

C or U

G or U

A, C, or G

A, C, or U

A, G, or U

C, G, or U

A, C, G, or U

∗

For DNA, T replaces U.

Secondary structure.

secondary structure

, of one or

more interacting RNA strands is defined by a set of base

pairs, each a Watson–Crick pair (A

U or C

G) or a wobble

pair (G

U) (e.g., see Figure 1a). For DNA, the corresponding

Watson–Crick pairs are A

T or C

G and there are no wobble

pairs (G

T is classified as a mismatch

). A

polymer graph

representation of a secondary structure is constructed by or-

dering the strands around a circle, drawing the backbones in

succession from 5

′

to 3

′

around the circumference with a

nick

between each strand, and drawing straight lines connecting

paired bases. A secondary structure is

unpseudoknotted

there exists a strand ordering for which the polymer graph

has no crossing lines, or

pseudoknotted

if all strand orderings

contain crossing lines. A secondary structure is

connected

no subset of the strands is free of the others.

Secondary structures may be specified in one of three ways

for NUPACK calculations (see Table 1 for examples):

•

dot-parens-plus notation

: each unpaired base is repre-

sented by a dot, each base pair by matching parentheses,

and each nick between strands by a plus.

•

run-length encoded (RLE) dot-parens-plus notation

: as a

shorthand for dot-parens-plus, any sequence of consec-

utive characters in dot-parens-plus may be replaced by

the character followed by a number.

•

DU+ notation

: Using DU+ notation, a duplex is de-

noted by

D

followed by the number of base pairs and an

unpaired region is denoted by

U

followed by the number

of unpaired nucleotides.

Each duplex is followed imme-

diately by the substructure (specified in DU+ notation)

that is “enclosed” by the duplex. If this substructure

includes more than one element, parentheses are used to

denote scope. A nick between strands is specified by a

“+”.

In mathematical expressions, it is convenient to represent

secondary structure

using a

structure matrix

(

) with

entries

a,b

(

) = 1 if structure

contains base pair

Complex

ABC

Test tube

Strands

BBC

AAA

ACB

BBB

CCC

CCB

ABC

AAC

BBA

AAB

CCA

Complexes

Figure 1.

Complex and test tube ensembles. (a) A (connected

unpseudoknotted) secondary structure for a complex of 3 strands

with strand ordering

= ABC. An arrowhead denotes the 3

′

end

of each strand. (b) A test tube ensemble containing strand species

{

A,B,C

}

interacting to form all complex species Ψ of up to

max

= 3 strands. Adapted with permission from Fornace

et al.

ACS Synth Biol

Chemical Society.

and

a,b

(

) = 0 otherwise. Abusing notation, the entry

a,a

(

) = 1 if base

is unpaired in structure

and 0 oth-

erwise. Hence,

(

) is a symmetric matrix with row and

column sums of 1.

Complex ensemble.

Consider a complex of

distinct

strands (each with a unique identifier in

{

···

}

) corre-

sponding to strand ordering

. The

complex ensemble

Γ(

)

contains all connected polymer graphs with no crossing lines

(i.e., all unpseudoknotted secondary structures).

As a mat-

ter of algorithmic necessity, all of the dynamic programs

in NUPACK operate on complex ensemble

Γ(

) treating all

strands as distinct. However, in the laboratory, strands with

the same sequence are typically indistinguishable with respect

to experimental observables. For comparison to experimental

data, physical quantities calculated over ensemble

Γ(

) are

post-processed to obtain the corresponding quantities calcu-

lated over

complex ensemble

Γ(

) in which strands with the

same sequence are treated as indistinguishable.

The ensem-

ble Γ(

)

⊆

Γ(

) is a maximal subset of distinct secondary

structures for strand ordering

. Two secondary structures

are indistinguishable if their polymer graphs can be rotated

so that all strands are mapped onto indistinguishable strands,

all base pairs are mapped onto base pairs, and all unpaired

bases are mapped onto unpaired bases; otherwise the struc-

tures are distinct.

Test tube ensemble.

test tube ensemble

is a dilute so-

lution containing a set of strand species, Ψ

, introduced at

user-specified concentrations, that interact to form a set of

complex species, Ψ, each corresponding to a different strand

ordering treating strands with the same sequence as indis-

tinguishable.

5, 9

For

strands, there are (

−

1)! strand or-

derings if all strands are different species (e.g., complexes

= ABC and

= ACB for

= 3 and strands A, B, C), but

fewer than (

−

1)! strand orderings if some strands are of the

same species (e.g., complex

= AAA for

= 3 with three

A strands). By the Representation Theorem,

a secondary

structure in the complex ensemble for one strand ordering

does not appear in the complex ensemble for any other strand

ordering, averting redundancy. It is often convenient to define

Ψ to contain all complex species of up to

max

strands, al-

though Ψ can be defined to contain arbitrary complex species

formed from the strand species in Ψ

Multi-tube ensemble.

Consider the multi-tube ensemble

Ω where tube

∈

Ω contains the set of strand species Ψ

interacting to form the set of complex species Ψ

. The set of

all complexes in multi-tube ensemble Ω is then Ψ

≡∪

∈

Ω

hairpin loop

stacked pairs

bulge loop

interior loop

multiloop

exterior loop

Figure 2.

Loop-based free energy model for a complex. Canonical

loop types for a complex with strand ordering

= ABC. Adapted

with permission from Fornace

et al.

ACS Synth Biol

, 9, 2665-2678,

Note that the multi-tube ensemble encompasses the complex

and test tube ensembles as subsidiary special cases.

Loop-based free energy model.

For each (unpseudoknot-

ted connected) secondary structure

∈

Γ(

), the free energy,

∆

(

φ,s

), is estimated as the sum of the empirically deter-

mined free energies of the constituent loops

12–14, 17, 21, 22

plus

a strand association penalty,

∆

assoc

, applied

−

1 times

for a complex of

strands:

∆

(

φ,s

) = (

−

1) ∆

assoc

∑

loop

∈

∆

(loop)

(1)

The secondary structure of Figure 2 illustrates the different

loop types, with loop free energy, ∆

(loop), modeled as fol-

lows:

12–14, 17, 21, 22

•

hairpin loop

is closed by a single base-pair

. The

loop free energy, ∆

hairpin

a,b

, depends on sequence and

loop size.

•

interior loop

is closed by two base pairs (

and

with

a < d < e < b

). The loop free energy, ∆

interior

a,d,e,b

de-

pends on sequence, loop size, and loop asymmetry.

Bulge

loops

(where either

+ 1 or

−

1) and

stacked

pairs

(where both

+ 1 and

−

1) are treated

as special cases of interior loops.

•

multiloop

is closed by three or more base pairs. The

loop free energy is modeled as the sum of three sequence-

independent penalties: ∆

multi

init

for formation of a multi-

loop, ∆

multi

for each closing base pair, ∆

multi

for each

unpaired nucleotide inside the multiloop, one sequence-

dependent penalty: ∆

terminalbp

a,b

for each closing pair

, and optional coaxial and dangle stacking contribu-

tions (see below).

•

exterior loop

contains a nick between strands and any

number of closing base pairs. The exterior loop free en-

ergy is the sum of ∆

terminalbp

a,b

for each closing base pair

, plus optional coaxial and dangle stacking contribu-

tions (see below). Hence, an unpaired strand has a free

energy of zero, corresponding to the reference state.

See Section S1.7 of Reference 9 for details on the functional

form of loop-based free energy models.

Coaxial and dangle stacking.

Within a multiloop or an

exterior loop, there is a subensemble of coaxial stacking

states between adjacent closing base pairs and dangle stack-

ing states between closing base pairs and adjacent unpaired

bases. Within a multiloop or exterior loop, a base pair can

form one

coaxial stack

with an adjacent base pair, or can form

0 coaxial stacks

0 dangle stacks

0 coaxial stacks, 1 dangle stack

0 coaxial stacks, 2 dangle stacks

1 coaxial stack

0 dangle stacks

1 coaxial stack, 1 dangle stack

1 coaxial stack, 2 dangle stacks

1 dangle stack

2 dangle stacks

0 coaxial stacks

1 coaxial stack

0 dangle stacks

multiloop coaxial stack

multiloop dangle stack

multiloop

exterior loop coaxial stack

exterior loop dangle stack

exterior loop

Multiloop stacking states

Exterior loop stacking states

Figure 3.

Coaxial and dangle stacking states for multiloops and exterior loops. (a) Stacking subensemble for the multiloop of Figure

2. (b,c) Stacking subensembes for two exterior loops from Figure 2. Adapted with permission from Fornace

et al.

ACS Synth Biol

, 9,

dangle stack

with at most two adjacent unpaired bases; un-

paired bases can either form no stack, or can form a dangle

stack with at most one adjacent base pair.

For a given multiloop or exterior loop, the energetic contri-

butions of all possible coaxial and dangle stacking states are

enumerated so as to calculate the free energy:

∆

stacking

−

log

∑

∈

loop

∏

∈

−

∆

/kT

(2)

where

indexes the possible stacking states within the loop

and

indexes the individual stacks (coaxial or dangle) within

a stacking state. The free energy of a multiloop or exterior

loop is augmented by the corresponding ∆

stacking

bonus.

Hence, a secondary structure

continues to be defined as a

set of base pairs, and the stacking states within a given multi-

loop or exterior loop are treated as a structural subensemble

that contributes in a Boltzmann-weighted fashion to the free

energy model for the loop. Let

q

∈

denote a stacking state

of the paired and unpaired bases in

. We may equivalently

define the free energy of secondary structure

in terms of the

stacking state free energies

∆

(

φ,s

q

)

(3)

for all stacking states

q

∈

∆

(

φ,s

) =

−

log

∑

q

∈

−

∆

(

φ,s

q

)

/kT

(4)

Let

q

(

) denote the ensemble of stacking states correspond-

ing to the complex ensemble of secondary structures

Γ(

NUPACK supports the following coaxial and dangle stack-

ing formulations:

•

All stacking

(default): complex ensemble with coaxial

and dangle stacking.

•

Coaxial stacking:

complex ensemble with coaxial stack-

ing.

•

Dangle stacking:

complex ensemble with dangle stacking.

•

No stacking:

complex ensemble without coaxial or dangle

stacking.

For backwards compatibility with NUPACK 3, NUPACK 4

also supports historical complex ensembles without coaxial

stacking and with approximate dangle stacking.

Symmetry correction.

For a secondary structure

∈

Γ(

)

with an

-fold rotational symmetry, there is an

-fold re-

duction in distinguishable conformational space, so the free

energy

∆

(

φ,s

) must be adjusted

by a symmetry correction:

∆

(

φ,s

) =

∆

(

φ,s

) + ∆

sym

(

φ,s

)

(5)

where

∆

sym

(

φ,s

) =

log

(

φ,s

)

(6)

Because the symmetry factor

(

φ,s

) is a global property of

each secondary structure

∈

Γ(

), it is not suitable for use

with dynamic programs that treat multiple subproblems si-

multaneously without access to global structural information.

As a result, dynamic programs operate on ensemble

Γ(

) us-

ing physical model

∆

(

φ,s

) and then the Distinguishabil-

ity Correction Theorem

enables exact conversion of physical

quantities to ensemble Γ(

) using physical model ∆

(

φ,s

Interestingly, ensembles

Γ(

) and Γ(

) both have utility when

examining the physical properties of a complex as they pro-

vide related but different perspectives, akin to complemen-

tary thought experiments.

Free energy parameters.

NUPACK supports the follow-

ing temperature-dependent parameter sets for RNA:

•

rna06

based on References 14, 19 and 21 with additional

parameters

13, 17

including coaxial stacking

14, 22

and dan-

gle stacking

11, 17, 22

in 1M Na

•

rna95

based on Reference 11 with additional parame-

ters

including coaxial stacking

14, 22

and dangle stack-

ing

11, 17, 22

in 1M Na

and for DNA:

•

dna04

based on References 12 and 18 with additional pa-

rameters

including coaxial stacking

and dangle stack-

ing

15, 17

in user-specified concentrations of Na

, K

, and Mg

12, 16, 18, 20

DNA/RNA hybrids are not allowed. For backwards compat-

ibility with NUPACK 3, NUPACK 4 also supports historical

DNA and RNA parameter sets.

Salts.

The default salt conditions for RNA and DNA pa-

rameter sets are [Na

] = 1M; these are the only salt condi-

tions for RNA. Salt corrections are available for DNA param-

eters to permit calculations in user-specified sodium, potas-

sium, ammonium, and magnesium ion concentrations.

•

Sodium:

based on refs 12 and 18; the sum of the concen-

trations of (monovalent) sodium, potassium, and ammo-

nium ions, [Na

]+[K

]+[NH

], is specified in units of

molar (default: 1.0, range: [0.05,1.1]).

•

Magnesium:

based on refs 16 and 20; the concentra-

tion of (divalent) magnesium ions, [Mg

], is specified

in units of molar (default: 0.0, range: [0.0,0.2]).

PHYSICAL QUANTITIES

Consider a multi-tube ensemble Ω where tube

∈

Ω contains

a set of strand species Ψ

interacting to form a set of com-

plex species Ψ

. Let

∈

denote a complex with sequence

and complex ensembles

Γ(

) (treating all strands as dis-

tinct) and Γ(

) (treating strands with the same sequence as

indistinguishable). NUPACK calculates a number of physical

quantities over these ensembles.

5, 9

Partition function.

For complex

, the partition function

evaluated over ensemble Γ(

) treating strands with the same

sequence as indistinguishable is denoted

(

) =

∑

∈

Γ(

)

−

∆

(

)

/kT

(7)

Complex free energy.

For complex

, the corresponding

complex free energy is

∆

(

)

≡−

log(

(

))

(8)

Structure free energy.

For complex

, the secondary struc-

ture free energy treating strands with the same sequence as

indistinguishable is denoted

∆

(

)

(9)

If the physical model includes coaxial and dangle stacking,

the structure free energy will include stacking contributions

∆

stacking

. If the secondary structure

has a rotational sym-

metry, the structure free energy will include the symmetry

correction ∆

sym

(

Equilibrium structure probability.

For complex

, the equi-

librium structure probability of any secondary structure

∈

Γ(

) treating strands with the same sequence as indistin-

guishable is denoted

(

) =

−

∆

(

)

/kT

(

)

(10)

Boltzmann-sampled structures.

For complex

, a set of

secondary structures Boltzmann-sampled from ensemble

Γ(

) treating strands with the same sequence as indistin-

guishable is denoted

sample

(

)

∈

Γ(

)

(11)

Boltzmann-sampled structures are available only via the NU-

PACK Python module.

Equilibrium base-pairing probabilities.

For complex

, the

base-pairing probability matrix

(

) has entries

a,b

(

)

∈

1] corresponding to the probability

a,b

(

) =

∑

∈

Γ(

)

(

)

a,b

(

)

(12)

that base pair

forms at equilibrium within ensemble

Γ(

treating all strands as distinct. Here,

(

) is the structure

matrix and

(

) is the equilibrium probability of structure

∈

Γ(

) treating all strands as distinct. Abusing notation,

the entry

i,i

(

)

∈

1] denotes the equilibrium probability

that base

is unpaired over ensemble

Γ(

). Hence,

(

)

is a symmetric matrix with row and column sums of 1. In

NUPACK graphics, the diagonal entries denoting unpaired

bases are depicted as an extra column to the right of the

matrix.

MFE proxy structure.

For complex

, the minimum free en-

ergy (MFE) stacking state

q

MFE

(

)

∈

q

(

) treating all

strands as distinct is

q

MFE

(

) = arg

min

q

∈

q

(

)

∆

(

q

)

(13)

The corresponding MFE proxy structure is

MFE

′

(

)

≡{

∈

Γ(

)

q

MFE

(

)

∈

}

(14)

defined as the secondary structure containing the MFE stack-

ing state within its subensemble. The free energy of the MFE

proxy structure is

∆

(

MFE

′

(

))

(15)

There may be more than one MFE stacking state, each cor-

responding to the same or different MFE proxy structures. If

there are multiple MFE proxy structures, the NUPACK web

app presents only one of them; to see multiple MFE proxy

structures, use the NUPACK Python module.

Suboptimal proxy structures.

For complex

, the set of

suboptimal proxy secondary structures with stacking states

within a specified ∆

gap

≥

0 of the MFE stacking state is

denoted

subopt

(

∆

gap

)

{

∈

Γ(

)

q

∈

∆

(

q

)

≤

∆

(

q

MFE

(

)) + ∆

gap

}

(16)

Suboptimal proxy structures are available only via the NU-

PACK Python module.

Complex ensemble defect.

For complex

with target struc-

ture

, the dimensional complex ensemble defect

(

) =

|−

∑

≤

≤|

≤

≤|

a,b

(

)

a,b

(

)

(17)

represents the equilibrium number of incorrectly paired nu-

cleotides over the ensemble

Γ(

) relative to target structure

3, 6

Here,

(

) is the equilibrium base-pairing probability

matrix and

(

) is the target structure matrix for

. The

normalized complex ensemble defect

is then

≡

(

)

| ∈

(18)

representing the equilibrium fraction of incorrectly paired nu-

cleotides evaluated over the ensemble of complex

relative to

target structure

Complex ensemble size.

For complex

, the number of

secondary structures in the complex ensemble, treating all

strands as distinct, is denoted

Γ(

)

(19)

The corresponding number of stacking states is denoted

q

(

)

(20)

Equilibrium complex concentrations.

For the set of com-

plexes Ψ

in test tube

, the set of equilibrium complex con-

centrations is denoted

≡

∀

∈

(21)

These concentrations are the unique solution to the strictly

convex optimization problem

min

∑

∈

(log

−

log

−

(22)

subject to

∑

∈

i,j

∀

∈

(23)