of 14
PHYSICAL REVIEW E
102
, 022404 (2020)
Editors’ Suggestion
First-principles prediction of the information processing capacity of a simple genetic circuit
Manuel Razo-Mejia
,
1
Sarah Marzen
,
2
Griffin Chure
,
1
Rachel Taubman
,
2
Muir Morrison
,
3
and Rob Phillips
1
,
3
,
*
1
Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125, USA
2
Department of Physics, W. M. Keck Science Department, Claremont McKenna College, Claremont, California 91711, USA
3
Department of Physics, California Institute of Technology, Pasadena, California 91125, USA
(Received 6 May 2020; revised 29 June 2020; accepted 2 July 2020; published 13 August 2020)
Given the stochastic nature of gene expression, genetically identical cells exposed to the same environmental
inputs will produce different outputs. This heterogeneity has been hypothesized to have consequences for how
cells are able to survive in changing environments. Recent work has explored the use of information theory as
a framework to understand the accuracy with which cells can ascertain the state of their surroundings. Yet the
predictive power of these approaches is limited and has not been rigorously tested using precision measurements.
To that end, we generate a minimal model for a simple genetic circuit in which all parameter values for the model
come from independently published data sets. We then predict the information processing capacity of the genetic
circuit for a suite of biophysical parameters such as protein copy number and protein-DNA affinity. We compare
these parameter-free predictions with an experimental determination of protein expression distributions and the
resulting information processing capacity of
E. coli
cells. We find that our minimal model captures the scaling
of the cell-to-cell variability in the data and the inferred information processing capacity of our simple genetic
circuit up to a systematic deviation.
DOI:
10.1103/PhysRevE.102.022404
As living organisms thrive in a given environment, they
are faced with constant changes in their surroundings. From
abiotic conditions such as temperature fluctuations or changes
in osmotic pressure, to biological interactions such as cell-
to-cell communication in a tissue or in a bacterial biofilm,
living organisms of all types sense and respond to external
signals. Figure
1(a)
shows a schematic of this process for
a bacterial cell sensing a concentration of an extracellular
chemical. At the molecular level where signal transduction
unfolds mechanistically, there are physical constraints on the
accuracy and precision of these responses given by intrinsic
stochastic fluctuations [
1
]. This means that two genetically
identical cells exposed to the same stimulus will not have
identical responses [
2
].
One implication of this noise in biological systems is that
cells do not have an infinite resolution to distinguish sig-
nals, and, as a consequence, there is a one-to-many mapping
between inputs and outputs. Furthermore, given the limited
number of possible outputs, there are overlapping responses
between different inputs. This scenario can be mapped to
a Bayesian inference problem where cells try to infer the
state of the environment from their phenotypic response, as
schematized in Fig.
1(b)
. The question then becomes this: how
can one analyze this probabilistic, rather than deterministic,
relationship between inputs and outputs? The abstract answer
to this question was worked out in 1948 by Claude Shannon,
who, in his seminal work, founded the field of information
theory [
3
]. Shannon developed a general framework for how
*
Author to whom all correspondence should be addressed:
phillips@pboc.caltech.edu
to analyze information transmission through noisy commu-
nication channels. In his work, Shannon showed that the
only quantity that satisfies three reasonable axioms for a
measure of uncertainty was of the same functional form as the
thermodynamic entropy—thereby christening his metric the
information entropy [
4
]. He also gave a definition, based on
this information entropy, for the relationship between inputs
and outputs known as the mutual information. The mutual
information
I
between input
c
and output
p
, given by
I
=

c
P
(
c
)

p
P
(
p
|
c
)log
2
P
(
p
|
c
)
P
(
p
)
,
(1)
quantifies how much we learn about the state of the input
c
given that we get to observe the output
p
. In other words,
the mutual information can be thought of as a generalized
correlation coefficient that quantifies the degree to which
the uncertainty about a random event decreases given the
knowledge of the average outcome of another random event
[
5
].
It is natural to conceive of scenarios in which living organ-
isms that can better resolve signals might have an evolutionary
benefit, making it more likely that their offspring will have a
fitness advantage [
6
]. In recent years there has been a growing
interest in understanding the theoretical limits on cellular
information processing [
7
,
8
], and in quantifying how close
evolution has pushed cellular signaling pathways to these
theoretical limits [
9
11
]. While these studies have treated the
signaling pathway as a “black box,” explicitly ignoring all
the molecular interactions taking place in them, other studies
have explored the role that molecular players and regula-
tory architectures have on these information processing tasks
[
12
18
]. Despite the great advances in our understanding
2470-0045/2020/102(2)/022404(14)
022404-1
©2020 American Physical Society
MANUEL RAZO-MEJIA
et al.
PHYSICAL REVIEW E
102
, 022404 (2020)
input
[]
inducer concentration
signal processing
cellular response
protein expression
output
(a)
(b)
(c)
protein copy number
probability
probability
probability
protein copy number
input-output function
noisy
response
precise
response
cellular response as Bayesian inference
posterior
likelihood
prior
protein copy number
probability
low
high
medium
probability
low
high
medium
probability
protein copy number
probability
well resolved input
poorly resolved input
environment state inference
[]
[]
[]
stimulus
l
o
w
m
e
d
h
i
g
h
l
o
w
m
e
d
h
i
g
h
h
h
i
i
g
g
h
h
l
l
o
o
w
w
o
l
o
w
m
e
d
h
i
g
h
l
o
w
m
e
d
h
i
g
h
h
h
i
i
g
g
h
h
m
m
e
e
d
d
=
=
=
l
o
w
m
e
d
h
i
g
h
l
o
w
m
e
d
h
i
g
h
l
l
o
o
w
w
o
m
m
e
e
d
d
FIG. 1. Cellular signaling systems sense the environment with different degrees of precision. (a) Schematic representation of a cell as
a noisy communication channel. From an environmental input (inducer molecule concentration) to a phenotypic output (protein expression
level), cellular signaling systems can be modeled as noisy communication channels. (b) We treat cellular response to an external stimulus as a
Bayesian inference of the state of the environment. As the phenotype (protein level) serves as the internal representation of the environmental
state (inducer concentration), the probability of a cell being in a specific environment given this internal representation
P
(
c
|
p
) is a function
of the probability of the response given that environmental state
P
(
p
|
c
). (c) The precision of the inference of the environmental state depends
on how well cells can resolve different inputs. For three different levels of input (left panel) the green strain responds more precisely than the
purple strain since the output distributions overlap less (middle panel). This allows the green strain to make a more precise inference of the
environmental state given a phenotypic response (right panel).
of the information processing capabilities of molecular mech-
anisms, the field still lacks a rigorous experimental test of
these detailed models with precision measurements on a sim-
ple system in which physical parameters can be perturbed. In
this work, we approach this task with a system that is both
theoretically and experimentally tractable in which molecular
parameters can be varied in a controlled manner.
Over the past decade, the dialogue between theory and
experiments in gene regulation has led to predictive power
of models not only over the mean level of gene expression,
but the noise as a function of relevant parameters such as
regulatory protein copy numbers, affinity of these proteins to
the DNA promoter, as well as the extracellular concentrations
of inducer molecules [
19
22
]. These models based on equi-
librium and nonequilibrium statistical physics have reached
a predictive accuracy level such that, for simple cases, it is
now possible to design input-output functions [
23
,
24
]. This
provides the opportunity to exploit these predictive models to
tackle the question of how much information genetic circuits
can process. This question lies at the heart of understanding
the precision of the cellular response to environmental signals.
Figure
1(c)
schematizes a scenario in which two bacterial
strains respond with different levels of precision to three pos-
sible environmental states, i.e., inducer concentrations. The
overlap between the three different responses is what precisely
determines the resolution with which cells can distinguish
different inputs. This is analogous to how the point spread
function limits the ability to resolve two light-emitting point
sources.
In this work, we follow the same philosophy of theory-
experiment dialogue used to determine model parameters to
predict from first principles the effect that biophysical param-
eters such as transcription factor copy number and protein-
DNA affinity have on the information processing capacity of
a simple genetic circuit. Specifically, to predict the mutual
information between an extracellular chemical signal (input
c
, isopropyl
β
-D-1-thiogalactopyranoside, or IPTG in our
experimental system) and the corresponding cellular response
in the form of protein expression (output
p
), we must compute
the input-output function
P
(
p
|
c
). To do so, we use a master-
022404-2
FIRST-PRINCIPLES PREDICTION OF THE ...
PHYSICAL REVIEW E
102
, 022404 (2020)
equation-based model to construct the protein copy number
distribution as a function of an extracellular inducer con-
centration for different combinations of transcription factor
copy numbers and binding sites. Having these input-output
distributions allows us to compute the mutual information
I
between inputs and outputs for any arbitrary input distribu-
tion
P
(
c
). We opt to compute the channel capacity, i.e., the
maximum information that can be processed by this gene
regulatory architecture, defined as Eq. (
1
) maximized over all
possible input distributions
P
(
c
). By doing so, we examine the
physical limits of what cells can do in terms of information
processing by harboring these genetic circuits. Nevertheless,
given the generality of the input-output function
P
(
p
|
c
) that
we derive, the model presented here can be used to compute
the mutual information for any arbitrary input distribution
P
(
c
). All parameters used for our model were inferred from
a series of studies that span several experimental techniques
[
20
,
25
27
], allowing us to make parameter-free predictions
of this information processing capacity [
28
].
These predictions are then contrasted with experimental
data, where the channel capacity is inferred from single-cell
fluorescence distributions taken at different concentrations of
inducer for cells with previously characterized biophysical pa-
rameters [
20
,
27
]. We find that our parameter-free predictions
quantitatively track the experimental data up to a systematic
deviation. The lack of numerical agreement between our
model and the experimental data poses new challenges toward
having a foundational, first-principles understanding of the
physics of cellular decision-making.
The remainder of the paper is organized as follows. In
Sec.
IA
we define the minimal theoretical model and param-
eter inference for a simple repression genetic circuit. Sec-
tion
IB
discusses how all parameters for the minimal model
are determined from published datasets that explore different
aspects of the simple repression motif. Section
IC
computes
the moments of the mRNA and protein distributions from
this minimal model. In Sec.
ID
we explore the consequences
of variability in gene copy number during the cell cycle. In
addition, we compare experimental and theoretical quantities
related to the moments of the distribution, specifically the
predictions for the fold-change in gene expression (mean
expression relative to an unregulated promoter) and the gene
expression noise (standard deviation over mean). Section
IE
follows with reconstruction of the full mRNA and protein
distribution from the moments using the maximum entropy
principle. Finally, Sec.
IF
uses the distributions from Sec.
IE
to compute the maximum amount of information that the
genetic circuit can process. Here again we contrast our zero-
parameter fit predictions with experimental inferences of the
channel capacity.
I. RESULTS
A. Minimal model of transcriptional regulation
As a tractable circuit for which we have control over the
parameters both theoretically and experimentally, we chose
the so-called simple repression motif, a common regulatory
scheme among prokaryotes [
29
]. This circuit consists of a
single promoter with an RNA-polymerase (RNAP) binding
site and a single binding site for a transcriptional repressor
[
20
]. The regulation due to the repressor occurs via exclusion
of the RNAP from its binding site when the repressor is
bound, decreasing the likelihood of having a transcription
event. As with many important macromolecules, we consider
the repressor to be allosteric, meaning that it can exist in
two conformations, one in which the repressor is able to
bind to the specific binding site (active state) and one in
which it cannot bind the specific binding site (inactive state).
The environmental signaling occurs via passive import of
an extracellular inducer that binds the repressor, shifting the
equilibrium between the two conformations of the repressor
[
27
]. In previous work, we have extensively characterized
the mean response of this circuit under different condi-
tions using equilibrium-based models [
28
]. Here we build
upon these models to characterize the full distribution of
gene expression with parameters such as the repressor copy
number and its affinity for the DNA being systematically
varied.
As the copy number of molecular species is a discrete
quantity, chemical master equations have emerged as a useful
tool to model their inherent probability distribution [
30
]. In
Fig.
2(a)
we show the minimal model and the necessary
set of parameters needed to compute the full distribution of
mRNA and its protein gene product. Specifically, we assume
a three-state model where the promoter can be found in (i) a
transcriptionally active state (
A
state), (ii) a transcriptionally
inactive state without the repressor bound (
I
state), and (iii)
a transcriptionally inactive state with the repressor bound (
R
state). We do not assume that the transition between the active
state
A
and the inactive state
I
occurs due to RNAP binding
to the promoter as the transcription initiation kinetics involve
several more steps than simple binding [
31
]. We coarse-grain
all these steps into effective “on” and “off” states for the pro-
moter, consistent with experiments demonstrating the bursty
nature of gene expression in
E. coli
[
19
]. These three states
generate a system of coupled differential equations for each
of the three state distributions
P
A
(
m
,
p
;
t
),
P
I
(
m
,
p
;
t
), and
P
R
(
m
,
p
;
t
), where
m
and
p
are the mRNA and protein count
per cell, respectively, and
t
is time. Given the rates depicted in
Fig.
2(a)
we define the system of ODEs for a specific
m
and
p
. For the transcriptionally active state, we have
dP
A
(
m
,
p
)
dt
=−
A
I



k
(
p
)
off
P
A
(
m
,
p
)
+
I
A



k
(
p
)
on
P
I
(
m
,
p
)
+
m
1
m



r
m
P
A
(
m
1
,
p
)
m
m
+
1



r
m
P
A
(
m
,
p
)
+
m
+
1
m



γ
m
(
m
+
1)
P
A
(
m
+
1
,
p
)
m
m
1



γ
m
mP
A
(
m
,
p
)
+
p
1
p



r
p
mP
A
(
m
,
p
1)
p
p
+
1



r
p
mP
A
(
m
,
p
)
+
p
+
1
p



γ
p
(
p
+
1)
P
A
(
m
,
p
+
1)
p
p
1



γ
p
pP
A
(
m
,
p
)
,
(2)
022404-3
MANUEL RAZO-MEJIA
et al.
PHYSICAL REVIEW E
102
, 022404 (2020)
20
40
60
0
l
l
e
c
/
A
N
R
m
probability
0.05
0.04
0.03
0.02
0.01
0.00
(a)
KINETIC MODEL AND MINIMAL PARAMETER SET
(b)
PARAMETER DETERMINATION
promoter
on
and
off
rates
transcription
rate
repressor unbinding rate
repressor binding rate
fold-change
1.0
0.8
0.6
0.4
0.2
0.0
010
–6
10
–4
10
–2
)
M
(
G
T
P
I
10
0
10
1
10
2
10
3
number of repressors
10
0
10
–1
10
–2
10
–3
10
–4
fold-change
10
0
10
1
10
2
10
3
number of repressors
fold-change
10
0
10
–1
10
–2
10
–3
10
–4
single-site repression
(via colorimetric assay)
transcription factor competition
(via video microscopy)
allosteric regulation
(via flow cytometry)
mRNA expression noise
(via mRNA FISH)
inducer dissociation
constants
free energy
active-inactive
energy difference
STATE
R
STATE
I
STATE
A
repressor-DNA
binding energy
active repressor
inactive repressor
RNA polymerase
mRNA
fluorescent protein
FIG. 2. Minimal kinetic model of transcriptional regulation for a simple repression architecture. (a) Three-state promoter stochastic model
of transcriptional regulation by a repressor. The regulation by the repressor occurs via exclusion of the transcription initiation machinery, not
allowing the promoter to transition to the transcriptionally active state. All parameters highlighted with colored boxes were determined from
published datasets based on the same genetic circuit. Parameters in dashed boxes were taken directly from values reported in the literature or
adjusted to satisfy known biological restrictions. (b) Datasets used to infer the parameter values. From left to right, Garcia and Phillips [
20
]is
used to determine
k
(
r
)
off
and
k
(
r
)
on
; Brewster
et al.
[
26
] is used to determine
AI
and
k
(
r
)
on
; Razo-Mejia
et al.
[
27
] is used to determine
K
A
,
K
I
,and
k
(
r
)
on
; and Jones
et al.
[
25
] is used to determine
r
m
,
k
(
p
)
on
,and
k
(
p
)
off
.
where the state transitions for each term are labeled by overbraces. For the transcriptionally inactive state
I
,wehave
dP
I
(
m
,
p
)
dt
=
A
I



k
(
p
)
off
P
A
(
m
,
p
)
I
A



k
(
p
)
on
P
I
(
m
,
p
)
+
R
I



k
(
r
)
off
P
R
(
m
,
p
)
I
R



k
(
r
)
on
P
I
(
m
,
p
)
+
m
+
1
m



γ
m
(
m
+
1)
P
I
(
m
+
1
,
p
)
m
m
1



γ
m
mP
I
(
m
,
p
)
+
p
1
p



r
p
mP
I
(
m
,
p
1)
p
p
+
1



r
p
mP
I
(
m
,
p
)
+
p
+
1
p



γ
p
(
p
+
1)
P
I
(
m
,
p
+
1)
p
p
1



γ
p
pP
I
(
m
,
p
)
.
(3)
And finally, for the repressor bound state
R
,
dP
R
(
m
,
p
)
dt
=−
R
I



k
(
r
)
off
P
R
(
m
,
p
)
+
I
R



k
(
r
)
on
P
I
(
m
,
p
)
+
m
+
1
m



γ
m
(
m
+
1)
P
R
(
m
+
1
,
p
)
m
m
1



γ
m
mP
R
(
m
,
p
)
+
p
1
p



r
p
mP
R
(
m
,
p
1)
p
p
+
1



r
p
mP
R
(
m
,
p
)
+
p
+
1
p



γ
p
(
p
+
1)
P
R
(
m
,
p
+
1)
p
p
1



γ
p
pP
R
(
m
,
p
)
.
(4)
022404-4
FIRST-PRINCIPLES PREDICTION OF THE ...
PHYSICAL REVIEW E
102
, 022404 (2020)
As we will discuss later in Sec.
ID
, the protein degradation
term
γ
p
is set to zero since active protein degradation is slow
compared to the cell cycle of exponentially growing bacteria,
but rather we explicitly implement binomial partitioning of
the proteins into daughter cells upon division [
32
].
It is convenient to rewrite these equations in a compact
matrix notation [
30
]. For this, we define the vector
P
(
m
,
p
)as
P
(
m
,
p
)
=
(
P
A
(
m
,
p
)
,
P
I
(
m
,
p
)
,
P
R
(
m
,
p
)
)
T
,
(5)
where a superscript
T
is the transpose. By defining the
matrices
K
to contain the promoter state transitions,
R
m
and

m
to contain the mRNA production and degradation terms,
respectively, and
R
p
and

p
to contain the protein production
and degradation terms, respectively, the system of ODEs can
then be written as (see [
33
], Sec. S1, for a full definition of
these matrices)
d
P
(
m
,
p
)
dt
=
(
K
R
m
m

m
m
R
p
p

p
)
+
R
m
P
(
m
1
,
p
)
+
(
m
+
1)

m
P
(
m
+
1
,
p
)
+
m
R
p
P
(
m
,
p
1)
+
(
p
+
1)

p
P
(
m
,
p
+
1)
.
(6)
Having defined the gene expression dynamics, we now
proceed to determine all rate parameters in Eq. (
6
).
B. Inferring parameters from published datasets
A decade of research in our group has characterized the
simple repression motif with an ever expanding array of
predictions and corresponding experiments to uncover the
physics of this genetic circuit [
28
]. In doing so, we have come
to understand the mean response of a single promoter in the
presence of varying levels of repressor copy numbers and
repressor-DNA affinities [
20
], due to the effect that competing
binding sites and multiple promoter copies impose [
26
], and
in recent work, assisted by the Monod-Wyman-Changeux
(MWC) model, we expanded the scope to the allosteric nature
of the repressor [
27
]. All of these studies have exploited the
simplicity and predictive power of equilibrium approxima-
tions to these nonequilibrium systems [
34
]. We have also
used a similar kinetic model to that depicted in Fig.
2(a)
to
study the noise in mRNA copy number [
25
]. Although these
studies focus on the same experimental system described
by different theoretical frameworks, in earlier work in our
laboratory an attempt to unite parametric knowledge across
studies based on equilibrium and nonequilibrium models has
not been performed previously. As a test case of the depth
of our theoretical understanding of this simple transcriptional
regulation system, we combine all of the studies mentioned
above to inform the parameter values of the model presented
in Fig.
2(a)
. Figure
2(b)
schematizes the datasets and exper-
imental techniques used to measure gene expression along
with the parameters that can be inferred from them.
SectionS2in[
33
] expands on the details of how the
inference was performed for each of the parameters. Briefly,
the promoter activation and inactivation rates
k
(
p
)
on
and
k
(
p
)
off
,as
well as the transcription rate
r
m
, were obtained in units of the
mRNA degradation rate
γ
m
by fitting a two-state promoter
model [no state
R
from Fig.
2(a)
][
35
]tomRNAFISHdata
of an unregulated promoter (no repressor present in the cell)
[
25
].Therepressoronrateisassumedtobeoftheform
k
(
r
)
on
=
k
o
[
R
], where
k
o
is diffusion-limited on rate and [
R
]is
the concentration of active repressor in the cell [
25
]. This con-
centration of active repressor is at the same time determined
by the repressor copy number in the cell, and the fraction of
these repressors that are in the active state, i.e., able to bind
DNA. Existing estimates of the transition rates between con-
formations of allosteric molecules set them at the microsecond
scale [
36
]. By considering this to be representative for our
repressor of interest, the separation of timescales between the
rapid conformational changes of the repressor and the slower
downstream processes such as the open-complex formation
processes allow us to model the probability of the repressor
being in the active state as an equilibrium MWC process.
The parameters of the MWC model
K
A
,
K
I
, and
AI
were
previously characterized from video-microscopy and flow-
cytometry data [
27
]. For the repressor off rate,
k
(
r
)
off
,wetake
advantage of the fact that the mean mRNA copy number as
derived from the model in Fig.
2(a)
cast in the language of
rates is of the same functional form as the equilibrium model
cast in the language of binding energies [
37
]. Therefore, the
value of the repressor-DNA binding energy
r
constrains
the value of the repressor off rate
k
(
r
)
off
. These constraints on
the rates allow us to make self-consistent predictions under
both the equilibrium and the kinetic framework. Having all
parameters in hand, we can now proceed to solve the gene
expression dynamics.
C. Computing the moments of the mRNA and protein
distributions
Finding analytical solutions to chemical master equations
is often fraught with difficulty. An alternative approach is to
to approximate the distribution. One such scheme of approxi-
mation, the maximum entropy principle, makes use of the mo-
ments of the distribution to approximate the full distribution.
In this section, we will demonstrate an iterative algorithm to
compute the mRNA and protein distribution moments.
The kinetic model for the simple repression motif depicted
in Fig.
2(a)
consists of an infinite system of ODEs for each
possible pair of mRNA and protein copy number, (
m
,
p
). To
compute any moment of the distribution, we define a vector

m
x
p
y
≡
(

m
x
p
y

A
,

m
x
p
y

I
,

m
x
p
y

R
)
T
,
(7)
where

m
x
p
y

S
is the expected value of
m
x
p
y
in state
S
{
A
,
I
,
R
}
for
x
,
y
N
. In other words, just as we defined the
vector
P
(
m
,
p
), here we define a vector to collect the expected
value of each of the promoter states. By definition, any of
these moments

m
x
p
y

S
can be computed as

m
x
p
y

S

m
=
0

p
=
0
m
x
p
y
P
S
(
m
,
p
)
.
(8)
Summing over all possible values for
m
and
p
in Eq. (
6
) results
in an ODE for any moment of the distribution of the form (see
[
33
] Sec. S3 for a full derivation)
d

m
x
p
y

dt
=
K

m
x
p
y
+
R
m

p
y
[(
m
+
1
)
x
m
x
]

+

m

mp
y
[(
m
1
)
x
m
x
]

+
R
p

m
(
x
+
1
)
[(
p
+
1
)
y
p
y
]

+

p

m
x
p
[(
p
1
)
y
p
y
]

.
(9)
022404-5
MANUEL RAZO-MEJIA
et al.
PHYSICAL REVIEW E
102
, 022404 (2020)
Given that all transitions in our stochastic model are first-
order reactions, Eq. (
9
) has no moment-closure problem [
14
].
This means that the dynamical equation for a given moment
only depends on lower moments (see [
33
], Sec. S3 for full
proof). This feature of our model implies, for example, that the
second moment of the protein distribution

p
2

depends only
on the first two moments of the mRNA distribution

m

and

m
2

, the first protein moment

p

, and the cross-correlation
term

mp

. We can therefore define
μ
(
x
,
y
)
to be a vector
containing all moments up to

m
x
p
y

for all promoter states,
μ
(
x
,
y
)
=
[

m
0
p
0

,

m
1
p
0

,...,

m
x
p
y

]
T
.
(10)
Explicitly for the three-state promoter model depicted in
Fig.
2(a)
, this vector takes the form
μ
(
x
,
y
)
=
[

m
0
p
0

A
,

m
0
p
0

I
,

m
0
p
0

R
,...,

m
x
p
y

A
,

m
x
p
y

I
,

m
x
p
y

R
]
T
.
(11)
Given this definition, we can compute the general moment
dynamics as
d
μ
(
x
,
y
)
dt
=
A
μ
(
x
,
y
)
,
(12)
where
A
is a square matrix that contains all the numerical
coefficients that relate each of the moments. We can then use
Eq. (
9
) to build matrix
A
by iteratively substituting values
for the exponents
x
and
y
up to a specified value. In the
next section, we will use Eq. (
12
) to numerically integrate
the dynamical equations for our moments of interest as cells
progress through the cell cycle. We will then use the value
of the moments of the distribution to approximate the full
gene expression distribution. This method is computationally
more efficient than trying to numerically integrate the infinite
set of equations describing the full probability distribution
P
(
m
,
p
), or using a stochastic algorithm to sample from the
distribution.
D. Accounting for cell-cycle-dependent variability
in gene dosage
As cells progress through the cell cycle, the genome has
to be replicated to guarantee that each daughter cell receives
a copy of the genetic material. As replication of the genome
can take longer than the total cell cycle, this implies that cells
spend part of the cell cycle with multiple copies of each gene
depending on the cellular growth rate and the relative position
of the gene with respect to the replication origin [
38
]. Genes
closer to the replication origin spend a larger fraction of the
cell cycle with multiple copies compared to genes closer to
the replication termination site [
38
]. Figure
3(a)
depicts a
schematic of this process where the replication origin (
oriC
)
and the relevant locus for our experimental measurements
(
galK
) are highlighted.
Since this change in gene copy number has been shown
to have an effect on cell-to-cell variability in gene expression
[
25
,
39
], we now extend our minimal model to account for
these changes in gene copy number during the cell cycle.
We reason that the only difference between the single-copy
state and the two-copy state of the promoter is a doubling
of the mRNA production rate
r
m
. In particular, the promoter
activation and inactivation rates
k
(
p
)
on
and
k
(
p
)
off
and the mRNA
production rate
r
m
inferred in Sec.
IA
assume that cells spend
a fraction
f
of the cell cycle with one copy of the promoter
(mRNA production rate
r
m
) and a fraction (1
f
) of the cell
cycle with two copies of the promoter (mRNA production rate
2
r
m
). This inference was performed considering that at each
cell state the mRNA level immediately reaches the steady-
state value for the corresponding mRNA production rate. This
assumption is justified since the timescale to reach this steady
state depends only on the degradation rate
γ
m
, which for
the mRNA is much shorter (
3 min) than the length of the
cell cycle (
60 min for our experimental conditions) [
40
].
Section S2 in [
33
] shows that a model accounting for this
gene copy number variability is able to capture data from
single-molecule mRNA counts of an unregulated (constitu-
tively expressed) promoter.
Given that the protein degradation rate
γ
p
in our model
is set by the cell division time, we do not expect that the
protein count will reach the corresponding steady-state value
for each stage in the cell cycle. In other words, cells do not
spend long enough with two copies of the promoter for the
protein level to reach the steady-state value corresponding to
a transcription rate of 2
r
m
. We therefore use the dynamical
equations developed in Sec.
IC
to numerically integrate the
time trajectory of the moments of the distribution with the
corresponding parameters for each phase of the cell cycle.
Figure
3(b)
shows an example corresponding to the mean
mRNA level (upper panel) and the mean protein level (lower
panel) for the case of the unregulated promoter. Given that
we inferred the promoter rate parameters considering that
mRNA reaches steady state in each stage, we see that the
numerical integration of the equations is consistent with the
assumption of having the mRNA reach a stable value in each
stage [see Fig.
3(b)
upper panel]. On the other hand, the
mean protein level does not reach a steady state at either
of the cellular stages. Nevertheless, it is notable that after
several cell cycles, the trajectory from cycle to cycle follows
a repetitive pattern [see Fig.
3(b)
lower panel]. Previously
we have experimentally observed this repetitive pattern by
tracking the expression level over time with video microscopy
as observed in Fig. 18 of [
28
].
To test the effects of including this gene copy number
variability in our model, we now compare the predictions of
the model with experimental data. As detailed in Sec.
III
,
we obtained single-cell fluorescence values of different
E.
coli
strains carrying a YFP gene under the control of the
LacI repressor. Each strain was exposed to 12 different input
inducer (IPTG) concentrations for
8 generations for cells to
adapt to the media. The strains imaged spanned three orders
of magnitude in repressor copy number and three distinct
repressor-DNA affinities. Since growth was asynchronous,
we reason that cells were randomly sampled at all stages
of the cell cycle. Therefore, when computing statistics from
the data such as the mean fluorescence value, in reality we
are averaging over the cell cycle. In other words, as de-
picted in Fig.
3(b)
, quantities such as the mean protein copy
number change over time, i.e.,

p
≡
p
(
t
)

. This means that
computing the mean of a population of unsynchronized cells
is equivalent to averaging this time-dependent mean protein
022404-6