of 64
Supplemental Information for: First-principles
prediction of the information processing capacity of a
simple genetic circuit
Manuel Razo-Mejia
1
,SarahMarzen
2
,Gri
nChure
1
,RachelTaubman
2
,MuirMorrison
3
,and
Rob Phillips
1, 3, *
1
Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, USA
2
Department of Physics, W. M. Keck Science Department
3
Department of Physics, California Institute of Technology, Pasadena, CA 91125, USA
*
Correspondence: phillips@pboc.caltech.edu
Contents
S1 Three-state promoter model for simple repression
24
S2 Parameter inference
26
S2.1 Unregulated promoter rates
.................................
26
S2.2 Accounting for variability in the number of promoters
..................
29
S2.3 Repressor rates from three-state regulated promoter.
...................
34
S3 Computing moments from the master equation
37
S3.1 Computing moments of a distribution
...........................
37
S3.2 Moment closure of the simple-repression distribution
...................
41
S3.3 Computing single promoter steady-state moments
.....................
43
S4 Accounting for the variability in gene copy number during the cell cycle
43
S4.1 Numerical integration of moment equations
........................
44
S4.2 Exponentially distributed ages
...............................
48
S4.3 Reproducing the equilibrium picture
............................
50
S4.4 Comparison between single- and multi-promoter kinetic model
.............
51
S4.5 Comparison with experimental data
............................
52
S5 Maximum entropy approximation of distributions
55
S5.1 The MaxEnt principle
....................................
56
S5.2 The Bretthorst rescaling algorithm
.............................
59
S5.3 Predicting distributions for simple repression constructs
.................
61
S5.4 Comparison with experimental data
............................
62
23
S6 Gillespie simulation of master equation
63
S6.1 mRNA distribution with Gillespie simulations
.......................
64
S6.2 Protein distribution with Gillespie simulations
.......................
64
S7 Computational determination of the channel capacity
65
S7.1 Blahut’s algorithm
......................................
65
S7.2 Channel capacity from arbitrary units of fluorescence
...................
66
S7.3 Assumptions involved in the computation of the channel capacity
............
68
S8 Empirical fits to noise predictions
69
S8.1 Multiplicative factor for the noise
..............................
69
S8.2 Additive factor for the noise
.................................
70
S8.3 Correction factor for channel capacity with multiplicative factor
.............
70
S9 Derivation of the cell age distribution
72
S1
Three-state promoter model for simple repression
In order to tackle the question of how much information the simple repression motif can process we
require the joint probability distribution of mRNA and protein
P
(
m, p
;
t
). To obtain this distribution
we use the chemical master equation formalism as described in Section
1.1
. Specifically, we assume a
three-state model where the promoter can be found 1) in a transcriptionally active state (
A
state), 2)
in a transcriptionally inactive state without the repressor bound (
I
state) and 3) with the repressor
bound (
R
state). (See Fig.
2
(A)). These three states generate a system of coupled di
erential equations
for each of the three state distributions
P
A
(
m, p
),
P
I
(
m, p
) and
P
R
(
m, p
). Given the rates shown in
Fig.
2
(A) let us define the system of ODEs. For the transcriptionally active state we have
dP
A
(
m, p
)
dt
=
A
!
I
z
}|
{
k
(
p
)
o
P
A
(
m, p
)+
I
!
A
z
}|
{
k
(
p
)
on
P
I
(
m, p
)
+
m
1
!
m
z
}|
{
r
m
P
A
(
m
1
,p
)
m
!
m
+1
z
}|
{
r
m
P
A
(
m, p
)+
m
+1
!
m
z
}|
{
m
(
m
+ 1)
P
A
(
m
+1
,p
)
m
!
m
1
z
}|
{
m
mP
A
(
m, p
)
+
p
1
!
p
z
}|
{
r
p
mP
A
(
m, p
1)
p
!
p
+1
z
}|
{
r
p
mP
A
(
m, p
)+
p
+1
!
p
z
}|
{
p
(
p
+ 1)
P
A
(
m, p
+ 1)
p
!
p
1
z
}|
{
p
pP
A
(
m, p
)
.
(S1)
24
For the inactive promoter state
I
we have
dP
I
(
m, p
)
dt
=
A
!
I
z
}|
{
k
(
p
)
o
P
A
(
m, p
)
I
!
A
z
}|
{
k
(
p
)
on
P
I
(
m, p
)+
R
!
I
z
}|
{
k
(
r
)
o
P
R
(
m, p
)
I
!
R
z
}|
{
k
(
r
)
on
P
I
(
m, p
)
+
m
+1
!
m
z
}|
{
m
(
m
+ 1)
P
I
(
m
+1
,p
)
m
!
m
1
z
}|
{
m
mP
I
(
m, p
)
+
p
1
!
p
z
}|
{
r
p
mP
I
(
m, p
1)
p
!
p
+1
z
}|
{
r
p
mP
I
(
m, p
)+
p
+1
!
p
z
}|
{
p
(
p
+ 1)
P
I
(
m, p
+ 1)
p
!
p
1
z
}|
{
p
pP
I
(
m, p
)
.
(S2)
And finally for the repressor bound state
R
we have
dP
R
(
m, p
)
dt
=
R
!
I
z
}|
{
k
(
r
)
o
P
R
(
m, p
)+
I
!
R
z
}|
{
k
(
r
)
on
P
I
(
m, p
)
+
m
+1
!
m
z
}|
{
m
(
m
+ 1)
P
R
(
m
+1
,p
)
m
!
m
1
z
}|
{
m
mP
R
(
m, p
)
+
p
1
!
p
z
}|
{
r
p
mP
R
(
m, p
1)
p
!
p
+1
z
}|
{
r
p
mP
R
(
m, p
)+
p
+1
!
p
z
}|
{
p
(
p
+ 1)
P
R
(
m, p
+ 1)
p
!
p
1
z
}|
{
p
pP
R
(
m, p
)
.
(S3)
For an unregulated promoter, i.e. a promoter in a cell that has no repressors present, and therefore
constitutively expresses the gene, we use a two-state model in which the state
R
is not allowed. All
the terms in the system of ODEs containing
k
(
r
)
on
or
k
(
r
)
o
are then set to zero.
As detailed in Section
1.1
it is convenient to express this system using matrix notation [
30
]. For
this we define
P
(
m, p
)=(
P
A
(
m, p
)
,P
I
(
m, p
)
,P
R
(
m, p
))
T
. Then the system of ODEs can be expressed
as
d
P
(
m, p
)
dt
=
KP
(
m, p
)
R
m
P
(
m, p
)+
R
m
P
(
m
1
,p
)
m
m
P
(
m, p
)+(
m
+ 1)
m
P
(
m
+1
,p
)
m
R
p
P
(
m, p
)+
m
R
p
P
(
m, p
1)
p
p
P
(
m, p
)+(
p
+ 1)
p
P
(
m, p
+ 1)
,
(S4)
where we defined matrices representing the promoter state transition
K
,
K
2
6
4
k
(
p
)
o
k
(
p
)
on
0
k
(
p
)
o
k
(
p
)
on
k
(
r
)
on
k
(
r
)
o
0
k
(
r
)
on
k
(
r
)
o
3
7
5
,
(S5)
mRNA production,
R
m
, and degradation,
m
, as
R
m
2
4
r
m
00
000
000
3
5
,
(S6)
and
m
2
4
m
00
0
m
0
00
m
3
5
.
(S7)
25
For the protein we also define production
R
p
and degradation
p
matrices as
R
p
2
4
r
p
00
0
r
p
0
00
r
p
3
5
(S8)
and
p
2
4
p
00
0
p
0
00
p
3
5
.
(S9)
The corresponding equation for the unregulated two-state promoter takes the exact same form
with the definition of the matrices following the same scheme without including the third row and
third column, and setting
k
(
r
)
on
and
k
(
r
)
o
to zero.
A closed-form solution for this master equation might not even exist. The approximate solution
of chemical master equations of this kind is an active area of research. As we will see in Appendix
S2
the two-state promoter master equation has been analytically solved for the mRNA [
34
] and protein
distributions [
53
]. For our purposes, in Appendix
S5
we will detail how to use the Maximum Entropy
principle to approximate the full distribution for the two- and three-state promoter.
S2
Parameter inference
(Note: The Python code used for the calculations presented in this section can be found in the
following link
as an annotated Jupyter notebook)
With the objective of generating falsifiable predictions with meaningful parameters, we infer the
kinetic rates for this three-state promoter model using di
erent data sets generated in our lab over the
last decade concerning di
erent aspects of the regulation of the simple repression motif. For example,
for the unregulated promoter transition rates
k
(
p
)
on
and
k
(
p
)
o
and the mRNA production rate
r
m
,we
use single-molecule mRNA FISH counts from an unregulated promoter [
25
]. Once these parameters
are fixed, we use the values to constrain the repressor rates
k
(
r
)
on
and
k
(
r
)
o
. These repressor rates are
obtained using information from mean gene expression measurements from bulk LacZ colorimetric
assays [
20
]. We also expand our model to include the allosteric nature of the repressor protein, taking
advantage of video microscopy measurements done in the context of multiple promoter copies [
26
] and
flow-cytometry measurements of the mean response of the system to di
erent levels of induction [
27
].
In what follows of this section we detail the steps taken to infer the parameter values. At each step
the values of the parameters inferred in previous steps constrain the values of the parameters that are
not yet determined, building in this way a self-consistent model informed by work that spans several
experimental techniques.
S2.1
Unregulated promoter rates
We begin our parameter inference problem with the promoter on and o
rates
k
(
p
)
on
and
k
(
p
)
o
, as well
as the mRNA production rate
r
m
. In this case there are only two states available to the promoter – the
inactive state
I
and the transcriptionally active state
A
. That means that the third ODE for
P
R
(
m, p
)is
removed from the system. The mRNA steady state distribution for this particular two-state promoter
26
model was solved analytically by Peccoud and Ycart [
34
]. This distribution
P
(
m
)
P
I
(
m
)+
P
A
(
m
)
is of the form
P
(
m
|
k
(
p
)
on
,k
(
p
)
o
,r
m
,
m
)=
k
(
p
)
on
m
+
m
(
m
+ 1)
k
(
p
)
o
+
k
(
p
)
on
m
+
m
k
(
p
)
o
+
k
(
p
)
on
m
k
(
p
)
on
m
r
m
m
m
F
1
1
k
(
p
)
on
m
+
m,
k
(
p
)
o
+
k
(
p
)
on
m
+
m,
r
m
m
!
,
(S10)
where
(
·
) is the gamma function, and
F
1
1
is the confluent hypergeometric function of the first kind.
This rather complicated expression will aid us to find parameter values for the rates. The inferred
rates
k
(
p
)
on
,
k
(
p
)
o
and
r
m
will be expressed in units of the mRNA degradation rate
m
. This is because
the model in Eq.
S10
is homogeneous in time, meaning that if we divide all rates by a constant it would
be equivalent to multiplying the characteristic time scale by the same constant. As we will discuss in
the next section, Eq.
S10
has degeneracy in the parameter values. What this means is that a change
in one of the parameters, specifically
r
m
, can be compensated by a change in another parameter,
specifically
k
(
p
)
o
, to obtain the exact same distribution. To work around this intrinsic limitation of the
model we will include in our inference prior information from what we know from equilibrium-based
models.
Bayesian parameter inference of RNAP rates
In order to make progress at inferring the unregulated promoter state transition rates, we make
use of the single-molecule mRNA FISH data from Jones et al. [
25
]. Fig.
S1
shows the distribution of
mRNA per cell for the
lacUV5
promoter used for our inference. This promoter, being very strong,
has a mean copy number of
h
m
i⇡
18 mRNA/cell.
Figure S1.
lacUV5
mRNA per cell distribution.
Data from [
25
] of the unregulated
lacUV5
promoter
as inferred from single molecule mRNA FISH.
Having this data in hand we now turn to Bayesian parameter inference. Writing Bayes theorem
we have
P
(
k
(
p
)
on
,k
(
p
)
o
,r
m
|
D
)=
P
(
D
|
k
(
p
)
on
,k
(
p
)
o
,r
m
)
P
(
k
(
p
)
on
,k
(
p
)
o
,r
m
)
P
(
D
)
,
(S11)
where
D
represents the data. For this case the data consists of single-cell mRNA counts
D
=
{
m
1
,m
2
,...,m
N
}
,where
N
is the number of cells. We assume that each cell’s measurement is
27
independent of the others such that we can rewrite Eq.
S11
as
P
(
k
(
p
)
on
,k
(
p
)
o
,r
m
|{
m
i
}
)
/
"
N
Y
i
=1
P
(
m
i
|
k
(
p
)
on
,k
(
p
)
o
,r
m
)
#
P
(
k
(
p
)
on
,k
(
p
)
o
,r
m
)
,
(S12)
where we ignore the normalization constant
P
(
D
). The likelihood term
P
(
m
i
|
k
(
p
)
on
,k
(
p
)
o
,r
m
) is exactly
given by Eq.
S10
with
m
= 1. Given that we have this functional form for the distribution, we can
use Markov Chain Monte Carlo (MCMC) sampling to explore the 3D parameter space in order to fit
Eq.
S10
to the mRNA-FISH data.
Constraining the rates given prior thermodynamic knowledge.
One of the strengths of the Bayesian approach is that we can include all the prior knowledge on the
parameters when performing an inference [
4
]. Basic features such as the fact that the rates have to be
strictly positive constrain the values that these parameters can take. For the specific rates analyzed
in this section we know more than the simple constraint of non-negative values. The expression of an
unregulated promoter has been studied from a thermodynamic perspective [
23
]. Given the underlying
assumptions of these equilibrium models, in which the probability of finding the RNAP bound to the
promoter is proportional to the transcription rate [
54
], they can only make statements about the mean
expression level. Nevertheless if both the thermodynamic and the kinetic model describe the same
process, the predictions for the mean gene expression level must agree. That means that we can use
what we know about the mean gene expression, and how this is related to parameters such as molecule
copy numbers and binding a
nities, to constrain the values that the rates in question can take.
In the case of this two-state promoter it can be shown that the mean number of mRNA is given
by [
30
] (See Appendix
S3
for moment computation)
h
m
i
=
r
m
m
k
(
p
)
on
k
(
p
)
on
+
k
(
p
)
o
.
(S13)
Another way of expressing this is as
r
m
m
p
(
p
)
active
,where
p
(
p
)
active
is the probability of the promoter being
in the transcriptionally active state. The thermodynamic picture has an equivalent result where the
mean number of mRNA is given by [
23
], [
54
]
h
m
i
=
r
m
m
P
N
NS
e
"
p
1+
P
N
NS
e
"
p
,
(S14)
where
P
is the number of RNAP per cell,
N
NS
is the number of non-specific binding sites,
"
p
is the
RNAP binding energy in
k
B
T
units and
(
k
B
T
)
1
. Using Eq.
S13
and Eq.
S14
we can easily see
that if these frameworks are to be equivalent, then it must be true that
k
(
p
)
on
k
(
p
)
o
=
P
N
NS
e
"
p
,
(S15)
or equivalently
ln
k
(
p
)
on
k
(
p
)
o
!
=
"
p
+ln
P
ln
N
NS
.
(S16)
28
To put numerical values into these variables we can use information from the literature. The RNAP
copy number is order
P
1000
3000 RNAP/cell for a 1 hour doubling time [
55
]. As for the number of
non-specific binding sites and the binding energy, we have that
N
NS
=4
.
6
10
6
[
54
] and
"
p
5
7
[
23
]. Given these values we define a Gaussian prior for the log ratio of these two quantities of the form
P
ln
k
(
p
)
on
k
(
p
)
o
!!
/
exp
8
>
>
>
<
>
>
>
:
ln
k
(
p
)
on
k
(
p
)
o
(
"
p
+ln
P
ln
N
NS
)
2
2
2
9
>
>
>
=
>
>
>
;
,
(S17)
where
is the variance that accounts for the uncertainty in these parameters. We include this prior as
part of the prior term
P
(
k
(
p
)
on
,k
(
p
)
o
,r
m
) of Eq.
S12
. We then use MCMC to sample out of the posterior
distribution given by Eq.
S12
. Fig.
S2
shows the MCMC samples of the posterior distribution. For the
case of the
k
(
p
)
on
parameter there is a single symmetric peak.
k
(
p
)
o
and
r
m
have a rather long tail towards
large values. In fact, the 2D projection of
k
(
p
)
o
vs
r
m
shows that the model is sloppy, meaning that the
two parameters are highly correlated. This feature is a common problem for many non-linear systems
used in biophysics and systems biology [
56
]. What this implies is that we can change the value of
k
(
p
)
o
,
and then compensate by a change in
r
m
in order to maintain the shape of the mRNA distribution.
Therefore it is impossible from the data and the model themselves to narrow down a single value for
the parameters. Nevertheless since we included the prior information on the rates as given by the
analogous form between the equilibrium and non-equilibrium expressions for the mean mRNA level,
we obtained a more constrained parameter value for the RNAP rates and the transcription rate that
we will take as the peak of this long-tailed distribution.
The inferred values
k
(
p
)
on
=4
.
3
+1
0
.
3
,
k
(
p
)
o
= 18
.
8
+120
10
and
r
m
= 103
.
8
+423
37
are given in units of the
mRNA degradation rate
m
. Given the asymmetry of the parameter distributions we report the upper
and lower bound of the 95
th
percentile of the posterior distributions. Assuming a mean life-time for
mRNA of
3 min (from this
link
) we have an mRNA degradation rate of
m
2
.
84
10
3
s
1
. Using
this value gives the following values for the inferred rates:
k
(
p
)
on
=0
.
024
+0
.
005
0
.
002
s
1
,
k
(
p
)
o
=0
.
11
+0
.
66
0
.
05
s
1
,
and
r
m
=0
.
3
+2
.
3
0
.
2
s
1
.
Fig.
S3
compares the experimental data from Fig.
S1
with the resulting distribution obtained by
substituting the most likely parameter values into Eq.
S10
. As we can see this two-state model fits
the data adequately.
S2.2
Accounting for variability in the number of promoters
As discussed in ref. [
25
] and further expanded in [
38
] an important source of cell-to-cell variability
in gene expression in bacteria is the fact that, depending on the growth rate and the position relative
to the chromosome replication origin, cells can have multiple copies of any given gene. Genes closer
to the replication origin have on average higher gene copy number compared to genes at the opposite
end. For the locus in which our reporter construct is located (
galK
) and the doubling time of the
mRNA FISH experiments we expect to have
1.66 copies of the gene [
25
], [
37
]. This implies that the
cells spend 2/3 of the cell cycle with two copies of the promoter and the rest with a single copy.
To account for this variability in gene copy we extend the model assuming that when cells have
two copies of the promoter the mRNA production rate is 2
r
m
compared to the rate
r
m
for a single
29
Figure S2. MCMC posterior distribution.
Sampling out of Eq.
S12
the plot shows 2D and 1D
projections of the 3D parameter space. The parameter values are (in units of the mRNA degradation rate
m
)
k
(
p
)
on
=4
.
3
+1
0
.
3
,
k
(
p
)
o
= 18
.
8
+120
10
and
r
m
= 103
.
8
+423
37
which are the modes of their respective distributions,
where the superscripts and subscripts represent the upper and lower bounds of the 95
th
percentile of the
parameter value distributions
Figure S3. Experimental vs. theoretical distribution of mRNA per cell using parameters from
Bayesian inference.
Dotted line shows the result of using Eq.
S10
along with the parameters inferred for the
rates. Blue bars are the same data as Fig.
S1
obtained from [
25
].
promoter copy. The probability of observing a certain mRNA copy
m
is therefore given by
P
(
m
)=
P
(
m
|
one promoter)
·
P
(one promoter) +
P
(
m
|
two promoters)
·
P
(two promoters)
.
(S18)
Both terms
P
(
m
|
promoter copy) are given by Eq.
S10
with the only di
erence being the rate
r
m
.It
is important to acknowledge that Eq.
S18
assumes that once the gene is replicated the time scale in
30
which the mRNA count relaxes to the new steady state is much shorter than the time that the cells
spend in this two promoter copies state. This approximation should be valid for a short lived mRNA
molecule, but the assumption is not applicable for proteins whose degradation rate is comparable to
the cell cycle length as explored in Section
1.4
.
In order to repeat the Bayesian inference including this variability in gene copy number we must
split the mRNA count data into two sets – cells with a single copy of the promoter and cells with
two copies of the promoter. For the single molecule mRNA FISH data there is no labeling of the
locus, making it impossible to determine the number of copies of the promoter for any given cell. We
therefore follow Jones et al. [
25
] in using the cell area as a proxy for stage in the cell cycle. In their
approach they sorted cells by area, considering cells below the 33
th
percentile having a single promoter
copy and the rest as having two copies. This approach ignores that cells are not uniformly distributed
along the cell cycle. As first derived in [
40
] populations of cells in a log-phase are exponentially
distributed along the cell cycle. This distribution is of the form
P
(
a
) = (ln 2)
·
2
1
a
,
(S19)
where
a
2
[0
,
1] is the stage of the cell cycle, with
a
= 0 being the start of the cycle and
a
=1being
the cell division (See Appendix
S9
for a derivation of Eq.
S19
). Fig.
S4
shows the separation of the
two groups based on area where Eq.
S19
was used to weight the distribution along the cell cycle.
Figure S4. Separation of cells based on cell size.
Using the area as a proxy for position in the cell cycle,
cells can be sorted into two groups – small cells (with one promoter copy) and large cells (with two promoter
copies). The vertical black line delimits the threshold that divides both groups as weighted by Eq.
S19
.
A subtle, but important consequence of Eq.
S19
is that computing any quantity for a single cell is
not equivalent to computing the same quantity for a population of cells. For example, let us assume
that we want to compute the mean mRNA copy number
h
m
i
. For a single cell this would be of the
form
h
m
i
cell
=
h
m
i
1
·
f
+
h
m
i
2
·
(1
f
)
,
(S20)
where
h
m
i
i
is the mean mRNA copy number with
i
promoter copies in the cell, and
f
is the fraction
of the cell cycle that cells spend with a single copy of the promoter. For a single cell the probability of
having a single promoter copy is equivalent to this fraction
f
.ButEq.
S19
tells us that if we sample
unsynchronized cells we are not sampling uniformly across the cell cycle. Therefore for a population
31
of cells the mean mRNA is given by
h
m
i
population
=
h
m
i
1
·
+
h
m
i
2
·
(1
)
(S21)
where the probability of sampling a cell with one promoter
is given by
=
Z
f
0
P
(
a
)
da,
(S22)
where
P
(
a
) is given by Eq.
S106
. What this equation computes is the probability of sampling a cell
during a stage of the cell cycle
<f
where the reporter gene hasn’t been replicated yet. Fig.
S5
shows
the distribution of both groups. As expected larger cells have a higher mRNA copy number on average.
Figure S5. mRNA distribution for small and large cells.
(A) histogram and (B) cumulative
distribution function of the small and large cells as determined in Fig.
S4
. The triangles above histograms in
(A) indicate the mean mRNA copy number for each group.
We modify Eq.
S12
to account for the two separate groups of cells. Let
N
s
be the number of
cells in the small size group and
N
l
the number of cells in the large size group. Then the posterior
distribution for the parameters is of the form
P
(
k
(
p
)
on
,k
(
p
)
o
,r
m
|{
m
i
}
)
/
"
N
s
Y
i
=1
P
(
m
i
|
k
(
p
)
on
,k
(
p
)
o
,r
m
)
#
2
4
N
l
Y
j
=1
P
(
m
j
|
k
(
p
)
on
,k
(
p
)
o
,
2
r
m
)
3
5
P
(
k
(
p
)
on
,k
(
p
)
o
,r
m
)
,
(S23)
where we split the product of small and large cells.
For the two-promoter model the prior shown in Eq.
S17
requires a small modification. Eq.
S21
gives the mean mRNA copy number of a population of asynchronous cells growing at steady state.
Given that we assume that the only di
erence between having one vs. two promoter copies is the
32