ES11876_SI.pdf

Supplemental Information for: First-principles

prediction of the information processing capacity of a

simple genetic circuit

Manuel Razo-Mejia

,SarahMarzen

,Gri

nChure

,RachelTaubman

,MuirMorrison

,and

Rob Phillips

1, 3, *

Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, USA

Department of Physics, W. M. Keck Science Department

Department of Physics, California Institute of Technology, Pasadena, CA 91125, USA

Correspondence: phillips@pboc.caltech.edu

Contents

S1 Three-state promoter model for simple repression

S2 Parameter inference

S2.1 Unregulated promoter rates

.................................

S2.2 Accounting for variability in the number of promoters

..................

S2.3 Repressor rates from three-state regulated promoter.

...................

S3 Computing moments from the master equation

S3.1 Computing moments of a distribution

...........................

S3.2 Moment closure of the simple-repression distribution

...................

S3.3 Computing single promoter steady-state moments

.....................

S4 Accounting for the variability in gene copy number during the cell cycle

S4.1 Numerical integration of moment equations

........................

S4.2 Exponentially distributed ages

...............................

S4.3 Reproducing the equilibrium picture

............................

S4.4 Comparison between single- and multi-promoter kinetic model

.............

S4.5 Comparison with experimental data

............................

S5 Maximum entropy approximation of distributions

S5.1 The MaxEnt principle

....................................

S5.2 The Bretthorst rescaling algorithm

.............................

S5.3 Predicting distributions for simple repression constructs

.................

S5.4 Comparison with experimental data

............................

S6 Gillespie simulation of master equation

S6.1 mRNA distribution with Gillespie simulations

.......................

S6.2 Protein distribution with Gillespie simulations

.......................

S7 Computational determination of the channel capacity

S7.1 Blahut’s algorithm

......................................

S7.2 Channel capacity from arbitrary units of fluorescence

...................

S7.3 Assumptions involved in the computation of the channel capacity

............

S8 Empirical fits to noise predictions

S8.1 Multiplicative factor for the noise

..............................

S8.2 Additive factor for the noise

.................................

S8.3 Correction factor for channel capacity with multiplicative factor

.............

S9 Derivation of the cell age distribution

Three-state promoter model for simple repression

In order to tackle the question of how much information the simple repression motif can process we

require the joint probability distribution of mRNA and protein

(

m, p

;

). To obtain this distribution

we use the chemical master equation formalism as described in Section

1.1

. Specifically, we assume a

three-state model where the promoter can be found 1) in a transcriptionally active state (

state), 2)

in a transcriptionally inactive state without the repressor bound (

state) and 3) with the repressor

bound (

state). (See Fig.

(A)). These three states generate a system of coupled di

↵

erential equations

for each of the three state distributions

(

m, p

(

m, p

) and

(

m, p

). Given the rates shown in

Fig.

(A) let us define the system of ODEs. For the transcriptionally active state we have

(

m, p

)

{

(

)

↵

(

m, p

{

(

)

(

m, p

)

{

(

)

{

(

m, p

{

(

+ 1)

(

)

{

(

m, p

)

{

(

m, p

{

(

m, p

{

(

+ 1)

(

m, p

+ 1)

{

(

m, p

)

(S1)

For the inactive promoter state

we have

(

m, p

)

{

(

)

↵

(

m, p

)

{

(

)

(

m, p

{

(

)

↵

(

m, p

)

{

(

)

(

m, p

)

{

(

+ 1)

(

)

{

(

m, p

)

{

(

m, p

{

(

m, p

{

(

+ 1)

(

m, p

+ 1)

{

(

m, p

)

(S2)

And finally for the repressor bound state

we have

(

m, p

)

{

(

)

↵

(

m, p

{

(

)

(

m, p

)

{

(

+ 1)

(

)

{

(

m, p

)

{

(

m, p

{

(

m, p

{

(

+ 1)

(

m, p

+ 1)

{

(

m, p

)

(S3)

For an unregulated promoter, i.e. a promoter in a cell that has no repressors present, and therefore

constitutively expresses the gene, we use a two-state model in which the state

is not allowed. All

the terms in the system of ODEs containing

(

)

(

)

↵

are then set to zero.

As detailed in Section

1.1

it is convenient to express this system using matrix notation [

]. For

this we define

(

m, p

)=(

(

m, p

)

(

m, p

)

(

m, p

))

. Then the system of ODEs can be expressed

(

m, p

)

(

m, p

)

(

m, p

(

)

(

m, p

)+(

+ 1)

(

)

(

m, p

(

m, p

(

m, p

)+(

+ 1)

(

m, p

+ 1)

(S4)

where we defined matrices representing the promoter state transition

⌘

(

)

↵

(

)

(

)

↵

(

)

(

)

(

)

↵

(

)

(

)

↵

(S5)

mRNA production,

, and degradation,

, as

⌘

000

(S6)

and

⌘

(S7)

For the protein we also define production

and degradation

matrices as

⌘

(S8)

and

⌘

(S9)

The corresponding equation for the unregulated two-state promoter takes the exact same form

with the definition of the matrices following the same scheme without including the third row and

third column, and setting

(

)

and

(

)

↵

to zero.

A closed-form solution for this master equation might not even exist. The approximate solution

of chemical master equations of this kind is an active area of research. As we will see in Appendix

the two-state promoter master equation has been analytically solved for the mRNA [

] and protein

distributions [

]. For our purposes, in Appendix

we will detail how to use the Maximum Entropy

principle to approximate the full distribution for the two- and three-state promoter.

Parameter inference

(Note: The Python code used for the calculations presented in this section can be found in the

following link

as an annotated Jupyter notebook)

With the objective of generating falsifiable predictions with meaningful parameters, we infer the

kinetic rates for this three-state promoter model using di

↵

erent data sets generated in our lab over the

last decade concerning di

↵

erent aspects of the regulation of the simple repression motif. For example,

for the unregulated promoter transition rates

(

)

and

(

)

↵

and the mRNA production rate

,we

use single-molecule mRNA FISH counts from an unregulated promoter [

]. Once these parameters

are fixed, we use the values to constrain the repressor rates

(

)

and

(

)

↵

. These repressor rates are

obtained using information from mean gene expression measurements from bulk LacZ colorimetric

assays [

]. We also expand our model to include the allosteric nature of the repressor protein, taking

advantage of video microscopy measurements done in the context of multiple promoter copies [

] and

flow-cytometry measurements of the mean response of the system to di

↵

erent levels of induction [

In what follows of this section we detail the steps taken to infer the parameter values. At each step

the values of the parameters inferred in previous steps constrain the values of the parameters that are

not yet determined, building in this way a self-consistent model informed by work that spans several

experimental techniques.

S2.1

Unregulated promoter rates

We begin our parameter inference problem with the promoter on and o

↵

rates

(

)

and

(

)

↵

, as well

as the mRNA production rate

. In this case there are only two states available to the promoter – the

inactive state

and the transcriptionally active state

. That means that the third ODE for

(

m, p

)is

removed from the system. The mRNA steady state distribution for this particular two-state promoter

model was solved analytically by Peccoud and Ycart [

]. This distribution

(

)

⌘

(

)

is of the form

(

)

(

)

↵

⇣

(

)

⌘

(

+ 1)

✓

(

)

↵

(

)

◆

✓

(

)

↵

(

)

◆

⇣

(

)

⌘

✓

◆

(

)

(

)

↵

(

)

(S10)

where

(

) is the gamma function, and

is the confluent hypergeometric function of the first kind.

This rather complicated expression will aid us to find parameter values for the rates. The inferred

rates

(

)

(

)

↵

and

will be expressed in units of the mRNA degradation rate

. This is because

the model in Eq.

S10

is homogeneous in time, meaning that if we divide all rates by a constant it would

be equivalent to multiplying the characteristic time scale by the same constant. As we will discuss in

the next section, Eq.

S10

has degeneracy in the parameter values. What this means is that a change

in one of the parameters, specifically

, can be compensated by a change in another parameter,

specifically

(

)

↵

, to obtain the exact same distribution. To work around this intrinsic limitation of the

model we will include in our inference prior information from what we know from equilibrium-based

models.

Bayesian parameter inference of RNAP rates

In order to make progress at inferring the unregulated promoter state transition rates, we make

use of the single-molecule mRNA FISH data from Jones et al. [

]. Fig.

shows the distribution of

mRNA per cell for the

lacUV5

promoter used for our inference. This promoter, being very strong,

has a mean copy number of

i⇡

18 mRNA/cell.

Figure S1.

lacUV5

mRNA per cell distribution.

Data from [

] of the unregulated

lacUV5

promoter

as inferred from single molecule mRNA FISH.

Having this data in hand we now turn to Bayesian parameter inference. Writing Bayes theorem

we have

(

)

(

)

↵

(

)

(

)

↵

)

(

)

(

)

↵

)

(

)

(S11)

where

represents the data. For this case the data consists of single-cell mRNA counts

{

,...,m

}

,where

is the number of cells. We assume that each cell’s measurement is

independent of the others such that we can rewrite Eq.

S11

(

)

(

)

↵

}

)

(

)

(

)

↵

)

(

)

(

)

↵

)

(S12)

where we ignore the normalization constant

(

). The likelihood term

(

)

(

)

↵

) is exactly

given by Eq.

S10

with

= 1. Given that we have this functional form for the distribution, we can

use Markov Chain Monte Carlo (MCMC) sampling to explore the 3D parameter space in order to fit

Eq.

S10

to the mRNA-FISH data.

Constraining the rates given prior thermodynamic knowledge.

One of the strengths of the Bayesian approach is that we can include all the prior knowledge on the

parameters when performing an inference [

]. Basic features such as the fact that the rates have to be

strictly positive constrain the values that these parameters can take. For the specific rates analyzed

in this section we know more than the simple constraint of non-negative values. The expression of an

unregulated promoter has been studied from a thermodynamic perspective [

]. Given the underlying

assumptions of these equilibrium models, in which the probability of finding the RNAP bound to the

promoter is proportional to the transcription rate [

], they can only make statements about the mean

expression level. Nevertheless if both the thermodynamic and the kinetic model describe the same

process, the predictions for the mean gene expression level must agree. That means that we can use

what we know about the mean gene expression, and how this is related to parameters such as molecule

copy numbers and binding a

nities, to constrain the values that the rates in question can take.

In the case of this two-state promoter it can be shown that the mean number of mRNA is given

by [

] (See Appendix

for moment computation)

(

)

(

)

(

)

↵

(S13)

Another way of expressing this is as

⇥

(

)

active

,where

(

)

active

is the probability of the promoter being

in the transcriptionally active state. The thermodynamic picture has an equivalent result where the

mean number of mRNA is given by [

], [

]

(S14)

where

is the number of RNAP per cell,

is the number of non-specific binding sites,

is the

RNAP binding energy in

units and

⌘

(

)

. Using Eq.

S13

and Eq.

S14

we can easily see

that if these frameworks are to be equivalent, then it must be true that

(

)

(

)

↵

(S15)

or equivalently

(

)

(

)

↵

+ln

(S16)

To put numerical values into these variables we can use information from the literature. The RNAP

copy number is order

⇡

1000

3000 RNAP/cell for a 1 hour doubling time [

]. As for the number of

non-specific binding sites and the binding energy, we have that

⇥

[

] and

⇡

[

]. Given these values we define a Gaussian prior for the log ratio of these two quantities of the form

(

)

(

)

↵

exp

✓

(

)

(

)

↵

◆

(

+ln

)

◆

;

(S17)

where

is the variance that accounts for the uncertainty in these parameters. We include this prior as

part of the prior term

(

)

(

)

↵

) of Eq.

S12

. We then use MCMC to sample out of the posterior

distribution given by Eq.

S12

. Fig.

shows the MCMC samples of the posterior distribution. For the

case of the

(

)

parameter there is a single symmetric peak.

(

)

↵

and

have a rather long tail towards

large values. In fact, the 2D projection of

(

)

↵

shows that the model is sloppy, meaning that the

two parameters are highly correlated. This feature is a common problem for many non-linear systems

used in biophysics and systems biology [

]. What this implies is that we can change the value of

(

)

↵

and then compensate by a change in

in order to maintain the shape of the mRNA distribution.

Therefore it is impossible from the data and the model themselves to narrow down a single value for

the parameters. Nevertheless since we included the prior information on the rates as given by the

analogous form between the equilibrium and non-equilibrium expressions for the mean mRNA level,

we obtained a more constrained parameter value for the RNAP rates and the transcription rate that

we will take as the peak of this long-tailed distribution.

The inferred values

(

)

(

)

↵

= 18

+120

and

= 103

+423

are given in units of the

mRNA degradation rate

. Given the asymmetry of the parameter distributions we report the upper

and lower bound of the 95

percentile of the posterior distributions. Assuming a mean life-time for

mRNA of

⇡

3 min (from this

link

) we have an mRNA degradation rate of

⇡

⇥

. Using

this value gives the following values for the inferred rates:

(

)

024

005

002

(

)

↵

and

Fig.

compares the experimental data from Fig.

with the resulting distribution obtained by

substituting the most likely parameter values into Eq.

S10

. As we can see this two-state model fits

the data adequately.

S2.2

Accounting for variability in the number of promoters

As discussed in ref. [

] and further expanded in [

] an important source of cell-to-cell variability

in gene expression in bacteria is the fact that, depending on the growth rate and the position relative

to the chromosome replication origin, cells can have multiple copies of any given gene. Genes closer

to the replication origin have on average higher gene copy number compared to genes at the opposite

end. For the locus in which our reporter construct is located (

galK

) and the doubling time of the

mRNA FISH experiments we expect to have

⇡

1.66 copies of the gene [

], [

]. This implies that the

cells spend 2/3 of the cell cycle with two copies of the promoter and the rest with a single copy.

To account for this variability in gene copy we extend the model assuming that when cells have

two copies of the promoter the mRNA production rate is 2

compared to the rate

for a single

Figure S2. MCMC posterior distribution.

Sampling out of Eq.

S12

the plot shows 2D and 1D

projections of the 3D parameter space. The parameter values are (in units of the mRNA degradation rate

)

(

)

(

)

↵

= 18

+120

and

= 103

+423

which are the modes of their respective distributions,

where the superscripts and subscripts represent the upper and lower bounds of the 95

percentile of the

parameter value distributions

Figure S3. Experimental vs. theoretical distribution of mRNA per cell using parameters from

Bayesian inference.

Dotted line shows the result of using Eq.

S10

along with the parameters inferred for the

rates. Blue bars are the same data as Fig.

obtained from [

promoter copy. The probability of observing a certain mRNA copy

is therefore given by

(

one promoter)

(one promoter) +

(

two promoters)

(two promoters)

(S18)

Both terms

(

promoter copy) are given by Eq.

S10

with the only di

↵

erence being the rate

.It

is important to acknowledge that Eq.

S18

assumes that once the gene is replicated the time scale in

which the mRNA count relaxes to the new steady state is much shorter than the time that the cells

spend in this two promoter copies state. This approximation should be valid for a short lived mRNA

molecule, but the assumption is not applicable for proteins whose degradation rate is comparable to

the cell cycle length as explored in Section

1.4

In order to repeat the Bayesian inference including this variability in gene copy number we must

split the mRNA count data into two sets – cells with a single copy of the promoter and cells with

two copies of the promoter. For the single molecule mRNA FISH data there is no labeling of the

locus, making it impossible to determine the number of copies of the promoter for any given cell. We

therefore follow Jones et al. [

] in using the cell area as a proxy for stage in the cell cycle. In their

approach they sorted cells by area, considering cells below the 33

percentile having a single promoter

copy and the rest as having two copies. This approach ignores that cells are not uniformly distributed

along the cell cycle. As first derived in [

] populations of cells in a log-phase are exponentially

distributed along the cell cycle. This distribution is of the form

(

) = (ln 2)

(S19)

where

1] is the stage of the cell cycle, with

= 0 being the start of the cycle and

=1being

the cell division (See Appendix

for a derivation of Eq.

S19

). Fig.

shows the separation of the

two groups based on area where Eq.

S19

was used to weight the distribution along the cell cycle.

Figure S4. Separation of cells based on cell size.

Using the area as a proxy for position in the cell cycle,

cells can be sorted into two groups – small cells (with one promoter copy) and large cells (with two promoter

copies). The vertical black line delimits the threshold that divides both groups as weighted by Eq.

S19

A subtle, but important consequence of Eq.

S19

is that computing any quantity for a single cell is

not equivalent to computing the same quantity for a population of cells. For example, let us assume

that we want to compute the mean mRNA copy number

. For a single cell this would be of the

form

cell

)

(S20)

where

is the mean mRNA copy number with

promoter copies in the cell, and

is the fraction

of the cell cycle that cells spend with a single copy of the promoter. For a single cell the probability of

having a single promoter copy is equivalent to this fraction

.ButEq.

S19

tells us that if we sample

unsynchronized cells we are not sampling uniformly across the cell cycle. Therefore for a population

of cells the mean mRNA is given by

population

)

(S21)

where the probability of sampling a cell with one promoter

is given by

(

)

da,

(S22)

where

(

) is given by Eq.

S106

. What this equation computes is the probability of sampling a cell

during a stage of the cell cycle

where the reporter gene hasn’t been replicated yet. Fig.

shows

the distribution of both groups. As expected larger cells have a higher mRNA copy number on average.

Figure S5. mRNA distribution for small and large cells.

(A) histogram and (B) cumulative

distribution function of the small and large cells as determined in Fig.

. The triangles above histograms in

(A) indicate the mean mRNA copy number for each group.

We modify Eq.

S12

to account for the two separate groups of cells. Let

be the number of

cells in the small size group and

the number of cells in the large size group. Then the posterior

distribution for the parameters is of the form

(

)

(

)

↵

}

)

(

)

(

)

↵

)

(

)

(

)

↵

)

(

)

(

)

↵

)

(S23)

where we split the product of small and large cells.

For the two-promoter model the prior shown in Eq.

S17

requires a small modification. Eq.

S21

gives the mean mRNA copy number of a population of asynchronous cells growing at steady state.

Given that we assume that the only di

↵

erence between having one vs. two promoter copies is the