nature20777-s1.pdf

Supplementary Methods

Sections:

Barcode distance matrix computation

Analysis of sources of error in MEMOIR operation, and their impact on reconstruction

accuracy

Simulation of MEMOIR operation for missing cells and sparse trees

Inference of the switching rates between the two

Esrrb

gene expression states

1. Barcode distance matrix computation

In this section we describe how we computed the matrix of pairwise barcode distances between cells

(Figure 3e). These distances are based

on the complete barcoded scratchpad profile of each cell,

including both the total number of transcripts, denoted as

푁

, and the fraction of those transcripts that

colocalized with scratchpad, denoted as

푝

First, to account for statistical fluctuations

and errors in the detected fraction of colocalized transcripts,

we converted the measured value of

푝

to a distribution of possible values for

the colocalization fraction.

To account for measurement errors, we imposed a lower bound of 0.04 and an upper bo

und of 0.96 on

the measured value of

푝

To determine the distribution of possible values of

푝

, we assumed that

fluctuations follow a binomial distribution with mean

푁푝

and variance

푁푝

(

1

−

푝

)

. This assumption

follows from the physical reasoning that

the probability of colocalization of one transcript is

independent from that of other transcripts in the cell.

The distribution of possible values of the colocalization fraction,

푥

, takes the form,

퐹

(

푥푁

)

푁

푥

푝

−

푝

To keep numerical compu

tations tractable, we approximated the factorials in the binomial coefficient

using gamma functions, i.e.

푁

≈

훤

(

푁

)

To compute the cell

cell distance for a single barcode, we quantified how the colocalization fraction

distribution differs for a given barcode between two cells. We denote the distribution of colocalization

WWW.NATURE.COM/NATURE | 1

PPLEMENTAR

Y INFORMATION

doi

:10.10

38/na

ture

fraction of barcode

푖

in cell

푎

퐹

(

푥

)

and the comparable distribution in cell

푏

퐹

(

푦

)

. We define the

barcode

푑

(

푎

푏

)

distance between the two cells as,

푑

푎

푏

퐹

푥

퐹

푦

푥

−

푦

푑푥푑푦

Finally, the overall cell

cell distance based on all barcode information wa

s calculated as the average

over the distances derived for each barcode weighted for each barcode by the smaller number of

barcode transcripts in either of the two cells. This ensures that barcodes that are highly expressed in

both cells are weighted more

heavily than barcodes that are weakly expressed in either or both cells.

Denoting the number of transcripts of barcode

푖

in cell

푎

푁

, and similarly for cell

푏

, the cell

cell

distance matrix, shown in Fig. 3e, then takes the form,

푑

푎

푏

min

푁

푑

(

푎

푏

)

2. Analysis of sources of error in MEMOIR operation, and their impact on reconstruction accuracy

2a. Incorporating noise into MEMOIR simulations

To understand what factors impact the overall reconstruction errors in MEMOIR, we

simulated the

recording, readout, and reconstruction processes, incorporating relevant sources of stochastic

fluctuations (noise). Sources of error can be divided into three categories: (1) noise inherent to the basic

mechanism of stochastic scratchpad co

llapse, (2) recording noise, and (3) readout noise (see Extended

Data Fig. 6a).

Noise inherent to the system

The first noise category captures fluctuations due to the general

approach of recording lineage information in discrete stochastic mutagenic (c

ollapse) events, but does

not

include noise arising from the specific experimental implementation of the system (those noise

sources are described below). Reconstruction errors in this category result from collapse events that by

chance occur too frequentl

y or too i

nfrequently

to provide lineage information, as well as identical

collapse events occurring in different cells (coincidences). For example, if the exact same scratchpad

collapse event(s) occur independently in two sister cel

ls,

the resulting clades will be indistinguishable

The simulations shown in Figure 3g and Extended Data Figure 8 assume irreversible collapse of an

idealized set of binary scratchpads, occurring at a constant rate and therefore incorporate only this

WWW.NATURE.COM/NATURE | 2

SUPPLEMENTARY INFORMATION

RESEARCH

doi:10.10

8/na ture

20777

source of error. Analysis of these simulations thus sho

ws how the stochasticity of collapse events

impacts reconstruction accuracy in the absence of additional sources of noise.

Recording noise.

Recording noise accounts for fluctuations in the collapse rate due to stochastic

fluctuations in the expression

vels

of gRNA and Cas9. To accurately represent the magnitude of these

effects, we first measured these fluctuations experimentally, and then incorporated empirical values for

these fluctuations into

the

simulations.

First, we quantified gRNA expression v

ariability both within colonies (intra

colony) and between colonies

(inter

colony) using the co

expressed fluorescent reporter in individual cells in all of the 108 MEM

colonies. We denote the gRNA expression in cell

푖

in colony

푚

푔

. (Extended

Data Fig. 6b,c). The

intra

colony noise was computed by first determining the variance of the single

cell measurements

amongst cells in each individual colony and then averaging that variance across all colonies. Namely,

휎

푁

푛

푔

−

푛

푔

Here,

푁

is the total number of colonies, and

푛

is the number of cells in colony

푚

Similarly, the inter

colony noise was computed by taking the mean of

푔

amongst the cells in a given

colony and then computing the variance of the colony means across all colonies.

휏

푁

푛

푔

−

푔

where

푔

is the mean gRNA expression across all cells in all colonies,

푔

푁

푛

푔

WWW.NATURE.COM/NATURE | 3

SUPPLEMENTARY INFORMATION

RESEARCH

doi:10.10

8/na ture

20777

Second, we quantified variability in Cas9 expression using the same analysis, performed on the number

of Cas9 transcripts,

푐

, read out using smFISH in the final round of hybridization. We denote the intra

colony and inter

colony variation in Cas9 expressi

on as

휎

and

휏

, respectively, and the mean Cas9

expression level across all cell in all colonies as

푐

We next incorporated the empirically determined noise measurements for Cas9 and gRNA into the

simulations. We set the effective collapse rate in each cell,

휆

, proportional to the product of

the

number

of Cas9 transcripts,

푐

, and

the

gRNA activity,

푔

휆

푔푐

. To incorporate recording noise, we assumed

each colony had its own values for

푐

and

푔

, drawn from the observed intercolony distributions with

means

푐

and

푔

and standard deviations

휏

and

휏

respectively. Next, we incorporated the intra

colon

noise, by adding fluctuations to values of

푐

and

푔

in each cell in a given colony, with the magnitude of

fluctuations set by the empirical intra

colony variations,

휎

and

휎

. The probability of collapse for each

scratchpad per generation,

푝

, was c

alculated from

휆

assuming a Poisson process, namely,

푝

−

푒

The normalization constant,

휂

, was selected to ensure an average collapse probability of 0.1 per

scratchpad per generation, consistent with the empirically estimated average collapse rate in the MEM

01 colonies (see Fig. 2c and Methods). We verified that the resulting fluctuations in

the collapse rate

(from roughly 0.05 to 0.2 per scratchpad per generation) were consistent with the experimentally

observed fluctuations in the loss of colocalization fraction between individual cells across the 108 MEM

01 colonies used in the analysis sho

wn in Figure 3. Overall, as shown in

Fig. 3i and

Extended Data Fig. 6d,

this recording noise produced only a relatively small decrease in the reconstruction accuracy of the

simulated colonies.

Readout noise.

This category includes (a) stochastic express

ion of individual barcoded scratchpads, (b

)

the effects of

multiple integrations of the same barcoded s

cratchpad type, and (c) errors

in smFISH

imaging readout. To simulate the impact of readout noise on reconstruction accuracy, we first obtained

the empir

ical distribution of single

cell transcript counts of each of the 13 scratchpads in the cells in all

108 analyzed MEM

01 colonies (see Figure below). The transcript count distributions for the barcoded

scratchpads were well

approximated as gamma distributi

ons (with an average mean of 5.4). The gamma

distribution is expected when there is a constant probability per unit time of initiating a transcriptional

burst, and the number of mRNAs produced per burst follows an exponential distribution, and has been

obs

erved empirically to describe diverse stochastic gene expression

processes

43,44

WWW.NATURE.COM/NATURE | 4

SUPPLEMENTARY INFORMATION

RESEARCH

doi:10.10

8/na ture

20777

Distribution of the number of

transcripts of each barcode in individual cells in the 108 MEM

01 colonies analyzed in Figure 3.

All 13 detected barcodes are shown. Black lines are fits to Gamma distributions.

We quantified the intra

colony and inter

colony variability in the expressio

n levels of the barcoded

scratchpads using the same procedure described above for gRNA and Cas9. In this case, variance was

calculated for each barcode and then summed over 13 barcodes per cell to estimate each cell’s overall

expression variance. These emp

irical parameterizations for the transcript count distributions were

incorporated in

the simulations

For readout, we first associated with each barcode a random number of transcripts drawn from the

gamma distribution associat

ed with that barcode. Second, we incorporated additional noise from

smFISH imaging, based on experimental quantification of the accuracy of smFISH colocalizat

ion

detection rates (Fig. 2c). I

f a scratchpad was intact, then the probability of colocalization,

푝

, was set to

0.87; if a scratchpad was collapsed, the probability of erroneous colocalization,

푝

, was set to 0.05.

These es

timates were obtained from

our measurements of colocalization fractions before activation of

the MEM

01 cells (Fig. 2c), and

are consistent with previous reports on smFISH

fidelity

. The number

of scratchpads that colocalized for each barcode was drawn from binomial distributions based on these

colocalization detection probabilities. Namely, the number of colocalized transcripts

푞

from a total of

푇

transcripts, for an intact integ

ration of a barcode, was drawn from the distribution,

100

200

Cell counts

BC: 2

100

200

BC: 9

100

200

300

BC: 10

100

200

BC: 12

100

200

300

BC: 13

100

200

300

Cell counts

BC: 14

200

400

BC: 16

100

200

300

BC: 20

Transcripts

200

400

600

BC: 21

Transcripts

100

200

300

BC: 22

Transcripts

100

200

300

Cell counts

BC: 23

Transcripts

100

150

BC: 26

Transcripts

200

400

BC: 27

WWW.NATURE.COM/NATURE | 5

SUPPLEMENTARY INFORMATION

RESEARCH

doi:10.10

8/na ture

20777

푝

6HV5MV

푞

;

푇

푞

푝

−

푝

Similarly, the number of colocalized transcripts for a collapsed integration of a barcode was drawn from

the distribution,

푝

MZ[[5\]S^

푞

;

푡

푇

푞

푝

−

푝

Finally, we accounted for multiple integrations of the same barcode as follows: If a barcode had more

than one integration,

the total number of transcripts and the number of colocalized transcripts for that

barcode were determined by summing

the relevant transcripts from each individual integration.

Importantly, the distribution of the expression level of each individual integration was selected such that

the distribution of the total number of transcripts (obtained by taking the convolution

of the single

integration distributions) matched the experimentally observed distribution for that barcode. For

example, if a barcode had two integrations and an expression distribution characterized by a mean of 10

transcripts and shape parameter 2 (from

the gamma fit), then each integration was assumed to have an

expression distribution characterized by a mean of 5 transcripts and shape parameter 1.

This proce

dure produced

an ensemble of simulated colonies, where every cell was assigned a total

number of

transcripts and a number of colocalized transcripts for each barcode (same as the

experimental data). The resulting simulated colonies were then reconstructed with the same neighbor

joining algorithm used on the experimental data (see Methods).

Impact o

f readout noise on overall reconstruction accuracy.

Figure 3i and

Extended Data Figure 6e,f

show

that noise in scratchpad expression and multiple incorporations of the same barcoded scratchpad

account for most of the reconstruction error. By contrast, other components such as recording noise,

noise intrinsic to the MEMOIR design, and noise from smFIS

H imaging fidelity contributed minimally to

the overall reconstruction error.

WWW.NATURE.COM/NATURE | 6

SUPPLEMENTARY INFORMATION

RESEARCH

doi:10.10

8/na ture

20777

Finally, to determine if the addition of fluctuations to the simulations matched the experimentally

observed decrease in reconstruction accuracy, we simulated trees with all th

ree components of noise

for various numbers of integrations per barcode and compared the distribution of reconstruction

accuracy to that observed for the 108 experimental colonies. As shown in

Extended Data Fig

ure

6f, the

distribution of reconstruction acc

uracy obtained from simulations is generally consistent with the

experimentally observed distributi

on, with an effective value of

2 integration sites for each barcode.

Note that no fitting parameters were used besides the empirically measured fluctuations

in the

recording and readout components. Together,

these results indicate that

this implementation of

EMOIR behaved as expected of a

system wit

h 13 scratchpads and realistic

sources of noise. The main

sources of error, the stochastic expression of scratc

hpad transcripts and variable expression from

multiple integrations of the same barcoded scratchpad types, can be improved in future

implementations.

2b. Potential solutions to address sources of noise and reduce errors

Here we describe ways to mitigate

the key sources of noise currently affecting MEMOIR reconstruction

accuracy:

●

Noise inherent to the system.

Because collapse events are stochastic, similar collapse patterns

can arise by chance independently, creating ambiguities. The likelihood of such er

rors decreases

rapidly as the number and diversity of barcoded scratchpads is increased. Further, tunable

control of Cas9 and gRNA expression or scratchpad affinity (e.g., by varying gRNA target

complementarity) allows average collapse rate to be optimized

for different regimes, including

accessing more generations (Extended Data Fig. 8b) or low collapse rates for studying sparse

trees (Extended Data Fig. 9).

●

Readout noise: Multiple incorporations of barcoded scratchpad types.

This is currently a

significan

t source of noise, as shown in Extended Data Fig. 6, and it is readily addressed as

above, e.g., by starting from a larger and more diverse pool of barcodes.

●

Readout noise: Stochastic scratchpad expression.

Individual scratchpads are read out at the

RNA le

vel, and therefore subject to fluctuations in transcription. This source of noise that can be

addressed by using 1) stronger promoters, 2) site

specific integration in well

expressed loci,

and/or 3) stabilized RNA transcripts (in order to build up signal o

ver longer periods).

Alternatively, barcode expression noise can be circumvented by employing other in situ labeling

methods, like single

molecule DNA FISH.

WWW.NATURE.COM/NATURE | 7

SUPPLEMENTARY INFORMATION

RESEARCH

doi:10.10

8/na ture

20777

3. Simulation of MEMOIR operation for missing cells and sparse trees

3a. Missing cells

To simulate

reconstruction for trees with missing

cells (

Extended Data Fig. 8c

the

forward algorithm

described

in Methods

was used to generate a full binary tree of three generations, which was then

pruned by removing a given number of randomly chosen endpoint cells. To reconstruct the lineage

relationships of the remaining cells, we assigned each of the possible 315 reconst

ructions a score given

퐻

−

푇

, where

퐻

is the Hamming distance between the barcode

d scratchpad collapse

patterns

of cells

푖

and

푗

푇

is the lineage distance between cells

푖

and

푗

from that particular

reconstruction (

for si

sters,

for first

cousins, and

for second

cousins), and the summation runs over

all pairs of cells. Matrices

퐻

and

푇

were normalized so that their elements summed to one.

This

reconstruction score format was useful not only in scoring the topology of

possible lineage

arrangements, but also in estimating branch lengths (e.g., whether two cells were likely closely related

sisters or more distantly related cousins).

The reconstruction with the lowest score was selected as the

reconstruct

ed tree

and used t

o compute the fraction of

relationships that were

correct

ly identified (i.e.,

fraction of correct relationships)

shown in the heat

maps.

If multiple reconstructions had the same

lowest score, the fraction of correct relationships was averaged over all of t

hem.

3b. Simulation of MEMOIR operation for sparse trees

To simulate sparse trees, we assumed collapse events in a given lineage occur

specified constant

collapse rate

(per cell per generation). The sparse regime occurs when λ <

1. Because generating

hundreds of collapse events requires large trees of many generations, we did not simulate the

underlying proliferating binary tree. Instead, we directly generated a tree of collapse events with

variable branch lengths corresponding t

o the stochastic time intervals between collapse events.

Assuming a constant rate of collapse (a Poisson process), we randomly drew branch lengths from an

exponential distribution with the time scale set by the collapse rate, i.e. mean branch length

∼

1/λ

. At

any given level of the tree, if the tree had

푁

branches, the time interval to the next branching event was

selected from an exponential distribution with rate

푁휆

, and the branching event was randomly assigned

to one of the

푁

branches. The branch

ing process was continued until the desired number of leaves was

generated. Extended Data Figure 9c,d shows examples of sparse trees with various number of leaves

and tree depths (defined as the cumulative number of collapse events since the common ancesto

experienced by each leaf averaged over all the leaves of the tree). For clarity, the approximate

WWW.NATURE.COM/NATURE | 8

SUPPLEMENTARY INFORMATION

RESEARCH

doi:10.10

8/na ture

20777

correspondence between these sparse trees (characterized by number of leaves and tree depth) and

underlying full binary trees (characterized by collapse rate

and number of generations) was determined

by simulating full binary trees in the sparse regime and then finding parameters that generated effective

sparse trees with the desired number of leaves and depth (see the cartoon in Extended Data Fig. 9a).

The co

llapse rate and number of generations of the full binary trees corresponding to the example

sparse trees are reported in Extended Data Figure 9c,d. To reconstruct sparse trees, we used the

generated collapse pattern at the leaves and a Sankoff maximum pars

imony

algorithm

. The fraction of

correct partitions identified by the reconstructed trees was calculated using the Robinson

Foulds

distance

metric

. Sim

ulations were implemented in Matlab, and Python using the Biopython

package

and ETE3

toolkit

4. Inference of the switching rates between the two

Esrrb

gene expression states

In this section, we describe how we inferred the transition rates between the low and high expression

states of

Esrrb

. In 4

a, we compute the occurrence frequencies of pairs of cells in the

same gene

expression state. In 4

b, we present a straightforward inf

erence of the transition rates from the decay in

the observed frequency of pairs of cells in the same

expression state. Finally, in 4

c, we present a more

involved but more accurate inference algorithm.

a. Occurrence frequency of related pairs of cells in the same

Esrrb

gene expression state on MEMOIR

trees

In this subsection, we describe how the co

occurrence frequencies of pairs of related cells in the same

Esrrb

expression state can be extracted from t

he MEMOIR reconstructed lineage trees and endpoint

gene expression measurements.

To compute the frequencies, we first assign each cell a probability of being in one of the two

Esrrb

gene

expression states based on its

Esrrb

mRNA count. To do so, we fit th

e bimodal distribution of single

cell

Esrrb

transcript counts (Fig

ure 4b

) with a weighted sum of two negative binomial distributions, each

corresponding to one gene expression state. We denote the distribution corresponding to the

Esrrb

low

expression stat

e as

푃

(

푚

)

, and the distribution corresponding to the

Esrrb

high expression state as

푃

(

푚

)

, where

푚

is the number of transcripts in an individual cell. The bimodal distribution is fit to

푓

푃

푚

−

푓

푃

(

푚

)

, where

푓

is an additional fitting parameter

that corresponds to the population

fraction of cells in the low expression state.

WWW.NATURE.COM/NATURE | 9

SUPPLEMENTARY INFORMATION

RESEARCH

doi:10.10

8/na ture

20777

With the fit parameters determined, the probability that an individual cell with

푛

Esrrb

mRNA molecules

is in the low expression state (E

) is

푝

(

푛

)

푓푃

푛

푓푃

푛

(

−

푓

)

푃

푛

Similarly, the probability that the cell is in the high gene expression state (E+) is simply

푝

푛

−

푝

(

푛

)

Next, we generalize this approach to jointly observing the states of a pair of cells. If cell 1 is in the E

state with probab

ility

푝

and cell 2 is in the E

state with probability

푝

then the probability that both

cells are simultaneously in E

states is

푝

. Similarly, the probability of observing cell 1 in the E+

state and cell 2 in the E

state is

푝

. We can denote the probability of observing the two cells in any

of the four possible pairs of states by a

matrix,

푝

The entries in this matrix must sum to

. Finally, to compute the occurrence fre

quency of a pair of

sisters cells in each of the four possible pairs of states, we average the above matrix over all pairs of

sisters cells.

푪

푁

]6]VSm]

푝

where the summation runs over

all

푁

]6]VSm]

pairs of sisters cells in the 30 colonies used in the analysis.

We note that since the ordering of the sister pair is arbitrary, we turn the resulting matrix symmetric by

averaging the off

diagonal terms.

푪

is a 2x2 matrix for siste

rs. The argument ‘1’ denotes the fact that sisters are 1 generation apart.

Similarly, we define

푪

as the frequency of observing a pair of first

cousin cells in a given pair of

WWW.NATURE.COM/NATURE | 10

SUPPLEMENTARY INFORMATION

RESEARCH

doi:10.10

8/na ture

20777

states

푪

(

)

, is determined by taking the average over all the observed pairs

of first

cousin cells. The

occurrence frequency of sisters, first cousins, and second cousins (

푪

(

)

) in the same

Esrrb

expression

state (either both E

, or both E+) is plotted in Figure

Finally, we estimated the statistical error on the observed

frequencies in two ways

(error bars in Figure

). First, from the total number of observed pairs of cells and the computed frequencies, we estimated

the counting error assuming a multinomial distribution. That is, for an observed frequency

푓

computed

usi

푁

pairs of cells, the statistical error was estimated as

푓

−

푓

푁

. To understand how much of

the similarity observed between states of closely related cells was just due to chance, we randomly

permuted the lineage relationships within each colony and r

ecomputed the frequencies. The statistical

error was estimated as the variance of the frequencies computed over 1000 iterations of such random

permutations. The first error estimate only accounts for the finite size of the total number of

observations. The

second estimate accounts for correlations between the states of closely related cells

due to chance with the number of different gene expression states in each colony held fixed. The two

methods produced similar estimates reflecting the level intra

colony

heterogeneity. Because the two

approaches do not capture independent statistical fluctuations, we combined them by taking the

maximum of the two estimates.

b. Inference of switching dynamics from reconstructed lineage trees and endpoint gene expression

measurements

Here, we describe the inference of transition rates from the frequencies derived in the previous section.

To do so, we first derive the relationship between transitions rates and the expected probability of

observing a pair of related cells in

the same gene expression state.

We define a minimal model of stochastic and memoryless transitions between the two expression

states. We assume that the probability of transitioning from the E

state to the E+ state per generation is

constant and denote

it as

푇

(

퐸

)

. The probability that a cell in the E

state will retain its state from

one generation to the next is give by

푇

퐸

−

푇

(

퐸

)

. Similarly, the probability of

transitioning out of the E+ state per generation is denoted as

푇

(

퐸

)

. The transition probabilities can

be combined together into a transition rate matrix,

WWW.NATURE.COM/NATURE | 11

SUPPLEMENTARY INFORMATION

RESEARCH

doi:10.10

8/na ture

20777

푻

푇

(

퐸

)

푇

(

퐸

)

푇

(

퐸

)

푇

(

퐸

)

First, we present a simple approach for inferring the transition rate matrix from the decay in the

observed frequencie

s of pairs of cells in the same gene expression state going from first

cousins to

second

cousins. Next, we present a more involved derivation that also takes into account the value of

the observed frequencies in addition to how they decay from closely rela

ted cells to more distant

relatives.

We can relate the frequency of observing second

cousins in the same state (either both in E

or both E+)

to the frequency of observing the first

cousins in a particular pair of states by taking into account all

possibl

e state transitions going from one generation to the next. It follows,

푪

−

푇

-F

−

푇

-F

푪

푇

F-

−

푇

-F

푪

-F

푇

F-

푇

F-

푪

(

)

푪

−

푇

F-

−

푇

F-

푪

푇

-F

−

푇

F-

푪

F-

푇

-F

푇

-F

푪

(

)

We numerically solved the above two non

linear equations for

푇

-F

and

푇

F-

for the observed values of

푪

(

)

and

푪

(

)

(the observed frequencies were first corrected for misclassifications of the two states

caused by measurement uncertainties, see below). To estimate the statistical error on the inferred

transitions rates, we produced simulated frequency matrices based on

the estimated statistical error of

the frequencies (see previous section). We solved for the transition rates for 10000 iterations of the

simulated data and computed the standard deviation over the rates. We found

푇

-F

and

푇

F-

c. A

more accurate inference of the transition rates

Next, to get a more accurate set of inferred transition rates, we incorporate

the values of the

frequencies in addition to their decay over the generations in our calculations. To do so, we first

calculate

the joint

probability of finding a pair of sister cells in states

푖

and

푗

using the transition rate

matrix.

푪

푻

푖

푠

푻

푗

푠

푝

(

푠

)

]

WWW.NATURE.COM/NATURE | 12

SUPPLEMENTARY INFORMATION

RESEARCH

doi:10.10

8/na ture

20777

The summation in above equation runs over the two possible states of the parent cell. Assuming that

e population fractions of the E

and E+ cells do not change significantly during the three generations

spanned by the experiment,

푝

(

퐸

)

and

푝

퐸

can be set approximately equal to the population

fractions derived from fitting the bimodal

Esrrb

transcript

count distribution (see previous section).

Namely,

푝

퐸

푓

and

푝

퐸

−

푓

We experimentally validated the assumption that the

Esrrb

transcript

count distribution does not change significantly in

our experimental conditions ove

a period

48 hours or approximately four generations (

see

Extended Data Fig. 10

To keep the population fractions constant, the number of cells transitioning from the E

state to the E+

state must be equal to the number transitioning in the reverse d

irection. Hence,

푇

퐸

푝

퐸

푇

퐸

푝

(

퐸

)

. With this simplification,

푪

푻

푖

푠

푻

푠

푗

푝

푗

푝

푗

푻

(

푖

푗

)

]

where

푻

(

푖

푗

)

denotes the

푖

푗

th element of the square of the transition matrix. Similarly, the joint

probability of observing a pair of cousins, and second

cousins in states

푖

and

푗

is respectively equal to

푪

푝

푗

푻

(

푖

푗

)

and

푪

푝

(

푗

)

푻

(

푖

푗

)

To infer the state

transition rates, we use

the measured frequencies and population fractions and

solve

the above equations for

푻

However, we first need to introduce a correction to the measured frequencies to account for

misclassification of expression states when the

y are probabilistically assigned. Namely, it is more likely

to misclassify a cell in the E

state as E+ than vice versa because of the asymmetrical shape of the

distributions

푃

(

푚

)

(see the fits in Figure 4b

). To correct for misclassifications, we first

compute

the

probability of assigning a cell in state

푗

mistakenly to state

푖

by simply integrating the

probability that a

a cell with

푛

Esrrb

transcripts

is assigned

to state

푖

(computed in the previous section) over the

robability distribution of observing

푛

transcripts in a cell in state

푗

푸

푓푃

푛

푓푃

푛

−

푓

푃

푛

푃

(

푛

)

푑푛

WWW.NATURE.COM/NATURE | 13

SUPPLEMENTARY INFORMATION

RESEARCH

doi:10.10

8/na ture

20777

For our analysis, E

cells are mistakenly assigned to E+ cells with probability 0.12, whereas E+ cells are

mistakenly assigned to the

state with probability 0.04.

Matrix

푸

effectively introduces apparent transitions between the two states at the measurement stage.

These additional transitions can be explicitly included in the calculation of the joint

probability of finding

a pair o

f sister cells in states

푖

and

푗

푪

푸

풊풌

푻

푘

푠

푸

풋풍

푻

푙

푠

푝

푠

풍

풌

]

푸

풊풌

푻

푘

푠

푻

푠

푙

푸

풍풋

푻

푝

푙

풍

풌

]

푸

풊풌

푻

ퟐ

푘

푙

푝

(

푙

)

푸

풍풋

푻

풍

풌

However, the apparent transitions represented by

푸

do not correspond to actual transiti

ons between

the

Esrrb

expression states. Rather, they represent misclassifications due to uncertainties in the state

assignment. To correct the estimate,

remov

the contributions of these apparent transitions to the

measured frequencies,

apply

ing

the inverse of matrix

푸

to the frequency matrices,

푪

푸

푪

푸

Finally, to infer the transition rates shown in Figure

, we solved for the transition rate matrix

푻

using

the corrected frequency matrix for second

cousins

푪

(

)

푪

(

)

exh

ibits a lower statistical error than

푪

(

)

and

푪

(

)

because the number of pairs of second cousins is larger than the number of pairs of first

cousins and sisters. To validate the inferred rates, we checked that inferring

푻

using the corrected

frequency m

atrices for first

cousins

푪

(

)

and sisters

푪

(

)

produced consistent transition rates. Finally,

to estimate the statistical error of the inferred rates we used the estimated statistical error in the

measured frequencies (see previous section) and produc

ed simulated frequency matrices. The error on

the inferred rates was estimated as the standard deviation of the rates inferred over 10000 iterations of

simulated data.

Results are shown in Figure 4d

WWW.NATURE.COM/NATURE | 14

SUPPLEMENTARY INFORMATION

RESEARCH

doi:10.10

8/na ture

20777

WWW.NATURE.COM/NATURE | 15

SUPPLEMENTARY INFORMATION

RESEARCH

doi:10.10

8/na ture

20777