of 15
1
Supplementary Methods
Sections:
1.
Barcode distance matrix computation
2.
Analysis of sources of error in MEMOIR operation, and their impact on reconstruction
accuracy
3.
Simulation of MEMOIR operation for missing cells and sparse trees
4.
Inference of the switching rates between the two
Esrrb
gene expression states
1. Barcode distance matrix computation
In this section we describe how we computed the matrix of pairwise barcode distances between cells
(Figure 3e). These distances are based
on the complete barcoded scratchpad profile of each cell,
including both the total number of transcripts, denoted as
, and the fraction of those transcripts that
colocalized with scratchpad, denoted as
.
First, to account for statistical fluctuations
and errors in the detected fraction of colocalized transcripts,
we converted the measured value of
to a distribution of possible values for
the colocalization fraction.
To account for measurement errors, we imposed a lower bound of 0.04 and an upper bo
und of 0.96 on
the measured value of
.
To determine the distribution of possible values of
, we assumed that
fluctuations follow a binomial distribution with mean
푁푝
and variance
푁푝
(
1
)
. This assumption
follows from the physical reasoning that
the probability of colocalization of one transcript is
independent from that of other transcripts in the cell.
The distribution of possible values of the colocalization fraction,
, takes the form,
(
푥푁
)
=
*+
1
-
.
*
+
To keep numerical compu
tations tractable, we approximated the factorials in the binomial coefficient
using gamma functions, i.e.
!
(
+
1
)
.
To compute the cell
-
to
-
cell distance for a single barcode, we quantified how the colocalization fraction
distribution differs for a given barcode between two cells. We denote the distribution of colocalization
WWW.NATURE.COM/NATURE | 1
SU
PPLEMENTAR
Y INFORMATION
doi
:10.10
38/na
ture
2
07
77
fraction of barcode
in cell
as
5
6
(
)
and the comparable distribution in cell
as
8
6
(
)
. We define the
barcode
6
(
,
)
distance between the two cells as,
6
,
=
5
6
8
6
|
|
푑푥푑푦
-
=
-
=
Finally, the overall cell
-
to
-
cell distance based on all barcode information wa
s calculated as the average
over the distances derived for each barcode weighted for each barcode by the smaller number of
barcode transcripts in either of the two cells. This ensures that barcodes that are highly expressed in
both cells are weighted more
heavily than barcodes that are weakly expressed in either or both cells.
Denoting the number of transcripts of barcode
in cell
as
5
6
, and similarly for cell
, the cell
-
to
-
cell
distance matrix, shown in Fig. 3e, then takes the form,
,
=
min
5
6
,
8
6
6
(
,
)
6
2. Analysis of sources of error in MEMOIR operation, and their impact on reconstruction accuracy
2a. Incorporating noise into MEMOIR simulations
To understand what factors impact the overall reconstruction errors in MEMOIR, we
simulated the
recording, readout, and reconstruction processes, incorporating relevant sources of stochastic
fluctuations (noise). Sources of error can be divided into three categories: (1) noise inherent to the basic
mechanism of stochastic scratchpad co
llapse, (2) recording noise, and (3) readout noise (see Extended
Data Fig. 6a).
Noise inherent to the system
.
The first noise category captures fluctuations due to the general
approach of recording lineage information in discrete stochastic mutagenic (c
ollapse) events, but does
not
include noise arising from the specific experimental implementation of the system (those noise
sources are described below). Reconstruction errors in this category result from collapse events that by
chance occur too frequentl
y or too i
nfrequently
to provide lineage information, as well as identical
collapse events occurring in different cells (coincidences). For example, if the exact same scratchpad
collapse event(s) occur independently in two sister cel
ls,
the resulting clades will be indistinguishable
.
The simulations shown in Figure 3g and Extended Data Figure 8 assume irreversible collapse of an
idealized set of binary scratchpads, occurring at a constant rate and therefore incorporate only this
WWW.NATURE.COM/NATURE | 2
SUPPLEMENTARY INFORMATION
RESEARCH
doi:10.10
3
8/na ture
20777
source of error. Analysis of these simulations thus sho
ws how the stochasticity of collapse events
impacts reconstruction accuracy in the absence of additional sources of noise.
Recording noise.
Recording noise accounts for fluctuations in the collapse rate due to stochastic
fluctuations in the expression
le
vels
of gRNA and Cas9. To accurately represent the magnitude of these
effects, we first measured these fluctuations experimentally, and then incorporated empirical values for
these fluctuations into
the
simulations.
First, we quantified gRNA expression v
ariability both within colonies (intra
-
colony) and between colonies
(inter
-
colony) using the co
-
expressed fluorescent reporter in individual cells in all of the 108 MEM
-
01
colonies. We denote the gRNA expression in cell
in colony
as
6
C
. (Extended
Data Fig. 6b,c). The
intra
-
colony noise was computed by first determining the variance of the single
-
cell measurements
amongst cells in each individual colony and then averaging that variance across all colonies. Namely,
E
F
=
1
1
C
6
C
1
C
6
C
H
I
6
J
-
F
H
I
6
J
-
+
C
J
-
Here,
is the total number of colonies, and
C
is the number of cells in colony
.
Similarly, the inter
-
colony noise was computed by taking the mean of
6
C
amongst the cells in a given
colony and then computing the variance of the colony means across all colonies.
E
F
=
1
1
C
6
C
H
I
6
J
-
F
+
C
J
-
where
is the mean gRNA expression across all cells in all colonies,
=
1
1
C
6
C
H
I
6
J
-
+
C
J
-
WWW.NATURE.COM/NATURE | 3
SUPPLEMENTARY INFORMATION
RESEARCH
doi:10.10
3
8/na ture
20777
Second, we quantified variability in Cas9 expression using the same analysis, performed on the number
of Cas9 transcripts,
, read out using smFISH in the final round of hybridization. We denote the intra
-
colony and inter
-
colony variation in Cas9 expressi
on as
M
and
M
, respectively, and the mean Cas9
expression level across all cell in all colonies as
.
We next incorporated the empirically determined noise measurements for Cas9 and gRNA into the
simulations. We set the effective collapse rate in each cell,
, proportional to the product of
the
number
of Cas9 transcripts,
, and
the
gRNA activity,
:
=
푔푐
. To incorporate recording noise, we assumed
each colony had its own values for
and
, drawn from the observed intercolony distributions with
means
and
and standard deviations
M
and
E
respectively. Next, we incorporated the intra
-
colon
y
noise, by adding fluctuations to values of
and
in each cell in a given colony, with the magnitude of
fluctuations set by the empirical intra
-
colony variations,
M
and
E
. The probability of collapse for each
scratchpad per generation,
, was c
alculated from
assuming a Poisson process, namely,
=
1
.
PQ
.
The normalization constant,
, was selected to ensure an average collapse probability of 0.1 per
scratchpad per generation, consistent with the empirically estimated average collapse rate in the MEM
-
01 colonies (see Fig. 2c and Methods). We verified that the resulting fluctuations in
the collapse rate
(from roughly 0.05 to 0.2 per scratchpad per generation) were consistent with the experimentally
observed fluctuations in the loss of colocalization fraction between individual cells across the 108 MEM
-
01 colonies used in the analysis sho
wn in Figure 3. Overall, as shown in
Fig. 3i and
Extended Data Fig. 6d,
this recording noise produced only a relatively small decrease in the reconstruction accuracy of the
simulated colonies.
Readout noise.
This category includes (a) stochastic express
ion of individual barcoded scratchpads, (b
)
the effects of
multiple integrations of the same barcoded s
cratchpad type, and (c) errors
in smFISH
imaging readout. To simulate the impact of readout noise on reconstruction accuracy, we first obtained
the empir
ical distribution of single
-
cell transcript counts of each of the 13 scratchpads in the cells in all
108 analyzed MEM
-
01 colonies (see Figure below). The transcript count distributions for the barcoded
scratchpads were well
-
approximated as gamma distributi
ons (with an average mean of 5.4). The gamma
distribution is expected when there is a constant probability per unit time of initiating a transcriptional
burst, and the number of mRNAs produced per burst follows an exponential distribution, and has been
obs
erved empirically to describe diverse stochastic gene expression
processes
43,44
.
WWW.NATURE.COM/NATURE | 4
SUPPLEMENTARY INFORMATION
RESEARCH
doi:10.10
3
8/na ture
20777
Distribution of the number of
transcripts of each barcode in individual cells in the 108 MEM
-
01 colonies analyzed in Figure 3.
All 13 detected barcodes are shown. Black lines are fits to Gamma distributions.
We quantified the intra
-
colony and inter
-
colony variability in the expressio
n levels of the barcoded
scratchpads using the same procedure described above for gRNA and Cas9. In this case, variance was
calculated for each barcode and then summed over 13 barcodes per cell to estimate each cell’s overall
expression variance. These emp
irical parameterizations for the transcript count distributions were
incorporated in
the simulations
.
For readout, we first associated with each barcode a random number of transcripts drawn from the
gamma distribution associat
ed with that barcode. Second, we incorporated additional noise from
smFISH imaging, based on experimental quantification of the accuracy of smFISH colocalizat
ion
detection rates (Fig. 2c). I
f a scratchpad was intact, then the probability of colocalization,
M
, was set to
0.87; if a scratchpad was collapsed, the probability of erroneous colocalization,
S
, was set to 0.05.
These es
timates were obtained from
our measurements of colocalization fractions before activation of
the MEM
-
01 cells (Fig. 2c), and
are consistent with previous reports on smFISH
fidelity
16
,
17
. The number
of scratchpads that colocalized for each barcode was drawn from binomial distributions based on these
colocalization detection probabilities. Namely, the number of colocalized transcripts
from a total of
transcripts, for an intact integ
ration of a barcode, was drawn from the distribution,
0
50
0
100
200
Cell counts
BC: 2
0
50
100
0
100
200
BC: 9
0
10
20
30
0
100
200
300
BC: 10
0
50
0
100
200
BC: 12
0
50
0
100
200
300
BC: 13
0
10
20
30
0
100
200
300
Cell counts
BC: 14
0
10
20
0
200
400
BC: 16
0
10
20
30
0
100
200
300
BC: 20
0
5
10
15
Transcripts
0
200
400
600
BC: 21
0
20
40
Transcripts
0
100
200
300
BC: 22
0
10
20
30
Transcripts
0
100
200
300
Cell counts
BC: 23
0
50
Transcripts
0
50
100
150
BC: 26
0
10
20
Transcripts
0
200
400
BC: 27
WWW.NATURE.COM/NATURE | 5
SUPPLEMENTARY INFORMATION
RESEARCH
doi:10.10
3
8/na ture
20777
6HV5MV
;
=
M
X
1
M
Y
.
X
Similarly, the number of colocalized transcripts for a collapsed integration of a barcode was drawn from
the distribution,
MZ[[5\]S^
;
=
S
X
1
S
Y
.
X
Finally, we accounted for multiple integrations of the same barcode as follows: If a barcode had more
than one integration,
the total number of transcripts and the number of colocalized transcripts for that
barcode were determined by summing
the relevant transcripts from each individual integration.
Importantly, the distribution of the expression level of each individual integration was selected such that
the distribution of the total number of transcripts (obtained by taking the convolution
of the single
-
integration distributions) matched the experimentally observed distribution for that barcode. For
example, if a barcode had two integrations and an expression distribution characterized by a mean of 10
transcripts and shape parameter 2 (from
the gamma fit), then each integration was assumed to have an
expression distribution characterized by a mean of 5 transcripts and shape parameter 1.
This proce
dure produced
an ensemble of simulated colonies, where every cell was assigned a total
number of
transcripts and a number of colocalized transcripts for each barcode (same as the
experimental data). The resulting simulated colonies were then reconstructed with the same neighbor
-
joining algorithm used on the experimental data (see Methods).
Impact o
f readout noise on overall reconstruction accuracy.
Figure 3i and
Extended Data Figure 6e,f
show
that noise in scratchpad expression and multiple incorporations of the same barcoded scratchpad
account for most of the reconstruction error. By contrast, other components such as recording noise,
noise intrinsic to the MEMOIR design, and noise from smFIS
H imaging fidelity contributed minimally to
the overall reconstruction error.
WWW.NATURE.COM/NATURE | 6
SUPPLEMENTARY INFORMATION
RESEARCH
doi:10.10
3
8/na ture
20777
Finally, to determine if the addition of fluctuations to the simulations matched the experimentally
observed decrease in reconstruction accuracy, we simulated trees with all th
ree components of noise
for various numbers of integrations per barcode and compared the distribution of reconstruction
accuracy to that observed for the 108 experimental colonies. As shown in
Extended Data Fig
ure
6f, the
distribution of reconstruction acc
uracy obtained from simulations is generally consistent with the
experimentally observed distributi
on, with an effective value of
2 integration sites for each barcode.
Note that no fitting parameters were used besides the empirically measured fluctuations
in the
recording and readout components. Together,
these results indicate that
this implementation of
M
EMOIR behaved as expected of a
system wit
h 13 scratchpads and realistic
sources of noise. The main
sources of error, the stochastic expression of scratc
hpad transcripts and variable expression from
multiple integrations of the same barcoded scratchpad types, can be improved in future
implementations.
2b. Potential solutions to address sources of noise and reduce errors
Here we describe ways to mitigate
the key sources of noise currently affecting MEMOIR reconstruction
accuracy:
Noise inherent to the system.
Because collapse events are stochastic, similar collapse patterns
can arise by chance independently, creating ambiguities. The likelihood of such er
rors decreases
rapidly as the number and diversity of barcoded scratchpads is increased. Further, tunable
control of Cas9 and gRNA expression or scratchpad affinity (e.g., by varying gRNA target
complementarity) allows average collapse rate to be optimized
for different regimes, including
accessing more generations (Extended Data Fig. 8b) or low collapse rates for studying sparse
trees (Extended Data Fig. 9).
Readout noise: Multiple incorporations of barcoded scratchpad types.
This is currently a
significan
t source of noise, as shown in Extended Data Fig. 6, and it is readily addressed as
above, e.g., by starting from a larger and more diverse pool of barcodes.
Readout noise: Stochastic scratchpad expression.
Individual scratchpads are read out at the
RNA le
vel, and therefore subject to fluctuations in transcription. This source of noise that can be
addressed by using 1) stronger promoters, 2) site
-
specific integration in well
-
expressed loci,
and/or 3) stabilized RNA transcripts (in order to build up signal o
ver longer periods).
Alternatively, barcode expression noise can be circumvented by employing other in situ labeling
methods, like single
-
molecule DNA FISH.
WWW.NATURE.COM/NATURE | 7
SUPPLEMENTARY INFORMATION
RESEARCH
doi:10.10
3
8/na ture
20777
3. Simulation of MEMOIR operation for missing cells and sparse trees
3a. Missing cells
To simulate
reconstruction for trees with missing
cells (
Extended Data Fig. 8c
),
the
forward algorithm
described
in Methods
was used to generate a full binary tree of three generations, which was then
pruned by removing a given number of randomly chosen endpoint cells. To reconstruct the lineage
relationships of the remaining cells, we assigned each of the possible 315 reconst
ructions a score given
by
6a
6a
b
6
,
a
c
F
, where
6a
is the Hamming distance between the barcode
d scratchpad collapse
patterns
of cells
and
,
6a
is the lineage distance between cells
and
from that particular
reconstruction (
1
for si
sters,
2
for first
cousins, and
3
for second
cousins), and the summation runs over
all pairs of cells. Matrices
and
were normalized so that their elements summed to one.
This
reconstruction score format was useful not only in scoring the topology of
possible lineage
arrangements, but also in estimating branch lengths (e.g., whether two cells were likely closely related
sisters or more distantly related cousins).
The reconstruction with the lowest score was selected as the
reconstruct
ed tree
and used t
o compute the fraction of
relationships that were
correct
ly identified (i.e.,
fraction of correct relationships)
shown in the heat
-
maps.
If multiple reconstructions had the same
lowest score, the fraction of correct relationships was averaged over all of t
hem.
3b. Simulation of MEMOIR operation for sparse trees
To simulate sparse trees, we assumed collapse events in a given lineage occur
at
a
specified constant
collapse rate
λ
(per cell per generation). The sparse regime occurs when λ <
1. Because generating
hundreds of collapse events requires large trees of many generations, we did not simulate the
underlying proliferating binary tree. Instead, we directly generated a tree of collapse events with
variable branch lengths corresponding t
o the stochastic time intervals between collapse events.
Assuming a constant rate of collapse (a Poisson process), we randomly drew branch lengths from an
exponential distribution with the time scale set by the collapse rate, i.e. mean branch length
1/λ
. At
any given level of the tree, if the tree had
branches, the time interval to the next branching event was
selected from an exponential distribution with rate
푁휆
, and the branching event was randomly assigned
to one of the
branches. The branch
ing process was continued until the desired number of leaves was
generated. Extended Data Figure 9c,d shows examples of sparse trees with various number of leaves
and tree depths (defined as the cumulative number of collapse events since the common ancesto
r
experienced by each leaf averaged over all the leaves of the tree). For clarity, the approximate
WWW.NATURE.COM/NATURE | 8
SUPPLEMENTARY INFORMATION
RESEARCH
doi:10.10
3
8/na ture
20777
correspondence between these sparse trees (characterized by number of leaves and tree depth) and
underlying full binary trees (characterized by collapse rate
and number of generations) was determined
by simulating full binary trees in the sparse regime and then finding parameters that generated effective
sparse trees with the desired number of leaves and depth (see the cartoon in Extended Data Fig. 9a).
The co
llapse rate and number of generations of the full binary trees corresponding to the example
sparse trees are reported in Extended Data Figure 9c,d. To reconstruct sparse trees, we used the
generated collapse pattern at the leaves and a Sankoff maximum pars
imony
algorithm
45
. The fraction of
correct partitions identified by the reconstructed trees was calculated using the Robinson
-
Foulds
distance
metric
40
. Sim
ulations were implemented in Matlab, and Python using the Biopython
package
46
and ETE3
toolkit
47
.
4. Inference of the switching rates between the two
Esrrb
gene expression states
In this section, we describe how we inferred the transition rates between the low and high expression
states of
Esrrb
. In 4
a, we compute the occurrence frequencies of pairs of cells in the
same gene
expression state. In 4
b, we present a straightforward inf
erence of the transition rates from the decay in
the observed frequency of pairs of cells in the same
expression state. Finally, in 4
c, we present a more
involved but more accurate inference algorithm.
4
a. Occurrence frequency of related pairs of cells in the same
Esrrb
gene expression state on MEMOIR
trees
In this subsection, we describe how the co
-
occurrence frequencies of pairs of related cells in the same
Esrrb
expression state can be extracted from t
he MEMOIR reconstructed lineage trees and endpoint
gene expression measurements.
To compute the frequencies, we first assign each cell a probability of being in one of the two
Esrrb
gene
expression states based on its
Esrrb
mRNA count. To do so, we fit th
e bimodal distribution of single
-
cell
Esrrb
transcript counts (Fig
ure 4b
) with a weighted sum of two negative binomial distributions, each
corresponding to one gene expression state. We denote the distribution corresponding to the
Esrrb
low
expression stat
e as
.
(
)
, and the distribution corresponding to the
Esrrb
high expression state as
h
(
)
, where
is the number of transcripts in an individual cell. The bimodal distribution is fit to
.
+
1
h
(
)
, where
is an additional fitting parameter
that corresponds to the population
fraction of cells in the low expression state.
WWW.NATURE.COM/NATURE | 9
SUPPLEMENTARY INFORMATION
RESEARCH
doi:10.10
3
8/na ture
20777
With the fit parameters determined, the probability that an individual cell with
Esrrb
mRNA molecules
is in the low expression state (E
-
) is
j
.
(
)
=
푓푃
.
푓푃
.
+
(
1
)
h
Similarly, the probability that the cell is in the high gene expression state (E+) is simply
j
h
=
1
j
.
(
)
.
Next, we generalize this approach to jointly observing the states of a pair of cells. If cell 1 is in the E
-
state with probab
ility
j
.
-
and cell 2 is in the E
-
state with probability
j
.
F
then the probability that both
cells are simultaneously in E
-
states is
j
.
-
j
.
F
. Similarly, the probability of observing cell 1 in the E+
state and cell 2 in the E
-
state is
j
h
-
j
.
F
. We can denote the probability of observing the two cells in any
of the four possible pairs of states by a
2
×
2
matrix,
j
.
-
j
.
F
j
h
-
j
.
F
j
.
-
j
h
F
j
h
-
j
h
F
The entries in this matrix must sum to
1
. Finally, to compute the occurrence fre
quency of a pair of
sisters cells in each of the four possible pairs of states, we average the above matrix over all pairs of
sisters cells.
1
=
1
]6]VSm]
j
.
6
j
.
a
j
h
6
j
.
a
j
.
6
j
h
a
j
h
6
j
h
a
6
,
a
where the summation runs over
all
]6]VSm]
pairs of sisters cells in the 30 colonies used in the analysis.
We note that since the ordering of the sister pair is arbitrary, we turn the resulting matrix symmetric by
averaging the off
-
diagonal terms.
1
is a 2x2 matrix for siste
rs. The argument ‘1’ denotes the fact that sisters are 1 generation apart.
Similarly, we define
2
as the frequency of observing a pair of first
-
cousin cells in a given pair of
WWW.NATURE.COM/NATURE | 10
SUPPLEMENTARY INFORMATION
RESEARCH
doi:10.10
3
8/na ture
20777
states
.
(
2
)
, is determined by taking the average over all the observed pairs
of first
-
cousin cells. The
occurrence frequency of sisters, first cousins, and second cousins (
(
3
)
) in the same
Esrrb
expression
state (either both E
-
, or both E+) is plotted in Figure
4d
.
Finally, we estimated the statistical error on the observed
frequencies in two ways
(error bars in Figure
4d
). First, from the total number of observed pairs of cells and the computed frequencies, we estimated
the counting error assuming a multinomial distribution. That is, for an observed frequency
computed
usi
ng
pairs of cells, the statistical error was estimated as
1
. To understand how much of
the similarity observed between states of closely related cells was just due to chance, we randomly
permuted the lineage relationships within each colony and r
ecomputed the frequencies. The statistical
error was estimated as the variance of the frequencies computed over 1000 iterations of such random
permutations. The first error estimate only accounts for the finite size of the total number of
observations. The
second estimate accounts for correlations between the states of closely related cells
due to chance with the number of different gene expression states in each colony held fixed. The two
methods produced similar estimates reflecting the level intra
-
colony
heterogeneity. Because the two
approaches do not capture independent statistical fluctuations, we combined them by taking the
maximum of the two estimates.
4
b. Inference of switching dynamics from reconstructed lineage trees and endpoint gene expression
measurements
Here, we describe the inference of transition rates from the frequencies derived in the previous section.
To do so, we first derive the relationship between transitions rates and the expected probability of
observing a pair of related cells in
the same gene expression state.
We define a minimal model of stochastic and memoryless transitions between the two expression
states. We assume that the probability of transitioning from the E
-
state to the E+ state per generation is
constant and denote
it as
(
h
|
.
)
. The probability that a cell in the E
-
state will retain its state from
one generation to the next is give by
.
|
.
=
1
(
h
|
.
)
. Similarly, the probability of
transitioning out of the E+ state per generation is denoted as
(
.
|
h
)
. The transition probabilities can
be combined together into a transition rate matrix,
WWW.NATURE.COM/NATURE | 11
SUPPLEMENTARY INFORMATION
RESEARCH
doi:10.10
3
8/na ture
20777
=
(
.
|
.
)
(
.
|
h
)
(
h
|
.
)
(
h
|
h
)
First, we present a simple approach for inferring the transition rate matrix from the decay in the
observed frequencie
s of pairs of cells in the same gene expression state going from first
-
cousins to
second
-
cousins. Next, we present a more involved derivation that also takes into account the value of
the observed frequencies in addition to how they decay from closely rela
ted cells to more distant
relatives.
We can relate the frequency of observing second
-
cousins in the same state (either both in E
-
or both E+)
to the frequency of observing the first
-
cousins in a particular pair of states by taking into account all
possibl
e state transitions going from one generation to the next. It follows,
--
3
=
1
-F
1
-F
--
2
+
2
F-
1
-F
-F
2
+
F-
F-
FF
(
2
)
FF
3
=
1
F-
1
F-
FF
2
+
2
-F
1
F-
F-
2
+
-F
-F
--
(
2
)
We numerically solved the above two non
-
linear equations for
-F
and
F-
for the observed values of
(
2
)
and
(
3
)
(the observed frequencies were first corrected for misclassifications of the two states
caused by measurement uncertainties, see below). To estimate the statistical error on the inferred
transitions rates, we produced simulated frequency matrices based on
the estimated statistical error of
the frequencies (see previous section). We solved for the transition rates for 10000 iterations of the
simulated data and computed the standard deviation over the rates. We found
-F
=
0
.
10
±
0
.
07
and
F-
=
0
.
04
±
0
.
03
.
4
c. A
more accurate inference of the transition rates
Next, to get a more accurate set of inferred transition rates, we incorporate
d
the values of the
frequencies in addition to their decay over the generations in our calculations. To do so, we first
calculate
d
the joint
-
probability of finding a pair of sister cells in states
and
using the transition rate
matrix.
6a
1
=
(
)
]
J
j
v
,
j
w
WWW.NATURE.COM/NATURE | 12
SUPPLEMENTARY INFORMATION
RESEARCH
doi:10.10
3
8/na ture
20777
The summation in above equation runs over the two possible states of the parent cell. Assuming that
th
e population fractions of the E
-
and E+ cells do not change significantly during the three generations
spanned by the experiment,
(
.
)
and
h
can be set approximately equal to the population
fractions derived from fitting the bimodal
Esrrb
transcript
-
count distribution (see previous section).
Namely,
.
=
and
h
=
1
.
We experimentally validated the assumption that the
Esrrb
transcript
-
count distribution does not change significantly in
our experimental conditions ove
r
a period
of
48 hours or approximately four generations (
see
Extended Data Fig. 10
).
To keep the population fractions constant, the number of cells transitioning from the E
-
state to the E+
state must be equal to the number transitioning in the reverse d
irection. Hence,
h
.
.
=
.
h
(
h
)
. With this simplification,
6a
1
=
=
F
(
|
)
]
J
j
v
,
j
w
where
F
(
|
)
denotes the
,
th element of the square of the transition matrix. Similarly, the joint
-
probability of observing a pair of cousins, and second
-
cousins in states
and
is respectively equal to
6a
2
=
x
(
|
)
and
6a
3
=
(
)
y
(
|
)
.
To infer the state
transition rates, we use
d
the measured frequencies and population fractions and
solve
d
the above equations for
.
However, we first need to introduce a correction to the measured frequencies to account for
misclassification of expression states when the
y are probabilistically assigned. Namely, it is more likely
to misclassify a cell in the E
-
state as E+ than vice versa because of the asymmetrical shape of the
distributions
±
(
)
(see the fits in Figure 4b
). To correct for misclassifications, we first
compute
d
the
probability of assigning a cell in state
mistakenly to state
by simply integrating the
probability that a
a cell with
Esrrb
transcripts
is assigned
to state
(computed in the previous section) over the
p
robability distribution of observing
transcripts in a cell in state
.
6a
=
푓푃
6
푓푃
.
+
1
h
a
(
)
푑푛
WWW.NATURE.COM/NATURE | 13
SUPPLEMENTARY INFORMATION
RESEARCH
doi:10.10
3
8/na ture
20777
For our analysis, E
-
cells are mistakenly assigned to E+ cells with probability 0.12, whereas E+ cells are
mistakenly assigned to the
E
-
state with probability 0.04.
Matrix
effectively introduces apparent transitions between the two states at the measurement stage.
These additional transitions can be explicitly included in the calculation of the joint
-
probability of finding
a pair o
f sister cells in states
and
,
6a
1
=
풊풌
풋풍
=
]
J
j
v
,
j
w
풊풌
풍풋
]
J
j
v
,
j
w
=
풊풌
(
)
풍풋
However, the apparent transitions represented by
do not correspond to actual transiti
ons between
the
Esrrb
expression states. Rather, they represent misclassifications due to uncertainties in the state
assignment. To correct the estimate,
we
remov
ed
the contributions of these apparent transitions to the
measured frequencies,
by
apply
ing
the inverse of matrix
to the frequency matrices,
=
.
-
Y
.
-
.
Finally, to infer the transition rates shown in Figure
4d
, we solved for the transition rate matrix
using
the corrected frequency matrix for second
-
cousins
,
(
3
)
.
(
3
)
exh
ibits a lower statistical error than
(
2
)
and
(
1
)
because the number of pairs of second cousins is larger than the number of pairs of first
-
cousins and sisters. To validate the inferred rates, we checked that inferring
using the corrected
frequency m
atrices for first
-
cousins
(
2
)
and sisters
(
1
)
produced consistent transition rates. Finally,
to estimate the statistical error of the inferred rates we used the estimated statistical error in the
measured frequencies (see previous section) and produc
ed simulated frequency matrices. The error on
the inferred rates was estimated as the standard deviation of the rates inferred over 10000 iterations of
simulated data.
Results are shown in Figure 4d
.
WWW.NATURE.COM/NATURE | 14
SUPPLEMENTARY INFORMATION
RESEARCH
doi:10.10
3
8/na ture
20777
WWW.NATURE.COM/NATURE | 15
SUPPLEMENTARY INFORMATION
RESEARCH
doi:10.10
3
8/na ture
20777