Distinguishing biophysical stochasticity from technical noise in single-cell RNA sequencing using Monod

We present the Python package Monod for the analysis of single-cell RNA sequencing count data through biophysical modeling. Monod naturally “integrates” unspliced and spliced count matrices, and provides a route to identifying and studying differential expression patterns that do not cause changes in average gene expression. The Monod framework is open-source and modular, and may be extended to more sophisticated models of variation and further experimental observables. The Monod package can be installed from the command line using pip install monod. The source code is available and maintained at https://github.com/pachterlab/monod. A separate repository, which contains sample data and Python notebooks for analysis with Monod, is accessible at https://github.com/pachterlab/monod_examples/. Structured documentation and tutorials are hosted at https://monod-examples.readthedocs.io/.


Introduction
The interpretation of single-cell transcriptomics data depends on the ability to distinguish between variation in gene expression due to technical noise arising from experimental artifact, and variability reflecting underlying biology. Thus, analysis of single-cell RNA sequencing (scRNA-seq) data begins with "depth normalization," a procedure whose purpose is to account for technical variation in the number of reads sequenced per cell due to the stochastic sampling of reads from cDNA libraries. Additionally, variance-stabilization transformations are applied to account for associations between variance and magnitude of gene expression. These transformations are also premised on technical artifact stemming from stochastic sampling of reads from cDNA libraries. Essentially all scRNAseq analyses begin with these two steps. Other subsequent transformations and procedures to remove technical noise are also commonplace, e.g., dimensionality reduction by principal component analysis [1] and batch correction [2] being two examples.
Despite the omnipresence of normalization as a first step in single-cell RNA sequencing analysis, and extensive studies of its effectiveness in achieving variance-stabilization and uniformity in read depth per cell [3,4], the question of whether normalization can inadvertently remove biological signal has not been thoroughly explored. This is a cause for concern, because investigation of technical noise removal in the context of batch correction [5] has shown that biological signal can be inadvertently removed in an attempt to account for technical artifact. Figure 1 shows that normalization can be similarly problematic. Using differences between cell types to bound biological variation, we find that not only normalization, but all the commonly applied transformations to single-cell RNA sequencing data remove biological signal, especially from highly expressed genes. The UMAP embedding step is particularly egregious; it adds large amounts of non-biological noise at the end of a process intended to remove it.
The removal of biological signal by transformations currently applied to single-cell RNA sequencing data is perhaps not surprising; current workflows are the culmination of experiments with heuristics, and the methods are not grounded in biophysical models. However, there is no reason that scRNA-seq analysis cannot be grounded in rigorous systems biology [6][7][8]. We propose the use of models of transcription with which technical and biological variation can be distinguished on the basis of mechanism. Our approach, via modeling with a chemical master equation (CME) [9], conforms with mechanistic approaches to quantitative biology that originated in the latter half of the twentieth century [10][11][12][13], and that substantially expanded over the past two decades [14][15][16], providing biophysical rationale for the biological component of variation observed in gene expression measurements.
In addition to relying on methods that provides an avenue to rationally thinking about biological stochasticity versus technical variation, our approach also addresses another vexing problem in single-cell RNA sequencing data analysis. Recently, interest in "RNA velocity," the inference of trajectory directions using the relative abundances of unspliced and spliced mRNA [17], has led to recognition that in addition to the standard count matrices produced in single-cell RNA sequencing pre-processing [18], reads aligned to non-coding sequences may also be informative [19]. Several software packages have been developed to quantify both "unspliced" and "spliced" modalities [17,20,21], but despite their widespread use for RNA velocity [22], a natural question they raise has not been addressed: how should spliced and unspliced count matrices be "integrated," or analyzed simultaneously, to obtain insights into gene expression beyond the context of trajectory inference? We show that mechanistic modeling also provides an answer to this question. : Normalization and dimensionality reduction distort and underestimate biological variation, especially in high-expression genes. a. A proposed baseline for the analysis of residual variation after data transformation: the fraction of biological variability can be bounded by a theoretical baseline, which is computed from the variation in average subpopulation expression. If this baseline is violated, the data transformation has discarded some biophysically meaningful variation. b. High-expression genes have high variance (gray points: genes below the 95th percentile by mature RNA expression; red points: genes above the 95th percentile by mean mature RNA expression, red line: percentile threshold). c. Proportional fitting size normalization (PF), log-transformation (log), and principal component analysis (PCA) globally deflate the squared coefficient of variation (CV 2 ), whereas Uniform Manifold Approximation and Projection (UMAP) globally inflates it (gray and red points: as in b). d.-g. All four of the steps substantially deflate high-expression genes' CV 2 relative to raw data, implicitly attributing their variability to nuisance technical effects (gray and red points: as in b). h.-k. The deflation of variability results in the violation of the theoretical lower bound computed from cell subpopulation differences, particularly for high-expression genes (gray and red points: as in b; curved teal line: identity baseline, below which biological variability is removed; horizontal teal line: threshold, above which variability is inflated relative to raw data). 3

Model definition
We focus on a class of bursty transcriptional models of mRNA expression: This reaction schema encodes a continuous-time Markov chain on a bivariate discrete space of molecule counts for each gene, where X N is a nascent mRNA species and X M is a mature mRNA species. We identify the former with unspliced and the latter with spliced transcripts (Section 2.6 of [23]).The rate k is the characteristic burst frequency, such that the number of transcription events in any time interval of length τ is P oisson(kτ ). B is the number of unspliced mRNA produced per transcriptional event, which is a generally a random variable. β is the splicing rate, such that a given molecule of X N will become a molecule of X M after an exponentially-distributed delay with rate β. Analogously, γ is the degradation rate, such that a given molecule of X M will be eliminated after an exponentially-distributed delay with rate γ. At steady state, we fit the rate parameters in units of k, which is equivalent to imposing k = 1.
Typical fluorescence transcriptomics analyses use a model of random promoter switching [11] to define B as geometrically-distributed with mean b [15], inducing negative binomial-like [24] distributions of X N and X M [15,[25][26][27][28][29][30][31]. This model describes the dynamics at a promoter that randomly switches between a transcriptionally inactive, long-lived state and a highly active, shortlived state (Section S2.1.1); B is the number of molecules generated in the active state.
The chemistry of the process [32] suggests that a given mRNA molecule may be captured again after being reverse transcribed into cDNA once. Therefore, we assume that cDNA molecules are generated according to a Poisson birth process over unity time, such that: where λ N and λ M are the Poisson process rates for each species, X ′ N and X ′ M are unobserved in vivo molecules with dynamics following Equation 1, and X N and X M are UMIs observed after capture, amplification, sequencing, and alignment. We further assume λ N := C N L for a constant C N and variable gene length L: the dependence on L coarsely models the possibility of multiple priming at internal poly(A) stretches [33].
We use the bursty/Poisson model of biology and sequencing throughout our analysis. However, to facilitate comparisons with alternative descriptions of biological variation, we characterize and implement various other models of biology, with optional technical noise components, outlined in Section S2.

Probabilistic investigation of normalization and dimensionality reduction
Due to the scale of scRNA-seq data, standard analyses heavily use data transformation and dimensionality reduction to produce a version of the data more amenable to statistics [18]. For example, a typical analysis of cell type heterogeneity may apply size normalization (e.g, proportional fitting or PF, which treats RNA counts as compositional quantities [34]), log-transformation, principal component analysis (PCA), and Uniform Manifold Approximation and Projection (UMAP). Each of these steps has a specific purpose; for the four steps above, the purposes are, in turn, to remove variability due to technical heterogeneity, to obtain easily tractable normal-like log-abundance distributions, to select the latent data dimensions that contain the most variability, and to visualize the cell type structure [18]. These transformations rely on implicit assumptions about the structure of the data; these assumptions may be mutually contradictory, and their violation may produce results that range from suboptimal to catastrophically incorrect.
These limitations and failure modes have previously been investigated. Size normalization privileges relative, rather than absolute RNA species abundance; occasionally, this approach produces inconsistent results across the genome [35] and retains apparently technical variation [34,36]. Log-transformation is optimal for homogeneous, high-expression, approximately negative binomial data [4,18,34], and relies on an arbitrary genome-wide "pseudocount" hyperparameter that can distort the distributions [4,34,35,37]. PCA is optimal for multivariate normal data, and can be misled by the large zero fractions observed in single-cell data [37]. Finally, UMAP appears to be optimal for data with uniform, low-noise coverage of a latent manifold, with risk of distortions due to violated assumptions and stochastic initialization [22,38,39]. A comprehensive treatment of the distortions induced or ameliorated by each step appears, however, to be out of reach, as the transformations' results are heavily data-dependent and elude theoretical analysis.
In Figure 1a, we propose a procedure for the quantitative benchmarking of data transformations relative to an internal baseline. Each step transforms the data distribution, purportedly retaining relevant biological variability -such as cell type differences -and removing incidental or technical variability, quantified by the squared coefficient of variation (CV 2 ). Therefore, by removing some fraction of variability, a data transformation implies this component is immaterial to analysis, whereas the residual fraction of variation -the CV 2 ratio for the distribution after and prior to transformation, denoted byη 2 /η 2 -is attributed to biology. However, this residual fraction should not vary arbitrarily; under mild assumptions, we can bound the biological fraction of CV 2 from below by the variability in cell subpopulation averages (Section 5.1.1).
To compare the results of the transformation procedures to this baseline, we analyzed a mouse glutamatergic neuron dataset [40], using pre-annotated subtypes to produce a lower bound. The details of the analysis are given in Section 5.1. We considered a set of 2,951 genes, emphasizing the top 5% by dataset-wide average; these high-variability genes are typically of most interest in single-cell analyses (Figure 1b). The iterative application of transformations up to PCA typically deflated the gene-specific CV 2 values, particularly for the high-expression genes and in the logtransformation step. However, the application of UMAP inflated CV 2 throughout. We found that the high-expression genes' variability was typically deflated relative to the raw data, suggesting that the data transformations attribute overdispersion to nuisance technical effects (Figure 1d-g).
Log-transformation, PCA, and UMAP violated the baseline computed from inter-subtype variation, particularly for the high-expression genes. In addition, a considerable fraction of genes demonstrated variability exceeding that of the original data after PF and UMAP. As shown in Figure S8, after computing the UMAP, more than a third of the genes in the dataset had, at some point in the analysis, gone below the lower bound. This result suggests that ubiquitous transformations efface meaningful biological signal. Despite the claim in [4] that the log-transform is "best," our analysis shows that it may be even better not to apply a transform which is agnostic to technical versus biological stochasticity. UMAP attempts to recover variance by inflating cell type differences; however, since this inflation is genome-wide, it does not restore the quantitative information lost in previous steps, and may generate false findings.
We propose that a mechanistic approach provides a more reliable avenue for the analysis of sequencing data. In this worldview, all assumptions about the noise behaviors are explicit rather than implicit; count data are not to be denoised, but fit to a first-principles model that includes biological and technical noise terms. Once a satisfactory parametric fit is available, the fractions of biological and technical variability follow immediately (Section 5.1.3). This approach is outlined schematically in Figure 2a: given annotations, we can separately fit cell subtypes, obtain their biophysical parameters, and aggregate them to obtain the fraction of biological variability.
The fit, implemented in Monod, attributes overdispersion in high-expression genes to biological variability (Figure 2b), in striking contrast to the non-parametric transformations ( Figure S9). As a consequence, the inferred fraction of biological variability coheres with the baseline (Figure 2c). Interestingly, this agreement is not merely a consequence of independently fitting cell subtypes and aggregating the variance. By applying Monod to the entire glutamatergic dataset, introducing some error due to the neglect of subtype heterogeneity, we obtain similar results, with a single violation of the bound ( Figure S10). This control suggests that the mechanistic procedure largely explains biological variability by transcriptional bursting, rather than subtype differences.  Figure 1a may be compared to point estimates of the biological variability fractions, which follow immediately from a fit to a parametric model of transcription and sequencing. b. The Monod fits explicitly attribute the variability in high-expression genes to biological phenomena (gray and red points: as in Figure 1b). c. The Monod results lie entirely within the admissible region (gray and red points: as in b; curved teal line: identity baseline, below which inferred biological variability is lower than inter-cell population variability; horizontal teal line: threshold, above which inferred biological variability exceeds that of raw data).

Principled integration of single-cell and single-nucleus data
In parallel with single-cell RNA sequencing [41], recent years have seen the rapid adoption of singlenucleus RNA sequencing (snRNA-seq) technologies [42,43]. As illustrated in Figure 3a, snRNA-seq releases and isolates the nuclei, whereas scRNA-seq requires tissue dissociation into individual cells. As discussed in [44], in spite of the loss of cytoplasmic signal, which limits the ability to detect certain relevant genes [45], single-nucleus sequencing provides technical advantages. Single-nucleus protocols require considerably simpler tissue handling; for example, they can be applied to frozen cell samples [46][47][48].
The interest in single-nucleus sequencing, as well as the recognition of systematic differences in the findings from the two technologies [45,[49][50][51], has motivated the analysis of these differences [52] and the development of more or less ad hoc data integration methods [50,53]. Several recent reports [52][53][54] have found that the nuclear datasets exhibit a strong length bias, with longer genes being overrepresented in nuclei. This discrepancies, in turn, appears to stem from a fundamental methodological difference: single-cell analyses typically only use exonic reads, whereas single-nucleus combine intronic and exonic reads [42,51]. Although the exonic molecule counts do not appear to exhibit a length bias, intronic ones do [33], likely due to internal priming in poly(A)rich intronic regions [17]. Furthermore, even if all reads are included in the analysis of a single-cell dataset, the length bias may be attenuated due to the abundance of fully processed molecules. However, the appropriate way to correct for this effect is obscure. Previous reports have suggested [53,54] or eschewed [55] normalizing by gene length; this scaling, if applied, prevents the application of discrete models.
We propose that scRNA-seq and snRNA-seq may be more analyzed in a more principled way through a mechanistic lens. This strategy treats nascent (intron-containing) and mature (exonic) molecules as distinct, in the spirit of [52,54], and takes the distinction to its logical conclusion by defining a model with nuclear export (Section 5.2.1). Under a particular set of assumptions, this model reduces to the form in Section 2.1, with nuclear export taking the role of cytoplasmic degradation as the mechanism of mature RNA efflux. However, the nascent RNA dynamics -i.e., transcription and splicing -should be identical for the two technologies, as they are confined to the nucleus.
This axiom provides a foundation for the joint analysis of the technologies. We fully outline the approach in Section 5.2. For example, Figure 3b-c compares the average counts for 2,000 genes in scRNA-seq and snRNA-seq datasets generated from a single mouse brain tissue sample by 10x Genomics [56,57]. Surprisingly, in spite of the depletion of cytoplasmic RNA, the mature count averages were visually similar, whereas the nuclear count averages were approximately half an order of magnitude higher in the single-nucleus dataset. Quantitatively, 83% of the mature and over 99% of the nascent averages were higher in the snRNA-seq sample. To explain this difference, we adopt the usual "marker gene" paradigm, i.e., that closely related cell types typically differ in the expression of a small number of genes [18], whereas the other genes have similar distributions. This assumption implies that incidental enrichment of certain cell subpopulations cannot explain the striking, widespread discrepancy, and immediately leads us to conclude that the difference is purely technical; due to the details of the nuclear sequencing protocol, the procedure retains considerably more RNA of both types. This assumption appears to be supported by Figure 3d-e: both species exhibited an overall decrease in the noise levels (66% of the mature and 98% of the nascent CV 2 values), which is consistent with decreased molecule loss. The difference in mature RNA amounts should, then, be explained by the combination of two competing effects: the depletion of cytoplasmic RNA, as well as more effective capture of remaining molecules, in the single-nucleus protocol.
To quantify the efflux rates, we fit the datasets using Monod and inferred the technical noise parameters for the single-cell dataset. Next, we identified the set of single-nucleus technical noise parameters ( Figure S6) that provided the best match to the burst size and splicing rate parameters (Figure 3f-g); the discovered set of technical noise parameters had higher (more effective) sampling rates. The inferred efflux rates at this set were considerably higher for the single-nucleus dataset, both visually ( Figure 3h) and statistically: the t-test {t, p} values were {−2.7, 7.3 × 10 −3 } for the burst size, {1.6, 0.11} for the splicing rate, and {−11, 2.1 × 10 −27 } for the efflux rate.
The procedure we have outlined has significant limitations: for example, we have neglected nuclear efflux in the single-cell data and cell type heterogeneity, both of which are physiologically important [40,58] likely contributors to deviations in Figure 3f-g. In addition, single-nucleus sequencing may harbor as of yet poorly-understood technical noise phenomena particular to the technology. Nevertheless, the model formulation provides a foundation for the incorporation of more sophisticated nuclear retention delays [44,59,60] jointly with technical noise. In addition, the strategy provides a principled solution to the dilemma of incorporating intronic reads: all the available data should be used, with species differences encoded in a multivariate mechanistic model. If its assumptions are explicitly formulated, the model can be fit, or extended to account for violations, based on experimental data. Figure 3: Single-cell and single-nucleus data can be conceptually "integrated" by defining distinct stochastic models that account for differences in mature RNA processing and technical noise, while maintaining the nascent RNA. a. Single-cell RNA sequencing protocols dissociate and isolate of individual cells, whereas single-nucleus RNA sequencing isolate nuclei and discard cytoplasmic molecules; however, the downstream sequencing steps may be identical. b. Counterintuitively, representative paired mouse brain single-cell and single-nucleus datasets exhibit similar mature RNA levels (gray points: genes; dashed black line: line of identity; green line: the approximate average offset observed for single-nucleus data). c. The single-nucleus dataset consistently has considerably higher nascent RNA counts, which suggests the presence of a technical effect between the two technologies (conventions as in b). d. The single-nucleus dataset demonstrates slightly lower noise levels for mature count data (gray points: genes; dashed black line: line of identity). e. The single-nucleus dataset demonstrates considerably lower noise levels for nascent count data (conventions as in d).
f.-g. By fitting mechanistic models to both datasets, we can identify technical noise parameters that produce consistent burst and splicing parameters between the technologies (points: maximum likelihood estimates for burst sizes and splicing rates; error bars: conditional 99% confidence intervals for inferred parameters; dashed black line: line of identity). h. At the discovered technical noise parameters, the mature RNA efflux or turnover is considerably higher for the single-nucleus dataset, consistent with this parameter's interpretation as the rapid export from the nucleus (conventions as in f -g).

Mechanistic basis for differential expression analysis
In typical transcriptomics workflows, the determination of differences between cell types or conditions often reduces to the determination of differentially expressed (DE) genes, which exhibit statistically significant differences in their average copy numbers. However, the identification of DE genes requires careful accounting for technical covariates [18]. In addition, the data may exhibit compensating mechanistic effects that change the distribution while keeping the averages constant, which would not be identifiable by standard statistical methods [33,[61][62][63].
We propose that differential expression testing should be generalized to the identification of modulated parameters. We use the notation "DE-θ" to denote criteria using θ -which may be a data moment or an inferred biophysical parameter -as a test statistic. In particular, we stress the potential of multivariate data, which provides more statistical power [7] and enables the identification of parameter modulation patterns which would not otherwise be identifiable (Section 5.3.1).
For illustration, we revisit and extend an analysis performed in [33]: we used Monod to fit counts from pre-clustered glutamatergic and GABAergic cell types in four mouse brain datasets [40], then selected genes that appeared to be DE-θ for the burst size, splicing rate, or degradation rate. We further filtered for genes which were not DE-µ M , i.e., had a low average difference in mature RNA expression between the cell types. The full analysis procedure is given in Section 5.3.2.
Based on the pervasive co-variation of splicing and degradation rate differences (Figure S39 of [33]), as well as physical considerations (Section 5.3.1), we suggest that this co-variation should properly be ascribed to burst frequency modulation, even though this parameter was not explicitly fit. Therefore, we summarized the findings further ( Figure 4a) in terms of burst size and frequency differences, in the spirit of [26,64]. The discovered genes are indicated according to the cell type differences' effect on noise: genes highlighted in red exhibit more overdispersion in the glutamatergic population, whereas genes highlighted in light teal exhibit more overdispersion in the GABAergic population. These genes exhibit only minor differences in average expression, and fall fairly close to the line of expression identity (solid diagonal line in Figure 4a), where an increase in burst size is precisely compensated by a decrease in the burst frequency. The differences in parameters are reflected in the data distributions and the model fits. In addition to identifying distributional differences in a small number of markers, the mechanistic approach also enables the summary of more far-reaching perturbations that move beyond the usual marker gene paradigm. For example, recent studies found that the introduction of a modified nucleotide (IdU) to a culture medium enhances transcriptional noise, but keeps average expression constant, hinting at a genome-wide mechanism for compensation [63,65].
Although the model required to fully recapitulate the dynamics of DNA damage repair involved in this process is sophisticated, we found that we could characterize the effects of IdU using a simple bursty model. The analysis procedure is fully described in Section 5.3.3. In brief, we fit the nascent and mature data from control and IdU datasets using Monod. As in [33], the technical noise parameters were not readily identifiable from the 10x v2 sequencing data. We assumed the parameters were in a region we previously discovered for this technology (Figure 3e of [33]), and analyzed biophysical parameters under that assumption ( Figure S7).
We found that the IdU-perturbed cell culture exhibited striking noise amplification, with very limited differences in mean expression (Figure 4c). This result strongly contrasts, e.g., Figure 4a and Section S7.10.4 of [33], which show fairly symmetric noise amplification and reduction between cell types. The asymmetry in the findings are consistent with the authors' conclusions and orthogonal validation, which likewise found that burst size increases and burst frequency decreases in the IdU condition [65].
We selected a set of well-fit genes that exhibited particularly high modulation and had average expression greater than 1 in at least one of the conditions for further analysis, identifying Stx7, Washc5, Apod, Eif2ak2, Ubr2, Cnnm2, Dram2, Zfp110, Cul4a, Ddx19b, and Yap1 (red points in Figure 4c). Interestingly, two of these genes are directly related to the DNA damage activity of IdU: Dram2 is involved in the autophagic response to DNA damage repair, whereas Cul4a is involved in the turnover of DNA repair proteins. Several other genes more generally mediate the cellular stress response: Zfp110, Eif2ak2, and Yap1 regulate apoptosis, whereas Ddx19b may be active in stress granules. The role of the remaining genes is obscure: Stx7 and Washc5 are related to vesicular function, Apod is involved in lipid metabolism, Ubr2 controls ubiquitination, and Cnnm2 appears to be involved in ion transport [66].
We were able to partially compare our results for Sox2, Nanog, and Mtpap, whose transcriptional parameters were computed from fluorescence data in [63,65]. We did not observe Sox2 expression in either dataset. Nanog was rejected by our goodness-of-fit procedure. This is, in principle, consistent with the results in Table S2 of [63], which report gene on fractions near 30-55%; this regime violates the assumptions of the bursty model (gene on fraction tending to zero). The inferred signs for Mtpap parameter modulation agreed with Figure S4 of [65], although we obtained rather different magnitudes (log 2 fold changes of ≈ −0.3 by smFISH vs. ≈ −1.5 by Monod for burst frequency; ≈ 2 by smFISH vs. ≈ 1.3 by Monod for burst size). Therefore, although the genomewide trends broadly recapitulate the mechanistic explanations provided by the authors, and some of the high-noise genes appear to be implicated in DNA repair and stress, the quantitative comparison of fluorescence and sequencing data requires further analytical work. Figure 4: The inference of mechanistic parameters with Monod allows us to generalize differential expression testing to the identification of genes with distributional differences, without requiring substantial changes in average expression. a. The differences between mouse glutamatergic and GABAergic cell types, computed from four independent replicates, include genes with substantial noise enhancement but little to no change in average expression, which may reflect biophysically important compensation mechanisms (light red points: genes with significantly higher noise in glutamatergic cells; light teal points: genes with significantly higher noise in GABAergic cells; gray points: all other genes; solid diagonal line: parameter combinations where burst size and frequency differences compensate to maintain a constant average expression; dashed diagonal lines: ±1 log 2 expression fold change region about the constant-average expression line; vertical and horizontal lines: parameter combinations where burst size and frequency, respectively, do not change; fits originally performed in [33]). b. Differences in inferred noise behaviors reflect differences in distribution shapes (light red: glutamatergic cell type; light teal: GABAergic cell type; histograms: raw counts; lines: Monod fits; top row: mature RNA marginal; bottom row: nascent RNA marginal). c. Perturbation by IdU, which triggers DNA damage and repair, rarely changes expression levels, but induces genome-wide noise enhancement [65] detectable by Monod (lines and gray points: as in a; red points and labels: well-fit, moderate-expression genes identified as highly noise-enhanced). 13

Discussion
Our Python software Monod facilitates mechanistic inference from multimodal scRNA-seq data. At this time, it is restricted to a narrow set of transcriptional models (tractable by quadrature), technical noise models (catalysis or drop-out), modalities (nascent and mature RNA), correlation structure (no inter-gene relationships), and heterogeneity structures (a single homogeneous cell type). While some of these assumptions may be simplistic, the current approach to single-cell RNA sequencing analysis corresponds to an even more unrealistic model, which makes contradictory implicit assumptions and violates fundamental constraints. With even a basic mechanistic model for integrating nascent and mature RNA counts, we have demonstrated the possibility for interesting discovery. Technical noise may be described in a self-consistent fashion; single-nucleus and singlecell sequencing data can be described in a common framework; subtle distributional differences between pre-annotated cell types can be identified and ascribed to biophysical phenomena. We anticipate that Monod can be extended to utilize multimodal data to parametrize more complex mechanistic models.

Normalization and dimensionality reduction
We would like to characterize the performance of typical data processing and dimensionality reduction techniques. To do so, we need a meaningful baseline for "good" performance. For the purposes of illustration, we focus on the methods' effects on cell type differences, whose characterization is a commonplace application of single-cell analyses [18]. In this section, we seek to outline and apply a framework for investigating these methods' implicit assumptions and distortive effects.
We essentially have three options for constructing a baseline for studying heterogeneity, which have different trade-offs. First, we can define a stochastic model under a particular set of hypotheses, simulate from it, and compare the algorithm performance to the underlying ground truth. However, this approach may be overly simplistic, as the simulation may not accurately represent all features of the underlying data-generating process. Second, we can obtain datasets collected from distinct tissues, concatenate them, and treat them as a single dataset. However, this approach is somewhat artificial and divorced from typical use cases, which treat a single tissue. In addition, there may be hard-to-characterize technical batch effects between datasets. Third, we can obtain a preannotated dataset from a single tissue, and perform the analysis conditional on the assumption that the annotations are sufficiently accurate for our purposes. Although this approach necessarily represents a compromise, we use it for simplicity.

Variance decomposition baseline
Given a generic set of cell populations, indexed by κ, we can construct an estimate for the amount of biological variation. Under mild assumptions about technical noise, the overall biological variation is bounded from below by the inter-cell population variability. Quantitatively, this property holds for the squared coefficient of variation (CV 2 ), which we denote by η 2 in our derivations.
First, we construct a categorical distribution π that contains the fractional cell type abundances π κ . From the laws of total expectation and variance [70], we obtain the sample-wide meanμ and varianceσ 2 of a particular gene's counts, prior to corruption by technical noise: whereμ κ is the mean andσ 2 κ is the variance of each discrete cell population. The observed statistics contain contributions from technical noise: To relate the observations to the biological processes, we make two assumptions about the form of the technical noise. First, we assume that the sequencing process uniformly samples cells from the underlying population. In other words, we suppose that the observed cell type proportions match the biological proportions. This allows us to write down an analogous decomposition: Second, we assume that ξ κ = ξ for all κ. In other words, we suppose that, for a particular gene and on average, all cell types are chemically and statistically identical with respect to the sequencing process. We find that the lower moments of the observed distributions can be rewritten in terms of the lower moments of the biological distributions: Under this set of assumptions, we find that the ratio of variances with and without technical noise takes the following form: which cannot be easily manipulated or constrained in the absence of ground truth statistics. However, if we instead use the ratio of coefficients of variation, we find: i.e., the fraction of biological variability (as quantified by the CV 2 ) is at least as high as the fraction of variability attributable to the inter-population mean differences.

Non-mechanistic model definition
Intuitively, we should hope that transformations commonly applied to "denoise" scRNA-seq data retain the biological variability of interest. For example, if we would like to preserve the quantitative relationships between cell types, we seek to keep the per-gene noise post-transformation above the lower bound derived in Section 5.1.1; if this lower bound is violated, some cell type differences have been degraded. Transformations are iteratively applied to an entire data matrix Z, with entries Z ij indexing over cells i = 1, . . . , c and genes j = 1, . . . , g. We conceptualize a transformation as some function Φ ℓ (Z), such that Φ ℓ = ϕ ℓ • · · · • ϕ 1 . Thus, for example, "proportional fitting" count normalization followed by log-transformation (log1pPF [34]) would be represented by a composition of ℓ = 2 transformations, with  being the size factor, This formulation assumes that each ϕ is a function that maps a c × g matrix to another c × g matrix. However, some transformations, such as principal component analysis (PCA), accomplish dimensionality reduction, and map a c×g matrix to a c×g ′ matrix, with g ′ ≤ g. Such a projection ψ places the data onto a lower-dimensional manifold within g. We can characterize how much variance is retained by such a projection by applying an inverse transformation, such that a dimensionalityreducing step's ϕ = ψ −1 • ψ. This inverse transformation is typically not unique, and may not be deterministic. However, if the cell populations largely lie along the low-dimensional manifold, we should expect the "denoising" steps to have a minimal effect on the variance thus removed.

Mechanistic model definition
The approach outlined in Section 5.1.1 makes fairly mild assumptions about the distributions to obtain a limit on the fraction of biological variability. By making stronger assumptions, we can obtain point, rather than region estimates, at the cost of potential model misspecification or inaccuracy in estimated parameters.
For example, if we are interested in the CV 2 values for mature RNA with and without technical noise, we can immediately exploit the analytical statistics reported in Section S2.4: To lighten the notation, we elide the population-specific subscripts κ on the parameters b, β, γ, and λ. Therefore, the fraction of biological variability under this model can be computed by separately fitting the cell types, then substituting in the maximum likelihood estimates to obtaiñ η 2 /η 2 .

Data processing
To investigate the effect of common data transformations and compare them to the bound from Section 5.1.1, we used the glutamatergic cell subtypes reported for the mouse sample B08, originally generated by the Allen Institute for Brain Science [40].
Preprocessing. We used kallisto | bustools 0.26.0 to pre-process data. We downloaded a pre-built M. musculus genome from https://support.10xgenomics.com/single-cell-gene-expression/ software/downloads/latest (mm10, 2020-A version). To build intronic and exonic references, we used the kb ref function with the --lamanno option. We obtained the raw FASTQ files for the dataset B08 [40,71], which was generated using the 10x v3 single-cell chemistry. Next, we used the kb count function with the --lamanno option, as well as the -x 10xv3 whitelist option to quantify the datasets, producing unspliced (intron-containing) and spliced (non-intron-containing) RNA count matrices [72,73].
Filtering. We filtered the dataset to remove "low-quality" cells or empty droplets. First, we removed all barcodes that did not pass the default kallisto | bustools filter. Next, we removed all cell barcodes that were associated with fewer than 10 4 molecular barcodes, computed over all genes, corresponding to standard knee plot filtering procedures ( Figure S2).
Next, we split the dataset by cell type and subtype annotations [40]. We extracted seven classes: six glutamatergic subtypes (L2/3 IT, L5 IT, L6 IT, L5/6 NP, L6 CT, and L6b) and their union ("glutamatergic"). We omitted low-abundance cell types (L6 IT Car3 and L5 ET, with fewer than ten barcodes) from analysis and inclusion in the "glutamatergic" category.
Next, we used Monod 0.2.6.0 to extract genes with moderate to high expression. We removed two sets of genes: those with very low observed average and maximum expression (X N ≤ 0.01, X M ≤ 0.01, max X N ≤ 3, max X M ≤ 3), and those with excessively high observed maximum expression, which are too computationally intensive to fit (max X N ≥ 400, max X M ≥ 400). We use the notation X z to denote the observed distribution of species X z (nascent or mature) for a particular gene; X z is the observed mean and max X z is the observed maximum. This procedure produced a set of 2,951 genes that met the thresholds in all of the cell populations.
As high-expression, high-variability genes are typically of most interest in single-cell analyses, we further selected the top 5% of genes by expression, and colored them orange in all visualizations. These genes tended to have the highest variance in the dataset. This selection procedure is shown in Figure 1b.
Baseline computation. To calculate the baseline introduced in Section 5.1.1 for each gene, we used a plug-in estimate for the lower bound in Equation 8, using only the mature RNA data. Specifically, the bound affords the consistent estimator where S 2 is the sample variance over all glutamatergic cells, (X M ) κ is the average expression in cell subtype κ, and c κ is the number of cells in that subtype. We used the existing annotations to extract the κ-indexed variables.
To represent the admissible region, wherein the retained fraction CV 2 after transformation is no lower than the fraction of CV 2 attributable to cell type differences, we plotted it as a teal line. In addition, we plotted the location at which the CV 2 after transformation exceeds that of the original dataset. Violations of this upper bound are not necessarily a cause for concern. For example, a transformation may, in principle, effectively inflate differences between cell types to make them more distinguishable.
Computation of variability retained by transformation. We considered the effects of four transformations of the glutamatergic mature RNA count matrix: proportional fitting normalization, log-transformation, principal component analysis (PCA), and Uniform Manifold Approximation and Projection (UMAP) [74]. These transformations take place in sequence: log-transformation is applied to normalized data; PCA is applied to log-transformed data; UMAP is applied to PCAtransformed data. This series of transformations is associated with four ϕ and Φ functions, as well as five c × g Z matrices: Z, the raw mature count matrix, , the normalized mature count matrix, , the log1pPF transformed mature count matrix, , its reconstructed PCA projection, and The first two transformations, ϕ 1 and ϕ 2 , corresponding to normalization and log-transformation, are reported in Equation 9.
The reconstructed PCA projection is given by ψ −1 3 • ψ 3 , where ψ 3 is implemented through the Scikit-learn 1.0.1 function transform and ψ −1 3 is implemented through the function inverse_transform, both associated with a sklearn.decomposition.PCA object [75]. We used 50 components to compute the PCA projection.
The reconstructed UMAP projection is given by ψ −1 where ψ 4 is implemented through the umap-learn 0.5.1 function transform and ψ −1 4 is implemented through the function inverse_transform, both associated with a umap.UMAP object [74]. We used the 50-dimensional PCA projection and default parameters to compute the UMAP projection.
To compute the amount of retained CV 2 for each gene j and each step indexed by l, we computed the ratio of coefficients of variation prior to and after transformation: To characterize the overall effect of the cumulative application of transformations, we plotted the distributions of (η l ) 2 j -i.e., the transformed data CV 2 -after each transformation. This analysis is shown in Figure 1c.
To characterize the relationship between the average expression and the fraction of variation attributed to biological variability of interest, we plotted f j,l,ret against the mean Z j for each gene. The normalization and dimensionality reduction procedures attempt to eliminate noise while maintaining biological "signal," and this visualization reveals whether an increase in variability is implicitly attributed to the former or the latter. This analysis is shown in Figure 1d-g.
To understand the transformations' effect on distributions, we plotted the f j,l,ret value for each step l against the baseline value f j,baseline , computed using Equation 11 separately for each gene j. To easily compare the results to the threshold, we plotted the admissible region. The lower bound represents the reduction of inter-cell type variability, whereas the upper bounds represents the inflation of overall variability above its original value. This analysis is shown in Figure 1h-k. In addition, we computed the fraction of genes violating the bound at each step: 0%, 8.2%, 35.2%, and 7.6% after proportional fitting, log-transformation, PCA, and UMAP respectively.
As the transformations are applied cumulatively, the distribution at step l may fall within the admissible region, but still be quantitatively degraded because it fell outside it at some step l ′ < l. To characterize the loss of quantitative information about cell type relationships, we plotted the same data points as above, and colored them according to the history of the analysis. If a gene has ever violated the lower bound, we plotted it in a violet color. This analysis is shown in Figure S8. In addition, we computed the fraction of genes identified after each step. We found that 0%, 8.2%, 35.2%, and 35.4% of the genes had at some point violated the bound after proportional fitting, log-transformation, PCA, and UMAP respectively.  Table S7. These bounds were chosen according to the results previously obtained for mouse brain datasets, as reported in Figure S24 of [33].
At each grid point, we iterated over the 2,951 genes, using gradient descent to identify the conditional maximum likelihood estimate of {log 10 b, log 10 β, log 10 γ}, where the rates β and γ are defined in units of burst frequency k (Section S4.3.2 of [33]). We used the conditional method of moments estimate (Table S6, "Bursty") as the starting point and performed 15 steps of gradient descent. The procedure was parallelized over up to sixty processors (Intel Xeon Gold 6152, 2.10GHz). Runtimes varied between 33 minutes for the smallest dataset and 2.7 hours for the largest.
To identify the optimal sampling parameters, we identified the grid point with the lowest total Kullback-Leibler divergence, computed over all genes. To ensure we obtained the true optima under the bursty model, we performed four rounds of fixed-point iteration. First, we rejected a subset of genes if they were detected by the chi-squared test with p = 0.001 with a Bonferroni correction, and their Hellinger distance from the data distribution exceeded 0.05. Next, we recalculated the optimum based on the remaining data (Section S4.3.5 of [33]), and repeated the procedure. This procedure did not change the optimum for any of the datasets. Further, we investigated the stability of the optima under gene subsampling, and found them to be stable and consistent (Section S4.3.5 of [33]).
Although the optima discovered for the cell subtypes were fairly close, they were not identical, with smaller datasets showing striking deviations ( Figure S5). From physical considerations, we assumed the subtypes, which originate from a single technical sample, have the same set of sampling parameters. For simplicity, we assumed the parameter set inferred from the entire glutamatergic dataset provided a sufficiently accurate estimate for all of its constituent subtypes, and analyzed the data under that set of {log 10 C N , log 10 λ M }.
To compute the fraction of biological variability, we used the identities in Equation 10 using the parameters inferred for each gene j: To characterize the relationship between the average expression and the fraction of variation attributed to biological variability of interest, we plotted f j,bio against the mean Z j for each gene. This visualization reveals whether an increase in variability is attributed to biological or technical effects. This analysis is shown in Figure 2b.
To understand the fits' sensibility, we plotted the f j,bio value against the baseline value f j,baseline , computed using Equation 11 separately for each gene j. To easily compare the results to the threshold, we plotted the admissible region. This analysis is shown in Figure 2c.
If the fit is sufficiently good, Equation 10 naturally enforces the bound in Equation 11. To understand whether a mechanistic analysis provides actionable information, or merely exploits external information about cell types, we used the fit to the entire dataset to repeat the analysis. This approach introduces some error, as we neglect intra-cell type differences altogether, and do not use goodness-of-fit testing to omit genes that show subtype heterogeneity. If the trends look substantially similar, the results suggest that the Monod procedure attributes the vast majority of CV 2 to intrinsic and bursting noise, with only a minor fraction ascribed to inter-subtype differences. In other words, if we do not use π at all, but still obtain similar results, the agreement with the bound is not an incidental consequence of the expectation over π in the last two lines of Equation 10. To computeη 2 /η 2 in this scenario, we used the mean and variance identities in the first four lines of Equation 10. These results are shown in Figure S10a-b.
Comparisons between transformation and the mechanistic model. To compare the attributions implicit in the transformation procedures and the mechanistic fit, we plotted them against each other. If the two methods agree, they should lie on the line of identity. The results are shown in Figure S9 for the Monod subtype analysis and Figure S10c-f for the analysis of the entire dataset. 22

Nuclear data integration
We would like to coherently integrate single-cell and single-nucleus RNA sequencing data. To do so, we need to specify the relationship between the two modalities. We can establish such a relationship from first principles by making assumptions regarding the underlying biophysical processes. For example, by proposing that nascent RNA are restricted to the nucleus, we can reasonably assume that the nascent RNA dynamics should be identical between the two modalities. On the other hand, the mature RNA distributions and dynamics may have substantial differences, as nuclei are depleted in this species relative to the entire cell. In this section, we propose a possible foundation for the integration of these modalities.

Mechanistic model definition
To describe the stochastic dynamics and sampling in a single-cell dataset, we use the formulation given in Section 2.1 and outlined in more detail in [33]. To connect this model to nuclear data, we note that formally, it can arise from the following model: where X M,nuc and X M,cyt are nuclear and cytoplasmic mature RNA species, respectively. The rate γ e describes the efflux of nuclear RNA, whereas the rate γ c describes the degradation of cytoplasmic RNA.
In the limit γ c ≪ γ e or γ c ≫ γ e , the model in Section 2.1 approximately holds for cytoplasmic data: if one of these stages is considerably longer-lived, the two-stage processing of mature RNA can be effectively described by a one-stage model. In this case, γ can be interpreted as the lower rate. We typically assume that the first limit is most relevant, although orthogonal data suggest that the details are highly gene-and tissue-dependent [58]. We note that it is, in principle, straightforward [59] implement a model that explicitly incorporates both parameters; however, for computational facility, we use the simpler reduced model and discard genes that fail to fit it. On the other hand, for nuclear data, the model holds for γ = γ e .

Data processing
To compare the distributions of single-cell and single-nucleus datasets and explain them using a mechanistic argument, we used mouse neuron datasets generated by 10x Genomics.
Preprocessing. We used kallisto | bustools 0.26.0 to pre-process data. We downloaded a pre-built M. musculus genome from https://support.10xgenomics.com/single-cell-gene-expression/ software/downloads/latest (mm10, 2020-A version). To build intronic and exonic references, we used the kb ref function with the --lamanno option. We obtained the raw FASTQ files for the "Brain 4" and "Brain Nuclei 4" datasets from two multiplexing experiments, both generated using the 10x v3 chemistry. We selected these datasets because they had the highest average molecule counts per cell in both technologies. Next, we used the kb count function with the --lamanno option, as well as the -x 10xv3 whitelist option to quantify the datasets, producing unspliced (intron-containing) and spliced (non-intron-containing) RNA count matrices [72,73].
Filtering. We filtered the dataset to remove "low-quality" cells or empty droplets. First, we removed all barcodes that did not pass the default kallisto | bustools filter. Next, we removed all cell barcodes that were associated with fewer than T molecular barcodes (T = 3 × 10 3 for sc and 6 × 10 3 for sn), computed over all genes, corresponding to standard knee plot filtering procedures ( Figure S3). In addition, we removed cells with more than 10 5 barcodes, as they may reflect obscure technical noise sources unique to single-nucleus data.
Next, we used Monod 0.2.6.0 to extract genes with moderate to high expression. We removed two sets of genes: those with very low observed average and maximum expression (X N ≤ 0.01, X M ≤ 0.01, max X N ≤ 3, max X M ≤ 3), and those with excessively high observed maximum expression, which are too computationally intensive to fit (max X N ≥ 400, max X M ≥ 400). This procedure produced a set of 5,690 genes that met the thresholds in all of the cell populations. We randomly selected 2,000 genes for further analysis.
Inference and analysis of biophysical parameters. To fit a mechanistic model, we used Monod 0.2.6.0. We set up a 20 × 21 grid over the {log 10 C N , log 10 λ M } domain listed in Table S7.
At each grid point, we iterated over the 2,000 genes, using gradient descent to identify the conditional maximum likelihood estimate of {log 10 b, log 10 β, log 10 γ}, where the rates β and γ are defined in units of burst frequency k. We used the conditional method of moments estimate as the starting point and performed 15 steps of gradient descent. The procedure was parallelized over up to fifteen processors (Intel Xeon Gold 6152, 2.10GHz). Runtimes varied between 2.2 hours for the whole-cell dataset and 3.8 hours for the nuclear dataset.
To identify the optimal sampling parameters, we identified the grid point with the lowest total Kullback-Leibler divergence, computed over all genes. To ensure we obtained the true optima under the bursty model, we performed four rounds of fixed-point iteration. First, we rejected a subset of genes if they were detected by the chi-squared test with p = 0.01 with a Bonferroni correction, and their Hellinger distance from the data distribution exceeded 0.05. Next, we recalculated the optimum based on the remaining data, and repeated the procedure. This procedure did not change the optimum for any of the datasets. Further, we investigated the stability of the optima under gene subsampling, and found them to be stable and consistent.
The optimum discovered for the single-nucleus dataset demonstrated noticeably higher molecule observation probabilities (orange points, Figure S6). This observation was supported by basic observations of the dataset statistics: despite the depletion of cytoplasmic RNA, the single-nucleus dataset had as much mature RNA as the single-cell dataset, and approximately half an order of magnitude more nascent RNA (Figure 2b-c). To illustrate these trends, we computed the offset from the ratio of the dataset-wide means. In addition, the single-nucleus dataset appeared to exhibit lower noise levels (Figure 2d-e). To obtain a quantitative understanding of the average and noise behaviors, we computed the fraction of genes that lay above the line of identity.
From physical considerations, the two independent experiments, performed using different technologies, should not necessarily have the same sampling parameters. However, as the samples were taken from the same tissue, they should have the same physics of transcription and splicing. Therefore, we somewhat arbitrarily assumed that the single-cell optimum was sufficiently accurate, and chose a set of single-nucleus sampling parameters that provided the lowest squared errors for the log 10 b and log 10 β parameters. As shown by the blue points in Figure S6, the optimum so discovered lay approximately half an order of magnitude above the optimum for the single-cell data, and within the top 5th percentile for the sampling parameter likelihood landscape (hatched region).
We analyzed the single-nucleus data under that set of parameters, recomputing the goodness-of-fit statistics accordingly.
To illustrate the differences between the datasets, we plotted the inferred parameters and the identity line. To quantify uncertainty in the parameters, we exploited the Fisher information matrix as described in Section S4.3.4 of [33]; we visualized the error bars, which represent the 99% confidence intervals for the biological parameters, conditional on the sampling parameter values. Finally, we applied a t-test, implemented through scipy.stats.ttest_ind [68], to the pairs of single-cell and single-nucleus parameter estimates. We omitted genes rejected by goodness-of-fit procedures from these computations and visualizations.

Mechanistic differential expression
We seek to move beyond averages and explain the differences between single-cell samples and cell types in terms of biophysical distribution parameters, in the spirit of [76]. In this section, we propose and apply an approach for the identification of cell type differences which would be poorly detectable using standard average-based procedures, and demonstrate its performance using an experiment studying transcriptional noise amplification.

Signatures of frequency modulation
We fit the rate parameters log 10 β and log 10 γ, setting the burst frequency k to unity. This is formally equivalent to fitting log 10 β k and log 10 γ k : at steady state, the system is characterized by three independent parameters, which cannot be distinguished based on a single dataset.
The models we present are not natively adapted to detect changes in k: to unambiguously distinguish between modulation of upstream and downstream processes, time-resolved data are mandatory. However, the high correlation between the magnitudes of changes in log 10 β k and log 10 γ k (e.g., as shown in Section S7.10.3 of [33]) highly suggestive of the hypothesized frequency modulation.
We propose that the modulation of k can be motivated by biological argument. β and γ, the rates of splicing and degradation, use a one-step, first-order, memoryless reaction as a highly simplified representation of a series of chemical transformations effected in tandem with a spliceosome or a ribonuclease (RNase) complex respectively. However, spliceosomes and RNases are promiscuous, whereas transcription is highly regulated. Therefore, we hypothesize that targeted modulation of the burst frequency upstream at the gene locus is more mechanistically plausible than the synchronized and targeted modulation of the downstream processes.
Therefore, if the approximate equality ∆ log 10 β k ≈ ∆ log 10 γ k holds, we can propose that ∆ log 10 k has a similar magnitude, but the opposite sign. We average the two to estimate the burst frequency modulation: ∆ log 10 k ≈ − 1 2 ∆ log 10 β k + ∆ log 10 γ k (17)

Data processing: mouse neurons
To illustrate the approach, we compared the parameters for glutamatergic and GABAergic cell types from four mouse datasets (B08, C01, F08, and H12) generated by the Allen Institute for Brain Science [40,71]. We previously performed the fits and identified genes that suggested substantial parameter modulation [33]. Here, we revisit the fits and summarize the key findings. First, we used the fits to identify the differentially expressed genes for each parameter (Section S4.6.2 of [33]). We use the notation DE-θ to indicate that the log-parameter θ exhibited a Bonferroni-corrected p-value lower than 0.1 and mean log 2 fold change higher than 1. The log 2 fold changes were defined as the difference between the parameter values in the GABAergic and glutamatergic cell types. We omitted data points that were discarded by goodness-of-fit testing. With this procedure, we identified a set of DE-b, DE-β, and DE-γ genes, separated according to the sign of the log 2 fold change.
We computed the mean log 2 fold difference in mature RNA averages between the cell types, and selected the set of identified DE-θ genes with a magnitude lower than 1. In other words, these genes have detectably large differences in biophysical parameters, but do not, on average, exhibit large differences in mature RNA averages µ N .
Next, we averaged the mean log 2 fold changes in β and γ to obtain an estimate of the log 2 fold change in the burst frequency, as in Equation 17. We plotted the resulting aggregated fold changes in burst size and burst frequency against each other, highlighting genes that were DE-θ for some biological θ, but not DE-µ N .
We colored these genes by the effect on noise. It is elementary to show that, if the mean remains constant, a decrease in b compensated by an increase in k -equivalently, decrease in β/k and γ/k -leads the joint distribution of nascent and mature RNA to become bivariate Poisson. For example, if a gene was found to exhibit significantly higher b, β, or γ in GABAergic cells, we assigned it to the GABAergic set, as it suggests relative noise amplification in this cell type. This analysis is shown in Figure 4a.
The structure of this plot bears further discussion, as it provides a convenient summary of useful statistical properties. The solid diagonal line denotes the set of b and k combinations that yield a constant mean (all other parameters held equal). The dashed diagonal lines are offset by unity, and show the range of parameters that give averages with a lower than twofold change in the mean. The dashed vertical and horizontal lines correspond to no change in b and k, respectively. Qualitatively, moving toward the top right corresponds to increasing the mean; moving toward the bottom left corresponds to decreasing the mean; moving toward the top left corresponds to decreasing the noise to the Poisson limit; moving to the bottom right corresponds to increasing the noise.
To demonstrate the qualitative impact of noise modulation, we visualized the distributions, as well as the fits, in both cell types, based on data from the B08 dataset. We selected the genes Nin and Bach2, which are associated with neuronal development, as discussed in [33]. In addition, we computed these genes' mature count averages in each cell type. This demonstration is given in Figure 4b.

Data processing: mouse embryonic stem cells
To demonstrate the potential of this approach for detecting broad trends in transcriptional modulation without replicates, we considered the transcriptomes of mouse embryonic stem cells with and without 5'-iodo-2'-deoxyuridine (IdU) perturbations. This dataset was generated by Desai et al. [63,65] to investigate the effect of IdU incorporation on transcriptional bursting properties; the authors found that the perturbation appeared to increase the noise genome-wide, but did not affect averages.
Preprocessing. We used kallisto | bustools 0.26.0 to pre-process data. We downloaded a pre-built M. musculus genome from https://support.10xgenomics.com/single-cell-gene-expression/ software/downloads/latest (mm10, 2020-A version). To build intronic and exonic references, we used the kb ref function with the --lamanno option. We obtained the raw FASTQ files for the DMSO (control) and IdU datasets, both of which were generated using the 10x v2 chemistry. Next, we used the kb count function with the --lamanno option, as well as the -x 10xv2 whitelist option to quantify the datasets, producing unspliced (intron-containing) and spliced (non-introncontaining) RNA count matrices [72,73].
Filtering. We filtered the dataset to remove "low-quality" cells or empty droplets. First, we removed all barcodes that did not pass the default kallisto | bustools filter. Next, we removed all cell barcodes that were associated with fewer than 4 × 10 3 molecular barcodes, computed over all genes, corresponding to standard knee plot filtering procedures ( Figure S4).
Next, we used Monod 0.2.6.0 to extract genes with moderate to high expression. We removed two sets of genes: those with very low observed average and maximum expression (X N ≤ 0.01, X M ≤ 0.01, max X N ≤ 3, max X M ≤ 3), and those with excessively high observed maximum expression, which are too computationally intensive to fit (max X N ≥ 400, max X M ≥ 400). This procedure produced a set of 4,373 genes that met the thresholds in both cell populations. We randomly selected 2,000 genes for further analysis, ensuring that the genes analyzed in the previous publications (Nanog, Sox2, Pou5f1, Klf4, Wdr83, Stx7, Hif1an, Mtpap, Farsa, Wipi2, and Snd1 ) were included.
Inference and analysis of biophysical parameters. To fit a mechanistic model, we used Monod 0.2.6.0. We set up a 20 × 21 grid over the {log 10 C N , log 10 λ M } domain listed in Table S7.
At each grid point, we iterated over the 2,000 genes, using gradient descent to identify the conditional maximum likelihood estimate of {log 10 b, log 10 β, log 10 γ}, where the rates β and γ are defined in units of burst frequency k. We used the conditional method of moments estimate as the starting point and performed 15 steps of gradient descent. The procedure was parallelized over up to eighty processors (Intel Xeon Gold 6152, 2.10GHz). Runtimes varied between sixteen and seventeen minutes.
To identify the optimal sampling parameters, we identified the grid point with the lowest total Kullback-Leibler divergence, computed over all genes. To ensure we obtained the true optima under the bursty model, we performed four rounds of fixed-point iteration. First, we rejected a subset of genes if they were detected by the chi-squared test with p = 0.01 with a Bonferroni correction, and their Hellinger distance from the data distribution exceeded 0.05. Next, we recalculated the optimum based on the remaining data, and repeated the procedure. This procedure did not change the optimum for any of the datasets. Further, we investigated the stability of the optima under gene subsampling, and found them to be stable and consistent.
The discovered optima were not consistent between datasets (orange points, Figure S6), and the likelihood landscapes were rugged and inconclusive (hatched region, Figure S6). This observation accords with our previous analyses of 10x v2 datasets (e.g., panels a. of figures in Section S7.6 of [33]): the older v2 technology does not appear to provide enough information to identify the technical noise parameters. Therefore, we somewhat arbitrarily used the grid point closest to log 10 C N = −6.5, log 10 λ M = −1.2, near the optimum discovered for a mouse neuron dataset in Figure 3e of [33]. We analyzed the datasets under that set of parameters, recomputing the goodness-of-fit statistics accordingly.
We computed the mean log 2 fold change in burst size and burst frequency (Equation 17), and plotted them against each other, using the conventions in Figure 4. We omitted data points that were discarded by goodness-of-fit testing. Finally, we identified all genes with log 2 change higher than 1.5 in b as well as k, which demonstrated significant noise amplification. To focus on genes with biologically interesting effects, we selected only those which had a mature RNA mean greater than unity in at least one of the conditions, and reported them.