log(x+1)* and log(1+x)*

A. Sina Booeshaghi​1​ and Lior Pachter​2  1 ​Department of Mechanical Engineering, California Institute of Technology 2 ​Division of Biology and Biological Engineering and Department of Computing and Mathematical Sciences, California Institute of Technology  * These formulae contributed equally to the title Address correspondence to lpachter@caltech.edu  Abstract Single-cell RNA-seq technologies have been successfully employed over the past decade to generate many high resolution cell atlases. These have proved invaluable in recent efforts aimed at understanding the cell type specificity of host genes involved in SARS-CoV-2 infections. While single-cell atlases are based on well-sampled highly-expressed genes, many of the genes of interest for understanding SARS-CoV-2 can be expressed at very low levels. Common assumptions underlying standard single-cell analyses don’t hold when examining low-expressed genes, with the result that standard workflows can produce misleading results.  Results The ​ACE2​ receptor, which facilitates entry of SARS-Cov-2 into cells ​(Zhang ​ et al.​, 2020)​, has become one of the most studied genes in the history of genomics over the past two months. There are already hundreds of preprints about the gene (Google Scholar). Several studies have reported on the expression of ​ACE2​ at single-cell resolution, and papers have been rife with speculation about implications of differential ​ACE2​ mRNA abundance for severity of disease. As is common in single-cell RNA-seq, the expression estimates of ​ACE2​ are derived from counts that are filtered and normalized. Figure 1a shows an analysis of ​ACE2​ mRNA in mice lungs (data from ​(Angelidis et al.​, 2019)​). The expression is computed from cells containing at least one copy of the gene. To understand the implication of this restriction, suppose the counts are modeled by a Poisson random variable with parameter . Application of the filter amounts to computing


Results
The ACE2 receptor, which facilitates entry of SARS-Cov-2 into cells (Zhang et al. , 2020) , has become one of the most studied genes in the history of genomics over the past two months. There are already hundreds of preprints about the gene (Google Scholar). Several studies have reported on the expression of ACE2 at single-cell resolution, and papers have been rife with speculation about implications of differential ACE2 mRNA abundance for severity of disease. As is common in single-cell RNA-seq, the expression estimates of ACE2 are derived from counts that are filtered and normalized. Figure 1a shows an analysis of ACE2 mRNA in mice lungs (data from (Angelidis et al. , 2019) ). The expression is computed from cells containing at least one copy of the gene. To understand the implication of this restriction, suppose the counts are modeled by a Poisson random variable with parameter . Application of the filter amounts to computing While this is approximately when is large, it is close to 1 when is small (de L'Hospital, 1768) . Figure 1b shows the fraction of cells containing at least one copy of ACE2 (Booeshaghi and Pachter, 2020) . Evidently, Figure 1a creates a misleading impression. In fact, ACE2 has significantly lower mRNA expression in the lungs of aged mice than young mice. positive cells. The p -value was computed using a t -test. c) A comparison of the naïve estimate of the expectation of log1p (red) to the Taylor approximation of the expectation of log1p (blue). The code to reproduce the panels in the figure is available here .
The fraction of cells with nonzero expression of a gene has a useful statistical interpretation. We leave it as an exercise for the reader to show that the maximum likelihood Poisson parameter estimator is given by .
Since is approximately equal to this expression when is small, this provides an interpretation of the fraction of cells with at least one copy of a low-abundance gene as a Poisson parameter.
Another mistake that we've found to be common in reporting ACE2 expression has to do with the log transformation, frequently used as part of a normalization of counts. Counts are log transformed for two reasons: the first is to stabilize the variance, as the log transform has the property that it stabilizes the variance for random variables whose variance is quadratic in the mean (Bartlett, 1947) . The rationale of this step for single-cell RNA-seq is manifold: first when performing PCA on the gene expression matrix to find a reduced-dimensional representation that captures the variance, it is desirable that all genes contribute equally. The second rationale for the log transform is that it converts multiplicative relative changes to additive differences. In the context of PCA, this allows for interpreting the projection axes in terms of relative, rather than absolute, abundances of genes.
A seemingly minor technical issue in log transforming counts is that zero counts cannot be "logged", as log(0) is undefined. To circumvent this problem, it is customary to add a "pseudocount", e.g. +1, to each gene count prior to log transforming the data. We denote this by log1p (see units of Figure 1a), in accordance with the function name in NumPy (Oliphant, 2006) . For a gene with an average of counts where is large, it is intuitive that the average of the log1p transformed counts is approximately log( ). However, this is not true for small . An understanding of the result of applying the log1p transform begins with the observation that for a This shows that for low-expressed genes, the average log1p expression differs considerably from log( ) (see Figure 2). Thus, while a 2-fold change for large translates to a log(2) difference after log1p, that is not the case for small .
In summary, while single-cell RNA-seq atlases offer detailed information about the transcriptomic profiles of distinct cell types, their use to examine specific genes, as has been done recently in the study of SARS-CoV-2 infection related genes, requires care. Methods should not be used unless their limitations are understood. For example, while it doesn't matter whether one uses log( x+ 1) or log(1 +x ), the filtering and normalization applied to counts can affect comparative estimates in non-intuitive ways. Moreover, there are subtle problems that arise when working with small counts that transcend the elementary issues we have discussed (Warton, 2018) . These matters are not theoretical; we leave the identification of published preprints and papers that have ignored the issues we've raised, and hence reported misleading results, as an exercise for the reader.