of 38
The impact of package selection and versioning on
single-cell RNA-seq analysis
Joseph M Rich
1,2
, Lambda Moses
1
, P ́etur Helgi Einarsson
3
, Kayla
Jackson
1,2
, Laura Luebbert
1
, A. Sina Booeshaghi
4
, Sindri Antonsson
3
,
Delaney K. Sullivan
1,5
, Nicolas Bray
6
, P ́all Melsted
3
, and Lior Pachter
*1,7,8
1
Biology and Biological Engineering, California Institute of Technology,
Pasadena, CA, 91125, USA
2
USC-Caltech MD/PhD Program, Keck School of Medicine, Los Angeles,
CA, 90033, USA
3
Faculty of Industrial Engineering, Mechanical Engineering and Computer
Science, Reykjav ́ık, Iceland
4
Department of Bioengineering, University of California Berkeley, Berkeley,
CA, USA
5
UCLA-Caltech Medical Scientist Training Program, David Geffen School of
Medicine, University of California, Los Angeles, Los Angeles, CA, 90095,
USA
6
Boston, MA
7
Computing and Mathematical Sciences, California Institute of Technology,
Pasadena, CA, 91125, USA
8
Lead Contact
April 11, 2024
*
Correspondence: lpachter@caltech.edu.
1
.
CC-BY 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted April 11, 2024.
;
https://doi.org/10.1101/2024.04.04.588111
doi:
bioRxiv preprint
Summary
Standard single-cell RNA-sequencing analysis (scRNA-seq) workflows consist of converting
raw read data into cell-gene count matrices through sequence alignment, followed by anal-
yses including filtering, highly variable gene selection, dimensionality reduction, clustering,
and differential expression analysis. Seurat and Scanpy are the most widely-used packages
implementing such workflows, and are generally thought to implement individual steps sim-
ilarly. We investigate in detail the algorithms and methods underlying Seurat and Scanpy
and find that there are, in fact, considerable differences in the outputs of Seurat and Scanpy.
The extent of differences between the programs is approximately equivalent to the variabil-
ity that would be introduced in benchmarking scRNA-seq datasets by sequencing less than
5% of the reads or analyzing less than 20% of the cell population. Additionally, distinct
versions of Seurat and Scanpy can produce very different results, especially during parts of
differential expression analysis. Our analysis highlights the need for users of scRNA-seq to
carefully assess the tools on which they rely, and the importance of developers of scientific
software to prioritize transparency, consistency, and reproducibility for their tools.
Keywords:
single-cell RNA-seq, Scanpy, Seurat, open source software
2
.
CC-BY 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted April 11, 2024.
;
https://doi.org/10.1101/2024.04.04.588111
doi:
bioRxiv preprint
1 Introduction
Single-cell RNA-sequencing (scRNA-seq) is a powerful experimental method that provides
cellular resolution to gene expression analysis. In tandem with the widespread adoption
of scRNA-seq technologies, there has been a proliferation of methods for the analysis of
scRNA-seq data
1
. However, despite the large number of tools that have been developed,
the majority of analysis of scRNA-seq takes place with one of two analysis platforms: Seu-
rat
2
or Scanpy
3
. These programs are ostensibly thought to implement the same, or very
similar, workflows for analysis
4
. The first step in the computational analysis of scRNA-seq
results is converting the raw read data into a cell-gene count matrix
X
, where entry
X
ig
is the number of RNA transcripts of gene
g
expressed by cell
i
. Typically, cells and genes
are filtered to remove poor-quality cells and minimally expressed genes. Then, the data are
normalized to control for non-meaningful sources of variability, such as sequencing depth,
technical noise, library size, and batch effects. Highly variable genes (HVGs) are then se-
lected from the normalized data to identify potential genes of interest and to reduce the
dimensionality of the data. Subsequently, gene expression values are scaled to a mean of
zero and variance of one across cells. This scaling is done primarily to be able to apply prin-
cipal component analysis (PCA) to further reduce dimensionality, and to provide meaningful
embeddings that describe sources of variability between cells. The PCA embeddings of the
cells are then passed through a k-nearest neighbors (KNN) algorithm in order to describe
the relationships of cells to each other based on their gene expression. The KNN graph is
used to produce an undirected shared nearest neighbor (SNN) graph for further analysis,
and the nearest neighbor graph(s) are passed into a clustering algorithms to group similar
cells together. The graph(s) are also used for further non-linear dimensionality reduction
with t-distributed stochastic neighbour embedding (t-SNE) or Uniform Approximation and
Projection method (UMAP) to graphically depict the structure of these neighborhoods in
two dimensions. Finally, cluster-specific marker genes are identified through differential ex-
pression (DE) analysis, in which each gene’s expression is compared between each cluster
and all other clusters and quantified with a fold-change and p-value.
Seurat, written in the programming language R in 2015, is particularly favored in the bioin-
formatics community as one of the first platforms for comprehensive scRNA-seq analysis
2
.
Scanpy is a Python-based tool that was developed after Seurat in 2017 and now offers a
similar set of features and capabilities
3
. Both tools have a wide range of options for analysis
and active communities. The choice between Seurat and Scanpy often boils down to the
user’s programming preference.
The input to Seurat and Scanpy is a cell-gene count matrix, with two popular packages for
count matrix generation being Cell Ranger and kallisto-bustools (kb). Cell Ranger, devel-
oped by 10x Genomics, is specifically optimized for processing data from the Chromium
platform, providing a solution that includes barcode processing, read alignment (using the
STAR aligner
5
), and gene expression analysis
6
. It is popular for its user-friendliness and
seamless integration with 10x Genomics data. However, Cell Ranger’s robustness comes
with the trade-off of high computational demands, particularly for larger datasets
7
. On
the other hand, kb
7;8
is an open-source alternative to Cell Ranger known for its efficiency
3
.
CC-BY 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted April 11, 2024.
;
https://doi.org/10.1101/2024.04.04.588111
doi:
bioRxiv preprint
and speed
7
. kb-python is a wrapper around kallisto
9
and bustools
10
, which pseudoaligns
reads to produce a barcode, unique molecular identifier (UMI), set (BUS) file, which is then
processed into a cell-by-gene count matrix. Utilizing the kallisto pseudoalignment algorithm
and the bustools toolkit, kb provides a fast, lightweight solution for quantifying transcript
abundances and handling BUS files. This efficiency makes it particularly suitable for en-
vironments with constrained computational resources. Additionally, kb is accurate
11
and
stands out for its flexibility, allowing researchers to tailor the analysis pipeline to a broader
range of experimental designs and research needs. New versions of each of these packages
are periodically released, contributing improvements in algorithm design and efficiency, new
capabilities, and integration with new sequencing technologies.
A “standard” scRNA-seq experiment costs thousands of dollars, with exact pricing influenced
largely by data size. While it is difficult to provide an exact cost as a result of variability
between methods, it is estimated that a typical sequencing kit costs approximated in the
range of hundreds to thousands of dollars, and sequencing costs add up to an additional
$
5
per million reads
12
. The necessary number of reads per cell for high-quality data depends
on the context of the experiment, but as an example, Cell Ranger typically recommends
20,000 read pairs per cell for its v3 technologies, and 50,000 read pairs per cell for its
v2 technologies
13
. Sample preparation also has substantial costs, often requiring precious
patient samples, or maintenance of cell or animal lines for months to years in preparation
for experimental analysis. A standard 10x Genomics scRNA-seq experiment sequences tens
of millions to billions of reads, with a recommended cell count ranging from 500-10,000+
depending on the context. These estimates do not factor in additional costs including labor,
experimental setup, and follow-up analysis. Therefore, it is desirable to try to achieve a
middle ground between dataset richness and experimental costs, which requires evaluating
the additional information provided by marginal increases in data size.
A typical implicit assumption in bioinformatics data analysis is that the choice among pack-
ages and versioning should have little to no impact on the interpretation of results. However,
sizeable variability has been observed between packages or versions, even when performing
otherwise similar or seemingly identical analyses
14
. The goal of this study is to quantify the
variability in the standard scRNA-seq pipeline between packages (i.e., Seurat vs. Scanpy)
and between multiple versions of the same package (i.e., Seurat v5 vs. v4, Scanpy v1.9 vs.
v1.4, Cell Ranger v7 vs. v6). Additionally, we quantify the variability introduced through a
range of read or cell downsampling and compare this to the variability between Seurat and
Scanpy.
2 Results
2.1 Seurat and Scanpy Show Considerable Differences in ScRNA-
seq Workflow with Defaults
Figure 1 shows the results of comparing Seurat v5.0.2 and Scanpy v1.9.5 with default settings
using the PBMC 10k dataset, demonstrating the typical variability to be expected between
4
.
CC-BY 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted April 11, 2024.
;
https://doi.org/10.1101/2024.04.04.588111
doi:
bioRxiv preprint
the two implementations of the “standard” single-cell RNA-seq workflow. Additional pipeline
settings run were those with aligned function argument values (Supp Fig 2), identical input
data preceding each step (Supp Fig 3), and both aligned function argument values and
identical input data preceding each step in the manner of Seurat (Supp Fig 4) and Scanpy
(Supp Fig 5).
There was no difference in cell or gene filtering between the packages after filtering UMIs,
minimum genes per cell, minimum cells per gene, and maximum mitochondrial gene content
(Fig 1a, i-ii). Furthermore, given the same matrices as input, Seurat and Scanpy handled log
normalization identically as well, producing equivalent output (data not shown). However,
the programs deviated from their default algorithm for HVG selection, with a Jaccard index
(intersection over union between two sets) of 0.22 (Fig 1a, iii). This difference could be re-
solved either by selecting the “seurat
v3” flavor for Scanpy or the “mean.var.plot” algorithm
for Seurat (Supp Fig 2a, Supp Fig 5a).
Further differences were observed with PCA analysis, which also yielded different results
when run with default parameters. The PCA plots showed noticeable differences in the
plotted positions of each cell on the PC1-2 space, although the same general shape of the
plot is preserved (Fig 1b, i). The Scree plots also displayed differences, most notably with the
proportion of variance explained by the first PC differing by 0.1 (Fig 1b, ii). The eigenvectors
demonstrated differences, with the angle between the first PC vectors having a sine of 0.1,
the angle between the second PCs having a sine of 0.5, i.e., 30 degrees apart, and PCs 3+
being nearly orthogonal (Fig 1b, ii). All of these changes could be resolved with HVG-
set standardization and with the clipping and regression settings prior to PCA adjusted
accordingly (Supp Fig 2b, Supp Fig 5b). Seurat, by default, clips values to a maximum of
10 during scaling and does not perform any regression, whereas Scanpy, by default, does not
implement clipping and regresses by total counts and percentage of mitochondrial content.
Next, the packages differed substantially in their production of an SNN graph. Both the
content and size of each neighborhood per cell differed greatly (Fig 1c). The median Jaccard
index between the neighborhood of each cell from Seurat and Scanpy was 0.11, and the
median degree ratio (Seurat/Scanpy) magnitude was 2.05. The degree ratio for each was
nearly always greater than 1, indicating that Seurat, by default, yields more highly connected
SNN graphs than Scanpy. Given that the points in Fig 1c are distributed relatively evenly
between 0 and the maximum potential Jaccard index across all degree ratios, it appears
that it is not simply the degree difference driving the low median Jaccard index. When
aligning the function arguments when generating the SNN graph, there was no qualitative
improvement in median degree ratio magnitude, but there was a slight improvement in
median Jaccard index (Supp Fig 4c, Supp Fig 5c).
Clustering with default settings also resulted in differences in output, as seen by the discor-
dance in the alluvial plot and the Adjusted Rand Index (ARI) of 0.53 (Fig 1d). Alluvial
plots were aligned to maximize cluster alignment and coloring between groups, as to allow vi-
sual discordance to correlate with dissimilarity (see Methods). Even when aligning function
arguments and input SNN graphs, Seurat and Scanpy demonstrated differences in Louvain
5
.
CC-BY 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted April 11, 2024.
;
https://doi.org/10.1101/2024.04.04.588111
doi:
bioRxiv preprint
clustering (Supp Fig 4d), but were identical in their implementation of the Leiden algorithm
(Supp Fig 5d).
UMAP plots visually showed some differences in the shapes of local and neighboring clusters,
even when controlling for global shifts or rotations (Fig 1d). For instance, the Seurat UMAP
shows that clusters 7 (pink) and 8 (grey) are very distinct from all other clusters on the
plot, whereas in the Scanpy UMAP plot, the analogous clusters are located much closer
to other clusters, such as 19 (black) and 20 (orange). Comparing neighborhood similarity
of KNN graphs constructed from these UMAP data revealed poor neighborhood overlap
that modestly improves as the similarity between function arguments and preceding input
are aligned (Supp Fig 6). Performing Leiden clustering and subsequent UMAP plotting
of these UMAP-derived KNN graphs revealed that the general characteristics of the UMAP
plots between packages were maintained, but there were still some considerable irreconcilable
differences (Supp Fig 6).
Upon DE analysis, Seurat and Scanpy overlapped with a Jaccard index of 0.62 for their
significant marker genes (i.e., the total set of genes with adjusted p-value
<
0
.
05 across all
clusters), but Seurat had approximately 50% more significant marker genes than Scanpy.
(Fig 1e). The difference in significant marker genes is a result of a few differences in default
settings between packages. First, each package implements the Wilcoxon function separately,
with Seurat requiring tie correction and Scanpy by default omitting tie correction. Addition-
ally, each package adjusts p-values differently by default - Seurat with Bonferroni multiple
testing correction, and Scanpy with Benjamini-Hochberg multiple testing correction. Finally,
Seurat, by default, filters markers by p-value, percentage of cells per group possessing the
gene, and log-fold change (logFC) prior to performing the Wilcoxon rank-sum test; Scanpy
does not perform this type of filtering without invoking additional functions. Setting the
filtering arguments and clusters of Scanpy to be the same as Seurat (filtering, tie-correction,
Bonferroni correction) for DE analysis improved the Jaccard index of significant marker gene
overlap to 0.73 (Supp Fig 2e), and providing the same cluster assignments further improved
the Jaccard index to 0.99 (Supp Fig 4e, i). The remaining 1% of genes differ as a result
of differences in logFC calculation discussed later. Setting the methods to be like Scanpy
(no filtering, Benjamini-Hochberg) worsened the Jaccard index to 0.38, as a result of the
inability to turn off tie correction in Seurat (Supp Fig 5e, i).
When aligning the cluster assignments between groups, further DE analysis can be performed
that compares differences in expression levels per gene per cluster. In addition to comparing
the sets of significant marker genes across all clusters, the similarity in markers (i.e., genes
per cluster after DE analysis and any potential filtering for differential expression) can be
compared. As mentioned previously, Scanpy’s lack of filtering means that Scanpy includes
all genes in all clusters, even when that gene is minimally- or non-differentially expressed in
that cluster; whereas Seurat, by default, includes only a small percentage of genes per cluster
based on logFC, p-value, and number of cells expressing the gene in the reference groups
(Supp Fig 3e, ii). Applying analogous thresholding to Scanpy vastly reduces the problem,
increasing the Jaccard index from 0.22 to 0.92, but not fulling resolving the discrepancy
(Supp 4e, ii). Removing all filtering from Seurat fully removes all differences in marker sets
6
.
CC-BY 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted April 11, 2024.
;
https://doi.org/10.1101/2024.04.04.588111
doi:
bioRxiv preprint
between packages (Supp Fig 5e, ii).
Seurat and Scanpy compute logFC differently as well. Comparing each analogous gene per
cluster across packages resulted in a concordance correlation coefficient (CCC) of 0.98 and
a PCA fit line with a slope of 1, indicating strong correlation across packages. Briefly, CCC
measures the agreement between two variables both in terms of correlation and variance.
However, observing the scatterplot of logFC values revealed noticeable differences in a large
number of values (Supp Fig 3, iii). Specifically, there were a handful of cases (4,109 out of
135,185 markers) where Scanpy predicted a logFC near
±
30 for a gene in a cluster while
Seurat predicted a logFC near 0. The reasons for this are elaborated in the Discussion
Regarding adjusted p-value, there were also differences between Seurat and Scanpy (Supp
Fig 3e, iv). With default function arguments, Seurat predicted p-values either less than or
similar to Scanpy, but never substantially greater. Most p-values were near the maximum
of 1, but there was a wide degree of variability. A considerable number of p-values were
far from the y=x line, including those below 1e-50 for Seurat but near 1 for Scanpy. 20%
of markers had their p-values flip across the p=0.05 threshold between packages, with it
being fairly even flipping in either direction (i.e., significant only in Seurat, or significant
only in Scanpy). When function arguments were aligned to be like Seurat, virtually all
differences in adjusted p-value disappeared (Supp Fig 4e, iv). However, the differences could
not be reconciled with Scanpy-like function arguments, due to the lack of ability to toggle
tie correction in Seurat’s/presto’s Wilcoxon rank sum calculation.
7
.
CC-BY 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted April 11, 2024.
;
https://doi.org/10.1101/2024.04.04.588111
doi:
bioRxiv preprint
Figure 1:
Seurat and Scanpy show considerable differences in scRNA-seq workflow results
with default function arguments. (A) Filtering and HVG selection analysis UpSet plots
consisting of overlap of sets of cells, genes, and HVGs. (B) PCA analysis through projection
onto first 2 PCs, Scree plot comparison of proportion of variance explained, and sine of
eigenvectors. Black lines = point mapping between conditions. (C) KNN/SNN analysis
through SNN neighborhood Jaccard index and degree ratio (Seurat/Scanpy) per cell. (D)
Clustering and UMAP analysis through UMAP plots of each condition, with alluvial plot
showing cluster assignment mapping and degree of agreement. Numbers at the bottom of
each alluvial plot show the total number of clusters in each group. (E) Differential expression
analysis UpSet plot through overlap of all significant (p
<
0
.
05) marker genes across all
clusters.
8
.
CC-BY 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted April 11, 2024.
;
https://doi.org/10.1101/2024.04.04.588111
doi:
bioRxiv preprint
The comparison of Seurat and Scanpy reveals that, in some but not all cases, the results
between programs can be reconciled. There are three classes of possible alignment between
functions: aligned by default, aligned when function arguments are matched, and incompat-
ible for alignment. The classification of each function into these classes is shown in Table 1.
Supplemental Tables 1 (Seurat) and 2 (Scanpy) go into detail for each step of analysis on the
function name, the default arguments, the arguments needed to match the other package as
closely as possible, and the parameters unique to that package.
Table 1:
Seurat and Scanpy function agreement for scRNA-seq pipeline. Green = equivalent
by default; yellow = equivalent with matched arguments; red = incompatible. HVG = Highly
Variable Genes; PCA = Principal Component Analysis; SNN = Shared Nearest Neighbors;
UMAP = Uniform Manifold Approximation and Projection; DE = Differential Expression
2.2 Read and Cell Downsampling Retain Most Information Com-
pared to Seurat vs. Scanpy Down to Small Fractions of Dataset
Size
Given the variability introduced between packages, a natural question that arises is how to
benchmark the magnitude of these differences. To this end, we simulated the downsampling
of reads and cells before generating the filtered count matrix and compared the differences
introduced along a gradient of downsampled fractions to the full-size data. We performed
each step of the analysis with default function arguments for the respective package and
without aligning input data preceding each step, except for those steps of DE analysis which
required input aligning before each step in order to compare marker gene statistics in iden-
tical clusters (marker selection, logFC, and adjusted p-value). For each step of analysis, in
addition to generating all plots as in Fig 1, we selected a single numeric metric that would
capture the degree of variability between groups as follows:
Cell filtering: Jaccard index of cell sets
Gene filtering: Jaccard index of gene sets
9
.
CC-BY 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted April 11, 2024.
;
https://doi.org/10.1101/2024.04.04.588111
doi:
bioRxiv preprint
HVG selection: Jaccard index of HVG sets
PCA: Mean of difference in corresponding PC loadings PCs 1-3
KNN/SNN: Median magnitude of log of SNN degree ratio
Clustering: ARI
UMAP: Median Jaccard index of UMAP-derived KNN neighborhoods across all cells
Marker gene selection: Jaccard index of significant marker gene sets
Marker selection: Jaccard index of marker sets
logFC: CCC
Adjusted p-value: Fraction of adjusted p-values that flipped across the p=0.05 thresh-
old between conditions
The summary for the fraction of downsampling sufficient to achieve results at least as good as
the variability between default Seurat vs. Scanpy within a 5% margin is shown in Figure 2.
These methods are especially robust to read downsampling, with most steps achieving similar
results with less than 5% of the original reads present (Fig 2a). The methods are also robust
to cell downsampling, although to a lesser extent, with most methods performing similarly
to baseline at less than 25% of the original number of cells (Fig 2b). The step least robust
to downsampling for both reads and cells was gene selection; however, given the similarity
of HVGs and marker genes to the full-size dataset across downsampled fractions, it appears
that the difference in gene sets lies largely in less significant genes. Plots similar to Figure
1 for downsampled fractions of reads and cells, for both Seurat and Scanpy, that generally
achieve similar performance to the variability introduced between default Seurat vs. Scanpy
can be found in Supplemental Figures 7-10. Metric calculation across downsampled fractions
can be found in Supplemental Figures 11-12.
Figure 2:
(A) Read and (B) cell downsampling retain most information compared to Seurat
vs. Scanpy down to small fractions of dataset size. A minimum downsampled fraction of
0.01 was used as a lower bound. Orange = Seurat; blue = Scanpy.
10
.
CC-BY 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted April 11, 2024.
;
https://doi.org/10.1101/2024.04.04.588111
doi:
bioRxiv preprint