The impact of package selection and versioning on single-cell RNA-seq analysis
Abstract
Standard single-cell RNA-sequencing analysis (scRNA-seq) workflows consist of converting raw read data into cell-gene count matrices through sequence alignment, followed by analyses including filtering, highly variable gene selection, dimensionality reduction, clustering, and differential expression analysis. Seurat and Scanpy are the most widely-used packages implementing such workflows, and are generally thought to implement individual steps similarly. We investigate in detail the algorithms and methods underlying Seurat and Scanpy and find that there are, in fact, considerable differences in the outputs of Seurat and Scanpy. The extent of differences between the programs is approximately equivalent to the variability that would be introduced in benchmarking scRNA-seq datasets by sequencing less than 5% of the reads or analyzing less than 20% of the cell population. Additionally, distinct versions of Seurat and Scanpy can produce very different results, especially during parts of differential expression analysis. Our analysis highlights the need for users of scRNA-seq to carefully assess the tools on which they rely, and the importance of developers of scientific software to prioritize transparency, consistency, and reproducibility for their tools.
Copyright and License
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Acknowledgement
This work was supported in part by NIH 5UM1HG012077-02. D.K.S. was funded by the UCLA-Caltech Medical Scientist Training Program (NIH NIGMS training grant T32 GM008042). We thank Bernadett Gaál for the feedback of dedicating a study to package comparisons in the scRNA-seq workflow. We thank Tara Chari for providing feedback and assisting with data management. The authors acknowledge the Howard Hughes Medical Institute for funding A.S.B. through the Hanna H. Gray Fellows program. We thank the Caltech Bioinformatics Resource Center for providing computing resources during the development of the project.
Contributions
Work on this paper was led by J.R., who implemented the comparisons, produced the results, and drafted an initial version of this manuscript. The project emerged from discussions among various combinations of the authors: J.R., L.M., P.H.E., K.J., L.L., A.S.B., S.A., D.K.S, N.B, P.M., L.P. The methods for benchmarking and comparisons of results were developed by J.M.R., L.M., L.P.; Software was written primarily by J.M.R. with the help of L.M.; Formal analysis and investigation was conducted by J.M.R., L.M., L.P.; Writing – Original Draft, J.M.R.; Writing – Review & Editing, J.M.R., L.M., P.H.E., K.J., L.L., A.S.B., S.A., D.K.S, N.B, P.M., L.P.; Visualization, J.M.R., L.M., L.P.; Funding Acquisition, L.P.; Resources, L.P.; Supervision, L.P.
Data Availability
All original code has been deposited at GitHub and is publicly available as of the date of publication. The Docker image that provides the virtual environment in which all analysis was performed has been deposited at DockerHub and is publicly available as of the date of publication. Count matrix generation from FASTQ files was performed in a conda environment, also deposited on the GitHub repository. The original FASTQ files of the PBMC 10k dataset, shown in figures, and the PBMC 5k dataset, used for validation (figures not shown) can be found on the 10x Genomics website. The count matrices generated with kb-python and Cell Ranger from the PBMC 10k dataset FASTQ files have been deposited at Box and are publicly available as of the date of publication. Links to all materials can be found in the Key Resources table.
Conflict of Interest
The authors declare no competing interests.
Files
Name | Size | Download all |
---|---|---|
md5:689c9fc16a69162f5e5c6c28f3f16afc
|
3.0 MB | Preview Download |
Additional details
- PMCID
- PMC11014608
- National Institutes of Health
- 5UM1HG012077-02
- National Institutes of Health
- UCLA-Caltech Medical Scientist Training Program T32 GM008042
- Howard Hughes Medical Institute
- Caltech groups
- Tianqiao and Chrissy Chen Institute for Neuroscience, Division of Biology and Biological Engineering