Welcome to the new version of CaltechAUTHORS. Login is currently restricted to library staff. If you notice any issues, please email coda@library.caltech.edu
Published April 11, 2024 | Submitted
Discussion Paper Open

The impact of package selection and versioning on single-cell RNA-seq analysis

Abstract

Standard single-cell RNA-sequencing analysis (scRNA-seq) workflows consist of converting raw read data into cell-gene count matrices through sequence alignment, followed by analyses including filtering, highly variable gene selection, dimensionality reduction, clustering, and differential expression analysis. Seurat and Scanpy are the most widely-used packages implementing such workflows, and are generally thought to implement individual steps similarly. We investigate in detail the algorithms and methods underlying Seurat and Scanpy and find that there are, in fact, considerable differences in the outputs of Seurat and Scanpy. The extent of differences between the programs is approximately equivalent to the variability that would be introduced in benchmarking scRNA-seq datasets by sequencing less than 5% of the reads or analyzing less than 20% of the cell population. Additionally, distinct versions of Seurat and Scanpy can produce very different results, especially during parts of differential expression analysis. Our analysis highlights the need for users of scRNA-seq to carefully assess the tools on which they rely, and the importance of developers of scientific software to prioritize transparency, consistency, and reproducibility for their tools.

Copyright and License

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.

Acknowledgement

This work was supported in part by NIH 5UM1HG012077-02. D.K.S. was funded by the UCLA-Caltech Medical Scientist Training Program (NIH NIGMS training grant T32 GM008042). We thank Bernadett Gaál for the feedback of dedicating a study to package comparisons in the scRNA-seq workflow. We thank Tara Chari for providing feedback and assisting with data management. The authors acknowledge the Howard Hughes Medical Institute for funding A.S.B. through the Hanna H. Gray Fellows program. We thank the Caltech Bioinformatics Resource Center for providing computing resources during the development of the project.

Contributions

Work on this paper was led by J.R., who implemented the comparisons, produced the results, and drafted an initial version of this manuscript. The project emerged from discussions among various combinations of the authors: J.R., L.M., P.H.E., K.J., L.L., A.S.B., S.A., D.K.S, N.B, P.M., L.P. The methods for benchmarking and comparisons of results were developed by J.M.R., L.M., L.P.; Software was written primarily by J.M.R. with the help of L.M.; Formal analysis and investigation was conducted by J.M.R., L.M., L.P.; Writing – Original Draft, J.M.R.; Writing – Review & Editing, J.M.R., L.M., P.H.E., K.J., L.L., A.S.B., S.A., D.K.S, N.B, P.M., L.P.; Visualization, J.M.R., L.M., L.P.; Funding Acquisition, L.P.; Resources, L.P.; Supervision, L.P.

Data Availability

All original code has been deposited at GitHub and is publicly available as of the date of publication. The Docker image that provides the virtual environment in which all analysis was performed has been deposited at DockerHub and is publicly available as of the date of publication. Count matrix generation from FASTQ files was performed in a conda environment, also deposited on the GitHub repository. The original FASTQ files of the PBMC 10k dataset, shown in figures, and the PBMC 5k dataset, used for validation (figures not shown) can be found on the 10x Genomics website. The count matrices generated with kb-python and Cell Ranger from the PBMC 10k dataset FASTQ files have been deposited at Box and are publicly available as of the date of publication. Links to all materials can be found in the Key Resources table.

Conflict of Interest

The authors declare no competing interests.

Files

nihpp-2024.04.04.588111v2.pdf
Files (3.0 MB)
Name Size Download all
md5:689c9fc16a69162f5e5c6c28f3f16afc
3.0 MB Preview Download

Additional details

Created:
May 6, 2024
Modified:
May 6, 2024