A Caltech Library Service

Modular, efficient and constant-memory single-cell RNA-seq preprocessing

Melsted, Páll and Booeshaghi, A. Sina and Liu, Lauren and Gao, Fan and Lu, Lambda and Min, Kyung Hoi and da Veiga Beltrame, Eduardo and Hjorleifsson, Kristján Eldjárn and Gehring, Jase and Pachter, Lior (2021) Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nature Biotechnology . ISSN 1087-0156. doi:10.1038/s41587-021-00870-2. (In Press)

[img] PDF (Supplementary Figs. 1–15, Note and Table 2) - Supplemental Material
See Usage Policy.

[img] PDF (Reporting Summary) - Supplemental Material
See Usage Policy.

[img] MS Excel (Supplementary Table 1) - Supplemental Material
See Usage Policy.

[img] MS Excel (Supplementary Table 3) - Supplemental Material
See Usage Policy.


Use this Persistent URL to link to this item:


We describe a workflow for preprocessing of single-cell RNA-sequencing data that balances efficiency and accuracy. Our workflow is based on the kallisto and bustools programs, and is near optimal in speed with a constant memory requirement providing scalability for arbitrarily large datasets. The workflow is modular, and we demonstrate its flexibility by showing how it can be used for RNA velocity analyses.

Item Type:Article
Related URLs:
URLURL TypeDescription ReadCube access ItemData ItemData ItemData ItemData ItemData ItemData ItemData ItemData ItemData ItemData ItemData ItemData ItemData ItemData genome annotations and reference transcriptomes ItemCode ItemCode ItemCode ItemDocumentation and tutorials
Melsted, Páll0000-0002-8418-6724
Booeshaghi, A. Sina0000-0002-6442-4502
Lu, Lambda0000-0002-7092-9427
Min, Kyung Hoi0000-0003-0894-4017
da Veiga Beltrame, Eduardo0000-0002-1529-9207
Hjorleifsson, Kristján Eldjárn0000-0002-7851-1818
Gehring, Jase0000-0002-3894-9495
Pachter, Lior0000-0002-9164-6231
Additional Information:© 2021 Nature Publishing Group. Received 07 August 2019; Accepted 09 February 2021; Published 01 April 2021. We thank V. Ntranos and V. Svensson for helpful suggestions and comments. We thank J. Farrell for the D. rerio gene annotation used to process SRR6956073, J. Schiefelbein for the A. thaliana gene annotation used to process SRR8257100, J. Fear for the D. melanogaster gene annotation used to process SRR8513910, and J. Kim and Q. Zhu for the C. elegans gene annotation used to process SRR8611943. The benchmarking work was made possible, in part, thanks to support from the Beckman Institute Caltech Bioinformatics Resource Center. A.S.B. and L.P. were funded in part by NIH U19MH114830. Data availability: A diverse set of 20 datasets was compiled for the purpose of benchmarking preprocessing workflows. Datasets produced and distributed by 10x Genomics were downloaded from the 10x Genomics data downloads page: Six v3 chemistry datasets and two v2 chemistry datasets were downloaded and processed (Supplementary Table 3). Another 12 datasets were obtained from either the SRA or the European Nucleotide Archive; all were produced with 10x Genomics v2 chemistry. For six of the datasets (SRR6956073, SRR6998058, SRR7299563, SRR8206317, SRR8327928 and SRR8524760), the BAM files were downloaded and the Cell Ranger utility bamtofastq was run to produce FASTQ files for preprocessing from Cell Ranger–structured BAM files. FASTQ files were downloaded directly for the datasets E-MTAB-7320, SRR8257100, SRR8513910, SRR8599150 (available at and, SRR8611943 and SRR8639063. Code availability: The software versions used for the results in the paper were: Alevin v0.13.1, bustools v0.39.1, Cell Ranger v3.0.0, DropletUtils v1.6.1, kallisto v0.46.0, Python 3.7, R v3.5.2, Scanpy v1.4.1, scvelo 0.1.17, Seurat v3.0, snakemake v5.3.0, STARsolo v2.7.0e, velocyto v0.17.17, wc v8.22 (GNU coreutils) and zcat v1.5 (gzip). All programs were run with default options unless otherwise specified. The code to reproduce the findings of this paper is available at, kallisto is available at and bustools is available at Documentation and tutorials for using the kallisto bustools scRNA-seq workflow are available at Details of all datasets and their accession numbers can be found in Supplementary Table 3. All genome annotations and reference transcriptomes can be found at These authors contributed equally: Páll Melsted, A. Sina Booeshaghi. Author Contributions: P.M., A.S.B., L. Liu and L.P. developed the algorithms for bustools and P.M., A.S.B. and L. Liu wrote the software. A.S.B. conceived of and performed the UMI and barcode calculations motivating the algorithms. F.G. implemented and performed the benchmarking procedure, and curated indices for the datasets. A.S.B. and E.d.V.B. designed and produced the comparisons between Cell Ranger and kallisto bustools. L. Lu investigated in detail the performance of different workflows on the “10k mouse neuron” data and produced the analysis of that dataset. A.S.B. designed the RNA velocity workflow and performed the RNA velocity analyses. K.M.H contributed to the development of the reproducible workflow. K.E.H. developed and investigated the effect of reference transcriptome sequences for pseudoalignment. J.G. interpreted results and helped to supervise the research. A.S.B. planned, organized and prepared figures. A.S.B., E.d.V.B., P.M. and L.P. planned the manuscript. A.S.B. and L.P. wrote the manuscript. The authors declare no competing interests. Peer review information: Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.
Funding AgencyGrant Number
Caltech Beckman InstituteUNSPECIFIED
Subject Keywords:Genome informatics; Software; Transcriptomics
Record Number:CaltechAUTHORS:20210405-142728694
Persistent URL:
Official Citation:Melsted, P., Booeshaghi, A.S., Liu, L. et al. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nat Biotechnol (2021).
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:108622
Deposited By: Tony Diaz
Deposited On:07 Apr 2021 23:48
Last Modified:07 Apr 2021 23:48

Repository Staff Only: item control page