Modular, efficient and constant-memory single-cell RNA-seq preprocessing

We describe a workflow for preprocessing of single-cell RNA-sequencing data that balances efficiency and accuracy. Our workflow is based on the kallisto and bustools programs, and is near optimal in speed with a constant memory requirement providing scalability for arbitrarily large datasets. The workflow is modular, and we demonstrate its flexibility by showing how it can be used for RNA velocity analyses. A preprocessing workflow for single-cell RNA-seq data achieves near-optimal speed.

that is untenable given the pace of improvement in technology and the corresponding increase in data volume.
In recent work, we introduced a format for scRNA-seq data that makes possible the development of efficient workflows by virtue of decoupling the computationally demanding step of associating reads to transcripts and genes (alignment) from the other steps required for scRNA-seq preprocessing 10 . This format, called BUS (barcode, UMI, set), can be produced by pseudoalignment, and rapidly manipulated by a suite of tools called bustools (Supplementary Table 2). To illustrate the utility, efficiency and flexibility of this approach for scRNA-seq preprocessing, we describe a Chromium preprocessing workflow based on reasoned choices for the key preprocessing steps. While we focus on Chromium, our workflow is general and can be used with other technologies. We show that our preprocessing workflow is faster and has lower memory requirements than existing methods, and we demonstrate the power of modular processing with the BUS format by developing a fast RNA velocity analysis workflow 11 . We also validate the design decisions underlying the Cell Ranger workflow. Our benchmarking and testing is comprehensive, comprising analysis of almost two dozen datasets and surpassing the scale of testing that has been performed for current workflows. Documentation and tutorials for the kallisto bustools workflow are available at http://pachterlab.github.io/ kallistobustools.
In designing an scRNA-seq preprocessing workflow, we began by investigating each required step: correction of barcodes, collapsing of UMIs and assignment of reads to genes. To achieve single-cell resolution, the Chromium technology produces barcode sequences that are used to associate cDNA reads to individual cells. We began by considering the efficiency-accuracy trade-offs involved in grouping reads with the same, or similar, barcodes to define the contents of individual cells. The Chromium barcodes arise from a 'whitelist' , a set of pre-defined sequences that are included with the Cell Ranger software. Grouping reads by barcode is therefore straightforward, except for the fact that barcodes may contain sequencing errors. The Cell Ranger workflow corrects all barcodes that are one base-pair change away (Hamming distance 1) from barcodes in the whitelist. An examination of a benchmark panel of 20 datasets revealed that this error correction approach can be expected to rescue, on average, 0.8% of the reads in an experiment (Fig. 1a,b), a calculation based on an inferred error rate per base for each dataset (Methods and Supplementary Table 3). Thus, correction Letters NATuRe BiOTechNOlOgy of barcodes Hamming distance 2 away from whitelist barcodes would rescue, on average, a negligible number (0.0038%) of reads (Methods). We therefore implemented a Hamming distance 1 correction method in our workflow via the bustools correct command.
Next, in considering how to count UMIs to generate count matrices, we first investigated the extent to which 'collisions' occur (that is, cases where the same UMI occurs in reads originating from two different molecules 12 ; Fig. 1c). While inter-gene collisions can be directly measured, intra-gene collisions cannot be distinguished from molecules that are PCR duplicates. To estimate the intra-gene collision rate, we first calculated, for each cell in the benchmark panel, the effective number of UMIs in each of the associated droplets (Supplementary Fig. 1 and Supplementary Note). This estimate, along with the number of inter-gene collisions and distinct UMIs observed, allowed us to estimate the extent of intra-gene collision, and therefore the counts lost due to naive collapsing of UMIs by gene (Methods and Supplementary Note). We found that the average number of intra-gene collisions resulting in lost counts due to naive collapsing for the ten genes expressed at the highest levels across the 10xv2 datasets is 0.4%, and 0.17% for the 10xv3 datasets. The average percentage of lost counts per gene per cell due to naive collapsing was less than 0.003% for v2 chemistry and 0.000048% for v3 chemistry (Fig. 1d). Thus, we decided to apply naive collapsing as it is computationally efficient and effective based on empirical evidence. This was implemented in the bustools count command. Notably, the recently published Alevin collapsing algorithm 5 will overestimate gene counts because reads with the same UMI pseudoaligned to the same gene are very likely to be from the same molecule    (20 datasets) showing barcodes matching the whitelist ("Retained", white), barcodes that are Hamming distance 1 away from the whitelist that were corrected (black) and uncorrected barcodes (gray). c, UMI collapsing within genes. d, Fraction of UMIs lost per gene across cells in the benchmark panel due to over-collapsing. The average number of intra-gene collisions resulting in lost counts due to naive collapsing for the ten genes expressed at the highest levels across 10xv2 datasets is 0.4% and for 10xv3 datasets is 0.17%. The average loss for genes expressed at average levels is about 0.003%. e, Running time of kallisto bustools (orange), Cell Ranger (blue), Alevin (black) and STARsolo (green) for preprocessing the benchmark panel. f, Memory usage of kallisto bustools (orange), Cell Ranger (blue), Alevin (black) and STARsolo (green) for preprocessing the benchmark panel.
even if they pseudoalign to distinct transcripts. Such situations likely result from missing or incorrect annotation 13 rather than from collisions of two distinct molecules labeled with the same UMI. One implication of the UMI collapsing analysis is that UMI error correction is possible because UMIs with only one base-pair change away from an abundant UMI are likely to have resulted from sequencing error. To examine the benefit of such a correction, we computed the expected number of UMIs that would be corrected with Hamming distance 1 correction, and found that for 10-base-pair and 12-base-pair UMIs only 0.5% and 0.6% of reads would be recovered, respectively, at the error rates observed in published datasets (Methods and Supplementary Fig. 2). Moreover, such error correction would require identification of abundant UMIs in lieu of a whitelist, adding time and complexity to the workflow. While we believe such error correction may be warranted in the case of longer UMIs ( Supplementary Fig. 2), we did not include it in our workflow.
In most scRNA-seq preprocessing workflows, assignment of cDNAs to genes utilizes genome alignment 4,14,15 . As detailed base-pair alignment is not necessary to generate a count matrix, pseudoalignment to a reference transcriptome 8 suffices. Moreover, pseudoalignment has been shown to be highly concordant with alignment for the purposes of quantification in bulk RNA-seq 16 . To test this hypothesis, we compared counts obtained by pseudoalignment using the kallisto 8 and bustools programs with counts produced via Cell Ranger, which is based on the STAR aligner 17 . Analysis of the "10xv3 Mus musculus neuron 10k" scRNA-seq dataset ( Fig. 2) confirms that there is a high correlation between pseudoalignmentand alignment-based counting (see Supplementary Fig. 3 for other datasets); however, in one dataset ("pbmc10k_v3", Supplementary   Fig. 3.19) we found that pseudoalignment produced more counts than alignment. Specifically, in the FGF23 (ENSG00000118972) gene, Cell Ranger had many fewer counts than kallisto bustools. We hypothesized that the reason for this discrepancy was the presence of reads from unspliced transcripts crossing splice junction boundaries, and therefore being erroneously pseudoaligned to the transcriptome. To test this, we created a modified index that included a 90-base-pair overlap into the exon and the intron (one base pair less than the length of the reads) to capture such reads and confirmed that it resolved the discrepancy ( Supplementary Fig. 4). We observed this problem to be rare and therefore did not deem it essential to account for it in a standard processing workflow. It may be that consideration of such reads will be crucial for nuclear scRNA-seq analyses 18 , when the abundance of such intronic junction reads will be problematic for naive pseudoalignment.
Thus, our workflow consists of pseudoalignment of reads to a reference transcriptome to generate a BUS file, and subsequent processing to correct barcode errors and produce a count matrix showing, for a given UMI count (x axis), the number of cells that contain at least that many UMI counts (y axis). The dashed lines correspond to the Cell Ranger filtered cells. b, Correspondence in the number of distinct UMIs per cell between the workflows. c, Genes detected by kallisto bustools and Cell Ranger as a function of distinct UMI counts per cell. d, Pearson correlation between gene counts as a function of the distinct UMI counts per cell. e, The l 1 distance between gene abundances for each kallisto bustools cell and its nearest neighbor plotted against each kallisto bustools cell and its corresponding Cell Ranger cell (orange), and the l 1 distance between the gene abundances for each Cell Ranger cell and its nearest neighbor plotted against each Cell Ranger cell and its corresponding kallisto bustools cell (blue). Marginal distributions show that each kallisto bustools cell is closest to its corresponding Cell Ranger cell and that each Cell Ranger cell is closest to its corresponding kallisto bustools cell. f, kallisto bustools (top panel) and Cell Ranger (bottom panel) t-SNE from the first ten principal components. g, A quantile-quantile plot comparing the observed distribution of P values of GSEA, after Bonferroni correction for multiple testing across ontologies and datasets, with the expected distribution of a uniform distribution between 0 and 1. If the observed distribution does not deviate from the expected distribution, then the points should lie close to the diagonal line, y = x. The gray ribbon around the line is the 95% confidence interval. Here most Gene Ontology (GO) terms have adjusted P = 1, meaning that most GO terms are very depleted of genes differentially expressed between the kallisto bustools and Cell Ranger matrices. GO terms above y = x are labeled. Generally, GO terms significantly enriched among differentially expressed genes are related to ribosomal proteins; specifically, the GO terms 1, 2 and 3 correspond to structural constituent of ribosome, cytosolic large ribosomal subunit and cytosolic small ribosomal subunit. The points are colored by ontology: biological processes (BP), cellular components (CC) and molecular functions (MF).

NATuRe BiOTechNOlOgy
( Supplementary Fig. 5). To ensure that memory usage is constant in the number of reads, the BUS files are sorted by barcode before counting using the bustools sort command (Supplementary Note).
While this workflow is very similar to that of Cell Ranger, it is not identical. As Cell Ranger is widely used, we investigated the extent to which the Cell Ranger results are concordant with our workflow.

NATuRe BiOTechNOlOgy
We processed 20 datasets (Supplementary  [24][25][26][27][28] and Rattus norvegicus 28 ). We found a high degree of concordance with respect to quality control metrics (Fig. 2a-g and Supplementary Fig. 3). Crucially, in all datasets, in a joint analysis of kallisto bustools and Cell Ranger counts, the closest cell to a kallisto bustools cell was its associated Cell Ranger cell (that is, the Cell Ranger cell with the same barcode sequence). Furthermore, gene count correlations between individual cells passing Cell Ranger filtering criteria were almost always above 0.90, and frequently as high as 0.99. To assess the extent to which differences between Cell Ranger and kallisto bustools affect biological inferences, we also compared Cell Ranger to kallisto bustools results in a downstream analysis of the "10x Genomics E18 mouse 10k brain cells" dataset. We found that pseudotime trajectory inference for neuronal precursor cells produced highly concordant results between Cell Ranger and kallisto bustools, with the same trajectory topology and similar pseudotime values along the trajectory (Supplementary Fig. 6). This result is consistent with other analysis comparisons. Projections of Cell Ranger and kallisto bustools cells to the first two principal components and to two dimensions of t-distributed stochastic neighbor embedding (t-SNE) are very similar (Supplementary Fig. 7.1). The results of Leiden clustering 29 are also similar between the two preprocessing workflows (Supplementary Fig. 7.2). We performed differential expression analysis to identify marker genes of the clusters, and then performed gene set enrichment analysis (GSEA) on the marker genes for cell type annotation ( Supplementary  Fig. 7.3). The marker genes and their corresponding gene sets were highly correlated between the workflows. In both Cell Ranger and kallisto bustools results, most clusters are neuronal, and the clusters for erythrocytes (cluster 16 in both), endothelial cells (cluster 21 in kallisto; cluster 19 in Cell Ranger) and immune cells (clusters 20 and 22 in kallisto; cluster 17 in Cell Ranger) can be clearly identified based on marker genes ( Supplementary Fig. 7.3). Correlation between the same barcodes in kallisto bustools and Cell Ranger with the top cluster marker genes is very high, with both the Pearson and Spearman correlation coefficients above 0.9 for the vast majority of cells ( Supplementary Fig. 7.4). In a separate mixed-species dataset, the number and proportion of UMIs from human and mouse cells are similar between Cell Ranger and kallisto bustools (Supplementary Fig. 8). Overall, these results suggest that the Cell Ranger workflow produces results consistent with our method, not only at the level of dataset summary statistics, but also in downstream analyses.
The modularity of our approach makes possible the rapid implementation of alternative workflows. To illustrate this, we developed an RNA velocity workflow. By including intron sequences in the index for pseudoalignment, we were able to identify reads originating from unspliced transcripts and, using the bustools capture command, created the spliced and unspliced matrices needed for RNA velocity. Our RNA velocity workflow, which is 13 times faster than velocyto 11 analysis of the same dataset, is suitable for large datasets that were previously challenging to preprocess. To illustrate this, we computed RNA velocity vectors for recently published data from the developing mouse retina 30 consisting of 113,917 cells (Fig. 3). We found that six pseudotime marker genes highlighted in Clark et al. 2019 30 (Crx, Nrl, Otx2, Pax6, Rbpms and Rlbp1) displayed patterns consistent with the RNA velocity vectors, and with the pseudotime analysis of Clark et al. 30 (Supplementary Fig. 9). The velocity analysis reveals new information; namely, it identifies developmental states when RNA velocity is changing (Supplementary Fig. 9, middle column). We verified the fidelity of our workflow by computing RNA velocity vectors on a dataset from La Manno et al. 2018 11 and comparing our results to those of the paper (Methods and Supplementary Fig. 10). Furthermore, the spliced count matrix agreed with the count matrix obtained in the Cell Ranger/Velocyto gene expression workflow ( Supplementary Fig. 11). Despite identifying many more unspliced counts, our resultant velocity figure was concordant with that of La Manno et al. 11 .
Our scRNA-seq workflow is up to 51 times faster than Cell Ranger and up to 4.75 times faster than Alevin. It is also up to 3.5 times faster than STARsolo, a recent version of the STAR aligner adapted for scRNA-seq ( Fig. 1e and Supplementary Table 1). In benchmarks on the panel described in this paper, kallisto bustools running time was comparable to that of the word count (wc) command applied to the FASTQ files, suggesting that the kallisto bustools workflow is near optimal in efficiency ( Supplementary Fig. 12). Unlike Alevin, our workflow requires a small fixed amount of constant memory that is independent of the number of reads being preprocessed (Fig. 1f). The limiting memory constraint is therefore the size of the reference index, which is under 4 GB of RAM for the human transcriptome, and thus our method is suitable for low-cost and environmentally conscious cloud computing. In cases where it is necessary to work with very large indices on small memory instances, memory needs can be reduced using an index splitting strategy (Supplementary Note), albeit with fewer resulting read counts (Supplementary Figs. 13 and 14). Our speed and constant-memory requirements make RNA velocity tractable for datasets of any size.
Our UMI collapsing analysis suggests that UMI sequences can be short; even just six base pairs of sequence can suffice for identifying most molecules owing to the cell barcode and gene identification for each read serving as auxiliary barcodes ( Supplementary  Fig. 15). Furthermore, the fact that identical UMIs associated with distinct reads from the same gene are almost certainly reads from the same molecule (Fig. 1d) makes it possible, in principle, to design assignment algorithms for multi-mapping reads. Reads could be assigned with an expectation-maximization algorithm that is based on estimating the copy number of each molecule in the library using a model as described in the Supplementary Note, and this is a promising direction for future work. An initial attempt at such assignment 5 appears to improve concordance between scRNA-seq gene abundance estimates and those from bulk RNA-seq. The current implementation of our approach can produce transcript compatibility counts that have information about read ambiguity before assignment of multi-mapping reads, and can therefore be used to identify isoform-specific changes across cells and cell clusters 31 .
While we have focused on a workflow for 10x Chromium data, the bustoools commands we implemented are generic and will work with any BUS file, generated with data from any scRNA-seq technology. Distinct technologies encode barcode and UMI information differently, but the kallisto bus command can accept custom formatting rules. While the preprocessing steps for error correction and counting may need to be optimized for the distinguishing characteristics of different technologies, the modularity of the bustools-based workflow makes such customization possible and easy.

online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/ s41587-021-00870-2.

Methods
Hardware. All of the benchmarks were carried out on a Supermicro server computer (2 × Xeon Gold 6152 22-Core 2.1, 3.7 GHz Turbo, 12 × 64 GB Quad-Rank DDR4 2,666 MHz memory, 16 × 12 TB Ultrastar He12 HUH721212ALE600, 7,200 r.p.m., SATA 6 GB s −1 HDD) with a CentOS 7 operating system installed. The running time of all programs was evaluated using eight threads.
Transcriptome indices. Reference transcriptomes were constructed by processing datasets with Cell Ranger, downloading the constructed Cell Ranger GTF file, and then producing a transcriptome from it and the relevant genome using GFFread (http://cole-trapnell-lab.github.io/cufflinks/file_formats/#the-gffread-utility).
Inference of per-base sequencing error rate and correctable barcodes. For each dataset, the per-base error rate p was estimated by the formulâ where u and e are realizations of random variables U and E, u is the number of barcodes matching the whitelist, e is the number of barcodes Hamming distance 1 away from a whitelist barcode and 16 is the length of 10xv2 and 10xv3 barcodes.
Letting U, E and T be random variables representing the number of barcodes matching the whitelist, the number of barcodes Hamming distance 1 from the whitelist and the effective total number of barcodes respectively, the previous equation was derived by solving for p using To estimate the proportion of barcodes Hamming distance 2 or greater away from a whitelist barcode, we computed for each dataset, using the estimated per-base error rate p.

UMI collision estimates.
A UMI associated with a read from a cell is said to have 'collided' if it appears in two or more reads originating from different molecules. To estimate UMI collision rates, two types of information were used. First, reads in the same cell that originate from different genes must have originated from different molecules, and therefore the sharing of a UMI between two such reads was used as an indicator of a collision. Second, for each gene, the number of distinct UMIs associated within it was measured from the data. Based on the assumption that UMIs were sampled uniformly at random from beads, these data were used to estimate the number of intra-gene collisions (see Supplementary Note). The assumption was verified by examining the distribution of UMI counts across cells; the empirical distribution was near uniform with the exception of a handful of UMIs (Supplementary Note Fig. 2). Table 3) were processed uniformly as follows.

Comparative analysis of the benchmark panel datasets. The benchmark panel datasets (Supplementary
For each dataset, a knee plot 32 was constructed for both Cell Ranger and kallisto bustools by plotting, for each cell, the number of distinct UMIs in the cell versus the number of barcodes with at least that number of UMIs. Then, the numbers of distinct UMIs for kallisto bustools and Cell Ranger were plotted against each other. Subsequently, for each cell, the number of distinct UMIs was plotted against the number of genes detected. Finally, the Pearson correlation was computed between the gene counts of kallisto bustools and Cell Ranger for each cell. To investigate the similarity of Cell Ranger to kallisto bustools, the l 1 distance between each corresponding kallisto bustools and Cell Ranger cell was computed. The distance to the nearest kallisto bustools cell was also measured. To visualize the Cell Ranger and kallisto count matrices, t-SNE was performed on the data projected to the ten principal components computed for each dataset using the opentSNE package (https://github.com/pavlin-policar/openTSNE/) with perplexity = 30, metric = Euclidean, random_state = 42 and n_iter = 750.
To check for systematic differences in the quantification of certain genes between Cell Ranger and kallisto bustools, a differential expression analysis was performed on the matrices produced by the two workflows. First, the matrices were concatenated using the genes determined to be expressed in both methods. Then the counts were normalized using Seurat. Differential expression analysis was performed with logistic regression. Next GSEA was carried out with the R package topGO on all marker genes with an adjusted P value less than 0.05, to identify classes of genes more likely to be affected by the different workflows. Parameters of topGO are statistic = fisher, algorithm = weight01, and the gene universe is all genes detected in both the kallisto bustools and the Cell Ranger matrices.
For mixed-species datasets (mouse and human), the gene universe used to test mouse GO terms was all mouse genes observed in both kallisto bustools and Cell Ranger matrices, and the gene universe for human GO terms was all human genes observed in both matrices. For each dataset, the ontologies biological processes, cellular components and molecular functions were tested separately. According to the vignette of topGO, the test for each ontology with correction of network topology should be considered to be unaffected by multiple testing, as different GO terms are not independent. However, as the tests for different ontologies and datasets were independent, we applied Bonferroni correction to adjust the P values for this. As each single-species dataset (17 in total) was tested for 3 ontologies, and each mixed-species dataset (3 in total) was tested for 6 ontologies, the P values were multiplied by 69 for 69 independent tests on ontologies performed.
Comparative analysis of the 10x Genomics E18 mouse dataset. Analysis of the Cell Ranger and kallisto bustools preprocessed datasets was performed in R. The DropletUtils package 33,34 was used to remove empty droplets from the kallisto bustools gene count matrix. For Cell Ranger, the filtered matrix was used. After filtering, genes not detected in any remaining Cell Ranger or kallisto bustools barcode were removed. Seurat was used for basic analysis. First, data were normalized by dividing the UMI count of each gene in each cell by the total UMI counts of that cell, and the result was multiplied by 10,000. Then a pseudocount of 1 was added, and the natural log transform was applied. Subsequently, the normalized data were scaled so the distribution of the expression of each gene would have a mean of 0 and a standard deviation of 1. Subsequently, 3,000 genes with highly variable expression levels were selected with the vst method in Seurat. Then principal component analysis was performed on these genes in the scaled data with the R package irlba called by Seurat. The first 40 principal components were used for t-SNE, which was performed with the R package Rtsne called by Seurat. Clustering was performed with the Leiden algorithm 29 on the kallisto bustools and Cell Ranger matrices. The clustering parameters were 20 nearest neighbors and resolution 1. Differential expression analysis was performed with the logistic regression method described in Ntranos et al. 31 as implemented in Seurat and applied to the normalized (unscaled) data. Spearman and Pearson correlations were computed for the top 15 cluster marker genes. GSEA was performed on cluster marker genes with adjusted P <0.05 using the R package topGO 35 with the GO 36,37 annotations provided by Bioconductor 3.10. SingleR 38 was used to annotate cell types based on correlation profiles with bulk RNA-seq from ref. 39 . Then, the neuronal cell types were used for pseudotime analysis. Pseudotime analysis was performed with slingshot 40 via the Docker container from dyno 41 .
Species mixing. The "10x Genomics 10k 1:1 Mixture of Fresh Frozen Human (HEK293T) and Mouse (NIH3T3) Cells" dataset was analyzed with kallisto bustools and Cell Ranger for the purpose of comparing the resultant barnyard plots 42 . Human and mouse genes were identified with their ENSEMBL identifiers. The total number of UMIs mapped to the human and mouse genes in each barcode was calculated with the unfiltered matrices. In Supplementary Fig. 8b,c, only barcodes present in both the kallisto bustools and Cell Ranger unfiltered matrices were used.

RNA velocity.
A human reference transcriptome FASTA file of exonic transcripts and a reference genome FASTA were obtained from the University of California Santa Cruz Genome Browser, build December 2013 GRCh38. A BED file of intronic transcripts, with an L − 1 (where L is the read length) flanking sequence added to each end, was also obtained from the University of California Santa Cruz Genome Browser. A unique number was appended to the end of each intronic transcript in the BED file. The genome FASTA and the intronic BED file were used with bedtools getfasta to construct an intronic FASTA file. The intronic and exonic FASTA files were combined and an index was built with kallisto index. The reads were aligned to the index using kallisto bus. The barcodes in the resultant BUS file were error corrected with bustools correct and then sorted with bustools sort. To isolate the intronic counts and exonic counts for each barcode, bustools capture was run twice: once using the list of intronic transcripts and once using the list of exonic transcripts. The spliced count matrices were made by using bustools count on the intron-captured split.bus file, and the unspliced count matrices were made by using bustools count on the exon-captured split.bus file. Both matrices were loaded into an annotated data frame in a Jupyter notebook for downstream analysis.
To perform the comparison to the La Manno et al. 2018 dataset ( Supplementary Figs. 10 and 11), the data were first downloaded from the Sequence Read Archive (SRA; SRP129388). The cell barcodes were filtered by those in La Manno et al. 2018 11 . The data were preprocessed with the kallisto bustools workflow. The matrices were loaded into Python. The cluster labels were then transferred from La Manno et al. 2018. The velocyto notebook provided with the paper was used to reproduce the results based on the Cell Ranger/velocyto matrices and the kallisto bustools matrices.
For the Clark et al. 2019 RNA velocity analysis ( Fig. 3 and Supplementary  Fig. 9), the data were downloaded from the SRA (GSE118614). First, the cell barcodes were filtered by those in Clark et al. 2019. The cluster labels were then transferred from Clark et al. 2019 30 and the standard velocity pipeline from scvelo was run using the kallisto bustools spliced and unspliced matrices.
Reproducibility. Tutorials for performing multiple types of single-cell analysis are available at http://pachterlab.github.io/kallistobustools in the form of Google Colab notebooks. Users can run many different types of scRNA-seq workflows from standard count matrix generation to RNA velocity. Google Colab serves a Jupyter notebook that runs on free cloud compute infrastructure allowing anyone with a Google account to run the notebooks. The efficiency and low memory requirements of the kallisto bustools pipeline make it possible to run it on this infrastructure, and this is not possible for other scRNA-seq preprocessing programs due to their higher computational requirements.
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
A diverse set of 20 datasets was compiled for the purpose of benchmarking preprocessing workflows. Datasets produced and distributed by 10x Genomics were downloaded from the 10x Genomics data downloads page: https:// support.10xgenomics.com/single-cell-gene-expression/datasets. Six v3 chemistry datasets and two v2 chemistry datasets were downloaded and processed (Supplementary Table 3). Another 12 datasets were obtained from either the SRA or the European Nucleotide Archive; all were produced with 10x Genomics v2 chemistry. For six of the datasets (SRR6956073, SRR6998058, SRR7299563, SRR8206317, SRR8327928 and SRR8524760), the BAM files were downloaded and the Cell Ranger utility bamtofastq was run to produce FASTQ files for preprocessing from Cell Ranger-structured BAM files. FASTQ files were downloaded directly for the datasets E-MTAB-7320, SRR8257100, SRR8513910, SRR8599150 (available at https://github.com/bustools/getting_started/releases/ download/getting_started/SRR8599150_S1_L001_R1_001.fastq.gz and https:// github.com/bustools/getting_started/releases/download/getting_started/ SRR8599150_S1_L001_R2_001.fastq.gz), SRR8611943 and SRR8639063. Details of all datasets and their accession numbers can be found in Supplementary