Welcome to the new version of CaltechAUTHORS. Login is currently restricted to library staff. If you notice any issues, please email coda@library.caltech.edu
Published September 11, 2023 | In Press
Journal Article Open

Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references

  • 1. ROR icon California Institute of Technology


Single-cell RNA-sequencing (scRNA-seq) is an indispensable tool for characterizing cellular diversity and generating hypotheses throughout biology. Droplet-based scRNA-seq datasets often lack expression data for genes that can be detected with other methods. Here we show that the observed sensitivity deficits stem from three sources: (1) poor annotation of 3′ gene ends; (2) issues with intronic read incorporation; and (3) gene overlap-derived read loss. We show that missing gene expression data can be recovered by optimizing the reference transcriptome for scRNA-seq through recovering false intergenic reads, implementing a hybrid pre-mRNA mapping strategy and resolving gene overlaps. We demonstrate, with a diverse collection of mouse and human tissue data, that reference optimization can substantially improve cellular profiling resolution and reveal missing cell types and marker genes. Our findings argue that transcriptomic references need to be optimized for scRNA-seq analysis and warrant a reanalysis of previously published datasets and cell atlases.

Copyright and License

© 2023. The Author(s), under exclusive licence to Springer Nature.


We thank L. S. Pachter and members of the M.T. lab for helpful discussion and comments. We thank the Single-Cell Profiling Center (SPEC) in the Beckman Institute at Caltech for technical assistance with scRNA-seq. A.H.P. is supported by Eugene McDermott Scholar funds and by Startup funds from Peter O'Donnell Jr. Brain Institute at UT Southwestern. Y.O. is supported by Startup funds from the President and Provost of the California Institute of Technology and the Biology and Biological Engineering Division of California Institute of Technology, Searle Scholars Program, the Mallinckrodt Foundation, the McKnight Foundation, the Klingenstein–Simons Foundation, the New York Stem Cell Foundation and the NIH (grant nos. R56MH113030 and R01NS109997).


A.-H.P. conceived and designed the project. A.-H.P. and H.P. devised and performed data analysis. A.-H.P. and S.C. generated the MnPO scRNA-seq dataset. S.C. and M.T. generated the human PBMC scRNA-seq dataset. S.C., M.T. and Y.O. provided conceptual advice on data analysis. All authors contributed to the manuscript as drafted by A.-H.P. and H.P. A.-H.P. and Y.O. supervised the overall project.

Data Availability

Raw and fully processed scRNA-seq data generated for this project (mouse MnPO and human PBMC) are available at the NCBI Gene Expression Omnibus (GEO, GSE198528). Additionally, previously published mouse and human datasets were analyzed including mouse 10x Genomics scRNA-seq datasets generated by the Tabula Muris consortium (bone marrow, SRR6835854; kidney, SRR6835849; lung, SRR6835860; tongue, SRR6835844), which can be accessed from the GEO repository GSE132042. Human brain scRNA-seq data generated from the prefrontal cortex (CS22_PFC) were acquired from the NEMO archive at https://assets.nemoarchive.org/dat-0rsydy7, which requires a custom data use agreement. Finally, human 10x Genomics scRNA-seq data (liver, TSP14_Liver_NA; lung, TSP14_Lung_Proximal; tongue, TSP14_Tongue_Anterior) generated by the Tabula Sapiens consortium can be accessed through the Tabula Sapiens AWS storage web service accessible from https://tabula-sapiens-portal.ds.czbiohub.org/ and requires a custom data use agreement. Baseline single-cell transcriptomic references for human (GRCh38) and mouse (mm10) datasets were downloaded from 10X Genomics (latest available version 2020-A): https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest?. Latest optimized versions of the mouse and human reference transcriptomes and respective genome annotations are available for download at www.thepoollab.org/resources.

Code Availability

Custom scripts for analyzing data and generating figures are available at https://github.com/PoolLab/Generecovery. ReferenceEnhancer R package for optimizing genome annotations for scRNA-seq analysis is available at https://github.com/PoolLab/ReferenceEnhancer.

Conflict of Interest

The authors declare no competing interests.


Files (819.2 kB)
Name Size Download all
45.8 kB Preview Download
41.0 kB Download
318.7 kB Preview Download
180.9 kB Preview Download
179.1 kB Preview Download
53.6 kB Download

Additional details

September 12, 2023
January 9, 2024