Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references

Pool, Allan-Hermann; Poldsam, Helen; Chen, Sisi; Thomson, Matt; Oka, Yuki

doi:10.1038/s41592-023-02003-w

Published September 11, 2023 | Version In Press

Journal Article Open

Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references

1. The University of Texas Southwestern Medical Center
2. Tallinn University of Technology
3. California Institute of Technology

Single-cell RNA-sequencing (scRNA-seq) is an indispensable tool for characterizing cellular diversity and generating hypotheses throughout biology. Droplet-based scRNA-seq datasets often lack expression data for genes that can be detected with other methods. Here we show that the observed sensitivity deficits stem from three sources: (1) poor annotation of 3′ gene ends; (2) issues with intronic read incorporation; and (3) gene overlap-derived read loss. We show that missing gene expression data can be recovered by optimizing the reference transcriptome for scRNA-seq through recovering false intergenic reads, implementing a hybrid pre-mRNA mapping strategy and resolving gene overlaps. We demonstrate, with a diverse collection of mouse and human tissue data, that reference optimization can substantially improve cellular profiling resolution and reveal missing cell types and marker genes. Our findings argue that transcriptomic references need to be optimized for scRNA-seq analysis and warrant a reanalysis of previously published datasets and cell atlases.

Copyright and License

Acknowledgement

We thank L. S. Pachter and members of the M.T. lab for helpful discussion and comments. We thank the Single-Cell Profiling Center (SPEC) in the Beckman Institute at Caltech for technical assistance with scRNA-seq. A.H.P. is supported by Eugene McDermott Scholar funds and by Startup funds from Peter O'Donnell Jr. Brain Institute at UT Southwestern. Y.O. is supported by Startup funds from the President and Provost of the California Institute of Technology and the Biology and Biological Engineering Division of California Institute of Technology, Searle Scholars Program, the Mallinckrodt Foundation, the McKnight Foundation, the Klingenstein–Simons Foundation, the New York Stem Cell Foundation and the NIH (grant nos. R56MH113030 and R01NS109997).

Contributions

A.-H.P. conceived and designed the project. A.-H.P. and H.P. devised and performed data analysis. A.-H.P. and S.C. generated the MnPO scRNA-seq dataset. S.C. and M.T. generated the human PBMC scRNA-seq dataset. S.C., M.T. and Y.O. provided conceptual advice on data analysis. All authors contributed to the manuscript as drafted by A.-H.P. and H.P. A.-H.P. and Y.O. supervised the overall project.

Data Availability

Raw and fully processed scRNA-seq data generated for this project (mouse MnPO and human PBMC) are available at the NCBI Gene Expression Omnibus (GEO, GSE198528). Additionally, previously published mouse and human datasets were analyzed including mouse 10x Genomics scRNA-seq datasets generated by the Tabula Muris consortium (bone marrow, SRR6835854; kidney, SRR6835849; lung, SRR6835860; tongue, SRR6835844), which can be accessed from the GEO repository GSE132042. Human brain scRNA-seq data generated from the prefrontal cortex (CS22_PFC) were acquired from the NEMO archive at https://assets.nemoarchive.org/dat-0rsydy7, which requires a custom data use agreement. Finally, human 10x Genomics scRNA-seq data (liver, TSP14_Liver_NA; lung, TSP14_Lung_Proximal; tongue, TSP14_Tongue_Anterior) generated by the Tabula Sapiens consortium can be accessed through the Tabula Sapiens AWS storage web service accessible from https://tabula-sapiens-portal.ds.czbiohub.org/ and requires a custom data use agreement. Baseline single-cell transcriptomic references for human (GRCh38) and mouse (mm10) datasets were downloaded from 10X Genomics (latest available version 2020-A): https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest?. Latest optimized versions of the mouse and human reference transcriptomes and respective genome annotations are available for download at www.thepoollab.org/resources.

Code Availability

Custom scripts for analyzing data and generating figures are available at https://github.com/PoolLab/Generecovery. ReferenceEnhancer R package for optimizing genome annotations for scRNA-seq analysis is available at https://github.com/PoolLab/ReferenceEnhancer.

Conflict of Interest

The authors declare no competing interests.

Files

41592_2023_2003_Fig6_ESM.jpg

Files (819.2 kB)

Name	Size	Download all
41592_2023_2003_Fig6_ESM.jpg md5:8794ab139491a885e0ed6d122851d1d2	45.8 kB	Preview Download
41592_2023_2003_Fig7_ESM.jpg md5:d05bcdc978006cbd572660b2b0c97e36	180.9 kB	Preview Download
41592_2023_2003_Fig8_ESM.jpg md5:790cf4e2643d011ee1189e8a3cb0b7c9	179.1 kB	Preview Download
41592_2023_2003_Fig9_ESM.jpg md5:ac6a777085916550e034a96d092a1b34	318.7 kB	Preview Download
41592_2023_2003_MOESM3_ESM.xlsx md5:c6239c8bed82ab5be8108c3b17958292	41.0 kB	Download
41592_2023_2003_MOESM4_ESM.xlsx md5:328e45401891950d963fd2759674c804	53.6 kB	Download

Additional details

ISSN: 1548-7105
URL: https://rdcu.be/dlYbF

Featured in: https://www.caltech.edu/about/news/invisible-cell-types-and-gene-expression-revealed-with-sequencing-data-analysis-improvement (URL)
Is supplemented by: Dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE198528 (URL); Dataset: http://www.thepoollab.org/resources (URL); Software: https://github.com/PoolLab/Generecovery (URL); Software: https://github.com/PoolLab/ReferenceEnhancer (URL)

The University of Texas Southwestern Medical Center
California Institute of Technology
Edward Mallinckrodt Jr. Foundation
McKnight Foundation
National Institutes of Health
R56MH113030
National Institutes of Health
R01NS109997

Accepted: 2023-09-11

published online

Caltech groups: Division of Biology and Biological Engineering (BBE)

	All versions	This version
Views	28	28
Downloads	25	25
Data volume	3.3 MB	3.3 MB

Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references

Copyright and License

Acknowledgement

Contributions

Data Availability

Code Availability

Conflict of Interest

Files

41592_2023_2003_Fig6_ESM.jpg

Files (819.2 kB)

Additional details

Identifiers

Related works

Funding

Dates

Caltech Custom Metadata

Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references

Creators

Abstract

Copyright and License

Acknowledgement

Contributions

Data Availability

Code Availability

Conflict of Interest

Files

41592_2023_2003_Fig6_ESM.jpg

Files (819.2 kB)

Additional details

Identifiers

Related works

Funding

Dates

Caltech Custom Metadata