Detection of viral sequences at single-cell resolution identifies novel viruses associated with host gene expression changes
Creators
Abstract
The increasing use of high-throughput sequencing methods in research, agriculture and healthcare provides an opportunity for the cost-effective surveillance of viral diversity and investigation of virus–disease correlation. However, existing methods for identifying viruses in sequencing data rely on and are limited to reference genomes or cannot retain single-cell resolution through cell barcode tracking. We introduce a method that accurately and rapidly detects viral sequences in bulk and single-cell transcriptomics data based on the highly conserved RdRP protein, enabling the detection of over 100,000 RNA virus species. The analysis of viral presence and host gene expression in parallel at single-cell resolution allows for the characterization of host viromes and the identification of viral tropism and host responses. We apply our method to peripheral blood mononuclear cell data from rhesus macaques with Ebola virus disease and describe previously unknown putative viruses. Moreover, we are able to accurately predict viral presence in individual cells based on macaque gene expression.
Copyright and License
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Acknowledgement
We thank A. Bryon for her advice and helpful discussions on the applicability of kallisto translated search for the detection of viral RNA. We also thank A. Lin, L. Hensley, D. Kotliar and P. Sabeti for helping us understand and work with their data. We thank A. Babaian for answering our questions about the PalmDB database extensively. We thank C. Middle, M. Roos and S. Kuersten from Illumina for generating the bulk RNA-seq data included in Fig. 2a (second panel). L.L. was supported by funding from the Biology and Bioengineering Division at the California Institute of Technology. D.K.S. was supported by funding from the UCLA–Caltech Medical Scientist Training Program (National Institutes of Health (NIH) National Institute of General Medical Sciences (NIGMS) training grant T32 GM008042). M.C. was supported by the National Science Foundation Graduate Research Fellowships Program (NSF GRFP) under grant no. 2139433. A.V.W. was supported by NIH F30AI167524. A.V.W. was supported in part by the Bill & Melinda Gates Foundation (INV-023124). Under the grant conditions of the Foundation, a Creative Commons Attribution 4.0 Generic License has already been assigned to the Author Accepted Manuscript version that might arise from this submission.
Funding
L.L. was supported by funding from the Biology and Bioengineering Division at the California Institute of Technology. D.K.S. was supported by funding from the UCLA–Caltech Medical Scientist Training Program (National Institutes of Health (NIH) National Institute of General Medical Sciences (NIGMS) training grant T32 GM008042). M.C. was supported by the National Science Foundation Graduate Research Fellowships Program (NSF GRFP) under grant no. 2139433. A.V.W. was supported by NIH F30AI167524. A.V.W. was supported in part by the Bill & Melinda Gates Foundation (INV-023124).
Conflict of Interest
L.L., D.K.S. and L.P. are listed as inventors of a patent application including the manuscript in its entirety (patent application number, 18/972,306; status, pending). The patent application was submitted through the Technology Transfer Office of the California Institute of Technology (Caltech), with Caltech being the patent applicant. The remaining authors declare no competing interests.
Data Availability
With one exception, the sequencing data analysed in this paper are publicly available under GEO accessions GSE150316 (ref. 36), GSM4548303 (ref. 40), GSE158390 (ref. 39) and GSM5974202 (refs. 84,85) (detailed descriptions are provided in Supplementary Table 2). The raw sequencing data for one of the validation datasets, shown in Fig. 2a second panel from the left, is not publicly available per participant privacy practices37,38. However, the count matrices generated for this dataset are publicly available on Caltech Data as described below. The following genomes and transcriptomes were used in our analyses: human GRCh38 genome and transcriptome from Ensembl (v.109), mouse GRCm39 genome and transcriptome from Ensembl (v.109), rhesus macaque Mmul_10 genome and transcriptome from Ensembl (v.109), dog ROS_Cfam_1.0 genome and transcriptome from Ensembl (v.109) and EBOV reference genomes NC_002549.1 and GCA_000848505.1. To further increase the reproducibility of our results, we provide intermediary files generated as part of our analyses on Caltech Data at https://doi.org/10.22002/krqmp-5hy81 (ref. 86) and https://doi.org/10.22002/k7xqw-88d74 (ref. 87) (detailed descriptions for each file are provided in Supplementary Table 3). The PalmDB reference files optimized for use with kallisto translated search for the identification of viral sequences in bulk and single-cell RNA-seq data are available at GitHub (https://github.com/pachterlab/LSCHWCP_2023/tree/main/PalmDB)88.
Code Availability
The code used to generate all of the results and figures reported in this paper, starting from the raw sequencing reads, can be found at https://github.com/pachterlab/LSCHWCP_202391. The code is organized by figure panel and provided in immediately executable Google Colab notebooks to maximize the reproducibility of the results and methods described in this paper.
Contributions
L.L. and L.P. conceptualized the project. L.L., D.K.S. and K.E.H. developed and implemented the kallisto translated search algorithm, including extensive testing. L.L. and L.P. designed the validation, benchmarking and experimental data analysis with input from all authors, and L.L. performed the analyses and visualized the results. M.C. designed the logistic regression models, with adjustments from L.L. and T.C. T.C. provided critical feedback on statistical methods. A.V.W. provided extensive validation data and critical feedback on the validation, manuscript and revisions. L.L. wrote the original draft of the manuscript and revisions. L.P. supervised the project. All authors contributed to revisions and reviewed, edited and approved the final manuscript.
Supplemental Material
Supplementary Information - Supplementary Note, Supplementary Discussion, Supplementary Tables 1–3, Supplementary Figs. 1–3
Files
41587_2025_2614_MOESM1_ESM.pdf
Files
(5.0 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:8b6d22f158c835c60a80c6ab40501036
|
5.0 MB | Preview Download |
Additional details
Additional titles
- Alternative title
- Efficient and accurate detection of viral sequences at single-cell resolution reveals novel viruses perturbing host gene expression
- Alternative title
- Efficient and accurate detection of viral sequences at single-cell resolution reveals putative novel viruses perturbing host gene expression
Identifiers
- PMID
- 40263451
Funding
- California Institute of Technology
- Biology and Bioengineering Division -
- National Institutes of Health
- T32 GM008042
- National Science Foundation
- DGE-2139433
- National Institutes of Health
- F30AI167524
- Bill & Melinda Gates Foundation
- INV-023124
Dates
- Submitted
-
2024-05-13
- Accepted
-
2025-02-24