Detection of viral sequences at single-cell resolution identifies novel viruses associated with host gene expression changes

Luebbert, Laura; Sullivan, Delaney K.; Carilli, Maria; Eldjárn Hjörleifsson, Kristján; Viloria Winnett, Alexander; Chari, Tara; Pachter, Lior

doi:10.1038/s41587-025-02614-y

Published April 22, 2025 | Version In Press

Journal Article Open

Detection of viral sequences at single-cell resolution identifies novel viruses associated with host gene expression changes

1. California Institute of Technology
2. Broad Institute
3. Harvard University
4. University of California, Los Angeles

The increasing use of high-throughput sequencing methods in research, agriculture and healthcare provides an opportunity for the cost-effective surveillance of viral diversity and investigation of virus–disease correlation. However, existing methods for identifying viruses in sequencing data rely on and are limited to reference genomes or cannot retain single-cell resolution through cell barcode tracking. We introduce a method that accurately and rapidly detects viral sequences in bulk and single-cell transcriptomics data based on the highly conserved RdRP protein, enabling the detection of over 100,000 RNA virus species. The analysis of viral presence and host gene expression in parallel at single-cell resolution allows for the characterization of host viromes and the identification of viral tropism and host responses. We apply our method to peripheral blood mononuclear cell data from rhesus macaques with Ebola virus disease and describe previously unknown putative viruses. Moreover, we are able to accurately predict viral presence in individual cells based on macaque gene expression.

Copyright and License

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Acknowledgement

We thank A. Bryon for her advice and helpful discussions on the applicability of kallisto translated search for the detection of viral RNA. We also thank A. Lin, L. Hensley, D. Kotliar and P. Sabeti for helping us understand and work with their data. We thank A. Babaian for answering our questions about the PalmDB database extensively. We thank C. Middle, M. Roos and S. Kuersten from Illumina for generating the bulk RNA-seq data included in Fig. 2a (second panel). L.L. was supported by funding from the Biology and Bioengineering Division at the California Institute of Technology. D.K.S. was supported by funding from the UCLA–Caltech Medical Scientist Training Program (National Institutes of Health (NIH) National Institute of General Medical Sciences (NIGMS) training grant T32 GM008042). M.C. was supported by the National Science Foundation Graduate Research Fellowships Program (NSF GRFP) under grant no. 2139433. A.V.W. was supported by NIH F30AI167524. A.V.W. was supported in part by the Bill & Melinda Gates Foundation (INV-023124). Under the grant conditions of the Foundation, a Creative Commons Attribution 4.0 Generic License has already been assigned to the Author Accepted Manuscript version that might arise from this submission.

Funding

L.L. was supported by funding from the Biology and Bioengineering Division at the California Institute of Technology. D.K.S. was supported by funding from the UCLA–Caltech Medical Scientist Training Program (National Institutes of Health (NIH) National Institute of General Medical Sciences (NIGMS) training grant T32 GM008042). M.C. was supported by the National Science Foundation Graduate Research Fellowships Program (NSF GRFP) under grant no. 2139433. A.V.W. was supported by NIH F30AI167524. A.V.W. was supported in part by the Bill & Melinda Gates Foundation (INV-023124).

Conflict of Interest

L.L., D.K.S. and L.P. are listed as inventors of a patent application including the manuscript in its entirety (patent application number, 18/972,306; status, pending). The patent application was submitted through the Technology Transfer Office of the California Institute of Technology (Caltech), with Caltech being the patent applicant. The remaining authors declare no competing interests.

Data Availability

With one exception, the sequencing data analysed in this paper are publicly available under GEO accessions GSE150316 (ref. ³⁶), GSM4548303 (ref. ⁴⁰), GSE158390 (ref. ³⁹) and GSM5974202 (refs. ^84,85) (detailed descriptions are provided in Supplementary Table 2). The raw sequencing data for one of the validation datasets, shown in Fig. 2a second panel from the left, is not publicly available per participant privacy practices^37,38. However, the count matrices generated for this dataset are publicly available on Caltech Data as described below. The following genomes and transcriptomes were used in our analyses: human GRCh38 genome and transcriptome from Ensembl (v.109), mouse GRCm39 genome and transcriptome from Ensembl (v.109), rhesus macaque Mmul_10 genome and transcriptome from Ensembl (v.109), dog ROS_Cfam_1.0 genome and transcriptome from Ensembl (v.109) and EBOV reference genomes NC_002549.1 and GCA_000848505.1. To further increase the reproducibility of our results, we provide intermediary files generated as part of our analyses on Caltech Data at https://doi.org/10.22002/krqmp-5hy81 (ref. ⁸⁶) and https://doi.org/10.22002/k7xqw-88d74 (ref. ⁸⁷) (detailed descriptions for each file are provided in Supplementary Table 3). The PalmDB reference files optimized for use with kallisto translated search for the identification of viral sequences in bulk and single-cell RNA-seq data are available at GitHub (https://github.com/pachterlab/LSCHWCP_2023/tree/main/PalmDB)⁸⁸.

Code Availability

The code used to generate all of the results and figures reported in this paper, starting from the raw sequencing reads, can be found at https://github.com/pachterlab/LSCHWCP_202391. The code is organized by figure panel and provided in immediately executable Google Colab notebooks to maximize the reproducibility of the results and methods described in this paper.

Contributions

L.L. and L.P. conceptualized the project. L.L., D.K.S. and K.E.H. developed and implemented the kallisto translated search algorithm, including extensive testing. L.L. and L.P. designed the validation, benchmarking and experimental data analysis with input from all authors, and L.L. performed the analyses and visualized the results. M.C. designed the logistic regression models, with adjustments from L.L. and T.C. T.C. provided critical feedback on statistical methods. A.V.W. provided extensive validation data and critical feedback on the validation, manuscript and revisions. L.L. wrote the original draft of the manuscript and revisions. L.P. supervised the project. All authors contributed to revisions and reviewed, edited and approved the final manuscript.

Supplemental Material

Supplementary Information - Supplementary Note, Supplementary Discussion, Supplementary Tables 1–3, Supplementary Figs. 1–3

Reporting Summary

Extended Data Fig. 1 kallisto translated search can identify viral sequences beyond those explicitly included in the PalmDB while retaining accurate lower-rank taxonomy assignments

Extended Data Fig. 2 Comma-free code recalls viral sequences equally well compared to maximizing the Hamming distance between amino acids

Extended Data Fig. 3 Within-taxonomy similarity between viral sequences is preserved in the comma-free space, and the precision of taxonomic assignment remains stable even at increasing mutation rates

Extended Data Fig. 4 RdRP sequences are identified correctly and comprehensively by kallisto translated search with PalmDB

Extended Data Fig. 5 Variability in results across different host masking workflows, with no masking leading to host sequences being misidentified as viral, while overly conservative masking results in the loss of viral sequences

Extended Data Fig. 6 Different host masking workflows result in varying numbers of positive cells for different virus IDs

Extended Data Fig. 7 Quality control and cell type assignment of the macaque PBMC dataset

Extended Data Fig. 8 Blank sequencing reagents contain viral sequences, some of which are also present in the macaque PBMC dataset, while other viral sequences in the dataset exhibit varying distributions across animals, time points, and cell types

Extended Data Fig. 9 Logistic regression models trained to predict viral presence based on host gene expression

Extended Data Fig. 10 Highly weighted genes in the logistic regression models are enriched in immune response pathways

Files

41587_2025_2614_MOESM1_ESM.pdf

Files (5.0 MB)

Name	Size	Download all
41587_2025_2614_MOESM1_ESM.pdf md5:8b6d22f158c835c60a80c6ab40501036	5.0 MB	Preview Download

Additional details

Alternative title: Efficient and accurate detection of viral sequences at single-cell resolution reveals novel viruses perturbing host gene expression
Alternative title: Efficient and accurate detection of viral sequences at single-cell resolution reveals putative novel viruses perturbing host gene expression

PMID: 40263451

California Institute of Technology
Biology and Bioengineering Division -
National Institutes of Health
T32 GM008042
National Science Foundation
DGE-2139433
National Institutes of Health
F30AI167524
Bill & Melinda Gates Foundation
INV-023124

Submitted: 2024-05-13
Accepted: 2025-02-24

Caltech groups: Division of Biology and Biological Engineering (BBE)
Publication Status: In Press

	All versions	This version
Views	41	41
Downloads	22	22
Data volume	129.0 MB	129.0 MB

Detection of viral sequences at single-cell resolution identifies novel viruses associated with host gene expression changes

Copyright and License

Acknowledgement

Funding

Conflict of Interest

Data Availability

Code Availability

Contributions

Supplemental Material

Files

41587_2025_2614_MOESM1_ESM.pdf

Files (5.0 MB)

Additional details

Additional titles

Identifiers

Related works

Funding

Dates

Caltech Custom Metadata

Detection of viral sequences at single-cell resolution identifies novel viruses associated with host gene expression changes

Creators

Abstract

Copyright and License

Acknowledgement

Funding

Conflict of Interest

Data Availability

Code Availability

Contributions

Supplemental Material

Files

41587_2025_2614_MOESM1_ESM.pdf

Files (5.0 MB)

Additional details

Additional titles

Identifiers

Related works

Funding

Dates

Caltech Custom Metadata