CaltechAUTHORS
  A Caltech Library Service

Minimal gene set discovery in single-cell mRNA-seq datasets with ActiveSVM

Chen, Xiaoqiao and Chen, Sisi and Thomson, Matt (2022) Minimal gene set discovery in single-cell mRNA-seq datasets with ActiveSVM. Nature Computational Science, 2 (6). pp. 387-398. ISSN 2662-8457. doi:10.1038/s43588-022-00263-8. https://resolver.caltech.edu/CaltechAUTHORS:20220707-978247000

[img] PDF - Published Version
Creative Commons Attribution.

4MB
[img]
Preview
Image (JPEG) (Extended Data Fig. 1: Application of ActiveSVM to identify region specific marker genes in the mouse brain with spatial transcriptomic data) - Supplemental Material
Creative Commons Attribution.

711kB
[img] PDF (Supplementary Figs. 1 & 2; Supplementary Tables 1-4) - Supplemental Material
Creative Commons Attribution.

6MB
[img] Archive (ZIP) (Source Data Fig. 2) - Supplemental Material
Creative Commons Attribution.

13MB
[img] Archive (ZIP) (Source Data Fig. 3) - Supplemental Material
Creative Commons Attribution.

109MB
[img] Archive (ZIP) (Source Data Fig. 4) - Supplemental Material
Creative Commons Attribution.

9MB
[img] Archive (ZIP) (Source Data Fig. 5) - Supplemental Material
Creative Commons Attribution.

8MB
[img] Archive (ZIP) (Source Data Fig. 6) - Supplemental Material
Creative Commons Attribution.

1MB
[img] Archive (ZIP) (Source Data Extended Data Fig. 1) - Supplemental Material
Creative Commons Attribution.

71kB

Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:20220707-978247000

Abstract

Sequencing costs currently prohibit the application of single-cell mRNA-seq to many biological and clinical analyses. Targeted single-cell mRNA-sequencing reduces sequencing costs by profiling reduced gene sets that capture biological information with a minimal number of genes. Here we introduce an active learning method that identifies minimal but highly informative gene sets that enable the identification of cell types, physiological states and genetic perturbations in single-cell data using a small number of genes. Our active feature selection procedure generates minimal gene sets from single-cell data by employing an active support vector machine (ActiveSVM) classifier. We demonstrate that ActiveSVM feature selection identifies gene sets that enable ~90% cell-type classification accuracy across, for example, cell atlas and disease-characterization datasets. The discovery of small but highly informative gene sets should enable reductions in the number of measurements necessary for application of single-cell mRNA-seq to clinical tests, therapeutic discovery and genetic screens.


Item Type:Article
Related URLs:
URLURL TypeDescription
https://doi.org/10.1038/s43588-022-00263-8DOIArticle
https://rdcu.be/cRcBCPublisherFree ReadCube access
https://www.ncbi.nlm.nih.gov/sra/?term=SRP073767Related ItemPBMC Single-cell RNA-seq data
http://support.10xgenomics.com/single-cell/datasetsRelated ItemDatasets
https://figshare.com/projects/Tabula_Muris_Transcriptomic_characterization_of_20_organs_and_tissues_from_Mus_musculus_at_single_cell_resolution/27733Related ItemTabula Muris dataset
https://figshare.com/articles/dataset/PopAlign_Data/11837097/3Related Itemoriginal multiple myeloma PBMC data
http://support.10xgenomics.com/single-cell/datasetsRelated Item10x Genomics Megacell dataset
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2396856Related Itemperturb-seq dataset
https://github.com/CaiGroup/seqFISH-PLUSRelated Itemspatial transcriptomics data
https://pypi.org/project/activeSVCRelated ItemPython package - ActiveSVC
https://github.com/xqchen/activeSVCRelated Itemsource codes of activeSVC
https://doi.org/10.5281/zenodo.6481687DOIxqchen/activeSVC: ActiveSVM
https://colab.research.google.com/drive/16h8hsnJ3ukTWAPnCB581dwj-nN5oopyM?usp=sharingRelated ItemGoogle colaboratory: PBMC demo
https://colab.research.google.com/drive/1SLehIKIQqpjK6BzEKc9m0y3uJ_LBqRzA?usp=sharingRelated ItemGoogle colaboratory: Tabula Muris demo
https://colab.research.google.com/drive/1fhQ8GD3NyzB3w0vof9WimXK6BLqDNuDC?usp=sharingRelated ItemGoogle colaboratory: PBMC cross-validation demo
ORCID:
AuthorORCID
Chen, Xiaoqiao0000-0003-4685-3466
Chen, Sisi0000-0001-9448-9713
Thomson, Matt0000-0003-1021-1234
Additional Information:© The Author(s) 2022, corrected publication 2022. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. Received 23 July 2021. Accepted 17 May 2022. Published 27 June 2022. We would like to thank I.-M. Strazhnik for expert assistance with preparation of illustrations and G. Riddihough of Life Science Editors for Editorial Assistance. We thank J. Jiang, Y. Yue, L. Cai, D. Sivak, D. Angeles and K. Zinn for discussion. The work was supported by the Heritage Medical Research Institute, the Beckman Institute Single-cell Profiling and Engineering Center (SPEC), NIH (R01HD100039), and the The Margaret E. Early Medical Research Trust. Contributions. X.C. conceived the ActiveSVM algorithm. X.C. and M.T. refined the algorithm and developed the application to single-cell genomics. X.C., S.C. and M.T. performed numerical experiments, biological interpretation, and data analysis. S.C. analyzed the Tabula Muris and multiple myeloma datasets and established biological interpretation of ActiveSVM results. X.C., S.C. and M.T. wrote the paper. Data availability All of the data used in the paper have been previously published. The PBMC Single-cell RNA-seq data have been deposited in the Short Read Archive under accession no. SRP073767 by the authors of ref. 13. Data are also available at http://support.10xgenomics.com/single-cell/datasets. The original Tabula Muris dataset is available at https://figshare.com/projects/Tabula_Muris_Transcriptomic_characterization_of_20_organs_and_tissues_from_Mus_musculus_at_single_cell_resolution/27733. The original multiple myeloma PBMC data, which contain two healthy donors and four multiple myeloma donors, are available at https://figshare.com/articles/dataset/PopAlign_Data/11837097/3. The 10x Genomics Megacell dataset is available at http://support.10xgenomics.com/single-cell/datasets. The perturb-seq dataset17 is available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2396856 The spatial transcriptomics data18 are available at https://github.com/CaiGroup/seqFISH-PLUS. Source Data are provided with this paper. Code availability Our method is integrated as an installable Python package called ActiveSVC. The installation instructions and user guidance are shown at https://pypi.org/project/activeSVC. The source codes of activeSVC and some demo examples are publicly available on GitHub at https://github.com/xqchen/activeSVC and Zenodo56. We also created a Google colaboratory project demonstrating three examples: the PBMC demo is at https://colab.research.google.com/drive/16h8hsnJ3ukTWAPnCB581dwj-nN5oopyM?usp=sharing, the Tabula Muris demo is at https://colab.research.google.com/drive/1SLehIKIQqpjK6BzEKc9m0y3uJ_LBqRzA?usp=sharing, and the PBMC cross-validation57 demo is at https://colab.research.google.com/drive/1fhQ8GD3NyzB3w0vof9WimXK6BLqDNuDC?usp=sharing. The authors declare no completing interests. Peer review. Nature Computational Science thanks the anonymous reviewers for their contribution to the peer review of this work. Handling editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.
Errata:04 July 2022In the version of this article initially published, edited Source Data captions for Fig. 6 and Extended Data Fig. 1 mistakenly referred to “Extended Data” figures rather than folders with “ED” prefixes in the data. The captions have been corrected in the HTML version of the article.
Group:Hydrodynamics Laboratory
Funders:
Funding AgencyGrant Number
Heritage Medical Research InstituteUNSPECIFIED
Beckman Institute Single-cell Profiling and Engineering Center (SPEC)UNSPECIFIED
NIHR01HD100039
Margaret E. Early Medical Research TrustUNSPECIFIED
Issue or Number:6
DOI:10.1038/s43588-022-00263-8
Record Number:CaltechAUTHORS:20220707-978247000
Persistent URL:https://resolver.caltech.edu/CaltechAUTHORS:20220707-978247000
Official Citation:Chen, X., Chen, S. & Thomson, M. Minimal gene set discovery in single-cell mRNA-seq datasets with ActiveSVM. Nat Comput Sci 2, 387–398 (2022). https://doi.org/10.1038/s43588-022-00263-8
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:115426
Collection:CaltechAUTHORS
Deposited By: George Porter
Deposited On:08 Jul 2022 23:08
Last Modified:25 Jul 2022 23:13

Repository Staff Only: item control page