CaltechAUTHORS
  A Caltech Library Service

Active feature selection discovers minimal gene sets for classifying cell types and disease states with single-cell mRNA-seq data

Chen, Xiaoqiao and Chen, Sisi and Thomson, Matt (2021) Active feature selection discovers minimal gene sets for classifying cell types and disease states with single-cell mRNA-seq data. . (Unpublished) https://resolver.caltech.edu/CaltechAUTHORS:20210622-154854635

[img] PDF (February 12, 2022) - Submitted Version
Creative Commons Attribution Non-commercial No Derivatives.

9MB

Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:20210622-154854635

Abstract

Sequencing costs currently prohibit the application of single-cell mRNA-seq to many biological and clinical analyses. Targeted single-cell mRNA-sequencing reduces sequencing costs by profiling reduced gene sets that capture biological information with a minimal number of genes. Here, we introduce an active learning method (ActiveSVM) that identifies minimal but highly-informative gene sets that enable the identification of cell-types, physiological states, and genetic perturbations in single-cell data using a small number of genes. Our active feature selection procedure generates minimal gene sets from single-cell data through an iterative cell-type classification task where misclassified cells are examined at each round of analysis to identify maximally informative genes through an `active' support vector machine (ActiveSVM) classifier. By focusing computational resources on misclassified cells, ActiveSVM scales to analyze data sets with over a million single cells. We demonstrate that ActiveSVM feature selection identifies gene sets that enable ~90% cell-type classification accuracy across a variety of data sets including cell atlas and disease characterization data sets. The method generalizes to reveal genes that respond to genetic perturbations and to identify region specific gene expression patterns in spatial transcriptomics data. The discovery of small but highly informative gene sets should enable substantial reductions in the number of measurements necessary for application of single-cell mRNA-seq to clinical tests, therapeutic discovery, and genetic screens.


Item Type:Report or Paper (Discussion Paper)
Related URLs:
URLURL TypeDescription
https://doi.org/10.1101/2021.06.15.448478DOIDiscussion Paper
http://support.10xgenomics.com/single-cell/datasetsRelated ItemData
https://figshare.com/projects/Tabula_Muris_Transcriptomic_characterization_of_20_organs_and_tissues_from_ Mus_musculus_at_single_cell_resolution/27733Related ItemData
https://figshare.com/articles/dataset/PopAlign_Data/11837097/3Related ItemData
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2396856Related ItemData
https://github.com/CaiGroup/seqFISH-PLUSRelated ItemCode
ORCID:
AuthorORCID
Chen, Sisi0000-0001-9448-9713
Alternate Title:Active feature selection discovers minimal gene-sets for classifying cell-types and disease states in single-cell mRNA-seq data
Additional Information:The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. Version 1 - June 16, 2021; Version 2 - February 12, 2022. Data Availability: All data used in the paper has been previously published. The PBMC Single-cell RNA-seq data have been deposited in the Short Read Archive under accession number SRP073767 by the authors of [17]. Data are also available at http://support.10xgenomics.com/single-cell/datasets. The original Tabula Muris dataset is available at https://figshare.com/projects/Tabula Muris Transcriptomic characterization of 20 organs and tissues from Mus musculus at single cell resolution/27733. The original multiple myeloma PBMC data, containing 2 healthy donors and 4 multiple myeloma donors, is available at https://figshare.com/articles/dataset/PopAlign Data/11837097/3. The 10x genomics Megacell data set is available at http://support.10xgenomics.com/single-cell/datasets. The perturb-seq data set [21] is availble at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2396856 The spatial transcriptomics data [22] is available https://github.com/CaiGroup/seqFISH-PLUS. The authors have declared no competing interest.
DOI:10.1101/2021.06.15.448478
Record Number:CaltechAUTHORS:20210622-154854635
Persistent URL:https://resolver.caltech.edu/CaltechAUTHORS:20210622-154854635
Official Citation:Active feature selection discovers minimal gene sets for classifying cell types and disease states with single-cell mRNA-seq data. Xiaoqiao Chen, Sisi Chen, Matt Thomson. bioRxiv 2021.06.15.448478; doi: https://doi.org/10.1101/2021.06.15.448478
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:109524
Collection:CaltechAUTHORS
Deposited By: Tony Diaz
Deposited On:23 Jun 2021 19:39
Last Modified:11 Apr 2022 21:28

Repository Staff Only: item control page