CaltechAUTHORS
  A Caltech Library Service

Predicting gene essentiality in Caenorhabditis elegans by feature engineering and machine-learning

Campos, Tulio L. and Korhonen, Pasi K. and Sternberg, Paul W. and Gasser, Robin B. and Young, Neil D. (2020) Predicting gene essentiality in Caenorhabditis elegans by feature engineering and machine-learning. Computational and Structural Biotechnology Journal, 18 . pp. 1093-1102. ISSN 2001-0370. https://resolver.caltech.edu/CaltechAUTHORS:20200518-091100663

[img] PDF - Published Version
Creative Commons Attribution Non-commercial No Derivatives.

1962Kb
[img] Archive (ZIP) (Supplementary data) - Supplemental Material
Creative Commons Attribution Non-commercial No Derivatives.

7Mb

Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:20200518-091100663

Abstract

Defining genes that are essential for life has major implications for understanding critical biological processes and mechanisms. Although essential genes have been identified and characterised experimentally using functional genomic tools, it is challenging to predict with confidence such genes from molecular and phenomic data sets using computational methods. Using extensive data sets available for the model organism Caenorhabditis elegans, we constructed here a machine-learning (ML)-based workflow for the prediction of essential genes on a genome-wide scale. We identified strong predictors for such genes and showed that trained ML models consistently achieve highly-accurate classifications. Complementary analyses revealed an association between essential genes and chromosomal location. Our findings reveal that essential genes in C. elegans tend to be located in or near the centre of autosomal chromosomes; are positively correlated with low single nucleotide polymorphim (SNP) densities and epigenetic markers in promoter regions; are involved in protein and nucleotide processing; are transcribed in most cells; are enriched in reproductive tissues or are targets for small RNAs bound to the argonaut CSR-1. Based on these results, we hypothesise an interplay between epigenetic markers and small RNA pathways in the germline, with transcription-based memory; this hypothesis warrants testing. From a technical perspective, further work is needed to evaluate whether the present ML-based approach will be applicable to other metazoans (including Drosophila melanogaster) for which comprehensive data set (i.e. genomic, transcriptomic, proteomic, variomic, epigenetic and phenomic) are available.


Item Type:Article
Related URLs:
URLURL TypeDescription
https://doi.org/10.1016/j.csbj.2020.05.008DOIArticle
https://bitbucket.org/tuliocampos/essential_elegansRelated ItemData/Code
https://doi.org/10.6084/m9.figshare.11533101Related ItemData/Code
ORCID:
AuthorORCID
Sternberg, Paul W.0000-0002-7699-0173
Additional Information:© 2020 The Authors. Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Received 23 March 2020, Revised 1 May 2020, Accepted 6 May 2020, Available online 15 May 2020. This research was funded by grants from the National Health and Medical Research Council (NHMRC) of Australia and the Australian Research Council (ARC) to RBG, PKK and/or NDY. Other support to RBG was from the Melbourne Water. NDY was supported by a Career Development Fellowship, and PKK by an Early Career Research Fellowship from NHMRC. TLC was a recipient of a Research Training Program Scholarship from the Australian Government and is also supported by the Oswaldo Cruz Foundation (Fiocruz/Brazil). PWS was supported by U.S. National Institutes of Health grant U24-HG002223. CRediT authorship contribution statement: Tulio L. Campos: Conceptualization, Methodology, Software, Validation, Data curation, Writing - original draft, Visualization, Investigation, Writing - review & editing. Pasi K. Korhonen: Conceptualization, Supervision, Software, Validation, Visualization, Investigation, Writing - review & editing. Paul W. Sternberg: Visualization, Investigation, Writing - review & editing. Robin B. Gasser: Conceptualization, Supervision, Visualization, Investigation, Writing - review & editing. Neil D. Young: Conceptualization, Supervision, Visualization, Investigation, Writing - review & editing. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Data and code availability: The data used herein, the code developed to perform the systematic ML approaches as well as information regarding software versions and attached libraries are available at: https://bitbucket.org/tuliocampos/essential_elegans. A static version linked to this publication is available at: https://doi.org/10.6084/m9.figshare.11533101.
Funders:
Funding AgencyGrant Number
National Health and Medical Research Council (NHMRC)UNSPECIFIED
Australian Research CouncilUNSPECIFIED
Melbourne WaterUNSPECIFIED
Fundação Oswaldo CruzUNSPECIFIED
NIHU24-HG002223
Subject Keywords:Caenorhabditis elegans; Machine-learning; Essential genes; Essentiality predictions
Record Number:CaltechAUTHORS:20200518-091100663
Persistent URL:https://resolver.caltech.edu/CaltechAUTHORS:20200518-091100663
Official Citation:Tulio L. Campos, Pasi K. Korhonen, Paul W. Sternberg, Robin B. Gasser, Neil D. Young, Predicting gene essentiality in Caenorhabditis elegans by feature engineering and machine-learning, Computational and Structural Biotechnology Journal, Volume 18, 2020, Pages 1093-1102, ISSN 2001-0370, https://doi.org/10.1016/j.csbj.2020.05.008. (http://www.sciencedirect.com/science/article/pii/S2001037020302713)
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:103265
Collection:CaltechAUTHORS
Deposited By: Tony Diaz
Deposited On:18 May 2020 16:17
Last Modified:27 May 2020 16:45

Repository Staff Only: item control page