A Caltech Library Service

A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential

Hill, Steven T. and Kuintzle, Rachael and Teegarden, Amy and Merrill, Erich, III and Danaee, Padideh and Hendrix, David A. (2018) A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential. Nucleic Acids Research, 46 (16). pp. 8105-8113. ISSN 0305-1048. PMCID PMC6144860.

[img] PDF - Published Version
Creative Commons Attribution Non-commercial.

[img] PDF - Submitted Version
Creative Commons Attribution.

[img] MS Word (Supplementary Data) - Supplemental Material
Creative Commons Attribution Non-commercial.


Use this Persistent URL to link to this item:


The current deluge of newly identified RNA transcripts presents a singular opportunity for improved assessment of coding potential, a cornerstone of genome annotation, and for machine-driven discovery of biological knowledge. While traditional, feature-based methods for RNA classification are limited by current scientific knowledge, deep learning methods can independently discover complex biological rules in the data de novo. We trained a gated recurrent neural network (RNN) on human messenger RNA (mRNA) and long noncoding RNA (lncRNA) sequences. Our model, mRNA RNN (mRNN), surpasses state-of-the-art methods at predicting protein-coding potential despite being trained with less data and with no prior concept of what features define mRNAs. To understand what mRNN learned, we probed the network and uncovered several context-sensitive codons highly predictive of coding potential. Our results suggest that gated RNNs can learn complex and long-range patterns in full-length human transcripts, making them ideal for performing a wide range of difficult classification tasks and, most importantly, for harvesting new biological insights from the rising flood of sequencing data.

Item Type:Article
Related URLs:
URLURL TypeDescription CentralArticle Paper
Additional Information:© The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (, which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. Received April 11, 2018; Revised May 20, 2018; Editorial Decision June 07, 2018; Accepted June 15, 2018; Published: 09 July 2018. The authors would like to thank Prof. Stephen Ramsey, Prof. Christopher K. Mathews, Prof. Liang Huang, Prof. Colin Johnson, Prof. P. Andy Karplus and Prof. Michael Freitag for feedback on the manuscript and helpful discussions. The authors thank Mike Tyka for the suggestion to use data augmentation. Authors’ contribution: S.H., R.K., E.M., A.T. and D.H. wrote the software. S.H., R.K., A.T., P.D. and D.H. did the bioinformatics analysis. R.K., D.H. and S.H. wrote the manuscript. Funding: NIH [R56 AG053460, R21 AG052950]; Oregon State University (start-up grant). Funding for open access charge: NIH [R56 AG053460]. Conflict of interest statement: None declared.
Funding AgencyGrant Number
NIHR56 AG053460
NIHR21 AG052950
Oregon State UniversityUNSPECIFIED
Issue or Number:16
PubMed Central ID:PMC6144860
Record Number:CaltechAUTHORS:20181026-154742624
Persistent URL:
Official Citation:Steven T Hill, Rachael Kuintzle, Amy Teegarden, Erich Merrill, Padideh Danaee, David A Hendrix; A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential, Nucleic Acids Research, Volume 46, Issue 16, 19 September 2018, Pages 8105–8113,
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:90443
Deposited By: Tony Diaz
Deposited On:26 Oct 2018 23:18
Last Modified:03 Oct 2019 20:25

Repository Staff Only: item control page