A Caltech Library Service

Learned Protein Embeddings for Machine Learning

Yang, Kevin K. and Wu, Zachary and Bedbrook, Claire N. and Arnold, Frances H. (2018) Learned Protein Embeddings for Machine Learning. Bioinformatics, 34 (15). pp. 2642-2648. ISSN 1367-4803. PMCID PMC6061698; PMC6247922.

[img] Archive (ZIP) (Supplementary Data) - Supplemental Material
See Usage Policy.


Use this Persistent URL to link to this item:


Motivation: Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding of the underlying physical or biological mechanisms. Such models enable the prediction and discovery of sequences with optimal properties. Machine-learning models generally require that their inputs be vectors, and the conversion from a protein sequence to a vector representation affects the model’s ability to learn. We propose to learn embedded representations of protein sequences that take advantage of the vast quantity of unmeasured protein sequence data available. These embeddings are low-dimensional and can greatly simplify downstream modeling. Results: The predictive power of Gaussian process models trained using embeddings is comparable to those trained on existing representations, which suggests that embeddings enable accurate predictions despite having orders of magnitude fewer dimensions. Moreover, embeddings are simpler to obtain because they do not require alignments, structural data, or selection of informative amino-acid properties. Visualizing the embedding vectors shows meaningful relationships between the embedded proteins are captured.

Item Type:Article
Related URLs:
URLURL TypeDescription CentralArticle CentralErratum
Yang, Kevin K.0000-0001-9045-6826
Bedbrook, Claire N.0000-0003-3973-598X
Arnold, Frances H.0000-0002-4027-364X
Additional Information:© 2018 The Author. Published by Oxford University Press. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model ( Availability and Implementation: The embedding vectors and code to reproduce the results are available at The authors wish to thank members of the Arnold lab, Justin Bois, and Yisong Yue for general advice and discussions on this project. This work is supported by the U.S. Army Research Office Institute for Collaborative Biotechnologies [W911F-09-0001 to F.H.A., K.K.Y.], the Donna and Benjamin M. Rosen Bioengineering Center [to K.K.Y.], the National Institutes of Health [F31MH102913, to C.N.B], and the National Science Foundation [GRF2017227007 to Z.W.]. Conflict of Interest: none declared.
Errata:The authors of the above paper wish to inform readers that the following article was incorrectly included as a reference: McIsaac, R.S. et al. (2014) Directed evolution of a far-red fluorescent rhodopsin. Proc. Natl. Acad. Sci. USA, 111, 13034–13039. The article which should have appeared in its place is: Engqvist, M.K.M. et al. (2015) Directed evolution of Gloeobacter violaceus rhodopsin spectral properties. Journal of Molecular Biology 427, 205-220.
Group:Rosen Bioengineering Center
Funding AgencyGrant Number
Army Research Office (ARO)W911F-09-0001
Donna and Benjamin M. Rosen Bioengineering CenterUNSPECIFIED
NIH Predoctoral FellowshipF31MH102913
Issue or Number:15
PubMed Central ID:PMC6061698; PMC6247922
Record Number:CaltechAUTHORS:20180330-110704718
Persistent URL:
Official Citation:Kevin K Yang, Zachary Wu, Claire N Bedbrook, Frances H Arnold; Learned protein embeddings for machine learning, Bioinformatics, Volume 34, Issue 15, 1 August 2018, Pages 2642–2648,
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:85532
Deposited By: Tony Diaz
Deposited On:30 Mar 2018 19:27
Last Modified:05 Mar 2020 18:18

Repository Staff Only: item control page