CaltechAUTHORS
  A Caltech Library Service

Informed training set design enables efficient machine learning-assisted directed protein evolution

Wittmann, Bruce J. and Yue, Yisong and Arnold, Frances H. (2021) Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Systems, 12 (11). pp. 1026-1045. ISSN 2405-4712. doi:10.1016/j.cels.2021.07.008. https://resolver.caltech.edu/CaltechAUTHORS:20201207-131007947

[img] PDF - Submitted Version
Creative Commons Attribution Non-commercial No Derivatives.

2MB
[img] PDF (Figures S1–S10 and Tables S1–S10) - Supplemental Material
See Usage Policy.

2MB
[img] MS Excel (Data S1) - Supplemental Material
See Usage Policy.

12kB
[img] MS Excel (Data S2) - Supplemental Material
See Usage Policy.

103kB
[img] MS Excel (Data S3) - Supplemental Material
See Usage Policy.

29kB
[img] PDF (Transparent peer review records for Wittmann et al.) - Supplemental Material
See Usage Policy.

765kB

Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:20201207-131007947

Abstract

Directed evolution of proteins often involves a greedy optimization in which the mutation in the highest-fitness variant identified in each round of single-site mutagenesis is fixed. The efficiency of such a single-step greedy walk depends on the order in which beneficial mutations are identified—the process is path dependent. Here, we investigate and optimize a path-independent machine learning-assisted directed evolution (MLDE) protocol that allows in silico screening of full combinatorial libraries. In particular, we evaluate the importance of different protein encoding strategies, training procedures, models, and training set design strategies on MLDE outcome, finding the most important consideration to be the implementation of strategies that reduce inclusion of minimally informative “holes” (protein variants with zero or extremely low fitness) in training data. When applied to an epistatic, hole-filled, four-site combinatorial fitness landscape, our optimized protocol achieved the global fitness maximum up to 81-fold more frequently than single-step greedy optimization. A record of this paper’s transparent peer review process is included in the supplemental information.


Item Type:Article
Related URLs:
URLURL TypeDescription
https://doi.org/10.1016/j.cels.2021.07.008DOIArticle
https://doi.org/10.1101/2020.12.04.408955DOIDiscussion Paper
ORCID:
AuthorORCID
Wittmann, Bruce J.0000-0001-8144-9157
Yue, Yisong0000-0001-9127-1989
Arnold, Frances H.0000-0002-4027-364X
Alternate Title:Machine Learning-Assisted Directed Evolution Navigates a Combinatorial Epistatic Fitness Landscape with Minimal Screening Burden
Additional Information:© 2021 Elsevier Inc. Received 15 December 2020, Revised 6 May 2021, Accepted 26 July 2021, Available online 19 August 2021. The authors thank Sabine Brinkmann-Chen, Patrick Almhjell, and Lucas Schaus for helpful discussion and critical reading of the manuscript, Zachary Wu, Kadina Johnston, and Amir Motmaen for helpful discussion, Suresh Guptha for assistance with computational infrastructure development and maintenance, and Paul Chang for assistance with Triad calculations. Additionally, the authors thank NVIDIA Corporation for donation of two Titan V GPUs used in this work and Amazon.com for donation of Amazon web services (AWS) computing credits. This work was supported by the NSF Division of Chemical, Bioengineering, Environmental and Transport Systems (CBET 1937902) and by an Amgen Chem-Bio-Engineering Award (CBEA). Author contributions: Conceptualization, B.J.W., Y.Y., and F.H.A.; methodology, B.J.W. and Y.Y.; software, B.J.W.; validation, B.J.W.; formal analysis, B.J.W.; investigation, B.J.W.; writing – original draft, B.J.W., Y.Y., and F.H.A.; writing – review & editing, B.J.W., Y.Y., and F.H.A.; visualization, B.J.W. The authors declare no competing interests. Data and code availability: Data needed to replicate simulations have been deposited at Caltech Data and are publicly available as of the date of publication. DOIs are listed in the key resources table. The raw simulation data reported in this study cannot be deposited in a public repository because it is multiple terabytes in size. To request access, contact Bruce Wittmann at bwittman@caltech.edu. In addition, summary statistics describing these raw data have been deposited at Caltech Data and are publicly available as of the date of publication. DOIs are listed in the key resources table. This paper analyzes existing, publicly available data. The accession numbers for the datasets are listed in the key resources table. All original code has been deposited at Caltech Data and is publicly available as of the date of publication. DOIs are listed in the key resources table. Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
Funders:
Funding AgencyGrant Number
NVIDIA CorporationUNSPECIFIED
Amazon Web ServicesUNSPECIFIED
NSFCBET-1937902
AmgenUNSPECIFIED
Subject Keywords:machine learning; directed evolution; epistasis; zero-shot prediction; fitness landscape; combinatorial mutagenesis; protein engineering
Issue or Number:11
DOI:10.1016/j.cels.2021.07.008
Record Number:CaltechAUTHORS:20201207-131007947
Persistent URL:https://resolver.caltech.edu/CaltechAUTHORS:20201207-131007947
Official Citation:Bruce J. Wittmann, Yisong Yue, Frances H. Arnold, Informed training set design enables efficient machine learning-assisted directed protein evolution, Cell Systems, Volume 12, Issue 11, 2021, Pages 1026-1045.e7, ISSN 2405-4712, https://doi.org/10.1016/j.cels.2021.07.008. (https://www.sciencedirect.com/science/article/pii/S2405471221002866)
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:106948
Collection:CaltechAUTHORS
Deposited By: Tony Diaz
Deposited On:07 Dec 2020 21:15
Last Modified:18 Nov 2021 22:56

Repository Staff Only: item control page