A Caltech Library Service

Machine Learning-Assisted Directed Evolution Navigates a Combinatorial Epistatic Fitness Landscape with Minimal Screening Burden

Wittmann, Bruce J. and Yue, Yisong and Arnold, Frances H. (2020) Machine Learning-Assisted Directed Evolution Navigates a Combinatorial Epistatic Fitness Landscape with Minimal Screening Burden. . (Unpublished)

[img] PDF - Submitted Version
Creative Commons Attribution Non-commercial No Derivatives.

[img] PDF - Supplemental Material
Creative Commons Attribution Non-commercial No Derivatives.


Use this Persistent URL to link to this item:


Due to screening limitations, in directed evolution (DE) of proteins it is rarely feasible to fully evaluate combinatorial mutant libraries made by mutagenesis at multiple sites. Instead, DE often involves a single-step greedy optimization in which the mutation in the highest-fitness variant identified in each round of single-site mutagenesis is fixed. However, because the effects of a mutation can depend on the presence or absence of other mutations, the efficiency and effectiveness of a single-step greedy walk is influenced by both the starting variant and the order in which beneficial mutations are identified—the process is path-dependent. We recently demonstrated a path-independent machine learning-assisted approach to directed evolution (MLDE) that allows in silico screening of full combinatorial libraries made by simultaneous saturation mutagenesis, thus explicitly capturing the effects of cooperative mutations and bypassing the path-dependence that can limit greedy optimization. Here, we thoroughly investigate and optimize an MLDE workflow by testing a number of design considerations of the MLDE pipeline. Specifically, we (1) test the effects of different encoding strategies on MLDE efficiency, (2) integrate new models and a training procedure more amenable to protein engineering tasks, and (3) incorporate training set design strategies to avoid information-poor low-fitness protein variants (“holes”) in the training data. When applied to an epistatic, hole-filled, four-site combinatorial fitness landscape of protein G domain B1 (GB1), the resulting focused training MLDE (ftMLDE) protocol achieved the global fitness maximum up to 92% of the time at a total screening burden of 470 variants. In contrast, minimal-screening-burden single-step greedy optimization over the GB1 fitness landscape reached the global maximum just 1.2% of the time; ftMLDE matching this minimal screening burden (80 total variants) achieved the global optimum up to 9.6% of the time with a 49% higher expected maximum fitness achieved. To facilitate further development of MLDE, we present the MLDE software package (, which is designed for use by protein engineers without computational or machine learning expertise.

Item Type:Report or Paper (Discussion Paper)
Related URLs:
URLURL TypeDescription Paper
Wittmann, Bruce J.0000-0001-8144-9157
Yue, Yisong0000-0001-9127-1989
Arnold, Frances H.0000-0002-4027-364X
Additional Information:The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. This version posted December 4, 2020. The authors thank Patrick Almhjell, Lucas Schaus, and Sabine Brinkmann-Chen for helpful discussion and critical reading of the manuscript, Zachary Wu, Kadina Johnston, and Amir Motmaen for helpful discussion, and Paul Chang for assistance with Triad calculations. Additionally, the authors thank NVIDIA Corporation for donation of two Titan V GPUs used in this work as well as Inc. for donation of AWS computing credits. This work was supported by the NSF Division of Chemical, Bioengineering, Environmental and Transport Systems (CBET 1937902), and by an Amgen Chem-Bio-Engineering Award (CBEA). The authors declare no competing interests.
Funding AgencyGrant Number
Amazon Web ServicesUNSPECIFIED
Record Number:CaltechAUTHORS:20201207-131007947
Persistent URL:
Official Citation:Machine Learning-Assisted Directed Evolution Navigates a Combinatorial Epistatic Fitness Landscape with Minimal Screening Burden. Bruce J. Wittmann, Yisong Yue, Frances H. Arnold. bioRxiv 2020.12.04.408955; doi:
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:106948
Deposited By: Tony Diaz
Deposited On:07 Dec 2020 21:15
Last Modified:07 Dec 2020 21:15

Repository Staff Only: item control page