A Caltech Library Service

Signal Peptides Generated by Attention-Based Neural Networks

Wu, Zachary and Yang, Kevin K. and Liszka, Michael J. and Lee, Alycia and Batzilla, Alina and Wernick, David and Weiner, David P. and Arnold, Frances H. (2020) Signal Peptides Generated by Attention-Based Neural Networks. ACS Synthetic Biology, 9 (8). pp. 2154-2161. ISSN 2161-5063. doi:10.1021/acssynbio.0c00219.

[img] PDF (ACS AuthorChoice) - Published Version
See Usage Policy.

[img] PDF (Supplementary tables detailing (1) primers used to generate linear DNA fragments, (2) reaction conditions, (3) strains used, (4) control sequences generated, (5) distribution of protein and SP lengths as obtained from UniProt; Supplementary Sections...) - Supplemental Material
See Usage Policy.

[img] MS Excel (Supplementary File 1: Amino acid sequences of proteins and signal peptides) - Supplemental Material
See Usage Policy.


Use this Persistent URL to link to this item:


Short (15–30 residue) chains of amino acids at the amino termini of expressed proteins known as signal peptides (SPs) specify secretion in living cells. We trained an attention-based neural network, the Transformer model, on data from all available organisms in Swiss-Prot to generate SP sequences. Experimental testing demonstrates that the model-generated SPs are functional: when appended to enzymes expressed in an industrial Bacillus subtilis strain, the SPs lead to secreted activity that is competitive with industrially used SPs. Additionally, the model-generated SPs are diverse in sequence, sharing as little as 58% sequence identity to the closest known native signal peptide and 73% ± 9% on average.

Item Type:Article
Related URLs:
URLURL TypeDescription Itemtrained Transformer model for generating signal peptides and the data used to train the model
Lee, Alycia0000-0001-5972-807X
Arnold, Frances H.0000-0002-4027-364X
Additional Information:© 2020 American Chemical Society. This is an open access article published under an ACS AuthorChoice License, which permits copying and redistribution of the article or any adaptations for non-commercial purposes. Received: April 21, 2020; Published: July 10, 2020. The authors would like to thank Yisong Yue, Taehwan Kim, and other instructors of the Spring 2017 CS159 course at Caltech for initial guidance, and Zheyuan (Steve) Guo and Lucas Schaus for helpful discussions. Additionally, the authors would like to thank the team members of BASF Enzymes for being gracious hosts over the course of this project and Twist Biosciences for providing DNA at educational rates. Author Contributions: Z.W., K.K.Y., and M.J.L. contributed equally. Z.W., F.H.A., and K.K.Y. conceived and directed this study. K.K.Y., A.L., and Z.W. obtained training data and trained the models. Z.W., M.J.L., and D. Wernick planned the in vivo experimental validation. M.J.L. and A.B. performed the experimental validation. Z.W. analyzed the experimental results. D. Weiner advised the study. Z.W., F.H.A., K.K.Y., and M.J.L. wrote the paper. All authors edited and approved the manuscript. This work was supported by BASF through the California Research Alliance (CARA), the National Science Foundation Division of Chemical, Bioengineering, Environmental and Transport Systems (CBET-1937902), a National Science Foundation Graduate Fellowship GRF2017227007 (to Z.W.), and through generous research credits provided by Amazon Web Services. The authors declare the following competing financial interest(s): Provisional patent applications have been filed based on the results presented here. Notes: The trained Transformer model for generating signal peptides and the data used to train the model will be available at
Funding AgencyGrant Number
California Research AllianceUNSPECIFIED
NSF Graduate Research FellowshipGRF2017227007
Amazon Web ServicesUNSPECIFIED
Subject Keywords:machine learning, signal peptides, protein design, Bacillus subtilis, secretion
Issue or Number:8
Record Number:CaltechAUTHORS:20200713-075534149
Persistent URL:
Official Citation:Signal Peptides Generated by Attention-Based Neural Networks. Zachary Wu, Kevin K. Yang, Michael J. Liszka, Alycia Lee, Alina Batzilla, David Wernick, David P. Weiner, and Frances H. Arnold. ACS Synthetic Biology 2020 9 (8), 2154-2161; DOI: 10.1021/acssynbio.0c00219
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:104346
Deposited By: Tony Diaz
Deposited On:13 Jul 2020 15:52
Last Modified:16 Nov 2021 18:30

Repository Staff Only: item control page