Published December 23, 2023 | Submitted
Discussion Paper Open

Unexplored regions of the protein sequence-structure map revealed at scale by a library of foldtuned language models

  • 1. ROR icon California Institute of Technology

Abstract

Nature has likely sampled only a fraction of all protein sequences and structures allowed by the laws of biophysics. However, the combinatorial scale of amino-acid sequence-space has traditionally precluded substantive study of the full protein sequence-structure map. In particular, it remains unknown how much of the vast uncharted landscape of far-from-natural sequences consists of alternate ways to encode the familiar ensemble of natural folds; proteins in this category also represent an opportunity to diversify candidates for downstream applications. Here, we characterize sequence-structure mapping in far-from-natural regions of sequence-space guided by the capacity of protein language models (pLMs) to explore sequences outside their natural training data through generation. We demonstrate that pre-trained generative pLMs sample a limited structural snapshot of the natural protein universe, including >350 common (sub)domain elements. Incorporating pLM, structure prediction, and structure-based search techniques, we surpass this limitation by developing a novel “foldtuning” strategy that pushes a pretrained pLM into a generative regime that maintains structural similarity to a target protein fold (e.g. TIM barrel, thioredoxin, etc) while maximizing dissimilarity to natural amino-acid sequences. We apply “foldtuning” to build a library of pLMs for >700 naturally-abundant folds in the SCOP database, accessing swaths of proteins that take familiar structures yet lie far from known sequences, spanning targets that include enzymes, immune ligands, and signaling proteins. By revealing protein sequence-structure information at scale outside of the context of evolution, we anticipate that this work will enable future systematic searches for wholly novel folds and facilitate more immediate protein design goals in catalysis and medicine.

Copyright and License

 

Acknowledgement

We thank Steve Mayo, Carl Pabo, Zach Martinez, Alec Lourenco, Lucas Schaus, Blade Olson, Joe Boktor, as well as all members of the Thomson Lab for helpful discussions.

Funding

This work was supported by the National Institutes of Health under award number R01GM150125, the Moore Foundation, the Packard Foundation, and the Heritage Medical Research Institute.

Conflict of Interest

The authors have no competing interests to disclose.

Files

2023.12.22.573145v1.full.pdf
Files (5.7 MB)
Name Size Download all
md5:50d3e623596f0fd3e8aee06802a00d72
5.7 MB Preview Download

Additional details

Created:
June 13, 2024
Modified:
June 13, 2024