TRILL: Orchestrating Modular Deep-Learning Workflows for Democratized, Scalable Protein Analysis and Engineering

Creators: Martinez, Zachary A; Murray, Richard M.¹; Thomson, Matt¹

1. California Institute of Technology

Abstract

Deep-learning models have been rapidly adopted by many fields, partly due to the deluge of data humanity has amassed. In particular, the petabases of biological sequencing data enable the unsupervised training of protein language models that learn the “language of life.” However, due to their prohibitive size and complexity, contemporary deep-learning models are often unwieldy, especially for scientists with limited machine learning backgrounds. TRILL (TRaining and Inference using the Language of Life) is a platform for creative protein design and discovery. Leveraging several state-of-the-art models such as ESM-2, DiffDock, and RFDiffusion, TRILL allows researchers to generate novel proteins, predict 3-D structures, extract high-dimensional representations of proteins, functionally classify proteins and more. What sets TRILL apart is its ability to enable complex pipelines by chaining together models and effectively merging the capabilities of different models to achieve a sum greater than its individual parts. Whether using Google Colab with one GPU or a supercomputer with hundreds, TRILL allows scientists to effectively utilize models with millions to billions of parameters by using optimized training strategies such as ZeRO-Offload and distributed data parallel. Therefore, TRILL not only bridges the gap between complex deep-learning models and their practical application in the field of biology, but also simplifies the orchestration of these models into comprehensive workflows, democratizing access to powerful methods. Documentation: https://trill.readthedocs.io/en/latest/home.html.

Copyright and License

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.

Acknowledgement

Special thanks to Inna Strazhnik for making very informative figures to showcase TRILL. Also more special thanks to Martin Holmes for bravely testing alpha TRILL, as well as to Arjuna Subramanian and Alec Lourenço for being early adopters and avid users of TRILL, they have provided valuable insights and found manu bugs. Thanks also to Lucas Schaus for the idea of comparing classified CPPs from ProtGPT2 to “junk” random proteins. Another thanks to Changhua Yu for helping me dive into PLMs, with very useful Jupyter notebooks using ESM1 that helped me get started in this fast paced field. The authors would also like to thank Shilpa Yadahalli for cleaning and sharing Dataset E through personal communication. ZAM would also like to thank G Anthony Reina for contributing to the TRILL documentation and Blossom Market Hall in San Gabriel for providing a comfortable setting and free Wi-Fi to work on TRILL. This work was supported by the Heritage Medical Research Institute, Gordon and Betty Moore Foundation, NIH R01-GM150125, and the Packard Foundation.

Funding

This work was supported by the Heritage Medical Research Institute, Gordon and Betty Moore Foundation, NIH R01-GM150125, and the Packard Foundation.

Data Availability

https://doi.org/10.22002/mn4w0-cqj07

Code Availability