Benchmarking protein language models for protein crystallization

Mall, Raghvendra; Kaushik, Rahul; Martinez, Zachary A.; Thomson, Matt W.; Castiglione, Filippo

doi:10.1038/s41598-025-86519-5

Published January 18, 2025 | Version Published

Journal Article Open

Benchmarking protein language models for protein crystallization

1. Technology Innovation Institute
2. California Institute of Technology
3. National Research Council

The problem of protein structure determination is usually solved by X-ray crystallography. Several in silico deep learning methods have been developed to overcome the high attrition rate, cost of experiments and extensive trial-and-error settings, for predicting the crystallization propensities of proteins based on their sequences. In this work, we benchmark the power of open protein language models (PLMs) through the TRILL platform, a be-spoke framework democratizing the usage of PLMs for the task of predicting crystallization propensities of proteins. By comparing LightGBM / XGBoost classifiers built on the average embedding representations of proteins learned by different PLMs, such as ESM2, Ankh, ProtT5-XL, ProstT5, xTrimoPGLM, SaProt with the performance of state-of-the-art sequence-based methods like DeepCrystal, ATTCrys and CLPred, we identify the most effective methods for predicting crystallization outcomes. The LightGBM classifiers utilizing embeddings from ESM2 model with 30 and 36 transformer layers and 150 and 3000 million parameters respectively have performance gains by 3- $$5\%$$ than all compared models for various evaluation metrics, including AUPR (Area Under Precision-Recall Curve), AUC (Area Under the Receiver Operating Characteristic Curve), and F1 on independent test sets. Furthermore, we fine-tune the ProtGPT2 model available via TRILL to generate crystallizable proteins. Starting with 3000 generated proteins and through a step of filtration processes including consensus of all open PLM-based classifiers, sequence identity through CD-HIT, secondary structure compatibility, aggregation screening, homology search and foldability evaluation, we identified a set of 5 novel proteins as potentially crystallizable.

Copyright and License

© 2025, The Author(s). This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Acknowledgement

The authors would like to acknowledge Dr. Thomas Launey for his valuable feedback which helped to better position the paper and the reviewers whose suggestions helped to enhance the comprehensiveness of the manuscript.

Data Availability

All the code used for the analysis in this study is available at https://github.com/raghvendra5688/crystallization_benchmark/

Supplemental Material

Supplementary Information: https://static-content.springer.com/esm/art%3A10.1038%2Fs41598-025-86519-5/MediaObjects/41598_2025_86519_MOESM1_ESM.pdf

Contributions

R.M., M.T. and F.C. conceived the study. R.M. and R.K. performed the data curation. R.M., Z.M. and R.K. designed the methodology. R.M. and R.K. performed the experiments and visualizations. All authors contributed in writing, reviewing and editing the manuscript.

Files

s41598-025-86519-5.pdf

Files (8.6 MB)

Name	Size	Download all
41598_2025_86519_MOESM1_ESM.pdf md5:f0d2332624475a1396d012f41c757456	306.8 kB	Preview Download
s41598-025-86519-5.pdf md5:7e03d2c1d854c37d0e7d6ca49622fd0a	8.3 MB	Preview Download

Additional details

Describes: Journal Article: https://rdcu.be/egiVO (URL)

Accepted: 2025-01-13

Accepted
Available: 2025-01-18

Published online

Caltech groups: Division of Biology and Biological Engineering (BBE)
Publication Status: Published

	All versions	This version
Views	11	11
Downloads	11	11
Data volume	75.1 MB	75.1 MB

Benchmarking protein language models for protein crystallization

Copyright and License

Acknowledgement

Data Availability

Supplemental Material

Contributions

Files

s41598-025-86519-5.pdf

Files (8.6 MB)

Additional details

Related works

Dates

Caltech Custom Metadata

Benchmarking protein language models for protein crystallization

Creators

Abstract

Copyright and License

Acknowledgement

Data Availability

Supplemental Material

Contributions

Files

s41598-025-86519-5.pdf

Files (8.6 MB)

Additional details

Related works

Dates

Caltech Custom Metadata