CaltechAUTHORS
  A Caltech Library Service

GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

Zvyagin, Maxim and Brace, Alexander and Hippe, Kyle and Deng, Yuntian and Zhang, Bin and Orozco Bohorquez, Cindy and Clyde, Austin and Kale, Bharat and Perez-Rivera, Danilo and Ma, Heng and Mann, Carla M. and Irvin, Michael and Pauloski, J. Gregory and Ward, Logan and Hayot-Sasson, Valerie and Emani, Murali and Foreman, Sam and Xie, Zhen and Lin, Diangen and Shukla, Maulik and Nie, Weili and Romero, Josh and Dallago, Christian and Vahdat, Arash and Xiao, Chaowei and Gibbs, Thomas and Foster, Ian and Davis, James J. and Papka, Michael E. and Brettin, Thomas and Stevens, Rick and Anandkumar, Anima and Vishwanath, Venkatram and Ramanathan, Arvind (2022) GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics. . (Unpublished) https://resolver.caltech.edu/CaltechAUTHORS:20230322-101633000.20

[img] PDF - Submitted Version
Creative Commons Attribution Non-commercial No Derivatives.

5MB

Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:20230322-101633000.20

Abstract

We seek to transform how new and emergent variants of pandemiccausing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pretraining on over 110 million prokaryotic gene sequences and finetuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.


Item Type:Report or Paper (Discussion Paper)
Related URLs:
URLURL TypeDescription
https://doi.org/10.1101/2022.10.10.511571DOIDiscussion Paper
http://www.ncbi.nlm.nih.gov/pmc/articles/pmc9709791/PubMed CentralDiscussion Paper
ORCID:
AuthorORCID
Brace, Alexander0000-0001-9873-9177
Clyde, Austin0000-0002-3697-7070
Ma, Heng0000-0002-7667-922X
Pauloski, J. Gregory0000-0002-6547-6902
Ward, Logan0000-0002-1323-5939
Hayot-Sasson, Valerie0000-0002-4830-4535
Dallago, Christian0000-0003-4650-6181
Xiao, Chaowei0000-0002-7043-4926
Foster, Ian0000-0003-2129-5269
Davis, James J.0000-0003-0104-5852
Papka, Michael E.0000-0002-6418-5767
Brettin, Thomas0000-0001-9301-9760
Anandkumar, Anima0000-0002-6974-6797
Ramanathan, Arvind0000-0002-1622-5488
Additional Information:The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. We thank the Argonne Leadership Computing Facility (ALCF) supported by the DOE under DE-AC02-06CH11357 and the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory supported by the DOE under Contract No. DE-AC02-05CH11231. We thank Bill Allcock, Silvio Rizzi and ALCF, Wahid Bhimji and NERSC for their timely help in enabling us to run these jobs at scale. We also thank Defne Gorgun, Lorenzo Casalino and Rommie Amaro for stimulating discussions. This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the US DOE Office of Science and the National Nuclear Security Administration, the National Institute of Allergy and Infectious Diseases, National Institutes of Health Award Number P01AI165077 (AR), the National Science Foundation Award Number 2117896 and supported by the DOE through the National Virtual Biotechnology Laboratory, a consortium of DOE national laboratories focused on response to COVID-19, with funding from the Coronavirus CARES Act. The authors have declared no competing interest.
Group:COVID-19
Funders:
Funding AgencyGrant Number
Department of Energy (DOE)DE-AC02-06CH11357
Department of Energy (DOE)DE-AC02-05CH11231
Department of Energy (DOE)17-SC-20-SC
NIHP01AI165077
NSFDMR-2117896
Coronavirus CARES ActUNSPECIFIED
PubMed Central ID:PMC9709791
DOI:10.1101/2022.10.10.511571
Record Number:CaltechAUTHORS:20230322-101633000.20
Persistent URL:https://resolver.caltech.edu/CaltechAUTHORS:20230322-101633000.20
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:120313
Collection:CaltechAUTHORS
Deposited By: George Porter
Deposited On:22 Mar 2023 18:10
Last Modified:23 May 2023 21:02

Repository Staff Only: item control page