A Caltech Library Service

Generator based approach to analyze mutations in genomic datasets

Jain, Siddharth and Xiao, Xiongye and Bogdan, Paul and Bruck, Jehoshua (2021) Generator based approach to analyze mutations in genomic datasets. Scientific Reports, 11 . Art. No. 21084. ISSN 2045-2322. PMCID PMC8548350. doi:10.1038/s41598-021-00609-8.

[img] PDF - Published Version
Creative Commons Attribution.

[img] PDF - Submitted Version
Creative Commons Attribution Non-commercial No Derivatives.

[img] PDF - Supplemental Material
Creative Commons Attribution.


Use this Persistent URL to link to this item:


In contrast to the conventional approach of directly comparing genomic sequences using sequence alignment tools, we propose a computational approach that performs comparisons between sequence generators. These sequence generators are learned via a data-driven approach that empirically computes the state machine generating the genomic sequence of interest. As the state machine based generator of the sequence is independent of the sequence length, it provides us with an efficient method to compute the statistical distance between large sets of genomic sequences. Moreover, our technique provides a fast and efficient method to cluster large datasets of genomic sequences, characterize their temporal and spatial evolution in a continuous manner, get insights into the locality sensitive information about the sequences without any need for alignment. Furthermore, we show that the technique can be used to detect local regions with mutation activity, which can then be applied to aid alignment techniques for the fast discovery of mutations. To demonstrate the efficacy of our technique on real genomic data, we cluster different strains of SARS-CoV-2 viral sequences, characterize their evolution and identify regions of the viral sequence with mutations.

Item Type:Article
Related URLs:
URLURL TypeDescription CentralArticle ItemDatasets ItemCode ItemCode Paper
Jain, Siddharth0000-0002-9164-6119
Bogdan, Paul0000-0003-2118-0816
Bruck, Jehoshua0000-0001-8474-0812
Alternate Title:Predicting the Emergence of SARS-CoV-2 Clades
Additional Information:© The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit Received 11 May 2021; Accepted 13 October 2021; Published 26 October 2021. S.J. was supported by the Center for Evolutionary Science at Caltech. P.B. and X.X. gratefully acknowledge the support by the National Science Foundation Career award under Grant number CPS/CNS-1453860, the NSF awards under Grant numbers CCF-1837131, MCB-1936775, CNS-1932620, CMMI-1936624, the U.S. Army Research Office (ARO) under Grant No. W911NF-17-1-0076, the Okawa Foundation research award, the Defense Advanced Research Projects Agency (DARPA) Young Faculty Award and DARPA Director Award under Grant No. N66001-17-1-4044, a 2021 USC Stevens Center Technology Advancement Grant (TAG) award, an Intel faculty award and a Northrop Grumman grant. The views, opinions, and/or findings contained in this article are those of the authors and should not be interpreted as representing the official views or policies, either expressed or implied by the Defense Advanced Research Projects Agency, the Department of Defense or the National Science Foundation. Data availability: The SARS-CoV-2 datasets analyzed in the current study are provided in GISAID ( The implementation of the method, acknowledgement files for the SARS-CoV-2 samples used and the necessary details to generate figures in the paper are provided at The acknowledgement files for the SARS-CoV-2 samples are also provided in the supplementary material. These authors contributed equally: Siddharth Jain and Xiongye Xiao. Author Contributions: S.J. and X.X wrote the manuscript and code for the implementation of the proposed method and data analysis. All authors discussed the results and commented on the manuscript. P.B. and J.B. originated and directed the study. The authors declare no competing interests.
Funding AgencyGrant Number
Army Research Office (ARO)W911NF-17-1-0076
Okawa FoundationUNSPECIFIED
Defense Advanced Research Projects Agency (DARPA)N66001-17-1-4044
University of Southern CaliforniaUNSPECIFIED
Northrop Grumman CorporationUNSPECIFIED
Subject Keywords:Computational biology and bioinformatics; Mathematics and computing
PubMed Central ID:PMC8548350
Record Number:CaltechAUTHORS:20200728-093329251
Persistent URL:
Official Citation:Jain, S., Xiao, X., Bogdan, P. et al. Generator based approach to analyze mutations in genomic datasets. Sci Rep 11, 21084 (2021).
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:104601
Deposited By: Tony Diaz
Deposited On:28 Jul 2020 17:25
Last Modified:29 Oct 2021 16:00

Repository Staff Only: item control page