A Caltech Library Service

Predicting the Emergence of SARS-CoV-2 Clades

Jain, Siddharth and Xiao, Xiongye and Bogdan, Paul and Bruck, Jehoshua (2020) Predicting the Emergence of SARS-CoV-2 Clades. . (Unpublished)

[img] PDF - Submitted Version
Creative Commons Attribution Non-commercial No Derivatives.

[img] PDF - Supplemental Material
Creative Commons Attribution Non-commercial No Derivatives.


Use this Persistent URL to link to this item:


Evolution is a process of change where mutations in the viral RNA are selected based on their fitness for replication and survival. Given that current phylogenetic analysis of SARS-CoV-2 identifies new viral clades after they exhibit evolutionary selections, one wonders whether we can identify the viral selection and predict the emergence of new viral clades? Inspired by the Kolmogorov complexity concept, we propose a generative complexity (algorithmic) framework capable to analyze the viral RNA sequences by mapping the multiscale nucleotide dependencies onto a state machine, where states represent subsequences of nucleotides and state-transition probabilities encode the higher order interactions between these states. We apply computational learning and classification techniques to identify the active state-transitions and use those as features in clade classifiers to decipher the transient mutations (still evolving within a clade) and stable mutations (typical to a clade). As opposed to current analysis tools that rely on the edit distance between sequences and require sequence alignment, our method is computationally local, does not require sequence alignment and is robust to random errors (substitution, insertions and deletions). Relying on the GISAID viral sequence database, we demonstrate that our method can predict clade emergence, potentially aiding with the design of medications and vaccines.

Item Type:Report or Paper (Discussion Paper)
Related URLs:
URLURL TypeDescription Paper ItemData/Code
Bruck, Jehoshua0000-0001-8474-0812
Additional Information:The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. Posted July 27, 2020. Data availability: The samples are obtained from GISAID and the acknowledgement files with the GISAID sample identifiers are provided on Code availability: The code for reproducing the results is provided on Author contributions statement: S.J. wrote the manuscript and code for the implementation of the mutation detection, clade classification and clade emergence techniques. X.X. helped with the code development. All authors discussed the results and commented on the manuscript. P.B. and J.B. originated and directed the study. The authors declare no competing interests.
Record Number:CaltechAUTHORS:20200728-093329251
Persistent URL:
Official Citation:Predicting the Emergence of SARS-CoV-2 Clades. Siddharth Jain, Xiongye Xiao, Paul Bogdan, Jehoshua Bruck. bioRxiv 2020.07.26.222117; doi:
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:104601
Deposited By: Tony Diaz
Deposited On:28 Jul 2020 17:25
Last Modified:28 Jul 2020 17:25

Repository Staff Only: item control page