A Caltech Library Service

Estimation of duplication history under a stochastic model for tandem repeats

Farnoud, Farzad and Schwartz, Moshe and Bruck, Jehoshua (2019) Estimation of duplication history under a stochastic model for tandem repeats. BMC Bioinformatics, 20 . Art. No. 64. ISSN 1471-2105.

[img] PDF - Published Version
Creative Commons Attribution.

[img] PDF - Supplemental Material
Creative Commons Attribution.


Use this Persistent URL to link to this item:


Background: Tandem repeat sequences are common in the genomes of many organisms and are known to cause important phenomena such as gene silencing and rapid morphological changes. Due to the presence of multiple copies of the same pattern in tandem repeats and their high variability, they contain a wealth of information about the mutations that have led to their formation. The ability to extract this information can enhance our understanding of evolutionary mechanisms. Results: We present a stochastic model for the formation of tandem repeats via tandem duplication and substitution mutations. Based on the analysis of this model, we develop a method for estimating the relative mutation rates of duplications and substitutions, as well as the total number of mutations, in the history of a tandem repeat sequence. We validate our estimation method via Monte Carlo simulation and show that it outperforms the state-of-the-art algorithm for discovering the duplication history. We also apply our method to tandem repeat sequences in the human genome, where it demonstrates the different behaviors of micro- and mini-satellites and can be used to compare mutation rates across chromosomes. It is observed that chromosomes that exhibit the highest mutation activity in tandem repeat regions are the same as those thought to have the highest overall mutation rates. However, unlike previous works that rely on comparing human and chimpanzee genomes to measure mutation rates, the proposed method allows us to find chromosomes with the highest mutation activity based on a single genome, in essence by comparing (approximate) copies of the pattern in tandem repeats. Conclusion: The prevalence of tandem repeats in most organisms and the efficiency of the proposed method enable studying various aspects of the formation of tandem repeats and the surrounding sequences in a wide range of settings.

Item Type:Article
Related URLs:
URLURL TypeDescription
Farnoud, Farzad0000-0002-8684-4487
Schwartz, Moshe0000-0002-1449-0026
Bruck, Jehoshua0000-0001-8474-0812
Additional Information:© 2019 The Author(s). This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( applies to the data made available in this article, unless otherwise stated. Received: 27 July 2018; Accepted: 3 January 2019; Published: 6 February 2019. The authors would like to thank Han Mao Kiah for helpful discussions related to Lemma 1. Furthermore, the authors would like to thank anonymous reviewers for their insightful comments and valuable suggestions, which helped us improve the paper. This research was supported by National Science Foundation grants CCF-1317694 and CCF-1755773, and by a United States – Israel Binational Science Foundation (BSF) grant no. 2017652. Availability of data and materials: The datasets analyzed in the current study are available from the Tandem Repeat Database ( [28] under organism: Homo sapiens HG38. Authors’ contributions: All authors contributed to the theoretical analysis and the development of the estimation method. FF performed the data analysis and FF and MS prepared the manuscript. All authors read and approved the final manuscript. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. The authors declare that they have no competing interests.
Group:Parallel and Distributed Systems Group
Funding AgencyGrant Number
Binational Science Foundation (USA-Israel)2017652
Other Numbering System:
Other Numbering System NameOther Numbering System ID
Record Number:CaltechAUTHORS:20190211-084750757
Persistent URL:
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:92820
Deposited By: Tony Diaz
Deposited On:12 Feb 2019 22:25
Last Modified:22 Nov 2019 09:58

Repository Staff Only: item control page