CaltechAUTHORS
  A Caltech Library Service

Capacity and Expressiveness of Genomic Tandem Duplication

Jain, Siddharth and Farnoud, Farzad and Bruck, Jehoshua (2015) Capacity and Expressiveness of Genomic Tandem Duplication. California Institute of Technology , Pasadena, CA. (Unpublished) http://resolver.caltech.edu/CaltechAUTHORS:20150209-155348874

[img] PDF - Submitted Version
See Usage Policy.

324Kb

Use this Persistent URL to link to this item: http://resolver.caltech.edu/CaltechAUTHORS:20150209-155348874

Abstract

The majority of the human genome consists of repeated sequences. An important type of repeats common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in the sequence AGTCTGTGC, TGTG is a tandem repeat, namely, it was generated from AGTCTGC by tandem duplication of length 2. In this work, we investigate the possibility of generating a large number of sequences from a small initial string (called the seed) by tandem duplication of length bounded by a constant. Our results include exact capacity values for certain tandem duplication string systems with alphabet sizes 2; 3; and 4. In addition, motivated by the role of DNA sequences in expressing proteins via RNA and the genetic code, we define the notion of the expressiveness of a tandem duplication system, as the feasibility of expressing arbitrary substrings. We then completely characterize the expressiveness of tandem duplication systems for general alphabet sizes and duplication lengths. Noticing that a system with capacity = 1 is expressive, we prove that for an alphabet size ≥ 4, the capacity is strictly smaller than 1, independent of the seed and the duplication lengths. The proof of this limit on the capacity (note that the genomic alphabet size is 4), is related to an interesting result by Axel Thue from 1906 which states that there exist arbitrary length sequences with no tandem repeats (square-free) for alphabet size ≥ 3. Finally, our results illustrate that duplication lengths play a more significant role than the seed in generating a large number of sequences for these systems.


Item Type:Report or Paper (Technical Report)
Related URLs:
URLURL TypeDescription
http://www.paradise.caltech.edu/papers/etr127.pdfAuthorReport
http://arxiv.org/abs/1509.06029arXivDiscussion Paper
Group:Parallel and Distributed Systems Group
Other Numbering System:
Other Numbering System NameOther Numbering System ID
ParadiseETR127
Record Number:CaltechAUTHORS:20150209-155348874
Persistent URL:http://resolver.caltech.edu/CaltechAUTHORS:20150209-155348874
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:54593
Collection:CaltechPARADISE
Deposited By: Kristin Buxton
Deposited On:10 Feb 2015 05:14
Last Modified:19 Jan 2016 23:39

Repository Staff Only: item control page