CaltechAUTHORS
  A Caltech Library Service

Coverage statistics for sequence census methods

Evans, Steven N. and Hower, Valerie and Pachter, Lior (2010) Coverage statistics for sequence census methods. BMC Bioinformatics, 11 . Art. No. 430. ISSN 1471-2105. PMCID PMC2940910. https://resolver.caltech.edu/CaltechAUTHORS:20170306-122114222

[img] PDF - Published Version
Creative Commons Attribution.

1600Kb
[img] PDF - Submitted Version
See Usage Policy.

685Kb
[img] PDF (Authors’ original file for figure 1) - Supplemental Material
Creative Commons Attribution.

157Kb
[img] Image (PNG) (Authors’ original file for figure 2) - Supplemental Material
Creative Commons Attribution.

30Kb
[img] Image (PNG) (Authors’ original file for figure 3) - Supplemental Material
Creative Commons Attribution.

31Kb
[img] PDF (Authors’ original file for figure 4) - Supplemental Material
Creative Commons Attribution.

158Kb
[img] PDF (Authors’ original file for figure 5) - Supplemental Material
Creative Commons Attribution.

104Kb
[img] PDF (Authors’ original file for figure 6) - Supplemental Material
Creative Commons Attribution.

384Kb

Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:20170306-122114222

Abstract

Background: We study the statistical properties of fragment coverage in genome sequencing experiments. In an extension of the classic Lander-Waterman model, we consider the effect of the length distribution of fragments. We also introduce a coding of the shape of the coverage depth function as a tree and explain how this can be used to detect regions with anomalous coverage. This modeling perspective is especially germane to current high-throughput sequencing experiments, where both sample preparation protocols and sequencing technology particulars can affect fragment length distributions. Results: Under the mild assumptions that fragment start sites are Poisson distributed and successive fragment lengths are independent and identically distributed, we observe that, regardless of fragment length distribution, the fragments produced in a sequencing experiment can be viewed as resulting from a two-dimensional spatial Poisson process. We then study the successive jumps of the coverage function, and show that they can be encoded as a random tree that is approximately a Galton-Watson tree with generation-dependent geometric offspring distributions whose parameters can be computed. Conclusions: We extend standard analyses of shotgun sequencing that focus on coverage statistics at individual sites, and provide a null model for detecting deviations from random coverage in high-throughput sequence census based experiments. Our approach leads to explicit determinations of the null distributions of certain test statistics, while for others it greatly simplifies the approximation of their null distributions by simulation. Our focus on fragments also leads to a new approach to visualizing sequencing data that is of independent interest.


Item Type:Article
Related URLs:
URLURL TypeDescription
http://dx.doi.org/10.1186/1471-2105-11-430DOIArticle
http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-430PublisherArticle
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2940910/PubMed CentralArticle
https://arxiv.org/abs/1004.5587arXivDiscussion Paper
ORCID:
AuthorORCID
Pachter, Lior0000-0002-9164-6231
Additional Information:© 2010 Evans et al; licensee BioMed Central Ltd. This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Received: 23 April 2010. Accepted: 18 August 2010. Published: 18 August 2010. SNE is supported in part by NSF grant DMS-0907630 and VH is funded by NSF fellowship DMS-0902723. We thank Adam Roberts for his help in making Figure 6. Authors' contributions: LP proposed the problem of understanding the random behaviour of coverage functions in the context of sequence census methods. VH investigated the coverage function and lattice path excursions based on ideas from topological data analysis. SE developed the probability theory and identified the relevance of Theorem 1. SNE, VH and LP worked together on all aspects of the paper and wrote the manuscript. All authors read and approved the final manuscript.
Funders:
Funding AgencyGrant Number
NSFDMS-0907630
NSF Graduate Research FellowshipDMS-0902723
PubMed Central ID:PMC2940910
Record Number:CaltechAUTHORS:20170306-122114222
Persistent URL:https://resolver.caltech.edu/CaltechAUTHORS:20170306-122114222
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:74789
Collection:CaltechAUTHORS
Deposited By: George Porter
Deposited On:06 Mar 2017 21:07
Last Modified:24 Feb 2020 10:30

Repository Staff Only: item control page