A Caltech Library Service

Short Tandem Repeats Information in TCGA is Statistically Biased by Amplification

Jain, Siddharth and Mazaheri, Bijan and Raviv, Netanel and Bruck, Jehoshua (2019) Short Tandem Repeats Information in TCGA is Statistically Biased by Amplification. . (Unpublished)

[img] PDF - Submitted Version
Creative Commons Attribution Non-commercial No Derivatives.


Use this Persistent URL to link to this item:


The current paradigm in data science is based on the belief that given sufficient amounts of data, classifiers are likely to uncover the distinction between true and false hypotheses. In particular, the abundance of genomic data creates opportunities for discovering disease risk associations and help in screening and treatment. However, working with large amounts of data is statistically beneficial only if the data is statistically unbiased. Here we demonstrate that amplification methods of DNA samples in TCGA have a substantial effect on short tandem repeat (STR) information. In particular, we design a classifier that uses the STR information and can distinguish between samples that have an analyte code D and an analyte code W. This artificial bias might be detrimental to data driven approaches, and might undermine the conclusions based on past and future genome wide studies.

Item Type:Report or Paper (Discussion Paper)
Related URLs:
URLURL TypeDescription Paper
Jain, Siddharth0000-0002-9164-6119
Mazaheri, Bijan0000-0001-9690-8686
Raviv, Netanel0000-0002-1686-1994
Bruck, Jehoshua0000-0001-8474-0812
Additional Information:The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. bioRxiv preprint first posted online Jan. 11, 2019. This work was supported in part by The Caltech Mead New Adventure Fund and a Caltech CI2 Fund. The authors would like to thank Eytan Ruppin for his valuable advice and feedback. The ethics approval to the TCGA data was granted by Caltech Institutional Review Board.
Funding AgencyGrant Number
Caltech Mead New Adventure FundUNSPECIFIED
Caltech Innovation Initiative (CI2)UNSPECIFIED
Record Number:CaltechAUTHORS:20190114-091231818
Persistent URL:
Official Citation:Short Tandem Repeats Information in TCGA is Statistically Biased by Amplification. Siddharth Jain, Bijan Mazaheri, Netanel Raviv, Jehoshua Bruck. bioRxiv 518878; doi:
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:92245
Deposited By: Tony Diaz
Deposited On:14 Jan 2019 20:22
Last Modified:18 Aug 2021 01:05

Repository Staff Only: item control page