Welcome to the new version of CaltechAUTHORS. Login is currently restricted to library staff. If you notice any issues, please email coda@library.caltech.edu
Published January 14, 2019 | Submitted
Report Open

Short Tandem Repeats Information in TCGA is Statistically Biased by Amplification

Abstract

The current paradigm in data science is based on the belief that given sufficient amounts of data, classifiers are likely to uncover the distinction between true and false hypotheses. In particular, the abundance of genomic data creates opportunities for discovering disease risk associations and help in screening and treatment. However, working with large amounts of data is statistically beneficial only if the data is statistically unbiased. Here we demonstrate that amplification methods of DNA samples in TCGA have a substantial effect on short tandem repeat (STR) information. In particular, we design a classifier that uses the STR information and can distinguish between samples that have an analyte code D and an analyte code W. This artificial bias might be detrimental to data driven approaches, and might undermine the conclusions based on past and future genome wide studies.

Additional Information

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. bioRxiv preprint first posted online Jan. 11, 2019. This work was supported in part by The Caltech Mead New Adventure Fund and a Caltech CI2 Fund. The authors would like to thank Eytan Ruppin for his valuable advice and feedback. The ethics approval to the TCGA data was granted by Caltech Institutional Review Board.

Attached Files

Submitted - 518878.full.pdf

Files

518878.full.pdf
Files (285.3 kB)
Name Size Download all
md5:cc5f756c463bea2fe8559e47043008ca
285.3 kB Preview Download

Additional details

Created:
August 19, 2023
Modified:
October 20, 2023