Short Tandem Repeats Information in TCGA is Statistically Biased by Amplification
Abstract
The current paradigm in data science is based on the belief that given sufficient amounts of data, classifiers are likely to uncover the distinction between true and false hypotheses. In particular, the abundance of genomic data creates opportunities for discovering disease risk associations and help in screening and treatment. However, working with large amounts of data is statistically beneficial only if the data is statistically unbiased. Here we demonstrate that amplification methods of DNA samples in TCGA have a substantial effect on short tandem repeat (STR) information. In particular, we design a classifier that uses the STR information and can distinguish between samples that have an analyte code D and an analyte code W. This artificial bias might be detrimental to data driven approaches, and might undermine the conclusions based on past and future genome wide studies.
Additional Information
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license. bioRxiv preprint first posted online Jan. 11, 2019. This work was supported in part by The Caltech Mead New Adventure Fund and a Caltech CI2 Fund. The authors would like to thank Eytan Ruppin for his valuable advice and feedback. The ethics approval to the TCGA data was granted by Caltech Institutional Review Board.Attached Files
Submitted - 518878.full.pdf
Files
Name | Size | Download all |
---|---|---|
md5:cc5f756c463bea2fe8559e47043008ca
|
285.3 kB | Preview Download |
Additional details
- Eprint ID
- 92245
- Resolver ID
- CaltechAUTHORS:20190114-091231818
- Caltech Mead New Adventure Fund
- Caltech Innovation Initiative (CI2)
- Created
-
2019-01-14Created from EPrint's datestamp field
- Updated
-
2021-11-16Created from EPrint's last_modified field