A Caltech Library Service

Overestimated Polygenic Prediction due to Overlapping Subjects in Genetic Datasets

Park, David Keetae and Chen, Mingshen and Kim, Seungsoo and Joo, Yoonjung Yoonie and Loving, Rebekah K. and Kim, Hyoung Seop and Cha, Jiook and Yoo, Shinjae and Kim, Jong Hun (2022) Overestimated Polygenic Prediction due to Overlapping Subjects in Genetic Datasets. . (Unpublished)

[img] PDF - Submitted Version
See Usage Policy.

[img] PDF - Supplemental Material
See Usage Policy.


Use this Persistent URL to link to this item:


Recently, polygenic risk score (PRS) has gained significant attention in studies involving complex genetic diseases and traits. PRS is often derived from summary statistics, from which the independence between discovery and replication sets cannot be monitored. Prior studies, in which the independence is strictly observed, report a relatively low gain from PRS in predictive models of binary traits. We hypothesize that the independence assumption may be compromised when using the summary statistics, and suspect an overestimation bias in the predictive accuracy. To demonstrate the overestimation bias in the replication dataset, prediction performances of PRS models are compared when overlapping subjects are either present or removed. We consider the task of Alzheimer's disease (AD) prediction across genetics datasets, including the International Genomics of Alzheimer's Project (IGAP), AD Sequencing Project (ADSP), and Accelerating Medicine Partnership - Alzheimer's Disease (AMP-AD). PRS is computed from either sequencing studies for ADSP and AMP-AD (denoted as rPRS) or the summary statistics for IGAP (sPRS). Two variables with the high heritability in UK Biobank, hypertension, and height, are used to derive an exemplary scale effect of PRS. Based on the scale effect, the expected performance of sPRS is computed for AD prediction. Using ADSP as a discovery set for rPRS on AMP-AD, ΔAUC and ΔR2 (performance gains in AUC and R2 by PRS) record 0.069 and 0.11, respectively. Both drop to 0.0017 and 0.0041 once overlapping subjects are removed from AMP-AD. sPRS is derived from IGAP, which records ΔAUC and ΔR2 of 0.051±0.013 and 0.063±0.015 for ADSP and 0.060 and 0.086 for AMP-AD, respectively. On UK Biobank, rPRS performances for hypertension assuming a similar size of discovery and replication sets are 0.0036±0.0027 (ΔAUC) and 0.0032±0.0028 (ΔR2). For height, ΔR2 is 0.029±0.0037. Considering the high heritability of hypertension and height of UK Biobank, we conclude that sPRS results from AD databases are inflated. The higher performances relative to the size of the discovery set were observed in PRS studies of several diseases. PRS performances for binary traits, such as AD and hypertension, turned out unexpectedly low. This may, along with the difference in linkage disequilibrium, explain the high variability of PRS performances in cross-nation or cross-ethnicity applications, i.e., when there are no overlapping subjects. Hence, for sPRS, potential duplications should be carefully considered within the same ethnic group.

Item Type:Report or Paper (Discussion Paper)
Related URLs:
URLURL TypeDescription Paper ItemBiobank ItemData 2016 2018 2017
Additional Information:The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. This version posted January 22, 2022. Author Contributions: D.K.P., M.C., S.Y., and J.H.K. conceived and designed the study. D.K.P. and J.H.K. performed statistical analysis. D.K.P., M.C., and J.H.K. analyzed the genetic data. All authors discussed the results and implications and commented on the manuscript at all stages. M.C., S.K., Y.Y.J., R.K.L., H.S.K., J.C., and S.Y. gave technical support and conceptual advice. D.K.P., S.Y., and J.H.K. wrote the paper. All co-authors contributed to the final manuscript. The authors declare that they have no competing interests. Data Availability: The dataset(s) supporting the conclusions of this article are available in webpages of UK Biobank (, ADSP (accession phs000572.v1.p1;, Mayo RNAseq study (accession syn5550404;, Mount Sinai Brain Bank (MSBB) study (accession syn3159438;, and Religious Orders Study and Memory and Aging Project (ROSMAP) Study (accession syn3159438;
Record Number:CaltechAUTHORS:20220124-461300100
Persistent URL:
Official Citation:Overestimated Polygenic Prediction due to Overlapping Subjects in Genetic Datasets. David Keetae Park, Mingshen Chen, Seungsoo Kim, Yoonjung Yoonie Joo, Rebekah Loving, Hyoung-Seop Kim, Jiook Cha, Shinjae Yoo, Jong Hun Kim. bioRxiv 2022.01.19.476997; doi:
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:113070
Deposited By: Tony Diaz
Deposited On:24 Jan 2022 17:46
Last Modified:24 Jan 2022 17:46

Repository Staff Only: item control page