CaltechAUTHORS
  A Caltech Library Service

Expression reflects population structure

Brown, Brielin C. and Bray, Nicolas L. and Pachter, Lior (2018) Expression reflects population structure. PLoS Genetics, 14 (12). Art. No. e1007841. ISSN 1553-7390. PMCID PMC6317812. http://resolver.caltech.edu/CaltechAUTHORS:20181008-162020262

[img] PDF - Published Version
Creative Commons Attribution.

1561Kb
[img] PDF - Submitted Version
Creative Commons Attribution Non-commercial No Derivatives.

418Kb
[img] PDF (S1 Table. Different choices of numbers of components give correlated scores) - Supplemental Material
Creative Commons Attribution.

161Kb
[img] Image (PNG) (S1 Fig. LDA is related to CCA and can be used for correction) - Supplemental Material
Creative Commons Attribution.

981Kb
[img] Image (PNG) (S2 Fig. Using standard CCA with all genes and genotypes results in no population structure and high over-fitting) - Supplemental Material
Creative Commons Attribution.

448Kb
[img] Image (PNG) (S3 Fig. Running PCCA without including batch as a covariate gives nearly identical results) - Supplemental Material
Creative Commons Attribution.

490Kb
[img] Image (PNG) (S4 Fig. Subsampling SNPs and genes shows that similar structure can be obtained with a small fraction of genes) - Supplemental Material
Creative Commons Attribution.

1032Kb
[img] Image (PNG) (S5 Fig. Results from a GO enrichment analysis of the genes with the most variance in the projection onto the first two principal components) - Supplemental Material
Creative Commons Attribution.

321Kb
[img] Image (PNG) (S6 Fig. Using regression rather than CCA to relate the principal components of the two data matrices also yields a projection that reveals population structure within the expression data) - Supplemental Material
Creative Commons Attribution.

196Kb
[img] Image (PNG) (S7 Fig. Percentage of variance explained as a function of the number of PCs used in expression and SNP data) - Supplemental Material
Creative Commons Attribution.

188Kb
[img] Image (PNG) (S8 Fig. Choosing different numbers of PCA components provides similar visualizations) - Supplemental Material
Creative Commons Attribution.

1516Kb

Use this Persistent URL to link to this item: http://resolver.caltech.edu/CaltechAUTHORS:20181008-162020262

Abstract

Population structure in genotype data has been extensively studied, and is revealed by looking at the principal components of the genotype matrix. However, no similar analysis of population structure in gene expression data has been conducted, in part because a naïve principal components analysis of the gene expression matrix does not cluster by population. We identify a linear projection that reveals population structure in gene expression data. Our approach relies on the coupling of the principal components of genotype to the principal components of gene expression via canonical correlation analysis. Our method is able to determine the significance of the variance in the canonical correlation projection explained by each gene. We identify 3,571 significant genes, only 837 of which had been previously reported to have an associated eQTL in the GEUVADIS results. We show that our projections are not primarily driven by differences in allele frequency at known cis-eQTLs and that similar projections can be recovered using only several hundred randomly selected genes and SNPs. Finally, we present preliminary work on the consequences for eQTL analysis. We observe that using our projection co-ordinates as covariates results in the discovery of slightly fewer genes with eQTLs, but that these genes replicate in GTEx matched tissue at a slightly higher rate.


Item Type:Article
Related URLs:
URLURL TypeDescription
https://doi.org/10.1371/journal.pgen.1007841DOIArticle
https://doi.org/10.1101/364448DOIDiscussion Paper
https://doi.org/10.1371/journal.pgen.1007841.s001DOIS1 Table
https://doi.org/10.1371/journal.pgen.1007841.s002DOIS1 Fig
https://doi.org/10.1371/journal.pgen.1007841.s003DOIS2 Fig
https://doi.org/10.1371/journal.pgen.1007841.s004DOIS3 Fig
https://doi.org/10.1371/journal.pgen.1007841.s005DOIS4 Fig
https://doi.org/10.1371/journal.pgen.1007841.s006DOIS5 Fig
https://doi.org/10.1371/journal.pgen.1007841.s007DOIS6 Fig
https://doi.org/10.1371/journal.pgen.1007841.s008DOIS7 Fig
https://doi.org/10.1371/journal.pgen.1007841.s009DOIS8 Fig
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6317812/PubMed CentralArticle
ORCID:
AuthorORCID
Brown, Brielin C.0000-0001-5569-5223
Pachter, Lior0000-0002-9164-6231
Additional Information:© 2018 Brown et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Received: July 30, 2018; Accepted: November 20, 2018; Published: December 19, 2018. The authors would like to thank Shannon McCurdy for invaluable feedback on this manuscript. LP and NB were funded by National Institutes of Health grant R01HG008164. LP was also funded by National Institutes of Health grant DK094699. BB was funded by the National Science Foundation Graduate Research Fellowship Program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Data Availability: GEUVADIS project RNA-seq reads are available at the European Nucleotide Archive (accession number ENA: ERP001942). 1000 genomes genotypes are available from cog-genomics (https://www.cog-genomics.org/plink/1.9/resources#1kg). Analysis software are available on github (https://github.com/pachterlab/PCCA/). Gencode v27 transcripts are available at ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/gencode.v27.pc_transcripts.fa.gz. Gencode v27 GTF is available at ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/gencode.v27.annotation.gtf.gz. The authors have declared that no competing interests exist.
Funders:
Funding AgencyGrant Number
NIHR01 HG008164
NIHDK094699
NSF Graduate Research FellowshipUNSPECIFIED
PubMed Central ID:PMC6317812
Record Number:CaltechAUTHORS:20181008-162020262
Persistent URL:http://resolver.caltech.edu/CaltechAUTHORS:20181008-162020262
Official Citation:Brown BC, Bray NL, Pachter L (2018) Expression reflects population structure. PLoS Genet 14(12): e1007841. https://doi.org/10.1371/journal.pgen.1007841
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:90174
Collection:CaltechAUTHORS
Deposited By: George Porter
Deposited On:09 Oct 2018 14:44
Last Modified:02 May 2019 15:56

Repository Staff Only: item control page