CaltechAUTHORS
  A Caltech Library Service

Mining gene expression data by interpreting principal components

Roden, Joseph C. and King, Brandon W. and Trout, Diane and Mortazavi, Ali and Wold, Barbara J. and Hart, Christopher E. (2006) Mining gene expression data by interpreting principal components. BMC Bioinformatics, 7 . Art. No. 194. ISSN 1471-2105. PMCID PMC1501050. https://resolver.caltech.edu/CaltechAUTHORS:RODbmcbinform06

[img]
Preview
PDF - Published Version
Creative Commons Attribution.

8Mb

Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:RODbmcbinform06

Abstract

Background: There are many methods for analyzing microarray data that group together genes having similar patterns of expression over all conditions tested. However, in many instances the biologically important goal is to identify relatively small sets of genes that share coherent expression across only some conditions, rather than all or most conditions as required in traditional clustering; e.g. genes that are highly up-regulated and/or down-regulated similarly across only a subset of conditions. Equally important is the need to learn which conditions are the decisive ones in forming such gene sets of interest, and how they relate to diverse conditional covariates, such as disease diagnosis or prognosis. Results: We present a method for automatically identifying such candidate sets of biologically relevant genes using a combination of principal components analysis and information theoretic metrics. To enable easy use of our methods, we have developed a data analysis package that facilitates visualization and subsequent data mining of the independent sources of significant variation present in gene microarray expression datasets (or in any other similarly structured high-dimensional dataset). We applied these tools to two public datasets, and highlight sets of genes most affected by specific subsets of conditions (e.g. tissues, treatments, samples, etc.). Statistically significant associations for highlighted gene sets were shown via global analysis for Gene Ontology term enrichment. Together with covariate associations, the tool provides a basis for building testable hypotheses about the biological or experimental causes of observed variation. Conclusion: We provide an unsupervised data mining technique for diverse microarray expression datasets that is distinct from major methods now in routine use. In test uses, this method, based on publicly available gene annotations, appears to identify numerous sets of biologically relevant genes. It has proven especially valuable in instances where there are many diverse conditions (10's to hundreds of different tissues or cell types), a situation in which many clustering and ordering algorithms become problematical. This approach also shows promise in other topic domains such as multi-spectral imaging datasets.


Item Type:Article
Related URLs:
URLURL TypeDescription
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1501050/PubMed CentralArticle
ORCID:
AuthorORCID
Mortazavi, Ali0000-0002-4259-6362
Wold, Barbara J.0000-0003-3235-8130
Additional Information:© 2006 Roden et al., licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Submission date 3 July 2005; Acceptance date 7 April 2006; Publication date 7 April 2006 Authors' contributions: JR and CH conceived the methodology of exhaustively analyzing and interpreting principal components, in particular how to identify extreme genes and significant conditions, and how to automate correlating these conditions with covariates to aid interpretation. JR, BK, DT and CH carried out the software development. JR carried out the initial PCA interpretation studies and drafted the manuscript. BK and DT performed additional PCA analyses and results interpretation. AM performed the gene set GO term enrichment analysis. BW conceived of the GNF dataset interpretation study, participated in its design and results interpretation, and helped to draft the manuscript. All authors read and approved the final manuscript. Acknowledgements: This work was supported in part by grants to BJW from the Department of Energy and the National Cancer Institute’s Director’s Challenge program. Additional support was provided by the NASA Office of Biological and Physical Research (OBPR) program. We also acknowledge Eric Mjolsness for discussions at the earliest phases of this research, and Ken McCue for additional discussions. We acknowledge that the GNF gene microarray expression data presented herein was obtained from Genomics Institute of the Novartis Research Foundation, and is 2003-2005 GNF. We acknowledge that the diabetes expression data presented herein was obtained from the Broad Institute’s Cancer Program dataset repository.
Funders:
Funding AgencyGrant Number
Department of Energy (DOE)UNSPECIFIED
National Cancer InstituteUNSPECIFIED
NIHUNSPECIFIED
NASAUNSPECIFIED
PubMed Central ID:PMC1501050
Record Number:CaltechAUTHORS:RODbmcbinform06
Persistent URL:https://resolver.caltech.edu/CaltechAUTHORS:RODbmcbinform06
Alternative URL:http://dx.doi.org/10.1186/1471-2105-7-194
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:2544
Collection:CaltechAUTHORS
Deposited By: Archive Administrator
Deposited On:09 Apr 2006
Last Modified:29 Oct 2019 22:52

Repository Staff Only: item control page