Term Matrix: A novel Gene Ontology annotation quality control system based on ontology term co-annotation patterns
Abstract
Biological processes are accomplished by the coordinated action of gene products. Gene products often participate in multiple processes, and can therefore be annotated to multiple Gene Ontology (GO) terms. Nevertheless, processes that are functionally, temporally, and/or spatially distant may have few gene products in common, and co-annotation to unrelated processes likely reflects errors in literature curation, ontology structure, or automated annotation pipelines. We have developed an annotation quality control workflow that uses rules based on mutually exclusive processes to detect annotation errors, based on and validated by case studies including the three we present here: fission yeast protein-coding gene annotations over time; annotations for cohesin complex subunits in human and model species; and annotations using a selected set of GO biological process terms in human and five model species. For each case study, we reviewed available GO annotations, identified pairs of biological processes which are unlikely to be correctly co-annotated to the same gene products (e.g., amino acid metabolism and cytokinesis), and traced erroneous annotations to their sources. To date we have generated 107 quality control rules, and corrected 289 manual annotations in eukaryotes and over 2.5 million automatically propagated annotations across all taxa.
Additional Information
© 2020 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited. Manuscript received 22/06/2020; Manuscript accepted 06/08/2020; Published online 02/09/2020. We thank Peter D'Eustachio for Reactome updates and the InterPro group for InterPro2GO mapping updates. We thank Nomi Harris for constructive comments on the manuscript. We also thank the many biocurators, editors and other members of the GO Consortium who have contributed to GO annotations and to the development of the Gene Ontology, and PomBase principal investigator Stephen G. Oliver for ongoing guidance and support of all PomBase activities. Data accessibility: The GO ontology and annotation datasets are freely available from the Gene Ontology website (see the main downloads page [41]). All other data supporting this article have been uploaded as part of the electronic supplementary material. Authors' contributions: V.W. conceived the project, generated annotation rules and wrote the initial draft; S.C. and C.J.M. developed Term Matrix; K.M.R. provided bioinformatic support for the fission yeast case study; V.W., A.L., S.R.E., D.P.H., K.V.A., H.A. and R.C.L. corrected annotation errors identified in the study; M.A.H. made extensive text revisions, and prepared the manuscript for submission; D.P.H., K.V.A. and P.G. corrected ontology errors; S.P. and M.F. provided SPKW mapping updates; M.F. and P.G. provided PAINT propagation updates. All authors contributed to the discussion of ideas and manuscript revisions, and read and approved the final manuscript. The authors declare no competing interests. V.W., A.L., M.A.H. and K.M.R. are supported by the Wellcome Trust via the PomBase project (grant no. 104967/Z/14/Z). S.C., S.R.E., D.P.H., K.V.A., P.G. and C.J.M. are funded via the GO resource, which is supported by the National Human Genome Research Institute (NHGRI) (grant no. U41 HG002273). S.R.E. is also funded by the NHGRI via the Saccharomyces Genome Database (grant no. U41 HG001315) and the Alliance of Genome Resources (grant no. U24 HG010859). K.V.A. is also funded via WormBase, which is supported by the NHGRI (grant no. U24 HG002223), the UK Medical Research Council (grant no. MR/S000453/1) and the UK Biotechnology and Biological Sciences Research Council (grant no. BB/P024602/1). H.A. is funded by the UK Medical Research Council (grant no. MR/N030117/1). R.C.L. is supported by Alzheimer's Research UK (grant no. ARUK-NAS2017A-1) and by the National Institute for Health Research UCL Hospitals Biomedical Research Centre. The GO Consortium, FlyBase (HA), Mouse Genome Informatics (DPH), the Saccharomyces Genome Database (SRE), and WormBase (KVA) are members of the Alliance of Genome Resources.Attached Files
Published - rsob.200149.pdf
Submitted - 2020.04.21.045195v1.full.pdf
Supplemental Material - RSOB200149_si_001.xlsx
Supplemental Material - RSOB200149_si_002.xlsx
Supplemental Material - RSOB200149_si_003.xlsx
Supplemental Material - RSOB200149_si_004.xlsx
Supplemental Material - RSOB200149_si_005.xlsx
Supplemental Material - RSOB200149_si_006.xlsx
Files
Name | Size | Download all |
---|---|---|
md5:b509d1319beef1dae3ecc88afd53a801
|
12.2 kB | Download |
md5:e262afbcc49de790f9f678d89430d2a8
|
18.6 kB | Download |
md5:8001e8da6831f10f0262987d35439bfc
|
17.5 kB | Download |
md5:fcbcd0664f2b681a8ebb5a669b6f5eb1
|
91.1 kB | Download |
md5:ac9df41c71d153f2942d4979ae435482
|
30.1 kB | Download |
md5:efa8330f5dddd2cfacf3088cbe3fb62a
|
959.2 kB | Preview Download |
md5:7cd9ccf5028e257bd98bd9494b9dfaf1
|
4.9 kB | Download |
md5:f501e7b565775d6b47ca0852cb162a1a
|
1.8 MB | Preview Download |
Additional details
- PMCID
- PMC7536087
- Eprint ID
- 102759
- Resolver ID
- CaltechAUTHORS:20200423-151030286
- 104967/Z/14/Z
- Wellcome Trust
- U41 HG002273
- NIH
- U41 HG001315
- NIH
- U24 HG010859
- NIH
- U24 HG002223
- NIH
- MR/S000453/1
- Medical Research Council (UK)
- BB/P024602/1
- Biotechnology and Biological Sciences Research Council (BBSRC)
- MR/N030117/1
- Medical Research Council (UK)
- ARUK-NAS2017A-1
- Alzheimer's Research UK
- National Institute for Health Research
- Created
-
2020-04-23Created from EPrint's datestamp field
- Updated
-
2023-06-01Created from EPrint's last_modified field