CaltechAUTHORS
  A Caltech Library Service

Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation

Van Auken, Kimberly and Jaffery, Joshua and Chan, Juancarlos and Müller, Hans-Michael and Sternberg, Paul W. (2009) Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation. BMC Bioinformatics, 10 (228). ISSN 1471-2105. PMCID PMC2719631. https://resolver.caltech.edu/CaltechAUTHORS:20090911-153603785

[img]
Preview
PDF - Published Version
Creative Commons Attribution.

316Kb
[img] Plain Text (Training set corpus and true positive sentences for Cellular Component category development) - Supplemental Material
Creative Commons Attribution.

309Kb
[img] Plain Text (Category terms) - Supplemental Material
Creative Commons Attribution.

3510b
[img] Plain Text (Annotation test corpus) - Supplemental Material
Creative Commons Attribution.

894b
[img] Plain Text (Curation efficiency test corpus) - Supplemental Material
Creative Commons Attribution.

899b
[img] MS PowerPoint (WormBase Cellular Component curation form) - Supplemental Material
Creative Commons Attribution.

53Kb

Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:20090911-153603785

Abstract

Background: Manual curation of experimental data from the biomedical literature is an expensive and time-consuming endeavor. Nevertheless, most biological knowledge bases still rely heavily on manual curation for data extraction and entry. Text mining software that can semi- or fully automate information retrieval from the literature would thus provide a significant boost to manual curation efforts. Results: We employ the Textpresso category-based information retrieval and extraction system http://www.textpresso.org webcite, developed by WormBase to explore how Textpresso might improve the efficiency with which we manually curate C. elegans proteins to the Gene Ontology's Cellular Component Ontology. Using a training set of sentences that describe results of localization experiments in the published literature, we generated three new curation task-specific categories (Cellular Components, Assay Terms, and Verbs) containing words and phrases associated with reports of experimentally determined subcellular localization. We compared the results of manual curation to that of Textpresso queries that searched the full text of articles for sentences containing terms from each of the three new categories plus the name of a previously uncurated C. elegans protein, and found that Textpresso searches identified curatable papers with recall and precision rates of 79.1% and 61.8%, respectively (F-score of 69.5%), when compared to manual curation. Within those documents, Textpresso identified relevant sentences with recall and precision rates of 30.3% and 80.1% (F-score of 44.0%). From returned sentences, curators were able to make 66.2% of all possible experimentally supported GO Cellular Component annotations with 97.3% precision (F-score of 78.8%). Measuring the relative efficiencies of Textpresso-based versus manual curation we find that Textpresso has the potential to increase curation efficiency by at least 8-fold, and perhaps as much as 15-fold, given differences in individual curatorial speed. Conclusion: Textpresso is an effective tool for improving the efficiency of manual, experimentally based curation. Incorporating a Textpresso-based Cellular Component curation pipeline at WormBase has allowed us to transition from strictly manual curation of this data type to a more efficient pipeline of computer-assisted validation. Continued development of curation task-specific Textpresso categories will provide an invaluable resource for genomics databases that rely heavily on manual curation.


Item Type:Article
Related URLs:
URLURL TypeDescription
http://www.biomedcentral.com/1471-2105/10/228PublisherArticle
http://dx.doi.org/10.1186/1471-2105-10-228PublisherArticle
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2719631/PubMed CentralArticle
ORCID:
AuthorORCID
Van Auken, Kimberly0000-0002-1706-4196
Sternberg, Paul W.0000-0002-7699-0173
Additional Information:© 2009 Van Auken et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Received: 29 January 2009. Accepted: 21 July 2009. Published: 21 July 2009. We gratefully acknowledge helpful comments on the manuscript from Tanya Berardini, Kara Dolinski, Ranjana Kishore, Mike Livstone, and Tracy Teal, and Seth Carbon for assistance with GO database queries. We also thank David Botstein, Kara Dolinski, and Mike Livstone at the Lewis-Sigler Institute for Integrative Genomics at Princeton University for generously providing space to KV while this work was being completed. This work was supported by grants #P41-HG02223, #R01-HG004090, and #P41-HG02273 from the National Human Genome Research Institute (NHGRI) at the United States National Institutes of Health. PWS is an Investigator with the Howard Hughes Medical Institute. KV and JJ identified training set sentences and developed the categories. KV performed and analyzed the recall and precision tests. KV, JJ, and PWS performed the curation efficiency tests. JC developed all curation tools for the project. HMM developed and maintained the Textpresso for Cellular Component Curation site. KV wrote the paper with valuable discussions and critical contributions at all stages of the project from HMM and PWS.
Funders:
Funding AgencyGrant Number
NIHP41-HG02223
NIHR01-HG004090
NIHP41-HG02273
Howard Hughes Medical Institute (HHMI)UNSPECIFIED
National Human Genome Research InstituteUNSPECIFIED
Issue or Number:228
PubMed Central ID:PMC2719631
Record Number:CaltechAUTHORS:20090911-153603785
Persistent URL:https://resolver.caltech.edu/CaltechAUTHORS:20090911-153603785
Official Citation:Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation Kimberly Van Auken, Joshua Jaffery, Juancarlos Chan, Hans-Michael Müller and Paul W Sternberg BMC Bioinformatics 2009, 10:228doi:10.1186/1471-2105-10-228
Usage Policy:This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
ID Code:15819
Collection:CaltechAUTHORS
Deposited By: George Porter
Deposited On:29 Sep 2009 21:18
Last Modified:03 Oct 2019 01:03

Repository Staff Only: item control page