A Caltech Library Service

Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR

Van Auken, Kimberly and Fey, Petra and Berardini, Tanya Z. and Dodson, Robert and Cooper, Laurel and Li, Donghui and Chan, Juancarlos and Li, Yuling and Basu, Siddhartha and Muller, Hans-Michael and Chisholm, Rex and Huala, Eva and Sternberg, Paul W. (2012) Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR. Database : The Journal of Biological Databases and Curation, 2012 . Art. No. bas040. ISSN 1758-0463. PMCID PMC3500519.

PDF - Published Version
Creative Commons Attribution.


Use this Persistent URL to link to this item:


WormBase, dictyBase and The Arabidopsis Information Resource (TAIR) are model organism databases containing information about Caenorhabditis elegans and other nematodes, the social amoeba Dictyostelium discoideum and related Dictyostelids and the flowering plant Arabidopsis thaliana, respectively. Each database curates multiple data types from the primary research literature. In this article, we describe the curation workflow at WormBase, with particular emphasis on our use of text-mining tools (BioCreative 2012, Workshop Track II). We then describe the application of a specific component of that workflow, Textpresso for Cellular Component Curation (CCC), to Gene Ontology (GO) curation at dictyBase and TAIR (BioCreative 2012, Workshop Track III). We find that, with organism-specific modifications, Textpresso can be used by dictyBase and TAIR to annotate gene productions to GO's Cellular Component (CC) ontology.

Item Type:Article
Related URLs:
URLURL TypeDescription CentralArticle
Van Auken, Kimberly0000-0002-1706-4196
Sternberg, Paul W.0000-0002-7699-0173
Additional Information:© 2012 The Author(s). Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits non-commercial reuse, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact Submitted 18 June 2012; Revised 30 September 2012; Accepted 2 October 2012. We would like to thank the BioCreative Workshop 2012 Steering Committee for the opportunity to participate in the workshop and, in particular, C. Arighi for advice and support regarding the Task III evaluation. We also thank C Grove, K Howe, R Kishore, D Raciti, MA Tuli, X Wang, G Williams and K Yook for their helpful comments on the manuscript and gratefully acknowledge S. Wimpfheimer for assistance with the figures. The members of the WormBase Consortium are M. Berriman and R. Durbin (Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK); T. Bieri, P. Ozersky and J. Spieth (The Genome Institute, Washington University School of Medicine, St. Louis, MO 63108, USA); A. Cabunoc, A. Duong, T.W. Harris and L. Stein (Ontario Institute for Cancer Research, 101 College Street, Suite 800, Toronto, ON, Canada M5G0A); J. Chan, W.J. Chen, J. Done, C. Grove, R. Kishore, R. Lee, Y. Li, H.M. Muller, C. Nakamura, D. Raciti, G. Schindelman, K. Van Auken, D. Wang, X. Wang, K. Yook and P.W. Sternberg (Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA); J. Hodgkin (Genetics Unit, Department of Biochemistry, University of Oxford, South Parks Road, Oxford OX1 3QU, United Kingdom); P. Davis, K. Howe, M. Paulini, M.A. Tuli, G. Williams and P. Kersey (EMBL-European Bioinformatics Institute Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK). Funding: The US National Human Genome Research Institute (HG02223 to WormBase) and the British Medical Research Council (G070119 to WormBase); The US National Human Genome Research Institute (HG004090 to Textpresso); The US National Institute of Health (GM64426 to dictyBase); The National Science Foundation (DBI-0850219 to TAIR], with additional support from TAIR sponsors (http://www.; The National Science Foundation (0822201 to The Plant Ontology); US National Human Genome Research Institute (HG002273 to The Gene Ontology Consortium). PWS is an investigator with the Howard Hughes Medical Institute. Funding for open access charge: US National Human Genome Research Institute [Grant no. HG002273].
Funding AgencyGrant Number
British Medical Research CouncilG070119
National Human Genome Research InstituteUNSPECIFIED
PubMed Central ID:PMC3500519
Record Number:CaltechAUTHORS:20130118-102637703
Persistent URL:
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:36475
Deposited By: Jason Perez
Deposited On:18 Jan 2013 21:15
Last Modified:03 Oct 2019 04:38

Repository Staff Only: item control page