A Caltech Library Service

BC4GO: a full-text corpus for the BioCreative IV GO task

Van Auken, Kimberly and Schaeffer, Mary L. and McQuilton, Peter and Laulederkind, Stanley J. F. and Li, Donghui and Wang, Shur-Jen and Hayman, G. Thomas and Tweedie, Susan and Arighi, Cecilia N. and Done, James and Müller, Hans-Michael and Sternberg, Paul W. and Mao, Yuqing and Wei, Chih-Hsuan and Lu, Zhiyong (2014) BC4GO: a full-text corpus for the BioCreative IV GO task. Database : The Journal of Biological Databases and Curation, 2014 . Art. No. 74. ISSN 1758-0463. PMCID PMC4112614. doi:10.1093/database/bau074.

PDF - Published Version
See Usage Policy.


Use this Persistent URL to link to this item:


Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database groups. Owing to its manual nature, this task is considered one of the bottlenecks in literature curation. There have been many previous attempts at automatic identification of GO terms and supporting information from full text. However, few systems have delivered an accuracy that is comparable with humans. One recognized challenge in developing such systems is the lack of marked sentence-level evidence text that provides the basis for making GO annotations. We aim to create a corpus that includes the GO evidence text along with the three core elements of GO annotations: (i) a gene or gene product, (ii) a GO term and (iii) a GO evidence code. To ensure our results are consistent with real-life GO data, we recruited eight professional GO curators and asked them to follow their routine GO annotation protocols. Our annotators marked up more than 5000 text passages in 200 articles for 1356 distinct GO terms. For evidence sentence selection, the inter-annotator agreement (IAA) results are 9.3% (strict) and 42.7% (relaxed) in F1-measures. For GO term selection, the IAAs are 47% (strict) and 62.9% (hierarchical). Our corpus analysis further shows that abstracts contain ∼10% of relevant evidence sentences and 30% distinct GO terms, while the Results/Experiment section has nearly 60% relevant sentences and >70% GO terms. Further, of those evidence sentences found in abstracts, less than one-third contain enough experimental detail to fulfill the three core criteria of a GO annotation. This result demonstrates the need of using full-text articles for text mining GO annotations. Through its use at the BioCreative IV GO (BC4GO) task, we expect our corpus to become a valuable resource for the BioNLP research community.

Item Type:Article
Related URLs:
URLURL TypeDescription CentralArticle
Van Auken, Kimberly0000-0002-1706-4196
Sternberg, Paul W.0000-0002-7699-0173
Additional Information:© 2014 Oxford University Press. This work is written by US Government employees and is in the public domain in the US. Received 1 February 2014; Revised 1 July 2014; Accepted 3 July 2014. We would like to thank Don Comeau, Rezarta Dogan and John Wilbur for general discussion and technical assistance in using BioC, and in particular to Don Comeau for providing us source PMC articles in the BioC XML format. We also thank Lynette Hirschman, Cathy Wu, Kevin Cohen, Martin Krallinger and Thomas Wiegers from the BioCreative IV organizing committee for their support, and Judith Blake, Andrew Chatr-aryamontri, Sherri Matis, Fiona McCarthy, Sandra Orchard and Phoebe Roberts from the BioCreative IV User Advisory Group for their helpful discussions. Funding Intramural Research Program of the NIH, National Library of Medicine (to C.W., Y.M. and Z.L.), the USDA ARS (to M.L.S.), the National Human Genome Research Institute at the US National Institutes of Health (# HG004090, # HG002223 and # HG002273) and National Science Foundation (ABI-1062520, ABI-1147029 and DBI-0850319). Conflict of interest. None declared.
Funding AgencyGrant Number
Department of AgricultureUNSPECIFIED
PubMed Central ID:PMC4112614
Record Number:CaltechAUTHORS:20140829-131503355
Persistent URL:
Official Citation:Kimberly Van Auken, Mary L. Schaeffer, Peter McQuilton, Stanley J. F. Laulederkind, Donghui Li, Shur-Jen Wang, G. Thomas Hayman, Susan Tweedie, Cecilia N. Arighi, James Done, Hans-Michael Müller, Paul W. Sternberg, Yuqing Mao, Chih-Hsuan Wei, and Zhiyong Lu BC4GO: a full-text corpus for the BioCreative IV GO task Database 2014: bau074 doi:10.1093/database/bau074 published online July 28, 2014
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:49073
Deposited By: Ruth Sustaita
Deposited On:29 Aug 2014 21:55
Last Modified:10 Nov 2021 18:39

Repository Staff Only: item control page