CaltechAUTHORS
  A Caltech Library Service

Automatic categorization of diverse experimental information in the bioscience literature

Fang, Ruihua and Schindelman, Gary and Van Auken, Kimberly and Fernandes, Jolene and Chen, Wen and Wang, Xiaodong and Davis, Paul and Tuli, Mary Ann and Marygold, Steven J. and Millburn, Gillian and Matthews, Beverley and Zhang, Haiyan and Brown, Nick and Gelbart, William M. and Sternberg, Paul W. (2012) Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics, 13 . Art. No. 16. ISSN 1471-2105. PMCID PMC3305665. http://resolver.caltech.edu/CaltechAUTHORS:20120320-105305942

[img]
Preview
PDF - Published Version
Creative Commons Attribution.

279Kb
[img]
Preview
PDF - Supplemental Material
Creative Commons Attribution.

55Kb
[img]
Preview
PDF - Supplemental Material
Creative Commons Attribution.

324Kb
[img] MS Excel - Supplemental Material
Creative Commons Attribution.

18Kb
[img] MS Excel - Supplemental Material
Creative Commons Attribution.

16Kb
[img] MS Excel - Supplemental Material
Creative Commons Attribution.

17Kb
[img] MS Excel - Supplemental Material
Creative Commons Attribution.

17Kb
[img] MS Excel - Supplemental Material
Creative Commons Attribution.

16Kb
[img] MS Excel (Table S7) - Supplemental Material
Creative Commons Attribution.

11Kb
[img] MS Excel (Table S8) - Supplemental Material
Creative Commons Attribution.

12Kb
[img] MS Excel (Table S9) - Supplemental Material
Creative Commons Attribution.

11Kb
[img] MS Excel (Table S10) - Supplemental Material
Creative Commons Attribution.

11Kb
[img] MS Excel (Table S11) - Supplemental Material
Creative Commons Attribution.

11Kb
[img] MS Excel (Table S12) - Supplemental Material
Creative Commons Attribution.

10Kb
[img] Archive (ZIP) (easySVM.tar.gz) - Supplemental Material
Creative Commons Attribution.

16Mb

Use this Persistent URL to link to this item: http://resolver.caltech.edu/CaltechAUTHORS:20120320-105305942

Abstract

Background: Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance. Results: We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction. Conclusions: Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort.


Item Type:Article
Related URLs:
URLURL TypeDescription
http://dx.doi.org/10.1186/1471-2105-13-16DOIArticle
http://www.biomedcentral.com/1471-2105/13/16PublisherArticle
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3305665/PubMed CentralArticle
ORCID:
AuthorORCID
Van Auken, Kimberly0000-0002-1706-4196
Sternberg, Paul W.0000-0002-7699-0173
Additional Information:© 2012 Fang et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Received: 5 October 2011 Accepted: 26 January 2012. Published: 26 January 2012. This work was supported by grants P41 HG002223, P41 HG002223-10S1, P41 HG000739 and R01 HG004090 from the National Human Genome Research Institute (NHGRI) at the United States National Institutes of Health. We thank the past and present members of WormBase and FlyBase for curating the papers used in this study. We gratefully acknowledge Karen Yook for making the WormBase data type definition available on the WormBase Wiki page; Juancarlos Chan for help with getting WormBase training data from the WormBase curation status tracking database; and Hans-Michael Müller for the full text WormBase papers. PWS is an Investigator with the Howard Hughes Medical Institute. Authors’ contributions: RF developed the algorithm, wrote the program and analyzed all the datasets. GS contributed to the comprehensive SVM scheme and validated RNAi results. KVA validated gene product (GO) and mutant allele sequence results. JF validated phenotype analysis results. WC validated gene expression and antibody results. XW validated gene regulation results. PD validated the training set used for the gene structure correction data type. MAT validated mutant allele sequence results. SM and GM provided the FlyBase Cambridge datasets. BM and HZ provided the FlyBase Harvard datasets. RF wrote the paper with valuable discussion and critical contributions at all stages of the project from PWS. PWS, GS, KVA, GM, BM, and SM edited the manuscript. All authors read and approved the final manuscript.
Funders:
Funding AgencyGrant Number
NIHP41-HG002223
NIHP41 HG02273-1051
NIHP41 HG000739
NIHR01 HG004090
PubMed Central ID:PMC3305665
Record Number:CaltechAUTHORS:20120320-105305942
Persistent URL:http://resolver.caltech.edu/CaltechAUTHORS:20120320-105305942
Official Citation:Fang et al.: Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics 2012 13:16.
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:29784
Collection:CaltechAUTHORS
Deposited By: Ruth Sustaita
Deposited On:21 Mar 2012 15:31
Last Modified:21 Jun 2017 23:36

Repository Staff Only: item control page