A Caltech Library Service

Automatic document classification of biological literature

Chen, David and Muller, Hans-Michael and Sternberg, Paul W. (2006) Automatic document classification of biological literature. BMC Bioinformatics, 7 . Art. No. 370. ISSN 1471-2105. PMCID PMC1559726.

PDF - Published Version
Creative Commons Attribution.

PDF - Supplemental Material
Creative Commons Attribution.

[img] MS Word - Supplemental Material
Restricted to Repository administrators only
Creative Commons Attribution.


Use this Persistent URL to link to this item:


Background: Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature. Results: We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. Conclusions: We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.

Item Type:Article
Related URLs:
URLURL TypeDescription CentralArticle
Sternberg, Paul W.0000-0002-7699-0173
Additional Information:© 2006 Chen et al., licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Submission date 28 March 2006; Acceptance date 7 August 2006; Publication date 7 August 2006 Authors’ contributions: DC devised and developed all aspects of the algorithm, performed the computational experiments and analyzed them, implemented the web-interface and its underlying data-processing and wrote the manuscript. HMM initiated the project, provided the Textpresso database and infrastructure, suggested ideas and edited the manuscript. PWS edited the manuscript, provided materials and supervised the project. All authors read and approved the final manuscript. We thank the Caltech WormBase group for discussions. This project was funded by the generosity of Dr. Anthony Skjellum via the SURF program at the California Institute of Technology, and the National Human Genome Research Institute at the US National Institutes of Health # P41 HG02223 and # HG004090.
Funding AgencyGrant Number
Caltech Summer Undergraduate Research Fellowship (SURF)UNSPECIFIED
NIHP41 HG02223
National Human Genome Research InstituteUNSPECIFIED
PubMed Central ID:PMC1559726
Record Number:CaltechAUTHORS:CHEbmcbioinf06
Persistent URL:
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:4376
Deposited By: Archive Administrator
Deposited On:21 Aug 2006
Last Modified:02 Oct 2019 23:12

Repository Staff Only: item control page