A Caltech Library Service

Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature

Müller, Hans-Michael and Kenny, Eimear E. and Sternberg, Paul W. (2004) Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature. PLoS Biology, 2 (11). pp. 1984-1998. ISSN 1544-9173. PMCID PMC517822.

PDF - Published Version
Creative Commons Attribution.


Use this Persistent URL to link to this item:


We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso’s two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at or via WormBase at

Item Type:Article
Related URLs:
URLURL TypeDescription CentralArticle
Sternberg, Paul W.0000-0002-7699-0173
Additional Information:Copyright: 2004 Muller et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Received November 17, 2003; Accepted July 19, 2004; Published September 21, 2004. This work was supported in part by a grant (# P41 HG02223) from the National Human Genome Research Institute at the United States National Institutes of Health. HMM was a participant in the Initiative in Computational Molecular Biology, which was funded by the Burroughs Wellcome Fund Interfaces program, and was a Howard Hughes Medical Institute Associate, with which Paul W. Sternberg is an Investigator. We thank Juancarlos Chan for programming help, Andrei Petcherski for his help with evaluating the Textpresso system, Robert Li for developing the PDF-to-text conversion software package, and Daniel Wang for the continued acquisition of papers. We thank Igor Antoshechkin, Kimberly Van Auken, Carol Bastiani, Ranjana Kishore, Raymond Lee, Alok Saldanha, Erich Schwarz, Weiwei Zhong, and the anonymous referees for helpful comments on the manuscript. Conflicts of interest: The authors have declared that no conflicts of interest exist. Author contributions: HMM, EEK, and PWS conceived and designed the experiments. HMM, EEK, and PWS performed the experiments. HMM, EEK, and PWS analyzed the data. HMM, EEK, PWS contributed reagents/materials/analysis tools. HMM, EEK, and PWS wrote the paper.
Funding AgencyGrant Number
NIHP41 HG02223
National Human Genome Research InstituteUNSPECIFIED
Burroughs Wellcome FundUNSPECIFIED
Howard Hughes Medical Institute (HHMI)UNSPECIFIED
Subject Keywords:CGC, Caenorhabditis Genetics Center; GMOD, Generic Model Organism Database; GO, Gene Ontology; PERL, Practical Extraction and Report Language; PMID, PubMed unique identifier; SNOMED, Systemized Nomenclature of Medicine; UMLS, Unified Medical Language System; XML, eXtensible Markup Language; XPDF, a PDF viewer for X
Issue or Number:11
PubMed Central ID:PMC517822
Record Number:CaltechAUTHORS:MULpb04
Persistent URL:
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:295
Deposited By: Archive Administrator
Deposited On:18 May 2005
Last Modified:02 Oct 2019 22:32

Repository Staff Only: item control page