A Caltech Library Service

Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature

Müller, H.-M. and Van Auken, Kimberly M. and Li, Y. and Sternberg, P. W. (2018) Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature. BMC Bioinformatics, 19 . Art. No. 94. ISSN 1471-2105. PMCID PMC5845379.

[img] PDF - Published Version
Creative Commons Attribution.


Use this Persistent URL to link to this item:


Background: The biomedical literature continues to grow at a rapid pace, making the challenge of knowledge retrieval and extraction ever greater. Tools that provide a means to search and mine the full text of literature thus represent an important way by which the efficiency of these processes can be improved. Results: We describe the next generation of the Textpresso information retrieval system, Textpresso Central (TPC). TPC builds on the strengths of the original system by expanding the full text corpus to include the PubMed Central Open Access Subset (PMC OA), as well as the WormBase C. elegans bibliography. In addition, TPC allows users to create a customized corpus by uploading and processing documents of their choosing. TPC is UIMA compliant, to facilitate compatibility with external processing modules, and takes advantage of Lucene indexing and search technology for efficient handling of millions of full text documents. Like Textpresso, TPC searches can be performed using keywords and/or categories (semantically related groups of terms), but to provide better context for interpreting and validating queries, search results may now be viewed as highlighted passages in the context of full text. To facilitate biocuration efforts, TPC also allows users to select text spans from the full text and annotate them, create customized curation forms for any data type, and send resulting annotations to external curation databases. As an example of such a curation form, we describe integration of TPC with the Noctua curation tool developed by the Gene Ontology (GO) Consortium. Conclusion: Textpresso Central is an online literature search and curation platform that enables biocurators and biomedical researchers to search and mine the full text of literature by integrating keyword and category searches with viewing search results in the context of the full text. It also allows users to create customized curation interfaces, use those interfaces to make annotations linked to supporting evidence statements, and then send those annotations to any database in the world.

Item Type:Article
Related URLs:
URLURL TypeDescription CentralArticle
Van Auken, Kimberly M.0000-0002-1706-4196
Sternberg, P. W.0000-0002-7699-0173
Additional Information:© 2018 The Author(s). Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated. Received: 13 July 2017. Accepted: 1 March 2018. Published: 9 March 2018. We would like to thank Daniela Raciti, Christian Grove, and Valerio Arnaboldi for testing the system and helpful discussion, and Seth Carbon and Chris Mungall for assistance with the Noctua-Textpresso Central communication protocol. This work was supported by USPHS grant U41-HG002223 (WormBase) and U41-HG002273 (Gene Ontology Consortium). Paul W. Sternberg is an investigator of the Howard Hughes Medical Institute. Authors’ contributions: HMM designed the system, developed the software, implemented the system and wrote the paper. KVA designed and tested the system, and wrote the paper. YL developed the software and implemented the system. PWS designed and tested the system, and supervised the project. All authors read and approved the final manuscript. The authors declare that they have no competing interests.
Funding AgencyGrant Number
Howard Hughes Medical Institute (HHMI)UNSPECIFIED
Subject Keywords:Literature curation - Text mining - Information retrieval - Information extraction - Literature search engine - Ontology - Model organism databases
PubMed Central ID:PMC5845379
Record Number:CaltechAUTHORS:20180309-075052784
Persistent URL:
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:85218
Deposited By: Ruth Sustaita
Deposited On:09 Mar 2018 17:11
Last Modified:03 Oct 2019 19:28

Repository Staff Only: item control page