Rangarajan, Arun and Schedl, Tim and Yook, Karen and Chan, Juancarlos and Haenel, Stephen and Otis, Lolly and Faelten, Sharon and DePellegrin-Connelly, Tracey and Isaacson, Ruth and Skyzypek, Marek S. and Marygold, Steven J. and Stefancsik, Raymund and Cherry, J. Michael and Sternberg, Paul W. and Müller, Hans-Michael (2011) Toward an interactive article: integrating journals and biological databases. BMC Bioinformatics, 12 . Art. No. 175. ISSN 1471-2105 http://resolver.caltech.edu/CaltechAUTHORS:20110726-095513577
- Published Version
Creative Commons Attribution.
Use this Persistent URL to link to this item: http://resolver.caltech.edu/CaltechAUTHORS:20110726-095513577
Background: Journal articles and databases are two major modes of communication in the biological sciences, and thus integrating these critical resources is of urgent importance to increase the pace of discovery. Projects focused on bridging the gap between journals and databases have been on the rise over the last five years and have resulted in the development of automated tools that can recognize entities within a document and link those entities to a relevant database. Unfortunately, automated tools cannot resolve ambiguities that arise from one term being used to signify entities that are quite distinct from one another. Instead, resolving these ambiguities requires some manual oversight. Finding the right balance between the speed and portability of automation and the accuracy and flexibility of manual effort is a crucial goal to making text markup a successful venture. Results: We have established a journal article mark-up pipeline that links GENETICS journal articles and the model organism database (MOD) WormBase. This pipeline uses a lexicon built with entities from the database as a first step. The entity markup pipeline results in links from over nine classes of objects including genes, proteins, alleles, phenotypes and anatomical terms. New entities and ambiguities are discovered and resolved by a database curator through a manual quality control (QC) step, along with help from authors via a web form that is provided to them by the journal. New entities discovered through this pipeline are immediately sent to an appropriate curator at the database. Ambiguous entities that do not automatically resolve to one link are resolved by hand ensuring an accurate link. This pipeline has been extended to other databases, namely Saccharomyces Genome Database (SGD) and FlyBase, and has been implemented in marking up a paper with links to multiple databases. Conclusions: Our semi-automated pipeline hyperlinks articles published in GENETICS to model organism databases such as WormBase. Our pipeline results in interactive articles that are data rich with high accuracy. The use of a manual quality control step sets this pipeline apart from other hyperlinking tools and results in benefits to authors, journals, readers and databases.
|Additional Information:||© 2011 Rangarajan et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Received: 13 August 2010; Accepted: 19 May 2011; Published: 19 May 2011. We thank members of WormBase, SGD, the GSA editorial board, Dartmouth Journal Services and FlyBase for helpful discussions and collaborations. We thank the anonymous reviewers for helpful critiques of the manuscript. This work was funded in part by grants #R01-HG004090, #P41-HG002223, #P41-HG000739 and #P41-HG001315 from the National Human Genome Research Institute (NHGRI) at the United States National Institutes of Health. Work by TS was funded by GM63310 and GM085150. PWS is an investigator with the Howard Hughes Medical Institute. Authors’ contributions: AR developed, maintains and runs the linking software and the software for obtaining the entities from the biological databases. TS initiated the project, provided the workflow, did initial manual quality control and oversees the project. KY does the manual quality control for WormBase and provides feedback to correct the linking software and the pipeline, and designed the author form for declaring new entities. JC set up the author form, integrated author data to forms used by WormBase curators and provides access to WormBase data. SH runs the source XML composition software. LO, SF send article source files to databases and receive the returned, linked article, and communicate with the authors at the proof stage. TDC helped to initiate and oversees the project. RI obtains a WormBase paper ID for each new article published by Genetics and sends authors the new author form. MSS does manual QC for SGD. JMC initiated the project at SGD, provided access to SGD data and oversees the project for SGD. RS does manual QC for FlyBase. SJM initiated the project at FlyBasae, provided access to FlyBase data and oversees the project for FlyBase. PWS oversees the project at WormBase, and helps with QC. HMM provided the textpresso software used as part of the linking software, and also helps run the software. All the authors read and approved the final manuscript.|
|Official Citation:||Rangarajan et al.: Toward an interactive article: integrating journals and biological databases. BMC Bioinformatics 2011 12:175.|
|Usage Policy:||No commercial reproduction, distribution, display or performance rights in this work are provided.|
|Deposited By:||Jason Perez|
|Deposited On:||27 Jul 2011 18:44|
|Last Modified:||26 Dec 2012 13:25|
Repository Staff Only: item control page