A Caltech Library Service

Accelerated variant curation from scientific literature using biomedical text mining

Mallick, Rishab and Arnaboldi, Valerio and Davis, Paul and Diamantakis, Stavros and Zarowiecki, Magdalena and Howe, Kevin (2022) Accelerated variant curation from scientific literature using biomedical text mining. microPublication Biology, 2022 . Art. No. 000578. ISSN 2578-9430. PMCID PMC9160977.

[img] PDF - Published Version
Creative Commons Attribution.


Use this Persistent URL to link to this item:


Biological databases collect and standardize data through biocuration. Even though major model organism databases have adopted some automation of curation methods, a large portion of biocuration is still performed manually. To speed up the extraction of the genomic positions of variants, we have developed a hybrid approach that combines regular expressions, Named Entity Recognition based on BERT (Bidirectional Encoder Representations from Transformers) and bag-of-words to extract variant genomic locations from C. elegans papers for WormBase. Our model has a precision of 82.59% for the gene-mutation matches tested on extracted text from 100 papers, and even recovers some data not discovered during manual curation. Code at:

Item Type:Article
Related URLs:
URLURL TypeDescription CentralArticle
Arnaboldi, Valerio0000-0002-2563-5374
Davis, Paul0000-0001-5545-0824
Diamantakis, Stavros0000-0002-0273-3406
Zarowiecki, Magdalena0000-0001-6102-7731
Howe, Kevin0000-0002-1751-9226
Additional Information:© 2022 by the authors. This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Received: 2/8/2022. Revision Received: 5/19/2022. Accepted: 6/1/2022. Published: 6/1/2022. Funding for WormBase is from US National Human Genome Research Institute [U24 HG002223]; UK Medical Research Council [MR/S000453/1]; UK Biotechnology and Biological Sciences Research Council [BB/P024610/1, BB/P024602/1]. Rishab Mallick was a participant in the Google Summer of Code 2021 program. Author Contributions. Rishab Mallick: Writing - original draft, Methodology, Investigation, Visualization Valerio Arnaboldi: Conceptualization, Supervision, Software, Writing - review & editing Paul Davis: Data curation, Validation Stavros Diamantakis: Data curation, Validation Magdalena Zarowiecki: Conceptualization, Data curation, Funding acquisition, Project administration, Supervision, Writing - review & editing Kevin Howe: Funding acquisition, Supervision, Writing - review & editing.
Funding AgencyGrant Number
NIHU24 HG002223
Medical Research Council (UK)MR/S000453/1
Biotechnology and Biological Sciences Research Council (BBSRC)BB/P024610/1
Biotechnology and Biological Sciences Research Council (BBSRC)BB/P024602/1
Google Summer of Code 2021UNSPECIFIED
PubMed Central ID:PMC9160977
Record Number:CaltechAUTHORS:20220606-736182000
Persistent URL:
Official Citation:Mallick, R; Arnaboldi, V; Davis, P; Diamantakis, S; Zarowiecki, M; Howe, K (2022). Accelerated variant curation from scientific literature using biomedical text mining. microPublication Biology. 10.17912/micropub.biology.000578.
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:115036
Deposited By: George Porter
Deposited On:07 Jun 2022 15:16
Last Modified:07 Jun 2022 15:16

Repository Staff Only: item control page