Mallick, Rishab and Arnaboldi, Valerio and Davis, Paul and Diamantakis, Stavros and Zarowiecki, Magdalena and Howe, Kevin (2022) Accelerated variant curation from scientific literature using biomedical text mining. microPublication Biology, 2022 . Art. No. 000578. ISSN 2578-9430. PMCID PMC9160977. https://resolver.caltech.edu/CaltechAUTHORS:20220606-736182000
![]() |
PDF
- Published Version
Creative Commons Attribution. 357kB |
Use this Persistent URL to link to this item: https://resolver.caltech.edu/CaltechAUTHORS:20220606-736182000
Abstract
Biological databases collect and standardize data through biocuration. Even though major model organism databases have adopted some automation of curation methods, a large portion of biocuration is still performed manually. To speed up the extraction of the genomic positions of variants, we have developed a hybrid approach that combines regular expressions, Named Entity Recognition based on BERT (Bidirectional Encoder Representations from Transformers) and bag-of-words to extract variant genomic locations from C. elegans papers for WormBase. Our model has a precision of 82.59% for the gene-mutation matches tested on extracted text from 100 papers, and even recovers some data not discovered during manual curation. Code at: https://github.com/WormBase/genomic-info-from-papers
Item Type: | Article | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Related URLs: |
| ||||||||||||
ORCID: |
| ||||||||||||
Additional Information: | © 2022 by the authors. This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Received: 2/8/2022. Revision Received: 5/19/2022. Accepted: 6/1/2022. Published: 6/1/2022. Funding for WormBase is from US National Human Genome Research Institute [U24 HG002223]; UK Medical Research Council [MR/S000453/1]; UK Biotechnology and Biological Sciences Research Council [BB/P024610/1, BB/P024602/1]. Rishab Mallick was a participant in the Google Summer of Code 2021 program. Author Contributions. Rishab Mallick: Writing - original draft, Methodology, Investigation, Visualization Valerio Arnaboldi: Conceptualization, Supervision, Software, Writing - review & editing Paul Davis: Data curation, Validation Stavros Diamantakis: Data curation, Validation Magdalena Zarowiecki: Conceptualization, Data curation, Funding acquisition, Project administration, Supervision, Writing - review & editing Kevin Howe: Funding acquisition, Supervision, Writing - review & editing. | ||||||||||||
Group: | WormBase | ||||||||||||
Funders: |
| ||||||||||||
PubMed Central ID: | PMC9160977 | ||||||||||||
Record Number: | CaltechAUTHORS:20220606-736182000 | ||||||||||||
Persistent URL: | https://resolver.caltech.edu/CaltechAUTHORS:20220606-736182000 | ||||||||||||
Official Citation: | Mallick, R; Arnaboldi, V; Davis, P; Diamantakis, S; Zarowiecki, M; Howe, K (2022). Accelerated variant curation from scientific literature using biomedical text mining. microPublication Biology. 10.17912/micropub.biology.000578. | ||||||||||||
Usage Policy: | No commercial reproduction, distribution, display or performance rights in this work are provided. | ||||||||||||
ID Code: | 115036 | ||||||||||||
Collection: | CaltechAUTHORS | ||||||||||||
Deposited By: | George Porter | ||||||||||||
Deposited On: | 07 Jun 2022 15:16 | ||||||||||||
Last Modified: | 07 Jun 2022 15:16 |
Repository Staff Only: item control page