Characterization and automated classification of sentences in the biomedical literature: a case study for biocuration of gene expression and protein kinase activity
Abstract
Biological knowledgebases are essential resources for biomedical researchers, providing ready access to gene function and genomic data. Professional, manual curation of knowledgebases, however, is labor-intensive and thus high-performing machine learning methods that improve biocuration efficiency are needed. Here we report on sentence-level classification to identify biocuration-relevant sentences in the full text of published references for two gene function data types: gene expression and protein kinase activity. We performed a detailed characterization of sentences from references in the WormBase bibliography and used this characterization to define three tasks for classifying sentences as either 1) fully curatable, 2) fully and partially curatable, or 3) all language-related. We evaluated various machine learning (ML) models applied to these tasks and found that GPT and BioBERT achieve the highest average performance, resulting in F1 performance scores ranging from 0.89 to 0.99 depending upon the task. Our findings demonstrate the feasibility of extracting biocuration-relevant sentences from full text. Integrating these models into professional biocuration workflows, such as those used by the Alliance of Genome Resources and the ACKnowledge community curation platform, might well facilitate efficient and accurate annotation of the biomedical literature.
Copyright and License
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.
Supplemental Material
-
Supplementary File S1[supplements/631539_file03.pdf]
-
Supplementary Table 1[supplements/631539_file04.xlsx]
-
Supplementary Table 1 Caption[supplements/631539_file05.pdf]
-
Supplementary Table 2[supplements/631539_file06.csv]
-
Supplementary Table 2 Caption[supplements/631539_file07.pdf]
Data Availability
Acknowledgement
We thank Chris Grove, Ranjana Kishore, and Pengyuan Li for critical feedback on this work. Their insightful comments and constructive suggestions greatly improved the quality and clarity of this publication.
Funding
The ACKnowledge project is funded by RO1 OD023041 from the National Library of Medicine. The Alliance of Genome Resources is funded by U24HG010859 from the National Human Genome Research Institute and the National Heart, Lung and Blood Institute. WormBase and FlyBase are funded by the National Human Genome Research Institute by U24HG002223 and U41HG000739, respectively. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Contributions
Daniela Raciti - Conceptualization, Data curation, Funding acquisition, Investigation, Validation, Writing – original draft, Writing – review & editing Kimberly Van Auken - Conceptualization, Data curation, Funding acquisition, Investigation, Validation, Writing – original draft, Writing – review & editing Valerio Arnaboldi - Conceptualization, Formal Analysis, Funding acquisition, Investigation, Methodology, Software, Writing – original draft, Writing – review & editing
Files
Name | Size | Download all |
---|---|---|
md5:11de32799ea54ec4fc65e97909c4ed0d
|
69.1 kB | Preview Download |
md5:cb89ed2a0051e87827b1545210fdba98
|
755.5 kB | Preview Download |
md5:e38d7d579ba2d42b00ce43882ab1664d
|
50.2 kB | Preview Download |
md5:0ac8d1f1c6d8c3aa2fe6d9a7bd749ce6
|
53.9 kB | Preview Download |
md5:4d0c47cf5979c5523592dc7ccf1b26d7
|
184.9 kB | Download |
md5:5ecc29d78bfcae96cfaa41187e7a7476
|
343.9 kB | Preview Download |
Additional details
- United States National Library of Medicine
- ACKnowledge project RO1 OD023041
- National Human Genome Research Institute
- Alliance of Genome Resources U24HG010859
- National Heart Lung and Blood Institute
- National Human Genome Research Institute
- WormBase U24HG002223
- National Human Genome Research Institute
- FlyBase U41HG000739
- Caltech groups
- Division of Biology and Biological Engineering (BBE), WormBase
- Publication Status
- Submitted