Published January 8, 2025 | Submitted v1
Discussion Paper Open

Characterization and automated classification of sentences in the biomedical literature: a case study for biocuration of gene expression and protein kinase activity

An error occurred while generating the citation.

Abstract

Biological knowledgebases are essential resources for biomedical researchers, providing ready access to gene function and genomic data. Professional, manual curation of knowledgebases, however, is labor-intensive and thus high-performing machine learning methods that improve biocuration efficiency are needed. Here we report on sentence-level classification to identify biocuration-relevant sentences in the full text of published references for two gene function data types: gene expression and protein kinase activity. We performed a detailed characterization of sentences from references in the WormBase bibliography and used this characterization to define three tasks for classifying sentences as either 1) fully curatable, 2) fully and partially curatable, or 3) all language-related. We evaluated various machine learning (ML) models applied to these tasks and found that GPT and BioBERT achieve the highest average performance, resulting in F1 performance scores ranging from 0.89 to 0.99 depending upon the task. Our findings demonstrate the feasibility of extracting biocuration-relevant sentences from full text. Integrating these models into professional biocuration workflows, such as those used by the Alliance of Genome Resources and the ACKnowledge community curation platform, might well facilitate efficient and accurate annotation of the biomedical literature.

Copyright and License

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.

Supplemental Material

Data Availability

Acknowledgement

We thank Chris Grove, Ranjana Kishore, and Pengyuan Li for critical feedback on this work. Their insightful comments and constructive suggestions greatly improved the quality and clarity of this publication.

Funding

The ACKnowledge project is funded by RO1 OD023041 from the National Library of Medicine. The Alliance of Genome Resources is funded by U24HG010859 from the National Human Genome Research Institute and the National Heart, Lung and Blood Institute. WormBase and FlyBase are funded by the National Human Genome Research Institute by U24HG002223 and U41HG000739, respectively. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Contributions

Daniela Raciti - Conceptualization, Data curation, Funding acquisition, Investigation, Validation, Writing – original draft, Writing – review & editing Kimberly Van Auken - Conceptualization, Data curation, Funding acquisition, Investigation, Validation, Writing – original draft, Writing – review & editing Valerio Arnaboldi - Conceptualization, Formal Analysis, Funding acquisition, Investigation, Methodology, Software, Writing – original draft, Writing – review & editing

Files

2025.01.06.631539v1.full.pdf
Files (1.5 MB)
Name Size Download all
md5:11de32799ea54ec4fc65e97909c4ed0d
69.1 kB Preview Download
md5:cb89ed2a0051e87827b1545210fdba98
755.5 kB Preview Download
md5:e38d7d579ba2d42b00ce43882ab1664d
50.2 kB Preview Download
md5:0ac8d1f1c6d8c3aa2fe6d9a7bd749ce6
53.9 kB Preview Download
md5:4d0c47cf5979c5523592dc7ccf1b26d7
184.9 kB Download
md5:5ecc29d78bfcae96cfaa41187e7a7476
343.9 kB Preview Download

Additional details

Created:
March 12, 2025
Modified:
March 12, 2025