Published April 25, 2024 | Version Published
Journal Article Open

Perspectives on tracking data reuse across biodata resources

Creators

  • 1. ROR icon California Institute of Technology

Abstract

Motivation

Data reuse is a common and vital practice in molecular biology and enables the knowledge gathered over recent decades to drive discovery and innovation in the life sciences. Much of this knowledge has been collated into molecular biology databases, such as UniProtKB, and these resources derive enormous value from sharing data among themselves. However, quantifying and documenting this kind of data reuse remains a challenge.

Results

The article reports on a one-day virtual workshop hosted by the UniProt Consortium in March 2023, attended by representatives from biodata resources, experts in data management, and NIH program managers. Workshop discussions focused on strategies for tracking data reuse, best practices for reusing data, and the challenges associated with data reuse and tracking. Surveys and discussions showed that data reuse is widespread, but critical information for reproducibility is sometimes lacking. Challenges include costs of tracking data reuse, tensions between tracking data and open sharing, restrictive licenses, and difficulties in tracking commercial data use. Recommendations that emerged from the discussion include: development of standardized formats for documenting data reuse, education about the obstacles posed by restrictive licenses, and continued recognition by funding agencies that data management is a critical activity that requires dedicated resources.

Availability and implementation

Summaries of survey results are available at: https://docs.google.com/forms/d/1j-VU2ifEKb9C-sW6l3ATB79dgHdRk5v_lESv2hawnso/viewanalytics (survey of data providers) and https://docs.google.com/forms/d/18WbJFutUd7qiZoEzbOytFYXSfWFT61hVce0vjvIwIjk/viewanalytics (survey of users).

Copyright and License

Acknowledgement

The UniProt Consortium: Alex Bateman, Maria-Jesus Martin, Sandra Orchard, Michele Magrane, Shadab Ahmad, Emily H. Bowler-Barnett, Hema Bye-A-Jee, Paul Denny, Tunca Dogan, ThankGod Ebenezer, Jun Fan, Leonardo Jose da Costa Gonzales, Abdulrahman Hussein, Alexandr Ignatchenko, Giuseppe Insana, Rizwan Ishtiaq, Vishal Joshi, Dushyanth Jyothi, Swaathi Kandasaamy, Antonia Lock, Aurelien Luciani, Jie Luo, Yvonne Lussi, Pedro Raposo, Daniel L. Rice, Rabie Saidi, Rafael Santos, Elena Speretta, James Stephenson, Prabhat Totoo, Nidhi Tyagi, Preethi Vasudev, Kate Warner, Rossana Zaru, Supun Wijerathne, Khawaja Talal Ibrahim, Minjoon Kim, Juan Marin at the EMBL—European Bioinformatics Institute; Alan J. Bridge, Lucila Aimo, Ghislaine Argoud-Puy, Andrea H. Auchincloss, Kristian B. Axelsen, Parit Bansal, Delphine Baratin, Teresa M. Batista Neto, Jerven T. Bolleman, Emmanuel Boutet, Lionel Breuza, Blanca Cabrera Gil, Cristina Casals-Casas, Elisabeth Coudert, Beatrice Cuche, Edouard de Castro, Anne Estreicher, Maria L. Famiglietti, Marc Feuermann, Elisabeth Gasteiger, Sebastien Gehant, Arnaud Gos, Nadine Gruaz, Chantal Hulo, Nevila Hyka-Nouspikel, Florence Jungo, Arnaud Kerhornou, Philippe Le Mercier, Damien Lieberherr, Patrick Masson, Anne Morgat, Ivo Pedruzzi, Sandrine Pilbout, Lucille Pourcel, Sylvain Poux, Monica Pozzato, Manuela Pruess, Nicole Redaschi, Catherine Rivoire, Christian J. A. Sigrist, Shyamala Sundaram and Anastasia Sveshnikova at the SIB Swiss Institute of Bioinformatics.; Cathy H Wu, Cecilia N Arighi, Chuming Chen, Yongxing Chen, Hongzhan Huang, Kati Laiho, Minna Lehvaslaiho, Peter McGarvey, Darren A Natale, Karen Ross, C.R. Vinayaka, Yuqi Wang and Jian Zhang at the Protein Information Resource. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Data Availability

The data underlying this article are available in the article and in its online supplementary material. Summaries of the survey results are available at: https://docs.google.com/forms/d/1j-VU2ifEKb9C-sW6l3ATB79dgHdRk5v_lESv2hawnso/viewanalytics (survey of data providers) and https://docs.google.com/forms/d/18WbJFutUd7qiZoEz-bOytFYXSfWFT61hVce0vjvIwIjk/viewanalytics (survey of users).

Supplementary data are available at Bioinformatics Advances online.

Contributions

Karen E. Ross (Conceptualization [lead], Investigation [equal], Writing—original draft [lead], Writing—review and editing [lead]), Fredric B. Bastian (Investigation [equal], Writing—review and editing [supporting]), Matt Buys (Investigation [equal], Writing—review and editing [supporting]), Charles E. Cook (Conceptualization [supporting], Investigation [equal], Writing—review and editing [supporting]), Peter D’Eustachio (Investigation [equal], Writing—review and editing [supporting]), Melissa Harrison (Investigation [equal], Writing—review and editing [supporting]), Henning Hermjakob (Conceptualization [supporting], Investigation [equal], Writing—review and editing [supporting]), Donghui Li (Investigation [equal], Writing—review and editing [supporting]), Phillip Lord (Investigation [equal], Writing—review and editing [supporting]), Darren A. Natale (Investigation [equal], Writing—review and editing [supporting]), Bjoern Peters (Investigation [equal], Writing—review and editing [supporting]), Paul W. Sternberg (Investigation [equal], Writing—review and editing [supporting]), Andrew I. Su (Conceptualization [supporting], Investigation [equal], Writing—review and editing [supporting]), Matthew Thakur (Investigation [equal], Writing—review and editing [supporting]), Paul D. Thomas (Investigation [equal, Writing—review and editing [supporting]), and Alex Bateman (Conceptualization [lead], Investigation [equal], Project administration [lead], Writing—review and editing [lead])

Conflict of Interest

A.B. is Editor-in-Chief of Bioinformatics Advances, but was not involved in the editorial process of this manuscript.

Funding

This work has been supported by the National Institutes of Health [U24HG007822; U24HG007822-09S1].

Files

vbae057.pdf

Files (992.5 kB)

Name Size Download all
md5:dde4c275bb5d020f1b33453984f12d15
976.4 kB Preview Download
md5:574afda99e8fd8c97c7201297b8fec09
16.2 kB Preview Download

Additional details

Identifiers

Funding

National Institutes of Health
U24HG007822
National Institutes of Health
U24HG007822-09S1

Caltech Custom Metadata

Caltech groups
Division of Biology and Biological Engineering (BBE)