Published June 5, 2024 | Published
Journal Article Open

Measuring data rot: An analysis of the continued availability of shared data from a Single University

  • 1. ROR icon California Institute of Technology

Abstract

To determine where data is shared and what data is no longer available, this study analyzed data shared by researchers at a single university. 2166 supplemental data links were harvested from the university’s institutional repository and web scraped using R. All links that failed to scrape or could not be tested algorithmically were tested for availability by hand. Trends in data availability by link type, age of publication, and data source were examined for patterns. Results show that researchers shared data in hundreds of places. About two-thirds of links to shared data were in the form of URLs and one-third were DOIs, with several FTP links and links directly to files. A surprising 13.4% of shared URL links pointed to a website homepage rather than a specific record on a website. After testing, 5.4% the 2166 supplemental data links were found to be no longer available. DOIs were the type of shared link that was least likely to disappear with a 1.7% loss, with URL loss at 5.9% averaged over time. Links from older publications were more likely to be unavailable, with a data disappearance rate estimated at 2.6% per year, as well as links to data hosted on journal websites. The results support best practice guidance to share data in a data repository using a permanent identifier.

Copyright and License

© 2024 Kristin A. Briney. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Acknowledgement

The author thanks Tom Morrell for downloading metadata from the CaltechAUTHORS repository, recommending CyberDuck for testing FTP links, and providing feedback on a copy of the manuscript draft. Thanks to George Porter for answering questions about the specifications of the CaltechAUTHORS metadata. Thank you to George, Tony Diaz, and others in the Caltech Library who have dedicated a huge amount of time and effort to populating the CaltechAUTHORS repository and ensuring the quality of its metadata.

Funding

The author received no specific funding for this work.

Data Availability

The data for this article is available in CaltechDATA, https://doi.org/10.22002/h5e81-spf62, under a CC0 license.

Code Availability

Code used in this article is available in CaltechDATA, https://doi.org/10.22002/d2h9g-5q152, under a GNU GPL license.

Conflict of Interest

The authors have declared that no competing interests exist.

Files

journal.pone.0304781.pdf
Files (915.2 kB)
Name Size Download all
md5:fa5168010fa7de21145ded05880bbdde
915.2 kB Preview Download

Additional details

Created:
September 5, 2025
Modified:
September 5, 2025