Research Data Management
An Interactive Workshop
Tom Morrell
Using Kristin Briney’s
Research Data Management Workbook
Bi 252
December 5, 2024
https://doi.org/
10.7907/6g4v6
-
3r537
Can you document your research?
•
Write a Project
-
Level
README.txt
•
Document a research project you’re working on/have worked on
•
Section 2.2 in the Research Data Management Workbook (Page 12)
•
Can fill in directly in workbook
•
Put up a red sticky note if you have questions
•
Put up a green sticky note when you’re done
Current Research Data Practices
Akers, K. G. & Doty, J. Disciplinary differences in faculty research data management practices and perspectives.
Int. J. Digit.
Curation
8,
5
–
26 (2013).
doi
:
10.2218/ijdc.v8i2.263
(Emory)
Shen, Y. Strategic Planning for a Data
-
Driven, Shared
-
Access Research Enterprise: Virginia Tech Research Data Assessment
and Landscape Study.
Coll. Res.
Libr
.
77,
500
–
519 (2016).
doi
:
10.5860/crl.77.4.500
Most researchers store data on
local computer hard drives
Researchers report that finding
data is their biggest challenge
How Reusable is Research Data Today?
•
Morphological characteristics of plants
and animals
•
516 publications using a specific analysis
technique between 1991 and 2011
•
25% of emails didn’t work
•
38% didn’t respond to email
•
13% didn’t have data
•
4% didn’t want to share
•
Received 19% of data
•
Availability decreased with time
Vines, T. H.
et al.
The availability of research data declines rapidly with article age.
Curr
. Biol.
24,
94
–
97 (2014).
doi
:
10.1016/j.cub.2013.11.014
0
5
10
15
20
25
30
35
40
0
5
10
15
20
25
P
e
r
c
e
n
t
a
g
e
o
f
P
u
b
l
i
c
a
t
i
o
n
s
W
h
e
r
e
D
a
t
a
E
x
i
s
t
s
Age of Paper (Years)
0
5
10
15
20
25
30
35
40
0
5
10
15
20
25
Data Quality
Andrew, R. L.
et al.
Assessing the reproducibility of discriminant function analyses.
PeerJ
(2015). doi:
10.7717/peerj.1137
On Average, 13%
of Papers Had
Usable Data
Why is it better to have data available?
Berman, H. M.,
Kleywegt
, G. J., Nakamura, H. & Markley, J. L. The Protein Data Bank archive as an open data resource.
J.
Comput
. Aided. Mol. Des.
1009
–
1014 (2014). doi:
10.1007/s10822
-
014
-
9770
-
y
Abramson
,
J
,
et. Al.
Accurate
Structure
Prediction
of
Biomolecular
Interactions
with
AlphaFold
3.
Nature
2024
,
630
(8016),
493
–
500.
10.1038/s41586
-
024
-
07487
-
w
.
www.rcsb.org
Why is it better to have data available?
Journals require data availability:
Commitment Statement in the Earth, Space, and Environmental
Sciences for depositing and sharing data
https://copdess.org/enabling
-
fair
-
data
-
project/commitment
-
statement
-
in
-
the
-
earth
-
space
-
and
-
environmental
-
sciences/
Stall, S.
et. al.
Make scientific data FAIR
Nature.
570, 27
-
29 (2019).
doi
:
10.1038/d41586
-
019
-
01720
-
7
Why is it better to have data available?
“Scientific data underlying peer
-
reviewed scholarly publications
resulting from federally funded
research should be made freely
available and publicly accessible by
default at the time of publication.”
2022 OSTP Memo
https://www.whitehouse.gov/wp
-
content/uploads/2022/08/08
-
2022
-
OSTP
-
Public
-
Access
-
Memo.pdf
•
Expected Data
•
Data Formats and Metadata
•
Access to Data
•
Data Archiving
Data Management Plans
One Example:
Started with proposal submitted
Jan 25 2023
“NIH expects that in drafting Plans, researchers will maximize the appropriate sharing
of scientific data”
“NIH Encourages the use of established repositories”
“Shared scientific data should be made accessible as soon as possible” (at publication
or by the end of grant)
https://sharing.nih.gov/data
-
management
-
and
-
sharing
-
policy/about
-
data
-
management
-
and
-
sharing
-
policies/research
-
covered
-
under
-
the
-
data
-
management
-
sharing
-
policy#after
https://grants.nih.gov/grants/guide/notice
-
files/NOT
-
OD
-
21
-
013.html
Simple Solutions
•
Choose a file naming/organization scheme
•
Save reasonable files
•
Use reliable storage
•
Plan for sharing
Naming
•
Trying to recreate your work months/years later is hard
•
Choosing a consistent naming system makes things easier
20241003biological_data_presentation1
Sortable Date
Readable Description
“Version”
Make your own file naming convention
•
Think about one group of files you work with
•
Section 3.2 in the Research Data Management Workbook (Page 19)
•
Can fill in directly in workbook
•
Put up a red sticky note if you have questions
•
Put up a green sticky note when you’re done
Data Architectures
Simple
Name
Date Based
20241103
Name
Complex
Name
Y2024
M11
D03
Full worksheet at:
https://doi.org/10.7907/894q
-
zr22
Metadata
Name
20161123
Name
Y2016
M11
D23
•
What other information will be useful for reanalysis?
•
Use standard terms (
https://fairsharing.org/
)
•
Store with data
•
README template
•
JSON or XML document
Save Reasonable Files
•
Human
-
readable text files are best (.txt, .csv)
•
Non
-
proprietary files are better than proprietary
•
Do analysis with scripts if possible
•
Save both input and output files as space allows
Active Data Storage
•
Small amounts of data (GB) are easy
•
TB
-
scale data require planning
•
Need a system that will be reliable
•
Caltech
-
Managed Storage
•
Network
-
Attached Storage (Local RAID array)
•
Cloud Storage
Caltech
-
Managed Storage
•
Box and OneDrive
•
Great for sharing with collaborators
•
Excellent for sensitive data (HIPAA, FERPA)
•
1 TB free Box for labs; 200 GB free on OneDrive
•
Caltech HPC Storage
•
Up to 20 TB free per lab
•
Redundant but not backed up
https://www.imss.caltech.edu/services/collaboration
-
storage
-
backups/storage
-
comparison
Local vs Cloud Storage
•
RAID array can protect against
data loss
•
Reasonably low cost (4 TB
-
$400;
112 TB
-
$3000)
•
Need to plan space requirements
•
Need to manage
•
Defined or flexible storage
•
Vendor Managed
•
Continuous cost
•
Limited by bandwidth
•
Dependent on vendor
Local Storage
Cloud Storage
Disaster Recovery
•
What Happens in a Disaster?
•
Use 2 mirrored NAS units in 2
locations
•
Mirror local storage to cloud
storage (AWS Glacier Deep
Archive)
Earthquake damage a Cal State Northridge
, 1994 by
Stickpen
,
CC0 License
Assess your file storage
•
How are your research files being stored today?
•
Section 4.1 in the Research Data Management Workbook (Page 23)
•
Can fill in directly in workbook
•
Put up a red sticky note if you have questions
•
Put up a green sticky note when you’re done
Data Sharing
•
FAIR (Findability, Accessibility, Interoperability,
Reusability)
•
Subject Repositories
•
General Repositories
•
Institutional Repositories
Wilkinson, M. D.
et al.
The FAIR Guiding Principles for scientific data management and stewardship.
Sci. Data
3:160018
doi
:
10.1038/sdata.2016.18
(2016)
Subject Repositories
•
Protein Data Bank
•
GenBank
•
Wormbase
•
Pangaea
•
Long Term Ecological Research Data Portal
•
Good listing:
journals.plos.org
/
plosone
/s/data
-
availability
•
Thousands more:
www.re3data.org
General Repositories
•
Zenodo
(CERN
-
Free)
•
Dryad (Nonprofit
-
$120 per submission + Space)
•
Figshare
(20GB Max)
•
Mendeley
Data (Elsevier
-
Free)
•
Dataverse
(Harvard
-
Free)
CaltechDATA
•
Available at data.caltech.edu
•
Easy to describe and upload files
•
All records get a DOI (permanent, registered link)
•
500 GB free uploads (soon to be 1 TB)
•
Integration with GitHub
•
API for accessing data
•
Library takes care of preserving and maintaining
access to files
Discoverability
•
CaltechDATA
site
search
•
DOIs appear in
DataCite
search
•
Broad
discoverability
Make a dataset deposit
•
Pick a dataset that’s ready for sharing, or a dataset associated
with a published paper from your lab (check
https://authors.library.caltech.edu/
)
•
You can use the test version of CaltechDATA
https://data.caltechlibrary.dev/
if the dataset isn’t ready yet
•
Can follow workbook section 6.2 (page 38)
•
Put up a red sticky note if you have questions
•
Put up a green sticky note when you’re done
Data Files
https://doi.org/10.22002/D1.224
https://doi.org/10.22002/D1.227
https://doi.org/10.22002/D1.228
https://doi.org/10.22002/D1.229
https://rpgroup
-
pboc.github.io/mwc_induction
https://doi.org/10.22002/D1.299
Paper Website
on GitHub
https://doi.org/10.1101/111013
Use Cases
Use Case
-
TCCON
Total Carbon Column Observing Network (TCCON)
29 Data Collection Sites Around the World
Data files
Data Curation
and Processing
Processed Data
https://tccon
-
wiki.caltech.edu/Sites/Park_Falls
http://tccondata.org/