of 30
Research Data Management
An Interactive Workshop
Tom Morrell
Using Kristin Briney’s
Research Data Management Workbook
Bi 252
December 5, 2024
https://doi.org/
10.7907/6g4v6
-
3r537
Can you document your research?
Write a Project
-
Level
README.txt
Document a research project you’re working on/have worked on
Section 2.2 in the Research Data Management Workbook (Page 12)
Can fill in directly in workbook
Put up a red sticky note if you have questions
Put up a green sticky note when you’re done
Current Research Data Practices
Akers, K. G. & Doty, J. Disciplinary differences in faculty research data management practices and perspectives.
Int. J. Digit.
Curation
8,
5
26 (2013).
doi
:
10.2218/ijdc.v8i2.263
(Emory)
Shen, Y. Strategic Planning for a Data
-
Driven, Shared
-
Access Research Enterprise: Virginia Tech Research Data Assessment
and Landscape Study.
Coll. Res.
Libr
.
77,
500
519 (2016).
doi
:
10.5860/crl.77.4.500
Most researchers store data on
local computer hard drives
Researchers report that finding
data is their biggest challenge
How Reusable is Research Data Today?
Morphological characteristics of plants
and animals
516 publications using a specific analysis
technique between 1991 and 2011
25% of emails didn’t work
38% didn’t respond to email
13% didn’t have data
4% didn’t want to share
Received 19% of data
Availability decreased with time
Vines, T. H.
et al.
The availability of research data declines rapidly with article age.
Curr
. Biol.
24,
94
97 (2014).
doi
:
10.1016/j.cub.2013.11.014
0
5
10
15
20
25
30
35
40
0
5
10
15
20
25
P
e
r
c
e
n
t
a
g
e
o
f
P
u
b
l
i
c
a
t
i
o
n
s
W
h
e
r
e
D
a
t
a
E
x
i
s
t
s
Age of Paper (Years)
0
5
10
15
20
25
30
35
40
0
5
10
15
20
25
Data Quality
Andrew, R. L.
et al.
Assessing the reproducibility of discriminant function analyses.
PeerJ
(2015). doi:
10.7717/peerj.1137
On Average, 13%
of Papers Had
Usable Data
Why is it better to have data available?
Berman, H. M.,
Kleywegt
, G. J., Nakamura, H. & Markley, J. L. The Protein Data Bank archive as an open data resource.
J.
Comput
. Aided. Mol. Des.
1009
1014 (2014). doi:
10.1007/s10822
-
014
-
9770
-
y
Abramson
,
J
,
et. Al.
Accurate
Structure
Prediction
of
Biomolecular
Interactions
with
AlphaFold
3.
Nature
2024
,
630
(8016),
493
500.
10.1038/s41586
-
024
-
07487
-
w
.
www.rcsb.org
Why is it better to have data available?
Journals require data availability:
Commitment Statement in the Earth, Space, and Environmental
Sciences for depositing and sharing data
https://copdess.org/enabling
-
fair
-
data
-
project/commitment
-
statement
-
in
-
the
-
earth
-
space
-
and
-
environmental
-
sciences/
Stall, S.
et. al.
Make scientific data FAIR
Nature.
570, 27
-
29 (2019).
doi
:
10.1038/d41586
-
019
-
01720
-
7
Why is it better to have data available?
“Scientific data underlying peer
-
reviewed scholarly publications
resulting from federally funded
research should be made freely
available and publicly accessible by
default at the time of publication.”
2022 OSTP Memo
https://www.whitehouse.gov/wp
-
content/uploads/2022/08/08
-
2022
-
OSTP
-
Public
-
Access
-
Memo.pdf
Expected Data
Data Formats and Metadata
Access to Data
Data Archiving
Data Management Plans
One Example:
Started with proposal submitted
Jan 25 2023
“NIH expects that in drafting Plans, researchers will maximize the appropriate sharing
of scientific data”
“NIH Encourages the use of established repositories”
“Shared scientific data should be made accessible as soon as possible” (at publication
or by the end of grant)
https://sharing.nih.gov/data
-
management
-
and
-
sharing
-
policy/about
-
data
-
management
-
and
-
sharing
-
policies/research
-
covered
-
under
-
the
-
data
-
management
-
sharing
-
policy#after
https://grants.nih.gov/grants/guide/notice
-
files/NOT
-
OD
-
21
-
013.html
Simple Solutions
Choose a file naming/organization scheme
Save reasonable files
Use reliable storage
Plan for sharing
Naming
Trying to recreate your work months/years later is hard
Choosing a consistent naming system makes things easier
20241003biological_data_presentation1
Sortable Date
Readable Description
“Version”
Make your own file naming convention
Think about one group of files you work with
Section 3.2 in the Research Data Management Workbook (Page 19)
Can fill in directly in workbook
Put up a red sticky note if you have questions
Put up a green sticky note when you’re done
Data Architectures
Simple
Name
Date Based
20241103
Name
Complex
Name
Y2024
M11
D03
Full worksheet at:
https://doi.org/10.7907/894q
-
zr22
Metadata
Name
20161123
Name
Y2016
M11
D23
What other information will be useful for reanalysis?
Use standard terms (
https://fairsharing.org/
)
Store with data
README template
JSON or XML document
Save Reasonable Files
Human
-
readable text files are best (.txt, .csv)
Non
-
proprietary files are better than proprietary
Do analysis with scripts if possible
Save both input and output files as space allows
Active Data Storage
Small amounts of data (GB) are easy
TB
-
scale data require planning
Need a system that will be reliable
Caltech
-
Managed Storage
Network
-
Attached Storage (Local RAID array)
Cloud Storage
Caltech
-
Managed Storage
Box and OneDrive
Great for sharing with collaborators
Excellent for sensitive data (HIPAA, FERPA)
1 TB free Box for labs; 200 GB free on OneDrive
Caltech HPC Storage
Up to 20 TB free per lab
Redundant but not backed up
https://www.imss.caltech.edu/services/collaboration
-
storage
-
backups/storage
-
comparison
Local vs Cloud Storage
RAID array can protect against
data loss
Reasonably low cost (4 TB
-
$400;
112 TB
-
$3000)
Need to plan space requirements
Need to manage
Defined or flexible storage
Vendor Managed
Continuous cost
Limited by bandwidth
Dependent on vendor
Local Storage
Cloud Storage
Disaster Recovery
What Happens in a Disaster?
Use 2 mirrored NAS units in 2
locations
Mirror local storage to cloud
storage (AWS Glacier Deep
Archive)
Earthquake damage a Cal State Northridge
, 1994 by
Stickpen
,
CC0 License
Assess your file storage
How are your research files being stored today?
Section 4.1 in the Research Data Management Workbook (Page 23)
Can fill in directly in workbook
Put up a red sticky note if you have questions
Put up a green sticky note when you’re done
Data Sharing
FAIR (Findability, Accessibility, Interoperability,
Reusability)
Subject Repositories
General Repositories
Institutional Repositories
Wilkinson, M. D.
et al.
The FAIR Guiding Principles for scientific data management and stewardship.
Sci. Data
3:160018
doi
:
10.1038/sdata.2016.18
(2016)
Subject Repositories
Protein Data Bank
GenBank
Wormbase
Pangaea
Long Term Ecological Research Data Portal
Good listing:
journals.plos.org
/
plosone
/s/data
-
availability
Thousands more:
www.re3data.org
General Repositories
Zenodo
(CERN
-
Free)
Dryad (Nonprofit
-
$120 per submission + Space)
Figshare
(20GB Max)
Mendeley
Data (Elsevier
-
Free)
Dataverse
(Harvard
-
Free)
CaltechDATA
Available at data.caltech.edu
Easy to describe and upload files
All records get a DOI (permanent, registered link)
500 GB free uploads (soon to be 1 TB)
Integration with GitHub
API for accessing data
Library takes care of preserving and maintaining
access to files
Discoverability
CaltechDATA
site
search
DOIs appear in
DataCite
search
Broad
discoverability
Make a dataset deposit
Pick a dataset that’s ready for sharing, or a dataset associated
with a published paper from your lab (check
https://authors.library.caltech.edu/
)
You can use the test version of CaltechDATA
https://data.caltechlibrary.dev/
if the dataset isn’t ready yet
Can follow workbook section 6.2 (page 38)
Put up a red sticky note if you have questions
Put up a green sticky note when you’re done
Data Files
https://doi.org/10.22002/D1.224
https://doi.org/10.22002/D1.227
https://doi.org/10.22002/D1.228
https://doi.org/10.22002/D1.229
https://rpgroup
-
pboc.github.io/mwc_induction
https://doi.org/10.22002/D1.299
Paper Website
on GitHub
https://doi.org/10.1101/111013
Use Cases
Use Case
-
TCCON
Total Carbon Column Observing Network (TCCON)
29 Data Collection Sites Around the World
Data files
Data Curation
and Processing
Processed Data
https://tccon
-
wiki.caltech.edu/Sites/Park_Falls
http://tccondata.org/