of 11
Received:
December
22,
2023.
Revised:
February
16,
2024.
Accepted:
February
22,
2024
Published
by
Oxford
University
Press
2024.
This
work
is
written
by
(a)
US
Government
employee(s)
and
is
in
the
public
domain
in
the
US.
Briefings in Bioinformatics
, 2024,
25(3)
, bbae122
https://doi.org/10.1093/bib/bbae122
Problem Solving Protocol
FAIR
Header
Reference
genome:
a
TRUSTworthy
standard
Adam
Wright
,
Mark
D.
Wilkinson
,
Christopher
Mungall
,
Scott
Cain
,
Stephen
Richards
,
Paul
Sternberg
,
Ellen
Provin
,
Jonathan
L.
Jacobs
,
Scott
Geib
,
Daniela
Raciti
,
Karen
Yook
,
Lincoln
Stein
and
David
C.
Molik
Corresponding
author.
Adam
Wight,
Adaptive
Oncology
Program,
Ontario
Institute
for
Cancer
Research,
661
University
Avenue
Suite
500,
Toronto,
ON
M5G
0A3,
Canada.
E-mail:
adam.
wright@oicr.
on.
ca
AUTHORS
NOTES.
The
US
Department
of
Agriculture
is
an
equal
opportunity
lender,
provider
and
employer.
Mention
of
trade
names
or
commercial
products
in
this
report
is
solely
to
provide
specific
information
and
does
not
imply
recommendation
or
endorsement
by
the
US
Department
of
Agriculture.
Abstract
The
lack
of
interoperable
data
standards
among
reference
genome
data-sharing
platforms
inhibits
cross-platform
analysis
while
increasing
the
risk
of
data
provenance
loss. Here, we
describe
the
FAIR
bioHeaders
Reference
genome
(FHR), a
metadata
standard
guided
by
the
principles
of
Findability,
Accessibility,
Interoperability
and
Reuse
(FAIR)
in
addition
to
the
principles
of
Transparency,
Respon-
sibility,
User
focus,
Sustainability
and
Technology.
The
objective
of
FHR
is
to
provide
an
extensive
set
of
data
serialisation
methods
and
minimum
data
field
requirements
while
still
maintaining
extensibility, flexibility
and
expressivity
in
an
increasingly
decentralised
genomic
data
ecosystem.
The
effort
needed
to
implement
FHR
is
low;
FHR’s
design
philosophy
ensures
easy
implementation
while
retaining
the
benefits
gained
from
recording
both
machine
and
human-readable
provenance.
Keywords
:
Reference
Genome;
provenance;
data
management;
network
effect;
FASTA
INTRODUCTION
Importance
of
reference
genomes
The
large
and
ever-increasing
number
of
well-characterised
reference
genomes
has
become
a
prerequisite
for
many
essential
analyses,
including
cross-species
comparisons
and
population
genomics
studies
of
model
and
non-model
systems
[
1
].
Unfor-
tunately,
there
are
no
file
header
standards
for
the
metadata
describing
these
essential
resources,
and
even
basic
fields,
such
as
‘species’
and
‘strain’,
are
missing
from
the
common
reference
genome
data
standards.
Instead,
such
metadata
must
be
kept
separately
from
the
files
containing
the
reference
genome
data,
raising
the
risk
of
gaps
in
data
provenance
and
human
copying
errors
and
imposing
a
burden
on
computational
biologists
and
developers
of
analytic
software
alike
[
2
,
3
].
The
FAIR-bioHeaders
Reference
genome
(FHR)
specification
aims
to
provide
a
standard
to
maintain
the
provenance
of
reference genomes that is translatable across storage and analysis
platforms.
History
of
FASTA
The
FASTA
file
format
is
widely
used
in
genomic
analysis
as
the
repository
of
reference
genome
sequence
information.
FASTA
was
developed
in
1985
[
4
]
and
is
still
widely
used
for
refer-
ence
genomes,
but
during
that
time
the
genomics
ecosystem
has
changed
dramatically;
the
first
journal
mention
of
a
centralised
DNA
repository
was
in
Elke
Jordan
and
Christine
Carrico’s
Science
Letter
in
1982
[
5
].
The
FASTA
format
specification
(originally
the
‘Pearson
format’)
was
created
by
William
Pearson
and
David
Lipman
in
1985
[
4
], but
has
since
been
maintained
by
the
National
Center
for
Biotechnology
Information
(NCBI)
at
the
US
National
Institutes
of
Health
[
4
].
It
used
to
be
that
reference
genome
files
resided
in
a
small
number
of
trusted
repositories, such
as
GenBank, and
were
down-
loaded
from
there
to
the
bioinformatician’s
local
computer
sys-
tem,
it
is
increasingly
the
case
that
these
files
are
modified,
reannotated,
and
redistributed
in
a
decentralised
manner
[
6
8
].
This was not the consequence of any deliberative process.Instead,
it
occurred
organically
as
a
consequence
of
the
shared
ownership
of
the
reference
data
and
the
collaborative
nature
of
science
[
9
].
Decentralisation
also
occurs
when
multiple
websites
host
genomic
data
instead
of
a
single
authority,
and
it
is
unlikely
that
this
process
will
reverse
[
6
].
Decentralisation
carries
risks,
not
least
of
which
is
the
loss
of
provenance
metadata
that
may
occur
when
the
files
are
transferred
among
resources
and
when
users
download
the
genomes
for
local
processing. Furthermore, the
loss
of
provenance
reduces
the
ability
of
users
to
ensure
that
data
are
what
they
claim
to
be,
potentially
causing
confusion,
propagating
errors
in
subsequent
analysis,
and
increasing
overall
time
and
effort
to
reuse
data
[
10
].
Deficiencies
of
FASTA
for
reference
genomes
It is highly problematic that reference genome FASTA files contain
no intrinsic information that describes the nature and provenance
of
their
contents;
all
provenance
information
must
come
from
external
sources
and
be
linked
to
the
file
name
or
checksum.
However,
both
of
these
methods
are
prone
to
loss
of
information.
Files
can
be
easily
renamed
or
overwritten,
and
when
the
name
Downloaded from https://academic.oup.com/bib/article/25/3/bbae122/7636762 by California Institute of Technology user on 05 June 2024
2
|
Wright
et
al.
has
changed,
the
link
to
provenance
information
can
be
difficult
to
recover.
Checksums,
an
algorithmically
unique
representation
of
a
file
that
can
be
compared
for
accuracy,
are
also
brittle.
Com-
monly
performed
file manipulations, such
as
introducing carriage
returns
when
a
file
generated
on
a
Linux
system
is
opened
in
a
Windows
text
editor,
introduce
alterations
that
do
not
affect
the
semantics
of
the
file,
but
completely
change
the
checksum.
By
relying
on
external
information
for
the
provenance
of
the
file,
bioinformaticians
risk
associating
incorrect
metadata
with
the
genome
file
or
even
being
unable
to
locate
the
metadata
at
all
[
11
,
12
].
Differences
can
arise
when
a
reference
genome
is
replicated
acr
oss
platforms
or
devices
(e.g.
renaming
of
files
or
contigs,
removal
of
contigs
that
fail
to
meet
some
criteria
such
as
min-
imum
length,
the
removal
and
addition
of
metadata,
etc.)
lead-
ing
to
a
gradual
divergence
of
reference
genome
files
and
their
metadata
(i.e.
the
genome
data
and
metadata
divergence
prob-
lem,
divergence
problems
are
described
by
Haslhofer
2010
[
13
]).
Furthermor
e,
discrepancies
can
arise
when
a
genome
assembly
is
updated
with
additional
data,
due
to
inexact
version
match-
ing
from
multiple
genome
assembly
versions
and
user
updates
across
platforms.
To
address
the
discrepancies
that
arise
from
replication,
what
could
be
called
a
reference
genome
authority
is
typically
implemented
(e.g.
NCBI
Assembly,
Genbank,
DDBJ),
a
central
site
that
provides
the
authoritative
version
of
a
ref-
erence
genome
and
its
origin.
While
reference
genome
author-
ities
are
the
ideal
solution,
in
the
current
biological
data
envi-
ronment,
a
reference
genome
authority
is
not
always
a
practi-
cal
solution.
Two
recent
examples
of
reference
genomes
being
published
in
multiple
locations
illustrate
not
only
the
need
for
genome
hosting
to
be
available,
but
also
the
necessity
of
decen-
tralised
assembly
hosting
and
how
file-level
discrepancies
can
be
introduced.
One
example
is
the
American
Type
Culture
Collection
(ATCC),
a
major
biorepository
and
living
culture
collection
that
provides
researchers
with
the
physical
strains
and
cell
lines
needed
for
their
research.
Historically,
materials
obtained
from
ATCC
have
been
subjected
to
whole
genome
sequencing
by
researchers
using
those
materials
in
their
own
research.
The
resulting
genome
assemblies
produced
by
researchers
are
often
submitted
to
the
NCBI
Assembly
reference
genome
database
in
order
to
disseminate and share the data [
14
].The NCBI Assembly reference
database
,
in
this
case,
may
be
thought
of
as
the
reference
genome
authority.
However,
gaps
in
genomics
data
quality,
data
provenance
and
the
traceability
of
materials
used
by
researchers
have
contributed
significantly
to
the
scientific
reproducibility
crisis
(reviewed
in
Hirsch
2019;
[
15
]).
In
response
to
issues
of
pr
ovenance
and
authenticity,
ATCC
launched
the
ATCC
Genome
Portal
to
establish
its
own
quality
control
and
provenance
standards
associated
with
genome
references
that
represent
authentic
ATCC
materials
[
16
].
ATCC
has
thus
far
produced
over
4,000
high-quality
or
closed
reference
genomes
for
microbes
within
the
ATCC
collection,
all
under
an
ISO
9000
controlled
quality
assurance
framework.
This
presents
some
dilemmas,
however,
as
the
NCBI
Assembly
database
includes
(for
example)
genome
references
for
bacterial
strains
that
have
serious
gaps
in
metadata
or
include
substantial
errors
in
their
genome
assembly
when
compared
to
the
ATCC
Genome
Portal
reference
[
17
].
Reducing
discrepancies
between
genome
references
for
the
‘same’
organism
can
be
aided
by
improving
our
ability
to
include
crucial
metadata
about
the
origins
of
and
means
by
which
each
genome
reference
is
created
in-line
with
the
sequence
data
itself.
Another
example
of
discrepancies
that
can
arise
from
gaps
in
pr
ovenance
can
be
found
in
molecular
data
portals
and
genome
browsers.
Several
organism-focused
genome
data
portals,
such
as
AgBase
[
18
],
FlyBase
[
19
],
SoyBase
[
20
],
wFleaBase
[
21
],
Worm-
Base
[
22
],
VectorBase
[
23
],
Ensembl
[
24
]and
others
[
25
],
publish
annotations
that
are
not
found
in
the
NCBI
Assembly
database.
In
some
cases, these
annotations
and
associated
genomes
cannot
be
submitted
due
to
data
ownership
conflicts.
These
genome
browsers
and
data
repositories
are
often
associated
with
a
larger
consortium
that
is
working
to
answer
questions
of
interest
to
the
relevant
scientific
communities.
Examples
of
such
consor-
tiums
are
the
i5k
[
26
,
27
]
Workspace
[
28
],
a
collaborative
effort
to
annotate
arthropod
genomes,
and
the
Alliance
of
Genome
Resources
(The
Alliance)
[
29
], a
centralised
resource
Model
Organ-
ism
resource.
However,
the
data
that
these
communities
require
have specific requirements,which can lead to the data portals and
genome
browsers
becoming
the
primary
source
of
their
scientific
communities’
reference
genomes.
Regardless
of
the
resources
that
currently
host
the
genome,
there
is
no
link
to
the
source
of
the
file,
including
its
metadata,
and
associated
publications.
Since
decentralisation
is
currently
ongoing,
it
is
unreasonable
to
imagine
a
world
in
which
reference
genome
assemblies
are
not
shared
across
platforms.
In
this
paper,
we
present
the
FHR
FASTA
header
specification,
whic
h
has
been
developed
to
address
the
genome
data
and
meta-
data
divergence
problem. A
key
benefit
of
FHR
is
that
it
minimises
the
technical
impact
of
adding
provenance
metadata
to
FASTA
reference
genome
files
by
utilising
legacy
features
of
the
file
format
instead
of
adding
completely
new
ones.
FHR
is
designed
to
enable
FAIR
and
TRUST
principles
[
30
,
31
],
and
to
reduce
the
risk
of
data
loss
by
ensuring
that
the
provenance
metadata
is
tied
to
the
reference
genome. FHR
is
not
the
only
project
that
attempts
to
solve
the
problem
of
aligning
the
storage
of
assembly
metadata
and
the
exchange
of
said
data
between
resources.
For
example,
the
Minimum
Information
about
a
Genome
Sequence
(MIGS)
specification, published
in
2008, provides
a
set
of
fields
for
various
types
of
assemblies
with
the
intent
of
generating
reports
that
are
used
to
exchange
information
between
resources
[
32
].
Compared
to
FHR,
MIGS
specification
provides
minimal
information
that
should
be
tracked
for
a
reference
genome,
whereas
FHR
only
provides
the
fields
that
should
be
a
header. Future
versions
of
FHR
may
provide
an
extension
mechanism
that
allows
for
the
addition
of
fields
from
the
full
MIGS
checklist.
FairGenomes
is
another
project
that
tries
to
solve
problems
with
metadata
in
Genomes
[
33
].
FairGenomes
is
designed
for
personal
human
genomes
used
in
medical
studies
and
takes
advantage
of
a
more
extensive
schema
designed
around
the
use
of
stored
metadata
for
personal
human
genomes
in
downstream
analysis.
FHR
and
FairGenomes
have
different
design
goals
and
use
cases.
Lastly,
databases
that
store
reference
genome
assemblies
must
address
the
storage
of
metadata
about
those
genome
assemblies.
FHR
has
a
diver-
gent
design
scope
from
related
efforts.
Efforts
to
store
metadata
information
about
genome
assemblies
in
larger
databases
are
designed
to
report
information
either
specific
to
the
use
case
of
the
database
(i.e.
a
plant
breeding
database)
or
might
be
collect-
ing
large
amounts
of
metadata
(i.e.
NCBI
Assembly
database).
MIGS
and
other
efforts
could
be
thought
to
be
connected
to
the
former
effort,
where
maximum
metadata
information
should
be
recorded.
FHR
was
designed
with
a
much
broader
scope,
designed
to
retain
identifying
information
of
the
assembly, and
reduce
data
loss
when
an
assembly
is
copied
from
one
system
to
the
next.
It
follows
then
that
FHR
must
address
adoption
issues
in
this
space,
and
a
broader
user
base,
designing
for
the
cost-benefit
Downloaded from https://academic.oup.com/bib/article/25/3/bbae122/7636762 by California Institute of Technology user on 05 June 2024
FAIR
Header
Reference
genome
|
3
Table
1:
FHR
required
fields
Field
Example
Description
Schema
https://
raw.
githubusercontent.
com/
FAIR
URI
to
the
FHR
schema
bioHeaders/FHR-Specification/main/fhr.json
schemaVersion
1
Version
of
FHR
Genome
Example
species
Name
of
the
genome
Taxon
Taxonomic
identification
of
the
genome,
taxon
itself
is
purely
an
or
ganisational
component
to
hold
taxon
name
and
taxon
URI
Taxon
name
Example
species
property
of
taxon,
name
or
common
name
of
taxon
Taxon
uri
https://
identifiers.
org/
taxonomy:0000
Property
of
taxon,
URL
of
the
taxon
information,
to
a
registry
if
possible
version
2.3
Version
number
of
genome
metadataAuthor
Author
of
the
FHR
Instance
(Person
or
Organisation),
not
the
genome,
m
ultiple
metadataAuthors
are
allowed,
metadataAuthor
itself
is
purely
an
organisational
component
metadataAuthor
name
John
Doe
Property
of
author,
the
name
of
the
author
metadataAuthor
uri
https://
orcid.
org/0000-0002-1983-4588
Property
of
author,
the
URL
of
the
author
assemblyAuthor
Assembler
of
the
Genome
(Person
or
Org),
multiple
assemblyAuthors
are
allo
wed,
assemblyAuthor
itself
is
purely
an
organisational
component
assemblyAuthor
name
Jane
Doe
Property
of
assembler,
the
name
of
the
assembler
assemblyAuthor
uri
https://
orcid.
org/0000-0002-9511-5139
Property
of
the
assembler,
the
URL
of
the
assembler
dateCreated
2022-03-21
The
date
the
genome
assembly
was
created
Masking
Soft-masked
Any
masking
that
was
applied
to
the
reference
genome,
one
of
not-masked,
har
d-masked,
soft-masked
and
repeat
masked
Checksum
md5:a3d5d9146c3992b7ed6724409ba28aa9
Algorithm
and
hash
for
the
checksum
of
reference
genomes
of
its
adoption
differently.
Adoption
is
FHR’s
major
challenge,
and
several
steps
are
taken
to
facilitate
adoption:
the
revival
of
legacy
features
for
better
comparability,
the
lessening
of
writing
and
analysis
burden
with
software
tools,
and
the
minimisation
of
schema
and
required
data.
FHR’s
methodology
follows
from
this
design
philosophy.
METHODS
FHR
version
0.1.1
is
publicly
available
on
GitHub
within
the
organ-
isation
FAIR-bioHeaders.
There
are
two
relevant
repositories:
the
specification
and
FHR-related
tools
(i.e.
the
FHR
conversion
and
validation
toolkit).
The
specification
is
codified
within
JavaScript
Object
Notation
(JSON)
following
the
JSON
Schema
specification.
The
related
tools
are
written
in
Python
and
use
this
JSON
schema
to
validate
FHR-specified
files.
To
ensure
that
the
software
is
as
maintainable
as
possible,
we
chose
to
use
a
minimal
set
of
well-
established dependencies.The Python libraries on which FHR con-
verter
tools
depend
include
YAML
Ain’t
Markup
Language
(YAML),
JSON,
Microdata
v0.8.0,
re
(REGEX),
hashlib
and
JSONSchema
v4.17.3
libraries.
The
tool
requires
Python
3.6
or
higher
and
can
be
installed
using
the
Python
package
setuptools
version
42
or
higher.
These
dependencies
are
all
that
are
required
to
validate
and
convert
FHR
files.
The
goal
of
using
so
few
dependencies
is
to
reduce
the
potential
for
issues
that
arise
when
installing
the
tools
and
reduce
the
effort
required
to
maintain
the
tools.
Additionally,
version
0.1.1
of
the
FHR
File
Converter
is
now
available
on
PyPI
for
easy
installation
via
pip
and
on
Docker
Hub
as
a
Docker
image,
enabling
seamless
integration
into
various
development
and
deployment
workflows.
RESULTS
The
core
constituent
FHR
fields
were
determined
by
considering
the
hypothetical,
but
unlikely,
scenario
of
catastrophic
loss
of
all
copies
of
a
reference
genome.
In
this
hypothetical
scenario,
all
digital
copies
of
the
genome
assembly
have
been
lost,
along
with
raw
data.
To
reconstruct
the
genome,
the
following
fields
would
be
required:
the
location
of
the
biological
materials
used
to
create
the
genome,
the
sequencing
instruments
used,
and
the
assembly
tools
used
to
assemble
and
quality-check
the
genome.
Therefore,
FHR
records
the
location
of
the
biological
materials,
the
sequencing
instruments,
and
the
assembly
software
tools.
Furthermore, FHR
records
other
information
that
would
be
useful
in
recovering
such
a
genome:
the
metadata
author
of
the
FHR
document, the
assembler
used
and
other
documentation
either
in
the FHR instance or found in any scholarly articles associated with
the
genome.
FHR
also
records
funding
and
licensing
information,
related
links
and
the
name
and
version
of
the
genome.
The
intent
of
FHR
is
to
strike
a
balance
between
only
forcing
users
to
provide
the
minimal
information
about
a
reference
genome
assembly
but
also
be
flexible
enough
to
include
other
informa-
tion.
Therefore,
the
required
fields
are
the
absolute
minimum
to
provide
provenance
of
the
data,
and
the
optional
fields
focus
on
providing
other
useful
information,
and
flexibility.
It
is
rec-
ognized
that
more
information
on
sample
preparation
and
data
processing
could
be
added
to
the
header
but
to
keep
the
specifi-
cation
reasonably
concise
not
all
possible
fields
were
added
to
the
specification.
Required
fields
The FHR specification has ten required fields: schema,schemaVer-
sion,
genome,
taxon,
version,
assemblyAuthor,
metadataAuthor,
dateCreated,
masking
and
checksum.
Please
refer
to
Table
1
and
Figure
1
.
The
schema
field
indicates
which
JSON
schema
specification
the
metadata
adheres
to. In
combination
with
the
schemaVersion
field, it
allows
users
and
software
to
know
the
exact
format
of
the
metadata
to
which
the
header
conforms.
The
genome
field
is
a
string
that
is
used
to
refer
to
the
common
name
of the genome.The field contents are chosen by those gener-
ating
the
reference
genome
and
can
be
a
human-readable
name,
an
alphanumeric
ID
or
a
URI.
The
latter
options
are
designed
to
simplify
automated
analysis.
Downloaded from https://academic.oup.com/bib/article/25/3/bbae122/7636762 by California Institute of Technology user on 05 June 2024
4
|
Wright
et
al.
Figure
1.
Minimal
FHR
FASTA
header
example
with
sequence.
Taxon
allows
the
user
to
specify
the
species
in
both
a
human-
readable
string
as
well
as
a
link
to
identifiers.
org
to
provide
more
information
on
the
genome.
The
assemblyAuthor
field
is
a
list
of
authors
who
participated
in
the
generating
of
the
genome;
FHR
supports
multiple
assembly
authors.
Typically,
these
authors
would
be
those
who
contributed
to
the
original
paper(s)
describing
the
sequencing
and
annotation
of
the
genome,
but
this
choice
is
left
to
those
who
generated
the
reference
genome. In
contrast, metadataAuthor
identifies
the
per-
son
who
authored
the
header
metadata;
FHR
supports
multiple
metadata
authors,
as
multiple
individuals
or
organisations
can
contribute
to
the
recording
of
its
provenance.
By
providing
both
authorship
fields,
FHR
supports
situations
where
the
creators
of
the
FHR
metadata
are
not
the
same
as
the
creators
of
the
assembly
itself;
this
is
useful
when
FHR
metadata
have
been
added
after
the
fact
by
another
group.
The
dateCreated
field
specifies
the
creation
date
for
the
com-
plete
reference
genome
file. It
works
hand
in
hand
with
the
check-
sum
field,
which
provides
a
way
of
confirming
that
neither
the
data
nor
the
metadata
have
changed
since
the
file
was
created.
We
consider
the
checksum
field
to
be
one
of
the
FHR
format’s
most
useful
features.
In
addition
to
being
used
to
ensure
that
the
file
has
not
been
corrupted,
the
checksum
can
also
be
recorded
by
the
pipeline
to
keep
track
of
the
exact
assembly
used
within
an
analysis
run.
Pipelines
typically
record
the
name
of
the
assembly
(e.g.
HG19),
but
it
is
not
uncommon
for
various
ad
hoc
variants
of
the
assembly
to
be
circulated
within
the
community,
causing
ambiguity.
The
checksum
uniquely
identifies
the
assembly
and
its
metadata,
thereby
providing
an
identifier
that
can
be
used
by
analytic pipelines to unambiguously declare which assembly their
results
are
based
on.
The
masking
field
was
included
as
a
required
field
to
ensure
that
any
masking
that
was
applied
to
the
reference
genome
is
recorded
for
reuse. Masking
can
only
be
one
of
‘not-masked’, ‘soft-
masked’,
‘hard-masked’
or
‘repeat-masked’.
Together,
these
10
fields
allow
researchers
to
identify
the
format
of
the
metadata,
identify
the
contents
of
the
file
and
keep
track
of
who
made
the
file
and
when,
thereby
enabling
provenance-tracking.
Optional
fields
In
addition
to
the
nine
required
fields,
FHR
encourages
the
addition
of
up
to
11
optional
fields. These
additional
fields
provide
additional
information
on
how
the
genome
was
assembled
and
distributed.
Refer
to
Table
2
and
Figure
2
.
The
identifier,
relatedLink,
and
scholarlyArticle
provide
the
user
with
links
to
external
information
about
the
genome
that
is
not
present
in
the
FASTA
file
itself.
The
identifier
field
is
used
to
associate
compact
URIs
(‘CURIEs’)
with
the
reference
genome.
Identifier
is
the
main
field
from
which
a
genome
assembly
can
be
mapped
to
a
schema
or
database
of
an
organisation.
Both
relat-
edLink
and
scholarlyArticle
are
conventional
URLs.
relatedLink
is
intended
to
associate
the
reference
genome
with
its
host
site
and
mirrors, while
scholarlyArticle
is
intended
to
link
the
genome
to
its
marker
paper.
All
these
fields
can
take
multiple
values.
FHR
requires
both
URLs
and
identifiers
to
be
publicly
accessible
and
persistent.
In
particular,
taxonomic
information
(i.e.
taxon)
such
as
species
and
isolate/cultivar
must
be
identified
using
taxonomy
URIs
registered
with
identifiers.
org
in
the
taxon
uri
as
w
ell
as
the
name
of
the
taxon
in
the
taxon
name.
When
using
CURIEs, FHR
recommends
that
they
be
registered
with
identifiers.
org
so
that
the
user
can
conveniently
find
the
resource
associated
with
the
CURIE.
As
an
alternative
to
identifiers.
org
,
FHR
allows
Bior
egistry
[
34
]
CURIEs
to
be
used.
If
there
is
no
suitable
URI
for
use
in
an
identifier
field,
the
FHR
specifications
call
for
the
use
of
the
fields
relatedLink
and
documentation
described
later.
vitalStats
was
also
added
to
record
common
stats
of
reference
genomes.
The
assemblySoftware,
instrument,
accessionID
and
vouch-
erSpecimen
fields
provide
additional
information
on
how
the
genome
was
generated.
Instrument
is
a
multivalue
field
that
refers
to
sequencing
machines
and
DNA
prep
instrumentation
used
to
generate
the
reference
genome. assemblySoftware
is
used
to
store
the
name
and
version
of
the
assembly
software
used
in
the
creation
of
the
genome
assembly,
accessionID
refers
to
the
ID
of the genome assembly,and voucherSpecimen is used to describe
the
location
of
the
sequenced
material.
Another
optional
field,
genomeSynonym,
can
be
used
to
add
one
or
more
common
names
to
the
reference
genome
to
Downloaded from https://academic.oup.com/bib/article/25/3/bbae122/7636762 by California Institute of Technology user on 05 June 2024