WormBase 2024: status and transitioning to Alliance
infrastructure
Paul W. Sternberg
,
1,
*
,†
Kimberly Van Auken,
1
Qinghua Wang,
1
Adam Wright
,
2
Karen Yook,
1
Magdalena Zarowiecki,
3
Valerio Arnaboldi
,
1
Andrés Becerra
,
3
Stephanie Brown
,
4
Scott Cain,
2
Juancarlos Chan,
1
Wen J. Chen,
1
Jaehyoung Cho,
1
Paul Davis,
3
Stavros Diamantakis,
3
Sarah Dyer,
3
Dionysis Grigoriadis,
4
Christian A. Grove,
1
Todd Harris
,
2
Kevin Howe,
3
Ranjana Kishore
,
1
Raymond Lee,
1
Ian Longden,
2
Manuel Luypaert,
3
Hans-Michael Müller,
1
Paulo Nuin,
2
Mark Quinton-Tulloch
,
3
Daniela Raciti,
1
Tim Schedl,
5
Gary Schindelman,
1
Lincoln Stein
2
1
Division of Biology and Biological Engineering 140-18, California Institute of Technology, Pasadena, CA 91125, USA
2
Informatics and Bio-computing Platform, Ontario Institute for Cancer Research, Toronto, ON M5G0A3, Canada
3
European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK
4
School of Infection and Immunity, University of Glasgow, Glasgow G12 8TA, UK
5
Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA
*Corresponding author: Division of Biology and Biological Engineering, mail code 140-18, California Institute of Technology, 1200 East California Blvd., Pasadena, CA 91125,
USA. Email: pws@caltech.edu
†
Surnames are in alphabetical order starting with Sternberg.
WormBase has been the major repository and knowledgebase of information about the genome and genetics of
Caenorhabditis elegans
and other nematodes of experimental interest for over 2 decades. We have 3 goals: to keep current with the fast-paced
C. elegans
re
-
search, to provide better integration with other resources, and to be sustainable. Here, we discuss the current state of WormBase as well
as progress and plans for moving core WormBase infrastructure to the Alliance of Genome Resources (the Alliance). As an Alliance mem
-
ber, WormBase will continue to interact with the
C. elegans
community, develop new features as needed, and curate key information
from the literature and large-scale projects.
Keywords:
genome annotation; knowledgebase; software; disease models; bioinformatics tools; model organism; Global Core Biodata
Resources;
Caenorhabditis elegans
Received on 21 December 2023; accepted on 20 March 2024
© The Author(s) 2024. Published by Oxford University Press on behalf of The Genetics Society of America.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Introduction
The early organization of knowledge about
Caenorhabditis elegans
,
in the form of the
Worm Breeder’s Gazette
, the Cold Spring Harbor
Laboratory Press book
C. elegans
(Wood 1988), Worm Community
System (Shoman
et al.
1995), and Leon Avery’s World Wide Web
Server, coalesced around the genome into ACeDB (A
C. elegans
Data Base;
Durbin and Thierry-Mieg 1994
). WormBase was re
-
branded from ACeDB in 2000 when Lincoln Stein and Paul
Sternberg joined Richard Durbin, John Spieth, and Jean
Thierry-Mieg as leads of the project (
Stein
et al
.
2001). Since
then, WormBase has expanded in breadth and depth (Harris
et al
. 2003;
Harris, Clark
et al
. 2004;
Harris, Chen
et al
. 2004;
Harris
et al.
2010,
2014,
2020;
Chen
et al.
2005;
Schwarz
et al.
2006;
Bieri
et al.
2007;
Rogers
et al
. 2007;
Howe
et al.
2012,
2016;
Yook
et al.
2012;
Lee
et al.
2018;
Davis
et al.
2022) to become an
essential resource for researchers in
C. elegans
, many other nema
-
todes, and the broader biological and bioinformatic communities
(
Baxevanis 2011;
Tang
et al
.
2015
). Over time, Thierry-Mieg and
Spieth left the project; Durbin was replaced by Paul Kersey,
Kevin Howe, and recently, Sarah Dyer. Professional staff has re
-
mained remarkably constant except at European Molecular
Biology Laboratory (EMBL)-European Bioinformatics Institute
(EBI), where the institute’s underlying mission is to promote
molecular biology research in Europe, train scientists, and ensure
that their acquired knowledge and skills are returned to the scien
-
tific community. The Global Biodata Coalition (
Anderson
et al
.
2017
) has designated WormBase as a Core Global Biodata
Resource, recognizing that WormBase is “...of fundamental im
-
portance to the wider biological and life sciences community
and the long-term preservation of biological data.”
Over 20 years ago, several model organism databases (MODs)
formed the Gene Ontology Consortium (GOC) to derive a common
language to describe gene function, and WormBase joined the
consortium shortly thereafter in 2003 (Harris, Clark
et al
.
2004;
Harris, Chen
et al
. 2004
). This common ground set the stage for
the formation of the Alliance of Genome Resources (the
Alliance) in 2016 when several MODs joined forces to plan and im
-
plement shared infrastructure (
Alliance of Genome Resources
Consortium 2022
; Alliance of Genome Resources Consortium
2024
). There are 3 compelling reasons for WormBase to transition
its infrastructure to the Alliance (Alliance Central). First, while
WormBase is an exemplary member of the MOD community,
there is considerable room for improvement. Second, WormBase
funding has been reduced to
<
50%
of 2010 levels not counting in
-
flation, and thus, the current level of service is not sustainable on
its own. Third, the MODs currently do not provide standardized,
facile services to those interested in comparing genes across
GENETICS
, 2024, 227(1), iyae050
https://doi.org/10.1093/genetics/iyae050
Advance Access Publication Date: 4 April 2024
Knowledgebase & Database Resources
major
biomedical
models
(e.g.
worm–human–fly–yeast–fish)
an
increasingly
important,
if not essential,
feature
as comparative
in
-
formation
expands.
Given
this critical
juncture
in the history
of
WormBase,
we choose
to focus
on our transition
to the Alliance
in detail
here,
as well
as to discuss
the
current
state
of
WormBase
for context.
Infrastructure
moving
to the Alliance
For every
MOD,
there
is a need
to balance
rapid
data
updates
with
stable,
reliable
infrastructure.
WormBase
started
with
a database
(ACeDB)
using
the acedb
data
management
system,
which
was
built
for handling
the
C. elegans
genome
project,
the first
animal
genome
to be completed
(
C. elegans
Sequencing
Consortium
1998
;
Hillier
et al.
2005
). In addition,
acedb
allowed
seamless
inte
-
gration
of complex
data
stored
in different
instances,
for example,
2 separate
genome
databases
corresponding
to the then
Sanger
Center
and Washington
University
Genome
Sequencing
Center.
During
2017–2020,
we transitioned
our data
storage
to Datomic,
another
nonrelational
data
management
system
(
Davis
et al.
2022
). The flexibility
of our system
and processes
allowed
flexible
and rapid
data
modeling
but eventually
needed
cached
data
to al
-
low the website
to perform
fast enough
and a lengthy
build
pro
-
cess
that
slowed
the process
of propagating
corrections
to the
website.
On the positive
side,
we built
the website
in a highly
modular
fashion
with
individual
widgets
that
could
obtain
data
from
ACeDB,
Datomic,
or other
data
sources
such
as flat files
on
servers.
This
architecture
allowed
the migration
from
acedb
to
Datomic
to be opaque
to users,
and we expect
it will help
smooth
the less opaque
migration
to the Alliance.
The Alliance
started
with
a graph
database
system
(Neo4J)
that
allowed
data
loading
from
individual
MODs
and running
the new
website;
it was a flexible
solution
to rapidly
get a common
portal
in
place.
This
then
gave
time
to plan
a long-term
solution
for harmo
-
nized
data
at the Alliance,
in the form
of a PostgreSQL
relational
database.
This
architecture
means
new information
can be propa
-
gated
to the website
rapidly,
although
whether
this is continual
or
in monthly
releases
will depend
on the type
of information.
This
more
rapid
propagation
of information
from
curated
knowledge
-
base
to website
will be better
for users.
One
potential
downside
to shared
MOD
infrastructure
is decreased
responsiveness
to
new
features
due to more
requirements
for harmonization,
but
the modular
design
of the Alliance
system
and the increased
num
-
ber of biocurators
and software
engineers
in the now
larger
overall
project
will facilitate
timely
deployment
of improvements.
The data
currently
loaded
into the Alliance
Neo4J
database
re
-
present
the initially
harmonized
data
types
from
WormBase:
a
subset
of all data
types
and,
within
a data
type,
a subset
of data
type
properties.
These
data
include
genes
(including
basic
gene
in
-
formation
like name,
symbol,
synonyms,
and descriptions),
alleles
and variants
(including
transgenic
alleles
and their
constructs),
models
(including
strains
and genotypes),
gene
expression
anno
-
tations
(low-
and
high-throughput),
disease
model
annotations,
phenotype
annotations,
and interactions
(molecular
and genetic).
Data
type
harmonization
for the remaining
data
types
and prop
-
erties
is ongoing
among
Alliance
members,
and the resulting
har
-
monized
data
are being
deposited
into the persistent
PostgreSQL
database
to act as the primary
curation
database
and
eventual
source
of truth
for Alliance-curated
data.
Over
the next
several
months,
Alliance
software
developers
will
be transitioning
the
source
of data
on the public
Alliance
website
from
Neo4J
to the
persistent
PostgreSQL
store
and will begin
providing
newly
har
-
monized
representations
of data
not already
present
on the site.
WormBase
staff
will
continue
to provide
the aforementioned
data
to the Neo4J
database
(until
its retirement)
and in LinkML
format
(see
below)
to the PostgreSQL
persistent
store,
including
all more
recently
harmonized
data
types
and
properties.
Additional
fully
harmonized
data
types
to be imported
into
the
PostgreSQL
database
and,
eventually,
the Alliance
website
in
-
clude
reagents
(e.g.
clones
and antibodies),
genome
features
(e.g.
DNA
binding
sites,
enhancers),
RNAi
phenotype
annotations,
regulatory
interactions
(e.g.
geneA
upregulates
expression
of
geneB),
transcripts
and
proteins,
chromosomes
and
genome
assemblies,
transcription
factors,
expression
clusters
(sets
of cor
-
egulated
genes),
gene
site-of-action
data,
and person
and labora
-
tory
data
(representing
the research
community).
Over
the next
year,
the Neo4J-to-PostgreSQL
database
migration
should
be com
-
pleted
and the Alliance
should
begin
offering
newly
harmonized
data
in the Alliance
public
website
sourced
from
the persistent
curation
database.
Completion
of all remaining
WormBase
data
harmonization,
import,
and representation
on the Alliance
web
-
site will likely
span
the next
2 years.
At this juncture,
it is useful
to discuss
each
major
data
type
in WormBase
and
its status
with
respect
to inclusion
in the
Alliance.
As the Alliance
platform
continues
to be developed,
additional
WormBase
data
types
will systematically
be integrated.
Currently,
a subset
of data
is regularly
imported
from
WormBase
to the Alliance.
Moving
forward,
the WormBase
site will continue
to be supported
until
the transition
to the Alliance
is complete.
Genome
The genome
is first and foremost.
Of the Alliance
members,
only
WormBase,
SGD,
and FlyBase
are responsible
for the ownership
of their
genomes;
however,
maintaining
the genome
is a priority
for WormBase
and remains
crucial
for emerging
genome
commu
-
nities
as well
as other
MODs
that
may
join the Alliance
in the fu
-
ture.
We will switch
genome
sequence
and feature
curation
from
ACeDB
to Apollo
(
Dunn
et al.
2019
), which
will be more
accessible
to researchers
wishing
to improve
sequence
feature
annotations
on their
favorite
nematode
(e.g.
Moya
et al.
2023
).
Genes
structure
models
The set of 19,984
protein-coding
genes
and 28,032
ncRNA
genes
(in
release
WS291)
are fundamental
to most
analyses
with
C. elegans
.
There
is only
a slow
change
in gene
number
and precise
structure
of introns
and UTRs
(see
Yoshimura
et al
. 2019
). Gene
structure
curation
will continue
to be handled
by WormBase
per se.
Genetic
loci
Of the 19,984
protein-coding
genes
in
C. elegans
, 10,838
have
a
name/gene
symbol
(e.g.
lin-12
). Gene
naming
is largely
investigator-initiated,
with
a request
to WormBase
through
genenames@wormbase.org
; this will continue
with
C. elegans
con
-
tent
in the Alliance.
The gene
symbol,
which
is formatted
in low
-
ercase
and
italics,
communicates
information
about
the gene
based
on mutant
phenotype,
functional
criteria,
orthology,
or
homology.
To make
genes
easily
recognizable
to non-
elegans
re
-
searchers,
the current
preference
is to name
genes
with
human
orthologs
after
the human
gene
if characterized.
Gene
name
sta
-
bility
and formatting
are important
to avoid
confusion
in the lit
-
erature
and to facilitate
searches
of other
databases
that
use the
C. elegans
gene
symbols
(e.g.
UniProt)
and text mining.
However,
occasionally
names
have
been
changed
because
of incorrect
orthology
or not supported
by functional
studies
from
the com
-
munity.
Names
have
also
been
changed
based
on requests
from
2 |
P. W. Sternberg
et al.
multiple community members (e.g.
daf-21
renamed as
hsp-90
). To
avoid name changes and to have the published name used in da
-
tabases, researchers should contact WormBase prior to manu
-
script submission; the most common issue is that the requested
name is already in use in
C. elegans
or another model organism
for a nonhomologous gene product. Over the past 10 years, the
community has averaged 180 new gene names per year, often
with a corresponding publication, increasing knowledge about
our favorite organism. More information about
C. elegans
–specific
nomenclature can be found at
https://wormbase.org/about/
userguide/nomenclature.
Orthologs and paralogs
WormBase.org
displays orthology from the Ensembl Compara
pipeline.
https://useast.ensembl.org/info/about/publications.
html. The Alliance has orthology inferences from the Quest
for Orthologs (Nevers
et al.
2022; discussed in
Alliance of
Genome Resources Consortium 2024
); the Alliance ortholog
set allows comparison among the model organisms in a recipro
-
cal manner. If there are any specific issues with ortholog assign
-
ments, the Alliance can manually modify assertions as they do
for ZFIN with curator-chosen orthology assertions. The Alliance
recently added paralogs, a useful feature of the WormBase
homology section. A next step will be to add comparisons of ex
-
pression, disease models, and function among the paralogs
using the Ribbon approach (a feature currently available for
orthologs); this will be a new feature for WormBase users.
With these changes, switching to Alliance orthology inferences
will be beneficial to users.
Genome features
In addition to gene structure models, features such as binding
sites and enhancers are annotated in Gene Feature Format
(GFF3) files and displayed on JBrowse 2 (Diesh
et al
. 2023) and
on genome feature pages (Fig. 1). Discussions on harmonizing
descriptions for genome features at the Alliance are ongoing,
but the existence of standard formats and the Sequence
Ontology (e.g.
Sant
et al.
2021) suggest this approach will work
well.
Variants and alleles
Alleles with phenotypes, alleles with no associated phenotypes, nat
-
ural variants, and other polymorphisms are curated at WormBase
and can be located under the “Genetics” widget of a gene page. In
the Alliance, WormBase “alleles” are classified into 3 variant classes:
“phenotypic variants,” where an allele has a genomic location over
-
lapping a single gene and has been associated with a phenotype; “al
-
leles,” where there is no associated phenotype and the genomic
location may or may not be known; or “high-throughput variants,”
which have genomic locations but no associated phenotypes and in
-
clude natural variants. All WormBase alleles that are considered
“phenotypic variants” have been migrated to the Alliance.
Both WormBase and the Alliance run the Ensembl Variant
Effect Predictor tool (McLaren
et al.
2016) to generate predicted
variant consequences and Human Genome Variation Society
(HGVS) nomenclature (den Dunnen 2017), with missense variants
being further annotated with predicted pathogenicity scores using
PolyPhen-2 (Adzhubei
et al.
2013) and SIFT (Kumar
et al.
2009).
Transgenes and constructs
Transgenes and constructs include integrated and extrachromo
-
somal objects created by the worm community to study, for ex
-
ample, gene function and expression. The construct class records
the individual engineered genomic snippets that are used to create
the transgene in the worm. In the Alliance data modles, transgenes
are considered to be a type of variant because transgenes are treated
as engineered alleles by other members of the Alliance. This change
in classification should not have any impact when looking up these
objects in WormBase or in the Alliance; however, when searching
the Alliance for these objects in batch, users may need to start their
search from the variant class. For now, transgenes and constructs
will continue to be annotated as needed for the curation of other
data types, such as phenotype and gene expression.
Strains
Strain information in WormBase is added and updated, from peri
-
odic data transfers from the Caenorhabditis Genetics Center
(CGC) (
Tuli
et al
. 2018) as well as by curators based on published
papers. In addition, the adoption of disease model curation, where
Fig. 1.
JBrowse with sequence features. A partial view of the
lin
-3 sequence in the JBrowse viewer. “Classical alleles” (amusingly including CRISPR gene
edits) shown in purple provide evidence for the
lin-3
vulval enhancer (
WBsf919657
) shown in green. The top shows the displayed coordinates and allows
navigation with left and right arrows as well as the ability to zoom in or out. Users can choose the tracks that are displayed, with many choices hidden in
the menu at the upper left. Here, curated exons are shown (magenta) (
lin-3
has alternative 5
′
exons, one of which is shown here) in the intron of majority
transcription unit. Protein motifs are shown next (purple), followed by mutant alleles and finally an enhancer (green).
WormBase and the Alliance
| 3
strains
are curated
as models
of disease,
helped
to formalize
strain
curation
in WormBase,
e.g.
strains
are
assigned
a
WBStrain
ID for unambiguous
identification
like other
data
ob
-
jects
such
as transgenes
and
variations.
Strains
are curated
as
part
of phenotype,
gene
expression,
and disease
curation.
RNAi
WormBase
has routinely
remapped
RNAi
reagents
for each
re
-
lease.
Going
forward,
this will be done
annually.
Phenotypes
asso
-
ciated
with
RNAi
experiments
are not yet in the Alliance
but will
be added
in about
a year,
after
the treatment
of sequence-based
reagents
(e.g. RNAI,
morpholinos)
is harmonized.
Antibodies
WormBase
maintains
a curated
list of antibodies
produced
from
individual
laboratories.
We collect
antibody-related
information
from
the literature
and from
direct
researcher
submissions.
This
information
includes
details
such
as the antibody
name,
target
antigen,
species
specificity,
experimental
methods,
and
asso
-
ciated
references.
The Alliance
data
model
for the antibody
class
is harmonized,
and the entire
collection
of WormBase
antibodies
will be incorporated
into the Alliance
within
a year.
Chemicals
and small
molecules
This
data
class
includes
drugs,
chemicals,
naturally
occurring
bio
-
logical
compounds,
and metabolites.
These
entities
are captured
during
disease,
phenotype,
large-scale
gene
expression,
and inter
-
action
curation.
Where
possible,
the published
name
used
by the
author
is matched
by name
or the Chemical
Abstracts
Service
Registry
Number
(CasRN;
http:/
/www.cas.org
) with
an entry
in
the Chemical
Entities
of Biological
Interest
ontology
(ChEBI;
http:/
/www.ebi.ac.uk/chebi/
;
Degtyarenko
et al
. 2007
). However,
it is not uncommon
for the chemical
to lack
a ChEBI
ID in
which
case
it is matched
with
an ID from
the Comparative
Toxicogenomic
Database
(CTD;
http:/
/ctdbase.org/
;
Davis
et al.
2021
) using
name,
medical
subject
heading
(MeSH)
vocabulary
IDs,
or CAS
RN. Synonyms
of the chemical
name
are retrieved
from
both
ChEBI
and
the CTD
for easier
retrieval
by curators
and WB users.
We will continue
to capture
chemicals
and small
molecules
as part
of curation
workflows.
These
entities
are due
to be imported
into
the Alliance
and harmonized
with
chemical
and small
molecule
data
sets curated
by other
Alliance
members
during
the next
year.
Phenotypes
Phenotype
data
comprise
all annotations
that
include
terms
from
the
C. elegans
Phenotype
Ontology
(
https:/
/github.com/
obophenotype/c-elegans-phenotype-ontology
). Phenotypes
are
annotated
to genes
based
on results
reported
from
perturbations
(RNAi
knockdown,
mutant
analysis,
or gene
overexpression)
as
well
as gene
interaction
experiments.
Phenotypes
are also
anno
-
tated
to rearrangements
and
strains
that
might
have
multiple
genetic
differences
from
the reference
strain;
when
these
data
are migrated
to the Alliance,
phenotype
annotations
will be in
-
cluded.
Phenotypes
of genes
studied
as models
of human
disease
are already
included
in the Alliance.
Gene expression
WormBase
expression
data
encompasses
single-cell
and
bulk
RNA-seq
data
derived
from
tissues
or entire
organisms,
in add
-
ition
to manually
curated
experiments
examining
individual
gene
expression
profiles
and
cell type–specific
expression
pat
-
terns.
Data
are accessed
via the “Expression”
widget
on the gene
page.
The widget’s
tables
and displays
bring
together
comprehen
-
sive evidence
regarding
the expression
of the gene
within
a specif
-
ic anatomical
entity,
life stage,
or cellular
component.
This
widget
offers
a quick
overview
of the number
of independent
experi
-
ments
that
corroborate
the gene’s
expression
at various
times
and locations.
Low-throughput
expression
data
displayed
on the WormBase
widget
is also
presented
on the Alliance
gene
report
pages,
with
few notable
differences.
The
initial
set of metadata
that
is in
-
cluded
on the Alliance
website
includes
a minimal
set of informa
-
tion to represent
the spatiotemporal
location
of the gene
product,
such
as the gene
under
study,
the anatomical
structure,
the life
stage,
and the cellular
component
where
the gene
was found
to
be expressed.
Additional
information
about
experimental
details,
such
as the reagents
used
(antibodies,
knock-in
alleles,
and re
-
porter
gene
constructs)
and
audiovisual
material
(movies
and
images),
is currently
available
via a link
to WormBase
and
will
be integrated
into the Alliance
website
incrementally
in the near
future.
Importantly,
the harmonization
process
that
is necessary
to make
the related
information
in 6 different
organisms
compat
-
ible is almost
completed
for these
ancillary
metadata.
It is import
-
ant
to highlight
that
accessing
expression
data
through
the
Alliance
website
provides
the valuable
capability
of obtaining
a
unified
perspective
on gene
expression
for human,
mouse,
rat,
frogs,
zebrafish,
fruit
fly, and yeast.
This
can be achieved
by select
-
ing the “Compare
ortholog
genes”
checkbox.
For high-throughput
expression
studies,
users
can browse
and
download
data
from
SPELL
(
https:/
/spell.wormbase.org/
), the
Expression
Dataset
Locator
(
https:/
/wormbase.org/tools/rnaseq/
expression_dataset_locator.cgi
), and
the RNA-seq
FPKM
Search
(
https:/
/wormbase.org/tools/rnaseq/fpkmmine.cgi
) tools
(
Harris
et al.
2020
). We annotate
differentially
expressed
genes
of each
study
as expression
clusters.
The expression
clusters
include
in
-
formation
about
tissue
enrichment,
drug
or treatment
response,
and gene
regulation
on each
gene
page.
Tissue
enrichment
data
of expression
clusters,
including
those
derived
from
single-cell
RNA-seq
studies,
are also
displayed
in the WormBase
Ontology
Browser.
These
displays
are based
on data
that
will take
longer
to harmonize
and will thus
be available
on the “Expression”
sec
-
tion
of the Alliance
in about
2 years,
after
other
expression
data
have
been
migrated.
We also
expanded
the set of single-cell
RNA-seq
data
sets dis
-
played
on the “Expression”
widget
of gene
report
pages,
adding
to the one generated
by the CeNGEN
consortium
(
T
aylor
et al
.
2021
). We now
display
a single-cell
transcriptional
profile
from
C. elegans
embryos
(
Packer
et al
. 2019
;
Kishore
et al
. 2020
) and are
planning
to add additional
studies.
These
single-cell
graphs
will
also be displayed
on the Alliance
gene
expression
detail
pages
cur
-
rently
under
development.
Interactions
The ability
to study
interactions
between
genes
and gene
products
is a great
strength
of
C. elegans
as an experimental
system
because
of the ease
of combining
mutations
due the male-hermaphrodite
mode
of reproduction,
the existence
of 6 linkage
groups,
and en
-
ough
progeny
to obtain
recombinants.
Biocurators
at WormBase
extract
assertions
of gene–gene
interaction
from
the literature
and supplemental
tables
of papers.
These
are currently
sent
to
the Alliance
for display.
The
concepts
of gene
interaction
(e.g.
Huang
and
Sternberg
2006
) have
mostly
been
harmonized
and
will be added
to Alliance
data
store
in the future.
An Alliance
interaction
viewer,
such
as the Venn
diagram
tool,
Vennter
(
Cho
et al.
2020
) will be implemented
within
2 years.
4 |
P. W. Sternberg
et al.
Gene Ontology
annotations
The Gene
Ontology
(GO)
is designed
to be a multispecies
ontol
-
ogy (
Gene
Ontology
Consortium
2023
), and
thus,
GO annota
-
tions
were
naturally
included
in the
Alliance
from
the
beginning.
GO annotations
displayed
at the Alliance
are re
-
trieved
from
the GOC
infrastructure;
up-to-date
ontology
infor
-
mation
is retrieved
similarly.
GO annotations
at the Alliance
are
initially
displayed
in GO ribbons
that
summarize
annotations
using
an Alliance
subset
(“slim”)
and
allow
a comparative
view.
GO annotations
may
also
be viewed
in a table
format
that
includes
supporting
evidence.
The
main
limitation
of GO
ribbons
is that
the high
level
subset
used
is not tuned
to nema
-
todes.
Eventually
we will
create
nematode-centric
choices
to
complement
the generic
GO slim.
In addition
to a GO ribbon
and
table
display,
WormBase
also
has a tool
for visualization
annotations
to GO terms
that
simplifies
the ontology
on a
case-by-case
basis,
for example,
displaying
enriched
terms
(SObA
graphs,
Lee
et al.
2018
). A general
Alliance
enrichment
tool
is in the planning
stages;
inclusion
of the SObA
graphs
to
visualize
enriched
terms
will come
at a later
date.
People
and labs
WormBase
has basic
information
about
nematode
researchers
(people)
that
not only
facilitates
communication
with
users
but
also
allows
us to connect
individuals
to gene
classes,
references,
allele
classes,
and their
intellectual
lineage.
In addition,
the “com
-
munity
curation”
contributed
by each
person
is acknowledged
on
their
person
pages
in WormBase.
Person
curation
with
its asso
-
ciated
information
will be added
to the Alliance
incrementally
over
the next
year.
For the
C. elegans
communities,
Laboratories
are connected
to people,
strains,
gene
classes,
alleles,
and labora
-
tory
websites.
One
of the current
priorities
for the Alliance
is to
produce
pages
describing
each
paper
and
the entities
to which
they
are connected.
This
latter
work
should
be completed
by the
end of 2024.
Genetic
pathways
WormBase
has previously
not devoted
substantial
attention
to
curating
genetic
pathways,
in part
because
a suitable
framework
had
not been
developed.
The
Gene
Ontology
Causal
Activity
Models
(GO-CAM)
models
now
provide
an appropriate
framework
(
Thomas
et al.
2019
). In these
models,
the annotation
for individual
gene
products
is linked
so that
a protein
has a molecular
function
in a particular
cellular
component
with
specific
inputs
and out
-
puts
with
defined
relations
in the context
of a specific
biological
process;
this information
allows
pathways
to be constructed
and
displayed
in a consistent
manner.
While
the number
of curated
GO-CAM
pathways
is still rela
-
tively
small,
it is growing,
and pathway
views
can be generated
from
the underlying
data
curated
by WormBase
and
displayed
at the Alliance
using
software
(
Fig. 2
). The “Pathways”
widget
on
the Alliance
gene
pages
also shows
projected
reactome
pathways
based
on curated
human
models
(
Milacic
et al.
2024
). The curated
GO-CAM
pathways
are expected
to improve
gene
set enrichment
analysis
(e.g.
Markarian
et al.
2024
).
WormBase
UI
Homepage
and static
pages
Many
users
start
their
interaction
with
WormBase.org
via the
homepage
to search
or open
tools
such
as WormMine
or multi-
ontology
enrichment
analysis.
The
Alliance
has
a generic
homepage,
from
which
one can click
on the WormBase
hexagon
(
Fig.
3a
) to go to the WormBase
homepage
at the
Alliance
(
Fig.
3b
) and,
if desired,
to the current
WormBase.org
(“Visit
WormBase”).
In addition,
from
any gene
page,
one can click
on
the equivalent
WormBase
gene
page.
The
many
static
pages
that
have
useful
but relatively
unchanging
information
(such
as
a guide
to Nomenclature)
will be linked
from
the
WormBase.org
homepage.
These
links
will be added
to the Alliance
WormBase
page
in the coming
year.
Gene page
Gene
pages
are complex,
with
access
to a host
of information.
In
general,
the information
will be present
in both
WormBase
and
Alliance
views
but organized
a bit differently
(
Fig. 4
). WormBase
gene
pages
allow
customized
user
ordering
of different
widgets,
and this is not yet available
at the Alliance.
On the other
hand,
at the Alliance,
the information
on different
organisms
is orga
-
nized
similarly
to facilitate
cross-organism
comparisons.
WormBase
tools
A summary
of the major
WormBase
tools
and displays
as well
as
their
migration
status
is given
in
Table
1
.
JBrowse
Over
the last year,
JBrowse,
one of the most
widely
used
tools
pro
-
vided
by WormBase,
has undergone
an overhaul
into
JBrowse
2
(
Diesh
et al
. 2023
) and has been
implemented
at WormBase
and
moved
to Alliance
infrastructure.
This
new version
of JBrowse
pro
-
vides
the functionality
of JBrowse
1 with
the addition
of several
new
features.
In addition
to the linear
genome
view
(i.e. the trad
-
itional
genome
browser
view
of a region
of a single
stretch
of gen
-
ome
with
data
tracks),
JBrowse
2 provides
other
data
views,
including
linear
genome
views
that
span
multiple
chromosomes
(up to an entire
genome),
whole
genome
comparisons
in the
form
of dotplots,
pairwise
synteny
views,
a “spreadsheet”
view,
and views
that
allow
exploration
of structural
variants.
JBrowse
2 also
provides
the ability
to create
scalable
vector
graphics
out
-
put of all views,
which
was
by far the most
requested
JBrowse
feature.
JBrowse
2 supports
comparative
genomics
data
from
a variety
of established
comparison
tools
like
minimap2
(
Li 2018
) and
MCScanX
(
Wang
et al.
2012
). WormBase
has provided
several
pre
-
computed
pairwise
genome
comparisons,
primarily
between
sev
-
eral of the Caenorhabditis
assemblies,
using
minimap2
to perform
the analysis
(
Fig. 5
). More
comparison
data
sets can be added
on
request
from
users,
and as with
all supported
views
in JBrowse
2, “session”
tracks
that
are accessible
just for the user
can be cre
-
ated
with
their
own
data
and
analysis
results.
These
data
are
available
both
as views
created
from
the “New
Session/Launch
View”
interface
as well
as from
a linear
genome
view
by adding
any of the tracks
from
the “Synteny”
category
of tracks.
Added
functionality
with
JBrowse
2 includes
several
tools
for
personalizing
a user’s
interaction
with
it. In addition
to the ability
to add session
tracks,
users
have
the ability
to use the spreadsheet
view
to load
a list of regions
of interest
that
can be quickly
navi
-
gated
to in succession.
Similar
functionality
is the ability
to create
and edit bookmarks
of regions
that have
been
visited
and the user
wants
to save.
These
bookmarks
can also be exported
as a tab se
-
parated
file that
can be uploaded
later
and
shared
with
other
users.
Since
JBrowse
2 can create
very
complicated
displays
with
multiple
views,
it also
provides
a way
to create
URL-shortened
share
links
that
will save
the exact
view
layout
that
can be easily
WormBase
and the Alliance
| 5
shared with collaborators. Finally, there are several help docu
-
ments for WormBase users new to JBrowse 2 that are available un
-
der the “Help” menu.
Gene summaries
WormBase continues to generate and update gene summaries
that are text descriptions of gene function. These summaries
are generated for 10 nematode species based on structured,
curated data in the WormBase database using an algorithm
developed at WormBase and further refined for the Alliance
(Kishore
et al
. 2020
). Gene summaries cover 5 Caenorhabditis spe
-
cies (
C. elegans
,
Caenorhabditis brenneri
,
Caenorhabditis briggsae
,
Caenorhabditis japonica
,
Caenorhabditis remanei
) and 5 parasitic spe
-
cies (
Onchocerca volvulus
,
Pristionchus pacificus
,
Strongyloides ratti
,
Trichuris muris
)
and can be viewed in the “Overview” section at
the top of gene pages. For well-studied genes, manually curated
data are the main source of information, while high-throughput
data are largely used for less studied genes. Summaries for
genes for the Caenorhabditis species such as
C. japonica
and
C.
remanei
are generated by relating them via orthology to
C
.
elegans
genes. As part of the ongoing WormBase migration pro
-
cess to the Alliance, we will (1) integrate the Alliance and
WormBase gene summary pipelines for
C. elegans
and (2) ex
-
plore the use of AI for writing gene summaries for those genes
that are less studied and/or have fewer structured, curated
data annotations in WormBase. As part of this process, we
will customize and finetune large language models (LLMs) to
generate high-quality and scientifically rigorous gene summar
-
ies which will help scale gene summary generation to cover the
fast-growing literature.
C. elegans
models of human disease
C. elegans
is increasingly being used to model human disease.
Modeling takes the form of expressing human proteins (or parts
thereof) in
C. elegans
, studying phenotypes that are the phenologs
(McGary
et al.
2010;
Golden 2017) of human disease phenotypes,
studying orthologs of human disease-relevant genes, or studying
disease-associated variants either in the orthologous protein
(e.g.
Wong
et al.
2019) or in a humanized worm (e.g.
McDiarmid
et al.
2020). WormBase continues to curate such models of human
disease from papers by associating genes, alleles, genotypes, and
strains with a term from the Disease Ontology (DO) (Schriml
et al.
2022
) according to annotation standards defined by the Alliance.
Other experimental data such as drugs, herbals, small molecules,
and other compounds that act as inducers (induce disease-like
phenotypes) and/or modifiers (ameliorate or exacerbate the dis
-
ease condition) are also included in an annotation. Note that as
part of the migration of WormBase data and displays to the
Alliance, disease-related data displays are actively being devel
-
oped and are almost complete at the Alliance. Users are encour
-
aged to view and download these data on gene and disease
Fig. 2.
Example of a
C. elegans
genetic pathway at the Alliance. This pathway diagram for PMK-1 signaling is generated automatically from a curated
GO-CAM pathway. In this example, only direct positive and negative regulation occur, but the underlying model can handle indirect inputs as well as
biochemical pathways using chemicals as reaction inputs and outputs. The rectangles on the right side show the standard GO annotations associated
with the GO-CAM, including molecular function, cell component, and biological processes.
6 |
P. W. Sternberg
et al.
Fig. 3.
A home
for
C. elegans
at the Alliance.
a) MOD
icons
at the bottom
of the Alliance
homepage.
b) WormBase
homepage
at the Alliance.
WormBase
and the Alliance
| 7
pages at the Alliance where harmonized worm disease data can be
viewed in the context of other species.
WormBase literature acquisition and
processing
The goal of the WormBase literature acquisition and processing
pipeline is to identify in PubMed all references about
C. elegans
and include them in the WormBase bibliography. We perform a
daily PubMed query using the keyword “elegans,” and a curator
manually reviews the results to accept appropriate references
for inclusion in the WormBase corpus. For all approved refer
-
ences, we attempt to retrieve the full text and associate entities,
e.g. species, genes, alleles, transgenes, and 15 different data types,
e.g. RNAi phenotypes, genetic interactions, and anatomical ex
-
pression patterns, with the reference via pattern matching and
machine learning algorithms, such as neural networks, respect
-
ively. Curators use the results of the machine learning algorithms
to prioritize references for curation. In addition, we use pattern
matching followed by manual validation to associate a reference’s
authors to a WormBase “person,” e.g. Iva S. Greenwald is con
-
nected to WBPerson220, to populate the publications section of
each Person page in WormBase.
WormBase is systematically transitioning our literature acquisi
-
tion and triage pipelines to Alliance literature system infrastruc
-
ture, known as the Alliance Bibliography Central (ABC). The daily
Fig. 4.
Comparison of WormBase and Alliance gene pages. Shown is
pezo-1
with the WormBase view on the left and Alliance view on the right.
Table 1.
Tools in WormBase and status. For each tool, the current location if not
WormBase.org
is listed as is the destination and
estimated time of migration.
Tool
Current location(s)
Migration
Eventual destination
FTP site
downloads.wormbase.org, Hinxton
1 yr
Alliance
GBrowse
No
Alliance
JBrowse
Done
Alliance
Forums
forums.wormbase.org
Done
Alliance-hosted WB portal
Wiki
wiki.wormbase.org
2 yr
Alliance-hosted WB portal
Blog
blog.wormbase.org
1 yr
Alliance-hosted WB portal
AQL/WQL gateways
Never
Alliance-hosted WB portal
API
2 yr
Replaced by Alliance
SPELL
1 yr
Alliance-hosted WB portal
SimpleMine
Partly done
Alliance
WormMine
intermine.wormbase.org
1 yr
Alliance
BLAST/BLAT
Done
Alliance
ePCR
2 yr
Alliance
Name Service
1 yr
Alliance
Apollo
2 yr
Alliance
Mailing lists
1 yr
Alliance
Email accounts
1 yr
Alliance
Nucleotide Aligner
Replaced 1 yr
Alliance
Protein Aligner
Replace 1 yr
Alliance
Ontology Graphs
2 yr
Alliance or Caltech server
Ontology Enrichment
1 yr
Alliance
8 |
P. W. Sternberg
et al.
PubMed
search,
sort (i.e. acceptance
into the WormBase
literature
corpus),
species
assignments,
and full-text
acquisition
and storage
are implemented
in the ABC.
Current
work
is focused
on the final
phase
of testing
the Alliance
reference
data
model
for data
type
and entity
tagging,
specifically
with
respect
to sufficiently
captur
-
ing the provenance
of paper
tags (i.e. manual
assertions
vs machine
learning
methods)
and,
wherever
possible,
automatically
using
manual
assertions
to validate
machine
learning
methods.
Fig. 5.
Linear
genome
and synteny
views
in JBrowse
2. JBrowse
2 showing
multiple
views:
the top panel
is a linear
genome
view
showing
the region
around
the
C. elegans
lin-2
gene
with
multiple
synteny
tracks
showing
the sequence
similarity
as determined
by minimap2
analysis
between
the N2 assembly
and
the assemblies
of multiple
related
genomes.
The middle
panel
is the result
of right
clicking
one of the regions
showing
synteny
between
C. elegans
and
Caenorhabditis
inopinata
and zooming
out slightly
to show
the entire
lin-2
in both
assemblies.
The lower
panel
is the same
but with
the similar
region
in
C. remanei
.
WormBase
and the Alliance
| 9