of 18
Updates to the Alliance of Genome Resources central
infrastructure
The Alliance of Genome Resources Consortium
1
1
A full list of members is provided at the end of this article.
The Alliance of Genome Resources (Alliance) is an extensible coalition of knowledgebases focused on the genetics and genomics of in
-
tensively studied model organisms. The Alliance is organized as individual knowledge centers with strong connections to their research
communities and a centralized software infrastructure, discussed here. Model organisms currently represented in the Alliance are bud
-
ding yeast,
Caenorhabditis elegans
,
Drosophila
, zebrafish, frog, laboratory mouse, laboratory rat, and the Gene Ontology Consortium.
The project is in a rapid development phase to harmonize knowledge, store it, analyze it, and present it to the community through a web
portal, direct downloads, and application programming interfaces (APIs). Here, we focus on developments over the last 2 years.
Specifically, we added and enhanced tools for browsing the genome (JBrowse), downloading sequences, mining complex data
(AllianceMine), visualizing pathways, full-text searching of the literature (Textpresso), and sequence similarity searching
(SequenceServer). We enhanced existing interactive data tables and added an interactive table of paralogs to complement our represen
-
tation of orthology. To support individual model organism communities, we implemented species-specific “landing pages” and will add
disease-specific portals soon; in addition, we support a common community forum implemented in Discourse software. We describe our
progress toward a central persistent database to support curation, the data modeling that underpins harmonization, and progress to
-
ward a state-of-the-art literature curation system with integrated artificial intelligence and machine learning (AI/ML).
Keywords:
database; knowledgebase; software; text mining; data integration;
Drosophila
; yeast;
Caenorhabditis elegans
; zebrafish;
mouse
Received on 20 November 2023; accepted on 29 February 2024
© The Author(s) 2024. Published by Oxford University Press on behalf of The Genetics Society of America.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Introduction
As has been discussed at length elsewhere (e.g.
Oliver
et al
. 2016;
Wood
et al
. 2022
), model organism knowledgebases [aka model or
-
ganism databases (MODs)] provide daily utility to researchers for
the design and interpretation of experiments, to computational
biologists for curated data sets, and to genomic researchers for an
-
notated genomes. Some of the major uses of the MODs have been
1-stop shopping for all information about a particular gene or ob
-
taining cleansed data sets with standard metadata for computa
-
tional analyses.
The Alliance of Genome Resources (referred to herein as the
Alliance) is a consortium of MODs and the Gene Ontology
Consortium (GOC). The mission of the Alliance is to support com
-
parative genomics to investigate the genetic and genomic basis of
human biology, health, and disease. To promote sustainability of
the core community data resources that make up the Alliance, we
implemented an extensible “knowledge commons” platform for
comparative genomics built with modular, reusable infrastruc
-
ture components that can support informatics resource needs
across a wide range of species (
Howe
et al
. 2018;
Alliance of
Genome Resources 2022;
Bult and Sternberg 2023). In 2022, the
Alliance was recognized as a Core Global Biodata Resource by
the Global Biodata Coalition (Anderson
et al
. 2017).
Specifically, the Alliance of Genome Resources is organized
as 2 interdependent units: Alliance Central and the Alliance
Knowledge Centers.
Alliance Central
is responsible for developing
and maintaining the software for data access and for the
coordination of data harmonization and data modeling activities
across our members. A primary goal of Alliance Central is to re
-
duce redundancy in systems administration and software devel
-
opment for model organism knowledgebases and to deploy a
unified “look and feel” for access to, and display of, common
data types and annotations across diverse model organisms and
human, following findability, accessibility, interoperability, and
reuse (FAIR) guiding principles. Model organism-specific knowl
-
edgebases serve as
Alliance Knowledge Centers
. Knowledge Centers
are responsible for expert curation and submission of data to
Alliance Central using Alliance Central infrastructure. Knowledge
Centers also are responsible for organism-specific user support
activities and for providing access to data types not yet supported
by Alliance Central. The founding Alliance Knowledge Centers
are
Saccharomyces
Genome Database (SGD;
Engel
et al
. 2022),
WormBase
(Davis
et al
. 2022;
Sternberg
et al
. 2024), FlyBase
(Gramates
et al
. 2022), Mouse Genome Database (Ringwald
et al
.
2022), the Zebrafish Information Network (Bradford
et al
. 2023),
Rat Genome Database (Vedi
et al
. 2023), and the GOC (Gene
Ontology Consortium 2023). The newest member, Xenbase
(Fisher
et al
. 2023), joined the Alliance consortium in 2022.
Here, we describe our progress toward harmonizing informa
-
tion provided by our member resources, our development of a
software infrastructure for ingest, curation, storage, analysis,
and output of such information, and development of an efficient
literature curation system. We start by describing new features
in our web portal at
AllianceGenome.org.
GENETICS
, 2024, 227(1), iyae049
https://doi.org/10.1093/genetics/iyae049
Advance Access Publication Date: 29 March 2024
Knowledgebase & Database Resources
The web portal
Community homepages
The Alliance website features landing pages for each model or
-
ganism in the Alliance consortium. These pages are accessed
from the “Members” drop-down menu in the header on every
Alliance page. These pages feature MOD-specific content such
as meetings, news, and other MOD-specific resource links. A com
-
mon template allows users to find the same types of information
in each landing page (
Fig. 1). As MODs transition their data and
web services to the Alliance, their member pages will evolve into
portals hosting additional MOD-specific data, tools, and links to
organism-specific resources.
Xenopus
in the Alliance
Xenbase, the
Xenopus
knowledgebase (Fisher
et al
. 2023
), is the first
knowledgebase to join the Alliance since the founding members
initiated the consortium.
Xenopus
is an amphibian frog species
used extensively in biomedical research and in particular for ex
-
perimental embryology, cell biology, and disease modeling with
genome editing (
Carotenuto
et al
. 2023;
Kostiuk and Khokha
2021
). As a nonmammalian air-breathing tetrapod,
Xenopus
repre
-
sents a valuable evolutionary transition between rodents and zeb
-
rafish for comparative genomic studies. Xenbase is built on the
same underlying data schema (structure) as FlyBase (Chado).
Two different
Xenopus
species are used interchangeably as a mod
-
el system:
Xenopus tropicalis
is a diploid that is the preferred system
for genome editing and genetics, whereas
Xenopus laevis
is an allo
-
tetraploid preferred for use in cell biology studies, microinjection,
and microsurgery-style experimentation.
Xenopus tropicalis
has 1:1
relationships between most genes and human orthologs (exclud
-
ing paralogs;
Mitros
et al
. 2019
), whereas
X. laevis
has 2 copies of
most human orthologs. The allotetraploid formed via hybridiza
-
tion of 2 different frog species (
Session
et al
. 2016
), and the
complexities of genome evolution that subsequently occurred in
-
crease the difficulty of identifying orthology of the 2
X. laevis
genes
to their diploid relatives, including humans. Mapping of the dip
-
loid
X. tropicalis
genes to their human orthologs was performed
as with the other organisms in the Alliance (see below). Because
this method does not yet work in the context of an allotetraploid,
the Alliance imports the
X. tropicalis
to
X. laevis
paralogy mappings
from Xenbase, where they have been established through a com
-
bination of synteny analysis and manual curation; this was one
major challenge in adding
Xenopus
to the Alliance.
Xenbase created software to upload content on a regular
schedule formatted for the current Alliance data ingest schema.
Currently, these data include orthology, the
Xenopus
anatomical
ontology, standard gene information, gene expression data, pub
-
lications, GO term associations, disease associations, anatomical
phenotypes, and genome details.
Xenopus
genes can be found
using the Alliance landing page search tool with
Xenopus
genes
flagged by
Xtr
and
Xla
notations. The 2 copies of the genes in
X. lae
-
vis
, the allotetraploid, are further tagged as “(symbol).L” and
“(symbol).S” to denote the genes on the long (L) and short (S)
chromosome pairs of this species (e.g.
pax6.L
and
pax6.S
).
Alliance release 6.0.0 has Xenbase data for 54,000 genes, 19,000
disease associations, over 45,000 gene expression records, and
more than 7,000 anatomical phenotypes. Expression and pheno
-
type data will be available in about a year.
In addition to the rich data made available to the Alliance from
Xenopus
research, this effort also served as a valuable test case for
understanding the level of effort and complexities engendered in
the addition of new knowledgebases to the Alliance and the func
-
tionality and adaptability of ingest system components.
New gene page section: paralogy
Gene pages now include a paralogy section populated with data
from the Drosophila Research & Screening Center (DRSC)
Fig. 1.
MOD landing pages at the Alliance portal. A common look and feel that allows community-specific content.
2 |
The Alliance of Genome Resources Consortium
Integrative Ortholog Prediction Tool (DIOPT) version 9.1 devel
-
oped by the DRSC (
Hu
et al
. 2011,
2021
). The assembly of protein
sets and algorithmic inferences of their orthology from various
sources was first centralized by the DRSC and then exported to
the Alliance Central. We include the same data sources used for
orthology, when these resources also provide paralogy informa
-
tion. Specifically, these resources have performed well on the
standardized benchmarking from the Quest for Orthologs (QfO)
Consortium (
Nevers
et al
. 2022). Orthologous Matrix (OMA;
Altenhoff
et al
. 2021) and PANTHER (Thomas
et al
. 2022) data
sets were retrieved through the QfO benchmark portal (https://
orthology.benchmarkservice.org
), and Compara data were ac
-
quired directly from the EBI Compara FTP site. In addition, the
DRSC conducted local analyses using InParanoid (
Persson and
Sonnhammer 2022), OrthoFinder (Emms and Kelly 2019),
OrthoInspector (Nevers
et al
. 2019), and SonicParanoid
(Cosentino and Iwasaki 2019) using the UniProt 2020 reference
proteome set (UniProt Consortium 2023
), the same set used in
the downloaded data sets, to ensure consistency. Direct data sub
-
missions from PhylomeDB (
Fuentes
et al
. 2022) and the SGD (Engel
et al
. 2022) were also integrated into the data set.
The new paralogy section comprises a table (Fig. 2
), similar to
the orthology table, that contains the gene symbol of related para
-
logs, a calculated rank, alignment length as the number of aligned
amino acids, percentage of similarity and identity, and a count of
the algorithms or methods that call the paralogous match. The
ranking score was developed to sort the paralogs by overall simi
-
larity and was reviewed by curators to display optimally an ac
-
ceptable rank order for well-studied sets of paralogs. The
ranking score considers several factors, including alignment
length, percent identity, and the number of paralogy methods
that identify the paralog. Additional information for rank deter
-
mination and alignment length are available to the users via a
clickable help icon located next to those column headers.
The paralog section was released with Alliance version 6.0.0.
Forthcoming updates will include the ability to sort and filter
the table by column values and the availability of these data via
our bulk downloads page. The existing tables on the gene pages
for Function, Disease, and Expression all contain checkboxes for
“Compare Ortholog Genes” that allow users to search across spe
-
cies for these features. We will add the additional checkbox
“Compare Paralog Genes” to provide similar functionality for par
-
alogous genes in a future Alliance release.
JBrowse sequence detail widget
A recent Alliance 6.0.0 release includes a new “Sequence Detail”
section of all gene pages that uses JBrowse and JavaScript libraries
to display an interactive widget that allows users to download
DNA and amino acid sequences of genes in several possible con
-
figurations: genomic sequence highlighted with UTRs, coding
and intronic regions, CDS regions, and translated protein for ex
-
ample (
Fig. 3
). In the next few releases, we will extend the func
-
tionality of the widget variant detail pages, where both the
wild-type and variant sequences will be provided. When the vari
-
ant occurs in the context of a protein coding gene, changes to the
coding sequence and resulting translated protein will also be dis
-
played and available for download.
Model organism BLAST
For more than 2 decades, some of the MOD members of the Alliance
have hosted their own custom BLAST interfaces (Altschul
et al
.
1990; e.g.
FlyBase Consortium 1999) that have allowed users to
search custom databases related to those model organisms, e.g.
subsets of related species or molecular clones, and display BLAST
hits in Genome Browsers aligned with current gene models. We
are now developing an updated and integrated Alliance BLAST,
powered by SequenceServer (Priyam
et al
. 2019
), that optimizes se
-
quence analysis across model organisms. We have begun to update
BLAST for the individual MODs. The new
WormBase
BLAST is now
available online and can currently be accessed via the tools menu
on wormbase.org. The results are linked to Genome Browsers and
Alliance gene pages (Fig. 4
). This tight connection allows users to
navigate seamlessly between their BLAST results and the wealth
of information available within the Alliance, enhancing the effi
-
ciency and depth of genetic research. For example, users can re
-
trieve BLAST results for a sequence of interest and then easily
navigate across Genome Browsers for different organisms, with a
comparison to different tracks revealing how that sequence aligns
with gene models, variants, and experimental tools (
Fig. 5
). From a
project perspective, developing Alliance BLAST with a common
cloud-optimized infrastructure will increase efficiency by reducing
the cost of compute overhead and eliminating the need to manage
separate MOD systems, which will then allow more focus on devel
-
oping new functionality to support researchers. Our focus in the
upcoming year is directed toward enhancing the user interface
(UI), reflecting our commitment to providing an intuitive platform
Fig. 2.
Paralog table for
C. elegans hlh-25
. The table presents a ranking of paralogs for the
hlh-25
gene, based on a weighted scoring algorithm that
incorporates sequence conservation metrics. It lists the gene symbols, provides the alignment length in amino acids, and quantifies the similarity and
identity percentages of genes paralogous to
hlh-25
. The methodology count, indicating the number of algorithms supporting the paralogous relationship,
is also included. In this ranking,
hlh-27
is identified as the primary paralog due to its high similarity and identity scores, despite being recognized by fewer
methods than
hlh-28
.
Alliance of Genome Resources
| 3
for researchers
in model
organism
genetics.
We plan
to produce
more
analysis
tools
as part of the evolving
Alliance
portal,
thereby
broadening
the range
of resources
available
for genetic
research
within
the community.
AllianceMine
AllianceMine,
a sophisticated,
multifaceted
search
and retrieval
tool that utilizes
the InterMine
software
(
Smith
et al
. 2012
), offers
a unified
view
of harmonized
data,
enabling
advanced
queries
across
multiple
species.
For instance,
gene
lists can be processed
as input
and simultaneously
query
different
annotations,
such
as
“Show
me genes
associated
with
a (specific
disease
term)”
(
Fig. 6
).
The results
from
queries
can be combined
for further
analysis
and
saved
or downloaded
in customizable
file formats.
Queries
them
-
selves
can be customized
by modifying
predefined
templates
or by
creating
new
templates
to access
a combination
of specific
data
types.
Thus,
this powerful
tool
can be used
in multiple
ways,
namely,
for search,
discovery,
curation,
and analysis.
Fig. 3.
Sequence
detail
widget.
Chosen
views
of a specific
gene
are readily
available
for copying
as plain
text or with
highlights.
5
region
of the human
PLAA
gene.
Fig. 4.
Screenshot
of results
from
the Alliance
SequenceServer
BLAST
tool.
The results
have
been
enhanced
relative
to the default
SequenceServer
results
page
by the addition
of links
to Alliance
JBrowse
and to the corresponding
gene
page
(in this case
C. elegans
abi-1)
at the Alliance
website
for each
BLAST
hit.
4 |
The
Alliance
of Genome
Resources
Consortium
AllianceMine currently showcases harmonized data encom
-
passing genes, diseases, GO, orthology, expression, alleles, var
-
iants, and FASTA formatted genome sequences. The tool also
offers predefined queries or “templates” for cross-species search
-
ing. Continual optimization will ensure timely data synchroniza
-
tion with the main Alliance site, as well as integration of newly
harmonized data types. Another aspect of improvement will be
the addition of more templates, widgets, and precompiled lists,
which can serve as logical input for templated queries.
SimpleMine
We designed SimpleMine for biologists to get essential informa
-
tion for a list of genes without any command-line or programming
skill, or patience to learn the awesome power of AllianceMine dis
-
cussed above. Users can submit a list of gene names or IDs to ac
-
cess more than 20 types of essential data with which they are
associated. The results are 1 line per gene with detailed informa
-
tion separated by 4 types of separators: tab, comma, bar, and
semicolon. Users can choose to display the output as HTML or
to download a tab-delimited file. Alliance SimpleMine contains
10 species curated by the Alliance MODs. It provides easy gene
name/ID conversion among MOD ID, public name (the commonly
used name for public consumption), NCBI, PANTHER, Ensembl,
and UniProtKB. Users can find summarized anatomic and
temporal expression patterns, variants, genetic, and physical in
-
teractions. Other essential gene information includes disease as
-
sociation and orthologs among all 10 species. The infrastructure
of SimpleMine allows users to perform species-specific searches
for lists of genes that take about 2 s to return results, or mixed-
species searches that take about 10 s to complete.
Pathway displays with metabolites (GO Causal
Activity Models)
We implemented a pathway display on Alliance gene pages that
presents both GO Causal Activity Model (GO-CAM;
Thomas
et al
.
2019) and Reactome pathway (Milacic
et al
. 2024
) model. The dis
-
play queries both the Reactome and GO application programming
interfaces (APIs) and shows the number of pathways from each re
-
source that contain the gene of interest. If a gene appears in mul
-
tiple pathways, users can select which pathway to display. For the
GO-CAM models, the viewer has been improved relative to previ
-
ous releases of the Alliance website (
Fig. 7). First, the layout has
been improved to show clearly the overall causal flow through a
pathway, from top to bottom and branching as necessary.
Second, the viewer displays not only the activities of genes/
proteins in a pathway but also metabolites, which is particularly
useful for visualizing metabolic pathways. These metabolites
may be either intermediates in a pathway or regulators of a
Fig. 6.
AllianceMine example. Using a simple template, a disease ontology (DO) term, in this case “autism,” is chosen, and all genes associated with this
DO term are returned in a downloadable table.
Fig. 5.
Output of a BLAST search. After a user clicks on the JBrowse link for a BLAST hit, they are directed to the web service where they will see a track for
the BLAST hit and how the hit aligns with other tracks.
Alliance of Genome Resources
| 5
protein activity. For signaling pathways, we distinguish between
direct and indirect regulations and between positive, negative,
or unknown effects.
Harmonized data models
The transition of data from individual MODs to the Alliance infra
-
structure requires data harmonization so that existing analogous
MOD data classes (types/tables) can be loaded into Alliance data
-
bases using a consistent schema and language. The first step is for
biocurators from each Alliance knowledge center to agree on
which data classes are analogous and can be treated as a single,
consolidated data class. The biocurators then align the properties
(table columns) of the consolidated data class, including identi
-
fiers, types of values, and whether entity–property–value associa
-
tions/triples require their own respective metadata and/or
evidence records. To enable this process, the Linked Data
Modeling Language (LinkML). We now have a standard workflow
and common data modeling patterns that have streamlined the
process, which we expect to complete in the next year. The
LinkML specifications, authored in human-readable files, are
used to programmatically generate JavaScript Object Notation
(JSON) schema specifications, which allow data quartermasters
(DQMs) to move data to the persistent store. These specifications
also inform curation software developers how to generate initial
back-end (Java models and APIs) and front-end infrastructure
(curation UI data tables and detail pages). Once DQMs have sub
-
mitted their data files for a particular data class, the data are
loaded into the persistent store and validated (see
Persistent store
architecture
description below) and thus automatically populated
into data tables and the curation interface. The data, having
been harmonized, ingested, validated, and displayed to curators
in the curation software, can now flow through to the public site
according to the data pipeline described (see
Persistent store archi
-
tecture
description below).
Many Alliance data classes have completely (or nearly com
-
pletely) harmonized data models in LinkML (see
https://github.
com/alliance-genome/agr_curation_schema
) including disease
annotations, alleles, variants, expression annotations, and refer
-
ences. Although many other data classes have partially harmo
-
nized models, ongoing and future harmonization efforts will
focus on completing harmonized models for the remaining
Fig. 7.
Alliance pathway viewer. The pathway widget displays gene products (rectangles with gene names) and chemicals (rectangles with chemical
abbreviations) and the flow of information and material between them (relations). These relations, shown in legend, indicate direct or indirect regulation
that can be positive, negative, or of unknown effect direction. For metabolites that mediate the information flow between gene products, distinct shading
distinguishes metabolites that are the inputs or outputs of a reaction.
6 |
The Alliance of Genome Resources Consortium
curated
data
classes:
genes,
transcripts,
proteins,
nontranscribed
genome
features,
affected
genomic
models
(AGMs;
strains,
genotypes,
and fish),
phenotype
annotations,
molecular
and gen
-
etic interactions,
gene
regulation
annotations,
high-throughput
expression
data
set metadata
(including
for RNA-seq,
single-cell
RNA-seq,
and proteomics
data
sets),
species,
reagents
such
as
DNA
clones
and antibodies,
images,
persons,
laboratories,
com
-
panies,
and various
entity
set classes
like gene
sets,
which
can
be used
for storing
assay
results
and performing
downstream
ana
-
lyses
like ontology
term
enrichment,
alignments,
and other
entity
set processing
calculations.
Persistent
store
architecture
We have
designed
a powerful
database
system
that
can handle
most
of the demands
of our project
including
curation,
analysis,
and display
of the data
(
Fig. 8
). Specifically,
we created
a database
using
Postgres
for long-term
and persistent
storage
of Alliance
cu
-
rated
data
contributed
by Alliance
member
MODs.
In parallel
to the
existing
(drop-and-reload)
data
pipeline
(Alliance
2022),
DQMs
from
each
MOD
now
submit
data
according
to our new
LinkML
schema
in JSON
format
directly
to the persistent
store
for ingestion,
validation,
and curation
via create–read–update–delete
(CRUD)
op
-
erations
enabled
by a curation
API library
and Prime
React
UI. A
data
pipeline
has been
established
to provide
data
from
the persist
-
ent store
Postgres
database
to our Alliance
public
website
APIs and
front-end
web UIs and to other
tools
and services.
LinkML-based
JSON
files are ingested
into Postgres
with
valid
-
ation
to ensure
(1) recognition
of submitted
entities
such
as genes,
alleles,
AGMs
(e.g. strains,
genotypes),
publications,
experimental
conditions,
and ontology
terms;
(2) recognition
of references
to
such
entities
in annotations
and associations;
(3) no entry
of dupli
-
cate entities;
and (4) proper
handling
of obsolete
entities.
Every
file
load
is accompanied
by a report
(in Postgres
and the curation
UI)
indicating
(1) the recognized
MD5
sum
and size of the (uncom
-
pressed)
file submitted;
(2) the success
or failure
of the load;
(3)
the number
of entities
recognized
in the submitted
file; (4) the
number
of distinct
entities
loaded
into Postgres;
(5) the number
and identity
of entities
(if any)
that
failed
to load
and the reason
for the failure;
(6) a link to download
the submitted
file; (7) the cor
-
responding
compatible
LinkML
model/schema
version;
and (8) the
MOD
data
release
version
corresponding
to the data
in the file sub
-
mitted.
This
information
can be used
by DQMs,
curators,
and de
-
velopers
to keep
track
of the fidelity
of the data
transfer
and
troubleshoot
any issues
that
arise.
Ontology
(and
other
external
resource)
loads
are updated
nightly
to ensure
that the latest
ver
-
sions
of such
data
are current.
The source
of truth
for MOD
data
will be transitioned
over
to the Alliance
infrastructure
in phases,
beginning
with
a few data
types
from
a few MODs
and expanding
over
time
to eventually
include
all (relevant)
data
types
from
all
participating
MODs;
as part
of this process,
legacy
issues
with
data
are cleaned
up.
To enable
CRUD
operations
on persistent
store
data,
curation
APIs
and
a curation
UI accessible
with
Okta
authentication
have
been
implemented
(
Fig. 9
). Curators
can now access
data
ta
-
bles
for the following
data
types:
genes,
alleles,
variants,
AGMs
(e.g.
strains,
genotypes),
publications
[accessed
via Alliance
Bibliographic
Central
(ABC)
APIs],
experimental
conditions,
con
-
structs,
disease
annotations,
molecules
[not
already
managed
by Chemical
Entities
of Biological
Interest
(ChEBI)],
ontology
terms,
and
controlled
vocabularies
and
their
terms.
CRUD
operations
have
been
fully
enabled
for disease
annotations,
ex
-
perimental
conditions,
and controlled
vocabularies,
read–update
operations
have
been
enabled
for alleles
and variants,
and read
operations
are enabled
for the remaining
data
types.
Work
is un
-
derway
to fully
enable
CRUD
operations
on all remaining
data
classes
and their
attributes
including
new
data
tables
for tran
-
scripts,
proteins,
other
(nongene)
genome
features,
expression
annotations,
phenotype
annotations,
molecular
interactions,
genetic
interactions,
gene
regulation
annotations,
antibodies,
images,
and more.
In addition
to data
tables
presenting
all entries
of a particular
data
class,
the curation
tool also has individual
en
-
tity detail
pages
(for example,
see an allele
detail
page
at
https:/
/
curation.alliancegenome.org/#/allele/MGI:6446761
) for data
entry
and editing
on a dedicated
web
page
for 1 particular
entity.
The
curation
tool also enables
user-specific
and MOD-specific
custom
user
settings
and preferences
to provide
a UI most
compatible
with
individual
curators’
workflows.
In the next
year,
the curation
tool will include
batch
creation
of
data
entities
(e.g. annotations,
reagents),
batch
editing,
data
his
-
tory inspection
and auditing,
undo
and review
of latest
changes,
publication
constraints
(constrain
data
view
and entry
to publica
-
tion
currently
being
curated),
customizations
and MOD
default
settings
for new
entity
creation
and detail
pages,
incorporation
of data
entity
and topic
tagging
information
from
the ABC
litera
-
ture
store
(see below),
and incorporation
of artificial
intelligence
(AI)/machine
learning
(ML)
into the curation
workflow.
For releases
of persistent
store
data
to the Alliance
public
web
-
site, Postgres
database
snapshots
are taken
and sent to a separate
Postgres
instance
that feeds
the data
via the curation
APIs (instan
-
tiated
as a library)
into the public
site indexer
where
various
data
filtering
and
transformations
occur
before
making
those
pro
-
cessed
data
available
to our
public
website
APIs
via our
Elasticsearch
index.
The Alliance
public
website
UI, using
existing
UI infrastructure,
is then
modified
or created
to accommodate
the
data
now
flowing
from
the persistent
store
database.
Fig. 8.
Evolution
of data
flow.
Graphical
summary
showing
the design
of short-term
infrastructure
initially
deployed
to support
rapid
delivery
of unified
data
to the community
and the planned
production
system.
Red,
data
quartermasters
at MODs;
yellow,
data;
brown,
database;
green,
transformations;
blue,
user
interface.
Alliance
of Genome
Resources
| 7
Security,
stability,
and backups
All services
and data
provided
by the Alliance
to its community
are hosted
on Amazon
Web
Services
(AWS).
This
provides
us
with
industry
leading
availability
of up to 99.99%
on services
like
EC2,
which
we use to host
our virtual
servers.
We use additional
AWS-managed
services
such
as Elastic
Beanstalk
for application
deployment,
AWS
Relational
Database
Service
for hosting
our re
-
lational
(Postgres)
databases,
and Amazon
OpenSearch
Service
for
hosting
our search
indexes,
which
all provide
automatic
updates
and maintenance
for increased
reliability.
All files hosted
at the
Alliance
of Genome
Resources
are stored
in S3 buckets,
which
en
-
sures
industry
leading
durability
and availability.
Furthermore,
we make
daily
backups
of our relational
databases
and have
pro
-
cesses
in place
that enable
easy
restore
of those
backups
in case
of
failure
or data
corruption.
All search
indexes
are derived
from
the
persistent
relational
database
and can be regenerated
at any mo
-
ment
when
required.
We make
use of separated
subnets
between
public-facing
and
private
systems,
and only
services
requiring
public
access
are gi
-
ven
public
IP addresses,
ensuring
that
public-facing
services
such
as our curation
interface
can be accessed
by our curators
worldwide
(through
Okta
Authentication),
although
the support
-
ing back-end
services
such
as the supporting
databases
can be
kept
private.
Access
to all services
is furthermore
restricted
to al
-
low access
only
to the required
ports
and services
through
the use
of AWS
Security
Groups
to control
the allowed
network
traffic.
AWS
IAM users,
groups,
and roles
are used
to control
the allowed
AWS
operations
and
access
among
Alliance
developers.
In all
cases,
the principle
of least
privilege
is applied,
so that the poten
-
tial attack
surface
is reduced
to a minimum
(for example
by not
granting
blanket
AWS
admin
permissions
to developers
who
do
not have
an AWS
admin
function).
Access
keys
to any system
can be revoked
when
misuse
of those
access
keys
is detected.
We also
configured
our github
repositories
to be scanned
auto
-
matically
for accidental
secret
credential
leakages
through
the
use of GitGuardian
software.
Literature
acquisition
We designed
and are implementing
a literature
system,
ABC,
that
will support
curation
and,
in the future,
end users.
The ABC
sup
-
ports
the tasks
of reference
acquisition,
triage,
and curation
work
-
flow management.
Specifically,
the ABC is an ecosystem
of online
tools
and supporting
Alliance
databases
that
manage
all refer
-
ences
and related
metadata
that
are “in corpus”
for the member
MODs.
Literature
acquisition
at the Alliance
begins
with
automated,
organism-specific
PubMed
queries
to retrieve
candidate
refer
-
ences
for each
MOD’s
corpus.
References
matching
the search
cri
-
teria
are then
added
to the ABC by assigning
an Alliance
reference
identifier
and importing
associated
bibliographic
information
to
the database.
Subsequently,
curators
manually
sort
references
as either
“in” or “out
of corpus”
based
on the curation
policies
of
the MOD
and eliminate
any false
positive
results
from
the initial
search.
While
many
thousands
of papers
are published
each
year,
only
some
have
information
that
is currently
curated.
For
example,
in 2022,
the curatable
literature
size after
triage
was
3,181
for ZFIN,
3,221
for SGD,
2,130
for FlyBase,
1,419
for
WormBase,
and
437 for Xenbase.
Once
references
are sorted,
they
enter
MOD-specific
curation
workflows
supported
by task-
specific
ABC curator
interfaces
to, for example,
add reference
files,
manually
tag references
with
specific
entities
(e.g. genes,
alleles,
and data
types)
and topics
(e.g. phenotypes,
anatomic
expression)
using
the Alliance
Tags
for Papers
(ATP)
ontology,
and merge
du
-
plicate
references.
In addition
to adding
reference
files manually,
the full text
of “in corpus”
references
included
in the PubMed
Central
(PMC)
open
access
set is also automatically
downloaded.
Curators
may
also
use the ABC
to add non-PubMed
references.
An additional
key feature
of the ABC is a search
interface
that al
-
lows
curators
to retrieve
references
based
on various
criteria
in
-
cluding
their
in/out
of corpus
status,
bibliographic
data,
and
publication
data
range,
if desired.
Reference
acquisition
function
-
ality
can easily
be extended
to integrate
additional
MODs
into the
Alliance
infrastructure.
Fig. 9.
Alliance
curation
tool.
Screenshot
of the Alliance
curation
tool interface
showing
an example
of curated
annotations
of AGMs
managed
in the
persistent
store.
8 |
The
Alliance
of Genome
Resources
Consortium
To facilitate reference data exchange between the Alliance and
MOD databases, the MODs provide a mapping file that associates
MOD reference Compact Uniform Resource Identifiers (CURIEs)
with PMIDs, e.g. ZFIN:ZDB-PUB-181026-2 - PMID:30352852. The
MODs also provide reference CURIEs and data for references not
included in PubMed but used by the MOD, such as internal cur
-
ation references and those published in a journal not yet indexed
at PubMed.
Over the past 25–30 years, Alliance member databases have in
-
dependently developed methods to acquire, triage, and curate
their respective literatures. Having implemented a common lit
-
erature curation interface, database, and full-text acquisition sys
-
tem, the ABC is now poised to expand its functionality by
incorporating ML methods developed by, and in production for,
a subset of Alliance members to all groups. For example, auto
-
mated pipelines that recognize entities (e.g. genes, alleles, and
strains) as well as data types (e.g. phenotype, genetic interactions)
can be developed for new groups with results stored centrally in
the Alliance literature database. Incorporating more automated
methods will allow faster association of the published literature
with relevant biological concepts, information that can be dis
-
played on future Alliance reference pages while the papers await
detailed full curation. Centralized literature infrastructure will
also support other curation pipelines, such as community cur
-
ation by authors, which can then be more readily implemented
for additional Alliance member communities, thus providing an
-
other avenue by which curated data can be swiftly included in
the Alliance. Lastly, the common literature tool will allow
Alliance biocurators to coordinate curation of multispecies refer
-
ences that will provide users a facile way to find and view cross-
species research exploiting the strengths of each Alliance model
organism, a primary goal of the Alliance.
Textpresso
Textpresso is a full-text literature search engine that gets power
from its single-sentence scope, focus on a specific model organism
(or topic), and categories of semantically or biologically related
terms (Fig. 10;
Müller
et al
. 2004,
2018). It has been used extensively
by WormBase
and SGD curators, as well as
C. elegans
and
Saccharomyces cerevisiae
researchers in addition to other MODs
(Van Auken
et al
. 2012;
Bowes
et al
. 2013).
The Alliance is committed to creating Textpresso instances tai
-
lored to the unique needs of each member database, all of which
will be managed within the Alliance software ecosystem and con
-
nected to the ABC as a single reference data source. This will re
-
duce the overhead of managing Textpresso at individual MODs
while also simplifying development and deployment of new fea
-
tures. Users will benefit from simplified access to Textpresso
from the Alliance website. We also plan to integrate Textpresso
searches further into specific Alliance web pages such as gene or
allele pages. Users will be able to obtain additional references to
biological entities through Textpresso searches, adding informa
-
tion from potentially noncurated literature to the list of curated
references currently linked on those pages. Textpresso will be
available to Alliance biocurators and to the general public through
MOD-customized websites and via APIs for programmatic access.
AI
The Alliance member MODs have a track record of implementing
ML tools to enhance literature triage and curation efficiency.
Notable examples include RGD’s early adoption of standard
software architectures such as Unstructured Information
Management Architecture (UIMA, an
Apache.org
project) and
the development of the OntoMate system (Liu
et al
. 2015), an
ontology-driven literature search engine, as well as
WormBase’s
creation of Textpresso (Muller
et al
. 2004) and document classifiers
for paper triage.
The rise of large language models (LLMs), such as BERT
(Bidirectional Encoder Representations from Transformers) and
ChatGPT, has transformed the natural language processing
(NLP) landscape, but questions about their accuracy and “halluci
-
nations” remain. The Alliance is developing LLMs for tasks such as
document classification, named entity recognition (NER), sen
-
tence classification, computationally assisted triage, and curation
and to build a natural language query system to simplify access to
its underlying structured data.
Alliance members have developed AI/ML classifiers for deter
-
mining with high accuracy whether papers returned from auto
-
mated PubMed queries should be kept in their corpus or
discarded (
Jiang
et al
. 2020) and classifiers that can determine
whether specific data types relevant for curation are present in
a document (Fang
et al
. 2012). The Alliance is developing a central
solution by providing these types of classifiers to all members.
Efforts are also underway to improve existing species-specific
entity extraction and classification models, with a focus on in
-
corporating human feedback in the loop and continuously train
-
ing models based on data validated by professional biocurators
and community curators. A centralized interface for “topic and
entity tag” addition and validation during literature triage and
curation is under development as part of the ABC. The interface
allows curators to associate tags with publications and at the
same time validate (or invalidate) results extracted from AI/ML
methods. This interface will streamline the collection of valuable
training and testing sets and will allow a more systematic ap
-
proach to the creation and comparison of different AI/ML models.
Furthermore, the Alliance is adopting Evidence and Conclusion
Ontology (ECO) terms to record systematically the type of evi
-
dence, e.g. neural network method evidence, and assertion meth
-
od, e.g. automatic assertion, used for reference flagging and triage.
This is especially relevant for topic and entity tags. Using ECO
terms aligns with FAIR data principles and offers transparency
in curation workflows.
APIs
APIs are a key component of Alliance Central’s data service infra
-
structure for rapid, modular software development. We currently
support a dozen APIs with hundreds of endpoints (
Fig. 11). New
APIs will be added as data harmonization and modeling of
Fig. 10.
Textpresso for SGD literature at the Alliance (
http://sgd-textpresso.
alliancegenome.org/tpc/search
).
Alliance of Genome Resources
| 9
additional data entities are completed. We will expand public site
APIs to generate all data needed for SimpleMine, AllianceMine,
etc. from single endpoints. Current APIs include public site APIs
(agr_java_software in the GitHub repo) and APIs available from a
public Swagger UI page. Because the public APIs support only
GET endpoints, they do not require authentication. All APIs that
support both GET and PUT/POST/DELETE endpoints do
require au
-
thentication. Some of the key API endpoints available at
https://
www.alliancegenome.org/swagger-ui/
are gene-summary, gene-
disease, gene-interactions, homologs-species, allele-phenotypes,
expression ribbon-summary, etc.
Data preservation in external repositories
The Alliance of Genome Resources is committed to the long-term
preservation of digital objects (annotations) and resources (e.g.
ontologies and software) that are central to the management
and integration of functional knowledge about the genomes of di
-
verse model organisms. As part of this commitment, the annota
-
tions and resources generated by Alliance members are integrated
into many long-standing external public bioinformatic resources
(e.g. Ensembl, UniProt, and NCBI). Distribution of Alliance annota
-
tions from multiple sources provides a degree of redundancy that
contributes to data stability and preservation. Alliance main
-
tained ontologies and annotations and are also deposited into
third-party repositories that fulfill Open Science principles (see
below). Leveraging community repositories ensures the data pro
-
ducts and resources remain accessible to the research community
even if the Alliance and/or its members cease operations.
Ontologies that Alliance members maintain are also available
from long-term repositories including the OBO Foundry (https://
obofoundry.org/) and Zenodo (zenodo.org).
Annotations related to gene expression, function, phenotype,
disease associations, etc. that are generated by Alliance members
and are available on the Alliance Data Downloads page are ar
-
chived in Zenodo. Software developed as part of the Alliance of
Genome Resources knowledge commons platform is available
from GitHub (
https://github.com/alliance-genome).
The external repositories used by the Alliance of Genome
Resources include the
OBO Foundry
that was established in the early
2000s as a community-based initiative for development and main
-
tenance of biological and biomedical ontologies using standardized
practices. The Foundry is the ontology repository of choice for the
Alliance because it is widely recognized as an authoritative source
of well-maintained ontologies for biology and biomedical research.
Zenodo
is a general purpose repository maintained by CERN
(European Council for Nuclear Research) for storing and sharing
documents, data, and other digital research materials across
many disciplines. Zenodo is a repository of choice for the
Alliance, in part, because of the commitment by the European
Commission to support Zenodo as long as CERN exists.
Outreach and interactions
The Alliance help desk
We established a common help desk email address (help@
alliancegenome.org
) that is featured prominently on the
Alliance website header and footer under “Contact Us.” All inquir
-
ies submitted using this email are logged as tickets in the Alliance
Jira software system. Members of the User Support Working
Group respond to user questions and inquiries in a timely manner,
typically within 48 h. Time to resolve user inquiries depends on
the nature of the question or request. The Jira system tracks
open tickets, forward tickets, tracks their active/resolved status,
and classifies them by subject. We use the information, in part,
to evaluate the design and utility of our UIs. For example, if par
-
ticular questions or subjects arise frequently, we reevaluate the
design and wording of the search form and/or results display
that caused user confusion.
Online documentation
We provide extensive user documentation about using the
Alliance data resources under the Help menu on the homepage
(https://www.alliancegenome.org/help
). The online documenta
-
tion provides guidance on such topics as how to use the search
functions, defines acceptable field parameters, and provides ex
-
planations of the displayed results. The User Support Working
Group also works closely with the User Interface Working Group
and the Developers to craft text for tooltips displayed on UIs.
Frequently Asked Question pages
The Frequently Asked Question (FAQ)/Known Issues page pro
-
vides answers to commonly asked questions about the Alliance
and also describes any known issues associated with a particular
software release. The link to the FAQ page is featured prominently
on the Alliance home page under the Help menu.
Illustrated tutorials and videos
We maintain several types of tutorial options that are accessible
from the Help menu (https://www.alliancegenome.org/tutorials).
The most requested types of tutorials are illustrated guides with
screenshots on how to use various features of the Alliance web
Fig. 11.
Swagger interface for the Alliance APIs.
10 |
The Alliance of Genome Resources Consortium