Updates to the Alliance of Genome Resources central infrastructure

Updates to the Alliance of Genome Resources central

infrastructure

The Alliance of Genome Resources Consortium

A full list of members is provided at the end of this article.

The Alliance of Genome Resources (Alliance) is an extensible coalition of knowledgebases focused on the genetics and genomics of in

tensively studied model organisms. The Alliance is organized as individual knowledge centers with strong connections to their research

communities and a centralized software infrastructure, discussed here. Model organisms currently represented in the Alliance are bud

ding yeast,

Caenorhabditis elegans

Drosophila

, zebrafish, frog, laboratory mouse, laboratory rat, and the Gene Ontology Consortium.

The project is in a rapid development phase to harmonize knowledge, store it, analyze it, and present it to the community through a web

portal, direct downloads, and application programming interfaces (APIs). Here, we focus on developments over the last 2 years.

Specifically, we added and enhanced tools for browsing the genome (JBrowse), downloading sequences, mining complex data

(AllianceMine), visualizing pathways, full-text searching of the literature (Textpresso), and sequence similarity searching

(SequenceServer). We enhanced existing interactive data tables and added an interactive table of paralogs to complement our represen

tation of orthology. To support individual model organism communities, we implemented species-specific “landing pages” and will add

disease-specific portals soon; in addition, we support a common community forum implemented in Discourse software. We describe our

progress toward a central persistent database to support curation, the data modeling that underpins harmonization, and progress to

ward a state-of-the-art literature curation system with integrated artificial intelligence and machine learning (AI/ML).

Keywords:

database; knowledgebase; software; text mining; data integration;

Drosophila

; yeast;

Caenorhabditis elegans

; zebrafish;

mouse

Received on 20 November 2023; accepted on 29 February 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which

permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Introduction

As has been discussed at length elsewhere (e.g.

Oliver

et al

. 2016;

Wood

et al

. 2022

), model organism knowledgebases [aka model or

ganism databases (MODs)] provide daily utility to researchers for

the design and interpretation of experiments, to computational

biologists for curated data sets, and to genomic researchers for an

notated genomes. Some of the major uses of the MODs have been

1-stop shopping for all information about a particular gene or ob

taining cleansed data sets with standard metadata for computa

tional analyses.

The Alliance of Genome Resources (referred to herein as the

Alliance) is a consortium of MODs and the Gene Ontology

Consortium (GOC). The mission of the Alliance is to support com

parative genomics to investigate the genetic and genomic basis of

human biology, health, and disease. To promote sustainability of

the core community data resources that make up the Alliance, we

implemented an extensible “knowledge commons” platform for

comparative genomics built with modular, reusable infrastruc

ture components that can support informatics resource needs

across a wide range of species (

Howe

et al

. 2018;

Alliance of

Genome Resources 2022;

Bult and Sternberg 2023). In 2022, the

Alliance was recognized as a Core Global Biodata Resource by

the Global Biodata Coalition (Anderson

et al

. 2017).

Specifically, the Alliance of Genome Resources is organized

as 2 interdependent units: Alliance Central and the Alliance

Knowledge Centers.

Alliance Central

is responsible for developing

and maintaining the software for data access and for the

coordination of data harmonization and data modeling activities

across our members. A primary goal of Alliance Central is to re

duce redundancy in systems administration and software devel

opment for model organism knowledgebases and to deploy a

unified “look and feel” for access to, and display of, common

data types and annotations across diverse model organisms and

human, following findability, accessibility, interoperability, and

reuse (FAIR) guiding principles. Model organism-specific knowl

edgebases serve as

Alliance Knowledge Centers

. Knowledge Centers

are responsible for expert curation and submission of data to

Alliance Central using Alliance Central infrastructure. Knowledge

Centers also are responsible for organism-specific user support

activities and for providing access to data types not yet supported

by Alliance Central. The founding Alliance Knowledge Centers

are

Saccharomyces

Genome Database (SGD;

Engel

et al

. 2022),

WormBase

(Davis

et al

. 2022;

Sternberg

et al

. 2024), FlyBase

(Gramates

et al

. 2022), Mouse Genome Database (Ringwald

et al

2022), the Zebrafish Information Network (Bradford

et al

. 2023),

Rat Genome Database (Vedi

et al

. 2023), and the GOC (Gene

Ontology Consortium 2023). The newest member, Xenbase

(Fisher

et al

. 2023), joined the Alliance consortium in 2022.

Here, we describe our progress toward harmonizing informa

tion provided by our member resources, our development of a

software infrastructure for ingest, curation, storage, analysis,

and output of such information, and development of an efficient

literature curation system. We start by describing new features

in our web portal at

AllianceGenome.org.

GENETICS

, 2024, 227(1), iyae049

https://doi.org/10.1093/genetics/iyae049

Advance Access Publication Date: 29 March 2024

Knowledgebase & Database Resources

The web portal

Community homepages

The Alliance website features landing pages for each model or

ganism in the Alliance consortium. These pages are accessed

from the “Members” drop-down menu in the header on every

Alliance page. These pages feature MOD-specific content such

as meetings, news, and other MOD-specific resource links. A com

mon template allows users to find the same types of information

in each landing page (

Fig. 1). As MODs transition their data and

web services to the Alliance, their member pages will evolve into

portals hosting additional MOD-specific data, tools, and links to

organism-specific resources.

Xenopus

in the Alliance

Xenbase, the

Xenopus

knowledgebase (Fisher

et al

. 2023

), is the first

knowledgebase to join the Alliance since the founding members

initiated the consortium.

Xenopus

is an amphibian frog species

used extensively in biomedical research and in particular for ex

perimental embryology, cell biology, and disease modeling with

genome editing (

Carotenuto

et al

. 2023;

Kostiuk and Khokha

2021

). As a nonmammalian air-breathing tetrapod,

Xenopus

repre

sents a valuable evolutionary transition between rodents and zeb

rafish for comparative genomic studies. Xenbase is built on the

same underlying data schema (structure) as FlyBase (Chado).

Two different

Xenopus

species are used interchangeably as a mod

el system:

Xenopus tropicalis

is a diploid that is the preferred system

for genome editing and genetics, whereas

Xenopus laevis

is an allo

tetraploid preferred for use in cell biology studies, microinjection,

and microsurgery-style experimentation.

Xenopus tropicalis

has 1:1

relationships between most genes and human orthologs (exclud

ing paralogs;

Mitros

et al

. 2019

), whereas

X. laevis

has 2 copies of

most human orthologs. The allotetraploid formed via hybridiza

tion of 2 different frog species (

Session

et al

. 2016

), and the

complexities of genome evolution that subsequently occurred in

crease the difficulty of identifying orthology of the 2

X. laevis

genes

to their diploid relatives, including humans. Mapping of the dip

loid

X. tropicalis

genes to their human orthologs was performed

as with the other organisms in the Alliance (see below). Because

this method does not yet work in the context of an allotetraploid,

the Alliance imports the

X. tropicalis

X. laevis

paralogy mappings

from Xenbase, where they have been established through a com

bination of synteny analysis and manual curation; this was one

major challenge in adding

Xenopus

to the Alliance.

Xenbase created software to upload content on a regular

schedule formatted for the current Alliance data ingest schema.

Currently, these data include orthology, the

Xenopus

anatomical

ontology, standard gene information, gene expression data, pub

lications, GO term associations, disease associations, anatomical

phenotypes, and genome details.

Xenopus

genes can be found

using the Alliance landing page search tool with

Xenopus

genes

flagged by

Xtr

and

Xla

notations. The 2 copies of the genes in

X. lae

vis

, the allotetraploid, are further tagged as “(symbol).L” and

“(symbol).S” to denote the genes on the long (L) and short (S)

chromosome pairs of this species (e.g.

pax6.L

and

pax6.S

Alliance release 6.0.0 has Xenbase data for 54,000 genes, 19,000

disease associations, over 45,000 gene expression records, and

more than 7,000 anatomical phenotypes. Expression and pheno

type data will be available in about a year.

In addition to the rich data made available to the Alliance from

Xenopus

research, this effort also served as a valuable test case for

understanding the level of effort and complexities engendered in

the addition of new knowledgebases to the Alliance and the func

tionality and adaptability of ingest system components.

New gene page section: paralogy

Gene pages now include a paralogy section populated with data

from the Drosophila Research & Screening Center (DRSC)

Fig. 1.

MOD landing pages at the Alliance portal. A common look and feel that allows community-specific content.

2 |

The Alliance of Genome Resources Consortium

Integrative Ortholog Prediction Tool (DIOPT) version 9.1 devel

oped by the DRSC (

et al

. 2011,

2021

). The assembly of protein

sets and algorithmic inferences of their orthology from various

sources was first centralized by the DRSC and then exported to

the Alliance Central. We include the same data sources used for

orthology, when these resources also provide paralogy informa

tion. Specifically, these resources have performed well on the

standardized benchmarking from the Quest for Orthologs (QfO)

Consortium (

Nevers

et al

. 2022). Orthologous Matrix (OMA;

Altenhoff

et al

. 2021) and PANTHER (Thomas

et al

. 2022) data

sets were retrieved through the QfO benchmark portal (https://

orthology.benchmarkservice.org

), and Compara data were ac

quired directly from the EBI Compara FTP site. In addition, the

DRSC conducted local analyses using InParanoid (

Persson and

Sonnhammer 2022), OrthoFinder (Emms and Kelly 2019),

OrthoInspector (Nevers

et al

. 2019), and SonicParanoid

(Cosentino and Iwasaki 2019) using the UniProt 2020 reference

proteome set (UniProt Consortium 2023

), the same set used in

the downloaded data sets, to ensure consistency. Direct data sub

missions from PhylomeDB (

Fuentes

et al

. 2022) and the SGD (Engel

et al

. 2022) were also integrated into the data set.

The new paralogy section comprises a table (Fig. 2

), similar to

the orthology table, that contains the gene symbol of related para

logs, a calculated rank, alignment length as the number of aligned

amino acids, percentage of similarity and identity, and a count of

the algorithms or methods that call the paralogous match. The

ranking score was developed to sort the paralogs by overall simi

larity and was reviewed by curators to display optimally an ac

ceptable rank order for well-studied sets of paralogs. The

ranking score considers several factors, including alignment

length, percent identity, and the number of paralogy methods

that identify the paralog. Additional information for rank deter

mination and alignment length are available to the users via a

clickable help icon located next to those column headers.

The paralog section was released with Alliance version 6.0.0.

Forthcoming updates will include the ability to sort and filter

the table by column values and the availability of these data via

our bulk downloads page. The existing tables on the gene pages

for Function, Disease, and Expression all contain checkboxes for

“Compare Ortholog Genes” that allow users to search across spe

cies for these features. We will add the additional checkbox

“Compare Paralog Genes” to provide similar functionality for par

alogous genes in a future Alliance release.

JBrowse sequence detail widget

A recent Alliance 6.0.0 release includes a new “Sequence Detail”

section of all gene pages that uses JBrowse and JavaScript libraries

to display an interactive widget that allows users to download

DNA and amino acid sequences of genes in several possible con

figurations: genomic sequence highlighted with UTRs, coding

and intronic regions, CDS regions, and translated protein for ex

ample (

Fig. 3

). In the next few releases, we will extend the func

tionality of the widget variant detail pages, where both the

wild-type and variant sequences will be provided. When the vari

ant occurs in the context of a protein coding gene, changes to the

coding sequence and resulting translated protein will also be dis

played and available for download.

Model organism BLAST

For more than 2 decades, some of the MOD members of the Alliance

have hosted their own custom BLAST interfaces (Altschul

et al

1990; e.g.

FlyBase Consortium 1999) that have allowed users to

search custom databases related to those model organisms, e.g.

subsets of related species or molecular clones, and display BLAST

hits in Genome Browsers aligned with current gene models. We

are now developing an updated and integrated Alliance BLAST,

et al

. 2019

), that optimizes se

quence analysis across model organisms. We have begun to update

BLAST for the individual MODs. The new

WormBase

BLAST is now

available online and can currently be accessed via the tools menu

on wormbase.org. The results are linked to Genome Browsers and

Alliance gene pages (Fig. 4

). This tight connection allows users to

navigate seamlessly between their BLAST results and the wealth

of information available within the Alliance, enhancing the effi

ciency and depth of genetic research. For example, users can re

trieve BLAST results for a sequence of interest and then easily

navigate across Genome Browsers for different organisms, with a

comparison to different tracks revealing how that sequence aligns

with gene models, variants, and experimental tools (

Fig. 5

). From a

project perspective, developing Alliance BLAST with a common

cloud-optimized infrastructure will increase efficiency by reducing

the cost of compute overhead and eliminating the need to manage

separate MOD systems, which will then allow more focus on devel

oping new functionality to support researchers. Our focus in the

upcoming year is directed toward enhancing the user interface

(UI), reflecting our commitment to providing an intuitive platform

Fig. 2.

Paralog table for

C. elegans hlh-25

. The table presents a ranking of paralogs for the

hlh-25

gene, based on a weighted scoring algorithm that

incorporates sequence conservation metrics. It lists the gene symbols, provides the alignment length in amino acids, and quantifies the similarity and

identity percentages of genes paralogous to

hlh-25

. The methodology count, indicating the number of algorithms supporting the paralogous relationship,

is also included. In this ranking,

hlh-27

is identified as the primary paralog due to its high similarity and identity scores, despite being recognized by fewer

methods than

hlh-28

Alliance of Genome Resources

| 3

for researchers

in model

organism

genetics.

We plan

to produce

analysis

tools

as part of the evolving

Alliance

portal,

thereby

broadening

the range

of resources

available

for genetic

research

within

the community.

AllianceMine

AllianceMine,

a sophisticated,

multifaceted

and retrieval

tool that utilizes

the InterMine

software

(

Smith

et al

. 2012

), offers

a unified

view

of harmonized

data,

enabling

advanced

queries

across

multiple

species.

For instance,

gene

lists can be processed

as input

and simultaneously

query

different

annotations,

such

“Show

me genes

associated

with

a (specific

disease

term)”

(

Fig. 6

The results

from

queries

can be combined

for further

analysis

and

saved

or downloaded

in customizable

file formats.

Queries

them

selves

can be customized

by modifying

predefined

templates

or by

creating

new

templates

to access

a combination

of specific

data

types.

Thus,

this powerful

tool

can be used

in multiple

ways,

namely,

for search,

discovery,

curation,

and analysis.

Fig. 3.

Sequence

detail

widget.

Chosen

views

of a specific

gene

are readily

available

for copying

as plain

text or with

highlights.

′

region

of the human

PLAA

gene.

Fig. 4.

Screenshot

of results

from

the Alliance

SequenceServer

BLAST

tool.

The results

have

been

enhanced

relative

to the default

SequenceServer

results

page

by the addition

of links

to Alliance

JBrowse

and to the corresponding

gene

page

(in this case

C. elegans

abi-1)

at the Alliance

website

for each

BLAST

hit.

4 |

The

Alliance

of Genome

Resources

Consortium

AllianceMine currently showcases harmonized data encom

passing genes, diseases, GO, orthology, expression, alleles, var

iants, and FASTA formatted genome sequences. The tool also

offers predefined queries or “templates” for cross-species search

ing. Continual optimization will ensure timely data synchroniza

tion with the main Alliance site, as well as integration of newly

harmonized data types. Another aspect of improvement will be

the addition of more templates, widgets, and precompiled lists,

which can serve as logical input for templated queries.

SimpleMine

We designed SimpleMine for biologists to get essential informa

tion for a list of genes without any command-line or programming

skill, or patience to learn the awesome power of AllianceMine dis

cussed above. Users can submit a list of gene names or IDs to ac

cess more than 20 types of essential data with which they are

associated. The results are 1 line per gene with detailed informa

tion separated by 4 types of separators: tab, comma, bar, and

semicolon. Users can choose to display the output as HTML or

to download a tab-delimited file. Alliance SimpleMine contains

10 species curated by the Alliance MODs. It provides easy gene

name/ID conversion among MOD ID, public name (the commonly

used name for public consumption), NCBI, PANTHER, Ensembl,

and UniProtKB. Users can find summarized anatomic and

temporal expression patterns, variants, genetic, and physical in

teractions. Other essential gene information includes disease as

sociation and orthologs among all 10 species. The infrastructure

of SimpleMine allows users to perform species-specific searches

for lists of genes that take about 2 s to return results, or mixed-

species searches that take about 10 s to complete.

Pathway displays with metabolites (GO Causal

Activity Models)

We implemented a pathway display on Alliance gene pages that

presents both GO Causal Activity Model (GO-CAM;

Thomas

et al

2019) and Reactome pathway (Milacic

et al

. 2024

) model. The dis

play queries both the Reactome and GO application programming

interfaces (APIs) and shows the number of pathways from each re

source that contain the gene of interest. If a gene appears in mul

tiple pathways, users can select which pathway to display. For the

GO-CAM models, the viewer has been improved relative to previ

ous releases of the Alliance website (

Fig. 7). First, the layout has

been improved to show clearly the overall causal flow through a

pathway, from top to bottom and branching as necessary.

Second, the viewer displays not only the activities of genes/

proteins in a pathway but also metabolites, which is particularly

useful for visualizing metabolic pathways. These metabolites

may be either intermediates in a pathway or regulators of a

Fig. 6.

AllianceMine example. Using a simple template, a disease ontology (DO) term, in this case “autism,” is chosen, and all genes associated with this

DO term are returned in a downloadable table.

Fig. 5.

Output of a BLAST search. After a user clicks on the JBrowse link for a BLAST hit, they are directed to the web service where they will see a track for

the BLAST hit and how the hit aligns with other tracks.

Alliance of Genome Resources

| 5

protein activity. For signaling pathways, we distinguish between

direct and indirect regulations and between positive, negative,

or unknown effects.

Harmonized data models

The transition of data from individual MODs to the Alliance infra

structure requires data harmonization so that existing analogous

MOD data classes (types/tables) can be loaded into Alliance data

bases using a consistent schema and language. The first step is for

biocurators from each Alliance knowledge center to agree on

which data classes are analogous and can be treated as a single,

consolidated data class. The biocurators then align the properties

(table columns) of the consolidated data class, including identi

fiers, types of values, and whether entity–property–value associa

tions/triples require their own respective metadata and/or

evidence records. To enable this process, the Linked Data

Modeling Language (LinkML). We now have a standard workflow

and common data modeling patterns that have streamlined the

process, which we expect to complete in the next year. The

LinkML specifications, authored in human-readable files, are

used to programmatically generate JavaScript Object Notation

(JSON) schema specifications, which allow data quartermasters

(DQMs) to move data to the persistent store. These specifications

also inform curation software developers how to generate initial

back-end (Java models and APIs) and front-end infrastructure

(curation UI data tables and detail pages). Once DQMs have sub

mitted their data files for a particular data class, the data are

loaded into the persistent store and validated (see

Persistent store

architecture

description below) and thus automatically populated

into data tables and the curation interface. The data, having

been harmonized, ingested, validated, and displayed to curators

in the curation software, can now flow through to the public site

according to the data pipeline described (see

Persistent store archi

tecture

description below).

Many Alliance data classes have completely (or nearly com

pletely) harmonized data models in LinkML (see

https://github.

com/alliance-genome/agr_curation_schema

) including disease

annotations, alleles, variants, expression annotations, and refer

ences. Although many other data classes have partially harmo

nized models, ongoing and future harmonization efforts will

focus on completing harmonized models for the remaining

Fig. 7.

Alliance pathway viewer. The pathway widget displays gene products (rectangles with gene names) and chemicals (rectangles with chemical

abbreviations) and the flow of information and material between them (relations). These relations, shown in legend, indicate direct or indirect regulation

that can be positive, negative, or of unknown effect direction. For metabolites that mediate the information flow between gene products, distinct shading

distinguishes metabolites that are the inputs or outputs of a reaction.

6 |

The Alliance of Genome Resources Consortium

curated

data

classes:

genes,

transcripts,

proteins,

nontranscribed

genome

features,

affected

genomic

models

(AGMs;

strains,

genotypes,

and fish),

phenotype

annotations,

molecular

and gen

etic interactions,

gene

regulation

annotations,

high-throughput

expression

data

set metadata

(including

for RNA-seq,

single-cell

RNA-seq,

and proteomics

data

sets),

species,

reagents

such

DNA

clones

and antibodies,

images,

persons,

laboratories,

com

panies,

and various

entity

set classes

like gene

sets,

which

can

be used

for storing

assay

results

and performing

downstream

ana

lyses

like ontology

term

enrichment,

alignments,

and other

entity

set processing

calculations.

Persistent

store

architecture

We have

designed

a powerful

database

system

that

can handle

most

of the demands

of our project

including

curation,

analysis,

and display

of the data

(

Fig. 8

). Specifically,

we created

a database

using

Postgres

for long-term

and persistent

storage

of Alliance

rated

data

contributed

by Alliance

member

MODs.

In parallel

to the

existing

(drop-and-reload)

data

pipeline

(Alliance

2022),

DQMs

from

each

MOD

now

submit

data

according

to our new

LinkML

schema

in JSON

format

directly

to the persistent

store

for ingestion,

validation,

and curation

via create–read–update–delete

(CRUD)

erations

enabled

by a curation

API library

and Prime

React

UI. A

data

pipeline

has been

established

to provide

data

from

the persist

ent store

Postgres

database

to our Alliance

public

website

APIs and

front-end

web UIs and to other

tools

and services.

LinkML-based

JSON

files are ingested

into Postgres

with

valid

ation

to ensure

(1) recognition

of submitted

entities

such

as genes,

alleles,

AGMs

(e.g. strains,

genotypes),

publications,

experimental

conditions,

and ontology

terms;

(2) recognition

of references

such

entities

in annotations

and associations;

(3) no entry

of dupli

cate entities;

and (4) proper

handling

of obsolete

entities.

Every

file

load

is accompanied

by a report

(in Postgres

and the curation

UI)

indicating

(1) the recognized

MD5

sum

and size of the (uncom

pressed)

file submitted;

(2) the success

or failure

of the load;

(3)

the number

of entities

recognized

in the submitted

file; (4) the

number

of distinct

entities

loaded

into Postgres;

(5) the number

and identity

of entities

(if any)

that

failed

to load

and the reason

for the failure;

(6) a link to download

the submitted

file; (7) the cor

responding

compatible

LinkML

model/schema

version;

and (8) the

MOD

data

release

version

corresponding

to the data

in the file sub

mitted.

This

information

can be used

by DQMs,

curators,

and de

velopers

to keep

track

of the fidelity

of the data

transfer

and

troubleshoot

any issues

that

arise.

Ontology

(and

other

external

resource)

loads

are updated

nightly

to ensure

that the latest

ver

sions

of such

data

are current.

The source

of truth

for MOD

data

will be transitioned

over

to the Alliance

infrastructure

in phases,

beginning

with

a few data

types

from

a few MODs

and expanding

over

time

to eventually

include

all (relevant)

data

types

from

all

participating

MODs;

as part

of this process,

legacy

issues

with

data

are cleaned

up.

To enable

CRUD

operations

on persistent

store

data,

curation

APIs

and

a curation

UI accessible

with

Okta

authentication

have

been

implemented

(

Fig. 9

). Curators

can now access

data

bles

for the following

data

types:

genes,

alleles,

variants,

AGMs

(e.g.

strains,

genotypes),

publications

[accessed

via Alliance

Bibliographic

Central

(ABC)

APIs],

experimental

conditions,

con

structs,

disease

annotations,

molecules

[not

already

managed

by Chemical

Entities

of Biological

Interest

(ChEBI)],

ontology

terms,

and

controlled

vocabularies

and

their

terms.

CRUD

operations

have

been

fully

enabled

for disease

annotations,

perimental

conditions,

and controlled

vocabularies,

read–update

operations

have

been

enabled

for alleles

and variants,

and read

operations

are enabled

for the remaining

data

types.

Work

is un

derway

to fully

enable

CRUD

operations

on all remaining

data

classes

and their

attributes

including

new

data

tables

for tran

scripts,

proteins,

other

(nongene)

genome

features,

expression

annotations,

phenotype

annotations,

molecular

interactions,

genetic

interactions,

gene

regulation

annotations,

antibodies,

images,

and more.

In addition

to data

tables

presenting

all entries

of a particular

data

class,

the curation

tool also has individual

tity detail

pages

(for example,

see an allele

detail

page

https:/

curation.alliancegenome.org/#/allele/MGI:6446761

) for data

entry

and editing

on a dedicated

web

page

for 1 particular

entity.

The

curation

tool also enables

user-specific

and MOD-specific

custom

user

settings

and preferences

to provide

a UI most

compatible

with

individual

curators’

workflows.

In the next

year,

the curation

tool will include

batch

creation

data

entities

(e.g. annotations,

reagents),

batch

editing,

data

his

tory inspection

and auditing,

undo

and review

of latest

changes,

publication

constraints

(constrain

data

view

and entry

to publica

tion

currently

being

curated),

customizations

and MOD

default

settings

for new

entity

creation

and detail

pages,

incorporation

of data

entity

and topic

tagging

information

from

the ABC

litera

ture

store

(see below),

and incorporation

of artificial

intelligence

(AI)/machine

learning

(ML)

into the curation

workflow.

For releases

of persistent

store

data

to the Alliance

public

web

site, Postgres

database

snapshots

are taken

and sent to a separate

Postgres

instance

that feeds

the data

via the curation

APIs (instan

tiated

as a library)

into the public

site indexer

where

various

data

filtering

and

transformations

occur

before

making

those

pro

cessed

data

available

to our

public

website

APIs

via our

Elasticsearch

index.

The Alliance

public

website

UI, using

existing

UI infrastructure,

is then

modified

or created

to accommodate

the

data

now

flowing

from

the persistent

store

database.

Fig. 8.

Evolution

of data

flow.

Graphical

summary

showing

the design

of short-term

infrastructure

initially

deployed

to support

rapid

delivery

of unified

data

to the community

and the planned

production

system.

Red,

data

quartermasters

at MODs;

yellow,

data;

brown,

database;

green,

transformations;

blue,

user

interface.

Alliance

of Genome

Resources

| 7

Security,

stability,

and backups

All services

and data

provided

by the Alliance

to its community

are hosted

on Amazon

Web

Services

(AWS).

This

provides

with

industry

leading

availability

of up to 99.99%

on services

EC2,

which

we use to host

our virtual

servers.

We use additional

AWS-managed

services

such

as Elastic

Beanstalk

for application

deployment,

AWS

Relational

Database

Service

for hosting

our re

lational

(Postgres)

databases,

and Amazon

OpenSearch

Service

for

hosting

our search

indexes,

which

all provide

automatic

updates

and maintenance

for increased

reliability.

All files hosted

at the

Alliance

of Genome

Resources

are stored

in S3 buckets,

which

sures

industry

leading

durability

and availability.

Furthermore,

we make

daily

backups

of our relational

databases

and have

pro

cesses

in place

that enable

easy

restore

of those

backups

in case

failure

or data

corruption.

All search

indexes

are derived

from

the

persistent

relational

database

and can be regenerated

at any mo

ment

when

required.

We make

use of separated

subnets

between

public-facing

and

private

systems,

and only

services

requiring

public

access

are gi

ven

public

IP addresses,

ensuring

that

public-facing

services

such

as our curation

interface

can be accessed

by our curators

worldwide

(through

Okta

Authentication),

although

the support

ing back-end

services

such

as the supporting

databases

can be

kept

private.

Access

to all services

is furthermore

restricted

to al

low access

only

to the required

ports

and services

through

the use

of AWS

Security

Groups

to control

the allowed

network

traffic.

AWS

IAM users,

groups,

and roles

are used

to control

the allowed

AWS

operations

and

access

among

Alliance

developers.

In all

cases,

the principle

of least

privilege

is applied,

so that the poten

tial attack

surface

is reduced

to a minimum

(for example

by not

granting

blanket

AWS

admin

permissions

to developers

who

not have

an AWS

admin

function).

Access

keys

to any system

can be revoked

when

misuse

of those

access

keys

is detected.

We also

configured

our github

repositories

to be scanned

auto

matically

for accidental

secret

credential

leakages

through

the

use of GitGuardian

software.

Literature

acquisition

We designed

and are implementing

a literature

system,

ABC,

that

will support

curation

and,

in the future,

end users.

The ABC

sup

ports

the tasks

of reference

acquisition,

triage,

and curation

work

flow management.

Specifically,

the ABC is an ecosystem

of online

tools

and supporting

Alliance

databases

that

manage

all refer

ences

and related

metadata

that

are “in corpus”

for the member

MODs.

Literature

acquisition

at the Alliance

begins

with

automated,

organism-specific

PubMed

queries

to retrieve

candidate

refer

ences

for each

MOD’s

corpus.

References

matching

the search

cri

teria

are then

added

to the ABC by assigning

an Alliance

reference

identifier

and importing

associated

bibliographic

information

the database.

Subsequently,

curators

manually

sort

references

as either

“in” or “out

of corpus”

based

on the curation

policies

the MOD

and eliminate

any false

positive

results

from

the initial

search.

While

many

thousands

of papers

are published

each

year,

only

some

have

information

that

is currently

curated.

For

example,

in 2022,

the curatable

literature

size after

triage

was

3,181

for ZFIN,

3,221

for SGD,

2,130

for FlyBase,

1,419

for

WormBase,

and

437 for Xenbase.

Once

references

are sorted,

they

enter

MOD-specific

curation

workflows

supported

by task-

specific

ABC curator

interfaces

to, for example,

add reference

files,

manually

tag references

with

specific

entities

(e.g. genes,

alleles,

and data

types)

and topics

(e.g. phenotypes,

anatomic

expression)

using

the Alliance