of 23
RESEA
RCH
ARTICL
E
Enrichment
on
steps,
not
genes,
improves
inference
of
differentially
expressed
pathways
Nicholas
Markarian
ID
1,2
, Kimberly
M.
Van
Auken
1
, Dustin
Ebert
3
, Paul
W.
Sternberg
ID
1
*
1
Division
of Biology
and
Biological
Engineeri
ng,
Californi
a Institute
of Technolo
gy,
Pasade
na,
California,
United
States
of America,
2
Keck
School
of Medicine,
University
of Southe
rn Californi
a, Los
Angeles,
Californi
a, United
States
of America,
3
Division
of Bioinf
ormatics,
Department
of Populat
ion
and
Public
Health
Sciences,
Keck
School
of Medicine,
Univers
ity
of Southern
Californi
a, Los
Angeles
, California,
United
States
of America
*
pws@calt
ech.edu
Abstract
Enrichment
analysis
is frequently
used
in combination
with
differential
expression
data
to
investigate
potential
commonalities
amongst
lists
of genes
and
generate
hypotheses
for
fur-
ther
experiments.
However,
current
enrichment
analysis
approaches
on
pathways
ignore
the
functional
relationships
between
genes
in a pathway,
particularly
OR
logic
that
occurs
when
a set
of proteins
can
each
individually
perform
the
same
step
in a pathway.
As
a result,
these
approaches
miss
pathways
with
large
or multiple
sets
because
of an
inflation
of path-
way
size
(when
measured
as
the
total
gene
count)
relative
to the
number
of steps.
We
address
this
problem
by
enriching
on
step-enabling
entities
in pathways.
We
treat
sets
of
protein-coding
genes
as
single
entities,
and
we
also
weight
sets
to account
for
the
number
of genes
in them
using
the
multivariate
Fisher’s
noncentral
hypergeometric
distribution.
We
then
show
three
examples
of pathways
that
are
recovered
with
this
method
and
find
that
the
results
have
significant
proportions
of pathways
not
found
in gene
list
enrichment
analysis.
Author
summary
Genome-scale
experiments
typically
identify
sets
of
genes
which
are
primarily
analyzed
by
enrichment
analysis
to
identify
relevant
pathways
that
may
be
perturbed.
Curated
path-
way
models
have
rich
structure
that
we
believe
can
be
exploited
to
get
better
results.
Some
pathway
steps
are
enabled
by
sets
of
interchangeable
genes
which
inflate
the
gene
count
of
their
respective
pathways
relative
to
the
number
of
steps.
We
improve
sensitivity
towards
these
pathways
in
enrichment
analysis
by
performing
enrichment
on
steps.
We
then
use
this
approach
to
identify
pathways
that
would
otherwise
be
missed
in
medically
relevant
datasets
to
gain
new
insights.
Introduction
High-throughput
experiments
regularly
output
large
lists
of
genes
that
vary
in
expression
across
conditions
and
cell
types
or
are
perturbed
in
disease
states
(e.g.
1,2).
While
these
PLOS COMP
UTATIONAL
BIOLOGY
PLOS
Computationa
l Biology
| https:/
/doi.org/10.13
71/journal.p
cbi.1011968
March
25,
2024
1 / 23
a1111111111
a1111111111
a1111111111
a1111111111
a1111111111
OPEN
ACCESS
Citation:
Markari
an N, Van Auken
KM, Ebert
D,
Sternberg
PW (2024)
Enrichmen
t on steps,
not
genes,
improves
inferenc
e of differenti
ally
expressed
pathways.
PLoS
Comput
Biol 20(3):
e1011968.
https://d
oi.org/10.1371/j
ournal.
pcbi.1011968
Editor:
Marc
Robinso
n-Rechavi,
Universite
de
Lausanne
Faculte
de biologie
et medecine,
SWITZERLA
ND
Received:
September
13, 2023
Accepted:
March
5, 2024
Published:
March
25, 2024
Copyright:
©
2024
Markarian
et al. This is an open
access
article
distributed
under
the terms
of the
Creative
Commons
Attribution
License,
which
permits
unrestricte
d use, distribu
tion, and
reproduction
in any medium,
provided
the original
author
and source
are credited.
Data
Availabilit
y Statement:
All data and code
used
is available
on the GitHub
repository
https://
github.co
m/nmarkari/
gocam_enric
hment.
We have
also used
Zenodo
to assign
a DOI to the repository:
10.5281/
zenodo.831023
6 (https://zen
odo.org/
records/83
10236).
Primary
sources
for datasets
used
in testing
are listed
in Table
5, and their
correspon
ding csv files after filtering
are in our
GitHub
at https://github.
com/nmark
ari/gocam_
enrichment/t
ree/main/t
est_data/proc
essed.
experiments
can
establish
transcriptional
signatures
for
cells
or
diseases,
the
interpretation
of
these
lists
of
genes
in
the
context
of
physiology
or
phenotype
can
prove
difficult,
even
for
domain
experts,
as
the
collective
body
of
knowledge
in
the
literature
grows
at
an
ever
increas-
ing
rate
[3].
Enrichment
tools
aim
to
aid
that
analysis
by
comparing
annotations,
such
as
dis-
ease,
process,
and
pathway
associations,
associated
with
the
outputted
list
of
differentially
expressed
genes
to
those
of
sets
of
genes
in
databases
that
previous
studies
have
linked
to
spe-
cific
diseases,
biological
processes,
and
pathways
[4–7].
While
the
enrichment
analysis
field
has
made
many
advances,
it has
treated
pathway
enrichment
the
same
as
enrichment
on
categorical
terms
without
consideration
for
path-
ways’
inherent
structure.
In
its
simplest
form,
enrichment
analysis
searches
for
overrepre-
sented
annotations
within
lists
of
genes.
It performs
pairwise
evaluations
of
the
overlap
between
the
list
from
a particular
experiment
and
reference
lists
in
knowledgebases
to
determine
if any
of
those
overlaps
are
greater
than
would
be
expected
by
chance.
Early
work
searched
for
overrepresentation
in
categories
such
as
diseases
or
particular
Gene
Ontology
(GO)
terms
for
cellular
compartments,
molecular
functions,
or
biological
processes
[6,8,9].
As
pathway
databases
such
as
Reactome
and
KEGG
were
developed
and
expanded
[10,11],
pathways
were
incorporated
into
enrichment
analyses
by
applying
the
same
algorithms
and
treating
pathway
membership
as
an
annotation
to
form
a list,
although
this
eliminated
causal
relationships
and
pathway
structure.
Later
work
introduced
more
sophisticated
sta-
tistical
procedures
to
address
open
problems
in
enrichment
such
as
utilizing
fold
changes
in
expression
[12],
leveraging
relationships
between
annotation
terms
[7],
or
incorporating
protein-protein
interaction
networks
[13],
but
these
methods
still
treated
pathways
the
same
as
other
reference
gene
sets,
considering
pathway
membership
as
an
annotation.
Unlike
the
previously
mentioned
classifications
for
genes,
such
as
the
category
of
genes
with
products
active
in
a specific
organelle,
biological
pathways
have
structure
(i.e.,
have
causal,
directional
relationships
between
participating
genes),
and
thus
are
not
simply
categories
or
lists,
but
this
issue
has
not
been
addressed.
We
utilized
an
aspect
of
pathway
structure,
sets,
in
our
enrichment
on
pathways
modeled
in
Gene
Ontology
Causal
Activity
Models
(GO-CAMs)
[14],
and
this
enabled
us
to
recover
pertinent
biological
pathways
that
could
otherwise
be
missed.
GO-CAMs
are
a type
of
path-
way
model
centered
around
GO
molecular
functions
and
use
other
ontology
terms
to
pro-
vide
relevant
biological
context.
GO-CAMs
are
typically
curated
manually
by
members
of
the
GO
consortium
[15],
but
an
additional
source
of
human
GO-CAMs
is computationally
generated
from
conversion
of
pathways
in
Reactome
[16],
a popular
pathway
database
also
used
to
define
gene
lists
for
pathways
in
widely
used
enrichment
analysis
tools
such
as
PAN-
THER
and
DAVID
[4,7,10].
In
examining
causal
flow
in
GO-CAMs,
we
realized
that
another
relationship
between
genes
annotated
to
pathways
has
been
neglected
in
the
con-
version
to
lists
but
is becoming
recognized
in
other
analyses
[17]:
OR
logic
via
interchange-
ability
of
gene
products
at
certain
steps
in
pathways.
This
interchangeability
is represented
explicitly
in
Reactome
(and
by
extension,
GO
CAMs)
as
“sets,”
defined
as
groups
of
proteins
or
protein
complexes
that
are
individually
sufficient
to
perform
the
same
step
in
a given
pathway,
and
implicitly
in
KEGG,
where
they
can
be
inferred
by
annotation
of
multiple
Enzyme
Commission
numbers
to
a reaction,
making
our
work
broadly
applicable.
For
example,
the
Reactome
set
“Glucokinase
and
Hexokinases”
is comprised
of
glucokinase
and
hexokinases
1,
2,
and
3.
Any
one
of
these
proteins
is sufficient
to
phosphorylate
glucose,
the
first
step
in
the
glycolysis
pathway.
Furthermore,
glucokinase
is only
expressed
in
the
liver
and
pancreas
and
is not
available
to
be
up
or
downregulated
by
other
cell
types.
Thus,
sets
can
either
be
a consequence
of
annotation
decisions,
such
as
using
one
pathway
diagram
that
may
differ
from
cell
type
to
cell
type,
or
they
can
be
a direct
representation
of
biology,
PLOS COMP
UTATIONAL
BIOLOGY
Enrichment
on
pathway
steps,
not
genes
PLOS
Computationa
l Biology
| https:/
/doi.org/10.13
71/journal.p
cbi.1011968
March
25,
2024
2 / 23
Funding:
This work
was supported
by a Nationa
l
Human
Genome
Research
Institute
grant
(U24HG01
2212)
to PWS
and supported
the
salaries
of PWS,
NM, KVA and DH. The funders
had no role in study
design,
data collecti
on and
analysis,
decision
to publish,
or prepara
tion of the
manuscript.
Competing
interests
:
The authors
have declared
that no competing
interests
exist.
where
multiple
gene
products
may
substitute
for
one
another,
albeit
with
potentially
distinct
reaction
kinetics.
We
can
contrast
sets
with
complexes
in
terms
of
logic.
For
example,
microtubules
are
formed
from
tubulin
α
β
dimers.
Microtubule
formation
requires
both
tubulin
α
AND
tubulin
β
.
In
contrast,
phosphorylating
glucose
requires
either
glucokinase
OR
hexokinase
1 OR
hexo-
kinase
2 OR
hexokinase
3.
(In
fact,
there
are
actually
8 genes
that
encode
α
tubulins
and
9
genes
that
encode
β
tubulins,
many
of
which
have
cell
or
tissue
specific
expression,
so
sets
can
be
found
in
the
context
of
complexes
as
well
[18]).
Sets
enable
curators
to
avoid
creating
multi-
ple,
otherwise
redundant
instances
of
pathways
when
different
gene
products
may
perform
the
same
step
in
different
cells
or
within
the
same
cell;
a single
instance
of
a pathway
model
is cre-
ated,
and
the
set
indicates
the
variability
at
that
step.
Ideally,
enrichment
analysis
would
acknowledge
this
variability
and
have
some
degree
of
robustness
to
the
decision
to
annotate
additional
genes
that
can
enable
a pathway.
However,
widely
used
enrichment
tools
such
as
those
at
PANTHER
and
Reactome
do
not
account
for
these
sets,
nor
does
any
other
tool
of
which
we
are
aware.
This
can
be
problematic,
because
sets
inflate
the
count
of
all
genes
annotated
to
a pathway
when
they
are
expanded
to
create
a gene
list
for
enrichment,
but
they
do
not
increase
the
number
of
steps.
For
example,
the
BMP
signaling
pathway
has
7 receptors,
each
individually
sufficient
to
facilitate
signaling,
and
they
are
expressed
in
many
tissues
at
varying
levels
[19].
There
are
many
other
steps
in
this
pathway,
but
for
argument’s
sake,
suppose
there
were
only
two
other
steps,
one
enabled
by
a complex
of
two
gene
products
and
the
other
enabled
by
one
gene
product,
for
a total
of
10
genes
annotated
to
the
pathway.
If a cell
upregulated
expression
of
one
member
of
the
set
of
receptors
and
the
single
gene
product
for
the
last
step,
this
scenario
will
be
treated
as
2 of
10
genes
in
the
pathway,
even
though
2 of
3 steps
are
affected.
Furthermore,
complexes
are
treated
the
same
as
sets
even
though
the
logical
relationship
between
their
members
differs.
Increased
expression
of
just
one
member
of
the
proteosome
complex
likely
does
not
mean
increased
pro-
teasome
activity,
but
increased
expression
of
a member
of
a set
of
enzyme
activators,
receptors,
or
enzymes
may
be
impactful.
Due
to
the
inflation
of
n
,
the
gene
count
of
the
pathway,
the
pathway
may
not
be
captured
by
the
enrichment
analysis.
Researchers
using
enrichment
tools
usually
seek
to
uncover
which
pathways
are
more
active
in
different
conditions,
a question
that
is more
directly
dependent
on
the
proportion
of
steps
in
a pathway
that
are
up
and
down
regulated
than
on
the
proportion
of
genes
annotated
to
a pathway,
given
that
some
of
those
genes
can
act
in
each
other’s
stead.
This
study
implements
enrichment
on
Gene
Ontology
Causal
Activity
Models
(GO-CAMs)
[14]
and
explores
the
impact
of
“sets,”
a feature
in
pathway
models
neglected
in
current
enrich-
ment
tools,
seeking
to
integrate
it into
analysis.
We
discover
that
some
very
large
gene
sets
greatly
inflate
the
gene
count
of
the
pathways
in
which
they
are
members
if sets
are
treated
the
same
as
complexes,
impeding
the
pathways
from
being
captured
by
enrichment
analysis
tools.
We
propose
accounting
for
this
by
performing
enrichment
analysis
on
the
pathway
steps
rather
than
directly
on
the
genes
themselves.
Using
a one-tailed
hypergeometric
test
while
treating
sets
as
single
entities,
we
showcase
three
examples
of
enriched
pathways
and
then
eval-
uate
results
on
datasets
from
six
studies
[20–25].
We
show
that
while
the
enrichment
results
largely
overlap
with
those
yielded
by
enriching
directly
on
the
list
of
genes,
a significant
pro-
portion
of
results
are
unique.
Lastly,
we
consider
how
the
assumptions
of
the
null
hypothesis
change
when
treating
sets
as
single
entities
and
introduce
enrichment
analysis
on
pathway
steps
via
the
multivariate
Fisher’s
noncentral
hypergeometric
distribution
to
weight
sets
according
to
the
number
of
genes,
in
line
with
the
traditional
assumption
that
each
gene
is
chosen
with
uniform
probability.
PLOS COMP
UTATIONAL
BIOLOGY
Enrichment
on
pathway
steps,
not
genes
PLOS
Computationa
l Biology
| https:/
/doi.org/10.13
71/journal.p
cbi.1011968
March
25,
2024
3 / 23
Results
Formulating
a step-focused
null
hypothesis
for
enrichment
by
including
sets
Traditionally,
enrichment
analysis
with
the
hypergeometric
test
asks
the
question
“What
is the
probability
that
k
or
more
out
of
n
genes
associated
with
the
pathway
are
in
a list
of
length
N
,
where
those
N
genes
are
sampled
from
a background
of
size
M
?”
We
want
to
propose
a ques-
tion
focused
on
the
steps
that
form
a pathway
instead.
Defining
entities
as:
1)
the
proteins,
2)
protein
subunits
of
complexes,
3)
and
sets
of
proteins
that
perform
steps
in
a pathway,
we
ask
“Given
the
steps
in
the
pathway
and
the
entities
that
enable
them,
what
is the
probability
that
the
k
or
more
out
of
n
entities
required
to
enable
those
steps
are
in
a list
of
length
N
,
where
those
N
entities
are
sampled
from
a background
of
size
M
?”
Both
null
hypotheses
assume
that
each
gene
or
entity
is sampled
independently
and
with
equal
probability.
We
don’t
formally
state
the
question
as
“What
is the
probability
that
k
or
more
out
of
n
steps
in
a pathway
are
sampled,”
because
complexes
are
split
into
their
protein
subunits,
but
that
is the
underlying
idea.
In
a pathway
with
no
complexes,
the
questions
are
equivalent,
and
we
want
to
represent
a
scenario
where
cells
regulate
pathways
by
selecting
steps
to
regulate
without
replacement.
We
illustrate
the
comparison
in
the
gene
lists
used
by
the
two
methods
in
Fig
1 and
detail
the
pro-
cedures
for
enrichment
in
the
next
section.
Fig
1.
Enriching
directly
on
genes
inflates
the
number
of
pathway
elements
relative
to
the
number
of
steps.
The
pathway
model
(in
blue)
consists
of
3 steps,
enabled
by
a complex,
a set,
and
a protein
respective
ly.
Traditiona
l enrichme
nt
extracts
the
list
of
all
genes
associated
with
a pathway
, treating
complexes
and
sets
equivalently,
even
though
the
logical
interpretati
on
of
a complex
is a joining
of
its
members
through
an
AND
relation
while
members
of
a set
are
linked
by
OR.
Enrichm
ent
on
steps
accounts
for
sets
while
creating
the
lists,
where
Set 1
acts
as
a placeholder
for
“Protein
1 OR Protein
2”
,
and
Complex
1
is treated
as
“Protein
3A AND
Protein
3B”
.
The
pathway
is enabled
by
Set 1 AND
Complex
1 AND
Protein
4
,
which
is equivalent
to
(P1
OR P2) AND
(P3A
AND
P3B)
AND
P4
.
Hence,
the
list
we
enrich
on
is “
Set 1
,
P3A
,
P3B
,
P4
,”
where
Set 1
is
(P2 OR P3)
.
Importantl
y, the
size
of
the
list
used
in
enrichment
is equal
to
the
minimum
number
of
genes
required
to
enable
the
pathway
in
our
step-centri
c enrichment
but
not
in
traditional
enrichme
nt
on
gene
lists,
which
uses
the
total
gene
count.
https://do
i.org/10.1371/j
ournal.pc
bi.1011968.
g001
PLOS COMP
UTATIONAL
BIOLOGY
Enrichment
on
pathway
steps,
not
genes
PLOS
Computationa
l Biology
| https:/
/doi.org/10.13
71/journal.p
cbi.1011968
March
25,
2024
4 / 23
Pathway
enrichment
procedure
Enriching
on
steps,
as
shown
above,
requires
mapping
the
input
list
of
genes
from
an
experi-
ment
to
the
list
of
step-enabling
entities
that
those
genes
belong
to.
This
list
consists
of
1)
any
sets
that
have
at
least
one
member
in
the
input
and
2)
any
genes
in
the
input
that
are
the
sole
genes
to
enable
a step
in
a pathway
(not
part
of
a set).
Enrichment
can
be
performed
using
the
one-tailed
hypergeometric
test
with
this
modified
input
list
and
the
step-enabling
list
of
enti-
ties
for
pathway
i
,
L
i
. This
is also
known
as
a one-tailed
Fisher’s
exact
test
[26].
The
key
result
is that
n
i
, the
length
of
list
L
i
, is the
minimum
number
of
genes
required
to
enable
all
steps
of
the
pathway.
Traditionally,
n
i
is the
total
number
of
genes
associated
with
a pathway,
which
could
greatly
exceed
the
number
of
steps
in
the
pathway
due
to
large
sets.
A comparison
of
the
algorithms
is shown
in
Fig
2.
Complexes
pose
a design
challenge
with
enriching
on
steps,
because
it is unclear
what
it
would
mean
to
alter
expression
of
one
of
the
members
of
a protein
complex
but
not
the
others.
This
depends
on
whether
a particular
complex
is assembled
upon
translation
or
later
through
protein
interactions,
as
well
as
knowledge
of
which
proteins
are
the
limiting
factor
due
to
stoi-
chiometry
and/or
assembly
kinetics.
In
addition,
some
annotated
complexes
in
pathway
data-
bases
are
transiently
formed
during
signal
transduction,
such
as
IL7R-JAK-STAT,
a complex
in
Reactome
[10].
We
decided
to
treat
complexes
the
same
way
as
they
have
been
previously:
each
complex
is mapped
to
its
protein
members,
and
those
proteins
are
considered
part
of
the
list
for
the
pathway,
just
as
each
member
of
a complex
is traditionally
added
to
the
list
of
genes
for
the
pathway
(e.g.
4).
If a protein
complex
is necessary
to
perform
a step
in
a pathway,
we
consider
their
protein
members
to
be
necessary
as
well,
acknowledging
the
limitation
that
this
allows
for
partial
contribution
to
enrichment
when
in
some
cases,
it should
biologically
be
all-
or-nothing.
Sets
of
complexes
usually
have
one
or
more
common
subunits
and
differ
only
in
one
sub-
unit,
so
we
create
a new
complex
out
of
the
common
subunits
(the
subunit
intersection
across
all
the
complexes),
and
then
a new
set
out
of
the
remainder
(Fig
2).
That
new
complex
is then
mapped
to
its
subunit
members,
and
each
is treated
as
an
entity.
Except
in
the
cases
where
the
set
of
complexes
is a heterogenous
group
or
has
multiple
specific
subunits,
this
faithfully
repre-
sents
sets
of
complexes
in
a manner
consistent
with
our
representation
of
complexes
and
of
sets
of
proteins.
For
example,
Prolyl
4-hydroxylase
is a complex
with
2 P4HB1
beta
subunits
and
2 identical
alpha
subunits
from
P4HA1,
P4HA2,
or
P4HA3.
This
is represented
as
a set
of
complexes
(2
P4HA1:
2 P4HB1
OR
2 P4HA2:
2 P4HB1
OR
2 P4HA3:
2 P4HB1),
but
we
repre-
sent
it with
P4HB1
AND
(P4HA1
OR
P4HA2
OR
P4HA3).
We
recursively
apply
the
above
logic
to
reduce
these
to
proteins
and
sets.
Comparing
parameter
changes
in
gene
list
and
step-centric
enrichment
These
changes
affect
the
hypergeometric
test
primarily
by
reducing
n
,
the
size
of
the
pathway
against
which
the
overlap,
k
, is compared
to.
M
,
the
size
of
the
background
list
is also
reduced,
because
~2300
genes
only
appear
in
pathways
as
part
of
sets
and
thus
are
not
unique
entities.
All
else
equal,
the
reduction
of
n
lowers
the
p-value,
while
the
reduction
of
M
increases
it.
We
constructed
lists
of
genes
for
each
pathway
via
the
standard
gene
list
method
and
compared
these
to
the
entity
lists
produced
by
our
method
for
each
pathway.
While
the
change
in
n
is
pathway-specific,
the
median
reduction
per
pathway
from
the
gene-list
method
to
ours
is 6%,
with
a 75
th
percentile
reduction
of
40%,
indicating
that
most
models
are
unaffected
by
the
change,
but
a minority
are
significantly
impacted
(Table
1).
The
change
in
M
is a reduction
from
5386
genes
annotated
across
all
pathways
to
3983
entities
(genes
and
sets
of
genes)
anno-
tated
across
all
pathways.
(Of
the
5386
genes,
only
3086
are
the
sole
entities
that
enable
at
least
PLOS COMP
UTATIONAL
BIOLOGY
Enrichment
on
pathway
steps,
not
genes
PLOS
Computationa
l Biology
| https:/
/doi.org/10.13
71/journal.p
cbi.1011968
March
25,
2024
5 / 23
one
step,
meaning
they
appear
outside
of
sets,
while
1400
are
only
annotated
as
part
of
sets.
There
are
a total
of
915
sets).
N
can
change
as
well,
but
the
magnitude
and
direction
of
change
are
dependent
on
the
input
list.
Lastly,
k
can
be
reduced
because
we
only
allow
for
each
set
to
count
once
towards
the
overlap,
even
if more
than
one
gene
in
the
set
is on
the
input
list.
This
Fig
2.
Step
by
step
comparis
on
of
enrichmen
t algorithm
s.
Multiple
testing
correctio
n is not
shown
here
but
is done
with
the
Benjamini
-Hochberg
procedur
e.
https://do
i.org/10.1371/j
ournal.pc
bi.1011968.
g002
PLOS COMP
UTATIONAL
BIOLOGY
Enrichment
on
pathway
steps,
not
genes
PLOS
Computationa
l Biology
| https:/
/doi.org/10.13
71/journal.p
cbi.1011968
March
25,
2024
6 / 23