Enrichment on steps, not genes, improves inference of differentially expressed pathways - journal.pcbi.1011968.pdf

RESEA

RCH

ARTICL

E

Enrichment

on

steps,

not

genes,

improves

inference

of

differentially

expressed

pathways

Nicholas

Markarian

ID

1,2

, Kimberly

M.

Van

Auken

1

, Dustin

Ebert

3

, Paul

W.

Sternberg

ID

1

*

1

Division

of Biology

and

Biological

Engineeri

ng,

Californi

a Institute

of Technolo

gy,

Pasade

na,

California,

United

States

of America,

2

Keck

School

of Medicine,

University

of Southe

rn Californi

a, Los

Angeles,

Californi

a, United

States

of America,

3

Division

of Bioinf

ormatics,

Department

of Populat

ion

and

Public

Health

Sciences,

Keck

School

of Medicine,

Univers

ity

of Southern

Californi

a, Los

Angeles

, California,

United

States

of America

*

pws@calt

ech.edu

Abstract

Enrichment

analysis

is frequently

used

in combination

with

differential

expression

data

to

investigate

potential

commonalities

amongst

lists

of genes

and

generate

hypotheses

for

fur-

ther

experiments.

However,

current

enrichment

analysis

approaches

on

pathways

ignore

the

functional

relationships

between

genes

in a pathway,

particularly

OR

logic

that

occurs

when

a set

of proteins

can

each

individually

perform

the

same

step

in a pathway.

As

a result,

these

approaches

miss

pathways

with

large

or multiple

sets

because

of an

inflation

of path-

way

size

(when

measured

as

the

total

gene

count)

relative

to the

number

of steps.

We

address

this

problem

by

enriching

on

step-enabling

entities

in pathways.

We

treat

sets

of

protein-coding

genes

as

single

entities,

and

we

also

weight

sets

to account

for

the

number

of genes

in them

using

the

multivariate

Fisher’s

noncentral

hypergeometric

distribution.

We

then

show

three

examples

of pathways

that

are

recovered

with

this

method

and

find

that

the

results

have

significant

proportions

of pathways

not

found

in gene

list

enrichment

analysis.

Author

summary

Genome-scale

experiments

typically

identify

sets

of

genes

which

are

primarily

analyzed

by

enrichment

analysis

to

identify

relevant

pathways

that

may

be

perturbed.

Curated

path-

way

models

have

rich

structure

that

we

believe

can

be

exploited

to

get

better

results.

Some

pathway

steps

are

enabled

by

sets

of

interchangeable

genes

which

inflate

the

gene

count

of

their

respective

pathways

relative

to

the

number

of

steps.

We

improve

sensitivity

towards

these

pathways

in

enrichment

analysis

by

performing

enrichment

on

steps.

We

then

use

this

approach

to

identify

pathways

that

would

otherwise

be

missed

in

medically

relevant

datasets

to

gain

new

insights.

Introduction

High-throughput

experiments

regularly

output

large

lists

of

genes

that

vary

in

expression

across

conditions

and

cell

types

or

are

perturbed

in

disease

states

(e.g.

1,2).

While

these

PLOS COMP

UTATIONAL

BIOLOGY

PLOS

Computationa

l Biology

| https:/

/doi.org/10.13

71/journal.p

cbi.1011968

March

25,

2024

1 / 23

a1111111111

a1111111111

a1111111111

a1111111111

a1111111111

OPEN

ACCESS

Citation:

Markari

an N, Van Auken

KM, Ebert

D,

Sternberg

PW (2024)

Enrichmen

t on steps,

not

genes,

improves

inferenc

e of differenti

ally

expressed

pathways.

PLoS

Comput

Biol 20(3):

e1011968.

https://d

oi.org/10.1371/j

ournal.

pcbi.1011968

Editor:

Marc

Robinso

n-Rechavi,

Universite

de

Lausanne

Faculte

de biologie

et medecine,

SWITZERLA

ND

Received:

September

13, 2023

Accepted:

March

5, 2024

Published:

March

25, 2024

Copyright:

©

2024

Markarian

et al. This is an open

access

article

distributed

under

the terms

of the

Creative

Commons

Attribution

License,

which

permits

unrestricte

d use, distribu

tion, and

reproduction

in any medium,

provided

the original

author

and source

are credited.

Data

Availabilit

y Statement:

All data and code

used

is available

on the GitHub

repository

https://

github.co

m/nmarkari/

gocam_enric

hment.

We have

also used

Zenodo

to assign

a DOI to the repository:

10.5281/

zenodo.831023

6 (https://zen

odo.org/

records/83

10236).

Primary

sources

for datasets

used

in testing

are listed

in Table

5, and their

correspon

ding csv files after filtering

are in our

GitHub

at https://github.

com/nmark

ari/gocam_

enrichment/t

ree/main/t

est_data/proc

essed.

experiments

can

establish

transcriptional

signatures

for

cells

or

diseases,

the

interpretation

of

these

lists

of

genes

in

the

context

of

physiology

or

phenotype

can

prove

difficult,

even

for

domain

experts,

as

the

collective

body

of

knowledge

in

the

literature

grows

at

an

ever

increas-

ing

rate

[3].

Enrichment

tools

aim

to

aid

that

analysis

by

comparing

annotations,

such

as

dis-

ease,

process,

and

pathway

associations,

associated

with

the

outputted

list

of

differentially

expressed

genes

to

those

of

sets

of

genes

in

databases

that

previous

studies

have

linked

to

spe-

cific

diseases,

biological

processes,

and

pathways

[4–7].

While

the

enrichment

analysis

field

has

made

many

advances,

it has

treated

pathway

enrichment

the

same

as

enrichment

on

categorical

terms

without

consideration

for

path-

ways’

inherent

structure.

In

its

simplest

form,

enrichment

analysis

searches

for

overrepre-

sented

annotations

within

lists

of

genes.

It performs

pairwise

evaluations

of

the

overlap

between

the

list

from

a particular

experiment

and

reference

lists

in

knowledgebases

to

determine

if any

of

those

overlaps

are

greater

than

would

be

expected

by

chance.

Early

work

searched

for

overrepresentation

in

categories

such

as

diseases

or

particular

Gene

Ontology

(GO)

terms

for

cellular

compartments,

molecular

functions,

or

biological

processes

[6,8,9].

As

pathway

databases

such

as

Reactome

and

KEGG

were

developed

and

expanded

[10,11],

pathways

were

incorporated

into

enrichment

analyses

by

applying

the

same

algorithms

and

treating

pathway

membership

as

an

annotation

to

form

a list,

although

this

eliminated

causal

relationships

and

pathway

structure.

Later

work

introduced

more

sophisticated

sta-

tistical

procedures

to

address

open

problems

in

enrichment

such

as

utilizing

fold

changes

in

expression

[12],

leveraging

relationships

between

annotation

terms

[7],

or

incorporating

protein-protein

interaction

networks

[13],

but

these

methods

still

treated

pathways

the

same

as

other

reference

gene

sets,

considering

pathway

membership

as

an

annotation.

Unlike

the

previously

mentioned

classifications

for

genes,

such

as

the

category

of

genes

with

products

active

in

a specific

organelle,

biological

pathways

have

structure

(i.e.,

have

causal,

directional

relationships

between

participating

genes),

and

thus

are

not

simply

categories

or

lists,

but

this

issue

has

not

been

addressed.

We

utilized

an

aspect

of

pathway

structure,

sets,

in

our

enrichment

on

pathways

modeled

in

Gene

Ontology

Causal

Activity

Models

(GO-CAMs)

[14],

and

this

enabled

us

to

recover

pertinent

biological

pathways

that

could

otherwise

be

missed.

GO-CAMs

are

a type

of

path-

way

model

centered

around

GO

molecular

functions

and

use

other

ontology

terms

to

pro-

vide

relevant

biological

context.

GO-CAMs

are

typically

curated

manually

by

members

of

the

GO

consortium

[15],

but

an

additional

source

of

human

GO-CAMs

is computationally

generated

from

conversion

of

pathways

in

Reactome

[16],

a popular

pathway

database

also

used

to

define

gene

lists

for

pathways

in

widely

used

enrichment

analysis

tools

such

as

PAN-

THER

and

DAVID

[4,7,10].

In

examining

causal

flow

in

GO-CAMs,

we

realized

that

another

relationship

between

genes

annotated

to

pathways

has

been

neglected

in

the

con-

version

to

lists

but

is becoming

recognized

in

other

analyses

[17]:

OR

logic

via

interchange-

ability

of

gene

products

at

certain

steps

in

pathways.

This

interchangeability

is represented

explicitly

in

Reactome

(and

by

extension,

GO

CAMs)

as

“sets,”

defined

as

groups

of

proteins

or

protein

complexes

that

are

individually

sufficient

to

perform

the

same

step

in

a given

pathway,

and

implicitly

in

KEGG,

where

they

can

be

inferred

by

annotation

of

multiple

Enzyme

Commission

numbers

to

a reaction,

making

our

work

broadly

applicable.

For

example,

the

Reactome

set

“Glucokinase

and

Hexokinases”

is comprised

of

glucokinase

and

hexokinases

1,

2,

and

3.

Any

one

of

these

proteins

is sufficient

to

phosphorylate

glucose,

the

first

step

in

the

glycolysis

pathway.

Furthermore,

glucokinase

is only

expressed

in

the

liver

and

pancreas

and

is not

available

to

be

up

or

downregulated

by

other

cell

types.

Thus,

sets

can

either

be

a consequence

of

annotation

decisions,

such

as

using

one

pathway

diagram

that

may

differ

from

cell

type

to

cell

type,

or

they

can

be

a direct

representation

of

biology,

PLOS COMP

UTATIONAL

BIOLOGY

Enrichment

on

pathway

steps,

not

genes

PLOS

Computationa

l Biology

| https:/

/doi.org/10.13

71/journal.p

cbi.1011968

March

25,

2024

2 / 23

Funding:

This work

was supported

by a Nationa

l

Human

Genome

Research

Institute

grant

(U24HG01

2212)

to PWS

and supported

the

salaries

of PWS,

NM, KVA and DH. The funders

had no role in study

design,

data collecti

on and

analysis,

decision

to publish,

or prepara

tion of the

manuscript.

Competing

interests

:

The authors

have declared

that no competing

interests

exist.

where

multiple

gene

products

may

substitute

for

one

another,

albeit

with

potentially

distinct

reaction

kinetics.

We

can

contrast

sets

with

complexes

in

terms

of

logic.

For

example,

microtubules

are

formed

from

tubulin

α

�

β

dimers.

Microtubule

formation

requires

both

tubulin

α

AND

tubulin

β

.

In

contrast,

phosphorylating

glucose

requires

either

glucokinase

OR

hexokinase

1 OR

hexo-

kinase

2 OR

hexokinase

3.

(In

fact,

there

are

actually

8 genes

that

encode

α

tubulins

and

9

genes

that

encode

β

tubulins,

many

of

which

have

cell

or

tissue

specific

expression,

so

sets

can

be

found

in

the

context

of

complexes

as

well

[18]).

Sets

enable

curators

to

avoid

creating

multi-

ple,

otherwise

redundant

instances

of

pathways

when

different

gene

products

may

perform

the

same

step

in

different

cells

or

within

the

same

cell;

a single

instance

of

a pathway

model

is cre-

ated,

and

the

set

indicates

the

variability

at

that

step.

Ideally,

enrichment

analysis

would

acknowledge

this

variability

and

have

some

degree

of

robustness

to

the

decision

to

annotate

additional

genes

that

can

enable

a pathway.

However,

widely

used

enrichment

tools

such

as

those

at

PANTHER

and

Reactome

do

not

account

for

these

sets,

nor

does

any

other

tool

of

which

we

are

aware.

This

can

be

problematic,

because

sets

inflate

the

count

of

all

genes

annotated

to

a pathway

when

they

are

expanded

to

create

a gene

list

for

enrichment,

but

they

do

not

increase

the

number

of

steps.

For

example,

the

BMP

signaling

pathway

has

7 receptors,

each

individually

sufficient

to

facilitate

signaling,

and

they

are

expressed

in

many

tissues

at

varying

levels

[19].

There

are

many

other

steps

in

this

pathway,

but

for

argument’s

sake,

suppose

there

were

only

two

other

steps,

one

enabled

by

a complex

of

two

gene

products

and

the

other

enabled

by

one

gene

product,

for

a total

of

10

genes

annotated

to

the

pathway.

If a cell

upregulated

expression

of

one

member

of

the

set

of

receptors

and

the

single

gene

product

for

the

last

step,

this

scenario

will

be

treated

as

2 of

10

genes

in

the

pathway,

even

though

2 of

3 steps

are

affected.

Furthermore,

complexes

are

treated

the

same

as

sets

even

though

the

logical

relationship

between

their

members

differs.

Increased

expression

of

just

one

member

of

the

proteosome

complex

likely

does

not

mean

increased

pro-

teasome

activity,

but

increased

expression

of

a member

of

a set

of

enzyme

activators,

receptors,

or

enzymes

may

be

impactful.

Due

to

the

inflation

of

n

,

the

gene

count

of

the

pathway,

the

pathway

may

not

be

captured

by

the

enrichment

analysis.

Researchers

using

enrichment

tools

usually

seek

to

uncover

which

pathways

are

more

active

in

different

conditions,

a question

that

is more

directly

dependent

on

the

proportion

of

steps

in

a pathway

that

are

up

and

down

regulated

than

on

the

proportion

of

genes

annotated

to

a pathway,

given

that

some

of

those

genes

can

act

in

each

other’s

stead.

This

study

implements

enrichment

on

Gene

Ontology

Causal

Activity

Models

(GO-CAMs)

[14]

and

explores

the

impact

of

“sets,”

a feature

in

pathway

models

neglected

in

current

enrich-

ment

tools,

seeking

to

integrate

it into

analysis.

We

discover

that

some

very

large

gene

sets

greatly

inflate

the

gene

count

of

the

pathways

in

which

they

are

members

if sets

are

treated

the

same

as

complexes,

impeding

the

pathways

from

being

captured

by

enrichment

analysis

tools.

We

propose

accounting

for

this

by

performing

enrichment

analysis

on

the

pathway

steps

rather

than

directly

on

the

genes

themselves.

Using

a one-tailed

hypergeometric

test

while

treating

sets

as

single

entities,

we

showcase

three

examples

of

enriched

pathways

and

then

eval-

uate

results

on

datasets

from

six

studies

[20–25].

We

show

that

while

the

enrichment

results

largely

overlap

with

those

yielded

by

enriching

directly

on

the

list

of

genes,

a significant

pro-

portion

of

results

are

unique.

Lastly,

we

consider

how

the

assumptions

of

the

null

hypothesis

change

when

treating

sets

as

single

entities

and

introduce

enrichment

analysis

on

pathway

steps

via

the

multivariate

Fisher’s

noncentral

hypergeometric

distribution

to

weight

sets

according

to

the

number

of

genes,

in

line

with

the

traditional

assumption

that

each

gene

is

chosen

with

uniform

probability.

PLOS COMP

UTATIONAL

BIOLOGY

Enrichment

on

pathway

steps,

not

genes

PLOS

Computationa

l Biology

| https:/

/doi.org/10.13

71/journal.p

cbi.1011968

March

25,

2024

3 / 23

Results

Formulating

a step-focused

null

hypothesis

for

enrichment

by

including

sets

Traditionally,

enrichment

analysis

with

the

hypergeometric

test

asks

the

question

“What

is the

probability

that

k

or

more

out

of

n

genes

associated

with

the

pathway

are

in

a list

of

length

N

,

where

those

N

genes

are

sampled

from

a background

of

size

M

?”

We

want

to

propose

a ques-

tion

focused

on

the

steps

that

form

a pathway

instead.

Defining

entities

as:

1)

the

proteins,

2)

protein

subunits

of

complexes,

3)

and

sets

of

proteins

that

perform

steps

in

a pathway,

we

ask

“Given

the

steps

in

the

pathway

and

the

entities

that

enable

them,

what

is the

probability

that

the

k

or

more

out

of

n

entities

required

to

enable

those

steps

are

in

a list

of

length

N

,

where

those

N

entities

are

sampled

from

a background

of

size

M

?”

Both

null

hypotheses

assume

that

each

gene

or

entity

is sampled

independently

and

with

equal

probability.

We

don’t

formally

state

the

question

as

“What

is the

probability

that

k

or

more

out

of

n

steps

in

a pathway

are

sampled,”

because

complexes

are

split

into

their

protein

subunits,

but

that

is the

underlying

idea.

In

a pathway

with

no

complexes,

the

questions

are

equivalent,

and

we

want

to

represent

a

scenario

where

cells

regulate

pathways

by

selecting

steps

to

regulate

without

replacement.

We

illustrate

the

comparison

in

the

gene

lists

used

by

the

two

methods

in

Fig

1 and

detail

the

pro-

cedures

for

enrichment

in

the

next

section.

Fig

1.

Enriching

directly

on

genes

inflates

the

number

of

pathway

elements

relative

to

the

number

of

steps.

The

pathway

model

(in

blue)

consists

of

3 steps,

enabled

by

a complex,

a set,

and

a protein

respective

ly.

Traditiona

l enrichme

nt

extracts

the

list

of

all

genes

associated

with

a pathway

, treating

complexes

and

sets

equivalently,

even

though

the

logical

interpretati

on

of

a complex

is a joining

of

its

members

through

an

AND

relation

while

members

of

a set

are

linked

by

OR.

Enrichm

ent

on

steps

accounts

for

sets

while

creating

the

lists,

where

Set 1

acts

as

a placeholder

for

“Protein

1 OR Protein

2”

,

and

Complex

1

is treated

as

“Protein

3A AND

Protein

3B”

.

The

pathway

is enabled

by

“

Set 1 AND

Complex

1 AND

Protein

4

,

”

which

is equivalent

to

“

(P1

OR P2) AND

(P3A

AND

P3B)

AND

P4

.

”

Hence,

the

list

we

enrich

on

is “

Set 1

,

P3A

,

P3B

,

P4

,”

where

Set 1

is

(P2 OR P3)

.

Importantl

y, the

size

of

the

list

used

in

enrichment

is equal

to

the

minimum

number

of

genes

required

to

enable

the

pathway

in

our

step-centri

c enrichment

but

not

in

traditional

enrichme

nt

on

gene

lists,

which

uses

the

total

gene

count.

https://do

i.org/10.1371/j

ournal.pc

bi.1011968.

g001

PLOS COMP

UTATIONAL

BIOLOGY

Enrichment

on

pathway

steps,

not

genes

PLOS

Computationa

l Biology

| https:/

/doi.org/10.13

71/journal.p

cbi.1011968

March

25,

2024

4 / 23

Pathway

enrichment

procedure

Enriching

on

steps,

as

shown

above,

requires

mapping

the

input

list

of

genes

from

an

experi-

ment

to

the

list

of

step-enabling

entities

that

those

genes

belong

to.

This

list

consists

of

1)

any

sets

that

have

at

least

one

member

in

the

input

and

2)

any

genes

in

the

input

that

are

the

sole

genes

to

enable

a step

in

a pathway

(not

part

of

a set).

Enrichment

can

be

performed

using

the

one-tailed

hypergeometric

test

with

this

modified

input

list

and

the

step-enabling

list

of

enti-

ties

for

pathway

i

,

L

i

. This

is also

known

as

a one-tailed

Fisher’s

exact

test

[26].

The

key

result

is that

n

i

, the

length

of

list

L

i

, is the

minimum

number

of

genes

required

to

enable

all

steps

of

the

pathway.

Traditionally,

n

i

is the

total

number

of

genes

associated

with

a pathway,

which

could

greatly

exceed

the

number

of

steps

in

the

pathway

due

to

large

sets.

A comparison

of

the

algorithms

is shown

in

Fig

2.

Complexes

pose

a design

challenge

with

enriching

on

steps,

because

it is unclear

what

it

would

mean

to

alter

expression

of

one

of

the

members

of

a protein

complex

but

not

the

others.

This

depends

on

whether

a particular

complex

is assembled

upon

translation

or

later

through

protein

interactions,

as

well

as

knowledge

of

which

proteins

are

the

limiting

factor

due

to

stoi-

chiometry

and/or

assembly

kinetics.

In

addition,

some

annotated

complexes

in

pathway

data-

bases

are

transiently

formed

during

signal

transduction,

such

as

IL7R-JAK-STAT,

a complex

in

Reactome

[10].

We

decided

to

treat

complexes

the

same

way

as

they

have

been

previously:

each

complex

is mapped

to

its

protein

members,

and

those

proteins

are

considered

part

of

the

list

for

the

pathway,

just

as

each

member

of

a complex

is traditionally

added

to

the

list

of

genes

for

the

pathway

(e.g.

4).

If a protein

complex

is necessary

to

perform

a step

in

a pathway,

we

consider

their

protein

members

to

be

necessary

as

well,

acknowledging

the

limitation

that

this

allows

for

partial

contribution

to

enrichment

when

in

some

cases,

it should

biologically

be

all-

or-nothing.

Sets

of

complexes

usually

have

one

or

more

common

subunits

and

differ

only

in

one

sub-

unit,

so

we

create

a new

complex

out

of

the

common

subunits

(the

subunit

intersection

across

all

the

complexes),

and

then

a new

set

out

of

the

remainder

(Fig

2).

That

new

complex

is then

mapped

to

its

subunit

members,

and

each

is treated

as

an

entity.

Except

in

the

cases

where

the

set

of

complexes

is a heterogenous

group

or

has

multiple

specific

subunits,

this

faithfully

repre-

sents

sets

of

complexes

in

a manner

consistent

with

our

representation

of

complexes

and

of

sets

of

proteins.

For

example,

Prolyl

4-hydroxylase

is a complex

with

2 P4HB1

beta

subunits

and

2 identical

alpha

subunits

from

P4HA1,

P4HA2,

or

P4HA3.

This

is represented

as

a set

of

complexes

(2

P4HA1:

2 P4HB1

OR

2 P4HA2:

2 P4HB1

OR

2 P4HA3:

2 P4HB1),

but

we

repre-

sent

it with

P4HB1

AND

(P4HA1

OR

P4HA2

OR

P4HA3).

We

recursively

apply

the

above

logic

to

reduce

these

to

proteins

and

sets.

Comparing

parameter

changes

in

gene

list

and

step-centric

enrichment

These

changes

affect

the

hypergeometric

test

primarily

by

reducing

n

,

the

size

of

the

pathway

against

which

the

overlap,

k

, is compared

to.

M

,

the

size

of

the

background

list

is also

reduced,

because

~2300

genes

only

appear

in

pathways

as

part

of

sets

and

thus

are

not

unique

entities.

All

else

equal,

the

reduction

of

n

lowers

the

p-value,

while

the

reduction

of

M

increases

it.

We

constructed

lists

of

genes

for

each

pathway

via

the

standard

gene

list

method

and

compared

these

to

the

entity

lists

produced

by

our

method

for

each

pathway.

While

the

change

in

n

is

pathway-specific,

the

median

reduction

per

pathway

from

the

gene-list

method

to

ours

is 6%,

with

a 75

th

percentile

reduction

of

40%,

indicating

that

most

models

are

unaffected

by

the

change,

but

a minority

are

significantly

impacted

(Table

1).

The

change

in

M

is a reduction

from

5386

genes

annotated

across

all

pathways

to

3983

entities

(genes

and

sets

of

genes)

anno-

tated

across

all

pathways.

(Of

the

5386

genes,

only

3086

are

the

sole

entities

that

enable

at

least

PLOS COMP

UTATIONAL

BIOLOGY

Enrichment

on

pathway

steps,

not

genes

PLOS

Computationa

l Biology

| https:/

/doi.org/10.13

71/journal.p

cbi.1011968

March

25,

2024

5 / 23

one

step,

meaning

they

appear

outside

of

sets,

while

1400

are

only

annotated

as

part

of

sets.

There

are

a total

of

915

sets).

N

can

change

as

well,

but

the

magnitude

and

direction

of

change

are

dependent

on

the

input

list.

Lastly,

k

can

be

reduced

because

we

only

allow

for

each

set

to

count

once

towards

the

overlap,

even

if more

than

one

gene

in

the

set

is on

the

input

list.

This

Fig

2.

Step

by

step

comparis

on

of

enrichmen

t algorithm

s.

Multiple

testing

correctio

n is not

shown

here

but

is done

with

the

Benjamini

-Hochberg

procedur

e.

https://do

i.org/10.1371/j

ournal.pc

bi.1011968.

g002

PLOS COMP

UTATIONAL

BIOLOGY

Enrichment

on

pathway

steps,

not

genes

PLOS

Computationa

l Biology

| https:/

/doi.org/10.13

71/journal.p

cbi.1011968

March

25,

2024

6 / 23