A Novel Approach to Comparative RNA-Seq Does
Not Support a Conserved Set of Orthologs Underlying
Animal Regeneration
Noémie C. Sierra
1
, Noah Olsman
2
,4
, Lynn Yi
2
, Lior Pachter
2
,3
, Lea Goentoro
2
,
David A. Gold
1
,2
,
*
1
Department of Earth and Planetary Sciences, University of California, Davis, Davis, CA 95616, USA
2
Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, USA
3
Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA 91125, USA
4
Present address: Department of Systems Biology, Harvard Medical School, 200 Longwood Avenue, Boston, MA 02215, USA.
*Corresponding author: E-mail:
dgold@ucdavis.edu.
Accepted:
June 05, 2024
Abstract
Molecular studies of animal regeneration typically focus on conserved genes and signaling pathways that underlie morpho
-
genesis. To date, a holistic analysis of gene expression across animals has not been attempted, as it presents a suite of pro
-
blems related to differences in experimental design and gene homology. By combining orthology analyses with a novel
statistical method for testing gene enrichment across large data sets, we are able to test whether tissue regeneration across
animals shares transcriptional regulation. We applied this method to a meta-analysis of six publicly available RNA-Seq data
sets from diverse examples of animal regeneration. We recovered 160 conserved orthologous gene clusters, which are en
-
riched in structural genes as opposed to those regulating morphogenesis. A breakdown of gene presence/absence provides
limited support for the conservation of pathways typically implicated in regeneration, such as Wnt signaling and cell pluripo
-
tency pathways. Such pathways are only conserved if we permit large amounts of paralog switching through evolution.
Overall, our analysis does not support the hypothesis that a shared set of ancestral genes underlie regeneration mechanisms
in animals. After applying the same method to heat shock studies and getting similar results, we raise broader questions
about the ability of comparative RNA-Seq to reveal conserved gene pathways across deep evolutionary relationships.
Key words:
regeneration, evolution, RNA-Seq.
Significance
RNA-Seq could be a useful tool for identifying shared genes involved in animal tissue regeneration. We therefore devel
-
oped a novel approach to compare RNA-Seq experiments with different designs and distantly related species. We ultim
-
ately find limited evidence for conserved genes, suggesting that rampant paralog switching has occurred over the course
of evolution or that animal regeneration is not a conserved trait, at least at a transcriptional level.
© The Author(s) 2024. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/
), which permits unrestricted reuse,
distribution, and reproduction in any medium, provided the original work is properly cited.
Introduction
Why regeneration occurs in some animals and not others
remains an enigma in biology. It is well known that certain
groups can readily regenerate lost tissues and body parts
(e.g. planarian worms, salamanders, and cnidarians), while
regeneration in others is restricted to specific organs or de
-
velopmental stages (e.g. nematode worms, insects, and
mammals). Animals with strong regenerative capabilities
are distributed across the evolutionary tree without a clear
pattern (
Alvarado 2000), and even closely related species
GBE
Genome Biol. Evol.
16(6) https://doi.org/10.1093/gbe/evae120 Advance Access publication 24 June 2024
1
can demonstrate
dramatically
different
capacities
(
Bely and
Sikes
2010
;
Zattara
et al. 2019
). These
observations
lead to
two competing
evolutionary
scenarios:
body
regeneration
is either
an ancient,
conserved
animal
trait that has been
lost to varying
degrees
across
multiple
lineages,
or it is a de
-
rived
trait that multiple
lineages
have converged
upon
inde
-
pendently.
Resolving
these
competing
hypotheses
has
profound
consequences
for the goals
of comparative
re
-
generative
biology:
are we searching
for unifying
principles
or trying
to determine
how
various
animals
deal with
the
universal
problem
of bodily
damage?
While
many
studies
focus
on putative
candidate
genes
underlying
animal
regeneration,
a growing
body
of litera
-
ture challenges
any simplistic
interpretation.
Some
genes
and pathways
commonly
reoccur
in studies.
Wnt signaling,
for example,
has been
shown
to play a critical
role in plan
-
arian
worms
(
Sikes
and Newmark
2013
;
Umesono
et al.
2013
), fish (
Stoick-Cooper
et al. 2007
), amphibians
(
Lin
and Slack
2008
), and mammals
(
Bielefeld
et al. 2013
;
Sanges
et al. 2013
;
Takeo
et al. 2013
). Other
studies
sug
-
gest that key components
of regeneration
might
be dissimi
-
lar across
major
groups.
For example,
a MARCKS-like
protein
that initiates
limb regeneration
in axolotl
salaman
-
ders
appears
to be a vertebrate
novelty
(
Sugiura
et al.
2016
). Regeneration
in newts,
a different
set of amphi
-
bians,
involves
genes
not found
in the axolotl
(
Looso
et al.
2013
). Finally,
genes
such
as the Oct4/POU5F1
regulator
of stem
cell pluripotency
appear
absent
in invertebrates
(
Gold
et al. 2014
). It is unclear
whether
these
observations
represent
anomalies
obfuscating
a conserved
set of shared
genes
or if they hint at the true evolutionary
convergence
driving
animal
regeneration.
Whether
the molecular
mechanisms
of regeneration
are
conserved
across
animals
rests,
in part, on what
counts
as a
“conserved”
(i.e. homologous)
gene.
Homologous
genes
can be subdivided
into orthologs
(genes
related
by vertical
descent
from
a common
ancestor)
and paralogs
(genes
that
arise by duplication
events).
Orthologs
or paralogs
may per
-
form
similar
functions,
but in evolutionary
biology,
com
-
mon
ancestry
is what
defines
conservation.
Paralogs
cannot
necessarily
be traced
back
to a single
gene
in a
last common
ancestor;
this means
the utilization
of para
-
logs by different
species
during
regeneration
does
not ne
-
cessarily
support
the hypothesis
of a conserved
ancestral
function,
as it may
reflect
evolutionary
convergence
achieved
after
gene
duplication.
Further
complicating
this
matter,
the ortholog/paralog
distinction
is contingent
on
the organisms
being
studied.
As more
distantly
related
species
are analyzed,
families
of paralogous
genes
are
often
collapsed
into
a single
orthologous
clade
(see
supplementary
fig. S1, Supplementary
Material
online,
for
an illustration
of this phenomenon).
Tests
of molecular
con
-
servation
therefore
require
careful
consideration
of the evo
-
lutionary
history
of genes.
The problem
is compounded
when
using
RNA-Seq
tech
-
nology
to identify
“conserved”
genes
between
distantly
re
-
lated
taxa undergoing
similar
biological
processes.
The first
issue
is a biological
one:
genes
rarely
share
one-to-one
homology
across
distantly
related
species.
An ancestral
gene
might,
over
the course
of evolution,
undergo
mul
-
tiple,
lineage-specific
rounds
of duplication.
The second
is
-
sue is technical:
RNA-Seq
studies
have
varying
temporal
resolutions,
timescales,
and depths
of sequencing.
When
looking
for significant
differences
in gene
expression,
these
two issues
result
in a heterogeneous
list of statistical
tests
that are problematic
to compare
between
studies.
As an ex
-
ample,
imagine
a conserved
orthologous
gene
group,
where
a sea sponge
has one gene
sampled
at three
time
points,
while
an axolotl
has five genes
sampled
at seven
time points.
If all time points
are compared
with each other,
this would
result
in three
statistical
tests for the sea sponge
compared
with 140 tests for the axolotl.
To address
this discrepancy,
we used
a Lancaster
P
-value
aggregation
method,
which
provides
a systematic
way of
collapsing
multiple
statistical
tests for significant
differential
expression
from
multiple
homologous
genes
into one value
(
Yi et al. 2018
). This allows
us to cluster
genes
into putative
conserved
ortholog
groups
(COGs)
and then
see which
COGs
are statistically
enriched
during
the regenerative
pro
-
cess for each species.
The method
takes
the
P
-values
gener
-
ated
by a differential
expression
analysis
for a group
of
genes
and essentially
treats
each
as an independent
signifi
-
cance
test of the hypothesis
that the broader
COG
is differ
-
entially
expressed.
Intuitively,
it may be the case
that no
single
P
-value
from
a set of independent
tests
registers
as
significant;
however,
many
borderline
significant
values
can be aggregated
to determine
significance.
These
aggre
-
gation
methods
take advantage
of the fact that many
inde
-
pendent
P
-values
generated
by the null hypothesis
should
follow
a uniform
distribution
on the interval
(0, 1).
Consequently,
we can test the
uniformity
of the set of
P
-values
to determine
their
likelihood
of being
generated
from
the null hypothesis.
In other
words,
the tests of a non
-
significant
COG
should
create
a random
distribution
of
P
-values,
while
a COG
with one or more
significant
compo
-
nents
will
statistically
deviate
from
this
distribution.
Mathematically,
the appropriate
test statistic
for uniformity
can be computed
from
the sum of inverse
cumulative
distri
-
bution
function
with
P
-values
and raw read
counts
as in
-
puts.
The result
of this process
is a table
with
entries
corresponding
to taxon–COG
pairs
and an associated
ag
-
gregated
P
-value
for each
COG.
supplementary
fig. S2,
Supplementary
Material
online,
illustrates
our approach
to applying
the Lancaster
method
to aggregate
P
-values
across
orthologous
genes
within
each RNA-Seq
experiment
(
Lancaster
1961
;
Yi et al. 2018
). This approach
allows
us to
make
statistically
honest
comparisons
of differential
gene
expression
between
diverse
studies
and elucidates
what
Sierra
et al.
GBE
2
Genome
Biol. Evol.
16(6)
https://doi.org/10.1093/gbe/evae120
Advance
Access
publication
24 June
2024
conserved genes are shared across animals during
regeneration.
In this study, we compared publicly available RNA-Seq
data sets spanning wildly different organisms and structures
undergoing regeneration (Fig. 1) to determine if a shared set
of differentially expressed genes could be elucidated. The
data sets analyzed include tissue regeneration in sea sponges
(Kenny et al. 2017
), oral/aboral body regeneration in sea an
-
emones (
Schaffer et al. 2016
), head/tail regeneration in plan
-
arian worms (
Kao et al. 2013), regeneration of “Cuvierian
tubules” in the respiratory system of sea cucumbers (Sun
et al. 2013), hair cell regeneration in zebrafish (Jiang et al.
2014), and limb regeneration in axolotl salamanders (Wu
et al. 2013: 201
). These data sets are highly divergent in their
sampling regimes but cover the relevant early window
between wound healing and blastema formation/cell prolif
-
eration (
Fig. 1
). Despite the limitations inherent in compara
-
tive RNA-Seq (considered in detail in the Discussion), this
study provides a first-order analysis to clarify what is con
-
served in animal regeneration at a transcriptional level.
Results
The first step was to organize all proteins from our six
data sets into clusters of putative orthologs. We used
OrthoFinder (Emms and Kelly 2015
) to assign orthology,
as this program combines amino acid sequence similarity
and phylogenetic relationships to reconstruct the evolu
-
tionary history of gene families. OrthoFinder assigned
266,324 proteins across our six data sets into 16,116
COGs, 2,287 of which were present in all six data
sets (see
supplementary additional file S1, part 1,
Supplementary Material online
). These COGs were typically
large, with a mean of 16.5 genes per COG. This reflected
the large number of gene models in certain data sets (par
-
ticularly the axolotl and zebrafish) as well as the wide evo
-
lutionary vantage taken in this study. Because we assigned
orthology at the pan-animal scale, many paralogs in verte
-
brates or eumetazoans collapsed into a single COG in this
study. As discussed later, this phenomenon is particularly
important when interpreting our results. After genes
were assigned to COGs, we used the Lancaster method
to aggregate all
P
-values per data set per COG into one
P
-value (
Yi et al. 2018
). If that
P
-value met a false-discovery
adjusted threshold of 0.05, we considered the COG differ
-
entially expressed for that particular data set.
To test how robust the assignment of differentially
expressed COGs (deCOGs) was to variation between data
sets, we examined how adding and removing data sets
impacted the final number of deCOGs. Using our
x0hrs
bisect
H0hrs
0hrs
removal
0H days
Zuvierian tubules
@hrs
lateral line
óound healing ;lastemaM cell proliferation
axolotl
(
Ambystoma
mexicanum
)
zebra
fi
sh
(
Danio rerio
)
sea cucumber
(
Apostichopus
japonicus
)
planarian worm
(
Schmidtea
mediterranea
)
sea anemone
(
Nematostella
vectensis
)
sea sponge
(
Halisarca
caerulea
)
Eumetazoa
Metazoa (animals)
Bilateria
Deuterostomia
Ve
rtebrata
bisect
x0hrs
limb
@ days
re
re
mo
mo
va
va
6
Zuvi ia
japo
6
Xranscriptome
5H per species6
a
b
Predict proteins and cluster
with other species into
Conserved Ortholog
Groups
(
COGs
)
RNA-Seq mapping & differential expression analysis
COG 1
Species 1
p
t1 vs t2
t2 vs t3
t3 vs t4 ...
Homolog 1
Homolog 2
Homolog n
...
t1 vs t2
t2 vs t3
t3 vs t4 ...
t1 vs t2
t2 vs t3
t3 vs t4 ...
Comparisons between time points (t)
Lancaster aggregated
p-value for entire COG
(1 per species)
Sp
Sp
Sp
Sp
Sp
Sp
Sp
Sp
Sp
Sp
Sp
Sp
Sp
Fig. 1.
Cases of animal regeneration included in this study. a) The six animals analyzed in this paper, organized by their evolutionary relationships. The
region of each organism undergoing regeneration is highlighted in red and is described underneath the image of each animal. The RNA-Seq sampling regime
from each study is visualized with a bar; each time point that was sampled is represented by a notch in that bar. Despite the different absolute time ranges, the
studies are comparable in that the time points span the early key stages of regeneration: starting with wound healing (red) and transitioning into blastema
formation/cell proliferation (blue). b) A simplified overview of the methodology used to define deCOGS. A more detailed version is provided in
supplementary
fig. S2, Supplementary Material
online.
A Novel Approach to Comparative RNA-Seq
GBE
Genome Biol. Evol.
16(6) https://doi.org/10.1093/gbe/evae120 Advance Access publication 24 June 2024
3
methodology, we recovered 160 deCOGs present in all six
data sets. Removing any particular data set from the study
increased the number of deCOGs shared across the remain
-
ing five data sets by an additional 31 to 202 (
Fig. 2
). We did
not find any correlation between the quality of the
RNA-Seq study and the number of additional deCOGs
recovered when a data set was removed. For example, re
-
moving the sea anemone from the analysis provided the
greatest increase in deCOGs, even though this data set in
-
cluded four RNA-Seq time points with biological replicates,
as well as a well-annotated genome to work off.
Conversely, the sea sponge had the poorest sampling re
-
gime, yet its removal resulted in one of the smallest gains
(50 deCOGs). Instead of data set quality, the number of
data set-specific deCOGs appears to be most important,
as removing data sets with a small number of deCOGs
(e.g. the sea cucumber and/or sea anemone) appeared to
have the largest impact on overall deCOGs recovered.
Ultimately, while some deCOGs could be lost due to incom
-
plete sampling of gene expression during regeneration, our
analyses do not suggest an obvious bias caused by the qual
-
ity of the data sets under consideration.
A related concern to data set quality was the absence of
biological replicates in some of the studies analyzed. Three
of the six data sets lack biological replicates; we accepted
this limitation in order to get phylogenetic diversity, though
it complicated our ability to assign differentially expressed
genes in those data sets (see Materials and Methods for de
-
tails). To study the impact of combining data sets with and
without biological replication, we looked at how many
deCOGs were retained in every combination of three
taxa. If we restrict our analysis to the three data sets with
biological replicates (the zebrafish, anemone, and planar
-
ian), we recover 569 deCOGs. This is at the lower end com
-
pared with all combinations of three data sets (ranging
from 379 to 1344 deCOGs, average
=
760 deCOGs; see
supplementary additional file S1, part 2.5, Supplementary
Material online).
The three combinations with the highest
number of deCOGs all include two data sets without
replicates. This suggests that our forgiving approach
to dealing with data sets lacking biological replication,
if anything, overestimates the true number of shared
deCOGs.
Following this check on the data, we proceeded with a
holistic assay of COGs and discovered that the six data
sets exhibit dramatically distinct gene repertoires. We
used presence/absence data to construct a Jaccard distance
matrix that illustrates the total number of COGs shared
across data sets (Fig. 3a) and a second matrix restricted to
deCOGs (Fig. 3b
). The first matrix organizes the taxa on
evolutionary relationships, while the second only retains
the vertebrate (axolotl + zebrafish) clade. If genes expressed
during regeneration represented an evolutionarily con
-
served network, we would anticipate the deCOG Jaccard
distance matrix in
Fig. 3b
to show greater similarity than
the full COG matrix in
Fig. 3a. Instead, there appears
to be even less similarity between data sets in
Fig. 3b
com
-
pared with
Fig. 3a, although a Mantel test (Mantel
1967
) suggests the two matrices are not significantly cor
-
related. (
P
=
0.11; see
supplementary additional file S1,
part 2.3, Supplementary Material online). This suggests
that genes expressed during regeneration are no more
similar across data sets than the gene repertoires as a
whole.
One of the patterns seen in
Fig. 3
is that the vertebrates
(the axolotl and zebrafish) appear more similar to each
íebrash
:
@:
H::
H@:
0::
0@:
w::
@Ww0
0@wx
wWJx
@bb0
-w@:
0wxJ
wH@
0:0
HJ:
Hw@
H:b
H:b
H:J
W-
bH
xb
xw
x:
J-
@w
@0
@:
@:
--
wW
w-
wH
Hx
H-
’ponge
%lanarian
Zucumber
©xolotl
©nemone
Fig. 2.
An UpSet plot demonstrating the number of overlapping
deCOGs shared across all six data sets. This plot focuses on overlaps of
four or more of the six data sets. The number of deCOGs common across
all six cases (160) is highlighted in orange. Additional deCOGs that are re-
covered when individual case studies are removed are highlighted in blue.
The data used to generate this figure are provided in
supplementary table
S2, Supplementary Material
online.
’ea anemone
’ea cucumber
%lanarian
’ea sponge
íebrash
©xolotl
Overlap: A
l
l COGs
Key
a
b
Overlap: deCOGs
%lanarian
’ea anemone
’ea cucumber
’ea sponge
íebrash
©xolotl
uH
:
H
Fig. 3.
Jaccard distance matrices based on the presence/absence of COGs across taxa. a) Matrix derived from all COGs as assigned by OrthoFinder. b) The
same analysis, but restricted to deCOGs The data used to generate this figure are provided in
supplementary table S3, Supplementary Material
online.
Sierra et al.
GBE
4
Genome Biol. Evol.
16(6) https://doi.org/10.1093/gbe/evae120 Advance Access publication 24 June 2024
other than any other combination of taxa. This raises the
possibility that regeneration in vertebrates is driven by
vertebrate-specific genes. To test this hypothesis, we as
-
signed each deCOG a phyletic origin, illustrated in
Fig. 4
.
At all nodes of the phylogeny, the majority of deCOGs
can be found across diverse eukaryotes. In other words, re
-
generation in most animal groups does not appear to re
-
quire much input from novel, animal-specific genes.
While this observation holds true in the vertebrates,
∼
9%
of all deCOGs unique to this clade do appear to be
vertebrate-specific novelties. This suggests that while the
genetic control of animal regeneration is largely driven by
the co-option of ancient genes, regeneration in vertebrates
also requires input from genes unique to the group.
After examining how the data are impacted by manipu
-
lating data sets, we next focused on the 160 deCOGs pre
-
sent in all six data sets. To test whether 160 deCOGs is
higher than expected by chance, we performed a resam
-
pling study where we randomized the deCOGs in each
data set (see Materials and Methods). The 160 deCOGs ob
-
served in our data are far greater than what is observed in
our 10,000 simulation runs, where the number of shared
deCOGs ranged from 8 to 44. While 160 deCOGs might
therefore appear noteworthy, we note that our approach
purposefully takes a generous view of what counts as
“conserved.” We have, for example, ignored differences
in expression direction or timing, meaning a COG is consid
-
ered “conserved” if the same gene is upregulated during
wound healing in one data set and downregulated in blas
-
tema formation in another. It is unlikely that such a gene ac
-
tually has a conserved biological function. Moreover, the
inclusion of distantly related animals in this analysis means
that many large gene families have been reduced to a single
COG. A good example of this latter issue comes from the
Wnt family of genes, which are recovered as a single
deCOG in our analysis. The gene tree produced by
OrthoFinder is reprinted in
Fig. 5
. Our analysis suggests
that sponge Wnt genes cannot be assigned to known sub
-
families, resulting in all Wnts collapsing into one COG (see
Borisenko et al. 2016
, for similar results). Ignoring the
sponge, only one of the Wnt subfamilies (Wnt8/9) is pre
-
sent in all organisms in our analysis, and no Wnt subfamily
demonstrates differential expression across all taxa. So
while Wnt genes are differentially expressed in every ex
-
ample of regeneration, each organism uses a different
combination of paralogs. This result may not be entirely sur
-
prising, as the parts of the body undergoing regeneration in
each animal are distinct, and each area of the body is pat
-
terned by different Wnt subfamilies during normal develop
-
ment (
Krauss et al. 1992;
Kusserow et al. 2005;
Almuedo-Castillo et al. 2012;
Borisenko et al. 2016;
Auger et al. 2023
). This result could therefore be inter
-
preted as evidence that diverse Wnt paralogs can be re
-
moved and integrated into a conserved regeneration
gene network (e.g.
Somorjai et al. 2018
) or, alternatively,
that different organisms have independently integrated
Wnt signaling into regeneration. Either way, this case study
illustrates that a deCOG is not synonymous with a con
-
served gene and offers no support that one specific Wnt
paralog has a conserved function in regeneration across
animal evolution.
To explore the possible function of the 160 deCOGs re
-
covered across all data sets, we used two highly cited web
resources, STRING (
Szklarczyk et al. 2014) and DAVID
7ukaryote novelty
fletazoan novelty
;ilaterian novelty
_ertebrate novelty
Origin of deCOGs
axolotl
zebra
fi
sh
sea cucumber
planarian worm
sea anemone
sea sponge
160 deCOGs shared across all six organisms
+50 deCOGs
+292 deCOGs
+302 deCOGs
+1962 deCOGs
Vertebrata
Deuterostoma
Bilateria
Eumetaozoa
Metazoa
Fig. 4.
Evolutionary (phyletic) origin of deCOGs. The total number of deCOGs recovered at each node of the evolutionary tree is indicated by a bar chart
to the right. Novel deCOGs at each node are broken down by their phyletic origin; for example, deCOGs that are a “bilaterian novelty” contain genes that have
no significant sequence similarity to genes outside of the Bilateria. The data used to generate this figure are provided in
supplementary table S4,
Supplementary Material
online.
A Novel Approach to Comparative RNA-Seq
GBE
Genome Biol. Evol.
16(6) https://doi.org/10.1093/gbe/evae120 Advance Access publication 24 June 2024
5
(Dennis et al. 2003
), to perform functional enrichment ana
-
lyses. We focused on the zebrafish for these analyses, as it
represents the best-studied model organism in our data.
The 160 deCOGs include 2,182 zebrafish transcripts,
554 of which could be considered differentially expressed
(using the generous cutoff of an unadjusted
P
<
0.01).
We compared this list of genes against the zebrafish
genome to look for enriched biological pathways using
the comprehensive and highly cited Kyoto Encyclopedia
of Genes and Genomes (KEGG) database (see
supplementary additional file S1, part 4.1, Supplementary
Material online,
for full results). According to STRING and
DAVID analyses, the 554 differentially expressed zebrafish
genes are enriched in basic cell processes, including mela
-
nogenesis, regulation of the actin cytoskeleton, phago
-
somes, and focal adhesion. Regarding KEGG pathways,
Notch and mTOR signaling are recovered in both analyses,
while Wnt and FoxO pathways are enriched in the STRING
analysis. However, all of these enriched pathways are sus
-
pect, as they are primarily driven by multiple homologs
from the same COG. For example, Wnt and Frizzled homo
-
logs represent 9 out of 11 genes driving “Wnt enrichment”
in STRING. In “mTOR enrichment,” Wnt and Frizzled homo
-
logs make up 9 of the 15 genes in STRING and 11 out of 17
in DAVID. Similarly, “Notch enrichment” is driven by the
presence of eight differentially expressed genes, seven of
which are Delta/Jagged homologs. If these pathways
were truly enriched in our data set, we would anticipate
more genes from these pathways being differentially ex
-
pressed. Rerunning the analysis on larger lists of deCOGs
following the removal of individual data sets did not
have a major impact on the pathways recovered (see
supplementary additional file S1, parts 4.2 to 4.7,
Supplementary Material online
). When restricting ourselves
to the three data sets with biological replicates, we do get
modest gains in the number of genes involved in Wnt sig
-
naling, although 11 of the 36 genes driving enrichment
are Wnt and Frizzled homologs (see
supplementary
additional file S1, part 4.9, Supplementary Material
online). When we restricted our analysis to deCOGs shared
between the vertebrates, we found a dramatic increase in
the number of Wnt pathway genes represented (71 genes).
Furthermore, mTOR (61 genes), FoxO (65 genes), and p53
signaling (34 genes) were also recovered as significantly
enriched pathways. All of these pathways have been
implicated in vertebrate regeneration (Di Giovanni et al.
2006;
Tothova and Gilliland 2007;
Yun et al. 2013;
Martins et al. 2016
). This is not simply a function of
vertebrate-restricted genes being recovered, as
>
99% of
the transcripts come from COGs present in at least one in
-
vertebrate and
∼
43% of the COGs are present in all six spe
-
cies (see
supplementary additional file S1, parts 4.10 to 4.
13, Supplementary Material online
). These results further
support the hypothesis that a conserved regeneration net
-
work might exist across vertebrates but offer little evidence
for conservation across the animals as a whole.
Given the long-standing interest in stem cell dynamics as
a critical regulator in animal regeneration, we decided to
conclude our study by exploring the representation of these
pathways in our data.
Figure 6
presents a simplified version
of the KEGG stem cell pluripotency network (KEGG 04550),
colored to indicate the number of data sets with one or
more differentially expressed genes in the relevant COG.
Few molecular signaling components were differentially
expressed across all six data sets, and most downstream sig
-
naling targets were expressed in fewer than four data sets.
Additionally, the ultimate target of these pathways—the
core transcriptional network driving mammalian stem cell
pluripotency (
Li and Belmonte 2017)—was largely absent,
with two of the genes missing from all data sets
(Oct4/POU5F1 and Nanog). At first glance, some interesting
signaling and receptor proteins appeared to be conserved
óntbMWulike
óntxulike
óntHJulike
ónt0ulike
óntHulike
óntwulike
óntHHulike
ónt-ulike
ónt@ulike
óntJulike
%lanarian ónt©I ónt%0Mw
óntH:ulike
’pongeuspecic ónt
jene5s6 presentI one or more genes show differential expression
jene5s6 presentI no differential expression
ïnidentiedI paraphyletic ónt
Fig. 5.
The presence of Wnt genes in the six RNA-Seq data sets
analyzed (produced by OrthoFinder). Wnt genes were recovered as a
single deCOG in our analysis, which can be subdivided into a min-
imum of 13 previously described subfamilies. The presence/absence
of these subfamilies in each taxon is demonstrated by silhouettes.
Gray silhouettes show the subfamily is present in the organism’s tran-
scriptome; black silhouettes show that the subfamily is present and
differentially expressed in the relevant RNA-Seq study. Note that no
subfamily is present and differentially expressed across all taxa. The
data used to generate this figure are provided in
supplementary
table S5, Supplementary Material
online.
Sierra et al.
GBE
6
Genome Biol. Evol.
16(6) https://doi.org/10.1093/gbe/evae120 Advance Access publication 24 June 2024
across all six data sets. However, detailed analysis of the
relevant COGs revealed that every example involved
well-described paralogs being collapsed into a single
pan-metazoan COG, as described previously for Wnt.
Examples include “Activin” and “BMP4” being part of a
single deCOG that also contains BMP2/4/5/6/8/15/16, as
well as the “SOX2” deCOG, which also contains SOX1/3/
9/14 (see
supplementary fig. S3 and additional file S1,
part 7, Supplementary Material
online for details). We
therefore find limited support for conserved genes in the
cell pluripotency network employed in the six regeneration
data sets.
An Additional Analysis on Heat Stress Suggests the
Problem of Identifying Conserved Orthologs from
Comparative RNA-Seq Is Not Restricted to Regeneration
It has been several years since this study was originally
posted on a preprint server. One of the reasons for the delay
was an early reviewer’s recommendation that we look for a
“positive control,” demonstrating how the Lancaster meth
-
od described here can recover conserved gene sets from
pan-animal RNA-Seq data sets. After testing many data
sets, we were unable to find a compelling control. An in
-
structive example is our study of heat stress, which we
anticipated would reveal COGs enriched in heat shock re
-
sponses. In this project, we analyzed six data sets covering
the relevant window of acute stress response to short-term
heat shock in a diverse set of organisms. Our data included
expression profiles of liver response of the Atlantic salmon
(
Shi et al. 2019), hemocyte transcriptive response in Pacific
oysters (Yang et al. 2017), whole organism response in the
Saharan silver ant (Willot et al. 2018), whole adult somatic
tissues of a demosponge (Guzman and Conaco 2016), and
a comparison of the liver transcriptome response of three
breeds of commercial chickens (Lan et al. 2016). Similar
to our original analysis, we were unable to recover a core
set of genes governing the heat shock response. We could
not find any COGs shared across all data sets and therefore
focused our analysis on the comparison that produced the
most results—the oyster and sponge (supplementary
additional file, section 8, Supplementary Material online
).
Enrichment analysis of the 105 COGs found little evi
-
dence for functional conservation. DAVID Functional
Annotation recovered evidence for an enrichment of the
“stress response” biological process. This was driven by
the differential expression of six heat shock transcripts,
four of which are annotated as “heat shock protein 68.”
Otherwise, all enrichment terms were related to basic
cell processes, muscle/actin activity, and melanogenesis
number of taxa with deZ+j
KEY
specific to "primed"
stem cell pathway
6
5
4
3
2
1
POU5F1
SOX2
NANOG
ID
Tbx3
Esrrb
Klf4
c-Myc
c-Myc
DUSP9
Mek
MAPK signaling
Raf
Erk1/2
Akt
ACVR1B
GSK3B
Dvl
SMAD2/3
Akt
BMPR1A
PIK3
Frizzled
SMAD1
Ras
PIK3
PI3K-Akt signaling
Jak-STAT signaling
JAK
Grb2
STAT3
Mek
LIFR
Erk1/2
LIF
MAPK signaling
IGF
Activin
FGF2
Wnt
Nodal
BMP4
TCF3
Axin
TGF-beta signaling
APC
SMAD4
Erk1/2
p38
CTNNB1
HNF1A
IGF-1R
FGFR
Wnt signaling
Core transcriptional network
cell proliferation
Signaling pathways regulating pluripotency of stem cells
Fig. 6.
The presence of deCOGs within the stem cell pluripotency network. The network has been reproduced and simplified from KEGG pathway
04550. The color of each box indicates the number of data sets with one or more differentially expressed genes within the relevant COG. Red arrows indicate
pathways that are specific to “primed” stem cells (e.g. human embryonic stem cells, human-induced pluripotent stem cells, and mouse epiblast–derived stem
cells); gray arrows indicate pathways also found in “naïve” stem cells (e.g. mouse embryonic stem cells and mouse-induced pluripotent stem cells). The data
used to generate this figure are provided in
supplementary table S6, Supplementary Material
online.
A Novel Approach to Comparative RNA-Seq
GBE
Genome Biol. Evol.
16(6) https://doi.org/10.1093/gbe/evae120 Advance Access publication 24 June 2024
7