1
Enhanced recovery of single
-
cell RNA
-
sequencing reads for missing gene
expression data
Allan
-
Hermann Pool
1, 2, 3
, #
, Helen Poldsam
1
, 4, 5
, Sisi Chen
6
, Matt Thomson
6
, Yuki Oka
6
, #
1.
Department of Neu
r
oscience, University of Texas Southwestern Medical Center, Dallas, T
X
, USA
2.
Peter O’Donnell Brain Institute, University of Texas Southwestern Medical Center, Dallas, TX, USA
5
3.
Department of Anesthesiology and Pain Management, University of T
exas Southwestern Medical Center, Dallas, TX, USA
4.
Department of Chemistry and Biotechnology, Tallinn University of Technology, Estonia
5.
Protobios LLC, M
ä
ealuse 4, Tallinn 12618, Estonia
6.
Division of Biology and Biological Engineering, California Institute of
Technology, Pasadena, CA, USA
#
-
correspondence should be addressed to
allan
-
hermann.pool@utsouthwestern.edu
or
yoka@caltech.edu
10
Abstract
Droplet
-
based
3’ single
-
cell RNA
-
sequencing (scRNA
-
seq) methods have proved
transformational in characterizing cellular diversity and generating valuable hypotheses
throughout biology
1,2
. Here we outline a common problem with 3’ scRNA
-
seq datasets where
15
genes that have been documented to be expressed with other methods, are either completely
mi
ssing or are dramatically under
-
represented
thereby compromising
the
discovery of cell
types,
states
,
and genetic mechanisms
. We show that this problem stems from three main sources of
sequencing read loss: (1) reads mapping immediately 3’ to known gene boundaries due to poor
3’ UTR annotation; (2) intronic reads stemming from unannotated e
xons or pre
-
mRNA; (3)
20
discarded reads due to gene overlaps
3
. Each of these issues impacts
the
detection of
thousands
of genes even in
well
-
characterized mouse and human genomes rendering downstream analysis
either partially or fully blind to their expression. We outline a simple
three
-
step
solution to recover
the missing gene expression data that entails
compiling a
h
y
b
r
i
d
pre
-
mRNA refer
ence to retrieve
intronic reads
4
, resolving
gen
e collision derived read loss through
removal of readthrough and
25
premature start transcripts
, and
redefining 3’ gene boundaries to capture false intergenic reads.
We demonstrate with mouse brain and human peripheral blood datasets that this approach
dramat
ically increases the amount of sequencing data included in downstream analysis revealing
20
-
50
% more genes per cell and incorporates
15
-
20
%
more
sequencing reads than with
standard solutions
5
.
These improvements reveal
p
reviously missing
biologically relevant
cell
30
types,
states
,
and marker genes
in the mouse brain and human blood profiling data. Finally, we
provide
scRNA
-
seq
optimized transcriptomic references for human and mouse data as wel
l as
simple algorithmic implementation of these solutions that can be deployed to both thoroughly as
well as poorly annotated genomes.
Our results demonstrate
that optimizing the
sequencing
read
mapping step can
significantly
improve the
analysis resolutio
n as well as biological insight from
35
scRNA
-
seq
. Moreover, this approach
warrants a fresh look at preceding analyses of this popular
and scalable cellular profiling technology.
Main
Droplet
-
based single
-
cell RNA
-
sequencing methods such as Dropseq
and 10x Genomics
40
platforms have dramatically lowered the cost and improved the throughput of
single
-
cell
gene
expression profiling.
These advances have thereby
widely democratized the
discovery of new cell
types and states
6
–
8
,
delineation
of developmental mechanisms
9
and
cellular basis of disease
10
as
well as mapping
of
behavioral and physiological functions to distinct cell typ
es
11,12
. The scalability
of such methods
however
comes
with a few
important
limitations
.
First,
the
droplet
-
based
45
methods
rely on 3’ gen
e tagging where
detection of genes depends on
registering sequencing
.
CC-BY 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted April 27, 2022.
;
https://doi.org/10.1101/2022.04.26.489449
doi:
bioRxiv preprint
2
reads
predominantly
at the 3’ end of genes
which makes
detection of splicing
isoforms
problematic. Second, 3’ scRNA
-
seq datasets
despite usually being much more shallowly
sequenced
are in general considered to have lower sensitivity
than
deep full
-
length isoform
sequencing solutions such as
provided by the
SMART
-
Seq chemistr
y
13
.
Indeed, several studies
50
have observed that genes shown to be expressed with other methods have critically been missing
in
analyses relying on
droplet
-
based
scRNA
-
seq
8,11
.
This shortcoming compromises the potential
of
3’ scRNA
-
seq
high
-
throughput technologies
to uncover
the genetic and cellular
mechanisms
giving rise to
development and tissue function
.
S
cRNA
-
sequencing workflow
consists of several steps including
sample preparation, sequencing
55
library generation, sequencing,
read
mapping
/quantification
,
and analysis of the gene
-
cell matrix
based data
.
While many of these step
s are considered standard,
some
such as sample
preparation
are widely recognized as critical for the final outcome and
can vary significantly
between protocols and labs
.
One often overlooked step
in this workflow
is
read
mapping/quantification
that
determines
which sequencing
reads
are
incorporated in the final
60
cellular gene expression data
.
During this process, sequencing
reads are mapped to
the
reference transcriptome (i), assigned to genes (ii), assigned to cells (iii), and duplicate
s
are
remov
ed
(iv)
14,15
.
As a result
of this step
,
often the majority of sequencing reads get excluded
from further analysis for one of several reasons including failure to map confidently to the
transcriptome,
being a duplicate read
, mapping to multiple sites in the genome (multimapping
65
reads), mapping to
more than one gene (multigene reads), mapping intronically or to an intergenic
region.
Some of
the
discarded read data however
reflect
endogenous gene expression
and can
render expressed genes missing
16,17
.
Several groups have
manually
amended the transc
riptome
for individu
al genes to
restore their
visib
ility
8,11
, however
,
a systemic effort to evaluate the scale
of this problem and to provide
a whole
-
transcriptome solution for this
issue
has
been missing
.
70
Here, we show that
analysis pipelines relying on standard exonic transcriptomic references are
blind to many genes that are easily detected with independent methods such as in situ
hybridizat
ion. We
demonstrate
that this lack of gene detection does not stem from low sensitivity
but rather inefficiencies of the currently used transcriptomic references and that this is the case
even with very well annotated genomes including that of mouse and hu
man. Furthermore, we
75
show
that the read loss stems from three sources:
poor annotation of 3’ untranslated regions,
gene overlaps stemming
from
the
annotation of
rare
read
-
through or prematurely starting
transcripts and finally
exclusion of
intronic reads.
We outline a
three
-
step
strategy
to
overcome
these limitations through
the
inclusion of intronic reads, resolving gene overlaps by excluding rare
transcript isoforms and identifying and incorporating unannotated gene 3’UTRs
.
This strategy
80
recovers obscured gene expression data
for thousands of genes
and reveals previously
un
detected
genetic markers, mechanisms and cell types.
Consequently, we provide
full
genome
optimized transcriptomic references for the mouse and human genom
es.
In sum, our data argue
that transcriptomic references need
to be
optimi
zed
for scRNA
-
seq analysis
and
that
this step
can
dramatically improve the profiling resolution
. These findings also warrant
a reanalysis of
85
previously published datasets.
Results
In order to characterize gene detection fidelity of 3’ gene counting methods we performed scRNA
-
sequencing of the median pre
-
optic nucleus (MnPO)
-
a mouse brain center implicated in a range
of physiological functions including thirst, sleep, heat and cold
sensation
18
.
Predictably, following
90
sequencing read map
ping to an exonic transcriptomic reference we identified about a dozen
distinct neuron types in this structure reflecting the functional diversity of this brain center (Fig.
.
CC-BY 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted April 27, 2022.
;
https://doi.org/10.1101/2022.04.26.489449
doi:
bioRxiv preprint
3
1a). We next compared gene detection fidelity with scRNA
-
sequencing to in situ hyb
ridization
–
an independent method provided by the Allen in situ brain atlas
19
. While we found many genes
that were reliably detected with both methods (e.g. Nxph4, Fig. 1b), we observed a number of
95
genes that were completely missing in scRNA
-
seq data while robustly detected with in situ
hybridization (
e.g. B4galnt2 and Gpr165, Fig. 1c
-
d). Follow
-
up analysis at these loci revealed
three distinct patterns of sequencing read mapping that determined whether the gene is detected
or missing in scRNA
-
sequencing analysis. The first type comprised of genes detec
ted by both
methods. In this case, sequencing reads mapped near perfectly to the exons of the underlying
100
gene and were thus included in downstream transcriptomic analysis (Fig. 1b). A second group of
genes were detected by
in situ hybridization but were
mi
ssing in scRNA
-
seq data as most
sequencing reads mapped to an intron of that gene resulting in exclusion from transcriptomic
analysis (Fig. 1c). Finally, a third group of genes were detected by in situ hybridization but not
with scRNA
-
seq and had no sequen
cing read mapping to known exons and introns (Fig. 1d).
105
Importantly, the last type of genes displayed excessive read mapping proximal to the known 3’
end of the gene suggesting that scRNAseq fails to detect these genes due to poor annotation of
3’ untransl
ated regions of genes. Thes
e data demonstrate that droplet
-
based single cell
sequencing datasets can fail to detect genes due to suboptimal read mapping to the reference
transcriptome.
110
In order to evaluate the magnitude of the missing gene problem, we quan
tified several metrics of
sequencing read mapping in two vertebrate species with the most thoroughly annotated genomes
–
mice and humans. For mice we evaluated the MnPO dataset and for humans we profiled
peripheral blood mononuclear cells (PBMCs). In mouse
brain data we found that out of the
uniquely mapped sequencing reads 71.8 % are exonic, 19.5 % intronic and 8.7 % intergenic out
115
of 272 million total reads suggesting that significant gains could be achieved by incorporating
sequencing data from intronic
and intergenic areas to gene expression estimates (Fig. 1e) . We
found similar metrics in human data with 69.9 % exonic, 23.5 % intronic and 6.7 % intergenic
reads (272 million total), respectively. Indeed, upon evaluating the number of genes detected as
a
result of including intronic reads, intergenic reads within 10 kb of known 3’ gene ends or both,
120
we observed dramatic gains in the amount of detected genes in scRNA
-
seq datasets with 13.6%,
25.8% and 33.6% more genes detected than with a conventional exon
ic transcriptome reference
in mouse (Fig. 1f). Again, comparable gains were observed with 19.9%, 23.2% and 39.2% more
genes detected, respectively for the human transcriptome. Moreover, we also evaluated the
dominant source of read information for genes i
n the mouse and human datasets. Predictably we
125
found that
the
majority of mouse genes (
79.6
%) were dominated by exonic reads with more than
50% of expression data stemming from exonic reads (Fig. 1g). Somewhat surprisingly,
less than
half
of human genes de
rive their expression data from exonic reads with the rest stemming from
intronic or 3’ intergenic reads. While not all intronic and proximal intergenic sequencing reads
s
tem from the respective protein
-
coding gene transcripts, these data indicate that
profound gains
130
in gene detection sensitivity are feasible by incorporating relevant intronic and intergenic read
data in downstream scRNA
-
seq analysis.
.
CC-BY 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted April 27, 2022.
;
https://doi.org/10.1101/2022.04.26.489449
doi:
bioRxiv preprint
4
Figure 1: Missing genes and sequencing read registration in single
-
cell RNA
-
seq
experiments. a.
Sc
-
RNA
-
seq based profiling of the mouse physiology regulating brain center
-
135
Median Preoptic Nucleus (MnPO). 10x Genomics 3’ transcriptomic analysis of MnPO neurons
(n=906) mapped to an exonic transcriptomic reference reveals 13 neuron types. Data shown in a
tSN
E embedding.
b.
Sample scRNA
-
seq detected gene (Nxph4) with sequencing read mapping
at its genomic locus. The majority of sequencing reads map to known exons of Nxph4 gene and
are therefore registered (blue) and included in downstream analysis. Discarded r
eads (red) map
140
to non
-
exonic regions or are antisense to the gene and are therefore excluded. Inset violin plot:
scRNA
-
seq analysis detects Nxph4 expression in several MnPO neuron types (cell
-
type specific
log
-
transformed expression of Nxph4 in MnPO neuron
types with cell
-
type identity color
-
coded as
in Fig1a). Micrograph inset: in situ hybridization of Nxph4 expression in the MnPO (scale bar: 150
μm, posterior MnPO outlined with white dashed line, data from Allen Brain Atlas Mouse ISH
145
dataset).
c.
Sample
gene (B4galnt2) not detected by scRNA
-
seq due to intronic read mapping.
Inset violin plot: gene expression is not detected in any of the MnPO neuron types. Inset
micrograph: in situ hybridization of B4galnt2 expression in the MnPO.
d.
Sample gene (Gpr165)
.
CC-BY 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted April 27, 2022.
;
https://doi.org/10.1101/2022.04.26.489449
doi:
bioRxiv preprint
5
not detected by scRNA
-
seq due to intergenic read mapping 3’ of known end of the gene. Inset
violin plot: gene expression is not detected in any of the MnPO neuron types with scRNA
-
seq.
150
Inset micrograph: in situ hybridization of Gpr165 expression in the MnP
O.
e.
Proportion of
uniquely mapped sequencing reads according to mapping site (exonic, intronic or intergenic) for
mouse brain (MnPO, left) and human peripheral blood mononuclear cells (right) datasets.
f.
Intronic and intergenic reads constitute a promis
ing source to recover missing gene expression
data in scRNA
-
seq analysis. Number of detected genes in mouse brain (MnPO, left) and human
155
PBMC (right) datasets, if reads mapping to exons, exons and introns, exons and intergenic reads
within 10kb of known 3’
ends of genes, or all three sources are included in downstream analysis.
g.
Human and mouse genes according to the dominant source of sequencing read data. Genes
are classified as ‘exonic dominant’, ‘intronic dominant’ or ‘3’ intergenic dominant’ if more
than 50%
of sequencing reads map to their exons, introns or within 10kb of their 3’ end, respectively. Mixed
160
genes have less than 50% of reads stemming from any of the three regions.
We
further evaluated the extent to
which intergenic reads 3’ from gene en
ds could contribute to
true gene expression estimates. If unannotated 3’ UTRs constitute a significant source of read
loss in 3’ scRNA
-
seq datasets we would expect to see elevated levels of sequencing reads
mapping proximal to 3’ end of ge
nes. Indeed, we o
bserve several
-
fold higher mapping of
165
intergenic reads immediately proximal to
the
3’ gene ends than at distal sites in both mouse and
human datasets (Fig. 2a, b). In fact close to
2
5% of intergenic reads in both mouse and human
datasets are within 10kb of
3’ gene ends, which represents
approximately
tw
o
-
fold
enrichment as
compared to the rest of the non
-
coding genome
20,21
. These results suggest that improved
annotation of 3
’ gene ends is a promising strategy to increase gene detection in 3’ single
-
cell
170
RNA
-
sequencing analysis (Fig. 2c).
Figure 2: Increased intergenic read mapping proximal to 3’ end of genes.
a.
Distribution of
sequencing reads mapping within 10kb of know
n gene ends in the mouse genome shows
175
increased mapping proximal to gene ends.
b.
Distribution of sequencing reads mapping within
10kb of known gene ends in the human genome shows increased mapping proximal to gene
ends.
c.
Fraction of intergenic reads mapping within 10kb of known gene ends from all intergenic
reads in the mouse brain (MnPO) and human PBMC datasets.
180
Another common source of read loss in scRNA
-
seq analysis stem from same strand gene
overlaps. Reads mapping t
o genomic regions annotated to more than one gene are classified as
multigene reads and are routinely removed from downstream analysis
14,15
. We evaluated the
magnitude of
gene overlaps using the Ensembl
mouse (v.98) and human (v.98) genome
annotations which
are most commonly used to gener
ate reference transcriptomes for scRNA
-
seq
185
.
CC-BY 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted April 27, 2022.
;
https://doi.org/10.1101/2022.04.26.489449
doi:
bioRxiv preprint
6
analysis. We found that gene overla
ps are a pervasive feature of currently available genome
annotations with 2035 (6.3 % of all mouse genes) and 5195 (14.2% of all human genes) genes
showing partial or complete overlap with other same strand genes in the mouse and human
genomes, respectivel
y (Fig. 3a).
The m
ajority of these overlaps in both mouse and human
genomes originate from single pairs of genes (Fig. 3b, c).
190
A closer inspection of overlapping genes revealed a few stereotypic patterns of overlaps that
result in partial or complete blind
ing of one or more overlapping genes from downstream analysis.
The first problematic pattern stems from readthrough transcripts where one or several of upstream
gene’s transcripts incorporate some or all exons of a downstream gene which effectively
elimina
tes all sequencing reads mapping to the latter (Fig. 3d). Another problematic feature of
195
overlapping genes are so called „premature start transcripts“ where a single or several transcripts
from a downstream gene are annotated to start upstream of the upstr
eam gene’s terminal exon
(Fig. 3e). The latter type of overlap is particularly problematic as the majority of sequencing reads
in 3’ scRNA
-
seq map to terminal exons and thus premature start transcripts effectively eliminate
the entire detection of their up
stream gene. A version of this issue impacts dozens of genes that
200
share their terminal exon and are thus completely invisible to analysis (Extended Fig. 1). Finally,
multigene overlapping genes pose a particular problem for pre
-
mRNA references where a sin
gle
large gene can completely eliminate dozens of nested genes rendering downstream analysis blind
to their expression (Fig. 3f). An important caveat to the latter is that there are currently several
strategies for compiling a pre
-
mRNA transcriptomic refer
ence with substantial differences in gene
205
detection and read mapping fidelity (Extended Fig. 2). In summary, gene overlaps in genome
annotations constitute a unique challenge to discovering valuable candidate genetic mechanisms
and marker genes in 3’ singl
e
-
cell RNA
-
seq analysis. Moreover, these problems impact
thousands of genes
particularly
in well annotated genomes.
The systemic issues with read loss stemming from discarding intronic, intergenic and multigene
210
mapping reads outlined above (Fig. 4a) sugge
st a straight
-
forward strategy to optimize
transcriptomic references. Here, we implement a three step process to overcome these limitations
that is applicable for any genomic annotation. In the first step we convert an exonic reference to
a pre
-
mRNA refere
nce to incorporate intronic reads into gene expression estimates
using a hybrid
intronic mapping strategy
(Fig. 4b). Secondly, we resolve gene overlaps by automated
215
identification and curation of premature and readthrough transcripts eliminating
overlappin
g
transcripts, gene models and long non
-
coding RNA genes that obscure o
r preclude detection of
protein
-
coding genes (Fig. 4c). Finally, we incorporate unannotated 3’ UTRs into our gene models
by rank ordering genes with high sequencing read mapping within
10kb of their known gene end
and supervised 3’ gene extension based on one of several criteria: a) read splicing to known
220
exons, b) extended gene boundary in another genome annotation (e.g. Refseq), c) external
ground truth evidence (Allen in situ atlas, P
rotein Atlas etc). As a result we generated optimized
genome annotations for both mouse and human transcriptomes (Fig. 4e
, Suppl. Table
s
1, 2
).
This constitutes a general and scalable strategy
for optimizing
genome annotations for high
-
efficiency 3’ scRNA
-
seq analysis.
225
.
CC-BY 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted April 27, 2022.
;
https://doi.org/10.1101/2022.04.26.489449
doi:
bioRxiv preprint
7
.
CC-BY 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted April 27, 2022.
;
https://doi.org/10.1101/2022.04.26.489449
doi:
bioRxiv preprint
8
Figure 3: Gene overlap and resulting compromised scRNA
-
seq gene detection in the
mouse and human genomes.
a.
Number of same
-
strand overlapping genes in the mouse and
human genomes (mouse anno
tation
-
Ensembl v98 for GRCm38 build; human annotation
-
Ensembl v98 for GRCh38 build).
b.
Number of gene overlaps among mouse overlapping genes.
230
c.
Number of gene overlaps among human overlapping genes.
d.
Readthrough transcripts
prevent the incorporatio
n of sequencing reads to gene expression estimates in downstream
genes. Gene regio
ns where sequencing read data are
discarded from gene expression estimates
due to multigene classification are highlighted in red.
e.
Premature
-
start transcripts prevent the
incorporation of sequencing reads to upstream gene’s expression estimates. Gene regions where
235
sequencing read data are discarded due to multigene classification are highlighted in red. As most
sequencing reads map at the 3’ end of genes, premature
-
start tr
anscripts can render upstream
genes undetectable by scRNA
-
seq analysis.
f.
Large multiple gene spanning genes can eliminate
scRNA
-
seq detection of dozens of nesting same
-
strand overlapping genes depending on read
mapping strategy. With pre
-
mRNA references,
where full gene spans are defined as exons, all
240
nesting genes will have no sequencing reads incorporated into expression estimates due to
resulting multi
-
gene mapping classification.
In order to evaluate the performance of the optimized reference transcri
ptomes, we evaluated the
gene and read detection efficiencies in both mouse brain and human PBMC datasets, and
contrasted the analyses to the same scRNA
-
seq dataset mapped to the traditional exonic
245
reference. We observed dramatic gains in both gene detecti
on and read registration with the
optimized mouse transcriptome with more than 3000 new detected genes and 14.8% more
sequencing reads included in downstream analysis. Moreover, the optimized reference yields a
profound increase in cellular profiling resol
ution with close to 600 additional genes/cell on a
median basis for MnPO neurons that constitutes a more than 20% increase in the number of
250
genes detected per neuron (Fig. 5a). Furthermore, this increase in cellular profiling resolution
translated into 1
–
3 additional neuron types detected under identical analysis parameters to
exonic transcriptome based analysis. Predictably, the optimized transcriptome revealed genes
that were invisible to the traditional exonic reference based scRNA
-
seq analysis due
to
sequencing read mapping to intronic and un
-
annotated 3’ UTR reigons (Fig. 5b).
255
We found consistently superior performance of the optimized human genome annotation based
analysis as compared to the implementation of an exonic transcriptomic reference. We de
tected
over 4500 additional genes and more than 21% of additional sequencing reads in the human
PBMC dataset (Fig. 5c). Similarly to the optimized mouse transcriptome, we observed dramatic
gains in profiling resolution of cells with more than 400 additiona
l genes/cell detected on a median
260
basis. These gains in gene and read detection in the human dataset translated to up to
6
additional cell types detected under identical analysis parameters as compared to the analysis
based on the exonic trancriptomic refe
rence. Therefore, optimizing genome annotations for
scRNA
-
seq analysis can lead to robust gains in sequencing read, gene as well as cell
-
type
detection.
265
.
CC-BY 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted April 27, 2022.
;
https://doi.org/10.1101/2022.04.26.489449
doi:
bioRxiv preprint
9
Figure 4: Strategy for compiling an optimized transcriptomic reference.
a.
Schematic of read
registration with regular exonic reference. Registered sequencing reads that are incorporated to
gene expression estimates are highlighted in purple with discarded sequencing reads shown in
grey. ScRNA
-
seq analysis with an exonic referen
ce discards several types of sequencing reads
270
that map to a specific gene including intronically mapped reads, reads mapping to exons that
overlap with readthrough transcripts from upstream genes (N
-
1) as well as sequencing reads
mapping to unannotated 3’
untranslated regions (UTRs).
b.
Step 1 of optimizing a transcriptomic
reference is incorporating intronic reads thereby generating a “pre
-
mRNA reference”.
c.
Step 2 of
optimizing a transcriptomic reference is resolving gene overlaps by removing rare readth
rough
275
and premature transcripts as well as poorly supported gene models and pseudogenes that result
in eliminating sequencing data from well
-
established protein
-
coding genes. This step incorporates
sequencing reads mapping to exons and introns that previou
sly overlapped with
readthrough/premature transcripts.
d.
Step 3 of optimizing a transcriptomic reference entails
extending 3’ boundaries of genes to incorporate unannotated 3’ UTRs with sequencing reads
280
spliced to reads mapping to known exons.
e.
Genome a
nnotation modifications for optimized
mouse and human reference transcriptomes.
.
CC-BY 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted April 27, 2022.
;
https://doi.org/10.1101/2022.04.26.489449
doi:
bioRxiv preprint