Enhanced recovery of single-cell RNA-sequencing reads for missing gene expression data

Enhanced recovery of single

cell RNA

sequencing reads for missing gene

expression data

Allan

Hermann Pool

1, 2, 3

, #

, Helen Poldsam

, 4, 5

, Sisi Chen

, Matt Thomson

, Yuki Oka

, #

Department of Neu

oscience, University of Texas Southwestern Medical Center, Dallas, T

, USA

Peter O’Donnell Brain Institute, University of Texas Southwestern Medical Center, Dallas, TX, USA

Department of Anesthesiology and Pain Management, University of T

exas Southwestern Medical Center, Dallas, TX, USA

Department of Chemistry and Biotechnology, Tallinn University of Technology, Estonia

Protobios LLC, M

ealuse 4, Tallinn 12618, Estonia

Division of Biology and Biological Engineering, California Institute of

Technology, Pasadena, CA, USA

correspondence should be addressed to

allan

hermann.pool@utsouthwestern.edu

yoka@caltech.edu

Abstract

Droplet

based

3’ single

cell RNA

sequencing (scRNA

seq) methods have proved

transformational in characterizing cellular diversity and generating valuable hypotheses

throughout biology

1,2

. Here we outline a common problem with 3’ scRNA

seq datasets where

genes that have been documented to be expressed with other methods, are either completely

ssing or are dramatically under

represented

thereby compromising

the

discovery of cell

types,

states

and genetic mechanisms

. We show that this problem stems from three main sources of

sequencing read loss: (1) reads mapping immediately 3’ to known gene boundaries due to poor

3’ UTR annotation; (2) intronic reads stemming from unannotated e

xons or pre

mRNA; (3)

discarded reads due to gene overlaps

. Each of these issues impacts

the

detection of

thousands

of genes even in

well

characterized mouse and human genomes rendering downstream analysis

either partially or fully blind to their expression. We outline a simple

three

step

solution to recover

the missing gene expression data that entails

compiling a

pre

mRNA refer

ence to retrieve

intronic reads

, resolving

gen

e collision derived read loss through

removal of readthrough and

premature start transcripts

, and

redefining 3’ gene boundaries to capture false intergenic reads.

We demonstrate with mouse brain and human peripheral blood datasets that this approach

dramat

ically increases the amount of sequencing data included in downstream analysis revealing

% more genes per cell and incorporates

sequencing reads than with

standard solutions

These improvements reveal

reviously missing

biologically relevant

cell

types,

states

and marker genes

in the mouse brain and human blood profiling data. Finally, we

provide

scRNA

seq

optimized transcriptomic references for human and mouse data as wel

l as

simple algorithmic implementation of these solutions that can be deployed to both thoroughly as

well as poorly annotated genomes.

Our results demonstrate

that optimizing the

sequencing

read

mapping step can

significantly

improve the

analysis resolutio

n as well as biological insight from

scRNA

seq

. Moreover, this approach

warrants a fresh look at preceding analyses of this popular

and scalable cellular profiling technology.

Main

Droplet

based single

cell RNA

sequencing methods such as Dropseq

and 10x Genomics

platforms have dramatically lowered the cost and improved the throughput of

single

cell

gene

expression profiling.

These advances have thereby

widely democratized the

discovery of new cell

types and states

–

delineation

of developmental mechanisms

and

cellular basis of disease

well as mapping

behavioral and physiological functions to distinct cell typ

11,12

. The scalability

of such methods

however

comes

with a few

important

limitations

First,

the

droplet

based

methods

rely on 3’ gen

e tagging where

detection of genes depends on

registering sequencing

CC-BY 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted April 27, 2022.

;

https://doi.org/10.1101/2022.04.26.489449

doi:

bioRxiv preprint

reads

predominantly

at the 3’ end of genes

which makes

detection of splicing

isoforms

problematic. Second, 3’ scRNA

seq datasets

despite usually being much more shallowly

sequenced

are in general considered to have lower sensitivity

than

deep full

length isoform

sequencing solutions such as

provided by the

SMART

Seq chemistr

Indeed, several studies

have observed that genes shown to be expressed with other methods have critically been missing

analyses relying on

droplet

based

scRNA

seq

8,11

This shortcoming compromises the potential

3’ scRNA

seq

high

throughput technologies

to uncover

the genetic and cellular

mechanisms

giving rise to

development and tissue function

cRNA

sequencing workflow

consists of several steps including

sample preparation, sequencing

library generation, sequencing,

read

mapping

/quantification

and analysis of the gene

cell matrix

based data

While many of these step

s are considered standard,

some

such as sample

preparation

are widely recognized as critical for the final outcome and

can vary significantly

between protocols and labs

One often overlooked step

in this workflow

read

mapping/quantification

that

determines

which sequencing

reads

are

incorporated in the final

cellular gene expression data

During this process, sequencing

reads are mapped to

the

reference transcriptome (i), assigned to genes (ii), assigned to cells (iii), and duplicate

are

remov

(iv)

14,15

As a result

of this step

often the majority of sequencing reads get excluded

from further analysis for one of several reasons including failure to map confidently to the

transcriptome,

being a duplicate read

, mapping to multiple sites in the genome (multimapping

reads), mapping to

more than one gene (multigene reads), mapping intronically or to an intergenic

region.

Some of

the

discarded read data however

reflect

endogenous gene expression

and can

render expressed genes missing

16,17

Several groups have

manually

amended the transc

riptome

for individu

al genes to

restore their

visib

ility

8,11

, however

a systemic effort to evaluate the scale

of this problem and to provide

a whole

transcriptome solution for this

issue

has

been missing

Here, we show that

analysis pipelines relying on standard exonic transcriptomic references are

blind to many genes that are easily detected with independent methods such as in situ

hybridizat

ion. We

demonstrate

that this lack of gene detection does not stem from low sensitivity

but rather inefficiencies of the currently used transcriptomic references and that this is the case

even with very well annotated genomes including that of mouse and hu

man. Furthermore, we

show

that the read loss stems from three sources:

poor annotation of 3’ untranslated regions,

gene overlaps stemming

from

the

annotation of

rare

read

through or prematurely starting

transcripts and finally

exclusion of

intronic reads.

We outline a

three

step

strategy

overcome

these limitations through

the

inclusion of intronic reads, resolving gene overlaps by excluding rare

transcript isoforms and identifying and incorporating unannotated gene 3’UTRs

This strategy

recovers obscured gene expression data

for thousands of genes

and reveals previously

detected

genetic markers, mechanisms and cell types.

Consequently, we provide

full

genome

optimized transcriptomic references for the mouse and human genom

es.

In sum, our data argue

that transcriptomic references need

to be

optimi

zed

for scRNA

seq analysis

and

that

this step

can

dramatically improve the profiling resolution

. These findings also warrant

a reanalysis of

previously published datasets.

Results

In order to characterize gene detection fidelity of 3’ gene counting methods we performed scRNA

sequencing of the median pre

optic nucleus (MnPO)

a mouse brain center implicated in a range

of physiological functions including thirst, sleep, heat and cold

sensation

Predictably, following

sequencing read map

ping to an exonic transcriptomic reference we identified about a dozen

distinct neuron types in this structure reflecting the functional diversity of this brain center (Fig.

CC-BY 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted April 27, 2022.

;

https://doi.org/10.1101/2022.04.26.489449

doi:

bioRxiv preprint

1a). We next compared gene detection fidelity with scRNA

sequencing to in situ hyb

ridization

–

an independent method provided by the Allen in situ brain atlas

. While we found many genes

that were reliably detected with both methods (e.g. Nxph4, Fig. 1b), we observed a number of

genes that were completely missing in scRNA

seq data while robustly detected with in situ

hybridization (

e.g. B4galnt2 and Gpr165, Fig. 1c

d). Follow

up analysis at these loci revealed

three distinct patterns of sequencing read mapping that determined whether the gene is detected

or missing in scRNA

sequencing analysis. The first type comprised of genes detec

ted by both

methods. In this case, sequencing reads mapped near perfectly to the exons of the underlying

100

gene and were thus included in downstream transcriptomic analysis (Fig. 1b). A second group of

genes were detected by

in situ hybridization but were

ssing in scRNA

seq data as most

sequencing reads mapped to an intron of that gene resulting in exclusion from transcriptomic

analysis (Fig. 1c). Finally, a third group of genes were detected by in situ hybridization but not

with scRNA

seq and had no sequen

cing read mapping to known exons and introns (Fig. 1d).

105

Importantly, the last type of genes displayed excessive read mapping proximal to the known 3’

end of the gene suggesting that scRNAseq fails to detect these genes due to poor annotation of

3’ untransl

ated regions of genes. Thes

e data demonstrate that droplet

based single cell

sequencing datasets can fail to detect genes due to suboptimal read mapping to the reference

transcriptome.

110

In order to evaluate the magnitude of the missing gene problem, we quan

tified several metrics of

sequencing read mapping in two vertebrate species with the most thoroughly annotated genomes

–

mice and humans. For mice we evaluated the MnPO dataset and for humans we profiled

peripheral blood mononuclear cells (PBMCs). In mouse

brain data we found that out of the

uniquely mapped sequencing reads 71.8 % are exonic, 19.5 % intronic and 8.7 % intergenic out

115

of 272 million total reads suggesting that significant gains could be achieved by incorporating

sequencing data from intronic

and intergenic areas to gene expression estimates (Fig. 1e) . We

found similar metrics in human data with 69.9 % exonic, 23.5 % intronic and 6.7 % intergenic

reads (272 million total), respectively. Indeed, upon evaluating the number of genes detected as

result of including intronic reads, intergenic reads within 10 kb of known 3’ gene ends or both,

120

we observed dramatic gains in the amount of detected genes in scRNA

seq datasets with 13.6%,

25.8% and 33.6% more genes detected than with a conventional exon

ic transcriptome reference

in mouse (Fig. 1f). Again, comparable gains were observed with 19.9%, 23.2% and 39.2% more

genes detected, respectively for the human transcriptome. Moreover, we also evaluated the

dominant source of read information for genes i

n the mouse and human datasets. Predictably we

125

found that

the

majority of mouse genes (

79.6

%) were dominated by exonic reads with more than

50% of expression data stemming from exonic reads (Fig. 1g). Somewhat surprisingly,

less than

half

of human genes de

rive their expression data from exonic reads with the rest stemming from

intronic or 3’ intergenic reads. While not all intronic and proximal intergenic sequencing reads

tem from the respective protein

coding gene transcripts, these data indicate that

profound gains

130

in gene detection sensitivity are feasible by incorporating relevant intronic and intergenic read

data in downstream scRNA

seq analysis.

CC-BY 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted April 27, 2022.

;

https://doi.org/10.1101/2022.04.26.489449

doi:

bioRxiv preprint

Figure 1: Missing genes and sequencing read registration in single

cell RNA

seq

experiments. a.

RNA

seq based profiling of the mouse physiology regulating brain center

135

Median Preoptic Nucleus (MnPO). 10x Genomics 3’ transcriptomic analysis of MnPO neurons

(n=906) mapped to an exonic transcriptomic reference reveals 13 neuron types. Data shown in a

tSN

E embedding.

Sample scRNA

seq detected gene (Nxph4) with sequencing read mapping

at its genomic locus. The majority of sequencing reads map to known exons of Nxph4 gene and

are therefore registered (blue) and included in downstream analysis. Discarded r

eads (red) map

140

to non

exonic regions or are antisense to the gene and are therefore excluded. Inset violin plot:

scRNA

seq analysis detects Nxph4 expression in several MnPO neuron types (cell

type specific

log

transformed expression of Nxph4 in MnPO neuron

types with cell

type identity color

coded as

in Fig1a). Micrograph inset: in situ hybridization of Nxph4 expression in the MnPO (scale bar: 150

μm, posterior MnPO outlined with white dashed line, data from Allen Brain Atlas Mouse ISH

145

dataset).

Sample

gene (B4galnt2) not detected by scRNA

seq due to intronic read mapping.

Inset violin plot: gene expression is not detected in any of the MnPO neuron types. Inset

micrograph: in situ hybridization of B4galnt2 expression in the MnPO.

Sample gene (Gpr165)

CC-BY 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted April 27, 2022.

;

https://doi.org/10.1101/2022.04.26.489449

doi:

bioRxiv preprint

not detected by scRNA

seq due to intergenic read mapping 3’ of known end of the gene. Inset

violin plot: gene expression is not detected in any of the MnPO neuron types with scRNA

seq.

150

Inset micrograph: in situ hybridization of Gpr165 expression in the MnP

Proportion of

uniquely mapped sequencing reads according to mapping site (exonic, intronic or intergenic) for

mouse brain (MnPO, left) and human peripheral blood mononuclear cells (right) datasets.

Intronic and intergenic reads constitute a promis

ing source to recover missing gene expression

data in scRNA

seq analysis. Number of detected genes in mouse brain (MnPO, left) and human

155

PBMC (right) datasets, if reads mapping to exons, exons and introns, exons and intergenic reads

within 10kb of known 3’

ends of genes, or all three sources are included in downstream analysis.

Human and mouse genes according to the dominant source of sequencing read data. Genes

are classified as ‘exonic dominant’, ‘intronic dominant’ or ‘3’ intergenic dominant’ if more

than 50%

of sequencing reads map to their exons, introns or within 10kb of their 3’ end, respectively. Mixed

160

genes have less than 50% of reads stemming from any of the three regions.

further evaluated the extent to

which intergenic reads 3’ from gene en

ds could contribute to

true gene expression estimates. If unannotated 3’ UTRs constitute a significant source of read

loss in 3’ scRNA

seq datasets we would expect to see elevated levels of sequencing reads

mapping proximal to 3’ end of ge

nes. Indeed, we o

bserve several

fold higher mapping of

165

intergenic reads immediately proximal to

the

3’ gene ends than at distal sites in both mouse and

human datasets (Fig. 2a, b). In fact close to

5% of intergenic reads in both mouse and human

datasets are within 10kb of

3’ gene ends, which represents

approximately

fold

enrichment as

compared to the rest of the non

coding genome

20,21

. These results suggest that improved

annotation of 3

’ gene ends is a promising strategy to increase gene detection in 3’ single

cell

170

RNA

sequencing analysis (Fig. 2c).

Figure 2: Increased intergenic read mapping proximal to 3’ end of genes.

Distribution of

sequencing reads mapping within 10kb of know

n gene ends in the mouse genome shows

175

increased mapping proximal to gene ends.

Distribution of sequencing reads mapping within

10kb of known gene ends in the human genome shows increased mapping proximal to gene

ends.

Fraction of intergenic reads mapping within 10kb of known gene ends from all intergenic

reads in the mouse brain (MnPO) and human PBMC datasets.

180

Another common source of read loss in scRNA

seq analysis stem from same strand gene

overlaps. Reads mapping t

o genomic regions annotated to more than one gene are classified as

multigene reads and are routinely removed from downstream analysis

14,15

. We evaluated the

magnitude of

gene overlaps using the Ensembl

mouse (v.98) and human (v.98) genome

annotations which

are most commonly used to gener

ate reference transcriptomes for scRNA

seq

185

CC-BY 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted April 27, 2022.

;

https://doi.org/10.1101/2022.04.26.489449

doi:

bioRxiv preprint

analysis. We found that gene overla

ps are a pervasive feature of currently available genome

annotations with 2035 (6.3 % of all mouse genes) and 5195 (14.2% of all human genes) genes

showing partial or complete overlap with other same strand genes in the mouse and human

genomes, respectivel

y (Fig. 3a).

The m

ajority of these overlaps in both mouse and human

genomes originate from single pairs of genes (Fig. 3b, c).

190

A closer inspection of overlapping genes revealed a few stereotypic patterns of overlaps that

result in partial or complete blind

ing of one or more overlapping genes from downstream analysis.

The first problematic pattern stems from readthrough transcripts where one or several of upstream

gene’s transcripts incorporate some or all exons of a downstream gene which effectively

elimina

tes all sequencing reads mapping to the latter (Fig. 3d). Another problematic feature of

195

overlapping genes are so called „premature start transcripts“ where a single or several transcripts

from a downstream gene are annotated to start upstream of the upstr

eam gene’s terminal exon

(Fig. 3e). The latter type of overlap is particularly problematic as the majority of sequencing reads

in 3’ scRNA

seq map to terminal exons and thus premature start transcripts effectively eliminate

the entire detection of their up

stream gene. A version of this issue impacts dozens of genes that

200

share their terminal exon and are thus completely invisible to analysis (Extended Fig. 1). Finally,

multigene overlapping genes pose a particular problem for pre

mRNA references where a sin

gle

large gene can completely eliminate dozens of nested genes rendering downstream analysis blind

to their expression (Fig. 3f). An important caveat to the latter is that there are currently several

strategies for compiling a pre

mRNA transcriptomic refer

ence with substantial differences in gene

205

detection and read mapping fidelity (Extended Fig. 2). In summary, gene overlaps in genome

annotations constitute a unique challenge to discovering valuable candidate genetic mechanisms

and marker genes in 3’ singl

cell RNA

seq analysis. Moreover, these problems impact

thousands of genes

particularly

in well annotated genomes.

The systemic issues with read loss stemming from discarding intronic, intergenic and multigene

210

mapping reads outlined above (Fig. 4a) sugge

st a straight

forward strategy to optimize

transcriptomic references. Here, we implement a three step process to overcome these limitations

that is applicable for any genomic annotation. In the first step we convert an exonic reference to

a pre

mRNA refere

nce to incorporate intronic reads into gene expression estimates

using a hybrid

intronic mapping strategy

(Fig. 4b). Secondly, we resolve gene overlaps by automated

215

identification and curation of premature and readthrough transcripts eliminating

overlappin

transcripts, gene models and long non

coding RNA genes that obscure o

r preclude detection of

protein

coding genes (Fig. 4c). Finally, we incorporate unannotated 3’ UTRs into our gene models

by rank ordering genes with high sequencing read mapping within

10kb of their known gene end

and supervised 3’ gene extension based on one of several criteria: a) read splicing to known

220

exons, b) extended gene boundary in another genome annotation (e.g. Refseq), c) external

ground truth evidence (Allen in situ atlas, P

rotein Atlas etc). As a result we generated optimized

genome annotations for both mouse and human transcriptomes (Fig. 4e

, Suppl. Table

1, 2

This constitutes a general and scalable strategy

for optimizing

genome annotations for high

efficiency 3’ scRNA

seq analysis.

225

CC-BY 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted April 27, 2022.

;

https://doi.org/10.1101/2022.04.26.489449

doi:

bioRxiv preprint

CC-BY 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted April 27, 2022.

;

https://doi.org/10.1101/2022.04.26.489449

doi:

bioRxiv preprint

Figure 3: Gene overlap and resulting compromised scRNA

seq gene detection in the

mouse and human genomes.

Number of same

strand overlapping genes in the mouse and

human genomes (mouse anno

tation

Ensembl v98 for GRCm38 build; human annotation

Ensembl v98 for GRCh38 build).

Number of gene overlaps among mouse overlapping genes.

230

Number of gene overlaps among human overlapping genes.

Readthrough transcripts

prevent the incorporatio

n of sequencing reads to gene expression estimates in downstream

genes. Gene regio

ns where sequencing read data are

discarded from gene expression estimates

due to multigene classification are highlighted in red.

Premature

start transcripts prevent the

incorporation of sequencing reads to upstream gene’s expression estimates. Gene regions where

235

sequencing read data are discarded due to multigene classification are highlighted in red. As most

sequencing reads map at the 3’ end of genes, premature

start tr

anscripts can render upstream

genes undetectable by scRNA

seq analysis.

Large multiple gene spanning genes can eliminate

scRNA

seq detection of dozens of nesting same

strand overlapping genes depending on read

mapping strategy. With pre

mRNA references,

where full gene spans are defined as exons, all

240

nesting genes will have no sequencing reads incorporated into expression estimates due to

resulting multi

gene mapping classification.

In order to evaluate the performance of the optimized reference transcri

ptomes, we evaluated the

gene and read detection efficiencies in both mouse brain and human PBMC datasets, and

contrasted the analyses to the same scRNA

seq dataset mapped to the traditional exonic

245

reference. We observed dramatic gains in both gene detecti

on and read registration with the

optimized mouse transcriptome with more than 3000 new detected genes and 14.8% more

sequencing reads included in downstream analysis. Moreover, the optimized reference yields a

profound increase in cellular profiling resol

ution with close to 600 additional genes/cell on a

median basis for MnPO neurons that constitutes a more than 20% increase in the number of

250

genes detected per neuron (Fig. 5a). Furthermore, this increase in cellular profiling resolution

translated into 1

–

3 additional neuron types detected under identical analysis parameters to

exonic transcriptome based analysis. Predictably, the optimized transcriptome revealed genes

that were invisible to the traditional exonic reference based scRNA

seq analysis due

sequencing read mapping to intronic and un

annotated 3’ UTR reigons (Fig. 5b).

255

We found consistently superior performance of the optimized human genome annotation based

analysis as compared to the implementation of an exonic transcriptomic reference. We de

tected

over 4500 additional genes and more than 21% of additional sequencing reads in the human

PBMC dataset (Fig. 5c). Similarly to the optimized mouse transcriptome, we observed dramatic

gains in profiling resolution of cells with more than 400 additiona

l genes/cell detected on a median

260

basis. These gains in gene and read detection in the human dataset translated to up to

additional cell types detected under identical analysis parameters as compared to the analysis

based on the exonic trancriptomic refe

rence. Therefore, optimizing genome annotations for

scRNA

seq analysis can lead to robust gains in sequencing read, gene as well as cell

type

detection.

265

CC-BY 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted April 27, 2022.

;

https://doi.org/10.1101/2022.04.26.489449

doi:

bioRxiv preprint

Figure 4: Strategy for compiling an optimized transcriptomic reference.

Schematic of read

registration with regular exonic reference. Registered sequencing reads that are incorporated to

gene expression estimates are highlighted in purple with discarded sequencing reads shown in

grey. ScRNA

seq analysis with an exonic referen

ce discards several types of sequencing reads

270

that map to a specific gene including intronically mapped reads, reads mapping to exons that

overlap with readthrough transcripts from upstream genes (N

1) as well as sequencing reads

mapping to unannotated 3’

untranslated regions (UTRs).

Step 1 of optimizing a transcriptomic

reference is incorporating intronic reads thereby generating a “pre

mRNA reference”.

Step 2 of

optimizing a transcriptomic reference is resolving gene overlaps by removing rare readth

rough

275

and premature transcripts as well as poorly supported gene models and pseudogenes that result

in eliminating sequencing data from well

established protein

coding genes. This step incorporates

sequencing reads mapping to exons and introns that previou

sly overlapped with

readthrough/premature transcripts.

Step 3 of optimizing a transcriptomic reference entails

extending 3’ boundaries of genes to incorporate unannotated 3’ UTRs with sequencing reads

280

spliced to reads mapping to known exons.

Genome a

nnotation modifications for optimized

mouse and human reference transcriptomes.

CC-BY 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted April 27, 2022.

;

https://doi.org/10.1101/2022.04.26.489449

doi:

bioRxiv preprint