Isoform cell-type specificity in the mouse primary motor cortex

Nature | Vol 598 | 7 October 2021 |

195

Article

Isoform cell-type specificity in the mouse

primary motor cortex

A. Sina Booeshaghi

, Zizhen

Yao

, Cindy van Velthoven

, Kimberly

Smith

, Bosiljka

Tasic

Hongkui Zeng

& Lior

Pachter

3,4

✉

Full-length SMART-seq

single-cell RNA sequencing can be used to measure gene

expression at isoform resolution, making possible the identification of specific

isoform markers for different cell types. Used in conjunction with spatial RNA capture

and gene-tagging methods, this enables the inference of spatially resolved isoform

expression for different cell types. Here, in a comprehensive analysis of 6,160 mouse

primary motor cortex cells assayed with SMART-seq, 280,327 cells assayed with

MERFISH

and 94,162 cells assayed with 10x Genomics sequencing

, we find examples

of isoform specificity in cell types—including isoform shifts between cell types that

are masked in gene-level analysis—as well as examples of transcriptional regulation.

Additionally, we show that isoform specificity helps to refine cell types, and that a

multi-platform analysis of single-cell transcriptomic data leveraging multiple

measurements provides a comprehensive atlas of transcription in the mouse primary

motor cortex that improves on the possibilities offered by any single technology.

Transcriptional and post-transcriptional control of individual iso

forms of genes is crucial for neuronal differentiation

–

, and isoforms

of genes have been shown to be specific to cell types in mouse and

human brains

–

. It is therefore not surprising that dysregulation of

splicing has been shown to be associated with several neurodevel

opmental and neuropsychiatric diseases

. Thus, it is of interest to

study gene expression in the brain at single-cell and isoform resolution.

Nevertheless, current single-cell studies aiming to characterize

cell types in the brain using single-cell RNA sequencing (scRNA-seq)

have relied mostly on gene-level analysis. This is, in part, owing to the

nature of the data produced by the highest-throughput single-cell

methods. Popular high-throughput assays such as Drop-seq

, 10x

Genomics Chromium

and inDrops

produce 3′-end reads that are,

in initial pre-processing, collated by gene to produce per-cell gene

counts. SMART-seq

is an scRNA-seq method that produces full-length

reads, enabling the quantification of individual isoforms of genes with

the expectation-maximization algorithm

. However, such increased

resolution comes at the cost of throughput. SMART-seq requires cells

to be deposited in wells, thus limiting the throughput of the assay. In

addition, SMART-seq requires more sequencing per cell

The trade-offs are evident in analysis of scRNA-seq data from the pri

mary motor cortex (MOp) produced by the BRAIN Initiative Cell Census

Network (BICCN)

. We examined 6,160 (filtered) SMART-seq v4 cells

and 94,162 (filtered) 10x Genomics Chromium (10xv3) cells (Extended

Data Fig. 1, Fig.

2a, b

) and found that while 10xv3 and SMART-seq are

equivalent in defining broad classes of cell types, 3′-end technol

ogy that can assay more cells can identify some rare cell types that

are missed at lower cell coverage (Extended Data Fig. 2a). Overall, 56

clusters with gene markers could be identified in the 10xv3 data but

not in the SMART-seq data, whereas only 39 clusters with gene markers

could be identified in the SMART-seq data and not the 10xv3 data—this

differential is consistent with previously reported comparisons of

10x Genomics Chromium and SMART-seq clusters

. However, while

SMART-seq has lower throughput than some other technologies, it has

a notable advantage: because it probes transcripts across their entire

length, SMART-seq makes possible isoform quantification and the

detection of isoform markers for cell types that cannot be detected

with 3′-end technologies (Extended Data Figs. 2b, c). Moreover, the

uniformity of read coverage of SMART-seq data

and its quantifica

tion with state-of-the-art tools

yields higher sensitivity than other

methods, which can make possible refined cell-type classification.

To take advantage of the complementary strengths of these differ-

ent platforms, we introduce an approach to scRNA-seq that links the

SMART-seq resolved isoforms to the 10x Chromium defined cell types,

and merges this information with spatial transcriptomic measurements

obtained by MERFISH

(Fig.

). In addition to revealing extensive iso

form diversity and cell-type specificity in the MOp, we find evidence

for previously missed transcriptionally distinct cell subtypes in the

MOp. Our results extend the notion of a single-cell database beyond a

list of gene markers, and we produce a gene-isoform-space single-cell

atlas of the MOp using the combined 10xv3, SMART-seq and MERFISH

data. Our methods are open source, reproducible, easy to use and con

stitute an effective workflow for leveraging full-length scRNA-seq data

in combination with data from other technologies.

Isoform markers for cell types

To identify isoform markers of cell types, we first sought to visualize

our SMART-seq data using gene-derived cluster labels from the BICCN

analysis (Methods). Rather than layering cluster labels on cells mapped

https://doi.org/10.1038/s41586-021-03969-3

Received: 9 March 2020

Accepted: 27 August 2021

Published online: 6 October 2021

Open access

Check for updates

Department of Mechanical Engineering, California Institute of Technology, Pasadena, CA, USA.

Allen Institute for Brain Science, Seattle, WA, USA.

Division of Biology and Biological

Engineering, California Institute of Technology, Pasadena, CA, USA.

Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA.

✉

e-mail:

lpachter@caltech.edu

196

| Nature | Vol 598 | 7 October 2021

Article

to 2D with an unsupervised dimensionality-reduction technique such

distributed stochastic neighbour embedding

(

-SNE) or uniform

manifold approximation and projection

(UMAP), we used a super

vised learning approach to project cells so that they are best separated

according to BICCN consortium

annotations using neighbourhood

component analysis (NCA). This method produces meaningful rep

resentation of the global structure of the data (Fig.

), without over-

fitting (Supplementary Fig. 1a). Analysis of the projections revealed

a batch effect in the 10xv3 data—which we addressed by restricting

analysis to a single batch—and minimal evidence of a batch effect in

the MERFISH data (Methods, Supplementary Fig. 2a, b).

Next, motivated by the discovery of genes exhibiting differential

exon usage between glutamatergic and GABAergic (γ-aminobutyric

acid-producing) neurons in the primary visual cortex

, we performed a

differential analysis between these two classes of neurons. We searched

for significant shifts in isoform abundances in genes whose expression

was stable across cell types (Methods). We discovered 398 such isoform

markers belonging to 310 genes (Supplementary Table 1). Figure

shows an example of such an isoform from the oxidative resistance 1

(

Oxr1

) gene, which is known to be essential for protection against oxi-

dative stress-induced neurodegeneration

. While we see no change

in gene expression of

Oxr1

between these two neuron types, we find

that among the 16 isoforms of the gene, one of them,

Oxr1-204

, is more

highly expressed in glutamatergic neurons. The

Oxr1

gene undergoes

an isoform shift in GABAergic neurons, where the expression of the

Oxr1-204

isoform is significantly lower, suggesting distinct subcellular

isoform localization in the two neuron types

. A gene-level analysis is

blind to this isoform shift (Fig.

, top right).

We hypothesized that there are genes exhibiting cell-type isoform

specificity at all levels of the MOp cell ontology. However, detection of

such genes and their associated isoforms requires meaningful cell-type

assignments and accurate isoform quantifications. To assess the reli-

ability of the SMART-seq clusters produced by the BICCN

, we exam-

ined the correlation in gene expression by cluster with an orthogonal

scRNA-seq technology, the 10xv3 3′-end assay. We clustered 94,162

10xv3 cells, also derived from the MOp, using the same method as the

SMART-seq cells (Methods). The clustering method generates three

hierarchies of cells: classes, subclasses and clusters. The SMART-seq

data have 2 major classes (glutamatergic and GABAergic), 18 subclasses

that subdivide the classes, and 62 clusters that subdivide the subclasses.

The 10x data similarly contain three hierarchies of cells: two major

classes (glutamatergic and GABAergic), 21 subclasses and 85 clusters.

We found high correlation of gene expression between the two assays

at the subclass and cluster levels (Extended Data Fig. 3).

Next, we assessed the accuracy of the SMART-seq isoform quantifica

tion and its concordance with 10xv3 quantifications of isoforms. Since

not all isoforms can be quantified from 10xv3 3′-end data, we examined

only isoforms containing some unique 3′ UTR sequence. This enabled

us to validate the isoform quantifications using a different method

(Methods). To extract isoform quantifications from 10xv3 data in cases

where there was a unique 3′ sequence, we relied on transcript compat

ibility counts

produced by pseudoalignment with kallisto

. We were

able to validate the SMART-seq isoform shift predictions at both the

subclass and cluster levels (Extended Data Fig. 4). The isoform abun

dance correlations are slightly lower than those for gene abundance

estimates (Extended Data Fig. 3), but sufficiently accurate to identify

significant isoform shifts, consistent with benchmarks showing that

isoforms can be quantified accurately from full-length bulk RNA-seq

Having validated the cluster assignments and isoform abundance

estimates, we tested for isoform switches for 16 cell subclasses exclud

ing low quality cells (example in Fig.

), and then for 48 distinct clusters

for subclasses that have more than one cluster (example in Fig.

) and

more than 5 cells per cluster (Methods). At the higher level of 16 cell

subclasses, we found a total of 654 isoforms from 550 genes within

the glutamatergic class and 381 isoforms from 332 genes within the

GABAergic class exhibiting isoform shifts among the 16 cell subclasses

despite constant gene abundance (Supplementary Table 2a, b). There

are several notable examples of isoform shifts at this level. For exam-

ple, we find a shift in the

Snap25-202

isoform, whose expression has

been specifically shown to be correlated with age and to differentially

′

Gene tagging

Drop-seq, inDr

ops or 10x Chr

omium

Smart-seq, Oxfor

d Nanopor

e or

Pacic Biosciences

Detector

′

′

Isoform sequencing

Spatial captur

seqFISH, MERFISH or 10x Visium

Genes

Isoforms

Gene

space

Isoform

space

Physical

space

Cell type

identication

Isoform

renement

Spatial

tethering

Fig. 1 | Measuring RNA with multiple platforms.

RNA is measured using

gene-tagging techniques such as the 10x Chromium scRNA-seq protocol,

isoform sequencing techniques such as SMART-seq, and spatial RNA-capture

techniques such as MERFISH. High-cell-throughput gene tagging enables

cell-type identification with marker genes and deep full-length isoform

sequencing enables cell-type marker refinement at the isoform level. Spatial

RNA capture coupled with gene tagging and isoform sequencing enables

spatial resolution of cell-type-specific isoform markers. The multi-method

procedure for sampling RNA enables inference of spatially resolved cell-type-

specific isoforms that no single technique could achieve independently

Nature

| Vol 598

| 7 October 2021

197

regulate synaptic transmission and synaptic plasticity at central syn

apses

. This isoform marks the L6b subclass (Fig.

). At the cluster

level, we found 923 isoforms from 823 genes exhibiting isoform shifts

among the 48 clusters passing filter despite constant gene abundance

(Supplementary Table 3). Another isoform of note that marks the L6b

Ror1

_ 1 cluster, a subset of cells in the L6b subclass, is the

Stxbp2-207

isoform whose gene

Stxbp2

has previously been detected in the sub

thalamic nucleus and the posterior hypothalamus

Assaying both male and female mice enabled us to examine

sex-specific effects in all subclasses except for the L5 IT, which was

excluded owing to batch effect (Supplementary Fig. 4). In total, these

subclasses exhibited 418 sex-specific isoforms, averaging 40 iso

forms per subclass (Supplementary Table 7). Unlike a recent study

that reported a sex-specific cell type in the ventromedial nucleus of

the hypothalamus

, we do not find any sex-specific subclasses. How

ever, we observed several autosomal isoforms that were differentially

expressed between male and female mice. Among these, the

Shank1-203

isoform is differential in

Vip

neurons, a finding that refines previous

data showing that

Shank1

, which has been shown to localize in Purkinje

cells in the cortex

, is a sex-specific gene whose expression is regulated

by sex hormones

We also investigated instances where clusters could be refined

according to isoform expression. After reclustering each 10xv3-derived

cluster using SMART-seq isoform quantifications (see Methods), we

found that 12 clusters can be split by isoforms. Examining the L6 CT

Grp_1 cluster, we find that the average effect size for differential iso

forms that split the cluster into two sub-clusters is higher than that for

genes (Extended Data Fig. 5). One isoform in particular, which splits

the L6 CT Grp_1 cluster, is a protein-coding isoform of the amyloid

precursor protein gene (

App

). Dysregulation of splicing for isoforms

App

have been associated with disease pathogenesis in Alzheimer

disease models

. Our findings show that isoform-level expression can

help refine cell types in the mouse MOp beyond what is possible using

gene-level expression estimates.

Along with isoforms detectable as differential between cell types

without change in gene abundance, we identified isoform markers

for the classes, subclasses, and clusters in the MOp ontology that are

differential regardless of gene expression. We found 5,658 isoforms

belonging to 3,132 genes that are specific to the glutamatergic and

GABAergic classes (Fig.

, Supplementary Table 4), 7,588 isoforms

belonging to 4,171 genes within the glutamatergic class and 4,359 iso

forms belonging to 2,614 genes within the GABAergic class exhibiting

isoform shifts specific to subclasses (Supplementary Table 5a, b), and

for the 48 clusters passing filter, 3,171 isoforms belonging to 2,461 genes

exhibiting isoform shifts in clusters (Supplementary Table 6). Together,

these form an isoform catalogue of the MOp (Supplementary Fig. 5a, b).

Spatial isoform specificity

Spatial scRNA-seq methods are not currently well suited to directly

probing isoforms of genes owing to the number and lengths of probes

required—however, spatial analysis at the gene level can be refined

to yield isoform-level results by extrapolating SMART-seq isoform

quantifications (Fig.

, Supplementary Fig. 5c).

Figure

4a, b

shows an example of a gene,

Pvalb

, for which the

SMART-seq quantification reveals that of the two isoforms of the gene,

only one,

Pvalb-201

, is expressed. Moreover, this effect is specific to

the

Pvalb

cell subclass (Fig.

). In an examination of MERFISH spa

tial single-cell RNA-seq derived from 64 slices from the MOp region

(Extended Data Fig. 6a), the

Pvalb

subclass, for which

Pvalb

is a marker,

can be seen to be dispersed throughout the motor cortex spanning all

layers (Extended Data Fig. 6b). While MERFISH probes only measure

abundance of

Pvalb

at the gene level (Fig.

), extrapolation from the

SMART-seq quantifications can be used to refine the MERFISH result to

reveal the spatial expression pattern of the

Pvalb-201

isoform.

This extrapolation can be done systematically. To build a spatial

isoform atlas of the MOp, we identified differentially expressed genes

from the MERFISH data (Supplementary Table 8a, b) and for each of

them checked whether there were SMART-seq isoform markers (from

Supplementary Table 5a, b). An example of the result is shown in Fig.

which displays one gene for each cluster, together with the isoform

label specific to that cluster and the spatial location of the specific

cluster within a slice of the mouse MOp.

SMAR

T- Se

0xv3

MERFISH

Reads

15,229,289,828

22,696,236,495

Cells

6,295

94,162

280,327

Genes

10,333

5,891

Isoforms

20,319

Classes

Subclasses

Clusters

1479

Cluster

Class

Subclass

erage detected per cell

Differential isoform

log(TPM + 1)

Lamp5

: 360

Sncg

: 77

Vip

: 607

Sst

: 406

Pvalb

: 521

L5 IT

: 1,561

L5/6 NP: 207

L5 PT

: 12

L2/3 IT

: 482

L6b: 570

VLMC: 6

L6 IT

: 395

L6 CT

: 904

L6 IT

Car3

: 5

Endo: 7

Astr

o: 9

SMC: 19

Low quality : 12

Glutamater

gic

GABAer

gic

Oxr1-204

Oxr1

Snap25-202

Snap25

L2/3 IT

L5 PT

L5/6 NP

L6 CT

L6 IT

Car3

L6b

Stxbp2-207

Stxbp2

L6b

Kcnip1

L6b

Kcnip1

L6b

Ror1

L6b

Ror1

L6b

Rprm

L6b

Shisa6

GABAer

gic

Glutamater

gic

Fig. 2 | Isoform specif icity in the absence of gene specif icity.

, Overview of

the data analysed. The clustering method used by the BICCN consortium

generates three hierarchies of cells: classes, subclasses and clusters. NA, not

applicable.

, A

-SNE of 10 neighbourhood components of 6,160 SMART-seq

cells coloured according to subclass. Astro, astrocytes; CT, corticothalamic;

endo, endothelial; IT, intratelencephalic; NN, non-neuronal; NP,

near-projecting; PT, pyramidal tract; SMC, smooth muscle cells; VLMC, vascular

lepotomeningeal cells.

, Example of a gene with an isoform specific to the

glutamatergic class. The

Ox r1-2 0 4

isoform distribution in log1p(transcripts per

million (TPM)) across cells (left) superimposed on the

-SNE of the cells. The

cells belonging to the glutamatergic class are circled. The violin plots of

the gene and isoform distributions show that the gene is not differentially

expressed but the isoform is (right).

, Example of a gene with an isoform

specific to the L6b subclass. The

Snap25-202

isoform distribution across cells

(left) superimposed on the

-SNE of the cells. The cells belonging to the L6b

subclass are circled. The violin plots of the gene and isoform distributions show

that the gene is not differentially expressed but the isoform is (right).

, Example of a gene with an isoform specific to the L6b

Ror1

_1 cluster. The

St xb p2-2 07

isoform distribution in log1p(TPM) across cells (left) superimposed

on the

-SNE of the cells. The cells belonging to the L6b

Ror1

_1 cluster are

circled. The violin plots of the gene and isoform distributions show that the

gene is not differentially expressed but the isoform is (right). *

< 0.01 between

the group and its complement. In violin plots, white circles represent the mean

and white bars represent the s.d.

198

| Nature | Vol 598 | 7 October 2021

Article

We hypothesized that the mouse MOp exhibits changes in isoform

expression associated with the physical location of cells

. To deter-

mine whether there are isoforms that increase or decrease in expression

along the depth of the motor cortex, we first estimated the position

of the various layers in the glutamatergic subclasses (Extended Data

Fig. 7a, Methods), performed weighted least-squares regression on

the centroids of the subclasses and inferred isoform expression from

the SMART-seq data. While we find many isoforms that exhibit a sig

nificant change in expression across the depth (Extended Data Fig. 7b,

c, Supplementary Table 8c), none of the isoforms that pass our filter

exhibit a monotonic change with respect to the mean. This suggests

that non-linear models may be better suited to study isoform variability

across the depth of the mouse MOp.

While direct measurement of isoform abundance may be possible

with spatial RNA-seq technologies such as SEQFISH

or MERFISH

such resolution would require dozens of probes to be assayed per

gene (Supplementary Fig. 6), each of which is typically tens of base

pairs in length. Thus, while isoforms can theoretically be detected

in cases where they contain large stretches of unique sequence, the

technology is currently prohibitive for assaying most isoforms, mak

ing the extrapolation procedure described here of practical relevance

(Supplementary Table 9).

Splicing markers

Isoform quantification of RNA-seq can be used to distinguish shifts in

expression between transcripts that share transcriptional start sites

and shifts due to the use of distinct transcription start sites (TSSs).

Investigating such differences can, in principle, shed light on tran

scriptional versus post-transcriptional regulation of detected isoform

shifts

. For example, in the glutamatergic class,

Ptk2b

(Extended

Data Fig. 8a) exhibits differential expression of transcripts between

start sites (Extended Data Fig. 8b). This gene is known to be associ

ated with Alzheimer’s disease and its transcript usage is mediated

by genetic variation

. We find that isoforms sharing the preferential

start site exhibit no discernible difference in expression (Extended

Data Fig. 8c), suggesting that the observed differences result from

cell-type-specific transcriptional, rather than post-transcriptional

regulation. We identified 1,971 isoforms from 128 groups of TSSs where

the TSSs are preferentially expressed in either GABAergic, glutamater

gic or non-neuronal classes, even when the expression of isoforms

contained within the TSS is constant (Supplementary Table 10a, c, d).

Such cases are likely to be instances where the TSS shifts between cell

types are a result of differential promoter usage—that is, the result of

a transcriptional program.

We also examined post-transcriptional programs (Supplementary

Table 10b), instances where the TSSs are not differential between

classes, but where there are isoform shifts within TSSs between classes.

We find 31 isoforms from 28 TSS groups that are differential between

classes when the TSS group is not. One such example is expression

of isoforms

Rtn1-201

and

Rtn1-203

, which share the same TSS in the

Rtn1

gene. The glutamatergic class exhibits preferential expression

Rtn1-201

, which was previously shown to be expressed in grey mat-

ter

, whereas the GABAergic class does not (Extended Data Fig. 9).

These cases are likely to be instances where isoform shifts between

cell types are a result of differential splicing—that is, the result of a

post-transcriptional program.

Spon1

Bcl11b

C1ql3

Olfm3

Syndig1

Sema5a

Nr4a2

Lamp5

Cxcl14

Vip

Sst

Pvalb

Cldn5

MERFISH TPM

100,000

20,000

50,000

100,000

250,000

20,000

250,000

150,000

100,000

25,000

500,000

L5/6 NP (3,482)

L5 PT (6,759)

L2/3 IT (33,757)

L6b (3,487)

L6 IT (12,475)

L6 CT (23,878)

L6 IT

Car3

(1,563)

Lamp5

(3,095)

Sncg

(477)

Vip

(2,789)

Sst

(4,853)

Pvalb

(7,885)

Endo

(18,183)

201

202

201

202

201

Fig. 3 | Isoform atlas.

Spatial isoform atlas of the MOp. The scatter plots (left)

show the locations of cells (black dots) in the indicated subclass within a single

representative slice of the mouse MOp as assayed by MERFISH. Right, each

column corresponds to a marker gene in the MERFISH dataset and each row

corresponds to a subclass (number of cells in parentheses) in which one

isoform (labelled on the diagonal) was differential in the SMART-seq dataset for

that subclass. This spatial isoform inference links isoform expression from the

SMART-seq data with the physical locations of cells expressing that isoform

from the MERFISH data. The normalized gene expression values are plotted for

each subclass–gene pair in TPM. The white circles in the violin plots represent

the mean and the white bars represent the s.d.

Pvalb-201

Pvalb-202

(

μ

(

μ

Fig. 4 | Spatial extrapolation of isoform expression.

, Expression of the

P val b -2 01

(

) and

P val b -2 02

(

) isoforms in log1p(TPM) units for each cell

superimposed on the NCA–

-SNE plot, as assayed by SMART-seq.

, Spatial

expression of the

P val b -2 01

isoform across 64 slices from the MOp, as

extrapolated from probes for the

Pvalb

gene assayed by MERFISH. Each cell

represented by MERFISH data is coloured by its expression of

Pvalb

normalized counts.

Nature | Vol 598 | 7 October 2021 |

199

Discussion

Our spatially resolved isoform atlas of the mouse MOp expands on

previously identified gene markers, extending the catalogue to isoform

markers for cell types characterized by the BICCN

. Our approach

leverages distinct strengths of different technologies, using the iso

form resolution of SMART-seq in conjunction with the complementary

cell depth obtainable with 10x Genomics technology and the spatial

resolution produced with MERFISH to spatially place cell-type isoform

markers. This validated approach, in which we leverage technologies

that are broadly consistent (Extended Data Fig. 10) yet complemen

tary in their strengths, is important because isoform specificity could

help in explaining the molecular basis of morphological differences.

For example,

Pvalb

cells observed in hippocampal

Pvalb

interneu

rons cannot be distinguished morphologically solely on the basis of

gene-level analysis

Spatial isoform markers also enable more targeted assays for ‘auto

matic expression histology’ and can facilitate investigation of the

functional significance of cell-type isoform specificity. Recently devel

oped experimental methods for this purpose—for example, isoform

screens

—are a promising direction and will be key to understanding

the significance of the vast isoform diversity in the brain

Online content

Any methods, additional references, Nature Research reporting sum-

maries, source data, extended data, supplementary information,

acknowledgements, peer review information; details of author contri

butions and competing interests; and statements of data and code avail

ability are available at

https://doi.org/10.1038/s41586-021-03969-3

Picelli, S. et al. Full-length RNA-seq from single cells using Smart-seq2.

Nat. Protoc.

171–181 (2014).

Chen, K. H., Boettiger, A. N., Moffitt, J. R., Wang, S. & Zhuang, X. Spatially resolved, highly

multiplexed RNA profiling in single cells.

Science

348

, aaa6090 (2015).

Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells.

Nat. Commun.

, 14049 (2017).

Weyn-Vanhentenryck, S. M. et al. Precise temporal regulation of alternative splicing

during neural development.

Nat. Commun.

, 2189 (2018).

Walker, R. L. et al. Genetic control of gene expression and splicing in the developing

human brain. Preprint at

https://doi.org/10.1101/471193

(2018).

Porter, R. S., Jaamour, F. & Iwase, S. Neuron-specific alternative splicing of transcriptional

machineries: implications for neurodevelopmental disorders.

Mol. Cell. Neurosci.

35–45 (2018).

Lukacsovich, D. et al. Single-cell RNA-seq reveals developmental origins and ontogenetic

stability of neurexin alternative splicing profiles.

Cell Rep

, 3752-3759.e4 (2019).

Song, Y. et al. Single-cell alternative splicing analysis with expedition reveals splicing

dynamics during neuron differentiation.

Mol. Cell

, 148-161.e5 (2017).

Que, L., Winterer, J. & Földy, C. Deep survey of GABAergic interneurons: emerging

insights from gene-isoform transcriptomics.

Front. Mol. Neurosci.

, 115 (2019).

10.

Huang, C.-C., Lin, Y.-S., Lee, C.-C. & Hsu, K.-S. Cell type-specific expression of eps8 in the

mouse hippocampus.

BMC Neurosci

, 26 (2014).

11.

Zhang, Y. et al. An RNA-sequencing transcriptome and splicing database of glia, neurons,

and vascular cells of the cerebral cortex.

J. Neurosci.

, 11929–11947 (2014).

12.

Sugino, K. et al. Mapping the transcriptional diversity of genetically and anatomically

defined cell populations in the mouse brain.

eLife

, e38619 (2019).

13.

Bittar, P. G., Charnay, Y., Pellerin, L., Bouras, C. & Magistretti, P. J. Selective distribution of

lactate dehydrogenase isoenzymes in neurons and astrocytes of human brain.

J. Cereb.

Blood Flow Metab.

, 1079–1089 (1996).

14.

Tasic, B. et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics.

Nat. Neurosci.

, 335–346 (2016).

15.

Gandal, M. J. et al. Transcriptome-wide isoform-level dysregulation in ASD, schizophrenia,

and bipolar disorder.

Science

362

, eaat8127 (2018).

16.

Petri, S. et al. The mRNA expression of AMPA type glutamate receptors in the primary

motor cortex of patients with amyotrophic lateral sclerosis: an in situ hybridization study.

Neurosci. Lett.

360

, 170–174 (2004).

17.

Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells

using nanoliter droplets.

Cell

161

, 1202–1214 (2015).

18.

Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic

stem cells.

Cell

161

, 1187–1201 (2015).

19.

Ramsköld, D. et al. Full-length mRNA-seq from single-cell levels of RNA and individual

circulating tumor cells.

Nat. Biotechnol.

, 777–782 (2012).

20.

Arzalluz-Luque, Á. & Conesa, A. Single-cell RNAseq for the study of isoforms-how is that

possible?

Genome Biol

, 110 (2018).

21.

Seirup, M. et al. Reproducibility across single-cell RNA-seq protocols for spatial ordering

analysis.

PLoS ONE

, e0239711 (2020).

22.

Yao, Z. et al. A transcriptomic and epigenomic cell atlas of the mouse primary motor

cortex.

Nature

https://doi.org/10.1038/s41586-021-03500-8

(2021).

23.

Wang, X., He, Y., Zhang, Q., Ren, X. & Zhang, Z. Direct comparative analyses of 10x

Genomics Chromium and Smart-seq2.

Genomics Proteomics Bioinfomatics

https://doi.org/

10.1016/j.gpb.2020.02.005

(2021).

24.

Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq

quantification.

Nat. Biotechnol.

, 525–527 (2016).

25.

Zhang, M. et al. Spatially resolved cell atlas of the mouse primary motor cortex by

MERFISH.

Nature

https://doi.org/10.1038/s41586-021-03705-x

(2021).

26.

Maaten, L. V. D. & Hinton, G. Visualizing data using

-SNE.

J. Mach. Learn. Res.

2579−2605 (2008).

27.

Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP.

Nat. Biotechnol.

, 38–44 (2018).

28.

Oliver, P. L. et al. Oxr1 is essential for protection against oxidative stress-induced

neurodegeneration.

PLoS Genet

, e1002338 (2011).

29.

Volkert, M. R. Preventing neurodegeneration by controlling oxidative stress: the role of

OXR1.

Front. Neurosci.

, 611904 (2020).

30.

Wu, Y., Davies, K. E. & Oliver, P. L. The antioxidant protein Oxr1 influences aspects of

mitochondrial morphology.

Free Radic. Biol. Med.

, 255–267 (2016).

31.

Tasic, B. et al. Shared and distinct transcriptomic cell types across neocortical areas.

Nature

563

, 72–78 (2018).

32.

Ntranos, V., Yi, L., Melsted, P. & Pachter, L. A discriminative learning approach to

differential expression analysis for single-cell RNA-seq.

Nat. Methods

, 163–166

(2019).

33.

Sarantopoulou, D. et al. Comparative evaluation of full-length isoform quantification from

RNA-seq.

BMC Bioinformatics

, 266 (2021).

34.

Bark, I. C., Hahn, K. M., Ryabinin, A. E. & Wilson, M. C. Differential expression of SNAP-25

protein isoforms during divergent vesicle fusion events of neural development.

Proc. Natl

Acad. Sci. USA

, 1510–1514 (1995).

35.

Irfan, M. et al. SNAP-25 isoforms differentially regulate synaptic transmission and

long-term synaptic plasticity at central synapses.

Sci. Rep.

, 6403 (2019).

36.

Wallén-Mackenzie, Å. et al. Spatio-molecular domains identified in the mouse

subthalamic nucleus and neighboring glutamatergic and GABAergic brain structures.

Commun. Biol

, 338 (2020).

37.

Kim, D.-W. et al. Multimodal analysis of cell types in a hypothalamic node controlling

social behavior.

Cell

179

, 713-728.e17 (2019).

38.

Böckers, T. M. et al. Differential expression and dendritic transcript localization of Shank

family members: identification of a dendritic targeting element in the 3′ untranslated

region of Shank1 mRNA.

Mol. Cell. Neurosci.

, 182–190 (2004).

39.

Berkel, S. et al. Sex hormones regulate SHANK expression.

Front. Mol. Neurosci.

, 337

(2018).

40.

Zhang, Y., Thompson, R., Zhang, H. & Xu, H. APP processing in Alzheimer’s disease.

Mol. Brain

, 3 (2011).

41.

Rash, B. G. & Grove, E. A. Area and layer patterning in the developing cerebral cortex.

Curr. Opin. Neurobiol.

, 25–34 (2006).

42.

Sansom, S. N. & Livesey, F. J. Gradients in the brain: the control of the development of form

and function in the cerebral cortex.

Cold Spring Harb. Perspect. Biol.

, a002519 (2009).

43.

Trapnell, C. et al. Transcript assembly and quantification by RNA-seq reveals unannotated

transcripts and isoform switching during cell differentiation.

Nat. Biotechnol.

, 511–515

(2010).

44.

Yi, L. et al. Gene-level differential analysis at transcript-level resolution.

Genome Biol.

53 (2018).

45.

Raj, T. et al. Integrative transcriptome analyses of the aging brain implicate altered

splicing in Alzheimer’s disease susceptibility.

Nat. Genet.

, 1584–1592 (2018).

46.

Mills, J. D. et al. Unique transcriptome patterns of the white and grey matter corroborate

structural and functional heterogeneity in the human frontal lobe.

PLoS ONE

, e78480

(2013).

47.

Klausberger, T. & Somogyi, P. Neuronal diversity and temporal dynamics: the unity of

hippocampal circuit operations.

Science

321

, 53–57 (2008).

48.

Harris, K. D. et al. Classes and continua of hippocampal CA1 inhibitory neurons revealed

by single-cell transcriptomics.

PLoS Biol

, e2006387 (2018).

49.

Thomas, J. D. et al. RNA isoform screens uncover the essentiality and tumor-suppressor

activity of ultraconserved poison exons.

Nat. Genet.

, 84–94 (2020).

50.

Karlsson, K. & Linnarsson, S. Single-cell mRNA isoform diversity in the mouse brain.

BMC Genomics

, 126 (2017).

51.

Beerenwinkel, N., Pachter, L. & Sturmfels, B. Epistasis and shapes of fitness landscapes.

Stat. Sinica

, 1317–1342 (2007).

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in

published maps and institutional affiliations.

Open Access

This article is licensed under a Creative Commons Attribution

4.0 International License, which permits use, sharing, adaptation, distribution

and reproduction in any medium or format, as long as you give appropriate

credit to the original author(s) and the source, provide a link to the Creative Commons license,

and indicate if changes were made. The images or other third party material in this article are

included in the article’s Creative Commons license, unless indicated otherwise in a credit line

to the material. If material is not included in the article’s Creative Commons license and your

intended use is not permitted by statutory regulation or exceeds the permitted use, you will

need to obtain permission directly from the copyright holder. To view a copy of this license,

visit

http://creativecommons.org/licenses/by/4.0/

Article

Methods

All of the results and figures in the paper are reproducible starting

with the raw reads using scripts and code downloadable from

https://

github.com/pachterlab/BYVSTZP_2020

. The repository makes the

method choices completely transparent, including all parameters and

thresholds used. All

-values were corrected using Bonferroni correc

tion and all error bars denote ±1× s.d. from the mean.

Tissue collection and isolation of cells

Mouse breeding and husbandry.

All procedures were carried out in ac

cordance with Institutional Animal Care and Use Committee protocols

at the Allen Institute for Brain Science. Mice were provided food and

water ad libitum and were maintained on a regular 12-h day:night cycle

at no more than five adult animals per cage. For this study, we enriched

for neurons by using Snap25-IRES2-Cre mice

(MGI:J:220523) crossed

to Ai14

(MGI: J:220523), which were maintained on the C57BL/6J back

ground (RRID:IMSR_JAX:000664). Mice were euthanized at 53−59 days

of postnatal age. Tissue was collected from both males and females for

scRNA SMART and scRNA 10x v3 analysis.

Single-cell isolation.

We isolated single cells by adapting previously

described procedures

. The brain was dissected, submerged in

artificial cerebrospinal fluid (ACSF)

, embedded in 2% agarose and

sliced into 250-μm (SMART-seq) or 350-μm (10x Genomics) coronal

sections on a compresstome (Precisionary Instruments). The Allen

Mouse Brain Common Coordinate Framework version 3 (CCFv3,

RRID:SCR_002978)

ontology was used to define MOp for dissections.

For SMART-seq, MOp was microdissected from the slices and dis

sociated into single cells with 1 mg ml

−1

pronase (Sigma P6911-1G) and

processed as previously described

. For 10x Genomics analysis, tissue

pieces were digested with 30 U ml

−1

papain (Worthington PAP2) in ACSF

for 30 min at 30 °C. Enzymatic digestion was quenched by exchanging

the papain solution three times with quenching buffer (ACSF with 1% FBS

and 0.2% BSA). The tissue pieces in the quenching buffer were triturated

through a fire-polished pipette with 600-μm diameter opening approxi

mately 20 times. The solution was allowed to settle and supernatant

containing single cells was transferred to a new tube. Fresh quenching

buffer was added to the settled tissue pieces, and trituration and super

natant transfer were repeated using 300-μm and 150-μm fire polished

pipettes. The single-cell suspension was passed through a 70-μm filter

into a 15-ml conical tube with 500 μl high-BSA buffer (ACSF with 1% FBS

and 1% BSA) at the bottom to help cushion the cells during centrifuga-

tion at 100

in a swinging-bucket centrifuge for 10 min. The supernatant

was discarded and the cell pellet was resuspended in quenching buffer.

All cells were collected by fluorescence-activated cell sorting (BD Aria

II, RRID: SCR_018091) using a 130-μm nozzle. Cells were prepared for

sorting by passing the suspension through a 70-

μm filter and adding DAPI

(to the final concentration of 2 ng ml

−1

). The sorting strategy was as previ

ously described

, with most cells collected using the tdTomato-positive

label. For SMART-seq, single cells were sorted into individual wells of

8-well PCR strips containing lysis buffer from the SMART-seq v4 Ultra

Low Input RNA Kit for Sequencing (Takara 634894) with RNase inhibitor

(0.17 U μl

−1

), immediately frozen on dry ice, and stored at −80 °C. For 10x

Genomics, 30,000 cells were sorted within 10 min into a tube containing

500 μl of quenching buffer. Each aliquot of 30,000 sorted cells was gently

layered on top of 200 μl of high-BSA buffer and immediately centrifuged

at 230

for 10 min in a swinging-bucket centrifuge. Supernatant was

removed and 35 μl of buffer was left behind, in which the cell pellet was

resuspended. The cell concentration was quantified, and immediately

loaded onto the 10x Genomics Chromium controller.

Genomic library preparation and sequencing

For SMART-seq library preparation, we performed the procedures

with positive and negative controls as previously described

. The

SMART-seq v4 (SSv4) Ultra Low Input RNA Kit for Sequencing (Takara

634894) was used to reverse transcribe poly(A) RNA and amplify

full-length cDNA. Samples were amplified with 18 cycles in 8-well

strips, in sets of 12–24 strips at a time. All samples proceeded through

Nextera XT DNA Library Preparation (Illumina Cat# FC-131-1096) using

Nextera XT Index Kit V2 (Illumina FC-131-2001) and a custom index set

(Integrated DNA Technologies). Nextera XT DNA Library preparation

was performed according to the manufacturer’s instructions, with a

modification to reduce the volumes of all reagents and cDNA input to

0.4× or 0.5× of the original protocol.

For 10x v3 library preparation, we used the Chromium Single Cell

3′ Reagent Kit v3 (10x Genomics 1000075). We followed the manufac-

turer’s instructions for cell capture, barcoding, reverse transcription,

cDNA amplification, and library construction. We targeted sequencing

depth of 120,000 reads per cell.

Sequencing of SMART-seq v4 libraries was performed as described

In brief, libraries were sequenced on an Illumina HiSeq2500 platform

(paired-end with read lengths of 50 bp). The 10x v3 libraries were

sequenced on Illumina NovaSeq 6000 (RRID:SCR_016387).

Pre-processing single-cell RNA-seq data

The 6,295 SMART-seq cells were processed using kallisto with the

‘kallisto pseudo’ command

. The 94,162 10x Genomics v3 cells were

pre-processed with kallisto and bustools

. Gene-count matrices were

made by using the -genecounts flag and TCC matrices were made by

omitting it. The mouse transcriptome reference used was GRCm38.p3

(mm10) RefSeq annotation gff file retrieved from NCBI on 18 January

2016 (

https://www.ncbi.nlm.nih.gov/genome/annotation_euk/all/

for consistency with the reference used by the BICCN consortium

The GTF and the GRCm38 genome fasta file (

https://github.com/

pachterlab/BYVSTZP_2020/releases/tag/biorxiv_v1

), provided by

the BICCN consortium, were used to create a transcriptome fasta file,

transcripts-to-genes map, and kallisto index using kb ref -i index.idx,

-g t2g.txt -f1 transcriptome.fa genome.fa genes.gtf. To validate the

SMART-seq isoform quantifications, we first examined the robust

ness of the quantifications to gene annotation, and found an average

correlation at the isoform level of 0.965 between the BICCN-derived

quantifications we used in our analysis

and the mouse GENCODE

M25-derived quantifications (Supplementary Fig. 3). The GENCODE

M25 mouse transcriptome reference (

https://github.com/pachterlab/

BYVSTZP_2020/blob/master/reference/gencode/fasta_link.txt

) and

the kallisto index were built using kallisto index -i index.idx gencode.

vM25.transcripts.fa.gz.

Isoform and gene-count matrices were generated for the Smart-seq2

data using the kallisto pseudo command. Cluster assignments were

associated with cells using cluster labels generated by the BICCN

consortium

. The labels are organized in a hierarchy of three levels:

classes, subclasses and clusters. The cluster labels for the cells can be

downloaded from

https://github.com/pachterlab/BYVSTZP_2020

Clustering and cell type assignment

Our analyses used SMART-seq cell labels produced by the BICCN

. In

brief, the assignment of cell types to the SMART-seq cells was based

on an extension of the cluster merging algorithm in the scrattch.hicat

package

. The clustering method generates three hierarchies of cells:

classes, subsets of cells within classes called subclasses, and subsets

of cells within subclasses called clusters.

To build a common adjacency matrix incorporating samples from

all the datasets, first a subset of datasets (reference datasets) was

selected. The 10x v2 single cell dataset from Allen (

scRNA 10x v2 A

)

and 10x v3 single nucleus dataset from Broad (

snRNA 10x v3 B

) were

used as references.

The key steps of the pipeline are as follows: (1) perform single-dataset

clustering, (2) select anchor cells for each reference dataset, (3)

select highly variable genes (HVG), (4) compute

-nearest neighbors,

(5) compute the Jaccard similarity, (6) perform Louvain clustering,

(7) merge clusters, (8) cluster iteratively, and (9) compile and merge

clusters. Further details are in ref.

Normalization and filtering of SMART-seq data

Isoform counts were first divided by the length of transcript to obtain

abundance estimates proportional to molecule copy numbers. Since

reads can come from anywhere in the transcriptome, it is likely that

longer isoforms are enriched. Therefore normalizing isoform abun

dances by length is crucial to accurately estimating mRNA copy number.

This has been shown in numerous studies on the accurate estimate of

isoform abundance

After normalizing by length, we then removed isoforms that had

fewer than one count and that were in fewer than one cell. We also

removed genes and their corresponding isoforms that had a disper

sion of less than 0.001.

To generate the cell-by-gene matrix we summed the isoforms that

correspond to the same gene. Cells with fewer than 250 gene counts and

with greater than 10% mitochondrial content were removed. Cells were

normalized to TPM by dividing the counts in each cell by the sum of the

counts for that cell, then multiplying by 1,000,000. The count matrices

were then transformed with log1p and the columns were scaled to unit

variance and zero mean. The resulting gene and isoform matrix con

tained 6,160 cells and 19,190 genes, corresponding to 69,172 isoforms.

Highly variable isoforms and genes were identified by first computing

the dispersion for each feature, and then binning all of the features into

20 bins. The dispersion for each feature was normalized by subtracting

the mean dispersion and dividing by the variance of the dispersions

within each bin. Then the top 5,000 features were retained based on the

normalized dispersion. This was computed

using scanpy.pp.highly_

variable_genes with n_top_genes = 5000, flavor=seurat, and n_bins=20.

Normalization and filtering of 10xv3 data

To generate the cell-by-gene matrix we used ‘bustools count –gene-

counts’. The cell-by-isoform matrix was generated using ‘bustools

count’ and restricting to the equivalence classes that contained only

one isoform thus generating a cell-by-isoform matrix. Both matrices

were loaded into python using kb python. Cells with less than 250

gene counts and with greater than 21.5% mitochondrial content were

removed. Cells were normalized to counts per million (CPM) by divid

ing the counts in each cell by the sum of the counts for that cell, then

multiplying by 1,000,000. The count matrices were then transformed

with log1p and the columns scaled to unit variance and zero mean.

The resulting gene matrix contained 94,162 cells and 24,575 genes.

We removed the cells that were identified as low quality by the BICCN

consortium. We identified batch effect among cells assayed on differ-

ent dates so we restricted our analysis to only the cells assayed on the

same date and selected the date with the most number of cells (Sup

plementary Fig. 2). Additionally, we performed pairwise comparison

of gene counts for each of the 4 10xv3 batches and found the Pearson

correlation to be very high for all pairs, with a mean of 0.9979 indicat-

ing limited batch effect between batches assayed on the same date.

Highly variable isoforms and genes were identified by first computing

the dispersion for each feature, and then binning all of the features into

20 bins. The dispersion for each feature was normalized by subtracting

the mean dispersion and dividing by the variance of the dispersions

within each bin. Then the top 5,000 features were retained based on the

normalized dispersion. This was computed using

scanpy.pp.highly_

variable_genes with n_top_genes = 5000, flavor=seurat, and n_bins=20.

Dimensionality reduction and visualization

To visualize the SMART-seq data with predefined cluster labels pro

duced via a joint analysis with many other data types we performed

NCA

on the full scaled log(TPM + 1) matrix using the subcluster

labels, to ten components.

-SNE

was then performed on the 10 NCA

components. NCA takes as input not just a collection of cells with their

associated abundances, but also cluster labels for those cells, and seeks

to find a projection that minimizes leave-one-out

-nearest neighbour

error

. Moreover,

-SNE applied to PCA (Supplementary Fig. 1b) scram

bles the proximity of glutamatergic and GABAergic cell types, while

-SNE of NCA appears to respect global structure of the cells. While

UMAP applied to PCA of the data (Supplementary Fig. 1c) appears to

be better than

-SNE in terms of preserving global structure, it still does

not separate out the cell types as well as NCA (Supplementary Fig. 1d).

-SNE was computed using sklearn.manifold.

-SNE was generated with

default parameters and random state 42. Similarly uniform manifold

approximation was performed on the 10 NCA components and the 50

truncated singular value decomposition (SVD)-derived components.

UMAP

was computed with the umap package with default parameters.

To ensure that NCA was not overfitting cells to their corresponding

subclasses, we randomly permuted all of the subclasses labels and reran

the NCA-to-

-SNE dimensionality-reduction method. We observed

uniform mixing of the permuted subclass labels, indicating that NCA

was not overfitting the cells to their corresponding subclasses.

Sample size

No explicit calculations were performed to determine sample size. We

analysed 6,160 mouse MOp cells assayed with SMART-seq, 280,327 cells

assayed with MERFISH, and 94,162 cells assayed with 10x Genomics

Chromium v3. We analysed both male and female mice to understand

differences in gene and isoform expression. The sample size for dif

ferential expression was set to be such that 90% of cells in a cluster

have a non-zero expression of the tested gene. The smallest cluster

size contained seven cells, with all cells having non-zero expression of

the tested genes. We computed error bars for all tests to ensure that

sample sizes were sufficient.

Batch effects

After finding a meaningful projection that appears to respect global

structure of the cells we searched for possible sources of batch effect

within the datasets. We found evidence of batch effect in the 10xv3

data by assay date (Supplementary Fig. 2a). To ensure that our findings

were not confounded by this batch effect we selected the set of cells

from only one assay date and picked the set with the largest number of

cells and the one with cells present in all clusters. We then looked at the

MERFISH data and found minimal evidence of batch effect across sam

ples based on the distribution of batch labels across clusters where the

observed fraction of cells per batch in each cluster was almost exactly

the expected fraction of cells per batch assuming uniform mixing (Sup

plementary Fig. 2b).

In further examining the single 10xv3 batch we settled on, we noted

a low correlation in one case, the L5 IT subclass. The low correlation

was also observed in a comparison between SMART-seq and MERFISH

gene expression data (Extended Data Fig. 10a), and 10xv3 and MERFISH

data (Extended Data Fig. 10b). We hypothesized that this low correla-

tion stems from a subclass-specific sex effect within the L5 IT, where

those cells differ drastically in their overall expression compared to

other subclasses. The L5 IT subclass contains seven clusters in the

SMART-seq data, four clusters in the 10xv3 data, and four clusters in

the MERFISH data.

To determine the source of the low correlation within the L5 IT

between SMART-seq, 10x and MERFISH data, we examined differential

genes between male and female cells within each subclass. We found

that cells within the L5 IT of the 10x and SMART-seq data exhibited

sex-specific segregation (Supplementary Fig. 4a, b). After perform

ing differential expression between male and female cells within all

subclasses we found that the L5 IT had the highest amount of uniquely

differential genes (Supplementary Fig. 4c) and that the SMART-seq

and 10x data had 37 common genes that were differentially expressed

(Supplementary Fig. 4d). The other subclasses, however, did not exhibit

Article

sex-based segregation. Without being able to rule out that the low

correlation for L5 IT cells across the technologies was due to confound

ing between batch and sex in the dataset, we decided to excluded the

subclass from our analyses.

Measuring number of isoforms per gene

We parsed the transcripts-to-genes map, grouping together transcripts

that had the same end site that were in the same gene. We then counted

the number of these end site sets within a gene and plotted them against

the number of isoforms within that gene.

Cross-technology cluster correlation

The correlation between 10xv3-SMART-seq, 10xv3-MERFISH, and

SMART-seq-MERFISH, was performed at the gene level and between

cells grouped by subclasses for all three pairs of technologies, and at

the isoform level and between cells grouped by cluster for only the

10xv3 and SMART-seq. For each pair we started with two raw matri

ces and restricted to the set of genes or isoforms common to the two.

Then we normalized the counts for each matrix per cell to one million,

log1p-transformed the entire matrix, and scaled the features to zero

mean and unit variance. Within each cluster, we restricted the features

to those present in at least 50% of the cells. We then found the mean cell

within the respective clusters in the two matrices and computed the

Pearson correlation between them. These methods were implemented

for Extended Data Figs. 3, 4, 10. In terms of accuracy of different tech-

nologies, we found good agreement between quantifications from

SMART-seq, 10xv3 and MERFISH (Extended Data Fig. 10).

Comparisons of different scRNA-seq technologies have tended to

focus on throughput, cost and gene-level accuracy

in a winner-takes-all

competition. Our results shed some light on the matter; it has been pre

viously shown that quantification of isoform abundance is necessary

for accurate gene-level estimates

, and we found that it matters in

practice (Supplementary Fig. 7, Supplementary Tables 11a, b, 12a, b).

This highlights the importance of proper isoform quantification of

SMART-seq data, even for gene-centric analysis

–

, when used in

conjunction with 10x Genomics and MERFISH data.

Isoform atlas

For each level of clustering, class, subclass and cluster, we performed a

-test for each gene or isoform between the cluster and its complement

on the log1p counts. To identify isoform enrichment that is masked at

a gene-level analysis, we looked for isoforms that were upregulated by

checking that the gene containing that isoform was not significantly

expressed in that cluster relative to the complement of that cluster.

Isoforms that were expressed in less than 90% of the cells in that cluster

were ignored. All

-tests used a significance level of 0.01 and all

-values

were corrected for multiple testing using Bonferroni correction.

MERFISH isoform extrapolation

First, we identified the genes that mark the specific subclass within

the MERFISH data. The

Pvalb

gene is a marker for the

Pvalb

subclass.

Then we performed differential analysis on the SMART-seq data at

the isoform level on the subclasses to identify the isoforms that mark

each of the SMART-seq subclasses. Only one of the two isoforms for

Pvalb

marked the

Pvalb

cluster. This allowed us to extrapolate the fact

that the specific

Pvalb

isoform is being detected in the MERFISH data.

Additionally, we identified all of the genes that mark the specific sub

classes in the MERFISH data through differential analysis and checked

if their underlying isoforms were also differentially expressed. We then

noted which isoforms were differentially expressed for the spatial

isoform atlas.

Weighted least-squares regression

First, we selected a representative slice of the MOp. Then we found the

outer hull of the MOp by using scipy.spatial.ConvexHull. We selected

the points that defined the upper boundary of the Mop, then performed

linear regression to fit a line to those points using sklearn.linear_model.

LinearRegression(). For each subclass in the glutamatergic class of

cells we identified the centroid of the subclass and determined the

perpendicular distance of the centroid to the MOp boundary line. We

normalized the set of distances by dividing by the centroid with the

largest distance to the boundary.

We look at the isoforms for which all of the subclasses had non-zero

expression in at least 90% of cells. For each isoform, we performed

weighted least squares regression for all of the subclasses with the

weights equal to the variance of isoform expression for each sub

class. We used the statsmodel.api.sm.WLS function. All weighted

least-squares tests used a significance level of 0.01 and all

-score

-values were corrected for multiple testing using Bonferroni correc-

tion. Monotonicity was checked for isoforms with an absolute value

slope greater than 1.5.

Grouping transcripts by start site

Using the transcripts to genes map and the filtered isoform matrix, we

grouped isoforms by their TSS into TSS classes and summed the raw

counts for the isoforms within each TSS class to create a cell-by-TSS

matrix. Differential analysis was then performed in exactly the same

way as above. For each cluster and each TSS or isoform, a

-test was per-

formed between the cells in that cluster and the cells in the complement

of that cluster. All statistical tests used a significance level of 0.01 and all

-values were corrected for multiple testing using Bonferroni correction.

Comparison of naive and EM quantification

Naive gene-count matrices were constructed from the SMART-seq data

by summing the counts corresponding to a single gene. Gene-count

matrices quantified by the expectation maximization (EM) algorithm and

normalized appropriately were made with SMART-seq by first dividing

isoform abundances by the length of their transcripts, and then summing

the abundances of isoforms by gene. Differential analysis was performed

independently on these two gene-count matrices and the resultant differ

ential genes were compared. Differential expression was then performed

on all of the genes for both the EM and naive gene quantifications. All

statistical tests used a significance level of 0.01 and all

-values were

corrected for multiple testing using Bonferroni correction.

Software versions

Software versions used were:

Anndata 0.7.1

bustools 0.39.4

awk (GNU

awk) 4.1.4

grep (GNU grep) 3.1

kallisto 0.46.1

kb_python 0.24.4

Mat-

plotlib 3.0.3

Numpy 1.18.1

Pandas 0.25.3

Scanpy 1.4.5.post3

Scipy

1.4.1

sed (GNU sed) 4.4

sklearn 0.22.1

statsmodels 0.12.1

tar (GNU

tar) 1.29

umap 0.3.10.

Reporting summary

Further information on research design is available in the Nature

Research Reporting Summary linked to this paper.

Data availability

The single-cell RNA-seq data used in this study were generated as part

of the BICCN consortium

. The 10xv3 and SMART-seq data can be

downloaded from

http://data.nemoarchive.org/biccn/lab/zeng/tran

scriptome/scell/

. The MERFISH data are available at

https://caltech.

box.com/shared/static/dzqt6ryytmjbgyai356s1z0phtnsbaol.gz

. All

cell annotations and cluster labels are available at

https://github.com/

pachterlab/BYVSTZP_2020/tree/master/reference

Code availability

The software used to generate the results and figures of the paper is

available at

https://github.com/pachterlab/BYVSTZP_2020

52.

Harris, J. A. et al. Anatomical characterization of Cre driver mice for neural circuit

mapping and manipulation.

Front. Neural Circuits

, 76 (2014).

53.

Madisen, L. et al. A robust and high-throughput Cre reporting and characterization

system for the whole mouse brain.

Nat. Neurosci.

, 133–140 (2010).

54.

Lein, E. S. et al. Genome-wide atlas of gene expression in the adult mouse brain.

Nature

445

, 168–176 (2007).

55.

Melsted, P. et al. Modular and efficient pre-processing of single-cell RNA-seq. Preprint at

https://doi.org/10.1101/673285

(2019).

56.

Jiang, H. & Wong, W. H. Statistical inferences for isoform expression in RNA-seq.

Bioinformatics

, 1026–1032 (2009).

57.

Trapnell, C. Defining cell types and states with single-cell genomics.

Genome Res

1491–1498 (2015).

58.

Wolf, F., Angerer, P. & Theis, F. SCANPY: large-scale single-cell gene expression data

analysis.

Genome Biol.

, 15 (2018).

59.

Salakhutdinov, R. & Hinton, G. Learning a nonlinear embedding by preserving class

neighbourhood structure. In

Proc. 11th International Conference on Artificial Intelligence

and Statistics

412-419 (2007).

60.

Ziegenhain, C. et al. Comparative analysis of single-cell RNA sequencing methods.

Mol. Cell

, 631-643.e4 (2017).

61.

Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with

RNA-seq.

Nat. Biotechnol.

, 46–53 (2013).

62.

Westoby, J., Herrera, M. S., Ferguson-Smith, A. C. & Hemberg, M. Simulation-based

benchmarking of isoform quantification in single-cell RNA-seq.

Genome Biol

, 191

(2018).

63.

Zhang, C., Zhang, B., Lin, L.-L. & Zhao, S. Evaluation and comparison of computational

tools for RNA-seq isoform quantification.

BMC Genomics

, 583 (2017).

Acknowledgements

We thank members of the BICCN consortium, especially the Mini-MOp

analysis group, for helpful conversations related to transcriptome analysis of the MOp. We

thank N. Volovich, V. Ntranos and P. Melsted for help with a preliminary quantification of the

SMART-seq data. Figure

was created from scratch using the tools available on Biorender.com.

Extended Data Fig. 6a was obtained from

http://atlas.brain-map.org/atlas

. This work was

funded by the NIH Brain Initiative via grant U19MH114930 to H.Z. and L.P.

Author contributions

A.S.B. and L.P. conceived the study. A.S.B. implemented the methods

and produced the results and figures. A.S.B. and L.P. analysed the data and wrote the

manuscript. Z.Y., C.v.V., K.S., B.T. and H.Z. produced the SMART-seq and 10xv3 data.

Competing interests

The authors declare no competing interests.

Additional information

Supplementary information

The online version contains supplementary material available at

https://doi.org/10.1038/s41586-021-03969-3

Correspondence and requests for materials

should be addressed to Lior Pachter.

Peer review information

Nature

thanks Chris Burge and the other, anonymous, reviewer(s) for

their contribution to the peer review of this work. Peer reviewer reports are available.

Reprints and permissions information

is available at

http://www.nature.com/reprints