D-SPIN constructs gene regulatory network models from
multiplexed scRNA-seq data revealing organizing principles
of cellular perturbation response
Jialong Jiang
1,*
, Sisi Chen
1,2,11*
, Tiffany Tsou
1,2
, Christopher S. McGinnis
3
, Tahmineh
Khazaei
1
, Qin Zhu
3
, Jong H. Park
1,2
, Inna-Marie Strazhnik
1
, John Hanna
4
, Eric D.
Chow
5,6
, David A. Sivak
7
, Zev J. Gartner
3,8,9,10
, and Matt Thomson
1,2
†
1
Division of Biology and Biological Engineering, California Institute of Technology,
Pasadena, California, 91125, USA
2
Beckman Single-Cell Profiling and Engineering Center, California Institute of Technology,
Pasadena, CA, 91125, USA
3
Department of Pharmaceutical Chemistry, University of California San Francisco, San
Francisco, CA, 94143, USA
4
Department of Pathology, Harvard Medical School and Brigham and Women’s Hospital,
Boston, MA, 02115, USA
5
Department of Biochemistry and Biophysics, University of California San Francisco, San
Francisco, CA, 94143, USA
6
Center for Advanced Technology, University of California San Francisco, San Francisco,
CA, 94143, USA
7
Department of Physics, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
8
Helen Diller Family Comprehensive Cancer Center, San Francisco, CA, 94115, USA
9
Chan Zuckerberg BioHub, University of California San Francisco, San Francisco, CA,
94143, USA
10
Center for Cellular Construction, University of California San Francisco, San Francisco,
CA, 94143, USA
11
Apertura Gene Therapy, 345 Park Ave South, New York, NY 10010
*
These authors contributed equally to this work
†
mthomson@caltech.edu
1
.
CC-BY-NC 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted May 20, 2023.
;
https://doi.org/10.1101/2023.04.19.537364
doi:
bioRxiv preprint
Abstract
Gene regulatory networks within cells modulate the expression of the genome in re-
sponse to signals and changing environmental conditions. Reconstructions of gene
regulatory networks can reveal the information processing and control principles used
by cells to maintain homeostasis and execute cell-state transitions. Here, we intro-
duce a computational framework, D-SPIN, that generates quantitative models of gene-
regulatory networks from single-cell mRNA-seq data sets collected across thousands of
distinct perturbation conditions. D-SPIN models the cell as a collection of interacting
gene-expression programs, and constructs a probabilistic model to infer regulatory in-
teractions between gene-expression programs and external perturbations. Using large
Perturb-seq and drug-response datasets, we demonstrate that D-SPIN models reveal
the organization of cellular pathways, sub-functions of macromolecular complexes, and
the logic of cellular regulation of transcription, translation, metabolism, and protein
degradation in response to gene knockdown perturbations. D-SPIN can also be applied
to dissect drug response mechanisms in heterogeneous cell populations, elucidating
how combinations of immunomodulatory drugs can induce novel cell states through
additive recruitment of gene expression programs. D-SPIN provides a computational
framework for constructing interpretable models of gene-regulatory networks to reveal
principles of cellular information processing and physiological control.
Introduction
The human genome encodes more than 30
,
000 genes, but a typical human cell expresses only
5,000 genes at any given time [1]. Gene regulatory networks within cells modulate gene expression
based upon environmental cues and the cell’s internal state [2, 3, 4, 5]. In gene regulatory networks,
transcription factors and associated regulatory proteins interact with one another and gene promot-
ers to activate or repress gene transcription [6, 7, 8]. Gene regulatory networks play a central role
in cellular decision-making processes [9, 10, 11, 12]. However, we have a limited understanding of
the global logic encoded within gene regulatory networks and the underlying principles of cellular
information processing.
Global reconstruction of gene regulatory networks in E. coli, yeast, and sea urchin embryos have
revealed features of biological information processing, including network modularity, recurring net-
work motifs, and combinatorial logic through the assembly of transcription factor complexes at gene
promoters [13, 14, 15, 16, 17, 18, 19]. However, our understanding of the information-processing
principles of gene regulatory networks is based on a limited number of network reconstructions per-
formed on model organisms over a limited range of physiological conditions. Even in E. coli a large
fraction of genes have unknown regulators [20]. In metazoans, gene network models have primarily
focused on sub-circuits involved in specific processes like T-cell activation, T-cell fate selection, and
embryonic stem cell differentiation [10, 11, 12]. The comparative analysis of gene regulatory net-
work architectures across cell types, developmental stages of an organism, and species, therefore,
remains nascent, and there are very few quantitative models of regulatory networks that can pre-
dict the response of a cell to signals, genetic perturbations, or therapeutics. Cells simultaneously
regulate tens of different processes including transcription, metabolism, protein degradation, and
differentiation. How gene regulatory networks globally modulate core cellular processes in parallel
in response to environmental conditions remains poorly understood.
Historically, gene regulatory network reconstruction and modeling have been constrained by the
number of biochemical or genetic measurements required to reconstruct networks with hundreds to
thousands of interacting protein components. Classical biochemical approaches perform bottom-up
2
.
CC-BY-NC 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted May 20, 2023.
;
https://doi.org/10.1101/2023.04.19.537364
doi:
bioRxiv preprint
gene network reconstructions through pairwise binding measurements requiring order-
M
2
biochem-
ical assays for a network with
M
components [21, 22]. Genetic approaches use a top-down strategy
that infers gene regulatory network architecture from perturbation-response experiments where
genes are knocked down or protein activity inhibited with small molecules. Perturbation of one
network component alters gene expression for subsets of genes; through the top-down association of
perturbations with responses, perturbation approaches enable the identification of gene regulators,
the organization of regulators into pathways, and the construction of integrated network models
[23, 24, 25, 26]. For global analysis of gene regulatory networks, perturbation approaches require
knocking out hundreds to thousands of genes while monitoring transcription across thousands of
genes. Historically, experimental realization of genome-wide network reconstruction through per-
turbation has, like biochemistry, been limited by experimental scale when genes were knocked out
in bulk assays.
Developments in single-cell genomics and perturbation bar-coding circumvent some of the con-
ventional limitations of perturbation-driven network reconstruction. Perturbation bar-coding ap-
proaches, including Perturb-seq, click-tags, and MULTI-seq, allow the transcriptional state of each
cell in a cell population to be measured across thousands of different genetic, signaling, and small
molecule conditions [27, 28, 29, 30]. These experimental approaches identify the perturbation de-
livered to every single cell in a population while also measuring changes in the entire transcriptome
through single-cell mRNA-seq. Given single-cell perturbation measurements allow interrogation of
a cell population’s response to thousands of genetic, signaling, or small molecule perturbations,
a core challenge is developing computational methods to integrate data from thousands of such
experiments into a gene regulatory network model that can classify perturbations, map the flow of
information across a regulatory network, and predict cellular response to novel perturbations.
Here, we introduce a mathematical modeling and network inference framework, D-SPIN
(Dimension-reduced Single-cell Perturbation Integration Network), that constructs gene regulatory
network models directly from single-cell perturbation-response data. D-SPIN models the cell as
a collection of interacting gene-expression programs, and constructs a probabilistic model, known
as Markov random fields, to infer regulatory interactions between gene-expression programs and
applied perturbations. To enable scaling of D-SPIN to large data sets with millions of single cells,
D-SPIN exploits a natural factoring within the mathematical structure of Markov random fields
inference to separate the learning problem into two steps, construction of a unified gene regulatory
network model and inference of how each perturbation interacts with the gene programs within the
unified network. To maximize the interpretability of D-SPIN models and computational efficiency,
D-SPIN operates on gene-expression programs, rather than single genes, enabling D-SPIN to derive
a coarse-grained description of a cellular regulatory network in terms of physiological units—gene-
expression programs—and their interactions with applied perturbations. In benchmarking tests, we
demonstrate that D-SPIN exhibits state-of-the-art performance on simulated gene network inference
tasks.
We apply D-SPIN to construct gene regulatory network models from experimental data sets
with thousands of perturbations and millions of cells, including two of the largest single-cell mRNA-
seq data sets in existence, a genome-wide Perturb-seq experiment [28] and a new human immune cell
drug-response experiment we performed. The integrated regulatory network models reveal global
organizing principles of cellular regulation and perturbation responses, including the organizing
principles of cellular pathways and networks of pathways used to maintain homeostasis in response
to perturbation. Applied to profile the response of human immune cells to combinations of im-
munomodulatory drugs, D-SPIN models remonstrate that drug combinations can generate novel
cell states through additive recruitment or superposition of drug-specific gene expression programs.
Broadly, D-SPIN provides a computational framework for large-scale modeling and comparative
analysis of gene regulatory networks across cell types, physiological conditions, and organisms.
3
.
CC-BY-NC 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted May 20, 2023.
;
https://doi.org/10.1101/2023.04.19.537364
doi:
bioRxiv preprint
Figure 1:
D-SPIN constructs unified gene regulatory network models from single-cell mRNA-seq data
collected across perturbation conditions.
(A) D-SPIN accepts as input single-cell mRNA-seq data from a cell
population profiled across a series of different perturbed experimental conditions such as genetic perturbations, drug
treatments, signaling conditions, or disease/healthy conditions. (B) To reduce the dimensionality of the network-
inference task, D-SPIN extracts gene-expression programs from the data using orthogonal non-negative matrix fac-
torization. (C) D-SPIN then constructs a unified regulatory network
J
, whose edges represent inferred interactions
between gene-expression programs. D-SPIN also infers interactions between the network and applied perturbations.
(D) Mathematically, D-SPIN constructs the regulatory network model by inferring a probabilistic model that encodes
the probability of each transcriptional state in terms of an interaction matrix
J
and sample-specific perturbation
response vectors
h
(
n
)
. D-SPIN scales to large data sets with millions of cells through a paralleled inference maximum-
likelihood procedure. At each step of inference, D-SPIN generates model samples for each perturbation condition
in parallel, and then uses the difference between model and data cross-correlation and mean to update the network
weights
J
and each perturbation response vector
h
(
n
)
. (E) D-SPIN yields a probabilistic model that can be interpreted
as a gene regulatory network. The D-SPIN model can be applied to (i) estimate the distribution of transcriptional
state defined by the regulatory network model under a specific condition, (ii) reveal principles of gene regulatory
network organization including modularity and sub-network architecture, and (iii) classify perturbations into groups
that reflect biological function.
4
.
CC-BY-NC 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted May 20, 2023.
;
https://doi.org/10.1101/2023.04.19.537364
doi:
bioRxiv preprint
D-SPIN infers a unified regulatory network model from single-cell perturbation
data
Here, we develop mathematical and computational strategies to integrate data across many dif-
ferent single-cell perturbation experiments into a network model that provides qualitative insights
into cellular pathways and regulatory strategies. D-SPIN solves the gene regulatory network infer-
ence problem by constructing a probabilistic model that encodes the distribution of transcriptional
states within a cell population across a set of perturbed conditions (Fig. 1, SI Text Sec. 1.1). Specif-
ically, D-SPIN builds a spin-network model or Markov random field, which can be interpreted as
an interaction network. Spin-network models were initially introduced in physics to study magnetic
materials known as spin glasses whose properties emerge through positive and negative interactions
between localized magnetic moments or spins. The models have been generalized and applied to
many different systems of interacting elements including neural networks, bird flocks, and proteins
[31, 32, 33]. In general, spin-network models consider a network of
M
interacting elements
s
i
.
The goal of the framework is to construct a model that encodes the probability
P
(
s
1
,...,s
M
) of
each potential configuration of the network of elements based upon inferred interactions between
elements, as quantified in a coupling matrix
J
.
In general, D-SPIN can be applied to model networks of individual genes; however, cells ex-
press thousands of genes, and gene-level models are both computationally expensive and also dif-
ficult to interpret. Therefore, we designed D-SPIN to construct reduced or low-dimensional spin-
network models by modeling interactions between gene programs, i.e., co-regulated groups of genes
[34, 35, 36]. Cells regulate their transcriptional state by modulating transcription factors that im-
pact sets of genes related to differentiation state, cellular pathways, and core physiological processes.
By focusing on gene programs, rather than single genes, D-SPIN generates low-dimensional models
of cellular regulation that are interpretable and can be inferred from data through a computa-
tionally efficient inference procedure. Building upon previously published work, D-SPIN identifies
co-regulated groups of genes, gene programs, through unsupervised orthogonal nonnegative ma-
trix factorization (oNMF) and phenotypic gene-program annotation (Fig. 1A, B, SI Text Sec. 1.4)
[37, 38].
Following the extraction of gene programs, D-SPIN applies a maximum likelihood statistical
inference procedure to infer a network of regulatory interactions between gene-expression programs
and between programs and applied perturbations (Fig. 1C). Regulatory interactions between gene
programs are represented in a gene-program interaction matrix
J
where each entry
J
ij
quantifies
the inferred interaction between program
i
and
j
. Interactions between the regulatory network and
perturbations are encoded in a set of perturbation response vectors
h
(
n
)
where
h
(
n
)
i
quantifies the
interaction of the perturbation
n
with gene program
i
.
Mathematically, an energy function
E
(
s
) computes the effective energy of a given transcrip-
tional state
s
by balancing the influence of the regulatory network (encoded in
J
) and the perturbation-
specific inputs (encoded in each
h
(
n
)
). The resulting probability of a given transcriptional state
s
is
P
(
s
) =
1
Z
exp[
−
E
(
s
;
J
,
h
)], where
E
(
s
;
J
,
h
(
n
)
) =
−
X
ij
J
ij
s
i
s
j
−
X
i
h
(
n
)
i
s
i
and
Z
is a normalizing constant ensuring all probabilities sum to 1 (SI Text Sec. 1.5) [39, 40]. The
regulatory network accounts for positive (
J
ij
>
0) and negative (
J
ij
<
0) interactions between gene
programs. Perturbations exert a bias on each gene program to increase the likelihood of being on
(
h
i
>
0) or off (
h
i
<
0).
Given single-cell gene-expression data collected from a series of perturbation conditions, D-
5
.
CC-BY-NC 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted May 20, 2023.
;
https://doi.org/10.1101/2023.04.19.537364
doi:
bioRxiv preprint
SPIN performs an optimization procedure through gradient ascent to estimate model parameters
J
and
h
(
n
)
given the data (Fig. 1D). For spin-network models, the optimization procedure yields
closed-form parameter-update rules
∆
J
ij
=
ε
X
n
⟨
s
i
s
j
⟩
(
n
)
Data
−⟨
s
i
s
j
⟩
(
n
)
Model
(1)
∆
h
(
n
)
i
=
ε
⟨
s
i
⟩
(
n
)
Data
−⟨
s
i
⟩
(
n
)
Model
(2)
where
⟨
s
i
s
j
⟩
(
n
)
Data
are gene-program cross-correlations computed from data for each of
n
perturbation
conditions,
⟨
s
i
s
j
⟩
(
n
)
Model
are gene-program cross-correlations computed from the model, and
ε
is a
learning rate (SI Text Sec. 1.7). At each step of model inference, the procedure samples from the
model for each perturbation condition and updates the regulatory network
J
and each perturbation
vector
h
(
n
)
. In general, spin-network model inference is a convex optimization problem, so the
optimization has a single unique solution.
The form of the spin-network model learning rules enables us to develop a highly efficient
and scalable training algorithm for D-SPIN by exploiting intrinsic parallelization. Training of
general graphical models requires estimating the gradient function on every data point separately,
which is typically computationally extensive. In contrast, in D-SPIN the gradient is pooled for each
experimental condition, and only depends on the mean
⟨
s
i
⟩
and cross-correlation
⟨
s
i
s
j
⟩
between gene
programs in each condition. Consequently, the training of the network can be deployed in parallel,
with each computational thread computing the gradient based on the mean and cross-correlation of
a single experimental condition. This approach eliminates the need to estimate gradients on every
data point separately (i.e., every cell’s transcriptional state) and minimizes data communication by
only requiring the mean and cross-correlation to be exchanged during parallelization. Therefore,
we routinely use hundreds of CPU cores for the optimization, enabling efficient network inference
over large data sets. Practically, we estimate the network using a subset of experimental conditions
with high diversity, and utilize the inferred network to identify response vectors of each condition
independently (SI Text Sec. 1.7, 1.8).
Following inference of
J
and
h
, D-SPIN models can be analyzed as network models that reveal
regulatory connections between cellular processes and identify interactions between applied pertur-
bations and cellular gene-expression programs. (Fig. 1E, ii-iii). D-SPIN models can also be used to
predict the distribution of transcriptional states in a cell population under normal and perturbed
conditions and explain how interactions within a gene network lead to a specific distribution of
cell states in an experimental population. The models provide a computational strategy that can
bridge geometric, cell clustering approaches like UMAP with mechanistic modeling approaches that
view the cell as a regulatory network with links between cellular processes, pathways, and pertur-
bations. D-SPIN can reveal how perturbations to specific nodes of a network alter the distribution
of transcriptional states in a cell population and, thus, alter the rendering of a cell population in
UMAP.
D-SPIN achieves state-of-the-art performance on gene regulatory network infer-
ence
To evaluate the accuracy of D-SPIN in performing network inference, we first applied D-SPIN
to reconstruct and analyze a model of the hematopoietic stem cell (HSC) differentiation network us-
ing the network simulation and benchmarking framework BEELINE [41]. The HSC network has two
major groups of regulatory genes, the first group includes Pu1 controlling granulocyte/monocyte
differentiation and the second group includes Gata1 controlling megakaryocyte/erythrocyte differen-
6
.
CC-BY-NC 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted May 20, 2023.
;
https://doi.org/10.1101/2023.04.19.537364
doi:
bioRxiv preprint
tiation [42]. BEELINE simulates single-cell transcription data from biologically identified regulatory
networks by generating and simulating ordinary differential equation (ODE) models with stochastic
noise. BEELINE also includes existing network inference methods for benchmark accuracy evalu-
ation. We used the BEELINE framework to generate synthetic gene-expression profiles from the
HSC network for 22 perturbation conditions encompassing knockdown/activation of each network
node individually by turning down/up the transcription rate (Fig. 2B, C, SI Text Sec. 1.9, SI Ta-
ble 1). For interpretation of the data we used the biological functions of each transcription factor to
classify the single-cell expression profiles into 7 different cell state designations including monocytes
(Mono), granulocytes (Gran), erythrocytes (Ery) and megakaryocytes (Mega) [42].
7
.
CC-BY-NC 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted May 20, 2023.
;
https://doi.org/10.1101/2023.04.19.537364
doi:
bioRxiv preprint
Figure 2:
D-SPIN achieves state-of-the-art performance on network inference benchmarking task, re-
vealing cell-fate control strategies for hematopoietic stem cell differentiation network
(A) HSC regulatory
network model [42] contains 11 transcription factors that interact through positive (blue) and negative interactions
(red) to modulate the differentiation of HSCs into different cell types. (B) Example simulated gene-expression pro-
file heatmaps generated using published computational model of the network for a series of simulated single-gene
knockdown and activation conditions. (C) The simulated network generates a distribution of cell states that can be
classified by their transcription factor expression into 7 cell states. The heatmap shows transcription factor expression
patterns that define each cell state generated across all simulated conditions, visualized as a UMAP. (D) Using the
simulated data, D-SPIN infers an integrated regulatory network model that encodes inferred interactions between
transcription factors as an interaction graph. D-SPIN also infers perturbation vectors that estimate how knockdown
or activation of transcription factors (e.g., Gata1 knockdown) impacts the regulatory network through up- or down-
regulation of transcription factors. (E) The D-SPIN model has greater accuracy than state-of-the-art methods [41], as
shown through the rendering of the network inferred from data by each method as well as a box plot of the number of
correctly inferred edges across different repeats of simulated gene-expression data. The box plot shows the accuracy
of top k edges for different values of k where the ground truth network has a total of 20 edges. (F) The D-SPIN model
predicts the cell-fate distribution generated through perturbations to the underlying network. UMAP plots of cell
state distribution generated for control and Pu1 activation/knockdown are shown for simulated data and the D-SPIN
model. Bar graphs quantify the proportion of HSC-derived cell states in each condition from simulated data and the
D-SPIN model. The data and model distribution are highly consistent across all 22 perturbations as quantified by
the cosine similarity between cell state distributions of data and model. (G) The model can be used to understand
how perturbations to distinct transcription factors can generate similar cell populations. Both Pu1 activation and
knockdown of Cebpa or Gfi1 generate an increased Mono state by directly activating Mono genes or indirectly through
the interaction between Gfi1 and EgrNab.
8
.
CC-BY-NC 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted May 20, 2023.
;
https://doi.org/10.1101/2023.04.19.537364
doi:
bioRxiv preprint
We applied D-SPIN to reconstruct a gene regulatory network model from the simulated data
and compared the accuracy of the D-SPIN model to 3 different network reconstruction methods
(PIDC, GENIE3, and GRNBoost2) that were the top performers in the BEELINE benchmarking
study [41]. We applied D-SPIN to generate a single network model encoded in the gene interac-
tion matrix
J
and generate perturbation vectors
h
(
n
)
for each perturbation condition (Fig. 2D,
SI Fig. 1A). We applied the algorithms to individual genes, rather than gene programs, given the
relatively compact size (11 nodes) of the network used in the BEELINE benchmarking evaluation.
Each method generates a set of inferred edges and provides confidence estimates for each edge.
We compared the methods by considering the top 10, top 15, and top 20 edges identified by each
method.
On the simulated network reconstruction task, D-SPIN consistently achieved higher perfor-
mance than comparable methods, primarily due to the natural way in which D-SPIN can accom-
modate perturbation data. For D-SPIN, on average 0
.
96 of the top 10 edges found by the model
were correct across inference runs, as compared to 0
.
6, 0
.
77, and 0
.
7 for PIDC, GRNBOOST2, and
GENIE3, respectively (Fig. 2E). For D-SPIN, 0
.
75 of the top 20 edges were correct, compared with
0
.
625, 0
.
645, and 0
.
665 for PIDC, GRNBOOST2, and GENIE3, respectively. As an example where
D-SPIN outperforms existing methods, the networks inferred by PIDC, GENIE3, and GRNBoost2
all identified a spurious activating interaction inside the Mega.-Ery. gene module, e.g., Scl-Fog1 or
Scl-Eklf, while D-SPIN pruned these false positive edges and revealed the central regulatory role
of Gata1. D-SPIN also pruned the spurious inhibiting interactions between the two modules, e.g.,
Pu1-Eklf or Pu1-Fli1, revealing the actual regulatory target of Pu1. When spurious edges occur
they were similar across the different inference frameworks: all 4 methods identified spurious edges
in strongly connected regions of the HSC network, struggling to separate direct versus indirect con-
nections [43]. However, for D-SPIN uniquely, spurious edges can be further reduced through the
sampling of higher-order perturbations (SI Fig. 1B, SI Table 1).
D-SPIN predicts the impact of simulated perturbations on differentiation
D-SPIN is a probabilistic or generative model that can predict the distribution of transcriptional
states in a cell population subjected to a perturbation, unlike the other models, which use either
regression or information-theoretic measures to find gene linkages. D-SPIN infers a quantitative
model of the cell population generated by the HSC network that can be sampled to generate
quantitative predictions regarding the distribution of cell types induced by each gene knockdown
or activation. Unlike existing methods, D-SPIN allowed us to ask how specific gene knockdown or
over-expression strategies could be applied to modulate the differentiation trajectory of HSC cells.
We applied D-SPIN to compute the population structure of a simulated HSC cell population
when different transcription factors are perturbed. The model distributions quantitatively agree
with data, and capture important biological impacts of the perturbations (Fig. 2F, SI Fig. 1C).
Analysis of the model allowed us to gain insight into how disruption of different underlying network
components might impact the differentiation trajectory of the cells. For example, when Pu1 is
knocked down, it results in a complete loss of Monocyte cell states and an increase in Granulocyte
cell states. The change in population structure occurs because Pu1 directly activates EgrNab and
cJun, and the three factors together induce the Monocyte cell state while EgrNab represses Gfi1,
a key regulator of the Granulocyte cell state. Knocking down Pu1 turns off the Pu1-cJun-EgrNab
group leading to loss of the Monocyte cell state as well as activation of Gfi1. Gfi1 activation leads
to the activation of Cebpa and the Granulocyte differentiation program. The Mega-Ery module
(Eklf, Fog1, Gata1, and Scl) is affected by both the reduced inhibition from Pu1 and increased
inhibition from Cebpa. Consequently, the cell states Mega and Ery remain present, albeit in reduced
proportions.
9
.
CC-BY-NC 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted May 20, 2023.
;
https://doi.org/10.1101/2023.04.19.537364
doi:
bioRxiv preprint
Both Pu1 activation and the knockdown of Cebpa and Gfi1 lead to an increased prevalence
of Monocyte cell states (Fig. 2G). Pu1 directly activates the monocyte regulatory network that
contains EgrNab and cJun. Gfi1 represses monocyte differentiation through repression of EgrNab,
and knockdown of Gfi leads to activation of EgrNab which in turn activates Pu1 and cJun, resulting
in increased Monocyte states. The knockdown of Cebpa turns off Gfi1 because of the positive
interaction between the two, leading to the activation of Monocyte states in a similar mechanism
as Gfi1 knockdown.
Thus, D-SPIN, using simulated data, can infer an accurate and generative regulatory network
model of a single-cell population by integrating information across perturbation conditions. D-
SPIN prunes the ambiguity of alternative models, and builds a coherent integrated network upon
which perturbations act. The inferred regulatory network reveals functional interactions between
regulatory genes and provides a map of how the distribution of cell states in the underlying HSC
cell population is generated through interactions between internal regulators.
10
.
CC-BY-NC 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted May 20, 2023.
;
https://doi.org/10.1101/2023.04.19.537364
doi:
bioRxiv preprint
Figure 3: Caption on the following page
11
.
CC-BY-NC 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted May 20, 2023.
;
https://doi.org/10.1101/2023.04.19.537364
doi:
bioRxiv preprint
Figure 3
D-SPIN generates a regulatory network model from genome-wide Perturb-seq data
that reveals the architecture of cellular pathways and sub-functions for pathway components.
(A) Genome-wide Perturb-seq experiment [28] delivered 9,867 gene knockdown perturbations individually
to
∼
2M human K562 cells using guide RNAs. (ii) D-SPIN model construction was applied to knockdown
perturbations that induced more than 10 differentially expressed genes and more than 20 cells passed quality
filtering. (iii) D-SPIN model was applied to generate a regulatory network model, inferring interactions be-
tween gene-expression programs and impacts of perturbations that classify guides into groups. (B) Inferred
D-SPIN model contains a core gene regulatory network composed of 30 gene-expression programs and the re-
sponse of each gene knockdown. Gene programs are annotated by a combination of function annotation tools
and manual lookup. D-SPIN classified gene knockdowns into 40 different guide classes with Leiden cluster-
ing. The interactions between gene-expression programs and perturbations are rendered as positive (blue) or
negative (red) edges, and edge thickness scales with the absolute value of edge weight. (C) D-SPIN network
model exhibits modular architecture with 7 core network modules that can be assigned physiological func-
tions: metabolism, transcription, translation, modification, protein degradation, DNA replication, and cell
division. Interactions between programs in the same modules are mostly activation and contain the majority
of strong activation interactions. (D) Guide RNA classes are each enriched for physiological functions and
cellular pathways in gene ontology and KEGG pathway databases. The bubble size represents the percentage
of enriched guides in the class. (E) UMAP representation of the response vector of each guide RNA agrees
with the global organization of major cellular pathways including transcription, translation, respiration, and
cell cycle. (F) Regulatory coherence of macromolecular complex subunits, quantified by the average pair-
wise cosine similarity between the perturbation response vector for knockdown of complex sub-components.
High-coherence complexes have subunits with similar impacts on the D-SPIN gene regulatory network. In
low-coherence complexes, knocking down different subunits leads to different impacts, suggesting distinct
functions for complex subunits. (G) Perturbation vectors for subunits of the ribosome, proteasome, and
integrator complexes. The ribosome has a coherent signature where components of the complex have similar
inferred response vectors, while the proteasome and integrator each have signatures of sub-functionalization.
In the example of proteasome, low-coherence is observed as only a subset of subunits have an impact on a
gene program (P5, P6), or even display opposite effects on a gene program (P1). D-SPIN also identified
recently annotated pathway components IPO9 and AKIRIN2 (proteasome) and C7orf26 (integrator).
Constructing a regulatory network model from genome-wide Perturb-seq data
Peturb-seq (or its derivatives) involves the knockdown of individual genes, using Cas9 and a targeting
guide RNA. Single-cell mRNA-seq is used to assess the transcriptional state of each cell and to identify the
guide RNA delivered to the cell. The technique can be applied at scale to interrogate the impact of thousands
of distinct gene knockdowns on cellular gene expression. We used D-SPIN to construct a gene regulatory
network model using data from a genome-wide Perturb-seq experiment on the erythrocyte leukemia K562
cell line [28]. K562 is an erythrocyte precursor cell line with a BCR-ABL mutation, derived from a chronic
myelogenous leukemia (CML) patient [44, 45]. In the genome-wide Perturb-seq experiment, 9
,
867 genes
were knocked down individually across 2 million single cells using CRISPRi. Prior to D-SPIN analysis, we
pre-processed the data to identify guide RNAs that have an impact on the K562-cell transcriptional state
(SI Text Sec. 1.2). Guide RNA perturbation impact varies (Fig. 3A): 70.5% of perturbations had less than
10 differentially expressed genes (DEGs), 14.7% had 10-100 DEGs, and 14.8% had more than 100 DEGs.
There is also great variation in the number of collected cells: 20.3% of perturbations had less than 100 cells
collected. We focused our analysis on 3,138 perturbations with more than 10 DEGs and 20 collected cells,
and D-SPIN was able to construct the network model in 8 hours with parallelization on 500 CPU cores.
We constructed a D-SPIN model using a set of 30 gene programs that D-SPIN extracted with its
automated oNMF procedure (Fig. 3B, SI Fig. 2A, SI Text Sec. 1.4, SI Table 2). The extracted gene programs
reflect both general cell biology as well as lineage-specific gene programs that reflect the specific biology
of K562 cells. We annotate the gene programs with a combination of bioinformatic databases including
DAVID, Enrichr, and manual lookup [46, 47]. Among the programs, D-SPIN extracted a series of programs
that are enriched for genes involved in specific core cell processes including transcription, translation, DNA
replication, mitosis, and RNA processing. D-SPIN also identified lineage-specific programs including an
erythroid-specific program with hemoglobin (HBG1, HBG2, HGZ) and glycophorin (GYPA, GYPB, GYPE)
12
.
CC-BY-NC 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted May 20, 2023.
;
https://doi.org/10.1101/2023.04.19.537364
doi:
bioRxiv preprint
genes, as well as two myeloid-associated programs with phagosome/actin-organization (ACTB, ACTG1,
ARPC3) and immune-response (LAPTM5, VASP, RAC2) genes respectively, which agrees with the multi-
lineage potential of K562 cells (Fig. 3B).
Given the set of 30 gene programs, D-SPIN generated a gene regulatory network model that provides
a wiring diagram of K562 internal regulation and the response of the K562 regulatory network to gene
knockdown perturbations (Fig. 3B, SI Fig. 2B, SI Table 2). The D-SPIN network contains gene programs,
as nodes, and interactions between gene programs as edges that represent inferred interactions. D-SPIN
identified 84 activating and 21 inhibiting interactions between gene programs, corresponding respectively to
19.3% and 4.8% of all gene-program pairs. The D-SPIN regulatory network reconstruction contains a set
of sub-networks or modules of gene programs that interact with one another through dense positive (blue)
interactions (Figs. 3B, C). We automatically decomposed the D-SPIN network into seven modules through
the Leiden community detection algorithm [48] (SI Text Sec. 1.10). Each identified module reflects a core
cellular function, including transcription, translation, protein-degradation, and cell-cycle-related functions.
The network contains a set of inhibitory interactions between gene programs that are expressed in dis-
tinct cellular states. The strongest inhibitory interaction is between gene programs expressed at different
stages of mitosis, for example, P29 Spindle microtubule inhibits both P25 DNA replication and P27 Histone.
The strong positive interactions inside the cell cycle sub-network of P25-P30 reconstruct the transcription
state distribution during cell cycle progression (SI Fig. C, D). We also find strong inhibiting interactions
between distinct stress-response programs, P23 Unfolded protein response, and P2 Lysosome. The Unfolded
protein response program consists of heat-shock protein (HSP90AB1, HSPD1, HSP90AA1) and other molecu-
lar chaperones (CCT8, CCT3, CCT5) genes, while the Lysosome program contains genes involved in lysosome
genesis (PSAP, ASAH1, PLD3). We also observe inhibition between P4 Erythroid and P6 Phagosome, where
the phagosome is an innate immune response mounted by macrophages and other myeloid cell types [49].
This indicates the presence of two, mutually exclusive differentiation paths that lead to erythroid and myeloid
cell fates.
D-SPIN reconstructs the architecture of core cellular pathways
In addition to inferring the structure of the core K562 gene regulatory network, D-SPIN also inferred
interactions between each gene program and each gene knockdown perturbation. The interactions between
perturbations and gene programs are represented by perturbation response vectors
h
(
n
)
. The entries of
h
(
n
)
represent positive or negative interactions between perturbations and individual gene programs. The
response vectors can be represented as directed edges that connect perturbations to the core regulatory
network (Fig. 3B).
Similarly to the grouping of the K562 gene regulatory network into interacting gene modules, we grouped
gene knockdown perturbations into a set of 40 perturbation classes that we refer to as G1-G40 (Guide RNA
group 1 through 40) through unsupervised Leiden clustering of the
h
(
n
)
vectors (SI Table 2). Identified
clusters reflect the pathway-level organization in the K562 cell, revealing both well-known cell biology and
more novel or cryptic organization of pathways (Figs. 3D, E). Guide clusters are enriched for specific cell-
biological functions, including DNA replication, MAPK signaling pathway, and RNA degradation (Fig. 3D).
The guide clusters also provide detailed information regarding the structure of cellular pathways. As a specific
example, the core components of the proteasome fall into a single perturbation cluster. The cluster also
contains AKIRIN2 and IPO9, two protein components recently associated with the proteasome. Similarly,
D-SPIN identifies C7orf26 as associated with the integrator subunits INTS10, INTS13 and INTS14, a key
result of the original primary Perturb-seq study [28].
Predicting macro-molecular complex component function
Macromolecular complexes like the ribosome, the proteasome, and the integrator complex play a critical
role in cell physiology and gene regulation. We used D-SPIN perturbation vectors to analyze the functions
of individual macromolecular-complex components. D-SPIN generates a perturbation vector
h
(
n
)
for each
subunit of each macromolecular complex that was knocked down in the Perturb-seq experiment. We applied
D-SPIN to gain insight into sub-functions of complex components by comparing cosine similarity between
h
(
n
)
for different components of known complexes in the CORUM database [50]. We selected complexes
that have more than two-thirds of subunits in the dataset and picked representative ones from subunits with
more than half overlapping subunits, resulting in a list of 46 core cellular complexes. For each complex in
13
.
CC-BY-NC 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted May 20, 2023.
;
https://doi.org/10.1101/2023.04.19.537364
doi:
bioRxiv preprint
the list, we defined a coherence score by computing the average cosine similarity between response vectors,
h
(
n
)
, for all pairs of complex components (SI Text Sec. 1.13). When subunits of a complex interact to
achieve a common function, disruption of individual subunits will disrupt that common function, leading
to high cosine similarity between subunit response vectors and an average cosine similarity across complex
components approaching 1. When complexes can achieve distinct functions through combinations of distinct
protein components, then such complexes have with response vectors with average cosine similarity closer to
0 which is the random expectation.
Across the 46 complexes in CORUM, our analysis revealed a wide range of coherence scores across
different complexes, ranging from 0.01 (large Drosha complex) to 0.73 (mitochondrial large ribosome) (Fig. 3F,
SI Fig. 2E). Notably, complexes that play critical roles in specific biological functions including translation
and metabolism, such as the mitochondrial ribosome, ribosome, exosome, DNA repair, and respiratory
chain, exhibited the highest coherence scores. In contrast, complexes with known regulatory roles including
the large Drosha complex, DGCR8 complex, ALL-1 supercomplex (involved in chromatin regulation), the
ATAC complex (histone acetyltransferase), and spliceosome all had lower coherence scores
< .
25. The
low-scoring complexes are involved in versatile regulatory roles in epigenetic regulation and microRNA and
mRNA processing. D-SPIN provides a list of the candidate cellular sub-processes (gene programs) that are
controlled by individual sub-units of each macromolecular regulatory complex (SI Table 2) providing a specific
framework for generating experimental hypothesis to determine how modulation of subunits composition
of these macromolecular regulatory complexes might enable the complex to achieve a range of regulatory
functions.
The ribosome behaves as a coherent complex where the knockdown of individual genes in the large or
small ribosome subunit has a similar impact on D-SPIN gene programs as illustrated by the monochromatic
edge connections between subunit perturbations and gene programs in the interaction diagram (Fig. 3G).
Perturbations to ribosome subunits activate protein digestion-related gene programs including P2 Lysosome,
P3 Endosome, and P20 P21 Proteasome. The perturbations inhibit ribosome programs, P1 Mitochondria, P19
Endoplasmic reticulum, P23 Unfolded protein response, and P24 Oxidation. Thus, perturbing any ribosome
subunits induces the cell into a stress response state with upregulated protein digestion and inhibition of
protein production and secretion processes.
The proteasome and integrator behave as incoherent complexes, and have response vectors that split
subunits into smaller sub-groups. The integrator complex subunits are partitioned into 4 sub-groups that
agree with the original Perturb-seq study (Fig. 3J): an INTS10-13-14-C7orf26 module, a cleavage module
(INTS3, INTS4, INTS8, INTS11), and two shoulder and backbone modules (INTS2, INTS5, INTS7, INTS8;
INTS1, INTS6, INTS12). Knockdown of the cleavage module activates P21 Proteasome and inhibits P22
rRNA processing, P18 Nuclear transport, P24 Oxidation, and P5 Lipid metabolism, while knockdown of
the shoulder and backbone modules activates P4 Erythroid and P7 Innate immunity and inhibited P21
Proteasome, P28 Chromosome segregation, P17 RNA splicing, and P13 Ribosome programs.
Similarly, the proteasome complex subunits are split into two groups of core particles (20S protea-
some) (PSMA6, PSMB1, PSMB3, PSMB6; PSMA2, PSMA7, PSMB2, PSMB7) and two regulatory particles
(PSMC3, PSMC3, PSMC3, PSMD2, PSMD6; PSMD4, PSMD13). D-SPIN finds that the knockdown of
proteasome subunit PSMC6 (also known as TBP7 and Rpt4) has a negative impact on respiration, and this
result is consistent with the finding that PSMC6 interacts with mitochondrial proteins including TRAP1 in
the endoplasmic reticulum to assist the folding of mitochondrial proteins [51]. Further, AKIRIN2 and IPO9
are grouped together with the 20S proteasome subunits by the response vectors. These results demonstrate
that the response vectors reveal the impacts of genetic perturbations, and provide a compact signature for
identifying functional associations between genes [52].
14
.
CC-BY-NC 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted May 20, 2023.
;
https://doi.org/10.1101/2023.04.19.537364
doi:
bioRxiv preprint
Figure 4:
D-SPIN network model identifies global regulatory response strategies mounted by K562
cells to distinct classes of gene perturbation.
(A) A coarse-grained network model is generated by grouping
gene programs into 7 identified gene modules and grouping guide RNA gene knockdown classes based on network
impact. The resulting coarse-grained model enables global analysis of cellular regulatory responses. (B) The minimal
model reveals that K562 cells mount 4 distinct classes of global regulatory responses in response to gene knockdown
perturbations. For example, when RNA polymerase is knocked down, cells generate the metabolism upregulation
response, where the cell upregulates metabolism and downregulates transcription, translation, and degradation. As
a second example, when ribosome subunits are knocked down, cells upregulate protein degradation and metabolism
while downregulating translation. (C) Gene-expression heatmap showing the upregulation (blue) or downregulation
(red) of genes in gene programs to perturbation of individual genes in a series of cellular pathways. Heatmap indicates
the change in gene expression for perturbed cell population compared to the control where a non-targeting guide RNA
is introduced to cells.
15
.
CC-BY-NC 4.0 International license
available under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (which
this version posted May 20, 2023.
;
https://doi.org/10.1101/2023.04.19.537364
doi:
bioRxiv preprint