D-SPIN constructs gene regulatory network models from multiplexed scRNA-seq data revealing organizing principles of cellular perturbation response

D-SPIN constructs gene regulatory network models from

multiplexed scRNA-seq data revealing organizing principles

of cellular perturbation response

Jialong Jiang

1,*

, Sisi Chen

1,2,11*

, Tiffany Tsou

1,2

, Christopher S. McGinnis

, Tahmineh

Khazaei

, Qin Zhu

, Jong H. Park

1,2

, Inna-Marie Strazhnik

, John Hanna

, Eric D.

Chow

5,6

, David A. Sivak

, Zev J. Gartner

3,8,9,10

, and Matt Thomson

1,2

†

Division of Biology and Biological Engineering, California Institute of Technology,

Pasadena, California, 91125, USA

Beckman Single-Cell Profiling and Engineering Center, California Institute of Technology,

Pasadena, CA, 91125, USA

Department of Pharmaceutical Chemistry, University of California San Francisco, San

Francisco, CA, 94143, USA

Department of Pathology, Harvard Medical School and Brigham and Women’s Hospital,

Boston, MA, 02115, USA

Department of Biochemistry and Biophysics, University of California San Francisco, San

Francisco, CA, 94143, USA

Center for Advanced Technology, University of California San Francisco, San Francisco,

CA, 94143, USA

Department of Physics, Simon Fraser University, Burnaby, BC V5A 1S6, Canada

Helen Diller Family Comprehensive Cancer Center, San Francisco, CA, 94115, USA

Chan Zuckerberg BioHub, University of California San Francisco, San Francisco, CA,

94143, USA

Center for Cellular Construction, University of California San Francisco, San Francisco,

CA, 94143, USA

Apertura Gene Therapy, 345 Park Ave South, New York, NY 10010

These authors contributed equally to this work

†

mthomson@caltech.edu

CC-BY-NC 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted May 20, 2023.

;

https://doi.org/10.1101/2023.04.19.537364

doi:

bioRxiv preprint

Abstract

Gene regulatory networks within cells modulate the expression of the genome in re-

sponse to signals and changing environmental conditions. Reconstructions of gene

regulatory networks can reveal the information processing and control principles used

by cells to maintain homeostasis and execute cell-state transitions. Here, we intro-

duce a computational framework, D-SPIN, that generates quantitative models of gene-

regulatory networks from single-cell mRNA-seq data sets collected across thousands of

distinct perturbation conditions. D-SPIN models the cell as a collection of interacting

gene-expression programs, and constructs a probabilistic model to infer regulatory in-

teractions between gene-expression programs and external perturbations. Using large

Perturb-seq and drug-response datasets, we demonstrate that D-SPIN models reveal

the organization of cellular pathways, sub-functions of macromolecular complexes, and

the logic of cellular regulation of transcription, translation, metabolism, and protein

degradation in response to gene knockdown perturbations. D-SPIN can also be applied

to dissect drug response mechanisms in heterogeneous cell populations, elucidating

how combinations of immunomodulatory drugs can induce novel cell states through

additive recruitment of gene expression programs. D-SPIN provides a computational

framework for constructing interpretable models of gene-regulatory networks to reveal

principles of cellular information processing and physiological control.

Introduction

The human genome encodes more than 30

000 genes, but a typical human cell expresses only

5,000 genes at any given time [1]. Gene regulatory networks within cells modulate gene expression

based upon environmental cues and the cell’s internal state [2, 3, 4, 5]. In gene regulatory networks,

transcription factors and associated regulatory proteins interact with one another and gene promot-

ers to activate or repress gene transcription [6, 7, 8]. Gene regulatory networks play a central role

in cellular decision-making processes [9, 10, 11, 12]. However, we have a limited understanding of

the global logic encoded within gene regulatory networks and the underlying principles of cellular

information processing.

Global reconstruction of gene regulatory networks in E. coli, yeast, and sea urchin embryos have

revealed features of biological information processing, including network modularity, recurring net-

work motifs, and combinatorial logic through the assembly of transcription factor complexes at gene

promoters [13, 14, 15, 16, 17, 18, 19]. However, our understanding of the information-processing

principles of gene regulatory networks is based on a limited number of network reconstructions per-

formed on model organisms over a limited range of physiological conditions. Even in E. coli a large

fraction of genes have unknown regulators [20]. In metazoans, gene network models have primarily

focused on sub-circuits involved in specific processes like T-cell activation, T-cell fate selection, and

embryonic stem cell differentiation [10, 11, 12]. The comparative analysis of gene regulatory net-

work architectures across cell types, developmental stages of an organism, and species, therefore,

remains nascent, and there are very few quantitative models of regulatory networks that can pre-

dict the response of a cell to signals, genetic perturbations, or therapeutics. Cells simultaneously

regulate tens of different processes including transcription, metabolism, protein degradation, and

differentiation. How gene regulatory networks globally modulate core cellular processes in parallel

in response to environmental conditions remains poorly understood.

Historically, gene regulatory network reconstruction and modeling have been constrained by the

number of biochemical or genetic measurements required to reconstruct networks with hundreds to

thousands of interacting protein components. Classical biochemical approaches perform bottom-up

CC-BY-NC 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted May 20, 2023.

;

https://doi.org/10.1101/2023.04.19.537364

doi:

bioRxiv preprint

gene network reconstructions through pairwise binding measurements requiring order-

biochem-

ical assays for a network with

components [21, 22]. Genetic approaches use a top-down strategy

that infers gene regulatory network architecture from perturbation-response experiments where

genes are knocked down or protein activity inhibited with small molecules. Perturbation of one

network component alters gene expression for subsets of genes; through the top-down association of

perturbations with responses, perturbation approaches enable the identification of gene regulators,

the organization of regulators into pathways, and the construction of integrated network models

[23, 24, 25, 26]. For global analysis of gene regulatory networks, perturbation approaches require

knocking out hundreds to thousands of genes while monitoring transcription across thousands of

genes. Historically, experimental realization of genome-wide network reconstruction through per-

turbation has, like biochemistry, been limited by experimental scale when genes were knocked out

in bulk assays.

Developments in single-cell genomics and perturbation bar-coding circumvent some of the con-

ventional limitations of perturbation-driven network reconstruction. Perturbation bar-coding ap-

proaches, including Perturb-seq, click-tags, and MULTI-seq, allow the transcriptional state of each

cell in a cell population to be measured across thousands of different genetic, signaling, and small

molecule conditions [27, 28, 29, 30]. These experimental approaches identify the perturbation de-

livered to every single cell in a population while also measuring changes in the entire transcriptome

through single-cell mRNA-seq. Given single-cell perturbation measurements allow interrogation of

a cell population’s response to thousands of genetic, signaling, or small molecule perturbations,

a core challenge is developing computational methods to integrate data from thousands of such

experiments into a gene regulatory network model that can classify perturbations, map the flow of

information across a regulatory network, and predict cellular response to novel perturbations.

Here, we introduce a mathematical modeling and network inference framework, D-SPIN

(Dimension-reduced Single-cell Perturbation Integration Network), that constructs gene regulatory

network models directly from single-cell perturbation-response data. D-SPIN models the cell as

a collection of interacting gene-expression programs, and constructs a probabilistic model, known

as Markov random fields, to infer regulatory interactions between gene-expression programs and

applied perturbations. To enable scaling of D-SPIN to large data sets with millions of single cells,

D-SPIN exploits a natural factoring within the mathematical structure of Markov random fields

inference to separate the learning problem into two steps, construction of a unified gene regulatory

network model and inference of how each perturbation interacts with the gene programs within the

unified network. To maximize the interpretability of D-SPIN models and computational efficiency,

D-SPIN operates on gene-expression programs, rather than single genes, enabling D-SPIN to derive

a coarse-grained description of a cellular regulatory network in terms of physiological units—gene-

expression programs—and their interactions with applied perturbations. In benchmarking tests, we

demonstrate that D-SPIN exhibits state-of-the-art performance on simulated gene network inference

tasks.

We apply D-SPIN to construct gene regulatory network models from experimental data sets

with thousands of perturbations and millions of cells, including two of the largest single-cell mRNA-

seq data sets in existence, a genome-wide Perturb-seq experiment [28] and a new human immune cell

drug-response experiment we performed. The integrated regulatory network models reveal global

organizing principles of cellular regulation and perturbation responses, including the organizing

principles of cellular pathways and networks of pathways used to maintain homeostasis in response

to perturbation. Applied to profile the response of human immune cells to combinations of im-

munomodulatory drugs, D-SPIN models remonstrate that drug combinations can generate novel

cell states through additive recruitment or superposition of drug-specific gene expression programs.

Broadly, D-SPIN provides a computational framework for large-scale modeling and comparative

analysis of gene regulatory networks across cell types, physiological conditions, and organisms.

CC-BY-NC 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted May 20, 2023.

;

https://doi.org/10.1101/2023.04.19.537364

doi:

bioRxiv preprint

Figure 1:

D-SPIN constructs unified gene regulatory network models from single-cell mRNA-seq data

collected across perturbation conditions.

(A) D-SPIN accepts as input single-cell mRNA-seq data from a cell

population profiled across a series of different perturbed experimental conditions such as genetic perturbations, drug

treatments, signaling conditions, or disease/healthy conditions. (B) To reduce the dimensionality of the network-

inference task, D-SPIN extracts gene-expression programs from the data using orthogonal non-negative matrix fac-

torization. (C) D-SPIN then constructs a unified regulatory network

, whose edges represent inferred interactions

between gene-expression programs. D-SPIN also infers interactions between the network and applied perturbations.

(D) Mathematically, D-SPIN constructs the regulatory network model by inferring a probabilistic model that encodes

the probability of each transcriptional state in terms of an interaction matrix

and sample-specific perturbation

response vectors

(

n

)

. D-SPIN scales to large data sets with millions of cells through a paralleled inference maximum-

likelihood procedure. At each step of inference, D-SPIN generates model samples for each perturbation condition

in parallel, and then uses the difference between model and data cross-correlation and mean to update the network

weights

and each perturbation response vector

(

n

)

. (E) D-SPIN yields a probabilistic model that can be interpreted

as a gene regulatory network. The D-SPIN model can be applied to (i) estimate the distribution of transcriptional

state defined by the regulatory network model under a specific condition, (ii) reveal principles of gene regulatory

network organization including modularity and sub-network architecture, and (iii) classify perturbations into groups

that reflect biological function.

CC-BY-NC 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted May 20, 2023.

;

https://doi.org/10.1101/2023.04.19.537364

doi:

bioRxiv preprint

D-SPIN infers a unified regulatory network model from single-cell perturbation

data

Here, we develop mathematical and computational strategies to integrate data across many dif-

ferent single-cell perturbation experiments into a network model that provides qualitative insights

into cellular pathways and regulatory strategies. D-SPIN solves the gene regulatory network infer-

ence problem by constructing a probabilistic model that encodes the distribution of transcriptional

states within a cell population across a set of perturbed conditions (Fig. 1, SI Text Sec. 1.1). Specif-

ically, D-SPIN builds a spin-network model or Markov random field, which can be interpreted as

an interaction network. Spin-network models were initially introduced in physics to study magnetic

materials known as spin glasses whose properties emerge through positive and negative interactions

between localized magnetic moments or spins. The models have been generalized and applied to

many different systems of interacting elements including neural networks, bird flocks, and proteins

[31, 32, 33]. In general, spin-network models consider a network of

interacting elements

The goal of the framework is to construct a model that encodes the probability

(

,...,s

) of

each potential configuration of the network of elements based upon inferred interactions between

elements, as quantified in a coupling matrix

In general, D-SPIN can be applied to model networks of individual genes; however, cells ex-

press thousands of genes, and gene-level models are both computationally expensive and also dif-

ficult to interpret. Therefore, we designed D-SPIN to construct reduced or low-dimensional spin-

network models by modeling interactions between gene programs, i.e., co-regulated groups of genes

[34, 35, 36]. Cells regulate their transcriptional state by modulating transcription factors that im-

pact sets of genes related to differentiation state, cellular pathways, and core physiological processes.

By focusing on gene programs, rather than single genes, D-SPIN generates low-dimensional models

of cellular regulation that are interpretable and can be inferred from data through a computa-

tionally efficient inference procedure. Building upon previously published work, D-SPIN identifies

co-regulated groups of genes, gene programs, through unsupervised orthogonal nonnegative ma-

trix factorization (oNMF) and phenotypic gene-program annotation (Fig. 1A, B, SI Text Sec. 1.4)

[37, 38].

Following the extraction of gene programs, D-SPIN applies a maximum likelihood statistical

inference procedure to infer a network of regulatory interactions between gene-expression programs

and between programs and applied perturbations (Fig. 1C). Regulatory interactions between gene

programs are represented in a gene-program interaction matrix

where each entry

quantifies

the inferred interaction between program

and

. Interactions between the regulatory network and

perturbations are encoded in a set of perturbation response vectors

(

)

where

(

)

quantifies the

interaction of the perturbation

with gene program

Mathematically, an energy function

(

) computes the effective energy of a given transcrip-

tional state

by balancing the influence of the regulatory network (encoded in

) and the perturbation-

specific inputs (encoded in each

(

)

). The resulting probability of a given transcriptional state

(

) =

exp[

−

(

;

)], where

(

;

(

)

) =

−

(

)

and

is a normalizing constant ensuring all probabilities sum to 1 (SI Text Sec. 1.5) [39, 40]. The

regulatory network accounts for positive (

0) and negative (

0) interactions between gene

programs. Perturbations exert a bias on each gene program to increase the likelihood of being on

(

0) or off (

0).

Given single-cell gene-expression data collected from a series of perturbation conditions, D-

CC-BY-NC 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted May 20, 2023.

;

https://doi.org/10.1101/2023.04.19.537364

doi:

bioRxiv preprint

SPIN performs an optimization procedure through gradient ascent to estimate model parameters

and

(

)

given the data (Fig. 1D). For spin-network models, the optimization procedure yields

closed-form parameter-update rules

∆

⟨

⟩

(

)

Data

−⟨

⟩

(

)

Model

(1)

∆

(

)

⟨

⟩

(

)

Data

−⟨

⟩

(

)

Model

(2)

where

⟨

⟩

(

)

Data

are gene-program cross-correlations computed from data for each of

perturbation

conditions,

⟨

⟩

(

)

Model

are gene-program cross-correlations computed from the model, and

is a

learning rate (SI Text Sec. 1.7). At each step of model inference, the procedure samples from the

model for each perturbation condition and updates the regulatory network

and each perturbation

vector

(

)

. In general, spin-network model inference is a convex optimization problem, so the

optimization has a single unique solution.

The form of the spin-network model learning rules enables us to develop a highly efficient

and scalable training algorithm for D-SPIN by exploiting intrinsic parallelization. Training of

general graphical models requires estimating the gradient function on every data point separately,

which is typically computationally extensive. In contrast, in D-SPIN the gradient is pooled for each

experimental condition, and only depends on the mean

⟨

⟩

and cross-correlation

⟨

⟩

between gene

programs in each condition. Consequently, the training of the network can be deployed in parallel,

with each computational thread computing the gradient based on the mean and cross-correlation of

a single experimental condition. This approach eliminates the need to estimate gradients on every

data point separately (i.e., every cell’s transcriptional state) and minimizes data communication by

only requiring the mean and cross-correlation to be exchanged during parallelization. Therefore,

we routinely use hundreds of CPU cores for the optimization, enabling efficient network inference

over large data sets. Practically, we estimate the network using a subset of experimental conditions

with high diversity, and utilize the inferred network to identify response vectors of each condition

independently (SI Text Sec. 1.7, 1.8).

Following inference of

and

, D-SPIN models can be analyzed as network models that reveal

regulatory connections between cellular processes and identify interactions between applied pertur-

bations and cellular gene-expression programs. (Fig. 1E, ii-iii). D-SPIN models can also be used to

predict the distribution of transcriptional states in a cell population under normal and perturbed

conditions and explain how interactions within a gene network lead to a specific distribution of

cell states in an experimental population. The models provide a computational strategy that can

bridge geometric, cell clustering approaches like UMAP with mechanistic modeling approaches that

view the cell as a regulatory network with links between cellular processes, pathways, and pertur-

bations. D-SPIN can reveal how perturbations to specific nodes of a network alter the distribution

of transcriptional states in a cell population and, thus, alter the rendering of a cell population in

UMAP.

D-SPIN achieves state-of-the-art performance on gene regulatory network infer-

ence

To evaluate the accuracy of D-SPIN in performing network inference, we first applied D-SPIN

to reconstruct and analyze a model of the hematopoietic stem cell (HSC) differentiation network us-

ing the network simulation and benchmarking framework BEELINE [41]. The HSC network has two

major groups of regulatory genes, the first group includes Pu1 controlling granulocyte/monocyte

differentiation and the second group includes Gata1 controlling megakaryocyte/erythrocyte differen-

CC-BY-NC 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted May 20, 2023.

;

https://doi.org/10.1101/2023.04.19.537364

doi:

bioRxiv preprint

tiation [42]. BEELINE simulates single-cell transcription data from biologically identified regulatory

networks by generating and simulating ordinary differential equation (ODE) models with stochastic

noise. BEELINE also includes existing network inference methods for benchmark accuracy evalu-

ation. We used the BEELINE framework to generate synthetic gene-expression profiles from the

HSC network for 22 perturbation conditions encompassing knockdown/activation of each network

node individually by turning down/up the transcription rate (Fig. 2B, C, SI Text Sec. 1.9, SI Ta-

ble 1). For interpretation of the data we used the biological functions of each transcription factor to

classify the single-cell expression profiles into 7 different cell state designations including monocytes

(Mono), granulocytes (Gran), erythrocytes (Ery) and megakaryocytes (Mega) [42].

CC-BY-NC 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted May 20, 2023.

;

https://doi.org/10.1101/2023.04.19.537364

doi:

bioRxiv preprint

Figure 2:

D-SPIN achieves state-of-the-art performance on network inference benchmarking task, re-

vealing cell-fate control strategies for hematopoietic stem cell differentiation network

(A) HSC regulatory

network model [42] contains 11 transcription factors that interact through positive (blue) and negative interactions

(red) to modulate the differentiation of HSCs into different cell types. (B) Example simulated gene-expression pro-

file heatmaps generated using published computational model of the network for a series of simulated single-gene

knockdown and activation conditions. (C) The simulated network generates a distribution of cell states that can be

classified by their transcription factor expression into 7 cell states. The heatmap shows transcription factor expression

patterns that define each cell state generated across all simulated conditions, visualized as a UMAP. (D) Using the

simulated data, D-SPIN infers an integrated regulatory network model that encodes inferred interactions between

transcription factors as an interaction graph. D-SPIN also infers perturbation vectors that estimate how knockdown

or activation of transcription factors (e.g., Gata1 knockdown) impacts the regulatory network through up- or down-

regulation of transcription factors. (E) The D-SPIN model has greater accuracy than state-of-the-art methods [41], as

shown through the rendering of the network inferred from data by each method as well as a box plot of the number of

correctly inferred edges across different repeats of simulated gene-expression data. The box plot shows the accuracy

of top k edges for different values of k where the ground truth network has a total of 20 edges. (F) The D-SPIN model

predicts the cell-fate distribution generated through perturbations to the underlying network. UMAP plots of cell

state distribution generated for control and Pu1 activation/knockdown are shown for simulated data and the D-SPIN

model. Bar graphs quantify the proportion of HSC-derived cell states in each condition from simulated data and the

D-SPIN model. The data and model distribution are highly consistent across all 22 perturbations as quantified by

the cosine similarity between cell state distributions of data and model. (G) The model can be used to understand

how perturbations to distinct transcription factors can generate similar cell populations. Both Pu1 activation and

knockdown of Cebpa or Gfi1 generate an increased Mono state by directly activating Mono genes or indirectly through

the interaction between Gfi1 and EgrNab.

CC-BY-NC 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted May 20, 2023.

;

https://doi.org/10.1101/2023.04.19.537364

doi:

bioRxiv preprint

We applied D-SPIN to reconstruct a gene regulatory network model from the simulated data

and compared the accuracy of the D-SPIN model to 3 different network reconstruction methods

(PIDC, GENIE3, and GRNBoost2) that were the top performers in the BEELINE benchmarking

study [41]. We applied D-SPIN to generate a single network model encoded in the gene interac-

tion matrix

and generate perturbation vectors

(

)

for each perturbation condition (Fig. 2D,

SI Fig. 1A). We applied the algorithms to individual genes, rather than gene programs, given the

relatively compact size (11 nodes) of the network used in the BEELINE benchmarking evaluation.

Each method generates a set of inferred edges and provides confidence estimates for each edge.

We compared the methods by considering the top 10, top 15, and top 20 edges identified by each

method.

On the simulated network reconstruction task, D-SPIN consistently achieved higher perfor-

mance than comparable methods, primarily due to the natural way in which D-SPIN can accom-

modate perturbation data. For D-SPIN, on average 0

96 of the top 10 edges found by the model

were correct across inference runs, as compared to 0

6, 0

77, and 0

7 for PIDC, GRNBOOST2, and

GENIE3, respectively (Fig. 2E). For D-SPIN, 0

75 of the top 20 edges were correct, compared with

625, 0

645, and 0

665 for PIDC, GRNBOOST2, and GENIE3, respectively. As an example where

D-SPIN outperforms existing methods, the networks inferred by PIDC, GENIE3, and GRNBoost2

all identified a spurious activating interaction inside the Mega.-Ery. gene module, e.g., Scl-Fog1 or

Scl-Eklf, while D-SPIN pruned these false positive edges and revealed the central regulatory role

of Gata1. D-SPIN also pruned the spurious inhibiting interactions between the two modules, e.g.,

Pu1-Eklf or Pu1-Fli1, revealing the actual regulatory target of Pu1. When spurious edges occur

they were similar across the different inference frameworks: all 4 methods identified spurious edges

in strongly connected regions of the HSC network, struggling to separate direct versus indirect con-

nections [43]. However, for D-SPIN uniquely, spurious edges can be further reduced through the

sampling of higher-order perturbations (SI Fig. 1B, SI Table 1).

D-SPIN predicts the impact of simulated perturbations on differentiation

D-SPIN is a probabilistic or generative model that can predict the distribution of transcriptional

states in a cell population subjected to a perturbation, unlike the other models, which use either

regression or information-theoretic measures to find gene linkages. D-SPIN infers a quantitative

model of the cell population generated by the HSC network that can be sampled to generate

quantitative predictions regarding the distribution of cell types induced by each gene knockdown

or activation. Unlike existing methods, D-SPIN allowed us to ask how specific gene knockdown or

over-expression strategies could be applied to modulate the differentiation trajectory of HSC cells.

We applied D-SPIN to compute the population structure of a simulated HSC cell population

when different transcription factors are perturbed. The model distributions quantitatively agree

with data, and capture important biological impacts of the perturbations (Fig. 2F, SI Fig. 1C).

Analysis of the model allowed us to gain insight into how disruption of different underlying network

components might impact the differentiation trajectory of the cells. For example, when Pu1 is

knocked down, it results in a complete loss of Monocyte cell states and an increase in Granulocyte

cell states. The change in population structure occurs because Pu1 directly activates EgrNab and

cJun, and the three factors together induce the Monocyte cell state while EgrNab represses Gfi1,

a key regulator of the Granulocyte cell state. Knocking down Pu1 turns off the Pu1-cJun-EgrNab

group leading to loss of the Monocyte cell state as well as activation of Gfi1. Gfi1 activation leads

to the activation of Cebpa and the Granulocyte differentiation program. The Mega-Ery module

(Eklf, Fog1, Gata1, and Scl) is affected by both the reduced inhibition from Pu1 and increased

inhibition from Cebpa. Consequently, the cell states Mega and Ery remain present, albeit in reduced

proportions.

CC-BY-NC 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted May 20, 2023.

;

https://doi.org/10.1101/2023.04.19.537364

doi:

bioRxiv preprint

Both Pu1 activation and the knockdown of Cebpa and Gfi1 lead to an increased prevalence

of Monocyte cell states (Fig. 2G). Pu1 directly activates the monocyte regulatory network that

contains EgrNab and cJun. Gfi1 represses monocyte differentiation through repression of EgrNab,

and knockdown of Gfi leads to activation of EgrNab which in turn activates Pu1 and cJun, resulting

in increased Monocyte states. The knockdown of Cebpa turns off Gfi1 because of the positive

interaction between the two, leading to the activation of Monocyte states in a similar mechanism

as Gfi1 knockdown.

Thus, D-SPIN, using simulated data, can infer an accurate and generative regulatory network

model of a single-cell population by integrating information across perturbation conditions. D-

SPIN prunes the ambiguity of alternative models, and builds a coherent integrated network upon

which perturbations act. The inferred regulatory network reveals functional interactions between

regulatory genes and provides a map of how the distribution of cell states in the underlying HSC

cell population is generated through interactions between internal regulators.

CC-BY-NC 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted May 20, 2023.

;

https://doi.org/10.1101/2023.04.19.537364

doi:

bioRxiv preprint

Figure 3: Caption on the following page

CC-BY-NC 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted May 20, 2023.

;

https://doi.org/10.1101/2023.04.19.537364

doi:

bioRxiv preprint

Figure 3

D-SPIN generates a regulatory network model from genome-wide Perturb-seq data

that reveals the architecture of cellular pathways and sub-functions for pathway components.

(A) Genome-wide Perturb-seq experiment [28] delivered 9,867 gene knockdown perturbations individually

∼

2M human K562 cells using guide RNAs. (ii) D-SPIN model construction was applied to knockdown

perturbations that induced more than 10 differentially expressed genes and more than 20 cells passed quality

filtering. (iii) D-SPIN model was applied to generate a regulatory network model, inferring interactions be-

tween gene-expression programs and impacts of perturbations that classify guides into groups. (B) Inferred

D-SPIN model contains a core gene regulatory network composed of 30 gene-expression programs and the re-

sponse of each gene knockdown. Gene programs are annotated by a combination of function annotation tools

and manual lookup. D-SPIN classified gene knockdowns into 40 different guide classes with Leiden cluster-

ing. The interactions between gene-expression programs and perturbations are rendered as positive (blue) or

negative (red) edges, and edge thickness scales with the absolute value of edge weight. (C) D-SPIN network

model exhibits modular architecture with 7 core network modules that can be assigned physiological func-

tions: metabolism, transcription, translation, modification, protein degradation, DNA replication, and cell

division. Interactions between programs in the same modules are mostly activation and contain the majority

of strong activation interactions. (D) Guide RNA classes are each enriched for physiological functions and

cellular pathways in gene ontology and KEGG pathway databases. The bubble size represents the percentage

of enriched guides in the class. (E) UMAP representation of the response vector of each guide RNA agrees

with the global organization of major cellular pathways including transcription, translation, respiration, and

cell cycle. (F) Regulatory coherence of macromolecular complex subunits, quantified by the average pair-

wise cosine similarity between the perturbation response vector for knockdown of complex sub-components.

High-coherence complexes have subunits with similar impacts on the D-SPIN gene regulatory network. In

low-coherence complexes, knocking down different subunits leads to different impacts, suggesting distinct

functions for complex subunits. (G) Perturbation vectors for subunits of the ribosome, proteasome, and

integrator complexes. The ribosome has a coherent signature where components of the complex have similar

inferred response vectors, while the proteasome and integrator each have signatures of sub-functionalization.

In the example of proteasome, low-coherence is observed as only a subset of subunits have an impact on a

gene program (P5, P6), or even display opposite effects on a gene program (P1). D-SPIN also identified

recently annotated pathway components IPO9 and AKIRIN2 (proteasome) and C7orf26 (integrator).

Constructing a regulatory network model from genome-wide Perturb-seq data

Peturb-seq (or its derivatives) involves the knockdown of individual genes, using Cas9 and a targeting

guide RNA. Single-cell mRNA-seq is used to assess the transcriptional state of each cell and to identify the

guide RNA delivered to the cell. The technique can be applied at scale to interrogate the impact of thousands

of distinct gene knockdowns on cellular gene expression. We used D-SPIN to construct a gene regulatory

network model using data from a genome-wide Perturb-seq experiment on the erythrocyte leukemia K562

cell line [28]. K562 is an erythrocyte precursor cell line with a BCR-ABL mutation, derived from a chronic

myelogenous leukemia (CML) patient [44, 45]. In the genome-wide Perturb-seq experiment, 9

867 genes

were knocked down individually across 2 million single cells using CRISPRi. Prior to D-SPIN analysis, we

pre-processed the data to identify guide RNAs that have an impact on the K562-cell transcriptional state

(SI Text Sec. 1.2). Guide RNA perturbation impact varies (Fig. 3A): 70.5% of perturbations had less than

10 differentially expressed genes (DEGs), 14.7% had 10-100 DEGs, and 14.8% had more than 100 DEGs.

There is also great variation in the number of collected cells: 20.3% of perturbations had less than 100 cells

collected. We focused our analysis on 3,138 perturbations with more than 10 DEGs and 20 collected cells,

and D-SPIN was able to construct the network model in 8 hours with parallelization on 500 CPU cores.

We constructed a D-SPIN model using a set of 30 gene programs that D-SPIN extracted with its

automated oNMF procedure (Fig. 3B, SI Fig. 2A, SI Text Sec. 1.4, SI Table 2). The extracted gene programs

reflect both general cell biology as well as lineage-specific gene programs that reflect the specific biology

of K562 cells. We annotate the gene programs with a combination of bioinformatic databases including

DAVID, Enrichr, and manual lookup [46, 47]. Among the programs, D-SPIN extracted a series of programs

that are enriched for genes involved in specific core cell processes including transcription, translation, DNA

replication, mitosis, and RNA processing. D-SPIN also identified lineage-specific programs including an

erythroid-specific program with hemoglobin (HBG1, HBG2, HGZ) and glycophorin (GYPA, GYPB, GYPE)

CC-BY-NC 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted May 20, 2023.

;

https://doi.org/10.1101/2023.04.19.537364

doi:

bioRxiv preprint

genes, as well as two myeloid-associated programs with phagosome/actin-organization (ACTB, ACTG1,

ARPC3) and immune-response (LAPTM5, VASP, RAC2) genes respectively, which agrees with the multi-

lineage potential of K562 cells (Fig. 3B).

Given the set of 30 gene programs, D-SPIN generated a gene regulatory network model that provides

a wiring diagram of K562 internal regulation and the response of the K562 regulatory network to gene

knockdown perturbations (Fig. 3B, SI Fig. 2B, SI Table 2). The D-SPIN network contains gene programs,

as nodes, and interactions between gene programs as edges that represent inferred interactions. D-SPIN

identified 84 activating and 21 inhibiting interactions between gene programs, corresponding respectively to

19.3% and 4.8% of all gene-program pairs. The D-SPIN regulatory network reconstruction contains a set

of sub-networks or modules of gene programs that interact with one another through dense positive (blue)

interactions (Figs. 3B, C). We automatically decomposed the D-SPIN network into seven modules through

the Leiden community detection algorithm [48] (SI Text Sec. 1.10). Each identified module reflects a core

cellular function, including transcription, translation, protein-degradation, and cell-cycle-related functions.

The network contains a set of inhibitory interactions between gene programs that are expressed in dis-

tinct cellular states. The strongest inhibitory interaction is between gene programs expressed at different

stages of mitosis, for example, P29 Spindle microtubule inhibits both P25 DNA replication and P27 Histone.

The strong positive interactions inside the cell cycle sub-network of P25-P30 reconstruct the transcription

state distribution during cell cycle progression (SI Fig. C, D). We also find strong inhibiting interactions

between distinct stress-response programs, P23 Unfolded protein response, and P2 Lysosome. The Unfolded

protein response program consists of heat-shock protein (HSP90AB1, HSPD1, HSP90AA1) and other molecu-

lar chaperones (CCT8, CCT3, CCT5) genes, while the Lysosome program contains genes involved in lysosome

genesis (PSAP, ASAH1, PLD3). We also observe inhibition between P4 Erythroid and P6 Phagosome, where

the phagosome is an innate immune response mounted by macrophages and other myeloid cell types [49].

This indicates the presence of two, mutually exclusive differentiation paths that lead to erythroid and myeloid

cell fates.

D-SPIN reconstructs the architecture of core cellular pathways

In addition to inferring the structure of the core K562 gene regulatory network, D-SPIN also inferred

interactions between each gene program and each gene knockdown perturbation. The interactions between

perturbations and gene programs are represented by perturbation response vectors

(

n

)

. The entries of

(

n

)

represent positive or negative interactions between perturbations and individual gene programs. The

response vectors can be represented as directed edges that connect perturbations to the core regulatory

network (Fig. 3B).

Similarly to the grouping of the K562 gene regulatory network into interacting gene modules, we grouped

gene knockdown perturbations into a set of 40 perturbation classes that we refer to as G1-G40 (Guide RNA

group 1 through 40) through unsupervised Leiden clustering of the

(

n

)

vectors (SI Table 2). Identified

clusters reflect the pathway-level organization in the K562 cell, revealing both well-known cell biology and

more novel or cryptic organization of pathways (Figs. 3D, E). Guide clusters are enriched for specific cell-

biological functions, including DNA replication, MAPK signaling pathway, and RNA degradation (Fig. 3D).

The guide clusters also provide detailed information regarding the structure of cellular pathways. As a specific

example, the core components of the proteasome fall into a single perturbation cluster. The cluster also

contains AKIRIN2 and IPO9, two protein components recently associated with the proteasome. Similarly,

D-SPIN identifies C7orf26 as associated with the integrator subunits INTS10, INTS13 and INTS14, a key

result of the original primary Perturb-seq study [28].

Predicting macro-molecular complex component function

Macromolecular complexes like the ribosome, the proteasome, and the integrator complex play a critical

role in cell physiology and gene regulation. We used D-SPIN perturbation vectors to analyze the functions

of individual macromolecular-complex components. D-SPIN generates a perturbation vector

(

n

)

for each

subunit of each macromolecular complex that was knocked down in the Perturb-seq experiment. We applied

D-SPIN to gain insight into sub-functions of complex components by comparing cosine similarity between

(

n

)

for different components of known complexes in the CORUM database [50]. We selected complexes

that have more than two-thirds of subunits in the dataset and picked representative ones from subunits with

more than half overlapping subunits, resulting in a list of 46 core cellular complexes. For each complex in

CC-BY-NC 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted May 20, 2023.

;

https://doi.org/10.1101/2023.04.19.537364

doi:

bioRxiv preprint

the list, we defined a coherence score by computing the average cosine similarity between response vectors,

(

n

)

, for all pairs of complex components (SI Text Sec. 1.13). When subunits of a complex interact to

achieve a common function, disruption of individual subunits will disrupt that common function, leading

to high cosine similarity between subunit response vectors and an average cosine similarity across complex

components approaching 1. When complexes can achieve distinct functions through combinations of distinct

protein components, then such complexes have with response vectors with average cosine similarity closer to

0 which is the random expectation.

Across the 46 complexes in CORUM, our analysis revealed a wide range of coherence scores across

different complexes, ranging from 0.01 (large Drosha complex) to 0.73 (mitochondrial large ribosome) (Fig. 3F,

SI Fig. 2E). Notably, complexes that play critical roles in specific biological functions including translation

and metabolism, such as the mitochondrial ribosome, ribosome, exosome, DNA repair, and respiratory

chain, exhibited the highest coherence scores. In contrast, complexes with known regulatory roles including

the large Drosha complex, DGCR8 complex, ALL-1 supercomplex (involved in chromatin regulation), the

ATAC complex (histone acetyltransferase), and spliceosome all had lower coherence scores

< .

25. The

low-scoring complexes are involved in versatile regulatory roles in epigenetic regulation and microRNA and

mRNA processing. D-SPIN provides a list of the candidate cellular sub-processes (gene programs) that are

controlled by individual sub-units of each macromolecular regulatory complex (SI Table 2) providing a specific

framework for generating experimental hypothesis to determine how modulation of subunits composition

of these macromolecular regulatory complexes might enable the complex to achieve a range of regulatory

functions.

The ribosome behaves as a coherent complex where the knockdown of individual genes in the large or

small ribosome subunit has a similar impact on D-SPIN gene programs as illustrated by the monochromatic

edge connections between subunit perturbations and gene programs in the interaction diagram (Fig. 3G).

Perturbations to ribosome subunits activate protein digestion-related gene programs including P2 Lysosome,

P3 Endosome, and P20 P21 Proteasome. The perturbations inhibit ribosome programs, P1 Mitochondria, P19

Endoplasmic reticulum, P23 Unfolded protein response, and P24 Oxidation. Thus, perturbing any ribosome

subunits induces the cell into a stress response state with upregulated protein digestion and inhibition of

protein production and secretion processes.

The proteasome and integrator behave as incoherent complexes, and have response vectors that split

subunits into smaller sub-groups. The integrator complex subunits are partitioned into 4 sub-groups that

agree with the original Perturb-seq study (Fig. 3J): an INTS10-13-14-C7orf26 module, a cleavage module

(INTS3, INTS4, INTS8, INTS11), and two shoulder and backbone modules (INTS2, INTS5, INTS7, INTS8;

INTS1, INTS6, INTS12). Knockdown of the cleavage module activates P21 Proteasome and inhibits P22

rRNA processing, P18 Nuclear transport, P24 Oxidation, and P5 Lipid metabolism, while knockdown of

the shoulder and backbone modules activates P4 Erythroid and P7 Innate immunity and inhibited P21

Proteasome, P28 Chromosome segregation, P17 RNA splicing, and P13 Ribosome programs.

Similarly, the proteasome complex subunits are split into two groups of core particles (20S protea-

some) (PSMA6, PSMB1, PSMB3, PSMB6; PSMA2, PSMA7, PSMB2, PSMB7) and two regulatory particles

(PSMC3, PSMC3, PSMC3, PSMD2, PSMD6; PSMD4, PSMD13). D-SPIN finds that the knockdown of

proteasome subunit PSMC6 (also known as TBP7 and Rpt4) has a negative impact on respiration, and this

result is consistent with the finding that PSMC6 interacts with mitochondrial proteins including TRAP1 in

the endoplasmic reticulum to assist the folding of mitochondrial proteins [51]. Further, AKIRIN2 and IPO9

are grouped together with the 20S proteasome subunits by the response vectors. These results demonstrate

that the response vectors reveal the impacts of genetic perturbations, and provide a compact signature for

identifying functional associations between genes [52].

CC-BY-NC 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted May 20, 2023.

;

https://doi.org/10.1101/2023.04.19.537364

doi:

bioRxiv preprint

Figure 4:

D-SPIN network model identifies global regulatory response strategies mounted by K562

cells to distinct classes of gene perturbation.

(A) A coarse-grained network model is generated by grouping

gene programs into 7 identified gene modules and grouping guide RNA gene knockdown classes based on network

impact. The resulting coarse-grained model enables global analysis of cellular regulatory responses. (B) The minimal

model reveals that K562 cells mount 4 distinct classes of global regulatory responses in response to gene knockdown

perturbations. For example, when RNA polymerase is knocked down, cells generate the metabolism upregulation

response, where the cell upregulates metabolism and downregulates transcription, translation, and degradation. As

a second example, when ribosome subunits are knocked down, cells upregulate protein degradation and metabolism

while downregulating translation. (C) Gene-expression heatmap showing the upregulation (blue) or downregulation

(red) of genes in gene programs to perturbation of individual genes in a series of cellular pathways. Heatmap indicates

the change in gene expression for perturbed cell population compared to the control where a non-targeting guide RNA

is introduced to cells.

CC-BY-NC 4.0 International license

available under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (which

this version posted May 20, 2023.

;

https://doi.org/10.1101/2023.04.19.537364

doi:

bioRxiv preprint