of 15
Characterization of human transcription
factor function and patterns of gene regulation
in HepG2 cells
Belle A. Moyers,
1
E. Christopher Partridge,
1
Mark Mackiewicz,
1
Michael J. Betti,
2
Roshan Darji,
1
Sarah K. Meadows,
1
Kimberly M. Newberry,
1
Laurel A. Brandsmeier,
1
Barbara J. Wold,
3
Eric M. Mendenhall,
1
and Richard M. Myers
1
1
HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA;
2
Vanderbilt University Medical Center, Nashville,
Tennessee 37232, USA;
3
Merkin Institute for Translational Research, California Institute of Technology, Pasadena, California
91125, USA
Transcription factors (TFs) are
trans
-acting proteins that bind
cis
-regulatory elements (CREs) in DNA to control gene expres-
sion. Here, we analyzed the genomic localization profiles of 529 sequence-specific TFs and 151 cofactors and chromatin regu-
lators in the human cancer cell line HepG2, for a total of 680 broadly termed DNA-associated proteins (DAPs). We used this
deep collection to model each TF
s impact on gene expression, and identified a cohort of 26 candidate transcriptional repres-
sors. We examine high occupancy target (HOT) sites in the context of three-dimensional genome organization and show biased
motif placement in distal-promoter connections involving HOT sites. We also found a substantial number of closed chromatin
regions with multiple DAPs bound, and explored their properties, finding that a MAFF/MAFK TF pair correlates with tran-
scriptional repression. Altogether, these analyses provide novel insights into the regulatory logic of the human cell line HepG2
genome and show the usefulness of large genomic analyses for elucidation of individual TF functions.
[Supplemental material is available for this article.]
Gene expression is regulated and modulated by the association, ei-
ther direct or indirect, of various classes of proteinsto DNA, includ-
ing RNA polymerase and transcription-associated proteins,
histone modifiers, and a broad suite of transcription factors (TFs)
and associated cofactors. Together, these DNA-associated proteins
(DAPs) are encoded by
10% of all protein-coding genes in the hu-
man genome (Vaquerizas et al. 2009; Lambert et al. 2018). DAPs
are known to associate with DNA either through recognition of
discrete small sequence motifs, by interactions with degenerate se-
quences having little complexity, or by cofactor recruitment. The
most common assay for genome-wide identification of genomic
binding or association sites for DAPs is chromatin immunoprecip-
itation followed by high-throughput sequencing (ChIP-seq),
which provides a statistically identified snapshot of regions re-
ferred to as peaks (Barski et al. 2007; Johnson et al. 2007;
Robertson et al. 2007; Kharchenko et al. 2008; Zhang et al. 2008;
Savic et al. 2015; Meadows et al. 2020). For those TFs with DNA se-
quence specificity, associations occur with enough frequency to be
detectable as a consistent DNA sequence motif through use of ge-
nome-wide binding data (Bailey et al. 2015) or in vitro molecular
binding assays (Chai et al. 2011).
The Encyclopedia of DNA Elements (ENCODE) Consortium
has completed and released 3194 ChIP-seq data sets for 1139
DAPs using both traditional antibody ChIP-seq and epitope-tagged
ChIP-seq methods (The ENCODE Project Consortium 2012; The
ENCODE Project Consortium et al. 2020; Partridge et al. 2020).
The human liver cancer
derived cell line HepG2 currently has the
largest number (n=814) of ENCODE-released ChIP-seq data sets,
some of which are repetitions of different ChIP-seq experiments
with the same target for a total of 680 unique DAP targets. With
this wealth of occupancy profiles for a single cell type, the HepG2
ChIP-seq data allow for the assessment of biological roles of DAPs
in a broad genomic context, including analyses of similarity and
coassociation frequency, association with regulatory region types,
and impact on gene expression. These data sets provide the oppor-
tunity to explore the functional impact of individual TFs and asso-
ciated proteins on gene expression and genome organization.
Here, we present an analysis of ChIP-seq data in the HepG2
that greatly expands on our previous work with this cell type
(Partridge et al. 2020), including 492 ChIP-seq data sets not ana-
lyzed in that prior work, as well as a lentiviral massively parallel re-
porter assay (lentiMPRA, or MPRA) to functionally test elements.
We provide an overview of this resource and highlight novel find-
ings with TFs and
trans
-regulatory proteins on
cis
-regulatory
sequences, including patterns of TF genomic localization in the
context of the three-dimensional (3D) organization of high
occupancy target (HOT) sites and the association of TFs with
closed chromatin regions that influence gene repression.
Results
It is estimated that there are 1639 sequence-specific TFs encoded in
the human genome (Lambert et al. 2018), only a subset of which
are expressed in any given cell type. To gain a deeper
Corresponding authors: rmyers@hudsonalpha.org,
emendenhall@hudsonalpha.org
Article published online before print. Article, supplemental material, and publi-
cation date are at https://www.genome.org/cgi/doi/10.1101/gr.278205.123.
Freely available online through the
Genome Research
Open Access option.
© 2023 Moyers et al. This article, published in
Genome Research
, is available
under a Creative Commons License (Attribution 4.0 International), as described
at http://creativecommons.org/licenses/by/4.0/.
Research
33:1879
1892 Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/23; www.genome.org
Genome Research 1879
www.genome.org
Cold Spring Harbor Laboratory Press
on January 9, 2024 - Published by
genome.cshlp.org
Downloaded from
understanding of gene regulatory mechanisms, we analyzed TF
binding data, much of which we generated, in HepG2 cells and le-
veraged the large numberof TFs assayed in thatcell line asthe most
comprehensive resource available. The expression level of any in-
dividual TF is not necessarily correlated with its biological signifi-
cance; proteins can be expressed at a very low level and still
perform important biological functions in a given context.
Pragmatically, however, we have observed diminishing rates of
success for ChIP-seq and epitope-tagged ChIP-seq data sets as the
expression level of those TFs decreases (Meadows et al. 2020).
Therefore, we identified all TFs in HepG2 cells that are expressed
at levels of at least two transcripts per million (TPM), as measured
by RNA-seq (ENCSR181ZGR). There are 895 TFs expressed at this
level in HepG2 cells. We compiled the existing data sets produced
from our laboratory and others from the ENCODE portal for 479
(53.5%) of these 895 TFs (see Methods) (
Supplemental Table 1
).
In addition to these 479 TFs, we also analyzed data for 50 TFs ex-
pressed at fewer than two TPM for which we were able to generate
high-quality ChIP-seq data despite their low expression and for
151 non-TF DAPs and nine histone marks, for a total of 680 unique
ChIP-seq DAP targets in HepG2 and nine histone modifications.
This expanded catalog of DAPs and associated gene regulatory
data sets provides a rich resource to characterize and understand
the functional impact of DAP binding on gene regulation.
DAP associations at cCREs reveal the interaction of DAP function
and regulatory context
TFsimpact expression byassociatingwithorbindingtoDNA, specif-
ically at
cis
-regulatory elements (CREs). We therefore sought to
determine which
cis
-regulatory elements are bound by TFs and the
patterns of activity that those bound regions display. We examined
cis
-regulatory elements for the presence of at least one DAP peak. To
do this, we used the Registry of Candidate
cis
-Regulatory Elements
(V4 cCREs) derived from the ENCODE data (The ENCODE Project
Consortium et al. 2020; JE Moore, HE Pratt, K Fan, et al., in prep.).
These candidate
cis-
regulatory elements (cCREs) represent genomic
regulatory elements across multiple human cell types and are de-
rived from chromatin accessibility assays (DNase-seq and ATAC-
seq), histone modifications, and DAP-binding data. To filter for
cCREs that are relevant in HepG2, we overlapped with HepG2
ATAC-seq data, generating a set of 318,567 HepG2 cCREs. Of these,
84.2% have at least one of the assayed DAPs associated, and those
cCREs with no DAPs associated in HepG2 are largely distal enhanc-
er-like sequences (Fig. 1A;
Supplemental Fig. 1
; Supplemental Table
2). We compared this patternofbinding with dinucleotide-matched
control sequences and found that these regions are significantly
more bound than controls (
Supplemental Fig. 2
; Supplemental
Table 2
). As the number of associated DAPs increases at cCREs, the
proportion of cCREs defined as
promoter-like
increases
(Supplemental Figs. 3, 4
; Supplemental Table 3
). We therefore con-
clude that the coverage of cCREs with at least some subset of their
associated DAPs is approaching completeness.
We also found 50,446 (15.8%) annotated cCREs overlapping
with an ATAC-seq peak in HepG2 cells but with no DAP peaks in
our data set. Given their predicted regulatory activity and their
open chromatin state in this cell type, we would expect that they
should be bound by some DAP. At least three explanations are pos-
sible: (1) These cCREs are unbound by any DAP, (2) they are bound
by DAPs that have not yet been assayed in HepG2 cells, and/or (3)
DAP binding was potentially missed as false negatives in the ChIP-
seq assays. To measure functional activity of these elements (as
well as for other analyses below), we performed a lentiMPRA (or
MPRA) following established methods (Gordon et al. 2020).
MPRAs functionally validate the regulatory activity of thousands
of DNA elements simultaneously by insertion of DNA upstream
of or downstream from a transcribed element (Klein et al. 2020).
Our MPRA experiment contained 69,210 elements of 170 bp
each, selected from various promoter and distal cCREs and from
non-cCREs, as well as a set of synthetic, nongenomic elements
with various numbers of TF motifs. We supplemented this data
set by also analyzing a publicly available HepG2 lentiMPRA data
AB
CD
Figure 1.
Genomic properties and activities of DAP-bound regions in
genomic and reporter contexts. (
A
) The majority of cCREs of each type
are bound by an assayed DAP. Bars show the number of sites of each
cCRE class (
x
-axis) with at least one DAP association (
bound
) and those
with none in our data set (
unbound
) when restricted to those overlap-
ping with an ATAC-seq peak in HepG2. (PLS) Promoter-like signature,
(pELS) proximal enhancer-like signature, (dELS) distal enhancer-like signa-
ture, (CA-H3K4me3) chromatin-accessible H3K4me3 region, (CA-CTCF)
chromatin-accessible CTCF-bound region, (CA-TF) chromatin-accessible
TF-bound region, (TF) TF-bound region lacking chromatin accessibility,
and (CA) chromatin accessibility only. (
B
) Promoter elements from locally
performed lentiMPRA experiments require fewer DAPs binding for high ac-
tivity in lentiMPRA than do distal elements. Boxes show MPRA signal (nat-
ural log of normalized RNA reads over normalized DNA reads) of promoter
elements as a function of binned number of DAPs (
x
-axis) with a peak in the
genomic region. Promoters are defined as elements whose bounds over-
lapped with a 200-bp region centered on GENCODE TSSs. Distal elements
are defined as elements at least 5 kb from annotated TSSs. Positive and
negative control elements are plotted for comparison. In
B
and
D
, boxes
represent 25%
75% quartiles with lines indicating the median, whiskers
extend to ±1.5 ×IQR (interquartile range) past the boxes, and when
present, points are observations falling outside of this range. Unpaired
t
-tests were used to identify significant differences in the means between
distal and promoter element activity in each category. (
)
P
=0.05,
(
∗∗
)
P
=0.0001, (
∗∗∗
)
P
2.2×10
16
.(
C
) The fraction of distal loci with
an ABC connection as a function of binned number of DAPs at a distal el-
ement. (
D
) Expression of genes genome-wide increases as the number of
factors bound and connected distal elements increases. The
y
-axis indi-
cates the natural log expression distribution of the ABC-supported gene
as a function of binned number of DAPs at a distal element. Unpaired
t
-
tests were used to identify significant differences in the means between
the expression of a given category compared with expression in the zero
category. (
)
P
=0.05.
Moyers et al.
1880 Genome Research
www.genome.org
Cold Spring Harbor Laboratory Press
on January 9, 2024 - Published by
genome.cshlp.org
Downloaded from