MPAthic: Quantitative Modeling of Sequence-Function Relationships for massively parallel assays

Massively parallel assays (MPAs) are being rapidly adopted for studying a wide range of DNA, RNA, and protein sequence-function relationships. However, the software available for quantitatively modeling these relationships is severely limited. Here we describe MPAthic, a software package that enables the rapid inference of such models from a variety of MPA datasets. Using both simulated and previously published data, we show that the modeling capabilities of MPAthic greatly improve on those of existing software. In particular, only MPAthic can accurately quantify the strength of epistatic interactions. These capabilities address a major need in the analysis of MPA data.


Background
Understanding how sequence governs function is one of the central challenges in modern biology. The classic success story in this endeavor is the cracking of the genetic code, which maps each 3-nucleotide mRNA codon to a corresponding amino acid [1]. But unlike the genetic code, which can be represented in a simple tabular form, most sequence-function relationships require a quantitative description. For instance, mutating one nucleotide in a transcriptional enhancer or one amino acid in an enzyme will typically have a quantitative, not qualitative, effect on biological function. Deciphering such quantitative sequence-function relationships has proved to be far more difficult than early success on the genetic code might have suggested.
Fortunately, a variety of massively parallel assays (MPAs), which have been developed in recent years, * Correspondence: jkinney@cshl.edu 2 Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 11375, Cold Spring Harbor, NY Full list of author information is available at the end of the article are providing new hope for deciphering quantitative sequence-function relationships. MPAs couple functional measurements to ultra-high-throughput DNA sequencing in ways that allow thousands to millions of sequences to be assayed in a single experiment; see [2,3,4,5,6] for reviews of these methods. Fig. 1 provides an illustration of three prototypical MPA assays: the Sort-Seq assay of Kinney et al., (2010) [7], the massively parallel reporter assay (MPRA) of Melnikov et al. [8], and the deep mutational scanning (DMS) assay of Fowler et al., (2010) [9].
Since the publication of these and other early works, the use of MPAs has rapidly expanded in multiple areas of molecular biology and genetics. Already, MPAs are revolutionizing the study of transcriptional regulation in vitro [10], in model organisms [7,11,12,13], and in human cells [8,14,15], the study of splicing [16] and untranslated regions [17,18,19] of mRNA , the study of protein sequence-function relationships [9,20,21,22], and the study of fitness landscapes and their evolutionary consequences [23,24,25,26,27]. The recent demonstration [28] of MPAs on sequence libraries integrated into mammalian chromosomes using CRISPR-Cas9 is likely to substantially increase the utility and prominence of such experiments in the near future.
Often, MPAs are used to simply catalog the activities of a large number of sequences. Multiple software packages have been developed to assist in the type of analysis that such cataloging requires [29,30,31]. But just tallying measurements is insufficient for characterizing most sequence-function relationships. This is because sequence space is often far too vast for every possible sequence of interest to be assayed. For example, the σ 70 RNA polymerase holoenzyme of Escherichia coli (RNAP) has a binding site roughly 40 bp in length. The sequence specificity of RNAP is of great biological interest, and in fact RNAP was the first DNA-binding protein to have its specificity characterized [32]. But it would be impossible to store quantitative measurements for all 4 40 ≈ 10 24 possible binding sites, let alone make these measurements, because the result-ing dataset would overwhelm the world's information technology infrastructure [33].
Understanding a quantitative sequence-function relationship ultimately requires quantitative modeling: the construction of a mathematical function that can predict the activity of any sequence of interest regardless of whether that sequence was previously assayed in an experiment. The data that many MPAs provide is well-suited for training such models. For example, [7] used quantitative modeling of Sort-Seq data to measure the in vivo interaction energies governing transcriptional activation at the lac promoter of E. coli. In later work, [8] used quantitative modeling of MPRA data to design enhancers with increased inducibility in human cells.
Unfortunately, the quantitative modeling of MPA data has been severely hindered by the lack of appropriate software. The only published software that provides any quantitative modeling capabilities appropriate to MPA data is dms tools [31]. dms tools, however, has multiple limitations. First, it only supports the inference of 'matrix models,' in which each position in a sequence is assumed to contribute independently to activity. [1] These models are blind to possible epistatic interactions between different positions within a sequence, interactions that are of great interest in MPA experiments [25]. Second, the parameters of the models that dms tools returns are inferred using enrichment ratio calculations. Although this inference approach is common in bioinformatics, the mathematical justification for this procedure relies on multiple assumptions that are often violated in real MPA experiments [34]. Moreover, enrichment ratios can be computed only when one's data consists of precisely one 'selected' set of sequences and one 'unselected' set. Many MPAs, e.g. [7,9,11,18,19,22,23,26], yield three or more sets of sequences and it is unclear how dms tools can be applied to these data without throwing away valuable measurements.
Here we introduce MPAthic, a software package that enables quantitative modeling of sequence-function relationships using data from a variety of MPAs. MPAthic improves on dms tools in multiple ways. In addition to supporting the inference of matrix models using enrichment ratios, MPAthic offers two alternative inference procedures: least-squares optimization and mutual information maximization. Both of these inference methods make use of all the data produced [1] We use the term 'matrix model' here instead of the more common term 'position weight matrix' to distinguish the mathematical form of such models from the method by which model parameters are inferred from data. The term 'position-specific scoring matrix' (PSSM) is synonymous with 'matrix model.' by MPAs. Least-squares optimization facilitates the rapid analysis of MPA data, while mutual information maximization yields models that are more rigorous in theory [34] and, as will be shown, more accurate in practice. Unlike dms tools, MPAthic also supports the inference of 'neighbor' models, which describe epistatic interactions between neighboring positions in a sequence.
We demonstrate MPAthic on both simulated data and on data from the MPA studies illustrated in Fig.  1 [7,8,9]. The modeling capabilities of MPAthic are shown to dramatically outperform those of dms tools in almost all cases. In particular, we find that matrix models fit using mutual information maximization consistently work better than models inferred using enrichment ratio calculations. This finding validates previous theoretical arguments [34]. Moreover, we show that MPAthic can successfully infer epistatic interactions present in both simulated and real sequencefunction relationships.
MPAthic incorporates other capabilities in addition to quantitative modeling: it provides methods for simulating massively parallel data using a user-specified model, for computing useful summary statistics such as 'information footprints' [7], and for evaluating quantitative models on arbitrary input sequences. These additional features are described in Supplemental Information (SI). All of MPAthic's functionality is accessible at the command line. Information on obtaining MPAthic is provided in the "Availability of data and materials" section at the end of this paper.

Constraints on data
Many massively parallel experiments, including those of [7,8,9,10,12,13,17,18,19,20,21,22,23,24,25,26,27,28], share the common form illustrated in Fig.  2A. One begins with a specific "wild type" sequence of interest. A library comprising variants of this wild type sequence, variants that have scattered substitution mutations, is then generated. These library sequences are then used as input to an experimental procedure that measures a specific sequence-dependent activity and, as a result of this measurement, outputs sequences into one or more "bins." Finally, the number of occurrences of each variant sequence in each bin is assayed using high-throughput sequencing.
The resulting data consists of a set aligned sequences, each associated with a specific number of counts within each of the experimental bins. MPAthic is designed specifically for the analysis of data having this form. The aligned nature of these sequences greatly simplifies the process of quantitative modeling.
In particular, when the assayed sequences have a natural alignment, one can sensibly model the quantitative effect of each nucleotide or amino acid position.
We note that not all massively parallel experiments have this form. SELEX-SEQ and related experiments often use library DNA that is completely random (e.g., [35,36,37,38,39,40]), and thus cannot be expected to have well-aligned features. Other experiments, including Sort-Seq and MPRA experiments, have used libraries that contain specified arrangements of binding sites or large numbers of different genomic regions (e.g., [11,41,42,43]). These other datasets are, in principle, amenable to quantitative modeling as well. This modeling task is more complex than for aligned sequences, however, and is not supported in the current version of MPAthic.

Formalization
We formalize the problem of inferring quantitative models of sequence-function relationships as follows. We represent massively parallel data as a set of N sequence-measurement pairs, {(S n , M n )} N n=1 , where each measurement M n is a non-negative integer corresponding to the bin in which the n'th sequence sequence, S n , was found. We assume that all sequences S have the same length L, and that at each of the L positions in each sequence there is one of C possible characters (C = 4 for DNA and RNA; C = 21 for protein, representing 20 amino acids and the termination signal). In what follows, each sequence S is represented as a binary C × L matrix having elements Here, l = 1, 2, . . . , L indexes positions within the sequence, while c = 1, 2, . . . , C indexes possible nucleotides or amino acids. Note that, in this representation, the same sequence S will typically be observed multiple times in each dataset and will often fall into multiple different bins. Our goal is to derive a function that can, given a sequence S, predict the activity R measured by the experiment. To do this we assume that the activity value R is given by a function r(S, θ) that depends on the sequence S and a set of parameters θ. Before we can infer the values of the parameters θ from data, we must first answer two distinct questions: 1 What functional form do we choose for r(S, θ)? 2 How do we use the data {S n , M n } and the model predictions r(S, θ) to infer parameter values θ? MPAthic provides two different options for the function r: a "matrix" model, where each position in S contributes independently to the predicted activity, and a "neighbor" model, which accounts for potential epistatic interactions between neighboring positions. MPAthic also provides three different methods for fitting the parameters θ to data: parameters values can be inferred by computing enrichment ratios (a method applicable only to matrix models), by performing least squares optimization, or by maximizing the mutual information between model predictions and measurements. These different model types and inference methods are elaborated below.

Matrix models and neighbor models
Matrix models have the form In this context, θ is a C × L matrix where each element θ cl represents the contribution of character c at position l to the overall sequence-dependent activity. For example, Fig. 3B shows the parameters of a matrix model that describes the sequence specificity of RNAP. These parameters were inferred from the Sort-Seq data of [7] using MPAthic. Neighbor models have the form Such models comprise C 2 (L − 1) parameters, denoted θ cdl , that represent the contributions of all possible adjacent di-nucleotides or di-amino-acids within S. Fig.  3C illustrates one such model which, as above, describes RNAP and was inferred using MPAthic operating on data from [7].

Enrichment ratio inference
The computation of enrichment ratios is the simplest way to infer quantitative sequence-function relationships from massively parallel data. The motivation for this inference method traces back to the seminal work of Berg and von Hippel [44,45], and the resulting models can be thought of as the incarnation of position weight matrices [46] in the context of massively parallel experiments. We note that the calculation of enrichment ratios is one of the primary types of analyses reported in the DMS literature [2]. Enrichment ratio inference, however, places strong restrictions on the types of models and data that one can use. Specifically, one is restricted to using matrix models only, and the data used to compute parameter values can consist of only two bins of sequences: an unselected 'library' bin (M = 0) and a selected 'enriched' bin (M = 1). Moreover, the validity of this inference procedure depends on additional assumptions that are often not satisfied by real-world experiments [34].
Still, if one is willing to impose these restrictions and make the necessary assumptions, then model parameters θ ER are computed using where Here, f M cl denotes the fraction of sequences in bin M having character c at position l, N M is the total number of sequences in bin M , λ is a nonnegative pseudocount (specified by the user), and the sum in Eq. (5) is restricted to sequences S n that lie within bin M (i.e., for which M n = M ). We note that that Eq. (5) arises in a Bayesian calculation as a the maximum a posteriori estimate of θ ER cl for appropriate choice of prior.
The calculation of enrichment ratios is the only type of inference supported by dms tools [31]. This approach, however, is implemented in two different ways. dms tools supports the computation of matrix model parameters using the formulas in Eq. (4) and Eq. (5). This leads to very rapid calculations, and for this reason such inference is also supported by MPAthic. dms tools also supports a much more computationally intensive inference procedure, which uses Monte Carlo sampling of an explicit Bayesian posterior distribution. In [31], this Monte Carlo approach is advocated as providing a more accurate estimate of model parameters than do Eqs. (4) and (5). In what follows, we use 'DT' to label matrix models inferred by dms tools using this Monte Carlo approach. Supplemental Information shows, however, that all DT models discussed in this manuscript are nearly indistinguishable from the models inferred by analytic enrichment ratio calculations.

Least squares inference
Least squares provides a computationally simple inference procedure that overcomes the most onerous restrictions of enrichment ratio calculations. It can be used to infer any type of linear model, including both matrix models and neighbor models. It can also be used on data that consists of more than two bins.
The idea behind the least squares approach is to choose parameters θ LS that minimize a quadratic loss function. Specifically, we use where Here, µ M is the assumed mean activity of sequences in bin M , σ 2 M is the assumed variance in the activities of such sequences, i indexes all parameters in the model, and α is a "ridge regression" regularization parameter [47]. By using the objective function L(θ), one can rapidly compute values of the optimal parameters θ using standard algorithms [48].
One downside to least squares inference is the need to assume specific values for µ M and for σ 2 M for each bin M . MPAthic allows the user to manually specify these values. There is a danger here, since assuming incorrect values for µ M and σ 2 M will generally lead to bias in the inferred parameters θ LS [34]. In practice, however, the default choice of µ M = M and σ 2 M = 1 often works surprisingly well when bins are arranged from lowest to highest average activity.
Another downside to least squares is the need to assume that experimental noise -specifically, p(R|M ) -is Gaussian. Only in such cases does least squares inference correspond to a meaningful maximum likelihood calculation. In massively parallel assays, however, noise is often strongly non-Gaussian. In such situations, least squares inference cannot be expected to yield correct model parameters for any choice of µ M and σ 2 M [34].

Information maximization inference
An alternative inference procedure, one that does not suffer from the need to assume a specific quantitative form for experimental noise, is the maximization of mutual information. In the large data limit, information maximization is equivalent to performing maximum likelihood inference when the quantiative form of experimental noise is unknown [34,49,50]. This approach was originally proposed for receptive field inference in sensory neuroscience [51, 52, 53], but has since been applied in multiple molecular biology contexts [49,54], including in the analysis of massively parallel experiments [7,8].
In this approach, parameter values are chosen to maximize the mutual information between model predictions and measurements. Specifically, one chooses where is the mutual information between the bins M in which sequences are found and the corresponding model prediction R for those sequences. In what follows, I(θ) is referred to as the "predictive information" of the model. For each choice of θ, computing predictive information requires a regularized estimate of the joint probability distribution p(R, M ). MPAthic currently uses standard kernel density methods [47] to estimate these distributions, although field-theoretic density estimation [55, 56] may ultimately prove superior in this context. Following [7], MPAthic optimizes parameters using a Metropolis Monte Carlo algorithm in which each choice for θ has relative probability exp[N I(θ)]. Because this Monte Carlo procedure is computationally expensive, information maximization is much slower than enrichment ratio calculations or least squares inference. Running on a standard laptop computer, our current algorithm takes between 30 minutes and 2 hours for each of the information maximization tasks described below.

Results
To test the capabilities of MPAthic, we analyzed data from previously published Sort-Seq [7], MPRA [8], and DMS [9] studies. Each of these studies generated multiple independent datasets, allowing us to test the inference capabilities of MPAthic by training and testing models on separate data. We also analyzed simulated data in order to assess the ability of MPAthic to accurately recover known parameter values.

Sort-Seq data
In their studies of the E. coli lac promoter, Kinney et al. [7] performed six independent Sort-Seq experiments, which they referred to as rnap-wt, crp-wt, fullwt, full-500, full-150, and full-0. All of these experiments assayed the transcriptional activity of variant sequences spanning a 75 bp region of the lac promoter (Fig. 3A). This region is known to bind two proteins, RNAP and CRP, at two separate binding sites. In the original study [7], models for the sequence specificity of these two proteins were inferred by modeling how transcription depends on sequence variation within their respective binding sites.
For both RNAP and CRP, we used each of these six datasets to infer both matrix models and neighbor models. [2] Inference was performed using two methods supported by MPAthic: least squares optimization (LS) and mutual information maximization (IM). In addition, matrix models were inferred using the Monte Carlo estimates of enrichment ratios provided by dms tools (DT). [3] The performance of each of these models on each of the available datasets was then quantified by estimating the predictive information I[R; M ]. Fig. 4A illustrates the performance of each inferred RNAP model (columns) on each of the published datasets (rows). Fig. 4B shows similar results for the inferred CRP models. [4] For both CRP and RNAP, the IM-inferred matrix models consistently outperformed the LS-and DT-inferred matrix models when evaluated on independent test data (Figs. 4C,4D,4E,4F). This finding lends support to the theoretical arguments of [34] that information maximization has substantial advantages over other methods for inferring quantitative sequence-function relationships from massively parallel data. It also demonstrates the superior performance of MPAthic relative to dms tools for the inference of matrix models.
We also investigated whether neighbor models, which account for epistatic interactions between neighboring positions in a sequence, might provide better descriptions of RNAP and CRP than simple matrix models do. To our knowledge, the presence of such interactions in either of these well-studied proteins has yet to be definitively established (although see [57]). We therefore compared the predictive performance of matrix and neighbor models that were trained (using IM) on the same datasets (Figs. 4G, 4H).
Neighbor models did not always outperform matrix models in these tests. However, for both CRP and RNAP, neighbor models did perform better than their corresponding matrix models when the predictive information of the matrix model was high (Figs. 4E,4F). Such high matrix model predictive information values [2] Raw data from [7] is available on NCBI SRA, accession number SRA012345; processed data formatted for use with MPAthic is provided on the MPAthic project homepage. [3] Because enrichment ratio calculations require exactly two sequence bins, only one high-activity and one low-activity set of sequences were used for the enrichment ratio calculations in Figs. 4 and 5.
[4] RNAP models were not trained or tested on the crpwt dataset because the RNAP binding site was not mutagenized in that experiment. Similarly, CRP models were not trained or tested on the rnap-wt dataset. CRP models were also not trained or tested on the full-0 dataset because cAMP, a ligand that CRP requires in order to bind DNA, was absent in that experiment. are expected to occur when the data used to train models is of high quality. We interpret this finding as evidence for epistatic interactions in the specificities of both CRP and RNAP. This also indicates that MPAthic, but not dms tools, is capable of quantifying such interactions. We note that the crp-wt data should be expected to be the best dataset for training models of CRP because the mutation rate used in this experiment was the highest (24%). This expectation is consistent with our finding that the CRP neighbor model inferred from crp-wt outperformed every other model inferred for CRP.

Simulated data
To further establish the ability of MPAthic to properly infer quantitative models, we next analyzed simulated Sort-Seq data. To generate simulated Sort-Seq data, we used the simulation capabilities of MPAthic together with the nbr-IM models for RNAP and CRP that were inferred from the full-wt dataset of [7]. Eight datasets were simulated in total, four for RNAP and four for CRP. In each simulation, 10 6 cells were sorted into either 10 or 2 bins; see SI for simulation details. Half of these simulated datasets (labeled "train") were then used to infer matrix and neighbor models as described in the previous section. The other half (labeled "test") were used solely to evaluate model performance. Fig. 5A shows results for the simulated RNAP data, while Fig. 5B shows corresponding results for simulated CRP data. The nbr-IM models performed best in every case tested, with virtually no apparent difference in performance between training and test data. In particular, all of the nbr-IM models performed substantially better than the mat-IM models, demonstrating the ability of MPAthic to accurately quantify epistatic interactions. It is also worth noting that, as in the analysis of Sort-Seq data, the matrix models found by MPAthic using IM inference outperformed those computed by dms tool using enrichment ratio calculations. Figs. 5C and 5D plot the values of parameters for inferred neighbor models against the corresponding parameter values of the neighbor models used to generate the data. We found very strong agreement, with a signal-to-noise ratio of 31 across the 528 parameters of the RNAP neighbor model, and a signal-to-noise ratio of 49 across the 336 parameters of the CRP neighbor model.

MPRA and DMS data
MPAthic is designed to facilitate the quantitative modeling of data from a variety of massively parallel assays, including MPRA and DMS experiments. To test the utility of MPAthic in these contexts, we inferred matrix models using MPRA data from [8] and DMS data from [9]. [5] In [8], replicate MPRA experiments were performed on a synthetic cAMP responsive element (CRE). These experiments tested ∼ 2.7 × 10 4 microarray-synthesized CREs having randomly scattered substitution mutations (10% per nucleotide position) throughout an 87 bp region. Using MPAthic, we inferred matrix models spanning this entire 87 bp region using IM-and LS-based inference. We also computed matrix models using dms tools. We found that on both replicate datasets, the IM-inferred models found by MPAthic performed the best in cross-comparisons (Fig. 6A). Moreover, both of these models performed better on both replicate datasets relative to the matrix model described in the original publication [8].
The DMS experiments of [9] assayed a variable region spanning 33 aa within a WW domain protein.
Specifically, the gene sequences of this WW domain was mutagenized at ∼2% per base. Multiple rounds of panning using a peptide ligand were then used to select WW-domain variants displayed on the surface of phage. The WW domain coding sequences present in the phage library after 0, 3, and 6 rounds of selection were then sequenced.
Using MPAthic, we fit models to either the round 0 and round 3 libraries, or to the round 3 and round 6 libraries. When trained on round 0,3 data and tested on round 3,6 data, the IM-inferred matrix models returned by MPAthic performed better than LS-inferred models and about the same as the matrix models returned by dms tools (Fig. 6B). However, IM models fit to round 3,6 data actually performed worse than the corresponding models of dms tools. This is the only situation we encountered where DT models outperformed IM models.
The poor performance of IM in this context is most likely due to the sparsity of data in the round 3,6 dataset. Specifically, in the round 3,6 dataset, we observed 8 amino-acid-position combinations with no representation in the data. Furthermore, 16 aminoacid-position combinations were represented by data [5] The preprocessed MPRA data of [8] was obtained from NCBI GEO, accession number GSE31982. The preprocessed DMS data of [2] was kindly provided by Douglas Fowler; raw data is available from NCBI SRA, accession number SRA020603. Processed data from both publications, formatted for use with MPAthic, is provided on the MPAthic project homepage. The neighbor models fit to data from both of these studies performed poorly relative to matrix models. We therefore ignore these neighbor models in what follows. from only one sequence read. By contrast, the round 0,3 dataset contained data on all amino-acid-position combinations, and for only 2 of these combinations did this data come from a single sequence. Our results therefore suggest that the IM-based inference of MPAthic can perform at least as well on DMS data as enrichment ratio calculations, but only when datasets are sufficiently rich. More generally, the existence of 20 amino acids compared to 4 DNA/RNA bases places a significantly larger burden on the amount of data needed to obtain accurate models from DMS data relative to Sort-Seq or MPRA data. This is true regardless of the inference method one uses.

Discussion
MPAthic provides routines for inferring quantitative models from MPA data. Such modeling is essential for understanding quantitative sequence-function relationships. The lack of published software available for this purpose has likely restricted the range of biological problems to which MPAs have been applied. Currently, the only published software for learning quantitative models from MPA data is dms tools [31]. As shown here, MPAthic improves upon dms tools in two key ways.
First, MPAthic is better than dms tools at inferring the parameters of matrix models, the simplest and most widely used type of model for describing sequence-function relationships. This improved performance is due to MPAthic supporting the use of mutual information maximization as a way to infer parameter values. dms tools, by contrast, is limited to enrichment ratio calculations. Mutual information maximization has been theoretically shown to provide an optimal inference method in the large data limit [34]. By contrast, the use of enrichment ratio calculations requires multiple assumptions that are often violated in realworld MPA experiments. Moreover, mutual information maximization makes use of all the available data, while enrichment ratio calculations often require one to discard valuable measurements. As we showed on both real and simulated data, the mutual information maximization routines provided by MPAthic almost always yield better matrix models than do enrichment ratio calculations. The only exception to this observation was found, unsurprisingly, in the analysis of a dataset having comparatively sparse coverage.
Second, unlike dms tools, MPAthic enables the quantification of epistatic interactions via the inference of neighbor models. Using simulated data, we showed that MPAthic is able to recover nearest-neighbor epistatic interactions with high accuracy. When applied to the published Sort-Seq data of [7], MPAthic was able to find neighbor models for both RNAP and CRP that had higher predictive power than the corresponding optimal matrix models. This indicates the successful quantification of real epistatic interactions that had not been previously known for either of these well-studied proteins.
The quantitative modeling of sequence-function relationships will ultimately require capabilities beyond those currently supported by MPAthic. For instance, there are a variety of other types of models that are likely to prove useful. Particularly promising are models with sparse all-versus-all pairwise interactions [57], models with interactions based on higher-order sequence features [58], deep neural network models [59], and nonlinear models that reflect specific biophysical mechanisms [7]. MPAthic does, however, provide a framework into which such modeling capabilities can be incorporated in the future, and through which different modeling strategies can be compared in a transparent way.
Availability of data and materials

Competing interests
The authors declare that they have no competing interests.
Author's contributions WTI and JBK participated in all aspects of this research. The MPRA assay of [8]. Variant enhancers (blue) are used to drive the transcription of RNA that contains enhancer-specific tags (shades of brown). Expression constructs are transfected into cell culture, after which tag-containing RNA is isolated and sequenced. Output sequences consist of the variant enhancers that correspond to expressed tags. (C) The DMS assay of [9]. Randomly mutagenized gene sequences (blue) produce variant proteins (colored bells) that are expressed on the surface of phage (gray rectangles). Panning is used to enrich for phage that express proteins that bind a specific ligand of interest (brown circles). The variant coding regions enriched after one or more rounds of panning are then sequenced.   In all massively parallel assays, a library of sequences is used as input to an experiment (black box) that outputs these sequences into one or more bins. The prevalence of each sequence in each bin depends on the assayed activity of that sequence. MPAthic can be used to analyze data from such experiments when the input library consists of substitution-mutated versions of a specific "wild type" sequence. (B) The data from such experiments can be represented as a table listing the number of occurrences of each unique sequence in each bin. MPAthic provides routines for inferring quantitative models from datasets that have this form. Routines are also provided for simulating data, for computing summary statistics, and for evaluating inferred models on arbitrary sequences.  [7]. This region contains binding sites for two proteins: CRP and RNAP. As shown in Fig. 4, multiple types of quantitative models for both CRP and RNAP (spanning the two indicated regions) were inferred from the datasets of [7]    Sort-Seq data from [7]. Each column corresponds to an inferred model; column headers indicate the dataset (rnap-wt, crp-wt, full-wt, full-500, full-150, or full-0) used to train the model, the type of model inferred (neighbor (nbr) or matrix (mat)), and the inference method used by MPAthic (information maximization (IM) or least squares (LS)). Columns corresponding to matrix models inferred using dms tools are indicated by DT. Rows indicate the datasets used to evaluate model performance. Heatmap values give the predictive information of each inferred model (column) on each test set (row). These values are expressed as a percentage of the maximal predictive information achieved on each test set (i.e., along each row). (C-H) Scatter plot comparisons of predictive information values for (C,D) matrix models fit using IM inference (I mat,IM ) vs. using dms tools (I mat,DT ), (E,F) matrix models fit using IM vs. LS inference (I mat,LS ), and (G,H) IM-inferred matrix models versus IM-inferred neighbor models (I nbr,IM   Analysis of simulated data. Sort-Seq data was simulated using the RNAP and CRP neighbor models inferred using MPAthic (in IM mode) from the full-wt data of [7]. Four datasets were generated for each model: one training and one test set were generated by sorting into 10 bins, while one training and one test set generated by sorting into 2 bins. (A,B) Performance of (A) RNAP and (B) CRP models inferred from and evaluated on these simulated datasets. Columns indicate the dataset used to train the model, the type of model inferred (nbr or mat), and the inference method used for training (MPAthic in IM or LS mode, or dms tools (DT)). Rows indicate the datasets on which models were evaluated. As in Figs. 3A and 3B, heatmaps show predictive information values expressed as a percentage of the maximal predictive information achieved on each dataset. (C,D) Comparison of the parameters of the neighbor models used in these simulations to the parameters of the neighbor models fit to the corresponding "sim-10 train" data using MPAthic in IM mode. Also shown is the signal-to-noise ratio, defined as the variance in the abscissa divided by the variance in the deviation of the ordinate from the diagonal.  Cross-comparison of matrix models fit to data from two replicate MPRA experiments reported in [8]. The performance of the matrix model reported in the original publication (Pub) is also shown. (B) In the DMS experiments of [9], sequence data was gathered after 0, 3, and 6 rounds of selection. Shown is a cross-comparison of matrix models fit to data from either rounds 0 and 3, or to data from rounds 3 and 6, using either MPAthic in IM or LS mode, or using dms tools (DT).