Chapter VII

Published as …

A version of this chapter originally appeared as Chure, G, Razo-Mejia, M., Belliveau, N.M., Kaczmarek, Zofii A., Einav, T., Barnes, Stephanie L., Lewis, M., and Phillips, R. (2019). Predictive shifts in free energy couple mutations to their phenotypic consequences. PNAS 116(37) DOI: https://doi.org/10.1073/pnas.1907869116. G.C., M.R.M, N.M.B., Z.A.K., and S.L.B designed the experiments and collected and analyzed data. G.C. developed theoretical treatment of free energy shifts. G.C., M.R.M, N.M.B., Z.A.K., T.E., S.L.B., and R.P. designed the research project. G.C. and R.P. wrote the paper. M.L. provided guidance and advice.

Non-Monotonic Behavior of ΔF Under Changing K_A and K_I

In Chapter 3, we illustrated that perturbations only to the allosteric parameters K_A and K_I relative to the wild-type values can result in a non-monotonic dependence of ΔF on the inducer concentration c. In this section, we prove that when the ratio of K_A to K_I is the same between the mutant and wild-type proteins, the function must be monotonic. This section is paired with an interactive figure available on the paper website which illustrates how scaling K_A and K_I relative to the wild-type value results in non-monotonic behavior.

We define a monotonic function as a continuous function whose derivative does not change sign across the domain upon which it is defined. To show that ΔF is non-monotonic when K_A and K_I are perturbed, we can compute the derivative of ΔF with respect to the inducer concentration c and evaluate the sign of the derivative at the limits of inducer concentration. If the sign of the derivative is different at the limits of c = 0 and c ≫ 0, we can see that the function is non-monotonic. However, if the sign is the same in both limits, we can not say conclusively if it is non-monotonic and must consider other diagnostics.

The free energy difference between a mutant and wild-type repressor when all parameters other than K_A and K_I are unperturbed can be written as
$$ \beta\Delta F(c) = -\log\left({\left[1 + e^{-\beta\Delta\varepsilon_{AI} }\left({1 + {c \over K_I^\text{(mut)} } \over 1 + {c \over K_A^\text{(mut)} }}\right)^2\right]^{-1} \over \left[1 + e^{-\beta\Delta\varepsilon_{AI} }\left({1 + {c \over K_I^\text{(wt)} } \over 1 + {c \over K_A^\text{(wt)} }}\right)^2\right]^{-1} }\right), \qquad(1)$$
in which Δε_AI is the energy difference between the active and inactive states of the repressor, c is the inducer concentration, and β = 1/k_BT where k_B is the Boltzmann constant and T is the temperature. The derivative with respect to c, which we determined using Mathematica’s (Wolfram Research, version 11.2) symbolic computing ability, is given as
$$ \begin{aligned} {\partial \beta\Delta F(c) \over \partial c} &= 2e^{-\beta\Delta\varepsilon_{AI} }\times \\ &\left(\frac{ {K_A^\text{(mut)} }^2 \left(K_A^\text{(mut)} - K_I^\text{(mut)}\right)\left(c + K_I^\text{(mut)}\right)}{\left(c + K_A^\text{(mut)}\right) \left[\left(c + K_A^\text{(mut)}\right)^2{K_I^\text{(mut)} }^2 + e^{-\beta\Delta\varepsilon_{AI} }{K_A^\text{(mut)} }^2 \left(c + K_I^\text{(mut)}\right)^2\right] } \right. \\ &- \left. \frac{ {K_A^\text{(wt)} }^2\left(K_A^\text{(wt)} - K_I^\text{(wt)}\right)\left(c + K_I^\text{(wt)}\right)}{\left(c + K_A^\text{(wt)}\right) \left[\left(c + K_A^\text{(wt)}\right)^2{K_I^\text{(wt)} }^2 + e^{-\beta\Delta\varepsilon_{AI} }{K_A^\text{(wt)} }^2\left(c + K_I^\text{(wt)}\right)^2\right]} \right). \end{aligned} \qquad(2)$$

This unwieldy expression can be simplified by defining the values of K_A^(mut) = θK_A^(wt) and K_I^(mut) = θK_I^(wt) as relative changes to the wild-type values where θ is a scaling parameter. While we can permit K_A^(mut) and K_I^(mut) to vary by different degrees, we will consider the case in which they are equally perturbed such that the ratio of K_A to K_I is the same between the mutant and wild-type versions of the repressor. While the equations become more cumbersome when one permits the dissociation constants to vary by different amounts (i.e. θ_{K_A}, θ_{K_I}), one arrives at the same conclusion. This definition allows us to rewrite Eq. 2 in the form of
$$ \begin{aligned} {\partial \beta\Delta F(c) \over \partial c} &= 2{K_A^\text{(wt)} }e^{-\beta\Delta\varepsilon_{AI} }\times \\ & \left({\theta^3 \left({K_A^\text{(wt)} }- {K_I^\text{(wt)} }\right)\left(c + \theta {K_I^\text{(wt)} }\right) \over \left(c + \theta {K_A^\text{(wt)} }\right)\left[\theta^2 {K_I^\text{(wt)} }^2\left(c + \theta {K_A^\text{(wt)} }\right)^2 + e^{-\beta\Delta\varepsilon_{AI} }\theta^2{K_A^\text{(wt)} }^2\left(c + \theta {K_I^\text{(wt)} }\right)^2\right]} \right. \\ &- \left. {\left({K_A^\text{(wt)} } - {K_I^\text{(wt)} }\right)\left(c + {K_I^\text{(wt)} }\right) \over \left(c + {K_A^\text{(wt)} }\right) \left[\left(c + {K_A^\text{(wt)} }\right)^2{K_I^\text{(wt)} }^2 + e^{-\beta\Delta\varepsilon_{AI} }{K_A^\text{(wt)} }^2\left(c + {K_I^\text{(wt)} }\right)^2\right]} \right). \end{aligned} \qquad(3)$$
With this derivative in hand, we can examine the limits of inducer concentration. As discussed in the main text, the free energy difference between the mutant and wild-type repressors when c = 0 should be equal to 0. However, the derivative at c = 0 will be different between the wild-type and the mutant. In this limit, Eq. 3 simplifies to
$$ {\partial \beta\Delta F(c) \over \partial c}\bigg\vert_{c = 0} = {2e^{-\beta\Delta\varepsilon_{AI} }\left({K_A^\text{(wt)} } - {K_I^\text{(wt)} }\right) \over {K_A^\text{(wt)} }{K_I^\text{(wt)} }\left(1 + e^{-\beta\Delta\varepsilon_{AI} }\right)}\left({1 \over \theta} - 1\right). \qquad(4)$$

When θ < 1, meaning that the affinity of the active and inactive states of the repressor to the inducer is increased relative to wild-type, the derivative is positive. Thus, the repressor bound state of the promoter becomes less energetically favorable than the repressor bound state. Similarly, if θ > 1, binding of the inducer to the mutant repressor is weaker than the wild-type repressor, making ∂βΔF(c)/∂c < 0, meaning the repressor bound state becomes more energetically favorable than the repressor unbound state of the promoter.

With an intuition for the sign of the derivative when c = 0, we can compute the derivative at another extreme where c ≫ 0. Here, Eq. 3 reduces to
$$ {\partial \beta\Delta F(c) \over \partial c}\bigg\vert_{c \gg 0} \approx {2e^{-\beta\Delta\varepsilon_{AI} }{K_A^\text{(wt)} }^2({K_A^\text{(wt)} } - {K_I^\text{(wt)} }) \over c^2\left({K_I^\text{(wt)} }^2 + e^{-\beta\Delta\varepsilon_{AI} }{K_A^\text{(wt)} }^2\right)} \left(\theta - 1\right). \qquad(5)$$
When θ > 1, Eq. 5 is positive. This is the opposite sign of the derivative when c = 0 when θ > 1. When θ < 1, Eq. 5 becomes negative whereas Eq. 4 is positive. As the derivative of ΔF with respect to c changes signs across the defined range of inducer concentrations, we can say the function is non-monotonic.

Fig. 1 shows the non-monotonic behavior of ΔF when K_A and K_I change by the same factor θ (maintaining the wild-type ratio, Fig. 1 (A)) and when K_A and K_I change by different factors (Fig. 1 (B)). In both cases, non-monotonic behavior is observed with the peak difference in the free energy covering several k_BT. We have hosted an interactive figure similar to Fig. 1 on the paper website where the reader can modify how K_A and K_I are affected by a mutation and examine how the active probability, free energy difference, and ∂βΔF/∂c are tuned.

Figure 1: **Non-monotonic behavior of ΔF with changes in K_A and K_I.** Middle column shows the allosteric contribution of free energy F plotted as a function of the inducer concentration. Right column shows the free energy difference ΔF as a function of inducer concentration, revealing non-monotonicity. (A) Behavior of F and ΔF when the values of K_A and K_I change relative to wild-type, but maintain the same ratio. θ is the scaling factor for both inducer dissociation constants. (B) Behavior of F and ΔF when the values of K_A and K_I change relative to the wild-type, but by different factors. In both panels, the wild-type parameter values were taken to be K_A = 200 μM, K_I = 1 μM, and Δε_AI = 4.5 k_BT. An interactive version of this figure is available on the paper website. The Python code (`ch7_figS1.py`) used to generate this figure can be found on the thesis GitHub repository.

Bayesian Parameter Estimation for DNA Binding Mutants

In this section, we outline the statistical model used in this work to estimate the DNA binding energy for a given mutation in the DNA binding domain. The methodology presented here is similar to that performed in Chapter 2 and outlined in accompanying Chapter 6. In the following text, we take a very detailed approach to vetting the robustness of our statistical inference machinery as determination of parameter values is critical to assessing the effects of mutations. Similarly to what is presented in Chapter 6, we begin with a derivation of our statistical model using Bayes’ theorem and then perform a series of principled steps to validate our choices of priors, ensure computational feasibility, and assess the validity of the model given the collected data. This work follows the analysis pipeline outlined by Michael Betancourt in his case-study entitled “Towards A Principled Bayesian Workflow.”

The second subsection Building a Generative Statistical Model lays out the statistical model used in this work to estimate the DNA binding energy and the error term σ. The subsequent subsections – Prior Predictive Checks, Simulation Based Calibration, and Posterior Predictive Checks – define and summarize a series of tests that ensure that the parameters of the statistical model can be identified and are computationally tractable. To understand how we defined our statistical model, only the second subsection is needed.

Calculation of the Fold-Change in Gene Expression

We appreciate the subtleties of the efficiency of photon detection in the flow cytometer, fluorophore maturation and folding, and autofluorescence correction, and we understand the importance in modeling the effects that these processes have on the reported value of the fold-change. However, in order to be consistent with the methods used in the literature, we took a more simplistic approach to calculate the fold-change. Given a set of fluorescence measurements of the constitutive expression control (R = 0), an autofluorescence control (no YFP), and the experimental strain (R > 0), we calculate the fold-change as
$$ \text{fold-change} = {\langle I_\text{cell}(R > 0)\rangle - \langle I_\text{autofluorescence}\rangle \over \langle I_\text{cell}(R = 0) \rangle - \langle I_\text{autofluorescence}\rangle}. \qquad(6)$$
It is important to note here that for a given biological replicate, we consider only a point estimate of the mean fluorescence for each sample and perform a simple subtraction to adjust for background fluorescence. For the analysis going forward, all mentions of measured fold-change are determined by this calculation.

Building a Generative Statistical Model

To identify the minimal parameter set affected by a mutation, we assume that mutations in the DNA binding domain of the repressor alters only the DNA binding energy Δε_RA, while the other parameters of the repressor are left unperturbed from their wild-type values. As a first approach, we can assume that all of the other parameters are known without error and can be taken as constants in our physical model. Ultimately, we want to know how probable a particular value of Δε_RA is given a set of experimental measurements y. Bayes’ theorem computes this distribution, termed the posterior distribution as
$$ g(\Delta\varepsilon_{RA}\,\vert\, y) = {f(y\,\vert\, \Delta\varepsilon_{RA}) g(\Delta\varepsilon_{RA}) \over f(y)} \qquad(7)$$
where we have used g and f to represent probability densities over parameters and data, respectively. The expression f(y |Δε_RA) captures the likelihood of observing our data set y given a value for the DNA binding energy under our physical model. All knowledge we have of what the DNA binding energy could be, while remaining completely ignorant of the experimental measurements, is defined in g(Δε_RA), referred to as the prior distribution. Finally, the likelihood that we would observe the data set y while being ignorant of our physical model is defined by the denominator f(y). In this work, this term serves only as a normalization factor and as a result will be treated as a constant. We can therefore say that the posterior distribution of Δε_RA is proportional to the joint distribution between the likelihood and the prior,
g(Δε_RA | y) ∝ f(y | Δε_RA)g(Δε_RA). (8)

We are now tasked with translating this generic notation into a concrete functional form. Our physical model derived in Chapter 2 computes the average fold-change in gene expression. Speaking practically, we make several replicate measurements of the fold-change to reduce the effects of random errors. As each replicate is independent of the others, it is reasonable to expect that these measurements will be normally distributed about the theoretical value of the fold-change μ, computed for a given Δε_RA. We can write this mathematically for each measurement as
$$ f(y\,\vert\, \Delta\varepsilon_{RA}) = {1 \over (2\pi\sigma^2)^{N/2}}\prod\limits_{i}^N \exp\left[-(y_i - \mu(\Delta\varepsilon_{RA}))^2 \over 2\sigma^2\right], \qquad(9)$$
where N is the number of measurements in y and y_i is the i^th experimental fold-change measurement. We can write this likelihood in shorthand as
f(y | Δε_RA) = Normal{μ(Δε_RA), σ} (10)
which we will use for the remainder of this section.

Using a normal distribution for our likelihood has introduced a new parameter σ which describes the spread of our measurements about the true value. We must therefore include it in our parameter estimation and assign an appropriate prior distribution such that the posterior distribution becomes
g(Δε_RA, σ |y) ∝ f(y | Δε_RA, σ)g(Δε_RA)g(σ). (11)
We are now tasked with assigning functional forms to the priors g(Δε_RA) and g(σ). Though one hopes that the result of the inference is not too dependent on the choice of prior, it is important to choose one that is in agreement with our physical and physiological intuition of the system.

We can impose physically reasonable bounds on the possible values of the DNA binding energy Δε_RA. We can say that it is unlikely that any given mutation in the DNA binding domain will result in an affinity greater than that of biotin to streptavidin (1 fM ≈ − 35 k_BT, BNID 107139 (Milo et al. 2010)), one of the strongest known non-covalent bonds. Similarly, it is unlikely that a given mutation will result in a large, positive binding energy, indicating that non-specific binding is preferable to specific binding ( ∼ 1 to 10 k_BT). While it is unlikely for the DNA binding energy to exceed these bounds, it is not impossible, meaning we should not impose these limits as hard boundaries. Rather, we can define a weakly informative prior as a normal distribution with a mean and standard deviation as the average of these bounds,
g(Δε_RA) ∼ Normal{ − 12, 12} (12)
whose probability density function in shown in Fig. 2 (A).

By definition, fold-change is restricted to the bounds [0, 1]. Measurement noise and fluctuations in autofluorescence background subtraction means that experimental measurements of fold-change can extend beyond these bounds, though not substantially. By definition, the scale parameter σ must be positive and greater than zero. We also know that for the measurements to be of any use, the error should be less than the available range of fold-change, 1.0. We can choose such a prior as a half normal distribution
$$ g(\sigma) = {1 \over \phi}\sqrt{2 \over \pi}\exp\left[-{\sigma^2 \over 2\phi^2}\right];\, \forall\, \sigma \geq 0 \qquad(13)$$
where ϕ is the standard deviation. By choosing ϕ = 0.1, it is unlikely that σ ≥ 1, yet not impossible, permitting the occasional measurement significantly outside of the theoretical bounds. The probability density function for this prior is shown in Fig. 2 (B).

While these choices for the priors seem reasonable, we can check their appropriateness by using them to simulate a data set and checking that the hypothetical fold-change measurements obey our physical and physiological intuition.

Figure 2: **Prior distributions and prior predictive check for estimation of the DNA binding energy.** (A) Prior probability density function for DNA binding energy Δε_RA as ∼ Normal( − 12, 12). (B) Prior probability density function for the standard deviation in measurement noise σ∼ HalfNormal(0, 0.1). (C) Percentiles of values drawn from the likelihood distribution given draws from prior distributions given R = 260, K_A = 139 μM, K_I = 0.53 μM, and Δε_AI = 4.5 k_BT, which match the parameters used in Razo-Mejia et al. (2018). Black points at top of (A) and (B) represent draws used to generate fold-change measurements from the likelihood distribution. Percentiles in (C) generated from 800 draws from the prior distributions. For each draw from the prior distributions, a data set of 70 measurements over 12 IPTG concentrations (ranging from 0 to 5000 μM) were generated from the likelihood. The Python code (`ch7_figS2.py`) used to generate this figure can be found on the thesis GitHub repository.

Prior Predictive Checks

If our choice of prior distribution for each parameter is appropriate, we should be able to simulate data sets using these priors that match our expectations. In essence, we would hope that these prior choices would generate some data sets with fold-change measurements above 1 or below zero, but they should be infrequent. If we end up getting primarily negative values for fold-change, for example, then we can surmise that there is something wrong in our definition of the prior distribution. This method, coined a prior predictive check, was first put forward in Good (1950) and has received newfound attention in computational statistics.

We perform the simulation in the following manner. We first draw a random value for Δε_RA out of its prior distribution stated in Eq. 12 and calculate what the mean fold-change should be given our physical model. With this in hand, we draw a random value for σ from its prior distribution, specified in Eq. 13. We then generate a simulated data set by drawing ≈ 70 fold-change values across twelve inducer concentrations from the likelihood distribution which we defined in Eq. 10. This roughly matches the number of measurements made for each mutant in this work. We repeat this procedure for 800 draws from the prior distributions, which is enough to observe the occasional extreme fold-change value from the likelihood. As the DNA binding energy is the only parameter of our physical model that we are estimating, we had to choose values for the others. We kept the values of the inducer binding constants K_A and K_I the same as the wild-type repressor (139 μM and 0.53 μM, respectively). We chose to use R = 260 repressors per cell as this is the repressor copy number we used in the main text to estimate the DNA binding energies of the three mutants.

The draws from the priors are shown in Fig. 2 (A) and (B) as black points above the corresponding distribution. To display the results, we computed the percentiles of the simulated data sets at each inducer concentration. These percentiles are shown as red shaded regions in Fig. 2 (C). The 5th percentile (dark purple band) has the characteristic profile of an induction curve. Given that the prior distribution for Δε_RA is centered at − 12 k_BT and we chose R = 260, we expect the generated data sets to cluster about the induction profile defined by these values. More importantly, approximately 95% of the generated data sets fall between fold-change values of -0.1 and 1.1, which is within the realm of possibility given the systematic and biological noise in our experiments. The 99^th percentile maximum is approximately 1.3 and the minimum approximately − 0.3. While we could tune our choice of prior further to minimize draws this far from the theoretical bounds, we err on the side of caution and accept these values as it is possible that fold-change measurements this high or low can be observed, albeit rarely. Through these prior predictive checks, we feel confident that these choices of priors are appropriate for the parameters we wish to estimate. We can now move forward and make sure that the statistical model as a whole is valid and computationally tractable.

Sensitivity Analysis and Simulation Based Calibration

Satisfied with our choice of prior distributions, we can proceed to check other properties of the statistical model and root out any pathologies lurking in our model assumptions.

To build trust in our model, we could generate a data set ỹ with a known value for σ and Δε_RA, estimate the posterior distribution g(Δε_RA, σ | ỹ), and determine how well we were able to retrieve the true value of the parameters. However, running this once or twice for handpicked values of σ and Δε_RA will not reveal edge-cases in which the inference fails, some of which may exist in our data. Rather than performing this operation once, we can run this process over a variety of data sets where the ground truth parameter value is drawn from the prior distribution (as we did for the prior predictive checks). For an arbitrary parameter θ, the joint distribution between the ground truth value θ̃, the inferred value θ, and the simulated data set ỹ can be written as
π(θ, ỹ, θ̃) = g(θ | ỹ)f(ỹ | θ̃)g(θ̃). (14)
If this process is run for a large number of simulations, Eq. 14 can be marginalized over all data sets ỹ and all ground truth values θ̃ to yield the original prior distribution,
∫dθ̃∫dỹπ(θ, ỹ, θ̃) = g(θ). (15)

This result, described by Talts et al. (2018), holds true for any statistical model and is a natural self consistency property of Bayesian inference. Any deviation between the distribution of our inferred values for θ and the original prior distribution g(θ) indicates that either our statistical model is malformed or the computational method is not behaving as expected. There are a variety of ways we can ensure that this condition is satisfied, which we outline below.

Using the data set generated for the prior predictive checks (shown in Fig. 2 (C)), we sampled the posterior distribution and compute Δε_RA and σ for each simulation and checked that they matched the original prior distribution. To perform the inference, we use Markov chain Monte Carlo (MCMC) to sample the posterior distribution. Specifically, we use the Hamiltonian Monte Carlo algorithm implemented in the Stan probabilistic programming language (Carpenter et al. 2017). The specific code files can be accessed through the paper website or the associated GitHub repository. The original prior distribution and the distribution of inferred parameter values can be seen in Fig. 3 (A) and (B). For both Δε_RA and σ, we can accurately recover the ground truth distribution (purple) via sampling with MCMC (orange). For Δε_RA, there appears to be an upper and lower limit past which we are unable to accurately infer the binding energy. This can be seen in both the histogram Fig. 3(A) and the empirical cumulative distribution Fig. 3 (B) as deviations from the ground truth when DNA binding is below ≈ − 25 k_BT or above ≈ − 5 k_BT. These limits hinder our ability to comment on exceptionally strong or weak binding affinities. However, as all mutants queried in this work exhibited binding energies between these limits, we surmise that the inferential scheme permits us to draw conclusions about the inferred DNA binding strengths.

Figure 3: **Comparison of averaged posterior and prior distributions for Δε_RA and σ.** (A) Distribution of the average values for the DNA binding energy Δε_RA (orange) overlaid with the ground truth distribution (purple). (B) Data averaged posterior (orange) for the standard deviation of fold-change measurements overlaid with the ground truth distribution (purple). Top and bottom show the same data with different visualizations. The Python code (`ch7_figS3.py`) used to generate this figure can be found on the thesis GitHub repository.

Rather than examining the agreement of the data-averaged posterior and the ground truth prior distribution solely by eye, we can compute summary statistics using the mean μ and standard deviation σ of the posterior and prior distributions which permit easier identification of pathologies in the inference. One such quantity is the posterior z-score, which is defined as
$$ z = {\mu_\text{posterior} - \tilde\theta \over \sigma_\text{posterior}}. \qquad(16)$$
This statistic summarizes how accurately the posterior recovers the ground truth value beyond simply reporting the mean, median, or mode of the posterior distribution. Z-scores around 0 indicate that the posterior is concentrating tightly about the true value of the parameter whereas large values (either positive or negative) indicate that the posterior is concentrating elsewhere. A useful feature of this metric is that the width of the posterior is also considered. It is possible that the posterior could have a mean very close to the ground truth value, but have an incredibly narrow distribution/spread such that it does not overlap with the ground-truth. Only comparing the mean value to the ground truth would suggest that the inference “worked.” However with a small standard deviation generates a very large z-score, telling us that something has gone awry.

If our inferential model is behaving properly, the width of the posterior distribution should be significantly smaller than the width of the prior, meaning that the posterior is being informed by the data. The level to which the posterior is being informed by the data can be easily calculated given knowledge of both the prior and posterior distribution. This quantity, aptly named the shrinkage s, can be computed as
$$ s = 1 - {\sigma^2_\text{posterior} \over \sigma^2_\text{prior}}. \qquad(17)$$
When the shrinkage is close to zero, the variance of the posterior is approximately the same as the variance of the prior, and the model is not being properly informed by the data. When s ≈ 1, the variance of the posterior is much smaller than the variance of the prior, indicating that it is being highly informed by the data. A shrinkage less than 0 indicates that the posterior is wider than the prior distribution, revealing a severe pathology in either the model itself or the implementation.

In Fig. 4, we compute these summary statistics for each parameter. For both Δε_RA and σ, we see clustering of the z-score about 0 with the extrema reaching ≈ ± 3. This suggests that for the vast majority of our simulated data sets, the posterior distribution concentrated about the ground truth value. We also see that for both parameters, the posterior shrinkage s is ≈ 1, indicating that the posterior is being highly informed by the data. There is a second distribution centered ≈ 0.8 for Δε_RA, indicating that for a subset of the data sets, the posterior is only ≈ 80% narrower than the prior distribution. These samples are those that were drawn outside of the limits of ≈ − 25 to − 5 k_BT where the inferential power is limited. Nevertheless, the posterior still significantly shrank, indicating that the data strongly informs the posterior.

Figure 4: **Inferential sensitivity for estimation of Δε_RA and σ**. The posterior z-score for each posterior distribution inferred from a simulated data set is plotted against the shrinkage for (A) the DNA binding energy Δε_RA and (B) the standard deviation of fold-change measurements σ. The Python code (`ch7_figS4.py`) used to generate this figure can be found on the thesis GitHub repository.

The general self-consistency condition given by Eq. 15 provides another route to ensure that the model is computationally tractable. Say that we draw a value for the DNA binding energy from the prior distribution, simulate a data set, and sample the posterior using MCMC. The result of this sampling is a collection of N values of the parameter which may be above, below, or equal to the ground-truth value. From this set of values, we select L of them and rank order them by their value. Talts et al. (2018) derived a general theorem which states that the number of samples less than the ground truth value of the parameter (termed the rank statistic) is uniformly distributed over the interval [0, L]. As Eq. 15 must hold true for any statistical model, deviations from uniformity signal that there is a problem in the implementation of the statistical model. How the distribution deviates is also informative as different types of failures result in different distributions. The nature of these deviations, along with a more formal proof of the uniform distribution of rank statistics can be found in Talts et al. (2018) where it was originally derived.

Given the sampling statistics for each of the simulated data sets, we took 800 of the MCMC samples of the posterior distribution for each of the 800 simulated data sets and computed the rank statistic. The distributions are shown in Fig. 5 as both histograms and ECDFs for the DNA binding energy and standard deviation. The distribution of rank statistics for both parameters appears to be uniform. The purple band overlaying the histograms (top row) as well as the purple envelopes overlaying the ECDFs (bottom row) represent the 99^th percentile expected from a true uniform distribution. The uniformity of this distribution, along with the well-behaved z-scores and shrinkage for each parameter, tells us that there are no underlying pathologies in our statistical model and that it is computationally tractable. However, this does not mean that it is correct. Whether this model is valid for the actual observed data is the topic of the next section.

Figure 5: **Rank distribution of the posterior samples from simulated data.** Top row shows a histogram of the rank distribution with n = 20 bins. Bottom row is the cumulative distribution for the same data. Purple bands correspond to the 99th percentile of expected variation from a uniform distribution. (A) Distribution for the DNA binding energy Δε_RA and (B) for the standard deviation σ. The Python code (`ch7_figS5.py`) used to generate this figure can be found on the thesis GitHub repository.

Parameter Estimation and Posterior Predictive Checks

We now turn to applying our vetted statistical model to experimental measurements. While the same statistical model was applied to all three DNA binding mutants, here we only focus on the mutant Q18M for brevity.

Using a single induction profile, we sampled the posterior distribution over both the DNA binding energy Δε_RA and the standard deviation σ using MCMC implemented in the Stan programming language. The output of this process is a set of 4000 samples of both parameters along with the value of their log posterior probabilities, which serves as an approximate measure of the probability of each value. The individual samples are shown in Fig. 6. The joint distribution between Δε_RA and σ is shown in the lower left hand corner, and the marginal distributions for each parameter are shown above and to the right of the joint distribution, respectively. The joint distribution is color coded by the value of the log posterior, with bright orange and dark purple corresponding to high and low probability, respectively. The symmetric shape of the joint distribution is a telling sign that there is no correlation between the two parameters. The marginal distributions for each parameter are also relatively narrow, with the DNA binding energy covering a range of ≈ 0.6 k_BT and σ spanning ≈ 0.02. To more precisely quantify the uncertainty, we computed the shortest interval of the marginal distribution for each parameter that contains 95% of the probability. The bounds of this interval, coined the Bayesian credible region, can accommodate asymmetry in the marginal distribution since the upper and lower bounds of the estimate are reported. In the main text, we reported the DNA binding energy estimated from these data to be 15.43_− 0.06^+ 0.06 k_BT, where the first value is the median of the distribution and the super- and subscripts correspond to the upper and lower bounds of the credible region, respectively.

While looking at the shape of the posterior distribution can be illuminating, it is not enough to tell us if the parameter values extracted make sense or accurately describe the data on which they were conditioned. To assess the validity of the statistical model in describing actual data, we again turn to simulation, this time using the posterior distributions for each parameter rather than the prior distributions. The likelihood of our statistical model assumes that across the entire induction profile, the observed fold-change is normally distributed about the theoretical prediction with a standard deviation σ. If this is an accurate depiction of the generative process, we should be able to draw values from the likelihood using the sampled values for Δε_RA and σ that are indistinguishable from the actual experimental measurements. This process is known as a posterior predictive check and is a Bayesian method of assessing goodness-of-fit.

For each sample from the posterior, we computed the theoretical mean fold-change given the sampled value for Δε_RA. With this mean in hand, we used the corresponding sample for σ and drew a data set from the likelihood distribution the same size as the real data set used for the inference. As we did this for every sample of our MCMC output (a total of ≈ 4000), it is more instructive to compute the percentiles of the generated data than to show the entire output. In Fig. 6 (B), the percentiles of the generated data sets are shown overlaid with the data used for the inference. We see that all of the data points fall within the 99^th percentile of simulated data sets with the 5^th percentile tracking the mean of the data at each inducer concentration. As there are no systematic deviations or experimental observations that fall far outside those generated from the statistical model, we can safely say that the statistical model derived here accurately describes the observed data.

Figure 6: **Markov Chain Monte Carlo (MCMC) samples and posterior predictive check for DNA binding mutant Q18M.** (A) Marginal and joint sampling distributions for DNA binding energy Δε_RA and σ. Each point in the joint distribution is a single sample. Marginal distributions for each parameter are shown adjacent to joint distribution. Color in the joint distribution corresponds to the value of the log posterior with the progression of dark purple to bright orange corresponding to increasing probability. (B) The posterior predictive check of the model. The measurements of the fold-change in gene expression are shown as black open-faced circles. The percentiles are shown as colored bands and indicate the fraction of simulated data drawn from the likelihood that fall within the shaded region. The Python code (`ch7_figS6.py`) used to generate this figure can be found on the thesis GitHub repository.

Inferring the Free Energy from Fold-Change Measurements

In this section, we describe the statistical model to infer the free energy F from a set of fold-change measurements. We follow the same principled workflow as described previously for the DNA binding estimation, including declaration of the generative model, prior predictive checks, simulation based calibration, and posterior predictive checks. Finally, we determine an empirical limit in our ability to infer the free energy and define a heuristic which can be used to identify measurements that are likely inaccurate. To understand the statistical model and the empirical limits of detection, only the subsections Building A Generative Model and Sensitivity Limits and Systematic Errors in Inference are necessary.

Building a Generative Model

In Chapter 2, we showed that the fold-change equation can be rewritten in the form of a Fermi function,
$$ \text{fold-change} = {1 \over 1 + e^{-F / k_BT}}, \qquad(18)$$
where F corresponds to the free energy difference between the repressor bound and unbound states of the promoter. While the theory prescribes a way for us to calculate the free energy based on our knowledge of the biophysical parameters, we can directly calculate the free energy of a measurement of fold-change by simply rearranging Eq. 18 as
$$ F = -k_BT\log\left({1 \over \text{fold-change}} - 1\right). \qquad(19)$$
With perfect measurement of the fold-change in gene expression (assuming no experimental or measurement noise), the free energy can be directly calculated. However, actual measurements of the fold-change in gene expression can extend beyond the theoretical bounds of 0 and 1, for which the free energy is mathematically undefined.

As the fold-change measurements between biological replicates are independent, it is reasonable to assume that they are normally distributed about a mean value μ with a standard deviation σ. While the mean value is restricted to the bounds of [0, 1], fold-change measurements outside of these bounds are still possible given that they are distributed about the mean with a scale of σ. Thus, if we have knowledge of the mean fold-change in gene expression about which the observed fold-change is distributed, we can calculate the mean free energy as
$$ F = -k_BT\log\left({1 \over \mu} - 1\right). \qquad(20)$$
For a given set of fold-change measurements y, we wish to infer the posterior probability distribution for μ and σ, given by Bayes’ theorem as
g(μ, σ | y) ∝ f(y | μ, σ)g(μ)g(σ), (21)
where we have dropped the normalization constant f(y) and assigned a proportionality between the posterior and joint probability distribution. Given that the measurements are independent, we define the likelihood f(y | μ, σ) as a normal distribution,
f(y | μ σ) ∼ Normal{μ, σ}. (22)
While the mean μ is restricted to the interval [0, 1], there is no reason a priori to think that it is more likely to be closer to either bound. To remain uninformative and be as permissive as possible, we define a prior distribution for μ as a Uniform distribution between 0 and 1,
$$ g(\mu)=\begin{cases}{1\over \mu_\text{max} - \mu_\text{min}} & \mu_\text{min} < \mu < \mu_\text{max}\\ 0 & \text{otherwise}\end{cases}. \qquad(23)$$
Here, μ_min = 0 and μ_max = 1, reducing g(μ) to 1. For σ, we can again assume a half-normal distribution with a standard deviation of 0.1 as was used for estimating the DNA binding energy Eq. 13,
g(σ) = HalfNormal{0, 0.1}. (24)

With a full generative model defined, we can now use prior predictive checks to ensure that our choices of prior are appropriate for the inference.

Prior Predictive Checks

To check the validity of the chosen priors, we pulled 1000 combinations of μ and σ from their respective distributions (Fig. 7 (A)) and subsequently drew a set of 10 fold-change values (a number comparable to the number of biological replicates used in this work) from a normal distribution defined by μ and σ. To visualize the range of values generated from these checks, we computed the percentiles of the empirical cumulative distributions of the fold-change values, as can be seen in Fig. 7 (C). Approximately 95% of the generated fold-change measurements were between the theoretical bounds of [0, 1] whereas 5% of the data sets fell outside with the maximum and minimum values extending to ≈ 1.2 and − 0.2, respectively. Given our familiarity with these experimental strains and the detection sensitivity of the flow cytometer, these excursions beyond the theoretical bounds agree with our intuition. Satisfied with our choice of prior distributions, we can proceed to check the sensitivity and computational tractability of our model through simulation based calibration.

Simulation Based Calibration

To ensure that the parameters can be estimated with confidence, we sampled the posterior distribution of μ and σ for each data set generated from the prior predictive checks. For each inference, we computed the z-score and shrinkage for each parameter, shown in Fig. 8(A). For both parameters, the z-scores are approximately centered about zero, indicating that the posteriors concentrate about the ground truth value of the parameter. The z-scores for σ green points in Fig. 8) (A) appear to be slightly off centered with more negative values than positive. This suggests that σ is more likely to be slightly overestimated in some cases. The shrinkage parameter for μ (red points) is very tightly distributed about 1.0, indicating that the prior is being strongly informed by the data. The shrinkage is more broadly distributed for σ with a minimum value of ≈ 0.5. However, the median shrinkage for σ is ≈ 0.9, indicating that half of the inferences shrank the prior distribution by at least 90%. While we could revisit the model to try and improve the shrinkage values, we are more concerned with μ which shows high shrinkage and zero-centered z-scores.

To ensure that the model is computationally tractable, we computed the rank statistic of each parameter for each inference. The empirical cumulative distributions for μ (black) and σ (red) can be seen in Fig. 8 (B). Both distributions appear to be uniform, falling within the 99^th percentile of the variation expected from a true uniform distribution. This indicates that the self-consistency relation defined by Eq. 15. holds for this statistical model. With a computationally tractable model in hand, we can now apply the statistical model to our data and verify that data sets drawn from the data-conditioned posterior are indistinguishable from the experimental measurements.

Figure 8: **Sensitivity measurements and rank statistic distribution of the statistical model estimating μ and σ.** (A) Posterior z-score of each inference plotted against the posterior shrinkage factor for the parameters μ (blue points) and σ (green points). (B) Distribution of rank statistics for μ (red) and σ (black). Purple envelope represents the 99^th percentile of a true uniform distribution. The Python code (`ch7_figS8.py`) used to generate this figure can be found on the thesis GitHub repository.

Posterior Predictive Checks

The same statistical model was applied to every unique set of fold-change measurements used in this work. Here, we focus only on the set of fold-change measurements for the double mutant Y17I-Q291V at 50 μM IPTG. The samples from the posterior distribution conditioned on this dataset can be seen in Fig. 9 (A). The joint distribution, shown in the lower left-hand corner, appears fairly symmetric, indicating that μ and σ are independent. There is a slight asymmetry in the sampling of σ, which can be more clearly seen in the corresponding marginal distribution to the right of the joint distribution.

For each MCMC sample of μ and σ, we drew 10 samples from a normal distribution defined by these parameters. From this collection of data sets, we computed the percentiles of the empirical cumulative distribution and plotted them over the data, as can be seen in Fig. 9 (B). We find that the observed data falls within the 99^th percentile of the generated data sets. This illustrates that the model can produce data which is identically distributed to the actual experimental measurements, validating our choice of statistical model.

Figure 9: **MCMC sampling output and posterior predictive checks of the statistical model for the mean fold-change μ and standard deviation σ.** (A) Corner plot of sampling output. The joint distribution between σ and μ is shown in the lower left hand corner. Each point is an individual sample. Points are colored by the value of the log posterior with increasing probability corresponding to transitions from purple to orange. Marginal distributions for each parameter are shown adjacent to the joint distribution. (B) Percentiles of the cumulative distributions from the posterior predictive checks are shown as shaded bars. Data on which the posterior was conditioned are shown as white orange circles and lines. The Python code (`ch7_figS9.py`) used to generate this figure can be found on the thesis GitHub repository.

Sensitivity Limits and Systematic Errors in Inference

Considering the results from the prior predictive checks, simulation based calibration, and posterior predictive checks, we can say that the statistical model for inferring μ and σ fold-change from a collection of noisy fold-change measurements is valid and computationally tractable. Upon applying this model to the experimental data of the wild-type strain (where the free energy is theoretically known), we observed that systematic errors arise when the fold-change is exceptionally high or low, making the resulting inference of the free energy inaccurate.

To elucidate the source of this systematic error, we return to a simulation based approach in which the true free energy is known (black points in Fig. 10 (A)). For a range of free energies, we computed the theoretical fold-change prescribed by Eq. 18. For each free energy value, we pulled a value for σ from the prior distribution defined in Eq. 13 and generated a data set of 10 measurements by drawing values from a normal distribution defined by the true fold-change and the drawn value of σ (purple points in Fig. 10 (A)). We then sampled the statistical model over these data and inferred the mean fold-change μ (orange points in Fig. 10 (A)). By eye, the inferred points appear to collapse onto the master curve, in many cases overlapping the true values. However, the points with a free energy less than ≈ − 2 k_BT and greater than ≈ 2 k_BT are slightly above or below the master curve, respectively. This becomes more obvious when the inferred free energy is plotted as a function of the true free energy, shown in Fig. 10 (B). Points in which the difference between μ and the nearest boundary (0 or 1) is less than the value of σ are shown as purple or green. When this condition is met, the inferred mean free energy strays from the true value, introducing a systematic error. This suggests that the spread of the fold-change measurements sets the detection limit of fold-change close to either boundary. Thus, the narrower the spread in the fold-change the better the estimate of the fold-change near the boundaries.

These systematic errors can be seen in experimental measurements of the wild-type repressor. Data from Razo-Mejia et al. (2018) in which the IPTG titration profiles of seventeen different bacterial strains were measured is shown collapsed onto the master curve in Fig. 10 (C) as red points. Here, each point corresponds to a single biological replicate. The inferred mean fold-change μ and 95% credible regions are shown as purple, blue, or green points. The color of these points correspond to the relative value of μ or 1 − μ to σ. The discrepancy between the predicted and inferred free energy of each measurement set can be seen in Fig. 10 (D). The significant deviation from the predicted and inferred free energy occurs past the detection limit set by σ. In this work, we therefore opted to not display inferred free energies at the extrema where the inferred fold-change was closer to the boundaries than the corresponding standard deviation, as it reflects limitations in our measurement rather than a deviation from the theoretical predictions.

Figure 10: **Identification of systematic error in simulated and real data when considering the free energy.** (A) The true fold-change (black points), simulated fold-change distribution (purple points), and inferred mean fold-change (orange) is plotted as a function of the true free energy. Error bars on inferred fold-change correspond to the 95% credible region of the mean fold-change μ. (B) Inferred free energy plotted as a function of the true free energy. Black line indicates perfect agreement between the ground truth free energy and inferred free energy. Blue points correspond to the inferred free energy where the median values of the parameters satisfy the condition μ > σ and 1 − μ > σ. Purple points correspond to the inferred mean fold-change μ < σ. Green points correspond to those where the inferred mean fold-change 1 − μ < σ. Error bars correspond to the bounds of the 95% credible region. (C) Biological replicate data from Razo-Mejia et al. (2018) (red points) plotted as a function of the theoretical free energy. Inferred mean fold-change μ and the 95% credible region are shown as blue points. Purple and green points are colored by the same conditions as in (B). (D) Inferred free energy as a function of the predicted free energy colored by the satisfied condition. Error bars are the bounds of the 95% credible region. All inferred values in (A – D) are the median values of the posterior distribution. The Python code (`ch7_figS10.py`) used to generate this figure can be found on the thesis GitHub repository.

Additional Characterization of DNA Binding Mutants

In Chapter 3, we estimated the DNA binding energy of each mutant using the mutant strains that had approximately 260 repressors per cell. In this section, we examine the effect of the choice of fit strain on the predictions of both the induction profiles and ΔF for each DNA binding domain mutant.

We applied the statistical model derived in Section 2 of this chapter for each unique strain of the DNA binding mutants and estimated the DNA binding energy. The median of the posterior distribution along with the upper and lower bounds of the 95% credible region are reported in Table 7.1. We found that the choice of fitting strain did not strongly influence the estimate of the DNA binding energy. The largest deviations appear for the weakest binding mutants paired with the lowest repressor copy number. In these cases, such as for Q18A, the difference in binding energy between the repressor copy numbers is ≈ 1 k_BT which is small compared to the overall DNA binding energy. Using these energies, we computed the predicted induction profiles of each mutant with different repressor copy numbers, shown in Fig. 11. In this plot, the rows correspond to the repressor copy number of the strain used to estimate the DNA binding energy. The columns correspond to the repressor copy number of the predicted strains. The diagonals, shaded in grey, show the induction profile of the fit strain along with the corresponding data. In all cases, we find that the predicted profiles are relatively accurate with the largest deviations resulting from using the lowest repressor copy number as the fit strain.

Estimated DNA binding energy for DNA binding domain mutants with different repressor copy numbers. Reported values are the median of the posterior distribution with the upper and lower bounds of the 95% credible regions.
Mutant	Repressors	DNA Binding Energy [k_BT]
Q18A	60	− 9.8_− 0.2^+ 0.2
	124	− 10.3_− 0.1^+ 0.1
	260	− 11.0_− 0.1^+ 0.1
	1220	− 11.3_− 0.1^+ 0.1
Q18M	60	− 15.83_− 0.08^+ 0.08
	124	− 15.7_− 0.1^+ 0.1
	260	− 15.43_− 0.06^+ 0.07
	1220	− 15.27_− 0.07^₊0.07
Y17I	60	− 9.4_− 0.3^+ 0.3
	124	− 9.5_− 0.1^+ 0.1
	260	− 9.9_− 0.1^+ 0.1
	1220	− 10.1_− 0.2^+ 0.2

The predicted change in free energy ΔF using each fit strain can be seen in Fig. 12 . In this figure, the rows represent the repressor copy number of the strain to which the DNA binding energy was fit whereas the columns correspond to each mutant. In each plot, we have shown the data for all repressor copy numbers with the fit strain represented by white filled circles. Much as for the induction profiles, we see little difference in the predicted ΔF for each strain, all of which accurately describe the inferred free energies. The ability to accurately predict the majority of the induction profiles of each mutant with repressor copy numbers ranging over two orders of magnitude strengthens our assessment that for these DNA binding domain mutations, only the DNA binding energy is modified.

Figure 11: **Pairwise comparisons of DNA binding mutant induction profiles.** Rows correspond to the repressor copy number of the strain used to estimate the DNA binding energy for each mutant. Columns correspond to the repressor copy number of the strains that are predicted. Diagonals in which the data used to estimate the DNA binding energy are shown with a gray background. The Python code (`ch7_figS11.py`) used to generate this figure can be found on the thesis GitHub repository.

Figure 12: **Dependence of fitting strain on ΔF predictions of DNA binding domain mutants.** Rows correspond to the repressor copy number used to estimate the DNA binding energy. Columns correspond to the particular mutant. Colored lines are the bounds of the 95% credible region of the predicted ΔF. Open face points indicate the strain to which the DNA binding energy was fit. The Python code (`ch7_figS12.py`) used to generate this figure can be found on the thesis GitHub repository.

Bayesian Parameter Estimation for Inducer Binding Domain Mutants

In Chapter 3, we put forward two naïve hypotheses for which parameters of our fold-change equation are affected by mutations in the inducer binding domain of the repressor. The first hypothesis was that only the inducer dissociation constants, K_A and K_I, were perturbed from their wild-type values. Another hypothesis was that the inducer dissociation constants were affected in addition to the energetic difference between the active and inactive states of the repressor, Δε_AI.

In this section, we first derive the statistical model for each hypothesis and then perform a series of diagnostic tests that expose the inferential limitations of each model. With well-calibrated statistical models, we then apply each to an induction profile of the inducer binding mutant Q291K and assess the validity of each hypothesis. To understand the statistical models for each hypothesis, only the subsection Building A Generative Statistical Model is necessary.