Chapter VII

Supplemental Information For Chapter III: Predictive Shifts in Free Energy Couple Mutations to Their Phenotypic Consequences

Published as …

A version of this chapter originally appeared as Chure, G, Razo-Mejia, M., Belliveau, N.M., Kaczmarek, Zofii A., Einav, T., Barnes, Stephanie L., Lewis, M., and Phillips, R. (2019). Predictive shifts in free energy couple mutations to their phenotypic consequences. PNAS 116(37) DOI: https://doi.org/10.1073/pnas.1907869116. G.C., M.R.M, N.M.B., Z.A.K., and S.L.B designed the experiments and collected and analyzed data. G.C. developed theoretical treatment of free energy shifts. G.C., M.R.M, N.M.B., Z.A.K., T.E., S.L.B., and R.P. designed the research project. G.C. and R.P. wrote the paper. M.L. provided guidance and advice.

Non-Monotonic Behavior of ΔF Under Changing KA and KI

In Chapter 3, we illustrated that perturbations only to the allosteric parameters KA and KI relative to the wild-type values can result in a non-monotonic dependence of ΔF on the inducer concentration c. In this section, we prove that when the ratio of KA to KI is the same between the mutant and wild-type proteins, the function must be monotonic. This section is paired with an interactive figure available on the paper website which illustrates how scaling KA and KI relative to the wild-type value results in non-monotonic behavior.

     We define a monotonic function as a continuous function whose derivative does not change sign across the domain upon which it is defined. To show that ΔF is non-monotonic when KA and KI are perturbed, we can compute the derivative of ΔF with respect to the inducer concentration c and evaluate the sign of the derivative at the limits of inducer concentration. If the sign of the derivative is different at the limits of c = 0 and c ≫ 0, we can see that the function is non-monotonic. However, if the sign is the same in both limits, we can not say conclusively if it is non-monotonic and must consider other diagnostics.

     The free energy difference between a mutant and wild-type repressor when all parameters other than KA and KI are unperturbed can be written as
$$ \beta\Delta F(c) = -\log\left({\left[1 + e^{-\beta\Delta\varepsilon_{AI} }\left({1 + {c \over K_I^\text{(mut)} } \over 1 + {c \over K_A^\text{(mut)} }}\right)^2\right]^{-1} \over \left[1 + e^{-\beta\Delta\varepsilon_{AI} }\left({1 + {c \over K_I^\text{(wt)} } \over 1 + {c \over K_A^\text{(wt)} }}\right)^2\right]^{-1} }\right), \qquad(1)$$
in which ΔεAI is the energy difference between the active and inactive states of the repressor, c is the inducer concentration, and β = 1/kBT where kB is the Boltzmann constant and T is the temperature. The derivative with respect to c, which we determined using Mathematica’s (Wolfram Research, version 11.2) symbolic computing ability, is given as
$$ \begin{aligned} {\partial \beta\Delta F(c) \over \partial c} &= 2e^{-\beta\Delta\varepsilon_{AI} }\times \\ &\left(\frac{ {K_A^\text{(mut)} }^2 \left(K_A^\text{(mut)} - K_I^\text{(mut)}\right)\left(c + K_I^\text{(mut)}\right)}{\left(c + K_A^\text{(mut)}\right) \left[\left(c + K_A^\text{(mut)}\right)^2{K_I^\text{(mut)} }^2 + e^{-\beta\Delta\varepsilon_{AI} }{K_A^\text{(mut)} }^2 \left(c + K_I^\text{(mut)}\right)^2\right] } \right. \\ &- \left. \frac{ {K_A^\text{(wt)} }^2\left(K_A^\text{(wt)} - K_I^\text{(wt)}\right)\left(c + K_I^\text{(wt)}\right)}{\left(c + K_A^\text{(wt)}\right) \left[\left(c + K_A^\text{(wt)}\right)^2{K_I^\text{(wt)} }^2 + e^{-\beta\Delta\varepsilon_{AI} }{K_A^\text{(wt)} }^2\left(c + K_I^\text{(wt)}\right)^2\right]} \right). \end{aligned} \qquad(2)$$

This unwieldy expression can be simplified by defining the values of KA(mut) = θKA(wt) and KI(mut) = θKI(wt) as relative changes to the wild-type values where θ is a scaling parameter. While we can permit KA(mut) and KI(mut) to vary by different degrees, we will consider the case in which they are equally perturbed such that the ratio of KA to KI is the same between the mutant and wild-type versions of the repressor. While the equations become more cumbersome when one permits the dissociation constants to vary by different amounts (i.e. θKA, θKI), one arrives at the same conclusion. This definition allows us to rewrite Eq. 2 in the form of
$$ \begin{aligned} {\partial \beta\Delta F(c) \over \partial c} &= 2{K_A^\text{(wt)} }e^{-\beta\Delta\varepsilon_{AI} }\times \\ & \left({\theta^3 \left({K_A^\text{(wt)} }- {K_I^\text{(wt)} }\right)\left(c + \theta {K_I^\text{(wt)} }\right) \over \left(c + \theta {K_A^\text{(wt)} }\right)\left[\theta^2 {K_I^\text{(wt)} }^2\left(c + \theta {K_A^\text{(wt)} }\right)^2 + e^{-\beta\Delta\varepsilon_{AI} }\theta^2{K_A^\text{(wt)} }^2\left(c + \theta {K_I^\text{(wt)} }\right)^2\right]} \right. \\ &- \left. {\left({K_A^\text{(wt)} } - {K_I^\text{(wt)} }\right)\left(c + {K_I^\text{(wt)} }\right) \over \left(c + {K_A^\text{(wt)} }\right) \left[\left(c + {K_A^\text{(wt)} }\right)^2{K_I^\text{(wt)} }^2 + e^{-\beta\Delta\varepsilon_{AI} }{K_A^\text{(wt)} }^2\left(c + {K_I^\text{(wt)} }\right)^2\right]} \right). \end{aligned} \qquad(3)$$
     With this derivative in hand, we can examine the limits of inducer concentration. As discussed in the main text, the free energy difference between the mutant and wild-type repressors when c = 0 should be equal to 0. However, the derivative at c = 0 will be different between the wild-type and the mutant. In this limit, Eq. 3 simplifies to
$$ {\partial \beta\Delta F(c) \over \partial c}\bigg\vert_{c = 0} = {2e^{-\beta\Delta\varepsilon_{AI} }\left({K_A^\text{(wt)} } - {K_I^\text{(wt)} }\right) \over {K_A^\text{(wt)} }{K_I^\text{(wt)} }\left(1 + e^{-\beta\Delta\varepsilon_{AI} }\right)}\left({1 \over \theta} - 1\right). \qquad(4)$$

     When θ < 1, meaning that the affinity of the active and inactive states of the repressor to the inducer is increased relative to wild-type, the derivative is positive. Thus, the repressor bound state of the promoter becomes less energetically favorable than the repressor bound state. Similarly, if θ > 1, binding of the inducer to the mutant repressor is weaker than the wild-type repressor, making βΔF(c)/∂c < 0, meaning the repressor bound state becomes more energetically favorable than the repressor unbound state of the promoter.

With an intuition for the sign of the derivative when c = 0, we can compute the derivative at another extreme where c ≫ 0. Here, Eq. 3 reduces to
$$ {\partial \beta\Delta F(c) \over \partial c}\bigg\vert_{c \gg 0} \approx {2e^{-\beta\Delta\varepsilon_{AI} }{K_A^\text{(wt)} }^2({K_A^\text{(wt)} } - {K_I^\text{(wt)} }) \over c^2\left({K_I^\text{(wt)} }^2 + e^{-\beta\Delta\varepsilon_{AI} }{K_A^\text{(wt)} }^2\right)} \left(\theta - 1\right). \qquad(5)$$
     When θ > 1, Eq. 5 is positive. This is the opposite sign of the derivative when c = 0 when θ > 1. When θ < 1, Eq. 5 becomes negative whereas Eq. 4 is positive. As the derivative of ΔF with respect to c changes signs across the defined range of inducer concentrations, we can say the function is non-monotonic.

     Fig. 1 shows the non-monotonic behavior of ΔF when KA and KI change by the same factor θ (maintaining the wild-type ratio, Fig. 1 (A)) and when KA and KI change by different factors (Fig. 1 (B)). In both cases, non-monotonic behavior is observed with the peak difference in the free energy covering several kBT. We have hosted an interactive figure similar to Fig. 1 on the paper website where the reader can modify how KA and KI are affected by a mutation and examine how the active probability, free energy difference, and βΔF/∂c are tuned.

Figure 1: Non-monotonic behavior of ΔF with changes in KA and KI. Middle column shows the allosteric contribution of free energy F plotted as a function of the inducer concentration. Right column shows the free energy difference ΔF as a function of inducer concentration, revealing non-monotonicity. (A) Behavior of F and ΔF when the values of KA and KI change relative to wild-type, but maintain the same ratio. θ is the scaling factor for both inducer dissociation constants. (B) Behavior of F and ΔF when the values of KA and KI change relative to the wild-type, but by different factors. In both panels, the wild-type parameter values were taken to be KA = 200 μM, KI = 1 μM, and ΔεAI = 4.5 kBT. An interactive version of this figure is available on the paper website. The Python code (ch7_figS1.py) used to generate this figure can be found on the thesis GitHub repository.

Bayesian Parameter Estimation for DNA Binding Mutants

In this section, we outline the statistical model used in this work to estimate the DNA binding energy for a given mutation in the DNA binding domain. The methodology presented here is similar to that performed in Chapter 2 and outlined in accompanying Chapter 6. In the following text, we take a very detailed approach to vetting the robustness of our statistical inference machinery as determination of parameter values is critical to assessing the effects of mutations. Similarly to what is presented in Chapter 6, we begin with a derivation of our statistical model using Bayes’ theorem and then perform a series of principled steps to validate our choices of priors, ensure computational feasibility, and assess the validity of the model given the collected data. This work follows the analysis pipeline outlined by Michael Betancourt in his case-study entitled “Towards A Principled Bayesian Workflow.”

     The second subsection Building a Generative Statistical Model lays out the statistical model used in this work to estimate the DNA binding energy and the error term σ. The subsequent subsections – Prior Predictive Checks, Simulation Based Calibration, and Posterior Predictive Checks – define and summarize a series of tests that ensure that the parameters of the statistical model can be identified and are computationally tractable. To understand how we defined our statistical model, only the second subsection is needed.

Calculation of the Fold-Change in Gene Expression

     We appreciate the subtleties of the efficiency of photon detection in the flow cytometer, fluorophore maturation and folding, and autofluorescence correction, and we understand the importance in modeling the effects that these processes have on the reported value of the fold-change. However, in order to be consistent with the methods used in the literature, we took a more simplistic approach to calculate the fold-change. Given a set of fluorescence measurements of the constitutive expression control (R = 0), an autofluorescence control (no YFP), and the experimental strain (R > 0), we calculate the fold-change as
$$ \text{fold-change} = {\langle I_\text{cell}(R > 0)\rangle - \langle I_\text{autofluorescence}\rangle \over \langle I_\text{cell}(R = 0) \rangle - \langle I_\text{autofluorescence}\rangle}. \qquad(6)$$
It is important to note here that for a given biological replicate, we consider only a point estimate of the mean fluorescence for each sample and perform a simple subtraction to adjust for background fluorescence. For the analysis going forward, all mentions of measured fold-change are determined by this calculation.

Building a Generative Statistical Model

To identify the minimal parameter set affected by a mutation, we assume that mutations in the DNA binding domain of the repressor alters only the DNA binding energy ΔεRA, while the other parameters of the repressor are left unperturbed from their wild-type values. As a first approach, we can assume that all of the other parameters are known without error and can be taken as constants in our physical model. Ultimately, we want to know how probable a particular value of ΔεRA is given a set of experimental measurements y. Bayes’ theorem computes this distribution, termed the posterior distribution as
$$ g(\Delta\varepsilon_{RA}\,\vert\, y) = {f(y\,\vert\, \Delta\varepsilon_{RA}) g(\Delta\varepsilon_{RA}) \over f(y)} \qquad(7)$$
where we have used g and f to represent probability densities over parameters and data, respectively. The expression f(y |ΔεRA) captures the likelihood of observing our data set y given a value for the DNA binding energy under our physical model. All knowledge we have of what the DNA binding energy could be, while remaining completely ignorant of the experimental measurements, is defined in g(ΔεRA), referred to as the prior distribution. Finally, the likelihood that we would observe the data set y while being ignorant of our physical model is defined by the denominator f(y). In this work, this term serves only as a normalization factor and as a result will be treated as a constant. We can therefore say that the posterior distribution of ΔεRA is proportional to the joint distribution between the likelihood and the prior,
g(ΔεRA | y) ∝ f(y | ΔεRA)g(ΔεRA).   (8)

     We are now tasked with translating this generic notation into a concrete functional form. Our physical model derived in Chapter 2 computes the average fold-change in gene expression. Speaking practically, we make several replicate measurements of the fold-change to reduce the effects of random errors. As each replicate is independent of the others, it is reasonable to expect that these measurements will be normally distributed about the theoretical value of the fold-change μ, computed for a given ΔεRA. We can write this mathematically for each measurement as
$$ f(y\,\vert\, \Delta\varepsilon_{RA}) = {1 \over (2\pi\sigma^2)^{N/2}}\prod\limits_{i}^N \exp\left[-(y_i - \mu(\Delta\varepsilon_{RA}))^2 \over 2\sigma^2\right], \qquad(9)$$
where N is the number of measurements in y and yi is the ith experimental fold-change measurement. We can write this likelihood in shorthand as
f(y | ΔεRA) = Normal{μ(ΔεRA), σ}    (10)
which we will use for the remainder of this section.

     Using a normal distribution for our likelihood has introduced a new parameter σ which describes the spread of our measurements about the true value. We must therefore include it in our parameter estimation and assign an appropriate prior distribution such that the posterior distribution becomes
g(ΔεRA, σ |y) ∝ f(y | ΔεRA, σ)g(ΔεRA)g(σ).   (11)
We are now tasked with assigning functional forms to the priors g(ΔεRA) and g(σ). Though one hopes that the result of the inference is not too dependent on the choice of prior, it is important to choose one that is in agreement with our physical and physiological intuition of the system.

     We can impose physically reasonable bounds on the possible values of the DNA binding energy ΔεRA. We can say that it is unlikely that any given mutation in the DNA binding domain will result in an affinity greater than that of biotin to streptavidin (1 fM ≈  − 35 kBT, BNID 107139 (Milo et al. 2010)), one of the strongest known non-covalent bonds. Similarly, it is unlikely that a given mutation will result in a large, positive binding energy, indicating that non-specific binding is preferable to specific binding ( ∼ 1 to 10 kBT). While it is unlikely for the DNA binding energy to exceed these bounds, it is not impossible, meaning we should not impose these limits as hard boundaries. Rather, we can define a weakly informative prior as a normal distribution with a mean and standard deviation as the average of these bounds,
g(ΔεRA) ∼ Normal{ − 12, 12}   (12)
whose probability density function in shown in Fig. 2 (A).

     By definition, fold-change is restricted to the bounds [0, 1]. Measurement noise and fluctuations in autofluorescence background subtraction means that experimental measurements of fold-change can extend beyond these bounds, though not substantially. By definition, the scale parameter σ must be positive and greater than zero. We also know that for the measurements to be of any use, the error should be less than the available range of fold-change, 1.0. We can choose such a prior as a half normal distribution
$$ g(\sigma) = {1 \over \phi}\sqrt{2 \over \pi}\exp\left[-{\sigma^2 \over 2\phi^2}\right];\, \forall\, \sigma \geq 0 \qquad(13)$$
where ϕ is the standard deviation. By choosing ϕ = 0.1, it is unlikely that σ ≥ 1, yet not impossible, permitting the occasional measurement significantly outside of the theoretical bounds. The probability density function for this prior is shown in Fig. 2 (B).

     While these choices for the priors seem reasonable, we can check their appropriateness by using them to simulate a data set and checking that the hypothetical fold-change measurements obey our physical and physiological intuition.

Figure 2: Prior distributions and prior predictive check for estimation of the DNA binding energy. (A) Prior probability density function for DNA binding energy ΔεRA as Normal( − 12, 12). (B) Prior probability density function for the standard deviation in measurement noise σ HalfNormal(0, 0.1). (C) Percentiles of values drawn from the likelihood distribution given draws from prior distributions given R = 260, KA = 139 μM, KI = 0.53 μM, and ΔεAI = 4.5 kBT, which match the parameters used in Razo-Mejia et al. (2018). Black points at top of (A) and (B) represent draws used to generate fold-change measurements from the likelihood distribution. Percentiles in (C) generated from 800 draws from the prior distributions. For each draw from the prior distributions, a data set of 70 measurements over 12 IPTG concentrations (ranging from 0 to 5000 μM) were generated from the likelihood. The Python code (ch7_figS2.py) used to generate this figure can be found on the thesis GitHub repository.

Prior Predictive Checks

If our choice of prior distribution for each parameter is appropriate, we should be able to simulate data sets using these priors that match our expectations. In essence, we would hope that these prior choices would generate some data sets with fold-change measurements above 1 or below zero, but they should be infrequent. If we end up getting primarily negative values for fold-change, for example, then we can surmise that there is something wrong in our definition of the prior distribution. This method, coined a prior predictive check, was first put forward in Good (1950) and has received newfound attention in computational statistics.

     We perform the simulation in the following manner. We first draw a random value for ΔεRA out of its prior distribution stated in Eq. 12 and calculate what the mean fold-change should be given our physical model. With this in hand, we draw a random value for σ from its prior distribution, specified in Eq. 13. We then generate a simulated data set by drawing 70 fold-change values across twelve inducer concentrations from the likelihood distribution which we defined in Eq. 10. This roughly matches the number of measurements made for each mutant in this work. We repeat this procedure for 800 draws from the prior distributions, which is enough to observe the occasional extreme fold-change value from the likelihood. As the DNA binding energy is the only parameter of our physical model that we are estimating, we had to choose values for the others. We kept the values of the inducer binding constants KA and KI the same as the wild-type repressor (139 μM and 0.53 μM, respectively). We chose to use R = 260 repressors per cell as this is the repressor copy number we used in the main text to estimate the DNA binding energies of the three mutants.

     The draws from the priors are shown in Fig. 2 (A) and (B) as black points above the corresponding distribution. To display the results, we computed the percentiles of the simulated data sets at each inducer concentration. These percentiles are shown as red shaded regions in Fig. 2 (C). The 5th percentile (dark purple band) has the characteristic profile of an induction curve. Given that the prior distribution for ΔεRA is centered at  − 12 kBT and we chose R = 260, we expect the generated data sets to cluster about the induction profile defined by these values. More importantly, approximately 95% of the generated data sets fall between fold-change values of -0.1 and 1.1, which is within the realm of possibility given the systematic and biological noise in our experiments. The 99th percentile maximum is approximately 1.3 and the minimum approximately  − 0.3. While we could tune our choice of prior further to minimize draws this far from the theoretical bounds, we err on the side of caution and accept these values as it is possible that fold-change measurements this high or low can be observed, albeit rarely. Through these prior predictive checks, we feel confident that these choices of priors are appropriate for the parameters we wish to estimate. We can now move forward and make sure that the statistical model as a whole is valid and computationally tractable.

Sensitivity Analysis and Simulation Based Calibration

Satisfied with our choice of prior distributions, we can proceed to check other properties of the statistical model and root out any pathologies lurking in our model assumptions.

     To build trust in our model, we could generate a data set with a known value for σ and ΔεRA, estimate the posterior distribution g(ΔεRA, σ | ), and determine how well we were able to retrieve the true value of the parameters. However, running this once or twice for handpicked values of σ and ΔεRA will not reveal edge-cases in which the inference fails, some of which may exist in our data. Rather than performing this operation once, we can run this process over a variety of data sets where the ground truth parameter value is drawn from the prior distribution (as we did for the prior predictive checks). For an arbitrary parameter θ, the joint distribution between the ground truth value θ̃, the inferred value θ, and the simulated data set can be written as
π(θ, , θ̃) = g(θ | )f( | θ̃)g(θ̃).   (14)
If this process is run for a large number of simulations, Eq. 14 can be marginalized over all data sets and all ground truth values θ̃ to yield the original prior distribution,
dθ̃dπ(θ, , θ̃) = g(θ).   (15)

     This result, described by Talts et al. (2018), holds true for any statistical model and is a natural self consistency property of Bayesian inference. Any deviation between the distribution of our inferred values for θ and the original prior distribution g(θ) indicates that either our statistical model is malformed or the computational method is not behaving as expected. There are a variety of ways we can ensure that this condition is satisfied, which we outline below.

      Using the data set generated for the prior predictive checks (shown in Fig. 2 (C)), we sampled the posterior distribution and compute ΔεRA and σ for each simulation and checked that they matched the original prior distribution. To perform the inference, we use Markov chain Monte Carlo (MCMC) to sample the posterior distribution. Specifically, we use the Hamiltonian Monte Carlo algorithm implemented in the Stan probabilistic programming language (Carpenter et al. 2017). The specific code files can be accessed through the paper website or the associated GitHub repository. The original prior distribution and the distribution of inferred parameter values can be seen in Fig. 3 (A) and (B). For both ΔεRA and σ, we can accurately recover the ground truth distribution (purple) via sampling with MCMC (orange). For ΔεRA, there appears to be an upper and lower limit past which we are unable to accurately infer the binding energy. This can be seen in both the histogram Fig. 3(A) and the empirical cumulative distribution Fig. 3 (B) as deviations from the ground truth when DNA binding is below  ≈  − 25 kBT or above  ≈  − 5 kBT. These limits hinder our ability to comment on exceptionally strong or weak binding affinities. However, as all mutants queried in this work exhibited binding energies between these limits, we surmise that the inferential scheme permits us to draw conclusions about the inferred DNA binding strengths.

Figure 3: Comparison of averaged posterior and prior distributions for ΔεRA and σ. (A) Distribution of the average values for the DNA binding energy ΔεRA (orange) overlaid with the ground truth distribution (purple). (B) Data averaged posterior (orange) for the standard deviation of fold-change measurements overlaid with the ground truth distribution (purple). Top and bottom show the same data with different visualizations. The Python code (ch7_figS3.py) used to generate this figure can be found on the thesis GitHub repository.

     Rather than examining the agreement of the data-averaged posterior and the ground truth prior distribution solely by eye, we can compute summary statistics using the mean μ and standard deviation σ of the posterior and prior distributions which permit easier identification of pathologies in the inference. One such quantity is the posterior z-score, which is defined as
$$ z = {\mu_\text{posterior} - \tilde\theta \over \sigma_\text{posterior}}. \qquad(16)$$
This statistic summarizes how accurately the posterior recovers the ground truth value beyond simply reporting the mean, median, or mode of the posterior distribution. Z-scores around 0 indicate that the posterior is concentrating tightly about the true value of the parameter whereas large values (either positive or negative) indicate that the posterior is concentrating elsewhere. A useful feature of this metric is that the width of the posterior is also considered. It is possible that the posterior could have a mean very close to the ground truth value, but have an incredibly narrow distribution/spread such that it does not overlap with the ground-truth. Only comparing the mean value to the ground truth would suggest that the inference “worked.” However with a small standard deviation generates a very large z-score, telling us that something has gone awry.

     If our inferential model is behaving properly, the width of the posterior distribution should be significantly smaller than the width of the prior, meaning that the posterior is being informed by the data. The level to which the posterior is being informed by the data can be easily calculated given knowledge of both the prior and posterior distribution. This quantity, aptly named the shrinkage s, can be computed as
$$ s = 1 - {\sigma^2_\text{posterior} \over \sigma^2_\text{prior}}. \qquad(17)$$
When the shrinkage is close to zero, the variance of the posterior is approximately the same as the variance of the prior, and the model is not being properly informed by the data. When s ≈ 1, the variance of the posterior is much smaller than the variance of the prior, indicating that it is being highly informed by the data. A shrinkage less than 0 indicates that the posterior is wider than the prior distribution, revealing a severe pathology in either the model itself or the implementation.

     In Fig. 4, we compute these summary statistics for each parameter. For both ΔεRA and σ, we see clustering of the z-score about 0 with the extrema reaching  ≈  ± 3. This suggests that for the vast majority of our simulated data sets, the posterior distribution concentrated about the ground truth value. We also see that for both parameters, the posterior shrinkage s is  ≈ 1, indicating that the posterior is being highly informed by the data. There is a second distribution centered 0.8 for ΔεRA, indicating that for a subset of the data sets, the posterior is only  ≈ 80% narrower than the prior distribution. These samples are those that were drawn outside of the limits of  ≈  − 25 to  − 5 kBT where the inferential power is limited. Nevertheless, the posterior still significantly shrank, indicating that the data strongly informs the posterior.

Figure 4: Inferential sensitivity for estimation of ΔεRA and σ. The posterior z-score for each posterior distribution inferred from a simulated data set is plotted against the shrinkage for (A) the DNA binding energy ΔεRA and (B) the standard deviation of fold-change measurements σ. The Python code (ch7_figS4.py) used to generate this figure can be found on the thesis GitHub repository.

     The general self-consistency condition given by Eq. 15 provides another route to ensure that the model is computationally tractable. Say that we draw a value for the DNA binding energy from the prior distribution, simulate a data set, and sample the posterior using MCMC. The result of this sampling is a collection of N values of the parameter which may be above, below, or equal to the ground-truth value. From this set of values, we select L of them and rank order them by their value. Talts et al. (2018) derived a general theorem which states that the number of samples less than the ground truth value of the parameter (termed the rank statistic) is uniformly distributed over the interval [0, L]. As Eq. 15 must hold true for any statistical model, deviations from uniformity signal that there is a problem in the implementation of the statistical model. How the distribution deviates is also informative as different types of failures result in different distributions. The nature of these deviations, along with a more formal proof of the uniform distribution of rank statistics can be found in Talts et al. (2018) where it was originally derived.

     Given the sampling statistics for each of the simulated data sets, we took 800 of the MCMC samples of the posterior distribution for each of the 800 simulated data sets and computed the rank statistic. The distributions are shown in Fig. 5 as both histograms and ECDFs for the DNA binding energy and standard deviation. The distribution of rank statistics for both parameters appears to be uniform. The purple band overlaying the histograms (top row) as well as the purple envelopes overlaying the ECDFs (bottom row) represent the 99th percentile expected from a true uniform distribution. The uniformity of this distribution, along with the well-behaved z-scores and shrinkage for each parameter, tells us that there are no underlying pathologies in our statistical model and that it is computationally tractable. However, this does not mean that it is correct. Whether this model is valid for the actual observed data is the topic of the next section.

Figure 5: Rank distribution of the posterior samples from simulated data. Top row shows a histogram of the rank distribution with n = 20 bins. Bottom row is the cumulative distribution for the same data. Purple bands correspond to the 99th percentile of expected variation from a uniform distribution. (A) Distribution for the DNA binding energy ΔεRA and (B) for the standard deviation σ. The Python code (ch7_figS5.py) used to generate this figure can be found on the thesis GitHub repository.

Parameter Estimation and Posterior Predictive Checks

We now turn to applying our vetted statistical model to experimental measurements. While the same statistical model was applied to all three DNA binding mutants, here we only focus on the mutant Q18M for brevity.

     Using a single induction profile, we sampled the posterior distribution over both the DNA binding energy ΔεRA and the standard deviation σ using MCMC implemented in the Stan programming language. The output of this process is a set of 4000 samples of both parameters along with the value of their log posterior probabilities, which serves as an approximate measure of the probability of each value. The individual samples are shown in Fig. 6. The joint distribution between ΔεRA and σ is shown in the lower left hand corner, and the marginal distributions for each parameter are shown above and to the right of the joint distribution, respectively. The joint distribution is color coded by the value of the log posterior, with bright orange and dark purple corresponding to high and low probability, respectively. The symmetric shape of the joint distribution is a telling sign that there is no correlation between the two parameters. The marginal distributions for each parameter are also relatively narrow, with the DNA binding energy covering a range of  ≈ 0.6 kBT and σ spanning  ≈ 0.02. To more precisely quantify the uncertainty, we computed the shortest interval of the marginal distribution for each parameter that contains 95% of the probability. The bounds of this interval, coined the Bayesian credible region, can accommodate asymmetry in the marginal distribution since the upper and lower bounds of the estimate are reported. In the main text, we reported the DNA binding energy estimated from these data to be 15.43 − 0.06 + 0.06kBT, where the first value is the median of the distribution and the super- and subscripts correspond to the upper and lower bounds of the credible region, respectively.

     While looking at the shape of the posterior distribution can be illuminating, it is not enough to tell us if the parameter values extracted make sense or accurately describe the data on which they were conditioned. To assess the validity of the statistical model in describing actual data, we again turn to simulation, this time using the posterior distributions for each parameter rather than the prior distributions. The likelihood of our statistical model assumes that across the entire induction profile, the observed fold-change is normally distributed about the theoretical prediction with a standard deviation σ. If this is an accurate depiction of the generative process, we should be able to draw values from the likelihood using the sampled values for ΔεRA and σ that are indistinguishable from the actual experimental measurements. This process is known as a posterior predictive check and is a Bayesian method of assessing goodness-of-fit.

     For each sample from the posterior, we computed the theoretical mean fold-change given the sampled value for ΔεRA. With this mean in hand, we used the corresponding sample for σ and drew a data set from the likelihood distribution the same size as the real data set used for the inference. As we did this for every sample of our MCMC output (a total of  ≈ 4000), it is more instructive to compute the percentiles of the generated data than to show the entire output. In Fig. 6 (B), the percentiles of the generated data sets are shown overlaid with the data used for the inference. We see that all of the data points fall within the 99th percentile of simulated data sets with the 5th percentile tracking the mean of the data at each inducer concentration. As there are no systematic deviations or experimental observations that fall far outside those generated from the statistical model, we can safely say that the statistical model derived here accurately describes the observed data.

Figure 6: Markov Chain Monte Carlo (MCMC) samples and posterior predictive check for DNA binding mutant Q18M. (A) Marginal and joint sampling distributions for DNA binding energy ΔεRA and σ. Each point in the joint distribution is a single sample. Marginal distributions for each parameter are shown adjacent to joint distribution. Color in the joint distribution corresponds to the value of the log posterior with the progression of dark purple to bright orange corresponding to increasing probability. (B) The posterior predictive check of the model. The measurements of the fold-change in gene expression are shown as black open-faced circles. The percentiles are shown as colored bands and indicate the fraction of simulated data drawn from the likelihood that fall within the shaded region. The Python code (ch7_figS6.py) used to generate this figure can be found on the thesis GitHub repository.

Inferring the Free Energy from Fold-Change Measurements

In this section, we describe the statistical model to infer the free energy F from a set of fold-change measurements. We follow the same principled workflow as described previously for the DNA binding estimation, including declaration of the generative model, prior predictive checks, simulation based calibration, and posterior predictive checks. Finally, we determine an empirical limit in our ability to infer the free energy and define a heuristic which can be used to identify measurements that are likely inaccurate. To understand the statistical model and the empirical limits of detection, only the subsections Building A Generative Model and Sensitivity Limits and Systematic Errors in Inference are necessary.

Building a Generative Model

In Chapter 2, we showed that the fold-change equation can be rewritten in the form of a Fermi function,
$$ \text{fold-change} = {1 \over 1 + e^{-F / k_BT}}, \qquad(18)$$
where F corresponds to the free energy difference between the repressor bound and unbound states of the promoter. While the theory prescribes a way for us to calculate the free energy based on our knowledge of the biophysical parameters, we can directly calculate the free energy of a measurement of fold-change by simply rearranging Eq. 18 as
$$ F = -k_BT\log\left({1 \over \text{fold-change}} - 1\right). \qquad(19)$$
With perfect measurement of the fold-change in gene expression (assuming no experimental or measurement noise), the free energy can be directly calculated. However, actual measurements of the fold-change in gene expression can extend beyond the theoretical bounds of 0 and 1, for which the free energy is mathematically undefined.

     As the fold-change measurements between biological replicates are independent, it is reasonable to assume that they are normally distributed about a mean value μ with a standard deviation σ. While the mean value is restricted to the bounds of [0, 1], fold-change measurements outside of these bounds are still possible given that they are distributed about the mean with a scale of σ. Thus, if we have knowledge of the mean fold-change in gene expression about which the observed fold-change is distributed, we can calculate the mean free energy as
$$ F = -k_BT\log\left({1 \over \mu} - 1\right). \qquad(20)$$
For a given set of fold-change measurements y, we wish to infer the posterior probability distribution for μ and σ, given by Bayes’ theorem as
g(μ, σ | y) ∝ f(y | μ, σ)g(μ)g(σ),   (21)
where we have dropped the normalization constant f(y) and assigned a proportionality between the posterior and joint probability distribution. Given that the measurements are independent, we define the likelihood f(y | μ, σ) as a normal distribution,
f(y | μσ) ∼ Normal{μ, σ}.   (22)
While the mean μ is restricted to the interval [0, 1], there is no reason a priori to think that it is more likely to be closer to either bound. To remain uninformative and be as permissive as possible, we define a prior distribution for μ as a Uniform distribution between 0 and 1,
$$ g(\mu)=\begin{cases}{1\over \mu_\text{max} - \mu_\text{min}} & \mu_\text{min} < \mu < \mu_\text{max}\\ 0 & \text{otherwise}\end{cases}. \qquad(23)$$
Here, μmin = 0 and μmax = 1, reducing g(μ) to 1. For σ, we can again assume a half-normal distribution with a standard deviation of 0.1 as was used for estimating the DNA binding energy Eq. 13,
g(σ) = HalfNormal{0, 0.1}.   (24)

     With a full generative model defined, we can now use prior predictive checks to ensure that our choices of prior are appropriate for the inference.

Prior Predictive Checks

     To check the validity of the chosen priors, we pulled 1000 combinations of μ and σ from their respective distributions (Fig. 7 (A)) and subsequently drew a set of 10 fold-change values (a number comparable to the number of biological replicates used in this work) from a normal distribution defined by μ and σ. To visualize the range of values generated from these checks, we computed the percentiles of the empirical cumulative distributions of the fold-change values, as can be seen in Fig. 7 (C). Approximately 95% of the generated fold-change measurements were between the theoretical bounds of [0, 1] whereas 5% of the data sets fell outside with the maximum and minimum values extending to  ≈ 1.2 and  − 0.2, respectively. Given our familiarity with these experimental strains and the detection sensitivity of the flow cytometer, these excursions beyond the theoretical bounds agree with our intuition. Satisfied with our choice of prior distributions, we can proceed to check the sensitivity and computational tractability of our model through simulation based calibration.

Figure 7: Prior predictive checks for inference of the mean fold-change. (A) The prior distributions for μ (left) and σ (right). The vertical axis is proportional to the probability of the value. Black points above distributions correspond to the values used to perform the prior predictive checks. (B) Percentiles of the data generated for each draw from the prior distributions shown as a cumulative distribution. Percentiles were calculated for 1000 generated data sets, each with 10 fold-change measurements drawn from the likelihood given the drawn values of μ and σ. The Python code (ch7_figS7.py) used to generate this figure can be found on the thesis GitHub repository.

Simulation Based Calibration

     To ensure that the parameters can be estimated with confidence, we sampled the posterior distribution of μ and σ for each data set generated from the prior predictive checks. For each inference, we computed the z-score and shrinkage for each parameter, shown in Fig. 8(A). For both parameters, the z-scores are approximately centered about zero, indicating that the posteriors concentrate about the ground truth value of the parameter. The z-scores for σ green points in Fig. 8) (A) appear to be slightly off centered with more negative values than positive. This suggests that σ is more likely to be slightly overestimated in some cases. The shrinkage parameter for μ (red points) is very tightly distributed about 1.0, indicating that the prior is being strongly informed by the data. The shrinkage is more broadly distributed for σ with a minimum value of  ≈ 0.5. However, the median shrinkage for σ is  ≈ 0.9, indicating that half of the inferences shrank the prior distribution by at least 90%. While we could revisit the model to try and improve the shrinkage values, we are more concerned with μ which shows high shrinkage and zero-centered z-scores.

     To ensure that the model is computationally tractable, we computed the rank statistic of each parameter for each inference. The empirical cumulative distributions for μ (black) and σ (red) can be seen in Fig. 8 (B). Both distributions appear to be uniform, falling within the 99th percentile of the variation expected from a true uniform distribution. This indicates that the self-consistency relation defined by Eq. 15. holds for this statistical model. With a computationally tractable model in hand, we can now apply the statistical model to our data and verify that data sets drawn from the data-conditioned posterior are indistinguishable from the experimental measurements.

Figure 8: Sensitivity measurements and rank statistic distribution of the statistical model estimating μ and σ. (A) Posterior z-score of each inference plotted against the posterior shrinkage factor for the parameters μ (blue points) and σ (green points). (B) Distribution of rank statistics for μ (red) and σ (black). Purple envelope represents the 99th percentile of a true uniform distribution. The Python code (ch7_figS8.py) used to generate this figure can be found on the thesis GitHub repository.

Posterior Predictive Checks

The same statistical model was applied to every unique set of fold-change measurements used in this work. Here, we focus only on the set of fold-change measurements for the double mutant Y17I-Q291V at 50 μM IPTG. The samples from the posterior distribution conditioned on this dataset can be seen in Fig. 9 (A). The joint distribution, shown in the lower left-hand corner, appears fairly symmetric, indicating that μ and σ are independent. There is a slight asymmetry in the sampling of σ, which can be more clearly seen in the corresponding marginal distribution to the right of the joint distribution.

     For each MCMC sample of μ and σ, we drew 10 samples from a normal distribution defined by these parameters. From this collection of data sets, we computed the percentiles of the empirical cumulative distribution and plotted them over the data, as can be seen in Fig. 9 (B). We find that the observed data falls within the 99th percentile of the generated data sets. This illustrates that the model can produce data which is identically distributed to the actual experimental measurements, validating our choice of statistical model.

Figure 9: MCMC sampling output and posterior predictive checks of the statistical model for the mean fold-change μ and standard deviation σ. (A) Corner plot of sampling output. The joint distribution between σ and μ is shown in the lower left hand corner. Each point is an individual sample. Points are colored by the value of the log posterior with increasing probability corresponding to transitions from purple to orange. Marginal distributions for each parameter are shown adjacent to the joint distribution. (B) Percentiles of the cumulative distributions from the posterior predictive checks are shown as shaded bars. Data on which the posterior was conditioned are shown as white orange circles and lines. The Python code (ch7_figS9.py) used to generate this figure can be found on the thesis GitHub repository.

Sensitivity Limits and Systematic Errors in Inference

Considering the results from the prior predictive checks, simulation based calibration, and posterior predictive checks, we can say that the statistical model for inferring μ and σ fold-change from a collection of noisy fold-change measurements is valid and computationally tractable. Upon applying this model to the experimental data of the wild-type strain (where the free energy is theoretically known), we observed that systematic errors arise when the fold-change is exceptionally high or low, making the resulting inference of the free energy inaccurate.

     To elucidate the source of this systematic error, we return to a simulation based approach in which the true free energy is known (black points in Fig. 10 (A)). For a range of free energies, we computed the theoretical fold-change prescribed by Eq. 18. For each free energy value, we pulled a value for σ from the prior distribution defined in Eq. 13 and generated a data set of 10 measurements by drawing values from a normal distribution defined by the true fold-change and the drawn value of σ (purple points in Fig. 10 (A)). We then sampled the statistical model over these data and inferred the mean fold-change μ (orange points in Fig. 10 (A)). By eye, the inferred points appear to collapse onto the master curve, in many cases overlapping the true values. However, the points with a free energy less than  ≈  − 2 kBT and greater than  ≈ 2 kBT are slightly above or below the master curve, respectively. This becomes more obvious when the inferred free energy is plotted as a function of the true free energy, shown in Fig. 10 (B). Points in which the difference between μ and the nearest boundary (0 or 1) is less than the value of σ are shown as purple or green. When this condition is met, the inferred mean free energy strays from the true value, introducing a systematic error. This suggests that the spread of the fold-change measurements sets the detection limit of fold-change close to either boundary. Thus, the narrower the spread in the fold-change the better the estimate of the fold-change near the boundaries.

     These systematic errors can be seen in experimental measurements of the wild-type repressor. Data from Razo-Mejia et al. (2018) in which the IPTG titration profiles of seventeen different bacterial strains were measured is shown collapsed onto the master curve in Fig. 10 (C) as red points. Here, each point corresponds to a single biological replicate. The inferred mean fold-change μ and 95% credible regions are shown as purple, blue, or green points. The color of these points correspond to the relative value of μ or 1 − μ to σ. The discrepancy between the predicted and inferred free energy of each measurement set can be seen in Fig. 10 (D). The significant deviation from the predicted and inferred free energy occurs past the detection limit set by σ. In this work, we therefore opted to not display inferred free energies at the extrema where the inferred fold-change was closer to the boundaries than the corresponding standard deviation, as it reflects limitations in our measurement rather than a deviation from the theoretical predictions.

Figure 10: Identification of systematic error in simulated and real data when considering the free energy. (A) The true fold-change (black points), simulated fold-change distribution (purple points), and inferred mean fold-change (orange) is plotted as a function of the true free energy. Error bars on inferred fold-change correspond to the 95% credible region of the mean fold-change μ. (B) Inferred free energy plotted as a function of the true free energy. Black line indicates perfect agreement between the ground truth free energy and inferred free energy. Blue points correspond to the inferred free energy where the median values of the parameters satisfy the condition μ > σ and 1 − μ > σ. Purple points correspond to the inferred mean fold-change μ < σ. Green points correspond to those where the inferred mean fold-change 1 − μ < σ. Error bars correspond to the bounds of the 95% credible region. (C) Biological replicate data from Razo-Mejia et al. (2018) (red points) plotted as a function of the theoretical free energy. Inferred mean fold-change μ and the 95% credible region are shown as blue points. Purple and green points are colored by the same conditions as in (B). (D) Inferred free energy as a function of the predicted free energy colored by the satisfied condition. Error bars are the bounds of the 95% credible region. All inferred values in (A – D) are the median values of the posterior distribution. The Python code (ch7_figS10.py) used to generate this figure can be found on the thesis GitHub repository.

Additional Characterization of DNA Binding Mutants

In Chapter 3, we estimated the DNA binding energy of each mutant using the mutant strains that had approximately 260 repressors per cell. In this section, we examine the effect of the choice of fit strain on the predictions of both the induction profiles and ΔF for each DNA binding domain mutant.

     We applied the statistical model derived in Section 2 of this chapter for each unique strain of the DNA binding mutants and estimated the DNA binding energy. The median of the posterior distribution along with the upper and lower bounds of the 95% credible region are reported in Table 7.1. We found that the choice of fitting strain did not strongly influence the estimate of the DNA binding energy. The largest deviations appear for the weakest binding mutants paired with the lowest repressor copy number. In these cases, such as for Q18A, the difference in binding energy between the repressor copy numbers is  ≈ 1 kBT which is small compared to the overall DNA binding energy. Using these energies, we computed the predicted induction profiles of each mutant with different repressor copy numbers, shown in Fig. 11. In this plot, the rows correspond to the repressor copy number of the strain used to estimate the DNA binding energy. The columns correspond to the repressor copy number of the predicted strains. The diagonals, shaded in grey, show the induction profile of the fit strain along with the corresponding data. In all cases, we find that the predicted profiles are relatively accurate with the largest deviations resulting from using the lowest repressor copy number as the fit strain.

Estimated DNA binding energy for DNA binding domain mutants with different repressor copy numbers. Reported values are the median of the posterior distribution with the upper and lower bounds of the 95% credible regions.
Mutant Repressors DNA Binding Energy [kBT]
Q18A 60  − 9.8 − 0.2 + 0.2
124  − 10.3 − 0.1 + 0.1
260  − 11.0 − 0.1 + 0.1
1220  − 11.3 − 0.1 + 0.1
Q18M 60  − 15.83 − 0.08 + 0.08
124  − 15.7 − 0.1 + 0.1
260  − 15.43 − 0.06 + 0.07
1220  − 15.27 − 0.07+0.07
Y17I 60  − 9.4 − 0.3 + 0.3
124  − 9.5 − 0.1 + 0.1
260  − 9.9 − 0.1 + 0.1
1220  − 10.1 − 0.2 + 0.2

     The predicted change in free energy ΔF using each fit strain can be seen in Fig. 12 . In this figure, the rows represent the repressor copy number of the strain to which the DNA binding energy was fit whereas the columns correspond to each mutant. In each plot, we have shown the data for all repressor copy numbers with the fit strain represented by white filled circles. Much as for the induction profiles, we see little difference in the predicted ΔF for each strain, all of which accurately describe the inferred free energies. The ability to accurately predict the majority of the induction profiles of each mutant with repressor copy numbers ranging over two orders of magnitude strengthens our assessment that for these DNA binding domain mutations, only the DNA binding energy is modified.

Figure 11: Pairwise comparisons of DNA binding mutant induction profiles. Rows correspond to the repressor copy number of the strain used to estimate the DNA binding energy for each mutant. Columns correspond to the repressor copy number of the strains that are predicted. Diagonals in which the data used to estimate the DNA binding energy are shown with a gray background. The Python code (ch7_figS11.py) used to generate this figure can be found on the thesis GitHub repository.
Figure 12: Dependence of fitting strain on ΔF predictions of DNA binding domain mutants. Rows correspond to the repressor copy number used to estimate the DNA binding energy. Columns correspond to the particular mutant. Colored lines are the bounds of the 95% credible region of the predicted ΔF. Open face points indicate the strain to which the DNA binding energy was fit. The Python code (ch7_figS12.py) used to generate this figure can be found on the thesis GitHub repository.

Bayesian Parameter Estimation for Inducer Binding Domain Mutants

In Chapter 3, we put forward two naïve hypotheses for which parameters of our fold-change equation are affected by mutations in the inducer binding domain of the repressor. The first hypothesis was that only the inducer dissociation constants, KA and KI, were perturbed from their wild-type values. Another hypothesis was that the inducer dissociation constants were affected in addition to the energetic difference between the active and inactive states of the repressor, ΔεAI.

     In this section, we first derive the statistical model for each hypothesis and then perform a series of diagnostic tests that expose the inferential limitations of each model. With well-calibrated statistical models, we then apply each to an induction profile of the inducer binding mutant Q291K and assess the validity of each hypothesis. To understand the statistical models for each hypothesis, only the subsection Building A Generative Statistical Model is necessary.

Building a Generative Statistical Model

For both hypotheses, we assume that the underlying physical model is the same while a subset of the parameters are modified. As the fold-change measurements for each biological replicate are statistically independent, we can assume that they are normally distributed about the theoretical fold-change value. Thus, for each model, we must include a parameter σ which is the standard deviation of the distribution of fold-change measurements. For the first hypothesis, in which only KA and KI are changed, we are interested in sampling the posterior distribution
g(KA, KI, σ | y) ∝ f(y | KA, KI, σ)g(KA)g(KI)g(σ),   (25)
where y corresponds to the set of fold-change measurements. In the above model, we have assumed that the priors for KA and KI are independent. It is possible that it is more appropriate to assume that they are dependent and that a single prior distribution captures both parameters, g(KA, KI). However, assigning this prior is more difficult and requires strong knowledge a priori about the relationship between them. Therefore, we continue under the assumption that the priors are independent.

     The generic posterior given in Eq. 25 can be extended to evaluate the second hypothesis in which ΔεAI is also modified,
g(KA, KI, ΔεAI, σ | y) ∝ f(y | KA, KI, ΔεAI, σ)g(KA)g(KI)g(ΔεAI)g(σ)   (26)
where we have included ΔεAI as an estimated parameter and assigned a prior distribution.

     As we have assumed that the fold-change measurements across replicates are independent and normally distributed, the likelihoods for each hypothesis can be written as
f(y | KA, KI, σ) ∼ Normal{μ(KA, KI), σ},   (27)
for the first hypothesis and
f(y | KA, KI, ΔεAI, σ) ∼ Normal{μ(KA, KI, ΔεAI), σ},   (28)
for the second. Here, we have assigned μ(…) as the mean of the normal distribution as a function of the parameters defined by our fold-change equation.

     With a likelihood distribution in hand, we now turn toward assigning functional forms to each prior distribution. As we have used in the previous sections of this chapter, we can assign a half-normal prior for σ with a standard deviation of 0.1, namely,
g(σ) ∼ HalfNormal{0, 0.1}.   (29)
It is important to note that the inducer dissociation constants KA and KI are scale invariant, meaning that a change from 0.1 μM to 1 μM yields a decrease in affinity equal to a change from 10 μM to 100 μM. As such, it is better to sample the dissociation constants on a logarithmic scale. We can assign a log normal prior for each dissociation constant as
$$ g(K_A) = {1 \over K_A\sqrt{2\pi\phi^2}}\exp\left[-{(\log {K_A \over 1\,\mu\text{M}} - \mu_{K_A})^2 \over 2\phi^2}\right], \qquad(30)$$
or with the short-hand notion of
g(KA) ∼ LogNormal{μKA, ϕ}.   (31)
For KA, we assigned a mean μKA = 2 and a standard deviation ϕ = 2. For KI, we chose a mean of μKI = 0 and ϕ = 2, capturing our prior knowledge that KA > KI for the wild-type LacI. While the prior distributions are centered differently, they both show extensive overlap, permitting mutations in which KA < KI. For ΔεAI, we assign a normal distribution of the prior centered at 0 with a standard deviation of 5 kBT,
g(ΔεAI) ∼ Normal{0, 5}.   (32)
This permits values of ΔεAI that are above or below zero, meaning that the inactive state of the repressor can be either more or less energetically favorable to the active state. A standard deviation of 5 kBT permits a wide range of energies with  + 5 kBT and  − 5 kBT corresponding to  ≈ 99.5% and  ≈ 0.5% of the repressors being active in the absence of inducer, respectively.

Prior Predictive Checks

To ensure that these choices of prior distributions are appropriate, we performed prior predictive checks for each hypothesis as previously described in the second section of this chapter. We drew 1000 values from the prior distributions shown in Fig. 13 (A) for KA, KI, and ΔεAI. Using the draws from the KA, and KI priors alone, we generated data sets of  ≈ 70 measurements. The percentiles of the fold-change values drawn for the 1000 simulations is shown in the top panel of Fig. 13 (B).

     It can be seen that in the absence of inducer, the fold-change values are close to zero and are distributed about the leakiness value due to σ. This is in contrast to the data sets generated when ΔεAI is permitted to vary along with KA and KI. In the bottom panel of Fig. 13 (B), the fold-change when c = 0 can extend above 1.0 which is possible only when ΔεAI is included, which sets what fraction of the repressors is active. Under both hypotheses, the 99th percentile of the fold-change extends to just above 1 or just below 0, which matches our intuition of how the data should behave. Given these results, we are satisfied with these choices of priors and continue onto the next level of calibration of our model.

Figure 13: Prior predictive checks for two hypotheses of inducer binding domain mutants. (A) Probability density functions for KA, KI, ΔεAI, and σ. Black points correspond to draws from the distributions used for prior predictive checks. (B) Percentiles of the simulated data sets using draws from the KA and KI distributions only (top, purple bands) and using draws from KA, KI, and ΔεAI (bottom, orange bands). The Python code (ch7_fig13.py) used to generate this figure can be found on the thesis GitHub repository.

Simulation Based Calibration

With an appropriate choice of priors, we turn to simulation based calibration to root out any pathologies lurking in the model itself or the implementation through MCMC. For each parameter under each model, we compute the z-score and shrinkage of each inference, shown in Fig. 14. Under the first hypothesis in which KA and KI are the only perturbed parameters Fig. 14(A), we see that all parameters have zscores clustered around 0, indicating that the value of the ground-truth is being accurately estimated through the inference. While the shrinkage for σ is close to 1 (indicating the prior is being informed by the data), the shrinkage for KA and KI is heavily tailed with some values approaching zero. This is true for both statistical models, indicating that for some values of KA and KI, the parameters are difficult to pin down with high certainty. In the application of these models to data, this will be revealed as large credible regions in the reported parameters. Under the second hypothesis in which all allosteric parameters are allowed to change, we see moderate shrinkage for ΔεAI purple points in 14(B) with the minimum shrinkage being around 0.5. The samples resulting in low shrinkage correspond to values of ΔεAI that are highly positive or highly negative, in which small changes in the active fraction of repressors cannot be accurately measured through our model. However, the median shrinkage for ΔεAI is approximately 0.92, meaning that the data highly informed the prior distributions for the majority of the inferences. The rank distributions for all parameters under each model appear to be highly uniform, indicating that both statistical models are computationally tractable.

      With knowledge of the caveats of estimating KA and KI for both models, we proceed with our analysis and examine how accurately these models can capture the phenomenology of the data.

Figure 14: Simulation based calibration of statistical models for inducer binding domain mutants. (A) Sensitivity statistics and rank distribution for a statistical model in which KA and KI are the only parameters permitted to vary. (B) Sensitivity statistics and rank distribution for a model in which all allosteric parameters KA, KI, and ΔεAI are allowed to be modified by the mutation. Gray envelope in the bottom plots correspond to the 99th percentile of variation expected from a true uniform distribution. The Python code (ch7_figS14.py) used to generate this figure can be found on the thesis GitHub repository.

Posterior Predictive Checks

With a properly calibrated statistical model for each hypothesis, we now apply it to a representative dataset. While each model was applied to each inducer binding domain mutant, we only show the application to the mutant Q291K with 260 repressors per cell paired with the native lac operator O2.

     The results from applying the statistical model in which only KA and KI can change is shown in Fig. 15. The joint and marginal distributions for each parameter (Fig. 15 (A)) reveal a strong correlation between KA and KI whereas all other parameters are symmetric and independent. While the joint and marginal distributions look well-behaved, the percentiles of the posterior predictive checks (Fig. 15 (B)) are more suspect. While all data falls within the 95th percentile, the overall trend of the data is not well predicted. Furthermore, the percentiles expand far below zero, indicating that the sampling of σ is compensating for the leakiness in the data being larger than it should be if only KA and KI were the changing parameters.

     We see significant improvement when ΔεAI is permitted to vary in addition to KA and KI. Fig. 16 (A) shows the joint and marginal distributions between all parameters from the MCMC sampling. We still see correlation between KA and KI, although it is not as strong as in the case where they are the only parameters allowed to change due to the mutation. We also see that the marginal distribution for σ has shrunk significantly compared to the marginal distribution in Fig. 15 (A). The percentiles of the posterior predictive checks, shown in Fig. 16 (B) are much more in line with the experimental measurements, with the 5th percentile following the data for the entire induction profile.

     In this section, we have presented two hypotheses for the minimal parameter set needed to describe the inducer binding mutations, derived a statistical model for each, thoroughly calibrated its behavior, and applied it to a representative data set. The posterior predictive checks (Fig. 15 and Fig. 16) help us understand which hypothesis is more appropriate for that particular mutant. The incredibly wide percentiles and significant change in the leakiness that result from a model in which only KA and KI are perturbed suggests that more than those two parameters should be changing. We see significant improvement in the description of the data when ΔεAI is altered, indicating that it is the more appropriate hypothesis of the two.

Figure 15: Posterior predictive checks for inducer binding domain mutants where only KA and KI are changed. (A) MCMC sampling output for each parameter. Joint distributions are colored by the value of the log posterior with increasing probability corresponding to transition from blue to yellow. (B) Percentiles of the data generated from the likelihood distribution for each sample of KA, KI, and σ. Overlaid points are the experimentally observed measurements. The Python code (ch7_figS15-S16.py) used to generate this figure can be found on the thesis GitHub repository.
Figure 16: Posterior predictive checks for inducer binding domain mutants where all allosteric parameters can change. (A) MCMC sampling output for all parameters. Joint distributions are colored by the value of the log posterior with increasing probability corresponding to the transition from blue to yellow. Marginal distributions are shown adjacent to each joint distribution. (B) Percentiles of the data generated from the likelihood for each sample of KA, KI, ΔεAI, and σ. The corresponding experimental data for Q291K are shown as black open-faced circles. The Python code (ch7_figS15-S16.py) used to generate this figure can be found on the thesis GitHub repository.

Additional Characterization of Inducer Binding Domain Mutants

To predict the induction profiles of the inducer binding mutants, we used only the induction profile of each mutant paired with the native O2 lac operator to infer the parameters. Here, we examine the influence the choice of fit strain has on the predictions of the induction profiles and ΔF for each mutant.

     In Chapter 3, we dismissed the hypothesis that only KA and KI were changing due to the mutation and based the fit to a single induction profile. In Fig. 17, the fits and predictions for each mutant paired with each operator sequence queried. Here, the rows correspond to the operator sequence of the fit strain while the columns correspond to the operator sequence of the predicted strain. The diagonals show the fit induction profiles and the corresponding data. Regardless of the choice of fit strain, the predicted induction profiles of the repressor paired with the O3 operator are poor, with the leakiness in each case being significantly underestimated. We also see that fitting the allosteric parameters to O3 results in poor predictions with incredibly wide credible regions for the other two operators. In Razo-Mejia et al. (2018) and in Chapter 6, we also found that fitting KA and KI to the induction profile of O3 generally resulted in poor predictions of the other strains with comparably wide credible regions.

     When ΔεAI is included as a parameter, however, the predictive power is improved for all three operators, as can be seen in Fig. 18. While the credible regions are still wide when fit to the O3 operator, they are much narrower than under the first hypothesis. We emphasize that we are able to accurately predict the leakiness of nearly every strain by redetermining ΔεAI whereas the leakiness was not predicted when only KA and KI were considered. Thus, we conclude that all three allosteric parameters KA, KI, and ΔεAI are modified for these four inducer binding domain mutations. The values of the inferred parameters are reported in Table 7.3.

      We also examined the effect the choice of fit strain has on the predicted ΔF, shown in Fig. 19. We find that the predictions agree with the data regardless of the choice of fit strain. One exception is the prediction of the Q291K ΔF when the parameters fit to the O3 induction profile are used. As the induction profile for Q291K paired with O3 is effectively flat at a fold-change of 1, it is difficult to properly estimate the parameters of our sigmoidal function. We note that all measurements of ΔF for Q291K are described by using the parameters fit to either O1 or O3 induction profiles, suggesting that the choice of fit strain makes little difference.

Figure 17: Pairwise comparison of fit strain versus predictions assuming only KA and KI are influenced by the mutation. Rows correspond to the operator sequence of the strain used for the parameter inference. Columns correspond to the operator sequence of the predicted strain. Colors identify the mutation. Diagonal positions show the induction fit strain and profiles. The Python code
(ch7_figS17-S18.py)
used to generate this figure can be found on the thesis GitHub repository.
Figure 18: Pairwise comparison of fit strain versus predictions assuming all allosteric parameters are affected by the mutation. Rows correspond to the operator of the strain used to fit the parameters. Columns correspond to the operator of the strains whose induction profile is predicted. Mutants are identified by color. Diagonals (gray background) show the induction profiles of the strain to which the parameters were fit. The Python code
(ch7_figS17-S18.py)
used to generate this figure can be found on the thesis GitHub repository.
Figure 19: Comparison of choice of fit strain on predicted ΔF profiles. Rows correspond to the operator of the strain to which the parameters were fit. Columns correspond to mutations. Points are colored by their operator sequence. The data corresponding to the operator of the fit strain are shown as white-faced points. The Python code
(ch7_figS19.py)
used to generate this figure can be found on the thesis GitHub repository.
Inferred values of KA, KI, and ΔεAI for inducer binding domain mutants. Values reported are the mean of the posterior distribution with the upper and lower bounds of the 95% credible region.
Mutant Operator KA [μM] KI [μM] ΔεAI [kBT]
F164T O1 290 − 56 + 60 1 − 0.98 + 4 4 − 3 + 5
O2 165 − 65 + 90 3 − 3 + 6 1 − 2 + 5
O3 110 − 105 + 700 7 − 4 + 5  − 0.9 − 0.3 + 0.4
Q291K O1  > 1000 410 − 100 + 150  − 3.2 − 0.1 + 0.1
O2  > 1000 310 − 60 + 70  − 3.11 − 0.07 + 0.07
O3 10 − 10 + 200 1 − 1 + 9  − 7 − 5 + 3
Q291R O1 3 − 3 + 27 2 − 2 + 20  − 1.9 − 0.3 + 0.4
O2 9 − 9 + 20 8 − 8 + 20  − 2.32 − 0.09 + 0.01
O3 6 − 6 + 24 9 − 9 + 30  − 2.6 − 0.5 + 0.4
Q291V O1  > 1000 3 − 3 + 13 6 − 4 + 4
O2 650 − 250 + 450 8 − 8 + 8 3 − 3 + 6
O3 100 − 90 + 400 22 − 18 + 33 0.1 − 0.6 + 0.8

Comparing Parameter Values to the Literature

In this section, we compare and contrast the biophysical parameter values we use to characterize the wild-type Lac repressor with the rich literature that consists of in vitro and in vivo experiments. This section has an accompanying interactive figure available on the paper website which allows the reader to examine different combinations of parameter values and their agreement or disagreement with data taken from Garcia and Phillips (2011), Brewster et al. (2014), and Razo-Mejia et al. (2018).

     While the mutations used in this work and those in Daber, Sochor, and Lewis (2011) are the same, we report significantly different values for the inducer binding, DNA binding parameters, and the relative energy difference between active and inactive states of the mutant repressors. The apparent disagreement of parameter values between the present work and those presented in Daber, Sochor, and Lewis (2011) in part stem from different treatments of the values for the wild-type Lac repressor. Since its isolation by Gilbert and Müller-Hill in the 1960’s Gilbert and Müller-Hill (1966), the Lac repressor has been the subject of intense biochemical and structural study. Many measurements of the inducer and DNA binding kinetics of the repressor in vitro (such as O’Gorman et al. 1980) and their values have informed the fitting of other parameters from measurements in vivo (such as Daber, Sochor, and Lewis (2011) and Daber, Sharp, and Lewis (2009)). All of these measurements, however, do not directly measure the DNA- or inducer-binding kinetics, nor the equilibrium constant between the active and inactive states of the repressor. To properly estimate the parameters, one must either have direct measurement of a subset of the parameters or make assumptions regarding the states of the system. Examples of the estimated allosteric parameter values of the wild-type LacI repressor from our previous work (Razo-Mejia et al. 2018), that of Daber, Sochor, and Lewis (2011), and in vitro measurements from O’Gorman et al. (1980) are given in Table 7.3 The theoretical predictions for the fold-change in gene expression, along with the values reported in Daber, Sharp, and Lewis (2009), can be seen using the interactive figure on the paper website, where the reader can also enter their own parameter values and independently assess the agreement or lack thereof with the data.

      It is notable that differences between the various references shown in Table 7.3 can be drastic, in some cases differing by almost an order of magnitude. Of particular note is the disagreement in the energy difference between the active and inactive states of the repressor, ΔεAI. Daber, Sochor, and Lewis (2011) determines a negative value of ΔεAI meaning that the inactive state of the repressor is energetically favorable to the active state. In stark contrast is the value reported in our previous work of  + 4.5 kBT, implying that the active state is significantly more stable than the inactive state. The now seminal in vitro measurements reported in O’Gorman et al. (1980) suggest that the two states are nearly energetically equivalent.

     The wide range of these reported values demonstrate that such thermodynamic models are highly degenerate, meaning that many combinations of parameter values can result in nearly equally good descriptions of the data. To illustrate this point, we estimated KA, KI, and DNA binding energy ΔεRA for each operator to the data reported in Razo-Mejia et al. (2018), Garcia and Phillips (2011), and Brewster et al. (2014), using the three values of ΔεAI shown in Table 7.4. The resulting fits can be seen in Fig. 20. Despite the drastically different values of ΔεAI, it is possible to generate adequate fits by modulating the other parameters. The parameter values of these fits, reported in Table 7.3 further illustrate this point as they differ significantly from one another.

     The theoretically predicted fold-change in the presence of multiple promoters, shown in Fig. 20 (E) is perhaps the most informative test of the parameter values. The theoretical advancements made in Weinert et al. (2014) provide a means to mathematically grasp the intricacies of plasmid-borne expression through calculation of the chemical potential. This formalism transforms the intimidating combinatorics of R repressors and N specific binding sites (i.e. plasmid reporter genes) into a two-state system where one can compute an effective repressor copy number regulating a single promoter. Using this formalism, the input-output function for the fold-change in gene expression can be written as
$$ \text{fold-change} = {1\over 1 + \lambda_{R}(c)e^{-\beta\Delta\varepsilon_{RA}}}, \qquad(33)$$
where the effective repressor copy number (termed the fugacity) is denoted as λR(c). In Chapter 6 of this thesis, we show that the inflection point of Eq. 33, whose curves shown in the bottom right-hand plot of Fig. 20, is located where the number of specific binding sites for the repressor is approximately equal to the number of repressors in the active state. Using this key feature, one can infer ΔεAI given prior knowledge of ΔεRA and the total number of repressors per cell. As shown in Fig. 20, using values of ΔεAI from Razo-Mejia et al. (2018) and O’Gorman et al. (1980) approximately agree with the measurements whereas the predicted curves using ΔεAI from Daber, Sochor, and Lewis (2011) overestimates the fold-change, even though these values accurately describe the simple-repression data shown in the other panels of Fig. 20.

Figure 20: Degenerate fits of data from Razo-Mejia et al. (2018); Brewster et al. (2014); Garcia and Phillips (2011) using different values for active/inactive state energy difference ΔεAI. In all plots, the solid, dashed, and dotted lines correspond to the best-fit curves conditioned on the data using parameter values for ΔεAI of 4.5 kBT,  − 1.75 kBT, and 0.35 kBT, respectively. Induction profiles from Razo-Mejia et al. (2018) for operators O1 (A), O2 (B), and O3 (C) are shown as points and errors which correspond to the mean and standard error of at least 10 biological replicates. (D) Leakiness measurements of the simple repression motif with one unique regulated reporter gene. Data shown are from Garcia and Phillips (2011); Brewster et al. (2014). (E) The transcription factor titration effect. For gene expression measurements on plasmids, the fold-change as a function of repressor copy number exhibits strong nonlinearities. We used this effect as a way to independently infer the parameter ΔεAI and it can be seen that this breaks the degeneracy between different parameters. The Python code (ch7_figS20.py) used to generate this figure can be found on the thesis GitHub repository.

     Without some direct in vivo measurements of these parameters, one must make assumptions about the system to make any quantitative progress. We chose to use the parameter values defined in our laboratory as the repressor fugacity provides us with an independent, albeit indirect, measurement of ΔεAI which other works such as Daber, Sochor, and Lewis (2011) and O’Gorman et al. (1980) do not provide. Both of these works determine all of the parameter values simultaneously by fitting them to a single set of measurements. While these measurements accurately describe their data, their parameter values are less successful in accounting for data from Brewster et al. (2014), Garcia and Phillips (2011), and Razo-Mejia et al. (2018). In the context of this work, we emphasize that we make many of the same qualitative conclusions as in Daber, Sochor, and Lewis (2011) with respect to the effects of the mutations using our particular set of parameter values.

     While we use different values for ΔεAI, the qualitative results between this work and that of Daber, Sochor, and Lewis (2011) are in agreement. For example, both works determine that mutations in the DNA binding domain alter only the DNA binding affinity whereas the mutations in the inducer binding pocket affect onlys the allosteric parameters. Because the biological variables such as repressor and reporter gene copy number are tightly controlled in our system, we are able to more precisely measure features of the induction profiles such as the leakiness in gene expression. This ability allows us to detect changes in the active/inactive equilibrium which were masked in Daber, Sochor, and Lewis (2011) by the experimental design. While the precise parameter values may be different between publications, the exploration of free energy differences resulting from mutations are parameter-value independent and any parameter disagreements do not change our theoretical model. Fig. 2 of the main text is presented with no knowledge of parameter values – it simply shows the mathematics of the model. While the value of ΔF will ultimately depend on the parameter values, the formalism of this work remains agnostic to the parameter values and can be a useful tool for classifying mutations and couple the sequence-level variation to the systems-level response.

Thermodynamic parameter values of wild-type LacI from the literature.
Parameter Value Reference
ΔεAI  ≥ 4.5 kBT Razo-Mejia et al. (2018)
 − 1.7 kBT Daber, Sochor, and Lewis (2011)
0.35 kBT O’Gorman et al. (1980)
KA 139 − 20 + 22μM Razo-Mejia et al. (2018)
16 μM Daber, Sochor, and Lewis (2011)
133 μM O’Gorman et al. (1980)
KI 0.53 − 0.01 + 0.01μM Razo-Mejia et al. (2018)
2 μM Daber, Sochor, and Lewis (2011)
4 μM O’Gorman et al. (1980)
Estimated parameters from global fits of data from literature.
Parameter Value ΔεAI Reference Value
ΔεRA (O1 operator)  − 15.1 − 0.1 + 0.1kBT 4.5 kBT (Razo-Mejia et al. 2018)
 − 17.1 − 0.1 + 0.1kBT  − 1.75 kBT (Daber, Sochor, and Lewis 2011)
 − 15.7 − 0.1 + 0.1kBT 0.35 kBT (O’Gorman et al. 1980)
ΔεRA (O1 operator)  − 15.1 − 0.1 + 0.1kBT 4.5 kBT (Razo-Mejia et al. 2018)
 − 17.1 − 0.1 + 0.1kBT  − 1.75 kBT (Daber, Sochor, and Lewis 2011)
 − 15.7 − 0.1 + 0.1kBT 0.35 kBT (O’Gorman et al. 1980)
ΔεRA (O2 operator)  − 13.4 − 0.1 + 0.1kBT 4.5kBT (Razo-Mejia et al. 2018)
 − 15.4 − 0.1 + 0.1kBT  − 1.75 kBT (Daber, Sochor, and Lewis 2011)
 − 14.0 − 0.1 + 0.1kBT 0.35 kBT (O’Gorman et al. 1980)
ΔεRA (O3 operator)  − 9.21 − 0.06 + 0.06kBT 4.5kBT (Razo-Mejia et al. 2018)
 − 11.29 − 0.06 + 0.06kBT  − 1.75 kBT (Daber, Sochor, and Lewis 2011)
 − 9.85 − 0.05 + 0.06kBT 0.35 kBT (O’Gorman et al. 1980)
KA 225 − 10 + 10μM 4.5 kBT (Razo-Mejia et al. 2018)
290 − 20 + 20μM  − 1.75 kBT (Daber, Sochor, and Lewis 2011)
270 − 20 + 20μM 0.35 kBT (O’Gorman et al. 1980)
KI 0.81 − 0.05 + 0.05μM 4.5 kBT (Razo-Mejia et al. 2018)
8.2 − 0.5 + 0.5μM  − 1.75 kBT (Daber, Sochor, and Lewis 2011)
5.5 − 0.3 + 0.5μM 0.35 kBT (O’Gorman et al. 1980)

Parameter Estimation Using All Induction Profiles

In Chapter 3 and Sec. 7.1 and 7.2 of this chapter, we have laid out our strategy for inferring the various parameters of our model to a single induction profile and using the resulting values to predict the free energy and induction profiles of other strains. In this section, we estimate the parameters using all induction profiles of a single mutant and using the estimated values to predict the free energy profiles.

      The inferred DNA binding energies considering induction profiles of all repressor copy numbers for the three DNA binding mutants are reported in Table 7.5. These parameters are close to those reported in Table 7.1 for each repressor copy number with Q18A showing the largest differences. The resulting induction profiles and predicted change in free energy for these mutants can be seen in Fig. 21. Overall, the induction profiles accurately capture the data. We acknowledge that even when using all repressor copy numbers, the fit to Q18A remains imperfect. However, we contend that this disagreement is comparable to that observed in Razo-Mejia et al. (2018) which described the induction profile of the wild-type repressor. We find that the predicted change in free energy (bottom row in Fig. 21 (B)) narrows compared to that in Fig. 12 and Fig. 3.3 of Chapter 3, confirming that considering all induction profiles improves our inference of the most-likely DNA binding energy. There appears to be a very slight trend in the ΔF for Q18A at higher inducer concentrations, though the overall change in free energy from 0 to 5000 μM IPTG is small.

Figure 21: Induction profiles and predicted change in free energy using parameters estimated from the complete data sets. Top row shows fold-change measurements (points) as mean and standard error with ten to fifteen biological replicates. Shaded lines correspond to the 95% credible regions of the induction profiles using the estimated values of the DNA binding energies reported in Table 7.5. Bottom row shows the 95% credible regions of the predicted change in free energy (shaded lines) along with the inferred free energy of data shown in the top row. In all plots, the inducer concentration is shown on a symmetric log scale with linear scaling between 0 and 10 − 2μM and log scaling elsewhere. The Python code (ch7_figS21.py) used to generate this figure can be found on the thesis GitHub repository.
Estimated DNA binding energies for each DNA binding domain mutant using all repressor copy numbers
Mutant ΔεRA [kBT]
Y17I  − 9.81 − 0.08 + 0.04
Q18A  − 10.60 − 0.07 + 0.07
Q18M  − 15.61 − 0.05 + 0.05

     We also estimated the allosteric parameters (KA, KI, and ΔεAI) for all inducer binding domain mutations using the induction profiles of all three operator sequences. The values, reported in Table 7.6 are very similar to those estimated from a single induction profile (Table 7.3). We note that for Q291R, it is difficult to properly estimate the values for KA and KI as the observed induction profile is approximately flat. The induction profiles and predicted change in free energy for each inducer binding mutant is shown in Fig. 22. We see notable improvement in the agreement between the induction profiles and the observed data, indicating that considering all data significantly shrinks the uncertainty of each parameter. The predicted change in free energy is also improved compared to that shown in Fig. 19. We emphasize that the observed free energy difference for each point assumes no knowledge of the underlying parameters and comes directly from measurements. The remarkable agreement between the predicted free energy and the observations illustrates that redetermining the allosteric parameters is sufficient to describe how the free energy changes as a result of the mutation.

Figure 22: Induction profiles and predicted change in free energy using parameters estimated from the complete data sets for inducer binding domain mutants. Top row shows fold-change measurements (points) as mean and standard error with ten to fifteen biological replicates. Shaded lines correspond to the 95% credible regions of the induction profiles using the estimated values of the allosteric parameters reported in Table 7.6. Bottom row shows the 95% credible regions of the predicted change in free energy (shaded lines) along with the inferred free energy of data shown in the top row. In all plots, the inducer concentration is shown on a symmetric log scale with linear scaling between 0 and 10 − 2muM and log scaling elsewhere. The Python code (ch7_figS22.py) used to generate this figure can be found on the thesis GitHub repository.
Estimated values for KA, KI, and ΔεAI for inducer binding domain mutations using induction profiles of all operator sequences.
Mutant KA [μM] KI [μM] ΔεAI [kBT]
F161T 300 − 60 + 60 12.7 − 0.1 + 0.1  − 0.9 − 0.3 + 0.3
Q291K  > 1 mM 330 − 70 + 60  − 3.17 − 0.07 + 0.07
Q291R  > 1 mM  > 1 mM  − 2.4 − 0.2 + 0.2
Q291V  > 1 mM 53 − 13 + 17 0 − 0.3 + 0.3

Generalizability of Data Collapse to Other Regulatory Architectures

In Chapters 2, 3, and 4, we have stated that the input-output function for the fold-change in gene expression can be rewritten in the form of a Fermi function. However, this result can be derived directly by coarse graining the transcription factor bound and unbound states of the systems’ partition function. In this section, we show how coarse graining of promoter occupancy states results in a general Fermi function so long as the transcription factor bound states and transcriptionally active states do not overlap.

     As shown in Chapter 2, the partition function 𝒵 for the simple repression motif (ignoring allosteric control) can be enumerated as
$$ \mathcal{Z} = \overbrace{1 + {P \over N_{NS}}e^{-\beta\Delta\varepsilon_P}}^\text{repressor unbound states} + \underbracee^{-\beta\Delta\varepsilon_{R}}}_\text{repressor bound state} = \mathcal{Z}_{\neg r} + \mathcal{Z}_{r}, \qquad(34)$$
where R is the number of repressors, P is the number of polymerases, NNS is the number of nonspecific binding sites, and ΔεP and ΔεR are the binding energies of the polymerase and repressor to the DNA, respectively. The states can be grouped into either repressor bound states (right-hand terms) or repressor unbound states (left hand terms), denoted as 𝒵r and 𝒵¬r, respectively. The probability of the repressor not being bound to the promoter can be computed as
$$ P(\neg r) = {\mathcal{Z}_{\neg r} \over \mathcal{Z}_{\neg r} + \mathcal{Z}_r}. \qquad(35)$$

    In this coarse-grained description, the transcriptionally active states are separate from the repressor bound states. Thus, to calculate the fold-change in gene expression, we can compute the ratio of Pr|R > 0) when repressor is present to Pr | R = 0) when repressor is absent. As the latter term is equal to 1, the fold-change in gene expression is equivalent to Eq. 35. We can compute an effective free energy of each coarse-grained state as
¬r =  − kBTlog 𝒵¬r ; r =  − kBTlog 𝒵r.   (36)
With an effective energy in hand, we can write Eq. 35 (which is equivalent to the fold-change) in terms of the Boltzmann weights of these two coarse-grained states as
$$ \text{fold-change} = {e^{-\beta \tilde{F}_{\neg r}} \over e^{-\beta \tilde{F}_{\neg r}} + e^{-\beta\tilde{F}_r}}. \qquad(37)$$
A simple rearrangement of this result produces a Fermi function
$$ \text{fold-change} = {1 \over 1 + e^{-\beta \tilde{F}}}, \qquad(38)$$
where is the effective free energy of the system defined as
 =  − kBT(log𝒵r−log𝒵¬r).   (39)

     In the case of the simple repression motif, we apply the weak promoter approximation which is based on the fact that (P/NNS)e − βΔεP ≪ 1, reducing 𝒵¬r = 1. With this approximation, the effective free energy defined in Eq. 39 can be interpreted as the free energy between the repressed and transcriptionally active states. In general, this coarse-graining approach can be applied to any regulatory architecture in which the transcription factor bound states and transcriptionally active states share no overlapping substates. Fig. 23 shows various repression based architectures along with the coarse graining of states and the effective free energy.

      This approach cannot be applied to architectures in which the transcription factor bound and transcriptionally active states cannot be separated. One such example is the case of simple activation in which the binding of an activator increases gene expression. For this architecture, the total partition function (as derived in Bintu et al. 2005) is
$$ \mathcal{Z} = \overbrace{1 + {P \over N_{NS}}e^{-\beta\Delta\varepsilon_{P}}}^\text{activator unbound states} + \underbracee^{-\beta\Delta\varepsilon_A} + {A \over N_{NS}}{P \over N_{NS}}e^{-\beta(\Delta\varepsilon_{A} + \Delta\varepsilon_{P} + \varepsilon_\text{interaction})}}_\text{activator bound states} = \mathcal{Z}_{\neg a} + \mathcal{Z}_a. \qquad(40)$$
Here, we have used A to denote the number of activators, ΔεA is the binding energy of the activator to its specific binding site, and εinteraction is the interaction energy between the activator and polymerase which makes the activator-polymerase-DNA state more energetically favorable to the polymerase-DNA state. Unlike in the case of repression, the transcriptionally active states are present in the 𝒵a. While we can compute the probability of the activator not being bound Pa), this is not equivalent to the fold-change in gene expression as was the case for simple repression. Rather, the fold-change in gene expression can be written as
$$ \text{fold-change} = {\mathcal{Z}_a -1 + \mathcal{Z}_{\neg a} - {A \over N_{NS}}e^{-\beta\Delta\varepsilon_{A}} \over \mathcal{Z}_{a} + \mathcal{Z}_{\neg a}}{\mathcal{Z}_a \over \mathcal{Z}_a - 1}. \qquad(41)$$
This is a more cumbersome expression than in the case for simple repression, and cannot be massaged into a one-parameter description of the fold-change. Thus, the analysis presented here is only applicable to cases in which the occupancy of the promoter by either the polymerase or the transcription factor are mutually exclusive.

Figure 23: Various repression-based regulatory architectures and their coarse-grained states. The left column shows a schematic of the regulatory architecture. The middle column shows the equilibrium states of the system coarse grained into repressor bound (orange boxes) and repressor unbound (purple boxes) states. States in which the polymerase is bound is shaded in white and are neglected by the weak promoter approximation. The right column illustrates that the formulation of the effective free energy is the same for all architectures, although the formulation of 𝒵¬r and 𝒵r are different between the examples. The bottom row illustrates the coarse graining of a simple repression motif in which the same repressor regulates multiple copies of the same gene. Rather than considering all states, the gene expression can be calculated by considering one promoter and an effective copy number, described by the repressor fugacity λr, and is derived in Weinert et al. (2014). Grayed-out states illustrate this further level of coarse graining.

Strain and Oligonucleotide Information

Escherichia coli strains used in this work
Mutation Operator R Genotype
22 MG1655::ΔlacZYA
O1 0 MG1655::ΔlacIZYA; galK<>25O1+11-YFP
O2 0 MG1655::ΔlacIZYA; galK<>25O2+11-YFP
O3 0 MG1655::ΔlacIZYA; galK<>25O3+11-YFP
WT O2 260 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1027LacI
Y17I O2 60 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1147LacI(Y17I)
Y17I O2 124 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS446LacI(Y17I)
Y17I O2 260 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1027LacI(Y17I)
Y17I O2 1220 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1LacI(Y17I)
Q18A O2 60 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1147LacI(Q18A)
Q18A O2 124 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS446LacI(Q18A)
Q18A O2 260 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1027LacI(Q18A)
Q18A O2 1220 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1LacI(Q18A)
Q18M O2 60 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1147LacI(Q18M)
Q18M O2 124 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS446LacI(Q18M)
Q18M O2 260 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1027LacI(Q18M)
Q18M O2 1220 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1LacI(Q18M)
F161T O1 260 MG1655::ΔlacIZYA; galK<>25O1+11-YFP; ybcN<>3*1-RBS1027LacI(F161T)
F161T O2 260 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1027LacI(F161T)
F161T O3 260 MG1655::ΔlacIZYA; galK<>25O3+11-YFP; ybcN<>3*1-RBS1027LacI(F161T)
Q291V O1 260 MG1655::ΔlacIZYA; galK<>25O1+11-YFP; ybcN<>3*1-RBS1027LacI(Q291V)
Q291V O2 260 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1027LacI(Q291V)
Q291V O3 260 MG1655::ΔlacIZYA; galK<>25O3+11-YFP; ybcN<>3*1-RBS1027LacI(Q291V)
Q291K O1 260 MG1655::ΔlacIZYA; galK<>25O1+11-YFP; ybcN<>3*1-RBS1027LacI(Q291K)
Q291K O2 260 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1027LacI(Q291K)
Q291K O3 260 MG1655::ΔlacIZYA; galK<>25O3+11-YFP; ybcN<>3*1-RBS1027LacI(Q291K)
Q291R O1 260 MG1655::ΔlacIZYA; galK<>25O1+11-YFP; ybcN<>3*1-RBS1027LacI(Q291R)
Q291R O2 260 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1027LacI(Q291R)
Q291R O3 260 MG1655::ΔlacIZYA; galK<>25O3+11-YFP; ybcN<>3*1-RBS1027LacI(Q291R)
Y17I-F161T O2 260 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1027LacI(Y17IF161T)
Y17I-Q291V O2 260 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1027LacI(Y17IQ291V)
Y17I-Q291K O2 260 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1027LacI(Y17IQ291K)
Q18A-F161T O2 260 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1027LacI(Q18AF161T)
Q18A-Q291V O2 260 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1027LacI(Q18AQ291V)
Q18A-Q291K O2 260 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1027LacI(Q18AQ291K)
Q18M-F161T O2 260 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1027LacI(Q18MF161T)
Q18M-Q291V O2 260 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1027LacI(Q18MQ291V)
Q18M-Q291K O2 260 MG1655::ΔlacIZYA; galK<>25O2+11-YFP; ybcN<>3*1-RBS1027LacI(Q18MQ291K)
Oligonucleotides used for mutant generation.
Primer Name Sequence (5' 3') Description Method
10.1 acctctgcg gaggggaag cgtgaacctc tcacaagacg gcatcaaatt acactagca acaccagaac agc Integration into ybcN locus λ-Red Recombineering
10.3 ctgtagatgt gtccgttca tgacacgaat aagcggtgta gccattacg ccggctaat gcacccagt aagg Integration into ybcN locus λ-Red Recombineering
GCMWC-001 ccggcatac tctgcgaca Amplification of plasmid QuickChange Mutagenesis
GCMWC-002 gtgtctctta tATGaccgt ttcccgc Q18M Mutation (CAGATG) QuickChange Mutagenesis
GCMWC-003 tgtctcttat GCGaccgttt cccgc Q18A Mutation (CAGGCG) QuickChange Mutagenesis
GCMWC-004 gttaacggcg ggatataac Amplification of plasmid QuickChange Mutagenesis
GCMWC-005 caccatcaaa GTGgatttt cgcctgc Q291V Mutation (CAG GTG) QuickChange Mutagenesis
GCMWC-006 caccatcaaa AAGgattttcg cc Q291K Mutation (CAGAAG) QuickChange Mutagenesis
GCMWC-007 cagtattatt ACCtcccatga agacgg F161T Mutation (TTCACC) QuickChange Mutagenesis
GCMWC-008 ttgatgggtg tctggtcag Amplification of plasmid QuickChange Mutagenesis
GCMWC-009 gcatactctg cgacatcgta taa Amplification of plasmid QuickChange Mutagenesis
GCMWC-010 cggtgtctct ATTcagaccg tttc Y17I Mutation (TATATT) QuickChange Mutagenesis
GCMWC-017 ccatcaaaAG Ggattttcgc ctgctggg gcaaaccag Q291R Mutation (CAGAGG) Gibson Assembly
GCMWC-018 ggcgaaaatc CCTtttgatg gtggttaa cggcggg Q291R Mutation (CTGCCT) Gibson Assembly

References

Bintu, Lacramioara, Nicolas E Buchler, Hernan G Garcia, Ulrich Gerland, Terence Hwa, Jané Kondev, and Rob Phillips. 2005. “Transcriptional Regulation by the Numbers: Models.” Current Opinion in Genetics & Development 15 (2): 116–24. https://doi.org/10.1016/j.gde.2005.02.007.

Brewster, Robert C., Franz M. Weinert, Hernan G. Garcia, Dan Song, Mattias Rydenfelt, and Rob Phillips. 2014. “The Transcription Factor Titration Effect Dictates the Level of Gene Expression.” Cell 156 (6): 1312–23. https://doi.org/10.1016/j.cell.2014.02.022.

Carpenter, Bob, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. “Stan: A Probabilistic Programming Language.” Journal of Statistical Software 76 (1): 1–32. https://doi.org/10.18637/jss.v076.i01.

Daber, Robert, Kim Sharp, and Mitchell Lewis. 2009. “One Is Not Enough.” Journal of Molecular Biology 392 (5): 1133–44. https://doi.org/10.1016/j.jmb.2009.07.050.

Daber, Robert, Matthew A. Sochor, and Mitchell Lewis. 2011. “Thermodynamic Analysis of Mutant Lac Repressors.” Journal of Molecular Biology, The Operon Model and its Impact on Modern Molecular Biology, 409 (1): 76–87. https://doi.org/10.1016/j.jmb.2011.03.057.

Garcia, H. G., and R. Phillips. 2011. “Quantitative Dissection of the Simple Repression Input-Output Function.” Proceedings of the National Academy of Sciences 108 (29): 12173–8. https://doi.org/10.1073/pnas.1015616108.

Gilbert, Walter, and Benno Müller-Hill. 1966. “Isolation of the Lac Repressor.” Proceedings of the National Academy of Sciences 56 (6): 1891–8.

Good, Isadore J. 1950. Probability and the Weighting of Evidence. New York City: Hafner Publishing Company.

Milo, Ron, Paul Jorgensen, Uri Moran, Griffin Weber, and Michael Springer. 2010. “BioNumbersthe Database of Key Numbers in Molecular and Cell Biology.” Nucleic Acids Research 38 (suppl_1): D750–D753. https://doi.org/10.1093/nar/gkp889.

O’Gorman, R. B., J. M. Rosenberg, O. B. Kallai, R. E. Dickerson, K. Itakura, A. D. Riggs, and K. S. Matthews. 1980. “Equilibrium Binding of Inducer to Lac Repressor Operator DNA Complex.” Journal of Biological Chemistry 255 (21): 10107–14.

Razo-Mejia, Manuel, Stephanie L. Barnes, Nathan M. Belliveau, Griffin Chure, Tal Einav, Mitchell Lewis, and Rob Phillips. 2018. “Tuning Transcriptional Regulation Through Signaling: A Predictive Theory of Allosteric Induction.” Cell Systems 6 (4): 456–469.e10. https://doi.org/10.1016/j.cels.2018.02.004.

Talts, Sean, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2018. “Validating Bayesian Inference Algorithms with Simulation Based Calibration.” arXiv, April. http://arxiv.org/abs/1804.06788.

Weinert, Franz M., Robert C. Brewster, Mattias Rydenfelt, Rob Phillips, and Willem K. Kegel. 2014. “Scaling of Gene Expression with Transcription-Factor Fugacity.” Physical Review Letters 113 (25). https://doi.org/10.1103/PhysRevLett.113.258101.