Efficient Ensemble Refinement by Reweighting

Ensemble refinement produces structural ensembles of flexible and dynamic biomolecules by integrating experimental data and molecular simulations. Here we present two efficient numerical methods to solve the computationally challenging maximum-entropy problem arising from a Bayesian formulation of ensemble refinement. Recasting the resulting constrained weight optimization problem into an unconstrained form enables the use of gradient-based algorithms. In two complementary formulations that differ in their dimensionality, we optimize either the log-weights directly or the generalized forces appearing in the explicit analytical form of the solution. We first demonstrate the robustness, accuracy, and efficiency of the two methods using synthetic data. We then use NMR J-couplings to reweight an all-atom molecular dynamics simulation ensemble of the disordered peptide Ala-5 simulated with the AMBER99SB*-ildn-q force field. After reweighting, we find a consistent increase in the population of the polyproline-II conformations and a decrease of α-helical-like conformations. Ensemble refinement makes it possible to infer detailed structural models for biomolecules exhibiting significant dynamics, such as intrinsically disordered proteins, by combining input from experiment and simulation in a balanced manner.


Gradients of the Log-Posterior for Correlated Gaussian Errors
In the following, we present a detailed derivation of the expressions for the gradients of the negative log-posterior given by eq 4 of the main text in the log-weights and forces formulation for correlated Gaussian errors. Expressions for the gradients for uncorrelated errors presented in the main text are special cases of the more general expressions derived here.
For correlated Gaussian errors, the likelihood is given by P ({y i }|w) ∝ exp (−χ 2 /2), with such that eq 4 of the main text takes on the form The components of the vector of residuals r are given by 1 where we introduced r α i = y α i − Y i . S is the symmetric and positive definite covariance matrix of the statistical errors. Note that for uncorrelated errors the covariance matrix is diagonal, S = diag{σ 2 1 , . . . , σ 2 M }. Denoting the ij elements of the inverse of S as S −1 ij , we may write We derive the gradients of the negative log-posterior given by eq 2 by separately evaluating the gradients of the relative entropy S KL and of χ 2 . To derive the gradients of the relative entropy S KL given by eq 3 of the main text in the log-weights and forces methods below, we use the chain rule and first derive here the gradient with respect to the weights w α . We take into account that weights are normalized and set w N = 1 − N −1 α=1 w α and write The derivative of the first term of eq 5 with respect to w γ is given by The derivative of the second term of eq 5 with respect to w γ is given by

Log-Weights
We derive the gradient of the negative log-posterior given by eq 2 with respect to the logweights given by eq 12 of the main text.
To calculate the gradient of the relative entropy we apply the chain rule, i.e., Inserting eq 8 into eq 9 and using and we obtain where To calculate the gradient of χ 2 given by eq 4 with respect to the log-weights we use the chain rule, Thus, the gradient of eq 4 with respect to the log-weights becomes Consequently, the gradient of the negative log-posterior with respect to the log-weights for correlated Gaussian errors is given by For uncorrelated errors the covariance matrix is diagonal and the gradient of χ 2 simplifies to such that we recover eq 14 of the main text as expected.

Generalized Forces
For correlated Gaussian errors, the generalized forces are given by where f j = y j − Y j . These forces determine the weights via eq 19 of the main text. To calculate the gradient of L given by eq 2 with respect to the forces we use the chain rule, By applying the chain rule, we obtain the gradient of the relative entropy with respect to the forces, where we used eqs 8 and 19.
Next, we calculate the gradient of χ 2 given by eq 4 with respect to w α . Because of the normalization condition N α=1 w α = 1, we only have N − 1 independent variables. Using Consequently, ∂r i /∂w γ = r γ i − r N i for γ < N . We obtain for the gradient of eq 4 with respect to w γ for γ < N : and ∂χ 2 /∂w N = 0. By applying the chain rule and using eqs 19 and 22, we obtain Consequently, for correlated Gaussian errors the gradient of the negative log-posterior in eq For uncorrelated errors, eq 23 simplifies to i.e., we recover eq 20 of the main text, which in the notation used in here in Supplementary Information takes on the form 2 Refinement of Ala-5 using J-Couplings Comparison of Optimization using Generalized Forces and Log-

Weights
Ensemble refinements using generalized forces and log-weights give very similar results across the full range of the confidence parameter θ. The correlation of the optimal weights for Ala-5 refined against J-couplings is shown in Figure S1A.  BioEn ensemble refinement produced very similar trends no matter which Karplus parameters were used to calculate the J-couplings. We performed independent Ala-5 ensemble refinement with three different set of Karplus parameters: the empirical parameters 2 (origi-nal) and two set of parameters obtained from density functional theory 3 (DFT1 and DFT2).
For further analysis of optimization with the original and DFT1 parameter sets we picked refined ensembles with S KL = 0.5. Irrespective of which set of Karaplus parameters we used to calculate J-couplings the polyproline-II conformation becomes more populated and the α-helical like conformation less populated ( Figure 3D in main text, Figure S4D and

Agreement for Individual J-Couplings
Comparing the agreement between the simulated ensemble and experiments for individual observables ( Figure S6) shows which data points drive the ensemble refinement. Here we focus at ensemble refinement using J-couplings calculated with the DFT2 of Karplus parameters. For 3 J CC (Figure S6D), 3 J HNHα ( Figure S6A) and 3 J HαC ( Figure S6C) couplings the agreement between experiment and simulations improves considerably with the optimal ensemble at θ = 6.65. 3 J CC' was measured only for residue 2 of Ala-5 and χ 2 was decreased from ≈ 8 to ≈ 2. For 3 J HαC the improvement is driven by residue 4 which fits poorly in the initial ensemble, whereas for the other residues the agreement is already very good in the initial ensemble. Some improvement in the fit was obtained for 2 J NCα ( Figure S6G) and 3 J HNCα ( Figure S6H), with χ 2 reduced from 3 to < 1 and 2 to ≈ 0.5 respectively. Only very small changes were seen for 1 J NCα ( Figure S6F). Note that the 1 J NCα coupling for residue 5 is uninformative in our analysis as evidenced by the flat χ 2 across the full-range of θ values.
The ψ dihedral angle is not defined for the terminal residue and the calculated 1 J NCα depends on ψ in the current parameterization. For 3 J HNC ( Figure S6B) and 3 J HNCβ ( Figure S6E) the agreement is extremely good to start with and deteriorates somewhat with the refinement.
Importantly, as discussed in the main text, the refinement removes systematic offsets for 3 J HNHα , 3 J HαC and 2 J NCα .

Chemical Shifts
Experimental chemical shifts for Ala-5 from Graf et al. 2 were compared to the chemical shifts calculated from the initial and reweighted, optimal ensemble Figure S7. The error in the comparison of calculated and measured shifts is dominated by the forward model. Hence the error bars shows the root mean square error for SPARTA+ 6 predictions for the respective nuclei previously determined.   Figure S7: Ala-5 chemical shifts calculated from the initial and optimal ensemble with the DFT2 Karplus parameters.