Scalable Constant pH Molecular Dynamics in GROMACS

Molecular dynamics (MD) computer simulations are used routinely to compute atomistic trajectories of complex systems. Systems are simulated in various ensembles, depending on the experimental conditions one aims to mimic. While constant energy, temperature, volume, and pressure are rather straightforward to model, pH, which is an equally important parameter in experiments, is more difficult to account for in simulations. Although a constant pH algorithm based on the λ-dynamics approach by Brooks and co-workers [Kong, X.; Brooks III, C. L. J. Chem. Phys.1996, 105, 2414–2423] was implemented in a fork of the GROMACS molecular dynamics program, uptake has been rather limited, presumably due to the poor scaling of that code with respect to the number of titratable sites. To overcome this limitation, we implemented an alternative scheme for interpolating the Hamiltonians of the protonation states that makes the constant pH molecular dynamics simulations almost as fast as a normal MD simulation with GROMACS. In addition, we implemented a simpler scheme, called multisite representation, for modeling side chains with multiple titratable sites, such as imidazole rings. This scheme, which is based on constraining the sum of the λ-coordinates, not only reduces the complexity associated with parametrizing the intramolecular interactions between the sites but also is easily extendable to other molecules with multiple titratable sites. With the combination of a more efficient interpolation scheme and multisite representation of titratable groups, we anticipate a rapid uptake of constant pH molecular dynamics simulations within the GROMACS user community.


.1 Biasing and pH-dependent potentials
To enhance sampling at the physically relevant states in the λ-dynamics simulations, we introduce a biasing potential of the form: After providing a value for the barrier height, the eight parameters k, a, b, d, s, w, r and m, are determined in an iterative fashion, as in Donnini et. al. 1 Parameters for the default barrier of 7.5 kJ/mol and a barrier of 5.0 kJ/mol are listed in Table S1. The blue curves in Figure S1 illustrate the potential with a barrier of 7.5 kJ/mol.
We also emphasize here, that the iterative procedure to obtain parameters k, a, b, d, s, w, r and m reported in the paper by Donnini et. al, 1 differed from the one actually used in both the previous and current implementations. Here, were describe the difference. Parameters d, s, w, r, and m were calculated in the exact same manner, as described in the original paper, 1 however, the parameters k, a, and b were computed differently. Initially, k was set to half of the desired barrier height, a to 0.05, and b to -0.1. Next, parameters k, a, and b were iteratively modified until the terminate conditionals, introduced further, were not satisfied.
The iteration loop was organised as follows: 1. The local minimum depth k is updated, so that the total barrier height corresponds to the desired value: where h is the desired barrier height 2. The local minimum position b is adjusted: where x 0 is the average position of λ: 3. The local minimum width a is adjusted: where σ is the dispersion of λ: and σ 0 is the desired dispersion of λ, which is set to 0.02.
The difference with the original routine, described by Donnini et. al, 1 is in the requirement for V bias < 0 in equations for x 0 and σ. This requirement is essential for the biasing potential to converge to the desired shape. Additionally, if the desired barrier is smaller than 0.45 kJ/mol, it is always forced to zero.
The pH-dependent potential used in this work is where k 1 and x 0 depend on the parameters r and a of biasing potential (Equation S1 ). Here, we use k 1 = 2.5r and x 0 = 2a. This pH potential, as well as the combination with the biasing potential is plotted in Figure S1 for two pH values Table S1: Parameters for the double well biasing potential for barrier heights 5.0 kJ/mol and 7.5 kJ/mol.  Figure S1: Combination of biasing and pH potentials at pH=3.0 (left) and pH=5.0 (right), when pK a =4.0. The blue profile shows the biasing potential V bias (Equation S1) with a barrier height of 7.5 kJ/mol. The red profile shows the pH-dependent potential V pH (Equation S7). The yellow profile shows the sum of the two potentials V total = V bias + V pH

Partial charges of atoms in different protonation states
Default partial charges for the atoms in the side chains of Asp, Glu, and His were used to model the electrostatic interactions of these residues in their different protonation states for constant pH simulations with both the CHARMM36 and Martini 2.0 force fields. These charges are provided via a separate coefficients.dat file that contains for each amino acid a list of atoms with their charges in the various protonation states. This file is included in the supplementary archive SI_constant_ph_gromacs.zip.

Correction potential V MM
The quantum mechanical contributions to proton binding that are missing from classical molecular mechanics force fields are compensated for by a correction potential V MM . To propagate λ-coordinates we only need the gradients of this potential with respect to the λcoordinates. Analytical expressions for these gradients are obtained as polynomial fits to the derivatives of the reference free energy along the charge interpolation path. Below we provide the details for the fitting of ∂V MM /∂λ for both the single-site and multisite representations of a titratable residue.

Correction potential for single site representation
If the single site representation is used, V MM is obtained as follows: First, the trajectory averages of ⟨ ∂V ∂λ ⟩ λ are computed at various values of λ between -0.1 and 1.1. In our implementation, we can accumulate ⟨ ∂V ∂λ ⟩ λ at fixed values of λ by setting the fictitious mass of the λ-particle to zero. Subsequently, ∂V MM /∂λ is obtained as a polynomial fit to the ⟨ ∂V ∂λ ⟩ λ values.

Correction potential for multisite representation
The correction potential for the multisite representation with n protonation states is obtained by performing a multi-dimensional fit to the trajectory averages of ⟨ ∂V ∂λ k ⟩ (λ (α) We represent the correction potential by a (k + 1) th order polynomial where p i indicates the power of λ i . The derivative of this polynomial with respect to λ j is of order k: Because of the constraint (Equation S8), we can substitute λ n = 1 − n−1 where we have introduced the coefficients b t,j q 1 ,q 2 ,...,q n−1 as short-hand notations for the linear combinations of a p 1 ,p 2 ,...,pn : where m t,j,q 1 ,...,q n−1 are the expansion coefficients obtained by contracting the terms between brackets in Equation S11. To simplify notation, we write expression S12 in matrix form: · · · m 0,n,0,0,...,0 and a = (a k+1,0,...,0 , a k,1,...,0 , ..., a 0,0,...,0 ) T .
The best multi-dimensional fit for ∂V MM /∂λ k is obtained by finding the coefficients b t,j q 1 ,q 2 ,...,q n−1 that minimize the squared deviation between the ∂V MM (λ 1 , ..., λ n )/∂λ k and the n ) at the grid points: n ) at the grid points obtained from the trajectories, the residual sum of squares S(b) can be cast in matrix notation: where ∥...∥ denotes the Euclidean norm, and f and X are: Here, 0 is a 0-filled matrix with the same dimensions as X single , which is defined by: where Λ j α is the value of n−1 l=1 (λ where l is an index for each of the n c * n − rank(M) linearly dependent combinations with 1 ≤ l ≤ n c * n − rank(M); and ϵ l i are the expansion coefficients for these combinations. In matrix notation with Q the full-rank matrix of the coefficients ϵ l i . If rank(Q) = 0, there are no linear dependencies between the coefficients b t,j q 1 ,q 2 ,...,q n−1 and the solution for the optimization problem S15 is obtained asb where X T X is considered invertible, 2 which will always be the case if the number of grid points (N grid ) is sufficiently large for a given maximum order of the polynomial. If rank(Q) > 0, the optimization in S15 transforms into a constrained estimation. From equations S13 and S21 the constraints on the estimator b can be written as and the solution for optimal estimator is given by where R is a matrix such that R T Q = 0 and rank [(Q R)] = n c * n. R is obtained by computing the basis set of complement space of Q.

Coefficients for Asp, Glu and His
The procedure for obtaining the gradients of force field free energy associated with deprotonation of the amino acids in their reference states is explained in the Methods section of the main text. The coefficients obtained by applying the fitting procedures described above to these gradients are provided as input and are listed in coefficients.dat file that is included in the supporting archive SI_constant_ph_gromacs.zip. Figure S2: Titration curves for tripeptide Asp using ∂V MM /∂λ coefficients obtained for pentaAsp (on the left) and triAsp (on the right). Dots show the fraction of deprotonated acid from CpHMD simulations, and dashed lines are the theoretical curve for reference pK a = 3.65.

Using tripeptides as reference compounds
In this work, we performed both the thermodynamic integration runs for obtaining V MM and the test simulations for tripeptides, whereas the experimental reference pK a values for the amino acids were obtained for pentapeptides. 3,4 We validated this choice by computing also the titration curve for the Asp tripeptide using coefficients obtained from thermodynamic integration runs on a pentapeptide system. As shown in Figure S2, the titration curves are virtually identical, suggesting that the correction potentials obtained from the calibration runs on the tripeptides, are transferable.
In this work, we interpolated the charges between protonation states, but not the Lennard-Jones parameters. The motivation for this choice is twofold: First, while implementing the interpolations is conceptually straightforward, re-organizing GROMACS to realize such interpolations is very time-consuming. Second, the contribution to the proton affinity due to changes in the Lennard-Jones interactions between the protonation states is much smaller than the contribution from the Coulomb interactions. 5 Therefore, we consider it a reasonable approximation to neglect the effect of the changes in Lennard-Jones parameters.
To verify the validity of this approximation for the two force fields used in this work, We interpolated the interactions in 21 equidistant steps between λ = 0 and λ = 1.
Each step was simulated for 51 ns, of which the first nanosecond was used for equilibration and hence discarded from the analysis. We used Bennett's acceptance ratio method 6 to extract the free energies. In Figure S3, we plot ∆∆G, defined as the difference between the free energy differences (∆G) associated with deprotonating Asp in the protein and in the peptide, for the four TI runs. While the contribution of interpolating the Lennard-Jones to the ∆∆G values is below 10% for both the CHARMM36 AA and Martini 2.0 CG force field, the contribution of interpolating also the bonded interactions is of the same order but of opposing sign. We, therefore, conclude that for the force fields used in this work neglecting the interpolation of both bonded and Lennard-Jones interactions and only interpolate the charges, provides a sufficiently accurate description of the effect of the environment on the proton affinities. Figure S3: Difference between the free energy difference associated with deprotonation of aspartic acid in cardiotoxin V and in the tripeptide (i.e., ∆∆G), for both the AA CHARMM36 and CG Martini force fields, when interpolating all interactions, only Coulomb interactions, Coulomb plus Lennard-Jones interactions, and Coulomb plus bonded interactions.

Titrations using single site representation
The all-atom titration simulations presented in the main text were performed with the multisite representation for all amino acids. While for His the multisite representation is necessary, the Asp and Glu side chains could have been modeled with the singe-site representation as well. To demonstrate that this choice is not relevant for the results of the simulations, we repeated the titration simulations with the single-site representation for the tripeptides, keeping all other simulation parameters the same. Comparing the titration curves obtained with the single-site representation in Figure S4 to the curves obtained with the multisite representation in Figure 3 of the main text, we conclude that both representations yield identical results. Figure S4: Titration curves of tripeptides Glu and Asp in water, using AA CHARMM36 force field in the single site representation. Charge constraints, in combination with 20 buffer particles were used to keep the box neutral in all simulations. Dots show the fraction of frames in which the residue is deprotonated, and the dashed lines represent the fits to the Henderson-Hasselbalch equation. From these fits the pK a values were estimated and included inside the graphs.

Comparison of HEWL pK a values to previous calculations
The scatter plot in Figure S5 compares the calculated pK a values for HEWL obtained in this work to those obtained by others, 7,8 as well as to experiment. 9 Figure S5: Correlation between the experimental and calculated pK a for the HEWL protein.

Convergence of computed pK a values
To monitor the convergence of the predicted pK a values in cardiotoxin V and HEWL, we computed the cumulative average of the pK a values (figure S6 and S8) and S deprot ( figure S7 and S9) at different pH values as a function of simulation time window. The time window over which S deprot and pK a were averaged was increased from 5 to 50 ns in steps of 5 ns. The results, averaged over the replicas, are shown in Figures S6-S7 for Cardiotoxin and S8-S9 for HEWL.
Within 50 ns all pK a values, as well as S deprot at high, average, and low pH values level off, indicating convergence. Because S deprot converges slower than the pK a values, convergence of the latter is no proof that the simulations are converged. The slow convergence of protonation states of Asp52-Glu35 pair in HEWL has been observed by others, who also discussed the origin for the slow convergence as well. 7,10    6 Structural analysis of proteins at different pH

Cardiotoxin V
The effect of pH on the structure of cardiotoxin V is moderate. Overall, the protein is stable over the whole pH range, and no major conformational changes are observed (figure S10,S11). Figure S10: RMSD of cardiotoxin V with respect to average structure at low, average and high pH. RMSD was computed for C α atoms. Figure S11: RMSF of cardiotoxin V residues with respect to average structure at low, average and high pH. RMSF was computed for C α atoms.

HEWL
The effect of pH on the structure of HEWL is moderate. Overall, the protein is stable over the whole pH range, and no major conformational changes are observed (figure S12,S13).
The major effect of pH on the structure is observed in loops formed by residues 40-50 and 65-75. The flexibility of these loops increases with decreasing pH. These trends are in line with previous findings for HEWL and have been discussed in detail elsewhere. 10 Figure S12: RMSD of HEWL with respect to average structure at low, average and high pH. RMSD was computed for C α atoms. Figure S13: RMSF of HEWL residues with respect to average structure at low, average and high pH. RMSF was computed for C α atoms.

Residue hydration as a function of pH
To determine if there is correlation between the protonation state and the solvent exposure of a residue, we monitored the number of water molecules interacting with the residue sidechain.
In Figures S14 and S15  are buried within the hydrophobic core of HEWL, access to water is restricted. Instead, we observe that upon deprotonation, these residues form more hydrogen bonds with the protein, as shown in Figure S16.

Charge interpolation for coupled sites
In the main text, we demonstrated that for "chemically", or topologically uncoupled sites, the gradient of the Coulomb energy with respect to the λ k coordinates can be evaluated directly from the electrostatic potential if rather than interpolating the Coulomb potential functions linearly, we interpolate the charges linearly instead. Here, we demonstrate that this is also true for chemically coupled sites within the multisite representation.
In the multisite representation, the interpolated charge on an atom i is where the sum runs over all λ k -groups that contribute to the charge of that atom. Thus, the total electrostatic interaction of this atom with the other atoms in the system does not depend on the value of a single, but on the values of multiple λ k coordinates.
The total λ-dependent electrostatic energy for a system with N g titratable residues, l, each of which with n l atoms, and described by ν l λ k coordinates coupled within the multisite representation, is: This electrostatic potential has contributions from (i) interactions between atoms that are not part of any titratable residue, V rest-rest ; (ii) interactions between the n l atoms that are part of titratable residue l, and the atoms that are not part of any titratable residue, V g-rest ; (iii) interactions between atoms that belong to different titratable residues, V g-g ; and (iv) interactions between atoms within the same titratable residue, V g . The first contribution is independent of λ k , and is not considered further.
Substituting the expression for the total charge on the atoms (Equation S25), in the second contribution yields: where N λ−groups = Ng l=1 ν l is the total number of λ-groups in the system, which is equivalent to N sites used in the main text. Because, in contrast to the single-site representation, in which the λ-coordinate connects two protonation states, each protonation state requires a separate λ-coordinate in the multisite representation, N λ−groups can be larger than N sites .
For example, the histidine side chain has two sites (N δ and N ϵ ), but is described by three λ-groups. After the substitution, the expression for V g−rest is identical to V λ−rest in equation 16 of the main text.
Substituting the expression for the total charge on the atoms (Equation S25) also in the contribution describing the interactions between atoms that are part of different titratable residues, yields: with g l (λ k ) denoting the l-th titratable residue, which has λ k -coordinate under the multisite constraint. Typically g l (λ k ) constitutes a residues with ν l protonation states. Here we have again replaced the combined sums over the N g and ν by sums over N λ−groups . The superscript "λ-λ, different groups" indicates that the Coulomb potential is computed with the interpolated charges of atoms that belong to two different groups, thus not within the same residue.