System-Specific Parameter Optimization for Nonpolarizable and Polarizable Force Fields

The accuracy of classical force fields (FFs) has been shown to be limited for the simulation of cation–protein systems despite their importance in understanding the processes of life. Improvements can result from optimizing the parameters of classical FFs or by extending the FF formulation by terms describing charge transfer (CT) and polarization (POL) effects. In this work, we introduce our implementation of the CTPOL model in OpenMM, which extends the classical additive FF formula by adding CT and POL. Furthermore, we present an open-source parametrization tool, called FFAFFURR, that enables the (system-specific) parametrization of OPLS-AA and CTPOL models. The performance of our workflow was evaluated by its ability to reproduce quantum chemistry energies and by molecular dynamics simulations of a zinc-finger protein.


Introduction
Metal ions are essential in biological systems and are involved in physiological functions ranging from maintaining protein structure and stability to directly participating in catalytic activities. 1 Approximately one-third of all proteins contain metal ions. 2 As an abundant cation in the human body, 3 Zinc ions are known to play an important role in enzyme catalysis or protein folding/stability. In aqueous solutions, Zn 2+ normally coordinates with six water molecules in an octahedral coordination geometry. However, in a protein environment, Zn 2+ is often observed to form a tetrahedral coordination structure with four ligating amino acid residues, 4 commonly His and Cys. Due to the nature of electrostatic interactions, Zn 2+ also tends to be close to negatively charged residues such as Asp or Glu. Zn 2+ is involved in various biological functions by interacting with these residues. For example, metallothioneins (MTs) 5,6 are present in all living organisms and are involved in various diseases. [7][8][9] Under physiological conditions, the four mammalian MT isoforms have Zn 3 Cys 9 clusters and Zn 4 Cys 11 clusters in their centers as functional groups. Zinc-finger proteins are another well-studied class of Zinc-containing proteins. Multiple fingers can combine together to carry out many complex functions, such as regulating DNA/RNA transcription, 10,11 protein folding and assembly, lipid binding, Zinc sensing, 12 and even protein recognition. 13 The most well characterized Zinc-finger proteins feature a binding domain with two Cys and two His residues. The study of the classical Cys 2 His 2 Zinc-finger structures is crucial for a better understanding of their broader functions.
Molecular dynamics (MD) simulations employing molecular mechanics (MM) are widely used in the study of complex biological processes, such as protein folding, protein dynamics, and enzyme catalysis because of their ability to model systems at atomic scales ranging in sizes from thousands to millions of atoms and time scales of milli-seconds. [14][15][16] The majority of current MD studies employ classical force-fields (FFs) such as OPLS-AA, 17 AM-BER, 18 CHARMM 19 and GROMOS. 20 It is a challenge for classical force-field models to describe metal-protein interactions due to the strong local electrostatic field and induction effect, [21][22][23][24][25][26] for example, computer simulation of Zinc-containing proteins has been a longstanding challenge that appears hard to tackle without explicit treatment of charge-transfer or polarization.
One approach to improve the accuracy of force-fields is to refine the parameters by fitting the model to more and more accurate experimental data or quantum mechanical (QM) calculations. For example, force-matching algorithms 27 were used to fit parameters to reproduce ab initio forces. Empirical Continuum Correction (ECC) [28][29][30] force-fields scale the charges to implicitly take electronic polarization into account. Several works 31,32 tune the Lennard-Jones (LJ) parameters or use a 12-6-4 LJ-type model to simulate charge-induced dipole interactions. These efforts have been successful to some extent, however, reparameterization is often time-consuming and labor-intensive. There are a few automatic parameterization tools, for example, CHARMM General force-field (CGenFF), 33 LigParGen, 34 and Antechamber. 35,36 These programs typically generate missing parameters for a given system based on analogies with atom types and the relevant parameters available in the corresponding FF or through parameter estimation algorithms. 37 However, the accuracy of assigning approximate parameters to a specific system is limited, and parameters already present in a given FF may also have to be optimized. FFparam 38 and ForceBalance 39 enable the tuning of existing FF parameters. All these parameterization tools share a common assumption of transferability, which assumes a set of parameters optimal for small organic molecules for a given atom type can be applied in a wide range of chemical and spatial contexts. It is well known that the presence of electron donors and acceptors can significantly affect molecular properties by polarization effects. 40 LJ parameters are also sensitive to the local environment 41,42 and long-range electrodynamic screening. 43 In this regard, a fundamentally different approach to derive environment-specific or molecule-specific parameters is proposed in references. [44][45][46] However, parameters still remain fixed despite structures and environments changing over the course of, e.g., MD simulations.
Another approach to improve FF accuracy in metalloprotein simulations is to introduce more physics in to the model. Including polarization effects is a significant step to improve force-fields. 47,48 There is growing evidence that polarizable force-fields describe ionic systems more accurately than classical force-fields. It has been found that the inclusion of polarization plays an important role in the simulation of ion channels, 49

OPLS-AA functional form
OPLS-AA is one of the major families of classical force-fields. It is used as the starting point for parameterization in this work. OPLS-AA uses the harmonic functional form to represent the potential energy shown in eq. 1.
where E FF is the potential energy of the system. E bonds , E angles , E torsions and E improper correspond to bonded or so-called covalent terms of bond stretching, bond-angle bending, dihedral-angle torsion, and improper dihedral-angle bending (or out-of-plane distortions) in the molecules. E vdW and E ele are nonbonded terms. They describe van der Waals (vdW) and Coulomb (electrostatic) interactions, respectively.
The energy terms in eq. 1 are depicted in detail in eq. 2.
where K r ij , K θ ij , V ij n , and V ij 2imp are force constants, r 0 ij and θ 0 ij are the reference bond length and bond angle, r ij , θ ij and ϕ ij are current bond length, bond angle and dihedral angle, respectively, n is the periodicity, ϕ 0 ij is the phase offset, σ ij is the distance at zero energy, ε ij sets the strength of the interaction, q i and q j are the charges of the two particles, and f ij is the scaling factor for short distances (i.e. "1-4 pairs") of nonbonded interaction.
In OPLS-AA, the pairwise LJ parameters σ ij and ε ij are calculated as the geometric mean of those of individual atom types (σ i and ε i ).
Classical force-field simulations were performed using OpenMM7, a high performance toolkit for molecular simulations. 70

CTPOL model
The CTPOL 66,67 model introduces charge transfer and polarization effects into classical force fields. Instead of a fixed-charge model, CTPOL takes the charge transfer from a ligand atom L (O, S, N) to a metal cation into account. The amount of transferred charge, ∆q L−Me , is assumed to depend linearly on the inter-atomic distance, r L−Me where a L and b L are parameters to be determined that are specific for pairs of ligand L (O, S, N) and a metal cation. The parameters a L and b L are of opposite sign, so that the magnitude of charge transfer decreases with distance. The distance at which ∆q L−Me becomes 0 is Beyond this distance, we assume charge transfer to be 0. This approximates real-life charge transfer, which is generally negligible at distances greater than the sum of the vdW radii of atoms i and j, r vdW ij .
Thus, charge on ligand atom L, q L , can be calculated as where q 0 L refers to the charge on atom L in a fixed-charge model. Polarization energy, E pol r , can be computed as where µ i is the induced dipole on atom i and E 0 i is the electrostatic field produced by the current charge distribution in the system at the polarizable site i. The summation is over the metal and the metal-bonded residues. A cutoff distance r cutoff , which is equal to the sum of the vdW radii of atoms i and j scaled by a parameter γ = 0.92, is introduced to avoid unphysically high induced dipoles at close distance. If the distance between atom i and j, r ij , is smaller than r cutoff , we set r ij equal to r cutoff . The only parameter here is the atomic polarizability: where E i is the total electrostatic field on atom i due to the charges and induced dipoles in the system.
In this work, we have implemented the CTPOL model in OpenMM via a python script, which can be found at https://github.com/XiaojuanHu/CTPOL_MD. 71 This represents a proof-of-concept implementation, which runs on CPUs. Further code optimization and a transfer to GPUs will likely speed up simulations substantially.

Reference data set
To evaluate the performance of the parameterization protocol on dipeptide and dipeptidecation systems, we created a quantum chemistry data set. The data set consists of six All DFT calculations in this work were performed with the numeric atom-centered basis set all-electron code FHI-aims. [73][74][75] The PBE 76 generalized-gradient exchange-correlation functional augmented by the correction of van der Waals interactions using the Tkatchenko- The conformers of AcAla 2 NMe, AcAla 2 NMe+Na + , and AcHisDNMe+Zn 2+ were obtained by a conformational search algorithm as shown in the studies of Rossi et al. 83 and Schneider et al. 25 First, a global conformational search was performed with the basin-hopping approach 84,85 at the force-field level (OPLS-AA). 86 The scan program of the TINKER molec-ular modeling package 87,88 was employed to perform the basin-hopping search strategy. An energy threshold of 100 kcal/mol for local minima and a convergence criterion for local geometry optimizations of 0.0001 kcal/mol were used. All obtained conformers were relaxed at PBE+vdW T S level with tier 1 basis set and light setting employed. A clustering scheme was then applied to exclude duplicates using the root-mean-square deviations (RMSD) of atomic positions. Finally, further relaxation was accomplished at the PBE+vdW T S level using tier 2 basis set and tight setting.

Parameter optimization
Optimization methods used in this work include LASSO (least absolute shrinkage and selection operator) 90 regression, Ridge regression 91 and particle swarm optimization (PSO). 92,93 If the parameters enter the force-field function in a quadratic way, e.g. V ij n , the optimization can be performed by solving a set of linear equations. In this case, LASSO and Ridge regression were employed to treat the potential overfitting. The regularization parameter λ in LASSO and Ridge regression was selected by 10-fold cross-validation. LASSO and Ridge regression were performed with Python's scikit-learn 94 library. If the parameters can not be obtained by solving a set of linear equations, e.g. the charge transfer parameters a L , PSO was employed. PSO is a powerful population-based global optimization algorithm. It relies on a population of candidate solutions, called particles, and finds the optimal solution by moving these particles through a high-dimensional parameter space based on their position and velocity. PSO was performed with the python package pyswarm. 95

FFAFFURR
Force field parameterization is an optimization problem with three challenging aspects: 96 1. The optimization problem has to be defined, which consists of the objective of the optimization and, following this, the selection of training data as well as force-field parameters to adjust.
2. In order to perform the force-field parameterization, a preferably automated procedure has to be implemented. The framework and algorithms used in FFAFFURR are explained in this paper.
3. The obtained set of force-field parameters has to be validated against other data than the training data.
Regarding item 1, the parameters of every energy term in a force-field have to be optimized since energy terms and parameters are interdependent and only adjusting a subset may cause inconsistency. Items 1 and 3, training and validation, heavily rely on high-quality data. We use DFT data for comparing potential energies and further validate by MD simulation. Some practical points were considered when establishing the FFAFFURR framework: (i) the framework should be straightforward to set up and use, (ii) it should be easy to extend with other FF parameters or functional forms, and (iii) the result should be immediately usable by a molecular simulation package. FFAFFURR acts as a "wrapper" between the molecular mechanics package openMM 70 and the ab initio molecular simulation package FHI-aims. [73][74][75] The code reads QM data directly from the output of FHI-aims and the output itself is a parameter file that can be processed by openMM. FFAFFURR is designed as the next step of the genetic algorithm package Fafoom. 89 Conformers obtained by Fafoom through global search can be directly parsed to FFAFFURR. FFAFFURR is an open source tool and can be found at https://github.com/XiaojuanHu/ffaffurr-dev/ releases/tag/version1.0.
Bond and angle parameterization K r ij , K θ ij , r 0 ij and θ 0 ij are empirical parameters of bond-stretching and angle-bending terms. The "spring" parameters K r ij and K θ ij are unaltered in FFAFFURR. The focus simply lies on the "torsional" and "non-bonded" parameters. Bond-stretching and angle-bending terms intend to model small displacements away from the lowest energy structure. We adjust r 0 ij and θ 0 ij by simply taking the average of the respective bond or angle over all local minima in the quantum chemistry data set.

Torsion angle parameterization
The torsion angle term represents a combination of the bonded and nonbonded interactions.
It has been reported that torsional parameters fitted to gas phase QM data perform similarly to those fitted to the experimental data. 97 Although torsional parameters can be derived from vibrational analysis or using vibrational spectra as target data, this approach is complicated and requires a more elaborate treatment. 38,98,99 In the case of the torsion term, force constants V ij n and V ij 2imp can be tuned by LASSO or Ridge regression to minimize the difference between the FF and QM torsional energies. The "torsions contribution" from QM E QM torsions is calculated as: where E QM total represents the total energy of a conformer from a QM calculation, E FF nonbonded , E FF bond and E FF angle represent energies of nonbonded terms, bonded terms, and angle terms from FF calculations, respectively.

Electrostatic parameterization
A key difference between FF parameter sets is the origin of the atomic partial charges.
Deriving charges from QM data is widely used. The workflow of FFAFFURR tested three choices of partial charges: Hirshfeld, 79,80 ESP 79,81 and RESP 82 charges. The charge of each atom type of the force-field is defined as the average value of QM charges. The scaling factor f ij used to scale the electrostatic interactions between the third neighbors (1,4-interactions) can also be adjusted by fitting to minimize the difference between the FF and QM energies.

LJ parameterization
Pair-specific Lennard-Jones (LJ) interaction parameters (referred to as NBFIX in the CHARMM force-fields) have been proven to better describe the interaction between cations and carbonyl groups of a protein backbone. 21  In recent years, progress has been made in the calculation of pairwise dispersion interaction strength from the ground-state electron density of molecules. [100][101][102] The interatomic pairwise parameter σ ij can be derived using the atomic Hirshfeld partitioning scheme, which has already been used in the pairwise Tkatchenko-Scheffler vdW model. With the concept of the vdW radius, the LJ energy can be written as where R min ij refers to the atomic distance where the vdW potential is at its minimum. With the definition of the free and effective atomic volume V free and V eff , R min ij is estimated as the sum of effective atomic van der Waals radii of atom i and atom j. The effective vdW radius of an atom is given by where R 0 free is the free-atom vdW radii that correspond to the electron density contour value determined for the noble gas on the same period using its vdW radius by Bondi. 103 Pairwise σ ij can be calculated as The ε ij parameter from eq. 9 can be tuned by fitting FF LJ energies to reproduce QM vdW energies by LASSO or Ridge regression.

Deriving charge transfer parameters
In all Zinc-finger proteins and most enzymes, Zn 2+ coordinates to four ligands. However, due to the setup of the QM data set with monomeric and dimeric peptides, the cations have coordination numbers (CNs) of one or two. Therefore we added a correction factor for CN in eq. 3 k, a L , and r cutoff can be adjusted by PSO. The target objective of fitting can be the QM potential energy, the QM interaction energy, or the electrostatic potential derived from electron densities. b L can be calculated with the assumption that charge transfer is zero at the cutoff distance.

Polarization energy
To get the value of atomic polarizability α i in eq. 7, we use the definition of effective polarizability of an atom in a molecule, where the free-atom polarizabilitiy is scaled according to its close environment with a partitioning: where V eff and V free are the same as in eq. 10, and α 0 free is the isotropic static polarizability.
α i is taken by averaging over all atoms with the same atom type in the quantum chemistry data set. FFAFFURR also supports slightly adjusting α i by fitting force-field energies to reproduce QM energies via PSO.

Boltzmann-type weighted fitting
The quantum chemistry data set covers a wide range of relative energies. By transitioning from, in our case, DFT to an additive force-field, even including charge transfer and polarization, we reduce dimensionality of the energy function and therewith the ability to correctly/fully represent the PES. Consequently, a force-field, describing, e.g., such a cationprotein system, cannot fully reproduce a DFT PES. Hence, it appears advisable to put focus on the accuracy of distinct areas of the PES. RMSD between two surfaces is a common fitting criteria, but this approach gives more weight to areas of the energy surface with larger absolute values, while the real weight should more closely represent the Boltzmann weight of the energy surface. Consequently, we calculate Boltzmann-type weights and apply them as a scoring function. The weighted RMSD, wRM SD, is given as: where RMSD is modified by including a Boltzmann-type factor, where A is the normalization constant (so that w i = 1) and RT is a temperature factor that has no physical meaning in the context of this application, but affects the flatness of the distribution. Our previous work 23 has shown how Boltzmann-type weighted RMSD with an appropriate choice of RT can be utilized as the objective function for force-field parameter optimization. Therefore, we implemented Boltzmann-type weighted fitting in FFAFFURR by scaling the energies with the corresponding Boltzmann-type weights.

Assessment of the energies
To evaluate the performance of the parameterization, energies of conformers in the test set calculated with optimized parameters were compared to DFT energies by mean absolute errors (MAEs) and maximum errors (MEs). The MAE for the relative energies between FF energies and QM energies is calculated as where N is the number of conformers in a given data set. ∆E i refers to the energy difference between conformer i and the lowest-energy conformer in the set. The adjustable parameter c is used to shift the FF or QM energy hierarchies to one another to get the lowest MAE.
ME is calculated as:

Molecular dynamics simulations
We performed MD simulations of the NMR structure 1ZNF 104 with different parameter sets to evaluate the performance of FFAFFURR. All MD simulations were performed using OpenMM7. 70 The structure of 1ZNF was placed in a cubic box of 68 Å side length filled with TIP3P water. Four Cl − were added to neutralize the system.

Results and discussion
To assess the performance of FFAFFURR and describe which protocol to use to create a parameter set, we optimized the parameters of OPLS-AA with FFAFFURR and extended the OPLS-AA model by the CTPOL model. The quality of optimized parameters was assessed by examining the structural stability of the Zinc-finger motif in MD simulations.

OPLS-AA parameterization
Although studies have shown that it is difficult to implicitly incorporate the polarization effects into classical FFs, 23,106 fine-tuning parameters of fixed-charge models to describe cationprotein systems is still attractive due to its low computational cost and easier parameterization. Here we tested the performance of the fixed-charge model OPLS-AA parametrized using FFAFFURR. Five systems were tested: (1) AcAla 2 NMe; (2) AcAla 2 NMe+Na + ; (3) AcCys − NMe; (4) AcCys − NMe+Zn 2+ ; and (5) AcCys − 2 NMe+Zn 2+ . AcAla 2 NMe and AcAla 2 NMe + Na + were used as reference models since the charge transfer and polarization effects caused by Na + are less than that of Zn 2+ . On the contrary, Cys − is one of the ligands that interact with Zn 2+ in proteins, and charge transfer between Cys − and Zn 2+ is significant. For each system, 80 percent of the conformers were randomly selected as the training set, and the remaining 20 percent were used as the test set.
We first demonstrate the functionality of FFAFFURR on the example of OPLS-AA parameterization. The key steps of OPLS-AA parameterization are briefly described in Figure   2 (a). We showed the ability to reproduce PES by optimizing parameters of bonds, angles, electrostatic interactions, LJ interactions, and torsional interactions. Users can choose which energy items to adjust according to their needs. In Figure 2 (a), the parameters in blue boxes are derived from DFT calculations and the parameters in red boxes are fitted by LASSO or Ridge regression as described in Sections 2.4 and 2.5. Here, we only tested RESP partial charges, the LASSO method in deriving ε ij , and Ridge regression in deriving V ij n . The parameters derived from DFT calculations are tuned first because they are considered fixed with respect to changes of the other parameters, then different tuning orders of the parameters for the other energy terms in the FF formula are tested to choose the order that gives the smallest errors between DFT and FF energies. The final order of the protocol is shown in Figure 2 (a). Figure 2 (b-f) shows the comparison of FF energies with optimized parameters after each step in Figure 2 (a). Noticeably, charges for AcAla 2 NMe, AcCys − NMe and AcAla 2 NMe+Na + were not altered since the original charges yielded errors lower than average RESP charges from QM calculations, while average RESP charges were employed for AcCys − NMe+Zn 2+ and AcCys − 2 NMe+Zn 2+ . Figure 2 (e) and (f) indicate that using average RESP charges significantly reduces absolute errors for AcCys − NMe+Zn 2+ and AcCys − 2 NMe+Zn 2+ . This could be due to the capture of charge transfer to some extent. In the case of AcAla 2 NMe and AcCys − NMe, the MAEs were improved from 2.72 kcal/mol and 3.59 kcal/mol to 0.61 kcal/mol and 0.98 kcal/mol, respectively, which is well within the 1 kcal/mol chemical accuracy.
In the case of AcAla 2 NMe+Na + , the MAE was improved from 3.99 kcal/mol to 1.67 kcal/mol. Although the optimized MAE is above the chemical accuracy, the maximum error is significantly reduced. However, in the cases of AcCys − NMe+Zn 2+ and AcCys − 2 NMe+Zn 2+ , the MAEs were improved from 51.75 kcal/mol and 43.47 kcal/mol to 16.8 kcal/mol and 16.59 kcal/mol, respectively. Although these are by numbers great improvements, the MAEs are much higher than for the other systems. Calculations based on parameters of such quality have no predictive power. This confirms the necessity of explicitly including charge transfer and polarization effects to describe divalent ion-dipeptide systems. We note that for dipeptides and dipeptides with monovalent cation systems, optimization of torsional parameters has the greatest impact on the improvement of the accuracy. Previous studies by some of us 78,107 have shown that mono-and divalent cations strongly modify the preferences of torsion angles. While for dipeptides with divalent cations, apparently, the adjustment and treatment of charge interactions plays the most important role. This further confirms that the capture of charge transfer and polarization is crucial for the accurate description of systems with divalent cation. We also note that the maximum errors are greatly reduced after the parameterization of LJ interactions of the five systems.

CTPOL parameterization
The CTPOL model introduces both local polarization and charge-transfer effects into classical force-fields. We investigated the performance of the CTPOL model on the cationdipeptide systems: AcAla 2 NMe+Na + , and two challenging systems AcCys − NMe+Zn 2+ and AcCys − 2 NMe +Zn 2+ . The major steps of the CTPOL parameterization workflow are depicted in Figure 3 (a). Following the methodology of OPLS-AA optimization, parameters unaffected by others are adjusted first, then different orders are tested to choose the order that gives smallest errors between DFT and FF energies. Charges are taken from OPLS-AA in step 1 to step 3. In step 4, charge transfer was introduced. As already mentioned, the parameters in blue boxes are derived from DFT calculations and the parameters in red boxes are fitted by LASSO or Ridge regression. The parameters in green boxes are obtained by PSO. Noticeably, α i is tuned twice. In step 3, α i is taken as the average effective polarizability calculated from the ab initio method. In step 5, we tried to slightly tune α i by PSO.
An additional round of parameterization from step 4 to step 5 can be performed to better optimize the FF parameters.
Absolute errors of each step in Figure 3 (a) are illustrated in Figures 3 (b-d). Absolute errors of optimized OPLS-AA (opt-opls) are also shown in Figure 3 to compare the performance of FFAFFURR on OPLS-AA and CTPOL models. As shown in Figure 3, the introduction of polarization effects in step 3 didn't improve the accuracy much, and the errors of the AcAla 2 NMe+Na + system even increased. This may be due to the fact that classical force-fields already take some account of polarization effects, since the charges come from fitting to reproduce quantum mechanical or experimental electrostatic field distributions. 67 Including charge transfer from ligand atoms to the cation reduces atomic charges, therefore compensating for the electrostatic potential. Not surprisingly, errors are significantly reduced after including charge transfer as displayed in Figure 3. After the parameterization, the MAEs of AcAla 2 NMe+Na + , AcCys − NMe+Zn 2+ and AcCys − 2 NMe+Zn 2+ reached 1.45 kcal/mol, 7.42 kcal/mol, and 8.12 kcal/mol, respectively. In contrast, the MAEs of the optimized OPLS-AA are 1.67 kcal/mol, 16.8 kcal/mol, and 16.59 kcal/mol, respectively.
Apparently, the inclusion of charge transfer and polarization effects better describes systems involving cations than classical force-fields, especially for systems with divalent cations. To focus the fitting on the low-energy part of the PES, we applied Boltzmann-type weights to the scoring function during the fitting of the charge transfer parameters. In Figure 4, the AcCys − NMe+Zn 2+ system is taken as an example. Figure S1 shows the Boltzmann-type weights (w i ) along QM relative energies with different temperature factor (RT) values. For all RT, the weight decreases as the relative energy increases, but increasing RT decreases the weights on low-energy conformations. Figure 4 shows the difference in mean absolute errors between unweighted fitting and weighted fitting with RT = 16. In Figure 4, the height of the bar represents the mean absolute error for conformers whose relative energies are smaller than the right node of the bar. Interestingly, the weighted fitting improves accuracy substantially in the low-energy region, while high-energy regions do not get worse.

Mean absolute error (kcal/mol)
No weighting RT=16 Figure 4: Absolute errors of optimized FF energies with respect to QM energies by weighted/unweighted fitting of AcCys − NMe+Zn 2+ system. The height of the bar represents the mean absolute error for conformers whose relative energies are smaller than the right node of the bar.

Validation with molecular dynamics simulations
The 1ZNF PDB structure 104 is one of the first Zinc-finger structures to be resolved experimentally. It is also the simplest, containing only 25 amino acids and one Cys 2 His 2 Zn 2+ binding domain where the Zinc ion is in a stable coordination geometry consisting of cysteine sulfurs and histidine nitrogens in the first coordination shell (see Figure 5). Due to its compact size, the 1ZNF structure provides an ideal case study for an MD validation of a FFAFFURR parameterization workflow. One potential application of FFAFFURR to this system is to optimize selected parameters for the interaction center ( Figure 5 bottom-left), since that is the region of most complexity.
In this paper, we used an approach similar to Li et al., 108 giving the residues in the interaction center unique residue names to distinguish them from similar residues in the rest of the protein. This allows us to target only atom types within the binding domain for parameterization, without affecting the parameters of similar atom types away from the binding site.
Four parameter sets were tested with MD in this study, as described in Table 1. For the unparameterized OPLS-AA force-field, we observed unbinding of the two histidine residues from the Zn 2+ interaction center, as shown in Figure 6, almost immediately after the start of the simulation. To try and prevent this, we optimized pair-wise LJ parameters between atoms in HisD and Zn 2+ . The parameters that are optimized are listed in Table S2. The LJ parameters between atoms in Cys and Zn 2+ are kept untouched since we haven't seen strange behaviors between Cys and Zn 2+ . The optimized LJ parameters were used in opt-OPLS-AA and opt-CTPOL sets. In the CTPOL and opt-CTPOL models, charge transfer was introduced for S/N/O/Zn atoms in the binding site, and polarization effects between non-hydrogen atoms and Zn 2+ were added. Oxygen is included as some of the structures in the ab-initio dataset have Zn 2+ interacting with a peptide oxygen atom. We ran three 40 ns long simulations with each of the four models listed in Table 1. We also used the 37 experimental NMR structures of 1ZNF to compare structural features between our simulations and NMR observations. Figure 5 shows the RMSD of each of the parameter sets, using the first model of the NMR structures as a reference. In the same figure, we also plot the RMSD of the 37 NMR models with respect to the same first model to see how much variation occurs among those.
It is clear from Figure 5  This is evident from the RMSD of the backbone, as shown in the bottom panel of Figure   5. This is primarily due to the Histidines breaking away from the binding with Zn 2+ , as supported by Figure S2.

LJ parameterization makes the CTPOL model more robust
To evaluate the effect of optimized pair-wise LJ parameters we compared the CTPOL model without any LJ parameterization (CTPOL) to the CTPOL model with LJ parameterization (opt-CTPOL). From Figure 5, it may appear that such optimization has little effect, and in fact may slightly worsen the overall structure due to the higher RMSD of the backbone.
However, while both models preserve the interaction center much better than OPLS-AA and opt-OPLS-AA, opt-CTPOL appears to produce a much more stable binding domain than CTPOL. This can be seen when we recompute RMSD after varying the initial conditions.
To test the impact of initial conditions, we ran 40 independent 1ns long simulations, with the initial frame randomly chosen from a 4 ns MD simulation and random initial velocities. These are reasonable initial conditions that should exhibit similar behavior, as they are taken from a simulation. Figure 7 shows that the 40 ns trajectory of CTPOL using the NMR structure as the starting point is more or less stable. However, when running simulations from different initial conditions, this stability is not guaranteed, as seen from the spikes in RMSD. On the other hand, opt-CTPOL appears to be stable for all initial conditions.
A reason for this is the abnormal charge transfers to Zn 2+ in CTPOL as seen in Figure 8.
This occurs around the same time as the binding domain fluctuations in Figure 7. A closer inspection of the distances between Zn 2+ and coordinating nitrogens ( Figure 9) reveals that these fluctuations are perfectly correlated with these distances. As the binding site breaks down, the coordinating histidines containing these nitrogens move far away, as much as 9 Å away, but the sulfurs remain in close proximity at all times. At such distances, the charge transfer contribution of the nitrogens drops to zero, and the only contribution is from the sulfurs, and hence the lower total charge transfer. However, opt-CTPOL appears to have no such fluctuation in either the 40 ns or 40 × 1 ns trajectories.
These unfolding events within 1ns occur about 20% of the time for CTPOL, thus making CTPOL without LJ-optimization unreliable.  In cyan, we have the charge transfer, in green, the average of the distances of Zn-N20 and Zn-N24, and in yellow the average of the distances of Zn-S4 and Zn-S7. Out of the 40 independent simulations, the average distance of Zn-N20/24 rises above 3 Å 8 times.

opt-CTPOL shows improvement with a caveat to be addressed in the future
To evaluate how parameters affect the coordination of Zn 2+ , we plotted the radial distribution function of non-hydrogen protein atoms around the cation in Figure 10 (top). We can see immediately that NMR and opt-CTPOL have a similar peak structure, but the distances are shorter in opt-CTPOL. In CTPOL, the first and second peaks, containing Nitrogens and Sulfurs respectively, overlap completely and are indistinguishable. In the NMR models, the first and second peaks at 2.0 and 2.3 correspond to Zn 2+ -N(His) and Zn 2+ -S(Cys), respectively. In contrast to CTPOL, the opt-CTPOL peaks are distinct, with only a small percentage (< 2%) of trajectories showing Nitrogens in the 2nd peak dominated by Sulfur.
These features are also seen in similar analyses of the 40 × 1 ns trajectories ( Figure 11).
Based on the analyses in this manuscript we can conclude that CTPOL does not reproduce NMR binding domain as well as opt-CTPOL even for the stable continuous 40 ns trajectory.
After identifying the peaks, and selecting a range of distances ( Figure 10 (top)), we determined which atoms comprise each peak and what fraction of the trajectory these atoms remain in that peak, as shown in Figure 10 (bottom). The 1st and 2nd peaks in CTPOL appear to be contaminated by other atom types which do not appear in NMR peaks at all.
In the 40 × 1 ns trajectory, since CTPOL binding site has been shown to break apart in a few cases, it is no surprise that water also appears in Peak 1 of CTPOL ( Figure 11 (bottom)).
The opt-CTPOL model has no other atom types in the first peak, and only relatively few others in the 2nd peak not present in NMR.
We should note that the NMR model we used does not contain any explicit water molecules. To determine if water could be present in the binding site, we looked at 15 Zinc-finger X-ray crystallography structures from the Protein Data Bank 112 (PDB) website (http://www.rcsb.org/pdb/) to find binding sites which are similar to this one (see S4 for a full list). We looked at binding sites which had a total of 2 histidines and 2 cysteines, similar to 1ZNF. We found 8 binding sites from the 15 crystal structures, and the smallest water distance to Zn 2+ was 4.38 Å, well outside even the 3rd peak range in the NMR models. We   further relaxed the matching criterion for the binding site to any binding site that contains a total of 4 histidines or cysteines (i.e., the number of coordinating histidines and cysteines sum to 4, but does not have to be 2 each). This resulted in a total of 60 binding sites. From these, we found the smallest water distance to be 3.98 Å, still beyond the peak 3 range.
Thus, the inclusion of water in the 1st and 2nd peaks, as is the case in CTPOL model, is uncharacteristic of Zn-finger binding sites of similar nature to 1ZNF. The opt-CTPOL model does a better job of keeping the water outside these peaks, with only a small fraction of water in the 2nd peak.
Angle and distance distributions Figure 12: Binding site with Zn 2+ at the center (grey atom), the sulfurs from Cys7 (S7) and Cys4 (S4), the NE nitrogens from His20 (N20) and His24 (24). Hydrogens have been removed for clarity. The yellow triangle on top has vertices on Zn, S4 and S7, while the blue triangle at the bottom has vertices on Zn, N20, and N24. The angle between the planes of these triangles are used for plotting the distributions in Figure 14. The red and blue arrows ( ⃗ S bi and ⃗ N bi ) are vectors that bisect angles S7-Zn-S4 and N20-Zn-N24 respectively. The distributions of angle θ between these two bisectors are plotted on Figure 14 (b). The distributions of some of the distances between the 5 atoms shown in this figure are shown on Figure 15, while the distributions for some of the angles are shown in Figure 13.
To further evaluate the stability and accuracy of the binding domain in the CTPOL and opt-CTPOL frameworks, we analyzed a number of geometric quantities which are defined in Figure 12 and its caption. Here we only consider the 40 ns continuous trajectory for which the binding domain is stable for CTPOL, since these geometric quantities would not make sense for the 40 × 1ns trajectory where the binding domain destabilizes. Figure 13 shows the distribution of most of the angles that the coordinating atoms make with Zn 2+ . Additionally, Figure 14 (a) shows the distributions of angles between the planes shown in Figure 12, and Figure 14 (b) shows the distributions of the angles between the bisectors, also defined in Figure 12. It is quite clear that opt-CTPOL reproduces the NMR distributions of angles as well or better than CTPOL. The distribution of the S4-Zn-S7 angle appears to agree particularly well with NMR, as does the angle between the bisectors.
While the CTPOL 40 ns trajectory showed a slightly better overall RMSD from Figure 5, it is clearly not reproducing these angles as well as opt-CTPOL. This implies that opt-CTPOL is maintaining the shape of the binding domain better, which is in accordance with the RDF distribution and peak analysis of Figure 10.

Distance (Å)
Probability density Probability density Figure 15: Probability distribution of distances (using Kernel Density Estimation 109,110 ) over entire trajectory (for simulations) and over 37 models (for NMR data).

Issues we observe and their possible origins
The main issue we face is clearly the contraction of distances in the Zn 2+ -binding site in comparison to NMR data. To investigate the source of this issue, we first analyzed the relevant distances in the ab initio data set, as depicted in Figure 16. We see that the Zn-N distances are mostly around the 2 Å mark, and Zn-S distances are mostly around 2.3 Å, as it is indicated by the maxima of the histograms. In both cases, they cover a wide range of energies as well, suggesting that these are preferred distances for various conformations of the dipeptide. However, these are (certainly almost exclusively) minima on the PES.
The larger distances in Figure 16 are due to coordination of Zn with other atoms (such as carbonyl oxygen). To make a meaningful comparison with the 1ZNF system, we further filtered the QM data down to where the nearest atom to Zn is atom SG or atom NE2, since that is the case in the 1ZNF system. From the filtered data, we computed the average distances, and compare them in Table 2, where it is clear that the QM data agrees well with NMR, unlike opt-CTPOL, which systematically gives shorter distances. Given that the shorter distances are not an artefact that was carried from the gas phase QM data, in the following we discuss a few likely causes for this behavior of our model: • We demonstrate in this paper (see Figure 3) that our approach can properly match energy hierarchies. However, the focus on equilibrium geometries may result in insufficient treatment / parameterization of repulsive (off-equilibrium) geometries. The effect may even be amplified due to the Boltzmann-weighting used in the parameter optimization process.
• The Lennard-Jones-(12,6)-potential is intended to model pairwise repulsive and non-Coulomb attractive interactions among atoms. However, due to the nature of classical Relative Energy (kcal/mol) Figure 16: Distribution of ab-initio distances, with bars representing histograms of the numbers of conformers (left axis) over distance range, and points representing the distance and energy of individual conformers (right axis). The plot on the left takes distances between Zn 2+ and S(Cys) from AcCys − -NMe + Zn 2+ ab-initio conformations. The plot on the right takes distances between Zn 2+ and N(His) from AcCys − -His + Zn 2+ ab-initio conformations.
force-field formulations, it may also contain aspects of polarization, dipole-dipole interactions etc. These aspects may be incorporated in parameters as well as in the functional form (e.g. in the 12-6 form).
• Although we are matching MM energies at the exact same conformations, we do not know if the MM minima themselves correspond to those conformations. It is possible that we are simply matching the MM energies at the QM minima, but the MM minima may be in different states altogether, in this case, those that result in shorter ligand distances.
• None of the ab initio references contain both Sulfur and Nitrogen in the coordination of Zn 2+ , or even specifically S-N, N-N, and S-S pairwise interactions. We only parameterize Zn 2+ -ligand LJ interactions in this work as an example, but it is possible that optimizing ligand-ligand interactions will improve the result.

Conclusion and outlook
The availability of sufficiently accurate force-field parameters for cation-peptide systems is a major obstacle in metalloprotein simulations. One approach to facilitate the development of new force-field parameters is to construct tools to derive parameters from QM calculations.
The benefits of such an approach was shown in previous work 23 on the QM-driven parameterization of Drude and CTPOL models for ion-protein interactions in MD simulations.
Since the explicit Drude model includes polarization as a degree of freedom subject to forces, it requires shorter time steps and a dual thermostat and it introduces additional parameters such as the mass of Drude particles and spring constants. CTPOL -as an implicit model -requires the minimization of dipole moments at every time step, but allows for normal time steps. Both models exhibit their own challenges and -in particular our re-implementation of CTPOL -their own room for improvements.
FFAFFURR is developed as a python tool to facilitate the parameterization of classical and polarizable CTPOL models. In this paper, we chose to parameterize OPLS-AA as an example, although the tool can be adjusted to work with other similar force-fields such as CHARMM and AMBER once the code is generalized. It automatically parses QM calculation outputs from FHI-aims and generates parameter files that can be directly processed by the molecular dynamics package OpenMM. FFAFFURR also allows users to choose which energy terms to adjust.
We utilized an extensive data set of model peptide-cation conformers from DFT calculations. Structures cover an extended relative energy range and all required properties for the parameterization were extracted. Due to the sheer size of the utilized data set -by means of number of conformers and energy range -we think that we have covered sufficient individual structural diversity to properly derive parameters for the FF energy terms. The performance of optimized parameters in each energy term was evaluated by comparing of FF energies and QM potential energies.
We showed that the CTPOL model outperforms OPLS-AA in terms of the accuracy of reproducing the QM energy hierarchies for divalent-dipeptide systems.
One potential usage of FFAFFURR is the rapid construction of FFs for troublesome metal centers in metalloproteins. We tested this function by performing MD simulations on the 1ZNF Zinc-finger protein 104 and comparing simulation results with NMR models.
With the parameters optimized from FFAFFURR, we found that CTPOL much better reproduces the overall structure of the protein, while with OPLS and opt-OPLS, the Zn 2+ binding site unravels. However, to better stabilize and reproduce structural features of the binding domain, LJ optimization (opt-CTPOL) was necessary, since CTPOL alone had some shortcomings in correctly reproducing the binding domain, or keeping it stable under various initial conditions. The LJ optimization resulted in coordination composition and geometry that better agrees with the NMR models than CTPOL alone.
On the other hand, the optimization of LJ does lead to a shrunken binding domain. While we briefly discussed reasons for this behavior above, we focus here on possible remedies: • A possible lack of off-equilibrium geometries in the parameterization could be tested first by omitting the Boltzmann-weighting. This would give more impact to higherenergy structures that are more likely to feature at least partly more compressed structures. Maybe this already would lead to improved geometries.
• The repulsive part of the Lennard-Jones potential could be better captured by locally exploring the PES by adding points through dislocating atoms in the cation interaction site. This could either be realized by carefully altering Cartesian coordinates by small values in the +/− space directions or through short DFT-based MD simulations. This would also capture whether or not the MM-minima are at those conformations.
• The functional form of the Van der Waals potential may be better generalized by alternatives to the Lennard-Jones form which allow for better tuning of the shape of the repulsive components. For example, the Mie potential is a generalized version of Lennard-Jones where the exponents and coefficients of each term can be varied. The Buckingham potential is another alternative where the repulsive term is replaced with an exponential decay with variable amplitude and decay rate. Such adjustments of the Van der Waals potential also requires a data base that contains sufficient off-equilibrium geometries.
• Lastly, the dataset could be expanded to contain models with more than one aminoacid side chain, in order to capture complex coordination structures, allowing us to parameterize other pairwise interaction such as Sulfur and Nitrogen, and not just Zn 2+ and ligand atoms.
In summary, FFAFFURR has a wide range of functions to facilitate and perform parameter optimization in peptide systems. It can provide better additive force-field parameters for systems with no cations, as seen from Figures 2b and 2c. Additionally, FFAFFURR and can provide almost all the functions required for the cation-peptide parameterization process, including for CTPOL force-fields. FFAFFURR helps users remove labor-intensive steps in force-field optimization.
Despite the success of FFAFFURR in this study, we see several directions to discuss in future research. Note that only the parameters of the Zinc-finger protein interaction center were optimized with FFAFFURR in the MD simulation, while the standard OPLS-AA parameters were used for the rest of the protein. While our study indicates the compatibility of the optimized parameters with the standard FF parameters, this may have to be investigated in more detail in a future study. One characteristic of FFAFFURR is that it can be employed to derive parameters for a specific system. This helps to grasp the specific environment of the system. However, QM calculations are required when a new system is under investigation. We created a data set of cation-dipeptides containing several divalent cations, which can be automatically parsed to FFAFFURR. 69 If the user's system goes beyond the scope of the dataset, an in-house genetic algorithm package Fafoom 89 can be used to generate conformers and do the QM generation fast and automatically.

Available data and code
The reference data can be found on the NOMAD repository via the DOI  Table S3: The CTPOL parameters. The a and b are parameters in eq. 12, r is the cutoff distance. The correction factor k in eq. 12 is set as 3.418.   Angle between His1 C and His1 N planes Figure S3: Probability distributions of select dihedral angles. (top) The dihedral angle between the two coordinating histidine planes. The planes were determined using the CG, CD, and CE atoms of histidine. (bottom) The dihedral angle between plane defined by His1 CG, CD, and CE1 atoms, and plane defined by His1 CG, ND, and NE atoms. This is to check for internal distortion of the plane. The values are close to 180 (instead of 0) because one set of atoms goes clockwise, and the other counter clockwise, when defining the planes.