Pseudo-Improper-Dihedral Model for Intrinsically Disordered Proteins

We present a new coarse-grained Cα-based protein model with a nonradial multibody pseudo-improper-dihedral potential that is transferable, time-independent, and suitable for molecular dynamics. It captures the nature of backbone and side-chain interactions between amino acid residues by adapting a simple improper dihedral term for a one-bead-per-residue model. It is parameterized for intrinsically disordered proteins and applicable to simulations of such proteins and their assemblies on millisecond time scales.


INTRODUCTION
Molecular dynamics simulations provide insights into the properties of biomolecular systems. They make use of empirical potentials 1−3 that depend on the length scale of the description. The atomic level descriptions 4−9 are substantially distinct from the coarse-grained (CG) residuelevel descriptions in which a residue is represented by a single bead. In between, there are CG models in which several atoms are grouped into beads with still different sets of potentials. 10−14 The one-bead-per-residue CG protein models are especially useful when analyzing large systems at long time scales such as those occurring in the dynamics of virus capsids (assembly and indentation) 15−17 or protein stretching at nearexperimental speeds. 18−20 The simplest versions of such a model are structure-based or Go-like, 21−24 meaning that all parameters in the potentials are derived from the experimentally determined native structure 25−28 and the solvent is implicit. There is no unique way to construct a Go-like model because its most important descriptor is the contact map, and there are various criteria to define it. In addition, various contact potentials and the backbone stiffness terms can be employed. Our benchmarking to the stretching experiments 24 indicates that the Lennard-Jones (LJ) potential between the effective beads combined with the native contacts determined through an atomic overlap criterion 29−31 can correctly recreate protein dynamics of stretching and, in addition, leads to proper folding. The criterion involves checking for an overlap between spherical spaces associated with heavy atoms in a pair of residues in the native state. Its presence introduces an attractive potential well, and its absence results in a soft repulsion between the beads.
The structure-based approach clearly cannot be applied to the intrinsically disordered proteins (IDP) 32 because such proteins dynamically adopt significantly differing conformations and there is no dominant "native state". Sampling their rich energy landscape may be challenging for all-atom models, especially in the case of simulating processes involving multiple chains, like aggregation. 33 Short all-atom simulations may still be used to parameterize CG models built solely for the purpose of simulating one specific system (like in the multiscale approaches 34−36 ); however, our goal is to construct a CG potential that is transferable to many systems.
We have argued 37 that the contact-based CG description for IDPs is still possible provided that the contacts are determined dynamically from the instantaneous shape of the backbone, as described by the locations of the C α atoms, and are thus allowed to form and disappear in an adiabatic fashion. The contacts can effectively arise either from the side-chain−sidechain, side-chain−backbone, or backbone−backbone interactions. This model also includes electrostatic interactions as described by a Debye−Huckel (D-H) potential, 38 and it leads to a reasonable agreement with experimental and all-atom theoretical results pertaining to the average geometry of the conformations for a set of systems. It is also appealing computationally because it effectively involves only two-body interactions. We have already used this model to determine the phase diagram for aggregation of polyglutamines, 39 which involved simulating 1800 residues for over 1 ms.
However, switching the contacts dynamically on and off violates the detailed balance and can lead to nonequilibrium stationary states, which have been studied, for example, in the context of active biomembranes. 40−42 One may hope that sufficiently slow, adiabatic switches may mask such glitches, but it is nevertheless desirable to construct a model without any time-dependent potentials.
In our approach, we distinguish side-chain and backbone interactions using only the positions of the C α atoms. Determining the type of interaction between two residues in the previous model required knowledge of the positions of six residues, but after the contact was quasi-adiabatically turned on, the forces were acting only between a pair of residues. Without this switching, we could either return to Monte Carlo sampling (where the idea was originally implemented 43,44 ) or introduce a multibody term in the potential (such terms are crucial in reproducing the fine structure of residues that is lost in coarse-graining 45, 46 ). An example of such a term is the one used for dipole−dipole interactions that can describe hydrogen bonding in the protein backbone. 47 Here, we propose a new and empirically motivated molecular dynamics model in which the short-range interactions are represented by time-independent four-body potentials. The point of departure is an observation that the backbone stiffness energy involves computation of a four-body dihedral potential 48 that restricts the angle between two planes, each set by three residues. This potential can be used in an "improper" way to take into account the rigidity of the side chain in a two-bead-per-residue model. When two residues interact, this picture can be further simplified by removing the side-chain bead (in our one-bead-per-residue model, the beads are centered on the C α atoms). We can still use the improper dihedral term by replacing the side-chain bead by the bead of the second residue that participates in the interaction and vice versa: the side-chain bead of the second residue in the interacting pair can be replaced by the bead representing the first residue. The planes defined by this procedure are shown in Figure 1. Thus, despite using the four-body terms, we still retain the pairwise nature of interactions. In our model, each contact between residues is described by two pseudoimproper-dihedral (PID) angles associated with each residue.
We made a survey of the structures from the Protein Data Bank, which showed specific patterns made by the PID angles. Those patterns are clearly different for the interactions made by the backbone and the side-chain groups, which proves that we can distinguish these two cases in our approach.
We show that our new model can successfully recreate the experimentally determined radii of gyration (R g ) for a set of 23 IDPs. Because replacing the quasi-adiabatic contacts with fourbody PID potentials significantly changed the dynamics, we had to reparameterize all aspects of the model. In order to quantify which variant of the model after reparameterization agrees best with the experiment, we computed Pearson coefficients that show how close the simulation and experimental results are. The best variants surpass our previous model.
We also compared energy distributions of the models, which proved that adiabatic switching caused discrepancies from the Boltzmann distribution. Those discrepancies were not present in the new model with the PID potential. However, the new model turned out to require significantly more computational resources.

METHODS
2.1. Results of the PDB Survey. Virtually all proteins are made from the same set of 20 amino acids, and the geometry of interactions between the residues should be similar for the case of IDPs and structured proteins even if the relative occurrence of those interactions may be different (since, e.g., the IDPs contain much fewer hydrophobic residues). Therefore, we made a survey of 21,090 structured proteins from the CATH database 49 (the set of proteins with the sequence similarity not exceeding 40%: cath-dataset-nonredundant-S40.pdb) to determine which PID angles are favorable in the inter-residue interactions. We computed the values of PID angles for each pair of contacting residues from the database, where a contact between residues is defined through the overlap criterion. These heavy atoms may be a part of a backbone or a side chain. If a backbone atom from one residue overlaps with a side-chain atom from another residue, we call it a backbone−side-chain contact (bs). We analogously define side-chain−side-chain (ss) and backbone−backbone (bb) contacts. There may be many overlaps, so one residue pair can form more than one type of contact simultaneously.
In order to associate a characteristic set of PID angles and distances with a unique type of contact, we derive subdistributions corresponding to situations in which the overlaps arise only in one class of atoms (e.g., only ss). Table 1 shows how many overlap contacts of each type are in the database. Figure 2 shows PID angle distributions for the ss, bs, and bb cases. The distinction between contact types is based on the overlap criterion described above (green histograms) or on a subset of these contacts that fulfill directional criteria introduced in the previous model 37 (red histograms). Figure  2 shows distributions of contacts that are only one type (e.g., only bb overlaps). Figure S2 in the Supporting Information is the same as Figure 2, but contacts can be of more than one type (differences are minor and refer only to the bs case). The Boltzmann inversion potential V B = − k B T ln (p(Ψ PID )) made from these distributions was fitted by an analytical function described in the next section. We observe that side-chain and backbone contacts are associated with different sets of the PID angles. The two minima for the bb case correspond to the parallel or antiparallel β sheet or to the right-or left-handed α helix (for the i, i +3 contacts, only one minimum is present, which reflects right-handedness, see Figure S3 in the Supporting Information). Figure 1. Idea of the PID angles. The interaction between residues i and j involves angles Ψ ij (defined by i − 1, i, i + 1, and j beads) and Ψ ji (defined by j − 1, j, j + 1, and i beads). Journal of Chemical Theory and Computation pubs.acs.org/JCTC Article Figure 3 shows two-dimensional distributions, where the PID angle Ψ PID is on one axis, and the C α −C α distance r is on the other axis. Different side chains result in different distance distributions, as shown for GLN or ALA residues, but the PID angle distributions stay mostly the same (with the exception of special cases, PRO and GLY). We find that the backbone contacts correspond to smaller and better defined distances than the side-chain contacts, which result in a diffuse cloud for Ψ PID ≈ 0 rad and r > 6 Å (side chain) and sharp peaks for Ψ PID ≈ ±1 rad and r < 6 Å (backbone). Two-dimensional distributions of the PID angle and distance for contacts made by all 20 types of residues are available in Figure S4 in the Supporting Information, and one-dimensional distance distributions used to determine the equilibrium distances for ss interactions (r min ss ) are available in our previous article; 37 however, the values of r min ss are reprinted in Table S1 in the Supporting Information.
2.2. Implementation of the PID Potential. In our model, the interaction between two residues depends on their distance and the two PID angles they make. Therefore, we chose our PID potential to be a product of three terms: V(ψ A , ψ B , r) = λ A (ψ A )λ B (ψ B )ϕ(r), where ψ A is the first PID angle in a pair, ψ B is the second angle in the pair, and r is the C α −C α distance. As the first approximation, we decided to use the cosine function for λ and the LJ potential for where r min is the minimum, and ε LJ is the depth (discussed later). Due to the broad character of the bs distribution (green histograms in Figure 2), we take only the bb and ss contacts into account (see section 2.3 in the Supporting Information). This feature is distinct from our previous model in which the bs interactions were included. 37 For the bb and ss interactions, we have clearly defined peaks that can be fitted to the potential function (separately for PID angles and distances). Each peak has a different width and center, so the detailed form of the λ function is Each pair of the ss contacts has r min ss corresponding to the minimum identified in our previous work. 37 Because for r < r min the ϕ(r) potential becomes strongly repulsive, α parameters for ss and bb contacts must be chosen so that if λ bb ≠ 0, then λ ss = 0 and vice versa.
Because the bb PID angle distribution has two peaks, the bb potential has two terms corresponding to both of them (ψ 0 bb+ and ψ 0 bb− ). Therefore The repulsive part of the LJ potential should always be present for small distances to prevent the residues from passing through one another (excluded volume effect). This is why the bb terms have a more complicated form:  This formula ensures the presence of the excluded volume because, for r < r min bb , the LJ potential ϕ bb is no longer multiplied by the λ bb factors. For the ss contacts, V ss (ψ A , ψ B , r) = λ ss (ψ A )λ ss (ψ B )ϕ ss (r) and r min ss depends on the types of amino acids in the pair in contact (see Section 2.4 about the details of the LJ potential). The total PID potential is V = V ss + V bb .
Fitting the function to the Boltzmann inversion potential based on the contact distributions of only one type (bb, bs, or ss) that fulfill directional criteria defined in ref 37 (red histograms in Figure 2) resulted in the following parameters: α bb+ = 6.4, α bb− = 6.0, α ss = 1.2, ψ 0 ss = − 0.23 rad, ψ 0 bb+ = 1.05 rad, and ψ 0 bb− = − 1.44 rad. Fitting to the distributions that do not have to fulfill directional criteria (green histograms in Figure 2) resulted in a model that agreed poorly with the experiment (see section 3.2 of the Supporting Information). In order to improve numerical efficiency, the cosine function was replaced by its algebraic approximation, defined in section 1 of the Supporting Information.
Values of r min are given in ref 37 and Table S1 in the Supporting Information. In Section 3, we denote the pseudoimproper-dihedral potential by letter P and the old, quasiadiabatic model by letter A. , where m is the average amino acid mass, r⃗ i is the position of the residue, F ⃗ i is the force resulting from the potential, γ = 2m/τ is the damping coefficient, and Γ ⃗ i is the thermal white noise with the variance σ 2 = 2γk B T. The time unit τ ≈ 1 ns was verified for a different model with the same equations of motion, 50 and even if for this new model τ is expected to be slightly different, it should still correspond to an overdamped case with diffusional (not ballistic) dynamics.
The residues in our model are connected harmonically with the spring constant k = 100 Å −2 ·ε and equilibrium distance 3.8 Å (ε ≈ 1.5 kcal/mol is the energy unit 51 ). The backbone stiffness potential in our model consists of a bond angle and (proper) dihedral terms. Its depth and form were obtained from a Boltzmann inversion potential based on a random coil library. 52 Its exact analytical form is described in ref 37. It is defined in kcal/mol, so it can be used to verify what is the effective room temperature. The results for polyproline, which cannot form side-chain−side-chain contacts (see the next section) and has high backbone stiffness, indicate 37 that simulations for the room temperature 0.38 ε/k B give the best agreement with experimental values for the polyproline end-toend distance, so we set the temperature to 0.38 in the reduced units.
Because the potential for backbone stiffness, as in the previous model, is based on a random coil library, it does not favor any secondary structure. In order to make structures like α helices or β turns possible, we allow attractive contacts between the ith and i + 3rd residues in the chains as these contacts correspond to hydrogen bonds between backbone atoms in an all-atom representation. 53 However, the nature of i, i + 4 contacts is different (see section 2.1 in the Supporting Information), so in the results, we tried models with i, i + 4 contacts (+ in superscript) and without them (− in superscript).
2.4. Depth and Form of the Lennard-Jones Potential. The energy unit ε = 110 pN·Å ≈ 1.5 kcal/mol is taken from the earlier models, 24,37 and although it corresponds to the strength of one contact in those models, it is not necessarily the case for the PID model. We keep ε bb = ε because it is roughly equal to the energy of one hydrogen bond made by the protein backbone, 54 but the varied nature of ss contacts required parameterization. We checked values between 0 and 1 ε for the uniform ε ss potential (we denote this case as ME) and tried two matrices where ε ss depends on the pair of residues: the classic Miyazawa−Jernigan matrix based on PDB statistics from 1996 55 (denoted MJ) and the MDCG matrix derived from all-atom simulations 56 (denoted MD). We also tried scaling them by a factor from 0 to 1 (in the results, the factor is denoted in subscript). In all three cases (ME, MJ, and MD), ε ss = 0 for PRO and GLY residues.
The distance distributions for some pairs of residues are very broad. 37 This is caused by the ability of longer side chains to deform and adapt different conformations. This ability is lost in any CG model in which each residue is represented by a spherical bead. To correct this, one may imagine a modified form of the LJ potential for ss contacts: We tested both this modified form (denoted by letter F, for flat) and the traditional form (denoted by letter L) of the LJ potential.
2.5. Electrostatics. Our previous model used a Debye− Huckel-screened electrostatic potential with electric permittivity ϵ = 4 Å/r depending on the C α − C α distance r following the approach of Tozzini et al. 38 (thus we denote this version of electrostatic interactions by letter T). However, this approach was designed for structured proteins, where the permittivity inside the hydrophobic core is significantly lower than in water. IDPs lack this core and are more solvent-exposed, so we tried a simpler term with ϵ = 80 (we denote this term by letter C). In both cases, the form of the electrostatic potential is , where s is the screening length that depends on the ionic strength. In our model, histidine is considered to be uncharged.

RESULTS AND DISCUSSION
We tested our model on a benchmark of 23 IDPs whose radius of gyration was measured in SAXS (small-angle X-ray scattering) experiments 57−67 (their sequences, screening lengths used, and a Pappu diagram 68 are in section 3.1 of the Supporting Information). We considered over 200 variants of the model: with a pseudo-improper-dihedral potential (denoted by letter P) or the previous version with a quasiadiabatic potential (letter A), with or without i, i + 4 attractive contacts (+ or − in superscript), with the standard (letter L) or flat (letter F) form of the LJ potential, different depths of this Journal of Chemical Theory and Computation pubs.acs.org/JCTC Article potential (we considered three residue−residue matrices: uniform, ME; Miyazawa−Jernigan, MJ; and all-atom based, MD; all scaled by a factor denoted in subscript), and with standard (letter C) or distance-dependent (letter T) electric permittivity. Thus, the name of each version is made from four symbols (the full legend is available in Table S4 in the Supporting Information), e.g., A + L ME 1 T is the model with quasi-adiabatic switching, attractive i, i + 4 contacts, LJ potential with uniform ε, and electrostatics used by Tozzini et al. 38  Applying a pseudo-improper-dihedral potential to this model results in worse agreement with the experiment (panel (b)), so we reparameterized the model by changing five features described above. The end result (panel (c)) has much better agreement with the experiment. We checked that this is indeed the result of using the new form of the potential as applying the reparameterized features to the previous model (panels (d) and (e)) improves it only slightly. Panel (f) shows the results of the Gaussian chain model 69  , where n is the number of residues, and b is the effective Kuhn length that may be treated as a fit parameter. By fitting with the least-squares method to the set of 23 IDPs, we obtained b = 6.7 Å.
The R g values for the set of 23 IDPs have been measured in SAXS experiments. We note that the direct result of a biomolecular SAXS experiment is a scattering profile that contains information not only about the value of R g but also about the average shape and size of the IDP under study. It is possible to compute such a scattering profile from an ensemble of simulation structures and compare it with the experiment, 70 but this task involves fitting parameters and requires raw experimental data, which are not available for many proteins used in the testing set. On the other hand, the value of R g is a single number that can be easily compared to simulation results. For these reasons, we decided to compare our results only with the R g values. An example of a comparison with a SAXS intensity profile for protein 6AAA is shown in Figure  S12 in the Supporting Information.
Another way to validate our model is to perform a histogram test, 71 where we can check if our simulation is consistent with the Boltzmann distribution of energy: the state with energy E should occur with the probability p(E) = Ω(E) exp ( − E/ k B T)/Q, where Ω(E) is the state density, and Q is the normalization factor. If we perform simulations using two different temperatures T 1 and T 2 , then we can compute the following quantity: log(p 1 (E)/p 2 (E)) = log (Q 1 /Q 2 ) + E · (1/ k B T 2 − 1/k B T 1 ). This quantity should depend on the energy linearly with the coefficient (1/k B T 2 − 1/k B T 1 ). We can plot this dependence (treating log(Q 1 /Q 2 ) as a fit parameter) for the new version (P − F MD 0.1 C) and previous version (A + L ME 1 T) of the model. 37 Such a plot for proteins his5 (24 residues), NRG1 (75 residues), and p53 (93 residues) is shown in Figure 5. The data points obtained with the new model lie closer to the line with the coefficient ( The new version of the model has significant advantages over the previous one: it better agrees with the experimental data and the Boltzmann distribution. However, even though  Journal of Chemical Theory and Computation pubs.acs.org/JCTC Article the computational speed of both models scales almost linearly with the system size, the new model is usually at least five times slower (see Figure 6).
The new model is also harder to parallelize (due to multibody terms) and does not perform well for folded proteins: we tried folding small proteins (PDB access codes 1L2Y, 1UBQ, and 1ERY) with it but arrived with an RMSD of 6 Å or higher even for the smallest protein, 1L2Y (see section 4 of the Supporting Information). This is not surprising because IDPs have very weak inter-residue interactions: the top seven variants of our model considered in IDP parameterization had their residue−residue matrices multiplied by a factor smaller than 0.5 (see Figure 7). It is interesting to note that two out of our top seven variants multiply the matrix by 0, meaning that there are no interactions with the exception of excluded volume, backbone stiffness, and electrostatics. A simple Gaussian chain model also works quite well for IDPs (panel (f) in Figure 4). This proves that water is a good solvent for IDPs, and their inter-residue interactions affect chain dimensions in a minor way. 32 This fact was also used in a recent hierarchical approach for studying IDPs, where only local fragments are modeled in detail, and the whole chain is constructed from those fragments. 72 The top-ranked variants of the model may be further refined by fine-tuning the ss interactions. In future studies, our new CG model can be used to explore conformational dynamics of IDPs and their assemblies. It can also be adapted to study physical properties of flexible linkers in multidomain proteins. 73,74 In this case, each of the structured domains can be kept stable by a Go-model potential, 24 and each of the linkers (as well as their interactions with the domains) can be described by the PID potential.  Journal of Chemical Theory and Computation pubs.acs.org/JCTC Article