Using Local States To Drive the Sampling of Global Conformations in Proteins

Conformational changes associated with protein function often occur beyond the time scale currently accessible to unbiased molecular dynamics (MD) simulations, so that different approaches have been developed to accelerate their sampling. Here we investigate how the knowledge of backbone conformations preferentially adopted by protein fragments, as contained in precalculated libraries known as structural alphabets (SA), can be used to explore the landscape of protein conformations in MD simulations. We find that (a) enhancing the sampling of native local states in both metadynamics and steered MD simulations allows the recovery of global folded states in small proteins; (b) folded states can still be recovered when the amount of information on the native local states is reduced by using a low-resolution version of the SA, where states are clustered into macrostates; and (c) sequences of SA states derived from collections of structural motifs can be used to sample alternative conformations of preselected protein regions. The present findings have potential impact on several applications, ranging from protein model refinement to protein folding and design.

extracted from a non-redundant subset of the PDB and classified in 517 subclasses. Among these, 21 have more than 100 members and were considered for further analysis (ArchDB.BN4.100). The structures were downloaded from the PDB and encoded with the M32K25 SA and its reduced version rSA. Only the loops with all the C α atoms resolved were considered, for a total of 3314 structures.
The loops in the 5 most populated rSA motifs shown in Figure 5 were clustered with the gromos 1 method and a cutoff of 0.6 Å on C α atoms.

Selection of KUN loops
We identified loops with similar shape from structurally related proteins by filtering the ArchDB.BN4.100 database for β-hairpin loops in Immunoglobulin-like domains (ArchDB.BN4.100.Ig in Table S3), which are ubiquitous and have a very well conserved fold 6 . The loops with one of the most populated rSA encodings ('KUN', Table S3) were then selected. A structural superimposition showed that KUN structures are clustered in two groups (KUNg1 and KUNg2), with KUNg2 structures featuring a more bent conformation around the residue in position p4 compared to KUNg1 and significantly different Ramachandran plots at the p4 and p5 position ( Figure 6). Inspection of the sequences ( Figure S5A) shows that a charged or aromatic residue in position p5 in KUNg1 loops is replaced by a Gly residue in the KUNg2 group, which allows for the larger bending of the loop in KUNg2. Two sequences representative of the two groups were selected for the simulations (bold in Figure S5A), namely the D-E loop from the Novel immunetype receptor 10 (KUNg1) and the C-D loop from the B and T lymphocyte attenuator (KUNg2).

System setup -folding simulations
The GROMACS 4.5.5 program 7 was used to prepare the initial system coordinates. The Amber99SB*-ildn 8 force field was used for all the simulations.
The experimental structures of the fast folders GB1 β-hairpin (number of residues N = 16) and Trp-Cage (N = 20) and of the immunoglobulin β-hairpins KUNg1 (N = 14) and KUNg2 (N = 10) were extracted from the PDB (Table S10). The protein termini were capped with ACE (N-term) and NME (C-term) groups. The charge of the ionisable residues was set to that of their standard protonation state at pH 7. To generate starting unfolded conformations, preliminary simulations were run where the number of C α contacts was steered towards low values. Unfolding simulations were stopped when extended conformations with no significant secondary structure content were found. The unfolded structures were then solvated with an octahedral box of TIP3P water molecules, with an initial minimal distance between the protein and the box boundaries of 8 Å. The systems were then neutralized by adding the appropriate number of counterions, resulting in systems with a total number of atoms ranging from ~ 10,000 to ~ 30,000 (Table S10).

System setup -CASP8 target for model refinement
The protein used for model refinement (XF2673 from Xylella fastidiosa, UniProt ID Q87A02) is the TR464 target of the model refinement category in the 8th community wide experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP8) 9 . Simulations were performed on the initial model provided to the CASP8 participants by the organisers. This was the best model among those submitted in the normal structure prediction exercise. The target description provided to the participants indicated the regions around residues 24 and 42 (41 and 59 in the best model numbering) as problematic, i.e. as having a particularly large deviation from the experimental structure. A PSIPRED 10-11 prediction of secondary structure (version 3.3, default parameters) indicated irregular or extended elements in the region around residue 24 (residues 19-28) ( Figure S6A) instead of the helical elements found in the model structure. Residue 42 is instead found in a β-hairpin loop (residues 40-43). Comparison with the experimental structure shows that in the best model this loop is in a non-native conformation that does not allow the formation of a E5-R42 salt bridge ( Figure 7A). These data were used to define the regions L1 (19-30) and L2 (39-44) included in the CV rSA used for the refinement. The starting TR464 model (residues 18-86, N = 69) was solvated with an octahedral box of TIP3P water molecules, with an initial minimal distance between the protein and the box boundaries of 10 Å. The system was then neutralized by adding 2 Na+ ions, resulting in systems with a total number of 13,922 atoms.

System equilibration
MD simulations were performed with GROMACS 4.5.5 7 . Periodic boundary conditions were imposed. The equations of motion were integrated using the leap-frog method with a 2-fs time step.
All the protein covalent bonds were frozen with the LINCS 12 method, while SETTLE 13 was used for water molecules. The electrostatic interactions were calculated with the particle mesh Ewald 14 method, with a 9-Å cutoff for the direct space sums, a 1.2-Å FFT grid spacing, and a 4-order interpolation polynomial for the reciprocal space sums. A 9-Å cutoff was used for van der Waals interactions. Long-range corrections to the dispersion energy were included. The neighbour list for non-covalent interactions was updated every 5 steps.
The systems were first minimized with 2000 steps of steepest descent, followed by 2000 steps of conjugate gradient minimization. Harmonic positional restraints with a force constant of 4.8 kcal/mol/Å 2 were imposed onto the protein heavy atoms and gradually reduced to 1.2 kcal/mol/Å 2 in 160 ps, while the temperature was increased from 200 to 300 K at constant volume. The systems were then simulated at constant temperature (300 K) and pressure (1 bar) for 200 ps. After removal of harmonic restraints, 1 ns of equilibration was run in NPT conditions, followed by a 1-ns NVT simulation with the volume fixed to the NPT equilibrated value. The Berendsen algorithm 15 was employed for temperature and pressure regulation during the equilibration runs, with coupling constants of 0.2 and 1 ps, respectively. In the last NVT equilibration, the thermostat was switched to the v-rescale 16 method with a 0.1-ps coupling constant, which was then used for the subsequent production runs. In the refinement of target TR464, after removal of heavy atom restraints, weak positional restraints were kept on secondary structure elements not included in regions L1 and L2 (C α atoms only), with a force constant of 0.1 kcal/mol/Å 2 .
Gaussians were deposited every 5 ps with height 0.48 kcal/mol. The CV SA parameters (eq. 1) were set to n = 8, m = 10 and ρ 0 = 0.6 Å. The Gaussian width was set to 0.1 on the basis of preliminary unbiased simulations. In the folding simulations, a CV describing a generic (i.e. not containing information on the native state) global property was used in addition to the CV SA . Different global CVs were tested and the best performance was obtained when using a CV measuring the total number of C α contacts calculated considering all the possible pairs of residues. The CMAP CV implemented in PLUMED-1.3 17 was used, with n = 8, m = 10, a contact threshold of 6.5 Å and a Gaussian width of 2. To reduce the sampling of completely extended conformations in the folding simulations, an upper limit was set on the radius of gyration R g , using a quartic function as restraining potential (UWALL option of PLUMED-1.3) with an energy constant of 9.6 kcal/mol/Å 4 .
For the fast folders and the KUN hairpins, the R g limit was estimated on the basis of the empirical formula approximating the experimental R g of the unfolded state R g unf = R g 0 ×N v , where R g 0 = 1.927 Å, v = 0.598 and N is the number of residues 20 . This results in a rounded value of 10 Å for GB1 βhairpin, 11 Å for Trp-Cage, 9 Å for KUNg1 and 8 Å for KUNg2. An upper limit on R g was also set in the simulations for the refinement of TR464, where R g was allowed to increase by at most 1 Å from the starting value of 10.5 Å. Each metadynamics simulation was run for a total of 100 ns.
Coordinates were saved every 1 ps. In the analysis of the trajectories, high-CV SA ensembles were selected as composed of structures with CV SA >= CV SA max -2. Due to the reduced percentage of high-CV SA structures in the Metadynamics simulation of Trp-Cage using the rSA encoding (0.03%), in this case the definition was extended to the structures with CV SA >= CV SA max -3 (2%).

Enhanced sampling -Steered Molecular Dynamics (SMD)
Steered MD simulations were started from the equilibrated systems. The CV SA was steered using a harmonic restraint with the reference value moving at constant velocity from the initial CV SA value to CV SA max , corresponding to the number of fragments used for its definition. The CV SA parameters (eq. 1) were set to n = 8, m = 10 and ρ 0 = 0.6 Å. Multiple simulations were run starting from different atomic velocities (  Figure S8). In all the cases considered in this study, the trajectories associated with the first peak (low-work transformations) were found to have a higher success rate in producing a correctly folded structure, with an increase in productive trajectories compared to the whole set of SMD runs from 36 to 38% (β-hairpin, p Nat FiltW in Table S2), from 33 to 44% (TrpCage, Table S2), from 36 to 39% (KUNg1 , Table S5), from 72 to 82% (KUNg2 , Table S5) and from 18 to 33% (TR464 , Table   S7). SMD runs were filtered so that only the trajectories with work values within the first peak (low-work runs) were retained. Additional simulations were run on the final structures of these trajectories to allow for side chain and solvent relaxation. The CV SA was restrained to its final CV SA max value, but with the RMSD cutoff threshold ρ 0 (eq. 1) increased from 0.6 to 0.8 Å, thus allowing also for small adjustments of the backbone conformation. These relaxation simulations were run for 5 (fast folders and KUN β-hairpins) and 10 (TR464) ns (Table S11). They were usually very effective in recovering key side chain interactions ( Figure 7C and S5B), together with small improvements in the average RMSD nat (Table S2, S5 and S7).

Clustering of trajectories
Representative          The representative letter of the cluster is highlighted in bold. b Pseudo-angle defined by the fragment C α atoms 1, 2 and 3. c Pseudo-angle defined by fragment C α atoms 2, 3 and 4. d Pseudo-torsion around fragment C α atoms 2 and 3.       a Percentage of structures in the whole trajectory with RMSD nat <= 2 Å. b Percentage of structures in the high-CV SA ensemble (CV SA >= CV SA max -2) with RMSD nat <= 2 Å. c Percentage of productive SMD trajectories (final structure with RMSD nat <= 2 Å). The number of productive over total SMD runs is reported in parentheses. d C α RMSD from the native structure. Average vales are reported, calculated over the final structures of productive SMD trajectories. The minimum final RMSD nat value is reported in parentheses. e Percentage of productive SMD trajectories after filtering for low-work trajectories. The number of productive over low-work SMD runs is reported in parentheses. f RMSD nat values calculated as in d after filtering for low-work trajectories. g Percentage of productive SMD trajectories after filtering for low-work trajectories and relaxation. The number of productive over low-work SMD runs is reported in parentheses. h RMSD nat values calculated as in d after filtering for low-work trajectories and after relaxation.