Analysis of Protein Folding Simulation with Moving Root Mean Square Deviation

We apply moving root-mean-square deviation (mRMSD), which does not require a reference structure, as a method for analyzing protein dynamics. This method can be used to calculate the root-mean-square deviation (RMSD) of structure between two specified time points and to analyze protein dynamics behavior through time series analysis. We applied this method to the Trp-cage trajectory calculated by the Anton supercomputer and found that it shows regions of stable states as well as the conventional RMSD. In addition, we extracted a characteristic structure in which the side chains of Asp1 and Arg16 form hydrogen bonds near the most stable structure of the Trp-cage. We also determined that ≥20 ns is an appropriate time interval to investigate protein dynamics using mRMSD. Applying this method to NuG2 protein, we found that mRMSD can be used to detect regions of metastable states in addition to the stable state. This method can be applied to molecular dynamics simulations of proteins whose stable structures are unknown.


■ INTRODUCTION
Molecular dynamics (MD) simulations are widely used to investigate phenomena in biomolecules such as proteins. This method is particularly suitable for investigating detailed motions at the atomic level on femto-or picosecond time scales, which are difficult to observe experimentally. In addition, application of a graphics processing unit (GPU) to scientific calculations has accelerated many MD programs, such as Amber, 1,2 NAMD, 3,4 OpenMM, 5 gromacs, 6,7 and ACEMD. 8 Using these programs, it has become possible to perform MD simulations of the microsecond order. With the use of special purpose supercomputers for MD simulation such as MD-GRAPE 9,10 and Anton, 11−13 it is possible to run MD simulations on the order of milliseconds. This has made it possible to perform protein folding processes in conventional MD simulations. 14 Long-time conventional MD simulations of protein systems are especially important when using time-dependent analysis methods, e.g., relaxation mode analysis (RMA), 15−24 dynamic component analysis (DCA), 25,26 and time-lagged or timestructure based independent component analysis (tICA). 27−31 RMA was developed to investigate the dynamic properties of random spin systems 32 and homopolymer systems for Monte Carlo 33 and molecular dynamics. 34 It has also been applied to biomolecular systems 15−24 and can be used to extract slow motion that cannot be extracted by the principal component analysis because it uses time information. 15 In free-energy topography using RMA, the minima tend to be aligned along the axis, and transitions between stable and metastable states have been well described. 16 The differences among RMA, DCA, and tICA have been previously explained in refs 20, 21, and 23. Generating time series data to use these methods requires long-time conventional MD simulations.
In conventional MD simulation, we use a time series of rootmean-square deviation (RMSD) for heavy or C α atom positions to investigate protein behavior. 35,36 In addition, plotting various energies against RMSD is a useful approach for investigating the structural stability of a protein. 37−43 However, the reference structure of RMSD, which, in such a case, is a native structure, must be known in advance. One possible approach would be to use an extended structure as a reference. However, in this case, the value indicating the native structure is unclear.
In the previous study, we plotted the various energies of four proteins as a function of the end-to-end distance between the N-and C-termini. 44 The distance does not require a reference structure, and it is experimentally measurable using techniques such as fluorescent resonance energy transfer (FRET). However, the correlations between RMSD and end-to-end distance depend on individual proteins. In other words, end-toend distance is not an appropriate indicator for investigating stable or metastable states of proteins.
In this work, we attempted to analyze the MD trajectory using time series RMSD without a reference structure. We calculated RMSD using the structure before a specified time rather than a fixed reference structure. Therefore, it is useful for MD simulations of proteins whose native structures have not yet been experimentally determined. The method of calculating such RMSD values is a known technique for checking equilibration in short-time MD simulation. This is also a simplified version of moving RMSD (mRMSD) proposed by Harada et al. for sampling efficiency of the nontargeted parallel cascade selection molecular dynamics. 45−47 We used the time series of mRMSD to analyze Anton's long-time simulations of the Trp-cage 48,49 and NuG2 50 proteins. 14 ■ METHODS Moving RMSD. RMSD is a commonly used measure for comparing protein conformations. 35,36 We considered two different structures, v and w. These structures have n threedimensional coordinates; RMSD was defined by The two protein conformations were prealigned to minimize the distance between them. There are several ways to calculate RMSD: all-atom RMSD, heavy-atom RMSD, and C α -atom RMSD.
Here we chose C α -atom RMSD to investigate protein structural stability. Conventionally, RMSD is calculated by fixing a reference. In the case of protein folding simulations, native structures obtained experimentally are used as references. Now consider a simulation of a protein whose the native structure is unknown. In the absence of a suitable reference structure, we calculate the RMSD as follows. v is a structure at time t, and w is one at time t − Δt, where Δt is the time interval for RMSD calculation. This description is a simplified version of the moving RMSD (mRMSD). 45,46 In the original mRMSD, a reference is an average of several structures before and after the time t of interest. (See Figure 1 in ref 45.) In this work, a single structure before a certain time is used as a reference, and the RMSD between v(t) and w(t − Δt) is referred to as the mRMSD. We calculated the time series of RMSD(v(iΔs), w(iΔs − Δt)) with the sampling time interval Δs at time step i to investigate the protein dynamics behavior. The advantage of this measure is that it does not require a reference structure. We mention that mRMSD is not suitable as an axis for energy plots because the value of mRMSD varies with time interval Δt.
Total Energy. We introduce the total energy, E tot , which is the sum of the conformational energy, E conf , and the solvation free energy (SFE), E sol : The conformational energy is determined from the protein structure and depends on the force field used in the calculation. This energy can be calculated using molecular mechanics (MM) or MD programs. The SFE is defined as the amount of change in chemical energy when one solute molecule is moved from one position in the gas phase to another in the solvent. The SFE can be calculated using the three-dimensional reference interaction site model (3D-RISM) theory, 51,52 the energy representation (ER) theory, 53−55 and the grid inhomogeneous solvation theory (GIST). 56,57 This total energy is a reasonable indicator of protein structural stability. [41][42][43][44]58 ■ COMPUTATIONAL DETAILS We used the following two proteins: the Trp-cage and the triple mutant of the redesigned protein G variant NuG2 (an α + β protein). The details of Anton's MD simulations for these proteins can be found in the Supporting Information in ref 14. We used GROMACS 59 for calculating the conformational energy with CHARMM22* force field. 60−62 The SFE of the proteins was calculated using the 3D-RISM theory with the reference-modified density functional theory. 63−65 The number density of water was 0.033329 Å −3 and the optimal hard sphere diameter was 2.88 Å for the thermodynamic states at 298.15 K. We performed the SFE calculation using 3D-RISM theory with the original code for GPU. 66,67 An SFE calculation of a NuG2 protein takes 10 s on NVIDIA V100 GPU (256 3 grids, water solvent).
Structures were extracted from Anton's trajectory in the calculations and used every 2 ns (every 10 samples). For NuG2 protein, we used the fourth trajectory (NuG2−3 Trajectory). The total number of sampling structures was 104 400 and 86 377 for Trp-cage and NuG2, respectively. For the C α -RMSD calculations with respect to the native structure, the references were taken from 2JOF.pdb (model 1) and 1MI0.pdb for Trp-cage and NuG2, respectively. We used 2, 5, 10, 20, 40, and 80 ns time intervals for the mRMSD calculations. We set Δs = Δt in the mRMSD calculations. The end-to-end distance is defined as the distance between C α atoms of the N-and C-termini. Energy calculations for Trpcage were performed every 20 ns. In addition, calculations were performed every 2 ns during the 20−26.2 μs period.

■ RESULTS AND DISCUSSION
Trp-cage. Trp-cage is an artificial protein consisting of 20 residues that forms a tryptophan-centered hydrophobic core. It is used as a test system for various new methods. 68−75 In addition, Trp-cage is often used to study the folding process of proteins. 76−80 We used the N1D/L2A/I4A triple mutant of Trp-cage 49 in this study.
The time series of the conventional RMSD of C α atom from a native structure and the total energy are shown in Figure 1.
Here, the native structure is the first coordinate of 2JOF.pdb. The stable state has an RMSD value of ∼2 Å, with ∼10 distinct regions. The total energy in these regions is lower than in regions with larger RMSD values. In other words, the total energy is also an indicator of the protein's stable state. (The most stable structures in the stable state regions are shown in Figure S1 of the Supporting Information.) Figure 2 shows the mRMSD time series of the C α atom with time intervals as follows: (a) 2, (b) 5, (c) 10, (d) 20, (e) 40, and (f) 80 ns. At all time intervals, we were able to identify the regions of the stable state indicated by the conventional RMSD. In the stable region, the mRMSD shows values of ∼1 Å while the conventional RMSD shows values of ∼2 Å. However, in (a) 2 and (b) 5 ns, where time intervals were short, the mRMSD values were less than 2 Å even in the nonstable state. This is because the conformational change of the protein is smaller when the time interval is short. As the time interval increased to (c) 10 or (d) 20 ns, the mRMSD values in the nonstable state region increased while the values in the stable state did not change. As the time interval is further lengthened, the separation between stable and nonstable states becomes clearer (Figure 2e,f).
Next, we show the comparison of RMSD (black) and the mRMSD with a 20 ns time interval (red) near the stable state regions in Figure 3. Labels A−F correspond to A−F of the stable state regions shown in Figure 1. The mRMSD values are low and stable in all stable state regions indicated by RMSD. This suggests that the mRMSD functions properly as an indicator of the stable state. The regions indicated by α, β, γ, and σ in Figure 3 are metastable state regions. In these regions, discrepancies between RMSD and mRMSD values were observed. In addition, the fluctuations in values were small for RMSD, but large for mRMSD. This can be explained as follows. In the conventional RMSD, which uses a reference, the changes of the RMSD value become smaller when the difference between the structure and the reference is large. On the other hand, the mRMSD is more sensitive to structural changes. In the metastable state, the value of mRMSD fluctuates significantly because the structural fluctuation is larger than in the stable state. Figure 4 shows the correlation between the RMSD and the mRMSD with a 20 ns time interval. A circular region centered at point (1.5, 1.5) is observed. This indicates that there was a clear correlation in the stable state region. On the other hand, around RMSD 6 Å, the distribution extends vertically. Around RMSD 6 Å, there were various structures that deviated from the stable state. Since these structures undergo various fluctuations and structural changes, the mRMSD values fluctuate. The pattern of the correlation distribution depends on the time interval of mRMSD, but the correlation in the stable state remains independent of the time interval. Figure 5 shows histograms of mRMSD values. The counts of histograms are normalized. At (a) 2 ns, there is a strong peak near 1 Å that falls gently to 10 Å. However, as the time interval increases from (b) 5 ns to (c) 10 ns, a new peak becomes visible around 6 Å. This indicates that when the mRMSD time interval is short, the conformational change of the protein is too small to detect the movement of several tens of nanoseconds. Furthermore, at (d) 20 ns, the first peak splits into two peaks at 0.7 and 1.6 Å. We will discuss the peak at 1.6 Å later. Further increasing the time interval does not  significantly change the position of the peak or the shape of the histogram (Figure 5e,f). Therefore, we consider that a mRMSD time interval of ≥20 ns is appropriate for analysis of protein dynamics.
We show the values of the total energy of Trp-cage as functions of C α -RMSD in Figure 6. Panels a and b were sampled every 20 ns from the entire trajectory, while panels c and d were sampled every 2 ns in region A (20−26.2 μs) of Figure 1. In Figure 6a,c, the first structure of 2JOF.pdb was used as a reference for C α -RMSD, and in Figure 6b,d, the structure with the lowest total energy in each sample was used as a reference. Panels c and d assume that the MD calculation was able to sample sufficient stable structures in region A. Comparing panels a and c, the lowest total energy in Figure 6a is −205.3 kcal/mol, while it is −206.9 kcal/mol in Figure 6c. In addition, there are seven structures below −200 kcal/mol in Figure 6a, while there are 19 in Figure 6c. Naturally, it is possible to obtain lower energy structures by intensively sampling stable state regions. In other words, if the goal is to find a stable structure of a protein, stopping the MD simulation at the end of region A and sampling will achieve this. Next, comparing panel a with panel b and panel c with panel d, the separated distributions are observed around 0.8 and 1.6 Å in Figure 6b,d. These separations of distributions correspond to the separations of the first peak observed in Figure 5d,e,f. This suggests that Trp-cage has a structure with a similar shape to the stable structure. On the other hand, it is difficult to distinguish these two in Figure 6a,c. Thus, depending on the structure used as a reference, the value of RMSD will change, and the information obtained from it will vary. Figure 7 shows Trp-cage backbone structures with the side chain of the sixth amino acid tryptophan. Panel a is the first structure of 2JOF.pdb, panels b and c are the most stable structures in Figure 6a,c, respectively, and panel d is the most stable structure, with an RMSD value ∼1.6 Å in Figure 6d. Structures a, b, and c are very similar in shape, but structure a differs in the shape of the central helix. Except for this, structures a and c are very similar in terms of the shape of the first helix. We can say that structure c is an intermediate structure between structures a and b. Structure d is close to the most stable structure, but the N-terminus is closer to the Cterminus. The N-terminal Asp1 side chain forms a hydrogen bond with the side chain of Arg16. It is well-known that the side chain of Arg16 forms a hydrogen bond with the side chain of Asp9 in the stable structure of Trp-cage. 49

Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
We calculated the end-to-end distance in region A ( Figure  8). Between 20 and 26.2 μs, the end-to-end distance moves back and forth between values of 8 and 12 Å. The number of oscillations is 107, giving an average period of ∼58 ns. It was found that a mRMSD time interval of ≥20 ns was required to detect this mode of hydrogen bond formation and breakage. Since this phenomenon lasts for ≥6 μs, it is not expected to have a significant effect on the structural stability of the Trpcage.
At the end of this subsection, we show the structures of the metastable regions in Figure 9. Panels a, b, c, and d are the structures in the metastable regions of α, β, γ, and σ in Figure  3, respectively. These structures are not exactly in a metastable state because they do not form secondary structures. They are similar to those analyzed for clustering in previous studies. 88 −90 These structures form loops and are further maintained by contact between hydrophobic side chains. By examining the time series of mRMSD, we can find states that are different from the native structure but less fragile.

Journal of Chemical Information and Modeling
pubs.acs.org/jcim Article NuG2. NuG2, which contains 56 amino acids, is the N37A/ A46D/D47A triple mutant of protein G. 50 NuG2 protein consists of a single α-helix packed against two β sheets. In other words, the first β (β 1 ) strand, the second β (β 2 ) strand, the α-helix, the third β (β 3 ) strand, and the fourth β (β 4 ) strand are arranged in order from the N-terminus to the Cterminus. NuG2 is redesigned to switch folding pathways and to fold through a similar pathway to protein L in the early folding process. In the early folding process, the first and second β strands formed an N-terminal antiparallel β-sheet (βhairpin). NuG2 has been analyzed by several groups using Anton's MD trajectories. 41,91−98 In addition, we have suggested the existence of many metastable states in a previous study using RMA. 22 Figure 10 shows the time series of (a) the RMSD of the native structure (PDB 1MI0) and the mRMSD with time intervals of (b) 20 ns and (c) 80 ns. (See Supporting Information for the time series of mRMSD with time intervals 2, 5, 10, and 40 ns.) The three stable state regions with lower RMSD ∼2 Å are shown in Figure 10a. The first one is designated as stable state region A in Figure 10c. As in the case of Trp-cage, the stable state regions can be seen in the same location in Figure 10b,c. In Figure 10c, α and β indicate the metastable regions, which have slower structural changes. In both RMSD and mRMSD, the beginning and end of the region β can be clearly identified. On the other hand, it is difficult to determine the starting point of region α from the time series of RMSD. This shows the superiority of mRMSD, which can be used to calculate the difference from the structure before a certain time interval. The mRMSD is a very useful value because it can extract regions of the stable structure and metastable structures in which structural changes are slow.
The correlation between the RMSD and the mRMSD with a 20 ns time interval is shown in Figure 11. Similar to the Trpcage, there is a strong correlation near the region centered at (2, 2), which indicates the stable state. Compared with the Trp-cage correlation, longitudinally extending distribution is shifted around RMSD 10 Å. This is because the value of RMSD for unstable structures is protein dependent. Structures with mRMSD values smaller <5 Å are considered to be metastable states because of small structural changes. In other words, we can obtain information on metastable structures using mRMSD even if the RMSD value is large. Figure 12 shows the histograms of mRMSD values for each time interval. No separation is observed in the peaks at (a) 2 ns or (b) 5 ns. At time intervals of ≥10 ns, separation of the peaks is observed. The splitting of the first peak observed in Trp-cage is not observed at (d) 20 ns. The separation of the first peak is a Trp-cage specific phenomenon. This is because NuG2 is larger and thus mRMSD values are less sensitive to terminal  fluctuations. At (f) 80 ns, three peaks related to NuG2 dynamics were observed. Figure 13 shows the characteristic structures of (a, b) the metastable state region α, (c) the metastable region β, and (d) the stable state region A. (a) Just before the start of region α, NuG2 forms a random coil structure. Immediately after, a parallel β-sheet is formed at the N-and C-termini. (b) This parallel β-sheet is stable, and then antiparallel β-sheets are formed between β 1 and β 2 and between β 3 and β 4 . This structure resembles the native structure without the α-helix. This structure fits into B1 in the ref 22 classification. However, there is no direct transition from this metastable structure to the native structure. At the end of region α, the parallel β-sheet of β 1 and β 2 breaks down and NuG2 becomes a random coil structure. (c) In the β region, the structure is formed as shown in Figure 13c. This is classified as the R4 in ref 22. The structure which forms the parallel β-sheets between β 1 and β 4 strands and between β 2 and β 3 strands cannot be the expected structure. This structure also does not directly change to the native structure. Between regions β and A, the parallel β-sheets of β 1 and β 4 are broken and NuG2 protein takes on an extended structure. Between regions β and A, the end-to-end distance exceeds 80 Å, as observed in Figure 14. (d) Then, β 1 Figure 11. Correlation between the RMSD with respect to the native structure and the mRMSD with 20 ns time interval.

Journal of Chemical Information and Modeling
pubs.acs.org/jcim Article and β 2 strands form the antiparallel β-sheet and NuG2 folds rapidly to reach the native structure, as shown in Figure 13d.
The time series of the end-to-end distance between the Nand C-termini is shown in Figure 14. α, β, and A indicate the same regions as in Figure 10. Similar to RMSD and mRMSD, in addition to region A, regions of stable states starting from 125 and 155 μs can be observed. However, the value of the end-to-end distance of the native structure and the width of the fluctuation of the end-to-end distance depends on the type of protein. 44 Therefore, it is not easy to search for stable states using only this measure. On the other hand, we can obtain information from the end-to-end distance in a different way from RMSD or mRMSD. For example, large values are observed at the end of the stable or metastable states. These indicate that the parallel β-sheet between β 1 and β 4 was broken. RMSD and mRMSD can only show that the structure changed significantly.

■ CONCLUSIONS
We investigated protein folding processes by calculating the mRMSD, which does not require a reference structure. We have discussed here how the mRMSD can be incorporated into the stable and metastable states of proteins. In MD simulation trajectories of Trp-cage and NuG2 proteins, the time series of mRMSD can be used to correctly detect native structures. In addition, it is also able to detect the metastable states of both proteins. We also showed the difference in RMSD values when using each of the most stable structures within the MD simulation and the experimental structure as a reference. The misalignment between the reference structure and the most stable structure in the MD simulation affects the resolution of the energy analysis.
The mRMSD can be used to analyze MD simulations of proteins whose native structure is unknown. The behavior of the time series of the mRMSD can be examined to determine whether the simulation should be stopped. It is possible to determine with equal or higher accuracy than generating references with AlphaFold2, 99 RoseTTAFold, 100 etc. The flow of analyzing an MD simulation of a protein of unknown structure is as follows: 1. Perform the MD simulation, calculate mRMSD on the fly, and stop the simulation at the end of the stable state region after the stable state one appears several times. 2. Sample the structures of the stable state regions and determine the most stable structure by calculating the total energy or the frequency distribution. 3. Calculate conventional RMSD using the most stable structure as a reference, and analyze the various energies of the protein.
In general, it is difficult to extract native or metastable states from trajectories of MD simulations with large structural changes. The regions extracted by mRMSD calculation can be analyzed by other analytical methods to clarify the mechanisms of protein stability, dynamics, and folding processes. For example, the stability of the most stable and metastable states can be investigated by calculating total energy. The folding process can also be studied by analyzing the structures before and after the regions using clustering methods.
If a target is an intrinsically disordered protein (IDP) with partially flexible regions, this method may not work well. However, by splitting the structure into two or three parts, we can examine the behavior of RMSD of an IDP. An IDP undergoes transitions to more ordered states upon binding to its targets. 101 Calculating mRMSD is also useful to investigate such binding processes between an IDP and its diverse partners.
The appropriate time interval is ≥20 ns in the case of Trpcage and NuG2. The appropriate time interval depends on the system. For larger proteins, the appropriate time interval for mRMSD calculations would be longer. If the sampling interval Δs and the mRMSD time interval Δt are the same, the number of samples is smaller. However, the sampling interval and mRMSD time interval need not be the same. We can increase the number of samples by adjusting the time interval of the number of samples and the mRMSD time interval separately. We intend to make such improvement in future work.

■ ASSOCIATED CONTENT Data Availability Statement
MD trajectory data was provided by D. E. Shaw RESEARCH (e-mail: inquiries@deshaw.com). 3D-RISM input/output, mRMSD code, and parameter files obtained from the present  Journal of Chemical Information and Modeling pubs.acs.org/jcim Article calculations can be downloaded from the following repository: https://github.com/omron-sinicx/mRMSD-paper-data. The code for mRMSD was written using MDAnalysis (ver. 2.3.0). 102 The 3D-RISM with RMDFT functional code for the single GPU system is available from the following repository: https://github.com/drmaruyama/3D-RISM-CUDA/tree/rmdft.

Notes
The authors declare no competing financial interest.

■ ACKNOWLEDGMENTS
Numerical calculations were conducted in part using Cygnus at the Center for Computational Sciences, University of Tsukuba. Molecular graphics were designed using UCSF Chimera, which was developed by the Resource for Biocomputing, Visualization, and Informatics at the University of California, San Francisco. 103