Avoiding False Positive Conclusions in Molecular Simulation: The Importance of Replicas

: Molecular simulations are a computational technique used to investigate the dynamics of proteins and other molecules. The free energy landscape of these simulations is often rugged, and minor di ﬀ erences in the initial velocities, ﬂ oating-point precision, or underlying hardware can cause identical simulations (replicas) to take di ﬀ erent paths in the landscape. In this study we investigated the magnitude of these e ﬀ ects based on 310 000 ns of simulation time. We performed 100 identically parametrized replicas of 3000 ns each for a small 10 amino acid system as well as 100 identically parametrized replicas of 100 ns each for an 827 residue T-cell receptor/MHC system. Comparing randomly chosen subgroups within these replica sets, we estimated the reproducibility and reliability that can be achieved by a given number of replicas at a given simulation time. These results demonstrate that conclusions drawn from single simulations are often not reproducible and that conclusions drawn from multiple shorter replicas are more reliable than those from a single longer simulation. The actual number of replicas needed will always depend on the question asked and the level of reliability sought. On the basis of our data, it appears that a good rule of thumb is to perform a minimum of ﬁ ve to 10 replicas.


■ INTRODUCTION
Molecular simulations are a computational technique 1 with a wide range of applications. Examples include protein folding, 2 protein dynamics, 3 RNA 4 /DNA 5 dynamics, membrane dynamics, 6 and the dynamics of large systems. 7 Multiple simulations of identical atomic coordinates are called replicas, repeats, or independent ensembles. In this study, the term "replicas" refers to simulations of identical structures with identical parameters where only the initial velocities are created randomly according to a Maxwell distribution. Velocities are usually chosen randomly because (1) the system is assumed ergodicity of the system, (2) velocities cancel out of ensemble averages, and (3) the emerging Markovian property of complex dynamical systems makes the detail of initial velocities unimportant after very short times. Each replica simulation runs independently and is not influenced by the other replicas. Replicas of this type often produce differing trajectories 8 caused by the ruggedness of the energy landscape, which contains many local minima often separated by high energy barriers. 9 Random velocities are not the only cause of differing trajectories: replicas with identical initial velocities also produce different trajectories because of differences in floating-point precision on different machines, the number and type of processors, compiling options, system-specific random number generators, and/or dynamic load balancing. 8 The central limit theorem and ergodic hypothesis for dynamical systems tell us that in principle this should not matter for infinitely long simulations. The problem, however, is that real-life simulations are not infinitely long, and quantifying the uncertainty to which this leads is challenging. 10 With the assumptions of finite runtime and finite sampling, the question arises as to the simulation time and number of replicas that are necessary for reliable and reproducible conclusions to be drawn from molecular simulation studies. In the following we explain this using an example: An often seen study is the investigation of the dynamics of an X-ray structure beyond its static structure. In our example, a research group chooses the X-ray structure of a highly immunogenic T-cell receptor/peptide/major histocompatibility complex (TCRpMHC). They perform a single molecular dynamics (MD) simulation of 100 ns to investigate the molecular background of the experimentally known high immunogenicity. The group is interested in the number of H-bonds between the two TCR chains, as it has been suggested that the mechanical forces acting on the TCR could be involved in TCR triggering. 11 In this single simulation, they find an average number of 19.12 intra-TCR-chain H-bonds (orange histogram in Figure 1A). In the X-ray structure, the number of H-bonds is 15. They could conclude from this analysis that the additional four transient H-bonds cause a rigidification within the TCR. Therefore, the cause of the high immunogenicity of this complex is the increased number of H-bonds. This conclusion could be further strengthened by quoting the original X-ray paper 12 that states that the immunological hotspot (Y7 of the peptide) binds tightly between CDR1α and CDR3β of the TCR.
Imagine a second research group that independently and unaware of the research discussed above performs exactly the same simulation but uses a different initial seed for the velocities and/or hardware. They find an average number of 12.38 intra-TCR-chain H-bonds (blue histogram in Figure 1A). This simulation suggests that 2.6 H-bonds of the X-ray structure are present only transiently in the crystal environment and that the lower number of H-bonds observed in the simulation allows additional TCR flexibility in the interaction with CD3. This finding is proposed as the structural reason for the immunogenicity of this complex.
The two groups would have drawn two diametrically opposed conclusions based on two simulations of the same complex (replicas). An analysis of 100 replicas (yellow histogram in Figure 1B) shows that both conclusions would have been incorrect. The average number of intra-TCR-chain H-bonds over all 100 replicas is 15.28, which is almost identical to the X-ray value of 15. These 100 replicas show that the average number of intra-TCR-chain H-bonds is in reality not different from the number seen in the X-ray structure. Studies built only on the outcome of one simulation without replicas could lead to false positive conclusions and publications.
This example shows that single simulation studies may lead to false positive conclusions. However, only a few studies 13,14 have investigated the use of replicas, and to date no study has quantified the magnitude of uncertainty within identical replicas at different simulation lengths. In this work, we performed a large-scale MD simulation study to quantify the differences between replica simulations. We consider two systems. The first was a small 10 amino acid "toy" example. In this system, we hoped to attain a complete exploration of all of the degrees of freedom. The second was the above-described 827 amino acid "real-world" TCRpMHC example. In this system, exploration of the entire configurational space is not possible, and therefore, we rather sought an exhaustive exploration of one or multiple local free energy basins around the folded X-ray state. For these two examples, we quantified the reliabilities of different numbers of replicas in combination with varying simulation lengths.

■ METHODS
Structures. First we considered a small "toy" system, the structure of the chignolin mutant CLN025 (Protein Data Bank (PDB) 15 accession code 2RVD 16 ). Despite its small size (only 10 amino acids), this structure shows the essential characteristics of a protein, such as thermal stability, a folding pathway, and a free energy surface. 16 The average folding time has been determined to be about 600 ns. 2 Despite its protein characteristics, we refer to this system in our study as "the peptide".
Second we considered a large protein complex in which the Lc13 TCR is bound to the Epstein−Barr virus (EBV) peptide FLRGRAYGL loaded onto the human MHC allele HLA-B*08:01. The structure formed by the 827 amino acids that make up this TCRpMHC complex is available via PDB accession code 1MI5. 12 The entire system was used, including the constant regions of the TCR, as we recently showed their importance for reliable results. 17 This TCRpMHC structure is a good case study as it has been investigated previously. 3, 12 We refer to this system as "TCRpMHC". Missing side-chain atoms in the X-ray structures were modeled using SCWRL 18 via The inset on the right shows a zoomed version. The areas of the two simulations of (A) occupy rather extreme parts of the search space and amount to 2% of the total yellow area (2 out of 100 replicas).
the PeptX framework, 19 as this has been shown to yield the highest accuracy for peptide/MHC modeling. 20,21 MD Simulations. Both systems, the peptide and TCRpMHC, were immersed in dodecahedral simulation boxes of explicit SPC water, allowing for a minimum distance of 1.5 nm between the box boundary and the protein. Random water molecules were replaced with Na + and Cl − ions to achieve a neutral charge and a salt concentration of 0.15 mol/L. Energy minimization using the steepest-descent algorithm was applied. Then the system was warmed from 0 K to 310 K using the velocity rescaling thermostat 22 and position restraints on all bond lengths over 500 ps. Production runs with random velocities generated at 310 K were carried out using GROMACS 4 23 and the GROMOS force field in the 53A6 parametrization 24 on the ARCUS-b cluster of the Oxford Advanced Research Computing (ARC) facility.
We performed 100 replica simulations on the peptide system, each of 3000 ns, yielding a total simulation time of 300 000 ns. These simulations total 500 times the folding time of the system, which means that it is likely that we achieved good coverage of the free energy landscape and solution space, i.e., that we sampled the set of (almost) all feasible atom arrangements.
The TCRpMHC system was simulated for 100 ns, and 100 replicas were performed, yielding a total simulation time of 10 000 ns. In this case, the simulation time did not allow complete sampling of the system. Because of the size of the system, 100 ns is a usual simulation time for current MD studies of TCRpMHC. 25 Methods of Trajectory Evaluation. Trajectories were evaluated using the Gromacs tools g_hbond and g_dist 23 as well as the gro2mat package. 26 Simulations were manually inspected with VMD 27 and the vmdICE plugin. 28 Reliability and Reproducibility Analysis. In this work, we analyzed the relationship among the number of replicas, simulation time, and reproducibility by comparing the similarities of groups of replicas. We performed our analysis as follows: From the set of 100 replicas, we uniformly sampled n replicas at random with replacement (group 1). We repeated this sampling to obtain a second set of replicas of size n (group 2). Then for each group we computed the average A i (i = 1, 2) of a single descriptor (e.g., the number of H-bonds or a specific distance) over all frames and replicas. We repeated this sampling procedure 5000 times and recorded the difference d k between group 1 and group 2 in each repetition (k = 1, 2, ···, 5000). We also computed and recorded the average of the 5000 d k values. This average corresponds to a point in Figure 2D. Each point represents a unique combination of the number of replicas n used within each group and the length of the trajectory (number of frames), which is truncated at different time points (x axis of Figure 2). This allows a visualization of the influence of the number of replicas and simulation length on the reliability and reproducibility.

■ RESULTS
The Peptide. We analyzed the peptide system (PDB code 2RVD) using 100 replicas of 3000 ns each. An example of one replica is shown in the Appendix Movie.
The first property we analyzed was the Euclidian distance between the termini of the peptides (the distance in space between residues Tyr1 and Tyr10). This is a simple property that discriminates between folded and unfolded states of the peptide. The distribution of the distance over all 100 replicas and time frames shows a bimodal distribution with means at 5.2 and 9.1 Å ( Figure 2A). There is a slight tail to the right indicating the rare occurrence of elongated conformations. In Figure 2B we show an example conformation from each peak and a semielongated configuration. The overall distribution appears to be smooth, and the two peaks have nearly normal distributions around them. Figure 2C shows that not all replicas sample both states equally well. We depict the replicas with the on-average lowest, highest, and most average distance. Replica 65 spends most of the simulation time in a closed conformation (average distance 6.18 Å), while replica 68 spends more time in (semi)extended conformations (average distance 10.26 Å). Replica 98 has a mean value closest to the overall mean of all 100 replicas and also follows a similar distribution (average distance 7.76 Å).
While the most extreme examples among the 100 replicas differ greatly (e.g., Figure 1A and Figure 2C), the more interesting question is how much replicas differ on average from each other and how this difference changes with increased simulation time and/or more replicas per group. In Figure 2D we analyze this dependency (see Reliability and Reproducibility Analysis). Short simulations (<10 ns) tend to agree with each other. For example, two simulations of 10 ns differ on average by only 0.61 Å (top line in Figure 2D). However, this is not due to the good sampling quality but rather is due to undersampling and exploration of just the local neighborhood solution space (e.g., side-chain conformations) instead of the global solution space (e.g., unfolding and folding events of the backbone). This undersampling is reflected by the sharp increase in the average difference between groups between 10 and 400 ns. For example two simulations of 400 ns differ on average by 1.48 Å. Once the simulations are longer than 400 ns, the difference between groups starts to fall rapidly until about 1200 ns, after which the fall continues more slowly up to 3000 ns. If we move from the analysis of one replica per group to 10 (black line in Figure 2D), we find a median drop of 63.7% in the group difference. The curve is smoother and the decrease in group difference starts 200 ns earlier. If we then consider 50 replicas in each group, the median drop in difference is 83.7%. Even with 50 replicas per group and 3000 ns simulations, the difference between groups is not zero (green circle at the bottom right in Figure 2D). These differences can also be considered in the light of computational cost: using 10 replicas achieves 76.1% of the improvement of using 50 replicas at only 20% of the computational cost. The average standard deviation (over all 5000 permutations and 21 simulation lengths) of the replica group difference in peptide termini distance (in Å) drops considerably with increasing number of replicas from 1.23 (one replica) to 0.49 (five replicas), 0.32 (10 replicas), 0.18 (25 replicas), and 0.12 (50 replicas) (Appendix Figure 2). Figure 2D shows the variation for a various simulation lengths and numbers of replicas. With these data we can also calculate the optimal way to invest computational resources. If it is assumed that budget restrictions allow only a certain amount of simulation time to be computed, is it better to run multiple replicas or longer simulations? In Figure 2D, the blue squares mark different possibilities assuming fixed computational resources of 2000 ns, the blue diamonds 6000 ns, and the blue stars 18 000 ns. In all three cases, the investment of computational time into a larger number of shorter replicas yields better results (lower difference in mean). For example, using one replica of 2000 ns per group yields a mean difference of 0.82 Å between groups, while using five replicas of 400 ns per group lowers the difference to 0.68 Å and using 20 replicas of 100 ns In Appendix Figure 4 we show the robustness of our approach using a slightly different methodology: instead of using two groups of equal sizes, we use one group that consists of all 100 replicas and a second group of n replicas, where n runs from 1 to 50. In this way we illustrate the difference between the assumed truth (100 replicas) and n replicas. While the scales of the y axes in Figure 2 and Appendix Figure 4 are different, the overall shapes are highly similar, showing the robustness of our method.
In Appendix Figure 5 we show the same data, but we use the number of replicas (instead of the simulation length) for the x axis and the simulation time (instead of the number of replicas) for the color code. This figure emphasizes the convergence with increasing number of replicas, but the "peak" at around 100 ns in Figure 2 is only visible indirectly by overlapping lines. It can be seen that there is a strong decrease in average difference when more replicas are used at a fixed investment of 2000 ns but less difference at a fixed investment of 18 000 ns. This shows that performing more replicas instead of longer simulations is especially important if limited overall computational resources are available.
The distribution of the number of H-bonds over all 100 replicas and time frames is shown in Figure 3A. This is a more global property of the peptide. A nearly normal distribution with a mean of 6.49 H-bonds is observed. There is a slight tail to the left indicating few conformations with few or zero H-bonds. In Figure 3B we show three example conformations: a semielongated conformation with zero H-bonds and two closed conformations with two and eight H-bonds, respectively. Both open and closed conformations can have few or many H-bonds, but there is a moderate negative correlation (r = −0.504) between the total number of H-bonds and the distance between the peptide termini.
Once again it is possible to find extreme examples within our 100 replicas that deviate from the overall mean of all replicas to a significant extent: 5.53 H-bonds for replica 68 and 6.97 H-bonds for replica 65 versus the overall mean of 6.49 H-bonds ( Figure 3C). A detailed analysis of replica differences shows a picture similar to that for the peptide termini distance ( Figure 2D). The main difference is that for very short simulations (≤5 ns) there is no high agreement. This is due to the high on and off rates of H-bonds, where angle changes of a few degrees between the donor and acceptor can influence the presence/absence of a H-bond. Between around 5 and 40 ns an area of "misleading" good agreement between replicas is present (similar to the area at the beginning of the peptide termini analysis) that is probably caused by the lack of sampling of backbone configurations. The peak of disagreement between replicas is once again located between around 40 and 200 ns, with a rapid fall in average difference between groups to about 1000 ns and a slower fall afterward. Extending the analysis to more replicas, we find median drops in group difference of 64.5% for 10 replicas (black line in Figure 3D) and 83.9% for 50 replicas, but of using 10 replicas already achieves 76.7% of the improvement of using 50 replicas at 20% of the computational cost. The average standard deviation (over all 5000 permutations and 21 time cuts) of replica group difference in H-bonds drops considerably with the increasing number of replicas from 0.59 (one replica) to 0.23 (five replicas), 0.15 (10 replicas), 0.09 (25 replicas), and 0.06 (50 replicas) (Appendix Figure 3).
The analysis of efficient resource usage for H-bond reproducibility shows results similar to those for the peptide termini distance: using a larger number of shorter replicas yields better results (blue markers in Figure 3D). For example, using one replica of 2000 ns per group yields a mean difference of 0.32 H-bonds between groups, while using five replicas of 400 ns per group lowers the difference to 0.27 H-bonds and using 20 replicas of 100 ns yields a mean difference of 0.20 H-bonds.
TCRpMHC. We performed a total of 100 replicas, each of 100 ns, on the TCRpMHC protein complex (see Methods).
First we analyzed the Euclidian distance between the center of mass of the C-terminal peptide residue Leu9 and the center of mass of CDR3β. This is a local property of the TCRpMHC system. A high degree of flexibility in this distance is expected because terminal peptide residues are known to often unstably bind to the MHC. 29−31 The distribution of the distance over all 100 replicas and times shows a nearly normal distribution with a mean of around 13 Å ( Figure 4A). In Figure 4B we show an example configuration with a distance of 13.5 Å. Figure 4C shows extreme examples that deviate from the overall mean value of 13 Å (10.5 Å for replica 58 and 15.6 Å for replica 25).
For one replica per group, the analysis of replica differences ( Figure 4D) shows a quick increase in the average group difference up to about 4 ns and a slower but constantly present increase afterward. The "jump" between 10 and 20 ns is an intended optical effect as the x-axis step size changes from 1 to 10 ns. With five to 50 replicas per group, only a marginal increase in group difference is observable after 20 ns.
The use of more replicas per group reduces the average group difference: we find median drops in group difference of 64.9% for 10 replicas (black line in Figure 4D) and 84.1% for 50 replicas, but using 10 replicas already achieves 77.1% of the improvement of using 50 replicas at 20% of the computational cost. The average standard deviation (over all 5000 permutations and 21 time cuts) of the replica group difference in distance (in Å) drops considerably with increasing number of replicas from 0.63 (one replica) to 0.25 (five replicas), 0.18 (10 replicas), 0.11 (25 replicas), and 0.08 (50 replicas) (Appendix Figure 6). For this larger system, the analysis of efficient resource usage for reproducibility shows results similar to those for the peptide system: Using a larger number of shorter replicas yields better results (blue markers in Figure 4D). For example, using five replicas of 100 ns yields a mean difference of 0.44 Å between groups, while using 10 replicas of 50 ns yields a mean difference of 0.28 Å.
Finally, we analyzed the number of H-bonds between the TCR αand β-chains based on our 100 TCRpMHC replica simulations. This is the same descriptor that we used as motivation in the Introduction (Figure 1), and it takes into account many amino acids of the TCR. A nearly normal distribution of the number of H-bonds is shown in Figure 5A, and Figure 5B illustrates an example configuration with 15 Hbonds present. As expected, the number of intra-TCR-chain Hbonds is not correlated with the distance between Leu9 of the peptide and CDR3β (r = 0.042). The most extreme examples of replicas were found to have on average 12.38 (replica 49) and 19.12 (replica 38) intra-TCR-chain H-bonds ( Figure 5C).
The analysis of replica differences shows a more stable picture than the Leu9−CDR3β distance analysis ( Figure 5D). After a slight increase in group difference over the first nanosecond the average does not change much over the remaining 99 ns. However, the use of more replicas reduces the average group difference: we find median drops in group difference of 68.4%

Journal of Chemical Theory and Computation
Article for 10 replicas (black line in Figure 5D) and 85.7% for 50 replicas, but using 10 replicas already achieves 79.6% of the improvement of using 50 replicas at 20% of the computational cost. The average standard deviation (over all 5000 permutations Once again, using a larger number of shorter replicas yields better results (blue markers in Figure 5D). For example, using five replicas of 100 ns yields a mean difference of 0.77 H-bonds between groups, while using 10 replicas of 50 ns yields a mean difference of 0.54 H-bonds.
The same type of reliability analysis based on group differences is shown in Appendix Figure 9 for other descriptors: radii of gyration of CDR3α and CDR3β, backbone root-mean-square

Journal of Chemical Theory and Computation
Article deviation, distance between CDR3α and CDR3β, solventaccessible surface area (SASA) of the peptide within the TCRpMHC, overall SASA of TCRpMHC, and the numbers of peptide/MHC, TCR/MHC, TCR/peptideMHC, and TCR/ peptide H-bonds.

■ DISCUSSION
Reproducibility is a central theme in science. 32 The use of replicas to increase reproducibility and reliability is a standard method in many areas of research, including clinical trials, 33 gene expression analysis, 34 and experimental design. 35,36 The law of large numbers 37 states that performing an experiment multiple times will bring the average result closer to the true outcome. The variance of the mean of n replicas is n times smaller than the variance of the variable measured in each independent replica. 36 However, in the context of molecular simulations, performing replicas is computationally expensive. It is tempting to perform one longer simulation, and for many readers a paper that describes a single 2000 ns simulation is more impressive than a paper describing 20 replicas of 100 ns each. It is therefore perhaps not surprising that replicas are not regularly used in current molecular (dynamics) simulation studies (e.g., refs 38−44). The results of the analysis in this paper suggest that in terms of reproducibility and reliability, multiple replicas should be preferred over single long simulations. These findings agree with observations made in molecular docking approaches. 45,46 Another reason for the frequent use of single simulations might be that even single long simulations struggle to reach biologically relevant time scales. For example, although a 100 ns TCRpMHC simulation is the current state of the art for this type of system, 25 it is unlikely that such a simulation will provide insights into global rearrangements. Our results suggest that after 100 ns this 827 amino acid system is in a similar phase of sampling as our 10 amino acid peptide system after 5 ns (compare Figure 4D with the 0 to 5 ns portion of Figure 2D). In both cases, the local neighborhood has been explored but no fundamental changes have occurred. It is likely that the curves for the TCRpMHC simulations in Figures 4 and 5 would form peaks similar to those in Figures 2 and 3 if sufficiently long simulations were carried out. How long the simulations of TCRpMHC complexes would need to be extended to show global rearrangements similar to those for our peptide system is hard to predict. Most likely the time scale would be far beyond the microsecond scale, as even 1000 ns simulations of TCRpMHC still do not show global changes. 29,47 The length of the simulations (once they have reached a minimum length and left local solution space) is not as important for reliability as the number of replicas. This agrees with previous studies showing that very short simulations of only few nanoseconds can effectively explore the local neighborhood around X-ray structures, leading to high correlations between predicted and experimental binding affinities using 50 replicas of only 4 ns each. 47,48 It could be argued that better agreement between multiple shorter simulations is not surprising because the simulations have the same starting configuration and are likely to have a similar start of the trajectory. While this might be partly true, it can be seen from the peptide simulations ( Figure 2) that even three 1000 ns simulations against three 1000 ns simulations on average show better agreement than one 3000 ns simulation against one 3000 ns simulation. It is unlikely that this effect is due to failure of the 1000 ns simulation to leave the initial local solution space. A possible explanation why long simulations do not necessarily agree with each other is that they can get trapped in local minima for a considerable amount of time or drift away in solution space by unfolding structure elements that should not be unfolded ( Figure 6). This explanation model agrees with previous studies showing that long simulations can get trapped in local minima. 49,50

Journal of Chemical Theory and Computation
Article For shorter simulations (e.g., 100 ns) of larger systems (e.g., 800 amino acids), one might expect differences between identical simulations, but the extent to which we observed this was surprising. We found an average difference in the number of intra-TCR-chain H-bonds between replica simulations of 1.75 ± 1.26 ( Figure 5 and Appendix Figure 8), while the total average number of H-bonds is about 15.28 (11.5% difference or 19.7% including the upper boundary of the standard deviation). This means that if a study based on a single 100 ns simulation finds an increase/decrease by about three H-bonds (20% of the total H-bonds), there is no strong reason to believe that this increase/ decrease is true.
This uncertainty can be brought down considerably by using five replicas of the same length (0.77 ± 0.58 H-bonds; 5.0% or 8.8%), 10 replicas (0.55 ± 0.41; 3.6% or 6.3%), 25 replicas (0.34 ± 0.26; 2.2% or 3.9%), or 50 replicas (0.24 ± 0.18; 1.6% or 2.7%), depending on the level of precision sought. If it is acceptable to be wrong by about one H-bond then five replicas will probably be sufficient; if an error of only half a H-bond is acceptable, 25 replicas might be the better choice. In our own experience, multiple simulations per comparison group have been shown to be beneficial for the reproducibility of a wide range of applications, such as H-bond networks 51 and free energy calculations. 47 Even in very different approaches, such as hierarchical Monte Carlo simulations, the challenge of non-reproducible trajectories persists (see Figure 5 in ref 29). In previous work, 29 we used the Mosaics package, 54 which has been successfully applied before for a wide variety of challenges, including receptor signaling, 55 prediction of nucleosome occupancy, 56 empty MHC binding groove dynamics, 57 energy contributions of epigenetic DNA marks, 58 RNA mutations, 4 and macromolecule refinement. 59 We employed a coarse-grained model based on a three-point representation of each amino acid (α-carbon, carbonyl oxygen, and side-chain centroid) in combination with a knowledgebased potential. 52 Movements were based on hierarchically grouped regions to restrict unnecessary degrees of freedom while allowing essential degrees. Potential chain breaks were resolved using a stochastic chain closure algorithm. 53 We used repeated simulated annealing cycles to allow for a rapid local and global search for energetically favorable states. Monte Carlo approaches like this are able to sample a larger part of the free energy landscape, but replicas still differ considerably from each other, emphasizing the need for multiple replicas in molecular simulation approaches.
Another important question involves the minimum simulation length needed in order to avoid missing potentially important transitions? For our small peptide system we can approximate an answer on the basis of Figure 2D: once replicas have left local sampling after about 20 ns, we become aware of the undersampling problem as reflected in the peak at around 100 ns. After this peak the difference lines start to flatten out, and we find the area around 1000 ns to be a good recommendation for the minimum simulation length of the peptide system. However, this does not mean that shorter simulations are not useful in general, since free energy calculations, for example, can reproduce experimental data very well using a large number of very short simulations. 47,48 The actual choice between simulation length and number of replicas (given a fixed amount of available computational resources) will always depend on the question asked and the classical statistical dilemma of whether false negative results (i.e., many short simulations do not see relevant motion because they are too short) or false positive results (i.e., one or few very long simulation show interesting results, but these results are outliers, and the average over many very long simulations would have shown a different picture, as discussed in the Introduction) are more acceptable.
We conclude that opting for multiple replicas over single molecular simulations allows for more reproducible results and thereby increases the reliability of molecular simulation papers. We strongly recommend that scientists perform multiple replicas before drawing conclusions from molecular simulations.

Journal of Chemical Theory and Computation
Article