Enhanced Grand Canonical Sampling of Occluded Water Sites Using Nonequilibrium Candidate Monte Carlo

Water molecules play a key role in many biomolecular systems, particularly when bound at protein–ligand interfaces. However, molecular simulation studies on such systems are hampered by the relatively long time scales over which water exchange between a protein and solvent takes place. Grand canonical Monte Carlo (GCMC) is a simulation technique that avoids this issue by attempting the insertion and deletion of water molecules within a given structure. The approach is constrained by low acceptance probabilities for insertions in congested systems, however. To address this issue, here, we combine GCMC with nonequilibium candidate Monte Carlo (NCMC) to yield a method that we refer to as grand canonical nonequilibrium candidate Monte Carlo (GCNCMC), in which the water insertions and deletions are carried out in a gradual, nonequilibrium fashion. We validate this new approach by comparing GCNCMC and GCMC simulations of bulk water and three protein binding sites. We find that not only is the efficiency of the water sampling improved by GCNCMC but that it also results in increased sampling of ligand conformations in a protein binding site, revealing new water-mediated ligand-binding geometries that are not observed using alternative enhanced sampling techniques.

system is given by: where r N represents the positions of the system particles, r M −N represents those of the ideal gas, p M are the momenta of all particles (note that these need not be separated), h is Planck's constant, Q M V T is the partition function of this canonical ensemble, and E is the total energy, calculated as follows: Note that the potential energy has no dependence on the positions of the particles in the ideal gas.
We begin this derivation using the generalised acceptance criterion for moves employing nonequilibrium candidate Monte Carlo (NCMC), as published by Nilmeier et al.: S2 where π(x 0 ) is the equilibrium probability of state x 0 , P (Λ p |x 0 ) is the probability of selecting protocol Λ p , given state x 0 , α(X|Λ p ) is the cumulative probability of all the perturbation steps from protocol Λ p , and ∆S(X|Λ p ) is the conditional path action difference. The ratio of equilibrium probabilities is simply dependent on the exponentiated change in total energy over the course of the move: In the grand module, a single type of nonequilibrium protocol is defined per move, where the direction in which this protocol is applied is determined by whether the particle is alchemically coupled or decoupled (or: inserted or deleted). Therefore, the cumulative probability of the perturbation steps being applied is equal for insertion and deletion moves, so that S-2 α(X|Λ p ) = α(X|Λ p ), and therefore: More important here, is the probability of proposing the forward and reverse moves, denoted by P (Λ p |x 0 ) and P (Λ p |x T ), respectively. An insertion move involves choosing a particle at random from the ideal gas, and inserting it at a random location (with infinitesimal volume, dr) in the simulated system, whilst retaining its orientation -in practice, this involves generating a random orientation upon insertion, to account for the fact that the ideal gas is not explicitly simulated. The probability of proposing such a move is: where the first term is the probability of selecting a specific ideal gas particle at random, and the second is the probability of choosing a particular insertion point at random. Conversely, the probability of attempting a deletion move to return the particle to the same location in the ideal gas is: where it should be noted that the first term accounts for the fact that the simulated system contains N + 1 particles, following an insertion. The ratio of these two terms is: For a system of N particles in a volume V , the ideal component of the canonical partition function can be calculated analytically by: S-3 where Λ is the thermodynamic wavelength. From Eq. 9 the ideal contribution to the Helmholtz free energy for the system particles can also be calculated analytically: Assuming a very large ideal gas allows the use of Stirling's approximation, applicable for very large numbers, such that Eq. 10 can be rewritten as: The chemical potential is defined as the derivative of the free energy with respect to the particle number meaning the ideal chemical potential can be calculated as: This allows the chemical potential to be related to the number density of the ideal gas (ρ ideal ) as follows: In the limit of an infinitely large ideal gas, the first term of Eq. 8 reduces to the number density of the ideal gas meaning the relationship in Eq. 13 allows the probability ratio in Eq. 8 to be written as: Eqs. 3, 4, 5 and 14 can then be combined to give: S-4 The Adams parameter S3,S4 is defined as: Substituting the Adams parameter into Eq. 15 gives: When using equilibrium-preserving propagation techniques (such as the BAOAB Langevin integrator used in this work S5,S6 ), it has been shown that the conditional path action difference is related to the negative heat of the nonequilibrium process: S2 THe heat can be calculated as the sum of the total energy change of the system over all the propagation steps: where the sum runs over all the propagation kernels, with E(x * t ) and E(x t ) representing the energy of the system before and after relaxation.
The energy change of the system over the course of the move can be expressed in terms of heat and work: The acceptance ratio in Eq. 17 can therefore be rewritten in terms of the nonequilibrium work, W (X|Λ p ): The corresponding derivation for a deletion move follows the same steps, and arrives at S-5 the following acceptance ratio: It should be noted that Eqs. 21 and 22 very closely resemble those of instantaneous GCMC moves, with the only difference being the replacement of the potential energy change, with the nonequilibrium work.
When running a simulation in which the GCMC region is a sphere, focused on a particular region, V sys must be replaced with the volume of the GCMC sphere. Additionally, the value of N must correspond to the number of water molecules within the GCMC sphere -given that grand allows waters to permeate the GCMC region, this must be updated at the end of each move, in case the number of water molecules has changed by more than one. Therefore, Eqs. 21 and 22 are replaced with Eqs. 23 and 24: where N 0 is the number of waters present in the sphere in the initial state (x 0 ) and N T is the number present in the sphere in the proposed state (x T ). If the water which is subjected to the switching ends up outside the GCMC sphere, the move must be automatically rejected as the reverse move is impossible, as the probability of the reverse protocol becomes zero.

S-6
Restrained Simulations To further ensure that the GCNCMC/MD method could predict both water locations and their occupancies with similar results to GCMC/MD, we selected a representative state from each of the four binding poses, applied restraints to the nonsolvent heavy atoms and performed simulations with each of GCNCMC/MD, GCMC/MD and water hopping as described previously. This provides a better comparison between the methods, as it removes the issue of the different protein-ligand sampling observed, thereby reducing the comparison to just the sampling of water binding sites. We therefore expect to see much better agreement between the different methods for the number of waters observed in the binding site. The results are shown in Fig. S4.
For the dry state, GCNCMC/MD and GCMC/MD generated 98.6 ± 0.1 % and 98.5 ± 0.1 % of frames respectively with no waters present in the binding region. The remaining frames contained either one or two transient waters which had an even spatial distribution about the binding region and none of which persisted for longer than a few simulation frames.
One of the eight water hopping repeats inserted a water into the binding region. This water was subsequently deleted after 1123 moves, compared to GCNCMC/MD where nearly all waters inserted were deleted within 10 moves of their insertion.
Similar results are observed for the crystal state, with the grand canonical methods showing populations of 99.2 ± 0.1 % and 99.1 ± 0.1 % for the one-water conformation, with the remaining states containing a second water. At no point was the crystal water deleted.
Water hopping did not accept any moves resulting in translation into or out of the binding region.
Wet state 1 again showed strong agreement between the GCNCMC/MD and GCMC/MD results with the population of the states containing 1-4 waters being almost identical across the two methods. As expected, the majority of states contained the two waters shown in Fig.   6 in the main text with the remaining states containing either 1, 3 or 4 waters. The slightly greater variance in the data for this state compared to the previous two can be attributed to the increased space created by the ligand conformation, increasing the acceptance probabil-S-7 ities of insertion moves. The slow convergence times of water hopping were highlighted by only four of the 8 repeats inserting a third water. Whereas the GCNCMC/MD simulations led to a water binding/unbinding event on average once every 100 moves, the water hopping simulations only achieved this on average once every 5333 moves.
Wet state 2 saw small differences between the two grand canonical methods. This is, again, owing to the additional space created by the ligand moving to the back of the pocket allowing for greater variation in the number of waters present -the water sites also appear to be less well defined, as more space is available. The starting structure contained five waters, as shown in Fig. 6 in the main text, and the results show that 3, 4 or 5 waters within the binding site are all reasonably populated configurations. Both GCNCMC/MD and GCMC/MD generated a few states with only 1 or 2 waters and one 6-water state was accepted with GCNCMC/MD, lasting 3 moves before a water was deleted. In 6 out of the 8 repeats using the water hopping method, the waters inside the pocket were left unchanged.
In the other two repeats, a water was successfully translated out of the binding region which, in one case, was subsequently replaced by the translation of another water. Across the 8 repeats of each method, GCMC/MD simulations accepted on average one move every (3.5 ± 0.7) × 10 5 force evaluations, GCNCMC/MD every (6.4 ± 0.4) × 10 5 force evaluations and water hopping one every (7.1 ± 5.0) × 10 6 force evaluations. evaluations however (Fig. S6), shows that GCMC/MD appears to be the quickest to produce converged results, particularly for the unrestrained simulations. It is important to consider though that this is partly owing to the method's inability to capture some of the ligand conformations observed with GCNCMC/MD and therefore although the results may appear converged, they have not necessarily converged to the correct value. The water hopping method clearly does not converge within the simulation time, with only a few moves being accepted across the eight repeats performed.

Comparison of Simulation Convergence and Efficiency
Simulations were also run to assess the efficiency with which the different methods could equilibrate the water network in a dry binding site. Here, we use the crystal state and wet state 1 of MUP-I to assess the ability of GCNCMC/MD, GCMC/MD and water hopping to equilibrate the number of waters within the cavity -in each case starting with no waters in the binding site. For the sake of this investigation, we define the system to be equilibrated at the first point that there is a water molecule within 1.4 Åof the hydration site(s) previously identified as shown in Fig. 6 in the main text (one site for the crystal state and two sites for wet state 1). Restraints were applied as before to prevent protein and ligand conformational changes previously observed.
For the two NCMC methods, simulations were run over a range of switching times (n prop =50) and moves were separated by 1000 steps of MD (2 ps). The GCMC/MD simulations were run in iterations of 50 GCMC moves followed by 1000 steps MD.
As shown in Fig. S7, on average GCNCMC/MD and GCMC/MD require a similar number of force evaluations to equilibrate the waters within the binding site, with the results appearing fairly independent of switching time. Water hopping takes longer than both grand canonical methods to achieve the same equilibration. To equilibrate the crystal state using GCNCMC/MD, an average of (1.8 ± 0.3) × 10 5 force evaluations were required, using GCMC/MD required (2.5 ± 1.1) × 10 5 and water hopping (5.6 ± 0.9) × 10 6 force evaluations.
For the wet state of MUP-I tested, the GCMC/MD sampling proved the most efficient, requiring an average of (1.3 ± 0.3) × 10 5 force evaluations and GCNCMC/MD an average S-9 of (2.6 ± 0.3) × 10 5 . The gap between the grand canonical methods and water hopping is greater, with the latter taking an average of (3.8 ± 0.5) × 10 7 force evaluations to equilibrate the two waters. In general, GCMC/MD is the most efficient method for equilibrating water networks in a dry binding pocket, although this relies on the assumption that there are no coupled protein or ligand motions, necessary for the equilibration of the water network, that GCMC/MD would fail to sample. To test the switching times, values of 5, 7, 9, 11, 13 and 15 ps were used with five independent repeats performed at each switching time. The results are shown in Fig. S8 where it can be seen that there is no discernible correlation between the switching time used and the number of waters observed.

Dependence of Sampled
To test the n prop values, we used a constant switching time of 10 ps and n prop values of 10, 20, 40, 50 and 100. The results are shown in Fig. S9 and again there is no correlation between the parameter choice and the observed water occupancy.
These results provide further evidence that the additional conformations we observe with the GCNCMC/MD method are in fact genuine and physically relevant conformations, and not a result of our choice of parameters.
S-10 Table S1: Details of the simulation protocols used for GCNCMC/MD and water hopping simulations on the protein test systems. In the cases of the water hopping simulations for HSP90 and restrained MUP-I, the switching times used were double those used for GCNCMC/MD as water hopping involves both a decoupling and a subsequent coupling of a water whereas GCNCMC/MD only requires one or the other. The n pert value refers to the the total number of perturbations per NCMC move.  Figure S1: The efficiency of different GCNCMC protocols when carried out on a water box. The protocols are grouped by the n prop parameter which defines the number of MD steps between each perturbation step during the GCNCMC move. Efficiency is defined as the number of accepted moves within 12 hours of wall time. Data points have been included at a switching time of 50 ps to highlight the decrease in efficiency after the broad peak.
S-12   Figure S4: Distributions of the number of waters observed using each of the water placement methods in the MUP-I binding site, when the protein-ligand complex is restrained to the previously described conformations. Positional restraints were applied to a representative structure of each of the four dominant MUP-I conformations previously identified. Simulations were performed using the three different methods (GCNCMC/MD, GCMC/MD and water hopping) to assess the agreement and convergence of the results. Eight independent repeats were run for each combination of method and conformation. The number of waters present in the binding site is shown along the x-axis with the height of the bars indicating the average percentage of frames containing each number of waters. The error bars show the standard error of the mean

S-15
Simulation Convergence and Efficiency  S-20