How to Compute Atomistic Insight in DFT Clusters: The REG-IQA Approach

The relative energy gradient (REG) method is paired with the topological energy partitioning method interacting quantum atoms (IQA), as REG-IQA, to provide detailed and unbiased knowledge on the intra- and interatomic interactions. REG operates on a sequence of geometries representing a dynamical change of a system. Its recent application to peptide hydrolysis of the human immunodeficiency virus-1 (HIV-1) protease (PDB code: 4HVP) has demonstrated its full potential in recovering reaction mechanisms and through-space electrostatic and exchange–correlation effects, making it a compelling tool for analyzing enzymatic reactions. In this study, the computational efficiency of the REG-IQA method for the 133-atom HIV-1 protease quantum mechanical system is analyzed in every detail and substantially improved by means of three different approaches. The first approach of smaller integration grids for IQA integrations reduces the computational overhead by about a factor of 3. The second approach uses the line-simplification Ramer–Douglas–Peucker (RDP) algorithm, which outputs the minimal number of geometries necessary for the REG-IQA analysis for a predetermined root mean squared error (RMSE) tolerance. This cuts the computational time of the whole REG analysis by a factor of 2 if an RMSE of 0.5 kJ/mol is considered. The third approach consists of a “biased” or “unbiased” selection of a specific subset of atoms of the whole initial quantum mechanical model wave-function, which results in more than a 10-fold speed-up per geometry for the IQA calculation, without deterioration of the outcome of the REG-IQA analysis. Finally, to show the capability of these approaches, the findings gathered from the HIV-1 protease system are also applied to a different system named haloalcohol dehalogenase (HheC). In summary, this study takes the REG-IQA method to a computationally feasible and highly accurate level, making it viable for the analysis of a multitude of enzymatic systems.

where n = 133 and N steps = 11 or eleven geometries of the PES. Considering that these integrations involve a six-dimensional integral over the finite volume of each atom, these are very time-consuming. Moreover, IQA depends on the wave-function obtained from QM calculations. Different levels of theory (e.g., HF, DFT, CCSD, MP2 with different basis sets) change the number of Gaussian primitives that make up the wavefunction.

Interacting Quantum Atoms (IQA)
Note that the deformation energy is defined as a difference between , and , , which is more meaningful when it comes to molecular displacement, as encountered in reactions or in van der Waals complexes. Moreover, it has been recently demonstrated that , where is the intra-atomic steric hindrance while is the intra-atomic charge transfer. This makes a more faithful quantifier. However, this quantity will not be analysed in this study but will prove to be helpful in future analyses. The chemical information obtained from the terms can be derived from (S2) where is the monopolar/charge-transfer term obtained by using a simple Coulomb equation (eq S3) between the total (net) charge of each atomic basin (q A and q B ) and their internuclear distance , while consists of polarisation energy that the whole system induces to the interaction between atoms A and B when all monopolar energy has been removed (eq S4), This separation gives a more detailed and intuitive understanding of the classical electrostatic energy recovered within IQA.

Relative Energy Gradient (REG)
As a general example for explaining the REG methodology we quote the Morse potential, which describes the energy of two atoms moving from infinity to short-range or the other way around and is shown in eq S5, which can be rewritten as where D e is the well depth, r e the equilibrium bond distance between the atoms and a is ultimately related to the spring constant in Hooke's model in the regime very close to the energy minimum. Note that D e is subtracted in eq S5 because the zero-reference of the potential energy corresponds to the two atoms at infinite distance. Equation S6 shows a combination of a repulsive potential (first term), which governs the left side of the Morse potential, and an attractive one (second term), which regulates the right side. This behaviour follows from Figure S1. This analysis is useful to understand what happens in many PESs and is thus universal for the REG-IQA framework. Hence, a single PES depends on specific interactions at different stages of the dynamical evolution. Figure S1. Morse potential for two hydrogen atoms forming a hydrogen molecule. The blue and red curves represent the combination of the repulsive and attractive potential shown in eq S6, and are separated by an energy minimum at 0.74 Å.

IQA Grids Benchmark
Here we show results in connection with Section 3.2 of the main manuscript. The default settings of the program AIMAll and its custom settings (mesh=sparse, outer angular quadrature = GS1) are compared for the V xc and V cl energy terms. Table S4 shows the REG values of custom setttings and Table S3 those obtained with the defaultsettings. The results obtained using computationally cheap customised integration grids perfectly match those obtained by the default integration settings for the HIV-1 protease system of 133 atoms.  Figure S2-S3. The main point to explain is the nature of the original RDP tolerance epsilon (ε) in the context of PESs. This value in RDP is an actual distance in a geometrical space (e.g. pixels versus pixels of an image). During the recursive RDP iterations ε is strictly compared to the maximum perpendicular distance of a point to the polyline, which has the same units as ε. Thus, whichever the units of ε are, the algorithm will work correctly if the perpendicular distance has its same units. This is what makes the algorithm applicable to PESs (i.e. y=energy, x=distance) even if epsilon does not have a physically relevant unit.
As an example, the IRC of a simple S N 2 reaction is used to benchmark the RDP algorithm. Details on the QM calculations are omitted because they are not relevant in the context of RDP. Figure S2 shows the effect of the RDP algorithm at different RMSE threshold values. It is clear that with an RMSE of ~0.7 kJ/mol, the potential energy surface (PES) is trimmed from 31 points to only 9 points.
Given that the error is only ∼2% (0.7 kJ/mol to ~35 kJ/mol) of the maximum energy difference (the energy of activation from reactants to transition state), it is safe to assume that the REG-IQA analysis on the 9 points is reliable.
In Figure S3, the same first-principles PES is defined with 50 and 100 points. As mentioned earlier, the RDP algorithm acts independently of the number of points of the initial polyline. Nevertheless, the polylines that were recovered at the same RMSE of tolerance (0.1 kJ/mol) are very similar. This means that there is not a specific number of points necessary for the RDP algorithm to operate on. However, the user should be aware of the computational time to obtain the original PES. Indeed, a 100-points PES scan is computationally much more demanding than a 50-points one (also depending on the system size). On the contrary, fewer starting points would make the RDP algorithm too severe while reducing the number of points, leading to an unreliable REG-IQA analysis. Note that Figure S2 and S3 show a different value of RMSE compared to the chosen RMSE of confidence of 0.1 kJ/mol. That is because the algorithm is jumping between a polyline that has an RMSE lower than 0.1 kJ/mol to another one that has a higher one, and therefore chooses the one lower than the chosen threshold.  5 Selection of a subset of atoms 8.5 0.93 V cl (c58,o67) 9.5 0.98 6 Correlation Curves as an Alternative Error Metric The recovery error, defined as the RMSE between the true system's energy and the sum of IQA "recovered" energies, is a valuable metric to assess the REG-IQA method quantitatively. However, it has been shown that this value increases with the system size. This value seems very large in the case of the HIV-REG lite system analysis (around 33 kJ/mol), although the latter still outputs relevant REG rankings and agrees with the full-system calculations. Indeed, having more atoms in a system means that many more IASs are involved, which leads the clocking up of integration errors (see deviations from zero in equation 4 of main manuscript). However, these errors are not relevant considering that they make up a very low percentage of the total absolute energy of a system, which is usually in the range of thousands of kJ/mol. A better way to assess the capabilities of REG-IQA is to remove any bias present due to the IQA energy integrations. Specifically, one can use a familiar concept that is the foundation of the REG analysis: Suppose the R coefficient is close to the value of one. In that case, it means that the IQA integrations (both intra-atomic and interatomic) correlate well to the actual ab initio energies and that the ensuing REG analysis will be reliable. Also, the closer the linear coefficient is to 1 for any PES segment, the better the IQA energies are recovering the full PES. This is an immediate consequence of equation S8. In the best-case scenario, where each IQA summed energy translated over the energy mean equals the wave-function energy, the calculated m will be equal to 1 (equation S8). Indeed, thanks to the additive property of REG, when all REG values of a segment are summed together, a value of 1 is obtained if no errors are present due to integrations, Note that this kind of approach is still giving a quantitative idea on the reliability of the REG-IQA method, even if we are correlating energy differences rather than absolute energies. The correlation curves as presented can also be applied to the lite system when using the full wave-function. Indeed, if the correlation between the IQA energies and the large system wave-function energies is adequate, it means that most of the relevant intra-and interatomic interactions have been considered throughout the REG analysis. In other words, even if fewer interactions are considered, a reliable REG-IQA analysis is obtained as long as a balance of favourable and unfavourable interactions is maintained. In contrast, this error metric will not be insightful, although it will technically work, for the lite system corresponding to a small wave-function because the QM PES is not representative of the chemical reaction considered and the IQA integrations did not recover the energy gradients of the real system properly, as mentioned earlier.
Representative graphs and correlation curves for the HIV-1 Protease case study are shown in Figure S4 to demonstrate the concept of correlation curves as an error metric. The first comparison is made between (a) and (d), where the IQA recovered energies are in favourable agreement with the QM calculated energies.
However, each stationary point is recovered differently, and there appear to be more IQA energy fluctuations in the 11-steps graph (a) than in the 6-steps graph (c). This is confirmed by looking at the corresponding correlation curves (b) and (d), where a slightly stronger correlation for the latter is observed. Both approaches show that IQA recovers the reaction trend despite the always present integration errors, which are accounted for by the translation over the energy averages. Figure S4 (e) shows the HIV-REG lite system (27 atoms) using the large wave-function. It is observed that IQA is not fully recovering the wave-function energy as the two PES are not overlapping well. This is expected because only 729 out of the possible 17,689 energies of the 133-atom systems are considered. Nevertheless, the correlation between the two energies is still strong (R = 0.977), and the linear coefficient is close to 1 (m = 1.229) as shown in panel (f), explaining that the interactions considered for the REG analysis are those that most contribute to the shape of the PES.
Finally, panels (g) and (h) shows the PES and the correlation curve for the HIV-REG lite system calculated considering the small wave-function of 27 atoms. Although the results seem reasonable due to a surprisingly good correlation between the IQA energy and the WFN energy, it must be noticed that the PES is not representative of the HIV-1 peptide hydrolysis. Indeed, the hypothetical ∆E of activation for this smaller system is around 58.7 kJ/mol, which is different from the 66.5 kJ/mol obtained from the 133-atom PES.
Moreover, this PES in Figure S4 Figures S7-S8). These extreme cases show how correlation curves can be used to understand if relevant atoms (and thus their interactions) were correctly or incorrectly chosen for the IQA integrations in a biased approach (as described in the main text). Figure S6 (a) shows the PES recovered with the sum of 36 (=6x6) IQA energy terms and the wave-function energy of the 133-atom system. Figure S6 (b) represents the correlation curve between the two types of energies. We observe an acceptable correlation but a very high linear coefficient m, which is the sum of the 36 REG values of the considered interactions. This high value confirms that strong interactions are considered for that specific segment of the PES but too few to represent the wave-function energy gradients correctly. Figure S5. HIV-1 protease active site 133-atom model represented by a lite system that is too small. The lite system is marked in orange with black labels and the rest in cyan. The lite system consists of 6 atoms on which the REG-IQA analysis was performed.
(b) (a) Figure S6. (a) Wave-function PES and IQA PES comparison for a 6-atoms lite system, and (b) relative correlation curve. Figure S8 (a) shows the PES recovered with the sum of the IQA energies of 24 atoms (depicted in Figure S7) the wave-function energy of the 133-atom system. In this case, there is no good correlation between wave-function energy and IQA energies ( Figure S8 (b)). Indeed, given the trend of the original PES, which goes from lower to higher energy, similar behaviour from the recovered IQA PES is expected to occur. However, this does not occur due to the selection of primarily atoms that are irrelevant to the reaction mechanism. In other words, to ensure the atom selection is correct, it is likely that the sum of the REG values of the intra-atomic and interatomic interactions is close to unity, given the additive property of REG.

Hydrogen Addition
As pointed out in the main text, hydrogen atoms were added at the point of truncation at a distance equal to their van der Waals radii. However, geometry optimisation of the newly formed H-A, with A being any atom left un-bonded from the original bond, would be intuitively needed given that the H atom is not at its lowest energy state due to being added at an arbitrary distance. However, does this strongly affect the energy gradients of the PES under study? Table S7 shows a comparison of the m, R 2 and ∆ ∆ error metrics for atom groups G, H, I and J of Figure S10 in their optimised and non-optimised states. The optimisation was performed by constraining (i.e. "freezing") all the atoms of a group at each stationary point of the PES while letting the H atoms added from the truncation to relax. Note that optimising the whole group is beyond the scope of this study and would lead to prolonged simulation protocols. Indeed, these geometries are a truncation of a larger system that is already optimised, and the objective is to find the optimal number of atoms that best recovers the energy gradients of that initial system. It is easily observed that the optimisation of the added hydrogen atoms has little to no impact on the error metrics; thus, the energy gradients of the PES of these truncations are almost the same in both cases, and there is no practical advantage in optimising the truncations for the following REG-IQA analysis. The REG-IQA results for the 68 atoms system of group G ( Figure S10 (g)) are presented here. The first detail is that the number atoms that can be considered are not 68 but 64 because four hydrogens were added due to truncation. These hydrogens are not relevant to the REG-IQA analysis because they are not part of the full 133-atom system; thus, there is no need to calculate their interactions with other atoms. However, for the sake of completeness these atoms were still included in the computation, but not in the final analysis. Table   S8 shows the REG-V xc and REG-V cl values and their ranking. These values should be compared with the ones for the full 133-atom system (Table S3). A strong agreement of both REG values and rankings is observed, thus demonstrating that using almost half of the atoms for the IQA analysis gives the same chemically relevant information. Moreover, the correlation curve in Figure S12 shows nearly perfect agreement between summed IQA energies and the original QM energies.
This approach is similar to the one shown in the main manuscript, where 27 atoms were arbitrarily picked, and the REG-IQA analysis was performed on the small wave-function. However, in this case, the wave-function at each stationary point has the same properties as the full system wave-function, while the chosen 27 atoms were not enough to fully recover the PES. Moreover, this procedure is also more accurate than the one used in the "biased" approach because the pool of IQA energies is substantially increased while maintaining computational efficiency. In that case, 27 atoms were selected, and the IQA integrations were done using the full system wave-function. That resulted in a two-day computation time per geometry using 8 CPU cores, and precise IQA energies were calculated but with discrepancies in the REG analysis due to consideration of fewer interactions. The 68-atom calculations do not recover the exact energy values, given the smaller and slightly different wave-function employed, but output a more meaningful REG analysis compared to the 27-atom system. Moreover, the computational efficiency is still maintained, if not improved, due to a smaller wave-function. Indeed, the calculations overall took ∼2 CPU days per stationary point on 8 CPU cores. analyses, as CPU hours averaged over all single-point energies. Note that the calculations were carried out on heterogeneous hardware and that each geometry was computed on 8 CPU cores. Original and unbiased calculations were also carried out with the lowest settings for the IQA integrations, which correspond to "mesh = sparse" and "outer angular quadrature = GS1" (≤ 1800 points). Figure S12. Correlation curve between the wave-function and IQA energies for the HIV-1 protease 68-atoms truncation.

Original
9 HheC system 9.1 RDP algorithm Figure S13. Ramer-Douglas-Peucker algorithm applied to the reaction coordinate of the HheC system. The transition state is on the left and the reactant on the right. The RMSE is chosen to be 0.25 kJ/mol. Points [1,2,3,4,10,11,12,13] are the points selected by the algorithm with a 0.23 kJ/mol actual RMSE. These 8 points are highlighted with small disks.