Electrostatic Embedding of Machine Learning Potentials

This work presents a variant of an electrostatic embedding scheme that allows the embedding of arbitrary machine learned potentials trained on molecular systems in vacuo. The scheme is based on physically motivated models of electronic density and polarizability, resulting in a generic model without relying on an exhaustive training set. The scheme only requires in vacuo single point QM calculations to provide training densities and molecular dipolar polarizabilities. As an example, the scheme is applied to create an embedding model for the QM7 data set using Gaussian Process Regression with only 445 reference atomic environments. The model was tested on the SARS-CoV-2 protease complex with PF-00835231, resulting in a predicted embedding energy RMSE of 2 kcal/mol, compared to explicit DFT/MM calculations.


S1 Molecules excluded from the dataset
shows four congeneric molecules excluded from the dataset. All have highly distorted geometries for the tertiary carbon atom in the three-membered ring, causing large prediction error for the atomic charge.

S2 Selection of reference atomic environments
The reference environments were selected using Informative Vector Machine 1,2 approach: given a set of candidate datapoints, the process starts with a randomly picked datapoint and then adds the point which has the largest variance with the current model. The process is repeated until the largest variance for any of the excluded datapoints falls below a certain threshold. An example of this process with a toy 1D model is presented on Figure S2.  Table S1.  3 sparse GPR approach aims to find the basis weights that minimize the prediction error on a large training set, while using only a limited number of basic atomic environments. The derivation is based on the assumption that the mean of the learned function is zero. If this is not the case, it is suggested that the mean is subtracted before learning and then added to the predictions. This essentially introduces an additional parameter to the model. To avoid that, here an alternative expression is used, which is based on the full GPR expression and provides the values of the learned function at the reference locations so that the prediction error over the full dataset is minimized. Additionally, a constant shift is added to the predictions and the mean of the observations is constrained to zero. This is identical to subtracting the mean from the observations and adding it to the predictions. Moreover, learning the values instead of weights makes the free parameters physically meaningful, in a sense that they can be interpreted as an estimate of the physical observables corresponding to the basis locations based on the given training set. Therefore, they are much less dependent on a particular choice of features and kernel function compared to the GPR weights.
We start with the full GPR expression for N training observations and M basis locations Where y are the training observations and λ is a Lagrange multiplier. By differentiating Eq.
(S3) w.r.t. y ′ ref and y 0 we get the following set of equations: solving which gives: The observations for the basis locations that serve as the free parameters of the model are then:

S4 Molecular dipolar polarizability from Thole model
The molecular dipolar polarizability tensor α mol provides the induced molecular dipole µ ind in response to the uniform external electric field E through the following expression: In the induction model presented in this work, the induced dipole can be calculated as a sum of induced atomic dipoles obtained with the Thole model, µ ind = i µ i . Then, α mol is obtained as follows: where B = A −1 and b ij are 3x3 blocks of B.

S6 Calculation of electrostatic potential
The electrostatic potentials created by the QM subsystem at the MM atomic positions were calculated using orca vpot. The GFN2-xTB reference calculations were also done through ORCA interface, which does not support generation of density files, so for GFN2-xTB the potentials were not calculated. In the case of AM1 and PM3, it seems to be a bug in either orca or orca vpot, resulting in huge errors for the seliempirical electrostatic potential values compared to the DFT ones (RMSE ≈ 10 E h /e), so these were not used for discussion. The bug is reported (https://orcaforum.kofo.mpg.de/viewtopic.php?f=11&t=9451).

S7 χ from MBIS partitioning
While electronegativities are not directly available from QM reference data and are obtained by sparse GPR fitting to MBIS charges, once the scaling factor a QEq is known, it is possible to derive χ from the MBIS atomic charges by rearranging Eq. 16: The obtained electronegativities provide exactly the MBIS charges when used in QEq with the same parameters (valence widths and a QEq ). "Average" electronegativities (⟨χ M BIS ⟩) used in Table 3 are obtained by averaging χ M BIS values for each atom over the dataset.   Figure S4: Learning workflow. Figure S5: Prediction workflow.