Learning Electronic Polarizations in Aqueous Systems

The polarization of periodically repeating systems is a discontinuous function of the atomic positions, a fact which seems at first to stymie attempts at their statistical learning. Two approaches to build models for bulk polarizations are compared: one in which a simple point charge model is used to preprocess the raw polarization to give a learning target that is a smooth function of atomic positions and the total polarization is learned as a sum of atom-centered dipoles and one in which instead the average position of Wannier centers around atoms is predicted. For a range of bulk aqueous systems, both of these methods perform perform comparatively well, with the former being slightly better but often requiring an extra effort to find a suitable point charge model. As a challenging test, we also analyze the performance of the models at the air–water interface. In this case, while the Wannier center approach delivers accurate predictions without further modifications, the preprocessing method requires augmentation with information from isolated water molecules to reach similar accuracy. Finally, we present a simple protocol to preprocess the polarizations in a data-driven way using a small number of derivatives calculated at a much lower level of theory, thus overcoming the need to find point charge models without appreciably increasing the computation cost. We believe that the training strategies presented here help the construction of accurate polarization models required for the study of the dielectric properties of realistic complex bulk systems and interfaces with ab initio accuracy.

Further information is given in the README.txtfiles of these folders.

Water Slab Learning Curves
Fig. S1 shows learning curves for the polarization of the air-water interface systems using either the data pre-processing approach or by learning the positions of Wannier centres.
There are several differences between these learning curves and those shown in Fig. 1(a) in the mean text for pure bulk water: most notably, the Wannier centre model performs much better compared to the pre-processed data model; as discussed in the main text, the symmetry of the water slabs biases the predicted molecular dipole moments, meaning that they are systematically smaller.This biasing is likely the reason for the relatively poorer performance.In addition, the model for average Wannier centre displacements saturates very quickly: the slab systems contain 128 water molecules, as opposed to the 32 molecules used for the bulk model, and so the effective number of training points is proportionally larger.Given that this model is already excellent, there is little need, for our current purposes, to improve it further.

Dipoles
The predictions of SA-GPR models trained on the polarization of a system are given by, where P pred (X i ) is the polarization for the system represented by X i , K αβ (X i , X j ) the (α, β)component of the vector SA-GPR kernel between systems X i and X j , and w j,β a weight that is computed, using the projected-processes approach 9 as, Here, K N M is the kernel matrix between members of the (N -dimensional) training set and those of the (M -dimensional) active set; K M M is between pairs of members of the active set.σ 2 is a regularization parameter and P calc is a vector containing the polarizations of members of the training set.It should be noted that, e.g., the vectors w and P calc are "vectors of vectors" (i.e., matrices), but that we use this notation as it is familiar from standard Gaussian process regression (GPR).
Eq. (S2) can be modified to include molecular dipole moments, in this case calculated using the SPC/E model, with K P M the kernel between individual water molecules (of which there are P in the training set) and members of the active set, µ calc a vector containing calculated molecular dipole moments, and γ a hyperparameter that determines the relative importance of fitting to molecular dipoles compared to total polarizations.This was tuned to a value of 0.05, which gave good agreement with the molecular dipoles of a validation set, while not losing accuracy on their polarizations.

Local Perpendicular Dielectric Constants
As described in Olivieri et al., the local dielectric constant ε ⊥ (z) as a function of the z position relative to the centre of the slab was given by, 10

Electrolyte Solution Molecular Dipole Moments
Fig. S4(a) shows the scatterplot of molecular dipole moment components calculated using Wannier centres against those predicted using models trained on pre-processed data, baselined against the point charge polarization.As for the molecular dipole moments of bulk water, the correlation between the two predictions is excellent.This is further underscored by Fig. S4(b), which gives the histograms of molecular dipole moment magnitudes: the distribution of Wannier centre dipole moments is broader, as before, but the positions of the maxima are in very agreement.These results show that it is possible to use SA-GPR to build polarization models that give an excellent reproduction of the local dipole moments, even in a system where the electrostatics are more complex than in pure water.

Data-Driven Unwrapping for Electrolyte Solutions
Fig. S5 illustrates the data-driven pre-processing approach for concentrated NaCl solutions.
This can be compared directly to Fig. 4 in the main text, for bulk water: the data-driven method works for both of these systems.
0.004 0.002 0.000 0.002 0.004 Predicted P i (D) this normalizes out the ability of the models to correctly describe the fluctuations of the polarization, and shows instead only their ability to describe how these fluctuations relax towards equilibrium.Although here the H s O-I PP SR still performs worst for the air-water interface, the discrepancy is much less pronounced than in the main text, suggesting that the main problem with this model is its description of the magnitude of the fluctuations, rather than the dynamics, of the polarization.In all cases, solid black lines give the results calculated using density functional theory (DFT), dashed red lines the predictions from SA-GPR models for the post-processed polarization (in the case of the NaCl solution, this has been baselined against the polarization from a simple point charge model), and dotted blue lines the predictions from SA-GPR models for the Wannier displacements around each oxygen atom.Additionally, for the air-water interface the dash-dotted green line shows the predictions of an SA-GPR model trained on the total system polarization as well as on molecular dipole moments.In all cases, solid black lines give the results calculated using density functional theory (DFT), dashed red lines the predictions from SA-GPR models for the post-processed polarization (in the case of the NaCl solution, this has been baselined against the polarization from a simple point charge model), and dotted blue lines the predictions from SA-GPR models for the Wannier displacements around each oxygen atom.Additionally, for the air-water interface the dash-dotted green line shows the predictions of an SA-GPR model trained on the total system polarization as well as on molecular dipole moments.

Figure S1 :
Figure S1: Learning curves for the von Mises error in predicting the polarization of water slabs as a function of the number of frames used to train the model, either by pre-processing the data using a point charge model (solid red line) or by learning the average displacement of Wannier centres from the oxygen atoms (dashed black line).

Fig. S2 compares
Fig. S2 compares the average magnitudes of local molecular dipole moments as a function of their distance from the centre of a water slab, for the calculated Wannier centres, and H 2 O-B W SR, trained on Wannier displacements from bulk water systems.The discrepancy between these two results leads us to conclude that without having been trained on local environments that contain interfaces, the Wannier centre models from pure bulk simulations give results that are quantitatively incorrect.Fig.S3 shows the average magnitudes of local molecular dipole moments as a function of their distance from the slab centre for the calculated Wannier displacements and from the Wannier displacements predicted from an SA-GPR model.The two results are in excellent agreement.

Figure S2 :
Figure S2: Average molecular dipole moment as a function of distance from the centre of a slab of water; solid black lines show the predictions using the calculated Wannier centre displacements, and dashed green lines the predictions of a model trained on bulk data (H 2 O-B W SR). The dashed vertical line shows the position of the Gibbs dividing surface.

) with µ ⊥, 1 Figure S3 :
FigureS3: Average molecular dipole moment as a function of distance from the centre of a slab of water; solid black lines show the predictions using the calculated Wannier centre displacements, while dashed green lines give the predictions of a model trained on the Wannier centre displacements (H 2 O-I W SR). The dashed vertical line shows the position of the Gibbs dividing surface.

Figure
Figure S4: (a) Scatterplot of partially resummed dipole moments from the NaCl PP-B SR model, comparing the prediction of the atom-centred dipole model with the calculated Wannier dipoles.(b) Histograms of molecular dipole moment magnitudes from calculated Wannier centres (solid black lines) and from NaCl PP-B SR predictions (dashed red lines).

Figure
FigureS5: (a) Blue crosses show the total polarizations predicted using a model trained on their derivatives.The points in red are identified as the "main branch", which is fit to a straight line.All points are then shifted by Qn to be as close as possible to the straight line, with Q the quantum of polarization and n a vector of integers.(b) Predictions of an SA-GPR model trained on the pre-processed data.

Figure
FigureS6:x-component (left-hand panels) and y-component (right-hand panels) of the system polarization along a molecular dynamics trajectory for (a) bulk water, (b) the airwater interface and (c) concentrated NaCl solutions.In all cases, solid black lines give the results calculated using density functional theory (DFT), dashed red lines the predictions from SA-GPR models for the post-processed polarization (in the case of the NaCl solution, this has been baselined against the polarization from a simple point charge model), and dotted blue lines the predictions from SA-GPR models for the Wannier displacements around each oxygen atom.Additionally, for the air-water interface the dash-dotted green line shows the predictions of an SA-GPR model trained on the total system polarization as well as on molecular dipole moments.

Figure S7 :
FigureS7: Polarization autocorrelation function C P P (t), normalized by its value at t = 0 for (a) bulk water, (b) the air-water interface and (c) concentrated NaCl solutions.In all cases, solid black lines give the results calculated using density functional theory (DFT), dashed red lines the predictions from SA-GPR models for the post-processed polarization (in the case of the NaCl solution, this has been baselined against the polarization from a simple point charge model), and dotted blue lines the predictions from SA-GPR models for the Wannier displacements around each oxygen atom.Additionally, for the air-water interface the dash-dotted green line shows the predictions of an SA-GPR model trained on the total system polarization as well as on molecular dipole moments.