Delta Machine Learning for Predicting Dielectric Properties and Raman Spectra

Raman spectroscopy is an important characterization tool with diverse applications in many areas of research. We propose a machine learning method for predicting polarizabilities with the goal of providing Raman spectra from molecular dynamics trajectories at reduced computational cost. A linear-response model is used as a first step and symmetry-adapted machine learning is employed for the higher-order contributions as a second step. We investigate the performance of the approach for several systems including molecules and extended solids. The method can reduce training set sizes required for accurate dielectric properties and Raman spectra in comparison to a single-step machine learning approach.

Atomic motions are often key to physical and chemical phenomena occurring at finite temperature, both in solid-state and molecular systems.Experimentally, the dynamical behavior of such systems can be probed by Raman spectroscopy.2][13] Computational predictions of Raman spectra using first-principles calculations are an important counterpart to experimental measurements.They provide further insight into the behavior of materials, facilitate the interpretation of measured spectra via theory-experiment comparisons, and enable predictions of dynamical properties in new compounds.The central quantity for theoretical calculations of Raman spectra is the polarizability tensor, α.It describes the first-order dielectric response of a system to external electric fields.In practice, the dielectric tensor, ϵ, can be used for periodic systems since it contains the same information; 14 in the following text, we shall use both quantities interchangeably.6][17] But this method is limited since it cannot capture anharmonic effects, higher-order Raman scattering, or the explicit temperature dependencies of Raman modes.These effects are relevant in a variety of physical systems and scenarios.For example, a description of phase-transitions in solid materials requires temperature-dependent phonon modes, 18 which cannot be captured in a strictly harmonic phonon picture.Molecular dynamics (MD) simulations offer a way to include these effects and overcome limitations of the harmonic approach.A Raman spectrum can be computed from an MD trajectory by calculating Fouriertransformed velocity autocorrelation functions of the components of α. [19][20][21] However, this requires computing a time series of α values from multiple MD snap-arXiv:2307.10578v2[cond-mat.mtrl-sci] 1 Feb 2024 shots.The number of data points needed in such an MD-Raman approach depends on the desired frequency resolution and the total range of frequencies that needs to be covered, but typically at least a few hundred points are needed.Such polarizability calculations can be done from first-principles using density functional perturbation theory (DFPT), 22 but this renders MD-based Raman calculations computationally demanding.The large computational efforts involved in MD-Raman calculations limit the range of physical scenarios and systems one can investigate with the method.The computational cost of first-principles calculations can be significantly reduced by machine learning (ML) techniques.4][25] More recently, these methods have also been applied to predict tensorial properties including polarizabilities, which currently is a very active research area. 26oth symmetry-adapted kernel-based methods [27][28][29][30][31][32] and neural-network approaches [33][34][35][36][37][38] have been used for this, as well as a physically-based small parametric model. 391][42] Delta ML (∆-ML) is a combined approach to predicting physical quantities: a computationally inexpensive approximation is used as a first step and ML methods are then applied to learn only the differences between first-step predictions and true values. 43The challenge in any ∆-ML method is the search for a physical model to achieve sound first-step predictions that can seamlessly be combined with the proceeding ML method.It has been previously shown that the prediction of polarizabilities in a molecular crystal can be improved by using the polarizability of the molecular monomer as a first step. 27However, to the best of our knowledge, no systematic ∆-ML approach that can be applied beyond the case of molecular crystals has been proposed and investigated yet for prediction of dielectric properties such as polarizabilities and Raman spectra.In this letter, we propose a ∆-ML method for predicting dielectric properties of molecules and materials.Focusing on polarizabilities, we suggest a linear response model (LRM) that encodes key information on the dielectric response of the system as a first step.Combination of the LRM with ML for tensorial properties is assessed via inspection of polarizability components and Raman spectra.We find that the ∆-ML method increases accuracy and reduces required training-set size compared to a direct ML approach where the same ML model is applied to the polarizability data directly, i.e., without using any first-step approximation.Applying ∆-ML to small molecules and extended solids including more complicated materials, as well as discussing its inherent limitations and potential for further improvements, we demonstrate its predictive power for practical MD-Raman calculations across a broad range of physical systems.In order to develop our method, we start from a Taylor expansion of a component of α with respect to atomic displacements (∆x) from their respective equilibrium positions or any other reference structure (x 0 ): where N at is the number of atoms of the system and the index i enumerates the components of a 3N at dimensional atomic position vector x.Eq. ( 1) is an exact formula which can be approximated to arbitrary order.Here, we consider the simplest, first-order variant: a LRM is constructed by determining the constant term, α(x 0 ), and the respective derivatives, ∂α αβ /∂x i .The constant term can be determined via a single DFPT calculation for the equilibrium structure.The first-order derivatives can be obtained through additional DFPT calculations on displaced structures.Applying a central difference method, two calculations are necessary for each atomic degree of freedom.We note that the problem of choosing the coordinate frame in systems with rotational degrees of freedom can easily be solved by employing rotationally and translationally invariant internal coordinates, as discussed in Sec.I.E of the Supporting Information (SI).
A numerical demonstration that rigid rotations do not significantly deteriorate predictions of our LRM model is given in Sec.I.H of the SI.Symmetry considerations can be used in this procedure in order to reduce the number of required calculations, since any two symmetry-equivalent atoms imply derivatives that are related via similar symmetry operations as position vectors of atoms.Thus, a total of 6N + 1 calculations are required to parameterize the LRM, where N is the number of symmetryinequivalent atoms in the system.Further details about the LRM are presented in the SI.We parameterize the LRM via DFPT calculations 44 using VASP 45 and the PBE exchange-correlation functional. 46This allows for predicting polarizabilities of MD snapshots by extracting the displacements ∆x ij from it and applying Eq. ( 1).Fig. 1a shows the performance of predictions with the LRM for the α xx component in SiO 2 along a selected portion of an MD trajectory at 300 K, which we obtain using density functional theory (DFT) in VASP.Compared to the DFPT reference data for the same MD trajectory, the LRM accurately captures many of the α xx oscillations.
The LRM has several limitations: first, the polarizability can have a local extremum at the reference positions, in which case the derivatives, ∂α αβ /∂x i , vanish.This occurs when atomic motion along the respective axis is first-order Raman inactive.Additionally, it is also possible that these derivatives are zero only for certain components of α, which will be discussed below.In these cases, the LRM can only capture the constant term and the predicted polarizability timeseries for these components will therefore be approximately constant.These are important limitations of the present LRM, which, in principle, can be overcome in a straightforward way by accounting for further terms in the expansion of Eq. ( 1).
Here, we focus on the fact that these limitations motivate an ML model as an additional second step that can capture higher-order displacement responses.
For the second step in our ∆-ML approach we employ kernel-based methods. 47,48The underlying idea is to use a descriptor that captures relevant aspects of atomic configurations and a kernel function that allows to quantify similarities between different configurations along MD trajectories.These similarities can then be used to perform fitting using kernel ridge regression (KRR).For many physical quantities, descriptors based on overlap integrals are an appropriate choice.0][51][52] For fitting tensorial quantities such as α, the extended λ-SOAP approach is wellsuited because it also takes into account tensorial covariance properties. 53While the choice for λ-SOAP implies no difference for some tensor invariants such as the mean polarizability (see Sec. IC in the SI), it does provide a benefit for fitting the off-diagonal elements of α.
We perform the fitting procedure in KRR with λ-SOAP using the dscribe 54 and librascal 55 packages together with scikit-learn. 56The training data for the ML model are obtained from DFPT calculations on a subset of MD trajectories.Note that the 6N + 1 configurations that were required for the LRM can, in principle, be reused as training data for the ML model, which we did not attempt here to ease the comparison.Furthermore, a validation dataset is selected from the MD trajectory in order to optimize ML hyperparameters with respect to validation error.For the sizes N t and N v of the training and validation set, respectively, we used a constant ratio N t : N v = 5 : 1.This ensures that the validation set is scaled in proportion whenever the number of training data points is increased.In the spirit of ∆-ML, LRM predictions are always subtracted from DFPT values before feeding them to ML, such that only differences are learnt.Further details on the fitting procedure can be found in Sec.I.F of the SI.Fig. 2 shows scatterplots of a polarizability timeseries for SiO 2 predicted with ∆-ML and an otherwise identical ML approach without the LRM (direct-ML), comparing both to reference DFPT calculations along an MD tra- jectory.The diagonal components of α are predicted well already with direct-ML compared to the DFPT reference data.For this particular case, even LRM alone achieved fairly accurate predictions (cf.Fig. 1a), implying that the diagonal components are relatively easy to capture for SiO 2 .However, the off-diagonal components are predicted far less accurately in direct-ML.Remarkably, we find that our ∆-ML method provides similarly accurate predictions for both the diagonal and off-diagonal components of α (see Fig. 2b and Fig. 1b).The findings suggests that the ∆-ML approach allows for more efficient learning of all components of α because the LRM provides a good first approximation for dynamic fluctuations of this quantity.To assess the accuracy of our method, we compared it again to direct-ML for the case of SiO 2 , using DFPT results as a reference and calculating the coefficient of determination, R 2 (see SI for details).We focus on the tensor invariants a (mean polarizability) and γ 2 (anisotropy) of α (see Sec. IC of the SI), since these are directly relevant for Raman calculations, and show R 2 for SiO 2 as a function of N t in Fig. 3.We find that for achieving R 2 close to 1 for both a and γ 2 , ∆-ML requires N t to be on the order of only 20, which outperforms direct-ML by at least a factor of 2. Thus, the LRM and ML methods encoded in ∆-ML complement each other well and may offer accurate predictions of α with smaller training-set sizes.The size of the training set required to obtain good prediction performance is expected to strongly depend on the system.Therefore, we investigate the versatility of ∆-ML by computing R 2 for a broader range of physical systems that include extended solids and gas-phase molecules.The N t values that are required to achieve good prediction performance are listed in Table I for each system.The solid AlN is an interesting, challenging example because of its known LO/TO splitting that makes prediction of its dielectric and Raman properties a difficult problem. 17 Thus, AlN illustrates several complications that require an accurate computational methodology for prediction of its Raman spectrum.For this case as well as for SiO 2 , we find that ∆-ML achieves a significant reduction of required N t compared to direct-ML.The small gas-phase molecules (H 2 O, CH 4 , and CH 3 OH) we consider here are found to require similarly low N t ∼ 10 in both approaches, further demonstrating the broad applicability of the ∆-ML method.We consider two more extended systems, Si and NaCl, in order to demonstrate how the aforementioned inherent limitations of the chosen LRM are compensated by the proceeding ML step: the change of the diagonal components of α with ∆x i is approximately an even function in Si (see SI), for which the simple LRM predicts a constant timeseries for the α αα (x(t)) values that does not capture the pertinent temporal fluctuations in the system.NaCl is another challenging case because it is not first-order Raman active and the LRM therefore merely predicts α αβ (x(t)) = const.Table I shows that ∆-ML and direct-ML lie on par for these two cases, i.e., the proceeding ML can compensate for the lack of dynamical information in the underlying LRM.Hence, even when, for physical reasons, the choice of our specific LRM may not provide any benefit for learning components of α, it does not worsen prediction performance.Therefore, the ∆-ML approach can seamlessly integrate a simple physical model and ML procedure in order to capture the temporal and spatial fluctuations of α.These findings suggest that ∆-ML is a promising approach for dielectric predictions of dynamical systems.Furthermore, since the LRM is the simplest approximation to Eq. ( 1), it can still be extended in a straightforward way to further improve prediction performance of ∆-ML if needed.Close connections of the here discussed dielectric quantities to Raman spectra are established via a correlationfunction analysis.Specifically, calculation of the Raman spectrum requires the Fourier-transformed velocity autocorrelation functions (VACFs) of the tensor components of α, i.e., The terms a τ and γ 2 τ can then be computed from the VACFs, 20,57 see Sec.ID in the SI for details.A spherically-averaged Raman spectrum can then be obtained as Note that the frequency-dependent prefactor in Eq. ( 3) is not exact.Several other versions of this equation also exist, which can be derived based on different approximations for taking into account the quantum nature of the atomic motion. 58e applied our ∆-ML method to calculate MD-Raman spectra at 300 K for the test systems described above, using full DFPT calculations as a reference.Fig. 4 showcases two examples and the other systems are discussed in the SI.While our focus is on the performance of the ∆-ML method, we point out that the spectra are also affected by other aspects of the computational setup, such as the choice of density functional approximation.Good agreement between ∆-ML and DFPT is obtained for SiO 2 as expected from our above findings and further quantified through the calculated cosine similarity, S C , shown as spectrum overlap in Fig. 3 (see SI).The case of NaCl is particularly interesting since, as we noted  above, it is a first-order Raman inactive material, which means that only higher-order effects contribute to the Raman spectrum.For this reason, the spectrum contains contributions from q-points other than Γ. 59 This makes MD-Raman calculations particularly challenging, requiring a large simulation cell in order to provide sufficient sampling of q-space (see SI for details).In addition, we find that the ML procedure required relatively large N t in this case.We speculate that N t could be reduced via improvement of the first step in the ∆-ML procedure, for example by including higher-order terms.With sufficient accuracy of the terms a and γ, the Raman spectrum for NaCl agrees well with DFPT and recent experimental data. 59To our knowledge, this is the first ML-based MD-Raman calculation for a higher-order Raman material.It is interesting to note that for MD-Raman calculations we observed a certain insensitivity on errors in the individual components of α.Specifically, relatively inaccurate predictions of tensor components can still result in a relatively accurate Raman spectrum (see spectrum overlap in Fig. 3).In addition, errors at different points along α(x(t)) are, to a good approximation, random and independent of each other.Thus, these errors tend to average out when computing autocorrelation functions.Finally, we mention that our method provides room for further adaptation in future work.Possibilities include use of more advanced first-step models instead of the LRM, different descriptors in the ML step, and changes to the hyperparameter optimization scheme such as employing cross validation.In conclusion, we proposed a ∆-ML approach that unifies a physical model with symmetry-adapted ML for prediction of dielectric properties and Raman spectra and is applicable to a broad range of different systems.We focused on the polarizability tensor and chose a simple LRM as a starting point to describe the dynamic dielectric fluctuations, which is completed by an ML procedure in a second step.The ∆-ML method can perform better than an otherwise identical direct-ML approach with the same training set size.This is especially because the LRM step provides a benefit for predicting off-diagonal components of the polarizability tensor.Since the data points needed for parameterizing the LRM can be reused as ML training data, ∆-ML does not necessarily increase computational costs compared to direct-ML.We also investigated specific systems for which the LRM method provides no benefit by design and found that it does not deteriorate ML prediction performance for these cases.Our findings show that ∆-ML is a promising approach for predictions of dielectric properties and Raman spectra of molecules and materials at finite temperature.It provides a way to reliably compute spectra that capture the full extent of atomic motions in molecules and materials without relying on the harmonic approximation at a reasonable computational cost.
We speculate that the ∆-ML approach might also be useful for the calculation of other properties that require time-correlation functions, such as infrared spectra or transport coefficients.
FIG. 1.(a) LRM predictions of the αxx component of the polarizability tensor in SiO 2 over a period of 2000 fs, compared to DFPT reference data.(b) ∆-ML predictions for the same data, obtained with Nt = 50 and Nv = 10.Note that the values shown here are polarizability per volume.
FIG. 2. (a) Scatterplot comparing direct-ML predictions of diagonal (left) and off-diagonal (right) polarizability components in SiO 2 to DFPT reference data, obtained with Nt = 20 and Nv = 4.(b) Scatterplot comparing ∆-ML predictions to DFPT, obtained with the same training and prediction set.Note that the values shown here are polarizability per volume.

R 2
FIG. 3. (a) Performance metrics for direct ML predictions in SiO 2 as function of training set size Nt.(b) The same performance metrics for ∆-ML predictions.

C
.v.S. thanks Felix Schwarzfischer for hepful discussions.Funding provided by the Alexander von Humboldt-Foundation in the framework of the Sofja Kovalevskaja Award, endowed by the German Federal Ministry of Education and Research, by TUM.solar in the context of the Bavarian Collaborative Research Project Solar Technologies Go Hybrid (SolTech), and by TU Munich -IAS, funded by the German Excellence Initiative and the European Union Seventh Framework Programme under Grant Agreement No. 291763, are gratefully acknowledged.The authors further acknowledge the Gauss Centre for Supercomputing e.V. for funding this project by providing computing time through the John von Neumann Institute for Computing on the GCS Supercomputer JUWELS at Jülich Supercomputing Centre.Part of the research was obtained using the computational resources procured in the national project National competence centre for high performance comput-ing within the Operational programme Integrated infrastructure (project code: 311070AKF2).T.B. acknowledges support from Slovak Research and Development Agency under the Contract No. APVV-20-0127.

TABLE I .
Minimum required training set sizes to achieve R 2 > 0.8 in both a (mean polarizability) and γ 2 (anisotropy) of α for different systems.Nt/Nv was kept constant at 5:1, Nt was increased in steps of 20 for AlN and NaCl and in steps of 10 for all other systems, and R 2 was evaluated on separate data sets of size 400.