Machine Learning Approaches toward Orbital-free Density Functional Theory: Simultaneous Training on the Kinetic Energy Density Functional and Its Functional Derivative

Orbital-free approaches might offer a way to boost the applicability of density functional theory by orders of magnitude in system size. An important ingredient for this endeavor is the kinetic energy density functional. Snyder et al. [Phys. Rev. Lett.2012, 108, 25300223004593] presented a machine learning approximation for this functional achieving chemical accuracy on a one-dimensional model system. However, a poor performance with respect to the functional derivative, a crucial element in iterative energy minimization procedures, enforced the application of a computationally expensive projection method. In this work we circumvent this issue by including the functional derivative into the training of various machine learning models. Besides kernel ridge regression, the original method of choice, we also test the performance of convolutional neural network techniques borrowed from the field of image recognition.


Data Sets
Training and test data are supposed to be created as closely as possible to the data used by Snyder et al. S1 in order to clearly demonstrate the improved accuracy achieved by including the functional derivative into the training algorithm. In Sections 1.1 and 1.2 the data sets for N = 1 and N = 2, respectively, are inspected in greater detail. Additionally, the performance achieved by simple models is investigated as reference for the more complex models explored in the main article.
The parameter triplets a, b, and c, used to generate the potentials of the training and test set, are available online in CSV format.

Data for N = 1
The training set consists of a total of M = 100 densities. Figure S1 shows a typical example of an input density as well as a histogram of the corresponding kinetic energies. In order to evaluate the complexity of the data set we fit simple regression models and test their performance on the test set. The results for all of these models are summarized in Table S1.
The simplest possible model is a constant model: where the least squares solution for the parameter b is given by the mean kinetic energies of the training examples b = ∑ M j T j /M. The mean absolute error achieved by this model is known as mean absolute deviation (MAD) in statistics. A slightly more complex model is given by a standard least squares linear fit, relating the G = 500 dimensional input densities to the kinetic energy: where the index (g) denotes the gth entry of a vector and the parentheses are used to distinguish grid point indices from training example indices. Note that the absolute error for this model is significantly higher than for the constant model. This is most likely due to the fact that 100 training examples are insufficient to fit 500 weight parameters. Ridge regression addresses this problem by introducing an additional regularization term that is used to penalize large weights in the linear fit. Since ridge regression is a linear variant of kernel ridge regression used in the main article, the achieved errors represent an upper bound to the expected performance of KRR. The results are obtained with a regularization parameter of λ = 10 −6 and the ridge regression algorithm as implemented in the scikit-learn python package. S2

Data for N = 2
The data set for N = 2 used in Section 3.3 and Section 3.4 of the main article is based on the same potentials as the data presented the previous section. While the addition of a second particle leads to higher values for the total kinetic energy, the standard deviation does not increase significantly.
We attribute this to the fact that the second particle is less influenced by the potential and the S-3 corresponding wave function resembles that of an particle moving freely in a hard wall box. A summary of this training set is given in Figure S2.   In Section 3.3 and Section 3.4 of the main article the kinetic energy is given by a machine learning correction to the von Weizsäcker functional: The models are therefore not trained on the data set presented in Figure S2 but rather on the difference between these exact values and the prediction obtained with the von Weizsäcker functional.  Figure S3: Comparison of the exact kinetic energy values of the N = 2 test set on the x-axis and the corresponding predictions by the Thomas-Fermi functional (left panel) and the von Weizsäcker functional (right panel) on the y-axis.
In Figure S3 the kinetic energy predictions of the von Weizsäcker functional and the Thomas-Fermi functional (for the sake of completeness) are plotted versus the exact solutions in order to show that the von Weizsäcker model can not describe systems consisting of two spatial orbitals.
While the Thomas-Fermi model roughly captures the correct functional correlation, the predictions by the von Weizsäcker functional exhibit an opposing trend. The accuracy of the Thomas-Fermi model can be improved further by adding a constant term. Fitting this constant offset using the training set yields reduced values of 110.9 kcal/mol, 83.1 kcal/mol, and 453.5 kcal/mol for the mean, the standard deviation and the maximum of the absolute error, respectively.
Due to the opposing functional behavior of the von Weizsäcker model, subtracting the von Weizsäcker kinetic energy from the exact values leads to an increased variance in the training data for the machine learning model as depicted in Figure S4.   Figure S4: Histogram of the kinetic energies in the training set for N = 2 after subtracting the von Weizsäcker kinetic energy.
S-5 Table S3 shows that this fact is also reflected in the reduced accuracy achieved by the constant model. Note, however, that both linear models achieve significantly lower mean absolute errors.
We conclude that the von Weizsäcker term captures some of the nonlinear contributions to the kinetic energy in this data set.

Kernel Ridge Regression
In this section we provide a detailed derivation of the Kernel Ridge Regression (KRR) algorithm.
In Section 2.1 the KRR method is reviewed and a standard notation valid in both the SI and the main manuscript is introduced. Section 2.2 shows how this concept can be extended to include training data for the functional derivative. Note that in both sections the formulas are derived in the so-called weight space view and transformed only in the last step to the kernel space expressions used in the main article.

Standard KRR
The main idea in KRR is that even nonlinear data can be described by a linear model after the transformation to a higher dimensional vector space, where w d are the fit coefficients and φ denotes the transformation from the input space R G to a higher dimensional feature space R D . The weights are determined by minimization of a cost function consisting of the squared error and a regularization term where the parameter λ controls the regularization strength. The cost function is minimized by setting the derivative with respect to w k equal to zero: S-7 Derivation with respect to all weights gives a total of D such equations which can be rewritten as a matrix equation: where Φ is a (D × M) matrix whose columns contain the transformed input densities and I D is a unit matrix of size (D × D). The weights can be calculated by inversion of the matrix on the left hand side: This expression for w can be rearranged further by the following matrix identity, often referred to as push-through identity: S3-S5 A Setting A = λ I D , P = Q = Φ and R −1 = I M yields: One of the advantages of this rearrangement is that the costly matrix inversion can be performed either on the (D × D) matrix in equation 8 or the (M × M) matrix in equation 10. As an analog to the D-dimensional weight vector w the M-dimensional vector α is defined as: Plugging the weight vector back into the linear model in equation 4 yields: Another advantage of applying the push-through identity is that it allows for the so-called kernel trick, where the scalar product in feature space is replaced with a kernel function φ (n i ) φ (n j ) = k n i , n j . After defining the kernel matrix K = Φ Φ with elements K i j = k n i , n j and a vector k(n) = Φ φ (n) with elements k j (n) = k n j , n the final result can be written as: As shown by Snyder et al. S1 the corresponding prediction for the functional derivative is given by: Note that the division by ∆x in the discretized functional derivative is necessary to eliminate the dependence on the number of grid points G.

Including Derivative Information
In a similar fashion, taking the derivative of equation 4 with respect to the input densities, yields the prediction of the functional derivative in weight space: The gth component of the gradient is denoted as: where the parentheses are used to distinguish grid point indices from training example indices. Using this expression, the cost function can be extended to include the squared error of the functional derivative weighted by an additional hyperparameter κ: S-9 where ∇ n j T j ∆x are the reference vectors of the discretized functional derivative. The weights are determined by setting the derivative with respect to w k equal to zero: Collecting all terms proportional to the weights on the left hand side yields By extending the matrices defined in equation 7 this can again be rewritten as a matrix equation with the extended transformation matrix of shape (D × M(1 + G)): S-10 the extended target vector: . . .
and a (M(1 + G) × M(1 + G)) diagonal matrix containing the scaling factor for the relative importance of derivative information: Solving equation 20 for the weight vector w and applying the push-through identity of equation 9 with the regularization matrix The weights are again rewritten as S-11 where the coefficients α and β are determined by solving the matrix equation Note that the individual β j are vectors of length G. The extended kernel matrix K ext is calculated by applying the kernel trick to the extended transformation matrix Φ ext : S-12 Using the new weights from equation 26 for the prediction of the kinetic energy of a previously unseen density n in equation 4 yields: The corresponding prediction of the derivative is given by: or similarly in the vector notation used in the main article by: S-13

KRR Hyperparameter Search
The hyperparameters for the KRR models are determined using 5-fold cross validation on a rectangular search grid. A total of 29 points, log-uniformly distributed between 10 and 500, are used for the length scale parameter σ , and 5 points, log-uniformly distributed between 10 −10 and 10 −14 , for the regularization parameter λ . Both the mean absolute error for the kinetic energy and the functional derivative are evaluated using cross validation. The hyperparameters are chosen such that the sum of these two errors is minimized. Note that this choice is biased toward the derivative score as the mean absolute error on the derivative is typically significantly larger.  Figure S5 shows that the rough grid search in fact displays a similar trend toward smaller values for both σ and λ for an increasing training set size M. As expected for a machine  Figure S6 shows the cross validation score for both the kinetic energy and the functional derivative for κ ∈ {0.1, 1, 10, 100, 1000}. Note that the blank regions in Figure S6  and λ = 10 −11 . Nevertheless, the weighting parameter is set to κ = 1 for all of the hyperparameter investigations, since it is difficult to judge the relative importance of the errors for the presented application. Figure

Neural Network Hyperparameters and Training Behavior
This section summarizes all the hyperparameters used during the training of the convolutional neural network models for the sake of reproducibility, but without discussing their influence on the actual training behavior or the final model performance.
The neural network models are trained by minimizing the following cost function: Most choices for the hyperparameters are shared by all of the models. The remaining hyperparameters are listed in Table S4. Since the relative weighting of contributions in the cost function is over determined by four scaling factors, the weighting parameter κ is set to a fixed value of κ = 1.
The network parameters are determined with the Adam optimizer, S6 Figure S10 shows the mean absolute error for the functional derivative as a function of the training process. Comparing Figure S9 and S10 shows that the early stages of the training are dominated by a reduction in the derivative error. In part due to the decreasing learning rate, the derivative error stops improving roughly at the half-way point of the training process while the kinetic energy error keeps decreasing.

S-19
5 Finding Minimum Energy Densities

Iteration on the Density and Local Principal Component Analysis
The direct minimization algorithm used to find the minimum energy density is a standard steepest descent optimization using a Lagrange multiplier µ to ensure the correct number of particles N.
The corresponding Lagrange function is given by: where the total energy functional E[n] for noninteracting particles is simply the sum of kinetic energy T [n] and potential energy: The steepest descent update rule with step size η is given by: where the Lagrange multiplier µ is chosen such that the norm n i+1 dx = N is ensured. Assuming that n i is properly normalized, this is equivalent to The introduction of a projection operator P for the functional derivative as suggested by Snyder et al.: S1 S-20 yields a similar result for the Lagrange multiplier µ: Note that the local principal component analysis (PCA) of Ref. S1 uses the difference of the density at the current step n i and a subset of the training densities as basis for the projection operator. Since the integral over the difference of two valid densities is zero, the same will be true for all functions projected onto this subspace. The correct norm for the density n i+1 is therefore ensured without the need for a Lagrange multiplier.

Iteration on the Variable ϕ
In Section 3.3 and Section 3.4 of the main article a basis of sine functions is used to restrict the search space instead of the local PCA. This necessitates additionally enforcing a nonnegativity constraint, which is typically achieved by iterating on the variable ϕ = √ n instead of the density n. The steepest descent update rule for ϕ is given by: where the Lagrange multiplier µ is again chosen such that the norm n i+1 dx = ϕ 2 i+1 dx = N is ensured: S-21 Assuming that ϕ i is properly normalized ϕ 2 i dx = ϕ 2 i+1 dx = N this simplifies to: Since the step size η is an arbitrary positive number the integrals on both sides have to be equal to zero, which yields Inspired by the local PCA method, the basis of sine functions is introduced using a projection operator acting on the functional derivative: The norm of n i+1 is then given by Again assuming that ϕ 2 i dx = ϕ 2 i+1 dx = N yields: Using the same arguments as before and the fact that P is a linear operator the multiplier µ can be calculated via S-22

Kernel Ridge Regression with Offset
The main problem of the iterative calculation of minimum energy densities is that the search algorithm is likely to leave the valid region, i.e. the region where the machine learning model offers correct predictions. Snyder et al. S1 solved this by restricting the search space using a projection onto the subspace of training densities. Denzel and Kästner S8 suggested a different solution for a similar problem in machine learning accelerated molecular geometry optimization. The extrapolation behavior of kernel based machine learning models can be tuned by introducing a constant offset: In Figure Table S5. All presented models use the extended KRR formalism and κ = 1, λ = 10 −12 for the remaining hyperparameters. Table S5 shows that the largest value for the length scale parameter of σ = 10.0 yields the best accuracy for both the kinetic energy and the functional derivative. The value of the offset parameter b has a smaller influence on the model performance than the length scale parameter. The results for the iteratively found densities are summarized in Table S6. All three models with length scale σ = 10.0 show a large number of minimization runs that leave the region spanned by the training examples, leading to extremely large mean absolute error values. While the results for σ = 2.5 suggest a better behavior during the minimization (at least for b = max(T j ) + 5 and b = max(T j ) + 10), the final errors achieved by these models are clearly limited by the poor model performance shown in Table S5. A length scale of σ = 5.0, therefore, represents a balanced tradeoff between model error and the ability to restrict the search space. In fact, the performance for b = max(T j ) + 10 and σ = 5.0 is comparable to the results achieved using the principal component analysis presented in the main article. However, in general this method is not practical for large scale applications as it is difficult to determine a set of hyperparameters during training which will lead to models suitable for a usage in iterative minimizations. This problem does not arise in the S-24 original application of this approach in geometry optimization tasks as the training set is constantly extended by additional ab-initio calculations during minimization. S-25 Figure S12 shows the effect of the local PCA projection on the functional derivative prediction of the various models on the sample potential of Figure  integrated absolute error values for the projected functional derivatives are summarized in Table S7.
Note that both the exact and the machine learning predicted functional derivatives are projected. 8 Learning Curve Figure S13 shows the learning curve, that is the accuracy of the models as a function of the number We use the hyperparameters given by Snyder et al. for the standard KRR method. A description of the method for optimizing the extended KRR hyperparameters as well as a detailed analysis of the results are given in Section 3. The hyperparameters for both KRR models are summarized in Table S8. For all of the neural network training runs the same hyperparameters as presented in

S-27
Section 4 is used and the batch size is reduced to the number of training examples M. Figure S13 shows a general trend of decreasing error with increasing number of training examples. Once again the advantage of including derivative information becomes apparent as all three models trained on derivatives achieve chemical accuracy for the smallest training set size M = 40.
Note the slight increase in the kinetic energy error for the ResNet when the number of training examples M is increased from 60 to 80. This variation in the performance is to be expected due to the random initialization of the weights at the beginning of the training process. A detailed comparison of neural network learning curves could therefore only be done after averaging the results over several training runs. 9 Computational Timing Figure S14 shows the computational time required by the various models for evaluating a single density of the test set. All times are measured by averaging over 100 runs on a workstation equipped with an Intel i7-920 and without GPU acceleration for tensorflow. Note that the individual evaluation times could be improved significantly by further optimization in the computational implementation and more specialized hardware. Nevertheless, this graph clearly shows that the computational effort for the kernel based methods increases with the number of training examples while the neural network models maintain a constant evaluation time.