SPAHM(a,b): Encoding the Density Information from Guess Hamiltonian in Quantum Machine Learning Representations

Recently, we introduced a class of molecular representations for kernel-based regression methods—the spectrum of approximated Hamiltonian matrices (SPAHM)—that takes advantage of lightweight one-electron Hamiltonians traditionally used as a self-consistent field initial guess. The original SPAHM variant is built from occupied-orbital energies (i.e., eigenvalues) and naturally contains all of the information about nuclear charges, atomic positions, and symmetry requirements. Its advantages were demonstrated on data sets featuring a wide variation of charge and spin, for which traditional structure-based representations commonly fail. SPAHM(a,b), as introduced here, expand the eigenvalue SPAHM into local and transferable representations. They rely upon one-electron density matrices to build fingerprints from atomic and bond density overlap contributions inspired from preceding state-of-the-art representations. The performance and efficiency of SPAHM(a,b) is assessed on the predictions for data sets of prototypical organic molecules (QM7) of different charges and azoheteroarene dyes in an excited state. Overall, both SPAHM(a) and SPAHM(b) outperform state-of-the-art representations on difficult prediction tasks such as the atomic properties of charged open-shell species and of π-conjugated systems.


Introduction
Physics-based machine learning representations, also known as representations for quantum machine learning (QML), [1][2][3][4][5] are rooted in the fundamental principle that all the (static) information about a neutral chemical system is uniquely encoded into the system-specific parameters that fix the electronic Schrödinger equation: nuclear charges {Z I } and positions {R I }.Owing to their physical origins, these representations are highly general and have a deep connection to quantum-chemical targets.Hence, they have been broadly exploited to supply fast and accurate predictions of a myriad of atomistic chemical properties.
Each of these categories of representations have led to impressive performances for the predictions of both prototypical and complex molecular or material properties 30 such as atomization energies, 23 multipole moments, 31 polarizabilities, 17,32 HOMO-LUMO gaps, 33,34 molecular forces, [35][36][37] potential energy surfaces, 13,38,39 electron densities, [40][41][42][43] density functionals, 44 and many-body wavefunctions. 45Yet, since such representations are functions of {Z I } and {R I } only, achieving the same level of accuracy for chemical targets inherently dependent upon changes in electron delocalization, spin, or charge remains a challenge and additional electronic information (i.e., the Hamiltonian) is needed.An alternative approach consists in adding one more layer between the geometry and the representation and complementing the latter with some quantum-chemical information computed from the former.Illustrative examples include OrbNet, 46,47 which uses quantum-mechanical operators obtained from a converged semiempirical computation as input features for a neural network, as well as methodologies such as EHML-ML 48 and DFTB-ML 49 aiming at refining the parameters characteristic of semiempirical methods (e.g., Hückel theory and DFTB) to achieve higherlevel accuracy.Alternative models like EPNN ?propose a heuristic neural-network-based partitioning scheme to provide fast and reliable quantum-like atomic charges as input for predictive models.AIMNet 50 with the neural spin-charge equilibration unit 51 takes {R I }, {Z I }, and total molecular charge and spin multiplicity to learn a state-specific representation with a message-passing neural network.More computationally demanding alternatives consist in featurizing components of fully converged Hartree-Fock-level matrices, operators, densities, or determinants, as in DeePHF, 52 DeePKS, 53 MO-ML, [54][55][56] the orbital-based FJK representation, 57 and the kernel density functional approximation 58 (KDFA).Also relevant to this category is the recent introduction 59 of Coulomb lists and smooth overlap of electron densities that bridge geometry-based descriptors with electronic structure theory.The recently introduced matrix of orthogonalized atomic orbital coefficients proposes a compact although more expensive representation derived from an orbital localization scheme. 60th the same purpose of encoding valuable electronic information, we recently introduced the spectrum of approximated Hamiltonian matrices (SPA H M) representation fam-ily, 61 which has the advantage of avoiding the self-consistent field (SCF) procedure.Specifically, the eigenvalue SPA H M (ε-SPA H M) is a compact global representation consisting of occupied-orbital eigenvalues extracted from lightweight one-electron Hamiltonians traditionally used as an SCF initial guess in molecular quantum chemistry codes.
Owing to a seamless generalization to open-shell systems, ε-SPA H M performs well on datasets characterized by a wide variation of charge and spin, for which the traditional structure-based representations commonly fail.However, it suffers from some limitations: i) its global nature limits transferability, 62 ii) it only exploits eigenvalues, despite the availability of additional information (e.g., the eigenvectors and associated electron densities), and iii) comparing the orbital energies of compounds having different size and composition lacks physical sense.
To address such limitations, in this work we expand SPA H M and build two types of representations exploiting the electron density extracted from the same approximated Hamiltonians.We then bridge the conceptual advantages of both SOAP 16 and atomic version of SLATM 26 (aSLATM) to obtain atomic-density overlap fingerprints, SPA H M(a), or bonddensity based representation, SPA H M(b).
The predictive power of SPA H M(a,b) is demonstrated on local (atomic) properties such as atomic partial charges, spin densities, and isotropic magnetic shielding on the QM7 dataset. 23,63We then show the excellent performance of the models on datasets made of a mix of neutral and radical cationic organic molecules and of radical cations of push-pull azoheteroarene-based photoswitches.These results importantly highlight the possibility of achieving fast and efficient predictions of chemical properties sensitive to the electronic structure (e.g., charge carrier organic materials or transition-metal-catalyzed reaction steps).

Theory
This section provides a concise description of the proposed atom-based SPA H M(a) and bondbased SPA H M(b) models introduced in this work.The general workflow used to generate the representation is sketched on Fig. 1, and detailed derivations are shown in Sec.S1 of the Supplementary Material.
x atom Ci : Guess density (a)

SPA H M(a)
This work extends our former representation built from the eigenvalues of lightweight model Hamiltonians.To achieve locality and transferability, the new representations focus on the eigenvectors of those Hamiltonians.However, to avoid dealing with permutational invariance, instead of the eigenvectors our extension is based on the electron density ρ(r) or more specifically the pre-processed density matrix D (Fig. 1a).
Local representations are designed to encode information about each atom within a molecule into a vector.[76] After performing Löwdin orthogonalization 66 of atomic orbitals, we obtain a separate density matrix attributed to each atom (Fig. 1b, left).We then proceed with the density fitting and can take the coefficients c I(I) of the functions centered on the atom of interest (Fig. 1c).
The density fitting step allows to take into account the contribution of atoms J ̸ = I to the ρ I (r), using the coefficients c J(I) of decomposition of ρ I (r) centered on nucleus J, thus implicitly including bonding information.The detailed description and comparison with other atomic partitioning schemes is provided in Sec.S4.Note that in order to include bonding information explicitly, the proposed approach can be generalized to obtain density matrices attributed to each bond (Sec.2.2).
The vector of coefficients c I is not rotationally-invariant, and hence cannot be directly exploited as a representation.The next step corresponds to construction of a symmetryadapted vector v I (Fig. 1d).
Inspired by the SOAP kernel, 16 we compute the similarity between two atoms A and B as overlap of ρ A (r) and ρ B (r).To ensure rotational invariance, the overlap is integrated over all possible rotations in 3D space, (To obtain the overlap, the atoms A and B are virtually put at the same point of space.)Note that this expression can be generalized to ensure rotational equivariance to learn higher-order tensorial properties in spirit of λ-SOAP. 17 Each atomic density ρ I (r) is expressed as a sum of terms, centered on the nucleus I, hence the overlap kernel can be written as a scalar product of two vectors, K overlap

A,B
= v ⊺ A v B (see Sec. S1 A for a detailed derivation of the expression for v I ).We disregard the overlap kernel and use v I as a representation vector of an atom.It provides the following advantages: i) kernel computation is significantly simplified; ii) an atomic-density representation can be combined with other vectors; iii) the representations can be used with any other kernel function, such as widely-used Laplacian and Gaussian kernels.
Another way to obtain a symmetry-adapted vector from c I , reported in the context of ML density functionals, 58 is to use sum of squares of density-fitting coefficients for each shell.
Comparison with our representation is provided in Sec.S8.
The last step is to construct an atomic representation x I from the symmetry-adapted vectors v J(I) .We regroup all the vectors according to the charge of the nucleus J into "bags" of element types, inspired by the construction of aSLATM. 26Finally, we sum up the features in each bag to form the final vector.This procedure is illustrated on Fig. 1e (top).

SPA H M(b)
As discussed in the previous section, the bonding information is included only implicitly into SPA H M(a). A complementary approach consists of building an explicit representation for a bond IJ by extracting the corresponding density matrix and the density ρ IJ (r) with the Löwdin formalism 66 (Fig. 1b, right).
Using the standard density fitting approach, ρ IJ could be expressed as a sum of terms centered on I and J, but this would preclude rewriting the kernel as a scalar product and then extracting a representation vector.For this reason, we instead decompose ρ IJ (r) onto a basis set centered in the middle of the IJ bond (Fig. 1c).
Even though most of the information on the bond-density close to nuclei is lost during this procedure, the behavior in the midbond region is well captured.Bond-centered bases are often used to extend atomic-orbital bases for obtaining accurate interatomic potentials, 77,78 but not for density fitting.We thus optimized the basis for each bond present in the datasets studied (involving elements H, C, N, O, F, S, see Sec. 5).The basis set construction is described in Sec.S6 A.
Comparison of two bonds AB and CD involves aligning them along the z-axis and superimposing their geometrical centers.The similarity is then computed as an overlap of ρ AB (r) and ρ CD (r), integrated over the rotation around the z-axis (Fig. 1d), (See Sec.S1 B for a detailed derivation.) Simplifications to reduce both the time needed to compute the vector v IJ and its size are possible.For the fitting one can, for instance, use only basis functions with magnetic quantum number m = 0 to drop the integration over rotation around the z-axis, or even leave only a single s-or p-orbital.Sec.S6 B illustrates how these simplifications provide a useful compromise for certain datasets.
With the bond-representation vectors {v IJ } at hand, the similarity can be computed between two bonds.While this could be readily used to train bond-property models (e.g., bond dipole moments, dissociation energies) this work focuses on atomic properties requiring one additional step to use the bond vectors and construct an atomic representation.
As for the atom-density representation (Sec.2.1), we chose an aSLATM 26 -inspired "bagging" procedure (Fig. 1e, bottom).For each atom A i , all the vectors v A i B j are grouped according to the element B and summed up prior to concatenation.Here, the difference between the bagging of SPA H M(a) and SPA H M(b) is that the former is sorted according to unique elements (one-body terms in the language of SLATM) and the latter -according to pairs of unique elements (two-body terms).This difference illustrates the complementary focus of the two new variations of SPA H M to convert the information from lightweight Hamiltonians into local atomic and bond environments.Learning curves for different datasets on the exemplary task of predicting local properties of nitrogen atoms: (a) atomic charges and (b) isotropic magnetic shielding constants for QM7 and atomic charges for (c) radical cations of 3600 QM7 molecules (QM7/2-RC) and (d) mix of 3600 QM7 molecules and 3600 radical cations (QM7/2+QM7/2-RC).The QM7/2+QM7/2-RC aSLATM curve is missing since aSLATM is not injective and therefore inappropriate for this dataset.

Classic benchmark dataset: QM7
We assess the learning ability of SPA H M(a,b) by predicting two distinct local atomic properties -atomic charges and isotropic magnetic shielding constants -computed for the QM7 database. 23For each element (H, C, N, O, S) and property, a separate kernel ridge regression (KRR) model is trained using its own hyperparameters (see Sec. 5).Each set was randomly divided into a training and test set (80%-20% split).
For each molecule, the LB 79 guess Hamiltonian paired with a minimal basis set 80 is diagonalized to obtain the atomic SPA H M(a,b) representations following the procedure described in Sec. 2 and Fig. 1.The LB guess was chosen owing to its best performance for the eigenvalue-based SPA H M (ε-SPA H M). 61 Comparison with the Hückel 81,82 and PBE0 83 Hamiltonians are provided in Sec.S7.Briefly, there is a correlation between the quality of the initial guess and the performance of the representation, which opens the way to improving SPA H M(a,b) through modifying the underlying guess Hamiltonian.
The learning curves of SPA H M(a) and SPA H M(b) for nitrogen atomic charges are shown in Fig. 2a with comparisons with those of aSLATM 26 (learning curves for other elements and properties are reported in Sec.S2 A).SPA H M(a) errors are comparable with those of aSLATM with no clear systematic trend across all the distinct elements (see Sec. S2 A).
The generally good performance of SPA H M(a) arises from its well-suited atomic-density fingerprints, which encode similar information to atomic charges.Interestingly, the somewhat more sophisticated bond-variant SPA H M(b) performs worse than SPA H M(a), implying that the bonding information is less relevant for this task.This contrasts with the predictions of isotropic shielding constants (Fig. 2b) for which SPA H M(b) is systematically superior to SPA H M(a) owing to its dependence on the presence of multiple bonds and π-conjugation, which are better captured by the bond density-based model.Yet, for this property neither SPA H M(a,b) outperform aSLATM.Specifically, for the hydrogen atom (Sec.S2 A), most frequently analyzed in NMR studies of organic compounds, the SPA H M(b) error is ∼ 1.5 times higher than the aSLATM one.This result is however not surprising as it was previously demonstrated with ε-SPA H M that the strength of the approach lies in capturing the properties of datasets covering a broad range of chemical compositions and electronic structures featuring a variety of charges and spins. 61We thus train the model on two databases containing more electronically-diverse species.The first one is made of radical cations of 3600 structures randomly selected from QM7 (QM7/2-RC), and the second is a mixture of these 3600 neutral molecules and 3600 radical cations (QM7/2+QM7/2-RC).The learning curves for nitrogen atomic charge are shown on Fig. 2c and Fig. 2d, respectively.For the QM7/2-RC dataset, we also predict atomic spin densities (the complete set of learning curves available in Sec.S2 A).The SPA H M(a,b) vectors for open-shell molecules are built from concatenation of vectors obtained for α and β densities (see Sec. S5 for more details).Fig. 2c illustrates the improved performance of SPA H M(a,b) with respect to aSLATM, for the difficult task of learning atomic charges of charged species.Thanks to its rooting in the electron density, SPA H M is able to capture local changes in the electronic structure.For QM7/2+QM7/2-RC (Fig. 2d), aSLATM cannot be used as it yields the same representation vector for a neutral molecule and its radical cation.On the other hand, SPA H M(a,b) seamlessly include the electronic information.The overall prediction errors are approximately averaged errors for the QM7 and QM7/2-RC datasets.However, SPA H M(b) is always worse than SPA H M(a) for both charges and spins for the same reason as discussed above.

Tunable push-pull azoheteroarene-based dyes
To assess the performance of SPA H M(a,b) beyond prototypical molecular examples, we consider a combinatorial database of push-pull azoheteroarene-based photoswitches 84 (APS), containing 3429 molecules.While this database was originally designed to analyze the tunability of excited states for this class of dyes, we first investigate their hole-carrier properties and train predictive models for the atomic charges and spins of radical cations (APS-RC).look at the chemical composition of the two sets.QM7 consists of organic molecules with seven or less heavy atoms.While it contains a large amount of structures with multiple and/or π-conjugated bonds, there are only a few aromatic molecules and thus a restricted number of fragments promoting extensive π-electron delocalization.In contrast, the APS dyes are built from 2-4 donor-acceptor aromatic groups interacting through the azo moieties, forming a fairly long π-conjugated scaffold prone to high charge and spin delocalization.The bond-centered representation, which relies upon basis functions with components spatially orthogonal to the bond, is suited to capture these electronic changes.A deeper analysis of an individual molecule is provided in Sec.S3 A.

Predicting excited-state properties
Next, we challenged the representations with the APS dataset, considering a productive π → π * excited state.In line with the original work, 84 we focused on learning atomic contributions to the hole and particle densities (computed in the same way as Hirshfeld charges).SPA H M is computed from a ground-state initial guess, thus it cannot be expected to predict excited-state properties well.Since the targets are atomic hole and particle contri-butions, a reasonable approach is to use radical cation and anion densities, respectively, as a starting point.Another choice would be to compute SPAHM from the guess HOMO and LUMO densities, but it is not assessed here.The prediction errors for neutral, cation, and anion SPA H M(a,b) and aSLATM for nitrogen are shown in Fig. 4 (see Fig. S3 and Fig. S4 of Sec.S2 B for other elements).
Among the three SPA H M density sources, the anion one is systematically the best for the particle contributions, and the cation one -for the hole contributions, while, as expected, the neutral one is usually the worst.
Only the anion-SPA H M contains information on the LUMO, which explains its good performance for the particle contributions (particle density consists of unoccupied orbitals) and low performance for the hole contributions (hole density consists of occupied orbitals thus the LUMO information just adds extra noise).
The better performance of cation-SPA H M in the case of hole contributions could be explained differently.For open-shell systems, SPA H M(a,b) consist of two concatenated vectors constructed from the α-and β-densities.Thus the full vector implicitly contains the information on the HOMO orbital density which is removed from the β-vector with respect to α.
In total, for the productive π → π * state of the azo-photoswitches we found the cation-SPA H M(b) and anion-SPA H M(b) to be the best for the hole and particle properties, respectively, and expect the same trend for similar excited states.For all the elements in the dataset this approach outperforms aSLATM, which proves that the SPA H M family can be useful also for excited-state properties.

Out-of-sample prediction
We also predicted hole and particle contributions for an out-of-sample molecule, for which a graphical representation is shown in Fig. 5, and the numerical values are provided in Sec.S3 B. This structure is one of several that were excluded from training in the original The predicted values show a picture typical for this excitation: 84,85 the particle density is mostly localized on the azo group (which makes it much easier to learn than the hole density).Conversely, the hole density is delocalized all over the π-system, its asymmetry showing the push-pull character of the excitation.

Efficiency
To complete this work, we evaluate the efficiency of our models compared to aSLATM, for both the feature vectors generation and kernel computation.Since the training of KRR models consists of the kernel matrix inversion, which does not explicitly depend on the representation type, the inversion times are not included.
We randomly selected a subset of 1000 molecules from the QM7 (sub-QM7) and APS (sub-APS, fluorine-containing molecules excluded) databases and generated the SPA H M(a), SPA H M(b), and aSLATM representations.The user times are reported in Table 1.We extend the analysis by computing the full Laplacian kernel matrices from these repre-sentations (for each element separately), the user times are reported in This can be advantageous for molecular dynamics in active learning setups. 86We also note that in all cases the prediction time is negligible with respect to TDDFT computations, We note that SPA H M(a) and SPA H M(b) encode the electronic information while retaining compactness with feature vectors about 4-and 9-fold smaller than aSLATM, respectively.
In particular, the size of the representations does not depend on the molecular sizes in the dataset (i.e., the system size) but rather on the number of unique elements contained in it.
Detailed analysis of the efficiency of the models reveals that this constitutes a significant advantage for the kernel construction.
Overall, the proposed representations afford a transferable (local) and efficient alternative for quantum machine learning in the prediction of various electronic-state properties.We also expect the new SPA H M variants to provide a powerful and chemically intuitive framework for the prediction of properties of chemical reactions, which require a bond-focus 87 as found in SPA H M(b), and for the description of molecular properties for which geometrical structures do not inherently coincide with electronic structures (e.g., organic electronic materials).

Methods
The codes used in this paper are available on a dedicated GitHub repository at https:// github.com/lcmd-epfl/SPAHM-RHOand on Q-stack, a broader package for custom quantumchemical routines to promote quantum machine learning, at https://github.com/lcmd-epfl/Q-stack.
The initial guess densities were obtained in a minimal basis (MINAO 80 ) using the LBm potential. 61,79(Comparison with the Hückel 81,82 and PBE0 83 potentials is provided in Sec.S7.) To construct the SPA H M(a) representation, the cc-pVDZ/JKFIT 88,89 atom-centered density fitting basis was used.To construct the SPA H M(b) representation, a bond-centered density fitting basis was optimized, the procedure is described in Sec.S6 A. The QML 90 and TENSOAP (SOAPFAST) 91 packages were used to construct the aSLATM 26 and SOAP The atomic charges and spins and/or hole and particle contributions were computed using dominant Hirshfeld partitioning 67 at the PBE0 83 /cc-pVQZ 92,93 level for the QM7 23 and QM7/2-RC datasets and at the ωB97X-D 94 /def2-SVP 95 level for the APS 84 and APS-RC datasets.The excited-state properties were computed with TDDFT within the Tamm-Dancoff approximation. 96The isotropic shielding constants were computed at the PBE Let us consider two atoms, A and B. Each atomic density ρ I (r) is represented as a linear combination of atomcentered spherical Gaussian basis functions {ϕ nℓm }, labeled with their radial channel number n and angular ℓ and magnetic m quantum numbers, and each nucleus is virtually positioned at the origin.The overlap kernel K overlap

A,B
between atoms A and B is the squared overlap of ρ A and ρ B averaged over all possible relative orientations R, For a given orientation, the overlap k AB ( R) is where D are Wigner D-matrices for real spherical harmonics S1 and A ℓ nn ′ = ⟨ϕ nℓm |ϕ n ′ ℓm ⟩ ∀m.

The kernel becomes
Thanks to orthogonality of the real Wigner D-matrices, S1 i.e., the kernel is further simplified to With p = (n 1 , n 2 , ℓ), q = (n ′ 1 , n ′ 2 , ℓ) it can be rewritten as a dot product where v I is the representation of an atomic electron density ρ I (r) and is an analog of the power spectrum of atomic neighbor density.Now let us consider two bonds, AB and XY .The (Löwdin) bond densities ρ AB (r) and ρ XY (r) are decomposed onto basis sets centered in the middle of each bond, where a function ϕ i is defined by a radial channel number n i and angular ℓ i and magnetic m i quantum numbers.Both bonds are aligned along the z-axis and their midpoints are put at the origin.The overlap kernel K overlap AB,XY = I 1 between the two bonds AB and XY is defined as a overlap integral I 2 (φ) squared averaged over rotations φz around the z-axis, With the decomposition (S8), the overlap integral I 2 (φ) is rewritten with overlap of the basis functions, as well as the kernel I 1 , With the rules for rotation of real spherical harmonics around the quantization axis, the overlap I 3 (φ) becomes where ϕ ȷ is the same basis function as ϕ j but with an opposite phase (i.e.m j = −m ȷ).The integral over rotation I 6 is simplified to and the overlap kernel I 1 -to When p and q are centered at the same point, S pq = δ ℓpℓq δ mpmq A ℓp npnq .Thus I 1 is further simplified to ) which can be rewritten as a dot product in the same spirit as the atom-density kernel, where v IJ is the representation of a bond density ρ IJ (r).

S2. LEARNING CURVES
A. QM7 and its derivatives

S3. OUT-OF-SAMPLE SYSTEM
A. APS-RC Analysis of an individual system clearly illustrates the relevance of our models.From the APS-RC dataset we selected an out-of-sample structure and used previously trained SPA H M(a,b) models to predict the atomic charges of its radical cation.Fig. S5 compares the predicted and computed values of atomic charges for a selection of atoms included in the π-conjugated system.For SPA H M(b), the predicted values accurately reproduce the computed ones within 0.01 a.u., thus verifying its performance.However, by taking the changes in atomic charges for all the constituing atoms and summing them up (i.e.k (q cation k − q neutral k )) we obtain a total molecular charge ∼ 0.9, approximately yielding the removed electron.1) concatenation of representation vectors x obtained from ρ α and ρ β separately ("αβ"); 2) representation vector obtained from ρ = ρ α +ρ β , the total electron density as in case of closed-shell systems ("+"); 3) concatenation of representation vectors obtained from ρ = ρ α + ρ β and ρ m = ρ α − ρ β separately ("+−").
They were tested on the QM7/2-RC dataset with the SPA H M(b) representation.The results are shown on Fig. S8.As expected, in most cases the "+" model, having no information on the spin density, performed the worst, whereas the "αβ" model showed the best results and was chosen as the default option.The decomposition of the bond density onto a midbond-centered basis set required optimization of a suitable basis.First, we followed the procedure described in Ref. S7 used to optimize a basis to fit the on-top pair density.
For each bond of interest in a molecule, we search for the set of coefficients {c i } that approximates the bond density in the least-squares sense, where S is the overlap matrix, b i = ⟨ρ AB |ϕ i ⟩, and the decomposition error is Thus, to optimize the exponents, we minimize the sum of decomposition errors E for the molecules chosen for the bond of interest.The exponents {α µ } for all the angular momenta are optimized simultaneously.The exponents are parameterized as α µ = exp(p µ ), and the first derivatives of the loss functions E with respect to the exponents are computed as follows, with the overlap integrals and their derivatives taken numerically.All the bonds were treated separately.For each bond (or atom pair) presented in the QM7 and APS datasets we chose representative molecules containing it (e.g., H 2 for H-H; C 2 H 2 , C 2 H 4 , and C 2 H 6 for C-C; H 2 O and H 2 O 2 for H-O), and the sum of the molecular decomposition errors was minimized.The maximum angular momentum ℓ max and the number of functions n ℓ for each ℓ were gradually increased and optimized on each step, until addition further radial functions or angular momenta did not provide any significant decrease of error.The optimized exponents are available separately in Q-stack (https://github.com/lcmd-epfl/Q-stack).
However, for some of the bonds the fitting errors were huge (up to 20%) due to the fact that largest fraction of the bond density is still localized on participating nuclei, thus the fine-tuning of the fitting basis could not improve much.This could be solved with adding a single Gaussian centered in the midbond as a weight function.Our tests showed that, however the fitting error significantly decreased, the quality of learning was almost the same.
On Fig. S9 we compare the performance of SPA H M(b) computed using the fully-optimized basis for each bond ("normal") and using the same (C-C) basis for every bond ("same basis").It is clear that the representation quality does not depend on the exponents of the basis thus their optimization can be omitted.(the role of angular momenta is discussed in Sec.S6 B).

S7. EFFECT OF THE HAMILTONIAN
We compared the SPA H M(a,b) representations built upon the density matrices obtained from the Hückel guess S8,S9 , the LB S10 guess (default), and a converged PBE0 S11 computation.The learning curves are shown on Fig. S11.As expected, the worst approximation, the Hückel guess, gives the worst regression results.In contrast to the eigenvalue SPA H M, S12 , the converged density makes the best representation, sometimes overperforming SLATM, which opens the way to improvement of SPA H M(a,b) through improvement of the underlying guess Hamiltonian.

Figure 1 :
Figure 1: Scheme illustrating the steps required to compute SPA H M(a) and SPA H M(b) representations.

Figure 3 :
Figure 3: Learning curves of atomic charges and spins of nitrogen for the APS-RC dataset.The reference properties are computed at the ωB97X-D/def2-SVP level (see Sec. 5).

Figure 4 :
Figure 4: Prediction errors at the full training set for contributions of nitrogen atoms to the hole and particle densities of the productive π → π * state for the APS dataset; (+) and (−) indicate SPA H M computed for radical cations and anions, respectively.The reference properties are computed with TDDFT at the ωB97X-D/def2-SVP level (see Sec. 5).

Figure 5 :
Figure 5: Qualitative picture for (a) an out-of-sample structure: atomic contributions to the (b) particle and (c) hole densities of the productive π → π * state, predicted by SPA H M(b). (a): The elements are color-coded by dark gray for H, light gray for C, blue for N, and yellow for S. (b,c): The contribution of each atom is represented by the color intensity from black (0) to red (max.value).
SPA H M(a,b) computation requires diagonalization (per molecule) and density-fitting (per atom/bond in molecule) procedures, in the worst case scaling cubically with the number of atoms, resulting in being computationally expensive.Compared to aSLATM, generating SPA H M(a) vectors is approximately 9-fold more time-consuming (squared for SPA H M(b) relatively) for the sub-QM7 dataset.Moving toward more complex systems, i.e. the sub-APS dataset, reveals a larger observable time complexity than aSLATM: it took about twice more time to compute the aSLATM representation, compared to QM7, and about five times for either SPA H M(a,b).However, the overall speed is implementation-dependent, and efficiency is being addressed in ongoing efforts.Moreover, the simplified bond models discussed in Sec.S6 B open the route for future optimizations of SPA H M(b).
whilst SPA H M(b) shows good results for excited-state properties.Finally, comparison of the original SPA H M(b) with simplified versions shows little deterioration of the overall performance and offers promising routes toward more efficient implementations (See Sec.S6 B).Additionally, limiting the extent of the bond-based environments by optimizing the cutoff distance would preclude the computation of distant pairs while maintaining relevant motifs.While being currently under investigation, this effort is expected to significantly reduce run times especially for large systems.4 Conclusions This work expands our lightweight and efficient eigenvalue SPA H M representation into a local electron density-based variant.The adopted strategy extends the class of fingerprints derived from an approximated Hamiltonian with two local density-matrix-based representations: SPA H M(a) and SPA H M(b), accounting for atom and bond contributions.Combining strategies inspired from state-of-the-art local representations (i.e., SOAP, aSLATM) while simultaneously encoding electronic information, the SPA H M variants show excellent predictive power on local atomic properties (e.g., atomic charges, atomic spin density, and isotropic magnetic shielding) of neutral and charged species for both the prototypical QM7 and more challenging (azoheteroarene-based dyes) sets.SPA H M(a,b) were shown to outperform aSLATM for predicting properties of cationic species generated from the QM7 database as well as for those of highly conjugated cationic systems.Validation on the azoheteroarene-based dye database also demonstrated that SPA H M(b) is especially adapted to describe changes in electron delocalization typically observed in extended π-conjugated systems. S2
FIG. S2.Learning curves of atomic charges and spins for the APS-RC dataset.
(b)-anions SPA H M(a)-anions SPA H M(b)-cations SPA H M(a)-cations SPA H M(b) SPA H M(a) N (π → π * hole) (b)-anions SPA H M(a)-anions SPA H M(b)-cations SPA H M(a)-cations SPA H M(b) SPA H M(a) N (π → π * particle) FIG. S3.Learning curves of atomic contributions to the hole and particle densities of the productive π → π * state for the APS dataset; (+) [dashed line] and (−) [solid line] indicate SPA H M computed for radical cations and anions, respectively.

F 3
FIG. S4.Histograms of the full training set errors of atomic contributions to the hole and particle densities of the productive π → π * state for the APS dataset; (+) [dashed line] and (−) [solid line] indicate SPA H M computed for radical cations and anions, respectively.
FIG. S5.Predicted by SPA H M(a) (red) and SPA H M(b) (blue) and computed (black) atomic charges for a radical cation of an out-of-sample structure on a selection of atoms (highlighted).

πππFIG. S6 .
FIG. S6.Predicted by SPA H M(a,b) atomic to the hole and particle densities of the productive π → π * state for an out-of-sample structure.

FIG. S7 .
FIG. S7.Learning curves of atomic charges and shielding constants for the QM7 dataset.The color code reflects the different models used to construct the SPA H M(a) representation from the rotationally-invariant vectors.
FIG. S8.Learning curves of atomic charges spins for QM7/2-RC dataset and the SPA H M(b) representation.The color code reflects the different models used to generalize the representation to open-shell systems.
FIG. S10.Learning curves of atomic charges and spins for the dataset.The color code reflects the different basis sets used to generate the SPA H M(b) representations: "normal": fully-optimized basis for each bond; "only m = 0": optimized basis with m ̸ = 0 orbitals excluded.Learning curves for SLATM are given for comparison.

FIG. S11 .
FIG. S11.Learning curves of atomic charges and shielding constants for the QM7 dataset.The color code reflects the different Hamiltonians used to generate the SPA H M(a,b) representations.
FIG. S12.Learning curves of atomic charges and shielding constants for the QM7 dataset.The color code reflects the different representations."MR2021" stands for the KDFA S13 representation.

Table 1 :
User times required to generate the the SPA H M(a), SPA H M(b), and aSLATM representations and to compute the Laplacian kernel for the sub-QM7 and sub-APS sets (1000 randomly selected molecules).The values are averaged over 5 runs.

Table 1 .
For a fixed set, the theoretical complexity of kernel computation is proportional to the representation vector length (number of features).Since the SPA H M(a,b) vectors are ∼ 10 times more compact than the aSLATM ones, kernel computation for the former is significantly faster, which is especially important for multiple runs needed for hyperparameter search.While extension of SPA H M(a,b) to open-shell systems leads to a two times increase of the vector length, it is still linear with respect to the number of elements in the dataset in contrast to the cubic dependency of aSLATM.Thus, despite the computation of the SPA H M(a,b) vectors requiring more time than the aSLATM ones, training SPA H M(a,b)+KRR models is more efficient than aSLATM+KRR.