Electronic Excited States from Physically Constrained Machine Learning

Data-driven techniques are increasingly used to replace electronic-structure calculations of matter. In this context, a relevant question is whether machine learning (ML) should be applied directly to predict the desired properties or combined explicitly with physically grounded operations. We present an example of an integrated modeling approach in which a symmetry-adapted ML model of an effective Hamiltonian is trained to reproduce electronic excitations from a quantum-mechanical calculation. The resulting model can make predictions for molecules that are much larger and more complex than those on which it is trained and allows for dramatic computational savings by indirectly targeting the outputs of well-converged calculations while using a parametrization corresponding to a minimal atom-centered basis. These results emphasize the merits of intertwining data-driven techniques with physical approximations, improving the transferability and interpretability of ML models without affecting their accuracy and computational efficiency and providing a blueprint for developing ML-augmented electronic-structure methods.


I. INTRODUCTION
Machine learning (ML) techniques have been applied very successfully to circumvent the complexity and computational cost of physics-based modelling techniques.[1,2] For example, ML interatomic potentials trained on quantum mechanical (QM) calculations have become ubiquitous, making it possible to simulate the structure and stability of complex systems in realistic thermodynamic conditions [3][4][5].ML approaches are increasingly applied to a broader array of quantum mechanical properties [6], from the ground-state electron density [7][8][9] to electronic excitations [10][11][12][13][14], the latter of which are the main focus of the present study.In QM calculations, these properties are often the result of a sequence of computational steps, that operate on intermediate quantities describing the electronic structure of a molecule.For instance, mean-field methods such as Hartree-Fock and Kohn-Sham density functional theory (DFT) [15,16] evaluate a self-consistent, effective single-particle Hamiltonian, which can be diagonalized to obtain numerous properties of the system.Methods based on this singleparticle picture also form the basis of more accurate abinitio methods such as complete active space (CAS) or coupled-cluster (CC) theories, as well as of less demanding semi-empirical methods [17][18][19], which parametrize the Hamiltonian using ab-initio and empirical data.In addition to the parametrization, which avoids computing many expensive integrals, semi-empirical schemes usually work in a "minimal basis" [20,21] and consider only valence electrons, discarding core electrons and high-energy virtual orbitals [22], to further speed up the calculations.Significant work has been devoted to exploring these approximations, which can be used as inspiration to design ML schemes for electronic properties [23].
When the single-particle wavefunction is expressed in terms of an atom-centered basis, the elements of the Hamiltonian matrix are indexed by two atoms and determined by interactions with their neighbors, which makes them very well suited as the target of an ML model built on geometric and chemical information.[24] Over the past few years, several works have discussed the prediction of single-particle electronic Hamiltonians of molecular systems [25][26][27][28][29][30].Physical symmetries constrain their electronic structure, which can be exploited by constructing equivariant ML schemes that ensure that the data-driven model conforms to these constraints.[24,31,32] Once the Hamiltonian matrix is obtained, all sorts of groundstate properties such as the electron density can be obtained with simple, inexpensive manipulations.Furthermore, excited states can also be predicted, at least approximately, by post-processing the ground-state singleparticle Hamiltonian.As a matter of fact, a ML algorithm could be built to directly target the property one wants to predict, e.g.electronic excitation energies [10].
Here we pursue an alternative strategy, which integrates data-driven modelling and QM calculations more closely.
We build a symmetry-adapted ML model with an intermediate layer that mimics a minimal-basis, singleparticle electronic Hamiltonian, which is then used to compute the molecular orbital (MO) energy levels and atomic charges.We then train this model against the MOs obtained from quantum chemical calculations with a richer basis set.The resulting architecture inherits the accuracy of QM calculations, as well as the transferability to larger, more complex molecules, while being or- ders of magnitude faster.This approach also enables predictions of molecular excited states, which we demonstrate using the simplified Tamm-Dancoff Approximation (sTDA) [33,34] to calculate valence excitation energies.
We analyze the resulting model for a dataset of hydrocarbons, training on a few small molecules and assessing predictions on much larger systems.Our model not only can reproduce the energies and shapes of MOs of previously unseen molecules but can also predict their excitation energy with remarkable accuracy.We finally showcase an example of calculating the vibronic spectra of a molecule not present in the training set.Our observations have relevance beyond the specific type of excitedstate calculations we apply, as they indicate that models combining data-driven steps with physically motivated manipulations and constraints combine the advantages of both approaches and deserve to be more widely adopted as a tool for accurate and affordable atomistic modeling of the electronic properties of molecules and materials.

II. RESULTS
The details of our framework are described in the Materials and Methods section and the Supplementary Materials.However, to better appreciate the result, we outline the motivation behind some of our technical choices.Our overarching goal is to demonstrate how a hybrid ML scheme can achieve transferability on different axes.On one hand, we aim to show that it can be trained on small, simple molecules, and reach useful accuracy when making predictions on more complicated, larger compounds.On the other hand, we want to demonstrate that the model can predict quantities other than those that the model has been trained on: specifically, electronic excitations and their coupling with nuclear vibrations.With these goals in mind, we train and validate our model on small hydrocarbon systems (specifically ethane, ethene, butadiene, hexane, hexatriene, styrene and isoprene), which provide a concise but representative palette of saturated, unsaturated, and aromatic motifs.These structures were generated from high-temperature replica-exchange molecular dynamics (REMD) simulations, which contain multiple conformers and distorted configurations.We then use the trained model for predic-   tions on larger and more complex molecules, from azulene to beta-carotene.We select classes of compounds with interesting yet well-understood physical effects (e.g. the dependence of band gap on the extent of a conjugated system), and use a simple approximation to compute electronic excitations, allowing for a comparison with explicit quantum mechanical calculations for large molecules and more subtle properties, such as vibronic spectra.Even though our architecture is fully compatible with any deep-learning scheme, we use an equivariant model based on linear regression to emphasize the impact of the coupling between the ML scheme and the physical approximations over the fine-tuning of the architecture of a non-linear ML model.

A. Hybrid ML architecture
Most of the ML frameworks that predict the Hamiltonian directly target the elements of a single-particle electronic matrix H (the Fock, or Kohn-Sham matrix) obtained from a quantum mechanical calculation [24,28,31,32], which describes the interactions between a suitable set of basis functions centered on the different atoms.The molecular orbitals (MO) and their energies {ε n } are obtained by diagonalizing the predicted (or quantum mechanical) H.A notable exception is given by the SchNet+H approach [35], in which the target is a "pseudo-Hamiltonian" H that is invariant to the molecular orientation, which is diagonalized to obtain eigenvalues {ε n } that are then compared with the reference calculation.This was shown to provide better accuracy (and a much-simplified architecture) than targeting directly the matrix elements, at the cost of losing the natural symmetries of the physical Hamiltonian.Ref. 24 also discusses the construction of a symmetry-adapted projected Hamiltonian that reproduces the MO energies from a converged calculation using a smaller set of orbitals, and was then used as the target of the ML model.This simplifies the calculation (the cost of diagonalizing a matrix scales cubically with the number of orbitals), but introduces a non-physical training target, as there is no unique definition of the reduced matrix.The hybrid architecture we propose here is shown schematically in Fig. 1, and combines the most desirable features of these earlier schemes.We use a symmetryadapted ridge regression model based on equivariant 2center-1-neighbor atom-density correlation features [24] to parameterize the matrix elements of an effective minimal-basis Hamiltonian H.This matrix has a structure and the O(3) symmetries corresponding to an atomcentered STO-3G basis, containing 1s orbitals for H and 1s, 2s and 2p for C atoms.H can be used in different ways, corresponding to different strategies to incorporate data-driven techniques in an electronic structure calculation, and to the different models and loss functions depicted in Fig. 1 (b).

QM
In model 1, we take the effective Hamiltonian matrix as a literal prediction of the results of a self-consistent STO-3G calculation and compute the loss as the L 2 norm of the difference between the predicted H and the target H.In model 2, we compute the eigenvalues εn by diagonalizing H, and define the loss as the error in reproducing the eigenvalues ε n of the STO-3G calculation.In this case, H plays the role of a "pseudo-Hamiltonian", that (in contrast to SchNet+H) has the correct symmetry properties, but is not bound to be equal to the matrix elements computed in a specified minimal basis.The imposed symmetry properties help the model to recover the correct shape of MOs [24].Finally, in model 3 we supplement the MO energies with other quantities computed from the electronic-structure calculations and compute a combined loss that measures the errors in reproducing all the physical constraints.In our case, we use the Löwdin atomic charges from the QM STO-3G calculation.This constraint is meant to guide the model towards a correct prediction of the QM density on each atom.We stress that in all cases but model 1, minimizing the model loss is a non-convex optimization problem despite the fact that we use a linear expression for the relation between the matrix elements of H and the structural descriptors.
As shown in Table I, targeting the matrix elements of the Hamiltonian in a direct learning setup (model 1) leads to consistently larger prediction errors for the singleparticle energy levels than using the ML model of the Hamiltonian as an intermediate step in the calculation of the eigenvalues (model 2).However, the indirect optimization leads to dramatic errors on other quantities that can be computed from the Hamiltonian matrix.Model 2 gives unphysical predictions for atomic charges, such as negative charges on hydrogen atoms (Fig. 2).Combining multiple targets (model 3) achieves a better-balanced model, that improves upon direct Hamiltonian learning for both eigenvalues and atomic charges.Performance of the ML model on predicting the excitation energy for the first two singlet excited states.The target is computed with sTDA B3LYP/def2-TZVP.The prediction is computed by coupling the ML prediction with sTDA.The blue circles and the yellow diamonds represent the first and second excited states respectively.The MAE is reported for both.

B. Large basis targets
The comparison between different architectures reveals the need to balance data and physics-based considerations when optimizing the overall architecture of the ML model.Even though the accuracy of predictions for the indirect model ( 3) is remarkable, one has to keep in mind that a minimal basis description of the electronic structure is very far from converged: errors on the electronic eigenvalues, particularly for low-lying excited states, are often of the order of several eV.As anticipated, an indirect training strategy can also be used to predict target quantities that are computed with larger basis sets, while keeping a model architecture consistent with a minimal basis.Specifically, we use as targets the valence-state eigenvalues ε LB n and Löwdin charges q LB i computed from a large triple-zeta basis (that contains 1s, 2s, 2p and 3s orbitals for H and 1s, 2s, 2p, 3s, 3p, 3d, 4s, 4p, 4d, 4f and 5s for C atoms).The performance of this model is shown in Figure 3 for both MO energies and atomic charges (see Figure S6 for a detailed plot).
The errors in these "large-basis target" (LBT) calculations are considerably larger (up to a factor of 10 for ethane) than in those targeting the properties computed with a minimal basis, which is unsurprising given that the electronic-structure problem is considerably underparameterized relative to the reference calculations.These errors, however, are at least an order of magnitude smaller than those of an explicit minimal-basis electronic-structure calculation (Figure 3 (b) and (c)).The model learns an effective pseudo-Hamiltonian that reproduces to a high accuracy the desired electronic properties of a converged calculation.Such a pseudo-Hamiltonian has the symmetries of a minimal-basis H, but is not explicitly tied to a choice of atom-centered basis functions.Nonetheless, the MO shapes can still be inspected with a pseudo-basis with the correct angular symmetries and arbitrary radial basis function.The molecular orbitals are shown in Figure 3 (d) using the STO-3G atomic orbital basis.The LUMO orbital of ethane is the only one exhibiting a mismatch with the ML prediction.The difficulty in predicting this MO lies in its Rydberg character.Rydberg orbitals are known to be present for calculations on small molecules in gas phase, and their diffuse character results particularly challenging for ML when using a minimal-basis pseudo-Hamiltonian.All the other orbitals show that the symmetry and the nodal structure of the LBT one-electron wavefunctions are learned correctly by the model.In the remainder of this study we will focus exclusively on this type of LBT hybrid models, that offer an excellent tradeoff between accuracy and computational expense.

C. Electronic excitations from an ML Hamiltonian
One of the advantages of explicit electronic-structure calculations is that, after having determined the self- The ML time (blue) is the sum of the featurization time, the prediction time, and the sTDA.Each time is computed on a single core of an Intel Xeon Gold 5120 Processor.B3LYP/def2-TZVP and B3LYP/STO-3G calculations are performed with PySCF.More details are in Table S5 in the SI.
consistent single-particle Hamiltonian, they allow the prediction of many molecular properties through simple and inexpensive post-processing steps.Our benchmarks thus far only validate the accuracy for properties that are explicitly trained on.To check whether the predicted pseudo-Hamiltonian can also be manipulated to access other properties, we consider the problem of predicting excitation energies based on time-dependent DFT (TD-DFT), which is commonly used to compute excited states for large molecules.We use, in particular, an approximation of TD-DFT developed by Grimme [33] called simplified Tamm-Dancoff Approximation (sTDA).Its appeal in our present case is that integrals in sTDA are approximated with Löwdin charges that are readily available from our ML prediction (see Materials and Methods).As shown in Fig. 4, the excitation energies computed from the effective ML Hamiltonian are predicted with good accuracy, with a balanced error for both the first and second excited states.The error increases as the molecule gets larger and more flexible such as for hexane, but it always remains well below 200 meV.As a point of comparison, excitation energies computed at the STO-3G level differ by at least 1 eV from those computed with a large basis set (see Table S5).What makes these results particularly remarkable is that the model does not explicitly use the sTDA excitations as targets, which indicates good generalization capabilities of the ML model -and at the same time scope for further improvement of the accuracy by including these additional targets to the model optimization step.Overall, our in-direct learning strategy delivers predictions of electronic properties with an accuracy comparable to that of converged settings, at a cost that is much smaller than that of a minimal-basis DFT calculation, and many orders of magnitudes faster when compared to the large basis (see Fig. 5).The speed gain is mainly due to the ML algorithm itself, which directly predicts the blocks of the selfconsistent Hamiltonian.The minimal-basis formulation reduces the cost of the diagonalization step, althoughfor the larger molecules -evaluating the sTDA excitations becomes the rate-limiting step for predictions that start from the ML-based pseudo-Hamiltonian.The local nature of the model also means that the ML Hamiltonian is highly sparse, which would be beneficial in combination with linear-scaling solvers [36].We stress that the current implementation is not optimized for speed, being instead focused on making it easy to test new ideas: therefore a more efficient calculation of symmetry-adapted features, as well as the use of more sophisticated model architectures, leave much room to further improve accuracy and computational requirements of the ML model.

D. Extrapolative predictions
The calculation of the the excitation energies based on the ML-derived minimal-basis Hamiltonian demonstrates the ability of our framework to generalize to new properties, beyond those it has been trained on.As we shall see, it also demonstrates excellent transferability to new structures, much larger and more complex than those included in the training set.As a first benchmark, we test our ML model by predicting the excitation energies for polyalkenes of different lengths.For octatetraene and decapentene, we extract 100 conformations from a REMD trajectory spanning a broad temperature range, following a protocol analogous to that used to generate the train set. Figure 6 (a) shows the prediction of the first excited state for several conformations of octatetraene and decapentene.Errors are comparable to those observed on the validation set, indicating excellent transferability to bigger molecules.This is confirmed by the small errors obtained for the prediction of the MO energies (Figure 6 (b) and Figure S8) of longer polyalkenes up to 10 double bonds.
In this benchmark, errors are dominated by the presence of distorted configurations.The excitation energy for the optimized geometries (Figure 6 (c)) provides clearer insights into specific physical effects, without the noise introduced by thermal distortion.The model perfectly captures the dependence of the S 1 energy on the increase in conjugation length, but there is a redshift of approximately 300 meV relative to the target values.This redshift illustrates a limitation of the ML framework, which uses short-ranged features to parameterize the pseudo-Hamiltonian, and all matrix elements involving atoms beyond the cutoff are predicted to be zero (Figure S9).This effect results in a systematic underesti- mation of the HOMO-LUMO gap (Figures S9 and S10), which translates into an underestimation of the S 1 excitation energy.Increasing the cutoff of the ML features would be an obvious strategy to address this problem, which however also leads to a model that is less accurate and transferable -as it would require training on larger molecules to thoroughly sample longer-range interactions (see Figure S2).An alternative, very effective solution is to use an explicit minimal-basis calculation as a baseline, using the ML model to learn the large-basis targets by correcting the STO-3G Hamiltonian.As shown in the SI, this baselined model yields much lower errors, and completely eliminates the redshift.This is relevant in light of several recent efforts that use low-cost electronicstructure properties as molecular descriptors [37,38], and representative of the kind of trade-offs that are necessary when designing hybrid modeling schemes.It entails, however, a substantial increase in computational cost, and we will restrict our investigation to a model that does not rely on a baseline.
We also consider the case of four representative aromatic molecules, none of which is included in the training set (Figure 6 (d)).Even though styrene is the only aromatic molecule used during training, the prediction of S 1 for the test aromatic molecules matches very well the reference value, except for azulene which contains a pentagonal motif which is completely missing in the training data.Figure 6 (d) also shows the transition density associated with the first excited state for both the ML prediction and the target.The ML-based transition density matches well the qualitative nature of the reference, indicating that the model predicts the correct excitation despite small quantitative differences in the shape due to the use of a minimal basis set.Naphthalene is the only exception: for this molecule, the L a and L b states are particularly challenging for computational methods [39]: for B3LYP/def2-TZVP, in particular, they are very close in energy.As a result, the small errors in the ML model lead to an exchange in their ordering (see Figure S11).These results, together with the ability of the model to capture the correct trend in the excitation energy of polyenes, demonstrate that using a hybrid model based on the evaluation of an effective Hamiltonian as an intermediate step allows to capture global effects such as conjugation and aromaticity, despite being modeled from local features.This is in stark contrast to models that target the excitation energy directly, which would not be capable of capturing changes for molecules larger than those included in the training set (see e.

polarizability
).An analysis of the dependence of the excitation energies of biphenyl and butadiene as a function of molecular distortions confirms that an sTDA calculation based on the ML Hamiltonian correctly captures the qualitative behavior of excited states, and is more accurate relative to the converged DFT calculations than commonly used semiempirical methods, even when used in an extrapolative regime (Figure S13).
As a final example, we test our model to predict the Hamiltonian of very large molecules, namely C 60 fullerene and β-carotene (Figure 6 (e) and (f)).Despite its size and complexity, MO energies of β-carotene are predicted with an MAE of less than 200 meV, which is largely due to the underestimation of the HOMO-LUMO gap, similar to what we observed in much simpler, linear polyalkenes (Figure 6 (c)).The peculiar structural features of C 60 , with the presence of pentagonal rings, complete lack of hydrogen atoms, and high curvature, make it a complete outlier relative to the training set.Nevertheless, the LBT model achieves an MAE of 654 meV on the MO energies, which is much smaller than the error of a minimal-basis calculation (MAE in excess of 4.7 eV).For β-carotene the atomic Löwdin charges are predicted with an accuracy comparable to that observed for small molecules, while for C 60 they are exactly zero due to symmetry (Figure S12), which is captured by the ML model thanks to its equivariant structure.

E. Vibronic spectra
As a final application, we demonstrate how our hybrid model can be further combined with advanced simulation techniques to obtain accurate yet inexpensive predic-tions of subtle quantum-mechanical effects.We estimate the vibronic spectrum of anthracene via second-order cumulant expansion theory (see Materials and Methods).Within this framework, the vibronic structure of the excitation is encoded by the spectral density.Among the different approaches to compute this quantity [42,43], we rely on the calculation of the autocorrelation function of the excitation energy along an MD trajectory.This dynamical method is typically very accurate [44], and transparently incorporates vibrational information.This is an application where a fast ML model for excited states can make prohibitively-demanding quantum-chemical calculations routine: a single sTDA calculation for anthracene at the B3LYP/def2-TZVP level requires approximately 8 CPU minutes, versus half a second for the hybrid model -a speed-up of three orders of magnitude.
The spectral density and the vibronic spectrum of anthracene are shown in Figure 7 (a) and (b), respectively.The low cost of ML calculations allows us to average the spectral density over several 10 ps windows.The corresponding averaged spectral density for sTDA B3LYP/def2-TZVP is much more demanding, and we show its value computed for a single window.The target spectral density is mostly within the confidence interval reported for the ML averaged spectral density (blue interval) (Figure 7 (a)).A comparison with the ML spectral density computed in the same window shows that all the peaks are slightly overestimated by the ML, with the exception of a peak at around 400 cm −1 which is absent in the ML prediction.This peak is responsible for the vibronic shoulder visible in the experimental absorption spectrum, which is missing in that predicted from ML simulations (Figure 7 (b) and Figure S15).Inspection of this normal mode shows that it is a global "breathing"-like motion of the entire molecule (see inset in Figure 7 (a)) arising from small alterations of the interatomic distances in anthracene, and thus hard to characterize with short-ranged features.Besides this minor discrepancy, the absorption spectrum is in good agreement with the experimental one.Indeed, the ML spectrum shows a competitive performance with ZINDO (Figure S14), a similarly fast semiempirical method specifically built to target singlet excited states of organic molecules, and sometimes used to compute spectral densities in complex biomolecules [45,46].

III. DISCUSSION
At the most fundamental level, the difference between physically-motivated and data-driven modeling approaches is that between primarily deductive and naively inductive paradigms of scientific knowledge.The former strives for universality, and to reduce the complexity of empirical observations to a minimal set of fundamental laws, while the latter infers patterns from data, and is usually more limited in generalization power, but can be more precise, or computationally efficient, in capturing structure-property relations.The approach we discuss here to describe electronic excitations, combining machine-learning of an effective minimal-basis, singleparticle Hamiltonian with training on a large basis set and the further application to evaluate physics-based approximations of molecular excitations, demonstrates the advantages of a middle-ground, hybrid solution.
Despite the relatively small training set composed of MD snapshots of seven small hydrocarbon molecules and the simplistic choice of ML architecture, our model shows excellent transferability, both in terms of the evaluation of derived electronic properties and in terms of making predictions for larger, more complex compounds.We probe the former aspect by computing excitation energies in the sTDA approximation, or the vibronic spectrum based on molecular dynamics trajectories, and the latter by making predictions on molecules that are larger and more complex than those included in the training.In both cases, we demonstrate excellent transferability, with similar errors observed in the interpolative and extrapolative regime -except for cases, such as C 60 , which entail dramatically different chemical motifs.In every case we consider, we obtain an accuracy that is much better than that afforded by an explicit minimal-basis quantum calculation, at a considerably reduced cost.We capture qualitative physical effects, such as the dependence of molecular excitations on the extent of conjugation or on internal molecular rotations.We can also easily interpret the quantitative errors, which we trace back to the aggressive local truncation of descriptors.
We make a few observations that could guide the development of similar, hybrid models.(1) Reproducing the mathematical structure of the quantum mechanical approximations is more effective than explicitly targeting the value of approximate electronic-structure quantities.
(2) Using indirect properties as targets, such as singleparticle eigenvalues, makes it possible to "promote" the model accuracy to a higher level of theory, e.g. using a larger basis set, at no additional cost.(3) Indirect learning should be sufficiently constrained to avoid overfitting and unphysical predictions.(4) The use of wellprincipled descriptors, that incorporate the symmetries of the problem, is beneficial in reproducing the qualitative behavior of the excitations.( 5) Locality plays a fundamental role in facilitating transferability, but there is a tradeoff with the asymptotic accuracy that the model can achieve.One of the main challenges is that, despite the common features [24,47], the design space of ML models for chemistry is very large [48].Incorporating physical approximations into an ML model can make the architecture easier to interpret, but also introduces more degrees of freedom that need to be explored more extensively.We suggest that restricting the investigation to simple, easy-to-interpret models, translating some of the insights that have become well-established in the construction of ML interatomic potentials to electronicstructure targets, and emphasizing generalization power over in-sample benchmark accuracy -which is the main advantage of physics-based, deductive modeling -are the next logical steps in advancing the integration between machine learning and quantum chemistry.

IV. MATERIALS AND METHODS
a. Model Architecture Our ML model aims to reproduce the single-particle states (MOs) from a mean-field quantum chemistry calculation such as Hartree-Fock or Kohn-Sham DFT.These MOs are expanded on a basis set of atomic orbitals (AOs), and their coefficients are obtained from the solution of the self-consistent-field (SCF) generalized eigenvalue equation: where H is the Fock (or Kohn-Sham) matrix, S is the overlap matrix of the AO basis, and E is a diagonal matrix containing the MO energies.As H depends on the orbitals themselves, the iterative solution of this equation is numerically expensive, especially for large basis sets.
We learn an effective Hamiltonian H that substitutes H in the eigenvalue equation and directly yields the MOs, bypassing the SCF solution.To simplify and constrain the problem, we require that H is defined on a minimal and orthonormal AO basis, akin to standard semiempirical methods.In addition, the minimal basis over which we learn H is only implicitly defined, a feature that improves the model flexibility.We train our model so that the solutions: generate a selected subset of MOs and MO energies as the original SCF equations.The obtained MO coefficients and energies can be used to predict additional quantities, such as electronic excitation energies, within a physicsbased model.For model 1, which directly targets the entries of an orthogonalized Hamiltonian, we use the Löwdinsymmetrized Fock H LSF = S −1/2 HS −1/2 computed with B3LYP/STO-3G as our target.When we target an SCF Fock H in a larger basis, which yields more accurate results, we cannot link its entries to our prediction H in an implicit minimal-basis.Instead, we ensure that our model generates the desired subset of MOs, with energies that are as close as possible to the quantum chemical calculations.To do so, we first train our model indirectly on B3LYP/STO-3G targets, minimizing a loss on MO energies (model 2) and on MO energies and Löwdin charges (model 3).The Löwdin charge q A on atom A is computed as: where Z A is the atomic number of A, µ indexes an atomic orbital, N occ is the number of occupied MOs, and the MO coefficients C are obtained from Eq. 2. The final LBT ML model is trained on MO energies and Löwdin charges, as for model 3, but on targets computed with B3LYP/def2-TZVP.b.Dataset Generation We use a dataset of 1000 different geometries of ethane, ethene, butadiene, and octatetraene that was originally presented in ref. 8.We extend it with configurations of hexane, hexatriene, isoprene, styrene, and decapentene following a similar protocol as in the reference.In summary, we carry out a replica exchange molecular dynamics (REMD) simulation with a timestep of 0.5 fs, for a total of 150 ps of sampling per replica and attempting exchanges every 2 fs.Molecular dynamics trajectories for each replica were integrated in the constant-temperature ensemble using a generalized Langevin equation (GLE) thermostat.Forces were computed at DFTB3-UFF/3OB level of theory.The simulation was run with i-PI [49] in combination with DFTB+ [50].For each molecule, 1000 structures were selected from all trajectories, using farthest point sampling (FPS) [51] on SOAP [52] descriptors averaged over the structures.The SOAP descriptor was computed with rascaline [53], using a cutoff of 4.5 Å, n max = 6, l max = 4, and a Gaussian width of 0.2 Å. FPS was performed with scikit-matter [54].For these datasets we then performed DFT calculations using both the STO-3G and def2-TZVP basis sets and Gaussian-like B3LYP functional (b3lypg) level of theory using PySCF [55], to obtain the Fock and overlap matrices and other required electronic structure properties.All calculations were performed in the spherical atomic basis and molecular symmetry was not taken into account even when present, as for some optimized geometries.
c. Symmetry-adapted Hamiltonian regression The single-particle Hamiltonian is expressed in terms of AO basis functions ϕ a (x − r i ), where a = ñl m denotes the orbital symmetry and r i the atom on which the orbital is centered.We use the shorthand ⟨ia| Ĥ |jb⟩ = H ia,jb (A) to denote the Hamiltonian matrix element between an orbital ϕ a centered on atom i and ϕ b centered on atom j of a structure A. Due to the presence of two atomic indices, these elements must be equivariant to permutations of the orbital labels associated with each atomic center.The rotations of each block of the matrix (involving all the corresponding m, m ′ indices) can be decomposed into rotations of irreducible representations (irreps) of O(3) (the transformation from the uncoupled l m l′ m′ basis to the coupled angular basis (irrep) |λµ⟩ is effected through Clebsch-Gordan coefficients).We model each irreducible block separately, so that our targets have the form H pτ λµ ij where (λµ) is the SO(3) symmetry index, τ captures additional symmetries (e.g., inversion parity) associated with this block and p enumerates the other variables, i.e. the angular ( l, l′ ) and radial (ñ, ñ′ ) basis, as well as the chemical nature of the atoms.Given this symmetry-based decomposition of H, we can employ ML models of arbitrary complexity as long as they are equivariant to the same permutation and SO(3) symmetries.Here, we use linear models with descriptors ξ τ λµ (A ij ) with the same symmetries of the target: where w pτ λ are invariant weights and δ λ0 b pτ λ is the intercept for the scalar (λ = 0) blocks, which in our model we take to be zero.We stress here that this framework, while described for linear models, is equally compatible with any deep-learning scheme.d.Symmetry-adapted features Our descriptors ξ τ λµ (A ij ) are the two-centered features described in Ref. 24, obtained as generalizations of the atom-centered density correlation descriptors [52,56,57] to simultaneously represent multiple atomic centers and their connectivity.Briefly, these features rely on pair density coefficients c nlm (A ij ) as the core ingredients that essentially specify the position of atom j relative to atom i.
In this form, the spatial description has been discretized on a more tractable basis R nl (GTO-style radial functions) and spherical harmonics Y m l (r) for the radial and angular degrees of freedom respectively, as is commonly done in quantum chemistry codes.
Summing over one of the indices (j), for all pairs within a set cutoff distance leads to a neighbor density c nlm (A i ) = j c nlm (A ij ), which describes the local correlations of a single center i and its neighbors.Combinations (through tensor products) of the neighbor density express higher-order correlations with multiple neighbors.On the other hand, the combination of the neighbor density with c nlm (A ij ) yields a richer descrip-tion of the specific pair between two atoms (ij).In particular, the simplest such combination c n ′ l ′ m ′ (A i )c nlm (A ij ) describes the correlations of three atoms -two centers i, j and the neighbors of i.The cutoff distance enforces locality of the descriptor and usually is an optimizable hyperparameter, however, a cutoff smaller than the interatomic separation (between i and j) means that there will be no features corresponding to the pair, and hence a zero prediction for all the associated Hamiltonian blocks (see e.g. Figure S3).The choice of spherical harmonics as the angular basis and implementation of the combinations (tensor products) through a Clebsch-Gordan coupling (similar to the one described for the Hamiltonian blocks) ensures rotational equivariance of the features and symmetry to the permutation of the atom labels can be similarly enforced by averaging over the permutation group.We direct the interested reader to Ref. 24 for more details about the symmetrization of the features.
e. sTDA Excitation energies were computed using Grimme's simplified Tamm-Dancoff Approach (sTDA) [33,34], solving the TDA equations AX = ΩX, where X indicates the configuration interaction singles (CIS) amplitudes that describe the excitations.sTDA uses an approximated form for A, in which exchangecorrelation terms are neglected and integrals are simplified: where ε i is the MO energy for the i-th orbital, i, j indices refer to occupied orbitals, and a, b refer to virtual orbitals.Integrals are evaluated using a monopole approximation: q A pq q B rs γ AB (7) where q A pq is the Löwdin transition charge between MOs p and q for atom A, the sums run over all the atoms of the molecule, and γ AB is a Matanaga-Nishimoto-Ohno-Klopman term [33].The Löwdin transition charges are computed from the predicted MO coefficients C as: where the sum runs over atomic orbitals µ centered on atom A. Additional approximations are present in sTDA to speed up the calculation.The CI space is truncated using a user-defined threshold, followed by an additional selection of the most important electronic configurations to be included.For details we refer to the original publication [33].
All the ingredients needed for sTDA (namely, the MO energies and the transition Löwdin charges) are available from the ML model, which makes the coupling of the ML model to sTDA straightforward.All our calculations are performed with an in-house implementation of sTDA in spherical basis [58] in PyTorch [59].
f. Vibronic Spectra The anthracene trajectory was generated with a DFTB3/3OB dynamics in the NVT ensemble with the Langevin thermostat and a coupling constant of 1 ps −1 .The temperature was set to 300K.We used a timestep of 0.5 fs, for a total simulation time of 100 ps.Coordinates were saved every 1 fs.The simulation was run with AMBER [61].Vibronic spectra were computed using the second-order cumulant expansion formalism [62,63].Starting from a correlated trajectory, the excitation energy U (t) is computed for each frame, and used to evaluate its autocorrelation function c U U (t) = ⟨U (t)U (0)⟩.The autocorrelation is used to calculate the spectral density function J(ω) encoding the vibronic coupling: The autocorrelation was damped in order to fall smoothly to zero in a time window of 10 ps.The vibronic homogeneous lineshape is obtained from the spectral density through the lineshape function g(t): − i (sin (ωt) − ωt) dω (10) from which the homogeneous absorption lineshape is computed as: Finally, the absorption spectrum is obtained after incorporating into the homogeneous spectrum static disorder, modelled by random sampling from a Gaussian distribution with a full width at half maximum (FWHM) of 400 cm −1 .The absorption spectrum was computed with SPECDEN [64].implemented in PyTorch with default parameters and a learning rate of 1.The model was trained until the loss converged to a plateau, where the training was interrupted.The loss in the train and validation set is shown in Figure S1.

S2. TRAINING WITH A LARGER CUTOFF
We train our models with features computed for atoms within a cutoff distance of 3.5 Å.This translates into the impossibility of predicting matrix elements for atoms that are outside the cutoff.While this is surely a limitation when the interaction between distant atoms becomes significant, it is also beneficial from a model regularization perspective.In fact, a small cutoff to compute the features, reduces their complexity, and allows for better transferability and robustness of our models.This can be confirmed by training a model with features computed with a larger cutoff and comparing its performance on extrapolated molecules with that of a model trained using features with smaller cutoff.
Here, we target the B3LYP/def2-TZVP Hamiltonian, using features with a cutoff of 3.5 Å (the model shown in the main text) and with a larger cutoff of 5.5 Å.This latter cutoff is only moderately larger than the first, and many elements of the resulting Hamiltonian are still strictly zero.Nonetheless, the model exhibits clear signs of overfitting when it comes to extrapolating to completely out-of-sample molecules as we can see in Figure S2.The errors for C60 and β-carotene increase significantly with larger cutoff.Thus, the use of a larger cutoff results in more powerful models that, if not trained extensively with a very large dataset, can lead to overfitting and loss of overall generalizability of the model.

S3. A ∆ML MODEL FOR THE ELECTRONIC HAMILTONIAN
In the main text, as well as in the preceding section, we discussed the limitations associated with using a small cutoff for features while training a model and also highlighted how using a larger cutoff instead is detrimental to the model's generalization capability.
It is also known that the use of local features poses certain problems when long-range interactions become important [68][69][70].While this limitation can be lifted in future refinements of the model through the use of long-range features [71], we show here an alternative approach to mitigate the limitations arising from the use of a small cutoff by ensuring the inclusion of long-range interactions through the implementation of a ∆ML model.
We use the B3LYP/STO-3G QM calculation as our baseline model from which we obtain a zeroth-order Fock matrix.This baseline is then corrected with a ∆ Fock matrix learned with ML in the very same way as the model discussed in the main text.We are therefore able to include long-range interactions within the Fock matrix at a lower cost since they are computed with a low-level QM calculation.A similar strategy has been reported in Ref. [31], although here it is used to learn an effective Fock with a minimal basis size and symmetries, targeting the properties of a Fock in a larger basis.QM is computed with a low-level QM calculation, which can be DFT with a smaller basis (e.g., STO-3G as in our example) or a semiempirical method such as PM6.An ML model is trained to learn a ∆ Fock matrix H ∆ ML that corrects the baseline so as to obtain the final Fock matrix.Long-range interactions that are not present in the ML (empty blocks in the H ∆ ML matrix shown) model are accounted for by the baseline calculation.
We denote the baseline Fock matrix obtained with e.g.STO-3G as H base QM , and the ∆ Fock matrix learned with ML as H ∆ ML , the Fock matrix H predicted by the model is given as : This matrix is diagonalized to obtain MO energies and Löwdin charges, and the H ∆ ML is trained so that MO energies and charges are as close as possible to those of a higher-level, large basis QM calculation (e.g., B3LYP/def2-TZVP).0.10 0.12 0.14 (# double bonds + 1) 1    Vibronic spectrum in a 10 ps window of the anthracene trajectory, computed with ML (blue) and with sTDA B3LYP/def2-TZVP (orange).The static disorder is not added in this spectrum, to better compare the vibronic peaks of the two spectra.The vertical dashed line denotes the peak missing from the ML prediction.The 95% confidence interval of the mean is reported for the ML model.

FIG. 1 .
FIG. 1. Schematic representation of the indirect learning framework.(a) An ML model designed to target the molecular orbital (MO) energies εn and the Löwdin charges q by using an equivariant model of the single-particle Hamiltonian as an intermediate layer.Cross-hatching shows the Hamiltonian blocks for each pair of atoms (i, j) in the structure that is modeled by learnable functions f with corresponding input features ξ.Arrows are color-coded to indicate data-driven predictions and physics-based approximations.(b) We compare different training protocols, targeting the elements of the Hamiltonian (model 1), the minimal-basis eigenvalues (model 2), and both the eigenvalues and Löwdin charges (model 3).(c) Illustration of a model that is trained on charges and a selection of MO energies computed with a larger basis.Once the effective Hamiltonian is learned, it can be used to compute other electronic-structure quantities, beyond those used during training.
I. Performance comparison of the three different ML models.The errors on MO energies and the Löwdin charges obtained from different models 1, 2 and 3 for minimal basis targets.L = M SEH is the mean squared loss on the Hamiltonian used in model 1, L = M SEε is the mean squared loss on the MO energies used in model 2 and L = M SEε,q is the sum of mean squared losses on the MO energies and the Löwdin charges used in model 3.

FIG. 3 .
FIG. 3. Performance of the ML model on large basis targets.(a) The hydrocarbons in the training dataset: ethane, ethene, butadiene, hexane, hexatriene, isoprene, and styrene.The two panels on the right show the performance of the ML model on test geometries of the training molecules, for the MO energy (b) and the Löwdin charges (c).The target is always computed with B3LYP/def2-TZVP.The prediction is either the ML one (blue points) or the STO-3G calculation (orange points).The mean absolute error (MAE) is reported for both.Different shades of blue and orange correspond to the different molecules (panel (a)).A detailed plot showing the prediction for each molecule separately is reported in Figure S6.(d) Comparison of HOMO and LUMO molecular orbitals between the ML prediction in a minimal pseudo-basis and the target B3LYP/def2-TZVP.

FIG. 4 .
FIG.4.Prediction of the electronic excited states.Performance of the ML model on predicting the excitation energy for the first two singlet excited states.The target is computed with sTDA B3LYP/def2-TZVP.The prediction is computed by coupling the ML prediction with sTDA.The blue circles and the yellow diamonds represent the first and second excited states respectively.The MAE is reported for both.

FIG. 5 .
FIG.5.Timings of the ML model and QM calculations.Time to compute excited state energies for molecules with an increasing number of electrons.The B3LYP/def2-TZVP time (orange) and the B3LYP/STO-3G time (green) are computed as the sum of the DFT calculation and sTDA.The ML time (blue) is the sum of the featurization time, the prediction time, and the sTDA.Each time is computed on a single core of an Intel Xeon Gold 5120 Processor.B3LYP/def2-TZVP and B3LYP/STO-3G calculations are performed with PySCF.More details are in TableS5in the SI.

FIG. 6 .
FIG. 6.Generalization performance of the ML model.The target is B3LYP/def2-TZVP, coupled with sTDA to obtain the excitation energies.(a) ML prediction of the excitation energy of the first singlet excited state of octatetraene and decapentene.The MAE is reported for both molecules.The subscript "oct" refers to octatetraene, "dec" to decapentene.(b) Distribution of absolute errors for the MO energies of polyalkenes with a progressively longer conjugated chain.The dashed lines are the first, second, and third quartile of the distribution.(c) ML prediction of the excitation energy of the first singlet excited state for polyalkenes, from six up to ten double bonds.(d) ML prediction of the excitation energy of the first singlet excited state for some aromatic molecules.The transition density is visualized on the right.For the ML model, we have used the STO-3G basis to obtain the transition density cubes.(e) ML prediction of the MO energies of C60.(d) ML prediction of the MO energies of β-carotene.

FIG. 7 .
FIG. 7.Vibronic spectrum of anthracene.(a) Spectral density predicted with the ML model (solid blue line) and corresponding 95% confidence interval.The spectral density is an average over several windows of 100 ps of MD trajectory.The spectral density for the target sTDA B3LYP/def2-TZVP (orange dashed line) is reported for a single window of 10 ps, alongside the corresponding ML prediction (blue dashed line).(b) Absorption spectrum predicted with the ML model (blue) and compared with the experimental spectrum[40] (orange).The ML spectrum is the average over 100 ps of anthracene MD.The blue interval denotes the 95% confidence interval of the mean.The vertical dashed black line in (a) denotes a peak around 400 cm −1 that is not captured by ML.The black dashed line in (b), shifted by 400 cm −1 from the grey line, denotes the corresponding missing vibronic shoulder.The normal mode at around 400 cm −1 is shown in the anthracene sketch in panel (a).

ACKNOWLEDGMENTSFunding:
BM, LC, and EC acknowledge funding from the European Research Council (ERC) under the Grant ERC-AdG-786714 (LIFETimeS).MC, DS, and JN acknowledge funding from the European Research Council (ERC) under the research and innovation program (Grant Agreement No. 101001890-FIAMMA) and the NCCR MARVEL, funded by the Swiss National Science Foundation (SNSF, grant number 182892).Author contributions: Conceptualization: MC, BM, FIG. S1.Loss versus Epoch curves for training and validation losses.The MSE on MO energies ε (first term in Eq. (S1)) and the MSE on Löwdin charges (second term in Eq. (S1)) are shown separately.
FIG. S2.Effect of using a cutoff to compute features on the model performance.The mean absolute errors (MAE) when features are computed with a large cutoff compared to a smaller cutoff, for (a) C60 and (b) β-carotene.MAE is reported by including the valence MOs only.The MAE when core MOs are included is considerably larger.
FIG. S3.Illustration of the ∆ML model idea.A baseline Fock matrix H baseQM is computed with a low-level QM calculation, which can be DFT with a smaller basis (e.g., STO-3G as in our example) or a semiempirical method such as PM6.An ML model is trained to learn a ∆ Fock matrix H ∆ ML that corrects the baseline so as to obtain the final Fock matrix.Long-range interactions that are not present in the ML (empty blocks in the H ∆ ML matrix shown) model are accounted for by the baseline calculation.
FIG. S4.Comparison of the ML and ∆ML model.Predictions on polyalkenes from 6 up to 10 double bonds.(a) Prediction of the HOMO-LUMO gap.(b) Prediction of the S1 excitation energy.
FIG. S10.Target versus predicted HOMO-LUMO gaps for polyalkenes.HOMO-LUMO gap For polyalkenes, from 6 up to 10 double bonds we compare the HOMO-LUMO gap values for the target computed at B3LYP/def2-TZVP level from QM and the ones predicted by the model.
FIG. S11.Excited States of Naphthalene: sTDA B3LYP/def2-TZVP vs. sTDA ML Predictions.Transition density of the first and second excited states of naphthalene, as computed with sTDA B3LYP/def2-TZVP (left) and with sTDA ML (right).The La state (predominant HOMO -LUMO character) is the S1 state according to sTDA B3LYP/def2-TZVP, with a close-lying L b state (nearly equal contributions of HOMO-1 -LUMO and HOMO -LUMO+1) about 130 meV above in energy.The order is inverted with sTDA ML, which predicts the L b state as about 30 meV below the La state.
FIG. S12.Predicted Löwdin charges for extrapolated molecules.Prediction of atomic Löwdin charges for (a) C60 and (b) β-carotene.The target charges are computed with B3LYP/def2-TZVP.The MAE is reported in both cases.Note that the scale of panel (a) is multiplied by 10 −6 : the charges should be zero because of symmetry, and the data shows simply that the ML model has less numerical noise than the reference calculations.