Machine Learning for Electronically Excited States of Molecules

Electronically excited states of molecules are at the heart of photochemistry, photophysics, as well as photobiology and also play a role in material science. Their theoretical description requires highly accurate quantum chemical calculations, which are computationally expensive. In this review, we focus on not only how machine learning is employed to speed up such excited-state simulations but also how this branch of artificial intelligence can be used to advance this exciting research field in all its aspects. Discussed applications of machine learning for excited states include excited-state dynamics simulations, static calculations of absorption spectra, as well as many others. In order to put these studies into context, we discuss the promises and pitfalls of the involved machine learning techniques. Since the latter are mostly based on quantum chemistry calculations, we also provide a short introduction into excited-state electronic structure methods and approaches for nonadiabatic dynamics simulations and describe tricks and problems when using them in machine learning for excited states of molecules.


From Foundations to Applications
In recent years, machine learning (ML) has become a pioneering field of research and has an increasing influence on our daily lives. Today it is a component of almost all applications we use. For example, when we talk to Siri or Alexa, we interact with a voice assistant and make use of natural language processing. 1,2 ML is applied for refugee integration, 3 for playing board games, 4 in medicine, 5 for example, for image recognition 6 or for autonomous driving. 7 A short historical overview over general ML is provided in ref 8. Recently, ML has also gained increasing interest in the field of quantum chemistry. 9,10 The power of (big) data-driven science is even seen as the "fourth paradigm of science", 11 which has the potential to accelerate and enable quantum chemical simulations that were considered unfeasible just a few years ago. In general, the field of ML in quantum chemistry is progressing faster and faster. In this review, we focus on an emerging part of this field, namely ML for electronically excited states. In doing so, we concentrate on singlet and triplet states of molecular systems, since almost all existing approaches of ML for the excited states focus on singlet states and only a few studies consider triplet states. [12][13][14][15] We note that electron detachment or uptake further leads to doublet and quartet states, and even higher spin multiplicities, such as quintets, sextets, etc. are common in transition metal complexes, where an important task is to identify which multiplicity yields the lowest energy and is thus the ground state. 15 refs 16-19 give a good overview of such processes.
The theoretical study of the excited states of molecules is crucial to complement experiments and to shed light on many fundamental processes of life and nature. 20 For example, photosynthesis, human vision, photovoltaics or photodamage of biologically relevant molecules are a results of light-induced reactions. 20-37 Experimental techniques like UV/visible spectroscopy or photoionization spectroscopy [38][39][40][41][42][43][44][45] lack the ability to directly describe the exact electronic mechanisms of photo-induced reactions. The theoretical simulation of the corresponding experiments can go hand-in-hand with experimental results and can provide the missing details of photodamage and -stability of molecules. 20, 27,35,44, However, the computation of the excited states is highly complex, costly, and often necessitates expert knowledge. 68 As ML models have only recently been applied in the field of photochemistry, keeping track of the approaches is still possible and this field is still in its initial stage.
Due to the multi-faceted photochemistry of molecular systems, ML models can target this research field in many different ways, which are summarized in Figure 1. For example, the choice of relevant molecular orbitals for active space selections can be assisted with ML. 70 The fundamentals of quantum chemistry, e.g., to obtain an optimal solution to the Schrödinger equation or Density Functional Theory, can be central ML applications. For the ground state, ML approximations to the molecular wave function 71-79 or the density (functional) of a system exist. 69,[79][80][81][82][83][84][85][86][87][88] Obtaining a molecular wave function from ML can be seen as the most powerful approach in many perspectives, as any property we wish to know could be derived from it. Unfortunately, such models for the excited states are lacking and have yet only been investigated for a one-dimensional system, 89 leaving much room for improvement.
Most ML studies instead focus on predicting the output of a quantum chemical calculation, the so-called "secondary-output". 69 Hence they fit a manifold of energetic states of different spin multiplicities, their derivatives and properties thereof. With respect to different spin states of molecular systems only a few studies exist, which predict spins of transition metal complexes 15 or singlet and triplet energies of carbenes 12 of different composition or focus on the conformational changes within one molecular system 13,90,91 for the sake of improving molecular dynamics (MD) simulations. The energies of a system in combination with its properties, i.e., the derivatives, the coupling values between them, and the permanent and transition dipole moments, 13,14,90-97 can be used for MD simulations to study the temporal evolution of a system in the ground-state  and in the excited states. 13,14,[90][91][92]130,[136][137][138][139][140][141][142][143][143][144][145][146][147][147][148][149] With energies and different properties, tertiary outputs can be computed, such as absorption, ionization or X-ray spectra, [150][151][152][153] gaps between HOMO (highest occupied molecular orbital) and LUMO (lowest occupied MO) or vertical excitation energies. [154][155][156][157] In addition, quantum chemical outputs can also be analyzed or fitted in a direct way, e.g., reaction kinetics as results of dynamics simulations can be mapped to a set of molecular geometries and can be predicted with ML models. 158 Excitation energy transfer properties can be learned,, 159,160 and structure-property correlations can be explored to design materials with specific properties. 16,76,131,152,[161][162][163][164][165][166][167][168][169][170]

Scope and Philosophy of this Review
ML for the excited states is developing at a slower pace than the exploding field of ML for the electronic ground state. 168,[171][172][173] The reason is in our opinion mainly a result of the complexity and high expenses of the underlying reference calculations and the associated complexity of the corresponding ML models. Simulation techniques to understand the excitedstate processes are not yet viable for many applications at an acceptable cost and accu- Figure 1: Targets of ML for the excited states of molecules. All areas of excited-state quantum chemistry (QC) calculations can be enhanced with ML, ranging from input to primary outputs that are used in the computation of secondary outputs, which in turn are employed to calculate tertiary outputs. Analysis can be carried out at all stages. This classification is inspired by the one in Ref. 69.
racy. Therefore, within this review we also want to highlight the existing problems of quantum chemical approaches that might be solvable with ML and put emphasis on identifying challenges and limitations that hamper the application of ML for the excited states. The young age of this research field leaves much room for improvement and new methods. This review is structured as follows: (1) Throughout this review, we will start (non-exhaustively) discussing ground state processes, since they are inherently linked to the excited state processes and should also be considered here. We will therefore start by discussing the differences between the groundstate potential energy hypersurfaces (PESs) and the excited-state PESs and will also emphasize the difference in their properties in section 2.
(2) Section 3.1 gives an overview of the theoretical methods that can be used to describe the excited states of molecules. In the forthcoming discussion, we will describe different reference methods with a view to their application in time-dependent simulations, namely MD simulations. 27,172 It is worth mentioning, that unlike for the ground state, where a lot of different methods can provide reliable reference computations for training, choosing a proper quantum chemistry method for the treatment of excited states is a challenge on its own. Many methods require expert knowledge, prohibiting their use further. 37,174 In addition, not any method can provide the necessary properties for any type of application. Subsequently, we aim to review the different flavours of excited-state MD simulations with focus on those methods that have been enhanced with ML models lately.
(3) After having provided the basic theoretical background, we will discuss how to generate a comprehensive, yet full-fledged training set for the excited states from the quantum chemistry data. We will summarize the existing approaches that are applied to create a comprehensive training set and put emphasis on the bottlenecks of existing methods that can limit also the application of ML. This will provide the reader with the knowledge about starting points for future research questions and clarify where method development is needed. It further provides the basis for the discussion of ML models for the excited states of molecular systems.
(4) A summary of state-of-the-art ML methods for photochemistry follows. We will differentiate between single-state and multi-state ML models and single-property and multi-property ML models. 93 As mentioned before, ML models can tackle a quantum chemical calculation in many different ways, see Figure 1. The different ML models will be classified in the ways they enhance quantum chemical simulations. Most approaches aim at providing an ML-based force field for the excited states, so most focus will be put on this topic. At last, the prospects of ML models to revolutionize this field of research and future avenues for ML will be highlighted.
Noteworthy, we focus on the excited states of molecules, as the excited electronic states in the condensed phase are challenging to fit and are thus often not explicitly considered in conventional approaches. [175][176][177][178][179][180] In solid state physics for example, the electronic states are usually treated as continua. The density of states at the Fermi level, 181 band gaps, [182][183][184] and electronic friction tensors 123,185,186 have been described with ML models up to date and especially the electronic friction tensor is useful to study the indirect effects of electronic excitations in materials. [187][188][189][190][191][192] Electron transfer processes as a result of electron-hole-pair excitations can be further investigated along with multi-quantum vibrational transitions by discretizing the continuum of electronic states and fitting them (often manually) to reproduce experimental or quantum chemical data in a model Hamiltonian. 178,[193][194][195][196][197][198] Yet, to the best of our knowledge, the excited electronic states in the condensed phase have not been fitted with ML. A recent review on reactive and inelastic scattering processes and the use of ML for quantum dynamics reactions in the gas phase and at a gas-phase interface can be found in ref 199. Besides the electronic excitations that take place in molecules after light excitation, ML models have successfully entered research fields, which focus on other types of excitations as well. Those are for example vibrational or rotational excitations giving rise to Raman spectra or Infrared spectra, 43,109,[200][201][202][203][204] nuclear magnetic resonance, 205 or magnetism, 206,207 which we will also not consider in this review.
2 General Background: From the Ground State to the Excited States Figure 2: Excited-state processes that can take place after excitation of a molecule by light. Absorption of light can make the molecule enter a higher electronic singlet state. Intersystem crossing to a triplet state or internal conversion to another state of same spin-multiplicity can take place. Radiative emission, i.e., fluorescence and phosphorescence, are possible reactions from an excited singlet and triplet state, respectively. Figure 2 gives an overview of the excited state processes that will be discussed within this review. It shows a schematic one-dimensional representation of the potential energy curves for the ground and excited states as a function of molecular coordinates. Figure 2 illustrates that the ground state potential energy curve, given by a dark-blue solid line, is mostly a smooth function of the reaction coordinate and gives information about several local minima. In the ground state, many methods exist to describe the physico-chemical prop-erties of molecules and materials reasonably well, ranging from small systems up to proteins, DNA or nanoparticles. For small system sizes, highly accurate ab-initio methods can be applied, while more crude approximations have to be used for larger systems. The unfavorable scaling of many quantum chemical methods with the size of system under investigation requires this compromise between accuracy and system size. Crude approximations for systems that are larger than several 100s of atoms become inevitable. 20, 24,37 The chemistry we are interested in, however, is not static, but rather depends to a large extent on the changes that matter undergoes. In this regard, it is more intuitive to study the temporal evolution of a system. Much effort has been devoted to develop methods to study the temporal evolution of matter in the ground state potential. As an example, physical functions can be obtained with conventional force fields, such as AMBER, 208 CHARMM 209 or GROMOS. 210,211 The first ones already date back to the 1940s-1950s. Such force fields enable the study of large and complex systems, protein dynamics or binding-free energies on time scales up to a couple of nanoseconds. 175,[212][213][214][215][216][217][218][219][220] However, their applicability is restricted by the limited accuracy and inability to describe bond formation and breaking. Novel approaches, such as reactive force fields exist, but have not yet entered the mainstream and still face the problem of generally low accuracy. 221 The accuracy of ab-initio methods can be combined with the efficiency of conventional force fields with ML models. The latter have shown to advance simulations in the ground state considerably and allow for the fitting of almost any input-output relation. 135,172,222 Accurate and reactive PESs of molecules in the ground state can be obtained with a comprehensive reference data set, which contains the energies, forces and ground-state properties of a system under investigation. Proper training of an ML model then guarantees that the accuracy of the reference method is retained, while inferences can be made much faster. In this way, they allow for a description of reactions and can overcome the limitations of existing force fields. 133,171,[223][224][225][226][227] Regarding the excited states, processes become much more complex and the computation of excited state PESs is far more difficult than the computation of the ground state PESs. As can be seen in Figure 2, a lot of different classes of excited states, e.g. singlet states as shown by continuous blueish lines or triplet states as shown by dashed reddish lines, have to be accounted for, which are characterized by several transition states, local minima, and crossing points. This complexity makes a separate treatment of each electronically excited state inaccurate and leads to further challenges that prohibit the straight-forward and large-scale use of many existing quantum chemical methods and consequently also existing ML models for the ground state.
Additionally, computations of the excited states suffer from being generally less efficient. To name only one central problem: The larger the system becomes, the closer the electronic states lie in energy, and the more excited-state processes can usually take place. The necessary consideration of an increasing number of excited states increases the already substantial computational expenses even more and restricts the use of accurate methods to systems containing only a few dozens of atoms in a reasonable amount of time with current computers. This increasing complexity makes not only the reference computations, but also the application of ML models for the excited states more complicated than for the ground state. At the same time, the application of ML models for the excited states might also be more promising, because higher speed-ups can be achieved.
For the excited states, methods similar to force fields, like the linear vibronic coupling (LVC) approach, 228,229 are usually limited to small regions of conformational space and restricted to a single molecule. General force fields that are valid for different molecules in the excited states do not exist. Also the ML analogue, so-called transferable ML models, to fit the excited state PESs of molecules throughout chemical compound space are unavailable up to date. Nevertheless, it is out of ques-tion that an ML model, which is capable of describing the photochemistry of several different molecular systems, e.g., different amino acids or DNA bases of different sizes, is highly desirable. A lot remains to be done in order to achieve this goal and yet, to the best of our knowledge, no more than a maximum of about 20 atoms and 3 electronic states with a distinct multiplicity have been fitted accurately with ML models. 13,14,[90][91][92]94,130,[136][137][138][139][140][141][142][143]143,144,[144][145][146][147][147][148][149] Whether or not the excited states of a molecular system become populated depends on the ability of a molecule to absorb energy in the form of light, or more generally, electromagnetic radiation of a given wavelength. Usually, the so-called resonance condition has to be fulfilled, i.e., the energy gap between two electronic states has to be equivalent to the photon energy of the incident light. Note however that also multi-photon processes can occur, where several photons have to be absorbed at once to bridge the energy difference between two electronic states [230][231][232] Further, the absorption of light does not only provide access to one, but most often to a manifold of energetically closelying states. The number of states that can be excited is related to the range of photon energies that is contained in the electromagnetic radiation. This energy range is inversely proportional to the duration of the electric field, e.g., of a laser pulse, due to the Fourier relation of energy and time. 233 However, the energy range of the photons and the energy difference between the electronic states are not the only factors influencing the absorption of light, which gives rise to questions like: Is the molecule able to absorb light of a considered wavelength? Which of the excited states is populated with the highest probability?
An answer to these questions can be obtained from an analysis of the oscillator strength. In order to make an electronic transition possible, an oscillating dipole must be induced as a result of the interaction of the molecule with light. The oscillator strength, f osc ij , between two electronic states, i and j, is proportional in atomic units (a.u.) to the respective transition dipole moment, µij, and the respective energy differ-ence, ∆E ij : 234 If the transition dipole moment between two states is zero, no transition is allowed. The reasons can be that a change of the electronic spin would be required, and the transition is thus spin forbidden. Another reason can be the molecular symmetry, leading to symmetry forbidden transitions. The latter are common in molecules that carry an inversion centre and transitions that conserve parity are forbidden. 235 An energetic state is called dark, if the transition dipole moment is very small or zero. In contrast, a state is called bright, if the transition dipole moment is large. Most often, studies that target the photochemistry of molecules focus on excitation to the lowest brightest singlet state, i.e., the state that absorbs most of the incident energy.
After an excitation process, the molecule is considered to move on the excited-state PESs and is expected to undergo further conversions. The excess of energy a molecule carries -as a result of the initial absorption of energy -is most often converted into heat, light, such as fluorescence or phosphorescence, or into chemical energy. If the molecule returns to its original state, then the molecule is photostable. Otherwise, either photodamage, such as decomposition, or useful photochemical reactions including bond breaking/formation occur. In all cases, heat or light can be emitted, which can also be harnessed in light-emission applications. 27,[236][237][238] With respect to photo-stability, ultrafast transitions, in the range of femto-to picoseconds (10 −15 -10 −12 seconds) take place and lead the molecule back to the ground state. This means, the electronic energy is converted into vibrations of the molecule and the molecule is termed hot. This heat is usually dissipated into the environment, a procedure that is often neglected in excited-state simulations due to the cost of describing surrounding molecules.
Radiationless transitions from one electronic state to another take place in so-called critical regions of the PESs. As the name already suggests, critical regions are crucial for the dynamics of a molecule, but are also challenging to model accurately. A transition from one state to another that conserves the spinmultiplicity is called internal conversion. Furthermore, states of different spin-multiplicities may be accessible via intersystem crossing. The critical points, where transitions are most likely to occur, are called conical intersections and are illustrated in Figure 2. At these crossing points, PESs computed with quantum chemistry can show discontinuities. These discontinuities can occur also in other excited-state properties and pose an additional challenge for an ML model when fitting excited-state quantities.
In addition to the aforementioned complications of treating a manifold of excited states, also the probability of a radiationless transition between them has to be computed somehow. This probability is usually determined by couplings between two approaching PESs. Between states of the same spin multiplicity, nonadiabatic couplings (NACs) arise, and spinorbit couplings (SOCs) give rise to the transition probability between states of different spin multiplicities. These couplings are intimately linked to the excited-state PESs and therefore should also be considered with ML. However, only a handful of publications describe couplings with ML, 13,90, [92][93][94]138,143,144,147,239 which highlights the difficulty of providing the necessary reference data as well as the challenges of accurately fitting them. New methods are constantly needed to further enhance this exciting research field.

Quantum Chemical Theory and Methods
In this section, we present some key aspects of quantum theory for excited states because (i) the outcome of the corresponding calculations serve as training data for ML and (ii) to clarify the employed nomenclature. We put special emphasis on describing the differences of excited-state computations to computations in the ground state and the challenges that arise due to the treatment of a manifold of excited states. These challenges also point at issues that are problematic for ML. These explanations will provide the groundwork to evaluate different quantum chemical methods for their use to generate a training set for ML and to use it for different types of applications, such as excited-state MD simulations. Naturally, we can only provide a general idea of this field and refer the interested reader to pertinent textbooks and reviews, such as Refs. 26,29,36,37,[240][241][242][243][244][245][246][247][248] In order to follow a consistent notation within this review, we try to explain all basic concepts with notations that are frequently used in literature. Currently, a zoo of different notations for the same property can be found. For example, the NACs, or derivative couplings, are sometimes referred to as so-called interstate couplings, i.e. couplings between two states multiplied with the corresponding energy gap between those two states, 142 while in other works interstate couplings refer to off-diagonal elements of the Hamiltonian in another basis, where the potential energies are no eigenvalues of the electronic Schrödinger equation. We want to avoid a confusion of the different notations and thus provide a consistent definition below. For the excited states, a number of different electronic states is required. Throughout this review, we adopt the following labelling convention for different electronic states: The lower case Latin letters, i, j, etc. will be used to denote different electronic states. The abbreviations N S , N M , and N A will indicate the number of states, molecules and atoms, respectively.
The foundation for the following sections is a separation of electronic and nuclear degrees of freedom, which is based on the work of Born and Oppenheimer. 249 However, the famous Born-Oppenheimer approximation is later on (partly) lifted and the coupling between electrons and nuclei is taken into account in nonadiabatic dynamics simulations.

Electronic Structure Theory for Excited States
The main goal when carrying out an electronic structure calculation is usually to compute the potential energy and other physico-chemical properties of a compound. We distinguish between two overarching theories to achieve this goal: Wave Function Theory (WFT) and Density Functional Theory (DFT) -as outlined, e.g., by Kohn in his Nobel lecture. 250 The basis of WFT, as for any electronic structure calculation, is the electronic Schrödinger equation 251,252 with the electronic Hamilton operator,Ĥ el , and the N-electron wave function Ψ i (R, r) of electronic state i, which is dependent on the electronic coordinates r and parametrically dependent on the nuclear coordinates, R: From the wave function, the eigenvector of this eigenvalue equation, any property of the system under investigation can be derived. How to solve the electronic Schrödinger equation exactly to obtain the potential energy of an electronic state i, E i , is known in theory. However, from a practical point of view, the computation is infeasible for molecules that are more complex than for example H 2 , He + 2 , and similar systems. 253 In order to make the computation of larger and more complex systems viable, approximated wave functions are introduced.
In contrast to WFT, DFT reformulates the energy of a system in terms of the ground state electron density rather than the N-electron wave function and the energy is expressed as a functional thereof. The advantage of DFT over WFT is a rather high accuracy for a rather low computational cost. If DFT is applied properly, it is considered as one of the most efficient ways to obtain reliable and reasonably accurate results of molecules up to 100s of atoms. In solid state physics, DFT is even the workhorse of most studies aiming to describe ground state properties. 254 However, the problem is that the equations to be solved are unknown. The missing piece is the exact exchange-correlation functional of a system. Up to date, researchers have come up with many different approximations to this functional that can be used to treat specific problems, but a universal functional capable of describing different problems equally accurately has not yet been found. Moreover, there is no systematic way to improve a density functional. The results obtained with DFT therefore critically depend on the choice of the functional. 253,255 In the following sections, we will describe both theories in the light of excited states of molecules. We will start to cover ab-initio methods, which means that they are derived from first principles without parametrization.

Wave Function Theory (WFT)
The basis of all discussed ab-initio methods is the Hartree-Fock method. The N-electron wave function is represented by a single Slater Determinant, φ 0 , which makes N coupled oneelectron problems out of the N-body problem. This Slater determinant is the anti-symmetric product of one-electron wave functions, the spin orbitals, which can be atomic, molecular or crystal orbitals, depending on the system. In the case of molecular (or also crystal) orbitals, they are usually expanded as a linear combination of atomic orbitals, where the expansion coefficients are optimized during the calculation. In order to do so efficiently, the atomic orbitals are themselves expanded with the help of a basis set. The N-electron wave function is therefore obtained as a double expansion. Two approximations are applied, which is the use of a finite basis set to represent the atomic orbitals and in turn also the molecular orbitals on the one hand and the use of a single Slater Determinant on the other hand. This usually gives a poor description of a system under investigation, due to a lack of electronic correlation.
Electronic correlation describes how much the motion of an electron is influenced by all other electrons. Since the Hartree-Fock method can be seen as a mean-field theory, where an electron "feels" only the average of the other electrons, correlation is quantified by the correlation energy, which is the difference between the Hartree-Fock energy and the exact energy of a system.
Unsurprisingly, all further discussed quantum chemical methods aim at improving the Hartree-Fock method. They can be seen as dif-ferent flavors of the same solution to the problem: They all include more determinants in one way or another. Accordingly, the wave function is expanded as a linear combination of determinants, where a determinant consists of molecular orbitals, which are expanded in atomic orbitals. This ansatz contains two types of coefficients that can be optimized, the ones for the determinants and the ones yielding the molecular orbitals. If the latter are kept the same for different determinants, we speak of a single-reference wave function. If both types of coefficients are adapted, we speak of a multi-reference wave function. Similarly, the electron correlation is also divided into two parts, termed dynamic correlation and static correlation. Single-reference methods improve on the dynamic correlation, while a multireference wave function allows for static correlation. However, the separation is not so strict, as can be seen by the following fact: Both the aforementioned single-reference variant and the multi-reference variant become equivalent when including an infinite number of terms and deliver the exact solution to the Schrödinger equation if also an infinite basis set is used.

Configuration Interaction
In the case of single-reference methods, the orbitals obtained from the reference calculation (usually Hartree-Fock) are kept fixed. Since usually more orbitals than the number of electrons in the system are calculated, the possibility of constructing different Slater determinants from these orbitals exist, which can be used for expanding the actual wave function: 256,257 Each Slater Determinant is weighted by a coefficient, c I . These coefficients can be obtained variationally by minimizing the total energy under the constraint of fixed orbitals, ending up in the Configuration Interaction (CI) methods. Ψ 0 is the reference, Hartree-Fock, wave function. In principle, the exact solution can be obtained by considering all possible Slater Determinants in combination with a complete ba-sis set. The use of all possible configurations is called Full-CI and represents the case, when all electrons are arranged in all possible ways. This approach is infeasible for almost all molecular systems, more complex than e.g. He, and truncated methods are needed. Those are for example, CIS (CI Singles) or CISD (CIS and Doubles), where only single excitations or additionally double excitations are accounted for, respectively. Figure 3 gives a schematic overview of the improvements of CI that one can apply. A huge advantage of these methods is, that how to obtain the exact solution is known, and that they are systematically improvable. However, truncated CI does not scale correctly with the system size and is therefore not size-extensive and also not size-consistent (i.e., the energy of two fragments A and B at large distance computed together, E(A + B), is not equal to the sum of the energies of the fragments from separate calculations, = E(A) + E(B)). 258 Figure 3: Different arrangements of electrons in molecular orbitals giving rise to the configuration interaction (CI) method. Inclusion of excited configurations in addition to the groundstate, reference determinant, φ 0 , allows to go beyond the Hartree-Fock method. Electrons are excited into higher electronic orbitals and Slater Determinants are indicated using the letters S, D, T, and Q, which refer to single, double, triplet, and quadruple excitations.
The CI scheme can be employed to improve the ground-state wave function by mixing the Hartree-Fock determinant and determinants of different electron configurations. In the same way, also wave functions of excited states can be computed. Then, the coefficients C I are optimized for higher eigenvalues of the electronic Hamiltonian instead of the first one. Beginners in the field then often get confused by terms like single excitation in comparison to first excited state. A single excitation determinant (see Fig. 3) can be part of the wave function for the first excited state but can also be a part of the ground-state wave function.
Electron Propagator Methods Another class of methods that we shortly want to mention here are electron propagator methods, that are based on one electron Green's function and are another variant of perturbation theory schemes. One popular method that is based on Green's function one electron propagator approach is the algebraic diagrammatic construction scheme to second order perturbation theory (ADC(2)). 259 ADC(2) is a single-reference method and can be used to efficiently compute excited states of molecules. It offers a good compromise between computational efficiency and accuracy, while being systematically improvable (higher order variants like ADC(2)-x or ADC(3) exist). The time evolution of a systems polarizability is obtained by applying the polarization propagation, which contains information on a system's excited states. 256,[260][261][262][263] The ground-state energy of ADC(2) is based on Møller-Plesset perturbation theory of second order, 264,265 MP2, where the latter can formally be shown to include double excitations for the improvement of Hartree-Fock, see Ref. 256. The dependence of ADC(2) on MP2 gives rise to instabilities in regions, where excited states come close to the ground state or homolytic dissociation takes place. The excited states of bound molecules are described with reasonable accuracy. Compared to multi-reference CI methods (see below), the black box behaviour of ADC(2) is a clear advantage. 259 Coupled Cluster The gold standard of abinitio methods for the ground state is the family of Coupled Cluster (CC) methods. CC is often referred to as the size-extensive and sizeconsistent version of CI. The different electronic configurations accounting for single or double excitations (such as in CIS and CISD for example) are obtained by applying an excitation operator,T : 266 Similarly to CI, this operator can be truncated.
IfT =T 1 +T 2 , single and double excitations are accounted for. Excited states can be computed in a singlereference approach by equation-of-motion-CC (EOM-CC), where the excited-state wave function is written as an excitation operator times the ground-state wave function. For further details, see, e.g., the reviews 267,268.
CASSCF The problem of missing static correlation in the Hartree-Fock approach is tackled by a multi-reference ansatz for the wave function. 255 This treatment is important for many excited-state problems, but also some transition metal complexes in their ground state, transition states or homolytic bond-breaking with the dissociation of the N 2 molecule being a notoriously difficult example. 269,270 The multi-configurational self-consistent field (MCSCF) method can be seen as the multireference counterpart to the Hartree-Fock method. 271 One of the most popular variants of MCSCF methods is the Complete Active Space SCF (CASSCF), 272,273 where important atomic orbitals and electrons are selected giving rise to an active space. An example is shown in Figure 4. According to this scheme, the orbitals are split into an inactive, doubly occupied part, an active part and an inactive, empty part. Within the active space a FCI computation is carried out. The active space has to be chosen manually by selecting a number of active electrons and active orbitals. CASSCF is no black box method and a meaningful active space selection is the full responsibility of the user. As an advantage, CASSCF can describe static correlation well, which is necessary in systems with nearly degenerated configurations with respect to the reference Slater determinant. For completeness, state-averaging (i.e. SA-CASSCF) is most often applied, where states belonging to the same symmetry are averaged. Another variant of MCSCF methods is restricted active space SCF (RASSCF), which is very similar to CASSCF, but within RASSCF the active space is restricted and no FCI computation is carried out. 256 MR-CI Even higher accuracy can be obtained with multi-reference CI methods, 29,274,275 such as MR-CISD, that additionally add single and double excitations out of the active space and are therefore based on CASSCF wave functions. With this approach electronic correlation, i.e. static and dynamic correlation, can be treated.
CASPT2 Alternatively, complete-activespace perturbation theory of second order, CASPT2, [276][277][278] can correct electronic correlation effects via treating multi-reference problems with perturbation theory. This variant of multi-reference perturbation theory methods uses the CASSCF wave function as the zeroth order wave function. CASPT2 can be applied to each state separately (single-state (SS)-CASPT2) or correlated states can be mixed at second order resulting in a multi-state perturbation treatment (MS-CASPT2). [276][277][278] Other perturbation approaches for multi-reference problems exist, like the n-electron valence state perturbation theory (NEVPT2). [279][280][281] MRCC In addition to multi-reference methods based on CI, multi-reference variants of CC approaches exist. A relatively efficient implementation is for example the Mk-MRCC approach of Mukherjee and co-workers 282 or the Brillouin-Wigner approach, 283 which is however not size extensive. Noticeably, the development of multi-reference CC approaches is a rather young research field compared to other excitedstate methods and the computation of properties and forces is not well explored. Many studies therefore focus on the simulation of energies of low-lying states with MRCC methods. Additionally, such methods suffer from algebraic complexity and numerical instabilities. Interested readers that seek for a more extensive summary of existing MRCC methods are referred to Refs. 29,284,285.
Challenges The probably biggest drawback of the aforementioned multi-reference methods is that their protocols are very demanding. Finding a proper active space is a tedious task that often requires expert knowledge. Too small active spaces can lead to inaccurate energies and problems with so-called intruder states are common. Those are electronic states, that are high in energy at a reference molecular geometry, but become very low in energy at another molecular geometry, that is visited along a reaction coordinate. The active space then changes along this path. This behavior can result in inconsistent potential energies. In case of CASPT2, the configurations of intruder states can lead to large contributions in the second-order energy, making the assumption of small perturbations invalid. Especially for describing molecular systems with many energetically close-lying states and for the generation of a training set for ML, such inconsistencies are problematic. Figure 5 shows an example of potential energy curves of 3 singlet states and 4 triplet states of tyrosine computed with (a) CASSCF (12,11) and (b) CASPT2 (12,11), where 12 refers to the number of active electrons and 11 to the number of active orbitals.
We used OpenMolcas 286 to compute an unrelaxed scan along the reaction coordinate, which is a stretching of the O-H bond located at the phenyl-ring of tyrosine. Figure 5: Potential energy curves of the three lowest singlet (S 0 -S 2 ) and the four lowest triplet state (T 1 -T 4 ) of the amino acid tyrosine along the O-H bond length of the hydroxy group located at the phenyl ring (Ph-OH) computed with CASSCF(12,11)/ano-rcc-pVDZ and CASPT2(12,11)/ano-rcc-pVDZ. 287 Intruder states are no exception.
Actually, they are quite common in small to medium sized organic molecules.
A large enough reference space can mitigate this problem, but makes computations almost infeasible. The computational costs increase exponentially with the number of active orbitals. In many cases, the improved accuracy due to a larger active space cannot justify the considerably higher expenses. At its best and with massively parallel simulations, an active space of about 20 electrons in 20 orbitals can be treated, 288 which is impracticable for many applications, such as dynamics simulations. For medium-sized molecules, the active space that would be required for a given simulation might even be way to large to be feasible for calculations in a static picture.
Worth mentioning at this point are also Rydberg states, that often need to be considered in small to medium sized molecules. Rydberg states can be strongly interlaced with valence excited states. In such cases, the active space needs to be large enough to treat both, the valence and Rydberg molecular orbitals. Additionally, the one electron basis set should be flexible enough to describe both types of orbitals. This increases the computational costs additionally. More details on the inclusion of Rydberg states in simulations can be found in refs 289-292. A promising tool to eliminate the complex choice of active orbitals is autoCAS. [293][294][295] It provides a measure of the entanglement of molecular orbitals that is based on the density matrix renormalization group (DMRG). A DMRG-SCF calculation is similar to a CASSCF calculation, but instead of a FCI solution of the active space, an approximated solution with DMRG is obtained to avoid the exponential scaling of the computational costs with the number of active orbitals. [296][297][298][299][300][301] As an alternative, ML can be used to determine an active space. 70

Density Functional Theory
A complementary view on how to obtain the energy of a system is provided by DFT. DFT dates back to 1964, when it was formulated by Hohenberg and Kohn 302 entirely in terms of the electron density, η( r). A one-to-one correspondence between this density and an external potential, v( r), exists and the potential acts on the electron density. The energy can be formulated in terms of a universal functional, F [η( r)], of the electron density, which is independent of the external potential. In this way, the energy of a system's ground state can be computed with the following equation: The most widely used implementations of DFT rely on the Kohn-Sham approach. 303 In fact, Kohn-Sham DFT is so successful that it is often simply referred to as DFT. In this approach, an auxiliary wave function in the form of a Slater determinant is employed. Since a single Slater determinant is the exact solution for a system of noninteracting electrons, this DFT approach can be seen as describing a system of nonin-teracting electrons that are forced to behave as if they were interacting. The latter effect can be achieved only by an unknown modification of the Hamiltonian or rather of the aforementioned functional. In other words, a Slater determinant as wave function ansatz is exact but the Hamiltonian can only be approximated, in contrast to Hartree-Fock, where the true electronic Hamiltonian is used but the Slater determinant is only an approximate wave function. The functional F [η( r)] can be separated into Coulombic interactions and a non-Coulombic part. The latter can further be divided into two terms: the kinetic energy of the noninteracting electrons and the exchange-correlation part, which describes the interaction of electrons and thus also corrects the kinetic energy by the difference of the real kinetic energy and the kinetic energy of the fictitious system of noninteracting electrons. The exchange-correlation functional is the part of DFT that is unknown and finding the exchange-correlation functional remains the holy grail of DFT.
In principle, if the exact functional was known, the exact ground-state energy of a system could be computed. Unfortunately, it is not known and the success of a DFT calculation critically depends on the approximation that is used to the unknown exchange-correlation functional. For completeness, KS-DFT is often used for closed-shell systems. In case of open-shell systems, two spin densities are distinguished, resulting in spin-polarized KS theory. 304 As explained above, the electron density is computed from a single reference Kohn-Sham wave function, i.e., the one of noninteracting electrons with the density of the real system. This single-reference wave function makes DFT a single-reference method. In fact, most failures of DFT are a consequence of an improper description of static correlation. 255 In order to describe excited states, the time-dependent (TD) version of DFT, namely TDDFT, can be used. The foundation of this theory was laid in the 1980s with the Runge-Gross theorems, 305 which can be regarded as analogies to the Hohenberg-Kohn theorems. They are based on the assumption that a one-to-one correspondence exists also between a time-dependent potential and a time-dependent electron density in this potential. A system can therefore be completely described by its time-dependent density. Also in the time-dependent case, the variational principle for the density is proposed.
The most widely used approach of TDDFT is linear response TDDFT (LR-TDDFT). Again, often TDDFT is used synonymously with LR-TDDFT due to its extensive use. Within this theory and the KS approximation, no time dependent density is necessary to compute excitation energies and excited state properties. Linear response theory can be directly applied to the ground state density. 306, 307 Casida's formulation of this theory is the most popular one and gives rise to random-phase approximation pseudo-eigenvalue equations, which are also known as the Casida equations. Within the adiabatic approximation, they are implemented efficiently in many existing electronic structure programs. The Tamm-Dancoff approximation 308,309 further simplifies the equations to an eigenvalue problem, resulting in the counterpart to CIS. 310 Especially in cases, when the time evolution of a system is studied, the Tamm-Dancoff approximation is beneficial, since it leads to more stable computations close to critical regions of the PESs. 253,304 The advantage of LR-TDDFT is its computational efficiency. The reasonable accuracy if a proper functional is chosen makes this approach often the method of choice to study the photochemistry of medium-sized to large and complex systems, which are not feasible to treat with costly multi-reference WFT based methods. 29,311,312 Shortcomings of LR-TDDFT are the incorrect dimensionality of conical intersections, which are, however, one of the most important regions during nonadiabatic MD simulations. [313][314][315] The incorrect dimensionality of conical intersections with standard TDDFT implementations leads to a qualitatively incorrect description of such critical regions. The missing couplings can be corrected for example with the CI-corrected Tamm-Dancoff approximation 316 or the hole-hole Tamm-Dancoff approximation, 317 which can recover the missing couplings and provide correct dimensionality at conical intersections.
In addition, one should be aware that by definition, double excitations cannot be accounted for with LR-TDDFT. The computation of double excitations can be achieved by using a frequency dependent exchange kernel, which is known as dressed TDDFT. 318,319 Alternatively, spin-flip TDDFT 320,321 can be used, where a triplet state is taken as a reference state and single excitations are treated with a flip in the electron's spin. However, spin-contamination is quite common within these methods. In general, the description of double excitations from a multi-reference state would be more favorable, although spin-flip TDDFT is often considered to be a multi-reference method. In order to compute specific orbital occupations and consequently excitations and charge-transfer states, an alternative approximation exists, which is known as the ∆-SCF approach. In this theory, the electrons are forced into specific KS orbitals. The SCF is applied to converge the energy with respect to this configuration. [322][323][324] Other multi-reference variants of TDDFT exist too. However, their description is beyond the scope of this review and we refer the reader to a review covering this topic in much more detail. 29 Last but not least, we shortly want to discuss the most critical part of a DFT calculation, which is the proper choice of the exchangecorrelation functional. In case of excited states, the treatment of valence excitations, Rydberg states and long-range charge transfer excitations on the same footing is highly problematic. While hybrid (meta-) generalized gradient approximation (GGA) or range-separated hybrid functionals 325 are for example well suited for vertical excitations and the latter also for Rydberg states, global hybrid meta GGA or rangeseparated hybrid GGA functionals are better to describe charge transfer. 253,326 Most often, functionals are accurate for one specific problem, but they fail to describe others. Although much effort has been devoted to develop functionals, finding a universal functional for DFT is still far from being achieved. 29,175,253,304 In summary, it should be stressed that, in general, there is not only one single solution to a particular problem, but that many possible ways can be considered which lead to an equivalent description of a particular problem. Considering the excited states of molecules, it should be mentioned that it is of utmost importance to think carefully about the photochemical processes that may occur in order to find the most appropriate method for most of the assumed reactions. It often happens that within the same molecular system, one method can describe a certain photochemical reaction quite well, while another reaction can be described better with another method. However, the mixing of methods is not practicable for standard applications. Recently, studies on ML models have emerged that combine the different strengths of several methods, e.g. ∆-learning techniques 327,328 or transfer learning. 329 These methods could be well-suited solutions for many future applications to overcome the current limitations of existing quantum mechanical methods for the excited states. Even more than for ground state properties, the quality of the excited states depends critically on the ability of a method to describe the different possible reactions -as a consequence of the larger accessible configuration space of a molecular system. Even for medium-sized systems it should be clear that a suitable method may already be computationally impracticable and a balance between accuracy and computational effort has to be found.

Bases
The potentials computed with the aforementioned methods for different nuclear geometries can be represented in different bases, which are connected by unitary transformations. An example of five states in different bases are given in Figure 6. Note that often a system in a certain basis is also referred to as being in a certain picture or representation; here we will not use the term representation in order to not confuse the reader with molecular representations used in ML. As it is visible in the figure, we focus on three types of bases: (a) the diabatic basis, (b) the adiabatic (spin-diabatic) basis, i.e., the direct output of standard electronic structure programs, (c) the diagonalized version of (a) and (b), i.e., the spin-adiabatic basis. Throughout literature, different names are given to these bases, which are summarized in Table 1. They stem from a partition of the total wave function into a sum of electronic and nuclear contributions, which can be written for all bases as: In a similar way as the number 20 can be factored into 4·5 or 4.5·4.4, the total wave function can be expanded in the different bases. Here, ψ basis i (r, R) corresponds to the eigenfunctions of the electronic Hamiltonian only for one of the bases (namely the one from column B of table 1). Associated with these functions are the corresponding potentials, depicted for a model system in Fig. 6. Note that a different approach is taken in the exact factorization method, 330 where the total wave function is expanded only in a single product, i.e., without the sum in eq. 6, giving rise to only one (time-dependent) potential. Table 1: Commonly used names of bases for the excited-state potential energy surfaces based on refs 228,242,331-334. The labels a, b, and c are consistent with Fig. 6 a b c diabatic adiabatic diagonal crude adiabatic spin-diabatic spinadiabatic spectroscopic MCH fieldadiabatic quasidiabatic field-free field-dressed .

Adiabatic (Spin-Diabatic) Basis
The direct output of an electronic structure calculation usually provides the eigenenergies and eigenfunctions of the electronic Hamiltonian. In many cases, only one spin multiplicity is calculated. If this procedure is repeated along a nuclear coordinate, potential curves result that are termed adiabatic. Adiabatic means "not going through" (from greek a=not, dia=through, batos=passable) and, indeed, the potentials never cross when considering one multiplicity. This situation is schematically illustrated in Figure 6(b) for singlet S i and singlet S j .
Within one multiplicity, 3N A -dimensional adiabatic PESs are obtained that are strictly ordered by energy. Hence, the states are usually denominated with the first letter of the multiplicity and a number as subscript, e.g., S 0 , S 1 , etc. For states of the same multiplicity, critical points and seams exist. These regions of the PESs are referred to as conical intersection (seams), in which the corresponding states become degenerate. Such features make adiabatic PESs non-smooth functions of the atomic coordinates, which make them difficult to predict with the intrinsically smooth regressors of ML. At a conical intersection, the approaching potential energy curves form a cone and the NACs, denoted as C NAC ij , between them show singularities as a result of the inverse proportionality to the vanishing energy gap: 274,335 Second order derivatives are neglected here, as is done in many quantum chemistry programs that compute NAC vectors. The blue dashed curve in panel (b) of Fig. 6 illustrates the norm of the NAC vector, C N AC ij , that couples the states S i and S j . At the avoided crossing points of the states, the NAC norm shows a sharp spike, but is almost vanishing elsewhere. If more than one multiplicity is considered, the term adiabatic is not adequate anymore, because potentials of different multiplicity might cross through each other. This situation is then called diabatic with respect to the spin multiplicities, or spin-diabatic in short. For example, singlets are adiabatic among each other, triplets are adiabatic among each other but singlets are diabatic with respect to triplets. However, also the diabatic basis (see Fig. 6(a) and also below) qualifies as spin-diabatic. Due to this nomenclature issue, which even gets experts confused sometimes, we refer to this basis as MCH (Molecular Coulomb Hamiltonian) because it is obtained from the eigenfunctions and eigenvalues of the non-relativistic electronic Hamiltonian, where only Coulomb interactions are considered.
As an example, a crossing of a singlet state and a triplet state is shown in Fig. 6(b). As it is visible, the triplet components, which are defined by different magnetic quantum numbers, are degenerate. The states are coupled by SOCs (denoted as C SOC ij ), which are usually obtained as smooth potential couplings with standard quantum chemistry programs: 28,242,334 These couplings are single real-valued or complex-valued properties. 34,336 Whether they are complex or not depends on the electronic structure program employed, but they can be converted into each other. 34,36,242,337 H SO in eq. 8 is the spin-orbit Hamilton operator, which describes the relativistic effect due to interactions of the electron-spin with the orbital angular momentum, allowing states of different spin-multiplicities to couple. [337][338][339] Note that also SOCs between different states of the same multiplicity exist except for singlets. No exact expression on how to include relativistic effects into the many-body equations has been found, yet. Among the most popular approximations used is the Breit equation, 340 applying an adapted Hamiltonian instead of the electronic Hamiltonian, which com-prises, among other terms, a relativistic part. This additional part of the Hamiltonian accounts for spin-orbit effects and is proportional to the atomic charge, 34,36,337,339,340 leading to the belief that SOCs would only be relevant in systems with heavy atoms. 341,342 Today it is known, that spin-orbit effects also play a crucial role in many other molecular systems and are important for intersystem crossing between states of different spin multiplicities. 37,[343][344][345] The states in the MCH basis can also be coupled via external electric-magnetic fields, e.g., by sunlight or a laser. The corresponding couplings stem from the transition dipole moments multiplied with the electric field. Since the effect of the field is not included in the potentials but as off-diagonal potential couplings, the MCH basis is also called field-free. [331][332][333]346 However, also the diabatic basis qualifies as field-free.

Diabatic Basis
In the diabatic basis, the electronic wave function is not parametrically dependent on the nuclear coordinates. Note that such a strictly diabatic basis for polyatomic systems does not exist in practice and only approximated, so called, quasi-diabatic, PESs can be fit. In literature, quasi-diabatic PESs are most often referred to as diabatic ones, so we will also use this notation here. Further, diabatic potentials usually need to be determined from adiabatic potentials and are not unique, i.e., they rely on the method and the reference point, which is chosen in the adiabatic basis to fit diabatic potentials. 228,242 An example of a system in the diabatic basis as given in panel (a) of Figure 6 and commonly used notations can be found in Table 1 in the first column. In regions, where an avoided crossing is present in the adiabatic basis, the coupled diabatic potential energy curves cross. Since the electronic wave function of a state is ideally independent of the nuclear coordinates, its character is conserved. Consequently the states are labeled according to their character and multiplicity, e.g., as 1 ππ * or according to symmetry labels. Similar to the character, also spectroscopically important quantities like the dipole moment are mostly conserved or vary smoothly along the nuclear coordinates. Therefore, spectroscopic experiments can easily be interpreted when using the diabatic basis, which is thus sometimes also called spectroscopic basis. Note that sometimes labels like S 1 , etc. are used also when referring to the diabatic basis, especially in experimental papers when an identification of the wave function's character has not been carried out and only one geometry is considered. However, at a different geometry, the energetic order of the states might have changed such that a state previously labeled as S 2 might now be lower in energy than a state previously labeled as S 1 . Furthermore, this labeling scheme in the diabatic basis can lead to confusion with the labels from the MCH basis, and we suggest to reserve it only for the MCH basis.
Due to the mostly conserved characters and the crossing of states, diabatic potentials are smooth functions of the nuclear coordinates, in contrast to adiabatic potentials. A diabatic PES is thus highly favorable for several numerical applications including ML.
The MCH and diabatic bases can be interconverted by a unitary transformation with a unitary matrix, U, that is determined up to an arbitrary sign (as a result of the arbitrary sign of the wave function, which will be discussed in detail in section 4.2). In the case of two states, U, is a rotation matrix: and is dependent on the rotation angle, θ. Accordingly, the peaky NACs, which are obtained as derivative couplings (also called kinetic couplings) in the MCH basis, are converted to smooth potential couplings in the diabatic basis. The smooth SOCs from the MCH basis become even smoother (ideally constant) in the diabatic basis. While one can straightforwardly apply diagonalization to convert diabatic PESs to adiabatic PESs (and similarly adiabatic PESs to diagonal PESs), a dilemma arises when one wants to take the inverse way to obtain diabatic PESs from adiabatic ones (and similarly adiabatic PESs from diagonal ones). In fact, finding diabatic PESs is highly complex and most often requires expert knowledge. Up to date, only small molecules could be represented with accurate diabatic potentials and developing a method to automatically generate diabatic PESs remains an active field of research. Existing methods to obtain diabatic potentials require human input and are mostly applicable to small systems and certain reaction coordinates. Early pioneering works can be found in refs. 228,347 Today, a lot more variants exist. Examples are the propagation diabatization procedure, 348 diabatization by localization, 349 Procrustes diabatization 239 or diabatization by ansatz. 140

Diagonal Basis
As the name indicates, the diagonal basis can be obtained by a diagonalization from the MCH or diabatic bases. In this case, a strictly adiabatic picture is obtained, where states never cross. 242 Accordingly, the concept of multiplicity for a single state is lost because the state might be of singlet character in one region and of triplet character in another region. There-fore, the basis is also called spin-mixed or spinadiabatic. 36,336,363 The states are strictly ordered by energy and can be labeled simply with numbers (see Fig. 6(c)). The resulting wave functions are eigenfunctions of the relativistic electronic Hamiltonian. 36,242,344 These eigenfunctions as well as the eigenenergies can be also obtained directly with e.g. relativistic two-component or four-component calculations, 364 instead of via diagonalization.
In this basis, the effect of the SOCs are incorporated into the PESs to a large extent. What remains are localized kinetic couplings, which are similar in nature to the NACs in the MCH basis. An example is given in Fig. 6(c). The parts of the potentials that correspond to the different triplet components in the MCH basis are split energetically in the diagonal basis. In the case of small SOCs, the diagonal potentials look similar to the MCH potentials. However, if the SOCs are strong, potentials that are degenerate in the MCH basis can be easily shifted apart by 1 eV in the diagonal basis. Such splittings are then also experimentally observable, and the diagonal basis yields a more intuitive interpretation of these experiments. 45,365,366 As mentioned above, the states in the MCH basis can also be coupled via electromagnetic fields. A diagonalization of the potential matrix then yields so-called field-dressed states or light-induced potentials, which can also be termed field-adiabatic. 331,346,367-369 Since the fields are usually time-dependent, the most important axis along which the potentials in this field-dressed basis need to be plotted is time. 346 In principle, all these bases are equivalent but only if an infinite number of terms is considered in eq. (6). In practice, potentials represented in different bases have different advantages for dynamics simulations, especially in combination with different approximations made in the different dynamics methods as outlined below.

Excited-State Dynamics Simulations
In order to investigate the temporal evolution of an isolated molecular system in the excited states, the time-dependent Schrödinger equa-tion has to be solved: 240 From a technical point of view, a sequence of time steps is computed, where in every step the electronic problem is solved to yield potentials, which determine the forces acting on the nuclei such that the nuclear equations of motion can be solved for the current time step. Ideally, the nuclei are treated quantum mechanically. In this case, the PESs are usually computed in advance and either interpolated or stored on a grid for later use. The hope is that ML can improve the interpolation of potentials drastically. Such global PESs are needed because a wave function is employed for the nuclei, which extends over a range of nuclear coordinates at the same time (see Fig. 7(a)). An overview over corresponding dynamics methods is given in section 3.3.1.
The nuclear dynamics can also be approximated classically while quantum potentials are used, i.e., mixed quantum classical dynamics (MQCD) simulations are carried out. Such methods is discussed in section 3.3.2. Since the classical nuclear trajectories are defined only at one nuclear geometry at a time (see Fig. 7(b)), on-the-fly calculations of the potential energies are possible. An on-the-fly scheme is computationally advantageous, if the number of visited geometries during the dynamics is smaller than the number of points needed to represent the conformational space on a grid or via interpolation. 26,28,242,313,314,344,370 No fitting of PESs is necessary in an on-the-fly approach but fitted PESs can still be used as an alternative. Since ML approaches provide such interpolated potentials, the amount of training points generated with quantum chemistry must be less than the number of points needed in an on-the-fly approach in order to be advantageous. This demand is satisfied, e.g., for long time scales or if many trajectories are necessary.
In the following, we will shortly discuss the different types of nuclear motion and the opportunities of ML models to enhance the respective dynamics simulations.

Quantum Nuclear Dynamics
The computational cost of an exact nuclear dynamics simulation scales exponentially with the nuclear degrees of freedom. Hence simulations are limited to small systems, typically containing less than 5 atoms. 27,34,371 Still, the calculation of the PESs of the molecule can be a rather expensive part of the whole scheme and the use of ML algorithms is advisable even for such small systems.
To treat larger systems, approximations have to be invoked. A prominent approach that can be converged to the exact solution is the multi-configurational time-dependent Hartree (MCTDH) approach. 49,[372][373][374] Its high efficiency stems from the use of time-dependent basis functions to represent the nuclear wave functions. Nonetheless, the computations are computationally costly and the nuclear degrees of freedom are often reduced to only a few important key coordinates, 228,375 where classical simulations can help identifying the latter. 376 Whether quantum dynamics of such reduceddimensionality models are better than using classical dynamics of a full-dimensional system is still under debate and probably depends on the system. The potentials need to be presented to the algorithm in the diabatic basis, mostly due to numerical stability (e.g., smooth couplings are easier to integrate than singular ones). Since more than 20 years, (modified) Shepard interpolation is used to fit diabatic potentials. 149,[377][378][379][380] Notably, the grow algorithm 149 can be used to efficiently generate the database of points upon which the interpolation is based. However, it is clearly desirable to treat larger systems, and ML models like neural networks (NNs) promise higher performance or more flexibility in such cases. 141,144,145,147,348,[359][360][361][362] More recently, on-the-fly methods addressing quantum dynamics have been developed. 143,[381][382][383] They mostly rely on a combination of Gaussians to represent the nuclear wave function. 26 For example, the variational multiconfiguration Gaussian method (dd-vMCG) 384 offers a variational and thus accurate solution for the equations of motion. Also full multiple spawning 46,371,385 can be regarded as fully quantum mechanically by describing the wave function with a number of time-dependent Gaussian functions, that follow classical trajectories with quantum mechanically determined time-dependent coefficients. In its more affordable ab-initio multiple spawning variant, more approximations are introduced such that the results sometimes draw near the classical solutions. 386,387 Further related methods exist, like the ab-initio multiple cloning method, 388 or the thawed Gaussian approximation. 389 Another class of dynamics methods are semiclassical approaches, which allow the inclusion of quantum effects in the classical dynamics of nuclei, such as quantum mechanical tunnelling or coherence. 390 Note that these methods, where the nuclear dynamics is treated semi-classically, should not be confused with the MQCD approaches (see below) that are also often termed semi-classical (because the nuclei are treated classically and the electrons quantum-mechanically). The semiclassical dynamics methods range from the initial value representation, 391,392 adapted with the Zhu-Nakamura approach leading to the Zhu-Nakamura-Herman-Kluk initial value representation, 393 to path integral approaches. 394 The path integral formalism is especially interesting when the quantum and classical degrees of freedom should be coupled in a dynamically consistent manner. By using so-called ring-polymers, i.e., replica of the original classical system, a deviation of the nuclear dynamics from the classical path can be obtained and the time evolution of a system including nuclear quantum effects can be investigated. However, ring-polymer dynamics suffer from high computational efforts as a consequence of the large number of replica required. Accelerated formalism exist, which are for example implemented in the Python wrapper i-PI, 395,396 which allows to interface path-integral methods with programs that provide PESs, but are mostly dedicated to the electronic ground state. Up to date, only a few implementations of semi-classical methods in atomistic simulation software are available. Compared to classical mechanics, the computational costs increase by a factor of about 10 to 100. 390,397,398

Mixed Quantum-Classical Molecular Dynamics
While semi-classical methods are promising to simulate the dynamics of molecular systems containing up to tens of atoms highly accurately, the study of larger systems is still dominated by computationally cheaper MQCD methods, where the nuclear motion is treated fully classically. 27,[397][398][399] In contrast to quantum dynamics, the motion of the nuclei can be computed very fast using classical mechanics, and the computation of the PESs, on which the nuclei are assumed to move, remains the time limiting step. In this sense, ML models have a huge potential to enhance MQCD simulations by providing the electronic PESs and enabling the investigation of reactions that are not feasible with conventional approaches. 13, [400][401][402] In fact, most studies that describe photochemistry with ML up to date aim to replace the quantum chemical calculation of the PESs in MQCD approaches.
The most popular MQCD method is trajectory surface hopping, 403-405 schematically represented in Figure 7(b). A manifold of independent trajectories is required to obtain statistically relevant results and to mimic the extended nuclear wave functions. For a single trajectory, the nuclei move classically on one of the quan-tum potentials, hence only one state is considered to be active, but transitions between different states are allowed. 406 Different approaches exist to determine the probability of such a transition, also called hop or jump in surface hopping methods. To this aim, different quantities are needed that are commonly provided in the MCH basis, as it is the direct outcome of a quantum chemical simulation. One of the first implementations to compute the hopping probability is based on the Landau-Zener formalism. 407,408 Based on the Landau-Zener formula, the potential energy differences are used to determine the hopping probability. No information about couplings is required, which implies that the approach must fail for states that do not couple but lie close in energy. Very similar to this approach is the Zhu-Nakamura theory. [409][410][411][412] Also here, the computation of couplings is omitted and only information about PESs is used. Among the mostly used hopping algorithm is Tully's fewest switches algorithm, 403 which is valid for many cases and based on the NACs between different PESs. An extension to other couplings is provided e.g. in the SHARC (surface hopping including arbitrary couplings) method. 344 When couplings are considered, an internal transformation from the MCH basis to the diagonal basis is most advantageous because the localized couplings of the diagonal picture precisely indicate, where the few switches of the fewest switches approach should take place. In cases, where the PESs are fit in advance, either with ML models or other types of analytical functions, the use of a diabatic basis is favorable (because of the Berry phase, see below) but should be transformed to the diagonal picture for the calculation of hopping probabilities. Other flavors to account for transitions exist. However, they have not been applied in simulations with ML algorithms yet. Interested readers are therefore referred to refs 36,48,242,314,344,403,410,[413][414][415][416][417] for further information.
The bottleneck of approaches that require NACs is that the computation of the couplings remains one of the most expensive part of a quantum chemical calculation. The computa-tional effort to compute a NAC vector is comparable to that of a force calculation. However, more NACs are present than there are forces, i.e. N S ×(N S −1)/2 NACs need to be computed, whereas N S forces are needed (respectively with entries for the Cartesian coordinates of each nucleus). Note that in case of fitted PESs with ML, all of these vectors have to be computed for each data point. Conventional approaches with an ab-initio on-the-fly evaluation of the PESs can make use of the fact, that only one active state needs to be considered at a certain time step. Many MD programs therefore only require a computation of the forces of the active state and the respective couplings arising from this state.
Note that despite the benefits of MQCD simulations, they obey micro-reversibility only approximately 418 and effects due to coherences or tunneling necessitate additional considerations as a consequence of the classical treatment of nuclear motion. 419 A more approximate approach is the Ehrenfest dynamics method, also referred to as meanfield trajectory method. It is often used for large systems and also frequently in material science. 178,189 The Ehrenfest method is based on the approximation that nuclei move classically on an average potential, rather than switching from one specific state to another. 314,420,421 Due to the treatment of each electronic state separately, surface hopping methods allow the accurate bifurcation into different reaction channels, while such effects are neglected in a mean-field treatment of PESs.
The main limitation of MQCD approaches are the expensive evaluation of ab-initio potentials, which allows dynamics simulations only for up to a couple of picoseconds. In addition, rare reaction channels are hardly explored as a result of usually bad statistics. 36,422,423 In this sense, MQCD simulations offer a perfect place for ML to enter this field of research and advance it significantly. The fast evaluation of the ML PESs can help to explore different reaction channels and to obtain accurate reaction kinetics. Observables and macroscopic properties can be computed directly or with post-processing as well as analysis runs, and offer another fulcrum for ML. The computed observables should then be directly compared to experiments.

Dipole Moments and Spectra
An important property for comparing experiment and theory is the dipole moment. The permanent dipole moment of the ground state is a frequent target of studies with ML. 109, [424][425][426][427][428][429][430][431][432][433][434][435] The permanent dipole moment, µ i (or µ ii ) , of a state i can be obtained via the dipole moment operator (see eq. (12) below) or as the sum of partial charges, q a,i of atom a in state i, and the vector that describes the distance of the position of atom a to the center of mass of the molecule, r α : µ i = N A a = q a,i r α . It can be used for the computation of infrared spectra with MD simulations. The spectrum is then obtained as the Fourier transform of the time auto-correlation function of the time derivative of the dipole moment. 436 In contrast to the ground state, excited-state simulations often make use of the transition dipole moments, which are computed from the dipole moment operator within many quantum chemistry programs: The ground state dipole moment can differ strongly from those in the excited states, due to a frequency shift and altered electron distribution upon light-excitation. 437 Transition and permanent dipole moments can be fit with the charge model of ref. 109, where point charges are never learned directly, but instead are inferred as latent variables by an NN dipole model making use of r α . 435 Noticeably, the computation of absolute values of permanent and transition dipole moments is very challenging even when highly accurate quantum chemistry methods are employed and experimental values are hardly reproduced. 93,438 However, also experimental studies provide absolute values only in few cases. Most computational studies therefore do not aim to reproduce the absolute values of transition dipole moments but rather use relative values to obtain reasonably accurate absorption spectra, which can be compared to experiments. 56,240,242,[439][440][441] Since many molecules absorb in the UV, the terms UV spectra and absorption spectra are often used interchangeably. However, absorption can take place in many regions of the electromagnetic spectrum, including, e.g., X-rays, where rather core electrons than valence electrons are excited. 31 As already mentioned shortly, absorption spectra can be obtained from a calculation of excited-state energies and oscillator strengths, which are proportional to the squared transition dipole moments. Noticeably, the transition dipole moment is only defined up to an arbitrary sign as a result of the arbitrary phase of the wave function (see section 4.2). To circumvent this ill-definition, oscillator strengths or the lengths of dipole vectors can be fitted with ML. However, this workaround can be problematic if explicit field-dipole interactions should be considered with ML models.

Data Sets for Excited States
The basis of any successful ML model is a comprehensive and accurate training set that can describe the required conformational space of a molecule comprehensively and accurately with as little noise as possible. 442 While electronic structure theory for ground state problems is almost free of noise, the same cannot be said so easily for problems in the excited states. "Bad points with abrupt changes" 14 within ab-initio calculations for the excited states are frequently observed, which can occur even far away from any critical point of the PESs and are difficult to detect. 13,14, 92 The amount of noise in the reference data does not only depend on the chosen method (and in case of multi-reference methods on the selected active space), but also on the number of electronic states considered and the photochemistry of the molecule under investigation. where the latter method is more expensive than the former and therefore limited to describe small systems.
In general, the computation of excited-state PESs is much more expensive than the computation of the ground state potential of the same molecule. Not only highly accurate abinitio methods have to be applied for many systems, but also forces and couplings are required for the considered states. A high density of electronic states present in a molecular system can thus increase the costs of a calculation considerably. In this regard, an active, efficient and meaningful training set generation is indispensable, especially when photodynamics simulations are the target of a study.
Keeping in mind, that the quality of the reference data confines the quality of an ML model, several key questions can be identified when designing a study based on ML potentials. We believe the following questions to be important for the selection of a suitable reference method: 1) What is the goal of an ML model and what properties must it predict in order to benefit from the advantages that ML can offer? Are only energy gaps of different electronic states to the electronic ground state necessary or are gaps between other states and couplings between them also relevant? Especially the description of couplings requires further consid-eration, as they cannot be calculated with all quantum chemistry methods and additionally face the problem of random sign jumps along different reaction coordinates. 90,92, 454 2) How many excited states are relevant and which method is computationally affordable to treat the amount of states required? A comparison with experiment and the computation of vertical excitation spectra with reference methods can help to obtain an answer to this question.
3) How large is the system under investigation and how complex are the excited state processes that are considered to be important? This question is important in order to identify if single reference methods like LR-TDDFT or ADC(2) make sense for certain reactions that might occur. While large and flexible molecules with a lot of energetically close-lying states can give rise to a multifaceted photochemistry including dissociation, homolytic bondbreaking, and bond-formation, the dynamics of rigid molecules might only be dominated by one main reaction channel and lose the additional energy in form of molecular vibrations. The complexity of the excited-state processes can help to estimate the number of necessary data points to describe the relevant configurational space of the molecule.
In case multi-reference methods are necessary to describe many different excited-state processes of a molecule, the training set generation can become infeasible. For example, 356 data points were computed for the 15-atom cyclopentoxy molecule with MR-CISD(5,3)/cc-pVD(T)Z. 94 Respective calculations comprised 19,302,445 configuration state functions and one reaction coordinate could be fitted in the diabatic basis. We also ran into a similar problem when fitting the excited states of the amino acid tyrosine containing 24 atoms, which also requires a multi-reference treatment. The size of the active space and the number of states needed for an accurate description made multireference methods like CASSCF or CASPT2 computationally too expensive, see Fig. 5. In these cases, the computation of an ample training set is far too expensive with multi-reference methods and the quantum chemistry calcula-tions remain the bottleneck even when using ML.
In addition to the aforementioned intricacies to build up a meaningful, yet accurate training set for the excited-states, the process is further complicated by the arbitrary phase of the wave function. As a consequence, excitedstate properties resulting from two different electronic states, such as transition dipole moments or couplings between different electronic states, 13,14,90,92,93,454 are not uniquely defined and cannot simply be fitted with conventional ML models. Either an additional data preprocessing or an adaption of the learning algorithm has to be incorporated to render data learnable with ML models.

Phase of the Wave Function
In contrast to ground state properties, excitedstate properties such as transition dipole moments, NACs or SOCs arise from two different electronic states. As a consequence of the arbitrary phase of the wave function of each electronic state, properties resulting from two different states carry an arbitrary sign, which makes them generally double-valued. In case of vectorial properties, such as dipole moments or coupling vectors, the whole vector can be multiplied by +1 or -1 and is still a valid solution. Similarly, single valued properties, such as SOCs obtained from electronic structure programs, can be multiplied by +1 or -1 and are equally correct. This additional complexity prohibits that conventional ML algorithms learn such raw data of quantum chemistry and hampers the training process to find a proper relation between a molecular geometry and the excited-state property. 92,454 A one-dimensional example of this problem is illustrated for the NAC (exemplified using one single value along the reaction coordinate) that couples an excited singlet state, S i , and a second excited singlet state, S j , in Figure 8. A positively signed function of atomic coordinates is shown by dashed blue lines with a cusp at the point at which the two singlet states are degenerate. Such a smooth function (besides the sharp spike at the conical intersection) is highly desirable when fitting with ML models is aimed for. It is worth mentioning that a consistent negative sign (light-blue dashed line) along this reaction coordinate is equally correct and that it is desirable to seek for one global sign. However, the direct output of a quantum chemistry program along this reaction coordinate looks more similar to the dashed magenta line in-between the blue curves. As one can imagine, no proper training can be guaranteed with these inconsistent data. Note that existing MD programs for the excited states usually track such phase jumps within electronic wave functions in order to account for nonadiabatic transitions correctly. 242 The idea of phase tracking can also be applied in ML in order to thwart the problems due to the arbitrariness within coupling or dipole elements. Some algorithms have been developed to remove the arbitrary sign jumps and provide smooth functions of atomic coordinates. 13,14,92,455 Noticeably, the properties obtained after a transformation to the diabatic basis are already smoothly varying functions of atomic coordinates. 336 However, the challenges arising due to the arbitrary phase of the wave function still persist, because the inconsistencies within adiabatic properties have to be removed in order to make the diabatization process feasible. 14,90 It is worth mentioning at this point that also another kind of phase exists that cannot be eliminated in the aforementioned way. It is called Berry phase or geometric phase. After performing a loop in space around a conical intersection and returning to the original point, a change in the phase of the wave function of π can be observed, i.e., the same point is only reached after two loops around the conical intersection. Neglecting this effect can lead to false transition probabilities, depending on the dynamics method and the system. While in most cases in MQCD the Berry phase can be safely neglected, this is not possible in quantum dynamics simulations. A diabatic basis is advantageous in this case, because the Berry phase is absent in this picture. However, the Berry phase has to be kept in mind, when fitting diabatic potentials. 456-460

Phase Correction of Adiabatic Data
First ML studies on dynamics in the adiabatic basis omitted a preprocessing and were unable to reproduce reference results based on ML alone, 138 or avoided the phase problem by using the Zhu-Nakamura method. 137,139 Evidently, potentials and forces can be learned with conventional ML approaches but adaptations or a preprocessing of data is necessary to learn coupling elements or transition dipole moments. Independent of the purpose -the fitting of adiabatic quantities 92,454 or the diabatization of adiabatic data with property-dependent diabatization schemes 14 -the adiabatic data has to be corrected to remove the arbitrary sign jumps that are due to the arbitrary phase of the wave function. Several ways for these corrections exist, which have been shown to work well for different excited-state problems.
One possibility is to preprocess data according to the wave function overlap -betweem the wave functions from a geometry of interest and a reference geometry -for each electronic state. This process is termed phase correction 242,454 and has been applied by us in order to generate a training set for three singlet states of CH 2 NH + 2 92 and 2 singlet and 2 triplet states of CSH 2 . SOCs, 13 NACs, 13,92,93 and transition dipole moments 92,93 could be fitted in the adiabatic basis with deep NNs and kernel ridge regression (KRR). 13,92,93 Very recently, Zhang et al. 95 applied this procedure to describe transition dipole moments of N-methylacetamide.
The wave function overlap matrix, S, with size N S ×N S , is computed between two molecular geometries α and β: 461 In many cases along a given reaction path, the off-diagonal elements of the overlap matrix are very close to zero and the diagonal elements are very close to +1 or -1, indicating whether the phase of a state has changed along this path or not. Whenever a new state enters along the reaction path or adiabatic states switch their character, which is common after passing through a conical intersection for example, the off-diagonal elements provide the relevant phase information instead of the diagonal elements. Taking all these effects into account, a phase vector, p, can be derived for each given molecular geometry. A property resulting from electronic state i and j has to be multiplied by the corresponding phase factors of these states. 92 An advantage of this algorithm is that it does not require any manual fitting of data. However, this procedure has to be carried out for every data point included in the training set with respect to one pre-defined reference wave function. This reference wave function can be for example the wave function of the groundstate equilibrium structure of the molecule and needs to be identified to guarantee an almost globally consistent sign of elements. During a photo-initiated simulation, it is common that geometries quickly start to differ from the reference geometry. The wave function overlap then tends to zero and cannot provide information about the correct sign of a certain electronic state. In this case, the phase must be propagated from the reference geometry on with n interpolation steps. The phase vector applicable for the correction of the data point to be included in the training set is then obtained by multiplication with all previously obtained phase vectors, p 0 to p n−1 : Intruder states prohibit a proper tracking because their wave function is absent at the earlier geometries. Hence, a phase correction may be rendered infeasible for systems with a high density of states.
In order to obtain the correct phase, more states can be included in the simulations, which however increases the computational cost. A solution is to take many electronic states into account only close to the reference geometry. The amount of states can then be reduced along a given reaction coordinate and relevant states can be disentangled from irrelevant ones. Further, it makes sense to save the already phasecorrected wave functions of several geometries in addition to the reference geometry. Whenever a new data point should be included into the training set, the distance to each saved data point can be computed in order to find the closest available structure and reduce the amount of interpolation steps. 92,400 This problem has also been recognized by Robertson et al. 358 for a diabatization process, where a sufficiently large vector space of the CAS wave function is required for proper diabatization. The overlaps of electronic states can be maximized by rotation of CI vectors of CAS wave function states. A similar version to use the information of CI vectors for diabatization was applied by Williams et al., 140 who used NNs to assist the diabatization process of adiabatic NO 3 potentials.
Another way to correct the sign of data points was carried out by Guan et al., 14 who fitted diabatic 1,2 1 A PESs and dipole moment surfaces of NH 3 from MR-CISD/aug-cc-pVTZ data with NNs. The diabatic PESs were taken from a previous study and obtained with the Zhu-Yarkony diabatization procedure. [462][463][464] By diagonalization, the rotation matrix defined in eq 10 could be obtained, which connects the diabatic and the adiabatic basis (see eq. (9)). The adiabatic dipole moments, µ M CH , could then be trans-formed into the diabatic basis using the unitary matrix, U: As the unitary matrix U is only defined up to an arbitrary sign, the signs of the diabatic dipole moments have to be corrected in order to provide a consistent diabatic dipole moment surface. This correction has been done with a so-called cluster growing algorithm. 455 The cluster growing algorithm requires an initial set of phase corrected data points. In this work, 347 data points were adjusted manually for this purpose. Subsequently, a Gaussian process regression (GPR) model 465 was fitted to these data points. The signs of the rest of the data points to be corrected were then adjusted with the GPR model. Several iterations were carried out, where each iteration aims for the inclusion of close-lying points to the cluster, leading to the name "cluster growing" algorithm. 146 The singularities in regions close to conical intersections can make this algorithm fail. Therefore, data points in such regions have been removed by setting a threshold. Data points with energy gaps lower than this threshold were excluded from the cluster. The regions around conical intersections could not be fitted as comprehensively as other regions of the PESs. As another drawback, the authors note that the initial manual fitting of the signs is a tedious task, especially when larger systems and more dimensions are described.
Two of the authors also fitted diabatic PESs of two singlet states and one triplet state as well as the SOCs between singlets and triplets of formaldehyde, CH 2 O, with NNs. 90 The electronic structure reference method was MR-CISD/cc-pVTZ. The diabatic potentials were obtained using an adapted version of the Boys localization. 351 The energy differences between two states are incorporated in the equations in order to remove earlier identified diabolic singularities. 146 The range of π, which the rotation angle for the diabatization covers, guarantees a proper treatment of the Berry phase. The diabatization procedure further requires consistent transition dipole moments, which were adjusted manually for this purpose. The diabatic SOCs were then obtained as a linear combination of the adiabatic SOCs by applying the same rotation matrix as for the energies. One separate NN function was used to fit each coupling value and electronic state separately.
It becomes clear that only a small number of works on this topic exist. At the moment, many problems remain unsolved for generating a training set that properly accounts for both types of phases, the arbitrary phase and the Berry phase, and is applicable for large systems with many states. An automatic phase correction procedure without the need of manual input would be very advantageous, especially when larger and more flexible systems are treated. Further developments are needed.

ML-Based Internal Phase Correction
One step towards a routine application of ML for photochemical studies and an easier training set generation with quantum chemistry is an ML-based internal phase correction, which has been implemented by us into the SchNarc approach for photodynamics simulations. 13 In contrast to the phase correction algorithm to correct the training data, this procedure renders the learning of inconsistent quantum chemical data possible. A modification of the training process, termed phase-free training, is required for this purpose. 13 We implemented this training algorithm in a combination of the deep continuous-filter convolutional-layer NN SchNet, 428,432 adapted for excited states, and the MD program SHARC 242,344,466 Similar to standard training algorithms, parameters of an ML model are optimized in order to minimize a cost function. Most frequently, the L 1 or L 2 loss functions are applied, which take the mean absolute error or mean squared error between predicted and reference data into account. The phase-free training algorithm uses a phase-less loss function, which includes all trained properties at once and additionally removes the influence of the random phase switches. In this way, the computational costs for the training set generation can be reduced.
Compared to the previously reported ML models for photochemistry, where each state was fitted independently, 14,90,139 SchNarc is capable of describing all PESs at once, including the elements resulting from different pairs of states. This results in an overall loss function with several terms, where each term is weighted with a different trade-off value, t, that can be defined manually: If only energies (E) and forces (F) are fitted, then the loss function is equal to a linear combination of L 2 loss functions for energies and forces. 13,109 The parts of the SOCs and NACs are and respectively. The error for SOCs and NACs that enters the loss function is the minimum error that can be achieved when trying out all possible combinations of phases for each pair of states, i.e., 2 N S −1 possible solutions. The algorithm takes into account that the signs of SOCs and NACs coupling different pairs of states depend on each other. The error function containing all possible solutions for SOCs, ε phase SOC , and NACs, ε phase N AC , can be obtained as follows: This phase-less loss procedure does not require any preprocessing of training data. Quantum chemistry calculations can be directly fitted with this adaption of the loss function. The power of this approach is that, once a given phase vector for a data point has been found, it can be directly applied to correct the arbitrary signs of other properties, such as transition dipole moments. If other properties are targeted, the loss function applied for NACs can be similarly used for other vectorial properties, and the loss function applied for SOCs can be used for any other single-or complex-valued element of arbitrary sign. 13 However, as a consequence of the higher complexity of the loss function, the training process is generally more expensive. The computational effort required for training can be reduced if only one type of coupling is treated within MD simulations. In these cases, a simpler adaption of the phase-free loss is also applicable. 13

Training Set Generation
The requirements and desirable specifications for a training set can vary strongly, dependent on the type of application: When the focus of a study is the investigation of the huge chemical space and the search for certain patterns thereof or the design of new molecules with targeted properties, usually the training set should be as large as possible to cover as many molecules as possible. In the best case, the data points are computed with high accuracy and this reference method is accurate for the excited states of many different types of systems. In terms of accuracy and general applicability, ab-initio methods are more suitable, as they do not require the selection of a density functional, which might be accurate for some cases, but fail for others. However, the costs and complexity of highly accurate multi-reference ab-initio methods limit their applicability, so that TDDFT remains the method of choice when making predictions throughout chemical compound space. 150,327,467 The most widely applied approach to generate a training set for this purpose is to start from an existing (ground-state) data base that already covers a large chemical space of certain types of molecules. In this way, not much effort has to be devoted into the exploration of chemical space and structure optimizations to get the most stable conformations of different molecules.
For the purpose of ML-based excited-state dynamics simulations, things look quite different. Note that for photodynamics simulations, only molecule-specific ML model exist until now, which can potentially develop into a universal excited-state force field, but much remains to be done to achieve this goal. Indeed, the generalization of the excited state PESs and corresponding couplings is expected to be a highly complex task, especially due to the problematic generalization of excited states. 92 A comparison of the isoelectronic molecules CH 2 NH + 2 and C 2 H 4 can serve as an example. Their conical intersection between the first excited singlet state and the ground state is accompanied by a rotation along the dihedral angle, which could lead to very similar photo-initiated processes. However, higher-lying excited states are ordered completely different in both molecules and excitation leads to completely different photodynamics. 92,[468][469][470][471][472][473][474][475][476][477][478] As it stands, existing ML models for photodynamics simulations are developed to investigate the photo-initiated processes of one specific molecule.
Overall, we arrive at the following wish list for the training set, which has been identified also for MD in the ground state: 101,115,479,480 1) The training set should be as small as possible to keep the number of reference calculations at a minimum. 2) At the same time, the relevant conformational space of the molecule that is required for the reaction under investigation should be sampled comprehensively. 92,115,328,442,480 Keeping this in mind, an efficient procedure to obtain relevant molecular structures has to be applied. A large number of schemes to achieve this goal have been proposed, which are mainly based on two different strategies: One approach is to simulate MD in the ground and excited states with the reference method and putting much effort into covering critical regions of the PESs comprehensively. [137][138][139] Structure-based sampling or subsequent clustering is beneficial in this case. 137,138,[481][482][483] The other strategy is to use an active learning approach, which decreases the number of necessary reference calculations considerably, but is usually more timeconsuming. 480 Noticeable, within ML for quantum chemistry, active learning often refers to an approach, where an initial training set is used to fit an ML model and this previously learned information is applied to expand the training set. 402 The latter approach is often carried out with the help of MD simulations, but has also recently been adapted in a trajectoryfree way. 402,484

Basic Sampling Techniques and Existing Databases
To find patterns within certain groups of molecules, to explore chemical space and to develop new methods that can fit for example different properties of molecules, such as the valence density used in DFT, 81 or large molecules from small building blocks, 107 a good starting point is often considered to be an already existing data base. Prominent examples are the QM data bases, namely QM7, QM7b, QM8, and QM9, 424 which have been used in a large number of publications up to date and provide a benchmark for many ML studies. 12,150,428,433,434,446,[485][486][487][488][489] Especially the QM9 424 data set containing more than 133k small organic molecular structures and corresponding DFT energies, enthalpies, harmonic frequencies, and dipole moments (to name only a few properties) is very popular among the scientific community and has also been used in challenges on kaggle, where researcher and layperson all over the world can compete against each other to find the most suitable solution to a given task. Prices up to several thousand dollar are quite common. 490 In a similar spirit, the QM9 IPAM ML 2016 challenge requires to predict the energies of QM9 from only 100 training points within chemical accuracy (error of ≈0.05 eV). 491 All aforementioned data bases originate from GDB data bases, [492][493][494] and are often a subset thereof. The chemical universe GDB data bases have been designed using molecular graphs to sample a comprehensive space of molecular structures for the search of new lead compounds in drug design. 494 One of the first data bases available for the scientific community to treat the excited states of molecules is most probably the QM7b 495 data set, that contains the excitation energies computed with TDDFT for a total amount of >14k molecules with atoms C, N, O, H, S, and Cl. This data set is based on the molecular geometries of the QM7 100,494 data set plus an additional amount of 7211 molecules containing a chlorine atom. The excitation energies of the first singlet state and other properties were recomputed for each optimized molecular geometry. Very similar, the QM8 327 data base was developed, based on the GDB-17 data base. 496 This data set can be used for the computation of vertical excitation spectra. It hence includes not only the vertical excitation energies of the first excited singlet state, but also the corresponding oscillator strengths. Oscillator strengths are also reported in an autogenerated data set for optoelectronic materials with DFT. 467 Note that the oscillator strength is computed from the squared transition dipole moment, hence an arbitrary phase factor cancels out and the data does not have to be preprocessed. In addition to the TDDFT energies, CCSD energies are reported, having enabled the development of the so-called ∆learning approach -a powerful way to obtain the accuracy of highly accurate ab-initio methods with only a small amount of respective reference calculations. Two ML models are trained in this approach, one on a less accurate method and another one on the difference between the less accurate and higher sophisticated method. 497 This scheme can also be applied multiple times to achieve increasing accuracy with little additional computational ef-fort 328 and has been adapted for spectroscopy in the condensed phase as well. 151 The QM9 data set has further been the basis of a very recently constructed data set for singlet and triplet states of >13k carbene structures, termed QMspin. 12 4,000 geometries from the QM9 data set were randomly selected, hydrogen atoms were subtracted and singlet and triplet states were optimized using CASSCF(2,2)/cc-pVDZ-F12 and open-shell restricted KS-DFT with the B3LYP 498,499 functional, respectively. The MR-CI method was subsequently used to compute the electronic energies of singlet and triplet states. This data set has been used to investigate structural and electronic relationships in carbenes, which are important intermediates in many organic reaction networks. 12 The OE62 500 data base, a benchmark data set applicable for spectroscopy, is another descent of several existing data sets, such as the QM8 and QM9 data sets. It consists of >61k organic molecules able to form crystals including up to 174 non-hydrogen atoms. Reported are the orbital energies of molecules computed with DFT/PBE. 501 Another database, which also contains excited state data, is the PubChemQC data base. 502 It contains over three million molecules, whose structures are reported along with the energies at DFT/B3LYP/6-31G* level of theory. In addition, the excitation energies of at least three million structures are reported for the 10 energetically lowest-lying singlet states at TDDFT/B3LYP/6-31G* level of theory.
A simple strategy was carried out by Kolb et al., 503 who used an existing analytical PES to create an ML potential: They randomly sampled data points, trained an ML model and added more points in regions with deviations from the original PES. Other strategies have been carried out mainly for the fitting of ground state potentials and for materials, which are however also relevant to consider for the excited states. One novel, suitable strategy is for example "de novo exploration" of PESs using a similarity measure provided by ML models. 504 At least for material discovery, this method can be used to omit any additional active learn-ing procedure to converge PESs. Another way to build a training set is to employ moleculegenerating ML models, 164,505,506 such as the recently developed Gschnet. 507 Alternatively, MD simulations with the reference method can provide a good starting point for training. 120,486,508 Ye et al. 509 sampled 70k conformations for Nmethylacetamide via MD simulations with the OPLS force field 510 within GROMACS 511 for subsequent UV spectra calculations. We have applied a similar scheme to generate a training set of SO 2 based on an LVC model. 229 Surface hopping MD simulations with the SHARC method 242,344,466 were carried out with the reference method LVC(MR-CISD) ending up in >200k data points of different conformations of SO 2 . 13 Due to the crude sampling and low cost of the reference method, no emphasis was put on clustering the training set into a smaller, still comprehensive set.
90k data points were required in an MLbased surface hopping study of CH 2 NH with the Zhu-Nakamura method. Reference data for the ground and first excited singlet state, S 0 and S 1 , were generated with CASSCF(2,2)/6-31G via ground-state and surface hopping MD simulations. The latter method was applied to sample the regions around conical intersections between the S 0 and S 1 state. 139 Similarly, Hu et al. 137 sampled 200k data points of 6-aminopyrimidine using ground-state and surface hopping MD with CASSCF(10,8)/6-31G*. State-averaging over three singlet states was applied. In addition, structures that led to hops between different states were used as starting points to find minimum energy conical intersections and clustering was carried out to reduce the amount of data for training.
One way to select data points more efficiently is a structure-based sampling scheme, as proposed for instance by Ceriotti et al. with sketch map, 481,512,513 an algorithm for dimensionality reduction of atomistic MD simulations or enhanced sampling simulations. Likewise, Dral et al. 138 applied a grid-based sampling method to construct PESs of a model spin-boson Hamiltonian to execute surface hopping MD with KRR. The energetically low-lying regions of the PESs were first sampled via an inexpensive method and subsequently the distances between the molecular structures were computed. In this way, 10,000 data points were obtained. 138,482 ML models trained on only 1,000 data points were accurate enough to reproduce reference dynamics. This approach was compared with random sampling for the methyl chloride molecule and was shown to reduce the amount of training data needed up to 90% for static calculations. 482,483

Active Learning
As shown in the previous section, training sets with the respective equilibrium structure of a large number of molecules are very powerful for investigating the huge chemical space or for the design of new molecules. However, the usefulness of such training sets for photodynamics is rather questionable. The reason for this deficiency is that, especially in MD simulations in the excited states, the excess of energy carried by a molecule very quickly leads to conformations that are far beyond the equilibrium structure and most likely far away from originally sampled structures. The formation and breaking of bonds is quite common in photodynamics simulations and is usually only accessible from an excited, dissociative state. The use of photodynamics simulations with the reference method could solve this problem, but are not feasible if specific reactions occur on a rather slow time scale or if many different processes take place. 27,28,36,37,171,400 As previous studies have shown, inefficient sampling techniques lead to a huge amount of data, which still does not guarantee that the training set is comprehensive enough for excited-state MLMD simulations. In fact, ML models fail dramatically in under-sampled and extrapolative regions of the PESs. A smarter sampling technique is advantageous in these cases in order to efficiently identify such under-sampled regions and build trustworthy ML models.
Active learning, where ML "asks" for its training data, is one solution to create a data set more efficiently. An example from chemistry is the adaption of an initially generated training set due to an uncertainty measure for ML models trained on this initial training set. This concept has already been introduced in 1992 as query by committee 514 and has been adapted for quantum chemistry quite fast due to the required fitting and interpolation of PESs for grid-based quantum dynamics simulations. Pioneering works by Collins and co-workers 148,149,377,515 applied modified Shepard interpolation to fit PESs and iteratively adapt them in out-of-confidence regions using the GROW algorithm. 515,516 Since then, several sampling techniques have been developed that are based on MD and an extension of data bases using interpolation moving least squares, 517,518 permutation invariant polynomial fitting, 519,520 and different ML models for the ground state 99,101,109,479,480,521-533 and also excited states. 13, 92,138 As active learning starts from already trained ML models, an initial training set has to be provided. Some strategies to provide this initial reference data set will be discussed, following strategies applied to adapt this initial training set. Note that all previously discussed methods can be similarly applied to generate an initial training set.
Initial training set In general, an initial training set can be obtained in many different ways. As photo-initiated MD simulations usually start from vertical excitation of the ground state equilibrium geometry, this structure is commonly used as the starting point and reference geometry for the training set generation. In principle, any technique can be applied to then add conformations to obtain a preliminary training set. A good starting guess is to use normal modes of a molecule, as they are generally important for dynamics. In two recent works, we carried out scans along different normal modes and combinations thereof to sample conformations of small molecules. 13,92 Normal modes are also sampled for generating ANI-1 NN PESs. 113 For the excited states, it is favorable to include critical regions of the molecule in the initial training set by carrying out optimization of these geometries and including the calculations into the training set. 92,137 When small molecules are targeted, this initial training set can already be comprehensive to start the training of ML models and adapt the training set based on an uncertainty measure provided by the ML models. 92 In case more flexible and larger molecules are studied that give rise to a complex photochemistry and a high density of states including different spin multiplicities, a small initial training set might not be sufficient and a larger conformational space of the molecule needs to be sampled. This can be done for example via Wigner sampling 534 and also with MD simulations in the ground state. 535,536 Suitable methods are for example umbrella sampling, 537 trajectory-guided sampling, 538 enhanced sampling 539 or metadynamics 540 in combination with a cheap electronic structure method like the semi-empirical tight-binding based quantum chemistry method GFN2-xTB 541 or existing ground-state force fields. A large amount of different geometries can be created very fast and inexpensively, which then can be clustered to exclude similar conformations of the molecule to keep the number of reference simulations at a minimum. The selected data points for the training set can then be computed with the chosen reference method, whose accuracy is targeted with ML. Additionally, if certain reaction coordinates have been shown to be important in experiments or previous studies, then it is favorable to include data from scans along these reaction coordinates. 94,400 As soon as meaningful ML models can be obtained from the initial training set, active learning techniques can be applied to enlarge the set. What number of data points turns out to be sufficient for the initial training set is dependent on a lot of different factors, such as the size and flexibility of the molecule under investigation, the number of excited electronic states described, and the ML model and descriptor applied. 92,93 In order to give a ballpark figure, we note that we used approximately 1000 data points as initial training set for small molecules in recent studies using deep multi-layer feedforward NNs. 13,92 Strategies for actively expanding the training set The next step in active learning is to expand the initial training set by adding points from out-of-confidence regions. The detection of these undersampled regions can be done in many different ways, whereby most approaches rely on MD simulations.
Among the most popular strategy is the iterative sampling scheme of Behler, 480 originally developed for fitting ground-state PESs. Today, it is widely used, see for example refs 101,479,542, and has been modified as a so-called adaptive sampling approach. 109 The latter has been adapted by us for the generation of a training set for the excited state PESs of molecules including couplings. 92 The basis of almost any iterative or adaptive sampling scheme is a similarity measure to judge whether a molecular geometry can be predicted reliably with ML models or not. While kernel methods intrinsically provide a measure of similarity for each molecular geometry, NNs do not. Therefore, adaptive sampling with NNs requires at least two ML models. In case of KRR or GPR, two ML models can be used as well, but are not necessarily needed. Indeed, the statistical uncertainty estimate of the predictions remains a huge advantage of GPR models. 442,543 The adaptive sampling scheme for the excited states is illustrated in Figure 9 and exemplified with two ML models. The whole process starts with an initial training set, which is used to train the two (or more) preliminary ML models. These models differ in their initial weights or model parameters. The resulting dissimilar ML architectures guarantee that the ML models do not predict the exact same number for a given molecular input. The hypothesis underlying this scheme is that inferences of different ML models trained on the same training set will be similar to each other as long as an interpolative regime is given. The inferences of the ML models are inaccurate and should differ from each other to a much larger extent if a molecular input lies in an unknown or under-sampled region of the PESs.
In order to find such regions, sampling steps are carried out, e.g., by running (excited-state) MD simulations based on the mean of the in-ferences made by the different ML models for energies, E M L , forces, F M L , and if required also couplings, C M L . In each sampling step, the variances for each predicted property are computed. In the present example, energies and forces are treated together as σ M L E+F (but can also be used separately), separately from variance of the couplings σ M L C . If a variance exceeds a pre-defined threshold, the ML models diverge and the predictions are deemed untrustworthy. N M L refers to the number of different ML models, ζ, used for adaptive sampling: Note that the variance is averaged over all states for energies and forces and over all pairs of states for couplings, that are described with the ML models. As a variant, each state could also be treated separately. However, as the different electronic states are not independent of each other, a mean-treatment is assumed to be advantageous. 93 Each data point that is predicted with a variance larger than the pre-defined threshold for a given property, is recomputed with the reference quantum chemistry method and added to the training set. In this way, undersampled or generally unknown regions of the PESs are identified. Whenever the variance of each property is within the range that is thought to be reliable, the mean of the inferences is forwarded to the MD program to propagate the nuclei and continue MLMD simulations. The name adaptive sampling is based on the recommendation to choose a rather large threshold in the beginning of the adaptive sampling procedure and to adapt this threshold to smaller values as the ML models become more accurate and robust. 109 A first estimate for the initial value of a threshold Figure 9: Adaptive sampling scheme illustrated using two ML models (blue and red blocks). The active learning procedure starts from an initial, preliminary training set (yellow), which is used to train ML models. A sampling step, e.g. a time step of an MD simulation, is executed: The ML models take the molecular geometry of the sampling step as an input and predict the energies of the considered excited states, their derivatives, and additional required photo-chemical properties. In case the predictions of the ML models are deemed to be different, quantum chemical reference calculations are carried out, ML models are retrained and the serial steps are carried out again. This procedure is executed until the desired quality of the ML PESs is attained in order to sufficiently describe the chemical problem under investigation.
can be obtained from the MAE of the corresponding ML model on the initial training set.
In principle, adaptive sampling can be carried out for every property, that should be represented with ML potentials, and is not restricted to energies, forces, and couplings. Similarly, it does not need to be executed with excited-state dynamics, but could also be done with groundstate MD or any sampling method that is considered to be suitable.
As a negative side effect, this procedure is generally more time-consuming than many other sampling techniques, because ML models have to be trained each time a new data point is added to the training set. To apply adaptive sampling in a more efficient way, it is advantageous to execute not only one ML trajectory, but many hundred trajectories in parallel, as it is usually done in MD simulations. The ML models should then only be retrained, when all ML-based trajectories have reached an undersampled conformational region. 92,109,480 Despite the higher complexity of adaptive sampling compared to random sampling, it can reduce the number of required data points for MLMD simulations substantially. In this regard, also the computational costs for the training set generation can be kept at a minimum.
Adaptive sampling was carried out successfully to generate a training set of 4,000 data points of CH 2 NH + 2 containing three singlet states and couplings. ML-based surface hopping MD simulation could be carried out on long time scales using the average of two deep NNs. The concept of iterative sampling also proved beneficial for the long MD simulation to guarantee accurate ML potentials throughout the production run. Here, the threshold was not adapted anymore and the MD was continued from the current geometry after a training cycle was completed. 92 In addition, the average of more NNs turned out to be more accurate than the prediction of only one NN, which was also shown in Ref. 109.
Another quality control besides the propertybased one proposed by Behler can be obtained by comparing the molecular structures at each time step as done by Dral et al. 138,482 and Ceriotti et al. 481 A combination of a structurebased and property-based detection of sparsely sampled regions of the PESs has been done by Zhang et al. and Guo et al. 362,524,[544][545][546] Very recently, an alternative approach has been applied with NNs by Lin et al. 402 that does not require MD simulations. It is based on the finding that the negative of the squared difference surface obtained from NNs approaches zero in regions, where no data points are available. 518 Therefore, new points can be computed at the minima of the negative squared difference surfaces of at least two NNs (or, equivalently, at local maxima of the squared difference surface). This method is supposed to be very efficient in cases, where different conformations are separated by large energy barriers or strongly stabilized local minima are common. MD simulations would take a long time to overcome the potential barriers and reach the region of unknown molecular structures. 402 The idea behind this technique is similar to previous works with GPR. A measure of confidence can be provided with GPR models that enables the search of regions with large variance in ML predictions. In these regions, data points can be added to build up a training set . 484,[547][548][549] Similarly, Bayesian Optimisation Structure Search (BOSS) has been proposed for constructing energy landscapes of organic and inorganic interfaces. 550 A combination of different approaches has also been applied by Häse et al., 160 who fitted TDDFT excited-state energies of a light-harvesting system. Given a large enough, error-free, and comprehensive data set, ML has the potential to determine known and unknown (un)physical laws within the data. 551

ML Models
Besides the training set, which defines the highest possible accuracy an ML model can attain, the type of regressor and the descriptor to represent a molecule to the ML model play also important roles. 552 Improper choices of regressors and descriptors can result in inaccurate ML models.

ML Models: Type of Regressor
Given the vast number of ML algorithms applied in the field of computational chemistry, one might ask which one to use or adapt for photochemistry. As recent studies applying ML for quantum chemistry have shown, many possible choices of ML approaches exist and there is no single solution. Nevertheless, a trend can be observed: Many studies that use ML in the research field of quantum chemistry employ labelled data sets, i.e., supervised learning techniques. Within supervised learning, one can distinguish between regression and classification. Classification aims at finding patterns and at grouping data into certain clusters. 553 Those types of ML models are often used e.g. in spam filters, in medicine to diagnose diseases, 554,555 or in food research, e.g. to guarantee a certain wine quality or origin. 556 Examples of applied classification models in the field of computational chemistry are for example support vector machines, random forests or decision trees used, e.g., to classify enzymes 557 or for the selection of an active space. 70,558 More often than classification models, regression models are applied to assist the search for a solution of a quantum chemical problem. Regression is used to fit functions that can relate a molecular input, X, to a quantum chemical output, Y . The simplest relation that can be assumed is linear. Although many quantum chemical problems cannot be accurately described with a linear function as given in eq. 23, it can serve as a baseline model to evaluate the minimum accuracy one can obtain. 92,171,553,559,560 The regression coefficients, also known as weights, w, and biases, b, are tailored for a given problem under investigation. In case of linear regression, ordinary least squares regression can be applied to find these coefficients. The process of finding the optimal relation between X and Y is termed training. The coefficients are optimized by minimizing a so-called loss function, L, which monitors the error between the original property, Y QC , and the predicted property by the ML model, Y M L , with respect to the training instances. Most often, the L 1 loss or the L 2 loss is used as an indicator for the training convergence. The L 1 monitors the mean average error (MAE) and the L 2 loss the mean squared error (MSE) of predictions: The Greek letter β runs over all molecules, N M , inside the training set. In principle, any error estimate can be used to train an ML model and find suitable regression coefficients. An example specifically developed for excitedstate problems is the aforementioned phase-less loss (see section 4.2.2). 13 Such adapted loss functions and also conventional ones are employed in different types of ML models. In the following, we focus on the two most widely used models for the description of the excited states: Kernel methods and NNs.
Kernel methods Kernel methods 561 are based on a similarity measure between data points. Examples are KRR or GPR, which go beyond linear regression by applying the kernel trick and ridge regression. Ridge regression is used to find the weights, which differs from linear regression by a regularization term, λ: Y QC refers to the training data and K to the kernel matrix. The kernel trick makes it possible to apply ridge regression to non-linearly separable data by mapping them into a higher-dimensional feature space, in which the data points are linearly separable. Therefore, a kernel function, k, e.g. a Gaussian or Laplacian, is placed on each compound to measure the distance to all of the other compounds in the training set. The kernel function defines the non-linearity of the model. A property of a query compound, α, can be obtained as the weighted sum of regression coefficients and kernel instances: The size of the kernel matrix is dependent on the number of training points and hence the depth of the model is inherently linked to the size of the training set, which is why they are called "non-parametric". 553,562 An advantage of kernel methods is that they mainly contain two hyperparameters, i.e., internal model parameters, which need to be optimized for proper training. Most important are the width of the non-linear kernel function, σ, and the regularization. The latter is used to prevent the model from overfitting -the case when the model fits training data including noise almost exactly and fails to accurately predict data points not included in the training set but stemming from an interpolative regime. As quantum chemical data is most often noisefree, the regularization term is usually small.
As the optimization of hyperparameters is often a tedious task, kernel methods with their few hyperparameters are easier to use than, e.g., NNs with many hyperparameters. Nonetheless, kernel methods can provide almost exact solutions of problems under investigation. 125 A drawback is, however, that the inversion of the kernel matrix can become expensive and even be rendered infeasible on current computers due to increasing memory requirements with increasing training set size. 93 Further, kernel methods are usually defined to only map an input to a single output. Therefore, they can treat only one electronic state at a time in standard implementations and, thus, can be referred to as single-state models. A single-state treatment requires a separate ML model for each electronic state or for each property resulting of a pair of states, whereas a multi-state ML model describes all electronic states and properties resulting from different pairs of states at once. 93,400 Hence in their standard implementation, the treatment of several excited states necessitates the use of several kernel models, which is commonly done in the research field of quantum chemistry. 137,138,147,563,564 The description of forces is possible for the ground state or a single excited state and is implemented, e.g., in the QML toolkit using KRR and the Faber-Christensen-Huang-Lilienfeld (FCHL) representation, 433 in the symmetric gradient domain ML (sGDML) 120,120 method or with smooth overlaps of atomic positions (SOAP) 565 for GPR. 119 Neural Networks Another prominent approach in ML is the use of NNs as highly flexible parametric functions, which can fit huge amounts of data and can map a molecular input to many quantum chemical outputs. 93 The simplest form of NNs are multi-layer feed-forward NNs, which are schematically represented in Fig. 10. As it is visible in Fig. 10, the width Figure 10: Schematic representation of a multilayer feed-forward NN with inputs, X, nodes, n, and outputs, Y . In the usual implementation for the fitting of PESs, the NN maps a molecular geometry to the ground state, which could be similarly done for any other single state. In case a manifold of excited states is described, one molecular input can also be mapped to a vector of different excited states and additionally, other properties can be included. The forces are treated as derivatives of the NN potentials with respect to Cartesian coordinates.
of the model is dependent on the number of nodes, n t r , which are connected to each other using weights, w tu rs . The indices refer to a connection between node r and node s from layer t and layer u, respectively. The number of nodes and hidden layers can be chosen independently of the training set size.
Due to the highly flexible functional form of NNs, highly complex relationships can be fit, but an analytical solution to find the weights is not available (in contrast to KRR). A numerical solution can be obtained with stochastic gradient algorithms, which are frequently applied to obtain a step-wise update of the weights: The gradient of the loss function as given in eq. (24) with respect to the weights is multiplied with a so-called learning rate, l r . This hyperparameter is deemed one of the most important hyperparameters used for training. 9,566 In order to obtain an optimal solution, the learning rate needs to be chosen properly. Algorithms such as AdaGrad 567 or Adam 568 can automatically adapt the learning rate during training. Further, the second-order derivatives can be included into algorithms, which is for instance done in the global extended Kalman filter, 569 in its parallel variant, 570 or the element-decoupled variant. 103 The loss function can be adapted so that more than only one property can be trained at once. This is often done to include the forces in the training process. In general, NNs possess various hyperparameters like the learning rate, regularizers, number of nodes, etc. As a consequence, an extensive hyperparameter search complicates the use of NNs and makes them more complex to apply than kernel methods.
Besides simple multi-layer feed-forward NNs, high-dimensional variants exist. These networks comprise several atomic NNs, which represent atoms in their chemical and structural environment and are thus also called atomistic NNs. Each local atomic contribution, E a , can be summed up to provide the energy of the whole system, E, which is well known to work for the ground state PESs: and was originally implemented by Behler to construct high-dimensional NN potentials. 571 Embedded-atom NNs 533 are similar to high-dimensional NNs in their way of constructing the energy of a system. They differ in the underlying descriptors to the ones of Behler. Atomic contributions to the energy are dependent on the embedded density of atoms and are summed up according to eq 28. These embedded density-like descriptors are approximated from atomic orbitals.
Independent of a simple or an atomistic architecture, the model can be used to fit a single output or a vector of many outputs at the same time. For ground state problems, a single-state model is usually used, which maps an input to a single output, e.g. the PES of the ground state. Oftentimes, this single-state fashion is adapted to fit different excited states with different NN models. 14, 139,199,509 However, it has been shown that including more excited-states in one model can be advantageous, 93 as the excited-states are inherently linked to each other and so are the excited-state properties. 37 Treating many excited states can be referred to as multi-state model and the inclusion of more properties can result in a multi-property model. 76,93,95,186,400 The different properties can be weighted with respect to their magnitudes or importance for a given chemical problem under investigation, such that the best possible accuracy can be obtained. 13 Another type of networks are convolutional NNs, which are most often applied in image or speech recognition, 572-574 but can also be adapted to process a molecular input and identify an optimal molecular descriptor. This type of network can be combined in an end-to-end fashion with an architecture, which fits this generated molecular representation to a query output. 428,432,508,575,576 An important ingredient of all these ML models is the descriptor, which is mapped to the output. In most studies, the descriptor is one of many different possibilities to represent a molecule, which will be discussed in the next section.

Descriptors and Features
Electronic structure methods can process and uniquely identify molecules using e.g. Cartesian coordinates. In contrast, such types of inputs are not optimal for ML models as the same molecular geometry, but translated or rotated, could only be mapped to the same output with great effort and unnecessary computational cost. Hence, a molecular descriptor should fulfill the following requirements: It should be translationally, rotationally, and permutationally invariant as well as differentiable. 102 It should also be unique with respect to the relative spatial arrangement of atoms, universally applicable for any kind of system, and computationally efficient. 552 However, a descriptor can be more than that; it can already include a part of the mapping, e.g., from a molecular structure to an energy. It can thus ease the task of the regressor and help to attain the best possible accuracy for a given training set.
The ways to represent a molecule to an ML model can be classified roughly into two categories: molecule-wise descriptors, which represent the molecule as a whole to the ML model, and atom-wise descriptors, which represent atoms in their chemical and structural environment and build up a property using local contributions. 102,480 Both ways in describing a molecular system have their merits and pitfalls and will be discussed along with their applications in recent studies for the excited states in the following.

Molecule-wise descriptors
The distance matrix is one of the simplest descriptors that preserves rotational and translational invariance. Most often it is used in its inverse form with distances between atoms a and b, giving rise to the symmetric inverse distance matrix, D. Due to the ill-definition of diagonal elements, which are not differentiable, the diagonal elements are excluded and only the upper or lower triangular matrix is used to represent a molecule to an ML model. 564 Since the Hamiltonian contains distances rather in the denominator, it makes sense to also use the matrix of inverse distances. 92 The matrix of inverse distances is very similar to the Coulomb Matrix, C: 100 but the Coulomb matrix additionally considers the atomic charges, Z. These types of descriptors are frequently used in ML studies for the excited states. For example, MLMD simulations in the excited states could be advanced using these simple descriptors 92,93,137,138 and were also accurate enough to fit NNs and KRR models for excited-state properties. 92,93,160,327,509,563 Distance based descriptors are further implemented in several program packages that have been used for photodynamics simulations with KRR. For example, MLAtom 577 contains the Coulomb Matrix and a representation that includes all nuclear pairs in form of normalized inverted internuclear distances. 482 The QML toolkit 578 includes the Coulomb matrix in addition to other representations, such as bag of bonds. 579 Another variant are polynomials formed from inverse distances. 92 These molecule-wise descriptors have the advantage of being easy to use and implement. Especially for small molecular systems and with regard to the training of an ML model, they are cheap. However, they might miss some important information based on angular distributions. Currently, it is also investigated, whether representations based on two-body or threebody terms are accurate enough to uniquely identify a molecule. 580 A problematic issue of the aforementioned types of distance-based molecular descriptors is that they are not permutationally invariant. 102,400,480,576 This problem can be mitigated by data augmentation, i.e., randomly permutation of atoms by mixing of matrix rows, which results in more data points for the same molecular input. The additional amount of data increases rapidly with the system size and could lead to long training times. 480,576 Alternatively, another metric than the commonly used L 1 or L 2 norms can be employed, the so-called Wasserstein metric, which was tested with the Coulomb matrix. 581 Permutation invariant polynomials (PIPs), introduced by Bowman and co-workers, 519,520,582 are frequently applied in a PIP-NN approach by Guo and coworkers to investigate photochemical problems. 141,142,145,[359][360][361][362] The advantage of these polynomials is that they are invariant to permutation of atoms and inversion. 145 They comprise single-valued functions, p ab , such as logarithmic or Morse like functions, which incorporate internuclear distances, r ab . The PIP vector, G is obtained applying a symmetrization operator,Ŝ, accounting for possible permutation operations: with an example of p ab : Evidently, additional hyperparameters such as c have to be optimized and the choice of PIPs is generally not unique. 360,543 Another negative aspect of molecule-wise descriptors is that they can only treat one molecular system, because the input size is fixed. The input dimension could, in principle, be defined according to the largest system included in the training set, but this would lead to unnecessarily large input vectors for smaller systems, which would then contain many zero values. 571,576 The training of more ML models, each for one specific system size, is one possible solution, 160 but obviously necessitates the training and evaluation of more than one ML model.

Atom-wise descriptors
In contrast, atomwise representations allow for a fitting of molecules of arbitrary size and composition. Such descriptors are state-of-the-art for groundstate problems with commonly used examples being the SOAP, 565 atom-centered symmetry functions (ACSF), 571 weighted ACSFs 446,583 or the FCHL representation. 125,487 These representations describe atoms in their chemical and structural local environment and usually rely on a cut-off function. This cut-off function defines the sphere around an atom, which is deemed to be important and is therefore considered when modelling the atomic local environment. Radial distribution functions, socalled second-order terms, account for interatomic distances and are often used together with angular distribution functions, i.e., thirdorder terms. It is further beneficial to include first-order terms, i.e., the stoichiometry of atoms. 125,428,446,583 Most often, higher order terms than third-order terms are not included due to increasing costs and little improvements in accuracy. 576 The description of PESs from atomic contributions is beneficial in order to treat systems of arbitrary sizes and to use systematic molecular fragmentation methods. 107 Admittedly, the validity of this approach is not so clear for the excited-states and consequently, such representations are less frequently used in ML studies targeting the excited states. Up to day, only small molecules have been fitted with atomwise representations, which are too small to prove the validity of excited-state PESs, which are constructed from local atomic contributions. To the best of our knowledge, the largest molecule fitted with atom-wise descriptors contained 12 atoms and was N -methylacetamide. 95 Other molecules were CH 2 NH + 2 , 13,93 CH 2 NH, 139 SO 2 13 or CSH 2 . 13 Further studies are needed to demonstrate whether an atom-wise construction of excited-state properties and PESs is possible or not. Nevertheless, this approach is most powerful for studies that aim to describe large and complex systems, which could potentially be described from smaller building blocks. For instance, the construction of a DNA double strand or a peptide could be, at least in principle, constructed from ML models that are trained on their smaller subsystems, i.e., DNA bases and amino acids, respectively. Unfortunately, we are far away from having achieved a description of large molecular systems for the excited states, let alone the construction of accurate PESs of medium-sized molecular sys-tems, such as DNA bases or amino acids.
Other types of descriptors Besides the benefits high-dimensional ML models offer for the fitting of PESs of molecules, descriptors are not restricted to the aforementioned examples. In general, any type of descriptor might be suitable for a given problem. Applied descriptors range from topological and binary features generated from SMILES strings 584 to normal modes, which are often used as a coordinate system and descriptors to fit diabatic PESs. 14, 97,134,141,143,[143][144][145]147,[359][360][361][362]585 Other types of molecular features besides structurebased ones, e.g. electronegativity, bond-order, oxidation states, ..., 15,70 are also used.

Automatically generated descriptors
The selection of an optimal descriptor and the optimization of the related parameters for this descriptor is no trivial task and requires expert knowledge in many cases. 576 A way to circumvent an extensive parameter search is offered by the aforementioned message passing NNs, 575 which include the descriptor parameters in the network architecture. In this way, they automatically fit the optimal parameters of a descriptor for a given problem, i.e., training set under investigation. Such tailored descriptors can guarantee highly accurate solutions if the NN model is trained properly. PhysNet, 586

Application of ML for Excited States
In this chapter, we review ML studies of excited states and their properties. We aim to show how they have been employed to improve static and dynamics calculations and focus on the used type of regressor, descriptor, training set, and property. We will classify the approaches according to Figure 1.

Parameters for Quantum Chemistry
At the current state of research, the user must decide whether a multi-reference method is necessary or a single reference method is sufficient to describe a chemical problem. It would be helpful if ML models could suggest a suitable reference method, e.g. based on a literature search. Unfortunately, such a tool is not yet available, but ML can help to select an active space for multi-reference methods. Jeong et. al 70 developed an ML protocol for classification based on XGBoost 558 to allow for a "black box" use of many multi-reference methods by automatically selecting the relevant active space for molecular systems. The tedious selection of active orbitals and active electrons can thus be avoided. The accuracy of this approach was demonstrated for diatomic molecules in the dissociation limit and the molecules were represented via the molecular orbital bond order and the average electronegativity of the system.

ML of Primary Outputs
To the best of our knowledge, no ML models for providing primary outputs of quantum chemistry exist for excited states (see Figure 1). Targeting the primary output of a quantum chemistry simulation, i.e., the N-electron wave function, or providing ML density (functionals) is far from trivial even for ground-state problems. 71-79,81,89,588-591 However, such an approach for excited states could solve many problems and allow for wave function analysis, providing additional insights like the excited state characters. 592 Therefore, we expect such models to appear in the near future.

ML of Secondary Outputs
In the following, we summarize the contributions of ML models that fit the secondary output of quantum chemical calculations, i.e., PESs, SOCs, NACs, and transition as well as permanent dipole moments in the adiabatic and diabatic basis (Figure 1). The prediction of the manifold quantities (see Fig. 2) can be done in two ways, i.e., in a single-state fashion and in a multi-state fashion. 93 The applicability of such ML models to the simulation of photodynamics will be discussed.

ML in the Diabatic Basis
Diabatic PESs are fitted with ML and related methods since more than 25 years. 149,377 An advantage of diabatic PESs is their smoothness, which is perfectly matched by ML models built upon smooth functions. However, the tedious procedure to generate diabatic PESs remains. Some effort is therefore devoted to develop MLassisted diabatization procedures and eliminate this limiting step. NNs. Due to the high dimensionality of the system, the authors resort to application of regularization in the fitting algorithm and an adapted loss function to obtain an accurate representation of two-state diabatic PESs with NNs. This novel strategy is envisioned for the computation of the photoelectron spectrum of cyclopentoxide. 94 Fitting 39 degrees of freedom in the diabatic basis is a huge improvement in this research field. The authors further note that a comprehensive sampling of the full relevant PESs in such high dimensional space is problematic. Due to the aforementioned problems, a description of medium-sized to large molecules with diabatic potentials is often done with more crude approximations. 140,376 An example is the LVC model, 228 with its one-shot variant, 229 or the exciton model. 177,593 For more details on this topic, the reader is referred to refs. 63,228,313,[594][595][596] The Frenkel exciton Hamiltonian can be used to describe lightharvesting systems or charge-transfer. 177,593 Such a Hamiltonian was constructed for the investigation of the excited state energies of bacteriochlorophylls of the Fenna-Matthews-Olson complex. Multi-layer feed-forward NNs with the Coulomb matrix as a molecular descriptor could accelerate the construction of such Hamiltonians for the prediction of excitedstate energies. 160 The effective Hamiltonian of the whole complex was subsequently used to predict excitation energy transfer times and efficiencies. Therefore, Häse et al. used exciton Hamiltonians as an input. 159 Fitting diabatic potentials and properties Given diabatic PESs, ML models can be used to fit them. KRR models are often employed for this task, due to their ease of use and ability to provide accurate predictions, as mentioned above. Recent studies by Habershon and coworkers focus on interpolation of diabatic PESs and their use for grid-based quantum dynamics methods, i.e., variational Gaussian wavepackets and MCTDH. The butatriene cation has been investigated in two-dimensions comprising two electronic states. 147 The description of this molecule has been recently advanced with a new diabatization scheme, namely Procrustes diabatization. The method was evaluated with twostate direct-dynamics MCTDH (DD-MCTDH) simulations of LiF and applied to four electronic states of butatriene. 239 Some of the authors also carried out DD-MCTDH 4-mode/2state 143 and subsequently 12-mode/2-state dynamics of pyrazine. 144 The investigation of the higher-dimensional space of pyrazine could be achieved by systematic tensor decomposition of KRR and advances conventional MCTDH simulations considerably with respect to accuracy and computational efficiency. Further, the method was applied to investigate the ultrafast photodynamics of mycosporine-like amino acids, which are suitable as ingredients in sunscreens due to their photochemical properties and photostability. 597 However, the reduced 6dimensional and 14-dimensional DD-MCTDH simulations with KRR interpolated PESs were unable to reproduce the expected ultrafast photodynamics, which had been observed in previously performed surface hopping calculations and is typical for sunscreen ingredients. The authors note that the inclusion of more adiabatic states for the diabatization procedure and the consideration of additional relevant modes can lead to more accurate results. All of the reference simulations were carried out at the CASSCF level of theory with KRR fitted diabatic PESs.
In addition to KRR models, NNs were also used to describe diabatic PESs. Seminal works include PIP-based NNs by Guo, Yarkony and co-workers. Absorption spectra and the dynamics of excited states of NH 3 and H 2 O could be studied by fitting potential energy matrix elements. 141,145,[359][360][361][362]543 Subsequently, some of the authors fit the dipole moments corresponding to the diabatic 1,2 1 A surface of NH 3 . 14 SOCs of formaldehyde were learned with NNs in the diabatic picture. 90 341 data points were used for training of SOCs. A singlet and a triplet state in the adiabatic basis were transformed to diabatic states using Boys localization. 351 Since this diabatization is based on transition dipole moments, the respective properties of the excited states had to be phase corrected. The authors proved the accuracy of their fitted PESs and emphasized the usability of the ML models to describe full-dimensional quantum dynamics. 14,90,543 Very recently, they investigated the OH + H 2 reaction, i.e., the nonadiabatic quenching of the hydroxyl radical colliding with molecular hydrogen. Four diabatic potentials including forces and couplings were fitted using a least squares fitting procedure. 1345 data points of 1,2,3 2 A adiabatic PESs were computed with MR-CISD. 543 The aforementioned ML models are singlestate models. Each energetic state and each coupling or dipole moment value resulting from different pairs of states is fitted with a separate ML model. While this yields justifiable accuracy for energies and diabatic coupling values, 93 dipole moments are vectorial properties and need to preserve rotational covariance. 95 As the aforementioned studies show, ML models are generally powerful to advance quantum dynamics simulations for the excited states and can also assist the construction of effective Hamiltonians. However currently, diabatic PESs cannot simply be fit for systems with arbitrary size and arbitrary complexity. The diabatization remains a methodological bottleneck, where additional developments are needed.
The investigation of medium-sized to larger molecular systems, especially the investigation of their temporal evolution, is more often carried out in the adiabatic basis using on-thefly simulations. An increasing number of recent studies focus on fitting such adiabatic PESs. The inconsistencies in adiabatic properties make such quantities generally more challenging to fit, which is why this field of research gained a lot of attention relatively late, i.e., only in the last 3 years.

ML in the Adiabatic Basis
Surface hopping MD Probably, the first ML models for MQCD calculations date back to the year 2008. 96 Nonadiabatic MD simulations were carried out with NN-interpolated PESs to investigate O 2 scattered from Al(III). Symmetry functions were used as descriptors. 598 A spin-unpolarized singlet and a spin-polarized triplet state at DFT level of theory were fitted with 3768 data points. 598,599 This two-state spin-diabatic problem allowed for evaluation of coupling values and singlet-triplet transitions with the fewest switches surface hopping approach. 403,404 In a later study, another adiabatic spin-polarized PES was included and coupling values were computed between singlets and triplets 600 and evaluated from constructed Hamiltonian matrices. 91 MD simulations were executed using a manifold of ML-fitted PESs according to different spin-configurations. The studies showed that singlet-triplet transitions are highly probable during the scattering event of O 2 on Au(III). 91,96 After these two seminal studies, the interest in advancing MQC photodynamics simulations in the adiabatic basis increased mainly in the last three years. One of the first works during this time was conducted by Hu et. al, 137 who investigated the nonadiabatic dynamics of 6-aminopyrimidine with KRR and the Coulomb matrix. Due to the many degrees of freedom of the molecule and including three singlet states, a large amount of training data was required (> 65k data points). Coupling values were not fitted but, instead, the Zhu-Nakamura approach was used to compute hopping probabilities.
Later, Dral et al. 138 applied KRR models to accurately fit a two-state spin-Boson Hamiltonian and reproduce reference dynamics using 1,000 and 10,000 data points. NAC vectors were fit in a single-state fashion. During dynamics simulations, conformations close to critical regions were computed with the reference method instead of the ML model in order to allow for accurate transitions.
In another study, Chen et al. 139 used two separate deep NNs to fit the energies and forces of two adiabatic singlet states of CH 2 NH. About 90k data points were used to generate these single-state models. Using the Zhu-Nakamura approach to account for hopping probabilities, the reference dynamics could be reproduced and quantum chemical calculations were replaced completely during the dynamics.
Cui and coworkers 601 further developed a multi-layer energy-based fragmentation method to study the excited-state dynamics and photochemistry of larger systems. This scheme composes a molecular system into a photochemically active (inner) region and a photochemically inert (outer) region. In the original scheme, the active region and the interactions with the outer region are described with the multi-reference method CASSCF, whereas the outer region is treated with DFT. This decomposition of the total energy of a system allows to treat larger systems, which cannot be described fully with CASSCF. The approach is similar to QM/MM (quantum mechanics/molecular mechanics) schemes in the mechanical embedding framework. The authors simulated two-state photodynamics of CH 3 N=NCH 3 (inner region) including five water molecules (outer region) without the use of ML. The Zhu-Nakamura approximation to model hopping probabilities in nonadiabatic MD simulations was applied. 601 In order to make the simulations more efficient, the authors replaced the DFT calculations with deep multi-layer feed-forward NNs using a distance-based descriptor, 123 hence they describe the ground state energies and forces of the photochemically inert region with ML and describe the S 1 and S 0 state of the inner region with CASSCF. The hybrid ML multilayer energy-based fragmentation method can reproduce the photodynamics of the system. 443 Subsequently, the deep NNs were replaced with embedded-atom NNs 533 and accurate second derivatives could be computed efficiently. 444 Recently, we sought to fit NACs and transition and permanent dipole moments in addition to energies and forces of three singlet states of the methylenimmonium cation, CH 2 NH + 2 , using deep NNs and the matrix of inverse distances as a molecular descriptor. 92 We were able to perform ML-enhanced excited-state MD simulations with hopping probabilities based on MLfitted NACs. NNs could replace the reference method MR-CISD completely during the dynamics. Long time scale photodynamics simulations for 1 ns were achieved using the mean of 2 NN models in approximately two months, whereas the reference method would have taken an estimated 19 years to compute the dynamics for 1 ns on the same computer. This study demonstrated the possibility of MLMD simulations to go beyond time scales of conventional methods. As another benefit of the ML models, it was shown that a large ensemble of trajectories could be calculated, still at lower cost than a few trajectories with the reference method. 92 With the same training set, we further assessed the performance of KRR together with von Lilienfeld and co-workers. 93 The operator formalism 602 and the FCHL representation 125,487 were used to fit the three singlet states of CH 2 NH + 2 . A single-state treatment and a multi-state treatment for predicting energies were compared. To this aim, a multistate KRR approach as developed with an additional kernel that encodes the quantum energy levels. The accuracy of KRR models could be improved using this extended approach. 93 The KRR models were further compared to deep NN models regarding their ability to predict dipole moments and NACs. While NNs yielded slightly higher accuracy at the largest available training set size, KRR models exhibited a steeper learning curve, hence more efficient learning. The different performance of NNs and KRR models was proposed to be a result of the parametric dependence of the depth of NNs and the non-parametric dependence of the depth of KRR models. Results further suggested that small differences between the reference method and ML models, especially in critical regions of the PESs, can lead to completely wrong photodynamics simulations. 93 Nevertheless, multireference quantum chemical potential energy curves could be faithfully reproduced with KRR models and NN models for the three singlet energies of CH 2 NH + 2 . In order to omit the extensive hyperparameter search of the descriptor and regressor, we further developed the SchNarc approach for photodynamics, 13 which is based on SchNet. 428,432 SchNarc allows for (1) a description of SOCs, (2) an NAC approximation based on ML-fitted PESs, their first and second derivatives with respect to Cartesian coordinates, and (3) a phase-free training algorithm to enable a training of raw quantum chemical data. The SchNarc approach is based on the message passing NN SchNet, 428,432 which was adapted by us for the treatment of a manifold of excited electronic states. Additionally, this model can describe dipole moments using the charge model of ref, 109 also adapted for excitedstates. All excited-state properties can be described in one ML model in a multi-state fashion. The performance of SchNarc was evaluated with surface hopping dynamics: Three singlet and three triplet states of SO 2 were computed with ML models for 700 fs and the underlying PESs were based on an "one-shot" LVC(MR-CISD) model. 229 CSH 2 was investigated using 2 singlets and 2 triplet states for 3 ps at CASSCF level of theory representing slow population transfer, and the performance of SchNarc to reproduce ultrafast transitions during dynamics was assessed using CH 2 NH + 2 with the aforementioned training set. The hopping probabilities were computed according to ML-fitted SOCs and NACs -the latter being fitted in a rotationally covariant way as derivatives of virtual ML properties and approximated from ML PESs. In all cases, excellent agreement with the reference method could be achieved. Noticeably, all the aforementioned photodynam-ics studies with ML models 13,92,93,137-139 make use of Tully's fewest switches surface hopping approach with hopping probabilities based on coupling values or approximated schemes. 403,404 Exemplary timings for MLMD, LVC dynamics, and MQCD The speed-up of simulations is one of the main arguments employed for promoting ML in quantum chemistry. In order to get an idea about the computational time used in different calculations, we provide an example here. The timings of surface hopping MD with analytical PESs (from LVC), quantum chemical PESs, and ML-fitted PESs based on fitted and approximated NACs from Hessians can be found for three exemplary molecules in Table 2.
Obviously, crude excited-state force fields like the LVC model are faster than ML models, e.g., for SO 2 . We note that even such force field implementations can probably still be streamlined for speed but will always be more expensive than ground-state MD simulations, where it would take approximately 0.005 seconds to simulate 100 fs for the gas-phase methylenimmonium cation, CH 2 NH + 2 , using a state-of-theart program like Amber. 208 However, dynamics based on highly accurate quantum chemical calculations can be accelerated significantly with ML-fitted PESs, e.g., SchNarc models for CH 2 NH + 2 based on MR-CISD/aug-cc-pVDZ. 13 The speedup is higher if NACs are learned directly (MLMD1) compared to when they are approximated from Hessians (MLMD2). A lot of Hessian evaluations are required in this example because ultrafast transitions occur in CH 2 NH + 2 . The second-order derivatives reduce the efficiency by a factor of about ten. Nevertheless, Hessian calculations of ML-PESs can be accelerated by a factor of about 5-10 using a GPU (dependent on the molecule and GPU used). Table 2 further shows that a cheaper underlying reference method, such as CASSCF(6,5)/def2-SVP used for CSH 2 , does not allow for such a significant speed-up. In this example however, the difference between simulations with learned NACs and approximated NACs is small because the dynamics of CSH 2 is characterized by slow population transfer. Hence, less Hessian evaluations are required to estimate the hopping probabilities.
The time required to train a SchNarc model on a GeForce GTX 1080 Ti GPU is approximately 11 hours for energies and forces of 3 singlet states with 3,000 data points of CH 2 NH + 2 , about 13 hours for energies, forces, and SOCs of 2 singlet and 2 triplet states using 4,000 data points of CSH 2 and about 4 hours for energies and forces of 3 singlet states of SO 2 using 5,000 data points. Dipole Moments In addition to the investigation of the temporal evolution of some systems in the excited states, permanent and transition dipole moments have been computed with ML models. As mentioned before, in our earlier approaches, we fitted permanent and transition dipole moments as single values with NNs and KRR -strictly speaking we were neglecting the rotational covariance of the vectors (since rotations were negligible in these simulations). 92,93 The SchNarc model improved on this description by treating dipole moments as vectorial properties. The NN and KRR models for dipole moments have been evaluated and compared to quantum chemical reference dipole moments using learning curves and MAEs. Their potential to compute UV spectra was emphasized. The use of dipole moments to actually simulate UV spectra was demonstrated by Jiang, Mukamel, and co-workers using Nmethylacetamide, a model system to investigate peptide bonds. 95,509 They evaluated the ability of ML to describe transition dipole moments at TDDFT level of theory. In a first attempt, 509 the authors predicted dipole vectors as independent values. 14 internal coordinates in combination with multi-layer feedforward NNs were used to predict transition energies of N -methylacetamide. Xyz representations served as an input for fitting ground state dipole moments. The Coulomb matrix was employed to fit transition dipole moments for the nπ * and ππ * transitions, but did not lead to sufficiently accurate results. Higher accuracy was obtained by replacing the atomic charges in the Coulomb matrix (eq 30) with charges from natural population analysis. The choice of descriptors was justified by screening different types of descriptors for prediction of different properties. In a later work, some of the authors used embedded-atom NNs to predict transition dipole moments from atomic contributions in a rotationally covariant way. The dipole moment vector between two states i and j was obtained as a linear combination of three contributions: µ i T and µ j T were modeled using the charge model of ref 109. A third contribution, µ 3 T , was obtained as the cross product of µ i T and µ j T : µ i T , µ j T and q 3 a were outputs of the same embedded-atom NN.

ML of Tertiary Outputs
The secondary outputs, such as dipole moments or excited state energies can be used to calculate oscillator strengths (eq 1) and energy gaps ( Fig. 1(d)). These properties can serve for the modelling of UV absorption spectra. UV spectra were computed in the previously described studies of N -methylacetamid with the ML fitted transition dipole moments. Jiang, Mukamel and co-workers 509 applied the transition dipole moment and additionally fitted nπ * and ππ * excitation energies to compute UV spectra this molecule with NNs. Subsequently, some of the authors 95 used these excitation energies and the transition dipole moments to model a Frenkel exciton Hamiltonian for proteins using amino acid residues and peptide bonds. This effective Hamiltonian could subsequently be used to approximate UV spectra of proteins. The interaction between amino acid residues and peptides was neglected so only the isolated peptide excitation energies, i.e., those of N -methylacetamid, and the respective transition dipole moments were needed to construct the Hamiltonian. The authors made use of the dipole-dipole approximation 603 and applied embedded-atom NNs.
Ramakrishnan et. al 327 predicted excitation energies of the lowest-lying two excited singlet states, S 1 and S 2 , as well as corresponding oscillator strengths obtained from TDDFT calculations with KRR. The QM8 496 data base was used consisting of 20k organic molecules. With the ∆-learning approach, CC2 accuracy could be obtained. Very recently, Xue et al. 563 assessed the performance of KRR models with the normalized inverse distances as a molecular descriptor to predict absorption spectra of benzene and a derivative of acridine containing 38 atoms. Therefore, the authors learned the excited-state energy gaps of several states and the corresponding oscillator strengths in a single-state fashion. Applying Gaussian broadening, the absorption cross sections could be computed at TDDFT accuracy.
Pronobis et al. 156 compared 2-body, 3-body and automatically designed descriptors to learn TDDFT HOMO-LUMO gaps as well as first and second vertical excitation energies. More than 20k molecules of the QM9 data base 424,496 were selected for this purpose and learning curves were used to evaluate the learning behaviour of different ML models. While atomwise descriptors worked well for HOMO-LUMO gaps, the authors concluded that the accuracy of predicted transition energies is not sufficiently accurate and suggested that advanced non-local descriptors might be necessary to achieve higher accuracy. They further proposed the idea of encoding information about the electronic state in the ML model. 156 Indeed, our recent study, in which we compared the performance of KRR and NN models with atom-wise and molecule-wise descriptors demonstrated that encoding of the energy level is advantageous. 93 Recently, Kang et. al 584 used 500,000 molecules of the PubChemQC 502 data base to train a random forest model on the excitation energy and the oscillator strength corresponding to the electronic state with the highest oscillator strength. 10 singlet states, as available in the PubChemQC data base, were evaluated for that purpose. The authors used SMILES (simplified molecular-input line-entry system) strings and converted them into descriptors. The descriptors comprised several topological 604 and binary 605 fingerprints, which were calculated with the help of the RDkit library. 606 The authors compared the prediction accuracy to the aforementioned models and stated that their model outperformed previous ML models in the task of predicting accurate oscillator strengths and excitation energies for the most probable transition in organic molecules. Analysis of important features led the authors identify that nitrogen-containing heterocycles are important for high oscillator strengths in molecules. The authors concluded that their study could serve the design of new fluorophores with high oscillator strengths. 584 Ghosh et. al 150 used multi-layer feed-forward NNs, convolutional NNs and DTNNs to fit 16 highest occupied orbital energies from DFT, i.e., the respective eigenvalues, for the computation of molecular spectra with a full width at half maximum of 0.5 eV for Gaussian broad-ening. Geometries from the QM7b 494,495 and QM9 424,496 data base were used for training and molecular spectra were tested using 10k additional diastereomers, which were also used by Ramakrishnan et. al 327 to evaluate the ∆-learning approach. The convolutional NNs with the Coulomb matrix and DTNNs with an automatically generated representation outperformed the simpler NNs. Overall, good agreement to reference DFT spectra could be achieved. 150 Markland and co-workers 447 trained NNs with atom-centered Chebyshev polynomial descriptors 108 on the TDDFT/CAM-B3LYP/6-31+G* S 0 -S 1 energy gap of the deprotonated trans-thiophenyl-p-coumarate (chromophore of yellow protein) in water and Nile red chromophore in water and benzene. Farthest point sampling 121 was used to select about 2,000 data points from a larger set of 36,000 data points and was compared to random sampling. The authors assessed the performance of three different ML approaches to compute absorption spectra, spectral densities and 2-dimensional electronic spectra. One model (hidden solvation) completely ignored any environmental effects and only described the chromophore, another model (indirect solvation) incorporated environmental effects within a 5Å cutoff of the atomistic descriptor for the chromophore and a third model (direct solvation) treated the whole system, i.e., the chromophore and the atoms of the solvent, explicitly. As expected, the hidden solvation model turned out to be insufficiently accurate for systems with strong solvent-chromophore interactions, but was comparable to the hidden solvation model when describing Nile red chromophore in benzene. The indirect solvation and direct solvation models were comparable to each other, but with respect to the computational efficiency, the indirect solvation model was beneficial. This model could reproduce reference linear absorption spectra, spectral densities, and could capture spectral diffusion of 2-dimensional electronic spectra of all treated chromophores. 447 Penfold and co-workers 153 applied deep multilayer feed-forward NNs to proof the ability of ML to predict X-ray absorption spectra (XAS), which provide a wealth of information on the geometry and electronic structure of chemical systems, especially in the near-edge structure region. Note that X-Ray free-electron laser spectroscopy can further be used to generate ultrashort X-ray pulses to investigate photodynamics simulations in real-time. The training set for the prediction of Fe K-edge Xray near-edge structure spectra contained 9040 data points. The inputs for NNs were generated using local radial distributions around the Fe absorption site of arbitrary systems taken from the Materials Project Database. 607 Qualitatively accurate peak positions and intensities could be obtained computationally efficient and the structural refinement of nitrosylmyoglobin and [Fe(bpy) 3 ] 2+ was assessed with NNs. The authors noted that future development is needed to accurately capture structures far from equilibrium as well as irregularities in the bulk.
Another study was executed by Aarva et al., 608 who focused on XAS and X-ray photoelectron spectra of functionalized amorphous carbonaceous materials. By clustering of DFT data with unsupervised ML techniques average fingerprint spectra of distinct functionalized surfaces could be obtained. The authors use GPR. Similarly to the aforementioned state encoding, 93 the authors encoded the electronic structure, i.e., the ∆-Kohn Sham values (coreelectron binding energies), in a Gaussian kernel. This kernel was then linearly combined with a structure-based kernel based on the SOAP 609 descriptor. The spectra computed from the different clusters were used to fit experimental spectra allowing for an approximation to the composition of experimental samples on a semiquantitative level. The so-called fingerprint spectra, which enabled the differentiation of the spectral signatures, were assessed in a previous study using different models for amorphous carbon, 610 among them an ML fitted PES using GPR. 110, 611 Kulik and co-workers 15 used deep NNs to predict the spin-state ordering in transition metal complexes to determine the spin of the lowest lying energetic state in open-shell systems. The determination of spin states is important to evaluate catalytic and material properties of metal complexes. Descriptors based on a selection of empirical features were used to capture the bonding in inorganic molecular systems. The performance of descriptors including different features was assessed for a set of octahedral complexes with first-row transition metals. The most important features were identified to be the atom, which connects the ligand to the metal, its environment and its electronegativity, the metal identity and its oxidation state, as well as the formal charge and denticity of the ligand. 612 The ML models were tested on spin-crossover complexes and could assign the correct spin in most cases. Additionally, ML models were applied for the discovery of inorganic complexes [613][614][615][616] The inverse design of molecules with specific properties was further targeted by Schütt et. al, 76 who developed SchNOrb, a deep NN model based on SchNet. The automatically generated descriptor was extended with a description of atom pairs in their chemical and structural environment. An analytic representation of the electronic structure of a molecular system was obtained in a local atomic orbital representation. The analytic derivatives of the electronic structure allowed for optimization of electronic properties. This was demonstrated by minimizing and maximizing the HOMO-LUMO gap of malonaldehyde. 486 Besides, the ML method was used to predict the lowest 20 molecular orbitals of ethanol at DFT level of theory, to investigate proton transfer in malonaldehyde using groundstate dynamics and to analyze bond order and partial charges of uracil.
Bayesian NN models were applied by Häse et. al 158 to relate molecular geometries to the outcome of nonadiabatic MD simulations obtained with CASSCF. Normal modes with and without velocities of initial conditions served as an input for NN models. Velocities in addition to normal modes as descriptors improved the accuracy of ML models slightly, pointing out that normal modes contain already enough information for the sake of their study. The dissociation times of 1,2-dioxetane obtained from nonadiabatic MD simulations was the targeted output. The NNs could faithfully reproduce dissociation times and further provided a measure of uncertainty. The authors noted that their method could be particularly interesting for analysis of MLMD simulations.

ML-Assisted Analysis
The aforementioned studies have shown that ML enables the simulation of MD simulations and spectra predictions at low computational costs. The computational efficiency allows for enhanced statistics, i.e., in case of MD simulations a huge number of trajectories and the simulations on long time scales. 13,92 Therefore, subsequent analyses of production runs can become a time limiting step of studies. This problem was identified in the aforementioned study on the dissociation times of 1,2-dioxetane by Häse et. al. 158 Therefore, the authors further used their method to interpret the outcomes of nonadiabatic MD simulations. 1,2-dioxetane is the target of their study as it is the smallest molecule known to show chemilumiescence after nonadiabatic transitions from an excited state to the ground state. The chemiluminescent properties of this compound were related to its decomposition rate into two formaldehyde molecules. By analysis of the ML models that fit the dissociation times, correlations could be observed between the normal modes and the dissociation times. For example, the modes corresponding to C-C bond stretching and C-O bond stretching were relevant for the accurate prediction of dissociation times. It was further emphasized by the authors that although the findings of NNs were expected and obey physical laws, ML models were helpful to extract relevant information of large amount of data and could potentially serve as an inspiration to humans.
Time-resolved experimental photoluminescence spectra could be analyzed with the Lu-miML software developed by Ðorđević et. al, 617 who applied linear regression models to learn from computer-generated photoluminescence data. The software was employed to predict decay rate distributions 618 of perovskite nanocrystals from data generated with femtosecond broadband fluorescence upconversion spectroscopy. 619 The authors highlighted the applicability of their method to enhance studies on the optimization and design of optical devices and further noted that their approach can also be used to analyze transient absorption spectra. Aspuru-Guzik and coworkers 152 applied Bayesian NNs to find correlations of nanoaggregates with electronic coupling in semiconducting materials using absorption spectra. In general, the analysis of experimental spectra and the inverse design of compounds is most frequently applied in the research field of material science. Their description goes beyond the scope of this review and the reader is referred to 169.

Conclusion and Future Perspectives
In the last few years, machine learning (ML) has started to slowly enter the research field of photochemistry, especially the photochemistry of molecular systems. Although this field of research is rather young compared to ML for the electronic ground-state, some groundbreaking works have already shown the potential of ML models to significantly accelerate and improve existing simulation techniques. So far, most studies provide a proof of concept using small molecular systems or model systems. Different applications are targeted and will also be aimed at in the future, ranging from dynamics with excited-state ML potentials via absorption spectra to the interpretation of data, see Fig. 1.
Analysing the different studies reviewed here, some trends in the choice of reference methods, ML models, and descriptors can be observed. These trends are illustrated in Figure 11.
The pie chart in panel 11(a) shows the used reference methods for the computation of a training set to describe the excited states or excited-state properties of molecules. As can be seen, about half of the training sets are computed with multi-reference methods. [12][13][14]70,90,[92][93][94]137,[139][140][141][142][143][144][145]147,158,239,359,360,362,597 The employed single-reference approaches are exclusively based on DFT. 15, 76,91,95,96,150,156,327,447,467,509,584,6 Analytical methods or experimental data are also applied. 138,152,159,160,617 When restricting the analysis to studies targeting dynamics, the fraction that employs multi-reference methods even increases. About 70% of all dynamics studies use multi-reference methods to compute the training data for ML models. 15% of the studies use single-reference methods and an equally large portion apply model Hamiltonians or analytical potentials. This shows that most chemical problems for the investigation of the excited states of molecules require multi-reference accuracy.
Recent studies of ML-based photodynamics simulations have shown that many thousands of data points are necessary to describe a few excited-state potentials of small molecular systems. To the best of our knowledge, the dynamics in the excited states with ML for molecules with more than 12 atoms in full dimensions has not yet been investigated. 137,143,144 Especially the huge number of data points is concerning in this case, as larger molecules with more energetic states and a complex photochemistry could require many more data points. A meaningful training set generation, which can be achieved with active learning, adaptive sampling and structurebased sampling techniques, is thus essential for dynamics simulations. 92,109,479,480 Clustering of molecular geometries obtained from dynamics simulations with a cheap method further is beneficial for selecting important reference geometries. 137,138,[481][482][483] Still, the high costs and the complexity of multi-reference methods to compute an ample training set for ML also hampers the application of ML models to fit the excited states of larger polyatomic systems, whose accurate photochemical description is often additionally complicated by a high density of electronic states.
Single reference methods, such as timedependent DFT, are advantageous with respect to the computational costs of the training set, but suffer from qualitatively incorrect PESs in some conformational regions of molecules, such as dissociative regions. In principle, these conformational regions could be excluded from the training set and the remaining conformational space could be interpolated using ML, but the training set would then remain incomplete and so would the dynamics. Schemes like the ∆ − learning approach 327 or transfer learning 329 could be helpful in this regard. These approaches might be useful to let ML models learn from single-reference data and adjust their accuracy according to multi-reference methods. The direct use of approximated methods, such as time-dependent DFT-based tight binding, is most likely not suitable for photodynamics on long time scales, because such approaches might easily be quantitatively incorrect. Of particular concern is then the accumulation of quantitatively tiny errors in the underlying potentials toward wrong dynamics trends. At the current stage of research, it is not clear whether such approximate potentials can provide qualitatively correct trends for reaction dynamics. 400 In addition to the aforementioned problems, the training set generation is complicated by the arbitrariness of the signs of coupling values and properties resulting from two different electronic states. 13,14,90,92,93,95 This arbitrariness has to be removed in order to make data learnable with conventional methods. Such a correction scheme is termed phase correction and has been applied to correct coupling values and dipole moments. 14,90,92,95,454 An alternative phase correction training algorithm has been shown to be beneficial with respect to the costs of the training set generation and has enabled the learning of raw quantum chemical data. 13 Figure 11(b) shows which ML models are applied in the discussed studies. About two thirds rely on NNs, whereby simple multilayer feed-forward NNs are most often employed. Several research fields were advanced with NN-fitted functions: photodynamics simulations, 13, [91][92][93]96,138,139,141,142,145,359,360,362 spectra predictions and analysis, 95,150,153,447,509,563 excited-state properties, 13-15,90,93,95,509 diabatization procedures, 94,140 interpretation of reaction outcomes, 158,617 and the prediction of HOMO-LUMO gaps or gaps between energetic states. 76,150,156 KRR methods were mainly applied to interpolate diabatic potentials 143,144,147,239,597 and in studies focusing on more than one molecular systems. 327 In general, only a few studies focused on extrapolation throughout chemical compound space in the excited states. Yet only the energies, HOMO-LUMO gaps or spectra based on fitted oscillator strengths could be predicted using a single ML model for different molecules. 15, 153,156,327 Decision trees were used to select an active space for diatomic molecules. 70 One drawback of recently developed ML models is that they are molecule-specific and thus not universal. In part, this issue is related to the used molecular descriptors. As can be seen in panel (c) in Figure 11, most studies apply descriptors that capture molecules as a whole. The few studies, which describe PESs and properties of molecular systems from atomic contributions, either treat small molecular systems 13,93,95 or predict properties re-lated to the ground-state equilibrium structure of a molecular system or to electronic ground state calculations, e.g. the HOMO-LUMO gaps. 76,150,156 Due to the limited transferability of existing ML models to predict the excited state PESs and properties of different molecular systems, an extrapolation throughout chemical compound space is hindered in many cases.
In order to fully exploit the advantages that ML models offer and to achieve the aforementioned goal of a transferable ML model for the excited states, a highly versatile descriptor is required, which can describe atoms in their chemical and structural environment and enables an ML model to treat molecules of arbitrary size and composition. It would be highly desirable, if an ML model could then describe the photochemistry of large systems, which are too expensive to compute with precise multi-reference methods, using only small building blocks, i.e., small enough ones to describe their electronic structure accurately. For example, the excited states of proteins or DNA strands could potentially be predicted from contributions of amino acids or DNA bases, respectively, which is most often done using effective model Hamiltonians up to date. 56 A local description of the excitedstate PESs and their properties derived from the ML-fitted PESs, could further provide a way toward excited-state ML/MM simulations alike QM/MM (quantum mechanics/molecular mechanics) techniques. 400,443,601 Unfortunately, it is not yet known whether the excited-state PESs and properties can be constructed from atomic contributions or not. 400 In studies comparing different ML models, it was even suggested that non-local descriptors might be needed or that the electronic state has to be encoded explicitly in the molecular representation to enable a transferable description of the excited states with ML. 93,156 To conclude, the reviewed studies focus on almost all aspects of excited-state quantum chemistry and improve them successfully: ML models can help to choose a proper active space for multi-reference methods, they predict secondary and tertiary outputs of quantum chemical calculations and help in the interpretation of theoretical studies. ML models push the boundaries of computed time scales 92 and are used to investigate and analyze the huge amount of data we produce every day in experiments or with high-performance computers. 158,617 It should be emphasized once more that the recent studies show that the goal of ML is not to replace existing methods completely, but to provide a way to improve them. In fact, ML models for the excited states at their current stage are far from replacing existing quantum chemical methods, and they are also far from being routine. Without human intervention, ML cannot solve existing problems and much remains to be done to describe systems beyond single, isolated molecules.
To the best of our knowledge, what is still missing is the proof that ML can provide an approximation to the multi-reference wave function of a molecular system. Such an achievement would be a great advancement in the research field of photochemistry, as any property we wish to know could possibly be derived from the ML wave function. An ML representation of the electronic structure would further be beneficial to allow for an inverse design of molecules with specific properties, which has been shown to be feasible for the ground state of a molecular system. 76 The optimization of photochemical properties with respect to molecular geometries would be useful for many exciting research fields, e.g. photocatalysis, 165 photosensitive drug design 620 or photovoltaics. 609,621 The multi-faceted photochemistry offers a perfect playground for ML models. It may be important to highlight that, despite the negative image ML has suffered in some research communities, it cannot be denied that it opens up many new ways and possibilities to improve simulations and make studies feasible that were considered unattainable only a few years, if not only months ago. 442 The computational efficiency and high flexibility of deep learning models can lead this research field toward simulations of long time and large length scales. The possibilities ML models offer are far from being being exhausted. Considering the enormous chemical space, estimated to consist of more than 10 60 molecules, 622 and the desire to de-velop methods, which could develop into a universal approximator, make ML models perfectly suited to advance this research field. The possibility of deep ML models to process a huge amount of data can even assist the interpretation and analysis 158,617 of many photochemical studies and can help to explore unknown physical relations and be a source of potential human inspiration.