Thermal Half-Lives of Azobenzene Derivatives: Virtual Screening Based on Intersystem Crossing Using a Machine Learning Potential

Molecular photoswitches are the foundation of light-activated drugs. A key photoswitch is azobenzene, which exhibits trans–cis isomerism in response to light. The thermal half-life of the cis isomer is of crucial importance, since it controls the duration of the light-induced biological effect. Here we introduce a computational tool for predicting the thermal half-lives of azobenzene derivatives. Our automated approach uses a fast and accurate machine learning potential trained on quantum chemistry data. Building on well-established earlier evidence, we argue that thermal isomerization proceeds through rotation mediated by intersystem crossing, and incorporate this mechanism into our automated workflow. We use our approach to predict the thermal half-lives of 19,000 azobenzene derivatives. We explore trends and trade-offs between barriers and absorption wavelengths, and open-source our data and software to accelerate research in photopharmacology.


S1 Extended methods
All models used the PaiNN architecture 1 and were implemented in PyTorch 2 . As in our previous work, we used five convolutions instead of the three used originally, as this substantially improved model performance 3 . We also allowed the k values in the radial basis functions to be updated during training. Remaining hyperparameters can be found in Ref. 3 , and an in-depth explanation of their meaning can be found in Ref. 1 . Further hyperparameter optimization is likely possible 4 . After training the model on data without empirical dispersion, we added D3BJ dispersion 5,6 to the total energy prediction: The forces were computed with automatic differentation 7 . Dispersion parameters were taken from the PhysNet repository 8,9 . No cutoff was used. D3 dispersion is a function only of the atom types and positions, and could therefore be added analytically without approximation. The model performance in the main text used SF-D3BJ TDDFT as the ground truth. The geometries for SF-TDDFT/C-PCM calculations were generated through active learning ( Fig. 3(a)). In each round of active learning, the TS workflow in Fig. 3(b) was performed for 1,000 (SMILES, mechanism) pairs using one of the trained models. The pairs were randomly sampled from a set of 73,000 reactions with four mechanisms each. We selected 3,000 generated geometries for DFT calculations, added the new data to the training set, and repeated. 50% of the geometries were selected by uncertainty, 30% by energy, and 20% randomly. For uncertaintybased selection, we chose the geometries with the highest variance in forces predicted by the three models. For energy selection, we sampled a geometry i with probability p i ∝ e Ei/kBT . The energy was computed relative to other geometries of the same species in the same simulation. The energy selection sampled near-TS geometries from relaxed scans, and high-energy MTD structures from conformer generation.
Active learning was repeated 15 times, yielding 43,938 calculations in total. The final model was trained on 42,938 geometries, with 500 used for validation and 500 for testing.
We aimed to avoid steric clashes in tetra-ortho substitution, since this can make experimental synthesis more difficult. For this reason we separated substituents by size. Those with four total atoms or fewer were labeled small, and all others were labeled large. For all tetra-ortho substitutions, we only allowed one side of the N=N bond to be substituted with large groups. The other side could only be substituted with small groups. Substitution with small groups at all positions was also allowed.

S3.1 Relaxed scans
To perform the relaxed scans, we first identified the CNNC atoms using a substructure search of azobenzene in RDKit 23 . We then adjusted their angles and/or dihedral angle by a constant amount in each step of the scan. The angles and dihedrals were adjusted using the atomic simulation environment (ASE) 24 . For the inversion mechanism we set the final CNN or NNC angle to 179.5 • . We used 179.5 • instead of 180 • because the Cartesian derivative of the latter is undefined, and hence the constraining forces described below would also be undefined. For rotation we set the final dihedral angle to ±90 • . We also set the final CNN and NNC angles to 122 • . Without this constraint, we found that some of the scans collapsed to inversion TSs instead of rotation TSs. 122 • is the angle of the rotational TS when optimizing azobenzene with SF-TDDFT 25 . 20 steps were used in each scan. After the angles and/or dihedrals were adjusted, the geometry was optimized with forces given by F + F restrain . The restraining forces are the negative gradients of the following restraining energies: Here α denotes an angle, ω denotes a dihedral, n denotes the n th step of the optimization, and α n , ω n are target values at the n th step. E α was used for the inversion mechanism, while both E α and E ω were used for the rotation mechanism. k α and k ω were each set to 1 Ha. The optimization was performed with the BFGS algorithm [26][27][28][29] in ASE. A rather loose convergence threshold of f max = 0.05 eV/Å was used. This was because of the conformer generation performed after the scans. As described in Sec. S3.2, each stage of conformer generation was restarted if a new conformer had a lower energy than the seed conformer. We used a tolerance of f max = 0.05 eV/Å to minimize the optimization time spent on high-energy conformers. Tight convergence criteria were only used once no lower energy conformers were found. If the seed conformer were optimized with tighter thresholds than the generated ones, it could have a lower energy simply due to its stricter thresholds. Hence it needed to be optimized with the same thresholds.
Angles and dihedrals were set with ASE at each step. ASE adjusts only these internal coordinates, without rotating the rest of the molecule as a solid body. RDKit uses a solid body rotation, and in principle should then require fewer optimization steps at each stage of the scan. However, we found that this led to unstable scans, and so we used ASE instead.
For all optimizations and dynamic simulations we updated the neighbor list every ten steps. To account for neighbors coming into the cutoff radius between updates, we computed the neighbors using a cutoff of r cut + r skin , where r cut = 5.0Å is the model cutoff radius and r skin = 2.0Å is the cutoff skin. Distances between atoms and neighbors were computed at each step, and only those within r cut of each other were used in the model.

S3.2 Conformer generation
Conformer searches were performed for all reactants, products, and transition states. We implemented a method based on CREST 31 using our NN. CREST combines metadynamics, high-temperature molecular dynamics, and genetic structure crossing 33 to sample conformational space. The sampled geometries are then optimized with a multi-level filtering scheme. In this approach, optimizations are performed with progressively tighter thresholds, and structures are discarded if their energy is above a maximum value that is lowered in each step. By default CREST uses the fast semi-empirical method GFN2-xTB 34 to compute energies and forces.
The key component for phase space exploration is metadynamics (MTD). The collective variables used for MTD are the root-mean-square displacements (RMSDs) from previously visited structures. This drives the molecule away from previously visited configurations and towards new regions of phase space. The associated biasing potential is given by  Here the sum is over n previously visited structures, where a new structure is added every 1 ps. ∆ i is the RMSD of the current structure with respect to the i th previous structure, k i is a pushing strength parameter, and α i is a width parameter. The forces at each step are the sum of the actual forces and the negative gradient of V bias . 14 MTD runs are performed in CREST, using different combinations of α i and k i . Initially we used CREST to perform conformer searches. However, given the speed of relaxed scans and eigenvector following with our NN, we found that CREST was by far the most time-consuming step in the workflow. For example, with access to 24 Xeon-P8 CPU nodes and 1,152 cores in total, we ran 82 CREST jobs at a time with 14 cores per job. The average job took 4.2 hours, which meant a throughput of 470 geometries per day. Given four TS guesses per species, plus the reactant and product geometries, this meant that only 78 species could be screened per day. Screening 25,000 species, as done in this work, would have taken 11 months.
CREST has built-in options for conformer searches that are faster but less extensive. These involve fewer MTD runs and tighter energy windows for each level of optimization. However, given less extensive sampling and tighter energy windows, it makes sense to use our NN instead xTB. For example, CREST has a default energy window of 6.0 kcal/mol for the final ensemble, even though conformers of energy greater than approximately 2.5 kcal/mol are not populated at room temperature. This is to compensate for errors in xTB, so that low-lying conformers mistakenly assigned a high energy by xTB are not discarded. These conformers can be subsequently re-ranked or re-optimized with a higher level of theory. Therefore, it is more reasonable to use a tight energy window for the NN, which is specifically trained to reproduce SF-TDDFT for azobenzene derivatives. Similarly, the sampling should be more extensive for xTB than for the NN. This is because errors in xTB mean that conformational space must be thoroughly sampled up to an energy of 6.0 kcal/mol. The NN, by contrast, only needs to thoroughly sample the space up to an energy of 2.5 kcal/mol. Alternatively, one could retain the extensive sampling but use the ultra-fast GFN force-field 35 . However, we found that the force-field produced poor results for TSs.
We therefore implemented our own conformer search with the NN, and used tighter energy windows and fewer dynamics simulations. The workflow is shown schematically in Fig. S1. The approach closely follows that of CREST, but uses only one MTD run per stage, and does not use MD or genetic structure crossing. As in CREST, we determined the MTD time using a flexibility measure derived from the chemical graph (the current formula in Table S1: Parameters used in our NN conformer generation workflow. From top to bottom, the sections contain MTD parameters, optimization parameters, and cregen parameters. N is the number of atoms in the system. The cregen parameters are the defaults used in CENSO, a program that refines CREST ensembles with DFT 32 . CREST differs from that of the original publication; we used the up-to-date version in the source code 36 , commit 6bb6355). To partially compensate for the decreased total simulation time, we used a minimum MTD time of 15 ps, instead of the 5 ps used in CREST. The parameters used in our workflow are given in Table S1. Several points require further discussion. First, we used the CREST cregen tool to remove both duplicates and rotamers in the ensemble. Duplicates are pairs of geometries with ∆E < ∆E thr , RMSD < R thr , and B < B thr , where ∆ is the difference between the two quantities, E is the energy, B is the rotational constant, and thr denotes a threshold. Rotamers are structures with ∆E < ∆E thr , RMSD > R thr , and B < B thr . We removed rotamers because the difference in rotamer count per conformer should be small, and because accounting for all rotamers is quite difficult. This is especially true when MD and genetic structure crossing are not included, since they are included in CREST primarily to find rotamers 31 .
Second, we used the L-BFGS algorithm 37 in ASE to perform optimizations. The convergence criterion is max i,α |f i,α | < f max , where f i,α is a force component, i ∈ {1, N } is the atom index, α ∈ {x, y, z} is the Cartesian index, and f max is a threshold. For equilibrium geometries we used f ultra max = 0.005 eV/Å. For TS geometries we used f ultra max = 0.01 eV/Å, since the average change in energy between the two thresholds is only 0.1 kcal/mol, while the number of extra steps can be significant. We used f max = 0.005 eV/Å for eigenvector following performed on the five lowest-energy conformers. Note also that the algorithm is implemented in Cartesian coordinates. This makes it less efficient than the internal optimizer in xTB, which uses internal coordinates. Use of an internal coordinate optimizer may be of interest in the future.
Third, unlike in CREST 30 , we did not constrain bond lengths with SHAKE 38 . Typically SHAKE is used to allow longer time steps and therefore accelerate the dynamics. Indeed, CREST uses a default time step of 5 fs, while the maximum value for unconstrained dynamics is typically 0.5 fs. However, the SHAKE implementation in ASE is rather slow, and therefore became a bottleneck instead of reducing run times. One can also increase the time step by artificially increasing the mass of hydrogen, since, as the lightest element, its motion is usually the fastest in the system. We followed the default in CREST and set the hydrogen mass to 4 amu. This allowed us to use a time step of 2 fs. Interestingly, we found that CREST automatically reduced the time step from 5 fs to 2 fs for TS conformer searches, since trial MTD runs with larger time steps all failed. This happened even though SHAKE is used in CREST. The same thing did not happen for conformer searches of equilibrium geometries. Hence our time step was the same as the CREST time step for TSs, even though we did not use SHAKE. Yet an efficient implementation of SHAKE is still of interest, since fixing the bond lengths during MTD might decrease the number of subsequent optimization steps needed.
Fourth, we fixed the CNNC atoms in each molecule when performing TS conformer searches. We experimented S4 a b Figure S2: Energy differences between TSs generated with NN conformer searches and with CREST.
with Hookean constraints on the CNN angles and CNNC dihedrals, but this led to unstable dynamics. This was likely because the Cartesian gradient of an angle is undefined when the angle is 180 • . We excluded the fixed atoms from the RMSD computation in MTD. For equilibrium geometries we excluded the CNNC atoms from the RMSD, but did not fix the atoms. Excluding these atoms ensured that cis did not isomerize to trans.
To evaluate our approach, we performed conformer searches with CREST and with our method on 100 relaxed scan TS geometries. We then optimized each CREST geometry with the NN, removed duplicates, and retained the geometries with energies under 2.5 kcal/mol. The top 5 conformers from each method were then optimized with eigenvector following. We checked that the mechanism of each TS matched the target mechanism of the relaxed scan (e.g. rotational TS for a rotation relaxed scan). We removed one species that had relaxed to an inversion TS with CREST after being seeded with a rotational TS. The distribution of TS energy differences is shown in Fig.  S2(a). The mean value of E NN, MTD TS − E CREST TS is 0.28 kcal/mol. The mean increases to 0.48 kcal/mol after removing outliers with absolute energy differences over 6 kcal/mol. The agreement between the two methods is quite good, despite using only one MTD simulation in the NN approach instead of 14. Further, the error in the ∆E † is expected to be smaller than the error in E TS . This is because the conformer search can only overestimate the lowest energy, not underestimate it, and so some error cancellation can be expected.
We also compared the completeness of the ensembles in each of the two methods. To do so we computed the conformational free energy, defined in SI Sec. S3.6. Figure S2(b) shows the difference in conformational free energy between the two methods. Interestingly, we see that the NN generates a more complete ensemble on average. The mean value of G NN, MTD conf − G CREST conf is −0.36 kcal/mol. This trend persists even when the lowest-energy conformers are quite close in energy between the two methods. For example, restricting ourselves to species in which the |E NN, MTD TS − E CREST TS | ≤ 0.5 kcal/mol, we find that the difference in G conf is −0.60 kcal/mol. We conclude that the NN ensembles are quite complete, and that the associated G conf is reliable.
The NN approach is faster than CREST, mainly because fewer MTD runs are performed. NN energy and force calculations also tend to be faster than xTB for large molecules. A precise comparison is difficult, because xTB is run on CPUs and NNs are run on GPUs. This means that one cannot simply compare wall-clock times using equivalent hardware. Indeed, to compare the number of species that can be screened per day, one must also consider the number of GPUs available to the user vs. the number of CPUs. Nevertheless, some rough comparisons can still be made. For example, running 189 ps of MTD for a molecule with 102 atoms took 24 hours and 48 minutes with xTB on one Xeon-P8 core. If the xTB calculations were parallelized over all 48 cores on the node, then the run would take 31 minutes given perfect parallelization. Hence 1.9 MTD runs could be performed per node per hour. By contrast, the corresponding NN MTD run took 4 hours with one NVIDIA A100 GPU, using a batch of 50 species. This corresponds to 12.5 MTD runs per GPU per hour, or 25 per node per hour, assuming 2 GPUs per node. In this case, NN MTD is effectively 25/1.9 ≈ 13 times faster than xTB MTD.
Lastly, since the conformer search is the rate-limiting step in our workflow, any additional speed-ups would increase our throughput. While MTD enables exploration of conformational space, it is fundamentally limited by the fact that all steps must be taken in series. Uncorrelated geometries can only be generated after taking several hundred steps. Stochastic conformer generators are attractive because they can rapidly produce many uncorrelated structures. Examples include the ETKDG method in RDKit 39 and the commercial software Omega 40,41 . To adapt these generators for TSs, one would have to generate conformers of equilibrium geometries, and perform a relaxed scan from the equilibrium structures to the TS. Even if TS geometries would be generated directly, the structures would still have to be optimized. In either case, optimization would incur significant cost. Indeed, structure optimizations Factor for scaling the correction to the SD step 0.33 sd corr parabolic fit Do a parabolic fit for finding the optimal SD step length True tol max g Maximum gradient for convergence 2 × 10 −3 Ha/Bohr tol rms g RMS gradient for convergence 5 × 10 −4 Ha/Bohr take the most time of any step in CREST and NN conformer generation. Therefore, avoiding MTD would not reduce the computation time to that of a stochastic conformer generator. Another approach is to train a generative model to produce TS geometries 42 . ML for equilibrium conformer generation has rapidly progressed in recent years [43][44][45][46][47][48][49][50][51] , and many of the methods could also be applied to TS generation. These avenues would certainly be of interest in future work.

S3.3 Eigenvector following
EVF was implemented with Baker's rational function optimization method 52 . In the first step, the numerical Hessian was computed with finite differences, using the ASE vibrations package. We used a step size of 0.005Å, and checked that the results were indistinguishable from the analytical Hessian computed with PyTorch. We used finite differences rather than automatic differentiation to minimize the consumed GPU memory. In subsequent steps the Hessian was updated using Powell's method 53 . We used a convergence threshold of f max = 0.005 eV/Å.

S3.4 Intrinsic reaction coordinate
We implemented an IRC algorithm based on the approach in Orca 54 . This approach is itself based on the method of Morokuma and coworkers 55 . Details can be found in the Orca 5.0.2 manual. The parameters that we used are given in Table S2. Note that the force tolerances are rather large, since only a loose optimization is needed to see if the IRC has the product and reactant on either end.

S3.5 Singlet-triplet crossing searches
Singlet-triplet crossings were located with a two-step procedure. In the first step, we performed IRC to locate the crossings one either side of the TS. In the second step, we optimized each of the two crossings by minimizing the energy while keeping the singlet-triplet gap close to zero. This yielded MECPs on each side of the TS.
In the first step we performed the normal IRC algorithm on the S 0 state. We also kept track of the T 1 energy, and stopped the optimization once E T1 − E S0 changed sign. We then used the final geometry as the starting point for the MECP optimization.
The MECPs were optimized with the approach of Ref 56 . In a regular optimization, the objective function is the energy E, and so its negative gradient is F. In an MECP optimization, the objective function is the energy subject to the constraint that the singlet-triplet gap is zero. We therefore used the following effective forces in the optimization: The first term is the mean of the singlet and triplet forces,F, with the gradient of the energy gap ∆E projected out by P. Its effect is to minimize the average singlet and triplet energies without increasing their gap. The second term contains the force difference ∆F. Since it is scaled by the energy difference ∆E, its effect is to minimize the S6 singlet-triplet gap. α is a constant with units of energy. The terms are given by where a superscript T denotes transposition. We implemented this approach in ASE, using a calculator that produced the forces of Eq. (S6). We then performed the optimization using the BFGS algorithm in ASE. The maximum force tolerance for convergence was set to 0.005 eV/Å. The constant α was set to 12.55 kcal/mol following a related MECP algorithm for conical intersections 57 . We found that the optimized MECPs were insensitive to the choice of α. For example, nearly identical MECP energies were obtained when using α = 1.0 kcal/mol.

S3.6 Free energy calculations
The free energy was computed for reactants, products, and TSs, using where G is the Gibbs free energy, H is the enthalpy, and S is the entropy. The enthalpy was computed as where the last term is the average conformer energy, and the sum is over all conformers. p i is the Boltzmann probability for conformer i, given by p i ∝ exp(−E i /k B T ). E el is the electronic energy, E ZPE is the zero-point energy, and E vib (T ) is the thermal vibrational energy. Standard expressions for E ZPE and E vib (T ) can be found in Ref. 13 . Apart from E conf , all terms in Eq. (S12) were computed using the ASE thermochemistry package. The entropy was computed as where S mRRHO is the modified rigid rotor-harmonic oscillator entropy (see below), and S conf is the conformational entropy: The sum is over all conformers of all four mechanisms when using TS theory, but only over rotational conformers when using the ISC approach. Note that if two mechanisms are the same by symmetry, then the conformational entropy is increased by kT log 2, while ∆G is lowered by this amount. This is the same factor that would arise from using the symmetry number σ = 2 in the calculations 58 . The modified rigid rotor harmonic oscillator approximation 59 interpolates between a free rotor at low frequencies and a harmonic oscillator at high frequencies. This is more physically sound than a standard harmonic approximation, and avoids the divergent entropy of low-frequency harmonic modes. The rotor cutoff frequency was set to 50 cm −1 , following the default in xTB 34 . All other parameters were unchanged from Ref. 59 .
Harmonic modes and frequencies were computed in the normal way 60 by diagonalizing the mass-weighted Hessian after projecting out rotation and translation. For reactants and products we checked that there were no imaginary frequencies. For TSs we checked that there was only one imaginary frequency, with magnitude ≥ 200 cm −1 . Species with reactants or products not satisfying these conditions were removed. Species were kept if they had at least one TS geometry from each mechanism satisfying these conditions.
For G X , H X , and S X we used thermostatistical quantities from the rotational TSs. In principle the vibrational terms should be computed with the average Hessian of the singlet and triplet states at the crossing 56 . However, we found that this approach led to a sensitive dependence on α in Eq. (S6), and so we used the rotational TS results S7 instead.

S4 Non-adiabatic transition state theory
The reaction rate from non-adiabatic transition state theory (NA-TST) is given by 56 Here β = 1/(k B T ), Z X and Z R are the rovibrational partition functions at the crossing point and reactant geometry, respectively, and P (ε) is the probability of transitioning between the two electronic states at energy ε. P (ε) can be computed within the Wentzel-Kramers-Brillouin (WKB) approximation 61,62 . The result is 63 where the intersystem crossing rate is and α = 4H Here H SO is the spin-orbit coupling, µ is the reduced mass of the reaction coordinate, F g is the geometric mean of the singlet and triplet forces at the crossing, N is the number of atoms, and ∆F is the force difference (Eq. (S8)).
In NA-TST, the reaction coordinate is the direction of ∆F. The reduced mass is then given by where m n is the mass of the n th atom. Notice that Eq. (S18) scales quadratically with H SO , as expected from Fermi's golden rule 64 . In re-expressing the equation from Ref. 63 we have used the relation G rv = −k B T logZ rv for the rovibrational free energy, and written G X = E X + G X rv . We have also corrected two typos (Eq. (7) in Ref. 63 should not contain k B T , and F g involves a square root rather than a square).
In this work we have approximated H SO as constant among all azobenzene derivatives. We tested this approximation by computing H SO at the rotational TS of several derivatives. We included a derivative with chlorine, since, as a relatively heavy third row element, it could increase the coupling. Couplings are available in Q-Chem for TDFFT, but not for SF-TDDFT or CASPT2. They are also not available for CASPT2 in Orca. We therefore used TDDFT with both the B3LYP and BHHLYP functionals, and found that H SO ≈ 40 cm −1 , with a relative range of 25% between the smallest and biggest values. The coupling is within a factor of two of the (14, 12) CASPT2 result 65 , and the small variation indicates that H SO ≈ const. is a good approximation. It also suggests that the couplings from CASPT2 would be close to constant. In our calculations we therefore set H SO = 20 cm −1 , following the (presumably) more accurate CASPT2 result.
It is informative to estimate the impact of this constant coupling approximation. To do so, we note that the ISC rate scales quadratically with the spin-orbit coupling. Thus a relative change of 25% leads to a 56% change in the rate. Converting this to an effective activation entropy, and using t ISC = 4.2 ps as discussed below, gives a change of 0.26 kcal/mol to ∆G eff . Thus the maximum error from this approximation is only 0.26 kcal/mol, and is thus negligible compared to other sources of error.
The ISC rate must be multiplied by two for the singlet → triplet transition, and by three for the triplet → singlet transition 65 . Using Eq. (S18) together with H SO = 20 cm −1 , we found that t ISC = 5.7 ps for S → T and 4.2 ps for T → S. These are within a factor of two of the results quoted in Ref. 65 using Fermi's golden rule, thus validating both their approach and ours. Note that we use the same spin-orbit coupling constant as Ref. 65 , and so the differences come solely from the different rate formulas. The benefit of Eq. (S18) is that it does not contain the Franck-Condon factors used in Fermi's golden rule expressions 65,66 . Such terms are computationally expensive, and cannot be easily computed for anharmonic modes such as the reaction coordinate. We found that t ISC was essentially constant among derivatives: the minimum time was 4.8 ps, and the maximum time was 5.8 ps.

S5 Connection between non-adiabatic and Eyring transition state theory
The intersystem crossing rate in NA-TST can be connected to the experimental ∆S † . Experimentally one starts by assuming an Arrhenius transition rate, Here E a is the activation energy and A is the Arrhenius prefactor. The two parameters are obtained experimentally from a plot of log k vs. 1/T 58 . Differentiating log k with respect to 1/T in Eqs. (2) and (S23), equating the two results, and invoking Eq. (2) yields 58 where the slight temperature dependence of S and H has been neglected. If the process is actually mediated by ISC, then we can apply the same logic to Eq. (S17), which yields Inserting Eqs. (S25) and (S23) into (S17) then gives Typically η/T 3 ≪ 1, and so the last exponential in Eq. (S27) is of order one. Neglecting this term and inserting the result into Eq. (S24) gives the effective activation entropy, This is the entropy that would be inferred from experiment, given a triplet-mediated process with intersystem crossing rate k ISC , and an entropy difference ∆S X between the intersection geometry and the reactant geometry.
S6 Activation free energies by mechanism SI Fig. S3 compares the activation free energies of the different mechanisms. The plots are for the 19,000 compounds screened in the main text. Panel (a) compares triplet-mediated isomerization with standard S 0 isomerization. ∆G eff is lower than ∆G † in 65% of cases. This indicates that, at the SF-TDDFT level of theory, isomerization proceeds through T 1 for roughly 2/3 of the molecules. However, this result should be interpreted with caution. As shown in Table S3, SF-TDDFT overestimates the energy at both the MECP and the TS relative to the highly accurate SF-EOM-CCSD(dT). The overestimate is 5.4 kcal/mol at the MECP and 3.9 kcal/mol at the TS. This means that ∆G eff − ∆G † is overestimated by 1.5 kcal/mol. We therefore expect that, for the 19,000 compounds screened, ∆G eff should be lower than ∆G † in more than 65% of cases. Indeed, if we assume that these overestimates are constant among species, and thus subtract 1.5 kcal/mol from all ∆G eff , we find that the triplet mechanism is preferred in 88% of cases. This result should also be interpreted with caution, however, since it neglects errors in ∆S eff and ∆S † , and assumes that the energy error is constant among all species. Panel (b) compares the rotation and inversion mechanisms in TST. ∆G † rot is lower than ∆G † inv in 55% of all cases. Interestingly, this means that there is no strong preference for either mechanism at the SF-TDDFT level of a b Figure  theory. This should be contrasted with the species in Fig. 4, for which rotation was preferred in every case. S7 Thermal isomerization rates SI Fig. S4 compares predicted and experimental thermal lifetimes, rather than ∆G values. It contains the same information as Fig. 4 in the main text, but with ∆G converted to a rate using Eq. (2), with T = 298.15 K. The lifetime is the reciprocal of the rate. Notice that we first converted experimental rates to ∆G, given the temperature of the experiment, and then back to a rate using T = 298.15 K. This ensures that the lifetimes are all compared at the same temperature. The line of best fit was computed for ∆G, and then converted to a lifetime.
We see that the experimental lifetimes span five orders of magnitude. Species 1 has the shortest lifetime at 10 minutes, while species 6 has the longest lifetime at 2.5 years. The systematic overestimation of the energies at the MECP and the TS leads to an overestimate of the lifetimes. The predicted lifetimes are respectively 4, 5, and 7 orders of magnitude too high for TST, ISC, and TST with only inversion. However, since the errors are systematic, the model still has a high Spearman rank correlation with experiment. Converting from ∆G to lifetime does not affect the ranking, and so the Spearman rank coefficients are the same as in Fig. 4. R 2 is negative because of the exponential dependence of lifetime errors on ∆G errors.

S8.1 Singlet models
Three different models were pre-trained on 680,736 gas-phase SF-TDDFT/6-31G* calculations from Ref. 3 . 95% of the data was used for training, 4% was used for validation, and 1% for testing. Each model was initialized with a different random seed and trained on different train/validation/test splits. Each model was then fine-tuned with SF-TDDFT/6-31G* using a C-PCM model of water [67][68][69] . The final models were trained on 42,938 geometries, with 500 used for validation and 500 for testing. Different splits were used to fine-tune each of the final models.
The singlet ground-state corresponds to an excitation from a triplet reference state in SF-TDDFT. We identified two singlets as the two states with the lowest ⟨S 2 ⟩ of the three-lowest energy excitations 3 . The lower energy singlet was then taken as the ground singlet. Geometries with ⟨S 2 ⟩ > 1 in the ground singlet state were discarded. The spin contamination in the training set was fairly low, with mean⟨S 2 ⟩ = 0. 16.
Approximately 6% of species in the training set had non-zero charge (+1, +2, or +4). The model does not use charge in the input; however, all charge states in this work can be inferred from the number of bonds for each atom. For example, if nitrogen has a single bond to four different atoms, then it must have a partial charge of +1 to have a full octet. Since bonds can be inferred from atom type and distances 5 , the model can learn the effect of different charge states from positions and atomic numbers alone.
Training was performed over energies and forces/force couplings in units of kcal/mol and kcal/mol/Å, respectively. Per-species reference energies were subtracted from each energy. These were obtained by summing atomic reference energies, computed using multi-variable linear regression from (atom type, count) to relaxed geometry energy computed with SF-TDDFT in vacuum. The set of vacuum reference energies was used for both vacuum and solvent training. Configurations with 10-σ energy outliers were removed prior to training. Those with forces ≥450 kcal/mol/Å from the mean or energies ≥900 kcal/mol from the mean were also removed. Geometries were not removed on the basis of X-σ force deviations, since many geometries were close to TSs and hence had near-zero forces.
Models were trained with the Adam algorithm, using a batch size of 60 for the gas-phase models and 20 for the fine-tuned models. We used an MAE loss for the forces and energies: Here ρ E = 0.2 is the energy weight, ρ F = 1.0 is the force weight, M is the number of geometries in a batch, N i is the number of atoms in geometry i,X is a predicted quantity and X is its true value. We used an MAE loss instead of a mean-squared-error loss, since this led to significantly better model performance.
The learning rate was initialized to 10 −4 and reduced by a factor of two if the validation loss had not improved in X epochs. X was set to 10 for the gas-phase models and 50 for the fine-tuned models. Training was stopped when the learning rate fell below 10 −5 . The final model was selected as the one with the lowest validation loss. Training was performed on a single 32 GB Nvidia Volta V100 GPU, and took approximately 2 days for each of the models.

S8.2 Triplet models
Triplet models were trained on the S 0 /T 1 gap. They were pre-trained using energies from Ref. 3 . Gradients were not used since they had not been computed. For fine-tuning we computed gradients of both the singlet and triplet state for all geometries, and thus used both the gap and its gradient for training. The same loss tradeoff was used as for the singlet model. The learning rate was set 5 × 10 −5 for pre-training and 10 −4 for fine-tuning; the former was chosen because of instabilities in pre-training. To identify the triplet state, we first identified the two singlet states from the three lowest-energy excitations using the method above. The triplet state was then the remaining one. Data points with ⟨S 2 ⟩ < 2.0 or ⟨S 2 ⟩ > 2.4 were discarded.
SF-TDDFT uses an M S = 1 triplet reference state. The excited M S = 0 triplet and the reference M S = 1 triplet should have the same energy. We have found this to be true for spin-flip coupled-cluster methods, but not for SF-TDDFT. This non-zero energy difference was also observed in Ref. 70 . The authors explained that the difference is due to orbital relaxation, since the orbitals are relaxed for the reference state but not for the TDDFT states.
It is not immediately clear which triplet energy should be used for ISC. We tested results using the reference triplet, the excited triplet, and an average of the two. We compared E X to that of spin-flip coupled cluster for S11 azobenzene (SI Sec. S12), and found that the best agreement was for the triplet average. The excited triplet overestimated E X by 5 kcal/mol, and the reference triplet underestimated it by nearly the same amount. However, we found that the correlation with experiment was slightly better when we used the SF-TDDFT triplet. Further, this is more consistent with the principle of treating singlet and triplet states on "equal footing". We therefore trained our models on the excited-state triplet energies. The same network architecture and training parameters were used as for the singlet models.

S8.3 Singlet excitation model
The singlet excitation model was trained directly on the S 0 /S 1 gap. It was first pre-trained using energies and gradients from Ref. 3 . It was then fine-tuned using energies only, since we did not compute S 1 gradients on the new geometries.

S9 Filtering the screening results
We only kept the species that satisfied the following conditions: 1. Endpoints are done. Both cis and trans must have conformer ensembles and a minimum-energy geometry with no imaginary frequencies.
2. Endpoints maintain cis/trans isomerism. The optimized cis and trans isomers must actually be cis and trans, respectively. This is checked with RDKit.
3. TS conformer generation is finished for each mechanism. A TS conformer ensemble must be generated for each of the four mechanisms.
4. At least one TS is converged for each mechanism. To be converged, the TS must have only one imaginary frequency, and its magnitude must exceed 200 cm −1 .

5.
Graph is unchanged. All of the TS and endpoint geometries must have the same molecular graph as the input SMILES. This is checked using D3 coordination numbers. Atom pairs with coordination numbers ≥ 0.95 are assigned a bond. The resulting bond list is checked against that of the original SMILES string.

7.
For the two singlet-triplet MECPs, one is closer to cis and the other to trans. This is checked using the root-mean-square displacement of the CNNC atoms after alignment.
8. Each rotational TS connects to cis and trans through the IRC. This is true if condition 7 is satisfied.
25,000 species were screened in total, and 18,877 remained after applying these filters.

S10 Experimental data
Most papers reported either the thermal half-life τ 1/2 , or the thermal lifetime τ . These are related through τ = (e/2) τ 1/2 . In some cases the authors did not provide τ or τ 1/2 , but did plot the absorption spectrum at different time intervals. In these cases we inferred the thermal lifetime from the absorption plots. In all cases we computed the activation free energy from Eyring TST using τ : Note that different works used different temperatures, and so T depends on the source. S11 Note on mechanisms S11.1 Concerted inversion In our own calculations we have found that concerted inversion has two imaginary frequencies, and therefore is not a true TS. This is because it is a combination of two different inversions, each of which has one imaginary frequency. It is therefore higher in energy than a single inversion, not a true TS, and can be excluded from possible thermal mechanisms. Table S3: Summary of the methods and results from our benchmark. ∆E † is the energy difference between the TS and the reactant, and ∆E X is the energy difference between the singlet-triplet crossing and the reactant. They are shown schematically in Fig. 2

S11.2 Inversion-assisted rotation
A relaxed scan from α ≈ 120 • to α ≈ 180 • will either end in a pure inversion TS or an inversion-assisted rotation TS, depending on the substituents. One can also be converted into the other through a conformational search with fixed CNNC atoms. Therefore, we grouped inversion and inversion-assisted rotation together.

S12 Quantum chemistry benchmark
Here we compare the torsional energy profiles of unsubstituted azobenzene predicted by different electronic structure methods. We analyze results from DFT, DLPNO-UCCSD(T) (domain based local pair-natural orbitals 71-80 for unrestricted coupled cluster with single, double, and perturbative triple excitations), SF-TDDFT, CASPT2 (complete active space with second-order perturbation theory), and SF-EOM-CCSD(dT) (spin-flip equation-of-motion coupledcluster with single, double, and perturbative triple excitations 81 ). The methods are summarized in Table S3. For DFT we use the B3LYP functional 82 with D3 5 Becke-Johnson (BJ) dispersion 6 . For SF-TDDFT we use the BHHLYP functional both with and without D3BJ dispersion. For B3LYP-D3BJ we use the double-zeta correlation-consistent (cc) basis set cc-pVDZ 83 . This model chemistry predicts activation free energies of azoarene isomerization that are in good agreement with experiment 84 . For DLPNO-UCCSD(T) we extrapolate to the complete basis-set limit using explicit correlation (F12) 85 with doubleand triple-zeta basis sets (cc-pV(D, T)Z). Such high-accuracy basis sets can be used because of the computational efficiency of DLPNO. For BHHLYP SF-TDDFT we use the 6-31G* basis 86 , since this is a medium-cost, medium-accuracy basis set that can be used to generate large amounts of data for ML. Similar quality basis-sets are used for CASPT2 and SF-EOM-CCSD(dT) due to computational constraints. B3LYP-D3BJ, CASPT2, and DLPNO-CCSD(T) calculations are performed with Orca 5.2 54 . For DFT and CCSD(T) calculations in Orca we use the resolution of identity 87-93 and chain-of-spheres 94 approximations (RIJCOSX). The TightPNO setting is used for DLPNO calculations. All other calculations are performed with Q-Chem 5.3.
We performed a relaxed scan from ω = 0 to ω = 180 • in 1 • steps using our trained NN potential. We constrained the dihedral angle at each step using a force constant of 1 Ha. We then performed single-point gas-phase calculations with the methods given above. Calculations were performed on all geometries, except for SF-EOM-CCSD(dT), which was performed on only nine geometries due to high computational cost. Those points are shown as stars, and are connected by a smooth line using a quadratic interpolation. Triplet calculations with DLPNO-CCSD(T) failed to converge and are hence not reported.
Results are shown in Fig. S5. The plots contain several salient features. First, the single-reference methods DFT and UCCSD(T) predict a cusp near ω = 90 • , while the multi-reference methods produce a smooth maximum of lower energy. This can be seen in Fig. S5(d), where DFT (red) closely tracks UCCSD(T) (gray) until ω = 90 • . Both methods have a cusp at ω = 90 • , but the energy from DFT is lower. Similar results have been found for ethylene torsion 95 .
As shown in Table S3, DFT gives an activation energy of 35.6 kcal/mol, while UCCSD(T) gives 42.1 kcal/mol. The cusp in the DFT energy explains why previous work could not optimize the rotational TS with DFT 58 . Optimization methods such as EVF 52 use a quadratic expansion of the PES to locate critical points. This expansion fails in the vicinity of a cusp, since the Hessian is undefined at this point.
Previous calculations with DFT found that inversion is the preferred mechanism in gas phase 13 . However, given that the DFT activation energy is 10 kcal/mol higher than CASPT2, such conclusions are not to be trusted. Indeed, the inversion activation energy was found to be 31 kcal/mol with CASPT2 (10, 8) 96 , which is 6 kcal/mol higher than the CASPT2 rotation barrier here. Thus according to CASPT2, isomerization would proceed by rotation even if it were not mediated by ISC. It is also interesting that the maximum value of the CCSD(T) T 1 -diagnostic 97 is 0.013, which occurs at ω = 90 • . The minimum value is 0.01. A value above 0.02 indicates that the problem has multi-reference character, and that CCSD(T) should not be trusted. The diagnostic stays well below 0.02 despite the clear multi-reference nature of the problem. We also note that the unrestricted CCSD(T) results collapsed to the restricted results, with the square spin ⟨S 2 ⟩ = 0 for all geometries.
A second important result is that all multi-reference methods predict smooth TS maxima with singlet-triplet crossings on either side. Examples are CASPT2 and SF-EOM-CCSD(dT) in Fig. S5(b), and SF-EOM-CCSD(dT) and SF-TDDFT in Fig. S5(c). However, the ∆E † and ∆E X predicted by the different methods are in quantitative disagreement. The activation energies predicted by CASPT2, SF-EOM-CCSD(dT), and SF-TDDFT (D3BJ) are 25.4, 28.7, and 32.6 kcal/mol, respectively. The singlet-triplet crossing energies are 18.7, 23.5, and 28.9 kcal/mol respectively. Given the disagreement between CASPT2 and SF-EOM, it is not immediately clear which method should be taken as the "ground truth".
On balance, however, it seems that SF-EOM-CCSD(dT) is likely more accurate. One reason is that its predictions are in quantitative agreement with MR-CISD+Q (multi-reference configuration interaction with singles and doubles, plus a Davidson correction) for a standard rotational conical intersection benchmark 98 . Its treatment of static correlation should therefore be sufficient for azobenzene, while its treatment of dynamic correlation is better than that of CASPT2. Moreover, its predicted activation enthalpy is in better agreement with experiment. The experimental activation enthalpy is 21.1 kcal/mol 99 , which corresponds to ∆H X in the ISC picture. Our NN calculations indicate that ∆H X ≈ ∆E X − 2.5 kcal/mol for azobenzene. Therefore, ∆H X ≈ 21.0 kcal/mol with SF-EOM-CCSD(dT), while ∆H X ≈ 16.2 kcal/mol with CASPT2. The former agrees better with experiment than the latter.
A third important result is that SF-TDDFT's predictions are in qualitative agreement with other multi-reference methods, but not quantitative agreement. As in the other methods, SF-TDDFT predicts a smooth maximium with singlet-triplet crossings on either side of the TS. However, its predicted barrier is 3.9 kcal/mol higher than that of SF-EOM, and 7.2 kcal/mol higher than that of CASPT2. Its singlet-triplet crossing energy is 5.4 kcal/mol higher than SF-EOM, and 10.2 kcal/mol higher than CASPT2. These errors make it difficult to compare the ISC reaction rate with the Eyring reaction rate using SF-TDDFT. Hence in the main text we simply assume that all azobenzene derivatives isomerize through ISC.
Lastly, the plots show the importance of adding dispersion corrections to DFT. Figure S5(a) includes results of SF-TDDFT with and without dispersion. These are shown in green and cyan, respectively. When dispersion is included, all single-and multi-reference methods are in good agreement away from ω = 90 • . When dispersion is not included, SF-TDDFT overestimates the cis energy, because the attractive force between the two benzene rings is underestimated. Similar conclusions have been made in previous work 58,84 . There is still some residual error in the cis energy, even with dispersion, which does not occur with B3LYP-D3BJ. This is likely because the D3 parameters for BHHLYP were chosen for ground-state DFT, not SF-TDDFT.