Machine Learning of Reaction Properties via Learned Representations of the Condensed Graph of Reaction

The estimation of chemical reaction properties such as activation energies, rates, or yields is a central topic of computational chemistry. In contrast to molecular properties, where machine learning approaches such as graph convolutional neural networks (GCNNs) have excelled for a wide variety of tasks, no general and transferable adaptations of GCNNs for reactions have been developed yet. We therefore combined a popular cheminformatics reaction representation, the so-called condensed graph of reaction (CGR), with a recent GCNN architecture to arrive at a versatile, robust, and compact deep learning model. The CGR is a superposition of the reactant and product graphs of a chemical reaction and thus an ideal input for graph-based machine learning approaches. The model learns to create a data-driven, task-dependent reaction embedding that does not rely on expert knowledge, similar to current molecular GCNNs. Our approach outperforms current state-of-the-art models in accuracy, is applicable even to imbalanced reactions, and possesses excellent predictive capabilities for diverse target properties, such as activation energies, reaction enthalpies, rate constants, yields, or reaction classes. We furthermore curated a large set of atom-mapped reactions along with their target properties, which can serve as benchmark data sets for future work. All data sets and the developed reaction GCNN model are available online, free of charge, and open source.

: Comparison of test set mean absolute error (left) and root mean squared error (right) between different models for the E2/S N 2 computational activation energy dataset. Error bars correspond to the standard deviation between five folds. Dummy model performance indicated by dashed line and gray area (standard deviation between folds). Best model system highlighted in red, line corresponds to best performance. Figure S3: Comparison of test set mean absolute error (left) and root mean squared error (right) between different models for the S N Ar experimental activation energy dataset. Error bars correspond to the standard deviation between five folds. Dummy model performance indicated by dashed line and gray area (standard deviation between folds). Best model system highlighted in red, line corresponds to best performance. S-3 Figure S4: Comparison of test set mean absolute error (left) and root mean squared error (right) between different models for the computational rate constants dataset. Error bars correspond to the standard deviation between five folds. Dummy model performance indicated by dashed line and gray area (standard deviation between folds). Best model system highlighted in red, line corresponds to best performance. Figure S5: Comparison of test set mean absolute error (left) and root mean squared error (right) between different models for the xperimental phosphatase reaction yield dataset. Error bars correspond to the standard deviation between five folds. Dummy model performance indicated by dashed line and gray area (standard deviation between folds). Best model system highlighted in red, line corresponds to best performance. Figure S6: Comparison of test set R 2 scores between different models for the Rad-6-RE computational reaction enthalpy dataset. Error bars correspond to the standard deviation between five folds. Best model system highlighted in red or orange, line corresponds to best performance. Red: Best performance if Grambow's dual GCNN model allowed to relax to large numbers of FFN layer, which can learn the data by heart. Orange: Best performance if Grambow's dual GCNN model is constrained to one FFN layers, which reduces overfitting. The orange dot refers to the MAE of the dual GCNN in that case. performance comparison across different models is complicated, since the Rad-6-RE database consists of a reaction network, i.e. all possible reactions between a set of molecules. Is is therefore impossible to find a data split which separates out a completely independent test set without losing some of the data. The same problem was already described in Ref. S1.

S2 Prediction of reaction enthalpies
Since some of the molecules in the test sets thus also appear in the training sets, the true predictive accuracy of the models may be somewhat worse than the reported values, and the different models can exploit the data leakage to a different extent, as described in the following.
The CGR approach yields low mean absolute errors (0.13 eV), which come close to the errors reported in Ref. S1, although the model of Stocker et al. takes the geometries of each S-5 Figure S7: Comparison of test set mean absolute error (left) and root mean squared error (right) between different models for the Rad-6-RE computational reaction enthalpy dataset. Error bars correspond to the standard deviation between five folds. Dummy model performance indicated by dashed line. Best model system highlighted in red or orange, line corresponds to best performance. Red: Best performance if Grambow's dual GCNN model allowed to relax to large numbers of FFN layer, which can learn the data by heart. Orange: Best performance if Grambow's dual GCNN model is constrained to one FFN layers, which reduces overfitting. The orange dot refers to the MAE of the dual GCNN in that case. molecule into account, thus leading to a more accurate representation of the enthalpy of a molecule. For force field geometries, their Kernel Ridge Regression (KRR) with the SOAP S2 representation yields a mean absolute error of 0.09 eV, but if given DFT geometries as input the MAE is only 0.05 eV. The performance of the CGR GCNN method developed in this study, which does not require any 3-D information as input, is therefore very encouraging.
Grambow's dual GCNN model performs exceptionally well on this task, and outperforms the CGR model, leading to mean absolute errors of only 0.08 eV (red markers in Fig. S7).
We attribute this performance boost to the architecture of the dual GCNN model, which subtracts the reactant embeddings from the product embeddings. Since the enthalpy of reaction itself is the subtraction of reactant enthalpies from the product enthalpies, this gives the dual GCNN model a large advantage for two reasons. First, the model can naturally pick up the structure of the data. Second, the model can exploit molecules which overlap between the training and test set. Good performances are only observed for a large number S-6 of tunable paramaters, here 20M, and only at a large MPNN depth (which makes the molecular hidden representations unique) and with more than one FFN layer. This indicates that the model is in fact learning the energies of molecules by heart, instead of finding a meaningful relationship. The best performing model with a single FFN layer (orange markers in Fig. S7), where the model cannot easily learn the individual energies by heart, is above 0.4 eV. In contrast, the CGR GCNN model already yields acceptable performances at only 0.4M parameters and due to its architecture cannot exploit this data leakage, so that the CGR GCNN model still seems preferable.
Both GCNN approaches outperform fingerprint based models by a large margin. Table S1 and S2 list the hyperparameters, model sizes and performances for all systems and models. 'Opt' refers to the best parameters found by a hyperparameter search using 20 steps of Bayesian optimization as implemented in Chemprop, S3 scanning MPNN depths of 2 to 6, 300 to 2400 hidden nodes in the MPNN and FFN, dropout rates of 0.0 to 0.4 and 1 (no hidden layer) to 3 FFN layers.