Reinforcement Learning for Traversing Chemical Structure Space: Optimizing Transition States and Minimum Energy Paths of Molecules

In recent years, deep learning has made remarkable strides, surpassing human capabilities in tasks, such as strategy games, and it has found applications in complex domains, including protein folding. In the realm of quantum chemistry, machine learning methods have primarily served as predictive tools or design aids using generative models, while reinforcement learning remains in its early stages of exploration. This work introduces an actor–critic reinforcement learning framework suitable for diverse optimization tasks, such as searching for molecular structures with specific properties within conformational spaces. As an example, we show an implementation of this scheme for calculating minimum energy pathways of a Claisen rearrangement reaction and a number of SN2 reactions. The results show that the algorithm is able to accurately predict minimum energy pathways and, thus, transition states, providing the first steps in using actor–critic methods to study chemical reactions.

in tasks like strategy games, and it has found applications in complex domains, including protein folding.In the realm of quantum chemistry, machine learning methods have primarily served as predictive tools or design aids using generative models, while reinforcement learning remains in its early stages of exploration.This work introduces an actor-critic reinforcement learning framework suitable for diverse optimization tasks, such as searching for molecular structures with specific properties within conformational spaces.As an example, we show an implementation of this scheme for calculating minimum energy pathways of a Claisen rearrangement reaction and a number of S N 2 reactions.Our results show that the algorithm is able to accurately predict minimum energy pathways and thus, transition states, therefore providing the first steps in using actor-critic methods to study chemical reactions.
In recent years, machine learning (ML) has had a large impact on society in many different areas.
From large language models [1][2][3] such as ChatGPT [4] to various applications in the financial sector.[5][6][7] However, ML has only recently found its way to subject areas outside of a computational or financial setting.In physics and chemistry, ML is still in its infancy.Initial works in ML for chemistry mainly look at predicting primary outputs such as wave functions [8,9] or electron density [10,11], secondary outputs such as energies, forces or dipole moments [12][13][14][15][16][17][18] and tertiary outputs such as reaction rates [19,20] or fundamental gaps [21][22][23] of quantum systems using supervised ML.These developments have allowed scientists to calculate electronic and other properties at a much larger speed and hence lower computational cost than the associated quantum mechanical reference methods.
In a general setting, the task of predicting properties of a quantum system, or, in principle, any predictive task, involves constructing a function that maps a parameter space to a one-dimensional property space, such as energy or forces, or to a higher-dimensional property space, for example, the parameters of a wave function or excited states.[24][25][26] This function aims to closely approximate the true properties of the system.Current quantum chemical methods operate on the basis of a similar principle.In these methods, the objective is to build a parameterized wave function or probability density that minimizes the energy of the quantum system using the variational principle [27].For each new molecular structure, another optimization of the wave function must be performed.A general function that adequately describes the mapping between the set of molecular structures and their associated properties still remains unknown.
Neural networks are known to be universal function approximators [28] due to their high parameter redundancy, which allows them to construct mappings between almost any two sets.Thus, in theory, they should be able to construct this function form given the relevant dataset.Current work in quantum chemistry looks at encoding useful physical properties into neural network architectures [15,[29][30][31][32] to produce a functional form more similar to the true mapping of molecular structures to the associated properties, thus reducing the number of parameters needing to be optimized to obtain similar errors.
The problem of finding these mappings can also be formulated slightly differently, rather than optimizing parameters with a currently existing optimizer we could look to have another neural network to find the best fitting parameters for a specific objective.To do this, one can think about the process of changing parameters as a Markov chain in parameter space, where at each step the parameters from the previous step are adapted.A neural network can then learn to explore the parameter space to find the parameters that maximize or minimize a certain property.This is the idea behind reinforcement learning.The advantage of using reinforcement learning in comparison to a normal optimization process is that a larger amount of exploration is obtained in the parameter space.One area where the optimization of parameters is of interest is the space of different molecular structures where one wishes to optimize the positions of atoms against some reward function.Some examples are geometry optimization, minimum energy pathway calculation, or the search for critical points on excited-state potential energy surfaces, such as conical intersections.In this work, we explore the application of reinforcement learning to this task and test it on the search for minimum energy pathways and transition states.
A scheme of the algorithm developed in this work is illustrated in Figure 1.The exploration of a reinforcement learning algorithm can be expressed as a Markov chain.Initially, a configuration is taken and the positions are adapted with an action constructed by the neural network.The reward r i is then calculated and fed back to the neural network so that it can re-evaluate its decision to improve in the future.More precisely, the definitions of the different components that are to be defined for this task and in the context of molecular systems are summarized and described below: • State: Current conformation of an m-atomic molecule described by atomic numbers, Z i , and positions, R i ∈ R 3 : • Action: Given the atomic numbers and positions in the state, the set of corresponding new positions can be constructed.The new positions, R i + δR i , are then constructed as follows: To maintain symmetry constraints, a switch to internal coordinates may be conducted.
• Reward: The reward is used to let the reinforcement learning agent assess the success of its actions and is dependent on the particular task at hand.In the case of geometry optimizations, the task would be for instance to maximize E −1 t to obtain the minimum energy structure.
The reward calculated is defined as the immediate feedback after a change in the current state.
However, in this case, the quantity of interest is the long term reward or the reward given by the final structure.As the long term reward is not known until the task is complete, an estimation of an expected long term reward given by the current state is calculated.Two main methodologies in reinforcement learning have been developed for this purpose [33], namely value-based learning, such as Q-learning [34], and policy based methods, such as actor-critic methods [35].
FIG. 1. Actor-critic workflow.The actor constructs an action which alters the positions of the atoms in the molecule.The current configuration of the molecules is known as the state.A reward is calculated using the new configuration which is passed to the critic so that the long term reward can be estimated.
The long term reward and current state are then used in the optimization of the actor's policy.Given the updated policy the process is then repeated.
The Actor-Critic method [36,37] used in this work is a class of reinforcement learning shown to be effective in high dimensional problems [38,39], with which will be dealt with in many situations in chemistry.This method involves three components: an environmental state, an actor that interacts with the environment, and a critic.This concept is illustrated in Figure 1.The objective of the actor is to develop a policy, π(s|θ), to maximize long term rewards and the critic attempts to predict the expected long term reward, V (S k ).
More precisely, looking at each step the actor receives a state that is the result of the action performed on the environment.In this work, the action consists of changes to the current positions of a molecule, the state.The reward of the given state is calculated using some evaluator, i.e., the energy of the new state or the corresponding forces.This reward along with the state is passed to the critic, which then estimates what the expected long term reward will be given the current state.The expected long term reward is the core quantity in actor-critic models from which the actor learns to adjust itself.However, this quantity is not straightforward to calculate as it would be needed to calculate all possible moves from a particular state to then take the final reward.To avoid this tedious task, a neural network is used to estimate this quantity.b) Again the solid lines represent how the reward evolves over the whole episode.The episode can then be broken up into several pieces, in each segment an estimation of the expectation is calculated.In the first segment the expectation is the critic's estimate at state S 0 , at step S 33 it is given by the critic estimate at step 33 plus the previous rewards and finally in the final segment the estimate is the critics estimate at step 66 plus the previous rewards.given the previous actions the actor has performed; if the starting state is S k then this value is written as V (S k ).However, this scheme can prove quite inefficient since every critic update must be performed at the end of the episode.To provide an approximation of V (S k ) before the end of the episode, temporal difference learning is used.Temporal difference learning is illustrated in Figure 3b.It involves breaking down an episode into smaller pieces of size m, then looking at the final reward received plus the estimate from the critic at the final state.In this way, the actor and critic are able to update themselves more frequently.
To define mathematically, the cumulative rewards received through the episode, R k , can be rewritten: with r t being the rewards of each individual step.As can be seen above the total rewards received are split through the episode into two sums, one for the smaller episode piece of length m and then the sum for the remainder of the episode which is approximated by the critic.This new estimate is used to update the critic.Due to the output of the critic being used to adjust the critic in the training process, actor-critic methods can become unstable so correctly selecting hyper-parameters is critical for proper training and preventing divergence in the loss.
Now that an estimate of the long term rewards of a particular molecular state has been obtained it can be integrated with the agent or actor so that better decisions are obtained.As mentioned earlier there are two main methods of doing this but the focus will be on policy based methods namely actor-critic methods.Actor-critic methods operate by taking the estimated long term reward using the critic and measuring if the decisions of the actor lead to a value higher or lower than the expected long term reward.The advantage can be written as If A k+n > 0 then the series of actions taken by the actor can be considered as positive since they lead to rewards that were above average.This in turn leads to these actions being positively reinforced in the actor.On the contrary if A k+n < 0 this corresponds to the actor selecting actions which lead to the rewards being below average.This changes the actors behaviour in order to reduce the probability that these actions happen again.Over time the actor should learn the steps which lead the maximization of the reward function by selecting actions with positive advantages.Now that the overall structure of the reinforcement learning algorithm has been described, we can consider how to implement this in the case of molecular structures.The decisions of the actor can be either deterministic or stochastic in nature, i.e., either the actor output is the new positions of the atoms or a probability distribution from which a new position is sampled.The second case is preferable since it enforces more exploration and thus a higher likelihood of finding an optimal solution.Let us introduce this mathematically, define R t i as the position of atom i in the molecule at step t in the episode then the position of atom i at time step t+1 is sampled from the distribution The probability distribution for the next configuration in the episode at time step t can then be written as, Here, α is the normalisation constant.The positions can then be sampled from the generated probability distributions to construct the new conformer.This process is repeated to form a Markov chain.
To reiterate, the critic must estimate the long term rewards V (S k ) but also be permutation invariant of the input by considering each atom individually then summing their associated contributions.Thus V (S k ) can expanded as follows, where summing over r i t corresponds to each atom's estimated contribution to the total reward at time step t.This structure allows easy implementation of the critic as a neural network for arbitrary-sized molecules.Now the overview of an actor and critic system is given we will attempt to apply this to the case of minimum energy pathway prediction as an application of actor-critic methods in a molecular setting.
Conventional methods to compute minimum energy pathways require many sequential evaluations of the quantum Hamiltonian leading to high computational costs and due to the vastness and intricate local topological structure of the potential energy surface, usually requiring lots of human input to successfully converge.Therefore, ML is promising to advance this field, but only few works exist in this direction [40][41][42][43].One work is TS-Net [40], which applies a tensor-field network to predict the structure of a transition state.However, this method requires a training set with transition states and the generation of transition states for a training set is computationally expensive and time-consuming, thus limiting the applicability of this model, especially when targeting large systems.Another work reformulates the transition state search into a shooting game using reinforcement learning techniques [41].This technique is related to a common computational workflow, namely, transition path sampling, and operates by choosing a coordinate in phase-space from which two trajectories are started with opposite momenta.If trajectories reach the desired products and reactants, the episode is considered successful.While this approach is powerful in theory, it requires Monte-Carlo techniques [44] to identify promising pathways before training.As a consequence this method can become highly expensive as the molecular systems become larger.Thus, to study larger systems, a faster way of performing quantum mechanical calculations is required along with a way to intelligently search the potential energy surface.
To improve on the methods mentioned, the developed model is based on the frequently used nudged elastic band method (NEB) [45,46] to predict minimum energy pathways.NEB involves dividing the reaction pathway into a series of discrete images or "beads".Each bead represents a possible intermediate state.Given a reaction pathway, the force on each image is determined by a combination of the internal inter-atomic forces of each image perpendicular to the reaction pathway, F int ⊥, and the virtual harmonic spring forces holding the images together parallel to the reaction pathway, F spr ∥.More precisely, the set of forces on the atoms in image j is given by: Here R i represents the positions of atom i in image j, τ i is the tangent to the reaction path at image j and k is the spring constant controlling the strength of the harmonic springs.Using this, the maximum force F max on any one atom in any image along the reaction pathway can be computed.
Here the aim is to optimize the molecular pathways by adapting molecular configurations in a way that minimizes F max .
Keeping in mind the NEB algorithm and the actor-critic model derived above, the task of finding the minimum energy pathway can be formulated as a Markov process with each state being formed by a series of molecular conformations along the reaction path.In a similar way to the NEB method, at each step in the process the positions of the molecules on the reaction pathway are modified.
This differs from the previous derivations since a set of molecular structures needs to be considered in contrast to a single molecular structure.Therefore, the previously defined formula need to be rewritten such that they account for multiple geometries.The full derivation of this will be left in the supplementary material, see section S5.Due to the extremely large configuration space made up by the images along the reaction pathway, a subset of configurations is constructed, which provides a good starting point.To do this the F int ⊥ and F spr ∥ values for the atoms in each image are constructed using a PaiNN model [13].Then for each image a linear combination of these values is generated to form the desired subspace of potential moves.Analogous to above, the reinforcement learning algorithm can again be broken down into a series of components.The state is represented by molecules along the reaction path described by atomic numbers and positions with each reaction path being described by n-interpolated images.Thus, the state space consists of n molecules with their associated positions and atomic numbers.The action as mentioned before consists of a linear interpolation of the vectors F int ⊥ and F spr ∥ for an individual image.The new positions can be constructed from this.The reward can be defined as the maximum force F max , as shown above.
The reward, r t , at each step is given by (F max ) −1 .Maximizing the given reward and minimizing the F max value are equivalent.
In order to achieve an accurate description of the policy and value functions, it is important to convert the Cartesian coordinates into a representation that is both rotationally and translationally invariant [13].This can be accomplished through the use of the PaiNN representation developed by Schütt et al. [13].Based on this representation, the aim is to predict changes to the positions of atoms in each image along the reaction pathway at each step with the goal to move closer to the minimum energy pathway.Around each atom in the molecule, a fine discrete grid of α i and β i values is constructed to produce a linear combination of F int ⊥ and F spr ∥.A distribution is then constructed by the model on this grid and the new position is sampled from the respective distribution.An illustration of this and how the associated model is built can be seen in Figure 3.Further details of this can be found in section S2 in the supplementary material.Additionally details of the training procedure and loss can be found in section S1 of the supplementary material.To incorporate symmetry considerations into the predictions, the outputs of a series of PaiNN layers are utilized.These layers form the initial feature set that is subsequently integrated into the remainder of the model.Next, a self-interaction layer [47] is added to condense the information between neighboring images along the reaction pathway into a single feature vector.These feature vectors are then passed through a series of atom-wise layers.A distribution for α i is produced from which values are sampled.These are used to construct an additional feature vector containing information about the magnitude of F int ⊥ along with an associated dot product with F spr ∥.This feature vector along with the feature vector produced by the self-interaction layer are combined together using a point-wise multiplication followed by a series of atom-wise layers.The associated β i distribution is then produced and sampled from in order to construct the new position.An illustration of this can be seen in Figure 3.The full path of the reaction is attached as supporting information.
To assess the reward, a PaiNN [13] S2).Following from this, an initial guess is used of the reaction pathway obtained via geodesic interpolation.For consideration of computational efficiency, the model is allowed to sample for 10 episodes of length 50 to find a pathway of lower F max value as the new initial starting guess.The agent is now trained from this starting guess to minimize the F max values.Exact implementation details can be found in the supplementary material in section S4.4.
The training process can be followed in Figure 4b that illustrates a consistent increase in the reward function and thus a decrease in the associated F max value.Figure 4c shows the different energy curves and transition states found with the model and quantum chemistry (QC) using standard NEB with DFT at PBE0-D4/def2-TZVP level of theory.The activation energy obtained using the model was calculated to be about 37 kcal/mol, which is in very good agreement to the reference value of 35 kcal/mol, especially when considering the initial guess of over 150 kcal/mol.
In the second task, the aim is to minimize the F max value of a series of S N 2 reactions and predict the associated transition state structures.The S N 2 reactions under consideration are as mentioned, reactions of the following form: and X ̸ = Y .Again, the reward is computed using PaiNN.The MAEs for energies and forces are 0.87 kcal/mol and 0.20 kcal/mol/Å ,respectively, on a hold-out test set (see section S4.3 and Figure S3 for further details).
In contrast to the previous experiment, no initial sampling is done by the model prior to the training period, thus the episodes commence initially from the geodesic interpolation.Again for computational efficiency, in each episode uses 10 steps.Looking at Figure5a, in all cases the reward  Additionally further research will also look into integrating actor-critic methods into other molecular tasks.In addition, the restriction to hyper planes generated by the internal and spring forces can be removed to allow for larger search space but with a larger computational cost.As the reward function can be changed depending on the use case to allow the actor-critic algorithm to target conformations with a certain set of properties, we expect that the reinforcement learning model has the potential to become a valuable tool not only for the estimation of transition states and minimum energy paths, but also can advance the search for molecular conformations with target properties.
The values a ′ k,i and a k,i are the chosen action and an arbitrary action respectively.N a corresponds to the total number of actions.Additionally, to generate an initial starting guess for the reaction pathway we use geodesic interpolation followed by high speed sampling using the model's initial distribution over the actions.This produces a guess much closer to the minimum energy pathway meaning less training time is wasted on pathways which are not useful.The critic is trained using standard temporal difference learning.In this case, the rewards are obtained from the (F max ) −1 values.

II. ACTOR IMPLEMENTATION DETAILS
Based on the representation mentioned in the paper, we aim to predict changes to the positions of atoms in each image along the reaction pathway at each step in order to move closer to the minimum energy pathway.To achieve changes in atomic positions, we look to generate distributions for positions changes around each atom.To elaborate further, around each atom in the molecule, we construct a fine discrete grid of α i and β i values to produce a linear combination of F int ⊥ and F spr ∥.A distribution is then constructed by the model on this grid and the new position is sampled from the respective distribution.It is important to note these α i and β i values can be generated per atom in each image or be the same for every atom in the image with the former being the most computationally intense.In the following section we will describe the case where an α i and β i value are generated for each individual atom.
In more detail, let us take the initial state S t , of a reaction pathway, which contains the atomic types {Z i } and associated positions {R in } n } of all atoms in each molecule along the reaction pathway where n is the number of images and i j corresponds to the index of the atom in image j.The probability distributions of the positions of the atoms in the reaction pathway at step T is given by {{P (R i )} j ∀j ∈ [0, n]}.To accommodate the auto-regressive nature of the reinforcement model we can write this as the product of conditional probabilities of the positions at each step which can be decomposed as changes in positions: From now on we will abbreviate R t−1 i,j−1 , R t−1 i,j , R t−1 i,j+1 with Y t−1 i,j .Additionally, we can decompose the distribution into the joint probability distribution of α i and β i values from which the changes in positions for each molecule along the reaction path are constructed: Furthermore, we add a dependency of β i on the α i value chosen.This leads to the joint distribution being rewritten so that β i depends on α i ; These distributions are produced as an output of the actor model and combined into a categorical distribution over a fine grid.A new position is then sampled from this grid for each atom in every image.The methodology is then implemented into a deep learning architecture as seen in the paper.

III. CRITIC IMPLEMENTATION DETAILS
As mentioned in the paper, the critic model employs identical PaiNN representation layers as the actor model, with the exception that the final layers are utilized to furnish an estimation of the value function At each step in the auto-regressive structure of the RL algorithm, the reward is given from the NEB calculation using the F max quantity and a conventional PaiNN model.The atom and image of which F max occurs in future steps is highly uncertain.However, it can be decomposed as the probability of each atom causing the largest force multiplied by the estimated force if this is the case; the expected F max for a particular atom.More formally the value function decomposes as follows: Using the shared representation layer between the actor and critic models we then predict the expected values of each atom in each image, as it is shown in Figure 1.The resulting expectations are then summed in the final layers to give an estimate of the value function.

IV. HYPERPARAMETERS AND TRAINING
All experiments and trainings were run on an NVIDIA RTX A2000 12 GB GPU as well as 20 i7-12700 12th Gen CPUs.
The target properties used for training of the Claisen rearrangement and S N 2 reactions were energy and forces.The loss weighting of energy and forces was 0.02 and 0.98, respectively, for both datasets.
The dataset was split into training, validation, and test sets randomly.For the claisen rearragement reaction, 50000 data points were used for training, 6000 for validation, and 5000 for testing.We used a batch size of 50.For the S N 2 dataset, 350000 data points were used for training, 50000 for validation, and 52709 for testing.A batch size of 500 was used.
For both datasets a cutoff function was applied.In addition, for the Claisen rearrangement reaction, the means of each target property were substracted from the dataset as well.Both models were built using SchNetpack [13] in combination with the equivariant PaiNN [12] representation.
The hyperparameters for the model used in both the Claisen rearrangement and S N 2 reactions are as stated in Table 1.Different hyperparameters were tested starting from the default values of PaiNN and changing them such that a good compromise between accuracy and computational efficiency could be achieved.In SchNetPack, interaction layers are followed by a series of atom-wise dense layers [14] where the following layer is compromised of half the number of neurons of the previous until 1 is reached, i.e. 64, 32, 16, 8, 4, 2, 1.  distributions.All energies and forces throughout were predicted using the PaiNN model from the previous section.
To streamline the performance, the spring force vector is used.The spring force vector is precalculated as an input into the model.In addition, the internal force vector is derived using the pretrained PaiNN model, serving as an additional input to the model.This incorporation of both force vectors reduces the computational time needed for training.For the Claisen rearrangement reaction and the S N 2 reactions, the model utilizes specific hyperparameters, which are provided in  pathway corresponding to the minimum activation energy and minimum F max found through all episodes.Due to the stochastic sampling procedure it is worth to mention that results are likely not reproducible exactly, but within a certain statistical accuracy.For better reproducibility, however, the associated trained PaiNN models are provided in addition to this document and code.

V. DETAILS OF IMPLEMENTED NUDGED ELASTIC BAND (NEB) METHOD
The reaction pathway is a crucial concept in understanding chemical reactions.It provides insights into the sequence of structural changes occurring between the initial and final states of a reaction.In order to construct a reaction pathway, a method called the nudged elastic band (NEB) [8] is commonly used.
The reaction pathway is formed starting from the start and end structures and using a series of N + 1 images connected by virtual spring forces forming an elastic band.The elastic band with N + 1 images can be written as {R 0 , R 1 , ..., R n−1 , R n } where R 0 and R n are the initial and final structures respectively.The initial and final images correspond to specific molecular configurations, which correspond to local energy minima.The positions of the endpoints are thus held fixed to maintain a constant energy throughout the process.
To connect these images and simulate the structural changes along the pathway, virtual spring forces are introduced.The key component of these spring forces is the tangent vector, denoted as τ i , which describes the direction of the pathway at each image.Initially, a simple estimate of the tangent vector can be obtained by calculating the normalized line segments between neighboring images and summing them: However, this basic approach can lead to the formation of kinks or irregularities in the pathway.
More information about this can be found in reference 8. To address this issue, a more sophisticated method is proposed.Let i max be the index of the image with the maximum potential energy; this corresponds to the transition state.
With the tangent vectors established, the respective spring forces between neighboring images can be calculated.Additionally these forces are parameterized by a spring constant, denoted as k, which quantifies the strength of the interaction between neighboring images.The spring force denoted as F i spr ∥, can be expressed as: Additionally, the perpendicular force, denoted as F int ⊥, which arises from internal molecular forces is considered.This term is defined as the internal force vector that acts orthogonal to the spring force vector.The combined force acting on each image, denoted as F i , is then defined as the sum of the spring force and the internal molecular force acting perpendicular to the pathway: After each episode, molecular positions are then optimised using an optimisation algorithm.In the NEB experiments performed optimisation was done using BFGS [18].By applying the BFGS algorithm to the force vector, one can iteratively adjust the positions of the images along the reaction coordinate, seeking the minimum energy pathway.The algorithm uses the current force vector and the approximated Hessian matrix to determine the search direction and step length for each iteration.The positions of the images are updated accordingly, aiming to converge towards a pathway with lower reaction barrier.More information on this can be found in 18.The model constructs a new forces vector with the predicted α i and β i values; one step is then taken with the BFGS algorithm to generate the new positions given the relevant force vectors.The BFGS algorithm and script to calculate F i values given the α i and β i used the Atomic Simulation Environment (ASE).[10] VI.QUANTUM CHEMICAL REFERENCE CALCULATIONS All quantum chemical calculations were conducted using the ORCA 5.0.3 [11] program suite.
For SCF cycles the convergence levels were set to tight SCF and the integration grid was set to defgrid2.In the dataset for the Claisen rearrangement reaction all calculations were conducted using the TZVP basis set and the PBE0 [1] functional.For the S N 2 dataset all calculations were performed with the TZVP basis set, the DSD-BLYP functional [9] and the D3BJ [6] dispersion correction.The details for methods used in constructing the respective datasets for the Claisen rearrangement reaction and S N 2 reactions can be found at [5] and [15] respectively.
The minimum energy pathway for the Claisen rearrangement reaction was calculated using the NEB implementation in ORCA.Initially the NEB was run with the PBE functional [4] and SVP basis set.This pathway was used as a starting point for the following NEB.This NEB was done using the TZVP basis set with the PBE0 functional and D4 dispersion correction [3].The NEB calculations were run using the climbing image implementation [7].
To calculate the root mean squared deviation (RMSD) between the generated transition states and the reference transition states we did a transition state optimisation with DFT.For the S N 2 reactions, predicted transition states from the model were taken as the initial starting points for transition state optimisation.An eigenvalue following algorithm was applied using OptTS option in ORCA.The TZVP basis set was used along with the RI-C auxiliary basis set [2].Again the DSD-BLYP functional and the D3BJ dispersion correction were used.

A. Frequency Calculations
The frequency calculations of generated structures were performed using the DSD-BLYP functional, D3BJ dispersion correction, TZVP basis set and RI-C auxiliary basis set.The frequency calculations were computed numerically.The associated frequencies for each of the S N 2 reactions of the following form can be seen in the table below.
X -+ H 3 C-Y → X-CH 3 + Y -X, Y ∈ {F, Cl, Br, I}   The frequency calculation for the transition state of the Claisen rearrangement reaction generated by the model was performed with the TZVP basis set and the PBE0 functional.As with the S N 2 reaction all frequencies were computed numerically.The molecules has 63 normal modes hence results will be provided in an additional document.The structure has 6 negative frequencies, which reveals that the structure is not exactly the transition state.However, most frequencies are very small in their magnitude, i.e., only two are smaller than -300, which indicates that the structure is close to the transition state.In fact, many transition states found with NEB and quantum chemistry reference methods often comprise of more than one negative frequency.The RMSD of

FIG. 2 .
FIG. 2. Possible rewards generated by an actor-critic algorithm.a) The solid lines represent how the reward evolves over the episode and the dotted line represents the critic's estimate at step S 0 , V (S 0 ).

Figure 2
Figure 2 illustrates an example of the rewards generated by a set of random walks.The random walks show how an actor might change the position of the single atom along with the associated rewards from an initial state S k .The straight line represents the critics estimate of the rewards

FIG. 3 .
FIG. 3. Actor architecture.The PaiNN representation forms the initial feature vectors for the actor.The positions of each atom in the current image are passed through a self interaction layer and a dense layer to produce the α i distribution.A similar process is used to construct the associated β i distribution.The two distributions are then sampled from to produce the change in positions at the current time step.

FIG. 4 .
FIG. 4. Claisen rearrangement reaction.a) An illustration of the start, intermediate and end structures in the allyl-p-tolyl ether Claisen rearrangement reaction.b) Rewards and episode number over the course of the training phase representing the moving average rewards values with 3 respective window sizes.c) Energies and structures of start and end geometry and found transition state (TS) obtained with quantum chemistry (QC, blue), actor-critic model (red), and the initial guess (yellow) from geodesic interpolation.

FIG. 5 .
FIG. 5. S N 2 reactions.a) Rewards through a training period of 300 episodes for a series of S N 2 substitution reactions.b) Transition states obtained with the model (solid) compared to transition states obtained with the original quantum chemical method (transparent) including root mean squared deviations (RMSD).
is maximized and the associated F max is minimized.In order to give a comparative view on the effectiveness of the model, the transition structures produced by the model are compared with the reference structures.The structures are plotted on top of each other in Figure 5b.Transparent are the structures obtained with quantum chemistry.As can be seen, the structures are in excellent agreement to each other.For quantitative measure, further computation of the root mean squared deviations (RMSDs) is performed, which are shown below the images and are below 0.1 Åin all cases.The test cases show the use of the model to predict energy pathways and transition states of a series of reaction mechanisms as demonstrated for the organic allyl-p-tolyl ether Claisen rearrangement reaction and S N 2 substitution reactions.The results show that the model produces transition states and corresponding energy curves, which closely resemble the quantum chemical reference values.While the model can efficiently be trained on single reaction, future research is needed to assess the performance on training of many diverse reactions at once.Further effort will thus be devoted to the development of expanding the model to generalize more easily to a whole set of reactions.

FIG. 1 .
FIG. 1. Critic architecture.The PaiNN model provides an initial feature representation, (x 1 , ..., x m ) j with j referring to the index of the image.For each image in the reaction path we pass the feature vectors for each atom through a series of atom-wise layers.The output contains the expectation of each atom in a molecule.The expectations are then summed over all atoms in each image and finally over all images to deliver the final expectation.

Figure 2
Figure 2 and 3 show the energies and forces in the test dataset against the energies and forces predicted by the PaiNN [12] models respectively.

BFIG. 2 .FIG. 4 .
FIG. 2. Claisen rearrangement reaction dataset: Scatter plot of a) energies and b) forces showing the test data against the values predicted by the PaiNN model.

S N 2 [cm − 1 ]
Reactions FrequenciesDegrees of FreedomX = Cl, Y = Br [cm −1 ] X = Cl, Y = I model was trained on a dataset containing structures of allyl-p-tolyl ether obtained from metadynamics simulations taken from ref.14 to form the initial representation input for the model.The mean absolute errors (MAEs) for energies and forces are 0.22 kcal/mol and 0.37 kcal/mol/Å , respectively, on a hold-out test set (details on training of PaiNN models are in the supplementary material in section S4.1-S4.2 and Figure

TABLE I .
Hyperparameters of the final PaiNN models trained on the Claisen rearrangement reaction and S N 2 reactions.

Table 2 .
During the training period two pathways found by the model are stored.These are the

TABLE II .
Hyperparameters used for the model when training on the Claisen rearrangement reaction and S N 2 reactions.

TABLE III .
Normal modes of the found transition state for the S N 2 reactions.