Influence of Template Size, Canonicalization, and Exclusivity for Retrosynthesis and Reaction Prediction Applications

Heuristic and machine learning models for rank-ordering reaction templates comprise an important basis for computer-aided organic synthesis regarding both product prediction and retrosynthetic pathway planning. Their viability relies heavily on the quality and characteristics of the underlying template database. With the advent of automated reaction and template extraction software and consequently the creation of template databases too large for manual curation, a data-driven approach to assess and improve the quality of template sets is needed. We therefore systematically studied the influence of template generality, canonicalization, and exclusivity on the performance of different template ranking models. We find that duplicate and nonexclusive templates, i.e., templates which describe the same chemical transformation on identical or overlapping sets of molecules, decrease both the accuracy of the ranking algorithm and the applicability of the respective top-ranked templates significantly. To remedy the negative effects of nonexclusivity, we developed a general and computationally efficient framework to deduplicate and hierarchically correct templates. As a result, performance improved considerably for both heuristic and machine learning template ranking models, as well as multistep retrosynthetic planning models. The canonicalization and correction code is made freely available.

fingerprints do not generalize well, i.e. perform worse than a Morgan fingerprint on a test set. If a Morgan fingerprint is concatenated to the learned fingerprint, the model performs better, but is still slightly worse compared to the ML-fixed model, which we attribute to the larger set of parameters, which again is prone to over-fitting. S-3

Full reaction
Default template Radius 1 template Figure S1: Examples of template extraction where radius 1 (no special groups) leads to similar templates, but the default parameters yield different reaction templates.

S-4
Template goal Actual template

S3 Inclusion of hydrogens to reaction rules
An alternative means to promote the exclusivity of templates is the inclusion of hydrogens at all template atoms. However, this added specificity decreases the top-N-accuracy and applicability, as shown in Fig. S3 exemplary for templates at radius 1 with special groups (default). It furthermore cannot resolve some exclusivity issues, such as those affiliated with special groups. We therefore did not pursue this approach further.

S4 Hierarchical template correction
A pseudocode representation of the template correction algorithm is given in Algorithm 1.
For a list of templates, each template is compared to a list of most general templates to be kept (starting with an empty list). The template is kept if it is more general than another template. In that case, the more specific template is removed from the list. Parent-child relations are recorded to correctly replace the respective specific templates with more general S-6 templates.
It is important to cluster the templates according to their reaction centers (minimal, most general templates) to avoid partial matches of the reaction center. Fig. S4 shows an example of two templates that produce a theoretically correct SMARTS pattern match, but encode different transformations with different leaving groups. Via the hierarchical, iterative correction of clustered templates, we avoid this unwanted behavior, since the two templates are never clustered into the same group due to the difference in reaction centers.
S-9 S5 Canonicalization of templates As described in the main article, the Weisfeiler-Lehman refinement was used as ranking method within our template canonicalization approach (Label function in Algorithm 2).
Within the Label function, the Combine function is a simple label concatenation and compression in most cases, except for combining neighbor features, where a commutative and associative operation must be used. One can use numeric addition or sort the neighbor features before concatenation. The canonicalization process (Algorithm 2) starts with parsing the template and standardizing the atomic SMARTS in the template because they are used as features in the Weisfeiler-Lehman refinement method. In this study, we rely on RDKit to parse and write the SMARTS strings. Next, the reactant and the product graphs are merged according to their atom mapping numbers. Before the Weisfeiler-Lehman refinement, the ConnectedSubgraphSize function computes the size of the subgraph, which serves as an invariant to distinguish highly symmetric loops. Then, the Weisfeiler-Lehman refinement is performed accompanied with chirality fixing. The chirality fixing removes unnecessary chiral tags due to the same neighbors around tetrahedral chiral centers or double bonds. After this point, the canonical form of the graph has already been generated, but additional steps are required for standardizing the atom mapping numbers if there is symmetry in the graph. The symmetry can be broken by assigning a unique and reproducible tag to the smallest non-unique node according to the order of the previously computed canonical labels.
This tie-breaking process needs to be repeated until all the canonical labels are unique. At this time, the order of the canonical labels gives the canonical rank of the nodes and also their unique atom mapping numbers. In some applications, e.g. retrosynthesis planning, the mirror of a chiral template carries the same chemical transform due to the fact that R→S and S→R templates both indicate chirality inversion and the reaction templates may not selectively apply to only one of the enantiomers. In this scenario, one may simply choose the lexicographically smaller template string as the canonical form.

S-10
Algorithm 2 Canonicalization of templates Graphs for the reactant and product.

14:
G ← Merge G r , G p into a condensed graph using atom mapping. return Serialize(G) 32: end procedure S-11

S6 Supplemental figures
Top-N-accuracies: Fig. S5 and S6 depict top-N-accuracies for N = 1 and 50 for the USPTO-50k dataset. Top-N-accuracies for N = 1, 5 and 50 for the USPTO-460k dataset are shown in Fig. S7, S8 and S9. Fig. S10 depicts top-N-accuracies for the regular and canonical-corrected templates only for the ML-fixed model. Across the different datasets, N, and models, hierarchical correction of templates consistently leads to improved model performance and closes the gap between different evaluation schemes (identifying the correct template (darker shade) or precursor (lighter shade)). Canonicalizing templates also leads to a reduction in the gap between the two evaluation schemes, but to a smaller extent.
Larger templates cause the ranking models to perform worse across all models (similarity, ML-fixed and ML-learned). Fig. S11 depicts top-N-accuracies for USPTO-460k via rank evaluation either per template (left), or per precursor (right), where one template might produce multiple precursors. The latter case is more relevant to multi-step synthesis pathway planning, since a too large number of precursors in each step makes an iterative search computationally expensive. Templates at radius 0 are readily applicable and produce a high amount of precursors, thus decreasing the top-N-accuracies per precursor even below the respective results at radius 1 or default templates. The same effect was observed for USPTO-50k in the main article.
Applicabilities of highest ranking templates: Applicabilities of the top 1 and 50 templates with the USPTO-50k dataset are shown in Fig. S12 and S13. For USPTO-460k, applicabilities of the top 1, 5 and 50 templates are depicted in Fig. S14, S15 and S16. For all datasets, models and number of evaluated templates, larger radii decrease applicability.
Hierarchically corrected templates help the model to identify more applicable templates, especially at large template sizes ('Default', 'Radius 2' and 'Radius 3'). The influence of canonicalization is negligible.
S-12 Figure S5: Top-1-accuracies of proposed retrosynthetic disconnections (top) and forward predictions (bottom) for the USPTO-50k dataset using the 'sim', 'ml-fixed', and 'ml-learned' models. The darker shade in a bar corresponds to evaluation via comparing templates, the lighter shade to comparing precursors or products. Each set of four bars shows the effects of canonicalizing ('can') or hierarchically corecting ('cor') the regular uncorrected templates ('reg'), or both ('can-cor'). Figure S6: Top-50-accuracies of proposed retrosynthetic disconnections (top) and forward predictions (bottom) for the USPTO-50k dataset using the 'sim', 'ml-fixed', and 'ml-learned' models. The darker shade in a bar corresponds to evaluation via comparing templates, the lighter shade to comparing precursors or products. Each set of four bars shows the effects of canonicalizing ('can') or hierarchically corecting ('cor') the regular uncorrected templates ('reg'), or both ('can-cor') S-13 Figure S7: Top-1-accuracies of proposed retrosynthetic disconnections (top) and forward predictions (bottom) for the USPTO-460k dataset using the 'sim', 'ml-fixed', and 'mllearned' models. The darker shade in a bar corresponds to evaluation via comparing templates, the lighter shade to comparing precursors or products. Each set of four bars shows the effects of canonicalizing ('can') or hierarchically corecting ('cor') the regular uncorrected templates ('reg'), or both ('can-cor') Figure S8: Top-5-accuracies of proposed retrosynthetic disconnections (top) and forward predictions (bottom) for the USPTO-460k dataset using the 'sim', 'ml-fixed', and 'mllearned' models. The darker shade in a bar corresponds to evaluation via comparing templates, the lighter shade to comparing precursors or products. Each set of four bars shows the effects of canonicalizing ('can') or hierarchically corecting ('cor') the regular uncorrected templates ('reg'), or both ('can-cor') Figure S9: Top-50-accuracies of proposed retrosynthetic disconnections (top) and forward predictions (bottom) for the USPTO-460k dataset using the 'sim', 'ml-fixed', and 'ml-learned' models. The darker shade in a bar corresponds to evaluation via comparing templates, the lighter shade to comparing precursors or products. Each set of four bars shows the effects of canonicalizing ('can') or hierarchically corecting ('cor') the regular uncorrected templates ('reg'), or both ('can-cor') Figure S10: Dependence of top-N-accuracies of proposed retrosynthetic disconnections (top) and forward predictions (bottom) on the template scheme for the USPTO-460k dataset, MLfixed model. 'reg' corresponds to uncorrected, 'can-cor' to canonical and corrected templates. P means evaluated by precursors or products (continuous line), T means evaluated by template match (dashed line).

S-16
Figure S12: Fraction of applicable templates of the one highest ranked templates of proposed retrosynthetic disconnections (top) and forward predictions (bottom) for the USPTO-50k dataset using the 'sim', 'ml-fixed', and 'ml-learned' models. Each set of four bars shows the effects of canonicalizing ('can') or hierarchically corecting ('cor') the regular uncorrected templates ('reg'), or both ('can-cor'). Figure S13: Fraction of applicable templates of the 50 highest ranked templates of proposed retrosynthetic disconnections (top) and forward predictions (bottom) for the USPTO-50k dataset using the 'sim', 'ml-fixed', and 'ml-learned' models. Each set of four bars shows the effects of canonicalizing ('can') or hierarchically corecting ('cor') the regular uncorrected templates ('reg'), or both ('can-cor').

S-17
Figure S14: Fraction of applicable templates of the one highest ranked templates of proposed retrosynthetic disconnections (top) and forward predictions (bottom) for the USPTO-460k dataset using the 'sim', 'ml-fixed', and 'ml-learned' models. Each set of four bars shows the effects of canonicalizing ('can') or hierarchically corecting ('cor') the regular uncorrected templates ('reg'), or both ('can-cor'). Figure S15: Fraction of applicable templates of the five highest ranked templates of proposed retrosynthetic disconnections (top) and forward predictions (bottom) for the USPTO-460k dataset using the 'sim', 'ml-fixed', and 'ml-learned' models. Each set of four bars shows the effects of canonicalizing ('can') or hierarchically corecting ('cor') the regular uncorrected templates ('reg'), or both ('can-cor').

S-19
S7 Top-N-accuracies of all systems   Continued on next page          Continued on next page S-32