Unbiasing Retrosynthesis Language Models with Disconnection Prompts

Data-driven approaches to retrosynthesis are limited in user interaction, diversity of their predictions, and recommendation of unintuitive disconnection strategies. Herein, we extend the notions of prompt-based inference in natural language processing to the task of chemical language modeling. We show that by using a prompt describing the disconnection site in a molecule we can steer the model to propose a broader set of precursors, thereby overcoming training data biases in retrosynthetic recommendations and achieving a 39% performance improvement over the baseline. For the first time, the use of a disconnection prompt empowers chemists by giving them greater control over the disconnection predictions, which results in more diverse and creative recommendations. In addition, in place of a human-in-the-loop strategy, we propose a two-stage schema consisting of automatic identification of disconnection sites, followed by prediction of reactant sets, thereby achieving a considerable improvement in class diversity compared with the baseline. The approach is effective in mitigating prediction biases derived from training data. This provides a wider variety of usable building blocks and improves the end user’s digital experience. We demonstrate its application to different chemistry domains, from traditional to enzymatic reactions, in which substrate specificity is critical.


A. Results and Discussion
Figure S1: TopN accuracy for the ability to reproduce the disconnection site across the disconnection aware and baseline models, where N is the number of predictions. The disconnection site was determined by reconstituting the reaction with the predicted precursors and query product. If the predicted precursors could be obtained from the pre-labelled disconnection site, then the prediction would increase the topN accuracy metric. However, we observe that topN accuracy decreases for the baseline models. Thus, this shows that the predicted precursors correspond to a different disconnection site than that pre-labelled in the test set, and no suitable precursors are generated for the desired disconnection. In comparison, the disconnection aware models are consistently able to predict suitable sets of precursors with the exception of the 'USPTO50k' dataset, for which there is a decline in performance.

D. Prompt Generation -Extracting Atom-Tags
Algorithm 1 sketches the pseudo code used to prepare the training data for the model.
Algorithm 1 A function that converts the molecule objects for the precursors and product (typically in RDKit Mol format) to the format required for training the model. all_atom_map_numbers is a function returning the list of atom map indices present in the product object. neighborhood refers to the neighboring atoms and corresponding bond types. Input: precursors: Molecule object for the precursors, including atom mapping information product: Molecule object for the product, including atom mapping information Output: precursors, product: a tuple containing the new molecule objects for precursors and product if neighborhood(precursors_atom) = neighborhood(product_atom) then 7: transformed_atoms.append(product_atom)  Figure S2: Performance of the tag autocompletion model with respect to the abilty to reconstruct the disconnection sites, shown for different sizes of the disconnection site as represented by the number of atom-tags. The ground-truth distribution of atom-tags for the respective datasets are shown. A correlation between the ability to reconstruct atom-tags and the training data is observed.

E. Tag Completion
The auto-complete tags model achieved 77% accuracy on average across all datasets for tag reconstruction. Breaking it down by the number of tagged atoms and accuracy, the curve follows the same as for the tag distribution ( Figure 2). We see the highest performance for number of tags equal to two, and a drop in performance for number of tags equals 4 as observed previously due to a lack of training data. Number of tags equal to one is omitted as it was not permuted, given that no permutations exist.

F. OpenNMT Model Training
The following command defines a transformer-based sequence-to-sequence model and trains it by optimizing the negative log-likelihood.

E) Predicted Precursors
Baseline ( Table S5: Experiments conducted to determine the approach to take when presented with incomplete specification of the disconnection site, either generated by a human or automatic tagging model. The disconnection aware model is able to handle incomplete specification of the disconnection site albeit with slightly lower disconnection accuracy.  H. Improved Class Diversity Figure S4: USPTO50k reaction class diversity S11 Figure