Deep Generative Models for 3D Linker Design

Rational compound design remains a challenging problem for both computational methods and medicinal chemists. Computational generative methods have begun to show promising results for the design problem. However, they have not yet used the power of three-dimensional (3D) structural information. We have developed a novel graph-based deep generative model that combines state-of-the-art machine learning techniques with structural knowledge. Our method (“DeLinker”) takes two fragments or partial structures and designs a molecule incorporating both. The generation process is protein-context-dependent, utilizing the relative distance and orientation between the partial structures. This 3D information is vital to successful compound design, and we demonstrate its impact on the generation process and the limitations of omitting such information. In a large-scale evaluation, DeLinker designed 60% more molecules with high 3D similarity to the original molecule than a database baseline. When considering the more relevant problem of longer linkers with at least five atoms, the outperformance increased to 200%. We demonstrate the effectiveness and applicability of this approach on a diverse range of design problems: fragment linking, scaffold hopping, and proteolysis targeting chimera (PROTAC) design. As far as we are aware, this is the first molecular generative model to incorporate 3D structural information directly in the design process. The code is available at https://github.com/oxpig/DeLinker.

We implemented the function f , which maps the hidden state of a node to its atom type, as a linear classifier with attention from the node's hidden vector to one of the node types.
The attention mechanism is similar to Bahdanau et al. 3 and allows the label for a given node to depend on the hidden states of the other expansion nodes.
Similarly, we augmented the edge selection and edge labelling step by adding attention between the feature vectors for all candidate edges. This allows the score of a candidate edge to depend on the other possible edges. We trained the model with a learning rate of 0.001 for 10 epochs using the Adam optimiser.

Hyperparameter search.
We performed a limited hyperparameter search of the following parameters (final parameters in bold): • The model was fairly robust to the choice of hyperparameters. Performance was measured via the validation reconstruction loss and not generative performance.

Data curation
Fragment-molecule pairs for the ZINC 4 and CASF 5 sets were constructed as follows. First all possible fragmentations of each molecule were produced by enumerating all double acyclic single bond cuts. 6 These we then filtered to remove trivial and unrealistic situations using the following constraints: (i) minimum linker length: 3 atoms, (ii) minimum fragment size: 5 atoms, (iii) linker fewer heavy atoms than either fragment, (iv) minimum path length between fragments: 2 atoms.
The remaining fragment-molecule pairs were filtered for several 2D properties: (i) the synthetic accessibility (SA) score 7 of the molecule must be lower than the fragments with exit vectors represented by dummy atoms, (ii) the molecule must pass pan-assay interference (PAINS) 8 filters, and (iii) rings must either be saturated aliphatic or aromatic (according to RDKit 9 valency rules). PAINS filters were implemented by SMARTS substructure searching with RDKit, using the RDKit version of the Saubern et al. 10   Additional results Table S2: Ablation study for DeLinker, our deep generative method on the ZINC data set. We show the effect on the 2D metrics of removing all of the structural information ("No info") and including only the distance information ("Distance") compared to our full protocol ("DeLinker"), the database baseline ("Database"), and a graph-based baseline 1 ("CGVAE"). See Data curation for a description of the 2D property filters.  We show the effect on the 3D metrics of removing all of the structural information ("No info") and including only the distance information ("Distance") compared to our full protocol ("DeLinker") that includes both distance and angle information, the database baseline ("Database"), and a graph-based baseline 1 ("CGVAE"). See Methods -Assessment metrics for a description of the metrics.     Figure S1: A random sample of 50 novel linkers generated by DeLinker during testing on the held-out ZINC data set. Figure S2: Fragment linking case study. The top 20 molecules generated by DeLinker that met the 3D similarity threshold ranked by AutoDock Vina 11,12 score. Labels are the docking score from minimizing the aligned molecular conformer according to the Vina energy function. Figure S3: Scaffold hopping case study. The top 20 molecules generated by DeLinker that met the 3D similarity threshold ranked by AutoDock Vina 11,12 score. Labels are the docking score from minimizing the aligned molecular conformer according to the Vina energy function. Figure S4: PROTAC design case study. The top 20 molecules generated by DeLinker that met the 3D similarity threshold ranked by AutoDock Vina 11,12 score. Labels are the docking score from minimizing the aligned molecular conformer according to the Vina energy function.