CENsible: Interpretable Insights into Small-Molecule Binding with Context Explanation Networks

We present a novel and interpretable approach for assessing small-molecule binding using context explanation networks. Given the specific structure of a protein/ligand complex, our CENsible scoring function uses a deep convolutional neural network to predict the contributions of precalculated terms to the overall binding affinity. We show that CENsible can effectively distinguish active vs inactive compounds for many systems. Its primary benefit over related machine-learning scoring functions, however, is that it retains interpretability, allowing researchers to identify the contribution of each precalculated term to the final affinity prediction, with implications for subsequent lead optimization.


CENsible architecture
Figure S1.CENsible has the same architecture as gnina's default2018 model, except the final layer outputs weight vectors (coefficients) rather than affinities.Each Conv3D layer is followed by a rectified linear unit (RELU), not shown.N is batch size.In total, there are 8,269,024 fitted parameters.To calculate the loss per batch, the model determines per-batch affinity predictions by taking the row-wise dot products of these weights and the precalculated terms.

Evaluating Overfitting
Figure S2.Evaluation of model overfitting.Pearson correlation coefficients for training (blue) and testing (orange) sets are plotted across epochs for three distinct folds.The performance on the testing sets does not degrade over time, and all three models converge to similar coefficients, indicating consistent generalization across different data subsets and suggesting that overfitting is not severe.
To assess whether our training protocol is given to overfitting, we retrained the same threefold cross-validation models described in Table 1, this time calculating the Pearson correlation coefficients for all training-set and testing-set predictions per epoch (Figure S2).As expected, each model is better at predicting the affinities of examples in its training set.This difference is typical because models generally perform better on the data used for training, but it is especially true in this case, given our efforts to ensure that training and testing sets were independent in terms of protein-sequence similarity.Most importantly, the testing set's performance does not substantially degrade over time, suggesting that overfitting is not severe.
The three models also converge to roughly the same Pearson correlation coefficients, even though they are evaluated on different testing sets.This consistency suggests that the models generalize well, indicating that severe overfitting to the training data is unlikely.

Training Evaluation
Figure S3.Training evaluations for the first (left), second (middle), and third (right) folds when using clustered crossvalidation.These analyses were performed using the same models described in Table 1.Top row: the loss per batch.The orange line represents a moving average of window size 100.Middle row: the Pearson coefficient on the withheld test set for each fold, per epoch of training.Bottom row: a scatter plot of experimental vs. CENsiblepredicted affinities (negative logarithm transforms).

CENsible and Smina Capture Similar Physicochemical Contributions to Binding
Following the approach detailed in the main manuscript, we investigated the correlation between the steric-and repulsion-term contributions to the smina and CENsible scores.
To assess steric contributions to binding, we considered the gauss(o=0,_w=0.5,_c=8)and gauss(o=3,_w=2,_c=8) terms, adding these weighted terms together to capture their combined contributions to the final scores.The steric contributions associated with the two scoring functions are again highly correlated (Figure S4A), suggesting that both scoring functions generally capture the same binding principles.
To assess the repulsion-term contributions to binding, we considered the repulsion(o=0,_c=8) term, which smina and CENsible also have in common.The analysis again revealed a strong correlation (Figure S4B); however, we found that CENsible systematically scales down the weight on the repulsion term, making its contribution to the overall score far smaller than in smina.The repulsion-term contribution to a few of the PDBBind crystal structures is very large; we suspect that CENsible has reduced the influence of this term to overcome this apparent artifact.

CENsible Scores Depend on Accurate Poses
To provide evidence that CENsible scores depend on the accuracy of the underlying ligand poses, we considered all the known-active compounds from the three virtual screens presented in the main text: glucokinase, neuraminidase, and methionyl-tRNA synthetase.We rescored these compounds' best-scoring and worst-scoring smina poses with CENsible.We hypothesize that CENsible leverages information about the protein/ligand complex to make its assessments, so the CENsible scores associated with the worst-scoring (likely incorrect) poses should be less than those associated with best-scoring (more likely to be correct) poses.
As shown in Figure S5, this is indeed the case.We used SciPy's 1 stats.mannwhitneyufunction to perform three two-sided Mann-Whitney U tests 2 to assess whether the differences between rescored best/worst smina poses are statistically significant.We selected this approach because Shapiro-Wilk tests 3 suggested that some of the score distributions shown in Figure S5 are not normally distributed, complicating the use of a straightforward Student's t-test 4 .Per this assessment, the median values are statistically different in all three cases.Further, in all three cases, the median score of the rescored best-scoring smina pose was higher (better) than that of the rescored worst-scoring smina pose.
These results suggest that CENsible scores improve when reassessing accurate poses, further implying that CENsible's assessment does depend on the structure of the protein-ligand complex.

Though
CENsible generally performs better than smina (Pearson's coefficients on testing sets; Table S1), it is not well suited to all systems, as is generally the case with all scoring functions [5][6][7][8][9] .A benchmark virtual screen targeting HIV integrase illustrates this variability.The protein and compounds for this screen were again sourced from the DUD-E dataset (id HIVINT).The foundational smina screen was notably predictive (AUROC: 0.7864; Figure S6).However, when the smina poses were reassessed using CENsible, there was a marked decline in the AUROC (0.4949; Figure S6).Similarly, rescoring the poses with the default gnina scoring function 10,11 also reduced the AUROC considerably (0.5237, data not shown).
Figure S6.A virtual screen targeting HIV integrase illustrates that CENsible, like all scoring functions, is not well suited to every receptor.On the left, the ROC curves associated with the smina (blue) and CENsible-rescored (red) virtual screens.The area under each curve is labeled as AUROC.The line of nodiscrimination (dotted black line), corresponding to a purely random classifier, is shown for reference.On the right, the top-compound enrichment factors (i.e., the extent to which the top-ranking compounds are enriched with true binders).

Virtual Screen Enrichment Factors
In the main text, we use three benchmark virtual screens to assess whether CENsible's learned principles of ligand binding apply to docked poses.We reason that if CENsible has truly learned general principles of ligand binding, it should be able to assess binding across a wide range of affinities.We thus used the AUROC metric to assess these screens (Figure 3) because it captures predictivity from the best-ranked compound to the worst.
Enrichment factors (EFs) are also useful for assessing virtual-screen predictivity.These factors capture the extent to which the top-ranked compounds are enriched with true binders.The EF metric is useful when the goal is not to interpretably assess ligand binding but rather to identify novel ligands.Though EFs are not ideally suited for our assessment, we include them here for completeness.
By the EF metric, CENsible performs well relative to smina on the glucokinase and (especially) the neuraminidase screens, but smina has better enrichment in the T. brucei methionyl-tRNA synthetase screen (Figure S7).Regardless, CENsible's primary utility lies in its ability to provide interpretable output tailored to a specific protein receptor.

S-9
Comparison with Other Scoring Functions at Evaluating Crystallographic Poses Our primary goal was to develop an interpretable machine-learning scoring function, but we were also curious how well our approach could score crystallographic poses compared to other related functions (e.g., smina 12 and the dense, gnina2018, gnina, crossdock_default2018, redock_default2018, and default2017 neural-network scoring functions available via the gnina software package 10,11 ).
We could not use the clustered cross-validation approach described in the main text to create independent training and testing sets for these other scoring functions because we do not know which proteins were included in their training sets.To ensure a fair comparison, we therefore trained three new CEN models using unclustered threefold cross validation.Specifically, we randomly split the PDBbind data into three folds without regard for sequence similarity.We trained three new CEN models on different combinations of two portions, withholding the third as a testing set in each case.
Table S1 shows that the three CEN scoring functions performed better than smina.This comparison is critical because the CEN scoring functions incorporate some smina terms.The difference is that smina uses a single set of weights for all protein/ligand complexes, but the CENs essentially customize the weights based on the structure of the protein/ligand complex.
We also evaluated the same poses present in each cross-validation testing set using gnina neural-network-based scoring functions 10,11 .Table S1 shows that the CENs are substantially better at predicting the binding affinities of crystallographic protein/ligand complexes than even the best of the gnina scoring functions.Given these results, it is tempting to conclude that the CEN scoring functions are better suited for prospective virtual screens than gnina functions.We note, however, that the best performing gnina function (gnina default) is an ensemble of five models, some of which were trained on docked rather than crystallographic data.The gnina default function may therefore be more predictive in the context of a virtual screen 10,11 .Regardless, the strength of the CEN approach is that it is far more interpretable.The training data were randomly split into thirds; for each fold, we trained a CEN model on two thirds of the data and tested on the remaining third (top row).For comparison, we also rescored the test-set examples with smina and gnina-based scoring functions (subsequent rows).In the case of the smina scores, we considered the absolute values.

CENsible Performance on the CASF-2016 Benchmark
As stressed in the main text, CENsible is designed to provide interpretable insights into the contributions of specific physicochemical interactions that contribute to protein/ligand binding, given a correct ligand pose.Its primary goal is not to predict overall binding affinity but to evaluate the extent to which specific atomic interactions contribute to that final affinity prediction, suggesting new strategies for lead optimization.That said, we certainly acknowledge that CENsible must be reasonably effective at predicting affinity if we are to have any confidence in its interpretations.
We used the CASF-2016 benchmark core dataset to further assess CENsible accuracy and compare it to other scoring functions.This set contains 285 protein-ligand complexes taken from the PDBbind2016 dataset, in addition to decoy molecules selected for docking and screening evaluation 13 .We rescored the full core set using the allcen3 CENsible model, the most updated version of the trained model available in the published GitHub repository.
CENsible ostensibly outperforms all other methods included in the CASF-2016 benchmark dataset in the scoring-and ranking-power benchmarks (Table S2).Given the "correct" (crystallographic) poses of known ligands, the scoring and ranking-power benchmarks assess how well a scoring function predicts experimental binding affinities and affinity rankings, respectively.These benchmarks thus closely match the task CENsible was trained to perform (i.e., assessing known/correct ligand poses to characterize the physicochemical properties and interactions that contribute to binding).That said, we note that the PDBbind2016 core set overlaps with CENsible's training set, so CENsible's performance on this benchmark is likely inflated.In contrast, ∆vinaRF 14 , ostensibly the next-best scoring function (and the only other machine-learning-based model), was trained on a dataset that specifically excluded PDBs present in the CASF-2007 or CASF-2013 benchmarks.
For the sake of time, we did not retrain CENsible on a new dataset that excluded the PDBbind2016 core-set structures.However, the cross-validation studies described in the previous section give a sense of how much CENsible's performance on the scoring-power benchmark might be inflated.When trained on two-thirds of the PDBbind data and evaluated on the remaining (withheld) third, CENsible achieved Pearson correlation coefficients of ~0.7 (Table S1).This value is lower than the 0.893 measured on the CASF-2016 benchmark.Nonetheless, if a retrained version of CENsible tailored specifically to CASF-2016 had similar performance, it would still rank among the best scoring functions evaluated per the scoring-power metric.
In contrast, CENsible's performance on the docking-power benchmark was poor (Table S3).This benchmark assesses how well a scoring function can identify a true ligand's correct pose from among many computer-generated incorrect poses of the same ligand.The docking-power task differs meaningfully from the one CENsible was trained to perform.CENsible provides interpretable insights given a correct ligand pose.It was trained only on correct (crystallographic) poses, not incorrect poses.
Finally, CENsible's performance on the screening-power benchmark was also poor (Table S3).This benchmark assesses how well a scoring function can identify true ligands from a set that includes many decoy molecules (presumed non-binders).This task also differs meaningfully from the one CENsible was trained to perform.Specifically, CENsible was trained on known-ligand poses and is not necessarily suited to the tasks of assessing incorrect poses or the poses of compounds that do not bind to the target.functions in scoring and ranking power.All benchmarks were performed using the crystallographic poses.Scoring power is measured as the Pearson correlation between computed scores and experimentally confirmed binding affinities.Ranking power is measured as the Spearman's rank correlation coefficient or the Predictive Index (see ref. 13).

Similar CENsible Weight Vectors are Generally Adjacent in t-SNE Space
The t-SNE analysis presented in the main text is predicated on the assumption that similar predicted weight vectors, we, continue to be adjacent when projected into t-SNE space.To verify this assumption, we first used K-means clustering to identify groups of predicted weight vectors that are similar.We separately divided the data into 2, 3, 4, and 5 clusters and calculated silhouette coefficients for each clustering.These coefficients indicate the relative distance between a given example and its nearest neighbor not in the same cluster, such that high silhouette values suggest better cluster separation.We found that the best separation occurred when we clustered the data into two or three clusters (Figure S8).
We next projected the members of these clusters into 2D t-SNE space to determine whether predicted weights belonging to the same cluster continued to be adjacent in the t-SNE space.Adjacency was largely preserved, especially when using two or three clusters (Figure S9, upper row), but also when dividing the data into 4 and 5 clusters (Figure S9, remaining panels).The analysis confirms that clusters are spatially correlated in the lower-dimensional space.

Figure S4 .
Figure S4.Correlations between smina and CENsible scoring contributions for shared physicochemical terms across ~19,000 PDBBind protein-ligand complexes.A) Combined steric terms.B) Repulsion term.All panels show linear regression lines that pass through the origin.

Figure S5 .
Figure S5.Probability distributions of CENsible-rescored smina poses.Only known active compounds from the three virtual screens are included.Bins are spaced 0.5 units, though only every other bin is marked on the X axis.The score distributions associated with the worst and best smina-docked poses are shown in green and orange, respectively.U statistics and p values are given in the upper-right corner of each panel.

Figure S7 .
Figure S7.Top-compound enrichment factors (i.e., the extent to which the topranking compounds are enriched with true binders).

Figure S9 .
Figure S9.Predicted term weights projected onto 2D t-SNE space, colored by K-means cluster.Colors are randomly assigned.

Table S1 .
Pearson's coefficients across three test splits, calculated via a linear fit between known and predicted affinities.

Table S2 .
Comparison of performance between CENsible and all scoring functions included in CASF-2016 benchmark.