Learning Molecule Drug Function from Structure Representations with Deep Neural Networks or Random Forests

Empirical testing of chemicals for desirable properties, such as drug efficacy, toxicity, solubility, and thermal conductivity, costs many billions of dollars every year. Further, the choice of molecules for testing relies on expert knowledge and intuition in the relevant domain of application. The ability to predict the action of a molecule in silico would greatly increase the speed and decrease the cost of prioritizing molecules with desirable function for experimental testing. We explore the use of molecule structures to predict their drug classification without any explicit biological model (e.g. protein structure or cell system). Two approaches were compared: (1) a two-dimensional image representation of molecule structure with transfer learning from a pre-trained convolutional neural network (CNN), and (2) Morgan molecular fingerprints (MFP) with a random forest (RF). Both methods using only molecule information as input achieved significantly better accuracy than previous work that used transcriptomic measurements of molecular effects. Because these data suggest that much of a molecule’s function is encoded by its chemical structure, we explore which general molecular properties are drug class-specific and how misclassification of structures might provide drug repurposing opportunities.


Images with Convolutional Neural Networks (IMG+CNN)
Molecule images with RGB channels were resized to 224x224 pixels and used for training and validation of a CNN with predetermined weights from resnext101_64 31 implemented using fastai and pytorch 32 . The loss function used was binary cross entropy and the output layer was logsoftmax. During training, all the layers were frozen except the output layer, which was retrained for various numbers of epochs with a learning rate of 0.01, or with cyclic learning rate decreasing from 0.01 for various numbers of epochs 33 . The CNN was trained without data augmentation unless otherwise noted.

Molecular Fingerprints with Random Forests (MFP+RF)
Random forests 29 are ensembles of decision trees, where each tree is learned on a subsample of data points and features (in this case bits in a molecular fingerprint). Benchmarking studies often include MFP+RF models because of their strong performance on a variety of computational chemistry tasks 14,[34][35][36][37][38][39] The random forest model is implemented with scikit-learn 40 . The RF was configured to use 4000 trees, minimum of 1 sample per leaf, balanced class weight, and log2 max features, hyperparameters that performed well in a previous high-throughput chemical screening study 34 .

CNN Latent Representations
Latent image representations were used to compare the MFP and IMG representations of molecules. To generate latent representations of molecule images, the resnext101_64 CNN model was loaded and not trained on any molecule images. The final three layers, a batch normalization layer, linear layer, and logsoftmax, were removed. The same molecule images used for training and classification above were then passed through the truncated network to generate a 512-length vector of their latent representation in the final hidden layer.

Comparison of Drug-like Properties
CIDs were used to download Molecular weight, XLogP, HBondAcceptorCount, HBondDonorCount, and IsomericSMILES values using the pubchempy module. Compounds were then filtered to include non-redundant IsomericSMILES values to include only drugs with a single class label (see Supplemental Data File 1). XLogP values were not available for all queried compounds (see Supplemental Table 1). Violin plots were created using ggplot2 (see Figure 4). For each quantitative feature, a Welch's ANOVA and Games-Howell post hoc test (implemented in R's userfriendlyscience package) was used to compare differences between chemical features between drug-class groups. This test was selected because it does not assume a normal distribution, even variance, or equal sample sizes between groups 41 . Adjusted p-values from Games-Howell post hoc test are reported in Supplemental table  2). Drug-class level distribution and pairwise relations of chemical features were further visualized with Seaborn (see Supplemental Figure 2).

Model Comparison and Evaluation
The primary IMG+CNN and MFP+RF model comparison used default hyper parameters that have worked well previously 33 . Subsequent analyses assessed the sensitivity of the IMG+CNN model to the hyperparameters and training data augmentation. For model validation, five validation sets were constructed by stratified random sampling of 20% of the molecules in each class. The classification accuracy on the five validation sets was used as the primary model comparison metric. Principal component analysis (PCA) was used to compare the Morgan fingerprints and latent molecule representations from the CNN, and the results were visualized using plots of the first two principal components. Class-specific prediction accuracy of the models was compared using confusion matrices. We also compared our prediction accuracy with the accuracy previously reported by Aliper et al. 26 . However, it should be clearly noted that we used different training and validation sets in this paper so the accuracies are not directly comparable.

General Overview
Two structure-based representations were used for training and classification with two different model architectures: (1) IMG+CNN or (2) MFP+RF (Figure 1). Molecules were split into three subtask sets as described previously 26 . Each subtask set contained 3105, 5762, or 6955 molecules for the 3-, 5-and 12-class problems, respectively. Table 2 gives a summary of the validation set accuracy for the models described here in comparison with results from Aliper et al. (note that our training and validation sets were different than that used by Aliper et al.) For all subtasks, MFP+RF performed best, achieving up to 88.6 % accuracy, even on the 12-class prediction task. The IMG+CNN models with default settings were worse, but still significantly outperformed previous results using a multilayer perceptron DNN or support vector machine (SVM) and gene expression inputs.
Additional training strategies were assessed for their impact on the CNN hyperparameter sensitivity on the 3-class subtask ( Table 3). Cyclic learning rates were also tested, and improved neural network training 26 . Batch size was varied from 50-800 pictures per batch using either constant learning rate or cyclic learning rates. Validation set accuracy slightly depends on the batch size and number of epochs using both static learning rates and cyclic learning rates, with batch size of 400 or more generally performing worse than batches of 200 pictures or less. When training over 255 epochs, cyclic learning had no significant effect on model performance. However, with 1023 training epochs, a cyclic learning rate was better than static learning rate (two-sided ttest p-value = 0.035). The best results were achieved using cyclic learning rates with 10 cycles with a batch size of 200. Most interestingly, when the CNN was trained with grayscale images instead of color images, validation accuracy was significantly lower than when using color molecule images (bottom row, Table 3, Color: 0.817 ± 0.005, Greyscale: 0.801 ± 0.005, 2-tailed t-test p-value = 0.001).
A common strategy for improving the performance of CNNs is data augmentation by randomly altering training images 42 . We also tried training with pictures augmented by randomly flipping horizontally or vertically, randomly rotating up to 10 degrees, and randomly zooming by up to 10%. Despite the input data's relative simplicity, data augmentation was effective at improving validation set accuracy by 3.7-6.1%. On a single cross-validation fold (validation set #1), IMG+CNN plus data augmentation achieved 0.866, 0.8226, and 0.779 validation accuracy on the 3-class, 5-class, and 12class problems, respectively (2047 epochs, initial learning rate 0.01 with cosine annealing).

Model and Representation Comparisons
To better understand the differences between model inputs, PCA was used to compare the multidimensional distributions of latent images and MFPs. This analysis revealed that MFP representations capture much greater variance than CNN-processed images. The first 10 principle components (PCs) from PCA of MFPs explain only 20.4% of the variance in the data (Figure 2A), but the first 10 PCs of the explained 54.0% of the variance in the CNN-transformed data ( Figure 2B). This suggests that MFPs allow greater embedding of structural uniqueness than the transformed images.
Given the unequal stratification among examples within classes in the training and validation sets, the per-class performance of both models was compared. Confusion matrices of true class versus predicted class from IMG+CNN for the 3 subtasks reveal differences in per-class validation accuracy. In the 3-class prediction subtask, the prediction performance has similar high accuracy among the three groups, but 17% of cardiovascular drugs are predicted incorrectly as central nervous system drugs ( Figure 3A). In the 5-class prediction subtask, which includes the 3-class drugs plus gastrointestinal and anti-infective drugs, prediction performance is slightly decreased relative to the 3-class performance. However, the model learns anti-infective drugs with high accuracy (89%). Gastrointestinal drugs are predicted poorly with only 57% accuracy (Figure 3B). In the 12-class prediction task, the accuracy for antiinfective molecules remains the highest at 89%, but smaller classes are predicted less accurately ( Figure 3C). The smallest class 'Urological Agent', which contains only 26 molecules, was rarely predicted correctly in the validation set. The difference in class sizes likely contributes to this deficiency, and we did not directly control for this during training. The same analysis of per-class validation set accuracy for results from MFP+RF revealed more consistent accuracies across the classes (Supplemental Figure 1).

Understanding Molecule Properties and Misclassification
A compound's drug-like properties are conventionally believed to be a function of chemical features such as molecular weight, lipophilicity, and number of hydrogen bond donors and acceptors. For instance, Lipinski's rule of five famously states that a druglike compounds should have no more than five hydrogen bonds donors, 10 hydrogen bond acceptors, a logP greater than 5, and molecular weight under 500 3 . These properties are directly or indirectly encoded in the graphical representation of a compound's chemical structure. In the compound dataset explored in this study, antiinfective and respiratory system drugs exhibit different distributions in this chemical feature space (Figure 4). Based on a Welch's ANOVA and Games-Howell post hoc test, these two classes of drugs appear different in their distributions of molecular weight, hydrogen bond donors, hydrogen bond acceptors, but not XLogP (Supplemental Table 2). However, it is important to consider that XLogP values for this dataset (collected via the pubchempy API) are sparse (Supplemental Table 1), are computed 43 rather than measured, and appear to have a different distribution than the other chemical features (Supplemental Figure 2), so comparisons of XLogP values in this dataset may be skewed. Assuming that there is in fact no meaningful difference between the XLogP values for anti-infective and respiratory drugs, perhaps the source of the classifier's errors (Figure 3 and Supplemental Figure 1) is it's overweighting of features related to lipophilicity relative to molecular weight or hydrogen bond donors or acceptors. Conversely, while gastrointestinal drugs and antineoplastic drugs do not show statistically significant differences across this drug-like chemical feature space, the classifier is able to easily resolve them. This phenomenon perhaps suggests that there are important structural features of a drug that the classifier is able to learn that fall beyond the conventional framework of how chemists understand the drug-like properties of compounds.

Misclassification for Mechanism or Drug Repurposing Opportunities
Misclassification can be interpreted in at least two ways: (1) the model hasn't learned enough to accurately predict the true class, or (2) the model has learned something beyond our understanding about the drugs and classes. Although the latter is more interesting, the former is the safer and more likely interpretation. However, cases where the model is wrong might present opportunities for drug repurposing. In addition, in this work we hypothesize those incorrect predictions might inform us about drug mechanisms. For example, among the 6 molecules in the urological drug validation set, the IMG+CNN misclassified Trospium as a central nervous system (CNS) agent. This is not surprising, however, because Trospium is known mechanistically as a muscarinic antagonist 44 , which is a common function of CNS drugs. In fact, the structure of Trospium is remarkably similar to another muscarinic antagonist used to treat Parkinson's disease, Benztropine 45 ( Figure 5).

Conclusion
Here we report two drug classification models that use structure-based representations. Both models greatly exceed a previous benchmark on the same prediction task. Our models use molecular structures directly as inputs. In contrast, the previous work used alterations of the transcriptome as a proxy for the molecule. The results presented in this paper suggest that experimental measurement of a molecule's influence in biological systems may not be needed to accurately predict some types of chemical properties, such as drug classes. Further research is required to determine which types of chemical prediction tasks require information about the biological state changes included by chemicals instead of the chemical properties alone. The richness and stochasticity of biological systems cannot possibly be captured without some direct empirical measure of the system. We believe there must be room to improve effect-based models that would outperform models based solely on molecule structure, especially on more complex prediction tasks.
A related point is that the images and MFPs can be generated directly from the chemical structure for almost any chemical. A major limitation of the Aliper et al. featurization, or any empirically-determined featurization, is that it requires new experiments for each new compound. Therefore, there must be a major return for that extra cost in a virtual screening or chemical prediction setting. Therefore, we propose that future studies on this task or any chemical prediction task that use empiricallydetermined featurizations must also use models that consider only chemical structure features as a baseline.
Although there are several chemistry problems where DNNs outperform other machine learning methods 38,46,47 , here the MFP+RF performed best in all three classification tasks. However, from our results we cannot determine if this is due to the featurization, the type of model, or the model configuration. For example, images may be worse featurizations than MFPs, but CNNs may be harder to train than the RF, especially when the amount of training data is limited. Our results suggest that MFPs are a more dispersed representation than images, suggesting that use of MFPs may help the accuracy of our RF model (Figure 2). A widespread hyperparameter search could be used to assess whether CNNs can outperform RFs when both models are directly optimized for this task. Even though our CNN method was worse than MFP+RF for these hyperparameter settings, our results show that transfer learning and molecule images for CNNs provide a simple and effective method for predicting drug function.
Much can be learned about chemical function from the cases where we find misclassification of chemical structures. We show cases where this can be rationalized by chemical properties of the molecules and cases where these properties that we often use to define the character of a chemical cannot explain the classification performance. In the latter case, this may mean that our models have learned something about chemistry that may not be comprehensible by chemists. Still, the class-specific differences in molecular properties are interesting to compare. Further, when the models misclassify a structure, we can interpret this both as suggestive of a shared drug mechanism, and as an opportunity for drug repurposing. Drug repurposing is an especially important aspect of this work because application of an already-approved drug is much less costly that de novo approval of a new chemical. We followed the setting from Aliper et al. that excluded chemicals with multiple therapeutic uses, but modeling these drugs will be important to more fully understand the model's potential for drug repurposing.    All chemical identifiers for 12 major medical subheadings (MeSH) therapeutic use classifications were downloaded from pubchem. The identifiers were then filtered for only those mapping to a single MeSH class and then converted to SMILES. SMILES representations were converted to either 2-dimensional images or molecular fingerprints. The 2D images were used to train a convolutional neural network (CNN). The fingerprints were used to train a random forest (RF). The models were used separately to predict classes of drugs using stratified 5-fold cross validation.  Each matrix is the classification using the random 20% validation set. Each confusion matrix is a random representative of the five models from the 5-fold cross validation.