Prediction Model of Clearance by a Novel Quantitative Structure–Activity Relationship Approach, Combination DeepSnap-Deep Learning and Conventional Machine Learning

Some targets predicted by machine learning (ML) in drug discovery remain a challenge because of poor prediction. In this study, a new prediction model was developed and rat clearance (CL) was selected as a target because it is difficult to predict. A classification model was constructed using 1545 in-house compounds with rat CL data. The molecular descriptors calculated by Molecular Operating Environment (MOE), alvaDesc, and ADMET Predictor software were used to construct the prediction model. In conventional ML using 100 descriptors and random forest selected by DataRobot, the area under the curve (AUC) and accuracy (ACC) were 0.883 and 0.825, respectively. Conversely, the prediction model using DeepSnap and Deep Learning (DeepSnap-DL) with compound features as images had AUC and ACC of 0.905 and 0.832, respectively. We combined the two models (conventional ML and DeepSnap-DL) to develop a novel prediction model. Using the ensemble model with the mean of the predicted probabilities from each model improved the evaluation metrics (AUC = 0.943 and ACC = 0.874). In addition, a consensus model using the results of the agreement between classifications had an increased ACC (0.959). These combination models with a high level of predictive performance can be applied to rat CL as well as other pharmacokinetic parameters, pharmacological activity, and toxicity prediction. Therefore, these models will aid in the design of more rational compounds for the development of drugs.


■ INTRODUCTION
Quantitative structure−activity relationship (QSAR) analysis is a method to predict the absorption, distribution, metabolism, and excretion (ADME) parameters of small-molecule compounds based on their molecular structure. QSAR is used to predict ADME parameters including solubility, 1 protein binding, 2 permeability, 3 blood-to-plasma concentration ratios, 4 and metabolic stability, 5 as well as in vivo pharmacokinetic (PK) parameters [clearance (CL), volume of distribution (Vd), and half-life]. 6,7 The construction of QSAR models has mostly used molecular descriptors and fingerprints as features of compounds, as well as multiple regression, partial least-squares regression, random forest, support vector machines, neural networks, and XGBoost as algorithms. However, some prediction methods that use conventional ML for ADME parameters have a poor prediction accuracy.
Recently, applying Deep Learning (DL) to the prediction of ADME parameters demonstrated accuracy improvements over other conventional ML methods. 8−10 In these reports, molecular descriptors and fingerprints were used for DL as features of the compounds. However, the use of new features is expected to further improve the prediction accuracy in addition to DL. Uesawa recently reported a new method called DeepSnap, which uses images of compounds as features for DL. 11 DeepSnap and Deep Learning (DeepSnap-DL) provided better predictions of toxicological targets including mitochondrial membrane potential disruption, constitutive androstane receptor (CAR), and aryl hydrocarbon receptor (AhR) compared with conventional ML. 11−17 However, there have been no reports on ADME parameters using DeepSnap-DL. Therefore, constructing ADME parameters using DeepSnap-DL might have good prediction accuracy similar to that for toxicological targets.
Among ADME parameters, CL is an important PK parameter for drug discovery. CL in animal species, such as rats, is used to understand the relationship between compound exposure in animals and humans regarding PK, toxicity, and drug effects. It is desirable to obtain compounds that have an acceptable PK profile. However, most compounds do not have an acceptable PK profile at the early drug discovery stage. Grime et al. reported the efficient and cost-effective pursuit of candidate compounds with acceptable PK profiles. 18 Pharmaceutical companies typically perform PK experiments in rats, including intravenous and oral administration, to determine whether compounds have an acceptable profile. To reduce the overall drug discovery cost, time, and animal usage, it is ideal to predict the rat PK profile before new chemical synthesis. QSAR can make predictions at early stages of drug development, even for virtual compounds, and can therefore help in the rational design of drug compounds.
Muegge et al. and McIntyre et al. reported QSAR models of rat clearance based on 6000−17,529 expanded in-house compounds. 19,20 They constructed rat CL models using conventional ML. In their models, molecular descriptors and fingerprints were used as features of compounds. The naive Bayesian method used by McIntyre et al. and random forest or support vector machines (specific machine details not provided in Muegge et al.) were used as algorithms. However, these prediction models did not show sufficient prediction performance [accuracy (ACC) of 0.74 and an area under the curve (AUC) of 0.82]. Although the prediction of rat CL is important in drug discovery, it is one of the evaluation targets that are difficult to predict by conventional ML. Therefore, in this study, we selected rat CL prediction as a difficult prediction target in drug discovery and developed a new prediction model using combination DeepSnap-DL and conventional ML to improve the evaluation metrics.

Separation of Compounds into Training and Test Datasets and Their Verification by Chemical Space
Analysis. Principal component analysis (PCA) was performed using a dataset of 1545 compounds with 11 representative molecular descriptors to confirm the correctness of the compound separation. It was previously reported that PCA could show the distribution of chemical space in the dataset. 21 Components 1, 2, and 3 explained 62.3, 12.0, and 8.0% of the variance, respectively. Figure 1 shows that the compounds were effectively separated into the training and test datasets.
Construction of CL Prediction Models Using Molecular Descriptors by DataRobot. The CL prediction models were constructed using 4795 molecular descriptors by DataRobot. First, random forest was selected as the algorithm based on the results of logloss of internal validation. Then, 100 molecular descriptors were selected from 4795 molecular descriptors using the permutation importance of random forest (Table S1). Second, these 100 descriptors were used to build a prediction model. Over 40 prediction models were constructed and evaluated. The top three models are shown in Table 1, and all results are shown in Table S2. Among these models, random forest showed the lowest logloss results. Based on this result, a final prediction model was constructed using 100% training data by random forest. The results of the evaluation metrics for the test datasets are shown in Table 2   Ensemble Model with Combination DeepSnap-DL and Conventional ML. The average of the predicted probabilities obtained from conventional ML using the molecular descriptors and DeepSnap-DL was calculated as the new predicted probability (ensemble model). Table 2 shows the results of the evaluation metrics of test sets using the probabilities of these averages. AUC, BAC, ACC, sensitivity, specificity, F-measure, precision, recall, and MCC were 0.943, 0.868, 0.874, 0.835, 0.901, 0.845, 0.855, 0.835, and 0.739, respectively. The evaluation metrics of the ensemble model showed better results than conventional ML using the molecular descriptors and DeepSnap-DL.
Consensus Model with Combination DeepSnap-DL and Conventional ML. The test results of the confusion matrix of conventional ML using the molecular descriptors and DeepSnap-DL results are shown in Table 4a,b. Based on these results, a consensus model was constructed using the results of the agreement (Table 4c). Evaluation metrics showed that BAC, ACC, sensitivity, specificity, F-measure, precision, recall, and MCC were 0.958, 0.959, 0.953, 0.963, 0.948, 0.943, 0.953, and 0.915, respectively ( Table 2). The number of predictable compounds decreased from 309 to 214. However, these results showed that the consensus model was highly accurate for all the evaluation metrics.

■ DISCUSSION
During drug discovery, prediction models are constructed for various targets such as toxicity and ADME parameters, but the prediction performance of these models is insufficient for some targets. Therefore, a new prediction model that has a high level of predictive performance is desired. In this study, we focused on the prediction of rat CL as an important and difficult prediction target in drug discovery. To improve the prediction performance, we developed a new prediction model with DeepSnap-DL, which uses images for ML.
For rat CL dataset creation, compounds were separated into training and test sets (Table 5 and Figure 1). To ensure unbiased

ACS Omega
http://pubs.acs.org/journal/acsodf Article segregation, the PCA analysis was conducted using 11 representative molecular descriptors (Table S4), which are generally considered important for synthetic expansion. 21 We also examined the distribution of each training set and test set for the 11 descriptors ( Figure S1). As shown in Figure 1 and Figure  S1, the separation was well balanced and the cumulative contribution ratio of PCA from 1 to 3 was 82.31%. For rat CL prediction, two models have been reported to date, although a direct comparison is difficult because both are different in-house compounds. 19,20 However, the prediction performance of both models was low, with an ACC of 0.74 and an AUC of 0.82. 19,20 These models also used molecular descriptors and fingerprints as features of compounds, as well as random forest (or support vector machines) and naive Bayesian as algorithms. In this study, we used molecular descriptors obtained from three software, Molecular Operating Environment (MOE), alvaDesc, and ADMET Predictor, and constructed a model using DataRobot, which allows multiple algorithms to be considered simultaneously as conventional ML. As a result, evaluation metrics calculated an ACC of 0.825 and an AUC of 0.883 (Table 2). Although it is difficult to make a direct comparison because of the different compounds used, we constructed a prediction model that surpassed previous models by adopting multiple software for molecular descriptors and multiple algorithms.
We developed a prediction model using DeepSnap, which uses images as features of compounds and DL as an algorithm. First, we examined the hyperparameters of DeepSnap-DL for predicting rat CL. The results for all hyperparameter combination conditions in this study are shown in Table S3, and the AUC results for internal validation are shown in Table 3. The condition of 145°, learning rate of 0.000001, and maximum epoch of 300 showed the highest value of AUC [DeepSnap (Validation)] (Table 3). We evaluated the prediction performance of the test sets using this condition for the final model of DeepSnap-DL. As shown in Table 2, the ACC was 0.832 and the AUC was 0.905, which were higher than when using the molecular descriptor-based method (conventional ML). It was reported that DeepSnap-DL had a higher prediction performance than conventional ML for multiple toxicity targets of progesterone receptor, CAR, and AhR. 15−17 Although Deep-Snap-DL has only been used for toxicity targets, it also had high prediction performance for PK parameters.
In this study, we focused on the multiple QSAR model to improve the prediction performance further. The multiple QSAR models [ensemble learning, combinatorial (combi) QSAR, and consensus classification] were used to improve prediction and obtain stable results by combining different features or algorithms. 22−24 Various multiple QSAR models have been reported to date. Combi QSAR is a method for constructing models by combining molecular descriptors in multiple commercial software and multiple algorithms (knearest neighbor, support vector machine, decision trees, and random forest). 25−28 It was reported that high prediction accuracy and stable results were obtained by using these methods. 25−28 Furthermore, Brownfield et al. proposed a prediction method that combined three class systems by a fusion process as a consensus classification. 29 Kim et al. and Wang et al. showed an improvement in prediction accuracy by using a molecular descriptor and the parameter of transporter as a biological descriptor for the prediction model. 30,31 From these findings, it was expected that the use of multiple prediction models and different types of features might improve the prediction performance. Therefore, we investigated the combination of prediction models using a molecular descriptor-based model and DeepSnap-DL. We developed two multiple QSAR models, the ensemble model and consensus model. For the ensemble model, the average of the prediction probabilities obtained from the molecular descriptor-based model and DeepSnap-DL was used. As a result, the AUC and ACC were improved because these scores increased from 0.883−0.905 to 0.943 and from 0.825−0.832 to 0.874, respectively (Table 2). To confirm that this was not a coincidence, we examined different test partitions ( Figure S2), which showed that the average of the prediction probabilities improved the prediction performance (Table S5). For the consensus model, only the results that agreed with the molecular descriptor-based method and DeepSnap-DL were used. The evaluated number of compounds decreased from 309 to 214 because different prediction results could not be used. However, the accuracy of prediction using the consensus model was improved from 0.825−0.832 to 0.959 (Table 2). We examined different test partitions as was done for the ensemble model and found that using the consensus model improved the prediction performance ( Figure S2 and Table S5). Although the consensus model has been shown to have the highest prediction accuracy, it is not possible to evaluate all test compounds. In fact, the number of compounds that can be evaluated by the consensus model has been reduced to 214−226 ( Table 2 and Table S5). In the early stages of drug screening, a large number of compounds need to be evaluated comprehensively without omission. The model that can evaluate all compounds is suitable, so further efforts are needed for practical application. In this study, the ensemble model and the consensus model improved the prediction performance. To the best of our knowledge, this is the first report showing an improvement in prediction performance using images and molecular descriptors of compounds. The reason for the improved prediction performance using this combination is that the recognition of compounds in the image space and molecular descriptor space is different. This suggests that the high prediction model was achieved by using information in each space.

■ CONCLUSIONS
In this study, we constructed a novel combination model using different types of compound features that had high performance for rat CL prediction. Although this combination model was effective for rat CL, it may be applicable to other pharmacokinetic parameters, toxicity, and pharmacological activity. This combination QSAR method enables virtual screening from a library of compounds and accelerates drug discovery. Furthermore, this model is expected to enable the construction of a prediction model that outperforms previous models for drug discovery as well as other compound-based targets. Therefore, it is expected to be widely applicable in various fields when prediction performance is an issue.

■ EXPERIMENTAL SECTION
Experimental Data. All procedures for the animal experiments were approved by the Animal Ethics Committee of Japan Tobacco Inc., Central Pharmaceutical Research Institute. The CL of compounds that were synthesized in multiple in-house projects and subjected to rat PK was obtained from the in-house database. All results were obtained after the intravenous administration of compounds to rats at doses of 0.03−10.0 ACS Omega http://pubs.acs.org/journal/acsodf Article mg/kg. CL was estimated after the intravenous administration by the noncompartmental analysis of individual plasma concentration−time profiles. In this study, the results of the CL of 1545 in-house compounds were used to construct the prediction models. The threshold for the classification of CL values was 1 L/h/kg, which is approximately 30% of the hepatic blood flow rate in rats. 32 This threshold is equivalent to approximately 70% for the bioavailability (BA) of compounds eliminated by the liver alone, assuming that the fraction absorbed (Fa) and fraction intestinal availability (Fg) are 1.
McIntyre et al. developed similar prediction models, but they used 70% of hepatic blood flow as their threshold. 20 This is equivalent to 30% BA when FaFg = 1. However, it is difficult to obtain a compound with FaFg = 1 early in drug discovery, and the BA is easily likely to be below 30%. Therefore, in this study, the threshold was set at 1 L/h/kg, which is equivalent to 70% BA.
Calculation of Molecular Descriptors. The structural data of compounds for water molecules and counter ions were eliminated by the processing of disposal salts. Subsequently, the 3D structure of each compound was optimized using "Rebuild 3D" and the force field calculations (amber-10: EHT) were conducted in MOE version 2019.0102 (MOLSIS Inc., Tokyo, Japan). Structural descriptors were calculated employing MOE, alvaDesc (1.0.16) (Alvascience srl, Lecco, Italy), and ADMET Predictor (9.5.0.16) (Simulations Plus, New York, NY, USA). At the time of descriptor generation, descriptors of string type were removed in ADMET Predictor and descriptors of variance 0 were removed in alvaDesc. Overall, 4795 descriptors were selected for further analysis.
Separation of Compounds into Training and Test Sets and Their Verification by Chemical Space Analysis. After applying stratified random sampling, the compounds in the dataset were separated randomly into a training set and test set at a ratio of 4:1 (Table 5). To investigate the chemical space, 11 molecular parameters were used as reported previously with JMP Pro software 14.3.0 (SAS Institute Inc., Cary, NC, USA) PCA. 21 The parameters included molecular weight, SlogP (log octanol/water partition coefficient), topological polar surface area (TPSA), h_logD [octanol/water distribution coefficient (pH = 7)], h_pKa [acidity (pH = 7)], h_pKb [basicity (pH = 7)], a_acc (number of H-bond acceptor atoms), a_don (number of H-bond donor atoms), a_aro (number of aromatic atoms), b_ar (number of aromatic bonds), and b_rotN (number of rotatable bonds). The principal components were calculated from 1 to 3.
Construction of Rat CL Models Based on Molecular Descriptors. Model construction based on 4795 molecular descriptors was performed using DataRobot (SaaS, DataRobot, Tokyo, Japan). All analyses were conducted from 29 July 2020 to 31 July 2020. DataRobot automatically performs a modeling competition in which a wide selection of algorithm and data preprocessing techniques compete with one another as reported previously. 33,34 Prior to training, 20% of the training dataset was randomly selected as the holdout and excluded from training. Five-fold cross-validation was implemented and the partitions were determined with stratified sampling. After the selection of models based on logloss scores of internal validation, molecular descriptors were selected from 4795 molecular descriptors to 100 using the permutation importance. Using these 100 selected molecular descriptors, over 40 models were created and random forest was selected based on validation results as the final algorithm. Following logloss scores of internal validation, the best model was constructed using 100% of the training data. This final model was used to calculate the prediction accuracy of the test sets ( Figure 2).
Deep Learning. All the two-dimensional (2D) PNG images produced by DeepSnap were resized by utilizing NVIDIA DL GPU Training System (DIGITS) version 6.0.0 software (NVIDIA, Santa Clara, CA, USA) on four-GPU systems, Te2D sla-V100 (32 GB), with a resolution of 256 × 256 pixels as input data, as previously reported. 11−17 To rapidly train and finetune the highly accurate Convolutional Neural Network (CNN) using the input DeepSnap (Training) and DeepSnap (Validation) datasets based on the image classification and by building the pretrained prediction model, we used a pretrained Figure 2. Flowchart of the modeling process for rat CL prediction. For modeling, the 80% training dataset and 20% test dataset were set. The 80% dataset was used to construct prediction models using the molecular descriptor-based method by DataRobot and DeepSnap-DL. Ensemble and consensus models were constructed using the molecular descriptor-based method and DeepSnap-DL. The evaluation metrics of each prediction model were calculated using test sets. Combination DeepSnap-DL and Conventional ML. In this study, we investigated the combination of DeepSnap-DL and conventional ML using two methods. The first method was the average of the prediction probabilities. The prediction probabilities obtained by DeepSnap-DL and conventional ML were averaged, and this average value was used as the prediction probability of the new prediction model (ensemble model) ( Figure 2). The second method used a prediction model that adopted the results of the agreement between DeepSnap-DL and conventional ML (consensus model) (Figure 2).
Evaluation of the Predictive Model. The performance of each model in predicting rat CL was evaluated in terms of the following metrics: AUC, BAC, ACC, sensitivity, specificity, Fmeasure, precision, recall, and MCC calculated using KNIME (4.1.4) (KNIME, Konstanz, Germany). These performance metrics were defined as follows:  The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acsomega.1c03689.