Revisiting the Application of Machine Learning Approaches in Predicting Aqueous Solubility

The solubility of chemical substances in water is a critical parameter in pharmaceutical development, environmental chemistry, agrochemistry, and other fields; however, accurately predicting it remains a challenge. This study aims to evaluate and compare the effectiveness of some of the most popular machine learning modeling methods and molecular featurization techniques in predicting aqueous solubility. Although these methods were not implemented in a competitive environment, some of their performance surpassed previous benchmarks, offering gradual but significant improvements. Our results show that methods based on graph convolution and graph attention mechanisms demonstrated exceptional predictive abilities with high-quality data sets, albeit with a sensitivity to data noise and errors. In contrast, models leveraging molecular descriptors not only provided better interpretability but also showed more resilience when dealing with inherent noise and errors in data. Our analysis of over 4000 molecular descriptors used in various models identified that approximately 800 of these descriptors make a significant contribution to solubility prediction. These insights offer guidance and direction for future developments in solubility prediction.

Table S3.1:Summary of results, represented as mean±SD of fifty runs (ten runs for CV), in terms of RMSE, R 2 , and QCK, with the best results highlighted in bold.Note: *, **, and *** indicate increasing levels of statistical significance, with * being p < 0.05, ** p < 0.01, and *** p < 0.001; a The probability value p measures the strength of evidence against the null hypothesis H 0 (there are no significant differences in the variances) in the hypothesis test.
Table S3.2:Based on the prediction results detailed in Supplemental Information 2 (SI2), we conducted the Brown-Forsythe ANOVA test to examine whether there were significant differences in the predictive performance of various ML methods across different test sets.
The results showed that there were significant differences (***) in the predictions among at least one of the ML methods across all datasets.Consequently, further post hoc tests are required to determine the specific groups between which these differences are significant.

Figure S3. 1 :
Figure S3.1:Predictions from XGBoost and 1DCNN are plotted against the literature log S values for compounds from the test sets 19SC1, 19SC2, and 08SC.The black diagonal line represents the perfect agreement between predicted and experimental solubility values, while the blue line indicates the best linear regression fit to these predictions.

Figure S3. 2 :
Figure S3.2:Predictions from LightGBM and GCN are plotted against the literature log S values for compounds from the test sets 19SC1, 19SC2, and 08SC.The black diagonal line represents the perfect agreement between predicted and experimental solubility values, while the blue line indicates the best linear regression fit to these predictions.

Figure
Figure S3.3:Predictions from GAT and GATv2 are plotted against the literature log S values for compounds from the test sets 19SC1, 19SC2, and 08SC.The black diagonal line represents the perfect agreement between predicted and experimental solubility values, while the blue line indicates the best linear regression fit to these predictions.

Figure
Figure S3.4:Predictions from AttentiveFP and MPNN are plotted against the literature log S values for compounds from the test sets 19SC1, 19SC2, and 08SC.The black diagonal line represents the perfect agreement between predicted and experimental solubility values, while the blue line indicates the best linear regression fit to these predictions.

Figure
Figure S3.5:The distribution of RMSE and R 2 values for the prediction results on the 19SC1 test set, obtained from fifty training iterations using different ML modeling methods.The curves represent the kernel density estimates for these distributions.

Figure S3. 6 :
Figure S3.6:The distribution of RMSE and R 2 values for the prediction results on the 19SC2 test set, obtained from fifty training iterations using different ML modeling methods.The curves represent the kernel density estimates for these distributions.

Figure S3. 7 :
Figure S3.7:The distribution of RMSE and R 2 values for the prediction results on the 08SC test set, obtained from fifty training iterations using different ML modeling methods.The curves represent the kernel density estimates for these distributions.
**, and *** indicate increasing levels of statistical significance, with * being p < 0.05, ** p < 0.01, and *** p < 0.001; a The probability value p measures the strength of evidence against the null hypothesis H 0 (there are no significant differences in the prediction capability across these ML methods) in the hypothesis test.