Random Forest Models To Predict Aqueous Solubility

David S. Palmer, Noel M. O'Boyle, Robert C. Glen, and John B. O. Mitchell*
Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
J. Chem. Inf. Model., 2007, 47 (1), pp 150–158
DOI: 10.1021/ci060164k
Publication Date (Web): December 2, 2006
Copyright © 2007 American Chemical Society

 Current address:  Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, U.K.

,
*

 Corresponding author phone:  +44-1223-762983; fax:  +44-1223-763076; e-mail:  jbom1@cam.ac.uk.

Abstract

Abstract Image

Random Forest regression (RF), Partial-Least-Squares (PLS) regression, Support Vector Machines (SVM), and Artificial Neural Networks (ANN) were used to develop QSPR models for the prediction of aqueous solubility, based on experimental data for 988 organic molecules. The Random Forest regression model predicted aqueous solubility more accurately than those created by PLS, SVM, and ANN and offered methods for automatic descriptor selection, an assessment of descriptor importance, and an in-parallel measure of predictive ability, all of which serve to recommend its use. The prediction of log molar solubility for an external test set of 330 molecules that are solid at 25 °C gave an r2 = 0.89 and RMSE = 0.69 log S units. For a standard data set selected from the literature, the model performed well with respect to other documented methods. Finally, the diversity of the training and test sets are compared to the chemical space occupied by molecules in the MDL drug data report, on the basis of molecular descriptors selected by the regression analysis.

Tools

SciFinder Links

SciFinder subscribers:  Click to sign in | Not a SciFinder subscriber? Learn more at www.cas.org

History

  • Published In Issue January 22, 2007
  • Received May 5, 2006

Recommend & Share

Related Content

Other ACS content by these authors: