Counting Clusters Using R-NN Curves

Rajarshi Guha,* Debojyoti Dutta, David J. Wild, and Ting Chen
School of Informatics, Indiana University, Bloomington, Indiana 47406, and Department of Computational Biology, University of Southern California, Los Angeles, California 90089
J. Chem. Inf. Model., 2007, 47 (4), pp 1308–1318
DOI: 10.1021/ci600541f
Publication Date (Web): June 30, 2007
Copyright © 2007 American Chemical Society
*

 Corresponding author. e-mail:  rguha@indiana.edu.

,

 Indiana University.

,

 University of Southern California.

Abstract

Abstract Image

Clustering is a common task in the field of cheminformatics. A key parameter that needs to be set for nonhierarchical clustering methods, such as k-means, is the number of clusters, k. Traditionally, the value of k is obtained by performing the clustering with different values of k and selecting that value that leads to the optimal clustering. In this study, we describe an approach to selecting k, a priori, based on the R-NN curve algorithm described by Guha et al. (J. Chem. Inf. Model., 2006, 46, 1713−722), which uses a nearest-neighbor technique to characterize the spatial location of compounds in arbitrary descriptor spaces. The algorithm generates a set of curves for the data set which are then analyzed to estimate the natural number of clusters. We then performed k-means clustering with the predicted value of k as well as with similar values to check that the correct number of clusters was obtained. In addition, we compared the predicted value to the number indicated by the average silhouette width as a cluster quality measure. We tested the algorithm on simulated data as well as on two chemical data sets. Our results indicate that the R-NN curve algorithm is able to determine the natural number of clusters and is in general agreement the average silhouette width in identifying the optimal number of clusters

Tools

SciFinder Links

SciFinder subscribers:  Click to sign in | Not a SciFinder subscriber? Learn more at www.cas.org

History

  • Published In Issue July 23, 2007
  • Received November 28, 2006

Recommend & Share

Related Content

Other ACS content by these authors: