R-NN Curves:  An Intuitive Approach to Outlier Detection Using a Distance Based Method

Rajarshi Guha,§ Debojyoti Dutta,§ Peter C. Jurs,* and Ting Chen
Department of Chemistry, Pennsylvania State University, University Park, Pennsylvania 16802, and Department of Computational Biology, University of Southern California, Los Angeles, California 90089
J. Chem. Inf. Model., 2006, 46 (4), pp 1713–1722
DOI: 10.1021/ci060013h
Publication Date (Web): June 1, 2006
Copyright © 2006 American Chemical Society

 Pennsylvania State University.

,
§

 These authors contributed equally to this paper.

,

 University of Southern California.

,
*

 Corresponding author e-mail:  pcj@psu.edu.

Abstract

Libraries of chemical structures are used in a variety of cheminformatics tasks such as virtual screening and QSAR modeling and are generally characterized using molecular descriptors. When working with libraries it is useful to understand the distribution of compounds in the space defined by a set of descriptors. We present a simple approach to the analysis of the spatial distribution of the compounds in a library in general and outlier detection in particular based on counts of neighbors within a series of increasing radii. The resultant curves, termed R-NN curves, appear to follow a logistic model for any given descriptor space, which we justify theoretically for the 2D case. The method can be applied to data sets of arbitrary dimensions. The R-NN curves provide a visual method to easily detect compounds lying in a sparse region of a given descriptor space. We also present a method to numerically characterize the R-NN curves thus allowing identification of outliers in a single plot.

Tools

SciFinder Links

SciFinder subscribers:  Click to sign in | Not a SciFinder subscriber? Learn more at www.cas.org

History

  • Published In Issue July 24, 2006
  • Received January 10, 2006

Recommend & Share

Related Content

Other ACS content by these authors: