Combinatorial chemistry is redefining the way new pharmaceuticals and agrochemicals are discovered and developed (1). In the 1970s, the advent of automated peptide synthesis for molecular biological analysis laid the groundwork for creating the corresponding exhaustive combinatorial libraries of, for example, hexapeptides (2). The technology has evolved through several demonstrations in principle to its current state (3), where the term "combinatorial chemistry" embraces a wide range of often contrasting strategies, including parallel synthesis and shuffle-and-split, one-per-well and mixture synthesis, and array synthesis (4).
We use combinatorial chemistry in a broad sense to mean any chemistry in which sets of related reactants are combined to create a library of structurally related products with well-defined relationships to each other. All possible combinations may be realized in practice; but, in many cases, the bulk of a library is virtual.
Published work still centers on solid-support synthesis, although considerable momentum is building for chemistry in solution (5). As with many new technologies, combinatorial chemistry has created challenges in collateral fields that have become apparent only as the discipline has evolved. We discuss a few of the chemical information and computational analysis challenges that have arisen as combinatorial chemistry has developed within the pharmaceutical and agrochemical research communities: how to store and manipulate combinatorial libraries (definition), how to make intelligent selections of subsets from combinatorial libraries (dissection), and how to evaluate the performance of the metrics used in making such selections (validation).
Click here to see Figure 1Structures 1 and 2 in Figure 1 are generic, but they are not combinatorial; there is a subtle but important distinction between the two. Such generic structures are open-ended definitions; each represents a generalized class of compounds (indeed, they are used as queries in 2- and 3-D searching). The inductive literature implication is that individual examples that can be represented by that structure will probably behave similarly. In a combinatorial context, a closed, summary definition is used deductively: every combination of specified variations is implied. A (small) combinatorial library and the reaction that produces it from reagent sets 3 and 4 are given in Figure 2. Note that the combinatorial SLN (CSLN) 5 involves Y macro-atoms, each of which corresponds to a list of specific variations; compare these with the nonspecific X groups in the generic (query) SLNs for 1 and 2 (Figure 1). Periods serve as place holders between unconnected fragments that, because they all come from the same reagent, constitute a single variation (e.g., CH3.H.OH). Note, too, that the line notation can incorporate valence attributes (v = 1, 5, 6, etc.) for each variation along with other associated data such as name or reagent price.
Click here to see Figure 2Pattern recognition is intimately involved in the exquisite usefulness of 2-D structures to human beings, but line notations (or connectivity matrix representations) are better suited to database registration, automated substructure searching, and other computational manipulations. Several combinatorial definition systems are now available in the chemical information management arena, each with its own builder interface. Libraries can be encapsulated in CSLN format using Legion (Tripos Inc.).
It is also possible to encode libraries using other combinatorial line notations derived from SMILES, such as CHORTLES and CHUCKLES. Project Library (MDL Information Systems Inc., San Leandro, CA), on the other hand, produces a set of interlinked substructure files in SD format (6). CSLN is the most general of the available combinatorial formats, in that multivalent/multicenter macro atoms (e.g., Y_03 in 5) can be accommodated as well as simpler cases involving independent monovalent variations such as X1 and X5 in 2.
Combinatorial formulas are very powerful tools for conceptualization and communication, but their usefulness is dependent on the ability of the human mind to generalize from the particular to the abstract. Computers lack this ability, so dealing directly with combinatorial libraries has been difficult for database managers. Instead, many of the data management systems available for working with combinatorial libraries require complete enumeration of each example for substructure searching. Working only with fully enumerated combinatorial libraries is analogous to perusing every title in a public library to find a desired book.
Full enumeration is inherently inefficient because most of the structural information in such libraries is, by definition, redundant: The computational and memory overhead costs involved in storing all of the individual structures can be daunting, especially when virtual libraries are involved. Even in fairly simple cases, the numbers can be enormous. There are, for example, 534 commercially available diamines and 15,627 available acid equivalents with which to react them. The corresponding combinatorial library encompasses 13 X 109 compounds. In such a case, it makes sense to enumerate only when necessary. At present, registration and searching of combinatorial structures per se are only possible for CSLNs in UNITY databases (Tripos Inc.); some similar capability will be available in Central Library (MDL Information Systems Inc.). Related functionalities are reportedly being developed elsewhere.
In some combinatorial builders, libraries are defined by entering lists of substituents; in others, substituent fragments are extracted from reactant lists based on defined cores common to all reactants. Both modes are available in Legion.
Partitioning. One attractive way to make selections involves partitioning the library into a uniform array of "cells" on the basis of two or more descriptor coordinates and then taking one representative compound from each cell (7). To do this, characteristic descriptors are derived from molecular connectivity matrices for each compound in a data set and the range of values found for each descriptor is divided into equal-sized bins. The selection sample is built up by taking one compound from each cell in the resulting multidimensional lattice.
This is illustrated for our centipede in Figure 3A, where the superimposed lattice corresponds to two axes x and y (principal components derived from the descriptors a, b, and c). The vertex (compound) that falls closest to the center of each cell has been selected and highlighted. This example also illustrates some of the problems that can be encountered in partitioning: Many of the cell blocks are empty or nearly so, and it is clear that the center of each cell is not necessarily representative.
Click here to see Figure 3Partitioning is practical and fast, even for large data sets; however, to be effective, it requires a fairly uniform distribution of compounds across a low-dimensionality descriptor space (at high dimensionalities, the distribution gets too sparse and too many cells are empty) that is more or less continuous. But the structural space defined by a combinatorial library is inherently discontinuous and of high dimensionality. This problem can be minimized by careful construction of characteristic pairs of descriptors, as epitomized in the DiverseSolutions software created by R. Pearlman at the University of Texas-Austin (available from Tripos Inc.). Where the mapping between the biochemical response of interest and the structural space being explored can successfully be reduced to a few such descriptors, this approach is powerful and effective.
Dissimilarity-based selection. In addition to the limitations cited above, partitioning is only feasible in descriptor spaces in which distance and coordinates are well defined. Somewhat surprisingly, this is not true for all descriptor spaces of interest; the edge metric shown in the box figure represents such a metric. In particular, it is not true for 2-D and 3-D fingerprints--long-bit vectors in which each element is 1 or 0, depending on whether a particular substructure is present in the corresponding compound.
Such fingerprints are widely used in database searching applications and lend themselves to rapid, large-scale manipulations. Three-dimensional fingerprints based on the relative spatial distribution of pharmacophoric groups--hydrogen bond donors, hydrogen bond acceptors, aromatic rings, and so on--are particularly useful for identifying new lead chemistries (7). Atom pair (AP) fingerprints are related descriptors in which each bit corresponds to the presence or absence of a pair of atom types separated by a particular number of bonds (8). As the name implies, fingerprints are designed to be distinctive and characteristic of each molecule. Their usefulness for evaluating molecular similarity makes it seem intuitively obvious that they should be useful for assessing molecular diversity as well.
Fingerprints are usually a thousand bits or more long, giving a high dimensionality that is unsuitable for partitioning. Moreover, the similarity between two fingerprints a and b is most usefully described by the Tanimoto coefficient T(a,b), which is the ratio of shared substructures (i.e., the number of bits set to 1 in both fingerprints) to the number of substructures that appear in at least one of the fingerprints (i.e., the number of bits set to 1 in either) (9); the dissimilarity between two fingerprints a and b is generally taken as 1-T(a,b). But T(a,b) is only defined between pairs of fingerprints, so there is no corresponding coordinate system, and it lacks the mathematical properties of a distance that are a prerequisite for partitioning.
Pairwise Tanimoto (dis)similarities can be used to build up diverse samples from fingerprint data sets by iteratively testing unselected candidate elements from the target data set against those that have already been selected. One method entails a minimum dissimilarity criterion: If a randomly selected candidate exceeds some preset, minimum acceptable pairwise dissimilarity with respect to all the compounds that have already been selected, it is added to the list of selected compounds. This algorithm is used in PDTriplets (Tripos Inc.) and ChemDiverse (Chemical Design, Chipping Norton, U.K.) software for working with pharmacophore fingerprints; it is available as a selection option in DiverseSolutions (9). The algorithm is fast, and a good representative subset is generally selected when a relatively large subset is being chosen.
When a smaller fraction of a large data set (realized or virtual) is being selected, finding a diverse subset is often more important than is selecting a representative one. The smaller the subset, the more information each selection must carry and the more important that each selected compound be distinctive with respect to the others. A good algorithm for selecting diverse subsets uses a maximal dissimilarity criterion (10): At each iteration, the unselected candidate that is most dissimilar to those already selected is added to the selection list (down to some minimum acceptable dissimilarity). As pointed out by Holliday and Willett (11), dissimilarity to the set of those already selected can be defined in several ways. In Selector's dbdiss executable (Tripos Inc.), the dissimilarity to a set of fingerprints is taken as equal to the smallest dissimilarity with respect to any fingerprint in the set, but the algorithm can be generalized to other definitions as well (11). As the name suggests, maximal dissimilarity is very good for creating relatively small, diverse subsets.
In Figure 3B, we applied dissimilarity selection to the edge metric in our schematic centipede (although the Euclidean abc space gives nearly identical results in this case). Our example illustrates a potential weakness for this kind of subset selection. In some cases, maximal dissimilarity selection will produce excessively diverse subsets, which leads to preferential selection of peripheral, extreme examples; in this case, only "feet" get selected. Such outliers often behave differently than do more mainstream compounds. Bias toward outliers is also a problem for D-optimal design approaches when a linear or quadratic response surface is (somewhat arbitrarily) assumed (12).
Cluster-based selection. Partitioning and dissimilarity-based selection methods are isometric in that they make strong assumptions about what descriptor similarity means in terms of biochemical similarity across the full range of variation in an entire library. One alternative is to use more localized measures of similarity to build up clusters of similar compounds and then select one example from each cluster. When the clusters are distinctive enough, such approaches produce subsets that are representative and diverse.
Several approaches are available for creating such clusters in structural space, two of which are currently in use. In Jarvis-Patrick (JP) clustering, the criterion for including two compounds in the same cluster is simply that they have at least some minimum number of neighbors in common (13). The algorithm used in JP clustering can be very fast, particularly when selection from subsets is not allowed; it is available in Selector (Tripos Inc.) and in MERLIN (Daylight Chemical Information Systems Inc., Irvine, CA).
JP clustering is only practical, however, for selecting relatively dense subsets (5% or more of the total data set). Even then it is difficult to obtain a specified number of clusters. Moreover, it tends to produce a few large, distended clusters plus many singletons (14, 15). This is illustrated for our schematic library in Figure 3C. Here each leg or pair of legs becomes an isolated set. When only a few examples are called for, this means that many clusters go unrepresented or--looked at another way--have been arbitrarily pooled for sampling.
A more rigorous approach is hierarchical clustering (16). In this technique, all compounds start as singleton clusters. In each iteration, the two most similar clusters are consolidated, and the pairwise similarities between the new cluster and other clusters are evaluated. The result is more balanced and seems more natural than JP clustering (Figure 3D). Like JP clustering, hierarchical clustering will work with any descriptor for which pairwise similarity can be evaluated, so it readily accommodates Tanimoto coefficients and fingerprints. We used the edge metric in Figure 3, C and D, for example.
As for maximal dissimilarity selection, there are several ways to evaluate the similarity between sets, each of which produces somewhat different results. Ward's method, which has its roots in statistical theory, has been used with considerable success in this regard (15). Complete linkage, in which the similarity between sets A and B is equal to the minimum pairwise similarity between any element of A and any element of B, is also attractive because it guarantees a common, maximal internal dissimilarity (diameter) for all clusters (17). Complete linkage is the default hierarchical clustering method used for clustering in Selector.
The characteristic qualitative differences between JP and hierarchical clustering are illustrated by the statistics given in Table 1. Sixty-eight compounds from a data set compiled for validating ClogP algorithms (18) were clustered using both techniques. The JP algorithm assigned most of the data set to a single large cluster consisting of 46 compounds and produced 3 singletons; the (geometric) mean and median cluster sizes, which are representative of the general level of discrimination, were only 1.81 and 2.5, respectively. Complete-linkage hierarchical clustering, on the other hand, gave a more balanced distribution across the largest clusters and no singletons. The increased level of discrimination was also reflected in the mean and median cluster sizes, which were 3.6 and 6, respectively.
Hierarchical clustering can be used with any descriptor or set of descriptors as long as the set can be consolidated into a single meaningful pairwise measure of similarity and is not sensitive to dimensionality. Unfortunately, the so-called classical algorithm is prohibitively memory- and CPU-intensive for application to large (>2000 compound) data sets. A much faster and more efficient calculation method known as reciprocal nearest neighbor (RNN) analysis is available (16), but its usefulness is limited to scalar descriptors for which centroids have a well-defined meaning.
In some clustering studies, fingerprints have been treated as simple vectors, and similarities between them have been evaluated as Euclidean distances in lieu of using Tanimoto similarity so that RNN could be used. But comparing fingerprints on the basis of Euclidean distances means that bits that are set to 0 in each of a pair (indicating that both of the corresponding compounds lack some particular substructure) contribute as much to the similarity as do bits that are set to 1 in both fingerprints, which indicates that the two compounds share a common substructure. Results can be contradictory because this method obviates any scaling of the similarity coefficient: Ethane and methane can turn out to be more similar to each other than lysozyme is to an analogue in which a single glycine residue has been replaced by an alanine!
Making selections from a library (i.e., dissecting the centipede) is only a means to an end: determining how biochemical activity is distributed across the section of structural space spanned by the library. Descriptors are not equally useful for making such selections. Two possible situations are shown in Figure 4.
Click here to see Figure 4In many descriptor spaces, active compounds cluster together loosely and may even intermingle as in the left side of Figure 4. LogP (octanol-to-water distribution ratio) falls into this class, as do pharmacophore patterns. Although there may be an optimal logP for benzodiazepine tranquilizers, an antihistamine with that optimal logP will almost certainly be completely inactive against the benzodiazepine receptor. Similarly, a successful pharmacophore search will be one in which 15% of the compounds identified as bearing an "active" pharmacophore pattern actually exhibit the desired activity (21). Similarity in such descriptor spaces is a necessary condition for similarity of biochemical activity, but it is not sufficient.
In the complementary scenario (Figure 4--see above), each activity may be associated with several distinct islands in the descriptor space, but few inactives are near any active. Similarity in descriptors of this sort is a sufficient condition for similarity in biochemical activity, but not a necessary one. Such descriptors are said to possess the Neighborhood Property (22).
When selecting compounds by any technique, whether from a realized library for assay or from a virtual one for synthesis, one would like to be assured that the descriptors used to evaluate similarity exhibit the Neighborhood Property. Otherwise, the actual biochemical activity (or inactivity) of the selected compound cannot be expected to reflect the activity (or inactivity) of passed-over compounds in the section of descriptor space it represents.
There are several different approaches to validating particular metrics for use in evaluating molecular diversity. In this case, the metric that we want to validate is a descriptor or a set of descriptors taken in conjunction with a particular measure of similarity. Here we consider how two (categorical and local neighborhood validation) can be applied to Tanimoto coefficients between 2-D and between AP fingerprints, and to Euclidean distances between inertially oriented molecular fields. For calculating the latter metric, the principal (steric) axes of a molecule are used to define the axes of steric, electrostatic, and H-bonding molecular fields (23).
Categorical validation.
In this method, compounds from
several activity classes are grouped on the basis of similarity in the
metric being validated. Assigning pharmacologically related compounds
to the same group is then counted as a success. One example is the
actives-versus-inactives groups defined in Brown and Martin
(15). Here, success will be quantified in terms of the more
broadly applicable 2
statistic,
calculated with respect
to the distribution expected for a purely random allocation of
compounds among groups (24). The larger the
2
statistic, the more strongly the corresponding result deviates from a
random distribution.
The data set that we used earlier for Table 1 has its 68 compounds
distributed across six pharmacological classes and can be used for
testing different ways of estimating logP values. When the data set was
divided up using different metrics into nine clusters (some
classes have natural subclasses), 2-D fingerprints performed very well
(2 = 125.31, P < 0.005), but AP fingerprints
failed to give a significantly nonrandom classification
(
2 = 23.64, P < 0.99). Inertially
oriented steric fields were also successful (
2 =
68.92, P < 0.005), although not as successful as 2-D
fingerprints.
Data for running this kind of analysis are readily available from various public and private databases. Unfortunately, the results are not directly relevant to selection of representative diverse subsets for lead discovery, because they inextricably confound (in the statistical sense) structural similarity with biochemical similarity. The islands of pharmacological activity represented are artificially compact and structurally homogeneous by virtue of historical accident. The benzamides in the data set we used above, for example, were developed from one another or from common progenitors (18) and thus contain common substructures quite distinct from those in the phenothiazines and the ß-blockers. When the analysis is carried out carefully, meaningful results can be obtained with corporate databases (15), but the qualitative nature of the evaluation (e.g., "active" vs. "inactive" ) complicates analysis across thresholds and is an inherent weakness of this method.
Neighborhood validation.
An alternative approach is to
assess the neighborhood behavior directly by examining the quantitative
relationship between differences in a metric and differences in
specific biochemical activities (22). If the Neighborhood
Property holds for a particular metric, small differences in
biochemical activity will be consistently associated with small
differences in the metric. Note that the converse will not necessarily
be true--very dissimilar compounds may have very similar differences in
biochemical activity. Hence, a plot of differences in biochemical
activity as a function of metric difference will have a relatively
empty top-left triangle, if the metric exhibits the Neighborhood
Property. The degree of emptiness again can be quantitatively evaluated
in terms of 2 with respect to the expected
distribution
for a random selection.
Results from such analyses have been described in detail elsewhere (22) and are summarized in Table 2. Of the metrics examined, only 2-D fingerprints and molecular fields consistently exhibit strong Neighborhood Properties. AP fingerprints and autocorrelation coefficients (averages of atom pair property differences weighted by their separations) (25) were sometimes useful, showing significant neighborhood behavior in one-third of the data sets examined. Calculated molar refractivity (CMR; BioByte Corp.), ClogP, and molecular connectivity indices (CI) (19), though they have proven very useful in QSAR studies, rarely exhibit neighborhood behavior, at least not when considered in isolation from other metrics (22). In general, the higher the dimensionality of a descriptor, the tighter is its neighborhood behavior.