logo
CONTENT TYPES

Figure 1Loading Img
RETURN TO ISSUEPREVMachine Learning and...Machine Learning and Deep LearningNEXT

Impact of Chemist-In-The-Loop Molecular Representations on Machine Learning Outcomes

Cite this: J. Chem. Inf. Model. 2020, 60, 10, 4449–4456
Publication Date (Web):August 10, 2020
https://doi.org/10.1021/acs.jcim.0c00193
Copyright © 2020 American Chemical Society
Authors ChoiceACS AuthorChoice
Article Views
2481
Altmetric
-
Citations
-
LEARN ABOUT THESE METRICS

Article Views are the COUNTER-compliant sum of full text article downloads since November 2008 (both PDF and HTML) across all institutions and individuals. These metrics are regularly updated to reflect usage leading up to the last few days.

Citations are the number of other articles citing this article, calculated by Crossref and updated daily. Find more information about Crossref citation counts.

The Altmetric Attention Score is a quantitative measure of the attention that a research article has received online. Clicking on the donut icon will load a page at altmetric.com with additional details about the score and the social media presence for the given article. Find more information on the Altmetric Attention Score and how the score is calculated.

PDF (10 MB)
Supporting Info (2)»

Abstract

The development of molecular descriptors is a central challenge in cheminformatics. Most approaches use algorithms that extract atomic environments or end-to-end machine learning. However, a looming question is that how do these approaches compare with the critical eye of trained chemists. The CAS fingerprint engages expert chemists to curate chemical motifs, which they deem could influence bioactivity. In this paper, we benchmark the CAS fingerprint against commonly used fingerprints using a well-established benchmark set of 88 targets. We show that the CAS fingerprint outperforms most of the commonly used molecular fingerprints. Analysis of the CAS fingerprint reveals that experts tend to select features that are rarely reported in the literature, though not all rare features are selected. Our analysis also shows that the CAS fingerprint provides a different source of information compared to other commonly used fingerprints. These results suggest that anthropomorphic insights do have predictive power and highlight the importance of a chemist-in-the-loop approach in the era of machine learning.

Introduction

ARTICLE SECTIONS
Jump To

The representation of a molecular structure as a numerical vector, a molecular fingerprint, is a longstanding challenge in cheminformatics. This problem was first encountered in the field of chemical database construction, where the goal was to enable users to rapidly retrieve molecules in the database that are chemically similar to a query.(1) The pioneering work of Morgan in 1965(2) reported an algorithm, which generates unique numerical identifiers for each molecular structure based on a tabular arrangement of atoms and bonds (i.e., a connection table). This algorithm remains a foundational component of the CAS Registry, a comprehensive collection of more than 155 million chemical substances extracted by scientists from journal articles, patents, and other sources dating back to more than 150 years. Since Morgan’s work, molecular fingerprints have been used in clustering molecules(3) and bioactivity prediction,(4) where those fingerprints are fed as inputs into statistical machine learning models to predict activity.
Most 2D molecular fingerprints to date rely on algorithmic rules that systematically enumerate chemical environments around every atom in a molecule up to a predefined distance and then map those chemical environments into a fixed-length vector via hashing.(5) This results in a binary vector, where each entry corresponds to the presence or the absence of (potentially multiple) chemical fragments. Variations between fingerprints are mostly in how the chemical environments are defined. These definitions range from concentric circles emanating from the central atom (e.g., extended connectivity fingerprint(6)), linear paths along chemical bonds (e.g., Daylight fingerprint(5)), or pairs/triplets/quadruplets of atoms separated by a predefined number of bonds (e.g., atom pair and topological torsion fingerprints(7,8)). However, the challenge is that biologically relevant motifs do not necessarily fit into the algorithmic definition of chemical motifs.
Recent machine learning breakthroughs reported algorithms that learn the optimal representation of molecules directly from the data, stored as molecular graphs, in an end-to-end manner.(9) Although in theory, these methods are superior to ad hoc definitions of a chemical environment, the advantage of learned molecular representation becomes significant only when there is an abundant amount of data(10)—a luxury typically not encountered in drug discovery projects, where the goal is to drive decisions before a lot of unfruitful data has been generated.
Rather than employing algorithmic definitions of chemical environments or end-to-end machine learning, some fingerprints register the presence or the absence of a fixed list of hand-curated common molecular motifs.(11) However, these fingerprints typically focus on common chemical motifs, thus two molecules could have similar fingerprints yet potentially display very different bioactivity profiles.
In this paper, we report and benchmark the CAS fingerprint, an expert system produced by CAS, a division of the American Chemistry Society specializing in scientific information solutions. The CAS fingerprint is a curated set of 7851 chemical motifs deemed by expert chemists in CAS to be potentially important to chemical properties. We show that the CAS fingerprint outperforms most of the commonly used molecular fingerprints. Moreover, we find that human experts tend to pick out features that are rare. Our findings complement the recent study, which shows that human chemists predict solubility just as well as the state of the art models,(12) and highlight the potential value of well-curated expert systems in the application of machine learning.

Methods

ARTICLE SECTIONS
Jump To

CAS Fingerprint

Based on their professional judgment, over 25 000 different structural features were proposed by CAS scientists for inclusion into the CAS fingerprint. Many of the proposed features were newly identified for this purpose while some features had been used internally by CAS for other purposes. To determine which features to ultimately select, CAS conducted a heuristic-based iterative analysis to determine which combination of structural features had the greatest performance impact on in-house benchmarks. The in-house benchmarks consisted of almost 9000 similarity search queries used for quality assurance testing purposes as well as two binary classification (active/inactive) tasks associated with two targets that did not overlap with the targets considered in this study. These initial targets, including a member of the RAS family of oncogenes and a member of the isocitrate/isopropylmalate dehydrogenase family of enzymes, were targets associated with drug discovery consulting projects focused on identifying promising new drug-like molecules with a desired property profile. The heuristic-based iterative analysis consisted of systematically determining the isolated performance impact associated with each proposed feature for inclusion and checking whether combinations of proposed features further improved the performance of the entire set. Based on this analysis, CAS selected the most chemically meaningful structural features that optimized the performance of the entire set of molecular descriptors.
The underlying structural features encoded into the CAS fingerprint consist of a variety of different molecular descriptors based on a molecule’s atoms, the bonds that connect them, and the spatial arrangement of the atoms. For example, some of the structural features are defined in terms of the concentric area surrounding each atom, while others are based on the paths of atoms and bonds throughout a given molecule. Given a set of known or theoretical (virtual) molecules, a connection table (i.e., the atoms and bonds that comprise the basic structure of the substance) is algorithmically generated. The molecular descriptors, which make up the predefined set of structural features for each molecule are automatically generated from the connection table and then encoded into a binary bit string (i.e., the CAS fingerprint).
The composition of the CAS fingerprint is proprietary. However, the CAS fingerprint is being used by CAS as part of consulting projects that incorporate machine learning algorithms to predict molecular activities and properties.

Benchmarking Data and Methodology

In this paper, we use the benchmark dataset and machine learning methodologies reported by Riniker and Landrum.(13) In summary, they considered 88 binary classification (active/inactive) tasks drawn from the public domain. These tasks simulate a typical virtual screening campaign in terms of a realistic number of active/inactive and chemical diversity. To benchmark the CAS fingerprint, we utilized 10 2D fingerprints belonging to three different classes of fingerprints: structural keys, circular, and path-based fingerprints. All of the selected fingerprints were used in the Riniker and Landrum ligand-based virtual screening benchmark study.(13) The selected fingerprints are also consistent with subsequent studies, which found the extended connectivity fingerprints(6) with a bond diameter 4 and 6 as well as the topological torsion fingerprint(8) among the best-performing fingerprints when ranking diverse structures by similarity, while the atom pair fingerprint(7) was found to be the best performing when ranking very close analogues.(14)
Since the CAS fingerprint can be classified as a structural key-based fingerprint, the Avalon fingerprint (Avalon)(11) and the Molecular ACCess System (MACCS)(15) served as baseline fingerprints. The extended connectivity fingerprint with a bond diameter 6 (ECFP6) represented the only circular fingerprint utilized in this study, while the path-based fingerprints included the bit vector form of the atom pair fingerprint (hashap), the bit vector form of the topological torsion fingerprint (hashtt), and the RDKit fingerprint, a relative of the Daylight fingerprint,(5) with a maximum path length of six (RDK6).(16) All of these fingerprints were calculated using RDKit, an open-source cheminformatics and machine learning toolkit.(16) For all bit-string fingerprints, sizes of 1024 bits and 7851 bits (the length of the CAS fingerprint) were used.
To evaluate the comparative performance of different fingerprints, we employ the sign test discussed in a recent publication by some of the authors.(17) This test detects consistent differences between pairs of observations and evaluates the quality of a method by the number of times it outperforms another method. We use the area under the curve of the receiver operating characteristic (ROC-AUC) and the area under the curve of the precision recall curve (PRC-AUC) as the figures of merit.
We release our benchmarking platform, adapted from,(13) as an open-source code (https://github.com/mc-robinson/benchmarking_platform_p23). Our codebase implements the sign test and automatically generates a report with the relevant plots and interpretable analytics. Furthermore, we have attempted to simplify the execution of the original code, and we have upgraded the platform from Python 2 to Python 3. This platform can be easily adapted for testing future novel fingerprints.

Results

ARTICLE SECTIONS
Jump To

Benchmarking Results

We first compare the performance of the CAS fingerprint against other commonly used fingerprints in the panel of 88 tasks. The fingerprints are fed into a random forest (RF) model. Figures 14 show that the CAS fingerprint is among the top-ranking method using ROC-AUC and PRC-AUC metrics. An alternative way to visualize the performance of different fingerprints is via its average rank across all 88 targets.

Figure 1

Figure 1. Performance of the fingerprints, evaluated by the ROC-AUC, across 88 different datasets using the random forest (RF) method. The CAS fingerprint is highlighted for clarity.

Figure 2

Figure 2. Performance of CAS fingerprint relative to the best- and worst-performing fingerprint, evaluated by the ROC-AUC, across 88 different datasets using the random forest (RF) method. Figure 1 is replotted to highlight the best-performing fingerprint, the worst-performing fingerprint, and the CAS fingerprint.

Figure 3

Figure 3. Performance of the fingerprints, evaluated by the PRC-AUC, across 88 different datasets using the random forest (RF) method. The CAS fingerprint is highlighted for clarity.

Figure 4

Figure 4. Performance of the CAS fingerprint relative to the best- and worst-performing fingerprint, evaluated by the PRC-AUC, across 88 different datasets using the random forest (RF) method. Figure 3 is replotted to highlight the best-performing fingerprint, the worst-performing fingerprint, and the CAS fingerprint.

To provide a quantitative evaluation of the different methods, Tables 1 and 2 show the results of the sign test comparing the CAS fingerprint against other fingerprints. The sign test essentially counts the proportion of “wins” for a given fingerprint over another across all 88 targets included in the dataset. Therefore, the sign test statistic can be interpreted as the probability that the CAS fingerprint outperforms another fingerprint method. For example, the sign test statistic of the CAS fingerprint against the 7851 bit ECFP6 fingerprint (a fingerprint of the same length as the CAS fingerprint) is 0.818 (cf. Table 1), suggesting that if one were to only use the CAS fingerprint rather than the 7851 bit ECFP6 fingerprint, one would have picked the best-performing method 81.8% of the time.
Table 1. CAS Fingerprint Significantly Outperforms Most of the Commonly Used Fingerprints with Bit Size 7851a
sign test against CASfpecfp6_7851rdk6_7851avalon_7851hashap_7851
ROC-AUC0.818 (0.725, 0.885)0.727 (0.626, 0.809)0.659 (0.555, 0.750)0.522 (0.420, 0.624)
PRC-AUC0.522 (0.420, 0.624)0.807 (0.712, 0.876)0.568 (0.464, 0.667)0.580 (0.475, 0.677)
a

The table shows the results of the sign test comparing the CAS fingerprint against the commonly used fingerprints with bit size 7851. The 95% Wilson score intervals are included in parentheses.

Table 2. CAS Fingerprint Significantly Outperforms the Commonly Used Fingerprints with Bit Size 1024a
sign test against CASfpecfp6_1024rdk6_1024avalon_1024hashap_1024
ROC-AUC0.886 (0.803, 0.937)0.739 (0.638, 0.819)0.693 (0.590, 0.780)0.659 (0.555, 0.750)
PRC-AUC0.727 (0.626, 0.809)0.807 (0.712, 0.876)0.648 (0.544, 0.740)0.739 (0.638, 0.819)
a

The table shows the results of the sign test comparing the CAS fingerprint against the commonly used fingerprints with bit size 1024. The 95% Wilson score intervals are included in parentheses.

We note that the fingerprint size may affect the performance because of a trade-off between underfitting and overfitting. Fingerprints with a small number of parameters may have high bias and low variance, while fingerprints with a large number of parameters may have high variance and low bias. This trade-off can be a hyperparameter in the model, but in practice, cross-validating over the fingerprint size will significantly increase the computational overhead.
In the Supporting Information, we evaluate the fingerprints using logistic regression (LR), Naïve Bayes (NB), and Tanimoto similarity (Tanimoto). Figure 5 summarizes the CAS fingerprint performance by showing its average rank, across all 88 tasks, for different machine learning methods and metrics. The overall trend is that although the CAS fingerprint is not always the best possible fingerprint for every task, there is a nonnegligible number of tasks where the CAS fingerprint is the best method. For example, the worse performance of the CAS fingerprint across all methods is with Tanimoto similarity when compared against hashed topological torsion fingerprints, with a sign test score of 0.375. This means that if one were to use only hashed topological torsion fingerprints, one would have missed the better method 37.5% of the time. As such, practically speaking, the CAS fingerprint is a consistently useful member of a cheminformatics toolkit.

Figure 5

Figure 5. Average rank of each fingerprint across 88 targets. Note that a lower average rank (i.e., closer to one) denotes better performance.

CAS Fingerprint Captures Distinct Sources of Chemical Information

We, next turn to examine whether the information present in the CAS fingerprint is different from common fingerprints used in the literature. To this end, we evaluate the correlation between the output probability of the classifier using different fingerprints as input. To facilitate interpretation, we focus on the challenging hERG dataset reported by some of the authors in a previous benchmarking study.(18)
Figure 6 shows that the correlation between the CAS fingerprint output and output from other fingerprints is low, suggesting that this expert system not only outperforms most other fingerprints, but the information present in it is novel and orthogonal to fingerprints in the literature. The orthogonality of information means that the CAS fingerprint can pick out counterintuitive compounds not identified by other commonly used fingerprints. Moreover, our results suggest that the CAS fingerprint is not only a strong method in and of itself but also a useful member of a toolkit if one were to construct an ensemble of fingerprints.

Figure 6

Figure 6. Performance of the CAS fingerprint is uncorrelated with other fingerprints, suggesting that the fingerprint is capturing orthogonal chemical signals. The figure shows the correlation between the rank ordering of active (orange) and inactive (blue) by an algorithm tested on the CAS fingerprint and other fingerprints. The plots on the diagonal show the distribution of a classifier score for active (orange) and inactive (blue) compounds.

Human Experts Identify Rare Chemical Features

Having identified the power of the CAS fingerprint, we turn to interrogate the reason behind the performance of human experts compared to algorithms. We suggest that one possible reason is the shallow heuristic notion of novelty—the ability of experts to separate meaningful features from the mundane.
Hashed fingerprints are produced by generating all possible linear paths of connected atoms through the molecule. Since hashed fingerprints do not require a predefined set of structural features, they, in principle, have the advantage of being more broadly applicable. However, they often have poor selectivity as they consist of relevant and irrelevant structural features, and often, the latter outnumbers the former. These irrelevant features can often include features that occur very frequently in molecules and, as a result, are unlikely to be discriminating.
Based on their experience and scientific expertize in curating the CAS Registry, the CAS scientists were able to not only identify structural features that fit chemically meaningful patterns but that are rarely found in the molecules present in the CAS Registry. Table 3 shows that nearly 60% of the structural features present in the CAS fingerprint are found in 1% or fewer of the molecules in the CAS Registry, while almost 75% are found in 3% or fewer. The CAS Registry contained more than 155 million molecules as of January 20, 2020.
Table 3. Human Experts Tend to Pick Out Rare Structural Features, Which Are Less Frequently Found in Other Moleculesa
frequency rangecountpercentage of total count
0.00–1.00%461359%
1.01–2.00%77110%
2.01–3.00%4486%
3.01–4.00%3284%
4.01–5.00%2113%
5.01–10.00%5827%
over 10.00%89811%
a

The table shows the frequency of the occurrence of the 7851 features included in the CAS fingerprint for all of the molecules present in the CAS Registry.

Capturing rare chemical features is a double-edged sword; if we only consider very niche features, one could end up with a descriptor set that is always zero for most molecules. Figure 7 shows the distribution of the number of motifs in the CAS fingerprint that are found in the molecules in the benchmark dataset. On average, a molecule contains 403 CAS fingerprint motifs, and every molecule contains at least 21 CAS fingerprint features. As such, the CAS fingerprint captures information relevant for typical small organic molecules.

Figure 7

Figure 7. Histogram of the distribution of the number of motifs in the CAS fingerprint that are found in the molecules in the benchmark dataset.

While it is theoretically possible to exhaustively enumerate all possible chemical motifs, our results suggest that human crowdsourcing provides a priori knowledge of important chemical motifs. It follows from the bias-variance trade-off that a priori knowledge of important motifs means that higher performance can be achieved with less data.

Conclusions

ARTICLE SECTIONS
Jump To

The application of machine learning to drug discovery has thus far focused mostly on benchmarking algorithms or comparing human experts with algorithms. In this work, we demonstrate the value of a chemist-in-the-loop approach, where expert insights are used to curate potentially biologically relevant molecular motifs and machine learning is used to predict activity from those motifs. Experts are shown to pick out relevant “interesting” features that yield distinct sources of information compared to other fingerprints in the literature. As the enterprise of drug discovery is premised upon searching for novelty, these results indicate that the CAS fingerprint is a valuable resource to the drug discovery and cheminformatics community.

Supporting Information

ARTICLE SECTIONS
Jump To

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.0c00193.

  • Performance of the fingerprints, evaluated by the ROC-AUC and PRC-AUC, across 88 different datasets using random forest, Naïve Bayes, logistic regression, and Tanimoto similarity methods (PDF)

  • Table of the AUC (ROC and PRC) and SD values for all 88 targets for each of these machine learning methods (XLSX)

Terms & Conditions

Electronic Supporting Information files are available without a subscription to ACS Web Editions. The American Chemical Society holds a copyright ownership interest in any copyrightable Supporting Information. Files available from the ACS website may be downloaded for personal use only. Users are not otherwise permitted to reproduce, republish, redistribute, or sell any Supporting Information from the ACS website, either in whole or in part, in either machine-readable form or any other form without permission from the American Chemical Society. For permission to reproduce, republish and redistribute this material, requesters must process their own requests via the RightsLink permission system. Information about how to use the RightsLink permission system can be found at http://pubs.acs.org/page/copyright/permissions.html.

Author Information

ARTICLE SECTIONS
Jump To

  • Corresponding Author
  • Authors
    • Dmitrii A. Polshakov - CAS, P.O. Box 3012, Columbus, Ohio 43210-0012, United States
    • Matthew C. Robinson - PostEra Inc., 1209 Orange Street, Wilmington, Delaware 19801, United States
    • Alpha A. Lee - PostEra Inc., 1209 Orange Street, Wilmington, Delaware 19801, United States
  • Notes

    The authors declare the following competing financial interest(s): This work was sponsored and financially supported by the CAS Innovation Lab.

References

ARTICLE SECTIONS
Jump To

This article references 18 other publications.

  1. 1
    Christie, B. D.; Leland, B. A.; Nourse, J. G. Structure Searching In Chemical Databases By Direct Lookup Methods. J. Chem. Inf. Model. 1993, 33, 545547,  DOI: 10.1021/ci00014a004
  2. 2
    Morgan, H. L. The Generation Of An Unique Machine Description For Chemical Structures– A Technique Developed At Chemical Abstracts Service. J. Chem. Doc. 1965, 5, 107113,  DOI: 10.1021/c160017a018
  3. 3
    McGregor, M. J.; Pallai, P. V. Clustering Of Large Databases of Compounds: Using The MDL “Keys” as Structural Descriptors. J. Chem. Inf. Comput. Sci. 1997, 37, 443448,  DOI: 10.1021/ci960151e
  4. 4
    Sheridan, R. P.; Kearsley, S. K. Why Do We Need So Many Chemical Similarity Search Methods?. Drug Discovery Today 2002, 7, 903911,  DOI: 10.1016/S1359-6446(02)02411-X
  5. 5
    Sastry, M.; Lowrie, J. F.; Dixon, S. L.; Sherman, W. Large-scale Systematic Analysis Of 2D Fingerprint Methods And Parameters to Improve Virtual Screening Enrichments. J. Chem. Inf. Model. 2010, 50, 771784,  DOI: 10.1021/ci100062n
  6. 6
    Rogers, D.; Hahn, M. Extended-connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742754,  DOI: 10.1021/ci100050t
  7. 7
    Carhart, R. E.; Smith, D. H.; Venkataraghavan, R. Atom Pairs as Molecular Features In Structure-Activity Studies: Definition And Applications. J. Chem. Inf. Model. 1985, 25, 6473,  DOI: 10.1021/ci00046a002
  8. 8
    Nilakantan, R.; Bauman, N.; Dixon, J. S.; Venkataraghavan, R. Topological Torsion: A New Molecular Descriptor For SAR applications. Comparison With Other Descriptors. J. Chem. Inf. Model. 1987, 27, 8285,  DOI: 10.1021/ci00054a008
  9. 9
    Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R. P. Convolutional Networks On Graphs For Learning Molecular Fingerprints. In the Proceedings of Advances in Neural Information Processing Systems 28, 2015; pp 22152223.
  10. 10
    Yang, K.; Swanson, K.; Jin, W.; Coley, C.; Eiden, P.; Gao, H.; Guzman-Perez, A.; Hopper, T.; Kelley, B.; Mathea, M.; Palmer, A.; Settels, V.; Jaakkola, T.; Jensen, K.; Barzilay, R. Analyzing Learned Molecular Representations For Property Prediction. J. Chem. Inf. Model. 2019, 59, 33703388,  DOI: 10.1021/acs.jcim.9b00237
  11. 11
    Gedeck, P.; Rohde, B.; Bartels, C. QSAR- How Good Is It In Practice? Comparison Of Descriptor Sets On An Unbiased Cross Section Of Corporate Data Sets. J. Chem. Inf. Model. 2006, 46, 19241936,  DOI: 10.1021/ci050413p
  12. 12
    Boobier, S.; Osbourn, A.; Mitchell, J. B. Can Human Experts Predict Solubility Netter Than Computers?. J. Cheminf. 2017, 9, 63,  DOI: 10.1186/s13321-017-0250-y
  13. 13
    Riniker, S.; Landrum, G. A. Open-source Platform To Benchmark Fingerprints For Ligand-based Virtual Screening. J. Cheminf. 2013, 5, 26,  DOI: 10.1186/1758-2946-5-26
  14. 14
    O’Boyle, N. M.; Sayle, R. A. Comparing structural fingerprints using a literature-based similarity benchmark. J. Cheminf. 2016, 8, 36,  DOI: 10.1186/s13321-016-0148-0
  15. 15
    Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G. Reoptimization Of MDL Keys For Use In Drug Discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 12731280,  DOI: 10.1021/ci010132r
  16. 16
    RDKit: Cheminformatics and Machine Learning Software, 2020. http://www.rdkit.org.
  17. 17
    Robinson, M. C.; Glen, R. C.; Lee, A. A. Validating the Validation: Reanalyzing a Large-scale Comparison of Deep Learning and Machine Learning Models for Bioactivity Prediction. J. Comput.-Aided Mol. Des. 2020, 717730,  DOI: 10.1007/s10822-019-00274-0
  18. 18
    Lee, A. A.; Yang, Q.; Bassyouni, A.; Butler, C. R.; Hou, X.; Jenkinson, S.; Price, D. A. Ligand Biological Activity Predicted By Cleaning Positive And Negative Chemical Correlations. Proc. Natl. Acad. Sci. U.S.A. 2019, 116, 33733378,  DOI: 10.1073/pnas.1810847116

Cited By


This article has not yet been cited by other publications.

    • Abstract

      Figure 1

      Figure 1. Performance of the fingerprints, evaluated by the ROC-AUC, across 88 different datasets using the random forest (RF) method. The CAS fingerprint is highlighted for clarity.

      Figure 2

      Figure 2. Performance of CAS fingerprint relative to the best- and worst-performing fingerprint, evaluated by the ROC-AUC, across 88 different datasets using the random forest (RF) method. Figure 1 is replotted to highlight the best-performing fingerprint, the worst-performing fingerprint, and the CAS fingerprint.

      Figure 3

      Figure 3. Performance of the fingerprints, evaluated by the PRC-AUC, across 88 different datasets using the random forest (RF) method. The CAS fingerprint is highlighted for clarity.

      Figure 4

      Figure 4. Performance of the CAS fingerprint relative to the best- and worst-performing fingerprint, evaluated by the PRC-AUC, across 88 different datasets using the random forest (RF) method. Figure 3 is replotted to highlight the best-performing fingerprint, the worst-performing fingerprint, and the CAS fingerprint.

      Figure 5

      Figure 5. Average rank of each fingerprint across 88 targets. Note that a lower average rank (i.e., closer to one) denotes better performance.

      Figure 6

      Figure 6. Performance of the CAS fingerprint is uncorrelated with other fingerprints, suggesting that the fingerprint is capturing orthogonal chemical signals. The figure shows the correlation between the rank ordering of active (orange) and inactive (blue) by an algorithm tested on the CAS fingerprint and other fingerprints. The plots on the diagonal show the distribution of a classifier score for active (orange) and inactive (blue) compounds.

      Figure 7

      Figure 7. Histogram of the distribution of the number of motifs in the CAS fingerprint that are found in the molecules in the benchmark dataset.

    • References

      ARTICLE SECTIONS
      Jump To

      This article references 18 other publications.

      1. 1
        Christie, B. D.; Leland, B. A.; Nourse, J. G. Structure Searching In Chemical Databases By Direct Lookup Methods. J. Chem. Inf. Model. 1993, 33, 545547,  DOI: 10.1021/ci00014a004
      2. 2
        Morgan, H. L. The Generation Of An Unique Machine Description For Chemical Structures– A Technique Developed At Chemical Abstracts Service. J. Chem. Doc. 1965, 5, 107113,  DOI: 10.1021/c160017a018
      3. 3
        McGregor, M. J.; Pallai, P. V. Clustering Of Large Databases of Compounds: Using The MDL “Keys” as Structural Descriptors. J. Chem. Inf. Comput. Sci. 1997, 37, 443448,  DOI: 10.1021/ci960151e
      4. 4
        Sheridan, R. P.; Kearsley, S. K. Why Do We Need So Many Chemical Similarity Search Methods?. Drug Discovery Today 2002, 7, 903911,  DOI: 10.1016/S1359-6446(02)02411-X
      5. 5
        Sastry, M.; Lowrie, J. F.; Dixon, S. L.; Sherman, W. Large-scale Systematic Analysis Of 2D Fingerprint Methods And Parameters to Improve Virtual Screening Enrichments. J. Chem. Inf. Model. 2010, 50, 771784,  DOI: 10.1021/ci100062n
      6. 6
        Rogers, D.; Hahn, M. Extended-connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742754,  DOI: 10.1021/ci100050t
      7. 7
        Carhart, R. E.; Smith, D. H.; Venkataraghavan, R. Atom Pairs as Molecular Features In Structure-Activity Studies: Definition And Applications. J. Chem. Inf. Model. 1985, 25, 6473,  DOI: 10.1021/ci00046a002
      8. 8
        Nilakantan, R.; Bauman, N.; Dixon, J. S.; Venkataraghavan, R. Topological Torsion: A New Molecular Descriptor For SAR applications. Comparison With Other Descriptors. J. Chem. Inf. Model. 1987, 27, 8285,  DOI: 10.1021/ci00054a008
      9. 9
        Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R. P. Convolutional Networks On Graphs For Learning Molecular Fingerprints. In the Proceedings of Advances in Neural Information Processing Systems 28, 2015; pp 22152223.
      10. 10
        Yang, K.; Swanson, K.; Jin, W.; Coley, C.; Eiden, P.; Gao, H.; Guzman-Perez, A.; Hopper, T.; Kelley, B.; Mathea, M.; Palmer, A.; Settels, V.; Jaakkola, T.; Jensen, K.; Barzilay, R. Analyzing Learned Molecular Representations For Property Prediction. J. Chem. Inf. Model. 2019, 59, 33703388,  DOI: 10.1021/acs.jcim.9b00237
      11. 11
        Gedeck, P.; Rohde, B.; Bartels, C. QSAR- How Good Is It In Practice? Comparison Of Descriptor Sets On An Unbiased Cross Section Of Corporate Data Sets. J. Chem. Inf. Model. 2006, 46, 19241936,  DOI: 10.1021/ci050413p
      12. 12
        Boobier, S.; Osbourn, A.; Mitchell, J. B. Can Human Experts Predict Solubility Netter Than Computers?. J. Cheminf. 2017, 9, 63,  DOI: 10.1186/s13321-017-0250-y
      13. 13
        Riniker, S.; Landrum, G. A. Open-source Platform To Benchmark Fingerprints For Ligand-based Virtual Screening. J. Cheminf. 2013, 5, 26,  DOI: 10.1186/1758-2946-5-26
      14. 14
        O’Boyle, N. M.; Sayle, R. A. Comparing structural fingerprints using a literature-based similarity benchmark. J. Cheminf. 2016, 8, 36,  DOI: 10.1186/s13321-016-0148-0
      15. 15
        Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G. Reoptimization Of MDL Keys For Use In Drug Discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 12731280,  DOI: 10.1021/ci010132r
      16. 16
        RDKit: Cheminformatics and Machine Learning Software, 2020. http://www.rdkit.org.
      17. 17
        Robinson, M. C.; Glen, R. C.; Lee, A. A. Validating the Validation: Reanalyzing a Large-scale Comparison of Deep Learning and Machine Learning Models for Bioactivity Prediction. J. Comput.-Aided Mol. Des. 2020, 717730,  DOI: 10.1007/s10822-019-00274-0
      18. 18
        Lee, A. A.; Yang, Q.; Bassyouni, A.; Butler, C. R.; Hou, X.; Jenkinson, S.; Price, D. A. Ligand Biological Activity Predicted By Cleaning Positive And Negative Chemical Correlations. Proc. Natl. Acad. Sci. U.S.A. 2019, 116, 33733378,  DOI: 10.1073/pnas.1810847116
    • Supporting Information

      Supporting Information

      ARTICLE SECTIONS
      Jump To

      The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.0c00193.

      • Performance of the fingerprints, evaluated by the ROC-AUC and PRC-AUC, across 88 different datasets using random forest, Naïve Bayes, logistic regression, and Tanimoto similarity methods (PDF)

      • Table of the AUC (ROC and PRC) and SD values for all 88 targets for each of these machine learning methods (XLSX)


      Terms & Conditions

      Electronic Supporting Information files are available without a subscription to ACS Web Editions. The American Chemical Society holds a copyright ownership interest in any copyrightable Supporting Information. Files available from the ACS website may be downloaded for personal use only. Users are not otherwise permitted to reproduce, republish, redistribute, or sell any Supporting Information from the ACS website, either in whole or in part, in either machine-readable form or any other form without permission from the American Chemical Society. For permission to reproduce, republish and redistribute this material, requesters must process their own requests via the RightsLink permission system. Information about how to use the RightsLink permission system can be found at http://pubs.acs.org/page/copyright/permissions.html.

    Pair your accounts.

    Export articles to Mendeley

    Get article recommendations from ACS based on references in your Mendeley library.

    Pair your accounts.

    Export articles to Mendeley

    Get article recommendations from ACS based on references in your Mendeley library.

    You’ve supercharged your research process with ACS and Mendeley!

    STEP 1:
    Click to create an ACS ID

    Please note: If you switch to a different device, you may be asked to login again with only your ACS ID.

    Please note: If you switch to a different device, you may be asked to login again with only your ACS ID.

    Please note: If you switch to a different device, you may be asked to login again with only your ACS ID.

    OOPS

    You have to login with your ACS ID befor you can login with your Mendeley account.

    MENDELEY PAIRING EXPIRED
    Your Mendeley pairing has expired. Please reconnect

    This website uses cookies to improve your user experience. By continuing to use the site, you are accepting our use of cookies. Read the ACS privacy policy.

    CONTINUE