Impact of Chemist-In-The-Loop Molecular Representations on Machine Learning Outcomes
- Todd J. Wills*Todd J. Wills*Email: [email protected]CAS, P.O. Box 3012, Columbus, Ohio 43210-0012, United StatesMore by Todd J. Wills,
- Dmitrii A. PolshakovDmitrii A. PolshakovCAS, P.O. Box 3012, Columbus, Ohio 43210-0012, United StatesMore by Dmitrii A. Polshakov,
- Matthew C. RobinsonMatthew C. RobinsonPostEra Inc., 1209 Orange Street, Wilmington, Delaware 19801, United StatesMore by Matthew C. Robinson, and
- Alpha A. LeeAlpha A. LeePostEra Inc., 1209 Orange Street, Wilmington, Delaware 19801, United StatesMore by Alpha A. Lee
Abstract

The development of molecular descriptors is a central challenge in cheminformatics. Most approaches use algorithms that extract atomic environments or end-to-end machine learning. However, a looming question is that how do these approaches compare with the critical eye of trained chemists. The CAS fingerprint engages expert chemists to curate chemical motifs, which they deem could influence bioactivity. In this paper, we benchmark the CAS fingerprint against commonly used fingerprints using a well-established benchmark set of 88 targets. We show that the CAS fingerprint outperforms most of the commonly used molecular fingerprints. Analysis of the CAS fingerprint reveals that experts tend to select features that are rarely reported in the literature, though not all rare features are selected. Our analysis also shows that the CAS fingerprint provides a different source of information compared to other commonly used fingerprints. These results suggest that anthropomorphic insights do have predictive power and highlight the importance of a chemist-in-the-loop approach in the era of machine learning.
Introduction
Methods
CAS Fingerprint
Benchmarking Data and Methodology
Results
Benchmarking Results
Figure 1

Figure 1. Performance of the fingerprints, evaluated by the ROC-AUC, across 88 different datasets using the random forest (RF) method. The CAS fingerprint is highlighted for clarity.
Figure 2

Figure 2. Performance of CAS fingerprint relative to the best- and worst-performing fingerprint, evaluated by the ROC-AUC, across 88 different datasets using the random forest (RF) method. Figure 1 is replotted to highlight the best-performing fingerprint, the worst-performing fingerprint, and the CAS fingerprint.
Figure 3

Figure 3. Performance of the fingerprints, evaluated by the PRC-AUC, across 88 different datasets using the random forest (RF) method. The CAS fingerprint is highlighted for clarity.
Figure 4

Figure 4. Performance of the CAS fingerprint relative to the best- and worst-performing fingerprint, evaluated by the PRC-AUC, across 88 different datasets using the random forest (RF) method. Figure 3 is replotted to highlight the best-performing fingerprint, the worst-performing fingerprint, and the CAS fingerprint.
| sign test against CASfp | ecfp6_7851 | rdk6_7851 | avalon_7851 | hashap_7851 |
|---|---|---|---|---|
| ROC-AUC | 0.818 (0.725, 0.885) | 0.727 (0.626, 0.809) | 0.659 (0.555, 0.750) | 0.522 (0.420, 0.624) |
| PRC-AUC | 0.522 (0.420, 0.624) | 0.807 (0.712, 0.876) | 0.568 (0.464, 0.667) | 0.580 (0.475, 0.677) |
The table shows the results of the sign test comparing the CAS fingerprint against the commonly used fingerprints with bit size 7851. The 95% Wilson score intervals are included in parentheses.
| sign test against CASfp | ecfp6_1024 | rdk6_1024 | avalon_1024 | hashap_1024 |
|---|---|---|---|---|
| ROC-AUC | 0.886 (0.803, 0.937) | 0.739 (0.638, 0.819) | 0.693 (0.590, 0.780) | 0.659 (0.555, 0.750) |
| PRC-AUC | 0.727 (0.626, 0.809) | 0.807 (0.712, 0.876) | 0.648 (0.544, 0.740) | 0.739 (0.638, 0.819) |
The table shows the results of the sign test comparing the CAS fingerprint against the commonly used fingerprints with bit size 1024. The 95% Wilson score intervals are included in parentheses.
Figure 5

Figure 5. Average rank of each fingerprint across 88 targets. Note that a lower average rank (i.e., closer to one) denotes better performance.
CAS Fingerprint Captures Distinct Sources of Chemical Information
Figure 6

Figure 6. Performance of the CAS fingerprint is uncorrelated with other fingerprints, suggesting that the fingerprint is capturing orthogonal chemical signals. The figure shows the correlation between the rank ordering of active (orange) and inactive (blue) by an algorithm tested on the CAS fingerprint and other fingerprints. The plots on the diagonal show the distribution of a classifier score for active (orange) and inactive (blue) compounds.
Human Experts Identify Rare Chemical Features
| frequency range | count | percentage of total count |
|---|---|---|
| 0.00–1.00% | 4613 | 59% |
| 1.01–2.00% | 771 | 10% |
| 2.01–3.00% | 448 | 6% |
| 3.01–4.00% | 328 | 4% |
| 4.01–5.00% | 211 | 3% |
| 5.01–10.00% | 582 | 7% |
| over 10.00% | 898 | 11% |
The table shows the frequency of the occurrence of the 7851 features included in the CAS fingerprint for all of the molecules present in the CAS Registry.
Figure 7

Figure 7. Histogram of the distribution of the number of motifs in the CAS fingerprint that are found in the molecules in the benchmark dataset.
Conclusions
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.0c00193.
Performance of the fingerprints, evaluated by the ROC-AUC and PRC-AUC, across 88 different datasets using random forest, Naïve Bayes, logistic regression, and Tanimoto similarity methods (PDF)
Table of the AUC (ROC and PRC) and SD values for all 88 targets for each of these machine learning methods (XLSX)
Terms & Conditions
Electronic Supporting Information files are available without a subscription to ACS Web Editions. The American Chemical Society holds a copyright ownership interest in any copyrightable Supporting Information. Files available from the ACS website may be downloaded for personal use only. Users are not otherwise permitted to reproduce, republish, redistribute, or sell any Supporting Information from the ACS website, either in whole or in part, in either machine-readable form or any other form without permission from the American Chemical Society. For permission to reproduce, republish and redistribute this material, requesters must process their own requests via the RightsLink permission system. Information about how to use the RightsLink permission system can be found at http://pubs.acs.org/page/copyright/permissions.html.
References
This article references 18 other publications.
- 1Christie, B. D.; Leland, B. A.; Nourse, J. G. Structure Searching In Chemical Databases By Direct Lookup Methods. J. Chem. Inf. Model. 1993, 33, 545– 547, DOI: 10.1021/ci00014a004
- 2Morgan, H. L. The Generation Of An Unique Machine Description For Chemical Structures– A Technique Developed At Chemical Abstracts Service. J. Chem. Doc. 1965, 5, 107– 113, DOI: 10.1021/c160017a018[ACS Full Text
], [CAS], Google Scholar2https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADyaF2MXkt1Omtr0%253D&md5=63dacaaebba9a603360996ca690e56c3Generation of a unique machine description for chemical structures--a technique developed at Chemical Abstracts ServiceMorgan, H. L.Journal of Chemical Documentation (1965), 5 (2), 107-13CODEN: JCHDAN; ISSN:0021-9576.The description employed is a uniquely ordered list of the node symbols of the structure (or graph) in which the value (at. symbol) of each node and its attachment (bonding) to the other nodes of the total structure. When the entire structure has been numbered according to a given set of rules, the connection table is formed by recording the structural relation by a process of successive partial orderings. - 3McGregor, M. J.; Pallai, P. V. Clustering Of Large Databases of Compounds: Using The MDL “Keys” as Structural Descriptors. J. Chem. Inf. Comput. Sci. 1997, 37, 443– 448, DOI: 10.1021/ci960151e[ACS Full Text
], [CAS], Google Scholar3https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADyaK2sXivFKqt7g%253D&md5=c488dd62a0b9b7d1daef20782049e8eeClustering of Large Databases of Compounds: Using the MDL "Keys" as Structural DescriptorsMcGregor, Malcolm J.; Pallai, Peter V.Journal of Chemical Information and Computer Sciences (1997), 37 (3), 443-448CODEN: JCISD8; ISSN:0095-2338. (American Chemical Society)An anal. of chem. structures from several com. available libraries of compds. is presented with a view of acquiring compds. for screening. The Jarvis-Patrick clustering method has been applied, using the MDL "keys" as structural descriptors. The nature of the MDL keys is examd. in this context, some features of the clustering algorithm are discussed, and clustering statistics are presented. - 4Sheridan, R. P.; Kearsley, S. K. Why Do We Need So Many Chemical Similarity Search Methods?. Drug Discovery Today 2002, 7, 903– 911, DOI: 10.1016/S1359-6446(02)02411-X[Crossref], [PubMed], [CAS], Google Scholar4https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BD3s%252FjvFKhtw%253D%253D&md5=f1c702fda251441a2d8a68a1aa819f1eWhy do we need so many chemical similarity search methods?Sheridan Robert P; Kearsley Simon KDrug discovery today (2002), 7 (17), 903-11 ISSN:1359-6446.Computational tools to search chemical structure databases are essential to finding leads early in a drug discovery project. Similarity methods are among the most diverse and most useful. We will present some lessons we have gathered over many years experience with in-house methods on several therapeutic problems. The effectiveness of any similarity method can vary greatly from one biological activity to another in a way that is difficult to predict. Also, any two methods tend to select different subsets of actives from a database, so it is advisable to use several search methods where possible.
- 5Sastry, M.; Lowrie, J. F.; Dixon, S. L.; Sherman, W. Large-scale Systematic Analysis Of 2D Fingerprint Methods And Parameters to Improve Virtual Screening Enrichments. J. Chem. Inf. Model. 2010, 50, 771– 784, DOI: 10.1021/ci100062n[ACS Full Text
], [CAS], Google Scholar5https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC3cXlslynu7k%253D&md5=54e3d9f7e1d493e3b3ee2f153f1af4b2Large-Scale Systematic Analysis of 2D Fingerprint Methods and Parameters to Improve Virtual Screening EnrichmentsSastry, Madhavi; Lowrie, Jeffrey F.; Dixon, Steven L.; Sherman, WoodyJournal of Chemical Information and Modeling (2010), 50 (5), 771-784CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)A systematic virtual screening study on 11 pharmaceutically relevant targets has been conducted to investigate the interrelation between 8 two-dimensional (2D) fingerprinting methods, 13 atom-typing schemes, 13 bit scaling rules, and 12 similarity metrics using the new cheminformatics package Canvas. In total, 157 872 virtual screens were performed to assess the ability of each combination of parameters to identify actives in a database screen. In general, fingerprint methods, such as MOLPRINT2D, Radial, and Dendritic that encode information about local environment beyond simple linear paths outperformed other fingerprint methods. Atom-typing schemes with more specific information, such as Daylight, Mol2, and Carhart were generally superior to more generic atom-typing schemes. Enrichment factors across all targets were improved considerably with the best settings, although no single set of parameters performed optimally on all targets. The size of the addressable bit space for the fingerprints was also explored, and it was found to have a substantial impact on enrichments. Small bit spaces, such as 1024, resulted in many collisions and in a significant degrdn. in enrichments compared to larger bit spaces that avoid collisions. - 6Rogers, D.; Hahn, M. Extended-connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742– 754, DOI: 10.1021/ci100050t[ACS Full Text
], [CAS], Google Scholar6https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC3cXlt1Onsbg%253D&md5=cd6c736cd7a3d280b67f5316acce8006Extended-Connectivity FingerprintsRogers, David; Hahn, MathewJournal of Chemical Information and Modeling (2010), 50 (5), 742-754CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)Extended-connectivity fingerprints (ECFPs) are a novel class of topol. fingerprints for mol. characterization. Historically, topol. fingerprints were developed for substructure and similarity searching. ECFPs were developed specifically for structure-activity modeling. ECFPs are circular fingerprints with a no. of useful qualities: they can be very rapidly calcd.; they are not predefined and can represent an essentially infinite no. of different mol. features (including stereochem. information); their features represent the presence of particular substructures, allowing easier interpretation of anal. results; and the ECFP algorithm can be tailored to generate different types of circular fingerprints, optimized for different uses. While the use of ECFPs has been widely adopted and validated, a description of their implementation has not previously been presented in the literature. - 7Carhart, R. E.; Smith, D. H.; Venkataraghavan, R. Atom Pairs as Molecular Features In Structure-Activity Studies: Definition And Applications. J. Chem. Inf. Model. 1985, 25, 64– 73, DOI: 10.1021/ci00046a002
- 8Nilakantan, R.; Bauman, N.; Dixon, J. S.; Venkataraghavan, R. Topological Torsion: A New Molecular Descriptor For SAR applications. Comparison With Other Descriptors. J. Chem. Inf. Model. 1987, 27, 82– 85, DOI: 10.1021/ci00054a008
- 9Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R. P. Convolutional Networks On Graphs For Learning Molecular Fingerprints. In the Proceedings of Advances in Neural Information Processing Systems 28, 2015; pp 2215– 2223.Google ScholarThere is no corresponding record for this reference.
- 10Yang, K.; Swanson, K.; Jin, W.; Coley, C.; Eiden, P.; Gao, H.; Guzman-Perez, A.; Hopper, T.; Kelley, B.; Mathea, M.; Palmer, A.; Settels, V.; Jaakkola, T.; Jensen, K.; Barzilay, R. Analyzing Learned Molecular Representations For Property Prediction. J. Chem. Inf. Model. 2019, 59, 3370– 3388, DOI: 10.1021/acs.jcim.9b00237[ACS Full Text
], [CAS], Google Scholar10https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1MXhsVOhsLfL&md5=f6b9978033193d6534486d7123f67f3eAnalyzing Learned Molecular Representations for Property PredictionYang, Kevin; Swanson, Kyle; Jin, Wengong; Coley, Connor; Eiden, Philipp; Gao, Hua; Guzman-Perez, Angel; Hopper, Timothy; Kelley, Brian; Mathea, Miriam; Palmer, Andrew; Settels, Volker; Jaakkola, Tommi; Jensen, Klavs; Barzilay, ReginaJournal of Chemical Information and Modeling (2019), 59 (8), 3370-3388CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)Advancements in neural machinery have led to a wide range of algorithmic solns. for mol. property prediction. Two classes of models in particular have yielded promising results: neural networks applied to computed mol. fingerprints or expert-crafted descriptors and graph convolutional neural networks that construct a learned mol. representation by operating on the graph structure of the mol. However, recent literature has yet to clearly det. which of these two methods is superior when generalizing to new chem. space. Furthermore, prior research has rarely examd. these new models in industry research settings in comparison to existing employed models. In this paper, the authors benchmark models extensively on 19 public and 16 proprietary industrial data sets spanning a wide variety of chem. end points. In addn., the authors introduce a graph convolutional model that consistently matches or outperforms models using fixed mol. descriptors as well as previous graph neural architectures on both public and proprietary data sets. The empirical findings indicate that while approaches based on these representations have yet to reach the level of exptl. reproducibility, the proposed model nevertheless offers significant improvements over models currently used in industrial workflows. - 11Gedeck, P.; Rohde, B.; Bartels, C. QSAR- How Good Is It In Practice? Comparison Of Descriptor Sets On An Unbiased Cross Section Of Corporate Data Sets. J. Chem. Inf. Model. 2006, 46, 1924– 1936, DOI: 10.1021/ci050413p[ACS Full Text
], [CAS], Google Scholar11https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD28XnsF2gt7s%253D&md5=5f41d0640ce85ab7ddf1e342f5f48f7cQSAR - How Good Is It in Practice? Comparison of Descriptor Sets on an Unbiased Cross Section of Corporate Data SetsGedeck, Peter; Rohde, Bernhard; Bartels, ChristianJournal of Chemical Information and Modeling (2006), 46 (5), 1924-1936CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)The quality of QSAR (Quant. Structure-Activity Relationships) predictions depends on a large no. of factors including the descriptor set, the statistical method, and the data sets used. Here we study the quality of QSAR predictions mainly as a function of the data set and descriptor type using partial least squares as the statistical modeling method. The study makes use of the fact that we have access to a large no. of data sets and to a variety of different QSAR descriptors. The main conclusions are that the quality of the predictions depends both on the data set and the descriptor used. The quality of the predictions correlates pos. with the size of the data set and the range of biol. activities. There is no clear dependence of the quality of the predictions on the complexity of the data set. All of the descriptors tested produced useful predictions for some of the data sets. None of the descriptors is best for all data sets; it is therefore necessary to test in each individual case, which descriptor produces the best model. In our tests, 2D fragment based descriptors usually performed better than simpler descriptors based on augmented atom types. Possible reasons for these observations are discussed. - 12Boobier, S.; Osbourn, A.; Mitchell, J. B. Can Human Experts Predict Solubility Netter Than Computers?. J. Cheminf. 2017, 9, 63, DOI: 10.1186/s13321-017-0250-y
- 13Riniker, S.; Landrum, G. A. Open-source Platform To Benchmark Fingerprints For Ligand-based Virtual Screening. J. Cheminf. 2013, 5, 26, DOI: 10.1186/1758-2946-5-26[Crossref], [CAS], Google Scholar13https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC3sXhtV2qtbfK&md5=96b42ca93a436db5a3a47ad697ade5c3Open-source platform to benchmark fingerprints for ligand-based virtual screeningRiniker, Sereina; Landrum, Gregory A.Journal of Cheminformatics (2013), 5 (), 26CODEN: JCOHB3; ISSN:1758-2946. (Chemistry Central Ltd.)Similarity-search methods using mol. fingerprints are an important tool for ligand-based virtual screening. A huge variety of fingerprints exist and their performance, usually assessed in retrospective benchmarking studies using data sets with known actives and known or assumed inactives, depends largely on the validation data sets used and the similarity measure used. Comparing new methods to existing ones in any systematic way is rather difficult due to the lack of std. data sets and evaluation procedures. Here, we present a std. platform for the benchmarking of 2D fingerprints. The open-source platform contains all source code, structural data for the actives and inactives used (drawn from three publicly available collections of data sets) and lists of randomly selected query mols. to be used for statistically valid comparisons of methods. This allows the exact reprodn. and comparison of results for future studies. The results for 12 std. fingerprints together with two simple baseline fingerprints assessed by seven evaluation methods are shown together with the correlations between methods. High correlations were found between the 12 fingerprints and a careful statistical anal. showed that only the two baseline fingerprints were different from the others in a statistically significant way. High correlations were also found between six of the seven evaluation methods, indicating that despite their seeming differences, many of these methods are similar to each other.
- 14O’Boyle, N. M.; Sayle, R. A. Comparing structural fingerprints using a literature-based similarity benchmark. J. Cheminf. 2016, 8, 36, DOI: 10.1186/s13321-016-0148-0[Crossref], [CAS], Google Scholar14https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1cXkvFejtLk%253D&md5=373a170c173f0733838e27f25aa2811aComparing structural fingerprints using a literature-based similarity benchmarkO'Boyle, Noel M.; Sayle, Roger A.Journal of Cheminformatics (2016), 8 (), 36/1-36/14CODEN: JCOHB3; ISSN:1758-2946. (Chemistry Central Ltd.)Background: The concept of mol. similarity is one of the central ideas in chem informatics, despite the fact that it is ill-defined and rather difficult to assess objectively. Here we propose a practical definition of mol. similarity in the context of drug discovery: mols. A and B are similar if a medicinal chemist would be likely to synthesize and test them around the same time as part of the same medicinal chem. program. The attraction of such a definition is that it matches one of the key uses of similarity measures in early-stage drug discovery. If we make the assumption that mols. in the same compd. activity table in a medicinal chem. paper were considered similar by the authors of the paper, we can create a dataset of similar mols. from the medicinal chem. literature. Furthermore, mols. with decreasing levels of similarity to a ref. can be found by either ordering mols. in an activity table by their activity, or by considering activity tables in different papers which have at least one mol. in common. Results: Using this procedure with activity data from ChEMBL, we have created two benchmark datasets for structural similarity that can be used to guide the development of improved measures. Compared to similar results from a virtual screen, these benchmarks are an order of magnitude more sensitive to differences between fingerprints both because of their size and because they avoid loss of statistical power due to the use of mean scores or ranks. We measure the performance of 28 different fingerprints on the benchmark sets and compare the results to those from the Riniker and Landrum (J Cheminf 5:26, 2013. doi:10.1186/1758-2946-5-26) ligand-based virtual screening benchmark. Conclusions: Extended-connectivity fingerprints of diam. 4 and 6 are among the best performing fingerprints when ranking diverse structures by similarity, as is the topol. torsion fingerprint. However, when ranking very close analogs, the atom pair fingerprint outperforms the others tested. When ranking diverse structures or carrying out a virtual screen, we find that the performance of the ECFP fingerprints significantly improves if the bit-vector length is increased from 1024 to 16,384.
- 15Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G. Reoptimization Of MDL Keys For Use In Drug Discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273– 1280, DOI: 10.1021/ci010132r[ACS Full Text
], [CAS], Google Scholar15https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD38XmvFKktLY%253D&md5=424323cf6bab93c4356ace7575d3f0caReoptimization of MDL Keys for Use in Drug DiscoveryDurant, Joseph L.; Leland, Burton A.; Henry, Douglas R.; Nourse, James G.Journal of Chemical Information and Computer Sciences (2002), 42 (6), 1273-1280CODEN: JCISD8; ISSN:0095-2338. (American Chemical Society)For a no. of years MDL products have exposed both 166 bit and 960 bit keysets based on 2D descriptors. These keysets were originally constructed and optimized for substructure searching. We report on improvements in the performance of MDL keysets which are reoptimized for use in mol. similarity. Classification performance for a test data set of 957 compds. was increased from 0.65 for the 166 bit keyset and 0.67 for the 960 bit keyset to 0.71 for a surprisal S/N pruned keyset contg. 208 bits and 0.71 for a genetic algorithm optimized keyset contg. 548 bits. We present an overview of the underlying technol. supporting the definition of descriptors and the encoding of these descriptors into keysets. This technol. allows definition of descriptors as combinations of atom properties, bond properties, and at. neighborhoods at various topol. sepns. as well as supporting a no. of custom descriptors. These descriptors can then be used to set one or more bits in a keyset. We constructed various keysets and optimized their performance in clustering bioactive substances. Performance was measured using methodol. developed by Briem and Lessel. "Directed pruning" was carried out by eliminating bits from the keysets on the basis of random selection, values of the surprisal of the bit, or values of the surprisal S/N ratio of the bit. The random pruning expt. highlighted the insensitivity of keyset performance for keyset lengths of more than 1000 bits. Contrary to initial expectations, pruning on the basis of the surprisal values of the various bits resulted in keysets which underperformed those resulting from random pruning. In contrast, pruning on the basis of the surprisal S/N ratio was found to yield keysets which performed better than those resulting from random pruning. We also explored the use of genetic algorithms in the selection of optimal keysets. Once more the performance was only a weak function of keyset size, and the optimizations failed to identify a single globally optimal keyset. Instead multiple, equally optimal keysets could be produced which had relatively low overlap of the descriptors they encoded. - 16RDKit: Cheminformatics and Machine Learning Software, 2020. http://www.rdkit.org.Google ScholarThere is no corresponding record for this reference.
- 17Robinson, M. C.; Glen, R. C.; Lee, A. A. Validating the Validation: Reanalyzing a Large-scale Comparison of Deep Learning and Machine Learning Models for Bioactivity Prediction. J. Comput.-Aided Mol. Des. 2020, 717– 730, DOI: 10.1007/s10822-019-00274-0[Crossref], [PubMed], [CAS], Google Scholar17https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXis1Shuro%253D&md5=c0a2ede7f44522a2bc08c1751a51b554Validating the validation: reanalyzing a large-scale comparison of deep learning and machine learning models for bioactivity predictionRobinson, Matthew C.; Glen, Robert C.; Lee, Alpha A.Journal of Computer-Aided Molecular Design (2020), 34 (7), 717-730CODEN: JCADEQ; ISSN:0920-654X. (Springer)Abstr.: Machine learning methods may have the potential to significantly accelerate drug discovery. However, the increasing rate of new methodol. approaches being published in the literature raises the fundamental question of how models should be benchmarked and validated. We reanalyze the data generated by a recently published large-scale comparison of machine learning models for bioactivity prediction and arrive at a somewhat different conclusion. We show that the performance of support vector machines is competitive with that of deep learning methods. Addnl., using a series of numerical expts., we question the relevance of area under the receiver operating characteristic curve as a metric in virtual screening. We further suggest that area under the precision-recall curve should be used in conjunction with the receiver operating characteristic curve. Our numerical expts. also highlight challenges in estg. the uncertainty in model performance via scaffold-split nested cross validation.
- 18Lee, A. A.; Yang, Q.; Bassyouni, A.; Butler, C. R.; Hou, X.; Jenkinson, S.; Price, D. A. Ligand Biological Activity Predicted By Cleaning Positive And Negative Chemical Correlations. Proc. Natl. Acad. Sci. U.S.A. 2019, 116, 3373– 3378, DOI: 10.1073/pnas.1810847116[Crossref], [PubMed], [CAS], Google Scholar18https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1MXjslOhsL8%253D&md5=8b46e0999acba22792e6cd539a28c33eLigand biological activity predicted by cleaning positive and negative chemical correlationsLee, Alpha A.; Yang, Qingyi; Bassyouni, Asser; Butler, Christopher R.; Hou, Xinjun; Jenkinson, Stephen; Price, David A.Proceedings of the National Academy of Sciences of the United States of America (2019), 116 (9), 3373-3378CODEN: PNASA6; ISSN:0027-8424. (National Academy of Sciences)Predicting ligand biol. activity is a key challenge in drug discovery. Ligand-based statistical approaches are often hampered by noise due to undersampling: The no. of mols. known to be active or inactive is vastly less than the no. of possible chem. features that might det. binding. The authors derive a statistical framework inspired by random matrix theory and combine the framework with high-quality neg. data to discover important chem. differences between active and inactive mols. by disentangling undersampling noise. The authors' model outperforms std. benchmarks when tested against a set of challenging retrospective tests. The authors prospectively apply the authors' model to the human muscarinic acetylcholine receptor M1, finding four exptl. confirmed agonists that are chem. dissimilar to all known ligands. The hit rate of the authors' model is significantly higher than the state of the art. The authors' model can be interpreted and visualized to offer chem. insights about the mol. motifs that are synergistic or antagonistic to M1 agonism, which the authors have prospectively exptl. verified.
Cited By
Abstract

Figure 1

Figure 1. Performance of the fingerprints, evaluated by the ROC-AUC, across 88 different datasets using the random forest (RF) method. The CAS fingerprint is highlighted for clarity.
Figure 2

Figure 2. Performance of CAS fingerprint relative to the best- and worst-performing fingerprint, evaluated by the ROC-AUC, across 88 different datasets using the random forest (RF) method. Figure 1 is replotted to highlight the best-performing fingerprint, the worst-performing fingerprint, and the CAS fingerprint.
Figure 3

Figure 3. Performance of the fingerprints, evaluated by the PRC-AUC, across 88 different datasets using the random forest (RF) method. The CAS fingerprint is highlighted for clarity.
Figure 4

Figure 4. Performance of the CAS fingerprint relative to the best- and worst-performing fingerprint, evaluated by the PRC-AUC, across 88 different datasets using the random forest (RF) method. Figure 3 is replotted to highlight the best-performing fingerprint, the worst-performing fingerprint, and the CAS fingerprint.
Figure 5

Figure 5. Average rank of each fingerprint across 88 targets. Note that a lower average rank (i.e., closer to one) denotes better performance.
Figure 6

Figure 6. Performance of the CAS fingerprint is uncorrelated with other fingerprints, suggesting that the fingerprint is capturing orthogonal chemical signals. The figure shows the correlation between the rank ordering of active (orange) and inactive (blue) by an algorithm tested on the CAS fingerprint and other fingerprints. The plots on the diagonal show the distribution of a classifier score for active (orange) and inactive (blue) compounds.
Figure 7

Figure 7. Histogram of the distribution of the number of motifs in the CAS fingerprint that are found in the molecules in the benchmark dataset.
References
ARTICLE SECTIONSThis article references 18 other publications.
- 1Christie, B. D.; Leland, B. A.; Nourse, J. G. Structure Searching In Chemical Databases By Direct Lookup Methods. J. Chem. Inf. Model. 1993, 33, 545– 547, DOI: 10.1021/ci00014a004
- 2Morgan, H. L. The Generation Of An Unique Machine Description For Chemical Structures– A Technique Developed At Chemical Abstracts Service. J. Chem. Doc. 1965, 5, 107– 113, DOI: 10.1021/c160017a018[ACS Full Text
], [CAS], Google Scholar2https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADyaF2MXkt1Omtr0%253D&md5=63dacaaebba9a603360996ca690e56c3Generation of a unique machine description for chemical structures--a technique developed at Chemical Abstracts ServiceMorgan, H. L.Journal of Chemical Documentation (1965), 5 (2), 107-13CODEN: JCHDAN; ISSN:0021-9576.The description employed is a uniquely ordered list of the node symbols of the structure (or graph) in which the value (at. symbol) of each node and its attachment (bonding) to the other nodes of the total structure. When the entire structure has been numbered according to a given set of rules, the connection table is formed by recording the structural relation by a process of successive partial orderings. - 3McGregor, M. J.; Pallai, P. V. Clustering Of Large Databases of Compounds: Using The MDL “Keys” as Structural Descriptors. J. Chem. Inf. Comput. Sci. 1997, 37, 443– 448, DOI: 10.1021/ci960151e[ACS Full Text
], [CAS], Google Scholar3https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADyaK2sXivFKqt7g%253D&md5=c488dd62a0b9b7d1daef20782049e8eeClustering of Large Databases of Compounds: Using the MDL "Keys" as Structural DescriptorsMcGregor, Malcolm J.; Pallai, Peter V.Journal of Chemical Information and Computer Sciences (1997), 37 (3), 443-448CODEN: JCISD8; ISSN:0095-2338. (American Chemical Society)An anal. of chem. structures from several com. available libraries of compds. is presented with a view of acquiring compds. for screening. The Jarvis-Patrick clustering method has been applied, using the MDL "keys" as structural descriptors. The nature of the MDL keys is examd. in this context, some features of the clustering algorithm are discussed, and clustering statistics are presented. - 4Sheridan, R. P.; Kearsley, S. K. Why Do We Need So Many Chemical Similarity Search Methods?. Drug Discovery Today 2002, 7, 903– 911, DOI: 10.1016/S1359-6446(02)02411-X[Crossref], [PubMed], [CAS], Google Scholar4https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A280%3ADC%252BD3s%252FjvFKhtw%253D%253D&md5=f1c702fda251441a2d8a68a1aa819f1eWhy do we need so many chemical similarity search methods?Sheridan Robert P; Kearsley Simon KDrug discovery today (2002), 7 (17), 903-11 ISSN:1359-6446.Computational tools to search chemical structure databases are essential to finding leads early in a drug discovery project. Similarity methods are among the most diverse and most useful. We will present some lessons we have gathered over many years experience with in-house methods on several therapeutic problems. The effectiveness of any similarity method can vary greatly from one biological activity to another in a way that is difficult to predict. Also, any two methods tend to select different subsets of actives from a database, so it is advisable to use several search methods where possible.
- 5Sastry, M.; Lowrie, J. F.; Dixon, S. L.; Sherman, W. Large-scale Systematic Analysis Of 2D Fingerprint Methods And Parameters to Improve Virtual Screening Enrichments. J. Chem. Inf. Model. 2010, 50, 771– 784, DOI: 10.1021/ci100062n[ACS Full Text
], [CAS], Google Scholar5https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC3cXlslynu7k%253D&md5=54e3d9f7e1d493e3b3ee2f153f1af4b2Large-Scale Systematic Analysis of 2D Fingerprint Methods and Parameters to Improve Virtual Screening EnrichmentsSastry, Madhavi; Lowrie, Jeffrey F.; Dixon, Steven L.; Sherman, WoodyJournal of Chemical Information and Modeling (2010), 50 (5), 771-784CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)A systematic virtual screening study on 11 pharmaceutically relevant targets has been conducted to investigate the interrelation between 8 two-dimensional (2D) fingerprinting methods, 13 atom-typing schemes, 13 bit scaling rules, and 12 similarity metrics using the new cheminformatics package Canvas. In total, 157 872 virtual screens were performed to assess the ability of each combination of parameters to identify actives in a database screen. In general, fingerprint methods, such as MOLPRINT2D, Radial, and Dendritic that encode information about local environment beyond simple linear paths outperformed other fingerprint methods. Atom-typing schemes with more specific information, such as Daylight, Mol2, and Carhart were generally superior to more generic atom-typing schemes. Enrichment factors across all targets were improved considerably with the best settings, although no single set of parameters performed optimally on all targets. The size of the addressable bit space for the fingerprints was also explored, and it was found to have a substantial impact on enrichments. Small bit spaces, such as 1024, resulted in many collisions and in a significant degrdn. in enrichments compared to larger bit spaces that avoid collisions. - 6Rogers, D.; Hahn, M. Extended-connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742– 754, DOI: 10.1021/ci100050t[ACS Full Text
], [CAS], Google Scholar6https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC3cXlt1Onsbg%253D&md5=cd6c736cd7a3d280b67f5316acce8006Extended-Connectivity FingerprintsRogers, David; Hahn, MathewJournal of Chemical Information and Modeling (2010), 50 (5), 742-754CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)Extended-connectivity fingerprints (ECFPs) are a novel class of topol. fingerprints for mol. characterization. Historically, topol. fingerprints were developed for substructure and similarity searching. ECFPs were developed specifically for structure-activity modeling. ECFPs are circular fingerprints with a no. of useful qualities: they can be very rapidly calcd.; they are not predefined and can represent an essentially infinite no. of different mol. features (including stereochem. information); their features represent the presence of particular substructures, allowing easier interpretation of anal. results; and the ECFP algorithm can be tailored to generate different types of circular fingerprints, optimized for different uses. While the use of ECFPs has been widely adopted and validated, a description of their implementation has not previously been presented in the literature. - 7Carhart, R. E.; Smith, D. H.; Venkataraghavan, R. Atom Pairs as Molecular Features In Structure-Activity Studies: Definition And Applications. J. Chem. Inf. Model. 1985, 25, 64– 73, DOI: 10.1021/ci00046a002
- 8Nilakantan, R.; Bauman, N.; Dixon, J. S.; Venkataraghavan, R. Topological Torsion: A New Molecular Descriptor For SAR applications. Comparison With Other Descriptors. J. Chem. Inf. Model. 1987, 27, 82– 85, DOI: 10.1021/ci00054a008
- 9Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R. P. Convolutional Networks On Graphs For Learning Molecular Fingerprints. In the Proceedings of Advances in Neural Information Processing Systems 28, 2015; pp 2215– 2223.Google ScholarThere is no corresponding record for this reference.
- 10Yang, K.; Swanson, K.; Jin, W.; Coley, C.; Eiden, P.; Gao, H.; Guzman-Perez, A.; Hopper, T.; Kelley, B.; Mathea, M.; Palmer, A.; Settels, V.; Jaakkola, T.; Jensen, K.; Barzilay, R. Analyzing Learned Molecular Representations For Property Prediction. J. Chem. Inf. Model. 2019, 59, 3370– 3388, DOI: 10.1021/acs.jcim.9b00237[ACS Full Text
], [CAS], Google Scholar10https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1MXhsVOhsLfL&md5=f6b9978033193d6534486d7123f67f3eAnalyzing Learned Molecular Representations for Property PredictionYang, Kevin; Swanson, Kyle; Jin, Wengong; Coley, Connor; Eiden, Philipp; Gao, Hua; Guzman-Perez, Angel; Hopper, Timothy; Kelley, Brian; Mathea, Miriam; Palmer, Andrew; Settels, Volker; Jaakkola, Tommi; Jensen, Klavs; Barzilay, ReginaJournal of Chemical Information and Modeling (2019), 59 (8), 3370-3388CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)Advancements in neural machinery have led to a wide range of algorithmic solns. for mol. property prediction. Two classes of models in particular have yielded promising results: neural networks applied to computed mol. fingerprints or expert-crafted descriptors and graph convolutional neural networks that construct a learned mol. representation by operating on the graph structure of the mol. However, recent literature has yet to clearly det. which of these two methods is superior when generalizing to new chem. space. Furthermore, prior research has rarely examd. these new models in industry research settings in comparison to existing employed models. In this paper, the authors benchmark models extensively on 19 public and 16 proprietary industrial data sets spanning a wide variety of chem. end points. In addn., the authors introduce a graph convolutional model that consistently matches or outperforms models using fixed mol. descriptors as well as previous graph neural architectures on both public and proprietary data sets. The empirical findings indicate that while approaches based on these representations have yet to reach the level of exptl. reproducibility, the proposed model nevertheless offers significant improvements over models currently used in industrial workflows. - 11Gedeck, P.; Rohde, B.; Bartels, C. QSAR- How Good Is It In Practice? Comparison Of Descriptor Sets On An Unbiased Cross Section Of Corporate Data Sets. J. Chem. Inf. Model. 2006, 46, 1924– 1936, DOI: 10.1021/ci050413p[ACS Full Text
], [CAS], Google Scholar11https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD28XnsF2gt7s%253D&md5=5f41d0640ce85ab7ddf1e342f5f48f7cQSAR - How Good Is It in Practice? Comparison of Descriptor Sets on an Unbiased Cross Section of Corporate Data SetsGedeck, Peter; Rohde, Bernhard; Bartels, ChristianJournal of Chemical Information and Modeling (2006), 46 (5), 1924-1936CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)The quality of QSAR (Quant. Structure-Activity Relationships) predictions depends on a large no. of factors including the descriptor set, the statistical method, and the data sets used. Here we study the quality of QSAR predictions mainly as a function of the data set and descriptor type using partial least squares as the statistical modeling method. The study makes use of the fact that we have access to a large no. of data sets and to a variety of different QSAR descriptors. The main conclusions are that the quality of the predictions depends both on the data set and the descriptor used. The quality of the predictions correlates pos. with the size of the data set and the range of biol. activities. There is no clear dependence of the quality of the predictions on the complexity of the data set. All of the descriptors tested produced useful predictions for some of the data sets. None of the descriptors is best for all data sets; it is therefore necessary to test in each individual case, which descriptor produces the best model. In our tests, 2D fragment based descriptors usually performed better than simpler descriptors based on augmented atom types. Possible reasons for these observations are discussed. - 12Boobier, S.; Osbourn, A.; Mitchell, J. B. Can Human Experts Predict Solubility Netter Than Computers?. J. Cheminf. 2017, 9, 63, DOI: 10.1186/s13321-017-0250-y
- 13Riniker, S.; Landrum, G. A. Open-source Platform To Benchmark Fingerprints For Ligand-based Virtual Screening. J. Cheminf. 2013, 5, 26, DOI: 10.1186/1758-2946-5-26[Crossref], [CAS], Google Scholar13https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC3sXhtV2qtbfK&md5=96b42ca93a436db5a3a47ad697ade5c3Open-source platform to benchmark fingerprints for ligand-based virtual screeningRiniker, Sereina; Landrum, Gregory A.Journal of Cheminformatics (2013), 5 (), 26CODEN: JCOHB3; ISSN:1758-2946. (Chemistry Central Ltd.)Similarity-search methods using mol. fingerprints are an important tool for ligand-based virtual screening. A huge variety of fingerprints exist and their performance, usually assessed in retrospective benchmarking studies using data sets with known actives and known or assumed inactives, depends largely on the validation data sets used and the similarity measure used. Comparing new methods to existing ones in any systematic way is rather difficult due to the lack of std. data sets and evaluation procedures. Here, we present a std. platform for the benchmarking of 2D fingerprints. The open-source platform contains all source code, structural data for the actives and inactives used (drawn from three publicly available collections of data sets) and lists of randomly selected query mols. to be used for statistically valid comparisons of methods. This allows the exact reprodn. and comparison of results for future studies. The results for 12 std. fingerprints together with two simple baseline fingerprints assessed by seven evaluation methods are shown together with the correlations between methods. High correlations were found between the 12 fingerprints and a careful statistical anal. showed that only the two baseline fingerprints were different from the others in a statistically significant way. High correlations were also found between six of the seven evaluation methods, indicating that despite their seeming differences, many of these methods are similar to each other.
- 14O’Boyle, N. M.; Sayle, R. A. Comparing structural fingerprints using a literature-based similarity benchmark. J. Cheminf. 2016, 8, 36, DOI: 10.1186/s13321-016-0148-0[Crossref], [CAS], Google Scholar14https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1cXkvFejtLk%253D&md5=373a170c173f0733838e27f25aa2811aComparing structural fingerprints using a literature-based similarity benchmarkO'Boyle, Noel M.; Sayle, Roger A.Journal of Cheminformatics (2016), 8 (), 36/1-36/14CODEN: JCOHB3; ISSN:1758-2946. (Chemistry Central Ltd.)Background: The concept of mol. similarity is one of the central ideas in chem informatics, despite the fact that it is ill-defined and rather difficult to assess objectively. Here we propose a practical definition of mol. similarity in the context of drug discovery: mols. A and B are similar if a medicinal chemist would be likely to synthesize and test them around the same time as part of the same medicinal chem. program. The attraction of such a definition is that it matches one of the key uses of similarity measures in early-stage drug discovery. If we make the assumption that mols. in the same compd. activity table in a medicinal chem. paper were considered similar by the authors of the paper, we can create a dataset of similar mols. from the medicinal chem. literature. Furthermore, mols. with decreasing levels of similarity to a ref. can be found by either ordering mols. in an activity table by their activity, or by considering activity tables in different papers which have at least one mol. in common. Results: Using this procedure with activity data from ChEMBL, we have created two benchmark datasets for structural similarity that can be used to guide the development of improved measures. Compared to similar results from a virtual screen, these benchmarks are an order of magnitude more sensitive to differences between fingerprints both because of their size and because they avoid loss of statistical power due to the use of mean scores or ranks. We measure the performance of 28 different fingerprints on the benchmark sets and compare the results to those from the Riniker and Landrum (J Cheminf 5:26, 2013. doi:10.1186/1758-2946-5-26) ligand-based virtual screening benchmark. Conclusions: Extended-connectivity fingerprints of diam. 4 and 6 are among the best performing fingerprints when ranking diverse structures by similarity, as is the topol. torsion fingerprint. However, when ranking very close analogs, the atom pair fingerprint outperforms the others tested. When ranking diverse structures or carrying out a virtual screen, we find that the performance of the ECFP fingerprints significantly improves if the bit-vector length is increased from 1024 to 16,384.
- 15Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G. Reoptimization Of MDL Keys For Use In Drug Discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273– 1280, DOI: 10.1021/ci010132r[ACS Full Text
], [CAS], Google Scholar15https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BD38XmvFKktLY%253D&md5=424323cf6bab93c4356ace7575d3f0caReoptimization of MDL Keys for Use in Drug DiscoveryDurant, Joseph L.; Leland, Burton A.; Henry, Douglas R.; Nourse, James G.Journal of Chemical Information and Computer Sciences (2002), 42 (6), 1273-1280CODEN: JCISD8; ISSN:0095-2338. (American Chemical Society)For a no. of years MDL products have exposed both 166 bit and 960 bit keysets based on 2D descriptors. These keysets were originally constructed and optimized for substructure searching. We report on improvements in the performance of MDL keysets which are reoptimized for use in mol. similarity. Classification performance for a test data set of 957 compds. was increased from 0.65 for the 166 bit keyset and 0.67 for the 960 bit keyset to 0.71 for a surprisal S/N pruned keyset contg. 208 bits and 0.71 for a genetic algorithm optimized keyset contg. 548 bits. We present an overview of the underlying technol. supporting the definition of descriptors and the encoding of these descriptors into keysets. This technol. allows definition of descriptors as combinations of atom properties, bond properties, and at. neighborhoods at various topol. sepns. as well as supporting a no. of custom descriptors. These descriptors can then be used to set one or more bits in a keyset. We constructed various keysets and optimized their performance in clustering bioactive substances. Performance was measured using methodol. developed by Briem and Lessel. "Directed pruning" was carried out by eliminating bits from the keysets on the basis of random selection, values of the surprisal of the bit, or values of the surprisal S/N ratio of the bit. The random pruning expt. highlighted the insensitivity of keyset performance for keyset lengths of more than 1000 bits. Contrary to initial expectations, pruning on the basis of the surprisal values of the various bits resulted in keysets which underperformed those resulting from random pruning. In contrast, pruning on the basis of the surprisal S/N ratio was found to yield keysets which performed better than those resulting from random pruning. We also explored the use of genetic algorithms in the selection of optimal keysets. Once more the performance was only a weak function of keyset size, and the optimizations failed to identify a single globally optimal keyset. Instead multiple, equally optimal keysets could be produced which had relatively low overlap of the descriptors they encoded. - 16RDKit: Cheminformatics and Machine Learning Software, 2020. http://www.rdkit.org.Google ScholarThere is no corresponding record for this reference.
- 17Robinson, M. C.; Glen, R. C.; Lee, A. A. Validating the Validation: Reanalyzing a Large-scale Comparison of Deep Learning and Machine Learning Models for Bioactivity Prediction. J. Comput.-Aided Mol. Des. 2020, 717– 730, DOI: 10.1007/s10822-019-00274-0[Crossref], [PubMed], [CAS], Google Scholar17https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BB3cXis1Shuro%253D&md5=c0a2ede7f44522a2bc08c1751a51b554Validating the validation: reanalyzing a large-scale comparison of deep learning and machine learning models for bioactivity predictionRobinson, Matthew C.; Glen, Robert C.; Lee, Alpha A.Journal of Computer-Aided Molecular Design (2020), 34 (7), 717-730CODEN: JCADEQ; ISSN:0920-654X. (Springer)Abstr.: Machine learning methods may have the potential to significantly accelerate drug discovery. However, the increasing rate of new methodol. approaches being published in the literature raises the fundamental question of how models should be benchmarked and validated. We reanalyze the data generated by a recently published large-scale comparison of machine learning models for bioactivity prediction and arrive at a somewhat different conclusion. We show that the performance of support vector machines is competitive with that of deep learning methods. Addnl., using a series of numerical expts., we question the relevance of area under the receiver operating characteristic curve as a metric in virtual screening. We further suggest that area under the precision-recall curve should be used in conjunction with the receiver operating characteristic curve. Our numerical expts. also highlight challenges in estg. the uncertainty in model performance via scaffold-split nested cross validation.
- 18Lee, A. A.; Yang, Q.; Bassyouni, A.; Butler, C. R.; Hou, X.; Jenkinson, S.; Price, D. A. Ligand Biological Activity Predicted By Cleaning Positive And Negative Chemical Correlations. Proc. Natl. Acad. Sci. U.S.A. 2019, 116, 3373– 3378, DOI: 10.1073/pnas.1810847116[Crossref], [PubMed], [CAS], Google Scholar18https://chemport.cas.org/services/resolver?origin=ACS&resolution=options&coi=1%3ACAS%3A528%3ADC%252BC1MXjslOhsL8%253D&md5=8b46e0999acba22792e6cd539a28c33eLigand biological activity predicted by cleaning positive and negative chemical correlationsLee, Alpha A.; Yang, Qingyi; Bassyouni, Asser; Butler, Christopher R.; Hou, Xinjun; Jenkinson, Stephen; Price, David A.Proceedings of the National Academy of Sciences of the United States of America (2019), 116 (9), 3373-3378CODEN: PNASA6; ISSN:0027-8424. (National Academy of Sciences)Predicting ligand biol. activity is a key challenge in drug discovery. Ligand-based statistical approaches are often hampered by noise due to undersampling: The no. of mols. known to be active or inactive is vastly less than the no. of possible chem. features that might det. binding. The authors derive a statistical framework inspired by random matrix theory and combine the framework with high-quality neg. data to discover important chem. differences between active and inactive mols. by disentangling undersampling noise. The authors' model outperforms std. benchmarks when tested against a set of challenging retrospective tests. The authors prospectively apply the authors' model to the human muscarinic acetylcholine receptor M1, finding four exptl. confirmed agonists that are chem. dissimilar to all known ligands. The hit rate of the authors' model is significantly higher than the state of the art. The authors' model can be interpreted and visualized to offer chem. insights about the mol. motifs that are synergistic or antagonistic to M1 agonism, which the authors have prospectively exptl. verified.
Supporting Information
ARTICLE SECTIONSThe Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.0c00193.
Performance of the fingerprints, evaluated by the ROC-AUC and PRC-AUC, across 88 different datasets using random forest, Naïve Bayes, logistic regression, and Tanimoto similarity methods (PDF)
Table of the AUC (ROC and PRC) and SD values for all 88 targets for each of these machine learning methods (XLSX)
Terms & Conditions
Electronic Supporting Information files are available without a subscription to ACS Web Editions. The American Chemical Society holds a copyright ownership interest in any copyrightable Supporting Information. Files available from the ACS website may be downloaded for personal use only. Users are not otherwise permitted to reproduce, republish, redistribute, or sell any Supporting Information from the ACS website, either in whole or in part, in either machine-readable form or any other form without permission from the American Chemical Society. For permission to reproduce, republish and redistribute this material, requesters must process their own requests via the RightsLink permission system. Information about how to use the RightsLink permission system can be found at http://pubs.acs.org/page/copyright/permissions.html.





