ACS Publications. Most Trusted. Most Cited. Most Read
My Activity

Figure 1Loading Img

Binary Classification of a Large Collection of Environmental Chemicals from Estrogen Receptor Assays by Quantitative Structure–Activity Relationship and Machine Learning Methods

View Author Information
ORISE Postdoctoral Fellow and National Center for Computational Toxicology, U.S. Environmental Protection Agency, Research Triangle Park, North Carolina 27711, United States
§ Bioinformatics Research Center, Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, United States
*Mailing address: 109 T.W. Alexander Drive, Research Triangle Park, NC 27711, USA. Phone: (919) 541-3085. Fax: (919) 541-1194. E-mail: [email protected]
Cite this: J. Chem. Inf. Model. 2013, 53, 12, 3244–3261
Publication Date (Web):November 26, 2013
Copyright © 2013 American Chemical Society

    Article Views





    Other access options
    Supporting Info (1)»


    Abstract Image

    There are thousands of environmental chemicals subject to regulatory decisions for endocrine disrupting potential. The ToxCast and Tox21 programs have tested ∼8200 chemicals in a broad screening panel of in vitro high-throughput screening (HTS) assays for estrogen receptor (ER) agonist and antagonist activity. The present work uses this large data set to develop in silico quantitative structure–activity relationship (QSAR) models using machine learning (ML) methods and a novel approach to manage the imbalanced data distribution. Training compounds from the ToxCast project were categorized as active or inactive (binding or nonbinding) classes based on a composite ER Interaction Score derived from a collection of 13 ER in vitro assays. A total of 1537 chemicals from ToxCast were used to derive and optimize the binary classification models while 5073 additional chemicals from the Tox21 project, evaluated in 2 of the 13 in vitro assays, were used to externally validate the model performance. In order to handle the imbalanced distribution of active and inactive chemicals, we developed a cluster-selection strategy to minimize information loss and increase predictive performance and compared this strategy to three currently popular techniques: cost-sensitive learning, oversampling of the minority class, and undersampling of the majority class. QSAR classification models were built to relate the molecular structures of chemicals to their ER activities using linear discriminant analysis (LDA), classification and regression trees (CART), and support vector machines (SVM) with 51 molecular descriptors from QikProp and 4328 bits of structural fingerprints as explanatory variables. A random forest (RF) feature selection method was employed to extract the structural features most relevant to the ER activity. The best model was obtained using SVM in combination with a subset of descriptors identified from a large set via the RF algorithm, which recognized the active and inactive compounds at the accuracies of 76.1% and 82.8% with a total accuracy of 81.6% on the internal test set and 70.8% on the external test set. These results demonstrate that a combination of high-quality experimental data and ML methods can lead to robust models that achieve excellent predictive accuracy, which are potentially useful for facilitating the virtual screening of chemicals for environmental risk assessment.

    Read this article

    To access this article, please review the available access options below.

    Get instant access

    Purchase Access

    Read this article for 48 hours. Check out below using your ACS ID or as a guest.


    Access through Your Institution

    You may have access to this article through your institution.

    Your institution does not have access to this content. You can change your affiliated institution below.

    Supporting Information

    Jump To

    Table S1: 51 molecular descriptors and properties generated from QikProp. Table S2: Top 19 bits of structural fingerprints selected from random forest. Tables S3–S5: True positive (TP), false negative (FN), true negative (TN), and false positive (FP) derived from LDA, CART, and SVM models. This material is available free of charge via the Internet at

    Terms & Conditions

    Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system:

    Cited By

    This article is cited by 49 publications.

    1. Xian Liu, Dawei Lu, Aiqian Zhang, Qian Liu, Guibin Jiang. Data-Driven Machine Learning in Environmental Pollution: Gains and Problems. Environmental Science & Technology 2022, 56 (4) , 2124-2133.
    2. Liguo Wang, Lu Zhao, Xian Liu, Jianjie Fu, Aiqian Zhang. SepPCNET: Deeping Learning on a 3D Surface Electrostatic Potential Point Cloud for Enhanced Toxicity Classification and Its Application to Suspected Environmental Estrogens. Environmental Science & Technology 2021, 55 (14) , 9958-9967.
    3. Ingrid Grenet, Kevin Merlo, Jean-Paul Comet, Romain Tertiaux, David Rouquié, Frédéric Dayan. Stacked Generalization with Applicability Domain Outperforms Simple QSAR on in Vitro Toxicological Data. Journal of Chemical Information and Modeling 2019, 59 (4) , 1486-1496.
    4. David A. Dreier, Nancy D. Denslow, Christopher J. Martyniuk. Computational in Vitro Toxicology Uncovers Chemical Structures Impairing Mitochondrial Membrane Potential. Journal of Chemical Information and Modeling 2019, 59 (2) , 702-712.
    5. Fjodor Melnikov, Jui-Hua Hsieh, Nisha S. Sipes, Paul T. Anastas. Channel Interactions and Robust Inference for Ratiometric β-Lactamase Assay Data: A Tox21 Library Analysis. ACS Sustainable Chemistry & Engineering 2018, 6 (3) , 3233-3241.
    6. Qingda Zang, Kamel Mansouri, Antony J. Williams, Richard S. Judson, David G. Allen, Warren M. Casey, and Nicole C. Kleinstreuer . In Silico Prediction of Physicochemical Properties of Environmental Chemicals Using Molecular Fingerprints and Machine Learning. Journal of Chemical Information and Modeling 2017, 57 (1) , 36-49.
    7. Ulf Norinder and Scott Boyer . Conformal Prediction Classification of a Large Data Set of Environmental Chemicals from ToxCast and Tox21 Estrogen Receptor Assays. Chemical Research in Toxicology 2016, 29 (6) , 1003-1010.
    8. Hui Wen Ng, Stephen W. Doughty, Heng Luo, Hao Ye, Weigong Ge, Weida Tong, and Huixiao Hong . Development and Validation of Decision Forest Model for Estrogen Receptor Binding Prediction of Chemicals Using Large Data Sets. Chemical Research in Toxicology 2015, 28 (12) , 2343-2351.
    9. Jie Liu, Kamel Mansouri, Richard S. Judson, Matthew T. Martin, Huixiao Hong, Minjun Chen, Xiaowei Xu, Russell S. Thomas, and Imran Shah . Predicting Hepatotoxicity Using ToxCast in Vitro Bioactivity and Chemical Structure. Chemical Research in Toxicology 2015, 28 (4) , 738-751.
    10. Zahir Aghayev, Adam T. Szafran, Anh Tran, Hari S. Ganesh, Fabio Stossi, Lan Zhou, Michael A. Mancini, Efstratios N. Pistikopoulos, Burcu Beykal. Machine learning methods for endocrine disrupting potential identification based on single-cell data. Chemical Engineering Science 2023, 281 , 119086.
    11. Pengyu Chen, Jing Yang, Ruihan Wang, Bowen Xiao, Qing Liu, Binbin Sun, Xiaolei Wang, Lingyan Zhu. Graphene oxide enhanced the endocrine disrupting effects of bisphenol A in adult male zebrafish: Integrated deep learning and metabolomics studies. Science of The Total Environment 2022, 809 , 151103.
    12. Jie Liu, Wenjing Guo, Sugunadevi Sakkiah, Zuowei Ji, Gokhan Yavas, Wen Zou, Minjun Chen, Weida Tong, Tucker A. Patterson, Huixiao Hong. Machine Learning Models for Predicting Liver Toxicity. 2022, 393-415.
    13. 爽 候. Prediction and Optimization of Anticancer Drug Activity—A Case Study of Breast Cancer. Statistics and Application 2022, 11 (06) , 1338-1347.
    14. Kangli Chang, Shiyu Liu, Hao Yan, Fuchuan Li, Dongfang Li. Quantitative Structure-Activity Relationship Modeling of Estrogen Receptor Alpha Bioactivity based on Multiple Algorithms. 2021, 1-6.
    15. Elizabeth Goya-Jorge, Mazia Amber, Rafael Gozalbes, Lisa Connolly, Stephen J. Barigye. Assessing the chemical-induced estrogenicity using in silico and in vitro methods. Environmental Toxicology and Pharmacology 2021, 87 , 103688.
    16. Ruoyu Li, Qin Deng, Dong Tian, Daoye Zhu, Bin Lin. Predicting Perovskite Performance with Multiple Machine-Learning Algorithms. Crystals 2021, 11 (7) , 818.
    17. Xian Liu, Huazhou Zhang, Qiao Xue, Wenxiao Pan, Aiqian Zhang. In silico health effect prioritization of environmental chemicals through transcriptomics data exploration from a chemo-centric view. Science of The Total Environment 2021, 762 , 143082.
    18. Alie de Boer, Lisette Krul, Markus Fehr, Lucie Geurts, Nynke Kramer, Maria Tabernero Urbieta, Johanneke van der Harst, Bob van de Water, Koen Venema, Katrin Schütte, Paul A. Hepburn. Animal-free strategies in food safety & nutrition: What are we waiting for? Part I: Food safety. Trends in Food Science & Technology 2020, 106 , 469-484.
    19. Gabriel Idakwo, Sundar Thangapandian, Joseph Luttrell, Yan Li, Nan Wang, Zhaoxian Zhou, Huixiao Hong, Bei Yang, Chaoyang Zhang, Ping Gong. Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets. Journal of Cheminformatics 2020, 12 (1)
    20. Yasunari Matsuzaka, Yoshihiro Uesawa. DeepSnap-Deep Learning Approach Predicts Progesterone Receptor Antagonist Activity With High Performance. Frontiers in Bioengineering and Biotechnology 2020, 7
    21. Stephen J. Barigye, José Manuel García de la Vega, Juan A. Castillo-Garit. Undersampling: case studies of flaviviral inhibitory activities. Journal of Computer-Aided Molecular Design 2019, 33 (11) , 997-1008.
    22. Hiromasa Kaneko. Illustration of merits of semi-supervised learning in regression analysis. Chemometrics and Intelligent Laboratory Systems 2018, 182 , 47-56.
    23. Ingrid Grenet, Yonghua Yin, Jean-Paul Comet. G-Networks to Predict the Outcome of Sensing of Toxicity. Sensors 2018, 18 (10) , 3483.
    24. Lu Yan, Quan Zhang, Feng Huang, Wen-Wen Nie, Chun-Qi Hu, Hua-Zhou Ying, Xiao-Wu Dong, Mei-Rong Zhao. Ternary classification models for predicting hormonal activities of chemicals via nuclear receptors. Chemical Physics Letters 2018, 706 , 360-366.
    25. Alexander Golbraikh, Alexander Tropsha. QSAR/QSPR Revisited. 2018, 465-495.
    26. Nicholas J. Niemuth, Rebecca D. Klaper. Low-dose metformin exposure causes changes in expression of endocrine disruption-associated genes. Aquatic Toxicology 2018, 195 , 33-40.
    27. Ingrid Grenet, Yonghua Yin, Jean-Paul Comet, Erol Gelenbe. Machine Learning to Predict Toxicity of Compounds. 2018, 335-345.
    28. Qingda Zang, Michael Paris, David M. Lehmann, Shannon Bell, Nicole Kleinstreuer, David Allen, Joanna Matheson, Abigail Jacobs, Warren Casey, Judy Strickland. Prediction of skin sensitization potency using machine learning approaches. Journal of Applied Toxicology 2017, 37 (7) , 792-805.
    29. David A. Dreier, Nancy D. Denslow, Christopher J. Martyniuk. Computational analysis of the ToxCast estrogen receptor agonist assays to predict vitellogenin induction by chemicals in male fish. Environmental Toxicology and Pharmacology 2017, 53 , 177-183.
    30. Fatemeh Abbasitabar, Vahid Zare-Shahabadi. In silico prediction of toxicity of phenols to Tetrahymena pyriformis by using genetic algorithm and decision tree-based modeling approach. Chemosphere 2017, 172 , 249-259.
    31. Svetoslav H. Slavov, Richard D. Beger. Rigorous 3‐dimensional spectral data activity relationship approach modeling strategy for ToxCast estrogen receptor data classification, validation, and feature extraction. Environmental Toxicology and Chemistry 2017, 36 (3) , 823-830.
    32. Judy Strickland, Qingda Zang, Michael Paris, David M. Lehmann, David Allen, Neepa Choksi, Joanna Matheson, Abigail Jacobs, Warren Casey, Nicole Kleinstreuer. Multivariate models for prediction of human skin sensitization hazard. Journal of Applied Toxicology 2017, 37 (3) , 347-360.
    33. Quan Zhang, Lu Yan, Yan Wu, Li Ji, Yuanchen Chen, Meirong Zhao, Xiaowu Dong. A ternary classification using machine learning methods of distinct estrogen receptor activities within a large collection of environmental chemicals. Science of The Total Environment 2017, 580 , 1268-1275.
    34. Chun-Qi Hu, Kang Li, Ting-Ting Yao, Yong-Zhou Hu, Hua-Zhou Ying, Xiao-Wu Dong. Integrating docking scores and key interaction profiles to improve the accuracy of molecular docking: towards novel B-Raf V600E inhibitors. MedChemComm 2017, 8 (9) , 1835-1844.
    35. Tailong Lei, Youyong Li, Yunlong Song, Dan Li, Huiyong Sun, Tingjun Hou. ADMET evaluation in drug discovery: 15. Accurate prediction of rat oral acute toxicity using relevance vector machine and consensus modeling. Journal of Cheminformatics 2016, 8 (1)
    36. Oleg A. Raevsky, Veniamin Y. Grigorev, Daniel E. Polianczyk, German I. Sandakov, Svetlana L. Solodova, Alexander V. Yarkov, Sergey O. Bachurin, John C. Dearden. Physicochemical property profile for brain permeability: comparative study by different approaches. Journal of Drug Targeting 2016, 24 (7) , 655-662.
    37. Kamel Mansouri, Ahmed Abdelaziz, Aleksandra Rybacka, Alessandra Roncaglioni, Alexander Tropsha, Alexandre Varnek, Alexey Zakharov, Andrew Worth, Ann M. Richard, Christopher M. Grulke, Daniela Trisciuzzi, Denis Fourches, Dragos Horvath, Emilio Benfenati, Eugene Muratov, Eva Bay Wedebye, Francesca Grisoni, Giuseppe F. Mangiatordi, Giuseppina M. Incisivo, Huixiao Hong, Hui W. Ng, Igor V. Tetko, Ilya Balabin, Jayaram Kancherla, Jie Shen, Julien Burton, Marc Nicklaus, Matteo Cassotti, Nikolai G. Nikolov, Orazio Nicolotti, Patrik L. Andersson, Qingda Zang, Regina Politi, Richard D. Beger, Roberto Todeschini, Ruili Huang, Sherif Farag, Sine A. Rosenberg, Svetoslav Slavov, Xin Hu, Richard S. Judson. CERAPP: Collaborative Estrogen Receptor Activity Prediction Project. Environmental Health Perspectives 2016, 124 (7) , 1023-1033.
    38. Peter P. Egeghy, Linda S. Sheldon, Kristin K. Isaacs, Halûk Özkaynak, Michael-Rock Goldsmith, John F. Wambaugh, Richard S. Judson, Timothy J. Buckley. Computational Exposure Science: An Emerging Discipline to Support 21st-Century Risk Assessment. Environmental Health Perspectives 2016, 124 (6) , 697-702.
    39. Huiyong Sun, Peichen Pan, Sheng Tian, Lei Xu, Xiaotian Kong, Youyong Li, Dan Li, Tingjun Hou. Constructing and Validating High-Performance MIEC-SVM Models in Virtual Screening for Kinases: A Better Way for Actives Discovery. Scientific Reports 2016, 6 (1)
    40. Hao Ye, Heng Luo, Hui Wen Ng, Joe Meehan, Weigong Ge, Weida Tong, Huixiao Hong. Applying network analysis and Nebula (neighbor-edges based and unbiased leverage algorithm) to ToxCast data. Environment International 2016, 89-90 , 81-92.
    41. Kathryn Ribay, Marlene T. Kim, Wenyi Wang, Daniel Pinolini, Hao Zhu. Predictive Modeling of Estrogen Receptor Binding Agents Using Advanced Cheminformatics Tools and Massive Public Data. Frontiers in Environmental Science 2016, 4
    42. Ingo Muegge, Prasenjit Mukherjee. An overview of molecular fingerprint similarity search in virtual screening. Expert Opinion on Drug Discovery 2016, 11 (2) , 137-148.
    43. Nicholas J. Niemuth, Rebecca D. Klaper. Emerging wastewater contaminant metformin causes intersex and reduced fecundity in fish. Chemosphere 2015, 135 , 38-45.
    44. Rodrigo C. Barros, Christian V. Quevedo, Renata De Paris, Marcio P. Basgalupp. Clustering Molecular Dynamics trajectories with a univariate estimation of distribution algorithm. 2015, 2058-2065.
    45. Vinay Randhawa, Anil Kumar Singh, Vishal Acharya. A systematic approach to prioritize drug targets using machine learning, a molecular descriptor-based classification model, and high-throughput screening of plant derived molecules: a case study in oral cancer. Molecular BioSystems 2015, 11 (12) , 3362-3377.
    46. Somayeh Pirhadi, Fereshteh Shiri, Jahan B. Ghasemi. Multivariate statistical analysis methods in QSAR. RSC Advances 2015, 5 (127) , 104635-104665.
    47. Renata De Paris, Christian V. Quevedo, Duncan D. Ruiz, Osmar Norberto de Souza, Rodrigo C. Barros. Clustering Molecular Dynamics Trajectories for Optimizing Docking Experiments. Computational Intelligence and Neuroscience 2015, 2015 , 1-9.
    48. Chuang Ma, Hao Helen Zhang, Xiangfeng Wang. Machine learning for Big Data analytics in plants. Trends in Plant Science 2014, 19 (12) , 798-808.
    49. Erik Lampa, Lars Lind, P Monica Lind, Anna Bornefalk-Hermansson. The identification of complex interactions in epidemiology and toxicology: a simulation study of boosted regression trees. Environmental Health 2014, 13 (1)

    Pair your accounts.

    Export articles to Mendeley

    Get article recommendations from ACS based on references in your Mendeley library.

    Pair your accounts.

    Export articles to Mendeley

    Get article recommendations from ACS based on references in your Mendeley library.

    You’ve supercharged your research process with ACS and Mendeley!

    STEP 1:
    Click to create an ACS ID

    Please note: If you switch to a different device, you may be asked to login again with only your ACS ID.

    Please note: If you switch to a different device, you may be asked to login again with only your ACS ID.

    Please note: If you switch to a different device, you may be asked to login again with only your ACS ID.

    Your Mendeley pairing has expired. Please reconnect