ACS Publications. Most Trusted. Most Cited. Most Read
My Activity

Figure 1Loading Img

Evaluation of Mutual Information and Genetic Programming for Feature Selection in QSAR

View Author Information
School of Biological Sciences, University of Exeter, Exeter EX4 4QF, Great Britain and School of Engineering and Computer Science, University of Exeter, Exeter EX4 4QF, Great Britain
Cite this: J. Chem. Inf. Comput. Sci. 2004, 44, 5, 1686–1692
Publication Date (Web):August 11, 2004
Copyright © 2004 American Chemical Society

    Article Views





    Other access options


    Feature selection is a key step in Quantitative Structure Activity Relationship (QSAR) analysis. Chance correlations and multicollinearity are two major problems often encountered when attempting to find generalized QSAR models for use in drug design. Optimal QSAR models require an objective variable relevance analysis step for producing robust classifiers with low complexity and good predictive accuracy. Genetic algorithms coupled with information theoretic approaches such as mutual information have been used to find near-optimal solutions to such multicriteria optimization problems. In this paper, we describe a novel approach for analyzing QSAR data based on these methods. Our experiments with the Thrombin dataset, previously studied as part of the KDD (Knowledge Discovery and Data Mining) Cup 2001 demonstrate the feasibility of this approach. It has been found that it is important to take into account the data distribution, the rule “interestingness”, and the need to look at more invariant and monotonic measures of feature selection.

    Read this article

    To access this article, please review the available access options below.

    Get instant access

    Purchase Access

    Read this article for 48 hours. Check out below using your ACS ID or as a guest.


    Access through Your Institution

    You may have access to this article through your institution.

    Your institution does not have access to this content. You can change your affiliated institution below.

     School of Biological Sciences, University of Exeter.


     Corresponding author e-mail:  [email protected].

     School of Engineering and Computer Science, University of Exeter.

    Cited By

    This article is cited by 42 publications.

    1. Adam C. Mater, Michelle L. Coote. Explainable Molecular Sets: Using Information Theory to Generate Meaningful Descriptions of Groups of Molecules. Journal of Chemical Information and Modeling 2021, 61 (10) , 4877-4889.
    2. Botao Jiao, Yinan Guo, Shengxiang Yang, Jiayang Pu, Dunwei Gong. Reduced-Space Multistream Classification Based on Multiobjective Evolutionary Optimization. IEEE Transactions on Evolutionary Computation 2023, 27 (4) , 764-777.
    3. Yuankai Zhao, Roger J. Mulder, Shadi Houshyar, Tu C. Le. A review on the application of molecular descriptors and machine learning in polymer design. Polymer Chemistry 2023, 14 (29) , 3325-3346.
    4. Suja Subramanian, Tina P. George, Jeslin George, Tessamma Thomas. Ensemble learning based assessment of the role of transcription factors in gene expression. Computers in Biology and Medicine 2023, 152 , 106455.
    5. Mohammad Reza Keyvanpour, Mehrnoush Barani Shirzad, Farhaneh Moradi. PCAC: a new method for predicting compounds with activity cliff property in QSAR approach. International Journal of Information Technology 2021, 13 (6) , 2431-2437.
    6. I. Čmelo, M. Voršilák, D. Svozil. Profiling and analysis of chemical compounds using pointwise mutual information. Journal of Cheminformatics 2021, 13 (1)
    7. Hongbin Dong, Jing Sun, Xiaohang Sun, Rui Ding. A many-objective feature selection for multi-label classification. Knowledge-Based Systems 2020, 208 , 106456.
    8. Jianbin Ma, Xiaoying Gao. A filter-based feature construction and feature selection approach for classification using Genetic Programming. Knowledge-Based Systems 2020, 196 , 105806.
    9. Kader Sahin, Emin Saripinar. A novel hybrid method named electron conformational genetic algorithm as a 4D QSAR investigation to calculate the biological activity of the tetrahydrodibenzazosines. Journal of Computational Chemistry 2020, 41 (11) , 1091-1104.
    10. Riccardo Concu, M. Natália Dias Soeiro Cordeiro. On the Relevance of Feature Selection Algorithms While Developing Non-linear QSARs. 2020, 177-194.
    11. Madhulata Kumari, Neeraj Tiwari, Naidu Subbarao. A genetic programming-based approach to identify potential inhibitors of serine protease of Mycobacterium tuberculosis. Future Medicinal Chemistry 2020, 12 (2) , 147-159.
    12. Mariela Bollini, Ana M. Bruno, María E. Niño, Juan J. Casal, Leandro D. Sasiambarrena, Damián A.G. Valdez, Leandro Battini, Vanesa R. Puente, María E. Lombardo. Synthesis, 2D-QSAR Studies and Biological Evaluation of Quinazoline Derivatives as Potent Anti-Trypanosoma cruzi Agents. Medicinal Chemistry 2019, 15 (3) , 265-276.
    13. Uday Kamath, Carlotta Domeniconi, Amarda Shehu, Kenneth De Jong. EML: A Scalable, Transparent Meta-Learning Paradigm for Big Data Applications. 2019, 35-59.
    14. Gabriel Idakwo, Joseph Luttrell IV, Minjun Chen, Huixiao Hong, Ping Gong, Chaoyang Zhang. A Review of Feature Reduction Methods for QSAR-Based Toxicity Prediction. 2019, 119-139.
    15. Savina Colaco, Sujit Kumar, Amrita Tamang, Vinai George Biju. A Review on Feature Selection Algorithms. 2019, 133-153.
    16. Razieh Sheikhpour, Mehdi Agha Sarram, Sajjad Gharaghani, Mohammad Ali Zare Chahooki. Feature selection based on graph Laplacian by using compounds with known and unknown activities. Journal of Chemometrics 2017, 31 (8)
    17. Danishuddin, Asad U. Khan. Descriptors and their selection methods in QSAR analysis: paradigm for drug design. Drug Discovery Today 2016, 21 (8) , 1291-1302.
    18. Bing Xue, Mengjie Zhang, Will N. Browne, Xin Yao. A Survey on Evolutionary Computation Approaches to Feature Selection. IEEE Transactions on Evolutionary Computation 2016, 20 (4) , 606-626.
    19. Ricardo W. Pino Urias, Stephen J. Barigye, Yovani Marrero-Ponce, César R. García-Jacas, José R. Valdes-Martiní, Facundo Perez-Gimenez. IMMAN: free software for information theory-based chemometric analysis. Molecular Diversity 2015, 19 (2) , 305-319.
    20. Renu Vyas, Purva Goel, Sanjeev S. Tambe. Genetic Programming Applications in Chemical Sciences and Engineering. 2015, 99-140.
    21. Uday Kamath, Kenneth De Jong, Amarda Shehu, . Effective Automated Feature Construction and Selection for Classification of Biological Sequences. PLoS ONE 2014, 9 (7) , e99982.
    22. Yuting Guo, Jianzhong Wang, Na Gao, Miao Qi, Ming Zhang, Jun Kong, Yinghua Lv. AlPOs Synthetic Factor Analysis Based on Maximum Weight and Minimum Redundancy Feature Selection. International Journal of Molecular Sciences 2013, 14 (11) , 22132-22148.
    23. Uday Kamath, Jack Compton, Rezarta Islamaj-Dogan, Kenneth A. De Jong, Amarda Shehu. An Evolutionary Algorithm Approach for Feature Generation from Sequence Data and Its Application to DNA Splice Site Prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2012, 9 (5) , 1387-1398.
    24. Mohammad Goodarzi, Matheus P. Freitas, Yvan Vander Heyden. Linear and nonlinear quantitative structure–activity relationship modeling of the HIV-1 reverse transcriptase inhibiting activities of thiocarbamates. Analytica Chimica Acta 2011, 705 (1-2) , 166-173.
    25. W. W. L. Wong, F. J. Burkowski. Using Kernel Alignment to Select Features of Molecular Descriptors in a QSAR Study. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2011, 8 (5) , 1373-1384.
    26. Uday Kamath, Kenneth A. De Jong, Amarda Shehu. An evolutionary-based approach for feature generation: Eukaryotic promoter recognition. 2011, 277-284.
    27. Anna Friedlander, Kourosh Neshatian, Mengjie Zhang. Meta-learning and feature ranking using genetic programming for classification: Variable terminal weighting. 2011, 941-948.
    28. Martin Vogt, Anne Mai Wassermann, Jürgen Bajorath. Application of Information—Theoretic Concepts in Chemoinformatics. Information 2010, 1 (2) , 60-73.
    29. Dong-Sheng Cao, Qing-Song Xu, Yi-Zeng Liang, Xian Chen, Hong-Dong Li. Automatic feature subset selection for decision tree-based ensemble methods in the prediction of bioactivity. Chemometrics and Intelligent Laboratory Systems 2010, 103 (2) , 129-136.
    30. S.B. Gunturi, S.S. Theerthala, N.K. Patel, J. Bahl, R. Narayanan. Prediction of skin sensitization potential using D-optimal design and GA-kNN classification methods. SAR and QSAR in Environmental Research 2010, 21 (3-4) , 305-335.
    31. Francesco Archetti, Ilaria Giordani, Leonardo Vanneschi. Genetic programming for QSAR investigation of docking energy. Applied Soft Computing 2010, 10 (1) , 170-182.
    32. Rajarshi Guha. On the interpretation and interpretability of quantitative structure–activity relationship models. Journal of Computer-Aided Molecular Design 2008, 22 (12) , 857-871.
    33. Sitarama B. Gunturi, Kotu Archana, Akash Khandelwal, Ramamurthi Narayanan. Prediction of hERG Potassium Channel Blockade Using kNN-QSAR and Local Lazy Regression Methods. QSAR & Combinatorial Science 2008, 27 (11-12) , 1305-1317.
    34. Zhiguo Yan, Zhizhong Wang, Hongbo Xie. The application of mutual information-based feature selection and fuzzy LS-SVM-based classifier in motion classification. Computer Methods and Programs in Biomedicine 2008, 90 (3) , 275-284.
    35. Xiao-Hong Wang, Yang-Dong Hu, Yu-Gang Li. Synthesis of nonsharp distillation sequences via genetic programming. Korean Journal of Chemical Engineering 2008, 25 (3) , 402-408.
    36. Sitarama B. Gunturi, Ramamurthi Narayanan. In Silico ADME Modeling 3: Computational Models to Predict Human Intestinal Absorption Using Sphere Exclusion and kNN QSAR Methods. QSAR & Combinatorial Science 2007, 26 (5) , 653-668.
    37. Zheng Rong Yang. Predicting Hepatitis C Virus Protease Cleavage Sites Using Generalized Linear Indicator Regression Models. IEEE Transactions on Biomedical Engineering 2006, 53 (10) , 2119-2123.
    38. Andreas Bender, Jeremy L. Jenkins, Qingliang Li, Sam E. Adams, Edward O. Cannon, Robert C. Glen. Chapter 9 Molecular Similarity: Advances in Methods, Applications and Validations in Virtual Screening and QSAR. 2006, 141-168.
    39. H. Li, C. W. Yap, Y. Xue, Z. R. Li, C. Y. Ung, L. Y. Han, Y. Z. Chen. Statistical learning approach for predicting specific pharmacodynamic, pharmacokinetic, or toxicological properties of pharmaceutical agents. Drug Development Research 2005, 66 (4) , 245-259.
    40. Lutz Weber. Current Status of Virtual Combinatorial Library Design. QSAR & Combinatorial Science 2005, 24 (7) , 809-823.
    41. Francesco Archetti, Stefano Lanzeni, Enza Messina, Leonardo Vanneschi. Genetic Programming and Other Machine Learning Approaches to Predict Median Oral Lethal Dose (LD50) and Plasma Protein Binding Levels (%PPB) of Drugs. , 11-23.
    42. Željko Debeljak, Marica Medic-Šaric. Advances in Relevant Descriptor Selection. , 189-198.

    Pair your accounts.

    Export articles to Mendeley

    Get article recommendations from ACS based on references in your Mendeley library.

    Pair your accounts.

    Export articles to Mendeley

    Get article recommendations from ACS based on references in your Mendeley library.

    You’ve supercharged your research process with ACS and Mendeley!

    STEP 1:
    Click to create an ACS ID

    Please note: If you switch to a different device, you may be asked to login again with only your ACS ID.

    Please note: If you switch to a different device, you may be asked to login again with only your ACS ID.

    Please note: If you switch to a different device, you may be asked to login again with only your ACS ID.

    Your Mendeley pairing has expired. Please reconnect