Random Forest Refinement of the KECSA2 Knowledge-Based Scoring Function for Protein Decoy Detection
- Jun PeiJun PeiDepartment of Chemistry, Michigan State University, 578 S. Shaw Lane, East Lansing, Michigan 48824, United StatesMore by Jun Pei
- ,
- Zheng ZhengZheng ZhengDepartment of Chemistry, Michigan State University, 578 S. Shaw Lane, East Lansing, Michigan 48824, United StatesMore by Zheng Zheng
- , and
- Kenneth M. Merz Jr.*Kenneth M. Merz, Jr.*E-mail: [email protected]Department of Chemistry, Michigan State University, 578 S. Shaw Lane, East Lansing, Michigan 48824, United StatesInstitute for Cyber Enabled Research, Michigan State University, 567 Wilson Road, East Lansing, Michigan 48824, United StatesMore by Kenneth M. Merz, Jr.
Abstract

Knowledge-based potentials generally perform better than physics-based scoring functions in detecting the native structure from a collection of decoy protein structures. Through the use of a reference state, the pure interactions between atom/residue pairs can be obtained through the removal of contributions from ideal-gas state potentials. However, it is a challenge for conventional knowledge-based potentials to assign different importance factors to different atom/residue pairs. In this work, via the use of the “comparison” concept, Random Forest (RF) models were successfully generated using unbalanced data sets that assign different importance factors to atom pair potentials to enhance their ability to identify native proteins from decoy proteins. Individual and combined data sets consisting of 12 decoy sets were used to test the performance of the RF models. We find that RF models increase the recognition of native structures without affecting their ability to identify the best decoy structures. We also created models using scrambled atom types, which create physically unrealistic probability functions in order to test the ability of the RF algorithm to create useful models based on inputted scrambled probability functions. From this test, we find that we are unable to create models that are of similar quality relative to the unscrambled probability functions. Next, we created uniform probability functions where the peak positions are the same as the original, but each interaction has the same peak height. Using these uniform potentials, we were able to recover models as good as the ones using the full potentials suggesting all that is important in these models are the experimental peak positions. The KECSA2 potential along with all codes used in this work are available at https://github.com/JunPei000/protein_folding-decoy-set.
Cited By
This article is cited by 10 publications.
- Jun Pei, Lin Frank Song, Kenneth M. Merz, Jr.. Pair Potentials as Machine Learning Features. Journal of Chemical Theory and Computation 2020, 16
(8)
, 5385-5400. https://doi.org/10.1021/acs.jctc.9b01246
- Jun Pei, Zheng Zheng, Hyunji Kim, Lin Frank Song, Sarah Walworth, Margaux R. Merz, Kenneth M. Merz, Jr.. Random Forest Refinement of Pairwise Potentials for Protein–Ligand Decoy Detection. Journal of Chemical Information and Modeling 2019, 59
(7)
, 3305-3315. https://doi.org/10.1021/acs.jcim.9b00356
- Habibah A. Wahab, Rommie E. Amaro, Zoe Cournia. A Celebration of Women in Computational Chemistry. Journal of Chemical Information and Modeling 2019, 59
(5)
, 1683-1692. https://doi.org/10.1021/acs.jcim.9b00368
- Hsin-Yi Chen, Jian-Qiang Chen, Jun-Yan Li, Hung-Jin Huang, Xi Chen, Hao-Ying Zhang, Calvin Yu-Chian Chen. Deep Learning and Random Forest Approach for Finding the Optimal Traditional Chinese Medicine Formula for Treatment of Alzheimer’s Disease. Journal of Chemical Information and Modeling 2019, 59
(4)
, 1605-1623. https://doi.org/10.1021/acs.jcim.9b00041
- Jiashun Mao, Javed Akhtar, Xiao Zhang, Liang Sun, Shenghui Guan, Xinyu Li, Guangming Chen, Jiaxin Liu, Hyeon-Nae Jeon, Min Sung Kim, Kyoung Tai No, Guanyu Wang. Comprehensive strategies of machine-learning-based quantitative structure-activity relationship models. iScience 2021, 24
(9)
, 103052. https://doi.org/10.1016/j.isci.2021.103052
- Kiyoto A. Tanemura, Jun Pei, Kenneth M. Merz. Refinement of pairwise potentials via logistic regression to score
protein‐protein
interactions. Proteins: Structure, Function, and Bioinformatics 2020, 88
(12)
, 1559-1568. https://doi.org/10.1002/prot.25973
- Min-Hsuan Lee. Identification of host–guest systems in green TADF-based OLEDs with energy level matching based on a machine-learning study. Physical Chemistry Chemical Physics 2020, 22
(28)
, 16378-16386. https://doi.org/10.1039/D0CP02871A
- Katerina Serafimova, Iliyan Mihaylov, Dimitar Vassilev, Irena Avdjieva, Piotr Zielenkiewicz, Szymon Kaczanowski. Using Machine Learning in Accuracy Assessment of Knowledge-Based Energy and Frequency Base Likelihood in Protein Structures. 2020, 572-584. https://doi.org/10.1007/978-3-030-50420-5_43
- Shiyang Long, Pu Tian. A simple neural network implementation of generalized solvation free energy for assessment of protein structural models. RSC Advances 2019, 9
(62)
, 36227-36233. https://doi.org/10.1039/C9RA05168F
- Edelmiro Moman, Maria A. Grishina, Vladimir A. Potemkin. Nonparametric chemical descriptors for the calculation of ligand-biopolymer affinities with machine-learning scoring functions. Journal of Computer-Aided Molecular Design 2019, 33
(11)
, 943-953. https://doi.org/10.1007/s10822-019-00248-2