Natural Language Processing Methods for the Study of Protein–Ligand InteractionsClick to copy article linkArticle link copied!
- James MichelsJames MichelsDepartment of Computer and Information Science, University of Mississippi, University, Mississippi 38677, United StatesMore by James Michels
- Ramya BandarupalliRamya BandarupalliDepartment of BioMolecular Sciences, School of Pharmacy, University of Mississippi, University, Mississippi 38677, United StatesMore by Ramya Bandarupalli
- Amin Ahangar AkbariAmin Ahangar AkbariDepartment of BioMolecular Sciences, School of Pharmacy, University of Mississippi, University, Mississippi 38677, United StatesMore by Amin Ahangar Akbari
- Thai LeThai LeDepartment of Computer Science, Indiana University, Bloomington, Indiana 47408, United StatesMore by Thai Le
- Hong Xiao*Hong Xiao*E-mail: [email protected]Department of Computer and Information Science and Institute for Data Science, University of Mississippi, University, Mississippi 38677, United StatesMore by Hong Xiao
- Jing Li*Jing Li*E-mail: [email protected]Department of BioMolecular Sciences, School of Pharmacy, University of Mississippi, University, Mississippi 38677, United StatesMore by Jing Li
- Erik F. Y. Hom*Erik F. Y. Hom*E-mail: [email protected]Department of Biology and Center for Biodiversity and Conservation Research, University of Mississippi, University, Mississippi 38677, United StatesMore by Erik F. Y. Hom
Abstract
Natural Language Processing (NLP) has revolutionized the way computers are used to study and interact with human languages and is increasingly influential in the study of protein and ligand binding, which is critical for drug discovery and development. This review examines how NLP techniques have been adapted to decode the “language” of proteins and small molecule ligands to predict protein–ligand interactions (PLIs). We discuss how methods such as long short-term memory (LSTM) networks, transformers, and attention mechanisms can leverage different protein and ligand data types to identify potential interaction patterns. Significant challenges are highlighted including the scarcity of high-quality negative data, difficulties in interpreting model decisions, and sampling biases in existing data sets. We argue that focusing on improving data quality, enhancing model robustness, and fostering both collaboration and competition could catalyze future advances in machine-learning-based predictions of PLIs.
This publication is licensed under
License Summary*
You are free to share(copy and redistribute) this article in any medium or format within the parameters below:
Creative Commons (CC): This is a Creative Commons license.
Attribution (BY): Credit must be given to the creator.
Non-Commercial (NC): Only non-commercial uses of the work are permitted.
No Derivatives (ND): Derivative works may be created for non-commercial purposes, but sharing is prohibited.
*Disclaimer
This summary highlights only some of the key features and terms of the actual license. It is not a license and has no legal value. Carefully review the actual license before using these materials.
License Summary*
You are free to share(copy and redistribute) this article in any medium or format within the parameters below:
Creative Commons (CC): This is a Creative Commons license.
Attribution (BY): Credit must be given to the creator.
Non-Commercial (NC): Only non-commercial uses of the work are permitted.
No Derivatives (ND): Derivative works may be created for non-commercial purposes, but sharing is prohibited.
*Disclaimer
This summary highlights only some of the key features and terms of the actual license. It is not a license and has no legal value. Carefully review the actual license before using these materials.
License Summary*
You are free to share(copy and redistribute) this article in any medium or format within the parameters below:
Creative Commons (CC): This is a Creative Commons license.
Attribution (BY): Credit must be given to the creator.
Non-Commercial (NC): Only non-commercial uses of the work are permitted.
No Derivatives (ND): Derivative works may be created for non-commercial purposes, but sharing is prohibited.
*Disclaimer
This summary highlights only some of the key features and terms of the actual license. It is not a license and has no legal value. Carefully review the actual license before using these materials.
License Summary*
You are free to share(copy and redistribute) this article in any medium or format within the parameters below:
Creative Commons (CC): This is a Creative Commons license.
Attribution (BY): Credit must be given to the creator.
Non-Commercial (NC): Only non-commercial uses of the work are permitted.
No Derivatives (ND): Derivative works may be created for non-commercial purposes, but sharing is prohibited.
*Disclaimer
This summary highlights only some of the key features and terms of the actual license. It is not a license and has no legal value. Carefully review the actual license before using these materials.
License Summary*
You are free to share(copy and redistribute) this article in any medium or format within the parameters below:
Creative Commons (CC): This is a Creative Commons license.
Attribution (BY): Credit must be given to the creator.
Non-Commercial (NC): Only non-commercial uses of the work are permitted.
No Derivatives (ND): Derivative works may be created for non-commercial purposes, but sharing is prohibited.
*Disclaimer
This summary highlights only some of the key features and terms of the actual license. It is not a license and has no legal value. Carefully review the actual license before using these materials.
1. Introduction
2. The Languages of Life
Figure 1
Figure 1. Language of protein sequences and the ligand SMILES representation: NLP methods can be applied to text representations to infer local and global properties of human language, proteins, and molecules alike. Local properties are inferred from subsequences in text: (left) for human language, this includes a part of speech or role a word serves; (middle) for protein sequences, this includes motifs, functional sites, and domains; and (right) for SMILES strings, this can include functional groups and special characters used in SMILES syntax to indicate chemical attributes. Similarly, global properties can theoretically be inferred from a text in its entirety.
2.1. The “Language” of Proteins
2.2. The “Language” of Ligands
3. Protein–Ligand Interaction Data and Data Sets
Data set Name | Year | Proteins | Ligands | Interactions | Protein Category | Ligand Category | Task |
---|---|---|---|---|---|---|---|
Functional Data Available | |||||||
Protein Data Bank (PDB) (92) | 2000 | 220,777 | – | – | General (Structure) | General (Structure) | C |
BRENDA (103) | 2002 | 8,423 | 38,623 | – | Enzymes | General | R, C |
PDBBindb, (96) | 2004 | – | – | 23,496 | General (Structure) | General | R, C |
DrugBankb, (91) | 2006 | 4,944 | 16,568 | 19,441 | Human Proteome | General | C |
BindingDB (104) | 2007 | 2,294 | 505,009 | 1,059,214 | General | General | R, C |
PubChem (73,92) | 2009 | 248,623 | 119,108,078 | 250,633 | General | General | R, C |
Davis (94) | 2011 | 442 | 68 | 30,056 | Kinases (Sequence) | Kinase Inhibitors (SMILES) | R |
PSCDB (105) | 2011 | – | – | 839 | Human Proteome | General | R, C |
ChEMBL (90) | 2012 | 15,398 | 2,399,743 | 20,334,684 | General (Protein ID) | General (SMILES) | R, C |
DUD-E (106) | 2012 | 102 | 22,886 | 2,334,372 | General | General | R, C |
Iridium Databaseb, (107) | 2012 | – | – | 233 | General | General | R, C |
KIBA (95) | 2014 | 467 | 52,498 | 246,088 | Kinases (Protein ID) | Kinase Inhibitors (SMILES) | R |
Natural Ligand Database (NLDB)b, (108) | 2016 | 3,248 | – | 189,642 | Enzymes (Structure) | General | R, C |
PDID (109) | 2016 | 3,746 | 51 | 1,088,789 | Human Proteome | General | R, C |
dbHDPLSb, (110) | 2019 | – | – | 8,833 | General (Structure) | General | C |
CovPDBb, (111) | 2022 | 733 | 1,501 | 2,294 | General (Structure) | General | C |
PSnpBindb, (112) | 2022 | 731 | 32,261 | 640,074 | General | General | R, C |
Protein Binding Atlasb, (112) Portal | 2023 | 1,716 | 30,360 | 129,333 | Drug Targets | Drug Molecules | R, C |
Protein–Ligand Binding Database (PLDB)b, (113) | 2023 | 12 | 556 | 1,831 | Carbonic Anhydrases, Heat Shock Proteins | General | R |
BioLiP2 (114) | 2023 | 426,209 | – | 823,510 | General (Structure) | General | R, C |
PLAS-20kb, (115) | 2024 | – | – | 20,000 | Enzymes | General | R, C |
Functional Data Unavailable | |||||||
Database of Interacting Proteins (116) | 2004 | 28,850 | – | 81,923 | Various Species | – | C |
Protein Small-Molecule Classification Databaseb, (117) | 2009 | 4,916 | 8,690 | – | General (Structure) | General (Structure) | C |
CavitySpaceb, (118) | 2022 | 23,391 | – | 23,391 | General (Structure) | General | C |
Note: Data sets categorized as “General” provide broad information without focusing on specific categories of proteins or ligands. Data types (e.g., sequence, structure), are denoted in parentheses. Categories labeled with “Protein ID” include protein IDs from established databases. Data sets may receive periodic updates. Suggested tasks are denoted as “R” for regression and “C” for classification. "−" indicated that exact information is either not included in the source or is not readily obtainable.
Protein–ligand complexes are available with the data set.
4. Machine Learning and NLP for PLIs
Figure 2
Figure 2. Summary of the data preparation, model creation, and model evaluation workflow. Model Creation for PLI studies follows an Extract-Fuse-Predict Framework: input protein and ligand data are extracted and embedded, combined, and passed into a machine learning model to generate predictions.
4.1. The Extract-Fuse-Predict Framework
Extraction | ||||
---|---|---|---|---|
Model Name | Protein Extractor | Ligand Extractor | Fusion | Prediction |
LSTM | ||||
Affinity2Vec (140) | ProtVec | Seq2Seq | Heterogeneous Network | Gradient-Boosting Trees (R) |
DeepLPI (141) | ResNet | ResNet | Concatenation with LSTM | FCN (C, R) |
FusionDTA (142) | BiLSTM | BiLSTM | Concatenation with Linear Attention | FCN (R) |
Transformer | ||||
Shin et al. (181) | CNN | Transformer | Concatenation | FCN (R) |
MolTrans (182) | Transformer | Transformer | Interaction Matrixb with CNN | FCN (C) |
ELECTRA-DTA (180) | CNN with Squeeze-and-Excite Mechanism | CNN with Squeeze-and-Excite Mechanism | Concatenation | FCN (R) |
MGPLI (184) | Transformer, CNN | Transformer, CNN | Concatenation | FCN (C) |
SVSBI (183) | Transformer, LSTM, and AutoEncoder | Transformer, LSTM, and AutoEncoder | k-embedding fusionc | FCN, Gradient-Boosting Treesd (R) |
Non-Transformer Attention | ||||
DeepCDA (121) | CNN with LSTM | CNN with LSTM | Two-Sided Attentiond | FCN (R) |
HyperAttention- DTI (151) | CNN | CNN | Cross-Attention, Concatenation | FCN (C) |
ICAN (150) | Various | Various | Cross-Attention, Concatenation | 1D CNN (C) |
Other NLP Methods | ||||
GANsDTA (202) | GAN Discriminator | GAN Discriminator | Concatenation | 1D CNN (R) |
Multi-PLI (203) | CNN | CNN | Concatenation | FCN (C, R) |
ChemBoost (124) | Various | SMILESVec | Concatenation | Gradient-Boosting Trees (R) |
Note: A model’s task of Classification (C) and/or Regression (R) is denoted beside the “Prediction” column entries in parentheses. Definitions for specific terms may be found in the Glossary (Table 6). Terms Defined by the Cited Authors:
Interaction Matrix: Output from dot product operations to measure interactions between protein subsequence and ligand substructure pairs.
k-embedding fusion: The use of machine learning to find an optimal combination of lower-order embeddings via different integrating operations.
Two-sided Attention: Attention mechanism that computes scores using the products of both pairs of protein/ligand fragments and protein/ligand feature vectors.
Extraction | ||||
---|---|---|---|---|
Model Name | Protein Extractor | Ligand Extractor | Fusion | Prediction |
Transformer | ||||
UniMol (122) | Transformer-Based Encoder | Transformer-Based Encoder | Concatenation | Transformer-Based Decoder (R) |
Other Attention | ||||
Lim et al. (160) | GNN | GNN | Attention | FCN (C) |
Jiang et al. (152) | GCN | GCN | Concatenation | FCN (R) |
GEFA (153) | GCN | GCN | Concatenation | FCN (R) |
Knutson et al. (155) | GAT | GAT | Concatenation | FCN (C, R) |
AttentionSite-DTI (158) | GCN with Attention | GCN with Attention | Concatenation, Self-Attention | FCN (C, R) |
HAC-Net (156) | GCN with Attention Aggregation | GCN with Attention | Combined Graph Representation | FCN (R) |
BindingSite-AugmentedDTI (157) | GCN with Attention | GCN with Attention | Concatenation, Self-Attention | Various (R) |
PBCNet (154) | GCN | Message-Passing NN | Attention | FCN (R) |
Note: A model’s task of Classification (C) and/or Regression (R) is denoted beside the “Prediction” column entries in parentheses. Definitions for specific terms may be found in the Glossary (Table 6).
Extraction | |||||
---|---|---|---|---|---|
Model Name | Input Type | Protein | Ligand | Fusion | Prediction |
LSTM | |||||
Zheng et al. (204) | P: Struct. L: Seq | Dynamic CNNb with Attention | BiLSTM with Attention | Concatenation | FCN (C) |
DeepGLSTM (85) | P: Seq L: Struct. | BiLSTM with FCN | GCN | Concatenation | FCN (R) |
Transformer | |||||
Transformer- CPI (86) | P: Seq L: Struct. | Transformer Encoder | GCN | Transformer Decoder | FCN (C) |
DeepPurpose (201) | P: Seq L: Either | 4 Various Encoders | 5 Various Encoders | Concatenation | FCN (C, R) |
CAT-CPI (185) | P: Seq L: Image | Transformer Encoder | Transformer Encoder | Concatenation | CNN and FCN (C) |
Non-Transformer Attention | |||||
Tsubaki et al. (205) | P: Seq L: Struct. | CNN | GNN | Attention and Concatenation | FCN (C) |
DeepAffinity (206) | P: Seq L: Struct. | RNN-CNN with Attention | RNN-CNN with Attention | Concatenation | FCN (R) |
MONN (207) | P: Seq L: Struct. | CNN | GCN | Pairwise Interaction Matrix,c Attention | Linear Regression (C, R) |
GraphDTA (197) | P: Seq L: Struct. | CNN | 4 GNN Variants | Concatenation | FCN (R) |
CPGL (208) | P: Seq L: Struct. | LSTM | GAT with Attention | Two-Sided Attention,d Concatenation | Logistic Regression (C) |
CAPLA (161) | P: Both L: Struct. | Dilated Convolutional Block | Dilated Convolutional Block with Cross-Attention to Binding Pocket | Cross-Attention, Concatenation | FCN (R) |
Note: A model’s task of Classification (C) and/or Regression (R) is denoted beside the “Prediction” column entries in parentheses. Definitions for specific terms may be found in the Glossary (Table 6). The input representations for sequence and structure are abbreviated for brevity. Terms Defined by the Cited Authors:
Dynamic CNN: ResNet-based CNN modified to handle inputs of variable lengths by padding the sides of the input with zeroes.
Pairwise Interaction Matrix: A [number of atoms]-by-[number of residues] matrix in which each element is a binary value indicating if the corresponding atom-residue pair has an interaction. (207)
Two-sided Attention: Attention mechanism that uses dot product operations between protein AA and ligand atom pairs, while taking matrices of learned weights into account.
Extraction | |||||
---|---|---|---|---|---|
Model Name | Protein Extractor | Ligand Extractor | Additional Features Used | Fusion | Prediction |
LSTM | |||||
HGDTI (209) | BiLSTM | BiLSTM | Disease and Side Effect Information | Concatenation | FCN (C) |
ResBiGAAT (87) | Bidirectional GRU with Attention | Bidirectional GRU with Attention | Global Protein Features | Concatenation | FCN (R) |
Transformer | |||||
Gaspar et al. (125) | Transformer or LSTM | ECFC4 Fingerprints | Multiple Sequence Alignment Information | Concatenation | Random Forest (C) |
HoTS (210) | CNN | FCN | Binding Region | Transformer Block | FCN (C, R) |
PLA-MoRe (88) | Transformer | GIN and AutoEncoder | Bioactive Properties | Concatenation | FCN (R) |
AlphaFold 3 (89) | Attention-Based Encoderb | Attention-Based Encoderb | Post-Translational Modifications, Multiple Sequence Alignment Information | Attention | Diffusion Transformerc |
Other NLP Methods | |||||
MultiDTI (123) | CNN with FCN | CNN with FCN | Disease and Side Effect Information | Heterogeneous Network | FCN (C) |
Note: A model’s task of Classification (C) and/or Regression (R) is denoted beside the “Prediction” column entries in parentheses. Definitions for specific terms may be found in the Glossary (Table 6). Terms Defined by the Cited Authors:
Atom Attention Encoder: An attention-based encoder that uses cross-attention to capture local atom features.
Diffusion Transformer: A transformer-based model that aims to remove noise from predicted atomic coordinates until a suitable final structure is output.
Term | Definition |
---|---|
AutoEncoder | A neural network tasked with compressing and reconstructing input data, often used for feature learning. (262) |
BiLSTM | Bidirectional Long Short-Term Memory, a variant of LSTM where two passes are made over the input sequence, one reading in forward order, and one in reverse order. |
CNN | Convolutional Neural Network, a type of neural network that processes grid-like data, such as images, through a gradually-optimized filter that slides across input data to discern important features. |
Dilated Convolutional Block | Convolutional Neural Network operations with defined gaps between kernels, which can capture larger receptive fields with fewer parameters. |
ECFC4 Fingerprint | A molecular fingerprint that encodes information about the presence of specific substructures within a diameter of 4 bonds from each atom. (263) |
FCN | Fully-Connected Network, a feedforward Neural Network where each neuron in one layer connects to every layer in the next. FCNs can also be referred to as Multi-Layer Perceptrons. |
GAN Discriminator | An NN part of Generative Adversarial Networks (GAN) that learns important features to distinguish between real and artificial data. |
GAT | Graph Attention Network, a type of Graph Neural Network that uses attention mechanisms to deciding the value of neighboring nodes to a given node when updating a node’s information. (264) |
GCN | Graph Convolutional Network, a type of Graph Neural Network that aggregates neighboring node features through a first-order approximation on a local filter of the graph. (265) |
GIN | Graph Isomorphism Network, a type of Graph Neural Network that uses a series of functions to ensure embeddings are the same no matter what order nodes are presented in. (266) |
Gradient-Boosting Trees | A machine learning technique where many decision trees are trained in order, such that the next tree learns from the misclassified samples of the previous tree. All trees are then used to “vote” on results of each input. |
GRU | Gated Recurrent Unit, a simplified version of Long Short-Term Memory that similarly uses a gating mechanism to retain and forget information, but is less complex than Long Short-Term Memory. (137) |
Heterogeneous Network | A graph where nodes and edges represent different types of information, often used to convey complex relationships in biological systems (e.g., drug, target, side-effect, etc.). |
Message-Passing NN | Type of Graph Neural Network that computes individual messages to be passed between nodes so that representations for each node contain information from its neighbors. (267) |
ProtVec | A method for representing protein sequences as dense vectors using skip-gram neural networks. (268) |
Random Forest | A machine learning method where many decision trees are constructed, and the result of the ensemble is the mode of the individual tree predictions. |
ResNet | Short for Residual Network. A neural network architecture that speeds up training by learning functions to substitute for layer operations, allowing for the “skipping” of layers and faster training. (269) |
Seq2Seq | A machine learning method used for language translation in NLP, featuring an encoder-decoder structure. (266) |
SMILESVec | Previous work from authors. 8-character ligand SMILES fragments are assigned a vector through a single-layer neural network, and an input SMILES string’s vector is equal to the mean of fragment vectors present in that input SMILES. (270) |
Squeeze-And-Excite Mechanism | Mechanism for Convolutional Neural Networks that uses global information to adapt the model to emphasize more important features. (271) |
4.2. Extraction of Embeddings
Figure 3
Figure 3. Framework diagrams for RNN (and its variant LSTM), transformer, and attention with arrows representing a flow of information. (A) The "unrolled" structure of an RNN and the recurrent units, where hidden states propagate across time steps. The recurrent unit takes the current token Xt as input, combines it with the value of the current hidden state ht, and computes their weighted sum before generating the response Ot and an updated hidden state ht+1. Weighted sums depend upon the associated network weights Wxh, Whh, or Woh, which connect input to hidden state, hidden state to hidden state, and hidden state to output, respectively. LSTM differs in that a memory state is updated during each iteration, facilitating long-term dependency learning. (B) A simplified framework of a transformer's encoder-decoder architecture, and associated attention mechanism. A scaled product of the Query and Key vectors yields attention weights that can provide interpretability, with the new embedding vector (or the output vector) updated based on this specific key.
4.2.1. Recurrent Neural Networks
4.2.2. Attention-Based Architectures
Figure 4
Figure 4. Sample attention weights for relating protein and ligand. The heatmaps on the left help visualize the weighted importance of select protein residues and ligand atoms in a PLI. Structural views of the protein–ligand binding pocket are shown in the middle, with insets of the 2D ligand structures on the right. The colored residues and red color highlights indicate AAs in the protein binding pocket and ligand atoms with high attention scores. Reproduced with permission from Figure 7 of Wu et al. (148) Used with permission under license CC BY 4.0. Copyright 2023 The Author(s). Published by Elsevier Ltd.
4.2.3. Transformers
4.3. Fusion of Protein–Ligand Representations: Concatenation or Cross-Attention
4.4. Prediction of Target Variables
4.5. Evaluation
5. Challenges and Future Directions
5.1. Lack of “True Negatives”
5.2. Diversity Bias in PLI Data Sets
5.3. Interpretable and Generalizable Design in PLI Predictions
5.4. The Insufficiency of an NLP-Only Approach for PLI Studies?
6. Conclusion
Data Availability
No data or software was generated for this review.
Acknowledgments
This work was supported in part by NIGMS/NIH Institutional Development Award (IDeA) #P20GM130460 to J.L, NSF award #1846376 to E.F.Y.H, and University of Mississippi Data Science/AI Research Seed Grant award #SB3002 IDS RSG-03 to J.M., J.L., T.L, and E.F.Y.H.
References
This article references 271 other publications.
- 1Songyang, Z.; Cantley, L. C. Recognition and specificity in protein tyrosine kinase-mediated signalling. Trends Biochem. Sci. 1995, 20, 470– 475, DOI: 10.1016/S0968-0004(00)89103-3Google Scholar1Recognition and specificity in protein tyrosine kinase-mediated signalingSongyang, Zhou; Cantley, Lewis C.Trends in Biochemical Sciences (1995), 20 (11), 470-5CODEN: TBSCDB; ISSN:0968-0004. (Elsevier Trends Journals)A review, with 46 refs. There are several factors that contribute to the specificities of protein tyrosine kinases (PTKs) in signal transduction pathways. While protein-protein interaction domains, such as the Src homol. (SH2 and SH3) domains, regulate the cellular localization of PTKs and their substrates, the specificities of PTKs are ultimately detd. by their catalytic domains. The use of peptide libraries has revealed the substrate specificities of SH2 domains and PTK catalytic domains, and has suggested cross-talk between these domains.
- 2Johnson, L. N.; Lowe, E. D.; Noble, M. E.; Owen, D. J. The Eleventh Datta Lecture. The structural basis for substrate recognition and control by protein kinases. FEBS Lett. 1998, 430, 1– 11, DOI: 10.1016/S0014-5793(98)00606-1Google Scholar2The structural basis for substrate recognition and control by protein kinasesJohnson, Louise N.; Lowe, Edward D.; Noble, Martin E. M.; Owen, David J.FEBS Letters (1998), 430 (1,2), 1-11CODEN: FEBLAL; ISSN:0014-5793. (Elsevier Science B.V.)A review with 49 refs. Protein kinases catalyze phospho transfer reactions from ATP to serine, threonine or tyrosine residues in target substrates and provide key mechanisms for control of cellular signaling processes. The crystal structures of 12 protein kinases are now known. These include structures of kinases in the active state in ternary complexes with ATP (or analogs) and inhibitor or peptide substrates (e.g. cAMP dependent protein kinase, phosphorylase kinase and insulin receptor tyrosine kinase); kinases in both active and inactive states (e.g., CDK2/cyclin A, insulin receptor tyrosine kinase and MAPK); kinases in the active state (e.g. casein kinase 1, Lck); and kinases in inactive states (e.g. twitchin kinase, calcium calmodulin kinase 1, FGF receptor kinase, c-Src and Hck). This paper summarizes the detailed information obtained with active phosphorylase kinase ternary complex and reviews the results with ref. to other kinase structures for insights into mechanisms for substrate recognition and control.
- 3Kristiansen, K. Molecular mechanisms of ligand binding, signaling, and regulation within the superfamily of G-protein-coupled receptors: molecular modeling and mutagenesis approaches to receptor structure and function. Pharmacol. Ther. 2004, 103, 21– 80, DOI: 10.1016/j.pharmthera.2004.05.002Google Scholar3Molecular mechanisms of ligand binding, signaling, and regulation within the superfamily of G-protein-coupled receptors: molecular modeling and mutagenesis approaches to receptor structure and functionKristiansen, KurtPharmacology & Therapeutics (2004), 103 (1), 21-80CODEN: PHTHDT; ISSN:0163-7258. (Elsevier Science B.V.)A review. The superfamily of G-protein-coupled receptors (GPCRs) could be subclassified into 7 families (A, B, large N-terminal family B-7 transmembrane helix, C, Frizzled/Smoothened, taste 2, and vomeronasal 1 receptors) among mammalian species. Cloning and functional studies of GPCRs have revealed that the superfamily of GPCRs comprises receptors for chem. diverse native ligands including endogenous compds. like amines, peptides, and Wnt proteins (i.e., secreted proteins activating Frizzled receptors); endogenous cell surface adhesion mols.; and photons and exogenous compds. like odorants. The combined use of site-directed mutagenesis and mol. modeling approaches have provided detailed insight into mol. mechanisms of ligand binding, receptor folding, receptor activation, G-protein coupling, and regulation of GPCRs. The vast majority of family A, B, C, vomeronasal 1, and taste 2 receptors are able to transduce signals into cells through G-protein coupling. However, G-protein-independent signaling mechanisms have also been reported for many GPCRs. Specific interaction motifs in the intracellular parts of these receptors allow them to interact with scaffold proteins. Protein engineering techniques have provided information on mol. mechanisms of GPCR-accessory protein, GPCR-GPCR, and GPCR-scaffold protein interactions. Site-directed mutagenesis and mol. dynamics simulations have revealed that the inactive state conformations are stabilized by specific interhelical and intrahelical salt bridge interactions and hydrophobic-type interactions. Constitutively activating mutations or agonist binding disrupts such constraining interactions leading to receptor conformations that assocs. with and activate G-proteins.
- 4West, I. C. What determines the substrate specificity of the multi-drug-resistance pump?. Trends Biochem. Sci. 1990, 15, 42– 46, DOI: 10.1016/0968-0004(90)90171-7Google ScholarThere is no corresponding record for this reference.
- 5Vivier, E.; Malissen, B. Innate and adaptive immunity: specificities and signaling hierarchies revisited. Nat. Immunol. 2005, 6, 17– 21, DOI: 10.1038/ni1153Google Scholar5Innate and adaptive immunity: specificities and signaling hierarchies revisitedVivier, Eric; Malissen, BernardNature Immunology (2005), 6 (1), 17-21CODEN: NIAMCZ; ISSN:1529-2908. (Nature Publishing Group)A review. The conventional classification of known immune responses by specificity may need re-evaluation. The immune system can be classified into two subsystems: the innate and adaptive immune systems. In general, innate immunity is considered a nonspecific response, whereas the adaptive immune system is thought of as being very specific. In addn., the antigen receptors of the adaptive immune response are commonly viewed as 'master sensors' whose engagement dictates lymphocyte function. Here the authors propose that these ideas do not genuinely reflect the organization of immune responses and that they bias the authors' view of immunity as well as the authors' teaching of immunol. Indeed, the level of specificity and mode of signaling integration used by the main cellular participants in the adaptive and innate immune systems are more similar than previously appreciated.
- 6Desvergne, B.; Michalik, L.; Wahli, W. Transcriptional regulation of metabolism. Physiol. Rev. 2006, 86, 465– 514, DOI: 10.1152/physrev.00025.2005Google Scholar6Transcriptional regulation of metabolismDesvergne, Beatrice; Michalik, Liliane; Wahli, WalterPhysiological Reviews (2006), 86 (2), 465-514CODEN: PHREA7; ISSN:0031-9333. (American Physiological Society)A review. Our understanding of metab. is undergoing a dramatic shift. Indeed, the efforts made towards elucidating the mechanisms controlling the major regulatory pathways are now being rewarded. At the mol. level, the crucial role of transcription factors is particularly well-illustrated by the link between alterations of their functions and the occurrence of major metabolic diseases. In addn., the possibility of manipulating the ligand-dependent activity of some of these transcription factors makes them attractive as therapeutic targets. The aim of this review is to summarize recent knowledge on the transcriptional control of metabolic homeostasis. We first review data on the transcriptional regulation of the intermediary metab., i.e., glucose, amino acid, lipid, and cholesterol metab. Then, we analyze how transcription factors integrate signals from various pathways to ensure homeostasis. One example of this coordination is the daily adaptation to the circadian fasting and feeding rhythm. This section also discusses the dysregulations causing the metabolic syndrome, which reveals the intricate nature of glucose and lipid metab. and the role of the transcription factor PPARγ in orchestrating this assocn. Finally, we discuss the mol. mechanisms underlying metabolic regulations, which provide new opportunities for treating complex metabolic disorders.
- 7Atkinson, D. E. Biological feedback control at the molecular level: Interaction between metabolite-modulated enzymes seems to be a major factor in metabolic regulation. Science 1965, 150, 851– 857, DOI: 10.1126/science.150.3698.851Google ScholarThere is no corresponding record for this reference.
- 8Huang, S.-Y.; Zou, X. Advances and challenges in protein-ligand docking. Int. J. Mol. Sci. 2010, 11, 3016– 3034, DOI: 10.3390/ijms11083016Google Scholar8Advances and challenges in protein-ligand dockingHuang, Sheng-You; Zou, XiaoqinInternational Journal of Molecular Sciences (2010), 11 (), 3016-3034CODEN: IJMCFK; ISSN:1422-0067. (Molecular Diversity Preservation International)A review. Mol. docking is a widely-used computational tool for the study of mol. recognition, which aims to predict the binding mode and binding affinity of a complex formed by two or more constituent mols. with known structures. An important type of mol. docking is protein-ligand docking because of its therapeutic applications in modern structure-based drug design. Here, we review the recent advances of protein flexibility, ligand sampling, and scoring functions - the three important aspects in protein-ligand docking. Challenges and possible future directions are discussed in the conclusion.
- 9Chaires, J. B. Calorimetry and thermodynamics in drug design. Annu. Rev. Biophys. 2008, 37, 135– 151, DOI: 10.1146/annurev.biophys.36.040306.132812Google Scholar9Calorimetry and thermodynamics in drug designChaires, Jonathan B.Annual Review of Biophysics (2008), 37 (), 135-151CODEN: ARBNCV ISSN:. (Annual Reviews Inc.)A review. Modern instrumentation for calorimetry permits direct detn. of enthalpy values for binding reactions and conformational transitions in biomols. Complete thermodn. profiles consisting of free energy, enthalpy, and entropy may be obtained for reactions of interest in a relatively straightforward manner. Such profiles are of enormous value in drug design because they provide information about the balance of driving forces that cannot be obtained from structural or computational methods alone. This perspective shows several examples of the insight provided by thermodn. data in drug design.
- 10Serhan, C. N. Signalling the fat controller. Nature 1996, 384, 23– 24, DOI: 10.1038/384023a0Google ScholarThere is no corresponding record for this reference.
- 11McAllister, C. H.; Beatty, P. H.; Good, A. G. Engineering nitrogen use efficient crop plants: the current status: Engineering nitrogen use efficient crop plants. Plant Biotechnol. J. 2012, 10, 1011– 1025, DOI: 10.1111/j.1467-7652.2012.00700.xGoogle ScholarThere is no corresponding record for this reference.
- 12Goldsmith, M.; Tawfik, D. S. Enzyme engineering: reaching the maximal catalytic efficiency peak. Curr. Opin. Struct. Biol. 2017, 47, 140– 150, DOI: 10.1016/j.sbi.2017.09.002Google Scholar12Enzyme engineering: reaching the maximal catalytic efficiency peakGoldsmith, Moshe; Tawfik, Dan S.Current Opinion in Structural Biology (2017), 47 (), 140-150CODEN: COSBEF; ISSN:0959-440X. (Elsevier Ltd.)A review. The practical need for highly efficient enzymes presents new challenges in enzyme engineering, in particular, the need to improve catalytic turnover (kcat) or efficiency (kcat/KM) by several orders of magnitude. However, optimizing catalysis demands navigation through complex and rugged fitness landscapes, with optimization trajectories often leading to strong diminishing returns and dead-ends. When no further improvements are obsd. in library screens or selections, it remains unclear whether the maximal catalytic efficiency of the enzyme (the catalytic 'fitness peak') has been reached; or perhaps, an alternative combination of mutations exists that could yield addnl. improvements. Here, we discuss fundamental aspects of the process of catalytic optimization, and offer practical solns. with respect to overcoming optimization plateaus.
- 13Vajda, S.; Guarnieri, F. Characterization of protein-ligand interaction sites using experimental and computational methods. Curr. Opin. Drug Discovery Devel. 2006, 9, 354– 362Google Scholar13Characterization of protein-ligand interaction sites using experimental and computational methodsVajda, Sandor; Guarnieri, FrankCurrent Opinion in Drug Discovery & Development (2006), 9 (3), 354-362CODEN: CODDFF; ISSN:1367-6733. (Thomson Scientific)A review. The ability to identify the sites of a protein that can bind with high affinity to small, drug-like compds. has been an important goal in drug design. Accurate prediction of druggable sites and the identification of small compds. binding in those sites have provided the input for fragment-based combinatorial approaches that allow for a more thorough exploration of the chem. space, and that have the potential to yield mols. that are more lead-like than those found using traditional high-throughput screening. Current progress in exptl. and computational methods for identifying and characterizing druggable ligand binding sites on protein targets is reviewed herein, including a discussion of successful NMR, x-ray crystallog. and tethering technologies. Classical geometric and energy-based computational methods are also discussed, with particular focus on two powerful technologies, i.e., computational solvent mapping and grand canonical Monte Carlo simulations (as used by Locus Pharmaceuticals Inc). Both methods can be used to reliably identify druggable sites on proteins and to facilitate the design of novel, low-nanomolar-affinity ligands.
- 14Du, X.; Li, Y.; Xia, Y.-L.; Ai, S.-M.; Liang, J.; Sang, P.; Ji, X.-L.; Liu, S.-Q. Insights into protein-ligand interactions: Mechanisms, models, and methods. Int. J. Mol. Sci. 2016, 17, 144, DOI: 10.3390/ijms17020144Google Scholar14Insights into protein-ligand interactions: mechanisms, models, and methodsDu, Xing; Li, Yi; Xia, Yuan-Ling; Ai, Shi-Meng; Liang, Jing; Sang, Peng; Ji, Xing-Lai; Liu, Shu-QunInternational Journal of Molecular Sciences (2016), 17 (2), 144/1-144/34CODEN: IJMCFK; ISSN:1422-0067. (MDPI AG)Mol. recognition, which is the process of biol. macromols. interacting with each other or various small mols. with a high specificity and affinity to form a specific complex, constitutes the basis of all processes in living organisms. Proteins, an important class of biol. macromols., realize their functions through binding to themselves or other mols. A detailed understanding of the protein-ligand interactions is therefore central to understanding biol. at the mol. level. Moreover, knowledge of the mechanisms responsible for the protein-ligand recognition and binding will also facilitate the discovery, design, and development of drugs. In the present review, first, the physicochem. mechanisms underlying protein-ligand binding, including the binding kinetics, thermodn. concepts and relationships, and binding driving forces, are introduced and rationalized. Next, three currently existing protein-ligand binding models-the "lock-and-key", "induced fit", and "conformational selection"-are described and their underlying thermodn. mechanisms are discussed. Finally, the methods available for investigating protein-ligand binding affinity, including exptl. and theor./computational approaches, are introduced, and their advantages, disadvantages, and challenges are discussed.
- 15Fan, F. J.; Shi, Y. Effects of data quality and quantity on deep learning for protein-ligand binding affinity prediction. Bioorg. Med. Chem. 2022, 72, 117003, DOI: 10.1016/j.bmc.2022.117003Google Scholar15Effects of data quality and quantity on deep learning for protein-ligand binding affinity predictionFan, Frankie J.; Shi, YunBioorganic & Medicinal Chemistry (2022), 72 (), 117003CODEN: BMECEP; ISSN:0968-0896. (Elsevier B.V.)Prediction of protein-ligand binding affinities is crucial for computational drug discovery. A no. of deep learning approaches have been developed in recent years to improve the accuracy of such affinity prediction. While the predicting power of these systems have advanced to some degrees depending on the dataset used for model training and testing, the effects of the quality and quantity of the underlying data have not been thoroughly examd. In this study, we employed erroneous datasets and data subsets of different sizes, created from one of the largest databases of exptl. binding affinities, to train and evaluate a deep learning system based on convolutional neural networks. Our results show that data quality and quantity do have significant impacts on the prediction performance of trained models. Depending on the variations in data quality and quantity, the performance discrepancies could be comparable to or even larger than those obsd. among different deep learning approaches. In particular, the presence of proteins in the training data leads to a dramatic increase in prediction accuracy. This implies that continued accumulation of high-quality affinity data, esp. for new protein targets, is indispensable for improving deep learning models to better predict protein-ligand binding affinities.
- 16Sousa, S. F.; Ribeiro, A. J. M.; Coimbra, J. T. S.; Neves, R. P. P.; Martins, S. A.; Moorthy, N. S. H. N.; Fernandes, P. A.; Ramos, M. J. Protein-Ligand Docking in the New Millennium A Retrospective of 10 Years in the Field. Curr. Med. Chem. 2013, 20, 2296– 2314, DOI: 10.2174/0929867311320180002Google ScholarThere is no corresponding record for this reference.
- 17Morris, C. J.; Corte, D. D. Using molecular docking and molecular dynamics to investigate protein-ligand interactions. Mod. Phys. Lett. B 2021, 35, 2130002, DOI: 10.1142/S0217984921300027Google Scholar17Using molecular docking and molecular dynamics to investigate protein-ligand interactionsMorris, Connor J.; Corte, Dennis DellaModern Physics Letters B (2021), 35 (8), 2130002CODEN: MPLBET; ISSN:0217-9849. (World Scientific Publishing Co. Pte. Ltd.)A review. Mol. docking and mol. dynamics (MD) are powerful tools used to investigate protein-ligand interactions. Mol. docking programs predict the binding pose and affinity of a protein-ligand complex, while MD can be used to incorporate flexibility into docking calcns. and gain further information on the kinetics and stability of the protein-ligand bond. This review covers state-of-the-art methods of using mol. docking and MD to explore protein-ligand interactions, with emphasis on application to drug discovery. We also call for further research on combining common mol. docking and MD methods.
- 18Lecina, D.; Gilabert, J. F.; Guallar, V. Adaptive simulations, towards interactive protein-ligand modeling. Sci. Rep. 2017, 7, 8466, DOI: 10.1038/s41598-017-08445-5Google Scholar18Adaptive simulations, towards interactive protein-ligand modelingLecina Daniel; Gilabert Joan F; Guallar Victor; Guallar VictorScientific reports (2017), 7 (1), 8466 ISSN:.Modeling the dynamic nature of protein-ligand binding with atomistic simulations is one of the main challenges in computational biophysics, with important implications in the drug design process. Although in the past few years hardware and software advances have significantly revamped the use of molecular simulations, we still lack a fast and accurate ab initio description of the binding mechanism in complex systems, available only for up-to-date techniques and requiring several hours or days of heavy computation. Such delay is one of the main limiting factors for a larger penetration of protein dynamics modeling in the pharmaceutical industry. Here we present a game-changing technology, opening up the way for fast reliable simulations of protein dynamics by combining an adaptive reinforcement learning procedure with Monte Carlo sampling in the frame of modern multi-core computational resources. We show remarkable performance in mapping the protein-ligand energy landscape, being able to reproduce the full binding mechanism in less than half an hour, or the active site induced fit in less than 5 minutes. We exemplify our method by studying diverse complex targets, including nuclear hormone receptors and GPCRs, demonstrating the potential of using the new adaptive technique in screening and lead optimization studies.
- 19Ikebata, H.; Hongo, K.; Isomura, T.; Maezono, R.; Yoshida, R. Bayesian molecular design with a chemical language model. J. Comput. Aided Mol. Des. 2017, 31, 379– 391, DOI: 10.1007/s10822-016-0008-zGoogle ScholarThere is no corresponding record for this reference.
- 20Rives, A.; Meier, J.; Sercu, T.; Goyal, S.; Lin, Z.; Liu, J.; Guo, D.; Ott, M.; Zitnick, C. L.; Ma, J.; Fergus, R. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U. S. A. 2021, DOI: 10.1073/pnas.2016239118Google ScholarThere is no corresponding record for this reference.
- 21Cao, Y.; Shen, Y. TALE: Transformer-based protein function Annotation with joint sequence–Label Embedding. Bioinformatics 2021, 37, 2825– 2833, DOI: 10.1093/bioinformatics/btab198Google ScholarThere is no corresponding record for this reference.
- 22Wang, S.; Guo, Y.; Wang, Y.; Sun, H.; Huang, J. SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction. Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. New York, NY, USA 2019, 429– 436Google ScholarThere is no corresponding record for this reference.
- 23Chithrananda, S.; Grand, G.; Ramsundar, B. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. arXiv , 2020.Google ScholarThere is no corresponding record for this reference.
- 24Kumar, N.; Acharya, V. Machine intelligence-driven framework for optimized hit selection in virtual screening. J. Cheminform. 2022, 14, 48, DOI: 10.1186/s13321-022-00630-7Google ScholarThere is no corresponding record for this reference.
- 25Erikawa, D.; Yasuo, N.; Sekijima, M. MERMAID: an open source automated hit-to-lead method based on deep reinforcement learning. J. Cheminform. 2021, 13, 94, DOI: 10.1186/s13321-021-00572-6Google ScholarThere is no corresponding record for this reference.
- 26Zhou, M.; Duan, N.; Liu, S.; Shum, H.-Y. Progress in neural NLP: Modeling, learning, and reasoning. Engineering (Beijing) 2020, 6, 275– 290, DOI: 10.1016/j.eng.2019.12.014Google ScholarThere is no corresponding record for this reference.
- 27Patwardhan, N.; Marrone, S.; Sansone, C. Transformers in the real world: A survey on NLP applications. Inf. 2023, 14, 242, DOI: 10.3390/info14040242Google ScholarThere is no corresponding record for this reference.
- 28Bijral, R. K.; Singh, I.; Manhas, J.; Sharma, V. Exploring Artificial Intelligence in Drug Discovery: A Comprehensive Review. Arch. Comput. Methods Eng. 2022, 29, 2513– 2529, DOI: 10.1007/s11831-021-09661-zGoogle ScholarThere is no corresponding record for this reference.
- 29Ray, P. P. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems 2023, 3, 121– 154, DOI: 10.1016/j.iotcps.2023.04.003Google ScholarThere is no corresponding record for this reference.
- 30Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. https://www.mikecaptain.com/resources/pdf/GPT-1.pdf, Accessed: 2023–10–27.Google ScholarThere is no corresponding record for this reference.
- 31Goodside, R, Papay, Meet Claude: Anthropic’s Rival to ChatGPT. https://scale.com/blog/chatgpt-vs-claude, 2023.Google ScholarThere is no corresponding record for this reference.
- 32Bing Copilot. Bing Copilot; https://copilot.microsoft.com/.Google ScholarThere is no corresponding record for this reference.
- 33Rahul; Adhikari, S.; Monika NLP based Machine Learning Approaches for Text Summarization. 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC) 2020, 535– 538Google ScholarThere is no corresponding record for this reference.
- 34Nasukawa, T.; Yi, J. Sentiment analysis: capturing favorability using natural language processing. Proceedings of the 2nd international conference on Knowledge capture. New York, NY, USA 2003, 70– 77Google ScholarThere is no corresponding record for this reference.
- 35Lample, G.; Charton, F. Deep Learning for Symbolic Mathematics. arXiv , 2019.Google ScholarThere is no corresponding record for this reference.
- 36Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; Zhou, M. CodeBERT: APre-Trained Model for Programming and Natural Languages. arXiv , 2020.Google ScholarThere is no corresponding record for this reference.
- 37Mielke, S. J.; Alyafeai, Z.; Salesky, E.; Raffel, C.; Dey, M.; Gallé, M.; Raja, A.; Si, C.; Lee, W. Y.; Sagot, B.; Tan, S. Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP. arXiv , 2021.Google ScholarThere is no corresponding record for this reference.
- 38Camacho-Collados, J.; Pilehvar, M. T. From word to sense embeddings: A survey on vector representations of meaning. J. Artif. Intell. Res. 2018, 63, 743– 788, DOI: 10.1613/jair.1.11259Google ScholarThere is no corresponding record for this reference.
- 39Ashok, V. G.; Feng, S.; Choi, Y. Success with style: Using writing style to predict the success of novelsd.Google ScholarThere is no corresponding record for this reference.
- 40Barberá, P.; Boydstun, A. E.; Linn, S.; McMahon, R.; Nagler, J. Automated text classification of news articles: A practical guide. Polit. Anal. 2021, 29, 19– 42, DOI: 10.1017/pan.2020.8Google ScholarThere is no corresponding record for this reference.
- 41Wang, H.; Wu, H.; He, Z.; Huang, L.; Church, K. W. Progress in machine translation. Engineering (Beijing) 2022, 18, 143– 153, DOI: 10.1016/j.eng.2021.03.023Google ScholarThere is no corresponding record for this reference.
- 42Sønderby, S. K.; Winther, O. Protein Secondary Structure Prediction with Long Short Term Memory Networks. arXiv , 2014.Google ScholarThere is no corresponding record for this reference.
- 43Guo, Y.; Li, W.; Wang, B.; Liu, H.; Zhou, D. DeepACLSTM: deep asymmetric convolutional long short-term memory neural models for protein secondary structure prediction. BMC Bioinformatics 2019, 20, 341, DOI: 10.1186/s12859-019-2940-0Google ScholarThere is no corresponding record for this reference.
- 44Bhasuran, B.; Natarajan, J. Automatic extraction of gene-disease associations from literature using joint ensemble learning. PLoS One 2018, 13, e0200699, DOI: 10.1371/journal.pone.0200699Google ScholarThere is no corresponding record for this reference.
- 45Pang, M.; Su, K.; Li, M. Leveraging information in spatial transcriptomics to predict super-resolution gene expression from histology images in tumors. bioRxiv , 2021, 2021.11.28.470212.Google ScholarThere is no corresponding record for this reference.
- 46Bouatta, N.; Sorger, P.; AlQuraishi, M. Protein structure prediction by AlphaFold2: are attention and symmetries all you need?. Acta Crystallogr. D Struct Biol. 2021, 77, 982– 991, DOI: 10.1107/S2059798321007531Google ScholarThere is no corresponding record for this reference.
- 47Skolnick, J.; Gao, M.; Zhou, H.; Singh, S. AlphaFold 2: Why It Works and Its Implications for Understanding the Relationships of Protein Sequence, Structure, and Function. J. Chem. Inf. Model. 2021, 61, 4827– 4831, DOI: 10.1021/acs.jcim.1c01114Google Scholar47AlphaFold 2: Why It Works and Its Implications for Understanding the Relationships of Protein Sequence, Structure, and FunctionSkolnick, Jeffrey; Gao, Mu; Zhou, Hongyi; Singh, SureshJournal of Chemical Information and Modeling (2021), 61 (10), 4827-4831CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)AlphaFold 2 (AF2) was the star of CASP14, the last biannual structure prediction expt. Using novel deep learning, AF2 predicted the structures of many difficult protein targets at or near exptl. resoln. Here, the authors present the authors' perspective of why AF2 works and show that it is a very sophisticated fold recognition algorithm that exploits the completeness of the library of single domain PDB structures. It also learned local side chain packing rearrangements that enable it to refine proteins to high resoln. The benefits and limitations of its ability to predict the structures of many more proteins at or close to at. detail are discussed.
- 48Adadi, A.; Berrada, M. Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 2018, 6, 52138, DOI: 10.1109/ACCESS.2018.2870052Google ScholarThere is no corresponding record for this reference.
- 49Box, G. E. P. Science and Statistics. J. Am. Stat. Assoc. 1976, 71, 791– 799, DOI: 10.1080/01621459.1976.10480949Google ScholarThere is no corresponding record for this reference.
- 50Geirhos, R.; Jacobsen, J.-H.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; Wichmann, F. A. Shortcut learning in deep neural networks. Nature Machine Intelligence 2020, 2, 665– 673, DOI: 10.1038/s42256-020-00257-zGoogle ScholarThere is no corresponding record for this reference.
- 51Outeiral, C.; Nissley, D. A.; Deane, C. M. Current structure predictors are not learning the physics of protein folding. Bioinformatics 2022, 38, 1881– 1887, DOI: 10.1093/bioinformatics/btab881Google Scholar51Current structure predictors are not learning the physics of protein foldingOuteiral, Carlos; Nissley, Daniel A.; Deane, Charlotte M.Bioinformatics (2022), 38 (7), 1881-1887CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Predicting the native state of a protein has long been considered a gateway problem for understanding protein folding. Recent advances in structural modeling driven by deep learning have achieved unprecedented success at predicting a protein's crystal structure, but it is not clear if these models are learning the physics of how proteins dynamically fold into their equil. structure or are just accurate knowledge-based predictors of the final state. In this work, we compare the pathways generated by state-of-the-art protein structure prediction methods to exptl. data about protein folding pathways. The methods considered were AlphaFold 2, RoseTTAFold, trRosetta, RaptorX, DMPfold, EVfold, SAINT2 and Rosetta. We find evidence that their simulated dynamics capture some information about the folding pathway, but their predictive ability is worse than a trivial classifier using sequence-agnostic features like chain length. The folding trajectories produced are also uncorrelated with exptl. observables such as intermediate structures and the folding rate const. These results suggest that recent advances in structure prediction do not yet provide an enhanced understanding of protein folding.
- 52Steels, L. Modeling the cultural evolution of language. Phys. Life Rev. 2011, 8, 339– 356, DOI: 10.1016/j.plrev.2011.10.014Google ScholarThere is no corresponding record for this reference.
- 53Maurya, H. C.; Gupta, P.; Choudhary, N. Natural language ambiguity and its effect on machine learning. Int. J. Modern Eng. Res. 2015, 5, 25– 30Google ScholarThere is no corresponding record for this reference.
- 54Tenney, I.; Xia, P.; Chen, B.; Wang, A.; Poliak, A.; McCoy, R. T.; Kim, N.; Van Durme, B.; Bowman, S. R.; Das, D.; Pavlick, E. What do you learn from context? Probing for sentence structure in contextualized word representations. arXiv , 2019.Google ScholarThere is no corresponding record for this reference.
- 55Miyagawa, S.; Berwick, R. C.; Okanoya, K. The emergence of hierarchical structure in human language. Front. Psychol. 2013, 4, 71, DOI: 10.3389/fpsyg.2013.00071Google ScholarThere is no corresponding record for this reference.
- 56Liu, H.; Xu, C.; Liang, J. Dependency distance: A new perspective on syntactic patterns in natural languages. Phys. Life Rev. 2017, 21, 171– 193, DOI: 10.1016/j.plrev.2017.03.002Google ScholarThere is no corresponding record for this reference.
- 57Frank, S. L.; Bod, R.; Christiansen, M. H. How hierarchical is language use?. Proc. Biol. Sci. 2012, 279, 4522– 4531, DOI: 10.1098/rspb.2012.1741Google ScholarThere is no corresponding record for this reference.
- 58Oesch, N.; Dunbar, R. I. M. The emergence of recursion in human language: Mentalising predicts recursive syntax task performance. J. Neurolinguistics 2017, 43, 95– 106, DOI: 10.1016/j.jneuroling.2016.09.008Google ScholarThere is no corresponding record for this reference.
- 59Ferruz, N.; Höcker, B. Controllable protein design with language models. Nature Machine Intelligence 2022, 4, 521– 532, DOI: 10.1038/s42256-022-00499-zGoogle ScholarThere is no corresponding record for this reference.
- 60Ofer, D.; Brandes, N.; Linial, M. The language of proteins: NLP, machine learning & protein sequences. Comput. Struct. Biotechnol. J. 2021, 19, 1750– 1758, DOI: 10.1016/j.csbj.2021.03.022Google Scholar60The language of proteins: NLP, machine learning & protein sequencesOfer, Dan; Brandes, Nadav; Linial, MichalComputational and Structural Biotechnology Journal (2021), 19 (), 1750-1758CODEN: CSBJAC; ISSN:2001-0370. (Elsevier B.V.)A review. Natural language processing (NLP) is a field of computer science concerned with automated text and language anal. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.
- 61Ptitsyn, O. B. How does protein synthesis give rise to the 3D-structure?. FEBS Lett. 1991, 285, 176– 181, DOI: 10.1016/0014-5793(91)80799-9Google ScholarThere is no corresponding record for this reference.
- 62Yu, L.; Tanwar, D. K.; Penha, E. D. S.; Wolf, Y. I.; Koonin, E. V.; Basu, M. K. Grammar of protein domain architectures 2019, 116, 3636– 3645, DOI: 10.1073/pnas.1814684116Google ScholarThere is no corresponding record for this reference.
- 63Petsko, G. A.; Ringe, D. Protein Structure and Function; Primers in Biology; Blackwell Publishing: London, England, 2003.Google ScholarThere is no corresponding record for this reference.
- 64Shenoy, S. R.; Jayaram, B. Proteins: sequence to structure and function-current status. Curr. Protein Pept. Sci. 2010, 11, 498– 514, DOI: 10.2174/138920310794109094Google Scholar64Proteins: sequence to structure and function - current statusShenoy, Sandhya R.; Jayaram, B.Current Protein and Peptide Science (2010), 11 (7), 498-514CODEN: CPPSCM; ISSN:1389-2037. (Bentham Science Publishers Ltd.)A review. In an era that has been dominated by structural biol. for the last 30-40 yr, a dramatic change of focus toward sequence anal. has spurred the advent of the genome projects and the resultant diverging sequence/structure deficit. The central challenge of computational structural biol. is therefore to rationalize the mass of sequence information into biochem. and biophys. knowledge and to decipher the structural, functional, and evolutionary clues encoded in the language of biol. sequences. In investigating the meaning of sequences, 2 distinct anal. themes have emerged: (1) in the 1st approach, pattern recognition techniques are used to detect similarity between sequences and hence to infer related structures and functions; (2) in the 2nd, ab initio prediction methods are used to deduce 3-dimensional structure, and ultimately to infer function, directly from the linear sequence. Here, the authors attempt to provide a crit. assessment of what one may and may not expect from the biol. sequences and to identify major issues yet to be resolved. The presentation is organized under several subtitles such as protein sequences, pattern recognition techniques, protein tertiary structure prediction, membrane protein bioinformatics, human proteome, protein-protein interactions, metabolic networks, potential drug targets based on simple sequence properties, disordered proteins, the sequence-structure relation, and chem. logic of protein sequences.
- 65Takahashi, M.; Maraboeuf, F.; Nordén, B. Locations of functional domains in the RecA protein. Overlap of domains and regulation of activities. Eur. J. Biochem. 1996, 242, 20– 28, DOI: 10.1111/j.1432-1033.1996.0020r.xGoogle ScholarThere is no corresponding record for this reference.
- 66Liang, W.; KaiYong, Z. Detecting “protein words” through unsupervised word segmentation. arXiv , 2014.Google ScholarThere is no corresponding record for this reference.
- 67Kuntz, I. D.; Crippen, G. M.; Kollman, P. A.; Kimelman, D. Calculation of protein tertiary structure. J. Mol. Biol. 1976, 106, 983– 994, DOI: 10.1016/0022-2836(76)90347-8Google ScholarThere is no corresponding record for this reference.
- 68Rodrigue, N.; Lartillot, N.; Bryant, D.; Philippe, H. Site interdependence attributed to tertiary structure in amino acid sequence evolution. Gene 2005, 347, 207– 217, DOI: 10.1016/j.gene.2004.12.011Google ScholarThere is no corresponding record for this reference.
- 69Eisenhaber, F.; Persson, B.; Argos, P. Protein structure prediction: recognition of primary, secondary, and tertiary structural features from amino acid sequence. Crit. Rev. Biochem. Mol. Biol. 1995, 30, 1– 94, DOI: 10.3109/10409239509085139Google Scholar69Protein structure prediction: recognition of primary, secondary, and tertiary structural features from amino acid sequenceEisenhaber F; Persson B; Argos PCritical reviews in biochemistry and molecular biology (1995), 30 (1), 1-94 ISSN:1040-9238.This review attempts a critical stock-taking of the current state of the science aimed at predicting structural features of proteins from their amino acid sequences. At the primary structure level, methods are considered for detection of remotely related sequences and for recognizing amino acid patterns to predict posttranslational modifications and binding sites. The techniques involving secondary structural features include prediction of secondary structure, membrane-spanning regions, and secondary structural class. At the tertiary structural level, methods for threading a sequence into a mainchain fold, homology modeling and assigning sequences to protein families with similar folds are discussed. A literature analysis suggests that, to date, threading techniques are not able to show their superiority over sequence pattern recognition methods. Recent progress in the state of ab initio structure calculation is reviewed in detail. The analysis shows that many structural features can be predicted from the amino acid sequence much better than just a few years ago and with attendant utility in experimental research. Best prediction can be achieved for new protein sequences that can be assigned to well-studied protein families. For single sequences without homologues, the folding problem has not yet been solved.
- 70Garfield, E. Chemico-linguistics: computer translation of chemical nomenclature. Nature 1961, 192, 192, DOI: 10.1038/192192a0Google ScholarThere is no corresponding record for this reference.
- 71Wigh, D. S.; Goodman, J. M.; Lapkin, A. A. A review of molecular representation in the age of machine learning. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2022, DOI: 10.1002/wcms.1603Google ScholarThere is no corresponding record for this reference.
- 72Weininger, D. SMILES, a chemical language and information system. 1 Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31– 36, DOI: 10.1021/ci00057a005Google Scholar72SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rulesWeininger, DavidJournal of Chemical Information and Computer Sciences (1988), 28 (1), 31-6CODEN: JCISD8; ISSN:0095-2338.The SMILES (simplified mol. input line entry system) chem. notation system is described for information processing. The system is based on principles of mol. graph theory and it allows structure specification by use of a very small and natural grammar well suited for high-speed machine processing. The system is easy to use, has high machine compatibility, and allows many computer applications, including notation generation, const. speed database retrieval, substructure searching, and property prediction models.
- 73Wang, Y.; Xiao, J.; Suzek, T. O.; Zhang, J.; Wang, J.; Bryant, S. H. PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009, 37, W623– 33, DOI: 10.1093/nar/gkp456Google Scholar73PubChem: a public information system for analyzing bioactivities of small moleculesWang, Yanli; Xiao, Jewen; Suzek, Tugba O.; Zhang, Jian; Wang, Jiyao; Bryant, Stephen H.Nucleic Acids Research (2009), 37 (Web Server), W623-W633CODEN: NARHAD; ISSN:0305-1048. (Oxford University Press)PubChem (http://pubchem.ncbi.nlm.nih.gov) is a public repository for biol. properties of small mols. hosted by the US National Institutes of Health (NIH). PubChem BioAssay database currently contains biol. test results for more than 700 000 compds. The goal of PubChem is to make this information easily accessible to biomedical researchers. In this work, we present a set of web servers to facilitate and optimize the utility of biol. activity information within PubChem. These web-based services provide tools for rapid data retrieval, integration and comparison of biol. screening results, exploratory structure-activity anal., and target selectivity examn. This article reviews these bioactivity anal. tools and discusses their uses. Most of the tools described in this work can be directly accessed at http://pubchem.ncbi.nlm.nih.gov/assay/. URLs for accessing other tools described in this work are specified individually.
- 74Degtyarenko, K.; de Matos, P.; Ennis, M.; Hastings, J.; Zbinden, M.; McNaught, A.; Alcántara, R.; Darsow, M.; Guedj, M.; Ashburner, M. ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 2007, 36, D344– 50, DOI: 10.1093/nar/gkm791Google ScholarThere is no corresponding record for this reference.
- 75Wishart, D. S.; Knox, C.; Guo, A. C.; Cheng, D.; Shrivastava, S.; Tzur, D.; Gautam, B.; Hassanali, M. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 2008, 36, D901– 6, DOI: 10.1093/nar/gkm958Google Scholar75DrugBank: a knowledgebase for drugs, drug actions and drug targetsWishart, David S.; Knox, Craig; Guo, An Chi; Cheng, Dean; Shrivastava, Savita; Tzur, Dan; Gautam, Bijaya; Hassanali, MurtazaNucleic Acids Research (2008), 36 (Database Iss), D901-D906CODEN: NARHAD; ISSN:0305-1048. (Oxford University Press)DrugBank is a richly annotated resource that combines detailed drug data with comprehensive drug target and drug action information. Since its first release in 2006, DrugBank has been widely used to facilitate in silico drug target discovery, drug design, drug docking or screening, drug metab. prediction, drug interaction prediction and general pharmaceutical education. The latest version of DrugBank (release 2.0) has been expanded significantly over the previous release. With ∼4900 drug entries, it now contains 60% more FDA-approved small mol. and biotech drugs including 10% more exptl.' drugs. Significantly, more protein target data has also been added to the database, with the latest version of DrugBank contg. three times as many non-redundant protein or drug target sequences as before (1565 vs. 524). Each DrugCard entry now contains more than 100 data fields with half of the information being devoted to drug/chem. data and the other half devoted to pharmacol., pharmacogenomic and mol. biol. data. A no. of new data fields, including food-drug interactions, drug-drug interactions and exptl. ADME data have been added in response to numerous user requests. DrugBank has also significantly improved the power and simplicity of its structure query and text query searches. DrugBank is available at http://www.drugbank.ca.
- 76Wang, X.; Hao, J.; Yang, Y.; He, K. Natural language adversarial defense through synonym encoding. Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence 2021, 823– 833Google ScholarThere is no corresponding record for this reference.
- 77Bjerrum, E. J. SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules. arXiv , 2017.Google ScholarThere is no corresponding record for this reference.
- 78Lee, I.; Nam, H. Infusing Linguistic Knowledge of SMILES into Chemical Language Models. arXiv , 2022.Google ScholarThere is no corresponding record for this reference.
- 79Skinnider, M. A. Invalid SMILES are beneficial rather than detrimental to chemical language models. Nature Mach. Intell. 2024, 6, 437, DOI: 10.1038/s42256-024-00821-xGoogle ScholarThere is no corresponding record for this reference.
- 80O’Boyle, N.; Dalke, A. DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures. ChemRxiv , 2018.Google ScholarThere is no corresponding record for this reference.
- 81Krenn, M.; Häse, F.; Nigam, A.; Friederich, P.; Aspuru-Guzik, A. Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Mach. Learn.: Sci. Technol. 2020, 1, 045024, DOI: 10.1088/2632-2153/aba947Google ScholarThere is no corresponding record for this reference.
- 82Gohlke, H.; Mannhold, R.; Kubinyi, H.; Folkers, G. In Protein-Ligand Interactions; Gohlke, H., Ed.; Methods and Principles in Medicinal Chemistry; Wiley-VCH Verlag: Weinheim, Germany, 2012.Google ScholarThere is no corresponding record for this reference.
- 83Jumper, J. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583– 589, DOI: 10.1038/s41586-021-03819-2Google Scholar83Highly accurate protein structure prediction with AlphaFoldJumper, John; Evans, Richard; Pritzel, Alexander; Green, Tim; Figurnov, Michael; Ronneberger, Olaf; Tunyasuvunakool, Kathryn; Bates, Russ; Zidek, Augustin; Potapenko, Anna; Bridgland, Alex; Meyer, Clemens; Kohl, Simon A. A.; Ballard, Andrew J.; Cowie, Andrew; Romera-Paredes, Bernardino; Nikolov, Stanislav; Jain, Rishub; Adler, Jonas; Back, Trevor; Petersen, Stig; Reiman, David; Clancy, Ellen; Zielinski, Michal; Steinegger, Martin; Pacholska, Michalina; Berghammer, Tamas; Bodenstein, Sebastian; Silver, David; Vinyals, Oriol; Senior, Andrew W.; Kavukcuoglu, Koray; Kohli, Pushmeet; Hassabis, DemisNature (London, United Kingdom) (2021), 596 (7873), 583-589CODEN: NATUAS; ISSN:0028-0836. (Nature Portfolio)Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous exptl. effort, the structures of around 100,000 unique proteins have been detd., but this represents a small fraction of the billions of known protein sequences. Structural coverage is bottlenecked by the months to years of painstaking effort required to det. a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence-the structure prediction component of the 'protein folding problem'-has been an important open research problem for more than 50 years. Despite recent progress, existing methods fall far short of at. accuracy, esp. when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with at. accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Crit. Assessment of protein Structure Prediction (CASP14), demonstrating accuracy competitive with exptl. structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates phys. and biol. knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.
- 84Landrum, G. RDKit: A software suite for cheminformatics, computational chemistry, and predictive modeling. http://www.rdkit.org/RDKit_Overview.pdf, 2013; Accessed: 2023–12–13.Google ScholarThere is no corresponding record for this reference.
- 85Mukherjee, S.; Ghosh, M.; Basuchowdhuri, P. Proceedings of the 2022 SIAM International Conference on Data Mining (SDM); Proceedings; Society for Industrial and Applied Mathematics, 2022; pp 729– 737.Google ScholarThere is no corresponding record for this reference.
- 86Chen, L.; Tan, X.; Wang, D.; Zhong, F.; Liu, X.; Yang, T.; Luo, X.; Chen, K.; Jiang, H.; Zheng, M. TransformerCPI: improving compound–protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments. Bioinformatics 2020, 36, 4406– 4414, DOI: 10.1093/bioinformatics/btaa524Google Scholar86TransformerCPI: improving compound-protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experimentsChen, Lifan; Tan, Xiaoqin; Wang, Dingyan; Zhong, Feisheng; Liu, Xiaohong; Yang, Tianbiao; Luo, Xiaomin; Chen, Kaixian; Jiang, Hualiang; Zheng, MingyueBioinformatics (2020), 36 (16), 4406-4414CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Motivation: Identifying compd.-protein interaction (CPI) is a crucial task in drug discovery and chemogenomics studies, and proteins without three-dimensional structure account for a large part of potential biol. targets, which requires developing methods using only protein sequence information to predict CPI. However, sequence-based CPI models may face some specific pitfalls, including using inappropriate datasets, hidden ligand bias and splitting datasets inappropriately, resulting in overestimation of their prediction performance. Results: To address these issues, we here constructed new datasets specific for CPI prediction, proposed a novel transformer neural network named TransformerCPI, and introduced a more rigorous label reversal expt. to test whether a model learns true interaction features. TransformerCPI achieved much improved performance on the new expts., and it can be deconvolved to highlight important interacting regions of protein sequences and compd. atoms, which may contribute chem. biol. studies with useful guidance for further ligand structural optimization.
- 87Aly Abdelkader, G.; Ngnamsie Njimbouom, S.; Oh, T.-J.; Kim, J.-D. ResBiGAAT: Residual Bi-GRU with attention for protein-ligand binding affinity prediction. Comput. Biol. Chem. 2023, 107, 107969, DOI: 10.1016/j.compbiolchem.2023.107969Google ScholarThere is no corresponding record for this reference.
- 88Li, Q.; Zhang, X.; Wu, L.; Bo, X.; He, S.; Wang, S. PLA-MoRe: AProtein–Ligand Binding Affinity Prediction Model via Comprehensive Molecular Representations. J. Chem. Inf. Model. 2022, 62, 4380– 4390, DOI: 10.1021/acs.jcim.2c00960Google ScholarThere is no corresponding record for this reference.
- 89Abramson, J. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 2024, 636, E4, DOI: 10.1038/s41586-024-08416-7Google ScholarThere is no corresponding record for this reference.
- 90Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; Overington, J. P. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012, 40, D1100– 7, DOI: 10.1093/nar/gkr777Google Scholar90ChEMBL: a large-scale bioactivity database for drug discoveryGaulton, Anna; Bellis, Louisa J.; Bento, A. Patricia; Chambers, Jon; Davies, Mark; Hersey, Anne; Light, Yvonne; McGlinchey, Shaun; Michalovich, David; Al-Lazikani, Bissan; Overington, John P.Nucleic Acids Research (2012), 40 (D1), D1100-D1107CODEN: NARHAD; ISSN:0305-1048. (Oxford University Press)ChEMBL is an Open Data database contg. binding, functional and ADMET information for a large no. of drug-like bioactive compds. These data are manually abstracted from the primary published literature on a regular basis, then further curated and standardized to maximize their quality and utility across a wide range of chem. biol. and drug-discovery research problems. Currently, the database contains 5.4 million bioactivity measurements for more than 1 million compds. and 5200 protein targets. Access is available through a web-based interface, data downloads and web services at: https://www.ebi.ac.uk/chembldb.
- 91Wishart, D. S.; Knox, C.; Guo, A. C.; Shrivastava, S.; Hassanali, M.; Stothard, P.; Chang, Z.; Woolsey, J. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006, 34, D668– 72, DOI: 10.1093/nar/gkj067Google Scholar91DrugBank: a comprehensive resource for in silico drug discovery and explorationWishart, David S.; Knox, Craig; Guo, An Chi; Shrivastava, Savita; Hassanali, Murtaza; Stothard, Paul; Chang, Zhan; Woolsey, JenniferNucleic Acids Research (2006), 34 (Database), D668-D672CODEN: NARHAD; ISSN:0305-1048. (Oxford University Press)DrugBank is a unique bioinformatics/cheminformatics resource that combines detailed drug (i.e. chem.) data with comprehensive drug target (i.e. protein) information. The database contains >4100 drug entries including >800 FDA approved small mol. and biotech drugs as well as >3200 exptl. drugs. Addnl., >14 000 protein or drug target sequences are linked to these drug entries. Each DrugCard entry contains >80 data fields with half of the information being devoted to drug/chem. data and the other half devoted to drug target or protein data. Many data fields are hyperlinked to other databases (KEGG, PubChem, ChEBI, PDB, Swiss-Prot and GenBank) and a variety of structure viewing applets. The database is fully searchable supporting extensive text, sequence, chem. structure and relational query searches. Potential applications of DrugBank include in silico drug target discovery, drug design, drug docking or screening, drug metab. prediction, drug interaction prediction and general pharmaceutical education. DrugBank is available at http://redpoll.pharmacy.ualberta.ca/drugbank/.
- 92Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E. The Protein Data Bank. Nucleic Acids Res. 2000, 28, 235– 242, DOI: 10.1093/nar/28.1.235Google Scholar92The Protein Data BankBerman, Helen M.; Westbrook, John; Feng, Zukang; Gilliland, Gary; Bhat, T. N.; Weissig, Helge; Shindyalov, Ilya N.; Bourne, Philip E.Nucleic Acids Research (2000), 28 (1), 235-242CODEN: NARHAD; ISSN:0305-1048. (Oxford University Press)The Protein Data Bank (PDB; http://www.rcsb.org/pdb/)is the single worldwide archive of structural data of biol. macromols. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.
- 93Acids research, N. 2017 UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2017, 45, D158– D169, DOI: 10.1093/nar/gkw1099Google ScholarThere is no corresponding record for this reference.
- 94Davis, M. I.; Hunt, J. P.; Herrgard, S.; Ciceri, P.; Wodicka, L. M.; Pallares, G.; Hocker, M.; Treiber, D. K.; Zarrinkar, P. P. Comprehensive analysis of kinase inhibitor selectivity. Nat. Biotechnol. 2011, 29, 1046– 1051, DOI: 10.1038/nbt.1990Google Scholar94Comprehensive analysis of kinase inhibitor selectivityDavis, Mindy I.; Hunt, Jeremy P.; Herrgard, Sanna; Ciceri, Pietro; Wodicka, Lisa M.; Pallares, Gabriel; Hocker, Michael; Treiber, Daniel K.; Zarrinkar, Patrick P.Nature Biotechnology (2011), 29 (11), 1046-1051CODEN: NABIF9; ISSN:1087-0156. (Nature Publishing Group)We tested the interaction of 72 kinase inhibitors with 442 kinases covering >80% of the human catalytic protein kinome. Our data show that, as a class, type II inhibitors are more selective than type I inhibitors, but that there are important exceptions to this trend. The data further illustrate that selective inhibitors have been developed against the majority of kinases targeted by the compds. tested. Anal. of the interaction patterns reveals a class of 'group-selective' inhibitors broadly active against a single subfamily of kinases, but selective outside that subfamily. The data set suggests compds. to use as tools to study kinases for which no dedicated inhibitors exist. It also provides a foundation for further exploring kinase inhibitor biol. and toxicity, as well as for studying the structural basis of the obsd. interaction patterns. Our findings will help to realize the direct enabling potential of genomics for drug development and basic research about cellular signaling.
- 95Tang, J.; Szwajda, A.; Shakyawar, S.; Xu, T.; Hintsanen, P.; Wennerberg, K.; Aittokallio, T. Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. J. Chem. Inf. Model. 2014, 54, 735– 743, DOI: 10.1021/ci400709dGoogle Scholar95Making Sense of Large-Scale Kinase Inhibitor Bioactivity Data Sets: A Comparative and Integrative AnalysisTang, Jing; Szwajda, Agnieszka; Shakyawar, Sushil; Xu, Tao; Hintsanen, Petteri; Wennerberg, Krister; Aittokallio, TeroJournal of Chemical Information and Modeling (2014), 54 (3), 735-743CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)We carried out a systematic evaluation of target selectivity profiles across three recent large-scale biochem. assays of kinase inhibitors and further compared these standardized bioactivity assays with data reported in the widely used databases ChEMBL and STITCH. Our comparative evaluation revealed relative benefits and potential limitations among the bioactivity types, as well as pinpointed biases in the database curation processes. Ignoring such issues in data heterogeneity and representation may lead to biased modeling of drugs' polypharmacol. effects as well as to unrealistic evaluation of computational strategies for the prediction of drug-target interaction networks. Toward making use of the complementary information captured by the various bioactivity types, including IC50, Ki, and Kd, we also introduce a model-based integration approach, termed KIBA, and demonstrate here how it can be used to classify kinase inhibitor targets and to pinpoint potential errors in database-reported drug-target interactions. An integrated drug-target bioactivity matrix across 52 498 chem. compds. and 467 kinase targets, including a total of 246 088 KIBA scores, has been made freely available.
- 96Wang, R.; Fang, X.; Lu, Y.; Wang, S. The PDBbind database: collection of binding affinities for protein-ligand complexes with known three-dimensional structures. J. Med. Chem. 2004, 47, 2977– 2980, DOI: 10.1021/jm030580lGoogle Scholar96The PDBbind database: Collection of binding affinities for protein-ligand complexes with known three-dimensional structuresWang, Renxiao; Fang, Xueliang; Lu, Yipin; Wang, ShaomengJournal of Medicinal Chemistry (2004), 47 (12), 2977-2980CODEN: JMCMAR; ISSN:0022-2623. (American Chemical Society)We have screened the entire Protein Data Bank (Release No. 103, Jan. 2003) and identified 5671 protein-ligand complexes out of 19 621 exptl. structures. A systematic examn. of the primary refs. of these entries has led to a collection of binding affinity data (Kd, Ki, and IC50) for a total of 1359 complexes. The outcomes of this project have been organized into a Web-accessible database named the PDBbind database.
- 97Chen, S.; Zhang, S.; Fang, X.; Lin, L.; Zhao, H.; Yang, Y. Protein complex structure modeling by cross-modal alignment between cryo-EM maps and protein sequences. Nat. Commun. 2024, 15, 8808, DOI: 10.1038/s41467-024-53116-5Google ScholarThere is no corresponding record for this reference.
- 98Bishop, M.C. Pattern Recognition and Machine Learning, 1st ed.; Information Science and Statistics; Springer: New York, NY, 2006.Google ScholarThere is no corresponding record for this reference.
- 99Yang, J.; Shen, C.; Huang, N. Predicting or Pretending: Artificial Intelligence for Protein-Ligand Interactions Lack of Sufficiently Large and Unbiased Datasets. Front. Pharmacol. 2020, 11, 69, DOI: 10.3389/fphar.2020.00069Google Scholar99Predicting or pretending: artificial intelligence for protein-ligand interactions lack of sufficiently large and unbiased datasetsYang, Jincai; Shen, Cheng; Huang, NiuFrontiers in Pharmacology (2020), 11 (), 69CODEN: FPRHAU; ISSN:1663-9812. (Frontiers Media S.A.)Predicting protein-ligand interactions using artificial intelligence (AI) models has attracted great interest in recent years. However, data-driven AI models unequivocally suffer from a lack of sufficiently large and unbiased datasets. Here, we systematically investigated the data biases on the PDBbind and DUD-E datasets. We examd. the model performance of at. convolutional neural network (ACNN) on the PDBbind core set and achieved a Pearson R2 of 0.73 between exptl. and predicted binding affinities. Strikingly, the ACNN models did not require learning the essential protein-ligand interactions in complex structures and achieved similar performance even on datasets contg. only ligand structures or only protein structures, while data splitting based on similarity clustering (protein sequence or ligand scaffold) significantly reduced the model performance. We also identified the property and topol. biases in the DUD-E dataset which led to the artificially increased enrichment performance of virtual screening. The property bias in DUD-E was reduced by enforcing the more stringent ligand property matching rules, while the topol. bias still exists due to the use of mol. fingerprint similarity as a decoy selection criterion. Therefore, we believe that sufficiently large and unbiased datasets are desirable for training robust AI models to accurately predict protein-ligand interactions.
- 100Kryshtafovych, A.; Schwede, T.; Topf, M.; Fidelis, K.; Moult, J. Critical assessment of methods of protein structure prediction (CASP)–Round XIII. Proteins 2019, 87, 1011– 1020, DOI: 10.1002/prot.25823Google Scholar100Critical assessment of methods of protein structure prediction (CASP)-Round XIIIKryshtafovych, Andriy; Schwede, Torsten; Topf, Maya; Fidelis, Krzysztof; Moult, JohnProteins: Structure, Function, and Bioinformatics (2019), 87 (12), 1011-1020CODEN: PSFBAF; ISSN:1097-0134. (Wiley-Blackwell)A review. CASP (crit. assessment of structure prediction) assesses the state of the art in modeling protein structure from amino acid sequence. The most recent expt. (CASP13 held in 2018) saw dramatic progress in structure modeling without use of structural templates (historically "ab initio" modeling). Progress was driven by the successful application of deep learning techniques to predict inter-residue distances. In turn, these results drove dramatic improvements in three-dimensional structure accuracy: With the proviso that there are an adequate no. of sequences known for the protein family, the new methods essentially solve the long-standing problem of predicting the fold topol. of monomeric proteins. Further, the no. of sequences required in the alignment has fallen substantially. There is also substantial improvement in the accuracy of template-based models. Other areas-model refinement, accuracy estn., and the structure of protein assemblies-have again yielded interesting results. CASP13 placed increased emphasis on the use of sparse data together with modeling and chem. crosslinking, SAXS, and NMR all yielded more mature results. This paper summarizes the key outcomes of CASP13. The special issue of PROTEINS contains papers describing the CASP13 assessments in each modeling category and contributions from the participants.
- 101Janin, J.; Henrick, K.; Moult, J.; Eyck, L. T.; Sternberg, M. J. E.; Vajda, S.; Vakser, I.; Wodak, S. J. CAPRI: A Critical Assessment of PRedicted Interactions. Proteins 2003, 52, 2– 9, DOI: 10.1002/prot.10381Google Scholar101CAPRI: A critical assessment of predicted interactionsJanin, Joel; Henrick, Kim; Moult, John; Ten Eyck, Lynn; Sternberg, Michael J. E.; Vajda, Sandor; Vakser, Ilya; Wodak, Shoshana J.Proteins: Structure, Function, and Genetics (2003), 52 (1), 2-9CODEN: PSFGEY; ISSN:0887-3585. (Wiley-Liss, Inc.)A review. CAPRI is a community wide expt. to assess the capacity of protein-docking methods to predict protein-protein interactions. Nineteen groups participated in rounds 1 and 2 of CAPRI and submitted blind structure predictions for seven protein-protein complexes based on the known structure of the component proteins. The predictions were compared to the unpublished X-ray structures of the complexes. We describe here the motivations for launching CAPRI, the rules that we applied to select targets and run the expt., and some conclusions that can already be drawn. The results stress the need for new scoring functions and for methods handling the conformation changes that were obsd. in some of the target systems. CAPRI has already been a powerful drive for the community of computational biologists who development docking algorithms. We hope that this issue of Proteins will also be of interest to the community of structural biologists, which we call upon to provide new targets for future rounds of CAPRI, and to all mol. biologists who view protein-protein recognition as an essential process.
- 102Lensink, M. F.; Nadzirin, N.; Velankar, S.; Wodak, S. J. Modeling protein-protein, protein-peptide, and protein-oligosaccharide complexes: CAPRI 7th edition. Proteins 2020, 88, 916– 938, DOI: 10.1002/prot.25870Google Scholar102Modeling protein-protein, protein-peptide, and protein-oligosaccharide complexes: CAPRI 7th editionLensink, Marc F.; Nadzirin, Nurul; Velankar, Sameer; Wodak, Shoshana J.Proteins: Structure, Function, and Bioinformatics (2020), 88 (8), 916-938CODEN: PSFBAF; ISSN:1097-0134. (Wiley-Blackwell)The authors present the seventh report on the performance of methods for predicting the at. resoln. structures of protein complexes offered as targets to the community-wide initiative on the Crit. Assessment of Predicted Interactions. Performance was evaluated on the basis of 36,114 models of protein complexes submitted by 57 groups-including 13 automatic servers-in prediction rounds held during the years 2016 to 2019 for eight protein-protein, three protein-peptide, and five protein-oligosaccharide targets with different length ligands. Six of the protein-protein targets represented challenging hetero-complexes, due to factors such as availability of distantly related templates for the individual subunits, or for the full complex, inter-domain flexibility, conformational adjustments at the binding region, or the multi-component nature of the complex. The main challenge for the protein-peptide and protein-oligosaccharide complexes was to accurately model the ligand conformation and its interactions at the interface. Encouragingly, models of acceptable quality, or better, were obtained for a total of six protein-protein complexes, which included four of the challenging hetero-complexes and a homo-decamer. But fewer of these targets were predicted with medium or higher accuracy. High accuracy models were obtained for two of the three protein-peptide targets, and for one of the protein-oligosaccharide targets. The remaining protein-sugar targets were predicted with medium accuracy. The authors' anal. indicates that progress in predicting increasingly challenging and diverse types of targets is due to closer integration of template-based modeling techniques with docking, scoring, and model refinement procedures, and to significant incremental improvements in the underlying methodologies.
- 103Schomburg, I.; Chang, A.; Schomburg, D. BRENDA, enzyme data and metabolic information. Nucleic Acids Res. 2002, 30, 47– 49, DOI: 10.1093/nar/30.1.47Google Scholar103BRENDA, enzyme data and metabolic informationSchomburg, Ida; Chang, Antje; Schomburg, DietmarNucleic Acids Research (2002), 30 (1), 47-49CODEN: NARHAD; ISSN:0305-1048. (Oxford University Press)A review with 6 refs. BRENDA is a comprehensive relational database on functional and mol. information of enzymes, based on primary literature. The database contains information extd. and evaluated from ∼46,000 refs., holding data of at least 40,000 different enzymes from >6900 different organisms, classified in ∼3900 EC nos. BRENDA is an important tool for biochem. and medical research covering information on properties of all classified enzymes, including data on the occurrence, catalyzed reaction, kinetics, substrates/products, inhibitors, cofactors, activators, structure and stability. All data are connected to literature refs. which in turn are linked to PubMed. The data and information provide a fundamental tool for research of enzyme mechanisms, metabolic pathways, the evolution of metab. and, furthermore, for medicinal diagnostics and pharmaceutical research. The database is a resource for data of enzymes, classified according to the EC system of the IUBMB Enzyme Nomenclature Committee, and the entries are cross-referenced to other databases, i.e., organism classification, protein sequence, protein structure, and literature refs. BRENDA provides an academic web access at http://www.brenda.uni-koeln.de.
- 104Liu, T.; Lin, Y.; Wen, X.; Jorissen, R. N.; Gilson, M. K. BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res. 2007, 35, D198– 201, DOI: 10.1093/nar/gkl999Google Scholar104BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinitiesLiu, Tiqing; Lin, Yuhmei; Wen, Xin; Jorissen, Robert N.; Gilson, Michael K.Nucleic Acids Research (2007), 35 (Database Iss), D198-D201CODEN: NARHAD; ISSN:0305-1048. (Oxford University Press)BindingDB is a publicly accessible database currently contg. ∼20 000 exptl. detd. binding affinities of protein-ligand complexes, for 110 protein targets including isoforms and mutational variants, and ∼11 000 small mol. ligands. The data are extd. from the scientific literature, data collection focusing on proteins that are drug-targets or candidate drug-targets and for which structural data are present in the Protein Data Bank. The BindingDB website supports a range of query types, including searches by chem. structure, substructure and similarity; protein sequence; ligand and protein names; affinity ranges and mol. wt. Data sets generated by BindingDB queries can be downloaded in the form of annotated SDfiles for further anal., or used as the basis for virtual screening of a compd. database uploaded by the user. The data in BindingDB are linked both to structural data in the PDB via PDB IDs and chem. and sequence searches, and to the literature in PubMed via PubMed IDs.
- 105Amemiya, T.; Koike, R.; Kidera, A.; Ota, M. PSCDB: a database for protein structural change upon ligand binding. Nucleic Acids Res. 2012, 40, D554– 8, DOI: 10.1093/nar/gkr966Google ScholarThere is no corresponding record for this reference.
- 106Mysinger, M. M.; Carchia, M.; Irwin, J. J.; Shoichet, B. K. Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking. J. Med. Chem. 2012, 55, 6582– 6594, DOI: 10.1021/jm300687eGoogle Scholar106Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better BenchmarkingMysinger, Michael M.; Carchia, Michael; Irwin, John. J.; Shoichet, Brian K.Journal of Medicinal Chemistry (2012), 55 (14), 6582-6594CODEN: JMCMAR; ISSN:0022-2623. (American Chemical Society)A key metric to assess mol. docking remains ligand enrichment against challenging decoys. Whereas the directory of useful decoys (DUD) has been widely used, clear areas for optimization have emerged. Here we describe an improved benchmarking set that includes more diverse targets such as GPCRs and ion channels, totaling 102 proteins with 22886 clustered ligands drawn from ChEMBL, each with 50 property-matched decoys drawn from ZINC. To ensure chemotype diversity, we cluster each target's ligands by their Bemis-Murcko at. frameworks. We add net charge to the matched physicochem. properties and include only the most dissimilar decoys, by topol., from the ligands. An online automated tool (http://decoys.docking.org) generates these improved matched decoys for user-supplied ligands. We test this data set by docking all 102 targets, using the results to improve the balance between ligand desolvation and electrostatics in DOCK 3.6. The complete DUD-E benchmarking set is freely available at http://dude.docking.org.
- 107Warren, G. L.; Do, T. D.; Kelley, B. P.; Nicholls, A.; Warren, S. D. Essential considerations for using protein-ligand structures in drug discovery. Drug Discovery Today 2012, 17, 1270– 1281, DOI: 10.1016/j.drudis.2012.06.011Google Scholar107Essential considerations for using protein-ligand structures in drug discoveryWarren, Gregory L.; Do, Thanh D.; Kelley, Brian P.; Nicholls, Anthony; Warren, Stephen D.Drug Discovery Today (2012), 17 (23-24), 1270-1281CODEN: DDTOFS; ISSN:1359-6446. (Elsevier Ltd.)A review. Protein-ligand structures are the core data required for structure-based drug design (SBDD). Understanding the error present in this data is essential to the successful development of SBDD tools. Methods for assessing protein-ligand structure quality and a new set of identification criteria are presented here. When these criteria were applied to a set of 728 structures previously used to validate mol. docking software, only 17% were found to be acceptable. Structures were re-refined to maintain internal consistency in the comparison and assessment of the quality criteria. This process resulted in Iridium, a highly trustworthy protein-ligand structure database to be used for development and validation of structure-based design tools for drug discovery.
- 108Puvanendrampillai, D.; Mitchell, J. B. O. L/D Protein Ligand Database (PLD): additional understanding of the nature and specificity of protein-ligand complexes. Bioinformatics 2003, 19, 1856– 1857, DOI: 10.1093/bioinformatics/btg243Google ScholarThere is no corresponding record for this reference.
- 109Wang, C.; Hu, G.; Wang, K.; Brylinski, M.; Xie, L.; Kurgan, L. PDID: database of molecular-level putative protein-drug interactions in the structural human proteome. Bioinformatics 2016, 32, 579– 586, DOI: 10.1093/bioinformatics/btv597Google ScholarThere is no corresponding record for this reference.
- 110Zhu, M.; Song, X.; Chen, P.; Wang, W.; Wang, B. dbHDPLS: A database of human disease-related protein-ligand structures. Comput. Biol. Chem. 2019, 78, 353– 358, DOI: 10.1016/j.compbiolchem.2018.12.023Google Scholar110dbHDPLS: A database of human disease-related protein-ligand structuresZhu, Muchun; Song, Xiaoping; Chen, Peng; Wang, Wenyan; Wang, BingComputational Biology and Chemistry (2019), 78 (), 353-358CODEN: CBCOCH; ISSN:1476-9271. (Elsevier B.V.)Protein-ligand complexes perform specific functions, most of which are related to human diseases. The database, called as human disease-related protein-ligand structures (dbHDPLS), collected 8833 structures which were extd. from protein data bank (PDB) and other related databases. The database is annotated with comprehensive information involving ligands and drugs, related human diseases and protein-ligand interaction information, with the information of protein structures. The database may be a reliable resource for structure-based drug target discoveries and druggability predictions of protein-ligand binding sites, drug-disease relationships based on protein-ligand complex structures.
- 111Gao, M.; Moumbock, A. F. A.; Qaseem, A.; Xu, Q.; Günther, S. CovPDB: a high-resolution coverage of the covalent protein-ligand interactome. Nucleic Acids Res. 2022, 50, D445– D450, DOI: 10.1093/nar/gkab868Google Scholar111CovPDB: a high-resolution coverage of the covalent protein-ligand interactomeGao, Mingjie; Moumbock, Aurelien F. A.; Qaseem, Ammar; Xu, Qianqing; Guenther, StefanNucleic Acids Research (2022), 50 (D1), D445-D450CODEN: NARHAD; ISSN:1362-4962. (Oxford University Press)In recent years, the drug discovery paradigm has shifted toward compds. that covalently modify disease-assocd. target proteins, because they tend to possess high potency, selectivity, and duration of action. The rational design of novel targeted covalent inhibitors (TCIs) typically starts from resolved macromol. structures of target proteins in their apo or holo forms. However, the existing TCI databases contain only a paucity of covalent protein-ligand (cP-L) complexes. Herein, we report CovPDB, the first database solely dedicated to highresoln. cocrystal structures of biol. relevant cP-L complexes, curated from the Protein Data Bank. For these curated complexes, the chem. structures and warheads of pre-reactive electrophilic ligands as well as the covalent bonding mechanisms to their target proteins were expertly manually annotated. Totally, CovPDB contains 733 proteins and 1,501 ligands, relating to 2,294 cP-L complexes, 93 reactive warheads, 14 targetable residues, and 21 covalent mechanisms. Users are provided with an intuitive and interactive web interface that allows multiple search and browsing options to explore the covalent interactome at a mol. level in order to develop novel TCIs. CovPDB is freely accessible and its contents are available for download as flat files of various formats.
- 112Ammar, A.; Cavill, R.; Evelo, C.; Willighagen, E. PSnpBind: a database of mutated binding site protein-ligand complexes constructed using a multithreaded virtual screening workflow. J. Cheminform. 2022, 14, 8, DOI: 10.1186/s13321-021-00573-5Google ScholarThere is no corresponding record for this reference.
- 113Lingė, D. PLBD: protein-ligand binding database of thermodynamic and kinetic intrinsic parameters. Database 2023, DOI: 10.1093/database/baad040Google ScholarThere is no corresponding record for this reference.
- 114Wei, H.; Wang, W.; Peng, Z.; Yang, J. Q-BioLiP: A Comprehensive Resource for Quaternary Structure-based Protein–ligand Interactions. bioRxiv , 2023, 2023.06.23.546351.Google ScholarThere is no corresponding record for this reference.
- 115Korlepara, D. B. PLAS-20k: Extended dataset of protein-ligand affinities from MD simulations for machine learning applications. Sci. Data 2024, DOI: 10.1038/s41597-023-02872-yGoogle ScholarThere is no corresponding record for this reference.
- 116Xenarios, I.; Rice, D. W.; Salwinski, L.; Baron, M. K.; Marcotte, E. M.; Eisenberg, D. DIP: the database of interacting proteins. Nucleic Acids Res. 2000, 28, 289– 291, DOI: 10.1093/nar/28.1.289Google Scholar116DIP: the Database of Interacting ProteinsXenarios, Ioannis; Rice, Danny W.; Salwinski, Lukasz; Baron, Marisa K.; Marcotte, Edward M.; Eisenberg, DavidNucleic Acids Research (2000), 28 (1), 289-291CODEN: NARHAD; ISSN:0305-1048. (Oxford University Press)The Database of Interacting Proteins is a database that documents exptl. detd. protein-protein interactions. This database is intended to provide the scientific community with a comprehensive and integrated tool for browsing and efficiently extg. information about protein interactions and interaction networks in biol. processes. Beyond cataloging details of protein-protein interactions, the DIP is useful for understanding protein function and protein-protein relationships, studying the properties of networks of interacting proteins, benchmarking predictions of protein-protein interactions, and studying the evolution of protein-protein interactions.
- 117Wallach, I.; Lilien, R. The protein-small-molecule database, a non-redundant structural resource for the analysis of protein-ligand binding. Bioinformatics 2009, 25, 615– 620, DOI: 10.1093/bioinformatics/btp035Google ScholarThere is no corresponding record for this reference.
- 118Wang, S.; Lin, H.; Huang, Z.; He, Y.; Deng, X.; Xu, Y.; Pei, J.; Lai, L. CavitySpace: A Database of Potential Ligand Binding Sites in the Human Proteome. Biomolecules 2022, 12, 967, DOI: 10.3390/biom12070967Google Scholar118CavitySpace: A Database of Potential Ligand Binding Sites in the Human ProteomeWang, Shiwei; Lin, Haoyu; Huang, Zhixian; He, Yufeng; Deng, Xiaobing; Xu, Youjun; Pei, Jianfeng; Lai, LuhuaBiomolecules (2022), 12 (7), 967CODEN: BIOMHC; ISSN:2218-273X. (MDPI AG)Location and properties of ligand binding sites provide important information to uncover protein functions and to direct structure-based drug design approaches. However, as binding site detection depends on the three-dimensional (3D) structural data of proteins, functional anal. based on protein ligand binding sites is formidable for proteins without structural information. Recent developments in protein structure prediction and the 3D structures built by AlphaFold provide an unprecedented opportunity for analyzing ligand binding sites in human proteins. Here, we constructed the CavitySpace database, the first pocket library for all the proteins in the human proteome, using a widely-applied ligand binding site detection program CAVITY. Our anal. showed that known ligand binding sites could be well recovered. We grouped the predicted binding sites according to their similarity which can be used in protein function prediction and drug repurposing studies. Novel binding sites in highly reliable predicted structure regions provide new opportunities for drug discovery. Our CavitySpace is freely available and provides a valuable tool for drug discovery and protein function studies.
- 119Otter, D. W.; Medina, J. R.; Kalita, J. K. A survey of the usages of deep learning for natural language processing. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 604– 624, DOI: 10.1109/TNNLS.2020.2979670Google Scholar119A Survey of the Usages of Deep Learning for Natural Language ProcessingOtter Daniel W; Medina Julian R; Kalita Jugal KIEEE transactions on neural networks and learning systems (2021), 32 (2), 604-624 ISSN:.Over the last several years, the field of natural language processing has been propelled forward by an explosion in the use of deep learning models. This article provides a brief introduction to the field and a quick overview of deep learning architectures and methods. It then sifts through the plethora of recent studies and summarizes a large assortment of relevant contributions. Analyzed research areas include several core linguistic processing issues in addition to many applications of computational linguistics. A discussion of the current state of the art is then provided along with recommendations for future research in the field.
- 120Wang, Y.; You, Z.-H.; Yang, S.; Li, X.; Jiang, T.-H.; Zhou, X. A high efficient biological language model for predicting Protein-Protein interactions. Cells 2019, 8, 122, DOI: 10.3390/cells8020122Google Scholar120A high efficient biological language model for predicting protein-protein interactionsWang, Yanbin; You, Zhu-Hong; Yang, Shan; Li, Xiao; Jiang, Tong-Hai; Zhou, XiCells (2019), 8 (2), 122CODEN: CELLC6; ISSN:2073-4409. (MDPI AG): Many life activities and key functions in organisms are maintained by different types of protein-protein interactions (PPIs). In order to accelerate the discovery of PPIs for different species, many computational methods have been developed. Unfortunately, even though computational methods are constantly evolving, efficient methods for predicting PPIs from protein sequence information have not been found for many years due to limiting factors including both methodol. and technol. Inspired by the similarity of biol. sequences and languages, developing a biol. language processing technol. may provide a brand new theor. perspective and feasible method for the study of biol. sequences. In this paper, a pure biol. language processing model is proposed for predicting protein-protein interactions only using a protein sequence. The model was constructed based on a feature representation method for biol. sequences called bio-to-vector (Bio2Vec) and a convolution neural network (CNN). The Bio2Vec obtains protein sequence features by using a "bio-word" segmentation system and a word representation model used for learning the distributed representation for each "bio-word". The Bio2Vec supplies a frame that allows researchers to consider the context information and implicit semantic information of a bio sequence. A remarkable improvement in PPIs prediction performance has been obsd. by using the proposed model compared with state-of-the-art methods. The presentation of this approach marks the start of "bio language processing technol.," which could cause a technol. revolution and could be applied to improve the quality of predictions in other problems.
- 121Abbasi, K.; Razzaghi, P.; Poso, A.; Amanlou, M.; Ghasemi, J. B.; Masoudi-Nejad, A. DeepCDA: deep cross-domain compound–protein affinity prediction through LSTM and convolutional neural networks. Bioinformatics 2020, 36, 4633– 4642, DOI: 10.1093/bioinformatics/btaa544Google Scholar121DeepCDA: deep cross-domain compound-protein affinity prediction through LSTM and convolutional neural networksAbbasi, Karim; Razzaghi, Parvin; Poso, Antti; Amanlou, Massoud; Ghasemi, Jahan B.; Masoudi-Nejad, AliBioinformatics (2020), 36 (17), 4633-4642CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Motivation: An essential part of drug discovery is the accurate prediction of the binding affinity of new compd.-protein pairs. Most of the std. computational methods assume that compds. or proteins of the test data are obsd. during the training phase. However, in real-world situations, the test and training data are sampled from different domains with different distributions. To cope with this challenge, we propose a deep learning-based approach that consists of three steps. In the first step, the training encoder network learns a novel representation of compds. and proteins. To this end, we combine convolutional layers and long-short-term memory layers so that the occurrence patterns of local substructures through a protein and a compd. sequence are learned. Also, to encode the interaction strength of the protein and compd. substructures, we propose a two-sided attention mechanism. In the second phase, to deal with the different distributions of the training and test domains, a feature encoder network is learned for the test domain by utilizing an adversarial domain adaptation approach. In the third phase, the learned test encoder network is applied to new compd.-protein pairs to predict their binding affinity. Results: To evaluate the proposed approach, we applied it to KIBA, Davis and BindingDB datasets. The results show that the proposed method learns a more reliable model for the test domain in more challenging situations.
- 122Zhou, G.; Gao, Z.; Ding, Q.; Zheng, H.; Xu, H.; Wei, Z.; Zhang, L.; Ke, G. Uni-Mol: AUniversal 3D Molecular Representation Learning Framework. ChemRxiv , 2023.Google ScholarThere is no corresponding record for this reference.
- 123Zhou, D.; Xu, Z.; Li, W.; Xie, X.; Peng, S. MultiDTI: drug–target interaction prediction based on multi-modal representation learning to bridge the gap between new chemical entities and known heterogeneous network. Bioinformatics 2021, 37, 4485– 4492, DOI: 10.1093/bioinformatics/btab473Google Scholar123MultiDTI: drug-target interaction prediction based on multi-modal representation learning to bridge the gap between new chemical entities and known heterogeneous networkZhou, Deshan; Xu, Zhijian; Li, WenTao; Xie, Xiaolan; Peng, ShaoliangBioinformatics (2021), 37 (23), 4485-4492CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Motivation: Predicting new drug-target interactions is an important step in new drug development, understanding of its side effects and drug repositioning. Heterogeneous data sources can provide comprehensive information and different perspectives for drug-target interaction prediction. Thus, there have been many calcn. methods relying on heterogeneous networks. Most of them use graph-related algorithms to characterize nodes in heterogeneous networks for predicting new drug-target interactions (DTI). However, these methods can only make predictions in known heterogeneous network datasets, and cannot support the prediction of new chem. entities outside the heterogeneous network, which hinder further drug discovery and development. Results: To solve this problem, we proposed a multi-modal DTI prediction model named 'MultiDTI' which uses our proposed joint learning framework based on heterogeneous networks. It combines the interaction or assocn. information of the heterogeneous network and the drug/target sequence information, and maps the drugs, targets, side effects and disease nodes in the heterogeneous network into a common space. In this way, 'MultiDTI' can map the new chem. entity to this learned common space based on the chem. structure of the new entity. That is, bridging the gap between new chem. entities and known heterogeneous network. Our model has strong predictive performance, and the area under the receiver operating characteristic curve of the model is 0.961 and the area under the precision recall curve is 0.947 with 10-fold cross validation. In addn., some predicted new DTIs have been confirmed by ChEMBL database. Our results indicate that 'MultiDTI' is a powerful and practical tool for predicting new DTI, which can promote the development of drug discovery or drug repositioning.
- 124Özçelik, R.; Öztürk, H.; Özgür, A.; Ozkirimli, E. ChemBoost: A chemical language based approach for protein - ligand binding affinity prediction. Mol. Inform. 2021, 40, e2000212, DOI: 10.1002/minf.202000212Google ScholarThere is no corresponding record for this reference.
- 125Gaspar, H. A.; Ahmed, M.; Edlich, T.; Fabian, B.; Varszegi, Z.; Segler, M.; Meyers, J.; Fiscato, M. Proteochemometric Models Using Multiple Sequence Alignments and a Subword Segmented Masked Language Model. ChemRxiv , 2021.Google ScholarThere is no corresponding record for this reference.
- 126Arseniev-Koehler, A. Theoretical foundations and limits of word embeddings: What types of meaning can they capture. Sociol. Methods Res. 2022, 004912412211401Google ScholarThere is no corresponding record for this reference.
- 127Lake, B. M.; Murphy, G. L. Word meaning in minds and machines. Psychol. Rev. 2023, 130, 401– 431, DOI: 10.1037/rev0000297Google ScholarThere is no corresponding record for this reference.
- 128Winchester, S. A Verb for Our Frantic Times. https://www.nytimes.com/2011/05/29/opinion/29winchester.html, 2011; Accessed: 2024–9-15.Google ScholarThere is no corresponding record for this reference.
- 129Panapitiya, G.; Girard, M.; Hollas, A.; Sepulveda, J.; Murugesan, V.; Wang, W.; Saldanha, E. Evaluation of deep learning architectures for aqueous solubility prediction. ACS Omega 2022, 7, 15695– 15710, DOI: 10.1021/acsomega.2c00642Google Scholar129Evaluation of Deep Learning Architectures for Aqueous Solubility PredictionPanapitiya, Gihan; Girard, Michael; Hollas, Aaron; Sepulveda, Jonathan; Murugesan, Vijayakumar; Wang, Wei; Saldanha, EmilyACS Omega (2022), 7 (18), 15695-15710CODEN: ACSODF; ISSN:2470-1343. (American Chemical Society)Detg. the aq. soly. of mols. is a vital step in many pharmaceutical, environmental, and energy storage applications. Despite efforts made over decades, there are still challenges assocd. with developing a soly. prediction model with satisfactory accuracy for many of these applications. The goals of this study are to assess current deep learning methods for soly. prediction, develop a general model capable of predicting the soly. of a broad range of org. mols., and to understand the impact of data properties, mol. representation, and modeling architecture on predictive performance. Using the largest currently available soly. data set, we implement deep learning-based models to predict soly. from the mol. structure and explore several different mol. representations including mol. descriptors, simplified mol.-input line-entry system strings, mol. graphs, and three-dimensional at. coordinates using four different neural network architectures-fully connected neural networks, recurrent neural networks, graph neural networks (GNNs), and SchNet. We find that models using mol. descriptors achieve the best performance, with GNN models also achieving good performance. We perform extensive error anal. to understand the mol. properties that influence model performance, perform feature anal. to understand which information about the mol. structure is most valuable for prediction, and perform a transfer learning and data size study to understand the impact of data availability on model performance.
- 130Wu, X.; Yu, L. EPSOL: sequence-based protein solubility prediction using multidimensional embedding. Bioinformatics 2021, 37, 4314– 4320, DOI: 10.1093/bioinformatics/btab463Google ScholarThere is no corresponding record for this reference.
- 131Krogh, A. What are artificial neural networks?. Nat. Biotechnol. 2008, 26, 195– 197, DOI: 10.1038/nbt1386Google Scholar131What are artificial neural networks?Krogh, AndersNature Biotechnology (2008), 26 (2), 195-197CODEN: NABIF9; ISSN:1087-0156. (Nature Publishing Group)A review. Artificial neural networks have been applied to problems ranging from speech recognition to prediction of protein secondary structure, classification of cancers and gene prediction. How do they work and what might they be good for.
- 132Rumelhart, D.; Hinton, G. E.; Williams, R. J. Learning internal representations by error propagation. cmapspublic2.ihmc.us 1986, 673– 695Google ScholarThere is no corresponding record for this reference.
- 133Salehinejad, H.; Sankar, S.; Barfett, J.; Colak, E.; Valaee, S. Recent Advances in Recurrent Neural Networks. arXiv , 2017.Google ScholarThere is no corresponding record for this reference.
- 134Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30Google ScholarThere is no corresponding record for this reference.
- 135Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. arXiv , 2018.Google ScholarThere is no corresponding record for this reference.
- 136Chen, G. A gentle tutorial of recurrent neural network with error backpropagation. arXiv , 2016.Google ScholarThere is no corresponding record for this reference.
- 137Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv , 2014.Google ScholarThere is no corresponding record for this reference.
- 138Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735– 1780, DOI: 10.1162/neco.1997.9.8.1735Google Scholar138Long short-term memoryHochreiter S; Schmidhuber JNeural computation (1997), 9 (8), 1735-80 ISSN:0899-7667.Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
- 139Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005, 18, 602– 610, DOI: 10.1016/j.neunet.2005.06.042Google Scholar139Framewise phoneme classification with bidirectional LSTM and other neural network architecturesGraves Alex; Schmidhuber JurgenNeural networks : the official journal of the International Neural Network Society (2005), 18 (5-6), 602-10 ISSN:0893-6080.In this paper, we present bidirectional Long Short Term Memory (LSTM) networks, and a modified, full gradient version of the LSTM learning algorithm. We evaluate Bidirectional LSTM (BLSTM) and several other network architectures on the benchmark task of framewise phoneme classification, using the TIMIT database. Our main findings are that bidirectional networks outperform unidirectional ones, and Long Short Term Memory (LSTM) is much faster and also more accurate than both standard Recurrent Neural Nets (RNNs) and time-windowed Multilayer Perceptrons (MLPs). Our results support the view that contextual information is crucial to speech processing, and suggest that BLSTM is an effective architecture with which to exploit it.
- 140Thafar, M. A.; Alshahrani, M.; Albaradei, S.; Gojobori, T.; Essack, M.; Gao, X. Affinity2Vec: drug-target binding affinity prediction through representation learning, graph mining, and machine learning. Sci. Rep. 2022, 12, 4751, DOI: 10.1038/s41598-022-08787-9Google Scholar140Affinity2Vec: drug-target binding affinity prediction through representation learning, graph mining, and machine learningThafar, Maha A.; Alshahrani, Mona; Albaradei, Somayah; Gojobori, Takashi; Essack, Magbubah; Gao, XinScientific Reports (2022), 12 (1), 4751CODEN: SRCEC3; ISSN:2045-2322. (Nature Portfolio)Abstr.: Drug-target interaction (DTI) prediction plays a crucial role in drug repositioning and virtual drug screening. Most DTI prediction methods cast the problem as a binary classification task to predict if interactions exist or as a regression task to predict continuous values that indicate a drug's ability to bind to a specific target. The regression-based methods provide insight beyond the binary relationship. However, most of these methods require the three-dimensional (3D) structural information of targets which are still not generally available to the targets. Despite this bottleneck, only a few methods address the drug-target binding affinity (DTBA) problem from a non-structure-based approach to avoid the 3D structure limitations. Here we propose Affinity2Vec, as a novel regression-based method that formulates the entire task as a graph-based problem. To develop this method, we constructed a weighted heterogeneous graph that integrates data from several sources, including drug-drug similarity, target-target similarity, and drug-target binding affinities. Affinity2Vec further combines several computational techniques from feature representation learning, graph mining, and machine learning to generate or ext. features, build the model, and predict the binding affinity between the drug and the target with no 3D structural data. We conducted extensive expts. to evaluate and demonstrate the robustness and efficiency of the proposed method on benchmark datasets used in state-of-the-art non-structured-based drug-target binding affinity studies. Affinity2Vec showed superior and competitive results compared to the state-of-the-art methods based on several evaluation metrics, including mean squared error, rm2, concordance index, and area under the precision-recall curve.
- 141Wei, B.; Zhang, Y.; Gong, X. 519. DeepLPI: A Novel Drug Repurposing Model based on Ligand-Protein Interaction Using Deep Learning. Open Forum Infect. Dis. 2022, 9, ofac492.574, DOI: 10.1093/ofid/ofac492.574Google ScholarThere is no corresponding record for this reference.
- 142Yuan, W.; Chen, G.; Chen, C. Y.-C. FusionDTA: attention-based feature polymerizer and knowledge distillation for drug-target binding affinity prediction. Brief. Bioinform. 2022, DOI: 10.1093/bib/bbab506Google ScholarThere is no corresponding record for this reference.
- 143West-Roberts, J.; Valentin-Alvarado, L.; Mullen, S.; Sachdeva, R.; Smith, J.; Hug, L. A.; Gregoire, D. S.; Liu, W.; Lin, T.-Y.; Husain, G.; Amano, Y.; Ly, L.; Banfield, J. F. Giant genes are rare but implicated in cell wall degradation by predatory bacteria. bioRxiv , 2023.Google ScholarThere is no corresponding record for this reference.
- 144Hernández, A.; Amigó, J. Attention mechanisms and their applications to complex systems. Entropy (Basel) 2021, 23, 283, DOI: 10.3390/e23030283Google ScholarThere is no corresponding record for this reference.
- 145Yang, X. An overview of the attention mechanisms in computer vision. 2020.Google ScholarThere is no corresponding record for this reference.
- 146Hu, D. An introductory survey on attention mechanisms in NLP problems. arXiv , 2018.Google ScholarThere is no corresponding record for this reference.
- 147Vig, J., Madani, A., Varshney, L. R., Xiong, C., Socher, R., Rajani, N. F. BERTology Meets Biology: Interpreting Attention in Protein Language Models. arXiv , 2020.Google ScholarThere is no corresponding record for this reference.
- 148Wu, H.; Liu, J.; Jiang, T.; Zou, Q.; Qi, S.; Cui, Z.; Tiwari, P.; Ding, Y. AttentionMGT-DTA: A multi-modal drug-target affinity prediction using graph transformer and attention mechanism. Neural Netw. 2024, 169, 623– 636, DOI: 10.1016/j.neunet.2023.11.018Google ScholarThere is no corresponding record for this reference.
- 149Koyama, K.; Kamiya, K.; Shimada, K. Cross attention dti: Drug-target interaction prediction with cross attention module in the blind evaluation setup. BIOKDD2020 2020.Google ScholarThere is no corresponding record for this reference.
- 150Kurata, H.; Tsukiyama, S. ICAN: Interpretable cross-attention network for identifying drug and target protein interactions. PLoS One 2022, 17, e0276609, DOI: 10.1371/journal.pone.0276609Google ScholarThere is no corresponding record for this reference.
- 151Zhao, Q.; Zhao, H.; Zheng, K.; Wang, J. HyperAttentionDTI: improving drug–protein interaction prediction by sequence-based deep learning with attention mechanism. Bioinformatics 2022, 38, 655– 662, DOI: 10.1093/bioinformatics/btab715Google Scholar151HyperAttentionDTI: improving drug-protein interaction prediction by sequence-based deep learning with attention mechanismZhao, Qichang; Zhao, Haochen; Zheng, Kai; Wang, JianxinBioinformatics (2022), 38 (3), 655-662CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Motivation: Identifying drug-target interactions (DTIs) is a crucial step in drug repurposing and drug discovery. Accurately identifying DTIs in silico can significantly shorten development time and reduce costs. Recently, many sequence-based methods are proposed for DTI prediction and improve performance by introducing the attention mechanism. However, these methods only model single non-covalent inter-mol. interactions among drugs and proteins and ignore the complex interaction between atoms and amino acids. Results: In this article, we propose an end-to-end bio-inspired model based on the convolutional neural network (CNN) and attention mechanism, named HyperAttentionDTI, for predicting DTIs. We use deep CNNs to learn the feature matrixes of drugs and proteins. To model complex non-covalent inter-mol. interactions among atoms and amino acids, we utilize the attention mechanism on the feature matrixes and assign an attention vector to each atom or amino acid. We evaluate HpyerAttentionDTI on three benchmark datasets and the results show that our model achieves significantly improved performance compared with the state-of-the-art baselines. Moreover, a case study on the human Gamma-aminobutyric acid receptors confirm that our model can be used as a powerful tool to predict DTIs.
- 152Jiang, M.; Li, Z.; Zhang, S.; Wang, S.; Wang, X.; Yuan, Q. Drug-target affinity prediction using graph neural network and contact maps. RSC Adv. 2020, 10, 20701, DOI: 10.1039/D0RA02297GGoogle Scholar152Drug-target affinity prediction using graph neural network and contact mapsJiang, Mingjian; Li, Zhen; Zhang, Shugang; Wang, Shuang; Wang, Xiaofeng; Yuan, Qing; Wei, ZhiqiangRSC Advances (2020), 10 (35), 20701-20712CODEN: RSCACL; ISSN:2046-2069. (Royal Society of Chemistry)Computer-aided drug design uses high-performance computers to simulate the tasks in drug design, which is a promising research area. Drug-target affinity (DTA) prediction is the most important step of computer-aided drug design, which could speed up drug development and reduce resource consumption. With the development of deep learning, the introduction of deep learning to DTA prediction and improving the accuracy have become a focus of research. In this paper, utilizing the structural information of mols. and proteins, two graphs of drug mols. and proteins are built up resp. Graph neural networks are introduced to obtain their representations, and a method called DGraphDTA is proposed for DTA prediction. Specifically, the protein graph is constructed based on the contact map output from the prediction method, which could predict the structural characteristics of the protein according to its sequence. It can be seen from the test of various metrics on benchmark datasets that the method proposed in this paper has strong robustness and generalizability.
- 153Nguyen, T. M.; Nguyen, T.; Le, T. M.; Tran, T. GEFA: Early Fusion Approach in Drug-Target Affinity Prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 2022, 19, 718– 728, DOI: 10.1109/TCBB.2021.3094217Google Scholar153GEFA: early fusion approach in drug-target affinity predictionNguyen, Tri Minh; Nguyen, Thin; Le, Thao Minh; Tran, TruyenIEEE/ACM Transactions on Computational Biology and Bioinformatics (2022), 19 (2), 718-728CODEN: ITCBCY; ISSN:1557-9964. (Institute of Electrical and Electronics Engineers)Predicting the interaction between a compd. and a target is crucial for rapid drug repurposing. Deep learning has been successfully applied in drug-target affinity (DTA) problem. However, previous deep learning-based methods ignore modeling the direct interactions between drug and protein residues. This would lead to inaccurate learning of target representation which may change due to the drug binding effects. In addn., previous DTA methods learn protein representation solely based on a small no. of protein sequences in DTA datasets while neglecting the use of proteins outside of the DTA datasets. We propose GEFA (Graph Early Fusion Affinity), a novel graph-in-graph neural network with attention mechanism to address the changes in target representation because of the binding effects. Specifically, a drug is modeled as a graph of atoms, which then serves as a node in a larger graph of residues-drug complex. The resulting model is an expressive deep nested graph neural network. We also use pre-trained protein representation powered by the recent effort of learning contextualized protein representation. The expts. are conducted under different settings to evaluate scenarios such as novel drugs or targets. The results demonstrate the effectiveness of the pre-trained protein embedding and the advantages our GEFA in modeling the nested graph for drug-target interaction.
- 154Yu, J.; Li, Z.; Chen, G.; Kong, X.; Hu, J.; Wang, D.; Cao, D.; Li, Y.; Huo, R.; Wang, G.; Liu, X.; Jiang, H.; Li, X.; Luo, X.; Zheng, M. Computing the relative binding affinity of ligands based on a pairwise binding comparison network. Nature Computational Science 2023, 3, 860– 872, DOI: 10.1038/s43588-023-00529-9Google ScholarThere is no corresponding record for this reference.
- 155Knutson, C.; Bontha, M.; Bilbrey, J. A.; Kumar, N. Decoding the protein–ligand interactions using parallel graph neural networks. Sci. Rep. 2022, 12, 1– 14, DOI: 10.1038/s41598-022-10418-2Google ScholarThere is no corresponding record for this reference.
- 156Kyro, G. W.; Brent, R. I.; Batista, V. S. HAC-Net: A Hybrid Attention-Based Convolutional Neural Network for Highly Accurate Protein–Ligand Binding Affinity Prediction. J. Chem. Inf. Model. 2023, 63, 1947– 1960, DOI: 10.1021/acs.jcim.3c00251Google ScholarThere is no corresponding record for this reference.
- 157Yousefi, N.; Yazdani-Jahromi, M.; Tayebi, A.; Kolanthai, E.; Neal, C. J.; Banerjee, T.; Gosai, A.; Balasubramanian, G.; Seal, S.; Ozmen Garibay, O. BindingSite-AugmentedDTA: enabling a next-generation pipeline for interpretable prediction models in drug repurposing. Brief. Bioinform. 2023, DOI: 10.1093/bib/bbad136Google ScholarThere is no corresponding record for this reference.
- 158Yazdani-Jahromi, M.; Yousefi, N.; Tayebi, A.; Kolanthai, E.; Neal, C. J.; Seal, S.; Garibay, O. O. AttentionSiteDTI: an interpretable graph-based model for drug-target interaction prediction using NLP sentence-level relation classification. Brief. Bioinform. 2022, DOI: 10.1093/bib/bbac272Google ScholarThere is no corresponding record for this reference.
- 159Bronstein, M. M.; Bruna, J.; Cohen, T.; Veličković, P. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv , 2021.Google ScholarThere is no corresponding record for this reference.
- 160Lim, J.; Ryu, S.; Park, K.; Choe, Y. J.; Ham, J.; Kim, W. Y. Predicting Drug–Target Interaction Using a Novel Graph Neural Network with 3D Structure-Embedded Graph Representation. J. Chem. Inf. Model. 2019, 59, 3981– 3988, DOI: 10.1021/acs.jcim.9b00387Google Scholar160Predicting Drug-Target Interaction Using a Novel Graph Neural Network with 3D Structure-Embedded Graph RepresentationLim, Jaechang; Ryu, Seongok; Park, Kyubyong; Choe, Yo Joong; Ham, Jiyeon; Kim, Woo YounJournal of Chemical Information and Modeling (2019), 59 (9), 3981-3988CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)We propose a novel deep learning approach for predicting drug-target interaction using a graph neural network. We introduce a distance-aware graph attention algorithm to differentiate various types of intermol. interactions. Furthermore, we ext. the graph feature of intermol. interactions directly from the 3D structural information on the protein-ligand binding pose. Thus, the model can learn key features for accurate predictions of drug-target interaction rather than just memorize certain patterns of ligand mols. As a result, our model shows better performance than docking and other deep learning methods for both virtual screening (AUROC of 0.968 for the DUD-E test set) and pose prediction (AUROC of 0.935 for the PDBbind test set). In addn., it can reproduce the natural population distribution of active mols. and inactive mols.
- 161Jin, Z.; Wu, T.; Chen, T.; Pan, D.; Wang, X.; Xie, J.; Quan, L.; Lyu, Q. CAPLA: improved prediction of protein–ligand binding affinity by a deep learning approach based on a cross-attention mechanism. Bioinformatics 2023, 39, btad049, DOI: 10.1093/bioinformatics/btad049Google ScholarThere is no corresponding record for this reference.
- 162Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; Smetanin, N.; Verkuil, R.; Kabeli, O.; Shmueli, Y.; Dos Santos Costa, A.; Fazel-Zarandi, M.; Sercu, T.; Candido, S.; Rives, A. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023, 379, 1123– 1130, DOI: 10.1126/science.ade2574Google Scholar162Evolutionary-scale prediction of atomic-level protein structure with a language modelLin, Zeming; Akin, Halil; Rao, Roshan; Hie, Brian; Zhu, Zhongkai; Lu, Wenting; Smetanin, Nikita; Verkuil, Robert; Kabeli, Ori; Shmueli, Yaniv; dos Santos Costa, Allan; Fazel-Zarandi, Maryam; Sercu, Tom; Candido, Salvatore; Rives, AlexanderScience (Washington, DC, United States) (2023), 379 (6637), 1123-1130CODEN: SCIEAS; ISSN:1095-9203. (American Association for the Advancement of Science)Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full at.-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an at.-resoln. picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resoln. structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins.
- 163Zhang, S.; Fan, R.; Liu, Y.; Chen, S.; Liu, Q.; Zeng, W. Applications of transformer-based language models in bioinformatics: a survey. Bioinform. Adv. 2023, 3, vbad001, DOI: 10.1093/bioadv/vbad001Google ScholarThere is no corresponding record for this reference.
- 164Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv , 2014.Google ScholarThere is no corresponding record for this reference.
- 165Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. Pattern Recognition (CVPR) 2015, 3156– 3164Google ScholarThere is no corresponding record for this reference.
- 166Sutskever, I.; Vinyals, O.; Le, Q. V. Sequence to sequence learning with Neural Networks. arXiv , 2014;.Google ScholarThere is no corresponding record for this reference.
- 167Cho, K.; van Merrienboer, B.; Bahdanau, D.; Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. arXiv , 2014.Google ScholarThere is no corresponding record for this reference.
- 168Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv , 2018.Google ScholarThere is no corresponding record for this reference.
- 169Zeyer, A.; Bahar, P.; Irie, K.; Schlüter, R.; Ney, H. A Comparison of Transformer and LSTM Encoder Decoder Models for ASR. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019, 8– 15Google ScholarThere is no corresponding record for this reference.
- 170Irie, K.; Zeyer, A.; Schlüter, R.; Ney, H. Language Modeling with Deep Transformers. arXiv , 2019.Google ScholarThere is no corresponding record for this reference.
- 171Zouitni, C.; Sabri, M. A.; Aarab, A. A Comparison Between LSTM and Transformers for Image Captioning. Digital Technologies and Applications 2023, 669, 492– 500, DOI: 10.1007/978-3-031-29860-8_50Google ScholarThere is no corresponding record for this reference.
- 172Parisotto, E.; Song, F.; Rae, J.; Pascanu, R.; Gulcehre, C.; Jayakumar, S.; Jaderberg, M.; Kaufman, R. L.; Clark, A.; Noury, S.; Botvinick, M.; Heess, N.; Hadsell, R. Stabilizing Transformers for Reinforcement Learning. Proceedings of the 37th International Conference on Machine Learning 2020, 7487– 7498Google ScholarThere is no corresponding record for this reference.
- 173Bilokon, P.; Qiu, Y. Transformers versus LSTMs for electronic trading. arXiv , 2023.Google ScholarThere is no corresponding record for this reference.
- 174Merity, S. Single Headed Attention RNN: Stop Thinking With Your Head. arXiv , 2019.Google ScholarThere is no corresponding record for this reference.
- 175Ezen-Can, A. A Comparison of LSTM and BERT for Small Corpus. arXiv , 2020.Google ScholarThere is no corresponding record for this reference.
- 176Unsal, S.; Atas, H.; Albayrak, M.; Turhan, K.; Acar, A. C.; Doğan, T. Learning functional properties of proteins with language models. Nature Machine Intelligence 2022, 4, 227– 245, DOI: 10.1038/s42256-022-00457-9Google ScholarThere is no corresponding record for this reference.
- 177Brandes, N.; Ofer, D.; Peleg, Y.; Rappoport, N.; Linial, M. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 2022, 38, 2102– 2110, DOI: 10.1093/bioinformatics/btac020Google Scholar177ProteinBERT: a universal deep-learning model of protein sequence and functionBrandes, Nadav; Ofer, Dan; Peleg, Yam; Rappoport, Nadav; Linial, MichalBioinformatics (2022), 38 (8), 2102-2110CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biol. sequences. However, existing models and pretraining methods are designed and optimized for text anal. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontol. (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophys. attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data.
- 178Luo, S.; Chen, T.; Xu, Y.; Zheng, S.; Liu, T.-Y.; Wang, L.; He, D. One Transformer Can Understand Both 2D & 3D Molecular Data. arXiv , 2022.Google ScholarThere is no corresponding record for this reference.
- 179Clark, K.; Luong, M.-T.; Le, Q. V.; Manning, C. D. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. arXiv , 2020.Google ScholarThere is no corresponding record for this reference.
- 180Wang, J.; Wen, N.; Wang, C.; Zhao, L.; Cheng, L. ELECTRA-DTA: a new compound-protein binding affinity prediction model based on the contextualized sequence encoding. J. Cheminform. 2022, 14, 14, DOI: 10.1186/s13321-022-00591-xGoogle ScholarThere is no corresponding record for this reference.
- 181Shin, B.; Park, S.; Kang, K.; Ho, J. C. Self-Attention Based Molecule Representation for Predicting Drug-Target Interaction. Proceedings of the 4th Machine Learning for Healthcare Conference 2019, 230– 248Google ScholarThere is no corresponding record for this reference.
- 182Huang, K.; Xiao, C.; Glass, L. M.; Sun, J. MolTrans: Molecular Interaction Transformer for drug–target interaction prediction. Bioinformatics 2021, 37, 830– 836, DOI: 10.1093/bioinformatics/btaa880Google Scholar182MolTrans: molecular interaction transformer for drug-target interaction predictionHuang, Kexin; Xiao, Cao; Glass, Lucas M.; Sun, JimengBioinformatics (2021), 37 (6), 830-836CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Motivation: Drug-target interaction (DTI) prediction is a foundational task for in-silico drug discovery, which is costly and time-consuming due to the need of exptl. search over large drug compd. space. Recent years have witnessed promising progress for deep learning in DTI predictions. However, the following challenges are still open: (i) existing mol. representation learning approaches ignore the sub-structural nature of DTI, thus produce results that are less accurate and difficult to explain and (ii) existing methods focus on limited labeled data while ignoring the value of massive unlabeled mol. data. Results: We propose a Mol. Interaction Transformer (MolTrans) to address these limitations via: (i) knowledge inspired sub-structural pattern mining algorithm and interaction modeling module for more accurate and interpretable DTI prediction and (ii) an augmented transformer encoder to better ext. and capture the semantic relations among sub-structures extd. from massive unlabeled biomedical data. We evaluate MolTrans on real-world data and show it improved DTI prediction performance compared to state-of-the-art baselines.
- 183Shen, L.; Feng, H.; Qiu, Y.; Wei, G.-W. SVSBI: sequence-based virtual screening of biomolecular interactions. Commun. Biol. 2023, 6, 536, DOI: 10.1038/s42003-023-04866-3Google Scholar183SVSBI: sequence-based virtual screening of biomolecular interactionsShen Li; Feng Hongsong; Qiu Yuchi; Wei Guo-Wei; Wei Guo-Wei; Wei Guo-WeiCommunications biology (2023), 6 (1), 536 ISSN:.Virtual screening (VS) is a critical technique in understanding biomolecular interactions, particularly in drug design and discovery. However, the accuracy of current VS models heavily relies on three-dimensional (3D) structures obtained through molecular docking, which is often unreliable due to the low accuracy. To address this issue, we introduce a sequence-based virtual screening (SVS) as another generation of VS models that utilize advanced natural language processing (NLP) algorithms and optimized deep K-embedding strategies to encode biomolecular interactions without relying on 3D structure-based docking. We demonstrate that SVS outperforms state-of-the-art performance for four regression datasets involving protein-ligand binding, protein-protein, protein-nucleic acid binding, and ligand inhibition of protein-protein interactions and five classification datasets for protein-protein interactions in five biological species. SVS has the potential to transform current practices in drug discovery and protein engineering.
- 184Wang, J.; Hu, J.; Sun, H.; Xu, M.; Yu, Y.; Liu, Y.; Cheng, L. MGPLI: exploring multigranular representations for protein–ligand interaction prediction. Bioinformatics 2022, 38, 4859– 4867, DOI: 10.1093/bioinformatics/btac597Google ScholarThere is no corresponding record for this reference.
- 185Qian, Y.; Wu, J.; Zhang, Q. CAT-CPI: Combining CNN and transformer to learn compound image features for predicting compound-protein interactions. Front Mol. Biosci 2022, 9, 963912, DOI: 10.3389/fmolb.2022.963912Google ScholarThere is no corresponding record for this reference.
- 186Cang, Z.; Mu, L.; Wei, G.-W. Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening. PLoS Comput. Biol. 2018, 14, e1005929, DOI: 10.1371/journal.pcbi.1005929Google Scholar186Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screeningCang, Zixuan; Mu, Lin; Wei, Guo-WeiPLoS Computational Biology (2018), 14 (1), e1005929/1-e1005929/44CODEN: PCBLBG; ISSN:1553-7358. (Public Library of Science)This work introduces a no. of algebraic topol. approaches, including multi-component persistent homol., multi-level persistent homol., and electrostatic persistence for the representation, characterization, and description of small mols. and biomol. complexes. In contrast to the conventional persistent homol., multi-component persistent homol. retains crit. chem. and biol. information during the topol. simplification of biomol. geometric complexity. Multi-level persistent homol. enables a tailored topol. description of inter- and/or intra-mol. interactions of interest. Electrostatic persistence incorporates partial charge information into topol. invariants. These topol. methods are paired with Wasserstein distance to characterize similarities between mols. and are further integrated with a variety of machine learning algorithms, including k-nearest neighbors, ensemble of trees, and deep convolutional neural networks, to manifest their descriptive and predictive powers for protein-ligand binding anal. and virtual screening of small mols. Extensive numerical expts. involving 4,414 protein- ligand complexes from the PDBBind database and 128,374 ligand-target and decoytarget pairs in the DUD database are performed to test resp. the scoring power and the discriminatory power of the proposed topol. learning strategies. It is demonstrated that the present topol. learning outperforms other existing methods in protein-ligand binding affinity prediction and ligand-decoy discrimination.
- 187Chen, D.; Liu, J.; Wei, G.-W. Multiscale topology-enabled structure-to-sequence transformer for protein-ligand interaction predictions. Nat. Mac. Intell. 2024, 6, 799– 810, DOI: 10.1038/s42256-024-00855-1Google ScholarThere is no corresponding record for this reference.
- 188Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz, E.; Kamar, E.; Lee, P.; Lee, Y. T.; Li, Y.; Lundberg, S.; Nori, H.; Palangi, H.; Ribeiro, M. T.; Zhang, Y. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv , 2023.Google ScholarThere is no corresponding record for this reference.
- 189Elnaggar, A.; Essam, H.; Salah-Eldin, W.; Moustafa, W.; Elkerdawy, M.; Rochereau, C.; Rost, B. Ankh: Optimized protein language model unlocks general-purpose modelling. bioRxiv , 2023.Google ScholarThere is no corresponding record for this reference.
- 190Hwang, Y.; Cornman, A. L.; Kellogg, E. H.; Ovchinnikov, S.; Girguis, P. R. Genomic language model predicts protein co-regulation and function. Nat. Commun. 2024, 15, 2880, DOI: 10.1038/s41467-024-46947-9Google ScholarThere is no corresponding record for this reference.
- 191Vu, M. H.; Akbar, R.; Robert, P. A.; Swiatczak, B.; Greiff, V.; Sandve, G. K.; Haug, D. T. T. Linguistically inspired roadmap for building biologically reliable protein language models. arXiv , 2022.Google ScholarThere is no corresponding record for this reference.
- 192Xu, M.; Zhang, Z.; Lu, J.; Zhu, Z.; Zhang, Y.; Ma, C.; Liu, R.; Tang, J. PEER: A comprehensive and multi-task benchmark for Protein sEquence undERstanding. arXiv 2022, 35156– 35173.Google ScholarThere is no corresponding record for this reference.
- 193Schmirler, R.; Heinzinger, M.; Rost, B. Fine-tuning protein language models boosts predictions across diverse tasks. Nat. Commun. 2024, 15, 7407, DOI: 10.1038/s41467-024-51844-2Google ScholarThere is no corresponding record for this reference.
- 194Heinzinger, M.; Elnaggar, A.; Wang, Y.; Dallago, C.; Nechaev, D.; Matthes, F.; Rost, B. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics 2019, 20, 723, DOI: 10.1186/s12859-019-3220-8Google ScholarThere is no corresponding record for this reference.
- 195Manfredi, M.; Savojardo, C.; Martelli, P. L.; Casadio, R. E-SNPs&GO: embedding of protein sequence and function improves the annotation of human pathogenic variants. Bioinformatics 2022, 38, 5168– 5174, DOI: 10.1093/bioinformatics/btac678Google ScholarThere is no corresponding record for this reference.
- 196Anteghini, M.; Martins Dos Santos, V.; Saccenti, E. In-Pero: Exploiting deep learning embeddings of protein sequences to predict the localisation of peroxisomal proteins. Int. J. Mol. Sci. 2021, 22, 6409, DOI: 10.3390/ijms22126409Google ScholarThere is no corresponding record for this reference.
- 197Nguyen, T.; Le, H.; Quinn, T. P.; Nguyen, T.; Le, T. D.; Venkatesh, S. GraphDTA: predicting drug–target binding affinity with graph neural networks. Bioinformatics 2021, 37, 1140– 1147, DOI: 10.1093/bioinformatics/btaa921Google Scholar197GraphDTA: predicting drug-target binding affinity with graph neural networksNguyen, Thin; Le, Hang; Quinn, Thomas P.; Nguyen, Tri; Le, Thuc Duy; Venkatesh, SvethaBioinformatics (2021), 37 (8), 1140-1147CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Summary: The development of new drugs is costly, time consuming, often accompanied with safety issues. Drug repurposing can avoid the expensive, lengthy process of drug development by finding new uses for already approved drugs. In order to repurpose drugs effectively, it is useful to know which proteins are targeted by which drugs. Computational models that est. the interaction strength of new drug-target pairs have the potential to expedite drug repurposing. Several models have been proposed for this task. However, these models represent the drugs as strings, which is not a natural way to represent mols. We propose a new model called GraphDTA that represents drugs as graphs, uses graph neural networks to predict drug-target affinity. We show that graph neural networks not only predict drug-target affinity better than non-deep learning models, but also outperform competing deep learning methods. Our results confirm that deep learning models are appropriate for drug- target binding affinity prediction, that representing drugs as graphs can lead to further improvements.
- 198Nam, H.; Ha, J.-W.; Kim, J. Dual attention networks for multimodal reasoning and matching. Pattern Recognition (CVPR) 2017, 299– 307Google ScholarThere is no corresponding record for this reference.
- 199Wang, X.; Liu, D.; Zhu, J.; Rodriguez-Paton, A.; Song, T. CSConv2d: A 2-D Structural Convolution Neural Network with a Channel and Spatial Attention Mechanism for Protein-Ligand Binding Affinity Prediction. Biomolecules 2021, DOI: 10.3390/biom11050643Google ScholarThere is no corresponding record for this reference.
- 200Anteghini, M.; Santos, V. A. M. D.; Saccenti, E. PortPred: Exploiting deep learning embeddings of amino acid sequences for the identification of transporter proteins and their substrates. J. Cell. Biochem. 2023, 124, 1803, DOI: 10.1002/jcb.30490Google ScholarThere is no corresponding record for this reference.
- 201Huang, K.; Fu, T.; Glass, L. M.; Zitnik, M.; Xiao, C.; Sun, J. DeepPurpose: a deep learning library for drug–target interaction prediction. Bioinformatics 2021, 36, 5545– 5547, DOI: 10.1093/bioinformatics/btaa1005Google Scholar201DeepPurpose: a deep learning library for drug-target interaction predictionHuang Kexin; Zitnik Marinka; Fu Tianfan; Glass Lucas M; Xiao Cao; Sun JimengBioinformatics (Oxford, England) (2021), 36 (22-23), 5545-5547 ISSN:.SUMMARY: Accurate prediction of drug-target interactions (DTI) is crucial for drug discovery. Recently, deep learning (DL) models for show promising performance for DTI prediction. However, these models can be difficult to use for both computer scientists entering the biomedical field and bioinformaticians with limited DL experience. We present DeepPurpose, a comprehensive and easy-to-use DL library for DTI prediction. DeepPurpose supports training of customized DTI prediction models by implementing 15 compound and protein encoders and over 50 neural architectures, along with providing many other useful features. We demonstrate state-of-the-art performance of DeepPurpose on several benchmark datasets. AVAILABILITY AND IMPLEMENTATION: https://github.com/kexinhuang12345/DeepPurpose. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
- 202Zhao, L.; Wang, J.; Pang, L.; Liu, Y.; Zhang, J. GANsDTA: Predicting Drug-Target Binding Affinity Using GANs. Front. Genet. 2020, 10, 1243, DOI: 10.3389/fgene.2019.01243Google ScholarThere is no corresponding record for this reference.
- 203Hu, F.; Jiang, J.; Wang, D.; Zhu, M.; Yin, P. Multi-PLI: interpretable multi-task deep learning model for unifying protein–ligand interaction datasets. J. Cheminform. 2021, 13, 30, DOI: 10.1186/s13321-021-00510-6Google ScholarThere is no corresponding record for this reference.
- 204Zheng, S.; Li, Y.; Chen, S.; Xu, J.; Yang, Y. Predicting Drug Protein Interaction using Quasi-Visual Question Answering System. bioRxiv 2019, 588178Google ScholarThere is no corresponding record for this reference.
- 205Tsubaki, M.; Tomii, K.; Sese, J. Compound–protein interaction prediction with end-to-end learning of neural networks for graphs and sequences. Bioinformatics 2019, 35, 309– 318, DOI: 10.1093/bioinformatics/bty535Google Scholar205Compound-protein interaction prediction with end-to-end learning of neural networks for graphs and sequencesTsubaki, Masashi; Tomii, Kentaro; Sese, JunBioinformatics (2019), 35 (2), 309-318CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Motivation: In bioinformatics, machine learning-based methods that predict the compd.-protein interactions (CPIs) play an important role in the virtual screening for drug discovery. Recently, end-to-end representation learning for discrete symbolic data (e.g. words in natural language processing) using deep neural networks has demonstrated excellent performance on various difficult problems. For the CPI problem, data are provided as discrete symbolic data, i.e. compds. are represented as graphs where the vertices are atoms, the edges are chem. bonds, and proteins are sequences in which the characters are amino acids. In this study, we investigate the use of end-to-end representation learning for compds. and proteins, integrate the representations, and develop a new CPI prediction approach by combining a graph neural network (GNN) for compds. and a convolutional neural network (CNN) for proteins. Results: Our expts. using three CPI datasets demonstrated that the proposed end-to-end approach achieves competitive or higher performance as compared to various existing CPI prediction methods. In addn., the proposed approach significantly outperformed existing methods on an unbalanced dataset. This suggests that data-driven representations of compds. and proteins obtained by end-to-end GNNs and CNNs are more robust than traditional chem. and biol. features obtained from databases. Although analyzing deep learning models is difficult due to their black-box nature, we address this issue using a neural attention mechanism, which allows us to consider which subsequences in a protein are more important for a drug compd. when predicting its interaction. The neural attention mechanism also provides effective visualization, which makes it easier to analyze a model even when modeling is performed using real-valued representations instead of discrete features.
- 206Karimi, M.; Wu, D.; Wang, Z.; Shen, Y. DeepAffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks. Bioinformatics 2019, 35, 3329– 3338, DOI: 10.1093/bioinformatics/btz111Google Scholar206DeepAffinity: interpretable deep learning of compound-protein affinity through unified recurrent and convolutional neural networksKarimi, Mostafa; Wu, Di; Wang, Zhangyang; Shen, YangBioinformatics (2019), 35 (18), 3329-3338CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Motivation: Drug discovery demands rapid quantification of compd.-protein interaction (CPI). However, there is a lack of methods that can predict compd.-protein affinity from sequences alone with high applicability, accuracy and interpretability. Results: We present a seamless integration of domain knowledges and learning-based approaches. Under novel representations of structurally annotated protein sequences, a semisupervised deep learning model that unifies recurrent and convolutional neural networks has been proposed to exploit both unlabeled and labeled data, for jointly encoding mol. representations and predicting affinities. Our representations and models outperform conventional options in achieving relative error in IC50 within 5-fold for test cases and 20-fold for protein classes not included for training. Performances for new protein classes with few labeled data are further improved by transfer learning. Furthermore, sep. and joint attention mechanisms are developed and embedded to our model to add to its interpretability, as illustrated in case studies for predicting and explaining selective drug-target interactions. Lastly, alternative representations using protein sequences or compd. graphs and a unified RNN/GCNN-CNN model using graph CNN (GCNN) are also explored to reveal algorithmic challenges ahead.
- 207Li, S.; Wan, F.; Shu, H.; Jiang, T.; Zhao, D.; Zeng, J. MONN: a multi-objective neural network for predicting compound-protein interactions and affinities. Cell Systems 2020, 10, 308– 322, DOI: 10.1016/j.cels.2020.03.002Google Scholar207MONN: A Multi-objective Neural Network for Predicting Compound-Protein Interactions and AffinitiesLi, Shuya; Wan, Fangping; Shu, Hantao; Jiang, Tao; Zhao, Dan; Zeng, JianyangCell Systems (2020), 10 (4), 308-322.e11CODEN: CSEYA4; ISSN:2405-4712. (Cell Press)Computational approaches for understanding compd.-protein interactions (CPIs) can greatly facilitate drug development. Recently, a no. of deep-learning-based methods have been proposed to predict binding affinities and attempt to capture local interaction sites in compds. and proteins through neural attentions (i.e., neural network architectures that enable the interpretation of feature importance). Here, we compiled a benchmark dataset contg. the inter-mol. non-covalent interactions for more than 10,000 compd.-protein pairs and systematically evaluated the interpretability of neural attentions in existing models. We also developed a multi-objective neural network, called MONN, to predict both non-covalent interactions and binding affinities between compds. and proteins. Comprehensive evaluation demonstrated that MONN can successfully predict the non-covalent interactions between compds. and proteins that cannot be effectively captured by neural attentions in previous prediction methods. Moreover, MONN outperforms other state-of-the-art methods in predicting binding affinities.
- 208Zhao, M.; Yuan, M.; Yang, Y.; Xu, S. X. CPGL: Prediction of Compound-Protein Interaction by Integrating Graph Attention Network With Long Short-Term Memory Neural Network. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023, 20, 1935– 1942, DOI: 10.1109/TCBB.2022.3225296Google ScholarThere is no corresponding record for this reference.
- 209Yu, L.; Qiu, W.; Lin, W.; Cheng, X.; Xiao, X.; Dai, J. HGDTI: predicting drug–target interaction by using information aggregation based on heterogeneous graph neural network. BMC Bioinformatics 2022, 23, 126, DOI: 10.1186/s12859-022-04655-5Google ScholarThere is no corresponding record for this reference.
- 210Lee, I.; Nam, H. Sequence-based prediction of protein binding regions and drug-target interactions. J. Cheminform. 2022, 14, 5, DOI: 10.1186/s13321-022-00584-wGoogle ScholarThere is no corresponding record for this reference.
- 211Gönen, M.; Heller, G. Concordance probability and discriminatory power in proportional hazards regression. Biometrika 2005, 92, 965– 970, DOI: 10.1093/biomet/92.4.965Google ScholarThere is no corresponding record for this reference.
- 212Deller, M. C.; Rupp, B. Models of protein-ligand crystal structures: trust, but verify. J. Comput. Aided Mol. Des. 2015, 29, 817– 836, DOI: 10.1007/s10822-015-9833-8Google Scholar212Models of protein-ligand crystal structures: trust, but verifyDeller, Marc C.; Rupp, BernhardJournal of Computer-Aided Molecular Design (2015), 29 (9), 817-836CODEN: JCADEQ; ISSN:0920-654X. (Springer)X-ray crystallog. provides the most accurate models of protein-ligand structures. These models serve as the foundation of many computational methods including structure prediction, mol. modeling, and structure-based drug design. The success of these computational methods ultimately depends on the quality of the underlying protein-ligand models. X-ray crystallog. offers the unparalleled advantage of a clear math. formalism relating the exptl. data to the protein-ligand model. In the case of X-ray crystallog., the primary exptl. evidence is the electron d. of the mols. forming the crystal. The first step in the generation of an accurate and precise crystallog. model is the interpretation of the electron d. of the crystal, typically carried out by construction of an at. model. The at. model must then be validated for fit to the exptl. electron d. and also for agreement with prior expectations of stereochem. Stringent validation of protein-ligand models has become possible as a result of the mandatory deposition of primary diffraction data, and many computational tools are now available to aid in the validation process. Validation of protein-ligand complexes has revealed some instances of overenthusiastic interpretation of ligand d. Fundamental concepts and metrics of protein-ligand quality validation are discussed and we highlight software tools to assist in this process. It is essential that end users select high quality protein-ligand models for their computational and biol. studies, and we provide an overview of how this can be achieved.
- 213Kalakoti, Y.; Yadav, S.; Sundar, D. TransDTI: Transformer-based language models for estimating DTIs and building a drug recommendation workflow. ACS Omega 2022, 7, 2706– 2717, DOI: 10.1021/acsomega.1c05203Google Scholar213TransDTI: Transformer-Based Language Models for Estimating DTIs and Building a Drug Recommendation WorkflowKalakoti, Yogesh; Yadav, Shashank; Sundar, DuraiACS Omega (2022), 7 (3), 2706-2717CODEN: ACSODF; ISSN:2470-1343. (American Chemical Society)The identification of novel drug-target interactions is a labor-intensive and low-throughput process. In silico alternatives have proved to be of immense importance in assisting the drug discovery process. Here, we present TransDTI, a multiclass classification and regression workflow employing transformer-based language models to segregate interactions between drug-target pairs as active, inactive, and intermediate. The models were trained with large-scale drug-target interaction (DTI) data sets, which reported an improvement in performance in terms of the area under receiver operating characteristic (auROC), the area under precision recall (auPR), Matthew's correlation coeff. (MCC), and R2 over baseline methods. The results showed that models based on transformer-based language models effectively predict novel drug-target interactions from sequence data. The proposed models significantly outperformed existing methods like DeepConvDTI, DeepDTA, and DeepDTI on a test data set. Further, the validity of novel interactions predicted by TransDTI was found to be backed by mol. docking and simulation anal., where the model prediction had similar or better interaction potential for MAP2k and transforming growth factor-β (TGFβ) and their known inhibitors. Proposed approaches can have a significant impact on the development of personalized therapy and clin. decision making.
- 214Chatterjee, A.; Walters, R.; Shafi, Z.; Ahmed, O. S.; Sebek, M.; Gysi, D.; Yu, R.; Eliassi-Rad, T.; Barabási, A.-L.; Menichetti, G. Improving the generalizability of protein-ligand binding predictions with AI-Bind. Nat. Commun. 2023, 14, 1989, DOI: 10.1038/s41467-023-37572-zGoogle ScholarThere is no corresponding record for this reference.
- 215Gudivada, V.; Apon, A.; Ding, J. Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. Int. J. Adv. Softw. 2017, 10, 1– 20Google ScholarThere is no corresponding record for this reference.
- 216Nasteski, V. An overview of the supervised machine learning methods. Horizons 2017, 4, 51– 62, DOI: 10.20544/HORIZONS.B.04.1.17.P05Google ScholarThere is no corresponding record for this reference.
- 217Kozlov, M. So you got a null result. Will anyone publish it?. Nature 2024, 631, 728– 730, DOI: 10.1038/d41586-024-02383-9Google ScholarThere is no corresponding record for this reference.
- 218Edfeldt, K. A data science roadmap for open science organizations engaged in early-stage drug discovery. Nat. Commun. 2024, 15, 5640, DOI: 10.1038/s41467-024-49777-xGoogle ScholarThere is no corresponding record for this reference.
- 219Mlinarić, A.; Horvat, M.; Šupak Smolčić, V. Dealing with the positive publication bias: Why you should really publish your negative results. Biochem. Med. 2017, 27, 030201, DOI: 10.11613/BM.2017.030201Google ScholarThere is no corresponding record for this reference.
- 220Fanelli, D. Negative results are disappearing from most disciplines and countries. Scientometrics 2012, 90, 891– 904, DOI: 10.1007/s11192-011-0494-7Google ScholarThere is no corresponding record for this reference.
- 221Albalate, A.; Minker, W. Semi-supervised and unervised machine learning: Novel strategies; Wiley-ISTE, 2013.Google ScholarThere is no corresponding record for this reference.
- 222Sajadi, S. Z.; Zare Chahooki, M. A.; Gharaghani, S.; Abbasi, K. AutoDTI++: deep unsupervised learning for DTI prediction by autoencoders. BMC Bioinformatics 2021, 22, 204, DOI: 10.1186/s12859-021-04127-2Google ScholarThere is no corresponding record for this reference.
- 223Najm, M.; Azencott, C.-A.; Playe, B.; Stoven, V. Drug Target Identification with Machine Learning: How to Choose Negative Examples. Int. J. Mol. Sci. 2021, 22, 5118, DOI: 10.3390/ijms22105118Google ScholarThere is no corresponding record for this reference.
- 224Sieg, J.; Flachsenberg, F.; Rarey, M. In Need of Bias Control: Evaluating Chemical Data for Machine Learning in Structure-Based Virtual Screening. J. Chem. Inf. Model. 2019, 59, 947– 961, DOI: 10.1021/acs.jcim.8b00712Google Scholar224In Need of Bias Control: Evaluating Chemical Data for Machine Learning in Structure-Based Virtual ScreeningSieg, Jochen; Flachsenberg, Florian; Rarey, MatthiasJournal of Chemical Information and Modeling (2019), 59 (3), 947-961CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)A review. Reports of successful applications of machine learning (ML) methods in structure-based virtual screening (SBVS) are increasing. ML methods such as convolutional neural networks show promising results and often outperform traditional methods such as empirical scoring functions in retrospective validation. However, trained ML models are often treated as black boxes and are not straightforwardly interpretable. In most cases, it is unknown which features in the data are decisive and whether a model's predictions are right for the right reason. Hence, the authors reevaluated three widely used benchmark data sets in the context of ML methods and came to the conclusion that not every benchmark data set is suitable. Moreover, the authors demonstrate on two examples from current literature that bias is learned implicitly and unnoticed from std. benchmarks. On the basis of these results, the authors conclude that there is a need for eligible validation expts. and benchmark data sets suited to ML for more bias-controlled validation in ML-based SBVS. Therefore, the authors provide guidelines for setting up validation expts. and give a perspective on how new data sets could be generated.
- 225Volkov, M.; Turk, J.-A.; Drizard, N.; Martin, N.; Hoffmann, B.; Gaston-Mathé, Y.; Rognan, D. On the Frustration to Predict Binding Affinities from Protein–Ligand Structures with Deep Neural Networks. J. Med. Chem. 2022, 65, 7946– 7958, DOI: 10.1021/acs.jmedchem.2c00487Google Scholar225On the Frustration to Predict Binding Affinities from Protein-Ligand Structures with Deep Neural NetworksVolkov, Mikhail; Turk, Joseph-Andre; Drizard, Nicolas; Martin, Nicolas; Hoffmann, Brice; Gaston-Mathe, Yann; Rognan, DidierJournal of Medicinal Chemistry (2022), 65 (11), 7946-7958CODEN: JMCMAR; ISSN:0022-2623. (American Chemical Society)Accurate prediction of binding affinities from protein-ligand at. coordinates remains a major challenge in early stages of drug discovery. Using modular message passing graph neural networks describing both the ligand and the protein in their free and bound states, we unambiguously evidence that an explicit description of protein-ligand noncovalent interactions does not provide any advantage with respect to ligand or protein descriptors. Simple models, inferring binding affinities of test samples from that of the closest ligands or proteins in the training set, already exhibit good performances, suggesting that memorization largely dominates true learning in the deep neural networks. The current study suggests considering only noncovalent interactions while omitting their protein and ligand at. environments. Removing all hidden biases probably requires much denser protein-ligand training matrixes and a coordinated effort of the drug design community to solve the necessary protein-ligand structures.
- 226Shivakumar, D.; Williams, J.; Wu, Y.; Damm, W.; Shelley, J.; Sherman, W. Prediction of absolute solvation free energies using molecular dynamics free energy perturbation and the OPLS force field. J. Chem. Theory Comput. 2010, 6, 1509– 1519, DOI: 10.1021/ct900587bGoogle Scholar226Prediction of Absolute Solvation Free Energies using Molecular Dynamics Free Energy Perturbation and the OPLS Force FieldShivakumar, Devleena; Williams, Joshua; Wu, Yujie; Damm, Wolfgang; Shelley, John; Sherman, WoodyJournal of Chemical Theory and Computation (2010), 6 (5), 1509-1519CODEN: JCTCCE; ISSN:1549-9618. (American Chemical Society)The accurate prediction of protein-ligand binding free energies is a primary objective in computer-aided drug design. The solvation free energy of a small mol. provides a surrogate to the desolvation of the ligand in the thermodn. process of protein-ligand binding. Here, we use explicit solvent mol. dynamics free energy perturbation to predict the abs. solvation free energies of a set of 239 small mols., spanning diverse chem. functional groups commonly found in drugs and drug-like mols. We also compare the performance of abs. solvation free energies obtained using the OPLS_2005 force field with two other commonly used small mol. force fields - general AMBER force field (GAFF) with AM1-BCC charges and CHARMm-MSI with CHelpG charges. Using the OPLS_2005 force field, we obtain high correlation with exptl. solvation free energies (R2 = 0.94) and low av. unsigned errors for a majority of the functional groups compared to AM1-BCC/GAFF or CHelpG/CHARMm-MSI. However, OPLS_2005 has errors of over 1.3 kcal/mol for certain classes of polar compds. We show that predictions on these compd. classes can be improved by using a semiempirical charge assignment method with an implicit bond charge correction.
- 227El Hage, K.; Mondal, P.; Meuwly, M. Free energy simulations for protein ligand binding and stability. Mol. Simul. 2018, 44, 1044– 1061, DOI: 10.1080/08927022.2017.1416115Google Scholar227Free energy simulations for protein ligand binding and stabilityEl Hage, Krystel; Mondal, Padmabati; Meuwly, MarkusMolecular Simulation (2018), 44 (13-14), 1044-1061CODEN: MOSIEA; ISSN:0892-7022. (Taylor & Francis Ltd.)We summarize several computational techniques to det. relative free energies for condensed-phase systems. The focus is on practical considerations which are capable of making direct contact with expts. Particular applications include the thermodn. stability of apo- and holo-myoglobin, insulin dimerization free energy, ligand binding in lysozyme, and ligand diffusion in globular proteins. In addn. to provide differential free energies between neighboring states, converged umbrella sampling simulations provide insight into migration barriers and ligand dissocn. barriers and anal. of the trajectories yield addnl. insight into the structural dynamics of fundamental processes. Also, such simulations are useful tools to quantify relative stability changes for situations where expts. are difficult. This is illustrated for NO-bound myoglobin. For the dissocn. of benzonitrile from lysozyme it is found that long umbrella sampling simulations are required to approx. converge the free energy profile. Then, however, the resulting differential free energy between the bound and unbound state is in good agreement with ests. from mol. mechanics with generalized Born surface area simulations. Furthermore, comparing the barrier height for ligand escape suggests that ligand dissocn. contains a non-equil. component.
- 228Ngo, S. T.; Pham, M. Q. Umbrella sampling-based method to compute ligand-binding affinity. Methods Mol. Biol. 2022, 2385, 313– 323, DOI: 10.1007/978-1-0716-1767-0_14Google ScholarThere is no corresponding record for this reference.
- 229Pandey, M.; Fernandez, M.; Gentile, F.; Isayev, O.; Tropsha, A.; Stern, A. C.; Cherkasov, A. The transformational role of GPU computing and deep learning in drug discovery. Nat. Mach. Intell. 2022, 4, 211– 221, DOI: 10.1038/s42256-022-00463-xGoogle ScholarThere is no corresponding record for this reference.
- 230Bibal, A.; Cardon, R.; Alfter, D.; Wilkens, R.; Wang, X.; François, T.; Watrin, P. Is Attention Explanation? An Introduction to the Debate. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland, 2022, 3889– 3900Google ScholarThere is no corresponding record for this reference.
- 231Wiegreffe, S.; Pinter, Y. Attention is not not Explanation. arXiv , 2019.Google ScholarThere is no corresponding record for this reference.
- 232Jain, S.; Wallace, B. C. Attention is not Explanation. arXiv , 2019.Google ScholarThere is no corresponding record for this reference.
- 233Lundberg, S. M.; Lee, S.-I. A unified approach to interpreting model predictions. Neural Inf. Process. Syst. 2017, 30, 4765– 4774Google ScholarThere is no corresponding record for this reference.
- 234Gu, Y.; Zhang, X.; Xu, A.; Chen, W.; Liu, K.; Wu, L.; Mo, S.; Hu, Y.; Liu, M.; Luo, Q. Protein-ligand binding affinity prediction with edge awareness and supervised attention. iScience 2023, 26, 105892, DOI: 10.1016/j.isci.2022.105892Google ScholarThere is no corresponding record for this reference.
- 235Rodis, N.; Sardianos, C.; Papadopoulos, G. T.; Radoglou-Grammatikis, P.; Sarigiannidis, P.; Varlamis, I. Multimodal Explainable Artificial Intelligence: A Comprehensive Review of Methodological Advances and Future Research Directions. arXiv [cs.AI] 2023.Google ScholarThere is no corresponding record for this reference.
- 236Gilpin, L. H.; Bau, D.; Yuan, B. Z.; Bajwa, A.; Specter, M.; Kagal, L. Explaining explanations: An overview of interpretability of machine learning. 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA) 2018, 80– 89Google ScholarThere is no corresponding record for this reference.
- 237Luo, D.; Liu, D.; Qu, X.; Dong, L.; Wang, B. Enhancing generalizability in protein-ligand binding affinity prediction with multimodal contrastive learning. J. Chem. Inf. Model. 2024, 64, 1892– 1906, DOI: 10.1021/acs.jcim.3c01961Google ScholarThere is no corresponding record for this reference.
- 238Rao, R.; Bhattacharya, N.; Thomas, N.; Duan, Y.; Chen, X.; Canny, J.; Abbeel, P.; Song, Y. S. Evaluating protein transfer learning with TAPE. bioRxiv , 2019.Google ScholarThere is no corresponding record for this reference.
- 239Suzek, B. E.; Wang, Y.; Huang, H.; McGarvey, P. B.; Wu, C. H. UniProt Consortium UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 2015, 31, 926– 932, DOI: 10.1093/bioinformatics/btu739Google ScholarThere is no corresponding record for this reference.
- 240Eguida, M.; Rognan, D. A Computer Vision Approach to Align and Compare Protein Cavities: Application to Fragment-Based Drug Design. J. Med. Chem. 2020, 63, 7127– 7142, DOI: 10.1021/acs.jmedchem.0c00422Google Scholar240A Computer Vision Approach to Align and Compare Protein Cavities: Application to Fragment-Based Drug DesignEguida, Merveille; Rognan, DidierJournal of Medicinal Chemistry (2020), 63 (13), 7127-7142CODEN: JMCMAR; ISSN:0022-2623. (American Chemical Society)Identifying local similarities in binding sites from distant proteins is a major hurdle to rational drug design. We herewith present a novel method, borrowed from computer vision, adapted to mine fragment subpockets and compare them to whole ligand-binding sites. Pockets are represented by pharmacophore-annotated point clouds mimicking ideal ligands or fragments. Point cloud registration is used to find the transformation enabling an optimal overlap of points sharing similar topol. and pharmacophoric neighborhoods. The method (ProCare) was calibrated on a large set of druggable cavities and applied to the comparison of fragment subpockets to entire cavities. A collection of 33,953 subpockets annotated with their bound fragments was screened for local similarity to cavities from recently described protein X-ray structures. ProCare was able to detect local similarities between remote pockets and transfer the corresponding fragments to the query cavity space, thereby proposing a first step to fragment-based design approaches targeting orphan cavities.
- 241Öztürk, H.; Özgür, A.; Ozkirimli, E. DeepDTA: deep drug-target binding affinity prediction. Bioinformatics 2018, 34, i821– i829, DOI: 10.1093/bioinformatics/bty593Google Scholar241DeepDTA: deep drug-target binding affinity predictionOzturk, Hakime; Ozgur, Arzucan; Ozkirimli, ElifBioinformatics (2018), 34 (17), i821-i829CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Motivation: The identification of novel drug-target (DT) interactions is a substantial part of the drug discovery process. Most of the computational methods that have been proposed to predict DT interactions have focused on binary classification, where the goal is to det. whether a DT pair interacts or not. However, protein-ligand interactions assume a continuum of binding strength values, also called binding affinity and predicting this value still remains a challenge. The increase in the affinity data available in DT knowledge-bases allows the use of advanced learning techniques such as deep learning architectures in the prediction of binding affinities. In this study, we propose a deep-learning based model that uses only sequence information of both targets and drugs to predict DT interaction binding affinities. The few studies that focus on DT binding affinity prediction use either 3D structures of protein-ligand complexes or 2D features of compds. One novel approach used in this work is the modeling of protein sequences and compd. 1D representations with convolutional neural networks (CNNs). Results: The results show that the proposed deep learning based model that uses the 1D representations of targets and drugs is an effective approach for drug target binding affinity prediction. The model in which high-level representations of a drug and a target are constructed via CNNs achieved the best Concordance Index (CI) performance in one of our larger benchmark datasets, outperforming the KronRLS algorithm and SimBoost, a state-of-the-art method for DT binding affinity prediction.
- 242Evans, R. Protein complex prediction with AlphaFold-Multimer. bioRxiv , 2021.Google ScholarThere is no corresponding record for this reference.
- 243Omidi, A.; Møller, M. H.; Malhis, N.; Bui, J. M.; Gsponer, J. AlphaFold-Multimer accurately captures interactions and dynamics of intrinsically disordered protein regions. Proc. Natl. Acad. Sci. U. S. A. 2024, 121, e2406407121, DOI: 10.1073/pnas.2406407121Google ScholarThere is no corresponding record for this reference.
- 244Zhu, W.; Shenoy, A.; Kundrotas, P.; Elofsson, A. Evaluation of AlphaFold-Multimer prediction on multi-chain protein complexes. Bioinformatics 2023, 39, btad424, DOI: 10.1093/bioinformatics/btad424Google ScholarThere is no corresponding record for this reference.
- 245Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. Pattern Recognition (CVPR) 2022, 10684– 10695Google ScholarThere is no corresponding record for this reference.
- 246Dhariwal, P.; Nichol, A. Diffusion models beat GANs on image synthesis. Neural Inf. Process. Syst. 2021, 8780– 8794Google ScholarThere is no corresponding record for this reference.
- 247Karras, T.; Aittala, M.; Aila, T.; Laine, S. Elucidating the Design Space of Diffusion-Based Generative Models. Adv. Neural Inf. Process. Syst. 2022, 26565– 26577Google ScholarThere is no corresponding record for this reference.
- 248Buttenschoen, M.; Morris, G.; Deane, C. PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences. Chem. Sci. 2024, 15, 3130– 3139, DOI: 10.1039/D3SC04185AGoogle ScholarThere is no corresponding record for this reference.
- 249Wee, J.; Wei, G.-W. Benchmarking AlphaFold3’s protein-protein complex accuracy and machine learning prediction reliability for binding free energy changes upon mutation. arXiv , 2024.Google ScholarThere is no corresponding record for this reference.
- 250Bernard, C.; Postic, G.; Ghannay, S.; Tahi, F. Has AlphaFold 3 reached its success for RNAs? bioRxiv , 2024.Google ScholarThere is no corresponding record for this reference.
- 251Zonta, F.; Pantano, S. From sequence to mechanobiology? Promises and challenges for AlphaFold 3. Mechanobiology in Medicine 2024, 2, 100083, DOI: 10.1016/j.mbm.2024.100083Google ScholarThere is no corresponding record for this reference.
- 252He, X.-H.; Li, J.-R.; Shen, S.-Y.; Xu, H. E. AlphaFold3 versus experimental structures: assessment of the accuracy in ligand-bound G protein-coupled receptors. Acta Pharmacol. Sin. 2024, 1– 12, DOI: 10.1038/s41401-024-01429-yGoogle ScholarThere is no corresponding record for this reference.
- 253Desai, D.; Kantliwala, S. V.; Vybhavi, J.; Ravi, R.; Patel, H.; Patel, J. Review of AlphaFold 3: Transformative advances in drug design and therapeutics. Cureus 2024, 16, e63646, DOI: 10.7759/cureus.63646Google ScholarThere is no corresponding record for this reference.
- 254Baek, M. Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021, 373, 871– 876, DOI: 10.1126/science.abj8754Google Scholar254Accurate prediction of protein structures and interactions using a three-track neural networkBaek, Minkyung; DiMaio, Frank; Anishchenko, Ivan; Dauparas, Justas; Ovchinnikov, Sergey; Lee, Gyu Rie; Wang, Jue; Cong, Qian; Kinch, Lisa N.; Schaeffer, R. Dustin; Millan, Claudia; Park, Hahnbeom; Adams, Carson; Glassman, Caleb R.; DeGiovanni, Andy; Pereira, Jose H.; Rodrigues, Andria V.; van Dijk, Alberdina A.; Ebrecht, Ana C.; Opperman, Diederik J.; Sagmeister, Theo; Buhlheller, Christoph; Pavkov-Keller, Tea; Rathinaswamy, Manoj K.; Dalwadi, Udit; Yip, Calvin K.; Burke, John E.; Garcia, K. Christopher; Grishin, Nick V.; Adams, Paul D.; Read, Randy J.; Baker, DavidScience (Washington, DC, United States) (2021), 373 (6557), 871-876CODEN: SCIEAS; ISSN:1095-9203. (American Association for the Advancement of Science)DeepMind presented notably accurate predictions at the recent 14th Crit. Assessment of Structure Prediction (CASP14) conference. We explored network architectures that incorporate related ideas and obtained the best performance with a three-track network in which information at the one-dimensional (1D) sequence level, the 2D distance map level, and the 3D coordinate level is successively transformed and integrated. The three-track network produces structure predictions with accuracies approaching those of DeepMind in CASP14, enables the rapid soln. of challenging x-ray crystallog. and cryo-electron microscopy structure modeling problems, and provides insights into the functions of proteins of currently unknown structure. The network also enables rapid generation of accurate protein-protein complex models from sequence information alone, short-circuiting traditional approaches that require modeling of individual subunits followed by docking. We make the method available to the scientific community to speed biol. research.
- 255Ahdritz, G. OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat. Methods 2024, 21, 1514– 1524, DOI: 10.1038/s41592-024-02272-zGoogle ScholarThere is no corresponding record for this reference.
- 256Liao, C.; Yu, Y.; Mei, Y.; Wei, Y. From words to molecules: A survey of Large Language Models in chemistry. arXiv , 2024.Google ScholarThere is no corresponding record for this reference.
- 257Bagal, V.; Aggarwal, R.; Vinod, P. K.; Priyakumar, U. D. MolGPT: Molecular generation using a transformer-decoder model. J. Chem. Inf. Model. 2022, 62, 2064– 2076, DOI: 10.1021/acs.jcim.1c00600Google Scholar257MolGPT: Molecular Generation Using a Transformer-Decoder ModelBagal, Viraj; Aggarwal, Rishal; Vinod, P. K.; Priyakumar, U. DevaJournal of Chemical Information and Modeling (2022), 62 (9), 2064-2076CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)Application of deep learning techniques for de novo generation of mols., termed as inverse mol. design, has been gaining enormous traction in drug design. The representation of mols. in SMILES notation as a string of characters enables the usage of state of the art models in natural language processing, such as Transformers, for mol. design in general. Inspired by generative pre-training (GPT) models that have been shown to be successful in generating meaningful text, we train a transformer-decoder on the next token prediction task using masked self-attention for the generation of druglike mols. in this study. We show that our model, MolGPT, performs on par with other previously proposed modern machine learning frameworks for mol. generation in terms of generating valid, unique, and novel mols. Furthermore, we demonstrate that the model can be trained conditionally to control multiple properties of the generated mols. We also show that the model can be used to generate mols. with desired scaffolds as well as desired mol. properties by conditioning the generation on scaffold SMILES strings of desired scaffolds and property values. Using saliency maps, we highlight the interpretability of the generative process of the model.
- 258Janakarajan, N.; Erdmann, T.; Swaminathan, S.; Laino, T.; Born, J. Language models in molecular discovery. arXiv , 2023.Google ScholarThere is no corresponding record for this reference.
- 259Park, Y.; Metzger, B. P. H.; Thornton, J. W. The simplicity of protein sequence-function relationships. Nat. Commun. 2024, 15, 7953, DOI: 10.1038/s41467-024-51895-5Google ScholarThere is no corresponding record for this reference.
- 260Stahl, K.; Warneke, R.; Demann, L.; Bremenkamp, R.; Hormes, B.; Brock, O.; Stülke, J.; Rappsilber, J. Modelling protein complexes with crosslinking mass spectrometry and deep learning. Nat. Commun. 2024, 15, 7866, DOI: 10.1038/s41467-024-51771-2Google ScholarThere is no corresponding record for this reference.
- 261Senior, A. W. Improved protein structure prediction using potentials from deep learning. Nature 2020, 577, 706– 710, DOI: 10.1038/s41586-019-1923-7Google Scholar261Improved protein structure prediction using potentials from deep learningSenior, Andrew W.; Evans, Richard; Jumper, John; Kirkpatrick, James; Sifre, Laurent; Green, Tim; Qin, Chongli; Zidek, Augustin; Nelson, Alexander W. R.; Bridgland, Alex; Penedones, Hugo; Petersen, Stig; Simonyan, Karen; Crossan, Steve; Kohli, Pushmeet; Jones, David T.; Silver, David; Kavukcuoglu, Koray; Hassabis, DemisNature (London, United Kingdom) (2020), 577 (7792), 706-710CODEN: NATUAS; ISSN:0028-0836. (Nature Research)Protein structure prediction can be used to det. the three-dimensional shape of a protein from its amino acid sequence1. This problem is of fundamental importance as the structure of a protein largely dets. its function2; however, protein structures can be difficult to det. exptl. Considerable progress has recently been made by leveraging genetic information. It is possible to infer which amino acid residues are in contact by analyzing covariation in homologous sequences, which aids in the prediction of protein structures3. The authors can train a neural network to make accurate predictions of the distances between pairs of residues, which convey more information about the structure than contact predictions. Using this information, the authors construct a potential of mean force4 that can accurately describe the shape of a protein. The resulting potential can be optimized by a simple gradient descent algorithm to generate structures without complex sampling procedures. The resulting system, named AlphaFold, achieves high accuracy, even for sequences with fewer homologous sequences. In the recent Crit. Assessment of Protein Structure Prediction5 (CASP13)-a blind assessment of the state of the field-AlphaFold created high-accuracy structures (with template modeling (TM) scores6 of 0.7 or higher) for 24 out of 43 free modeling domains, whereas the next best method, which used sampling and contact information, achieved such accuracy for only 14 out of 43 domains. AlphaFold represents a considerable advance in protein-structure prediction. The authors expect this increased accuracy to enable insights into the function and malfunction of proteins, esp. in cases for which no structures for homologous proteins have been exptl. detd.7.
- 262Kramer, M. A. Nonlinear principal component analysis using autoassociative neural networks. AIChE J. 1991, 37, 233– 243, DOI: 10.1002/aic.690370209Google Scholar262Nonlinear principal component analysis using autoassociative neural networksKramer, Mark A.AIChE Journal (1991), 37 (2), 233-43CODEN: AICEAC; ISSN:0001-1541.Nonlinear-principal-component anal. (NLPCA), is a novel technique for multivariate data anal., similar to the method of principal-component anal. (PCA). NLPCA like PCA, is used to identify and remove correlations among problem variables as an aid to dimensionality redn., visualization, and exploratory data anal. While PCA identifies only linear correlations between variables, NLPCA uncovers both linear and nonlinear correlations, without restriction on the character of the nonlinearities present in the data. NLPCA operates by training a feedforward neural network to perform the identity mapping, where the network inputs are reproduced at the output layer. The network contains an internal bottleneck layer (contg. fewer nodes than input or output layers), which forces the network to develop a compact representation of the input data and 2 addnl. hidden layers. The NLPCA method is demonstrated by using time-dependent, simulated batch-reaction data. NLPCA can reduce dimensionality and produce a feature space map resembling the actual distribution of the underlying system parameters.
- 263Rogers, D.; Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010, 50, 742– 754, DOI: 10.1021/ci100050tGoogle Scholar263Extended-Connectivity FingerprintsRogers, David; Hahn, MathewJournal of Chemical Information and Modeling (2010), 50 (5), 742-754CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)Extended-connectivity fingerprints (ECFPs) are a novel class of topol. fingerprints for mol. characterization. Historically, topol. fingerprints were developed for substructure and similarity searching. ECFPs were developed specifically for structure-activity modeling. ECFPs are circular fingerprints with a no. of useful qualities: they can be very rapidly calcd.; they are not predefined and can represent an essentially infinite no. of different mol. features (including stereochem. information); their features represent the presence of particular substructures, allowing easier interpretation of anal. results; and the ECFP algorithm can be tailored to generate different types of circular fingerprints, optimized for different uses. While the use of ECFPs has been widely adopted and validated, a description of their implementation has not previously been presented in the literature.
- 264Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. arXiv , 2017.Google ScholarThere is no corresponding record for this reference.
- 265Kipf, T. N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv , 2016.Google ScholarThere is no corresponding record for this reference.
- 266Xu, Z.; Wang, S.; Zhu, F.; Huang, J. Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery. Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. New York, NY, USA 2017, 285– 294Google ScholarThere is no corresponding record for this reference.
- 267Gilmer, J.; Schoenholz, S.; Riley, P. F.; Vinyals, O.; Dahl, G. E. Neural Message Passing for Quantum Chemistry. ICML 2017, 1263– 1272Google ScholarThere is no corresponding record for this reference.
- 268Asgari, E.; Mofrad, M. R. K. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS One 2015, 10, e0141287, DOI: 10.1371/journal.pone.0141287Google Scholar268Continuous distributed representation of biological sequences for deep proteomics and genomicsAsgari, Ehsaneddin; Mofrad, Mohammad R. K.PLoS One (2015), 10 (11), e0141287/1-e0141287/15CODEN: POLNCL; ISSN:1932-6203. (Public Library of Science)We introduce a new representation and feature extn. method for biol. sequences. Named bio-vectors (BioVec) to refer to biol. sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an av. family classification accuracy of 93% ± 0.06%is obtained, outperforming existing family classification methods. In addn., we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be detd. Importantly, this model needs to be trained only once and can then be applied to ext. a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics.
- 269He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2015, 770– 778Google ScholarThere is no corresponding record for this reference.
- 270Öztürk, H.; Ozkirimli, E.; Özgür, A. A novel methodology on distributed representations of proteins using their interacting ligands. Bioinformatics 2018, 34, i295– i303, DOI: 10.1093/bioinformatics/bty287Google ScholarThere is no corresponding record for this reference.
- 271Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. Proc. IEEE Conf. Comput. Vis. Pattern Recognition 2018, 7132– 7141Google ScholarThere is no corresponding record for this reference.
Cited By
This article has not yet been cited by other publications.
Article Views
Altmetric
Citations
Article Views are the COUNTER-compliant sum of full text article downloads since November 2008 (both PDF and HTML) across all institutions and individuals. These metrics are regularly updated to reflect usage leading up to the last few days.
Citations are the number of other articles citing this article, calculated by Crossref and updated daily. Find more information about Crossref citation counts.
The Altmetric Attention Score is a quantitative measure of the attention that a research article has received online. Clicking on the donut icon will load a page at altmetric.com with additional details about the score and the social media presence for the given article. Find more information on the Altmetric Attention Score and how the score is calculated.
Recommended Articles
Abstract
Figure 1
Figure 1. Language of protein sequences and the ligand SMILES representation: NLP methods can be applied to text representations to infer local and global properties of human language, proteins, and molecules alike. Local properties are inferred from subsequences in text: (left) for human language, this includes a part of speech or role a word serves; (middle) for protein sequences, this includes motifs, functional sites, and domains; and (right) for SMILES strings, this can include functional groups and special characters used in SMILES syntax to indicate chemical attributes. Similarly, global properties can theoretically be inferred from a text in its entirety.
Figure 2
Figure 2. Summary of the data preparation, model creation, and model evaluation workflow. Model Creation for PLI studies follows an Extract-Fuse-Predict Framework: input protein and ligand data are extracted and embedded, combined, and passed into a machine learning model to generate predictions.
Figure 3
Figure 3. Framework diagrams for RNN (and its variant LSTM), transformer, and attention with arrows representing a flow of information. (A) The "unrolled" structure of an RNN and the recurrent units, where hidden states propagate across time steps. The recurrent unit takes the current token Xt as input, combines it with the value of the current hidden state ht, and computes their weighted sum before generating the response Ot and an updated hidden state ht+1. Weighted sums depend upon the associated network weights Wxh, Whh, or Woh, which connect input to hidden state, hidden state to hidden state, and hidden state to output, respectively. LSTM differs in that a memory state is updated during each iteration, facilitating long-term dependency learning. (B) A simplified framework of a transformer's encoder-decoder architecture, and associated attention mechanism. A scaled product of the Query and Key vectors yields attention weights that can provide interpretability, with the new embedding vector (or the output vector) updated based on this specific key.
Figure 4
Figure 4. Sample attention weights for relating protein and ligand. The heatmaps on the left help visualize the weighted importance of select protein residues and ligand atoms in a PLI. Structural views of the protein–ligand binding pocket are shown in the middle, with insets of the 2D ligand structures on the right. The colored residues and red color highlights indicate AAs in the protein binding pocket and ligand atoms with high attention scores. Reproduced with permission from Figure 7 of Wu et al. (148) Used with permission under license CC BY 4.0. Copyright 2023 The Author(s). Published by Elsevier Ltd.
References
This article references 271 other publications.
- 1Songyang, Z.; Cantley, L. C. Recognition and specificity in protein tyrosine kinase-mediated signalling. Trends Biochem. Sci. 1995, 20, 470– 475, DOI: 10.1016/S0968-0004(00)89103-31Recognition and specificity in protein tyrosine kinase-mediated signalingSongyang, Zhou; Cantley, Lewis C.Trends in Biochemical Sciences (1995), 20 (11), 470-5CODEN: TBSCDB; ISSN:0968-0004. (Elsevier Trends Journals)A review, with 46 refs. There are several factors that contribute to the specificities of protein tyrosine kinases (PTKs) in signal transduction pathways. While protein-protein interaction domains, such as the Src homol. (SH2 and SH3) domains, regulate the cellular localization of PTKs and their substrates, the specificities of PTKs are ultimately detd. by their catalytic domains. The use of peptide libraries has revealed the substrate specificities of SH2 domains and PTK catalytic domains, and has suggested cross-talk between these domains.
- 2Johnson, L. N.; Lowe, E. D.; Noble, M. E.; Owen, D. J. The Eleventh Datta Lecture. The structural basis for substrate recognition and control by protein kinases. FEBS Lett. 1998, 430, 1– 11, DOI: 10.1016/S0014-5793(98)00606-12The structural basis for substrate recognition and control by protein kinasesJohnson, Louise N.; Lowe, Edward D.; Noble, Martin E. M.; Owen, David J.FEBS Letters (1998), 430 (1,2), 1-11CODEN: FEBLAL; ISSN:0014-5793. (Elsevier Science B.V.)A review with 49 refs. Protein kinases catalyze phospho transfer reactions from ATP to serine, threonine or tyrosine residues in target substrates and provide key mechanisms for control of cellular signaling processes. The crystal structures of 12 protein kinases are now known. These include structures of kinases in the active state in ternary complexes with ATP (or analogs) and inhibitor or peptide substrates (e.g. cAMP dependent protein kinase, phosphorylase kinase and insulin receptor tyrosine kinase); kinases in both active and inactive states (e.g., CDK2/cyclin A, insulin receptor tyrosine kinase and MAPK); kinases in the active state (e.g. casein kinase 1, Lck); and kinases in inactive states (e.g. twitchin kinase, calcium calmodulin kinase 1, FGF receptor kinase, c-Src and Hck). This paper summarizes the detailed information obtained with active phosphorylase kinase ternary complex and reviews the results with ref. to other kinase structures for insights into mechanisms for substrate recognition and control.
- 3Kristiansen, K. Molecular mechanisms of ligand binding, signaling, and regulation within the superfamily of G-protein-coupled receptors: molecular modeling and mutagenesis approaches to receptor structure and function. Pharmacol. Ther. 2004, 103, 21– 80, DOI: 10.1016/j.pharmthera.2004.05.0023Molecular mechanisms of ligand binding, signaling, and regulation within the superfamily of G-protein-coupled receptors: molecular modeling and mutagenesis approaches to receptor structure and functionKristiansen, KurtPharmacology & Therapeutics (2004), 103 (1), 21-80CODEN: PHTHDT; ISSN:0163-7258. (Elsevier Science B.V.)A review. The superfamily of G-protein-coupled receptors (GPCRs) could be subclassified into 7 families (A, B, large N-terminal family B-7 transmembrane helix, C, Frizzled/Smoothened, taste 2, and vomeronasal 1 receptors) among mammalian species. Cloning and functional studies of GPCRs have revealed that the superfamily of GPCRs comprises receptors for chem. diverse native ligands including endogenous compds. like amines, peptides, and Wnt proteins (i.e., secreted proteins activating Frizzled receptors); endogenous cell surface adhesion mols.; and photons and exogenous compds. like odorants. The combined use of site-directed mutagenesis and mol. modeling approaches have provided detailed insight into mol. mechanisms of ligand binding, receptor folding, receptor activation, G-protein coupling, and regulation of GPCRs. The vast majority of family A, B, C, vomeronasal 1, and taste 2 receptors are able to transduce signals into cells through G-protein coupling. However, G-protein-independent signaling mechanisms have also been reported for many GPCRs. Specific interaction motifs in the intracellular parts of these receptors allow them to interact with scaffold proteins. Protein engineering techniques have provided information on mol. mechanisms of GPCR-accessory protein, GPCR-GPCR, and GPCR-scaffold protein interactions. Site-directed mutagenesis and mol. dynamics simulations have revealed that the inactive state conformations are stabilized by specific interhelical and intrahelical salt bridge interactions and hydrophobic-type interactions. Constitutively activating mutations or agonist binding disrupts such constraining interactions leading to receptor conformations that assocs. with and activate G-proteins.
- 4West, I. C. What determines the substrate specificity of the multi-drug-resistance pump?. Trends Biochem. Sci. 1990, 15, 42– 46, DOI: 10.1016/0968-0004(90)90171-7There is no corresponding record for this reference.
- 5Vivier, E.; Malissen, B. Innate and adaptive immunity: specificities and signaling hierarchies revisited. Nat. Immunol. 2005, 6, 17– 21, DOI: 10.1038/ni11535Innate and adaptive immunity: specificities and signaling hierarchies revisitedVivier, Eric; Malissen, BernardNature Immunology (2005), 6 (1), 17-21CODEN: NIAMCZ; ISSN:1529-2908. (Nature Publishing Group)A review. The conventional classification of known immune responses by specificity may need re-evaluation. The immune system can be classified into two subsystems: the innate and adaptive immune systems. In general, innate immunity is considered a nonspecific response, whereas the adaptive immune system is thought of as being very specific. In addn., the antigen receptors of the adaptive immune response are commonly viewed as 'master sensors' whose engagement dictates lymphocyte function. Here the authors propose that these ideas do not genuinely reflect the organization of immune responses and that they bias the authors' view of immunity as well as the authors' teaching of immunol. Indeed, the level of specificity and mode of signaling integration used by the main cellular participants in the adaptive and innate immune systems are more similar than previously appreciated.
- 6Desvergne, B.; Michalik, L.; Wahli, W. Transcriptional regulation of metabolism. Physiol. Rev. 2006, 86, 465– 514, DOI: 10.1152/physrev.00025.20056Transcriptional regulation of metabolismDesvergne, Beatrice; Michalik, Liliane; Wahli, WalterPhysiological Reviews (2006), 86 (2), 465-514CODEN: PHREA7; ISSN:0031-9333. (American Physiological Society)A review. Our understanding of metab. is undergoing a dramatic shift. Indeed, the efforts made towards elucidating the mechanisms controlling the major regulatory pathways are now being rewarded. At the mol. level, the crucial role of transcription factors is particularly well-illustrated by the link between alterations of their functions and the occurrence of major metabolic diseases. In addn., the possibility of manipulating the ligand-dependent activity of some of these transcription factors makes them attractive as therapeutic targets. The aim of this review is to summarize recent knowledge on the transcriptional control of metabolic homeostasis. We first review data on the transcriptional regulation of the intermediary metab., i.e., glucose, amino acid, lipid, and cholesterol metab. Then, we analyze how transcription factors integrate signals from various pathways to ensure homeostasis. One example of this coordination is the daily adaptation to the circadian fasting and feeding rhythm. This section also discusses the dysregulations causing the metabolic syndrome, which reveals the intricate nature of glucose and lipid metab. and the role of the transcription factor PPARγ in orchestrating this assocn. Finally, we discuss the mol. mechanisms underlying metabolic regulations, which provide new opportunities for treating complex metabolic disorders.
- 7Atkinson, D. E. Biological feedback control at the molecular level: Interaction between metabolite-modulated enzymes seems to be a major factor in metabolic regulation. Science 1965, 150, 851– 857, DOI: 10.1126/science.150.3698.851There is no corresponding record for this reference.
- 8Huang, S.-Y.; Zou, X. Advances and challenges in protein-ligand docking. Int. J. Mol. Sci. 2010, 11, 3016– 3034, DOI: 10.3390/ijms110830168Advances and challenges in protein-ligand dockingHuang, Sheng-You; Zou, XiaoqinInternational Journal of Molecular Sciences (2010), 11 (), 3016-3034CODEN: IJMCFK; ISSN:1422-0067. (Molecular Diversity Preservation International)A review. Mol. docking is a widely-used computational tool for the study of mol. recognition, which aims to predict the binding mode and binding affinity of a complex formed by two or more constituent mols. with known structures. An important type of mol. docking is protein-ligand docking because of its therapeutic applications in modern structure-based drug design. Here, we review the recent advances of protein flexibility, ligand sampling, and scoring functions - the three important aspects in protein-ligand docking. Challenges and possible future directions are discussed in the conclusion.
- 9Chaires, J. B. Calorimetry and thermodynamics in drug design. Annu. Rev. Biophys. 2008, 37, 135– 151, DOI: 10.1146/annurev.biophys.36.040306.1328129Calorimetry and thermodynamics in drug designChaires, Jonathan B.Annual Review of Biophysics (2008), 37 (), 135-151CODEN: ARBNCV ISSN:. (Annual Reviews Inc.)A review. Modern instrumentation for calorimetry permits direct detn. of enthalpy values for binding reactions and conformational transitions in biomols. Complete thermodn. profiles consisting of free energy, enthalpy, and entropy may be obtained for reactions of interest in a relatively straightforward manner. Such profiles are of enormous value in drug design because they provide information about the balance of driving forces that cannot be obtained from structural or computational methods alone. This perspective shows several examples of the insight provided by thermodn. data in drug design.
- 10Serhan, C. N. Signalling the fat controller. Nature 1996, 384, 23– 24, DOI: 10.1038/384023a0There is no corresponding record for this reference.
- 11McAllister, C. H.; Beatty, P. H.; Good, A. G. Engineering nitrogen use efficient crop plants: the current status: Engineering nitrogen use efficient crop plants. Plant Biotechnol. J. 2012, 10, 1011– 1025, DOI: 10.1111/j.1467-7652.2012.00700.xThere is no corresponding record for this reference.
- 12Goldsmith, M.; Tawfik, D. S. Enzyme engineering: reaching the maximal catalytic efficiency peak. Curr. Opin. Struct. Biol. 2017, 47, 140– 150, DOI: 10.1016/j.sbi.2017.09.00212Enzyme engineering: reaching the maximal catalytic efficiency peakGoldsmith, Moshe; Tawfik, Dan S.Current Opinion in Structural Biology (2017), 47 (), 140-150CODEN: COSBEF; ISSN:0959-440X. (Elsevier Ltd.)A review. The practical need for highly efficient enzymes presents new challenges in enzyme engineering, in particular, the need to improve catalytic turnover (kcat) or efficiency (kcat/KM) by several orders of magnitude. However, optimizing catalysis demands navigation through complex and rugged fitness landscapes, with optimization trajectories often leading to strong diminishing returns and dead-ends. When no further improvements are obsd. in library screens or selections, it remains unclear whether the maximal catalytic efficiency of the enzyme (the catalytic 'fitness peak') has been reached; or perhaps, an alternative combination of mutations exists that could yield addnl. improvements. Here, we discuss fundamental aspects of the process of catalytic optimization, and offer practical solns. with respect to overcoming optimization plateaus.
- 13Vajda, S.; Guarnieri, F. Characterization of protein-ligand interaction sites using experimental and computational methods. Curr. Opin. Drug Discovery Devel. 2006, 9, 354– 36213Characterization of protein-ligand interaction sites using experimental and computational methodsVajda, Sandor; Guarnieri, FrankCurrent Opinion in Drug Discovery & Development (2006), 9 (3), 354-362CODEN: CODDFF; ISSN:1367-6733. (Thomson Scientific)A review. The ability to identify the sites of a protein that can bind with high affinity to small, drug-like compds. has been an important goal in drug design. Accurate prediction of druggable sites and the identification of small compds. binding in those sites have provided the input for fragment-based combinatorial approaches that allow for a more thorough exploration of the chem. space, and that have the potential to yield mols. that are more lead-like than those found using traditional high-throughput screening. Current progress in exptl. and computational methods for identifying and characterizing druggable ligand binding sites on protein targets is reviewed herein, including a discussion of successful NMR, x-ray crystallog. and tethering technologies. Classical geometric and energy-based computational methods are also discussed, with particular focus on two powerful technologies, i.e., computational solvent mapping and grand canonical Monte Carlo simulations (as used by Locus Pharmaceuticals Inc). Both methods can be used to reliably identify druggable sites on proteins and to facilitate the design of novel, low-nanomolar-affinity ligands.
- 14Du, X.; Li, Y.; Xia, Y.-L.; Ai, S.-M.; Liang, J.; Sang, P.; Ji, X.-L.; Liu, S.-Q. Insights into protein-ligand interactions: Mechanisms, models, and methods. Int. J. Mol. Sci. 2016, 17, 144, DOI: 10.3390/ijms1702014414Insights into protein-ligand interactions: mechanisms, models, and methodsDu, Xing; Li, Yi; Xia, Yuan-Ling; Ai, Shi-Meng; Liang, Jing; Sang, Peng; Ji, Xing-Lai; Liu, Shu-QunInternational Journal of Molecular Sciences (2016), 17 (2), 144/1-144/34CODEN: IJMCFK; ISSN:1422-0067. (MDPI AG)Mol. recognition, which is the process of biol. macromols. interacting with each other or various small mols. with a high specificity and affinity to form a specific complex, constitutes the basis of all processes in living organisms. Proteins, an important class of biol. macromols., realize their functions through binding to themselves or other mols. A detailed understanding of the protein-ligand interactions is therefore central to understanding biol. at the mol. level. Moreover, knowledge of the mechanisms responsible for the protein-ligand recognition and binding will also facilitate the discovery, design, and development of drugs. In the present review, first, the physicochem. mechanisms underlying protein-ligand binding, including the binding kinetics, thermodn. concepts and relationships, and binding driving forces, are introduced and rationalized. Next, three currently existing protein-ligand binding models-the "lock-and-key", "induced fit", and "conformational selection"-are described and their underlying thermodn. mechanisms are discussed. Finally, the methods available for investigating protein-ligand binding affinity, including exptl. and theor./computational approaches, are introduced, and their advantages, disadvantages, and challenges are discussed.
- 15Fan, F. J.; Shi, Y. Effects of data quality and quantity on deep learning for protein-ligand binding affinity prediction. Bioorg. Med. Chem. 2022, 72, 117003, DOI: 10.1016/j.bmc.2022.11700315Effects of data quality and quantity on deep learning for protein-ligand binding affinity predictionFan, Frankie J.; Shi, YunBioorganic & Medicinal Chemistry (2022), 72 (), 117003CODEN: BMECEP; ISSN:0968-0896. (Elsevier B.V.)Prediction of protein-ligand binding affinities is crucial for computational drug discovery. A no. of deep learning approaches have been developed in recent years to improve the accuracy of such affinity prediction. While the predicting power of these systems have advanced to some degrees depending on the dataset used for model training and testing, the effects of the quality and quantity of the underlying data have not been thoroughly examd. In this study, we employed erroneous datasets and data subsets of different sizes, created from one of the largest databases of exptl. binding affinities, to train and evaluate a deep learning system based on convolutional neural networks. Our results show that data quality and quantity do have significant impacts on the prediction performance of trained models. Depending on the variations in data quality and quantity, the performance discrepancies could be comparable to or even larger than those obsd. among different deep learning approaches. In particular, the presence of proteins in the training data leads to a dramatic increase in prediction accuracy. This implies that continued accumulation of high-quality affinity data, esp. for new protein targets, is indispensable for improving deep learning models to better predict protein-ligand binding affinities.
- 16Sousa, S. F.; Ribeiro, A. J. M.; Coimbra, J. T. S.; Neves, R. P. P.; Martins, S. A.; Moorthy, N. S. H. N.; Fernandes, P. A.; Ramos, M. J. Protein-Ligand Docking in the New Millennium A Retrospective of 10 Years in the Field. Curr. Med. Chem. 2013, 20, 2296– 2314, DOI: 10.2174/0929867311320180002There is no corresponding record for this reference.
- 17Morris, C. J.; Corte, D. D. Using molecular docking and molecular dynamics to investigate protein-ligand interactions. Mod. Phys. Lett. B 2021, 35, 2130002, DOI: 10.1142/S021798492130002717Using molecular docking and molecular dynamics to investigate protein-ligand interactionsMorris, Connor J.; Corte, Dennis DellaModern Physics Letters B (2021), 35 (8), 2130002CODEN: MPLBET; ISSN:0217-9849. (World Scientific Publishing Co. Pte. Ltd.)A review. Mol. docking and mol. dynamics (MD) are powerful tools used to investigate protein-ligand interactions. Mol. docking programs predict the binding pose and affinity of a protein-ligand complex, while MD can be used to incorporate flexibility into docking calcns. and gain further information on the kinetics and stability of the protein-ligand bond. This review covers state-of-the-art methods of using mol. docking and MD to explore protein-ligand interactions, with emphasis on application to drug discovery. We also call for further research on combining common mol. docking and MD methods.
- 18Lecina, D.; Gilabert, J. F.; Guallar, V. Adaptive simulations, towards interactive protein-ligand modeling. Sci. Rep. 2017, 7, 8466, DOI: 10.1038/s41598-017-08445-518Adaptive simulations, towards interactive protein-ligand modelingLecina Daniel; Gilabert Joan F; Guallar Victor; Guallar VictorScientific reports (2017), 7 (1), 8466 ISSN:.Modeling the dynamic nature of protein-ligand binding with atomistic simulations is one of the main challenges in computational biophysics, with important implications in the drug design process. Although in the past few years hardware and software advances have significantly revamped the use of molecular simulations, we still lack a fast and accurate ab initio description of the binding mechanism in complex systems, available only for up-to-date techniques and requiring several hours or days of heavy computation. Such delay is one of the main limiting factors for a larger penetration of protein dynamics modeling in the pharmaceutical industry. Here we present a game-changing technology, opening up the way for fast reliable simulations of protein dynamics by combining an adaptive reinforcement learning procedure with Monte Carlo sampling in the frame of modern multi-core computational resources. We show remarkable performance in mapping the protein-ligand energy landscape, being able to reproduce the full binding mechanism in less than half an hour, or the active site induced fit in less than 5 minutes. We exemplify our method by studying diverse complex targets, including nuclear hormone receptors and GPCRs, demonstrating the potential of using the new adaptive technique in screening and lead optimization studies.
- 19Ikebata, H.; Hongo, K.; Isomura, T.; Maezono, R.; Yoshida, R. Bayesian molecular design with a chemical language model. J. Comput. Aided Mol. Des. 2017, 31, 379– 391, DOI: 10.1007/s10822-016-0008-zThere is no corresponding record for this reference.
- 20Rives, A.; Meier, J.; Sercu, T.; Goyal, S.; Lin, Z.; Liu, J.; Guo, D.; Ott, M.; Zitnick, C. L.; Ma, J.; Fergus, R. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U. S. A. 2021, DOI: 10.1073/pnas.2016239118There is no corresponding record for this reference.
- 21Cao, Y.; Shen, Y. TALE: Transformer-based protein function Annotation with joint sequence–Label Embedding. Bioinformatics 2021, 37, 2825– 2833, DOI: 10.1093/bioinformatics/btab198There is no corresponding record for this reference.
- 22Wang, S.; Guo, Y.; Wang, Y.; Sun, H.; Huang, J. SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction. Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. New York, NY, USA 2019, 429– 436There is no corresponding record for this reference.
- 23Chithrananda, S.; Grand, G.; Ramsundar, B. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. arXiv , 2020.There is no corresponding record for this reference.
- 24Kumar, N.; Acharya, V. Machine intelligence-driven framework for optimized hit selection in virtual screening. J. Cheminform. 2022, 14, 48, DOI: 10.1186/s13321-022-00630-7There is no corresponding record for this reference.
- 25Erikawa, D.; Yasuo, N.; Sekijima, M. MERMAID: an open source automated hit-to-lead method based on deep reinforcement learning. J. Cheminform. 2021, 13, 94, DOI: 10.1186/s13321-021-00572-6There is no corresponding record for this reference.
- 26Zhou, M.; Duan, N.; Liu, S.; Shum, H.-Y. Progress in neural NLP: Modeling, learning, and reasoning. Engineering (Beijing) 2020, 6, 275– 290, DOI: 10.1016/j.eng.2019.12.014There is no corresponding record for this reference.
- 27Patwardhan, N.; Marrone, S.; Sansone, C. Transformers in the real world: A survey on NLP applications. Inf. 2023, 14, 242, DOI: 10.3390/info14040242There is no corresponding record for this reference.
- 28Bijral, R. K.; Singh, I.; Manhas, J.; Sharma, V. Exploring Artificial Intelligence in Drug Discovery: A Comprehensive Review. Arch. Comput. Methods Eng. 2022, 29, 2513– 2529, DOI: 10.1007/s11831-021-09661-zThere is no corresponding record for this reference.
- 29Ray, P. P. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems 2023, 3, 121– 154, DOI: 10.1016/j.iotcps.2023.04.003There is no corresponding record for this reference.
- 30Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. https://www.mikecaptain.com/resources/pdf/GPT-1.pdf, Accessed: 2023–10–27.There is no corresponding record for this reference.
- 31Goodside, R, Papay, Meet Claude: Anthropic’s Rival to ChatGPT. https://scale.com/blog/chatgpt-vs-claude, 2023.There is no corresponding record for this reference.
- 32Bing Copilot. Bing Copilot; https://copilot.microsoft.com/.There is no corresponding record for this reference.
- 33Rahul; Adhikari, S.; Monika NLP based Machine Learning Approaches for Text Summarization. 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC) 2020, 535– 538There is no corresponding record for this reference.
- 34Nasukawa, T.; Yi, J. Sentiment analysis: capturing favorability using natural language processing. Proceedings of the 2nd international conference on Knowledge capture. New York, NY, USA 2003, 70– 77There is no corresponding record for this reference.
- 35Lample, G.; Charton, F. Deep Learning for Symbolic Mathematics. arXiv , 2019.There is no corresponding record for this reference.
- 36Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; Zhou, M. CodeBERT: APre-Trained Model for Programming and Natural Languages. arXiv , 2020.There is no corresponding record for this reference.
- 37Mielke, S. J.; Alyafeai, Z.; Salesky, E.; Raffel, C.; Dey, M.; Gallé, M.; Raja, A.; Si, C.; Lee, W. Y.; Sagot, B.; Tan, S. Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP. arXiv , 2021.There is no corresponding record for this reference.
- 38Camacho-Collados, J.; Pilehvar, M. T. From word to sense embeddings: A survey on vector representations of meaning. J. Artif. Intell. Res. 2018, 63, 743– 788, DOI: 10.1613/jair.1.11259There is no corresponding record for this reference.
- 39Ashok, V. G.; Feng, S.; Choi, Y. Success with style: Using writing style to predict the success of novelsd.There is no corresponding record for this reference.
- 40Barberá, P.; Boydstun, A. E.; Linn, S.; McMahon, R.; Nagler, J. Automated text classification of news articles: A practical guide. Polit. Anal. 2021, 29, 19– 42, DOI: 10.1017/pan.2020.8There is no corresponding record for this reference.
- 41Wang, H.; Wu, H.; He, Z.; Huang, L.; Church, K. W. Progress in machine translation. Engineering (Beijing) 2022, 18, 143– 153, DOI: 10.1016/j.eng.2021.03.023There is no corresponding record for this reference.
- 42Sønderby, S. K.; Winther, O. Protein Secondary Structure Prediction with Long Short Term Memory Networks. arXiv , 2014.There is no corresponding record for this reference.
- 43Guo, Y.; Li, W.; Wang, B.; Liu, H.; Zhou, D. DeepACLSTM: deep asymmetric convolutional long short-term memory neural models for protein secondary structure prediction. BMC Bioinformatics 2019, 20, 341, DOI: 10.1186/s12859-019-2940-0There is no corresponding record for this reference.
- 44Bhasuran, B.; Natarajan, J. Automatic extraction of gene-disease associations from literature using joint ensemble learning. PLoS One 2018, 13, e0200699, DOI: 10.1371/journal.pone.0200699There is no corresponding record for this reference.
- 45Pang, M.; Su, K.; Li, M. Leveraging information in spatial transcriptomics to predict super-resolution gene expression from histology images in tumors. bioRxiv , 2021, 2021.11.28.470212.There is no corresponding record for this reference.
- 46Bouatta, N.; Sorger, P.; AlQuraishi, M. Protein structure prediction by AlphaFold2: are attention and symmetries all you need?. Acta Crystallogr. D Struct Biol. 2021, 77, 982– 991, DOI: 10.1107/S2059798321007531There is no corresponding record for this reference.
- 47Skolnick, J.; Gao, M.; Zhou, H.; Singh, S. AlphaFold 2: Why It Works and Its Implications for Understanding the Relationships of Protein Sequence, Structure, and Function. J. Chem. Inf. Model. 2021, 61, 4827– 4831, DOI: 10.1021/acs.jcim.1c0111447AlphaFold 2: Why It Works and Its Implications for Understanding the Relationships of Protein Sequence, Structure, and FunctionSkolnick, Jeffrey; Gao, Mu; Zhou, Hongyi; Singh, SureshJournal of Chemical Information and Modeling (2021), 61 (10), 4827-4831CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)AlphaFold 2 (AF2) was the star of CASP14, the last biannual structure prediction expt. Using novel deep learning, AF2 predicted the structures of many difficult protein targets at or near exptl. resoln. Here, the authors present the authors' perspective of why AF2 works and show that it is a very sophisticated fold recognition algorithm that exploits the completeness of the library of single domain PDB structures. It also learned local side chain packing rearrangements that enable it to refine proteins to high resoln. The benefits and limitations of its ability to predict the structures of many more proteins at or close to at. detail are discussed.
- 48Adadi, A.; Berrada, M. Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 2018, 6, 52138, DOI: 10.1109/ACCESS.2018.2870052There is no corresponding record for this reference.
- 49Box, G. E. P. Science and Statistics. J. Am. Stat. Assoc. 1976, 71, 791– 799, DOI: 10.1080/01621459.1976.10480949There is no corresponding record for this reference.
- 50Geirhos, R.; Jacobsen, J.-H.; Michaelis, C.; Zemel, R.; Brendel, W.; Bethge, M.; Wichmann, F. A. Shortcut learning in deep neural networks. Nature Machine Intelligence 2020, 2, 665– 673, DOI: 10.1038/s42256-020-00257-zThere is no corresponding record for this reference.
- 51Outeiral, C.; Nissley, D. A.; Deane, C. M. Current structure predictors are not learning the physics of protein folding. Bioinformatics 2022, 38, 1881– 1887, DOI: 10.1093/bioinformatics/btab88151Current structure predictors are not learning the physics of protein foldingOuteiral, Carlos; Nissley, Daniel A.; Deane, Charlotte M.Bioinformatics (2022), 38 (7), 1881-1887CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Predicting the native state of a protein has long been considered a gateway problem for understanding protein folding. Recent advances in structural modeling driven by deep learning have achieved unprecedented success at predicting a protein's crystal structure, but it is not clear if these models are learning the physics of how proteins dynamically fold into their equil. structure or are just accurate knowledge-based predictors of the final state. In this work, we compare the pathways generated by state-of-the-art protein structure prediction methods to exptl. data about protein folding pathways. The methods considered were AlphaFold 2, RoseTTAFold, trRosetta, RaptorX, DMPfold, EVfold, SAINT2 and Rosetta. We find evidence that their simulated dynamics capture some information about the folding pathway, but their predictive ability is worse than a trivial classifier using sequence-agnostic features like chain length. The folding trajectories produced are also uncorrelated with exptl. observables such as intermediate structures and the folding rate const. These results suggest that recent advances in structure prediction do not yet provide an enhanced understanding of protein folding.
- 52Steels, L. Modeling the cultural evolution of language. Phys. Life Rev. 2011, 8, 339– 356, DOI: 10.1016/j.plrev.2011.10.014There is no corresponding record for this reference.
- 53Maurya, H. C.; Gupta, P.; Choudhary, N. Natural language ambiguity and its effect on machine learning. Int. J. Modern Eng. Res. 2015, 5, 25– 30There is no corresponding record for this reference.
- 54Tenney, I.; Xia, P.; Chen, B.; Wang, A.; Poliak, A.; McCoy, R. T.; Kim, N.; Van Durme, B.; Bowman, S. R.; Das, D.; Pavlick, E. What do you learn from context? Probing for sentence structure in contextualized word representations. arXiv , 2019.There is no corresponding record for this reference.
- 55Miyagawa, S.; Berwick, R. C.; Okanoya, K. The emergence of hierarchical structure in human language. Front. Psychol. 2013, 4, 71, DOI: 10.3389/fpsyg.2013.00071There is no corresponding record for this reference.
- 56Liu, H.; Xu, C.; Liang, J. Dependency distance: A new perspective on syntactic patterns in natural languages. Phys. Life Rev. 2017, 21, 171– 193, DOI: 10.1016/j.plrev.2017.03.002There is no corresponding record for this reference.
- 57Frank, S. L.; Bod, R.; Christiansen, M. H. How hierarchical is language use?. Proc. Biol. Sci. 2012, 279, 4522– 4531, DOI: 10.1098/rspb.2012.1741There is no corresponding record for this reference.
- 58Oesch, N.; Dunbar, R. I. M. The emergence of recursion in human language: Mentalising predicts recursive syntax task performance. J. Neurolinguistics 2017, 43, 95– 106, DOI: 10.1016/j.jneuroling.2016.09.008There is no corresponding record for this reference.
- 59Ferruz, N.; Höcker, B. Controllable protein design with language models. Nature Machine Intelligence 2022, 4, 521– 532, DOI: 10.1038/s42256-022-00499-zThere is no corresponding record for this reference.
- 60Ofer, D.; Brandes, N.; Linial, M. The language of proteins: NLP, machine learning & protein sequences. Comput. Struct. Biotechnol. J. 2021, 19, 1750– 1758, DOI: 10.1016/j.csbj.2021.03.02260The language of proteins: NLP, machine learning & protein sequencesOfer, Dan; Brandes, Nadav; Linial, MichalComputational and Structural Biotechnology Journal (2021), 19 (), 1750-1758CODEN: CSBJAC; ISSN:2001-0370. (Elsevier B.V.)A review. Natural language processing (NLP) is a field of computer science concerned with automated text and language anal. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.
- 61Ptitsyn, O. B. How does protein synthesis give rise to the 3D-structure?. FEBS Lett. 1991, 285, 176– 181, DOI: 10.1016/0014-5793(91)80799-9There is no corresponding record for this reference.
- 62Yu, L.; Tanwar, D. K.; Penha, E. D. S.; Wolf, Y. I.; Koonin, E. V.; Basu, M. K. Grammar of protein domain architectures 2019, 116, 3636– 3645, DOI: 10.1073/pnas.1814684116There is no corresponding record for this reference.
- 63Petsko, G. A.; Ringe, D. Protein Structure and Function; Primers in Biology; Blackwell Publishing: London, England, 2003.There is no corresponding record for this reference.
- 64Shenoy, S. R.; Jayaram, B. Proteins: sequence to structure and function-current status. Curr. Protein Pept. Sci. 2010, 11, 498– 514, DOI: 10.2174/13892031079410909464Proteins: sequence to structure and function - current statusShenoy, Sandhya R.; Jayaram, B.Current Protein and Peptide Science (2010), 11 (7), 498-514CODEN: CPPSCM; ISSN:1389-2037. (Bentham Science Publishers Ltd.)A review. In an era that has been dominated by structural biol. for the last 30-40 yr, a dramatic change of focus toward sequence anal. has spurred the advent of the genome projects and the resultant diverging sequence/structure deficit. The central challenge of computational structural biol. is therefore to rationalize the mass of sequence information into biochem. and biophys. knowledge and to decipher the structural, functional, and evolutionary clues encoded in the language of biol. sequences. In investigating the meaning of sequences, 2 distinct anal. themes have emerged: (1) in the 1st approach, pattern recognition techniques are used to detect similarity between sequences and hence to infer related structures and functions; (2) in the 2nd, ab initio prediction methods are used to deduce 3-dimensional structure, and ultimately to infer function, directly from the linear sequence. Here, the authors attempt to provide a crit. assessment of what one may and may not expect from the biol. sequences and to identify major issues yet to be resolved. The presentation is organized under several subtitles such as protein sequences, pattern recognition techniques, protein tertiary structure prediction, membrane protein bioinformatics, human proteome, protein-protein interactions, metabolic networks, potential drug targets based on simple sequence properties, disordered proteins, the sequence-structure relation, and chem. logic of protein sequences.
- 65Takahashi, M.; Maraboeuf, F.; Nordén, B. Locations of functional domains in the RecA protein. Overlap of domains and regulation of activities. Eur. J. Biochem. 1996, 242, 20– 28, DOI: 10.1111/j.1432-1033.1996.0020r.xThere is no corresponding record for this reference.
- 66Liang, W.; KaiYong, Z. Detecting “protein words” through unsupervised word segmentation. arXiv , 2014.There is no corresponding record for this reference.
- 67Kuntz, I. D.; Crippen, G. M.; Kollman, P. A.; Kimelman, D. Calculation of protein tertiary structure. J. Mol. Biol. 1976, 106, 983– 994, DOI: 10.1016/0022-2836(76)90347-8There is no corresponding record for this reference.
- 68Rodrigue, N.; Lartillot, N.; Bryant, D.; Philippe, H. Site interdependence attributed to tertiary structure in amino acid sequence evolution. Gene 2005, 347, 207– 217, DOI: 10.1016/j.gene.2004.12.011There is no corresponding record for this reference.
- 69Eisenhaber, F.; Persson, B.; Argos, P. Protein structure prediction: recognition of primary, secondary, and tertiary structural features from amino acid sequence. Crit. Rev. Biochem. Mol. Biol. 1995, 30, 1– 94, DOI: 10.3109/1040923950908513969Protein structure prediction: recognition of primary, secondary, and tertiary structural features from amino acid sequenceEisenhaber F; Persson B; Argos PCritical reviews in biochemistry and molecular biology (1995), 30 (1), 1-94 ISSN:1040-9238.This review attempts a critical stock-taking of the current state of the science aimed at predicting structural features of proteins from their amino acid sequences. At the primary structure level, methods are considered for detection of remotely related sequences and for recognizing amino acid patterns to predict posttranslational modifications and binding sites. The techniques involving secondary structural features include prediction of secondary structure, membrane-spanning regions, and secondary structural class. At the tertiary structural level, methods for threading a sequence into a mainchain fold, homology modeling and assigning sequences to protein families with similar folds are discussed. A literature analysis suggests that, to date, threading techniques are not able to show their superiority over sequence pattern recognition methods. Recent progress in the state of ab initio structure calculation is reviewed in detail. The analysis shows that many structural features can be predicted from the amino acid sequence much better than just a few years ago and with attendant utility in experimental research. Best prediction can be achieved for new protein sequences that can be assigned to well-studied protein families. For single sequences without homologues, the folding problem has not yet been solved.
- 70Garfield, E. Chemico-linguistics: computer translation of chemical nomenclature. Nature 1961, 192, 192, DOI: 10.1038/192192a0There is no corresponding record for this reference.
- 71Wigh, D. S.; Goodman, J. M.; Lapkin, A. A. A review of molecular representation in the age of machine learning. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2022, DOI: 10.1002/wcms.1603There is no corresponding record for this reference.
- 72Weininger, D. SMILES, a chemical language and information system. 1 Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31– 36, DOI: 10.1021/ci00057a00572SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rulesWeininger, DavidJournal of Chemical Information and Computer Sciences (1988), 28 (1), 31-6CODEN: JCISD8; ISSN:0095-2338.The SMILES (simplified mol. input line entry system) chem. notation system is described for information processing. The system is based on principles of mol. graph theory and it allows structure specification by use of a very small and natural grammar well suited for high-speed machine processing. The system is easy to use, has high machine compatibility, and allows many computer applications, including notation generation, const. speed database retrieval, substructure searching, and property prediction models.
- 73Wang, Y.; Xiao, J.; Suzek, T. O.; Zhang, J.; Wang, J.; Bryant, S. H. PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009, 37, W623– 33, DOI: 10.1093/nar/gkp45673PubChem: a public information system for analyzing bioactivities of small moleculesWang, Yanli; Xiao, Jewen; Suzek, Tugba O.; Zhang, Jian; Wang, Jiyao; Bryant, Stephen H.Nucleic Acids Research (2009), 37 (Web Server), W623-W633CODEN: NARHAD; ISSN:0305-1048. (Oxford University Press)PubChem (http://pubchem.ncbi.nlm.nih.gov) is a public repository for biol. properties of small mols. hosted by the US National Institutes of Health (NIH). PubChem BioAssay database currently contains biol. test results for more than 700 000 compds. The goal of PubChem is to make this information easily accessible to biomedical researchers. In this work, we present a set of web servers to facilitate and optimize the utility of biol. activity information within PubChem. These web-based services provide tools for rapid data retrieval, integration and comparison of biol. screening results, exploratory structure-activity anal., and target selectivity examn. This article reviews these bioactivity anal. tools and discusses their uses. Most of the tools described in this work can be directly accessed at http://pubchem.ncbi.nlm.nih.gov/assay/. URLs for accessing other tools described in this work are specified individually.
- 74Degtyarenko, K.; de Matos, P.; Ennis, M.; Hastings, J.; Zbinden, M.; McNaught, A.; Alcántara, R.; Darsow, M.; Guedj, M.; Ashburner, M. ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 2007, 36, D344– 50, DOI: 10.1093/nar/gkm791There is no corresponding record for this reference.
- 75Wishart, D. S.; Knox, C.; Guo, A. C.; Cheng, D.; Shrivastava, S.; Tzur, D.; Gautam, B.; Hassanali, M. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 2008, 36, D901– 6, DOI: 10.1093/nar/gkm95875DrugBank: a knowledgebase for drugs, drug actions and drug targetsWishart, David S.; Knox, Craig; Guo, An Chi; Cheng, Dean; Shrivastava, Savita; Tzur, Dan; Gautam, Bijaya; Hassanali, MurtazaNucleic Acids Research (2008), 36 (Database Iss), D901-D906CODEN: NARHAD; ISSN:0305-1048. (Oxford University Press)DrugBank is a richly annotated resource that combines detailed drug data with comprehensive drug target and drug action information. Since its first release in 2006, DrugBank has been widely used to facilitate in silico drug target discovery, drug design, drug docking or screening, drug metab. prediction, drug interaction prediction and general pharmaceutical education. The latest version of DrugBank (release 2.0) has been expanded significantly over the previous release. With ∼4900 drug entries, it now contains 60% more FDA-approved small mol. and biotech drugs including 10% more exptl.' drugs. Significantly, more protein target data has also been added to the database, with the latest version of DrugBank contg. three times as many non-redundant protein or drug target sequences as before (1565 vs. 524). Each DrugCard entry now contains more than 100 data fields with half of the information being devoted to drug/chem. data and the other half devoted to pharmacol., pharmacogenomic and mol. biol. data. A no. of new data fields, including food-drug interactions, drug-drug interactions and exptl. ADME data have been added in response to numerous user requests. DrugBank has also significantly improved the power and simplicity of its structure query and text query searches. DrugBank is available at http://www.drugbank.ca.
- 76Wang, X.; Hao, J.; Yang, Y.; He, K. Natural language adversarial defense through synonym encoding. Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence 2021, 823– 833There is no corresponding record for this reference.
- 77Bjerrum, E. J. SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules. arXiv , 2017.There is no corresponding record for this reference.
- 78Lee, I.; Nam, H. Infusing Linguistic Knowledge of SMILES into Chemical Language Models. arXiv , 2022.There is no corresponding record for this reference.
- 79Skinnider, M. A. Invalid SMILES are beneficial rather than detrimental to chemical language models. Nature Mach. Intell. 2024, 6, 437, DOI: 10.1038/s42256-024-00821-xThere is no corresponding record for this reference.
- 80O’Boyle, N.; Dalke, A. DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of Chemical Structures. ChemRxiv , 2018.There is no corresponding record for this reference.
- 81Krenn, M.; Häse, F.; Nigam, A.; Friederich, P.; Aspuru-Guzik, A. Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Mach. Learn.: Sci. Technol. 2020, 1, 045024, DOI: 10.1088/2632-2153/aba947There is no corresponding record for this reference.
- 82Gohlke, H.; Mannhold, R.; Kubinyi, H.; Folkers, G. In Protein-Ligand Interactions; Gohlke, H., Ed.; Methods and Principles in Medicinal Chemistry; Wiley-VCH Verlag: Weinheim, Germany, 2012.There is no corresponding record for this reference.
- 83Jumper, J. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583– 589, DOI: 10.1038/s41586-021-03819-283Highly accurate protein structure prediction with AlphaFoldJumper, John; Evans, Richard; Pritzel, Alexander; Green, Tim; Figurnov, Michael; Ronneberger, Olaf; Tunyasuvunakool, Kathryn; Bates, Russ; Zidek, Augustin; Potapenko, Anna; Bridgland, Alex; Meyer, Clemens; Kohl, Simon A. A.; Ballard, Andrew J.; Cowie, Andrew; Romera-Paredes, Bernardino; Nikolov, Stanislav; Jain, Rishub; Adler, Jonas; Back, Trevor; Petersen, Stig; Reiman, David; Clancy, Ellen; Zielinski, Michal; Steinegger, Martin; Pacholska, Michalina; Berghammer, Tamas; Bodenstein, Sebastian; Silver, David; Vinyals, Oriol; Senior, Andrew W.; Kavukcuoglu, Koray; Kohli, Pushmeet; Hassabis, DemisNature (London, United Kingdom) (2021), 596 (7873), 583-589CODEN: NATUAS; ISSN:0028-0836. (Nature Portfolio)Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous exptl. effort, the structures of around 100,000 unique proteins have been detd., but this represents a small fraction of the billions of known protein sequences. Structural coverage is bottlenecked by the months to years of painstaking effort required to det. a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence-the structure prediction component of the 'protein folding problem'-has been an important open research problem for more than 50 years. Despite recent progress, existing methods fall far short of at. accuracy, esp. when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with at. accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Crit. Assessment of protein Structure Prediction (CASP14), demonstrating accuracy competitive with exptl. structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates phys. and biol. knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.
- 84Landrum, G. RDKit: A software suite for cheminformatics, computational chemistry, and predictive modeling. http://www.rdkit.org/RDKit_Overview.pdf, 2013; Accessed: 2023–12–13.There is no corresponding record for this reference.
- 85Mukherjee, S.; Ghosh, M.; Basuchowdhuri, P. Proceedings of the 2022 SIAM International Conference on Data Mining (SDM); Proceedings; Society for Industrial and Applied Mathematics, 2022; pp 729– 737.There is no corresponding record for this reference.
- 86Chen, L.; Tan, X.; Wang, D.; Zhong, F.; Liu, X.; Yang, T.; Luo, X.; Chen, K.; Jiang, H.; Zheng, M. TransformerCPI: improving compound–protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments. Bioinformatics 2020, 36, 4406– 4414, DOI: 10.1093/bioinformatics/btaa52486TransformerCPI: improving compound-protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experimentsChen, Lifan; Tan, Xiaoqin; Wang, Dingyan; Zhong, Feisheng; Liu, Xiaohong; Yang, Tianbiao; Luo, Xiaomin; Chen, Kaixian; Jiang, Hualiang; Zheng, MingyueBioinformatics (2020), 36 (16), 4406-4414CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Motivation: Identifying compd.-protein interaction (CPI) is a crucial task in drug discovery and chemogenomics studies, and proteins without three-dimensional structure account for a large part of potential biol. targets, which requires developing methods using only protein sequence information to predict CPI. However, sequence-based CPI models may face some specific pitfalls, including using inappropriate datasets, hidden ligand bias and splitting datasets inappropriately, resulting in overestimation of their prediction performance. Results: To address these issues, we here constructed new datasets specific for CPI prediction, proposed a novel transformer neural network named TransformerCPI, and introduced a more rigorous label reversal expt. to test whether a model learns true interaction features. TransformerCPI achieved much improved performance on the new expts., and it can be deconvolved to highlight important interacting regions of protein sequences and compd. atoms, which may contribute chem. biol. studies with useful guidance for further ligand structural optimization.
- 87Aly Abdelkader, G.; Ngnamsie Njimbouom, S.; Oh, T.-J.; Kim, J.-D. ResBiGAAT: Residual Bi-GRU with attention for protein-ligand binding affinity prediction. Comput. Biol. Chem. 2023, 107, 107969, DOI: 10.1016/j.compbiolchem.2023.107969There is no corresponding record for this reference.
- 88Li, Q.; Zhang, X.; Wu, L.; Bo, X.; He, S.; Wang, S. PLA-MoRe: AProtein–Ligand Binding Affinity Prediction Model via Comprehensive Molecular Representations. J. Chem. Inf. Model. 2022, 62, 4380– 4390, DOI: 10.1021/acs.jcim.2c00960There is no corresponding record for this reference.
- 89Abramson, J. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 2024, 636, E4, DOI: 10.1038/s41586-024-08416-7There is no corresponding record for this reference.
- 90Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; Overington, J. P. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012, 40, D1100– 7, DOI: 10.1093/nar/gkr77790ChEMBL: a large-scale bioactivity database for drug discoveryGaulton, Anna; Bellis, Louisa J.; Bento, A. Patricia; Chambers, Jon; Davies, Mark; Hersey, Anne; Light, Yvonne; McGlinchey, Shaun; Michalovich, David; Al-Lazikani, Bissan; Overington, John P.Nucleic Acids Research (2012), 40 (D1), D1100-D1107CODEN: NARHAD; ISSN:0305-1048. (Oxford University Press)ChEMBL is an Open Data database contg. binding, functional and ADMET information for a large no. of drug-like bioactive compds. These data are manually abstracted from the primary published literature on a regular basis, then further curated and standardized to maximize their quality and utility across a wide range of chem. biol. and drug-discovery research problems. Currently, the database contains 5.4 million bioactivity measurements for more than 1 million compds. and 5200 protein targets. Access is available through a web-based interface, data downloads and web services at: https://www.ebi.ac.uk/chembldb.
- 91Wishart, D. S.; Knox, C.; Guo, A. C.; Shrivastava, S.; Hassanali, M.; Stothard, P.; Chang, Z.; Woolsey, J. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006, 34, D668– 72, DOI: 10.1093/nar/gkj06791DrugBank: a comprehensive resource for in silico drug discovery and explorationWishart, David S.; Knox, Craig; Guo, An Chi; Shrivastava, Savita; Hassanali, Murtaza; Stothard, Paul; Chang, Zhan; Woolsey, JenniferNucleic Acids Research (2006), 34 (Database), D668-D672CODEN: NARHAD; ISSN:0305-1048. (Oxford University Press)DrugBank is a unique bioinformatics/cheminformatics resource that combines detailed drug (i.e. chem.) data with comprehensive drug target (i.e. protein) information. The database contains >4100 drug entries including >800 FDA approved small mol. and biotech drugs as well as >3200 exptl. drugs. Addnl., >14 000 protein or drug target sequences are linked to these drug entries. Each DrugCard entry contains >80 data fields with half of the information being devoted to drug/chem. data and the other half devoted to drug target or protein data. Many data fields are hyperlinked to other databases (KEGG, PubChem, ChEBI, PDB, Swiss-Prot and GenBank) and a variety of structure viewing applets. The database is fully searchable supporting extensive text, sequence, chem. structure and relational query searches. Potential applications of DrugBank include in silico drug target discovery, drug design, drug docking or screening, drug metab. prediction, drug interaction prediction and general pharmaceutical education. DrugBank is available at http://redpoll.pharmacy.ualberta.ca/drugbank/.
- 92Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E. The Protein Data Bank. Nucleic Acids Res. 2000, 28, 235– 242, DOI: 10.1093/nar/28.1.23592The Protein Data BankBerman, Helen M.; Westbrook, John; Feng, Zukang; Gilliland, Gary; Bhat, T. N.; Weissig, Helge; Shindyalov, Ilya N.; Bourne, Philip E.Nucleic Acids Research (2000), 28 (1), 235-242CODEN: NARHAD; ISSN:0305-1048. (Oxford University Press)The Protein Data Bank (PDB; http://www.rcsb.org/pdb/)is the single worldwide archive of structural data of biol. macromols. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.
- 93Acids research, N. 2017 UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2017, 45, D158– D169, DOI: 10.1093/nar/gkw1099There is no corresponding record for this reference.
- 94Davis, M. I.; Hunt, J. P.; Herrgard, S.; Ciceri, P.; Wodicka, L. M.; Pallares, G.; Hocker, M.; Treiber, D. K.; Zarrinkar, P. P. Comprehensive analysis of kinase inhibitor selectivity. Nat. Biotechnol. 2011, 29, 1046– 1051, DOI: 10.1038/nbt.199094Comprehensive analysis of kinase inhibitor selectivityDavis, Mindy I.; Hunt, Jeremy P.; Herrgard, Sanna; Ciceri, Pietro; Wodicka, Lisa M.; Pallares, Gabriel; Hocker, Michael; Treiber, Daniel K.; Zarrinkar, Patrick P.Nature Biotechnology (2011), 29 (11), 1046-1051CODEN: NABIF9; ISSN:1087-0156. (Nature Publishing Group)We tested the interaction of 72 kinase inhibitors with 442 kinases covering >80% of the human catalytic protein kinome. Our data show that, as a class, type II inhibitors are more selective than type I inhibitors, but that there are important exceptions to this trend. The data further illustrate that selective inhibitors have been developed against the majority of kinases targeted by the compds. tested. Anal. of the interaction patterns reveals a class of 'group-selective' inhibitors broadly active against a single subfamily of kinases, but selective outside that subfamily. The data set suggests compds. to use as tools to study kinases for which no dedicated inhibitors exist. It also provides a foundation for further exploring kinase inhibitor biol. and toxicity, as well as for studying the structural basis of the obsd. interaction patterns. Our findings will help to realize the direct enabling potential of genomics for drug development and basic research about cellular signaling.
- 95Tang, J.; Szwajda, A.; Shakyawar, S.; Xu, T.; Hintsanen, P.; Wennerberg, K.; Aittokallio, T. Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. J. Chem. Inf. Model. 2014, 54, 735– 743, DOI: 10.1021/ci400709d95Making Sense of Large-Scale Kinase Inhibitor Bioactivity Data Sets: A Comparative and Integrative AnalysisTang, Jing; Szwajda, Agnieszka; Shakyawar, Sushil; Xu, Tao; Hintsanen, Petteri; Wennerberg, Krister; Aittokallio, TeroJournal of Chemical Information and Modeling (2014), 54 (3), 735-743CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)We carried out a systematic evaluation of target selectivity profiles across three recent large-scale biochem. assays of kinase inhibitors and further compared these standardized bioactivity assays with data reported in the widely used databases ChEMBL and STITCH. Our comparative evaluation revealed relative benefits and potential limitations among the bioactivity types, as well as pinpointed biases in the database curation processes. Ignoring such issues in data heterogeneity and representation may lead to biased modeling of drugs' polypharmacol. effects as well as to unrealistic evaluation of computational strategies for the prediction of drug-target interaction networks. Toward making use of the complementary information captured by the various bioactivity types, including IC50, Ki, and Kd, we also introduce a model-based integration approach, termed KIBA, and demonstrate here how it can be used to classify kinase inhibitor targets and to pinpoint potential errors in database-reported drug-target interactions. An integrated drug-target bioactivity matrix across 52 498 chem. compds. and 467 kinase targets, including a total of 246 088 KIBA scores, has been made freely available.
- 96Wang, R.; Fang, X.; Lu, Y.; Wang, S. The PDBbind database: collection of binding affinities for protein-ligand complexes with known three-dimensional structures. J. Med. Chem. 2004, 47, 2977– 2980, DOI: 10.1021/jm030580l96The PDBbind database: Collection of binding affinities for protein-ligand complexes with known three-dimensional structuresWang, Renxiao; Fang, Xueliang; Lu, Yipin; Wang, ShaomengJournal of Medicinal Chemistry (2004), 47 (12), 2977-2980CODEN: JMCMAR; ISSN:0022-2623. (American Chemical Society)We have screened the entire Protein Data Bank (Release No. 103, Jan. 2003) and identified 5671 protein-ligand complexes out of 19 621 exptl. structures. A systematic examn. of the primary refs. of these entries has led to a collection of binding affinity data (Kd, Ki, and IC50) for a total of 1359 complexes. The outcomes of this project have been organized into a Web-accessible database named the PDBbind database.
- 97Chen, S.; Zhang, S.; Fang, X.; Lin, L.; Zhao, H.; Yang, Y. Protein complex structure modeling by cross-modal alignment between cryo-EM maps and protein sequences. Nat. Commun. 2024, 15, 8808, DOI: 10.1038/s41467-024-53116-5There is no corresponding record for this reference.
- 98Bishop, M.C. Pattern Recognition and Machine Learning, 1st ed.; Information Science and Statistics; Springer: New York, NY, 2006.There is no corresponding record for this reference.
- 99Yang, J.; Shen, C.; Huang, N. Predicting or Pretending: Artificial Intelligence for Protein-Ligand Interactions Lack of Sufficiently Large and Unbiased Datasets. Front. Pharmacol. 2020, 11, 69, DOI: 10.3389/fphar.2020.0006999Predicting or pretending: artificial intelligence for protein-ligand interactions lack of sufficiently large and unbiased datasetsYang, Jincai; Shen, Cheng; Huang, NiuFrontiers in Pharmacology (2020), 11 (), 69CODEN: FPRHAU; ISSN:1663-9812. (Frontiers Media S.A.)Predicting protein-ligand interactions using artificial intelligence (AI) models has attracted great interest in recent years. However, data-driven AI models unequivocally suffer from a lack of sufficiently large and unbiased datasets. Here, we systematically investigated the data biases on the PDBbind and DUD-E datasets. We examd. the model performance of at. convolutional neural network (ACNN) on the PDBbind core set and achieved a Pearson R2 of 0.73 between exptl. and predicted binding affinities. Strikingly, the ACNN models did not require learning the essential protein-ligand interactions in complex structures and achieved similar performance even on datasets contg. only ligand structures or only protein structures, while data splitting based on similarity clustering (protein sequence or ligand scaffold) significantly reduced the model performance. We also identified the property and topol. biases in the DUD-E dataset which led to the artificially increased enrichment performance of virtual screening. The property bias in DUD-E was reduced by enforcing the more stringent ligand property matching rules, while the topol. bias still exists due to the use of mol. fingerprint similarity as a decoy selection criterion. Therefore, we believe that sufficiently large and unbiased datasets are desirable for training robust AI models to accurately predict protein-ligand interactions.
- 100Kryshtafovych, A.; Schwede, T.; Topf, M.; Fidelis, K.; Moult, J. Critical assessment of methods of protein structure prediction (CASP)–Round XIII. Proteins 2019, 87, 1011– 1020, DOI: 10.1002/prot.25823100Critical assessment of methods of protein structure prediction (CASP)-Round XIIIKryshtafovych, Andriy; Schwede, Torsten; Topf, Maya; Fidelis, Krzysztof; Moult, JohnProteins: Structure, Function, and Bioinformatics (2019), 87 (12), 1011-1020CODEN: PSFBAF; ISSN:1097-0134. (Wiley-Blackwell)A review. CASP (crit. assessment of structure prediction) assesses the state of the art in modeling protein structure from amino acid sequence. The most recent expt. (CASP13 held in 2018) saw dramatic progress in structure modeling without use of structural templates (historically "ab initio" modeling). Progress was driven by the successful application of deep learning techniques to predict inter-residue distances. In turn, these results drove dramatic improvements in three-dimensional structure accuracy: With the proviso that there are an adequate no. of sequences known for the protein family, the new methods essentially solve the long-standing problem of predicting the fold topol. of monomeric proteins. Further, the no. of sequences required in the alignment has fallen substantially. There is also substantial improvement in the accuracy of template-based models. Other areas-model refinement, accuracy estn., and the structure of protein assemblies-have again yielded interesting results. CASP13 placed increased emphasis on the use of sparse data together with modeling and chem. crosslinking, SAXS, and NMR all yielded more mature results. This paper summarizes the key outcomes of CASP13. The special issue of PROTEINS contains papers describing the CASP13 assessments in each modeling category and contributions from the participants.
- 101Janin, J.; Henrick, K.; Moult, J.; Eyck, L. T.; Sternberg, M. J. E.; Vajda, S.; Vakser, I.; Wodak, S. J. CAPRI: A Critical Assessment of PRedicted Interactions. Proteins 2003, 52, 2– 9, DOI: 10.1002/prot.10381101CAPRI: A critical assessment of predicted interactionsJanin, Joel; Henrick, Kim; Moult, John; Ten Eyck, Lynn; Sternberg, Michael J. E.; Vajda, Sandor; Vakser, Ilya; Wodak, Shoshana J.Proteins: Structure, Function, and Genetics (2003), 52 (1), 2-9CODEN: PSFGEY; ISSN:0887-3585. (Wiley-Liss, Inc.)A review. CAPRI is a community wide expt. to assess the capacity of protein-docking methods to predict protein-protein interactions. Nineteen groups participated in rounds 1 and 2 of CAPRI and submitted blind structure predictions for seven protein-protein complexes based on the known structure of the component proteins. The predictions were compared to the unpublished X-ray structures of the complexes. We describe here the motivations for launching CAPRI, the rules that we applied to select targets and run the expt., and some conclusions that can already be drawn. The results stress the need for new scoring functions and for methods handling the conformation changes that were obsd. in some of the target systems. CAPRI has already been a powerful drive for the community of computational biologists who development docking algorithms. We hope that this issue of Proteins will also be of interest to the community of structural biologists, which we call upon to provide new targets for future rounds of CAPRI, and to all mol. biologists who view protein-protein recognition as an essential process.
- 102Lensink, M. F.; Nadzirin, N.; Velankar, S.; Wodak, S. J. Modeling protein-protein, protein-peptide, and protein-oligosaccharide complexes: CAPRI 7th edition. Proteins 2020, 88, 916– 938, DOI: 10.1002/prot.25870102Modeling protein-protein, protein-peptide, and protein-oligosaccharide complexes: CAPRI 7th editionLensink, Marc F.; Nadzirin, Nurul; Velankar, Sameer; Wodak, Shoshana J.Proteins: Structure, Function, and Bioinformatics (2020), 88 (8), 916-938CODEN: PSFBAF; ISSN:1097-0134. (Wiley-Blackwell)The authors present the seventh report on the performance of methods for predicting the at. resoln. structures of protein complexes offered as targets to the community-wide initiative on the Crit. Assessment of Predicted Interactions. Performance was evaluated on the basis of 36,114 models of protein complexes submitted by 57 groups-including 13 automatic servers-in prediction rounds held during the years 2016 to 2019 for eight protein-protein, three protein-peptide, and five protein-oligosaccharide targets with different length ligands. Six of the protein-protein targets represented challenging hetero-complexes, due to factors such as availability of distantly related templates for the individual subunits, or for the full complex, inter-domain flexibility, conformational adjustments at the binding region, or the multi-component nature of the complex. The main challenge for the protein-peptide and protein-oligosaccharide complexes was to accurately model the ligand conformation and its interactions at the interface. Encouragingly, models of acceptable quality, or better, were obtained for a total of six protein-protein complexes, which included four of the challenging hetero-complexes and a homo-decamer. But fewer of these targets were predicted with medium or higher accuracy. High accuracy models were obtained for two of the three protein-peptide targets, and for one of the protein-oligosaccharide targets. The remaining protein-sugar targets were predicted with medium accuracy. The authors' anal. indicates that progress in predicting increasingly challenging and diverse types of targets is due to closer integration of template-based modeling techniques with docking, scoring, and model refinement procedures, and to significant incremental improvements in the underlying methodologies.
- 103Schomburg, I.; Chang, A.; Schomburg, D. BRENDA, enzyme data and metabolic information. Nucleic Acids Res. 2002, 30, 47– 49, DOI: 10.1093/nar/30.1.47103BRENDA, enzyme data and metabolic informationSchomburg, Ida; Chang, Antje; Schomburg, DietmarNucleic Acids Research (2002), 30 (1), 47-49CODEN: NARHAD; ISSN:0305-1048. (Oxford University Press)A review with 6 refs. BRENDA is a comprehensive relational database on functional and mol. information of enzymes, based on primary literature. The database contains information extd. and evaluated from ∼46,000 refs., holding data of at least 40,000 different enzymes from >6900 different organisms, classified in ∼3900 EC nos. BRENDA is an important tool for biochem. and medical research covering information on properties of all classified enzymes, including data on the occurrence, catalyzed reaction, kinetics, substrates/products, inhibitors, cofactors, activators, structure and stability. All data are connected to literature refs. which in turn are linked to PubMed. The data and information provide a fundamental tool for research of enzyme mechanisms, metabolic pathways, the evolution of metab. and, furthermore, for medicinal diagnostics and pharmaceutical research. The database is a resource for data of enzymes, classified according to the EC system of the IUBMB Enzyme Nomenclature Committee, and the entries are cross-referenced to other databases, i.e., organism classification, protein sequence, protein structure, and literature refs. BRENDA provides an academic web access at http://www.brenda.uni-koeln.de.
- 104Liu, T.; Lin, Y.; Wen, X.; Jorissen, R. N.; Gilson, M. K. BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res. 2007, 35, D198– 201, DOI: 10.1093/nar/gkl999104BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinitiesLiu, Tiqing; Lin, Yuhmei; Wen, Xin; Jorissen, Robert N.; Gilson, Michael K.Nucleic Acids Research (2007), 35 (Database Iss), D198-D201CODEN: NARHAD; ISSN:0305-1048. (Oxford University Press)BindingDB is a publicly accessible database currently contg. ∼20 000 exptl. detd. binding affinities of protein-ligand complexes, for 110 protein targets including isoforms and mutational variants, and ∼11 000 small mol. ligands. The data are extd. from the scientific literature, data collection focusing on proteins that are drug-targets or candidate drug-targets and for which structural data are present in the Protein Data Bank. The BindingDB website supports a range of query types, including searches by chem. structure, substructure and similarity; protein sequence; ligand and protein names; affinity ranges and mol. wt. Data sets generated by BindingDB queries can be downloaded in the form of annotated SDfiles for further anal., or used as the basis for virtual screening of a compd. database uploaded by the user. The data in BindingDB are linked both to structural data in the PDB via PDB IDs and chem. and sequence searches, and to the literature in PubMed via PubMed IDs.
- 105Amemiya, T.; Koike, R.; Kidera, A.; Ota, M. PSCDB: a database for protein structural change upon ligand binding. Nucleic Acids Res. 2012, 40, D554– 8, DOI: 10.1093/nar/gkr966There is no corresponding record for this reference.
- 106Mysinger, M. M.; Carchia, M.; Irwin, J. J.; Shoichet, B. K. Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking. J. Med. Chem. 2012, 55, 6582– 6594, DOI: 10.1021/jm300687e106Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better BenchmarkingMysinger, Michael M.; Carchia, Michael; Irwin, John. J.; Shoichet, Brian K.Journal of Medicinal Chemistry (2012), 55 (14), 6582-6594CODEN: JMCMAR; ISSN:0022-2623. (American Chemical Society)A key metric to assess mol. docking remains ligand enrichment against challenging decoys. Whereas the directory of useful decoys (DUD) has been widely used, clear areas for optimization have emerged. Here we describe an improved benchmarking set that includes more diverse targets such as GPCRs and ion channels, totaling 102 proteins with 22886 clustered ligands drawn from ChEMBL, each with 50 property-matched decoys drawn from ZINC. To ensure chemotype diversity, we cluster each target's ligands by their Bemis-Murcko at. frameworks. We add net charge to the matched physicochem. properties and include only the most dissimilar decoys, by topol., from the ligands. An online automated tool (http://decoys.docking.org) generates these improved matched decoys for user-supplied ligands. We test this data set by docking all 102 targets, using the results to improve the balance between ligand desolvation and electrostatics in DOCK 3.6. The complete DUD-E benchmarking set is freely available at http://dude.docking.org.
- 107Warren, G. L.; Do, T. D.; Kelley, B. P.; Nicholls, A.; Warren, S. D. Essential considerations for using protein-ligand structures in drug discovery. Drug Discovery Today 2012, 17, 1270– 1281, DOI: 10.1016/j.drudis.2012.06.011107Essential considerations for using protein-ligand structures in drug discoveryWarren, Gregory L.; Do, Thanh D.; Kelley, Brian P.; Nicholls, Anthony; Warren, Stephen D.Drug Discovery Today (2012), 17 (23-24), 1270-1281CODEN: DDTOFS; ISSN:1359-6446. (Elsevier Ltd.)A review. Protein-ligand structures are the core data required for structure-based drug design (SBDD). Understanding the error present in this data is essential to the successful development of SBDD tools. Methods for assessing protein-ligand structure quality and a new set of identification criteria are presented here. When these criteria were applied to a set of 728 structures previously used to validate mol. docking software, only 17% were found to be acceptable. Structures were re-refined to maintain internal consistency in the comparison and assessment of the quality criteria. This process resulted in Iridium, a highly trustworthy protein-ligand structure database to be used for development and validation of structure-based design tools for drug discovery.
- 108Puvanendrampillai, D.; Mitchell, J. B. O. L/D Protein Ligand Database (PLD): additional understanding of the nature and specificity of protein-ligand complexes. Bioinformatics 2003, 19, 1856– 1857, DOI: 10.1093/bioinformatics/btg243There is no corresponding record for this reference.
- 109Wang, C.; Hu, G.; Wang, K.; Brylinski, M.; Xie, L.; Kurgan, L. PDID: database of molecular-level putative protein-drug interactions in the structural human proteome. Bioinformatics 2016, 32, 579– 586, DOI: 10.1093/bioinformatics/btv597There is no corresponding record for this reference.
- 110Zhu, M.; Song, X.; Chen, P.; Wang, W.; Wang, B. dbHDPLS: A database of human disease-related protein-ligand structures. Comput. Biol. Chem. 2019, 78, 353– 358, DOI: 10.1016/j.compbiolchem.2018.12.023110dbHDPLS: A database of human disease-related protein-ligand structuresZhu, Muchun; Song, Xiaoping; Chen, Peng; Wang, Wenyan; Wang, BingComputational Biology and Chemistry (2019), 78 (), 353-358CODEN: CBCOCH; ISSN:1476-9271. (Elsevier B.V.)Protein-ligand complexes perform specific functions, most of which are related to human diseases. The database, called as human disease-related protein-ligand structures (dbHDPLS), collected 8833 structures which were extd. from protein data bank (PDB) and other related databases. The database is annotated with comprehensive information involving ligands and drugs, related human diseases and protein-ligand interaction information, with the information of protein structures. The database may be a reliable resource for structure-based drug target discoveries and druggability predictions of protein-ligand binding sites, drug-disease relationships based on protein-ligand complex structures.
- 111Gao, M.; Moumbock, A. F. A.; Qaseem, A.; Xu, Q.; Günther, S. CovPDB: a high-resolution coverage of the covalent protein-ligand interactome. Nucleic Acids Res. 2022, 50, D445– D450, DOI: 10.1093/nar/gkab868111CovPDB: a high-resolution coverage of the covalent protein-ligand interactomeGao, Mingjie; Moumbock, Aurelien F. A.; Qaseem, Ammar; Xu, Qianqing; Guenther, StefanNucleic Acids Research (2022), 50 (D1), D445-D450CODEN: NARHAD; ISSN:1362-4962. (Oxford University Press)In recent years, the drug discovery paradigm has shifted toward compds. that covalently modify disease-assocd. target proteins, because they tend to possess high potency, selectivity, and duration of action. The rational design of novel targeted covalent inhibitors (TCIs) typically starts from resolved macromol. structures of target proteins in their apo or holo forms. However, the existing TCI databases contain only a paucity of covalent protein-ligand (cP-L) complexes. Herein, we report CovPDB, the first database solely dedicated to highresoln. cocrystal structures of biol. relevant cP-L complexes, curated from the Protein Data Bank. For these curated complexes, the chem. structures and warheads of pre-reactive electrophilic ligands as well as the covalent bonding mechanisms to their target proteins were expertly manually annotated. Totally, CovPDB contains 733 proteins and 1,501 ligands, relating to 2,294 cP-L complexes, 93 reactive warheads, 14 targetable residues, and 21 covalent mechanisms. Users are provided with an intuitive and interactive web interface that allows multiple search and browsing options to explore the covalent interactome at a mol. level in order to develop novel TCIs. CovPDB is freely accessible and its contents are available for download as flat files of various formats.
- 112Ammar, A.; Cavill, R.; Evelo, C.; Willighagen, E. PSnpBind: a database of mutated binding site protein-ligand complexes constructed using a multithreaded virtual screening workflow. J. Cheminform. 2022, 14, 8, DOI: 10.1186/s13321-021-00573-5There is no corresponding record for this reference.
- 113Lingė, D. PLBD: protein-ligand binding database of thermodynamic and kinetic intrinsic parameters. Database 2023, DOI: 10.1093/database/baad040There is no corresponding record for this reference.
- 114Wei, H.; Wang, W.; Peng, Z.; Yang, J. Q-BioLiP: A Comprehensive Resource for Quaternary Structure-based Protein–ligand Interactions. bioRxiv , 2023, 2023.06.23.546351.There is no corresponding record for this reference.
- 115Korlepara, D. B. PLAS-20k: Extended dataset of protein-ligand affinities from MD simulations for machine learning applications. Sci. Data 2024, DOI: 10.1038/s41597-023-02872-yThere is no corresponding record for this reference.
- 116Xenarios, I.; Rice, D. W.; Salwinski, L.; Baron, M. K.; Marcotte, E. M.; Eisenberg, D. DIP: the database of interacting proteins. Nucleic Acids Res. 2000, 28, 289– 291, DOI: 10.1093/nar/28.1.289116DIP: the Database of Interacting ProteinsXenarios, Ioannis; Rice, Danny W.; Salwinski, Lukasz; Baron, Marisa K.; Marcotte, Edward M.; Eisenberg, DavidNucleic Acids Research (2000), 28 (1), 289-291CODEN: NARHAD; ISSN:0305-1048. (Oxford University Press)The Database of Interacting Proteins is a database that documents exptl. detd. protein-protein interactions. This database is intended to provide the scientific community with a comprehensive and integrated tool for browsing and efficiently extg. information about protein interactions and interaction networks in biol. processes. Beyond cataloging details of protein-protein interactions, the DIP is useful for understanding protein function and protein-protein relationships, studying the properties of networks of interacting proteins, benchmarking predictions of protein-protein interactions, and studying the evolution of protein-protein interactions.
- 117Wallach, I.; Lilien, R. The protein-small-molecule database, a non-redundant structural resource for the analysis of protein-ligand binding. Bioinformatics 2009, 25, 615– 620, DOI: 10.1093/bioinformatics/btp035There is no corresponding record for this reference.
- 118Wang, S.; Lin, H.; Huang, Z.; He, Y.; Deng, X.; Xu, Y.; Pei, J.; Lai, L. CavitySpace: A Database of Potential Ligand Binding Sites in the Human Proteome. Biomolecules 2022, 12, 967, DOI: 10.3390/biom12070967118CavitySpace: A Database of Potential Ligand Binding Sites in the Human ProteomeWang, Shiwei; Lin, Haoyu; Huang, Zhixian; He, Yufeng; Deng, Xiaobing; Xu, Youjun; Pei, Jianfeng; Lai, LuhuaBiomolecules (2022), 12 (7), 967CODEN: BIOMHC; ISSN:2218-273X. (MDPI AG)Location and properties of ligand binding sites provide important information to uncover protein functions and to direct structure-based drug design approaches. However, as binding site detection depends on the three-dimensional (3D) structural data of proteins, functional anal. based on protein ligand binding sites is formidable for proteins without structural information. Recent developments in protein structure prediction and the 3D structures built by AlphaFold provide an unprecedented opportunity for analyzing ligand binding sites in human proteins. Here, we constructed the CavitySpace database, the first pocket library for all the proteins in the human proteome, using a widely-applied ligand binding site detection program CAVITY. Our anal. showed that known ligand binding sites could be well recovered. We grouped the predicted binding sites according to their similarity which can be used in protein function prediction and drug repurposing studies. Novel binding sites in highly reliable predicted structure regions provide new opportunities for drug discovery. Our CavitySpace is freely available and provides a valuable tool for drug discovery and protein function studies.
- 119Otter, D. W.; Medina, J. R.; Kalita, J. K. A survey of the usages of deep learning for natural language processing. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 604– 624, DOI: 10.1109/TNNLS.2020.2979670119A Survey of the Usages of Deep Learning for Natural Language ProcessingOtter Daniel W; Medina Julian R; Kalita Jugal KIEEE transactions on neural networks and learning systems (2021), 32 (2), 604-624 ISSN:.Over the last several years, the field of natural language processing has been propelled forward by an explosion in the use of deep learning models. This article provides a brief introduction to the field and a quick overview of deep learning architectures and methods. It then sifts through the plethora of recent studies and summarizes a large assortment of relevant contributions. Analyzed research areas include several core linguistic processing issues in addition to many applications of computational linguistics. A discussion of the current state of the art is then provided along with recommendations for future research in the field.
- 120Wang, Y.; You, Z.-H.; Yang, S.; Li, X.; Jiang, T.-H.; Zhou, X. A high efficient biological language model for predicting Protein-Protein interactions. Cells 2019, 8, 122, DOI: 10.3390/cells8020122120A high efficient biological language model for predicting protein-protein interactionsWang, Yanbin; You, Zhu-Hong; Yang, Shan; Li, Xiao; Jiang, Tong-Hai; Zhou, XiCells (2019), 8 (2), 122CODEN: CELLC6; ISSN:2073-4409. (MDPI AG): Many life activities and key functions in organisms are maintained by different types of protein-protein interactions (PPIs). In order to accelerate the discovery of PPIs for different species, many computational methods have been developed. Unfortunately, even though computational methods are constantly evolving, efficient methods for predicting PPIs from protein sequence information have not been found for many years due to limiting factors including both methodol. and technol. Inspired by the similarity of biol. sequences and languages, developing a biol. language processing technol. may provide a brand new theor. perspective and feasible method for the study of biol. sequences. In this paper, a pure biol. language processing model is proposed for predicting protein-protein interactions only using a protein sequence. The model was constructed based on a feature representation method for biol. sequences called bio-to-vector (Bio2Vec) and a convolution neural network (CNN). The Bio2Vec obtains protein sequence features by using a "bio-word" segmentation system and a word representation model used for learning the distributed representation for each "bio-word". The Bio2Vec supplies a frame that allows researchers to consider the context information and implicit semantic information of a bio sequence. A remarkable improvement in PPIs prediction performance has been obsd. by using the proposed model compared with state-of-the-art methods. The presentation of this approach marks the start of "bio language processing technol.," which could cause a technol. revolution and could be applied to improve the quality of predictions in other problems.
- 121Abbasi, K.; Razzaghi, P.; Poso, A.; Amanlou, M.; Ghasemi, J. B.; Masoudi-Nejad, A. DeepCDA: deep cross-domain compound–protein affinity prediction through LSTM and convolutional neural networks. Bioinformatics 2020, 36, 4633– 4642, DOI: 10.1093/bioinformatics/btaa544121DeepCDA: deep cross-domain compound-protein affinity prediction through LSTM and convolutional neural networksAbbasi, Karim; Razzaghi, Parvin; Poso, Antti; Amanlou, Massoud; Ghasemi, Jahan B.; Masoudi-Nejad, AliBioinformatics (2020), 36 (17), 4633-4642CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Motivation: An essential part of drug discovery is the accurate prediction of the binding affinity of new compd.-protein pairs. Most of the std. computational methods assume that compds. or proteins of the test data are obsd. during the training phase. However, in real-world situations, the test and training data are sampled from different domains with different distributions. To cope with this challenge, we propose a deep learning-based approach that consists of three steps. In the first step, the training encoder network learns a novel representation of compds. and proteins. To this end, we combine convolutional layers and long-short-term memory layers so that the occurrence patterns of local substructures through a protein and a compd. sequence are learned. Also, to encode the interaction strength of the protein and compd. substructures, we propose a two-sided attention mechanism. In the second phase, to deal with the different distributions of the training and test domains, a feature encoder network is learned for the test domain by utilizing an adversarial domain adaptation approach. In the third phase, the learned test encoder network is applied to new compd.-protein pairs to predict their binding affinity. Results: To evaluate the proposed approach, we applied it to KIBA, Davis and BindingDB datasets. The results show that the proposed method learns a more reliable model for the test domain in more challenging situations.
- 122Zhou, G.; Gao, Z.; Ding, Q.; Zheng, H.; Xu, H.; Wei, Z.; Zhang, L.; Ke, G. Uni-Mol: AUniversal 3D Molecular Representation Learning Framework. ChemRxiv , 2023.There is no corresponding record for this reference.
- 123Zhou, D.; Xu, Z.; Li, W.; Xie, X.; Peng, S. MultiDTI: drug–target interaction prediction based on multi-modal representation learning to bridge the gap between new chemical entities and known heterogeneous network. Bioinformatics 2021, 37, 4485– 4492, DOI: 10.1093/bioinformatics/btab473123MultiDTI: drug-target interaction prediction based on multi-modal representation learning to bridge the gap between new chemical entities and known heterogeneous networkZhou, Deshan; Xu, Zhijian; Li, WenTao; Xie, Xiaolan; Peng, ShaoliangBioinformatics (2021), 37 (23), 4485-4492CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Motivation: Predicting new drug-target interactions is an important step in new drug development, understanding of its side effects and drug repositioning. Heterogeneous data sources can provide comprehensive information and different perspectives for drug-target interaction prediction. Thus, there have been many calcn. methods relying on heterogeneous networks. Most of them use graph-related algorithms to characterize nodes in heterogeneous networks for predicting new drug-target interactions (DTI). However, these methods can only make predictions in known heterogeneous network datasets, and cannot support the prediction of new chem. entities outside the heterogeneous network, which hinder further drug discovery and development. Results: To solve this problem, we proposed a multi-modal DTI prediction model named 'MultiDTI' which uses our proposed joint learning framework based on heterogeneous networks. It combines the interaction or assocn. information of the heterogeneous network and the drug/target sequence information, and maps the drugs, targets, side effects and disease nodes in the heterogeneous network into a common space. In this way, 'MultiDTI' can map the new chem. entity to this learned common space based on the chem. structure of the new entity. That is, bridging the gap between new chem. entities and known heterogeneous network. Our model has strong predictive performance, and the area under the receiver operating characteristic curve of the model is 0.961 and the area under the precision recall curve is 0.947 with 10-fold cross validation. In addn., some predicted new DTIs have been confirmed by ChEMBL database. Our results indicate that 'MultiDTI' is a powerful and practical tool for predicting new DTI, which can promote the development of drug discovery or drug repositioning.
- 124Özçelik, R.; Öztürk, H.; Özgür, A.; Ozkirimli, E. ChemBoost: A chemical language based approach for protein - ligand binding affinity prediction. Mol. Inform. 2021, 40, e2000212, DOI: 10.1002/minf.202000212There is no corresponding record for this reference.
- 125Gaspar, H. A.; Ahmed, M.; Edlich, T.; Fabian, B.; Varszegi, Z.; Segler, M.; Meyers, J.; Fiscato, M. Proteochemometric Models Using Multiple Sequence Alignments and a Subword Segmented Masked Language Model. ChemRxiv , 2021.There is no corresponding record for this reference.
- 126Arseniev-Koehler, A. Theoretical foundations and limits of word embeddings: What types of meaning can they capture. Sociol. Methods Res. 2022, 004912412211401There is no corresponding record for this reference.
- 127Lake, B. M.; Murphy, G. L. Word meaning in minds and machines. Psychol. Rev. 2023, 130, 401– 431, DOI: 10.1037/rev0000297There is no corresponding record for this reference.
- 128Winchester, S. A Verb for Our Frantic Times. https://www.nytimes.com/2011/05/29/opinion/29winchester.html, 2011; Accessed: 2024–9-15.There is no corresponding record for this reference.
- 129Panapitiya, G.; Girard, M.; Hollas, A.; Sepulveda, J.; Murugesan, V.; Wang, W.; Saldanha, E. Evaluation of deep learning architectures for aqueous solubility prediction. ACS Omega 2022, 7, 15695– 15710, DOI: 10.1021/acsomega.2c00642129Evaluation of Deep Learning Architectures for Aqueous Solubility PredictionPanapitiya, Gihan; Girard, Michael; Hollas, Aaron; Sepulveda, Jonathan; Murugesan, Vijayakumar; Wang, Wei; Saldanha, EmilyACS Omega (2022), 7 (18), 15695-15710CODEN: ACSODF; ISSN:2470-1343. (American Chemical Society)Detg. the aq. soly. of mols. is a vital step in many pharmaceutical, environmental, and energy storage applications. Despite efforts made over decades, there are still challenges assocd. with developing a soly. prediction model with satisfactory accuracy for many of these applications. The goals of this study are to assess current deep learning methods for soly. prediction, develop a general model capable of predicting the soly. of a broad range of org. mols., and to understand the impact of data properties, mol. representation, and modeling architecture on predictive performance. Using the largest currently available soly. data set, we implement deep learning-based models to predict soly. from the mol. structure and explore several different mol. representations including mol. descriptors, simplified mol.-input line-entry system strings, mol. graphs, and three-dimensional at. coordinates using four different neural network architectures-fully connected neural networks, recurrent neural networks, graph neural networks (GNNs), and SchNet. We find that models using mol. descriptors achieve the best performance, with GNN models also achieving good performance. We perform extensive error anal. to understand the mol. properties that influence model performance, perform feature anal. to understand which information about the mol. structure is most valuable for prediction, and perform a transfer learning and data size study to understand the impact of data availability on model performance.
- 130Wu, X.; Yu, L. EPSOL: sequence-based protein solubility prediction using multidimensional embedding. Bioinformatics 2021, 37, 4314– 4320, DOI: 10.1093/bioinformatics/btab463There is no corresponding record for this reference.
- 131Krogh, A. What are artificial neural networks?. Nat. Biotechnol. 2008, 26, 195– 197, DOI: 10.1038/nbt1386131What are artificial neural networks?Krogh, AndersNature Biotechnology (2008), 26 (2), 195-197CODEN: NABIF9; ISSN:1087-0156. (Nature Publishing Group)A review. Artificial neural networks have been applied to problems ranging from speech recognition to prediction of protein secondary structure, classification of cancers and gene prediction. How do they work and what might they be good for.
- 132Rumelhart, D.; Hinton, G. E.; Williams, R. J. Learning internal representations by error propagation. cmapspublic2.ihmc.us 1986, 673– 695There is no corresponding record for this reference.
- 133Salehinejad, H.; Sankar, S.; Barfett, J.; Colak, E.; Valaee, S. Recent Advances in Recurrent Neural Networks. arXiv , 2017.There is no corresponding record for this reference.
- 134Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30There is no corresponding record for this reference.
- 135Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. arXiv , 2018.There is no corresponding record for this reference.
- 136Chen, G. A gentle tutorial of recurrent neural network with error backpropagation. arXiv , 2016.There is no corresponding record for this reference.
- 137Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv , 2014.There is no corresponding record for this reference.
- 138Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735– 1780, DOI: 10.1162/neco.1997.9.8.1735138Long short-term memoryHochreiter S; Schmidhuber JNeural computation (1997), 9 (8), 1735-80 ISSN:0899-7667.Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
- 139Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005, 18, 602– 610, DOI: 10.1016/j.neunet.2005.06.042139Framewise phoneme classification with bidirectional LSTM and other neural network architecturesGraves Alex; Schmidhuber JurgenNeural networks : the official journal of the International Neural Network Society (2005), 18 (5-6), 602-10 ISSN:0893-6080.In this paper, we present bidirectional Long Short Term Memory (LSTM) networks, and a modified, full gradient version of the LSTM learning algorithm. We evaluate Bidirectional LSTM (BLSTM) and several other network architectures on the benchmark task of framewise phoneme classification, using the TIMIT database. Our main findings are that bidirectional networks outperform unidirectional ones, and Long Short Term Memory (LSTM) is much faster and also more accurate than both standard Recurrent Neural Nets (RNNs) and time-windowed Multilayer Perceptrons (MLPs). Our results support the view that contextual information is crucial to speech processing, and suggest that BLSTM is an effective architecture with which to exploit it.
- 140Thafar, M. A.; Alshahrani, M.; Albaradei, S.; Gojobori, T.; Essack, M.; Gao, X. Affinity2Vec: drug-target binding affinity prediction through representation learning, graph mining, and machine learning. Sci. Rep. 2022, 12, 4751, DOI: 10.1038/s41598-022-08787-9140Affinity2Vec: drug-target binding affinity prediction through representation learning, graph mining, and machine learningThafar, Maha A.; Alshahrani, Mona; Albaradei, Somayah; Gojobori, Takashi; Essack, Magbubah; Gao, XinScientific Reports (2022), 12 (1), 4751CODEN: SRCEC3; ISSN:2045-2322. (Nature Portfolio)Abstr.: Drug-target interaction (DTI) prediction plays a crucial role in drug repositioning and virtual drug screening. Most DTI prediction methods cast the problem as a binary classification task to predict if interactions exist or as a regression task to predict continuous values that indicate a drug's ability to bind to a specific target. The regression-based methods provide insight beyond the binary relationship. However, most of these methods require the three-dimensional (3D) structural information of targets which are still not generally available to the targets. Despite this bottleneck, only a few methods address the drug-target binding affinity (DTBA) problem from a non-structure-based approach to avoid the 3D structure limitations. Here we propose Affinity2Vec, as a novel regression-based method that formulates the entire task as a graph-based problem. To develop this method, we constructed a weighted heterogeneous graph that integrates data from several sources, including drug-drug similarity, target-target similarity, and drug-target binding affinities. Affinity2Vec further combines several computational techniques from feature representation learning, graph mining, and machine learning to generate or ext. features, build the model, and predict the binding affinity between the drug and the target with no 3D structural data. We conducted extensive expts. to evaluate and demonstrate the robustness and efficiency of the proposed method on benchmark datasets used in state-of-the-art non-structured-based drug-target binding affinity studies. Affinity2Vec showed superior and competitive results compared to the state-of-the-art methods based on several evaluation metrics, including mean squared error, rm2, concordance index, and area under the precision-recall curve.
- 141Wei, B.; Zhang, Y.; Gong, X. 519. DeepLPI: A Novel Drug Repurposing Model based on Ligand-Protein Interaction Using Deep Learning. Open Forum Infect. Dis. 2022, 9, ofac492.574, DOI: 10.1093/ofid/ofac492.574There is no corresponding record for this reference.
- 142Yuan, W.; Chen, G.; Chen, C. Y.-C. FusionDTA: attention-based feature polymerizer and knowledge distillation for drug-target binding affinity prediction. Brief. Bioinform. 2022, DOI: 10.1093/bib/bbab506There is no corresponding record for this reference.
- 143West-Roberts, J.; Valentin-Alvarado, L.; Mullen, S.; Sachdeva, R.; Smith, J.; Hug, L. A.; Gregoire, D. S.; Liu, W.; Lin, T.-Y.; Husain, G.; Amano, Y.; Ly, L.; Banfield, J. F. Giant genes are rare but implicated in cell wall degradation by predatory bacteria. bioRxiv , 2023.There is no corresponding record for this reference.
- 144Hernández, A.; Amigó, J. Attention mechanisms and their applications to complex systems. Entropy (Basel) 2021, 23, 283, DOI: 10.3390/e23030283There is no corresponding record for this reference.
- 145Yang, X. An overview of the attention mechanisms in computer vision. 2020.There is no corresponding record for this reference.
- 146Hu, D. An introductory survey on attention mechanisms in NLP problems. arXiv , 2018.There is no corresponding record for this reference.
- 147Vig, J., Madani, A., Varshney, L. R., Xiong, C., Socher, R., Rajani, N. F. BERTology Meets Biology: Interpreting Attention in Protein Language Models. arXiv , 2020.There is no corresponding record for this reference.
- 148Wu, H.; Liu, J.; Jiang, T.; Zou, Q.; Qi, S.; Cui, Z.; Tiwari, P.; Ding, Y. AttentionMGT-DTA: A multi-modal drug-target affinity prediction using graph transformer and attention mechanism. Neural Netw. 2024, 169, 623– 636, DOI: 10.1016/j.neunet.2023.11.018There is no corresponding record for this reference.
- 149Koyama, K.; Kamiya, K.; Shimada, K. Cross attention dti: Drug-target interaction prediction with cross attention module in the blind evaluation setup. BIOKDD2020 2020.There is no corresponding record for this reference.
- 150Kurata, H.; Tsukiyama, S. ICAN: Interpretable cross-attention network for identifying drug and target protein interactions. PLoS One 2022, 17, e0276609, DOI: 10.1371/journal.pone.0276609There is no corresponding record for this reference.
- 151Zhao, Q.; Zhao, H.; Zheng, K.; Wang, J. HyperAttentionDTI: improving drug–protein interaction prediction by sequence-based deep learning with attention mechanism. Bioinformatics 2022, 38, 655– 662, DOI: 10.1093/bioinformatics/btab715151HyperAttentionDTI: improving drug-protein interaction prediction by sequence-based deep learning with attention mechanismZhao, Qichang; Zhao, Haochen; Zheng, Kai; Wang, JianxinBioinformatics (2022), 38 (3), 655-662CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Motivation: Identifying drug-target interactions (DTIs) is a crucial step in drug repurposing and drug discovery. Accurately identifying DTIs in silico can significantly shorten development time and reduce costs. Recently, many sequence-based methods are proposed for DTI prediction and improve performance by introducing the attention mechanism. However, these methods only model single non-covalent inter-mol. interactions among drugs and proteins and ignore the complex interaction between atoms and amino acids. Results: In this article, we propose an end-to-end bio-inspired model based on the convolutional neural network (CNN) and attention mechanism, named HyperAttentionDTI, for predicting DTIs. We use deep CNNs to learn the feature matrixes of drugs and proteins. To model complex non-covalent inter-mol. interactions among atoms and amino acids, we utilize the attention mechanism on the feature matrixes and assign an attention vector to each atom or amino acid. We evaluate HpyerAttentionDTI on three benchmark datasets and the results show that our model achieves significantly improved performance compared with the state-of-the-art baselines. Moreover, a case study on the human Gamma-aminobutyric acid receptors confirm that our model can be used as a powerful tool to predict DTIs.
- 152Jiang, M.; Li, Z.; Zhang, S.; Wang, S.; Wang, X.; Yuan, Q. Drug-target affinity prediction using graph neural network and contact maps. RSC Adv. 2020, 10, 20701, DOI: 10.1039/D0RA02297G152Drug-target affinity prediction using graph neural network and contact mapsJiang, Mingjian; Li, Zhen; Zhang, Shugang; Wang, Shuang; Wang, Xiaofeng; Yuan, Qing; Wei, ZhiqiangRSC Advances (2020), 10 (35), 20701-20712CODEN: RSCACL; ISSN:2046-2069. (Royal Society of Chemistry)Computer-aided drug design uses high-performance computers to simulate the tasks in drug design, which is a promising research area. Drug-target affinity (DTA) prediction is the most important step of computer-aided drug design, which could speed up drug development and reduce resource consumption. With the development of deep learning, the introduction of deep learning to DTA prediction and improving the accuracy have become a focus of research. In this paper, utilizing the structural information of mols. and proteins, two graphs of drug mols. and proteins are built up resp. Graph neural networks are introduced to obtain their representations, and a method called DGraphDTA is proposed for DTA prediction. Specifically, the protein graph is constructed based on the contact map output from the prediction method, which could predict the structural characteristics of the protein according to its sequence. It can be seen from the test of various metrics on benchmark datasets that the method proposed in this paper has strong robustness and generalizability.
- 153Nguyen, T. M.; Nguyen, T.; Le, T. M.; Tran, T. GEFA: Early Fusion Approach in Drug-Target Affinity Prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 2022, 19, 718– 728, DOI: 10.1109/TCBB.2021.3094217153GEFA: early fusion approach in drug-target affinity predictionNguyen, Tri Minh; Nguyen, Thin; Le, Thao Minh; Tran, TruyenIEEE/ACM Transactions on Computational Biology and Bioinformatics (2022), 19 (2), 718-728CODEN: ITCBCY; ISSN:1557-9964. (Institute of Electrical and Electronics Engineers)Predicting the interaction between a compd. and a target is crucial for rapid drug repurposing. Deep learning has been successfully applied in drug-target affinity (DTA) problem. However, previous deep learning-based methods ignore modeling the direct interactions between drug and protein residues. This would lead to inaccurate learning of target representation which may change due to the drug binding effects. In addn., previous DTA methods learn protein representation solely based on a small no. of protein sequences in DTA datasets while neglecting the use of proteins outside of the DTA datasets. We propose GEFA (Graph Early Fusion Affinity), a novel graph-in-graph neural network with attention mechanism to address the changes in target representation because of the binding effects. Specifically, a drug is modeled as a graph of atoms, which then serves as a node in a larger graph of residues-drug complex. The resulting model is an expressive deep nested graph neural network. We also use pre-trained protein representation powered by the recent effort of learning contextualized protein representation. The expts. are conducted under different settings to evaluate scenarios such as novel drugs or targets. The results demonstrate the effectiveness of the pre-trained protein embedding and the advantages our GEFA in modeling the nested graph for drug-target interaction.
- 154Yu, J.; Li, Z.; Chen, G.; Kong, X.; Hu, J.; Wang, D.; Cao, D.; Li, Y.; Huo, R.; Wang, G.; Liu, X.; Jiang, H.; Li, X.; Luo, X.; Zheng, M. Computing the relative binding affinity of ligands based on a pairwise binding comparison network. Nature Computational Science 2023, 3, 860– 872, DOI: 10.1038/s43588-023-00529-9There is no corresponding record for this reference.
- 155Knutson, C.; Bontha, M.; Bilbrey, J. A.; Kumar, N. Decoding the protein–ligand interactions using parallel graph neural networks. Sci. Rep. 2022, 12, 1– 14, DOI: 10.1038/s41598-022-10418-2There is no corresponding record for this reference.
- 156Kyro, G. W.; Brent, R. I.; Batista, V. S. HAC-Net: A Hybrid Attention-Based Convolutional Neural Network for Highly Accurate Protein–Ligand Binding Affinity Prediction. J. Chem. Inf. Model. 2023, 63, 1947– 1960, DOI: 10.1021/acs.jcim.3c00251There is no corresponding record for this reference.
- 157Yousefi, N.; Yazdani-Jahromi, M.; Tayebi, A.; Kolanthai, E.; Neal, C. J.; Banerjee, T.; Gosai, A.; Balasubramanian, G.; Seal, S.; Ozmen Garibay, O. BindingSite-AugmentedDTA: enabling a next-generation pipeline for interpretable prediction models in drug repurposing. Brief. Bioinform. 2023, DOI: 10.1093/bib/bbad136There is no corresponding record for this reference.
- 158Yazdani-Jahromi, M.; Yousefi, N.; Tayebi, A.; Kolanthai, E.; Neal, C. J.; Seal, S.; Garibay, O. O. AttentionSiteDTI: an interpretable graph-based model for drug-target interaction prediction using NLP sentence-level relation classification. Brief. Bioinform. 2022, DOI: 10.1093/bib/bbac272There is no corresponding record for this reference.
- 159Bronstein, M. M.; Bruna, J.; Cohen, T.; Veličković, P. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv , 2021.There is no corresponding record for this reference.
- 160Lim, J.; Ryu, S.; Park, K.; Choe, Y. J.; Ham, J.; Kim, W. Y. Predicting Drug–Target Interaction Using a Novel Graph Neural Network with 3D Structure-Embedded Graph Representation. J. Chem. Inf. Model. 2019, 59, 3981– 3988, DOI: 10.1021/acs.jcim.9b00387160Predicting Drug-Target Interaction Using a Novel Graph Neural Network with 3D Structure-Embedded Graph RepresentationLim, Jaechang; Ryu, Seongok; Park, Kyubyong; Choe, Yo Joong; Ham, Jiyeon; Kim, Woo YounJournal of Chemical Information and Modeling (2019), 59 (9), 3981-3988CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)We propose a novel deep learning approach for predicting drug-target interaction using a graph neural network. We introduce a distance-aware graph attention algorithm to differentiate various types of intermol. interactions. Furthermore, we ext. the graph feature of intermol. interactions directly from the 3D structural information on the protein-ligand binding pose. Thus, the model can learn key features for accurate predictions of drug-target interaction rather than just memorize certain patterns of ligand mols. As a result, our model shows better performance than docking and other deep learning methods for both virtual screening (AUROC of 0.968 for the DUD-E test set) and pose prediction (AUROC of 0.935 for the PDBbind test set). In addn., it can reproduce the natural population distribution of active mols. and inactive mols.
- 161Jin, Z.; Wu, T.; Chen, T.; Pan, D.; Wang, X.; Xie, J.; Quan, L.; Lyu, Q. CAPLA: improved prediction of protein–ligand binding affinity by a deep learning approach based on a cross-attention mechanism. Bioinformatics 2023, 39, btad049, DOI: 10.1093/bioinformatics/btad049There is no corresponding record for this reference.
- 162Lin, Z.; Akin, H.; Rao, R.; Hie, B.; Zhu, Z.; Lu, W.; Smetanin, N.; Verkuil, R.; Kabeli, O.; Shmueli, Y.; Dos Santos Costa, A.; Fazel-Zarandi, M.; Sercu, T.; Candido, S.; Rives, A. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023, 379, 1123– 1130, DOI: 10.1126/science.ade2574162Evolutionary-scale prediction of atomic-level protein structure with a language modelLin, Zeming; Akin, Halil; Rao, Roshan; Hie, Brian; Zhu, Zhongkai; Lu, Wenting; Smetanin, Nikita; Verkuil, Robert; Kabeli, Ori; Shmueli, Yaniv; dos Santos Costa, Allan; Fazel-Zarandi, Maryam; Sercu, Tom; Candido, Salvatore; Rives, AlexanderScience (Washington, DC, United States) (2023), 379 (6637), 1123-1130CODEN: SCIEAS; ISSN:1095-9203. (American Association for the Advancement of Science)Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full at.-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an at.-resoln. picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resoln. structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins.
- 163Zhang, S.; Fan, R.; Liu, Y.; Chen, S.; Liu, Q.; Zeng, W. Applications of transformer-based language models in bioinformatics: a survey. Bioinform. Adv. 2023, 3, vbad001, DOI: 10.1093/bioadv/vbad001There is no corresponding record for this reference.
- 164Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv , 2014.There is no corresponding record for this reference.
- 165Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. Pattern Recognition (CVPR) 2015, 3156– 3164There is no corresponding record for this reference.
- 166Sutskever, I.; Vinyals, O.; Le, Q. V. Sequence to sequence learning with Neural Networks. arXiv , 2014;.There is no corresponding record for this reference.
- 167Cho, K.; van Merrienboer, B.; Bahdanau, D.; Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. arXiv , 2014.There is no corresponding record for this reference.
- 168Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv , 2018.There is no corresponding record for this reference.
- 169Zeyer, A.; Bahar, P.; Irie, K.; Schlüter, R.; Ney, H. A Comparison of Transformer and LSTM Encoder Decoder Models for ASR. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019, 8– 15There is no corresponding record for this reference.
- 170Irie, K.; Zeyer, A.; Schlüter, R.; Ney, H. Language Modeling with Deep Transformers. arXiv , 2019.There is no corresponding record for this reference.
- 171Zouitni, C.; Sabri, M. A.; Aarab, A. A Comparison Between LSTM and Transformers for Image Captioning. Digital Technologies and Applications 2023, 669, 492– 500, DOI: 10.1007/978-3-031-29860-8_50There is no corresponding record for this reference.
- 172Parisotto, E.; Song, F.; Rae, J.; Pascanu, R.; Gulcehre, C.; Jayakumar, S.; Jaderberg, M.; Kaufman, R. L.; Clark, A.; Noury, S.; Botvinick, M.; Heess, N.; Hadsell, R. Stabilizing Transformers for Reinforcement Learning. Proceedings of the 37th International Conference on Machine Learning 2020, 7487– 7498There is no corresponding record for this reference.
- 173Bilokon, P.; Qiu, Y. Transformers versus LSTMs for electronic trading. arXiv , 2023.There is no corresponding record for this reference.
- 174Merity, S. Single Headed Attention RNN: Stop Thinking With Your Head. arXiv , 2019.There is no corresponding record for this reference.
- 175Ezen-Can, A. A Comparison of LSTM and BERT for Small Corpus. arXiv , 2020.There is no corresponding record for this reference.
- 176Unsal, S.; Atas, H.; Albayrak, M.; Turhan, K.; Acar, A. C.; Doğan, T. Learning functional properties of proteins with language models. Nature Machine Intelligence 2022, 4, 227– 245, DOI: 10.1038/s42256-022-00457-9There is no corresponding record for this reference.
- 177Brandes, N.; Ofer, D.; Peleg, Y.; Rappoport, N.; Linial, M. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 2022, 38, 2102– 2110, DOI: 10.1093/bioinformatics/btac020177ProteinBERT: a universal deep-learning model of protein sequence and functionBrandes, Nadav; Ofer, Dan; Peleg, Yam; Rappoport, Nadav; Linial, MichalBioinformatics (2022), 38 (8), 2102-2110CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biol. sequences. However, existing models and pretraining methods are designed and optimized for text anal. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontol. (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophys. attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data.
- 178Luo, S.; Chen, T.; Xu, Y.; Zheng, S.; Liu, T.-Y.; Wang, L.; He, D. One Transformer Can Understand Both 2D & 3D Molecular Data. arXiv , 2022.There is no corresponding record for this reference.
- 179Clark, K.; Luong, M.-T.; Le, Q. V.; Manning, C. D. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. arXiv , 2020.There is no corresponding record for this reference.
- 180Wang, J.; Wen, N.; Wang, C.; Zhao, L.; Cheng, L. ELECTRA-DTA: a new compound-protein binding affinity prediction model based on the contextualized sequence encoding. J. Cheminform. 2022, 14, 14, DOI: 10.1186/s13321-022-00591-xThere is no corresponding record for this reference.
- 181Shin, B.; Park, S.; Kang, K.; Ho, J. C. Self-Attention Based Molecule Representation for Predicting Drug-Target Interaction. Proceedings of the 4th Machine Learning for Healthcare Conference 2019, 230– 248There is no corresponding record for this reference.
- 182Huang, K.; Xiao, C.; Glass, L. M.; Sun, J. MolTrans: Molecular Interaction Transformer for drug–target interaction prediction. Bioinformatics 2021, 37, 830– 836, DOI: 10.1093/bioinformatics/btaa880182MolTrans: molecular interaction transformer for drug-target interaction predictionHuang, Kexin; Xiao, Cao; Glass, Lucas M.; Sun, JimengBioinformatics (2021), 37 (6), 830-836CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Motivation: Drug-target interaction (DTI) prediction is a foundational task for in-silico drug discovery, which is costly and time-consuming due to the need of exptl. search over large drug compd. space. Recent years have witnessed promising progress for deep learning in DTI predictions. However, the following challenges are still open: (i) existing mol. representation learning approaches ignore the sub-structural nature of DTI, thus produce results that are less accurate and difficult to explain and (ii) existing methods focus on limited labeled data while ignoring the value of massive unlabeled mol. data. Results: We propose a Mol. Interaction Transformer (MolTrans) to address these limitations via: (i) knowledge inspired sub-structural pattern mining algorithm and interaction modeling module for more accurate and interpretable DTI prediction and (ii) an augmented transformer encoder to better ext. and capture the semantic relations among sub-structures extd. from massive unlabeled biomedical data. We evaluate MolTrans on real-world data and show it improved DTI prediction performance compared to state-of-the-art baselines.
- 183Shen, L.; Feng, H.; Qiu, Y.; Wei, G.-W. SVSBI: sequence-based virtual screening of biomolecular interactions. Commun. Biol. 2023, 6, 536, DOI: 10.1038/s42003-023-04866-3183SVSBI: sequence-based virtual screening of biomolecular interactionsShen Li; Feng Hongsong; Qiu Yuchi; Wei Guo-Wei; Wei Guo-Wei; Wei Guo-WeiCommunications biology (2023), 6 (1), 536 ISSN:.Virtual screening (VS) is a critical technique in understanding biomolecular interactions, particularly in drug design and discovery. However, the accuracy of current VS models heavily relies on three-dimensional (3D) structures obtained through molecular docking, which is often unreliable due to the low accuracy. To address this issue, we introduce a sequence-based virtual screening (SVS) as another generation of VS models that utilize advanced natural language processing (NLP) algorithms and optimized deep K-embedding strategies to encode biomolecular interactions without relying on 3D structure-based docking. We demonstrate that SVS outperforms state-of-the-art performance for four regression datasets involving protein-ligand binding, protein-protein, protein-nucleic acid binding, and ligand inhibition of protein-protein interactions and five classification datasets for protein-protein interactions in five biological species. SVS has the potential to transform current practices in drug discovery and protein engineering.
- 184Wang, J.; Hu, J.; Sun, H.; Xu, M.; Yu, Y.; Liu, Y.; Cheng, L. MGPLI: exploring multigranular representations for protein–ligand interaction prediction. Bioinformatics 2022, 38, 4859– 4867, DOI: 10.1093/bioinformatics/btac597There is no corresponding record for this reference.
- 185Qian, Y.; Wu, J.; Zhang, Q. CAT-CPI: Combining CNN and transformer to learn compound image features for predicting compound-protein interactions. Front Mol. Biosci 2022, 9, 963912, DOI: 10.3389/fmolb.2022.963912There is no corresponding record for this reference.
- 186Cang, Z.; Mu, L.; Wei, G.-W. Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening. PLoS Comput. Biol. 2018, 14, e1005929, DOI: 10.1371/journal.pcbi.1005929186Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screeningCang, Zixuan; Mu, Lin; Wei, Guo-WeiPLoS Computational Biology (2018), 14 (1), e1005929/1-e1005929/44CODEN: PCBLBG; ISSN:1553-7358. (Public Library of Science)This work introduces a no. of algebraic topol. approaches, including multi-component persistent homol., multi-level persistent homol., and electrostatic persistence for the representation, characterization, and description of small mols. and biomol. complexes. In contrast to the conventional persistent homol., multi-component persistent homol. retains crit. chem. and biol. information during the topol. simplification of biomol. geometric complexity. Multi-level persistent homol. enables a tailored topol. description of inter- and/or intra-mol. interactions of interest. Electrostatic persistence incorporates partial charge information into topol. invariants. These topol. methods are paired with Wasserstein distance to characterize similarities between mols. and are further integrated with a variety of machine learning algorithms, including k-nearest neighbors, ensemble of trees, and deep convolutional neural networks, to manifest their descriptive and predictive powers for protein-ligand binding anal. and virtual screening of small mols. Extensive numerical expts. involving 4,414 protein- ligand complexes from the PDBBind database and 128,374 ligand-target and decoytarget pairs in the DUD database are performed to test resp. the scoring power and the discriminatory power of the proposed topol. learning strategies. It is demonstrated that the present topol. learning outperforms other existing methods in protein-ligand binding affinity prediction and ligand-decoy discrimination.
- 187Chen, D.; Liu, J.; Wei, G.-W. Multiscale topology-enabled structure-to-sequence transformer for protein-ligand interaction predictions. Nat. Mac. Intell. 2024, 6, 799– 810, DOI: 10.1038/s42256-024-00855-1There is no corresponding record for this reference.
- 188Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz, E.; Kamar, E.; Lee, P.; Lee, Y. T.; Li, Y.; Lundberg, S.; Nori, H.; Palangi, H.; Ribeiro, M. T.; Zhang, Y. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv , 2023.There is no corresponding record for this reference.
- 189Elnaggar, A.; Essam, H.; Salah-Eldin, W.; Moustafa, W.; Elkerdawy, M.; Rochereau, C.; Rost, B. Ankh: Optimized protein language model unlocks general-purpose modelling. bioRxiv , 2023.There is no corresponding record for this reference.
- 190Hwang, Y.; Cornman, A. L.; Kellogg, E. H.; Ovchinnikov, S.; Girguis, P. R. Genomic language model predicts protein co-regulation and function. Nat. Commun. 2024, 15, 2880, DOI: 10.1038/s41467-024-46947-9There is no corresponding record for this reference.
- 191Vu, M. H.; Akbar, R.; Robert, P. A.; Swiatczak, B.; Greiff, V.; Sandve, G. K.; Haug, D. T. T. Linguistically inspired roadmap for building biologically reliable protein language models. arXiv , 2022.There is no corresponding record for this reference.
- 192Xu, M.; Zhang, Z.; Lu, J.; Zhu, Z.; Zhang, Y.; Ma, C.; Liu, R.; Tang, J. PEER: A comprehensive and multi-task benchmark for Protein sEquence undERstanding. arXiv 2022, 35156– 35173.There is no corresponding record for this reference.
- 193Schmirler, R.; Heinzinger, M.; Rost, B. Fine-tuning protein language models boosts predictions across diverse tasks. Nat. Commun. 2024, 15, 7407, DOI: 10.1038/s41467-024-51844-2There is no corresponding record for this reference.
- 194Heinzinger, M.; Elnaggar, A.; Wang, Y.; Dallago, C.; Nechaev, D.; Matthes, F.; Rost, B. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics 2019, 20, 723, DOI: 10.1186/s12859-019-3220-8There is no corresponding record for this reference.
- 195Manfredi, M.; Savojardo, C.; Martelli, P. L.; Casadio, R. E-SNPs&GO: embedding of protein sequence and function improves the annotation of human pathogenic variants. Bioinformatics 2022, 38, 5168– 5174, DOI: 10.1093/bioinformatics/btac678There is no corresponding record for this reference.
- 196Anteghini, M.; Martins Dos Santos, V.; Saccenti, E. In-Pero: Exploiting deep learning embeddings of protein sequences to predict the localisation of peroxisomal proteins. Int. J. Mol. Sci. 2021, 22, 6409, DOI: 10.3390/ijms22126409There is no corresponding record for this reference.
- 197Nguyen, T.; Le, H.; Quinn, T. P.; Nguyen, T.; Le, T. D.; Venkatesh, S. GraphDTA: predicting drug–target binding affinity with graph neural networks. Bioinformatics 2021, 37, 1140– 1147, DOI: 10.1093/bioinformatics/btaa921197GraphDTA: predicting drug-target binding affinity with graph neural networksNguyen, Thin; Le, Hang; Quinn, Thomas P.; Nguyen, Tri; Le, Thuc Duy; Venkatesh, SvethaBioinformatics (2021), 37 (8), 1140-1147CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Summary: The development of new drugs is costly, time consuming, often accompanied with safety issues. Drug repurposing can avoid the expensive, lengthy process of drug development by finding new uses for already approved drugs. In order to repurpose drugs effectively, it is useful to know which proteins are targeted by which drugs. Computational models that est. the interaction strength of new drug-target pairs have the potential to expedite drug repurposing. Several models have been proposed for this task. However, these models represent the drugs as strings, which is not a natural way to represent mols. We propose a new model called GraphDTA that represents drugs as graphs, uses graph neural networks to predict drug-target affinity. We show that graph neural networks not only predict drug-target affinity better than non-deep learning models, but also outperform competing deep learning methods. Our results confirm that deep learning models are appropriate for drug- target binding affinity prediction, that representing drugs as graphs can lead to further improvements.
- 198Nam, H.; Ha, J.-W.; Kim, J. Dual attention networks for multimodal reasoning and matching. Pattern Recognition (CVPR) 2017, 299– 307There is no corresponding record for this reference.
- 199Wang, X.; Liu, D.; Zhu, J.; Rodriguez-Paton, A.; Song, T. CSConv2d: A 2-D Structural Convolution Neural Network with a Channel and Spatial Attention Mechanism for Protein-Ligand Binding Affinity Prediction. Biomolecules 2021, DOI: 10.3390/biom11050643There is no corresponding record for this reference.
- 200Anteghini, M.; Santos, V. A. M. D.; Saccenti, E. PortPred: Exploiting deep learning embeddings of amino acid sequences for the identification of transporter proteins and their substrates. J. Cell. Biochem. 2023, 124, 1803, DOI: 10.1002/jcb.30490There is no corresponding record for this reference.
- 201Huang, K.; Fu, T.; Glass, L. M.; Zitnik, M.; Xiao, C.; Sun, J. DeepPurpose: a deep learning library for drug–target interaction prediction. Bioinformatics 2021, 36, 5545– 5547, DOI: 10.1093/bioinformatics/btaa1005201DeepPurpose: a deep learning library for drug-target interaction predictionHuang Kexin; Zitnik Marinka; Fu Tianfan; Glass Lucas M; Xiao Cao; Sun JimengBioinformatics (Oxford, England) (2021), 36 (22-23), 5545-5547 ISSN:.SUMMARY: Accurate prediction of drug-target interactions (DTI) is crucial for drug discovery. Recently, deep learning (DL) models for show promising performance for DTI prediction. However, these models can be difficult to use for both computer scientists entering the biomedical field and bioinformaticians with limited DL experience. We present DeepPurpose, a comprehensive and easy-to-use DL library for DTI prediction. DeepPurpose supports training of customized DTI prediction models by implementing 15 compound and protein encoders and over 50 neural architectures, along with providing many other useful features. We demonstrate state-of-the-art performance of DeepPurpose on several benchmark datasets. AVAILABILITY AND IMPLEMENTATION: https://github.com/kexinhuang12345/DeepPurpose. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
- 202Zhao, L.; Wang, J.; Pang, L.; Liu, Y.; Zhang, J. GANsDTA: Predicting Drug-Target Binding Affinity Using GANs. Front. Genet. 2020, 10, 1243, DOI: 10.3389/fgene.2019.01243There is no corresponding record for this reference.
- 203Hu, F.; Jiang, J.; Wang, D.; Zhu, M.; Yin, P. Multi-PLI: interpretable multi-task deep learning model for unifying protein–ligand interaction datasets. J. Cheminform. 2021, 13, 30, DOI: 10.1186/s13321-021-00510-6There is no corresponding record for this reference.
- 204Zheng, S.; Li, Y.; Chen, S.; Xu, J.; Yang, Y. Predicting Drug Protein Interaction using Quasi-Visual Question Answering System. bioRxiv 2019, 588178There is no corresponding record for this reference.
- 205Tsubaki, M.; Tomii, K.; Sese, J. Compound–protein interaction prediction with end-to-end learning of neural networks for graphs and sequences. Bioinformatics 2019, 35, 309– 318, DOI: 10.1093/bioinformatics/bty535205Compound-protein interaction prediction with end-to-end learning of neural networks for graphs and sequencesTsubaki, Masashi; Tomii, Kentaro; Sese, JunBioinformatics (2019), 35 (2), 309-318CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Motivation: In bioinformatics, machine learning-based methods that predict the compd.-protein interactions (CPIs) play an important role in the virtual screening for drug discovery. Recently, end-to-end representation learning for discrete symbolic data (e.g. words in natural language processing) using deep neural networks has demonstrated excellent performance on various difficult problems. For the CPI problem, data are provided as discrete symbolic data, i.e. compds. are represented as graphs where the vertices are atoms, the edges are chem. bonds, and proteins are sequences in which the characters are amino acids. In this study, we investigate the use of end-to-end representation learning for compds. and proteins, integrate the representations, and develop a new CPI prediction approach by combining a graph neural network (GNN) for compds. and a convolutional neural network (CNN) for proteins. Results: Our expts. using three CPI datasets demonstrated that the proposed end-to-end approach achieves competitive or higher performance as compared to various existing CPI prediction methods. In addn., the proposed approach significantly outperformed existing methods on an unbalanced dataset. This suggests that data-driven representations of compds. and proteins obtained by end-to-end GNNs and CNNs are more robust than traditional chem. and biol. features obtained from databases. Although analyzing deep learning models is difficult due to their black-box nature, we address this issue using a neural attention mechanism, which allows us to consider which subsequences in a protein are more important for a drug compd. when predicting its interaction. The neural attention mechanism also provides effective visualization, which makes it easier to analyze a model even when modeling is performed using real-valued representations instead of discrete features.
- 206Karimi, M.; Wu, D.; Wang, Z.; Shen, Y. DeepAffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks. Bioinformatics 2019, 35, 3329– 3338, DOI: 10.1093/bioinformatics/btz111206DeepAffinity: interpretable deep learning of compound-protein affinity through unified recurrent and convolutional neural networksKarimi, Mostafa; Wu, Di; Wang, Zhangyang; Shen, YangBioinformatics (2019), 35 (18), 3329-3338CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Motivation: Drug discovery demands rapid quantification of compd.-protein interaction (CPI). However, there is a lack of methods that can predict compd.-protein affinity from sequences alone with high applicability, accuracy and interpretability. Results: We present a seamless integration of domain knowledges and learning-based approaches. Under novel representations of structurally annotated protein sequences, a semisupervised deep learning model that unifies recurrent and convolutional neural networks has been proposed to exploit both unlabeled and labeled data, for jointly encoding mol. representations and predicting affinities. Our representations and models outperform conventional options in achieving relative error in IC50 within 5-fold for test cases and 20-fold for protein classes not included for training. Performances for new protein classes with few labeled data are further improved by transfer learning. Furthermore, sep. and joint attention mechanisms are developed and embedded to our model to add to its interpretability, as illustrated in case studies for predicting and explaining selective drug-target interactions. Lastly, alternative representations using protein sequences or compd. graphs and a unified RNN/GCNN-CNN model using graph CNN (GCNN) are also explored to reveal algorithmic challenges ahead.
- 207Li, S.; Wan, F.; Shu, H.; Jiang, T.; Zhao, D.; Zeng, J. MONN: a multi-objective neural network for predicting compound-protein interactions and affinities. Cell Systems 2020, 10, 308– 322, DOI: 10.1016/j.cels.2020.03.002207MONN: A Multi-objective Neural Network for Predicting Compound-Protein Interactions and AffinitiesLi, Shuya; Wan, Fangping; Shu, Hantao; Jiang, Tao; Zhao, Dan; Zeng, JianyangCell Systems (2020), 10 (4), 308-322.e11CODEN: CSEYA4; ISSN:2405-4712. (Cell Press)Computational approaches for understanding compd.-protein interactions (CPIs) can greatly facilitate drug development. Recently, a no. of deep-learning-based methods have been proposed to predict binding affinities and attempt to capture local interaction sites in compds. and proteins through neural attentions (i.e., neural network architectures that enable the interpretation of feature importance). Here, we compiled a benchmark dataset contg. the inter-mol. non-covalent interactions for more than 10,000 compd.-protein pairs and systematically evaluated the interpretability of neural attentions in existing models. We also developed a multi-objective neural network, called MONN, to predict both non-covalent interactions and binding affinities between compds. and proteins. Comprehensive evaluation demonstrated that MONN can successfully predict the non-covalent interactions between compds. and proteins that cannot be effectively captured by neural attentions in previous prediction methods. Moreover, MONN outperforms other state-of-the-art methods in predicting binding affinities.
- 208Zhao, M.; Yuan, M.; Yang, Y.; Xu, S. X. CPGL: Prediction of Compound-Protein Interaction by Integrating Graph Attention Network With Long Short-Term Memory Neural Network. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023, 20, 1935– 1942, DOI: 10.1109/TCBB.2022.3225296There is no corresponding record for this reference.
- 209Yu, L.; Qiu, W.; Lin, W.; Cheng, X.; Xiao, X.; Dai, J. HGDTI: predicting drug–target interaction by using information aggregation based on heterogeneous graph neural network. BMC Bioinformatics 2022, 23, 126, DOI: 10.1186/s12859-022-04655-5There is no corresponding record for this reference.
- 210Lee, I.; Nam, H. Sequence-based prediction of protein binding regions and drug-target interactions. J. Cheminform. 2022, 14, 5, DOI: 10.1186/s13321-022-00584-wThere is no corresponding record for this reference.
- 211Gönen, M.; Heller, G. Concordance probability and discriminatory power in proportional hazards regression. Biometrika 2005, 92, 965– 970, DOI: 10.1093/biomet/92.4.965There is no corresponding record for this reference.
- 212Deller, M. C.; Rupp, B. Models of protein-ligand crystal structures: trust, but verify. J. Comput. Aided Mol. Des. 2015, 29, 817– 836, DOI: 10.1007/s10822-015-9833-8212Models of protein-ligand crystal structures: trust, but verifyDeller, Marc C.; Rupp, BernhardJournal of Computer-Aided Molecular Design (2015), 29 (9), 817-836CODEN: JCADEQ; ISSN:0920-654X. (Springer)X-ray crystallog. provides the most accurate models of protein-ligand structures. These models serve as the foundation of many computational methods including structure prediction, mol. modeling, and structure-based drug design. The success of these computational methods ultimately depends on the quality of the underlying protein-ligand models. X-ray crystallog. offers the unparalleled advantage of a clear math. formalism relating the exptl. data to the protein-ligand model. In the case of X-ray crystallog., the primary exptl. evidence is the electron d. of the mols. forming the crystal. The first step in the generation of an accurate and precise crystallog. model is the interpretation of the electron d. of the crystal, typically carried out by construction of an at. model. The at. model must then be validated for fit to the exptl. electron d. and also for agreement with prior expectations of stereochem. Stringent validation of protein-ligand models has become possible as a result of the mandatory deposition of primary diffraction data, and many computational tools are now available to aid in the validation process. Validation of protein-ligand complexes has revealed some instances of overenthusiastic interpretation of ligand d. Fundamental concepts and metrics of protein-ligand quality validation are discussed and we highlight software tools to assist in this process. It is essential that end users select high quality protein-ligand models for their computational and biol. studies, and we provide an overview of how this can be achieved.
- 213Kalakoti, Y.; Yadav, S.; Sundar, D. TransDTI: Transformer-based language models for estimating DTIs and building a drug recommendation workflow. ACS Omega 2022, 7, 2706– 2717, DOI: 10.1021/acsomega.1c05203213TransDTI: Transformer-Based Language Models for Estimating DTIs and Building a Drug Recommendation WorkflowKalakoti, Yogesh; Yadav, Shashank; Sundar, DuraiACS Omega (2022), 7 (3), 2706-2717CODEN: ACSODF; ISSN:2470-1343. (American Chemical Society)The identification of novel drug-target interactions is a labor-intensive and low-throughput process. In silico alternatives have proved to be of immense importance in assisting the drug discovery process. Here, we present TransDTI, a multiclass classification and regression workflow employing transformer-based language models to segregate interactions between drug-target pairs as active, inactive, and intermediate. The models were trained with large-scale drug-target interaction (DTI) data sets, which reported an improvement in performance in terms of the area under receiver operating characteristic (auROC), the area under precision recall (auPR), Matthew's correlation coeff. (MCC), and R2 over baseline methods. The results showed that models based on transformer-based language models effectively predict novel drug-target interactions from sequence data. The proposed models significantly outperformed existing methods like DeepConvDTI, DeepDTA, and DeepDTI on a test data set. Further, the validity of novel interactions predicted by TransDTI was found to be backed by mol. docking and simulation anal., where the model prediction had similar or better interaction potential for MAP2k and transforming growth factor-β (TGFβ) and their known inhibitors. Proposed approaches can have a significant impact on the development of personalized therapy and clin. decision making.
- 214Chatterjee, A.; Walters, R.; Shafi, Z.; Ahmed, O. S.; Sebek, M.; Gysi, D.; Yu, R.; Eliassi-Rad, T.; Barabási, A.-L.; Menichetti, G. Improving the generalizability of protein-ligand binding predictions with AI-Bind. Nat. Commun. 2023, 14, 1989, DOI: 10.1038/s41467-023-37572-zThere is no corresponding record for this reference.
- 215Gudivada, V.; Apon, A.; Ding, J. Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. Int. J. Adv. Softw. 2017, 10, 1– 20There is no corresponding record for this reference.
- 216Nasteski, V. An overview of the supervised machine learning methods. Horizons 2017, 4, 51– 62, DOI: 10.20544/HORIZONS.B.04.1.17.P05There is no corresponding record for this reference.
- 217Kozlov, M. So you got a null result. Will anyone publish it?. Nature 2024, 631, 728– 730, DOI: 10.1038/d41586-024-02383-9There is no corresponding record for this reference.
- 218Edfeldt, K. A data science roadmap for open science organizations engaged in early-stage drug discovery. Nat. Commun. 2024, 15, 5640, DOI: 10.1038/s41467-024-49777-xThere is no corresponding record for this reference.
- 219Mlinarić, A.; Horvat, M.; Šupak Smolčić, V. Dealing with the positive publication bias: Why you should really publish your negative results. Biochem. Med. 2017, 27, 030201, DOI: 10.11613/BM.2017.030201There is no corresponding record for this reference.
- 220Fanelli, D. Negative results are disappearing from most disciplines and countries. Scientometrics 2012, 90, 891– 904, DOI: 10.1007/s11192-011-0494-7There is no corresponding record for this reference.
- 221Albalate, A.; Minker, W. Semi-supervised and unervised machine learning: Novel strategies; Wiley-ISTE, 2013.There is no corresponding record for this reference.
- 222Sajadi, S. Z.; Zare Chahooki, M. A.; Gharaghani, S.; Abbasi, K. AutoDTI++: deep unsupervised learning for DTI prediction by autoencoders. BMC Bioinformatics 2021, 22, 204, DOI: 10.1186/s12859-021-04127-2There is no corresponding record for this reference.
- 223Najm, M.; Azencott, C.-A.; Playe, B.; Stoven, V. Drug Target Identification with Machine Learning: How to Choose Negative Examples. Int. J. Mol. Sci. 2021, 22, 5118, DOI: 10.3390/ijms22105118There is no corresponding record for this reference.
- 224Sieg, J.; Flachsenberg, F.; Rarey, M. In Need of Bias Control: Evaluating Chemical Data for Machine Learning in Structure-Based Virtual Screening. J. Chem. Inf. Model. 2019, 59, 947– 961, DOI: 10.1021/acs.jcim.8b00712224In Need of Bias Control: Evaluating Chemical Data for Machine Learning in Structure-Based Virtual ScreeningSieg, Jochen; Flachsenberg, Florian; Rarey, MatthiasJournal of Chemical Information and Modeling (2019), 59 (3), 947-961CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)A review. Reports of successful applications of machine learning (ML) methods in structure-based virtual screening (SBVS) are increasing. ML methods such as convolutional neural networks show promising results and often outperform traditional methods such as empirical scoring functions in retrospective validation. However, trained ML models are often treated as black boxes and are not straightforwardly interpretable. In most cases, it is unknown which features in the data are decisive and whether a model's predictions are right for the right reason. Hence, the authors reevaluated three widely used benchmark data sets in the context of ML methods and came to the conclusion that not every benchmark data set is suitable. Moreover, the authors demonstrate on two examples from current literature that bias is learned implicitly and unnoticed from std. benchmarks. On the basis of these results, the authors conclude that there is a need for eligible validation expts. and benchmark data sets suited to ML for more bias-controlled validation in ML-based SBVS. Therefore, the authors provide guidelines for setting up validation expts. and give a perspective on how new data sets could be generated.
- 225Volkov, M.; Turk, J.-A.; Drizard, N.; Martin, N.; Hoffmann, B.; Gaston-Mathé, Y.; Rognan, D. On the Frustration to Predict Binding Affinities from Protein–Ligand Structures with Deep Neural Networks. J. Med. Chem. 2022, 65, 7946– 7958, DOI: 10.1021/acs.jmedchem.2c00487225On the Frustration to Predict Binding Affinities from Protein-Ligand Structures with Deep Neural NetworksVolkov, Mikhail; Turk, Joseph-Andre; Drizard, Nicolas; Martin, Nicolas; Hoffmann, Brice; Gaston-Mathe, Yann; Rognan, DidierJournal of Medicinal Chemistry (2022), 65 (11), 7946-7958CODEN: JMCMAR; ISSN:0022-2623. (American Chemical Society)Accurate prediction of binding affinities from protein-ligand at. coordinates remains a major challenge in early stages of drug discovery. Using modular message passing graph neural networks describing both the ligand and the protein in their free and bound states, we unambiguously evidence that an explicit description of protein-ligand noncovalent interactions does not provide any advantage with respect to ligand or protein descriptors. Simple models, inferring binding affinities of test samples from that of the closest ligands or proteins in the training set, already exhibit good performances, suggesting that memorization largely dominates true learning in the deep neural networks. The current study suggests considering only noncovalent interactions while omitting their protein and ligand at. environments. Removing all hidden biases probably requires much denser protein-ligand training matrixes and a coordinated effort of the drug design community to solve the necessary protein-ligand structures.
- 226Shivakumar, D.; Williams, J.; Wu, Y.; Damm, W.; Shelley, J.; Sherman, W. Prediction of absolute solvation free energies using molecular dynamics free energy perturbation and the OPLS force field. J. Chem. Theory Comput. 2010, 6, 1509– 1519, DOI: 10.1021/ct900587b226Prediction of Absolute Solvation Free Energies using Molecular Dynamics Free Energy Perturbation and the OPLS Force FieldShivakumar, Devleena; Williams, Joshua; Wu, Yujie; Damm, Wolfgang; Shelley, John; Sherman, WoodyJournal of Chemical Theory and Computation (2010), 6 (5), 1509-1519CODEN: JCTCCE; ISSN:1549-9618. (American Chemical Society)The accurate prediction of protein-ligand binding free energies is a primary objective in computer-aided drug design. The solvation free energy of a small mol. provides a surrogate to the desolvation of the ligand in the thermodn. process of protein-ligand binding. Here, we use explicit solvent mol. dynamics free energy perturbation to predict the abs. solvation free energies of a set of 239 small mols., spanning diverse chem. functional groups commonly found in drugs and drug-like mols. We also compare the performance of abs. solvation free energies obtained using the OPLS_2005 force field with two other commonly used small mol. force fields - general AMBER force field (GAFF) with AM1-BCC charges and CHARMm-MSI with CHelpG charges. Using the OPLS_2005 force field, we obtain high correlation with exptl. solvation free energies (R2 = 0.94) and low av. unsigned errors for a majority of the functional groups compared to AM1-BCC/GAFF or CHelpG/CHARMm-MSI. However, OPLS_2005 has errors of over 1.3 kcal/mol for certain classes of polar compds. We show that predictions on these compd. classes can be improved by using a semiempirical charge assignment method with an implicit bond charge correction.
- 227El Hage, K.; Mondal, P.; Meuwly, M. Free energy simulations for protein ligand binding and stability. Mol. Simul. 2018, 44, 1044– 1061, DOI: 10.1080/08927022.2017.1416115227Free energy simulations for protein ligand binding and stabilityEl Hage, Krystel; Mondal, Padmabati; Meuwly, MarkusMolecular Simulation (2018), 44 (13-14), 1044-1061CODEN: MOSIEA; ISSN:0892-7022. (Taylor & Francis Ltd.)We summarize several computational techniques to det. relative free energies for condensed-phase systems. The focus is on practical considerations which are capable of making direct contact with expts. Particular applications include the thermodn. stability of apo- and holo-myoglobin, insulin dimerization free energy, ligand binding in lysozyme, and ligand diffusion in globular proteins. In addn. to provide differential free energies between neighboring states, converged umbrella sampling simulations provide insight into migration barriers and ligand dissocn. barriers and anal. of the trajectories yield addnl. insight into the structural dynamics of fundamental processes. Also, such simulations are useful tools to quantify relative stability changes for situations where expts. are difficult. This is illustrated for NO-bound myoglobin. For the dissocn. of benzonitrile from lysozyme it is found that long umbrella sampling simulations are required to approx. converge the free energy profile. Then, however, the resulting differential free energy between the bound and unbound state is in good agreement with ests. from mol. mechanics with generalized Born surface area simulations. Furthermore, comparing the barrier height for ligand escape suggests that ligand dissocn. contains a non-equil. component.
- 228Ngo, S. T.; Pham, M. Q. Umbrella sampling-based method to compute ligand-binding affinity. Methods Mol. Biol. 2022, 2385, 313– 323, DOI: 10.1007/978-1-0716-1767-0_14There is no corresponding record for this reference.
- 229Pandey, M.; Fernandez, M.; Gentile, F.; Isayev, O.; Tropsha, A.; Stern, A. C.; Cherkasov, A. The transformational role of GPU computing and deep learning in drug discovery. Nat. Mach. Intell. 2022, 4, 211– 221, DOI: 10.1038/s42256-022-00463-xThere is no corresponding record for this reference.
- 230Bibal, A.; Cardon, R.; Alfter, D.; Wilkens, R.; Wang, X.; François, T.; Watrin, P. Is Attention Explanation? An Introduction to the Debate. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland, 2022, 3889– 3900There is no corresponding record for this reference.
- 231Wiegreffe, S.; Pinter, Y. Attention is not not Explanation. arXiv , 2019.There is no corresponding record for this reference.
- 232Jain, S.; Wallace, B. C. Attention is not Explanation. arXiv , 2019.There is no corresponding record for this reference.
- 233Lundberg, S. M.; Lee, S.-I. A unified approach to interpreting model predictions. Neural Inf. Process. Syst. 2017, 30, 4765– 4774There is no corresponding record for this reference.
- 234Gu, Y.; Zhang, X.; Xu, A.; Chen, W.; Liu, K.; Wu, L.; Mo, S.; Hu, Y.; Liu, M.; Luo, Q. Protein-ligand binding affinity prediction with edge awareness and supervised attention. iScience 2023, 26, 105892, DOI: 10.1016/j.isci.2022.105892There is no corresponding record for this reference.
- 235Rodis, N.; Sardianos, C.; Papadopoulos, G. T.; Radoglou-Grammatikis, P.; Sarigiannidis, P.; Varlamis, I. Multimodal Explainable Artificial Intelligence: A Comprehensive Review of Methodological Advances and Future Research Directions. arXiv [cs.AI] 2023.There is no corresponding record for this reference.
- 236Gilpin, L. H.; Bau, D.; Yuan, B. Z.; Bajwa, A.; Specter, M.; Kagal, L. Explaining explanations: An overview of interpretability of machine learning. 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA) 2018, 80– 89There is no corresponding record for this reference.
- 237Luo, D.; Liu, D.; Qu, X.; Dong, L.; Wang, B. Enhancing generalizability in protein-ligand binding affinity prediction with multimodal contrastive learning. J. Chem. Inf. Model. 2024, 64, 1892– 1906, DOI: 10.1021/acs.jcim.3c01961There is no corresponding record for this reference.
- 238Rao, R.; Bhattacharya, N.; Thomas, N.; Duan, Y.; Chen, X.; Canny, J.; Abbeel, P.; Song, Y. S. Evaluating protein transfer learning with TAPE. bioRxiv , 2019.There is no corresponding record for this reference.
- 239Suzek, B. E.; Wang, Y.; Huang, H.; McGarvey, P. B.; Wu, C. H. UniProt Consortium UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 2015, 31, 926– 932, DOI: 10.1093/bioinformatics/btu739There is no corresponding record for this reference.
- 240Eguida, M.; Rognan, D. A Computer Vision Approach to Align and Compare Protein Cavities: Application to Fragment-Based Drug Design. J. Med. Chem. 2020, 63, 7127– 7142, DOI: 10.1021/acs.jmedchem.0c00422240A Computer Vision Approach to Align and Compare Protein Cavities: Application to Fragment-Based Drug DesignEguida, Merveille; Rognan, DidierJournal of Medicinal Chemistry (2020), 63 (13), 7127-7142CODEN: JMCMAR; ISSN:0022-2623. (American Chemical Society)Identifying local similarities in binding sites from distant proteins is a major hurdle to rational drug design. We herewith present a novel method, borrowed from computer vision, adapted to mine fragment subpockets and compare them to whole ligand-binding sites. Pockets are represented by pharmacophore-annotated point clouds mimicking ideal ligands or fragments. Point cloud registration is used to find the transformation enabling an optimal overlap of points sharing similar topol. and pharmacophoric neighborhoods. The method (ProCare) was calibrated on a large set of druggable cavities and applied to the comparison of fragment subpockets to entire cavities. A collection of 33,953 subpockets annotated with their bound fragments was screened for local similarity to cavities from recently described protein X-ray structures. ProCare was able to detect local similarities between remote pockets and transfer the corresponding fragments to the query cavity space, thereby proposing a first step to fragment-based design approaches targeting orphan cavities.
- 241Öztürk, H.; Özgür, A.; Ozkirimli, E. DeepDTA: deep drug-target binding affinity prediction. Bioinformatics 2018, 34, i821– i829, DOI: 10.1093/bioinformatics/bty593241DeepDTA: deep drug-target binding affinity predictionOzturk, Hakime; Ozgur, Arzucan; Ozkirimli, ElifBioinformatics (2018), 34 (17), i821-i829CODEN: BOINFP; ISSN:1367-4811. (Oxford University Press)Motivation: The identification of novel drug-target (DT) interactions is a substantial part of the drug discovery process. Most of the computational methods that have been proposed to predict DT interactions have focused on binary classification, where the goal is to det. whether a DT pair interacts or not. However, protein-ligand interactions assume a continuum of binding strength values, also called binding affinity and predicting this value still remains a challenge. The increase in the affinity data available in DT knowledge-bases allows the use of advanced learning techniques such as deep learning architectures in the prediction of binding affinities. In this study, we propose a deep-learning based model that uses only sequence information of both targets and drugs to predict DT interaction binding affinities. The few studies that focus on DT binding affinity prediction use either 3D structures of protein-ligand complexes or 2D features of compds. One novel approach used in this work is the modeling of protein sequences and compd. 1D representations with convolutional neural networks (CNNs). Results: The results show that the proposed deep learning based model that uses the 1D representations of targets and drugs is an effective approach for drug target binding affinity prediction. The model in which high-level representations of a drug and a target are constructed via CNNs achieved the best Concordance Index (CI) performance in one of our larger benchmark datasets, outperforming the KronRLS algorithm and SimBoost, a state-of-the-art method for DT binding affinity prediction.
- 242Evans, R. Protein complex prediction with AlphaFold-Multimer. bioRxiv , 2021.There is no corresponding record for this reference.
- 243Omidi, A.; Møller, M. H.; Malhis, N.; Bui, J. M.; Gsponer, J. AlphaFold-Multimer accurately captures interactions and dynamics of intrinsically disordered protein regions. Proc. Natl. Acad. Sci. U. S. A. 2024, 121, e2406407121, DOI: 10.1073/pnas.2406407121There is no corresponding record for this reference.
- 244Zhu, W.; Shenoy, A.; Kundrotas, P.; Elofsson, A. Evaluation of AlphaFold-Multimer prediction on multi-chain protein complexes. Bioinformatics 2023, 39, btad424, DOI: 10.1093/bioinformatics/btad424There is no corresponding record for this reference.
- 245Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. Pattern Recognition (CVPR) 2022, 10684– 10695There is no corresponding record for this reference.
- 246Dhariwal, P.; Nichol, A. Diffusion models beat GANs on image synthesis. Neural Inf. Process. Syst. 2021, 8780– 8794There is no corresponding record for this reference.
- 247Karras, T.; Aittala, M.; Aila, T.; Laine, S. Elucidating the Design Space of Diffusion-Based Generative Models. Adv. Neural Inf. Process. Syst. 2022, 26565– 26577There is no corresponding record for this reference.
- 248Buttenschoen, M.; Morris, G.; Deane, C. PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences. Chem. Sci. 2024, 15, 3130– 3139, DOI: 10.1039/D3SC04185AThere is no corresponding record for this reference.
- 249Wee, J.; Wei, G.-W. Benchmarking AlphaFold3’s protein-protein complex accuracy and machine learning prediction reliability for binding free energy changes upon mutation. arXiv , 2024.There is no corresponding record for this reference.
- 250Bernard, C.; Postic, G.; Ghannay, S.; Tahi, F. Has AlphaFold 3 reached its success for RNAs? bioRxiv , 2024.There is no corresponding record for this reference.
- 251Zonta, F.; Pantano, S. From sequence to mechanobiology? Promises and challenges for AlphaFold 3. Mechanobiology in Medicine 2024, 2, 100083, DOI: 10.1016/j.mbm.2024.100083There is no corresponding record for this reference.
- 252He, X.-H.; Li, J.-R.; Shen, S.-Y.; Xu, H. E. AlphaFold3 versus experimental structures: assessment of the accuracy in ligand-bound G protein-coupled receptors. Acta Pharmacol. Sin. 2024, 1– 12, DOI: 10.1038/s41401-024-01429-yThere is no corresponding record for this reference.
- 253Desai, D.; Kantliwala, S. V.; Vybhavi, J.; Ravi, R.; Patel, H.; Patel, J. Review of AlphaFold 3: Transformative advances in drug design and therapeutics. Cureus 2024, 16, e63646, DOI: 10.7759/cureus.63646There is no corresponding record for this reference.
- 254Baek, M. Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021, 373, 871– 876, DOI: 10.1126/science.abj8754254Accurate prediction of protein structures and interactions using a three-track neural networkBaek, Minkyung; DiMaio, Frank; Anishchenko, Ivan; Dauparas, Justas; Ovchinnikov, Sergey; Lee, Gyu Rie; Wang, Jue; Cong, Qian; Kinch, Lisa N.; Schaeffer, R. Dustin; Millan, Claudia; Park, Hahnbeom; Adams, Carson; Glassman, Caleb R.; DeGiovanni, Andy; Pereira, Jose H.; Rodrigues, Andria V.; van Dijk, Alberdina A.; Ebrecht, Ana C.; Opperman, Diederik J.; Sagmeister, Theo; Buhlheller, Christoph; Pavkov-Keller, Tea; Rathinaswamy, Manoj K.; Dalwadi, Udit; Yip, Calvin K.; Burke, John E.; Garcia, K. Christopher; Grishin, Nick V.; Adams, Paul D.; Read, Randy J.; Baker, DavidScience (Washington, DC, United States) (2021), 373 (6557), 871-876CODEN: SCIEAS; ISSN:1095-9203. (American Association for the Advancement of Science)DeepMind presented notably accurate predictions at the recent 14th Crit. Assessment of Structure Prediction (CASP14) conference. We explored network architectures that incorporate related ideas and obtained the best performance with a three-track network in which information at the one-dimensional (1D) sequence level, the 2D distance map level, and the 3D coordinate level is successively transformed and integrated. The three-track network produces structure predictions with accuracies approaching those of DeepMind in CASP14, enables the rapid soln. of challenging x-ray crystallog. and cryo-electron microscopy structure modeling problems, and provides insights into the functions of proteins of currently unknown structure. The network also enables rapid generation of accurate protein-protein complex models from sequence information alone, short-circuiting traditional approaches that require modeling of individual subunits followed by docking. We make the method available to the scientific community to speed biol. research.
- 255Ahdritz, G. OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat. Methods 2024, 21, 1514– 1524, DOI: 10.1038/s41592-024-02272-zThere is no corresponding record for this reference.
- 256Liao, C.; Yu, Y.; Mei, Y.; Wei, Y. From words to molecules: A survey of Large Language Models in chemistry. arXiv , 2024.There is no corresponding record for this reference.
- 257Bagal, V.; Aggarwal, R.; Vinod, P. K.; Priyakumar, U. D. MolGPT: Molecular generation using a transformer-decoder model. J. Chem. Inf. Model. 2022, 62, 2064– 2076, DOI: 10.1021/acs.jcim.1c00600257MolGPT: Molecular Generation Using a Transformer-Decoder ModelBagal, Viraj; Aggarwal, Rishal; Vinod, P. K.; Priyakumar, U. DevaJournal of Chemical Information and Modeling (2022), 62 (9), 2064-2076CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)Application of deep learning techniques for de novo generation of mols., termed as inverse mol. design, has been gaining enormous traction in drug design. The representation of mols. in SMILES notation as a string of characters enables the usage of state of the art models in natural language processing, such as Transformers, for mol. design in general. Inspired by generative pre-training (GPT) models that have been shown to be successful in generating meaningful text, we train a transformer-decoder on the next token prediction task using masked self-attention for the generation of druglike mols. in this study. We show that our model, MolGPT, performs on par with other previously proposed modern machine learning frameworks for mol. generation in terms of generating valid, unique, and novel mols. Furthermore, we demonstrate that the model can be trained conditionally to control multiple properties of the generated mols. We also show that the model can be used to generate mols. with desired scaffolds as well as desired mol. properties by conditioning the generation on scaffold SMILES strings of desired scaffolds and property values. Using saliency maps, we highlight the interpretability of the generative process of the model.
- 258Janakarajan, N.; Erdmann, T.; Swaminathan, S.; Laino, T.; Born, J. Language models in molecular discovery. arXiv , 2023.There is no corresponding record for this reference.
- 259Park, Y.; Metzger, B. P. H.; Thornton, J. W. The simplicity of protein sequence-function relationships. Nat. Commun. 2024, 15, 7953, DOI: 10.1038/s41467-024-51895-5There is no corresponding record for this reference.
- 260Stahl, K.; Warneke, R.; Demann, L.; Bremenkamp, R.; Hormes, B.; Brock, O.; Stülke, J.; Rappsilber, J. Modelling protein complexes with crosslinking mass spectrometry and deep learning. Nat. Commun. 2024, 15, 7866, DOI: 10.1038/s41467-024-51771-2There is no corresponding record for this reference.
- 261Senior, A. W. Improved protein structure prediction using potentials from deep learning. Nature 2020, 577, 706– 710, DOI: 10.1038/s41586-019-1923-7261Improved protein structure prediction using potentials from deep learningSenior, Andrew W.; Evans, Richard; Jumper, John; Kirkpatrick, James; Sifre, Laurent; Green, Tim; Qin, Chongli; Zidek, Augustin; Nelson, Alexander W. R.; Bridgland, Alex; Penedones, Hugo; Petersen, Stig; Simonyan, Karen; Crossan, Steve; Kohli, Pushmeet; Jones, David T.; Silver, David; Kavukcuoglu, Koray; Hassabis, DemisNature (London, United Kingdom) (2020), 577 (7792), 706-710CODEN: NATUAS; ISSN:0028-0836. (Nature Research)Protein structure prediction can be used to det. the three-dimensional shape of a protein from its amino acid sequence1. This problem is of fundamental importance as the structure of a protein largely dets. its function2; however, protein structures can be difficult to det. exptl. Considerable progress has recently been made by leveraging genetic information. It is possible to infer which amino acid residues are in contact by analyzing covariation in homologous sequences, which aids in the prediction of protein structures3. The authors can train a neural network to make accurate predictions of the distances between pairs of residues, which convey more information about the structure than contact predictions. Using this information, the authors construct a potential of mean force4 that can accurately describe the shape of a protein. The resulting potential can be optimized by a simple gradient descent algorithm to generate structures without complex sampling procedures. The resulting system, named AlphaFold, achieves high accuracy, even for sequences with fewer homologous sequences. In the recent Crit. Assessment of Protein Structure Prediction5 (CASP13)-a blind assessment of the state of the field-AlphaFold created high-accuracy structures (with template modeling (TM) scores6 of 0.7 or higher) for 24 out of 43 free modeling domains, whereas the next best method, which used sampling and contact information, achieved such accuracy for only 14 out of 43 domains. AlphaFold represents a considerable advance in protein-structure prediction. The authors expect this increased accuracy to enable insights into the function and malfunction of proteins, esp. in cases for which no structures for homologous proteins have been exptl. detd.7.
- 262Kramer, M. A. Nonlinear principal component analysis using autoassociative neural networks. AIChE J. 1991, 37, 233– 243, DOI: 10.1002/aic.690370209262Nonlinear principal component analysis using autoassociative neural networksKramer, Mark A.AIChE Journal (1991), 37 (2), 233-43CODEN: AICEAC; ISSN:0001-1541.Nonlinear-principal-component anal. (NLPCA), is a novel technique for multivariate data anal., similar to the method of principal-component anal. (PCA). NLPCA like PCA, is used to identify and remove correlations among problem variables as an aid to dimensionality redn., visualization, and exploratory data anal. While PCA identifies only linear correlations between variables, NLPCA uncovers both linear and nonlinear correlations, without restriction on the character of the nonlinearities present in the data. NLPCA operates by training a feedforward neural network to perform the identity mapping, where the network inputs are reproduced at the output layer. The network contains an internal bottleneck layer (contg. fewer nodes than input or output layers), which forces the network to develop a compact representation of the input data and 2 addnl. hidden layers. The NLPCA method is demonstrated by using time-dependent, simulated batch-reaction data. NLPCA can reduce dimensionality and produce a feature space map resembling the actual distribution of the underlying system parameters.
- 263Rogers, D.; Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010, 50, 742– 754, DOI: 10.1021/ci100050t263Extended-Connectivity FingerprintsRogers, David; Hahn, MathewJournal of Chemical Information and Modeling (2010), 50 (5), 742-754CODEN: JCISD8; ISSN:1549-9596. (American Chemical Society)Extended-connectivity fingerprints (ECFPs) are a novel class of topol. fingerprints for mol. characterization. Historically, topol. fingerprints were developed for substructure and similarity searching. ECFPs were developed specifically for structure-activity modeling. ECFPs are circular fingerprints with a no. of useful qualities: they can be very rapidly calcd.; they are not predefined and can represent an essentially infinite no. of different mol. features (including stereochem. information); their features represent the presence of particular substructures, allowing easier interpretation of anal. results; and the ECFP algorithm can be tailored to generate different types of circular fingerprints, optimized for different uses. While the use of ECFPs has been widely adopted and validated, a description of their implementation has not previously been presented in the literature.
- 264Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. arXiv , 2017.There is no corresponding record for this reference.
- 265Kipf, T. N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv , 2016.There is no corresponding record for this reference.
- 266Xu, Z.; Wang, S.; Zhu, F.; Huang, J. Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery. Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. New York, NY, USA 2017, 285– 294There is no corresponding record for this reference.
- 267Gilmer, J.; Schoenholz, S.; Riley, P. F.; Vinyals, O.; Dahl, G. E. Neural Message Passing for Quantum Chemistry. ICML 2017, 1263– 1272There is no corresponding record for this reference.
- 268Asgari, E.; Mofrad, M. R. K. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS One 2015, 10, e0141287, DOI: 10.1371/journal.pone.0141287268Continuous distributed representation of biological sequences for deep proteomics and genomicsAsgari, Ehsaneddin; Mofrad, Mohammad R. K.PLoS One (2015), 10 (11), e0141287/1-e0141287/15CODEN: POLNCL; ISSN:1932-6203. (Public Library of Science)We introduce a new representation and feature extn. method for biol. sequences. Named bio-vectors (BioVec) to refer to biol. sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an av. family classification accuracy of 93% ± 0.06%is obtained, outperforming existing family classification methods. In addn., we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be detd. Importantly, this model needs to be trained only once and can then be applied to ext. a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics.
- 269He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2015, 770– 778There is no corresponding record for this reference.
- 270Öztürk, H.; Ozkirimli, E.; Özgür, A. A novel methodology on distributed representations of proteins using their interacting ligands. Bioinformatics 2018, 34, i295– i303, DOI: 10.1093/bioinformatics/bty287There is no corresponding record for this reference.
- 271Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. Proc. IEEE Conf. Comput. Vis. Pattern Recognition 2018, 7132– 7141There is no corresponding record for this reference.