ACS Publications. Most Trusted. Most Cited. Most Read
My Activity
Recently Viewed
You have not visited any articles yet, Please visit some articles to see contents here.
CONTENT TYPES

Figure 1Loading Img

DeeplyTough: Learning Structural Comparison of Protein Binding Sites

Cite this: J. Chem. Inf. Model. 2020, 60, 4, 2356–2366
Publication Date (Web):February 5, 2020
https://doi.org/10.1021/acs.jcim.9b00554
Copyright © 2020 American Chemical Society
ACS AuthorChoiceACS AuthorChoice
Article Views
9591
Altmetric
-
Citations
LEARN ABOUT THESE METRICS
PDF (2 MB)
Supporting Info (1)»

Abstract

Protein pocket matching, or binding site comparison, is of importance in drug discovery. Identification of similar binding pockets can help guide efforts for hit-finding, understanding polypharmacology, and characterization of protein function. The design of pocket matching methods has traditionally involved much intuition and has employed a broad variety of algorithms and representations of the input protein structures. We regard the high heterogeneity of past work and the recent availability of large-scale benchmarks as an indicator that a data-driven approach may provide a new perspective. We propose DeeplyTough, a convolutional neural network that encodes a three-dimensional representation of protein pockets into descriptor vectors that may be compared efficiently in an alignment-free manner by computing pairwise Euclidean distances. The network is trained with supervision (i) to provide similar pockets with similar descriptors, (ii) to separate the descriptors of dissimilar pockets by a minimum margin, and (iii) to achieve robustness to nuisance variations. We evaluate our method using three large-scale benchmark datasets, on which it demonstrates excellent performance for held-out data coming from the training distribution and competitive performance when the trained network is required to generalize to datasets constructed independently. DeeplyTough is available at https://github.com/BenevolentAI/DeeplyTough.

Introduction

ARTICLE SECTIONS
Jump To

Analysis of the three-dimensional (3D) structures of proteins and, in particular, the examination of functional binding sites, is of importance in drug discovery. Binding site comparison, also known as pocket matching, can be used to predict the selectivity of ligand binding as an approach for hit-finding in early drug discovery or to suggest the function of, as yet, uncharacterized proteins. (1) Structural approaches for pocket matching have been shown to be more predictive of shared ligand binding between two proteins than the global structure or sequence similarity. (2) Increasingly, there is interest in applying pocket-matching approaches to large datasets of protein structures to enable proteome-wide analysis. (3) Existing approaches for quantifying the structural similarity between a pair of putative protein binding sites exhibit a range of hand-crafted pocket representations as well as a combination of alignment-dependent and alignment-free algorithms for comparison. (1,4)
A key measure of success for pocket matching algorithms is the ability to assign similarity to pairs of protein pockets that have been shown to bind identical ligands. (5,6) This measure is useful because it encourages the concept of pocket similarity toward biological relevance. The binding of identical ligands to unrelated pockets, however, is highly dependent on the nature of the ligand (as well as the protein), and often no common structural pattern exists between the pair of binding sites. (7) As noted by Barelier et al., (5) “The same ligand might be recognized by different residues, with different interaction types, and even different ligand chemotypes may be engaged.” It is therefore unsurprising that varied algorithms for pocket matching differ in the manner by which cavities are represented or as to how different feature types are weighted in the resulting similarity measure. We argue that an approach able to learn from data is expected to perform well because this offers the possibility to remove the bias associated with hand-engineered protein pocket representations and their matching; others have also expressed this view. (4)
Deep learning has become the standard machine learning framework for solving many problems in computer vision, natural language processing, and other fields. This trend has also reached the drug discovery community showing utility in a wide range of scenarios. (8) In particular, trainable methods that make use of protein structure data have been applied to a number of tasks, including protein–ligand affinity prediction, (9−12) protein structure prediction, (13,14) binding pocket inpainting, (15) binding site detection, (16) and prediction of protein–protein interaction sites, (17,18) and recently Pu et al., (19) have described an approach to classify pockets into functional categories. However, to our knowledge, a deep learning approach to pairwise pocket matching has not been described.
A challenge for the machine learning-based method for pocket matching is presented by the available data, specifically the quantity of known protein pocket pairs and the quality of their annotations. Conveniently, Govindaraj and Brylinski (20) have recently compiled TOUGH-M1, a large-scale dataset of over one million pairs of pockets classified by whether or not they bind structurally similar ligands. In this work, we rely on the TOUGH-M1 collection for training and expect that its large scale will help our method overcome noise inherently present in automated strategies for gathering data.
We introduce DeeplyTough, a pocket matching method that uses a convolutional neural network (CNN) to encode protein-binding sites into descriptor vectors. Once computed, descriptors can be compared very efficiently in an alignment-free way by a simple measurement of pairwise Euclidean distance. This efficiency makes the proposed approach especially suited to investigations on large datasets. Our main contribution is the formulation of pocket matching as a supervised learning problem with three goals: (i) to provide similar pockets with similar descriptor vectors, (ii) to separate descriptor vectors of dissimilar pockets by a minimum margin, and (iii) to achieve robustness toward selected kinds of nuisance variability, such as the specific orientation and delineation of pocket definition. We thoroughly evaluate our method on three recent large-scale benchmarks for pocket matching. Concretely, we demonstrate excellent performance on held-out data coming from the training distribution (TOUGH-M1) and competitive performance when the trained network is required to generalize to datasets constructed independently by Chen et al. (6) and Ehrt et al. (21)

Methods

ARTICLE SECTIONS
Jump To

The problem of protein pocket matching is seen from the perspective of computer vision in our approach. The main idea is to regard pockets as 3D images and process them using a CNN to obtain their representation in a vector space where proximity indicates structural similarity; see Figure 1. In this section, we first discuss the choice of the training dataset and featurization. Then, we pose our method as a descriptor learning problem and describe a training strategy that encourages robustness to nuisance variability. Finally, we describe the architecture of the neural network and the details relevant to its implementation.

Figure 1

Figure 1. Illustration of learning to match pockets with a contrastive loss function (green and red arrow) and a stability loss function (blue arrow). Pockets are represented as multichannel 3D images pi and encoded using a CNN dθ into n-dimensional descriptor vectors (filled symbols), which can be compared quickly and easily by computing their pairwise Euclidean distances. The network is trained to make descriptor vectors of matching pocket pairs as similar to each other as possible and to separate the descriptor vectors of nonmatching pocket pairs by at least margin distance m. In addition, descriptor vectors are encouraged to be robust to small perturbations of the representation, shown as hollow symbols.

Training Dataset

Moving from intuition-based featurization schemes toward learned representations presumes the availability of a large training corpus of pocket pairs with associated 3D structures. To frame the task as a supervised machine learning problem, we assume each pocket pair to be of a certain similarity. Here, we restrict ourselves to the binary case of similar and dissimilar pocket pairs. Curation of such a dataset is nontrivial and has surfaced as an underlying theme in a range of benchmarking efforts. Indeed, the performance of pocket matching algorithms has been shown to depend strongly on the manner of dataset construction, (22) and we expect similar behavior to arise when using such datasets for training as well.
Whereas evolutionarily similar pocket pairs can be easily identified through sequence similarity, pairs of unrelated proteins binding similar ligands (5) represent less obvious examples of pocket pairs that may be presumed similar. We expect these cases to represent more closely the needs of desired applications because they are not detectable by sequence-based approaches.
Generally speaking, similarity can be defined on two levels of granularity: for pairs of proteins and for pairs of pockets.
Protein-level similarity is often derived from chemical similarity among the respective binding ligands of two proteins. For example, Chen et al. (6) discriminate protein pairs sharing common active ligands from those without active ligands in common. Whereas the empirical measurement of bioactivity-focused protein-level similarity is scalable and cost-efficient, pin-pointing exact binding sites (and protein conformations) responsible for the observed behavior is problematic. This uncertainty makes pocket pair datasets defined at the protein-level unfit for training a pocket-level similarity predictor directly. Nevertheless, such datasets can be used to evaluate pocket matching algorithms by estimating protein similarity as the maximum predicted pocket similarity computed over all pockets and all structures of the respective proteins. (6)
Pocket-level similarity is derived directly from 3D protein–ligand complexes, such as those available in the protein data bank (PDB), (23) and is often enhanced by the heuristic assumption that similar ligands bind to similar pockets and vice versa. Although pocket-level derived annotations provide detailed localization of protein–ligand binding, data acquisition is more expensive, and therefore data is sparser. Also, pockets are usually observed in a single bound (holo-) conformation, which may encourage training on such data not to generalize to other induced fit or unbound (apo-) protein conformations—although studies suggest that conformational variability may be limited. (24)
Historically, datasets for pocket matching have been constructed as classification experiments involving sets of protein structures bound to commonly occurring ligands. (25,26) However, these datasets tend to be small (102 to 103) and may not be sufficiently representative of possible protein binding site space for training purposes. Recently, Govindaraj and Brylinski (20) proposed a large dataset, TOUGH-M1, of roughly one million pairs of protein–ligand binding sites curated from the PDB, representing a step toward larger, more general datasets. Specifically, the authors considered a subset of the PDB including protein structures binding a single druglike ligand. Structures were clustered based on sequence similarity, and representative structures bound to a diverse set of ligands were chosen from each cluster. As the dataset is designed with the prospective use case in mind, where the location of ligand binding is not available, pockets were calculated using Fpocket, (27) and predicted cavities having the greatest overlap with known binding residues were selected. Finally, bound ligands were also clustered, and globally dissimilar protein pairs were identified either within (positive) or between (negative) each ligand cluster. The resulting TOUGH-M1 dataset consists of 505,116 positive pocket pairs and 556,810 negative pocket pairs.
In this work, we choose to train our approach on the TOUGH-M1 dataset. From a machine learning perspective, TOUGH-M1 has the advantages of being large and balanced and offers pocket-level similarities. Notwithstanding, this dataset represents a specific method for curating pocket similarities, and it is thus unclear if a trained method can generalize to other datasets constructed in possibly different ways. We will return to this question in the Results and Discussion and answer it affirmatively.
Finally, let us emphasize that while it is often functional binding sites that are of biological interest, we refer to protein cavities indiscriminately as pockets because the method discussed is agnostic to the biological relevance of the pockets analyzed.

Data Splitting Strategy

The definition of independent subsets of data, as desired for meaningful evaluation of machine learning-based methods, is not straightforward when working with pairs of protein structure pockets. Indeed, a number of recent works have commented on the need for robust splitting methodologies when working with protein structure data. (28,29) In our case, a single protein structure may take part in multiple pairwise relationships, some possibly being in the training set and some in the test set, leading to a potential for information leakage. In addition, protein structure datasets represent a many-to-one scenario where multiple structures pertain to a single protein family.
Here, we propose to split on the structure level (instead of pairs), devoting 80% of structures for training and reserving the remaining 20% as a held-out test set. Any pair connecting training and test protein structures is discarded. We adopt a sequence-based clustering approach whereby protein structures sharing more than 30% sequence identity are always allocated to the same cluster; clusters are then allocated to either a training or test set according to a random seed. Clusters are assigned using predefined protein chain sets available from the RCSB PDB as suggested by Kramer and Gedeck. (30)

Volumetric Input Representation

Similar to recent works addressing pocket detection (16) and protein–ligand affinity prediction, (10,11) we regard protein structures as 3D images with c channels (4D tensors). This is analogous to the treatment of color images in computer vision as functions assigning a vector of intensities of three primary colors to each pixel, .
In our case, there are c = 8 feature channels assigned to every point in the 3D image, expressing the presence or absence (occupancy) of atoms in general as well as the presence of atoms exhibiting seven pharmacophoric properties: hydrophobicity, aromaticity, ability to accept or donate a hydrogen bond, positive or negative ionizability, and being metallic. Each atom is thus assigned to at least one feature channel. Occupancy information is given by a smooth indication function of the van der Waals radii of atoms. More precisely, occupancy f(x)h at point in channel h ∈ {1, ..., c} corresponds to the strongest indication function over the protein atoms assigned to that channel formally
(1)
where ra is the van der Waals radius and xa is the position of atom a. Protein structures are retrieved from the PDB, and molecules that are not annotated as part of the main chain are ignored (e.g., water and ligands). This featurization process is analogous to that used for DeepSite (16) and is based on AutoDock 4 atom types (31) and computed using the high-throughput molecular dynamics (HTMD) package. (32)
A pocket is represented as tensor created by sampling the corresponding protein structure image f over a grid of shape d × d × d with step s Å. To denote the representation of a particular pocket in f centered at point and seen under angle ϕ, we use the functional notation p = p(f,μ,ϕ). In our datasets of interest, μ is either the geometric center of a pocket, that is, the centroid of a convex hull of alpha spheres in the case of pockets detected with Fpocket 2.0, (27) or the centroid of a convex hull of surrounding residues lying within 8 Å of any ligand heavy atom in the case of pockets defined by their bound ligands.

Learning Pocket Descriptors

We draw inspiration from computer vision, where comparing local image descriptors is the cornerstone of many tasks, such as stereo reconstruction or image retrieval. Here, hand-crafted descriptors such as SIFT (33) have been recently matched in performance by descriptors learned from raw data. (34,35)
Descriptor learning (also known as embedding learning) is usually formulated as a supervised learning problem. Given set of positive pocket pairs and set of negative pairs, the goal is to learn a representation such that the descriptors of structurally similar pockets are close to one another in the learned vector space, whereas descriptors of dissimilar pockets are kept far apart. Several objective (loss) functions have been introduced in the past work that typically operate on pairs or triplets of descriptors. Triplets are formed by selecting a positive and negative partner for a chosen anchor, (36,37) which is problematic in a pocket matching scenario, as the ground truth relationship between most pocket pairs is unknown: in fact, only 3991 out of 505,116 positive pairs in TOUGH-M1 can be used for constructing such triplets. Therefore, we build on the pair-wise setup following Simo-Serra et al., (38) which has shown success in computer vision tasks. (35) Specifically, given a pair of pockets Q = {(f1, μ1), (f2, μ2)} and orientations ϕ1, ϕ2, we minimize the following contrastive loss function (39) for a pair of pocket representations p1 = p(f111) and p2 = p(f222)
(2)
where is the description function (a neural network) with learnable parameters θ computing n-dimensional descriptors of pockets. The loss encourages the descriptors of positive pairs to be identical while separating those of negative pairs at least by margin m > 0 in the Euclidean space. The ability to compute descriptors independently (and in parallel, taking advantage of modern graphics processing units (GPUs)) and compare them efficiently by evaluating the L2 norm (Euclidean distance) of their difference is very advantageous, especially for large-scale searches and all-against-all scenarios.

Toward Descriptor Robustness

A highly desirable property of pocket matching tools is robustness with respect to the chosen pocket representation and the inherent variability in pocket definition. In particular, this includes discretization artifacts due to input grid resolution s, the orientation of pockets in 3D space ϕ (there is neither canonical orientation of proteins nor their pockets in space), and their precise delimitation (the inclusion or exclusion of a small number of atoms toward the edges of the bounding box) affecting the position of their geometric centers μ. Robustness here would also render the network stable to protein conformational variability.
Robustness has been traditionally addressed by using fuzzy featurization schemes and explicit alignment techniques in previous pocket matching tools (4) or by using data augmentation in machine learning methods. (40) The latter strategy is also applicable in our case, where data augmentation amounts to randomly sampling ϕ (implemented as random rotation around a random axis) and adding a random vector ϵ, ∥ϵ∥2 ≤ 2 Å to μ for each pocket seen during training in order to stimulate the descriptor function dθ to become invariant to such perturbations. However, we have not been able to achieve a sufficient level of invariance in practical experiments with this approach, which we consider related to known vulnerabilities of neural networks to small geometric input transformations in both adversarial (41) and benign settings. (42)
This motivates us to introduce an additional, explicit stability objective. (43) Given two perturbed representations of the same pocket (f, μ), = p(f,μ + ϵ11) and = p(f,μ + ϵ22), we encourage their descriptors to be identical by minimizing the following stability loss
(3)
The contrastive loss and the stability loss are then minimized jointly in a linear combination weighted with hyperparameter λ > 0 as
(4)

Network Architecture

Our description function dθ is a relatively shallow CNN. CNNs are hierarchical machine learning models consisting of layers of several types, see, for example, Goodfellow et al. (44) for an overview. To support the above-mentioned desire for translationally and rotationally invariant descriptors, we draw on the recent progress in learning rotationally equivariant features. Concretely, we use 3D steerable CNNs, (45) where 3D convolutional filters are parameterized as a linear combination of a complete steerable kernel basis. Such a technique for parameter sharing allowed us to considerably decrease the number of learnable parameters down to the order of 105 and therefore reduce possible overfitting.
The network, described in detail in Table 1, consists of six convolutional layers. We prefer striding preceded by low-pass filtering, as recommended by Azulay and Weiss, (42) over pooling, which has empirically led to more stable networks. The computed descriptors are additionally normalized to have unit length, as per usual practice. (33,35)
Table 1. Architecture of Network Used in the Experiments in Top Down Ordera
SCBkernel size 7, padding 3, stride 2, 4 × 16 (4Pros) fields of order 0–3
SCBkernel size 3, padding 1, stride 1, 4 × 32 (8Pros) fields of order 0–3
SCBkernel size 3, padding 1, stride 2, 4 × 48 (16Pros) fields of order 0–3
SCBkernel size 3, padding 0, stride 1, 4 × 64 (32Pros) fields of order 0–3
SCBkernel size 3, padding 0, stride 2, 256 fields of order 0
Ckernel size 1, padding 0, stride 1, n output channels
a

SCB denotes a steerable 3D convolution block with batch normalization and ReLU scalar and sigmoid gate activation. (45) C denotes a standard 3D convolution layer preceded by ReLU activation and batch normalization. (46) A smaller architecture is used for ProSPECCTs with reduced widths (denotedPros) to combat overfitting on a comparatively smaller effective training set.

Training Details

Besides the strategies for rotational and translational data augmentation described above, random points are sampled with probability 0.1, instead of pocket centers μ in negative pairs, to increase the variability of negatives and regularize the behavior of the network outside pockets over the whole protein. We set margin m = 1, loss weight λ = 1, and descriptor dimensionality n = 128. Networks are trained on balanced batches of 16 quadruples for 6000 iterations with a variant of stochastic gradient descent, Adam, (47) with a weight decay of 5 × 10–4 and a learning rate of 0.001 step-wise annealed after 4000 iterations. Training takes about a day on a single GPU. We observe that higher resolution and larger spatial context are generally beneficial and set d = 24 and s = 1 Å as a compromise between computational efficiency and performance in this work. Finally, let us remark that we use the same training parameters for all networks presented in this work.

Evaluation Details

As we train and evaluate our method using existing datasets, the primary goal of our evaluation strategy is to enable comparison with past and future works. In all three cases, the authors report results using the receiver operating characteristic (ROC). Therefore, we also focus our analysis on the ROC and the associated area under the curve (AUC). In information retrieval, it is often preferable to evaluate performance using precision-recall analysis, particularly in scenarios involving class imbalance—as is often the case in drug discovery. (48) Therefore, for the evaluation of DeeplyTough, we also provide metrics for precision-recall in the Supporting Information.

Results and Discussion

ARTICLE SECTIONS
Jump To

TOUGH-M1 Dataset

TOUGH-M1 (20) is a dataset of 505,116 positive and 556,810 negative protein pocket pairs defined from 7524 protein structures. Pockets are isolated computationally with Fpocket 2.0 (27) and filtered to include only predicted cavities having the greatest overlap with known binding residues; see section Training Dataset in Methods. As the TOUGH-M1 dataset is used for both training and evaluation but has not been used in a machine learning setting previously, we also define a sequence-based clustered splitting strategy, described in the Data Splitting Strategy section of Methods. We describe our strategy for training and evaluation, before we report our results and compare to several baseline methods.

Training and Evaluation Strategy

For the TOUGH-M1 dataset, sequence-based clusters were assigned to all PDB entries—20 entries for which no cluster could be identified were removed. Global structure-based alignment scores, shown in Figure S1, indicate that there are very few structures of high similarity distributed across training and test splits—shown for the first of 10 random partitions of the data.
Following Govindaraj and Brylinski, (20) the performance is measured by the ROC, and the corresponding AUC is reported. In addition, we report precision-recall curves in Figure S2. To estimate the sensitivity to the choice of particular splits, we use repeated random subsampling (Monte Carlo) validation. (49) Keeping any hyperparameter fixed, we repeat splitting and training on 10 random permutations of TOUGH-M1 and measure the standard error over the respective test sets.

Baselines

We compare DeeplyTough to four alignment-based methods chosen by the authors of the TOUGH-M1 dataset and to an additional alignment-free approach. APoc (50) optimizes the alignment between pocket pairs by iterative dynamic and integer programming, considering the secondary structure and fragment fitting. G-LoSA (22) uses iterative maximum clique search and fragment superposition. SiteEngine (51) uses geometric hashing and matching of triangles of physicochemical property centers. TM-align (52) is a protein structure alignment tool considering Cα coordinates and secondary structure elements, which we apply to individual pockets here. (21) Last, PocketMatch (53) is an alignment-free method which represents pockets as lists of sorted distances encoding their shape and chemical properties. While we reuse the list of matching scores published by Govindaraj and Brylinski (20) for each alignment-based method and split them into folds according to our evaluation strategy, we compute matching scores for PocketMatch ourselves.

Results

The measured ROC curves for the TOUGH-M1 dataset are shown in Figure 2. DeeplyTough achieves an AUC of 0.913, outperforming, by far, all other approaches and achieving a substantial improvement over the second best performing method, SiteEngine (AUC 0.732). G-LoSA achieves greater AUC than PocketMatch and APoc (AUCs 0.694, 0.644, and 0.644, respectively). TM-align shows close to random performance (AUC 0.52, TM-align/Fpocket). However if pockets are defined using their bound ligands instead of using Fpocket, the performance of TM-align increases to AUC 0.654 (TM-align/ligand); however this result is not strictly comparable with other scores. Furthermore, the performance of all methods is fairly stable across different test (and training) sets, with DeeplyTough achieving the lowest standard error.

Figure 2

Figure 2. ROC plot with associated AUC values evaluating the performance of pocket matching algorithms on TOUGH-M1 testing folds. Standard error, denoted as se, is measured over 10 random splits. The dashed line represents random predictions.

Analysis of TOUGH-M1 positive pocket pairs that are assigned large distances (false negatives) highlights potentially questionable ground truths in the dataset. In particular, bound ligands of false negative pockets show enrichment of biologically versatile endogenous molecules such as nucleotides [adenosine triphosphate (ATP) and ACO), amino acid monomers [tyrosine (TYR) and aspartic acid (ASP)] and sugars [glucose (GLC) and NDG] as well as nonbiologically relevant ligands involved in the production of protein structure data (MPD, CIT, and TRS)—three letter codes refer to PDB chemical component (CC) identifiers. Whereas these pocket pairs do represent instances where related ligands are bound to unrelated proteins (constituting the definition of a positive pocket pair), we argue that in some cases, there is limited structural similarity between pockets, and shared binding may be attributed to the conformational flexibility (54) or nonspecificity of the bound ligand. These cases may be considered a limitation of the current dataset.
On the other hand, analysis of TOUGH-M1 negative pocket pairs that are assigned small distances (false positives) suggests an interconnected network of pocket pairs according to distances generated by DeeplyTough. Of the false positive pocket pairs examined, many pockets bind polar ligand moieties containing anionic groups such as phosphates (2P0 and T3P), sulfonamides (E49 and 3JW), and carboxylates (G39 and BES). Furthermore, false positive pocket pairs seem to be enriched with polar residues, suggesting that there may be similarity between these pockets, despite their negative annotation; however, this hypothesis would need to be validated in a future work. Inherently, there is a high potential for false positives in pocket pair datasets because the absence of common bound ligands in extant data does not preclude two binding sites from binding related ligands.
Methods based on machine learning are susceptible to the biases embodied by their training data. (55) In our case, we use TOUGH-M1, which in turn is derived from the PDB. There is nonuniform representation of both protein classes and ligand chemical space in the PDB. In particular, the five most common ligands in the TOUGH-M1 dataset are ADP, NAD, NDP, ATP, and FAD, all of which contain the adenine moiety. This reflects the prevalence of endogenous cofactors in the underlying data distribution, and consideration of such bias should be a feature of future machine learning approaches to pocket matching. We expect that careful dataset design would be the key to maximize the performance of pocket matching in this setting. It should not be assumed, however, that the preponderance of such cofactors present an easy learning task because these are flexible ligands that bind a range of pockets, (56) as described in the analysis of false negatives above.
DeeplyTough performs well on the TOUGH-M1 dataset; however, data is still drawn from the same underlying distribution. It is therefore interesting to evaluate the performance of DeeplyTough under the so-called domain shift on further datasets constructed independently. These differ in their annotation label definitions and in the manner of defining pocket centers, as described below.

Vertex Dataset

The Vertex dataset introduced by Chen et al. (6) comprises 6598 positive and 379 negative protein pairs defined from 6029 protein structures. The protocol for annotation of protein pairs derives from commonalities (or lack thereof) among experimentally measured ligand activities. In this benchmark, predicted protein-level similarities are obtained from a set of pocket-level similarities. In particular, the smallest of k × l calculated pocket distances is assigned to each protein pair of interest, with k ligand binding sites collected from a set of PDB structures of one protein and l from the other. For the Vertex dataset, this amounts to 1,461,668 positive and 102,935 negative pocket comparisons in total. Unlike the TOUGH-M1 dataset, where binding sites are obtained from predicted cavities, the Vertex dataset defines pockets using their bound ligands directly. Specifically, we define pockets as all protein residues with any atom falling within 8 Å of any ligand atom.

Training and Evaluation Strategy

The network is trained on the whole TOUGH-M1 dataset. However, to prevent information leakage, we discard all TOUGH-M1 structures with more than 30% sequence identity to any structure in the Vertex dataset as well as 20 PDB entries to which no sequence cluster could be assigned, resulting in 6548 structures and 710,009 pairs left for training. Global structure-based alignment scores, shown in Figure S1, indicate that there are a very few structures of high similarity distributed across splits. Following Chen et al., (6) we measure the performance with the ROC curve and the corresponding AUC. In addition, we report precision-recall curves in Figure S3.

Baselines

We compare our approach to SiteHopper, (57) a structure-based pocket matching method chosen by the authors of the dataset. SiteHopper is an alignment-based method which represents binding sites as sets of points describing the molecular surface and nearby physicochemical features, which are aligned by maximizing the overlap of point-centered Gaussian functions. We also compare to PocketMatch and TM-align, as for the TOUGH-M1 analysis. G-LoSA was omitted from this study because of running time of hundreds of days on a single processor. Results for SiteHopper were kindly provided in personal communication by Chen et al. (6)

Results

The measured ROC curves are shown for the Vertex dataset in Figure 3. DeeplyTough shows performance (AUC 0.830) that lies between SiteHopper (AUC 0.887) and TM-align (AUC 0.767), whereas PocketMatch performs more poorly (AUC 0.604). Importantly, the result indicates that our method generalizes well across two different methods for defining binding site geometric centers (computational and ligand-based). These results also hint that even though ground truth annotation labels in TOUGH-M1 and Vertex have been defined by different protocols (clustering by ligand structure and using in vitro binding activity data, respectively), they still share some compatibility. Whereas SiteHopper performs more favorably than DeeplyTough on the Vertex dataset, there is also a substantial difference between their respective runtimes; we revisit this more quantitatively below.

Figure 3

Figure 3. ROC plot with associated AUC values evaluating the performance of pocket matching algorithms on the Vertex dataset (6977 protein pairs). The dashed line represents random predictions.

ProSPECCTs Datasets

ProSPECCTs (21) is a collection of 10 benchmarks recently assembled to better understand the performance of pocket matching for various practical applications. As for the Vertex dataset above, binding sites are defined by their bound ligands, and we include complete protein residues with any atom that falls within 8 Å of any ligand atom.

Training and Evaluation Strategy

The network is trained on the whole TOUGH-M1 dataset. To prevent information leakage, we discard all TOUGH-M1 structures with more than 30% sequence identity to any structure in any ProSPECCTs dataset as well as 20 PDB entries to which no sequence cluster could be assigned, resulting in 4862 structures and 401,366 pairs left for training. Because the number of structures, and by extension pairs, that overlap between TOUGH-M1 and the ProSPECCTs datasets is higher than for Vertex, our training set is more reduced; in response, we use a reduced architecture for DeeplyTough in the case of ProSPECCTs to reduce the potential for overfitting, as described in the Network Architecture section of Methods. For each ProSPECCTs dataset, global structure-based alignment scores indicate that there are a very few structures of high similarity distributed across train and test splits (Figure S1). Following Ehrt et al., (21) the performance is measured with ROC curves and the corresponding AUCs.

Baselines

We compare our approach to 21 pocket matching methods chosen by Ehrt et al., (21) directly reusing their published results.

Results

The measured AUC scores for the ProSPECCTs datasets are given in Table 2 and compactly visualized in Figure 4.

Figure 4

Figure 4. AUC values for DeeplyTough (green) and 21 other pocket matching methods (gray) on each of 10 ProSPECCTs datasets.

Table 2. AUC Values for 22 Pocket Matching Methods on Each of 10 ProSPECCTs Datasets
 P1P1.2P2P3P4P5P5.2P6P6.2P7
Cavbase0.980.910.870.650.640.600.570.550.550.82
FuzCav0.940.990.990.690.580.550.540.670.730.77
FuzCav (PDB)0.940.990.980.690.580.560.540.650.720.77
grim0.690.970.920.550.560.690.610.450.650.70
grim (PDB)0.620.830.850.570.560.610.580.450.500.64
IsoMIF0.770.970.700.590.590.750.810.620.620.87
KRIPO0.911.000.960.600.610.760.770.730.740.85
PocketMatch0.820.980.960.590.570.660.600.510.510.82
ProBiS1.001.001.000.470.460.540.550.500.500.85
RAPMAD0.850.830.820.610.630.550.520.600.600.74
shaper0.960.930.930.710.760.650.650.540.650.75
shaper (PDB)0.960.930.930.710.760.660.640.540.650.75
VolSite/shaper0.930.990.780.680.760.560.580.710.760.77
VolSite/shaper (PDB)0.941.000.760.680.760.570.560.500.570.72
SiteAlign0.971.001.000.850.800.590.570.440.560.87
SiteEngine0.961.001.000.820.790.640.570.550.550.86
SiteHopper0.980.941.000.750.750.720.810.560.540.77
SMAP1.001.001.000.760.650.620.540.680.680.86
TIFP0.660.900.910.660.660.710.630.550.600.71
TIFP (PDB)0.550.740.780.560.570.540.530.560.610.66
TM-align1.001.001.000.490.490.660.620.590.590.88
DeeplyTough0.950.980.900.760.750.670.630.540.540.83
rank1011153–47–866–714–1618–198
Dataset P1 evaluates the sensitivity to the binding site definition by comparing structures with identical sequences binding chemically distinct ligands at identical sites. Dataset P1.2 measures this exclusively for chemically similar ligands. By reaching AUCs of 0.95 and 0.98, we can conclude that DeeplyTough is fairly robust to varying pocket definitions, which may be attributed to our stability loss as well as our data augmentation strategy. In Table 3, we observe that the stability loss alone is responsible for an increase of more than AUC 0.1 across multiple ProSPECCTs datasets. In addition, the box plot in Figure 5 illustrates a clear distance-dependent distinction between identical and nonmatching binding site pairs, likely a virtue of our margin-based contrastive loss function.

Figure 5

Figure 5. Distribution of distances between DeeplyTough descriptors of matching and nonmatching binding sites of structures with identical sequences (ProSPECCTs Dataset P1).

Table 3. Effect of Training Dataset Size, Expressed by the Amount of Positive and Negative Binding site Pairs as Well as Unique PDB Structures, and of Stability Loss Measured in AUC Values on Each of 10 ProSPECCTs Datasets, with Standard Error if Sensible
 P1P1.2P2P3P4P5P5.2P6P6.2P7
2 × 100k pairs0.940.990.890.750.730.690.640.520.530.81
 ±0.00±0.00±0.01±0.01±0.01±0.00±0.00±0.01±0.00±0.00
2 × 10k pairs0.910.980.860.750.730.680.630.550.550.82
 ±0.02±0.00±0.01±0.02±0.01±0.00±0.00±0.01±0.01±0.00
2 × 1k pairs0.850.920.840.640.630.640.600.510.520.80
 ±0.00±0.00±0.02±0.03±0.04±0.00±0.00±0.05±0.05±0.00
3k structures0.880.950.860.710.670.670.620.540.540.81
 ±0.01±0.03±0.00±0.02±0.00±0.01±0.01±0.02±0.02±0.01
2k structures0.810.900.810.660.670.650.610.530.530.77
 ±0.00±0.01±0.00±0.01±0.00±0.00±0.01±0.04±0.04±0.00
1k structures0.760.860.780.630.650.610.580.480.490.73
 ±0.00±0.01±0.04±0.01±0.01±0.00±0.01±0.04±0.04±0.01
no stability loss0.840.890.760.640.610.640.590.590.600.77
proposed method0.950.980.900.760.750.670.630.540.540.83
Dataset P2 assesses the sensitivity to binding site flexibility by comparing the pockets of nuclear magnetic resonance structures with multiple models in the structure ensemble. DeeplyTough achieves AUC 0.90, which indicates a slight susceptibility to the conformational variability of proteins. We believe this could be addressed by introducing an appropriate data augmentation strategy in the training process.
Next, two decoy datasets evaluate the discrimination between nearly identical binding sites differing by five artificial mutations leading to different physicochemical properties (Dataset P3) or both physicochemical and shape properties (Dataset P4). Performing at AUC 0.75–0.76, DeeplyTough has some difficulty ranking original binding site pairs with identical sequences higher than pairs consisting of an original structure and a decoy structure. This suggests that the learned network might be overly robust and may not pay enough attention to pocket modifications. Compared to existing approaches, however, our approach ranks well as fourth and eighth, respectively. Moreover, performance is well correlated with the number of mutations in the sites (AUC 0.57, 0.65, 0.68, and 0.75 and AUC 0.55, 0.63, 0.68, and 0.70 for one to four mutations in Datasets P3 and P4, respectively), consistent with the intuition that pockets with more mutations should be easier to differentiate.
Another two datasets contain sets of dissimilar proteins binding to identical ligands and cofactors. Datasets P5 and P5.2 have been compiled by Kahraman et al. (25) and contain 100 structures bound to 1 of 10 ligands (excluding and including phosphate binding sites, respectively). Datasets P6 and P6.2 contain pairs of unrelated proteins bound to identical ligands, assembled by Barelier et al. (5) (excluding and including cofactors, respectively). Our method scores better on the Kahraman dataset (AUC 0.67) than on unrelated proteins (AUC 0.54), consistent with reports that the Kahraman dataset represents an easy benchmark since the chosen ligands may be distinguished solely by their sizes. (26)
Finally, Dataset P7 is a retrieval experiment measuring the recovery of known binding site similarities within a set of diverse proteins. DeeplyTough reaches AUC 0.83 (average precision 0.45), which places among the best performing half of the baseline approaches.
In summary, our method performs consistently well across ProSPECCTs datasets. For practical applications, we suggest these results support the use of DeeplyTough as a fast universal tool, rather than a specialist one.
Our evaluations also show that DeeplyTough is able to generalize to different levels of class imbalance in the dataset. Whereas the TOUGH-M1 training set is balanced, the Vertex dataset contains 17× more positive pairs than negative, and conversely, ProSPECCTs Dataset 7 and Dataset 2 have 489× and 13× more negative than positive pairs, respectively.

Running Time

In addition, Ehrt et al. (21) published the running times for each algorithm on Dataset P5 (100 pockets, 10,000 comparisons). The runtime for DeeplyTough is 206.4 s in total, where the preprocessing with AutoDock 4 (31) and HTMD (32) requires 191.4 s (serialized on a single CPU core), and the descriptor computation and comparison takes 15 s on an Nvidia Titan X. This makes ours the fourth fastest approach in the benchmark, behind PocketMatch, RAPMAD, and TM-align. SiteHopper is a slower approach with a total runtime of 3982.6 s (17th of the ProSPECCTs baseline methods). For DeeplyTough, we expect that further reduced runtimes may be achieved through full parallelism of the initial preprocessing.

Protein Binding Site Space

We use t-SNE (58) to visualize the learned descriptor space obtained by DeeplyTough. Figure 6 shows the embeddings of pockets in ProSPECCTs Dataset P1 colored by their associated UniProt accession numbers. For the most part, pockets derived from the same protein family are clustered together, suggesting that the network embeds similar pockets close to each other in the descriptor space. Similar conclusions can be drawn for pockets derived from the Vertex dataset in Figure 7, wherein pockets are colored by their respective top-level SCOPe classifications. These embeddings suggest that DeeplyTough may be useful for large-scale analyses of protein binding site space. (3)

Figure 6

Figure 6. t-SNE visualization of descriptors of binding sites in 12 protein groups, denoted by their UniProt accession number, comprised in ProSPECCTs Dataset P1.

Figure 7

Figure 7. t-SNE visualization of descriptors of binding sites in the Vertex dataset, labeled by the top-level SCOPe class of their proteins.

Training Data Ablation

Deep learning models are proverbially known to require large amounts of data for training. To provide insight into the dependence of DeeplyTough on the large scale of the TOUGH-M1 dataset, we experiment by artificially limiting the training dataset size using two approaches, in both cases validating on ProSPECCTs as an independent set. For simplicity, all training and network hyperparameters are kept fixed. However, this causes overfitting in lower-data regimes; we believe that investigating smaller networks or increasing regularization would improve the results discussed below.
First, we restrict the number of pocket pairs available to the network for training. Random subsets of size varying between 1000 and 100,000 are sampled from the original training set for both positive and negative pocket pairs. The results in Table 3 suggest that the network does not strongly suffer from the removal of training data in this way, even if the training set is reduced by 2 orders of magnitude. We expect that the likely cause for this is that even for a reduced training set of only 2000 pairs, the effective number of structures is still relatively high (about 2000 PDBs).
Hence, we constrain the number of distinct PDB structures available for training. Random subsets of size varying between 1000 and 3000 are sampled from 4862 structures in the TOUGH-M1 dataset, and only pairs lying within these subsets are retained. Results shown in Table 3 indicate that the performance starts to severely deteriorate as the number of structures drops below 2000, even if this corresponds to about 65,000 induced pairs. Therefore, we may conclude that for our method, structure diversity in the data is relatively more important than the number of ground truth relationships. This observation suggests that it may be appropriate to construct new pocket matching datasets, using as many structures from the PDB as possible, even if relatively few pocket pairs are defined. In addition, we expect pretrained DeeplyTough to be amenable to successful fine-tuning for smaller task-specific datasets and thus adapting to their specific biases.

Conclusions

ARTICLE SECTIONS
Jump To

In this work, we have proposed a deep learning-based approach for pocket matching. DeeplyTough encodes the 3D structure of protein binding sites using a CNN such that the similarity between binding sites is reflected in the Euclidean distance between their descriptor vectors. Once a set of descriptors is computed, pocket matching is simple and efficient, without any alignment operation taking place. In a thorough evaluation on three benchmark datasets, we have demonstrated excellent performance on held-out data coming from the training distribution (TOUGH-M1) and competitive performance when the network needs to generalize to independently constructed datasets (Vertex, ProSPECCTs). We have taken advantage of several recent innovations such as rotationally and translationally invariant CNNs, data augmentation, and the inclusion of an explicit stability loss function to encourage robustness of the network toward nuisance variability of the input. Overall, we expect trained methods for pocket matching to remove biases associated with intuition-based featurization schemes and also enable effective large-scale binding site analyses.
Having presented one of the first trainable methods for pocket matching, there are many exciting avenues for future research. Exploring different methods for obtaining supervision is perhaps the most promising direction. For example, binary labels could be replaced with continuous labels based on chemical similarity of ligands. In addition, the problem could be cast as multiple-instance learning in order to use protein-level similarity as a form of weak supervision. Another direction is to investigate other input representations, such as graphs or surfaces. Finally, experiments with model explainability techniques will give practitioners insights into the currently rather black-box nature of the algorithm.

Supporting Information

ARTICLE SECTIONS
Jump To

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.9b00554.

  • Box plots of TM-scores indicating the structural similarity of pockets in a test set to their nearest neighbor pockets in a training set for each training scenario, precision-recall plot with associated average precision values evaluating the performance of pocket matching algorithms on TOUGH-M1 testing folds and on the Vertex dataset, and average precision values for DeeplyTough on each of 10 ProSPECCTs datasets (PDF)

Terms & Conditions

Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system: http://pubs.acs.org/page/copyright/permissions.html.

Author Information

ARTICLE SECTIONS
Jump To

Acknowledgments

ARTICLE SECTIONS
Jump To

We thank Marwin Segler, Mohamed Ahmed, Nathan Brown, and Amir Saffari for helpful discussions as well as Lidio Meireles for providing us with the results for SiteHopper.

References

ARTICLE SECTIONS
Jump To

This article references 58 other publications.

  1. 1
    Ehrt, C.; Brinkjost, T.; Koch, O. Impact of Binding Site Comparisons on Medicinal Chemistry and Rational Molecular Design. J. Med. Chem. 2016, 59, 41214151,  DOI: 10.1021/acs.jmedchem.6b00078
  2. 2
    Illergård, K.; Ardell, D. H.; Elofsson, A. Structure is Three to Ten Times More Conserved than Sequence–A Study of Structural Response in Protein Cores. Proteins: Struct., Funct., Bioinf. 2009, 77, 499508,  DOI: 10.1002/prot.22458
  3. 3
    Meyers, J.; Brown, N.; Blagg, J. Mapping the 3D Structures of Small Molecule Binding Sites. J. Cheminf. 2016, 8, 70,  DOI: 10.1186/s13321-016-0180-0
  4. 4
    Naderi, M.; Lemoine, J. M.; Govindaraj, R. G.; Kana, O. Z.; Feinstein, W. P.; Brylinski, M. Binding Site Matching in Rational Drug Design: Algorithms and Applications. Briefings Bioinf. 2019, 20, 2167,  DOI: 10.1093/bib/bby078
  5. 5
    Barelier, S.; Sterling, T.; O’Meara, M. J.; Shoichet, B. K. The Recognition of Identical Ligands by Unrelated Proteins. ACS Chem. Biol. 2015, 10, 27722784,  DOI: 10.1021/acschembio.5b00683
  6. 6
    Chen, Y.-C.; Tolbert, R.; Aronov, A. M.; McGaughey, G.; Walters, W. P.; Meireles, L. Prediction of Protein Pairs Sharing Common Active Ligands Using Protein Sequence, Structure, and Ligand Similarity. J. Chem. Inf. Model. 2016, 56, 17341745,  DOI: 10.1021/acs.jcim.6b00118
  7. 7
    Meyers, J.; Chessum, N. E. A.; Ali, S.; Mok, N. Y.; Wilding, B.; Pasqua, A. E.; Rowlands, M.; Tucker, M. J.; Evans, L. E.; Rye, C. S.; O’Fee, L.; Le Bihan, Y.-V.; Burke, R.; Carter, M.; Workman, P.; Blagg, J.; Brown, N.; van Montfort, R. L. M.; Jones, K.; Cheeseman, M. D. Privileged Structures and Polypharmacology within and between Protein Families. ACS Med. Chem. Lett. 2018, 9, 11991204,  DOI: 10.1021/acsmedchemlett.8b00364
  8. 8
    Rifaioglu, A. S.; Atas, H.; Martin, M.-J.; Cetin-Atalay, R.; Atalay, V.; Dogan, T. Recent Applications of Deep Learning and Machine Intelligence on In Silico Drug Discovery: Methods, Tools and Databases. Briefings Bioinf. 2019, 20, 1878,  DOI: 10.1093/bib/bby061
  9. 9
    Wallach, I.; Dzamba, M.; Heifets, A. AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery. 2015, arXiv preprint arXiv:1510.02855.
  10. 10
    Stepniewska-Dziubinska, M. M.; Zielenkiewicz, P.; Siedlecki, P. Development and Evaluation of a Deep Learning Model for Protein-Ligand Binding Affinity Prediction. Bioinformatics 2018, 34, 36663674,  DOI: 10.1093/bioinformatics/bty374
  11. 11
    Jiménez, J.; Škalič, M.; Martínez-Rosell, G.; De Fabritiis, G. KDEEP: Protein–Ligand Absolute Binding Affinity Prediction via 3D-Convolutional Neural Networks. J. Chem. Inf. Model. 2018, 58, 287296,  DOI: 10.1021/acs.jcim.7b00650
  12. 12
    Imrie, F.; Bradley, A. R.; van der Schaar, M.; Deane, C. M. Protein Family-Specific Models Using Deep Neural Networks and Transfer Learning Improve Virtual Screening and Highlight the Need for More Data. J. Chem. Inf. Model. 2018, 58, 23192330,  DOI: 10.1021/acs.jcim.8b00350
  13. 13
    Billings, W. M.; Hedelius, B.; Millecam, T.; Wingate, D.; Corte, D. D. ProSPr: Democratized Implementation of Alphafold Protein Distance Prediction Network. 2019, bioRxiv.
  14. 14
    Gao, M.; Zhou, H.; Skolnick, J. DESTINI: A Deep-Learning Approach to Contact-Driven Protein Structure Prediction. Sci. Rep. 2019, 9, 3514,  DOI: 10.1038/s41598-019-40314-1
  15. 15
    Skalic, M.; Varela-Rial, A.; Jiménez, J.; Martínez-Rosell, G.; De Fabritiis, G. LigVoxel: Inpainting Binding Pockets Using 3D-Convolutional Neural Networks. Bioinformatics 2019, 35, 243250,  DOI: 10.1093/bioinformatics/bty583
  16. 16
    Jiménez, J.; Doerr, S.; Martínez-Rosell, G.; Rose, A. S.; De Fabritiis, G. DeepSite: Protein-Binding Site Predictor Using 3D-Convolutional Neural Networks. Bioinformatics 2017, 33, 30363042,  DOI: 10.1093/bioinformatics/btx350
  17. 17
    Fout, A.; Byrd, J.; Shariat, B.; Ben-Hur, A. Protein Interface Prediction Using Graph Convolutional Networks. Advances in Neural Information Processing Systems; Curran Associates, Inc., 2017; pp 65306539.
  18. 18
    Townshend, R. J.; Bedi, R.; Dror, R. O. Generalizable Protein Interface Prediction with End-to-End Learning. 2018 arXiv preprint arXiv:1807.01297.
  19. 19
    Pu, L.; Govindaraj, R. G.; Lemoine, J. M.; Wu, H.-C.; Brylinski, M. DeepDrug3D: Classification of Ligand-binding Pockets in Proteins with a Convolutional Neural Network. PLoS Comput. Biol. 2019, 15, e1006718  DOI: 10.1371/journal.pcbi.1006718
  20. 20
    Govindaraj, R. G.; Brylinski, M. Comparative Assessment of Strategies to Identify Similar Ligand-Binding Pockets in Proteins. BMC Bioinf. 2018, 19, 91,  DOI: 10.1186/s12859-018-2109-2
  21. 21
    Ehrt, C.; Brinkjost, T.; Koch, O. A Benchmark Driven Guide to Binding Site Comparison: An Exhaustive Evaluation Using Tailor-Made Data Sets (ProSPECCTs). PLoS Comput. Biol. 2018, 14, e1006483  DOI: 10.1371/journal.pcbi.1006483
  22. 22
    Lee, H. S.; Im, W. Protein Function Prediction; Springer, 2017; pp 97108.
  23. 23
    Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E. The Protein Data Bank. Nucleic Acids Res. 2000, 28, 235242,  DOI: 10.1093/nar/28.1.235
  24. 24
    Brylinski, M.; Skolnick, J. What is the Relationship between the Global Structures of Apo and Holo Proteins?. Proteins: Struct., Funct., Bioinf. 2008, 70, 363377,  DOI: 10.1002/prot.21510
  25. 25
    Kahraman, A.; Morris, R. J.; Laskowski, R. A.; Thornton, J. M. Shape Variation in Protein Binding Pockets and Their Ligands. J. Mol. Biol. 2007, 368, 283301,  DOI: 10.1016/j.jmb.2007.01.086
  26. 26
    Hoffmann, B.; Zaslavskiy, M.; Vert, J.-P.; Stoven, V. A New Protein Binding Pocket Similarity Measure Based on Comparison of Clouds of Atoms in 3D: Application to Ligand Prediction. BMC Bioinf. 2010, 11, 99,  DOI: 10.1186/1471-2105-11-99
  27. 27
    Le Guilloux, V.; Schmidtke, P.; Tuffery, P. Fpocket: An Open Source Platform for Ligand Pocket Detection. BMC Bioinf. 2009, 10, 168,  DOI: 10.1186/1471-2105-10-168
  28. 28
    Li, Y.; Yang, J. Structural and Sequence Similarity Makes a Significant Impact on Machine-Learning-Based Scoring Functions for Protein-Ligand Interactions. J. Chem. Inf. Model. 2017, 57, 10071012,  DOI: 10.1021/acs.jcim.7b00049
  29. 29
    Feinberg, E. N.; Sur, D.; Wu, Z.; Husic, B. E.; Mai, H.; Li, Y.; Sun, S.; Yang, J.; Ramsundar, B.; Pande, V. S. PotentialNet for Molecular Property Prediction. ACS Cent. Sci. 2018, 4, 15201530,  DOI: 10.1021/acscentsci.8b00507
  30. 30
    Kramer, C.; Gedeck, P. Leave-cluster-out Cross-validation is Appropriate for Scoring Functions Derived from Diverse Protein Data Sets. J. Chem. Inf. Model. 2010, 50, 19611969,  DOI: 10.1021/ci100264e
  31. 31
    Morris, G. M.; Huey, R.; Lindstrom, W.; Sanner, M. F.; Belew, R. K.; Goodsell, D. S.; Olson, A. J. AutoDock4 and AutoDockTools4: Automated Docking with Selective Receptor Flexibility. J. Comput. Chem. 2009, 30, 27852791,  DOI: 10.1002/jcc.21256
  32. 32
    Doerr, S.; Harvey, M. J.; Noé, F.; De Fabritiis, G. HTMD: High-Throughput Molecular Dynamics for Molecular Discovery. J. Chem. Theory Comput. 2016, 12, 18451852,  DOI: 10.1021/acs.jctc.6b00049
  33. 33
    Lowe, D. G. Object Recognition from Local Scale-Invariant Features. Proceedings of the Computer Vision and Pattern Recognition Conference , 1999; pp 11501157.
  34. 34
    Schönberger, J. L.; Hardmeier, H.; Sattler, T.; Pollefeys, M. Comparative Evaluation of Hand-Crafted and Learned Local Features. Proceedings of the Computer Vision and Pattern Recognition Conference , 2017; pp 69596968.
  35. 35
    Balntas, V.; Lenc, K.; Vedaldi, A.; Mikolajczyk, K. HPatches: A Benchmark and Evaluation of Handcrafted and Learned Local Descriptors. Proceedings of the Computer Vision and Pattern Recognition Conference , 2017.
  36. 36
    Wang, J.; Song, Y.; Leung, T.; Rosenberg, C.; Wang, J.; Philbin, J.; Chen, B.; Wu, Y. Learning Fine-Grained Image Similarity with Deep Ranking. Proceedings of the Computer Vision and Pattern Recognition Conference , 2014; pp 13861393.
  37. 37
    Hoffer, E.; Ailon, N. Deep Metric Learning Using Triplet Network. International Workshop on Similarity-Based Pattern Recognition , 2015; pp 8492.
  38. 38
    Simo-Serra, E.; Trulls, E.; Ferraz, L.; Kokkinos, I.; Fua, P.; Moreno-Noguer, F. Discriminative Learning of Deep Convolutional Feature Point Descriptors. Proceedings of the Computer Vision and Pattern Recognition Conference , 2015; pp 118126.
  39. 39
    Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality Reduction by Learning an Invariant Mapping. Proceedings of the Computer Vision and Pattern Recognition Conference , 2006; pp 17351742.
  40. 40
    Kauderer-Abrams, E. Quantifying Translation-Invariance in Convolutional Neural Networks. 2017, arXiv preprint arXiv:1801.01450.
  41. 41
    Fawzi, A.; Frossard, P. M. Are Classifiers Really Invariant?. 2015, arXiv preprint arXiv:1507.06535.
  42. 42
    Azulay, A.; Weiss, Y. Why Do Deep Convolutional Networks Generalize So Poorly to Small Image Transformations?. 2018, arXiv preprint arXiv:1805.12177.
  43. 43
    Zheng, S.; Song, Y.; Leung, T.; Goodfellow, I. Improving the Robustness of Deep Neural Networks via Stability Training. Proceedings of the Computer Vision and Pattern Recognition Conference , 2016; pp 44804488.
  44. 44
    Goodfellow, I.; Bengio, Y.; Courville, A.; Bengio, Y. Deep Learning; MIT Press: Cambridge, 2016; Vol. 1.
  45. 45
    Weiler, M.; Geiger, M.; Welling, M.; Boomsma, W.; Cohen, T. 3D Steerable CNNs: Learning Rotationally Equivariant Features in Volumetric Data. 2018, arXiv preprint arXiv:1807.02547.
  46. 46
    Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. 2015, arXiv preprint arXiv:1502.03167.
  47. 47
    Kingma, D. P.; Ba, J. Adam: A Method for Stochastic Optimization. 2014, arXiv preprint arXiv:1412.6980.
  48. 48
    Liu, S.; Alnammi, M.; Ericksen, S. S.; Voter, A. F.; Ananiev, G. E.; Keck, J. L.; Hoffmann, F. M.; Wildman, S. A.; Gitter, A. Practical Model Selection for Prospective Virtual Screening. J. Chem. Inf. Model. 2018, 59, 282293,  DOI: 10.1021/acs.jcim.8b00363
  49. 49
    Dubitzky, W.; Granzow, M.; Berrar, D. P. Fundamentals of Data Mining in Genomics and Proteomics; Springer Science & Business Media, 2007.
  50. 50
    Gao, M.; Skolnick, J. APoc: Large-Scale Identification of Similar Protein Pockets. Bioinformatics 2013, 29, 597604,  DOI: 10.1093/bioinformatics/btt024
  51. 51
    Shulman-Peleg, A.; Nussinov, R.; Wolfson, H. J. SiteEngines: Recognition and Comparison of Binding Sites and Protein–Protein Interfaces. Nucleic Acids Res. 2005, 33, W337W341,  DOI: 10.1093/nar/gki482
  52. 52
    Zhang, Y.; Skolnick, J. TM-align: A Protein Structure Alignment Algorithm Based on the TM-score. Nucleic Acids Res. 2005, 33, 23022309,  DOI: 10.1093/nar/gki524
  53. 53
    Yeturu, K.; Chandra, N. PocketMatch: A New Algorithm to Compare Binding Sites in Protein Structures. BMC Bioinf. 2008, 9, 543,  DOI: 10.1186/1471-2105-9-543
  54. 54
    Haupt, V. J.; Daminelli, S.; Schroeder, M. Drug Promiscuity in PDB: Protein Binding Site Similarity Is Key. PLoS One 2013, 8, 115,  DOI: 10.1371/annotation/0852cc69-8cea-4966-bb8a-ae0b348d1bd9
  55. 55
    Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A Survey on Bias and Fairness in Machine Learning. 2019 arXiv preprint arXiv:1908.09635.
  56. 56
    Stockwell, G. R.; Thornton, J. M. Conformational Diversity of Ligands Bound to Proteins. J. Mol. Biol. 2006, 356, 928944,  DOI: 10.1016/j.jmb.2005.12.012
  57. 57
    Batista, J.; Hawkins, P. C.; Tolbert, R.; Geballe, M. T. SiteHopper—A Unique Tool for Binding Site Comparison. J. Cheminf. 2014, 6, P57,  DOI: 10.1186/1758-2946-6-s1-p57
  58. 58
    Maaten, L. v. d.; Hinton, G. Visualizing Data Using t-SNE. J. Mach. Learn. Res. 2008, 9, 25792605

Cited By

This article is cited by 26 publications.

  1. Rishal Aggarwal, Akash Gupta, Vineeth Chelur, C. V. Jawahar, U. Deva Priyakumar. DeepPocket: Ligand Binding Site Detection and Segmentation using 3D Convolutional Neural Networks. Journal of Chemical Information and Modeling 2022, 62 (21) , 5069-5079. https://doi.org/10.1021/acs.jcim.1c00799
  2. Oliver B. Scott, Jing Gu, A.W. Edith Chan. Classification of Protein-Binding Sites Using a Spherical Convolutional Neural Network. Journal of Chemical Information and Modeling 2022, Article ASAP.
  3. Jinze Zhang, Hao Li, Xuejun Zhao, Qilong Wu, Sheng-You Huang. Holo Protein Conformation Generation from Apo Structures by Ligand Binding Site Refinement. Journal of Chemical Information and Modeling 2022, Article ASAP.
  4. Vikram K. Mulligan Parisa Hosseinzadeh . Computational Design of Peptide-Based Binders to Therapeutic Targets. ,,, 55-102. https://doi.org/10.1021/bk-2022-1417.ch003
  5. Andrew T. McNutt, David Ryan Koes. Improving ΔΔG Predictions with a Multitask Convolutional Siamese Network. Journal of Chemical Information and Modeling 2022, 62 (8) , 1819-1829. https://doi.org/10.1021/acs.jcim.1c01497
  6. Ankur Kumar Gupta, Krishnan Raghavachari. Three-Dimensional Convolutional Neural Networks Utilizing Molecular Topological Features for Accurate Atomization Energy Predictions. Journal of Chemical Theory and Computation 2022, 18 (4) , 2132-2143. https://doi.org/10.1021/acs.jctc.1c00504
  7. Mingyuan Xu, Ting Ran, Hongming Chen. De Novo Molecule Design Through the Molecular Generative Model Conditioned by 3D Information of Protein Binding Sites. Journal of Chemical Information and Modeling 2021, 61 (7) , 3240-3254. https://doi.org/10.1021/acs.jcim.0c01494
  8. Merveille Eguida, Didier Rognan. A Computer Vision Approach to Align and Compare Protein Cavities: Application to Fragment-Based Drug Design. Journal of Medicinal Chemistry 2020, 63 (13) , 7127-7142. https://doi.org/10.1021/acs.jmedchem.0c00422
  9. Lin Gu, Bin Li, Dengming Ming. A multilayer dynamic perturbation analysis method for predicting ligand–protein interactions. BMC Bioinformatics 2022, 23 (1) https://doi.org/10.1186/s12859-022-04995-2
  10. Wentao Shi, Manali Singha, Limeng Pu, Gopal Srivastava, Jagannathan Ramanujam, Michal Brylinski. GraphSite: Ligand Binding Site Classification with Deep Graph Learning. Biomolecules 2022, 12 (8) , 1053. https://doi.org/10.3390/biom12081053
  11. Chinmayee Choudhury, N. Arul Murugan, U. Deva Priyakumar. Structure-based drug repurposing: Traditional and advanced AI/ML-aided methods. Drug Discovery Today 2022, 27 (7) , 1847-1861. https://doi.org/10.1016/j.drudis.2022.03.006
  12. You-Wei Fan, Wan-Hsin Liu, Yun-Ti Chen, Yen-Chao Hsu, Nikhil Pathak, Yu-Wei Huang, Jinn-Moon Yang. Exploring kinase family inhibitors and their moiety preferences using deep SHapley additive exPlanations. BMC Bioinformatics 2022, 23 (S4) https://doi.org/10.1186/s12859-022-04760-5
  13. Adam Bess, Frej Berglind, Supratik Mukhopadhyay, Michal Brylinski, Nicholas Griggs, Tiffany Cho, Chris Galliano, Kishor M. Wasan. Artificial intelligence for the discovery of novel antimicrobial agents for emerging infectious diseases. Drug Discovery Today 2022, 27 (4) , 1099-1107. https://doi.org/10.1016/j.drudis.2021.10.022
  14. R.S.K. Vijayan, Jan Kihlberg, Jason B. Cross, Vasanthanathan Poongavanam. Enhancing preclinical drug discovery with artificial intelligence. Drug Discovery Today 2022, 27 (4) , 967-984. https://doi.org/10.1016/j.drudis.2021.11.023
  15. Wentao Shi, Manali Singha, Gopal Srivastava, Limeng Pu, J. Ramanujam, Michal Brylinski. Pocket2Drug: An Encoder-Decoder Deep Neural Network for the Target-Based Drug Design. Frontiers in Pharmacology 2022, 13 https://doi.org/10.3389/fphar.2022.837715
  16. Zhesen Tan, Chi Ho Chan, Michael Maleska, Bryan Banuelos Jara, Brian K. Lohman, Nathan J. Ricks, Daniel R. Bond, Ming C. Hammond. The Signaling Pathway That cGAMP Riboswitches Found: Analysis and Application of Riboswitches to Study cGAMP Signaling in Geobacter sulfurreducens. International Journal of Molecular Sciences 2022, 23 (3) , 1183. https://doi.org/10.3390/ijms23031183
  17. Adrià Fernández-Torras, Arnau Comajuncosa-Creus, Miquel Duran-Frigola, Patrick Aloy. Connecting chemistry and biology through molecular descriptors. Current Opinion in Chemical Biology 2022, 66 , 102090. https://doi.org/10.1016/j.cbpa.2021.09.001
  18. Wei Lin, Wenpu Wang. Smart Sociolinguistic Intelligent Analysis Framework Based on Feature Extraction and Matching of Structural Data. 2022,,, 1216-1220. https://doi.org/10.1109/ICSSIT53264.2022.9716571
  19. Vikram Khipple Mulligan. Computational Methods for Peptide Macrocycle Drug Design. 2022,,, 79-161. https://doi.org/10.1007/978-3-031-04544-8_3
  20. Shiliang Li, Chaoqian Cai, Jiayu Gong, Xiaofeng Liu, Honglin Li. A fast protein binding site comparison algorithm for proteome‐wide protein function prediction and drug repurposing. Proteins: Structure, Function, and Bioinformatics 2021, 89 (11) , 1541-1556. https://doi.org/10.1002/prot.26176
  21. Joshua Meyers, Benedek Fabian, Nathan Brown. De novo molecular design and generative models. Drug Discovery Today 2021, 26 (11) , 2707-2715. https://doi.org/10.1016/j.drudis.2021.05.019
  22. Vikram Khipple Mulligan. Current directions in combining simulation-based macromolecular modeling approaches with deep learning. Expert Opinion on Drug Discovery 2021, 16 (9) , 1025-1044. https://doi.org/10.1080/17460441.2021.1918097
  23. Jacob Kerner, Alan Dogan, Horst von Recum. Machine learning and big data provide crucial insight for future biomaterials discovery and research. Acta Biomaterialia 2021, 130 , 54-65. https://doi.org/10.1016/j.actbio.2021.05.053
  24. Wonmoon Song, Junghyeon Ko, Young Hwan Choi, Nathaniel S. Hwang. Recent advancements in enzyme-mediated crosslinkable hydrogels: In vivo -mimicking strategies. APL Bioengineering 2021, 5 (2) , 021502. https://doi.org/10.1063/5.0037793
  25. Wenhao Gao, Sai Pooja Mahajan, Jeremias Sulam, Jeffrey J. Gray. Deep Learning in Protein Structural Modeling and Design. Patterns 2020, 1 (9) , 100142. https://doi.org/10.1016/j.patter.2020.100142
  26. Jacob Kerner, Alan Dogan, Horst von Recum. Machine Learning and Big Data Provide Crucial Insight for Future Biomaterials Discovery and Research. SSRN Electronic Journal 2020, 37 https://doi.org/10.2139/ssrn.3746801
  • Abstract

    Figure 1

    Figure 1. Illustration of learning to match pockets with a contrastive loss function (green and red arrow) and a stability loss function (blue arrow). Pockets are represented as multichannel 3D images pi and encoded using a CNN dθ into n-dimensional descriptor vectors (filled symbols), which can be compared quickly and easily by computing their pairwise Euclidean distances. The network is trained to make descriptor vectors of matching pocket pairs as similar to each other as possible and to separate the descriptor vectors of nonmatching pocket pairs by at least margin distance m. In addition, descriptor vectors are encouraged to be robust to small perturbations of the representation, shown as hollow symbols.

    Figure 2

    Figure 2. ROC plot with associated AUC values evaluating the performance of pocket matching algorithms on TOUGH-M1 testing folds. Standard error, denoted as se, is measured over 10 random splits. The dashed line represents random predictions.

    Figure 3

    Figure 3. ROC plot with associated AUC values evaluating the performance of pocket matching algorithms on the Vertex dataset (6977 protein pairs). The dashed line represents random predictions.

    Figure 4

    Figure 4. AUC values for DeeplyTough (green) and 21 other pocket matching methods (gray) on each of 10 ProSPECCTs datasets.

    Figure 5

    Figure 5. Distribution of distances between DeeplyTough descriptors of matching and nonmatching binding sites of structures with identical sequences (ProSPECCTs Dataset P1).

    Figure 6

    Figure 6. t-SNE visualization of descriptors of binding sites in 12 protein groups, denoted by their UniProt accession number, comprised in ProSPECCTs Dataset P1.

    Figure 7

    Figure 7. t-SNE visualization of descriptors of binding sites in the Vertex dataset, labeled by the top-level SCOPe class of their proteins.

  • References

    ARTICLE SECTIONS
    Jump To

    This article references 58 other publications.

    1. 1
      Ehrt, C.; Brinkjost, T.; Koch, O. Impact of Binding Site Comparisons on Medicinal Chemistry and Rational Molecular Design. J. Med. Chem. 2016, 59, 41214151,  DOI: 10.1021/acs.jmedchem.6b00078
    2. 2
      Illergård, K.; Ardell, D. H.; Elofsson, A. Structure is Three to Ten Times More Conserved than Sequence–A Study of Structural Response in Protein Cores. Proteins: Struct., Funct., Bioinf. 2009, 77, 499508,  DOI: 10.1002/prot.22458
    3. 3
      Meyers, J.; Brown, N.; Blagg, J. Mapping the 3D Structures of Small Molecule Binding Sites. J. Cheminf. 2016, 8, 70,  DOI: 10.1186/s13321-016-0180-0
    4. 4
      Naderi, M.; Lemoine, J. M.; Govindaraj, R. G.; Kana, O. Z.; Feinstein, W. P.; Brylinski, M. Binding Site Matching in Rational Drug Design: Algorithms and Applications. Briefings Bioinf. 2019, 20, 2167,  DOI: 10.1093/bib/bby078
    5. 5
      Barelier, S.; Sterling, T.; O’Meara, M. J.; Shoichet, B. K. The Recognition of Identical Ligands by Unrelated Proteins. ACS Chem. Biol. 2015, 10, 27722784,  DOI: 10.1021/acschembio.5b00683
    6. 6
      Chen, Y.-C.; Tolbert, R.; Aronov, A. M.; McGaughey, G.; Walters, W. P.; Meireles, L. Prediction of Protein Pairs Sharing Common Active Ligands Using Protein Sequence, Structure, and Ligand Similarity. J. Chem. Inf. Model. 2016, 56, 17341745,  DOI: 10.1021/acs.jcim.6b00118
    7. 7
      Meyers, J.; Chessum, N. E. A.; Ali, S.; Mok, N. Y.; Wilding, B.; Pasqua, A. E.; Rowlands, M.; Tucker, M. J.; Evans, L. E.; Rye, C. S.; O’Fee, L.; Le Bihan, Y.-V.; Burke, R.; Carter, M.; Workman, P.; Blagg, J.; Brown, N.; van Montfort, R. L. M.; Jones, K.; Cheeseman, M. D. Privileged Structures and Polypharmacology within and between Protein Families. ACS Med. Chem. Lett. 2018, 9, 11991204,  DOI: 10.1021/acsmedchemlett.8b00364
    8. 8
      Rifaioglu, A. S.; Atas, H.; Martin, M.-J.; Cetin-Atalay, R.; Atalay, V.; Dogan, T. Recent Applications of Deep Learning and Machine Intelligence on In Silico Drug Discovery: Methods, Tools and Databases. Briefings Bioinf. 2019, 20, 1878,  DOI: 10.1093/bib/bby061
    9. 9
      Wallach, I.; Dzamba, M.; Heifets, A. AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery. 2015, arXiv preprint arXiv:1510.02855.
    10. 10
      Stepniewska-Dziubinska, M. M.; Zielenkiewicz, P.; Siedlecki, P. Development and Evaluation of a Deep Learning Model for Protein-Ligand Binding Affinity Prediction. Bioinformatics 2018, 34, 36663674,  DOI: 10.1093/bioinformatics/bty374
    11. 11
      Jiménez, J.; Škalič, M.; Martínez-Rosell, G.; De Fabritiis, G. KDEEP: Protein–Ligand Absolute Binding Affinity Prediction via 3D-Convolutional Neural Networks. J. Chem. Inf. Model. 2018, 58, 287296,  DOI: 10.1021/acs.jcim.7b00650
    12. 12
      Imrie, F.; Bradley, A. R.; van der Schaar, M.; Deane, C. M. Protein Family-Specific Models Using Deep Neural Networks and Transfer Learning Improve Virtual Screening and Highlight the Need for More Data. J. Chem. Inf. Model. 2018, 58, 23192330,  DOI: 10.1021/acs.jcim.8b00350
    13. 13
      Billings, W. M.; Hedelius, B.; Millecam, T.; Wingate, D.; Corte, D. D. ProSPr: Democratized Implementation of Alphafold Protein Distance Prediction Network. 2019, bioRxiv.
    14. 14
      Gao, M.; Zhou, H.; Skolnick, J. DESTINI: A Deep-Learning Approach to Contact-Driven Protein Structure Prediction. Sci. Rep. 2019, 9, 3514,  DOI: 10.1038/s41598-019-40314-1
    15. 15
      Skalic, M.; Varela-Rial, A.; Jiménez, J.; Martínez-Rosell, G.; De Fabritiis, G. LigVoxel: Inpainting Binding Pockets Using 3D-Convolutional Neural Networks. Bioinformatics 2019, 35, 243250,  DOI: 10.1093/bioinformatics/bty583
    16. 16
      Jiménez, J.; Doerr, S.; Martínez-Rosell, G.; Rose, A. S.; De Fabritiis, G. DeepSite: Protein-Binding Site Predictor Using 3D-Convolutional Neural Networks. Bioinformatics 2017, 33, 30363042,  DOI: 10.1093/bioinformatics/btx350
    17. 17
      Fout, A.; Byrd, J.; Shariat, B.; Ben-Hur, A. Protein Interface Prediction Using Graph Convolutional Networks. Advances in Neural Information Processing Systems; Curran Associates, Inc., 2017; pp 65306539.
    18. 18
      Townshend, R. J.; Bedi, R.; Dror, R. O. Generalizable Protein Interface Prediction with End-to-End Learning. 2018 arXiv preprint arXiv:1807.01297.
    19. 19
      Pu, L.; Govindaraj, R. G.; Lemoine, J. M.; Wu, H.-C.; Brylinski, M. DeepDrug3D: Classification of Ligand-binding Pockets in Proteins with a Convolutional Neural Network. PLoS Comput. Biol. 2019, 15, e1006718  DOI: 10.1371/journal.pcbi.1006718
    20. 20
      Govindaraj, R. G.; Brylinski, M. Comparative Assessment of Strategies to Identify Similar Ligand-Binding Pockets in Proteins. BMC Bioinf. 2018, 19, 91,  DOI: 10.1186/s12859-018-2109-2
    21. 21
      Ehrt, C.; Brinkjost, T.; Koch, O. A Benchmark Driven Guide to Binding Site Comparison: An Exhaustive Evaluation Using Tailor-Made Data Sets (ProSPECCTs). PLoS Comput. Biol. 2018, 14, e1006483  DOI: 10.1371/journal.pcbi.1006483
    22. 22
      Lee, H. S.; Im, W. Protein Function Prediction; Springer, 2017; pp 97108.
    23. 23
      Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E. The Protein Data Bank. Nucleic Acids Res. 2000, 28, 235242,  DOI: 10.1093/nar/28.1.235
    24. 24
      Brylinski, M.; Skolnick, J. What is the Relationship between the Global Structures of Apo and Holo Proteins?. Proteins: Struct., Funct., Bioinf. 2008, 70, 363377,  DOI: 10.1002/prot.21510
    25. 25
      Kahraman, A.; Morris, R. J.; Laskowski, R. A.; Thornton, J. M. Shape Variation in Protein Binding Pockets and Their Ligands. J. Mol. Biol. 2007, 368, 283301,  DOI: 10.1016/j.jmb.2007.01.086
    26. 26
      Hoffmann, B.; Zaslavskiy, M.; Vert, J.-P.; Stoven, V. A New Protein Binding Pocket Similarity Measure Based on Comparison of Clouds of Atoms in 3D: Application to Ligand Prediction. BMC Bioinf. 2010, 11, 99,  DOI: 10.1186/1471-2105-11-99
    27. 27
      Le Guilloux, V.; Schmidtke, P.; Tuffery, P. Fpocket: An Open Source Platform for Ligand Pocket Detection. BMC Bioinf. 2009, 10, 168,  DOI: 10.1186/1471-2105-10-168
    28. 28
      Li, Y.; Yang, J. Structural and Sequence Similarity Makes a Significant Impact on Machine-Learning-Based Scoring Functions for Protein-Ligand Interactions. J. Chem. Inf. Model. 2017, 57, 10071012,  DOI: 10.1021/acs.jcim.7b00049
    29. 29
      Feinberg, E. N.; Sur, D.; Wu, Z.; Husic, B. E.; Mai, H.; Li, Y.; Sun, S.; Yang, J.; Ramsundar, B.; Pande, V. S. PotentialNet for Molecular Property Prediction. ACS Cent. Sci. 2018, 4, 15201530,  DOI: 10.1021/acscentsci.8b00507
    30. 30
      Kramer, C.; Gedeck, P. Leave-cluster-out Cross-validation is Appropriate for Scoring Functions Derived from Diverse Protein Data Sets. J. Chem. Inf. Model. 2010, 50, 19611969,  DOI: 10.1021/ci100264e
    31. 31
      Morris, G. M.; Huey, R.; Lindstrom, W.; Sanner, M. F.; Belew, R. K.; Goodsell, D. S.; Olson, A. J. AutoDock4 and AutoDockTools4: Automated Docking with Selective Receptor Flexibility. J. Comput. Chem. 2009, 30, 27852791,  DOI: 10.1002/jcc.21256
    32. 32
      Doerr, S.; Harvey, M. J.; Noé, F.; De Fabritiis, G. HTMD: High-Throughput Molecular Dynamics for Molecular Discovery. J. Chem. Theory Comput. 2016, 12, 18451852,  DOI: 10.1021/acs.jctc.6b00049
    33. 33
      Lowe, D. G. Object Recognition from Local Scale-Invariant Features. Proceedings of the Computer Vision and Pattern Recognition Conference , 1999; pp 11501157.
    34. 34
      Schönberger, J. L.; Hardmeier, H.; Sattler, T.; Pollefeys, M. Comparative Evaluation of Hand-Crafted and Learned Local Features. Proceedings of the Computer Vision and Pattern Recognition Conference , 2017; pp 69596968.
    35. 35
      Balntas, V.; Lenc, K.; Vedaldi, A.; Mikolajczyk, K. HPatches: A Benchmark and Evaluation of Handcrafted and Learned Local Descriptors. Proceedings of the Computer Vision and Pattern Recognition Conference , 2017.
    36. 36
      Wang, J.; Song, Y.; Leung, T.; Rosenberg, C.; Wang, J.; Philbin, J.; Chen, B.; Wu, Y. Learning Fine-Grained Image Similarity with Deep Ranking. Proceedings of the Computer Vision and Pattern Recognition Conference , 2014; pp 13861393.
    37. 37
      Hoffer, E.; Ailon, N. Deep Metric Learning Using Triplet Network. International Workshop on Similarity-Based Pattern Recognition , 2015; pp 8492.
    38. 38
      Simo-Serra, E.; Trulls, E.; Ferraz, L.; Kokkinos, I.; Fua, P.; Moreno-Noguer, F. Discriminative Learning of Deep Convolutional Feature Point Descriptors. Proceedings of the Computer Vision and Pattern Recognition Conference , 2015; pp 118126.
    39. 39
      Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality Reduction by Learning an Invariant Mapping. Proceedings of the Computer Vision and Pattern Recognition Conference , 2006; pp 17351742.
    40. 40
      Kauderer-Abrams, E. Quantifying Translation-Invariance in Convolutional Neural Networks. 2017, arXiv preprint arXiv:1801.01450.
    41. 41
      Fawzi, A.; Frossard, P. M. Are Classifiers Really Invariant?. 2015, arXiv preprint arXiv:1507.06535.
    42. 42
      Azulay, A.; Weiss, Y. Why Do Deep Convolutional Networks Generalize So Poorly to Small Image Transformations?. 2018, arXiv preprint arXiv:1805.12177.
    43. 43
      Zheng, S.; Song, Y.; Leung, T.; Goodfellow, I. Improving the Robustness of Deep Neural Networks via Stability Training. Proceedings of the Computer Vision and Pattern Recognition Conference , 2016; pp 44804488.
    44. 44
      Goodfellow, I.; Bengio, Y.; Courville, A.; Bengio, Y. Deep Learning; MIT Press: Cambridge, 2016; Vol. 1.
    45. 45
      Weiler, M.; Geiger, M.; Welling, M.; Boomsma, W.; Cohen, T. 3D Steerable CNNs: Learning Rotationally Equivariant Features in Volumetric Data. 2018, arXiv preprint arXiv:1807.02547.
    46. 46
      Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. 2015, arXiv preprint arXiv:1502.03167.
    47. 47
      Kingma, D. P.; Ba, J. Adam: A Method for Stochastic Optimization. 2014, arXiv preprint arXiv:1412.6980.
    48. 48
      Liu, S.; Alnammi, M.; Ericksen, S. S.; Voter, A. F.; Ananiev, G. E.; Keck, J. L.; Hoffmann, F. M.; Wildman, S. A.; Gitter, A. Practical Model Selection for Prospective Virtual Screening. J. Chem. Inf. Model. 2018, 59, 282293,  DOI: 10.1021/acs.jcim.8b00363
    49. 49
      Dubitzky, W.; Granzow, M.; Berrar, D. P. Fundamentals of Data Mining in Genomics and Proteomics; Springer Science & Business Media, 2007.
    50. 50
      Gao, M.; Skolnick, J. APoc: Large-Scale Identification of Similar Protein Pockets. Bioinformatics 2013, 29, 597604,  DOI: 10.1093/bioinformatics/btt024
    51. 51
      Shulman-Peleg, A.; Nussinov, R.; Wolfson, H. J. SiteEngines: Recognition and Comparison of Binding Sites and Protein–Protein Interfaces. Nucleic Acids Res. 2005, 33, W337W341,  DOI: 10.1093/nar/gki482
    52. 52
      Zhang, Y.; Skolnick, J. TM-align: A Protein Structure Alignment Algorithm Based on the TM-score. Nucleic Acids Res. 2005, 33, 23022309,  DOI: 10.1093/nar/gki524
    53. 53
      Yeturu, K.; Chandra, N. PocketMatch: A New Algorithm to Compare Binding Sites in Protein Structures. BMC Bioinf. 2008, 9, 543,  DOI: 10.1186/1471-2105-9-543
    54. 54
      Haupt, V. J.; Daminelli, S.; Schroeder, M. Drug Promiscuity in PDB: Protein Binding Site Similarity Is Key. PLoS One 2013, 8, 115,  DOI: 10.1371/annotation/0852cc69-8cea-4966-bb8a-ae0b348d1bd9
    55. 55
      Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A Survey on Bias and Fairness in Machine Learning. 2019 arXiv preprint arXiv:1908.09635.
    56. 56
      Stockwell, G. R.; Thornton, J. M. Conformational Diversity of Ligands Bound to Proteins. J. Mol. Biol. 2006, 356, 928944,  DOI: 10.1016/j.jmb.2005.12.012
    57. 57
      Batista, J.; Hawkins, P. C.; Tolbert, R.; Geballe, M. T. SiteHopper—A Unique Tool for Binding Site Comparison. J. Cheminf. 2014, 6, P57,  DOI: 10.1186/1758-2946-6-s1-p57
    58. 58
      Maaten, L. v. d.; Hinton, G. Visualizing Data Using t-SNE. J. Mach. Learn. Res. 2008, 9, 25792605
  • Supporting Information

    Supporting Information

    ARTICLE SECTIONS
    Jump To

    The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.9b00554.

    • Box plots of TM-scores indicating the structural similarity of pockets in a test set to their nearest neighbor pockets in a training set for each training scenario, precision-recall plot with associated average precision values evaluating the performance of pocket matching algorithms on TOUGH-M1 testing folds and on the Vertex dataset, and average precision values for DeeplyTough on each of 10 ProSPECCTs datasets (PDF)


    Terms & Conditions

    Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system: http://pubs.acs.org/page/copyright/permissions.html.

Pair your accounts.

Export articles to Mendeley

Get article recommendations from ACS based on references in your Mendeley library.

Pair your accounts.

Export articles to Mendeley

Get article recommendations from ACS based on references in your Mendeley library.

You’ve supercharged your research process with ACS and Mendeley!

STEP 1:
Click to create an ACS ID

Please note: If you switch to a different device, you may be asked to login again with only your ACS ID.

Please note: If you switch to a different device, you may be asked to login again with only your ACS ID.

Please note: If you switch to a different device, you may be asked to login again with only your ACS ID.

MENDELEY PAIRING EXPIRED
Your Mendeley pairing has expired. Please reconnect

This website uses cookies to improve your user experience. By continuing to use the site, you are accepting our use of cookies. Read the ACS privacy policy.

CONTINUE