Optimization of Cavity-Based Negative Images to Boost Docking Enrichment in Virtual Screening

Molecular docking is a key in silico method used routinely in modern drug discovery projects. Although docking provides high-quality ligand binding predictions, it regularly fails to separate the active compounds from the inactive ones. In negative image-based rescoring (R-NiB), the shape/electrostatic potential (ESP) of docking poses is compared to the negative image of the protein’s ligand binding cavity. While R-NiB often improves the docking yield considerably, the cavity-based models do not reach their full potential without expert editing. Accordingly, a greedy search-driven methodology, brute force negative image-based optimization (BR-NiB), is presented for optimizing the models via iterative editing and benchmarking. Thorough and unbiased training, testing and stringent validation with a multitude of drug targets, and alternative docking software show that BR-NiB ensures excellent docking efficacy. BR-NiB can be considered as a new type of shape-focused pharmacophore modeling, where the optimized models contain only the most vital cavity information needed for effectively filtering docked actives from the inactive or decoy compounds. Finally, the BR-NiB code for performing the automated optimization is provided free-of-charge under MIT license via GitHub (https://github.com/jvlehtonen/brutenib) for boosting the success rates of docking-based virtual screening campaigns.


■ INTRODUCTION
Despite the pivotal role of molecular docking in protein structure-based drug discovery, 1−4 the docking-based screening often falls short of expectations. The problem is not necessarily the inadequacies of docking sampling; instead, the default scoring cannot rank the bioactive binding poses at the top and, thus, effectively identify the active ligands. Several rescoring, consensus scoring, or force field-based postprocessing methodologies aim to fix the problem, but, so far, their successes have been case specific or costly. 5−9 A potential solution to this persistent problem is to score accurately the shape complementarity between the docked ligand and its target protein's binding cavity. 10−12 Negative image-based rescoring (R-NiB; Figure 1) is a cavity-based docking rescoring methodology that takes on this challenge by focusing squarely on the shape/electrostatic potential (ESP) complementarity. 13 R-NiB has been shown to improve the yields with several docking algorithms (e.g., DOCK, 14 GLIDE, 15,16 or PLANTS 17 ) and multiple drug targets, such as neuraminidase (NEU) and retinoid X receptor alpha (RXRα; Figure 1A). 18 R-NiB can be performed using free/academic software from start to finish following a straightforward workflow. First, the binding poses of ligands are sampled flexibly against the target's cavity using a docking algorithm ( Figure 1C,D). Second, a negative image-based (NIB) model ( Figure 1B) is generated in a mirror image of the cavity using PANTHER 19 -cavity detection/filling software that was developed for cavitybased rigid docking or NIB screening. 20 Third, the shape/ESP of the NIB model, containing partially negative, positive, and neutral cavity atoms ( Figure 1D), is compared against the docking poses using the similarity comparison algorithm SHAEP 21 ( Figure 1E). R-NiB typically elicits a decent or excellent docking performance ( Figure 1F), 22 but the model fitness can be improved massively by adjusting the settings or via manual editing. 13,18,22 Usually, the PANTHER-generated NIB models contain at least a few "extra" cavity atoms that are not providing optimal shape/ESP for the filtering of active ligands from the "inactive" decoys and, thus, the models can be improved by removing some of them. Regardless, it can be perplexing why the removal of a few atoms makes or breaks the rescoring. The aim of this study was to make the model optimization both thorough and automatic following the initial  During the first generation (Gen #1) of the optimization, each cavity atom is removed from the initial NIB model (Gen #0) one at the time to generate eight new seven-atom variants. If one of the variants improves enrichment (node boxed green) more than the other variants or the input in docking rescoring, it is used as the input for another round of editing/ benchmarking. Here, the iteration goes through Gens #2−4 as the enrichment improves, and none of the last variants (Gen #5 = Gen #X) improves on the best model (Gen #4; node with a green background). (B) Atom composition evolution is shown for retinoid X receptor alpha for the input (Gen #0), mid-point (Gen #11), and fully optimized (Gen #23) NIB models. The rescoring with the final model (red line) is superior to docking (blue line), as seen in the ROC curves (x axis logarithmic). (C) Optimization protocol includes testing and potentially even validation prior to the virtual screening usage. docking and NIB model generation (Figure 2; Videos S1 and S2).
Machine learning techniques are emerging for both compound-centric and structure-based computer-aided drug discovery. 24−27 However, due to the ultra-fast speed of R-NiB (∼2−4 ms/cmpd.), 13 the optimization could be performed via a greedy search-driven method labeled here as brute force negative image-based optimization (BR-NiB; Figure 2). 28 In BR-NiB, the effect of each cavity atom on the NIB model ( Figure 1B) is evaluated systematically by iterative removal and rescoring ( Figure 2; Videos S1 and S2). In pruning, the atoms are removed one at a time and the fitness of each new (-1 atom) model variant is tested with SHAEP 21 using the docked training set. In branching, the model producing the best enrichment goes into the next round of iterative removal and rescoring ( Figure 2). Although not all permutations are tested in this heuristic scheme, the greedy search is repeated for as many generations as the yield improves ( Figure 2). BR-NiB ( Figure 2) was tested thoroughly with seven Database of Useful (Docking) Decoys-Enhanced (DUD-E) database 29 sets for finding the best optimization practices (Table S1 in the Supporting Information). Likewise, seven equivalent DUD 30 sets were tested. To increase the target diversity and vary database design, extra testing was done with 10 Maximum Unbiased Validation (MUV 31 ) sets, 5 more DUD-E sets, 5 DUDE-Z sets, and 5 Extrema sets. 32 In docking rescoring or testing, the optimized NIB models improved the overall enrichment consistently and substantially; importantly, the early enrichment was typically massively boosted. Although BR-NiB is scalable to supercomputers, in most cases, it can be performed with moderately sized training sets using a desktop computer. Moreover, the effectiveness of BR-NiB is not docking algorithm-specific, and the optimized models can even be cross-used effectively. Finally, the optimized models for two DUD-E targets were validated with newly crafted validation sets, in which minute amounts of active ligands at varying potency levels were mixed into an enormous drug-like smallmolecule library. The success in this validation step, having extremely low odds of finding hits by chance, demonstrates  Journal of Chemical Information and Modeling pubs.acs.org/jcim Article that BR-NiB has real potential in helping actual drug discovery projects.
In summary, if a reliable training set and a high-quality target protein 3D structure are available, the BR-NiB-optimized models present a tangible way for improving the effectiveness of docking-based virtual screening campaigns via ultra-fast rescoring.

■ RESULTS
Using Greedy Search to Boost Negative Image-Based Rescoring. Negative image-based rescoring ( Figure 1) produces excellent or at least moderate enrichment for docking (Gen #0 in Tables S2−S3). 13,18 Although the NIB models can be tested prior to their docking screening use, as it stands, the R-NiB protocol does not include optimization steps assuring consistently excellent yields. 18,22 The goal was to develop fast and easy-to-use method for optimizing the models by utilizing the categorical active/inactive ligand sets in a systematic and yet cost-effective manner. In BR-NiB ( Figure 2; Videos S1 and S2), the effect of each cavity atom is estimated by removing them systematically from the model and then benchmarking each new (-1 cavity atom) model variant. By automating the optimization, BR-NiB takes full advantage of the ultra-fast SHAEP rescoring and the existing compound activity data.
Selecting the Target Metric for the Optimization. BR-NiB promotes the removal of those cavity atoms that improve the selected target enrichment metric at any given iteration until the improvement halts (Gen #X in Figure 2; Video S1). Three target metrics, including area under the curve (AUC), early enrichment factor 1% (EFd 1%), and Boltzmannenhanced discrimination of the receiver operating characteristic (BEDROC) 33 with alpha value 20 (BR20) were tested using four test sets (Tables S1 and S4), including compounds that are in either active or "inactive" decoy category. Each metric worked better or worse on certain targets than with others. The EFd 1% is the most ill-suited metric for guiding BR-NiB because it is sensitive to changes at the very top of the ranking list and to the size of the training set. As target metrics, AUC or BR20 produce a better overall enrichment than EFd 1% because they allow the models to undergo larger transformations during the greedy search without getting hung up on losing a few actives from the top. Because BR20 produced the best overall results on early enrichment (Table  S4), it was chosen as the default target metric for the testing. When plotted against the BR-NiB generations, BR20 acquired smooth gradual improvement; meanwhile, the upward curves of the non-target metrics are more jagged (Figures S2−S3).
Model Optimization Ensures Consistent Docking Performance Improvement. The proof-of-concept testing of BR-NiB was performed using seven established DUD-E test sets including cyclooxygenase-2 (COX2), retinoid X receptor alpha (RXRα), mineralocorticoid receptor (MR), NEU, phosphodiesterase 5 (PDE5), estrogen receptor (ER), and peroxisome proliferator-activated receptor gamma (PPARγ) (Table S1). First, the input NIB models were trained using the complete compound sets using BR20 as the target metric. This naive approach (100:100 training/test set), lacking random training/test set division, served as a trial run for BR-NiB. Likewise, the naive testing was repeated for the equivalent but considerably smaller DUD database sets (Table S5). 30 Second, to avoid bias, the active and "inactive" decoy ligands were divided randomly into training and test sets (Table S1, Figure  2). Two ratios were applied: (1) the 70:30 ratio of training/ test sets represents a situation in which there exists a wealth of data for the model training and (2) the 10:90 ratio represents a situation in which the available set is considerably smaller (e.g., nine actives with MR).
BR-NiB treatment improved on docking or R-NiB rescoring yields consistently with both training (Table S6)  BR-NiB improved on the docking yield also for the 10:90 sets regarding both training (Table S6) and testing ( Table 1). The EFd 1% improvement varied between 1.4-and 16.5-fold and BR20 between 1.1-and 3.5-fold (Table 1; Figures S8−S9). In comparison to straight-up R-NiB (Gen #0 in Tables S2−  S3), BR-NiB provided 1.2-fold to 7.6-fold (10:90 MR and 100:100 NEU sets with the only shape, respectively) and 1.1to 2.0-fold (e.g., 70:30 RXRα and 100:100 MR sets with the 50/50 shape/ESP weight) improvement in EFd 1%. On . With PPARγ, the docking results were not improved using the 50/50 weight of shape/ESP scoring; however, the EFd 1% value was improved 1.5-fold with both 100:100 and 70:30 ratios using the only shape similarity scoring. In addition, with PDE5, the results for the 10:90 ratio was improved only, when excluding ESP similarity from the optimization. Interestingly, the docking did even worse with an alternative PPARγ structure with 70:30 sets, whereas the opposite was true for BR-NiB regardless of the applied shape/ ESP weight (Table 2). Limited compound sets can be a problem for validating the BR-NiB performance. For example, the yield improvements with the DUD sets were excellent but due to relatively low compound numbers training/test set divisions were omitted (Table S5). Moreover, when the training set is tiny, the optimized model is likely to become too specialized or even overfitted to match closely only those ligands that are present. Nevertheless, the BR-NiB execution is clearly worthwhile even with limited training sets. In fact, a large training set does not automatically guarantee the best results. As good or even better enrichment can sometimes be achieved with the smaller sets (70:100 and 10:100 in Table S8) in comparison to the complete training sets (100:100 in Table 1).
Merging Models for the Optimization. With targets such as PDE5 that have spacious cavities, it is difficult to generate in one go, a single NIB model that recognizes all active ligand sub-groups. If the PDE5 model is limited to sildenafil-(Model I in Figure 3) or tadalafil-bound (Model II in Figure 3) cavity volume, the docking performance was not typically improved (Table S3). The BR-NiB processing that relied solely on the shape similarity provided a limited boost to the EFd values on these individual PDE5 models. While the fusion of two models did not improve the R-NiB results, the optimization of this combined model worked on every level (Table 1 and Table S3; Figure 3). For example, with a 70:30 ratio, the EFd 1% improved 1.4-and 1.7-fold with the 50/50 weight of shape/ESP and shape only similarity score, respectively, over docking. Similarly, BR-NiB improved the AUC values of PDE5 docking from 0.77 ± 0.02 to 0.82 ± 0.02 (50/50 shape/ESP) and 0.87 ± 0.02 (only shape). In other words, docking performance can be improved even in difficult cases by fusing multiple models together for the optimization.
Changes to the Model Composition and Fitness during Optimization. BR-NiB improves the fitness gradually atom-by-atom until the best model is found (Figures S2 and  S3). With RXRα, the input model composed of 79 cavity atoms and the best enrichment was acquired at Gen #23 with a model composed of 45 atoms ( Figure 2). The model at the Gen #0 produced already substantial improvement over docking (e.g., AUC: 0.77 ± 0.02 vs 0.95 ± 0.01; Tables S2 vs 1); however, the optimization pushed the yield gains much higher (Table 1; Figure 2). Whereas R-NiB could improve the EFd 1% of docking from 11.5 to 56.5, the BR-NiB-optimized model generated EFd 1% value of 77.9. The impressive 6.7fold EFd 1% improvement by BR-NiB over docking is clearly visible, when comparing the ROC curve of docking to the Gen #23 of BR-NiB ( Figure 2). Notably, a close-to-optimal NIB model is generated already at the midpoint of the optimization (Gen #11 in Figure 2).
A closer look into the BR-NiB-optimized models, which perform far better in rescoring than the extracted co-crystal ligands ( Figure S11), shows that they contain informationrelated wide spectrum of molecules. The optimization lowers the nonpolar atom content 2−5%-points even when relying solely on the shape similarity, in other words, the mix of dissimilar atomic radii of polar (N/O) and nonpolar atoms (C) assist in depicting the volume optimally (Table S10). While there are target-specific differences, the polar atoms of the optimized models continue to overlap with those chemical groups of ligands that form key bonding interactions (Figure Journal of Chemical Information and Modeling pubs.acs.org/jcim Article S10B,C). 19 The surface atoms that do not overlap with the docked active ligands are typically removed first, thus, making the models slimmer (Gen #0 vs Gen #11 in Figure 2C). Likewise, atoms with minimal adverse effects are removed last, which decreases the perceived "ligand-likeness" of the optimized models (Gen #11 vs Gen #23 in Figure 2C; Table  S11). Notably, focusing the removals on those atoms that improve the model's average similarity match with only the actives does not generate higher yields than the BR20-guided BR-NiB (Table S12).
Effect of Docking Sampling on the Optimization. While there are software-and target-specific variations, BR-NiB generated invariably higher yields than the default docking scoring. BR-NiB was tested with three alternative software on four targets. Every enrichment metric was improved with training and test sets for COX2, RXRα, NEU, and MR using the docking software DOCK and GOLD 37 (Tables S13−S15). With GLIDE SP, 15,16 the AUC values of COX2 and RXRα could not be boosted. Notably, BR-NiB produced an even higher enrichment for NEU with GLIDE and for NEU and COX2 with GOLD in comparison to PLANTS. Overall, the ability of BR-NiB to work on an equal footing with multiple docking software was expected based on prior R-NiB results. 18 The BR-NiB-optimized models are not merely mirroring the sampling/scoring choices of the docking algorithms, but they focus on core recognition aspects for the active ligands. Different docking software samples and output different binding predictions and, moreover, they can skip compounds (e.g., GLIDE with MR; Table S13). This variation, however small it might be, affects the optimized model composition ( Figure 4). Therefore, the cross-use of NIB models, optimized using docking poses from different software, was probed. The limited testing indicates that the optimized models could indeed be cross-used (Table S15). When the PLANTS poses were rescored using the models optimized with the poses of GLIDE, DOCK, or GOLD, all enrichment metrics were improved. This universal applicability of the optimized models is remarkable, when taking a note that the AUC values are at a statistical tie despite the cross-use (Tables 1 vs S15).
Computing Demands of the Optimization. Individual R-NiB runs are ultra-fast (∼2 to 4 ms/cmpd.), 13 but the greedy search of BR-NiB can still take long time and plenty of parallel computing (Table S16). The exact demands depend on both the training set size and the input NIB model size. In theory, the optimization can be almost instantaneous, if enough parallel computing resources are committed; but in practice the number of CPUs (or cores) and RAM are the bottlenecks.
BR-NiB was tested using 15 CPUs, when the training sets were divided into 10,000 compound subsets for the parallel execution. In total, the duration of optimization ranged from 2 h, 4 h 43 min to 5 h 47 min for MR, NEU, and COX2, respectively (70:30 ratio in Table S14). The optimization becomes gradually faster when moving closer to the conclusion (e.g., Gen #0: 8 min vs Gen #X: 6 min with MR). PPARγ is a prime example of a demanding target regarding both the cavity size (N = 144; Figure S1) and training/test set size (N = 25,750; Table S1). The scalability of BR-NiB was probed by repeating the initial steps of PPARγ model optimization using a supercomputer: the doubling of the CPU count from 20 to 40 increased the pace from ∼1.5 to ∼1 h/gen.
If necessary, the BR-NiB iteration can be stopped at an earlier point because, typically, decent enrichment is acquired at the midpoint ( Figures S2−S3). Alternatively, the branching could be circumvented by pruning all those atoms whose individual deletion improves the enrichment at one go ( Figure  S10) and performing the normal BR-NiB after these removals. Although this cut-and-go approach is undeniably fast and it typically improves the enrichment over docking or standard R-NiB (Table S17), it is not on par with the full-scale BR-NiB ( Table 1). The cavity atoms and, for better or worse, their removals are interconnected, and hence it pays off to balance their effects and improve the model fitness via successive iterations.
When examining the Gen #1 model variants of RXRα (N = 57) in detail, nine cavity atoms, whose individual removals improved the BR20 value the most, were removed also from the fully optimized model ( Figure S10). In contrast, nine cavity atoms, whose removal decreased the BR20 value the most in  (Figure 2; Videos S1 and S2). The optimized models are alike; however, the composition differences due to alternative docking sampling affect the ROC curves.
Journal of Chemical Information and Modeling pubs.acs.org/jcim Article Gen #1, were similarly kept throughout the optimization. Therefore, in this case, the greedy search could have been sped up ∼30% by removing or locking in place these atoms before embarking on a full-scale optimization. This practice would require the careful use of balanced thresholding during the optimization, which is not considered further here.
Testing on Additional Targets and Different Databases Suggests Wider Applicability. Because the optimization with the random 70:30 training/test set division generated high enrichment (Table 1), five extra DUD-E sets were selected for performing BR-NiB with only this computationally light set up (Tables 2, S1 and S7). The targets, which were selected with additional diversity in mind, included serine/threonine-protein kinase AKT (AKT1), dopamine receptor D3 (DRD3), acetylcholinesterase (ACES), catecholo-methyltransferase (COMT), and focal adhesion kinase 1 (FAK1). The AUC values of docking were improved substantially for all targets except COMT in the testing ( Table 2). BR-NiB provided the biggest boost for ACES, as its AUC value improved from 0.39 to 0.69. Likewise, the AUC value of FAK1 improved from 0.80 to 0.95, indicating exceptional ability of the optimized model to increase the docking yield. Importantly, the BR-NiB was able to improve EFd 1% values of docking 3.6-, 36,7-, 2.0-, and 3.7-fold for AKT1, DRD3, COMT, and FAK1, respectively. The results were impressive also for ACES as its EFd 1% value jumped from 0 to 18 in the testing. To sum up, the DUD-E benchmarking (Tables 1 and 2) suggests that BR-NiB works exceptionally well with diverse drug targets.
Testing was also carried out with five DUDE-Z and Extrema benchmarking sets recently introduced by Stein et al. 32 The sets included NEU, adenosine A 2A receptor (A2AAR), ACES, heat shock protein 90 (HSP90), and androgen receptor (AR). The random 70:30 training/test set division results indicate that the BR-NiB method clearly provides an excellent boost for docking performance especially regarding early enrichment with both of the benchmarking databases ( Table 3). The success with the Extrema set, which is designed to avoid overoptimization of the electrostatics, is explained by the fact that both R-NiB 13 and BR-NiB approaches focus heavily on shape similarity and thus, the decoys with extreme charges should not challenge either of them. With DUDE-Z, the most difficult case was the HSP90 set, which has been described to be one of the most demanding targets for the structure-based drug discovery methods. 29 Here, the AUC was not significantly improved with the BR-NiB method over docking (from 0.52 to 0.56); however, the early enrichment was improved (e.g., EFd 1% from 0.0 to 6.5).
The BR-NiB performance was also tested with 10 sets from the MUV 31 database, which has been specifically designed to avoid analogue bias (Table S18). Like the DUD database, MUV also contains too few active ligands for building robust training/test set divisions. Both the default docking and BR-NiB generated low AUC values, which reflects the fact that the sets contain only a few active compounds that represent different chemotypes. As mainly intended for ligand-based approaches, the MUV sets are not optimal for docking benchmarking, that is, no reference protein structures are given, and the included compounds may have alternative binding sites. Regardless, BR-NiB was able to generate decent or even excellent early enrichment values, if compared to docking (Table S18). With Rho kinase 2, both the AUC and EFd 1% values were improved from 0.44 ± 0.05 and 3.3, respectively, to 0.71 ± 0.05 and 33.3. Likewise, with human immunodeficiency virus or HIV set, the corresponding values improved from 0.45 ± 0.05 and 3.3, respectively, to 0.69 ± 0.05 and 23.3.  34 was used for the AUC error estimation. The PDB codes for the target structures used are the following: 1B9V (NEU), 3EML (AA2AR), 6LTK (HSP90), 2AM9 (AR), and 1 × 10 66 (ACES). b The DUDE-Z is the optimized version of the DUD-E set. c The active compounds from the DUD-E set were used (training/test) in addition to the Extrema decoys.
Validation: Very Early Enrichment Improvement Using the Preoptimized Models. BR-NiB is not aiming to recognize correct binding poses but only to separate actives from inactives. 13,18 Irrespective, the limited root-mean-square deviation (RMSD) analysis with co-crystals (1−3 Å ranges in Table S18) indicates that neither BR-NiB nor R-NiB steer the selection toward "wrong" poses, if compared to the default docking scoring. A better measure of success is the hit-rate in demanding benchmarking tests involving multiple targets. Thus, as a further validation, the BR-NiB-optimized models were tested using new benchmarking sets for MR and NEU, which had accumulated enough additional data since the generation of their DUD-E sets. The aim was to emulate the extreme difficulty of virtual screening campaigning, in which only minute amounts of active ligands are buried in a vast and diverse library composed of drug-like compounds. The problems in building unbiased test sets are well-documented; 39−41 thus, this validation step is intended to act only as an extra check for the new method's effectiveness. The included decoy compounds are not property matched with the known actives nor experimentally verified (not typical with other benchmarking sets either), but the massive difference in the compound numbers should lean heavily in favor of finding decoys rather than active ligands by chance. Here, the rescoring results are directly comparable only against the original docking algorithm.
Because high potency levels are typically reached only via the optimization of low potency hits, three sets were generated for both targets with varying potency levels. The validation sets included 0.014% of "high potency" (IC 50 < 1 μM), "mid-to-low potency" (IC 50 = 1−50 μM), or "high-to-low potency" (IC 50 = < 50 μM) active ligands buried into a large library. Overall, the validation results (Table 4) were similar to the test set results ( Table 1), suggesting that the BR-NiB offers far better enrichment than the docking scoring. A direct comparison between the testing and validation results is skewed because the latter sets contain undoubtedly considerably more inactive compounds in the decoy set (MR: 1.8% vs 0.014%; NEU: 1.6% vs 0.014%). Regardless, the AUC values showed that the BR-NiB-optimized models worked at a comparable level with both sets. Even when focusing at the EFd 1% values, BR-NiB fared only slightly better in the testing than during the validation (Tables 1 vs 4).
Importantly, by testing just 0.1% (∼140 cmpd.) or 0.5% (∼700 cmpd.) of top-ranked compounds, the screening hitrates would have been at least 5−15% for MR and 10−55% for NEU ( Table 4). The enrichment improvement was highest for the "high potency set", but the upward trend was also clear for the less potent sets that exemplify better the non-optimized screening hits. When this is translated to hit numbers, the excellence of BR-NiB becomes apparent. With MR, BR-NiBguided screening would have found 1−3 hits, whereas the default docking scoring would have found hardly any. With NEU, the BR-NiB-guided selection would have performed even better, by recognizing 2−7 or 6−11 hits as opposed to 1− 3 found by the default scoring.

■ DISCUSSION
Target-tailored rescoring methods are frequently needed for improved molecular docking yields in virtual screening. 42 Herein, we report a BR-NiB (Figure 2; Videos S1 and S2) method that augments the composition of protein cavity or NIB models ( Figure S1; Table S10) for improved docking rescoring performance. The NIB models, which are used in a shape/ESP similarity comparison with the flexible docking poses, are subjected to iterative atom removals and benchmarking ( Figure 2). In the negative image-based rescoring (Figure 1), the BR-NiB-optimized models boost docking massively, that is, the active ligands are effectively separated from the decoys. The effectiveness of BR-NiB was verified by rigorous testing with multiple targets/sets (Tables 1−3, S5 and S18) and with demanding validation sets for two targets (MR and NEU in Table 4).
BR-NiB (Figure 2; Video S2) is a hybrid method that builds on the strengths of both ligand-and structure-based drug discovery methods. Therefore, the biggest hurdle of BR-NiB is the need for both the protein 3D structure and compound training set with validated activity information (Table S1). Regardless, the training set requirements of BR-NiB are moderate and, thus, the method should be well-suited for early-stage drug discovery projects, where there is a limited amount of compound data available. While the optimization can take a lot of time depending on a target, the process can easily be sped up by extra parallelization or, in theory, via the careful use of thresholding (Table S16). Finally, despite the automation, the input NIB models must be generated beforehand, and, likewise, the training set used in the  Journal of Chemical Information and Modeling pubs.acs.org/jcim Article optimization must be docked separately. However, because the atomic composition is optimized, the input model obviously does not have to be perfect to begin with and, moreover, the method is not docking software-specific (Tables S13−S15; Figure 4) nor big data-driven (Table S1). BR-NiB reminds regular pharmacophore (PHA) modeling, which can work excellently in virtual screening by focusing on a few key ligand descriptors with or without the added protein "exclusion zone" information. 43 Similar to BR-NiB, it is also possible to use docking poses as the pre-aligned conformation database to perform the PHA screen. Because both the active ligands and "inactive" decoys influence the final NIB model composition, BR-NiB could conceptually be perceived even as shape/ESP-based or shape-focused PHA modeling. Additionally, BR-NiB bears resemblance to field-based quantitative structure-activity relationship (F-QSAR) and even machine learning techniques, but there are major differences. In practice, the most predictive F-QSAR models require focusing on highly specific compound series, prefiltered before the model building, and, furthermore, the model's connection to the protein structure can be elusive. Machine learning techniques 24−27 can have limited usability due to the massive data requirements and, frankly, their "black box" logic can also be a cause of unease. A direct comparison to these established methods is moot because BR-NiB is exclusively a rescoring method and, thus, tied to the limitations of docking sampling. Regardless, in BR-NiB, both the cavity model and the underlying docking poses are always tightly connected to the protein (Figure 1), data requirements are moderate (Table  S1), and the optimization steps can be easily backtracked ( Figure 2).
There is plenty of room for improvements in the implementation of BR-NiB. Establishing the best practices for the optimization require yet more testing and experimental validationan iterative process that is outside the scope of this singular study. However, the fact that BR-NiB ignores the possible target protein flexibility or ligand binding induced-fit effects is likely to be a very persistent problem. This is seen for example with PPARγ, when the BR-NiB results improved noticeably, when using an alternative protein structure for the docking and input NIB model generation (Table 2). A potential workaround is to generate multiple NIB models for different protein structures or even combine and optimize multiple models simultaneously (e.g., PDE5; Figure 3; Tables  1 and S3); however, this sort of approach is computationally costly and requires extra effort from the user. To be fair, this protein conformation problem is inherent to using molecular docking as a high-throughput virtual screening method in the first place. 44,45 Because the training and testing are both showing similar enrichment boost to docking (Tables 1 vs S6 or Tables 2 vs  S7), there is no serious concern that the BR-NiB method is overfitting. Regardless, it is likely that the NIB models become more focused on certain active ligand chemotypes present in the training set than others during the optimization. While this specialization should at least in theory lower the diversity or perceived novelty of the eventual virtual screening hits, BR-NiB did slightly better than the default docking in the similarity analysis with the top-ranked validation set compounds (data not shown). One could argue that this phenomenon is commonplace irrespective of the applied method when working with limited training sets. However, even without considering the induced-fit effects, it is not possible that a singular NIB model could be a match for all active ligands present in a truly diverse training set. The optimization is a give-and-take process that makes greedy choices based on the target enrichment metric improvement ( Figure 2C). Paradoxically, a potential workaround could be to train multiple NIB models with purposefully different sets and apply the alternatively trained models to the virtual screening. Regardless, the BR-NiB testing should continue with alternative benchmarking sets and different databases in the future. Testing with the MUV database 31 (Table S18), DUDE-Z, Extrema (Table 3) and the newly made validation sets (Table 4) shows that BR-NiB works also with other benchmarking databases than DUD or DUD-E (Tables S5, 1 and 2). Ultimately, the method's practical usability must be confirmed in actual screening usage that includes in vitro validation.
Access to effective drugs and healthcare is a major obstacle for the equal prospects of wellbeing in the world, and the unsustainably high drug prices is a big contributor to this severe problem even in the developed Western countries. 46 Inarguably, it is a multilayered problem involving prohibitive costs related to regulation and safety, but the risk-aversion and profit-driven tactics are also to blame. 47 Computer-aided drug discovery, especially virtual screening methods, provide the means to lower the development costs and, importantly, open the door also for academic and open-access efforts by nonprofit institutions. 48 Thus, it is noteworthy that BR-NiB can be performed with free software, the BR-NiB code itself is freely downloadable under MIT license via GitHub (https://github. com/jvlehtonen/brutenib) and, in most cases, the optimization can be done using a regular desktop or even a laptop computer within a reasonable amount of time (Table S16).
In summary, a new rescoring methodology, BR-NiB, is presented for improving docking enrichment to a level that facilitates effective drug discovery in virtual screening campaigning.

■ EXPERIMENTAL SECTION
Protein and Ligand Structure Preparation for Benchmarking. The protein X-ray crystal structures were acquired from the Protein Data Bank (PDB; Table S1) 49 for the molecular docking and cavity detection or filling. The 3D structure editing and PDB-to-MOL2 conversion was done using BODIL. 50 The structures were protonated to match pH 7.4 using REDUCE3. 24. 51 No crystal waters were considered during the docking or cavity detection.
Five DUDE-Z and five Extrema benchmarking sets were also acquired for testing 32 (Table 3). DUDE-Z is described to be an optimized version of the DUD-E database. In the DUDE-Z design, special attention has been given to the charge matching between the active and decoy compounds. The Extrema design consist of decoys with extreme charges to prevent overoptimization based on the electrostatic interactions. With the Journal of Chemical Information and Modeling pubs.acs.org/jcim Article Extrema, the active compounds in the corresponding DUD-E sets were used as the active molecules. Altogether, 10 test sets were acquired from the MUV database 31 for the BR-NiB testing (Table S18). Because the database is designed mainly for ligand-based virtual screening, MUV aims to minimize the analogue bias by paying special attention to the spatial randomness, and it contains only 1.16 compounds per scaffold class. Each one of the 17 sets contain only 30 active ligands and 15,000 decoy molecules. If a suitable target protein 3D structure with a valid co-crystal ligand was available in the PDB, the MUV set in question was also tested. Although no inhibitor-bound X-ray crystal structure was available for the ephrin type-A receptor 4 set, the same binding cavity was used for docking as in a prior study by Gu et al. 52 The dopamine receptor D1 set was excluded because it included allosteric modulators that could bind to multiple different allosteric sites. 53 Prior to the flexible docking, the ligands were translated from the simplified molecular-input line-entry system (SMILES) strings to the 3D SYBYL MOL2 format, protonated at pH 7.4, alternative tautomeric states were generated and, finally, OPLS3 partial charges were incorporated using LIGPREP and MOL2CONVERT in MAESTRO 2017-1 (Schrodinger, LLC, New York, NY, USA, 2017). The active/ decoy compounds of the DUD-E sets were also divided randomly into training (70 or 10%) and test (30 or 90%) sets (Table S1) for unbiased testing (Figure 2). The random shuffling was done using standard C++ library Mersenne Twister 19937 pseudo-random number generator. 54 Flexible Molecular Docking. PLANTS and its default scoring function ChemPLP was used as the primary docking algorithm due to its proven effectiveness with R-NiB; 18 however, limited testing was additionally done using GLIDE 2018-1 15,16 (MAESTRO 2018-1, Schrodinger, LLC, New York, NY, USA, 2018) at standard precision (SP), DOCK 6.8, 14 and GOLD 5.6.3. 37 If available, the co-crystallized ligands ( Figure 1B; Table S1) were used as the docking centroids with a search radius of 10 Å. Particularly, some MUV targets lacked the co-crystallized ligand data, so the centroid was set inside the putative ligand binding-cavity. The relevant docking settings are detailed in prior studies. 13,18 Negative Image-Based Model Generation. The cavitybased NIB models ( Figure S1) were generated using default PANTHER 0.18.15, 19 but parameters such as the packing method, ligand distance limit, and/or box radius were adjusted, if necessary. If possible, the cavity centroids were taken from the co-crystallized ligands (Table S1; Figure S1). With PDE5, the NIB models could also be combined to generate a hybrid model (Figure 3). It was verified visually that the NIB models roughly covered the cavity volume involved in the known ligand binding. The input NIB models are given in the Supporting Information, and the PANTHER settings for DUD and DUD-E are explained in prior studies. 13,18,22 Negative Image-Based Rescoring. In negative imagebased rescoring (R-NiB; Figure 1), 13 the shape/electrostatics of flexible docking poses is compared against the cavity-based NIB model using the similarity comparison algorithm SHAEP. 21 The implementation of the --noOptimization option ensures that the poses are not realigned during the rescoring. The similarity comparison was performed with either the default 50/50 ratio of the shape/ESP or only the shape score. The performance was not tested with only the ESP score because typically the shape is in a dominant role in the SHAEP scoring. The R-NiB protocol is explained thoroughly at the practical level in a prior study. 22 Likewise, ligand-based rescoring was performed by comparing the co-crystallized ligands ( Figure S1; Table S1) against the docking poses without realignment.
Brute Force Negative Image-Based Optimization. In the BR-NiB (Figure 2; Videos S1 and S2), the effect of each cavity atom in the NIB model is evaluated iteratively and automatically. Although a true exhaustive search would be guaranteed to generate a global optimum model, whereas a greedy search is more likely to stop at a local optimum, the calculations would be too time-consuming with a brute force search. If the input model contains 50 cavity atoms (a n ) and its optimal size would be 40 atoms (a 1 ), a true exhaustive search would run all possible 1.3e10 test cases before concluding (eq 1), while the greedy search would need only 495 (eq 2).
BR-NiB is performed using a template of the NIB model from which cavity atoms are removed one at a time before the individual R-NiB runs. The iterative editing, rescoring, and enrichment evaluation processes are performed using a specific target enrichment metric such as the EFd, AUC, and BEDROC 33 with alpha value 20 (BR20) using ROCKER 0.1.4. 55 Note that the optimal BR value is dependent on the size of the training set; but it was not altered here to keep the results comparable. The model producing the best performance for each generation is put forth into the next cycle of editing and rescoring. The greedy search is deterministic and, thus, all things being equal, it should always lead to the same model composition. For this same reason, it is also unlikely to produce global optimum similarly as a true exhaustive search would; however, BR-NiB boosts docking with a sensible amount of computing. The iteration ends when the atom removals stop improving the target metric ( Figure 2; Video S1). The final BR-NiB-optimized NIB models are given in the Supporting Information. The BR-NiB code, needed for performing the optimization automatically, is provided online for free academic or commercial use under MIT license via GitHub (https://github.com/jvlehtonen/brutenib).
Validation: Benchmarking against a Massive Druglike Compound Library. New validation sets were generated for MR and NEU with varying activity ranges. In the "high potency set" were included active ligands with an IC 50 value of <1 μM, whereas in the "mid-to-low potency set" or "high-tolow potency set" were included ligands with the IC 50 range of 1−50 μM or <50 μM, respectively. The active ligands were taken randomly from the ChEMBL database. Duplicates with the original DUD-E sets were avoided and, furthermore, it was checked that the active ligands could fit into the binding pockets. Molecules with a MW of over 550 g/mol were excluded from the data sets to ensure that the molecules were not too complicated for the docking software. In practice, 20 randomly selected active ligands of each potency category (<1 μM, < 50 μM and 1−50 μM) were mixed with the drug-like small-molecule compounds of the SPECS database (acquired 08/2020; N = 140,626; MW of over 550 g/mol and rotatable Journal of Chemical Information and Modeling pubs.acs.org/jcim Article bond number of >8 excluded to include only the most ligandlike molecules), making the verified active ligand concentration 0.014% for each validation set (at least 1.5% for the DUD-E sets; Table S1). Finally, all compounds were prepared with LIGPREP, docked with PLANTS, and rescored with SHAEP using the BR-NiB-optimized NIB models similarly as described for the DUD-E sets. Data Analysis and Figure Preparation. Figures were prepared using BODIL 50 and VMD 1.9.2. 56 ROCKER 55 was used to plot the ROC curves. The AUC, BR20, and EFd values were calculated using ROCKER. The Wilcoxon statistic 34 estimates the standard deviation in the AUC calculations. The EFd 0.1, 0.5, 1, and 5% values correspond to the percentage of true positive ligands found when 0.1, 0.5, 1, or 5% of "inactive" decoy compounds have been discovered. The compounds that were skipped during docking were added to the bottom of the results in the order that corresponds random picking to make the early enrichment results comparable. The RMSD values were calculated for those DUD-E compounds with both docking poses and co-crystallized ligand conformers using rmsd.py (MAESTRO 2018-1). The conformers were aligned in the 3D space using the protein backbone C α atoms using VERTAA in BODIL. 50 The co-crystallized ligands (Table S1), which were prepared similarly as the DUD-E compounds but without allowing a heavy atom realignment, were taken from the PDB entries with the highest available resolution.
Target structures and benchmarking sets, Gen #0 training set results, PDE5 model results, effect of target metric on optimization, BR-NiB training set enrichments, BR-NiB training/test set 70:100 and 10:100 division enrichment results, ligand-based rescoring results with the X-ray co-crystals, NIB model compositions, shape fitting results, BR-NiB results with alternative docking software, docking cross-use results with the BR-NiB-optimized NIB models, computing demands of the optimization, single-generation optimization results, enrichment metrics of the MUV set, RMSD results of BR-NiB-selected poses, BR-NiB enrichment evolution over optimization generations, ROC curves, effect of cavity atom removals, and references (PDF) Initial NIB models, BR-NiB-optimized models, compositions of the benchmarking, and validation sets (ZIP) BR-NiB optimization (MOV) BR-NiB generation #0 vs generation #49 (MOV) ■ ACKNOWLEDGMENTS