ACS Publications. Most Trusted. Most Cited. Most Read
My Activity
CONTENT TYPES
RETURN TO ISSUEPREVResearch ArticleNEXT

Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks

View Author Information
Institute of Organic Chemistry & Center for Multiscale Theory and Computation, Westfälische Wilhelms-Universität Münster, 48149 Münster, Germany
Hit Discovery, Discovery Sciences, AstraZeneca R&D, Gothenburg, Sweden
§ Department of Medicinal Chemistry, IMED RIA, AstraZeneca R&D, Gothenburg, Sweden
Department of Physics & International Centre for Quantum and Molecular Structures, Shanghai University, Shanghai, China
Cite this: ACS Cent. Sci. 2018, 4, 1, 120–131
Publication Date (Web):December 28, 2017
https://doi.org/10.1021/acscentsci.7b00512

Copyright © 2017 American Chemical Society. This publication is licensed under these Terms of Use.

  • Open Access

Article Views

42088

Altmetric

-

Citations

LEARN ABOUT THESE METRICS
PDF (2 MB)
Supporting Info (2)»

Abstract

In de novo drug design, computational strategies are used to generate novel molecules with good affinity to the desired biological target. In this work, we show that recurrent neural networks can be trained as generative models for molecular structures, similar to statistical language models in natural language processing. We demonstrate that the properties of the generated molecules correlate very well with the properties of the molecules used to train the model. In order to enrich libraries with molecules active toward a given biological target, we propose to fine-tune the model with small sets of molecules, which are known to be active against that target. Against Staphylococcus aureus, the model reproduced 14% of 6051 hold-out test molecules that medicinal chemists designed, whereas against Plasmodium falciparum (Malaria), it reproduced 28% of 1240 test molecules. When coupled with a scoring function, our model can perform the complete de novo drug design cycle to generate large sets of novel molecules for drug discovery.

Synopsis

Using artificial neural networks, computers can learn to generate molecules with desired target properties. This can aid in the creative process of drug design.

Introduction

ARTICLE SECTIONS
Jump To

Chemistry is the language of nature. Chemists speak it fluently and have made their discipline one of the true contributors to human well-being, which has “change[d] the way you live and die”. (1) This is particularly true for medicinal chemistry. However, creating novel drugs is an extraordinarily hard and complex problem. (2) One of the many challenges in drug design is the sheer size of the search space for novel molecules. It has been estimated that 1060 drug-like molecules could possibly be synthetically accessible. (3) Chemists have to select and examine molecules from this large space to find molecules that are active toward a biological target. Active means for example that a molecule binds to a biomolecule, which causes an effect in the living organism, or inhibits replication of bacteria. Modern high-throughput screening techniques allow testing of molecules on the order of 106 in the lab. (4) However, larger experiments will get prohibitively expensive. Given this practical limitation of in vitro experiments, it is desirable to have computational tools to narrow down the enormous search space. Virtual screening is a commonly used strategy to search for promising molecules among millions of existing or billions of virtual molecules. (5) Searching can be carried out using similarity-based metrics, which provides a quantifiable numerical indicator of closeness between molecules. In contrast, in de novo drug design, one aims to directly create novel molecules that are active toward the desired biological target. (6, 7) Here, like in any molecular design task, the computer has to
(i)

create molecules,

(ii)

score and filter them, and

(iii)

search for better molecules, building on the knowledge gained in the previous steps.

Task i, the generation of novel molecules, is usually solved with one of two different protocols. (7) One strategy is to build molecules from predefined groups of atoms or fragments. Unfortunately, these approaches often lead to molecules that are very hard to synthesize. (8) Therefore, another established approach is to conduct virtual chemical reactions based on expert coded rules, with the hope that these reactions could then also be applied in practice to make the molecules in the laboratory. (9) These systems give reasonable drug-like molecules and are considered as “the solution” to the structure generation problem. (2) We generally share this view. However, we have recently shown that the predicted reactions from these rule-based expert systems can sometimes fail. (10-12) Also, focusing on a small set of robust reactions can unnecessarily restrict the possibly accessible chemical space.
Task ii, scoring molecules and filtering out undesired structures, can be solved with substructure filters for undesirable reactive groups in conjunction with established approaches such as docking (13) or machine learning (ML) approaches. (7, 14, 15) The ML approaches are split into two branches: Target prediction classifies molecules into active and inactive, and quantitative structure–activity relationships (QSAR) seek to quantitatively predict a real-valued measure for the effectiveness of a substance (as a regression problem). As molecular descriptors, signature fingerprints, extended-connectivity (ECFP), and atom pair (APFP) fingerprints and their fuzzy variants are the de facto standard today. (16-18) Convolutional networks on graphs are a more recent addition to the field of molecular descriptors. (19-22) Jastrzebski et al. proposed to use convolutional neural networks to learn descriptors directly from SMILES strings. (23) Random forests, support vector machines, and neural networks are currently the most widely used machine learning models for target prediction. (24-35)
This leads to task iii, the search for molecules with the right binding affinity combined with optimal molecular properties. In earlier work, this was performed (among others) with classical global optimization techniques, for example genetic algorithms or ant-colony optimization. (7, 36) Furthermore, de novo design is related to inverse QSAR. (37-40) While in de novo design a regular QSAR mapping Xy from molecular descriptor space X to properties y is used as the scoring function for the global optimizer, in inverse QSAR one aims to find an explicit inverse mapping yX, and then maps back from optimal points in descriptor space X to valid molecules. However, this is not well-defined, because molecules are inherently discrete (the space is not continuously populated), and the mapping from a target property value y to possible structures X is one-to-many, as usually several different structures with very similar properties can be found. Several protocols have been developed to address this, for example enumerating all structures within the constraints of hyper-rectangles in the descriptor space. (37-42) Gómez-Bombarelli et al. proposed to learn continuous representations of molecules with variational autoencoders, based on the model by Bowman et al., (43) and to perform Bayesian optimization in this vector space to optimize molecular properties. (44) While promising, this approach was not applied to create active drug molecules and often produced syntactically invalid molecules and highly strained or reactive structures, for example cyclobutadienes. (44)
In this work, we suggest a complementary, completely data-driven de novo drug design approach. It relies only on a generative model for molecular structures, based on a recurrent neural network, that is trained on large sets of molecules. Generative models learn a probability distribution over the training examples; sampling from this distribution generates new examples similar to the training data. Intuitively, a generative model for molecules trained on drug molecules would “know” how valid and reasonable drug-like molecules look and could be used to generate more drug-like molecules. However, for molecules, these models have been studied rarely, and rigorously only with traditional models such as Gaussian mixture models (GMM). (41, 45, 46) Recently, recurrent neural networks (RNNs) have emerged as powerful generative models in very different domains, such as natural language processing, (47) speech, (48) images, (49) video, (50) formal languages, (51) computer code generation, (52) and music scores. (53) In this work, we highlight the analogy of language and chemistry, and show that RNNs can also generate reasonable molecules. Furthermore, we demonstrate that RNNs can also transfer their learned knowledge from large molecule sets to directly produce novel molecules that are biologically active by retraining the models on small sets of already known actives. We test our models by reproducing hold-out test sets of known biologically active molecules.

Methods

ARTICLE SECTIONS
Jump To

Representing Molecules

To connect chemistry with language, it is important to understand how molecules are represented. Usually, they are modeled by molecular graphs, also called Lewis structures in chemistry. In molecular graphs, atoms are labeled nodes. The edges are the bonds between atoms, which are labeled with the bond order (e.g., single, double, or triple). One could therefore envision having a model that reads and outputs graphs. Several common chemistry formats store molecules in such a manner. However, in models for natural language processing, the input and output of the model are usually sequences of single letters, strings or words. We therefore employ the SMILES format, which encodes molecular graphs compactly as human-readable strings. SMILES is a formal grammar which describes molecules with an alphabet of characters, for example c and C for aromatic and aliphatic carbon atoms, O for oxygen, and −, =, and # for single, double, and triple bonds (see Figure 1). (54) To indicate rings, a number is introduced at the two atoms where the ring is closed. For example, benzene in aromatic SMILES notation would be c1ccccc1. Side chains are denoted by round brackets. To generate valid SMILES, the generative model would have to learn the SMILES grammar, which includes keeping track of rings and brackets to eventually close them. In morphine, a complex natural product, the number of steps between the first 1 and the second 1, indicating a ring, is 32. Having established a link between molecules and (formal) language, we can now discuss language models.

Figure 1

Figure 1. Examples of molecules and their SMILES representation. To correctly create smiles, the model has to learn long-term dependencies, for example, to close rings (indicated by numbers) and brackets.

Language Models and Recurrent Neural Networks

Given a sequence of words (w1, ..., wi), language models predict the distribution of the (i+1)th word wi+1. (55) For example, if a language model received the sequence “Chemistry is”, it would assign different probabilities to possible next words: “fascinating”, “important”, or “challenging” would receive high probabilities, while “runs” or “potato” would receive very low probabilities. Language models can both capture the grammatical correctness (“runs” in this sentence is wrong) and the meaning (“potato” does not make sense). Language models are implemented, for example, in message autocorrection in many modern smartphones. Interestingly, language models do not have to use words. They can also be based on characters or letters. (55) In that case, when receiving the sequence of characters chemistr, it would assign a high probability to y, but a low probability to q. To model molecules instead of language, we simply swap words or letters with atoms, or, more practically, characters in the SMILES alphabet, which form a (formal) language. For example, if the model receives the sequence c1ccccc, there is a high probability that the next symbol would be a “1”, which closes the ring, and yields benzene.
More formally, to a sequence S of symbols si at steps tiT, the language model assigns a probability of(1)where the parameters θ are learned from the training set. (55) In this work, we use a recurrent neural network (RNN) to estimate the probabilities of eq 1. In contrast to regular feedforward neural networks, RNNs maintain state, which is needed to keep track of the symbols seen earlier in the sequence. In abstract terms, an RNN takes a sequence of input vectors x1:n = (x1, ..., xn) and an initial state vector h0, and returns a sequence of state vectors h1:n = (h1, ..., hn) and a sequence of output vectors y1:n = (y1, ..., yn). The RNN consists of a recursively defined function R, which takes a state vector hi and input vector xi+1 and returns a new state vector hi+1. Another function O maps a state vector hi to an output vector yi. (55)(2)(3)(4)
The state vector hi stores a representation of the information about all symbols seen in the sequence so far. As an alternative to the recursive definition, the recurrent network can also be unrolled for finite sequences (see Figure 2). An unrolled RNN can be seen as a very deep neural network, in which the parameters θ are shared among the layers, and the hidden state ht is passed as an additional input to the next layer. Training the unrolled RNN to fit the parameters θ can then simply be done by using backpropagation to compute the gradients with respect to the loss function, which is categorical cross-entropy in this work. (55)

Figure 2

Figure 2. (a) Recursively defined RNN. (b) The same RNN, unrolled. The parameters θ (the weight matrices of the neural network) are shared over all time steps.

As the specific RNN function, in this work, we use the long short-term memory (LSTM), which was introduced by Hochreiter and Schmidhuber. (56) It has been used successfully in many natural language processing tasks, (47) for example in Google’s neural machine translation system. (57) For excellent in-depth discussions of the LSTM, we refer to the articles by Goldberg, (55) Graves, (58) Olah, (59) and Greff et al. (60)
To encode the SMILES symbols as input vectors xt, we employ the “one-hot” representation. (58) This means that if there are K symbols, and k is the symbol to be input at step t, then we can construct an input vector xt with length K, whose entries are all zero except the kth entry, which is one. If we assume a very restricted set of symbols {c, 1, \n}, input c would correspond to xt = (1, 0, 0), 1 to xt = (0, 1, 0), and \n to xt = (0, 0, 1).
The probability distribution Pθ(st+1|st, ..., s1) of the next symbol given the already seen sequence is thus a multinomial distribution, which is estimated using the output vector yt of the recurrent neural network at time step t by(5)where ytk corresponds to the kth element of vector yt. (58) Sampling from this distribution would then allow generating novel molecules: After sampling a SMILES symbol st+1 for the next time step t + 1, we can construct a new input vector xt+1, which is fed into the model, and via yt+1 and eq 5 yields Pθ(st+2|st+1, ..., s1). Sampling from the latter generates st+2, which serves again also as the model’s input for the next step (see Figure 3). This symbol-by-symbol sampling procedure is repeated until the desired number of characters have been generated. (58)

Figure 3

Figure 3. Symbol generation and sampling process. We start with a random seed symbol s1, here c, which gets converted into a one-hot vector x1 and input into the model. The model then updates its internal state h0 to h1 and outputs y1, which is the probability distribution over the next symbols. Here, sampling yields s2 = 1. Converting s2 to x2 and feeding it to the model leads to updated hidden state h2 and output y2, from which we can sample again. This iterative symbol-by-symbol procedure can be continued as long as desired. In this example, we stop it after observing an EOL (\n) symbol, and obtain the SMILES for benzene. The hidden state hi allows the model to keep track of opened brackets and rings, to ensure that they will be closed again later.

To indicate that a molecule is “completed”, each molecule in our training data finishes with an “end of line” (EOL) symbol, in our case the single character \n (which means that the training data is just a simple SMILES file). Thus, when the system outputs an EOL, a generated molecule is finished. However, we simply continue sampling, thus generating a regular SMILES file that contains one molecule per line.
In this work, we used a network with three stacked LSTM layers, using the Keras library. (61) The model was trained with back-propagation through time, (58) using the ADAM optimizer at standard settings. (62) To mitigate the problem of exploding gradients during training, a gradient norm clipping of 5 is applied. (58)

Transfer Learning

For many machine learning tasks, only small data sets are available, which might lead to overfitting with powerful models such as neural networks. In this situation, transfer learning can help. (63) Here, a model is first trained on a large data set for a different task. Then, the model is retrained on the smaller data set, which is also called fine-tuning. The aim of transfer learning is to learn general features on the bigger data set, which also might be useful for the second task in the smaller data regime. To generate focused molecule libraries, we first train on a large, general set of molecules, then perform fine-tuning on a smaller set of specific molecules, and after that start the sampling procedure.

Target Prediction

To verify whether the generated molecules are active on the desired targets, standard target prediction was employed. Machine learning based target prediction aims to learn a classifier c: M → {1, 0} to decide whether a molecule m ∈ molecular descriptor space M is active or not against a target. (14, 15) The molecules are split into actives and inactives using a threshold on a measure for the substance effectiveness. pIC50 = −log10(IC50) is one of the most widely used metrics for this purpose. IC50 is the half maximal inhibitory concentration, that is, the concentration of drug that is required to inhibit 50% of a biological target’s function in vitro.
To predict whether the generated molecules are active toward the biological target of interest, target prediction models (TPMs) were trained for all the tested targets (5-HT2A, Plasmodium falciparum and Staphylococcus aureus). We evaluated random forest, logistic regression, (deep) neural networks, and gradient boosting trees (GBT) as models with ECFP4 (extended connectivity fingerprint with a diameter of 4) as the molecular descriptor. (16, 17) We found that GBTs slightly outperformed all other models and used these as our virtual assay in all studies (AUC[5-HT2A] = 0.877, AUC[Staph. aur.] = 0.916). ECFP4 fingerprints were generated with CDK version 1.5.13. (64, 65) scikit-learn, (66) XGBoost, (67) and Keras (61) were used as the machine learning libraries. For 5-HT2A and Plasmodium, molecules are considered as active for the TPM if their IC50 reported in ChEMBL is <100 nM, which translates to a pIC50 > 7, whereas for Staphylococcus, we used pMIC > 3.

Data

The chemical language model was trained on a SMILES file containing 1.4 million molecules from the ChEMBL database, which contains molecules and measured biological activity data. The SMILES strings of the molecules were canonicalized (which means finding a unique representation that is the same for isomorphic molecular graphs) (68, 69) before training with the CDK chemoinformatics library, yielding a SMILES file that contained one molecule per line. (64, 65) It has to be noted that ChEMBL contains many peptides, natural products with complex scaffolds, Michael acceptors, benzoquinones, hydroxylamines, hydrazines, etc., which is reflected in the generated structures (see below). This corresponds to 72 million individual characters, with a vocabulary size of 51 unique characters. 51 characters is only a subset of all SMILES symbols, since the molecules in ChEMBL do not contain many of the heavy elements. As we have to set the number of symbols as a hyperparameter during model construction, and the model can only learn the distribution over the symbols present in the training data, this implies that only molecules with these 51 SMILES symbols seen during training can be generated during sampling.
The 5-HT2A, the Plasmodium falciparum, and the Staphylococcus aureus data sets were also obtained from ChEMBL. As these molecules were intended to be used in the rediscovery studies, they were removed from the training data before fitting the chemical language model.

Model Evaluation

To evaluate the models for a test set T, and a set of molecules GN generated from the model by sampling, we report the ratio of reproduced molecules , and enrichment over random (EOR), which is defined as(6)where n = |GNT| is the number of reproduced molecules from T by sampling a set GN of |GN| = N molecules from the fine-tuned generative model, and m = |RMT| is the number of reproduced molecules from T by sampling a set RM of |RM| = M molecules from the generic, unbiased generative model trained only on the large data set. Intuitively, EOR indicates how much better the fine-tuned models work when compared to the general model.

Results and Discussion

ARTICLE SECTIONS
Jump To

In this work, we address two points: First, we want to generate large sets of diverse molecules for virtual screening campaigns. Second, we want to generate smaller, focused libraries enriched with possibly active molecules for a specific target. For the first task, we can train a model on a large, general set of molecules to learn the SMILES grammar. Sampling from this model would generate sets of diverse, but unfocused molecules. To address the second task, and to obtain novel active drug molecules for a target of interest, we perform transfer learning: We select a small set of known actives for that target and we refit our pretrained chemical language model with this small data set. After each epoch, we sample from the model to generate novel actives. Furthermore, we investigate if the model actually benefits from transfer learning, by comparing it to a model trained from scratch on the small sets without pretraining.

Training the Recurrent Network

We employed a recurrent neural network with three stacked LSTM layers, each with 1024 dimensions, and each one followed by a dropout (70) layer, with a dropout ratio of 0.2, to regularize the neural network. The model was trained until convergence, using a batch size of 128. The RNN was unrolled for 64 steps. It had 21.3 × 106 parameters.
During training, we sampled a few molecules from the model every 1000 minibatches to inspect progress. Within a few 1000 steps, the model starts to output valid molecules (see Table 1).
Table 1. Molecules Sampled during Training

Generating Novel Molecules

To generate novel molecules, 50,000,000 SMILES symbols were sampled from the model symbol-by-symbol. This corresponded to 976,327 lines, from which 97.7% were valid molecules after parsing with the CDK toolkit. Removing all molecules already seen during training yielded 864,880 structures. After filtering out duplicates, we obtained 847,955 novel molecules. A few generated molecules were randomly selected and depicted in Figure 4. The Supporting Information contains more structures. The created structures are not just formally valid but also mostly chemically reasonable.

Figure 4

Figure 4. A few randomly selected, generated molecules. Ad = Adamantyl.

In order to check if the de novo compounds could be considered as valid starting points for a drug discovery program, we applied the internal AstraZeneca filters. (71) At AstraZeneca, this flagging system is used to determine if a compound is suitable to be part of the high-throughput screening collection (if flagged as “core” or “backup”) or should be restricted for particular use (flagged as “undesirable” since it contains one or several unwanted substructures, e.g., undesired reactive functional groups). The filters were applied to the generated set of 848 k molecules, and they flagged most of them, 640 k (75%), as either core or backup. Since the same ratio (75%) of core and backup compounds has been observed for the ChEMBL collection, we therefore conclude that the algorithm generates preponderantly valid screening molecules and faithfully reproduces the distribution of the training data.
To determine whether the properties of the generated molecules match the properties of the training data from ChEMBL, we followed the procedure of Kolb: (72) We computed several molecular properties, namely, molecular weight, BertzCT, the number of H-donors, H-acceptors, and rotatable bonds, logP, and total polar surface area for randomly selected subsets from both sets with the RDKit (73) library version 2016.03.1. Then, we performed dimensionality reduction to 2D with t-SNE (t-distributed stochastic neighbor embedding, a technique analogous to PCA), which is shown in Figure 5. (74) Both sets overlap almost completely, which indicates that the generated molecules very well recreate the properties of the training molecules.

Figure 5

Figure 5. t-SNE projection of 7 physicochemical descriptors of random molecules from ChEMBL (blue) and molecules generated with the neural network trained on ChEMBL (green), to two unitless dimensions. The distributions of both sets overlap significantly.

Furthermore, we analyzed the Bemis–Murcko scaffolds of the training molecules and the sampled molecules. (75) Bemis–Murcko scaffolds contain the ring systems of a molecule and the moieties that link these ring systems, while removing any side chains. They represent the scaffold, or “core” of a molecule, which series of drug molecules often have in common. The number of common scaffolds in both sets divided by the union of all scaffolds in both sets (Jaccard index) is 0.12, which indicates that the language model does not just modify side chain substituents but also introduces modifications at the molecular core.

Generating Active Drug Molecules and Focused Libraries

Targeting the 5-HT2A Receptor

To generate novel ligands for the 5-HT2A receptor, we first selected all molecules with pIC50 > 7 which were tested on 5-HT2A from ChEMBL (732 molecules), and then fine-tuned our pretrained chemical language model on this set. After each epoch, we sampled 100,000 chars, canonicalized the molecules, and removed any sampled molecules that were already contained in the training set. Following this, we evaluated the generated molecules of each round of retraining with our 5-HT2A target prediction model (TPM). In Figure 6, the ratio of molecules predicted to be active by the TPM after each round of fine-tuning is shown. Before fine-tuning (corresponding to epoch 0), the model generates almost exclusively inactive molecules. Already after 4 epochs of fine-tuning the model produced a set in which 50% of the molecules are predicted to be active.

Figure 6

Figure 6. Epochs of fine-tuning vs ratio of actives.

Diversity Analysis

In order to assess the novelty of the de novo molecules generated with the fine-tuned model, a nearest neighbor similarity/diversity analysis has been conducted using a commonly used 2D fingerprint (ECFP4) based similarity method (Tanimoto index). (72)Figure 7 shows the distribution of the nearest neighbor Tanimoto index generated by comparing all the novel molecules and the training molecules before and after n epochs of fine-tuning. For each bin, the white bars indicate the molecules generated from the unbiased, general model, while the darker bars indicate the molecules after several epochs of fine-tuning. Within the bins corresponding to lower similarity, the number of molecules decreases, while the bins of higher similarity get populated with increasing numbers of molecules. The plot thus shows that the model starts to output more and more similar molecules to the target-specific training set. Notably, after a few rounds of training not only are highly similar molecules produced but also molecules covering the whole range of similarity, indicating that our method could deliver not only close analogues but also new chemotypes or scaffold ideas to a drug discovery project. (5) To have the best of both worlds, that is, diverse and focused molecules, we therefore suggest to sample after each epoch of retraining and not just after the final epoch.

Figure 7

Figure 7. Nearest-neighbor Tanimoto similarity distribution of the generated molecules for 5-HT2A after n epochs of fine-tuning against the known actives. The generated molecules are distributed over the whole similarity range. Generated molecules with a medium similarity can be interesting for scaffold-hopping. (5)

Targeting Plasmodium falciparum (Malaria)

Plasmodium falciparum is a parasite that causes the most dangerous form of malaria. (76) To probe our model on this important target, we used a more challenging validation strategy. We wanted to investigate whether the model could also propose the same molecules that medicinal chemists chose to evaluate in published studies. To test this, first, the known actives against Plasmodium falciparum with a pIC50 > 8 were selected from ChEMBL (Table 2). Then, this set was split randomly into a training (1239 molecules) and a test set (1240 molecules). The chemical language model was then fine-tuned on the training set. 7500 molecules were sampled after each of the 20 epochs of refitting.
Table 2. Reproducing Known Actives in the Plasmodium Test Set
no.pIC50trainingtestgen molsreprod (%)EORa
1>812391240128,2562866.9
2>8100124093,721719.0
3>9100102291,0341135.7
a

EOR: Enrichment over random.

This yielded 128,256 unique molecules. Interestingly, we found that our model was able to “redesign” 28% of the unseen molecules of the test set. In comparison to molecules sampled from the unspecific, untuned model, an enrichment over random (EOR) of 66.9 is obtained. With a smaller training set of 100 molecules, the model can still reproduce 7% of the test set, with an EOR of 19.0. To test the reliance on pIC50 we chose to use another cutoff of pIC50 > 9, and took 100 molecules in the training set and 1022 in the test set. 11% of the test set could be recreated, with an EOR of 35.7. To visually explore how the model populates chemical space, Figure 8 shows a t-SNE plot of the ECFP4 fingerprints of the test molecules and 2000 generated molecules that were predicted to be active by the target prediction model for Plasmodium falciparum. It indicates that the model has generated many similar molecules around the test examples.

Figure 8

Figure 8. t-SNE plot of the pIC50 > 9 test set (blue) and the de novo molecules predicted to be active (green). The language model populates chemical space around the test molecules.

Targeting Staphylococcus aureus (Golden Staph)

To evaluate a different target, we furthermore conducted a series of experiments to reproduce known active molecules against Staphylococcus aureus. Here, we used actives with a pMIC > 3. MIC is the mean inhibitory concentration, the lowest concentration of a compound that prevents visible growth of a microorganism. As above, the actives were split into a training and a test set. However, here, the availability of the data allows larger test sets to be used. After fine-tuning on the training set of 1000 molecules (Table 3, entry 1), our model could retrieve 14% of the 6051 test molecules. When scaling down to a smaller training set of 50 molecules (the model gets trained on less than 1% of the data!), it can still reproduce 2.5% of the test set, and performs 21.6 times better than the unbiased model (Table 3, entry 2). Using a lower learning rate (0.0001, entry 3) for fine-tuning, which is often done in transfer learning, does not work as well as the standard learning rate (0.001, entry 2). We additionally examined whether the model benefits from transfer learning. When trained from scratch, the model performs much worse than the pretrained and subsequently fine-tuned model (see Figure 9 and Table 3, entry 4). Pretraining on the large data set is thus crucial to achieve good performance against Staphylococcus aureus.
Table 3. Reproducing Known Actives in the Staphylococcus Test Set
entrypMICtrainingtestgen molsreprod (%)EORa
1>31000605151,05214155.9
2>350700170,8912.521.6
3b>350700185,7551.86.3
4c>35070012850 
5d>30700160,988659.6
a

EOR: Enrichment over random.

b

Fine-tuning learning rate = 10–4.

c

No pretraining.

d

8 generate-test cycles.

Figure 9

Figure 9. Different training strategies on the Staphylococcus aureus data set with 1000 training and 6051 test examples. Fine-tuning the pretrained model performs better than training from scratch (lower test loss [cross entropy] is better).

Simulating Design-Synthesis-Test Cycles

The experiments we conducted so far are applicable if one already knows several actives. However, in drug discovery, one often does not have such a set to start with. Therefore, high throughput screenings are conducted to identify a few hits, which serve as a starting point for the typical cyclical drug discovery process: Molecules get designed, synthesized, and then tested in assays. Then, the best molecules are selected, and based on the gained knowledge new molecules are designed, which closes the cycle. Therefore, as a final challenge for our model, we simulated this cycle by iterating molecule generation (“synthesis”), selection of the best molecules with the machine learning based target prediction (“virtual assay”), and retraining the language model with the best molecules (“design”) with Staphylococcus aureus as the target. We thus do not use a set of known actives to start the structure generation procedure (see Figure 10).

Figure 10

Figure 10. Scheme of our de novo design cycle. Molecules are generated by the chemical language model and then scored with the target prediction model (TPM). The inactives are filtered out, and the RNN is retrained. Here, the TPM is a machine learning model, but it could also be a robot conducting synthesis and biological assays, or a docking program.

We started with 100,000 sampled molecules from the unbiased chemical language model. Then, using our target prediction model, we extracted the molecules classified as actives. After that, the RNN was fine-tuned for 5 epochs on the actives, sampling ≈10,000 molecules after each epoch. The resulting molecules were filtered with the target prediction model, and the new actives appended to the actives from the previous round, closing the loop.
Already after 8 iterations, the model reproduced 416 of the 7001 test molecules from the previous task, which is 6% (Table 3, entry 5), and exhibits an EOR of 59.6. This EOR is higher than if the model is retrained directly on a set of 50 actives (entry 2). Additionally, we obtained 60,988 unique molecules that the target prediction model classified as active. This suggests that, in combination with a target prediction or scoring model, our model can at least simulate the complete de novo design cycle.

Why Does the Model Work?

Our results indicate that the general model trained on a large molecule set has learned the SMILES rules and can output valid, drug-like molecules, which resemble the training data. However, sampling from this model does not help much if we want to generate actives for a specific target: We would have to generate very large sets to find actives for that target among the diverse range of molecules the model creates, which is indicated by the high EOR scores in our experiments.
When fine-tuned to a set of actives, the probability distribution over the molecules captured by our model is shifted toward molecules active toward our target. To study this, we compare the Levenshtein (string edit) distance of the generated SMILES to their nearest neighbors in the training set in Figure 11. The Levenshtein distance of, e.g., benzene c1ccccc1 and pyridine c1ccncc1 would be 1. Figure 11 shows that while the model often seems to have made small replacements in the underlying SMILES, in many cases it also made more complex modifications or even generated completely different SMILES. This is supported also by the distribution of the nearest neighbor fingerprint similarities of training and rediscovered molecules (ECFP4, Tanimoto, Figure 12). Many rediscovered molecules are in the medium similarity regime.

Figure 11

Figure 11. Histogram of Levenshtein (string edit) distances of the SMILES of the reproduced molecules to their nearest neighbor in the training set (Staphylococcus aureus, model retrained on 50 actives). While in many cases the model makes changes of a few symbols in the SMILES, resembling the typical modifications applied when exploring series of compounds, the distribution of the distances indicates that the RNN also performs more complex changes by introducing larger moieties or generating molecules that are structurally different, but isofunctional to the training set.

Figure 12

Figure 12. Violin plot of the nearest-neighbor ECFP4-Tanimoto similarity distribution of the 50 training molecules against the rediscovered molecules in Table 3, entry 2. The distribution suggests that the model has learned to make typical small functional group replacements, but can also reproduce molecules which are not too similar to the training data.

Because we perform transfer learning, during fine-tuning, the model does not “forget” what it has learned. A plausible explanation why the model works is therefore that it can transfer the modifications that are regularly applied when series of molecules are studied, to the molecules it has seen during fine-tuning.

Conclusion

ARTICLE SECTIONS
Jump To

In this work, we have shown that recurrent neural networks based on the long short-term memory (LSTM) can be applied to learn a statistical chemical language model. The model can generate large sets of novel molecules with similar physicochemical properties to the training molecules. This can be used to generate libraries for virtual screening. Furthermore, we demonstrated that the model performs transfer learning when fine-tuned to smaller sets of molecules active toward a specific biological target, which enables the creation of novel molecules with the desired activity. By iterating cycles of structure generation with the language model, scoring with a target prediction model (TPM) and retraining of the model with increasingly larger sets of highly scored molecules, we showed that we do not even need a set of known active molecules to start our procedure with, as the TPM could also be a docking program, or a robot conducting synthesis (77) and biological testing.
We see three main advantages of our method. First, it is conceptually orthogonal to established molecule generation approaches, as it learns a generative model for molecular structures. Second, our method is very simple to set up, to train, and to use; it can be adapted to different data sets without any modifications to the model architecture; and it does not depend on hand-encoded expert knowledge. Furthermore, it merges structure generation and optimization in one model. A weakness of our model is interpretability. In contrast, existing de novo design methods settled on virtual reactions to generate molecules, which has advantages as it minimizes the chance of obtaining “overfit”, weird molecules, and increases the chances to find synthesizable compounds. (2, 7)
To extend our work, it is just a small step to cast molecule generation as a reinforcement learning problem, where the pretrained LSTM generator could be seen as a policy, which can be encouraged to create better molecules with a reward signal obtained from a target prediction model. (78) In addition, different approaches for target prediction, for example, docking, could be evaluated. (7, 13)
Deep learning is not a panacea, and we join Gawehn et al. in expressing “some healthy skepticism” regarding its application in drug discovery. (31) Generating molecules that are almost right is not enough, because in chemistry, a miss is as good as a mile, and drug discovery is a “needle in the haystack” problem—in which also the needle looks like hay. Nevertheless, given that we have shown in this work that our model can rediscover those needles, and other recent developments, (31, 79) we believe that deep neural networks can be complementary to established approaches in drug discovery. The complexity of the problem certainly warrants the investigation of novel approaches. Eventually, success in the wet lab will determine if the new wave (26) of neural networks will prevail.

Supporting Information

ARTICLE SECTIONS
Jump To

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acscentsci.7b00512.

  • A set of molecules sampled from our model (PDF)

  • 400000 molecules as SMILES (ZIP)

Terms & Conditions

Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system: http://pubs.acs.org/page/copyright/permissions.html.

Author Information

ARTICLE SECTIONS
Jump To

  • Corresponding Authors
  • Authors
    • Thierry Kogej - Hit Discovery, Discovery Sciences, AstraZeneca R&D, Gothenburg, Sweden
    • Christian Tyrchan - Department of Medicinal Chemistry, IMED RIA, AstraZeneca R&D, Gothenburg, Sweden
  • Funding

    M.H.S.S. and M.P.W. acknowledge funding by Deutsche Forschungsgemeinschaft (DFG, SFB 858). M.P.W. would like to acknowledge support from the Shanghai Eastern Scholar Program. T.K. and C.T. are employees of AstraZeneca.

  • Notes
    The authors declare no competing financial interest.

Acknowledgment

ARTICLE SECTIONS
Jump To

The project was conducted during a research stay of M.H.S.S. at AstraZeneca R&D Gothenburg. We thank H. Chen and O. Engkvist for valuable discussions and feedback on the manuscript, and G. Klambauer for helpful suggestions.

References

ARTICLE SECTIONS
Jump To

This article references 79 other publications.

  1. 1
    Whitesides, G. M. Reinventing chemistry Angew. Chem., Int. Ed. 2015, 54, 3196 3209 DOI: 10.1002/anie.201410884
  2. 2
    Schneider, P.; Schneider, G. De Novo Design at the Edge of Chaos: Miniperspective J. Med. Chem. 2016, 59, 4077 4086 DOI: 10.1021/acs.jmedchem.5b01849
  3. 3
    Reymond, J.-L.; Ruddigkeit, L.; Blum, L.; van Deursen, R. The enumeration of chemical space Wiley Interdisc. Rev. Comp. Mol. Sci. 2012, 2, 717 733 DOI: 10.1002/wcms.1104
  4. 4
    Schneider, G.; Baringhaus, K.-H. Molecular design: concepts and applications; John Wiley & Sons: 2008.
  5. 5
    Stumpfe, D.; Bajorath, J. Similarity searching Wiley Interdisc. Rev. Comp. Mol. Sci. 2011, 1, 260 282 DOI: 10.1002/wcms.23
  6. 6
    Schneider, G.; Fechner, U. Computer-based de novo design of drug-like molecules Nat. Rev. Drug Discovery 2005, 4, 649 663 DOI: 10.1038/nrd1799
  7. 7
    Hartenfeller, M.; Schneider, G. Enabling future drug discovery by de novo design Wiley Interdisc. Rev. Comp. Mol. Sci. 2011, 1, 742 759 DOI: 10.1002/wcms.49
  8. 8
    Hartenfeller, M.; Zettl, H.; Walter, M.; Rupp, M.; Reisen, F.; Proschak, E.; Weggen, S.; Stark, H.; Schneider, G. DOGS: reaction-driven de novo design of bioactive compounds PLoS Comput. Biol. 2012, 8, e1002380 DOI: 10.1371/journal.pcbi.1002380
  9. 9
    Hartenfeller, M.; Eberle, M.; Meier, P.; Nieto-Oberhuber, C.; Altmann, K.-H.; Schneider, G.; Jacoby, E.; Renner, S. A collection of robust organic synthesis reactions for in silico molecule design J. Chem. Inf. Model. 2011, 51, 3093 3098 DOI: 10.1021/ci200379p
  10. 10
    Segler, M. H.; Waller, M. P. Modelling chemical reasoning to predict and invent reactions Chem. - Eur. J. 2017, 23, 6118 6128 DOI: 10.1002/chem.201604556
  11. 11
    Segler, M. H.; Waller, M. P. Neural-Symbolic Machine Learning for Retrosynthesis and Reaction Prediction Chem. - Eur. J. 2017, 23, 5966 5971 DOI: 10.1002/chem.201605499
  12. 12
    Segler, M. H.; Preuss, M.; Waller, M. P. Learning to Plan Chemical Syntheses ArXiv 2017, 1708.04202
  13. 13
    Kitchen, D. B.; Decornez, H.; Furr, J. R.; Bajorath, J. Docking and scoring in virtual screening for drug discovery: methods and applications Nat. Rev. Drug Discovery 2004, 3, 935 949 DOI: 10.1038/nrd1549
  14. 14
    Varnek, A.; Baskin, I. Machine learning methods for property prediction in chemoinformatics: quo vadis? J. Chem. Inf. Model. 2012, 52, 1413 1437 DOI: 10.1021/ci200409x
  15. 15
    Mitchell, J. B. Machine learning methods in chemoinformatics Wiley Interdisc. Rev. Comp. Mol. Sci. 2014, 4, 468 481 DOI: 10.1002/wcms.1183
  16. 16
    Riniker, S.; Landrum, G. A. Open-source platform to benchmark fingerprints for ligand-based virtual screening J. Cheminf. 2013, 5, 26 DOI: 10.1186/1758-2946-5-26
  17. 17
    Rogers, D.; Hahn, M. Extended-connectivity fingerprints J. Chem. Inf. Model. 2010, 50, 742 754 DOI: 10.1021/ci100050t
  18. 18
    Alvarsson, J.; Eklund, M.; Engkvist, O.; Spjuth, O.; Carlsson, L.; Wikberg, J. E.; Noeske, T. Ligand-based target prediction with signature fingerprints J. Chem. Inf. Model. 2014, 54, 2647 2653 DOI: 10.1021/ci500361u
  19. 19
    Baskin, I. I.; Palyulin, V. A.; Zefirov, N. S. A neural device for searching direct correlations between structures and properties of chemical compounds J. Chem. Inf. Comp. Sci. 1997, 37, 715 721 DOI: 10.1021/ci940128y
  20. 20
    Merkwirth, C.; Lengauer, T. Automatic generation of complementary descriptors with molecular graph networks J. Chem. Inf. Model. 2005, 45, 1159 1168 DOI: 10.1021/ci049613b
  21. 21
    Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R. P. Convolutional networks on graphs for learning molecular fingerprints Adv. Neural Inf. Proc. Syst. 2015, 2224 2232
  22. 22
    Altae-Tran, H.; Ramsundar, B.; Pappu, A. S.; Pande, V. Low Data Drug Discovery with One-Shot Learning ACS Cent. Sci. 2017, 3, 283 293 DOI: 10.1021/acscentsci.6b00367
  23. 23
    Jastrzebski, S.; Lesniak, D.; Czarnecki, W. M. Learning to SMILE(S). In International Conference on Learning Representations; 2016.
  24. 24
    Zupan, J.; Gasteiger, J. Neural networks: A new method for solving chemical problems or just a passing phase? Anal. Chim. Acta 1991, 248, 1 30 DOI: 10.1016/S0003-2670(00)80865-X
  25. 25
    Gasteiger, J.; Zupan, J. Neural networks in chemistry Angew. Chem., Int. Ed. Engl. 1993, 32, 503 527 DOI: 10.1002/anie.199305031
  26. 26
    Zupan, J.; Gasteiger, J. Neural networks in chemistry and drug design; John Wiley & Sons, Inc: 1999.
  27. 27
    Lusci, A.; Pollastri, G.; Baldi, P. Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules J. Chem. Inf. Model. 2013, 53, 1563 1575 DOI: 10.1021/ci400187y
  28. 28
    Unterthiner, T.; Mayr, A.; Klambauer, G.; Steijaert, M.; Wegner, J. K.; Ceulemans, H.; Hochreiter, S. Deep learning as an opportunity in virtual screening. In Proceedings of the Deep Learning Workshop at NIPS; 2014; Vol. 27, pp 1 9.
  29. 29
    Unterthiner, T.; Mayr, A.; Klambauer, G.; Hochreiter, S. Toxicity prediction using deep learning ArXiv 2015, 1503.01445
  30. 30
    Schneider, P.; Müller, A. T.; Gabernet, G.; Button, A. L.; Posselt, G.; Wessler, S.; Hiss, J. A.; Schneider, G. Hybrid Network Model for “Deep Learning” of Chemical Data: Application to Antimicrobial Peptides Mol. Inf. 2017, 36, 1600011 DOI: 10.1002/minf.201600011
  31. 31
    Gawehn, E.; Hiss, J. A.; Schneider, G. Deep learning in drug discovery Mol. Inf. 2016, 35, 3 14 DOI: 10.1002/minf.201501008
  32. 32
    Ramsundar, B.; Kearnes, S.; Riley, P.; Webster, D.; Konerding, D.; Pande, V. Massively multitask networks for drug discovery ArXiv 2015, 1502.02072
  33. 33
    Behler, J. Constructing high-dimensional neural network potentials: A tutorial review Int. J. Quantum Chem. 2015, 115, 1032 1050 DOI: 10.1002/qua.24890
  34. 34
    Behler, J.; Parrinello, M. Generalized neural-network representation of high-dimensional potential-energy surfaces Phys. Rev. Lett. 2007, 98, 146401 DOI: 10.1103/PhysRevLett.98.146401
  35. 35
    Ma, J.; Sheridan, R. P.; Liaw, A.; Dahl, G. E.; Svetnik, V. Deep neural nets as a method for quantitative structure–activity relationships J. Chem. Inf. Model. 2015, 55, 263 274 DOI: 10.1021/ci500747n
  36. 36
    Reutlinger, M.; Rodrigues, T.; Schneider, P.; Schneider, G. Multi-Objective Molecular De Novo Design by Adaptive Fragment Prioritization Angew. Chem., Int. Ed. 2014, 53, 4244 4248 DOI: 10.1002/anie.201310864
  37. 37
    Miyao, T.; Arakawa, M.; Funatsu, K. Exhaustive Structure Generation for Inverse-QSPR/QSAR Mol. Inf. 2010, 29, 111 125 DOI: 10.1002/minf.200900038
  38. 38
    Miyao, T.; Kaneko, H.; Funatsu, K. Inverse QSPR/QSAR Analysis for Chemical Structure Generation (from y to x) J. Chem. Inf. Model. 2016, 56, 286 299 DOI: 10.1021/acs.jcim.5b00628
  39. 39
    Takeda, S.; Kaneko, H.; Funatsu, K. Chemical-Space-Based de Novo Design Method To Generate Drug-Like Molecules J. Chem. Inf. Model. 2016, 56, 1885 1893 DOI: 10.1021/acs.jcim.6b00038
  40. 40
    Mishima, K.; Kaneko, H.; Funatsu, K. Development of a new de novo design algorithm for exploring chemical space Mol. Inf. 2014, 33, 779 789 DOI: 10.1002/minf.201400056
  41. 41
    White, D.; Wilson, R. C. Generative models for chemical structures J. Chem. Inf. Model. 2010, 50, 1257 1274 DOI: 10.1021/ci9004089
  42. 42
    Patel, H.; Bodkin, M. J.; Chen, B.; Gillet, V. J. Knowledge-based approach to de novo design using reaction vectors J. Chem. Inf. Model. 2009, 49, 1163 1184 DOI: 10.1021/ci800413m
  43. 43
    Bowman, S. R.; Vilnis, L.; Vinyals, O.; Dai, A. M.; Jozefowicz, R.; Bengio, S. Generating sentences from a continuous space. In SIGNLL Conference on Computational Natural Language Learning (CONLL); 2016.
  44. 44
    Gómez-Bombarelli, R.; Duvenaud, D.; Hernández-Lobato, J. M.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.; Aspuru-Guzik, A. Automatic chemical design using a data-driven continuous representation of molecules ArXiv 2016, 1610.02415
  45. 45
    Voss, C. Modeling Molecules with Recurrent Neural Networks; 2015; http://csvoss.github.io/projects/2015/10/08/rnns-and-chemistry.html.
  46. 46
    Firth, N. de novo Design Without the Chemistry; 2016; https://medium.com/@nf508/de-novo-design-without-the-chemistry-d183e8a9f150.
  47. 47
    Jozefowicz, R.; Vinyals, O.; Schuster, M.; Shazeer, N.; Wu, Y. Exploring the limits of language modeling ArXiv 2016, 1602.02410
  48. 48
    Graves, A.; Eck, D.; Beringer, N.; Schmidhuber, J. Biologically plausible speech recognition with LSTM neural nets. In International Workshop on Biologically Inspired Approaches to Advanced Information Technology; 2004; pp 127 136.
  49. 49
    van den Oord, A.; Kalchbrenner, N.; Kavukcuoglu, K. Pixel Recurrent Neural Networks. In International Conference on Machine Learning; 2016.
  50. 50
    Srivastava, N.; Mansimov, E.; Salakhutdinov, R. Unsupervised learning of video representations using LSTMs. In Proceedings of the 32nd International Conference on Machine Learning; 2015; pp 843 852.
  51. 51
    Gers, F. A.; Schmidhuber, E. LSTM recurrent networks learn simple context-free and context-sensitive languages IEEE Transactions on Neural Networks 2001, 12, 1333 1340 DOI: 10.1109/72.963769
  52. 52
    Bhoopchand, A.; Rocktäschel, T.; Barr, E.; Riedel, S. Learning Python Code Suggestion with a Sparse Pointer Network ArXiv 2016, 1611.08307
  53. 53
    Eck, D.; Schmidhuber, J. Finding temporal structure in music: Blues improvisation with LSTM recurrent networks. In Proc. 12th IEEE Workshop Neural Networks for Signal Processing; 2002; pp 747 756.
  54. 54
    Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules J. Chem. Inf. Model. 1988, 28, 31 36 DOI: 10.1021/ci00057a005
  55. 55
    Goldberg, Y. A Primer on Neural Network Models for Natural Language Processing J. Artif. Intell. Res. 2016, 57, 345 420
  56. 56
    Hochreiter, S.; Schmidhuber, J. Long short-term memory Neural computation 1997, 9, 1735 1780 DOI: 10.1162/neco.1997.9.8.1735
  57. 57
    Johnson, M.; Schuster, M.; Le, Q. V.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Viégas, F.; Wattenberg, M.; Corrado, G. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation ArXiv 2016, 1611.04558
  58. 58
    Graves, A. Generating sequences with recurrent neural networks ArXiv 2013, 1308.0850
  59. 59
    Olah, C. Understanding LSTM Networks; http://colah.github.io/posts/2015-08-Understanding-LSTMs/.
  60. 60
    Greff, K.; Srivastava, R. K.; Koutník, J.; Steunebrink, B. R.; Schmidhuber, J. LSTM: A search space odyssey IEEE transactions on neural networks and learning systems 2017, 28, 2222 2232 DOI: 10.1109/TNNLS.2016.2582924
  61. 61
    Chollet, F. Keras; https://github.com/fchollet/keras; retrieved on 2016-10–-24.
  62. 62
    Kingma, D.; Ba, J. Adam: A method for stochastic optimization. In International Conference for Learning Representations; 2015.
  63. 63
    Cireşan, D. C.; Meier, U.; Schmidhuber, J. Transfer learning for Latin and Chinese characters with deep neural networks. In The 2012 International Joint Conference on Neural Networks (IJCNN); 2012; pp 1 6.
  64. 64
    Steinbeck, C.; Hoppe, C.; Kuhn, S.; Floris, M.; Guha, R.; Willighagen, E. L. Recent developments of the chemistry development kit (CDK)-an open-source java library for chemo-and bioinformatics Curr. Pharm. Des. 2006, 12, 2111 2120 DOI: 10.2174/138161206777585274
  65. 65
    Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E. The Chemistry Development Kit (CDK): An open-source Java library for chemo-and bioinformatics J. Chem. Inf. Comp. Sci. 2003, 43, 493 500 DOI: 10.1021/ci025584y
  66. 66
    Pedregosa, F. Scikit-learn: Machine Learning in Python J. Mach. Learn. Res. 2011, 12, 2825 2830
  67. 67
    Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system 22nd ACM SIGKDD Int. Conf. 2016, 785 DOI: 10.1145/2939672.2939785
  68. 68
    Weininger, D.; Weininger, A.; Weininger, J. L. SMILES. 2. Algorithm for generation of unique SMILES notation J. Chem. Inf. Model. 1989, 29, 97 101 DOI: 10.1021/ci00062a008
  69. 69
    https://en.wikipedia.org/wiki/Graph_canonization.
  70. 70
    Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting J. Mach. Learn. Res. 2014, 15, 1929 1958
  71. 71
    Cumming, J. G.; Davis, A. M.; Muresan, S.; Haeberlein, M.; Chen, H. Chemical predictive modelling to improve compound quality Nat. Rev. Drug Discovery 2013, 12, 948 962 DOI: 10.1038/nrd4128
  72. 72
    Chevillard, F.; Kolb, P. SCUBIDOO: A Large yet Screenable and Easily Searchable Database of Computationally Created Chemical Compounds Optimized toward High Likelihood of Synthetic Tractability J. Chem. Inf. Model. 2015, 55, 1824 1835 DOI: 10.1021/acs.jcim.5b00203
  73. 73
    RDKit: Open-source cheminformatics; http://www.rdkit.org.
  74. 74
    Maaten, L. v. d.; Hinton, G. Visualizing data using t-SNE J. Mach. Learn. Res. 2008, 9, 2579 2605
  75. 75
    Bemis, G. W.; Murcko, M. A. The properties of known drugs. 1. Molecular frameworks J. Med. Chem. 1996, 39, 2887 2893 DOI: 10.1021/jm9602928
  76. 76
    Williamson, A. E.; Todd, M. H. Open source drug discovery: highly potent antimalarial compounds derived from the Tres Cantos arylpyrroles ACS Cent. Sci. 2016, 2, 687 701 DOI: 10.1021/acscentsci.6b00086
  77. 77
    Ley, S. V.; Fitzpatrick, D. E.; Ingham, R.; Myers, R. M. Organic synthesis: march of the machines Angew. Chem., Int. Ed. 2015, 54, 3449 3464 DOI: 10.1002/anie.201410744
  78. 78
    Sutton, R. S.; Barto, A. G. Reinforcement learning: An introduction; MIT Press Cambridge; 1998; Vol. 1.
  79. 79
    Ching, T.; Himmelstein, D. S. Opportunities And Obstacles For Deep Learning In Biology And Medicine bioRxiv 2017, 142760

Cited By

ARTICLE SECTIONS
Jump To

This article is cited by 820 publications.

  1. Mingyang Wang, Zhengjian Wu, Jike Wang, Gaoqi Weng, Yu Kang, Peichen Pan, Dan Li, Yafeng Deng, Xiaojun Yao, Zhitong Bing, Chang-Yu Hsieh, Tingjun Hou. Genetic Algorithm-Based Receptor Ligand: A Genetic Algorithm-Guided Generative Model to Boost the Novelty and Drug-Likeness of Molecules in a Sampling Chemical Space. Journal of Chemical Information and Modeling 2024, 64 (4) , 1213-1228. https://doi.org/10.1021/acs.jcim.3c01964
  2. Hao Zhang, Jinchao Huang, Junjie Xie, Weifeng Huang, Yuedong Yang, Mingyuan Xu, Jinping Lei, Hongming Chen. GRELinker: A Graph-Based Generative Model for Molecular Linker Design with Reinforcement and Curriculum Learning. Journal of Chemical Information and Modeling 2024, 64 (3) , 666-676. https://doi.org/10.1021/acs.jcim.3c01700
  3. Tao Shen, Jiale Guo, Zunsheng Han, Gao Zhang, Qingxin Liu, Xinxin Si, Dongmei Wang, Song Wu, Jie Xia. AutoMolDesigner for Antibiotic Discovery: An AI-Based Open-Source Software for Automated Design of Small-Molecule Antibiotics. Journal of Chemical Information and Modeling 2024, 64 (3) , 575-583. https://doi.org/10.1021/acs.jcim.3c01562
  4. Gregory W. Kyro, Anton Morgunov, Rafael I. Brent, Victor S. Batista. ChemSpaceAL: An Efficient Active Learning Methodology Applied to Protein-Specific Molecular Generation. Journal of Chemical Information and Modeling 2024, 64 (3) , 653-665. https://doi.org/10.1021/acs.jcim.3c01456
  5. Gao Tu, Tingting Fu, Guoxun Zheng, Binbin Xu, Rongpei Gou, Ding Luo, Panpan Wang, Weiwei Xue. Computational Chemistry in Structure-Based Solute Carrier Transporter Drug Design: Recent Advances and Future Perspectives. Journal of Chemical Information and Modeling 2024, Article ASAP.
  6. Gaoqi Weng, Huifeng Zhao, Dou Nie, Haotian Zhang, Liwei Liu, Tingjun Hou, Yu Kang. RediscMol: Benchmarking Molecular Generation Models in Biological Properties. Journal of Medicinal Chemistry 2024, 67 (2) , 1533-1543. https://doi.org/10.1021/acs.jmedchem.3c02051
  7. Junjie Xie, Sheng Chen, Jinping Lei, Yuedong Yang. DiffDec: Structure-Aware Scaffold Decoration with an End-to-End Diffusion Model. Journal of Chemical Information and Modeling 2024, Article ASAP.
  8. Jawad Chowdhury, Charles Fricke, Olajide Bamidele, Mubarak Bello, Wenqiang Yang, Andreas Heyden, Gabriel Terejanu. Invariant Molecular Representations for Heterogeneous Catalysis. Journal of Chemical Information and Modeling 2024, 64 (2) , 327-339. https://doi.org/10.1021/acs.jcim.3c00594
  9. Chen-Hsuan Huang, Shiang-Tai Lin. MARS Plus: An Improved Molecular Design Tool for Complex Compounds Involving Ionic, Stereo, and Cis–Trans Isomeric Structures. Journal of Chemical Information and Modeling 2023, 63 (24) , 7711-7728. https://doi.org/10.1021/acs.jcim.3c01745
  10. Di-Fan Liu, Yong-Xin Zhang, Wen-Zhuo Dong, Qi-Kun Feng, Shao-Long Zhong, Zhi-Min Dang. High-Temperature Polymer Dielectrics Designed Using an Invertible Molecular Graph Generative Model. Journal of Chemical Information and Modeling 2023, 63 (24) , 7669-7675. https://doi.org/10.1021/acs.jcim.3c01572
  11. Fei Wang, Daniel Pasin, Michael A. Skinnider, Jaanus Liigand, Jan-Niklas Kleis, David Brown, Eponine Oler, Tanvir Sajed, Vasuk Gautam, Stephen Harrison, Russell Greiner, Leonard J. Foster, Petur Weihe Dalsgaard, David S. Wishart. Deep Learning-Enabled MS/MS Spectrum Prediction Facilitates Automated Identification Of Novel Psychoactive Substances. Analytical Chemistry 2023, 95 (50) , 18326-18334. https://doi.org/10.1021/acs.analchem.3c02413
  12. Nicolas Bosc, Eloy Felix, J. Mark F. Gardner, James Mills, Martijn Timmerman, Dennis Asveld, Kim Rensen, Partha Mukherjee, Rishi Das, Elodie Chenu, Dominique Besson, Jeremy N. Burrows, James Duffy, Benoît Laleu, Eric M. Guantai, Andrew R. Leach. MAIP: An Open-Source Tool to Enrich High-Throughput Screening Output and Identify Novel, Druglike Molecules with Antimalarial Activity. ACS Medicinal Chemistry Letters 2023, 14 (12) , 1733-1741. https://doi.org/10.1021/acsmedchemlett.3c00369
  13. Huaqiang Wen, Shihao Nan, Di Wu, Quanhu Sun, Yu Tong, Jun Zhang, Saimeng Jin, Weifeng Shen. A Systematic Review on Intensifications of Artificial Intelligence Assisted Green Solvent Development. Industrial & Engineering Chemistry Research 2023, 62 (48) , 20473-20491. https://doi.org/10.1021/acs.iecr.3c02305
  14. Mingyuan Xu, Hongming Chen. Tree-Invent: A Novel Multipurpose Molecular Generative Model Constrained with a Topological Tree. Journal of Chemical Information and Modeling 2023, 63 (22) , 7067-7082. https://doi.org/10.1021/acs.jcim.3c01626
  15. Alessandra Toniato, Alain C. Vaucher, Marzena Maria Lehmann, Torsten Luksch, Philippe Schwaller, Marco Stenta, Teodoro Laino. Fast Customization of Chemical Language Models to Out-of-Distribution Data Sets. Chemistry of Materials 2023, 35 (21) , 8806-8815. https://doi.org/10.1021/acs.chemmater.3c01406
  16. Lan Yu, Xiao He, Xiaomin Fang, Lihang Liu, Jinfeng Liu. Deep Learning with Geometry-Enhanced Molecular Representation for Augmentation of Large-Scale Docking-Based Virtual Screening. Journal of Chemical Information and Modeling 2023, 63 (21) , 6501-6514. https://doi.org/10.1021/acs.jcim.3c01371
  17. Chao Pang, Jianbo Qiao, Xiangxiang Zeng, Quan Zou, Leyi Wei. Deep Generative Models in De Novo Drug Molecule Generation. Journal of Chemical Information and Modeling 2023, Article ASAP.
  18. Yongxian Wu, Haixin Wei, Qiang Zhu, Ray Luo. Grid-Robust Efficient Neural Interface Model for Universal Molecule Surface Construction from Point Clouds. The Journal of Physical Chemistry Letters 2023, 14 (40) , 9034-9041. https://doi.org/10.1021/acs.jpclett.3c02176
  19. Daiki Erikawa, Nobuaki Yasuo, Takamasa Suzuki, Shogo Nakamura, Masakazu Sekijima. Gargoyles: An Open Source Graph-Based Molecular Optimization Method Based on Deep Reinforcement Learning. ACS Omega 2023, 8 (40) , 37431-37441. https://doi.org/10.1021/acsomega.3c05430
  20. Michael Riedl, Sayak Mukherjee, Mitch Gauthier. Descriptor-Free Deep Learning QSAR Model for the Fraction Unbound in Human Plasma. Molecular Pharmaceutics 2023, 20 (10) , 4984-4993. https://doi.org/10.1021/acs.molpharmaceut.3c00129
  21. Susanne Sauer, Hans Matter, Gerhard Hessler, Christoph Grebner. Integrating Reaction Schemes, Reagent Databases, and Virtual Libraries into Fragment-Based Design by Reinforcement Learning. Journal of Chemical Information and Modeling 2023, 63 (18) , 5709-5726. https://doi.org/10.1021/acs.jcim.3c00735
  22. Hongsong Feng, Rui Wang, Chang-Guo Zhan, Guo-Wei Wei. Multiobjective Molecular Optimization for Opioid Use Disorder Treatment Using Generative Network Complex. Journal of Medicinal Chemistry 2023, 66 (17) , 12479-12498. https://doi.org/10.1021/acs.jmedchem.3c01053
  23. Mitsuru Ohno, Yoshihiro Hayashi, Qi Zhang, Yu Kaneko, Ryo Yoshida. SMiPoly: Generation of a Synthesizable Polymer Virtual Library Using Rule-Based Polymerization Reactions. Journal of Chemical Information and Modeling 2023, 63 (17) , 5539-5548. https://doi.org/10.1021/acs.jcim.3c00329
  24. Sarveswara Rao Vangala, Sowmya Ramaswamy Krishnan, Navneet Bung, Rajgopal Srinivasan, Arijit Roy. pBRICS: A Novel Fragmentation Method for Explainable Property Prediction of Drug-Like Small Molecules. Journal of Chemical Information and Modeling 2023, 63 (16) , 5066-5076. https://doi.org/10.1021/acs.jcim.3c00689
  25. Krishnendu Sinha, Nabanita Ghosh, Parames C. Sil. A Review on the Recent Applications of Deep Learning in Predictive Drug Toxicological Studies. Chemical Research in Toxicology 2023, 36 (8) , 1174-1205. https://doi.org/10.1021/acs.chemrestox.2c00375
  26. Jin Da Tan, Balamurugan Ramalingam, Swee Liang Wong, Jayce Jian Wei Cheng, Yee-Fun Lim, Vijila Chellappan, Saif A. Khan, Jatin Kumar, Kedar Hippalgaonkar. Transfer Learning of Full Molecular Weight Distributions via High-Throughput Computer-Controlled Polymerization. Journal of Chemical Information and Modeling 2023, 63 (15) , 4560-4573. https://doi.org/10.1021/acs.jcim.3c00504
  27. Jieyu Jin, Dong Wang, Guqin Shi, Jingxiao Bao, Jike Wang, Haotian Zhang, Peichen Pan, Dan Li, Xiaojun Yao, Huanxiang Liu, Tingjun Hou, Yu Kang. FFLOM: A Flow-Based Autoregressive Model for Fragment-to-Lead Optimization. Journal of Medicinal Chemistry 2023, 66 (15) , 10808-10823. https://doi.org/10.1021/acs.jmedchem.3c01009
  28. Dexin Deng, Yingxue Yang, Yurong Zou, Kongjun Liu, Chufeng Zhang, Minghai Tang, Tao Yang, Yong Chen, Xue Yuan, Yong Guo, Shunjie Zhang, Wenting Si, Bin Peng, Qing Xu, Wen He, Dingguo Xu, Mingli Xiang, Lijuan Chen. Discovery and Evaluation of 3-Quinoxalin Urea Derivatives as Potent, Selective, and Orally Available ATM Inhibitors Combined with Chemotherapy for the Treatment of Cancer via Goal-Oriented Molecule Generation and Virtual Screening. Journal of Medicinal Chemistry 2023, 66 (14) , 9495-9518. https://doi.org/10.1021/acs.jmedchem.3c00082
  29. Harsha Valluri, Amritansh Bhanot, Shreeya Shah, Navya Bhandaru, Sandeep Sundriyal. Basic Nitrogen (BaN) Is a Key Property of Antimalarial Chemical Space. Journal of Medicinal Chemistry 2023, 66 (13) , 8382-8406. https://doi.org/10.1021/acs.jmedchem.3c00206
  30. Philippe Gendreau, Joseph-André Turk, Nicolas Drizard, Vinicius Barros Ribeiro da Silva, Clarisse Descamps, Yann Gaston-Mathé. Molecular Assays Simulator to Unravel Predictors Hacking in Goal-Directed Molecular Generations. Journal of Chemical Information and Modeling 2023, 63 (13) , 3983-3998. https://doi.org/10.1021/acs.jcim.3c00195
  31. Nutaya Pravalphruekul, Maytus Piriyajitakonkij, Phond Phunchongharn, Supanida Piyayotai. De Novo Design of Molecules with Multiaction Potential from Differential Gene Expression using Variational Autoencoder. Journal of Chemical Information and Modeling 2023, 63 (13) , 3999-4011. https://doi.org/10.1021/acs.jcim.3c00355
  32. Song Li, Chao Hu, Song Ke, Chenxing Yang, Jun Chen, Yi Xiong, Hao Liu, Liang Hong. LS-MolGen: Ligand-and-Structure Dual-Driven Deep Reinforcement Learning for Target-Specific Molecular Generation Improves Binding Affinity and Novelty. Journal of Chemical Information and Modeling 2023, 63 (13) , 4207-4215. https://doi.org/10.1021/acs.jcim.3c00587
  33. Maxime Langevin, Christoph Grebner, Stefan Güssregen, Susanne Sauer, Yi Li, Hans Matter, Marc Bianciotto. Impact of Applicability Domains to Generative Artificial Intelligence. ACS Omega 2023, 8 (25) , 23148-23167. https://doi.org/10.1021/acsomega.3c00883
  34. Marco Ballarotto, Sabine Willems, Tanja Stiller, Felix Nawa, Julian A. Marschner, Francesca Grisoni, Daniel Merk. De Novo Design of Nurr1 Agonists via Fragment-Augmented Generative Deep Learning in Low-Data Regime. Journal of Medicinal Chemistry 2023, 66 (12) , 8170-8177. https://doi.org/10.1021/acs.jmedchem.3c00485
  35. Tobiasz Ciepliński, Tomasz Danel, Sabina Podlewska, Stanisław Jastrzȩbski. Generative Models Should at Least Be Able to Design Molecules That Dock Well: A New Benchmark. Journal of Chemical Information and Modeling 2023, 63 (11) , 3238-3247. https://doi.org/10.1021/acs.jcim.2c01355
  36. Jike Wang, Yundian Zeng, Huiyong Sun, Junmei Wang, Xiaorui Wang, Ruofan Jin, Mingyang Wang, Xujun Zhang, Dongsheng Cao, Xi Chen, Chang-Yu Hsieh, Tingjun Hou. Molecular Generation with Reduced Labeling through Constraint Architecture. Journal of Chemical Information and Modeling 2023, 63 (11) , 3319-3327. https://doi.org/10.1021/acs.jcim.3c00579
  37. Masatsugu Yamada, Mahito Sugiyama. Molecular Graph Generation by Decomposition and Reassembling. ACS Omega 2023, 8 (22) , 19575-19586. https://doi.org/10.1021/acsomega.3c01078
  38. Dylan M. Anstine, Olexandr Isayev. Generative Models as an Emerging Paradigm in the Chemical Sciences. Journal of the American Chemical Society 2023, 145 (16) , 8736-8750. https://doi.org/10.1021/jacs.2c13467
  39. Claudio N. Cavasotto, Juan I. Di Filippo. The Impact of Supervised Learning Methods in Ultralarge High-Throughput Docking. Journal of Chemical Information and Modeling 2023, 63 (8) , 2267-2280. https://doi.org/10.1021/acs.jcim.2c01471
  40. Mehrad Ansari, Andrew D. White. Serverless Prediction of Peptide Properties with Recurrent Neural Networks. Journal of Chemical Information and Modeling 2023, 63 (8) , 2546-2553. https://doi.org/10.1021/acs.jcim.2c01317
  41. Dibyajyoti Das, Broto Chakrabarty, Rajgopal Srinivasan, Arijit Roy. Gex2SGen: Designing Drug-like Molecules from Desired Gene Expression Signatures. Journal of Chemical Information and Modeling 2023, 63 (7) , 1882-1893. https://doi.org/10.1021/acs.jcim.2c01301
  42. Zhengtao Zhou, Mario Eden, Weifeng Shen. Treat Molecular Linear Notations as Sentences: Accurate Quantitative Structure–Property Relationship Modeling via a Natural Language Processing Approach. Industrial & Engineering Chemistry Research 2023, 62 (12) , 5336-5346. https://doi.org/10.1021/acs.iecr.2c04070
  43. Lin Yao, Minjian Yang, Jianfei Song, Zhuo Yang, Hanyu Sun, Hui Shi, Xue Liu, Xiangyang Ji, Yafeng Deng, Xiaojian Wang. Conditional Molecular Generation Net Enables Automated Structure Elucidation Based on 13C NMR Spectra and Prior Knowledge. Analytical Chemistry 2023, 95 (12) , 5393-5401. https://doi.org/10.1021/acs.analchem.2c05817
  44. Qingchuan Chen, Jian Deng, Guangsheng Luo. Micromixing Performance and Residence Time Distribution in a Miniaturized Magnetic Reactor: Experimental Investigation and Machine Learning Modeling. Industrial & Engineering Chemistry Research 2023, 62 (8) , 3577-3591. https://doi.org/10.1021/acs.iecr.2c04513
  45. Yunjiang Zhang, Shuyuan Li, Miaojuan Xing, Qing Yuan, Hong He, Shaorui Sun. Universal Approach to De Novo Drug Design for Target Proteins Using Deep Reinforcement Learning. ACS Omega 2023, 8 (6) , 5464-5474. https://doi.org/10.1021/acsomega.2c06653
  46. Yan A. Ivanenkov, Daniil Polykovskiy, Dmitry Bezrukov, Bogdan Zagribelnyy, Vladimir Aladinskiy, Petrina Kamya, Alex Aliper, Feng Ren, Alex Zhavoronkov. Chemistry42: An AI-Driven Platform for Molecular Design and Optimization. Journal of Chemical Information and Modeling 2023, 63 (3) , 695-701. https://doi.org/10.1021/acs.jcim.2c01191
  47. Donghan Wang, Wenze Li, Xu Dong, Hongzhi Li, LiHong Hu. TFRegNCI: Interpretable Noncovalent Interaction Correction Multimodal Based on Transformer Encoder Fusion. Journal of Chemical Information and Modeling 2023, 63 (3) , 782-793. https://doi.org/10.1021/acs.jcim.2c01283
  48. Miquel Duran-Frigola, Marko Cigler, Georg E. Winter. Advancing Targeted Protein Degradation via Multiomics Profiling and Artificial Intelligence. Journal of the American Chemical Society 2023, 145 (5) , 2711-2732. https://doi.org/10.1021/jacs.2c11098
  49. Jose Raul Montero BastidasAbdellatif El MarrouniMaria Irina Chiriac, Thomas Struble, Dipannita Kalyani. ACCELERATING DRUG DISCOVERY BY HIGH-THROUGHPUT EXPERIMENTATION. , 443-463. https://doi.org/10.1021/mc-2022-vol57.ch18
  50. Derek van Tilborg, Alisa Alenicheva, Francesca Grisoni. Exposing the Limitations of Molecular Machine Learning with Activity Cliffs. Journal of Chemical Information and Modeling 2022, 62 (23) , 5938-5951. https://doi.org/10.1021/acs.jcim.2c01073
  51. Youhai Tan, Lingxue Dai, Weifeng Huang, Yinfeng Guo, Shuangjia Zheng, Jinping Lei, Hongming Chen, Yuedong Yang. DRlinker: Deep Reinforcement Learning for Optimization in Fragment Linking Design. Journal of Chemical Information and Modeling 2022, 62 (23) , 5907-5917. https://doi.org/10.1021/acs.jcim.2c00982
  52. Satoshi Noguchi, Junya Inoue. Exploration of Chemical Space Guided by PixelCNN for Fragment-Based De Novo Drug Discovery. Journal of Chemical Information and Modeling 2022, 62 (23) , 5988-6001. https://doi.org/10.1021/acs.jcim.2c01345
  53. William Bort, Daniyar Mazitov, Dragos Horvath, Fanny Bonachera, Arkadii Lin, Gilles Marcou, Igor Baskin, Timur Madzhidov, Alexandre Varnek. Inverse QSAR: Reversing Descriptor-Driven Prediction Pipeline Using Attention-Based Conditional Variational Autoencoder. Journal of Chemical Information and Modeling 2022, 62 (22) , 5471-5484. https://doi.org/10.1021/acs.jcim.2c01086
  54. Wenze Li, Donghan Wang, Zirui Yang, Huijie Zhang, LiHong Hu, GuanHua Chen. DeepNCI: DFT Noncovalent Interaction Correction with Transferable Multimodal Three-Dimensional Convolutional Neural Networks. Journal of Chemical Information and Modeling 2022, 62 (21) , 5090-5099. https://doi.org/10.1021/acs.jcim.1c01305
  55. Hanna Türk, Elisabetta Landini, Christian Kunkel, Johannes T. Margraf, Karsten Reuter. Assessing Deep Generative Models in Chemical Composition Space. Chemistry of Materials 2022, 34 (21) , 9455-9467. https://doi.org/10.1021/acs.chemmater.2c01860
  56. Sara Romeo Atance, Juan Viguera Diez, Ola Engkvist, Simon Olsson, Rocío Mercado. De Novo Drug Design Using Reinforcement Learning with Graph-Based Deep Generative Models. Journal of Chemical Information and Modeling 2022, 62 (20) , 4863-4872. https://doi.org/10.1021/acs.jcim.2c00838
  57. Chuan Li, Chenghui Wang, Ming Sun, Yan Zeng, Yuan Yuan, Qiaolin Gou, Guangchuan Wang, Yanzhi Guo, Xuemei Pu. Correlated RNN Framework to Quickly Generate Molecules with Desired Properties for Energetic Materials in the Low Data Regime. Journal of Chemical Information and Modeling 2022, 62 (20) , 4873-4887. https://doi.org/10.1021/acs.jcim.2c00997
  58. Yuriy Khalak, Gary Tresadern, David F. Hahn, Bert L. de Groot, Vytautas Gapsys. Chemical Space Exploration with Active Learning and Alchemical Free Energies. Journal of Chemical Theory and Computation 2022, 18 (10) , 6259-6270. https://doi.org/10.1021/acs.jctc.2c00752
  59. Nagababu Andraju, Greg W. Curtzwiler, Yun Ji, Evguenii Kozliak, Prakash Ranganathan. Machine-Learning-Based Predictions of Polymer and Postconsumer Recycled Polymer Properties: A Comprehensive Review. ACS Applied Materials & Interfaces 2022, 14 (38) , 42771-42790. https://doi.org/10.1021/acsami.2c08301
  60. Daniel J. Woodward, Anthony R. Bradley, Willem P. van Hoorn. Coverage Score: A Model Agnostic Method to Efficiently Explore Chemical Space. Journal of Chemical Information and Modeling 2022, 62 (18) , 4391-4402. https://doi.org/10.1021/acs.jcim.2c00258
  61. Jike Wang, Xiaorui Wang, Huiyong Sun, Mingyang Wang, Yundian Zeng, Dejun Jiang, Zhenxing Wu, Zeyi Liu, Ben Liao, Xiaojun Yao, Chang-Yu Hsieh, Dongsheng Cao, Xi Chen, Tingjun Hou. ChemistGA: A Chemical Synthesizable Accessible Molecular Generation Algorithm for Real-World Drug Discovery. Journal of Medicinal Chemistry 2022, 65 (18) , 12482-12496. https://doi.org/10.1021/acs.jmedchem.2c01179
  62. Dhruv Menon, Raghavan Ranganathan. A Generative Approach to Materials Discovery, Design, and Optimization. ACS Omega 2022, 7 (30) , 25958-25973. https://doi.org/10.1021/acsomega.2c03264
  63. Jie Zhang, Hongming Chen. De Novo Molecule Design Using Molecular Generative Models Constrained by Ligand–Protein Interactions. Journal of Chemical Information and Modeling 2022, 62 (14) , 3291-3306. https://doi.org/10.1021/acs.jcim.2c00177
  64. Mingyang Wang, Chang-Yu Hsieh, Jike Wang, Dong Wang, Gaoqi Weng, Chao Shen, Xiaojun Yao, Zhitong Bing, Honglin Li, Dongsheng Cao, Tingjun Hou. RELATION: A Deep Generative Model for Structure-Based De Novo Drug Design. Journal of Medicinal Chemistry 2022, 65 (13) , 9478-9492. https://doi.org/10.1021/acs.jmedchem.2c00732
  65. Tuan H. Nguyen, Lam H. Nguyen, Thanh N. Truong. Application of Machine Learning in Developing Quantitative Structure–Property Relationship for Electronic Properties of Polyaromatic Compounds. ACS Omega 2022, 7 (26) , 22879-22888. https://doi.org/10.1021/acsomega.2c02650
  66. Myeonghun Lee, Kyoungmin Min. MGCVAE: Multi-Objective Inverse Design via Molecular Graph Conditional Variational Autoencoder. Journal of Chemical Information and Modeling 2022, 62 (12) , 2943-2950. https://doi.org/10.1021/acs.jcim.2c00487
  67. Jon Paul Janet . Data-Driven Mapping of Inorganic Chemical Space for the Design of Transition Metal Complexes and Metal-Organic Frameworks. , 127-179. https://doi.org/10.1021/bk-2022-1416.ch007
  68. Navneet Bung, Sowmya Ramaswamy Krishnan, Arijit Roy. An In Silico Explainable Multiparameter Optimization Approach for De Novo Drug Design against Proteins from the Central Nervous System. Journal of Chemical Information and Modeling 2022, 62 (11) , 2685-2695. https://doi.org/10.1021/acs.jcim.2c00462
  69. Yuhong Wang, Sam Michael, Shyh-Ming Yang, Ruili Huang, Kennie Cruz-Gutierrez, Yaqing Zhang, Jinghua Zhao, Menghang Xia, Paul Shinn, Hongmao Sun. Retro Drug Design: From Target Properties to Molecular Structures. Journal of Chemical Information and Modeling 2022, 62 (11) , 2659-2669. https://doi.org/10.1021/acs.jcim.2c00123
  70. Fabio Urbina, Christopher T. Lowden, J. Christopher Culberson, Sean Ekins. MegaSyn: Integrating Generative Molecular Design, Automated Analog Designer, and Synthetic Viability Prediction. ACS Omega 2022, 7 (22) , 18699-18713. https://doi.org/10.1021/acsomega.2c01404
  71. Weixin Xie, Fanhao Wang, Yibo Li, Luhua Lai, Jianfeng Pei. Advances and Challenges in De Novo Drug Design Using Three-Dimensional Deep Generative Models. Journal of Chemical Information and Modeling 2022, 62 (10) , 2269-2279. https://doi.org/10.1021/acs.jcim.2c00042
  72. Cheng-Hao Liu, Maksym Korablyov, Stanisław Jastrzębski, Paweł Włodarczyk-Pruszyński, Yoshua Bengio, Marwin Segler. RetroGNN: Fast Estimation of Synthesizability for Virtual Screening and De Novo Design by Learning from Slow Retrosynthesis Software. Journal of Chemical Information and Modeling 2022, 62 (10) , 2293-2300. https://doi.org/10.1021/acs.jcim.1c01476
  73. Rocío Mercado, Esben J. Bjerrum, Ola Engkvist. Exploring Graph Traversal Algorithms in Graph-Based Molecular Generation. Journal of Chemical Information and Modeling 2022, 62 (9) , 2093-2100. https://doi.org/10.1021/acs.jcim.1c00777
  74. Kazuma Kaitoh, Yoshihiro Yamanishi. Scaffold-Retained Structure Generator to Exhaustively Create Molecules in an Arbitrary Chemical Space. Journal of Chemical Information and Modeling 2022, 62 (9) , 2212-2225. https://doi.org/10.1021/acs.jcim.1c01130
  75. Viraj Bagal, Rishal Aggarwal, P. K. Vinod, U. Deva Priyakumar. MolGPT: Molecular Generation Using a Transformer-Decoder Model. Journal of Chemical Information and Modeling 2022, 62 (9) , 2064-2076. https://doi.org/10.1021/acs.jcim.1c00600
  76. Yi Hua, Xiaobao Fang, Guomeng Xing, Yuan Xu, Li Liang, Chenglong Deng, Xiaowen Dai, Haichun Liu, Tao Lu, Yanmin Zhang, Yadong Chen. Effective Reaction-Based De Novo Strategy for Kinase Targets: A Case Study on MERTK Inhibitors. Journal of Chemical Information and Modeling 2022, 62 (7) , 1654-1668. https://doi.org/10.1021/acs.jcim.2c00068
  77. Jieyu Lu, Yingkai Zhang. Unified Deep Learning Model for Multitask Reaction Predictions with Explanation. Journal of Chemical Information and Modeling 2022, 62 (6) , 1376-1387. https://doi.org/10.1021/acs.jcim.1c01467
  78. Teresa Maria Creanza, Giuseppe Lamanna, Pietro Delre, Marialessandra Contino, Nicola Corriero, Michele Saviano, Giuseppe Felice Mangiatordi, Nicola Ancona. DeLA-Drug: A Deep Learning Algorithm for Automated Design of Druglike Analogues. Journal of Chemical Information and Modeling 2022, 62 (6) , 1411-1424. https://doi.org/10.1021/acs.jcim.2c00205
  79. Michael Moret, Francesca Grisoni, Paul Katzberger, Gisbert Schneider. Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models. Journal of Chemical Information and Modeling 2022, 62 (5) , 1199-1206. https://doi.org/10.1021/acs.jcim.2c00079
  80. MeloMarcelo C. R.Postdoctoral FellowMaaschJacqueline R. M. A.Computational ResearcherFuente-NunezCesar de laPresidential Assistant ProfessorDr. Monica Berrondo, CEO Macromoltek, Marina E. Michaud, PhD student, Emory University. Machine Learning for Drug Discovery. 2022https://doi.org/10.1021/acsinfocus.7e5017
  81. Hyunseung Kim, Jonggeol Na, Won Bo Lee. Generative Chemical Transformer: Neural Machine Learning of Molecular Geometric Structures from Chemical Language via Attention. Journal of Chemical Information and Modeling 2021, 61 (12) , 5804-5814. https://doi.org/10.1021/acs.jcim.1c01289
  82. Lijuan Yang, Guanghui Yang, Zhitong Bing, Yuan Tian, Yuzhen Niu, Liang Huang, Lei Yang. Transformer-Based Generative Model Accelerating the Development of Novel BRAF Inhibitors. ACS Omega 2021, 6 (49) , 33864-33873. https://doi.org/10.1021/acsomega.1c05145
  83. Filip Miljković, Raquel Rodríguez-Pérez, Jürgen Bajorath. Impact of Artificial Intelligence on Compound Discovery, Design, and Synthesis. ACS Omega 2021, 6 (49) , 33293-33299. https://doi.org/10.1021/acsomega.1c05512
  84. Fabio Urbina, Kushal Batra, Kevin J. Luebke, Jason D. White, Daniel Matsiev, Lori L. Olson, Jeremiah P. Malerich, Maggie A. Z. Hupcey, Peter B. Madrid, Sean Ekins. UV-adVISor: Attention-Based Recurrent Neural Networks to Predict UV–Vis Spectra. Analytical Chemistry 2021, 93 (48) , 16076-16085. https://doi.org/10.1021/acs.analchem.1c03741
  85. Dario Caramelli, Jarosław M. Granda, S. Hessam M. Mehr, Dario Cambié, Alon B. Henson, Leroy Cronin. Discovering New Chemistry with an Autonomous Robotic Platform Driven by a Reactivity-Seeking Neural Network. ACS Central Science 2021, 7 (11) , 1821-1830. https://doi.org/10.1021/acscentsci.1c00435
  86. Tiago Sousa, João Correia, Vítor Pereira, Miguel Rocha. Generative Deep Learning for Targeted Compound Design. Journal of Chemical Information and Modeling 2021, 61 (11) , 5343-5361. https://doi.org/10.1021/acs.jcim.0c01496
  87. Rajendra P. Joshi, Niklas W. A. Gebauer, Mridula Bontha, Mercedeh Khazaieli, Rhema M. James, James B. Brown, Neeraj Kumar. 3D-Scaffold: A Deep Learning Framework to Generate 3D Coordinates of Drug-like Molecules with Desired Scaffolds. The Journal of Physical Chemistry B 2021, 125 (44) , 12166-12176. https://doi.org/10.1021/acs.jpcb.1c06437
  88. Shuan Chen, Yousung Jung. Deep Retrosynthetic Reaction Prediction using Local Reactivity and Global Attention. JACS Au 2021, 1 (10) , 1612-1620. https://doi.org/10.1021/jacsau.1c00246
  89. Xiaochu Tong, Xiaohong Liu, Xiaoqin Tan, Xutong Li, Jiaxin Jiang, Zhaoping Xiong, Tingyang Xu, Hualiang Jiang, Nan Qiao, Mingyue Zheng. Generative Models for De Novo Drug Design. Journal of Medicinal Chemistry 2021, 64 (19) , 14011-14027. https://doi.org/10.1021/acs.jmedchem.1c00927
  90. John A. Keith, Valentin Vassilev-Galindo, Bingqing Cheng, Stefan Chmiela, Michael Gastegger, Klaus-Robert Müller, Alexandre Tkatchenko. Combining Machine Learning and Computational Chemistry for Predictive Insights Into Chemical Systems. Chemical Reviews 2021, 121 (16) , 9816-9872. https://doi.org/10.1021/acs.chemrev.1c00107
  91. Mingyuan Xu, Ting Ran, Hongming Chen. De Novo Molecule Design Through the Molecular Generative Model Conditioned by 3D Information of Protein Binding Sites. Journal of Chemical Information and Modeling 2021, 61 (7) , 3240-3254. https://doi.org/10.1021/acs.jcim.0c01494
  92. Mikołaj Sacha, Mikołaj Błaż, Piotr Byrski, Paweł Dąbrowski-Tumański, Mikołaj Chromiński, Rafał Loska, Paweł Włodarczyk-Pruszyński, Stanisław Jastrzębski. Molecule Edit Graph Attention Network: Modeling Chemical Reactions as Sequences of Graph Edits. Journal of Chemical Information and Modeling 2021, 61 (7) , 3273-3284. https://doi.org/10.1021/acs.jcim.1c00537
  93. Zachary J. Baum, Xiang Yu, Philippe Y. Ayala, Yanan Zhao, Steven P. Watkins, Qiongqiong Zhou. Artificial Intelligence in Chemistry: Current Trends and Future Directions. Journal of Chemical Information and Modeling 2021, 61 (7) , 3197-3212. https://doi.org/10.1021/acs.jcim.1c00619
  94. Biao Ma, Kei Terayama, Shigeyuki Matsumoto, Yuta Isaka, Yoko Sasakura, Hiroaki Iwata, Mitsugu Araki, Yasushi Okuno. Structure-Based de Novo Molecular Generator Combined with Artificial Intelligence and Docking Simulations. Journal of Chemical Information and Modeling 2021, 61 (7) , 3304-3313. https://doi.org/10.1021/acs.jcim.1c00679
  95. Jie Zhang, Rocío Mercado, Ola Engkvist, Hongming Chen. Comparative Study of Deep Generative Models on Chemical Space Coverage. Journal of Chemical Information and Modeling 2021, 61 (6) , 2572-2581. https://doi.org/10.1021/acs.jcim.0c01328
  96. Srilok Srinivasan, Rohit Batra, Henry Chan, Ganesh Kamath, Mathew J. Cherukara, Subramanian K. R. S. Sankaranarayanan. Artificial Intelligence-Guided De Novo Molecular Design Targeting COVID-19. ACS Omega 2021, 6 (19) , 12557-12566. https://doi.org/10.1021/acsomega.1c00477
  97. Xinhao Li, Denis Fourches. SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning. Journal of Chemical Information and Modeling 2021, 61 (4) , 1560-1569. https://doi.org/10.1021/acs.jcim.0c01127
  98. Kei Terayama, Masato Sumita, Ryo Tamura, Koji Tsuda. Black-Box Optimization for Automated Discovery. Accounts of Chemical Research 2021, 54 (6) , 1334-1346. https://doi.org/10.1021/acs.accounts.0c00713
  99. Sowmya Ramaswamy Krishnan, Navneet Bung, Gopalakrishnan Bulusu, Arijit Roy. Accelerating De Novo Drug Design against Novel Proteins Using Deep Learning. Journal of Chemical Information and Modeling 2021, 61 (2) , 621-630. https://doi.org/10.1021/acs.jcim.0c01060
  100. Andrei Buin, Hung Yi Chiang, S. Andrew Gadsden, Faraz A. Alderson. Permutationally Invariant Deep Learning Approach to Molecular Fingerprinting with Application to Compound Mixtures. Journal of Chemical Information and Modeling 2021, 61 (2) , 631-640. https://doi.org/10.1021/acs.jcim.0c01097
Load more citations
  • Abstract

    Figure 1

    Figure 1. Examples of molecules and their SMILES representation. To correctly create smiles, the model has to learn long-term dependencies, for example, to close rings (indicated by numbers) and brackets.

    Figure 2

    Figure 2. (a) Recursively defined RNN. (b) The same RNN, unrolled. The parameters θ (the weight matrices of the neural network) are shared over all time steps.

    Figure 3

    Figure 3. Symbol generation and sampling process. We start with a random seed symbol s1, here c, which gets converted into a one-hot vector x1 and input into the model. The model then updates its internal state h0 to h1 and outputs y1, which is the probability distribution over the next symbols. Here, sampling yields s2 = 1. Converting s2 to x2 and feeding it to the model leads to updated hidden state h2 and output y2, from which we can sample again. This iterative symbol-by-symbol procedure can be continued as long as desired. In this example, we stop it after observing an EOL (\n) symbol, and obtain the SMILES for benzene. The hidden state hi allows the model to keep track of opened brackets and rings, to ensure that they will be closed again later.

    Figure 4

    Figure 4. A few randomly selected, generated molecules. Ad = Adamantyl.

    Figure 5

    Figure 5. t-SNE projection of 7 physicochemical descriptors of random molecules from ChEMBL (blue) and molecules generated with the neural network trained on ChEMBL (green), to two unitless dimensions. The distributions of both sets overlap significantly.

    Figure 6

    Figure 6. Epochs of fine-tuning vs ratio of actives.

    Figure 7

    Figure 7. Nearest-neighbor Tanimoto similarity distribution of the generated molecules for 5-HT2A after n epochs of fine-tuning against the known actives. The generated molecules are distributed over the whole similarity range. Generated molecules with a medium similarity can be interesting for scaffold-hopping. (5)

    Figure 8

    Figure 8. t-SNE plot of the pIC50 > 9 test set (blue) and the de novo molecules predicted to be active (green). The language model populates chemical space around the test molecules.

    Figure 9

    Figure 9. Different training strategies on the Staphylococcus aureus data set with 1000 training and 6051 test examples. Fine-tuning the pretrained model performs better than training from scratch (lower test loss [cross entropy] is better).

    Figure 10

    Figure 10. Scheme of our de novo design cycle. Molecules are generated by the chemical language model and then scored with the target prediction model (TPM). The inactives are filtered out, and the RNN is retrained. Here, the TPM is a machine learning model, but it could also be a robot conducting synthesis and biological assays, or a docking program.

    Figure 11

    Figure 11. Histogram of Levenshtein (string edit) distances of the SMILES of the reproduced molecules to their nearest neighbor in the training set (Staphylococcus aureus, model retrained on 50 actives). While in many cases the model makes changes of a few symbols in the SMILES, resembling the typical modifications applied when exploring series of compounds, the distribution of the distances indicates that the RNN also performs more complex changes by introducing larger moieties or generating molecules that are structurally different, but isofunctional to the training set.

    Figure 12

    Figure 12. Violin plot of the nearest-neighbor ECFP4-Tanimoto similarity distribution of the 50 training molecules against the rediscovered molecules in Table 3, entry 2. The distribution suggests that the model has learned to make typical small functional group replacements, but can also reproduce molecules which are not too similar to the training data.

  • References

    ARTICLE SECTIONS
    Jump To

    This article references 79 other publications.

    1. 1
      Whitesides, G. M. Reinventing chemistry Angew. Chem., Int. Ed. 2015, 54, 3196 3209 DOI: 10.1002/anie.201410884
    2. 2
      Schneider, P.; Schneider, G. De Novo Design at the Edge of Chaos: Miniperspective J. Med. Chem. 2016, 59, 4077 4086 DOI: 10.1021/acs.jmedchem.5b01849
    3. 3
      Reymond, J.-L.; Ruddigkeit, L.; Blum, L.; van Deursen, R. The enumeration of chemical space Wiley Interdisc. Rev. Comp. Mol. Sci. 2012, 2, 717 733 DOI: 10.1002/wcms.1104
    4. 4
      Schneider, G.; Baringhaus, K.-H. Molecular design: concepts and applications; John Wiley & Sons: 2008.
    5. 5
      Stumpfe, D.; Bajorath, J. Similarity searching Wiley Interdisc. Rev. Comp. Mol. Sci. 2011, 1, 260 282 DOI: 10.1002/wcms.23
    6. 6
      Schneider, G.; Fechner, U. Computer-based de novo design of drug-like molecules Nat. Rev. Drug Discovery 2005, 4, 649 663 DOI: 10.1038/nrd1799
    7. 7
      Hartenfeller, M.; Schneider, G. Enabling future drug discovery by de novo design Wiley Interdisc. Rev. Comp. Mol. Sci. 2011, 1, 742 759 DOI: 10.1002/wcms.49
    8. 8
      Hartenfeller, M.; Zettl, H.; Walter, M.; Rupp, M.; Reisen, F.; Proschak, E.; Weggen, S.; Stark, H.; Schneider, G. DOGS: reaction-driven de novo design of bioactive compounds PLoS Comput. Biol. 2012, 8, e1002380 DOI: 10.1371/journal.pcbi.1002380
    9. 9
      Hartenfeller, M.; Eberle, M.; Meier, P.; Nieto-Oberhuber, C.; Altmann, K.-H.; Schneider, G.; Jacoby, E.; Renner, S. A collection of robust organic synthesis reactions for in silico molecule design J. Chem. Inf. Model. 2011, 51, 3093 3098 DOI: 10.1021/ci200379p
    10. 10
      Segler, M. H.; Waller, M. P. Modelling chemical reasoning to predict and invent reactions Chem. - Eur. J. 2017, 23, 6118 6128 DOI: 10.1002/chem.201604556
    11. 11
      Segler, M. H.; Waller, M. P. Neural-Symbolic Machine Learning for Retrosynthesis and Reaction Prediction Chem. - Eur. J. 2017, 23, 5966 5971 DOI: 10.1002/chem.201605499
    12. 12
      Segler, M. H.; Preuss, M.; Waller, M. P. Learning to Plan Chemical Syntheses ArXiv 2017, 1708.04202
    13. 13
      Kitchen, D. B.; Decornez, H.; Furr, J. R.; Bajorath, J. Docking and scoring in virtual screening for drug discovery: methods and applications Nat. Rev. Drug Discovery 2004, 3, 935 949 DOI: 10.1038/nrd1549
    14. 14
      Varnek, A.; Baskin, I. Machine learning methods for property prediction in chemoinformatics: quo vadis? J. Chem. Inf. Model. 2012, 52, 1413 1437 DOI: 10.1021/ci200409x
    15. 15
      Mitchell, J. B. Machine learning methods in chemoinformatics Wiley Interdisc. Rev. Comp. Mol. Sci. 2014, 4, 468 481 DOI: 10.1002/wcms.1183
    16. 16
      Riniker, S.; Landrum, G. A. Open-source platform to benchmark fingerprints for ligand-based virtual screening J. Cheminf. 2013, 5, 26 DOI: 10.1186/1758-2946-5-26
    17. 17
      Rogers, D.; Hahn, M. Extended-connectivity fingerprints J. Chem. Inf. Model. 2010, 50, 742 754 DOI: 10.1021/ci100050t
    18. 18
      Alvarsson, J.; Eklund, M.; Engkvist, O.; Spjuth, O.; Carlsson, L.; Wikberg, J. E.; Noeske, T. Ligand-based target prediction with signature fingerprints J. Chem. Inf. Model. 2014, 54, 2647 2653 DOI: 10.1021/ci500361u
    19. 19
      Baskin, I. I.; Palyulin, V. A.; Zefirov, N. S. A neural device for searching direct correlations between structures and properties of chemical compounds J. Chem. Inf. Comp. Sci. 1997, 37, 715 721 DOI: 10.1021/ci940128y
    20. 20
      Merkwirth, C.; Lengauer, T. Automatic generation of complementary descriptors with molecular graph networks J. Chem. Inf. Model. 2005, 45, 1159 1168 DOI: 10.1021/ci049613b
    21. 21
      Duvenaud, D. K.; Maclaurin, D.; Iparraguirre, J.; Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; Adams, R. P. Convolutional networks on graphs for learning molecular fingerprints Adv. Neural Inf. Proc. Syst. 2015, 2224 2232
    22. 22
      Altae-Tran, H.; Ramsundar, B.; Pappu, A. S.; Pande, V. Low Data Drug Discovery with One-Shot Learning ACS Cent. Sci. 2017, 3, 283 293 DOI: 10.1021/acscentsci.6b00367
    23. 23
      Jastrzebski, S.; Lesniak, D.; Czarnecki, W. M. Learning to SMILE(S). In International Conference on Learning Representations; 2016.
    24. 24
      Zupan, J.; Gasteiger, J. Neural networks: A new method for solving chemical problems or just a passing phase? Anal. Chim. Acta 1991, 248, 1 30 DOI: 10.1016/S0003-2670(00)80865-X
    25. 25
      Gasteiger, J.; Zupan, J. Neural networks in chemistry Angew. Chem., Int. Ed. Engl. 1993, 32, 503 527 DOI: 10.1002/anie.199305031
    26. 26
      Zupan, J.; Gasteiger, J. Neural networks in chemistry and drug design; John Wiley & Sons, Inc: 1999.
    27. 27
      Lusci, A.; Pollastri, G.; Baldi, P. Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules J. Chem. Inf. Model. 2013, 53, 1563 1575 DOI: 10.1021/ci400187y
    28. 28
      Unterthiner, T.; Mayr, A.; Klambauer, G.; Steijaert, M.; Wegner, J. K.; Ceulemans, H.; Hochreiter, S. Deep learning as an opportunity in virtual screening. In Proceedings of the Deep Learning Workshop at NIPS; 2014; Vol. 27, pp 1 9.
    29. 29
      Unterthiner, T.; Mayr, A.; Klambauer, G.; Hochreiter, S. Toxicity prediction using deep learning ArXiv 2015, 1503.01445
    30. 30
      Schneider, P.; Müller, A. T.; Gabernet, G.; Button, A. L.; Posselt, G.; Wessler, S.; Hiss, J. A.; Schneider, G. Hybrid Network Model for “Deep Learning” of Chemical Data: Application to Antimicrobial Peptides Mol. Inf. 2017, 36, 1600011 DOI: 10.1002/minf.201600011
    31. 31
      Gawehn, E.; Hiss, J. A.; Schneider, G. Deep learning in drug discovery Mol. Inf. 2016, 35, 3 14 DOI: 10.1002/minf.201501008
    32. 32
      Ramsundar, B.; Kearnes, S.; Riley, P.; Webster, D.; Konerding, D.; Pande, V. Massively multitask networks for drug discovery ArXiv 2015, 1502.02072
    33. 33
      Behler, J. Constructing high-dimensional neural network potentials: A tutorial review Int. J. Quantum Chem. 2015, 115, 1032 1050 DOI: 10.1002/qua.24890
    34. 34
      Behler, J.; Parrinello, M. Generalized neural-network representation of high-dimensional potential-energy surfaces Phys. Rev. Lett. 2007, 98, 146401 DOI: 10.1103/PhysRevLett.98.146401
    35. 35
      Ma, J.; Sheridan, R. P.; Liaw, A.; Dahl, G. E.; Svetnik, V. Deep neural nets as a method for quantitative structure–activity relationships J. Chem. Inf. Model. 2015, 55, 263 274 DOI: 10.1021/ci500747n
    36. 36
      Reutlinger, M.; Rodrigues, T.; Schneider, P.; Schneider, G. Multi-Objective Molecular De Novo Design by Adaptive Fragment Prioritization Angew. Chem., Int. Ed. 2014, 53, 4244 4248 DOI: 10.1002/anie.201310864
    37. 37
      Miyao, T.; Arakawa, M.; Funatsu, K. Exhaustive Structure Generation for Inverse-QSPR/QSAR Mol. Inf. 2010, 29, 111 125 DOI: 10.1002/minf.200900038
    38. 38
      Miyao, T.; Kaneko, H.; Funatsu, K. Inverse QSPR/QSAR Analysis for Chemical Structure Generation (from y to x) J. Chem. Inf. Model. 2016, 56, 286 299 DOI: 10.1021/acs.jcim.5b00628
    39. 39
      Takeda, S.; Kaneko, H.; Funatsu, K. Chemical-Space-Based de Novo Design Method To Generate Drug-Like Molecules J. Chem. Inf. Model. 2016, 56, 1885 1893 DOI: 10.1021/acs.jcim.6b00038
    40. 40
      Mishima, K.; Kaneko, H.; Funatsu, K. Development of a new de novo design algorithm for exploring chemical space Mol. Inf. 2014, 33, 779 789 DOI: 10.1002/minf.201400056
    41. 41
      White, D.; Wilson, R. C. Generative models for chemical structures J. Chem. Inf. Model. 2010, 50, 1257 1274 DOI: 10.1021/ci9004089
    42. 42
      Patel, H.; Bodkin, M. J.; Chen, B.; Gillet, V. J. Knowledge-based approach to de novo design using reaction vectors J. Chem. Inf. Model. 2009, 49, 1163 1184 DOI: 10.1021/ci800413m
    43. 43
      Bowman, S. R.; Vilnis, L.; Vinyals, O.; Dai, A. M.; Jozefowicz, R.; Bengio, S. Generating sentences from a continuous space. In SIGNLL Conference on Computational Natural Language Learning (CONLL); 2016.
    44. 44
      Gómez-Bombarelli, R.; Duvenaud, D.; Hernández-Lobato, J. M.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.; Aspuru-Guzik, A. Automatic chemical design using a data-driven continuous representation of molecules ArXiv 2016, 1610.02415
    45. 45
      Voss, C. Modeling Molecules with Recurrent Neural Networks; 2015; http://csvoss.github.io/projects/2015/10/08/rnns-and-chemistry.html.
    46. 46
      Firth, N. de novo Design Without the Chemistry; 2016; https://medium.com/@nf508/de-novo-design-without-the-chemistry-d183e8a9f150.
    47. 47
      Jozefowicz, R.; Vinyals, O.; Schuster, M.; Shazeer, N.; Wu, Y. Exploring the limits of language modeling ArXiv 2016, 1602.02410
    48. 48
      Graves, A.; Eck, D.; Beringer, N.; Schmidhuber, J. Biologically plausible speech recognition with LSTM neural nets. In International Workshop on Biologically Inspired Approaches to Advanced Information Technology; 2004; pp 127 136.
    49. 49
      van den Oord, A.; Kalchbrenner, N.; Kavukcuoglu, K. Pixel Recurrent Neural Networks. In International Conference on Machine Learning; 2016.
    50. 50
      Srivastava, N.; Mansimov, E.; Salakhutdinov, R. Unsupervised learning of video representations using LSTMs. In Proceedings of the 32nd International Conference on Machine Learning; 2015; pp 843 852.
    51. 51
      Gers, F. A.; Schmidhuber, E. LSTM recurrent networks learn simple context-free and context-sensitive languages IEEE Transactions on Neural Networks 2001, 12, 1333 1340 DOI: 10.1109/72.963769
    52. 52
      Bhoopchand, A.; Rocktäschel, T.; Barr, E.; Riedel, S. Learning Python Code Suggestion with a Sparse Pointer Network ArXiv 2016, 1611.08307
    53. 53
      Eck, D.; Schmidhuber, J. Finding temporal structure in music: Blues improvisation with LSTM recurrent networks. In Proc. 12th IEEE Workshop Neural Networks for Signal Processing; 2002; pp 747 756.
    54. 54
      Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules J. Chem. Inf. Model. 1988, 28, 31 36 DOI: 10.1021/ci00057a005
    55. 55
      Goldberg, Y. A Primer on Neural Network Models for Natural Language Processing J. Artif. Intell. Res. 2016, 57, 345 420
    56. 56
      Hochreiter, S.; Schmidhuber, J. Long short-term memory Neural computation 1997, 9, 1735 1780 DOI: 10.1162/neco.1997.9.8.1735
    57. 57
      Johnson, M.; Schuster, M.; Le, Q. V.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Viégas, F.; Wattenberg, M.; Corrado, G. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation ArXiv 2016, 1611.04558
    58. 58
      Graves, A. Generating sequences with recurrent neural networks ArXiv 2013, 1308.0850
    59. 59
      Olah, C. Understanding LSTM Networks; http://colah.github.io/posts/2015-08-Understanding-LSTMs/.
    60. 60
      Greff, K.; Srivastava, R. K.; Koutník, J.; Steunebrink, B. R.; Schmidhuber, J. LSTM: A search space odyssey IEEE transactions on neural networks and learning systems 2017, 28, 2222 2232 DOI: 10.1109/TNNLS.2016.2582924
    61. 61
      Chollet, F. Keras; https://github.com/fchollet/keras; retrieved on 2016-10–-24.
    62. 62
      Kingma, D.; Ba, J. Adam: A method for stochastic optimization. In International Conference for Learning Representations; 2015.
    63. 63
      Cireşan, D. C.; Meier, U.; Schmidhuber, J. Transfer learning for Latin and Chinese characters with deep neural networks. In The 2012 International Joint Conference on Neural Networks (IJCNN); 2012; pp 1 6.
    64. 64
      Steinbeck, C.; Hoppe, C.; Kuhn, S.; Floris, M.; Guha, R.; Willighagen, E. L. Recent developments of the chemistry development kit (CDK)-an open-source java library for chemo-and bioinformatics Curr. Pharm. Des. 2006, 12, 2111 2120 DOI: 10.2174/138161206777585274
    65. 65
      Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E. The Chemistry Development Kit (CDK): An open-source Java library for chemo-and bioinformatics J. Chem. Inf. Comp. Sci. 2003, 43, 493 500 DOI: 10.1021/ci025584y
    66. 66
      Pedregosa, F. Scikit-learn: Machine Learning in Python J. Mach. Learn. Res. 2011, 12, 2825 2830
    67. 67
      Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system 22nd ACM SIGKDD Int. Conf. 2016, 785 DOI: 10.1145/2939672.2939785
    68. 68
      Weininger, D.; Weininger, A.; Weininger, J. L. SMILES. 2. Algorithm for generation of unique SMILES notation J. Chem. Inf. Model. 1989, 29, 97 101 DOI: 10.1021/ci00062a008
    69. 69
      https://en.wikipedia.org/wiki/Graph_canonization.
    70. 70
      Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting J. Mach. Learn. Res. 2014, 15, 1929 1958
    71. 71
      Cumming, J. G.; Davis, A. M.; Muresan, S.; Haeberlein, M.; Chen, H. Chemical predictive modelling to improve compound quality Nat. Rev. Drug Discovery 2013, 12, 948 962 DOI: 10.1038/nrd4128
    72. 72
      Chevillard, F.; Kolb, P. SCUBIDOO: A Large yet Screenable and Easily Searchable Database of Computationally Created Chemical Compounds Optimized toward High Likelihood of Synthetic Tractability J. Chem. Inf. Model. 2015, 55, 1824 1835 DOI: 10.1021/acs.jcim.5b00203
    73. 73
      RDKit: Open-source cheminformatics; http://www.rdkit.org.
    74. 74
      Maaten, L. v. d.; Hinton, G. Visualizing data using t-SNE J. Mach. Learn. Res. 2008, 9, 2579 2605
    75. 75
      Bemis, G. W.; Murcko, M. A. The properties of known drugs. 1. Molecular frameworks J. Med. Chem. 1996, 39, 2887 2893 DOI: 10.1021/jm9602928
    76. 76
      Williamson, A. E.; Todd, M. H. Open source drug discovery: highly potent antimalarial compounds derived from the Tres Cantos arylpyrroles ACS Cent. Sci. 2016, 2, 687 701 DOI: 10.1021/acscentsci.6b00086
    77. 77
      Ley, S. V.; Fitzpatrick, D. E.; Ingham, R.; Myers, R. M. Organic synthesis: march of the machines Angew. Chem., Int. Ed. 2015, 54, 3449 3464 DOI: 10.1002/anie.201410744
    78. 78
      Sutton, R. S.; Barto, A. G. Reinforcement learning: An introduction; MIT Press Cambridge; 1998; Vol. 1.
    79. 79
      Ching, T.; Himmelstein, D. S. Opportunities And Obstacles For Deep Learning In Biology And Medicine bioRxiv 2017, 142760
  • Supporting Information

    Supporting Information

    ARTICLE SECTIONS
    Jump To

    The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acscentsci.7b00512.

    • A set of molecules sampled from our model (PDF)

    • 400000 molecules as SMILES (ZIP)


    Terms & Conditions

    Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system: http://pubs.acs.org/page/copyright/permissions.html.

Pair your accounts.

Export articles to Mendeley

Get article recommendations from ACS based on references in your Mendeley library.

Pair your accounts.

Export articles to Mendeley

Get article recommendations from ACS based on references in your Mendeley library.

You’ve supercharged your research process with ACS and Mendeley!

STEP 1:
Click to create an ACS ID

Please note: If you switch to a different device, you may be asked to login again with only your ACS ID.

Please note: If you switch to a different device, you may be asked to login again with only your ACS ID.

Please note: If you switch to a different device, you may be asked to login again with only your ACS ID.

MENDELEY PAIRING EXPIRED
Your Mendeley pairing has expired. Please reconnect