Molecular Design Method Using a Reversible Tree Representation of Chemical Compounds and Deep Reinforcement Learning

Automatic design of molecules with specific chemical and biochemical properties is an important process in material informatics and computational drug discovery. In this study, we designed a novel coarse-grained tree representation of molecules (Reversible Junction Tree; “RJT”) for the aforementioned purposes, which is reversely convertible to the original molecule without external information. By leveraging this representation, we further formulated the molecular design and optimization problem as a tree-structure construction using deep reinforcement learning (“RJT-RL”). In this method, all of the intermediate and final states of reinforcement learning are convertible to valid molecules, which could efficiently guide the optimization process in simple benchmark tasks. We further examined the multiobjective optimization and fine-tuning of the reinforcement learning models using RJT-RL, demonstrating the applicability of our method to more realistic tasks in drug discovery.


■ INTRODUCTION
Exploration of novel chemical compounds is an important process in drug and material discovery. The process entails iterative trials, including the proposal of candidate compounds and their experimental evaluation, which are usually expensive and time-consuming. Furthermore, while the chemical space of feasible and chemically synthesizable compounds is astronomically large (estimated to be on the order of 10 60 ), only a small portion of this space has been explored for drugs or functional materials. As it is impossible to explore compounds having the desired properties from this vast chemical space using ad-hoc experimental trials, computational exploration of chemical space has been proposed and is now gaining considerable attention owing to recent advancements in computational power. In computational exploration, all of the steps in the iterative trials, including the proposal of novel chemical compounds and their property evaluation, were performed in silico. Using extensive computational resources, this method extends the exploration range within the vast chemical space remarkably. However, owing to the discrete nature of chemical structure representation (i.e., graphs in discrete mathematics), the difficulty to efficiently explore the chemical space still persists.
To address this problem, new exploration methods using deep neural networks 1 (DNNs; for comprehensive reviews see refs 2, 3) have been proposed. One example is the latent-spacebased method, which involves a variational autoencoder (VAE 4 ). In this method, a DNN, known as a generative model, learns the mappings between the discrete space of the chemical compounds and the continuous latent manifold. After learning the generative model, exploration is performed over the continuous latent manifold. 5 Bayesian optimization or metaheuristics-based methods were used for the optimization process because the molecular properties used as scoring functions are usually not differentiable with respect to the latent manifold, even though the latent space itself is continuous. However, this method has several limitations. The first concerns the dimensions of the latent manifold. For example, on using the practical size of the training dataset (10 5 −10 6 compounds), learned latent manifolds with a reasonable reconstitution rate usually result in dimensions greater than 50, which is not easy for the available optimization algorithms. Second, no reliable indicator exists for assessing the rationality of learned mapping. Although we can evaluate the model using the reconstruction error against the training dataset, it is unclear whether a low reconstruction error is sufficient to guide the chemical structure optimization with the given scoring functions. The third problem is that the learning process of the generative model is detached from the optimization process of the chemical compounds with regard to the scoring function. The pretrained generative model remained unchanged during optimization and did not learn the scoring function. Recently, to address the third problem, the retraining of the generative model based on optimization results has been proposed. 6 This method requires an iterative process of optimization and training, which is computationally expensive and time consuming.
Another example is a reinforcement learning (RL)-based method. In this method, the molecular design problem is formulated as a Markov decision process wherein an agent learns the optimal policy based on the rewards offered by its surrounding environment. The RL-based method can partly alleviate the problem of latent-space-based methods because the policy approximated by the DNN model learns the scoring function during optimization. RL-based methods can be classified into two types based on the representation of chemical compounds. The first type uses text representation (e.g., SMILES), which allows us to leverage techniques used in natural language processing. Although this representation is widely used (including in latent-space-based methods 5 ), the generated text may contain grammatically invalid results that can affect optimization performance. This type of method includes the REINVENT. 7 The second type uses a graph representation of the chemical structure. Examples of this type include GCPN 8 and MolDQN. 9 In this representation, the generated graphs are theoretically decodable to valid molecules, thereby avoiding the grammatically invalid results in the text representation. However, the molecular properties can change drastically even with a single action of RL, such as in the last step of closing a conjugated aromatic ring. This makes it difficult to guide the agent directly to the optimal action at each RL step. Thus, the molecular representation using chemical groups (e.g., phenyl groups) as building blocks seem effective for RL-based molecular design. In contrast, molecular structure generation algorithms using predefined fragments as building blocks have been continuously studied over the last century, and many approaches have been proposed. 10−15 However, it is difficult to integrate such methods with learning-based optimization algorithms, including RL.
Recently, Jin et al. 16 proposed a novel coarse-grained molecule representation by leveraging junction tree (JT) decomposition to a molecule graph. In this method, a molecule is represented as a tree (i.e., a graph without circular or closed paths) with nodes corresponding to rings or bonds. Similar to the all-atom graph representation, this JT representation always corresponds to a valid molecule by composing valid chemical fragments as building blocks. This method has been successfully applied to several molecular-generative models. 16,17 However, the JT representation cannot be reversibly converted to the original molecule without auxiliary information. Therefore, additional neural networks (such as atom-based graph convolution networks) are required to supplement this information for decoding molecules from the generated JT representation, making this technique complicated and computationally intensive.
In this study, we propose a novel RL-based molecular generation and optimization framework that leverages JTbased representation. For this purpose, we tailored the JT representation to be reverse-convertible to the original molecule without auxiliary neural networks (reversible JT; RJT). This reversible molecular representation enabled us to formulate the molecular generation and optimization problem as tree-structure construction using RL (RJT-RL). The reversible nature of the RJT enables the evaluation of molecules decoded from the intermediate states in the RL episodes. We tested the method using simple molecular design tasks as benchmarks and subsequently applied it to more realistic tasks in drug discovery involving multiobjective scoring functions.

Reversible Tree Representation of Molecules.
In the original JT representation, the molecular graph is converted into a tree representation using the JT algorithm. 16 Briefly, the molecular graph is fragmented into bonds and rings using the smallest set of smallest rings (SSSR) algorithm implemented in RDKit. 18 Then, a node is assigned to each bond or ring, and the nodes are connected by an edge when they share atoms in to construct the tree representation ( Figure S1). If the bond nodes intersect at more than one node, the resulting creates a cyclic structure. This situation is avoided by inserting a single atom shared between these nodes as a "singleton node" (see the original study 16 for a detailed description of the algorithm). Hereafter, we term the nodes representing the bond and ring as "bond nodes" and "ring nodes," respectively. A possible combination of unique atom clusters in the dataset forms a vocabulary . Each node N i is assigned a word ID w i based on the atom cluster in the node. This coarse-grained representation allows us to build a molecule from chemically valid fragments, thereby avoiding atom-by-atom molecular creation through chemically invalid intermediates. However, the JT representation cannot be reversibly converted into the original molecule because the connection information between the nodes is lost in this coarse-grained representation. In previous studies, 16,17 this problem was avoided by introducing supplementary neural networks, which complicated the JT representation framework.
In this study, we introduce the reversible JT (RJT) representation ( ) of molecules by extending the JT decomposition of . To eliminate arbitrariness in the node connection (i.e., to determine atoms shared between two nodes connected by an edge), we record the ID of the atoms shared between two adjacent nodes (N i and N j ) as the "site information" σ i,j . The site information is defined as follows ( Figure 1). The original JT implementation involves sharing of one or two atoms between adjacent nodes connected by the JT edge. Hereafter, the JT edge sharing one atom is denoted as a "type-1 edge" (Figure 1A), while that sharing two atoms as a "type-2 edge" ( Figure 1B). Type-1 edges can be further classified into three cases: edges connecting (i) bonds and singleton nodes, (ii) two bond nodes, and (iii) bond and ring nodes. In these three cases, the indices of the shared atom in both nodes are recorded as the site information σ i,j . In contrast, type-2 edges always connect two ring nodes, and the indices of the two atoms in both the ring nodes are recorded. Because the atoms shared between the nodes are always adjacently located in the ring, the indices of the first atom and the direction ID (+1 or −1) of the second atom are recorded as site information σ i,j . A spiro connection between two rings is treated as a special case of the type-2 edge by assigning a direction ID of 0, although only one atom is shared between the nodes. The most appropriate location for storing this site information is edge E i,j , which represents the existence of shared atoms in N i and N j in . Thus, in this study, the site information is encoded as edge features.
The algorithm for converting the RJT representation to molecular graph (Algorithm 1) can be described as follows: this method is similar to the assembly algorithm of Jin et al., 16 except that enumerating all combinations of node-to-node attachments in is not necessary. In our method, the predicted is traversed in the depth-first order by attaching a subgraph, corresponding to the child node, to the constructed graph. Using the site information, the atom(s) shared between two nodes can be uniquely determined, and the nodes could be deterministically assembled to convert to . However, it is possible that the predicted site information is incompatible with the node types (e.g., more than four atoms are connected to one carbon atom). In such cases, the second (or third) probable site information can be utilized based on the output of the softmax logits of the neural network (described later).
Alternatively, an exception could be raised to indicate that a particular RJT contains invalid information. The latter implementation was used to simplify the code.

RJT-Based Neural Network.
The neural network architecture that encodes the RJT representation into hidden vectors with fixed sizes ({h i }) can be described as follows Each node N i is represented by a one-hot vector of word ID w i , while edge E i,j is represented by a one-hot vector of site information σ i,j . These one-hot representations of w i and σ i,j are then converted to node and edge features x i and y i,j using the learnable embedding matrices E node and E edge , respectively, as follows To encode the RJT, including both node and edge features, we extended the tree-based gated recurrent unit (GRU) 16 as follows In the above formulas, we extended the tree-based GRU; however, the tree-based long short-term memory (LSTM) 17 could also be extended to accommodate the edge features in a similar manner. The message vectors m i,j are updated in two phases (bottom-up and top-down), as in the original treebased GRU. Finally, the hidden vector h i of node N i is computed by aggregating the message vectors m i,j as follows where ReLU stands for the rectified linear unit, defined as ReLU(x) = max(0, x). RL Using the RJT Representation. In this study, we formulated a molecular design task for tree generation using RL. 19 In general RL settings, we consider that an agent receives state s t from the environment and selects an action a s ( ) t t according to its policy π = π(a t |s t ) at each time step t, where is a set of possible states and s ( ) t is a set of possible actions at state s t .
After taking an action, the agent receives the next state s t + 1 and a scalar reward r t , and proceeds to the next step t + 1. The episode ends when the agent receives a terminal state at time step T and then proceeds to the next episode with an initial time step t = 0. The return R t is the total accumulated reward from time step t to T with discount rate γ, as follows Action value Q π (s, a) is defined as the expected return on selecting action a in state s after following policy π, whereas the value of state V π (s) is defined as the expected return from state s after following policy π. The goal of the agent is to maximize the value of state s t , which is the expected return R t from each state s t .
Policy-based RL methods directly parameterize the policy π θ and optimize its parameter θ using the gradient estimator of [ ] R t . The policy gradient theorem states that the unbiased estimate of [ ] R t can be estimated as 20 To reduce the variance of this gradient estimate, the estimate of advantage (Ât; A(s, a) = Q(s, a) − V(s)) was used instead of R t . 21,22 Although the use of this advantage increases stability, it often leads to destructively large policy updates.
In the proximal policy optimization (PPO) algorithm, 23 the clipped surrogate objective L t CLIP is optimized instead of L t PG to alleviate this problem as follows where r t (θ) = π θ (a t |s t )/(π θd old (a t |s t )), and clip(•) clips r t (θ) outside the interval between [1 − ϵ, 1 + ϵ]. Then, the total loss function L t PPO , given as 2 is the squared error loss of the value function and S[π θ ] (s t ) is an entropy bonus term to ensure efficient exploration. For Ât, the truncated version of the generalized advantage estimation is used, as in the original PPO implementation, as follows , and λ is the generalized advantage estimator parameter. 22 Definition of Policy and Value Function Networks. The RJT representation of the molecule was used as the state s t in RL, where is a possible set of molecules that can be converted to RJT. The agent takes the action a s ( ) t t to modify the RJT of s t and constructs the RJT in the next step ( Figure 2A). Here, the action a t is defined using the following four components: (i) selection of the word ID w i of the new node N i , (ii) selection of the existing node N j for attaching to the new node, (iii) prediction of the site information σ i,j (for connecting the two subgraphs represented by N i and N j ), and (iv) determining whether the episode ends ( Figure 2B). The policy can then be expressed as a probability distribution function for each component of the action defined here. In this

Journal of Chemical Information and Modeling
pubs.acs.org/jcim Article study, these probability distribution functions were approximated by RJTNN, using the RJT of s t ( ) as an input ( Figure  2A). First, we calculated the hidden vectors {h i } for each node of the RJT using RJTNN, as follows Then, each component of the policy distribution was calculated as follows where n node is the number of nodes in , n voc is the size of vocabulary , and n site is the maximum number of possible combinations of site information. The MLP function stands for a multilayer perceptron with two layers, including the ReLU activation function. The action (a node , a word , a site , a stop ) taken by the agent is determined by sampling from this policy distribution as follows The value function is also approximated using the shared RJTNN with the policy network as follows Using these neural network functions, the PPO target loss function (L t PPO (θ)) was minimized using the Adam optimizer during RL training. PyTorch 24 and PFRL 25 were used for the deep learning and RL frameworks, respectively, for the experiments.
Expert Learning. To guide the exploration space of RL for drug-like molecules, the policy network was pretrained using the given dataset ("expert dataset") similar to previous studies. 7,8 For pretraining, the dataset molecules were converted to RJTs, and the root node was randomly selected from the sampled from this dataset. The path for traversing from the root node was generated in the breadth (or depth)-first order, from which one edge (connecting N i and N j ) was randomly selected. This edge represents the process of creating a new tree from a state comprising N i and its ancestral nodes. Thus, the calculated state s t and action a t from the edge are used to minimize the negative log likelihood of the policy function, i.e., In the training phase of RL, the learned policy function is not expected to deviate significantly from the pretrained policy and generate non-drug-like molecules. The use of augmented likelihood in the REINVENT algorithm 7 prevents the learned policy from deviating significantly from the pretrained model. In this study, we jointly trained the policy using an expert dataset with the target function of the PPO by minimizing the following target function where coefficient c exp is a hyperparameter that controls for the effect of the expert dataset. Experiment Settings. In all experiments, the valences of the atoms in the assembly process (conversion of RJT to a molecule) were checked. If excess valence was detected (e.g., a carbon atom with more than five bonds), a small negative score (c invalid ) was given as the reward for step t to discourage the actions that produced invalid molecules. Next, in RJT-RL, any molecular property could be used as the reward function r t , which is calculated at every RL step because RJTs corresponding to intermediate states are convertible to valid molecules. Thus, we examined the effectiveness of this "step reward" and compared it with the "final reward" case.
Here, R step and R fin are the step and final reward functions, respectively, defined according to the specific tasks, and t max is the last time step in this episode. We also defined R step (−1) � 0. Finally, the effect of the "duplication count penalty" on reward function was examined. One problem of RL is the balance between exploration and exploitation; poor exploration that maximizes short-term rewards results in trapping in the local minima. To avoid this problem, the entropy bonus term S[π θ ](s t ) was introduced into the loss function of the PPO. 23 To further facilitate the exploration of the chemical space, we introduced a duplication count penalty function, which modifies the final reward R fin calculated at the end of the episode as follows where n dup is the duplication count of the generated molecule in this episode and s is the score calculated for the generated molecule. This function is intended to reset R (the total return of the episode) to zero if the duplication count exceeds a specified threshold. A threshold value of N lim = 2 was utilized for all experiments in this study. This penalty term is inspired by previous studies, 26−28 wherein the count-base bonus term was shown to facilitate exploration in various RL tasks. All of the experiments described in the Results and Discussion section utilized the ethane molecule ("CC" in SMILES representation), which is the simplest atom cluster in vocabulary , unless otherwise stated. The policy network with a hidden vector size of 128 dimensions was pretrained using a dataset derived from the ZINC250k dataset. 5 In the derived dataset, macrocyclic compounds or rings containing more than eight atoms were removed. Additionally, bridge compounds containing more than eight substitution sites were removed. The resulting dataset contains 246,416 molecules. We tested whether all RJT representations of the molecules in this dataset could be reversibly converted to the original molecules. For comparison, REINVENT, 7 which is a SMILESbased molecular generator that uses RL, was used. The policy network was pretrained using the same dataset, and RL training was performed using the same scoring function and hyperparameters of k = 1 and σ = 20. In addition, CReM, 10 a rule-based structure generation method using fragments as building blocks, was employed. The fragment library for CReM was generated using the ZINC250k dataset, and the optimization method described in this study was used for structure generation. Similar computational resources were used (CPU: one core of Intel Xeon Gold 6254 CPU@ 3.10GHz and GPU: NVIDIA Tesla V100) for all experiments.

■ RESULTS AND DISCUSSION
To examine the effectiveness of RJT-RL, we performed several experiments using molecular design as a benchmark. All of the experiments and their abbreviations are summarized in Tables 1, 2, and 3. We calculated several measures defined in the MOSES benchmark suite 29 for molecules generated in the last 1/5 episodes.
Penalized Log P Optimization. First, we verified the effectiveness of our method using a single chemical property that is easily calculated using only chemical structure. In the first experimental setup, we evaluated the penalized Log P score, 30 which is the water−octanol partition coefficient (Log P) that also accounts for the ring size and synthetic accessibility (SA) score. 31 Although the maximization of the penalized Log P score itself has no actual application in drug discovery, we first attempted this task to assess whether our method can optimize the computationally easy problem beyond the example molecules in the training dataset. Furthermore, many previous studies that performed optimization in the chemical space have employed this target score, thereby allowing the comparison of our method with these The MOSES metrics 29 were calculated for the molecules generated in the last 1000 episodes (Novel: novelty, Valid: validity, Uniq: uniqueness, Frag: fragment similarity, Scaf: scaffold similarity, IntDiv: internal diversity (p = 1), and Filt: fraction of molecules passing the unwanted structure filter).  The metrics of the molecules generated by the last 10,000, 8000, and 2000 episodes for D1−D4, M1−M2, and F1, respectively, were calculated, as shown in Table 1.

Journal of Chemical Information and Modeling
pubs.acs.org/jcim Article studies. For RJT-RL, we examined three cases (P1, P2, and P3 in Table 1) that differed in the settings of the reward evaluation (step or final) and the duplication penalty. In P1 and P3, the penalized Log P score was used as the step reward function, R step , but the duplication penalty was turned off in P1. In P2, the penalized Log P score was used as R fin , and no reward was given for the intermediate steps.
The results are shown in Figures 3, and S2 and Table 1. In all cases (P1, P2, and P3), molecules with high rewards were continuously generated, even in the absence of the duplication penalty ( Figure 3A−C). Diverse molecules were generated without any penalty (uniqueness in Table 1); thus, the duplication penalty effect may be limited in this experimental setup. The comparison of the final and step reward cases (P2 vs P1/P3) showed that the reward functions tended to rise faster in P1/P3 than in P2 ( Figure 3A−C). A similar tendency was observed in all other experiments, starting from different random states ( Figure S3). These results suggest that the rewards of the RL intermediate states facilitated the optimization process. The REINVENT model (P4) also continuously generated high-scoring molecules, but the scores did not exceed those generated by our method ( Figure 3D). The top three penalized Log P scores obtained in the experiments are summarized in Table 4, along with the results from previous studies. The comparison shows that our method performed comparably to the weighted retraining method, 6 which also updates the generative model during the optimization process according to the target function.
Similarity-Guided Molecule Generation. Next, we verified the effectiveness of our method using a simple task to generate molecules similar to a given query structure. To measure the similarity between molecules i and j, we used the Jaccard index 32 J i,j of the RDKit implementation of the FCFP4 fingerprints. 33 Although the similarity-guided optimization problem could seem trivial to human intuition, this task is significant for computational optimization using fingerprints as the similarity measure. For example, when attempting to generate an ether group, the alcohol moiety must be generated before reaching the final state. The fingerprint bits of alcohol and ether oxygens are different in the definition of FCFP4; 33    Table 2). The similarity scores for each episode (or iteration) are plotted in blue, and the moving average and maximum values are plotted in orange and green, respectively. The chemical structures of compounds with the highest scores (top three) are shown. Similarity scores are noted below the chemical structures.

Journal of Chemical Information and Modeling
pubs.acs.org/jcim Article  Table 2). The similarity scores for each episode (or iteration) are plotted in blue, and the moving average and maximum values are plotted in orange and green, respectively. The chemical structures of compounds with the highest scores (top three) are shown. Similarity scores are noted below the chemical structures.

Journal of Chemical Information and Modeling
pubs.acs.org/jcim Article the optimization target score decreases in the step generating the intermediate alcohol moiety. Therefore, the landscape of the fingerprint-based similarity score is not smooth but contains many local minima wherein simple optimization algorithms can get trapped. This local-minima problem of fingerprint similarity could be avoided using the final reward (i.e., evaluating only the final molecule in the episodes). However, this makes it impossible to evaluate each RL state value in the episodes. To evaluate the effectiveness of different reward settings, we examined three different cases of RJT-RL (Table 2). Vortioxetine Rediscovery Experiment. In the first experiment (S1−S5; Table 2), vortioxetine was used as the query structure ( Figure 4A). The most similar molecule in the pretraining dataset is shown in Figure 4B, with a similarity of 0.58. The results of the vortioxetine rediscovery experiment are summarized in Figures 4C−G and S4. In case S1, optimization was performed with a step reward and without a duplication penalty (Table 2), thereby achieving a similarity score of 0.724 ( Figure 4C). However, after convergence to a local minimum, the exploration efficiency suddenly deteriorated, and similar or the same molecules were repeatedly generated (uniqueness in Table 2 and Figure S4A). In case S2, when only the final reward and duplication penalty were applied (Table 2), molecules with a similarity score of approximately 0.5 were continuously generated (Figure 4D), and a highest similarity of 0.771 was achieved ( Figure 4D). In case S3, with the application of step reward and duplication penalty, diverse molecules with high similarity scores were continuously generated ( Figure 4E). The average similarity score between the last 1000 episodes was 0.681, and a structure that was almost identical to the query structure was generated ( Figure  4E). The best and average performances for S3 outperformed those for S2, demonstrating that the step-by-step evaluation of the molecular properties successfully guided the generation of optimum molecules. Additionally, several runs were conducted with different random seeds for S1−S3 ( Figure S5), and it was observed that the duplication penalty strongly facilitated optimization in all cases. The step reward (S3) tends to produce better results than the final reward (S2), but it is influenced by random states to some extent. These results suggest that the advantage of step reward outweighs the disadvantage arising from the local minima in this problem.
In S4, REINVENT also generated structures that were similar, but not identical, to the query structure, with the highest similarity score of 0.732 ( Figure 4F). However, structures with high rewards were generated infrequently, and an average similarity score of approximately 0.32 was obtained at the end of training ( Figure 4F). This suggested that the agent failed to adequately explore the chemical space around the query structure. In S5, CReM also generated similar structures, with the highest similarity score of 0.776 ( Figure 4G). The average similarity score at the end of the optimization was approximately 0.5 ( Figure 4G). However, it requires many evaluations (approximately 50,000) of the score function as compared to the other experiments (approximately 10,000).

Journal of Chemical Information and Modeling pubs.acs.org/jcim Article
Celecoxib Rediscovery Experiment. Next, similarity-guided optimization of celecoxib, which is also included in the GuacaMol benchmark suite, 34 was performed to observe the effects of the query structures ( Figure 5A). For the experiment, RJT-RL, REINVENT, and CReM were used under conditions similar to those for vortioxetine (C1−C5; Table 3). The molecule most similar to celecoxib in the pretraining dataset is shown in Figure 5B, with a similarity of 0.606.
The results of the experiment are summarized in Figures  5C−G and S6. In RJT-RL, C3 (step reward) performed better than C1 and C2, as observed in the vortioxetine experiments (S1−S3). C3 generated a structure identical to celecoxib with a similarity of 1.0, but some effect of the random state on the generated structures was observed ( Figure S7). CReM (C5) also successfully generated a structure identical to celecoxib after optimization of the 8th generation with an evaluation of approximately 60,000 compounds ( Figure 5G). This result suggests that CReM is also a powerful method but requires a large number of score evaluations. In contrast, the best molecule generated by REINVENT (C4) has a similarity of 0.62 ( Figure 5F), which is slightly better than the most similar structure in the dataset ( Figure 5B). The original paper on  Table 3). The reward (left panel), interaction (middle panel), and docking (right panel) scores for each episode are plotted in blue, and the moving average and maximum values are plotted in orange and green, respectively.

Journal of Chemical Information and Modeling
pubs.acs.org/jcim Article REINVENT 7 stated that it generated an identical structure to celecoxib under similar conditions but with a larger size (1.5 million) of the pretraining dataset. REINVENT may require a larger pretraining dataset to achieve the best performance. Structure-Based Scaffold Hopping. We further examined the efficiency of RJT-RL in the realistic task of structurebased scaffold hopping, wherein known drug candidate compounds with structural information (i.e., important interactions between the target protein and compounds) are available. B-Raf kinase with its inhibitor complex structure 35 (PDB ID: 3TV6, Figure 6A) was used as an example. In this structure, the pyrimidine nitrogen of the inhibitor hydrogen bonds with the main-chain nitrogen atom of Gly 596 ( Figure  6B) in the kinase hinge region. In addition to this hinge interaction, this inhibitor interacts with the main-chain nitrogen atom of Cys 532 through its sulfonamide group ( Figure 6C). In practical structure-based drug discovery applications, it is crucial to search for molecules with different scaffolds that retain known important interactions with target proteins. Thus, we attempted to design novel compounds that preserved the interaction between the hinge region of the enzyme and the pyrimidine group in the compounds while changing the scaffold in the compound interacting with Cys 532.
In this task (D1−D4; Table 3), we used the docking score and the interaction score pertaining to the hydrogen-bond acceptor of the compound and the main-chain nitrogen atoms of Gly 596 and Cys 532. The ETKDG algorithm implemented in RDKit was used to generate three-dimensional (3D) conformations from two-dimensional (2D) structures. 36 In this process, we randomly selected one stereoisomer from all possible stereoisomers except for D3, wherein we enumerated all possible stereoisomers up to 16, attempted docking simulations, and then selected the isomer with the best docking score. The interaction score is defined as follows where HBA is the hydrogen-bond acceptor atom in the compound, ΔG is the AutoDock Vina score 37 of the docking pose of the compound, and c intr is a hyperparameter. Given the scale of the typical docking score (around −10) and interaction score (0−1) terms, c intr = 10.0 was used in this experiment. The docking score was clipped by 500 because the PPO algorithm was destabilized by an extremely high value of the docking score. The pyrimidine group was used as the initial state for the RL training. The initial binding pose of the pyrimidine group to the protein was restrained to the same as that in the crystal structure. Two types of reward functions were compared as in the other experiments (Table 3). In cases D1 and D3, the aforementioned reward function was used for the final step, while R step (t) = 0 was maintained for all other intermediate steps.
In case D2, the same function was used for R step (t). Furthermore, for comparison with the rule-based methods, CReM was examined using the same scoring function, including the docking and interaction scores (D4; Table 3).
The results for D1−D3 demonstrated that RJT-RL successfully generated optimized molecules with the desired interactions ( Figure 7A−C). Both the interaction and docking scores were gradually optimized during the RL training episodes, and high-score compounds were continuously sampled during the later training episode stages. However, the distributions of the Log P, SA, and QED 38 scores of the generated compounds tended to deviate from those of the training dataset ( Figure S8A−C). Compounds with high docking scores that satisfied the desired interactions are shown in Figure S9. Next, we compared the results of the final and step rewards (D1 and D2, respectively). The step reward was marginally better than the final reward. D2 obtained a higher reward than D1 for the same number of episodes and converged to the best score faster than D1 ( Figure 7A,B). This result also demonstrates the potential of the step-by-step evaluation of the molecular properties. However, the speed of D2 was much lower than that of D1 because the docking simulation was performed for each RL step in D2. Notably, D2 took approximately seven times more time than D1 to reach the same number of episodes (Table 5).
In D3, we enumerated all stereoisomers, selected the one with the best docking score, and used the final reward function. This result demonstrates that the enumeration of stereoisomer accelerated the speed of optimization compared to that of D1 ( Figure 7C). However, the speed of D3 was approximately four times lower than that of D1, as it performed up to 16 docking simulations per episode (Table 5). In D1 and D2, the randomly determined stereoisomers in the 3D-embedding process may result in different docking scores even though the agent performed the same actions. This random selection of stereoisomers may affect the optimization speed.
Next, we compared the results for RJT-RL (D1) and CReM (D4). Because the score evaluation, including the docking simulation, is a bottleneck in this task, we considered one score evaluation in CReM corresponding to one episode in RJT-RL with a final reward. The results show that the generated molecules are optimized against the target score function at a speed comparable to that of D1 ( Figure 7D). However, the distribution of each score term is quite different from that of RJT-RL (D1; Figure 7A). In D3, the docking score term is well optimized, while the interaction term is less optimized than in D1 ( Figure 7D). One possible reason for this unbalanced optimization is the tendency of CReM to generate large molecules ( Figures S8D and S9D). This tendency may increase the van der Waals interaction terms in the Vina scoring function, 37 thereby causing an unbalanced optimization. This could be avoided by optimizing c intr , the balancing hyperparameter of the score terms, or using the ligand efficiency 39 instead of the docking score. Finally, we examined several runs with different random seeds for D1, D2, and D3 ( Figure S10). In all cases, the optimization proceeds similarly, but the convergence speeds differ depending on the runs in D1 and D2. In particular, in another run of D2, the speed of convergence is observed to be similar to that of D1 ( Figure S10B). Next, we examined the differences in the molecules generated from different runs. We calculated the pairwise similarities of the top 100 molecules from the two runs and plotted their distributions ( Figure  S10D). The results demonstrate that diverse molecules are generated from different runs with the same hyperparameters. This could be an advantage of RJT-RL because it is important to obtain a variety of compounds with high target scores in actual drug design tasks. Consequently, given the level of improvement and deterioration by the step reward (D2) and  Table 3). In the upper panel, the reward, interaction, and docking scores for each episode are plotted in blue, and the moving average and maximum values are plotted in orange and green, respectively. In the lower panel, histograms of the Log P and SA score distributions of the training dataset (blue) and episodes 0−1000 (orange), 20,000−21,000 (green), and 39,000−40,000 (red) are plotted. stereoisomer enumeration (D3), we conclude that the final reward (D1) is best under the current experimental settings.
Multiobjective Optimization. While the above cases targeted simple rewards with one or two objectives, a typical drug design task requires the improvement of multiple objective functions. In particular, the compounds generated in cases D1−D3 tended to contain highly hydrophobic and/or possibly unstable structures ( Figures S8A−D and S9), as also evident from the low "filter" score of D1−D3 in Table 3. These unfeasible structures can be suppressed using multiobjective target functions, including penalties for unwanted structures. In M1, we introduced SA 31 and Log P scores to penalize such unfeasible structures and examined the performance of our method in multiobjective optimization tasks. We define two penalty terms as follows. where SA and Log P are the SA and Log P scores, respectively, of the target molecule. Thus, these terms are intended to restrict the SA score to less than 4 and the Log P score to the range of 0−5. The total reward function is defined as follows where c intr = 10, c SA = 1, and c LogP = 5 were used. We standardized and neutralized the compounds using the functions implemented in RDKit 18 before generating the 3D conformations. Additionally, we examined the CReM generator with similar settings, except for the target score function, which is defined as R fin above.
The result of M1 demonstrated that the multiobjective reward function increased similar to the cases D1−D3, and the distributions of SA and Log P values were restricted to the specified ranges ( Figures 8A and S11). In D1, molecules with high SA scores (SA > 4) were frequently generated ( Figure  S8A), whereas the generation of such molecules was suppressed in M1 ( Figure 8A). The Log P value of the generated molecules was also restricted to the specified range compared to those in the D1 experiments ( Figures 8A vs S8A). Consequently, the "filter score" in M1 was drastically improved compared to that of D1 (Table 3). The top three compounds in terms of reward and docking pose of the best compound are shown in Figure 9. Although the second and third compounds in Figure 9 are similar, the diversity of the generated structures of M1 can be observed from their statistics and plots (Table 3 and Figure S11). Collectively, these results demonstrate that RJT-RL can be successfully applied to multiobjective optimization. Moreover, another run of M1 was performed, and a similar tendency in the property distributions was observed ( Figure S12). A similarity plot between these runs ( Figure S12D) demonstrated that diverse molecules were generated from the different runs in this case.
Next, the results for RJT-RL (M1) and CReM (M2) were compared. CReM also successfully improved the overall score ( Figure 8B). However, an unbalanced optimization of the score terms was observed, similar to the case of D4 ( Figure  7D). Although the docking score was improved by optimization, the other terms (i.e., interaction, SA, and Log P terms) were less improved than those of M1 ( Figure 8). Even though the increasing tendency of the SA score was suppressed, a large number of compounds with SA > 4.0 were still generated ( Figure 8B). As for the Log P property, D4 generated more molecules with Log P in the specified range (0 < Log P < 5) than M2 (Figures S8D and 8B). To perform multiobjective optimization with CReM, we may have to perform an intensive hyperparameter search for the best weighting terms (c intr , c SA , and c LogP ) in the scoring function and/or to introduce another penalty term for large molecules.
Fine-Tuning to Different Reward Functions. One of the advantages of learning-based methods is their transferability to various tasks. To assess the transferability of the RJT-RL models (policy and value function), a fine-tuning experiment, F1, using different reward functions (Table 3) was performed. In this experiment, we attempted fine-tuning the model trained in D1 to the multiobjective reward function defined in M1.
The results of F1 are shown in Figures 10 and S13. From the episodes in the early training stage, the agent generated molecules with high docking and interaction scores, which were maintained during the training (Figure 10). The SA and Figure 9. Compounds were generated in a multiobjective reward experiment, M1 (see Table 3). The chemical structures of the top three compounds, with their rewards and docking scores, are shown in the top panels. The binding poses of the best compounds are shown in the bottom panel. The protein backbone and bound inhibitor are shown using the ribbon and transparent ball-and-stick models, respectively. The original compound bound to the crystal structure is shown as semitransparent. The possible hydrogen-bonding interactions are indicated by yellow dashed lines.
Log P scores also gradually improved as the training proceeded, and their distributions fell within the range specified by the penalty score (i.e., SA < 4 and 0 < Log P < 5) (Figure 10). The filter score of F1 was improved from that of D1 (Table 3), suggesting that the distribution of the generated compounds contained more feasible structures than before fine-tuning. In contrast, the diversity of molecules generated by F1 was slightly lower than that of M1, as evident from their statistics (uniqueness and IntDev in Table 3). The top three compounds in terms of reward and the docking pose of the best compound are shown in Figure 10. This result demonstrates that the agent adapted to the different reward functions containing the penalty terms after fewer training steps than those in M1.

■ CONCLUSIONS
In this study, we introduced RJT, a coarse-grained representation of molecules, which is directly convertible to the original chemical structure. We leveraged this representation for drug discovery, formulating a molecule design task as reinforcement learning to generate an RJT. The proposed method exhibited a better or comparable performance to other state-of-the-art methods in simple benchmark tasks. The results indicate the potential of the step-by-step evaluation of molecular properties, which is only possible in our framework using RJT. The step reward was shown to be advantageous when the computational cost of the score calculation was low because the score required multiple evaluations to generate a single compound. We further demonstrated that our method is applicable to real-world tasks such as structure-based scaffold hopping with a multiobjective reward function. Another advantage of RTJ-RL is the fine-tuning of the policy and value function models, which is not possible with rule-based fragment growth methods. This feature may be useful for tuning scoring functions in the compound design process.
The results of the experiments also suggest several problems with the proposed method. For example, considering the 3D generation involving docking simulation, efficient training of the RJT-RL model requires proper handling of the stereoisomers of the generated compounds. In this study, we Figure 10. Results of fine-tuning experiment F1 (see Table 3). (A) In the upper panel, the reward, interaction, and docking scores for each episode are plotted in blue, and the moving average and maximum values are plotted in orange and green, respectively. In the lower panel, histograms of the Log P and SA score distributions of episodes 0−1000 (orange), 5000−6000 (green), and 9000−10,000 (red) are plotted. (B) Chemical structures of the top three compounds with their rewards and docking scores are shown in the top panels. The binding poses of the best compounds are shown in the bottom panel as in Figure 9. enumerated the possible stereoisomers and searched for the best stereoisomers using the brute-force method. However, the search was performed on a single CPU, which reduced the overall performance ( Table 5). Parallelization of the 3D conformer generation and docking simulation could improve performance because this calculation is embarrassingly parallelizable. Another solution to this problem is to extend the site information in the RJT representation to properly handle the chirality flag of the atom so that the agent can generate the molecule, including the stereoisomers. Finally, in this study, we only considered the de novo design of compounds from a relatively small starting fragment size. To apply the proposed method for the optimization of lead compounds with larger and more complex structures, it is necessary to extend the action to include node deletion and mutation to enable the modification of the starting scaffolds.