Tuning Reinforcement Learning Parameters for Cluster Selection to Enhance Evolutionary Algorithms

The ability to find optimal molecular structures with desired properties is a popular challenge, with applications in areas such as drug discovery. Genetic algorithms are a common approach to global minima molecular searches due to their ability to search large regions of the energy landscape and decrease computational time via parallelization. In order to decrease the amount of unstable intermediate structures being produced and increase the overall efficiency of an evolutionary algorithm, clustering was introduced in multiple instances. However, there is little literature detailing the effects of differentiating the selection frequencies between clusters. In order to find a balance between exploration and exploitation in our genetic algorithm, we propose a system of clustering the starting population and choosing clusters for an evolutionary algorithm run via a dynamic probability that is dependent on the fitness of molecules generated by each cluster. We define four parameters, MFavOvrAll-A, MFavClus-B, NoNewFavClus-C, and Select-D, that correspond to a reward for producing the best structure overall, a reward for producing the best structure in its own cluster, a penalty for not producing the best structure, and a penalty based on the selection ratio of the cluster, respectively. A reward increases the probability of a cluster’s future selection, while a penalty decreases it. In order to optimize these four parameters, we used a Gaussian distribution to approximate the evolutionary algorithm performance of each cluster and performed a grid search for different parameter combinations. Results show parameter MFavOvrAll-A (rewarding clusters for producing the best structure overall) and parameter Select-D (appearance penalty) have a significantly larger effect than parameters MFavClus-B and NoNewFavClus-C. In order to produce the most successful models, a balance between MFavOvrAll-A and Select-D must be made that reflects the exploitation vs exploration trade-off often seen in reinforcement learning algorithms. Results show that our reinforcement-learning-based method for selecting clusters outperforms an unclustered evolutionary algorithm for quinoline-like structure searches.


INTRODUCTION
When it comes to large molecules, the search for global minima structures is a difficult task, for which many different algorithmic solutions have been proposed.Techniques such as minimahopping, 1 Monte Carlo methods, 2,3 random search, 4 particle swarm optimization, 5,6 simulated annealing 7 via basin-hopping, 2 neural network based methods, 8 and evolutionary or genetic algorithms 9−14 have been implemented to address this search.
An evolutionary algorithm (EA) is a metaheuristic optimization algorithm that is well suited to potential energy surface minimization problems due to their ability to search large regions of the energy landscape and increase the efficiency of the search via parallelization. 15The trademark of this algorithm is its creation of a new candidate molecule by the crossover of two parent molecules (randomly selected from a starting population pool) via cut and splice and random changes to the new candidate via mutations. 9,16For molecular crystals, symmetric crossover has been used to ensure a parent's space group is maintained in an offspring structure. 15Concepts of Darwinian evolution are used by ranking molecules in the population pool according to fitness and adjusting the probability so that molecules with higher fitness are more likely to be selected.
A shortcoming of a traditional evolutionary algorithm is its tendency to converge on local minima structures rather than global minima.In order to balance exploration and exploitation in an evolutionary algorithm-based search of the energy landscape, techniques such as improving upon the initial candidate pool, 17 integrating a Bayesian acquisition function into the fitness function, 18 dynamic management of evolutionary operators, 13 following the ab initio gradient of the potential energy surface, 14 and clustering the parent population 11,12,17 have been proposed.
According to Pereira and Marques, when searching for the global minima potential energy structure (PES), one should consider structural information for estimating dissimilarities between different molecules rather than their fitness values. 19ganov et al. state dissimilar parent structures from different local minima tend to produce offspring with higher energy. 17In order to prevent excess computation producing unviable offspring, clustering the parent population by structural similarity has been implemented in multiple works with beneficial results. 11,12,17,20In addition to suppressing unfavorable areas of the energy landscape, clustering allows for more exploration by having the evolutionary algorithm search different areas of the energy landscape rather than focusing on the most fit members of the population (exploitation) and possibly converging on local minima.
Previous work has shown that some clusters tend to produce better candidates than others on average. 12Therefore, it can be beneficial to spend more time searching for more favorable clusters over others.Sankaranarayanan et al. used clustering and genetic algorithms to create multiple tribes that compete against each other to increase population size.This increase in population size for successful tribes translates to a more expansive search in promising areas of the PES. 21In order to search the PES in an efficient but thorough manner, a few different attempts to achieve a delicate balance between exploration of lesser known PES regions and exploitation of regions known to have high fitness have been made. 18,20,21n this paper, we choose to address this exploration versus exploitation dilemma by clustering the initial parent population and developing a mechanism to rationally select these clusters for subsequent evolutionary algorithm runs based on their performance.By doing this, we equate the problem of finding global minima structures to the multiarmed bandit problem.The multiarmed bandit problem is a problem in which a decision maker iteratively selects from a fixed set of choices where the properties of each choice are only partially known.The multiarmed bandit problem stands as a classic challenge in reinforcement learning, illustrating the dilemma of balancing exploration and exploitation.Reinforcement learning is considered one of the three basic machine learning paradigms alongside supervised and unsupervised learning methods.In reinforcement learning, an agent learns which actions are most beneficial through a process of trial and error; after each action, the rewards/consequences of this action are calculated by a defined reward function, and a policy on how to effectively use these actions are developed.Because of the ability of reinforcement learning to address the multiarmed bandit problem, we decided to implement it as a solution to global minima searches via evolutionary algorithms with clustered populations.
We offer a novel approach to this problem of balancing exploration versus exploitation by using a reinforcement learning algorithm to select the clusters used in evolutionary algorithm runs.In our self-developed reinforcement learning algorithm, we define four learning parameters (MFavOvrAll-A, MFavClus-B, NoNewFavClus-C, and Select-D) that increase or decrease a cluster's likelihood of subsequent selections for an evolutionary algorithm run based on its previous performance.MFavOvrAll-A, MFavClus-B, NoNewFavClus-C, and Select-D correspond to a selection probability increase for producing the most favorable structure overall, increase for producing the most favorable structure from its own cluster, decrease for not producing a most favorable structure, and decrease for overselected clusters, respectively.Our results show that the best parameter combination requires a fine balance between parameter MFavOvrAll-A and parameter Select-D that mirrors the need to balance exploration and exploitation in searches.

Computational Details
The evolutionary algorithm used in this research is implemented using the Atomic Simulation Environment (ASE). 22,23The EA produces new molecules by picking two parent molecules from the population and combining them with the cut-and-splice operator. 9,16Mutations are not used here given that they tend to enhance exploration, and we want to test the exclusive effect of the learning agent.−26 The accuracy loss resulting from the utilization of DFTB is not a significant concern in our case, as our primary emphasis is on the EA's capacity to produce energetically favorable structures rather than precisely determining their energy levels.

Data Set
In this paper, we will focus on structures with the chemical composition C 9 H 7 N in order to test how effective our dynamic cluster selection based evolutionary algorithm is.Our version of DFTB+ identifies 4h-quinolizin-4-ylidene as the lowest energy structure and quinoline as the second lowest energy structure.We therefore tasked our algorithm with finding either molecule.We chose this system to test our method because previous research has found this problem to be sufficiently challenging to easily gauge the success of different method implementations. 11,12he library of structures that were used for the initial pool of candidates was generated following an in-house algorithm that builds molecular structures atom-by-atom.When a certain stoichiometry is assigned to this locally developed code, it randomly picks one of the atoms from the list of possible choices.Depending on the element, one of the possible hybridization geometries to that element is assigned to the atom.For example, if a carbon atom is chosen, it can be assigned either sp, sp 2 , or sp 3 hybridization.This defines where the next randomly selected atom is placed at a distance equal to the covalent radii of both atoms added together.As each new atom is placed and the geometry of its connections to future atoms assigned, the code also checks if the atom is too close to any other atoms, defined as less than 70% of a bond radius, as determined above.If two atoms are too close, then the process starts over.The code was executed using two strategies for hybridization assignment: equal likelihood of all hybridizations, and predetermined nonequal assignments.In the equal likelihood regime, the hybridization is assigned randomly with equal probability for all types.For the predetermined nonequal assignment regime, the total probability of selecting any hybridization is 100% but it is not equally distributed.Instead, varying probabilities for different assignments were run.For example, in one run, one hybridization had a 100% probability of being applied each time, while the others have 0%.In another run, one hybridization would have a 75% chance of being selected, while the remaining 25% is distributed among the other options.This process produced a large variety of chemically relevant structures and is summarized in a paper by Kellas and Groves. 11A similar strategy is implemented for molecular crystals in the Genarris code. 27Using the Kellas and Groves method, we generated 812 molecules in the ASE trajectory format. 22,23

Clustering
In order to cluster molecules, we employ a fingerprinting function to measure their dissimilarity.There are many fingerprinting functions to choose from including atompairs, 28 bag-of-bonds, 29 atom-centered symmetry functions, 30 smooth overlap of atomic positions, 31 overlap matrix fingerprint, 32 and the partial radial distribution function. 33Many of these examples seem to follow a similar idea of quantifying the local environment where some form of a two-body term is used, while the more involved ones also try to quantify three-and fourbody terms.We intend to extend this work to larger structures; however, to test the learning agent, we are going to use a wellestablished fingerprint function that focuses on quantifying twobody terms.Following the work of Oganov et.al., 34 we implement the following fingerprint function This fingerprint is a radial distribution function where i = 1, 2, ..., N A , j = 1, 2, ..., N B , R ij is the distance between atom A i and atom B j , and V uc is the unit cell volume.Each peak is smoothed before calculating the sum using a Gaussian kernel, δ, and accumulated into a histogram with bin size Δ; we use the values δ = 0.2 and Δ = 0.5 Å.Because we are not dealing with a molecular crystal, the unit cell volume is purposeless in our calculation and acts more as a scaling factor.We set R max = 8 Å and = V R uc 4 3 max 3 .We set R max to 8 Å because most molecules had interatomic distances below this value, allowing us to consider the contributions of all carbon−carbon and carbon− nitrogen distances (for feasible molecules) while preventing too much of the fingerprint from extending too far and diluting our distance metric.Since we are dealing with quinoline (C 9 H 7 N), we only utilize the carbon−carbon fingerprint, F C,C , and the carbon−nitrogen fingerprint, F C,N .Following similar research, 12 we assume any contributions from the hydrogen atoms are redundant and do not include them in our calculation.The fingerprints of the two molecules can be seen in Figure 1.
Once the fingerprint of each atom has been calculated, we calculated the distance between each molecule using the cosine distance function.When comparing two molecules, we produce two distances: the first between the carbon−carbon fingerprints and the second between the carbon−nitrogen fingerprints.We then calculate the total distance using the distance formula (see Figure 1).We allow the carbon−carbon and carbon−nitrogen distances to contribute equally to the total distance and though we recognize it is possible to prioritize one distance over another with the inclusion of a scaling factor or coefficient, we choose not to do so here.
Once the distances between all 812 generated molecules are calculated, we create a distance matrix and use agglomerative clustering via the scikit-learn aggolomerative clustering library to place these 812 molecules into clusters.Agglomerative clustering is a hierarchical clustering technique in which each molecule starts in its own cluster and clusters are merged according to the defined linkage setting.Scikit-learn has four possible linkage settings: observations of the two sets.The goal for clustering was to produce clusters of similar sizes with no cluster having fewer than 20 molecules.We found that only the "complete" linkage setting managed to prevent the creation of a single, large, and central cluster and significantly smaller outlier clusters.After choosing this "complete" linkage setting, a threshold value of 0.026 was picked through trial and error.This means that the distance between the two most dissimilar molecules is calculated and used to determine whether a cluster should combine.If that distance is above 0.026, the clusters will not combine because of our set threshold.The threshold value of 0.026 is likely specific to our quinoline-like structures and was found to produce a notable number of clusters of similar size.Figure 2 shows an example of two representative structures from three clusters.
Molecules that were too dissimilar to be put into clusters of size n ≥ 20 were all put into one cluster.Using this method, we clustered 812 molecules (generated by the variational hybridization method outlined earlier in the methodology) into 9 clusters, making sure the global minima structure was not already included.An extremely simplified version of this can be seen in Figure 3.The number of clusters being set to 9 was not an intentional choice but rather reflects a need to balance the formation of a large central cluster and having too many small groups.If the distance threshold is set too high, one large cluster forms, with many small clusters (n ≤ 5).If the distance threshold is set too low, all except one or two clusters have less than 20 molecules.Even with our carefully tuned distance threshold, we still needed to create a new "misfit" cluster (brown in Figure 3) comprised of clusters too small to use in our EA algorithms.

Tuning Learning Parameters
In order to determine how often a cluster should be chosen, each cluster is assigned a probability that changes, depending on its performance throughout the evolutionary algorithm run.In order to select a cluster, we use the random.choices()method from the Python random module where we define the weights that dictate how likely a cluster is to be chosen.For example, if molecules 1, 2, 3, and 4 had relative weights of 10, 20, 15, and 5, this would correspond to selection likelihoods of 20%, 40%, 30%, and 10%, respectively.Each cluster starts with its weight set at 100 and is then modified by our reinforcement learning system defined below.The choice to set these weights at 100 was somewhat arbitrary.We wanted the weights to be large enough that we could leave our learning parameters (defined below) as integer values.
We defined a set of rules with variable parameters that attempt to find the optimal balance between exploration and exploitation by changing the cluster selection weights throughout the evolutionary algorithm run.The incorporation of these variable  1.If a chosen cluster produces the lowest energy molecule of any cluster, the relative probability of that cluster being chosen in the future is increased by MFavOvrAll-A.
2. If a chosen cluster produces the lowest energy molecule in its own cluster, the relative probability of that cluster being chosen in the future is increased by MFavClus-B.
3. If a chosen cluster does not satisfy condition 2, its relative probability is decreased by NoNewFavClus-C, where a cutoff exists to prevent negative probabilities.
4. The chosen cluster will have its relative probability decreased by the ratio of how many times the cluster has been selected multiplied by Select-D.
Parameters MFavOvrAll-A, MFavClus-B, and NoNewFav-Clus-C all work to prioritize high performing clusters by increasing or decreasing a cluster's selection weight by a constant amount.The intention of Select-D is to prevent a successful cluster from dominating the selection cycle and to promote the selection of clusters that have shown subpar performance throughout the dynamic cluster selection based evolutionary algorithm process.The rationale is that every cluster is capable of producing the lowest energy molecular structure, so we want to give each cluster a chance while spending more time on clusters with favorable outputs.
Unlike MFavOvrAll-A, MFavClus-B, and NoNewFavClus-C, Select-D does not decrease a cluster selection weight by a constant amount.Instead we use a ratio which allows us to more effectively promote exploration since an overselected cluster is receiving a harsher penalty than its underselected counterpart would.Also, if a cluster had a poor performance initially but began producing molecules of interest later in the process, the rise to a higher selection probability would be easier.
Optimizing the variable parameters via evolutionary algorithm runs is extremely computationally expensive.In order to reduce computational time and resources, we modeled the performance of each cluster as the initial parent population of an evolutionary algorithm by taking the mean and standard deviation of molecular energies produced by 100 evolutionary algorithm runs.Modeling the values produced by each cluster as a Gaussian distribution, we can then quickly produce values from each cluster and calculate the cumulative success of different parameter permutations.In order to find the optimal values of our parameters, we did a grid search of all integer value combinations for MFavOvrAll-A ∈ As stated above, each parameter combination will be used for 100 evolutionary algorithm trials (modeled by using a Gaussian distribution) and have its performance evaluated by the cumulative success.Specifically for the task of finding the optimal parameter combination, we segment each EA trial into batches where 5 clusters are randomly chosen (with replacement so that a cluster can be chosen multiple times), and each chosen cluster simulates an EA run of 50 iterations.Between each batch (selection of 5 more clusters for 50 iteration EA runs), we take the lowest energy values from each cluster and update the selected clusters' probability for subsequent selection based on our learning parameters.If the an energy value produced is below our defined energy threshold value, E min , the EA trial is stopped and no further batches occur.Each batch consists of 250 Gaussian simulated EA iterations in total, and each trial can run for a maximum of 20 batches.Therefore, if the EA trial runs without producing an energy value below E min in its 20 batches, the maximum number of iterations (5000) will have occurred.This process is illustrated in Figure 5.The segmentation of each of the 100 trials per parameter combination into batches was meant to simulate parallelization of each actual EA run with the optimized learning parameters.Ultimately, we found it more suitable to parallelize the actual EA trials without batches (one 50 iteration run at a time).This means each EA trial takes longer, but we can run more of them at a given time.The effects of the batches being used in the testing of the learning parameters but not the actual EA runs will be discussed further in the Discussion section.
In order to determine the performance of each parameter combination, we ran 100 simulation trials for each combination and generated a cumulative success graph.The performance of each parameter combination was then compared by calculating the area under the cumulative success graph.An example of a cumulative success graph is shown in Figure 6.
Each parameter combination MFavOvrAll-A, MFavClus-B, NoNewFavClus-C, and Select-D had 100 trials performed in order to create a metric related to its average performance.Each of the 100 trials consists of up to 5000 iterations.In each trial, a success is defined as producing an energy value below a defined value E min (energy of quinoline as defined by DFTB+ in this paper).Each iteration produces a lowest energy value; if that lowest energy is below E min , then that trial is considered a success, the trial is stopped, and the iteration number is noted.
The cumulative success graph then measures how many successes have been recorded in X iterations or less.The cumulative success graph starts at zero and by the 5000 iterations could go up as far as 1.However, if any of the trials fail to produce an energy value below E min in 5000 iterations, then the trial never succeeds and the final cumulative success value (at 5000 iterations here) is less than 1.We purposefully set E min at a value where no trial is expected to reach 1 by 5000.This makes it easier to differentiate between the top parameter combinations and is a major motivation behind selecting quinoline-like structures.
Since the graph is cumulative, the earlier a parameter combination's trials begin to mark success, the higher the area underneath a cumulative success graph (cumulative success area) will be.While we think taking the average number of iterations needed for success could be sufficient as well, we believed the cumulative success area would more accurately characterize a parameter combination's success.
Of the 5,760,000 combinations attempted, the highest performing parameter combination is taken, and the dynamic probability-based cluster selection EA was implemented.

Evolutionary Algorithm Implementation
The EA used in this work is the implementation included in ASE 22,23 as described in a paper by Vilhelmsen and Hammer. 10 population can be generated randomly from a predefined set of structures or a combination of both.Each member of the population is assigned a fitness which is quantified in eq 1: where E min and E max are the minimum and maximum energy of any structure in the population.To ensure that a diverse set of candidates can be selected, the probability of selecting a given candidate also relies on a uniqueness factor that tends toward zero the more times the candidate is selected and based on the number of structurally equivalent structures outside of the population.The functional form of the uniqueness factor is quantified in eq 2: where n i is the number of times candidate i has participated in a pairing and m i is the number of candidates which are structurally similar to the ith candidate but not included in the population because the ith candidate is already present.The probability that a given candidate is selected is Vilhelmsen and Hammer 10 find that the inclusion of the uniqueness factor increases the effectiveness of the EA.
The cut-and-splice method used to create candidate structures once two parents are selected was devised by Deavon and Ho. 9,35In short, the process involves creating a dividing surface and orienting it randomly through the two parent structures' center of mass.This cuts the two parent structures roughly in half.The newly formed candidate structure is then composed of the atoms from one parent structure on one side of the dividing surface and the atoms from the second parent structure on the other side of the dividing surface.If the stoichiometry of the system is not maintained, then the following rules are applied: if there are too many atoms of a certain element, then the extra atoms that are furthest away from the dividing plane are removed.If there are too few atoms of a certain element, then atoms are randomly added from the parent structures that were not originally included in the new candidate structure.Finally a check is performed to ensure that atoms are not too close together.If any distance is smaller than the set threshold, the new candidate is rejected, and the process starts again with a new randomly oriented dividing plane.Once a candidate is created, the resulting structure is relaxed 100 SCF cycles or to a maximum force of 0.05 eV/Å with the DFTB+ calculator.This is to limit resources being tied up in trying to fully relax a structure that is not viable.Once one of these two criteria is met, the relaxed structure is compared to the weakest member of the current population.If it is more favorable, then it is added at the expense of the weakest member.If it is not more favorable than the weakest member, then it is rejected.In either case, this process of selecting two parents, creating a candidate from them, relaxing it, and comparing it to the members of the current population counts as a single iteration of the EA.For the evolutionary algorithm implementation, we will be testing the performance of an unclustered EA and three different versions of a dynamic probability based cluster selection EA where the method for the formation of the parent population varies.We will now outline the differences in methodology between each version and the measures taken to produce comparable results.
The unclustered evolutionary algorithm represents a baseline EA that we can use to analyze the performance of our dynamic probability-based cluster selection method.In the unclustered EA, the entire initial population of 812 molecules is drawn from parent molecules.The molecules produced during each step of the EA run are added to the parent population and can be drawn from in the future.For all EA methods, molecules with lower energies are deemed more fit and are more likely to be chosen as the parents.Each unclustered EA trial consists of 5000 iteration steps, where each step corresponds to the selection of two parents, production of a child molecule via cut-and-splice, and evaluation of the child molecule.500 EA trials were completed in total to estimate the cumulative success of this method.
For the dynamic probability-based cluster selection EA, all three versions follow the same implementation, with the exception of the usage of the parent population.In both cluster-selection evolutionary algorithms, 20 molecules from each cluster are randomly selected to form the initial parent population for each cluster.After this initialization, there are up to 100 cluster selections.In each cluster selection, a cluster is chosen to run for 50 iterations steps.This was done to equate the total possible number of iteration steps to the unclustered EA's 5000 steps.After each cluster selection, each clusters' probability of selection is updated to the learning parameters, and the parent population for the next selected cluster is formed.  .Parent population selection process is outlined for each methods.A: For method 3, all outputs of previous EA runs from that cluster are drawn from with the lowest energy molecule outside of the energy threshold of the previously selected molecule being selected for the next parent population.B: For method 1, only the output of the previous EA run from that cluster is used with molecules being selected for the parent population if no other molecule exists in its energy threshold.C: For method 2, only the output of the previous EA run from that cluster is used with the lowest energy molecule outside of the energy threshold of the previously selected molecule being selected for the next parent population.
The difference between the three dynamic probability-based cluster selection EA methods lies solely in the formation of the parent population.All three methods implement an energy threshold, ΔE, meant to eliminate redundancy in the parent population and prevent early convergence of local minima in the energy landscape.
In the first version, which we will refer to as method 1, the next parent population is drawn only from the output of the previous EA output from that cluster.In this version, the lowest energy molecules that have no other molecules in its energy threshold are selected.In other words, if the next lowest energy molecule has an energy E, and any other molecule (selected or not) has an energy in the range E − ΔE to E + ΔE, the molecule will not be selected.
In method 2, the next parent population is also drawn only from the EA output of the previous EA output from that cluster.However, in contrast to method 1, the next lowest energy molecule will only be skipped if the previously selected molecule has an energy between E − ΔE to E. Unlike method 1, molecules that are not selected have no impact for method 2.
In method 3, the next parent population is drawn from the output of every EA output from that cluster.Like method 2, the next lowest energy molecule that does not cross the energy threshold of the previously selected molecule is added to the parent population.
The difference between methods 1 and 2 can be seen in Figure 7. Method 3 follows the same selection procedure as method 2 except that it is looking at all molecules previously generated by that cluster's EA (not just the previous EA iteration).An example of the parent population selection process for methods 1, 2, and 3 can be seen in Figure 8.
Ideally, methods 1 and 2 should be drawing from 70 molecules (50 generated from the previous EA and 20 from the previous parent population).Since method 3 is drawing from all previous outputs belonging to a cluster, it draws from significantly more molecules as the EA run progresses.For ΔE values that are too large, it is possible that less than 20 molecules are selected the next parent population.This problem may occur only in method 3 during the beginning phases where each cluster does not have many outputs yet.
The remaining methodologies for these three versions will be identical.These parent population formation methods were employed to allow the EA to continue progressing between cluster selection phases and more accurately reflect the dynamics of the unclustered EA algorithm.

RESULTS
After clustering the initial 812 molecules 11 into 9 clusters, 100 evolutionary algorithm runs are done on each cluster with the lowest energy molecule from each run being recorded.The mean and standard deviation of these 100 molecules' energy are then calculated for each cluster and can be see in Table 1.The mean energy ranged between −558.11eV and −555.47 eV.On average, cluster 6 produced the lowest energy molecules during its EA runs.When simulating the EA runs, each cluster's output will be modeled as a Gaussian distribution with the mean and standard deviation calculated in this table.2 shows the five highest recorded parameter combinations with the highest cumulative success areas, while Table 3 shows the five parameter combinations with the lowest cumulative success areas.From these tables, we can see that the cumulative success areas ranged between 579.In Figure 9, all of the cumulative success areas are grouped into a histogram.The histogram is bimodal and has been split into four sections for further analysis.The first section corresponds to cumulative success areas below 1000 and represents the first peak.The second section corresponds to cumulative success areas between 1000 and 1200 and represents the area between the first and second peak.The third area corresponds to cumulative success areas between 1200 and 1500 and represents the majority of the second peak.The last area represents the best cumulative success areas at the right edge of the second peak with a cumulative success area greater than 1500.The second peak is higher suggesting more parameter values are clustered between cumulative success areas between 1300 and 1450.The histogram takes on a bimodal form that we suspect may be a result of most learning parameters either increasing or decreasing the baseline performance (rather than having no effect).However, further work would need to be done in order to validate this suspicion.
In Figure 10, a histogram of each parameter was then made showing the relationships between parameter values and area on the cumulative success area histogram.The value of parameter MFavOvrAll-A has a clear effect on the value of the cumulative success area.As the value of parameter MFavOvrAll-A increases, the proportion of cumulative success values from the lower value peak increases (see Figure 10A) .Parameter MFavClus-B and NoNewFavClus-C appear to have a much smaller effect on the cumulative success.However, for cumulative success areas between 1500 and 1600, parameter MFavClus-B tends toward smaller values while parameter NoNewFavClus-C tends toward values between toward the upper half of its distribution.(see Figure 10A,B).Upon looking at Figure 10D, parameter select-d appears to prefer values between 40 and 60.This is a consequence of its relationship with A that will be further explored.
Figure 11 shows a bidimensional histogram plot for MFavOvrAll-A vs Select-D where Figure 11A is for cumulative success areas between 1200 and 1500 and Figure 11B is for cumulative success areas greater than 1500.On the right graph in Figure 11, we can see a clear relationship between parameters MFavOvrAll-A and Select-D for cumulative success areas between 1200 and 1500.This area can approximately be described by combinations that satisfy the inequality D < 0.86A + 15.There is a band missing from Figure 11A that is explained by Figure 11B.Figure 11B represents the highest achieving parameter combinations and appears to require a strict balance between MFavOvrAll-A and Select-D that broadens as the parameter values increase.This relationship between the parameter MFavOvrAll-A and the parameter Select-D can be seen in Figure 10D.
The cumulative success distribution for the unclustered evolutionary algorithm vs the clustered evolutionary algorithm with the optimal imposed energy threshold, ΔE, can be seen in Figure 12.Method 1, Method 2, and Method 3 refer to the different methods we employed to curate the starting population at each iteration (see Section 2.5 for further detail).The unclustered EA consisted of EA runs of 5000 iterations, while the clustered EAs had 100 EA runs consisting of 50 iterations.The cluster-selection EA method used the optimal integer parameters MFavOvrAll-A = 79, MFavClus-B = 3, NoN-ewFavClus-C = 19, and Select-D = 68 for the dynamic cluster selection for all three methods.By the end of the 5000 iterations, the unclustered EA had a cumulative success rate of 0.436.With a ΔE = 0.3, methods 1 and 2 has final cumulative success rates of 0.752 and 0.636, respectively.With ΔE = 0.4, method 3 had a final cumulative success rate of 0.726.The unclustered EA begins with a larger cumulative success graph.However, by iteration 1221, all 3 cluster-selection EA methods outperform the unclustered EA method.
The cumulative success rates for different energy threshold values, ΔE, can be seen in Figure 13.Although method 1 has the highest overall cumulative success rate when ΔE = 0.3 eV, the cumulative success drops significantly at higher ΔE values.On the contrary, method 3, has a slightly lower best performance (compared to method 1) but has more consistent performance at different ΔE values.Method 2 has slightly more stable performance at larger ΔE values but under performs both methods 1 and 3 for most energies.

DISCUSSION
From the Results section, it is clear that we have developed a method for a dynamic cluster selection based evolutionary algorithm that has outperformed a regular unclustered evolutionary algorithm in our search for optimal structures consisting of 9 carbon atoms, 7 hydrogen atoms, and a nitrogen atom.
The steps of our dynamic cluster selection based evolutionary algorithm can be summarized in three steps.Our first step of clustering the molecules follows similar research 11 with the exception that we've replaced the bag of bonds method with Oganov et al.'s fingerprint function.While successful for fingerprinting quinoline-like structures in this paper, we realize that more testing could have been done to find values of δ and Δ that were more sensitive to differences in our structure.This is something that we suspect will especially need to be tuned for larger structures or unit cells.For example, adapting this fingerprint to larger chemical structures may be challenging as small structural differences may be overwhelmed by the size of the structure.While this fingerprint worked well for our molecules, we recognize that a different fingerprinting method may be more suitable for future work with larger structures.Using DFTB+ as an energy calculator allows us to very quickly estimate a structure's energy though the accuracy can be questionable at times.For example, we do not believe that 4hquinolizin-4-ylidene is actually the lowest energy version of C 9 H 7 N.However, the accuracy of this calculator is not a large concern for our method.The reinforcement learning based cluster selection method we employ prioritizes a cluster based on its ability to produce lower and lower energy structures.We could switch DFTB+ with any desired calculator and should expect similar results.
We consider the process of defining a set of learning parameters and tuning them with Gaussian approximations of cluster performance to be where we ventured farthest from previous research.Before deciding to create our own learning parameters, we looked into reinforcement learning solutions to the multiarmed bandit problem such as the epsilon-greedy strategy 36 and the upper confidence bound (UCB) strategy. 37e ultimately decided to create our own set of personalized learning parameters with the hope that it would outperform these generalized reinforcement learning strategies.While we leave the testing of these generalized reinforcement learning methods to future work, we recognize that the lack of notable effect from parameter MFavClus-B and NoNewFavClus-C has left our model looking extremely similar to the upper confidence bound strategy where one term rewards the system for consistently good performance and the second term penalizes systems that are overselected.The notable difference between our method and UCB is that our method rewards a cluster's ability to create the lowest energy molecule seen so far, whereas UCB rewards average performance.
Similar to other solutions to the multiarmed bandit problem, we found the need for balance between exploration and exploitation to appear organically in our results in two areas.First, we can see the need for this balance by the most optimal learning parameter combinations requiring a strict balance between parameter MFavOvrAll-A (bonus for finding optimal structures) and parameter Select-D (penalty for a cluster being overselected).Parameter MFavOvrAll-A represents exploitation in our search since a cluster that is producing favorable molecules would be repeatedly selected if this parameter were to be set to an extremely large value.On the other hand, molecules would be selected almost randomly if parameter Select-D were to be set to an extremely large value.This would represent a thorough exploration of each cluster regardless of its output.
The second area where we see the need to balance exploration vs exploitation is in the balance between clustering molecules for similarity and imposing energy thresholds to force dissimilarity between molecules.Oganov et al. stated that dissimilar molecules tend to produce unfavorable offspring.However, he does also state that if parent molecules are too similar the EA can stagnate. 17,20Our results show that the clustered EA can be boosted by implementing an energy difference threshold between members of the parent population.Therefore, within each cluster, we can conclude that there is an optimal way to sample a parent population that balances picking the lowest energy molecules (exploitation) and including diversity in the parent population (exploration).
We trained our learning parameters with the intention that each evolutionary algorithm could be parallelized to increase the efficiency.During the course of our research, we realized the importance of improving upon the parent population and had changed our method from 20 batch restarts with 5 evolutionary algorithms of 50 iterations running simultaneously to 100 evolutionary algorithms of 50 iterations running serially.We acknowledge that if our Gaussian distribution model used to optimize the learning parameters had been updated for 100 evolutionary algorithms consisting of 50 iterations, then the learning parameters may vary slightly from our current optimum.
The MFavOvrAll-A, MFavClus-B, NoNewFavClus-C, and Select-D parameters are solely based on the ability of a genetic algorithm to keep producing lower energy molecules than those already been generated.We believe that this would allow for a sufficient search for most chemical structures.However, there are considerations that we plan to implement in future work.The learning parameters were not normalized for the number of structures.In other words, MFavOvrAll-A, MFavClus-B, NoNewFavClus-C, and Select-D are specifically set for 9 clusters with initial weights of 100.Systems with a number of clusters notable different from 9 may experience suboptimal performance.Notably larger structures may receive rewards significantly less often and may need an adjustment, as well.In future implementations of this work, we plan to account for the number of cluster normalization.The goal for this implementation would be to allow little to no adjustment of these parameters.In any case, the A, B, C, and D parameters found in this paper could serve as a starting point to any global minimization search following our method.The omission of parameters B and C would also allow for significantly faster optimization.
In order to prevent the parent population of each cluster from becoming filled with duplicates, we imposed varying energy difference requirements on all members of the parent population.For all three methods, the optimal energy difference was found to be the nearest tenth eV.Imposing this threshold had a strong effect on the cumulative success of the method.While further research needs to be done in this area, we believe that the success of the energy threshold in a cluster EA search tells us the ideal parent population requires a balance between similarity and dissimilarity.
In order to impose dissimilarity in the parent population of each cluster, a threshold energy difference, ΔE, was implemented.The energy threshold difference was implemented over the fingerprint distance due to its simplicity in both implementation and computational complexity.However, we believe that imposing the fingerprint distance instead of the threshold energy difference would yield similar and possibly better results.One clear limitation of the threshold energy difference is that for larger structures molecules with dissimilar structures may have similar energies.Because all three methods appear to have an ideal energy threshold between 0.2 and 0.5, we suspect this would translate into their existing ideal fingerprint threshold range as well.
Two cases where the effect of clustering was directly related to the efficacy of EA are works by Jørgensen et al. and Hartke. 12,38n Jørgensen et al. the EA was clustered and selecting outliers, structures that were outside of existing clusters, were favored to be selected as a parent, while structures from larger clusters were penalized.This resulted in a success rate increase from 28% to 41% in a TiO 2 edge reconstruction over 2000 iterations.In this case, exploration appears to be prioritized over exploitation, however, it seems to produce favorable results.In the case of Hartke, Lennard-Jones clusters are examined using an EA.Clusters with 75−77 atoms typically take 20−40 times the computational resources as clusters with 74 or 78 atoms.However, by setting up a method that discriminates between different types of atomic clusters and ensuring diversity in the population, the resources required drop to 5−10 times the resources required to find the global minimum structure for clusters with 74 or 78 atoms.Based on these results, this paper is achieving similar speedups; however, our learning agent is balancing both exploration and exploitation.With further study, we predict that the methodology here might outperform these two methods that focus purely on promoting exploration.

CONCLUSION
In this paper, we developed a method that could further increase the efficiency of clustering in evolutionary algorithms by optimizing the procedure in which clusters are selected.In this demonstration, we clustered 812 generated quinoline-like molecules into 9 clusters, sampled the output of evolutionary algorithms from each cluster and modeled them as Gaussian distributions, and optimized a set of learning parameters by running simulations with these models.We called these parameters MFavOvrAll-A, MFavClus-B, NoNewFavClus-C, and Select-D where each parameter will increase or decrease the probability of a cluster being selected if its conditions is satisfied.
The highest performing parameter combinations require a fine balance between parameter MFavOvrAll-A (reward for producing the lowest energy molecule in the trial) and parameter Select-D (penalty for appearance) that is emblematic of the need to balance exploration vs exploitation in evolutionary algorithm-based searches and general reinforcement learningbased solutions.
In order to prevent stagnation within the evolutionary algorithm, we required a minimum ΔE between members of the parent population.We believe that the success of this ΔE implementation demonstrates the need to prevent premature convergence that can be accelerated by clustering parent populations.In doing so, our parents become similar enough to produce more favorable offspring but dissimilar enough to avoid converging on a local minima (funneling).
Taking the highest performing parameter combination, we were able to achieve a cumulative success rate of 0.752 after 5000 iterations.We found this to be a notable increase compared to the cumulative success rate of 0.436 yielded by the unclustered EA after 5000 iterations.
While future work needs to be done to generalize this method, the results of this paper show that reinforcement learning based methods of cluster selection are successful in increasing the efficiency of evolutionary algorithm searches for global minima molecular structures.We plan to further improve this method so that it can be used for future EA searches to determine the morphology of the physical defects in materials.

Figure 1 .
Figure 1.Left: The carbon−carbon fingerprint (top) and carbon−nitrogen fingerprint (bottom) of two molecules.Right: The distance between the two molecules is then generated using the cosine distance function on both fingerprints.

Figure 2 .
Figure 2. Extremely simplified version of our clustering model's performance on a small set of molecules.

Figure 3 .
Figure 3. Illustration of how the 812 molecules were distributed among each of the 9 clusters.The final cluster (brown) is consists of many smaller clusters that had to be manually combined.

Figure 4 .
Figure 4. Diagram outlining the how the selection weight of a cluster changes based on its performance with our defined parameters.Each cluster begins with a selection weight of 100.

Figure 5 .
Figure 5. Diagram outlining each of the 100 simulated EA trials used to test a learning parameter combination's performance.Batches are used here to simulate parallelizing each EA trial.

Figure 6 .
Figure 6.Example of a cumulative success rate generated during each parameter simulation.In this cumulative success rate plot, a = 79, b = 3, c = 19, and d = 68.

Figure 7 .
Figure 7. Example of a the parent population selection process for method 1 and method 2. Method 1 will not add a molecule to the parent population if any other molecule within an energy of ΔE is present.Method 2 will only reject a molecule if it is within ΔE of the previously selected molecule.

Figure 8
Figure 8. Parent population selection process is outlined for each methods.A: For method 3, all outputs of previous EA runs from that cluster are drawn from with the lowest energy molecule outside of the energy threshold of the previously selected molecule being selected for the next parent population.B: For method 1, only the output of the previous EA run from that cluster is used with molecules being selected for the parent population if no other molecule exists in its energy threshold.C: For method 2, only the output of the previous EA run from that cluster is used with the lowest energy molecule outside of the energy threshold of the previously selected molecule being selected for the next parent population.

Figure 9 .
Figure 9. Histogram showing the distribution of cumulative success areas for all parameter combinations.The histogram has been separated into four areas of interest that will be used in later analysis.These areas cover ranges of <1000, 1000−1200, 1200−1500, and >1400.

Figure 10 .
Figure 10.Stacked histogram of each parameter (MFavOvrAll-A upper-left, MFavClus-B upper-right, NoNewFavClus-C lower-left, Select-D lowerright).The cumulative success areas have been split into four areas of interest.

1
Clustering molecules using Oganov et al.'s 34 fingerprint function, the cosine distance function, and hierarchical agglomerative clustering. 2 Modeling each cluster's EA output as a Gaussian distribution to optimize learning parameters.3 Running the cluster-selection evolutionary algorithm with the optimal learning parameters.

Figure 11 .
Figure 11.Bidimensional histogram plot for parameter MFavOvrAll-A vs Select-D.(A) shows the frequency of which a combination of MFavOvrAll-A and Select-D achieved a cumulative success area between 1200 and 1500.(B) shows the frequency of which a combination of MFavOvrAll-A and Select-D achieved a cumulative success area greater than 1500.

Figure 12 .
Figure 12.Cumulative success rate of each method with its optimal energy threshold difference ΔE.An unclustered EA ran for the same number of iterations is included for comparison.

Figure 13 .
Figure 13.Scatter plot of energy threshold ΔE vs cumulative success rate for all three clustering methods.

Table 1 .
Average and Standard Deviation of Lowest Energy Molecules Produced by 100 EA Runs for Each Cluster

Table 2 .
Most Successful Parameter Combinations

Table 3 .
Least Successful Parameter Combinations