Data-Driven Strategies for Accelerated Materials Design

Conspectus The ongoing revolution of the natural sciences by the advent of machine learning and artificial intelligence sparked significant interest in the material science community in recent years. The intrinsically high dimensionality of the space of realizable materials makes traditional approaches ineffective for large-scale explorations. Modern data science and machine learning tools developed for increasingly complicated problems are an attractive alternative. An imminent climate catastrophe calls for a clean energy transformation by overhauling current technologies within only several years of possible action available. Tackling this crisis requires the development of new materials at an unprecedented pace and scale. For example, organic photovoltaics have the potential to replace existing silicon-based materials to a large extent and open up new fields of application. In recent years, organic light-emitting diodes have emerged as state-of-the-art technology for digital screens and portable devices and are enabling new applications with flexible displays. Reticular frameworks allow the atom-precise synthesis of nanomaterials and promise to revolutionize the field by the potential to realize multifunctional nanoparticles with applications from gas storage, gas separation, and electrochemical energy storage to nanomedicine. In the recent decade, significant advances in all these fields have been facilitated by the comprehensive application of simulation and machine learning for property prediction, property optimization, and chemical space exploration enabled by considerable advances in computing power and algorithmic efficiency. In this Account, we review the most recent contributions of our group in this thriving field of machine learning for material science. We start with a summary of the most important material classes our group has been involved in, focusing on small molecules as organic electronic materials and crystalline materials. Specifically, we highlight the data-driven approaches we employed to speed up discovery and derive material design strategies. Subsequently, our focus lies on the data-driven methodologies our group has developed and employed, elaborating on high-throughput virtual screening, inverse molecular design, Bayesian optimization, and supervised learning. We discuss the general ideas, their working principles, and their use cases with examples of successful implementations in data-driven material discovery and design efforts. Furthermore, we elaborate on potential pitfalls and remaining challenges of these methods. Finally, we provide a brief outlook for the field as we foresee increasing adaptation and implementation of large scale data-driven approaches in material discovery and design campaigns.

demonstrated for the ef f icient exploration of the near infinite reticular chemical space and inverse design of reticular materials with desired functions like gas separation. • Nigam, A.; Friederich, P.; Krenn, M.; Aspuru-Guzik, A.
Augmenting Genetic Algorithms with Deep Neural Networks for Exploring the Chemical Space. In International Conference on Learning Representations; 2020. 3 The proposal of a genetic algorithm enhanced by a neural network for inverse molecular design that can avoid convergence and bias molecule generation based on existing data sets.

■ INTRODUCTION
The tremendous rise of data science and machine learning (ML) in the last decades led to the suggestion that it constitutes the fourth pillar of science. 5 While data has always been at the heart of research, current hardware enables its utilization at an unprecedented scale. 5 Accordingly, our group, the Matter Lab, has been using ML extensively to accelerate the discovery of new materials, especially for clean energy technologies to combat climate catastrophe and enable innovative technologies.
In this Account, we define discovery as observing a previously unknown natural phenomenon or object, 6,7 and design as rationally devising an object based on a particular plan. 8 Typically, discovery precedes and inspires materials design, as design requires at least minimal knowledge of the necessary features. Therefore, large scale discovery helps to speed up the establishment of material design principles, i.e., heuristics to realize particular designs, because they enable identifying patterns in known matter with desired properties. In turn, successful design catalyzes the realization of new materials by restricting the search space to only the most promising regions in subsequent campaigns.
Herein, we review our work on organic electronic materials, crystalline materials, and data-driven methodologies for materials discovery and design, particularly high-throughput virtual screening, supervised learning, inverse molecular design, and Bayesian optimization. Moreover, we formulate general strategies for data-driven materials design our lab has adopted over the years and show how to implement them using ML. Finally, investigating these approaches critically, we propose typical use cases and highlight unsolved challenges.

■ APPLICATIONS Organic Electronic Materials
One of our research foci has been organic electronic materials. 9 Compared to silicon-based electronics, they offer several advantages, including low cost, low density, high mechanical flexibility and toughness, low energy consumption, and easy processability. Further, chemical derivatization is well-established, making the accessible candidate space vast.
Accordingly, solar cells have experienced a remarkable surge because of the vast energy available from the sun and increasing efforts against a climate catastrophe. Organic photovoltaics 10 (OPVs) could replace commercial silicon-based devices if their power conversion efficiencies (PCEs) surpassed 10% and their lifetimes exceeded several thousands of hours. Notably, state-of-the-art OPVs reach 18% PCE in laboratory devices. 11 The Harvard Clean Energy Project (CEP) was initiated to find photoactive organic materials with high efficiencies. 12 Starting from 26 building blocks, selected based on expert knowledge to maximize performance and synthesizability, 13 10 7 potential donors were generated. They were evaluated using highthroughput virtual screening (HTVS, vide infra) via increasingly expensive property predictions. First, the library was assessed using linear descriptor models constructed from experimental data. Subsequently, electronic structure calculations were performed, and PCEs were estimated using the Scharber model with a fullerene as acceptor. 14 That way, about 1000 candidates with estimated PCEs of 11% and higher were identified.
Additionally, statistical analysis of the top-performing molecules revealed design principles for photoactive donors identifying building blocks more likely to exhibit high performance. Notably, the screening efforts led to the experimental characterization of an organic crystal with one of the highest reported hole mobilities reported at the time. 15 Subsequently, extending the CEP to nonfullerene acceptors, over 51 000 candidates were generated based on 107 expertly chosen fragments. 16 More sophisticated property calibration with Gaussian processes and a modified Scharber model improved PCE predictions with a well-studied electron donor. Overall, 838 molecules with predicted PCEs of 8% or larger were found. Moreover, statistical analysis of the candidate structures was performed with respect to both Morgan fingerprints and the building blocks, establishing a general architecture for nonfullerene acceptors.
Similarly, organic light-emitting diodes 17 (OLEDs) have found wide adoption in small displays, are becoming prevalent in screens and lighting applications, and are entering the market in flexible displays. Thermally activated delayed fluorescence (TADF) emitters have become the main OLED class because of their high quantum efficiency, operational stability, and low cost. Their essential property is a small energy gap between the first excited singlet and triplet states so that energetically favored but nonemissive triplet excitons can be upconverted to emissive singlet excitons. Based on knowledge about the TADF mechanism, our group carried out HTVS of emitters covering 10 6 candidates ( Figure 1). 1 Key methodology included efficient quantum chemistry, calibrated against experiment via supervised learning (vide infra). Linear regression and neural networks were used for property predictions across the entire space.
Exploration was performed iteratively using a neural network to predict the most promising candidates, which were then simulated, minimizing evaluations. Not only were known emitters rediscovered, but new structures were also uncovered. Additionally, the systematic exploration exposed both established property trade-offs and unknown property limits. Moreover, the best leads were evaluated by human experts concerning synthesizability and novelty. Consequently, the most promising molecules after both computer and human-based evaluations were synthesized and incorporated into devices leading to high external quantum efficiencies of over 20%. This study serves as a prototype for the entire data-driven discovery pipeline from defining the candidate space to device integration.
Finally, renewable energy like wind and solar is intermittent, requiring large storage capacities to meet consumer demands. Redox-flow batteries (RFBs) resolve that by separating energy from power, enabling large grids to store immense amounts of energy scalable to varying demand loads. 18 Organic RFBs 19 Accounts of Chemical Research pubs.acs.org/accounts Article (ORFBs) represent a sensible advancement, as redox-active organic electrolytes are tunable and cheaper than inorganic alternatives. 20 To identify ideal organic electrolytes, our group performed HTVS of quinones, which are well-known for their single-electron redox pairs. 21 The screening spanned 1710 single-and double-electron redox pairs to validate existing studies and find new redox couples. The results indicated that quinone-exclusive electrolytes were promising aqueous ORFBs and revealed that functionalizations near the carbonyl groups largely affected redox potential and those away largely affected solubility. Subsequently, several experimental studies verified these predictions. 22,23 However, decomposition was found to deteriorate battery capacity irreversibly. 24 Hence, our group performed combined computational and experimental studies on the decomposition of quinones in aqueous environments. 18 HTVS was performed for over 140 000 redox pairs, including decomposition product analysis. The results identified a trade-off between redox potential, with a maximum near 0.95 V, close to experimental results at 0.85 V, 25 and stability. These results provide roadmaps for future studies, which are ongoing in our group, as the tradeoff suggests that electrolyte stability must be considered.

Crystalline Materials
Crystalline energy storage materials with high energy density at low cost are cornerstones of renewable energy applications. For instance, multivalent calcium ion batteries 26 (CIBs) improve upon monovalent lithium-ion counterparts through increased capacities and higher material abundance while maintaining comparable operating voltages. 27 However, the development of CIBs is hindered by the failure of traditional graphite and calcium metal anodes due to difficulties in intercalation and the lack of efficient electrolytes. Recently, a high voltage (4.45 V) CIB cell using tin as the anode was reported to achieve a remarkable cyclability (over 300 cycles). 28 Importantly, designing CIB anodes with improved performance requires a thorough exploration of the alloying space as calcium mixes with many elements. Hence, our group constructed a workflow to discover novel multivalent CIBs. 29 First, the tin electrochemical calciation reaction was investigated computationally and the reaction driving force as a function of calcium content was simulated. This exploration allowed the identification of threshold voltages governing the calciation limits. Consequently, a four-step screening strategy was adopted to look for high-performance CIB anodes. First, 357 metal− calcium binary and ternary compounds were identified from the Inorganic Crystal Structure Database (ICSD) 30 and further filtered to 115 candidates with existing decalciated metal/ metalloid or binary intermetallic compounds. The calciation voltage profiles were calculated, and two threshold calciation voltages were defined, one stricter, based on the tin−calcium system, and the other more relaxed to account for potential differences in the driving force requirements. For each threshold, the maximum capacities, output voltages, volume expansions, and energy densities of the respective material were determined. Finally, metal−calcium systems with higher energy density than tin−calcium were identified, in which metalloids (Si, As, Sb, Ge), post-transition metals (Al, Pb, Cu, Cd, CdCu 2 , Ga, Bi, In, Tl, Hg), and noble metals (Ag, Pt, Pd, Au) showed promise as alloying candidates for CIB anodes and calls for further experimental validations.
Additionally, reticular frameworks 31 (RFs), which include metal−organic frameworks (MOFs), are crystalline porous materials with high internal surface area and high stability and can be used for gas storage, gas separation, and electrochemical energy storage. They are constructed via self-assembly of molecular building blocks and exhibit a near-infinite combinatorial space, complicating their systematic exploration. Accounts of Chemical Research pubs.acs.org/accounts Article Recently, our group developed an invertible and efficient RF representation ( Figure 2). 2,32 MOF fragments were extracted from the computation-ready, experimental (CoRE) MOF database 33 and augmented randomly with common functional groups. Furthermore, we added sets of multiconnected metal or organic nodes and sets of known MOF topologies generating a data set with around 2 × 10 6 MOF structures. Moreover, property simulations were performed for a random subset of about 40 000 MOF structures. The supramolecular variational autoencoder (SmVAE) with a MOF structure encoder-decoder, property prediction model, and framework generation algorithm was constructed with these structures (Figure 2), which can locate high performing MOFs through property optimization in the latent space. We demonstrated its capabilities for automatic design by proposing top candidates for gas separation adsorbent materials. We believe that the MOFs discovered are highly competitive against the best-performing MOFs/zeolites ever reported. Currently, their performance was validated using computational methods. Nevertheless, experimental verification is under way. Furthermore, the as-built platform can be applied to various supramolecular systems (e.g., covalent-organic frameworks, coordination polymers, etc.) and applications (e.g., batteries, catalysis, drug delivery).

■ METHODOLOGY High-Throughput Virtual Screening (HTVS)
Virtual screening 34 denotes a selection process of candidate materials. Chemicals, either generated on-the-fly or from databases, are subject to simulations that estimate applicationspecific properties. Candidates failing computational tests are rejected, with the proviso that predicted performance is likely translatable to experimental performance. Thus, HTVS is a technique that reduces large candidate spaces to a manageable set of promising materials (Figure 3). In our search for new TADF emitters (vide supra), 1 the candidate space was narrowed down by 5 orders of magnitude via HTVS. Importantly, HTVS on large chemical spaces is inverse molecular design (vide infra) because, rather than designing structures directly, the computational tests and the candidate space are designed, which leads to the final hits based on the predicted properties. 35 Moreover, it can provide the basis for both generative and supervised models (vide infra), as they all rely on validated data.
Accordingly, HTVS is a powerful accelerator because computer simulation can be significantly less expensive than the respective experiments. 34 The continuing growth in computational power, which will soon reach the exascale, has made virtual screening highly scalable as it is embarrassingly parallel. Although HTVS is at least almost 20 years old, 36 it only recently started transforming materials science by advances in the accuracy and efficiency of density functional theory (DFT). 37 Besides computational cost, the main appeal of DFT was the possibility to tailor functional parameters to reproduce experiments, which increased its predictive power significantly.
For instance, linear response time-dependent DFT (TD-DFT) is accurate and computationally inexpensive for excited state properties. More importantly, it is robust, can be used in a black-box manner, and is readily deployed in simulations of tens of thousands of molecules with minimal failure rates. 14 However, one pernicious failure mode of TD-DFT is the description of excited states with significant double-excitation character, which is, inter alia, important in describing molecules with inverted singlet−triplet gaps, 38,39 such as the INVEST emitters recently described by our group. 40 Nevertheless, as computing power is increasing, more sophisticated ab initio approaches can be used in HTVS, allowing one to tackle ever more complicated problems and new material classes.
Yet, the impact of HTVS has been hampered by the difficulty in scaling the experimental confirmation of candidates, 1 as simulations feasible for high-throughput are still largely qualitative for condensed-phase properties. 41 A loose screen that accounts for computational inaccuracies minimizes false negatives, but the high cost of experimental validation means that almost all candidates must be rejected. The accuracy of computational screening can be maximized by implementing  Model-based ML algorithms for inverse design models use neural networks to learn patterns in molecular structures from existing data. After training, these models suggest new molecules covering important chemical features from the data set. Several methodologies exist. Herein we will discuss variational autoencoders (VAEs) and generative adversarial networks (GANs) because our group, to the best of our knowledge, was the first to apply these tools in chemistry. VAEs (Figure 4a) are capable of forming continuous (latent) spaces from discrete representations. They are trained to minimize the combined losses of latent space smoothness and input reconstruction enabling gradient-based optimization in the latent space. For inverse design, the latent space of VAEs is coupled with a property estimation model using supervised learning (vide infra). 44 Consequently, the latent space is arranged based on the property values allowing for a direct search of desired materials. GANs (Figure 4b) are generative models with joint training of two competing networks, a generator, and a discriminator. The generator produces examples from a high dimensional (often Gaussian) space, attempting to fool the discriminator, which tries to distinguish generated samples from reference structures. For molecules, our group proposed a sequential GAN (ORGAN), where the model is trained using reinforcement learning. 45 Desired molecular properties are used as a reward for generating good structures.
Notably, both VAEs and GANs are trained in a supervised way. Hence, they rely on existing data and mimic their distribution. Thus, they are limited in the exploration of the chemical space as compared to evolutionary techniques such as genetic algorithms (GAs, cf. Figure 4c). As its name implies, GAs are inspired by natural evolution. An initial population seeds the algorithm, each member being evaluated. The top-performing members proceed to the next iteration and the worst members are removed or replaced by better offspring. For inverse molecular design, the fitness function corresponds to the determination of desired molecular properties.
In contrast to deep learning-based models, GAs are not biased by user-defined data sets. Therefore, they are superior in unbiased explorations. 3 Recently, we have shown that GAs augmented with neural networks to estimate the similarity of a molecule with a given data set can explore specific structural classes without the large data requirements of GANs and VAEs. Additionally, neural network-based learning was used to detect and avoid local minima trapping the GA to amplify exploration by avoiding convergence. 3 Notably, this shows that ML-based inverse design techniques can be effectively combined with evolutionary algorithms.
Importantly, in all these approaches, molecular representation plays a crucial role. Molecular graphs are used for computational efficiency, as they avoid conformations. Simplified Molecular Input Line Entry System (SMILES) 46 strings are commonly used as a flat encoding of molecular graphs. However, they have a complex structure making a large fraction of molecules decoded from arbitrary SMILES invalid. This problem was solved recently by our group in a fundamental way by replacing SMILES with SELFIES (Self-Referencing Embedded Strings), 47 which is available on GitHub. 48 SELFIES is a 100% valid molecular string representation suitable as input for any inversedesign algorithm that outperformed alternative approaches in many benchmarks, such as validity and diversity of generated molecules, molecular density in the latent space of VAEs, or molecular optimization tasks with GAs. 3

Bayesian Optimization
Several tasks across chemistry can be framed as optimization problems, where controllable parameters optimizing a desired objective are sought. For materials, such optimizations are challenging, as they are typically high-dimensional, nonconvex, and subject to noise and the objectives are expensive to evaluate. Suitable optimization strategies ought to be sample-efficient, global, and noise-tolerant. That is, they need to identify optimal parameter choices with as few measurements as possible, be able to escape local minima, and mitigate the detrimental effect of noise. A plethora of experiment planning strategies for Figure 3. High-throughput virtual screening starts from a large space of candidates (e.g., generated combinatorically, as illustrated). Using virtual screening, most candidates are eliminated, such that fewer (more expensive and time-consuming) experimental tests can be performed.

Accounts of Chemical Research
pubs.acs.org/accounts Article optimization are currently available, 49 from traditional design of experiment to evolutionary and heuristic approaches. Among these, Bayesian optimization 50 (BO) has emerged as the strategy that best meets these requirements. BO is an experiment planning algorithm that, in contrast to most other approaches, uses an ML model to learn from previous observations before suggesting the next iteration (Figure 5a). 50 In its most widely adopted form, BO employs techniques such as Gaussian processes to build a surrogate model that captures the features of the underlying objective function. Based on this surrogate, an acquisition function is defined, which determines the strategy used to propose new experiments (Figure 5b). Just like BO formulations using different ML models exist, various acquisition functions have been developed. Due to the use of an ML model, BO is sample-efficient. It is also noise-tolerant, as these models explicitly account for it. Finally, BO is a global approach that balances the exploitation of the best local optima identified with the exploration of unprobed areas of parameter space.
Typical BO approaches are inherently sequential and require heavy computations for each iteration. Therefore, BO can be unduly expensive when used in conjunction with highthroughput evaluations. Thus, our group has developed Phoenics ( Figure 5c), a linear-scaling BO approach that supports parallel experiments. 4 Phoenics employs Bayesian neural networks (BNNs) to build a kernel density estimate of the objective function, and its acquisition function allows for selection of batches of evaluations to be run in parallel. Importantly, Phoenics is suitable for the optimization of continuous parameters, such as temperature and concentration. To also optimize categorical parameters, such as the choice of solvent, we developed Gryf fin (Figure 5d), which uses categorical kernel densities that can be relaxed to continuous ones. 51 In addition, Gryff in allows for expert knowledge, in the form of descriptors for each categorical choice, to be provided to improve the optimization efficiency. Often, multiple competing objectives are present in materials science. Chimera (Figure 5e) is a general-purpose approach to multiobjective optimization. 52 It allows defining a hierarchy of objective preferences, which are combined into a single function to be optimized with any algorithm of choice.
Importantly, all the aforementioned algorithms can be combined with automated laboratories to enable autonomous experimentation. 42 These self-driving platforms are able to execute closed-loop workflows for the self-optimization of materials and processes. However, this requires robust software connections between automated hardware and experiment   53,54 Accordingly, in our laboratory, we have deployed ChemOS, together with Phoenics, Gryf f in, and Chimera, for the autonomous optimization of manufacturing processes of thin-film materials, 55 multicomponent polymer OPV blends, 56 and reaction conditions of stereoselective Suzuki coupling. 57

Supervised Learning
The costs associated with property measurement, from both experiments and simulations, are a major obstacle to the widespread expansion of HTVS, optimization, and inverse design. All of these techniques require some form of data acquisition, i.e., simulations, measurements, or data mining. However, adapting experimental design to suit the needs of automated protocols is challenging, despite self-driving approaches likely being overall cost-effective. The promise of accurate and practically free inference of new results from existing data via supervised learning is a major driver of the ongoing ML revolution in the physical sciences. 58 Supervised learning requires a data set of features and labels. 59 For molecular property prediction, this data set contains molecules in a specific representation (features) and their corresponding properties (labels). First, the data set is split into three, training, validation and holdout sets. The model is trained stepwise on the training set, usually by gradient descent or related algorithms. In general, hyperparameters, i.e., choice of features, training set, and model architecture, influence predictive performance. These hyperparameters are optimized by maximizing prediction accuracy on the validation set. Eventually, model performance is evaluated via prediction accuracy for the holdout set, and the final model can be used to predict properties for unlabeled molecules. The entire workflow is illustrated in Figure 6. Our group developed several model architectures for supervised learning of molecular properties, most notably graph convolutional neural networks. 60,61 Importantly, supervised learning has been used successfully for materials discovery. For example, our group used the CEP data set for property prediction. 62 After training on more than 200 000 molecules, a neural network predicted the result of DFT calculations consistently at a fraction of the computational expense. Additionally, our group applied this approach to reduce the number of simulations in HTVS significantly, with training on a set of similar size. 1 Moreover, our group also used Gaussian process regression to calibrate for systematic errors in DFT. 16 Crucially, in these studies, ML algorithms, representations, acquisition of training data, and validation procedures for models were tightly integrated with an understanding of the problem space, as opposed to sole reliance on existing data from various sources. We believe these considerations are key when it comes to the practical application of ML in chemistry.
Moreover, fruitful applications of supervised learning in materials science start from well-defined scientific goals. In contrast, the excitement brought upon by ML has generated many studies that focus on learning performance rather than scientific objectives. Generally, this is based on the (debatable and often unsupported) idea that performance metrics on one Accounts of Chemical Research pubs.acs.org/accounts Article data set are transferable to other data sets or related problems. However, ML algorithms are highly parametrized and thus can readily overfit. 63 Indeed, the model choice can itself become a form of overfitting, especially when done on performance considerations alone. 64 Moreover, training data bias can contaminate predictions 65 but accounting for these biases appropriately is problem-specific. Furthermore, many studies are focused on error estimates obtained from statistical measures such as cross-validation. Although validation error can be a useful guide to the true prediction error on new data, it is not a replacement for it 66 and is often too optimistic. 67 In many ways, these issues arise when focus on the scientific goals is lost, as ultimately the best test of supervised learning is whether it solves problems.

■ CONCLUSION AND OUTLOOK
In this Account, we have reviewed data-driven approaches our group has employed for the design of materials, especially for clean energy applications, in the past decade. One of the first large scale campaigns our group embarked on was the CEP, where we implemented supervised learning together with HTVS using quantum chemistry simulations to investigate 10 7 potential donor molecules for organic solar cells and devised design principles by statistical analysis of structure−function relationships. 12 In the subsequent years, we refined these ML strategies and expanded our efforts toward other important materials such as OLEDs, OFRBs, multivalent CIBs, and RFs. In all these projects, data-driven workflows were key to speed up both the discovery and the design of new materials. However, we believe that the full potential of data-driven strategies is yet to be unleashed. For instance, many properties are currently not investigated in HTVS because of their prohibitive computational cost. One such property is molecular stability with respect to common decomposition pathways. The associated problem is the huge dimensionality of potential reactions molecules can undergo, which greatly exceeds the chemical compound space in complexity. Recently, our group developed a method for the automatic discovery of chemical reactions based on the selection of reactive internal coordinates such as weak chemical bonds. 68 We believe this approach, together with empirical rules or heuristics for selecting reactive internal coordinates, could be used for HTVS of reactivity and stability of materials, and research in that direction is ongoing. Other properties too prohibitive for HTVS include the influence of explicit solvation on spectroscopic properties and the direct simulation of amorphous solid-state structures and properties. The main challenge therein is the large number of particles and degrees of freedom in the model systems and the associated multitude of interactions.
Furthermore, some of the methodologies we developed have only been tested on benchmark problems but are yet to be employed in real applications. Particularly, the genetic algorithm augmented with neural networks using SELFIES as molecular representation 47 our group proposed recently has outperformed most alternative generative models in benchmarks. However, it has yet to be implemented for designing functional materials, and we are actively working on that. 3 Finally, one of the most critical challenges of ML is model interpretability. Typically, supervised learning approaches are employed in a black box fashion without gaining insight into what the model actually learned. However, our group has shown recently that regression methods such as gradient boosting, when trained on molecular graph features, can be used to reveal important chemical moieties influencing the properties. 69,70 The trained model can be interpreted by human experts and rationalizing the feature importance can lead to new scientific understanding. We believe that similar approaches have the potential to change the way science is carried out in the near future.
However, the bottleneck of materials design campaigns is experimental synthesis and characterization, usually by a large margin. 71 Any material, no matter how good its (predicted) performance, needs to be synthesized for it to be used in real life. In particular for clean energy applications, material syntheses need to be performed on a huge scale requiring reliable, safe and green chemical processes. Accordingly, the continuing speed-up in computer power providing unprecedented prediction capabilities needs to be paralleled by increased experimental throughput. Accelerating materials design ultimately requires close integration of computer simulation, ML and experimentation in self-driving platforms, which our group termed Materials Acceleration Platforms (MAPs). 43 One essential feature of MAPs is a closed-loop materials discovery workflow incorporating experimentation, computation, and human intuition. Online characterization techniques in conjunction with automated robotic synthesis 72−74 are central enabling technologies in these platforms. Making and measuring molecules on-demand in a feedback loop with self-correcting computational screening and ML is key to finding true "needlein-a-haystack" materials. Currently, our group is implementing such an MAP for the realization of innovative materials making Figure 6. Workflow for supervised learning of molecular properties. A known (labeled) data set is used to optimize a model, which is subsequently used to estimate molecular properties for an unknown (unlabeled) data set.
Accounts of Chemical Research pubs.acs.org/accounts Article use of robust cross coupling chemistry, parallel robotic synthesis, and in-line characterization of spectroscopic properties coupled with computer simulation and ML. Details of this implementation will be described in an upcoming Account our group is working on in due course. Accordingly, the data-driven methods described above are a stepping stone to accelerate materials design. However, to realize their true potential, they need to percolate into experimental systems, and we are looking forward to witnessing applications of these methods in closed-loop experimental material design campaigns in the near future. Gabriel dos Passos Gomes is an NSERC Banting postdoctoral fellow at the University of Toronto.
Matteo Aldeghi is a postdoctoral fellow at the Vector Institute for Artificial Intelligence and the University of Toronto.
Riley J. Hickman is a PhD student at the University of Toronto.
Mario Krenn is an Erwin Schrodinger postdoctoral fellow at the University of Toronto and the Vector Institute for Artificial Intelligence.
Cyrille Lavigne is a postdoctoral fellow at the University of Toronto.
Michael Lindner-D'Addario is a PhD student at the University of Toronto.
AkshatKumar Nigam is a researcher at the University of Toronto.
Cher Tian Ser is a PhD student at the University of Toronto.
Zhenpeng Yao is a postdoctoral fellow at the University of Toronto.