Intuition-Enabled Machine Learning Beats the Competition When Joint Human-Robot Teams Perform Inorganic Chemical Experiments

Traditionally, chemists have relied on years of training and accumulated experience in order to discover new molecules. But the space of possible molecules is so vast that only a limited exploration with the traditional methods can be ever possible. This means that many opportunities for the discovery of interesting phenomena have been missed, and in addition, the inherent variability of these phenomena can make them difficult to control and understand. The current state-of-the-art is moving toward the development of automated and eventually fully autonomous systems coupled with in-line analytics and decision-making algorithms. Yet even these, despite the substantial progress achieved recently, still cannot easily tackle large combinatorial spaces, as they are limited by the lack of high-quality data. Herein, we explore the utility of active learning methods for exploring the chemical space by comparing the collaboration between human experimenters with an algorithm-based search against their performance individually to probe the self-assembly and crystallization of the polyoxometalate cluster Na6[Mo120Ce6O366H12(H2O)78]·200H2O (1). We show that the robot-human teams are able to increase the prediction accuracy to 75.6 ± 1.8%, from 71.8 ± 0.3% with the algorithm alone and 66.3 ± 1.8% from only the human experimenters demonstrating that human-robot teams can beat robots or humans working alone.

Initial set of data S4 4.
Qualitative analysis of the strategies S11 7.1.
Quantitative analysis of the strategies S23 8. 1 References for the Electronic Supporting Information S50  The stock solutions along with H 2 O were connected to the inlets of the assigned pumps; namely pump number 5 for H 2 O, pump number 6 for A, pump number 7 for D, pump number 8 for C and pump number 9 for solution B. For this experiment, all pumps (10 in total) were active. Five pumps (no. [5][6][7][8][9] for the solutions of the reagents, four (no. [1][2][3][4] for functions like washing (using deionized water as solvent) and sampling and one pump (no. 10) for the space required between each reaction (spacer). The volumetric fraction of each reagent can either be decided by an algorithm and transformed into a set of orders recognizable by the pumps or it can be provided as a list ready to be input in the software from the human experimenters. The total reaction volume (15 ml), the temperature (90°C), the number of iterations (10) and the reaction time (30 min) have been defined beforehand and loaded from the software used to control the experiment. Dark blue samples are collected automatically at the end of each iteration using a Gilson FC204 fraction collector. After 1 day we obtain dark blue, prismatic crystals of {Mo 120 Ce 6 } (unit cell match).

The platform
The automated platform used for our experiments consists of 10 programmable syringe pumps (C3000 model, Tricontinent Ltd, CA, USA) fitted with a 5 mL syringe and a 3-way solenoid valve ( Figure S3). Four pumps (i.e. pumps 1, 2, 3 and 4) have been designated for functions such as the washing protocol and sampling. Figure S3: Image of the automated platform used for our experiments. The 10 programmable syringe pumps were fitted with 5 mL syringes. Five of the pumps were designated for the stock reagent solutions and four of the pumps were assigned for functions such as sampling and cleaning. The remaining pump was used in order to avoid blockage issues during the experiments.

S8
The general connectivity scheme of the pumps in the platform is as follows: pump 1 is connected to a deionized water tank at all times and is used for washing the plastic tubing after the sampling has been completed; pump 2 is used for the sampling from the reactor ( Figure S4a) to the Gilson FC204 fraction collector; pump 3 is used for emptying the reactor before washing, and pump 4 is connected to a 4way connector ( Figure S4b) and used for switching the flow in the tubing between sampling (pump 2) and washing (pump 1

The algorithm implementation
The algorithm is based on uncertainty sampling. It uses its current knowledge in order to predict the outcome of an experiment (in our case: crystal/no-crystal) and evaluates its confidence about those predictions. The algorithm then selects the experiment it is less confident about, which somehow lies at the perceived boundary between crystal and no-crystal zones. Figure S5: The algorithm steps for the active learning used in our study. The algorithm needs a small set of initial data (A) in order to train a first model (B) and generates many potential experiments to do next (C). Given the learned classifier it S10 predicts the outcome of those experiments (D), and selects the most uncertain experiment (E), i.e. the one for which it is the least confident about its own prediction. The selected experiment is performed on the real system (F) using the platform, the result is added to the dataset (G), used to train a new classifier and the process is repeated again up to a given termination criterion (H). The final dataset should be of higher quality than if collected using a non-active acquisition method (I).

Machine learning parameters
In the context of machine learning, there are parameters (which are characteristics of the system that can be inferred from the data) such as number of reagents, temperature, class of objects (eg. crystal, no-crystal); and there are hyperparameters (which are characteristics of the system that cannot be inferred from the data) such as number and size of layers in a neural network etc.
The hyperparameters are extremely important for tuning the model that is built by the learning algorithm, and different hyperparameters can result in different performance [2] . One of the most common ways of hyperparameter optimization is grid search, which is an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm: once a list of possible values is created, the classifier is trained and validated against each value. The best hyperparameters are those with the best performance. Grid search is an easy way both to implement and to verify that enough parameter space has been covered.
A typical support vector machine (SVM) classifier with an RBF kernel (radial basis function) has at least two hyperparameters that need to be tuned for good performance: a regularization constant (C) and a kernel parameter γ. Both parameters are continuous, and in order to perform the grid search, we select a finite set of reasonable values. The grid search trains the SVM with each pair (C, γ) and evaluates their performance on a validation set or by internal cross-validation on the training set. Finally the algorithm provides as an output the settings that achieved the highest score in the validation procedure.  S13 Figure S8: The 2D contour map the relationship between the selected (green points) and the non-selected (purple points) experiments for T1 for the middle of the experimental procedure (runs 3-6). Notice the transition happening in runs 3 and 4 in order to start taking into account the need for certain amounts of perchloric acid to be present in order to isolate compound (1). The value (0.0) denotes small probability of selecting an experiment, and the value (1.0) denotes large probability of selecting an experiment. S14 Figure S9: The 2D contour map the relationship between the selected (green points) and the non-selected (purple points) experiments for T1 for the tail of the experimental procedure (runs 7-10). The preference of the human experimenter for the reducing agent and the perchloric acid is demonstrated, especially after run 8. The value (0.0) denotes small probability of selecting an experiment, and the value (1.0) denotes large probability of selecting an experiment. S15 Figure S10: The 2D contour map the relationship between the selected (green points) and the non-selected (purple points) experiments for T2 for the first two runs. Notice how the main focus of the experimenter in this team is centered on both the reducing agent and the Mo/Ce ratio. The value (0.0) denotes small probability of selecting an experiment, and the value (1.0) denotes large probability of selecting an experiment. S16 Figure S11: The 2D contour map the relationship between the selected (green points) and the non-selected (purple points) experiments for T2 for the middle of the experimental procedure (runs 3-6). The value (0.0) denotes small probability of selecting an experiment, and the value (1.0) denotes large probability of selecting an experiment. S17 Figure S12: The 2D contour map the relationship between the selected (green points) and the non-selected (purple points) experiments for T2 for the tail of the experimental procedure (runs 7-10). The value (0.0) denotes small probability of selecting an experiment, and the value (1.0) denotes large probability of selecting an experiment. The preference of the human experimenter for the reducing agent and the perchloric acid is demonstrated, especially after run 7. S18 Figure S13: The 2D contour map the relationship between the selected (green points) and the non-selected (purple points) experiments for T3 for the first two runs. Notice how the main focus of the experimenter in this team is centered on the Mo/Ce ratio. The value (0.0) denotes small probability of selecting an experiment, and the value (1.0) denotes large probability of selecting an experiment. S19 Figure S14: The 2D contour map the relationship between the selected (green points) and the non-selected (purple points) experiments for T3 for the middle of the experimental procedure (runs 3-6). The value (0.0) denotes small probability of selecting an experiment, and the value (1.0) denotes large probability of selecting an experiment.
S20 Figure S15: The 2D contour map the relationship between the selected (green points) and the non-selected (purple points) experiments for T3 for the tail of the experimental procedure (runs 7-10). The value (0.0) denotes small probability of selecting an experiment, and the value (1.0) denotes large probability of selecting an experiment.

Principles
Considering the crystallization area as a volume in the parameter space of chemical involved in the experiments, a valuable metric is to estimate how much of the crystallization volume has been explored by each method. But this true volume is unknown to us. An alternative is to compute the volume created by the experiments leading to crystals. One could argue that the bigger this volume, the better the algorithm is at exploring the boundaries between crystal and no-crystal zones. To do so we compute the volume of the convex hull formed by the experiments leading to crystals. The convex hull is the smallest convex volume that encompasses all of the experimental points in our chemical space. In a 3D space, this process can be illustrated as in Figure S22.  ] The values for the initial data are 7.97 and 0.0 respectively. Figure S23: Evolution of the explored crystallization space for each method individually.

8.3
Prediction accuracy  Figure S24: Evolution of the prediction accuracies for each method individually.

Similarity between experiments
We count the average number of points within a specific distance of all other points. That is, given an experimental point, how many other experiments lie within a specific radius in the parameters space, measured as a Euclidean distance. This distance is a similarity measure between experiments: a small value indicates similar experiments. Figure S25: The similarity of the experiments plotted as a comparison of the number of crystals found within a given radius of another crystal. Note how two different groups emerge around a radius of R = 2. The first group consists of (H2, R1 and R2) and the second group consists of (H1, A1, A2, T1, T2, T3).
To select a good value for R, we arbitrarily decided to use the standard deviation of our measure between each method as our metric. The logic is that the radius for which the standard deviation is higher indicates that it can capture finer variations between methods. This is because as the standard deviation increases, the different runs of our experiment start to be separated from each other. At the end, a larger deviation from the average value means better separation. Figure S26 shows this measure and indicates that a radius of 2 is optimal given our metrics. We use R=2 in the following to study in more detail the evolution of our similarity measure.
S27 Figure S26: The similarity of the experiments plotted as a comparison of the number
In total: 989 experiments/ 307 crystals/ 682 non-crystals; for each procedure (team, algorithm, humans, random), and at each run 100 experiments were performed with the platform.

Comparing methods
We compared three different classifiers: SVM with RBF kernel [3] , RandomForest [4] (as presented in the main document in Figure 5), and Adaboost [5] on DecisionTree [6] ; all implemented within the scikit-learn python library [7] (Figures S27, S28 and S29 respectively). All three classifiers are able to capture non-linear decision lines between classes. It is also important to check other classifiers than the SVM because SVM was used in the active learning algorithm method described in the main manuscript, therefore we might have collected data solely tailored to the model built by the SVM classifier. Using two other classifiers allows us to verify that the data gathered are actually useful and meaningful for other modelling methods. Figures S28 and S29 show, respectively for RandomForest and Adaboost, the evolution of the prediction accuracy of each classifier trained on the data collected by each method for each run. 10-fold cross validation on the full dataset was used to select the set of parameters for each classifier. The same trends appear on both plots; the machine-learning algorithm was able to collect better quality data and improved its classification accuracy the most. In comparison, the humans showed a less significant improvement and the random did not improve in accuracy. Figure S27: Average of the prediction accuracies for the four methods with error bars in the full dataset as implemented by SVM. Figure S28: Average of the prediction accuracies for the four methods with error bars in the full dataset as implemented by RandomForest. Figure S29: Average of the prediction accuracies for the four methods with error bars in the full dataset as implemented by Adaboost.

Team 1
In this team, the human experimenter observed that there are two areas of crystallization in the initial dataset. These are around 0.7 mL and 1.4 mL of reducing agent respectively.   3 and Na 2 MoO 4 ) against NH 2 NH 2 .2HCl. They made the assumption that H 2 O is the least significant ingredient; therefore they did not take it into account. They made these calculations for every experiment that produced crystals and for every experiment that did not produce any crystals from the previous batches.
Finally, after plotting this information they were able to select their next choices from the suggested experiments. 16. Additional information provided from reference 28 Figure S30: The average of the prediction accuracies for the three methods in the full dataset as implemented by SVM.

Experiments selected by
S49 Figure S31: The average of the prediction accuracies for the three methods in the full dataset as implemented by Adaboost. Figure S32: The average of the prediction accuracies for the three methods in the full dataset as implemented by RandomForest.