Text Mining the Literature to Inform Experiments and Rationalize Impurity Phase Formation for BiFeO3

We used data-driven methods to understand the formation of impurity phases in BiFeO3 thin-film synthesis through the sol–gel technique. Using a high-quality dataset of 331 synthesis procedures and outcomes extracted manually from 177 scientific articles, we trained decision tree models that reinforce important experimental heuristics for the avoidance of phase impurities but ultimately show limited predictive capability. We find that several important synthesis features, identified by our model, are often not reported in the literature. To test our ability to correctly impute missing synthesis parameters, we attempted to reproduce nine syntheses from the literature with varying degrees of “missingness”. We demonstrate how a text-mined dataset can be made useful by informing new controlled experiments and forming a better understanding for impurity phase formation in this complex oxide system.


S2.2 Mol2Vec Chemical Embedding Convergence
The convergence plot for average change in cosine-similarity between principle components of chemical embeddings is shown in Figure S2.The plot shows convergence at around 20-30 principle components.The cosine similarities for all possible reagents when considering 30 principle components is given in Figure S3.

S4.2 k-Nearest Neighbors Imputation
We tested the ability for k-nearest neighbors missing value imputation to correctly impute values by randomly masking values for features of interest (Bi:Fe ratio, precursor, concentration, and mixing time / temperature, in alignment with the missing values analysis from Figure 3) and comparing the imputed value with the true value.We first acquired a subset of 110 rows from our dataset for which none of the values of the aforementioned features were missing.Then we randomly masked 20% of each of those feature values so that they would be labeled as "missing".We then performed kNN imputation on a scaled version of the dataset (since nearest neighboring data points are identified using Euclidean distance) and compared with the dataset of true values to find the frequency of exact matches.We found that with 95 total values masked and imputed we achieved perfect imputation for 35 of the values when k=5 nearest neighbors, 51 values when k=3, and 72 values when k=1.There is a risk of overfitting when using small values for k; thus, we moved forward with k = 5 so that the imputer is not resigned to the local structure of the data point.

S5 Classifier Model Comparison
We provide details on a comparison in performance evaluation as well as details on hyperparameter-tuning for four different classifier algorithms (decision trees, random forest, extra trees, and XGBoost).

S5.1 Hyperparameter Grids
The following lists the specific hyperparameters and values considered in cross-validation for each classifier algorithm.For the decision tree classifier we used sklearn's GridSearchCV module and for the random forest, extra trees, and XGBoost classifiers we used sklearn's RandomizedSearchCV module.Cross-validation was stratified, by 5 folds, 10 repeats, and the same random state set for each training.The F1 score (with phase impurity formation being the positive class) was used to determine the best estimator from cross-validation.
For randomized search cross-validation, 50 iterations were used for random forest and extra trees and 500 iterations were used for XGBoost.

S5.2 Performance Comparison Between Algorithms
Tables S2-S5 show the averages and standard deviations for several evaluation metrics for sets of 10 trained models within each algorithm type, separated by the different imputation and classification task frameworks.
Evaluation metrics include the F1 score (where the positive class corresponds to forming phase impurities), precision, recall, Mathew's correlation coefficient (MCC), normalized false positives (calculated as the number of false positives divided by the total number of positive cases, similar to Type I Error), normalized false negatives (calculated as the number of false negatives divided by the total number of negative cases, similar to Type II Error), and the macro-averaged F1 score (calculated as the average of the class-wise F1 scores).

Figure
Figure S1 depicts a flowchart for acquiring chemical embeddings from the Mol2Vec model.First, chemicals are identified by their SMILES strings which are used to collect the list of Morgan fingerprint identifiers for chemical substructures.Each substructure has been given a 300-dimensional embedding from a pre-trained model trained on amino acid structures.The dimension of this collection of substructure embeddings is then reduced using principal component analysis.The resulting coordinates can then be used to create vector summations of each substructure to form a molecule, similar to the vector summation construction of words into phrases from the Word2Vec model.

Figure S2 :
Figure S2: Convergence plot for chemical embedding PCA cosine similarity.Average change in cosine similarity between principle components of chemical embeddings for all chemical species across possible PCA vector space.

Figure S3 :
Figure S3: Tableshowingpairwise cosine difference values between vector representations of each chemical using mol2vec algorithm.

Figure
Figure S4 depicts the frequency of specific substrates used in the text-mined dataset.

Figure S4 :
Figure S4: Frequency of substrates: Frequency of specific substrates used in text-mined dataset.

Figure
Figure S5 depicts the range of values employed for all condition in the synthesis dataset, as well as the frequency of omitted values for each condition.

Figure S5 :
Figure S5: Synthesis condition heatmap: Individual syntheses from the text-mined dataset (rows) encoded by their conditions (columns).The heat in each cell represents the deviation from the mean value for that condition in that particular procedure.Grey values represent missing values.