Bandgap Engineering in the Configurational Space of Solid Solutions via Machine Learning: (Mg,Zn)O Case Study

Computer simulations of alloys’ properties often require calculations in a large space of configurations in a supercell of the crystal structure. A common approach is to map density functional theory results into a simplified interaction model using so-called cluster expansions, which are linear on the cluster correlation functions. Alternative descriptors have not been sufficiently explored so far. We show here that a simple descriptor based on the Coulomb matrix eigenspectrum clearly outperforms the cluster expansion for both total energy and bandgap energy predictions in the configurational space of a MgO–ZnO solid solution, a prototypical oxide alloy for bandgap engineering. Bandgap predictions can be further improved by introducing non-linearity via gradient-boosted decision trees or neural networks based on the Coulomb matrix descriptor.


Bandgap correction using the screened hybrid functional HSE
The PBE functional generally gives poor values of bandgap. The bandgaps obtained with HSE calculations tend to be in better agreement with experiment, but at a much higher computational cost. Interestingly, we found that PBE and HSE derived values of bandgaps are highly correlated for this system. In order to obtain a linear relation that would allow us to calculate near-HSE-quality values of bandgap from PBE bandgaps, we first chose five random configurations from each of the four bandgap energy quartiles (Figure 1a), thus ensuring the representation of the full Egap range. The correlation between HSE vs PBE bandgap energies is shown in Figure S1 (a).  Equation S1 provides us with a way to correct the PBE values and get bandgaps as similar as possible to HSE ones, at a fraction of the cost.
To test the validity of this approach, we applied the linear correction to a further 20 randomly selected configurations (across each of the four quartiles). These PBE-corrected Egap values were plotted against the corresponding HSE bandgaps, as shown in Figure S1b. It is clear that the linear transformation model is effective for correcting values on configurations for which it was not trained.

Calculation of the cluster correlation functions
The simulation cell that we have employed in all our calculations consists of a 2x2x2 repetition of the parent MgO unit cell. The space group of the parent structure is 225. Since the cubic unit cell of MgO has a lattice parameter of 4.248 Å, the 2x2x2 supercell used in our calculations has a length of 8.496 Å.
There are 32 cation sites in this supercell, of which 24 are occupied by Mg cations and 8 by Zn cations. In order to calculate efficiently the energies of the different (Mg24Zn8)O32 configurations we make use of cluster expansion (CE) theory. 1 Within this approach, the energy of a particular configuration, , of the solid solution, can be calculated as a linear expansion: s = ∑ c over the so-called cluster correlation functions (CCFs) , where the sum extends over all clusters α of exchangeable sites (e.g. points, pairs, trios, quartets, etc.), and are the effective cluster interactions (ECIs), c is the number of symmetrically distinct clusters, are the multiplicities (number of symmetrically equivalent clusters of type ). The CCFs can be defined as the average of the product of a basis function of occupation variables (-1 and 1 for each element in a binary alloy) over all the clusters of type α: There is some flexibility in the choice of basis functions. We used two types of basis functions: 1 i) a non-orthonormal, binary basis, ( ) = 1+ 2 which takes values 0 and 1 for Mg and Zn occupancy, respectively; and ii) an orthonormal basis function using discrete Chebyshev polynomials. The results obtained when using both types of cluster basis are very similar, so in the manuscript we only report the results using the orthonormal basis function.
We have employed the CELL tool 2 to obtain the CCF vectors s = { } for each of the 8043 symmetrically different configurations of the supercell with composition Mg24Zn8O32. There are 92 symmetrically distinct clusters up to fourth order, namely 1 empty cluster, 1 one-point cluster, 5 twopoint clusters, 14 three-point clusters and 71 four-point clusters. Figure S2 shows some of these clusters. Because we are only comparing configurations with the same composition, the CCFs for the empty cluster and for the one-point clusters are constant for all configurations. The vectors s formed from the remaining 90 correlation functions are then used as configurational descriptors for the linear regression (i.e. for finding the ECIs) or non-linear analysis using different techniques, as described in the text.

Performance of MLP neural network using CCF descriptor
We found that relaxing the linearity condition on the CCFs did not significantly improve the performance of the descriptor for mixing energies or bandgaps. Training the deep MLP model using the CCF descriptor yielded equally poor correlations as shown in Figure S3.

Comparison of shallow vs deep MLP neural networks using CME descriptor
We compared the performance of the deep vs shallow MLP models for predicting bandgaps. Though the deep MLP outperformed the shallow MLP in terms of MAE, we report the results of the shallow MLP here for comparison. Shallow MLP architectures may have the advantage of avoiding overfitting in certain use-cases, although there was no evidence of overfitting encountered in the present work. The performance of the shallow MLP increased as the training dataset size increased, as is the case of the deep MLP (reported in the main text).

Figure S4. Shallow multilayer perceptron Egap prediction based on the CME descriptor for a) 80% of the configurations used in training; b) 10% of the configurations used in training.
Table S1 reports a comparison of the shallow and deep MLPs in terms of performance. Although the shallow MLP performs very well and incurs in slightly shorter training times, the deep MLP is superior when using the same amount of data for training. ** For reference, the computing time required for training with all GBDT methods required less than five minutes for training, while LR requires less than ten seconds in each case.

Further details about the machine learning algorithms
GBDT was subject to hyperparameter optimization. The following settings were employed in the process (the corresponding name in the class sklearn.ensemble.GradientBoostingRegressor is shown in parenthesis): • Learning rate (learning_rate) = 0.01 • Maximum depth of the individual regression estimators (max_depth) = 4 • Minimum number of samples required to split an internal node (min_samples_split) = 5 • Number of boosting stages to perform (n_estimators) = 1000 Both MLP architectures featured 30% dropout at each layer to prevent overfitting, the activation used a ReLU function for the hidden layers, and linear activation for the final output layer. 3 MLP's were trained on batch sizes of 32, using the Adam stochastic gradient descent optimizer. 4 Convergence in MAE associated with the validation data was used to truncate MLP training, using the EarlyStopping method in Keras. This provided maximum training, while avoiding potential overfitting from extensive training beyond numerical convergence.

Information about codes and data
We provide the following resources in open access repositories: Data: • All data is available in a Zenodo repowith a persistent DOI associated to ensure continuation of availability: https://doi.org/10.5281/zenodo.4736810 • The repo contains both the raw data (in .csv format) and pre-processed data (in .pkl format, ready for training the ML models).
• Distributed under Creative Commons Attribution International 4.0 License. Code: • All the code required to reproduce our results is available on a github repo (https://github.com/scott-midgley/Machine-Learning-for-Solid-Solutions).
• The github repo is linked to the Zenodo data repoby combining the two repositories anyone should be able to reproduce all our results.
• The code repo includes a conda environment specification (in .yml format), so that it is possible to recreate the environment settings that we used to perform the study.
• All code is written as Jupyter notebooks, with full instructions and comments to make it easy to follow.
• The repo contains instructions on how to run the codes and recreate the appropriate environment and link to the data.
• Distributed under MIT License.

Models:
• The models can be recreated from the code and data, typically within a short time.
• The deep MLP networks take longer to train, so we have also provided the pre-trained models as serialized objects in .h5 files, which can be easily loaded and re-run. The code for doing so is also provided in notebooks.
For more detailed information, please refer to the README files within each repository.