Pretraining Strategies for Structure Agnostic Material Property Prediction

In recent years, machine learning (ML), especially graph neural network (GNN) models, has been successfully used for fast and accurate prediction of material properties. However, most ML models rely on relaxed crystal structures to develop descriptors for accurate predictions. Generating these relaxed crystal structures can be expensive and time-consuming, thus requiring an additional processing step for models that rely on them. To address this challenge, structure-agnostic methods have been developed, which use fixed-length descriptors engineered based on human knowledge about the material. However, the fixed-length descriptors are often hand-engineered and require extensive domain knowledge and generally are not used in the context of learnable models which are known to have a superior performance. Recent advancements have proposed learnable frameworks that can construct representations based on stoichiometry alone, allowing the flexibility of using deep learning frameworks as well as leveraging structure-agnostic learning. In this work, we propose three different pretraining strategies that can be used to pretrain these structure-agnostic, learnable frameworks to further improve the downstream material property prediction performance. We incorporate strategies such as self-supervised learning (SSL), fingerprint learning (FL), and multimodal learning (ML) and demonstrate their efficacy on downstream tasks for the Roost architecture, a popular structure-agnostic framework. Our results show significant improvement in small data sets and data efficiency in the larger data sets, underscoring the potential of our pretrain strategies that effectively leverage unlabeled data for accurate material property prediction.

sequently, we finalized our hyperparameter settings with a batch size of 16,384 and training for 50 epochs.We also experimented with validation ratios of 0.01 and 0.05 during our model optimization.Although the validation loss was lower with a ratio of 0.01 during pretraining, better performance was observed in downstream tasks using a ratio of 0.05.A key takeaway is that setting the validation ratio too low could compromise the effectiveness in downstream tasks.
Additionally, we explored varying the embedding size used in the Barlow Twins loss.
While the original paper suggests that larger embedding sizes lead to better performance, but larger embedding sizes demanded higher GPU memory, which was a limiting factor in our experiments.Taking these considerations into account, we decided to set the embedding size at 1024 for our model.The hyperparameters used for SSL pretraining are shown in Table S1.

Hyperparameters for Fingerprint Learning(FL)
To ensure that the Roost embeddings closely resemble the Magpie fingerprint, we aim to minimize the mean squared error (MSE) loss using gradient descent.This process requires all values to be numerical.However, one of the features, "compound possible," is a Boolean variable.As a result, we decided to exclude this feature from our analysis.
Furthermore, it is essential to normalize all values to ensure equal contribution towards approximating the fingerprint and to prevent the model from being biased by features with large values.After normalizing the remaining 144 Magpie features, we use them as the fingerprint for our Fingerprint-Learning framework.The hyperparameters for FL are shown in Table S2.The hyperparameters for FL were aligned with the SSL strategy, except for adjustments in the batch size and the number of epochs.We determined the number of epochs by observing the point where the loss no longer decreased.Pretraining was done for 100 epochs using this strategy.The batch size was optimized based on the memory constraint of the GPU.

Hyperparameters for Multimodal Learning(MML)
In this study, we leverage the CGCNN backbone, obtained from the Crystal Twins framework, 1 to generate embeddings for the hMOF dataset.We then utilize a structure-agnostic model to predict these embeddings.The hyperparameters for the MML model are shown in Table S3.The hyperparameters for MML were kept similar with the other pretraining strategies, with the exception of batch size and the number of epochs.We increased the number of epochs because we observed that the pretraining loss decreased and plateaued after 90 epochs.Consequently, training was terminated at 100 epochs.The batch size was optimized based on the memory constraint of the GPU.

Details about the Pretraining dataset Data quality
The quality of pretraining data has a direct impact on the representations learned by the model.In the case of SSL for images, noisy images can lead to the learning of misleading features.Similarly, for materials property prediction, certain materials with only one or two elements may present challenges.After masking one node in such materials, they could lose 50% or even 100% of their node features, which may cause the model to learn irrelevant features.

Data diversity and Quantity
A diverse pretraining dataset, encompassing various examples from the domain of interest, allows the SSL model to learn generalizable features that can be applied to different downstream tasks.The pretraining dataset should including a wide range of materials, such as metals, ceramics, polymers, composites, perovskites, and others, in order to provide the model with a comprehensive view of the materials landscape.If the pretraining data is limited to a specific class of materials, like metals, the model may become biased towards patterns specific to that class, hindering its ability to generalize to other materials property prediction tasks.
To identify the most appropriate pretraining data for downstream tasks, we conducted experiments using a variety of dataset combinations.These datasets primarily consisted of data from the Roost paper(OQMD and experimental band gap of non-metal materials), MatBench datasets, and MOF (Metal-Organic Framework) data.The results is shown in Table S4.It is observed that unique materials from all datasets have the highest improvement in performance for the materials property prediction tasks since it have the greatest quantity and diversity of data.This ensures the model learns robust and generalizable features.
Data diversity and quantity of dataset are highly correlated.We also examine how the size of pretraining dataset impact the downstream performance.If the pretraining dataset contains a small number of samples, the model during pretraining may not learn to capture the complex relationships in materials.To this end, we examine the influence of availability of pretraining data for the Roost model.We define the datasets used in the Roost paper 2 as Roost data.The aggregation of the all the datasets from the Matbench suite as Matbench data 3 and hMOF database as MOF data. 4The results of our experimentation for pretraining datasets is shown in Table S4 Table S4

Performance Improvements after Finetuning
We evaluate performance improvements compared to the baseline Roost 2 model.We observe that pretraining with SSL strategy is the most effective.Impressive gains are observed for small datasets and medium sized datasets for both SSL and FL strategy.The SSL and FL strategies are unable to improve the performance on larger dataset when compared to supervised learning.The MML strategy shows gains on the larger datasets, probably because the effects of learning structure features is most prominent for larger datasets.The performance gains are shown in Table S5.The mp-non-metals dataset is a subset of nonmetals materials in the mp-gap dataset.Additionally, we also investigate the significance of these improvements for the SSL strategy.

Hyperparameters for Finetuning on Matbench datasets
We evaluate the performance of pretrained models on the Matbench suite.The Table S6 shows the hyperparameters that we use to finteune the roost encoder for the matbench suite.
We would like to note that the same hyperparameters were used for all datasets.

Table S2 :
Hyperparameters for FL

Table S5 :
Improvement in downstream tasks performance compared to the supervised Roost model