How Beneficial Is Pretraining on a Narrow Domain-Specific Corpus for Information Extraction about Photocatalytic Water Splitting?

Language models trained on domain-specific corpora have been employed to increase the performance in specialized tasks. However, little previous work has been reported on how specific a “domain-specific” corpus should be. Here, we test a number of language models trained on varyingly specific corpora by employing them in the task of extracting information from photocatalytic water splitting. We find that more specific corpora can benefit performance on downstream tasks. Furthermore, PhotocatalysisBERT, a pretrained model from scratch on scientific papers on photocatalytic water splitting, demonstrates improved performance over previous work in associating the correct photocatalyst with the correct photocatalytic activity during information extraction, achieving a precision of 60.8(+11.5)% and a recall of 37.2(+4.5)%.


Details on Pre-training Corpus
The pre-training corpus was gathered using ChemDataExtractor's scraping tools from the Royal Society of Chemistry and Elsevier papers.In Elsevier's case, this interfaced with their official API for scraping.For both publishers, the set of results from the following set of queries were used:

Pre-training Hyperparameters
To maintain comparability with the original BERT base model, the PhotocatalysisBERT and PhysicalSciencesBERT models were trained to have the same number of parameters, so we used WordPiece embeddings 1 with a 30,000 token vocabulary, the hidden size was 768, there were 12 attention heads, and there were 12 hidden layers, resulting in 110 million total parameters.Other hyperparameters during training, which were kept equal between Photocatalysis-BERT and PhysicalSciencesBERT, are as follows: • Optimiser type: AdamW • Adam beta 1: 0.9 • Adam beta 2: 0.999 The BERT models were trained on the Masked Language Modelling (MLM) task with the ALCF Polaris cluster containing NVIDIA A100 GPUs, with training distributed across multiple nodes using DeepSpeed 2 .A batch size of 2,048 was used, and they were pre-trained for 187,500 steps.The maximum sequence length was also kept at 512 tokens.

Photocatalysis Extraction Results for All Models
All language models listed in this work were assessed in the same way as the Photocataly-sisBERT and PhysicalSciencesBERT models were in the main paper, looking at their exact match for our photocatalysis dataset 3 .SQuAD v2.0 performance levels quoted are for the development set.

•
photocatalytic water splitting • photocatalyst water hydrogen • catalysis water splitting • hydrogen production catalyst • catalysis hydrogen production • water splitting hydrogen • photocatalysis • photocatalyst Results were gathered for the years 2000 to 2023 for Elsevier.This resulted in a photocatalytic water splitting-specific corpus of size 11 GB being generated.

Table S1 :
Precision and Recall on our photocatalysis dataset3for BatteryOnlyBERT and BatteryBERT4, models pre-trained on battery corpora from scratch and with continued training from BERT-base weights.BatteryOnlyBERT achieved a SQuAD v2.0 exact match of 71.8 and F1 score of 75.7 on the development set.BatteryBERT achieved a SQuAD v2.0 exact match of 73.3 and F1 score of 77.0 on the development set.As discussed in the main text, BatteryOnlyBERT performs better when extracting co-catalysts, but for all other properties BatteryBERT performs better.

Table S2 :
Precision and Recall on our photocatalysis dataset 3 for OpticalPureBERT and OpticalBERT 5 , models pre-trained on optical material corpora from scratch and with continued training from BERT-base weights.OpticalPureBERT achieved a SQuAD v2.0 exact match of 73.0 and F1 score of 77.0 on the development set.OpticalBERT achieved a SQuAD v2.0 exact match of 74.3 and F1 score of 78.0 on the development set.

Table S3 :
Precision and Recall on our photocatalysis dataset 3 for DeBERTa v3 base and DeBERTa v3 large6-8.DeBERTa v3 large achieved a SQuAD v2.0 exact match of 88.1 and F1 score of 91.2 on the development set.DeBERTa v3 base achieved a SQuAD v2.0 exact match of 83.8 and F1 score of 87.4 on the development set.

Table S4 :
Precision and Recall on our photocatalysis dataset 3 for MatSciBERT 9 .The model achieved a SQuAD v2.0 exact match of 71.4 and F1 score of 75.2 on the development set.

Table S5 :
Precision and Recall on our photocatalysis dataset 3 for SciDeBERTa10.While trained on the same scientific corpus as PhysicalSciencesBERT (S2ORC 11 ), this model was trained on trained on general scientific data instead of being restricted to the physical sciences, so was plotted on the graphs in the main paper as a general-purpose model, instead of a material science model.The model achieved a SQuAD v2.0 exact match of 80.2 and F1 score of 83.5 on the development set.

Table S6 :
Precision and Recall on our photocatalysis dataset 3 for BERT base uncased12,13.Being trained with similar hyperparameters, this model is the closest analogue to Photo-catalysisBERT and PhysicalSciencesBERT trained on a general purpose corpus.The model achieved a SQuAD v2.0 exact match of 73.7 and F1 score of 77.9 on the development set.

Table S7 :
Precision and Recall on our photocatalysis dataset 3 for XLM RoBERTa large14,15, a cross-language model.The model achieved a SQuAD v2.0 exact match of 81.8 and F1 score of 84.9 on the development set.

Table S8 :
Precision and Recall on our photocatalysis dataset 3 for MiniLM16,17, a model distilled from a BERT base-sized UniLM v2 18 model.The model achieved a SQuAD v2.0 exact match of 76.1 and F1 score of 79.5 on the development set.

Table S9 :
Precision and Recall on our photocatalysis dataset 3 for tinyBERT19,20, another distilled model.The model achieved a SQuAD v2.0 exact match of 71.9 and F1 score of 76.4 on the development set.

Table S10 :
Precision and Recall on our photocatalysis dataset 3 for distilled BERT medium21, where a BERT large model was used as the teacher model.The model achieved a SQuAD v2.0 exact match of 68.6 and F1 score of 72.8 on the development set.