Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition

Enzymes are indispensable in many biological processes, and with biomedical literature growing exponentially, effective literature review becomes increasingly challenging. Natural language processing methods offer solutions to streamline this process. This study aims to develop an annotated enzyme corpus for training and evaluating enzyme named entity recognition (NER) models. A novel pipeline, combining dictionary matching and rule-based keyword searching, automatically annotated enzyme entities in >4800 full-text publications. Four deep learning NER models were created with different vocabularies (BioBERT/SciBERT) and architectures (BiLSTM/transformer) and evaluated on 526 manually annotated full-text publications. The annotation pipeline achieved an F1-score of 0.86 (precision = 1.00, recall = 0.76), surpassed by fine-tuned transformers for F1-score (BioBERT: 0.89, SciBERT: 0.88) and recall (0.86) with BiLSTM models having higher precision (0.94) than transformers (0.92). The annotation pipeline runs in seconds on standard laptops with almost perfect precision, but was outperformed by fine-tuned transformers in terms of F1-score and recall, demonstrating generalizability beyond the training data. In comparison, SciBERT-based models exhibited higher precision, and BioBERT-based models exhibited higher recall, highlighting the importance of vocabulary and architecture. These models, representing the first enzyme NER algorithms, enable more effective enzyme text mining and information extraction. Codes for automated annotation and model generation are available from https://github.com/omicsNLP/enzymeNER and https://zenodo.org/doi/10.5281/zenodo.10581586.

• Supplementary Table S1.Additional examples where no metabolites are annotated within enzymes.
• Supplementary Table S2.Additional examples where metabolites are annotated within enzymes.

S1. BERN2 installation
We describe setting up the BERN2 model on a Linux workstation with a GPU.In the "run bern2.sh"script, we changed "python" to "python3".Every time it runs, BERN2 creates a "log/" directory to record run details which can help identify the location of a problem.
From this, we found an issue with the installation of GNormPlusJava (the gene/protein normalisation tool) which required us to redownload an alternate version of the CRF++ tool.
BERN2 may try to access a port that is already being used -if the process using the port is not important, it can be killed using the following command fuser -k [port number]/tcp.If BERN2 is not running properly, it can be restarted cleanly by: deleting the "log/" folder, stopping BERN2 using the appropriate script, and deleting any CUDA artefacts.

S2. Guidelines for annotators
Both of our annotators were knowledgeable in terms of the enzyme nomenclature.They were trained to use the TeamTat software S1 and viewed documents independently.Rather that start from scratch, they viewed documents annotated by our automated pipeline.This included identifiers for entities with exact matches to our dictionary entries (E.C. codes) or the root term identified.Both annotators were aware of the dictionary used, and of our changes made (removing prefixes such as 'human' and 'bacterial' when the ontology contained multiple synonyms).The ontology did not contain synonyms for all types of isoforms, and annotators were asked to include these in their annotations.Likewise, we initially asked our annotators to add identifiers for the annotations they made, however this slowed down the task considerably, and it was decided to drop this aspect as this was not relevant for the NER task of this work.For this reason we did not use TeamTat's tools for reporting inter-annotator agreement as the metrics also require matching conceptID (i.e.identifiers) for a complete match.
After the first round of individual annotations (where they can only see the machine annotations), the annotators then worked in collaborative mode in which they could see the other person's annotations alongside their own.This process resolved the vast majority of conflicts and greatly improved the inter-annotator agreement.Any differences that remained were then identified and an arbiter set up to rule on these.However, in our case, after the second round the only differences were some missed annotations that were already annotated elsewhere in the document.We only worked with two annotators, but TeamTat is capable for multiple annotators to work simultaneously on the task.S-3 Table S1: Example output from metabolite NER S2 (underlined) and enzyme NER (in bold).All sentences (in addition to the table in the main manuscript) for all 18 articles used for the evaluation (see Materials and Methods).Correct annotations are indicated in green, false positive metabolite annotations within enzymes in red, and false negative and other false positive metabolite annotations in orange.Note: formatting (e.g., superscript) is removed after processing with Auto-CORPus S3 before NER.Table is split in two parts, part A contains all cases where no metabolites are annotated within enzymes.

A. No metabolites annotated within enzymes
PMC3406255: "We also determined that pyruvate dehydrogenase, turnover of the TCA cycle, anaplerosis and de novo glutamine and glycine synthesis contributed significantly to the ultimate disposition of glucose carbon."PMC3406255: "The 4-5 doublet in glutamate and glutamine carbon 4 is derived from [1,2-13C]acetyl-CoA produced from [U-13C]glucose, demonstrating that glucose was metabolized to acetyl-CoA via pyruvate dehydrogenase (PDH)."PMC3406255: "Detection of glycine in the breast metastasis is of particular interest since phosphoglycerate dehydrogenase, an enzyme in the serine/glycine biosynthesis pathway, is commonly over-expressed in human breast adenocarcinoma, is amplified at the genomic level in a subset of these tumors, and is essential for tumor growth in a human breast cancer xenograft model [29]."PMC3406255: "For example, mutations in isocitrate dehydrogenase isoforms 1 and 2 are commonly found in low-grade gliomas and influence intermediary metabolism, including some of the pathways analyzed in this work [44][45][46]."PMC4525767: "However, another equally plausible hypothesis is that reductions in the activity of branched chain ketoacid dehydrogenase (BCKD) and tyrosine aminotransferase (TAT) in states of insulin resistance lead to increased tissue and circulating BCAA and AAA concentrations, respectively [19]."PMC4525767: "Metformin has previously been shown to reduce gluconeogenesis and hepatic glucose production [56] and this effect may be at least partially mediated via reduction in glutamic acid/glutamate concentration, although gluconeogenesis is highly regulated by the rate-limiting enzyme, phosphoenolpyruvate carboxykinase."PMC4525767: "There is also recent evidence demonstrating that metformin suppresses gluconeogenesis by inhibiting mitochondrial glycerol phosphate dehydrogenase [57]."PMC5287439: "N1-methylinosine is found 3' adjacent to the anticodon at position 37 of eukaryotic tRNAs and is formed from inosine by a specific S-adenosylmethionine-dependent methylase.25" PMC6224486: "Moreover, a tryptophan catabolic enzyme, indoleamine 2,3-dioxygenase, has been reported as a central driver of malignant development and progression (39)."PMC6316856: "Serum glucose, HbA1c, triglycerides, total cholesterol, LDL and HDL-cholesterol, alanine aminotransferase (ALT), aspartate aminotransferase (AST), gamma-glutamyl transpeptidase (γ GT), creatinine, uric acid, vitamin D, folic acid, ferritin, c-reactive protein, thyroxin and thyroid-stimulating hormone were measured in a certified clinical laboratory, using standard protocols."PMC6875299: "In the glucose-alanine cycle pathway, glutamate dehydrogenase in muscle catalyzes the binding of α -ketoglutaric acid to ammonia to form glutamate, followed by glutamate catalyzed by alanine aminotransferase; pyruvic acid forms alpha-ketoglutarate and alanine [30]."Table S2: Example output from metabolite NER S2 (underlined) and enzyme NER (in bold).All sentences (in addition to the table in the main manuscript) for all 18 articles used for the evaluation (see Materials and Methods).Correct annotations are indicated in green, false positive metabolite annotations within enzymes in red, and false negative and other false positive metabolite annotations in orange.Note: formatting (e.g., superscript) is removed after processing with Auto-CORPus S3 before NER.Table is split in two parts, part B contains all cases where metabolites are falsely contained within enzymes.
B. Annotation of metabolite within an enzyme, false positive metabolite annotations can be filtered out