Automated Annotation of Untargeted All-Ion Fragmentation LC–MS Metabolomics Data with MetaboAnnotatoR

Untargeted metabolomics and lipidomics LC–MS experiments produce complex datasets, usually containing tens of thousands of features from thousands of metabolites whose annotation requires additional MS/MS experiments and expert knowledge. All-ion fragmentation (AIF) LC–MS/MS acquisition provides fragmentation data at no additional experimental time cost. However, analysis of such datasets requires reconstruction of parent–fragment relationships and annotation of the resulting pseudo-MS/MS spectra. Here, we propose a novel approach for automated annotation of isotopologues, adducts, and in-source fragments from AIF LC–MS datasets by combining correlation-based parent–fragment linking with molecular fragment matching. Our workflow focuses on a subset of features rather than trying to annotate the full dataset, saving time and simplifying the process. We demonstrate the workflow in three human serum datasets containing 599 features manually annotated by experts. Precision and recall values of 82–92% and 82–85%, respectively, were obtained for features found in the highest-rank scores (1–5). These results equal or outperform those obtained using MS-DIAL software, the current state of the art for AIF data annotation. Further validation for other biological matrices and different instrument types showed variable precision (60–89%) and recall (10–88%) particularly for datasets dominated by nonlipid metabolites. The workflow is freely available as an open-source R package, MetaboAnnotatoR, together with the fragment libraries from Github (https://github.com/gggraca/MetaboAnnotatoR).


Table of contents
The serum samples were analysed by reverse phase (C8) ultra-performance liquid chromatography using gradient elution (Lipid+ and Lipid-datasets) as well as by hydrophilic interaction (HILIC) ultra-performance liquid chromatography (HILIC+ dataset). Prior to analysis, all samples were thawed, and serum protein precipitation was performed using cold isopropanol (Lipid datasets) or cold acetonitrile (HILIC dataset), incubated for 2 h at -20C and centrifuged. A quality control sample (QC) resulting from a pooled mixture of all analysed samples was prepared for LC column equilibration and analytical drift correction (internal QC). Additionally, a commercial serum sample (external QC) and another serum sample unrelated to the study (Long term reference) were used to assess inter-batch variability. Each serum supernatant was analysed in an ACQUITY UPLC® system coupled to a Xevo G2-S ToF mass spectrometer (Waters, Milford, MA, USA). For the Lipid datasets, the samples were separated in a ACQUITY UPLC® BEH C8 1.7µm at 55C. The mobile phases were composed of a solution of 5 mM Ammonium acetate + 0.05% Acetic acid in a mixture of 25:25:50 proportion of Isopropanol, acetonitrile, and ultra-pure water (Mobile phase A); and 5mM Ammonium acetate + 0.05% Acetic acid in a 50:50 mixture of Acetonitrile and Isopropanol (Mobile phase B). After injection of 10 μL sample, the chromatography was run at flow rate of 0.6 mL/min using the gradient: 99% A (0-2 min); 70% A (2-11.5 min) and 10% A (11.5-12 min).
The HILIC dataset was collected using the same UPLC instrumental setup using a 2.1 × 150 mm ACQUITY BEH HILIC column (Waters Corp., Milford, MA, USA) maintained at 40°C during analysis. The mobile phases used consisted of acetonitrile with 0.1% formic acid (50:50 mixture) (mobile phase A) and 20 mM ammonium formate in water with 0.1% formic acid (mobile phase B). The chromatographic separation occurred at 0.6 mL/min flow rate. After sample injection, a 0.1 min isocratic separation occurred at initial conditions (95% A). This was followed linear gradient between 95% and 80% A from 0.1 to 4.6 min. A more rapid gradient was then applied from 80% to 50% A between 4.6 min and 5.50 min. This was followed by an isocratic period between 5.50 and 7.00 min (50% A). The gradient conditions were changed to 95% A at 7.10 min and the flow rate was gradually increased to 1 mL/min until 12.50 min. After this time the flow rate was returned to 0.6 mL/min until 15 min to re-establish the initial conditions and enable the injection of a new sample.
For both RP-C8 and HILIC separations, the MS data was collected separately in positive and negative mode electrospray ionization. The capillary voltage was set to 1.5 kV for positive mode and 1.0 kV for S4 negative mode, cone voltage was 20 V, source temperature was set at 120 °C with a cone gas (nitrogen) flow rate of 50 L/h, a desolvation gas temperature of 600 °C, and a nebulization gas (nitrogen) flow of 1000 L/h. MS data was acquired in MS E data acquisition mode in which MS scans are acquired by alternating all-ion fragmentation with no fragmentation. 3 Mass spectral data were collected in centroid mode using a mass range 50-2000 m/z for low-collision energy MS scans and 100-2000 m/z for high-collision energy MS scans for RP-C8 and 50-1200 m/z for HILIC low and high collision energy scans. When no fragmentation was employed (odd scans) a low collision energy (4 eV) was used and a high collision energy (ramp (10-30 eV) was used to acquire for fragmentation scans (even scans). Leucine enkephalin (2 ng/μL, 50% ACN, 0.1% FA) was used for lock mass correction which was infused at 20 µL/min. Lock mass data were collected every 60 s for 0.2 s.
The HILIC Negative was not used in this work due to the lower number of annotations available for this dataset. The LC-MS chromatograms were converted to netCDF format using Waters DataBridge software and, each function 1 (MS1 -low collision energy) was imported and processed using XCMS v. 3.1 to produce feature tables, from which the features were selected for manual and automated annotation. Features were annotated manually using MS/MS acquired using data-dependent acquisition and MS E acquisition.  Table S1. Briefly, after importing the data files, the chromatograms from MS1 and AIF were peak picked using the centWave algorithm and dealt with as a single dataset. In order to reduce the computational time, the Lipid-and HILIC+ datasets, which were the larger in size compared to Lipid+, were peak picked using higher values for prefiltering (Table S1). Before non-linear RT alignment was performed, a first grouping was applied using the bandwidths detailed in Table S1. A second grouping was performed after RT correction. Finally, any missing peaks were gap-filled using the function "fillPeaks" with method = "chrom" option. Each of the three datasets were processed separately.

Processing of MESA UPLC-MS datasets using XCMS and RAMClustR
RAMClustR was run on each of the final XCMS processing object, using the function "ramclustR" after specifying the experimental metadata and indicating the tags for MS1 and AIF scans. The resulting RAMClustR object, which contained the clusters containing the groups related to the same parent ion (i.e., the deconvoluted pseudo-MS/MS) was used as input for annotations by MetaboAnnotatoR.   divided equally between the other fragments, so that the total score sums to 1.

Construction of the metabolite fragment libraries
For the non-lipid metabolites, due to their structural diversity, the experimental MS/MS spectra were used. These consisted of MS/MS spectra from MassBank and GNPS databases acquired on different types of instruments, from low resolution triple-quadrupoles to high resolution Fourier transformtype instruments, using a wide range of collision energies between 5 and 65 eV including collision energy ramps, as detailed on Supplementary File 1.
The occurrence score for each fragment was calculated considering the MS/MS peak relative intensities. A score of 0.9 was divided equally between peaks above 10% of the most intense peak. A score of 0.1 was divided equally between the peaks with relative intensities between 0.05% and 10%, and peaks below 0.05% relative intensity were considered noise. Custom libraries can be imported to MetaboannotatoR from .txt and .msp formats. Figure S1 -Agreement between manual and automated as a function of scoring weights

Supporting figures
Using the MESA Lipid+ dataset, the matching weight parameter was changed from 0.5 to 1 and its effect on the agreement between manual and automated annotations provided by MetaboAnnotatoR was inspected (Fig. S1).

S8
The value of matching weight of 0.5 was considered to be adequate for the dataset and was applied throughout the automated annotations provided in the text.
A typical annotation output result of feature annotation using RAMClustR pseudo-MS/MS objects is shown in Fig S2. No EICs are obtained this graphical report because the RAMClustR object only contains processed data and EICs can only be obtained from the raw chromatograms. Figure S2 -Automated annotation results using RAMClustR object