Unsupervised Discovery and Comparison of Structural Families Across Multiple Samples in Untargeted Metabolomics

In untargeted metabolomics approaches, the inability to structurally annotate relevant features and map them to biochemical pathways is hampering the full exploitation of many metabolomics experiments. Furthermore, variable metabolic content across samples result in sparse feature matrices that are statistically hard to handle. Here, we introduce MS2LDA+ that tackles both above-mentioned problems. Previously, we presented MS2LDA, which extracts biochemically relevant molecular substructures (“Mass2Motifs”) from a collection of fragmentation spectra as sets of co-occurring molecular fragments and neutral losses, thereby recognizing building blocks of metabolomics. Here, we extend MS2LDA to handle multiple metabolomics experiments in one analysis, resulting in MS2LDA+. By linking Mass2Motifs across samples, we expose the variability in prevalence of structurally related metabolite families. We validate the differential prevalence of substructures between two distinct samples groups and apply it to fecal samples. Subsequently, within one sample group of urines, we rank the Mass2Motifs based on their variance to assess whether xenobiotic-derived substructures are among the most-variant Mass2Motifs. Indeed, we could ascribe 22 out of the 30 most-variant Mass2Motifs to xenobiotic-derived substructures including paracetamol/acetaminophen mercapturate and dimethylpyrogallol. In total, we structurally characterized 101 Mass2Motifs with biochemically or chemically relevant substructures. Finally, we combined the discovered metabolite families with full scan feature intensity information to obtain insight into core metabolites present in most samples and rare metabolites present in small subsets now linked through their common substructures. We conclude that by biochemical grouping of metabolites across samples MS2LDA+ aids in structural annotation of metabolites and guides prioritization of analysis by using Mass2Motif prevalence.


S1. Urine Samples
Urine samples from anonymized human volunteers were used from a clinical sample set in the Glasgow Polyomics archive. These samples were obtained as part of a trial for which ethical approval was applied for through the Multi-Centre Research and Ethics Committee (MREC), which was granted by the Scottish MREC and (with MREC N°06/MRE00/106). Informed consent was obtained from all individual study participants. Spot urine samples were obtained from the cohort of elderly hypertensive patients upon their first admission in the clinic.
Urine extracts of 22 patients were selected as follows: diagnosed with stroke, administering a variety of drugs including a number of antihypertensives, and availability of the sample extract in the Glasgow Polyomics archive. The resulting subject's age range spanned from 52 to 85; 13 were male, and 9 female; 4 were smokers (and 4 were reported as ex-smokers); 2 were reported to have diabetes; 13 were diagnosed with hypertension; and each subject reportedly took at least one antihypertensive drug. All recorded details of the patients can be found in Table S1.

S2. Beer Sample Specifications
10 mL samples of 18 different beers were collected from bottles over a period of 5 months and frozen immediately after sampling. One beer was sampled twice from different bottles. S-3

Data acquisition:
Blank runs, quality control samples (beer and serum extracts in accordance with standard procedures at Glasgow Polyomics) to assess the performance of the mass spectrometer in terms of chromatography and mass intensities, and three standard mixes containing 150 reference compounds were run to assess the quality of the mass spectrometer and to aid in metabolite annotation and identification 1 . The pooled sample was run prior to and across the batch every 6th sample to monitor the stability and quality of the LC-MS run, whereas the samples were run in a randomized order. Immediately after acquisition, all raw files were converted into mzXML format, thereby centroiding the mass spectra and separating positive and negative ionization mode spectra into two different mzXML files using the command line version of MSconvert (ProteoWizard). Accurate masses of standards were obtained well within 3 ppm accuracy and intensities of the quality control samples (a beer extract and a serum extract) were within specifications.

Data Conversion:
Structural family discovery in MS2LDA+ begins with the data conversion stage, which prepares the acquired full scan and fragmentation data into suitable input format.
MzXML (full scan) files are processed using XCMS 2 and MzMatch 3 for LC-MS peak detection and alignment. MS1 data in .mzXML format is processed using the CentWave algorithm from the XCMS library, resulting in detected peaks. This is converted into the peakML format used by the MzMatch library. Alignment is performed to match peaks across samples in different peakML files, resulting in a table of conserved precursor (MS1) peaks across samples.
For fragmentation data, Our MS2LDA+ pipeline takes as input fragmentation files in .mzML format. Peaks are again detected from fragmentation mzML files using the CentWave algorithm. We then relied on strategies developed by RMassBank 4 to find and select MS2 spectra with MS1 features. A linking step via greedy search is performed to match the most intense MS2 spectrum in a scan to its precursor MS1 peak, and a filtering step based on RT and intensity is applied.
For each pair of full-scan (MS1) data and fragmentation (MS2) data, a bag-of-word count matrix is produced. Entries in this matrix is the co-occurrence counts of discrete MS2 features in a fragmentation spectra. Two types of features are extracted: fragment (discretised m/z values of MS2 peaks) and loss (discretised neutral loss values) features. The same set of discrete features have to be defined across MS2 spectra, so we used a gap-based method to discriminate between different MS2 features.
These count matrices are used as input to the MS2LDA+ model, which infers the presence of structural families across multiple sets of MS1-MS2 data (corresponding to multiple files, where a file is the resulting MS1-MS2 pairing from acquired data from a sample). The model is described in the next section. S-4

S4. MS2LDA+ model
The next step for structural family discovery is the discovery of Mass2Motifs (patterns of co-occurring fragment and loss features). For MS2LDA+, we extend the standard LDA model by allowing a single set of motifs (also known as topics in standard LDA) to be shared across multiple files. This results in the following graphical model in Figure S4-1B. Here, the document-to-topic distribution is specific to each file , but the multinomial topic distributions (drawn from the Dirichlet with parameter vector ) are shared across multiple files. The file-specific prevalence of the different topics for the -th file is captured by a parameter vector .
The generative process of fragmentation spectra in the MS2LDA+ model is therefore: 1. For each file, draw ~( ).

For each feature in fragmentation spectra in file :
a. Draw a Mass2Motif (topic) assignment for the feature, ~( ).
S-5 b. Draw a fragment or loss feature, f from ( f | , ), a multinomial probability conditioned on the topic assignment .
Note that unlike the standard LDA model, each file now has its own prior Dirichlet distribution parameterised by and all documents in the same file has their document-to-topic distributions drawn from the same prior Dirichlet specific to the file.
For inference, we have developed a variational Bayes 5 scheme in which each file is modelled with standard LDA except for the updates of the motifs which are pooled across the files. Within each file, the e-step (updating the variational parameters (document-specific Dirichlet) and (word to motif probabilities) (see e.g. 5 ) are exactly as in the original LDA model. The updates for are also identical. The key difference lies in the update for the feature probabilities for each motif ( ) which is now influenced by all the files. The update for is now: where indexes files, spectra (within a file), fragment/loss feature (within a spectra). Here, is the probability that the -th feature in spectra of file is assigned to motif , while is 1 if the -th feature in spectra (in file ) is feature .
Our variational scheme therefore proceeds by alternating an update of the values in with updates of the and parameters within each file. We run the algorithm until we see satisfactory convergence of the values in between successive updates. In all experiments here, we ran 1000 iterations (resulting in total absolute change in of less than 1e-4).

S5. Overlap Score
The overlap score measures how much of the Mass2Motif is present in the spectrum rather than how much of the spectrum is explained by the Mass2Motif. The higher that score, the more mass fragments and neutral losses from the Mass2Motif can be found in the fragmentation spectrum of the parent ion, increasing the confidence that the substructure is indeed present.
The overlap score is computed from a combination of the individual feature to motif probabilities ( ) and the motif to feature probabilities . The score between a spectra and a motif is computed as (dropping the file index for clarity):

= ∑
where is any for which = 1. I.e. the probability of any instance of feature in spectra belonging to motif (in the variational inference scheme these values are identical for the same feature ).
This can be interpreted as how much of the -th motif is represented in the -th spectra. For example, if a Mass2Motif has several features with non-zero probability and the instances of these features in the spectra are all assigned to motif with a probability of 1, the overlap score would be 1. If the spectra has half the features (as measured by probability) and each is assigned to this topic with probability 1 then the overlap score would be 0.5. Figure S4-2. Visualisation of the overlap score with two hypothetical molecular spectra and three hypothetical Mass2Motifs. In example A, molecule 1 has an overlap score of 1 with Mass2Motif 1 as the motif is in the molecule in its entirety. The probability of this connection is 0.5 as it accounts for half of the molecule's total intensity. In example B, the same molecule has an overlap score of (approximately) 0.5 with Mass2Motif 2 as it includes roughly half of the Mass2Motif. The probability is again 0.5. Molecule 2's entire spectrum is explained by Mass2Motif 3 (probability is 1.0) but these peaks only account for about 40% of the Mass2Motif so the overlap is 0.4. All codes will be available for download from the university repository (http://researchdata.gla.ac.uk/402/ -DOI: 10.5525/gla.researchdata.402). All codes can also be found in GitHub (https://github.com/sdrogers/lda).

S6. Tables of Characterised Mass2Motifs
Table S6-1: with A) 30 most variable Mass2Motifs of which 23 were structurally characterized as containing a biochemically relevant substructure -22 of which can be described as xenobiotic (drug, food-related, or another source, B) 30 least variable Mass2Motifs of which 21 were structurally characterized as containing a biochemically relevant substructure, and C) an additional 57 structurally characterized Mass2Motifs with biochemical relevant substructures or with related ion products like isotopes.

S7. Pearson correlation between the vectors
Pearson correlation between the vectors for each sample was used to visualise the similarity between the Mass2Motifs profiles with edges between Mass2Motifs where their correlation exceeded a threshold. Computing the Pearson correlation between values for Mass2Motifs and visualising the results in a network highlighted xenobiotic-related substructures (like drugs) as they tended to not correlate with other Mass2Motifs (other than originating from the same drug i.e., group of paracetamol related Mass2Motifs) (see Figure S7-1). Endogenous Mass2Motifs were more highly connected, reflecting their existence within metabolic networks and pathways. This property aids in the classification, highlighting, and subsequent characterisation of xenobiotic Mass2Motifs and therefore metabolites.  S-13 S8. Table of core acylcarnitine metabolites found across 22 Urines   Table S8-1: 47 core acylcarnitines were annotated from the fragmented metabolites containing the acylcarnitine Mass2Motif. 30% of the annotated core acylcarnitine species could be matched to HMDB 6 entries.

S13. Stool sample analysis
Stool samples of Crohn's Disease patients and healthy controls for which LC-MS/MS data were available were analysed with MS2LDA+. Faecal extracts represent a challenging matrix influenced by many factors such as diet, drug administration, gut microbiota, and the individual's metabolome. Crohn's disease is a chronic inflammatory condition of the gut for which no curable treatment is available. To study metabolic differences during disease induction treatment with exclusive enteral nutrition (EEN) 8,9 , samples from children with Crohn's disease and healthy controls were analysed with HILIC LC-MS/MS and MS2LDA+ as follows: LC-MS/MS data in positive ionization mode for two time series of CD patients and two healthy controls was obtained using HILIC chromatography as described in Creek et al. 1 and mass spectrometry as described in the "MS and MS/MS settings" section of this paper and 7,10 . After pre-processing using a 2E6 minimum intensity cut-off for MS1 feature inclusion, MS2LDA+ was run on -on average -2664 (±384) fragmented MS1 features per sample, indicating that both abundant and lesser abundant metabolite features were included. 300 Mass2Motifs were discovered in the 12 samples, of which 30 were subsequently characterized. The substructure PCA plot was constructed and differential prevalence analysis was done between after treatment and healthy control samples in one group versus during treatment samples in the other group.
The substructure-based PCA showed separation of during treatment samples versus samples taken after treatment and from healthy controls ( Figure S13-A), also indicating commonalities between gut substructure contents of healthy controls and patients that finished their treatments. Differential prevalence analysis showed adenine and guanine substructures and H2O-rich losses being significantly prevalent in the healthy control/end-of-treatment group, as is also clear from the topic prevalence plots ( Figure S13-B-S13-D). Currently undergoing MS1-based analyses on a larger sample set of the same study point to treatment related changes in, among others, phosphatidylcholines and 1-Deoxynojirimycin, indicating the complementary nature of traditional MS1-based metabolomics workflows and MS2LDA+.
The stool sample analysis shows promising results and with increased amounts of fragmentation data becoming available and more routinely acquired, MS2LDA+ will offer a workflow to produce complementary results to classical MS1-based metabolomics workflows. S-20