OnPLS-Based Multi-Block Data Integration: A Multivariate Approach to Interrogating Biological Interactions in Asthma

Integration of multiomics data remains a key challenge in fulfilling the potential of comprehensive systems biology. Multiple-block orthogonal projections to latent structures (OnPLS) is a projection method that simultaneously models multiple data matrices, reducing feature space without relying on a priori biological knowledge. In order to improve the interpretability of OnPLS models, the associated multi-block variable influence on orthogonal projections (MB-VIOP) method is used to identify variables with the highest contribution to the model. This study combined OnPLS and MB-VIOP with interactive visualization methods to interrogate an exemplar multiomics study, using a subset of 22 individuals from an asthma cohort. Joint data structure in six data blocks was assessed: transcriptomics; metabolomics; targeted assays for sphingolipids, oxylipins, and fatty acids; and a clinical block including lung function, immune cell differentials, and cytokines. The model identified seven components, two of which had contributions from all blocks (globally joint structure) and five that had contributions from two to five blocks (locally joint structure). Components 1 and 2 were the most informative, identifying differences between healthy controls and asthmatics and a disease–sex interaction, respectively. The interactions between features selected by MB-VIOP were visualized using chord plots, yielding putative novel insights into asthma disease pathogenesis, the effects of asthma treatment, and biological roles of uncharacterized genes. For example, the gene ATP6 V1G1, which has been implicated in osteoporosis, correlated with metabolites that are dysregulated by inhaled corticoid steroids (ICS), providing insight into the mechanisms underlying bone density loss in asthma patients taking ICS. These results show the potential for OnPLS, combined with MB-VIOP variable selection and interaction visualization techniques, to generate hypotheses from multiomics studies and inform biology.

I n the postgenomic era, data-driven science has become increasingly necessary because of the vast array of instrumentation that is capable of generating thousands of data points for a single analytical observation. 1,2 In addition to using classical univariate statistical methods, machine-learning techniques have become routinely used to interrogate and understand vast amounts of data. 3,4 Two common characteristics of -omics data are that the number of measured variables is vastly greater than the number of observations 5 and that there is a degree of multicollinearity between variables. 6 As such, computational methods that project high dimensional data into a smaller number of component variables have become commonplace. 7 Multivariate projection methods such as principal components analysis (PCA), 8 partial least squares discriminant analysis (PLS-DA), 9 and canonical variate analysis (CVA), 8 together with hierarchical cluster analysis (HCA), 10 random forests, 11 and support vector machines (SVM), 12 are all used to analyze -omics data. 3,4 PLS-DA and its extension, orthogonal projection to latent structures discriminant analysis (OPLS-DA), 13,14 have become popular projection methods in the metabolomics community. 15 As modeling methods become increasingly complicated, they have also become concomitantly difficult to interpret. Assignment of the variable importance often becomes an a posteriori statistical process based on either permutation testing or random resampling (e.g., confidence intervals derived from bootstrap/jackknife statistics). 16 For methods based on a PLS algorithm, the direct statistical method of variable influence on projection (VIP) 17,18 is often used to estimate variable contribution to the resulting models.
In recent years, as the -omics sciences have matured, it has become common to acquire data from multiple -omics platforms in a single biological experiment. As such, each biological sample is interrogated by multiple analytical platforms, which in turn can be linked to multiple sources of experimental metadata. Data from each platform (or measurement context) can be considered a discrete block, with multiple blocks making up the complete data set of the experiment. Multivariate projection methods such as OPLS-DA have proven successful in modeling the underlying latent biological structure within a single high dimensional data block; however, they are theoretically unsuitable for modeling multiple data blocks simultaneously. There are two reasons for this issue. First, if multiple data blocks are concatenated into a single matrix, with no accounting for measurement context, then the subsequent model can be considered as a single projection model, where the weighting of each variable is governed by the total sum of squares. 19 This, in principle, demands that each block is normalized to the same size, to avoid a projection model that is biased toward the impact of the data set with the most variables. In practice, this can be problematic, particularly when there are a mix of blocks of vastly different sizes. For example, in a model concatenating 20 000 transcripts, 200 metabolites, and 20 clinical variables, the transcripts would over-represent the global data structure and thus have a larger contribution to the resulting model. In multiblock modeling, this is not an issue, as each block is treated independently. This approach leaves flexibility to scale individual variables according to importance and also to keep variables in their original unit. Second, each individual data set is associated with its own underlying structure, 19,20 describing the true biological variance and also platform-specific measurement error. Covariance of biological latent structure across multiple data blocks is implicit; however, it is a fair assumption that the measurement error across multiple blocks will be independent and thus easily ignored at this block-interaction level. Conversely, if multiple data blocks are concatenated into a single data set before projection, the model will struggle to effectively separate true biological structure from block-specific noise and result in erroneous interpretation of the conglomerate projection model.
To address the need for multivariate methods to simultaneously model multiple data matrices, a number of multi-block data integration methods have been proposed. 21−23 In 2011, Lofstedt and Trygg 24 proposed a novel multi-block multivariate method called OnPLS, which utilizes the framework of OPLS to decompose data from more than two input matrices. Multiblock models, such as OnPLS, are fully symmetric, meaning each data block is weighted to allow an equal contribution to the model, regardless of the number of variables or underlying data structure within each block. 25 Multi-block approaches offer further advantages over single block or block concatenation in biomarker discovery. First, the validity of any true biological biomarker is significantly increased if there is a clear covariance between data blocks, thus reducing the possibility of false discovery. 26 Second, contrary to block-concatenation modeling, which is strongly biased toward the globally joint variation, multi-block analysis decomposes the different levels of variation (global, local, unique) 27 such that relatively small but informative trends are also identified. Recently, Galindo-Prieto et al. adapted the VIP concept for multi-block data analysis (multi-block variable influence on orthogonal projections method, 28 MB-VIOP) to identify the variables that contribute to these different levels of joint structure.
The aim of this study was to combine OnPLS and MB-VIOP with data visualization methods to create a workflow capable of simultaneously modeling and investigating interactions between multiple -omics data blocks. The study chosen for this purpose was a subset from a previously reported asthma cohort, for which multiple -omics data sets were acquired in isolation. 29,30 These analyses included untargeted metabolomics, targeted metabolite assays, differential immune cell population analyses, and cytokine arrays. Additionally, for the present study, transcriptomics of peripheral blood T cells was performed. OnPLS modeling and MB-VIOP were then used to integrate the disparate data blocks into a single model, which was then interrogated to identify novel interactions between the data blocks and disease status as well as other clinical end points.

■ EXPERIMENTAL SECTION
Clinical Cohort. Briefly, 12 healthy controls and 10 severe asthmatics were included from the original study. 29 Transcriptomics was subsequently performed on peripheral blood T cells, and metabolomics/metabolic profiling assays were performed on serum. All participants were enrolled from the NIHR Southampton Respiratory Biomedical Research Unit and University Hospital Southampton outpatient clinics; all provided written informed consent. The National Research Ethics Service Committee South CentralSouthampton B ethics committee (UK; ref 10/H0504/2) approved this study. Clinical classification and enrollment criteria were previously described. 29,31 Participant data were included in the present study if they were classified as either healthy control or severe asthmatic individuals in the existing cohort, and data from all data blocks (described in the next section) were collected.
Sample Collection and Analyses. Details of sample collection and transcriptomics analyses are available in the Supporting Information. Details of analytics, quality control, and data cleaning for metabolomics, targeted metabolic assays, and clinical assays were performed as previously described. 29,30 Data Blocks and Processing. Six data blocks were used for modeling: Transcriptomics, Sphingolipids, Metabolomics, Fatty

Analytical Chemistry
Article Acids, Oxylipins, and Clinical Data (Figure 1). A complete list of all variables included for each data block is provided in Tables S1−S6. The data blocks were defined by a priori knowledge about both the system under observation and the measurement technology. 19 The primary consideration was that the underlying structure of the data could possibly confound the biological interaction between blocks. To avoid bias in combining information from different probes for one gene, all non-QC probes were included for OnPLS modeling; the Transcriptomics block included 54 613 variables. This approach is commonly used for analyzing transcriptomics data. 32 Four data blocks represented serum metabolites: Sphingolipids (28 variables, targeted assay), Metabolomics (66 variables, untargeted assay screened against an in-house chemical library), Fatty Acids (14 variables, targeted assay), and Oxylipins (38 variables, targeted assay). A total of 23 clinical variables were combined into the Clinical data block; these variables were derived from typical clinical assays and measurements and included lung function tests, bronchoalveolar lavage fluid and peripheral blood T cell populations, serum cytokines, and serum vitamin D3. For clinical data, values that were missing due to being below the limit of detection (LOD) of the respective assay were imputed with 1/10 of the lowest measured value, because the LOD was not known for each assay, and OnPLS cannot process missing values. Data that were missing for an entire subset array of the clinical data (e.g., for individuals missing the cytokine assay) were imputed using the median value of the corresponding clinical group (control or asthma). Remaining missing values were replaced using PCA imputation. Prior to OnPLS model calculation, all data (except for transcriptomics) were logtransformed. All data were then scaled to unit variance.
OnPLS Model Calculation and Visualization. The OnPLS model simultaneously analyzed the data matrices, returning output matrices of shared information (components), as described. 27 These output matrices reveal shared data structure on three levels for each data matrix, which can be summarized as Globally joint components reveal structure that is shared by all input data matrices. Locally joint components reveal structure shared by two or more, but not all, of the input matrices. Finally, unique components identify latent structure that is present in only one input matrix. The OnPLS model returned separate score vectors for each data block in each component. To identify the sources of biological variance explained by the OnPLS components, the component scores for each block were correlated with metadata variables not included in the clinical data block: clinical class (control vs asthma), sex, age, BMI, dose of inhaled and oral corticosteroids, and smoking (current/ former smoker vs never smoked). The resulting Pearson correlation coefficients were presented as a metadata correlation plot. 33 To visualize the overall OnPLS model, hierarchical principal component analysis (PCA) 34 was used to summarize the 30 OnPLS score vectors, resulting in 2 PCA components describing the relationships in the OnPLS model. Prior to calculating the PCA model, the score vectors were scaled to unit variance. The PCA score plot showed individual participants, and the loadings plot displayed the score vectors from the OnPLS model, labeled by block type and OnPLS model component number.
MB-VIOP Concept, Motivation, and Calculation. Multiblock variable influence on orthogonal projections (MB-VIOP) is a feature selection method that (i) sorts the input variables by importance for data interpretation in OnPLS models, either for the total model (all variation types together) or per component (global, local, or unique variations separately), and (ii) explores the connections between the variables (either in the same or a different data matrix) that contribute to explain the same component (latent variable) in the multi-block system. Multiblock-VIOP is a model-based variable selection method, because it uses the n preprocessed data matrices, the score vectors, and the normalized loading vectors from an OnPLS model. OnPLS regression can relate the data matrices according to the model component; however, it must be emphasized that not all input variables of these related matrices will connect among themselves to explain the variation contained in a specific model component. The MB-VIOP algorithm is necessary to sort the input variables according to their connections for interpreting the variation contained in one or more specific components. Furthermore, MB-VIOP finds the degree of importance of each variable in the correct proportion for a multi-block system, which cannot be achieved by the OnPLS normalized loadings plot. 35 The calculation of the MB-VIOP values can be summarized as the Hadamard products of the normalized loadings multiplied by the ratio of the variation explained by a model component and the cumulated variation. After a block-and component-wise iterative algorithm with all input variables from the six data matrices involved, the resulting MB-VIOP vectors were normalized by Euclidean norm and by the number of original (input) variables raised to the 1/2 power. The variables of interest that were identified by MB-VIOP were selected as a subset for further multivariate analysis as shown below. For additional details about the MB-VIOP fundamentals and algorithm, readers are referred to the original reference. 28 Data Visualization. The between-block covariance of the subset of variables contributing to Components 1 and 2 of the OnPLS model were visualized using chord plots. 36 Using the variables reaching a defined MB-VIOP threshold, a chord plot was constructed by first calculating the Spearman rank correlation coefficient (r) for each pairwise combination of variables with MB-VIOP values above a threshold. Those variables where a significant (p < 0.001) between-block correlation existed were presented as nodes in a circle (grouped by block), and the correlation represented as a colored arc (yellow being a positive correlation and purple a negative correlation). The number of arcs associated with a given node is recorded in parentheses next to the name of the variable. Each chord plot was constrained such that within-block correlations were ignored.
Data modeling (OnPLS), variable selection (MB-VIOP), a posteriori analyses, and creation of plots were performed using MATLAB 2018a (Mathworks, Natick, MA, USA). Correlation coefficients for the metadata correlation plots were calculated using functions from SciPy (http://www.scipy.org/), and the plot was created using the Matplotlib library. 37 SIMCA v15 (Umetrics, Umea, Sweden) was used to perform OPLS-DA analysis.
■ RESULTS AND DISCUSSION Study Population. A total of 22 participants from a previously described cohort 29,30 were included in this study (12 healthy control individuals and 10 individuals with severe asthma). Clinical information is presented in Table 1. Age and BMI were significantly higher in the severe asthmatic group and thus represented confounders in the study. Furthermore, all individuals in the severe asthmatic group were treated with inhaled and/or oral corticosteroids (ICS/OCS). Although the sex ratio and proportion of smokers were also different, they were not significantly altered between the two groups.
OnPLS Model. The OnPLS model calculated seven components that shared joint structure between at least two of the data blocks (Table 2). Two components (1 and 4) had globally joint structure, with contributions from all six blocks. The remaining components had locally joint structure, with between 2 and 5 data blocks contributing to the joint structure. The model did not identify any unique components.
The amount of variance explained in each component, for each data block, as well as cumulative variance explained by the model is reported in Table 2. Only 37% of the total variance in the Transcriptomics data block was explained, indicating that the majority of the information contained in this block is not descriptive for describing asthma. This could be due to the global and unbiased nature of the platform and/or the fact that the transcriptomics data were derived from the entire peripheral blood CD3+ T cell population. It would be of more clinical relevance to target specific cell subpopulations in a single-cell transcriptomics approach. 38 The clinical data described only 55% of the variance in the Clinical block; however, 16 of the 22 variables were either differential immune cell counts/subpopulation frequencies or cytokines produced by immune cells. Given the pathophysiological heterogeneity of asthma, traditional cell population and cytokine measures alone are insufficient to describe the disease. 39 The OnPLS model explained >70% of the variance in each metabolic profiling data block (Sphingolipids, Fatty Acids, and Oxylipins) and 56% of the Metabolomics block. This higher degree of explained variance can be attributed to the selective association between these variables and asthma. These targeted assays were performed to confirm findings from the initial metabolomics screen. 30 While not all targeted metabolites were originally detected using metabolomics, they represent biological processes known to be involved in inflammation. This point is of particular relevance in that it is not the number of variables in a given data block that is the primary driver but rather the inherent biological content. 4,9 This facet makes it meaningful to combine disparate -omics blocks of varying structure into a single OnPLS model and demonstrates the utility of this approach for data modeling. However, there is the expected caveat that data blocks that contain higher levels of biological structure will have a concomitant increase in contribution to the overall OnPLS model.
To determine the biological factors associated with each OnPLS component and data block, model score vectors were correlated with a number of known biological factors ( Figure 2). Component 1 scores from all blocks positively and significantly (p < 0.05) correlated with disease status (healthy vs asthma), age, and BMI. All blocks, except Fatty Acids, positively and significantly (p < 0.05) correlated with ICS and OCS dose. Transcriptomics, Fatty Acids, and Clinical scores correlated with smoking status (nonsmoker vs has ever smoked). As expected, age, BMI, and corticosteroid treatment were all confounded with disease status (Table 1); thus, explained variation in the model because of these factors was not distinguished from that of disease. The Component 2 scores for the Transcriptomics, Oxylipins, and Clinical blocks significantly (p < 0.05) and positively correlated with sex. While sex was not a significant confounder in this study, the distribution between the two classes was different. This highlights the utility for OnPLS to

Analytical Chemistry
Article identify biological sources for variation in -omics data. The scores for Components 3−6 did not correlate significantly with any of the listed biological factors and likely describe either a combination of recorded biological factors or biological factors that were either not observed or not recorded in this study. As such, this highlights the importance of strict experimental design measures and extensive record keeping in data-driven sciences.
Despite being a confounder in the study, age negatively correlated with Component 7 Fatty Acid scores and highlights the potential for OnPLS to identify underlying biology associated with data blocks. PCA of OnPLS Score Vectors. To visualize the entire OnPLS model, principal components analysis (PCA) was performed on the scaled OnPLS score vectors (hierarchical PCA, Figure 3). The first principal component (PC1) showed a separation between healthy controls and asthmatic individuals in the score plot ( Figure 3A). Aligning with the results of the correlation analysis, this separation was driven by the OnPLS Component 1 score vectors ( Figure 3B). It was then expected that PC2 would solely describe a sex difference, as OnPLS Component 2 score vectors drove the separation. Interestingly, PC2 actually described an interaction between disease and sex ( Figure 3A). While there was a sex difference among asthmatics, this was not observed in the controls. Investigating the interaction between sex and disease was not an aim of the original cohort study; however, this interaction was identified by simultaneously modeling all the data in combination with integrative visualization. In addition, the hierarchical PCA model corroborates the correlation analysis, showing that OnPLS Components 1 and 2 contain the most structural information. Therefore, these components were selected for further exploration with MB-VIOP analysis.
Multi-block Variable Influence on Orthogonal Projections (MB-VIOP). To further investigate the variables and their interactions underlying the shared structure of OnPLS components 1 and 2, MB-VIOP variable selection and subsequent correlation analysis were applied. A MB-VIOP threshold of >1.0 was used to select the variables of interest from  clinical variables contributed to explaining the shared structure describing disease separation ( Figure 4A). For visualization purposes, the MB-VIOP threshold was increased to 2.0 for the Transcriptomics data block, leaving 151 variables. For Component 2, 14 618 transcripts, 28 metabolites, 9 sphingolipids, 20 oxylipins, and 12 clinical variables contributed to explaining the shared structure describing the interaction between sex and disease ( Figure 4B). The Transcriptomics block appeared to have a strong influence on the disease−sex interaction, with 1487 transcripts passing the higher MB-VIOP threshold of 2.0; thus, the threshold was further increased to >2.5 to identify only the strongest contributions, leaving 203 transcripts. The complete list of all MB-VIOP values calculated for Components 1 and 2 is presented in Tables S1−S6.
In order to identify between-block biological interactions in Components 1 and 2, chord plots were used to visualize correlations of variables passing the specified MB-VIOP thresholds ( Figure 5). This approach revealed a number of interesting interactions, of which a selected few are discussed as examples of the application of the proposed workflow. Five metabolites that correlated with ICS dose 30 (cortisol; cortisone; dehydroepiandrosterone sulfate, DHEA-S; N-palmitoyltaurine, pipecolate) passed the MB-VIOP threshold criteria for Component 1. These metabolites correlated with the transcripts of 21 unique genes ( Figure 5A), of which ATP6 V1G1 was particularly interesting. ATP6 V1G1 has been implicated in osteoporosis and specifically osteoclast function, 40 which is a known side-effect of ICS treatment. 41 This novel link may provide insights to the mechanisms underlying bone density loss in asthma patients taking ICS. In addition, NPAS2, a transcription factor involved in mediating circadian rhythm, 42 correlated with five metabolites, four of which were ceramides ( Figure 5A). Evidence suggests that ceramide levels fluctuate diurnally; 43,44 however, to our knowledge, this is the first time an association has been made between NPAS2 and ceramides. More importantly, as all samples were collected at the same time of day (between 09:00 and 11:00), this supports emerging evidence of dysregulated circadian rhythm gene expression in  Variables with a significant (p < 0.01) between-block correlation were presented in the chord plots. Nodes represent variables. Text color is associated with block: gray, Transcriptomics; green, Metabolomics; yellow, Sphingolipids; blue, Fatty Acids; orange, Oxylipins; red, Clinical. The number of correlations associated with a given node is noted in parentheses next to the name of the variable. Node color represents direction of change. Component 1: blue, increased in asthma; red, decreased in asthma. Component 2: white, increased in females; black, increased in males. Chords represent correlations: yellow, positive correlation; purple, negative correlation. Each chord plot was constrained such that within-block correlations were ignored. (---) denotes noncoding gene transcripts.

Analytical Chemistry
Article DOI: 10.1021/acs.analchem.8b03205 Anal. Chem. 2018, 90, 13400−13408 asthma. 45 Indeed, experiencing nocturnal symptoms more than once per week was a classification criterion of severe asthma. 29 The disease−sex interaction identified by Component 2 was largely driven by differential bronchoalveolar lavage cell profiles (eosinophils, macrophages) and oxylipins ( Figure 5B). It also identified a high degree of correlation between the oxylipins and both PCDH10 and the uncharacterized gene locus LOC284219, suggesting that these genes may play a previously unidentified role in oxylipin metabolism. Together, these examples highlight the value of this method for interrogating biology and generating hypotheses from multiomics data.
By combining OnPLS multi-block modeling with MB-VIOP variable selection and various visualization methods, the composite of data derived from this study could be interrogated. Where methods such as OPLS are useful for identifying covariance in isolated data blocks, OnPLS offers the advantage of identifying combined covariance, thus offering a more complete understanding of the whole system. For example, when OPLS was applied to the Metabolomics data block in isolation, 21 variables had a VIP OPLS > 1.0 with dehydroepiandrosterone-sulfate (DHEA-S) being the strongest driver of the control−asthma difference (Supplemental Tables). Component 1 of OnPLS had 31 variables with a MB-VIOP > 1.0, 15 of which were unique to OnPLS modeling. Whereas DHEA-S was a major driver in the covariance in the single-block analysis, it was less important in the combined covariance of the OnPLS model. The Transcriptomics, Oxylipin, and Clinical data blocks showed similar trends, with OPLS and OnPLS revealing different biological insights (data not shown).
While the present study shows the potential for OnPLS-based modeling to be useful for simultaneously modeling multiple data blocks and generating hypotheses, it is limited by sample size and study power. Furthermore, OnPLS is currently unable to derive a block weighting such that MB-VIOP values can be scaled and directly compared across all blocks. Accordingly, MB-VIOP values can only be directly compared within a given data block and not between blocks. In interpreting the results, one must consider the overall contribution, not only of the block per se but also of the individual variables, to the respective component.

■ CONCLUSIONS
The multi-block OnPLS method combined with MB-VIOP variable selection and interaction visualization techniques yielded putative novel insights into asthma disease pathogenesis, the effects of asthma treatment, and biological roles of genes. The current study was performed in a worst-case scenario approach using a small sample set, with unbalanced groups and multiple study confounders. While these issues limit the ability of the different components of the OnPLS model to identify unique biological sources of variation, it demonstrates the potential for this method for identifying key structure in -omics data integration. It is likely that in large well-designed studies, the different components would be able to identify and explain other sources of biological and/or experimental variability (e.g., therapeutics, center bias, diet). It is also possible that this approach would be useful in identifying subphenotypes of disease, with different subgroups and/or mechanisms described by different components. We therefore propose that OnPLS modeling can be incorporated into large-scale molecular phenotyping studies for stratified medicine. Given that the -omics technologies detect molecules that function in a highly interdependent and dynamic manner within a living system, multi-block methods such as OnPLS, together with MB-VIOP and interaction visualization, provide a logical approach to investigating systems biology.

* S Supporting Information
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10 Table S6; Rho values for correlations between variables presented in Figure 5A (Component 1), Table S7; P values for correlations between variables presented in Figure 5A (Component 1), Table S8; Rho values for correlations between variables presented in Figure 5B (Component 2), Table S9; P values for correlations between variables presented in Figure 5B (Component 2),