Optimized Phenotypic Biomarker Discovery and Confounder Elimination via Covariate-Adjusted Projection to Latent Structures from Metabolic Spectroscopy Data

Metabolism is altered by genetics, diet, disease status, environment, and many other factors. Modeling either one of these is often done without considering the effects of the other covariates. Attributing differences in metabolic profile to one of these factors needs to be done while controlling for the metabolic influence of the rest. We describe here a data analysis framework and novel confounder-adjustment algorithm for multivariate analysis of metabolic profiling data. Using simulated data, we show that similar numbers of true associations and significantly less false positives are found compared to other commonly used methods. Covariate-adjusted projections to latent structures (CA-PLS) are exemplified here using a large-scale metabolic phenotyping study of two Chinese populations at different risks for cardiovascular disease. Using CA-PLS, we find that some previously reported differences are actually associated with external factors and discover a number of previously unreported biomarkers linked to different metabolic pathways. CA-PLS can be applied to any multivariate data where confounding may be an issue and the confounder-adjustment procedure is translatable to other multivariate regression techniques.

. Results of consistently contributing variables for dataset with 61% overlap. Table S2. Results of consistently contributing variables for dataset with 41% overlap.    Table S5. Abbreviations and full names of metabolites in the metabolic reaction network.

Glossary of mathematical operations
PLS algorithm -To avoid the computationally expensive inflating step in non-iterative partial least squares (1) (NIPALS) we use a modified version (2) of the original SIMPLS algorithm (3) to deal with a data matrix (X) where the number of samples (n) is much smaller than the number of variables (p), n<p. The modification is regarding the first step, where the original SIMPLS (proposed for n>p), in the case of n<p, would result in a square matrix of order p. In the modified algorithm, a samplesample association matrix (A) is defined as A=XX T (line 1, see pseudo-code below), a square matrix of order n, and both X and Y are assumed to be centered and, if required, scaled. The latent variables are calculated in an iterative manner until a stop criterion (specified number of components or the maximum number of n components) is reached (lines 4-13). The algorithm gives the same results as the NIPALS algorithm for a univariate response, however with a significant time gain (3). Same as for S4 NIPALS, this algorithm can also be used to do OSC-PLS, for which the data matrix X is replaced by the orthogonal signal corrected (4) matrix X osc .
Random Matrix Theory -The number of un-correlated components (υ) from the auto-correlation matrix of the regression coefficients is estimated using Random Matrix Theory (5) (RMT). RMT numerically computes a Jacobian matrix (J) that calculates the theoretical eigenvalue distribution of r B (lines 1-27). The partial J can be used to estimate υ (lines 28-29, see pseudo-code below).
Simulated data -In order to compare our method to standard PLS and OSC-PLS we simulated 100 data sets with 50 case (Y i =1) and 50 control (Y i =0) objects. In the data generation we added two confounding effects, one which was non-orthogonal to Y (r = ±0.36) and another which was almost orthogonal to Y (r = 0.04). The Pearson correlation between the two confounders was r = ±0.2. From the total of 2000 variables in each data set sampled from a normal distributions with unit variance (σ 2 = σ = 1), X ij ~ φ(µ,σ 2 =1), 10% of variables were sampled using a different mean for one class to induce a class-separation, these are considered to be true positives. The same goes for both confounders; with the addition that there is a 25% overlap in affected variables between each confounder and the 10% of variables affected by the class separation. The difference in mean between distributions (effect size) was 1 (61% overlap) and 1.645 (41% overlap) for the first 50 and the second 50 data sets, respectively (see Supplementary Figure 1). In total there are 2 3 unique classes, and the class of each object i is coded as a 3-tuple S xyz , where x, y and z indicate whether it is case or control, affected or not by confounder 1 and affected or not by confounder 2, respectively. The same goes for the 2000 variables, and each variable j has a 3-tuple code V xyz , again where x, y and z indicate whether the variable is affected by the case/control status, confounder 1 and confounder 2, respectively. Additionally, the effect of each factor (Y, confounder 1, confounder 2) was assigned a sign for each variable at random, where for variable j, a j , b j and c j are the signs for Y, confounder 1 and confounder 2, respectively. This results in a total of 3 3 unique types of variables with different effects for the covariates.
Generation of simulated data -The data for each variable is drawn from specific normal distributions (with σ 2 = σ = 1) depending on the sample, effect size (es) and effect signs (a j , b j and c j ).
Data for object i and a variable j which is unaffected by any of the covariates is simply drawn from a normal distribution as follows: Data for variable j which only contains information of case/control status is calculated using: In this equation only one Kronecker Delta has a value of 1 depending on object i 3-tuple. The same goes for variables which only contain information from confounders 1 and 2: The equations are expanded when multiple effects play a rule for a variable. Data for object i for variables with information on Y and confounder 1 is generated as follows: Note again that only one Kronecker Delta has a value of 1 in this equation. The c in the Kronecker Deltas indicates that the value for confounder 2 from the object 3-tuple has no effect here as the variable is unaffected by confounder 2. The same goes for variables affected by Y and confounder 2 (where confounder 1 plays no role, as indicated by the b in the Kronecker Deltas): And also for variables where Y plays no role (indicated by a in the Kronecker Deltas), but both confounder 1 and 2 play a role: Last, the data for variables affected by Y, confounder 1 and confounder 2 is drawn for the following distributions: If a variable with 3-tuple of type V 1yz is found to be significant it is considered a true positive finding, as these variables contain information of case/control status. All other variables (V 0yz ) are considered to be false positives if they are found to be significantly contributing to the models. In order to compare CA-(O)PLS with PLS and OSC-PLS for the simulated data sets, the same seed for the random number generator was used, to ensure the random sampling was performed exactly the same for all methods. All simulated data sets are available upon request.   Supplementary Table 1. Full results table showing the mean percentage (95% confidence intervals are given in parenthesis) of consistently-and-similarly contributing associations for the simulated data using an effect size of 1 (61% overlap). The type of variable is shown as J x,y,z , where x, y and z are the influence of the response variable, confounder 1 and confounder 2, respectively. The value for x, y and z indicate the type of variable and how each factor influences the generated variable, e.g. ±J 1,0,1 indicates the variable is influenced by the response variable and confounder 2 only, and that they have the same sign, and ±J1,-1,0 means that the variable is influenced by both the response variable and the first confounder, however the sign of both is inverse. The bottom section shows the total number of true positives (TP), false negatives (FN, type-II errors), false positives (FP, type-I errors) and true negatives (TN).