Molecular Signatures of Fusion Proteins in Cancer

Although gene fusions are recognized as driver mutations in a wide variety of cancers, the general molecular mechanisms underlying oncogenic fusion proteins are insufficiently understood. Here, we employ large-scale data integration and machine learning and (1) identify three functionally distinct subgroups of gene fusions and their molecular signatures; (2) characterize the cellular pathways rewired by fusion events across different cancers; and (3) analyze the relative importance of over 100 structural, functional, and regulatory features of ∼2200 gene fusions. We report subgroups of fusions that likely act as driver mutations and find that gene fusions disproportionately affect pathways regulating cellular shape and movement. Although fusion proteins are similar across different cancer types, they affect cancer type-specific pathways. Key indicators of fusion-forming proteins include high and nontissue specific expression, numerous splice sites, and higher centrality in protein-interaction networks. Together, these findings provide unifying and cancer type-specific trends across diverse oncogenic fusion proteins.

types (leukemia/lymphoma, epithelial, mesenchymal, other) affected by the mutations. A list of 1217 tumour suppressor genes (TSGs) was acquired from the TSGene 2.0 database [7], which also contained information on which TSGs were downregulated and across cancer types. A total of 985 TSGs were labeled as being downregulated in 1 or more of 11 available cancer types. Finally, the identity of 239 human oncogenes (OGs) was retrieved from the Tumour Associated Gene (TAG) database [8], of which 216 were protein-coding.

Gene multifunctionality
Gene ontology terms for all human Ensembl genes were acquired from the Gene Ontology (GO) Consortium [9] via BioMart. GOSlim terms, which offer a broader annotation of gene function, were also acquired. GO and GOSlim accession counts per gene were tallied as simple measures of gene multifunctionality.

Gene essentiality
A list of 2750 essential human genes was extracted from the Online Gene Essentiality database [10], which derived its essentiality annotation from the collection and analysis of large-scale, genome-wide experimental data, supplemented with text mining.
Dosage sensitivity and loss of function upon haploinsufficiency data 373 haploinsufficient genes in human, as well as curated annotation on whether gene loss results in a loss of function, were downloaded from the Clinical Genome Resource consortium [11] ftp site (ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/clingen/).

Protein kinase dataset
Human kinase genes and kinase classifications were retrieved from the Uniprot [12] kinase resource (http://www.uniprot.org/docs/pkinfam; March 2016 release), based on the Kinase Database [13] and previous kinase classification work [14]. A total of 509 kinase Uniprot accessions were mapped to 515 Ensembl gene IDs. All "atypical" subgroups of kinases were grouped together for a total of 11 kinase classes (Table S1).

Transcription factors and associated regulatory interactions
A set of 748 human transcription factors (TFs) and 8,015 TF-mediated regulatory links (i.e. activating or repressive relationship between a TF and target gene) was acquired from the TRRUST database [15]. The TF-target regulatory interactions in TRRUST were identified via manual curation of Medline abstracts [15]. Counts of the total number of unique TF-mediated regulatory links (including links of an "Unknown" regulatory nature) were tallied per Ensembl gene, as were the number of activating and repressive links.

Epigenetic complexes and chromatin modifying genes
A list of epigenetic modifier genes (i.e. genes involved in DNA modification, histone modification, or chromatin remodelling) and subunits of epigenetic complexes were downloaded from the EpiFactors database [16] (version 1.7.1). Data was cleaned, and a total of 770 Ensemble gene IDs with epigenetic gene functions were identified. To simplify gene function labels, the number of categories was reduced from 56 to 3 (chromatin modifying, n=151; histone modification, n=523; other, n=112, largely RNA and DNA processing and histone chaperones) using a combination of regular expressions and manual curation. Further, using the same resource, 302 Ensemble gene IDs were identified as participating in 69 epigenetic complexes.

Protein domain dataset
A collection of all domains and domain families in the human proteome was acquired from the Pfam database [18] using BioMart. The number of Pfam domains per Ensembl protein, the average length of Pfam domain per protein, and the density of Pfam domains (domain count over protein length) were calculated. These values were averaged over isoforms to obtain values per gene.

Intrinsic structural disorder
Residue by residue predictions for intrinsic structural disorder in all human proteins were acquired by running the IUPred program [19] on FASTA sequences (using the "long" option). Intrinsic disorder scores were calculated for genes as an average over isoforms.

Interaction-mediating protein segments and domains
Interaction-mediating segments of proteins were acquired from Interactome3D [20] (May 2015 release). Only the "representative" set of interactions and associated segments, comprising the top ranking structures and models, were used. Per Uniprot protein, unique stretches were identified using protein accessions and start and stop coordinates, the number of interaction-mediating stretches counted, and the length of the longest stretch identified. For the 35 Uniprot accessions that mapped to multiple Ensembl gene IDs, the maximum values for these quantities was taken and assigned to the gene. Additionally, the INstruct database of curated protein-protein interactions resolved to the level of protein domains was acquired [21], and the number and density of these interaction-mediating domains per protein were calculated.

Structural interface-forming residues
Protein residues involved in the formation of macromolecular interfaces were acquired from the Protein Interfaces, Surfaces and Assemblies (PISA) database [22], which collects and details macromolecular interfaces in the Protein Data Bank [23], were acquired as in [24]. The number and density of PISA residues were summarized per protein.
Linear motif data A list of 1548 human linear motifs (LMs) was extracted from the Eukaryotic Linear Motif (ELM) database [25] and filtered down to 1538 by finding entries with unique combinations of Uniprot accessions and start and end coordinates. The LM analysis was expanded by taking a further 1,036,282 putative short protein-binding regions, previously identified by the ANCHOR program [26] (as in [24]).

Post-translational modification sites
Post-translational modification (PTM) sites were acquired from the dbPTM resource [27] (version 3.0; accessed March 2016). Only unique PTM locations were considered, yielding 118047 unique PTMs in human, and PTM counts and densities per gene were calculated. Ubiquitination sites were acquired by subsetting dbPTM. Phosphorylation site subtypes were grouped into one category using regular expressions. Furthermore, a set of PTM sites which putatively function in the regulation of protein-protein interactions was acquired from the PTMcode database [28] (v2). Gene-wise counts and densities of ubiquitination, phosphorylation, and PTMcode PTM sites were calculated.

Cancer pathway involvement
A list of genes involved in human cancer pathways (KEGG pathway id: hsa05200) was acquired from the KEGG [29,30] database (http://www.kegg.jp/). A total of 402 Ensemble genes were identified as participating in cancer signaling pathways.

General and tissue-specific protein interactions
Genome-wide physical protein-protein interactions (PPIs) were acquired from the BioGRID database [31] (release 3.4.135, March 2016), converted to Ensembl accessions, and unique interactions taken. The centrality of all genes (nodes) in the network was summarized using several common metrics (degree, betweenness, closeness, and eigenvector centrality), which capture different notions of a node's importance in a graph (Table S1). For each gene, the number of interactions with known oncogenes, tumour suppressor genes, and cancer genes was identified through the (non tissue-specific) BioGRID PPI set, using cancer gene set annotation (see above).
Gene expression levels and tissue-specificity of expression Averaged, log-transformed tissue-specific gene expression values in Reads Per Kilobase per Million (RPKM) covering 27 human tissues (Table S1) from a study analyzing several RNAseq data sources was obtained [32]. Mean and maximum expression levels per gene across all tissues were also acquired. One popular metric for calculating expression heterogeneity is tau [33], which has been shown to be superior to several other methods due to its robustness and ability to separate ubiquitously-expressed (sometimes called "housekeeping") and tissuespecific genes [32]. Tau ( ) ranges from 0 (broadly expressed) to 1 (entirely tissuespecifically expressed) and is defined for each gene by: where n is the total number of tissues and ̂ is defined to be: where xi is a gene's expression in tissue i.
For simplicity, we omit the tissue-specific (TS) expression values and retain only the two derived summary values derived (i.e. average and maximum expression over all tissues), were retained.

Missing data handling and imputation
The compiled dataset (Table S1), 10 variables possessed missing values, namely: degree_centrality, betweenness_centrality, closeness_centrality and eigenvector_centrality (n missing/total = 4978/20295); tau, mean_expression and max_expression (2377/20295); num_GO_terms and num_GOSlim_terms (1576); and avg_disorder (1278). Gene set label features (e.g. is_oncogene, is_cancer_gene, is_kinase) have no missing values, since all genes without an explicit gene set labels are deemed to be non-members of the gene set (i.e. genes which are not known kinases are assumed to be known non-kinases). Missing values were imputed using multiple imputation by chained equations [34], also called MICE, which generates native-like, model-based imputations for each variable with missing data [35]. The R MICE package [36] was employed for the imputation procedure, using predictive mean matching, with parameters m (number of multiple imputations) and maxit (maximum number of iterations) set to 5.
Identifying parent protein and fusion event functional groups using PCA and hierarchical clustering To improve cluster interpretability and stability, a dimensionality reduction method (principal components analysis; PCA) was employed as a preprocessing step before hierarchical clustering. PCA involves a linear transformation of the data, specifically the projection of a set of potentially correlated variables into orthogonal (linearly uncorrelated) axes called principal components (PCs). As a result of the PCA procedure, the dataset is transformed into a new coordinate system, whereby the most variance by projection occurs in the direction of the first principal component, then the second, and so on up to p PCs, the dimensionality of the original dataset. Formally, PCA proceeds by finding the eigenvalue decomposition of the symmetric covariance matrix, where the PCs are the eigenvectors and the corresponding eigenvalues capture the amount of variance present in the data in that direction. The eigenvector with the highest eigenvalue is the first PC, and the eigenvalues of the different PCs can be visualized by creating a Scree plot. These principal components are by definition uncorrelated. Hence, PCA acts as a reframing of data using new axes that better capture variance. PCA is frequently used for dimensionality reduction, since considering fewer than p principal components generates a lower dimensional space that still captures a large amount of the variance observed in the original data. This is useful for visualization and for revealing dominant trends in complex data using only a few interpretable latent variables. PCs can be interpreted by examining which of the original variables correlate most heavily with the PC values. The first portion of PCs (e.g. the first 10 PCs) can be used in place of the original dataset as input into a clustering procedure.
Two separate clustering tasks were performed in this study: the identification of clusters of similar parent proteins, and the identification of clusters of similar fusion events (i.e. pairs of parent protein features). PCA was first performed on the datasets using the R package FactoMineR [37]. Variables that were >90% sparse were omitted from the PCA and clustering analysis. The first 10 PCs were clustered using agglomerative hierarchical clustering with Ward's method on Euclidean distances. The optimal number of clusters (based on relative inertia loss levels across different tree cuts) suggested by FactoMineR was accepted and individual genes were plotted on PC1 and PC2 axes and colored by cluster. Outlier parent proteins (n=7 parent proteins, n=6 fusion events) were identified as instances assigned to clusters containing <10 members when n_clust was set to 10, and these instances were omitted. PCs were characterized by describing feature correlations with PC values, where the correlation coefficient |r|≥0.4. Distributions of the top correlated features in the first 4 PCs were visualized by cluster. A set of 5 "paragon" examples, defined to be instances closest to cluster centroids were generated for each cluster.
A chi-squared test of independence was performed on the frequency matrix of parent protein clusters by 5' and 3' parent cluster membership. As an additional analysis of whether certain parent protein clusters fuse disproportionately often, a randomization analysis was conducted by generating 1000 random gene fusions by bootstrap sampling two sets of parent genes, and calculating the proportion of fusions by parent cluster in each batch. 10000 batches were generated to construct a null distribution of same cluster fusion proportions, and empirical p values were calculated. Enrichment for cancer types by fusion event cluster was examined with a chi-squared test on a contingency test for cancers in which all cell counts were n≥20.

Functional enrichment analysis
Gene sets corresponding to the three clusters of parent proteins were tested for enrichments of GOSlim molecular functions using PantherDB [38], using the set of all human genes as background, with p values corrected for multiple testing using Bonferroni's correction. For each gene set, the log of fractional difference (observed versus expected number of genes) was calculated. Categories with <10 genes were disregarded.

Fusion of metabolic pathways
Gene to REACTOME pathway and cellular process mappings for 8088 Ensembl genes were obtained using Ensembl BioMart, and REACTOME pathway IDs were mapped to descriptions using annotation from the REACTOME website (http://www.reactome.org/). Of the 2,371 gene fusions in our dataset, 577 possessed available pathway annotations for both parent genes (572 with fusion event outliers excluded); in these cases, the frequencies with which each pair of pathways was joined via fusion was calculated (pathway counts per gene ranged from 1 to 334, IQR=4-9, median=5, mean=11.01). To estimate null expected frequencies of pathway fusions (i.e. the frequencies with which any given pair of pathways would be expected to fuse if fusion events were randomly generated from any protein-coding genes), 1,000,000 random fusions with pathway annotation were simulated and pathway fusion frequencies calculated. A repeated set of calculations was performed whereby random fusions were generated from only the set of parent proteins. Enrichments were calculated as the log10 of observed/random pathway fusion frequencies (log chosen to normalize enrichment distribution), and to increase stability, only where pairs were fused ≥10 times. In the cancer-type specific analyses, pathway pairs were required to be fused ≥5 times and only in lung, lymphoid, head-neck, breast, and ovary due a lack of annotation availability in other cancer types. By-cluster trends in pathway fusion were investigated and enrichments calculated (cluster 1 n=184, cluster 2 n=232, cluster 3 n=156). In the main text figures, stroke weights under 1 were increased to 1 for improved visibility. We note that fusion does not necessitate that pathways necessarily come into contact (e.g. both parent proteins could be too severely truncated), but instead provides a broad overview of affected pathways; furthermore, since just under a third of protein-coding genes have REACTOME pathway participation annotations, the findings of the enrichment calculations may change as coverage and annotation improve.

Class balancing
Given the unbalanced nature of the datasets (e.g. most genes are not parent genes, most parent genes are not recurrent), and the issues inherent to building models on unbalanced data (i.e. signal loss and an inflated sense of model performance due to many algorithms tending towards universally classifying data points as the majority case), datasets were balanced with respect to the target class frequencies. Outcome class balancing was done by undersampling the majority classes without replacement until minority class frequencies were matched.
Feature normalization, categorical variable handling, and train/test splitting Random forest models are scale invariant and were fed raw feature data, whereas numerical features input to logistic regression models were scaled and centered to have a mean of 0 and variance of 1. To ensure symmetrical treatment, all categorical features were encoded into as many binary dummy variables as there were factor levels. Unless stated otherwise, in each classification task, 70% of the data was used as the training set and 30% was held out for testing.
Predictive modeling and feature importance ranking Random forest (RF) and penalized/regularized logistic regression (RLR) models generate, as part of their model structure, intuitive measures of the relative predictive importance of variables (see below). The two models have highly different mathematical formulations and may therefore capture different predictive signals within the datasets, and together generate a diversified view of feature importance. The 10 highest importance features from the RF and RLR models were acquired for each prediction task and assigned integer ranks from 10 (most important) to 1 (least important). These ranks were then added and displayed by feature, and distributions of the most informative features were visualised with violin and boxplots. Statistical significance of differences between continuous distributions was assessed using Wilcoxon rank sum tests with continuity corrections. For categorical variables, contingency tables were constructed and Pearson's Chi-squared test with Yates' continuity correction was used to assess variable dependence; if counts of any cell were less than 30, Fisher's exact test for count data (H0=two sided) was used instead. In the few subplots where boxplots and violin plots were not appropriate or visible (e.g. although the count_tissues_cancer_mutation variable is continuous, 97.3% of its values are 0, which renders continuous distribution geometries inappropriate), means were plotted by category instead.
Modeling with random forests Random forests, a type of recursive partitioning model [39], are powerful and robust tools for classification and regression [40,41]. Random forests ensemble together base learner classification trees, each of which recursively bifurcates bootstrapped data by splitting on features that maximize class purity at the terminal nodes (Figure 7). Each bifurcation considers a sample of the available features, and optimal splits are identified by greedily minimizing a purity criterion, which measures the homogeneity of the target variable. Commonly, the purity criterion used is either Gini impurity or entropy loss. The Gini index is given by: which measures the total variance across K classes, where ̂ is the proportion of training examples in the mth region from the kth class. The Gini index approaches 1 if all ̂ are close to zero or 1 (i.e. the tree terminal regions are highly pure). An alternative popular measure of purity is information gain (termed "entropy" in the scikit-learn implementation of a random forest classifier), defined as: Random forest models are characterized by several hyperparameters, the most important of which are the choice of purity criterion, the maximum tree depth, and the maximum number of features considered at each split. These values were optimized for each classification task using grid search and 10-fold cross-validation. Cross-validation (CV) is a method for estimating model performance using only the training data, and is a powerful tool for hyperparameter optimization: individual models (i.e. models using different values for the hyperparameters) can be trained on some portion of the data (e.g. 1/10 th in the case of 10-fold CV) and tested on the remaining training data. The procedure is then repeated for an appropriate number of times until the entire training data has been used for model training (e.g. 10 times in the case of 10-fold CV, using each 1/10 th of the training data exactly once). The optimal hyperparameter setting from the grid of possible options is the one that maximizes the average performance metric of interest across the CV rounds. Scikit-learn implements Gini importance [42,43] for measuring feature importance in random forest models, defined as the decrease in node impurity when splitting on the feature, weighted by the probability of data points reaching the node, and averaged over the base learner trees in the ensemble. The ten most important features in each classification task were identified from these feature importance rankings.
Modeling with regularized in regression L1 regularized multiple logistic regression is a common and efficient binary classification method [44], even in the presence of many irrelevant features. The algorithm estimates the natural logarithm of the odds ratio that a data point will belong to class 1 based on the values of predictors X1, X2,…, Xp. The probability of being in class 1 ( ( = 1), or simply ( ) is given by the sigmoidal logistic function: where the probability of being in class 0 is 1 minus this quantity. The logistic regression model has a logit that is linear in X: log ( ) 1 − ( ) = 0 + 1 1 + ⋯ + where the maximum likelihood fitted coefficients β1, β2, …, βp represent the change in log odds of being class 1 per unit change in the corresponding X1, X2,…, Xp. Model training involves estimating the values of the coefficients. As with other regression methods, logistic regression is sensitive to overfitting when used on high dimensional feature spaces. There is an additional issue of potential instability in coefficient values where non-negligible levels of correlation are present. Regularization techniques alter the standard regression loss function (such as root mean squared error) in order to penalize coefficient size by shrinking the maximum likelihood estimates towards 0. Penalization promotes model stability and increases generalization, and has the general regularization form of: is the dataset, ( ) are the predicted values, ( , ( )) is the loss function, and is the regularization parameter. specifies the regularization type: k=0 denotes the Akaike information criterion or Bayes information criterion; k=1 results in L 1 regularization or (in the case of linear regression) the Lasso; k=2 results in L2 regularization (or ridge regression). An L1 penalty, which attaches as penalty proportional to the sum of the absolute sizes of the model coefficients, generally results in many 0 coefficients, with the remainder of the coefficients being characterised by relatively little shrinkage (cf. an L2 penalty, which is proportional to the squares of model coefficients, and produces many small but non-zero coefficients). The amount of shrinkage is controlled by the regularization parameters λ1 and λ2, where a value of 0 indicates no shrinkage and generates the normal maximum likelihood estimate, and a value of infinity leads to infinite shrinkage and coefficients of 0. Since the L1 penalty leads to increasingly sparse solutions as the regularization strength increases, it can be used as a form of feature selection. In the context of L1 penalized logistic regression [45], the algorithm uses maximum likelihood to minimize the penalized negative binomial log-likelihood function. We use L1 regularized logistic regression models that were build using the penalized R package [46]. Response variables were coded as 0 (non-parent, non-recurrent, etc.) or 1 (parent, recurrent, etc.), and all numerical features were normalized to have mean 0 and variance 1. Models were trained on 70% of the data, and tested on the remaining 30%. A series of penalized logistic regression models were built over a range of regularization strengths, and the model which resulted in 10 non-zero coefficients (or the closest value greater than 10 -in practice, 11) was extracted and absolute values of coefficients plotted in descending order as a measure of feature importance.
Generating predictions with trained logistic regression models produces probabilities. These continuous values between 0 and 1 were binarized, such that test instances were classified as the "1" class when p≥0.5, and as "0" when p<0.5. For final predictions, optimal regularization strengths λ1 were found using 10-fold cross-validation on training data. Values of λ1 between 0.5 to 500 were tested, with the optimal value being the one that maximized the loglikelihood, as suggested by the penalized R package [46].
All human proteins were ranked by their "similarity" to known parent proteins on the basis of the RF and RLR model predictions (i.e. by ordering proteins by their RF class and RLR probability of being parent proteins).

Predicting fusion event pairing patterns
To assess if randomly-generated fusion events are distinguishable from observed fusion events, a mostly sparse (99.92%) 1849x1876 product matrix of all possible fusion events was first constructed, composed of 1849 distinct 5' genes and 1876 distinct 3' genes. A paired feature space consisting of the features of the 5' and 3' partners of known gene fusion pairings, where both parents had available feature vectors, was created. To identify potential differences between known pairings and random pairings, an equal number (n=2189) of hypothetical parent gene pairing was sampled from the matrix of possible fusion events, and a corresponding paired feature space was generated. Randomly generated fusion pairs were constrained to be novel combinations of known 5' and 3' parents. As before, RF and RLR logistic models were trained to distinguish between the two classes. The procedure was repeated by generating randomized fusions where a fusion could contain any parent (i.e. 5' and 3' parents were pooled).

Software
All data processing, visualization and feature engineering was conducted in the R environment for statistical computing [47], with substantial use of the dplyr [48], tidyr [49], and RMySQL [50] R packages for data processing and the ggplot2 [51] package for data visualization. Network analyses were conducted using the igraph R package version 1.0.0 [52]. Random forests were built with the scikit-learn [53] Python library and regularized logistic regression models were built with the penalized R package [46].  Figure S1. Biological function enrichments by parent protein cluster. Parent gene sets by cluster were tested for GOSlim molecular function enrichments using PantherDB, with the set of all human genes as background. P values associated with enrichments, shown to the right of each subplot, were corrected for multiple testing using Bonferroni's correction. For each gene set, the log of fractional difference (observed versus expected number of genes) was calculated. Biological process categories that did not have 10 or more member genes in at least one parent cluster were excluded from the visualization.

Figure S5. Analysis of the randomness of 5'-3' pairing patterns in fusion events. (a)
Comparison of observed versus random pairings between 5' and 3' fusion partners. Gene fusions were randomly generated by sampling once from the 5' parent set and once from the 3' parent set. RF and RLR classifiers were trained to attempt to distinguish between the features of randomly generated fusions and observed fusions. Note that gene symbols are used for illustration, while Ensembl gene accessions were used for randomization. Random forest predictive performance when (b) 5' and 3' gene ordering within fusions was conserved (i.e. in randomly generated fusion events, the 5' parent must be sampled from the set of known 5' parents and the 3' parent from the set of known 3' parents) and when (c) 5' and 3' genes are pooled (i.e. a randomly generated fusion can be composed of any combination of two known parent genes). (d) Model performance of the trained random forest model on test data, showing a confusion matrix and classification report.