Large-Scale Modeling of Sparse Protein Kinase Activity Data

Protein kinases are a protein family that plays an important role in several complex diseases such as cancer and cardiovascular and immunological diseases. Protein kinases have conserved ATP binding sites, which when targeted can lead to similar activities of inhibitors against different kinases. This can be exploited to create multitarget drugs. On the other hand, selectivity (lack of similar activities) is desirable in order to avoid toxicity issues. There is a vast amount of protein kinase activity data in the public domain, which can be used in many different ways. Multitask machine learning models are expected to excel for these kinds of data sets because they can learn from implicit correlations between tasks (in this case activities against a variety of kinases). However, multitask modeling of sparse data poses two major challenges: (i) creating a balanced train–test split without data leakage and (ii) handling missing data. In this work, we construct a protein kinase benchmark set composed of two balanced splits without data leakage, using random and dissimilarity-driven cluster-based mechanisms, respectively. This data set can be used for benchmarking and developing protein kinase activity prediction models. Overall, the performance on the dissimilarity-driven cluster-based split is lower than on random split-based sets for all models, indicating poor generalizability of models. Nevertheless, we show that multitask deep learning models, on this very sparse data set, outperform single-task deep learning and tree-based models. Finally, we demonstrate that data imputation does not improve the performance of (multitask) models on this benchmark set.


Abstract
Protein kinases are a protein family that plays an important role in several complex diseases such as cancer, and cardiovascular and immunological diseases. Protein kinases have conserved ATP binding sites, which when targeted can lead to similar activities of inhibitors against different kinases. This can be exploited to create multi-target drugs. On the other hand, selectivity (lack of similar activities) is desirable in order to avoid toxicity issues. There is a vast amount of protein kinase activity data in the public domain, which can be used in many different ways. Multi-task machine learning models are expected to excel for these kinds of datasets because they can learn from implicit correlations between tasks (in this case activities against a variety of kinases). However, multi-task modelling of sparse data poses two major challenges: (i) creating a balanced train-test split without data leakage and (ii) handling missing data. In this work, we construct a protein kinase benchmark set composed of two balanced splits without data leakage, using random and dissimilarity-driven cluster-based mechanisms, respectively. This dataset can be used for benchmarking and developing protein kinase activity prediction models. Overall, the performance on the dissimilarity-driven cluster-based split is lower than on random split-based sets for all models, indicating poor generalizability of models. Nevertheless, we show that multi-task deep learning models, on this very sparse dataset, outperform single-task deep learning and tree-based models. Finally, we demonstrate that data imputation does not improve the performance of (multi-task) models on this benchmark set.

Introduction
Protein kinases are a family of over 500 different enzymes responsible for protein phosphorylation. Most signalling pathways contain kinases making them pivotal players in all aspects of protein regulation. 1,2 They are found in animals, plants, bacteria and archaea, indicating their importance to sustaining life and their deregulation often leads to undesirable effects and pathologies. 3,4 These include multiple forms of cancer, inflammatory, cardiovascular, immunological and infectious diseases. 5,6 Thus, if the functioning of specific kinases in the body can selectively be altered, a wide range of diseases could potentially be treated, making protein kinases very interesting targets for drug discovery.
In the two decades since the FDA-approved imatinib more than 70 protein kinase inhibitors, mostly for applications in oncology, have been approved and many more inhibitors are in (pre-)clinical pipelines. 6,7 Despite the success, there is still a need for better inhibitors that can either selectively target a single protein kinase, a subset of targets (so-called poly-pharmacological compounds, which can modulate multiple targets), or mutant protein kinases to address resistance. However, the development of a new drug from early-stage drug discovery to clinical development is a challenging and expensive process that takes on average more than ten years and costs more than two billion dollars. 8 Computer-aided drug design (CADD) can reduce these costs by decreasing the number of compounds to be synthesised and the number of experiments needed, especially when applied in early-stage drug discovery. Moreover, CADD can enable early discontinuation of compounds predicted to fail. Machine learning-based quantitative structure-activity relationship (QSAR) models trained on experimental data are often used to predict activities from a compound's 2D or 3D structure. 9 Historically, QSAR models were singletask (ST), i.e. a single model was developed for activity against a single target. However, activities against targets with a conserved ATP binding site, such as protein kinases, are often correlated. [10][11][12] Single-task models cannot take advantage of such correlations, but multi-task (MT) models should be able to utilise these implicit correlations making the training process more efficient and the model predictions for each kinase more robust in that they suffer less from individual experimental errors and are applicable to a larger region of chemical space. [13][14][15][16] A hurdle to developing good multi-task models for activity predictions is the sparsity of the experimental data. Compounds with experimental data against multiple or ideally all targets are rare, making the data density of the molecules-targets matrix very low. Data imputation has been proposed as a solution. [17][18][19] Imputation is the process of using predicted values for missing data points in the dataset used to train the machine learning models. The complexity of the imputation strategy to obtain the predictions ranges from the simple computation of the mean value of the known data points per task to the use of deep learning models. 20 Subsequently, the imputed values are used together with the experimental activity data to train the models.
In drug design we aim for generalizable models, i.e. as much as possible they should predict the properties of novel compounds well. Therefore, model performance should be evaluated with a 'realistic' split (i.e. to the maximum extent possible corresponding to real-life situations), where the chemical similarity between the training and test sets is minimised. For single-task modelling, this is often straightforward, but for multi-task modelling, the construction of realistic balanced training-validation-test splits without data leakage between tasks is not as straightforward. 21 If the splits are done per target, this will lead to data leakage, i.e. the same compound can be in the train for one task and in the validation or test set for another one. As a consequence, training, validation and test sets are overlapping. This in turn leads to an overestimation of the performance of the models. On the other hand, when a 'general' random, cluster, or temporal split is applied to the overall dataset, the datasets will be unbalanced (different ratios of compounds in training, validation and test sets for the various tasks). Moreover, a random split does not result in chemical dissimilarity between the training and test sets, and as a consequence, a model's generalizability and performance will be overestimated. 22 A ba-sic cluster split, where the clusters are combined to create the training-validation-test set without enforcing molecular dissimilarity, is a variant of the random split and works approximately equally well as a random split. Whereas the temporal split on public data can often lead to extremely unbalanced splits, where some tasks have little to no data in a given set. Two out of the three challenges (balance and no data leakage) to create a good training-validation-test split for multi-task modelling can be addressed with a random global equilibrated selection (RGES) which is introduced in this paper. All three of them (no data-leakage, balance and dissimilarity) can be addressed with a dissimilarity-driven global balanced clustering (DGBC) split recently proposed by Tricarico et al. 21 It simultaneously maximises dissimilarity and balances the individual sets globally. Furthermore, an ensemble of best-performing models had a predictive accuracy exceeding that of single-dose kinase activity assays. 23 Beyond, classic singleand multi-task modelling, the cross-kinome correlations have been utilised in proteochemometric modelling, 24-26 and multi-task imputation models profile QSAR 27 and Alchemite 20 have shown promising results.
In this paper, we introduce two large, curated datasets of kinase activity values from the public domain: • Kinase200 contains kinases 197 with at least 200 activity data points per kinase • Kinase1000 contains kinases 74 with at least 1,000 activity data points per kinase Selected kinases are highlighted on the human protein kinase tree in Figure 1. 28 We also propose two well-balanced 80-10-10 multi-task splits for activity prediction models: one based on random allocation and the other one on clustering. 21 Finally, we benchmarked the capacity of different approaches to utilise large-scale protein kinase activity data to build prediction models. This was done by comparing the performance of a set of single-task models and multi-task models with and without data imputation. All data and code are shared publicly so that they can be used as a benchmark set for building predictive protein kinase activity models.   SuRTK106   Syk   TAK1   TAO1  TAO2  TAO3   TBCK   TBK1   TEC   TESK1  TESK2   TGFβR1   TGFβR2   TIE1   TIE2   TIFα   TIFβ   TIFγ   TLK1   TLK2   TNIK/ZC2   Tnk1   Trad   Trb1   Trb2   Trb3   Trio   TRKA   TRKB   TRKC   TRRAP   TSSK1  TSSK2   TSSK3   TSSK4   TTBK1   TTBK2   TTK   TTN   TXK   Tyk2   Tyk2~b Tyro/Sky The datasets were created based on the Papyrus dataset (version 05.6). 31 The Papyrus dataset is a curated dataset containing multiple large publicly available datasets, such as ChEMBL, and several smaller datasets. The data in the database have been standardised and normalised. Initially, all protein kinase activity data points (Ki, KD, IC50, EC50) of kinases labelled as 'high' quality were retrieved. Compounds with a molecular weight larger than 1000 Da, and activity points composed of multiple measurements with a standard deviation larger than 1.0 log units were filtered from the dataset. Furthermore, allosteric data points were removed based on text mining of ChEMBL assay descriptions and abstracts from PubChem, PubMed, CrossRef and Google Patents for keywords. Additionally, we removed all compounds that had a Tanimoto similarity, calculated from Morgan fin- gerprints (3, 2048), above 0.8 to at least one of the molecules annotated as allosteric in the previous step.
The final datasets were constructed by selecting kinases with 200 or more data points (Kinase200) and with 1000 or more data points (Kinase1000), respectively. Unless mentioned otherwise, all following analyses and figures are done for the Ki-nase200 dataset and corresponding figures for the Kinase1000 can be found in the Supplementary Information. An overview of the datasets is summarised in Table 1.

Data splitting
The datasets were split into training, test and validation sets using either random global equilibrated selection (RGES) split or dissimilarity-driven balanced cluster (DGBC) split with 80% of data going into the training set, 10% into the test set and 10% into the validation set.
Random global equilibrated selection (RGES) split was done by sorting targets from the target with the most data points to those with the least. Then, for each target, a random split was made. If a compound belonged to a different (training, validation, test) set for a different target, its final label was set to the label of that compound for the target lowest on the sorted list. This mechanism was chosen because reassigning labels for targets with larger numbers of compounds has smaller relative effects on the balance.
Dissimilarity-driven balanced cluster (DGBC) split was made by using a method developed by Tricarico et al. 21 First the compounds in the dataset were clustered using sphere exclusion clustering on ECFP6 fingerprints with a Tanimoto distance of 0.736 between cluster centroids. 32 Fingerprint generation and sphere exclusion clustering were done using RDKit (version 2020.09.05). 33 The clusters were distributed over the training, validation and test sets using linear programming to simultaneously achieve maximum dissimilarity between the sets and the desired training-validation-test ratio for each target.
Evaluation of the splits was done in three different ways: • Data balance -data percentage per set and target • Data distribution -distribution of pChEMBL values in each set • Chemical dissimilarity -distribution of minimum Tanimoto distance of compound in each set compared to all compounds in the other sets

Models
The main aim of this work is two-fold: (i) Assess to what extent accurate QSAR models can be developed on the basis of a large but sparse kinase-compound activity matrix. (ii) To what extent imputation can improve the models? For this, we first developed four QSAR models and subsequently investigated whether data imputation can improve performance.
To develop baseline models, we used two wellknown and widely used tree-based single-task methods: random forest and gradient boosting. The third model was a multi-task version of gradient boosting-The fourth and fifth models were developed with directed message-passing neural networks (D-MPNN), as implemented in the chemprop (CP). 34 The D-MPNN approach was applied both to single-and multi-task models, referred to as CP ST and CP MT , respectively. The sixth and seventh models applied the multi-task D-MPNN approach where the missing values were imputed either by mean imputation or a single-task RF prediction, referred to as CP Mean MT and CP RF MT , respectively. The eighth, and last, model was a reimplementation of the profile QSAR model (pQSAR) by Martin et al. 27 The models were developed on a server containing 24 Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz CPUs, 7 NVIDIA GeForce GTX 1080 GPUs and 1 NVIDIA GeForce RTX 2080 Ti GPU.

Random Forest, XGBoost and PyBoost
Sets of single-task RF ST and XGB ST models were developed for each kinase with the sklearn 35 and xgboost 36 packages respectively, the multitask PB MT model masking missing data in the loss function was developed with the pyboost 37 packages. All three models used Morgan fingerprints as features (radius 3, 2048 bits). Initially, all models were developed with default parameters (RF: n estimators=100, max depth=None, min sample split=2, min sample leaf=1, max features=1.0, XGB: ntrees=100, learning rate=0.3, n estimators=100, min child weight=1, colsample bytree=1, scale pos weight=1, max depth=6, subsample=1, PB: lr=0.05, min gain to split=0, lambda l2=1, gd steps=1, max depth=6, colsample=1, sub-sample=1, quantization='Quantile') and subsequently, hyperparameters of both sets of models were optimised on the validation set using a random grid search and selecting the best model based on the R 2 -score.
Chemprop Both a set of single-task (CP ST ) models and a single multi-task (CP MT ) model were developed masking missing data in the loss function. 34 The chemprop models were run with both default hyperparameters (hidden size=300, depth=3, dropout=0.0, ffn num layers=300, activation=ReLU, bias=False, max lr=1e-3, epochs=30) and with hyperparameters that were optimised on the validation set using Optuna. 38 The optimised parameters were obtained from 150 trials of the data on five randomly chosen targets from the Kinase1000 dataset split with BC split: P00533, P04626, P06239, Q5S007 and O75116.
Chemprop with data imputation Two chemprop multi-task models with data imputation were also developed: one with mean imputation (CP Mean MT ) and the other one with RF imputation (CP RF MT ). In the former case, missing values are filled in by the arithmetic mean of the mean of available activities per compound and the mean of available activities per kinase. In the latter case, by making a single-task RF prediction. These new datasets with imputed values are then used to train a chemprop multi-task model. Both models were run with default and optimised parameters, respectively.
pQSAR We reimplemented the pQSAR 2.0 method from Martin, et al. 27 in Python. In the first step missing data is imputed with single-task RF models. In the second step, for each kinase, a partial least squares (PLS) regression model was developed with the experimental and imputed activity values of the other kinases as input data. As described in the pQSAR 2.0 paper, the number of components for PLS is selected based on an adjusted score by penalising the R 2 by 0.002 * number of components. Both the RF and PLS models were built with scikit-learn. The implementation was validated by training it on the dataset accompanying the pQSAR paper by Martin et al., and comparing the results of our implementation to the results published in the paper. These results are shown in section S5.
In the original pQSAR paper, the train-test set split was done per assay instead of per kinase, so for some kinases, multiple different assays were included. In their implementation, the data split was done per task instead of on the complete dataset. That led to data leakage between assays (Type-1 leakage), in that several compounds were in the training set for one assay and in the test set for another. Furthermore, during the PLS training process, the input data included both training and test compounds, leading to further data leakage (Type-2 leakage). As a consequence of this double leakage, the training and test sets overlapped, which in turn led to an overestimation of the performance of the models. In all work described here, we do not allow Type-1 data leakage. In order to assess the effect of Type-2 data leakage, we used our pQSAR implementation with the RGES and DGBC splits, respectively, both with and without including the test set during PLS training and comparing the performance of both approaches.
Metrics Predicted activity values were compared to experimental activity values by calculating the coefficient of determination (R 2 ) and rootmean-squared error (RMSE) per kinase. Medians, means and standard deviations of these distributions are reported.

Data splits
We constructed two sets consisting of 80% (training)/ 10% (validation)/ 10% (test) sets for multitask modelling of protein kinases. These sets are well-balanced for all targets and there is no data leakage between targets. The first set was created with the RGES split and the second with Tricarico's DGBC split.

Balanced sets without data leakage
As illustrated in Figure 2A and summarised in Table S1, both datasets are well-balanced. For both splits, the mean of the ratio of the molecules per target is close to 80%/10%/10% and the standard deviations are small. The RGES split has a slightly more balanced ratio of molecules than the DGBC split. In Figure 2B, we show the pActivity (-log(activity)) value distribution per set. The distributions are very similar to each other indicating that activity values are also well-distributed between all sets.

Sets for interpolation and extrapolation
For both splits, the chemical similarity of the sets is illustrated in Figure 2C, showing the distribution of the minimum Tanimoto distance of molecules (min(d T )) in a set to all the molecules in the other sets. The mean values per set are summarised in Table S1. As expected by design, the DGBC split yields more chemically dissimilar sets than the RGES split. This makes DGBC a more challenging split, and therefore better suited for testing the generalizability of a model.

Modelling kinase activity
We benchmarked the performance of a variety of well-known single-and multi-task ML models to model large-scale protein kinase activity data. All models have been evaluated both with the RGES and the DGBC splits. The overall results are summarised in Table 2 and Figure 3.

Importance of data splitting
For all models, the performance is significantly better on the random-based split than on the structure-base splits. Most models reach median R 2 > 0.6 and RMSE< 0.6, often used thresholds to consider a model to be predictive, based on an RGES split. However, on average the median R 2 is 0.3 lower and the median RMSE is 0.2 higher for the DGBC split than the random split showing that the models do not perform as well on this data split. These results are in line with expectations and previously published results 21, 39 and show the importance to assess model performance with a realistic split.

Hyperparameter optimisation improves performance
We optimised the hyperparameters of the treeand chemprop-based models. The hyperparameters of the tree-based models were optimised separately for each model. To limit computation time, we only optimised the CP MT model's hyperparameters on the DGBC split and then applied the optimised parameters to all chemprop models. We show the performance of the optimised models in Figure 3 and Table 2, and the performances of the default models are summarised in Table S2.
On the RGES split, the hyperparameter optimisation significantly increases the performance of all models, except the RF ST models for which the performance does not change. The median R 2 and RMSE were improved by 0.12 and 0.07 on average, respectively, and the largest improvements are seen for the PB MT ) model (∆R 2 =0.30 and ∆RMSE = 0.19). In the case of the DGBC split, the effect of the hyperparameter optimisation is much smaller with the median R 2 and RMSE only improving by 0.07 and 0.03 on average, respectively. The largest improvements are also seen for PB MT model (∆R 2 =0.16 and ∆RMSE=0.07) and the second largest for the XGB ST models, indicating that hyperparameter optimisation is necessary for gradient boosting models. It is surprising that the effect is consistently larger in the RGES than the DGBC split for the DMPNN-based models as the hyperparameter optimisation was done with a small subset with the balanced cluster split. Nevertheless, all models were improved by hyperparameter optimisation and future studies could also use algorithms, like FABOLAS, 40 that can handle hyperparameter optimisation on very large datasets to yield even greater gain in performance for the multi-task DMPNN models. The subsequent analyses in this paper were done with the optimised models. Table 2: Performance of kinase activity prediction models. Median, mean and standard deviations of R 2 and RMSE metrics per kinase for single-task random forest (RF ST ), xgboost (XGB ST ), and chemprop single-task (CP ST ) models, multi-task pyboost model (PB MT ), chemprop multi-task models without data imputation (CP MT ), and with mean (CP Mean MT ) and RF imputation (CP RF MT ), and the profile QSAR model without data leakage (pQSAR).

Multi-task models outperform a collection of singe-task models
To evaluate the effect utilising correlations between kinases in sparse data we compared pairwise the performance of single-and multi-task (i) gradient boosting models (XGB ST s vs. PB MT ) and (ii) chemprop models (CP ST s vs. CP MT ). For both cases and both splits, the multi-task models are superior to the set of single-task models with increased average R 2 -scores and decreased RMSEs (Table 3). Table 3: Effect of multi-task modelling. Difference between the mean R 2 and RMSE per kinase of single-and multi-task models.
The effects of multi-task models are larger for the deep learning model compared to the gradientboosting one. The CP models are based on learned representations, therefore on small, singletask datasets, the model is most likely not able to extract generalisable embeddings. The multi-task model uses a much larger dataset from which it is able to extract more generalisable embeddings, leading to a significant improvement. On the R 2 -score, on which we see the larger gains, the improvements are larger for RGES and DGBC splits in the case of deep learning and gradient boosting models, respectively. This indicates that the exploitation of inter-target correlation can useful when predicting the activities of compounds dissimilar to the ones in the training set. Furthermore, in the case of the deep learning models, there is a speed-up of a factor ∼30 in computation time between running a single 198 multi-task model and running 198 separate single-task models.
Similarly, in a recent study, Moriwaki et al. 41 using an in-house dataset with random splits for binary kinase activity predictions showed promising results for multi-task graph neural networks that outperform single-task models. Another option to exploit the inter-target correlation is proteochemometric modelling. 42 However, both our single-and multi-task models seem to outperform proteochemometric models used by Born et al. (RMSEs above 0.75 in a random split-based evaluation) 26 in their large-scale kinase modelling.

Tree-based machine learning outperforms deep learning
Both in the case of single-task models (RF ST , XGB ST vs. CP ST ) and multi-task models (PB ST vs. CP MT ) , we see that the classical treebased machine learning methods outperform the deep learning model. Even though in general deep learning models have been shown to outperform classic machine learning approaches in activity prediction, 43 these results are in line with in-house results and other publications 23,44,45 that show that in some cases classic ML methods perform as well as deep learning models. The superior performance of tree-based methods over deep learning models has been well described on small-to medium-sized tabular datasets -several reasons for this are: NNs are biased towards overly smooth solutions, uninformative features are more problematic for NNs, NNs are typically not rotationally invariant -as discussed in Ref. 46. Moreover, the deep learning approach is based on learned representations, therefore on small, single-task datasets, the model is most likely not able to extract generalisable embeddings. This would explain why the difference between the single-task models is larger than between the multi-task ones. Furthermore, the hyperparameters of each tree-based model were optimised separately, whereas, due to computation time, deep learning models use parameters optimised once on a subset of the multi-task model so they might not be optimal for each separate model.

Performance is not correlated with data density
We have evaluated whether there is a correlation between density and model performance and whether there is a difference in performance when using datasets at different density levels. In Figure 4 we illustrate the performance of the CP MT model for each kinase is poorly correlated with the data density points for that kinase, assessed by the R 2 and the RMSE.
To evaluate how the multi-task model performance is affected by adding activity data on targets that have lower data density, the CP MT was run on two datasets, kinase1000 and kinase200, which have different densities. The performance difference for targets that are both present in the kinase200 and kinase1000 datasets is shown by the distributions of ∆R 2 = R 2 kinase1000 − R 2 kinase200 and ∆RMSE = RMSE kinase1000 − RMSE kinase200 in Figure 5. For the RGES split it is clear that there is no difference in performance between the two datasets, so adding activity data from targets that have fewer data points does not improve the performance of the models on the other targets. For the DGBC split, there are larger differences in performance between sets, but there is no systematic improvement with the addition of the sparser kinases to the dataset. Overall, adding more kinases with fewer data points leads to a sparser data matrix, and it does not improve model performance.

Data imputation does not improve chemprop performance
As the dataset is very sparse, making it difficult to exploit inter-kinase correlation, we investigated if data imputation could improve the performance of the multi-task models. Using chemprop, multi-tasking methods with imputation, CP Mean  kinase1000 − R 2 kinase200 (top) and for RMSE = RMSE kinase1000 − RMSE kinase200 (bottom) for kinase present in both datasets. In green and purple, kinases for which model performance increases or decreases with the addition of sparse kinases to the dataset. and Table 2). CP RF MT has very similar performance to CP MT and CP Mean MT underperforms significantly compared any other model, single-or multi-task.

pQSAR without data leakage underperforms
In Martin, et al. 27 the test set is included when training the PLS model which leads to data leakage between kinases. To investigate the impact of this, we trained the PLS both with and without the test set. In Figure 6 and Table S3 we show that when the test set is not included in the training of the PLS, the performance measures turn out worse than when the test set is included. The pQSAR without the test set performs worse than the single-task RF ST s, and when the test set is included in training it outperforms the RFs. The test set should be independent so we exclude the test set from the training of the PLS models. In this scenario, the pQSAR model underperforms compared to every other model except CP Mean MT . Alchemite, a commercial deep learning-based method using imputation adopts an expectation-maximisation algorithm and has shown promising results on the cluster-based pQSAR dataset as well, but as mentioned before the splits from pQSAR suffer from data-leakage. 20 Figure 6: Performance of pQSAR performance with and without data leakage. Distributions of R2 and RMSE values between predictions and experimental values of the data in the test set for each target kinase in the ki-nase200 dataset. Predictions were made using single task random forest models (RF ST ), pQSAR with PLS trained on training and test set (pQSAR w/ test ) and profile QSAR with PLS trained on the training set only (pQSAR w/o test ). Results are shown both for the random global equilibrated split (RGES) or dissimilarity-driven global balanced cluster (DGBC) split.

Conclusion
In this study, we have investigated the large-scale modelling of protein kinase activity with various machine-learning approaches. For this, we created two large protein kinase datasets from the curated Papyrus database, 31 comprising 198 kinases and 83K molecules (kinase200) and 66 kinases and 71K molecules (kinase1000), respectively. Other recently applied kinase datasets contain either a more limited number of targets 30 or molecules, 27,29 except for the dataset based on all human kinases used by Born et al. 26 For each of our datasets, two balanced multi-task splits without data leakage between targets were created, a random-(RGES) and a dissimilarity-driven cluster-based (DGBC) split, to evaluate rigorously the performance of the models within exploitation and exploration schemes. Other publications of large-scale kinase modelling have been using either only random-based splits 26 or have had data leakage between targets. 27 We then compared the performance of seven models to predict protein kinase activity. These included sets of single-task random forest, xgboost and chemprop models, multi-task pyboost model, chemprop models -without and with data imputation -and our implementation of pQSAR 2.0, Both in the cases of single and multi-task models, we see that the deep learning-based models are outperformed by classic machine learning methods on both splits. The performance of gradient boosting and D-MPNN models on both splits can be improved by creating multi-task models, indicating that exploitation of inter-target correlation can be useful when predicting activities of compounds both similar and dissimilar to the ones in the training set. The multi-task model's performance could not be improved by data imputation. Moreover, using pQSAR without data leakage between targets also does not lead to higher predictive power.
As expected, we see that every model performs significantly worse with the dissimilaritydriven split than with the random split. Most of the models show some predictive power with the random split but struggle with the clusterbased split. The dissimilarity-driven cluster-based split is a more conservative estimation of performance and thus it is a more realistic assessment of the performance of machine learning models in real drug discovery projects. As the poor performance on the cluster-based split of the different models shows (for the PB MT , the overall bestperforming model, only 19% of kinase have an R 2 above 0.4), further developments are needed to efficiently model activity with large-scale sparse datasets. The inclusion of additional data in the form of protein information may improve this performance.

Data and Software Availability
The code and the datasets are provided at to the community at https://github.com/ CDDLeiden/kinase-modelling as a benchmark set for developing large-scale multi-task models for kinases.

Supporting Information Available
Available free of charge. S1 -Extra analysis of the kinase200 split, S2 -analysis of the kinase1000 splits, S3 -models' performance with default parameters on kinase200 set, S4 -models' performance with default parameters on kinase1000 set, S5 -pQSAR implementation validation.

Supporting Information Available
Available free of charge: S1 -Extra analysis of the kinase200 split, S2 -analysis of the kinase1000 splits, S3 -models' performance with default parameters on kinase200 set, S4 -models' performance with default parameters on kinase1000 set, S5 -pQSAR implementation validation.