Quantifying the Hardness of Bioactivity Prediction Tasks for Transfer Learning

Today, machine learning methods are widely employed in drug discovery. However, the chronic lack of data continues to hamper their further development, validation, and application. Several modern strategies aim to mitigate the challenges associated with data scarcity by learning from data on related tasks. These knowledge-sharing approaches encompass transfer learning, multitask learning, and meta-learning. A key question remaining to be answered for these approaches is about the extent to which their performance can benefit from the relatedness of available source (training) tasks; in other words, how difficult (“hard”) a test task is to a model, given the available source tasks. This study introduces a new method for quantifying and predicting the hardness of a bioactivity prediction task based on its relation to the available training tasks. The approach involves the generation of protein and chemical representations and the calculation of distances between the bioactivity prediction task and the available training tasks. In the example of meta-learning on the FS-Mol data set, we demonstrate that the proposed task hardness metric is inversely correlated with performance (Pearson’s correlation coefficient r = −0.72). The metric will be useful in estimating the task-specific gain in performance that can be achieved through meta-learning.

• ESM2 (esm2_t6_8M_UR50D): This transformer has 6 layers and 8 million trainable parameters.The mean output of the last layer (layer 6) was used as the protein representation.The command that has been used is: python scripts/extract.pyesm2_t6_8M_UR50D fsmol_sequences.fastaembeddings_output --repr_layers 6 --include mean --truncation_seq_length 4096 The output of this model is a 320-dimensional array for each protein (amino acid sequence) • ESM2 (esm2_t12_35M_UR50D): This is a transformer with 12 layers and 35 million trainable parameters.The mean output of the last layer (layer 12) was used as the protein representation.The exact command that has been used is: python scripts/extract.pyesm2_t12_35M_UR50D fsmol_sequences.fastaembeddings_output --repr_layers 12 --include mean --truncation_seq_length 4096 The output of this model is a 480-dimensional array for each protein (amino acid sequence) • ESM2 (esm2_t30_150M_UR50D): This is a transformer with 30 layers and 150 million trainable parameters.The mean output of the last layer (layer 30) was used as the protein representation.The exact command that has been used is: python scripts/extract.pyesm2_t30_150M_UR50D fsmol_sequences.fastaembeddings_output --repr_layers 30 --include mean --truncation_seq_length 4096 The output of this model is a 640-dimensional array for each protein (amino acid sequence) • ESM2 (esm2_t33_650M_UR50D): This is a transformer with 33 layers and 650 million trainable parameters, this outputs a representation for each token (amino acid residue) at each layer.The output of the last layer (layer 33) was used in this study.Also, to determine the representation of the whole protein (not just amino acid residues), the mean representation of each token (amino acid residue) was used.The exact command that has been used is: python scripts/extract.pyesm2_t33_650M_UR50D fsmol_sequences.fastaembeddings_output --repr_layers 33 --include mean --truncation_seq_length 4096 The output of this model is a 1280-dimensional array for each protein (amino acid sequence) • ESM2 (esm2_t36_3B_UR50D): This is a transformer with 36 layers and 3 billion trainable parameters.The mean output of the last layer (layer 36) was used as the protein representation.The exact command that has been used is: python scripts/extract.pyesm2_t36_3B_UR50D fsmol_sequences.fastaembeddings_output --repr_layers 36 --include mean --truncation_seq_length 4096 The output of this model is a 2560-dimensional array for each protein (amino acid sequence)

Molecule featurization
Table S1 lists the methods (featurizers) used for calculating the molecular representations explored in this study.

Molfeat
Software version: Molfeat commit hash 4390f9f from authors' public code repository: https://github.com/datamol-io/molfeat The input to the featurizer is a SMILES or list of SMILES, and the output will be the features for each specific SMILES.

Unimol
Software version: Uni-Mol commit hash b6427ce from authors' public code repository: https://github.com/dptech-corp/Uni-Mol The input to the featurizer is a SMILES or list of SMILES, and the output will be the features for each specific smiles.The output is a 512-dimensional array.

Optimal transport dataset distance (OTDD)
For determining the distance between chemical space (molecule-label pairs), the optimal transport dataset distance was used. 3ftware version: otdd commit hash 72f1b22 from authors' public code repository https://github.com/microsoft/otdd The algorithm takes a pair of inputs (x, y), where x represents a molecule, and y is a bioactivity label (a binary value indicating the molecule's activity on a target).Various features or representations obtained from featurizers are utilized for the molecular representation (x).All parameters are set to their default values as used in the original repository.Additionally, 'max_samples' has been set to 1000, although this does not impact our results significantly, as the majority of our dataset comprises fewer than 1000 samples.Consequently, the outcome is not sensitive to this parameter.

Prototypical Network
To validate the relevance of our proposed distance and hardness module, we compared the performance of a prototypical network 4 with the assigned hardness for each task.We utilized the FS-Mol implementation of the prototypical network, available at https://github.com/microsoft/FS-Mol.Specifically, the benchmarking study employed the following architecture and features for the prototypical network:

Features and Architecture:
• Fully connected layer on top of features derived from a graph neural network (GNN) and Extended Connectivity Fingerprint (ECFP).
•     Increasing the number of training samples improved the performance of both methods but also shrunk the performance gap between the two methods (the more training data is available, the less improvement would be seen from a prototypical network compared to a single-task approach).

Figure S1 :
Figure S1: Proportion of active compounds in each target task set.Most target task sets comprise a comparable number of active and inactive compounds.

Figure S2 :
Figure S2: Representation of enzyme types among the single protein assay source tasks and target tasks within the FS-Mol data set.Transferases dominate both data sets, followed by hydrolases and oxidoreductases.Within the source tasks, some tasks have more than one assigned protein family (7%) or no specification of a protein family (33%).These are not reported in this figure.

Figure S3 :
Figure S3: Number of molecules representing the individual source and target tasks within the FS-Mol data set.

Figure S4 :
Figure S4: Meta-learning, in particular the prototypical network, achieved better performance (measured as ROC-AUC) than random forest (a single-task method) for all values of k (number of training data points).

Figure S5 :
Figure S5: Correlation of EXT_CHEM vs (A) EXT_PROT and (B) INT_CHEM for each of the 157 test tasks.Small molecules represented with GIN supervised infomax; proteins represented with ESM2_t33_650M; The number of nearest neighbors (k; training tasks) for calculating the hardness from the distance matrix is 50, with the weighted average used for computing the EXT_CHEM and average used for computing the EXT_PROT; INT_CHEM measured with a random forest model trained on 16 randomly selected training samples.

Figure S6 :
Figure S6: Relationship between the performance improvement (ΔAUPRC) obtained by using the

Figure S7 :
Figure S7: Spearman's r between EXT_CHEM measures based on different molecule representations.The number of nearest neighbors (k; training tasks) for calculating the hardness from the distance matrix is 50.

Figure S8 :
Figure S8: Spearman's r between EXT_PROT measures based on different protein representations.The number of nearest neighbors (k; training tasks) for calculating the hardness from the distance matrix is 50.

Figure S9 :
Figure S9: Pearson's r of the EXT_CHEM with the performance (ROC-AUC) of the prototypical network as a function of k (i.e., the number of nearest neighbor source tasks considered by the hardness components; weighted average used for computing the EXT_CHEM) (A) GIN supervised masking, (B) GIN supervised contextpred, (C), UniMol, and (D) Roberta-Zinc 480M-102M.This figure shows that correlation is sensitive to this parameter, and based on this plot, around 0.1-1% of source tasks can be an optimum number for k.

Figure S10 :
Figure S10: Pearson's r of the EXT_PROT with the performance (ROC-AUC) of the prototypical network as a function of k (i.e., the number of nearest neighbor source tasks considered by the hardness components; average used for computing the EXT_PROT) (A) ESM2_t6_8M model, (B) ESM2_t12_35M model, (C), ESM2_t30_150M model, and (D) ESM2_t36_3B model.This figure shows that correlation is sensitive to this parameter, and based on this plot, around 0.1-1% of source tasks can be an optimum number for k.

Table S2 :
Pearson's r for the final task hardness metric composed of EXT_CHEM, EXT_PROT, and INT_CHEM vs prototypical network performance (ROC-AUC).The number of nearest neighbors (k; training tasks) for calculating the hardness from the distance matrix is 50.Internal hardness for each test task is random forest ROC-AUC with 16 data points (samples) for training.

Table S3 :
Pearson's r for the final task hardness metric composed of EXT_CHEM, EXT_PROT, andINT_CHEM vs prototypical network performance (ROC-AUC).The number of nearest neighbors (k;training tasks) for calculating the hardness from the distance matrix is 10.Internal hardness for each test task is random forest ROC-AUC with 64 data points (samples) for training.

Table S4 :
Pearson r for INT_CHEM with different splitting strategies vs prototypical network performance (ROC-AUC).