Synergizing Chemical Structures and Bioassay Descriptions for Enhanced Molecular Property Prediction in Drug Discovery

The precise prediction of molecular properties can greatly accelerate the development of new drugs. However, in silico molecular property prediction approaches have been limited so far to assays for which large amounts of data are available. In this study, we develop a new computational approach leveraging both the textual description of the assay of interest and the chemical structure of target compounds. By combining these two sources of information via self-supervised learning, our tool can provide accurate predictions for assays where no measurements are available. Remarkably, our approach achieves state-of-the-art performance on the FS-Mol benchmark for zero-shot prediction, outperforming a wide variety of deep learning approaches. Additionally, we demonstrate how our tool can be used for tailoring screening libraries for the assay of interest, showing promising performance in a retrospective case study on a high-throughput screening campaign. By accelerating the early identification of active molecules in drug discovery and development, this method has the potential to streamline the identification of novel therapeutics.

It should be noted that 67 % of the FS-Mol data is not annotated with target protein information.The distribution of protein families present in FS-Mol is shown in fig.S1.

Molecular representation
Bioassay-based large language model (LLM) text embeddings The pipeline in this study requires titles, descriptions and protocols as additional representation for each molecule and assay.Therefore, this text information is extracted from PubChem.S3 Using both Application programming interfaces (APIs) from PubChem and ChEMBL a mapping of bioassay identifier (AID) to ChEMBL IDs is performed.S2,S3 This is done to retrieve the information rich text information of PubChem in combination with the ChEMBL-based FS-Mol benchmark.Only the 122 overlapping tasks of the test set could be mapped using the API of PubChem and ChEMBL, where the FS-Mol data originate.

Frequent hitter and assay artefact analysis
We apply GSK and REOS structural alerts using Rdkit.S5 The more matches a compound has with these filters, the higher its chance of being a false positive.We categorise in two classes, compounds with no alerts and compounds with one or more alerts.
We use the HitDexter 3 webserver (https://nerdd.univie.ac.at/hitdexter3/)for batch processing to predict frequent hitters.S6 We analyse the correlation of the pairs of TwinBooster predictions in combination with the likelihood of being a highly promiscuous compound.The data is extracted from these columns HPROM-NPROM target-based assay data set model, HPROM-NPROM cell-based assay data set model or HPROM-NPROM extended cell-based assay data set model of the .csvoutput from the webserver.S6 To investigate the performance of a kinase-only model a random forest (RF) model is trained on the kinase subset of the FS-Mol training set (based on the protein super family).S1,S7 Case study To highlight the zero-shot capabilities of TwinBooster a case study of biological highthroughput screening (HTS) is conducted.Therefore, the primary screen (AID 2732 * ) is analysed by TwinBooster to predict the desired properties.Then it is analysed against the confirmatory screen (AID 504437 † ).The columns PUBCHEM EXT DATASOURCE SMILES, PUBCHEM ACTIVITY OUTCOME as well as the text information are used for the TwinBooster prediction pipeline.S3 Performance evaluation Recall measures the proportion of relevant instances that are retrieved, this refers to the active compounds in this case study.S8 It can be expressed at the ratio of the found active molecules (true positives (TPs)) and all active molecules (TPs and false negatives (FNs)).

Similarity estimation
The Tanimoto similarity of compounds is calculated using the corresponding Rdkit function.S5 To highlight structural similarities and differences, 50 active compounds are sampled from the confirmatory screen and compared to the top 50 compounds are selected based on the TwinBooster ranking of the primary screen.

Models Language models
Fine-tuning large language models In this study, we used Microsoft's developed De-BERTa V3 S9 as the underlying architecture for our LLM and fine-tuned it on a comprehensive bioassay corpus obtained from PubChem.S3,S9,S10 We chose DeBERTaV3, a pretrained LLM, for this research due to its superior performance compared to the original DeBERTa or BERT model, respectively.S9 DeBERTa V3 uses the replaced token detection pretraining task, which is more sample-efficient than the traditional masked LLM approach.This innovation enhances both training efficiency and model quality by removing the "tug-of-war" dynamics present in the vanilla embedding sharing method used in ELECTRA.S9,S10 The fine-tuning process aimed to enhance the model's performance in the context of biomolecular properties.
The DeBERTa V3 base model S9 is fine-tuned on the PubChem corpus using ∼14 GB video random-access memory (RAM) for ∼15 h.In the augmented version, the description is shuffled ("." as delimiter) and 5 augmentations are used as the training corpus.Therefore, the Python S4 Transformers S11 library is used.The Optuna S12 hyperparameter optimisation library is used to find the best combination of hyperparameters for the LLM (ref.table S1).
After 20 optimisation procedure trials the best hyperparameters shown in table S2 were found.One-hot encoding To assess a rule-based baseline to ML or deep learning (DL) encoding, we use one-hot encoding for the assay type and protein super family information.
To each ECFP of the contained molecules we add 42 one-hot encoded features, that are task specific.This method only can be applied where FS-Mol tasks are annotated which is only the case for 33 % of all tasks.Information is taken from the target info.csvfile found in the FS-Mol repository.
Latent semantic analysis The latent semantic analysis (LSA) model S15 is pretrained on the PubChem corpus.This process is carried out in a manner similar to the approach used in Seidl et al., using the Python packages TfidfVectorizer and TruncatedSVD.S7,S16 The resulting text embeddings for each bioassay have a dimensionality of 355.
Performance evaluation For the evaluation, the perplexity (PPL) is used as an evaluation metric for fine-tuning the LLM (PPL ∈ [0, ∞[, lower values indicate better performance).S17,S18 The underlying hypothesis is that the ability of a model to predict subsequent words in unseen sentences with a high probability indicates a comprehensive understanding of the grammatical structures of the language.The LLM evaluation is performed on the test set (20 % split) of the corpus with a token corruption rate of 15 %.The training and evaluation procedures are conducted using the PubChem corpus.S3 Perplexity is mathematically defined as the exponential of the cross-entropy loss between the predicted word probabilities and the actual distribution in the test set.Given this definition, the perplexity of the model can be expressed as: PPL = exp(H(p, q)) = exp(eval loss) where H(p, q) denotes the cross-entropy loss.

TwinBooster
Multilayer perceptron Barlow Twins use multilayer perceptrons (MLPs) for both the encoders and the projector design.S19 The network architecture is altered from the original by having two encoders a molecule and a text encoder.The Barlow Twins model trains on negative as well as positive examples, which should help generalisation.Finally, the projector is shared for both representations.
Both encoders as well as the projector have the following structure where l i is the input layer and l i+1 is its output, with a flexible number of layers n and adjustable dimensionality of input and output.Furthermore, variables W , b represent learnable weights and biases.A linear layer is followed by batch normalisation, S20 an activation function ϕ, S21,S22 and the last linear layer.The network is constructed using PyTorch.S23 For training the network is using Barlow Twins loss S19 and the AdamW optimiser.S24 Manual hyperparameter tuning is performed on a range and set of parameters listed in table S3.The model is trained for 25 epochs or until early stopping is engaged if a validation set is provided.
Furthermore, the model is trained using ECFPs and LLM embeddings.For inactive molecules, the embeddings are sign changed.

Gradient boosting machine
The gradient boosting machine (GBM) package LightGBM is used for for training based on the informational bottleneck embeddings provided by the Barlow Twins model.S25 Performing zero-shot predictions is done by feeding the Barlow Twins model with ECFPs and text information of the desired molecules.S26 To achieve optimal performance, the SMAC3 S27 multi-fidelity hyperparameter optimisation library is applied to find the optimal combination of hyperparameters (ref.table S4), using 80 % of the "train"  Finally, the LightGBM is trained using the full "train" data of the FS-Mol dataset and the best hyperparameters listed in table S5.
Performance evaluation When comparing models, we are using intersecting tasks of the "test" data, to ensure a scientific comparison.Metric selection is based on the FS-Mol benchmark.S1 In addition, ROC AUC is commonly used for classifier evaluation in the PR AUC is recommended for highly skewed classes, as it provides a more realistic view of classifier performance than ROC AUC.S8,S29-S31 Moreover, both metrics can be calculated based on the probability of the prediction rather than the prediction itself, where a classification threshold problem can arise.S8 In the context of zero and few-shot learning, a different form of PR AUC, known as is used, where t i denotes a particular task or bioassay within the total set of i tasks.The expression t i represents the sum of all activity endpoints for a given task, indicating the number of active molecules.In addition, |t i | corresponds to the size of the task.This metric shows sensitivity to the balance between classes, allowing for straightforward comparisons with a baseline benchmark.This is due to the performance of a random classifier reflecting the percentage of positive endpoints.S1 This metric is also used in the FS-Mol benchmark.S1 Conformal prediction In our study, we apply the conformal prediction method using the LightGBM classifier.S26,S32 This technique involves a two-step process: calibration with crossvalidation on training data (5 fold), because no calibration set is provided, and prediction on test data.S33 Then GBM predictions are analysed while the confidence level is set to ϵ = 0.80.
This method is valuable in providing both predictive outputs and insights into the certainty of each prediction.S32

Statistical testing
Whenever possible, the two-sided Wilcoxon signed-rank test is used in this study due to the non-parametric nature of the populations.S34 When comparing means and standard deviations, the Welch t-test is used.S34 The Mann-Whitney U test is performed when assessing whether Tanimoto similarities of two independent samples come from the same population.S34 A correction for multiple testing is applied if necessary.The Benjamini-Hochberg multiple test correction is used for performance comparisons.S35 The seed is set to 42 where possible, or to the number of the performed replicate (seeds := [0..n r [). 10 replicates per model are performed.

Fine-tuned LLM
The augmented version is referred to in the main text as "PubChemDeBERTa".The nonaugmented version could have a perplexity of 2.32, while with the augmentation the perplexity could be reduced to 1.52.

Zero-shot benchmark
The zero-shot performances of 10 replicates are averaged and compared with baselines from FS-Mol.S1 The p-values are shown in table S7. ‡ Only means and standard deviations are given, not by task performance.on FS-Mol yields p ≃ 0.0020, for ROC AUC, PR AUC and ∆PR AUC on a Wilcoxon test.S34 Significance is indicated at p < α, where α = 0.05.
The performance across different protein super-families of the test set from FS-Mol is reported in table S8.We would also like to add that the predictions made by TwinBooster should not be interpreted as a dose/concentration, but as a zero-shot classification.

Exploring the challenges of interfering compounds in HTS
We performed a correlation analysis between the predictions from TwinBooster and the frequent hitter probability according to HitDexter 3. S6 The mean Pearson correlation coefficient S36 for the target-based assay dataset model is ρ ≃ −0.0353, for the cell-based model is ρ ≃ 0.0536 and for the extended cell-based model is ρ ≃ 0.0890, indicating no correlation, respectively.
In addition, GSK and REOS structural alerts are used to identify possible molecular artefacts.S5 A ROC AUC ≃ 0.5179, while a ROC AUC = 0.5 denotes randomness, shows that the predictions do not correlate with structural alerts, the same conclusion can be drawn from fig.S2, where no trend towards a class can be observed.As a further ablation, we built a model trained only on the kinase information contained in the FS-Mol training set.TwinBooster is able to significantly outperform the kinase-only RF model on the non-kinase test tasks of FS-Mol (refer to fig.S3).The kinase-only model is only able to perform slightly above random with a ∆PR AUC = 5.33 ± 8.17 %, in contrast to TwinBooster with a ∆PR AUC = 14.42 ± 1.04 %.

Ablation study
The LSA model S15 is pretrained on the PubChem corpus.LSA is chosen because of its non-LLM nature and its use in literature.S16 This process is carried out in a manner similar to the approach used in Seidl et al., using the Python packages TfidfVectorizer and TruncatedSVD.S7,S16 The resulting text embeddings for each bioassay have a dimensionality of 355.
In parallel, the hyperparameter optimisation for the GBM is performed in an analogous way as described in the main text.
The p-values of the statistical tests using the Wilcoxon test (table S9) with with Benjamini-Hochberg multiple test correction are provided in table S10.S35

Conformal prediction
A comparison of zero-shot performances with or without conformal prediction on FS-Mol is shown in table S11.
Table S11: Comparing zero-shot performances with or without conformal prediction on FS-Mol.The confidence level is set to ϵ = 0.80. 10 replicates each are performed.Results that are both the best and statistically significant (Wilcoxon test, S34 α = 0.05) are highlighted in bold.

Prompt engineering
To analyse the contributions of the different parts of the textual description, we analyse AID 507074, which is included in the FS-Mol test set.The PubChem entry from which our textual S-15 data originates contains the title and description (sentences 1-8).S3 The methods information is excluded, but can be found in the corresponding publication (sentences 9-12).S37 Both text sources are combined and performance is assessed for each trimmed text description, starting with the first sentence and adding one sentence at a time.We colour the text according to ∆PR AUC values, so the change in contribution can be oberseverd while reading fig.S4.
(1) Inhibition of recombinant PI3Kdelta by radioactive phosphotransfer assay in presence of 10 uM ATP.(2) The clinical success of multitargeted kinase inhibitors has stimulated efforts to identify promiscuous drugs with optimal selectivity profiles.(3) It remains unclear to what extent such drugs can be rationally designed, particularly for combinations of targets that are structurally divergent.(4) Here we report the systematic discovery of molecules that potently inhibit both tyrosine kinases and phosphatidylinositol-3-OH kinases, two protein families that are among the most intensely pursued cancer drug targets.( 5) Through iterative chemical synthesis, X-ray crystallography and kinome-level biochemical profiling, we identified compounds that inhibit a spectrum of new target combinations in these two families.(6) Crystal structures revealed that the dual selectivity of these molecules is controlled by a hydrophobic pocket conserved in both enzyme classes and accessible through a rotatable bond in the drug skeleton.( 7) We show that one compound, PP121, blocks the proliferation of tumor cells by direct inhibition of oncogenic tyrosine kinases and phosphatidylinositol-3-OH kinases.(8) These molecules demonstrate the feasibility of accessing a chemical space that intersects two families of oncogenes.( 9) For IC50 value determinations, purified kinase domains were incubated with inhibitors at two-fold or four-fold dilutions over a concentration range of 50 µM to 0.001 µM or with vehicle (0.1% DMSO) in the presence of 10 µM ATP, 2.5 µCi of [γ-32P]ATP and substrate.(10) Reactions were terminated by spotting onto nitrocellulose or phosphocellulose membranes, depending on the substrate; this membrane was then washed five or six times to remove unbound radioactivity and dried.(11) Transferred radioactivity was quantitated by phosphorimaging, and IC50 values were calculated by fitting the data to a sigmoidal dose-response using Prism (GraphPad).( 12) Single concentration kinome profiling was performed using the Invitrogen SelectScreen assay.Figure S4: Text contribution according to zero-shot performance of AID 507074.The lower the value, the lighter the colour (min(∆PR AUC) = 0.0879) and the higher the value, the darker the colour (max(∆PR AUC) = 0.3824).Sentences 1-8 are found in the PubChem entry and 9-12 are added from the publication.S37 In general, providing the model with more textual detail helps to improve performance.
It can be observed that sentences containing information that is rarely present in the context of PubChem assays, such as information about crystal structures (e.g.sentence 6), lead to a decrease in performance.On the contrary, experimental details such as substrate, cell line and conditions seem to increase performance (sentences 7-9).After sentence 10, performance seems to plateau.This could be explained by the token length of the fine-tuned LLM.When a string contains more than 128 words, the overhanging tokens are mean averaged in the embedding space, potentially reducing the information carried by the text embeddings.We reach the token length after sentence 6 and a second time after sentence 10, suggesting that text descriptions over 256 words may not be beneficial for predictive performance.
The order of the sentences does not contribute to the change in performance, which can be explained by the training procedure of the LLM, since the reinforcement is performed by shuffling the sentences.Due to the computational cost of predicting performance for all text permutations, this experiment was not conducted (12! ≃ 480 × 10 6 combinations).Primary screen This represents title, description and protocol is used for the zero-shot prediction of the primary screen (AID 2732 § ), based on the PubChem entry: S3 HTS for small molecule inhibitors of CHOP to regulate the unfolded protein response to ER stress.Many genetic and environmental diseases result from defective protein folding within the secretory pathway so that aberrantly folded proteins are recognized by the cellular surveillance system and retained within the endoplasmic reticulum (ER).Under conditions of malfolded protein accumulation, the cell activates the Unfolded Protein Response (UPR) to clear the malfolded proteins, and if unsuccessful, initiates a cell death response.Preliminary studies have shown that CHOP is a crucial factor in the apoptotic arm of the UPR; XBP1 activates genes encoding ER protein chaperones and thereby mediates the adaptive UPR response to increase clearance of malfolded proteins.Inhibition of CHOP is hypothesized to enhance survival by preventing UPR programmed cell death.There are currently no known small molecule CHOP inhibitors either for laboratory or clinical use.To identify small molecule inhibitors of the UPR pathway, mediated by CHOP, a cell-based luciferase reporter assay using § https://pubchem.ncbi.nlm.nih.gov/bioassay/2732stably transfected CHO-K1 cells with luciferase driven by the CHOP promoter has been developed.The assay have been optimized and validated in 384-well format and used to screen for inhibitors of tunicamycin-induced CHOP in HTS.

Case study
These identified compounds will have potential therapeutic application to diverse disease states ranging from diabetes, Alzheimer's disease, and Parkinson's disease, to hemophilia, lysosomal storage diseases, and alpha-1 antitrypsin deficiency.
Investigating molecular structures We have examined various molecular structures in fig.S6 presented by our selected case study.As can be seen, the predicted top active molecules are indeed TPs and were also correctly prioritised in the primary screen.Structurally, the molecules differ significantly and are not derivatives of each other.This is also evident in fig.S5, where TwinBooster also shows that scaffold diversity is enriched compared to

Figure S1 :
Figure S1: Normalised ratios of annotated tasks contained in the splits of FS-Mol according to their protein super family.The dagger denotes a missing protein family.

Figure S2 :
Figure S2: Number of molecules, according to their predicted class, which have no or one to many structural alerts.The class threshold is set to the median of the predictions.

Figure S3 :
FigureS3: of the TwinBooster compared to a kinase-only RF model on the non-kinase test tasks of FS-Mol.We performed the Wilcoxon test.S34  Significance is indicated at p < α, where p ≃ 6.1 × 10 −5 and α = 0.05.

Figure S5
FigureS5 shows an enrichment of the number of unique Murcko scaffolds identified by TwinBooster and their proportional representation from the primary screen.This indicates that TwinBooster does not prioritise certain scaffolds, but enriches the scaffold diversity compared to random scaffold selection.Highlighting the importance of scaffold variability in early drug development, this increase in diversity is critical to the identification of potential leads.The Mann-Whitney U test S34 is conducted to asses the difference in Tanimoto similarity score and reveals p ≃ 3.5e − 48.

Figure S5 :
Figure S5: Number of detected unique Murcko scaffolds in relation to the percentage of scaffolds from the primary screen.

Table S3 :
Barlow Twins Hyperparameters.The set of parameters is listed and the best are highlighted in bold.
−3 , 5 × 10 −3 } data of the FS-Mol dataset for training.Optimisation is set to 200 trials.The evaluation is performed by assessing the precision recall area under curve (PR AUC) and receiver operating characteristic area under curve (ROC AUC) on the "valid" and the remaining 20 % of the "train" data of the FS-Mol benchmark.SMAC3's multi-fidelity implementation is used with the budget parameter being represented by the n estimators parameter of LightGBM.S26,S27

Table S6 :
Comparing zero-shot model performances across different metrics on FS-Mol.In zero-shot mode no "test" molecules are provided.Results that are both the best and statistically significant (Welch's t-test S34 α = 0.05 with Benjamini-Hochberg S35 multiple test correction) are highlighted in bold.

Table S7 :
All p-value results are from pairwise tests using the Wilcoxon test with Benjamini-Hochberg correction against the FS-Mol baselines.Only means of PR AUC and ∆PR AUC per task are used for the Wilcoxon test.Significance is indicated at p corr < α, where α = 0.05.The Benjamini-Hochberg correction is used to correct for multiple testing.S35

Table S8 :
Average performance of TwinBooster per protein super-family present in the test set of FS-Mol.

Table S9 :
Performance of different ablation experiments on FS-Mol.All results are tested pairwise using the Wilcoxon test with Benjamini-Hochberg multiple test correction.S34,S3510 replicates each are performed.Significance in WilcoxonS34test is indicated at α = 0.05 for all results except those in italics, which are not significant.Results that are both the best and statistically significant are highlighted in bold.

Table S10 :
All p-value results are from pairwise tests using the Wilcoxon test with with Benjamini-Hochberg multiple test correction of different ablation experiments on FS-Mol.S35 10 replicates each are performed.Significance is indicated at p corr < α, where α = 0.05.If p corr ≥ α, values are in italics.As an additional experiment, we explored one-hot encoding of assay type and protein family.It should be noted, however, that one-hot encoding faces several challenges in the context of zero-shot molecular property prediction, as it limits the ability of the model to predict only targets seen during training, as these cannot be encoded in an informative manner.Another requirement is the need for labelled tabular data, which is generally scarce in drug discovery.It can only use the information from 33 % of all tasks from FS-Mol, as these are only annotated.This is why the kinase-only model should not be directly compared with the above results.The performance is ∆PR AUC = 19.20 ± 0.09 %.Therefore, the one-hot-encoding-based model is outperformed by the zero-shot-capable baseline, where LSA text embeddings are added (Welch's t test, S34 p < α, where p ≃ 2.3 × 10 −7 and α = 0.05).
TwinBooster and the LSA text embedding baselines outperform the one-hot encoded model, can use more training data because they do not require tabular annotation, and perform zero-shot predictions.

Table S12 :
Performance metrics of TwinBooster on the primary screen with or without conformal prediction.The confidence level is set to ϵ = 0.80.