GPCR-BERT: Interpreting Sequential Design of G Protein-Coupled Receptors Using Protein Language Models

With the rise of transformers and large language models (LLMs) in chemistry and biology, new avenues for the design and understanding of therapeutics have been opened up to the scientific community. Protein sequences can be modeled as language and can take advantage of recent advances in LLMs, specifically with the abundance of our access to the protein sequence data sets. In this letter, we developed the GPCR-BERT model for understanding the sequential design of G protein-coupled receptors (GPCRs). GPCRs are the target of over one-third of Food and Drug Administration-approved pharmaceuticals. However, there is a lack of comprehensive understanding regarding the relationship among amino acid sequence, ligand selectivity, and conformational motifs (such as NPxxY, CWxP, and E/DRY). By utilizing the pretrained protein model (Prot-Bert) and fine-tuning with prediction tasks of variations in the motifs, we were able to shed light on several relationships between residues in the binding pocket and some of the conserved motifs. To achieve this, we took advantage of attention weights and hidden states of the model that are interpreted to extract the extent of contributions of amino acids in dictating the type of masked ones. The fine-tuned models demonstrated high accuracy in predicting hidden residues within the motifs. In addition, the analysis of embedding was performed over 3D structures to elucidate the higher-order interactions within the conformations of the receptors.


Introduction
Understanding the fundamental function of proteins, as the most important molecules of life, enables the development of improved therapeutics for diseases.3][14][15] For example, bio-informatics protein sequence models take advantage of co-evolution and a vast amount of proteomics and genomics data to enable prediction of structure or function.However, deciphering the complex relationships between elements of sequences (amino acids) and their functions still remains a challenge as bio-informatics models are not sophisticated enough to model higher-order interactions within amino acid sequences.4][25] However, there has been less focus on the structure-agnostic AI bioinformatic models which predict the higher-order interactions between amino acids using sequences.7][28] LLMs can take advantage of Transformer architecture and learn higher-order relations in texts. 21,29,30In this study, we leveraged the recent advances of LLMs in protein modeling and developed a model for interpreting the G Protein Coupled Receptors (GPCRs) sequences.We developed GPCR-BERT which investigates the high-order interactions in GPCRs.7][48][49][50][51] Although each of the receptors has its own amino acid sequence, there are a few conserved motifs among them, such as NPxxY, CWxP and E/DRY.Each conserved motif serves as a crucial component in the GPCRs [52][53][54] and the type of each x can vary according to the receptor class (varies between E and D for the E/DRY motif).We posed critical questions regarding the sequence design of GPCRs including: 1.What is the correlation between variations (xs) in the conserved region of GPCRs (as xx in NPxxY motif) and other amino acids in the sequence?2. Can we predict the whole sequence of amino acids in a GPCR, given partial sequence?3. Can we find amino acids having the most contributions to others that may play key roles in GPCR conformational changes?To respond to these questions, we focused on three conserved motifs (NPxxY, CWxP, and E/DRY) in different types of receptors.We defined tasks to predict the x in these motifs given the rest of the sequence and their correlations to other amino acids within the GPCR sequences using protein language models.The emergence of Transformers has further enabled machien learning (ML) models to effectively capture long-range dependencies between words and gain a deeper comprehension of language syntax. 21,22In addition, the attention mechanism enables the model to weigh the importance of each element based on its contextual relationships within the sequence. 22Prot-Bert, a pre-trained variant of Bidirectional Encoder Representations from Transformers (BERT), was employed for this research and fine-tuned with tokenized amino acid sequences.This approach facilitated the model's learning of the intrinsic patterns present within the GPCR sequence. 21,55Furthermore, we interpreted attention weights and hidden states of the fine-tuned model to extract the extent of contributions of other amino acids in determining the type of masked x amino acids in the conserved motifs.Investigating the extent of contributions of amino acids in dictating the type of x may assist us in identifying the amino acids correlated to the function of the conserved sequences.The correlated amino acids can serve as potential candidates for mutation studies and aid in generating new protein structures in protein engineering.

Methods
The overall architecture of GPCR-BERT is displayed in Fig. 1(a).Prot-Bert, 55 a transformerbased model of 16-head and 30-layer structure has been used as the pre-trained model.
Prot-Bert is pre-trained on a massive protein sequence corpus, UniRef100, 56 which contains over 217 million unique sequences.A variant of the original BERT 21 developed by Google, this architecture is derived from the encoder segment of the Transformer model, comprised of multiple, sequential layers of an attention-feed-forward network (Fig. 1(b)).Within the attention structure, each token is encoded into input embedding which is converted into keys, queries, and values.Keys and queries are combined via matrix multiplication to form the attention map, which is subsequently passed through a softmax 57 function to generate a probability distribution.Following this process, the resulting distribution is used to scale (multiply) the value vectors.The feed-forward layer within each Transformer layer facilitates the learning of intricate patterns embedded in the input, whereas the attention mechanism

Residues Sequence Tokenization
Protein BERT Fine Tuning GPCR Conserved Sequence Prediction

Multi-head Attention
Add &Norm [N] [P] [MASK] [MASK] [Y] [I] [V] is responsible for understanding and encoding the relationships among various tokens.The multi-head attention structure divides the input across multiple parallel attention layers, or "heads".This setup enables each head to independently learn and specialize in capturing different types of patterns and relationships. 22This structure of the Transformer encoder in Prot-Bert enables the model to learn context-aware representations of amino acids in a protein sequence by considering each sequence as a "document".The model has been trained with a Masked Language Model (MLM) objective, which is a training method where some percentage of input tokens are masked at random, and the model must predict those masked tokens from their context.One of the biggest advantages of ProT-Bert over some other pre-trained models (such as Prot-XL and Prot-Albert) is that its performance has exceeded these models on several benchmarks. 55For fine-tuning process, a series of experiments were performed to identify the most suitable architecture for the regression head.As a result, the design of three fully connected layers is selected.

Data preprocessing
The GPCR sequences dataset used for fine-tuning Prot-Bert was obtained from the GPCRdb database. 58The GPCRdb database serves as a comprehensive resource of valuable experimental data on G-protein coupled receptors (GPCRs).This includes details on GPCR binding, configurations, and signal transduction.9][60] From the GPCRdb, all class A GPCR sequences that contain either NPxxY, CWxP or E/DRY motifs are extracted, obtaining a dataset of 293 protein sequences.Among the extracted protein sequences, some of them were incomplete sequences embodying missing residues that are denoted as 'X' which were especially prevalent in relatively long sequences.The aim was to remove the ones that contain excessive missing residues and avoid inefficient padding in the tokenization process.Considering the distribution of sequence lengths (Fig. S1), we knocked out the sequences that are larger than 370 which is 13.3% of initial data.This filtering process led to a dataset of 254 class A GPCR sequences distributed across 62 different receptor classes (Table S1).Although the three motifs are 'highly' conserved, not every single class A GPCR contains these motifs.For example, residue C in CWxP motif is only conserved 71% among class A GPCRs. 61 Thus, the initial dataset from GPCRdb was refined to 3 separate datasets to include only those sequences that embody each conserved motif.This process yielded a dataset composed of 238 sequences for the NPxxY prediction task, 168 for the CWxP task and 212 for the E/DRY task.
In the tokenization process, we searched through each sequence for the location of the con-

Prediction tasks of conserved motifs
The curated datasets were partitioned into training and testing subsets to a ratio of 0.75:0.25,and were subsequently incorporated into the structure of Prot-Bert, complemented by a regression head.A series of experiments were performed to identify the suitable architecture The NPxxY task and E/DRY task demonstrated exceptional performance, achieving nearly perfect accuracy with minimal error.In comparison, the performance on the CWxP task was lower, yet within an acceptable range of accuracy.This relative disparity in performance, particularly for the CWxP task, is assumed to be attributed to the smaller dataset size compared to the other two tasks.

Prediction tasks of masked sequences
To evaluate GPCR-BERT's capability to predict other parts of the sequence than the conserved motifs and to find out whether it is possible to infer a complete GPCR sequence from partial information, we have tested the model with several masked sequence prediction tasks as shown in Table2.The dataset containing all 254 GPCR sequences is used since these tasks are not related to separate conserved motifs.For each GPCR sequence, residues for each task are continuously masked with 'J' tokens starting from the sequence index 100 which is a randomly selected number.The number of masking is increased to 10, 15 and 50 to further examine the model's prediction capability.The 5 masked prediction task shows the highest prediction accuracy of 83.07%while the task of 10, 15, 50 shows similar accuracy of around 75%.The decent prediction accuracy indicates that GPCR-BERT is capable of predicting large number of residues and not just a small number of residues in the conserved motifs.

Comparison with other models
To compare the GPCR-BERT's performance with other ML models, we tested several prediction tasks to the original BERT and the Support Vector Machine (SVM).Compared with BERT, we can investigate the effect of leveraging an encoder pre-trained on protein sequences as opposed to one trained on generic English corpus.By testing on SVM, we can investigate how low-key ML models can perform when trained with limited protein sequence data.SVM is a popular supervised learning algorithm that is known for its high accuracy, ability to handle high-dimensional data, and versatility in modeling both linear and nonlinear relationships.SVM works by finding a hyperplane that best divides the dataset into classes.The main objective of the SVM is to maximize the margin between data points and the separating hyperplane.The best hyperplane is the one for which this margin is maximized. 64e dataset used for all models have the same train/test split ratio and random state.
As shown in the result(Table3), BERT was able to predict quite well (83% accuracy) in a relatively simple E/DRY task but the accuracy decreased significantly for more intricate tasks like NPxxY and the 5 masked prediction.From this, we can confirm the advantage of utilizing pre-trained models on protein database corpus in the interpretation of protein sequences.Interestingly, although the performance was incomparable to the GPCR-BERT, SVM has shown better results than the original BERT.This suggests that BERT's pretraining on English text may have adversely affected the learning of protein sequences.

Discussion Interpretability
The attention mechanism of the Transformer model facilitates the capturing of global sequence information within each token embedding in the encoder.Yet, in practical applications, the classification token, known as the [CLS] token, often serves as the aggregate representation of the entire sequence 65 66 The BERT tokenizer introduces the [CLS] token at the commencement of every sequence, which is then employed by BERT to assign the sequence to specific classes.Thus, the [CLS] token encapsulates information from all output embedding.Using this insight, we aimed to verify whether GPCR-BERT effectively recognizes and differentiates among various GPCR types.To this end, we extracted the [CLS] tokens of each GPCR sequence from the final hidden state which has the structure of 1024 nodes and visualized with t-distributed Stochastic Neighbor Embedding (t-SNE). 67T-SNE executes dimensionality reduction by determining pairwise similarities in the high-dimensional space and constructing a Gaussian joint probability distribution.Correspondingly, a similar probability distribution is defined using a Student's t-distribution.The algorithm then seeks to minimize the Kullback-Leibler divergence between these two distributions.This minimization process inherently attracts similar data points toward each other while repelling dissimilar points, thereby enabling the clustering of similar data points.The results are shown in Fig. 3(a).Interpreting the attention weights allows us to examine how the GPCR-BERT is able to understand the grammar of GPCR sequences.Within the hierarchical structure of multiple attention layers in the encoder, the terminal layer is most likely to capture advanced semantic information in the context of Natural Language Processing tasks, and analogous structural or binding information for protein sequences 21 . 68Thus, the 16 attention heads of the last layer of GPCR-BERT are extracted and visualized through heatmaps (Fig. 3(b)).High attention weights are indicative of a greater degree of correlation between the corresponding residues in the input sequence.These elevated attention weights highlight specific regions of the sequence that the model perceives as containing valuable information.Attention allows every token in the input sequence to have a direct connection with all other tokens in the sequence, thereby facilitating the modeling of inter-dependencies between tokens irrespective of their relative positions. 22In addition, the softmax function of the attention enables the attention weight to serve as a probability distribution resulting in the sum of each row being 1.Thus, the attention weights of the GPCR-BERT represent the relative correlation of the corresponding residue with others.To scrutinize the interrelationships between conserved motifs and the remaining sequence in GPCRs, the highest five weights associated with variant residues within the conserved motifs are examined.In the context of the CWxP prediction downstream task, for instance, the attention weight

Mapping Higher Order Interactions
The fine-tuned GPCR-BERT's representations of GPCRs are depicted via a t-SNE plot (Fig. 3(a)).For the sake of effective visualization, only classes consisting of more than three  GPCR sequence, rather than merely the conserved motifs when executing prediction tasks.
Moreover, given that the inter-cluster distances within the t-SNE plot signify the degree of similarity between each cluster, it is evident that the GPCR-BERT model is able to discern the subtle variations of GPCR classes.This capability is particularly notable in the case of distinguishing between β1 and β2 Adreno receptor classes depicted in the lower right corner of Fig. 3(a).More detailed version of this plot can be found in Fig. S2 (see Supporting Information).an attention weight of 0, while a noticeably strong attention weight was detected visualized in dark colors.This indicates that GPCR-BERT identified these residues as being more significantly correlated than others.This pattern was consistently observed across several heads (2, 3, 6, 8, 12, 16) of 4GBR, and extended to other receptor classes of GPCRs.Notably, this pattern was particularly prevalent in heads 1 and 2, where the attention weights exceeded 0.1 in the corresponding region.Therefore, an analysis of these patterns is expected to yield valuable insights into the conserved motifs.to neighboring residues and some residues at the extracellular part of the receptor that are responsible for ligand selectivity.However, the x in CWxP motif and E/D in E/DRY motif mainly attend to some residues in the binding pocket.We can conclude that the message passed through the ligand in the binding pocket may be modulated by the variations in x and E/D residues within the NPxxY, CWxP, and E/DRY motifs.This finding will help us to focus on the sequential design of these motifs in selecting the ligand.
Furthermore, we investigated whether residues with high correlation, other than those adjacent to the conserved motifs, align with experimental mutation findings.Thus, our correlation findings were compared with the mutagenesis data in the GPCRdb. 58This data contains the accumulated results of the positions of mutated residues and their effects on GPCRs.
Table4 shows the results for the comparison for β 2 adreno receptor (results for other classes are shown in Table S2 S4).It can be observed that head 6 and 10 of the GPCR-BERT found the residues that are related to receptor expression significant while head 11 and 13 discerned the significance of residues linked to thermostabilization.We can also see that multi-head attention mechanism of transformer models enables each head to learn distinct patterns and this characteristic can facilitate protein sequence analysis.Additionally, some highly correlated residues, such as residue E 3.26 in head 10, were identified by GPCR-BERT as potentially significant but remain understudied.These insights may provide a foundation for further investigations into GPCR structures and their mechanisms.There are, however, limitations to this study considering the minor inconsistency of the conserved motifs in class A GPCRs.As mentioned earlier, residue C in CWxP motif is only conserved 71% 61 while the NPxxY motif is conserved 94% among class A GPCRs.The remaining variations are in NPxxxY (3%) or NPxxF (3%) form. 71Similarly, Y in the E/DRY motif has exceptions as well. 72Since the model is focused on predicting only the general type of these motifs, its understanding of GPCRs might not apply to ones that have varied versions of the motifs.

Conclusion
In this study, we explored the higher-order interactions in sequence design of G proteincoupled receptors (GPCRs) and their impact on determining the type of function by taking advantage of language models.We utilized the Prot-Bert model, a transformer-based language model, and fine-tuned it with tokenized amino acid sequences to predict non-conserved amino acids (depicted by x) in the conserved motifs of NPxxY, CWxP, and E/DRY in various GPCRs.The results showed that the model effectively predicted these residues in the receptors with high accuracy.The attention weights and hidden states of the model were also analyzed to understand the extent of contributions of other amino acids in dictating the type of masked ones.We demonstrated that GPCR-BERT successfully distinguished different GPCR classes based on their sequences, and has the potential for understanding functional relationships within protein structures.These findings could also have implications in mutation studies, protein engineering, drug development, and further advancing our understanding of GPCR biology.

Figure 1 :
Figure 1: (a) The overall architecture of GPCR-BERT.GPCR amino acid sequences are tokenized and subsequently processed through Prot-Bert, followed by the regression head.(b) The structure of Prot-Bert Transformer and the attention layer.The input token embedding is transformed into keys, queries, and values which subsequently form the attention matrix.The output is passed through a fully connected neural network.This sequence of operations is iterated 30 times to reach the final output embedding of the GPCR sequence.(c) Representation of the top five most correlated amino acids to the first x (red) and second x (blue) in NPxxY motif within a GPCR obtained through attention heads.The thickness of the lines represents the strength of correlations (weights).
served motifs ensuring their proper localization within specific regions: NPxxY in Transmembrane (TM) 7, CWxP in TM6, and E/DRY in TM3.The 'x' residue earmarked for prediction (for example, xx in the NPxxY) was substituted with 'J', an alphabet character absent from the Prot-Bert tokenizer vocabulary.Simultaneously, label sequences were assembled, each retaining the ground truth amino acid while substituting all other residues with 'J'.During tokenization, the 'J' characters in the input sequences were replaced with the [MASK] token.Utilizing the Prot-Bert tokenizer, a [CLS] token was inserted at the beginning of each sequence, while sequences falling short of the maximum input length were padded using the [PAD] token.Both the input and label sequences were then translated into corresponding integers as dictated by the tokenizer vocabulary.

Figure 2 :
Figure 2: The test result for the downstream tasks of predicting x in (a) NPxxY, (b) CWxP, and (c) E/DRY.Each task was subjected to training for 30 epochs prior to testing.For the NPxxY task, the prediction results for both masked positions were taken into account.

Figure 3 :
Figure 3: (a) t-SNE visualization of GPCRs.t-SNE was applied to the [CLS] token embedding from the last hidden state of GPCR-BERT to reduce the dimension to 2D.Different receptor classes are denoted by distinct colors and the clustering of the same GPCR type is obvious.(b) Attention weight heatmap of human β 2 adreno receptor GPCR (4GBR).The figure provides a visual representation of head 1 of the final attention layer in GPCR-BERT.The labels on the axes correspond to the respective tokens.
GPCRs were selected.As evidenced by the t-SNE plot, GPCRs of the same receptor class (depicted by identical color markers) are clustered together, thereby indicating the model's successful classification of GPCRs based solely on their sequence information.This result also suggests that the [CLS] token within the embedding effectively captures the distinguishing characteristics of the individual GPCR classes.The model's ability to differentiate between classes such as the Orexin1 and Orexin2 receptor and the β1/β2 Adreno receptor -both of which possess the NPIIY motif -implies that the model takes into account the entire

Fig. 3 (
Fig.S3and Fig.S4in Supporting Information for heatmap of all 16 heads), suggesting that GPCR-BERT was successful in learning representations that underscore the interrelations of tokens within the sequence.As shown in Fig.3(b), the padding region (far right) received

Fig. 4
Fig.4 introduces the top five most correlated amino acids to the x residues in the NPxxY and CWxP motifs as well as the E/D residue in E/DRY motif for a few receptors in the dataset (see Data availability for the comprehensive list of GPCRs and their correlated residues).The common pattern found for attention and consequently correlation between NPxxY motif and other residues is the xx residues are mainly attending to their adjacent amino acids (NP).The first x in NPxxY usually attends to the upper top region of the receptors in the extracellular segment of protein.This higher-order interaction possibly demonstrates the connection between NPxxY motif and the ligand-selective region of the receptor. 70This is significant because the xx residues vary between different types of GPCRs.Our finding elucidates why changes in the GPCR type correlate to xx residues in NPxxY.Fig.5 shows structural representations for the top three most correlated amino acids to the xs in NPxxY (a) and CWxP (b) motifs and E/D in E/DRY motif within various receptors.The correlated amino acids are displayed in colors and the numbers indicate the number of amino acids separating them.As the figure shows, similar types of receptors are categorized based on the correlated amino acid predictions.As noted earlier, the xx in NPxxY motif mostly attend

Figure 5 :
Figure 5: Structural representations for the top three most correlated amino acids to (a) xx in NPxxY, (b) x in CWxP, and (c) E/D in E/DRY motifs within various GPCRs.In the tables, the colors of the correlated amino acids correspond to the colors assigned to amino acids across the protein and the numbers denote the number of residues separating them.

Table 1 :
Fine-tuning result of each prediction task.The outcomes pertaining to loss and accuracy for each downstream task are displayed.Each task is averaged over the results of three runs.

Table 2 :
Result of masked sequence prediction task.Each task is averaged over the results of three runs.

Table 3 :
Test accuracy (%) of each prediction task of different ML models.Each task is averaged over the results of three runs.

Table 4 :
Comparison of GPCR-BERT correlation results with the mutagenesis data in GPCRdb for β 2 Adreno receptors.Columns labeled 'head' and 'repeated' display the respective head number that detected the correlation and the frequency of its occurrence.'Residue' and 'position' columns represent the residue and its position as per the Ballesteros-Weinstein (BW) numbering system.The 'Match with mutagenesis data' column indicates if the identified correlation aligns with experimental mutation results.