Serverless Prediction of Peptide Properties with Recurrent Neural Networks

We present three deep learning sequence-based prediction models for peptide properties including hemolysis, solubility, and resistance to nonspecific interactions that achieve comparable results to the state-of-the-art models. Our sequence-based solubility predictor, MahLooL, outperforms the current state-of-the-art methods for short peptides. These models are implemented as a static website without the use of a dedicated server or cloud computing. Web-based models like this allow for accessible and effective reproducibility. Most existing approaches rely on third-party servers that typically require upkeep and maintenance. Our predictive models do not require servers, require no installation of dependencies, and work across a range of devices. The specific architecture is bidirectional recurrent neural networks. This serverless approach is a demonstration of edge machine learning that removes the dependence on cloud providers. The code and models are accessible at https://github.com/ur-whitelab/peptide-dashboard.


INTRODUCTION
Deep learning models have been widely applied to extract information from big data in cheminformatics. Compared to machine learning algorithms, deep learning can perform feature extraction and learn patterns over various nonlinear layers of representations of the input data, 1 can explain the vanishing effects of gradients, 2 and can perform better with raw high-dimensional data. 3 There is a growing increase in the number of web-based implementations of deep learning frameworks that provide convenient public access and ease of use. 4−9 Notably, many web servers have been developed for sequence design tasks, like analysis of RNA, DNA, or proteins, for example, survival analysis based on mRNA data (GENT2, 10 PROGgeneV2, 11 SurvExpress, 12 MEXPRESS, 13 etc.), studying prognostic implications of noncoding RNA (PROGmiR, 14 SurvMicro, 15 OncoLnc, 16 TANRIC 17 ), survival analysis based on protein (TCPAv3.0, 18 TRGAted 19 ) and DNA (Meth-Surv, 20 cBioPortal 21 ) data, and multiple areas of assessing cancer therapeutics. 22 These scientific web servers and webbased services allow for the availability of complex inference algorithms to a much broader user community and promote open science. This is especially important because of the disparities between lower and higher income nations, where there are disparities in the types of research activities that can be performed. 23 Cheminformatics-related research, the topic of this work, mostly takes place at those nations privileged with resource-rich institutions, where there are adequate funding resources. Yet, web-based implementations can broaden access to these methods.
Beyond disparities among institutions, web-based implementations are also a mechanism for reproducibility in science. In peptides specifically, Melo et al. 24 argue that deep learning sequence design should be accomplished by free public access to the (1) source code, (2) training and testing data, and (3) published findings. However, this is not often true; Littmann et al. 25 found in an analysis of ML research articles in biomedicine and life sciences published between 2011 and 2016 that only 50% released software, while 64% released data. Web-based servers do not fit the exact definition of open science (due to lack of source code access), but they do accomplish the goal of enabling others with broader expertise to build on previous advances and are often more accessible and convenient than access to model and source code alone.
Thus, there is a compelling argument to continue web-based tools. There are, however, two major drawbacks: source-code can be inaccessible as discussed above and the reliance on third-party or self-hosted servers. Deep learning inference often requires GPUs, and this requires a specialized hosting service or a complex self-hosted setup. This creates difficult ongoing expenses, and many tools are thus only available for a limited time after publication. Additionally, there can be low incentives to increase capacity. Popular tools, like RoseTTAFold, 26 can have days-long queues. The expense and deployment problems also can create disparities in impact of research between resource-rich and low-resource institutions, because not all researchers can afford to create web-based implementations.
To address the challenges above, we demonstrate a serverless deep learning web-based server, https://peptide.bio, that predicts peptide properties using recurrent neural networks (RNN) via users' local devices. These trained models are implemented in JavaScript and are loaded to a user's web browser. Users make predictions by running these trained models on a web browser on their local machines, or even cell phones, without having to install any modules. They can be run locally as well, if desired a . The serverless computing describes a programming model and architecture, where small code snippets are executed in the cloud without any control over the resources on which the code runs. 27 This is by no means an indication that there are no servers. Simply, it means that the developer leaves most operational concerns such as resource provisioning and scalability to the cloud provider (the end-user in our case). Figure 1 provides a visual demonstration on how our approach contrasts with the existing cheminformatics frameworks. John et al. 28 proposed the idea of serverless computing for efficient model utilization on cloud resources without specific constraints on the cloud provider. However, in this work, we seek to fully remove the need for a cloud provider and bypass this conventional dependency. While this is indeed impractical for the resource-intensive training step of deep learning models, the trained models are typically cheap to evaluate; thus, inference is robustly feasible on even limited computing resources (i.e., commodity phones, laptops). This removes hosting costs and the conventional dependence on cloud providers or self-hosting of resource-rich academic institutions. Although we make some compromises here on model size and complexity, we expect the continued improvement of hardware (i.e., Moore's law 29 ) to increase the type of models possible in JavaScript each year. This serverless approach should accelerate reproducible ML science, while also lowering the gap between resource-rich universities and the rest, as well as enabling a better dissemination of research from a broader community of chemists.
This paper is organized as follows: We start by providing a brief overview of some comparable predictive sequence-based models for the classification tasks in this work (hemolysis, solubility, and nonfouling) in Section 1.1. In Section 2, we describe the data sets, architecture of our deep learning models, and the choices for the hyperparameters, as well as a high level overview of the methods used in the previous comparable sequence-based models in the literature. This is followed by evaluating the model in a comparative setting with the state-of-the-art models in Section 3. Finally, we conclude the paper in Section 4, with a discussion of the implications of our findings.
1.1. Previous Work. Quantitative structure−activity relationship (QSAR) modeling is a well-established field of research that aims at mapping sequence and structural properties of chemical compounds to their biological activities. 30 QSAR models have been successfully applied to angiotensin-converting enzyme(ACE)-inhibitory peptides, 31−33 antimicrobial peptides, 34−37 and antioxidant peptides. 38−40 For solubility predictions, DSResSol (1) 41 improved prediction accuracy (ACC) and AUROC to 75.1% and 0.84, respectively, by identifying long-range interaction information between amino acid k-mers with dilated convolutional neural networks and outperformed all existing models such as DeepSol, 42 PaRSnIP, 43 SoluProt, 44 Protein−Sol, 45 and PROSO II. 46 HAPPENN 47 forms the state-of-the-art (SOTA) model for hemolytic activity prediction with ACC of 85.7% and has better performance compared with HemoPI 48 and HemoPred. 49 Hasan et al. 50 developed a two-layer prediction framework, called HLPpred-Fuse, that can distinguish between hemolytic and nonhemolytic peptides, as well as their low and high activity, with an AUROC of 0.91 (averaged over two reported independent data sets). However, short peptides (<6 amino acid residues) are excluded from their data sets due to the difficulty in capturing meaningful sequence information from shorter peptides.

Data Sets. 2.1.1. Hemolysis.
Hemolysis is defined as the disruption of erythrocyte membranes that decrease the life span of red blood cells and causes the release of hemoglobin. Identifying nonhemolytic antimicrobial is critical to their applications as nontoxic and safe measurements against bacterial infections. However, distinguishing between hemolytic and nonhemolytic peptides is complicated, as they primarily exert their activity at the charged surface of the bacterial plasma membrane. Timmons and Hewage 47 differentiate between the two whether they are active at the zwitterionic eukaryotic membrane, as well as the anionic prokaryotic membrane. In this work, the model for hemolytic prediction is trained using data from the Database of Antimicrobial Activity and Structure of Peptides (DBAASP v3 51 ). The activity is defined by extrapolating a measurement assuming dose response curves to the point at which 50% of red blood cells (RBC) are lysed. If the activity is below 100 μg/mL, it is considered hemolytic. Each measurement is treated independently, so sequences can appear multiple times. The training data contains 9316 sequences (19.6% positives and 80.4% negatives) of only L-and canonical amino acids. Note that due to the inherent noise in the experimental data sets used, in some observations (∼40%), an identical sequence appears in both negative and positive class. As an example, sequence "RVKRVWPLVIRTVIAGYNLYRAIKKK", is found to be both hemolytic and nonhemolytic in two different lab experiments (i.e., two training examples).
2.1.2. Solubility. The training data contains 18,453 sequences (47.6% positives and 52.4% negatives) based on data from PROSO II. 46 Solubility was estimated by retrospective analysis of electronic laboratory notebooks. The notebooks were part of a large effort called the Protein Structure Initiative and consider sequences linearly through the following stages: Selected, Cloned, Expressed, Soluble, Purified, Crystallized, HSQC (heteronuclear single quantum coherence), Structure, and deposited in PDB. 52 The peptides were identified as soluble or insoluble as described in ref 46 negatives) coming from insoluble and hemolytic peptides, as well as the scrambled positives. The scrambled negatives are generated with lengths sampled from the same length range as their corresponding positive set, and residues were sampled from the frequency distribution of the soluble data set. Samples are weighted to account for the class imbalance caused by the data set size for negative examples. A nonfouling peptide (positive example) is defined using the mechanism proposed by White et al. 53 Briefly, White et al. showed that the exterior surfaces of proteins have a significantly different frequency of amino acids, and this increases in aggregation prone environments, like the cytoplasm. Synthesizing self-assembling peptides that follow this amino acid distribution and coating surfaces with the peptides creates nonfouling surfaces. This pattern was also found inside chaperone proteins, another area where resistance to nonspecific interactions is important. 54 2.2. Model Architecture. To identify the positioninvariant patterns in the peptide sequences, we build a recurrent neural network (RNN), using a sequential model from Keras framework 55 and the TensorFlow deep learning library back-end. 56 Specifically, the RNN employs bidirectional Long Short-term Memory (LSTM) networks to capture longrange sequence correlations. Compared to the conventional RNNs, LSTM networks with gate control units (input gate, forget gate, and output gate) can learn dependency information between distant residues within peptide sequences more effectively. 57−59 They can also partly overcome the problem of vanishing or exploding gradients in the backpropagation phase of training conventional RNNs. 60 We use a bidirectional LSTM (bi-LSTM) to enhance the capability of our model in learning bidirectional dependence between Nterminal and C-terminal amino acid residues. An overview of the RNN architecture is shown in Figure 2.
Peptide sequences are represented as integer-encoded vectors of shape 200, where the integer at each position in Figure 2. RNN architecture. Fixed-length integer-encoded sequences are first fed to a trainable embedding layer, yielding a semantically more compact representation of the input essential amino acids. The bidirectional LSTMS and direct inputs of amino acid frequencies prior to the fully connected layer improve the learning of bidirectional dependency between distant residues within a sequence. The fully connected layers are downsized in three consecutive steps with layer normalization and dropout regularization. The final layer uses a sigmoid activation to output a scalar that shows the probability of being active for the desired training task. the vector corresponds to the index of the amino acid from the alphabet of the 20 essential amino acids: [A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V]. The maximum length of the peptide sequence is fixed at 200, and all sequences with higher lengths are excluded. For those sequences with shorter lengths, zeros are padded to the integer encoding representation to keep the shape fixed at 200 for all examples to allow input sequences with flexible lengths. Note that this is primarily applied to the training step for implementation considerations, and the trained model can make predictions on variable-length sequences as input. Every integer-encoded peptide sequence is first fed to an embedding layer. The embedding layer enables us to convert the indices of discrete symbols (i.e., essential amino acids) into a representation of a fixed-length vector of defined size. This is beneficial in the sense of creating a more compact representation of the input symbols, as well as yielding semantically similar symbols close to one another in the vector space. This embedding layer is trainable, and its weights can be updated during training along with the others layers of the RNN.
The output from the embedding layer either goes to a double stacked bi-LSTM layer or a single LSTM layer to identify patterns along a sequence that can be separated by large gaps. The former is used in predicting solubility and hemolysis, whereas the latter is for predicting a peptide's resistance to nonspecific interactions (nonfouling). The rationale behind this choice for the nonfouling model is that the bi-LSTM layer did not contribute to a better performance when compared with the LSTM layer (same ACC and AUROC of 82% and 0.93, respectively). The output from the LSTM layer is then concatenated with the relative frequency of each amino acid in the input sequences. This choice is partially based on our earlier work, 61 and helps with improving model performance. The concatenated output is then normalized and fed to a dropout layer with a rate of 10%, followed by a dense neural network with a ReLU activation function. This is repeated three times, and the final single-node dense layer uses a sigmoid activation function to force the final prediction as a value between 0 and 1. This scalar output shows the probability of the label being positive for the corresponding predicted peptide biological activity. We use this probability to evaluate the confidence of the model in making inferences on new sequences in our web-based implementation.
The hyperparameters are chosen based on a random search that resulted in the best model performance in terms of the Area Under the Receiver Operating Characteristic (AUROC) curve 62 and accuracy. The AUROC shows the model's ability to discriminate between positive and negative examples as the discrimination threshold is varied, and the accuracy is defined as the ratio of correct predictions to the total number of predictions made by the model. The embedding layer has the same input dimension of 21 (alphabet length added by one to account for the padded zeros) and output dimension of 32. The LSTM layer has 64 units, and the first, second, and third dense layers have 64, 16, and 1 units, respectively. We train with the Adam optimizer 63 of binary cross-entropy loss function, which is defined as where y i is the true value of the ith example, yî the corresponding prediction, and N the size of the data set. The learning rate is adapted using a cosine decay schedule with an initial learning rate of 10 −3 , decay steps of 50, and minimum of 10 −6 . The data split for training, validation, and test is 81%, 9%, and 10%, respectively. To avoid overfitting, we add early stopping with patience of 5 that restores model weights from the epoch with the maximum AUROC on the validation set during training.
Previous models for peptide prediction tasks use a variety of deep learning and classical machine learning methods. The prediction server PROSO II employs a two-layered structure, where the output of a primary Parzen 64 window model for sequence similarity and a logistic regression classifier of amino acid k-mer composition are fed to a second-level logistic regression classifier. HAPPENN uses normalized features selected by SVM and ensemble of Random Forests, which are fed to a deep neural network with batch normalization and dropout regularization to prevent overfitting. DSResSol (1) takes advantage of the integration of Squeeze-and-Excitation (SE) 65 residual networks 66 with dilated convolutional neural networks. 67 Specifically, the model includes five architectural units, including a single embedding layer, nine parallel initial CNNs with different filter sizes, nine parallel SE-ResNet blocks, three parallel CNNs, and fully connected layers. Table 1 shows the classification performance for all three tasks, along with a comparison between our RNN model and the state-of-the-art methods (see Section 1.1 for a brief overview). All models achieve the same result range as the state-of-the-art methods. We compare the feature extraction capability of our RNN with other unconditional protein language models that provide pretrained sequence representations that transfer well to supervised tasks. Specifically, we train two machine learning models on the hemolytic data set, using a UniRep 68,69 representation of the peptide sequences, followed by a logistic regression and a Random Forests 70 classifier. Our RNN architecture slightly outperforms both models in terms of AUROC. The one-hot representation of peptides followed by RNN results in the best hemolysis model in terms of AUROC in ref 71. The choice of one-hots requires training features specific to each position though, so we do not expect the model to generalize. In contrast, our model is length agnostic and will have a relatively smaller generalization error for sequences with lengths it has not observed before. Moreover, this removes the need for having training data at each position for each amino acid. Our predictive model for the solubility task, MahLooL (" "), has accuracy of 70.0%, and this is mostly attributed to the difficulty associated with solubility predictions in cheminformatics. Note that the solubility data set used contains a large distribution of sequence lengths (18−198). DSResSol (1) outperforms all existing solubility models on the testing set from ref 72. Readers are encouraged to refer to ref 41 for the comparison of DSResSol (1) with all the state-ofthe-art sequence-based solubility predictors. For the sake of a better comparison with our approach, we evaluate DSResSol (1)'s performance on the same testing set used for MahLooL. We explore the changes in model performance by training on all the training set, but filtering the testing data based on the sequence lengths, as illustrated in Table 2. MahLooL has comparable performance with respect to DSResSol (1) on the entire testing set. For short length (18−50) peptides, surprisingly it outperforms DSResSol (1), with an AUROC of 0.95 and accuracy of 91.3%. With longer length peptide sequences, the property inference task becomes more difficult by only using the amino acid sequence information, as other experimental settings and conditions become important, adding more epistemic uncertainty to the predictions.

RESULTS
To allow for transparency between users and developers, details of the models' performance, training procedures, intended use, and ethical considerations have been incorporated as model cards 73 on https://peptide.bio/. Model cards present information about how the model is trained, its intended use, caveats about its use, and any ethical or practical concerns when using model predictions. A brief overview of the model cards are presented in Table 3.
To evaluate the contribution of different architectural components to the model's performance, we conducted a set of ablative experiments on the solubility model only. In each ablation trial, an architectural component is removed, and the corresponding test AUC and accuracy are reported via a 5-fold cross-validation on the solubility data set. We remove the effect of regularization techniques (see Materials and Methods in Section 2) in our ablation trials by disregarding the early stopping callback and fixing the number of training epochs to 50. The learning rate is also set to a fixed value of 10 −3 . This is the reason for the lower performance of the "full model". The results from our ablation study are shown in Table 4, sorted by the highest AUROC. We point out that the AUROC of the

DISCUSSION
We present three sequence-based classifiers to predict hemolysis, solubility, and resistance to nonspecific interactions of peptides and achieve competitive results compared with state-of-the-art models. The hemolytic model predicts the ability for a peptide to lyse red blood cells and is intended to be applied to peptides between 1 and 190 residues, L-and canonical amino acids (AUROC and accuracy of 0.84 and 84.0%, respectively). The hemolysis training data set is from sequences thought to be antimicrobial or clinically relevant, so it may not generalize to all possible peptides. Our solubility model, MahLooL, is trained with data mostly containing long sequences; thus, it may not be as applicable to solid-phase synthesized peptides. MahLooL provides state-of-the-art sequence-based solubility predictions for short peptides (<50) with AUROC and accuracy of 0.95 and 91.3%, respectively. However, its accuracy is lower for long peptide sequences (>100). Its intended use is for peptides or proteins expressed in E. coli that are less than 200 residues long and may provide solubility predictions more broadly applicable. The nonfouling model predicts the ability for a peptide to resist nonspecific interactions and is intended to be applied to short peptides between 2 and 20 residues (AUROC and accuracy of 0.93 and 82.0%, respectively). The nonfouling training data mostly contain short sequences, where negative examples have insoluble peptides overrepresented, so the accuracy may be inflated if only comparing soluble peptides.

CONCLUSIONS
Our proposed RNN models allow for automatic extraction of features from peptide sequences and remove the reliance on domain experts for feature construction. Moreover, these models are implemented in JavaScript, so that they can run on a static website through a browser on users' phones or desktops. This serverless approach removes the conventional dependence of deep learning models in cheminformatics on third-party hosted servers, thus reduces cost, increases flexibility, accessibility, and promotes open science. Our work is impactful in two ways; First, we demonstrate a new paradigm for sharing methods seamlessly and without servers. This should enable better dissemination of research from a broader community of chemists. Second, we demonstrate on a solubility predictor that outperforms existing SOTA for short peptide sequences. Using the same architecture, we achieve competitive results for hemolysis and nonfouling predictions that should enable better design of peptide molecules, given information on their sequence only.

■ ASSOCIATED CONTENT Data Availability Statement
All data and code used to produce results in this study are publicly available in the following GitHub repository: https:// github.com/ur-whitelab/peptide-dashboard. The JavaScript implementation of the models is available at https://peptide. bio/.  For comparison, the performance of the model with full architecture (as shown in Figure 2) is highlighted with an asterisk.

Journal of Chemical Information and Modeling
pubs.acs.org/jcim Article