Beyond Tripeptides Two-Step Active Machine Learning for Very Large Data sets

Self-assembling peptide nanostructures have been shown to be of great importance in nature and have presented many promising applications, for example, in medicine as drug-delivery vehicles, biosensors, and antivirals. Being very promising candidates for the growing field of bottom-up manufacture of functional nanomaterials, previous work (Frederix, et al. 2011 and 2015) has screened all possible amino acid combinations for di- and tripeptides in search of such materials. However, the enormous complexity and variety of linear combinations of the 20 amino acids make exhaustive simulation of all combinations of tetrapeptides and above infeasible. Therefore, we have developed an active machine-learning method (also known as “iterative learning” and “evolutionary search method”) which leverages a lower-resolution data set encompassing the whole search space and a just-in-time high-resolution data set which further analyzes those target peptides selected by the lower-resolution model. This model uses newly generated data upon each iteration to improve both lower- and higher-resolution models in the search for ideal candidates. Curation of the lower-resolution data set is explored as a method to control the selected candidates, based on criteria such as log P. A major aim of this method is to produce the best results in the least computationally demanding way. This model has been developed to be broadly applicable to other search spaces with minor changes to the algorithm, allowing its use in other areas of research.


Wimley-White log P scale
The log P for each peptide is calculate via the Wimley-White log P scale [1,2] in which the log P is derived from the octanol-to-water solvation free energies of individual amino acids (Equation S1) Equation S1. Formula for determining log P of peptides proposed by W. Wimley & H. White [1,2] where N = number of amino acid residues in peptide.   Figure S1. Correlation matrix of Mordred parameters for pentapeptides, some are highly correlated, but still vary between different peptide chain length datasets and therefore have been selected to remain. Figure S2. Correlation matrix of Mordred parameters for pentapeptides, some are highly correlated, but still vary between different peptide chain length datasets and therefore have been selected to remain.

Hyperparameters
Hyperparameters for each model were selected via 5-fold cross validation of the 80 % of the tripeptide dataset and validated against the remaining 20 %.
The training of the non-linear SVM using the radial basis function (SVMRBF) for the kernel on Judred data had a C value of 100 and epsilon value of 0.1 with a scaled kernel coefficient. There was no limit on the number of iterations and the tolerance was 0.0001 (Table S2). In the case of training on Mordred data the C value was 30. The training of the elastic net model on Judred data used an alpha value of 0.001, L1 ratio of 0.1 with the number of iterations capped at 500 and a tolerance of 0.0001. The intercept was fit and the coefficients were updated by random selection on each iteration, these coefficients were allowed to be positive or negative. In the case of training on Mordred data the maximum number of iterations was capped at 1000. The training of the elastic net model on Judred data had a regularization parameter (C) of 100 and used the squared epsilon insensitive (L2) loss function with an epsilon vale of 0.1. The intercept was fit with an intercept scaling value of 0.1 and the number of iterations capped at 500 with a tolerance of 0.0001. In the case of training on Mordred data, C was 75 and the intercept scaling was 0.5. The training of the random forest model on Judred data used 300 estimators with no minimum impurity decrease and no early stopping threshold for tree growth. The minimum number of samples required to be a leaf node was 1 and the minimum weighted fraction of the sum of weights required to be a leaf node was 0. The quality measure of a split was measured as mean absolute error, boot strapping was disabled, and the maximum number of features considered when looking for best splits with the square root of the number of features. The use of out-of-bag samples to estimate r 2 on unknown data was disabled. In the case of training on Mordred data, mini-batch gradient descent was used with a batch size of 80, while the number of hidden layers was increased to 3. The training of the decision tree model on Judred data used a random splitter with no maximum depth, number of features or leaf nodes. The criterion optimised towards was the Friedman mean square error. The minimum impurity decrease for a node split was 0.02, the minimum number of samples required to be at a leaf node was 1 and the minimum number of samples required to split an internal node was 2.  Figure S3. Receiver operating characteristic (ROC) curves for several different machine learning models attempting to distinguish between self-assembling and non-assembling peptides with the cut-off AP scoring being varied. In most cases, use of the Mordred dataset outperforms the Judred dataset, otherwise they perform evenly well.

Initial training set
In order to discern if the model was being biased heavily by this initial peptide we ran the model 20 times each time starting with a different sequence-uniform tripeptide, we found the maximum difference between any two mean AP values after 10 iterations to be 0.078, falling to 0.074 after 20 iterations, the plots have been visualised in the Supporting Information, Figure S4. Figure S4. The initial training set used for active learning was the relevant polyalanine for the dataset in question. This figure shows that the model is not unfairly biased by the initial peptide choice, it merely functions to provide an initial step. Those reported by Frederix et al. [3] are shown in black.

Validation of model accuracy
For each dataset 800 randomly selected peptides where selected and analysed via 5-fold crossvalidation using the SVMRBF machine learning model with optimized hyperparameters for both feature sets (Judred & Mordred), the Mordred predictions are found to be more accurate ( Figure  S5).

Alternative methods
Other methods of sampling such as each iteration consists of the best (highest AP) tripeptide and 9 randomly chosen weighted tripeptides (weight = AP) are analysed to compare to our method, we find that they do not produce as impressive results.

Ranged sampling
Instead of submitting the top 10 AP predictions for CGMD simulations we tested Range Sampling by submitting the highest AP tripeptide and 9 randomly chosen weighted tripeptides where the weight was equal to the AP score, this method found 19/20 of the top tripeptides reported by Tuttle et al. [3] (compared to 20/20 for the top 10 AP predictions) with the top predictions being spread into later rather than earlier iterations of the model ( Figure S6).  Figure S6. Tripeptides active learning where A) each iteration consists of the best (highest AP) tripeptide and 9 randomly chosen weighted tripeptides (weight = AP) and B) each iteration consists of the 10 best predicted peptides. Black dots show the top tripeptides reported by Frederix et al. [3]. Inset is the mean AP score of peptides found after 100 and 200 peptides (10 & 20 iterations).

Active learning using only Judred descriptors
We also compared using a Judred only based active learning method without the second step of predicting based on Mordred descriptors. The results were compared for tetrapeptides with no restrictions on log P and the restrictions of max = 0 and max = -4, the results were more sporadic and tended to learn slowly than the two-step model ( Figure S7). Figure S7. The dual-resolution algorithm is compared to a Judred-only version of the algorithm for tetrapeptides with no restrictions on the dataset and with varying restrictions on the dataset. The results show that the Judred-only model takes longer to learn the trends for predicting a high AP score and returns more erratic results when tasks with the harder job of trying to predict aggregating/self-assembling soluble peptides. The red and blue lines represent the maximum and mean values from the random dataset respectively. Inset is the mean AP score of peptides found after 100 and 200 peptides (10 & 20 iterations).
prediction accuracy against the tripeptides dataset 12 Program for querying the status of CGMD jobs and restarting those that have failed or timed out.

Python
Scripts.zip/ APMD/ SimChecker.py 3.2 Coarse-grained molecular dynamics data Figure S8. Directory and file structures of coarse-grained molecule dynamics (CGMD) data. Figure S8 shows the file structure of the CGMD zip file, which contains a zip file for each active learning process reported. Each zip file contains folders for each peptide which in turn contain the following (where * represents a single letter peptide code): • *_min.gro: first frame, from which SASAinitial is calculated • *_eq_centred.gro -final frame, from which SASAfinal is calculated • SASA.xvg -SASA at each frame of the trajectory (not included due to size) • *_A.itp -MARTINI topology of peptide • *.top -topology of system The content of the CGMD if given in Table S6.  Table S7. Output logs from the active learning process, break lines are included between each iteration of 10 peptides and the initial polyalanine is not logged.