ALDELE: All-Purpose Deep Learning Toolkits for Predicting the Biocatalytic Activities of Enzymes

Rapidly predicting enzyme properties for catalyzing specific substrates is essential for identifying potential enzymes for industrial transformations. The demand for sustainable production of valuable industry chemicals utilizing biological resources raised a pressing need to speed up biocatalyst screening using machine learning techniques. In this research, we developed an all-purpose deep-learning-based multiple-toolkit (ALDELE) workflow for screening enzyme catalysts. ALDELE incorporates both structural and sequence representations of proteins, alongside representations of ligands by subgraphs and overall physicochemical properties. Comprehensive evaluation demonstrated that ALDELE can predict the catalytic activities of enzymes, and particularly, it identifies residue-based hotspots to guide enzyme engineering and generates substrate heat maps to explore the substrate scope for a given biocatalyst. Moreover, our models notably match empirical data, reinforcing the practicality and reliability of our approach through the alignment with confirmed mutation sites. ALDELE offers a facile and comprehensive solution by integrating different toolkits tailored for different purposes at affordable computational cost and therefore would be valuable to speed up the discovery of new functional enzymes for their exploitation by the industry.


S3
different classes, i.e. no, low, intermediate and high degree of activity and applied them to multi-class classification model.Phosphatase dataset.The data was originally used by Huang et al.where the activities of 218 enzymes against 165 substrates were reported.However, many enzymes in the dataset showed no phosphatase activity to all the substrates.We selected a smaller set of 54 enzymes that displayed certain extend of activities toward the substrates.While building sub datasets for the substrate-discovery task, we further narrowed the enzyme number to 22 to ensure balanced sub datasets with R2 between 0 and 1 (a criteria set is non-zero items should be more than 30% of total data).
BVMO dataset.The BVMO thermostability dataset was built by collecting the melting temperature of wild-type and mutated enzymes in the BVMO family.This dataset only contains enzyme properties and doesn't concern substrates.It was used for enzymediscovery task.

S5: Machine learning methods for comparison
The two traditional feature-based models, random forest (RF), support vector machine (SVM), and K-Nearest Neighbors (KNN) were used to compare with our approach.The training, validation and test sets in the baseline methods compared were same as those used for ALDELE methods.The most common features, molecular fingerprints from RDKit (a 208-dimensional feature vector) and protein amino acid sequence composition descriptors (an 8,567-dimensional feature vector) generated by propy }, and the PSSM features from "smooth" approach (a 420-dimensional feature vector) were all involved for a fair comparison.
The Goldman's model was tailored from original model to a KNN-based model with features, ESM-1b, a pre-trained transformer protein language featurizations (a 1280dimensional feature vector), and Morgan fingerprint features generated by RDKit (a 1280-dimensional feature vector).
The optimized hyper-parameters of compared neural network models were taken from the original papers.Tsubaki's model hat was originally designed for classification tasks was tailored for regression tasks in this research.The hyper-parameters of these neural network models are summarized as followings: Hyper-parameters of Tsubaki's: number of the GNN layer=3, numbers of CNN layer=3, R radius of subgraph=2, N-gram of sequence=3, vector dimensions=10, epoch=100, learning rate=0.001,learning rate decay=0.5, decay interval=10, weight decay=1e-6.
hbond_bb_sc Sidechain-backbone hydrogen bond energy.hbond_sc Sidechain-sidechain hydrogen bond energy.dslf_fa13 Disulfide geometry potential.Supports D-and L-cysteine disulfides, plus homocysteine disulfides or disulfides involving beta-3-cysteine.omega Omega dihedral in the backbone.fa_dun Internal energy of sidechain rotamers p_aa_pp Probability of amino acid at Φ/Ψ. hxl_tors Sidechain hydroxyl group torsion preference for Ser/Thr/Tyr, supersedes yhh_planarity (that covers L-and D-Tyr only).ref Reference energy for each amino acid.Balances internal energy of amino acid terms.Plays role in design.selection of hyperparameters on deep learning performance was performed on the Thiolase activity dataset and evaluated by learning curves.

Fig. S4- 1 :
Fig. S4-1: Learning curves with various hyperparameters on the validation dataset (a) various r-radius subgraphs and n-gram amino acids, (b) various dimension vectors and top RDKit descriptors, (c) various numbers of layers in GNN, CNN and NN.

Table S3 - 2 :
The features from Rosetta Score Function by amino acid position hbond_sr_bbBackbone-backbone hbonds close in primary sequence.hbond_lr_bbBackbone-backbone hbonds distant in primary sequence.

Table S6 - 3 :
The r.m.s.e.results using the activities of Phosphatase dataset -an enzyme

Table S7 -
1: 16 substrates chosen from the phosphatase activity dataset for calculating the attention weights.