ACS Publications. Most Trusted. Most Cited. Most Read
Molecular Dynamics (MD)-Derived Features for Canonical and Noncanonical Amino Acids
My Activity
  • Open Access
Machine Learning and Deep Learning

Molecular Dynamics (MD)-Derived Features for Canonical and Noncanonical Amino Acids
Click to copy article linkArticle link copied!

Open PDFSupporting Information (1)

Journal of Chemical Information and Modeling

Cite this: J. Chem. Inf. Model. 2025, 65, 4, 1837–1849
Click to copy citationCitation copied!
https://doi.org/10.1021/acs.jcim.4c02102
Published February 2, 2025

Copyright © 2025 The Authors. Published by American Chemical Society. This publication is licensed under

CC-BY-NC-ND 4.0 .

Abstract

Click to copy section linkSection link copied!

Machine learning (ML) models have become increasingly popular for predicting and designing structures and properties of peptides and proteins. These ML models typically use peptides and proteins containing only canonical amino acids as the training data. Consequently, these models struggle to make accurate predictions for peptides and proteins containing new amino acids that are absent in the training data set (e.g., noncanonical amino acids). One approach to improve the accuracy of the models is to collect more training data with the desired amino acids. However, this strategy is suboptimal as new data may not be easily attainable, and additional time is required to retrain the ML models. Alternatively, the extendibility of the ML models can be improved if the amino acid features used are representative and generalizable to the unseen amino acids. Herein, we develop amino acid features using molecular dynamics (MD) simulation results. Specifically, for a given amino acid, we perform MD simulation of its dipeptide to create features based on its backbone (ϕ, ψ) distributions and its electrostatic potentials. We demonstrate that these new features enable our ML models to more accurately predict the structural ensembles of cyclic peptides containing amino acids not present in the original training data set. For example, we build ML models to predict cyclic pentapeptide structures, with the training data set containing a library of 15 amino acids and the test data set containing the same 15-amino-acid library or an extended 50-amino-acid library. When using popular features such as Morgan fingerprints and MACCS keys to represent amino acids, the ML models achieve R2 = 0.963 for structural predictions of test cyclic pentapeptides containing the same 15-amino-acid library. However, these models’ performances decrease significantly to R2 = 0.430 and R2 = 0.508, respectively, when tasked to predict the structures of cyclic pentapeptides containing a library of 50 amino acids. On the other hand, the model using our backbone (ϕ, ψ) features outperforms those using Morgan fingerprints and MACCS keys, with R2 = 0.700. Overall, instead of having to collect more training data, our new features enable predictions of peptide sequences containing amino acids not originally present in the training data set at the mere cost of performing new dipeptide simulations for the new amino acids.

This publication is licensed under

CC-BY-NC-ND 4.0 .
  • cc licence
  • by licence
  • nc licence
  • nd licence
Copyright © 2025 The Authors. Published by American Chemical Society

Introduction

Click to copy section linkSection link copied!

Peptides and proteins are crucial in maintaining biological processes. These molecules can also be designed to bind to specific disease-relevant targets, making them attractive drug candidates. (1,2) The ability to efficiently and accurately predict peptide or protein structures and properties would greatly improve our ability to rationally design peptide and protein therapeutics. Accordingly, developing machine learning (ML) models for predicting the structures and properties of peptides and proteins has become increasingly popular. (3−10)
While there is much literature on the successes of peptide- and protein-related ML models, questions about their applicability domain remain. For example, many of these models are trained using peptides with a limited set of amino acids (often canonical amino acids), and their performance is rarely demonstrated for peptides containing new amino acids that are absent in the training data set. Since it is a common strategy in the drug design process to incorporate new amino acids (e.g., noncanonical amino acids) to enhance properties like proteolytic resistance and membrane permeability, it would be highly beneficial if the ML models could be applied to new amino acids with good performance. (11−14)
To accurately make predictions for peptides and proteins containing new amino acids, one could collect more training data containing the new amino acids and retrain the ML models. However, collecting more training data could be time-consuming and difficult, making this approach undesirable. Alternatively, the features chosen to represent peptide and protein sequences for the ML models can offer a strategy to improve the models’ extendibility. If these features capture generalizable information across amino acids, then they have the potential to help the models make accurate predictions for peptides and proteins with new amino acids.
For example, generalization is very difficult for one-hot encoding, as the digits representing new amino acids do not exist in the original model, and there would be no data related to these new digits in the original model. However, outputs from SMILES, (15) CHUCKLES, (16) CHORTLES, (17) PLN, (18) BigSMILES, (19) and HELM (20) can be tokenized to create features that may be more generalizable. (21−24) For example, Morgan fingerprints (FPs) and molecular access system (MACCS) keys can be used to encode amino acids, where an amino acid is represented by a bit vector indicating the presence or absence of “sub-fragments” or chemical moieties. (25,26) Using these encoding schemes to extrapolate to new amino acids is possible because, in principle, if the new amino acids are similar to and share fragments that were seen in the training data set, then the ML model has some context about the new amino acids. However, new amino acids may contain chemical moieties that are simply absent in the original set of amino acids, and thus, more consideration may be necessary. For example, Schissel and co-workers used Morgan FPs to represent amino acids and developed an ML model to predict if a miniprotein could deliver an antisense oligomer to the nucleus; however, when models were trained using data sets that exclude sequences with a specific amino acid, the models performed poorly when tasked with making predictions for sequences containing the excluded amino acid. (27)
We hypothesize that more informative features for amino acids could improve the generalizability of ML models. While features like Morgan FPs and MACCS keys can embed useful information about molecular “fragments” or chemical moieties, if there are multiple instances of the same chemical moiety at different positions in the molecule, the encodings may not differentiate them. Furthermore, for amino acids, information about the relative locations of chemical moieties on the side chain and the backbone atoms may be useful in prediction tasks. To create a richer FP feature that includes this information, we develop position-aware side chain (PASC) features (Figure 1A).

Figure 1

Figure 1. Overview of custom features for amino acids. (A) Position-aware side chain (PASC) fingerprints are based on a “heavy-atom walk” along the amino acid side chain. Starting at the Cα atom, a Morgan fingerprint with a radius of 1 is generated (red circle). Morgan fingerprints with a radius of 1 centered at the Cβ atom (orange circle), Cγ atom (yellow circle), etc., are similarly generated to provide the PASC features. (B) MD simulations of amino acid dipeptides are used to generate MD-derived features. For the backbone (BB) features, the (ϕ, ψ) distribution is calculated from the dipeptide simulation and binned in a 2D grid. Then, the resulting 2D probability density is flattened into a 1D vector. For the voxel (VOX) features, the simulation frames are aligned to reference coordinates for C, Cα, and N, where the Cα atom is at the origin, the N atom is at (1.449, 0.000, 0.000), and the C atom is at (−0.523, 1.429, 0.000) (in Å). Then, frame-averaged molecular electrostatic potential is calculated on a 3D voxel and flattened into a 1D vector. See the Methods section for more details.

In addition to the efforts toward building upon the FP features, we also aim to develop novel molecular dynamics (MD)-derived amino acid features. We create features for amino acids based on the MD simulations of the dipeptides of the amino acids. We develop backbone (BB) features to capture the backbone conformational preferences of an amino acid by analyzing the dipeptides’ backbone (ϕ, ψ) distributions, and we develop voxel (VOX) features to capture the electrostatic environment created by an amino acid by calculating their frame-averaged electrostatic potentials on a voxel (Figure 1B). When extending the model to a data set that includes new amino acids, all one needs to do is run dipeptide simulations of the new amino acids (ranging from 100 to 300 ns of two parallel bias-exchange metadynamics (BE-META) MD simulations using two initial starting structures), calculate their (ϕ, ψ) distribution and electrostatic potentials to create encodings for the new amino acids, and then make the predictions without needing to retrain the model.
Herein, we demonstrate the applicability and value of our new features using our previously developed ML models for cyclic peptide structure prediction, namely, the Structural Ensembles Achieved by Molecular Dynamics and Machine Learning (StrEAMM) models. (5,6) Cyclic peptides have emerged as a promising drug modality, and efficiently characterizing their structures could aid in their design. (28−30) Given a cyclic peptide sequence, the StrEAMM models predict a population vector, where each value represents the population of a specific cyclic peptide structure in the structural ensemble. (5,6) The StrEAMM models are trained to reproduce the structural ensembles observed in MD simulations, where the backbone dihedrals of the cyclic peptides are measured and discretized, each frame in the simulations is given a structural digit string, and the numbers of frames with the same structural digit strings are counted to obtain the population vectors. Our original training and test data sets included cyclic pentapeptides and cyclic hexapeptides containing a limited 15 amino-acid (15-aa) library (Figure 2, black box). In this current work, we evaluate our new amino acid features on how well the models trained on the data set with a limited amino acid library predict test sequences containing an expanded 37-aa library (Figure 2, blue box) and an expanded 50-aa library that includes noncanonical amino acids (Figure 2, purple box). Compared to Morgan FPs and MACCS keys, our new MD-derived features perform better on test sets containing amino acids that were absent in the training data set. For example, models using Morgan FPs, MACCS keys, or our BB features predict the structures of cyclic pentapeptides containing an expanded library of 50 amino acids with R2 = 0.430, R2 = 0.508, and R2 = 0.700, respectively. Our MD-derived features offer a convenient way to improve a model’s generalizability without spending more time collecting new training data.

Figure 2

Figure 2. Amino acids included in the training, validation, and test data sets for the StrEAMM models. The training and validation data sets include cyclic peptide sequences from a 15-amino-acid (15-aa) library (black box). The test data sets include sequences containing amino acids in the same 15-aa library, the 37-aa library (blue box), or the 50-aa library (purple box). *For brevity, only the L-forms of chiral amino acids are depicted, but their mirror images are also included in the library.

Methods

Click to copy section linkSection link copied!

Charge Derivation for Noncanonical Amino Acids (ncAAs)

To derive the backbone (BB) and voxel (VOX) features for the noncanonical amino acids (ncAAs), namely, L/D-PCF, L/D-F4C, L/D-BFA, L/D-NAL, L/D-HSE, L/D-GML, and AIB (Figure 2), we first needed to derive the charges for these ncAAs so that we could perform MD simulations of the dipeptides of L-PCF, L-F4C, L-BFA, L-NAL, L-HSE, L-GML, and AIB. The BB and VOX features for the D-amino acids can be easily derived from those for the L-amino acids by symmetry.

Initial Charge Derivation

We first converted the SMILES string for each of the ncAAs into a 3D structure with RDKit (version 2021.03.05). (31) Then, the structures of each dipeptide were capped by an acetyl N-terminal cap (ACE) and an N-methyl C-terminal cap (NME). Two conformations were built for each dipeptide (except for AIB) with (ϕ, ψ) in the α-helix region (−60, −40°) and the β-sheet region (−150, 150°) using PyMOL, respectively. (32) For the AIB dipeptide, only the α-helix conformation was used. Then, the software Antechamber (33) (AmberTools version 22) was used to prepare the ncAA topology files. The initial partial charges of the ncAA were derived in Antechamber using AM1-BCC. (34) However, these charges were only used for the initial simulated annealing simulations, which were used to sample side chain conformations for deriving the final partial charges using restrained electrostatic potential (RESP) fitting. (35) The parmchk2 module within AmberTools was used to assign AMBER ff99SB force field (36) parameters, and any missing parameters were supplemented with the parameters from the generalized AMBER force field (GAFF). (37) Additionally, modifications from residue-specific force field 2 (RSFF2) were also applied to the ncAAs. (38) Specifically, the RSFF2 modifications for L-LEU, L-PHE, L-PHE, L-TYR, L-TYR, and L-ALA, were applied to L-GML, L-BFA, L-NAL, L-F4C, L-PCF, and L-HSE, respectively. The RSFF2 modifications from both L-ALA and D-ALA were applied to AIB.

Simulated Annealing

To sample the side chain conformations and generate structures for RESP charge derivation, simulated annealing simulations starting from each backbone conformation of the ncAA dipeptides were carried out in GROMACS (version 2018.6). (39) Dihedral restraints with a force constant of 1,000 kJ·mol–1·rad–2 were applied to the (ϕ, ψ) dihedral angles of each ncAA dipeptide to maintain its corresponding backbone conformation. Each initial structure of the ncAA dipeptide was first energy-minimized in vacuum. The energy-minimized structure was used to generate 25 replicas with random initial velocities at 300 K. Next, each replica was heated from 300 to 800 K in vacuum in 100 ps and held at 800 K for another 100 ps in an NVT ensemble. For simulation steps in vacuum, neighbor search and nonbonded interactions were truncated at 999.0 nm, and the neighbor list was only built once and never updated. The resulting structure was solvated using TIP3P water. (40) The minimum distance between each dipeptide and the box walls was set to 1.0 nm. The system was neutralized with minimal counterions (Na+ or Cl) when applicable. Next, the steepest descent algorithm was employed to perform energy minimization for each solvated and neutralized system. Then, the system was relaxed at 300 K for 500 ps in an NVT ensemble. After NVT equilibration, an annealing process was carried out in an NPT ensemble. Specifically, the system was heated from 300 to 500 K in 100 ps, held at 500 K for 100 ps, cooled to 300 K in 500 ps, kept at 300 K for 100 ps, and then cooled again to 5 K in 200 ps. The final dipeptide structures from the 50 replicas (25 replicas for AIB) were used for subsequent charge derivations.
For simulation steps in explicit solvent, neighbor search and nonbonded interactions were truncated at 1.0 nm. Long-range electrostatics interactions beyond the cutoff were treated with particle mesh Ewald method (41) and a Fourier spacing of 0.12 nm and cubic interpolation. A long-range dispersion correction for energy and pressure was used to account for the 1.0 nm cutoff of the Lennard–Jones interactions. The Berendsen barostat with a time coupling constant of 2.0 ps and isothermal compressibility of 4.5 × 10–5 bar–1 was applied to maintain a pressure of 1 bar. The v-rescale thermostat with a time coupling constant of 0.1 ps was employed for temperature control. The LINCS constraint algorithm was applied to bonds connecting to hydrogen atoms. The leapfrog integrator was used with a time step of 2 fs. Extra improper terms related to C, O, N, and H atoms were applied to the two peptide bonds of each dipeptide to maintain planarity and keep a trans-amide configuration.

Final Charge Derivation

First, the final structures during simulated annealing simulations of each ncAA dipeptide were geometry-optimized using Gaussian16 software with tight convergence criteria at HF/6-31G(d) level of theory and the (ϕ, ψ) dihedral angles were frozen for each structure. (42) For any structures that did not reach convergence after 100 optimization steps, they were rerun with additional frozen χ1 dihedral angles where applicable. Similarly, reruns with additional frozen χ2 dihedral angles (where applicable) were performed when geometry optimization with additional frozen χ1 dihedral angles failed to converge. Structures that did not converge after additional optimizations with frozen χ2 dihedral angles were not used in subsequent charge derivation. Out of the 50 structures for each dipeptide (except 25 for AIB), one β-sheet conformation structure of GML and one β-sheet conformation structure of NAL did not converge when having additional frozen χ1 dihedral angles. However, this β-sheet conformation structure of GML converged after freezing the additional χ2 dihedral angle, while the β-sheet conformation structure of NAL did not.
The optimized structures of each dipeptide were submitted to the R.E.D. web server for charge derivation. (43) Gaussian16 was used to compute the molecular electrostatic potential with IOp(6/33 = 2,6/41 = 4,6/42 = 6) at HF/6-31G(d) level of theory to obtain reliable RESP charges following a two-stage RESP charge fitting method (35) to calculate partial charges. We assigned the charges for the backbone carbonyl C and O atoms in the ncAA, as well as those for the backbone amide N and H atoms, to be consistent with the charges for all the neutral canonical amino acids in the AMBER ff99SB force field. (36) Since the caps were used to help mimic the peptide environment, we set the charges for the N and H atoms in the NME cap, and the C and O atoms in the ACE cap, to also be consistent with those of the backbone atoms of the ncAA. Additionally, the total net charge of ACE and NME groups were constrained to zero. The derived partial charges of ncAA fragment were used for subsequent MD simulations of ncAA-dipeptides and cyclic peptides with ncAAs, while the partial charges of the caps from these RESP fittings were discarded.

MD Simulations of Amino Acid Dipeptides

Bias-exchange metadynamics (44) (BE-META) simulations for the two initial structures (α-helix and β-sheet conformations) for each dipeptide were run in parallel using GROMACS (version 2018.6) (39) patched by the PLUMED 2.5.1 plugin. (45) The initial structures were solvated in a cubic water box similar to the simulated annealing simulations, with a minimum distance between the dipeptide and the walls of the box set to 1.0 nm. The steepest descent algorithm was used to minimize the solvated structure. Upon minimization, the system was equilibrated in two stages: first, all heavy atoms of the dipeptide were position-restrained and 50 ps of NVT simulation at 300 K and 50 ps of NPT simulation at 300 K and 1 bar were performed. In the second stage, the position restraints were removed, and the same sequence of NVT and NPT simulations were run again for 100 ps each. BE-META production runs of 100 ns were then performed in the NPT ensemble, at 300 K and 1 bar, with a 2 fs time step. Each BE-META production run had one bias replica with a 2D bias, (ϕ, ψ) dihedral angles of each dipeptide, and also five neutral replicas with no bias.
All simulations were run using the leapfrog algorithm, with water geometry maintained using SETTLE and hydrogen-containing bonds constrained to equilibrium lengths using LINCS. The nonbonded interaction cutoff was set to 1.0 nm, with Coulombic interactions beyond the cutoff computed using particle mesh Ewald summation, with a Fourier spacing of 0.12 nm and cubic interpolation. Dispersion corrections for both energy and pressure were applied to the long-range van der Waals interactions. Temperature was controlled by velocity rescaling, with a coupling time constant of 0.1 ps. The Parrinello–Rahman barostat was used for pressure control, with a coupling time constant of 2.0 ps and isothermal compressibility of 4.5 × 10–5 bar–1.
To monitor simulation convergence, we first computed the 2D density distributions of the (ϕ, ψ) angles for the two parallel simulations of each dipeptide system using the last 50 ns of the neutral replicas. Then, normalized integrated products (NIPs) between the density distributions of two parallel simulations were calculated, where an NIP value of 1.0 represents perfect similarity. (46) Most dipeptides converged with NIP ≥ 0.99. For PHE, THR, TYR, and VAL dipeptides, the BE-META simulations were extended to 200 ns, and NIP ≥ 0.99 was achieved when using the 100–200 ns trajectories of the neutral replicas. For the F4C dipeptide, the BE-META simulations were extended to 300 ns, and NIP ≥ 0.99 was achieved when using 200–300 ns trajectories of the neutral replicas.

Cyclic Pentapeptide and Cyclic Hexapeptide Data Sets

The training data set of 705 cyclic pentapeptide sequences generated from the 15-aa library and the test data set of 50 cyclic pentapeptide sequences from the same 15-aa library are from our previously published work developing the StrEAMM models. (5) The training data set of 705 cyclic hexapeptide sequences generated from the 15-aa library comes from 555 cyclic hexapeptide sequences used in the previous StrEAMM work (6) and an additional 150 cyclic hexapeptides simulated in this work. The 49 test cyclic hexapeptides from the same 15-aa library, as well as 25 test cyclic pentapeptides and 25 test cyclic hexapeptides from the expanded 37-aa library, were also from the previous StrEAMM work. (6) In this work, we aimed to follow the same MD simulation protocol as our previously published work to curate approximately 50 test cyclic pentapeptides and 50 test cyclic hexapeptides for the expanded 37-aa library and also for the expanded 50-aa library. (5,6) We performed two sets of BE-META simulations starting from two initial structures for each cyclic peptide and monitored the simulation convergence for each cyclic peptide by performing dihedral principal component analysis using the backbone dihedrals, calculating the density distributions in the 3D principal component space, and calculating the NIP between the 3D density distributions from the two parallel simulations. If the NIP between the two parallel simulations was ≥ 0.9, we assumed the simulations had converged. For most cyclic peptides, this was achieved when using the last 50 ns of the neutral replicas of 100 ns BE-META simulations, which were used for subsequent structural analysis. For the 25 additional cyclic pentapeptides containing amino acids from the 37-aa library, four sequences were extended to 200 ns, and convergence was achieved using the last 100 ns of the neutral replicas. For the 50 cyclic pentapeptides containing amino acids from the 50-aa library, one sequence was extended to 200 ns, and convergence was achieved using the last 100 ns of the neutral replicas. For the 25 additional cyclic hexapeptides containing amino acids from the 37-aa library, six sequences were extended to 200 ns, and convergence was achieved using the last 100 ns of the neutral replicas; one sequence was extended to 300 ns, and convergence was achived using the last 200 ns of the neutral replicas. For the 50 cyclic hexapeptides containing amino acids from the 50-aa library, fourteen sequences were extended to 200 ns, and convergence was achieved using the last 100 ns of the neutral replicas; four sequences were extended to 300 ns, and convergence was achieved using the last 200 ns of the neutral replicas; three sequences did not converge in 300 ns and were discarded to ensure that we only included high quality data.

Position-Aware Side Chain (PASC) Fingerprints

To embed information about the positions of amino acid side chain chemical moieties relative to the backbone, we generated position-aware side chain (PASC) features that encode amino acids based on the concatenated fingerprints generated from a “heavy-atom walk” of the side chain. To generate the PASC fingerprints, RDKit (version 2021.03.05) was used to construct a 1024-bit Morgan fingerprint (25) centered on each heavy atom along the length of an amino acid’s side chain using a radius of 1. The length of the side chain was determined to start from Cα and continue along the side chain until reaching the second to last terminal atom. The last heavy atom in the side chain is not included in the encoding because the previous adjacent atom’s fingerprint should capture neighboring atoms within 1 bond away, given the set radius of 1. For an amino acid with a side chain length N (defined by heavy atoms), each side chain position’s fingerprint was concatenated together to obtain an (N – 1) × 1024 bit vector, such that the bits toward the end of the concatenated feature vector reflect chemical moieties furthest away from the backbone. To deal with the fact that different amino acids have varying side chain lengths, and thus encoding vectors for different amino acids could have different lengths, we used a fixed length from the longest side chain. The longest side chain in the 50-aa library was L-BFA with a length of 10, and the total encoding for each amino acid had 9 × 1024 bits = 9216 total bits. For amino acids with shorter side chains than L-BFA, the fingerprint was generated for all the heavy atoms in the side chain, and then the encoding was padded with zeros to make the final encoding length up to 9216 bits. To deal with multiple heavy atoms that are in the same position relative to the backbone, a fingerprint was made for each heavy atom, and the average of the fingerprints was used in the final concatenated bit vector.

Backbone (BB) Features

To generate the backbone (BB) features, we first binned the (ϕ, ψ) space for each amino acid dipeptide into an N × N grid. The N × N grids for the D-amino acids were derived from those of the L-amino acids by centro-symmetry. The N × N grids were then flattened into 1D feature vectors (Figure 1B). While the time required to train ML models would increase with the size of this feature vector, having too small of a feature vector would result in a large loss of resolution of the (ϕ, ψ) distribution. Therefore, we aimed to balance efficiency and accuracy and considered various N values between 2 and 100. The 40 × 40 grid size was ultimately chosen for our BB features as it achieved similar resolutions as those from grids made with larger N values (Figures S1–S4). After choosing the grid size, we normalized the probability densities using min-max normalization and obtained feature values within the range of 0 and 1. This normalization was performed in two ways; in one method, we performed a “per-plot” normalization where each amino acid’s 40 × 40 grid was normalized based on its own minimum and maximum grid points. We also performed a “per-grid” normalization that considers minimum and maximum values for each grid point across all 15 amino acids in the 15-aa library. We tested the different normalization schemes by training StrEAMM models on either non-normalized BB features (BB0), the “per-plot” normalization (BB1), and the “per-grid” normalization (BB2), and observed that the BB2 models perform the best on the validation data set for cyclic pentapeptides and cyclic hexapeptides (Figure S5). Therefore, the BB features referred to in the main text are “per-grid” normalized.

Electrostatic Potential Voxel (VOX) Features

Frame-averaged molecular electrostatic potentials were created for each amino acid from their dipeptide simulation trajectories. To calculate electrostatic potentials for each dipeptide, first, the backbone N, Cα, and C atoms of all frames of the dipeptide were aligned to a three-atom amino acid backbone template structure, where the Cα atom was at the origin, the N atom was at (1.449, 0.000, 0.000), and the C atom was at (−0.523, 1.429, 0.000) (in Å, Figure 1B). The partial charges on each atom were taken from the simulation force field and were represented as Gaussian charge distributions with the standard deviation equal to the van der Waals radius for each atom type. The electrostatic potential at a point r and frame (or time) t can be calculated using the following equation written in atomic units
ϕt(r)=inQi|ri,tr|erf(|ri,tr|RvdW,i2)
where ri,t is the position of atom i of the dipeptide at time t; Qi and RvdW,i are the partial charge and the van der Waals radius of atom i of the dipeptide; and n is the number of atoms of the dipeptide. Furthermore, the term in the sum can be simplified as the distance between a grid point and an atom approaches zero,
lim|ri,tr|0Qi|ri,tr|erf(|ri,tr|RvdW,i2)=QiRvdW,i2π
This form is useful for the Cα of the dipeptide, which is always aligned with the grid point at the origin. The electrostatic potential was calculated for each frame of the dipeptide simulation and then the time-average molecular electrostatic potential Φ(r) was obtained by averaging ϕt(r) over all frames.
The frame-averaged molecular electrostatic potential was calculated on a grid. The grid points were evenly spaced in the x, y, and z directions using 1 Å intervals. To reduce the number of grid points, a grid was also generated using 2.5 Å intervals, which we used to generate our voxel features. The size of the grid was determined by the extreme position of atoms across all frames and across all dipeptides to accommodate all amino acids on a single grid size. A 5 Å buffer was included between the edges of the box and the extrema positions to calculate the electrostatic potential extending outward from the extreme atoms. For the 1 Å-grid, the edges of the grid were rounded to the nearest whole number (Table S1). For the 2.5 Å-grid, the edges of the grid were rounded so that the total distance along the grid would be a multiple of 2.5 (Table S1). Molecular electrostatic potential representations for D-amino acids were calculated by flipping the L-amino acid representations across the xy-plane.
Similar to the BB features, we tested different normalization schemes by training StrEAMM models on non-normalized VOX features (VOX0), the “per-plot” normalization (VOX1), and the “per-grid” normalization (VOX2). We observed that the VOX0 models perform best on the validation data set for cyclic pentapeptides and hexapeptides (Figure S6). Therefore, the VOX features referred to in the main text are not normalized.

StrEAMM Neural Network Models

The ML models used for the StrEAMM method are convolutional neural networks (CNNs), as described in previous work. (6) We chose the CNN architectures that performed the best for the cyclic pentapeptide data set (“CNN (1,2)” models) and for the cyclic hexapeptide data set (“CNN (1,2)+(1,3)+(1,4)” model) as a starting point. (6) In brief, the StrEAMM CNN models represented a cyclic peptide sequence with C amino acids as a matrix with C columns and R rows, where R is the dimension of the feature (for example, R = 1600 for the BB features made from the 40 × 40 grid). For a “CNN (1,2)” model, this R × C matrix was concatenated with another matrix that encodes the cyclic peptide sequence permutated by one position (i.e., if the original sequence is cyclic-(ABCDE), the permutated sequence is cyclic-(BCDEA)). The resulting concatenated matrix was (2 × R) × C so that the features representing the amino acids within a (1, 2) pair (i.e., directly adjacent amino acids in the sequence) were also in the same column. Then, this feature matrix was input into a 1D convolutional layer, followed by a fully connected hidden layer of a multilayer perceptron (MLP) with a rectified linear unit (ReLU) activation. (47) The output layer (with a dimension based on the number of structures in the ensemble) used a SoftMax activation function to make the predicted populations sum to 1. The loss function used was the summation of the squared error loss.
For each feature type, we performed hyperparameter tuning with 3-fold cross-validation for up to 2000 epochs and using an initial grid search where the learning rates tested were 1 × 10–3, 1 × 10–4, and 1 × 10–5. The grid search also included the number of filters, including 128, 256, 512, and 1024, and the number of nodes, including 256, 512, and 1024. The best-performing models were determined based on the best average mean squared error (across three folds) for the validation data sets, which include sequences containing the same 15-aa library used for the training data sets. We also extended the initial grid when the best model had its hyperparameter values at any of the intial grid’s extrema. The optimal epoch value was selected using generalized error and percent change calculations that are used as overfitting criterion (Figure S7). (48) All model training was performed using the python package PyTorch (version 1.9.062). (49)

Results and Discussion

Click to copy section linkSection link copied!

Application of FPs, MACCS Keys, and Our New Features for Cyclic Peptide Structural Ensemble Prediction: 15-aa Training Data Set, 15-aa Test Data Set

To evaluate the generalizability of these new features, we apply them to predict cyclic peptide structural ensembles. We had previously developed the StrEAMM method, which enabled the prediction of cyclic peptide structural ensembles by using ML models trained on MD simulation results. (5) In the previous work, we simulated 705 cyclic pentapeptides containing a 15-aa library (Figure 2, black box) to generate a training data set. (5) Then, we trained linear regression and neural networks to predict the structural ensembles for a test data set containing 50 cyclic pentapeptides using the same 15-aa library as the training data set. (5,6) The StrEAMM models achieved good performance on this test set containing the 15-aa library. For example, the StrEAMM convolutional neural network (CNN) models, which used Morgan fingerprint (FP) features, reported an average R2 of 0.963 (Figure 3A, “15 AAs,” FP; training and validation results shown in Figure S8). The average performance on the test data set is calculated by performing 3-fold cross-validation and using each of the resulting three models to make predictions for the test data set.

Figure 3

Figure 3. Performance of different amino acid features on different cyclic pentapeptide test data sets. (A) The models are trained using 3-fold cross-validation. The table reports the average R2 (coefficient of determination) and standard deviation across the 3 folds. (B) The performance for one out of the three models from the 3-fold cross-validation is plotted. The predicted population of each structure in the cyclic peptides’ structural ensemble is compared to its populations observed in MD simulations. For clarity, only structures in the ensembles for all cyclic peptides in the test data sets with either a predicted or observed (in MD) percent population of >1% are plotted.

Herein, we train StrEAMM CNN models using features other than Morgan FPs, like MACCS keys and our new features. When we compare the models’ performances on the test set from the 15-aa library, we observe similar performances across all the features evaluated (Figure 3A, “15 AAs”). The average R2 for the models trained using MACCS keys, position-aware side chain fingerprints (PASC), backbone (BB), and voxel (VOX) features are 0.963, 0.963, 0.961, and 0.950, respectively. Parity plots show that the predicted populations of the test cyclic peptides are close to the populations observed in our MD simulation results (Figure 3B, top row). Statistical testing shows that there is no significant difference between the performance of all five features (Figure S9A, “15 AAs”).
In addition to developing models to predict cyclic pentapeptide structural ensembles, we also apply our StrEAMM method to cyclic hexapeptides. We have a training data set of 705 cyclic hexapeptides containing the 15-aa library and a test data set of 49 cyclic hexapeptides containing the same 15-aa library. Models trained using Morgan FPs and MACCS keys report average R2 of 0.898 and 0.877 on the test data set, respectively (Figure 4A, “15 AAs,” FP and MACCS; training and validation results shown in Figure S10). Consistent with what is observed from the cyclic pentapeptide model performances, our new PASC, BB, and VOX features perform similarly well, which have average R2 of 0.877, 0.892, and 0.847, respectively (Figure 4A, “15 AAs,” PASC, BB, and VOX; Figure 4B, top row; see Figure S9B “15 AAs” for statistical testing results).

Figure 4

Figure 4. Performance of different amino acid features on different cyclic hexapeptide test data sets. (A) The models are trained using 3-fold cross-validation. The table reports the average R2 (coefficient of determination) and standard deviation across the 3 folds. (B) The performance for one out of the three models from the 3-fold cross-validation is plotted. The predicted population of each structure in the cyclic peptides’ structural ensemble is compared to its populations observed in MD simulations. For clarity, only structures in the ensembles for all cyclic peptides in the test data sets with either a predicted or observed (in MD) percent population of >1% are plotted.

We further evaluate the models’ performance using weighted errors (WEs) as a metric. The general conclusions remain the same, i.e., all five features perform similarly for cyclic pentapeptides and cyclic hexapeptides (Figures S11 and S12, “15 AAs”).

Application of FPs, MACCS Keys, and Our New Features on Extended Test Data Sets: 15-aa Training Data Set, 37-aa and 50-aa Test Data Sets

In this present work, we expand the test data sets to include cyclic pentapeptides and cyclic hexapeptides containing a 37-aa library (Figure 2, blue box). While the 15-aa library is a representative subset of the canonical amino acids and their mirror images, the 37-aa library contains all the canonical amino acids (except for proline) and their mirror images. To assess the generalizability of the amino acid features, StrEAMM cyclic pentapeptide models trained on the data set generated from the 15-aa library are used to predict a test data set containing the 37-aa library. The models using the Morgan FPs and MACCS keys report poor performances of R2 of 0.364 and 0.531, respectively (Figure 3A, “37 AAs,” FP and MACCS). The models using the new PASC and VOX features also report poor performances, with R2 values of 0.352 and 0.406, respectively (Figure 3A, “37 AAs,” PASC and VOX). However, when using the BB features, the model performance increases to an average R2 of 0.735 (Figure 3A, “37 AAs,” BB). The BB features also outperform the other features in the StrEAMM cyclic hexapeptide models, reporting an average R2 of 0.683 (Figure 4A, “37 AAs,” BB). Statistical testing shows that the performance of the BB feature is statistically different from (better than) all the other features for the 37-aa tests (Figure S9, “37 AAs”).
We also test the generalizability of our new features to include noncanonical amino acids (ncAAs) by constructing a 50-aa library (Figure 2, purple box). The ncAAs selected in this study are the L- and D-form of p-chloro-phenylalanine (PCF), 4-carbamoyl-phenylalanine (F4C), biphenylalanine (BFA), β-(2-naphthyl)-alanine (NAL), homoserine (HSE), γ-methyl-leucine (GML), as well as 2-aminoisobutyric acid (AIB). The first six ncAAs are chosen because they have been included in peptide design schemes; when screening for peptides that have the potential to target some proteins of interest, these ncAAs were present in hits or optimized versions of hits. (50−52) AIB is chosen because of its achirality and ability to stabilize α-helices. (53,54) Using the same StrEAMM cyclic pentapeptide models trained on the data set generated from the 15-aa library, we assess the performance of the different amino acid features on a test data set generated from the 50-aa library. Similar to the models’ results on the test data set generated from the 37-aa library, we observe that the models using Morgan FPs and MACCS keys report poor performances on the test data set generated from the 50-aa library, with average R2 of 0.430 and 0.508, respectively (Figure 3A, “50 AAs,” FP and MACCS). The decreases in model performance on the extended test data sets using the FP-type encodings could be because there are bits in the bit-vectors of the new amino acids that are absent in the bits from the amino acids in the training data set. For example, for the Morgan FPs, while there are 97 unique bits present from all the amino acids in the 15-aa library (considering a 2048-bit vector with radius = 3), there are 247 unique bits present in the 37-aa library and 305 unique bits present in the 50-aa library. Therefore, there are 247 – 97 = 150 bits in the 37-aa library and 305 – 97 = 208 bits in the 50-aa library that are not present in the original training data set containing the 15-aa library.
On the other hand, our BB features outperform all the other features on the test data set generated from the 50-aa library, with an average R2 of 0.700 (Figure 3A, “50 AAs,” BB). The BB features also outperform the other amino acid features on the 50-aa cyclic hexapeptide data set, reporting R2 = 0.592 (Figure 4A, “50 AAs”). Statistical testing shows that the performance of the BB feature is statistically different from (better than) all the other features for the 50-aa tests (Figure S9, “50 AAs”). We further evaluated the models’ performance using weighted errors (WEs) as a metric. The BB features outperform the other features for cyclic pentapeptides (Figure S11, “37 AAs” and “50 AAs”). For cyclic hexapeptides, both the BB and VOX features perform similarly well (Figure S12, “37 AAs” and “50 AAs”).
It is noted that the improved performance on the 37-aa and 50-aa test data sets when using the BB features is not as pronounced for the cyclic hexapeptide models compared to the cyclic pentapeptide models. We hypothesize that for cyclic hexapeptides, the information related to the side chain, such as the voxel (VOX) features, may be almost as useful as backbone-based features, compared to cyclic pentapeptides, for which BB features are very informative for the models’ structural ensemble predictions.

Evaluation of the StrEAMM Model Performances on the Extended Test Data Sets Using Combinations of Features

The amino acid features included in our comparisons represent different aspects of amino acids; for example, the BB features aim to capture the (ϕ, ψ) preferences for each amino acid, while the FP-type encodings capture the presence and absence of specific chemical moieties. Both types of information may be useful in the prediction tasks, and we evaluate the performance of the StrEAMM models using different pairs of features.
First, for the 15-aa test data set, we observe that the cyclic pentapeptide models using different feature-pair combinations perform similarly to those using only a single feature. For example, while the cyclic pentapeptide model using only BB features has an average R2 of 0.961, the models using BB+VOX, BB+PASC, BB+MACCS keys, and BB+FP have R2 of 0.961, 0.961, 0.962, and 0.963 respectively (Figure 5, top left; training and validation results shown in Figure S13). Similarly, the cyclic hexapeptide models using feature-pair combinations perform as well as those using only a single feature. For example, the model using only BB features has an average R2 of 0.892, and the models using BB+VOX, BB+PASC, BB+MACCS keys, and BB+FP have an average R2 of 0.898, 0.894, 0.900, and 0.899, respectively (Figure 5, bottom left).

Figure 5

Figure 5. Performance of different combinations of amino acid features on cyclic pentapeptide (top) and hexapeptide (bottom) test data sets containing 15 AAs (left), 37 AAs (middle), and 50 AAs (right). The models are trained using 3-fold cross-validation, and the average R2 and standard deviation are reported. The models using a single type of feature (e.g., “BB only”) are represented on the diagonals. The best-performing models for the 37 AA and 50 AA test data sets, based on the average R2, are boxed with bold black outlines.

We then use the feature-pairs to predict the more challenging test data sets generated from the 37-aa and 50-aa libraries. For the cyclic pentapeptide models predicting the 37-aa test data set, the best-performing model based on the average R2 uses BB+VOX keys, reporting an average R2 of 0.754 (Figure 5, top middle). This result is an improvement from the best single-feature model, BB-only, with an average R2 of 0.735 (Figure 5, top middle). For the cyclic pentapeptide models predicting the 50-aa test data set, the best-performing model is also a combination of features. The best-performing model uses BB+MACCS keys, reporting an average R2 of 0.742, which is better than the best single-feature model, BB-only, with an average R2 of 0.700 (Figure 5, top right).
However, for the cyclic hexapeptide models, the feature-pair combination models do not necessarily perform better than the single-feature models. For the cyclic hexapeptide models predicting the 37-aa test data set, the overall best-performing model is the BB-only model with an average R2 of 0.683, while the best-performing feature-pair model is BB+MACCS with an average R2 of 0.666 (Figure 5, bottom middle). For the cyclic hexapeptide models predicting the 50-aa test, the overall best-performing model is the BB-only model with an average R2 of 0.592, while the best-performing feature-pair model is BB+VOX with an average R2 of 0.586 (Figure 5, bottom right). However, the performance of the BB-only model is within the standard deviation of the BB+VOX model. One possible reason why the models using feature-pair combinations are not better than those using the single features when predicting the 37-aa and 50-aa test data sets could be that we select the best hyperparameters based on the validation data sets (which are composed of amino acids from the 15-aa library), and the models may be biased/overfit to the 15-aa library containing only a subset of canonical amino acids. Hence, there is a challenge to the assumption that optimizing the hyperparameters based on the 15-aa validation data set will be optimal for the 37-aa and 50-aa libraries, and we are interested in exploring the regularization of these models in the future. We also plan to perform, for example, SHAP analysis (55) to understand which specific bits in the single and combined features contribute the most to the model’s predictions.
In principle, the StrEAMM platform can be applied to predict the structural ensembles of linear peptides and stapled peptides as well. The new BB features are readily applicable to those cases and to any ML models for peptides and proteins in general.

Conclusions

Click to copy section linkSection link copied!

We develop new amino acid features with the intention of improving the generalizability of ML models without needing to pay the additional cost of collecting more training data and retraining models. Toward this end, we create a position-aware side chain (PASC) fingerprint feature and features based on MD simulation results from amino acid dipeptides. We demonstrate that models implementing the commonly used features, namely, Morgan fingerprints (FPs) and MACCS keys, can accurately predict the structural ensembles of cyclic pentapeptides and cyclic hexapeptides when the test data set is generated from the same 15-aa library as the training data set. However, we show that the performance of these Morgan FPs and MACCS keys-based models decreases significantly when tasked to predict the structural ensembles of cyclic pentapeptides and cyclic hexapeptides containing amino acids that were not originally present in the training data set (i.e., cyclic peptides generated from the 37-aa and 50-aa libraries).
The new amino acid backbone (BB) features consistently outperform Morgan FPs and MACCS keys when applied to the test data sets from the expanded 37-aa and 50-aa libraries, highlighting the ability of the BB features to improve the extendibility of the models. For example, when compared to the models using Morgan FPs and MACCS keys to predict the structural ensembles of the test cyclic pentapeptides from a 37-aa library, our BB features outperform the other features with an average R2 of 0.735, compared to the average R2 of 0.364 and 0.531 for Morgan FPs and MACCS keys, respectively. This trend is also observed when predicting the structural ensembles of the test cyclic pentapeptides from a 50-aa library containing noncanonical amino acids, with our BB features reporting an average R2 of 0.700, compared to the average R2 of 0.430 and 0.508 for Morgan FPs and MACCS keys, respectively. The improved performances from the models using our BB features demonstrate both the utility of the features to help the extendibility of the models, as well as the advantage of leveraging information from MD simulation data. Specifically, the (ϕ, ψ) preferences of the amino acids are important for the cyclic peptide structural ensemble prediction task, and they can now be gauged using the dipeptide simulations and represented as features for the amino acids.
Overall, the promising results shown herein using expanded test data sets and new features highlight that our BB features improve the generalizability of the StrEAMM models for cyclic peptide structure predictions. We envision that these features could also be applied to and improve the generalizability of other ML models for peptides and proteins. The BB features only require dipeptide simulations of the amino acids to generate, and such MD simulation data provide valuable information about the amino acids’ backbone structural preferences. ML models trained using BB features are thus more transferable to new amino acids. If one is interested in making predictions for new amino acids not seen in the training data set, the cost is a mere dipeptide simulation, which is presumably minimal compared to the cost of collecting new training data that include the new amino acid and retraining the model. Therefore, the new BB features offer an attractive, cost-effective way to improve model extendibility.

Data Availability

Click to copy section linkSection link copied!

The new features that were used to train the models in this study can be found in the public GitHub repository, https://github.com/thui16/MD_derived_features_for_aas/. The StrEAMM models and MD simulation data used to train these models are under a patent application (please see Competing Interests section) and are not publicly available.

Supporting Information

Click to copy section linkSection link copied!

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.4c02102.

  • Example of a (ϕ, ψ) distribution binned using various grid sizes; comparison of backbone (BB) features of amino acids in the different amino acid libraries; comparison of the different normalization schemes applied to the BB and voxel (VOX) features; examples of learning curves from hyperparameter tuning; model performances (reporting R2) on the training and validation data sets for the cyclic pentapeptides; p-values from t-tests comparing different features; model performances (reporting R2) on the training and validation data sets for the cyclic hexapeptides; model performances (reporting weighted error, WE) on the various test data sets for the cyclic pentapeptides and cyclic hexapeptides; model performances (reporting R2) using combinations of features on the training and validation data sets for the cyclic pentapeptides and cyclic hexapeptides (PDF)

Terms & Conditions

Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system: http://pubs.acs.org/page/copyright/permissions.html.

Author Information

Click to copy section linkSection link copied!

  • Corresponding Author
  • Authors
    • Tiffani Hui - Department of Chemistry, Tufts University, Medford, Massachusetts 02155, United StatesOrcidhttps://orcid.org/0000-0002-1355-389X
    • Maxim Secor - Department of Chemistry, Tufts University, Medford, Massachusetts 02155, United States
    • Minh Ngoc Ho - Department of Chemistry, Tufts University, Medford, Massachusetts 02155, United StatesOrcidhttps://orcid.org/0009-0000-8054-0362
    • Nomindari Bayaraa - Department of Chemistry, Tufts University, Medford, Massachusetts 02155, United States
  • Author Contributions

    T.H., M.S., and M.N.H. contributed equally to this work. T.H. contributed to methodology, model training, analysis, and writing of the manuscript. M.S. contributed to methodology, analysis, and writing the manuscript. M.N.H., contributed to data collection, analysis, and writing the manuscript. N.B. contributed to data collection and editing the manuscript.

  • Notes
    The authors declare the following competing financial interest(s): A patent application PCT/US2022/072941, Cyclic peptide structure prediction via structural ensembles achieved by molecular dynamics and machine learning was filed on 2022/6/14. Y.-S.L. is on the scientific advisory board for Zonsen PepLib Biotech.

Acknowledgments

Click to copy section linkSection link copied!

This work was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R01GM124160 (PI: Y.-S.L.). We are grateful for the support from the Tufts Technology Services and for the computing resources at the Tufts Research Cluster. Initial structures for the simulations were built using UCSF Chimera, developed by the Resource for Biocomputing, Visualization, and Informatics at the University of California, San Francisco, with support from NIH Grant P41-GM103311.

References

Click to copy section linkSection link copied!

This article references 55 other publications.

  1. 1
    Wang, L.; Wang, N.; Zhang, W.; Cheng, X.; Yan, Z.; Shao, G.; Wang, X.; Wang, R.; Fu, C. Therapeutic peptides: current applications and future directions. Signal Transduction Targeted Ther. 2022, 7, 48  DOI: 10.1038/s41392-022-00904-4
  2. 2
    Liu, K.; Li, M.; Li, Y.; Li, Y.; Chen, Z.; Tang, Y.; Yang, M.; Deng, G.; Liu, H. A review of the clinical efficacy of FDA-approved antibody–drug conjugates in human cancers. Mol. Cancer 2024, 23, 62  DOI: 10.1186/s12943-024-01963-7
  3. 3
    Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; Bridgland, A.; Meyer, C.; Kohl, S. A. A.; Ballard, A. J.; Cowie, A.; Romera-Paredes, B.; Nikolov, S.; Jain, R.; Adler, J.; Back, T.; Petersen, S.; Reiman, D.; Clancy, E.; Zielinski, M.; Steinegger, M.; Pacholska, M.; Berghammer, T.; Bodenstein, S.; Silver, D.; Vinyals, O.; Senior, A. W.; Kavukcuoglu, K.; Kohli, P.; Hassabis, D. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583589,  DOI: 10.1038/s41586-021-03819-2
  4. 4
    Baek, M.; DiMaio, F.; Anishchenko, I.; Dauparas, J.; Ovchinnikov, S.; Lee, G. R.; Wang, J.; Cong, Q.; Kinch, L. N.; Schaeffer, R. D.; Millán, C.; Park, H.; Adams, C.; Glassman, C. R.; DeGiovanni, A.; Pereira, J. H.; Rodrigues, A. V.; van Dijk, A. A.; Ebrecht, A. C.; Opperman, D. J.; Sagmeister, T.; Buhlheller, C.; Pavkov-Keller, T.; Rathinaswamy, M. K.; Dalwadi, U.; Yip, C. K.; Burke, J. E.; Garcia, K. C.; Grishin, N. V.; Adams, P. D.; Read, R. J.; Baker, D. Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021, 373, 871876,  DOI: 10.1126/science.abj8754
  5. 5
    Miao, J.; Descoteaux, M. L.; Lin, Y.-S. Structure prediction of cyclic peptides by molecular dynamics + machine learning. Chem. Sci. 2021, 12, 1492714936,  DOI: 10.1039/D1SC05562C
  6. 6
    Hui, T.; Descoteaux, M. L.; Miao, J.; Lin, Y.-S. Training neural network models using molecular dynamics simulation results to efficiently predict cyclic hexapeptide structural ensembles. J. Chem. Theory Comput. 2023, 19, 47574769,  DOI: 10.1021/acs.jctc.3c00154
  7. 7
    Wan, F.; Kontogiorgos-Heintz, D.; de la Fuente-Nunez, C. Deep generative models for peptide design. Digital Discovery 2022, 1, 195208,  DOI: 10.1039/D1DD00024A
  8. 8
    Ferguson, A. L.; Ranganathan, R. 100th anniversary of macromolecular science viewpoint: data-driven protein design. ACS Macro Lett. 2021, 10, 327340,  DOI: 10.1021/acsmacrolett.0c00885
  9. 9
    Strokach, A.; Kim, P. M. Deep generative modeling for protein design. Curr. Opin. Struct. Biol. 2022, 72, 226236,  DOI: 10.1016/j.sbi.2021.11.008
  10. 10
    Chandra, A.; Tünnermann, L.; Löfstedt, T.; Gratz, R. Transformer-based deep learning for predicting protein properties in the life sciences. eLife 2023, 12, e82819  DOI: 10.7554/eLife.82819
  11. 11
    Oliva, R.; Chino, M.; Pane, K.; Pistorio, V.; De Santis, A.; Pizzo, E.; D’Errico, G.; Pavone, V.; Lombardi, A.; Del Vecchio, P.; Notomista, E.; Nastri, F.; Petraccone, L. Exploring the role of unnatural amino acids in antimicrobial peptides. Sci. Rep. 2018, 8, 8888  DOI: 10.1038/s41598-018-27231-5
  12. 12
    Lu, J.; Xu, H.; Xia, J.; Ma, J.; Xu, J.; Li, Y.; Feng, J. D- and unnatural amino acid substituted antimicrobial peptides with improved proteolytic resistance and their proteolytic degradation characteristics. Front. Microbiol. 2020, 11, 563030  DOI: 10.3389/fmicb.2020.563030
  13. 13
    Taechalertpaisarn, J.; Ono, S.; Okada, O.; Johnstone, T. C.; Lokey, R. S. A new amino acid for improving permeability and solubility in macrocyclic peptides through side chain-to-backbone hydrogen bonding. J. Med. Chem. 2022, 65, 50725084,  DOI: 10.1021/acs.jmedchem.2c00010
  14. 14
    Geurink, P. P.; van der Linden, W. A.; Mirabella, A. C.; Gallastegui, N.; de Bruin, G.; Blom, A. E.; Voges, M. J.; Mock, E. D.; Florea, B. I.; van der Marel, G. A.; Driessen, C.; van der Stelt, M.; Groll, M.; Overkleeft, H. S.; Kisselev, A. F. Incorporation of non-natural amino acids improves cell permeability and potency of specific inhibitors of proteasome trypsin-like sites. J. Med. Chem. 2013, 56, 12621275,  DOI: 10.1021/jm3016987
  15. 15
    Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988, 28, 3136,  DOI: 10.1021/ci00057a005
  16. 16
    Siani, M. A.; Weininger, D.; Blaney, J. M. CHUCKLES: A method for representing and searching peptide and peptoid sequences on both monomer and atomic levels. J. Chem. Inf. Comput. Sci. 1994, 34, 588593,  DOI: 10.1021/ci00019a017
  17. 17
    Siani, M. A.; Weininger, D.; James, C. A.; Blaney, J. M. CHORTLES: A method for representing oligomeric and template-based mixtures. J. Chem. Inf. Comput. Sci. 1995, 35, 10261033,  DOI: 10.1021/ci00028a012
  18. 18
    Jensen, J. H.; Hoeg-Jensen, T.; Padkjær, S. B. Building a biochemformatics database. J. Chem. Inf. Model. 2008, 48, 24042413,  DOI: 10.1021/ci800128b
  19. 19
    Lin, T.-S.; Coley, C. W.; Mochigase, H.; Beech, H. K.; Wang, W.; Wang, Z.; Woods, E.; Craig, S. L.; Johnson, J. A.; Kalow, J. A.; Jensen, K. F.; Olsen, B. D. BigSMILES: A structurally-based line notation for describing macromolecules. ACS Cent. Sci. 2019, 5, 15231531,  DOI: 10.1021/acscentsci.9b00476
  20. 20
    Zhang, T.; Li, H.; Xi, H.; Stanton, R. V.; Rotstein, S. H. HELM: A hierarchical notation language for complex biomolecule structure representation. J. Chem. Inf. Model. 2012, 52, 27962806,  DOI: 10.1021/ci3001925
  21. 21
    David, L.; Thakkar, A.; Mercado, R.; Engkvist, O. Molecular representations in AI-driven drug discovery: a review and practical guide. J. Cheminf. 2020, 12, 56  DOI: 10.1186/s13321-020-00460-5
  22. 22
    Nguyen-Vo, T.-H.; Teesdale-Spittle, P.; Harvey, J. E.; Nguyen, B. P. Molecular representations in bio-cheminformatics. Memetic Comput. 2024, 16, 519536,  DOI: 10.1007/s12293-024-00414-6
  23. 23
    Wigh, D. S.; Goodman, J. M.; Lapkin, A. A. A review of molecular representation in the age of machine learning. WIREs Comput. Mol. Sci. 2022, 12, e1603  DOI: 10.1002/wcms.1603
  24. 24
    Sousa, T.; Correia, J.; Pereira, V.; Rocha, M. Generative deep learning for targeted compound design. J. Chem. Inf. Model. 2021, 61, 53435361,  DOI: 10.1021/acs.jcim.0c01496
  25. 25
    Morgan, H. L. The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. J. Chem. Doc. 1965, 5, 107113,  DOI: 10.1021/c160017a018
  26. 26
    Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G. Reoptimization of MDL Keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 12731280,  DOI: 10.1021/ci010132r
  27. 27
    Schissel, C. K.; Mohapatra, S.; Wolfe, J. M.; Fadzen, C. M.; Bellovoda, K.; Wu, C.-L.; Wood, J. A.; Malmberg, A. B.; Loas, A.; Gómez-Bombarelli, R.; Pentelute, B. L. Deep learning to design nuclear-targeting abiotic miniproteins. Nat. Chem. 2021, 13, 9921000,  DOI: 10.1038/s41557-021-00766-3
  28. 28
    Ji, X.; Nielsen, A. L.; Heinis, C. Cyclic peptides for drug development. Angew. Chem., Int. Ed. 2024, 63, e202308251  DOI: 10.1002/anie.202308251
  29. 29
    Costa, L.; Sousa, E.; Fernandes, C. Cyclic peptides in pipeline: what future for these great molecules?. Pharmaceuticals 2023, 16, 996  DOI: 10.3390/ph16070996
  30. 30
    Zhang, H.; Chen, S. Cyclic peptide drugs approved in the last two decades (2001–2021). RSC Chem. Biol. 2022, 3, 1831,  DOI: 10.1039/D1CB00154J
  31. 31
    Landrum, G. RDKit: Open-Source Cheminformatics Software. https://www.rdkit.org/.
  32. 32
    The PyMOL Molecular Graphics System, version 3.0; Schrödinger, LLC.
  33. 33
    Case, D. A.; Aktulga, H. M. A.; Belfon, K.; Ben-Shalom, I. Y.; Berryman, J. T.; Brozell, S. R.; Cerutti, D. S.; Cheatham, T. E., III; Cisneros, G. A.; Cruzeiro, V. W. D.; Darden, T. A.; Duke, R. E.; Giambasu, G.; Gilson, M. K.; Gohlke, H.; Goetz, A. W.; Harris, R.; Izadi, S.; Izmailov, S. A.; Kasavajhala, K.; Kaymak, M. C.; King, E.; Kovalenko, A.; Kurtzman, T.; Lee, T. S.; LeGrand, S.; Li, P.; Lin, C.; Liu, J.; Luchko, T.; Luo, R.; Machado, M.; Man, V.; Manathunga, M.; Merz, K. M.; Miao, Y.; Mikhailovskii, O.; Monard, G.; Nguyen, H.; O’Hearn, K. A.; Onufriev, A.; Pan, F.; Pantano, S.; Qi, R.; Rahnamoun, A.; Roe, D. R.; Roitberg, A.; Sagui, C.; Schott-Verdugo, S.; Shajan, A.; Shen, J.; Simmerling, C. L.; Skrynnikov, N. R.; Smith, J.; Swails, J.; Walker, R. C.; Wang, J.; Wang, J.; Wei, H.; Wolf, R. M.; Wu, X.; Xiong, Y.; Xue, Y.; York, D. M.; Zhao, S.; Kollman, P. A. Amber; University of California: San Francisco, 2022.
  34. 34
    Jakalian, A.; Jack, D. B.; Bayly, C. I. Fast, efficient generation of high-quality atomic charges. AM1-BCC model: II. Parameterization and validation. J. Comput. Chem. 2002, 23, 16234161,  DOI: 10.1002/jcc.10128
  35. 35
    Bayly, C. I.; Cieplak, P.; Cornell, W.; Kollman, P. A. A well-behaved electrostatic potential based method using charge restraints for deriving atomic charges: the RESP model. J. Phys. Chem. A 1993, 97, 1026910280,  DOI: 10.1021/j100142a004
  36. 36
    Hornak, V.; Abel, R.; Okur, A.; Strockbine, B.; Roitberg, A.; Simmerling, C. Comparison of multiple amber force fields and development of improved protein backbone parameters. Proteins 2006, 65, 712725,  DOI: 10.1002/prot.21123
  37. 37
    Wang, J.; Wolf, R. M.; Caldwell, J. W.; Kollman, P. A.; Case, D. A. Development and testing of a general amber force field. J. Comput. Chem. 2004, 25, 11571174,  DOI: 10.1002/jcc.20035
  38. 38
    Zhou, C.-Y.; Jiang, F.; Wu, Y.-D. Residue-specific force field based on protein coil library. RSFF2: modification of AMBER ff99SB. J. Phys. Chem. B 2015, 119, 10351047,  DOI: 10.1021/jp5064676
  39. 39
    Abraham, M. J.; Murtola, T.; Schulz, R.; Páll, S.; Smith, J. C.; Hess, B.; Lindahl, E. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 2015, 1–2, 1925,  DOI: 10.1016/j.softx.2015.06.001
  40. 40
    Jorgensen, W. L.; Chandrasekhar, J.; Madura, J. D.; Impey, R. W.; Klein, M. L. Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 1983, 79, 926935,  DOI: 10.1063/1.445869
  41. 41
    Essmann, U.; Perera, L.; Berkowitz, M. L.; Darden, T.; Lee, H.; Pedersen, L. G. A smooth particle mesh Ewald method. J. Chem. Phys. 1995, 103, 85778593,  DOI: 10.1063/1.470117
  42. 42
    Frisch, M. J.; Trucks, G. W.; Schlegel, H. B.; Scuseria, G. E.; Robb, M. A.; Cheeseman, J. R.; Scalmani, G.; Barone, V.; Petersson, G. A.; Nakatsuji, H.; Li, X.; Caricato, M.; Marenich, A. V.; Bloino, J.; Janesko, B. G.; Gomperts, R.; Mennucci, B.; Hratchian, H. P.; Ortiz, J. V.; Izmaylov, A. F.; Sonnenberg, J. L.; Williams; ; Ding, F.; Lipparini, F.; Egidi, F.; Goings, J.; Peng, B.; Petrone, A.; Henderson, T.; Ranasinghe, D.; Zakrzewski, V. G.; Gao, J.; Rega, N.; Zheng, G.; Liang, W.; Hada, M.; Ehara, M.; Toyota, K.; Fukuda, R.; Hasegawa, J.; Ishida, M.; Nakajima, T.; Honda, Y.; Kitao, O.; Nakai, H.; Vreven, T.; Throssell, K.; Montgomery, J. A., Jr.; Peralta, J. E.; Ogliaro, F.; Bearpark, M. J.; Heyd, J. J.; Brothers, E. N.; Kudin, K. N.; Staroverov, V. N.; Keith, T. A.; Kobayashi, R.; Normand, J.; Raghavachari, K.; Rendell, A. P.; Burant, J. C.; Iyengar, S. S.; Tomasi, J.; Cossi, M.; Millam, J. M.; Klene, M.; Adamo, C.; Cammi, R.; Ochterski, J. W.; Martin, R. L.; Morokuma, K.; Farkas, O.; Foresman, J. B.; Fox, D. J. Gaussian 16, rev. C.01; Gaussian Inc.: Wallingford, CT, 2016.
  43. 43
    Vanquelef, E.; Simon, S.; Marquant, G.; Garcia, E.; Klimerak, G.; Delepine, J. C.; Cieplak, P.; Dupradeau, F. Y. R.E.D. Server: a web service for deriving RESP and ESP charges and building force field libraries for new molecules and molecular fragments. Nucleic Acids Res. 2011, 39, W511517,  DOI: 10.1093/nar/gkr288
  44. 44
    Piana, S.; Laio, A. A bias-exchange approach to protein folding. J. Phys. Chem. B 2007, 111, 45534559,  DOI: 10.1021/jp067873l
  45. 45
    Tribello, G. A.; Bonomi, M.; Branduardi, D.; Camilloni, C.; Bussi, G. PLUMED 2: New feathers for an old bird. Comput. Phys. Commun. 2014, 185, 604613,  DOI: 10.1016/j.cpc.2013.09.018
  46. 46
    Damas, J. M.; Filipe, L. C.; Campos, S. R.; Lousa, D.; Victor, B. L.; Baptista, A. M.; Soares, C. M. Predicting the thermodynamics and kinetics of helix formation in a cyclic peptide model. J. Chem. Theory Comput. 2013, 9, 51485157,  DOI: 10.1021/ct400529k
  47. 47
    Nair, V.; Hinton, G. E. Rectified Linear Units Improve Restricted Boltzmann Machines, Proceedings of the 27th International Conference on Machine Learning (ICML’10), Haifa, Israel, June 21–24, Fürnkranz, J.; Joachims, T., Eds.; Omnipress: Madison, WI, 2010; pp 807814.
  48. 48
    Prechelt, L. Automatic early stopping using cross validation: quantifying the criteria. Neural Networks 1998, 11, 761767,  DOI: 10.1016/S0893-6080(98)00010-0
  49. 49
    Paszke, A. G. S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T. L.; Gimelshein, N.; Antiga, L.; Desmaison, A.; Kopf, A.; Yang, E. D. Z.; Raison, M.; Tejani, A.; Chilamkurthy, S.; Steiner, B. F.; Bai, J.; Chintala, S. An Imperative Style, High-Performance Deep Learning Library, Advances in Neural Information Processing Systems 32 (NeurIPS 2019); Wallach, H.; Larochelle, H.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E.; Garnett, R., Eds.; Curran Associates: Vancouver, Canada, 2019; pp 80248035.
  50. 50
    Hayashi, K.; Uehara, S.; Yamamoto, S.; Cary, D. R.; Nishikawa, J.; Ueda, T.; Ozasa, H.; Mihara, K.; Yoshimura, N.; Kawai, T.; Ono, T.; Yamamoto, S.; Fumoto, M.; Mikamiyama, H. Macrocyclic peptides as a novel class of NNMT inhibitors: A SAR study aimed at inhibitory activity in the cell. ACS Med. Chem. Lett. 2021, 12, 10931101,  DOI: 10.1021/acsmedchemlett.1c00134
  51. 51
    Brousseau, M. E.; Clairmont, K. B.; Spraggon, G.; Flyer, A. N.; Golosov, A. A.; Grosche, P.; Amin, J.; Andre, J.; Burdick, D.; Caplan, S.; Chen, G.; Chopra, R.; Ames, L.; Dubiel, D.; Fan, L.; Gattlen, R.; Kelly-Sullivan, D.; Koch, A. W.; Lewis, I.; Li, J.; Liu, E.; Lubicka, D.; Marzinzik, A.; Nakajima, K.; Nettleton, D.; Ottl, J.; Pan, M.; Patel, T.; Perry, L.; Pickett, S.; Poirier, J.; Reid, P. C.; Pelle, X.; Seepersaud, M.; Subramanian, V.; Vera, V.; Xu, M.; Yang, L.; Yang, Q.; Yu, J.; Zhu, G.; Monovich, L. G. Identification of a PCSK9-LDLR disruptor peptide with in vivo function. Cell Chem. Biol. 2022, 29, 249258.e5,  DOI: 10.1016/j.chembiol.2021.08.012
  52. 52
    Yoshida, S.; Uehara, S.; Kondo, N.; Takahashi, Y.; Yamamoto, S.; Kameda, A.; Kawagoe, S.; Inoue, N.; Yamada, M.; Yoshimura, N.; Tachibana, Y. Peptide-to-small molecule: a pharmacophore-guided small molecule lead generation strategy from high-affinity macrocyclic peptides. J. Med. Chem. 2022, 65, 1065510673,  DOI: 10.1021/acs.jmedchem.2c00919
  53. 53
    Banerjee, R.; Basu, G.; Chène, P.; Roy, S. Aib-based peptide backbone as scaffolds for helical peptide mimics. J. Pept. Res. 2002, 60, 8894,  DOI: 10.1034/j.1399-3011.2002.201005.x
  54. 54
    Karle, I. L. Controls exerted by the Aib residue: helix formation and helix reversal. Pept. Sci. 2001, 60, 351365,  DOI: 10.1002/1097-0282(2001)60:5<351::AID-BIP10174>3.0.CO;2-U
  55. 55
    Lundberg, S. M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions; Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, December 4–9, 2017; Von Luxburg, U.; Guyon, I.; Bengio, S.; Wallach, H.; Ferugs, R., Eds.; Curan Associates: Redhook, NY, 2017; pp 47684777.

Cited By

Click to copy section linkSection link copied!
Citation Statements
Explore this article's citation statements on scite.ai

This article is cited by 1 publications.

  1. Yanpeng Fang, Duoyang Fan, Bin Feng, Yingli Zhu, Ruyan Xie, Xiaorong Tan, Qianhui Liu, Jie Dong, Wenbin Zeng. Harnessing advanced computational approaches to design novel antimicrobial peptides against intracellular bacterial infections. Bioactive Materials 2025, 50 , 510-524. https://doi.org/10.1016/j.bioactmat.2025.04.016

Journal of Chemical Information and Modeling

Cite this: J. Chem. Inf. Model. 2025, 65, 4, 1837–1849
Click to copy citationCitation copied!
https://doi.org/10.1021/acs.jcim.4c02102
Published February 2, 2025

Copyright © 2025 The Authors. Published by American Chemical Society. This publication is licensed under

CC-BY-NC-ND 4.0 .

Article Views

1568

Altmetric

-

Citations

Learn about these metrics

Article Views are the COUNTER-compliant sum of full text article downloads since November 2008 (both PDF and HTML) across all institutions and individuals. These metrics are regularly updated to reflect usage leading up to the last few days.

Citations are the number of other articles citing this article, calculated by Crossref and updated daily. Find more information about Crossref citation counts.

The Altmetric Attention Score is a quantitative measure of the attention that a research article has received online. Clicking on the donut icon will load a page at altmetric.com with additional details about the score and the social media presence for the given article. Find more information on the Altmetric Attention Score and how the score is calculated.

  • Abstract

    Figure 1

    Figure 1. Overview of custom features for amino acids. (A) Position-aware side chain (PASC) fingerprints are based on a “heavy-atom walk” along the amino acid side chain. Starting at the Cα atom, a Morgan fingerprint with a radius of 1 is generated (red circle). Morgan fingerprints with a radius of 1 centered at the Cβ atom (orange circle), Cγ atom (yellow circle), etc., are similarly generated to provide the PASC features. (B) MD simulations of amino acid dipeptides are used to generate MD-derived features. For the backbone (BB) features, the (ϕ, ψ) distribution is calculated from the dipeptide simulation and binned in a 2D grid. Then, the resulting 2D probability density is flattened into a 1D vector. For the voxel (VOX) features, the simulation frames are aligned to reference coordinates for C, Cα, and N, where the Cα atom is at the origin, the N atom is at (1.449, 0.000, 0.000), and the C atom is at (−0.523, 1.429, 0.000) (in Å). Then, frame-averaged molecular electrostatic potential is calculated on a 3D voxel and flattened into a 1D vector. See the Methods section for more details.

    Figure 2

    Figure 2. Amino acids included in the training, validation, and test data sets for the StrEAMM models. The training and validation data sets include cyclic peptide sequences from a 15-amino-acid (15-aa) library (black box). The test data sets include sequences containing amino acids in the same 15-aa library, the 37-aa library (blue box), or the 50-aa library (purple box). *For brevity, only the L-forms of chiral amino acids are depicted, but their mirror images are also included in the library.

    Figure 3

    Figure 3. Performance of different amino acid features on different cyclic pentapeptide test data sets. (A) The models are trained using 3-fold cross-validation. The table reports the average R2 (coefficient of determination) and standard deviation across the 3 folds. (B) The performance for one out of the three models from the 3-fold cross-validation is plotted. The predicted population of each structure in the cyclic peptides’ structural ensemble is compared to its populations observed in MD simulations. For clarity, only structures in the ensembles for all cyclic peptides in the test data sets with either a predicted or observed (in MD) percent population of >1% are plotted.

    Figure 4

    Figure 4. Performance of different amino acid features on different cyclic hexapeptide test data sets. (A) The models are trained using 3-fold cross-validation. The table reports the average R2 (coefficient of determination) and standard deviation across the 3 folds. (B) The performance for one out of the three models from the 3-fold cross-validation is plotted. The predicted population of each structure in the cyclic peptides’ structural ensemble is compared to its populations observed in MD simulations. For clarity, only structures in the ensembles for all cyclic peptides in the test data sets with either a predicted or observed (in MD) percent population of >1% are plotted.

    Figure 5

    Figure 5. Performance of different combinations of amino acid features on cyclic pentapeptide (top) and hexapeptide (bottom) test data sets containing 15 AAs (left), 37 AAs (middle), and 50 AAs (right). The models are trained using 3-fold cross-validation, and the average R2 and standard deviation are reported. The models using a single type of feature (e.g., “BB only”) are represented on the diagonals. The best-performing models for the 37 AA and 50 AA test data sets, based on the average R2, are boxed with bold black outlines.

  • References


    This article references 55 other publications.

    1. 1
      Wang, L.; Wang, N.; Zhang, W.; Cheng, X.; Yan, Z.; Shao, G.; Wang, X.; Wang, R.; Fu, C. Therapeutic peptides: current applications and future directions. Signal Transduction Targeted Ther. 2022, 7, 48  DOI: 10.1038/s41392-022-00904-4
    2. 2
      Liu, K.; Li, M.; Li, Y.; Li, Y.; Chen, Z.; Tang, Y.; Yang, M.; Deng, G.; Liu, H. A review of the clinical efficacy of FDA-approved antibody–drug conjugates in human cancers. Mol. Cancer 2024, 23, 62  DOI: 10.1186/s12943-024-01963-7
    3. 3
      Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; Bridgland, A.; Meyer, C.; Kohl, S. A. A.; Ballard, A. J.; Cowie, A.; Romera-Paredes, B.; Nikolov, S.; Jain, R.; Adler, J.; Back, T.; Petersen, S.; Reiman, D.; Clancy, E.; Zielinski, M.; Steinegger, M.; Pacholska, M.; Berghammer, T.; Bodenstein, S.; Silver, D.; Vinyals, O.; Senior, A. W.; Kavukcuoglu, K.; Kohli, P.; Hassabis, D. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583589,  DOI: 10.1038/s41586-021-03819-2
    4. 4
      Baek, M.; DiMaio, F.; Anishchenko, I.; Dauparas, J.; Ovchinnikov, S.; Lee, G. R.; Wang, J.; Cong, Q.; Kinch, L. N.; Schaeffer, R. D.; Millán, C.; Park, H.; Adams, C.; Glassman, C. R.; DeGiovanni, A.; Pereira, J. H.; Rodrigues, A. V.; van Dijk, A. A.; Ebrecht, A. C.; Opperman, D. J.; Sagmeister, T.; Buhlheller, C.; Pavkov-Keller, T.; Rathinaswamy, M. K.; Dalwadi, U.; Yip, C. K.; Burke, J. E.; Garcia, K. C.; Grishin, N. V.; Adams, P. D.; Read, R. J.; Baker, D. Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021, 373, 871876,  DOI: 10.1126/science.abj8754
    5. 5
      Miao, J.; Descoteaux, M. L.; Lin, Y.-S. Structure prediction of cyclic peptides by molecular dynamics + machine learning. Chem. Sci. 2021, 12, 1492714936,  DOI: 10.1039/D1SC05562C
    6. 6
      Hui, T.; Descoteaux, M. L.; Miao, J.; Lin, Y.-S. Training neural network models using molecular dynamics simulation results to efficiently predict cyclic hexapeptide structural ensembles. J. Chem. Theory Comput. 2023, 19, 47574769,  DOI: 10.1021/acs.jctc.3c00154
    7. 7
      Wan, F.; Kontogiorgos-Heintz, D.; de la Fuente-Nunez, C. Deep generative models for peptide design. Digital Discovery 2022, 1, 195208,  DOI: 10.1039/D1DD00024A
    8. 8
      Ferguson, A. L.; Ranganathan, R. 100th anniversary of macromolecular science viewpoint: data-driven protein design. ACS Macro Lett. 2021, 10, 327340,  DOI: 10.1021/acsmacrolett.0c00885
    9. 9
      Strokach, A.; Kim, P. M. Deep generative modeling for protein design. Curr. Opin. Struct. Biol. 2022, 72, 226236,  DOI: 10.1016/j.sbi.2021.11.008
    10. 10
      Chandra, A.; Tünnermann, L.; Löfstedt, T.; Gratz, R. Transformer-based deep learning for predicting protein properties in the life sciences. eLife 2023, 12, e82819  DOI: 10.7554/eLife.82819
    11. 11
      Oliva, R.; Chino, M.; Pane, K.; Pistorio, V.; De Santis, A.; Pizzo, E.; D’Errico, G.; Pavone, V.; Lombardi, A.; Del Vecchio, P.; Notomista, E.; Nastri, F.; Petraccone, L. Exploring the role of unnatural amino acids in antimicrobial peptides. Sci. Rep. 2018, 8, 8888  DOI: 10.1038/s41598-018-27231-5
    12. 12
      Lu, J.; Xu, H.; Xia, J.; Ma, J.; Xu, J.; Li, Y.; Feng, J. D- and unnatural amino acid substituted antimicrobial peptides with improved proteolytic resistance and their proteolytic degradation characteristics. Front. Microbiol. 2020, 11, 563030  DOI: 10.3389/fmicb.2020.563030
    13. 13
      Taechalertpaisarn, J.; Ono, S.; Okada, O.; Johnstone, T. C.; Lokey, R. S. A new amino acid for improving permeability and solubility in macrocyclic peptides through side chain-to-backbone hydrogen bonding. J. Med. Chem. 2022, 65, 50725084,  DOI: 10.1021/acs.jmedchem.2c00010
    14. 14
      Geurink, P. P.; van der Linden, W. A.; Mirabella, A. C.; Gallastegui, N.; de Bruin, G.; Blom, A. E.; Voges, M. J.; Mock, E. D.; Florea, B. I.; van der Marel, G. A.; Driessen, C.; van der Stelt, M.; Groll, M.; Overkleeft, H. S.; Kisselev, A. F. Incorporation of non-natural amino acids improves cell permeability and potency of specific inhibitors of proteasome trypsin-like sites. J. Med. Chem. 2013, 56, 12621275,  DOI: 10.1021/jm3016987
    15. 15
      Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988, 28, 3136,  DOI: 10.1021/ci00057a005
    16. 16
      Siani, M. A.; Weininger, D.; Blaney, J. M. CHUCKLES: A method for representing and searching peptide and peptoid sequences on both monomer and atomic levels. J. Chem. Inf. Comput. Sci. 1994, 34, 588593,  DOI: 10.1021/ci00019a017
    17. 17
      Siani, M. A.; Weininger, D.; James, C. A.; Blaney, J. M. CHORTLES: A method for representing oligomeric and template-based mixtures. J. Chem. Inf. Comput. Sci. 1995, 35, 10261033,  DOI: 10.1021/ci00028a012
    18. 18
      Jensen, J. H.; Hoeg-Jensen, T.; Padkjær, S. B. Building a biochemformatics database. J. Chem. Inf. Model. 2008, 48, 24042413,  DOI: 10.1021/ci800128b
    19. 19
      Lin, T.-S.; Coley, C. W.; Mochigase, H.; Beech, H. K.; Wang, W.; Wang, Z.; Woods, E.; Craig, S. L.; Johnson, J. A.; Kalow, J. A.; Jensen, K. F.; Olsen, B. D. BigSMILES: A structurally-based line notation for describing macromolecules. ACS Cent. Sci. 2019, 5, 15231531,  DOI: 10.1021/acscentsci.9b00476
    20. 20
      Zhang, T.; Li, H.; Xi, H.; Stanton, R. V.; Rotstein, S. H. HELM: A hierarchical notation language for complex biomolecule structure representation. J. Chem. Inf. Model. 2012, 52, 27962806,  DOI: 10.1021/ci3001925
    21. 21
      David, L.; Thakkar, A.; Mercado, R.; Engkvist, O. Molecular representations in AI-driven drug discovery: a review and practical guide. J. Cheminf. 2020, 12, 56  DOI: 10.1186/s13321-020-00460-5
    22. 22
      Nguyen-Vo, T.-H.; Teesdale-Spittle, P.; Harvey, J. E.; Nguyen, B. P. Molecular representations in bio-cheminformatics. Memetic Comput. 2024, 16, 519536,  DOI: 10.1007/s12293-024-00414-6
    23. 23
      Wigh, D. S.; Goodman, J. M.; Lapkin, A. A. A review of molecular representation in the age of machine learning. WIREs Comput. Mol. Sci. 2022, 12, e1603  DOI: 10.1002/wcms.1603
    24. 24
      Sousa, T.; Correia, J.; Pereira, V.; Rocha, M. Generative deep learning for targeted compound design. J. Chem. Inf. Model. 2021, 61, 53435361,  DOI: 10.1021/acs.jcim.0c01496
    25. 25
      Morgan, H. L. The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. J. Chem. Doc. 1965, 5, 107113,  DOI: 10.1021/c160017a018
    26. 26
      Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G. Reoptimization of MDL Keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 12731280,  DOI: 10.1021/ci010132r
    27. 27
      Schissel, C. K.; Mohapatra, S.; Wolfe, J. M.; Fadzen, C. M.; Bellovoda, K.; Wu, C.-L.; Wood, J. A.; Malmberg, A. B.; Loas, A.; Gómez-Bombarelli, R.; Pentelute, B. L. Deep learning to design nuclear-targeting abiotic miniproteins. Nat. Chem. 2021, 13, 9921000,  DOI: 10.1038/s41557-021-00766-3
    28. 28
      Ji, X.; Nielsen, A. L.; Heinis, C. Cyclic peptides for drug development. Angew. Chem., Int. Ed. 2024, 63, e202308251  DOI: 10.1002/anie.202308251
    29. 29
      Costa, L.; Sousa, E.; Fernandes, C. Cyclic peptides in pipeline: what future for these great molecules?. Pharmaceuticals 2023, 16, 996  DOI: 10.3390/ph16070996
    30. 30
      Zhang, H.; Chen, S. Cyclic peptide drugs approved in the last two decades (2001–2021). RSC Chem. Biol. 2022, 3, 1831,  DOI: 10.1039/D1CB00154J
    31. 31
      Landrum, G. RDKit: Open-Source Cheminformatics Software. https://www.rdkit.org/.
    32. 32
      The PyMOL Molecular Graphics System, version 3.0; Schrödinger, LLC.
    33. 33
      Case, D. A.; Aktulga, H. M. A.; Belfon, K.; Ben-Shalom, I. Y.; Berryman, J. T.; Brozell, S. R.; Cerutti, D. S.; Cheatham, T. E., III; Cisneros, G. A.; Cruzeiro, V. W. D.; Darden, T. A.; Duke, R. E.; Giambasu, G.; Gilson, M. K.; Gohlke, H.; Goetz, A. W.; Harris, R.; Izadi, S.; Izmailov, S. A.; Kasavajhala, K.; Kaymak, M. C.; King, E.; Kovalenko, A.; Kurtzman, T.; Lee, T. S.; LeGrand, S.; Li, P.; Lin, C.; Liu, J.; Luchko, T.; Luo, R.; Machado, M.; Man, V.; Manathunga, M.; Merz, K. M.; Miao, Y.; Mikhailovskii, O.; Monard, G.; Nguyen, H.; O’Hearn, K. A.; Onufriev, A.; Pan, F.; Pantano, S.; Qi, R.; Rahnamoun, A.; Roe, D. R.; Roitberg, A.; Sagui, C.; Schott-Verdugo, S.; Shajan, A.; Shen, J.; Simmerling, C. L.; Skrynnikov, N. R.; Smith, J.; Swails, J.; Walker, R. C.; Wang, J.; Wang, J.; Wei, H.; Wolf, R. M.; Wu, X.; Xiong, Y.; Xue, Y.; York, D. M.; Zhao, S.; Kollman, P. A. Amber; University of California: San Francisco, 2022.
    34. 34
      Jakalian, A.; Jack, D. B.; Bayly, C. I. Fast, efficient generation of high-quality atomic charges. AM1-BCC model: II. Parameterization and validation. J. Comput. Chem. 2002, 23, 16234161,  DOI: 10.1002/jcc.10128
    35. 35
      Bayly, C. I.; Cieplak, P.; Cornell, W.; Kollman, P. A. A well-behaved electrostatic potential based method using charge restraints for deriving atomic charges: the RESP model. J. Phys. Chem. A 1993, 97, 1026910280,  DOI: 10.1021/j100142a004
    36. 36
      Hornak, V.; Abel, R.; Okur, A.; Strockbine, B.; Roitberg, A.; Simmerling, C. Comparison of multiple amber force fields and development of improved protein backbone parameters. Proteins 2006, 65, 712725,  DOI: 10.1002/prot.21123
    37. 37
      Wang, J.; Wolf, R. M.; Caldwell, J. W.; Kollman, P. A.; Case, D. A. Development and testing of a general amber force field. J. Comput. Chem. 2004, 25, 11571174,  DOI: 10.1002/jcc.20035
    38. 38
      Zhou, C.-Y.; Jiang, F.; Wu, Y.-D. Residue-specific force field based on protein coil library. RSFF2: modification of AMBER ff99SB. J. Phys. Chem. B 2015, 119, 10351047,  DOI: 10.1021/jp5064676
    39. 39
      Abraham, M. J.; Murtola, T.; Schulz, R.; Páll, S.; Smith, J. C.; Hess, B.; Lindahl, E. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 2015, 1–2, 1925,  DOI: 10.1016/j.softx.2015.06.001
    40. 40
      Jorgensen, W. L.; Chandrasekhar, J.; Madura, J. D.; Impey, R. W.; Klein, M. L. Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 1983, 79, 926935,  DOI: 10.1063/1.445869
    41. 41
      Essmann, U.; Perera, L.; Berkowitz, M. L.; Darden, T.; Lee, H.; Pedersen, L. G. A smooth particle mesh Ewald method. J. Chem. Phys. 1995, 103, 85778593,  DOI: 10.1063/1.470117
    42. 42
      Frisch, M. J.; Trucks, G. W.; Schlegel, H. B.; Scuseria, G. E.; Robb, M. A.; Cheeseman, J. R.; Scalmani, G.; Barone, V.; Petersson, G. A.; Nakatsuji, H.; Li, X.; Caricato, M.; Marenich, A. V.; Bloino, J.; Janesko, B. G.; Gomperts, R.; Mennucci, B.; Hratchian, H. P.; Ortiz, J. V.; Izmaylov, A. F.; Sonnenberg, J. L.; Williams; ; Ding, F.; Lipparini, F.; Egidi, F.; Goings, J.; Peng, B.; Petrone, A.; Henderson, T.; Ranasinghe, D.; Zakrzewski, V. G.; Gao, J.; Rega, N.; Zheng, G.; Liang, W.; Hada, M.; Ehara, M.; Toyota, K.; Fukuda, R.; Hasegawa, J.; Ishida, M.; Nakajima, T.; Honda, Y.; Kitao, O.; Nakai, H.; Vreven, T.; Throssell, K.; Montgomery, J. A., Jr.; Peralta, J. E.; Ogliaro, F.; Bearpark, M. J.; Heyd, J. J.; Brothers, E. N.; Kudin, K. N.; Staroverov, V. N.; Keith, T. A.; Kobayashi, R.; Normand, J.; Raghavachari, K.; Rendell, A. P.; Burant, J. C.; Iyengar, S. S.; Tomasi, J.; Cossi, M.; Millam, J. M.; Klene, M.; Adamo, C.; Cammi, R.; Ochterski, J. W.; Martin, R. L.; Morokuma, K.; Farkas, O.; Foresman, J. B.; Fox, D. J. Gaussian 16, rev. C.01; Gaussian Inc.: Wallingford, CT, 2016.
    43. 43
      Vanquelef, E.; Simon, S.; Marquant, G.; Garcia, E.; Klimerak, G.; Delepine, J. C.; Cieplak, P.; Dupradeau, F. Y. R.E.D. Server: a web service for deriving RESP and ESP charges and building force field libraries for new molecules and molecular fragments. Nucleic Acids Res. 2011, 39, W511517,  DOI: 10.1093/nar/gkr288
    44. 44
      Piana, S.; Laio, A. A bias-exchange approach to protein folding. J. Phys. Chem. B 2007, 111, 45534559,  DOI: 10.1021/jp067873l
    45. 45
      Tribello, G. A.; Bonomi, M.; Branduardi, D.; Camilloni, C.; Bussi, G. PLUMED 2: New feathers for an old bird. Comput. Phys. Commun. 2014, 185, 604613,  DOI: 10.1016/j.cpc.2013.09.018
    46. 46
      Damas, J. M.; Filipe, L. C.; Campos, S. R.; Lousa, D.; Victor, B. L.; Baptista, A. M.; Soares, C. M. Predicting the thermodynamics and kinetics of helix formation in a cyclic peptide model. J. Chem. Theory Comput. 2013, 9, 51485157,  DOI: 10.1021/ct400529k
    47. 47
      Nair, V.; Hinton, G. E. Rectified Linear Units Improve Restricted Boltzmann Machines, Proceedings of the 27th International Conference on Machine Learning (ICML’10), Haifa, Israel, June 21–24, Fürnkranz, J.; Joachims, T., Eds.; Omnipress: Madison, WI, 2010; pp 807814.
    48. 48
      Prechelt, L. Automatic early stopping using cross validation: quantifying the criteria. Neural Networks 1998, 11, 761767,  DOI: 10.1016/S0893-6080(98)00010-0
    49. 49
      Paszke, A. G. S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T. L.; Gimelshein, N.; Antiga, L.; Desmaison, A.; Kopf, A.; Yang, E. D. Z.; Raison, M.; Tejani, A.; Chilamkurthy, S.; Steiner, B. F.; Bai, J.; Chintala, S. An Imperative Style, High-Performance Deep Learning Library, Advances in Neural Information Processing Systems 32 (NeurIPS 2019); Wallach, H.; Larochelle, H.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E.; Garnett, R., Eds.; Curran Associates: Vancouver, Canada, 2019; pp 80248035.
    50. 50
      Hayashi, K.; Uehara, S.; Yamamoto, S.; Cary, D. R.; Nishikawa, J.; Ueda, T.; Ozasa, H.; Mihara, K.; Yoshimura, N.; Kawai, T.; Ono, T.; Yamamoto, S.; Fumoto, M.; Mikamiyama, H. Macrocyclic peptides as a novel class of NNMT inhibitors: A SAR study aimed at inhibitory activity in the cell. ACS Med. Chem. Lett. 2021, 12, 10931101,  DOI: 10.1021/acsmedchemlett.1c00134
    51. 51
      Brousseau, M. E.; Clairmont, K. B.; Spraggon, G.; Flyer, A. N.; Golosov, A. A.; Grosche, P.; Amin, J.; Andre, J.; Burdick, D.; Caplan, S.; Chen, G.; Chopra, R.; Ames, L.; Dubiel, D.; Fan, L.; Gattlen, R.; Kelly-Sullivan, D.; Koch, A. W.; Lewis, I.; Li, J.; Liu, E.; Lubicka, D.; Marzinzik, A.; Nakajima, K.; Nettleton, D.; Ottl, J.; Pan, M.; Patel, T.; Perry, L.; Pickett, S.; Poirier, J.; Reid, P. C.; Pelle, X.; Seepersaud, M.; Subramanian, V.; Vera, V.; Xu, M.; Yang, L.; Yang, Q.; Yu, J.; Zhu, G.; Monovich, L. G. Identification of a PCSK9-LDLR disruptor peptide with in vivo function. Cell Chem. Biol. 2022, 29, 249258.e5,  DOI: 10.1016/j.chembiol.2021.08.012
    52. 52
      Yoshida, S.; Uehara, S.; Kondo, N.; Takahashi, Y.; Yamamoto, S.; Kameda, A.; Kawagoe, S.; Inoue, N.; Yamada, M.; Yoshimura, N.; Tachibana, Y. Peptide-to-small molecule: a pharmacophore-guided small molecule lead generation strategy from high-affinity macrocyclic peptides. J. Med. Chem. 2022, 65, 1065510673,  DOI: 10.1021/acs.jmedchem.2c00919
    53. 53
      Banerjee, R.; Basu, G.; Chène, P.; Roy, S. Aib-based peptide backbone as scaffolds for helical peptide mimics. J. Pept. Res. 2002, 60, 8894,  DOI: 10.1034/j.1399-3011.2002.201005.x
    54. 54
      Karle, I. L. Controls exerted by the Aib residue: helix formation and helix reversal. Pept. Sci. 2001, 60, 351365,  DOI: 10.1002/1097-0282(2001)60:5<351::AID-BIP10174>3.0.CO;2-U
    55. 55
      Lundberg, S. M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions; Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, December 4–9, 2017; Von Luxburg, U.; Guyon, I.; Bengio, S.; Wallach, H.; Ferugs, R., Eds.; Curan Associates: Redhook, NY, 2017; pp 47684777.
  • Supporting Information

    Supporting Information


    The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.4c02102.

    • Example of a (ϕ, ψ) distribution binned using various grid sizes; comparison of backbone (BB) features of amino acids in the different amino acid libraries; comparison of the different normalization schemes applied to the BB and voxel (VOX) features; examples of learning curves from hyperparameter tuning; model performances (reporting R2) on the training and validation data sets for the cyclic pentapeptides; p-values from t-tests comparing different features; model performances (reporting R2) on the training and validation data sets for the cyclic hexapeptides; model performances (reporting weighted error, WE) on the various test data sets for the cyclic pentapeptides and cyclic hexapeptides; model performances (reporting R2) using combinations of features on the training and validation data sets for the cyclic pentapeptides and cyclic hexapeptides (PDF)


    Terms & Conditions

    Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system: http://pubs.acs.org/page/copyright/permissions.html.