A Stochastic Landscape Approach for Protein Folding State Classification

Protein folding is a critical process that determines the functional state of proteins. Proper folding is essential for proteins to acquire their functional three-dimensional structures and execute their biological role, whereas misfolded proteins can lead to various diseases, including neurodegenerative disorders like Alzheimer’s and Parkinson’s. Therefore, a deeper understanding of protein folding is vital for understanding disease mechanisms and developing therapeutic strategies. This study introduces the Stochastic Landscape Classification (SLC), an innovative, automated, nonlearning algorithm that quantitatively analyzes protein folding dynamics. Focusing on collective variables (CVs) – low-dimensional representations of complex dynamical systems like molecular dynamics (MD) of macromolecules – the SLC approach segments the CVs into distinct macrostates, revealing the protein folding pathway explored by MD simulations. The segmentation is achieved by analyzing changes in CV trends and clustering these segments using a standard density-based spatial clustering of applications with noise (DBSCAN) scheme. Applied to the MD-based CV trajectories of Chignolin and Trp-Cage proteins, the SLC demonstrates apposite accuracy, validated by comparing standard classification metrics against ground-truth data. These metrics affirm the efficacy of the SLC in capturing intricate protein dynamics and offer a method to evaluate and select the most informative CVs. The practical application of this technique lies in its ability to provide a detailed, quantitative description of protein folding processes, with significant implications for understanding and manipulating protein behavior in industrial and pharmaceutical contexts.


S1 The BEAST Algorithm Details
The BEAST algorithm deals with time series data y i ∈ ⃗ y with samples i = 1 . . .N s where each sample is assumed to belong to one of several intervals m s = 1, 2 . . .M s with varying durations τ s,ms .These variables are represented altogether by a model vector ⃗ M .Intervals are further characterized by their trend and seasonality, 1 represented by a vector ⃗ β m .A Gaussian noise ϵ = N (0, σ 2 ) is assumed for each interval, with a changing variance for each segment.The algorithm aims to find the optimal segmentation of the data by estimating the interval durations, trend and seasonality parameters, and noise variance.This is achieved through Bayesian inference, which involves maximizing the posterior probability of the model parameters given the observed data, P pr = p( ⃗ M , ⃗ β m , σ 2 |⃗ y).The posterior probability is calculated by combining the likelihood probability (based on the assumption of Gaussian noise) with prior probabilities based on these assumptions: • Uniform prior the number of change-points in ⃗ M with a large upper bound.
• Uniform prior for the intervals duration in ⃗ M with a minimum length constraint.
• Normal-inverse Gamma prior for the noise variance amplitude, σ 2 .
• The trend and seasonality parameters that are normally distributed with the resulting variance of σ 2 .
While the posterior probability has a closed-form expression, finding the optimal parameters analytically is impossible.Therefore, the BEAST algorithm utilizes a Reverse Jump Markov chain Monte Carlo (MCMC) sampler within a Gibbs sampling framework to estimate the parameters iteratively.This eventually leads to optimal estimates of interval duration, trend and seasonality parameters, and noise variance.For additional details, please refer to the work of Zhao et.al . 1

S2 Implementation of the Stochastic Landscape Classification on Protein Folding
The Stochastic Landscape Classification (SLC) implementation follows the steps described in Table .1 in the main text.Additional information about the DBCAN classification and the Khun-Munkres label matching is given below.

S2.1 DBSCAN classification
While the CV segments are scattered in space (see Figs. Two parameters control the behavior of DBSCAN: minP ts, which defines the minimum number of neighbors needed for a dense cluster, and ϵ, which sets the search radius for finding neighbors.We chose these parameters based on commonly used DBSCAN heuristics. 9While for most applications, the minP ts value is suggested as twice the space dimension d, we choose minP ts = 2 • d − 1 = 5, where d = 3 (the stochastic landscape dimensionality).The ϵ value was heuristically chosen according to the knee point in a k-distance graph 2 .The knee point was determined using the following algorithm.First, move iteratively along the k-distance curve, examining one bisection point at a time.At each point, fit two lines, one to all points left of the current point and another to all points to the right.The "knee" is then identified as the bisection point where the sum of errors for these two line fits is minimized. 10After this classification, the cluster labels are assigned to each of the points in the scatter plots (see Figs. 1C and 3C of the main text).One cluster accounts for all the points considered as outliers.Finally, the labels are matched according to the Khun-Munkres algorithm.

S2.2 Khun-Munkres Label Matching
After applying the DBSCAN algorithm to the stochastic landscape data, labels are assigned to each point in the scatter plot based on the suggested clustering scheme.Subsequently, a comparison with the ground-truth data is pursued, giving rise to two challenges.First, discrepancies in the number of clusters may arise between the ground truth and the clustered data.Second, establishing the correspondence between labels in the clustered scheme and those in the ground truth necessitates attention.Addressing a similar issue, Chen et al. 11 proposed employing the Hungarian algorithm 12 (also known as the Kuhn-Munkres algorithm), aiming to match labels akin to solving an assignment problem.This approach facilitates a comparison of labels between clusters and states in the ground truth.Notably, if the number of states in the ground truth differs from the number of clusters identified by DBSCAN, the number of matched labels is constrained by the minimum of the two cluster counts.Consequently, the remaining clusters and their associated points are amalgamated with the outliers cluster (UC).Subsequently, the segment labels are generated and utilized as colors in scatter plots (refer to Figs. 1C and 1D for Chignolin protein, and Figs.3C and 3D for Trp-Cage protein).
In the context of this study, the clustering method was evaluated against a known ground truth.However, in applications where the ground truth is unknown, the label-matching scheme is irrelevant, and clusters should be interpreted based on the output of the DBSCAN algorithm, as described earlier.

S2.3 Additional CVs Segmented Trajectories
Following the SLC outline in              The trajectory length is measured as the percentage of the original CV trajectory length for the two proteins.Sharp metric transitions occur at 9% and 4.5% trajectory lengths for the Chignolin (total length: 5348 samples) and the Trp-Cage (total length: 6692 samples) after downsampling (see Table 1 in the main text).
We propose that sharp metric transitions (Fig. S14 C and D) around the SLC minimum state visitation counts indicate the method's inability to classify states correctly before these transitions (near-zero metric values appear).For the two proteins, the SLC minimum state visitation count values are determined by the state visitation count plots intersection with the dashed vertical line in Fig. S14 C and D. For the Chignolin, the resulting minimal number of state visitation counts is 7 for both the folded and unfolded states.For the Trp-cage, the resulting minimal number of state visitation counts are 9, 8, and 3 visitations, respectively, for the folded, unfolded, and misfolded states.The latter number of visitations is less than minP ts for the third state, which might seem counter-intuitive given the previous DBSCAN algorithm-based argument for the minimal visitation counts per state.This stems from the result of 9 and 8 visitations in the folded and unfolded states being sufficient to classify their clusters, regardless of the third state.This result yields, in this case, an already high value for the N M I and ARI classification metrics (0.62 and 0.79, respectively).This might be advantageous for some systems in which one protein state is rarely visited compared to the others.Here, for example, the third state can be regarded as merged with the unclassified 1B and 3B) vs. the stochastic coordinates, we aim to cluster them according to their position.This follows the assumption that different macro-states have different stochastic coordinates, originating in their respective different mean, STD, and average linear trends.The DBSCAN (Density-based spatial clustering of applications with noise) algorithm 2,3 is used widely for numerous classification tasks, such as superpixel segmentation, 4 radar data clustering, 5 gene clustering, 6 anomaly detection7 and others.8This algorithm analyzes data points in a given space and identifies clusters based on their density.Clustered points are assumed to be closely packed, with many surrounding neighbors.Points located in areas with low density, those further away from their nearest neighbors, are classified as outliers by the algorithm.All these outliers are clustered with the label UC (unclassified).
Fig. S13 shows the resulting correlation between the classification metric values and the unclassified data percentage.In all cases, the Pearson coefficients 13 indicate a negative correlation.

Figure S1 :
Figure S1: Protein state classification for CV Dist CH of Chignolin data.(A) The CV Dist CH value as a function of time (black dots), plotted every 20 ns, and the division into segments according to the trend change points detected by the BEAST algorithm (orange vertical lines).(B) Scatter plot for the Stochastic Landscape for the Chignolin trajectory, where each data point represents a segment corresponding to its stochastic coordinates y 1 , y 2 , and y 3 , which are the PCA components of the normalized mean, standard deviation, and average trend.(C) The result of the DBSCAN clustering algorithm depicted on the same scatter plot.Points in red and blue correspond to different segment clusters, respectively, whereas unclassified segments remain in grey.(D) Protein state labeling, determined by the Kuhn-Munkres algorithm applied to the DBSCAN clustering results, projected onto the original CV Dist CH trajectory.Purple and orange data points correspond to folded and unfolded states, respectively, whereas grey corresponds to unclassified samples.

Figure S2 :
Figure S2: Protein state classification for CV RMSD CH

Figure S3 :
Figure S3: Protein state classification for CV HLDA CH

Figure S4 :
Figure S4: Protein state classification for CV TPI-Deep-TDA CH of Chignolin data.(A) The CV TPI-Deep-TDA CH value as a function of time (black dots), plotted every 20 ns, and the division into segments according to the trend change points detected by the BEAST algorithm (orange vertical lines).(B) Scatter plot for the Stochastic Landscape for the Chignolin trajectory, where each data point represents a segment corresponding to its stochastic coordinates y 1 , y 2 , and y 3 , which are the PCA components of the normalized mean, standard deviation, and average trend.(C) The result of the DBSCAN clustering algorithm depicted on the same scatter plot.Points in red, blue, and green correspond to different segment clusters, respectively, whereas unclassified segments remain in grey.(D) Protein state labeling, determined by the Kuhn-Munkres algorithm applied to the DBSCAN clustering results, projected onto the original CV TPI-Deep-TDA CH trajectory.Purple and orange data points correspond to folded and unfolded states, respectively, whereas grey corresponds to unclassified samples.

Figure S5 :
Figure S5: Protein state classification for CV Deep-TICA-I TRP of Trp-Cage data.(A) The CV Deep-TICA-I TRP value as a function of time (black dots), plotted every 40 ns, and the division into segments according to the trend change points detected by the BEAST algorithm (orange vertical lines).(B) Scatter plot for the Stochastic Landscape for the Trp-Cage trajectory, where each data point represents a segment corresponding to its stochastic coordinates y 1 , y 2 , and y 3 , which are the PCA components of the normalized mean, standard deviation, and average trend.(C) The result of the DBSCAN clustering algorithm depicted on the same scatter plot.Points in red, blue, and green correspond to different segment clusters, respectively, whereas unclassified segments remain in grey.(D) Protein state labeling, determined by the Kuhn-Munkres algorithm applied to the DBSCAN clustering results, projected onto the original CV Deep-TICA-I TRP trajectory.Purple, orange, and teal data points correspond to folded, unfolded, and misfolded states, respectively, whereas grey corresponds to unclassified samples.

Figure S6 :
Figure S6: Protein state classification for CV Deep-TICA-II TRP of Trp-Cage data.(A) The CV Deep-TICA-II TRP value as a function of time (black dots), plotted every 40 ns,and the division into segments according to the trend change points detected by the BEAST algorithm (orange vertical lines).(B) Scatter plot for the Stochastic Landscape for the Trp-Cage trajectory, where each data point represents a segment corresponding to its stochastic coordinates y 1 , y 2 , and y 3 , which are the PCA components of the normalized mean, standard deviation, and average trend.(C) The result of the DBSCAN clustering algorithm depicted on the same scatter plot.Points in red, blue, and green correspond to different segment clusters, respectively, whereas unclassified segments remain in grey.(D) Protein state labeling, determined by the Kuhn-Munkres algorithm applied to the DBSCAN clustering results, projected onto the original CV Deep-TICA-II TRP trajectory.Purple, orange, and teal data points correspond to folded, unfolded, and misfolded states, respectively, whereas grey corresponds to unclassified samples.

Figure S7 :
Figure S7: Protein state classification for CV Deep-TDA-I TRP of Trp-Cage data.(A) The CV Deep-TDA-I TRP value as a function of time (black dots), plotted every 40 ns, and the division into segments according to the trend change points detected by the BEAST algorithm (orange vertical lines).(B) Scatter plot for the Stochastic Landscape for the Trp-Cage trajectory, where each data point represents a segment corresponding to its stochastic coordinates y 1 , y 2 , and y 3 , which are the PCA components of the normalized mean, standard deviation, and average trend.(C) The result of the DBSCAN clustering algorithm depicted on the same scatter plot.Points in red, light blue, dark blue, and green correspond to different segment clusters, respectively, whereas unclassified segments remain in grey.(D) Protein state labeling, determined by the Kuhn-Munkres algorithm applied to the DBSCAN clustering results, projected onto the original CV Deep-TDA-I TRP trajectory.Purple, orange, and teal data points correspond to folded, unfolded, and misfolded states, respectively, whereas grey corresponds to unclassified samples.

Figure S8 :
Figure S8: Protein state classification for CV Deep-TDA-III TRP of Trp-Cage data.(A) The CV Deep-TDA-II TRP value as a function of time (black dots), plotted every 40 ns, and the division into segments according to the trend change points detected by the BEAST algorithm (orange vertical lines).(B) Scatter plot for the Stochastic Landscape for the Trp-Cage trajectory, where each data point represents a segment corresponding to its stochastic coordinates y 1 , y 2 , and y 3 , which are the PCA components of the normalized mean, standard deviation, and average trend.(C) The result of the DBSCAN clustering algorithm depicted on the same scatter plot.Points in red, dark blue, light blue, pale blue, and green correspond to different segment clusters, respectively, whereas unclassified segments remain in grey.(D) Protein state labeling, determined by the Kuhn-Munkres algorithm applied to the DBSCAN clustering results, projected onto the original CV Deep-TDA-II TRP trajectory.Purple, orange, and teal data points correspond to folded, unfolded, and misfolded states, respectively, whereas grey corresponds to unclassified samples.

Figure S9 :
Figure S9: Protein state classification for CV RMSD-Backbone TRP of Trp-Cage data.(A) The CV RMSD-Backbone TRP value as a function of time (black dots), plotted every 40 ns, and the division into segments according to the trend change points detected by the BEAST algorithm (orange vertical lines).(B) Scatter plot for the Stochastic Landscape for the Trp-Cage trajectory, where each data point represents a segment corresponding to its stochastic coordinates y 1 , y 2 , and y 3 , which are the PCA components of the normalized mean, standard deviation, and average trend.(C) The result of the DBSCAN clustering algorithm depicted on the same scatter plot.Points in red and blue correspond to different segment clusters, respectively, whereas unclassified segments remain in grey.(D) Protein state labeling, determined by the Kuhn-Munkres algorithm applied to the DBSCAN clustering results, projected onto the original CV RMSD-Backbone TRP trajectory.Purple and orange data points correspond to folded and unfolded states, respectively, whereas grey corresponds to unclassified samples.

Figure S10 :
Figure S10: Protein state classification for CV Dist TRP of Trp-Cage data.(A) The CV Dist TRP value as a function of time (black dots), plotted every 40 ns, and the division into segments according to the trend change points detected by the BEAST algorithm (orange vertical lines).(B) Scatter plot for the Stochastic Landscape for the Trp-Cage trajectory, where each data point represents a segment corresponding to its stochastic coordinates y 1 , y 2 , and y 3 , which are the PCA components of the normalized mean, standard deviation, and average trend.(C) The result of the DBSCAN clustering algorithm depicted on the same scatter plot.Points in red, blue, and green correspond to different segment clusters, respectively, whereas unclassified segments remain in grey.(D) Protein state labeling, determined by the Kuhn-Munkres algorithm applied to the DBSCAN clustering results, projected onto the original CV Dist TRP trajectory.Purple, orange, and teal data points correspond to folded, unfolded, and misfolded states, respectively, whereas grey corresponds to unclassified samples.

Figure S11 :
Figure S11: Protein state classification for CV Contact TRP

Figure S12 :
Figure S12: Protein state classification for CV RMSD TRP of Trp-Cage data.(A) The CV RMSD TRP value as a function of time (black dots), plotted every 40 ns, and the division into segments according to the trend change points detected by the BEAST algorithm (orange vertical lines).(B) Scatter plot for the Stochastic Landscape for the Trp-Cage trajectory, where each data point represents a segment corresponding to its stochastic coordinates y 1 , y 2 , and y 3 , which are the PCA components of the normalized mean, standard deviation, and average trend.(C) The result of the DBSCAN clustering algorithm depicted on the same scatter plot.Points in red, light blue, dark blue, and green correspond to different segment clusters, respectively, whereas unclassified segments remain in grey.(D) Protein state labeling, determined by the Kuhn-Munkres algorithm applied to the DBSCAN clustering results, projected onto the original CV RMSD TRP trajectory.Purple, orange, and teal data points correspond to folded, unfolded, and misfolded states, respectively, whereas grey corresponds to unclassified samples.

Figure S14 :
Figure S14: Classification metric values and state visitations versus the CV trajectory length.(A) N M I (dotted blue) and ARI (dashed orange) classification metrics values versus the trajectory length for Chignolin and (B) Trp-Cage.The trajectory length is measured in percentage of the Chignolin (total length: 5348 samples) and the Trp-Cage (total length: 6692 samples) total trajectory length after downsampling.(C) Different protein state visitation counts versus the trajectory length for Chignolin and (D) Trp-Cage, for the folded (blue), unfolded (orange), and misfolded (yellow) states.The dashed vertical black lines correspond to the minimal state visitation counts estimated for both proteins, as described in the text.
Table. 1 of the main text, the following segmented CVs are depicted below for both the Chignolin and the Trp-Cage.Figures depicting protein state