Time-Lagged Independent Component Analysis of Random Walks and Protein Dynamics

Time-lagged independent component analysis (tICA) is a widely used dimension reduction method for the analysis of molecular dynamics (MD) trajectories and has proven particularly useful for the construction of protein dynamics Markov models. It identifies those “slow” collective degrees of freedom onto which the projections of a given trajectory show maximal autocorrelation for a given lag time. Here we ask how much information on the actual protein dynamics and, in particular, the free energy landscape that governs these dynamics the tICA-projections of MD-trajectories contain, as opposed to noise due to the inherently stochastic nature of each trajectory. To answer this question, we have analyzed the tICA-projections of high dimensional random walks using a combination of analytical and numerical methods. We find that the projections resemble cosine functions and strongly depend on the lag time, exhibiting strikingly complex behavior. In particular, and contrary to previous studies of principal component projections, the projections change noncontinuously with increasing lag time. The tICA-projections of selected 1 μs protein trajectories and those of random walks are strikingly similar, particularly for larger proteins, suggesting that these trajectories contain only little information on the energy landscape that governs the actual protein dynamics. Further the tICA-projections of random walks show clusters very similar to those observed for the protein trajectories, suggesting that clusters in the tICA-projections of protein trajectories do not necessarily reflect local minima in the free energy landscape. We also conclude that, in addition to the previous finding that certain ensemble properties of nonconverged protein trajectories resemble those of random walks; this is also true for their time correlations.


INTRODUCTION
The atomistic dynamics of proteins, protein complexes, and other biomolecules is exceedingly complex, covering time scales from subpicoseconds to up to hours. 1,2 It is governed by a similarly complex high-dimensional free energy landscape or funnel, 3 characterized by a hierarchy of free energy barriers, 4 and has been widely studied computationally by molecular dynamics (MD) simulations. 5 With particle numbers ranging from several hundreds to hundreds of thousands or more, 6−9 the correspondingly high-dimensional configuration space of the system poses considerable challenges to a fundamental understanding of biomolecular function, for example, of the conformational motions of these biological "nanomachines", 10,11 protein folding, 12 or specific binding.
Several attempts to reduce the dimensionality of the dynamics have addressed this issue. Most notable approaches are principal component analysis (PCA) to extract the essential dynamics 13 of the protein that contributes most to the atomic fluctuations, and time-lagged independent component analysis (tICA), which identifies those collective degrees of freedom that exhibit the strongest time-correlations for a given lag-time. 14,15 Both dimension reduction techniques can yield information on the conformational dynamics of a protein, that is, how the protein moves through several conformational substates, which can be defined as metastable conformations characterized by local free energy minima. 16 This property also renders these dimension reduction techniques highly useful as a preprocessing step to describing the conformational dynamics of macromolecules in terms of a discrete Markov process. 17−19 Currently tICA is most widely used, and it is preferred over PCA for this purpose 20 because it additionally uses time information on the input trajectory.
In this context, both PCA and tICA rely on MD trajectories as input, which raises the question how much of these analyses is determined by actual information on the protein dynamics, as opposed to noise due to the inherently stochastic nature of each trajectory, and, importantly, how these two can be quantified.
For PCA, this question has been answered by analysis of the principal components of a high-dimensional random walk in a flat energy landscape. 21,22 Unexpectedly, these turned out to approximate cosine functions, thus providing a very powerful criterion for the convergence of MD trajectories: The more an MD trajectory resembles a cosine, quantified by the cosine content, 21 the more it resembles a random walk, and the less information it contains on the actual protein dynamics or the underlying free energy landscape.
These analyses 21,22 have also suggested that clusters observed in low-dimensional PCA projections do not necessarily imply the existence of conformational substates and, instead, may also be a stochastic and/or projection artifact. Particularly the latter finding is highly relevant for the use of PCA for the construction of Markov models, 19 which thus may also in part reflect the randomness of one or several trajectories. Note that this holds also truealbeit probably to a lesser extentfor the construction of Markov models from several or many trajectories, as these have to be spawned from a seeding trajectory or from starting structures generated from other advanced sampling methods. 16,23−25 For tICA, no such analysis is available, but the inspection of several examples suggests that similar effects may also be at work. 26,27 To address this issue, here we will therefore analyze the tICA-projections of high dimensional random walks, and subsequently compare them to tICA-projections of selected protein trajectories. In particular, we will semianalytically derive an expression for random walk tICA-projections, which will prove analogous to the PCA cosine functions and thus can also serve as a criterion for convergence as well as for the quality of derived Markov models. Unexpectedly, and contrary to the regular behavior of random walk PCA projections, tICAprojections turn out to display much more complex behavior. In particular, we observed critical lag times at which the random walk projections change drastically and  for high dimensions  even discontinuously. The resulting much richer and more intricate structure of random walk projections renders the proper interpretation of tICA-projections of protein dynamics trajectories particularly challenging, and has profound implications for the proper constructions of Markov models.

THEORETICAL ANALYSIS AND METHODS
2.1. Definition of tICA. To establish notation, we briefly summarize the basic principle of tICA; for a more comprehensive treatment with particular focus on molecular dynamics applications, see ref 28.
C o n s i d e r a d -d i m e n s i o n a l t r a j e c t o r y T  with Cartesian coordinates x 1 , ..., x d , which for compact notation we assume to be meanfree, that is, the time average ⟨x(t)⟩ t is zero. TICA determines those "slowest" independent collective degrees of freedom where τ is a chosen lag time. Equivalently, using the time-lagged covariance matrix under the constraint that it is orthogonal to all previous degrees of freedom. Hence, the v k are the solutions of the generalized eigenvalue problem We will use the term "tICA-eigenvector" for the v k and "tICAprojection" for the projections y k onto the tICA-eigenvectors. In the literature, the term "tICA-component" is often used, but it is somewhat ambiguous and we will therefore avoid it.
For an infinite trajectory of a time-reversible system the matrices in this eigenvalue problem are symmetric. However, for the finite trajectories considered here, with time steps t = 1, ..., n, the matrix C(τ) is usually not symmetric. There are two slightly different symmetrization methods that circumvent this problem. The more popular one, which we denote the "main" method, uses an estimator that replaces the simple time-lagged averages mentioned by averages over all pairs (x t , x t+τ ) and (x t+τ , x t ), following, for example, Noé2 8 and the popular software package PyEMMA. 29 As a result, on the left-hand side of eq 1 C(τ) is replaced with and on the right-hand side C(0) is replaced with yielding a symmetrized version of eq 1 with real eigenvalues, The second "alternative" symmetrized version of eq 1 only differs on the right-hand side, where C(0) is not replaced with Σ, Our analysis is very similar for both versions, though with unexpectedly different results.
2.2. Theory. To render this symmetrized generalized eigenvalue problem more amenable to analysis, and following ref 30, we define a matrix formed from the trajectory i The latter two matrices serve to rewrite the above left and righthand sides, and, hence, also the symmetrized tICA-equation, This defining eq 4 for tICA can be converted into a more convenient form using the matrices Noting that and eq 4 reads This can be transformed into a normal eigenvalue problem using the AMUSE-algorithm 31,32 as follows. First diagonalize the right-hand side by an orthogonal matrix Q and a diagonal matrix Λ such that Substituting v k = Wu k , with W = QΛ −1/2 , and assuming all diagonal elements of Λ are nonzero, yields Note that this assumption is actually not necessarily true here, but since we are only interested in the nonzero eigenvalues and their eigenvectors the end results will still be correct. Since W is invertible, this equation is equivalent to where the matrix on the right-hand side turns out to be the unit matrix, Hence eq 5 simplifies to Now consider the following "swapped" version: 30 Notably, for each y k satisfying eq 7 there exists a corresponding eigenvector that solves eq 6. Indeed, choosing u k = W T XAy k yields Finally, up to normalization, y k is the projection of the trajectory onto the corresponding v k = Wu k , In other words, the tICA-projections of the trajectory are the eigenvectors (with nonzero eigenvalues) of the matrix M = X T WW T XA.
We will use this reformulation of the tICA defining equation to calculate the tICA-projections of random walks of given finite dimension and length.
2.3. Random Walks. For the numerical and semianalytical evaluation of tICA components, random walk trajectories ∈ t x( ) d  of dimension d were generated by carrying out n steps according to where is a d-dimensional univariate normal distribution centered at 0. Each trajectory was centered to zero before further processing. We verified empirically that other fixed probability distributions with mean 0 and finite variance yield similar results.
2.4. Molecular Dynamics Simulation. For two proteins a 1 μs molecular dynamics trajectory each was analyzed (Andreas Volkhardt, private communication). Both were generated using the GROMACS 4.5 software package 33 with the Amber ff99SB-ILDN force field 34 and the TIP4P-Ew water model. 35 The starting structures were taken from the PDB 36 entries 11AS 37 and 2F21, 38 respectively. From the latter, only a part of the structure (the WW-domain) was used. Energy minimization was performed using steepest descent for 5 × 10 4 steps. The hydrogen atoms were described by virtual sites. Each protein was placed within a triclinic water box using gmx-solvate, such that the smallest distance between protein surface and box boundary was larger than 1.5 nm. Natrium and chloride ions were added to neutralize the system, corresponding a physiological concentration of 150 mmol/L. Each system was first equilibrated for 0.5 ns in the NVT ensemble, and subsequently for 1.0 ns in the NPT ensemble at 1 atm pressure and temperature 300 K, both using an integration time step of 2 fs. The velocity rescaling thermostat 39 and Parrinello−Rahman pressure coupling 40 were used with coupling coefficients of τ = 0.1 ps and τ = 1 ps, respectively. All bond lengths of the solute were constrained using LINCS with an expansion order of 6, and water geometry was constrained using the SETTLE algorithm. Electrostatic interactions were calculated using PME, 41 with a real space cutoff of 10 Å and a Fourier spacing of 1.2 Å. The integration time step was 4 fs, and the coordinates of the alpha carbons were saved every 10 ps, such that 10 5 snapshots were available for each trajectory. Of these we discarded the first 10 4 steps, leading to trajectories of length n = 9 ×10 4 .

RESULTS AND DISCUSSION
To characterize the tICA components and projections of random walks, we will proceed in two steps. We will first analyze a special case, for which some analytical results can be obtained. Second, we will use the obtained insights to generalize this result to random walks of arbitrary length n and dimension d using a combined analytical/numerical approach. Subsequently, we will compare the obtained random walk projections to tICA analyses of biomolecular trajectories.
3.1. A Special Case. To gain first insight into the tICA components of a random walk, first consider the special case d = n, which allows for an almost fully analytical approach. In this case, all matrices in eq 7 are square and, assuming that X is invertible, such that eq 7 becomes independent of X, Note that the assumption that X is invertible is not strictly correct, as it has one zero-eigenvalue associated with the eigenvector given by y 0 = (1, ..., 1) T . This is also an eigenvector of B −1 A, but instead with eigenvalue 1. Therefore, all the eigenvectors and all but one eigenvalue of eq 7 are identical to those of eq 8, and the analysis can proceed using eq 8.
In the limit of large n, and using the above definitions for A and B, the matrix B −1 A approaches a circulant matrix with the These eigenvectors are complex, but since λ k = λ n−k and yk = yñ −k * , the real and imaginary part of yk (cosine and sine) are real eigenvectors for the same eigenvalues. Depending on τ and n, many of these eigenvalues are equal, since they only depend on τk mod n.
This result implies that for large n = d the eigenvalues of B −1 A approach those of the circulant matrix. More precisely, their eigenvalues are asymptotically equally distributed. 42 In contrast, the eigenvectors are only preserved in limits or under small perturbations if the respective adjacent eigenvalues are wellseparated from each other. 44 For the case at hand, however, this eigenvalue separation very quickly approaches zero for small k and large n (and for other k with |cos(2πτk/n)| ≈ 1). As a result, the eigenvectors of B −1 A for small k (and other k as before) differ from those of the circulant matrix even in this limit. Rather, they need to be represented as approximate linear combinations of This subtlety contributes to the complexity of the problem as well as of the solution, and has so far has prohibited us from proceeding further purely analytically both for finite d = n as well as for d = n → ∞. Nevertheless, the eigenvalue problem eq 8 provides a good starting point for a numerical approach. Still, the degeneracy discussed above needs to be taken properly into account, as the numerical eigenvectors are essentially arbitrarily chosen from the eigenspaces.
Inspecting the Fourier transforms of the numerical eigenvectors suggests that the eigenspaces of eq 8 for small k each contain an eigenvector that resembles a cosine function i k j j j y { z z z π ≈ y t tk n ( ) cos k with increasing accuracy for increasing n.
Another effect of the poor separation of the eigenvalues is that the above results are very sensitive to small changes to the matrix in eq 8. For example, when the alternative symmetrization method defined by eq 3 is used, the analysis in section 2.2 is unchanged, except that all diagonal entries of B become 2, and eq 8 reads For n = d → ∞, the same circulant matrix is obtained, such that the eigenvalues, eq 9, are unchanged. The numerical solution however reveals that the first few eigenspaces instead contain eigenvectors given by 3.2. General Solution. Next, we will consider the general case, that is, a random walk of length n in d < n dimensions. Unfortunately, we were unable to find analytical solutions similar to the above; however, the results of section 2.2 permit an elegant way for a numerical approach by computing the expectation value of the matrix M. To this aim, M was computed for a sample of 20000 random walks of given fixed dimension d and number of time steps n, from which an average matrix ⟨M⟩ was computed. The eigenvectors of ⟨M⟩ served as the semianalytical solution for the general case. We note that this does not necessarily produce the same results as averaging the individual tICA-projections directly. We have, however, tested that the eigenvectors of ⟨M⟩ are very similar to the averages of the tICA-projections. An exception to this is that averaging the tICA-projections can produce artifacts arising from to the fluctuating order of the eigenvectors, and these artifacts are not present in the eigenvectors of ⟨M⟩.
As an illustration, Figure 1 shows the first two resulting tICAprojections for random walks with n = 1000 and d = 50, revealing a strong dependence on the lag time τ. For short lag times τ, y 1 (t) ≈ cos(πt/n) and y 2 (t) ≈ cos(2πt/n). With increasing τ, these low-frequency cosines are gradually replaced by higherfrequency components, first in y 2 (starting at about τ = 90) and for further increasing τ > 150 also in y 1 . From then on, the frequencies of both y 1 and y 2 slowly decrease, maintaining a π phase shift.
In contrast to the special case considered above (section 3.1), our numerical studies suggest that for large lag times the averaged projections do not approach exact cosines for large n. Rather, "cosine like" functions appear, as can be seen for the high lag-times shown in Figure 1, where the circular shape that would be expected for exact cosines is noticeably distorted, even if n is further increased. In contrast, for short lag times, where the higher frequency components have not yet appeared (e.g., τ < 90 in Figure 1), the projections do seem to approach exact cosines with increasing n.
For the alternative symmetrization method, eq 3, the same method can be applied, and the obtained projections are shown in Figure 2. Indeed, when the two figures are compared, even more dramatic differences are seen as a result of this very small change. In particular, for short τ values, the cosine-like functions seem to be replaced by sine-like functions of twice the frequency, just like we have already seen for the special case d = n. Also, for increasing τ a much richer and complex behavior is seen. Finally, the onset of higher frequencies occurs for somewhat smaller τ values (at τ ≈ 100) compared to that in Figure 1 (at τ ≈110). This abrupt emergence of higher frequencies deserves closer inspection.
3.3. Abrupt Changes. To gain more insight into why these abrupt changes occur, Figure 3A shows the eigenvalues of ⟨M⟩ as a function of τ for dimension d = 30, revealing a strikingly complex pattern. For small lag times τ all eigenvalues decrease with τ, with associated cosine-shaped eigenvectors of period lengths 2n, 2n/2, 2n/3, ..., as annotated in the figure. The decrease of these curves reflects the sampling of the cosine- Also visible are several curves that monotonically increase with τ, each starting at zero for small τ. These curves represent two eigenvalues each, with cosine-shaped and sine-shaped eigenvectors of period lengths τ, 2τ, 3τ, ..., respectively, as also annotated in the Figure. Their increase is less obvious, as one might expect the autocorrelation of a τ-periodic function at lag time τ to be unity and, therefore, constant. Note, however, that the eigenvalue of ⟨M⟩ does not strictly represent this autocorrelation; rather, it represents the average of the autocorrelations of many instances of this eigenvector for each single random walkeach of which is not strictly periodic. For increasing period lengths, the eigenvectors approach cosines or sines, such that their average autocorrelation increases and so do the corresponding eigenvalues of ⟨M⟩.
At the intersections of these two sets of curves (black circles) the respective eigenvalues are degenerate and their order changes, which causes abrupt changes of the eigenvectors and, therefore, also of the projections onto these eigenvectors, the first two of which were discussed above.
For larger dimensions d, for example, for d = 50 as shown in Figure 3B, one would expect that the tICA-projections resemble cosine or sine functions increasingly closely, also at increasingly higher frequencies. As a result, the eigenvalues corresponding to the eigenvectors with period lengths τ, 2τ, 3τ, ... should increase with d at any given lag time τ, whereas the decreasing eigenvalue curves on the left side should remain unchanged. Therefore, the respective intersections should occur at smaller lag times τ. Comparison of the black circles in the two panels of Figure 3 shows that this is indeed the case. To illustrate this effect, Figure  4 shows the first two tICA-projections of random walks with To quantify this behavior, we generated a large number of random walks and determined the lag times τ at which the abrupt changes occur. Figure 5 shows the first and second of these critical lag times as a function of dimension d and for n ranging from 1000 to 5000 (colors). To enable direct comparison, the lag times τ have been normalized by n. As can be seen, for d between ca. 150 and n/2 both the first (upper curves) and second (lower curves) approximate power laws n/τ ∝ d b , as indicated by the respective fits (solid lines, the colors correspond to the values of n). For each fit, only dimensions d within the above range have been used.
The inset of Figure 5 shows the power law exponents b for varying n and for the first and second abrupt change, both of which apparently approach b = −1/2 for large n (also represented by the black lines in the main figure). Although we were unable to find a rigorous proof, this finding suggests that in the limit of large n and d, with d markedly smaller than n, the first few lag times at which abrupt changes occur scale as τ ∝ n/ √d.
3.4. Comparison of Random Walks and MD-Trajectories. We next compared the tICA-projections of random walks with those of molecular dynamics trajectories of proteins in solution. To that end, we used two MD-trajectories of length 1 μs each (generated as described in section 2.4), one of a comparatively large protein (PDB 11AS, 330 amino acids) 37 and one of a smaller protein (WW-domain of PDB 2F21, 34 amino acids). 38 As can be seen in Figure 6, the tICA-projections of the larger protein (top group) are indeed spectacularly similar to those of a random walk (bottom group). Even the strong dependence on the lag time is very similar, as are the abrupt changes discussed above.
Note that this striking similarity was obtained for a particular choice of d = 40 for the random walk; other dimensionalities yield less similar projections. Intriguingly, this finding thus suggests a new method of estimating an "effective" dimensionality of MD trajectories.
It is also worth noting that both the MD-trajectory and the random walk projections show apparent "clusters", for example, for τ = 500 and τ = 8000, which also look quite similar. The fact that such clusters are also seen for the random walk strongly suggests that these are mostly stochastic artifacts and do not point to minima of the underlying free energy landscape.
A closer inspection of the random walk projections offers an additional possible explanation for some of the clusters, which may also apply to the MD trajectory projections. Focusing, for example, at the averaged tICA-projections in Figure 1 immediately before the first abrupt change, one can see that the projection becomes overlaid with a cosine of higher frequency. Particularly at the ends of the curves, and in the presence of noise typical for single trajectories, this high frequency component can also produce apparent "clusters".
In contrast, for the smaller protein (Figure 7) no similarity to the tICA-projections of random walks is observed. In fact, the tICA-projections of the trajectory of the smaller protein show no resemblance to a cosine-like function at all. In light of the above analysis, this finding suggests that this trajectory is sufficiently long to explore one or several minima of the underlying free energy landscape, thereby deviating from a random walk. Further, one may infer that the three clusters seen in the figure actually point to conformational substates and, hence, can serve as proper Markov states.
It is an intriguing question whether or not, for given trajectory length, larger or more flexible proteins tend to more closely resemble random walks.

CONCLUSIONS
Here we have analyzed projections of random walks on tICA subspaces and subsequently compared those to tICA-projections of molecular dynamics trajectories of proteins. Our combined analytical and numerical study revealed a staggering complexity of the random walk tICA-projections, which showed a much richer mathematical structure than projections of random walks on principal components (PCA). 21,22 We attribute this complexity primarily to the fact that, in contrast to PCA, tICA components encode time information on the trajectory and, therefore, extract and process significantly more information. Mathematically, the complex behavior originates from the noncontinuous switch of the order of eigenvalues for increasing lag time τ, when passing through points of eigenvalue degeneracy. At these points, the associated eigenvectors change abruptly, and so do the corresponding projections of both random walks and molecular dynamics simulations. We also find that tICA can be very sensitive to very small changes in the definitions of the involved matrices. In particular, the projections of random walks are very different for the two discussed symmetrization methods.
A particularly striking example is the first abrupt change of the projections onto the two largest eigenvalues. Here, a closer inspection revealed an approximate square root relationship between the lag times at which this occurs and the dimensionality of the random walk. A similar square root law is already known for PCA: Approximately the first √d principal components of random walks resemble cosines. 21 Comparison of tICA-projections of random walks with those of a large protein (PDB 11AS) revealed striking similarities. This remarkable finding suggests that not only the ensemble properties of the finite protein trajectory resemble those of a random walk, as has been shown earlier via PCA, 21 but also the time correlations of the underlying protein dynamics. Here, the appearance of cosine-like functions in the projections onto the tICA-vectors associated with the longest correlation times clearly points to a nonconverged trajectory. For the comparatively small lag times typically used, the tICA-projections of random walks almost exactly resemble cosine functions, such that the cosine-content 22 of the tICA-projections should serve as a good quantifier of this.
In contrast, no resemblance to a random walk was seen for the second, smaller protein studied here, indicating that the projection reflects actual features of the underlying conformational dynamics of the protein.
The example in Figure 6 also illustrates the risk of overinterpreting apparent "clusters" seen in the tICA-projections as actual conformational substates, 4,16 which are defined as local minima of the protein free energy landscape that are sufficiently deep for the system to stay there for a certain amount of time. 16 Clearly, it is tempting to also see "clusters" in the random walk projections, which, however, by the definition of the random walk as a diffusion on a flat energy landscape, cannot represent conformational substates. This finding raises concerns for using automated clustering algorithms to identify, for example, folding intermediates or to characterize conformational motions from tICA-projections. 45 Because the additional parameter of a varying lag time provides a much richer structure and many instead of only one projection (as is the case for PCA), we speculate that the tICA resemblance to a random walk offers a much more sensitive tool to detect lack of convergence in MD trajectories of large biomolecules. Further, by adjusting the dimension of the random walk such as to maximize the similarity to a given MD trajectory, one can estimate the effective dimensionality of the underlying dynamics. The latter idea, as well as precisely how this "effective dimensionality" can be defined, clearly deserves further exploration.