Slicing and Dicing: Optimal Coarse-Grained Representation to Preserve Molecular Kinetics

The aim of molecular coarse-graining approaches is to recover relevant physical properties of the molecular system via a lower-resolution model that can be more efficiently simulated. Ideally, the lower resolution still accounts for the degrees of freedom necessary to recover the correct physical behavior. The selection of these degrees of freedom has often relied on the scientist’s chemical and physical intuition. In this article, we make the argument that in soft matter contexts desirable coarse-grained models accurately reproduce the long-time dynamics of a system by correctly capturing the rare-event transitions. We propose a bottom-up coarse-graining scheme that correctly preserves the relevant slow degrees of freedom, and we test this idea for three systems of increasing complexity. We show that in contrast to this method existing coarse-graining schemes such as those from information theory or structure-based approaches are not able to recapitulate the slow time scales of the system.

The dynamics of the model protein were simulated by using the simulation parameters outlined in 1 with the only modification that k B T =1.33 instead of k B T = 1. A single long trajectory of 50 million timesteps was generated, saving the coordinates every 5 steps. All simulations were carried out in OpenMM. 2

Simulation Analysis
Here we show (see Fig. S1) the implied timescale plot for the model protein, as obtained by TICA, 3,4 as a function of the TICA lagtime, and the plot of the end-to-end distance as a function of time (see Fig. S2). The first is used to determine a suitable lagtime for the system, for the calculation of the VAMP score. The second is to justify the slightly higher temperature of the system compared to that from. 1 Refer to the Figures themselves for a more in-depth description.

Reconstruction Error Autoencoder Architecture and Training Strategy
The reconstruction error (RE) is calculated by training an autoencoder model whose encoder part performs the coarse-grained (CG) mapping and the decoder part reconstructs the finegrained (FG) structure using the CG coordinates as its input. We define the reconstruction error as the mean square error between the original coordinates and the reconstructed ones: where x F G are the FG coordinates and x rec the reconstructed ones, and the average is performed over an ensemble of configurations (from long MD trajectories). In this work, for each considered CG mapping, the encoder part is linear and it's fixed by the mapping. Then,

S2
with the CG coordinates as input, the decoder part is trained to minimize the reconstruction error over training data sampled from equilibrium simulations.
The decoder consists of two modules, a linear and a non-linear one as shown in Fig.   S3. The linear module is a 3N × 3n matrix transforming the 3N CG coordinates into 3n fine-grained ones, while the non-linear module is a neural network. The two modules are combined in a residual network architecture: where x CG are the CG coordinates. We found that this design allows a faster convergence of the training than a decoder consisting only of a non-linear component. In practice, we first train the linear module till convergence, then add the non-linear correction and train it till convergence.

Data Set and Hyper-parameters
The hyper-parameters of the autoencoder, the data set and the training for the 4-bead and the model protein systems are shown in Table 1. The data used are obtained from MD simulation trajectories whose details have been described above. Cross-validation is applied in the training. In Fig. S4, we show the validation loss corresponding to the training of the CG mapping with the minimum RE, for the 4-bead, and model protein systems. The 2-stage convergence is caused by our separate two-step (linear and non-linear) training strategy.
Analytical VAMP score for harmonic systems As discussed in the main text, the VAMP score is defined as with the covariance matrix C 00 and time-lagged covariance matrix C 0τ defined as: where X t (or X t+τ ) is a set of features of the coarse-grained model evaluated at time t (or t + τ where τ is a lagtime), P τ = exp (Lτ ) is the propagator of the dynamics associated with the (full resolution) harmonic system, for a lagtime τ , and L is the generator of the dynamics. 5 If we take the features to be the displacement of the Cartesian coordinates, the covariance matrix C 00 is easily evaluated as where K = QM Γ −1 M ⊤ Q −1 is the effective CG interaction matrix, and as in, 6 the matrix is the projection operator that filters out free translations with the vector In the previous expression the inverse Γ =1 is defined by removing S4 the zero eigenmode (that is, it's a pseudoinverse): where u i are the normalized eigenvectors.
On the other hand, the evaluation of the time-lagged covariance matrix requires some additional calculations, as detailed below.
We assume an overdamped Langevin dynamics in a potential energy u(x) for the time evolution of the system: where γ is the friction coefficient and B t is a Brownian motion.
The generator L associated with this dynamics is the differential operator acting on a generic function f (x) as follows: In the specific case of a harmonic system with potential energy u(x) = 1 2 δx ⊤ Γδx, we have: With a change of variables from x to η = U ⊤ δx, and using the eigenvalue decomposition of the interaction matrix Γ = UΛU ⊤ , where Λ is the matrix of the eigenvectors of Γ the above equation becomes the sum of decoupled expressions along the normal modes: The (properly normalized) eigenfunctions ψ k (x) and eigenvalues ξ k of the generator L (Eq. S5 10) for the 1-dimensional case (with potential energy u(x) = 1 2 λx 2 ) are: 7 where H k (x) are the Hermite polynomials. The eigenfunctions are orthonormal with respect to the Boltzmann measure: The analytical expression for the eigenfunctions of the generator L and propagator P τ of the multidimensional process (Eq. 11) is then: with the indexes k i = 0, 1, . . . ∀i ∈ [1, n]. The corresponding eigenvalues of the propagator P τ are then: By using the normal mode decomposition: and the orthonormality of the eigenfunctions (Eq. 14), the expansion of the coordinate S6 displacement δx in terms of these eigenfunctions reduces to a finite sum: and the projections are easily evaluated as: where C n is a constant depending only on the number n of atoms in the fine-grained model.
By using this decomposition, it is straightforward to obtain an expression for P τ δx: = Ω τ δx where we have defined the matrix:

S7
With this result, we can evaluate the time-lagged covariance matrix as: By putting everything together we have an analytical expression for the VAMP score: Numerical VAMP Score The VAMP score can be also directly numerically from the simulated trajectories. 8 By using the definition (Eqn. 3) and 4 and 5) and approximating the ensemble averages with time averages with the time index t = 1, . . . , T along a long trajectory (or set of trajectories), and symmetrizing to impose detailed balance, we have: where X t (X t+τ ) is the vector containing the values of all the selected features (e.g., Cartesian coordinates, or other observables of the CG model) at time t (t+τ ). In practice, the numerical evaluation of these matrices and of the VAMP score can be obtained via a standard linear algebra python package.

S8
While the features X t chosen for the harmonic systems are the displacements in the Cartesian coordinates and the covariant matrices can be evaluated analytically (see section above), the features for the model protein are the inter-bead distance as calculations with internal coordinates are numerically more stable.
For the harmonic systems, we verified that numerical and analytical calculations yield the same results within numerical precision.
Comparison with the results of Foley, et al. 9 As mentioned in the main text, in ref. 9  Figure S1: The implied timescale of the model protein for the 4 leading processes. The implied timescales of the system refer to the intrinsic relaxation times of the system. The shaded grey area indicates the region where Markov processes can not be resolved. Note the simulation data has been strided by a factor of 500 and data is shown on a log-log scale. From this we select a lagtime of 100 to compute the VAMP score for this system. Figure S2: End to End distance of the model protein at 5 different temperatures, indicated by the y-axis label, that increase progressively from k B T = 1 to k B T = 2. For the top row at k B T = 1 we see no unfolding events while the bottom row at k B T = 2 has no apparent preference towards the folded state. The x-axis indicates the timestep strided again by 500. The temperature of the system corresponding to one that has a few but not too many transitions between the folded and unfolded state is suitable for our study.  Figure S4: Validation loss corresponding to the training of the reconstruction autoencoder for the CG mapping with minimum RE, for the 4-bead harmonic model (a) and model protein system (b). S15 A B Figure S5: A) Comparison of the Vibrational Power and VAMP score calculated for all possible CG maps defined by our mapping criterion (light blue dots), and for a sample of CG maps defined by the criterion of ref. 9 (orange dots), for the GNM of protein 2ERL coarsegrained into 4 beads. The values corresponding to the maximum VAMP score, maximum Vibrational Power, and Maximum Mapping Entropy are highlighted by blue, red, and green dots, respectively for our mapping choice, and by blue, red, and green diamonds for the mapping choice of ref. 9 The values corresponding to the block map (that is, the CG map where each of the 4 beads contains 10 consecutive atoms) are marked by an orange diamond. B) The same data from A) for a sample of CG maps defined by the criterion of ref. 9 are plotted in a zoomed-in range of Vibrational Power to show the structure of the data.