An Information-Theory-Based Approach for Optimal Model Reduction of Biomolecules

In theoretical modeling of a physical system, a crucial step consists of the identification of those degrees of freedom that enable a synthetic yet informative representation of it. While in some cases this selection can be carried out on the basis of intuition and experience, straightforward discrimination of the important features from the negligible ones is difficult for many complex systems, most notably heteropolymers and large biomolecules. We here present a thermodynamics-based theoretical framework to gauge the effectiveness of a given simplified representation by measuring its information content. We employ this method to identify those reduced descriptions of proteins, in terms of a subset of their atoms, that retain the largest amount of information from the original model; we show that these highly informative representations share common features that are intrinsically related to the biological properties of the proteins under examination, thereby establishing a bridge between protein structure, energetics, and function.


I. INTRODUCTION
The computer-aided investigation of a boundless spectrum of systems, ranging from liquids composed of simple inorganic molecules to the most complex biological machines, has immensely benefited from the availability of all-atom molecular dynamic (MD) simulation methods [1,2].Particularly in the fields of computational biophysics and biochemistry, recent technological advancements, most notably massive parallelisation, GPU computing, and tailor-made computers such as ANTON [3], have extended the range of applicability of atomistic (AA) simulations to the scale of microseconds for molecular complexes composed of millions of atoms [4][5][6].Nevertheless, the vast majority of relevant biological processes occur over much longer length and time scales than those commonly accessible by state-of-the-art computers.It is thus natural to ask if it is possible to reduce the complexity of the system under study without losing relevant information.
This objective is pursued in the field of coarse-grained modelling [7][8][9][10].A coarse-grained (CG) model is a simplified representation of a high-resolution, usually atomistic system in terms of fewer degrees of freedom, or effective interaction centroids.Among these CG sites, interactions are parametrised so as to reproduce equilibrium properties of the reference system.The first step of the modelling process consists in the definition of an operator M that maps an AA configuration r i , i = 1, ..., n to a CG one R I , I = 1, ..., N < n * raffaello.potestio@unitn.it[8,11], where n and N are the number of atoms in the system and the number of CG sites chosen, respectively.The linear coefficients c Ii in Eq. 1 are constant, positive and subject to the normalisation condition i c Ii = 1 to preserve translational invariance.Furthermore, coefficients are generally taken to be specific to each site [11], that is, an atom i taking part to the definition of CG site I will not be involved in the construction of another site J (c Ji = 0 ∀ J = I).
Once the mapping M is chosen, effective interactions among CG sites must be determined.In this respect, several methodologies have been devised in the past decades to parametrise such CG potentials [7][8][9][10].These approaches can be divided in two main categories, depending on the source of information employed to set the values of the interaction coefficients: in top-down CG'ing one tunes the parameters so that the model reproduces some emergent (structural and/or thermodynamic) properties of the reference system; in bottom-up CG'ing, on the other hand, the lower-resolution CG model naturally arises via an effective integration of neglected degrees of freedom of the higher-resolution system.In this second approach, which is the one on which we focus here, a pivotal role is played by the reference system's potential of mean force (PMF) U 0 , where p R (R) is the probability for the atomistic model to sample a specific CG configuration R. In the canonical ensemble, one has p R (R) = drp r (r)δ(M(r) − R) where β = 1/k B T , u(r) is the microscopic potential energy of the system, p r (r) ∝ exp(−βu(r)) is the Boltzmann distribution and Z the associated configurational partition function.As such, the probability p R (R) of a CG macrostate is obtained as a sum of the probabilities of the high-resolution model p r (r) over all microscopic configurations r that map onto the same CG site coordinates R.
From Eqs. 2 and 3 it follows that a computer simulation of the low-resolution system performed with the potential U 0 (more precisely, a free energy) would allow the CG sites to sample their configurational space with the same probability as they would do in the reference system.Unfortunately, the intrinsically multi-body nature of U 0 is such that its exact determination is largely unfeasible in practice [12].Thus, considerable effort has been devoted to devise increasingly accurate methods to approximate the PMF [13][14][15]: the relevant large-scale equilibrium properties of the reference system are then expected to emerge as a consequence of fundamental interactions aptly entailed in these effective potentials, rather than targeted explicitly during parameterisation for exact (or near-exact) reproduction as it is the case in top-down modelling.
Inevitably, however, a CG model loses information about the high-resolution reference even if the interactions completely capture the underlying PMF, since the CG representation has reduced degrees of freedom [8,16].In the absence of limitations arising from an inaccurate parametrisation of the interactions, the amount of information lost depends only on the number and selection of retained degrees of freedom.Generally, the mappingi.e., the simplified structural representation-is chosen based on general and intuitive criteria: for example, it is rather natural to represent a protein in terms of one single centroid per amino acid (usually the choice falls on the α carbon of the backbone) [17].However, it is by no means assured that a given representation which is natural and intuitive to the human eye is also the one which allows the CG model to retain the largest amount of information about the original, higher-resolution system [18,19].Moreover, the choice of a given mapping will influence the ability of simple, computationally tractable interaction potentials to well-approximate the true PMF.A quantitative criterion to assess how much detail is lost upon structural coarsening is thus needed in order to perform an optimal choice.
Here, we tackle this issue using the concept of mapping entropy, S map [14,[20][21][22], a quantity that measures the quality of a CG representation in terms of the distance between probability distributions -the Boltzmann distribution of the reference, all-atom system, and the equivalent distribution when the AA probabilities are projected into the CG coordinate space.The mapping entropy is ignorant of the parametrisation of the effective interactions of the simplified model; in contrast, S map effectively compares the reference system, described through all its degrees of freedom, to the same system in which configurations are viewed through "coarse-graining lenses".The difference between these two representations only lies in the resolution, not in the microscopic physics.
In the following, we illustrate a computationally effective protocol that enables the approximate calculation of the mapping entropy.In analogy with the pioneering work of Ref. [22], we employ this novel scheme to identify those representations of the reference molecular system that feature the lowest mapping entropy-that is, allowing for the smallest amount of information loss upon resolution reduction.The method is applied to three proteins of different size; we then show that the optimal choice of retained degrees of freedom, guided by the objective of preserving the largest amount of information while reducing the complexity of the system, highlights biologically meaningful and a priori unknown structural features.

II. RESULTS
In this section we report the main findings of our work.Specifically, (i ) we outline the theoretical and computational framework that constitutes the basis for the calculation of the mapping entropy; (ii ) we illustrate the biological systems that are case studies for the method; and (iii ) we describe the results of the mapping entropy optimisation for these systems and the properties of the associated optimal mappings.

A. Theory
The concept of mapping entropy as a measure of the loss of information inherently generated by performing a CG'ing procedure on a system-i.e., regardless of our limitations in exactly parametrising its PMF-was first introduced by one of us in the framework of the relative entropy method [14,21], and subsequently expanded in Refs.[20,22].For the sake of brevity, we here omit the formal derivation connecting relative entropy and mapping entropy as well as a discussion of the former, all details being provided in Appendix A.
In the following we restrict our analysis to the case of decimation mappings M, in which a subset of N < n atoms of the original system is retained while the remaining ones are integrated out, so that M I (r) = σ i r i , σ i = 1 for one I, 0 otherwise, ( 4) In this case, the mapping entropy S map reads (see Appendix A) that is, a Kullback-Leibler (KL) divergence D KL [23] between the probability distribution p r (r) of the highresolution system and the distribution obtained by observing the latter through "coarse-graining glasses", pr (r).Following the notation of Ref. [20], pr (r) is defined as where p R (R) is the probability of the CG macrostate R presented in Eq. 3, while is the degeneracy of the macrostate-how many microstates map onto the CG configuration R.
The calculation of S map in Eq. 5 thus amounts at determining the distance (in the KL sense) between two, although both microscopic, conceptually very different distributions.In contrast to p r (r), Eq. 6 displays that pr (r) associates, to all configurations that map onto the same CG macrostate R, the same probability; the latter is given by the average of the original probabilities of these microstates.Importantly, pr (r) represents the high-resolution description of the system that would be accessible only starting from its low-resolution onei.e., p R (R).Grouping together configurations into a CG macrostate has the effect of flattening the detail of their original probabilistic weights.An attempt to revert the CG'ing procedure and restore an atomistic resolution by reintroducing the mapping operator M in p R (R) can only result in microscopic configurations that are uniformly distributed within each macrostate.
Due to the smearing in probabilities, the CG'ing transformations constitute a semi-group [24].This irreversible character highlights a fundamental consequence of CG'ing strategies: a loss of information about the system.The definition, based on the KL divergence, presented in Eq. 5 is useful for practical purposes.A more direct understanding of this information loss and how it is encoded in the mapping entropy, however, can be obtained by considering the non-ideal configurational entropies of the original and CG systems, respectively quantifying the information content of the associated probability distributions, p r (r) and p R (R): the higher the entropy, the lower the information in the distribution [25].By virtue of Gibbs' inequality, from Eq. 5 one has S map ≥ 0. Furthermore, see Appendix A s R − s r = S map ≥ 0, (10) so that the entropy of the CG system is always larger than the reference, microscopic one, implying that a loss of information occurs in decreasing the level of resolution [20,22].Critically, the difference between the two information contents is precisely the mapping entropy.Noteworthy, no reference to the form of the CG interactions is made in S map , so that the information that is lost in the CG'ing process only depends on the mapping operator M-in our case, on the choice of the retained sites.This paves the way for the possibility of assessing the quality of a CG mapping based on the amount of information it is able to retain about the original system, a qualitative advancement with respect to the more common a priori selection of CG representations [17].Unfortunately, Eq. 5 or 10 do not allow-except in very simple cases [22]-a straightforward computational estimate of S map for a system arising from a choice of its CG mapping, as the observables to be averaged involve logarithms of high-dimensional probability distributions, and ultimately configuration-dependent free energies.However, having introduced the loss of information per macrostate S map (R) defined by the relation [20,22] in Appendix B we show that this problem can be overcome by further subdividing microscopic configurations that map to a given macrostate according to their potential energy.Let us define the conditional probability P β (U |R) for the system, thermalized at inverse temperature β, to have energy U provided that is in macrostate R as so that S map (R) can be exactly rewritten as (see Appendix B): where U β|R is the average of the potential energy restricted to the CG macrostate R, This derivation enables a direct estimate of the mapping entropy S map from configurations sampled according to the microscopic probability distribution p r (r).For a given mapping, the histogram of these configurations with respect to CG coordinates R and energy U approximates the conditional probability P β (U |R) and, consequently, S map (R), see Eq. 13; the total mapping entropy can thus be obtained as a weighted sum of the latter over all CG macrostates, Eq. 11.
The only remaining difficulty consists in obtaining accurate estimates of the exponential average in Eq. 13, which are prone to numerical errors.As often in these cases [26,27], is it possible to rely on a cumulant expansion of Eq. 13, which truncated at second order provides Inserting Eq. 15 in Eq. 11 results in a total mapping entropy given by: For a CG representations to exhibit an exactly zero mapping entropy, it is required that all microstates r that map onto a given macrostate R = M(r) have the same energy in the reference system.Indeed, in this case one has P β (U |R) = δ(U − ūR ) in Eq. 13, with ūR being the potential energy common to all microstates within macrostate R, and consequently S map (R) = 0. Eq. 15 highlights that deviations from this condition result in a loss of information associated to a particular CG macrostate that is proportional to the variance of the potential energy of all the atomistic configurations that map to R. The overall mapping entropy is an average of these energy variances over all macrostates, each one weighted with the corresponding probability.
In the numerical implementation we thus seek to identify those mappings which cluster together atomistic configurations having the same, or at least very close energy, so as to minimise the information loss arising from CG'ing.With respect to Eq. 16, we further approximate S map to its discretised counterpart (see Methods), where we identify N cl discrete CG macrostates R i , each of which contributes to Smap with its own probability p R (R i ) taken as the relative population of the cluster.
We then employ an algorithmic procedure to estimate and efficiently minimise a cost function (Eq. 24 of the Methods section), which is defined as an average of values of Smap computed over different CG configuration sets, each of these being associated to a given number of conformational clusters N cl .

B. Biological structures
The results of the previous section are completely general and independent of the specific features of the un-derlying system.Here, we focus our attention on three proteins we chose to constitute a small yet representative set of case studies differing by size, conformation, and degree of internal flexibility, as well as biological function.
Each protein is equilibrated, then simulated for 200 ns in the NVT ensemble with physiological ion concentration.Out of 200 ns, snapshots every 20 ps are extracted from each trajectory, for a total 10 4 AA configurations per protein employed throughout the analysis.Details about the simulation parameters as well as a quantitative analysis of MD trajectories can be found in the Supporting Information.Hereafter we provide a description of each molecule, along with a brief summary of its behaviour as observed along MD simulations.
[TAM] A recently released 31-residue tamapin mutant (PDB code 6D93).Tamapin is the toxin produced by the Indian red scorpion.It features a remarkable selectivity towards a peculiar calcium-activated potassium channel (SK2), whose potential use in the pharmaceutical context has made it a preferred object of study during the past decade [28,29].Throughout our simulation almost every residue is highly solvent-exposed.Side chains fluctuate substantially, thus giving rise to an extreme structural variability.
[AKE] Adenylate Kinase (PDB code 4AKE).It is a 214 residue-long phosphotransferase enzyme which catalyses the interconversion between adenine diphospate (ADP) and monophospate (AMP) and their energetically rich complex, Adenine triphospate (ATP) [30].It can be subdivided in three structural domains, CORE, LID, and NMP [31].The CORE domain is stable, while the other two undergo large conformational changes [32].Its central biochemical role in the regulation of the energetic balance of the cell and relatively small size, combined with the possibility to observe conformational transitions over timescales easily accessible by plain MD [33], make it the ideal candidate to test and validate novel computational methods [18,34,35].In our MD simulation the protein displays many rearrangements in the two motile domains, which occur to be quite close at many points.Nevertheless, the protein does not undergo a full open ↔ closed conformational transition.
[AAT] α − 1 antitrypsin (PDB code 1QLP).With 5934 atoms (372 residues), this protein is almost two times bigger than adenylate kinase.α − 1 antytripsin is a globular biomolecule and it is well known to exhibit a conformational rearrangement over the timescales of the minutes [36][37][38].During our simulated trajectory the molecule experiences fluctuations particularly localised in correspondence of the most solvent-exposed residues.The protein bulk appears to be very rigid, and there is no sign of a conformational rearrangement.

C. Minimisation of the mapping entropy and characterisation of the solution space
The algorithmic procedure described in the Methods section and Appendix B enables one to quantify the information loss experienced by a system as a consequence of a specific decimation of its degrees of freedom.This quantification, which is achieved through the approximate calculation of the associated mapping entropy, opens the possibility of minimising such measure in the space of CG representations, so as to identify the mapping that, for a given number of CG sites N , is able to preserve as much information as possible about the AA reference.
In the following we allow CG sites to be located only on heavy atoms, thus reducing the maximum number of possible sites to N heavy .We then investigate the properties of various kinds of CG mappings having different numbers of retained sites N .Specifically, we consider three chemically-intuitive values of N for each biomolecule: (i ) N α , i.e., the number of C α atoms of the structure (equal to the number of amino acids); (ii ) N αβ , the number of C α and C β atoms; and (iii ) N bkb , which results from counting all the heavy atoms belonging to the main chain of the protein.The values of N for mappings (i )-(iii ) in the case of TAM, AKE and AAT are listed in Tab.I, together with the corresponding N heavy .
Even restricting N to N α , N αβ and N bkb , the combinatorial dependence of the number of possible decimation mappings on the amount of retained sites and N heavy makes their exhaustive exploration unfeasible in practice (see Methods).To identify the CG representations that minimise the information loss we thus rely on a Monte Carlo simulated annealing approach (SA, see Supporting Information) [39,40].For each analysed protein and value of N , we perform 48 independent optimisation runs and store the CG representation characterised by the lowest value of Σ in each run, thus resulting in a pool of optimised solutions.In order to assess their statistical significance and properties, we also generate a set of random mappings and calculate the associated Σ's, which constitute our reference values.
Fig. 1 displays, for each value of N considered, the distribution of mapping entropies obtained from a random choice of the CG representation of TAM, AKE, and AAT together with each's optimised counterpart.For N = N bkb and N = N α in Fig. 1 we also report the values of Σ associated to physically-intuitive choices of the CG mapping that are commonly employed in the literature: the backbone mapping (N = N bkb ), which neglects all atoms belonging to the side chains; and the C α mapping (N = N α ), in which we only retain the C α atoms of the structures.The first is representative of united-atom CG models, while the second is a ubiquitous and rather intuitive choice to represent a protein in terms of a single bead per amino acid [17].
The optimality of a given mapping with respect to a random choice of the CG sites can be quantified in terms where µ and σ represent mean and standard deviation of the distribution of Σ over randomly sampled mappings, respectively.Table II summarises the values of Z found for each N for the proteins under examination, including Z[backbone] and Z[C α ], which are computed with respect to the random distribution generated with N = N bkb and N = N α respectively.As for the physically intuitive CG representations, Fig. 1 shows that the value of Σ associated to the backbone mapping is very high for all structures.For TAM in particular, the amount of information retained is so low that the mapping entropy falls 4.37 standard deviations higher than the reference distribution of random mappings, see Table II.This suggests that neglecting the side chains in a CG model of a protein is detrimental, at least as far as the structural resolution is concerned.In fact, the backbone of the protein undergoes relatively minor structural rearrangements when exploring the neighbourhood of the native conformation, thereby inducing negligible energetic fluctuations; for side chains, on the other hand, the opposite is true, with comparatively larger structural variability and a similarly broader energy range associated to it.Removing side chains from the mapping induces the clustering of atomistically different structures with different energies onto the same coarse-grained configuration, the latter being solely determined by the backbone.The corresponding mapping entropy is thus large-worse than a random choice of the retained atoms-since it is related to the variance of the energy in the macrostate.Independently of this, an appropriate parametrisation of the effective interactions can in principle allow the low-resolution system to sample the conformational space correctly-that is, with a probability distribution equal or sufficiently close to the atomistic one (in terms of CG degrees of freedom R, see Eq. A3 in Appendix A); an effective potential arbitrarily close to the MB-PMF, however, does not imply that the model provides an informative representation of the original system as quantified by S map .
Calculations employing the C α mapping for the three structures show that this provides Σ values that are very close to the ones we find with the backbone mapping, thus suggesting that C α atoms retain about the same amount of information that is encoded in the backbone.This is reasonable, given the rather limited conformational variability of the atoms along the peptide chain.However, a comparison of the random case distributions for a number N α and N bkb of retained atoms in Fig. 1 reveals that the former generally has a broader spread than the latter, due to the lower number of CG sites; consequently, the Σ of the C α atoms mapping is closer to the bulk of the distribution of the random case than that of the backbone mapping.
We now discuss the case of optimised mappings, that is, CG representations retaining the maximum amount of information about the AA reference.Each of the 48 minimisation runs, which have been carried out for each protein in the set and value of N considered, provided an optimal solution-a deep local minimum in the space of CG mappings; the corresponding Σ's spread over a compact range of values that are systematically lower than, and do not overlap with, those of the random case distributions (Fig. 1).
Optimal solutions for AKE and AAT span a wide interval of values of Σ; when N = N α in particular, the support of this set and of the corresponding random reference have comparable sizes.A quantitative measure of this broadness is displayed in the distributions of Z scores of optimal solutions presented in Table II.On the other hand, TAM shows a narrower distribution of optimal values of Σ for all values of N .As discussed in Sec.II D, this can be ascribed to the fact that most of the energy fluctuations in TAM-and consequently mapping entropy-are carried by a subset of atoms that are almost always retained in each optimal mapping, in contrast to a random choice of the CG representation.At the same time, the associated Z scores are lower than the ones of the bigger proteins for all values of N under examination, as TAM conformations generally feature a lower variability in energy than the other molecules.
For all the investigated proteins, the absence of an overlap between the distributions of Σ associated to random and optimised mappings raises some relevant questions.First, one might wonder what kind of structure the solution space has, that is, if the identified solutions lie at the bottom of a rather flat vessel or, on the contrary, each of them is located in a narrow well, neatly separated one form the other.Second, it is reasonable to ask whether some degree of similarity exists between these quasi-degenerate solutions of the optimisation problems and, in case, what significance this has.
In order to answer these questions, for each structure we select four pairs of mapping operators M opt that result in the lowest values of Σ.We then perform 100 independent transitions between these solutions, constructing intermediate mappings by randomly swapping two nonoverlapping atoms from the two solutions at each step and calculating the associated mapping entropy.Fig. 2 shows the results of this analysis for the pair of mappings with the lowest Σ (all the other transitions are reported in Fig. 8 of the Supporting Information).It is interesting to notice that the endpoints (that is, the optimised mappings) correspond to the lowest values of Σ along each transition path; by increasing the size of the proteins, the values of Σ for intermediate mappings get closer to the average of Σ random .We cannot rule out, by this analysis, the absence of lower minima over all the possible paths, although it seems quite unlikely given the available sampling.
This analysis thus addresses the first question by showing that at least the deepest solutions of the optimisation procedure are distinct from each other.It is not possible to (quasi)continuously transform an optimal mapping into another through a series of steps keeping the value of the mapping entropy low.Each of the inspected solutions is a small town surrounded by high mountains in each direction, isolated from the others with no valley connecting them.
The second question, namely what similarity, if any, exists among these disconnected solutions, is tackled in the following section.

D. Biological Significance
The degree of similarity between the optimal mappings can be assessed by a simple average, returning the frequency with which a given atom is retained in the 48 solutions of the optimisation problem.
Fig. 3 shows the probability P cons of conserving each heavy atom, separately for each analysed protein, computed as the fraction of times it appears in the pool of optimised solutions.One can notice the presence of regions that appear to be more or less conserved.Quantitative differences can be observed between the three cases under examination: while the heat map of TAM shows narrow and pronounced peaks of conservation probability, optimal solutions for AKE feature a more uniform distribution, where the maxima and minima of P cons extend over secondary structure fragments rather than small sets of atoms.The distribution gets even more blurred for AAT.
As index proximity does not imply spatial proximity in a protein structure, we mapped the aforementioned probabilities to the three-dimensional configurations.Results for TAM are shown in Fig. 4, while the corresponding ones for AKE and AAT are provided in Supporting Information (Fig. 7).From the distribution of P cons at dif- FIG.2: Values of the mapping entropy Σ [kJ/mol/K] of mappings connecting two optimal solutions.In each plot, one per protein under examination, the two lowest-Σ mappings are taken as initial and final endpoints (black dots) for paths constructed by swapping pairs of atoms between them (blue dots).For each protein, 100 independent paths at given N = N αβ are constructed and the mapping entropy of each intermediate point is computed.In each plot, horizontal lines represent the mean (red) and minimum (green) S map obtained from the corresponding distribution of random mappings presented in Fig. 1.
FIG. 3: Probability P cons that a given atom is retained in the optimal mapping at various numbers N of CG sites and for each analysed protein, expressed as a function of the atom index.Atoms are ordered according to their number in the PDB file.
ferent degrees of CG resolution N it is possible to infer some relevant properties of optimal mappings.For what concerns TAM (Fig. 4), it seems that, at the highest degree of CG (N = N α ), only two sites are always conserved, namely two nitrogen atoms belonging to ARG6 and ARG13 residues (P cons (NH1, ARG6) = 0.92, P cons (NH2, ARG13) = 0.96).The atoms which constitute the only other arginine residue, ARG7, are well conserved but with lower probability.By increasing the degree of resolution (N = N αβ ), i.e. employing more CG sites, we see that the atoms in the side chain of LYS27 appear to be retained more than average together with atoms of GLU24 (P cons (NZ, LYS27) = 0.65, P cons (OE2, GLU24) = 0.75).At N = 124 the distribution becomes more uniform, but still sharply peaked around terminal atoms of ARG6 and ARG13.
Interestingly, ARG6 and ARG13 have been identified to be the main actors involved in the TAM-SK2 channel interaction [41][42][43]: Andreotti et al. [41] suggested that these two residues strongly interact with the channel through electrostatics and hydrogen bonding.Furthermore, Ramírez-Cordero et al. [43] showed that mutating one of the three arginines of TAM dramatically decreases its selectivity towards the SK2 channel.
It thus appears that the mapping entropy optimisation protocol was capable of singling out the two residues that are crucial for a complex biological process relying only on data obtained in standard MD simulations, naturally performed in absence of the channel.The rationale for this can be found in the fact that such atoms strongly interact with the remainder of the protein, so that small variations of their relative coordinates have a large impact on the value of the overall system's energy.Retaining these atoms, and fixing their position in the coarse-grained conformation, thus enables the model to discriminate effectively the energy of the microstates associated to each macrostate.
For the AKE (Fig. 7) we have that when N = N α the external, solvent-exposed part of the LID domain is heavily coarse-grained, while its internal region is more Residues presenting the highest retainment probability P cons across N (ARG6 and ARG13) are highlighted.
conserved.The CORE region of the protein is always largely retained, without noteworthy peaks in probability.Such peaks, on the contrary, appear in correspondence of some terminal nitrogens of ARG36, LYS57 and ARG88 (P cons (NH2, ARG36) = 0.52, P cons (NZ, LYS57) = 0.48, P cons (NH2, ARG88) = 0.58).The two arginine amino acids are located in the internal region of the NMP arm, at the interface with the LID domain.ARG88 is known to be the most important residue for catalytic activity [44,45], as it binds the α-phosphate group of Ap5A, being central of phosphoryl transfer [46].Phenylglyoxal [47], a drug that mutates ARG88 to a glycine, has been shown to substantially hamper the catalytic capacity of the enzyme [46].ARG36 is also bound to phosphate atoms [45].Finally, LYS57 is on the external part of NMP and has been identified to play a pivotal role in collaboration with ARG88 to block the release of adenine from the hydrophobic pocket of the protein [48].More generally, this amino acid is crucial for stabilising the closed conformation of the kinase [49,50], which was never observed throughout the simulation.The overall probability pattern persists as N increases, even though less pronounced.
As for AAT, Fig. 7 shows that the associated optimisations heavily coarse-grain the reactive center loop of the protein.On the other hand, two of the most conserved residues in the pool of optimised mappings, MET358 and ARG101, are central to the biological role of this serpin.MET358 (P cons (CE, MET358) = 0.31) constitutes the reactive site of the protein [51].Being extremely inhibitor-specific, mutations or oxidation of this amino acid lead to severe diseases.In particular, heavy oxidation of MET358 is one of the main causes of emphysema [52].The AAT Pittsburgh variant shows MET358-ARG mutation, which leads to diminished antielastase activity but markedly increased antithrombin activity [36,51,53].In turn, ARG101 (P cons (CZ, ARG101) = P cons (NH1, ARG101) = P cons (NH2, ARG101) = 0.35) has a crucial role is due to its connection to mutations that lead to severe AAT deficiency [37,38].
In summary we observe that, in all the proteins inves-tigated, the presented approach identifies biologically relevant residues.With the exception of MET358 of AAT, the most probably retained atoms belong to amino acids which are charged and highly solvent-exposed.To quantify the statistical significance of the selection operated by the algorithm, we note that the latter detects those fragments out of a pool of 8, 69 and 100 charged residues for TAM, AKE and AAT, respectively.If we account for solvent exposition, these numbers reduce to 7, 32 and 40 considering amino acids with solvent accessible surface area (SASA) higher than 1 nm 2 .

III. DISCUSSION AND CONCLUSIONS
Bottom-up coarse-graining methods use the configurational landscape of a reference, high-resolution model of a physical system to construct a simplified representation that retains its large-scale properties.The interactions among effective sites are parametrised by directly integrating out (in an exact or approximate manner) the higher-resolution degrees of freedom, and imposing the equality of the two representations' partition functions.
These approaches have a long and successful history in the field of statistical mechanics and condensed matter, the most prominent example probably being Kadanoff's spin block transformations of ferromagnetic systems [54].This process, which lies at the heart of real-space renormalisation group (RG) theory, allows the relevant variables of the system to naturally emerge out of a (potentially infinite) pool of fundamental interactions, thus linking microscopic physics to macroscopic behaviour [55,56].
The generality of the concepts of renormalisation group and coarse-graining has naturally taken them outside of their native environment [57], the whole field of coarsegrained modelling of soft matter being one of the most fruitful offsprings of this cross-fertilisation.However, a straightforward application of RG methods in this latter context is severely restricted by fundamental differences between the objects of study.Most notably, the fundamental assumptions of self-similarity and scale invariance, which justify the whole process of renormalisation at the critical point, clearly do not apply to, say, a protein, in that the latter does certainly not resemble itself upon resolution reduction.Furthermore, scaling laws cannot be applied to a system such as a biomolecule that is intrinsically finite, for which the thermodynamic limit is not defined.
Additionally, one of the key consequences of selfsimilarity at the critical point is that the filtering process put forward by the renormalisation group turns out to be largely independent of the specific coarse-graining prescription: the set of relevant macroscopic variables emerges as such for whatever choice of mapping operator is taken to bridge the system across different length scales [58].
In the case of biological matter, where the organisation of degrees of freedom is not fractal, rather hierarchicalfrom atoms, to residues, to secondary structure elements, and so on-the mapping operator acquires instead a central role in the "renormalisation" process.The choice of a particular transformation rule, projecting an atomistic conformation of a molecule to its coarse-grained counterpart, already implies an external -i.e.not emergentselection of which variables are relevant in the description of the system, and which others are redundant.In this way, what should be the main outcome of a genuine coarse-graining procedure is demeaned to be one of its ingredients.
It is in this context that information-theoretical measures, such as the mapping entropy, can represent gamechanging instruments [59].In fact, this quantity provides the necessary measure to gauge a given mapping, that is, a given selection of coarse-grained degrees of freedom, in terms of the amount of information that the corresponding lower-resolution representation can preserve.
In this work, we have tackled the issue of providing a simple and efficient manner to compute the mapping entropy of a molecular system in the canonical ensemble.This method, relying on controllable approximations and standard molecular dynamics simulations, proved to be appropriate to identify, by means of a stochastic optimisation procedure, mappings having low mapping entropy for a given number of retained atoms.These mappings have been shown to represent isolated instances, in that the values of the mapping entropy for intermediate mappings connecting two optimal solutions are much larger than the values of the latter.Furthermore, and most remarkably, we have observed that those atoms consistently retained with high probability across various optimal mappings at different CG site numbers tend to be located in amino acids which play a relevant role in the biological function of the three proteins under examination.
This result, which has been obtained on the basis of plain MD simulations of the substrate-free molecules in explicit solvent, suggests that a substantial amount of information about a protein's key functional residues is en-tailed in its structural and energetic features; moreover, the mapping entropy optimisation has proved capable of letting this information emerge.A procedure can thus be devised, which enables the parameter-free identification of those variables in a macromolecular system that recapitulate its relevant properties and large-scale behaviour.The protocol here described works based on a rationale that is qualitatively different from other approaches [60] which rely on exclusively structural, conformational, or topological features of the molecule to extract information about it.In the mapping entropy-based approach structural properties are associated to thermodynamical ones, so that both the conformational variability of the system and its energetics are taken into account -and made use of.
It is our opinion that the automated selection of coarsegrained sites entails a great potential for further development, being at the nexus between molecular mechanics, statistical mechanics, information theory, and biology.The presented method establishes a quantitative bridge between a molecules representation-and hence its information content-on the one side, and the structuredynamics-function relationship on the other.This protocol can thus hold the key to the identification of important regions of proteins, for example druggable sites and allosteric pockets, relying on relatively simple MD simulations and efficient analysis tools.

IV. METHODS
In this section we describe the procedure we employ to obtain the mapping function M(r), see Eq. 4, associated to minimal values of mapping entropy.
Equation 16 provides us with a way of measuring the mapping entropy of a biomolecular system associated to any particular choice of decimation of the atomistic degrees of freedom.One can visualise the decimation mapping (Eq.4) as an array of bits, where 0 and 1 correspond to not retained and retained atoms, respectively.Order matters: swapping two bits produces a different mapping operator.The total number of CG representations that may be designed for a biomolecule with such a prescription, i.e., selecting N atoms out of n, is: which is astronomical even for the smallest proteins.In our calculation we restrict the set of possibly retained sites to heavy atoms-excluding hydrogens-thus significantly reducing the cardinality of the space of mappings.Nonetheless, finding the global minimum of Eq. 16 for a reasonably large molecule would be computationally intractable whenever N is different from 1-2 and n−1-n−2.
Hence, it is necessary to perform the minimisation of this mapping entropy through a Monte Carlo-based optimisation procedure, specifically the simulated annealing (SA [39,40]) protocol.As it is typically the case with this method, the computational bottleneck consists in the calculation of the observable (the mapping entropy) at each SA step.
We develop an approximate method able to compute the mapping entropy of a biological system analysing a MD trajectory which can contain up to tens of thousands of frames.At each SA step, that is, for each putative mapping, the algorithm computes a similarity matrix among all the configurations, whose entries are given by the root mean square distance (RMSD) between structure pairs; CG configurations are identified as such by clustering frames based on this distance matrix, making use of the bottom-up hierarchical clustering (UPGMA [61]).The observable of interest is then calculated from the variances of the atomistic intramolecular potential energy of the protein corresponding to the frames that map onto the same CG conformational cluster.
The protocol is initiated with the generation of a mapping such that the overall number of retained sites is equal to N .Then, at each SA step, the following operations are performed: 1. swap a retained site (σ i = 1) and a removed site (σ j = 0) in the mapping; 2. compute a similarity matrix among CG configurations using the RMSD; 3. apply a clustering algorithm on the RMSD matrix in order to identify the CG macrostates R; 4. compute Smap using Eq.16.
Once the new value of Smap is obtained, the move is accepted/rejected using a Metropolis-like rule.The overall workflow of the algorithm is schematically illustrated in Fig. 5.
For the sake of the accuracy of the optimisation, the more exhaustive the sampling the better, so the number of sampled atomistic configurations should be at least of the order of the tens of thousands.However, in that case step 2 would require the computation of a very large matrix, which in turn would dramatically slow down the entire process.This problem is circumvented performing a reasonable approximation in the calculation of the RMSD matrix.

A. RMSD matrix calculation
The RMSD between two superimposed atomic structures is given by: where n is the number of atoms and x i , y i represent the cartesian coordinates of the i-th atom in the two sets.According to Kabsch [62,63] it is possible to find the superimposition that minimises this quantity, namely the rotation matrix U that has to be applied to x in order to reach the minimum of the RMSD.
The aforementioned procedure is not computationally heavy per se, but its repetition for all the pairs of configurations in a MD trajectory at each step of the Monte Carlo process would be intractable.
The simplest solution to this problem is to discard the differences in the Kabsch alignment between two structures differing by a pair of swapped atoms.This assumption is particularly appealing from the point of view of speed and memory, since the expensive and relatively slow alignment procedure produces a result (a rotation matrix) that can be stored with negligible use of resources.In order to take advantage of this simplification without losing accuracy, for each structure and degree of CG we select an interval of Simulated Annealing steps T K in which we consider rotation matrices constant.After these steps, the full Kabsch alignment is applied again.
This approximation results in a substantial reduction of the number of operations that we have to execute at each Monte Carlo step.At first, given the initial random mapping operator M, we build the sets of coordinates that have been conserved by the mapping operator Γ(M) = M(r).Then we compute the overall RMSD matrix between every pair of structures Γ 1 and Γ 2 , RM SD(Γ 1 (M), Γ 2 (M)).This quantity is updated at each step with this simple rule: where s and a are the removed (substituted) and added atom, respectively, and MSD is the Mean Squared Deviation.This approach clearly represents an approximation to the correct procedure; it has to be emphasised, however, that the impact of such approximation is increasingly perturbative as the size of the system grows.Furthermore, the computational gain that the described procedure enables is sufficient to counterbalance the fact that the exact protocol would be so inefficient to make the optimisation impossible.For example, choosing T K = 1000 for AAT with N = N bkb our approximation gives a speedup factor of the order of 10 3 .

B. Hierarchical Clustering of Coarse Grained configurations
Several clustering algorithms exist that have been applied to group molecular structures based on RMSD similarity matrices [64,65].Many such algorithms have been developed and incorporated in the most common libraries FIG.5: Schematic representation of the algorithmic procedure used to minimise the mapping entropy, calculated by means of Eq. 24.The full similarity matrix is computed once every T K steps, while in the intermediate steps we resort to the approximation given by Eq. 22. T K depends both on the protein and on N .T M AX is the number of simulated annealing steps, T M AX = 2 × 10 4 .
for data science such as scikit-learn [66].Among the various available methods we choose to resort on the agglomerative bottom-up hierarchical clustering with average linkage (UPGMA algorithm [61]).We here briefly recapitulate the basics underpinnings of this procedure.
1.At the first step, the minimum of the similarity matrix is found and the two corresponding entries x, y (leaves) are merged together in a new cluster k; 2. k is placed in the middle of its two constituents.The distance matrix is updated to take into account the presence of the new cluster in place of the two close structures: d(k, z) = (d(x, z) + d(y, z))/2; 3. Steps 1. and 2. are iterated until one root is found.The distance among clusters k and w is generalized as follows: where |k| and |w| are the populations of the clusters and k[i] and w[j] their elements; 4. The actual division in clusters can be performed by cutting the tree (dendrogram) using a threshold value on the inter-clusters distance or taking the first value of distance that gives rise to a certain number of clusters N cl .In both cases it is necessary to introduce a hyperparameter.In our case the latter is a more viable choice to reduce the impact of roundoff errors.Indeed, the first criterion would push the optimisation to create as many clusters as possible, in order to minimise the energy variance inside them (a cluster with one sample has zero variance in energy).
This algorithm, whose implementation [67,68] is available in Python Scipy [69], is simple, relatively fast (O(n 2 log n)), and completely deterministic: given the distance matrix, the output dendrogram is unique.
Although this algorithm scales well with the size of the dataset, it may not be robust with respect to small variations along the optimisation trajectory.In fact, even the slightest modifications of the dendrogram may lead to abrupt changes in Smap .This is perfectly understandable from an algorithmic point of view, but it is deleterious for the stability of the optimisation procedure.Furthermore, the aforementioned choice of N cl is somehow arbitrary.Hence, we perform the following analysis in order to enhance the robustness of Smap at each MC move and to provide a quantitative criterion to set the hyperparameter: 1. Compute the RMSD similarity matrix between all the heavy atoms of the biological system under consideration; 2. Apply UPGMA algorithm to this object, retrieving the all-atom dendrogram; where |CL| is the cardinality of the list we chose.
The overall procedure amounts at identifying many different sets of CG macrostates R on which Smap can be computed, assuming that the average of this quantities can be used effectively as driving observable inside the optimisation.Noteworthy, this trivial assumption allows to increase the robustness of the SA optimisation and to keep in memory all the values of Smap calculated at different distances from the root of the dendrogram.

C. Simulated Annealing
We use Monte Carlo simulated annealing to stochastically explore the space of the possible decimation mappings associated to each degree of CG'ing.We here briefly describe the main features of our implementation of the SA algorithm, referring the reader to a few excellent reviews for a comprehensive description of the techniques that can be employed in the choice of temperature decay and parameter estimation [70,71].
We run the optimisation for 2×10 3 MC epochs, each of which is composed by 10 steps.This amounts at keeping the temperature constant for 10 steps and then decreasing it according to an exponential law.For the i-th epoch we have that T (i) = T 0 e −i/ν .The hyperparameters T 0 and ν are crucial for a wellbehaved MC optimisation.We choose ν = 300 so that the temperature at i = 2000 is approximately T 0 /1000.In order to feed our algorithm with reasonable values of T 0 , for each of 100 random mappings we perform 10 MC stochastic moves, measuring ∆Σ, namely the difference between the observables computed at two consecutive steps.Then we estimate T 0 so that a move that leads to an increment of the observable equal to the average of ∆Σ would possess an acceptance probability of 0.75 at the first step.

FIG. 1 :
FIG. 1: Distributions of the values of mapping entropy Σ [kJ/mol/K] in Eq. 18 for random mappings (light blue histogram) and optimised solutions (green histograms).Each column corresponds to an analysed protein, each row to a given number N of retained atoms.In the first and last rows, corresponding to numbers of CG sites equal to the number of C α atoms and of backbone atoms, N α and N bkb respectively, the values of the mapping entropy associated to the physically-intuitive choice of the CG sites (see text) is indicated by vertical lines (red for N = N α , purple for N = N bkb ).Note that the S map ranges have the same width in all plots.

FIG. 4 :
FIG.4: Structure of tamapin (one bead per atom) coloured according to the probability for each atom to be retained in the pool of optimal mappings.Each structure corresponds to a different number N of retained CG sites.Residues presenting the highest retainment probability P cons across N (ARG6 and ARG13) are highlighted.

FIG. 6 :FIG. 7 :
FIG.6: RMSD (top) and RMSF (bottom) for the three protein of interest.RMSD values are always calculated with respect to the first frame of the trajectory.For adenylate kinase we display also the RMSD with respect to the PDB of the closed conformation (4AKE wrt 1AKE, monomeric structure).LID and NMP domains are also highlighted in the RMSF plot.

FIG. 8 :
FIG.8: Values of the mapping entropy Σ [kJ/mol/K] of mappings connecting optimal solutions.In each plot, one per protein under examination, 100 transitions have been sampled between the three next-to-lowest-Σ pairs of optimal mappings at N = N αβ .Black dots indicate initial and final endpoints for paths constructed by swapping pairs of atoms between them (coloured dots).In each plot, horizontal lines represent the mean (violet) and minimum (green) S map obtained from the corresponding distribution of random mappings presented in Fig.1.The behaviour illustrated in Fig.2is preserved.

TABLE I :
Values of N α , N αβ , N bkb and N heavy (see text) for each analysed protein.

TABLE II :
Table of Z scores of each analysed protein.We report the mean and standard deviation of the distribution of Z values of the optimised solutions, Z, for all values of N investigated.Results for the standard mappings-Z[backbone] for backbone atoms only and Z[C α ] for C α atoms only-are also included.

TABLE III :
Bounds on inter-clusters distance and correspondent number of clusters.3. Impose lower and upper bounds (see Table III) on the inter-clusters distance depending on the conformational variability of the structure; 4. Visualise the cut dendrogram to identify the number of different clusters available at each of the two values of the threshold (N + cl and N − cl ) (Table III); 5. Build a list CL of five integers selecting three (intermediate) values between N − cl and N + cl ; 6. Define the observable as the average over the values of Smap (see Eq. 17) computed choosing different