The Answer Lies in the Energy: How Simple Atomistic Molecular Dynamics Simulations May Hold the Key to Epitope Prediction on the Fully Glycosylated SARS-CoV-2 Spike Protein

SARS-CoV-2 is a health threat with dire socioeconomical consequences. As the crucial mediator of infection, the viral glycosylated spike protein (S) has attracted the most attention and is at the center of efforts to develop therapeutics and diagnostics. Herein, we use an original decomposition approach to identify energetically uncoupled substructures as antibody binding sites on the fully glycosylated S. Crucially, all that is required are unbiased MD simulations; no prior knowledge of binding properties or ad hoc parameter combinations is needed. Our results are validated by experimentally confirmed structures of S in complex with anti- or nanobodies. We identify poorly coupled subdomains that are poised to host (several) epitopes and potentially involved in large functional conformational transitions. Moreover, we detect two distinct behaviors for glycans: those with stronger energetic coupling are structurally relevant and protect underlying peptidic epitopes, and those with weaker coupling could themselves be prone to antibody recognition.


Introduction
The novel coronavirus SARS-CoV-2, the etiological agent of COVID-19 respiratory disease, has infected millions of people worldwide, causing more than 500000 deaths (as of June 30 th , 2020) and extensive social and economic disruption. Given the pandemic status of the outbreak, social distancing measures cannot be sufficient any longer in containing it on a worldwide scale. This emergency calls for the development of strategies to rapidly identify pharmacological agents or vaccines as the only way to contain and combat the disease, restoring normal social conditions. Indeed, a number of currently ongoing trials focus on developing vaccines (see for instance https://www.nytimes.com/interactive/2020/science/coronavirus-vaccine-tracker.html) or on repurposing drugs already developed for other disorders [1][2][3][4] .
SARS-CoV-2 is extraordinarily effective in exploiting the host's protein machinery for replication and spreading. This is a characteristic that it shares with other members of the Coronaviridae family, all of which are characterized by a highly selective tropism that determines the onset of a variety of diseases in domestic and wild animals as well as in humans, including central nervous system affections, hepatitis and respiratory syndromes [5][6] . As was the case with its human predecessors SARS-CoV and MERS, critical players regulating cell entry of SARS-CoV-2 are the homotrimeric viral Spike protein (S) ( Figure 1) and the host cell docking point, the protein receptor angiotensinconverting enzyme 2 (ACE2) 7 . The CoV S protein is then cleaved by a series of serine proteases, including trypsin, cathepsins, elastase, the host type 2 transmembrane serine protease (TMPRSS2, DPP4 for MERS), and plasmin, which promote virus entry into epithelial cells 4 .
Recent cryo-electron microscopy (cryo-EM) analyses allowed to precisely determine the structure of the full-length Spike protein in its trimeric form [8][9][10] , and the structural basis for the recognition of the Spike protein's Receptor Binding Domain (RBD; Figure 1) by the extracellular peptidase domain of ACE2 11 . In parallel, computational studies have started to provide atomically detailed insights into S protein dynamics and on the elaborate role of the diverse polysaccharide chains that decorate its surface, effectively shielding a large portion of it from the host [12][13][14] .
This detailed dynamic and structural knowledge can set the stage for understanding the molecular bases of S protein recognition by the host's immune system, providing information on which physicochemical determinants are required to elicit functional antibodies. Such understanding can then be exploited to design and engineer improved antigens based on S, for instance by identifying antigenic domains that can be expressed in isolation or short sequences (epitopes) that can be mimicked by synthetic peptides [15][16][17][18][19][20] : this would be a crucial first step in the selection and optimization of candidate vaccines and therapeutic antibodies, as well as in the development of serological diagnostic tools.
Furthermore, knowledge acquired today about such recognition mechanisms could well mean that we are better prepared to tackle similar pandemics in the future by contrasting them more efficiently with the application of efficient and well-tested methods to new protein variants.
Here, we analyze representative 3D conformations of the full-length trimeric S protein in its fully glycosylated form (Figure 1), extracted from atomistic molecular dynamics (MD) simulations provided by the Woods group 13 , in order to predict immunogenic regions. To this end, a simple ab initio epitope prediction method that we previously developed for unmodified proteins [21][22][23] is optimized and extended to cover glycoproteins. The method is based on the idea that antibody-recognition sites (epitopes) may correspond to localized regions only exhibiting low-intensity energetic coupling with the rest of the structure; this lower degree of coupling would allow them to undergo conformational changes more easily, to be recognized by a binding partner, and to tolerate mutations at minimal energetic expense.
We show that our approach is indeed able to identify regions-also comprising carbohydrates-that recent structural immunology studies have shown to be effectively targeted by antibodies. On the same basis, our method predicts several additional potential immunogenic regions (currently still unexplored) that can then be used for generating optimized antigens, either in the form of recombinant isolated domains or as synthetic peptide epitopes. Finally, our results help shed light on the mechanistic bases of the large-conformational changes underpinning biologically relevant functions of the protein.
To the best of our knowledge, this approach is one of the first that permits to discover epitopes in the presence of glycosylation (an aspect that is often overlooked), starting only from the analysis of the physico-chemical properties of the isolated antigen in solution. Importantly, the approach does not require any prior knowledge of antibody binding sites of related antigenic homologs and does not need to be trained/tuned with data sets or ad hoc combinations of information on sequences, structures, SASA or geometric descriptors. The procedure is thus immediately and fully portable to other antigens.

Results and Discussion
To reveal the regions of the S protein that could be involved in antibody (Ab) binding, we employ a combination of the Energy Decomposition (ED) and MLCE (Matrix of Low Coupling Energies) methods, which we previously introduced and validated 21-32 and discuss in full in the Methods section.
Starting from 6 combined 400 ns replicas of atomistic molecular dynamics simulations of the fully glycosylated S protein in solution 13 (built from PDB ID: 6VSB 8 ), we isolate a representative frame from each of the three most populated clusters. ED and MLCE analyses of protein energetics assess the interactions that each aminoacid and glycan residue in S protomers establishes with every other single residue in the same protomer. In particular, we compute the nonbonded part of the potential energy (van der Waals, electrostatic interactions, solvent effects) implicitly, via an MM/GBSA calculation (molecular mechanics/generalized Born and surface area continuum solvation) 33 , obtaining, for a protomer composed of N residues (including monosaccharide residues on glycans), a symmetric N × N interaction matrix Mij. Eigenvalue decomposition of Mij highlights the regions of strongest and weakest coupling. The map of pairwise energy couplings can then be filtered with topological information (namely, the residue-residue contact map) to identify localized surface networks of low-intensity coupling (i.e., clusters of spatially close residue pairs whose energetic coupling to the rest of the structure is weak and not energetically optimized through evolution).
In our model, when these fragments are located on or near the surface, contiguous in space and weakly coupled to the protein's 'stability core', they represent potential interaction regions (i.e., epitopes). Otherwise put, putative interacting patches are hypothesized to be characterized by nonoptimized intramolecular interactions. Actual binding to an external partner such as an Ab is expected to occur if favorable intermolecular interactions determine a lower free energy for the bound than the unbound state 21,23,34 . Furthermore, minimal energetic coupling with the rest of the protein provides these subregions with greater conformational freedom to adapt to a binding partner and improves their ability to absorb mutations without affecting the protein's native organization and stability in a way that could be detrimental for the pathogen: all these properties are indeed hallmarks of Ab-binding epitopes.
Once interacting vicinal residue pairs (i, j)are identified by cross-comparison with the residue-residue contact map (vide supra and Methods Section), identification of poorly coupled regions representing potential epitopes proceeds as follows. Residue pairs are firstly ranked in order of increasing interaction intensity (from weakest to strongest). Two distinct sets of energetically decoupled regions are then mapped by applying two distinct cutoffs ('softness thresholds') to the residue pair list: either from the first 15% or from the first 5% of the ranked pairs (i.e., the 5% or 15% of the residue pairs with the weakest energetic coupling).
The less restrictive 15% cutoff subdivides the full-length, fully folded S protein into potentially immunoreactive domains (see Figure 1B,C and Methods) 22, 25,34 . The goal is to uncover regions that may normally be hidden from recognition by Abs in the native protein structure, but that can be experimentally expressed as isolated domains. Highly reactive neutralizing epitopes may in fact be present only in specific but transient conformations that are not immediately evident in the static Xray and EM models of the protein or are not accessible even to large scale MD simulations. Presenting these (cryptic) regions for Ab binding through their isolated parent domains may prove more advantageous in developing new immunogens 22, 25 .
The more stringent epitope definition (5% cutoff) narrows the focus on those (smaller) intra-domain regions that could be directly involved in forming the interface with Abs, and that can then be used to guide the engineering of optimized antigens in the form of synthetic epitope peptidomimetics. In this context, to be defined as epitopes, the energetically uncoupled regions must be at least 6-residue long.
Upon there is a large overlap with regions recognized by Abs and nanobodies (revealed by recent X-ray and cryo-EM structures). Importantly, for example, our calculation correctly identifies the binding region of mAb CR3022 35 , known to target a cryptic epitope that is exposed only upon significant structural rearrangement of the protein 12 (Figures 2 and Figure 4). A second domain that is found to host a large network of non-optimized interactions corresponds to the N-terminal domain ( Figure 1B, C, D). The latter has been shown to bind the new antibody 4A8 in a paper that was published during the preparation of this manuscript 36 .
A third region predicted to be highly antigenic coincides with the central/C-terminal part of the S1A domain. In a recent cryo-EM study of polyclonal antibodies binding to the S protein, this substructure was shown to be in the vicinity of the density for COV57 Fab(s), a novel Ab whose neutralizing activity showed no correlation with that of RBD-targeting Abs 37 ( Figure 1B, D). Overall, these findings support the validity of our approach in identifying protein domains that can be aptly used as highly reactive immunogens, as they are most likely to be targeted by a humoral immune response. Our analysis predicts that regions other than the S protein RBD may represent alternative targets for neutralization or functional perturbation of SARS-CoV-2. On the one hand, this may be important considering the fact that RBD can also be the target on non-neutralizing antibodies, e.g. 3022 35 . Indeed, using cocktails of antibodies to target different regions of S has recently been proposed as a viable therapeutic option 36 .
Turning to our more stringent definition of epitope, based exclusively on the top 5% of the most weakly coupled residue pairs (5% cutoff), we next focus on those regions of the S antigen that can be involved in forming contacts with antibodies.
Importantly, one predicted conformational epitope with sequence (348)  Interestingly, an additional predicted patch comprising a set of decorating carbohydrates is correctly predicted to be part of the interface with antibody S309 (6wpt.pdb; 6wps.pdb) 41 , with aminoacidic sequence (332)ITNLC(336)-(361)C and with (N334-linked) carbohydrate sequence β-D-GlcNAc-(α-L-Fuc)-β-D-GlcNAc-β-D-Man (i.e., UYB-(0fA)-4YB-VMB according to the GLYCAM naming convention 43 ).This predicted region sits notably close to the RBD interaction surface with ACE2. Antibody EY6A (6zdh.pdb) binds the RBD in the region of the cryptic epitope described by Wilson and collaborators 35 (Figure 4). Importantly, our predicted patch (365)YSVLYN(370)-(384)PTKLN(388) covers a significant part of the epitope. Once again, it is worth remarking that identification of this potentially immunoreactive patch is simply and exclusively obtained from structural and energetic interaction data generated for a protomer of the glycosylated, isolated S protein, after simulation with a general use forcefield (see Methods section).  In general, our approach is able to identify potential immunoreactive domains and epitopes of the Spike protein based only on structural and energetic information. Sequences predicted to be reactive using the restrictive epitope definition (5% cutoff) can be used for generating optimized antigens in the form of synthetic peptide epitopes. Engineering such epitopes would entail the synthesis of conformationally preorganized peptidomimetics of the 'natural' reactive regions-with intra-and extracellular stability enhanced through, e.g, a combination of natural and non-natural aminoacidswhich could reproduce the main structural and energetic conditions required to elicit a humoral immune response, as well as constituting candidates for vaccine development. Furthermore, reactive peptides thus identified may be suitable for use as baits in serologic diagnostic applications (e.g., in ELISA assays and in microarrays), to capture and detect to capture and detect not only circulating antibodies that are expressed in response to SARS-CoV-2 infections but also those that are endowed with neutralization activity and thus potentially predicting the infection outcome. As a further application, these peptide-based baits can represent a useful tool for isolating new mAbs and the screening of small molecules for drug development. One the most significant aspects of our approach is that the S protein's entire glycan shield is explicitly accounted for in the prediction of the immunoreactive regions. Indeed, the various oligosaccharide chains appear to behave differently (see differential coloring of oligosaccharide chains in Figure 1). In light of their stronger energetic coupling to other areas of the protein, some of the glycans are not recognized as epitopes, and thus form an integral part of the stabilizing intramolecular interaction network of S (white chains in Figure 1B); on the other hand, MLCE also identifies a second subset of poorly coupled oligosaccharides as potentially reactive epitopes (or part thereof) (colored oligosaccharide chains in Figure 1B; carbohydrate cluster in S2 targeted by the glycan-dependent antibody HIV-1 bnAb 2G12, see Figure 1B, 2), highlighting potential vulnerable spots in the glycan shield that could be exploited to design novel immunoreagents and vaccine candidates.
The portion of the glycan shield falling within the former category thus mainly serves to protect the protein from recognition by antibodies and consequently enhances viral infectiousness, as well as providing extra structural support. Two such glycans are further exemplified in Figure 6. The first is the entire oligosaccharide fragment bound to N234 ( Figure 6A), which is recognized by Amaro and coworkers as being crucial in 'propping up' the RBD 12 . Experimental deletion of N-glycans at this position by way of a mutation to Ala significantly modifies the conformational landscape of the protein's RBD 44 . The second is the portion of the N165-linked glycan whose subunits are rendered in yellow ( Figure 6B): consistent with experimental studies indicating N165-linked oligosaccharides

A B
as structural modulators 44 , we also find that the portion in question is not identified as a potential epitope, being consequently involved in diverting antibodies from targeting the region around N165 and thus preserving control of the S protein's structural dynamics.
Reflecting the multifaceted roles of the glycan shield, the remaining part of the N165-linked glycan ( Figure 6B; orange) appears instead to belong to the category of glycans that are potentially able to act as epitopes, since, unlike the part in yellow, we do detect it to be decoupled from the rest of the protomer.
lt is particularly significant to underline that MLCE, whose physical basis is to identify non-optimized interaction networks, detects peptidic epitopes even when they are in proximity of (optimized, nonimmunogenic) shielding carbohydrates. In light of this, it is reasonable to suggest that the protective effect of these particular carbohydrates may be circumvented and neutralized by exposing the underlying peptidic substructures. Furthermore, information on oligosaccharides identified as epitope constituents can be exploited to design glycomimics or glycosylated peptides as synthetic epitopes.
The latter aspect is indeed particularly relevant: small synthetic molecules that mimic antigenic determinants (and effectively act as their minimal surrogates) offer enticing opportunities to develop immunoreagents with superior characteristics in terms of ease of handling, reproducibility of batchto-batch production, ease of purification, sustainable cost, and better stability under a variety of conditions. Furthermore, production of these molecules greatly reduces the risk of cross-reactivity with any copurified antigens, which is instead rife when dealing with recombinant proteins. In contrast to smaller peptides or sugar-decorated peptidomimetics, a full-length recombinant antigenic protein (or any protein-based detection device) would typically require more stringent conditions (e.g., in terms of temperature and humidity) for storage, transport, and management in order to preserve the protein in its properly folded active form. The same would be true for other vaccinal solutions such as deactivated pathogens.
Overall, our work confirms how simple and transparent structural and physico-chemical understanding of the molecule that is the key player in SARS-CoV-2 viral infection can be harnessed to guide the prediction of (in some cases experimentally confirmed) regions, that are involved in immune recognition and to understand its molecular bases. Agreement with experiment confirms that knowledge generated in the process has the potential of being translated into new molecules for vaccine and diagnostic development. In this context, we have also identified potentially reactive regions in the S protein stalk that are currently under experimental synthesis and testing. leads to a symmetric × interaction matrix %& . [50][51] The implicit GB solvation model used in these calculations is identical to the one used in the preceding minimization step (vide supra), except that the implicit ion concentration is set to 0.0 M, and SASA is computed with the ICOSA method (based on icosahedra around each atom that are progressively refined to spheres).

ENERGY DECOMPOSITION
The symmetric interaction matrix %& obtained from separate MM/GBSA calculations (vide supra) on each of the three S protein protomer models under study (with residues) can be expressed in terms of its eigenvalues and eigenvectors as where * is the a-th eigenvalue and % * is the i th component of the corresponding eigenvector.
It was previously shown in a number of cases that eigenvector ( . ), also called first eigenvector, associated with the lowest eigenvalue . allows to identify most of the crucial aminoacids necessary for the stabilization of a protein fold, and consequently those aminoacids that are minimally coupled to such core. The latter were shown to correspond to potential interaction regions.
In the case of multidomain proteins such as S, the first eigenvector is not sufficient, and more eigenvectors are needed to capture the essential interactions for folding/stability and binding. The interaction matrix %& is thus decomposed instead via the alternative approach developed by Genoni et al. 34 . In this scenario, the aim is to select the smallest set of / eigenvectors that cover the largest part of residues (i.e., components) with the minimum redundancy under the assumption that: (where this time the sum occurs over / essential eigenvectors instead of residues). As detailed by Genoni et al. 34 ., the essential folding matrix 0 %& is subsequently further filtered through a symbolization process to emphasize the significant non-bonded interaction, yielding 0 %& 1 and finally subjected to a proper clustering procedure leading to domain identification.
The final simplified matrix 0 %& 1 resulting from domain decomposition thus only reports those residue pairs in the protomer that exhibit the strongest and weakest energetic interactions.

MLCE
Final epitope predictions are made using the Matrix of Local Coupling Energies method (MLCE), in which analysis of a given protein's energetic properties is combined with that of its structural determinants 21,23 . This approach allows to identify nonoptimized, contiguous regions on the protein surface that are deemed to have minimal coupling energies with the rest of the structure, and that have a greater propensity for recognition by Abs or other binding partners.
The MLCE procedure entails cross-comparison of the simplified pairwise residue-residue energy interaction matrix 0 %& resulting from domain decomposition (vide supra) with a pairwise residueresidue contact matrix %& . The latter matrix namely considers a pair of residues to be spatially contiguous (i.e., 'in contact') if they are closer than an arbitrary 6.0 Å-threshold; contact distances are measured between Cβ atoms in the case of non-glycine aminoacid residues, H atoms in the case of glycine residues, and between C1 atoms in the case of glycan residues.
The Hadamard product of the two matrices yields the matrix of the local pairwise coupling energies %& : Deriving the MLCE matrix allows to rank spatially contiguous residue pairs with respect to the strength of their energetic interactions (weakest to strongest). Selection of proximal pairs showing the weakest coupling with the rest of the protein ultimately defines putative epitopes; two distinct selections are carried out on the basis of two possible weakness (softness) cutoffs (5% or 15%), corresponding to the top 5% or 15% spatially contiguous residue pairs with the lowest energetic interactions.