Global Structure of the Intrinsically Disordered Protein Tau Emerges from Its Local Structure

The paradigmatic disordered protein tau plays an important role in neuronal function and neurodegenerative diseases. To disentangle the factors controlling the balance between functional and disease-associated conformational states, we build a structural ensemble of the tau K18 fragment containing the four pseudorepeat domains involved in both microtubule binding and amyloid fibril formation. We assemble 129-residue-long tau K18 chains with atomic detail from an extensive fragment library constructed with molecular dynamics simulations. We introduce a reweighted hierarchical chain growth (RHCG) algorithm that integrates experimental data reporting on the local structure into the assembly process in a systematic manner. By combining Bayesian ensemble refinement with importance sampling, we obtain well-defined ensembles and overcome the problem of exponentially varying weights in the integrative modeling of long-chain polymeric molecules. The resulting tau K18 ensembles capture nuclear magnetic resonance (NMR) chemical shift and J-coupling measurements. Without further fitting, we achieve very good agreement with measurements of NMR residual dipolar couplings. The good agreement with experimental measures of global structure such as single-molecule Förster resonance energy transfer (FRET) efficiencies is improved further by ensemble refinement. By comparing wild-type and mutant ensembles, we show that pathogenic single-point P301L, P301S, and P301T mutations shift the population from the turn-like conformations of the functional microtubule-bound state to the extended conformations of disease-associated tau fibrils. RHCG thus provides us with an atomically detailed view of the population equilibrium between functional and aggregation-prone states of tau K18, and demonstrates that global structural characteristics of this intrinsically disordered protein emerge from its local structure.


■ INTRODUCTION
Intrinsically disordered proteins (IDPs) are enriched in the proteomes of higher eukaryotes, where they perform essential functions. 1−3 In healthy neurons, the paradigmatic IDP tau binds and stabilizes microtubules. 1 In diseased neurons, tau loses the ability to bind to microtubules and forms the toxic aggregates associated with Alzheimer's and other neurodegenerative diseases. 2 Hyperphosphorylation of tau correlates with the progression of Alzheimer's disease. Tau has recently been shown to form biomolecular condensates. 3−6 Dysregulation of the formation of biomolecular condensates by mutations 7 and aberrant post-translational modifications such as phosphorylation 4,7 may underlie the pathogenicity of tau. Some tau mutations, e.g., P301L and P301S, show drastic effects in patients and are used in mouse models of tauopathies. 8,9 The conformational dynamics of tau around P301 may play a direct role in modulating the aggregation of tau in disease, 10−12 as studied also by molecular dynamics (MD) simulations of tau fragments. 12 Efforts to gain a clearer picture of the local conformational dynamics of tau promise a deeper understanding of its roles in health and disease.
The challenges in resolving structural ensembles of IDPs ask for an integrative approach. 13 Important progress in dealing with the high flexibility of disordered biomolecules has been made using nuclear magnetic resonance (NMR) spectroscopy, 14−17 solution X-ray scattering (SAXS), 18 and singlemolecule Forster resonance energy transfer (FRET). 19−23 To harness the full power of these experiments and interpret the data in detail, the construction of ensembles of structures 24−32 has proved to be a powerful strategy, especially for the interpretation of NMR experiments and the combination of multiple experimental methods. 31,33,34 For instance, Borgia et al. 32 combined data from single-molecule FRET, SAXS, dynamic light scattering, and fluorescence correlation spectroscopy with MD simulations to characterize the ensembles of a marginally stable spectrin domain and the IDP ACTR over a broad range of solution conditions. Gomes and co-workers 35 recently described an ensemble of the disordered N-terminal region of the Sic protein, obtained by integrating different combinations of SAXS, single-molecule FRET and NMR experiments using the ENSEMBLE approach. 36 Structural ensembles obtained from computational modeling can be combined with experimental data by using Bayesian and maximum entropy ensemble refinement methods. 29,37−45 The Bayesian formulation accounts naturally for uncertainties in the measurements, the model used to generate the ensemble, and the calculation of observables from the ensemble members. 39 Input ensembles 46 are obtained, e.g., from MD simulations 44,47−49 or chain growth, 26,28,50−53 and are then minimally modified to account for the experimental observations. However, for long protein or nucleic acid chains, it is difficult to create initial ensembles that have sufficient overlap with the final ensemble for reliable ensemble refinement. For experimental data that report on the local structure along the chain of a disordered protein, we expect that cumulative systematic errors in the MD force field will cause the summed squared error χ 2 between model and experiment to grow linearly with the length of the chain. As a consequence, the overlap between input and final ensemble deteriorates exponentially as the chain grows in length. Consequently, for long IDPs, only a few chains will tend to dominate the ensemble after refinement, with the rest of the large ensemble being mostly irrelevant.
The problem of poor overlap between the initial and final ensemble can be overcome by applying a bias already in the generation of the initial ensemble, e.g., by imposing restraints directly on observables or related quantities in the initial MD simulations. The use of chemical shifts and other NMR data in the structural modeling of flexible systems has a long and productive history. Approaches based on fragment selection proved particularly powerful. 54 −56 Protocols have been developed that combine biased fragment choice with corrections to remove the biases introduced. 42 In an early combination of biased chain growth with Bayesian weighting applied to tau K18, 28 overlapping peptide fragments were stitched together. Fragment selection was biased to double the radius of gyration in an otherwise overly compact ensemble. Steric clashes were resolved by energy minimization in implicit solvent, and high-energy structures were randomly removed in a pruning step. Excellent agreement with NMR observables 27 could be achieved by adjusting the weights of the ensemble members. However, formal and practical questions are raised: how does one incorporate experimental data already during chain growth without compromising the Bayesian framework of ensemble refinement, where such information would normally be used a posteriori? And how does one ensure that the final ensemble is well-defined and fully reproducible?
We will show here that in a Bayesian formulation any bias in ensemble generation can be accounted for fully and quantitatively in a final global refinement step by exploiting the direct connection of ensemble refinement to traditional free energy calculations. 39 Meaningful input ensembles can thus be generated without sacrificing the rigor and reproducibility of the ensemble refinement procedure.
We propose reweighted hierarchical chain growth (RHCG) as a general method to integrate data reporting on local structure into models of disordered and flexible polymeric molecules such as disordered proteins or nucleic acids. Protein chains are assembled from fragment structures, as obtained here from MD simulations. As in hierarchical chain growth (HCG), 52 chains with steric clashes are consistently removed in such a way that the resulting ensemble does not depend on arbitrary choices such as the direction of chain growth, N-to-C versus C-to-N. In RHCG, fragment choice is biased according to experiments reporting on the local structure. In a final reweighting step, any resulting bias is then removed. RHCG is thus a form of importance sampling.
Using RHCG, we arrive at an integrative model of tau K18 with atomic detail. Tau K18 contains the four pseudorepeat domains R1-R4 involved both in functional binding to microtubules 57 and in forming amyloid fibrils. 10,12 NMR chemical shift data that report on local structure are incorporated already during chain growth. Electrostatic 58 and other interactions between regions distant in sequence can impact the global structure of IDPs. Deviations from random coil behavior can emerge also from local residual structure. 29 For tau K18, it is not clear a priori how its local and global structure are shaped. We show that the RHCG ensembles also capture the global structure of tau K18, as probed by NMR, RDC, single-molecule FRET, and SAXS measurement. The global structure of tau K18 is thus determined to a significant degree by its local structure.
By comparing wild-type (WT) and mutant sequences, we provide a molecular view of possible differences between tau in a healthy cell and tau with pathogenic mutations. Our modeling of tau K18 reveals turns as in microtubule-bound states and extended structures as in tau fibrils. We found that pathogenic single-point P301 mutations shift the equilibrium from the former to the latter, emphasizing the close connection between functional forms of tau in solution and the fibrillar structures in tau-associated pathologies.

■ THEORY Bayesian Ensemble Refinement of Polymeric Molecules
We combine molecular simulations with ensemble refinement to create ensembles of proteins or nucleic acids that faithfully reflect the distribution of conformations in experiment. To create an initial ensemble, we adapt the hierarchical chain growth (HCG) method introduced recently, 52 as described in detail below. We then use Bayesian Inference of Ensembles (BioEn) 39 to adjust the weights of the individual ensemble members according to the experimental data, e.g., NMR chemical shifts.
BioEn ensemble refinement minimally adjusts the vector w = (w 1 , ..., w C ) of normalized weights of individual chains c = 1, ..., C in the ensemble to match the experimental data. We define a posterior P(w|data, I) as a function of the weights w, with P 0 (w|I) being the prior and P (data|w, I) being the likelihood. Here, I denotes background information, e.g., that we model polymeric molecules with internal structure. The BioEn maximum-entropy prior 38 is given by which reports how close the normalized refined weights w c are to the normalized reference weights w c 0 . Assuming Gaussian uncorrelated errors, the likelihood is The first sum is over the different experimental observations i = 1, ..., M data with measured values Y i , and the second sum is over the ensemble members c = 1, ..., C. For each chain c and observable i, we use a forward model to compute individual observations y i c . The error σ i 2 is the sum of the squared standard errors of the measurements Y i and the forward calculations y i c . In applications of BioEn to long biopolymers, small but systematic weight corrections at the monomer level can add up to large corrections overall. For NMR chemical shifts, for instance, the sum over i in eq 4 corresponds to a sum over residues. As a result, the χ 2 statistic is extensive; i.e., it tends to grow linearly with the length of the chain. Reweighting of assembled chains thus becomes progressively more challenging as the length of the chain grows (i.e., for chains with more fragments). The reason is that it becomes progressively unlikely that all fragments in an assembled chain occupy the relevant subspace with proper weight. As a result, chains will contribute with very uneven weights after BioEn reweighting. In other words, a few chains will dominate, and the rest of the large ensemble will be more or less irrelevant.

Reweighted Hierarchical Chain Growth
We address the problem of poor overlap between initial and final ensemble by using importance sampling. In MD simulations of complete biopolymer chains, bias potentials could be introduced, acting for instance on the torsion angles to better match NMR chemical shifts or J-couplings. Here, we focus instead on fragment-based chain growth. The key idea is to grow chains by using fragment libraries that have already been biased to enrich the ensemble with members of high weight, and then to correct for this biased choice of fragments in a final reweighting step. If the bias weights were chosen perfectly, the final step would give each chain equal weight.
In RHCG, we adapt HCG 52 to assemble polymer chains from fragments. At each of the N positions, fragments are picked at random from a fragment library and then combined by superimposition of residues at their termini with the equivalent residues in the adjacent fragments. Any models with steric clashes are discarded. In HCG, all fragments have equal weight; in RHCG, the fragments in the library {i n f } n=1,...,N f=1,...,F (with F being the number of fragments created at position n) are picked according to a weight w n f normalized to Σ f=1 F w n f = 1 for all n. These weights have to be chosen appropriately, as described below, and constitute our initial guess as to how likely a particular fragment is in the final reweighted ensemble of chains. The probability p[f c ] for a particular chain c to be created in this way is given by the product of weights for each of its fragments,  (5) where f n c ∈ {1,...,F} is the index of fragment n in chain c. Here, we construct the fragment libraries from MD simulations of short overlapping blocked peptides. Alternatively, fragment libraries can be constructed from MD simulations of full-length chains that are then broken up into overlapping segments and reassembled by chain growth. A similar approach has recently been used to explore the flexibility of the SARS-CoV-2 spike stalk. 59 Fragment libraries can also be built from experimentally resolved structures with appropriately defined weights.
We used NMR chemical shifts to bias the fragment choice. The weights of the fragments w n f were determined with BioEn applied to the fragment library at position n with a confidence parameter θ f . This confidence parameter was chosen to produce nearly uniform weights w c of the assembled chains after a global BioEn reweighting ( Figure S1C). Importantly, there is no issue of circularity because the bias applied during chain growth is fully accounted for, as described in the following section.

BioEn Reweighting of Assembled Chains
After the biased assembly of an ensemble of chains, we use BioEn 39,40 to correct for the bias in chain growth and to reweight the entire ensemble globally. To correct for the bias in chain assembly, chain c enters the global BioEn refinement with a relative weight proportional to the reciprocal of the bias probability, , with which its fragments were selected. Normalization of these relative weights gives us  ) or, expressed more compactly in terms of reciprocal weight factors, where the sum extends over the C chains of the ensemble. To the ensemble with these initial weights, we then apply BioEn reweighting, using as a reference experimental data reporting on local or global structural properties.

Chain Growth with Nonbonded Interactions beyond Steric Repulsion
Fragment assembly can, in principle, be extended to account for nonbonded interactions beyond steric repulsion to account, e.g., for electrostatic interactions between fragments. 60 This can be accomplished by using a free energy function G(f 1 c ,..,f N c ) that describes the interfragment interactions in chain c and can be calculated from an implicit solvent model or, by free energy calculations, from explicit solvent models. Chains c assembled from fragments f 1 c ,...,f N c are then weighted by an additional factor exp[−βG(f 1 c ,..,f N c )] with 1/β = k B T and k B being the Boltzmann constant and T being the absolute temperature. In the Bayesian formulation, the normalized reference weight of chain c in an ensemble of C chains then becomes

JACS Au
pubs.acs.org/jacsau Article To sample efficiently from this distribution, one can again use importance sampling by performing hierarchical assembly 52 with biased fragment selection. If, as above, w n f is the bias weight factor to choose fragment f at position n, then eq 7 becomes Here, we use only excluded volume interactions, which amounts to exp(−βG) = 1 for chains without interfragment steric clashes and exp(−βG) = 0 with clashes.

Assessment of Importance Sampling
In ideal importance sampling, we would grow chains of equal relative importance. Global BioEn reweighting would then give each member of the resulting ensemble equal weight, w c = 1/C. We use the KL divergence of the BioEn-optimized weights w c from ideal importance sampling to assess the effectiveness of our bias in chain growth: If S KL bias ≲ 1, the overlap between the ensembles produced by biased chain growth and after BioEn refinement is large; conversely, if S KL bias ≫ 1, the chain growth protocol should be optimized. We use S KL bias also to choose the confidence parameter θ f quantifying the strength of the bias in fragment choice during RHCG. As illustrated in Figure S1C, S KL bias as a measure of weight uniformity is minimal for a range of θ f values given a confidence parameter θ in the global BioEn ensemble reweighting.

Hierarchical and Reweighted Hierarchical Chain Growth
We generated structural ensembles of tau K18 (residues 244−372) using HCG 52 and RHCG. (RHCG software can be downloaded free of charge at https://github.com/bio-phys/hierarchical-chain-growth.) Tau structures were assembled from 43 pentamer fragments with two residues overlap between subsequent fragments. All fragments had their N and C termini capped by acetyl and N-methyl groups, respectively. The first (N-terminal) fragment started from the last residue outside tau K18, which was then removed in chain assembly. Fragment structures were sampled in all-atom replica exchange molecular dynamics (REMD) with explicit solvent. For each fragment, we used 24 replicas spanning a temperature range of 278−420 K. Each pentamer fragment was simulated for 100 ns as in our previous study. 52 We used structures from the T = 278 K ensemble to assemble tau K18 chains, which corresponds to the temperature of the NMR experiments. 27 To investigate the effect of point mutations at the P301 position, we also sampled fragments with P301 and mutations P301L, P301S and P301T. We repeated fragment simulations for WT P301, P301L, P301S, and P301T fragments with residue 301 at the central position of their respective fragments instead of the second position of its respective pentamer. Since we lack detailed chemical shift information, the P301X mutant chains were assembled with HCG, not RHCG. We note that in all fragment REMD simulations P301 was sampled exclusively as trans isomer.
We biased the fragment selection in RHCG according to C α chemical shifts measured by NMR. At each fragment position n, we performed independent BioEn reweighting 39,40 using the chemical shift data reported for the nonterminal residues in this fragment (Supporting Information (SI) text). A large confidence parameter of θ f = 10 ensured improved consistency of the chemical shifts (with the average χ 2 across fragments dropping from 0.856 to 0.688) with minimal weight changes (S KL BioEn = 0.004 on average). These local BioEn calculations gave us fragment weight factors w n f . In numerical tests on comparably small ensembles of 10 4 chains and with θ = 5 fixed for the global BioEn ensemble reweighting, we found that S KL BioEn was minimal for θ f = 5 to 10 ( Figure S1C).
We then used RHCG to build ensembles of between 2000 and 10 6 WT tau K18 models from the reweighted fragment libraries. For reference, we also constructed unbiased ensembles of WT tau using HCG 52 with unweighted fragment libraries. HCG was also used to construct tau K18 ensembles of P301 mutants. If not specified otherwise, the results shown are for ensembles of C = 50 000 chains. Following the procedure described in ref 52, we assembled 10000 representatives at each hierarchy level below the final assembly level to sample a high diversity of possible local conformations. At the final level, full-length models were assembled from this pool. The assembly process was trivially parallelized by using different random number seeds. In a final step, the RHCG ensembles were reweighted using BioEn to correct for the biased fragment choice while retaining consistency with the NMR chemical shift data. In this global BioEn reweighting step, the confidence parameter was set to θ = 5 according to an L-curve analysis (SI text and Figure S1A). The resulting ensembles were structurally diverse and, among 50 000 HCG and RHCG structures, did not contain any knots (SI text).

Calculation of Experimental Observables
NMR Secondary Chemical Shifts and J Couplings. For comparison with NMR experiments, we calculated chemical shifts from fragments and full-length structures using SPARTA+. 61 We subtracted random-coil shifts calculated using POTENCI 62 to compare to secondary chemical shifts ΔC. We computed 3 J HNHα couplings with the Karplus parameters by Vogeli et al. 63 with the mdtraj Python library. 64 NMR Residual Dipolar Couplings. RDCs were calculated from the ensembles of full-length structures with PALES 65,66 in the steric alignment mode. Even for random flight polymers, the presence of an ordering medium modeled as a hard surface induces nonzero RDCs. 67 The value D HN (r) for a particular residue r was calculated by computing the alignment of each chain c in the ensemble with PALES and then taking the average over all structures where D HN max = 21.7 kHz for an idealized amide bond length of 1.04 Å, 68 ϑ c (r) is the angle between the amide bond vector of residue r in chain c, the external magnetic field, P 2 (x) = (3x 2 − 1)/2 is the secondorder Legendre polynomial, and ⟨...⟩ denotes an average over the orientations of the chain biased by the alignment.
Small-Angle X-ray Scattering. We used FoXS 69 to calculate SAXS intensity profiles for the tau K18 structures in an ensemble and then calculated the weighted average over the ensemble. In the FoXS calculations, we took the solvation shell into account by setting c 2 = 3. The excluded-volume parameter was set to the default value of c 1 = 1. Geometric R G values were computed using the MDAnalysis library. 70,71 To compare measured scattering intensities to those predicted for the weighted ensemble, I sim (q), we first estimated an intensity scale factor a and a constant for background correction b by performing least-squares fitting of to the SAXS intensities with q being the scattering vector. For a regime unaffected by aggregation, q > 0.012 Å −1 , the best fit to experiment was achieved with the coefficients a = 1.1 × 10 −11 and b = 3.8 × 10 −5 . For q < 0.012 Å −1 , we took possible mild aggregation into account by approximating the scattering intensity including possible aggregates as By least-squares fitting with fixed a and b, we find an aggregate intensity of c = 0.001 56 and an aggregate size of R a = 234 Å. The fit to the combined model is shown in Figure S2. An earlier set of scattering data 18 is restricted to q > 0.03 Å −1 .
Comparison to Single-Molecule FRET Experiments. We compared C α −C α distances extracted from FRET experiments using the SAW-ν polymer model 72 to RHCG models. To quantify the effect of the fluorescent dyes on the distance distribution, we performed additional calculations in which we adapted the RHCG method to add dyes 73 during chain growth (SI text and Figure S3).

Comparison to NMR Paramagnetic Relaxation Enhancement Measurements
NMR paramagnetic relaxation enhancement (PRE) measurements on tau K18 have been previously reported. 74 We computed PREs for the tau K18 ensembles using the PREdict 75 Python library (https:// github.com/KULL-Centre/DEERpredict). PREdict adds explicit spin labels to the chains modeled with a rotamer library. The PRE is calculated in the fast-exchange limit with respect to both spin-label and chain dynamics. Details of the PRE calculation are given in the SI text.

Experiments
Single-Molecule FRET Experiments. For the single-molecule FRET experiments, tau K18 was labeled with Alexa Fluor 488 and CF660R at its naturally occurring cysteine residues, C291 and C322 (SI text). The labeled tau K18 was diluted to a concentration of 100 pM in 50 mM sodium phosphate buffer, pH 6. The SAW-ν model was used to analyze the single-molecule FRET data to extract distances and the polymer properties of tau K18 72 (SI text).
Small-Angle X-ray Scattering Experiments. SAXS data were collected at 298 K from monodisperse samples of K18 ranging from  Table S1 for the amino acid sequence of tau K18).

RHCG Produces a Diverse Ensemble of Tau K18 Chains
During chain assembly, we applied a gentle bias on the fragment choice by using fragment weights from BioEn reweighting against C α chemical shifts. To correct for the bias, the assembled chains were then reweighted with BioEn, again using the chemical shift data as experimental reference. In this global BioEn reweighting step, the chains were given near-uniform weights w c with S KL bias ≪ 1 ( Figure S1B). By comparison, the BioEn weights of the HCG ensemble created without bias are less uniform. The resulting ensemble of tau K18 is comprised of highly diverse structures with atomic detail ( Figure 1C). The typical C α root-mean-square distance (RMSD) between two chains is about 26 Å ( Figure S4 and SI text), and backbone dihedral angles are broadly sampled ( Figure S5).

RHCG Models of Tau K18 Capture the Average Local Structure of Tau as Reported by NMR
Chemical shifts are accurate reporters of local structure and secondary structure. 16,17,27,29,61,77 Overall, we found that the C α chemical shifts calculated for the RHCG ensemble of tau K18 are close to random coil values, with secondary chemical shifts ΔC mostly close to zero. Despite the residual amplitude typically being smaller than the error of ≈1 ppm 61 in the forward chemical shift calculation, the models capture important features of the variation of experimental secondary chemical shifts along the tau K18 amino acid sequence, such as a drop in secondary chemical shift going from L285 to V300. HCG without reweighting of the fragment library underestimates the populations of extended and β-strand like structures and overestimates the helical-like conformations. Going from HCG to RHCG, the average residual drops from 0.35 to 0.27 ppm and Pearson's r for the secondary chemical shifts ΔC of the C α atoms increases from 0.28 to 0.41. RHCG lowers in particular positive ΔC values, e.g., at the S420 position ( Figure S6A,B). In light of the considerable uncertainties in the forward calculation (≈1 ppm) and the small ΔC amplitudes, a lower θ value resulting in an even tighter fit was not justified ( Figure S1A).
We also calculated NMR 3 J HNHα couplings, which report primarily on the ϕ-dihedral angles of the protein backbone. The couplings calculated for our models agree well with the NMR experimental data 27 ( Figure S7). Also in terms of 3 J HNHα couplings, which were not used in the RHCG procedure, RHCG somewhat improves the representation of the local structures over HCG, as reflected by the increase of Pearson's r from 0.59 to 0.62. The root-mean-squared error dropped from 0.47 Hz (HCG) to 0.41 Hz (RHCG). For reference, the uncertainty of the calculated 3 J HNHα couplings has been estimated at ∼0.9 Hz. 78 We do not expect a more significant improvement because the 3 J HNHα coupling is sensitive primarily to the ϕ backbone torsion, whereas the C α chemical shift used in RHCG is particularly sensitive to the ψ backbone torsion. Indeed, even for a simple Ala pentapeptide we found small but systematic differences between a state-of-the-art force field and 3 J HNHα couplings. 40 Overall we conclude that reweighting in fragment assembly alleviates the small but systematic deviations caused by small imbalances in state-of-the-art force fields used to generate fragment libraries. As a result, the local structure of the tau K18 chains produced by RHCG is more consistent with NMR chemical shift and J-coupling experiments.

The RHCG Ensemble of Tau K18 Reproduces the Experimental NMR Residual Dipolar Couplings
We calculated the RDCs for the assembled tau K18 chain using the steric alignment mode of PALES, 66 and then averaged the RDC values over the ensemble with the respective weight of the chain. The measured 27 and calculated RDCs agree remarkably well and capture both the signature as a function of position along the chain ( Figure 1A) and the magnitude at individual residue positions ( Figure 1B). Without further fitting, we obtained Pearson r correlation coefficients of 0.73 for RHCG and 0.70 for HCG for tau K18 ensembles of 50 000 models. This consistency not only validates the ensemble but also gives direct insights into the interpretation of the RDCs measured for IDPs. RDCs inform on how restricted a chain is locally, with larger absolute RDCs expected for more restricted segments than for fully flexible segments. 15 The RDC D HN ∝⟨P 2 (cos(θ))⟩ reports on the relative orientation of an amide bond vector with respect to the magnetic field. Changes in the sign of the measured RDCs have been interpreted as changes in the direction of the protein backbone. 27 Our conformational ensemble reproduces the four changes in the sign of D HN found in experiments. 27 Importantly, as highlighted for the region centered on L284-S385 in Figure 1C, our structures on average trace a turn in the region where the sign changes, as indicated by a shortened distance across the four-residue segments ( Figure S8). By contrast, in regions such as Q276-I277, where the sign of D HN does not change, our structures do not show a preference in the chain direction and scatter around an average straight chain ( Figure 1C). We note that simple polymeric models that ignore amino acid chemistry and the correlations between subsequent residues tend not to capture the trends in the experimental RDCs, as previously noted. 15,27,79

Residual Dipolar Coupling Calculations Require Large Ensemble Sizes
The need for large ensembles has been highlighted before. 26 Building large ensembles relies on the possibility to quickly generate statistically independent atomically detailed models of IDPs. The RDC values predicted for particular residues in our models are widely and asymmetrically distributed with a range of about ±25 Hz (Figure 2A). By contrast, the experimental average is roughly in the range of −5 to 10 Hz ( Figure 1A). As a result, RDCs calculated from small ensembles are biased ( Figure 2B). We found that relatively large ensembles of ≥10 000 tau K18 chains are needed to get converged RDC values ( Figure 2B). We found in particular that Pearson's r correlation coefficient improved with increasing ensemble size. The ensemble-size dependence is similar for RHCG and HCG, even if the RHCG ensemble consistently performs somewhat better than the HCG ensemble ( Figures 1D, 2B,C, and S9).

RDCs from Short Chain Segments
In the modeling of RDCs of IDRs, it is frequently assumed that ensembles of short peptide segments of about 15 amino acids contain sufficient structural information to calculate RDCs. 80,81 We tested this assumption by cutting overlapping 15-mer segments out of the BioEn ensemble of full-length tau K18 and then calculating the average RDCs for their central 9 amino acids using a steric alignment. 66 We found that the RDCs calculated for the full ensemble and for the 15-mer segments are highly correlated (r = 0.91; Figure S10). Compared to the NMR RDCs, the correlation coefficient for segments (r = 0.61) is nearly as good as for full-length chains (r = 0.73). In line with earlier findings, 81 we conclude that comparably short peptide segments can indeed be used to model the RDCs of long IDRs such as tau. This finding also makes it possible to use RDC data during chain growth in RHCG. RDCs can be precalculated either directly for fragments of sufficient length or for a library of segments that have been assembled by chain growth. With the precalculated RDCs, subsequent chain growth can be biased to improve the overlap between the initial and BioEn-optimized ensembles of chains. Here, for tau K18, including RDCs in chain growth proved unnecessary because they were predicted accurately without any bias.

The RHCG Ensemble Captures the Extension of Tau K18 in Solution
The RHCG ensemble also captures the size and shape of tau K18 in solution as probed by SAXS measurements ( Figure  1D). The mean scattering profiles calculated from our tau K18 models agree well with the experimental scattering profiles ( Figure 1D), taking possible unspecific aggregation in the low q regime into account. The computed root-mean-square radius of gyration of approximately 39 Å coincides with the experimentally determined R G of 38 ± 3 Å. 18 The RHCG ensemble (⟨R h ⟩ = 34 Å) is also consistent with the hydrodynamic radius R h 34 ± 6 Å, as reported by dynamic light scattering (DLS). 74 R h was computed from the RHCG ensemble using an empirical approach. 82,83 Our RHCG ensemble agrees quite well with previously reported NMR paramagnetic relaxation enhancement (PRE) measurements 74 (Figure S11), which were not used in the generation of our ensembles. Spin-label dynamics were modeled with a rotamer-library approach. 75 The overall shapes of the experimental profiles measured for four different spinlabels 74 were captured without any refinement. 46 However, a fully quantitative comparison is challenging because of the sensitive dependence of the PRE on infrequent close contacts between proton and spin-label in the fast-exchange regime. As a result, the calculated PRE profiles are noisy and, without weight adjustments, tend to underestimate the actual PRE for residues and labels close in sequence. The good agreement with SAXS, dynamic light scattering, and NMR measurements suggests that the RHCG ensemble captures the global conformational properties of tau K18 in solution quite well without further refinement. However, BioEn reweighting of the spin-label rotamers 46 used to calculated the PRE and possibly also the chains should address some of the challenges in calculating PREs of disordered proteins.

Structure of tau K18 as Assessed by Single-Molecule FRET
Comparison to single-molecule FRET experiments suggests that our RHCG models are somewhat too extended ( Figure  1E), with longer C α −C α distances in the RHCG ensemble than those extracted from the FRET experiments. 45 This initial analysis of the FRET data with a commonly used polymer model 72 provides a valuable check on the validity of more involved comparisons with explicit representations of dyes. 45,73,84 In a BioEn calculation, we found that already a small adjustment of the RHCG chain weights suffices to match the mean distance deduced from FRET perfectly (RHCG* in Figure S3D and Table S2). The resulting RHCG* ensemble agrees as well with experiment as the RHCG ensemble in terms of the SAXS measurements, and slightly worse in terms of NMR RDC and PRE measurements ( Figure S12 and Table  S2). The Kullback−Leibler divergence of S KL ≈ 0.2 corresponds to a change of the underlying MD simulation potential energy function of S KL k B T = ∫ dx p (opt) (x)[U (opt) (x) − U(x)] ≈ 0.5 kJ/mol on average. 39 Conversely, this sensitivity also highlights the intricacies of the free energy landscape of disordered proteins, where subtle shifts in the energetics result in appreciable changes in conformation. 85 We explored possible effects of the fluorescent dyes by generating RHCG models with dyes attached. For these models, we calculated the mean FRET efficiency and compared it directly to the experimental measurement ( Figure  S3C). We found that an even smaller force field correction of 0.35 kJ/mol on average 39 would be sufficient to achieve full consistency of the ensemble means ( Figure S3D). Reweighting according to the FRET data changes the R G from 39.4 Å (RHCG) to 37.4 Å (RHCG*), and with explicit dye models from 40.1 Å (RHCG+dyes) to 39.1 Å (RHCG+dyes*), respectively.
The scaling exponent of 0.56 inferred from the SAW-ν model 72 is close to the value of an excluded-volume chain. The tau K18 segment is thus more extended than most moderately charged disordered IDPs. 21 Interestingly, the transfer efficiency and average distance between the Cys residues of tau K18 from single-molecule FRET are virtually independent of salt concentration ( Figure S3C), indicating that the rather pronounced expansion of this segment is not caused by charge repulsion. The FRET experiments are thus in line with our  modeling, which highlights that local structural preferences along the chain rather than long-range charge−charge interactions primarily shape the ensemble of tau K18.

Aggregation-Prone Extended Structures Feature Prominently in the Solution Ensemble of Tau K18
Interestingly, a small but significant fraction of our atomically detailed models feature conformations of the two aggregationprone hexapeptide motifs 10 as seen in the high-resolution structures of tau fibrils. 86,87 Chain growth thus captures biologically important structural features. For the first hexapeptide motif 275 VQIINK 280 , we found that about 9% of the models are within 1 Å C α RMSD of a tau fragment fibril structure (PDB: 5V5B 87 ) ( Figure 3A,C). A similar fraction of the tau K18 population has local structures matching that of a fibril from a corticobasal degeneration (CBD) patient sample 88 (PDB: 6TJO). The fraction of our ensemble that closely matches the experimental structures ( Figures 3B and S13) is clearly larger than what would be expected for a random six amino acid segment. For the second hexapeptide motif 306 VQIVYK 311 , we also found that about 8% of the models are within 1.0 Å C α RMSD of the X-ray structure (PDB: 2ON9 86 ) ( Figure 3B,D), about 2.5 times more than what would be expected for random hexapeptide segments. We found similar consistency for the second hexapeptide motif with the structures of tau fibrils ( Figure S13), as formed in Alzheimer's disease (PDB: 5O3O, 89  We found that a considerable fraction of WT tau K18 adopts locally compact turn-like structures ( Figure 4A−C). Similar turn-like structures have been resolved by NMR transfer NOESY experiments probing the conformations of microtubule-bound tau, 57 with an O(300)−N(303) distance below 4 Å in 18 out of the 20 structures in the NMR ensemble (PDB: 2MZ7; see Figure 4B). In the WT RHCG ensemble, 15% of structures of the 300 VPGGG 304 segment are within 1 Å C α RMSD of the closest representative of the NMR ensemble ( Figure 4A). This indicates that tau samples the turn-like structures of the microtubule-bound form also free in solution.

Chain Growth Captures the Effect of Mutations Toward Aggregation-Prone Structures
The PGG motifs at the end of each repeat favor turn-like structures. 94 We expect that mutations of the prolines shift the local structure away from turns. To test the effect of mutations at the 301 position, we considered the frontotemporal dementia with Parkinsonism-linked to chromosome 17 (FTDP-17) mutations P301L, P301S and P301T. Mutations of P301 have been shown to strongly promote tau aggregation 9,10 and are used in mouse models of tauopathies. 8,9 In our hierarchical modeling, the P301L, P301S, and P301T variants consistently form more extended structures than WT ( Figure 4C,D), both in ensembles of full-length tau K18 ( Figure 4C,E,F,G) and in fragment MD simulations ( Figure  S14). This loss of turn-like structures is indicated by a more than 2-fold reduction in the fraction of O−N distances < 4 Å between V300 and G303. The P301L mutation has been studied in detail by NMR and biophysical experiments. 11 The shift from turns to extended structures in our P301L ensemble is in line with smaller 15 N chemical shift values for K298, H299, and V300 in P301L tau K18. 11 The shift from turns to extended structures rationalizes the enhanced aggregation propensity of tau P301L in vitro 10,12 because extended structures predominate in fibrils. Locally more extended structures in the mutant proteins facilitate intermolecular contacts between tau chains and subsequent assembly and aggregation via intermolecular β-sheets. The shift to extended structures seen here also explains why P301L tau binds less strongly to microtubules. 11,95 In a population-shift mechanism, P301L, P301S, and P301T mutations thus appear to decrease the fraction of tau with locally compact turn structures, which are competent to bind to microtubules and to increase the fraction of aggregation-prone extended structures (Figures 5 and S11). The combination of these two effects may render P301 mutations deleterious both with respect to a loss in function and an increased tendency to form diseaseassociated fibrils.
According to chemical shift mapping, the P301L/P301S/ P301T mutations do not significantly alter the overall structure of tau. 11 Whereas the tendency to form aggregation-prone extended structures at position 301 more than doubles (see Figure 4), the absolute increase in the extended population is small (<15%) and confined locally to the turn region. The change in the calculated radius of gyration compared to WT is small, ∼0.2 Å, and thus within the uncertainty of both calculations and measurements. The same limitation applies to the mean C α −C α distances of the fluorophore labeled residues, which change by only ∼0.1−0.3 Å.

■ CONCLUSIONS
We showed that reweighted hierarchical chain growth captures both the local and the global structures of tau K18. Locally, NMR C α chemical shifts were reproduced within the expected uncertainties without any fitting. The agreement was improved further with only a gentle Bayesian ensemble refinement against NMR chemical shift data. Globally, the tau K18 chains assembled in this way reproduced SAXS, FRET, and NMR RDC measurements and thus captured the overall shape, dimension, and changes in orientation. In addition, the FRET experiments showed that the extension of tau K18 is insensitive to varying salt concentration unlike other disordered proteins. 58 The global structure of tau K18 thus emerged from its local structure in the sense that the ensembles of global chain structures built by combining short peptide fragments capture the measured global structural properties with good accuracy.
Fragment assembly and coil models have proved highly successful in the modeling of disordered proteins. [24][25][26]28,36,52,60,79 The quality of the ensemble models can be improved even further by integrating experimental data. 26,35 In BioEn, 39,40 the data enter through a χ 2 term. The summed squared error χ 2 of the models often grows roughly linearly with chain length, e.g., because of systematic errors in the force field used to generate the fragment models. As a result, the relative weights of the assembled chains in a refined ensemble will vary widely. The overlap between the ensemble of assembled chains and the final ensemble, as measured by exp(−S KL ), then decreases exponentially with increasing chain length, and ensemble refinement becomes increasingly inefficient.
Reweighted hierarchical chain growth is an importance sampling procedure designed to address this problem by producing evenly weighted ensembles. By applying a bias already during chain assembly, we ensure that the assembled chains have near-uniform weights in the final ensemble. A poorly designed importance sampling scheme would produce ensembles with an uneven weight distribution, as indicated by a high value of S KL bias in eq 10. By using hierarchical chain growth 52 and correcting for any bias in the assembly process in a formally rigorous manner using a form of Bayesian ensemble refinement, BioEn, 39 we ensure further that the final ensemble is well-defined and independent of arbitrary choices in the assembly process, such as the strength of the bias in fragment selection or the direction of chain growth.
In practice, RHCG may only be a starting point for further investigations and improvements. For instance, representative structures can be used as seeds for MD simulations of the fulllength protein. 52 By drawing conformations according to the BioEn weights, one can systematically select subensembles that are consistent with the available experimental data. If BioEn 39,40 indicates that entire regions of configuration space require large changes in weights, up or down, one may need to bias chain growth accordingly or may have to use different or improved simulation force fields. 96 The tau K18 ensembles obtained by reweighted hierarchical chain growth revealed how patient-associated mutations shift the balance from protein function to disease. In modeling the effect of mutations, we took advantage of a chemically informed description 79,97−104 of the disordered tau protein.
We found that, already free in solution, the microtubuleinteracting regions of tau K18 populate local structures as observed in the microtubule-bound state by NMR. Also consistent with conformational selection, we found that a comparable fraction of free tau K18 chains exhibits local structures as observed in pathogenic tau fibrils. We could further show that the disease-associated mutations P301L, P301S, and P301T shift the balance away from the microtubule-bound local turn structures toward the fibrilassociated extended structures ( Figure 5). Such shifts can have dramatic effects on the kinetics of aggregation 105 by lowering the barrier to nucleation. Indeed, a shift to extended structures was recently reported to be associated with fibril formation in tau condensates. 106 The emergence of global structure from local structure thus extends beyond chain shape, dimension, and orientation to the competition between tau's role as microtubule-bound regulator of cellular transport and as fibrilforming driver of neuropathologies.
Structural and topological analysis of tau K18 ensembles, comparison to C α chemical shifts, construction of fragment pool, modeling of P301 mutations, global BioEn reweighting, analysis of turn structures detected by RDCs, comparison to PRE experiments, comparison to NMR structure of microtubule bound 301 PGGG 304 , details on the sample preparation and setup of singlemolecule FRET experiments and further comparison of single-molecule FRET experiments and ensembles from RHCG, analysis of the structural consequences of P301 mutations, sequence of tau K18 and positions of spin labels and FRET dyes (PDF) Scattering intensity data (XLSX)