Hydrodynamic Radii of Intrinsically Disordered Proteins: Fast Prediction by Minimum Dissipation Approximation and Experimental Validation

The diffusion coefficients of globular and fully unfolded proteins can be predicted with high accuracy solely from their mass or chain length. However, this approach fails for intrinsically disordered proteins (IDPs) containing structural domains. We propose a rapid predictive methodology for estimating the diffusion coefficients of IDPs. The methodology uses accelerated conformational sampling based on self-avoiding random walks and includes hydrodynamic interactions between coarse-grained protein subunits, modeled using the generalized Rotne−Prager−Yamakawa approximation. To estimate the hydrodynamic radius, we rely on the minimum dissipation approximation recently introduced by Cichocki et al. Using a large set of experimentally measured hydrodynamic radii of IDPs over a wide range of chain lengths and domain contributions, we demonstrate that our predictions are more accurate than the Kirkwood approximation and phenomenological approaches. Our technique may prove to be valuable in predicting the hydrodynamic properties of both fully unstructured and multidomain disordered proteins.


Recursive algorithm for generating Self-Avoiding Random Walks of Spheres (SARWS)
To efficiently generate GLM protein conformations, we use a recursive approach.The recursive implementation relies on the observation that for the whole chain to be free of self intersection each sub-chain within it has to be free of self intersections as well.Based on that we can generate conformations of a given length  recursively by randomizing two chains of length /2 separately and then gluing them together.
For each half-chain, we add spheres subsequently, starting from one end of the protein in such a way that each added sphere has one point of contact with the previous one.The position of the point of contact is selected randomly from a uniform probability distribution on the surface of the previous sphere.Then, after the whole chain is assembled, the final construct is checked for intersections between different spheres, and self-intersecting chains are discarded.
We note that an alternative approach of simply re-randomizing the location of the last attached sphere if an intersection is detected leads to biased distributions and therefore cannot be used to generate conformations.This recursive strategy is captured by the pseudocode below: We implemented this algorithm as part of the SARWS package on which the GLM-MDA method is based.
This strategy leads to a significant performance benefit.Consider a situation where we try to generate a chain of length 4 and in our first round of randomization only beads 3 and 4 intersect.This would be detected when recursion depth is equal to 3 (combining two chains of length 1) and only two beads would have to be re-randomized rather than four in the iterative approach.Further performance gains can be achieved by implementing the algorithm above with no memory allocations as it requires only N memory cells for locations of bead centers at any moment (in our case we chose std::span to pass locations and radii in an elegant way without performance drawbacks).
The recursive approach involves a time complexity of O(N 1+γ ), and provides a satisfactory and unbiased ensemble for the largest of the proteins considered here in under a minute using only a personal computer (a single thread at 1.8 GHz).The speed of the recursive approach should be contrasted with an iterated one where steps are simply added one by one, and intersecting chains are discarded.This easier-to-implement method is characterized by a time complexity of ( ) which becomes prohibitively slow for chains with  > 20.

Fast convergence of the MDA-GLM algorith for computation of R h values.
A B

Figure S1.
Computed R h value (blue) and computational time (orange) as a function of ensemble size for two cases, A) a small SAP 1A protein (n = 149, id = 13, Table S1) and B) a large H 6 -SUMO-GW182 SD-mCherry protein (n = 809, id = 42), presented with 2 standard deviations error bars estimated using 10 rounds of bootstrap, included in the computation time.Even for moderate ensemble sizes (N=20), Monte Carlo errors are smaller than hydrodynamic approximation errors.

Chemicals
The chemicals for protein expression and purification were purchased from Merck (Sigma-Aldrich) and were analytically pure, grade A, or specified for molecular biology.The AF488 NHS ester was purchased from Lumiprobe GmbH.Alexa Fluor 546 NHS ester was purchased from Invitrogen.
Proteins were labelled by using the AF488 NHS ester according to the manufacturer's protocol (Lumiprobe GmbH ) and purified from the excess of the unreacted dye by Zeba spin columns (Thermo Scientific), multi-step dialysis with use of Pur-A-Lyzers (Sigma-Aldrich), or by another SEC run on Superdex 200 Increase 10/300 GL(Cytiva), depending on the protein properties.The residual presence of the unreacted dye was taken into account in the FCS data analysis as a second component.

Fluorescence correlation spectroscopy measurements
The FCS experiments were performed essentially as described previously 6 , at Zeiss LSM 780 with ConfoCor 3, in 50 mM Tris/HCl buffer pH 8.0 (at 25 C), 150 mM NaCl, 0.5 mM EDTA, and 1 mM TCEP or DTT, in droplets of 25-30 µl.The buffer and the samples were filtered through the membrane of 0.22 m pore sizes immediately before the experiment.The protein concentrations were in the range of 10-20 nM after the filtration.The temperature inside the droplet, 25  0.5 C, was checked after the FCS measurements by means of a certified calibrated micro-thermocouple.A single measurement time was 3 to 6 s, repeated 10 to 100 times in a set.The set of measurements was repeated 3 times in 5 independent droplets.
The structural parameter (s) was determined every time with use of AF488 (DAF488 = 435 μm 2 s −1 ) or Alexa Fluor 546 (D = 341 μm 2 s −1 ) in pure water 7 , individually for each microscopic slide previously passivated with BSA in the working buffer.The actual solution viscosity was taken into account by comparison of the diffusion time for AF488 or Alexa Fluor 546 in pure water and in the buffer at the same equipment calibration.
The experiments for proteins labelled by AF488, SUMO-mαEGFP-H 6 , and mαEGFP-H 6 were performed at the 488 nm excitation wavelength with a relative Argon multiline laser power of 3 %, MBS 488 nm, BP 495-555 nm.For the mCherry-fused proteins and Alexa Fluor 546 calibration, the excitation wavelength was 561 nm at 2 % relative DPSS laser power, MBS 488/561 nm, LP 580 nm.A dampening factor of 10 % and a dust filter of 10 % were applied.
Photophysical processes of AF488 and fluorescent proteins, mCherry and mαEGFP, were investigated in independent sets of experiments.A relative laser power ranging from 3 to 20 % at 488 nm was used for the AF488 triplet state lifetime measurements.The average lifetime was determined to be about 4 µs.In the case of mCherry and mαEGFP, the measurements were performed in 30 % glycerol to slow down the protein diffusion and extract the blinking 8 .The fraction of mCherry population that undergoes blinking was found to be about 24-28 % both for the fluorescent protein alone and in the fusion constructs, and about 15 % for mαEGFP.

FCS data analysis
The FCS data were analysed by using the Zen2010 software (Zeiss).The raw measurements were closely inspected and refined to exclude possible oligomerization or aggregation of the protein sample in the confocal volume during the experiment.Global fitting of the autocorrelation curve was performed to data sets containing 10 to 50 single measurements.The autocorrelation function for 3D diffusion, including photophysical processes (triplet state for chemical dyes or blinking for fluorescent proteins) was fitted according to the equations 9 : (eq. 5) where: () is the fitted autocorrelation function;   (), normalized autocorrelation function for photophysical processes;   (), normalized autocorrelation function for the diffusion of n components; PT, triplet state or blinking fraction;   , lifetime of the photophysical process;  , , diffusion time for the i-th component; s, structural parameter of the confocal volume; Φ  , fraction of the i-th diffusing component.
A one-component model (n = 1) providing for the fluorescent protein blinking was fitted for the fusion proteins, and a two-component model (n = 2), taking into account the AF488 triplet state and the presence of a residual freely diffusing dye, was used for the chemically labelled proteins.The mCherry and mαEGFP blinking fraction, as well as the AF488 triplet state lifetime determined from the independent experiments were fixed during the global analysis.
The Rh values were determined from the diffusion times,   , providing for the actual buffer viscosity, as follows: where 0 is the viscosity of pure water 10 at the temperature T and  _ and   is the diffusion time of AF488 or Alexa Fluor 546 in the buffer at the same calibration.
The numerical regressions were performed by Prism 6 (GraphPad Software).
The total experimental uncertainty was determined according to the propagation rules for small errors 11 , taking into account both numerical uncertainty of the fitting, statistical dispersion of the results, and uncertainties of other experimental values used for calculation of the results.
A power function of the number of the polymer units (N) was fitted to the experimental R h values of folded proteins, determined by FCS (Table S1) according to the equation: The critical exponent value, , was calculated as 0.33 ± 0.02, in agreement with the value of 1/3 for a polymer chain packed into a spherical shape, and the R 0 was determined as 3.9 ± 0.6 Å, which corresponds to an average R h value for free amino acids, 3.2 ± 0.4 Å 12 .

Bioinformatics
Example conformations of the IDPs were generated by AlphaFold 2.0 notebook 13,14 .Protein structures were drawn by using Discovery Studio v3.5 (Accelrys Software).
Identification of the protein sequence fragments to be treated as ordered regions and mimicked by larger balls in the globule-linker model (GLM) was done by using Disopred3 15 .
The fragment was assumed to be ordered if the disorder probability P was less than 50 % for at least three subsequent amino acid residues, including loops linking such fragments not exceeding 14 residues 16 .

Selection of R h from literature data
The experimental benchmark set was complemented by the Rh values selected from literature.

FCS
This work 1) the R h value from FCS is slightly underestimated due to the residual presence of the freely diffusing dye impossible to be completely separated from the protein by SEC and the short diffusion time of lysozyme.
2) shown in Figures 2 and 3A (main text) for comparison with other proteins; not included in the analysis of the theoretical model  S2) and experimental results (Table S1) for the benchmark set.IDPs (full green squares) and folded proteins (full black circles) from this work; IDPs from literature (blank squares); two largest outliers are marked in red (fesselin, Id. 43, N = 996, SEC) and magenta (OMM-64, Id. 39, N = 608, AUC).Error bars reflect both theoretical (Table S2, column F) and experimental uncertainties (Table S1) calculated according to small errors propagation rules.S2, columns D, F) vs. experimental results (Table S1, the benchmark set excluding globular proteins); 1:1 relationship (thin black line); linear fit to the data points without free y-intercept (green broken line, except F); (A) all R h values (full green squares, this work; blank green squares, literature); (B-F) subsets of results obtained using different experimental approaches, i.e.PFG-NMR, FCS (this work), SEC, DLS, and AUC, respectively.

Figure S8. Correlation analysis of R h values predicted for IDPs by MDA+GLM (Table
The Snedecor's F-test for the linear functions with and without the y-intercept as a free parameter fitted to the data points from the IDPs benchmark set showed that the y-intercept is insignificantly different from zero, -0.26 ± 3.6.The fit (Figure S8 A) yielded the slope of 0.96 ± 0.03 (with 90% confidence interval, CI, of 0.905 to 1.006).This means that MDA+GLM provides good 1:1 correlation with the experimental results for IDPs even at the level of 90% CI.
The R 2 of the linear correlation between the predicted and experimental results for all IDPs from Table S1 is 0.7534 (Figure S8 A), which means that our model explains ~75% of the R h variability within the IDP benchmark set.The remaining part of the variability as well as the slightly underestimated slope value can have several sources.Among the main reasons for the discrepancies are the intrinsic properties of individual experimental methods, which suffer from typical errors or limitations and are usually not taken into account when reporting the final experimental results.
The root mean square of the relative uncertainty for all experimental data (Table S1), when given, is 5.8%.Even for a perfect model that accurately predicts the diffusion coefficient, assuming the measurement uncertainty is only random (not systematic), achieving R 2 = 1 is impossible due to the inherent random noise in the data.The median R 2 values under these conditions, determined theoretically, are gathered in Table S5.
Relative error % Median R 2 of a perfect model However, the value of 5.8% seems underestimated.This is because it relies on undervalued figures provided in literature, where only some parts of the uncertainty are included in the error estimates, and in some cases, no error analysis is provided.Assuming a more realistic overall measurement error of 10%, which may still be considered small for certain measurements, the best possible model should give a typical R 2 of ~0.9.
Considering that our GLM-MDA approach involves approximated hydrodynamics, the predictions result in ~5% error of the theoretical R h values.Therefore, one should expect results only up to an R 2 of 0.85, even with exceptionally precise modeling of conformers, hydration layers, and other complex factors.
Intrinsically disordered benchmark proteins gathered in Table S1.
Sequence numbering according to Table S1.

Table S1 .
Experimental values of hydrodynamic radii, R h , for the benchmark proteins.Most of them are intrinsically disordered proteins (otherwise noticed in the Remarks column).N, number of amino acid residues in the protein chain.