Quantifying Protein–Protein Interactions in Molecular Simulations

Interactions among proteins, nucleic acids, and other macromolecules are essential for their biological functions and shape the physicochemcial properties of the crowded environments inside living cells. Binding interactions are commonly quantified by dissociation constants Kd, and both binding and nonbinding interactions are quantified by second osmotic virial coefficients B2. As a measure of nonspecific binding and stickiness, B2 is receiving renewed attention in the context of so-called liquid–liquid phase separation in protein and nucleic acid solutions. We show that Kd is fully determined by B2 and the fraction of the dimer observed in molecular simulations of two proteins in a box. We derive two methods to calculate B2. From molecular dynamics or Monte Carlo simulations using implicit solvents, we can determine B2 from insertion and removal energies by applying Bennett’s acceptance ratio (BAR) method or the (binless) weighted histogram analysis method (WHAM). From simulations using implicit or explicit solvents, one can estimate B2 from the probability that the two molecules are within a volume large enough to cover their range of interactions. We validate these methods for coarse-grained Monte Carlo simulations of three weakly binding proteins. Our estimates for Kd and B2 allow us to separate out the contributions of nonbinding interactions to B2. Comparison of calculated and measured values of Kd and B2 can be used to (re-)parameterize and improve molecular force fields by calibrating specific affinities, overall stickiness, and nonbinding interactions. The accuracy and efficiency of Kd and B2 calculations make them well suited for high-throughput studies of large interactomes.


INTRODUCTION
In biological cells, most protein, DNA, and RNA molecules have to bind to specific binding partners to perform their biological functions. These specific interactions compete with nonspecific interactions, and cells have evolved various mechanisms to minimize wasteful nonspecific binding. 1,2 However, nonspecific interactions shape the physicochemical properties of the crowded environments inside cells. 3 The quantification of binding affinities and interaction strengths of biological macromolecules is thus crucial for the understanding and modeling of cellular processes. In the following, we focus on protein−protein interactions, but all of our results are generally applicable to other specific and nonspecific binding interactions.
Experimentally, protein interactions are quantified by the dissociation constants K d and the second osmotic virial coefficient B ij of protein species i and j. We follow the common convention and use B 22 for self-interactions and B 23 for cross-interactions. The dissociation constant K d quantifies the amount of bound proteins and can be measured in isothermal titration calorimetry, surface plasmon resonance, or analytical ultracentrifugation experiments, for example. 4 The interaction strength of pairs of proteins in binding and nonbinding configurations can be quantified by measuring the second osmotic virial coefficient B ij , which relates the microscopic protein interactions to the macroscopic osmotic pressure. 5−7 Moreover, the second osmotic virial coefficient is related to solubility and used as a predictor for protein crystallization conditions. 8,9 In experiments, B ij is measured by sedimentation 10−12 and size-exclusion chromatography. 8 Scattering experiments, such as static light scattering (SLS) and small-angle X-ray scattering (SAXS) experiments, can provide approximate estimates for B ij . 13,14 K d and B ij are crucial quantities to relate molecular simulations of interacting proteins to the experiment. Such comparisons become increasingly important as molecular simulations of crowded cell-like environments have become computationally feasible, even in full atomic detail. 15,16 In simulations of strong binders, K d is usually determined by calculating the binding free energy to specific binding interfaces. 17 If binding interfaces are unknown, K d values are often calculated from the ratio of bound and unbound populations, 18 as recently applied to RNA−RNA binding. 19 As we will discuss here, this approximation is accurate only for special cases. B ij can be estimated by integration over the configuration space, 20−22 by Mayer sampling, 23,24 from molecular simulations using radial distribution functions or potentials of mean force, 25−28 and by simply counting all configurations in which proteins do not interact. 29,30 Here, we show that K d is fully determined by B ij and the fraction p b (V) of bound proteins estimated from molecular simulations of two proteins in a box with volume V, i.e.
Here N A is Avogadro's constant. In the derivation of this equation, we do not make any assumptions about the interaction strength or about the degrees of freedom of proteins or the solvent. Thus, it is generally applicable and valid not only for coarse-grained simulations using implicit solvents but also for fully atomistic molecular dynamics simulations using explicit solvents. We present two different routes to calculate B ij and thus K d . For simulations using implicit solvents, we can apply protein insertion and removal moves to estimate the free energy that corresponds to the two-particle partition function determining B ij . The insertion ensemble can be generated with any Monte Carlo or molecular dynamics code to sample from the canonical ensemble without modification. We estimate the partition function by combining the insertion and removal ensemble using either Bennett's acceptance ratio (BAR) method 31 or the binless weighted histogram analysis method (WHAM). 32−34 In contrast to the Mayer sampling method, 23,24 which uses molecular Monte Carlo integration to calculate virial coefficients even of higher orders, here, we use exactly the same simulation system for the calculation of B ij as we use to sample from the canonical ensemble.
For simulations using either implicit or explicit solvents, B ij can be calculated accurately by estimating the probability that the two proteins are outside of their interaction range. 29 We present mathematically simple expressions for B ij and K d in terms of this probability, which provide insights into their physical interpretations complementary to more common formulations based on radial distribution functions or potentials of mean force.
We quantify the interactions of the two proteins when they are not bound using K d and B ij . Previously, theoretical models for excluded volumes have been used to extract nonbinding interactions from experimentally measured B ij values. 35 Here, we use the fact that the contributions of bound configurations to B ij are completely determined by K d and show that the remaining contributions have a simple and clear interpretation. Moreover, we propose that these contributions of nonbinding interactions can be estimated in experiments.
The article is organized as follows. In Section 2, we derive expressions to calculate the dissociation constant and the second osmotic virial coefficient from simulations. We present the details of our methods in Section 3 and a validation of our methods and results for three weekly binding proteins using coarse-grained simulations in Section 4. We end with conclusions in Section 5.

THEORY
For simulations of two proteins in a box, we show that the dissociation constant K d is determined by the binding probability and the second osmotic virial coefficient B ij of protein species i and j. The latter is determined by the twoparticle partition function, which in general can be estimated from the fraction of states where proteins are outside of their interaction range 29

∑ ∑∑
where Π is the osmotic pressure, V m is the molar volume, R is the gas constant, T is the temperature, x i is the mole fraction of species i, and B ij is the osmotic second virial coefficient of proteins of species i and j.
We can express the second virial coefficients B ij of an arbitrarily shaped particle of species i and an arbitrarily shaped particle of species j, via one-and two-particle configurational partition functions. To do so, we extend the derivation by McQuarrie to nonspherical particles 7 and start from the grand canonical partition function where V is the volume, N i is the number of molecules of species i containing n i atoms each, and z i = exp(βμ i ) is the fugacity determined by the chemical potential μ i of species i and the inverse temperature β = 1/(k b T). k b is Boltzmann's constant. The osmotic pressure is a function of the fugacities and given by βΠV = ln Ξ(T,V, z 1 ,...,z m ). 38,39 Here, we write the canonical partition function Q(N 1 ,...,N m ) of m species of arbitrarily shaped particles as 4) where N N ( , ..., ) m 1 is the corresponding configurational partition function where the potential energy U(X) depends on the set X of all |X| = ∏ i N i n i atom positions. In eq 4, we introduced V ( ) i for the single-particle canonical partition function, e.g., = (0, 1, 0, ..., 0) 2 . For spherically symmetric particles, and we recover McQuarrie's expression 7 for Q(N 1 , ..., N m ). For rigid cylindrically symmetric and asymmetric particles, respectively. Note that in the following, we use "Z" instead of " " for these expressions for rigid molecules to distinguish them from the full configurational partition function of flexible molecules written as calligraphic " ". We obtain for the second osmotic virial coefficients where we introduced ij for the two-particle partition function, e.g., = (1, 1, 0, ..., 0) 12 for a pair of particles  (7) where c 0 = 1M is the standard concentration and Δp is the pressure difference between bound and unbound states. 40 The last term is usually small and can be neglected.
For large enough box volumes V, one would be tempted to estimate the dissociation constant of two proteins A and B directly from the binding probability p b (V).
For small box sizes typically used in simulations, this estimate suffers from finite-size effects. Accurate estimates using eq 8 would require unusually large boxes, as we show in Section 4, which makes sampling highly inefficient.
To overcome this finite-size effect, we effectively extend the box volume analytically and calculate K d in the limit of infinite volume ( Figure 1). We emphasize here that in the following derivation, we consider fully flexible proteins without any restrictions on their internal degrees of freedom. We remove the translational and rotational degrees of freedom of the protein of species i, which correspond to a factor Z i (V) = 8π 2 V in the partition function for asymmetric proteins. That is, we fix the position and orientation of the protein of species i, which leaves the internal degrees of freedom due to the flexibility of the protein unchanged. The corresponding partition function of j in the presence of i, with i fixed in position and orientation but internally flexible, is given by We extend this system with a fixed position and orientation of the flexible protein i by an additional volume ΔV accessible to the second protein. The contribution to the partition function of a protein of species j being in this additional volume ΔV is given by where Z j (ΔV) = 8π 2 ΔV gives the contribution due to the translational and rotational degrees of freedom of an asymmetric protein to the partition function.̃i and̃j are the partition functions of individual proteins i and j, whose positions and orientations are fixed in space. That is,̃i and j contain only contributions due to the respective internal degrees of freedom of free proteins and due to the degrees of freedom of solvent molecules in the vicinity of the proteins, which differ from the bulk due to the presence of the protein.
For rigid protein models in implicit solvents,̃=̃= 1 i j . The probability p b (V ex ) that the two proteins are bound in the extended volume V ex = V + ΔV is now given by the ratio of the partition function (b) of the bound proteins to the partition function of the extended system With the position of protein i fixed, (b) is independent of the size of the volume V and thus the same for the simulation box and for the extended system, i.e., To calculate a K d value unaffected by the finite size of the simulation box, we now substitute eq 11 into eq 8. We then take the limit ΔV → ∞ and use that Z j (V)/V = 8π 2 to obtain We can rewrite this equation realizing that the partition function of all bound states of the system, where also protein i can move and rotate, is given by Expressing V ( ) ij by the second osmotic virial coefficient defined in eq 6 To obtain a K d estimate independent of box size, we analytically extend the twoparticle partition function for the simulation box by the contributions of an extension volume ΔV (gray shaded area) and perform the limit ΔV → ∞. We calculate B ij from the probability p v (V) that the two proteins are within a subvolume v (green), which is at least large enough to cover all protein−protein interactions (yellow shaded area).
The Journal of Physical Chemistry B pubs.acs.org/JPCB Article and inserting the resulting expression in eq 12, we obtain the relationship between K d and B ij given in eq 1 As a corollary, the volume dependence of the fraction of bound proteins is parameterized by K d and B ij . As we derive in the following, K d and B ij fulfill the approximate relation This approximate relationship becomes an exact relationship if we define all interacting states as bound states 41,42 or for proteins that do not interact when they are not bound (see Section 2.4). We write V ( ) ij as a sum of the partition , and insert this expression in eq 14. We then obtain If unbound interactions are weak, theñ̃− such that where we used eq 13. Rearranging this equation, we arrive at eq 16. We can now insert this expression into eq 15 and obtain from which we can express K d to obtain an approximate estimate for K d , which we call K d ′, i.e.
Here, we introduced the fraction of unbound protein configurations as p u (V) = 1 − p b (V). Note that eq 21 corresponds to eq 13 of de Jong et al. 18 For the exact relationship between K d and K d ′, see Section 2.4. 2.3. Estimating the Second Osmotic Virial Coefficient. As we have shown above, we have to estimate B ij to accurately estimate K d . To do so, we apply the same concepts as we have used for the calculation of K d . We first remove contributions to the partition function due to the translational and rotational freedom of the whole system by keeping the position and the orientation of the otherwise flexible protein i fixed (eq 9). Around this protein, we define a subvolume v < V, which has to be big enough such that it captures all protein−protein interactions ( Figure 1). Outside this subvolume, protein− protein interactions can be neglected. That is, the flexible protein j moves freely when it is in volume δv = V − v.
The probability p v (V) that protein j is in subvolume v is given by where v ( ) is given analogous to eq 9 and Z j (δv) = 8π 2 δv for asymmetric proteins. We can express v ( ) from eq 22 as Usiing eq 9 for v ( ) , it follows that Using that Z j (v) and Z j (δv) are proportional to their arguments with the same prefactor (see Section 2.1) and Solving for p v (V), we obtain (27) which describes the dependence of p v (V) on the box volume V and the subvolume v.
We emphasize that eq 26 is generally valid for arbitrary binding partners, without making any assumptions about symmetry or the number of internal degrees of freedom of the binding partners or of the solvent. The only condition is that interactions between binding partners are negligible outside of the volume v. We can introduce correction terms based on an effective pairwise potential acting between the binding partners if this condition is not fulfilled (see Section 2.7).
To motivate the interpretation of eq 26, we rewrite it as Note that the prefactor in eq 28 contains the box volume V, whereas the prefactor in eq 26 contains the subvolume v. The first term in the brackets, determining the two-particle partition function, is the ratio of the probability of finding one protein outside of the subvolume v for the ideal system, 1 − v/V, to the corresponding probability for the interacting proteins, 1 − p v (V). This ratio, which is the inverse of the quantity f 2 (V) of Ashton and Wilding, 29 is independent of the subvolume v, chosen to be just large enough to cover the interaction range. Consequently, the first term in the brackets where we introduced the excess free energy of finding the two proteins outside of their interaction range in the box of volume V as We express K d as a function of p v (V) by inserting eq 28 into eq 1 and obtain We next establish the commonly used relationship of B ij to the partial radial distribution function g(r). 7 The ratio of can be estimated from the probability density of center-of-mass distances p(r) of two proteins in a box, for instance, which is itself related to the radial distribution function g(r). To do so, we define a spherical volume v = 4πR 3 /3 and a spherical shell around this sphere with volume δv = 4π[(δR + R) 3 − R 3 ]/3. The ratio is then given by We define a radial distribution function g(r) through We can choose the proportionality constant such that g(r) = 1 for r > R, where p(r) ∝ r 2 . Then, 4π ∫ R R+δR g(r)r 2 dr = δv and we may write Inserting this expression in eq 26 and using that ∫ 0 R 4πr 2 dr = v, we obtain By introducing an effective interaction potential βw(r) = −ln g(r), we can write eq 34 as it is commonly presented Using eq 28 instead of eq 34 or 35, we can avoid the computation of distance distribution functions and potentials of mean force, respectively, and the subsequent integration. Importantly, we also do not have to estimate the plateau value of g(r), which in simulations is different from one and which depends on system size and the thermodynamic ensemble. 29,43 Although these differences might be viewed only as a minor simplification, eq 28 emphasizes that B ij is independent of the detailed shapes of g(r) and w(r) and determined by the excess free energy F o (ex) (V) of finding the two proteins outside of their interaction range. Note that our results also apply to the infinite dilution limits of the Kirkwood where we use the superscript "(u)" to indicate contributions of the unbound states. For binding proteins, B ij (u) is given by the difference between B ij = B ij (u) + B ij (b) and the contributions to B ij (b) due to binding i.e., we can quantify the nonbinding interactions for two binding proteins via which becomes For hard spheres, p u (V) = 1 and where v exc is the excluded volume, such that B ij (u) = v exc /2. For attractive nonbinding interactions B ij (u) < v exc /2, and for repulsive nonbinding interactions B ij (u) > v exc /2. Note that for asymmetric proteins, v exc corresponds to an excluded region in the configuration space, which, for instance, is spanned by Cartesian coordinates and Euler angles in the case of rigid proteins. Thus, in general, v exc should be viewed as an effective volume corresponding to a thermodynamic free energy.
We now show that B ij (u) quantifies the difference between the approximate expression for K d in eq 21 and the box-sizeindependent expression for K d in eq 1. Inserting eq 11 into eq 21, we obtain i k j j j j j j j y { z z z z z z z (40) such that the relative difference is given by Consequently, the approximate estimate K d ′ deviates systematically from the true value K d , with deviations proportional to B ij (u) , but converges to the true value with increasing box volume as 1/V.
2.5. Indistinguishable Binding Partners (Homodimers). So far, we have assumed that the proteins are distinguishable, i.e., that they form heterodimers, but all expressions derived here are also valid for indistinguishable binding partners forming homodimers. To consider the case of two identical binding partners, we rewrite eq 13 as where we introduced for the partition function of two free proteins, which is determined by the product of two single-protein partition functions. For indistinguishable binding partners forming homodimers, both and The Journal of Physical Chemistry B pubs.acs.org/JPCB Article would have to be multiplied by a factor 1/2 to account for the indistinguishablity of the proteins. However, these factors then cancel in the ratio in eq 42.
2.6. K d and B ij from a Single Simulation. We can estimate K d and B ij from the fraction of bound protein p b (V) and the probability p v (V) of one protein being located in a subvolume v around the other. The latter determines B ij according to eq 28, which we then insert into eq 1 to obtain the finite-size corrected estimate of K d . We call this method the subvolume method. To calculate B ij , we can also estimate the two-particle partition function Z ij (V), now for simplicity but without loss of generality only considering rigid molecules, using free energy methods. 46 For implicit solvents, we can use insertion and removal moves of the proteins to efficiently estimate Z ij (V), as explained in the following. We call this method the insertion/removal method.
2.6.1. Estimating Two-Particle Configurational Partition Functions for Implicit Solvents. A simulation of a pair of proteins in a box of volume V at reciprocal temperature β gives us immediately the particle-removal energy distribution as the normalized distribution of potential energies. We define x i = (r i ,Ω i ), where r i are the Cartesian coordinates of the geometric center of protein i and Ω i are its Euler angles defining its orientation. We denote the configuration space as W = V × Ω to simplify the notation. The particle-removal energy distribution is then given by where Z 23 (β) = ∫ W 2 dx 2 dx 3 e −βU(x 2 ,x 3 ) and δ[·] is Dirac's delta function.
The particle-insertion energy distribution p ins (E) is formally given by where Z i (β = 0) = ∫ W dx i . Sampling the particle-insertion energy distribution p ins (E) for a given box size is straightforward. All one needs is a replica with reciprocal temperature β = 0 exactly. All moves will then be accepted, and the energies saved are those of random insertions. Alternatively, one could make trial moves of the two proteins with Monte Carlo move widths ±L/2, where L is the box length, and the orientation changes about random axes by ±π, and to write out the absolute trial (!) energies (not the energy differences or the accepted energies). With such a move protocol, it would not matter if one or both particles were moved and if moves are accepted or not. It also does not matter what the "acceptance rate" is (i.e., it can be zero!). What is important, though, is that the box volumes in insertion and removal runs are the same. The normalized removal and insertion energy distributions are related to each other by which follows from The ratio of partition functions defines the free energy of going from a system of two noninteracting particles to a system in which they interact where E i are the uncorrelated (by construction) insertion energies and E i are the uncorrelated removal energies. However, it is clear that this is problematic in cases where the proteins are strongly bound (forming a dimer!) because then one would have very little information about higher energies. This problem can be remedied using all of the data in a temperature replica exchange simulation. In effect, the hightemperature runs allow us to estimate an accurate density of states to a pretty high energy. The particle-insertion energies complement this density of states on the high-energy side. All of the runs at different temperatures can be combined with the list of insertion energies using binless WHAM. As a reference, we take the temperature of interest (β = β 1 without loss of generality). The bias energies at replicas with reciprocal temperature β i are then ΔU = (β i /β − 1)U. This formula works also for the insertion energies coming from a run with β i = 0. The insertion energies can be thought of as coming from a run with the bias potential ΔU = −U, i.e., on potential zero. A binless-WHAM analysis using these bias energies as input will produce the required free energy F as the difference between the reference state and the insertion run.
2.7. Practical Considerations. In the derivation of K d and B ij , we have assumed that the volume is large enough such that interactions between the protein with a fixed position and orientation and the protein in the extended volume can be neglected. If this condition is not fulfilled, then we can correct for residual interaction energies using a simple distancedependent interaction potential ϕ(r) in the calculation of , where denotes the Cartesian space defining ΔV. For example, at large distances, the interaction of charged proteins can be approximated by (screened) Coulomb interactions of the total charges located at the centers of charge. In such a case, we would include for the calculation of the fraction bound only configurations of the simulation where the two proteins are separated less than a cutoff distance, usually given by half the shortest box length. Such a system corresponds to a spherical volume with one protein at its center and the other one moving unrestrained. Doing so, we assume that the residual interaction modeled as a The Journal of Physical Chemistry B pubs.acs.org/JPCB Article simple pair-potential has a negligible effect on the internal degrees of freedom and the degrees of freedom of the surrounding solvent, i.e.,̃i and̃j are unchanged. Suitable definitions of bound states will depend on the molecular model we use for simulations. In our simulations of rigid proteins in implicit solvents, we consider a state as bound if the interaction energy of the two proteins is smaller than −2k b T. Additionally demanding that two proteins have to have a minimum C α distance smaller than 0.8 nm to be counted as bound does not have a noticeable effect on the binding probability. For molecular dynamics simulations using explicit solvents, a combination of distance-and energy-based criteria and using transition-based-assignment of states 47 might be necessary to reliably distinguish bound states from spurious contacts.
In simulations of two proteins in a box, we can estimate p v (V) using a distance-based criterion as has been introduced by Ashton and Wilding. 29 We define a distance between the two proteins, e.g., the center-of-mass distance r. We introduce a distance R such that interactions between proteins are negligible for distances r > R. For an ensemble of N structures, we count the number of structures N v for which r ≤ R. In these structures, the center-of-mass of protein 2 lies within a spherical volume v = 4πR 3 /3 centered at the center-of-mass of protein 1. We then estimate p v (V) = N v /N.
For strong binders and in boxes of typical size, ) in eq 28 diverges. Consequently, p v (V) has to be determined with sufficient numerical precision to obtain accurate estimates. For example, if we sample 10 000 configurations, then the numerical precision of p v (V) is limited to 1/10 000. The precision can be increased by sampling more configurations or, in the case of replica simulation, by including additional replicas using WHAM when calculating p v (V). For weak binders with K d ≳ 100 μM, 10 000 configurations are sufficient to estimate K d and B ij even without applying WHAM.

METHODS
We chose three weakly binding protein pairs with experimental K d values covering 3 orders of magnitude from ∼μM to ∼mM. The lysozyme homodimer has an experimental K d value of K d ≈ 2710 ± 240 μM 48 (PDB 6LYZ 49 ), the ubiquitin/CUE dimer (PDB 1OTR 50 ) has a K d ≈ 155 ± 9 μM, 50 and the dimer of the uracil-DNA glycosylase UDG and its uracil-DNA glycosylase inhibitor protein (Ugi) has a K d ≈ 1. 3 ± 0.3 μM 51 (PDB 1UUG 52 ).
To simulate these protein pairs, we use the amino-acid-level coarse-grained model developed by Kim and Hummer for weakly binding proteins 53 implemented in the Complexes++ software (https://www.github.com/bio-phys/complexespp). We treat all proteins as rigid bodies. In contrast to the original model, which is called the KH-model, we shift the original Miyazawa and Jernigan parameters 54,55 by e 0 = −1.875 k b T, where T = 300 K, to account for the solvation energy and we scale the resulting parameters by λ = 0.1243 to balance them with the electrostatic interactions. In the original model, e 0 = −2.27 k b T and λ = 0.159. The new values have been chosen to better reproduce the B 22 value of lysozyme and the K d value of the ubiquitin/UIM1 complex. We chose residue charges of −1.0e for Asp and Glu, +1.0e for Arg and Lys, and +0.5e for His because its isoelectric point is at pH 7. e is the elementary charge. Consequently, the total charges of the proteins are +8.5e for lysozyme, +0.5e for ubiquitin, −4.5e for CUE, +7.5e for UDG, and −11.5e for Ugi. We set the dielectric constant to 80 and the Debye length to 1 nm, corresponding to the conditions in an aqueous solution of 100 mM NaCl.
To generate Boltzmann ensembles of configurations, which also provide the removal energy distributions defined in eq 43, we perform temperature replica exchange Monte Carlo (REMC) simulations using 24 replicas. Temperatures were equally spaced between 300 and 530 K. In a Monte Carlo sweep, each protein performs one trial move on average, which can be translation or rotation. Replica exchanges are attempted every 10 sweeps. For the rotation move, a rotation axis is randomly generated by drawing a point from a sphere. Then, we rotate around this axis by an angle, which we draw from a box distribution with a width given by twice the maximum angle. This maximum angle is set to 0.1 rad for the coolest replica and to 1.25 rad for the hottest replica and spaced equidistantly in between. Similarly, we set the maximum displacement for the translation move to 0.2 nm in the coolest replica and to 1.35 nm in the hottest replica, with equal spacing in between. In our simulations, we use a cutoff radius of 3 nm to truncate our interaction potentials.
To sample the insertion energy distribution defined in eq 44 in simulations, we switch off all interactions by setting all interaction parameters and residue charges to zero. We use a maximum displacement of half the box length and a maximum rotation angle of π. We accept and sample all configurations to generate the insertion ensembles, for which we then recalculate all energies for switched-on potentials.
To estimate the two-particle partition function, we combine results from REMC simulations (removal ensemble) and the energies calculated for the ensemble of noninteracting proteins (insertion ensemble) using binless WHAM. 32−34 To avoid numerical problems, we clip interaction energies at 100 k b T. We define two proteins as being bound if their total interaction energy is below −2k b T.
For equilibration, we performed 10 6 Monte Carlo sweeps in each replica. For production, we performed 10 7 sweeps and we sampled every 100th sweep, yielding 10 5 structures for each protein pair per replica. We also performed 10 6 insertion moves for each pair, which by design creates uncorrelated configurations.
To study the box volume dependence of the fraction bound p b (V), we calculated for the coolest replica p b = N E≤−2k b T /N. N E≤−2k b T is the number of structures with energies E ≤ −2k b T, and N = 10 5 is the total number of structures. To study the box volume dependence of the subvolume probability p v (V), we calculated for the coolest replica p v = N v /N, where N v is the number of structures within the subvolume v. We defined this volume as a spherical volume with a radius given by the sum of (D i + D j )/2, where D i and D j are the largest diameters of proteins of species i and j, respectively, and our cutoff radius of 3 nm. The resulting radii are between ∼6.7 and ∼7.4 nm for the three proteins. For each protein pair, we performed simulations for 17 box sizes with volumes ranging from 3375 to 10 6 nm 3 . We calculated the standard errors of the mean by block averaging. 56,57 We validate the insertion/removal method and the subvolume method for the smallest boxes used here with volumeṼ = 15 3 nm 3 = 3375 nm 3 . With uniform probability, we selected at random 10 000 of the N = 10 5 samples and chose for each replica the configurations corresponding to the The Journal of Physical Chemistry B pubs.acs.org/JPCB Article same 10 000 indices. We also drew 10 000 configurations of the 10 6 configurations in the insertion ensemble with uniform probability. In the insertion/removal method, we then applied WHAM using these 250 000 configurations in total to calculate p b (Ṽ ) andṼ ( ) ij , from which we then estimated K d and B ij . We repeated this procedure 1000 times and calculated the averages of K d and B ij and their covariance matrices. We confirmed visually that the distributions of the estimates of K d and B ij are distributed according to two-dimensional Gaussians with the estimated covariance matrices. We use the same protocol to obtain estimates and uncertainties from resampling for the subvolume method, in which we do not use the insertion ensemble.

RESULTS
We calculated K d and B ij using the insertion/removal method and the subvolume method for three protein pairs, i.e., the lysozyme homodimer and the heterodimers ubiquitin/CUE and UDG/Ugi. As we will show, these estimates allow us to quantify the contributions due to binding and nonbinding interactions to B ij .
In the insertion/removal method, we determine K d and B ij from replica exchange simulations at a box volumeṼ and from insertion ensembles. We first estimated p b (Ṽ ) and Z ij (Ṽ ) by combining the insertion ensemble and the replicas of our temperature REMC simulations using WHAM. We then evaluated eq 14 to obtain B ij and used this value together with our estimate for p b (Ṽ ) in eq 1 to obtain K d . By resampling, we estimated the covariance matrix.
In the subvolume method, we first estimated p v (Ṽ ) and p b (Ṽ ) from all replicas using WHAM. We used eq 26 to calculate B ij from p v (Ṽ ) and used this estimate together with p b (Ṽ ) to estimate K d using eq 1.
We find that the estimates for K d and B ij from the insertion/ removal method and the subvolume method agree excellently with each other (Figure 2 and Table 1). Moreover, the estimates have similar uncertainties. K d values and B ij values calculated by resampling are correlated for both methods ( Figure 2). A smaller value of B ij , i.e., a more negative value, leads to a smaller value of K d according to eq 1.
For additional validation, we use the results for K d and B ij obtained at the box volumeṼ to predict the box-size dependence of the fraction bound p b (V) and the subvolume probability p v (V). We use eq 15 and our estimates for K d and B ij obtained at a box volumeṼ to calculate p b (V) (Figure 3). We use eq 27 and our estimates for B ij obtained at a box volumeṼ to calculate p v (V) (Figure 4). The resulting curves reproduce the box volume dependencies of p b (V) and p v (V) observed in the entire range of simulations, covering nearly 3 orders of magnitude in volume.
For strong binders, the fraction bound p b (V) and the subvolume probability p v (V) take on similar values (compare Figures 3 and 4). In these cases, p v (V) is dominated by binding. For small boxes, p b (V) is close to one and consequently so is p v (V). For box sizes large enough such that p b (V) is significantly below one, the contribution of the size of the subvolume v to p v (V) is small. For UDG/Ugi, the strongest binding complex considered here, the fraction bound dominates p v (V) such that the p v (V) curve in Figure 4 looks nearly identical to the corresponding p b (V) curve in Figure 3. However, the differences in these curves are significant as they are not only determined by the size of the subvolume v but also by the nonbinding interactions.
We can extract the contributions B ij (u) , eq 38, of nonbinding interactions to B ij . We can do so even in the case of strong binders for which the K d value is close to B ij (b) = −1/(2N A K d ) according to eq 16 ( Figure 5, top). With the estimates provided by either the insertion/removal method or the subvolume method, we can resolve the small difference B ij (u) = B ij − B ij (b) (Figure 5, center). Focusing on the results from the insertion/removal methods, we find that for lysozyme B ij (u) ≈ 83 ± 4 nm 3 > 0. This value is close to what one would expect for hard spheres of equal volume, i.e., B ij (u) = v exc /2 ≈ 70 nm 3 . For ubiquitin/CUE, the interactions are clearly attractive, but B ij (u) ≈ −9 ± 3 nm 3 nearly vanishes. For UDG/Ugi, B ij (u) ≈ −94 ± 5 nm 3 indicates attractive interactions ( Figure 5 and Table  1).
Note that for Ubi/CUE and UDG/Ugi, the estimates for B 23 (u) = B 23 − B 23 (b) are much smaller than the individual errors of B 23 and B 23 (b) (∼27 000 nm 3 for UDG/Ugi and ∼40 nm 3 for Ubi/CUE; Table 1). Naively, one would think that these large uncertainties preclude reliable estimates for the comparably small difference B 23 (u) in such a situation. However, the estimates for B 23 and B 23 (b) from resampling are highly correlated because of the strong correlation of B 23 and K d (Figure 2). That is, the individual errors of B 23 and B 23 (b) do not determine the errors of their difference.
Next, we show that the naive estimate of K d from concentrations using eq 8 actually suffers from a finite-size effect and that it converges to the estimates obtained with the insertion/removal and subvolume methods for large system sizes ( Figure 6). For comparison only, we evaluate eq 8 for our  Figure 3) and extrapolate the naive estimates for K d until convergence is reached. For typical box sizes used in simulations, K d is underestimated by about 10% for the lysozyme homodimer, the weakest binder considered here, and by 3 orders of magnitude for UDB/Ugi, the strongest binder considered here.
To reach convergence when using eq 8, the box volumes have to be increased by a factor ∼100 for the weakest binder and by a factor ∼100 000 for the strongest binder compared to typical box sizes. Using eq 1, we obtain finite-size effect-free estimates for K d at all box volumes (see Figure 6). In contrast, the estimates obtained using the approximate relation given by eq 21 (eq 13 of de Jong et al. 18 ) show small but systematic deviations determined by B ij (u) (eq 41); (see Figure 6). These systematic deviations decrease with increasing box volume as 1/V ( Figure  7). For the three dimers considered here, these differences are in the range of ±5% for the smallest box sizes used here.

CONCLUSIONS
We have shown how to calculate the dissociation constant K d of two proteins in a box from the fraction of protein dimers and the second osmotic virial coefficient B ij . We derived and validated two methods to calculate B ij : For implicit solvents, we can use standard Monte Carlo or molecular dynamics simulations of two proteins in a box and determine insertion and removal energy distribution functions. From the latter, we determine the two-particle partition function and thus B ij using   The Journal of Physical Chemistry B pubs.acs.org/JPCB Article BAR/WHAM. For implicit and explicit solvents, we can calculate the probability that the two proteins are within a volume at least covering the interaction range of the two proteins. 29 Calculating B ij from the radial distribution function or equally the potential of mean force via an integral is equivalent to this method. For the coarse-grained simulations performed here, both methods provide accurate results with comparable uncertainties. The relationship between K d and B ij given by eq 1 is also well suited for the quantification of protein interactions in molecular dynamics simulations using explicit solvents. Fully atomistic simulations of concentrated protein solutions in explicit solvents have become computationally feasible on the microsecond scale. 15,16 These studies have been facilitated by recent improvements in molecular force fields, which correct, among other things, for an increased stickiness of protein surfaces. 58−61 These parameterization efforts can benefit from comparisons of K d and B ij to the experiment.
Fully atomistic simulations are within reach for the protein pairs considered here. The box volumeṼ used here corresponds to about 300 000 particles in fully atomistic simulations using explicit solvents. The binding and unbinding of weakly binding proteins like lysozyme can be simulated atomistically without bias. 15 For more strongly binding proteins, enhanced sampling techniques have to be applied. 62 Binding and unbinding events of proteins and other molecules can be simulated efficiently without bias also in molecular dynamics simulations using explicit solvents using the MARTINI model, for example. 63−65 The sampling strategy used here for weak binders is different from the sampling strategy commonly used for strong binders. Strong binders usually have specific interfaces, and the dissociation constant is determined by the binding free energy to these specific interfaces. If these interfaces are known, then we only have to calculate the binding free energy for these specific binding poses dominating K d . 17 For weak binders, also /B ij | is close to one (top). In these cases, nonbinding contributions to B ij are relatively small, i.e., |B ij (u) /B ij | ≪ 1 (center). Figure 6. Finite-size correction gives box-size-independent dissociation constants K d (eq 1, red symbols). The naive estimate of K d (eq 8, blue symbols) suffers from finite-size effects and converges to the true value (gray horizontal line) for increasing box size. We illustrate this convergence by evaluating eq 8 for the predictions for p b (V) from the insertion/removal method (black solid line) and the subvolume method (red dashed line). Approximately corrected estimates (eq 21, eq 13 of de Jong et al., 18 green symbols) suffer from finite-size effects and converge to the true value for increasing box volume. Error bars have been obtained by resampling. Figure 7. Relative difference between the approximate estimates K d ′ (eq 21, eq 13 of de Jong et al. 18 ) and the box-size-independent estimates K d (eq 1) for the dissociation constant as shown in Figure 6 (discs) as functions of the inverse box volume 1/V. This difference is proportional to 1/V and to the contribution of unbound states to the second osmotic virial coefficient, B ij (u) (eq 41, lines).
The Journal of Physical Chemistry B pubs.acs.org/JPCB Article nonspecific binding can contribute significantly to K d and thus has to be sampled. B ij also plays an important role in understanding phase separation by which liquid droplets are formed within cells. 66 Specifically, the Flory−Huggins solution theory is used to model liquid−liquid phase separations. 67,68 In this framework, the Flory interaction parameter χ is determined by K d and B ij . 69 B ij also determines the "effective solvation volume" up to a proportionality constant, a quantity commonly used in polymer science. 70 The interactions of proteins in nonbinding configurations can be quantified by B ij (u) , which is fully determined by K d and B ij and which is thus a well-defined thermodynamic quantity. These interactions shape the physicochemical properties of the crowded environments inside cells. For example, nonbinding interactions can lead to demixing and therefore to colocalization of binding partners. This colocalization effectively increases the binding probability.
In principle, the contributions B ij (u) of nonbinding interactions to B ij can be determined experimentally. SAXS experiments provide information about B ij in the forward scattering intensities as well as information about dimerization, and thus K d , encoded in the radius of gyration. Varying protein concentrations in equilibrium sedimentation experiments can provide estimates for K d and B ij . 10 The latter is used to correct for the nonideality of the protein solution. Equation 1 can be viewed as such a correction for nonideality. Especially for weak binders, we expect that K d and B ij can be estimated accurately enough such that the contributions B ij (u) of nonbinding conformations to B ij can be determined. Similar to the calculations performed here, we expect that in sedimentation experiments, the uncertainties in the estimates for B ij (u) will be much smaller than the individual uncertainties in the estimates for K d and B ij .