Theoretical and Data-Driven Approaches for Biomolecular Condensates

Biomolecular condensation processes are increasingly recognized as a fundamental mechanism that living cells use to organize biomolecules in time and space. These processes can lead to the formation of membraneless organelles that enable cells to perform distinct biochemical processes in controlled local environments, thereby supplying them with an additional degree of spatial control relative to that achieved by membrane-bound organelles. This fundamental importance of biomolecular condensation has motivated a quest to discover and understand the molecular mechanisms and determinants that drive and control this process. Within this molecular viewpoint, computational methods can provide a unique angle to studying biomolecular condensation processes by contributing the resolution and scale that are challenging to reach with experimental techniques alone. In this Review, we focus on three types of dry-lab approaches: theoretical methods, physics-driven simulations and data-driven machine learning methods. We review recent progress in using these tools for probing biomolecular condensation across all three fields and outline the key advantages and limitations of each of the approaches. We further discuss some of the key outstanding challenges that we foresee the community addressing next in order to develop a more complete picture of the molecular driving forces behind biomolecular condensation processes and their biological roles in health and disease.


INTRODUCTION
Phase separation is a fundamental phenomenon of many biological molecules, such as proteins and RNA, characterized through the demixing of a homogeneous solution into a polymer-rich and a polymer-depleted phase. Recent studies have highlighted that inside living cells, phase separation processes can lead to formation of biomolecular condensates that provide cells with the means to spatially and temporally organize their contents beyond what is achievable with just membrane-bound compartments. Biomolecular condensates formed through phase separation processes have been found to play functionally important roles in a variety of biological pathways. 1,2 Some of the best-studied examples of functional condensates include nuclear condensates such as nucleoli, which are the site of ribosome biogenesis, 3,4 and cytoplasmic stress granules, which play a role in the cellular stress response. 5,6 In multiple instances, biomolecular condensates have also been found to regulate gene expression 7,8 and to facilitate cellular signaling pathways. 9−12 Condensates are important for their contributions to healthy cellular function, but also because of associations between condensate dysfunction and disease. 13 Disruptions to condensate formation have been linked to a number of diseases, most notably neurodegeneration 14 and multiple forms of cancer. 6,15 As the importance of biomolecular condensates is increasingly recognized and their varied roles in human health and disease are identified, it is crucial to develop approaches that can help us understand the factors governing separation phenomena that drive condensate formation. Insights both at the molecular level and in the context of the cell will be required to cellular condensates' contributions to healthy function and to disease.
In this Review, we discuss three types of dry-lab approaches that have been used to study biomolecular phase separation processes, and how these methods can complement each other to collectively support the goal of better understanding the driving forces and the biological role of biomolecular condensation processes ( Figure 1). We start by discussing the theory behind macromolecular phase separation and theoretical frameworks that have been developed for modeling macromolecular phase separation processes. We divide the latter frameworks into mean-field, molecule, sequence and residue level approaches according to the degree of coarsegraining they use. We proceed by reviewing simulations and molecular dynamics (MD) approaches and discuss how recent advancements in these areas have enabled their use for the examination of biomolecular phase separation processes, including in multicomponent systems. Given the large number of molecules that are required to simulate phase separating systems, we highlight the requirement for an appropriate balance between the detail that is used in simulations and its computational feasibility. Finally, we review how data-driven modeling and machine learning (ML) approaches have been used to predict protein properties and how these ideas have been extended to study protein phase behavior. In addition to reviewing the progress that has been made, we discuss the limitations of current methods and highlight the value of moving toward context specific predictions.
We note that while the different types of computational approaches each contribute unique insights, they also provide multiple interfaces for supporting and complementing each other. For instance, both simulations and machine learning models create computationally derived data that can be used to challenge and develop theories. Vice versa, the knowledge gained from theoretical approaches can enable the integration of appropriate inductive bias with machine learning models or with simulations. Focusing on the interface between MD and ML, ML can be used to increase the accuracy of MD force fields 17 or to accelerate simulations and provide new routes to simulating systems over long time scales. 18 In parallel, descriptors extracted from MD simulations can serve as inputs into ML models. Not only does this strategy lead to physically meaningful and interpretable inputs, it provides the possibility to capture how these properties vary in time. 19 Importantly, alongside our review of the different types of computational approaches, we emphasize that the ability to perform well-characterized measurements on phase separation and phase separating systems is a key prerequisite for the validation of any new dry-lab approach. We are therefore looking forward to even stronger integration of computational models and experimental studies, including in an active learning format where computational models are used to guide experimental design, and attention is paid to studying systems in regimes where computational models appear most uncertain. 20−22

THEORETICAL APPROACHES FOR MODELING PHASE SEPARATION
The physical concept of phases is formalized by assigning an order parameter ϕ(x) to each spatial coordinate x. The definition of the order parameter varies greatly from system to system. For example, in a liquid−gas transition the order parameter can simply be the density of molecules or compressibility, while in random packing of spheres an Figure 1. Theory, physics-driven simulations and data-driven machine learning approaches are all playing an important role in advancing our understanding of biomolecular condensation processes. This Review focuses on the recent advancements across all these three approaches outlining the advantages and limitations of each of the approaches. Simulation panel rendered with Avogadro. 16 ensemble of carefully constructed order parameters are needed. 23 Despite their differences, order parameters serve a common aim of characterizing a physical state with a quantifiable metric, so that a mathematical framework can be developed to study transitions between states. Once a ϕ(x) has been defined in the system, typically an associated free energy density f(ϕ) is also constructed. A physical system evolves toward configurations that minimize the total free energy F[ϕ] ≡∫ dxf [ϕ(x)] and two important classes of ϕ exist: nonconserved and conserved ϕ. In the former case, ϕ can be changed locally to minimize the free energy without global constraint; classical examples include ferromagnetism 24,25 and superconductivity, 26 with order parameters being magnetism, and electron wave function, respectively. The most stable configuration is then simply determined by df(ϕ)/dϕ = 0, and if multiple minima exist the one corresponding to the lowest free energy density will be the most stable state. In thermal equilibrium the system thus assumes a single value of ϕ throughout. By contrast, in biomolecular phase separation, the order parameter is the protein concentration and the total amount of protein has to be conserved in the whole system, leading to a globally constrained free energy minimization problem. Denoting the average protein concentration as φ ̅ and total volume V, we then require ϕ(x) satisfies Vϕ̅ = ∫ dxϕ(x) = const. while minimizing F [ϕ]. As a result, the stability of the solution changes as φ ̅ is changed and three regions can be identified: a mixed region with no phase separation, a metastable binodal region and an unstable spinodal region. The stability of a system can be probed by first partitioning it into separate compartments with different solute concen-trations while preserving mass balance, and then calculating the total free energy change before and after the partitioning. If the free energy decreases for an arbitrarily small density difference then phase separation occurs spontaneously, since any density fluctuation could induce phase separation. This is the spinodal region. On the other hand, if the total free energy decreases only for large density fluctuations, the system is metastable and said to be in a binodal region. Boundaries of the binodal region correspond to globally stable configurations and mathematically solving for them gives rise to concepts of phase equilibrium, 27 chemical potential μ and osmotic pressure Π.
The μ and Π are both derived from the free energy density f(ϕ) so most equilibrium modeling attempts in protein phase separation focus on computing f(ϕ) and finding phase boundaries by matching μ and Π. 28 In thermal equilibrium, the system is characterized by two regions having two different ϕ, and in each region ϕ is uniform except close to the interface. It is also possible to characterize proteins by proposing sequence-dependent parameters and investigating correlations between those parameters and phase separation propensity. Many approaches to modeling biomolecular phase separation exist. Here we discuss these approaches according to their degree of coarse-graining ( Figure 2a): at the Mean-field level, interactions between proteins, nonprotein solutes and the solvent are phenomenological, examples include the basic Flory−Huggins theory, random matrix theory and active phase separation; at the molecular level, the composition of the polymer is taken into account, with the sticker-spacer model and the Voorn−Overbeek model focusing on computing f(ϕ) and the charge asymmetry parameter σ describing generic At each temperature T below the critical point (hollow circle), three regions can be identified according to the total solute concentration ϕ: mixed, binodal and spinodal. Insets show characteristic appearances of the system in each region. c. Fitting of experimentally measured binodal concentrations of different hnRNPA1-LCD variants using the self-consistent solution. Adapted from ref 29. Copyright 2022 American Chemical Society. d. Sequence-dependent parameters that aim to capture composition and residue arrangement information in a single number. Illustrated are 2 groups of 3 sample sequences of 12 residues in length, with sequences within each group having the same composition and σ but different charge arrangements and thus κ and SCD. Blob size is set to 3 for κ calculation. e. Correlation between the radius of gyration ⟨R g ⟩ and κ. Adapted with permission from ref 30. Copyright 2013 National Academy of Sciences. polymeric behaviors; at the Sequence level, the order in which monomers are arranged in the polymer is further taken into consideration, with relevant parameters including the charge patterning parameter κ and the Sequence Charge Decoration (SCD) parameter; finally at the Residue, level calculations of f(ϕ) are performed from first-principles using either a particleto-field transformation as is done in Field Theoretic Simulation (FTS) and Random Phase Approximation (RPA) or a testpolymer thought experiment, leading to applications of the transfer matrix theory. A case can be made for all of the aforementioned approaches, while it remained an open question as to what degree of coarse-graining is most relevant for physiological systems and what should be expected of theories in the first place.

Mean-Field Theories Predict Rich Phenomenology
The Flory−Huggins formalism 31−33 captures the basic phenomenology of density transitions and is the earliest attempt at modeling phase separation of polymer solutions. In this picture, phase separation is driven by a competition between interaction free energy and entropic free energy. In the simple binary case with ϕ denoting the polymer volume fraction, the total free energy density is given by where N is the polymer chain length and χ the effective polymer−solvent interaction parameter. Surface tension can also be added as a |∇ϕ| 2 term to the above formulation that disfavors density fluctuations. 34,35 Although simplistic, this free energy form predicts the core phenomenology, namely spinodal decomposition and binodal phase separation ( Figure  2b). The spinodal concentrations can be worked out analytically 33 while a system of transcendental equations needs to be solved for binodal concentrations. As a result, analytical solution for binodal equations does not exist and computation of the binodal concentrations have been performed largely through numerical methods in the past 28,36 and only recently a self-consistent analytical solution has been derived. ? In the simple case of unit solute length N = 1, the dilute phase binodal concentration φ dil bin can be obtained as 1 exp tanh dil bin 3 6 8 + and for large χ the scaling law ϕ dil bin ≈ e −χ can be obtained. The expression for general N is more complicated, but a similar scaling law can be derived, giving ϕ dil bin ≈ e −χN . The exponential dependence of the concentration on interaction energy and chain length explains why phase separation can be observed at concentrations spanning orders of magnitude differences. In contrast, the spinodal concentration ϕ dil spi has a polynomial scaling law ϕ dil spi ≈ 1/2χN; these boundaries are thus qualitatively different. Furthermore, the self-consistent iteration approach provides an efficient way of fitting experimentally measured boundaries to extract interaction energies ( Figure 2c). An interesting result from the fitting above is that the effective polymer size is approximately 1.16 times the number of residues, reflecting the fact that one residue is slightly larger than the lattice size that is in turn determined by the correlation length of water molecules. It is straightforward to extend the theory to include more solute species 34,37 while exploration of the high-dimensional phase space still has to be done numerically.
Difficulty in determining accurately the binodal concentrations has also resulted in efforts exploring the use of the spinodal as a proxy to study multicomponent systems. 38,39 Diagonalizing the Hessian matrix of the free energy allows qualitative interpretation of experimentally measured tielines, 38 and for a large number of solutes this path led to the application of random matrix theory in phase separation. 40,41 This is especially pertinent considering the large number of protein species in living cells. 42 The central idea is to treat the pairwise interaction parameter χ as a 2-dimensional matrix with each entry drawn randomly from a fixed underlying distribution, with the rank corresponding to the total number of solute species s. In the limit of an infinite number of solute species s → ∞ the statistical properties of the system become predictable. This led to 2 distinct modes of phase separation: a "condensation" mode with all solutes colocalizing in a dense phase and ∼s/2 "demixing" modes that preferentially include and exclude some of the solutes. 41,43 The latter scaling is a direct result from Wigner's semicircle law 40 where ∼s/2 eigenvalues of the random matrix are smaller than 0, and the former arises from a distribution of interaction parameters with a nonzero mean. In comparison, the maximum number of phases that can coexist is ∼s, given by the Gibbs phase rule, 44 and numerical investigation has shown that both scalings can be reproduced with a suitable choice of interaction parameter distribution. 39,45,46 It still remains to be shown, however, that predictions from random matrix theory are physically and physiologically relevant.
The models presented so far only consider phase-separating systems at thermal equilibrium, while an important aspect of condensate formation in cells is the presence of energyconsuming cellular activities including post-translational modification (PTM) and cytoplasmic streaming. 47 Two classes of activities can be identified: molecular modifications and intrinsic motility. Molecular modifications can include PTM, chaperone actions or artificial activation of protein constructs. 47−50 In such systems, one can model different protein species as distinct types of solutes and use the Cahn−Hilliard framework of molecular fluxes 34,35,51 to study the evolution of dilute and dense phases in time and space, 52−54 and the activity is incorporated via introducing dynamic processes between solutes. Very generally, denoting different solute concentrations as ϕ i (x) with i an index for the solutes, and a generic free energy density f(ϕ i ), the fluxes J i (x) can be computed as with L ij the matrix of Onsager kinetic coefficients. 34,55,56 The local concentrations ϕ i (x) then evolves On the other hand, the proteins themselves can acquire motility through ATP consumption using, for example, a motor. 57 In this scenario, the activity becomes an intrinsic property of the solute and has to be modeled differently by either modifying the basic flux equation 58,59 in a bottom-up manner or introducing a nonintegrable term to the free energy functional 60 in a topdown manner. An interesting insight from the former approach is that given a density-dependent self-propulsion speed, phase separation arises from a jamming-like collective behavior even without any other interaction. 58 The appropriate framework for modeling active phase separation thus depends largely on the underlying physics.

Molecular Details Correct for Composition Effects
The phenomenological nature of the χ parameter in the Flory− Huggins model motivates more detailed descriptions of its origin, especially since condensate compositions can be measured experimentally. As such, efforts to include explicitly electrostatic interaction and hydrophobic sticker-like interaction have been made over the past decades. The Voorn− Overbeek theory 61 models electrostatic interaction arising from charged residues in the polymer by treating the solution as an ionic gas, from which results of the Debye−Huckel model 62 can be used to give the electrostatic interaction energy f VO ∝ Q 3/2 with Q the total charge density. A comparison with experimental data gave mixed results, 63 and discrepancies arise from intrachain correlations and charge density fluctuations that are not considered in the Voorn−Overbeek theory. Subsequent work to take polymeric properties into account showed that the charge-asymmetry parameter q q q q where q ± is the fraction of positive/negative charges on a single chain, separates different polymeric configuration regimes 64 and serves as a way to characterize a polymer based on its charge composition ( Figure 2d). The σ represents the amount of uncompensated charges on a polymer chain and acts as a repulsive term. The full electrostatic interaction also includes an attractive term arising from charge fluctuations and the interplay among these two terms and the entropic free energy leads to different spatial conformations of the polymer. The relationship between σ and phase separation propensity was not studied in detail but the σ parameter still represents a novel way to characterize polymers. Apart from electrostatic interactions, hydrophobic interaction also plays a large part in protein phase separation. 65−67 The sticker-spacer model 68,69 treats hydrophobic residues as generic stickers capable of forming strong, pairwise bonds, and the interaction energy arising from sticker bond formation can be computed by counting the total number of ways the stickers can be paired. When the amount of bonds exceeds a certain threshold a sol−gel transition is predicted 68,70 where a gel is defined as a cluster of infinite size. When coupled with the normal Flory−Huggins free energy, which poses as weak, nonspecific interactions, the phenomenology of liquid−liquid phase separation and gelation can be rationalized under a common framework. These two energy scales�the strong sticker−sticker interaction and the weak aspecific interaction� appear to support recent findings that prenucleation clusters are observed at concentrations much higher than that predicted by the classical nucleation theory, 71 and an interplay between phase separation and percolation was suggested.

Sequence Differences Lead to Distinct Behaviors
Interactions in the theoretical approaches presented above are determined by polymer composition but not the exact sequence in which residues are arranged in the chain. The sequence information, however, is an important aspect of protein interaction 66,72−75 and are increasingly addressed. To explore differences between sequences that have the same overall composition but different arrangements, series of such charged sequences have been studied computationally 30,76,77 and a charge patterning parameter κ was shown to correlate strongly with the polymer conformation state 30 (Figure 2d, e). The κ builds onto the σ parameter by first subdividing a polymer into blobs and computing σ for each blob, and then calculating the variance of σ before normalizing it between 0 and 1. By construction, κ is smallest when the charges in the sequence are homogeneously mixed and largest when segregated, thus encoding the sequence information into a single parameter. The κ is, however, phenomenological in nature, and subsequently the Sequence Charge Decoration (SCD) was derived mathematically from polymer path integral, 77 and it was further shown that effective interaction energy and binding affinity can both be computed from the S C D . 7 8 T with N the total number of residues and q i the charge at the i-th residue. The SCD is shown to correlate strongly with κ 77,79 and serves a similar role in characterizing the degree of charge mixing in a polymer chain (Figure 2d).

Residue-Level Calculations Mimic Real Protein Systems
At the highest level of molecular detail are theories that aim to calculate f(ϕ) from first-principles analytically. Transfer matrix theory, most commonly used in optics and lattice field theories, has been shown to be also applicable in studying polymer solutions 80 by drawing the parallel between time propagation and traversing along a polymer chain. The analysis starts by placing a "test polymer" in the solution and examining the monomer segment at the end of the chain. Different possibilities of the end segment environment are considered, where the segment can bind to a salt ion, a segment from a solute polymer or not bound to anything for example, and the corresponding Boltzmann factor assigned and organized in a column vector of size h, with h the total number of possible states. The theory then moves onto the next segment in the test polymer and the Boltzmann factors arising from possible configurations for the second segment are calculated and organized in an h by h matrix, because the energy for the second segment can be dependent on the first. This process is repeated in sequence until the end of the test chain and the partition function for the test polymer can be computed, from which the free energy is calculated analytically and phase equilibrium conditions can be solved numerically to find binodal concentrations. A drawback of this approach is that the theory is tractable for relatively simple polymers but h increases exponentially when polymers have more complicated sequences, and the resulting analytical form cannot be interpreted easily.
The alternative route, and one that represents the closest semblance to the physical system, is promotion of the particle picture to a field picture using the Hubbard−Stratonovich transformation 81,82 and evaluating the partition function either via the saddle point approximation, commonly termed Random Phase Approximation (RPA) for historical reasons, 83,84 or via complex Langevin sampling, termed Field-Theoretic Simulation (FTS). 72,85,86 The exact energy of a configuration is first written down as a sum of all pairwise interactions and a density function is defined simply by assigning a Dirac delta at each particle location, allowing the interaction energy to be written in an integral form. Gaussian smearing of the Dirac delta is then performed before applying the Hubbard−Stratonovich transformation, resulting in a field representation of the partition function. This way, not only interparticle distances but also correlation effects due to polymer connection are taken into account. The resulting field integral, however, is hard to evaluate, and RPA is applied to approximate the integrand to a Gaussian form so that the integral can be carried out. 79,85,87 RPA performs well in the dense phase, while large fluctuations in the dilute phase resulted in the breakdown of the Gaussian approximation, an issue that could only be addressed by numerical evaluation of the field integral via FTS. 72,85 Although FTS represents a simulation method at first glance, the particle-to-field transformation makes the polymer density a simulation parameter and changing it does not affect the time complexity of each step, and simulations at concentrations spanning orders of magnitude can be performed on similar time scales. The FTS integrals investigated only consider electrostatic interactions and pairwise excluded volume interaction, with hydrophobic sticker interactions still missing in the picture. Extension of current models can potentially shed more insight into realistic protein solutions.

SIMULATIONS
Molecular simulations can add specificity to models of protein phase separation by describing in detail the structure and dynamics of biomolecules that emerge from physically motivated interaction potentials between their components. As biomolecular condensation has become a topic of interest, molecular simulation techniques and energy functions have been developed to study phase separation, especially among intrinsically disordered proteins (IDPs). By explicitly considering the individual components of a system and mathematically representing their interactions, simulations can resolve details of interaction dynamics among molecules, including residue-specific effects like those described in the stickers and spacers framework, 66,68 and can describe systems of more complex composition, for example including multiple components and interactions among and between both folded and disordered domains, which would be much harder to do via analytical theory or machine learning approaches. Simulations can also provide insights into dynamical effects that would be harder to obtain by other methods.

Levels of Resolution
Much like the theoretical approaches described above, simulations can represent biomolecular systems with a range of resolutions, and the details about a system that are accessible from a simulation depend on the resolution with which the system is represented (Figure 3). Such biological systems comprise many macromolecules, requiring significant computational resources to simulate, and this is especially true when studying the large numbers of biomolecules (on the order of hundreds at a minimum) required to form welldefined dense and dilute phases. Higher-resolution representations of the same system such as all-atom simulations have even more interacting bodies and are thus more computationally intensive to simulate. As a result, when considering simulation approaches, it is necessary to balance the detail used to represent a system with the feasibility of its simulation.

Detailed Models Limit Simulation Length.
The most accurate description of molecular systems is quantum mechanical, allowing for changes in chemical bonding. However, quantum chemical calculations are prohibitively expensive for large biological systems, usually being used only in cases where it is necessary to account for possible bond breakage or formation, such as in enzyme catalysis via a multiscale approach. 88 So far, such effects have not been included in LLPS simulations, but they could potentially play a role in aging mechanisms; this might be most practically accomplished using empirically parametrized reactive potentials 89 or force fields derived by machine learning. 90 Without considering electronic degrees of freedom, the most detailed molecular simulations are so-called "all-atom" simulations in which all atoms in the system are represented as sites interacting via a pairwise potential, and held together by harmonic bonds within molecules; usually the surrounding solvent molecules and ions are also explicitly represented. By solving Newton's equations of motion to determine the evolution of the system given the starting positions and velocities of all components, all-atom models provide detailed information about structural ensembles, dynamics, and intramolecular interactions within biomolecules. All-atom force fields have been refined over more than 40 years, and while they have historically been tested on and applied to study globular folded proteins, 91 they are increasingly being adapted for IDPs. 92−95 Using the all-atom CHARMM22* force field, 96 Rauscher and Pomes simulated a collection of twenty-seven elastin proteins to confirm the liquid-like properties of its assemblies. 97 However, because of the larger sizes of most phase separating systems when considered at atomic resolution, the duration of all-atom simulations of biomolecular condensates is limited on widely available computational resources. As a result, it is challenging to sample condensates for the time scales required to escape local caging effects that may lead to the appearance of subdiffusive dynamics, which is necessary to obtain reliable diffusion coefficients and other dynamical properties. Additionally, it is generally not feasible to determine the phase separation equilibrium using an allatom representation from unbiased simulations because of the extremely slow escape of molecules from the protein-rich phase relative to accessible simulation time scales. Therefore, allatom simulations of condensates thus far have started with a preformed condensed phase. Despite these limitations, all-atom simulations still have a place in studies of biomolecular condensates. They have been used in combination with information from experiments and with coarser-grained simulations to provide detailed insights about specific interactions in systems where phase separation has been shown to occur. For instance, Zheng and colleagues performed microsecond-long all-atom simulations starting from various points in coarse-grained simulations of two different phase separating IDPs, which allowed them to track the diffusion of water and ions within and between the dense and dilute phases, and to identify the interactions contributing to IDP self-association in the dense phase. 98 Atomistic simulations have also been used to generate starting conformations for coarser-grained simulations, 99 to characterize the conformational ensembles of individual phase separating proteins and how they change with amino acid mutations, 100 to simulate phase behavior of collections of single amino acids of different types, 101 and to study condensate aging by zooming in on smaller, more specific structure forming sequence regions. 102,103 All-atom simulations are also useful and reasonably efficient for calculating properties of isolated intrinsically disordered proteins such as approximate polymer scaling exponents, which are often found to be linked to their phase behavior; the second virial coefficient, which can be obtained from simulations of two proteins, is similarly useful although somewhat more challenging to compute accurately. 105 Thus, predictions of single-chain properties from all-atom simulations under different conditions such as temperature or cosolvent concentration could, in principle, be used to predict the critical conditions for phase separation.
Properties of single chains or individual residues from allatom simulations have also been used to parametrize coarsergrained models for a range of systems, including disordered proteins undergoing phase separation. Joseph et al. used allatom representations of amino acids to calculate adjusted potentials of mean force between pairs of amino acids in the extreme salt conditions considered in their study, and to generate starting conformations for coarse-grained simulations of whole proteins. 99,106 This approach is commonly used for coarse-graining at various scales, though it does risk importing biases in the finer-grained model into the coarser-grained model. For example, if the all-atom model favors excessive secondary structure formation or produces more collapsed conformational ensembles than the protein inhabits, then a resulting coarse-grained model could also oversample secondary structure contacts and collapsed conformations. 107 Additionally, to reduce the number of calculations required without sacrificing atomic resolution in the biomolecules of interest, the solvent molecules in a biomolecular system can be replaced by a potential that implicitly accounts for protein− solvent interactions. 108 One way of accomplishing this is by representing the protein's solvation free energy with a combination of a mean-field interaction term and an electrostatic screening term as proposed by Lazaridis and Karplus 109 and further developed by Vitalis and Pappu. 108 The electrostatic screening can be tuned to reflect variations in the ionic strength of the solvent either by using explicit ions 108 or via implicit screening. The latter approach is taken by Generalized Born models of implicit solvation; however, these have generally not been optimized for disordered proteins. 110 3.1.2. Coarse-Graining Makes Large Systems Accessible. Zooming out from the atomic level, coarse-grained biomolecular simulations represent multiple atoms via a single bead. The Martini force field is a widely used coarse-groaning approach that represents an average of four atoms in a single interaction site, each of which falls into one of four broad categories based on the chemical properties of its component atoms. The interactions between chemically bonded beads are derived from all-atom simulations, and interactions between nonbonded beads include Lennard-Jones and Coulombic interactions, which were optimized to capture experimental free energies of partitioning between the polar and apolar phases of chemical compounds. Reducing the number of interacting sites allows the model to retain some chemical specificity while running at least an order of magnitude faster than all-atom simulations on systems of the same size. Additionally, validation against a collection of properties of lipid bilayers showed that the Martini model could accurately determine the stress profile across a bilayer, represent pore formation, and could be used to calculate free energies of lipid desorption and flipping across the bilayer in agreement with all-atom simulations. 111 However, when applied to model proteins and other biomolecules, it became clear that the original Martini models' intermolecular interactions were too strong. 113 As a result, Martini simulations could not recapitulate observed behavior or produce values that agreed with experimental findings. For instance, osmotic second virial coefficients from Martini simulations could be off by a factor of 100, 114,115 models of membrane-spanning proteins showed much more association and aggregation than structural and FRET data suggest, 116 and simulations of carbohydrates also displayed excess aggregation that disagreed with light scattering measurements. 115 These strong interactions also made it difficult to simulate protein phase separation using Martini, but recent adjustments to the model's protein−protein interaction parameters to capture excess transfer free energies between the dense and dilute phases of protein condensates have made it possible to simulate phase separation of the FUS low-complexity domain. 117 Similarly, comparison of the Martini 3 force field with global dimensions of disordered and multidomain proteins from small-angle X-ray scattering (SAXS) found the simulations were underestimating these proteins' sizes, prompting reparameterization of the model by increasing the strength of protein−water interactions. The reparameterized model captures the larger dimensions of disordered and multidomain proteins, and agrees with global dimension and self-association measurements from SAXS and paramagnetic relaxation enhancement experiments. 118 Both studies show that their adjustments to the force field make the Martini model more readily applicable to studies of phase separation. 117,118 Additionally, systems including both disordered regions or flexible linkers and folded domains, as are common in multivalent proteins that phase separate to form a condensed phase, can also be handled in Martini by including elastic network or Go ̅ interactions to define the folded domains. Remaining challenges include the development of an explicitly temperature dependent potential that would be required to reproduce temperature-dependent effects (e.g., the hydrophobic effect) in such a coarse-grained model.
The most common class of models that have been employed for the study of disordered protein phase separation use one bead to represent each residue in a disordered protein chain, Chemical Reviews pubs.acs.org/CR Review with electrostatic contributions and short-range interaction potentials used to describe the interactions between specific residue pairs. The first of these models was developed by Dignon and colleagues by combining a residue level coarsegrained protein model parametrized based on IDP radius of gyration data with a slab coexistence simulation approach that made it possible to converge system containing large numbers of molecules (Figure 4). 112 This approach made it possible to determine critical temperatures and dense and dilute phase protein concentrations, which showed agreement with theoretical descriptions of polymer phase transitions from Flory−Huggins theory. 31−33 Since the development of this coarse-graining framework, additional experimental data and atomistic simulations have been used to develop alternative residue-resolution coarse-grained models that aim to achieve quantitative agreement with experimental phase diagrams. 106,112,119−123 The improved coarse-grained models can now quantitatively determine saturation concentrations 106,119,120 and account for the effects of a range of temperatures and salt concentrations. 120 These models have been used in a variety of investigations of phase separated systems, including to identify the interactions driving and sustaining phase separation of the FUS low-complexity domain, 124 to understand salt-dependence in interactions driving phase separation, 99 and to look at the effect of condensate aging on material properties and dense phase interactions. 102,103 Coarse-grained models of nucleic acids have also made it possible to use simulations to look at protein− DNA interactions, including interactions between highly charged proteins and DNA in condensates, 125 the interplay of H1, HP1 and DNA, 126 to investigate the interactions driving phase separation of chromatin-associated proteins and its role in chromatin organization, and as part of a multiscale model of chromatin ( Figure 5). 127 Coarse-grained models also can be made more computationally efficient by confining molecular movements to a lattice as implemented in the LASSI model. 128 However, this comes at the expense of information about chain dynamics that and  transport properties such as diffusion and viscosity that can be gleaned from traditional molecular dynamics simulations. 102 Despite their widespread use for direct coexistence simulations of phase separating IDPs, there is no single coarse-grained force field that represents well the interactions of both disordered and folded proteins in the same simulation. As many multivalent proteins with disordered regions or flexible linkers between folded domains are known to phase separate to form a condensed phase, 129 representing heterogeneous systems involving both structured and disordered domains in a transferable, sequence-based model is an important challenge for accurate simulations of biomolecular condensates.
Beyond the sequence-based models described above, molecular simulations can also be used on a phenomenological level to test hypotheses about the principles governing a system's behavior. One step simpler than the one bead per residue models, beads-on-a-string models lack information about residue chemical specificity and are instead composed of a small number of bead types with distinct interaction parameters (similar to early HP models for protein folding). These models have been used to investigate general properties of disordered protein phase separation, including the contributions of hydrophobic residues and their sequential arrangement to protein phase behavior, the influence of terminal residues on critical concentrations, and the interplay of traditional liquid−liquid phase separation with re-entrant phase behavior and aggregation. 130 A similar model also showed that the sequence change decoration metric is helpful in distinguishing between proteins that will undergo phase separation versus aggregation, and that it is correlated well with critical densities for phase separation. 131 This type of approach has also been applied to study aging, showing that the relative rates of condensate formation and the formation of strong interprotein interactions within a condensate affect the condensate's transition to a more solid state and the shape of the resulting solid-like droplet. 103 Moving away from modeling IDPs as flexible chains, patchy colloid models represent molecules as spheres that interact through specified attractive regions arranged on their surface. These models can be used to investigate the effect of valence on collective behaviors in solutions of molecules, including the role of valence in various aspects of phase separation of multicomponent systems, such as the role of network connectivity among scaffold and client proteins in the dense phase for condensate stability, 104 and how differences in valency and binding affinity can promote layered organization of multicomponent condensates. 132 Patchy particles have also been used to understand the mechanisms underlying size conservation in condensates with surfactant clients 133 and the coexistence of multiple metastable droplets that do not immediately coalesce. 134 Models with multiple distinct classes of particles with different valencies and interaction strengths have also been used to study how macromolecules modulate the phase separation of multivalent proteins by altering their critical concentrations 135,136 and condensed phase material properties. 137 Adding RNA to the system in the form of a long chain with multiple protein binding sites has also made it possible to interrogate the effect of relative RNA−protein and protein−protein interactions on the nucleation of phase separation, the stability of formed condensates and the internal distributions of molecules within droplets. 138 Because of the fixed spherical shape of the molecules in these models, they are more appropriate for treating phase separation of multivalent folded proteins. It is naturally harder for such models to capture the role that flexible disordered regions of proteins play in promoting inter-and intramolecular interactions, although interactions between disordered proteins might be mapped to an equivalent "soft sphere" model. 139 Building hypothesized constraints into a model makes it possible to test whether those constraints are sufficient to reproduce key experimental observations. These types of ground-up models are helpful for identifying the driving forces of a variety of processes, and have been used to study a range of processes from protein-driven membrane deformation and cell division 140,141 to collagen self-assembly 142 and the phase separation in many-component systems with a range of intermolecular interactions. 45 While useful for focused investigations, these models are system-specific by nature, so they cannot be as readily applied to other types of systems.

Future Directions for Simulation Development
Molecular simulations' explicit representations of biomolecules allow them to build on theory by accounting for hydrophobic and electrostatic interactions simultaneously and by incorporating residue specificity into inter-and intramolecular interactions. As reviewed above, there are a range of simulation approaches that have been applied to study biomolecular condensates. These simulations have been used both for explaining and adding detail to experimental findings about phase separation, and to gain new insights about condensate formation that can drive experimental investigations.
As the field advances, for simulations to continue to serve as useful tools to study biomolecular condensates, they need to expand to account for the wide variety of components of cellular condensates and the cellular environment. The biomolecular condensates found in cells, and even those studied by experimentalists in vitro, are often complex assemblies of proteins and nucleic acids, sometimes with post-translationally modified residues, and surrounded by solutions that can include ions, carbohydrates, molecular crowding agents like polyethylene glycol (PEG) and other biomolecules. To accurately represent these complex systems, compatible models for other molecules (nucleic acids, PEG, salt, sugars) with well-parametrized intermolecular interactions are needed. Coarse-grained models of DNA have been developed and used with coarse-grained protein models, 125 and the Martini model was extended to model carbohydrates carbohydrates, 143 but further development for these and other molecules is needed. To complicate matters further, cells are not at equilibrium like most simulated systems�although characterizing the equilibrium behavior is still a useful starting point for systems close to equilibrium. While it is feasible to simulate multiple distinct states of a system that is out of equilibrium, such as a protein before and after a posttranslational modification, simulating active processes is much more challenging. An extreme example of an out of equilibrium system is the nucleolus, a condensate that is formed during ribosome synthesis, which might be characterized as a steady state in which production of ribosomes is balanced by creation of new ribosome components. In that case, a combination of equilibrium simulations with a kinetic or Markov state model would allow the complete process to be described. If developed, these additional components and capabilities would broaden applicability of molecular simulation approaches to more diverse biomolecular condensates and condensate-associated processes. In addition to allowing for simulation of more biologically relevant systems, simulations that can account for changes in ionic strength, pH and its effects on molecular ionization state, and even temperature can be used to probe the driving forces of phase behavior. 99 Finally, an ongoing challenge for computational models of proteins, which also has implications for models of phase separation, is to parametrize transferable force fields that can accurately represent both structured and disordered regions in a single molecule without arbitrary scaling of interactions between disordered and structured regions. A model that can represent proteins with both folded domains and disordered linkers or tails would make it possible to further dissect the roles of structured and flexible domains in phase separation. 144 Further, these models may also be used to consider a range of interaction specificities in a single system and dissect the role of specific, high-affinity interactions and lower-affinity associations to phase separation. Making coarse-grained models more representative of and compatible with details of the internal dynamics of condensates will make them more effective for studying processes that occur within the dense phase, such as the formation of internal structures during aging.

MACHINE LEARNING APPROACHES FOR PREDICTING PROTEIN PHASE BEHAVIOR
Machine learning and artificial intelligence have opened up new possibilities in many areas of life sciences and molecular biology. 145 In this Section, we discuss the advancements that such data-driven modeling approaches have enabled in the context of analyzing proteins and their phase behavior propensity. We also review the limitations of currently used approaches as they are applied to the study of protein phase separation processes and highlight outstanding questions and challenges in the field, specifically around what concerns movement toward higher resolution and context-specific predictions.

Machine Learning Models Can Capture a Diverse Set of Properties
For a number of decades, the key quest of molecular and computational biology has been to understand how the sequence of a protein defines its structure and, by extension, its function. However, it was only recently that experimental techniques capable of acquiring protein sequence and function data have generated sufficient data to enable the effective introduction of data science approaches, such as machine learning, to address the key questions in the field of molecular biology and protein sciences. Many of the top performing approaches that have been developed over the recent years for predicting protein structure as well as other properties have relied on combining machine learning with some form of evolutionary information, typically expressed as multiple sequence alignments or position-specific scoring matrices that describe the sequence variations of the protein of interest across species. Early successes of this approach occurred can be found in the area of protein secondary structure predic- The number of proteins from the human proteome that have been experimentally determined to undergo phase separation is a small fraction of the human proteome that has been found to localize into MLOs. (d) Based on existing data sets, a variety of machine learning approaches have been developed for predicting phase separation from the protein sequence. These can be divided into those that rely on individual knowledge-driven hand-crafted features (top), on nonexplicit protein featurisation from representation learning approaches (bottom) as highlighted in Figure 6, or on their ensemble (middle).

Chemical Reviews
pubs.acs.org/CR Review tion, 146,147 and they were later followed by examples from a variety of other prediction tasks, such as predicting the subcellular localization of proteins, 148 modeling protein binding pockets 149 and designing proteins with desired function. 150 Commonly recognized as the most notable milestone on the interface of computational sciences and molecular biology in the recent past was the application of deep learning with evolutionary information to predict protein structure from sequence, with AlphaFold 151 demonstrating, as part of the Critical Assessment of Protein Structure (CASP) challenge, 152 predictive accuracy comparable with the state-ofthe-art experimental techniques. 153 This achievement was soon followed by other approaches, such as RosettaFold, 154 OpenFold, 155 OmegaFold 156 and ESM, 157 often approximating or exceeding the performance of AlphaFold on the task of structure prediction. While powerful for prediction tasks with abundant labeled data, the use of models that rely on evolutionary information is not without its drawbacks. The drawbacks do not only pertain to the relatively expensive computational requirements for training an end-to-end model, but also to their limited applicability to proteins for which little evolutionary information is available. Therefore, an alternative set of approaches inspired by the rapid developments in the field of natural language processing (NLP) have opened a path toward learning protein properties without relying on evolutionary information as an explicit input. While the methods that rely on the development of protein-specific language models (pLMs) employ a diversity of network architectures and pretraining tasks, one common feature that they share is that they are trained in a self-supervised manner on large and diverse data sets of known protein sequences. During the training procedure, these models develop (or "learn") highdimensional representations of proteins, which can be used in a variety of ways (Figure 6a). For instance, their low-dimensional representations may allow direct extraction of biophysical information (Figure 6b, left) or they can be fed as input into another model trained on a specific prediction task with more labeled data (Figure 6a, bottom right).
The first NLP-inspired approaches that applied pretrained representations to protein property prediction were based on the principle of distributional semantics. 160 The successes of this idea led to an explosion of interest in the field and was soon followed by pretrained models that involved more complex architectures, such as the mLSTM-based UniRep 161 model and the ELMo-based SeqVec model. 162 These more advanced techniques better captured the long-range information in protein sequences and allowed for learning protein representations that were more context specific. More recently, transformer-based architectures have been developed and they have been shown to outperform their predecessors on a number of tasks. 159,163−165 It is worth emphasizing that the key feature of pLMs is that they output a learned representation of a protein sequence as a low-dimensional numerical vector. These representations have been shown to implicitly capture features relevant to the structural, spatial and functional organization of the proteome (Figure 6b), and are even being proposed for annotation of the full protein universe. 166 This is in stark contrast to manually encoding protein properties as an explicit feature vector. Both are valid approaches to represent a protein sequence in a form suitable for machine learning and can be used as input to a downstream model trained on labeled data for a specific predictive task (Figure 6a, bottom right).

Curated Databases of Protein Phase Separation
With investigations of biomolecular condensation processes becoming increasingly common across different laboratories, a variety of databases have been created to collate information on the phase behavior of different proteins. On the one hand are databases that specifically focus on gathering data from in vitro experiments. Such experiments present the possibility to understand in detail the molecular-level drivers of phase separation processes. The construction of these databases often relies on PubMed based keyword searches and while their broad focus overlaps, the details the databases provide differ. Here, we reviewed the data deposited in four of such databases (Figure 7a). First, the PhaSepDBdatabase 167 describes proteins that can undergo phase separation and highlights if the sequence has been observed to do this on its own or not. It lists 356 entries across proteins with 203 unique Uniprot IDs where proteins have undergone phase separation on their own. Next, the LLPSDB database 168,169 should be highlighted not only storing data on cases where proteins have phase separated but also doing this in conjunction with the experimental conditions under which their phase behavior has been probed. The database even includes a handful of phase diagrams collated from the literature. This feature gives the user the opportunity to use the information in a flexible manner to develop their own categorizations and annotations. Altogether, the database highlights 640 sequences that undergo phase separation with the sequences being protein constructs from 135 unique Uniprot IDs. Third, the PhaSePro database 170 lists a relatively small set of manually curated LLPS-prone proteins (121 in total). It substantiates this information with the specific molecular interactions that are likely to drive the condensate formation and the involvement of and dependency on other components such as RNA and membranes and even the effect of post-translational modifications. Finally, the DrLLPS database 171 stands out in the richness of its annotations. It aims to separate proteins into scaffolds (drive LLPS; 150 unique sequences), regulators (contribute to modulating LLPS; over 900 sequences) and clients (might be dispensable for formation of MLOs; over 8000 sequences). It further divides proteins into distinct condensate classes and outlines potential orthologs of these proteins across species. The database integrates information from a variety of other bioinformatic databases, so that the phase separation data would be accompanied by domain annotation, genetic variations, cancer mutations and drug-target relations.
On the other hand lie approaches and databases that do not aim to highlight scaffolding or self-phase separating proteins but instead focus on capturing the full diversity of sequences that have been found in cellular biomolecular condensates or membraneless organelles (MLOs). As has been highlighted by a number of studies, protein localization into MLOs is not necessarily dependent on the protein undergoing a phase separation process on its own. Many proteins are likely incoorporated into condensate bodies via interactions with other biomolecules or cofactors. The PhaSePro database outlined above captures this aspect by collating data from experiments that have shown protein localization into MLOs either in a high-throughput discovery-based manner (e.g., organelle purification, proximity labeling and immunofluorescence image-based screening) or in a lower throughput Chemical Reviews pubs.acs.org/CR Review manner via direct experimental validation (e.g., immunhistostraining). 167 Another relevant resource for MLO composition is the RNA Granule database (GranuleDB). In contrast to the PhaseSepDB that collates data across 73 different MLOs, the GranuleDB specifically focuses on proteins found in stress granules and P-bodies 172 (Figure 7b). However, it collates these data across a variety of different studies performed by using different cell lines and growth conditions or stressors. The GranuleDB similarly distinguishes between candidatebased cases (cell biological, physical association or genetic evidence for the components) and discovery based approaches (purification of the condensates or on proximity-dependent biotinylation techniques prior to a mass spectrometry-based characterization step). When focusing on the human proteome of these databases, we see a substantial overlap in the identified proteins with the majority of the proteins that are picked up by candidate-based approaches also identified in discovery based approaches as expected (Figure 8b). The substantial number of proteins captured by the PhaSepDB database but not by the GranuleDB database can be rationalized by the former database collating data across all different MLO bodies, rather than just stress granules and P-bodies. The high number of proteins that is part of the GranuleDB but not the PhaSepDB data set, however, is more surprising at the first sight but likely illustrates the degree of variation that the specific context (e.g., cell line or the stressor) introduces to the composition of the condensate body. Indeed, when focusing on the set of genes that had been identified by both databases, we can see that over half of these proteins have been observed in condensates by at least two experiments while within the set that does not overlap with the PhaSepDB data only about one-sixth has been seen in more than one study. Altogether, the databases that characterize the proteomes of different MLO systems list over 6600 proteins that have been found to localize into MLO bodies. This is a very notable fraction (around one-third) of the total human proteome, highlighting the importance of biomolecular condensates as a cellular phenomenon. Noteworthily, however, only 122 of these proteins have been seen to self-phase separate even when combining data across all the four databases outlined above (Figure 7c).

Data-Driven Approaches to Predicting Protein Phase Behavior
As described above, the emerging interest around phase separation processes has resulted in a multitude of experiments being performed that would allow elucidating the driving forces of phase separation processes on different proteins. Such experiments have either introduced different types of mutations (truncations, deletions, insertions) to the wild type protein sequence or, alternatively, have generated sequences de novo. 173 Cumulatively, the investigations have identified a number of sequence-derived features that are linked to an elevated propensity of a protein to undergo phase separation. For instance, experiments performed on the fused in sarcoma (FUS)-family proteins have highlighted that high abundance of arginine and tyrosine residues in prion-like domains anticorrelates strongly with the saturation concentration of the protein and, moreover, specifically shown that the latter value is inversely proportional to the product of the number of tyrosine and arginine residues in the sequence. 174 These arginine and tyrosine residues can be regarded as sticker regions according to the stickers-and-spacers framework. 66 Similarly, experiments with TAR DNA-binding protein 43 (TDP-43) have eluded the relative positioning of tryptophan and other aromatic amino acid residues within the sequence as a key factor that governs the phase behavior of this protein. 175 Examination of LAF-1, a DDX3 RNA helicase found in Pgranules inside Caenorhabditis elegans, has eluded electrostatic interactions within disordered regions and, specifically, the arginine-and glycine-rich domain of the N-terminal disordered region of LAF-1 to be a key feature for its condensate formation with this domain being both necessary and sufficient for the phase separation process to occur. 176 Crucially, multivalent interactions and oligomerization of the monomeric form of a protein, are believed to drive the phase separation of UBQLN2 177 and BAG2 178 proteins. Similarly, the phase separation of G3BP1, a common stress granule protein, has been shown to be initialized by oligomerization, specifically the NTF2 domain. 179 Building on these observations, a number frameworks have been established to estimate the propensity of a protein to undergo phase separation from its sequence. One of the earliest algorithms was PLAAC. 180 It used Hidden-Markov-Models trained on a small set of protein sequences to identify prion like domains (PLDs) within proteins. Despite being trained on yeast data, the approach has been found to work well on the human proteome. Two other predictors, R+Y and DDX4-like, have focused on understanding the key interactions that drive the phase separation of the FET-family and DDX4-family proteins, respectively, and expanded these leanings across the full proteome to identify additional condensation-prone sequences. The key determining features for these two approaches were found to be the statistical frequency and spacing of arginine and lysine residues for the FET-family proteins 174 and that of the arginine and phenalynine residues for the DDX4-family proteins. 181 CatGRANULE 182 was one of the first algorithms that, instead of identifying phase separation prone sequencing, aimed to allocate sequences a score. Relying on sequence composition statistics, specifically on the weighing of residues Figure 8. Capabilities of current predictive algorithms related phase separation processes can be advanced by accounting for the surrounding environment and its complexity to make the models context-specific (x-axis) as well as by understanding the molecularscale drivers of the processes to increase the resolution of information that is obtained (y-axis).
based on their predicted disorder, nucleic acid binding propensity, sequence length and the frequencies of arginine, glycine and phenalynine residues, the algorithm distinguished yeast proteins annotated as granule forming from the rest of the yeast proteome. Another statistical scoring algorithm, PScore, first used sequence composition to predict the availability of π-groups for forming planar contacts with other π-groups. It then used π contact predictions to distinguish known phase-separating proteins from proteins found in the protein database to estimate the phase separation propensity of each protein. 183 The planar π−π contacts on which this predictor was based are associated with structural disorder, kinks in beta strands, nucleic acid binding, increased interactions with solvent, aromatic residues, and arginine. These attributes set good basis for the algorithm serving as a good proxy for identifying phase separation prone proteins.
Vernon and Forman-Kay recently reviewed the predictive power of these approaches focusing on the PLAAC algorithm, LARKS, R+Y, DDX4-like, catGRANULE and PScore and CRAPome approaches. 184 They referred to these predictors as first-generation approaches as they exhibit a common feature of using just a single or a small set of knowledge-driven features to characterize how likely a protein is to undergo phase separation. Their systematic analysis showed that there is a common set of proteins captured correctly as phase separating by all of the predictors with RNA-binding proteins in particularly confidently predicted correctly. The investigation further revealed that proteins whose biological phase separation involved folded domains, either in multivalent interactions or via post-translational modifications, were classified incorrectly by all the predictors, clearly highlighting the limitations of these approaches.
The availability of databases that collate information across a variety of studies to describe the phase behavior of hundreds of different proteins as described in the previous section has recently opened up the possibility to use more complex dataheavy strategies for predicting biomolecular phase behavior. In contrast to the first generation approaches described above, these models always consider a variety of features at once and use available data to learn relationships between the designed features that best describe the available data.
A number of such predictive algorithms using this approach have been developed over the past few years 185−188 as is outlined in Figure 7d. Two of the first models put forward shortly after the first databases were released were the DeePhase predictor developed by Saar and Morgunov et al. 185 and the PSPredictor algorithm by Chu et al. 187 PSPredictor was reliant on features derived from pLMs, specifically the word2vec model (Figure 7d, bottom). This step limited the interpretability of the PSPredictor approach. The DeePhase algorithm combined learned representations with knowledge-driven hand-crafted features (Figure 7d, middle) with the final model relying on creating and ensemble between these approaches. This step of combining two orthogonal featurisation strategies was taken to facilitate arrival to a robust outcome in what is considered a relatively low-data regime. There have also been a variety of models that have relied solely on hand-crafted features (Figure 7d, top). One of the approaches that specifically exploits the fact that the model was built on explicitly defined descriptors rather than learned embeddings for introducing model intepretability is the LLPhyScore algorithm. 188 When developing the model, the authors carefully divided features and interactions into long-and short-range interactions. Through competitive feature training, their results suggested the relative importance of longrange interactions over short-range ones.

Open Challenges: Resolution and Context
The majority of the approaches reviewed in this Section have yielded as an outcome a score that described the propensity of the molecule to undergo a phase separation process. While useful as a high level descriptor, such a single value is inherently limited in its interpretability as phase separation processes are known to be highly context specific. In this Section we outline a handful approaches have emerged over the recent years that have specifically been designed to overcome this limitation dividing them into those that ensure that the predictions are more context specific or increase the resolution at which information could be obtained ( Figure 8).
First, Farahi et al. 189 used available phase separation databases (Section 4.2) to compile a list of 78 proteins for which they were able to quantify the minimum concentrations at which the proteins had been observed to phase separate in vitro. For the purpose of their analysis, they used the latter value as an estimate of the saturation concentration, c sat , of the protein. In parallel, they turned to the PaxDB database 190 to retrieve cell-and tissue-specific abundance of these proteins. Combining these two resources, their approach identified some proteins for which tissue specific concentrations reached values similar to the approximated saturation concentrations in vitro. This included nucleolar proteins (NPM1) and FUS-family RBPs (HNRNPA1, FUS, TARDP). For the majority of cases, however, the experimentally determined in vitro saturation concentrations was higher than the estimated cellular concentration of the protein, often by 1−2 orders of magnitude. Given that many of the 78 proteins included in their have been observed not only to phase separate in a purified form but to also localize into MLOs in cells, this discrepancy is likely due to strong local variations in protein concentration across the cell, the important role that other cellular components play in the formation of membraneless organelles, or their combination.
To advance a step further, recently, an attempt was made to not only predict the saturation concentrating of a protein at specific conditions but also its phase diagram. 191 The authors worked with a list of around 900 experiments performed on 366 unique sequences across 137 proteins and built a multihead neural attention-based deep learning model. The performance of the developed model in a classification task (predicting if phase separation is likely to occur at unseen concentration) remained limited, reaching an area under the receiver-operator characteristic curve (a measure of the usefulness of the model obtained by integrating the area under the curve highlighting the true positive rate (y-axis) and the false positive rate (x-axis) at all classification thresholds) of 0.64 in 5-fold cross-validation process, likely due to the sampling of the data points across a multidimensional space being sparse. However, the concept of transitioning from phase separation propensity to the prediction of the experimental conditions under which specific sequences are likely to undergo phase separation is to be acknowledged.
An additional dimension of context-specificity relates to the inclusion of additional components in the system. The data on how the inclusion of molecules, such as small drug-like components or RNA, has begun to emerge, but the volumes of these data have not made current data sets amenable to Chemical Reviews pubs.acs.org/CR Review machine learning approaches. Specific approximations can, however, be made to propose new insights. For instance, Zhu et al. 192 have highlighted the possibility to predict the effect of oligonucleotides on phase separation by building two different predictors: one that had been developed to predict phase separation in vitro and an additional model that predicted partitioning of proteins into reconstituted RNA-rich granules. By contrasting the two predictive models, the authors proposed a set of proteins that would have their phase separation propensity heavily enhanced in the presence of oligonucleotides. One of the predicted candidates, HMGA1, was experimentally demonstrated to follow this predicted trend. Models that predict protein localization into condensates inside the complex cellular environment have been developed. Indeed, it should be noted that some of the first generation predictors described earlier, such as PLAAC and catGRA-NULE, were built on known components of MLOs rather than purified systems. As such, these models could be regarded as the first generation MLO composition predictors. More recently, Kuechler et al. 193,194 gathered a comprehensive list of stress granule components across multiple studies and used this collated data set to examine how the properties of proteins, such as their amino acid composition or global physicochemical properties like solubility and melting temperature correlate with their localization into stress granules. Based on these data and descriptors, the authors subsequently developed a machine learning model that predicted the composition of stress granules (MaGS) and employed it to propose additional components that may be localizing into stress granules but had not been detected by current characterization protocols. Two of their proposed candidates, Zyxin and SynJ, were later validated to localize into stress granules upon arsenate stress, suggesting predictive models have the capability to unravel components of MLOs that are challenging to identify with hypothesis-free experimental approaches. A recent study by Chen at al. used the annotations in the PhaSepDB database to develop a machine learning model that predicted the localization of proteins into condensates from the protein sequence. As the training data for the model covered proteins from a variety of MLOs, the predictions were not specific to a single condensate system like the MaGS described previously, and the authors made limited attempts to understanding which of the high scoring components were likely to localize into the same condensate system. In conclusion, while these approaches have generated the possibility to investigate and model phase behavior in a complex cellular background, they are limited to a specific environment, such as the conditions inside the specific cell where the data was acquired and do not consider the cellular environment as an input to introduce context dependency.
In parallel, another direction toward which future approaches should aim to move is the resolution at which the systems are being characterized (Figure 8). ML approaches have so far made limited systematic attempts toward identifying the regions or domains of proteins that are relevant for protein separation. This is partly related to their inherent limited interpretability and it has been helpful to see certain recent predictors explicitly aiming to overcome this as described in the previous section. 188 Crucially, these attempts have remained global without highlighting which aspects define the phase behavior of a specific protein sequence. Movement toward such a capability would permit understanding the effect of specific mutations on protein phase behavior. In the context Figure 9. Computational approaches can be advanced further in multiple directions to better understand biomolecular phase separation processes. This includes but is not limited to understanding and elucidating (i) how environmental factors control phase separation, (ii) how the presence of other (bio)molecules, cofactors or other types of modulators affect the process, (iii) the role that post-translational modifications play in the process, and (iv) how cellular environment controls and enables phase separation and what the genomic signatures of biomolecular condensation are in cells.

Chemical Reviews
pubs.acs.org/CR Review of predicting phase behavior inside a cell, further resolution is desired to elucidate which components delocalize into the same condensate system. Current models that group different types of MLOs together into a single annotation are limited in achieving this. Experimental data from spatial proteomics studies, such as those acquired through fractination centrifugation 195 or through imaging, 196 can fill an important gap here and provide a route to subdividing condensation prone proteins into separate systems. Predicting such localization as a function of the cell state is likely to be one of the common goals driving further research in this field.

SUMMARY AND OUTLOOK
We have reviewed recent developments in theoretical, physicsdriven simulation and data-driven machine learning approaches to studying biomolecular condensation processes.
We have outlined how all these approaches have helped to advance our understanding of the biomolecular condensation processes with each method presenting its unique advantages and limitations. Despite this notable progress, there is no shortage of questions where any one of the discussed computational approaches can have further impact. We highlight some potential directions in Figure 9.
Considering these open areas of research one-by-one, first, to date there have been relatively few attempts to predict how environmental conditions affect the phase separation process. This scarcity has been in part due to the low volumes of available data as few experimental methods are able to describe the effect of environmental conditions on phase separation in a high-throughput manner. 197−199 While first machine learning models to this effect have been proposed, 191 on the level of performance, physics-driven MD simulations have shown the most promise as they have recently demonstrated a remarkable accuracy in the recapitulation of phase diagrams for a number of protein sequences. 106 However, it is worth highlighting that established MD methods considered only pairwise interactions between amino acid residues and are limited in their ability to capture cooperative interactions and three-and higher-body energy terms that are likely important for more complex multicomponent mixtures.
Next, the past few years have seen advancements in recapitulating the behavior of multicomponent systems. Physics-driven methods have moved in the direction of being able to accurately simulate systems more than a single component and theoretical approaches have advanced in the capability to model the key determinants of multivalent and cooperative interactions that give rise to phase separation. Machine learning models have not addressed the question of modeling multiphase systems in detail, but they have demonstrated the possibility to predict the components of MLOs in cells and discover hitherto unknown components of such systems. However, it is worth highlighting that there is no well-validated machine learning framework to predict the composition of distinct MLOs as most existing models concerning the predictions of the components of MLOs score proteins by propensities without allocating them to condensate systems.
An additional key direction that has been studied to a limited degree is the effect of post-translational modifications on biomolecular phase separation processes. 200−204 A recent study by Sridharan et al. analyzed systematically how phosphorylation can affects the composition of MLOs and demonstrated a possible methodology for studying the effect of post-transnational modifications experimentally across a condensate system. 205 With the advancement of experimental methods and condensate purification protocols, we expect larger volumes of data to emerge that characterizes how posttranslational modifications affect protein phase behavior.
Finally, it is evident that, inside a cellular milieu, there is a vast number of factors that are involved in defining the biomolecular landscape of a cell. The same condensate systems may often be present in some cell lines but not all. Within the same cell line, a specific condensate system likely appears only under certain conditions. To the best of our knowledge, there are currently no approaches that have endeavored to describe in a systematic manner how the presence or absence of biomolecular condensates or their composition is defined by the cell state or cellular genomics.
To conclude, we envisage the variety of high-throughput experimental approaches that have been established over the past decade as part of cross-disciplinary teams across biological, chemical and physical sciences to be be systematically used and applied to examine biomolecular condensation processes. This process will likely lead to the emergence of ample experimental data around some of the open questions outlined in this Section. Efficient integration of experimental data and computational methods can help us to address outstanding research questions in an accelerated manner in order to generate a more complete picture of the molecular driving forces of the biomolecular condensation process and their biological role.  Tuomas Knowles is professor of Physical Chemistry and Biophysics at the University of Cambridge, and codirector of the Cambridge Centre for Protein Misfolding Diseases. His research is focused on protein function and self-assembly in health and disease through both theoretical and experimental approaches. He is the recipient of multiple international prizes, including the Sackler Prize for Biophysics and the Corday−Morgan Prize from the Royal Society of Chemistry.