Structural Biology in the Multi-Omics Era.

Rapid developments in cryo-electron microscopy have opened new avenues to probe the structures of protein assemblies in their near native states. Recent studies have begun applying single particle analysis to heterogeneous mixtures, revealing the potential of structural-omics approaches that combine the power of mass spectrometry and electron microscopy. Here, we highlight advances and challenges in sample preparation, data processing, and molecular modeling for handling increasingly complex mixtures. Such advances will help structural-omics methods extend to cellular level models of structural biology.

ABSTRACT: Rapid developments in cryogenic electron microscopy have opened new avenues to probe the structures of protein assemblies in their near native states. Recent studies have begun applying single -particle analysis to heterogeneous mixtures, revealing the potential of structural-omics approaches that combine the power of mass spectrometry and electron microscopy. Here we highlight advances and challenges in sample preparation, data processing, and molecular modeling for handling increasingly complex mixtures. Such advances will help structural-omics methods extend to cellular-level models of structural biology. W ith the sequencing of thousands of genomes, large biological data sets (-omics data) have become pervasive in most fields of biology, including development, 1,2 the classification of organisms, 3,4 and disease, 5−7 among many others. Disciplines embracing -omics strategies reach well beyond the central dogma of biologygenomics, transcriptomics, and proteomicsinto such areas as metabolomics, 8 epigenomics, 9 pharmacogenomics, 10 and interactomics. 11 As with these other endeavors, structural biology has also expanded to embrace -omics approaches.
Major historic interactions of structural biology and -omics approaches have included, for example, electron tomography 12 to provide cellular context and spatial information to complement proteomics and interactomics data, 13−15 many efforts at proteome-scale modeling of three-dimensional (3D) structures and interactions, 16−18 and the entire field of structural genomics. 19−22 Structural genomics has employed techniques such as X-ray crystallography, NMR spectroscopy, and electron microscopy (EM) to solve structures of purified macromolecules in a high-throughput manner, targeting new protein folds and entire proteomes, which have been supplemented by molecular modeling and structure prediction to extend structural insights to new molecules.

METHODS
More recently, advances in single particle cryogenic electron microscopy (cryo-EM) have opened interesting new opportunities to connect -omics approaches and structural biology. In particular, cryo-EM boasts several important features: it requires only small amounts of sample, there is no requirement for crystal screening and optimization, and as a result, it is possible to capture several states of a macromolecular machine of interest. Cryo-EM is also capable of imaging a large field of individual macromolecular complexes in a single image. With the advent of direct electron detectors, ultrastable electron microscopes, automated data collection strategies, 23 and realtime data processing, 24 the "resolution revolution" in cryo-EM provides a definite route forward for increasing the throughput of structural biology. 25 We can anticipate that structures from these methods, in combination with electron tomography, will produce information-rich cell atlases capturing high-resolution structures of the proteome and its spatial context that will synergize with other -omics approaches. Here we focus specifically on efforts to increase the applicability of singleparticle cryo-EM to increasingly complex and heterogeneous samples, approaching cell lysates in complexity (as in shotgun cryo-EM), thus furthering the transformation of cryo-EM into a pipeline for structural-omics.
Mass spectrometry combined with electron microscopy has been shown to be well-suited for characterizing the architecture of protein complexes without purifying a specific target molecule, as demonstrated in yeast, 16 Desulfovibrio vulgaris, 26 macrophage cytoplasm, 27 the nuclear pore complex, 28−30 and most recently Plasmodium falciparum. 31 Protein−protein interactions identified through mass spectrometry in conjunction with advances in 3D structure determination have been used to investigate the architecture of multiple distinct protein complexes from mixtures such as fractionated cell lysate or even single cells. 32−34 To date, such studies have largely been limited to the identification of protein complexes that were easily recognizable (e.g., the proteasome and ribosome) or of high enough resolution to identify the proteins by comparing contiguous stretches of highly resolved amino acids to a reference proteome. 31 Currently, the field lacks robust and systematic computational pipelines for sorting, identifying, and performing molecular modeling of the myriad of structures that can potentially be solved from mixtures. The question remains: how can we break through these barriers?

HETEROGENEOUS MIXTURES
In fact, even before the challenges of molecular modeling of mixtures of structures obtained from shotgun cryo-EM methods, several challenges exist for high-throughput cryo-EM data collection and processing of mixtures. Sample preparation is often a major bottleneck in structural studies. In our hands, finding suitable freezing conditions for heterogeneous mixtures has proven equally difficult as for a single purified sample, 35 with the addition of several new challenges. Notably, in the case of cell extracts, the presence of dominating, highly abundant macromolecules can make screening difficult, especially when the sizes and shapes of other, less abundant proteins are unfamiliar. Although multiple orthogonal chromatographic separations might help simplify mixtures, we find that sample preparation with similar-sized macromolecules improves the chances of success. We have also found that different buffers in combination with different support substrates such as graphene oxide can produce an additional "purification" step, ultimately determining which complexes are present on the grid. Furthermore, many 3D reconstructions are built from large data sets containing hundreds of thousands of particles per complex. Scaling this to samples containing tens to hundreds of complexes, which may be present in different quantities, could prove challenging simply from a data collection perspective. It will also be important to incorporate improved denoising and particle picking algorithms to assist users in picking difficult to recognize particles with multiple shapes and sizes. 36−38 Despite these challenges, several groups have already produced multiple structures to <5 Å resolution from fractionated lysates. 31,32 While work on sample preparation methods for investigating fractionated or whole-cell lysates is ongoing, there already exist many approaches that can be used to reduce the complexity or target specific molecules from a mixture. Modified grid surfaces have been used for capturing proteins by His-tag, 39,40 biotin, 41 and antibody affinity. 42 These approaches can alleviate the need for purification, target low-abundance proteins, help with orientation bias, and be readily integrated in combination with clonal sets such as the ASKA library. 43 Other approaches include using microfluidic devices that can isolate and enrich target molecules. 44 To date, many of these studies have been limited to identifying only a few symmetric molecules from a mixture, and scaling these approaches for high throughput has yet to be attempted.

HETEROGENEOUS MIXTURES
Apart from optimization of sample preparation and data collection, new data processing schemes will also need to be introduced. Currently, most cryo-EM data processing software operates under the assumption that samples contain one dominant structure that may contain conformational or subunit heterogeneity. In order to adapt such software for use on highly heterogeneous samples, we developed an auxiliary algorithm based on the principles of the projectionslice theorem to presort particles into homogeneous subsets prior to conventional 3D classification and therefore avoid the need to guess the number of underlying structures present in the data. 35 A subsequent challenge will be to identify the resulting models, which can range from low to high resolution. Recently, the cryoID software package was introduced, which uses a unique approach to sequence by structure from highly resolved, contiguous amino acids in a 3D reconstruction. 31 However, the challenges from sample preparation suggest that it is more likely that these studies will produce a number of low-to mid-resolution maps, and there still remains a significant challenge for identifying and modeling low-to mid-resolution reconstructions from a mixture when their identities are not known a priori. Figure 1. A structural-omics pipeline. A broad goal in the field is to develop a high-throughput structural-omics approach for reconstructing complexes from a heterogeneous mixture. For example, whole-cell lysates, organelle lysates, and heterogeneous mixtures might be analyzed by both cryo-EM and mass spectrometry. Cryo-EM produces multiple 3D reconstructions of protein complexes, while mass spectrometry provides identity and interaction information for the proteins present in the sample. To merge the two, even more efficient computational pipelines are needed to build or retrieve individual structures of proteins, organize them by interactions, assemble them into complexes, and match them to their 3D reconstructions obtained from a sample.

■ APPROACHES FOR DOCKING ATOMIC MODELS INTO LOW-TO MID-RESOLUTION RECONSTRUCTIONS
Because of the likelihood that lower-abundance proteins in mixtures will only achieve low-to mid-resolution 3D reconstructions, if simply as a function of fewer particles, there will continue to be a need to better leverage other structural data. For this reason, an important focus remains improving approaches for fitting both predicted and currently available atomic structures into these lower-resolution 3D reconstructions (Figure 1). These range from user-intensive to computation-intensive approaches. Ideally, given the ambiguity of fitting numerous subunits into 3D reconstructions of unknown identity, one would prefer a quick, efficient, and computationally driven method. The challenge of fitting subunits into a 3D reconstruction becomes increasingly difficult for multi-subunit complexes and may be additionally complicated by considerations of symmetry. Techniques such as MBP and Fab labeling of individual subunits have been used to identify specific subunits within multi-subunit complexes. 45,46 While this would prove cumbersome for identifying proteins in multiple complexes within a cell lysate, it may be useful for targeting a specific complex of interest.
One commonly employed user-driven approach for fitting atomic structures into 3D reconstructions involves segmenting the maps either manually or using the Segger tool 47 followed by rigid-body docking using Fit-in-Map into these segmented regions in UCSF Chimera. 48,49 Scoring of this approach can be optimized using a flexible fitting tool 50,51 such as MDFF, 50 which applies forces proportional to the density gradient of the EM map, while conserving stereochemistry, to fit atomic structures into EM maps with resolutions as low as 15 Å. While these methods may work well if structural information is known a priori, any manual approach of rigid docking faces the possibility of getting caught in a local minimum, suffering from user bias, and requiring numerous user hours. Furthermore, fitting atomic models into complexes becomes extremely challenging when their identities are incompletely known.
The development of integrative methods allows for a more hands-off approach, eliminating some of these biases. 52−54 These approaches combine data retrieved from various experiments such as yeast two-hybrid (Y2H) assays, mutagenesis, cross-linking, small-angle X-ray scattering, electron microscopy, and X-ray crystallography to build the multiprotein model. 55,56 Such methodologies have been successful in building models for a number of multiprotein complexes such as the nuclear pore complex, 57 16S rRNA complexed with methyltransferase A small subunit, 56 and the BBSome. 58 Recently, several models predicted by integrative modeling were validated against their experimentally determined highresolution structures. 52 The results showed that for all atom models the positions of subunit centers were within 5 Å of the true model, demonstrating the power of this approach. 59−64 For those structures with resolution higher than 10 Å, not only can secondary structure elements be detected, but orientation and connectivity may also be predicted to validate the integrative models. 65 While these methods are promising for building a single multiprotein assembly with abundant data, they are computationally intensive, and whether they will be equally applicable to mixtures of multiple complexes from structural-omics data remains untested. Methods that could simplify model building by further constraining possible orientations, interactions, or flexibility may help moving forward.

MACHINES WITHIN COMPLEX MIXTURES
Because of the size and complexity of the data that describe extremely heterogeneous samples, corresponding mass spectrometry data become pivotal in identifying the proteins present, estimating their relative abundances, and identifying those that interact to form complexes in the sample. Previous studies have shown that machine learning combined with cofractionation mass spectrometry can be used to detect proteins that interact to form complexes on the basis of their elution profiles from multiple separation techniques. 66 These predicted complexes can be prioritized by relative abundance for modeling. Additionally, identification of previously solved structures could reduce the number of 3D reconstructions that need to be considered for subsequent modeling. Pipelines such as GEM-PRO could accomplish this by streamlining rapid searches of the Protein Data Bank by returning protein structures given a gene or protein sequence, while also evaluating the quality of the structures and preparing sequences for comparative modeling for those that do not have a known structure. 67 Recently, improved shape-based searches for protein complexes have been developed to better accommodate the low-to mid-resolution EM data produced from tomography. 68 Such shape-search tools might prove useful for searching 3D reconstructions in order to identify those known from prior structures. The 3D reconstructions that have been resolved and identified could then be used to revisit raw micrographs and pick specific particles with template matching approaches. 69 The remaining 3D models would subsequently have to be built de novo on the basis of, e.g., protein identities from mass spectrometry performed on the same samples. Importantly, beyond the structures of proteins already solved and available in the Protein Data Bank, 70 3D structural models have now been computationally generated by many research groups at the proteome scale, a success of the Protein Structure Initiative (such as those indexed by the Uniprot 71 database), using techniques of comparative modeling, 67,72 evolutionary couplings, 73 or even ab initio 74 approaches.
Any structural modeling of native protein assemblies would most likely require prior knowledge of which specific protein− protein interactions were occurring 75,76 as well as the stoichiometries of the interacting subunits. The latter, if unknown, might be obtainable using mass spectrometry. 57,66,77−79 Other approaches to deciphering stoichiometry might include using volume constraints, where volumes of different numbers of individual subunits are compared to the volume of a 3D reconstruction. Cross-linking mass spectrometry, where large numbers of pairwise protein interactions may be identified, can help in elucidating protein interaction partners. 80 Additionally, other pairwise restraints may be added, such as protein docking predictions, to reveal new assemblies. 81−83 However, protein docking becomes significantly more complex with more than two proteins and no knowledge of interaction interfaces or order of assembly. the correct assembly. The problem resembles a jigsaw puzzle, where subunits must fit into the molecular envelope while respecting mutual packing interfaces. In general, such packing problems are known to be NP-complete 84 and cannot be solved computationally in polynomial time. Nonetheless, additional restraints can be brought to bear to reduce the search complexity. For example, like a puzzle, one might determine interacting interfaces among the subunits, either by docking 18 or more approximate approaches, ideally algorithms that are rapid and partner-specific. In our own work, we have developed reduced representations of protein surfaces to help predict complementary interaction interfaces, which add a measure of robustness to minor structural deformations upon binding. 85 Combinations of such packing restraints could then be employed to help pack and refine 3D protein structures to EM maps. In parallel, researchers have improved computational search algorithms for packing problems by using reduction or backtracking, 86,87 and the potential exists to crowdsource the problem, employing the visual acuity of humans to manually fit subunits into 3D reconstructions. 88 Structural-omics stands to benefit strongly from the cryo-EM resolution revolution, and in turn these approaches have the potential to greatly enhance our understanding of biology from a systems perspective. Toward this end, it is already clear that various low-to high-resolution complexes may be reconstructed from a cell lysate using single-particle electron microscopy. The development of new computational tools to efficiently sort and build atomic models into these low-to midresolution reconstructions or to solve the high-resolution structures from mixtures of increasing complexity will certainly help to further advance this field and put it on a path toward even richer structural cell atlases.