Long Time-Scale Atomistic Simulations of the Structure and Dynamics of Transcription Factor-DNA Recognition

Recent years have witnessed an explosion of interest in computational studies of DNA binding proteins, including both coarse grained and atomistic simulations of transcription factor-DNA recognition, in order to understand how these transcription factors recognize their binding sites on the DNA with such exquisite specificity. The present study performs μs-timescale all-atom simulations of the dimeric form of the lactose repressor (LacI), both in the absence of any DNA, and in the presence of both specific and non-specific complexes, considering three different DNA sequences. We examine, specifically, the conformational differences between specific and non-specific protein-DNA interactions, as well as the behavior of the helix-turn-helix motif of LacI when interacting with the DNA. Our simulations suggest that stable LacI binding occurs primarily to bent A-form DNA, with a loss of LacI conformational entropy and optimization of correlated conformational equilibria across the protein. In addition, binding to the specific operator sequence involves a slightly larger number of stabilizing DNA-protein hydrogen bonds (in comparison to non-specific complexes), that may account for the experimentally observed specificity for this operator. In doing so, our simulations provide a detailed atomistic description of potential structural drivers for LacI selectivity.


Introduction
Gene regulation, which is controlled by DNA-binding proteins such as transcription factors (TFs) 1 , is essential to cellular function.Thus, understanding protein-DNA interaction kinetics is of great biological and biophysical interest.Specifically, to carry out their in vivo function, DNA-binding proteins have to rapidly find their specific binding sites from a remarkable amount of non-specific DNA 2 .Recent years have seen substantial progress in addressing the molecular description of dynamic target searching on DNA sequences by proteins, leading to a current consensus view that can be summarized as follows [2][3][4] : (1) The DNA-binding protein first binds non-specifically to the DNA, and then starts quickly searching the DNA for its target, through one-dimensional sliding.(2) During sliding, when the protein encounters a specific site, a tighter complex will be formed to recognize the target, otherwise, (3) the protein continues onedimensional sliding or dissociates from the DNA at some point, and then moves in the three-dimensional space of the solution until it encounters a DNA segment 2,5 .
Intersegmental transfer has also been suggested to be involved in the search process 2 .
However, as it is not possible to directly observe the search process at sufficiently high spatio-temporal resolution using experimental approaches, it is still unknown (in molecular detail) how these TFs recognize their binding sites, and how they switch from non-specific to specific binding.In addition, both specific and non-specific binding events are considered to be key factors in the search process 5 .
To this end, molecular structures of protein-DNA complexes involving both specific and non-specific binding have been solved by NMR spectroscopy or X-ray crystallography for EcoRV 6 , BamHI 7 , the γ-repressor, 8 and the lactose repressor [9][10][11][12][13] .In the bacterium Escherichia coli, the lactose repressor (LacI) regulates the expression of a set of genes which are involved in the lactose metabolism 14 .LacI was discovered as the first gene-regulatory DNA-binding protein, and is still one of the most studied model systems when exploring protein-DNA recognition [9][10][15][16][17][18][19][20][21][22][23] . Structually, the LacI protein is a homo-tetramer, with four identical subunits that are assembled into two dimers connected by four α-helices at the C-terminus of each monomer (Figure 1) 9,24 .Each monomer consists in turn of 360 amino acids, which can be divided into four main regions: an N-terminal DNA-binding domain (residues 1-50), a regulatory domain (core domain, residues 61-340), a hinge region connecting the DNA-binding domain and the core domain (residues 51-60), and a C-terminal tetramerization region (residues 341-360) that joins the four monomers into an α-helix bundle 9,25 .The DNA-binding domain is, in turn, characterized by the presence of a helix-turn-helix (HTH) motif, with two symmetrically monomeric DNA binding domains forming a dimeric unit that binds a single operator site 9,25 .
In the specific complex, residues 51-60 from each monomer form a well-ordered α-helix, referred to as a "hinge helix" (highlighted in red in Figure 1).The two hinge helices in the LacI dimer interact with each other, binding to the minor groove of the DNA.The insertion of the hinge helices into the minor groove leads to the DNA molecule bending by 36° in the specific complex, thus forming a bent structure 11 .In contrast, when LacI is bound to a non-specific site, the hinge region has been argued based on NMR studies to be more disordered and in a random coil form, resulting in a more flexible and mobile HTH domain, and a DNA molecule that is not bent 12 , although in this study the system carries a mutation in a hinge helix residue, V52C, to promote the formation of a disulfide bond.This could be interfering with the structural integrity of the hinge helices.High-resolution structures of non-specific LacI-DNA complexes have not been reported as of yet, except for one solution NMR structure of the LacI DNA binding domain in complex with non-specific DNA 12 .

Figure 1.
The structure of tetrameric LacI in complex with two DNA operators (PDB ID: 1Z04 [26][27] ).Here, the hinge helices are highlighted in red.These helices are disordered when LacI is bound to non-specific DNA (see e.g.PDB ID: 1OSL 12,27 ), but become well-ordered in α-helical form when LacI is bound to specific DNA, such as the example shown here.This figure was made with Chimera v1.11.2 28 .
In addition, LacI has been shown to bind a range of operator sequences 23,[29][30][31][32] .The main DNA operator for LacI, O 1 , is essential for the function of the lac operon 30 .There are two more operators, O 2 and O 3 , located 401 base pairs downstream of O 1 and 92 base pairs upstream of O 1 , respectively 31 .It was reported that LacI binds to O 2 with 2-fold lower binding affinity than with O 1 , while it shows 200-fold lower affinity for the O 3 operator 23 .
Furthermore, a perfectly symmetric operator, O SymL (5´-AATTGTGAGCGCTCACAATT-3´) has been constructed that has been reported to bind LacI 10-fold more tightly than the other operators. 29In the case of the LacI dimer, specific binding to the O 1 operator is in the range of 10 -8 to 10 -9 M 33 compared to 10 -3 M for non-specific binding 34 , corresponding to at least 10 5 fold difference between specific and non-specific binding.[37][38][39] There exists abundant structural 9-10, 13, 20, 22-23, 25 and functional 15-19, 21, 40 data to describe LacI, making it an ideal model system for characterizing protein-DNA recognition processes at the atomic level using simulation techniques, and there have been several interesting studies of this system.Given the extremely large (from a simulation perspective) size of ~310,000 atoms of the LacI-DNA complex (for the full tetramer in complex with a rod model of DNA 26,41 ), several of these have been performed using coarse-grained models, in order to explore the conformational changes of the complexes 42 , as well as more recent atomistic simulations 41,[43][44][45][46][47][48] .Apart from the short-timescale all-atom study of tetrameric LacI by Villa et al. 41 , other atomistic studies have included only the binding domain of either monomeric or dimeric LacI in the simulations.There is a risk that not including the core domain and C-terminal regions of LacI leads in turn to missing the collective motions of the binding domain and hinge regions relative to the core region, which is part of the LacI-DNA recognition mechanism.
In addition, while a very powerful tool, coarse-grained simulations do not provide atomistic insight into the details driving the differences between specific and non-specific protein-DNA interactions.
Here we present μs-timescale all-atom simulations of the free LacI dimer as well as the dimer in specific and non-specific complexes with three different DNA sequences.
We focus on the LacI dimer both to make the simulation size more tractable and due to the extensive experimental data on this model system 5, 9-10, 13, 15-23, 25, 33-34, 40, 46, 49-50 , in contrast to the paucity of structural information on the tetramer.We focus in particular on understanding the dynamical differences between specific and non-specific interactions, the behavior of the "clamping" helices of the protein when interacting with DNA.We show that there are sequence-dependent differences in the overall dynamics of LacI in complex with different DNA sequences, which could help explain the differences in selectivity of this transcription factor.In doing so, we provide a detailed atomistic description of the structural drivers for LacI selectivity.

System Setup
Our starting point for all simulations in this work is a 2.6 Å crystal structure of a LacI dimer in complex with a symmetric highly specific (consensus) operator sequence, O SymL (5´-AATTGTGAGCGCTCACAATT-3´) and the anti-inducer orthonitrophenylfucoside (ONPF) (PDB ID: 1EFA 10,27 ).The ONPF molecules were removed from the system, and chains A, B, D and E, comprising the LacI dimer and the DNA molecule, were retained for further simulations.The missing C-terminal tetramerization helices were added with Chimera 28 by taking coordinates from a crystal structure of the lac repressor in complex with a 21 base pair symmetric operator DNA and the gratuitous inducer isopropyl-β-D-1thiogalactoside (IPTG) (PDB ID: 1LBH 9,27 ).In order to avoid the common problem 51 of fraying of the base pairs at the two ends of the DNA strand during the simulations, we 7 added 10 base pairs to each end of the operator sequence using the 3DNA web server 52 , to create a longer 40 base pair sequence centered on O SymL (5´-GCGTACAAGG-AATTGTGAGCGCTCACAATT-GTGTAGGCGC-3´, Table S1).The resulting complex, which is shown in Figure 2, then served as a baseline for all subsequent simulations.The simulations performed in this work have been summarized in Table S2.In brief, we have performed simulations of (1) free DNA, (2) free LacI, (3) LacI in complex with bent DNA (based on PDB ID: 1EFA 10,27 ) and (4) LacI in complex with straight DNA (based on complexes constructed from snapshots of our simulations of free DNA and free LacI, respectively).As mentioned above, all simulations were performed using the LacI dimer.In addition, all simulations of free DNA or LacI-DNA complexes were performed using both the O SymL operator sequence described above, as well as two mutated sequences, DNA mut1 and DNA mut2 , which were generated using an in-house script available for download from Zenodo (DOI: 10.5281/zenodo.1494296).The mutations were generated by mutating either every other base of the operator sequence from A→C and T→G (DNA mut1 , 5´-GCGTACAAGG-ACTGGGGCGAGATAAAACTG-GTGTAGGCGC-3´, purine → pyrimidine and pyrimidine → purine), leading to a nonspecific complex with less chemical similarity to the specific complex, or by mutating every other base of the operator sequence from A→G and T→C (DNA mut2 , 5´-GCGTACAAGG-AGTCGCGGGTGTTTATAGTC-GTGTAGGCGC-3´, purine → purine and pyrimidine → pyrimidine), leading to a non-specific complex with more structural similarity to the specific complex.Thus, in each case, three independent simulations were performed for each of O SymL , DNA mut1 and DNA mut2 , either as free DNA or in complex with LacI.In the case of the simulations of free DNA or LacI-DNA complexes with bent DNA, this was achieved simply by mutating every other base-pair directly in the crystallographic conformation of O SymL (from PDB ID: 1EFA 10,27 ).In the case of the complexes of LacI and straight DNA, these simulations were performed by generating 6 artificial complexes based on snapshots extracted from simulations of free DNA, and either three snapshots extracted from simulations of free LacI, or three complexes built based on the crystal structure.These were then positioned relative to each other such that the protein and DNA were, on average, 17.9 Å apart (with a standard deviation of 1.4 Å), based on calculating the distance between the centers of mass of the binding domain of LacI, and the binding site on the DNA sequences (Figure S1 and Table S3).For references, the corresponding distance in PDB ID: 1EFA 10, 27 is 10.6 Å, and therefore this corresponds to a ~7 Å displacement of the protein relative to the DNA.
Each system was then placed into a rectangular box filled with TIP3P water molecules 53 , with a distance of at least 8 Å from the solute to the surface of the box in each direction.The necessary number of Na + and Cl -counterions were then added to first neutralize the system, and then achieve a 0.15 M salt concentration, in a random scheme using addIonsRand from the LEaP module as implemented in AMBER16 54 (for simulation specifics per system, see Table S3).All simulations were performed using the AMBER 16 simulation package 54 , with the protein described using the AMBER ff14SB force field 55 , and the DNA described using the Parmbsc1 force field 56 .Finally, the LEaP module of AMBER16 54 was used to generate the topology and coordinate files for subsequent MD simulations.

Simulation Details
All molecular dynamics simulations were performed for each system (LacI-DNA complexes, free DNA or free LacI) using the same protocol, and using the CUDA version of the PMEMD module [57][58][59] of the AMBER16 simulation package 54 .Each solvated system was first subjected to a 10,000 step steepest descent minimization, with harmonic positional restraints applied to all atoms of the solute, using a 25 kcal mol -1 Å -2 force constant.Subsequently, two more 10,000 step steepest descent minimizations were performed for the system, using weaker harmonic positional restraints with a force constant of 5 kcal mol -1 Å -2 .The minimized systems were then gradually heated up from 100 to 300 K, and equilibrated for 500 ps of simulation time, coupled by the Berendsen thermostat 60 with a time constant of 0.5 ps, and 5 kcal mol -1 Å -2 harmonic positional restraints on all atoms.The system was then further optimized for 1000 ps in an NPT ensemble (300K, 1 atm), controlled again by the Berendsen thermostat, and the Berendsen barostat 60 using a 1 ps time constant.Five further 100 ps equilibration steps were performed in an NPT ensemble, during which the 5 kcal mol -1 Å -2 harmonic positional restraints were decreased until no positional restraint was left on the system, and a final 100 ps simulation (again NPT ensemble) was performed with no positional restraints on the system.Finally, production runs of varying lengths (at least 0.5 μs per trajectory) were performed for each system, as summarized in Table S2.Production simulations were performed at constant temperature (300 K) and constant pressure (1 atm), coupled by the Berendsen thermostat with a 1 ps coupling time, and the Monte Carlo barostat with a 1 ps time constant [61][62] .Each system was simulated using between 3 and 9 different replicas (Table S2), obtained by assigning different initial random velocities to each replicate.All simulations were performed using a 2 fs time step, and snapshots were saved from the simulation for analysis every 10 ps.The SHAKE algorithm 63 was applied to constrain all bonds involving hydrogen atoms.An 8 Å cut-off was applied to all non-bonded interactions, with the long range electrostatic interactions being described using the particle mesh Ewald (PME) approach 64 .A detailed summary of all the MD simulations performed in this work can be found in Table S2.

PCA Analysis
PCA analysis was performed using the gmx covar and gmx anaeig tools from GROMACS v. 2018.1 [65][66][67] .To better quantify LacI-DNA recognition, we additionally calculated native contacts 68 , distances, and electrostatic interactions (approximated by the Debye-Hückel equation 69 ) between the binding domain and the DNA binding sites, calculated using PLUMED v2.4. 70

Native Contacts Analysis
Native contacts (Q) provide an excellent collective variable with which to define natively bound and unbound states of LacI; they have previously been effectively used in for example studies of peptide folding [71][72] , peptide binding [73][74] and of loop motion 75 .The native contacts were calculated using Eq. 1 68 : where X is a conformation along the reaction coordinate,  79 () is the distance between atoms i and j in conformation X,  79 = is the distance between atoms i and j in the reference conformation, S is the number of all pairs of heavy atoms (i, j), β is set as 5 Å -1 , and λ is chosen as 1.8, based on previous studies 68 .
The distance between LacI and DNA was, in turn, simply calculated based on the center of mass of all heavy atoms of the binding domain (Lys2-Ser61), and that of the central 20 base pairs (nucleotide ID 11-30 and 51-70), using Eq.2: Here, A and B are the two groups of atoms, RBA is the vector of the center of mass of groups A and B, mi and mj are the masses of atoms i and j belonging to groups A and B, respectively, while ri and rj are the coordinates.Finally, in the case of the electrostatic interaction energy between the binding domain of LacI and the DNA, this which was approximated by the Debye-Hückel (DH) equation 69 : where A (B) is the set of atoms of the first (second) group, i and j are the atom indexes in the two sets A and B, |rij| = |ri -rj| denotes the distance between atoms i and j, and κ is the usual Debye-Hückel parameter 69 .
All native contacts analyses on the LacI-DNA simulations were performed using PLUMED v2.4 70 , where the native contacts, Q, are defined by taking into account heavyatom contacts between LacI and DNA, with a cut-off of 4.5 Å from the crystal structure (PDB ID: 1EFA 10,27 ).

Secondary Structure Analysis
The GROMACS do_dssp interface to DSSP [76][77] was used to monitor the secondary structure changes of the hinge helices.In the crystal structure (PDB ID: 1EFA 10,27 ), there are 6 α-helical residues (Arg51-Leu56) for each monomer, and thus the hinge helicity is calculated based by number of residues in helix divided by the total number of residues, 12, in α-helical form, as in the starting structure.The angle between the two hinge helices is calculated from the dot product between two vectors along the two hinge helices, respectively.For each of the two vectors, it is a sum of the vectors  →  WWWWWWWWWWWW⃗ and  →  WWWWWWWWWWWWW⃗ of the 6 residues in α-helical form in each monomer.The angle calculated by Chimera v1.13.1 28 is 125.6° in the crystal structure (PDB ID: 1EFA 10,27 ), which is very close to what we obtain using the dot product (127.5°).

Analysis of DNA Bending
An algorithm proposed by Curuksu et al. 78  DNA is then calculated based on the dot product between the two "arm" vectors.

Scoring A-and B-form DNA.
In order to better quantitatively characterize the DNA conformations that may fall intermediate to the A-and B-forms of DNA, we introduce here a single scalar along the A-B continuum, the A-B index (ABI).A linear combination of four structural parameters (S) was used to construct the ABI, following previous studies 79 in which two structural parameters were used.The parameters used in this case are the backbone torsion (δ), the puckering torsions (ν1 and ν3) and the glycosidic torsion χ.The four parameters are then normalized relative to their differences between A-DNA and B-DNA (the values of the parameters in 'typical' A-DNA and B-DNA conformations of O SymL are shown in Table 1).Based on these values, the ABI of nucleotide x can be defined using: where the difference between the different parameters, ΔS (Δδ, Δν1, Δν3 and Δχ), is calculated by: Here, the S(A) and S(B) for the four parameters are listed in Table 1.Using this definition, the DNA is in an A-like conformation when the ABI value is close to 1.0, while it is a B-like conformation when the ABI value is close to 0.0.

Analyzing LacI Motion on the DNA Strand
To describe diffusive motion of LacI relative to the DNA binding sites, the distance between the center of mass of the binding domain of LacI and the DNA binding sites was projected onto the DNA axial vector, which is a sum of the vector  \ WWW⃗ on the central four base pairs (nucleotide IDs 20-21, 60-61).The projected distance can be both positive and negative, which indicates that LacI sits in one (5´-direction) or the other half (3´-direction) of the DNA binding sites.

Other Analysis
All other simulations were analyzed using CPPTRAJ 80 from AmberTools 16 54 , the Visual Molecular Dynamics (VMD) 81 package and the MDTraj 1.8.0 library 82 .All structure figures were prepared using Chimera 28 .

Probing the Intrinsic Flexibility of the Binding Domain in LacI
In order to obtain insight into the intrinsic flexibility of the DNA binding domain, we performed as our starting point 9 x 0.5 μs MD simulations of free LacI starting from the O SymL -bound structure (PDB ID: 1EFA 10,27 ) in the absence of DNA.Table S2 contains a summary of all simulations performed in this work.As shown in Figure S5A, the backbone RMSD of LacI from the reference structure ranges between 2 and 8 Å over these simulations.This is due to the flexibility of the binding domain; the backbone RMSD of LacI with the DNA-binding domains omitted is consistently 2-4 Å during the same simulations (Figure S5B).
Per-residue RMSF plots (Figure 3A) further show that DNA-binding domain flexibility dominates the motions of free LacI.The core domains are more stable, with the C-terminal helices showing intermediate flexibility.We note that experiments suggest that the hinge helices are unfolded when not in a specific LacI-DNA complex 12 .As can be seen in Figure 3B, we observe a general overall loss of helicity in the hinge helices (assessed via DSSP [76][77] ) during our simulations, in good agreement with these experimental observations 83 .In addition, complete loss of helicity was observed in 0.02% of simulation snapshots.Accompanying the overall loss of helicity, the angle between the two helices also varies substantially, ranging from 50° to 180° (Figure 3C), indicating similar changes in helical orientation.For reference, the angle between the hinge helices is 127.5° in the crystal complex of LacI-O SymL (PDB ID: 1EFA 10,27 ).
The observed changes in the hinge helices in the simulations of free LacI are of interest, as the helix-turn-helix (HTH) motif observed in LacI (Figure 1) is a common feature of many DNA-binding proteins [84][85] , and the formation of the hinge helices has been argued to be an important feature for DNA bending in various lac repressoroperator complexes 83,86 .It has also been argued that the protein-DNA interface of the non-specific complex is flexible and that this assists in rapid and efficient search for the target site on the DNA 12 .In addition, a recent study has shown that the LacI hinge domain independently binds DNA 87 , with the hinge region mediating both specific and non-specific binding 87 , with only small differences in affinity between the two.Following from this, computational studies have suggested that the hinge region plays an important contribution to the electrostatic energy of the LacI-DNA complex, the salt dependence of this electrostatic energy, and the number of salt ions excluded as the LacI-DNA complex is formed 48 , and that the binding of the hinge region of the lac repressor to DNA perturbs networks of correlated motions in the protein that in turn regulates its DNA binding affinity 88 .
Finally, we estimated prevalent binding-domain conformations in free LacI as follows.
We aligned all simulation snapshots to the backbone atoms of the core residues (62-332)   in the starting LacI crystal structure and calculated positional covariance of all backbone atoms in the binding domain (residues 2-61).We then projected state density onto the first two principal components of this covariance matrix to estimate the free energy landscape of the binding domain in a low-dimensional visualization (Figure 4).This analysis shows that the binding domain is indeed flexible, with a variety of conformations sampled during the simulation.In addition, in agreement with previous studies 87 , it appears that the DNA binding helices are capable of moving independently of each other, which likely affects the search process on the DNA.This is particularly interesting because the hinge sequence itself may also contribute substantially to the binding affinity for both specific and non-specific DNA, while also playing an important role in the transition from non-specific to specific binding [86][87] .

Figure 3. (A)
The root mean square fluctuations (RMSF, Å) of the Cα atoms of the free LacI dimer, calculated over 9 independent 0.5 μs molecular dynamics simulations (Table S2).The blue and red lines indicate the two monomeric subunits in the dimer, respectively.(B) Population distribution of the % helicity of the hinge helices, calculated based on the number of residues involved in forming an α-helix or other forms of helices (π-and 310-helix), over 9 independent 0.5 μs molecular dynamics simulations of free LacI.A helicity of 100% is equivalent to 2 × 6 residues (Arg51-Leu56) in helical form, as in the crystal structure of LacI in complex with O SymL (PDB ID: 1EFA 10,27 ).The secondary structure composition was calculated using DSSP [76][77] .(C) The corresponding population distribution of the angle between the hinge helices during the same simulations.The process by which the angle was calculated is described in the Methodology section.

Simulations of Free DNA in the Absence of LacI
A second natural point to consider is the extent to which the intrinsic behavior of the DNA itself determines differences in LacI-DNA recognition.We therefore performed an additional 9 x 0.5 μs MD simulations for each of three DNA sequences considered in this work (summarized in Table S2).In these simulations, the differences between the average behaviors of the three DNA sequences are minor (analyzed via Curves+ 89 , shown in Figures S6 -S15, and Tables S4 -S13), and therefore unlikely to determine the observed differences in specificity.We note that in these simulations, the terminal five base pairs on each end of the DNA strand were not analyzed in order to avoid end-effects such as fraying.
The bent DNA structures observed in LacI-DNA complexes (PDB ID: 1EFA 10,27 ), however, suggest that formation of a bent DNA conformation may be related to LacI discrimination between operator sequences.We first analyze this bending propensity as a static property and then as a dynamic one.Comparison of our simulations of free DNA with the LacI-O SymL crystal structure shows that the DNA sequence stays in a bent Aform when it is bound to LacI and in a straight B-form in the absence of protein (Figures S4 and S16).These bent A-like DNA conformations have been previously suggested to be important to protein-DNA recognition 11,[90][91][92][93][94][95] .We therefore devised a metric for scoring DNA structures along a continuum between A-form and B-form DNA (see the Methodology section), which we term the A-B index (ABI).The ABI of every nucleotide in the specific and non-specific complexes is shown in Figures 5 (data from our simulations) and S16 (data from experimental structures).It can be seen that in the specific complex (PDB ID: 1EFA 10,27 ), the DNA is largely in an A-like conformation with ABI values >0.6, whereas in the non-specific complex (PDB ID: 1OSL 12,27 ), the DNA is largely in a B-like conformation, with ABI values <0.1 (Figure S16).For comparison, the free-DNA simulations show that these DNA sequences are more likely to stay in the B-form (a probability of ~10% for A-form DNA in the case of the DNA containing the O SymL operator sequence), with some sequence-dependent differences 20 between the three sequences (Figure 5).This is, in turn, in good agreement with the observations of ref. 96 .Surprisingly, when the operator sequence was radically mutated (DNA mut1 and DNA mut2 ), the ABI scores for the central 20 nucleotides of the operator where LacI binds showed large changes, while the terminal 20 nucleotides were relatively unchanged.Thus, the mutations do appear to influence both the motion and flexibility of the DNA binding sites.Fraying of the terminal base pairs was observed in all the simulations, as shown in

Figure S17
. We note here as an aside that it has been shown experimentally that DNA conformation is greatly sensitive to hydration (see e.g.ref. 97 and references cited therein), and analogously, it has been shown that with current DNA/RNA force fields, A-form DNA is not stable in water, even with the newly developed Parmbsc1 force field 56 (although it can be maintained for several hundred nanoseconds in organic solvent 56 ).
In order to analyze the flexibility of the unbound DNA sequences, we have examined the RMSF of all heavy atoms of each nucleotide of the three DNA sequences (Figure 6).
We note that while there are some minor differences between the sequences, the DNA containing the O SymL sequences is intermediate in flexibility to DNA mut1 and DNA mut2 and there appears to be no trend between flexibility and selectivity.We note that previous studies [98][99][100] have argued that DNA flexibility is quite sequence dependent, and suggested that increased flexibility can assist DNA recognition by DNA binding proteins.Our data, however, suggest only minor differences between the three DNA sequences (Figure 6) on the timescales sampled in our simulations.

Modeling Specific and Non-Specific LacI-DNA Interactions on Straight DNA
Having confirmed that the binding domain of LacI is indeed highly flexible and can take on multiple configurations even in the absence of DNA, we next tested if we can observe 'natural' clamping of the LacI binding domain onto straight DNA using unrestrained conventional molecular dynamics simulations.We therefore constructed and assessed simulations of 3 random conformations from simulations of free LacI positioned within a feasible binding range of 3 random snapshots each from simulations of specific and non-specific DNA sequences.We created an additional three complexes of the LacI crystal structure (PDB ID: 1EFA 10,27 ) positioned within binding range of 3 random snapshots from simulations of specific and non-specific DNA sequences, leading to six starting complexes per DNA sequence simulated (see Tables S2 and S3).
We simulated three DNA sequences: O SymL and two mutated DNA sequences, DNA mut1 and DNA mut2 , which contain either a more drastic mutation of every A→C and T→G (DNA mut1 ) or a more conservative mutation of every A→G and T→C (DNA mut2 ).
These substantial changes were selected because differences in binding affinity between specific and non-specific DNA sequences are very small [35][36][37][38][39] , so substantial structural perturbations improve our ability to detect selectivity between different DNA sequences.
Starting complexes were generated by taking the conformational snapshots mentioned above and bringing them together at various distances, 18.0±1.6Åfrom the DNA, defined based on distance between center of mass of the binding domain and the DNA, as described in the Methodology section (for comparison, the corresponding distance in the starting crystal structure is 10.6 Å, so this corresponds to displacing the protein ~7.4 Å from the DNA).For examples of such complexes, see Figure 7.These complexes were then used as starting points for subsequent molecular dynamics simulations.The high shape complementarity between the LacI dimer and the bent DNA structure in the specific complex (PDB ID: 1EFA 10,27 ), made it impossible to directly juxtapose LacI and straight DNA with the LacI binding clamps pointing towards the DNA, due to steric clashes formed in particular between the DNA and the side chains Tyr7, Tyr17 and His 29.This affected also the choice of starting complex, as we needed to start from complexes with reasonable shape complementarity (considering also different conformations of the binding clamps).S3.
From these simulations, we make two key observations: (1) even when LacI starts as close as geometrically possible to straight DNA), we do not observe clamping of the binding helices onto DNA, presumably due to the aforementioned steric clash between LacI and the straight DNA.Native-contacts analysis confirmed this, showing that the fraction of native contacts in the loose complexes ranged from between 0.0 to 0.4 (on a range from 0 to 1, Figure 8) for all systems, relative to the native contacts formed in the initial crystal structure.(2) When the DNA is not bent, we observe what appears to be a natural searching motion along the DNA of 1.3 ± 1.2, 1.2 ± 0.9 and 1.1 ± 0.8 base pairs for the O SymL , DNA mut1 and DNA mut2 systems, respectively, during 6 x 1 μs simulations per system (a reference base pair distance of 3.4 Å was used, Figure 9).Taken together, these data strongly suggest that while the search process itself likely occurs on straight DNA, the bending of the DNA itself is an important part of forming a tight-bound specific (and possibly also non-specific) LacI-DNA complex.each LacI-DNA complex, while the data shown here for simulations of LacI in complex with bent DNA is based on 3 x 1.5 μs simulations of each LacI-DNA complex (see Table S2).
Figure 9.The distributions of the projected distance between the center of mass of the binding domain and the DNA binding site along the DNA axial vector, calculated from simulations of LacI in a loose encounter complex with straight DNA, as described in the Methodology section.The data shown here is based on 6 × 1.0 μs simulations of each LacI-DNA complex (see Table S2).

Modeling Specific and Non-Specific LacI-DNA Interactions on Bent DNA
As our simulations do not show tight binding between LacI and straight DNA on the timescales sampled, we also performed 3 x 1.5 μs MD simulations per sequence of complexes between LacI and bent DNA, simulating both a specific complex of LacI and O SymL and non-specific complexes between LacI and DNA mut1 and DNA mut2 .These were again generated by simply mutating the specific DNA sequence, as described in the Methodology section.As can be seen in Figure 8, the fraction of native contacts during the simulations of LacI in complex with O SymL are almost all higher than 0.9 for all the simulation time, which is quite close to the natively bound state.However, the highly mutated complex (DNA mut1 ) has fractions of native contacts less than 0.9 for 4.4 μs out of the total 4.5 μs of simulation time, and the protein starts dissociating from the DNA.In contrast, the simulations with the sequence with the less radical perturbation, DNA mut2 , samples fewer conformations with Q values of less than 0.9 than the simulations with DNA mut1 .In addition, as seen in Figure S18 a small dissociation tendency was observed in the first 0.3 μs of simulations for all three replicas of LacI-DNA complexes with the DNA mut1 sequences, while the native contacts were well-maintained in two out of three replicas for DNA mut2 , with a radical drop in native contacts in the third replica.In all cases, the overall bent shape of the DNA is maintained throughout our simulations, although the hinge helices lose some of their secondary structure in simulations of LacI in complex with DNA mut1 and DNA mut2 (Figure S19).These data suggest that LacI spontaneously loosens contact with non-cognate DNA sequences on a submicrosecond timescale.Although it does not fully transition to the "encounter complex" conformation, this loosening does demonstrate DNA sequence-specificity and suggest that simulated koff values will be highly sequence-specific.
The changes in native contacts are coupled with associated changes in energetics, with the average total electrostatic energies for the interaction between the binding domain of LacI and DNA (evaluated using the Debye-Hückel equation as described in the Methodology section) ranging from -11.0±0.7 kcal mol -1 , -10.8±2.0 kcal mol -1 and -10.7±2.3 kcal mol -1 for simulations of the LacI in complex with straight O SymL , DNA mut1 and DNA mut2 , and -17.8±1.1 kcal mol -1 , -16.7±1.1 kcal mol -1 and -17.4±1.1 kcal mol -1 for simulations of the LacI in complex with bent O SymL , DNA mut1 and DNA mut2 .Therefore, the overall average electrostatic energies become more favourable when LacI is in complex with bent DNA compared to straight DNA, and there are some small sequence specific differences in these values (although within standard deviation), in agreement with the differences in degree of native contacts maintained over the different simulations.
Since our simulations suggest that forming stable LacI-DNA complexes requires ) and distributions for the three DNA sequences, with results that agree very well with the studies of Xiao et al. 100 and Curuksu et al. 78 .When the DNA is complexed with LacI, however, the bending angle goes to as high as ~70° (Figure 10D -F).We also sampled some conformations of the DNA (~0.1%)at high bending angles (>70°) in the simulations of free DNA, but they are not stable, as shown in Curuksu et al. 78 , which could also be a reason that we did not sample enough binding contacts during the simulations of LacI on straight DNA.Finally, as can be seen from this figure, the bent shape of the DNA is maintained slightly better in the O SymL Lac-DNA complex than in the two mutated complexes, based on average bending angles of 71.8±14.4°,68.2±13.0°and 59.2±13.5°for O SymL , DNA mut1 and DNA mut2 , respectively.These angles are somewhat different from some experimental reports, but the methods of computing angles in those reports were either not reported or differ from our method, so the values are not directly comparable, and if one normalizes to the value calculated for PDB ID: 1L1M 11,27 , the relative changes are similar 11 .
As a final piece of secondary structure analysis, we also calculated the helicity of the hinge helices and the angle between the two hinge helices.As can be seen from Figure S19, the loss of the hinge helicity in all three complexed systems is smaller than in free LacI, as presented in Figure 3B, and the hinge helicity was better preserved in the LacI-O SymL system than in the two mutated complexes, as shown in the panels A-C of Figure The bending angle was calculated as described in the Methodology section.
Following from this, we also created dynamic cross-correlation maps 101    Finally, we have analyzed differences in hydrogen bonding interactions between the binding domain of LacI and the DNA, using a donor-acceptor distance cut-off of 3.0 Å and a donor-hydrogen-acceptor angle cut-off of 135°.Hydrogen bonds that occurred in less than 30% of the simulation frames were discarded from further analysis.The position of the key residues forming strong hydrogen bonds during our simulations of LacI in complex with O SymL (specifically, Leu6, Ser16, Thr19, Thr34, and Tyr47), are shown in Figure 12.Based on this analysis, the average total number of hydrogen bonds over our simulations drops from 28.2±3.1 in the case of the complex with O SymL , to 25.8±3.2 in the case of the complex with DNA mut1 and 26.1±3.3 in the case of the complex with DNA mut2 .6][37][38][39] Here, the consistent loss or gain of up to two hydrogen bonds as an average property over the simulation time can be sufficient to account for these differences.

Conclusions
Despite extensive prior studies 5, 9, 14, 24, 29, 32, 34, 41-42, 46-50, 73, 87, 103-105 , the precise molecular determinants of LacI binding selectivity, as well as the kinetic factors that allow LacI to rapidly search for cognate DNA sequences, remain incompletely understood.In the present work, we have studied the binding of the LacI dimer to the O SymL operator sequence, as well as two heavily mutated DNA sequences (DNA mut1 and DNA mut2 ).Our data yields key differences in the conformational equilibria of LacI that may help explain these properties.Specifically, we observe that the DNA-binding domains of LacI are substantially more conformationally flexible in the absence than in the presence of DNA and that these domains are also oriented differently with respect to each other.The precise functional consequence of these changes for LacI-DNA association kinetics, however, remains unknown.In addition, it appears that unbound DNA is predominantly in a straight B-form conformation, while LacI binding is only stable to bent A-form DNA in our simulations (in good agreement with available LacI-DNA structural data).Two alternate hypotheses that explain this finding are that spontaneous conformational fluctuations of DNA sequences are required for LacI binding, or that some (as yet unidentified) external factor drives this change.However, our simulations suggest that such a conformational change is indeed required for DNA binding.
Our simulations also suggest differences in the conformational equilibria and binding stability of LacI to bent DNA of either the O SymL or the mutated sequences that may account for the ex.The experimentally observed selective binding towards O SymL .We note that the selectivity of DNA binding proteins for specific vs. non-specific binding is on the order of 2-3 kcal mol -1 [35][36][37][38][39] , although a larger number in the range of 7 kcal mol -1 is expected for the LacI dimer [33][34] .Our simulations suggest that the LacI-O SymL complex has on average 1-2 more hydrogen bonds between LacI and the DNA on our simulation timescales, which would be sufficient to explain even the up to ~7 kcal mol -1 differences in binding selectivity towards different operator sequences vs. non-specific binding 23,[33][34] .
Furthermore, LacI shows substantially different conformational equilibria when bound to the O SymL sequence than in either the DNA free form, or in complex with either DNA mut1 or DNA mut2 (which in turn resemble free LacI rather than LacI bound to an operator sequence).This suggests that the difference in conformational equilibria observed (quantified as correlated positional displacement across residues) can at least report on selective binding and potentially illuminate to the mechanistic changes involved in selectivity.
In summary, this study thus yields the following hypotheses regarding LacI binding: binding occurs primarily to bent A-form DNA, involves a loss of LacI conformational entropy, and increases anti-correlated conformational deviations across the protein, reflecting a concerted change rather than several independent ones.Simultaneously, binding to the O SymL operator sequence involves a slightly larger number of stabilizing DNA-protein hydrogen bonds compared to binding to DNA mut1 or DNA mut2 , that may account for the experimentally observed specificity for this operator sequence.These concrete hypotheses are experimentally testable, and may help provoke further simulation and experimental studies of the binding kinetics of LacI to partner DNA, a process which still remains out of the reach of unbiased atomistic molecular dynamics simulation.

Associated Content
Additional structural and dynamical analysis of our simulations is provided as Supporting Information.

Figure 2 .
Figure 2. The starting structure of the LacI-O SymL complex used in our simulations, based on PDB structure 1EFA 10, 27 , with the DNA strand extended as described in the Methodology section.LacI is shown in grey, with the hinge helices highlighted in red.The 20 base pair O SymL sequence is shown in blue, and the extended base pairs (see the Methodology section) are shown in tan.This figure was made with Chimera v1.13.1 28 .

a
to describe A-DNA and B-DNA in our A-B Transition Index (ABI) calculations.a These values are average values over all 80 nucleotides based on the standard A-DNA and B-DNA forms of the O SymL operator sequence, as generated by the 3DNA web server 52 .For the corresponding pernucleotide values, see Figure S3, and for representative DNA conformations see Figure S4.

Figure 4 .
Figure 4. Visualization of binding-domain conformational equilibria via projection onto the first two principal components of the binding-domain position.Representative snapshots corresponding to the minima from the estimated free energy landscape, numbered sequentially from highest to lowest occupancy, are also shown here in red to illustrate the different conformations taken by the binding domains during our simulations, while the crystal conformation of LacI (from PDB ID: 1EFA 10, 27 ) is highlighted on the plot as a red star for comparison.

Figure 5 .
Figure 5. Fraction of time each DNA base pair spends in the A-form.The percentage time each base pair in simulated DNA containing the O SymL sequence, DNA mut1 , or DNA mut2 spent in A-DNA conformations was calculated based on 9 x 0.5 μs simulations of free DNA for each system using the classification scheme described in the Methodology.Plotted in the top panels are averages for each nucleotide across simulations of the DNA containing the O SymL sequence, whereas plotted in the bottom panels are differences between the DNA containing the O SymL sequence and DNA mut1 and DNA mut2 , respectively.

Figure 6 .
Figure 6.The root mean square fluctuations (RMSF, Å) of all heavy atoms in each nucleotide during the same simulations of free DNA containing the O SymL (red), DNA mut1 (green) and DNA mut2 (blue) operator sequences, respectively.

Figure 7 .
Figure 7. Structures of six juxtaposed LacI-DNA "encounter complexes" used as starting conformations for MD simulations of LacI on straight DNA.The central 20 base pairs (corresponding to O SymL , DNA mut1 and DNA mut2 ) are shown in blue, and the hinge helices of LacI are shown in red.Structures are presented in the same order as in TableS3.

Figure 8 .
Figure 8.The distributions of native contacts during simulations of (A-C) LacI in a loose encounter complex with straight DNA and (D-F) LacI in a tight complex with bent DNA, for the DNA sequences studied in this work.Starting structures were generated as described in the Methodology section.The data shown here for the loose encounter complexes with straight DNA is based on 6 × 1.0 μs simulations of highly bent DNA conformations, we tested the propensity of different DNA sequences to bend in such a manner starting from (A) free B-form DNA, (B) loose complexes between LacI and B-form DNA, and (C) LacI-A-form DNA tight complexes.In the case of the free DNA (Figures10A -C), we obtain similar average values (22.9±12.1°for O sym , 23.1±12.1°for DNA mut1 and 23.1±12.2°for DNA mut2

Figure 10 .
Figure 10.The population of the bending angle obtained from (A-C) 9 × 0.5 μs simulations of free O SymL , DNA mut1 and DNA mut2 and (D-F) 3 x 1.5 μs simulations of LacI in complex with the same DNA sequences.
(DCCM) to examine correlated motions both within the two individual LacI monomers and between the two monomers.These data are shown in Figure 11, which shows correlated motions in (A) free LacI (with no DNA bound) and LacI in complex with DNA containing the (B) O SymL , (C) DNA mut1 and (D) DNA mut2 sequences.The bottom left and top right squares of each DCCM plot show correlated motions within each individual monomer of the dimer, and the top left and bottom right squares of each DCCM plot show correlated motionsbetween the two monomers of the dimer.From these data, it can be seen that LacI behaves similarly in the absence of DNA and when complexed with DNA mut1 or DNA mut2 , with little presence of either correlated or anti-correlated motions.In contrast, these motions become highly correlated (or anti-correlated) when LacI is in complex with DNA containing the O SymL operator sequence.This suggests sequence-specific changes in LacI dynamics88 which could (at least partially) account for the differences in DNA recognition by LacI.

Figure 11 .
Figure 11.Calculated dynamic cross correlation maps 101 (DCCM) for (A) free LacI (without DNA bound), as well as LacI in complex with bent DNA containing the (B) O SymL , (C) DNA mut1 or (D) DNA mut2 sequences.The DCCM plots were calculated with Bio3D 102 by considering only Cα atoms based on the

Figure 12 .
Figure 12.The key residues of the binding domain of LacI that form hydrogen bonds with the O SymL operator sequence, based on PDB ID: 1EFA 10, 27 .The operator sequence was extended to a 40 base pair DNA sequence as described in the Methodology section.The original O SymL operator sequence is shown in blue and the extension of the DNA is shown in tan.
was used in this study in order to characterize the degree of DNA bending.The orientation of one nucleotide can be The "arm" vector at each end of the DNA is the sum of the four  \ WWW⃗ on the two base pairs at the two ends of the DNA binding sites (nucleotide ID: 9-10 and 71-72 at one end, and 31-32 and 49-50 at the other end), referred to as  & WWWW⃗ and  \ WWWW⃗ .The bending angle of the