Determining Sequence-Dependent DNA Oligonucleotide Hybridization and Dehybridization Mechanisms Using Coarse-Grained Molecular Simulation, Markov State Models, and Infrared Spectroscopy

A robust understanding of the sequence-dependent thermodynamics of DNA hybridization has enabled rapid advances in DNA nanotechnology. A fundamental understanding of the sequence-dependent kinetics and mechanisms of hybridization and dehybridization remains comparatively underdeveloped. In this work, we establish new understanding of the sequence-dependent hybridization/dehybridization kinetics and mechanism within a family of self-complementary pairs of 10-mer DNA oligomers by integrating coarse-grained molecular simulation, machine learning of the slow dynamical modes, data-driven inference of long-time kinetic models, and experimental temperature-jump infrared spectroscopy. For a repetitive ATATATATAT sequence, we resolve a rugged dynamical landscape comprising multiple metastable states, numerous competing hybridization/dehybridization pathways, and a spectrum of dynamical relaxations. Introduction of a G:C pair at the terminus (GATATATATC) or center (ATATGCATAT) of the sequence reduces the ruggedness of the dynamics landscape by eliminating a number of metastable states and reducing the number of competing dynamical pathways. Only by introducing a G:C pair midway between the terminus and the center to maximally disrupt the repetitive nature of the sequence (ATGATATCAT) do we recover a canonical “all-or-nothing” two-state model of hybridization/dehybridization with no intermediate metastable states. Our results establish new understanding of the dynamical richness of sequence-dependent kinetics and mechanisms of DNA hybridization/dehybridization by furnishing quantitative and predictive kinetic models of the dynamical transition network between metastable states, present a molecular basis with which to understand experimental temperature jump data, and furnish foundational design rules by which to rationally engineer the kinetics and pathways of DNA association and dissociation for DNA nanotechnology applications.


INTRODUCTION
Over the last couple of decades, DNA has proven to be much more than a vessel for genetic information. From sensing to computing to directed self-assembly, the programmable and predictable nature of DNA has unlocked numerous unforeseen nanotechnology applications. 1−4 Recently, single molecule localization techniques have exploited the rapid and transient binding of short DNA oligomers in order to achieve superresolution microscopy and optical multiplexing. 5−7 Predictive understanding of the sequence-dependent thermodynamics of DNA hybridization/dehybridizationthe assembly/disassembly of a DNA duplex from two single strandshas underpinned the rational design of DNA oligomer sequences for nanotechnology applications, where sequence-dependent nearest-neighbor models can accurately account for mismatched pairs, dangling ends, and other non-native bonding effects. 8,9 Secondary DNA structures such as hairpins and G-quadruplexes have also been studied in depth and leveraged for nanotechnology applications. 10−12 Predictive models of the dynamical, as opposed to purely thermodynamical, behaviors of DNA have become increasingly important in developing technologies such as DNA-PAINT (DNA Points Accumulation for Imaging in Nanoscale Topography), but these technologies have outpaced our fundamental understanding of the dynamics themselves. 13−15 Many experimental and computational studies have investigated DNA dynamical phenomena over picosecond to millisecond time scales. 16−20 Kinetic models have been developed for particular DNA processes such as toehold exchanges and optical barcoding 21,22 and supervised machine learning techniques have been combined with experimental measurements to predict the on/off rates as a function of sequence. 6,23,24 A comprehensive understanding of the full dynamical landscape of hybridization/dehybridization accounting for the sequence-dependent metastable states and association/dissociation pathways remains lacking and fundamental questions remain unresolved. For example, it remains unclear the extent to which hybridization of short DNA oligomers largely proceeds in a conventionally assumed "all-or-nothing" fashion or if long-lived metastable states facilitate the transition. 17,19,25−28 Out-ofregister "shifted" base-paired structures 17,18,25,29−31 and frayed structures 32−35 stand as candidates for metastable states with the potential to mediate substantial deviations from all-ornothing behavior, but the degree to which these states are kinetically relevant is difficult to determine experimentally and is likely to be highly sequence-dependent. The development of predictive models and design rules with which to engineer DNA strands with tailored hybridization/dehybridization kinetics and pathways is vital to advancing rational design of DNA strands for nanotechnology applications and is also of importance in understanding fundamental biological processes such as transcription and gene regulation.
Our understanding of hybridization dynamics is built upon decades of experimentssuch as temperature-jump, salt-jump, pH-jump, and other perturbative methodsthat rapidly stimulate DNA and monitor relaxation to a new equilibrium. 28,36−42 More recently, single-molecule diffusion and tethered multi-fluorophore assays have facilitated analyses under equilibrium conditions, but these results can be hampered by slow data collection rates and fluorescent tags effects on strand dynamics, particularly for shorter oligomers. 23,43−45 A number of computational modeling approaches have also been employed to provide molecularlevel resolution of hybridization. Simplified lattice models can recapitulate the essential aspects of the hybridization pathways but lack the realism of continuous space representations. 25,46 The long time scales associated with hybridization/dehybridization events place them outside the reach of unbiased allatom molecular dynamics simulations, 18 but they can be observed by employing enhanced sampling techniques 18,30,47−53 or by using elevated temperature or denaturing solvent concentrations to induce one-way dissociation events. 54,55 The effect of the applied bias upon the thermodynamics can be rigorously corrected for using standard reweighting techniques. 56−59 Rigorous elimination of the bias in the kinetics is critical for the construction of robust and reliable kinetic models that reflect the true system dynamics and are uncontaminated by any residual effects of the biasing potentials used to induce good sampling. A number of approaches to correct the kinetics are also available, including Girsanov reweighting, transition-based reweighting analysis (TRAM), dynamic histogram analysis method (DHAM), and their derivatives. 60−66 The application of these methods under the conditions of high bias necessary for good sampling can, however, present challenges for numerical convergence. A number of coarse-grained DNA force fields have been developed that enable direct observation of these events over microsecond time scales via unbiased coarse-grained molecular dynamics simulations, 29,30,49,67,68 which, up to an acceleration factor associated with the smoothing of the underlying free energy landscape inherent to the coarse-graining procedure, can preserve a faithful model of the unbiased dynamics and associated pathways. These models have previously been used to study biological phenomena such as nucleosome dynamics 69,70 and transcription factor binding 71,72 as well as nanotechnology applications such as strand displacement 73,74 and DNA origami. 75,76 In this study, we choose to employ a coarse-grained model for DNA 49 that is sufficiently inexpensive to enable the collection of sufficient volumes of unbiased simulation trajectories and adequately sample configurational space that we do not need to appeal to biasing strategies to enhance convergence nor apply any post hoc corrections to the thermodynamics or kinetics.
In this work, we study a family of self-complementary pairs of 10-mer DNA oligomers using coarse-grained molecular simulation, machine learning of the slow dynamical modes, and data-driven inference of long-time kinetic models to establish new understanding of the influence of sequence upon hybridization/dehybridization kinetics and mechanisms. This family5′-ATATATATAT-3′ (AT-all), 5′-GATATATATC-3′ (GC-end), 5′-ATATGCATAT-3′ (GC-core), and 5′-ATGATATCAT-3′ (GC-mix)was designed to probe the influence of the placement of two G:C base pairs within an otherwise repetitive A:T sequence and has been the subject of our prior experimental investigations using temperature-jump infrared spectroscopy and simple lattice models. 19 We validate the new computational models of hybridization/dehybridization dynamics developed in this work against new experimental data and reinterpret our prior experimental observations in light of the new computational understanding. Consistent with previous studies, 18,25,30 we find the degree of repetitiveness in the sequenceand therefore the kinetic accessibility and thermodynamic stability of out-of-register shifted statesleads to richer dynamics populated by a diversity of long-lived metastable states. Our data-driven modeling and analysis rigorously quantifies these behaviors and furnishes accurate predictive models of the hybridization/dehybridization rates, dynamical pathways, and metastable states. Specifically, we demonstrate that disrupting repetitive stretches of A:T bases by placement of interrupting G:C base pairs enables us to tune the landscape from rich six-state to simple two-state "all-ornothing" behavior, and the specific location of the interrupting pair can be used to modulate the stability of long-lived frayed states. Taken together, our analyses establish new molecularlevel understanding of the sequence-dependent kinetics and pathways through quantitative predictive models for the longtime system dynamics, resolution of the dynamical folding pathways and metastable states, and elementary design rules with which to control the dynamical behaviors of the system. We anticipate that this new foundational understanding, and the extension of our approach to more extensive families of DNA sequences, can guide the rational design of DNA oligomers with tailored kinetic properties engineered for DNA nanotechnology applications such as DNA-PAINT. 6 We performed molecular dynamics simulations of four 10-base self-complementary double-stranded DNA sequences that we have previously studied by temperature-jump infrared spectroscopy: 19 5′-ATATATATAT-3′ (AT-all), 5′-GATATATATC-3′ (GC-end), 5′-ATATGCATAT-3′ (GC-core), and 5′-ATGATATCAT-3′ (GC-mix). We modeled the DNA sequences using the coarse-grained 3-Site-Per-Nucleotide v2 (3SPN.2) model that uses three spherical beads to represent the phosphate, deoxyribose sugar, and nitrogenous base of each nucleotide and employs anisotropic interaction potentials to accurately treat intrastrand base-stacking, interstrand cross-stacking, and base pairing. 49 The model was parametrized against experimental data on bond lengths, bend angles, torsional angles, base step energies, and base stacking free energies, and reliably reproduces the structure, melting temperatures, persistence lengths, and sequence, salt, concentration, and temperature effects on duplex formation. 49 The model enables access to millisecond time scales and has been has widely adopted to study numerous phenomena including DNA packing in viral capsids, protein−DNA binding, and nucleosome unwrapping. 69,77,78 Although the 3SPN.2 model was not directly parametrized against dynamical experimental data, we will show below that the predicted sequence-dependent kinetics and relaxations are, within a corrective scaling factor, in good agreement with observed experimental trends.
All calculations were performed using the LAMMPS simulation package (http://lammps.sandia.gov) in accordance with best practices for the 3SPN.2 model. 79 A single pair of self-complementary sequences were placed in a cubic periodic box with side length 7.8 nm corresponding to a single-strand concentration of 7 mM. This concentration is only 3.5× larger than the 2 mM concentration employed in our experimental analyses (section 2.2). Solvent effects were modeled implicitly by employing Langevin dynamics 80,81 with an experimentally motivated per-site friction coefficient of 9.94 × 10 −11 m 2 /s. 49,82 We specified a 240 mM implicit salt concentration and treated electrostatic interactions using the Debye−Huckel with a 5 nm cutoff radius. 83 Simulations were performed in the NVT ensemble employing a Langevin thermostat. 84 With the exception of the simulation data reported in section 3.1 where we draw comparisons against experimental data collected over a range of temperatures, each sequence was simulated at its respective melting temperature rat 7 mM concentration as dictated by the 3SPN.2 modelAT-all: 309 K, GC-end: 317 K, GC-core: 324 K, GC-mix: 324 Kin order to maximize the number of spontaneous transitions between dissociated and hybridized states. Melting temperatures for each sequence were determined empirically by comparing the ratio of hybridized to dissociated populations over a 10 K temperature ramp centered on the nearest neighbor (NN) model predicted melting temperature, 8,9 and selecting the temperature at which the hybridized and dissociated populations were approximately equal. The Langevin equations of motion were integrated using the scheme of Bussi and Parrinello 81 with a 20 fs integration time step. We performed 40 independent simulations for each of the four sequences with half of the runs initialized from the hybridized state and half from the dissociated state. The initial hybridized state was defined based on the crystal structure coordinates of Arnott et al. 85 The dissociated state was generated from the hybridized state by displacing one strand away from the other by 1 nm in each of the x, y, and z directions. The magnitude of this displacement did not affect the results so long as all native Watson−Crick (WC) bonds were completely broken. Initial bead velocities were assigned from a Maxwell−Boltzmann distribution at the temperature of interest. Each simulation was conducted for 26 μs and frames saved to disc every 100 ps. Each simulation required ∼24 CPU-hours on 28×Intel E5-2680v4 CPU cores. The first 1 μs of each run was discarded for equilibration providing us with 40 × 25 μs = 1 ms of simulation data for each sequence, during which time we observed 55−100 hybridization/dehybridization events.
2.1.2. Markov State Model Construction. Markov state models (MSMs) are a powerful approach to infer long-time kinetic models from short molecular simulation trajectories 86−90 that we employ in this work to construct high-resolution sequence-dependent kinetic models of DNA hybridization and dissociation. In brief, a MSM extracts from simulation trajectories an ensemble of long-lived metastable macrostates, their equilibrium occupancy populations, and the equilibrium probability fluxes between them. In this manner, they provide an interpretable and predictive model of the system thermodynamics and kinetics. MSMs have recently been implemented to study the hybridization mechanism of one particular 14-mer DNA oligomer, but determining the sequence-dependent kinetics and mechanisms was not the focus of this study. 91 An energy disconnectivity graph-based approach was used to interrogate the differences in hybridization rates and mechanisms between GGGGGG and GCGCGC hexamers to reveal strong deviations from "all-or-nothing" behaviors and the importance of zippering and out-of-register diffusion mechanisms. 17 Kinetic models were constructed not from the dynamical trajectories of the molecular model, but by estimating rate constants between local minima using a transition state theory approximation. A recent application of MSMs to the long-time dynamics of short RNA oligonucleotides revealed stacking time scales to be highly sequence dependent. 92 In this work, MSMs were constructed for each of the four DNA sequences at their respective 3SPN.2 melting temperatures at 7 mM concentration from the 40 × 25 μs simulation trajectories following a four-step protocol detailed in ref 93: (i) trajectory featurization, (ii) dimensionality reduction, (iii) microstate clustering and microstate transition matrix inference, and (iv) macrostate clustering and macrostate transition matrix inference. Calculations were performed using the PyEMMA software package. 94 Featurization. Trajectories comprising the Cartesian coordinates of the DNA strands as a function of time were featurized using the MDTraj Python libraries 95 to represent the system in a manner that exposes the essential system dynamics but eliminates trivial translation and rotational invariances. We adopt intermolecular pairwise distances d(i,j) between the centers of mass of the 10 bases as a natural rototranslationally invariant featurization that represents each system configuration as the 10 × 10 = 100-element vector of interstrand pairwise distances. One additional symmetry arises from the self-complementary nature of these sequencesthe sense and antisense strands in each pair are identicalsuch that the representation of the system under our featurization should remain unchanged upon inverting the arbitrary labeling of strand "1" and strand "2". 86 The 100-element pairwise distance vector is not invariant to this permutation, but can easily be made so via a simple symmetrization operation in which each of the ( ) 10 2 = 45 intermolecular pairwise distances are replaced by the mean of the two permutationally invariant distances. Specifically, (d(i 1 ,j 2 ) = d(i 2 ,j 1 )) ← 0.5(d(i 1 ,j 2 ) + d(i 2 ,j 1 )), where i 1 denotes the ith base on strand 1 and j 2 the jth base on strand 2. 86 Finally, we took the reciprocal of the permutationally symmetrized pairwise distances to provide higher resolution and differentiation between proximate strand configurations in the near hybridized state compared to distantly separated dissociated strands. VAMP-2 scoringcalculation of the sum of the squared estimated eigenvalues of the transfer operatorof trajectories under a particular featurization provides a measure of the kinetic variance carried by that featurization. 96−99 Performing VAMP-2 scoring at a lag time of τ = 1.2 ns and retaining the top five modes, reveals that the reciprocal permutationally symmetrized pairwise distances can carry up to twice the kinetic variance as the nonreciprocal distances, suggesting that the higher resolution offered at close intermolecular distances can indeed boost the dynamical representational power of the model. Somewhat surprisingly, we observed that augmenting our set of intermolecular distances between bases with intramolecular distances between bases on the same strand led to no improvement of the VAMP-2 score. This indicates that the kinetically relevant conformational state of the two strands are adequately represented via the intermolecular distances and leading us to employ only intermolecular distances within our featurization.
Dimensionality Reduction. The featurized trajectories were then projected into a low-dimensional space in preparation for microstate clustering. The standard approach to doing so is to employ timelagged independent components analysis (tICA) to learn a linear projection into a low-dimensional embedding that maximally preserves the kinetic variance in the data. 97,100,101 In this work, we instead employ state-free reversible VAMPnets (SRVs) that can be conceived of as a nonlinear version of tICA. 102 SRVs employ neural networks to learn flexible nonlinear functions of the trajectory featurization that better approximate the slow dynamical modes of the system and have been shown to produce substantially higher resolution MSMs than those developed using tICA. 93,102 SRV modes were learned independently for each system to best approximate the slow collective modes for that particular DNA sequence. SRVs were trained using the SRV package we previously developed (https://github.com/hsidky/srv) employing the default network architecture of two hidden layers each comprising 100 neurons and tanh activation functions, a learning rate of 0.001, and a batch size of 50 000. We adopted a lag time of τ = 1.2 ns as appropriately short time scale to resolve the dynamical details of the hybridization/dehybridization dynamics. 103 As observed by Husic and Pande, the lag time cannot be treated as a hyperparameter to be optimized via the VAMP-2 score, but must be selected as a physically motivated choice designed to expose the dynamical motions relevant at a particular time and length scale of interest. 104 As we shall show, this choice of lag time leads to high-resolution Markovian macrostate MSMs. We guarded against overfitting using 5-fold cross-validation in which we constructed five random partitions of the 40 independent simulation trajectories into a training set of 20 trajectories and a validation set of 20 trajectories. 93 We observed plateau of the validation loss and no evidence of overfitting after 20 epochs of training requiring approximately 22 GPU-minutes on a single NVIDIA Tesla K80 GPU card. A VAMP-2 scoring of the cumulative kinetic variance explained as a function of number of SRV collective modes also showed no evidence of overfittingas would be evinced by separation of the training and validation VAMP-2 scores 93 and exhibited a knee for each of the four DNA sequences after the fifth, fourth, third, and second slow modes for AT-all, GC-end, GC-core, and GC-mix, respectively ( Figure S1 in the Supporting Information), and motivating the construction of 5D, 4D, 3D, and 2D embeddings, respectively.
Microstate Clustering. The SRV projections of the 10 million frames recorded over the course of the 1 ms molecular simulation trajectories collected for each DNA sequence were then clustered into microstates using k-means clustering. The VAMP-2 score of the microstate transition matrix constructed for each sequence at the selected τ = 1.2 ns was insensitive to the choice of the number of microstates over the range 100−1000, motivating our selection of 200 microstate clusters for each system.
Macrostate Clustering. The 200 microstates comprising each system were coarsened into our terminal macrostate MSM. Since we are interested in the construction of models of the equilibrium kinetics, we explicitly enforce detailed balance in the construction of the MSM. 90 The microstate transition matrix for each system was computed at a range of lag times τ and then diagonalized to recover the corresponding eigenvalues λ i and associated implied time scales t i = −τ/ln|λ i |. 90 The implied time scale plots for the four DNA sequences are presented in Figure S2. We observe rapid convergence of the implied time scales t i with lag time τ for all systems, motivating the construction of high resolution macrostate MSMs at a lag time τ = 1.2 ns. For this choice of lag time, we recover 5, 4, 2, and 1 implied time scales for the AT-all, GC-end, GC-core, and GC-mix systems, respectively. The identification of (i − 1) implied time scales implies the presence of (i − 1) slow modes and motivates the coarsening of the system into i macrostates. We estimate these i macrostates by applying PCCA+ spectral clustering to the leading (i − 1) eigenvectors of the microstate transition matrix. 105−107 We then estimate the corresponding 6, 5, 3, and 2 macrostate transition matrices P for the AT-all, GC-end, GC-core, and GC-mix systems, respectively, by projecting the molecular simulation trajectories into these discrete macrostates. These macrostate MSMs constitute our terminal kinetic models. We validate the Markovian nature of the four MSMs by subjecting them to the Chapman−Kolmogorov (CK) test. 60,90,108 This test asserts that the transition matrix for a Markovian (i.e., memoryless) MSM constructed at a lag time τ should satisfy the condition P(kτ) = P k (τ), which states that k successive applications of the transition matrix constructed at a lag time τ should be equivalent to a single application of the transition matrix constructed at a lag time kτ. We present in Figure S3 the CK tests for each DNA sequence to demonstrate that the τ = 1.2 ns models perform very well in predicting transition probabilities out to kτ = 7.2 ns, validating the Markovian nature and kinetic validity of the four models.
2.2. Experimental Methods. 2.2.1. Sample Preparation. Each DNA oligonucleotide was purchased from Integrated DNA Technologies (IDT) at desalt grade purity. Oligonucleotides were purified with 3 kDa cutoff centrifugal filters (Amicon). All labile protons were exchanged in deuterium oxide (D 2 O, Cambridge Isotopes, 99.9%). Oligonucleotides were prepared at a total strand concentration of 2 mM in 50 mM pD 7.2 sodium phosphate buffer with 240 mM NaCl and 18 mM MgCl 2 . Prior to each measurement, DNA solutions were placed in a water bath at 90°C for 3 min and then cooled to room temperature under ambient conditions.
2.2.2. T-Jump IR Spectroscopy. The details of the technique and processing used to acquire temperature-jump infrared (T-jump IR) data have been described previously. 109−111 Briefly, heating was initiated through optical excitation of the O−D stretch overtone transition of D 2 O. The 1.98 μm pulses (5 ns, 20 mJ, 20 Hz) used for heating were generated from the frequency-doubled output of a Nd:YAG laser sent through an optical parametric oscillator. Nonlinear IR spectra are collected from 5 ns to 50 ms delays after the T-jump with a synchronized 1 kHz spectrometer. T-jump heterodynedetected vibrational echo (t-HDVE) IR spectra were acquired with Fourier transform spectral interferometry, 110 where the delay between the local oscillator (LO) and DVE signal was scanned in 5 fs steps between (−10) and 10 fs. t-HDVE spectra were acquired with a parallel pulse polarization scheme and presented as a dispersed pump−probe (t-DPP) spectrum. t-DPP data are reported as the difference spectra relative to the initial temperature.
The sample was placed between two 1 mm thick CaF 2 windows separated by a 50 μm Teflon spacer enclosed in a brass jacket. The initial temperature of the sample was set using a recirculating chiller connected to the brass sample jacket. The T-jump temperature change (ΔT) was set to 14−16°C for all measurements and monitored using the change in transmission of the D 2 O bend− libration combination band measured in the LO beam. The temperature change was quantified by comparing the change in transmission of the LO beam with a FTIR temperature series of D 2 O.

Determination of Fast and Slow Dissociation Rates.
To determine observed rates from the T-jump data, the time-domain t-HDVE data was inverse Laplace transformed into the rate domain using a maximum entropy approach (MEM-iLT). 112 Observed rates λ fast and λ slow were computed from the amplitude-weighted mean rate across detected IR frequency, as previously described. 28 The fast response k d fast is defined as this amplitude-weighted mean rate, whereas the dissociation rate constant k d slow was extracted from the observed rate of the slow response λ slow using a two-state model for selfcomplementary oligomers, 113 where [S] T f is the concentration of single-strand oligomer at the final temperature of the T-jump, and k a slow is the association rate constant. In practice, eq 1 is recast in terms of the dissociation equilibrium constant K d to solve for k d slow and k a slow as a function of K d and [S] T f , FTIR temperature series were measured for each sequence as reported previously, 19 and the second SVD component along temperature was fit to a two-state model to determine the fraction of intact duplex θ as a function of temperature. 114 K d and [S] T f were then determined from θ at the final temperature of each T-jump measurement, Journal of the American Chemical Society pubs.acs.org/JACS Article where c tot is the total oligonucleotide concentration, which was 2 mM for all measurements.
3. RESULTS AND DISCUSSION 3.1. Sequence-Dependent Coarse-Grained Kinetics Recapitulate T-Jump IR Measurements. We first sought to demonstrate that the coarse-grained 3SPN.2 model accurately recapitulates the experimentally observed kinetics of DNA oligomer hybridization/dissociation by validating our computational predictions against temperature-jump infrared (Tjump IR) experiments. We conducted T-jump IR experiments for each DNA sequence as a function of temperature and extracted the "slow" k d slow and "fast" k d fast rates, corresponding to processes proceeding on 10−30 μs and 70−100 ns time scales, respectively. The slow response has previously been attributed to duplex dissociation on microsecond time scales induced by the rapid heating of the initially hybridized duplex. 19,28 The fast response has been assigned to terminal base pair fraying, 19,28 with this process corresponding to a relatively complex dynamical process that can span time scales from picoseconds to microseconds. [33][34][35]115 All-atom simulations suggest that frayed ends can assume misaligned WC bonds, base-sugar hydrogen bonds, and terminal stacked conformations. 32,116 To computationally mimic the T-jump process in our 3SPN.2 simulations, we conducted 1 μs simulations of an initially hybridized DNA duplex over a range of temperatures and monitored its structural relaxation. We performed 120 independent simulations for each DNA sequence at each temperature, and from these extracted computational estimates of k d slow and k d fast ( Figure S4). First, we tracked the slow response corresponding to duplex dissociation in our simulations by compiling the distribution of times at which both of the central base pairs first separate to a distance of 2.0 nm starting from an initial fully hybridized duplex. This cutoff was selected as the distance beyond which the strands are effectively non-interacting and defines the dissociated state. We extracted our computational estimate of k d slow by fitting a decaying exponential to the fraction of hybridized sequences as a function of time f hybridized (t) = exp(−k d slow t). We verified that our cutoff was sufficiently large by observing that our calculated values for k d slow changed by an average of only 7% by adopting a 1.3 nm cutoff.
Second, we tracked the fast response corresponding to terminal base pair fraying by compiling the distribution of times at which either of the terminal base pairs first separated to a distance of 1.3 nm, corresponding to a complete breakage of the WC interaction. We extracted our computational estimate of k d fast through a decaying exponential fit to the fraction of unfrayed sequences as a function of time f unfrayed (t) = exp(−k d fast t). We present in Figure 1 a comparison of k d slow and k d fast estimated by computation and experiment. Although the 3SPN.2 model was not directly fitted against kinetic data, 49 its predictions of sequence-dependent T-jump relaxation rates are, within a systematic scaling factor in time and systematic shift in temperature, in good agreement with observed experimental trends. It is well-known that the smoothing of the underlying free energy landscape induced by coarse-graining artificially accelerates the kinetics of coarse-grained molecular simulations and that different degrees of freedom may be accelerated by different factors. 117−119 We find that the simulated slow responses corresponding to center-of-mass translation of the strands during dissociation of the duplex is ∼10× accelerated relative to experiment, whereas the fast responses corresponding to fraying of the terminal bases is ∼120× accelerated. We apply these sequence-independent scaling factors to our reported computational values in Figure 1. Although the 3SPN.2 model reproduces melting temperatures relatively well, we observed a systematic 4 K under-prediction relative to experiment and so we apply a universal (+4) K corrective The simulation results are corrected by a sequence-independent scaling factor that corrects for a 10× acceleration of the slow dissociation dynamics and 120× acceleration of the fast fraying dynamics. The simulated temperature in all cases is subjected to a (+4) K corrective calibration to account for an observed systematic under-prediction of the melting temperature by the 3SPN.2 model temperature shift to our computational results. We note that our simulations are expected to slightly over-predict the melting temperature since they are conducted at 3.5× higher concentration relative to experiment, and the concentrationadjusted corrective shift accounting for the approximations inherent in the 3SPN.2 model would be slightly larger. Owczarzy et al. present an analytical prescription to apply concentration corrections to the melting temperature using knowledge of the enthalpy changes associated with duplex formation and helix nucleation. 120 In the absence of these quantities, one could instead empirically estimate concentration corrected melting temperatures by conducting a suite of simulations over a range of temperatures and assume ideal molecular behavior to apply concentration corrections to the   121 ) Furthermore, we note that these empirical calibration factors to the 3SPN.2 predictions are applied only for the purposes of making an experimental comparison, but acknowledge that there are uncertainties introduced by assuming that the equilibrium dynamics at fixed temperature can be compared directly to relaxation kinetics following a 15°C T-jump. The computational time scales and melting temperatures reported in the remainder of the paper are not corrected by these calibration corrections since we are only interested in the relative trends in the behaviors of the four sequences. In Figure 1a we observe an exponential increase of k d slow with temperature, as expected from the large enthalpic barrier to duplex dissociation. 20,38,40 Under the time and temperature calibration corrections, we see generally very good agreement between the computational and experimental curves. Of the four sequences, GC-core shows the largest discrepancy between computation and experiment, although the general exponential trend is preserved. This may be a result of its high propensity to fray (cf. section 3.5). In Figure 1b we expose a largely linear dependence of k d fast upon temperature for AT-all, GC-end, and GC-mix compared to an exponential dependence for GC-end. These trends can be understood in light of the comparatively larger enthalpic barrier for dissociation of the terminal G:C base pair in GC-end compared to that for the A:T terminal pair in the other three sequences. Again we see good agreement between the scaled computational predictions and the experimental T-jump measurements. The favorable comparison of computation and experiment provides support for the capacity of the 3SPN.2 model to reliably reproduce sequence-dependent trends in the slow and fast kinetics of the four DNA oligomers, and lends confidence in the use of these data for the parametrization of Markov state models of the long-time dynamics of each sequence. We observe that the dynamical fingerprinting approach presents an elegant means to compare experimental relaxation data directly against a Markov state model extracted from simulation data in terms of a "fingerprint" of peaks with amplitudes and time scales related to the relaxation of particular system observables. 122 These techniques have been previously employed to in applications to base stacking of DNA dinucleoside monophosphates 123 and RNA. 92 We explored the use of this approach to validate the MD simulation data but encountered challenges in resolving fast dynamical motions below the lag time of the fitted MSM and the extremely high computational cost of collecting sufficient simulation data to fit MSMs at each temperature for which experimental data was collected. Accordingly, we instead elected to perform a direct comparison between the experimental data and MD trajectories to validate the simulations themselves, then proceed to train MSMs over these data and conduct analysis and experimental tests of the MSM predictions.
3.2. SRV-MSMs for Each Sequence. We then proceeded to construct Markov state models (MSMs) from 1 ms of aggregated simulation trajectories for each of the four sequences at their respective melting temperatures to generate sequence-dependent kinetic models. MSMs define the longlived metastable macrostates of the system, their equilibrium occupancies, and the equilibrium transition probabilities between them. As such, they are extremely valuable in providing both a quantitative predictive model and a physically comprehensible mechanistic understanding of the long-time dynamical evolution of the system between an ensemble of metastable macrostates. We present in Figure 2 the inferred MSMs for each of the four 10-base DNA sequences. Across all four sequences we identify a totality of seven metastable macrostates corresponding to the fully hybridized state (H) in which all native base pairings are intact, four shifted states in which the strands are translated out-of-register by two or four bases in the 5′ (5S2, 5S4) or 3′ (3S2, 3S4) direction, a frayed state (F4)unique to GC-corein which four terminal A:T base pairs are unbound, and the fully dissociated state (D). We present these seven macrostates in Figure 2a along with schematic and cartoon renderings of representative microstates contained within each of these macrostates. The representative microstates emblematic of each macrostate were selected randomly from those possessing high (>99%) macrostate membership probabilities. Since the microstate ensemble exhibits a range of conformations, however, it is useful to characterize the degree of structural heterogeneity to determine the degree to which these microstates are emblematic of the distribution. In Figure S5 we present a projection of our macrostates into two physical order parametersthe degree of 3′ shift and degree of 5′ shiftto provide an interpretable embedding of the macrostates that exposes base pairing patterns and structural heterogeneity within the microstate ensemble. In all cases, we find these microstate distributions to be relatively narrowly focused within the low-free energy core of each macrostate such that a single archetypal microstate is indeed a good representative for the ensemble and an accurate representation of the base pairing pattern of the macrostate. In Figure 2b we present the occupancy probabilities of each state at thermodynamic equilibrium. By virtue of the fact that each sequence is simulated at its corresponding melting temperature (T m ), the probability of the dissociated state (D) is, by construction, approximately equal to the sum of the probabilities over the remaining six states (H, 5S2, 3S2, 5S4, 3S4, F4). In Figure 2c we present a visualization of the macrostate MSMs for each of the four sequences showing the connectivity between the identified macrostates. The macrostates are represented as orange circles in proportion to their equilibrium probabilities and the gray arrows indicate the probability of hopping from one macrostate to another under one time step of the kinetic model. The fluxes between the macrostates provide a wealth of high-level, interpretable information on the sequence-dependent metastable states and hybridization/dehybridization pathways. Immediately, we identify that the AT-all sequence possesses a rich and complex dynamical landscape comprising six metastable states whereas at the other end of the spectrum GC-mix exhibits far simpler two-state "all-or-nothing" behavior. In Figure 2d we present the so-called implied time scales of each MSM. These time scales correspond to the relaxation times of the DNA dimer among its constituent metastable macrostates. The leading implied time scale for each system corresponds to the characteristic time scale for hybridization/ dehybridization. Since each system is simulated at the same 7 mM concentration and at its respective melting temperature, it is not surprising that the leading time scale is approximately equal for all four systems and corresponds to the characteristic time scale for hybridization/dehybridization. The spectrum of higher order time scales corresponds to increasingly quicker relaxations between the metastable macrostates within the kinetic model and resolve the interesting sequence-dependent Journal of the American Chemical Society pubs.acs.org/JACS Article differences in the hybridization/dehybridization kinetic pathways. The total number of implied time scales is typically one fewer than the number of metastable macrostates and the existence of large implied time scales is indicative of slowly relaxing kinetic processes. The dense spectrum of slow implied time scales for AT-all is indicative of its relatively complex kinetic landscape whereas that for the two-state GC-mix comprises only a single time scale corresponding to hybridization/dehybridization. We now proceed to analyze in detail the sequence-dependent thermodynamics, kinetic, and mechanisms exposed by the four MSMs.

Comparison of MSM Thermodynamic Predictions with NN Model.
We first compare the thermodynamic predictions for the equilibrium macrostate probabilities made by the MSM models fitted at the sequence 3SPN.2 melting temperatures to those of the nearest neighbor (NN) model as a popular empirical model of DNA hybridization thermodynamics. The NN model predicts the free energy of duplex formation as a sum over helix initiation terms and the hybridization free energies of nearest neighbor pairs of bases that account for both the specific WC pairings and the modulating effects of the local (i.e., nearest neighbor) environment. 8,9 The parameters of the NN model were estimated by regressing over 108 experimental measurements to furnish a predictive model for the free energy of association as a function of DNA sequence and explicitly account for stacking contributions of native pairs, internal mismatches, and dangling ends. We apply the NN model to predict the free energy F NN of each of the macrostates occupied by each of the four sequences. A full accounting of our application of the NN model is provided in Figure S6 and supporting text within the Supporting Information. The free energy of each macrostate is related to its equilibrium occupancy probability P via the statistical mechanical relationship F = −k B T ln P + C, where T is temperature, k B is Boltzmann's constant, and C is an additive constant reflecting our ignorance of the absolute scale of free energies. We use this relationship to convert the equilibrium occupancy probabilities predicted by our MSM and reported in Figure 2b into free energies F MSM . The unknown additive constants preclude comparisons of absolute free energies between the MSM and NN model, but it is legitimate to compare relative free energies between pairs of macrostates since the additive constant cancels in taking differences. As such, we arbitrarily set the additive constant C in both the MSM and NN model such that the hybridized state H defines a zero free energy reference state and we report the stability of all macrostates relative to the hybridized state as ΔF NN = F NN − F H NN for the NN model and ΔF MSM = F MSM − F H MSM for the MSM.
As illustrated in Figure 3, we see that the MSM tends to predict higher free energies for all macrostates relative to the H state compared to the NN model. Although the trends are in qualitative agreement, what is the root of the quantitative discrepancy of the MSM and NN models in the predicted relative stabilities? First, the MSM is constructed bottom-up from molecularly detailed 3SPN.2 simulations whereas the NN There are approximations inherent in the 3SPN.2 model, not least of which is the coarse-grained representation that integrates over atomic degrees of freedom, and in the NN model that was fitted to limited experimental data assuming a low-order expansion in terms of nearest neighbor additive contributions. Second, although 3SPN.2 is expected to capture some dangling end stabilization effects through base stacking and cross-stacking interactions, the model was not parametrized to fully capture the enthalpic contributions of the interaction of unbound bases with terminal base pairs, whereas this term is explicitly included within the NN model. Third, a well-known deficiency of the NN model is the absence of any treatment of inert tailsfree bases that extend beyond the dangling end tend to destabilize the duplex. 124 We can further explore the role of inert tails upon macrostate stability by analyzing the AT-all and GC-end sequences that occupy out-of-register shifted macrostates 5S2 and 3S2 comprising a one-base inert tail and 5S4 and 3S4 comprising a three-base inert tail (cf. Figure 2a). For both sequences, our simulations show that 5S2 is both the most stable of shifted states relative to H (Figure 3a) and has the smallest discrepancy (∼4 kJ/mol AT-all, ∼10 kJ/mol GC-end) compared to NN predictions (Figure 3b). The 3S2 and 5S4 states have nearly the same deviation from NN predictions (∼7 kJ/mol AT-all, ∼ 15 kJ/mol GC-end), indicating that longer tails in the 5′ direction are as destabilizing as shorter tails in the 3′ direction. It is known from differential scanning calorimetry studies 125 that 3′ inert tails are more destabilizing than 5′ tails, with the differential behavior attributed to a combination of 5′ tails preferentially stacking on the core duplex and 3′ tails perturbing the duplex structure. 124−126 Both the NN and MSM predictions for AT-all are consistent with this trend (i.e., F 3S2 > F 5S2 and F 3S4 > F 5S4 ). For GC-end, the MSM and NN models both predict the out-of-register shifted states to be less stable relative to the hybridized state than the corresponding predictions for AT-all. This is in line with expectations since the terminal G:C pairs in GC-end decrease by two the number of available WC pairings in out-of-register shifted states compared to AT-all. The GC-end NN predictions run contrary to the expectation that the 3′ inert tails should be more destabilizing than the 5′ tails, whereas the MSM predictions are consistent with this trend. Indeed, the MSM model for GC-end does not identify the 3S4 macrostate as a metastable conformation for the duplex.
In sum, the qualitative trends in the macrostate thermodynamic stabilities are in good agreement between the MSM and NN models, but show quantitative discrepancies for macrostates possessing inert tails. In these instances the MSM predicts these macrostates to be less stable relative to the hybridized state compared to the NN model predictions by 4.0−14.5 kJ/mol. The MSM predictions are also consistent with the experimental expectation that 3′ inert tails should be more destabilizing than the 5′ tails, whereas the NN predictions can be in conflict with this trend.
3.4. Out-of-Register States Facilitate Hybridization and Dissociation Dynamics (AT-All, GC-End). In addition to thermodynamic stabilities, the macrostate MSM also furnishes quantitative and interpretable predictions of hybridization and dehybridization pathways and mechanisms at the sequence melting temperatures. It should be noted that because we are using a reversible MSM framework, detailed balance is enforced by construction. We now proceed to analyze these predictions for each of the four sequences and illuminate the relationship between sequence and dynamics. Two of our sequences, AT-all and GC-end, support out-ofregister metastable states, and we commence our analysis with the role of these shifted states.
AT-all possesses the richest and most complex MSM of the four sequences by virtue of its repetitive nature, comprising a hybridized state (H), dissociated state (D), and four out-ofregister shifted states (5S2, 3S2, 5S4, 3S4) (Figure 2c). Analysis of the MSM transition probabilities reveal a critical role of the shifted states in mediating hybridization and dehybridization. Commencing from the dissociated state D, we observe approximately equal probabilities for transitions to each of the other five states, such that a transition to one of the out-of-register shifted states 5S2, 3S2, 5S4, or 3S4 is approximately 2.2 times more likely than a direct transition to the hybridized state H. Commencing from the hybridized state H, however, a direct transition to the dissociated state is approximately 1.2 times more likely than a transition to one of the two-base shifted states 5S2 or 3S2. Once in one of the four shifted states, the 5′ vs 3′ overhang and degree of shifting play an important role in determining whether the duplex will transition to more shifted states, more aligned states, or completely dissociate. Transitions from more shifted states toward more aligned states (i.e., 5S4 → 5S2, 5S2 → H, 3S4 → 3S2, 3S2 → H) are approximately an order of magnitude more probable than the reverse transitions from more aligned states to more shifted states. The largest single transition probability from the four shifted states 5S2, 3S2, 5S4, and 3S4 is, however, back to the dissociated state D. Consistent with the higher destabilizing effect of 3′ inert tails relative to 5′ tails, 124−126 the 3S4 → D transition probability is twice as large as the 5S4 → D, and the 3S2 → D is four times larger than the 5S2 → D. The transition probability from the 5S2 and 3S2 states back to the dissociated state D is equal to or greater than the transition probability to the hybridized state. A transition path theory analysis of the MSM reveals that 33% of productive hybridization trajectories D --> H (where the dashed arrow indicates the combination of both direct and indirect pathways) and dehybridization trajectories H --> D proceed through one or more out-of-register shifted states. Among these out-of-register hybridization pathways, the D --> 5S2 --> H transition is predicted to occur 57% of the time. A mean first passage time (MFPT) analysis returns a MFPT for D --> H of 3.0 μs and for H --> D of 2.5 μs. As expected by the fact that the calculations are performed at the melting temperature, the MFPTs are approximately equal.
GC-end comprises the next most complex MSM. The introduction of the G:C pairs at the termini of the strands maximally preserves the repetitive tract of A:T base pairings such that the GC-end MSM possesses all of the same macrostates in its dynamical landscape with the exception of the 3S4 state ( Figure 2c). As discussed in section 3.3, the 3S4 state is rendered unstable within the lag time of our MSM due to the presence of the destabilizing 3′ inert tail and only four WC base pairings compared to six in the case of AT-all. Analysis of the transition probabilities reveal significant differences compared to those in the AT-all kinetic network. Commencing from the dissociated state D, we observe a similar transition probability to the 5S4 state as for AT-all, but once in the 5S4 state there are no significant transition probabilities to any other state except back to D. As such, the 5S4 state acts as a kinetic trap rather than as an intermediate to Journal of the American Chemical Society pubs.acs.org/JACS Article hybridization. The D → H and H → D transition probabilities are commensurate with those for AT-all. However, the D → 5S2 and D → 3S2 transition probabilities are half or less of those in AT-all, and the reverse transitions are an order of magnitude larger. This may be attributed to the reduced thermodynamic stability of the 5S2 and 3S2 states in GC-end that comprise only six WC pairs compared to eight in AT-all (cf. Figure 3). The 5S2 → H and 3S2 → H transition probabilities are more than an order of magnitude larger than in AT-all, which may again be attributed to the lower thermodynamic stability of the two shifted states relative to the hybridized state H. Again, the transition probabilities out of the 3S2 state to D or H are comparatively higher than those out of the 5S2 state, consistent with the increased destabilizing effect of 3′ inert tails. 124−126 Commencing from the hybridized state H, a direct transition to the dissociated state is approximately seven times more likely than a transition to one of the two-base shifted states 5S2 or 3S2. A transition path theory analysis of the MSM reveals that only 7% of hybridization events D --> H and dehybridization events H --> D proceed through one or more out-of-register shifted states. The significantly reduced role for out-of-register shifted states in mediating the hybridization and dissociation pathways for GC-end relative to AT-all is consistent with the reduced thermodynamic stability of these states due to the elimination of possible out-of-register WC base pairing for the terminal G:C pairs and therefore a reduced accessibility of these states in the GC-end kinetic network. We compute a MFPT for D --> H of 1.6 μs and for H --> D of 2.1 μs, which are again approximately equal. The out-of-register kinetic landscape that defines AT-all and GC-end hybridization have been explored by a number of previous computational studies. Simulations have identified internal displacement mechanisms capable of correcting base pair alignment in 3SPN.2 18 as well as in the coarse-grained oxDNA 30 and BioModi 67 models. In all cases, these mechanisms were shown to be crucial components of the hybridization pathway for homogeneous and repetitive sequences. Xiao et al. performed an all-atom energy landscape-based analysis of 5′-GGGGGG-3′ and 5′-GCGCGC-3′ hexamers. 17 Out-of-register states for 5′-GCGCGC-3′ hexamers were identified as deep kinetic traps along the hybridization pathway and "slithering" through these states did not provide a significant hybridization pathway compared to an alternative "zippering" mechanism. (In contrast, slithering through out-ofregister shifted states and zippering served as two parallel pathways for hybridization of 5′-GGGGGG-3′.) This stands in contrast to our results for our AT-all (5′-ATATATATAT-3′) sequence, in which out-of-register states participated in 33% of productive hybridization events. It is conceivable that the stronger hydrogen bonding in G:C WC pairs relative to A:T pairs may render out-of-register shifted states less favorable to hybridization by suppressing fluctuation-driven rearrangements, 127,128 but additional studies would be required to reconcile these observations.
3.5. Central GC Placement Induces Long-Lived Frayed States (GC-Core). The GC-core MSM represents a departure from the relatively rich and complex kinetic networks dominated by out-of-register shifted states to a much simpler one dominated by fraying (Figure 2c). The MSM contains only three stateshybridized H, dehybridized D, and frayed F4. The F4 state is unique to GC-core and contains up to six WC pairsthe two central G:C core pairs and as many as four A:T pairs on one side or other of the core, while the other run of four A:T pairs remains free. (As expected by symmetry, the particular AT run that is free occurs with equal probability on either side of the core.) Although partially frayed states containing less than four free A:T bases on either end of the duplex are common, these tend to interconvert faster than the lag time and are not registered as metastable within our MSM. Our model reveals the absence of any direct hybridization or dehybridization transitions between the H and D states, with all pathways passing through the frayed state F4. Previous studies would suggest that hybridization of this sequence should proceed via a zippering mechanism, wherein upon formation of the strong central G:C WC base pairings the duplex helix should rapidly assemble in a middle-out fashion. 16,30 Our results are partially consistent with this expectation, but reveal the frayed state F4 to be unexpectedly metastable, serving as a long-lived state with an mean lifetime of 1.8 ns. The stability of the state is attributable to the enthalpic stabilization offered from up to six WC pairs and the entropic stabilization associated with the configurational entropy of the two free AT-tails.
Analysis of the transition probabilities show that commencing from the F4 state, progression to the hybridized state F4 → H is 25 times more likely than dissociation F4 → D. Thus, once a D → F4 transition has occurred, a F4 → H transition will likely proceed; concomitantly, H → F4 events tend to fall back to the H state and are less likely to proceed to complete dissociation D. We noted in section 3.1 that fraying dynamics in the 3SPN.2 model appear to be significantly accelerated relative to center-of-mass translation, and it is conceivable that this may lead to elevated sampling of the F4 state within the computational model relative to experiment and the induction of more frequent dissociation. Moreover, since GC-core is the sequence most prone to fraying, this effect could be the root of the relatively poorer agreement of the k d slow response for GCcore compared to the other sequences due to an artificially elevated computational prediction of this rate (Figure 1a). Our model predicts a MFPT for D → F4 → H of 3.4 μs and for H → F4 → D of 2.9 μs, which are again approximately equal.
Lattice models have previously identified frayed states as putative intermediates in DNA hybridization/dehybridization. 19,25,46 Araque et al. studied a 5′-ATGCGCAT-3′ octomer using a lattice model and identified a symmetrically A:T frayed state as a crucial part of the duplex transition path. 25 We previously studied the four sequences that are the subject of the present work using T-jump IR and 2D IR spectroscopy and identified GC-core as possessing the highest deviation from two-state behavior during dissociation when neglecting out-ofregister contributions. 19 This result was interpreted to arise from loss of A:T contacts and fraying around the central G:C core, and this hypothesis was supported by lattice model calculations that predicted the GC-core conformational ensemble to possess substantially more frayed configurations than the other three sequences. 46 Follow-up T-jump measurements and Smoluchowski simulations on model 1D free energy landscapes showed that AT termini fraying was an effectively barrierless process characterized by rapid interconversion between all accessible frayed states. 28 These prior results are consistent with the present findings that expose the GC-core sequence to be the only sequence that occupies the F4 frayed state and therefore the only one possessing a metastable frayed state on time scales exceeding the τ = 1.2 ns lag time of our MSMs.
Journal of the American Chemical Society pubs.acs.org/JACS Article 3.6. Disruption of Repetitive AT Tracts Promotes Two-State "All-or-Nothing" Kinetics (GC-Mix). The GCmix sequence is the only one of the four sequences studied that exhibits simple two-state "all-or-nothing" behavior. 17,19,25,26 This characterization is intended to reflect the mechanistic observation that GC-mix MSM comprises just two states, the hybridized H and dehybridized D (Figure 2c), indicating that association and dissociation of the strands proceeds directly without passing through any metastable intermediate states resolvable under the τ = 1.2 ns lag time of our MSM. Indeed, the sequence's propensity to fray in simulation and the role of termini fraying in dissociation experiments 19 indicate that there remain deviations from "all-or-nothing" behavior which are not fully captured within the MSM lag time. (We note that "all-ornothing" can also be used, somewhat ambiguously, to refer to a two-state thermodynamic model where the hybridized and dehybridized are separated by a large free energy barrier but that the transitions between them may proceed through a network of metastable states.) The two-state behavior appears to arise as a consequence of the placement of the G:C pair that maximally disrupts the repetitive AT tract within the decamer and destabilizing either out-of-register shifted states or frayed states. We note that we do observe substantial transient fraying of the terminal two-base AT tails within our dynamical simulations, but these frayed states are not sufficiently thermodynamically stable to produce a metastable macrostate within the resulting MSM. This stands in contrast to the metastable F4 state populated by GC-core. Our MSM predicts a MFPT for D → H of 2.9 μs and for H → D of 2.3 μs, which are again nearly equal. Symmetrically permutable distances (e.g., A 1 :T 10 and T 10 :A 1 ) are reflected across the x-axis to avoid congestion in the plot. Circles superposed on the x-axis indicate the instantaneous MSM state assignment as dissociated D (blue) or hybridized H (orange). Hybridization tends to occur by a nucleation−zippering mechanism, wherein a native G:C pair and adjacent A:T pair or 2−3 central A:T pairs first form prior to rapid formation of the duplex. Dehybridization tends to occur by a fraying−peeling mechanism wherein fraying of the two-base AT-tails on one or both sides of the duplex precedes dissociation of the central native base pairs and complete dissolution of the duplex. Four additional hybridization events and four additional dehybridization events are presented in Figure S7.
Given the very simple two-state "all-or-nothing" behavior of GC-mix and the absence of any intermediate metastable states, we sought to interrogate our simulation trajectory data to elucidate the hybridization and dehybridization mechanisms. To do so, we followed all 10 intermolecular distances between native WC base pairs and tracked their evolution through a number of hybridization and dehybridization events. We present one representative example of each event in Figure 4 and four more in Figure S7. During the hybridization process, we observe a global decrease in all 10 distances as the strands approach one another and the formation of key native WC contacts immediately prior to duplex formation: specifically, one of the G:C WC pairs and at least one neighboring A:T pair or 2−3 central A:T pairs. This behavior is consistent with a "nucleation−zippering" mechanism as has been reported in previous studies. 16,20,37,52 In dehybridization, we observe fraying of one half of the duplex with the strands remaining associated until the loss of the final A:T pairs and ultimately the last G:C pair. Qualitatively, we observed some short-lived states composed of two to four native WC base pair contacts immediately before full dissociation occurs, but, in contrast to the F4 state we observe in GC-core, these conformations do not constitute a metastable state within our MSM nor do they tend to reform intact duplexes. These dissociation dynamics are consistent with a "fraying−peeling" dehybridization mechanism. 32,54,55 We observe that the principle of microscopic reversibility for a molecular system at thermodynamic equilibrium imposes the conditions of detailed balance and symmetry of the classical equations of motion under time reversibility. 129 A consequence of these considerations is that if our system is indeed at equilibrium, then the "fraying−peeling" dehybridization mechanism can be considered a reversal of the "nucleation−zippering" hybridization pathway since the timereversed simulation trajectories represent an equally valid sampling of the system dynamics. This is indeed borne out by our mechanistic observations for GC-mix wherein the early stages of "nucleation−zippering" proceed by the formation of one G:C contact followed by one or more additional A:T pairs and the late stages by the formation of all remaining WC pairs, which we compare with the early stages "fraying−peeling" wherein one half of the duplex frays and the late stages wherein dissociation finally completes by the dissolution of the last few A:T contacts and the final G:C pair.
3.7. Long-Lived Metastable Shifted States Predicted by the MSM Are Resolved by T-Jump IR. Finally, we sought to validate the predictions of our sequence-dependent MSMs against experimental T-jump IR spectroscopy. T-jump IR measurements commence from a low temperature, apply a step jump in temperature, and track the relaxation of the system to the dehybridized state. We hypothesized that the influence of the out-of-register shifted states present in the ATall and GC-end sequences upon the system relaxation kinetics should manifest in the slow and/or fast responses measured by T-jump IR. As discussed in section 3.1, the slow IR response is largely attributed to dissociation events and the fast to terminal base fraying. With regards to the slow response, our MFPT analyses of our MSMs predict out-of-register shifting events (i.e., H → 3S2, 5S2, 3S4, 5S4) to proceed on microsecond time scales, which are commensurate with the 1.4−2.9 μs time scales for dehybridization (i.e., H → D) for each of the four sequences. As such, we anticipate that the dynamical relaxations associated with out-of-register shifted states proceed on similar time scales to, and may not be distinguishable from, the relaxation to the dehybridized state. Nevertheless, the presence of these out-of-register shifted states in the low-temperature equilibrium ensemble prior to the T-jump step may be observable via their influence on the fast T-jump IR response attributable to fraying. Specifically, we hypothesize that the dangling ends and inert tails present in the out-of-register shifted states should promote a broader fraying response over the course of the relaxation that is distinct from that of in-register fraying. This heterogeneity of configurations should lead to heterogeneous dynamics, manifested in the observation of a more stretched relaxation over experimental time scales of 70−100 ns. Analysis of the MSM equilibrium distributions (Figure 2b) reveals 10.0% of the equilibrium ensemble to reside in out-of-register shifted states 3S2, 5S2, 3S4, and 5S4 for AT-all, compared to just 0.23% for GC-end, and 0% for GC-core and GC-mix. It is our conjecture that a substantial population of out-of-register shifted states in the pre-T-jump AT-all ensemble should be distinguishable from Figure 5. T-Jump IR responses reflect sequence-dependent conformational heterogeneity. The final temperature for each measurement is AT-all, 320 K; GC-end, 319 K; GC-core, 327 K; and GC-mix, 326 K. (a) Mid-IR t-HDVE difference spectra for GC-core at time delays from 5 ns to 560 μs. Normalized time traces for each sequence are shown at (b) 1600 cm −1 and (c) 1660 cm −1 . The signal at 1600 cm −1 corresponds to changes in A and T ring vibrations while the signal at 1660 cm −1 contains contributions from G and T carbonyl vibrations. Each time trace is fit to the sum of a stretched exponential with two exponentials (solid lines): S(t) = A exp(−(t/τ fast ) β fast ) + B exp(−t/τ slow ) + C exp(−t/τ cool ). The stretched exponential describes the process from 5 ns to 1 μs, and the two exponentials describe the signal increase from 1 to 320 μs and signal decay from rehybridization induced by thermal relaxation back to the initial temperature. The stretch factor β fast for the fits at 1600 and 1660 cm −1 are reported directly on the plots in panels b and c.
the GC-end, GC-core, and GC-mix as an elongation of the fast relaxation response associated with terminal base fraying.
We present in Figure 5 our T-jump IR t-HDVE difference spectra and corresponding normalized time traces at 1600 and 1660 cm −1 . The signal at 1600 cm −1 corresponds to changes in A and T ring vibrations while the signal at 1660 cm −1 contains contributions from G and T carbonyl vibrations. Each time trace was fitted to the sum of a stretched exponential and two exponentials S(t) = A exp(−(t/τ fast ) β fast ) + B exp(−t/τ slow ) + C exp(−t/τ cool ). The stretched exponential describes the relaxation process from 5 ns to 1 μs, and the two exponentials describe the signal increase from 1 to 320 μs and signal decay from rehybridization induced by thermal relaxation back to the initial temperature. The fitting parameters A, B, and C correspond to the relative amplitudes of the three kinetic responses and the stretch factor β fast to the heterogeneity of fraying dynamics at the fast time scale. A testable prediction of our hypothesis is that the long-lived out-of-register shifted states in AT-all should result in a significantly smaller value for the fitted β fast parameter (i.e., a more stretched response) relative to those for GC-end, GC-core, and GC-mix. This hypothesis was supported by the experimental time series at both 1600 cm −1 , where β fast AT-all = 0.3 compared to β fast GC-end = 0.6, β fast GC-core = 0.7, and β fast GC-mix = 0.6, and 1660 cm −1 , where β fast AT-all = 0.4 compared to β fast GC-end = 0.7, β fast GC-core = 0.6, and β fast GC-mix = 0.6. This result supports our hypothesis and validates a testable prediction of our sequence-dependent MSMs.

CONCLUSIONS
We have conducted an integrated computational and experimental study of the sequence-dependent kinetic mechanisms for the hybridization and dehybridization dynamics of a family of four self-complementary 10-mer DNA oligomers: ATATATATAT (AT-all), GATATATATC (GC-end), ATATGCATAT (GC-core), and ATGATATCAT (GC-mix). We conducted 1 ms of unbiased coarse-grained molecular dynamics simulations at the melting temperature of each sequence and employed deep learning techniques to construct high-resolution Markov state models as predictive and interpretable models of the sequence dependent dynamics. T-jump IR spectroscopy was used to calibrate the kinetic time scales of the coarse-grained molecular model and validate the kinetic prediction of the Markov state models that the AT-all sequence should possess long-lived out-of-register shifted states that are detectable within T-jump IR t-HDVE time traces. Our results reveal that the specific placement of interrupting G:C pairs within an otherwise repetitive AT sequence can have a profound impact on the kinetic pathways and mechanisms for association and dissociation of the DNA duplex. In particular, we found AT-all to possess the richest and most complex kinetic landscape of the four sequences that is dominated by out-of-register shifted states that participate in 33% of complete hybridization eventspathways leading from the dissociated state to full duplex formationand dehybridization eventspathways from the complete duplex to full dissociation. Introduction of the G:C pairs at the end of the strand maintains an eight-base-pair repetitive AT tract and the GC-end kinetic landscape possess all but one of the same outof-register shifted states as AT-all. Destabilization of the GCend shifted states relative to AT-all, however, results in a far more limited participation of these states with only 7% of GCend hybridization and dehybridization events passing through one or more shifted states. Placing the G:C pairs in the center of the strand maintains two four-base AT tracts on either side of the core and results in qualitatively different kinetic behaviors for GC-core. In this case, no metastable out-ofregister shifted states are registered by our model with the hybridization and dehybridization pathways all passing through a strongly metastable frayed state in which one or other of the four-base AT-tracts is unbound to produce two free AT-tails. Finally, placing the G:C bases between the center and end of the strand to maximally disrupt the repetitive AT tracts results in no metastable out-of-register or frayed states for GC-mix and results in simple two-state "all-or-nothing" hybridization/ dehybridization behavior. Analysis of the specific pathways reveals hybridization to largely proceed by a nucleationzippering mechanism and dehybridization to proceed by a fraying-peeling mechanism.
The ordering of the computationally predicted kinetic landscapes from most to least complexAT-all > GC-end > GC-core > GC-mixis largely dictated by sequence repetitiveness, specifically the number of consecutive AT motifs. We note that this ordering differs from our previously reported ordering in terms of deviation from two-state behavior of GC-core > GC-mix > AT-all > GC-end. 19, 28 We can understand these two apparently discrepant orderings by understanding that the latter was deduced based on experimental analyses and lattice models that did not account for out-of-register states and focused largely on fraying behaviors. Indeed, under the assumption that fraying is the dominant kinetic process relative to out-of-register shifting, we can harmonize the predictions of the present work with our prior work by eliminating all out-of-register shifted states in our fitted MSMs (Figure 2c), in which case we find GC-core to contain an F4 frayed intermediate and the remaining sequences to all have simple two-state dynamics such that the predicted ordering is GC-core > GC-mix ≈ AT-all ≈ GCend.
In sum, our results demonstrate the profound effect of sequence upon the kinetic landscapes, metastable states, and hybridization/dehybridization mechanisms of short DNA oligomers. Our analysis of this small family of sequences expose preliminary design principles for the (meta)stability of out-of-register and frayed states but we anticipate much greater richness in the landscapes will emerge with studies of longer and more diverse sequences. Going forward, we will extend this work to discern more general trends in sequencedependent hybridization/dehybridization for a wider range of oligomer sequences and motivate strategies for experimental comparisons. We also suggest that the MSM approach followed in this work, possibly coupled with biased sampling and reweighting techniques, 94,130 may be well-suited to expose changes in the hybridization/dehybridization mechanism and kinetics as a function of temperature. We anticipate that these insights may provide foundational design rules by which to improve understanding of in vivo hybridization processes and rationally engineer optimized sequences for DNA nanotechnology applications such as DNA-PAINT 7 and DNA barcoding. 6 ■ ASSOCIATED CONTENT * sı Supporting Information Details of nearest neighbor (NN) model calculations, supplementary figures illustrating selection of optimal number of SRV slow modes for each sequence, convergence of MSM implied time scales, MSM Chapman−Kolmogorov tests, "computational T-jump" calculations, and GC-mix hybridization and dehybridization trajectories(PDF)