Systematic Finite-Temperature Reduction of Crystal Energy Landscapes.

Crystal structure prediction methods are prone to overestimate the number of potential polymorphs of organic molecules. In this work, we aim to reduce the overprediction by systematically applying molecular dynamics simulations and biased sampling methods to cluster subsets of structures that can easily interconvert at finite temperature and pressure. Following this approach, we rationally reduce the number of predicted putative polymorphs in CSP-generated crystal energy landscapes. This uses an unsupervised clustering approach to analyze independent finite-temperature molecular dynamics trajectories and hence identify a representative structure of each cluster of distinct lattice energy minima that are effectively equivalent at finite temperature and pressure. Biased simulations are used to reduce the impact of limited sampling time and to estimate the work associated with polymorphic transformations. We demonstrate the proposed systematic approach by studying the polymorphs of urea and succinic acid, reducing an initial set of over 100 energetically plausible CSP structures to 12 and 27 respectively, including the experimentally known polymorphs. The simulations also indicate the types of disorder and stacking errors that may occur in real structures.


S1 CSP 0 study and comparison with the GAFF Lattice Energies
After the minimization with GAFF, we calculated the lattice energy of each structure (SI Tab.S4 for urea and SI Tab.S5-S6 for succinic acid) defined as the energy difference between a molecule in the lattice and a molecule in vacuum: For urea, the energies obtained with GAFF 1 are comparable with the Ψ mol ones but the ranking is different.In Fig. S1C, we compare the two methods by showing the lattice energy difference relative to the Ψ mol global minimum of each structure.The known urea structures optimized with the GAFF force field, all appear among the lowest energy structures.Based on GAFF calculations the experimental forms I and IV rank among the 10 most stable urea structure, while Form A emerges as the most stable.Based on Ψ mol calculations experimental forms I and IV rank 60th and 41st -with an energy difference with respect to the global minimum of 9.0 kJ mol −1 and 7.8 kJ mol −1 , respectively.
For succinic acid, the Ψ mol ranking 2 identifies form β as the global minimum while using GAFF β is in 6th position with a difference of 0.6 kJ mol −1 with respect to the most stable structure.In the Ψ crys set, the global energy minimum is the labile polymorph γ, with α at +0.3 kJ mol −1 and β at +1.6 kJ mol −1 .Looking at the experimental structures, the GAFF force field is in good agreement with the Ψ mol CSP 0 study results showing the β polymorph as the most stable, followed by γ at +4.34 kJ mol −1 and α at 6.91 kJ mol −1 .Fig. S1D shows that the lattice energies are very sensitive to the method used with variations of similar magnitude between the three methods.In general, differences are smaller compared to urea (Fig. S1C), with the exception of structures S1107, S18857 and S15665 which are the only ones that do not exhibit the R 2 2 (8) H-bonding motif.Structures S1107 and S18857 have the H-bonding motif R 2 2 (6)C 1 1 (4) (Fig. S2C) and the molecules in a planar conformation.In WTmetaD simulations, they convert to geometries that exhibit the succinic acid in the gA-g conformation, with the C-C-C-O dihedral angles around 90 Table S1: Comparison between the experimental α structure from the CSD and the best match from the CSP 0 study, showing the unit cell, density, packing motif of the experimental structure and of its closest approximation identified by Ψ mol .The the lattice energy obtained with GAFF forcefield for the two structures is reported in the last row.

S2 Fingerprints and cluster analysis S2.1 Urea
In general, we can define a structural fingerprint for each structure, i, as a set of probability densities of the relative position, relative orientation and relevant conformational degrees of freedom: We assume that urea is a rigid molecule so that the fingerprint depends only on the relative position and orientation.The different geometries are distinguished by using: • The distribution of the distances between each pair of centers of mass, p i (r COM ) • The 2D distribution of the intermolecular orientation given by the angles θ and ψ between each pair of vectors − − → CO and Both these distributions are generated by post-processing the final 500 ps of a molecular dynamics trajectory using PLUMED2. 3However, some precautions must be considered.The distances between centers of mass are calculated without considering periodic boundary conditions.The result is that supercells of the same geometry but of different size will generate different distributions as the distance increases, as shown in Fig. S4A.For this reason, the distribution is truncated at the length of half of the shortest box vector.When computing the dissimilarity between centers of mass distributions the Hellinger distance is computed on a domain bounded by the shortest half-box distance.In Fig. S4A, we can see the difference in the distribution between two supercells obtained by replicating the same unit cell of structure U98, two or four times in each direction.
From ∼ 0.75 nm (half of the smallest box vector) the two distributions differ from each other and cannot be used for the comparison.
The use of a bivariate distribution of relative orientations in the definition of the urea fingerprint is IV and form A which all share overlapping probability distributions of θ.Furthermore, the use of two univariate distributions, p i (θ) and p i (ψ), failed to distinguish some structures, as shown in Fig. S4C where the form I structure is compared with a similar structure that exhibits stacking disorder along one axis.In this case, the two-dimensional distributions corresponding to different packings are immediately distinguishable while the one-dimensional ones overlap.
After generating these fingerprints for every structure and normalizing the distributions, we need to define a distance between each pair of structures and generate a (dis)similarity matrix (here labeled with ∆).To this aim, we use the Hellinger distance, H ij , a measure of the overlap between two normalized distributions. 4 For each pair of structures, i and j, we calculated H COM ij and H CO,N N ij , defined as: where p i and p j are the previously generated probability distributions and the integral term corresponds to the Bhattacharyya coefficient (BC(p i , p j )).The latter term quantifies the similarity between two probability distributions and ranges from 0 (completely different) to 1 (identical).
Therefore, a Hellinger distance equal to 0 refers to identical distributions while values from 0 to 1 indicate increasingly different ones.The Hellinger distance is inherently symmetric, and numerically well behaved when dealing with concentrated probability densities.
In order to equally weight the contribution of the Hellinger distances associated with the elements of the fingerprint in the calculation a dissimilarity between structure i and j we rescale and by their respective maximum values computed over the entire set of structure pairs.The norm of the vector of rescaled Hellinger distances, computed over all the distributions included in the fingerprint and appropriately normalized, defines a distance between two structures ∆ i,j : The resulting distance matrix is shown in Fig. S5A.We use the dissimilarity matrix to perform a clustering via the Fast Search and Find of Density Peaks (FSFDP) method proposed by Rodriguez & Laio, 5 which does not require prior knowledge of the number of clusters.Starting from the dissimilarity matrix, for each structure i, FSFDP calculates the density parameter, ρ i , that indicates how many other structures are close to i, and the distance parameter, σ i , that is the minimum distance between i and another structure of higher density as: By plotting ρ vs σ, as in Fig. S5B, structures characterized by high values of both σ and ρ are identified as cluster centers.As we aim to maintain also unique structures, that exhibit ρ = 0, a Table S2: Clusters identified following the finite T MD simulation.

Cluster Centers Structures
density cut-off is not applied and the only parameter to optimize is the σ cut-off.This parameter discriminates between cluster centers, and all other structures that will be later assigned to a cluster.In this particular application, an ideal cutoff choice should allow to identify as separate structures all the high-density cluster centers as well as all isolated structures that have survived after the finite-temperature simulation steps.In the cases examined in this work the chosen cutoff is of the order of σ 0.1.For urea, a distance cut-off equal to 0.05 was found to be a suitable value to group crystal structures.In Tab.S2, we show the resulting 30 cluster centers with the structures assigned to them.

S2.2 Succinic Acid
Flexible molecules need a further set of fingerprints to describe their different conformations.For succinic acid, a new distribution of the torsional angles of the carbon chain, p i (φ), is implemented to distinguish the planar and folded geometries.As discussed in the previous case we consider the distribution of the distances between centers of mass, p i (r COM ).Two vectors have been used to define the molecular orientation: one connecting the two central carbon atoms (θ) and the other connecting the terminal ones (ψ).The angle θ can distinguish how the central carbon atoms of a molecule are located with respect to molecules in adjacent chains or layers, as shown in Fig. S6A.
The angle ψ helps to define how different layers are oriented, as shown in Fig. S6B, and identifies the packing motifs of each structure.
The resulting distance matrix and decision graph are shown in Fig. S6C and S6D.The latter shows many cluster centers located around ρ = 1, as a consequence of the structures that the Ψ crys and Ψ mol sets have in common.For succinic acid, the distance cut-off is taken as σ = 0.115, which allow to retain for further analysis all the unique structures (ρ = 0).The FSFDP method produced 32 clusters, ordered in Tab.S3 according to their respective number of members.Table S3: Clusters identified following the finite T MD simulation, divided by the two sets.Experimental forms β, α, and γ are labelled E1, E2 and E3, respectively.

Urea
In Tab.S4, we show the evolution of the putative polymorphs of urea through the different steps.
Structures are ordered as their rank in the CSP 0 stage.The energies are expressed in kJ mol −1 but while in the CSP and energy minimization (EM) with GAFF steps they are lattice energies, in the NVT and NPT ones they are an average of the potential energy difference between a molecule in the crystal and a molecule in vacuum.

S3.2 Succinic Acid
As before, we show the evolution of the putative polymorphs of succinic acid through different steps.The CSP(Ψ crys ) column of Tab.S6 shows the relative lattice energies as a result of the Ψ crys study.We show the experimental forms (optimized in the Ψ crys set) as Form β, Form α and Form γ.

Contents
Figure S1: Lattice energy vs density plot of structures minimized at 0 K using the GAFF force field for the urea (A) and succinic acid (B) systems.(C-D) Comparison between the lattice energies of the different methods involved.Structures are ordered by the lattice energy ranking of Ψ mol , enhancing the previously known crystal forms.

Figure
Figure S2: (A) The dominant planar conformation (aAa) of succinic acid and (B) the top and side view of conformer gA-g.(C) The H-bonding networks with motifs C 1 1 (4)R 2 2 (6) of structures S1107, S18857 and S15665 and (D) C 1 1 (4) of structures NC1, NC2 and NC3, that contain molecules in the conformation shown in part B.
Figure S3: The two fingerprints considered for the cluster analysis of the urea set: the distributions of (A) the distances between each center of mass and (B) the intermolecular orientations obtained by computing the angles between vectors − − → CO i • − − → CO j and −−→ N N i • − −− → N N j of each pair of molecules, i and j.

Figure
Figure S4: (A) Effect of using supercells of different sizes on the distribution of centers of mass for the orthogonal structure U98 with cell vectors a = 7.5777, b = 8.4116, c = 7.5048.(B) In these three different geometries, we can see that the angle between the C-O vector of two molecules remains constant.On the other hand, using the N-N vector we can distinguish the geometries typical of form A (top), form IV (middle), and form I (bottom).(C) One and two dimensional distributions of form I (left) and structure U146 (right).

Figure
Figure S5: (A) Distance matrix showing the distance ∆ i,j between each pair of structures used for the the FSFDP clustering method.(B) The resulting decision graph with a distance cut-off of 0.05, shown with a red line, that divides the cluster centers from the other structures (in the red shaded area).These other structures are assigned to the different clusters based on the vicinity to the structures of higher densities, i.e. σ.Groups of more than 2 structures are shown with different colors while unique geometries can be identified with a black circle.

Figure
Figure S6: (A) Possible relative orientations of two succinic acid molecules belonging to different chains.(B) Examples of the different orientation between layers recognized analyzing the Ψ angle.(C)Distance matrix used for the FSFDP clustering method and resulting decision graph (D).In the latter, we show with different colors the most populated cluster centers, in grey other small clusters and with a black circle unique structures.All other structures, shown with a red circle, will be assigned to a cluster by the clustering algorithm.

Table S4 :
The energies of the CSP 0-generated structures at different steps and the cluster centers assigned by the FSFDP clustering algorithm.Melted structures are shown in bright red.

Table S4 :
The energies of the CSP 0-generated structures at different steps and the cluster centers assigned by the FSFDP clustering algorithm.Melted structures are shown in bright red.

Table S4 :
The energies of the CSP 0-generated structures at different steps and the cluster centers assigned by the FSFDP clustering algorithm.Melted structures are shown in bright red.

Table S4 :
The energies of the CSP 0-generated structures at different steps and the cluster centers assigned by the FSFDP clustering algorithm.Melted structures are shown in bright red.

Table S4 :
The energies of the CSP 0-generated structures at different steps and the cluster centers assigned by the FSFDP clustering algorithm.Melted structures are shown in bright red.

Table S5 :
The energies of the Ψ mol set of structures of succinic acid at different steps and the cluster centers assigned by the FSFDP clustering algorithm.The clustering method has been applied to the two sets together.Cluster centers are then distinguished by including Ψ crys in parentheses.Melted structures are shown in bright red.

Table S5 :
The energies of the Ψ mol set of structures of succinic acid at different steps and the cluster centers assigned by the FSFDP clustering algorithm.The clustering method has been applied to the two sets together.Cluster centers are then distinguished by including Ψ crys in parentheses.Melted structures are shown in bright red.

Table S5 :
The energies of the Ψ mol set of structures of succinic acid at different steps and the cluster centers assigned by the FSFDP clustering algorithm.The clustering method has been applied to the two sets together.Cluster centers are then distinguished by including Ψ crys in parentheses.Melted structures are shown in bright red.

Table S6 :
The energies of the Ψ crys set of structures of succinic acid at different steps and the cluster centers assigned by the FSFDP clustering algorithm.The clustering method has been applied to the two sets together.Cluster centers are then distinguished by including Ψ crys in parentheses.Melted structures are shown in bright red.

Table S6 :
The energies of the Ψ crys set of structures of succinic acid at different steps and the cluster centers assigned by the FSFDP clustering algorithm.The clustering method has been applied to the two sets together.Cluster centers are then distinguished by including Ψ crys in parentheses.Melted structures are shown in bright red.

Table S6 :
The energies of the Ψ crys set of structures of succinic acid at different steps and the cluster centers assigned by the FSFDP clustering algorithm.The clustering method has been applied to the two sets together.Cluster centers are then distinguished by including Ψ crys in parentheses.Melted structures are shown in bright red.