A Hybrid Machine Learning Approach for Structure Stability Prediction in Molecular Co-crystal Screenings

Co-crystals are a highly interesting material class as varying their components and stoichiometry in principle allows tuning supramolecular assemblies toward desired physical properties. The in silico prediction of co-crystal structures represents a daunting task, however, as they span a vast search space and usually feature large unit cells. This requires theoretical models that are accurate and fast to evaluate, a combination that can in principle be accomplished by modern machine-learned (ML) potentials trained on first-principles data. Crucially, these ML potentials need to account for the description of long-range interactions, which are essential for the stability and structure of molecular crystals. In this contribution, we present a strategy for developing Δ-ML potentials for co-crystals, which use a physical baseline model to describe long-range interactions. The applicability of this approach is demonstrated for co-crystals of variable composition consisting of an active pharmaceutical ingredient and various co-formers. We find that the Δ-ML approach offers a strong and consistent improvement over the density functional tight binding baseline. Importantly, this even holds true when extrapolating beyond the scope of the training set, for instance in molecular dynamics simulations under ambient conditions.


I. METHOD
Complementing the energy expression for the combined DFTB+D4 and GAP model presented in the main text, Eqs. 2 and 3 describe the analog expressions for forces and stresses. In order to evalutate the stress correction with the intramolecular model, each of the molecules is successively placed in the unit cell as depicted in Fig. S1. Evaluation of the model on each of these structures delivers their intramolecular stress correction contribution. In case periodic replicas of a molecule fall into the local environment of the central one-defined by the cutoff value of the SOAP descriptor-this would lead to wrong values, however. To avoid this, the unit cell is expanded until all periodic replica are outside the cutoff region. The obtained stress correction for these structures is then multiplied with a prefactor f e which back-corrects this expansion.
FIG. S1: Visual representations of energy, force and stress expressions for approximating the PBE(0)+MBD target level of theory with the combined DFTB+D4 baseline and GAP model (denoted by ∆-GAP).

A. Co-Crystal Stabilities
In the main document we discuss the agreement in terms of lattice energies obtained on the test set between the DFTB+D4 baseline, the ∆-GAP model and the PBE(0)+MBD target. While the corresponding correlation plot illustrates the predictive power in a compact way by capturing the various stochiometries of all four co-crystals systems, it does not allow for a direct comparison of crystal stabilites as conducted in CSP studies. Here, the energetic ordering of crystal structures is arguably of greater importance than the numerical value of the lattice energy-which might be subject to systematic deviations. For this reason, we provide a comparison of ranking orders in Fig. S2 focusing on the stoichiometries identical to the experimentally known co-crystals. It can be seen that even in absence of systematic deviations applying the ML correction significantly improves the stability assessment. The Spearman's rank correlation coefficient (r s ) assesses monotonic relationships between data sets and is used to compare the ranking orders of PBE(0)+MBD and the approximate methods. It provides correlation in a range between -1 (opposite ordering) and +1 (identical ordering). With ∆-GAP values consistently larger compared to DFTB+D4 and in a range close to 1 our findings underscore the capability of ∆-GAP to improve the detection of promising structure candidates during CSP studies.
FIG. S2: Correlation between ranking orders obtained from PBE(0)+MBD lattice energies and the two approximate methods DFTB+D4 and ∆-GAP for test set structures with identical stoichiometry as their experimental counterparts.

B. Density-dependence of Lattice Energies for Experimental Co-Crystals
To test the accuracy of our method on the known experimental structures of each cocrystal we start by conducting full unit cell relaxations on the different levels of theory.
As full unit cell relaxations are not possible at the PBE(0)+MBD level, we start from the PBE+MBD cell in this case. Subsequently, atomic positions are relaxed for scaled variants of this cell, in order to obtain a good approximation of the relaxed PBE(0)+MBD density.
Analogous scans were also performed with DFTB+D4 and ∆-GAP, around their respective minima. The corresponding energy vs. density curves are shown in Fig. S3.
Here, the difference between the DFTB+D4 baseline and the other two methods is striking. In particular, the lattice energies and minimum densities are significantly too high.

E. Molecular Dynamics Simulations
For molecular dynamics simulations at constant pressure (1 bar) and temperature (

F. Comparison of Computational Costs
To provide an estimate for the relation of computational costs for the individual methods CPU-times obtained for the experimental PcaThp crystal have been analysed. For the relations between DFTB+D4 : ∆-GAP : PBE+MBD : PBE(0)+MBD we find an increase in computational costs according to 1:2:3200:15000. The corresponding calculations were performed on Intel Xeon IceLake Platinum 8360Y processors with a base frequency of 2.4 GHz.