SSIPTools: Software and Methodology for Surface Site Interaction Point (SSIP) Approach and Applications

We present the SSIPTools suite of programs. SSIPTools is a collection of software modules enabling the use of the Surface Site Interaction Point (SSIP) molecular descriptors, used for the modeling of noncovalent interactions in neutral organic molecules. It contains an implementation of the workflow for the generation of the SSIP descriptors, as well as the Functional Group Interaction Profiles (FGIPs) and Solvent Similarity Indexes (SSIs) applications, based on the SSIMPLE (Surface Site Interaction model for the Properties of Liquids at Equilibria) approach.

Supplementary Information S1 XML file formats S1 S2 SSIP description generation S1 S2. 1  S1 XML file formats SSIP XML schema files are hosted on the huntergroup website (here: http://www-hunter.ch.cam.ac.uk/ schema/), and information. The schema are also hosted in a repository in the SSIPTools Gitlab project, HunterDatabaseSchema.

S2.1 Molecule Input
The cmlgenerator library provides a simple command line interface to be able to process multiple structure files in a single operation.

S2.2 MEPS generation
NOTE about NWChemCMLUTils library set up and CLI. The MEPS calculation is a multi-step process, which was automated and the functionality was packaged into the nwchemcmlutils repository. The Python application program interface within NWChem [1,2] was used to improve the efficiency of the workflow.
The Flow Diagram in figure S1 summarises the process of MEPS generation. The MEPS is generated on the 0.002 e bohr −3 electron density isosurface [3]. NWChem [1] was used for calculations of the MEPS with a DFT method, the B3LYP functional [4][5][6][7]. B3LYP was chosen due to the reliable results produced with a low computational cost [8]. The basis set used was 6-31G* [9,10]   To store the MEPS data produced during calculations, the unformatted cube file format is used. This is a raw text file that uses FORTRAN formatted information, and is outputted by QM codes to store property information, based on a specification by the Gaussian package [12].
The MEPS is calculated in seven different orientations such that a coarser grid can be used to reduce computational time, without loss of accuracy, and to reduce the likelihood of discretisation errors. The final surface which is outputted is the combination of the seven MEPS grids, which provides a comparable result to the use of a much finer grid. Figure S2 shows the MEPS of water generated from a single orientation, showing the points lie in planes due to the grid. An uneven distribution of points on the surface is heavily reliant on the initial geometry and the grid properties. Rotation by 45 or 120 degrees about the x, y and z axes is used to generate a further six surfaces. The resultant surface from the superposition of these seven surfaces after reorientation shown in figure S3, contains an even distribution of points over the entire surface, with a higher density.

S2.2.1 Molecule Volume
The volume enclosed by an isosurface is also calculated during this stage within NWChem. This information is required for the calculation of the temperature dependent SSIMPLE model previously described [13]. The number of voxels with electron densities greater than the specified surface is calculated for each orientation.

S3
The mean volume enclosed is then written in the merged cube file that is outputted from the calculation.

S2.3 Footprinting algorithm
Water is known to have four hydrogen bond interaction sites. Calero et al. [3] used this to define the surface area of a SSIP as a quarter the surface area of water which corresponds to A SSIP = 9.35Å 2 .
Hydrogen bond interactions occlude a segment of the surface from other interactions. This places a restriction on the proximity of SSIPs on a molecular surface, to avoid clashing. The minimum separation of SSIPs on the surface, d (in figure S4), influences the SSIP description produced. It was parameterised by Calero et al. [3], and assigned a value of 3.2Å.
The flow diagram in figure S4 summarises the footprinting process, previously described [3]. A SSIP represents a single surface segment. The partitioning of a molecule into a set of SSIPs requires knowledge of the number of SSIPs to be assigned to the molecule. This is calculated from the molecular surface area.

S2.3.1 Surface Area Calculation
The molecular surface area is calculated to be able to determine the required number of SSIPs to assign to a molecule in the first step of footprinting. This is done using the positional information for the MEPS points, which lie on a convex hull that encapsulates the molecule. The molecular surface area, S.A., is calculated using equation (S1).
Where r i is the position of the ith point on the surface, D ri is the density of points in the local environment, given by equation (S2). The local environment is defined by r D , the density radius, and δ(r i , r j ) (equation (S2) and equation (S3)).
The density of points in the local environment is simply the reciprocal of the sum of the number of other points within the defined search radius. Each point on the surface contributes a circular area, equal to the area occluded by the density radius, weighted by the density of points in the local environment. A value of r D = 0.5Å was used.

S2.3.2 MEPS to hydrogen bond parameter mapping
The mapping of molecular electrostatic potential surface (MEPS) value to hydrogen bond parameters uses second order polynomial functions, parameterised in [3]. SSIP values for positive MEPS segments are mapped to hydrogen bond donor parameters with equation (S4). The negative MEPS values are mapped to hydrogen bond acceptor parameters with equation (S5).

S2.3.3 Efficient searching for nearest neighbours
Calculating the distance between all points on the MEPS is an O N 2 operation, which could become very expensive for large molecules. To reduce the computational cost, a KDTree is used [14,15], to reduce the scaling to O (N logN ) making the search for points satisfying the distance cutoffs faster. The KDTree produces a space-partitioned data structure. A KDTree data structure can be more efficiently traversed than a standard array structure, which was used previously.

S2.3.4 Functional group identification
As noted in [3] the acceptor sites have a correction factor applied. The requirement of an empirical correction factor to improve the prediction of hydrogen bond acceptors produces an extra requirement of functional group identification within the code. This is done by expressing the molecule in a graph representation, with atoms as vertices, and bonds as edges. A subgraph isomorphism approach can then be used to identify any occurrences of the functional group, if it is also described as a graph. The algorithm used is the VF2 of Cordella [16], which has been implemented in this code by Teodor Nikolov.
The functional group identification routine employed prevents the assignment of multiple correction factors. This is done by the canonical ordering of functional group subgraphs by use of an enumeration object. Once an assignment to a functional group is made, no others can be assigned. The functional groups are sorted by IUPAC precedence [17], so are therefore ordered so the largest collection of atoms would be assigned first, i.e. an oxygen atom would be assigned to an ester rather than an ether.

S2.3.5 Evaluation of trial SSIP configuration
The positional assignment uses the MEPS value, rather than the value of the SSIP. This is done to ensure the correct surface features are found. Once the best configuration is found, the region around each SSIP assigned is then further examined. Since a SSIP interaction involves a small contact area and not just a single point, all MEPS points within this contact area must be inspected. This led to the assignment of a contact radius, r, assigned value of 1.1Å, in work by Calero et al. [3]. A SSIP assigned to the surface is replaced by a more polar point that is found within this radius, to more accurately account for the polarity of the region.

S2.3.6 Canonicalisation of output
To ensure reproducibility of footprints from the calculation, the output ordering of SSIPs when written to file must be consistent, to allow for unit testing. The SSIPs are ordered by decreasing numerical value. If there are any SSIPs with the same value, the distance of the SSIPs to the molecule (mass unweighted) centroid is then used to ensure consistent ordering. This is required because the traversal of the MEPS surfaces generating the different trial SSIP collections is not deterministic, so the same solution may be found with different additions to the SSIP description.

S3 Phasecalculator usage
Phasecalculator provides a convenient wrapper for the functionality required to calculate FGIPs and SSIs for solvent systems.

S3.1 Input generation and calculations
Operation of phasecalculator via the CLI is split into two categories by sub-commands; calculate and inpgen. The calculate option will run the calculations defined in the input phasecalculator XML. This requires access to a version of the ssip-phastransfer jar file on the local system. The inpgen command can generate the required XML for calculations based on predefined SSIP descriptions.

S3.2 SSIP Descriptions
The SSIP descriptions for solvent molecules studied were previously defined in [13,18]. The descriptions are present in the puresolventinfomration repository. Both sets of SSIP descriptions are accessible with phasecalculator when constructing input with the inpgen sub-command. The default description for inpgen, uses the polarised alcohol descriptions from [18,19]. The unpolarised alcohol descriptions used in [13] can be selected using the -unpolarised flag when using the inpgen CLI.

S3.3 Phasecalculator XML format specification
The XML file input to the calculate sub-command of phasecalculator provides an agnostic interface to calculate properties for non-standard solvents.
For solvents with components not defined, or to use descriptions not defined in the puresolventinformation repository, inpgen cannot be used.
To use non-standard SSIP descriptions, defined in the previous section, for solvents previously studied requires the creation of valid XML input for the calculate sub-command with the included paths to include in listing 1.
Listing 1: Example phase definition from an phasecalculator input XML file. The SSIP description to use for a molecule is defined by the ssipFilelocation attribute. If a different SSIP description for a moelcule predefined is required, this path requires modification to the file path of the SSIP description of interest.