pywindow: Automated Structural Analysis of Molecular Pores

Structural analysis of molecular pores can yield important information on their behavior in solution and in the solid state. We developed pywindow, a python package that enables the automated analysis of structural features of porous molecular materials, such as molecular cages. Our analysis includes the cavity diameter, number of windows, window diameters, and average molecular diameter. Molecular dynamics trajectories of molecular pores can also be analyzed to explore the influence of flexibility. We present the methodology, validation, and application of pywindow for the analysis of molecular pores, metal–organic polyhedra, and some instances of framework materials. pywindow is freely available from github.com/JelfsMaterialsGroup/pywindow.


Reconstruction of the periodic unit cell
As a result of periodic boundary conductions, the atom positions for a molecule enclosed in a unit cell will often cross through periodic boundaries. The analysis requires the reconstruction of the molecules based on the atomic positions and the connectivity between atoms. Therefore, a simple algorithm was implemented in pywindow that allows one to rebuild the systems (the rebuild_system() method of the MolecularSystem class). First, a 3 x 3 x 3 supercell is constructed resulting in 27 symmetry representations of the system, with the initial unit cell in the centre of such supercell. For all atomic positions (excluding some elements 1 ) the one closest to the origin is picked as the seed for the individual molecule reconstruction. With a formula to determine if a bond is present between two atoms A and B, we iterate through all atoms in the system. The atom pair A and B is connected if the distance (d A,B ) meets the follow criterion: where r cov,max is the largest covalent radius of all the atom types in the molecular system and t is a user defined cut-off (default = 0.4 Å). In each iteration, the atoms that are determined as bonded to any of the atoms in the previous set are sampled for their atomic neighbours. This is performed for the initial unit cell only, the role of the atom representations from the supercell is to allow cross boundary connections. The atoms from the supercell area can populate the search set. However, none of the atoms from the supercell area can act as a starting seed. Only one occurrence of an atom, out of the 27 symmetry equivalences, can be assigned to a molecule.
For each molecule found, the centre of mass (COM ) is determined and if it is in the initial unit cell, this molecule is then returned as part of the rebuilt molecular system. The performance of the algorithm is presented in Figure S1 with an example of a cubic, monoclinic 1 H, Cl, Br, F, He, Ar, Ne, Kr, Xe, Rn S2 and triclinic symmetry periodic unit cell. The individual molecules for the analysis are then extracted using the make_modular() method, of the MolecularSystem() class, that returns each molecule as the Molecule object. This is done on the connectivity criteria based on the individual atom pair distances and the covalent radii. Figure S1: Three examples of the periodic unit cells before (top) and after (bottom) the rebuild process. The example on the left is of the cubic symmetry unit cell containing eight discrete molecules, the middle example of a monoclinic unit cell containing three discrete molecules and the right example of a triclinic unit cell containing two discrete molecules. Any solvent molecules have been omitted for clarity.

Centre of mass
The centre of mass (COM ) of a molecule is calculated according to the equation: where M W is the molecular weight of the molecule, v is a vector containing the Cartesian coordinates of an atom and m is the atomic mass for the element. The formula yields COM , which is a vector that corresponds to the x, y and z coordinates of the centre of mass of the molecule. The centre of mass is calculated with the calculate_centre_of_mass() method of Molecule class.

Maximum dimension
The molecule's maximum diameter (D max ) is determined as the distance between two furthest atoms in the molecule. It is obtained using an Euclidean distance matrix between all atoms in the molecule. From the obtained matrix of distances, the largest value found is between the coordinates of the two furthest atoms. The distance value is then corrected for the corresponding r vdw for the pair of elements, by adding it to the calculated distance yielding the D max . The maximum dimension of a molecule is calculated with the calcu-late_maximum_diameter() method of Molecule class.

Intrinsic void diameter
The intrinsic void diameter (D void ) is determined as the distance between the COM of a molecule and the closest atom. This is calculated using the Euclidean distance matrix between the COM vector and the positions of all elements in the molecule. The smallest value in the set of distances is then corrected by subtracting the appropriate r vdw of the determined closest element and multiplying it by 2 to yield the D void . The intrinsic void S4 diameter is calculated with the calculate_pore_diameter() method of Molecule class.

Intrinsic void volume
The intrinsic void volume (V void ) of a molecule is calculated with a formula for a sphere volume, where the void radius R void equals to 1 2 D void . The intrinsic void volume is calculated with the calculate_pore_volume() method of the Molecule class.

Methodology of calculating window diameters
The process of calculating the window diameter (D window ) for each of the windows in the molecule, determined as the channel necking that connects the intrinsic void with molecule's surrounding, presented in Figure 2 in the main manuscript, is performed as following: 1. From the Cartesian coordinates of a single molecule, the COM is calculated and the molecule is shifted to the origin (O) of the Cartesian system by subtracting the COM S5 vector from the molecule's coordinates.
2. On a sphere, with radius equal to D max and the centre of the sphere located at O, a set of sampling points is generated with the Vogel's method 1 for a spiral distribution of points on a disc, adopted for a sphere using cylindrical coordinates. The approach allows one to obtain a uniform distribution of points on a sphere surface (see Figure   2a in the main manuscript).
3. The number of sampling points is system dependent and calculated according to the formula: where A is the surface area of the sphere in Å 2 . The logarithmic scaling is used that prevents oversampling of larger molecules, which is the bottleneck for the algorithm performance, and the 250 factor proved in the design and validation processes to give a sufficient sampling level. Further, the scaling factor a allows further control on the sampling level. For the remaining vectors, these that are assumed to pass through the windows (see Figure 2b in the main manuscript), a set of coordinates along the vector in increments is calculated and defined as the 'vector path' (see Figure S2). At each point along the vector's path, the distance of the point to the position of the closest atom is calculated and corrected for the appropriate r vdw of an element by subtracting it from the distance.

S6
5. Vectors are then clustered using the density-based spatial clustering algorithm (DBSCAN) from the sckit-learn package (see Figure 2c in the main manuscript). 2 The number of returned clusters defines the number of windows found in the molecule.
6. The found clusters are analysed separately. For each such cluster, the vector's paths sets are compared for its minimum, therefore the smallest distance in the set, and the vector with highest value for that minimum is determined as the rough estimate of the window centre and the coordinates of this minimum and the distance calculated is used to determine the window circular plane perpendicular to the vector (see Figure   2d in the main manuscript).
7. The diameter of a window (D window ) is determined by the largest circle that can be circumscribed into the window. The radius of that circle is defined by the distance between the centre of the window and the closest atom. From that radius, the appropriate r vdw of that element is subtracted and it finally is multiplied by 2 to yield D window . First the rotation of the molecule's coordinates is performed to align the sampling vector with the Z axis with the COM of the molecule kept at O (see Figure   S3). In such an arrangement, the plane of the window is perpendicular to the XY plane of the coordinate system. The two step minimisation, where the coordinates of the window centre are used as variables and D window as the output is performed. The x and y components are minimised by using them as the variables in the minimize function from SciPy package 3 and the negative of D window is returned. This ensures that the correct centre is found, i.e. one that has the largest distance to the closest atom in the window plane. Next, the optimisation of the z coordinate is performed in the Z -axis direction, also using the minimize function of SciPy package, with the . At each such point the distance to the closest atom (corrected for the appropriate van der Waals radius) is calculated. If vector passes through an atom, such as the p n+2 vector, it is discarded. From the resulting vectors, the one that passes closest to the centre of the window marked with a green dot (vector in yellow) is then analysed as shown in Figure S3. Figure S3: In this schematic the process of finding the window centre is presented. The coordinates of the molecule are translated and rotated so that the vector that passes closest to the window centre (in yellow in Figure S2) starts at the origin of the Cartesian system and is aligned to the Z axis. The window plane is then shifted to the origin. The x and y coordinates of the window centre are optimised to find the largest window diameter. The z coordinate is also minimised to find the necking of the window channel and true window centre.

Method of calculating the average molecular diameter
The molecule's average diameter (D avg ), calculated from the lowest energy modelled structures, can allow for comparison to the experimentally measured solvodynamic diameters, to help identify the reaction product when other experimental methods such as crystallisation or mass spectra cannot determine which stoichiometry product was synthesised. We used this approach successfully in recent work of ours. 4 The average diameter is calculated as follows: 1. A molecule is taken and the D max is determined.

2.
A set of sampling points is distributed evenly on a sphere with radius equal to the D max , using Vogel's golden vector approach for an even points distribution on a disc, modified for a sphere (see Figure S4a). 4. The molecule's outline, as a set of points determined by the vectors crossing through the van der Waals spheres, is created (see Figure S4b).
5. The average of the distances of the molecular outline points yields the D avg and is given by the equation: where x i , is equal to the distance of the i th outline point from the centre of mass (see Figure S4c). S10 Figure S4: com/JelfsMaterialsGroup/pywindow in the Examples directory. Figure S5: Results for YAQHOQ S12 Figure S6: Results for BATVUP Figure S7: Results for NUXHIZ Figure S12: The atoms chosen in the PUDXES molecules to represent the window diameters in the circumcircle method. For each window, three carbon atoms that lie in the same plane and describe the narrowest point of the window channel necking were chosen. These are labelled with numbers here. S18 Figure S13: Window diameters and window centres calculated using the circumcirle method in pywindow. Implemented as described in Holden et al. work. 5 The input atoms sets indexes are reduced by 1 to facilitate the numbering format in python that starts with 0.

Limitations of pywindow
The pywindow package can be unfit for the analysis of some molecular pores. The definition of the window that has been described earlier states that it is the necking in the molecule's structure that connects the intrinsic void and the molecules's exterior. However, the example of a molecular pore can be found where such a necking is not present, or to be more accurate, it is the intrinsic void that is the narrowest point, as shown in Figure S14. Figure S14: The example of a molecular pore where the intrinsic void (green solid sphere) is simultaneously the channel necking -the window.
Additionally, the assumption that the window is of a circular shape will result in variety of range of underestimations of the actual window's area. For the PUDXES molecule, in the middle of Figure S15, the window is actually of a triangular shape. However, the assumption of the circular window shape is intentional for its simplicity. In some extreme cases, such as cryptophane-A (far right in Figure S15), very narrow windows close to the shape of number 8 can result in each of the circles to being identified as a separate window and the wrong number of windows output by pywindow. Figure S15: Examples of molecular pores with various windows shapes and the possible fit of the spherical window approximation (green dashed circle) as assumed in the pywindow package window diameter calculation. Molecules not shown to scale.