MetalHawk: Enhanced Classification of Metal Coordination Geometries by Artificial Neural Networks

The chemical properties of metal complexes are strongly dependent on the number and geometrical arrangement of ligands coordinated to the metal center. Existing methods for determining either coordination number or geometry rely on a trade-off between accuracy and computational costs, which hinders their application to the study of large structure data sets. Here, we propose MetalHawk (https://github.com/vrettasm/MetalHawk), a machine learning-based approach to perform simultaneous classification of metal site coordination number and geometry through artificial neural networks (ANNs), which were trained using the Cambridge Structural Database (CSD) and Metal Protein Data Bank (MetalPDB). We demonstrate that the CSD-trained model can be used to classify sites belonging to the most common coordination numbers and geometry classes with balanced accuracy equal to 96.51% for CSD-deposited metal sites. The CSD-trained model was also found to be capable of classifying bioinorganic metal sites from the MetalPDB database, with balanced accuracy equal to 84.29% on the whole PDB data set and to 91.66% on manually reviewed sites in the PDB validation set. Moreover, we report evidence that the output vectors of the CSD-trained model can be considered as a proxy indicator of metal-site distortions, showing that these can be interpreted as a low-dimensional representation of subtle geometrical features present in metal site structures.


SI 1. Artificial distortion of existing metal sites
Metal sites from the CSD and PDB validation sets were distorted by manipulating the cartesian coordinates of specific ligand atoms, progressively morphing the original coordination geometry into the desired one.The following passage only describes the details of the distortion trajectory to transform SPL sites in TET sites and SQP sites in TBP sites, since the inverse transformations can be simply envisioned by reversing the procedures outlined.
SPL sites were aligned to the reference system of Cartesian axes placing the four ligand atoms (Figure S1 (a)) along the y and z directions and the metal atom on the origin.Ligands placed along the y axis (L2 and L4) were rotated around the z axis in opposite directions compared to each other, thereby compressing the L2-M-L4 angle.Ligands placed instead along the z axis (L1 and L3) were rotated around the y axis to compress the L1-M-L3 angle.The rotation step was determined by subtracting the desired value (109.5° for a perfect tetrahedron) from the initial angle formed by the two pairs of ligand atoms and dividing the resulting angle by 20.
A similar procedure was followed to distort the SQP sites (Figure S1 (b)), but only rotating ligands placed along the y axis (L2 and L4) around the z axis.This resulted in the compression of the L2-M-L4, which was gradually brought as close as possible to the desired value (120° for equatorial ligands in a perfect trigonal bipyramid) by stepwise rotation for 20 steps.The coordinates of the six metal atom nearest neighbors were recorded for all four sites at each step to calculate the input vectors, which were subsequently passed to CSD-NN to compute the probability assigned for each class and the entropy of the output vectors.All calculations necessary for the distortions were performed with a custom python script using Scipy, Numpy, pymatgen and cosymlib.The procedure to transform metal site geometry described above was applied to 213 TET, 183 SPL, 89 SQP and 132 TBP sites from the CSD validation set and to 20 TET, 9 SPL, 10 SQP and 14 TBP sites from the PDB validation set.Entropy was calculated for the original and all distorted structures and represented as heatmaps (Figure S2) and as boxplots divided into bins of CShMs ratios (Figure S3).Finally, wo pairs of SPL and SQP sites (Figure S4) were randomly chosen from the CSD and PDB validation sets to also show detailed class probability profiles and entropy evolution during the progress of the distortion trajectory (Main Text, Figure 6).

Figure S1 :
Figure S1: Distortion of SPL sites to TET and of SQP sites to TBP.Colored arcs represent

Figure S2 :
Figure S2: Entropy heatmaps for TET↔SPL distortion trajectories for a) CSD and c) PDB

Figure S4 :
Figure S4: Metal sites on which distortions were performed to study the relationship between

Figure S5 :
Figure S5: Per class distribution of metal identity across the entire CSD dataset.

Figure S6 :
Figure S6: Per class distribution of metal identity across the entire PDB dataset.

Figure S7 :
Figure S7: Per class distribution of RMSD computed against the idealized class models for

Figure S8 :
Figure S8: Full confusion matrices for classification by the CSD-NN model on the CSD test set

Figure S9 :
Figure S9: Full confusion matrices for classification by the CSD-NN model on the whole PDB

Figure S10 :
Figure S10: Distribution of entropy values for correctly classified CSD validation set sites inside

Table S2 :
Balanced accuracy scores of optimized models computed on the full training sets for a) CSD sites and b) PDB sites in cross-validation.Standard deviation is shown in parenthesis next

Table S5 :
p-values of the t-test performed on the entropy values calculated by the CSD-NN model on the CSD test set and by the PDB-NN model on the PDB test set for each geometry class.

Table S6 :
Performance metrics on the whole PDB dataset for hyper-parameter optimized CSD-NN model and support for each class.

Table S7 :
Analysis of statistical correlation between entropy and Continuous Shape Measure (CShM), RMSD and scaled RMSD (RMSDscaled).RMSD, RMSDscaled and CShM values are computed against ideal geometry corresponding to each class using pymatgen and cosymlib.