Multipolar Atom Types from Theory and Statistical Clustering (MATTS) Data Bank: Impact of Surrounding Atoms on Electron Density from Cluster Analysis

The multipole model (MM) uses an aspherical approach to describe electron density and can be used to interpret data from X-ray diffraction in a more accurate manner than using the spherical approximation. The MATTS (multipolar atom types from theory and statistical clustering) data bank gathers MM parameters specific for atom types in proteins, nucleic acids, and organic molecules. However, it was not fully understood how the electron density of particular atoms responds to their surroundings and which factors describe the electron density in molecules within the MM. In this work, by applying clustering using descriptors available in the MATTS data bank, that is, topology and multipole parameters, we found the topology features with the biggest impact on the multipole parameters: the element of the central atom, the number of first neighbors, and planarity of the group. The similarities in the spatial distribution of electron density between and within atom type classes revealed distinct and unique atom types. The quality of existing types can be improved by adding better parametrization, definitions, and local coordinate systems. Future development of the MATTS data bank should lead to a wider range of atom types necessary to construct the electron density of any molecule.


Appendix 1. The computational details of rotation of local coordinate system
The procedure of rotating local coordinate systems contained several steps. Firstly, the MATTS2021 data bank entries were divided into individual files and then, their copies with several options for the local coordinate system choice were generated using local scripts. The symmetry filter was set to "no", thus no filtering by symmetry was applied to parameters. Lastly, the program bankMaker v0.016 from the DiSCaMB library was run with files containing modified entries from the MATTS2021 data bank, and files containing coordinates and multipole model parameters for 2516 model molecules from the MATTS2021 collection as inputs 1 . The bankMaker is a utility program within the newest version of DiSCaMB library. The program generated new values as an output. Atom types for which at least one parameters appeared to have sample standard deviation larger than 0.05 e were assigned a flag "inconsistent".

Appendix 2. The computational details of topology clustering
To obtain an extended tree from topology clustering, a few steps were necessary, all of them required local scripts with the use of graph description language DOT, as well as bash and Python programming languages. Firstly, the MATTS2021 data bank entries were divided into individual files using bash language. From each entry (one atom type) several pieces of information were automatically extracted and added to a new dataset: ID of an atom type, chemical element of the central atom, number and types of first neighbors, planarity properties (planarity, planar rings with planar atom, size of rings), symmetry from the data bank, sum of second neighbors and their types, and final label created for topology clustering purposes. Then, using bash language again, a loop through all the dataset entries generated a .py script with parent-child relations that are a basis of a tree, written in a format suitable to use anytree 2.8.0 Python package 2 . This script was then S24 executed using Spyder 4.1.4 3 and a .dot file (format of the graph description language DOT) was created. This file could already be used to generate an extended tree, but we edited it and included shorter labels for "parents" and "children", and added background color to show planarity. This was done automatically using local bash scripts. Then, from a final .dot file an extended tree was generated using Graphviz 4 (an open-source visualization software for Windows) and Windows an image.

Appendix 3. The computational details of electron density clustering
The process of electron density clustering was divided into two steps: data preprocessing and actual clustering. The MATTS2021 data bank entries were divided into individual files using bash language. Then, ID of an atom type and multipole model parameters (κ, Pval, κ' and Plm) were automatically extracted from each entry and added to a new dataset in a .csv format. This dataset was imported into Python using pandas 1.0.5 5 library. Also, other standard data science libraries were used: NumPy v1.19.0 6 (for data manipulation), Matplotlib 3.2.2 6 (for plotting), and scikitlearn 0.23.1 7 (for DBSCAN algorithm). For finding Eps parameters, scikitlearn.neighbors.NearestNeighbors and KneeLocator 8 libraries for Python were used.

Appendix 4. Hydrogen atom types
Hydrogen atom types belong in the 1x group with the exception of H122 which is a middle H in the oxonium ion H2O5 + and thus, belongs to the 2x group (see Figure 11). Atom types from the 1x group divide into two main clusters (0___ and 1___) in line with expectations from their chemical properties, the separating factor being the type of a first neighbor indicating whether said hydrogen S25 is polar or nonpolar. In those clusters, 10 and 20 multipoles have positive values, the rest are zero as expected (they were never refined in the model molecules  C7902 belong to 4-membered rings. Apart from the ring membership, in some types rare and specific combinations of first neighbors occur: C4691 is connected to Br and three C atoms, C882 to C and three F atoms, C452 to S, N, and two H atoms, C442 to S, N, and two C atoms, C432b to P, C, and two H atoms, C4111 to N, and three C atoms, and finally C782 is a carbon atom from SO2-CH(C)-SO2 chain. There are also groups such as: C756 and C889 (both connected to three F and one C atoms), C455 and C456 (both connected to two F and two C atoms, C7911 and C7931 (both in 4-membered rings, connected to two C and one H atoms, as a fourth neighbor C7911 has a N atom, and C7931 has an O atom), C435 and C953b (carbons from an epoxide 3-membered ring), C792, C7921 and C794 (4-membered rings, connected to H, two C, and either N or O atoms), C417a, C463 and C472 (connected to two or three C, and one or two N atoms), C4041, C404D, C799 and C998 (all are in 3-membered rings, at least two C neighbors, the remaining two are C or H atoms).
The sample standard deviations of MM parameters of the largest clusters of carbon 4n group usually do not exceed the desired values or exceed them only slightly. However, the 4n carbon group is the one which has also large percentage of atom types labeled with the inconsistence flag.
In all the clusters, obey the maximum symmetry possible for a particular coordinate system and agree with the sp3 shape of electron density (positive 30 and 33 for ZaXb, negative 31 and 33 for XabYa, negative 32 for ZabXc, and negative 30 and positive 33 for ZabcXa). Similarly like with the 3p group, the XabYa coordinate system allows the largest cluster of similar atom types to form. The most restrictive is the ZabXc system, suggesting that only 68 atom types for which all rotations belongs to the 10_0_16_3 cluster truly have 4 ̅ 3m symmetry. There is only one unique atom type which does not appear in any of the above large clusters, it is C840.

Appendix 6. Nitrogen atom types
Nitrogen atom types may belong to 2p, 3n, 3p, or 4n groups, there is also one atom type in the Ca. 70 % of 3p nitrogen atom types cluster together with each other in clusters 9_4__ (ZaXb system) and 9_5__ (XabYa system). Each of the clusters also contain the majority of 3n nitrogen atom types. Both clusters contain the same atom types, they only differ because of different rotation type. Coexistence of atom types from two groups, 3n and 3p, in one cluster shows the significance of using 'purple' local coordinate systems (shown previously in Figure 2) designed to allow such situation. Those clusters were not further divided as the standard deviation for each was smaller than the accepted limit of 0.05 e. Only 4 of 20 atom types from the 3p group have inconsistent parameters at some rotations: N327, N335b, N3592, N729. There is no inconsistency among the 3n group.

S31
Many nitrogen atom types from the 3n group also form large clusters in the remaining two coordinate systems, cluster 9_7__ in ZabXc and cluster 9_8__ in ZabcXa. Almost all nitrogen atom types in cluster 9_7__ appears also in previously mentioned cluster 9_5__. The values of clusters 9_4__ and 9_5__ indicate the clusters contain atom types having sp2 hybridization (negative 33 and vanishing 31 for XabYa, positive 30 , negative 32 and vanishing 33 for ZaXb). Though the absolute values of are twice or more smaller than for sp2 carbon atom types in clusters 10_0_0_0 and 10_1_0_0.
Sp2 hybridization for nitrogen types from the 3p group follows chemical intuition, however this is somehow unexpected to see that majority of nitrogen atom types from the 3n group also seem to have electron density resembling more sp2 than sp3 hybridization. Deeper investigations are required to better understand possible causes of this observation (specific geometry, not unique enough definition of coordinate system, etc.) Clusters 9_7__ and 9_8__ containing only 3n nitrogen types have values which do not fully follow sp3 hybridization. In the ZabXc system the 32 has the largest absolute value but the 30 has not disappeared, whereas in the ZabcXa system, the 33 is not balanced well by the 30 , the latter being ca. twice smaller. This is most probably because electron density of lone electron pair at nitrogen atoms has to be solely described by that nitrogen atom multipole functions, whereas electron density of covalent bonds is usually described by multipolar functions of both, central nitrogen atom and its covalent neighbors. Thus multipolar functions contributing to lone electron pair descriptions have to have higher populations.
As usual, there are some atom types that never appear in the largest clusters of the nitrogen 3p/3n group. For example, in cases of some rotations, N308 and N449 atom types from the 3n group seem to be similar to each other than to the others and differentiate themselves from other 3n types S32 (see i.e. clusters from 9_10__ to 9_19__). In the 3p group, this situation happens for atom types N318, N335a, N452, and N453 as well as with a group made from N322a, N323a, and N325a atom types (cluster 9_57__) or smaller one with N3592 and N459 types (cluster 9_29__). Another unique group shows the importance of topology properties as it contains all 3p atom types with N and O as first neighbors -N998 and N998a (clusters from 9_43__ to 9_48__). There are also unique atom types such as N334, N339, N312, N332, or N3591.
The nitrogen 4n group creates only 5 clusters, following five possible types of coordinate systems and each cluster contains all atom types form this group with one exception as outlined below: 24___ (ZaXb system), 25___ (XabYa system), 26___ (majority of rotations from the ZabXc system), 27___ (ZabcXa system) and 28___ (some of rotations from the ZabXc system).
The presence of two clusters for the ZabXc system is due to the incorrect flip of the sign for 32 multipole in some rotations for atom types: N401a, N401b, N402, N403. Within each cluster, MM parameters have their sample standard deviations smaller than the desired threshold, whereas their mean values follow clearly the electron density shape of the sp3 hybridized atoms and fulfill requirements for the highest possible symmetries to be seen in particular coordinate system. Only one atom type from the 4n group has inconsistent parameters for some of the rotations: N404.

Appendix 7. Oxygen atom types
Oxygen atom types belong either in the 1p, 2p, or 3n group and cluster in a complicated way, which can be seen in Figure S8. None of the oxygen atom types received label of being inconsistent. 44 atom types from the 1p group are divided into two main clusters that differ from each other in terms of the 22 and parameters. The first cluster, labeled 2_0_1_0, includes 26 atom types from the 1p group in ZaXb system for which 10 has a negative and 22 has a mostly S33 positive value, but also includes atom types from the 2p group in the ZabYa system. Mean value of equals 6.163 e, or 6.126 e when excluding the 2p group. The second cluster of oxygen 1p types (7 atom types, cluster 2_0_2_0) includes atom types for which still 10 has a negative value but 22 is zero and is much larger, with mean value 6.273 e. The rest of the atom types from the 1p group make either distinct clusters (O101, O189) or occur in pairs (O121 and O1999, O370 and O122f, O113 and O998, O371 and O372).
26 out of 36 oxygen atom types from the 2p group cluster together in the ZabYa system (cluster 2_0_1_0, all rotations of the ZabYa system), except for O001, O210, O211b, O257, O258, O272, and O793 atom types. As mentioned above, this is the same cluster in which the majority of oxygen 1p types are. Co-clustering of 1p and 2p oxygen types from two different coordinate systems can be understood while looking at pictures visualizing local coordinate systems (see Figure 3 or Figure S1), remembering that 2p is derived from 4n with lone electron pairs in place of atoms c and d, and 1p is derived from 3p with lone electron pairs in place of atoms b and c. Such a combination of number of first neighbors and local coordinate system orientation allows to orient the system the same way with respect to lone electron pairs. Existence of that mixed clusters suggest that these oxygen types are not contributing any electron density to bonding with its covalent neighbors, but they do contribute to lone electron pair density. Nevertheless, departure of oxygen types from asphericity is relatively small, values are much smaller than observed for carbon types. The unique atom types from the 2p group are O001 (an oxygen atom from an H2O molecule), O210, and O272, the latter being only one oxygen atom type that belongs to a 3-membered ring.
Lastly, there are only two atom types that belong to the 3n group: O324 (H5O2 + in oxonium) that is unique, and O323 (H3O + ) that in the XabYa system shows similarity to the 2p group being in the 2_3_0_ cluster.

Appendix 8. Phosphorus atom types
Phosphorus atom types belong either in the 4n or 6n group, the latter including only one atom type: P601, which naturally is unique. The remaining eleven types can be divided into two groups that are heavily differentiated by the sign and values of the multipoles already on the first level of clustering (see clusters from 29___ to 79___). One group includes atom types P401, P409a, P409c, and P410. The second group includes atom types: P402, P403, P404a, P405, P407, and P408. Both of them are shown in Figure S9. Interestingly, P404b is an undecided atom type that can be similar to either of the groups with the change of coordinate systems. The fact that even at first level of clustering phosphorous types are highly divided (do not form any large cluster) might be because in general values of phosphorous types are among the highest observed for all atom types, and the Eps value optimal for entire dataset is already too small for phosphorus types to keep them S35 together. Also, each 4n phosphorus atom type has inconsistencies among multipole parameters for many rotations.

Appendix 9. Sulfur atom types
Visualization of density clustering results on a general tree for sulfur atom types is shown in Figure S10. All six sulfur atom types from the 1p group create one cluster (4___). Sample standard deviations of MM parameters for the cluster are small and acceptable. From the mean values it is clear that the types do not have cylindrical symmetry, the 22 has not averaged to zero, though it has smaller absolute value than 20 and 30 .
The situation is more complicated for the remaining groups of the sulfur atom types. In the 2p group most atoms are together in clusters 2_4__ (ZaXb system), 2_5__ (XabYa system) and 2_6__ (ZabYa system) with the exception of S213 and S220, both having C and N as first neighbors.
S213 is similar to the main group in ZaXb and XabYa systems but in the ZabYa system, it becomes more like S220. Again, sample standard deviations of parameters are acceptable. Symmetries resulting from the values are the highest possible to be easily spotted within each of the coordinate system. One can conclude that sulfur atom types belonging to the above cluster have electron densities fulfilling the sp3 hybridization, though electron density lobes originating from sulfur atom and directed to the covalent partner are different from those directed to the positions of the two lone electron pairs. For sure, sulfur types are by far more aspherical than oxygen types, as analogous multipoles of sulfur types have much higher populations. In the 2p group, there is only one atom type with inconsistent multipole populations: S220.
Next, there is the 3n group that contains only four sulfur atom types and three of them (S320, S399, S001) cluster together in three of four possible local coordinate systems: ZaXb, XabYa, S36 ZabcXa (clusters 12_0__, 13_0__, and 15_0__). In the ZabXc system each of them creates separate clusters. Clusters 12_0__ and 13_0__ have acceptable values of sample standard deviations of their parameters, and as for sulfur 2p clusters, almost follow the requirement for the sp3 hybridization. Regarding symmetries resulting from the values, the situation is strange. For ZabcXa system, the symmetry is much lower than the maximal possible, m instead of 3m, though the multipoles violating 3m symmetry have smaller populations than the one fulfilling it. For XabYa system, the symmetry is too high. Apparently the system focuses too much on only two neighboring atoms, not describing properly contributions from the third neighbor and lone electron pair. There is no inconsistency observed among 3n sulfur atom types.