From Local Atomic Environments to Molecular Information Entropy

The similarity of local atomic environments is an important concept in many machine learning techniques, which find applications in computational chemistry and material science. Here, we present and discuss a connection between the information entropy and the similarity matrix of a molecule. The resulting entropy can be used as a measure of the complexity of a molecule. Exemplarily, we introduce and evaluate two specific choices for defining the similarity: one is based on a SMILES representation of local substructures, and the other is based on the SOAP kernel. By tuning the sensitivity of the latter, we can achieve good agreement between the respective entropies. Finally, we consider the entropy of two molecules in a mixture. The gain of entropy due to the mixing can be used as a similarity measure of the molecules. We compare this measure to the average and best-match kernel. The results indicate a connection between the different approaches and demonstrate the usefulness and broad applicability of the similarity-based entropy approach.


Introduction
The concept of similarity is directly linked to the notion of the complexity of an object: 3][4][5][6] Unfortunately, the different complexity measures are often hardly comparable.
In this article, we establish a connection between the information entropy (in the sense of Shannon 7,8 ) and the similarity matrix constructed from the local atomic environments of a molecule.The resulting expression is analogous to the von-Neumann entropy 9 and can be used as a general framework for quantifying molecular complexity.
Similarity also plays an important role in many machine learning techniques, like kernelridge regression (KRR) or Gaussian process regression (GPR). 10In combination with descriptors of local atomic environments, 11,12 those methods have been very successful in different areas of computational chemistry and material science. 13The atomic environments are usually defined in terms of all atoms within a certain distance of a reference atom.Suitable descriptors are then found, for example, from an expansion of the density of atoms in the environment into radial basis functions and spherical harmonics, i.e. using a smooth overlap of atomic positions (SOAP). 14The kernel function entering KRR or GPR is calculated from such descriptors and can be interpreted as a measure of the similarity of two local atomic environments.Comparing all pairs of environments in this way leads to the similarity matrix of the specific molecule.On the other hand, for learning and predicting the global properties of molecules, one can introduce the similarity between different molecules. 15The latter can also be constructed from the similarity matrix of the local atomic environments. 16 order to calculate the molecular information entropy, we present two specific choices for defining a similarity function of local atomic environments: One is based on a graph representation of the molecule, choosing substructures around a reference atom and comparing the resulting SMILES strings. 17,18The second approach facilitates the aforementioned SOAP similarity kernel and thus uses the positions and atomic numbers of the atoms.Both approaches are suitable for automatized computational studies, and we present results for a selection of molecules from the QM9 dataset. 19,20nally, we investigate the information entropy for pairs of molecules which leads to the mixing entropy.The latter is the maximal gain of information entropy upon mixing two molecules. 6Based on this observation, we propose and construct a new similarity measure of molecules and compare it to previously studied kernels. 16Our results demonstrate the usefulness and broad applicability of the similarity-based entropy approach.

Information Entropy
Typically, information entropy is considered in contexts involving some kind of 'experiment' or 'process', where each time one of the events A 1 , A 2 , . . ., A n occurs at random. 21Knowing the probabilities p 1 , p 2 , . . ., p n for those events, one can characterize the amount of uncertainty about the outcomes by introducing the Shannon entropy 7,8 according to The logarithm can be taken with respect to any base, but it is usually assumed to be base two.Moreover, we take p i log p i = 0 if p i = 0.One readily sees, that the entropy vanishes if the probability for one event is one and the others are zero accordingly.This describes an experiment with no uncertainty, because the outcome would always be the same.On the other hand, one has maximum uncertainty and thus maximal entropy if all events have the same probability p i = 1/n.In this case, H = log n.
In the context of molecules and graphs 1,22 a different point of view might be more suitable.
If we decompose the molecule (or a graph) into n different parts (e.g., its atoms or vertices) and assign each of them with one of the equivalence classes (e.g., atom types) A 1 , A 2 , . . ., A m (m ≤ n), then we can construct a finite scheme by associating the probability p i = n i /n to the respective class.The number of parts we found for each class is denoted by n i , i.e., m i=1 n i = n.The entropy given by Eq. ( 1) can be viewed as a measure of the complexity of the object.If all parts belong to the same class (n 1 = n), the complexity is zero.Conversely, if all parts belong to a different class (n i = 1 and n = m), then the complexity is maximal for that particular system.

Information Entropy from Similarity
To obtain a connection between information entropy of a molecule M and the similarities of its atoms, we start from a similarity function as follows, If two atoms k and l in the molecule are chemically, or otherwise, equivalent, the function yields 1, and otherwise it gives 0. Using the similarity function for all pairs of atoms yields the similarity matrix S(M) of the molecule.This matrix is symmetric, positive semi-definite and its trace equals the number of atoms in the molecule.
By permuting rows and columns of a similarity matrix of size n × n arising from Eq. (2), it can always be written as a direct sum of matrices of ones, i.e., S ≃ The similarity matrix becomes block-diagonal with m blocks of size n i × n i , respectively.
Consequently, it has m non-zero eigenvalues, namely {n 1 , n 2 , . . ., n m } and n − m eigenvalues which are zero 1 .This suggests that we can obtain a finite scheme directly from the similarity matrix since the non-zero eigenvalues of S divided by n yield the required probabilities p i = n i /n.Moreover, we can directly calculate the associated entropy Here, Tr denotes the trace of the matrix and log is the matrix logarithm.The last expression is analogous to the von-Neumann entropy 9 which is not only used in quantum mechanics, but also in the context of complex networks. 23This analogy suggests that we can generalize the similarity function Eq. ( 2) to any positive semi-definite and symmetric function with a value range 0 ≤ S ≤ 1.The expression for the entropy remains unchanged.
We can also introduce the linear entropy by expanding the logarithm to the lowest order: It is given in terms of the average of the squared elements of the similarity matrix.If the latter only contains zeros and ones, the latter average is equal to the average of the elements themselves.While being approximate, this expression can be easier to calculate, especially for large similarity matrices.
Of course, the main question is how to find a suitable similarity function.From a chemical point of view one might have different options for specifying chemical equivalence 3,5 or one might use graph-theoretic concepts. 4,22In the following, two approaches are presented and discussed which are also motivated by applicability in computational settings.

Substructure-SMILES Similarity
As a first approach to find the similarity of atomic environments, we use a graph representation of the molecule under consideration.For each atom of the molecule, we select a subgraph which involves all atoms which are connected to the reference atom by at most N bonds.For example, N = 1 would entail the atom itself and the bonded neighbors.Each subgraph is then converted to a (canonical) SMILES string 17,18 with the reference atom as starting point.In practice, we use rdkit 24 to generate the environments and the SMILES strings.The similarity function is then simply defined via the comparison of the respective SMILES strings of the subgraphs, i.e., Instead of exact string comparison, one can also utilize any other string or SMILES similarity method. 25gure 1 illustrates the atomic environments of one of the carbon atoms in ethanol for different values of N .For N = 0, the SMILES strings only contain the element symbol of the respective atom and therefore all hydrogen atoms, and all carbon atoms are found to be identical (lower right block in the similarity matrix).Already for N = 1, this is no longer the case as the environments take neighboring atoms into account and the hydrogen atom bound to the oxygen is considered as distinct from the other hydrogens.The similarity matrices show the increasing differentiation between the different atoms until convergence is reached for N ≥ 2 in this case.

N
Figure 2: Molecular information entropies H SMILES calculated from the substructure-SMILES similarity approach for the molecules denoted on the upper axis.The values are obtained for different sizes of the atomic environments (N = 0, 1, 2, 3) and compared to values from literature (topmost row). 26,27Results

Molecular Information Entropies
First, we consider the molecular information entropies calculated from the substructure-SMILES similarity approach.To this end, we collected 13 small molecules for which the information entropies are known based on symmetry and chemical intuition. 26,27Figure 2 shows a comparison of the entropies obtained from Eq. (3) for different sizes N of the atomic environment.Consistent with the observation in Sec.2.3, the entropy increases with increasing N for some molecules until it converges to the expected value.In those cases, larger environments are necessary to achieve a sufficient differentiability between all the environments (as shown in Fig. 1 for CCO).
In the next step, we use the ground state geometries of the 13 molecules mentioned above from the QM9 dataset 19,20 and calculate the SOAP descriptors for each atom using dscribe 28,29 (with r cut = 6 Å, n max = 10, and l max = 6).The similarity matrix and entropy are then calculated via Eqs.( 8) and (3), respectively, for a given sensitivity exponent ζ.The different plots show SOAP entropies with increasing sensitivity exponent ζ.Perfect matching is indicated by the straight lines.
the substructure-SMILES similarity approach (with N = 3).As the sensitivity increases for larger ζ, the entropies also increase since the similarity function further distinguishes the atomic environments.Apparently, the entropies do not converge in all cases to the SMILES entropies.The reason is that the functional form of the similarity function in Eq.
(5) suppresses all entries which are ≤ 1 with increasing N which leads in the extreme case to all environments being dissimilar and yields the maximum entropy.
To further characterize the similarities arising from the two approaches, we consider the substructure-SMILES similarity as references and compute the Kullback-Leibler (KL) divergence 30 for the SOAP similarity of the same molecule, The KL divergence is a positive function and becomes zero if the two similarity matrices are identical.As an illustration, we take the first 184 molecules from the QM9 dataset  observations.Then, at ζ ≈ 64 it shows a minimum implying that the respective SOAP similarities on average match best to the SMILES similarities.For larger exponents, the KL divergence increases slowly as more and more entries in S SOAP are vanishing.Figure 4 also shows the comparison of the entropies obtained from the SOAP similarities for ζ = 64 and the SMILES-based similarities.We observe a good but not perfect agreement, which can again be attributed to the sweeping effect of the sensitivity exponent.It should be noted that the binary nature of the SMILES similarity is an extreme case which requires an atypical high sensitivity.

Mixing Entropy and Molecular Similarity
One important question concerns the complexity or information entropy of mixtures of molecules as this directly applies to chemical reactions. 26,27,31We take two molecules, M I and M II , and suppose that we have a similarity function S which can be applied to any pair of atomic environments.The resulting similarity matrix S(M I , M II ) = S I,II has a 2×2 block structure with the diagonal blocks (S I and S II ) referring to the similarity between parts of each individual molecule, and the off-diagonal blocks (S I,II = S T II,I ) refer to similarity between environments belonging to different molecules.
If the two molecules are identical, all four blocks are equal to S I and the non-vanishing eigenvalues of S I,II become those of 2S I .Since the total number of parts is 2n I , the entropy of the two molecules is the same as the one for the individual molecule.In the other case, if the two molecules do not share any equivalent atomic environments, the off-diagonal blocks are filled with zeros, and the similarity matrix can be written as a direct sum S I+II = S I ⊕S II .
In this case, the entropy becomes a weighted average of the individual entropies plus a term which can be called entropy of mixing, 26,27 with For n I = n II , the mixing entropy is equal to log 2 (or one bit per atom).For any pair of molecules, we can define the gain of entropy due to mixing via In general, this quantity will have values between 0 and H mix ≤ 1.
To illustrate the concept of mixing entropy, we take all pairs of the first 184 molecules from the QM9 dataset and calculate the combined similarity matrices S I,II and the corresponding entropies H(S I+II ).In Fig. 5 the latter are compared to the mixing entropies H mix (n I , n II ) for the molecules using the substructure-SMILES approach to calculate the similarities (with N = 3).For the data points on the diagonal, the molecules do not share equivalent atomic environments and ∆H = H mix .However, there are a number of molecular pairs for which  the entropy is smaller than the mixing entropy.
The results for the mixing entropies suggest that one can use the ratio of the gain of entropy due to mixing ∆H(M I , M II ) and the mixing entropy H mix (n I , n II ) as a measure for similarity of two molecules.This ratio is zero for two identical molecules and one for molecules not sharing atomic environments.Other approaches to compare two molecules based on similarities of local atomic environments include the average structural kernel and the best-match structural kernel. 16The former is calculated by averaging the elements of the off-diagonal block in S I,II , i.e., The best-match kernel can be formulated in terms of a rectangular assignment problem, 16,32 The matrix X contains only zeros and ones and additionally its elements are subject to j X ij = 1 ∀i and i X ij ≤ 1 ∀j.If N I > N II , the arguments M I and M II have to be interchanged.Here, we have generalized the two kernels by allowing arbitrary powers p of the entries in S to be used while in Ref. 16 one is summing the elements of the similarity matrix (p = 1) in both kernels.The reason for introducing p is given by the expression of the linear entropy in Eq. ( 4).There, only squared elements of the similarity matrix appear which suggests that taking the sum over the squared elements in K(p) or K(p) , i.e. p = 2, is more natural when comparing to the entropy-based similarity as also shown below.
Taking again the pairs of the 184 molecules from before, we compute the different molecular similarities for the substructure-SMILES and the SOAP approaches and compare them in Fig. 6.One sees that for the SMILES approach, the results for p = 1 and p = 2 are identical, which is expected since S ij = S 2 ij in that case.On the other hand, for the SOAP approach one sees a clear dependence on p.The best-match kernel with p = 2 yields similarities which agree on average with the entropy-based similarities.In contrast, the p = 1 kernel gives rise to a systematic nonlinear deviation.With the average kernels one obtains similarities which are rather different from the entropy-based measure.In either case, it appears that the kernels with p = 2 have a correspondence with the entropy-based similarities which is close to linear.The results show that the considered molecular similarity measures are related despite their different starting points.

Conclusions
In summary, we have presented and discussed a connection between the information entropy of a molecule and the similarity matrix of its local atomic environments.The entropy can be obtained from an expression, given by Eq. ( 3), which is akin to the von-Neumann entropy.
This relation provides a convenient framework for calculating the information entropy and also for analyzing its properties.
Two approaches for obtaining the similarity of atomic environments were given.The substructure-SMILES method is based on a graph representation of the molecule.The similarity is defined in terms of a comparison of SMILES strings for substructures.For a set of molecules, the entropies were shown to correspond to the known and expected values if the size of the substructures was sufficiently large.The second approach was based on SOAP descriptors which were obtained from positions and atomic numbers of atoms in a molecule.
The sensitivity of the resulting similarities can be tuned by an integer exponent ζ, and we found a good agreement between SOAP and SMILES based entropies for larger values of ζ.
It should be noted that any value < 1 in the similarity matrix is suppressed with increasing ζ, so that it is not expected that the values converge to the SMILES-based calculations.But the results show that both approaches can be used to estimate the information entropy of molecules.The choice of similarity function and the tuning of its hyperparameters (like ζ) should be adapted to the given application.
Finally, we have investigated the entropy of a pair of molecules.For identical molecules, this entropy corresponds to the one of the individual molecule.If the molecules do not share any similar atomic environments, the ensemble entropy becomes equal to a weighted sum of the individual entropies plus the mixing entropy which only depends on the number of atoms in each molecule.In general, the entropy of the pair takes values between these two extremes.This was the motivation for defining the similarity of two molecules in terms of the entropy gain due to mixing.We compared this measure to two other similarity kernels (the average structural kernel and the best-match structural kernel) for 184 molecules and found on average a very good agreement with a modified best-match kernel.Thus, the entropy-based molecular similarity provides an alternative measure for comparing molecules with a strong anchoring in the framework of molecular information entropies.

Figure 1 :
Figure 1: Top row: Atomic environments for one of the carbon atoms in ethanol with N = 0, 1, 2 (from left to right).The substructure SMILES strings are given below the structures.Bottom row: Similarity matrices obtained from Eq. (5) for N = 0, 1, 2.

Figure 3 5 HN = 3 , = 16 Figure 3 :
Figure3shows a comparison of the corresponding entropies with the entropies found from

[ 10 are
excluded due to SMILES mismatch between rdkit and QM9] and compute the average KL divergence for different sensitivity exponents.From Fig.4one can see that the KL divergence initially decreases with increasing sensitivity -in accordance with our previous

Figure 4 :
Figure 4: Left: Average Kullback-Leibler divergence ⟨D⟩ of the SMILES and SOAP similarity matrices of 174 molecules.Right: Comparison of entropies obtained from the substructure-SMILES (H SMILES ) and the SOAP approach (H SOAP ) near the minimum of ⟨D⟩ for 174 molecules.Perfect matching is indicated by the straight line.

Figure 5 :
Figure5: Comparison of the gain of entropy due to mixing of two molecules ∆H SMILES and the respective mixing entropy H mix .The similarities are calculated using the substructure-SMILES approach (with N = 3).Perfect matching is indicated by the straight line.