Quantitative Structure—Permittivity Relationship Study of a Series of Polymers

Dielectric constant is an important property which is widely utilized in many scientific fields and characterizes the degree of polarization of substances under the external electric field. In this work, a structure–property relationship of the dielectric constants (ε) for a diverse set of polymers was investigated. A transparent mechanistic model was developed with the application of a machine learning approach that combines genetic algorithm and multiple linear regression analysis, to obtain a mechanistically explainable and transparent model. Based on the evaluation conducted using various validation criteria, four- and eight-variable models were proposed. The best model showed a high predictive performance for training and test sets, with R2 values of 0.905 and 0.812, respectively. Obtained statistical performance results and selected descriptors in the best models were analyzed and discussed. With the validation procedures applied, the models were proven to have a good predictive ability and robustness for further applications in polymer permittivity prediction.


■ INTRODUCTION
Polymeric properties related to electrical conductivity are useful in many applications, such as cable insulation, 1 capsules for electrical components, interlayer dielectrics, charge-storage capacitors, 2,3 and printed circuit boards. 4Dielectric permittivity is an important value that is widely used and characterizes the degree of polarization of substances under the action of an external electric field.A larger dielectric constant means a larger polarization of the medium between the two charges.Therefore, the dielectric constant is the ability of a substance to separate the charge or orient its molecular dipoles in an external electric field.−6 However, the exact experimental values of the dielectric constant for polymers are often unavailable.The prediction of dielectric constants computationally and by using theoretical approaches, such as machine learning predictive modeling, is important in the molecular design of new polymeric materials with the desired properties.The rapid and accurate implementation of predictions for a wide variety of chemical structures can significantly improve the performance and speed of phenom-ena investigation.However, the theoretical calculation of the property, such as dielectric constant of the polymer is not an easy problem, because this property is a nonlinear property and, therefore, a function of several factors, including polymer structure and composition, temperature, materials morphology, additives and plasticizers, impurities, and moisture in the volume of the polymer.A quantitative structure−activity/ property relationship (QSAR/QSPR) is a subsection of machine learning (ML) modeling and chemical informatics for revealing relationships between chemical structures of molecules and their activity.−11 In cheminformatics, molecular descriptors are numbers that formally represent a molecule, obtained by a well-defined algorithm and applied to a well-defined experimental procedure.In other words, a molecular descriptor is the result of a mathematical expression that converts the chemical structure to a numerical value. 12Each molecular descriptor describes a molecular structure by encoding a part of the structure or a whole molecular structure.Molecular descriptors play a fundamental role in the development of QSPR models.One of the main features of the QSPR approach is that it requires only knowledge of the chemical structure and is independent of any experimental properties.Once a correlation is found, it can be applied to predict the properties of new compounds/materials that have not been synthesized previously or not found.Therefore, the QSPR approach can accelerate the development of new molecules and materials with the required properties.Using the QSPR approach, many different properties of polymers can be determined with a sufficient accuracy, in particular, this approach is already used to determine, such properties as a refractive index, 4,13−21 glass transition temperature, 14,22−33 cohesive energy, 34 thermal decomposition temperature, 35 solubility parameter, 36 as well as fouling release properties. 37−41 But the number of attempts to predict the dielectric constants of polymers was rather small. 4,42Liu et al. 42 introduced a model with a correlation coefficient of (R 2 ) 0.908 and a standard error (s) of 0.001 for 22 polyalkenes using three descriptors, but the values of ε in this case cover only the range from 2.154 to 2.165.Bicerano 4 developed a QSPR model with (R 2 ) 0.958 and (s) 0.087 to correlate ε with 32 topological and constitutional descriptors for 61 polymers.This model is good but contains too many descriptors.High correlation and randomness of correlations may be partly due to increased number of descriptors in the model and use a whole dataset as a trainig set.Moreover, the two models were not validated externally using a test set.In fact, validation is a crucial aspect of any QSPR/QSAR modeling. 43he purpose of this study was to develop a reliable predictive QSPR model that could effectively be used to predict dielectric constant values with mechanistically explainable descriptors for further design applications.The model is developed using a set of 71 polymers with a large structural diversity, with further model validation applying specific validation approaches and an external set.

Data Set
The experimental data (polymers 1−56) were taken from the source that published by Bicerano,4 the remaining data (polymers 57−71) from the source published by Ku and Liepins, 5 at room temperature (298 K).In total, the data set for this study consists of 71 polymers with diverse structures (see Table 1).The data set contains polymers of the following types: polyvinyls, polyethylenes, polyoxides, polystyrenes, polyethers, polysulfones, polyacrylnitrile, polyamides, polyacrylates, poly siloxanes, polyxylylenes, and polycarbonates.

Computational Details
In this work, the structures of all polymers were computationally optimized and used for generating structural properties/features/ descriptors calculation.Because polymers are macromolecules with a large size and wide chain length distribution, the calculation of structural descriptors based on original structural formulas was not possible using current descriptor-generating software. 23,30Moreover, due to the high molecular weight of the polymers, the effect of the terminal groups on the overall structure of polymer is quite small, which allows us to neglect the contribution of the terminal structure contribution.45 The molecular structures of each polymer were drawn in ChemSketch software.46 The optimization of monomeric units, i.e., geometry optimization and finding the minimal energy conformation, is an important step and provides a real conformation of the investigated structure for further QSAR modeling.Molecular modeling is often used for optimization and property assessment of various chemical systems.47−50 In this work, the geometry optimization was carried out using HyperChem software, applying molecular mechanics force-field MM+.51 The criterion for the energy optimization limit was chosen as the achieved gradient of 0.01 kcal/ mol.The molecular descriptors for each polymer were calculated based on minimal energy conformation using DRAGON software.52 Dragon 6.0 allows one to generate about 5000 descriptors per structure. 52 Thgenerated descriptors include the following categories: constitutional indices, 2D and 3D matrix-based descriptors, 2D autocorrelations, topological descriptors, indicator descriptors, connectivity indexes, information indices, atom-centered fragments, charge-based descriptors, 0D, 2D, and 3D descriptors, molecular properties, and so on.12 Descriptors with high correlations, single variables, and noninformative information were discarded based on the constant value, near constant (R > 0.95), and pair correlation criteria (R > 0.7).A total of 523 descriptors of different types were selected from about 5000 descriptors after the initial filter criteria applied.Each descriptor represents a molecular graph invariant, describes the particular property, and overall adds to chemical diversity of the monomeric unit.
The model development was performed by QSARINS software 53,54 with the following setup to find the best model.For the genetic algorithm (GA)-based variable selection step, the number of generations was set to 2000 and a mutation rate of 35% was used.For the best models' selection, the population size of the final models' list was set to 20.For validation purposes, multiple methods were applied, including leave-one-out (LOO) cross validation, y-scrambling, as well as internal and external validation protocols.After validation techniques were applied, the best model was chosen based on multiple criteria: (1) high statistical performance of R 2 and Q 2 variables (including R 2 − Q 2 < 0.3); 43 (2) a low number of variables in the model; (3) low cross-correlation between descriptors in the selected model; and (4) best performance of R 2 for the external validation set (test set) to avoid model overfitting. 43

■ RESULTS AND DISCUSSION
In this work, a data set of 71 polymers was used to develop a quantitative structure−permittivity relationship model.For the model validation, the set was split into training and test sets consisting of 57 (80%) and 14 (20%) polymers, respectively.The splitting was performed with care to ensure that at least one compound of each structural class in the training set was represented in the test set.After genetic algorithm combined with multiple linear regression analysis (GA-MLRA) computation iterations, the best models were found.After a first round of GA-MLRA it was found that five compounds are outliers, with a high prediction value error.The outliers are 62, 63, 66, 67, and 69.After elimination of outliers, the GA-MLRA iteration was repeated.The set with a total of 66 components was split into training and test sets containing 53 (80%) and 13 (20%) polymers, respectively.In the process of finding the best model, several options were selected that best correlate with the dielectric constants of the selected polymers.Two models with four and eight variables are proposed, the statistical characteristics of which are given in Table 2.
The following equations represent the proposed models with four (1) and eight (2) variables The four-variable model shows a good performance, with R train 2 = 0.842 and R test 2 = 0.715.A graphical representation of the model for the training and test sets is given in Figure 1A.Compared to the 4-variable model, the eight-variable model shows better R train 2 and Q 2 performance values for the training set, smaller standard deviation s, and better predictive performance due to higher R test 2 for the test set, 0.812.In comparison to the four-variable model, the 8-variable model has a larger number of variables, which can lead to some level of overfitting, but still very robust.A graphical representation of the model for the training and test sets is presented in Figure 1B.
Both equations: (1) and ( 2) show satisfactory statistical results that confirm the robustness of these models.However, considering the combined productivity for both training and test sets, the second model provides a better performance.
Descriptor selection was performed by applying a variable selection GA algorithm, followed by the MLRA approach together with a cross-validation LOO procedure.Based on the size of the data set and the correlation coefficients of the training and test sets (R train 2 and R test 2 ), the significance criterion F and the standard errors, the number of descriptors in the final QSPR model was determined.
A very important step in the model's robustness is to check the applicability domain (AD).Predictions of compounds can be considered reliable only if the dataset's chemical space of applicability is within the predictive chemical space of the developed model, before the model can be applied for further predictions.The AD check was performed by application of leverage approach, i.e., William's plot evaluation for the final models.All data points were within the three standardized residues (±3σ) and within the HAT index, where h* is the critical value of leverage h.If the errors of estimation would exceed the values of the standardized residues, then the predicted values could go out of the AD and give inaccurate predictions as they go beyond reasonable extrapolation.If the value of h of the resulted data is higher than h*, then they are considered as structurally significant contributors to the model. 55As can be seen in the Williams plots (Figure 2) for both equations, in the first model (A) there are only two polymers, and in the second (B) only one polymer has values h higher than h*.However, these polymers have low residual values, which means that the model is stable enough to make reliable predictions for all polymers structurally similar to the ones in the data set.
The obtained models contain the following descriptors: Me�mean atomic Sanderson electronegativity (scaled on carbon atom); AAC�mean information index on atomic composition; R5p+�R maximal autocorrelation of lag 5/ weighted by polarizability; JGI1�mean topological charge index of order 1; GATS 1p�Geary autocorrelation of lag 1 weighted by polarizability; Mor22v�signal 22/weighted by van der Waals volume; RARS�R matrix average row sum; ESpm11u�Spectral moment 11 from edge adj.matrix; R1v +�R maximal autocorrelation of lag 1/weighted by van der Waals volume; and nCt�number of total tertiary C(sp 3 ).
More information about these descriptors can be found in the Dragon software user's guide 12,52 and the references therein.
As a rule, the value of coefficient F indicates the ability of the model to predict the value of the properties in the training set.The large F ratio values in both eqs (64.124 and 52.542 for the first and second, respectively) indicate that both equations do an excellent job with predicting ε values.Each equation has an adjusted value of R adj 2 0.829 and 0.888, which denotes a very good correspondence between correlation and data variation.The cross-validated correlation coefficient (Q 2 for eq 1 is equal to 0.813 and Q 2 for eq 2 is eqiual ro 0.865) demonstrates the robustness of the models.The model was further validated by using a y-randomization test.The obtained R 2 Yscr against the correlation coefficient between the original and shuffled data is shown in Figure 3.It can be seen from Figure 3 that the original models are not due to random correlations; since values of R 2 Yscr are significantly low.It is worth noting that model 1 (eq 1) showed much stronger robustness at the yscrambling test than model 2 (eq 2), while both models are quite strong.The calculated results of the values of ε from eqs 1 and 2 for the training and test sets are shown in Table 1 and Figure 1.
Based on the model selection procedure described earlier, the relative contribution of descriptors to the respective models was determined and shown in Figure 4.The descriptors involved in the model are having the reducing contribution to the model in the following order: for eq 1: Me  > AAC > R5p+ > JGI1 and for eq 2: Me > AAC > RARS > R1v + > GATS1p > ESpm11u > Mor22v > nCt.
One of the most important descriptors involved in both equations is the AAC information index.This descriptor contains information about each atom in a molecule by its own atom type, its bond type, and the atom types of its first neighbors.AAC is a measure of atomic composition associated with molecular complexity.When a molecule is larger and its elemental composition is more complex, the value of the descriptor increases.The positive value of this descriptor indicates that polymers with a more complex structure and, accordingly, with a larger value for this descriptor would have larger values of ε.Another descriptor, ESpm11u, is based on the use of bond distances as weights in the diagonal entries of the edge matrix.
It is worth noting that the presented QSPR models can be a good simple way to predict the permittivity of homopolymers.These models can be improved further in future studies by improving the dataset size and variety of polymers.We believe that the results of this study will pave the way for future steps in investigating the electrical conductivity mechanism of polymeric materials.

■ CONCLUSIONS
In this work, a machine learning-based structure−property relationship model for dielectric constants (ε) based on a diverse set of polymers is developed.A transparent model was obtained with application of the GA-MLRA approach, to get a mechanistically explainable model.This work represents two QSPR models developed based on descriptors computed from monomeric polymer structures.The reliability of the models was validated using several verification methods.The best overall performance is achieved by a four-and eight-descriptor QSAR models, with R 2 values of 0.842/0.715and 0.905/0.812for training/test sets, respectively, per each model.The models are suitable for further development of polymers with desired dielectric constants based on chemical structure information of monomers.

Figure 1 .
Figure 1.Plots of experimental and predicted values of the dielectric constants for the entire data set.Yellow dots are the training set, and blue dots are the test set (A�for eq 1.; B�for eq 2).

Table 1 .
Set of Experimental and Predicted Dielectric Constant Data for the Polymers Involved in the Experiment

Table 2 .
Statistical Characteristics of the Four-and Eight-Variable Models