Modified Group Contribution Scheme to Predict the Glass-Transition Temperature of Homopolymers through a Limiting Property Dataset

Previous studies on glass-transition temperature (Tg) prediction mainly focus on developing diverse methods with higher regression accuracy, but very little attention has been paid to the dataset. Generally, a large range of Tg values of a specified polymer could be found in the literature but which one should be selected into a dataset merely depends on the implicit preference rather than a recognized and clear criterion. In this paper, limiting glass-transition temperature (Tg(∞)), a constant value obtained at the infinite number-average molecular weight Mn, was validated to be an adequate bridge index in the Tg prediction models. Furthermore, a new dataset containing 198 polymers was established to predict Tg(∞) using the improved group contribution method and it showed a good correlation (R2 = 0.9925, adjusted R2 = 0.9894). The method could also generate Tg–Mn curves by introducing the Tg(∞) function and provide more information to polymer scientists and engineers for material selection, product design, and synthesis.


INTRODUCTION
Glass-transition temperature (T g ) marks the temperature at which an amorphous polymer is cooled from the rubbery to the glassy state. As rheological, mechanical, and dielectric properties change dramatically when the operating temperature passes through T g , T g plays a vital role in a wide range of both scientific and industrial processes. To explore the correlation between T g and the chain structure of the polymer, several attempts have been made by experiments, 1 molecular dynamics simulation, 2,3 and quantitative structure−property relationship. 4 Undoubtedly, a reliable and quick property prediction of polymers that are not even synthesized is desirable as it saves time and resources for basic research and industrial product development. In the T g prediction literature, two types of methods are well-recognized and being developed all the time: theoretical methods and semi-empirical methods. Theoretical methods are based on molecular descriptors of polymers, and many kinds of methods/tools have been introduced to predict T g with high accuracy, such as connectivity indices, 5 CODESSA program, 6 and an approach based on considering the polymer repeating unit as a set of anharmonic oscillators. 7 In the last few years, neural networkbased quantitative structure−property relationship (QSPR) methods are also in vogue, 8−11 these methods perform better when many factors influence the targets with large amounts of data.
Among them, the effect of the polymer structure on T g has been studied systematically by Bicerano 5 and Askadiskii. 7 Bicerano's approach is based on the principal descriptors of the polymer repeat unit structure, where bond indices are introduced to correlate with T g . The method can predict T g of polymers containing nine atomic elements C, H, O, N, F, Si, S, Cl, and Br. Furthermore, Askadskii's approach considers a repeat unit structure as a set of anharmonic oscillators that describe the thermal motion of atoms in the range of intra-and intermolecular forces, where atomic constants and empirical parameters which are independent of the polymer chemical structure are introduced in the calculation. The number of atomic elements contained in the predicted polymer extends to fifteen: C, H, O, N, F, Si, S, Cl, Br, I, P, B, As, Sn, and Pb. Subsequently, this method is applied for the prediction of T g of polymer−solvent systems. 12 The most widely referenced semi-empirical method is the group contribution method due to its simplicity and reasonable accuracy. Weyland et al. 13 predicted the glass-transition temperature of polymers using weighted additive contribution for their constitutional structure groups, and this method was then improved by integrating the connectivity index method 14 or neural network. 15 Over the last decades, the group contribution method has been applied for the prediction of polymer melts, 16 polymer diluents, 17 copolymers, 18 and polymer networks, 19 and group contribution-based computeraided molecular design (CAMD) has been developing and has become a powerful approach for polymer product design. The group contribution method in the CAMD approach was used to screen simple polymer structures, 20 and then used in the field of drug delivery 21 and rubber polymers, 22 which shows its broad application prospect.
Most contributions above focused on method innovation with various descriptors, models, and algorithms, but the dataset selection and validation has received scant attention. Actually, the value of T g varies over a wide range for a specified polymer in the literature. There are two main reasons: First, different from normal molecules, T g is highly structure-dependent. Many factors besides the repeat unit structure are confirmed to influence T g : molecular weight distribution, cross-linking, long side chain, chain stiffness, and even the interactions between polymer chains. 23 Second, T g is more difficult to measure compared with the normal boiling point or melting point because it is also known to heavily depend on both method selection and operation in the test. The cooling/ heating rate and measuring device selection will also affect the result even though the measuring method is fixed. So the discrepancy of the reported data points for a specified polymer can also be very high. Table 1 shows the T g values of some common polymers found in Bicerano's work, 5 Katritzky's work 6 (value at a high molecular weight), Askadskii's book, 7 Van Krevelen's book, 24 and PoLyInfo database. 25 As a result of the one-to-many mapping nature and the implicit selection preference, inconsistency and conflict inevitably often exist in different datasets, which makes the prediction results seem unconvincing. Therefore, it is necessary to select a more proper property index to predict T g based on a reliable dataset with a reasonable number of consistent data points.
Although several theories and models exist for discussing the affecting factors of polymer T g such as pressure, 26 crosslinking, 27 and crystallinity. 28 Molecular weight distribution still has the most critical influence on property prediction and synthesis/production processes. 29 It is now well-established from a variety of studies that the glass-transition temperature of the polymer increases with the number-average molecular weight (M n ), which has been proposed by Fox and Flory 30 where T g (∞) is the limiting value of T g for an infinitely high molecular weight. K g is the empirically determined constant. eq 1 reflects that when the molecular weight reaches an infinite value, T g tends to be stabilized and independent of molecular weight distribution. Thus compared with plenty of T g values in the previous data extraction strategy, T g (∞) is consistent and unique, which can be used in the dataset as a more convincing referenced property index. Besides, consider the relationship presented in eq 1, T g of a specified polymer at different number-average molecular weights could also be predicted, which means that more valuable information is provided for decision-making in scientific research and industrial development.
In this work, the limiting value T g (∞) is adopted as the bridge index in the T g prediction models by establishing oneto-one mapping essentially between the polymer repeat unit structure and properties. A dataset that contains 198 polymers is established through a standard selection criterion, detailed data extraction, and treatment procedures. For the purpose of the demonstration, new regression with this proposed dataset is performed by the group contribution method. Finally, the comparison with other methods and T g −M n curves are also discussed.

RESULTS AND DISCUSSION
2.1. Determining T g (∞) to be the Referenced Value in Prediction. Previous scholars have tried to gather data from the literature to find relationships between T g and M n . 5,31 To further demonstrate the applicability of the relationship in eq 1, we searched T g values with M n information in the database 25 and literature. 32−40 Seventy-two polymers were found with sufficient data points, in which 32 polymers and regression results are shown in Figure 1 (all polymers are shown in Table  S1 in the Supporting Information). Although controversy has arisen over the correlation curve between T g and M n , 41−43 our results show that T g increases asymptotically toward a constant limiting value with increasing M n in the overwhelming majority of the cases with sufficient data points. Even though T g (∞) has not been proved rigorously, T g (∞) could still be an adequate index used in the preliminary property prediction for polymer scientists and engineers.
In addition to the empirical evidence, the theoretical basis of eq 1 comes from free volume theory. 30 The reason why T g is found to increase asymptotically with molecular weight can be explained by the impact of end groups or chain entangle-ment. 44 End groups exist in the two ends of the polymer chain and introduce additional free volumes, which leads to a decrease in T g . 31 When the molecular weight is low, the effect of end groups is significant, the type of chain end also influences T g . 45 But as the molecular weight increases, the proportion of end groups decreases. And when the molecular weight increases to a certain extent, the proportion can be neglected, T g approaches T g (∞). Experiments on polystyrene 43 and poly(methyl methacrylate) 46 indicated that the type of chain-end has little effect on T g (∞), whether the polymer is linear or cyclic. In previous studies, the group contribution method was used to predict the common T g value, and only repeat unit structures were used as molecular descriptors. However, as the effect of the end groups is eliminated at an infinite molecular weight, this method is validated to be more suitable for T g (∞) prediction.
2.2. Dataset Generation. The next most important thing is to allocate a reliable T g (∞) value for each candidate polymer in the dataset. Different from the direct measurement data T g , T g (∞) requires additional processing of a series of raw T g data obtained in a wide range of M n . On the one hand, searching for the reported T g through mountains of papers is a cumbersome and tedious task. On the other hand, information quality and multidimensionality (reporting T g, M n , test method, and parameters simultaneously) should be checked to ensure that these generated T g (∞) values are trustworthy. Taking all factors into account, only homopolymers without any postprocessing are studied in this work.
2.2.1. Data Sources. The author holds that scholarly journal articles with rigorous peer review are the most reliable sources. But direct search in Google Scholar and/or SciFinder may sound a bit like searching a needle-in-a-haystack problem. As our guide, the following three sources are adopted in the preanalysis stage before literature retrieval and reading: Bicerano, 5 Katritzky et al., 6 and PoLyInfo. 25 Bicerano gathered T g −M n relationships of 35 polymers and Katritzky reported 88 un-cross-linked homopolymers with high molecular weights. PoLyInfo contains over 16 000 homopolymers with polymer structure, synthesis, and measurement information, and it is the foundation of our searching, sorting, and data analytics. It should be noted that the verification of the original literature for all data is the essential step.

Screening Criteria.
According to the definition of the concept, T g (∞) could be generated by two ways: regression and approximation. The former is calculated strictly by calculation and has been considered authors' favorite, while the latter adds more polymers empirically to meet the requirements of group contribution methods on the number and kind of data points. The information on T g and molecular weight has been checked to ensure that any outliers cannot affect the performance of our prediction model. And all of the selected polymers are linear homopolymers, which eliminates the influence of cross-linking. The influence of the measuring method cannot be ignored, and we also record this message in Table S1 in the Supporting Information.
2.2.2.1. Regression-Set 1 and Set 2. For polymers which has more than five number-average molecular weight information data points in our collection, regression based on eq 1 was carried out to determine the relationship and find the limiting values. If R 2 is higher than 0.8, we assign it to Set 1 (the most convincing data group), while Set 2 will receive the regression values with R 2 located between 0.5 and 0.8. These values are listed in the "Regression value" column of Table S1.

ACS Omega
http://pubs.acs.org/journal/acsodf Article Only 25 polymers reported by Bicerano 5 were selected into our dataset because some of them are commercial products which are copolymers or homopolymers with long end groups (like Fomblin series and Demnum) and named "Bicerano value" in Table S1. The data of polyethylene were also not selected because it comes from oligomers.

Approximation-Set 3.
The approximation is performed depending on the characteristics: T g (∞) should be higher than the common T g value, and the values of T g for polymers with high M n are very little different from T g (∞). However, so far, no unanimous conclusion can be drawn in the literature as to which number could be regarded as a high number. Through a quantitative analysis of 45 polymers in Set 1, the results in Figure 2 indicated that T g will reach/exceed 0.95, 0.98, and 0.99 times T g (∞) with M n = 20 kg/mol in over 80, 53.3, and 31.1% of the cases, respectively. Moreover, the values of T g for polymer with M n = 50 kg/mol are quite similar to T g (∞) (deviation < 2%) in 80% cases. Thus, in scenarios requiring low precision, the value of T g (or T g /margin factor, 0.95, 0.98, and 0.99 according to the M n ) for a polymer with a high M n could be regarded as an acceptable approximation value for T g (∞). Detailed information about this analysis can be found in Table S2.
For polymers that have sufficient data points (more than five points but not always have M n information) in the database, we will attempt to find the values that are close to the limiting values. If the number of data points for a specified polymer is more than ten, a histogram generated by PoLyInfo was also provided. As the limiting values are higher than experimental values for the given polymers, the maximum values of the data points in the collection would be considered as the candidate limiting values before the M n check. However, to keep the data clean, the maximum points which are far higher than other points and discontinuous in the same polymer were eliminated. This is because this value may result from the measuring method, pre-or postreatment, or just experimental error. These screened values are listed in the "PoLyInfo value" column of Table S1.
To further expand the dataset, data points from Katritzky with high molecular weights (M n > 20 kg/mol) were also introduced into our dataset. Due to the large and strong electron-withdrawing properties of CF 3 , as well as the simplification of the side chain in our prediction, polymers with CF 3 in the long side chain were eliminated. As a result, 85 polymers were selected, and these values are listed in the "Katritzky value" column of Table S1. All of these polymers were allocated into Set 3, and detailed screening procedures for each polymer in this dataset are also provided in the Supporting Information Part 2.
2.3. Dataset Description and Statistics. This version of the dataset consists of 198 homopolymers and is further divided into three sets with various precisions. As shown in Figure 3, 16 polymer classes including polyolefins, polystyrenes, polyvinyls, polyacrylics, polyhalo-olefins, polydienes, polyoxides/ethers, polysulfides, polyesters, polyamides, polyimides, polyketones, polycarbonates, polyimines, polysiloxanes, and polyphenylenes are gathered in our dataset. And the dataset contains not only long side chain polymers but also long main chain polymers.
It is the biggest T g (∞) dataset for homopolymers as far as we know, we still think it is too small. But regrettably, by searching and sorting thousands of papers in the whole year, these are all we could find because of the lack of molecular weight information in the literature. It is well-known that molecular weight is one of the most important basic parameters, which affects a lot of other properties, but it's still hard to find it accompanied by the reported properties in the papers. Hoping the change, the authors will keep up with the literature and update the dataset periodically. The opensource prediction tools based on the proposed dataset under development will be released on the GitHub Platform.
2.4. Prediction Performance. As we mentioned before, the value of T g is also influenced by the method and the duration of an experiment. Unfortunately, we cannot select only the data points using the same measurement and operations due to the lack of quality data in the literature. On this occasion, an appropriate assessment criterion is proposed, followed by a detailed analysis. To regress the contribution of T g (∞), 58 characteristic groups (including 1 side chain variable) from 198 polymers are selected; the performance of regression is described using statistical performance indicators, R-squared (R 2 ), and mean relative error (MRE).
The experimental and predicted T g (∞) values of 198 polymers are listed in Table S4. R 2 of Y g (∞) is 0.9925, and the performance is shown in Figure 4. As regular R 2 increases with the independent variable, adjusted R 2 is introduced to avoid misleading results. Generally, the equations can be presented as follows  where n is the number of data points used in the regression and k is the number of polymer groups. The adjusted R 2 in this work is 0.9894, this value is slightly lower than the value of R 2 , proving the appropriate use of the polymer groups.
MRE is used to describe the average relative error, which is the difference between the experimental data X exp and predicted data X est of property X for the given polymers. The equation can be written as follows Where N is the number of data used in the regression. As mentioned above because the T g values in the literature come from different measuring methods under different conditions, although the influence of the molecular weight on T g is eliminated using limiting values, the error still exists. Different from the normal boiling or melting point, T g is measured by indirect methods; there are great differences using different measuring methods and duration of experiments, so this error cannot be simply considered as measurement error. Under this circumstance, it is suitable to use R 2 to evaluate the prediction performance rather than MRE or root-mean-square error in other literature. Table 2 shows the prediction performance of various methods in the literature Katritzky's method, Bicerano's method, and Askadskii's method are all theoretical methods which are based on molecular descriptors of polymer, the R 2 of Katritzky's method is 0.946, and then improved by Bicerano and Askadskii with 0.9749 and 0.998, respectively. Afantitis's method and Palomba's method are completed through the neural network based on Katritzky's method, increasing the R 2 to 0.9269 and 0.953, respectively. Gani's method and our method in this paper are both based on the group contribution method, Gani et al. reported R 2 of 0.9374 and 0.948 for improvements achieved by high order groups and connectivity index.
An R 2 of 0.9925 in our method indicates that the proposed model fits the used dataset very well. It is worth noting that the R 2 value is not suitable for direct comparison strictly due to the different datasets used in different studies. The higher R 2 value in our paper is not devoted to demonstrate the superiority of the group contribution method. The basic group contribution is a widely used, simple method, which has been proved in the literature that it is not the best performance model. However, as our work has paid attention to the dataset used for prediction, we want to demonstrate that T g (∞) is better than T g used in the model. To better illustrate our idea, we chose this most traditional method rather than a more sophisticated one in this paper. Table 2 indicates that the simple method could also achieve sufficiently good performance using the proper index and dataset. Obviously, other more advanced QSPR methods can further improve the accuracy based on the proposed limiting property dataset.
MRE can be used to reflect the accuracy of different polymer classes in the dataset. Total MRE of T g (∞) is 8.09%, there are 96 polymers out of 194, whose MRE is less than 5%, 87 polymers have the MRE ranging from 5 to 20%, and the MRE of 15 polymers is greater than 20%. Regression performance statistics based on the main types of polymers for T g (∞) are listed in Figure 5. As shown in the figure, the MRE values for polyolefins are much higher than those of the other polymers. This phenomenon may be caused due to the following reasons: First, we only define the number of backbones in the side chain rather than the detailed side chain structure. This simplification could improve the generalization ability of the proposed model because of fewer independent variables, but the MRE value will increase for the polymers with complicated side chain structures. Second, the literature reported more varied and complicated side chain structures for polyolefins, at least in the

ACS Omega
http://pubs.acs.org/journal/acsodf Article proposed dataset, the polyolefins sub-set has more data points with long and branching side chain structures than other subsets. Thus, the side chain simplification will lead to a higher MRE for polyolefins. Third, main chain and side chain structures do not have an equal effect on T g , 1,47 and Van Krevelen's work 24 proves that side chain structures have more influence on T g of polyolefins than that of other polymers. In principle, the more complicated the side chain structures, the more influential the side chain on the main chain, the more likely the simplification will affect. Compared with the previous T g prediction method, T g with M n information can also be predicted in this work. Figure 6 displays the previously predicted values of T g of four polymers, and T g (∞)/T g −M n curves predicted using our method. As shown, these predicted curves show good agreements with experimental values extracted from the database and the fitting result using the Fox−Flory equation (eq 1). These results also indicate that predicting T g without molecular weight information (we put these values in the T g -axis) may puzzle the user with a large deviation between the experimental and predicted value. Thus predicting T g −M n curves can provide practical guidelines for the follow-up polymer synthesis and help achieve the target T g value with fewer experiments.

CONCLUSIONS
Because of the large range of glass-transition temperature T g values found in the literature, there is a lack of uniform criterion for which point in that range should be used in the prediction. To settle this controversy, limiting glass-transition temperature T g (∞), the single constant value for each polymer was introduced in the T g prediction model, and a new dataset that consists of 198 polymers through a detailed selection procedure was established. To build the structure−property relationship, the group contribution method was adopted, where polymers with long side chains can be predicted using the combination of the basic group and number of backbones in the side chain. After this, 58 structural groups were used in the regression, and a good correlation was achieved (R 2 = 0.9925, adjusted R 2 = 0.9894, MRE = 8.09%), which was found to be higher than the previous work. Finally, T g −M n curves could also be generated by the proposed method, which provides an additional dimension of average molecular weight information.
At present, it is still not easy to find molecular weight information in the literature, which leads to difficulty in determining T g (∞) and T g −M n relationship for more polymers. The authors indicate that multi-dimensional experimental data can not only convincingly support the researchers' own conclusions but also lay a foundation for follow-up data mining and knowledge discovery. Besides, the proposed methodology could also be extended to other molecular weight-dependent properties such as melting temperature, surface tension, or heat capacity, as well as other structural parameters such as degree of crystallinity, cross-linking, and branching. Several methods currently exist for the prediction of T g , among these, the group contribution method is simple and frequently-used by polymer chemists and engineers. To demonstrate the effectiveness of the proposed methodology and dataset in a practical way, the group contribution method proposed by Van Krevelen is introduced to predict T g (∞) with slight modifications. The general form of this method for predicting T g can be written as follows where Y gi is the glass-transition temperature contribution of the i group. However, in this work, to predict the limiting value of glass-transition temperature, eq 4 should be modified where Y gi (∞) is the limiting glass-transition temperature contribution of the i group, and N i is the occurrence number of the i group. Similarly, Y gb and N b are the contribution and occurrence number of backbones in the side chain.Y g (∞) is defined as the limiting molar glass-transition function.
The following issue is about the form of the left side of eq 5. For eq 4, both Satyanarayana 14 and Van Krevelen 24 have discussed the left function, and the difference between them is whether the additional adjustable parameter should be considered in the function (in the Satyanarayana method, where M w is the formula weight of the repeat unit. For polypropylene, M w = 42 g/mol). The Satyanarayana method is meaningful in polymers where Y g0 can be regarded as the influence of end groups, especially those with low molecular weights. However, as end groups can be varied, the value of the additional adjustable parameter Y g0 is not a constant. In the Van Krevelen method, end groups are thought to play a minor role in the whole polymer chain, so the repeat unit structure is used to express additive properties. In our work, we chose not to use Y g0 , although it will help improve the accuracy because for limiting the glass-transition temperature, the influence of the end groups is eliminated, as proved above. Then the group contribution method for estimating T g (∞) can be written as

Structure Description and Group Definition
Criteria. Both the main chain and side chain exist in the polymer repeat unit structure, as shown in Figure 7. As the effect of the side chain on polymer properties is different from that of the main chain, 1 it is unwise to regard them as equal when defining groups for group contribution.
For the main chain structure, the definition of groups is referred to Van Krevelen's work, 24 which prefers bivalent and composed groups. In this paper, 36 of 57 main chain groups are identical to those in Van Krevelen's work, which are marked in Table S3. As some polymer repeat unit structures in our dataset cannot be fully described using these groups, 18 new groups are created based on Van Krevelen's rules: 24 1. A long chain may consist mainly of bivalent groups. 2. It is better to regard a composed unit as one structural group. Besides, there are still 3 special main chain groups in the group set, which are discussed later in this section.
The effect of structural groups on T g (∞) is different in the polymer main chains and side chains, and many authors have different strategies for defining the side chain. 47 When the side chain is small, the influence on T g (∞) can be eliminated by combining it into the main chain. When the side chain is large, treatments are demonstrated to define the side chain: a long side chain group contains two parts, the basic group, which is from the main chain group set, and the side chain. The number of backbone atom number in the side chain N b is introduced to describe the length of the side chain, and hydrogen atoms are not calculated into N b . The schematic diagram is shown in Figure 8. When meeting groups with a long side chain, their contributions can be calculated using the combinations of the main chain and side chain.
In particular, for polyoxides or polyethers, the value of contribution from the same groups may be different because of their neighboring atoms or groups. Considering the repeat unit structure of polyethers, several special groups are introduced: −O-(end), −O-(oxide), and −(CH 2 ) n -(oxide). Polyether usually exists with an end oxygen group in the repeat unit structure, as shown in Figure 9a. When calculating T g (∞), this kind of oxygen group −O-(end) has a unique contribution value. Specifically, for polyoxides, which consist of only an oxygen group and several saturated hydrocarbon groups (as shown in Figure 9b), one −O-(oxide) group and n -(CH 2 ) n -(oxide) groups are used in the calculation. All of the groups defined in this paper can be found in Table S3 in the Supporting Information.

ACS Omega
http://pubs.acs.org/journal/acsodf Article 4.3. T g −M n Relationship. Many correlations have been proposed to describe the relationship between T g and M n through calculation 27,48,49 and simulation. 50,51 It should be noted that Askadskii deduced a more rigorous T g −M n correlation (as shown in eq 7) based on Lin's theory 52 Among these eq 1 is simple and widely used. However, K g is a constant which is also related to the polymer structure, eq 1 can be applied only to the polymer where K g of this polymer has been reported or calculated. For a polymer, for which little data are available in the database, it is not convenient as K g cannot be obtained. Several attempts have been made for predicting K g of new polymers, whose T g and T g (∞) are known. Data from several studies suggest that because of chain stiffness, K g of high T g (∞) is higher than that of low T g (∞), 53 and K g is proportional to a power of T g (∞). Bicerano 5 gathered data from the literature and analyzed statistically. The relationship is shown in eq 8 Substituting eq 8 into eq 1, we can obtain at the desired expression eq 9. Once the values of T g (∞) and M n are obtained, T g can be predicted using the equation above, and this is why we choose this equation in this work. The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acsomega.0c04499.
Dataset containing the limiting values and resources of 198 polymers, T g (∞) and K g values of polymers in Set 1 and their M n values when the T g reaches 95% T g (∞), 98% T g (∞), and 99% T g (∞), 58 structural groups and their formula weights and T g (∞) contributions and the comparison between the original and predicted T g (∞) for polymers are presented in the Supporting Information Part I. Detailed T g (∞) determination for each polymer is presented in the Supporting Information Part II (PDF)