Data Science Approach to Estimate Enthalpy of Formation of Cyclic Hydrocarbons

In spite of increasing importance of cyclic hydrocarbons in various chemical systems, studies on the fundamental properties of these compounds, such as enthalpy of formation, are still scarce. One of the reasons for this is the fact that the estimation of the thermodynamic properties of cyclic hydrocarbon species via cost-effective computational approaches, such as group additivity (GA), has several limitations and challenges. In this study, a machine learning (ML) approach is proposed using a support vector regression (SVR) algorithm to predict the standard enthalpy of formation of cyclic hydrocarbon species. The model is developed based on a thoroughly selected dataset of accurate experimental values of 192 species collected from the literature. The molecular descriptors used as input to the SVR are calculated via alvaDesc software, which computes in total 5255 features classified into 30 categories. The developed SVR model has an average error of approximately 10 kJ/mol. In comparison, the SVR model outperforms the GA approach for complex molecules and can be therefore proposed as a novel data-driven approach to estimate enthalpy values for complex cyclic species. A sensitivity analysis is also conducted to examine the relevant features that play a role in affecting the standard enthalpy of formation of cyclic species. Our species dataset is expected to be updated and expanded as new data are available to develop a more accurate SVR model with broader applicability.


INTRODUCTION
With the development of alternative fuels coming from different sources, as well as new additives from petroleum, cyclic hydrocarbons have become important components of current and future fuels. 1,2 Furthermore, cyclic hydrocarbons, such as polycyclic aromatic hydrocarbons (PAH), are common intermediates in flames that lead to soot formation. Cyclic hydrocarbons are important not only in combustion chemistry but also in other fields; cyclic unsaturated hydrocarbons can lead to the formation of Criegee intermediates 3 and highly oxidized organic compounds, 4 with implications in pollutant formation and climate. Therefore, knowledge of their molecular properties can help build models for atmospheric and combustion modeling. Despite their importance, lesser is known about the oxidation process of cyclic hydrocarbons compared to their aliphatic counterparts, as more theoretical and experimental studies have been concerned with the latter. 5 Nevertheless, additional data for complex cyclic hydrocarbons, saturated and unsaturated, have been derived in the recent years. 6,7 Chemical kinetics and molecular thermochemistry studies are of vital importance for the development of kinetic models, which are useful to gain knowledge of the oxidation properties of fuels. Accurate thermochemical properties are critical for combustion modeling, as has been recently proved by vom Lehn et al., 8 who observed that ignition delay of diethyl ether mixtures is more sensitive to the enthalpy of formation of certain species than to kinetic parameters. Different tools can be used to estimate thermochemical properties. Quantum chemistry calculations can yield accurate predictions but become impractical for large molecules with many heavy atoms. A more computationally feasible alternative, although less accurate, is the group additivity method, 9 which calculates the enthalpy of formation from so-called group contribution values. However, this approach assumes that each group is independent, and the contributions of those groups are additive; this may result in a poor description of the thermochemistry of cyclic species, as ring strain is influenced by more than just immediate neighboring atoms considered when defining group values. To cope with this limitation, ring corrections are often added; however, this alternative group additivity method only yields accurate predictions for cyclic species closely related to ones that are included in the training database, thereby limiting its applicability. Some other difficulties in the implementation of ring corrections have been reported elsewhere. 9,10 Alternative techniques to estimate thermochemical properties are therefore necessary to promote further studies on the oxidation properties of cyclic hydrocarbons. Machine learning (ML) has become a popular tool to predict molecular properties in recent years. ML models for estimating several chemical properties, viz., octane number, 11 flash point, 12,13 cetane number, 12,13 melting point, 14 solubility, and toxicity, 15 have been developed using different input representations. Saldana et al. 14 used ML compounded with geometry optimization for estimation of heat of combustion using the data taken from the DIPPR database 16 and the Yaws' handbook of thermodynamic and physical properties of chemical compounds. 17 Although their dataset included a diverse set of compounds, most of them were aliphatic and thus showed highest errors for cyclic species, with an overall error of 52 kJ/mol. Li et al. 18 used "active learning" to develop a self-training and continuously evolving model for enthalpy of polycyclic species calculated from the low-level density functional theory (DFT) B3LYP/6-31G(2df,p). Since the training dataset was from low-level theory calculations, error in their work could be attributed to both error from the ML model and error from the DFT calculations, compared with experimental values or calculations at a higher level of theory. However, their large initial dataset and self-evolving algorithm provide a platform for predicting enthalpy for a wide spectrum of molecules.
In the present study, we focus on estimating the standard enthalpy of formation at 298 K and 1 atm (hereafter referred to as enthalpy) of cyclic species using ML techniques based on a dataset consisting exclusively of accurate experimental data. Our goal is to develop an ML model that yields accurate enthalpies of cyclic species, which are scarce, to facilitate further studies on the oxidation of cyclic hydrocarbons. Section 2 provides details on the dataset used in this study along with details of the input features used for the ML model.

DATA CURATION
The database used in this work consists of experimental measurements reported by Ghahremanpour et al., 19 CRC, 20 and Minenkov et al. 21 The dataset consists of 192 cyclic hydrocarbon species including both saturated and unsaturated species with up to 14 carbon atoms. Among those 192 species, both constitutional isomers and R and S enantiomers (species with chiral centers) are included, making our ML model able to discern between different kinds of isomers.
Due to the lack of data reported for cyclic hydrocarbons beyond C 14 species, we restricted our dataset to species with up to 14 carbon atoms. Larger species with several rings can assume different structures and conformers that alter their thermochemistry, so training an ML model for such species requires a large and comprehensive dataset. This is the case for C 15 −C 18 hydrocarbons, for which accurate enthalpies are only available for 10 species. In addition, three-membered ring species were also excluded due to their lower importance in combustion and their skewed values of enthalpy due to high cyclic strain energies. Thus, the final dataset consists of cyclic hydrocarbon species with a minimum of four-membered ring and a maximum of 14 carbon atoms. Figure 1 shows the distribution of species present in our dataset with respect to their number of carbon atoms. It should be noted that the present approach allows us to update the training dataset as new data become available to improve the ML model for a wider range of cyclic hydrocarbons. The dataset with the species considered for our ML model, along with the excluded ones, is provided in the Supporting Material.
Several input representation methods have been used in the past for predicting molecular properties, such as Coloumb matrices, 22 various sets of molecular descriptors, 12,14 and convolutions. 15 Coloumb matrices and convolution methods require a molecular fingerprint; training such methods requires a relatively large dataset to find relationships between plain molecular structure and a macroscopic property. On the contrary, molecular descriptors are formulations derived from the molecular structure and can be used for small to large datasets with a varying number of descriptors. Todeschini and Consonni 23 compiled a comprehensive set of molecular descriptors that can be calculated from simplified molecular input line entry system (SMILES) in alvaDesc. 24 Out of this comprehensive set of 5255 molecular descriptors sorted in 30 categories, a relevant subset of 5072 descriptors from 29 categories are used in this study. Duplicate descriptors with identical values, such as alcohol functional group counts, which are zero for all of the species in our dataset, are removed to reduce the number of descriptors to 2478. A process to further downselect among these descriptors is explained in Section 3 along with the ML workflow. The complete dataset used for the ML model containing all of the molecular descriptors is provided in the Supporting Material.

MACHINE LEARNING FRAMEWORK
The ML model development consists of a workflow and model training. The ML workflow comprises division of the dataset for error estimation and final model development. The training consists of using the ML algorithm to train on the dataset following the workflow and then fine-tuning the hyperparameters associated with the algorithm. Sections 3.1 and 3.2 discuss each part in detail. All of the scripts for this study are written in Python using scikit-learn library 25 with TensorFlow 26 backend. All scripts are available in the GitHub repository (github.com/kiranyalamanchi/enthalpy-cyclic-species).
3.1. Workflow. Two commonly used workflows in ML are as follows: (i) divide the dataset into training/validation/test sets used for training the model, then fine-tune the hyperparameters, and estimate the model performance, and (ii) divide the dataset into k sets and use a k-fold crossvalidation (k-fold CV) method to estimate the accuracy of the ML algorithm on the dataset, and then train the final model on the entire dataset. The k-fold CV method is computationally demanding but gives a better representation of model accuracy. This method is adopted herein because of the small dataset size. In the accuracy estimation part, the workflow consists of two loops, an outer loop and an inner loop. First, the entire dataset is divided into k-sets, i = 1 to k, as shown in Figure 2. The outer loop constitutes of i = 1 to k-folds iterated over until each fold has been used as the test set. For each of ith set in the outer loop, the remaining data points expect those in ith set are mixed and again divided into j = 1 to k sets for the inner loop, as shown in Figure 2. The inner loop is used to determine the best hyperparameters for each particular split of the outer loop from a defined search space. For each combination of hyperparameters, all of the j = 1 to k-folds in the inner loop are iterated over until each set has been used as a validation set. The accuracy of the inner j = 1 to k-folds is averaged and used as a metric for the performance of that particular hyperparameter combination. The hyperparameter combination that has the minimum average validation error is used to train a model for that particular ith outer split. This model is trained on the remaining k − 1 folds (leaving ith set from i = 1 to k-sets) as the training set and then tested on the test set (ith set). Since the entire dataset is divided into k sets for the outer loop, we have k errors, which represents the model's performance. Finally, a model is trained on the entire dataset to obtain the final model, which can be used for finding enthalpy of a new species that is not present in the original dataset. A similar cross-validation pipeline was followed in our previous work for the development of ML models for enthalpy of linear species, 27 and more details on adopted k-fold workflow can be found therein. It should be noted that this methodology follows best practice in ML and ensures that the final model is not simply an overfitting of the dataset. 3.2. Model Development. Support vector regression (SVR) and artificial neural network (ANN) are widely used ML algorithms for regression tasks. While ANN performs better for tasks with large datasets, SVR can be very efficient in dealing with smaller ones; 27 hence, the latter is selected for this study. SVR 28 is a regression counterpart developed from support vector machines (SVM), 29 which are large margin classifiers widely used for classification problems. The idea of SVM is to transform a feature space using kernels to create a hyperplane that separates different classes with a large margin. SVR is based on a similar concept except that the hyperplane aims to fit the data. The function of SVR can be mathematically described by eq 1, where ε and ξ i , ξ i * are positive numbers that describe the allowable error (ε) and additional error above ε for a data point i in the training set, respectively. Equation 2 describes the radial basis function (RBF) kernel. Therefore, constraint terms describe the distance of Y i (output of data point i) from the predicted value by the transformed hyperplane using a kernel with parameters ω. This distance is the prediction error and is permitted to be within ε, while the additional ξ i , ξ i * has to be minimized. The term a∥ωa∥ 2 is the regularization term used in the cost function to be minimized to avoid overfitting. The parameters C, ε, and γ are known as hyperparameters, which are adjusted using the validation set, i.e., in the inner loop for the k-fold CV method, as explained in Section 3.1.
The processed dataset consists of 192 species with 2478 features, which is a large amount of features compared to the data points present. Using this entire feature set could cause overfitting and noise. To reduce the number of features, the dataset is divided randomly into 90/10% training/test, and the test set accuracy is used to select the number of features. Note that this split method is just used for feature selection and not for error estimation, for which the k-fold CV workflow is used. Pearson's correlation coefficient 30 is a metric used to measure correlation between vectors and for reduction of features. For a given correlation coefficient threshold, input features with an absolute correlation coefficient exceeding the threshold are removed, and then hyperparameters are tuned on the training set. The model is then used to predict on the test set such that the mean absolute error (MAE) can be obtained, as shown in Figure 3. The plot consists of errors for five different random training/test splits (each shown in different color), which are considered to avoid any bias due to a particular split. All of the test set errors for different splits decrease for a correlation    Table 1 shows the number of descriptors present from each category in the final reduced set of descriptors, the individual list of which is provided in the Supporting Material. These 47 descriptors are used with 192 data points for error estimation using the k-fold CV method, and the results are discussed in Section 4.

RESULTS AND DISCUSSION
Using the reduced set of descriptors, the entire dataset was split randomly into k sets, and the k-fold CV method was applied. It is a common practice to set the value of k to 5, 10, or the size of dataset, which are called 5-fold, 10-fold, and leave one out CV (LOOCV), respectively. Since the dataset in this study is small, using 5-fold would leave out too much data for the model to adequately train. LOOCV is a computationally expensive workflow for very small datasets. Therefore, k is set to 10 for the present study such that there are 10 test sets in the outer loop of the k-fold CV, resulting in 10 sets of errors. The regression (R2) score and MAE for these 10 sets are given in Table 2. An average MAE of 9.71 kJ/mol is achieved using SVR with the 10-fold CV workflow. The highest errors observed for fold numbers 6 and 7 arise from the species with a high prediction error, azulene and cyclohepta-1,3,5-triene, respectively. This is due to their unique structures compared to other species in the dataset, i.e., a seven-membered ring with resonant double bonds.
It is also worth mentioning that the ML model prediction accuracy is more sensitive to the availability of data for species with similar structures than for species with the same number of carbon atoms. This can be explained by C 13 species having lower average error than C 10 species, as shown in Figure 4, despite the fact that the number of C 10 and C 13 species are 23 and 2, respectively. The two C 13 species, 9H-fluorene and diphenylmethane, are planar and comprise mainly benzene rings, for which there are much data available. In contrast, the C 10 species include some unique structures, such as azulene, for which it is more difficult to predict its enthalpy due to a few similar species in the database.
4.1. Comparison with Group Additivity. Bensons group additivity (GA) 9 is an easy and fast method for calculating thermochemical properties. Han et al. 31 refined the GA approach by including ring corrections to account for the complexity of ring structures to improve the performance of GA for cyclic species. Despite this improvement, and due to limitations in the underlying assumptions of GA, this method is only effective for estimating the molecular thermochemistry of relatively simple organic molecules, and its accuracy rapidly decreases as the species become more complex, as was reported by Li et al. 18 A direct estimation of the improvement achieved by our SVR model over the GA approach needs both models to be trained on the same data, but this is beyond the scope of this work. Therefore, in Figure 5, GA values taken from RMG 32 and present SVR predictions are compared to experimental    The Journal of Physical Chemistry A pubs.acs.org/JPCA Article values for the eight C 11 species in the dataset. It is important to mention that the SVR predictions in these comparisons span over all of the folds shown in Table 2. Therefore, every prediction is from a model that was trained on a dataset that does not contain the corresponding predicted species; the species was used for external validation in a test set in the kfold workflow. Both GA (5.77 kJ/mol MAE) and SVR (7.82 kJ/mol MAE) achieve good accuracy on predictions of enthalpy values for molecules with relatively simple structures such as 1-methylnaphthalene, 2-methylnaphthalene, 1-methyl-1,2,3,4-tetrahydronaphthalene, 1,1-dimethylindan, pentamethyl-benzene, and spiro [5.5]undecane. The slightly better performance of GA could be attributed to the fact that the SVR values are from a model that was not trained on these particular species, whereas this may not be the case for GA. SVR outperforms the GA method for molecules with more complex structures, such as 1-methyladamantane and 2methyladamantane. The average error for 1-methyladamantane and 2-methyladamantane is greater for GA (171.54 kJ/mol MAE) and smaller for SVR (17.57 kJ/mol MAE). This is a promising result, indicating that the SVR method yields more accurate predictions for complex cyclic molecules for which the GA approach usually fails, as has been noted by Li et al. 18 Cyclic molecules with multiple rings connected together assume complex and unique geometry that contributes to enthalpy via ring strain, making it difficult for the GA approach to capture complexity mere ring corrections.

Sensitivity Analysis.
To get more insights into the molecular descriptors affecting enthalpy, a sensitivity analysis was performed. The SVR model trained on the whole dataset is used for sensitivity analysis in this section. The perturb method, as discussed in Gevrey et al., 33 was used to find the sensitivity of enthalpy to each of the input features. All of the input features are first assigned their mean, and then one of the input features is increased by 10% of its standard deviation. The output from the SVR model using the modified input, when one of the feature values was increased, is compared to the output with all of the features assigned with their mean values. This difference is then normalized by the 10% of the standard deviation of the feature that is perturbed. Finally, sensitivity of a feature is determined by taking the absolute value of the resulting normalized values. These sensitivity values are used to sort the features, and the top 10 sensitive features are listed in Table 3.
Among the 10 features listed in Table 3, four (nCsp3, nDB, MW, and nTA) are simple constitutional descriptors describing chemical composition of a compound without any information about its molecular geometry or atom con-nectivity. Out of the remaining six, four are ring descriptors (nR03, nR04, Rbrid, and nR05) which describe the number and size of rings as well as the number of ring bridges; these descriptors are crucial to discern among different cycles and their arrangement within the dataset. MAXDN is the maximum of the field effects on the atoms in a molecule due to the perturbation of all other atoms. 34 Mor26u is a 3D MoRSE descriptor calculated from the whole molecule structure and is one of the few descriptors that could discern between species with chiral centers. This descriptor can be used to establish relationships between enthalpy and chirality. Although the Morse descriptor is difficult to interpret, many studies have found them to be useful in quantitative structure− property relationship (QSPR) studies. 35 MAXDN and Mor26U are the only sensitive descriptors that are related to the geometry of molecules, while the remaining are constitutional and ring descriptors.
These findings are in contrast to those of the noncyclic compounds previously studied 27 by us, wherein enthalpy predictions were found to be highly sensitive to P_VSA descriptors, which are related to the polarizability and electronic structure of the species. This indicates that enthalpy of cyclic compounds is highly influenced by the number and size of their cycles, as well as by the arrangement these cycles adopt, that is, by ring descriptors. These conclusions are helpful for the development of further ML models on cyclic and/or linear hydrocarbons, which can benefit from our sensitivity analysis that ranks the importance of different descriptors.

CONCLUSIONS
A data-driven approach based on the SVR algorithm was developed to predict enthalpy values for cyclic hydrocarbons with a dataset of 192 species collected from Ghahremanpour et al., 19 CRC, 20 and Minenkov et al. 21 Molecular descriptors from alvaDesc 24 were used as input features that are generated from the output of SMILES, which are chemical formulas encoded as text strings. A k-fold workflow with an SVR algorithm was used to determine the accuracy with which our ML model could predict standard enthalpy of formation of cyclic hydrocarbons. In comparison to the group additivity method, our ML model performs better for species with complex structures. Sensitivity analysis reveals that simple molecular descriptors reflecting the ring nature and overall size of species play a more important role compared to more complex descriptors involving the shape of a species. Our ML model represents an accurate and computationally practical alternative to well-established GA and quantum chemistry methods for the prediction of enthalpy data of cyclic species, which are scarce in the literature despite their importance in combustion.
Access to more data can greatly enhance the accuracy and predictive capability of the presented ML model. Therefore, it is expected that this ML model will be improved as new and accurate data for enthalpy of cyclic hydrocarbons become available. However, calculating accurate enthalpies for cyclic and large hydrocarbons can be computationally expensive, and methods like active learning used by Li et al. 18 for better design of experiments (DoE) should be considered for this task. Furthermore, it can be useful to train the existing model with large datasets consisting of uniform inputs, as opposed to widely varying features. In this sense, it would be also beneficial to identify a subset of cyclic species specifically relevant to