Deep Neural Networks for Multicomponent Molecular Systems

Deep neural networks (DNNs) represent promising approaches to molecular machine learning (ML). However, their applicability remains limited to single-component materials and a general DNN model capable of handling various multicomponent molecular systems with composition data is still elusive, while current ML approaches for multicomponent molecular systems are still molecular descriptor-based. Here, a general DNN architecture extending existing molecular DNN models to multicomponent systems called MEIA is proposed. Case studies showed that the MEIA architecture could extend two exiting molecular DNN models to multicomponent systems with the same procedure, and that the obtained models that could learn both the molecular structure and composition information with equal or better accuracies compared to a well-used molecular descriptor-based model in the best model for each case study. Furthermore, the case studies also showed that, for ML tasks where the molecular structure information plays a minor role, the performance improvements by DNN models were small; while for ML tasks where the molecular structure information plays a major role, the performance improvements by DNN models were large, and DNN models showed notable predictive accuracies for an extremely sparse dataset, which cannot be modeled without the molecular structure information. The enhanced predictive ability of DNN models for sparse datasets of multicomponent systems will extend the applicability of ML in the multicomponent material design. Furthermore, the general capability of MEIA to extend DNN models to multicomponent systems will provide new opportunities to utilize the progress of actively developed single-component DNNs for the modeling of multicomponent systems.


■ INTRODUCTION
Multicomponent molecular systems such as polymer alloys, mixtures, and composite materials are used in various applications due to multiple functions and tunable properties. The properties of these classes of materials vary with both chemical structures and component compositions, and composition freedom means that design spaces of multicomponent molecular systems are exponentially larger than those of single-component systems, which makes an optimal recipe for multicomponent materials hard to be found. To solve this problem, high-throughput screening methods are actively developed and used for a wide range of multicomponent systems including active layers of organic solar cells, 1 electrolyte additives for lithium ion batteries, 2 surface coating materials, 3 and industrial products. 4 Nevertheless, due to tremendously large design space of multi-component molecular systems, even high-throughput screening methods are compelled to limit their exploration space to a fixed combination or composition with a moderate number of component materials. However, in most cases, to find better recipes for multicomponent materials, it is needed to explore much larger combination and composition freedoms of multicomponent systems, and therefore, the primary approach for the multicomponent material design is still trial-and-error-based.
To accelerate the exploration of material design spaces, the material science community has recently paid much attention to machine learning-based data-driven approaches to the material design. Although most of molecular machine learning studies have focused on single-component systems, 5 attempts have also been made to model properties of multicomponent molecular systems using datasets generated from massive computer simulations 6 or systematic experiments. 7−23 However, it is rarely possible to obtain sufficient datasets for multicomponent materials due to the huge chemical space involved, and therefore, handling of a machine learning task on sparse datasets, which are produced on a trial-and-error basis during the process of the material design, is needed to extend the applicability of machine learning-based approaches for the multicomponent materials design. However, such a machine learning task is still challenging for current molecular machine learning models.
Conversely, recent years have seen that the accuracy and applicability of molecular machine learning was significantly improved by using deep neural networks (DNNs) for molecules. 24−33 One of the major advantage of DNNs is the ability of learning rules to generate fixed-size feature vectors suitable for a given task directly from unstructured data such as images and molecular structures. This ability of DNNs is widely called "feature representation learning". Figure 1a shows a typical workflow of the molecular DNNs that are capable of feature representation learning. Here, the overall network is composed of two subnetworks, referred to in this study as embedding and prediction networks, respectively. The embedding network is responsible for feature representation learning, and the prediction network makes a prediction using the fixedsize feature vector generated by the embedding network. Note that prediction networks are usually fully connected neural networks (FCNNs), i.e., classical neural networks. On the other hand, in contrast to DNNs, conventional shallow learning models, such as the support vector machine, random forest, and classical neural networks, require fixed-size vectors as inputs. Therefore, to handle molecular structures by the shallow learning models, feature engineering to generate fixed-size feature vectors, called a molecular descriptor, is required before running machine learning models (Figure 1b). This approach is referred to in this study as the "descriptor-based model". Feature representation learning not only simplifies the workflow of machine learning but also enhances predictive performances of the obtained models and enabled remarkable success of DNNs in machine learning tasks on unstructured data including image recognition 34 and machine translation. 35 In the field of chemical science, there are several embedding networks 24−27,31,32 developed suitable for feature representation learning from molecular structures, and they enabled molecular DNNs to outperform descriptor-based methods in various datasets. 36 Moreover, if there are sufficient data, prediction accuracies of molecular DNN models can be higher than those of density functional theory (DFT) calculations. 31,37 However, most of the molecular DNN models currently proposed are essentially for single-component systems and cannot be directly applied to multicomponent systems due to limitations in terms of inability to handle composition information and an arbitrary number of molecular structures. Therefore, most of existing approaches for machine learning of multicomponent molecular systems are still descriptorbased 6,7,9−23 (see Figure 1d for typical cases). Another requirement for molecular DNN models in terms of application on multicomponent molecular systems is permutation invariance of input components, a mathematical property of a model whereby the order of components in its input is essentially independent of its output. For example, for a binary system composed of component materials z 1 (r 1 %) and z 2 (r 2 %), a model should satisfy the following equation z z z z model( , r , , r ) model( , r , , r ) Permutation invariance is a natural requirement for a machine learning model in multicomponent molecular systems with randomly associated components, such as mixtures, random copolymers, and polymer blends, because if the model lacks permutation invariance, its prediction changes artificially depending on the order of inputting component material data to the model. Although, there is a DNN model performing feature representation learning from the dataset of twocomponent systems, binary copolymers, with molecular structure and composition informations, 8 this model is not applicable to materials with more than two components and does not have permutation invariance. Meanwhile, there are also DNN models that can handle nonmolecular solid materials with an arbitrary number of component species in a permutation invariant manner; 38,39 however, these models use only the composition information as an input and cannot perform feature representation learning from unstructured data such as molecular structures. It is also possible to treat molecular simulation models of a multicomponent copolymer as a single long-chain monomer, which can be handled by machine learning models for a single molecule; 6 however, such an approach is not applicable to randomly associated experimental copolymers with arbitrary compositions. Therefore, to extend the applicability of feature representation learning of molecular DNNs to multicomponent molecular systems, a new class of models is needed; capable of learning feature representation and composition information from a multicomponent molecular system with an arbitrary number of unordered component molecules in a permutation invariant manner.
In this study, a general DNN architecture extending DNN models for single-component systems to the multicomponent while maintaining permutation invariance is presented. To benefit from feature representation learning abilities of existing DNN models for single molecules, the proposed DNN architecture uses embedding networks of the existing models to generate a fixed-size feature vector for each component (Figure 1c). Then, a mixing network merges the obtained feature vectors with composition information in a permutationinvariant manner to obtain a single fixed-size vector (Figure 1c), and this vector is finally used as an input for a prediction network to make a prediction. Such DNN models are attractive because, by substituting their embedding networks, one can easily extend the models to any type of unstructured chemical data such as images, spectrum, and simulation results. However, there is no proposed mixing network that can merge a set of feature vectors with composition information in a permutation invariant manner. Instead, we start from considering a special case of the multicomponent material dataset where the compositions are fixed and only the combinations of molecular components determine the materials properties. In this special case, multicomponent molecular system data can be regarded as set data and properly handled by the recently emerging class of DNN models for the set 40−46 proposed in the field of machine learning research by substituting their embedding networks with that of molecular DNN models. Note that set data is a variablesized collection of components without any natural order between them. Although the DNN models for sets cannot handle composition information, most of these models are permutation invariant and composed of embedding, mixing, and prediction networks as with the DNN architecture for multicomponent molecular systems considered here ( Figure  1c). Therefore, DNN models for sets provide a good starting point for designing a DNN architecture for multicomponent molecular systems.
In the following, the DNN architecture for multicomponent molecular systems is constructed based on the DNN models for sets, and the performances of obtained models are examined in four types of case studies. These case studies showed that, by exploiting embedding networks of existing molecular DNN models, the proposed DNN architecture enables learning of both feature representation and composition information with equal or better accuracies compared to a well-used molecular descriptor-based model in the best model for each case study, and the performance improvements by DNN models were notable for machine learning tasks where the molecular structure information in multicomponent materials data plays a major role. Furthermore, DNN models showed notable predictive accuracies for an extremely sparse dataset.

■ METHODS
In the following part, the permutation-invariant DNN models for sets are decomposed into common building blocks and the building blocks thus obtained are used to reconstruct DNN models for multicomponent molecular systems with composition information in the next part.
DNN Models for Sets. Figure 2 shows a network architecture shared by most DNN models for sets, 40−45 which is referred to in this study as the embed-interact−aggregate architecture.
Here, embed block corresponds to the embedding network in Figure 1c, which generates a set of fixed-size feature vectors, Z, from a set of molecular structures, X. Note that, the same network parameters are used for embedding all input molecular structures. While the pair of interact and aggregate blocks corresponds to the mixing network, which merges the set of feature vectors, Z, into a single vector, a. The interact block is a neural network incorporating interactions among the input set. Because the interact block does not change the size of the set, this block is omittable. The aggregate block can be both a neural network and unlearnable function, such as simple summation, and merges a set of feature vectors, M, into the single vector, a.
More formally, for an input set of molecules, X = {x 1 , x 2 , ..., x N }, the embed-interact−aggregate architecture models calculate the output values as follows To conserve overall permutation invariance, interact and aggregate blocks are invariant to permutation of feature vectors in z j (j ≠ i) and M, respectively. The overall model can be trained using backpropagation. The embed-interact−aggregate architecture-based DNN models have been successfully applied to machine learning tasks on sets such as regression using multiple images. 41,44,45 Brief descriptions of these models are provided in the Supporting Information.
DNN Models for Multicomponent Molecular Systems. Next, a permutation invariant mix-embed-interact−aggregate (MEIA) architecture is proposed for multicomponent molecular systems, using the building blocks of the embed-interact− aggregate architecture and incorporating composition information into them ( Figure 3).
In this architecture, mix functions are introduced to incorporate composition information into interact and/or aggregate blocks of the embed-interact−aggregate architecture. A requirement for the mix function is conserving permutation invariance of the overall model, and this is accomplished by individually converting each feature vector in the input set of a building block using composition information of each molecular component. Following is an example of such mix function, concat-mix, which is applied to interact blocks in this study z r z r concat mix( , ) concatenate( ) Z z z z , , ..., Here, R = {r 1 , r 2 , ..., r N } is composition of each component. Such individual conversion of the feature vectors of molecular components in the set does not obviously break overall permutation invariance. By this way, the MEIA architecture can handle an arbitrary number of molecular structures with composition information in a permutation invariant manner.
Combining the building blocks, embed, interact, aggregate, and mix functions, allows a total of seven types of MEIA-based models to be evaluated in the following case studies. These models are shown in Table 1 alongside the interact, aggregate, and mix functions used, and examples of MEIA-based models are shown in Figure 3 for WS, Concat-MHA, Concat-Self-MHA, and WS-Self-MHA models. Note that, as the embed block, the embedding network of the graph convolution (GC) model 36 was used unless otherwise stated and all MEIA-based models examined here use FCNN with a single-hidden layer as the prediction network. In Table 1, summation represents the element-wise summation of feature vectors, and weight-mix(m i , r i ) = r i m i . Therefore, the mixing network of WS model employing these functions is a simple-weighted summation function (Figure 3 WS). The other six models utilize attention neural networks to improve the treatment of interactions between components. In the field of molecular DNN research, attention neural networks are used to merge a set of feature vectors of atoms in a molecule into a single-molecular feature vector incorporating interactions between the atoms. 31 Intuitively, attention neural networks learn a weight for each component in a set of feature vectors and perform weighted summation based on the obtained weights in permutation invariant manners. Therefore, attention neural networks can be used as aggregate blocks in the MEIA architecture. In the case study, two popular models of attention neural networks, RNNbased 40 and multihead 47 attentions, were used as aggregate blocks (Concat-MHA and Concat-RNNA in Table 1). As an example of the MEIA-based model with an attention neural network, the neural network architecture of the Concat-MHA model is shown in Figure 3. Note that only difference between Concat-MHA and Concat-RNNA models is the choice of the aggregate block.
Meanwhile, attention neural networks can also be used to update a set of feature vectors without merging them by updating all vectors in the set based on the weighted sum of Figure 3. Schematic of the MEIA architecture for the multicomponent molecular system. Examples of neural network architectures and how they process multiple molecular structures with composition information are shown for WS, Concat-MHA, Concat-Self-MHA, and WS-Self-MHA models. Each colored pillar represents a feature vector. Elements of the feature vectors that have composition information incorporated by mix functions, concat-mix and weight-mix, are highlighted in red.

RNN-based attention
Concat-mix for aggregate a "WS" in the names of models stands for the combination of weightmix and summation. "Concat", "Self", "MHA", and "RNNA" stand for concat-mix, self-attention, multihead attention, and RNN-based attention, respectively.

ACS Omega
http://pubs.acs.org/journal/acsodf Article vectors generated by these neural networks. Such a type of attention is called as self-attention 47 and can be used as interact blocks in the MEIA architecture. As with aggregate blocks, RNN-based and multihead self-attentions were used as interact blocks (Concat-Self-MHA, WS-Self-MHA, Concat-Self-RNNA, and WS-Self-RNNA in Table 1). As examples of the MEIAbased model with self-attentions, the neural network architectures of Concat-Self-MHA and WS-Self-MHA models are shown in Figure 3. Note that the only difference between models using RNN-based and multihead self-attentions is the choice of the interact block.  31 models are used as embed blocks with their default featurerizers to generate the initial atom and bond features. All models were implemented as tensor graph objects in DeepChem and optimized using the L2 loss and ADAM optimizer. 48 To reduce over fitting, dropout and batch normalization layers were introduced to all models examined here ( Figures S3−S9).
Dataset. The chemical compositions in the weight fraction are used as composition information. All datasets in the experiments were split into train and test sets, with the former used in training and hyper parameter tuning and the test sets limited to performance evaluations. In the case study 1, the dataset was randomly split; while in the case study 2, the dataset was split according to the setting of "mixture-out validation" where each multicomponent material data in test sets comprised an unknown combination of component molecules. 12 Subsequently, in the case studies of 3 and 4, datasets were split according to the setting of "compound-out validation" where each multicomponent material data in test sets contains at least one unknown component molecule. 12 Input Molecular Structure. 2D molecular graphs without hydrogen atoms were used as input molecular structures for neural network models in all case studies. For random copolymers, monomer units (i.e., substructures in polymers) were used as input molecular structures, and the connections between repeating units were not considered. Note that although this simple treatment for monomers may not be generally applicable to all the properties of polymers, this approach is reasonable for the modelling of glass transition temperature (Tg) of random copolymers, which can often be modelled based on the information about component monomers, 49,50 although other factors, such as molecular weights, are also known to change Tg of copolymers.
Performance Comparison. The performances of models for the test sets were compared by coefficient of determination (R2), mean absolute error (MAE), root-mean-square error (RMSE), and Spearman's rank correlation coefficient (ρ) for average predictions of 50 and 20 randomly initialized models for GC-and MPNN-based models, respectively. As baselines, the performances of FCNN models with three hidden layers were also evaluated for composition-only and extended-connectivity fingerprint (ECFP) 51 inputs, respectively. The ECFP fingerprints were obtained using an RDkit 52 with a radius of 2 in an unhashed manner, and molar-weighted sums of nonzero elements of the obtained fingerprints were used as inputs for the FCNN model. A detailed architecture of the FCNN model is shown in Figure S9.
Note that all MEIA-based models examined here share the same architecture of prediction networks in the last parts of networks, and the last part of the FCNN model also has the network architecture identical to the prediction networks of the MEIA-based models. Therefore, model performance differences are attributable to the way a fixed size vector used in the last part of the network is generated from molecular structures and compositions. In the MEIA-based models, this is done by embedding and mixing networks, whereas, in the FCNN models with an ECFP input, this is done by ECFP, weighted summation, and initial two fully connected layers of the FCNN model.
Hyper Parameter Optimization. The hyper parameters of the models were obtained by Bayesian hyper parameter optimizations using GPyOpt. 53 In the Bayesian optimizations, the MAE in cross validations was minimized in 5 initial random sampling and 20 subsequent steps of optimization. Meanwhile, for "mixture-out" and "compound-out" validation tasks, cross validations in Bayesian optimizations are also performed according to the setting of "compound-out validation". To construct a data set for "k-fold compound-out cross validation", all the component molecules in the training set are first split into k subsets, whereupon training samples containing the component molecules in the subset are used for validation and the remainder for learning for each subset. Details of the hyper parameters used in the Bayesian optimizations are shown in Tables S1 and S2, while the settings of the other hyper parameters are shown in Table S3.

■ RESULTS AND DISCUSSION
In the following, to examine whether the MEIA architecture can properly extend single-component DNN models and whether the MEIA architecture can allow DNN models to learn both feature representation and composition information, performances of obtained models were compared with two shallow learning baselines, composition-only and descriptor-based ( Figure 1d) models, in four case studies. For the compositiononly model, a FCNN with composition input was used, and for the descriptor-based model, a FCNN with molar-weighted sum of ECFP fingerptints 51 was used. Note that ECFP has often been used in benchmark studies of molecular DNNs 36 because ECFP uses only the information of the molecular graph and does not use further detailed information about the molecules, such as physicochemical properties, generate descriptors, as with most molecular DNN models.
Case Study 1: Interpolation of the Composition Space. The first case study is on a previously examined interpolation task within a composition space of copolymers 8 where the test set contains no unknown components. This task could be handled without structural information, by interpolating the data points of a training set within the composition space. A dominant contribution of the composition information renders this task suitable for examining whether the MIEA models can properly handle the composition information for each component. Furthermore, this task is suitable to assess the effect of permutation invariance introduced in MEIA-based models because this task was previously modelled by a recursive NN model, 8 a DNN models without permutation invariance. Therefore, in this case study, the result of previous study using recursive NN model was also used as a baseline, in addition to composition-only and descriptor-based baselines.
The dataset used for this task was a collection of compositions and Tg of 275 random copolymers and homopolymers comprising 12 monomers. 8 Among them, 57 samples were selected as the test set, with the remaining samples used as the train set.
The evaluated performances are shown in Table 2, with the average scores over all the seven MEIA-based models, called MEIA (average). Six out of seven MEIA-based models evaluated here outperformed the recursive NN baseline in all performance metrics. Furthermore, scores for the recursive NN baseline fell outside one standard deviation from the scores of MEIA (average) in all performance metrics. On the other hand, all scores of composition-only and descriptor-based baselines fell within one standard deviation from the scores of MEIA (average), and relative superiority of these models is unclear from this comparison. The significantly enhanced performance relative to the permutation-dependent recursive NN baseline is considered attributable to the introduction of permutation invariance. Unexpectedly, the performance of the recursive NN baseline is worse than that of the composition-only baseline in all performance metrics. It is considered that in the recursive NN model, the negative effect of an artifact induced by permutation dependence exceeds the positive effect of incorporating structural information. Accordingly, in the following experiments, the recursive NN model was not considered in the performance comparisons.
To make a clear comparison among prediction performances of MEIA-based models and composition-only and descriptor-based baselines, three-fold cross-validation test was performed where union of the original train and test sets was randomly divided into three new test sets. Average scores of the crossvalidation test are shown in Table 3 with the minimum and maximum scores in the three runs. As a result, WS and WS-Self-RNNA showed at least comparative performances relative to the composition-only and descriptor-based baselines with slightly better average performances in R2, MAE, and RMSE; however, considering the minimum and maximum ranges in three runs (brackets in Table 3), there is no significant performance improvement by these MEIA-based models relative to the baselines. Furthermore, average scores of MEIA (average) over three runs were slightly worse than the composition-only and descriptor-based baselines for all the performance metrics, although the minimum and maximum range of each score was overlapped between them. Considering the dominant role of composition information, which can be represented as a tabulated data, this result is consistent with the well-known superiority of shallow learning models for the tabulated data relative to DNNs. Although there is no significant performance improvement by MEIA-based models, the result that two MEIAbased models showed at least comparative accuracies relative to the composition-only and descriptor-based baselines in this interpolation task showed that MIEA-based models can properly handle the composition information for each component.
Case Study 2: Regression for Unknown Molecular Combinations. The second case study is on a previously examined regression tasks for multicomponent materials with  Mean scores are reported with the minimum and maximum scores in the three runs (in brackets). b Scores for MEIA (average) were calculated by averaging the scores of the all seven MEIA-based models in each run. Then, the mean, minimum, and maximum scores in the three runs are reported.

ACS Omega
http://pubs.acs.org/journal/acsodf Article unknown combinations of component molecules. In this task, each test data comprised unknown combinations of component molecules. Such a setting of dataset is known as "mixture-out validation" and seen in quantitative structure−property relationship (QSPR) studies for mixtures. Among them, Oprisiu et al. exhaustively studied regression models for datasets comprising thousands of densities of binary mixtures, by changing molecular descriptors, mixing rules, and machine learning models. 12 They generated feature vectors of mixtures from the classical molecular descriptors, such as ChemAxon, 54 Dragon, 55 and Inductive 56 descriptors by applying simple mixing rules such as weighted summation (Figure 1d). Their study is one of the most sophisticated forms of machine learning modeling for multicomponent materials. Although the best model reported by Oprisiu et al. used the support vector regression (SVR) and Chemaxon descriptors, 12 in this study, the FCNN with ECFP is used as the baseline because, in contrast to ECFP and MEIAbased models, Chemaxon uses much information about molecules including 3D molecular structures and physicochemical properties in the descriptor calculation. Also, the actual best model for this task has not been not determined in this study.
The "mixture-out" dataset of mixture density is available in the website of Oprisiu et al., OCHEM, and comprises 3857 and 672 samples of the train and test sets, featured 118 and 46 component molecules, respectively. To evaluate the prediction performances of MEIA-based models on multiple "mixture-out" data splits, the train and test sets used by Oprisiu et al. were merged into the single dataset, and 387 combinations of component molecules found in the resulting dataset were randomly divided into three groups. Then, three test sets were reconstructed according to this grouping, and for each test set, samples not contained in the test set were assigned to a train set. Finally, test set mixtures containing a component molecule that is not contained in the train set were reassigned to the train set. The regression objective of this task is deviations of the experimental densities of mixtures from the ideal densities defined as weighted averages of the density of each component.
The evaluated average performances for baselines and MEIAbased models (models with GC in embed column) are shown in Table 4 with the minimum and maximum scores in the three runs. As a result, the WS-Self-RNNA model showed at least comparative performance relative to the descriptor-based baseline with better average scores in all the performance metrics with no overlap of the minimum and maximum ranges of R2 and RMSE. However, the other six MEIA-based models and s MEIA (average) showed worse average performances relative to the descriptor-based baseline in most scores. This is probably because the embedding network of the GC model, which is used as the embed block in these MEIA-based models, was unsuitable for this dataset. Accordingly, the embedding network of MPNN, 31 another popular molecular DNN model, was also examined as the embed block.
The evaluated performances for MPNN-based models are also shown in Table 4 (models with MPNN in embed column). As expected, enhanced average performances relative to corresponding GC-based models were observed for all MPNN-based models in all performance metrics. Moreover, four out of seven MPNN-based models showed better average performances than the descriptor-based baseline for all performance metrics, with no overlap of the minimum and maximum ranges of scores. MEIA (average) also showed better average scores than the descriptor-based baselines for all performance metrics, with almost no overlap of the minimum and maximum ranges of scores. These results show that the MEIA architecture can extend both GC and MPNN models to multicomponent systems using the same procedure and that performances of MEIA-based models are dependent on the suitability of the embedding networks for the given task.
The observation of clearly improved performances of MEIAbased models relative to the descriptor-based baseline is contrast to the case study 1 where the effect of the structural information on the modelling was small. To reveal the role of the structural information in this task, the performance of the compositiononly baseline was also evaluated. In contrast to the case study 1,  Mean scores are reported with the minimum and maximum scores in the three runs (in brackets). b Scores of models are shown in bold if the worst scores in the three runs are equal or better than the best score for the descriptor-based baseline. c Scores for MEIA (average) were calculated by averaging the scores of the all seven MEIA-based models in each run. Then, the mean, minimum, and maximum scores in the three runs are reported.

ACS Omega
http://pubs.acs.org/journal/acsodf Article the descriptor-based baseline showed clearly improved performances relative to the composition-only baseline, with no overlap of the minimum and maximum ranges of scores in all performance metrics, which shows that structural information plays a major role in this task. This indicates that, for a machine learning task on a multicomponent molecular systems where structural information plays a major role, the feature representation learning by embedding networks largely benefits the modelling, which is consistent with excellent performances of single-component molecular DNN models for machine learning tasks on molecular structures. 36 Results of the comparison between MEIA-based models and the best model reported by Oprisiu et al., 12 which employs SVR and Chemaxon descriptors, using the same training and test sets are also provided in Table S4. Although three MPNN-based models showed better performances than the Chemaxon-based model in all the performance metrics, however, further experiments are needed to show reproducibility of these results and determine the best model for this dataset.
Case Study 3: Regression for Unknown Component Molecules. The third case study is on a "compound-out validation" for boiling points of binary mixtures, which was also studied by Oprisiu et al. 12 In this task, each test data contains at least one unknown component molecule and cannot properly be predicted by interpolating the composition space. This setting is more difficult than the other two validation settings.
The mixture boiling point dataset comprising 3239 and 1309 train and test data, featured 67 and 56 component molecules, respectively. To evaluate the prediction performances of MEIAbased models on multiple "compound-out" data splits, the train and test sets used by Oprisiu et al. were merged into the single dataset, and 98 component molecules found in the resulting dataset were randomly divided into three groups. Then, three test sets were reconstructed according to this grouping, by assigning all mixtures containing molecules belonging to each group to the test set. Also, for each test set, samples not contained in the test set were assigned to the train set. Although the best model reported by Oprisiu et al. used the SVR and Dragon descriptors, 12 in this study, the FCNN with ECFP is used as the baseline because, in contrast to ECFP and MEIAbased models, Dragon uses much information about molecules including 3D molecular structures and physicochemical properties in the descriptor calculation. Then, the actual best model for this task has not been not determined in this study.
As with the mixture density modeling in the case study 2, the performances of both GC-and MPNN-based MEIA models were examined. The evaluated average performances are shown in Table 5 with the minimum and maximum scores in the three runs. As a result, 9 out of 14 GC and MPNN-based MEIA models showed at least comparative performances relative to the descriptor-based baseline with better average scores over the three run in all the performance metrics, and MEIA (average) also showed slightly better performances than the descriptorbased baseline in the average over the three runs for all performance metrics. These results again showed that the MEIA-architecture can extend both GC and MPNN models to multicomponent systems in the same manner, with at least comparative performances with the descriptor-based baseline. However, the minimum and maximum ranges for all scores are overlapped between all the MEIA-based models and descriptorbased baseline. Therefore, the performance improvement by the DNN models in this task is not clear. This is due to large deviations of performances of models among the three different data splits.
Noted that in contrast to the case study 2, in this case study, clear superiority of MPNN-based models relative to GC-based models was not observed.
The performances of the composition-only and descriptorbased baselines were also compared, and the descriptor-based baseline showed improved average performances relative to the composition-only model in all performance metrics. However, the minimum and maximum ranges for MAE and RMSE were overlapped between the composition-only and descriptor-based baselines due to large deviations of performances of models among the three different data splits. Mean scores are reported with the minimum and maximum scores in the three runs (in brackets). b Scores for MEIA (average) were calculated by averaging the scores of the all seven MEIA-based models in each run. Then, the mean, minimum, and maximum scores in the three runs are reported.

ACS Omega
http://pubs.acs.org/journal/acsodf Article Results of comparison between the MEIA-based models and best model reported by Oprisiu et al., 12 which employs SVR and Dragon descriptors, using the same training and test sets are also provided in Table S5. Although 4 out of 14 GC-and MPNNbased MEIA models showed better performances than the Dragon-based model in all the performance metrics, however, further experiments are needed to show reproducibility of these results and determine the best model for this dataset.
Case Study 4: Regression on a Sparse Dataset. Finally, performances of MEIA-based models were examined for the most challenging task, compound-out validation for a collection of small sets of experimental data. The dataset for this task was prepared by collecting compositions and Tg of film-shaped ternary linear random copolymers from the open-access PoLyInfo database. 57 Although some records in the PoLyInfo do not contain complete information about their composition, records were collected as far as possible if a summation of molecular weights of component monomers were smaller than 1000. Also, collected data considered to be at least in some extent reflect the natural distribution of multicomponent linear copolymers in literature. The collected dataset contains 83 copolymers comprising 55 component monomers and was produced by 36 literature studies. Figure S10 is a histogram for the number of appearances of each monomer in the dataset and shows that most monomers in the dataset are used only a few times. Note that although DNN models in this study does not consider some factors that affect Tg of polymers, such as molecular weights, this dataset can be used to examine numerical properties of models on the extremely sparse ternary system.
To create dataset for "compound-out validation", the 41 monomers, which appear in the collected copolymers less than five times, were chosen as "unknown compounds" in "compound-out validation". To evaluate the prediction performances of MEIA-based models on multiple "compound-out" data splits, these "unknown compounds" were randomly divided into ten groups. Then, ten test sets were reconstructed according to this grouping, by assigning all copolymers containing monomers belonging to each group to the test set. Also, for the each test set, samples not contained in the test set were assigned to the train set. Because the dataset for this task is small, all scores were evaluated by merging all the predicted values for the ten test sets. Then, the reproducibilities of obtained results were confirmed by repeating the same test twice with different data splits and evaluate the mean, minimum, and maximum over two repeated tests.
The performances of models were evaluated and compared with the composition-only and descriptor-based baselines, and the average results are shown in Table 6. Scatter plots of predicted versus experimental values are also shown in Figure 4.
Here, the composition-only baseline showed an average R2 of 0.076 with almost no predictive ability as shown in Figure 4, underlining the importance of the contribution of monomers contained only in the test set. Meanwhile, the performance of the descriptor-based baseline remains terrible with an average R2 of 0.309. Conversely, MEIA-based models showed remarkable predictive abilities for this dataset. Enhanced average performances relative to the descriptor-based baseline were observed for all MEIA-based models, with no overlap of the minimum and miximum ranges of scores in all the performance metrics except the MAE of Concat-Slef-MHA and Concat-Self-RNN models. Also, the best R2 model, WS, showed a Mean scores are reported with the minimum and maximum scores in the two runs (in brackets). b Scores of models are shown in bold if the worst scores in the two runs are equal or better than the best score for the descriptor-based baseline. c Scores for MEIA (average) were calculated by averaging the scores of the all seven MEIA-based models in each run. Then, the mean, minimum, and maximum scores in the two runs are reported. 562. This remarkable performance improvement is attributable to the contribution of molecular structure information to modelling of the sparse dataset. Then, these results again indicate that, for a machine learning task on a multicomponent molecular systems where structural information plays a major role, the feature representation learning by embedding networks largely benefit the modelling.

■ CONCLUSIONS
In this study, the permutation-invariant DNN architecture, MEIA for multicomponent molecular systems was proposed. Introducing permutation invariance allows for significant reductions in prediction errors relative to the previously proposed DNN model without permutation invariance. Moreover, using the MEIA architecture, both GC and MPNN models were successfully extended to multicomponent systems by the same procedure, and the best one in MEIA-based models consistently showed comparative or higher accuracies relative to the descriptor-based baseline in all four case studies. These results showed that the MEIA architecture can extend singlecomponent molecular DNN models to multicomponent systems and can allow DNN models to learn both feature representation and composition information directly from the information of molecular structures and their compositions. It is well known that deep learning models are suitable for unstructured data, while shallow learning models are suitable for tabulated data. On the other hand, suitability of machine learning models for multicomponent material data is not obvious. The comparison of the four case studies on multicomponent molecular systems indicates that for machine learning tasks where the molecular structure information plays a minor role, the performance improvements by DNN models are small as with the case study 1; while for machine learning tasks where the molecular structure information plays a major role as with case studies 2 and 4, the performance improvements by DNN models are large. This trend was notable for the sparse dataset in the case study 4 where MEIA-based models showed significantly improved predictive abilities, whereas the conventional descriptor-based model showed poor performance. Currently, the usage of molecular machine learning in the multicomponent material design remains limited due to the scarcity and nonsystematic nature of experimental datasets for this class of materials. Accordingly, enhancing the prediction ability for such sparse datasets by MEIA-based models will extend the applicability of molecular machine learning in modelling of multicomponent materials.

■ FUTURE WORK
Improved predictive capabilities of MEIA-based models for the sparse dataset have the potential to change current screening approaches for multicomponent molecular systems. Currently, even in recent high-throughput screening studies, the exploration space was limited to fixed combinations or compositions with moderate numbers of component materials 1 because a screening effort on full combination and composition freedoms of multicomponent systems by a limited number of experiments produces uninterpretable results with sparse data. However, there is no guarantee that materials with desired properties will be found in the predefined design space for high-throughput screening studies. To find optimal recipes, redefinition of the design space and screening within it often need to be repeated, which hinders the application of high-throughput screening methods to the multicomponent materials design, and therefore, the trial-and-error-based small-scale screening on combination and composition spaces of multicomponent systems is still the primary approach for the development of multicomponent materials. Meanwhile, MEIA-based models will allow screening on full combination and composition freedoms of all candidate raw materials, in combination with self-driving laboratories 58 composed of high-throughput screening equipment and machine learning-based sequential optimization methods such as Bayesian optimization, which will largely expand accessible design spaces of multicomponent systems. Furthermore, MEIAbased models will speed-up the nonhigh-throughput screening by providing recommendations for better raw materials that is expected to improve material properties from libraries of available materials based on sparse experimental datasets. Another attractive feature of DNNs is the transfer learning capability where feature representations learned by other datasets are used to improve machine learning performance on small datasets. 59 By employing embedding networks of prelearned models, MEIA-based models can also perform the transfer learning, which is useful especially for the nonhighthroughput screening studies generating small experimental data.
Although this study focused on the general extension of DNNs for multicomponent systems and special treatments for each input material and output property were not considered, there will be room for improvement of the performances of the MEIA-based models for individual applications. For copolymers, the treatment of input data of monomers is a hopeful direction for the performance improvement, which includes the processing for connection points of monomers and the introduction of the reactant information of copolymers. Another direction for performance improvement is the variation of the form of the summation function, which allows generating various types of mixing rules, such as the weighted geometric mean and weighted square sum, which are also used in the QSPR studies for mixtures. 22,15 Furthermore, the introduction of the embedding networks of state-of-the-art molecular machine learning models for each type of dataset as the embed block will be a convenient way to improve MEIA-based models because prediction performances of MEIA-based models largely depend on the architectures of the embedding networks, as seen in the case study 2 where the MPNN-based models showed the improved performances relative to the GC-based models in the binary mixture density prediction.
It is worth to point out that most of DNN models for singlecomponent systems can be divided into embedding and prediction networks, and the MEIA architecture can extend these models to multicomponent systems by exploiting their embedding networks. Therefore, the MEIA architecture that enables feature representation learning from unstructured data incorporating composition information will extend DNN models for a wide range of chemical data to multicomponent systems. A promising extension of current study is DNN models for nonmolecular multicomponent systems including inorganic crystals and metal alloys using unstructured data such as spectrum and simulation results.

ACS Omega
http://pubs.acs.org/journal/acsodf Article Brief descriptions of embed-interact-aggregate models for sets, neural network architectures, histogram for the number of appearances of each monomer in the ternary copolymer dataset, hyper-parameter settings, and performances of models (PDF) List of ternary copolymers used in the case study (XLS)