Polymer Informatics at Scale with Multitask Graph Neural Networks

Artificial intelligence-based methods are becoming increasingly effective at screening libraries of polymers down to a selection that is manageable for experimental inquiry. The vast majority of presently adopted approaches for polymer screening rely on handcrafted chemostructural features extracted from polymer repeat units—a burdensome task as polymer libraries, which approximate the polymer chemical search space, progressively grow over time. Here, we demonstrate that directly “machine learning” important features from a polymer repeat unit is a cheap and viable alternative to extracting expensive features by hand. Our approach—based on graph neural networks, multitask learning, and other advanced deep learning techniques—speeds up feature extraction by 1–2 orders of magnitude relative to presently adopted handcrafted methods without compromising model accuracy for a variety of polymer property prediction tasks. We anticipate that our approach, which unlocks the screening of truly massive polymer libraries at scale, will enable more sophisticated and large scale screening technologies in the field of polymer informatics.


Introduction
Polymers have emerged as a powerful class of materials for a wide range of applications because of their low-cost processing, chemical stability, tunable chemistries, and low densities.
2][3] The result is a constant flux of materials data.Over the past decade, the polymer informatics community has translated this data stream into machine-learned property predictors that efficiently screen libraries of candidate polymers for subsequent experimental inquiry. 4,5rrently, most approaches for polymer screening rely on handcrafted features-extracted from the chemical structure of a polymer repeat unit-as input for property predictors. 6,7ese approaches are highly accurate, but feature extraction becomes a bottleneck (as discussed in Section 3.1) when used to screen vast swathes of the polymer chemical space.This bottleneck is increasingly exposed by the proliferation of enumeration methods 8,9 and long-sought 10,11 inverse predictors, [12][13][14][15][16] which directly locate optimal pockets of the chemical space from a user-defined wish list of material properties.By leveraging these tools, the day that we routinely generate billions of polymer candidates is fast approaching.Advances in polymer screening and feature engineering are needed to keep up with this pace.
An alternative to handcrafting features is "machine learn" them.One approach is to represent the material as raw text, such as a simplified molecular-input line-entry system (SMILES) 17 or BigSMILES 18 string, and then learn features with a neural network specifically designed for natural language processing. 19Another promising approach is to represent the material as a graph, and then train a Graph Neural Network (GNN) 20 to learn features.
To date, GNNs have outperformed approaches based on handcrafted features [20][21][22][23][24] on the massive QM9 database 25 for small molecules.Similarly, feature learning approaches have supplanted traditional methods in other domains (e.g., convolutional neural networks 26 in computer vision and transformers 27 in natural language processing) where the extraction of handcrafted representations from the input data is non-trivial or impractical. 26 The core idea behind multitask learning is that, by training a model to simultaneously learn multiple correlated target properties, the model is less likely to produce overfitted predictions to the training set of any one target property. 28As a result, the predictive performance for each property is improved.This idea is seen in nature as well.For example, there is evidence that training in one sport can improve a young athlete in another related sport. 291][32][33][34][35][36] The majority of these approaches are single task.The GNN proposed by Mohapatra et al. 34 is suitable for biopolymers, in which the monomer sequence is known.Other approaches, 32,33,35,36 geared toward synthetic polymers (the subject of interest in this work), represent a polymer using the graph of a predominant repeat unit.This introduces the need for invariance to certain transformations of the repeat unit graph: translation, addition, and subtraction (as defined in Section 2.2).A subset of the GNNs for synthetic polymers 35,36 are invariant to translation, but not to addition and subtraction.In other words, a GNN that preserves the invariant properties of polymer repeat units has not been developed until now.Our work, a powerful multitask GNN architecture (see Fig. 1) for polymers, fills this gap.We call this development the Polymer Graph Neural Network (polyGNN).
In the small molecule domain, the adoption of GNNs is motivated by systematic work 20 comparing GNNs and handcrafted approaches on even footing across a diverse set of molecules and predictive tasks.Analogous studies are absent from the synthetic polymer domain.Previous works have compared feature learning and handcrafted approaches for up to two 31,35 polymer properties, or for several properties in the same class 30 (e.g., electronic properties).
In what follows, we compare polyGNN with the handcrafted fingerprint originally hosted under the Polymer Genome (PG) project 4 on a large and diverse data set consisting of more than 13,000 polymers and 30+ predictive tasks-spanning thermal, thermodynamic, physical, electronic, optical, dielectric, and mechanical properties, the Hildebrand solubility  parameter, as well as permeability of six gases.
Our benchmark, the PG fingerprint, contains descriptors that correspond to one of three length scales.The finest-level components are atomic triples (e.g., C i O j N k ) where the subscripts denote the atomic coordination.The next (block) level contains pre-defined atomic fragments (e.g., the common cycloalkenes).These two levels contain strictly one-hot features.At the highest (chain) level are numerical features that describe the atomic or block topology (e.g., the number of atoms in the largest side chain).The handcrafted PG fingerprint is the current state-of-the-art in polymer representation, and has shown success in the numerical representation of materials over a wide chemical and property space. 4,8,37The handcrafted PG fingerprint-based property predictors thus serve as veritable performance baselines.We find that polyGNN, relative to these baselines, leads to a one to two orders of magnitude faster fingerprinting and better or comparable model accuracy in most polymer property prediction tasks.polyGNN thus offers a powerful new polymer informatics option for screening the polymer chemical space at scale.

Data set and preparation
Our corpus contains measurements for up to 36 properties of 13,388 polymers, yielding over 21,000 data points in total.The unit and symbol for each property is listed in Fig. 2A.
The distribution of data points per property is plotted in Fig. 2B.These data points come from in-house density functional theory (DFT) computations, [38][39][40] experimental data collected from the literature, [41][42][43][44][45][46] printed handbooks, [47][48][49] and online databases. 50,51Band gaps were calculated for both individual polymer chains E gc and polymer crystal (bulk) structures E gb using DFT.DFT data contain uncertainties due to the choice of exchange correlation functional, pseudopotentials, optimization procedure, etc. while data from physical experiment comes with uncertainty due to sample and measurement conditions.Thus, data for the same property but from different sources (e.g., DFT-computed and experimentally measured refractive index) are treated separately and then co-learned with multitask learning.
Our multitask learning approach requires data preprocessing steps.First, the training data for each property was MinMax scaled between zero and one.This ensures that the optimizer of a multitask ML model equally weights the loss for each property during training.
Second, to better exploit correlations between properties, 5 we divided our entire 36 property data set into six "property groups": thermal properties, thermodynamic & physical properties, electronic properties, optical & dielectric properties, solubility & gas permeability, and mechanical properties.The stratification of properties by group is shown in Fig. 2B.Finally, we designate each property within one group a unique one hot "selector" vector (see Fig. 1 for example selector vectors of thermal properties).These vectors are used by our ML models to distinguish between multiple tasks.

polyGNN
All GNNs rely on a well-defined graph representation of their input.If the input is a small molecule, then building a corresponding graph is straight-forward-each heavy (i.e., nonhydrogen) atom is a graph node and each bond between heavy atoms is a graph edge.
However, polymers are macromolecules with numerous atoms and bonds.Creating a node and edge for each atom or bond will generate a massive graph.Machine learning based on thousands of such graphs would be computationally inefficient.Instead, we construct a polymer graph from its repeat unit alone and propose that additional information (e.g., molecular weight, end groups, etc.)-if available-be concatenated to each computed atom or bond fingerprint and/or to the learned polymer fingerprint.
Ideally, our learned polymer fingerprint must respect the invariances present in a polymer repeat unit.We identify three key transformations-translation, addition, and subtraction-that repeat units of infinite 2D polymer chains are invariant to.We define translation as the movement of the periodicity window, which can produce periodic repeat units that are all equivalent.For example, ( OCC ), ( COC ), and ( CCO ) are equivalent repeat units of polyethylene glycol, related to one another by translation.We define addition (subtraction) as the extension (reduction) of a repeat unit by one or more minimal repeat units.
For example, ( COCO ) and ( COCOCO ) are equivalent repeat units, related to one another by the addition (or subtraction) of their minimal repeat unit, ( CO ).We have constructed polyGNN to be invariant under such transformations, as discussed below.
The polyGNN architecture is composed of three main modules: an Encoder for processing the repeat unit, a Message Passing Block for fingerprinting, and an Estimator to co-learn multiple properties.In the polyGNN Encoder, bonds are added between heavy atoms at the boundary of any input repeat unit, forming a periodic polymer graph (as shown in Fig. 1).This ensures that the graph of the repeat unit, and hence its learned fingerprint, is invariant to translation.Then, each atom and bond in the graph are given initial feature vectors (described later in Section 2.3) that are computed using RDKit. 52The featurized graph is passed to the Message Passing Block and then to the aggregation function.In the Message Passing Block, the initial feature vectors are passed between neighboring atoms.
This information flow is the mechanism by which rich polymer features are learned (described later in Section 2.4).
After message passing, the sequence of learned atomic fingerprints is aggregated into a single polymer fingerprint by taking the mean.Taking the mean rather than the sum ensures that, for example, ( COCO ) and ( COCOCO ) are mapped to the same fingerprint.
However, there are polymers (see Fig. 3A) where the desired invariance is not preserved.
These conflicts arise because RDKit treats periodic polymer graphs as cyclic molecules.To address these conflicts, we propose two approaches.In the first approach, which we will continue to refer to as polyGNN, the original training data set is augmented with transformed repeat units (see Fig. 3B).Thus, although polyGNN is not invariant to addition or subtraction in these complicated cases, it achieves approximate invariance after learning from augmented data.This choice was inspired by state-of-the art image classification models, which are trained using cropped and flipped images. 53In this work, we find that data augmentation is also effective for training polyGNNs but does increase training time-a onetime cost.As an alternative, we created a variant, polyGNN2, with guaranteed invariance to addition and subtraction (and thus no need for augmentation).Invariance is achieved by modifying the Encoder to compute features on an extended polymer graph instead of on the periodic graph (see Section S2).However, operating on the extended graph notably slows fingerprinting in polyGNN2, and so we instead focus on polyGNN in what follows.

Fingerprinting graphs
The node features used in this work are element type, node degree, implicit valence, formal charge, number of radical electrons, hybridization, aromaticity (i.e., whether or not a given node is part of an aromatic ring), and number of hydrogen atoms.The edge features are bond type, conjugation (i.e., whether or not a given edge is part of a conjugated system), and ring member (i.e., whether or not a given edge is part of a ring).

Neural message passing
In GNNs, "messages" between neighboring atoms in a graph are iteratively passed along chemical bonds.After each iteration, every atom fingerprint is updated using the messages.
In this way, atoms learn about their local neighborhood over time.By fitting parameters (e.g., weights and biases) in the model, the information contained in each message is optimized for the task at hand.This process is captured by three general but abstract equations presented in Section S3.In this section, for concreteness, we will demonstrate message passing using a highly simplified example.
First, consider the graph of infinite polyethylene glycol (PEG), shown in Fig. 1.We restrict our initial atom features to the element type and our initial bond features to the bond type.Thus, all edge fingerprints on the PEG graph are set to [1, 0, 0, 0] (indicating the presence of single bonds and no double, triple, or aromatic bonds).The two carbon atoms in PEG are initialized with a fingerprint of [1, 0] (indicating the presence of C atoms and not O atoms).We index these two nodes 0 and 1.The oxygen atom, with index 2, in PEG is initialized with a fingerprint of [0, 1].Now, we compute messages m i,j between all pairs of chemically bonded atoms using the functional form j , e i,j T where i, j are atom indices, x (0) i is an initial atom fingerprint, and e i,j is a bond fingerprint.Note that, for simplicity, we ignore bias terms and use the Rectified Linear Unit (ReLU) activation in this example.W φ is a matrix of parameters.Before training, the parameters are randomly initialized.During training, the parameters are iteratively updated (i.e., learned) using some flavor of stochastic gradient descent.In this example, our choice of initial parameters will be guided by mathematical convenience, and we do not consider subsequent weight updates.Choosing Now, these messages can be used to update the fingerprint of each atom using the functional form where W χ is a matrix of parameters, and j takes on values corresponding to atoms that share a chemical bond with atom i.After we conveniently initialize W χ to a 2 × 10 all-ones matrix, we have x 0 = [41, 41]   x 1 = [41, 41]   x 2 = [33, 33] .
So, by exchanging messages with neighbors, the fingerprint of each carbon atom in PEG was updated from [1, 0] to [41, 41] and the fingerprint of each oxygen atom was updated from [0, 1] to [33, 33].The effect of message passing is clear.Initially, the oxygen atom was not aware of neighboring carbon atoms (that is, x 2,2 = 0, where x i,l is the l th dimension of x i ).However, after passing one round of messages, the oxygen atom becomes aware of its carbonaceous neighbors (i.e., x 2,2 = 0).Likewise, the carbon atoms become aware of their neighboring oxygen atom over time.
3 Results and Discussion For each of the above, the time spent on the Encoder was fixed at 26 seconds.The remaining time was spent on the Message Passing Block which, unlike the Encoder, can run on CPUs or graphics processing units (GPUs).
By extrapolation, this means that fingerprinting a library of 1 billion polymers would take 26 days in the best case (shallow polyGNN run on a GPU) and 47 days in the worst case (deep polyGNN run on one CPU).Meanwhile, at a rate of 125.4 ms per polymer, fingerprinting a library of 1 billion polymers would take nearly 4 years on one CPU using the handcrafted PG approach.Of course, the rates for either approach can be further sped up with parallelization and/or increased random access memory.

Benchmarking accuracy
Here we evaluate the predictive accuracy of polyGNN models on 34 of the 36 properties in our data set; dielectric constant at 10 7 and 10 9 Hz ( 7 and 9 ) were excluded because Figure 4: Fingerprint time as a function of method, capacity, and hardware.Fingerprint time t, measured in milliseconds per polymer, is plotted on the y-axis.t was computed using a diverse set of 13,388 polymers.Above each bar is the total time (in plain text) in seconds taken to compute fingerprints for the entire set as well as the speed up (in parentheses) relative to the handcrafted PG method.t varies depending on the fingerprint method, the hardware, and the model capacity.Method and hardware are labelled on the x-axis; CPU and GPU refer to one Intel ® Xeon ® Gold 6140 CPU core and to one 32 GB Nvidia ® Tesla ® V100-PCIE GPU, respectively.Capacity is denoted by bar color.
our corpus contains fewer than 50 data points for these properties.Data for the remaining properties was randomly cut into a training and a test set in a 4 : 1 ratio.Three such random cuts were performed per property, so that statistics (e.g., standard deviation) of model performance could be computed.
Kuenneth et al. 5 showed that multitask learning significantly improves the accuracy of polymer property prediction, relative to single task learning.Thus, we train single task (ST) and multitask (MT) polyGNNs and compare both on the same data.As a benchmark, we also train both ST and MT "PG-MLPs" (i.e., MLPs that use the handcrafted PG fingerprint as input; see Section S4 for details on this architecture).A detailed discussion of our training procedure can be found in Section S5.The root-mean-squared-error (RMSE) and R 2 values of polyGNN and PG-MLP are compared in Tables 1 and S1.
We note several observations from these results.First, our data augmentation strategy plays a critical role in teaching polyGNN models invariance to addition and subtraction (see Table S2).Second, we find that MT learning is an important component of our approach, especially in low data situations.As shown in Table S1 The relatively low performance on these four properties could be explained by the fact that the polyGNN models trained here struggle to learn the block-or chain-level features (which typically consist of 4+ atoms) present in the handcrafted PG fingerprint.In principle, increasing the number of message passing steps-so as to capture larger length scale features-should mitigate this challenge.In practice, however, we observe a threshold number of message passing steps.Above three message passing steps, model generalization only worsens-regardless of the property of interest.This empirical observation has been reported by others and is due to a collapse in which the learned fingerprints of all polymers, even chemically distinct ones, converge. 54,55However, as evidenced by the impressive performance of the MT polyGNN models on a vast majority of properties, the inability to learn block-or chain-level features features is often ameliorated by the ability to learn lower-level features that go beyond those currently present in the handcrafted PG fingerprint.Still, the development of techniques that encourage GNNs to surpass the message passing threshold  2a; for example, the RMSE of the MT polyGNN approach on T g is 31.7 ± 1.5 K. is a critical next step.We leave this task for future work.

Summary and Outlook
In summary, we have produced polyGNN-the first-ever protocol that integrates polymer feature learning from SMILES strings and other relevant features, invariant transformations, data augmentation and multitask learning.Through careful comparison, we show that our protocol culminates in ultrafast polymer fingerprinting and accurate property prediction over the most comprehensive array of chemistries and properties studied to date.The gains in speed are essential when screening large candidate sets (e.g., millions or billions of polymers) and/or when computational resources are limited.Our approach is especially accurate when the data set size is moderate to large.Even with data sets containing less than 300 points, our approach is at least competitive with presently adopted methods in a majority of cases.
Looking ahead, though polyGNNs perform remarkably well in the experiments tried here, handcrafted polymer fingerprints have advantages.In tasks where chain-or blocklevel features are essential, handcrafted fingerprinting approaches may yield the best model accuracy.Advances in the optimization of graph neural networks are needed to make the accuracy of polyGNNs competitive in these tasks.Finally, a handcrafted feature is, by definition, interpretable.In contrast, the features learned by the polyGNNs presented here are not interpretable.Following the work of others, 56 future polyGNN architectures may incorporate attention mechanisms for partial interpretability.However, the interpretability of polyGNN features at the level of handcrafted features will require further innovation.Despite these shortcomings, we anticipate that the adoption of polyGNNs and related approaches will increase as they unlock the ability to screen truly massive polymer libraries at scale.

Public Use
The sources of data used in this work and the availability of each source is reported in the paper.The code used to train our polyGNN models is available at github.com/Ramprasad-Group/polygnn for academic use.
Highly Tough Thermosetting Polymers.ACS Applied Materials and Interfaces 2022,

S2 polyGNN2 Encoder
The input to the polyGNN2 Encoder is a repeat unit.From this repeating unit, the trimer graph (shown in Figure S1) is created.Then, the trimer graph is featurized.The atoms, bonds, and features corresponding to the central repeat unit of the trimer graph are used to create the periodic graph.Thus, the initial fingerprint of the periodic graph is always invariant to addition and subtraction, as shown for the example of polyacetylene in Figure S1.
However, the polyGNN2 Encoder is slower than the polyGNN Encoder.While the polyGNN Encoder takes 26 seconds to featurize the graphs of 13 338 polymers, the polyGNN2 Encoder takes 37 seconds to featurize this set of polymer graphs.
Figure S1: Two equivalent repeat units ("A" and "B", where "A" and "B" refer to four and six atom repeat units, respectively) of polyacetylene and their corresponding trimer and periodic graphs.Each repeat unit is converted to a trimer graph.Each node (i.e., heavy atom) in the trimer graph is featurized.Each atom is labeled with a zero if the atom is aliphatic or is labeled with a one if the atom is aromatic.Other atomic features and all bond features are not shown for visual clarity.The atoms, bonds, and features at the center of the trimer graph (shaded in yellow) are used to form the periodic graph.
of the entire polymer, x p , is calculated by the graph aggregation function A g , as shown in Eq. 3.
Finally, x p and the selector s can be passed to the Estimator.Here, these inputs are mapped to some polymer property prediction, y p , via a parameterized function ψ.We implement ψ as a multilayer perceptron.
During training, the parameters of all φ (k) , χ (k) , ψ are learned simultaneously.As shown in Eq. 2, our update step leverages skip connections, which have been shown to improve the optimization of shallow layers in deep neural networks. 53

S4 Handcrafted PG models
The handcrafted PG models are made up of five MLP submodels (see Section S4), trained using five-fold cross-validation.The input to each MLP is the handcrafted PG fingerprint of a given polymer repeat unit and a property selector and the output is the predicted property value.

S5 Training procedure
Each of the models discussed in the main text are ensemble models, composed of several submodels.The output of the ensemble is computed by a simple average of each submodel's output.For multitask ensembles, data for all properties within a group were combined, target values were scaled, and selectors were assigned as described in Section 2.1 of the main text.For single-task ensembles, properties were not combined into groups, MinMax scaling variance a majority of the time.

Figure 1 :
Figure 1: The polyGNN architecture.The Encoder converts the repeat unit SMILES string to a periodic graph and then computes initial atomic and bond fingerprint vectors (green and purple squares, respectively).A subsequent set of atomic fingerprints (yellow squares) are learned by the Message Passing Block and then averaged, yielding the learned polymer fingerprint (light blue square).This fingerprint and a series of selector vectors are passed to the Estimator, producing a series of property predictions.T d , T m , T g refer to the critical temperatures for thermal decomposition, melting, and glass transition, respectively.

Figure 2 :
Figure 2: Breakdown of our data set.(A) The symbol, name, and unit of each property in our data set.For properties with data from both experiment and DFT calculations, the two sources are distinguished by the abbreviations "expt."and "DFT".Our data set includes the permeability µ g of six gases g ∈ {He, H 2 , CO 2 , O 2 , N 2 , CH 4 }.Each permeability data point is scaled by x → log 10 (x + 1).Our experimental dielectric constant f data contains measurements at nine frequencies f ∈ {1.78, 2, 3, 4, 5, 6, 7, 9, 15} in log 10 Hz.The distribution of µ g and f are given in Section S1. (B) The data set size per property, shown on both the y-axis and above each bar.Bars of the same color belong to the same property class."perm."stands for gas permeability.

Figure 3 :
Figure 3: Overview of data set augmentation.(A) Two equivalent repeat units of infinite polyacetylene and their corresponding periodic graphs.Each atom in the graph is labeled with a zero if the atom is aliphatic or labeled with a one if the atom is aromatic.Other atomic features and all bond features are not shown for visual clarity.(B) Data augmentation strategy for polyGNN.Rows of the original training data are transformed by repeat unit addition.
3.1 Benchmarking speedpolyGNN was developed with a primary objective in mind: to increase the rate at which large libraries of polymers may be screened.We quantified this rate by measuring the time needed to fingerprint a data set of 13,338 known polymers on a variety of different capacities and hardware.Capacity, as used in this work, is a hyperparameter that specifies both the number of message passing steps and the depth of each multilayer perceptron (MLP) in the network.
The timings to compute 13,388 polymer fingerprints with a randomly initialized polyGNN model are given in Fig.4.A shallow polyGNN (with a capacity of two) fingerprints the set of polymers in 32 seconds (2.4 ms per polymer) on one CPU or 30 seconds (2.2 ms per polymer) on one GPU.Meanwhile, a deep polyGNN (with a capacity of 12) takes 57 seconds (4.3 ms per polymer) to compute the fingerprint set on one CPU or 32 seconds on one GPU.
, polyGNNs that do not use MT learning exhibit erroneous predictions (i.e., negative R 2 value) for five properties-E i , 1.78 , 2 , 5 , 6 -each with 158 or fewer data points.In contrast, with MT learning, polyGNNs exhibit positive R 2 for each of the 34 properties studied.
of their predictions is within 5% of that property's standard deviation σ, see TableS3for a complete list of σ values).However, for the 20 properties containing 300 data points or less, the situation becomes more complex.MT polyGNN models still perform well relative to the MT PG-MLP benchmark, but not for every property.MT polyGNN models are more or comparably accurate for 16 properties, but are notably less accurate on four properties (experimental crystallization tendency X e , 1.78 , 2 , 6 ).

Table 1 :
Average RMSE plus/minus one standard deviation on unseen test data.† Starred properties contain 300 or fewer data points.Models with the best, or comparable with the best, average RMSE are bolded.The unit of each RMSE value matches those listed in Fig.

Table S2
contains the average variance of models trained with and without augmentation, var augment and var no augment , respectively, on each of the 36 properties studied in this work.is a set of non-equivalent repeat units, var is the variance function, f is a machine learning model, and E p is a set of repeat units related to p (a repeat unit in P) by addition or subtraction.In this work, P is a set of 9 repeat units not seen by any model during training, and each E p is composed of p, 2p, 3p, 4p and 5p.For example, if p is ( C ) then E p = {( C ), ( CC ), ( CCC ), ( CCCC ), ( CCCCC )}.TableS3lists standard deviations of the data in our corpus, grouped by property.

Table S1 :
Average R 2 plus/minus one standard deviation on unseen test data.Starred properties contain 300 or fewer data points.

Table S2 :
The average variance of models trained with and without augmentation.The unit of each property is given in TableS3.Property var no augment var augment var no augment var augment

Table S3 :
The standard deviation (σ) of data in our corpus, grouped by property.