Predicting Power Conversion Efficiency of Organic Photovoltaics: Models and Data Analysis

In this paper, the ability of three selected machine learning neural and baseline models in predicting the power conversion efficiency (PCE) of organic photovoltaics (OPVs) using molecular structure information as an input is assessed. The bidirectional long short-term memory (gFSI/BiLSTM), attentive fingerprints (attentive FP), and simple graph neural networks (simple GNN) as well as baseline support vector regression (SVR), random forests (RF), and high-dimensional model representation (HDMR) methods are trained to both the large and computational Harvard clean energy project database (CEPDB) and the much smaller experimental Harvard organic photovoltaic 15 dataset (HOPV15). It was found that the neural-based models generally performed better on the computational dataset with the attentive FP model reaching a state-of-the-art performance with the test set mean squared error of 0.071. The experimental dataset proved much harder to fit, with all of the models exhibiting a rather poor performance. Contrary to the computational dataset, the baseline models were found to perform better than the neural models. To improve the ability of machine learning models to predict PCEs for OPVs, either better computational results that correlate well with experiments or more experimental data at well-controlled conditions are likely required.

thus it was decided to use it in this work as one of the neural models. Only the key model aspects are provided here. Readers are directed to 1,2 for a more detailed description.
Feature generation. This step follows the common procedure as described previously.
Atom types and their aromaticity are used to encode the node features whereas bond types (single, double, triple and aromatic bonds) are used to encode the edge features.
Circular iteration. The Weisfeiler-Lehman algorithm 3 is used to circulate around the nodes of molecular graphs generated in the previous step. The algorithm consists of two stages: (i) an initial assignment stage, in which each node and edge are assigned the unique integer identiers that represent their starting features set, and (ii) an iterative updating stage, in which each node identier is updated to reect the identiers of its neighbours and each edge identier is updated to reect the identiers of nodes it connects. Once all nodes and edges have generated their new identiers, the algorithm cycle is repeated until the requested number of iterations is reached. At the end, the nal node identiers are assembled into the nal molecular ngerprint vector. Since the node and edge features only include atom and bond types information, one can also think about this molecular ngerprint vector as a unique list of sub-molecular fragments building the parent molecule, hence the name graph fragments sequence identiers (g-FSI) in the original publication 1 . In this paper only a single iteration of the Weisfeiler-Lehman algorithm has been performed by setting its radius parameter to one. To avoid the confusion with ordinary BiLSTM networks, the model will be referred to as g-FSI/BiLSTM for the remainder of the paper.
The g-FSI vector encodes two important structural aspects: fragment (ngerprint) types and their order in the molecule. Note that mapping discovered fragment types to unique integer identiers proceeds on a rst-found rst-served basis with no hashing. Therefore, it is dependent on the order in which the data is processed. In order to eliminate this dependency, prior to using the model, the selected datasets are processed rst to create the fragment-identiers look-up dictionaries. This step also nds the molecules with the largest S2 number of fragments which are in turn used to dene the maximum sequence length for the BiLSTM network. Furthermore, it is important to highlight that the order of fragments in the g-FSI vector depends on the order in which the molecular graph nodes are processed.
To keep the fragments that are spatially close to each other together (e.g two fragments that are part of the same aromatic ring), the molecular graph has been iterated in the breadth rst search (BFS) order.
Node type embedding and BiLSTM. This step starts from passing the g-FSI vectors through a trainable embedding layer, which for a single molecule, results in creating a sequence of real-valued fragments embedding vectors (one for each fragment integer identier).
Secondly, the embedding vectors are processed by the forward and backward BiLSTM cells resulting in a sequence of hidden forward and backward state vectors.
Readout. In order to obtain the nal PCE value, the two BiLSTM hidden states are concatenated and weighted-summed in the attention layer. This is followed by propagating the attention layer output through a multilayer perceptron.

S1.2 Simple graph neural network
The neural network presented in this section is simple in three respects: (1) It belongs to the class of graph convolutional networks but is implemented in a straight-forward manner.
Compared to 4  Circular iteration. For each node v in the molecular graph, the embedding vector is chosen corresponding to its node type and used as initial node state h 0 (v) ∈ R ED . The graph layers of simple GNN are indexed with l ranging from 1 to the overall layer count, CL. Each layer can be described in terms of the message passing neural network (MPNN) framework introduced in: 6 First, messages with transformed states are received from v's direct neighbour nodes N (v) and aggregated, where W l ∈ R ED×ED denotes a matrix of trainable weights and #N (v) is the number of neighbour nodes. Finally, the state of node v at layer l is updated by where the rectied linear activation function, relu, is applied component-wise. For simplicity, W l is a square matrix such that all node states have the same dimension, ED, throughout all graph layers. Here, the layer count, CL, plays the same role as for Attentive FP, as the radius in the Weisfeiler-Lehman algorithm, and as the radius for generating Morgan ngerprints: it corresponds to the maximum distance of (direct and indirect) neighbour nodes that may contribute to the nal state of node v.
Readout. The state of the graph G is set to the empirical mean over all node states at where N (G) is the set of all nodes in G. Finally, the graph state h(G) ∈ R ED is fed into a multilayer perceptron with one output neuron for PCE prediction analogously to the g-FSI/BiLSTM model.

S1.3 Attentive ngerprint
Attentive FP (ngerprint) is a graph neural network that was introduced by Xiong et al. 7 and achieves state-of-the-art performance on several data sets that are widely used in ML research according to a comprehensive assessment of several neural networks and ML methods by Wu et al. 8 . Attentive FP makes use both of Gated Recurrent Units (GRU, 9 ) and attention and thus resumes ideas of Gated Graph Neural Networks 10 and Graph Attention Networks. 11 It applies the attention mechanism both on the atomic node level and on the molecular graph level to focus on substructures that are relevant for predicting target variables. Readout. A virtual super node is added to the molecular graph and connected to all of its atom nodes. Its state represents the state of the entire graph and is initialized with the sum of all atom node states updated at the end of the last iteration. The super node state itself is updated throughout several iterations using its own GRU. In each iteration, the attention weight between the super node and an atom node is calculated in the same way as between two atom nodes. Finally, the super node state is fed into a fully connected layer which has one output neuron in the case of PCE as target variable.
The GRU can be regarded as a less complex version of an LSTM. The g-FSI/BiLSTM model uses breadth rst search for ordering the fragments of a molecule and applies two LSTMs in a bidirectional manner to nd patterns within the horizontal sequence of ordered fragments. In contrast, the GRU in Attentive FP conveys information for each target node in the molecular graph vertically, i.e. from layer to layer, and in each layer, the local environment from its direct neighbours is merged with its own information from previous layers (i.e. from previous time steps). Moreover, the information ows for all nodes layer-wise and in parallel and the nodes share the same GRU and corresponding trainable weights.

S1.4 Baseline models
In this paper, we use Morgan ngerprints as input features for all baseline methods. The main idea behind Morgan ngerprints stems from the Morgan algorithm 12 which aims to provide a unique numbering scheme of chemical structures. The Morgan algorithm is specic to molecular graphs and tries to solve the isomorphism problem between them, while the S6 Weisfeiler-Lehman algorithm shares the same goal for graphs in general.
RDKit implements Morgan ngerprints along the Extended Connectivity Fingerprint algorithm, a modied Morgan algorithm described in. 13 The generation of Morgan ngerprints requires two integer-valued parameters -the ngerprint radius, FR, and the ngerprint size,

Readout. A ngerprint vector with FS bits is initialized with zeros. Each remaining
identier is converted to an integer i and the bit at position (i mod FS) in the ngerprint vector is set to one.
The resulting bit vector is called a Morgan ngerprint. If the hash function is chosen properly and FS is large enough, collisions are unlikely and consequently, dierent sub-S7 structures tend to be represented by dierent bits. While Figure 2 emphasizes the iterative behaviour that the ngerprint algorithm has in common with the Weisfeiler-Lehmann algorithm and graph neural networks, Morgan ngerprints are usually regarded as part of the feature generation. Next, we will shortly describe the baseline methods and how they make use of Morgan ngerprints as input features. Support Vector Regression (SVR). The support vector regression model uses the support vector machines for function estimation. 14 Two important hyperparameters are the regularization parameter C and the parameter , which species that the loss between the predicted value and the target value vanishes if their distance is smaller than . As in, 15 we use the RBF kernel and replace the metric term between two points in the Euclidean space by a distance term between two Morgan ngerprints. The resulting function is: where the kernel coecient GK is a hyperparameter, x and x are two bit vectors of size FS and T denotes the Tanimoto similarity index given by

Random forests (RF). Random forests (RF) is an ensemble method that can be used
for classication and regression. The method constructs a number of decision trees by randomly subsampling from the dataset and by randomly selecting subsets of features. 16 The method makes a prediction by averaging the outcome of the decision trees. Important hyperparameters are the number of trees, the maximum number of samples and the maximum number of features. A full list of the RF hyperparameters and their ranges is provided in Table S2 in the Appendix. In the case of Morgan ngerprints, the ngerprint size corresponds to the number of overall features from which a maximum number of bits is selected for S8 constructing an individual decision tree.

High Dimensional Model Representation (HDMR). The High Dimensional Model
Representation is the nal chosen baseline model. The method works by approximating an unknown multivariable function, f , using its nite expansion 17 . In this case, f represents a mapping between a molecular ngerprint vector bit values, x i and the nal power conversion eciency of an organic heterojunction solar cell, y, with the molecular ngerprint vector generated from the structure of the donor candidate molecule. Under the HDMR representation, the function, f can be approximated using the following expression: where N x is the dimension of the input space, for this application it is equal to the ngerprint size, FS, and f 0 represents the mean value of f (x). The above approximation is sucient in many situations in practice since terms containing functions of more than two input parameters can often be ignored due to their negligible contributions compared to the lowerorder terms 18 . The terms in Equation S6 are evaluated by approximating the functions f i (x i ) and f ij (x i , x j ) with orthonormal basis functions, φ k (x i ). In this work, the basis functions have been constructed from ordinary polynomials 18 . The HDMR parameters are listed in Table S2.      No. of atom node layers (AL) 5 S14 1.9332 × 10 −5 a rst value -CEPDB, second value -HOPV15 High dimensional model representation parameters