Chemprop: A Machine Learning Package for Chemical Property Prediction

Deep learning has become a powerful and frequently employed tool for the prediction of molecular properties, thus creating a need for open-source and versatile software solutions that can be operated by nonexperts. Among the current approaches, directed message-passing neural networks (D-MPNNs) have proven to perform well on a variety of property prediction tasks. The software package Chemprop implements the D-MPNN architecture and offers simple, easy, and fast access to machine-learned molecular properties. Compared to its initial version, we present a multitude of new Chemprop functionalities such as the support of multimolecule properties, reactions, atom/bond-level properties, and spectra. Further, we incorporate various uncertainty quantification and calibration methods along with related metrics as well as pretraining and transfer learning workflows, improved hyperparameter optimization, and other customization options concerning loss functions or atom/bond features. We benchmark D-MPNN models trained using Chemprop with the new reaction, atom-level, and spectra functionality on a variety of property prediction data sets, including MoleculeNet and SAMPL, and observe state-of-the-art performance on the prediction of water-octanol partition coefficients, reaction barrier heights, atomic partial charges, and absorption spectra. Chemprop enables out-of-the-box training of D-MPNN models for a variety of problem settings in fast, user-friendly, and open-source software.


Example commands
To train a default model on the ESOL solubility dataset 1 which is distributed with Chemprop as CSV file, and save the results to the folder "checkpoint", run chemprop_train --data_path data/delaney.csv--dataset_type regression --save_dir checkpoint --save_smiles_splits on the command line after installation of Chemprop following the instructions on Github. 2 This splits the data randomly into training, validation and test sets in the ratio 80/10/10, trains a default model and computes the performance on the test set.To compute predictions using an already trained model, run chemprop_predict --checkpoint_dir checkpoint --test_path checkpoint/fold_0/test_smiles.csv--preds_path checkpoint/test_preds.csvwhich takes the previously generated test set, computes predictions using all models in the checkpoint folder and saves them to the indicated path.For the use of Chemprop within a Python script or a graphical web interface, as well as many options to customize the model, data splits, and performance metrics please consult the instructions on Github 2 or the Chemprop documentation. 3 Hyperparameter optimization can be performed with similar commands, as detailed in the Discussion of Features section.

Additional features
Users can provide their custom additional features by adding keywords and paths to the data files containing the features.
For molecule-level features x m , a path to the features can be specified using the keyword --features_path PATH/TO/FEATURES.The provided molecular features are concatenated to the learned molecular embedding prior to the FFN network.The features can be provided as a numpy .npyfile or CSV file.For both file formats, the features must be in the same order as the SMILES strings in the data file.The features file should not contain the SMILES strings, since features will be associated with the corresponding molecule based on the ordering in the file.The features file should contain numerical values, with columns corresponding to different features and rows corresponding to molecule data points.By default, provided features are normalized unless the flag --no_features_scaling is used.
For additional atomic features x v , the path to the features can be provided using the keyword --atom_descriptors_path PATH/TO/FEATURES.The supported file formats include .npz,.pkl,and .sdf.Two options are available to select in which way atom descriptors are used.The option --atom_descriptors descriptor concatenates the additional features to the embedded atomic features after the D-MPNN.On the other hand, the option --atom_descriptors feature concatenates the features to the initial atomic feature vectors prior to the D-MPNN, such that they can be used during message-passing.Additional bond-level features can be provided via --bond_descriptors_path PATH/TO/FEATURES in the same format as the atom-level features.Similarly, users must choose in which way bond descriptors are used.The option --bond_descriptors descriptor concatenates the new bond-level features to the embedded bond features after the D-MPNN, which can only be used for bond-level property prediction, while the option --bond_descriptors feature concatenates the new features with the default bond feature vectors before the D-MPNN.
Users must ensure that the order of additional atom and bond features match the atom and bond ordering in the RDKit molecule object.If users wish to only use their custom features instead of the default features, the keywords --overwrite_default_atom_features and --overwrite_default_bond_features can be used to overwrite the default atom and bond features, respectively.The overwrite option is only available when the additional fea-tures are used as feature.Similar to the molecular-level features, the atom-and bond-level features will be normalized automatically by default.This can be disabled with the options --no_atom_descriptor_scaling and --no_bond_descriptor_scaling.
The inputs of atom and bond features can be provided via three file formats:

Regularization
Chemprop has two builtin forms of regularization, intended to help reduce overfitting in trained models.These two regularization techniques were present in the initial release of The second form of regularization is called dropout.During training, dropout regularization will randomly zero out a fraction of the latent variables for that forward pass.This practice has been shown to reduce overfitting and lead to higher quality latent variables. 4e level of dropout regularization can be specified using the option --dropout <p> where p is the dropout probability.By default, dropout is inactive.We have observed dropout to be a helpful addition to models in a variety of contexts and recommend that users include it in their choices of hyperparameters.

Multi-molecule models
The number of molecules N is specified with the keyword --number_of_molecules.for each molecule (Figure S1a).If the option --mpn_shared is specified, the same D-MPNN is used for all molecules (Figure S1b).In both cases, the D-MPNN of each molecule uses the

Reaction support
The initial atom and bond feature vectors in the CGR contain information on both the reactant and product features.Whenever information is not available, e.g. because a bond did not exist in either the reactants or products, the features are set to zero.A simple concatenation of reactant and product features can be used to obtain the pseudomolecule features (keyword --reaction_mode reac_prod).Since the atomic number does not change upon reaction, its one-hot encoding is not repeated in the second part of the feature vector.For many reaction properties the change in the local structure upon reaction, i.e. the difference between reactants and products, is very informative.Since neural networks are known to not perform well for adding and subtraction operations, we also provide options to include the difference in properties directly.Namely, one can concatenate the difference in atom and bond features with the reactant properties (keyword --reaction_mode reac_diff, default) or with the product properties (keyword --reaction_mode prod_diff command line utility, chemprop_hyperopt, that automates this process by removing the need to manually define the search space of hyperparameters.Users can simply supply a list of keywords from which to build a hyperparameter search space (Table S3).The number of trials of hyperparameter combinations to be tested can be set using the --num_iters argument.By default, the search space will first be randomly sampled for num_iters/2 trials before switching to targeted sampling via the tree-structured Parzen estimator algorithm 6,7 for the remaining trials.The number of random trials to be used can be changed by setting --startup_random_iters to a value less than num_iters.
Hyperparameter optimization can be the most resource-intensive step in model training.
In order to search a large parameter space adequately, a large number of trials would be needed.Chemprop allows for parallel operation of multiple hyperparameter optimization instances, so that the entire set of trials does not need to be run in series.Parallel operation can be achieved by setting the location of trial checkpoint files with --hyperopt_checkpoint_dir to be a single shared location for multiple hyperparameter optimization instances.This allows for multiple instances of the program to share and contribute to the same trial history, reducing the wall time needed to perform hyperparameter optimization significantly.

Atom/bond-level targets
The input is provided as a CSV file.The targets of atomic properties must be a 1D list in the same order as the atoms in the RDKit 8 molecule object.The bond properties can either be a 2D list of shape n × n, where n is the number of atoms, or a 1D list in the same order as the bonds in the RDKit molecule object.An example file with both atomic and bond targets is shown in Table S4.It is also important to note that Chemprop can autodetect whether a target should be an atomic or bond target.Alternatively, the --keeping_atom_map option can be used if users wish to use atommapped SMILES.To apply the summation constraint to properties for each molecule, a path to the constraints can be specified using the keyword --constraints_path PATH/TO/CONSTRAINTS in the same order as the SMILES strings in the data file.Different constraints should be separated into different columns with a header row and one row per molecule, and the file should not contain the SMILES string.Which targets will be constrained is controlled by the names of the tasks in the constraint file header.For properties without constraints, the atomic or bond embeddings will be linked with FFN layers.Conversely, for properties with sum constraints, attention-based layers will also be constructed for each target. 9By default, the atom tasks share FFN weights, and bond tasks share FFN weights so that the FFN weights might benefit from multitask training.The argument --no_shared_atom_bond_ffn can be used if users want to train the FFN weights for each task independently.The argument --no_adding_bond_types will let the bond types of each bond determined by RDKit molecules not be added to the output of bond targets.For attention-based constraining, the argument --weights_ffn_num_layers can be used to change the number of layers in the FFN for determining weights used to correct the constrained targets (default 2).

Benchmark methods
In the following, we describe the hyperparameter tuning procedure for our benchmark studies as well as the source, splitting routines, and further information on all benchmark datasets employed in this study.

Hyperparameter tuning
Training of benchmark models was carried out using hyperparameters optimized for each task.Throughout the remainder of this study, we classify datasets as small/large if they contain less/more than 10k data points in total.Models trained on small datasets were optimized for hyperparameters using 100 search iterations, whereas models trained on large datasets were optimized for only 30 iterations.During hyperparameter tuning and final model training, we trained for 200/50 epochs for small/large datasets.All models were trained on a single data split with an ensemble size of 5 during the final training, and without ensembling for hyperparameter tuning.During hyperparameter tuning, we optimized for the number of message passing steps, the hidden size during message passing, the number of layers of the feed forward neural network, as well as its hidden size, and the dropout ratio.
For small datasets, furthermore the learning rate (initial, final and maximum), warum-up period and batch size were optimized.For both hyperparameter tuning and model production, scaled sums were used to aggregate atomic into molecular feature vectors.All other parameters were left at their default values.

Datasets
The benchmarking datasets used in this study are listed in Table S5.All datasets are publicly available from the literature as described in the following.Various evaluation metrics are used to assess the performance of the Chemprop models on each dataset and against other models previously reported in the literature: • ROC-AUC: area under the receiver operating characteristic curve • PRC-AUC: area under the precision-recall curve of performance for the literature dataset.We then retrained a production model on the full dataset (no validation or test data) using the best hyperparameters and number of epochs identified earlier, with which we made predictions for the three SAMPL challenges.

Atom/bond-level targets
To predict atom-level and bond-level targets, we selected three benchmark datasets.The framework we used to predict atomic and bond properties in Chemprop was based on modifications made to the approach developed by Guan et al. 9 They published a dataset of For benchmarking, we also used the BDE-db dataset from St. John et al. 17 This dataset contains bond dissociation enthalpies (BDEs) for 42,577 closed-shell organic molecules with up to 9 heavy atoms of types C, H, O, and N, resulting in 290,664 BDEs.BDEs were calculated using the M06-2X/def2-TZVP level of theory.We used the same data splits as their study, 18 with 40,577 data points as training set and 1000 molecules each in the validation and test sets.
Lastly, we included a dataset of DDEC partial charges, which includes partial charges calculated with different dielectric constants (ϵ = 4 for charges in protein and ϵ = 78 for charges in water). 19The dataset comprises 130,267 moderate size organic molecules with elements of types C, H, N, O, S, P, F, Cl, Br, and I, curated from ZINC and ChEMBL databases.A small fraction of data in the ϵ = 78 dataset was dropped due to issues with SMILES conversion.We then randomly split the datasets into 80% training, 10% validation, and 10% test data.Two external test sets of 146 organic liquids and 1081 FDA-approved drugs were used to test the transferability of the models.

Reaction barrier heights
To benchmark Chemprop's reaction functionality, four datasets of computational barrier heights were selected to cover a broad range of dataset size, diversity and quality.Since some of the original publications only report model mean absolute errors, we also report mean absolute errors, although we train on mean squared errors similar to all other benchmarks in this study.
First, E2 and S N 2 reactions originally published in Ref. 20  Third, the RDB7 dataset, which contains 11,926 high-accuracy reaction barrier heights and enthaplies calculated at CCSD(T)-F12/cc-pVDZ-F12 as provided in Ref. 25.In contrast to the E2, S N 2, and cycloaddition datasets that focus on one specific reaction class, this dataset spans a large range of barrier heights and is used to assess Chemprop's performance on substantially more reaction diversity.We randomly split the data into 80% training, 10% validation and 10% test data and then added reverse reactions to each set.
Fourth, the RGD1-CNHO dataset 26 was used, which comprises the largest and most diverse dataset out of the four, and also the most difficult to learn.We again randomly split the data into 80% training, 10% validation and 10% test data and then added reverse reactions to each set.

UV/Vis absorption
Multi-molecule models are demonstrated using prediction of UV/Vis peak absorption wavelength, a prediction model that involves both the absorbing molecule and the solvent.Our dataset of the peak wavelength of maximum absorption (λ max,abs ) is a combination of several databases [27][28][29][30] that were extracted from the experimental literature.There are 26,395 samples across a variety of dye molecule families and solvents.Each sample consists of a dye molecule SMILES, solvent molecule SMILES, and a peak wavelength value.There are no multi-component species of either dyes or solvents.The train-validation-test splits are in 80/10/10 proportions and are constrained to avoid data leakage of highly-correlated measurements of the same dye in multiple solvents.

IR spectra
The dataset used for whole-spectra predictions was collected from infrared absorption spectra made public by NIST. 31This dataset comprises 8,754 gas-phase spectra, with absorbance magnitudes indicated at 2 cm −1 intervals between 400 and 4000 cm −1 .The spectra for different molecules have different ranges of collected absorbance and may have regions of missing or excluded values.We randomly split this dataset into 80% training, 10% validation, and 10% test data.

HOMO-LUMO gaps
The PCQM4MV2 dataset is a collection of DFT-calculated molecular HOMO-LUMO gaps, originally collected as part of the PubChemQC project 32  3 Benchmark results

Model Performance: General benchmarking
In the following, we present benchmarking results on predicting molecular targets on singlemolecule datasets.

MoleculeNet & OGB
In the original publication of the algorithm behind Chemprop, 34 the MoleculeNet datasets were used as benchmark to compare against other non-deep-learning algorithms, such as Morgan Fingerprints used with random forest regression. 34In this work, we do not fully repeat the original coverage of the MoleculeNet datasets.We revisit three of the datasets that continue to be of interest: QM9, HIV, and PCBA.The most significant differences in the benchmark models presented here and those presented in Ref. 34 are the improved First, we trained a multitask model on all 12 targets in the QM9 dataset, which produced an average MAE of 2.14 and RMSE of 3.96 across all targets.Though reporting averaged metrics is common, the differing orders of magnitude among the target properties biases the averaged result heavily toward targets of larger magnitudes.In Table S7, we report the test set metrics individually by task.We also trained benchmark single-task models on the U0 and HOMO-LUMO gap targets, reported in Table S7.In this benchmark, the performance observed on the single-task treatment of U0 is significantly better than the multitask version with RMSE of 2.45 and 3.21 Ha, respectively.The single-task model did not have a clear improvement on HOMO-LUMO gap performance.
The results of the benchmark Chemprop models trained on the HIV and PCBA datasets are presented in Table S8.Compared to the best model from OGB, which uses the heterogeneous interpolation on graph 36 , our model has a lower AP of 0.3028 on the PCBA scaffold split, but we are able to achieve better performance than the average models.

PCQM4Mv2
A benchmark model was trained on the PCQM4Mv2 dataset curated by the Open Graph expected to be similar but not the same as the test set reported on the leaderboard.

SAMPL
When training Chemprop to predict water-octanol partition coefficients (logP), we obtain an RMSE of 0.53 on a random test set with our conventional data splits of 80% test, 10% validation and 10% test.This corresponds to an RMSE of 0.72 kcal mol −1 for the transfer free energy ∆G from water to octanol at 298 K, which is related to logP by where R is 8.314 J mol −1 K −1 and T is the temperature.We then retrained a production model on the full logP dataset without any validation or test splits using the same hyperparameters to predict logP of the molecules in the SAMPL6, SAMPL7, and SAMPL9 blind prediction challenges.The performance of Chemprop is shown in Table S9, where Chemprop outperforms all other submissions from the SAMPL6, SAMPL7 and SAMPL9 challenges.We note that submissions range from quantum mechanics (QM) models and molecular mechanics (MM) models to empirical models relying on heuristic rules or machine learning, as well as mixtures thereof.Our work therefore does not only outperform other empirical models, but also a large variety of QM and MM models.In general, logP is often used in drug development, where it serves as an indicator of lipophilicity, which is known to impact the absorption, distribution, metabolism, excretion, and toxicity of drug candidates. 37We thus demonstrate the ability of Chemprop to aid in important tasks such as drug discovery.Moreover, we note that the best performing submission in SAMPL7 was made by a biotechnology company independent of our group, using a Chemprop model trained on a different database, further highlighting the usefulness and impact of our software.

Model Performance: Specific feature demonstrations
In the following, we present benchmarking results for speciality features of Chemprop, namely the training on reactions or multiple molecules, the prediction of atom/bond-level targets or spectra, and the use of uncertainty quantification methods.

Atom/bond-level targets
As shown in Fig. S2, the performance of a multitask constrained D-MPNN was evaluated on a dataset containing six atomic and bond QM descriptors, with testing errors agreeing well with previous findings. 9 BDE prediction was also examined using a single-task model for BDE and a multitask model for both BDE and partial charge.The single-task model achieved an MAE of 0.60 kcal mol −1 , which is comparable to the testing error of 0.58 reported by the GNN model in ALFABET. 18However, the GNN model in ALFABET was exclusively engineered for the purpose of BDE prediction, whereas the multitask model in Chemprop is capable of training These findings suggest that Chemprop is promising for predicting various atom-and bond-level properties of molecules, with potential applications in drug discovery and materials science.

Reaction barrier heights
Table S10 summarizes the MAEs obtained for different reaction barrier height datasets.
For E2 and S N 2 reactions we can directly compare or work against the models by Stuyver et al. 22 and Heinen et al. 23 We find that Chemprop significantly outperforms the Weisfeiler-Lehman (WL) architecture from Stuyver et al. 22,33 even when adding quantum-mechanical (QM) descriptors (termed "ml-QM-GNN" in Table S10) to the WL network.We furthermore note that Ref. 22  A large benefit of Chemprop in reaction mode over all other architectures in Table S10 is furthermore its generality and versatility.It is straightforward to train any machine learning model on a single type of reactions (like S N 2, E2, or cycloadditions), but finding a representation and architecture that can predict reaction properties of a large variety of reactions is much more difficult.Here, we showcase the ability of Chemprop to learn from diverse reaction datasets using the RDB7 25 and the RGD1-CNHO 26 datasets.We find larger MAEs compared to the simpler single-type datasets.Albeit not reaching chemical accuracy, our models still produce state-of-the-art performances given the diversity of reactions and range of barrier heights in both datasets.In Ref. 38, a refined Chemprop model with customized atom features and pretrained on DFT data of lower level of theory yields an MAE of 2.6 kcal mol −1 .Importantly, Chemprop does not make use of the three-dimensional structure of the reactants, products and transition states but estimates barrier heights solely from the change in bonds, thus requiring minimal information to predict a new reaction.We furthermore note that simpler approaches such as models trained only on reactant structures or descriptors are not applicable to diverse reaction datasets.

UV/Vis absorption
Chemprop achieved an MAE of 15.5 nm, RMSE of 29.7 nm, and R 2 of 0.920 on our dataset of experimental absorption peak wavelengths across a diverse set of dye molecules in a variety of solvents.This was previously demonstrated to outperform state-of-the-art fingerprint-based methods 39 .Our train-validation-test splitting for this task was constrained such that all

IR spectra
Chemprop spectra prediction was benchmarked using gas-phase IR absorbance data provided publicly by NIST. 31Similarity between the predicted spectra and the target spectra are assessed using Spectral Information Divergence, SID.The benchmark average SID for predictions on the test set was 0.27.Qualitatively, a SID value of 0.27 is a good prediction which generally tends to match the location and magnitude of all major peaks and the location of most minor peaks, while smoothing some of the details of peak shape.To give some context to this value, we also provide some simple baselines for comparison.The average SID of a uniform distribution against the dataset was 2.52.The average SID of a roundrobin pairing of every spectrum in the dataset with every other spectrum in the dataset was 2.89.
The average SID of a normalized sum of all the spectra against each individual member of the dataset was 1.45.

Uncertainty estimation for QM9 gap
Table S11 summarizes the performance of three uncertainty quantification (UQ) methods (ensemble, evidential, and mean-variance estimation (MVE)) selected from Chemprop's available UQ options.For the evidential uncertainty, we used the total uncertainty (the sum of aleatoric and epistemic components).We trained all models on the gap values from the QM9 dataset and calibrated the predictions using the z-scaling method 40 with the standard deviation as the regression calibrator metric.We then evaluated the methods based on four metrics: negative log likelihood (NLL), Spearman rank correlation (ρ), expected normalized calibration error (ENCE), and miscalibration area (MA).On this task, we observe that MVE performs the best across all four metrics, while ensemble performs the worst across all metrics, with evidential in between.However, we emphasize that UQ performance can vary depending on task, dataset size, representation, and other factors.[43][44][45][46]

Timing
Training and inference timing benchmarks for Chemprop can be found in Tables S12, S13, and S14.These benchmarks were measured on three systems: a compute cluster node with CPU only, a compute cluster node with a GPU resource, and a laptop.We used an Intel Xeon Platinum 8260 processor (2.4 GHz, 48 CPU cores) for cluster CPU benchmarks and an Intel Xeon Gold 6248 (2.5 GHz, 40 CPU cores) processor with an Nvidia Volta V100 GPU for the cluster GPU benchmark timing.Both devices are part of the MIT Supercloud. 47For both systems, we restricted the maximum numbers of CPU cores accessible to Chemprop to 8.For laptop timing, we used a Thinkpad X1 Carbon with an Intel Core i7-1280P (1.8 GHz, 14 CPU cores) processor and no enabled GPU.Our benchmark datasets were randomly sampled subsets of the QM9 HOMO-LUMO gap targets with sizes of 100,000, 10,000, and 1,000.
The training times for Chemprop models found in Table S12 includes all training processes, including time for data preprocessing and model evaluation.This training was carried out with a 80/10/10 training-validation-test split of the data.The hyperparameters were chosen to be in the typical range used for datasets of this size: hidden size of 1000, feed forward hidden size of 1000, 4 message passing layers, 2 feed forward layers, and 50 epochs.
The training times found in Table S13 S14.These times include all inference processes, including postprocessing of the predictions.
Training time shows significant speed improvement when moving from the laptop platform to the cluster CPU system and further improvement moving from the cluster CPU system to the cluster GPU system.This trend is followed across the tested dataset sizes.
In each case, the speedup is greater than a factor of 2. Training for single models on moderately sized datasets can be carried out reasonably even on a laptop.For large datasets, hyperparameter optimization, and model structures involving many submodels, training on cluster resources or using a GPU is recommended.Inference times in the 10,000 and 100,000 dataset sizes are also improved when moving from laptop to cluster CPU to cluster GPU, but the progressive improvement is smaller than for training.Inference using any of the system levels tested is relatively fast for these dataset sizes.

•
.npz format Atomic descriptors are saved as 2D array ([number of atoms x number of descriptors]) for each molecule in the exact same order as the SMILES strings in the data file.Similarly, bond descriptors are saved as 2D array ([number of bonds x number of descriptors]).For example: np .s a v e z ( ' d e s c r i p t o r s .npz ' , * d e s c r i p t o r s ) where descriptors is a list of atomic or bond descriptors in 2D array in the order of molecules in the training/predicting datafile.• .pkl/.pckl/.pickleformat It contains a pandas dataframe with SMILES as index and a numpy array of descriptors as columns.For example:
Chemprop and remain an important contributor to model quality.The first form of regu-larization is called early stopping.With early stopping, the performance of the model on the validation set is calculated at the end of each epoch.The version of the model that is stored at the end of training is the one saved at the end of the best scoring epoch.This has the effect of discarding later epochs of training where the model would be overfitting to the training data, continuing to improve the training loss at the cost of hurting performance on the validation and test sets.Contrary to what the name implies, early stopping as implemented in Chemprop does not shorten the amount of time needed for training.

N = 2 .
For the example of a solute-solvent pair N = 2.To train a new model using multiple molecules as an input, the SMILES string of each molecule must be provided as a separate column in the input CSV file.If N molecules are used, Chemprop assumes that the SMILES strings are located in the first N columns by default.Alternatively, the names of the specific columns containing the SMILES of the different molecules can be specified using the --smiles_columns <column_1> ... option.The embedding of multiple molecules in Chemprop can be done in two different ways as schematically represented in Figure S1 for When multiple molecules are used, by default Chemprop trains a separate D-MPNN

Figure S1 :
Figure S1: Example of how two molecules (N = 2) can be embedded in Chemprop.a) A separate D-MPNN is used for each molecule or b) the same D-MPNN is used.After embedding, the different molecular vectors are concatenated (CAT) and used as input to the feed forward network (FFN) for property prediction.
For the PCBA dataset which has 128 classification tasks, the test scores are averaged over all tasks.The Chemprop model achieves a ROC-AUC of 0.8028 on the scaffold split for the HIV prediction.While our model underperforms compared to the best models from MoleculeNet and OGB leaderboards, our model provides better predictions than the average models from both leaderboards on the HIV scaffold split.For the PCBA random split, the Chemprop model has a PRC-AUC of 0.2089, outperforming the best model from MoleculeNet, the DeepChem graph convolutional model 35 with a PRC-AUC of 0.136.

Benchmark. 11
This dataset contains the molecular HOMO-LUMO gap calculated by DFT in units of eV.The benchmark Chemprop model achieved a test set MAE of 0.0956 eV and RMSE of 0.154 eV.The OGB hosts a leaderboard for performance for this dataset, based on a blinded test set.The test set used for this model was part of the open data and is

Figure S2 : 4 (
Figure S2: Comparing QM computed descriptors with multitask constrained model predictions on a held-out testing set.

Table S1 :
Custom atomic features for each atom provided in 1D arrays.

Table S2 :
Multiple atomic features for each atom provided in multiple 1D arrays.

Table S3 :
Searchable hyperparameters using chemprop_hyperopt.after each layer in both the D-MPNN encoder and FFN ffn_hidden_size the size of each hidden layer in the FFN ffn_num_layers the number of layers in the FFN hidden_size the message size in the D-MPNN encoder linked_hidden_size the size of both the messages in the D-MPNN encoder and the hidden layers in the FFN.This argument is overridden by either hidden_size or ffn_hidden_size max_lr the maximum learning rate used in the learning rate scheduler init_lr the initial learning rate expressed as the ratio of init_lr to max_lr final_lr the final learning rate expressed as the ratio of final_lr to max_lr warmup_epochs the number of epochs over which to ramp up the learning rate up from init_lr to max_lr expressed as a fraction of the total training epochs basicsearch over depth, ffn_num_layers, dropout, and linked_hidden_layers learning_rate search over init_lr, max_lr, final_lr, and warmup_epochs all all of the above hyperparameters

Table S4 :
Example input file for atom-and bond-level property prediction.The value of hirshfeld_charges is presented as a 1D list, while the value of bond_index_matrix is presented as a 2D list.

Table S6 .
a predictor for HIV inhibition.The PCBA dataset includes the 128 biological activities selected from PubChem BioAssay 13 for 437,929 compounds.The datasets were evaluated using the random and scaffold splits that were provided by MoleculeNet and OGB.We adopted the training, validation, and test sets of the scaffold-split HIV data and the random-split PCBA data from MoleculeNet.The scaffold-split PCBA data were adopted from OGB as MoleculeNet did not evaluate the PCBA model on the scaffold split.In all splits, the datasets were split into 80% training, 10% validation, and 10% test sets.For the random-split PCBA, MoleculeNet sets all missing targets to zero (in contrast to OGB), so we In the MoleculeNet presentation of the properties, atomized versions of the thermochemical properties U0, U298, H298, and G298 are provided alongside the original versions of the properties.In this work, we will use the atomized thermochemical properties.
• AP: average precision • MAE: mean absolute error • RMSE: root-mean-square error • R 2 : coefficient of determination • SID: spectral information divergence2.2.1 MoleculeNet & OGBThe HIV and PCBA datasets from MoleculeNet 10 and Open Graph Benchmark (OGB)11were selected for classification tasks.Both MoleculeNet and OGB provide a diverse set of benchmark datasets that have been widely used to compare the performance of various machine learning models.They also host public leaderboards that allow us to directly compare our results to other public models.The HIV dataset contains results from an assay designed to detect HIV inhibition for 41,127 compounds.It has been observed that many of the species in the HIV dataset are at risk for assay result artifacts 12 , so in the narrowest sense dataset performance should be viewed as a test for the assay result rather than strictly as report performances for either case, i.e. filled-in zeros for comparability to the MoleculeNet leaderboard, and without filled-in values, to showcase how the observed performance drops when adopting a scaffold split versus a random split.QM9 is a dataset of DFT calculation values commonly used for chemical model benchmarking.The calculations for this dataset were originally carried out by Ramakrishnan et al. 14 and later distributed as part of the MoleculeNet benchmarks. 10The dataset is made up of 133,885 molecules with properties and structures calculated at the B3LYP/6-31G(2df,p) level of theory.The molecules were chosen as the set of possible molecules containing up to nine heavy atoms of the types C, N, O, and F. Data sources for QM9 provide 3D coordinates for the atoms in the optimized structures, but we only use molecule SMILES as inputs for model training in this work.QM9 provides 12 target values for each molecule, provided in Experimental logP data of the SAMPL6, SAMPL7 and SAMPL9 challenges was downloaded from the SAMPL GitHub repository. 15SAMPL runs a series of blind challenges for compu-

Table S5 :
Summary of the benchmarking datasets.
a References for the data and data splits.b The size of the training set.The SAMPL6, SAMPL7, and SAMPL9 data are used as a test set.c Including reverse reactions.tational chemistry, providing the identity of test molecules for which predictions of physicochemical properties, among them the water-octanol partition coefficients, can be submitted using quantum-mechanics, molecular mechanics, or empirical models.In this work, we build an empirical model based on Chemprop, which we train on a publicly available dataset of logP measurements from Ref. 16. Molecules present in the SAMPL challenges were removed from the logP training dataset.The remaining 23,469 data points were randomly split into 80% training, 10% validation, and 10% test data.The test data was used to obtain a measure

Table S6 :
Target values present in the QM9 dataset, presented with the target labels used in the dataset.
quantum mechanical (QM) descriptors for 136,219 organic molecules with atom types H, C, O, N, F, S, Cl, Br, B, I, P, and Si.This dataset included atomic charges, Fukui indices, NMR shielding constants, bond length, and bond orders.Those molecules were optimized using GFN2-xTB and subjected to population analysis using the B3LYP/def2-SVP level of theory.We used this dataset to evaluate the performance of different implementations, randomly splitting it into 80% training, 10% validation, and 10% test data.

Table S7 :
and now curated as part of the Open Graph Benchmark. 11This dataset contains HOMO-LUMO gaps measured in units of eV for 3,452,151 molecules.A further 294,470 molecules have targets privately held by the Test set metrics for the different targets of QM9.The top grouping of tasks was trained together in a single multitask model.The bottom grouping of results for U0 and gap shows the results for single-task models.The atomized basis of the thermochemical properties U0, U298, H298, and G298 were used for training.
Open Graph Benchmark for blinded testing purposes and are not included in the benchmarks performed in this work.For benchmark training, the data we had available was randomly divided into 80% training, 10% validation, and 10% test data.The Open Graph Benchmark provides 3D coordinates for training data used in the dataset, but we only use molecule SMILES as inputs for model training in this work.

Table S8 :
Test set results for HIV and PCBA classification tasks compared with MoleculeNet (MolNet) and OGB leaderboards (higher = better).For the PCBA random split, we also report the performance with missing targets set to None (in brackets).For the PCBA scaffold split, the test set only had a single class for 'PCBA-493208' task, and therefore 'PCBA-493208' was omitted from the training, validation and test set.

Table S10 :
MAEs for predicting the barrier heights of organic reactions in kcal mol −1 for this work (top), other graph-convolutional approaches (middle, taken from Ref.22 and 33), and simple machine learning approaches (bottom, taken from Ref.22, 23 and 33).
also reports the performance of a Chemprop model, but trained it only on the reactants, not the full reactions.Here, we can directly observe the advantage offered by using the full reaction to construct the input graph representations.Chemprop furthermore outperforms the multivariate regression of quantum-mechanical descriptors of Ref. 22. Compared to the kernel ridge regression (KRR) models of Ref. 23, Chemprop outperforms the models based on BoB, SLATM, and FCHL19 representations by a large margin.The KRR model on a simple one-hot encoding of the nucleophile, electrophile and substituents close to the reactive center offers a slight performance benefit at the disadvantage of not being able to generalize at all to new reactants or reactions.For[3 + 2]dipolar cycloadditions, we compare Chemprop in reaction mode to WL type models with and without QM features, regressions on QM descriptors, as well as different

Table S11 :
Uncertainty evaluation metrics (dimensionless) for QM9 gap predictions.NLL: negative log likelihood; ρ: Spearman rank correlation; ENCE : expected normalized calibration error; MA: miscalibration area.The arrows indicate if smaller or larger values indicate better performance.measurements of the same molecule in different solvents would be assigned to the same split to avoid data leakage.Previous work has shown that data leakage can lead to overly-optimistic estimates of generalization ability on similar datasets. 39Any work on multi-molecule tasks should carefully consider the implications of the choice of splitting technique to avoid data leakage from highly correlated samples (in cases such as measurements of the same property in different solvents) or from duplicated samples with flipped molecule columns (in cases of symmetric multi-molecule properties).

Table S12 :
Train times in hours:minutes:seconds for subsets of the QM9 dataset

Table S13 :
Average training times for an epoch in seconds for subsets of the QM9 dataset, excluding the first epoch.

Table S14 :
Inference times in hours:minutes:seconds for subsets of the QM9 dataset.