Automatic Prediction of Peak Optical Absorption Wavelengths in Molecules Using Convolutional Neural Networks

Molecular design depends heavily on optical properties for applications such as solar cells and polymer-based batteries. Accurate prediction of these properties is essential, and multiple predictive methods exist, from ab initio to data-driven techniques. Although theoretical methods, such as time-dependent density functional theory (TD-DFT) calculations, have well-established physical relevance and are among the most popular methods in computational physics and chemistry, they exhibit errors that are inherent in their approximate nature. These high-throughput electronic structure calculations also incur a substantial computational cost. With the emergence of big-data initiatives, cost-effective, data-driven methods have gained traction, although their usability is highly contingent on the degree of data quality and sparsity. In this study, we present a workflow that employs deep residual convolutional neural networks (DR-CNN) and gradient boosting feature selection to predict peak optical absorption wavelengths (λmax) exclusively from SMILES representations of dye molecules and solvents; one would normally measure λmax using UV–vis absorption spectroscopy. We use a multifidelity modeling approach, integrating 34,893 DFT calculations and 26,395 experimentally derived λmax data, to deliver more accurate predictions via a Bayesian-optimized gradient boosting machine. Our approach is benchmarked against the state of the art that is reported in the scientific literature; results demonstrate that learnt representations via a DR-CNN workflow that is integrated with other machine learning methods can accelerate the design of molecules for specific optical characteristics.


SI.1 Bayesian Optimization using Gaussian Processes
In Bayesian optimization, a probabilistic surrogate model is built to approximate an objective function f , which is based on a performance metric that is to be maximized (or minimized) under three constraints: (i) the analytical expression of f and its derivatives are unknown (i.e.no closed form), (ii) f is expensive to evaluate, and (iii) the evaluations of f may result in noisy responses.In this work, the surrogate model is a Gaussian process (GP): the generalization of a Gaussian distribution to a distribution over functions that is characterized by mean and covariance (or positive definite kernel) functions, and it consists of a prior distribution that represents the prior beliefs over all possible f .The optimization algorithm sequentially refines the surrogate model following a set of observations via Bayesian posterior updating, yielding a posterior mean and variance functions that better approximate f over the space of the objective functions [1][2][3][4] .
The use of GPs to construct the surrogate model stems from the fact that Gaussian distributions are self-conjugate with respect to a Gaussian likelihood function; that is, a Gaussian prior yields a Gaussian posterior when the likelihood function is also Gaussian, that is: where the observations are of the form D 1:T = {(θ t=1 , ϕ t=1 ), ..., (θ T , ϕ T )} for a total of T observations with query points θ within the hyperparameter space and the output ϕ T of f .Therefore, with a conjugate prior for a Gaussian likelihood function, the posterior mean and covariance functions can be computed given the set of observations [3][4][5][6] .
Bayesian optimization employs an acquisition function, which uses the GP posterior, to evaluate the utility of candidate points within the hyperparameter space that may exhibit an improvement to the current best evaluation of f (θ).There exists a trade-off between exploration and exploitation, since an acquisition function wishes to 'explore' regions of high uncertainty in the posterior, while 'exploiting' known optimal regions where the posterior mean, and therefore, the objective function is expected to be high.The next query point θ t+1 is realized by maximizing the acquisition function with respect to the exploration of a high-variance region and the exploitation of a high-mean region, both of which yield high acquisition values.The surrogate model is then sequentially updated after evaluating each query point, thereby producing a more informative posterior distribution 2,7-10 .Three acquisition functions, denoted by α, were employed in this work, each corresponding to one of the following acquisition schemes: (i) Probability of Improvement (PI) 7 , (ii) Expected Improvement (EI) 8,9 , and (iii) Upper-Confidence-Bounds (UCB) 10 .
See Supporting Information 2 (SI.2) for the pseudo-code of Bayesian optimization implemented.
Note that the acquisition strategies described assume that the objective function is based on a performance metric which is to be maximized.However, if the objective function is to be minimized, the next query point is determined by maximizing the negative of the acquisition functions or by taking the lower confidence bound.

SI. 3 Figure S3 :Figure S4 :
Figure S3: The regression analysis of the ML-based prediction of the vertical excitation energies, and their corresponding peak wavelengths (λ max ), against the DFT-based calculations.The predictions were made by Bayesian-optimized gradient boosting models that were trained on the sTDA-DFT data set via (a) the DR-CNN and (b) the GBFS sub-workflows, and on the TD-DFT data set via (c) the the DR-CNN and (d) the GBFS sub-workflows.The solid blue line is a linear fit between the DFT and ML-based predictions that was generated using ordinary least squares refinement.The dashed red line is drawn to represent the hypothetical case, where the ML-based prediction would equal the DFT-based calculations.
Figure S5: Distributions of the absolute errors of the ML-based prediction of λ max against the experimental measurements, partitioned by solvent types.The ratio signifies the occurrence frequency of the solvent in the training and test sets with scaffold splitting, respectively.
Algorithm 1: Bayesian optimization with Gaussian process prior input: objective function f , hyper-parameter space θ, acquisition functions α, y best ← y t ; end end for t = T init + 1 to T do build probabilistic model for f conditioned on previous observations D 1:t−1 ; compute all possible true functions using Gaussian process regression; optimise acquisition functions α independently based on the posterior distribution and propose a candidate point for each acquisition scheme θ t,s ← argmax θ α s (θ|D 1:t−1 ) for s = {P I, EI, U CB}; choose next evaluation point θ t ← argmax θ sof tmax(µ(θ t,s )); compute exact objective function y t ← f (θ t ); if y t > y best then θ best ← θ t y best ← y t