Database Independent Automated Structure Elucidation of Organic Molecules Based on IR, 1H NMR, 13C NMR, and MS Data

Herein, we report a computational algorithm that follows a spectroscopist-driven elucidation process of the structure of an organic molecule based on IR, 1H and 13C NMR, and MS tabular data. The algorithm is independent from database searching and is based on a bottom-up approach, building the molecular structure from small structural fragments visible in spectra. It employs an analytical combinatorial approach with a graph search technique to determine the connectivity of structural fragments that is based on the analysis of the NMR spectra, to connect the identified structural fragments into a molecular structure. After the process is completed, the interface lists the compound candidates, which are visualized by the WolframAlpha computational knowledge engine within the interface. The candidates are ranked according to the predefined rules for analyzing the spectral data. The developed elucidator has a user-friendly web interface and is publicly available (http://schmarnica.si).


INTRODUCTION
The idea of using computers to solve chemical structures from experimental spectroscopic data dates from the 1960s. 1,2 Over the past decades, by combining the techniques of chemistry and computer science, a number of structure elucidation systems have been developed, 2−17 accompanied by reports of constant improvements of known and developments of novel impressive algorithms. 18−33 The goal of such expert systems is to determine unknown molecular structures from experimental data with minimal human intervention. Although CASE (Computer Aided Structure Elucidation) programs are beginning to provide very good results in structure elucidation, they are still not entirely automated and usually a number of 2D in addition to the 1D NMR spectra must be provided. 29 A recently reported system for fully automatic processing and assignment of 1 H and 13 C NMR spectra, that can further elucidate and determine the relative stereochemistry of complex molecules 34 indicates that the development of structure-elucidation systems is still a lively and evolving field.
Most of the reported CASE systems take a database-oriented approach to structural elucidation and therefore heavily rely on databases containing chemical structures and spectra. 2,18,31−33 The absence of structural motifs in databases and mismatching because of experimental differences of the recorded spectra are potential drawbacks of these approaches. To move away from the typical database-oriented computer-assisted elucidation systems, our aim is to develop an algorithm that will be independent of database-searching and would not rely on predefined molecular formulas. We envisage developing an algorithm that will, based on the provided IR and NMR data, first identify a set of small structural fragments and then bind them together based on NMR data, thus building a structure in a bottom-up fashion. The proposed process of structure elucidation mimics a spectroscopist-driven approach of resolving the chemical structure from spectroscopic data.
In empiric elucidation of the structure of an unknown organic molecule, a spectroscopist must combine at least two spectroscopic methods: NMR and MS, NMR or IR, or even all three of them (IR, NMR, and MS) to derive to the correct result because IR, NMR, and MS each give only partial information on the structure of the molecule. While IR reveals some functional groups that are difficult to identify by other methods, it is impossible to determine the connection of structural fragments without NMR, whereas MS is indispensable in verifying the correctness of the proposed molecular structure and for elucidating the missing structural fragments that cannot be determined by IR or NMR. Therefore, we designed an algorithm that relies on a combination of all three methods to determine the correct structure. It produces the most fitting compound given the input data from 1 H and 13 C NMR, IR, and MS spectra and elucidates the chemical structure of a molecule similarly to a trained spectroscopist, mimicking the humandriven process of structure elucidation: (i) starting by identifying functional groups by IR and proton-and carboncontaining structural fragments by 1 H and 13 C NMR spectra, (ii) connecting the identified structural fragments together by relying on the 1 H NMR spectrum and (iii) completing (rechecking) the elucidated molecular structure by MS and 13 C NMR spectrum. The flowchart of the developed algorithm is presented in Figure 1.

METHODS
2.1. Description of the Algorithm. The input of the algorithm consists of tabulated IR, 1 H NMR, and MS data; in addition, the 13 C NMR data can also be added, although the algorithm can perform the elucidation process without 13 C NMR (vide infra). The 1 H NMR spectrum contains proton resonances which are described by three values: the chemical shift, the integral, and the splitting pattern. The chemical shift (denoted as shift in Table 1) is the resonant frequency of a proton relative to a TMS standard and is expressed in parts per million (ppm). Roughly, it provides information about the chemical surrounding to which the proton is bound. The integral (count) gives the relative number of protons present at each resonance, while the splitting pattern (splitting) provides detailed insights into the connectivity pattern of neighboring protons in a molecule. For the first-order 1 H NMR spectra, the splitting pattern to n chemically equivalent neighboring protons splits the proton resonance into a n + 1 multiplet with intensity ratios following the Pascal's triangle. In the IR spectrum, some functional groups give rise to characteristic absorption bands described by their intensity, position (frequency, in cm −1 ), and appearance. Although the bands can be intense or weak and broad or narrow, the algorithm specifically considers only broad absorption bands (broad in Table 1). The algorithm also makes use of the molecular mass (mass) of the compound. 13 C NMR data can be included if available. 13 C NMR allows the identification of nonequivalent carbon atoms in an organic molecule and shows a single peak for each chemically nonequivalent carbon atom. The chemical shift (denoted as shift in Table 1), analogous to 1 H NMR, is the resonant frequency of a carbon relative to a TMS standard or residual solvent and is expressed in parts per million (ppm).
The algorithm operates with the predefined chemical shift ( 1 H NMR) and frequency (IR) ranges that are associated to the specific structural fragmentsa structural fragment may correspond to only a few atoms, a functional group or an even larger structural part of the molecule. Based on these, the algorithm searches for a set of candidate compounds by (i) identifying potential matches of fragments in the input spectra, (ii) connecting the identified fragments in both spectra into joint entities, and (iii) identifying the number of fragments in the analyzed molecule and filling in the elements, which are not visible in the IR and 1 H NMR spectra. Finally, the algorithm joins the fragments into candidate compounds and ranks them by evaluating a number of rules, yielding a list of candidate compounds ranked by a relevance score, which gives higher rank to more likely matches. Tables with IR, 1 H NMR, and 13 C NMR frequency and chemical shift ranges for functional groups and proton-and carbon-based structural fragments, 35 along with IR, 1 H NMR, 13 C NMR, and MS data of 70 compounds from the literature were employed for algorithm development and testing of its performance. 36 We describe more details on the individual steps of the algorithm in the following paragraphs. We illustrate them on a very simple example of methanol, for which the input IR spectrum contains two peaks at IR 1 = 3347 (broad) and IR 2 = 2945 (narrow) [cm −1 ], 1 H NMR two peaks at NMR 1 = {3.66 ppm, H-count = 1, coupling = singlet} and NMR 2 = {3.43 ppm, H-count = 3, coupling = singlet}, one 13 C NMR peak at 50.1 ppm and has the molecular mass of 32.03.
The functioning of the developed algorithm is presented in Figure 2. The algorithm performs elucidation in five steps, which are color-coded in Figure 2, that is, step 1 (green), step 2 (red), step 3 (blue), step 4 (yellow), and step 5 (violet). 2.1.1.
Step 1: Identification of IR and NMR Structural Fragments. The algorithm first deduces the structural fragments (functional groups) from their positions (frequency) in the IR spectrum. Based on the predefined frequency ranges, the algorithm assigns the input peaks to individual fragments. Because the frequency ranges for some fragments can overlap, a peak may be assigned to several structural fragments. In our example, the list of possible IR fragments related to the two peaks is CH 3 , CH 2 , CH, OH, and NH. In the 1 H NMR spectrum, the algorithm determines the proton-containing structural fragments and their neighboring protons/groups  Step 2: Grouping IR and NMR Structural Fragments. The IR spectrum only provides information on existence of structural fragments in the compound, but not their exact counts. On the other hand, the 1 H NMR spectrum provides information on the hydrogen count of proton-containing structural fragments. In step 2, the algorithm relates the IR and 1 H NMR structural fragments from step 1 into group(s) of structural fragments based on compatibility of the numbers of their hydrogen atoms. The fragments may be grouped in different ways, for example, one fragment deduced from IR can be attached to multiple NMR peaks and vice versa. Therefore, in step 2, the algorithm generates a full set of possible IR and 1 H NMR fragment combinations, which are denoted as IR−NMR fragments. In our example, NMR 1 (1 H atom) can be matched by CH, OH, and NH, while NMR 2 (3 H atoms) by CH 3 , as well as CH, OH, and NH (if they appear three times). The algorithm thus maps all of these combinations into a set of matched IR− NMR fragments. 2.1.3.
Step 3: Identifying Compound Candidates. In step 3, the IR−NMR fragments are augmented in different ways to match the input molecular mass, thus producing groups of structural fragments, each group representing constituent parts of a candidate compound.
2.1.3.1. Expansion. Proton resonances in the 1 H NMR spectrum may reflect multiple structural fragments in the molecule. For example, an NMR proton resonance with the hydrogen count of 6 may represent two CH 3 fragments, 3 CH 2 fragments, or 6 CH fragments. The algorithm thus examines all IR−NMR fragments and expands them with all possible combinations of fragments that match the hydrogen count in the input data. In our example, the algorithm expands the combinations of NMR 2 with CH, OH, and NH by creating three copies of each fragment in the combination to match the hydrogen count of three.
2.1.3.2. Multiplication. Because of isomorphism and the relative nature of 1 H NMR spectra, the count of individual IR− NMR fragments in the candidate compound may be incorrect (too low). Given the molecular mass, the algorithm checks if the counts should be augmented to match the mass and, in these cases, multiplies the number of fragments in all candidate compounds. In our simple example, no multiplication is needed.
2.1.3.3. Insertion. Some functional groups (e.g., ether) and commonly encountered halogen atoms (e.g., Cl, Br, and I) do not have specific signals in the IR and 1 H NMR spectra. The algorithm therefore inserts these elements into the candidate compounds to match the molecule mass. 2.1.4.
Step 4: Checking for Validity. In the fourth step, each candidate compound is checked for validity. First, the candidate compound's mass is compared to the input mass, and the candidate is removed if the difference is too large. Then, the Erdos−Gallai algorithm 37 is used to test if a connected graph can Journal of Chemical Information and Modeling pubs.acs.org/jcim Article be constructed from the compound's fragments, given each fragment's number of connections. The compound is discarded, if there are more than five unconnected edges left after construction. The constructed compounds with less than five unconnected edges are still considered but penalized (see Table  2). In the investigated example of methanol, only a single candidate compound CH 3 −OH is valid according to all three criteria.
In addition, the algorithm optionally evaluates the combinations against the 13 C NMR data, if given. Each carbon fragment in a candidate compound is checked, comparing the reported chemical shift in the 13 C NMR spectrum with predefined 13 C NMR regions for the investigated structural fragment. If a reported shift belongs to multiple potential fragments, all options are considered as correct. A candidate composition is therefore valid, if all identified fragments are also compliant with their 13 C NMR predefined range positions. In the investigated example of methanol, the CH 3 fragment value of 50.1 ppm falls into the predefined region of aliphatic carbon atoms with electronegative substituent, in this case, aliphatic alcohol. For the one remaining compound candidate, the 13 C NMR input data confirm the presence of the CH 3 fragment and confirms the candidate. 2.1.5.
Step 5: Creating and Ranking the Compounds.
Step 4 results in several candidate compounds, each consisting of a number of unconnected fragments. In step 5, the algorithm observes the fragments of a candidate compound as vertices of a graph, for which the edges (bonds between elements) must be determined. The algorithm connects the edges in compliance with each fragment's neighbor count and neighbors' sum of hydrogen atoms. In the process, some of the candidates are removed−these include compounds where all the fragments cannot be connected because of limitations in connections between the elements; others lack the elements, which were not visible in any of the input spectra and were not added in step 3. Each candidate compound can produce any number of valid graphs (multiple possible edge combinations); therefore, the algorithm evaluates each one using a set of rules, yielding a relevance score for each candidate compound. The rules are shown in Table 2. Initially, the algorithm assigns a relevance score of 100% to each candidate compound. Based on the rules, the algorithm lowers the score of the candidates accordingly.
The output of the algorithm is a list of compound candidates with the corresponding relevance scores. If a compound can be connected, does not contain unconnected edges, and the fragment set produces a connected graph with a mass, similar to the measured mass of the compound, it will receive a high relevance score. If it fails on one or more rules, the relevance score is lowered accordingly.

Computational
Methods. The presented elucidation approach was developed in Python 3, 38 with the web service for online elucidation using the Django 39 framework with JavaScript support on the front-end. The web service for online elucidation also includes the Pysmiles library, 40 which translates the elucidated compounds into SMILES representations. The Pysmiles library also employs the NetworkX library, 41 which enables the manipulation and creation of graphs. Using the online service WolframAlpha, 42 the web service visualizes the elucidated end result compounds, which the algorithm provides as the output.
The algorithm employs a combinatorial approach to combine the identified fragments into compounds. Several timeoptimizing approaches were also used to minimize the number of possible combinations. Using the Erdos−Gallai theorem, 37 we removed the possible compound combinations. This theorem provides a necessary and sufficient condition for a finite sequence of chemical bonds to establish whether the combination is potentially possible. Using backtracking, 43 we recursively built graphs from the compound combinations. Finally, we developed a simple decision model, 44 which ranks the built compounds given a set of rules.

RESULTS AND DISCUSSION
In the following, we describe a simplified elucidation process with the proposed algorithm on 3-(4-chlorophenyl)propan-1-ol (1, Figure 3). The input tabular data for IR, 1 H NMR, 13  ClO] + ). 46 The peak values of 3033 and 2978 cm −1 for C(sp 2 )-H and C(sp 3 )-H bond stretching were added to the IR spectra because the algorithm fails to provide elucidated structure if the present structural fragments of the compound are not assigned (or missing) in the IR spectrum (vide infra). Additionally, multiplet proton resonance at δ 1.78−1.92 ppm was assigned as 1.85 (quint-like, 2H) (vide infra).
The algorithm started the elucidation process (step 1) by analyzing the IR spectrum. A broad peak at 3325 cm −1 was assigned to an OH group and peaks at 3033 and 2978 cm −1 to the structural fragments with C(sp 2 )-H and C(sp 3 )-H bonds, respectively. The 1495 and 1454 cm −1 peaks implied the presence of an aromatic ring (CC stretching). Analysis of the 1 H NMR spectrum revealed the presence of one proton bound to a heteroatom because of the broad singlet resonance at 1.76 ppm with the integral value of 1, which may correspond to an OH or NH moiety. A quintet-like resonance with integral of 2 indicated a CH 2 fragment having two pairs of chemically equivalent neighboring protons (see below). The two resonances with the shifts in the aliphatic region of the NMR spectrum, that is, at 2.67 and 3.61 ppm, with integral 2 and a triplet splitting pattern (2 neighbors) suggested the −CH 2 − CH 2 −CH 2 − hydrocarbon chain. The two doublets with integrals 2 in the aromatic region of the NMR spectrum indicated a para-substituted aromatic ring. Analysis of the 13 C NMR spectrum revealed the presence of four nonequivalent aromatic carbon atoms resonating at 140.0, 131.2, 129.5, and 128.1 ppm, two nonequivalent aliphatic carbon atoms connected in a hydrocarbon chain resonating at 33.7 and 31.1 ppm, and one heteroatom-substituted aliphatic carbon resonating at 61.3 ppm. In step 2, these elements were combined, into IR−NMR fragments: an OH group, a −CH 2 −CH 2 −CH 2 − hydrocarbon chain, and a para-substituted aromatic ring. In step 3, the difference of the sum of calculated mass of fragments on the IR−NMR fragments list and the experimental input mass value differed by 35 which corresponds to the mass number of chlorine-35, which was added to the candidate compound. The process was completed (steps 4 and 5) by constructing the compound graph and attaching the chlorine atom at the aromatic ring para relative to the propyl alcohol substituent. In this way, the algorithm derived to the correct structure. The phenol structure candidate, also possible with respect to the IR− NMR fragments set list, received a lower relevance score because of the 1 H NMR chemical shift δ 1.76 ppm of the OH group. In step 4, each carbon fragment in the candidate compounds was checked with respect to the predefined ranges and reported values of 13 C NMR data (vide supra). All carbon atoms in both Journal of Chemical Information and Modeling pubs.acs.org/jcim Article structure candidates in Figure 3 were possible with respect to reported 13 C NMR input values. Selected examples of structures resolved from their corresponding literature IR, 1 H NMR, 13 C NMR, and MS tabulated data by the proposed algorithm are presented in Figure 4. The algorithm resolves the structures of various primary, secondary, tertiary, and aromatic amines 2−6, 16, 31, 34, 37, alcohols 7, 9, 17, 19, 20, 30, 31, 34, Figure 4 is collected in the Supporting Information. 36 The algorithm recognizes various functional groups and differentiates between structural isomers of organic molecules.
When elucidating structures of organic molecules, we solve a problem that is related to the class of inverse problems, which are most frequently ill-posed and usually do not have a unique solution. Therefore, the correct structure of the investigated molecule was not always the first on the list of the compound candidates with the highest relevance score. For example, for compound 33, the highest-ranking candidate was 2-oxo-5phenylpentanoic acid (score 99.41%), whereas the correct structure of the investigated compound, 5-oxo-5-phenylpentanoic acid (33), was ranked second (score 99.05%). Close inspection of the 1 H NMR spectrum [δ 11.10 (br s, 1H), 7.50 (m, 5H), 3.07 (t, 2H), 2.50 (t, 2H), 2.10 (quint-like, 2H)] 47 revealed that terminal methylene groups of the −CH 2 − CH 2 −CH 2 − chain with resonances at δ 3.07 (t, 2H) and 2.50 (t, 2H) ppm, have chemical shifts that could imply an attachment on an aromatic ring, carbonyl, as well as a carboxylic group. Therefore, both of the above described compound candidates are probable. However, in investigated cases the correct structure of the molecule was usually first or second, in few cases third, result on the list of compound candidates and always had a relevance score above 99%.
The elucidator has a user-friendly web interface and is publicly available (http://schmarnica.si). Users can input numerical values of IR, 1 H NMR, and MS spectral data, along with optional input of 13 C NMR data ( Figure 5), of the investigated compound into designated fields and run the elucidation process. The elucidation process can run with or without 13 C NMR data, although 13 C NMR data improve the process and result of elucidation. The process incorporates the described combinatorial search for potential fragments based on the two spectra, and a graph search algorithm, which evaluates the connectivity of the potential fragments from each possible combination. After the process is finished, the interface lists the compound candidates, which are visualized by the Wolf-ramAlpha computational knowledge engine within the interface. The candidates are ranked according to the predefined rules for analyzing the spectral data (vide supra), and the ranking score is displayed next to each candidate. If the algorithm successfully resolves the structure from the spectral data (vide infra), the molecular candidate with the highest-percent matching (for the tested examples) almost always corresponded with the correct structure (vide supra). The processing time, in which elucidator derives the list of structure candidates, depends of the number of structural fragments present in the investigated molecule and their connectivity. On the test data, the algorithm usually derived a list of structure candidates in a few seconds to up to 5 min. In cases of compounds with several structural fragments (e.g., compound 20) or elements that can be identified only by MS (e.g., Cl, Br, I atoms, and ethers), the time of calculation can significantly increase because of the exponentially increased number of possible combinations.
An evaluation of different optimization and data techniques was used to determine the efficiency of the proposed model. First, we evaluated the time complexity of the proposed approach from its baseline (no optimizations) to the final version. In order to evaluate the impact of the individual optimization technique on the process, we only evaluated the complexity and not the classification performance of the approach. The aggregated results are shown in Table 3.
The initial time needed to elucidate 40 compounds was 341.04 s. With weight checking and removal of potentially incorrect combinations, this time was significantly reduced to 21.66 s. Further optimizations, which additionally excluded the impossible compound combinations, reduced the calculation time by another 50% to 11.45 s. Inclusion of the 13 C NMR data did not significantly affect the time complexity. However, the number of elucidation candidates was reduced by about 20%. This part of the process decreased the final number of candidates by removing the incorrect candidates from the results, while not negatively affecting the classification performance by potentially removing the correct results from the candidate list.
Considering all optimizations, the average number of candidates per compound significantly decreased by considering the weight of the combinations, followed by the connectivity methods and tree realization checks. Additionally, the 13 C NMR data, which we added as an optional input to the proposed approach, reduced the number of candidates.
It is important to note that in its current form the algorithm can resolve relatively simple organic molecules (Figure 4). It currently processes only first-order 1 H NMR spectra and fails to Journal of Chemical Information and Modeling pubs.acs.org/jcim Article resolve the unknown structure if the IR spectrum does not provide the information on the functional groups that are present in the molecule and should be seen in the IR spectrum. For instance, if the −NH− group is present in the molecule, but the IR spectrum for some reason (hidden/superimposed/not assigned signal) does not provide the corresponding peak value for this group, that is, ca. 3300 cm −1 , the algorithm fails to resolve the structure. In some cases, the spectral data of 1 H NMR were adjusted to fit the first-order NMR data. For example, in the above presented elucidation of compound 1, the splitting pattern of proton resonance at δ 1.78−1.92 ppm was assigned as a quintet-like (quint-like) at δ 1.85 ppm (vide supra). We decided to use this term as it corresponds well to what one can actually observe in the spectrum without knowing the structure of the compound. Depending on the resolution of the spectrum, the quintet-like splitting pattern commonly appears for the central methylene protons in X−CH 2 −CH 2 −CH 2 −Y hydrocarbon chain that are coupled to the nonequivalent neighboring pairs of X−CH 2 and CH 2 −Y protons with similar coupling constants. In the literature, the resonance for this type of central methylene protons is correctly reported as a multiplet or triplet of triplets; however, at the current stage, the algorithm cannot process more complex splitting patterns (e.g., dd, dt, tt, etc.) or multiplets. The exception is the phenyl group, C 6 H 5 − (Ph−) that is defined as a multiplet resonance with integral 5 in the region around 7 ppm. The algorithm can also process parasubstituted phenyl ring as it frequently resembles two doublet resonances with an integral ratio of 2:2 in the aromatic region of the spectra. Therefore, only compounds with mono-and parasubstituted phenyl rings can be currently processed by the algorithm. One of the primary goals in further developments will be upgrading the algorithm to resolve more complex NMR data as well as to resolve structures from partly incomplete spectral data (vide supra). The algorithm proposed herein serves as a ground for further developments, which will increase its capacity of resolving more complex molecular structures.

CONCLUSIONS
The proposed algorithm is the first step in the development of a user-friendly and database-independent chemical structure elucidator that would mimic a spectroscopist-driven process of resolving the molecular structure from spectral data, that is, building the molecular structure form small predefined fragments. It recognizes various functional groups and differentiates between structural isomers of organic molecules. In its current form, the algorithm can resolve rather simple organic structures and will thus serve as a basis for further developments. The elucidator is publicly available through a web-interface, which can be used to elucidate and visualize unknown compounds. For interested researchers, source code of the elucidator is also publicly available.
■ ASSOCIATED CONTENT * sı Supporting Information The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.0c01332. Literature spectroscopic data of compounds employed for algorithm development and testing of its performance (PDF)