CHEMTECH
September 1998
CHEMTECH 1998, 28(9), 35-40.
Copyright © 1998 by the American Chemical Society.



ENABLING SCIENCE

Building the shortest synthesis route

The goal is to make the target compound in the fewest steps possible, thus avoiding wasteful yield losses and minimizing synthesis time.

James B. Hendrickson

R &D laboratories synthesize many new compounds every year, yet there seems to be no clear protocol for designing acceptable and efficient routes to target molecules. Indeed, there must be millions of ways to do it. Some years ago, in an effort to use the power of the computer to generate all the best and shortest routes to any compound, my group at Brandeis began to develop the SYNGEN program (1-3).

The task is huge, even for the computer. Imagine a graph that traces the process of building up a target molecule; we call it a synthesis tree (Figure 1). The starting materials for the possible synthesis routes are molecules we can easily obtain. As the routes progress, new starting materials are added from time to time until the target is obtained. Each line represents a reaction step, or level, from one intermediate to another, and each step decreases the yield. Two of many possible routes are traced in Figure 1.

Figure 1 thumbnail Figure 1.

To find these routes, we presume to start with the target structure and a catalog of all possible starting materials. Then, the computer generates all the points (intermediates) and lines (reactions) of the graph. If the computer has been programmed with an extensive knowledge of chemical reactions, it could do this by generating all possible reactions backward one step from the target structure to the intermediate structures, then repeating this on each intermediate as many times as necessary to return to the available starting materials.

At this stage, the problem gets too big. Suppose there are 20 possible last reactions to the target (level 1) and that each of these reactions also has 20 possible reactions back to level 2. Going back only five levels will generate 205 (3.2 million) routes. How do we select only one to try in the laboratory?

This generation of reactions and intermediates is a brute-force approach; clearly, it must be focused and simplified with some stringent logic. The central criterion should be economy - that is, to make the target in the fewest steps possible, thus avoiding wasteful yield losses and minimizing synthesis time.

A protocol for synthesis generation
The key to finding the shortest path seems to be to join the fewest possible starting materials and those that are closest to the target on the graph. The starting material skeletons are usually smaller than the target skeleton, so joining them to assemble the target will always require reactions that construct skeletal bonds. This underlying skeleton is revealed by deleting all the functional group bonds on a structure and leaving only the framework, usually just C-C sigma-bonds.

The central feature of any synthesis is the assembly of the target skeleton from the skeletons of the starting material. Looking for all the possible ways of cutting the target skeleton into the skeletons of available starting materials represents a major focus for examining the synthesis tree.

We illustrate this task by looking at the steroid skeleton of estrone and cutting it in two at different points in the structure (Figure 2). Each cut creates two intermediate skeletons, and each skeleton is then cut in two again to obtain four skeletons. This procedure creates a convergent synthesis, and convergent routes are the most efficient (4). With four starting skeletons, we will need to construct only 6 (or fewer) of the 21 target skeleton bonds. We could keep dividing each skeleton until we ultimately arrive at a set of one-carbon skeletons, but it is not necessary to go that far, that is, to a "total synthesis".

Figure 2 thumbnail Figure 2.

With our four starting skeletons, each skeleton represents a family of many compounds with different functional groups placed on the same skeleton. Suppose that we find a set in which all four skeletons are represented by real compounds in an available library of starting materials; this set could form the basis of a synthesis route with no more than six construction steps to the steroid if the functional groups are right. The skeletal bonds we cut, which must be constructed in the synthesis route, are called a bondset, and these bondsets are a basis for generating the shortest syntheses. Each skeletal bondset represents a whole family of potential syntheses.

The ideal synthesis
There are two kinds of reactions: construction reactions, which build the target skeletal bonds (usually C-C bonds), and refunctionalization (DgrFG) reactions, which alter the functional groups without changing the skeleton. Any synthesis must do construction reactions, because the starting materials are smaller than the target, but must a synthesis route have any DgrFG reactions?

Imagine a synthesis route with its set of starting materials chosen so that their functional groups are correct to initiate the first construction, leave a product correctly functionalized for the second construction, and so on, continuing to construct skeletal bonds until the target skeleton is built. This is the ideal synthesis in that it must have the fewest steps possible. It requires no DgrFG reactions to get from one construction product to the next.

In a survey of many syntheses, we found that the average nonaromatic starting material has a

  • skeleton of only three carbons, and that
  • one skeletal bond in three of the target constructed, and that there are
  • twice as many DgrFG reactions as constructions.
Therefore, for an average synthesis, the number of steps equals the number of target skeleton bonds.

We think we can do better. Building the shortest, most economical syntheses requires first finding those skeletal dissection bondsets with the fewest bonds, to minimize construction reactions. It also requires no more than four correctly functionalized starting materials, to minimize DgrFG reactions. Common targets have 20 or fewer carbons, which implies an average starting material of 5 carbons. In our experience with catalogs of starting materials, functional diversity on the skeletons is ample up through five carbons but decreases sharply with larger molecules.

Generating the chemistry
Once we find the four commercially available starting materials, we need to make a second pass, down from the target through the ordered designated bonds of the bondset. This process generates the actual construction reactions we require, in reverse. So, we need a method of generalizing structures and reactions to quickly find the reactions appropriate to the functional groups present (1, 2, 5).

Any carbon in a structure can have four general kinds of bonds, as summarized in Figure 3: skeletal bonds to other carbons (R); pi-bonds to adjacent carbons (prod ); bonds to heteroatoms that are electronegative (Z); and bonds to heteroatoms that are electropositive (H). The numbers of bonds are referred to as sigma, pi, z, and h, respectively. If we know the values of sigma and obtain h by subtraction from 4, only two digits (z and pi) are needed to describe each carbon. This description is summarized in Figure 3, where each carbon is marked in the example structure with its zpi value. This digitalized general description of the structure is easy for the computer.

Figure 3 thumbnail Figure 3.

A reaction change at each carbon is just a simple exchange of one bond type for another. This change may be designated by the two letters for the bond made and for the bond lost. Thus, reaction HZ indicates making a bond to hydrogen by loss of a bond to heteroatom--that is, a reduction. The 16 possible combinations are shown and described with general reaction families in Figure 3.

Using this system, we can generate all possible generalized reactions, forward or backward, from any structure. No routes are missed, and we can find all the best routes back from the target to real starting materials. Relatively few generalized reactions are created, and we refine the abstract into real chemistry only at the end. When starting materials are generated through successive applications of these reaction families, we can look them up in the catalog, where they are indexed by skeleton and by generalized zpi lists of the functionality on each skeletal carbon.

The SYNGEN program
We have applied this approach in our SYNGEN program, an earlier version of which found its way into laboratories at Glaxo, Wyeth-Ayerst, and SmithKline Beecham but is currently being improved significantly. The two phases of the generation are summarized in Figure 4 for one particular result, the Wyeth estrone synthesis. In the first phase (Figure 4, left side), we see the skeletal dissection down to four starting skeletons, all found in the catalog; in fact, the intermediate skeleton B also was found, so further dissection to E and F may not be needed.

Figure 4 thumbnail Figure 4.

In the second phase (Figure 4, right side), this ordered bondset is followed, one bond at a time, generating the construction reactions for an ideal synthesis until all of the functional groups have been generated. These actual starting materials are found in the catalog, so a full synthesis route can be written from them that goes up the right side in a quick, constructions-only ideal synthesis of the target. This three-step synthesis of a target structure can be converted to estrone in two more steps. The prediction for an average synthesis would have been much longer.

The catalog for the current version of SYNGEN has about 6000 starting materials, but it is being expanded from available chemicals directories. After the target is drawn on the screen, the program generates the best routes in <1 min. It displays the bondsets, the starting materials used, and the actual routes, which are ordered by their calculated overall cost.

The output screen from SYNGEN for the example analyzed in Figure 4 is shown in Figure 5. Two other sample outputs, from a different bondset of the same target, are shown in Figures 6 and 7. The notations on the arrows use abbreviations to describe the nature of the reaction; explanations are available on a help screen. The routes shown are still in a generalized form and require further elaboration of chemical detail by the user. Literature precedents, however, are being added to the program, as described later.

Figure 5 thumbnail Figure 5.

Figure 6 thumbnail Figure 6.

Figure 7 thumbnail Figure 7.

The future of SYNGEN
Three developments are currently under way on the SYNGEN program. The first and perhaps most important improvement is creating a graphical output presentation that is easy for a chemist to read and navigate; this work is nearing completion. The second deals with the problem of validating the generated reactions with real chemistry. The third development, currently supported by the U.S. Environmental Protection Agency (EPA), is to assign starting material indexes of environmental hazard--such as toxicity and carcinogenicity--so that the routes generated may be flagged for environmental concern when these starting materials are involved.

The second development deals with a major problem in previous versions of SYNGEN: The program generated too many reactions that chemists saw as clearly nonviable. Such results tended to destroy their confidence in the program as a whole. We now have a way to validate the generated reactions from the literature, eliminating many of these nonviable reactions.

The generalizing procedure for describing structures and reactions in SYNGEN also was applied to create an index-and-retrieval system to find matches for any input query reaction from a large database of published reactions. This program, RECOGNOS, has been applied to an archive of 400,000 reactions originally published between 1975 and 1992 and packaged as a single CD-ROM that allows instant access to matching precedents in that archive (6, 7). The RECOGNOS program is available on CD-ROM from InfoChem GmbH, Munich, Germany, combined with their ChemReact database of 370,000 reactions and renamed "ChemReact for Macintosh".

This archive of literature reactions, now almost double its original size, has been distilled to more than 100,000 construction reactions. These reactions, in turn, have been converted into a look-up table for use by the SYNGEN program. With this tool, SYNGEN can validate any reaction it generates by searching for matches in the archive and determining the average yield. Unprecedented reactions are therefore set aside, and a realistic yield can be estimated for each reaction to be used in the overall cost accounting.

We believe that SYNGEN has considerable potential for discovering new alternatives for creating organic chemicals in the most economical way possible. Even when the program does not yield a directly usable synthesis, it often starts the chemist thinking about different approaches previously not considered. No chemist can think of all the possible routes to the target, but SYNGEN does this fast. It also provides a powerful and focused output of the possibilities.

Acknowledgments
In addition to welcoming new support from industry for the remaining developments, I thank the National Science Foundation for its patient support over two decades and the U.S. Environmental Protection Agency for its current support of SYNGEN development.


References

ACS Pubs ChemPort ChemCenter