Building Block-Centric Approach to DNA-Encoded Library Design

DNA-encoded library technology grants access to nearly infinite opportunities to explore the chemical structure space for drug discovery. Successful navigation depends on the design and synthesis of libraries with appropriate physicochemical properties (PCPs) and structural diversity while aligning with practical considerations. To this end, we analyze combinatorial library design constraints including the number of chemistry cycles, bond construction strategies, and building block (BB) class selection in pursuit of ideal library designs. We compare two-cycle library designs (amino acid + carboxylic acid, primary amine + carboxylic acid) in the context of PCPs and chemical space coverage, given different BB selection strategies and constraints. We find that broad availability of amines and acids is essential for enabling the widest exploration of chemical space. Surprisingly, cost is not a driving factor, and virtually, the same chemical space can be explored with “budget” BBs.


Figure S1 .
Figure S1.UMAP analysis of Fmoc-AA, primary amine, and carboxylic acid BBs from Enamine without truncation.Density plots arranged by chemical similarity are compared for (A) Fmoc-amino acids (CONH 2 -R-NH 2 ), (B) primary amine (R-NH 2 ), and (C) carboxylic acids (R-CONH 2 ).Grayscale intensity denotes the probability density of points.(D) A box and whisker plot is generated to summarize the density of UMAP space when Fmoc-AA BBs are input as their corresponding truncates, amide representations, or fully intact structures.(E) A box and whisker plot is generated to summarize the Tanimoto-similarity scores for Fmoc-AA BBs are input as their corresponding truncates, amide representations, or fully intact structures.

Figure S6 .
Figure S6.Cost comparison between matched primary amine and Fmoc-AA BBs.(A) Selected examples of paired compounds and associated costs (250 mg).(B) Primary amine and Fmoc-AA BBs are plotted by cost, with lines indicating compound pairs.(C) Data in A are reduced by plotting the difference in the cost of the pair (Fmoc-AA cost -primary amine cost).Data points are jittered to avoid overplotting and overlayed with a violin plot indicating relative density.

Figure S7 .
Figure S7.MW and cLogP distributions for Fmoc-AA, primary amine, and carboxylic acid BB sets.(A-C) BBs were binned by MW (10 Da), and plotted with a corresponding density trace.(D-F) MW was calculated for BBs (Fmoc removed for AAs) and plotted against predicted cLogP.Density of points is indicated by grayscale.

Figure S9 .
Figure S9.Illustration of UMAP coverage from random, diversity-based, and uniform selections of 192 carboxylic acids.

Figure S10 .
Figure S10.Comparison of chemical similarity for BBs stratified by cost.(A) Cycle 1 primary amine and (B) Cycle 2 carboxylic acid BB sets are randomly sampled (192 BBs/sampling) at variable price cutoffs (≤ $100, ≤ $250, ≤ $500 / 250 mg) iteratively (n = 50).Average nearest neighbor scores are calculated for each sampling and plotted as a box and whisker plot.The interquartile range (IQR) of NN Tanimoto scores for each cost filter are indicated.

Figure S11 .
Figure S11.Comparisons within or between an enumerated library and an Enamine Lead-like set.A 192 × 192 primary amine × carboxylic acid DEL is enumerated (random BB-sampling).The all × all 2D-Tanimoto similarity matrix is generated using the enumerated DEL (n = 36,864) and a downsampling of an Enamine lead-like commercial catalog ("Diversity set", n = 36,864).Summaries of (A) DEL compound intralibrary similarity, (B) Enamine lead-like set intralibrary similarity, (C) DEL vs Enamine library, and (D) Enamine library vs DEL similarity are plotted as histograms, with nearest neighbor scores in blue, and average Tanimoto scores in orange.