Supporting Information for “Quantitative Comparison of Enrichment from DNA-Encoded Chemical Library Selections”

DNA-encoded chemical libraries (DELs) provide a high-throughput and cost-effective route for screening billions of unique molecules for binding affinity for diverse protein targets. Identifying candidate compounds from these libraries involves affinity selection, DNA sequencing, and measuring enrichment in a sample pool of DNA barcodes. Successful detection of potent binders is affected by many factors, including selection parameters, chemical yields, library amplification, sequencing depth, sequencing errors, library sizes, and the chosen enrichment metric. To date, there has not been a clear consensus about how enrichment from DEL selections should be measured or reported. We propose a normalized z-score enrichment metric using a binomial distribution model that satisfies important criteria that are relevant for analysis of DEL selection data. The introduced metric is robust with respect to library diversity and sampling and allows for quantitative comparisons of enrichment of n-synthons from parallel DEL selections. These features enable a comparative enrichment analysis strategy that can provide valuable information about hit compounds in early stage drug discovery.


S1. Nomenclature for n-synthons
Consider a hypothetical combinatorial library with 3 cycles of split-and-pool chemistry, where each cycle adds 1,000 unique sequence/building block pairs. When analyzing data from a selection, any combination of encoding sequences might be found to be enriched, which would correspond to different combinations of building blocks promoting binding to the target. These combinations of building blocks are often called n-synthons, where n is the number of cycles in the combination. Each of these n-synthons can be evaluated for enrichment in selection data separately by aggregating count data grouped by the different combinations of synthetic cycles in the library. The full set of synthon types in the example 10 3 x 10 3 x 10 3 library are listed in Table   S1. When plotting selection data in a standard 3D scatter plot (or "cubic view"), a plane feature in the cubic view represents a single conserved building block from one of the three cycles of this library and hence can be called a 1-synthon or mono-synthon. The example library contains 10 3 different mono-synthons in each of three axes (i.e., 10 3 per cycle). Every mono-synthon represents a single building block which is a substructure of 10 6 unique molecular structures in the library. Di-synthons and tri-synthons similarly represent higher dimensional groupings of 2 and 3 cycles of building blocks, respectively. These correspond to lines and points when plotted in the cubic view. Finally, n-synthons for the highest dimension n within a library (e.g., tri-synthon for a 3-cycle library) are sometimes referred to as singletons, because they each represent only one molecular superstructure.  Table S1. Nomenclature for different types of combinatorial features in 3-cycle DNA-encoded libraries. Within a 3-cycle combinatorial library, any combination of chemical building blocks may be found to be important for binding affinity to a target. This leads to seven different feature axes belonging to three different synthon types from which specific features can be evaluated for enrichment. a Feature axes are described using a notation indicating the inclusion (1) or exclusion (0) of a cycle in a feature. b The total number of synthons per feature axis in a 3-cycle library wherein each cycle contains 10 3 building blocks. c The number of unique library molecules represented by a single synthon in the example 3-cycle library.

S2.1. Candidate Enrichment Metrics
We evaluated the set of enrichment metrics in Table S2 for their ability to meet the enumerated criteria for a successful enrichment metric. The z-score metric is the often-utilized measure of difference from the mean count in units of standard deviations in counts. The normalized z-score metric is similar, but it is normalized by the square root of the number of samples. The Count_ratio is the difference from observed counts to the expected count in units of the expected count, CBV_ratio is the ratio of the observed count to the Bonferroni-corrected 95% significance critical value from a fitted binomial distribution, logE is the logarithm of the ratio of observed to expected population fractions, and Cohen's h uses the arcsine transformation and is a variancestabilizing metric for differences in proportions. Variance stabilizing function for proportions Table S2. List of evaluated enrichment metrics. For a given synthon using the above metrics, C is the observed count, E is the expected (mean) count, n is the total number of molecules sampled, σ is the standard deviation in counts from a binomial distribution model, E cbv is the Bonferonni-corrected 95% critical value from a binomial distribution with n and p i , p o is the observed population fraction and p i is the expected population fraction.

Name Formula Comments
Each of the candidate metrics involve comparing observed versus expected populations, and here expected populations were evaluated by using only the diversity of each n-synthon. In other words, all synthetic yields were assumed to be equal and therefore features with equal diversity are equally probable to be chosen in a random selection. Where expected counts and standard deviations are required, we modeled the data with binomial distributions. The binomial distribution was utilized because its probability mass function yields the probability of observing k counts given a fixed selection probability p i and the total number of observations n. For small expected counts, the binomial distribution closely resembles the Poisson distribution, and for higher expected counts, it closely resembles the normal distribution. Thus, the binomial distribution can simultaneously model the wide range of selection probabilities for mono-, di-, and tri-synthon features in DEL selection data. The enrichment metrics in Table S2  Before a library is screened in a selection against a target, it is important to verify that the distribution of members in the unscreened library is reasonably close to the expected distribution.
For this reason, it is common to screen each library after synthesis in its unselected, or "naïve", form. Generally, it is expected that counts should follow close to a binomial distribution for n equal to the number of molecules decoded and p i equal to the probability of random selection.
Small errors during synthesis, amplification, and sequencing typically produce deviations from the expected binomial distribution, but in our experience these deviations are much smaller than typical perturbations due to affinity selection. Small random perturbations in count distributions are therefore acceptable in practice. Figure S1 shows the observed enrichment for each n-synthon in the sequencing of a naïve library. This 3-cycle library has a size of around 229 million members, and 99 million sequences were read during DNA sequencing. Of these, 81 million sequences represented valid barcodes, and after UMI filtering (see Supporting Information S3), the final data set included just under 15 million sampled library molecules. In this scenario, the expected count for a unique singleton in the library is below 1, at 0.065. This implies that every molecule observed is present at a minimum of 15 times the expected count. Since observed molecules in a naïve library sample are presumably observed only due to random selection, successful enrichment metrics should account for this random selection noise, especially for high-diversity features when the expected count is very low. In this naïve data set, it was observed that for the z-score metrics, enrichment is usually centered close to zero for each synthon type, i.e., the observed populations are, on average, close to the expected populations. The exception is for the (1, 1, 1) feature axis (also known as singletons or tri-synthons), because both the expected population and sampling ratio are low enough that an observed count of 1 is measured as being significant enrichment (an observed count of 1 is much greater than the expected count of 0.065). This effect is also prominent in the logE and Count_ratio metrics, where the evaluated enrichment for singletons is much larger than that of other n-synthons. These two metrics do not evaluate enrichment for different n-synthon types with different values of expected population (i.e., diversity) on the same order of magnitude. On the other hand, the z-score metrics evaluate the enrichment of singletons to be of the same order of magnitude as other n-synthons. The CBV_ratio metric shows only a few features with a value greater than 1, which is interpreted as the value above which enrichment would be considered statistically significant at the 95% confidence level. Cohen's h tends to tighten the distribution of enrichment values for high diversity features which have low expected probabilities compared to lower-diversity features.

S2.2.2. Non-enrichment.
An ideal enrichment metric should make it easy for the analyst to determine a lack of significant enrichment of features in a library. To investigate how the candidate metrics perform in the nonenrichment scenario, we examined a data set for which we believe that the library contains no binders with significant affinity for the protein target. We evaluated the metrics for the targetselection pool and the NTC pool and plotted the enrichment values of each feature against each other in Figure S2. In the non-enrichment scenario, an ideal metric would be expected to yield 1) low measures of enrichment and 2) similar enrichment for both target and NTC data sets with a small amount of additional random noise. In Figure S2, the z-score shows generally higher measured enrichment for the NTC data than the target data, while the normalized z-score more closely follows the diagonal line. This is consistent with our observation of 206,220 molecules for the NTC sample compared to 118,446 for the target sample. Thus, the unnormalized z-score can be skewed by the number of decoded molecules, while the normalized z-score is less sensitive to the amount of sampling. CBV_ratio similarly is affected by differences in sampling, while Cohen's h evaluates the two data sets to be more equal in enrichment. logE and Count_ratio again are very sensitive to expected population and show large differences between synthon types in terms of magnitudes of enrichment. Figure S2. Comparative enrichment plots for a selection with no significant target-specific enrichment. For each enrichment metric, enrichment for the NTC is plotted against the enrichment for the target selection data set. Enrichment is evaluated for each n-synthon within the library, and the points are colored by synthon type (dimension). The diagonal y = x line represents equal enrichment in the target and NTC data. The enrichment metrics are a) z-score,

S2.2.3. Enrichment
The main goal of any DEL analysis strategy is to enable straightforward detection of targetspecific enrichment of any n-synthons of a library. We evaluated the performance of the enrichment metrics using a data set for which there was specific enrichment of a family of molecular structures with affinity for a protein target. In Figure S3, logE shows little separation between highly-enriched and lowly-enriched features, causing interpretation to be nontrivial.
Count_ratio treats high-diversity features very differently than low-diversity features, measuring  Cohen's h.

S2.2.4. Variable Sampling
One of the most important requirements for a useful enrichment metric is an insensitivity to sampling. Sampling insensitivity is required to compare enrichment in multiple selection experiments against each other without being affected by sampling bias. If a sampling bias is present in the enrichment metric, then it would be difficult to determine if a feature is enriched due to target-specific binding rather than the sampling bias. To examine this property, we evaluated the metrics on two data sets: one experimental data set with target-specific enrichment, and the same dataset with 90% of the decoded ligands randomly removed from the data. Thus, the two samples represent the same selection experiment but with a ten-fold difference in sampling. The two data sets are compared in Figure S4. The z-score and CBV_ratio metrics clearly show bias for the higher-sampled data set, while the normalized z-score, Count_ratio, and Cohen's h appear to be insensitive to the ten-fold difference in sampling. In comparison, logE shows much larger deviations between the two data sets, especially for lower-enriched features.   Figure S5, left, where the copy count per molecule distribution from using the simple unique counter on a data set from a non-target control (NTC; i.e., a selection in which no target protein is included) is shown in red. Rather than counting each unique UMI sequence as unique molecules from a DEL sample pool, each completed graph is considered to be one unique molecule for which the DNA barcode has been subject to sequencing errors.

S3.2. Graph-based UMI counting
To address sequence errors in the UMI region, we have adopted the strategy of Smith et al.
which involves a directed graph-based counting scheme ( Figure S5, Right). 5 The method assumes that all sampled DNA barcodes have been decoded and that each unique decoded library member is associated with a set of UMI sequences. For each decoded library member, the set of associated UMIs and their populations is read, and the most populated UMI is set as the root node of a directed graph. UMI sequences which are within a set edit distance D and meet a set count ratio threshold R are added as child nodes. Remaining UMIs are likewise added to the graph as children of the root node or its descendants. This process is repeated until no more UMI sequences can be added to the graph. The next highest populated UMI which has not yet been inserted into a graph is then assigned to be the root node of a new graph, and then children are similarly added to this new graph. The process continues until each UMI sequence is assigned a position in one of the directed graphs. Each completed graph then represents a single unique library molecule, wherein the root node is interpreted as having been the original UMI sequence, and its child nodes are interpreted as products of errors during PCR-amplification or sequencing.
We have observed that this directed graph-based counting strategy generally improves agreement between theoretical and observed count distributions in naïve (unscreened) library data sets and removes significant discontinuities in copy count per molecule distributions ( Figure S5, Left, black).

S3.3.
Triazine DEL Naïve data set The naive sequencing of the 174,145,836-member triazine DEL provides an excellent example of the effect of errors in UMI sequences. In Table S3, various molecule counting schemes are compared to the theoretical count distribution based on a binomial distribution model. Since each counting method results in a different total number of decoded molecules, n, the counting schemes must be compared to different evaluations of the binomial probability mass function with different values of n. We first compared the unique counter (U) to the binomial distribution model (U*). We found that 512 library members had observed counts above the expected noise level of 4. We additionally examined the performance of the graph-based methods, labeled as G(D, R) where G represents a graph built with an edit distance parameter D and a count ratio parameter R. Thus G(2, 2) corresponds to a graph built with child nodes added with edit distance less than or equal to 2 and count ratio of parent count to child count greater than or equal to 2.
The corresponding binomial count distributions are provided in the table and labeled as G(D, R)*. We observed that for this data set, the G(2, 1) model was able to generate count data which matched expected noise levels from a binomial distribution model. sequence. The strict unique counter method (U) has a count distribution which overcounted some library members and thus library members were observed with higher counts (k) than the expected random noise level from a binomial distribution model (U*). Using the graph-based G(2, 2) counter alleviated these high counts somewhat, but observed counts did not meet expected noise levels unless the G(2, 1) or G(3, 1) counters were used. By using such directed-  Samples were analyzed on a Thermo Vanquish UHPLC system coupled to an electrospray LTQ ion trap mass spectrometer. An ion-pairing mobile phase comprising of 15mM TEA/100mM HFIP in a water/methanol solvent system was used in conjunction with an oligonucleotide column Thermo DNAPac RP (2.1 x 50 mm, 4µm) for all the separations. All mass spectra were acquired in the full scan negative-ion mode over the mass range 500-2000m/z. The data analysis was performed by exporting the raw instrument data (.RAW) to an automated biomolecule deconvolution and reporting software (ProMass) which uses a novel algorithm known as ZNova to produce artifact-free mass spectra. The following deconvolution parameters were applied: peak width 3.0, merge width 0.2, minimum and normalize scores of 2.0 and 1.0 respectively. The noise threshold was set at S/N 2.0. The processed data was directly exported to Microsoft Excel worksheets for further data comparisons. A sample MS analysis using ProMass software is presented in Figure S7.  Figure S8.    After portioning 1,040 wells on 11 plates with 100 nmol of HP 2, each well was ligated with codon 1 by the general ligation procedure, followed by ethanol precipitation by the general procedure. The pellets (~100 nmol, 1 equiv) were reconstituted in water (100 µL) and 250 mM pH 9.5 Borate buffer (100 µL, 25,000 nmol, 250 equiv). After cooling to 4 °C, cyanuric chloride (25 µL, 40 mM in CH 3 CN, 1000 nmol, 10 equiv) was added, and the wells were monitored for triazine addition by LC/MS. After complete addition, a collection of 1,040 amines and amino acids were added to individual wells (25 µL, 200 mM in CH 3 CN/water, 5000 nmol, 50 equiv) and the reactions were left overnight at 4 °C. After analysis of all wells and controls by LC/MS, the DNA was precipitated by the general procedure. After reconstitution, the wells were quickly pooled and precipitated by the general procedure. In addition, a separate control of a small amount of the library pool with a triazine-functionalized "spike-in" oligo was monitored by LC/MS to ensure that residual contaminants were not reacting with the on-DNA diamino-chlorotriazine intermediates (no significant reaction was detected). After reconstitution, an estimated yield of cycle 1 was determined by OD measurement of the pooled solution (est. 74.5 µmol, 1.57 mM, 47.5 mL, 74.5% yield).
Then 1008 amines (18.45 µL, 200 mM in CH 3 CN/water, 3690 nmol, 50 equiv) were added to individual wells and the plates were heated to 80 °C for 6 h (two wells received no amine as an encoded control). In addition, a series of non-library, parallel controls that mimicked library reaction substrates and conditions were included on several empty plate wells, and some library wells were augmented with a triazine-functionalized "spike in" control oligo (8 nmol) to monitor the post-pool transformations by LC/MS. After analysis and positive confirmation of all control data, each well underwent ethanol precipitation by the general procedure. After reconstitution and ligation of codon 2 by the general procedure, all cycle 2 wells were pooled and precipitated by the general procedure. An OD measurement of the reconstituted cycle 2 pool indicated near quantitative recovery (est. 74.500 µmol, 124 mL, 0.56 mM, quant).
Then 1040 amines (28.4 µL, 200 mM in water/CH 3 CN, 5680 nmol, 80 equiv) were added to individual wells, followed by 4-4,6-dimethoxy-1,3,5-triazin-2-yl)-4-methylmorpholinium chloride ("DMTMM", 56.8 µL, 20448 nmol, 360 mM soln in water, 288 equiv) and the plates were incubated at 30 °C overnight. As in cycle 2, a series of non-library parallel control wells were set up and some wells were augmented with an appropriately functionalized triazine "spike in" control oligo (8 nmol). After analysis of all control well data, the library wells were precipitated by the general procedure, reconstituted, pooled, and again precipitated by the general procedure. The final yield of the cycle 3 pool was estimated by OD (62.99 µmol, 81.8 mL, 0.77 mM, 84.5 % yield), although a secondary measurement by comparing the intensity of the 3-cycle library band to a known standard (Low Molecular Weight DNA Ladder from New England Biolabs) on a native TBE gel suggested a recovery of ~52 µmol.

S4.2.6. Preparation of amplifiable triazine DEL samples ("shots") for selection experiments
On small scale (1-20 nmol of completed library), the triazine DEL material was ligated with two DNA oligonucleotides containing a DNA segment encoding the library design, a segment encoding the experimental usage, a degenerate segment serving as the UMI region, a segment increasing sequencing diversity and a terminal primer segment to allow PCR amplification. After ethanol precipitation and reconstitution, the amount of amplifiable library material within prepared shots was subsequently quantified by qPCR and shots were used without further purification. Alternatively, portions of the library were ligated on large scale (1-5 µmol) with a duplexed 12-bp DNA codon to encode the library design followed by small scale ligation (1-20 nmol) of the remaining regions needed for the amplifiable shot.