Quantification and Mapping of Alkylation in the Human Genome Reveal Single Nucleotide Resolution Precursors of Mutational Signatures

Chemical modifications to DNA bases, including DNA adducts arising from reactions with electrophilic chemicals, are well-known to impact cell growth, miscode during replication, and influence disease etiology. However, knowledge of how genomic sequences and structures influence the accumulation of alkylated DNA bases is not broadly characterized with high resolution, nor have these patterns been linked with overall quantities of modified bases in the genome. For benzo(a) pyrene (BaP), a ubiquitous environmental carcinogen, we developed a single-nucleotide resolution damage sequencing method to map in a human lung cell line the main mutagenic adduct arising from BaP. Furthermore, we combined this analysis with quantitative mass spectrometry to evaluate the dose–response profile of adduct formation. By comparing damage abundance with DNase hypersensitive sites, transcription levels, and other genome annotation data, we found that although overall adduct levels rose with increasing chemical exposure concentration, genomic distribution patterns consistently correlated with chromatin state and transcriptional status. Moreover, due to the single nucleotide resolution characteristics of this DNA damage map, we could determine preferred DNA triad sequence contexts for alkylation accumulation, revealing a characteristic DNA damage signature. This new BaP damage signature had a profile highly similar to mutational signatures identified previously in lung cancer genomes from smokers. Thus, these data provide insight on how genomic features shape the accumulation of alkylation products in the genome and predictive strategies for linking single-nucleotide resolution in vitro damage maps with human cancer mutations.

S3 °C, 800 rpm, and purified by ethanol precipitation. A total of 10 µg of purified DNA was used for N 2 -BPDE-dG quantification using LC-MS/MS, 1.5 µg was used for damage sequencing (described in N 2 -BPDE-dG-Damage-seq library preparation).

Total RNA extraction
Total RNA was extracted from BEAS-2B cells using TRIzol RNA Isolation Reagents (Thermo Fisher Scientific, Waltham, MA, United States). 1.3-1.8 × 10 5 cells were lysed for each extraction and homogenized in 1 mL of TRI-reagent. For each lysis, 0.2 mL chloroform per 1 mL TRI-reagent was added to the sample and vortexed for 15 s. The samples were incubated for 3 min, prior to centrifugation at 12,000 x g for 15 min at 4 °C. The aqueous phase was transferred to a new 1.5 mL tube and 0.6 mL of isopropanol per 1 mL of TRIzol reagent was added and mixed gently by inverting the tube several times. The samples were incubated at ambient temperature for 10 min, prior to centrifugation at 12,000 x g for 10 min at 4 °C and the supernatant was discarded. The RNA pellets were washed with 75% (v/v) ethanol and centrifuged at 7,500 x g for 10 min at 4 °C.
The supernatant was removed and the RNA was air dried and resuspended in an appropriate amount of RNase-free water. Samples were flash-frozen in liquid nitrogen and stored at -80 °C.
From the remaining solution, modified nucleosides were enriched by butanol extraction, by a procedure adapted from Klaene et al. (8). Per 450 µL sample, 150 µL of water-saturated butanol S4 (prepared by mixing 3 mL butanol with 300 µL Milli-Q water) was added. After thoroughly vortexing, the sample was centrifuged at 3,000 x g for 1 min. The upper butanol layer was transferred to a 1.5 mL DNA LoBind tube (Eppendorf). A second extraction was performed on the same sample by repeating the previous steps. The butanol layers were combined in the same LoBind tube, and additional Milli-Q water (300 µL) was added. The mixture was thoroughly vortexed and centrifuged at 3,000 x g for 1 min. The upper layer was transferred to a vial for high-performance liquid chromatography (HPLC) and evaporated to dryness using miVac Centrifugal Concentrator (Genevac, Ipswich, United Kingdom). The resulting residue was resuspended in 20 μL 50% MeOH for further mass spectrometric analysis.

Quantification of dG in isolated DNA
The dG content of enzymatically digested DNA samples was determined by HPLC. Nucleosides were separated with an Agilent 1100 or 1200 series HPLC system (Agilent, Santa Clara, CA, United States) equipped with a Phenomenex Kinetex 2.6 µm C18 100 Å column 150 x 2.1 mm (00F-4462-AN, Phenomenex, Torrance, CA, United States). The injection volume was 20 μL and the nucleotides were detected at 254 nm. Compounds were eluted using solvent A (3% acetonitrile in water) and solvent B (acetonitrile). A gradient elution was applied at a flow rate of 200 μL/min: 0%-15% B (0-6.0 min), 15%-80% B (6.0-6.1 min), 80% B (6.1-11.0 min), 80%-0% B (11.0-11.2 min), followed by re-equilibrium for 15 min. Calibration curves were produced from the analysis of dG in water using six calibration points in the range of 0.1-50 μM. The dG concentration from the 20 μL aliquots of the digestion mix (see DNA digestion) was determined from the calibration curve, and it was used to determine the total number of nucleotides (#nt) in each sample. The calculation assumed a GC content of 41% in the human genome according to the following equation in which Vtot is the total volume of the digestion mix in L described in Eq. 1:

Q5 polymerase extension assay
A BPDE-modified 24mer oligo at a single G (Table S2)

BPDE-dG-Damage-seq library preparation
Oligonucleotides used in library preparation are listed in Table S2 and a schematic of the library preparation is shown in Figure S13.   were considered, and the counts were strand-combined. Counts were summed per region, and methylation levels corresponding to the ratio between methylated and total events were presented as a 0 to 1 range, with 0 being a fully unmethylated state (methyl counts = 0) and 1 being a fully methylated state (methyl counts = total counts). LMRs (low methylated regions, including non-and partly methylated regions) were identified using methylSeekR package (11) with default parameters. To simplify the analysis, the LMRs from all the samples were combined in unified catalogs (merged LMRs), respectively, whereby elements with more than 60% overlap were merged to reduce redundancy. FMR (fully methylated regions) were regions not being classified as LMR. Methylation levels of the unified sets of LMRs were quantified separately for every replicate and used to compute differential methylation between experimental groups.

N 2 -BPDE-dG sequencing data analysis
Data quality was checked using FastQC, and BBMap (12) was used to filter common adaptor contaminations and the reads containing the AD1B sequence: 5′-GAC-TGG-TTC-CAA-TTG-AAA-GTG-CTC-TTC-CGA-TCT-3′. The paired-end reads were mapped to the reference human genome hg38 using Burrows-Wheeler Aligner (13). Reads were not filtered based on their mapping score.
Unmapped and duplicated reads were removed using Samtools (14) and Picard. The reads mapped to the ENCODE Blacklist (15) were removed, then the damage sites were extracted and counted in desired bins using custom scripts and bedtools (16). Further analysis was performed using R 4.1.0 and Bioconductor 3.1.2. The analysis window size was evaluated using NGSopwin (17). Scaling factors calculated using "DESeq2" (18) were applied to samples before analysis.
Aside from the analyses of nucleotide enrichment at damage sites and trinucleotide enrichment around damage sites, we considered only those reads that indicated a G at the -1 position.    Variables including coverage of genomic features per bin, BPDE+ (site count of 2 uM exposed cell, highlighted by blue dash lines), and relative abundance (2 uM BPDE-exposed cells, highlighted by green dash lines). (B) Boxplot showing the damage distribution on non-transcribed strand (NTS) and transcribed strand (TS) of high expressed genes (n= 10,612) from all exposed samples. Data represent the mean calculated from three replicates in a bin size of 25 bp. The significant test was calculated using paired t-test.