Biofoundry-Scale DNA Assembly Validation Using Cost-Effective High-Throughput Long-Read Sequencing

Biofoundries are automated high-throughput facilities specializing in the design, construction, and testing of engineered/synthetic DNA constructs (plasmids), often from genetic parts. A critical step of this process is assessing the fidelity of the assembled DNA construct to the desired design. Current methods utilized for this purpose are restriction digest or PCR followed by fragment analysis and sequencing. The Edinburgh Genome Foundry (EGF) has recently established a single-molecule sequencing quality control step using the Oxford Nanopore sequencing technology, along with a companion Nextflow pipeline and a Python package, to perform in-depth analysis and generate a detailed report. Our software enables researchers working with plasmids, including biofoundry scientists, to rapidly analyze and interpret sequencing data. In conclusion, we have created a laboratory and software protocol that validates assembled, cloned, or edited plasmids, using Nanopore long-reads, which can serve as a useful resource for the genetics, synthetic biology, and sequencing communities.


■ INTRODUCTION
A critical step of the Design Build Test Learn (DBTL) cycle widely adopted in synthetic biology 1 is verifying the fidelity of the DNA construct, obtained in the Build phase, to the designed DNA sequence.Several factors may lead to erroneous DNA constructs, including incorrect input DNA, problems during assembly or sample handling, misannealing overhangs, addition of point mutations, and homologous or other forms of recombination.Currently, the verification step relies mostly on a restriction digest or PCR followed by fragment analysis (FA) and sequencing. 2 FA provides a cost-efficient but indirect, low confidence confirmation of construct correctness by checking the presence of specific restriction enzyme recognition sites and the fragment size.Conversely, DNA sequencing approaches provide a nucleotide-level readout at the expense of substantially higher cost.Sanger sequencing, for example, is not feasible in most cases due to the cost and high number of reactions, but it may be useful for verifying targeted regions or in small batches of similar constructs assembled from shared genetic parts. 2Second-and third-generation sequencing methods provide a solution to sequencing large batches of plasmid constructs due to their high-throughput and no requirement for using primers.In this Technical Note, we describe a single-molecule sequencing DNA assembly quality control solution at the Edinburgh Genome Foundry (EGF) that can be utilized by biologists and the sequencing community.EGF is an automated high-throughput facility (biofoundry) specializing in the modular assembly of DNA constructs (plasmids), using Golden Gate cloning.EGF's platform is species agnostic, and its outputs are used in projects as diverse as programming of stem cells for personalized medicine applications, vaccine development, gene therapy, and many more.

■ RESULTS AND DISCUSSION
The DNA verification by sequencing step at the end of the Build phase of a synthetic biology (engineering) project aims to compare the fidelity of an assembled DNA construct to its corresponding designed sequence.Here, we present a one-step software pipeline that facilitates and expedites analysis and interpretation of sequencing data (Figure 1).Specifically, we wanted to obtain an annotated comparison of the sequenced and the expected DNA, and a judgment call for each sample, based on various checks.
We hereby refer to errors as differences (or mutations or variants) between the designed and assembled DNA sequence.These errors include single nucleotide variants (SNV), small insertions and deletions (indels), structural variations (SV), and sequencing errors. 3Although random sequencing errors can be mitigated with increased sequencing depth, systematic errors are much harder to avoid; here, we focused on SNVs and SVs.
We applied automated protocols for the Oxford Nanopore Technologies (ONT) rapid barcoding kits to simultaneously fragment each sample (plasmid or DNA construct) and ligate individual barcodes.Up to 96 barcoded samples are pooled into a single library and loaded onto Flongle flow cells in a MinION Mk1C sequencer (see Methods section).A Nextflow 4 pipeline, named Sequeduct, has been created to perform alignment, variant detection, and reporting (https://github.com/Edinburgh-Genome-Foundry/Sequeduct/).This pipeline requires the FASTQ files of the partial and full length reads obtained from sequencing, the reference Genbank files of the designed sequences, and a sample sheet.The generated PDF report contains a chapter for each barcode: read statistics are followed by a histogram of the read lengths, which is a good indicator of the presence of structural errors.A displayed coverage chart with the reference sequence visualizes any deletions (Supporting Information S1).Visualizing insertions in a similar plot is not straightforward, therefore a cumulative plot of the longest unaligned interval is provided as an indicator of large insertions.A variant (error) table is also provided in a simplified variant call format (VCF), 5 which lists point and few nucleotide variants but not structural variants such as large deletions or insertions.Variants at homopolymer stretches are flagged as this is a known systemic sequencing error. 3Variants are also annotated on the reference sequence on a second plot for an easy overview and navigation.Based on the above results, each of the plasmids with a sufficient number of reads is assigned a pass/fail outcome, as detailed in the Methods.A summary spreadsheet of the results, based on a review of the report, is also created.This can be revised by the user.A detailed guide on the interpretation of the report is published online (https://github.com/Edinburgh-Genome-Foundry/Sequeduct_demo).The pipeline also returns the variant call consensus FASTA sequence for each barcode.
If structural variants are found, then a subsequent task is to describe their nature and provide an explanation.This is largely beyond the scope of validation, but a second pipeline which aligns user-specified DNA part sequences against a de novo plasmid sequence assembled from the reads, using Canu, 6 is also provided.Alignments are reported and visualized in a PDF file (Supporting Information S2).This is useful for evaluating plasmids (or other DNA) that are assembled from parts using various Golden Gate toolkits, such as EMMA, 7 MoClo, 8 or Mobius, 9 or any other method, such as Gibson 10 assembly, and helps clarify whether part or sample mix-ups, recombination events, or overhang misannealing has occurred.
Multiple alternative approaches have been published by Oxford Nanopore Technologies and other research laboratories.The EPI2ME Clone validation workflow uses de novo assembly to produce a FASTA file for each sample. 11The SequenceGenie workflow analyzes data from a multiplexed sequencing approach using a novel sample barcoding system. 12he MinION Plasmid Sequence Verification Pipeline provides a cost-effective way of sequencing plasmids for clinical research applications. 13Circuit-seq creates de novo assemblies from multiplexed samples, 14 while OnRamp is reference-based. 15In comparison to these, Sequeduct performs an evaluation against an expected sequence and focuses on the produced report and downstream interpretation of results, which are more suitable for engineering biology and quality control purposes.
Several companies provide a Nanopore sequencing service of plasmids for a fee (∼$15/plasmid).Performing the sequencing in-house is cost-competitive and fast, provided that at least 24 samples are sequenced at the same time.In any case, Sequeduct is free software and can be used with FASTQ data from sequencing providers in order to generate a more detailed and targeted report.This addresses a general need of quickly interpreting sequencing verification results and helps researchers routinely verify a received plasmid, an introduced mutation, the results of a cloning experiment, or DNA assembled from parts.
Ongoing development aims to incorporate additional functionalities and improvements.These include deconvolut- ing mixed samples, where multiple plasmids are in the same sample or use the same barcode.This would allow sequencing polyclonal ("polyploid") samples or combinatorial assemblies or pooling multiple plasmids into the same barcode.Similarly, we plan to address more error scenarios in the pipeline as we accumulate sequencing results.We anticipate that analysis of synthetic design outcomes using full length sequence results will lead to the establishment of more robust DNA design rules.For example, analysis of a collection of plasmid sequencing data can point to sequence patterns that interfere with cloning or cause recombination.Alternatively, a review of the results can help find problematic components in a project so that these can be avoided in the next round of design.
In conclusion, we have set up a complete solution� consisting of laboratory and software protocols�for the validation of assembled, cloned, or edited plasmids and other DNA, using long-reads.The software is available under a free and open-source license (GPLv3) to encourage contributions and feedback from biofoundries and the sequencing community.

■ METHODS
Sequencing.Plasmid DNA is prepared using the Wizard SV 96 Plasmid DNA Purification System (Promega), and the concentration is measured by fluorescence-based quantification (Qubit dsDNA BR Assay Kit, Thermo Scientific).Samples are normalized to be within the 20−90 fmol/μL range, and 1 μL of sample is used for library preparation.The protocols for the ONT Rapid Barcoding Kit (SQK-RBK004) or Rapid Barcoding Kit 96 (SQK-RBK110.96)are performed following the manufacturer's instructions on an Opentrons OT-2 (SQK-RBK004) or Tecan Freedom EVO200 (SQK-RBK110.96)liquid handling robot.Libraries are loaded onto Flongle flow cells (R9.4.1) and run for up to 24 h on a MinION Mk1C device that performs basecalling using Guppy v4.3.4.
Analysis.The "pass" folder of the FASTQ sequencing data is used in the analysis.The pipeline is written in Nextflow and is available on GitHub with documentation and an example data set at https://github.com/Edinburgh-Genome-Foundry/Sequeduct_demo.The first workflow ("preview") generates summary plots of each barcode for an overview of the sequencing run, using NanoPlot, part of the NanoPack suite. 16he second workflow ("analysis") filters FASTQ files using NanoFilt followed by read alignment to the reference sequence using minimap2. 17The reference sequence files can be created with a sequence editor or batch simulated using DNA Cauldron. 18SAMtools 19 is used to obtain coverage data, and variants are called with freebayes. 20Consensus sequence files are created with BCFtools. 21As part of the pipeline, a Python package, Ediacara, was also written to generate a report PDF that visualizes results for each barcode/plasmid (https:// github.com/Edinburgh-Genome-Foundry/Ediacara).The package utilizes Biopython 22 and DNA Features Viewer. 23ach plasmid is assigned an outcome: samples with an insufficient number of reads (sequencing problems) resulting below 30× coverage are marked as "low coverage".The remaining samples are marked "fail" if problems are detected: zero coverage sections, a majority of reads having an unaligned (insert) segment, or a consensus sequence length outside tolerance.The "warning" label is applied for cases where the errors are below the set threshold levels.All other samples are marked as "pass".An explanation of the report is also included in its Appendix section.In addition to the pipeline, a Dockerfile is also provided to generate a Docker image with all of the required software.

* sı Supporting Information
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acssynbio.3c00589.Supporting Information S1: example PDF report of the "analysis" pipeline; Supporting Information S2: example PDF report of the "review" pipeline (PDF) ■

Figure 1 .
Figure 1.An overview of the sequencing pipeline.(A) The assembled or cloned plasmids are prepared into libraries using a liquid handling platform.Libraries are loaded onto a Flongle flow cell in an Oxford Nanopore Mk1C sequencer.(B) The FASTQ files are analyzed with a Nextflow pipeline (Sequeduct) that utilizes a Docker container with all required software.(C) The Ediacara Python package creates a PDF report for an easy overview and interpretation of the results.