Long-read sequences greatly facilitate the assembly of complex genomes and characterization of structural variation. We describe a method to generate ultra-long sequences by nanopore-based sequencing platforms. The approach adopts an optimized DNA extraction followed by modified library preparations to generate hundreds of kilobase reads with moderate coverage from human cells.
Third generation single-molecule DNA sequencing technologies offer significantly longer read length that can facilitate the assembly of complex genomes and analysis of complex structural variants. Nanopore platforms perform single-molecule sequencing by directly measuring the current changes mediated by DNA passage through the pores and can generate hundreds of kilobase (kb) reads with minimal capital cost. This platform has been adopted by many researchers for a variety of applications. Achieving longer sequencing read lengths is the most critical factor to leverage the value of nanopore sequencing platforms. To generate ultra-long reads, special consideration is required to avoid DNA breakages and gain efficiency to generate productive sequencing templates. Here, we provide the detailed protocol of ultra-long DNA sequencing including high molecular weight (HMW) DNA extraction from fresh or frozen cells, library construction by mechanical shearing or transposase fragmentation, and sequencing on a nanopore device. From 20-25 µg of HMW DNA, the method can achieve N50 read length of 50-70 kb with mechanical shearing and N50 of 90-100 kb read length with transposase mediated fragmentation. The protocol can be applied to DNA extracted from mammalian cells to perform whole genome sequencing for the detection of structural variants and genome assembly. Additional improvements on the DNA extraction and enzymatic reactions will further increase the read length and expand its utility.
Over the past decade, massively parallel and highly accurate second-generation high-throughput sequencing technologies have driven an explosion of biomedical discovery and technological innovation1,2,3. Despite the technical advances, the short-read data generated by the second-generation platforms are ineffective in resolving complex genomic regions and are limited in the detection of genomic structural variants (SVs), which play important roles in human evolution and diseases4,5. Furthermore, short-read data are unable to resolve repeat variation and are unsuitable for discerning haplotype phasing of genetic variants6.
Recent progress in single-molecule sequencing offers significantly longer read length, which can facilitate the detection of the full spectrum of SVs7,8,9, and offers accurate and complete assembly of complex microbial and mammalian genomes6,10. The nanopore platform performs single-molecule sequencing by directly measuring the current changes mediated by DNA passage through the pores11,12,13. Unlike any existing DNA sequencing chemistry, nanopore sequencing can generate long (tens to thousands of kilobases) reads in real-time without relying on polymerase kinetics or artificial amplification of the DNA sample. Therefore, nanopore long-read sequencing (NLR-seq) holds great promise for generating ultra-long read lengths well beyond 100 kb, which would greatly advance genomic and biomedical analyses14, particularly in the low-complexity or repeat-rich regions of the genomes15.
The unique feature of nanopore sequencing is its potential to generate long reads without a theoretical length limitation. Therefore, the read length is dependent on the physical length of the DNA which is directly affected by the DNA integrity and sequencing template quality. Moreover, depending on the extent of manipulation and the number of steps involved, such as pipetting forces and extraction conditions, the quality of the DNA is highly variable. Therefore, it is challenging for one to yield long reads by just applying the standard DNA extraction protocols and manufacturer's supplied library construction methods. Toward this end, we have developed a robust method to generate ultra-long read (hundreds of kilobases) sequencing data starting from harvested cell pellets. We adopted multiple improvements in the DNA extraction and library preparation procedures. We streamlined the protocol to exclude unnecessary procedures that cause DNA degradation and damages. This protocol is composed of high molecular weight (HMW) DNA extraction, ultra-long DNA library construction, and sequencing on a nanopore platform. For a well-trained molecular biologist, it typically takes 6 h from cell harvesting to the completion of HMW DNA extraction, 90 min or 8 h for library construction depending on the shearing method, and up to a further 48 h for DNA sequencing. The use of the protocol will empower the genomics community to improve our understanding of genome complexity and gain new insight into genome variation in human diseases.
NOTE: The NLR-seq protocol consists of three consecutive steps: 1) extraction of high-molecular weight (HMW) genomic DNA; 2) ultra-long DNA library construction, which includes fragmentation of the HMW DNA into the desired sizes and ligation of sequencing adapters to the DNA ends; and 3) loading of the adapter-ligated DNA onto the arrays of nanopores (Figure 1).
1. HMW DNA extraction
2. Ultra-long DNA library construction
NOTE: There are two ways to construct the ultra-long DNA libraries based on two different shearing methods coupled with nanopore sequencing kits. A mechanical shearing-based library produces data with an N50 of 50-70 kb, taking about 8 h for the library construction. A transposase fragmentation-based library produces an N50 of 90-100 kb data, taking only 90 min for the library construction. The mechanical shearing protocol gives higher yield from the same DNA input using identical versions of sequencing adapter and quality of nanopore flow cells.
3. Sequencing on the nanopore device
The ultra-long DNA sequencing protocol applies HMW DNA for library construction. Therefore, it is critical to choose well-cultured cells with the live ratio >85% at the cell harvesting step. The amount of cells used for DNA extraction will affect the quality and the quantity of the HMW DNA. The cell lysis does not work well if starting with too many cells. Using too few cells does not generate enough DNA for library construction because the HMW DNA precipitation is performed using gentle rotation by hand instead of high-speed centrifugation. An example of the HMW DNA after adding ice-cold 100% ethanol and rotating is shown as the white cotton-like precipitate in Figure 2.
It is important to check the quality of the input DNA before beginning the library construction. Degradation, incorrect quantification, contamination (e.g., proteins, RNAs, detergents, surfactant, and residual phenol or ethanol) and low molecular weight DNA can have a significant effect on the subsequent procedures and on the final read length. We recommend performing the QC analysis using the DNA from three different locations in the tube containing HMW DNA. From UV reading results for the HMW DNA, the OD260/OD280 value is approximately 1.9 and the OD260/OD230 value is approximately 2.3 (Figure 3A,B). These ratio values are consistent among the three tests for a good HMW DNA sample. Different shearing methods requires different volumes of input DNA. The concentration of HMW DNA needs to be >200 ng/µL for mechanical shearing while it needs to be >1 µg/µL for transposase fragmentation. The concentration detected by a fluorometer is a little lower than UV reading. However, the coefficient of variation of the concentration of the same HMW DNA sample is required to be less than 15% with both the fluorometer and the UV reading assays. Mechanical shearing applies a syringe with a needle to break the HMW DNA so that the number of passes through the needle will impact the size of the sheared DNA and the final read length. It is recommended to perform size QC after needle shearing to ensure the majority of the HMW DNA is larger than 50 kb as illustrated in Figure 4. In the mechanical shearing method, 30 passes generated the best sequencing results considering both the length and output.
The N50 of a mechanical shearing-based library is 50-70 kb while a transposase fragmentation-based library is 90-100 kb. The results of four runs using the HG00733 cell line are shown in Table 1. All four runs have over 2,300 reads with length longer than 100 kb. The maximum length is longer in the transposase fragmentation-based libraries (455 kb and 489 kb) compared with the mechanical shearing-based libraries (348 kb and 387 kb) while the latter produced more total reads, indicating a higher yield. The transposase fragmentation-based library construction has fewer steps and shorter preparation time so that it will introduce fewer short fragments. The two runs using transposase have a longer mean length (>30 kb) and median length (>10 kb). In addition, the data shows consistent high quality in all runs (mean quality score is approximately 10.0, ~90% base accuracy). More than 97% of the total bases were aligned to the human reference genome (hg19) using Minimap216 with the default settings. The expected size distributions of the raw reads are shown in Figure 5. All runs have a large proportion of data above 50 kb while transposase fragmentation-based libraries have a higher ratio of ultra-long reads (e.g. > 100 kb). This protocol has been successfully applied in multiple human cell lines (Supplementary Table 1).
Figure 1: Schematic overview of the nanopore long-read sequencing (NLR-seq) workflow. Orange, the transposase complex. Yellow-green, the nanopore adapter. Please click here to view a larger version of this figure.
Figure 2: Representative DNA precipitation from phenol-chloroform extraction method. The white arrow indicates the HMW DNA. Please click here to view a larger version of this figure.
Figure 3: Example QC results of the HMW DNA from UV reading. (A) HMW DNA from step 1.21.1 ready for mechanical shearing-based library construction. (B) HMW DNA from step 1.21.2 for transposase fragmentation-based library construction. Please click here to view a larger version of this figure.
Figure 4: QC results of the needle sheared HMW DNA by pulsed-field gel electrophoresis. L1: Quick-load 1 kb DNA ladder; L2: Quick-load 1 kb extend DNA ladder. 1-8: DNA with different passing times through the needle shearing. 1-3, no shearing; 4, 10 times; 5, 20 times; 6, 30 times; 7, 40 times; 8, 50 times. This QC step is optional. Please click here to view a larger version of this figure.
Figure 5: Expected size distributions of the nanopore ultra-long DNA libraries. MS, mechanical shearing-based libraries. TF, transposase fragmentation-based libraries. Please click here to view a larger version of this figure.
Mechanical shearing_rep1 | Mechanical shearing_rep2 | Transposase fragmentation_rep1 | Transposase fragmentation_rep2 | |
Cell line | HG00733 | HG00733 | HG00733 | HG00733 |
N50 of the reads | 55,180 | 63,007 | 98,237 | 95,629 |
Number of reads longer than 100 Kb | 2,500 | 3,082 | 2,386 | 2,355 |
Number of total reads | 97,859 | 80,465 | 24,166 | 21,032 |
Maximum length (bp) | 348,482 | 387,113 | 454,660 | 489,426 |
Mean length (bp) | 17,861 | 20,395 | 33,528 | 38,175 |
Median length (bp) | 5,335 | 5,894 | 10,249 | 15,656 |
Mean quality of the reads | 10.0 | 10.1 | 9.9 | 10.0 |
Total bases of raw reads | 1,747,849,822 | 1,641,058,932 | 810,229,733 | 802,886,304 |
Total bases of aligned reads | 1,693,300,832 | 1,607,975,925 | 791,422,077 | 778,417,627 |
Mapped ratio of total bases (hg19, Minimap2) | 96.9% | 98.0% | 97.7% | 97.0% |
Number of active pores | 1225: 480, 402, 254, 89 | 1058: 480, 356, 176, 46 | 958: 452, 328, 148, 30 | 1092: 487, 367, 195, 43 |
Table 1: Performance metrics summary from runs with different shearing protocols.
Library 1 | Library 2 | |
Cell line | K562 | GM19240 |
Cell Ordering Information | ATCC, cat. No. CCL-243 | Coriell Institute, cat. No. GM19240 |
Protocol | mechanical shearing | mechanical shearing |
N50 of the reads | 60,063 | 55,295 |
Number of total reads | 193,783 | 120,807 |
Median length (bp) | 1,843 | 4,688 |
Mean length (bp) | 9,825 | 17,408 |
Maximum length (bp) | 548,780 | 212,338 |
Total bases of raw reads | 1,903,989,686 | 2,103,015,331 |
Total bases of aligned reads | 1,837,350,047 | 1,997,419,761 |
Mapped ratio of total bases (hg19, Minimap2) | 96.6% | 95.0% |
Number of active pores | 1111: 482, 371, 203, 55 | 1032: 447, 333, 196, 56 |
Supplementary Table 1: Summary of two NLR-seq runs using other cell lines with the mechanical shearing protocol.
In principle, nanopore sequencing is able to generate 100 kb to megabase reads in length11,12,13. Four major factors will affect the performance of the sequencing run and data quality: 1) active pore numbers and the activity of the pores; 2) motor protein, which controls the speed of DNA passing through the nanopore; 3) DNA template (length, purity, quality, mass); 4) sequencing adapter ligation efficiency, which determines the useable DNA from the input sample. The first two factors depend on the version of the flow cell and the sequencing kit provided by the manufacturer. The second two factors are critical steps in this protocol (HMW DNA extraction, shearing and ligation).
This protocol requires patience and practice. The quality of HMW DNA is important for ultra-long DNA libraries6. The protocol starts with cells collected with high viability (>85% viable cell preferred), limiting the degraded DNA from dead cells. Any harsh process which may introduce damages to the DNA (e.g., strong disturbing, shaking, vortex, multiple pipetting, repeated freezing and thawing) should be avoided. In the design of the protocol, we omit pipetting in the entire process of DNA extraction. Wide bore tips need to be used when pipetting is necessary after the mechanical shearing during library construction and sequencing. As the nanopores are sensitive to the chemistries in the chamber buffer12, there should be as few residual contaminants (e.g., the detergents, surfactants, phenol, ethanol, proteins RNAs, etc.) as possible in the DNA. Considering the length and yield, the phenol extraction method shows the best and most reproducible results compared with multiple different extraction methods tested so far.
Despite the ability of this protocol to produce long-read sequences, several limitations still remain. First, this protocol was optimized based on the nanopore sequencing device available at time of publication; thus, it is limited to the selective nanopore-based sequencing chemistry and could be suboptimal when performed in other types of long-read sequencing devices. Second, the outcome is highly dependent on the quality of the DNA extracted from the starting material (tissues or cells). Read length will be compromised if the starting DNA is already degraded or damaged. Third, although multiple QC steps are incorporated in the protocol to check the DNA quality, the final yield and length of the reads can be affected by the flow cell and pore activity, which could be variable at this early stage of nanopore sequencing platform development.
The protocol described here uses human suspension cell line samples for DNA extraction. We have optimized the passing times in needle shearing, the ratio of HMW DNA to transposase and the ligation time to produce the described results. The protocol can be expanded in four ways. First, users can start with other cultured mammalian cells and with different amount of cells, tissues, clinical samples, or other organisms. Further optimization on lysis incubation time, reaction volume and centrifugation will be needed. Second, it is hard to predict the target size for ultra-long read sequencing. If the read lengths are shorter than expected, the users can adjust the passing times in the mechanical shearing-based method or change the ratio of the HMW DNA to transposase in the transposase fragmentation-based method. Longer binding and elution time during cleanup steps are helpful because the HMW DNA is highly viscous. Third, with different nanopore sequencing devices, one can adjust the amount and volume of the DNA to meet the criteria of the sequencer. Fourth, only those DNA ligated to sequencing adapters will be sequenced. To further improve ligation efficiency, one can attempt to titrate the adapter and ligase concentrations. Modified ligation time and molecular crowding agents such as PEG18 can be applied in future. The ultra-long DNA sequencing protocol combined with CRISPR19,20 may offer an effective tool for target enrichment sequencing.
The authors have nothing to disclose.
The authors thank Y. Zhu for her comments on the manuscript. Research reported in this publication was partially supported by the National Cancer Institute of the National Institutes of Health under Award Number P30CA034196. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Reagents | |||
Absolute ethanol | Sigma-Aldrich | E7023 | |
Agencourt AMPure XPbeads | Beckman | A63881 | magnetic beads for cleanup |
BD conventional needles | Becton Dickinson | 305136 | 27G, for mechanical shearing |
BD Luer-Lok syringe | Becton Dickinson | 309628 | for mechanical shearing |
Blunt/TA Ligase Master Mix | NEB | M0367S | |
Countess Cell Counting Chamber Slides | Invitrogen | C10228 | for cell counting |
EDTA | Invitrogen | AM9261 | pH 8.0, 0.5 M, 500 mL |
Flow Cell | Oxford Nanopore Technologies | FLO-MIN106 | R9.4.1 |
HG00773 cells | Coriell Institute | HG00733 | cells used in this protocol |
Ligation Sequencing Kit 1D | Oxford Nanopore Technologies | SQK-LSK108 | nanopore ligation kit |
MaXtract High Density tubes | Qiagen | 129073 | gel tubes |
NEBNext FFPE DNA Repair Mix | NEB | M6630S | |
NEBNext Ultra II End Repair/dA-Tailing Module | NEB | M7546S | |
Nuclease-free water | Invitrogen | AM9937 | |
Phosphate-Buffered Saline, PBS | Gibco | 70011044 | 10X, pH 7.4 |
Phenol:chloroform:IAA | Invitrogen | AM9730 | |
Proteinase K | Qiagen | 19131 | 20 mg/mL |
Qubit dsDNA BR Assay Kit | Invitrogen | Q32850 | fluorometer assays for DNA quantification |
Rapid Sequencing Kit | Oxford Nanopore Technologies | SQK-RAD004 | nanopore transposase kit |
RNase A | Qiagen | 19101 | 100 mg/mL |
SDS | Invitrogen | AM9822 | 10% (wt/vol) |
Sodium chloride solution | Invitrogen | AM9759 | 5.0 M |
TE buffer | Invitrogen | AM9849 | pH 8.0 |
Tris | Invitrogen | AM9856 | pH 8.0, 1 M |
Triton X-100 solution | Sigma-Aldrich | 93443 | ~10% |
Name | Company | Catalog Number | Comments |
Equipment | |||
Bio-Rad C1000 Thermal Cycler | Bio-Rad | 1851196EDU | |
Centrifuge 5810R | Eppendorf | 22628180 | |
Countess II FL Automated Cell Counter | Life Technologies | AMQAF1000 | for cell counting |
DynaMag-2 Magnet | Life Technologies | 12321D | magnetic rack |
Eppendorf ThermoMixer | Eppendorf | 5382000023 | for incubation |
Freezer | LabRepCo | LHP-5-UFMB | |
GridION | Oxford Nanopore Technologies | GridION X5 | nanopore device used in this protocol |
HulaMixer Sample Mixer | Thermo Fisher Scientific | 15920D | rotator mixer |
MicroCentrifuge | Benchmark Scientific | C1012 | |
NanoDrop ND-1000 Spectrophotometer | Thermo Fisher Scientific | ND-1000 | for UV reading |
Pippin Pulse | Sage Science | PPI0200 | pulsed-field gel electrophoresis instrument |
Qubit 3.0 Fluorometer | Invitrogen | Q33216 | fluorometer |
Refrigerator | LabRepCo | LABHP-5-URBSS | |
Vortex-Genie 2 | Scientific Industries | SI-A236 | |
Water bath | VWR | 89501-464 |