Next-generation sequencing (NGS) is a powerful tool for genomic characterization that is limited by the high error rate of the platform (~0.5–2.0%). We describe our methods of error-corrected sequencing that allow us to obviate the NGS error rate and detect mutations at variant allele fractions as rare as 0.0001.
Conventional next-generation sequencing techniques (NGS) have allowed for immense genomic characterization for over a decade. Specifically, NGS has been used to analyze the spectrum of clonal mutations in malignancy. Though far more efficient than traditional Sanger methods, NGS struggles with identifying rare clonal and subclonal mutations due to its high error rate of ~0.5–2.0%. Thus, standard NGS has a limit of detection for mutations that are >0.02 variant allele fraction (VAF). While the clinical significance for mutations this rare in patients without known disease remains unclear, patients treated for leukemia have significantly improved outcomes when residual disease is <0.0001 by flow cytometry. In order to mitigate this artefactual background of NGS, numerous methods have been developed. Here we describe a method for Error-corrected DNA and RNA Sequencing (ECS), which involves tagging individual molecules with both a 16 bp random index for error-correction and an 8 bp patient-specific index for multiplexing. Our method can detect and track clonal mutations at variant allele fractions (VAFs) two orders of magnitude lower than the detection limit of NGS and as rare as 0.0001 VAF.
As we age, exposure to mutagens and stochastic errors during cell division result in the accumulation of somatic aberrations in the genome, and this underlies the fundamental pathogenesis of malignant transformation, neuro-developmental diseases, pediatric disorders and normal aging1,2. Somatic mutations with disease-driving potential are important diagnostic and prognostic biomarkers for early detection and risk management3,4,5. In order to better understand physiologic clonogenesis, which will inform clinical and research decisions, the accurate quantification and characterization of these mutations is of primary importance. Next-generation sequencing (NGS) is currently used to study clonal mutations in heterogeneous DNA samples; however, NGS is limited to identifying mutations at >0.02 variant allele fraction (VAF) — due to the inherent error-rate of 0.5–2.0% of the sequencing platforms6,7,8. As a result, tracking diagnostically and prognostically significant somatic variants at lower VAF cannot be achieved using standard NGS.
Recently, various methods have been developed in order to circumvent the error rate of NGS8,9,10,11. These methods utilize molecular tagging, which enables error correction after sequencing. Each molecule or genomic fragment in the sequencing library is tagged with a random Unique Molecular Identifier (UMI) that is specific to that molecule. The UMIs are constructed by permutations of a string of randomized nucleotides (8–16 N). A second sample-specific barcode is also integrated into the workflow that enables multiplexing multiple samples into the same NGS sequencing run. PCR amplification is performed on the molecularly tagged library, and subsequently the library is sent for sequencing. During library preparation, it is expected that errors will be randomly introduced to the genomic fragment during PCR amplification and sequencing8. To remove random sequencing errors, raw sequencing reads are grouped according to the UMI. Artifacts from sequencing are not expected to be present in all reads with the same UMI at the same genomic position due to the stochastic nature of introduction, whereas a true variant will be faithfully amplified and sequenced in all reads that share the same UMI. The artifacts are bioinformatically removed. Here, we describe three methods of Error-corrected Sequencing (ECS) optimized in the laboratory for DNA to identify single nucleotide variants (SNVs) and small insertion-deletions (Indels), and for RNA to facilitate quantification of gene expression below the NGS error threshold.
The first method describes a way to look for rare somatic event using gene specific primers designed by researchers. Prior to library preparation, researchers should design primers to target the fragments of interest. We used the web-app Primer3 (http://bioinfo.ut.ee/primer3-0.4.0/). Amplicons of 200–250 bp are ideal for polymerase chain reaction (PCR) as these will, once UMIs have been incorporated, generate overlapping paired-end reads with 150 bp paired-end reads. The optimal primer design conditions to be used are: Minimum primer size = 19; Optimum primer size = 25; Maximum primer size = 30; Minimum Tm = 64 °C; Optimum Tm = 70 °C; Maximum Tm = 74 °C; Maximum Tm difference = 5 °C; Minimum GC content = 45; Maximum GC content = 80; Number to return = 20; Maximum 3' end stability = 100.
In Method 2, we describe a method combining the ECS-DNA protocol with Illumina chemistry to survey for clonal SNVs and small Indels as rare as 0.0001 VAF using commercially available gene panels that include hundreds of amplicons. We have used the TruSight Myeloid Sequencing Panel (Illumina) for our experiment and designed an expanded panel to include additional genes of interest for pediatric myeloid diseases. These panels have not offered unique molecular identifiers (UMIs) that would facilitate error correction, so we have added our own adapter strategy to these panels. ECS should work equally well with any of other panels designed to enrich for genes associated with different diseases. Following DNA isolation and subsequent quantification from the tissue or sample of interest, it is recommended to have at least 500 ng of stock DNA per specimen. We routinely make a single sequencing library using 250 ng of DNA in order to capture as much unique genomic fragment as possible for downstream reads de-duplication and VAF calculation. An optional replicate sequencing library can be made with the remaining 250 ng of DNA. We always make two replicate libraries per specimen, and we consider only those events detected independently in both replicates as true positives. We also implemented a genomic position-specific binomial error model to increase the accuracy of variant calling4,13.
Lastly, we describe a method coupling ECS to RNA sequencing for transcript quantification using off-the-shelf QIAseq Targeted RNA panels (Qiagen). The UMIs required for de-duplication and error correction have been incorporated in the kits, and researchers can make libraries following manufacturer's recommendations. Bioinformatically, researchers can follow the pipeline outlined for ECS-DNA, which will be explained in detail in the PROTOCOL section.
1. Targeted Error-corrected Sequencing for DNA
2. Gene Panels with Error-corrected Sequencing of DNA
Table 1: Example demonstrating the way to construct a position-specific binomial error model.
3. Error-corrected Sequencing of RNA
With Targeted Error-Corrected Sequencing for DNA, we have performed a proof of principle experiment diluting mutant patient DNA in commercial genomic DNA. The patient had a mutation in GATA1 (chrX:48650264, C>G) with original VAF of 0.19. We demonstrate in Figure 1 that ECS is quantitative to a level of 1:10,000 for single nucleotide variant.
Figure 1: Dilution series of GATA1 SNV demonstrating that ECS is quantitative to the level of 1:10,000. Please click here to view a larger version of this figure.
We also show that the ECS-DNA reliably detects rare clonal mutations in genes recurrently in adult acute myeloid leukemia (AML) in healthy elderly individuals4. We obtained buffy coat samples from 20 healthy individuals in the Nurse's Health Study banked roughly ~10 years apart. We applied the ECS-DNA panel protocol on these samples. For this experiment, we adapted the Illumina TruSight Myeloid Sequencing Panel that consists of 568 amplicons (more information on gene list on https://www.illumina.com/products/by-type/clinical-research-products/trusight-myeloid.html) and sequenced 80 libraries from 20 individuals (2 collections at different time points, 2 replicates per individual per time point) using Illumina NextSeq platform, which generated an average of 47.7 million paired-end reads and an average of 3.4 million error-corrected consensus sequences per library4. The mean nucleotide coverage per library was roughly 6,000x (3.4 millions divided by 568). For each sample, we constructed a position-specific error profile using sequenced libraries that are not from the same sample. We found 109 clonal somatic mutations that were present in both replicates of at least one collection time point. These mutations have VAF ranging from 0.0003–0.1451. We selected 21 mutations with known COSMIC representations, and validated all 21 mutations in one or two collection time point(s) using ddPCR (n = 34, Figure 2, adapted from Young et al. 20164).
Figure 2: Mutations identified by ECS were verified via ddPCR with highly concordant VAFs. (n=34, modified from Young et al. 20164). Please click here to view a larger version of this figure.
With respect to error-corrected expression level using ECS-RNA protocol, we customized a gene panel using QIAseq chemistry that consists of 416 genes known to be associated with various cancers (adapted from QIAseq Human Cancer Transcriptome panel), and we amplified the most commonly expressed exon of a given gene (Gene list in Supplementary Material 1). We sequenced the libraries using Illumina MiSeq platform in paired-end format that gave an average of 8.3 million reads per library, and we managed to capture an average of 0.417 million error-corrected consensus sequences. We showed that the expression level of low abundance transcript (<1,000 transcript count in 50 ng of total RNA) is highly reproducible between replicates (data point n = 300, Figure 3). Validation by ddPCR (six selected genes of varying degree of expression) demonstrated that the expression level of genes had been correctly captured by the ECS protocol without the need for normalization.
Figure 3: Top, correlation of transcript counts by ECS-RNA between replicates of the same sample (n = 300). Bottom, transcript counts identified by ECS were verified by ddPCR (n = 6). Please click here to view a larger version of this figure.
Here, we demonstrate a suite of error-corrected sequencing protocols that can be easily implemented to study mutations with low VAFs in different diseases. The most important factor is the incorporation of UMIs with each molecule before sequencing as they enable error-correction of the raw reads. The methods described here allow researchers to incorporate customized UMIs to both commercially available gene panels and self-designed gene-specific oligos.
Standard NGS protocol precludes the detection of mutations with VAF below 2% due to the sequencing error rate, and this limits the application of NGS in studies where the detection of rare variants is crucial. By circumventing the standard NGS error rate, ECS enables sensitive detection of these raw variants. For instance, detection of pathogenic mutations when these mutations first arise (therefore having low VAF) is imperative to inform early intervention of the disease14,15. In leukemia research, the detection of minimal residual disease (residual leukemic cells post-treatment) informs risk stratification and could be used to inform treatment options in a manner that binary flow cytometric assessments cannot. In addition, ECS is applicable to detect circulating tumor nucleic acid and to evaluate metastatic potential in solid tumor patients by assessing for the presence/absence as well as the variant burden of certain mutations that are characteristics of the primary tumor16.
As demonstrated in Table 1, the power of using binomial distribution-based position-specific error model to call variants depends largely on the number of sequenced libraries as well as the depth of sequencing used to build the error model. The robustness of the error model increases with higher number of samples and more sequencing depth. It is recommended to use at least 10 sequenced samples with an average of error-corrected read coverage of 3000x per sample to build an error profile for each sample. The position-specific approach is similar to MAGERI, but instead of using an aggregate error rate for all six different substitution types (A>C/T>G, A>G/T>C, A>T/T>A, C>A/G>T, C>G/G>C, C>T/G>A)13, we model each substitution independently at every position. For instance, an error rate of C>T at a given genomic position is different from another position. Our approach also takes into account a sequencing batch effect, as the base substitution rate observed in one sequencing run might be different from another run. Hence it is important to model each position for all substitution types especially when samples from different sequencing runs are pooled to build the model.
An important consideration when designing an ECS experiment is the desired detection threshold. The beauty of NGS studies is that they can be easily scaled in terms of genes/targets of interest, detection threshold (dictated by depth of sequencing), and number of individuals queried. For example, if the researchers are interested to find rare mutations in two amplicons with a detection threshold of 0.0001, they can pool maximally 75 samples in a single sequencing run using MiSeq V2 chemistry which outputs up to 15 million reads (2 amplicons * 10,000 molecules * 10 reads for error-correction * 75 samples = 15 million sequencing reads). Researchers can vary the number of molecules going into sequencing or the number of pooled samples in a single sequencing run to adjust the detection threshold. In our studies, we aimed to find mutations with a detection threshold of 0.0001 VAF (1:10,000) using the Illumina gene panel. We routinely use 250 ng of starting DNA to ensure that sufficient molecules are captured in order to achieve the aforementioned detection threshold. Researchers can opt to start with lower amount of DNA (50 ng is recommended) if the desired detection limit is >0.001 VAF.
As the UMIs are appended onto the i5 indexes, sequencing settings have to be amended accordingly. For example, we used 16 N UMIs, and the sequencing settings were 2×144 paired end reads, 8 cycles of Index 1 and 16 cycles of index 2 as opposed to the usual 8 cycles of Index 2. The increase in Index 2 cycle is compensated by a decrease in the total number of cycles allocated to the reads. If researchers opt to use 12N UMIs10,17, the settings should be changed to 12 cycles of Index 2.
This UMI-based sequencing method is optimized to correct for sequencing errors. It remains suboptimal in dealing with PCR jackpotting, which is an issue for all amplification-based method. We performed rounds of post-sequencing and post-bioinformatics validation using ddPCR, and we hardly detect any false positives due to PCR jackpotting. Nonetheless, it is recommended that researchers conduct the experiments using high fidelity polymerase to ensure low amplification errors.
The authors have nothing to disclose.
We thank the participants in the Children's Oncology Group AAML1531 study and the Nurses' Health Study for their contributions in the form of patient samples. This work was funded by the National Institutes of Health (UM1 CA186107, RO1 CA49449 and RO1 CA149445), the Children's Discovery Institute of Washington University and St. Louis Children's Hospital (MC-II-2015-461), and Eli Seth Matthews Leukemia Foundation.
Q5 High Fidelity Hot Start Master Mix | New England BioLabs | M0492S | |
Agencourt AMPure XP | Beckman Coulter | A63880 | |
Qubit dsDNA HS Assay Kit | Thermo Fisher Scientific | Q32854 | |
SYBR Safe DNA Gel Stain | Thermo Fisher Scientific | S33102 | |
Truseq Custom Amplicon Index Kit | Illumina | FC-130-1003 | |
UMI i5 adapter sequences | Integrated DNA Technologies | – | |
NEBNext Ultra End Repair/dA-Tailing Module | New England BioLabs | E7442S | |
NEBNext Ultra II Ligation Module | New England BioLabs | E7595S | |
QX200 ddPCR EvaGreen Supermix | Bio-Rad | 1864034 | |
QX200 Droplet Generation Oil for EvaGreen | Bio-Rad | 1864005 | |
QX200 Droplet Digital PCR System | Bio-Rad | 1864001 | |
ddPCR 96-Well Plates | Bio-Rad | 12001925 | |
DG8 Cartridges for QX200/QX100 Droplet Generator | Bio-Rad | 1864008 | |
DG8 Gaskets for QX200/QX100 Droplet Generator | Bio-Rad | 1863009 | |
Bioanalyzer | Agilent Genomics | G2939BA | |
TapeStation | Agilent Genomics | G2991AA | |
TruSight Myeloid Sequencing Panel | Illumina | FC-130-1010 | |
Bowtie 2 | Johns Hopkins University | – | |
Customized QIAseq Targeted RNA Panel | Qiagen | – | |
Rneasy Plus Mini Kit (50) | Qiagen | 74134 |