Pooled DNA sequencing is a fast and cost-effective strategy to detect rare variants associated with complex phenotypes in large cohorts. Here we describe the computational analysis of pooled, next-generation sequencing of 32 cancer-related genes using the SPLINTER software package. This method is scalable, and applicable to any phenotype of interest.
As DNA sequencing technology has markedly advanced in recent years2, it has become increasingly evident that the amount of genetic variation between any two individuals is greater than previously thought3. In contrast, array-based genotyping has failed to identify a significant contribution of common sequence variants to the phenotypic variability of common disease4,5. Taken together, these observations have led to the evolution of the Common Disease / Rare Variant hypothesis suggesting that the majority of the “missing heritability” in common and complex phenotypes is instead due to an individual’s personal profile of rare or private DNA variants6-8. However, characterizing how rare variation impacts complex phenotypes requires the analysis of many affected individuals at many genomic loci, and is ideally compared to a similar survey in an unaffected cohort. Despite the sequencing power offered by today’s platforms, a population-based survey of many genomic loci and the subsequent computational analysis required remains prohibitive for many investigators.
To address this need, we have developed a pooled sequencing approach1,9 and a novel software package1 for highly accurate rare variant detection from the resulting data. The ability to pool genomes from entire populations of affected individuals and survey the degree of genetic variation at multiple targeted regions in a single sequencing library provides excellent cost and time savings to traditional single-sample sequencing methodology. With a mean sequencing coverage per allele of 25-fold, our custom algorithm, SPLINTER, uses an internal variant calling control strategy to call insertions, deletions and substitutions up to four base pairs in length with high sensitivity and specificity from pools of up to 1 mutant allele in 500 individuals. Here we describe the method for preparing the pooled sequencing library followed by step-by-step instructions on how to use the SPLINTER package for pooled sequencing analysis (http://www.ibridgenetwork.org/wustl/splinter). We show a comparison between pooled sequencing of 947 individuals, all of whom also underwent genome-wide array, at over 20kb of sequencing per person. Concordance between genotyping of tagged and novel variants called in the pooled sample were excellent. This method can be easily scaled up to any number of genomic loci and any number of individuals. By incorporating the internal positive and negative amplicon controls at ratios that mimic the population under study, the algorithm can be calibrated for optimal performance. This strategy can also be modified for use with hybridization capture or individual-specific barcodes and can be applied to the sequencing of naturally heterogeneous samples, such as tumor DNA.
This method was used in research reported in Vallania FML et al. Genome Research 2010.
1. Sample Pooling and PCR Capture of Targeted Genomic Loci
2. Pooled PCR Library Preparation and Sequencing
3. Sequencing Reads Alignment and Analysis
4. Rare Variant Detection Using SPLINTER
5. Representative Results
We pooled a population of 947 individuals and targeted over 20 kb for sequencing. We applied SPLINTER for the detection of rare variants following our standard protocol. Each individual had previously had genotyping performed by genome-wide array genotyping. Concordance between genotyping of tagged and novel variants called in the pooled sample were excellent (Figure 6). Three variants, two of which (rs3822343 and rs3776110) were rare in the population, were called de novo from the sequencing results and were validated by individual pyrosequencing. Minor allele frequencies (MAF) in the pool were similar to the MAF reported in dbSNP build 129. The MAF concordance between pyrosequencing and pooled sequencing was excellent (Table 3).
Table 1. DNA oligonucleotide sequences for the positive control. Each sequence consists of a DNA fragment differing from the Wild Type reference by either two substitutions or one insertion and one deletion. Click here to view larger image.
Table 2. Example of SPLINTER output. The first two rows represent the standard SPLINTER output for a substitution or a deletion (blue header). The last row represents the standard SPLINTER output for an insertion (purple header). Click here to view larger image.
Table 3. Five known and three novel variants were identified from large populations and validated by individual genotyping. Individual validation was performed by pyrosequencing (rows 1-3), TaqMan assay (rows 4-6) or Sanger sequencing (rows 7,8). For a broad range of allele frequencies and including five positions with MAF <1%, concordance between pooled sequencing allele frequency estimation and individual genotyping was strong. Positions marked with an asterisk (*) are adapted from previously reported data9.
Figure 1. Pooled-DNA sequencing and SPLINTER analysis overview. Patient DNA is pooled and amplified at selected loci. The final PCR products are pooled together with a positive and negative control at equimolar ratios. The pooled mix is then sequenced and the resulting reads are mapped back to their reference. Mapped negative control reads are used to generate a run-specific error model. SPLINTER can then be used to detect rare SNPs and indels by incorporating information from the error model and the positive control. [Adapted from Vallania FLM et al, Genome Research 2010] Click here to view larger image.
Figure 2. Pooled PCR amplicon ligation and sonication. As a demonstration of the ligation and random fragmentation steps in the library preparation protocol, pUC19 vector was enzymatically digested to the fragments shown in lane 2. These fragments were normalized by molecule number, combined and randomly ligated according to step 1.7 above. The resulting large concatamers are shown in lane 3. The ligated concatamers were equally divided and subjected to sonication as described in step 1.8 above. The resulting smear of DNA fragments for each technical replicate are shown in lanes 4 and 5. The bracket highlights the size range used for gel extraction and sequencing library creation.
Figure 3. Accuracy as a function of coverage for a single allele in a pooled sample. Accuracy is estimated as the Area Under the Curve (AUC) of a Receiver Operator Curve (ROC), which ranges from 0.5 (random) to 1.0 (perfect accuracy). AUC is plotted as a function of coverage per allele for the detection of single mutant alleles in pools of 200, 500 and 1000 alleles (A). AUC is plotted as a function total coverage for substitutions, insertions and deletions (B). [Adapted from Vallania FLM et al, Genome Research 2010].
Figure 4. Error Plot shows the probability of incorporating an erroneous base at a given position. The error profile shows low error rates with an increasing trend toward the 3' end of the sequencing read. Notably, different reference nucleotides display different error probabilities (see for example probability of incorporating a C given a G as reference). [Adapted from Vallania FLM et al, Genome Research 2010].
Figure 5. Accuracy of SPLINTER in estimating allele frequency for positions that had greater than 25-fold coverage per allele. Based on results in Panel A, Figure 3 showing optimal sensitivity for single variant detection with ≥25-fold coverage, a comparison between pooled-DNA allele frequencies estimated by SPLINTER with allele counts measured by GWAS results in very high correlation (r = 0.999). [Adapted from Vallania FLM et al, Genome Research 2010].
Figure 6. Comparison between allele frequencies measured by GWAS compared to SPLINTER estimates from pooled sequencing of 974 individuals. There were 19 common positions between the genotyped loci and the sequence regions for comparison. The resulting correlation is very high (r = 0.99538). Click here to view larger figure.
There is increasing evidence that the incidence and therapeutic response of common, complex phenotypes and diseases such as obesity8, hypercholesterolemia4, hypertension7 and others may be moderated by personal profiles of rare variation. Identifying the genes and pathways where these variants aggregate in affected populations will have profound diagnostic and therapeutic implications, but analyzing affected individuals separately can be time and cost prohibitive. Population-based analysis offers a more efficient method for surveying genetic variation at multiple loci.
We present a novel pooled-DNA sequencing protocol paired with the SPLINTER software package designed to identify this type of genetic variation across populations. We demonstrate the accuracy of this method in identifying and quantifying minor alleles within a large pooled population of 947 individuals, including rare variants that were called de novo from the pooled sequencing and validated by individual pyrosequencing. Our strategy mainly differs from other protocols by the incorporation of a positive and a negative control within every experiment. This allows SPLINTER to achieve much higher accuracy and power compared to other approaches1. The optimal coverage of 25-fold per allele is fixed independently of the size of the pool, making the analysis of large pools feasible as this requirement only scales linearly with the pool size. Our approach is very flexible and can be applied to any phenotype of interest but also to samples that are naturally heterogeneous, such as mixed cell populations and tumor biopsies. Given the ever-increasing interest in pooled sequencing from large target regions such as the exome or genome, our library prep and SPLINTER analysis is compatible with custom-capture and whole-exome sequencing, but the alignment utility in the SPLINTER package was not designed for large references sequences. Therefore, we have successfully utilized the dynamic programming aligner, Novoalign, for genome-wide alignments followed by variant calling from the pooled sample (Ramos et al., submitted). Thus, our pooled sequencing strategy can scale successfully to larger pools with increasing amounts of target sequence.
The authors have nothing to disclose.
This work was supported by the Children’s Discovery Institute grant MC-II-2006-1 (RDM and TED), the NIH Epigenetics Roadmap grant [1R01DA025744-01 and 3R01DA025744-02S1] (RDM and FLMV), U01AG023746 (SC), the Saigh Foundation (FLMV and TED), 1K08CA140720-01A1 and Alex’s Lemonade Stand “A” Award support (TED). We thank the Genome Technology Access Center in the Department of Genetics at Washington University School of Medicine for help with genomic analysis. The Center is partially supported by NCI Cancer Center Support Grant #P30 CA91842 to the Siteman Cancer Center and by ICTS/CTSA Grant# UL1RR024992 from the NationalCenter for Research Resources (NCRR), a component of the National Institutes of Health (NIH), and NIH Roadmap for Medical Research. This publication is solely the responsibility of the authors and does not necessarily represent the official view of NCRR or NIH.
Reagent Name | Company | Catalogue Number | Section |
PfuUltra High-Fidelity | Agilent | 600384 | 1.4 |
Betaine | SIGMA | B2629 | 1.4 |
M13mp18 ssDNA vector | NEB | N4040S | 1.5 |
pGEM-T Easy | Promega | A1360 | 1.5 |
T4 Polynucleotide Kinase | NEB | M0201S | 2.2 |
T4 Ligase | NEB | M0202S | 2.2 |
Polyethylene Glycol 8000 MW | SIGMA | P5413 | 2.2 |
Bioruptor sonicator | Diagenode | UCD-200-TS | 2.3 |