Amino acid-level signal-to-noise analysis determines the prevalence of genetic variation at a given amino acid position normalized to background genetic variation of a given population. This allows for identification of variant "hotspots" within a protein sequence (signal) that rises above the frequency of rare variants found in a population (noise).
Advancements in the cost and speed of next generation genetic sequencing have generated an explosion of clinical whole exome and whole genome testing. While this has led to increased identification of likely pathogenic mutations associated with genetic syndromes, it has also dramatically increased the number of incidentally found genetic variants of unknown significance (VUS). Determining the clinical significance of these variants is a major challenge for both scientists and clinicians. An approach to assist in determining the likelihood of pathogenicity is signal-to-noise analysis at the protein sequence level. This protocol describes a method for amino acid-level signal-to-noise analysis that leverages variant frequency at each amino acid position of the protein with known protein topology to identify areas of the primary sequence with elevated likelihood of pathologic variation (relative to population "background" variation). This method can identify amino acid residue location "hotspots" of high pathologic signal, which can be used to refine the diagnostic weight of VUSs such as those identified by next generation genetic testing.
The rapid improvement in genetic sequencing platforms has revolutionized the accessibility and role of genetics in medicine. Once confined to a single gene, or a handful of genes, the reduction in cost and increase in speed of next generation genetic sequencing has led routine sequencing of the entirety of the genome's coding sequence (whole exome sequencing, WES) and the entire genome (whole genome sequencing, WGS) in the clinical setting. WES and WGS have been used frequently in the setting of critically ill neonates and children with concern for genetic syndrome where it is a proven diagnostic tool that can change clinical management1,2. While this has led to increased identification of likely pathogenic mutations associated with genetic syndromes, it has also dramatically increased the number of incidentally found genetic variants, or unexpected positive results, of unknown diagnostic significance (VUS). While some of these variants are disregarded and not reported, variants localizing to genes associated with potentially fatal or highly morbid diseases are often reported. Current guidelines recommend reporting of incidental variants found in specific genes which may be of medical benefit to the patient, including genes associated with the development of sudden cardiac death-predisposing diseases such as cardiomyopathies and channelopathies3. Although this recommendation was designed to capture individuals at risk of a SCD-predisposing disease, the sensitivity of variant detection far exceeds specificity. This is reflected in a growing number of VUSs and incidentally identified variants with unclear diagnostic utility that far exceed the frequency of the respective diseases in a given population4. One such disease, long QT syndrome (LQTS), is a canonical cardiac channelopathy caused by mutations localizing to genes which encode cardiac ion channels, or channel interacting proteins, resulting in delayed cardiac repolarization5. This delayed repolarization, seen by a prolonged QT interval on resting electrocardiogram, results in an electrical predisposition to potentially fatal ventricular arrhythmias such as torsades de pointes. While a number of genes have been linked to the development of this disease, mutations in KCNQ1-encoded IKs potassium channel (KCNQ1, Kv7.1) is the cause of LQTS type 1 and is utilized as an example below6. Illustrating the complexity in variant interpretation, the presence of rare variants in LQTS-associated genes, so called "background genetic variation" has been previously described7,8.
In addition to large compendium-style databases of known pathogenic variants, several strategies exist for predicting the effect different variants will produce. Some are based on algorithms, such as SIFT and Polyphen 2, which can filter large numbers of novel non-synonymous variants to predict deleteriousness9,10. Despite broad use of these tools, low specificity limits their applicability when it comes to "calling" clinical VUSs11. "Signal-to-noise" analysis is a tool which identifies the likelihood of a variant being associated with disease based on the frequency of known pathologic variation at the loci in question normalized against rare genetic variation from a population. Variants localizing to genetic loci where there is a high prevalence of disease-associated mutations compared to population-based variation, a high signal-to-noise, are more likely to be disease-associated themselves. Further, rare variants found incidentally localizing to a gene with a high frequency of rare population variants compared to disease-associated frequency, a low signal-to-noise, may be less likely to be disease-associated. The diagnostic utility of signal-to-noise analysis has been illustrated in the latest guidelines for genetic testing for cardiomyopathies and channelopathies; however, it has only been employed at the whole gene level or domain-specific level12. Recently, given increased availability of both pathologic variants (disease databases, cohort studies in the literature) and population-based control variants (Exome Aggregation Consortium, ExAC and the Genome Aggregation Database, GnomAD13), this has been applied to the individual amino acid positions within the primary sequence of a protein. Amino acid-level signal-to-noise analysis has proven useful in categorizing incidentally identified variants in genes associated with LQTS as likely "background" genetic variation rather than disease-associated. Among the three major genes associated with LQTS, including KCNQ1, these incidentally identified variants lacked a significant signal-to-noise ratios, suggesting that the frequency of these variants at individual amino acid positions reflect rare population variation rather than disease-associated mutations. Furthermore, when protein-specific domain topology was overlaid against areas of high signal-to-noise, pathologic mutation "hotspots" localized to key functional domains of the proteins14. This methodology holds promise in determining 1) the likelihood a variant is disease- or population-associated and 2) identifying novel critical functional domains of a protein associated with human disease.
1. Identify the Gene and Specific Splice Isoform of Interest
NOTE: Here, we demonstrate the use of Ensembl15 to identify the consensus sequence for the gene of interest which is associated with the pathogenesis of the disease of interest (i.e. KCNQ1 mutations are associated with LQTS). Alternatives to Ensembl include RefSeq via the National Center for Biotechnology Information (NCBI)16 and the University of California, Santa Cruz (UCSC) Human Genome Browser17 (see Table of Materials).
2. Create the Experimental Genetic Variant Database (the "Signal")
NOTE: Here, we demonstrate how to create a database of disease-associated variants in the gene of interest with the frequency of the disease-associated variants among individuals with the disease of interest. This database can take many forms and represents the "signal" (phenotype-positive genetic variation) which will be normalized against the control variant database. This can include 1) disease-associated variants for comparison against VUSs to identify novel functional domains of the protein and/or 2) VUSs, including incidentally identified VUSs, to compare against disease-associated variants to determine likelihood of pathogenicity. Disease-associated variants in KCNQ1 will be presented for illustration; however, the method is the same for analysis of incidentally-identified VUSs or any other set of experimental variants.
3. Create the Control Genetic Variant Database (the "Noise")
NOTE: Here, we demonstrate how to create a database of control variants in the gene of interest with an associated frequency in a control population. This database represents the "noise" (phenotype-negative, population-based genetic variation) which is the background against which the experimental variant database will be normalized. This is referred to as "control" variation.
4. Amino Acid Level Signal-to-noise Calculation and Mapping
5. Protein Domain Topology Overlay
6. Variant Position Overlay
A representative result for amino acid-level signal to noise analysis for KCNQ1 is depicted in Figure 6. In this example, rare variants identified in the GnomAD cohort (control cohort), incidentally-identified WES variants (experimental cohort #1), and LQTS case-associated variants deemed likely disease-associated (experimental cohort #2) are depicted. Further, the signal-to-noise analysis comparing the WES and LQTS cohort variant frequency normalized against GnomAD variant frequency is depicted. LQTS-associated variants demonstrated high signal-to-noise ratios in domains corresponding with the channel pore, selectivity filter, and the KCNE1-binding domain. In comparison, incidentally identified variants in the WES cohort did not clearly demonstrate specific regions of high signal-to-noise elevation, suggesting that these variants reflect background genetic variation. This example did not utilize variant MAFs as noted above; however, it demonstrates all of the same principles as described.
Figure 1: Example of control variant database with MAF calculation. Column A, directly imported GnomAD control rare variants. Column B, deletion of left-sided, non-position-related text from the variant nomenclature using an example formula for character removal (i.e.: for B2 "=RIGHT(A2,LEN(A2)-5", see Table of Materials). Column C, deletion of right-sided, non-position-related text from the variant nomenclature using a related formula (i.e.: for C2 "=LEFT(B2,LEN(B2)-3"). Column D, resultant unsorted amino acid positions. Column E, amino acid positions sorted in an ascending fashion to allow for identification of duplicate positions. Column F, associated MAF for each variant as imported from GnomAD. Column G and H, combined MAF for a given amino acid position (sum of each variant MAF at a specific position). Please click here to view a larger version of this figure.
Figure 2: Example of experimental variant database with MAF calculation. Column A, a list of mock LQTS-associated mutations in KCNQ1 representing a disease-associated mutation experimental database. Column B, mutation position corresponding to each variant. Column C, a count of mutation-positive individuals within mock Study 1. Each are presumed to be heterozygous mutation carriers. The total number of individuals genotyped in the study is located at the bottom of the sheet. Column D, count of mutation-positive individual in mock Study 2. Column E, count of mutation-positive individual in mock Study 3. Column F, total mutation-positive individuals hosting the observed mutation across all studies. Note that distinct mutations associated with the same amino acid position should be combined. Column G, MAF of each mutation and amino acid position using an example formula (i.e.: for G2 "=2/(176*2) ", see Table of Materials). Note that since all individuals are presumed to be heterozygous and each individual presumed to carry 2 alleles of the KCNQ1 locus, the total individuals should be multiplied by 2 for the allele frequency. Please click here to view a larger version of this figure.
Figure 3: Example of rolling average calculation for control and experimental variants. Column A and B, GnomAD control variant positions and respective MAFs. Column C, all amino acid positions of KCNQ1 from amino acid position to final. Column D, GnomAD variant MAF for all positions with a MAF of 0 in place of positions without a variant. This can be automatically calculated using a VLOOKUP function (i.e. for D2, "=IFERROR(VLOOKUP(C2,A:B,2,),0), see Table of Materials). Column E, rolling average of position MAF using an example formula (i.e. for E2, "=SUM(D2:D7)/6" and for E7, "=SUM(D2:D12)/11"). Column G and H, LQTS experimental variant positions with respective MAFs. Column I, all amino acid positions of KCNQ1. Column J, LQTS variant MAF for all positions. Column K, rolling LQTS MAF. Gray fill cells are examples of where MAF values from columns B and H are expanded into column D and J, respectively, which correlate with respective positions in column C/I. Note that it is critical that all cells are formatted as "Numbers" for proper formula functioning. Please click here to view a larger version of this figure.
Figure 4: Example of signal-to-noise analysis and graphing. Left, example database and calculations. Column A, all amino acid positions of KCNQ1. Column B, LQTS experimental MAF rolling average for each position. Column C, GnomAD control MAF rolling average for each position. D: Signal-to-noise ratio (i.e. for D2, "=B2/C2"). Right, example of graph of signal-to-noise ratio (Y-axis) versus amino acid position (X-axis). Please click here to view a larger version of this figure.
Figure 5: Example of protein and variant position mapping. A, example database and calculations. Column A, all amino acid positions of KCNQ1. Column B, KCNQ1 positions which have a rare control variant identified in GnomAD. Column C, the domain mapping column where cells containing values correspond to the N or C-terminal aspect of identified KCNQ1 protein domains or features. As the most N-terminal domain is the S1 domain has the N-terminal boundary at amino acid 122, no values are noted here. Column D, the variant mapping column where cells containing a 1 correspond to KCNQ1 positions which localize rare variants. Gray fill cells are two examples of where variant positions in column B are expanded into column D which correlate with respective positions in column A. Please click here to view a larger version of this figure.
Figure 6: Example of amino acid-level signal-to-noise analysis of KCNQ1-encoded KCNQ1 (Kv7.1). Top, variant positions are demonstrated with vertical lines, including rare GnomAD cohort variants (black), incidentally-identified variants in WES referrals (blue), and variants identified in LQTS cases(green). Functional domains are noted. Relative frequency of LQTS case variants normalized to GnomAD variants (green line) is depicted compared to WES (blue line). S1-S6, transmembrane domains; SF, ion selectivity filter; KCNE1 and AKAP9, respective protein binding domains. Modified and reprinted with permission from previous work14. Please click here to view a larger version of this figure.
High-throughput genetic testing has advanced dramatically in its application and availability over the past decade. However, in many diseases with well-established genetic underpinnings, such as cardiomyopathies, expanded testing has failed to improve diagnostic yield21. Further, there is significant uncertainty regarding the diagnostic utility of many identified variants. This is partially due to a growing number of incidentally identified rare variants discovered on WES and WGS, which can lead to misdiagnosis22. Amino acid level signal-to-noise analysis is based on well-established strategies for predicting variant pathogenicity and provides the advantage of leveraging large-scale population-based genome studies to refine variant interpretation.
It follows that one of the most crucial steps to this protocol is the selection of control and experimental cohorts. Many of the publicly-available large genome studies are accessible through aggregate databases, such as GnomAD, that can allow for representative control cohorts in this protocol to be as large as 138,632 individuals at present date. Though not all subjects in these aggregate cohorts are ostensibly healthy, the large sample size in the setting of rare disease makes this resource invaluable and allows for a stringent MAF exclusion threshold. Exclusion of common variants is necessary as they are unlikely to be a cause of highly penetrant Mendelian disease. Based on previous work, a MAF threshold of 0.01 for channelopathy-associated genes and 0.0001 for cardiomyopathy genes may be appropriate and has been validated by independent groups23,24. Importantly, given the importance of the MAF threshold, this should be set and validated for each study independently. A MAF threshold need not be applied to an experimental cohort, given the well-established presence of founder mutations in channelopathies and cardiomyopathies. The size of the experimental cohort needs to be sufficient to identify areas where variants may cluster; however, there is no strict size. Additionally, the experimental cohort should not include variants known to be benign within the literature, as this would diminish the veracity of the pathogenic signal.
Properly selecting exclusion criteria is also crucial for interpretation and applicability of the result. Though this protocol recommends excluding certain mutation classes such as synonymous variants, these could feasibly be included for disease processes in which deleterious synonymous variants have been identified25,26. Additionally, when various exclusion criteria are applied to both experimental and control groups, it can allow for stratification of signal-to-noise mapping by mutation subclass (i.e. comparing missense to truncating variants).
Setting a rolling average for MAFs allow for inference of involvement to neighboring amino acids. For example, if amino acid position 35 contains a pathologic variant and resides in a critical protein domain, then position 36 may have a degree of pathogenicity when mutated. Likewise, should a stretch of primary sequence have a large amount of rare control variants, then amino acids within this region that do not host rare variants may yet have a higher likelihood of containing rare variants found in a population. While the rolling average in this protocol is +/- 5, this range can be vary based on the user's desired level of resolution of signal-to-noise ratio and the specific protein being studied. In the example of LQTS, the interrogated KCNQ1-encoded KCNQ1 channel has several transmembrane domains spanning ~10 amino acids, prompting the authors to adjust their desired resolution to reflect significant findings on that scale14. For proteins with a longer primary sequence and protein length, the span of the rolling average may need to be increased due to larger spans of protein sequence without control variation.
There are several limitations to this method. As previously stated, a sufficient phenotype-positive population hosting putative pathologic variants must be identified in order to drive a clear pathologic signal. Additionally, these pathologic variants may have variable penetrance, thus truly pathologic mutations may not manifest a disease phenotype or may otherwise not be fully penetrant and disease causing. While many publicly held databases, such as GnomAD, are often considered "healthy cohorts," the prevalence of genetic diseases is likely similar in this database as population studies. As detailed, this protocol focuses specifically on amino-acid level changes resulting from exonic gene variants that code for amino acids, which excludes the role that pathogenic intronic splicing variants may play in monogenic disease. Given their recently demonstrated role in cardiomyopathies, expansion of the resolution this approach may be warranted to identify intergenic "hotspots" as well. Furthermore, the application of a MAF threshold may miss certain "risk alleles" that, though existing in the population with a MAF higher than that of disease prevalence, may contribute to disease pathogenesis27,28. Despite these limitations, this analysis is adaptable and can play a key role in providing clinicians a relative probability of disease pathogenicity when appropriate applied.
Finally, given the predilection of this analysis to identify critical regions within a protein, amino acid-level signal-to-noise calculations utilizing pathologic mutations offers the possibility of identifying novel functional domains of the proteins being studied. Given the observation of high pathogenicity signal-to-noise at key locations of ion channels, such as the pore domain, selectivity filter, S2 transmembrane domain, and the KCNE1-binding domain of KCNQ1, identification of a "peak of pathogenicity" within an area of the protein without a known function may suggest a novel critical domain. For example, a marked peak of pathogenicity of LQTS-associated mutations has been identified localizing to amino acid residues 912-930 of KCNH2-encoded KCNH2 (Kv11.1). This region of the protein has no identifiable functional domain yet demonstrates a marked propensity for LQTS-associated mutations14. As the knowledge of protein topology expands, more sophisticated proteomics could feasibly improve the resolution of this method in the future from analyzing signal-to-noise ratio along a protein's primary structure to include its secondary, tertiary, or quaternary structure. Addition of advanced computational sciences to this analysis, such as machine learning and artificial intelligence, affords the opportunity to identify novel patterns among pathologic versus population-based genetic variation, if robust databases of these variants can be generated29,30. In turn, this method could aid in better characterizing and predicting the genotype-phenotype relationship of specific diseases and be used in conjunction with an individual's pre-test probability of disease to improve the diagnostic yield of genetic testing. Further, this analysis may discover novel protein biology and identify novel loci within the human genome which manifest with disease when altered.
The authors have nothing to disclose.
APL is supported by the National Institutes of Health K08-HL136839.
1000 Genome Project | N/A | www.internationalgenome.org | |
ClinVar | N/A | www.ncbi.nlm.nih.gov/clinvar | |
Ensembl Genome Browser | N/A | uswest.ensembl.org/index.html | |
Excel | Microsoft | office.microsoft.com/excel/ | Used for all example formulas and functions |
Exome Aggregation Consortium | N/A | www.exac.broadinstitute.org | |
Genome Aggregation Database | N/A | www.gnomad.broadinstitute.org | |
National Center for Biotechnology Information Domain and Structure Database | N/A | www.ncbi.nlm.nih.gov/guide/domains-structures/ | |
National Center for Biotechnology Information Gene Database | N/A | www.ncbi.nlm.nih.gov/gene/ | |
National Center for Biotechnology Information Protein Database | N/A | www.ncbi.nlm.nih.gov/protein/ | |
National Heart, Lung, and Blood Institute GO Exome Sequencing Project | N/A | www.evs.gs.washington.edu/EVS/ | |
SnapGene | GSL Biotech LCC | www.snapgene.com | |
University of California, Santa Cruz Human Genome Browser | N/A | www.genome.ucsc.edu |