We present a protocol to identify functional implications of non-coding variants identified by genome-wide association studies (GWAS) using three-dimensional chromatin interactions.
Genome-wide association studies (GWAS) have successfully identified hundreds of genomic loci that are associated with human traits and disease. However, because the majority of the genome-wide significant (GWS) loci fall onto the non-coding genome, the functional impact of many remain unknown. Three-dimensional chromatin interactions identified by Hi-C or its derivatives can provide useful tools to annotate these loci by linking non-coding variants to their actionable genes. Here, we outline a protocol to map GWAS non-coding variants to their putative genes using Alzheimer's disease (AD) GWAS and Hi-C datasets from human adult brain tissue. Putative causal single-nucleotide polymorphisms (SNPs) are identified by application of fine-mapping algorithms. SNPs are then mapped to their putative target genes using enhancer-promoter interactions based on Hi-C. The resulting gene set represents AD risk genes, as they are potentially regulated by AD risk variants. To garner further biological insights into molecular mechanisms underlying AD, we characterize AD risk genes using developmental brain expression data and brain single-cell expression profiles. This protocol can be expanded to any GWAS and Hi-C datasets to identify putative target genes and molecular mechanisms underlying various human traits and diseases.
Genome-wide association studies (GWAS) have played a pivotal role in revealing the genetic basis of a range of human traits and diseases. This large-scale genotyping has uncovered thousands of genomic variants associated with phenotypes ranging from height to schizophrenia risk. However, despite the enormous success of GWAS in identifying disease and trait associated loci, a mechanistic understanding of how these variants contribute to phenotype has been challenging because most phenotype associated variants reside in the non-coding fraction of the human genome. Since these variants often overlap with predicted regulatory elements, they are likely to alter transcriptional control of a nearby gene. However, non-coding loci can influence transcription of genes at linear distances exceeding one megabase, making the genes affected by each variant difficult to identify. Three-dimensional (3D) chromatin structure plays an important role in mediating connections between distant regulatory loci and gene promoters and can be used to identify genes affected by phenotype associated single-nucleotide polymorphisms (SNPs).
Gene regulation is mediated by a complex process, which involves enhancer activation and chromatin loop formation that physically connect enhancers to gene promoters to which the transcriptional machinery can be directed1,2,3. Because chromatin loops often span several hundred kilobases (kb), detailed maps of 3D chromatin architecture are required to decipher gene regulatory mechanisms. Multiple chromatin conformation capture technologies have been invented to identify the 3D chromatin architecture4. Among these technologies, Hi-C provides the most comprehensive architecture, as it captures genome-wide 3D chromatin interaction profiles. Hi-C datasets have been quickly adapted to interpret non-coding genome-wide significant (GWS) loci5,6,7,8,9,10,11,12,13, as it can link non-coding variants to their putative target genes based on chromatin interaction profiles.
In this article, we outline a protocol to computationally predict putative target genes of GWAS risk variants using chromatin interaction profiles. We apply this protocol to map AD GWS loci14 to their target genes using Hi-C datasets in the adult human brain9. The resulting AD risk genes are characterized by other functional genomic datasets that include single cell transcriptomic and developmental expression profiles.
1. Workstation Setup
2. Generation of a GRanges Object for Credible SNPs
3. Positional Mapping
NOTE: For each step, type the corresponding code into the console window in RStudio.
4. Developmental Expression Trajectories
NOTE: For each step, type the corresponding code into the console window in RStudio.
5. Cell-type Expression Profiles
NOTE: For each step, type the corresponding code into the console window in RStudio.
6. Gene Annotation Enrichment Analysis of AD Risk Genes
The process described here was applied to a set of 800 credible SNPs that were defined by the original study14. Positional mapping revealed that 103 SNPs overlapped with promoters (43 unique genes) and 42 SNPs overlapped with exons (27 unique genes). After positional mapping, 84% (669) SNPs remained unannotated. Using Hi-C datasets in the adult brain, we were able to link an additional 208 SNPs to 64 genes based on physical proximity. In total, we mapped 284 AD credible SNPs to 112 AD risk genes (Figure 1A). AD risk genes were associated with amyloid precursor proteins, amyloid-beta formation, and immune response, reflecting the known biology of AD15,16,17,18 (Figure 1B-D). Developmental expression profiles of AD risk genes showed marked postnatal enrichment, indicative of the age-associated elevated risk of AD (Figure 2A). Finally, AD risk genes were highly expressed in microglia, primary immune cells in the brain (Figure 2B). This is in agreement with the recurrent findings that AD has a strong immune basis and microglia are the central player in AD pathogenesis14,19,20.
Figure 1: Defining putative target genes of AD GWS loci. (A) Credible SNPs derived from the top 29 AD loci were categorized into promoter SNPs, exonic SNPs, and unannotated non-coding SNPs. Promoter and exonic SNPs were directly assigned to their target genes by positional mapping, while chromatin interaction profiles in the adult brain were additionally used to map SNPs based on physical interactions. (B-D) Enrichment of GO (B), KEGG (C), and Reactome (D) terms in AD risk genes was performed using HOMER as described in protocol section 6. The x axis represents the false discovery rate (FDR) corrected -log10 (P-value). Enriched terms with FDR < 0.1 were plotted. Grey vertical lines represent FDR = 0.05. APP amyloid precursor protein. Numerator, the number of AD risk genes represented in each term; denominator, the number of genes in each term. Please click here to view a larger version of this figure.
Figure 2: Characterization of AD risk genes. (A) AD risk genes are highly expressed in the postnatal cortex compared to the prenatal cortex. (B) Violin plots depicting distributions of gene expression values (normalized expression) in different cell types from the cortex. These results show that AD risk genes are highly expressed in microglia, consistent with previous studies14. Please click here to view a larger version of this figure.
Supplementary File 1. Please click here to view this file (Right click to download).
Supplementary File 2. Please click here to view this file (Right click to download).
Supplementary File 3. Please click here to view this file (Right click to download).
Here we describe an analytic framework that can be used to functionally annotate GWS loci based on positional mapping and chromatin interactions. This process involves multiple steps (for more details see this review13). First, given that chromatin interaction profiles are highly cell-type specific, Hi-C data obtained from the appropriate cell/tissue types that best capture underlying biology of the disorder needs to be used. Given that AD is a neurodegenerative disorder, we used adult brain Hi-C data9 to annotate GWS loci. Second, each GWS locus often has up to hundreds of SNPs that are associated with the trait because of linkage disequilibrium (LD), so it is important to obtain putative causal ('credible') SNPs by computationally predicting the causality through the use of fine-mapping algorithms21,22 or experimentally testing regulatory activities using high-throughput approaches such as massively parallel reporter assays (MPRA)23 or self-transcribing active regulatory region sequencing (STARR-seq)24. For the work described here, we used credible SNPs reported in Jansen et al.14. Third, promoter and exonic SNPs are annotated based on positional mapping. We used a simple positional mapping strategy in which SNPs were mapped to the genes when they overlapped with promoters (defined as 2 kb upstream of transcription start site) or exons. However, this approach can be further elaborated by assessing the functional consequences of exonic SNPs, such as whether the SNP induces nonsense mediated decay, missense variation, or nonsense variation. Fourth, chromatin interaction profiles from the appropriate tissue/cell type can be used to assign SNPs to their putative target genes based on physical proximity. We used interaction profiles anchored to promoters, but we can further refine or expand the interaction profiles by taking enhancer activities (guided by histone H3 K27 acetylation or chromatin accessibility) or exonic interactions into account. One important consideration in this process is to use consistent human genome build. For example, if genomic positions of summary statistics are not based on hg19 (i.e., hg18 or hg38), an appropriate version of the reference genome should be obtained or the summary statistics need to be converted to hg19 using liftover25.
We applied this framework to identify putative target genes for AD GWAS, assigning 284 SNPs to 112 AD risk genes. Using developmental expression profiles26 and cell-type specific expression profiles9, we then demonstrated that this gene set was consistent with what is known about AD pathology, revealing the cell types (microglia), biological functions (immune response and amyloid beta), and elevated risk upon age.
While we presented a framework that delineates potential target genes of AD and its underlying biology, it is of note that Hi-C based annotation can be expanded to annotate any non-coding variation. As more whole-genome sequencing data becomes available and our understanding about the non-coding rare variation grows, Hi-C will provide a key resource for interpretation of disease-associated genetic variants. A compendium of Hi-C resources obtained from multiple tissue and cell types will be therefore critical to facilitating a wide application of this framework to garner biological insights into various human traits and disease.
The authors have nothing to disclose.
This work was supported by the NIH grant R00MH113823 (to H.W.) and R35GM128645 (to D.H.P.), NARSAD Young Investigator Award (to H.W.), and SPARK grant from the Simons Foundation Autism Research Initiative (SFARI, to N.M. and H.W.).
10 kb resolution Hi-C interaction profiles in the adult brain from psychencode | http://adult.psychencode.org/ | ||
Developmental expression datasets | http://www.brainspan.org/ | ||
Fine-mapped credible SNPs for AD (Supplementary Table 8 from Jansen et al.14) | https://static-content.springer.com/ | ||
HOMER | http://homer.ucsd.edu/ | ||
R (version 3.5.0) | https://www.r-project.org/ | ||
RStudio Desktop | https://www.rstudio.com/ | ||
Single cell expression datasets | http://adult.psychencode.org/ |