Unlike DNA sequence data, epigenomic data are not readily subjected to text-based searches. Presented here are the procedures to use an upgraded version of GeNemo, a web-based bioinformatics tool, to conduct pattern-based searches for similarities in epigenomic data comparing available online databases including Encyclopedia of DNA Elements with user's data.
Compared with the robust text-based search tools for genomic or RNA sequencing data, current methodologies for pattern-based searches of epigenomic and other functional genomic data are very limited. GeNemo is the first online search tool that accomplishes this goal. Users input their functional genomic data in the Browser Extensible Data (BED), Peaks, and bigWig formats, and may search for data in any of the three formats. Users may specify which types of datasets to search against, choosing from a variety of online datasets, with the Encyclopedia of DNA Elements (ENCODE) representing different epigenomic marks, transcriptional factor binding sites, and chromatin hypersensitivities or accessibilities in specific cell types, and developmental stages or species (mouse or human). GeNemo returns a list of genomic regions with matching patterns to the input data, which may be viewed in the browser as well as downloaded in the BED file format. The upgraded GeNemo has improved graphical display, has more robust interface, and is no longer prone to errors due to changes in the University of California, Santa Cruz (UCSC) genome browser. Troubleshooting steps for common problems are discussed. As the amount of functional genomic data is expanding exponentially, there is a critical need to develop and refine new bioinformatic tools such as GeNemo for data analyses and interpretation.
Recent technological advances have allowed for a rapid expansion of epigenomic or functional genomic data depositories, which have outpaced the development of relevant analytic tools to extract biological insights. One important way to analyze epigenomic data is to search user-generated data against data depositories and especially those from the Encyclopedia of DNA Elements (ENCODE)1 projects for matching patterns that could lead to new knowledge. For instance, identifying similarities in the patterns of two different epigenomic marks at defined loci across the genome may indicate coordinated action by different molecular players on chromatin conformation and transcriptional regulation2,3,4.
Conventional text-based search engines are ineffective in this regard because, unlike DNA sequence, epigenomic data predominantly exist in the format of intensities or functional genomic regions. GeNemo, standing for Gene Nemo (as in Finding Nemo), was developed to address this unmet need using pattern-based searches5. Its algorithm utilizes a Markov Chain Monte Carlo maximization process5. Users take their own data or a dataset downloaded from depositories and search an array of online epigenomic data to identify similarities in patterns.
The current version of GeNemo has an updated display, interfaces more robustly with the University of California, Santa Cruz (UCSC) genome browser6, and is less susceptible to issues caused by changes in the latter. In particular, while GeNemo's Results page used to be based on the UCSC genome browser interface, the current version of GeNemo supports its own Results page and consequently is no longer adversely affected by structural changes to the UCSC genome browser. GeNemo can use any genomic signal, including protein-binding, histone modification, chromatin accessibility, topological domains, and so on, as a query to find colocalized/similar segments among known data sets from large consortia. Therefore, it is an important tool to study the relationship between different epigenomic data of interest and known data generated in large scale genomic projects.
NOTE: The protocol can be paused anywhere.
1. Basic Setup
2. Optional Setup
Figure 1: GeNemo's front page with the necessary areas filled out. A user needs to input the species, search file and search range, and select tracks he/she wishes to search against. Email address and display file are optional. Please click here to view a larger version of this figure.
3. Data Selection
Figure 2: Track selection window. This is brought up by clicking the "DATA SELECTION" button on the front page. Here, users select tracks to search the input file against. Some of the tracks are already selected by default. Please click here to view a larger version of this figure.
Figure 3: Filter window. This is brought up by clicking the "FILTER" button on the Track selection window. Here, users can select many tracks at the same time, with relative ease. Please click here to view a larger version of this figure.
Figure 4: How to use the filter function. Please click here to view a larger version of this figure.
4. Search and Results
Figure 5: Results page. This particular search returned 363 matching regions. Displaying the first matching region can be done by clicking the "SHOW" button on the bottom left of each resulting region box. On the left part of the display window it can be seen that the two data files (input and selected track) are similar in signal strength pattern. Please click here to view a larger version of this figure.
Shown here in Figure 5 is a simulated search. The human species was selected, and the corresponding sample file was used as the input data file. In addition, the default tracks, as seen in Figure 3, were selected. There were a total of 363 matching regions, and the first region is shown in the display page. It can be seen that the intensity pattern from base 17036000 to 17038000 on chromosome 1 for the input file and one of the selected tracks is very similar.
A thorough understanding of the epigenome is required to achieve the full potential of human genome sequencing in providing new biological insights8. Currently there are only ways to search online epigenomic datasets by their data description and title (i.e., metadata)1. This severely limits the types of search one can do with epigenomic data. Pattern-based search tools for epigenomic data are essential for exploring the relationship between different epigenomic marks, which may lead to new biological insights. GeNemo, which searches by the content of the data and not metadata, is the first service of its kind to compare patterns in epigenomic data from published depositories such as the ENCODE database with a user-generated or downloaded dataset5. This marks the beginning of the availability of an epigenomic search tool that is widely accessible to researchers around the world just as text-based sequence search tool became widely available in the 1990s. Currently, there are no alternatives for pattern-based online search tools for epigenomic data other than GeNemo.
One potential example of using GeNemo is to search the co-appearing histone modifications and other epigenetic marks with the transcriptional factor E2F6 in human embryonic stem cells (an example E2F6 binding signal file is available at ENCODE data portal or at https://sysbio.ucsd.edu/public/xcao3/ENCODESample/ENCFF001UBC.bed). By using this file as query to search against all ENCODE datasets in H1-hESC, GeNemo will show that E2F6 binding signal is heavily enriched with H3K4me1, H3K4me2, H3K4me3, and H3K27me3, which agrees with existing research showing that E2F6 regulates some genes via methylation of H3K279. On the other hand, there appears to be colocalization of E2F6 and CtBP2 binding sites, which is known to interact with a factor in the same family, E2F710. These results for the entire genome against a large number of epigenetic marks, transcriptional factor binding signals, and other signals included in ENCODE can be fairly easily obtained with GeNemo, which can provide all potential targets for further analysis.
Since the first publication5 of GeNemo as a web-based epigenomic data search tool, the Results section of GeNemo has been updated to have a matching appearance with GeNemo's front page. The old Results section closely mirrored the UCSC genome browser results section, and was largely dependent on the remote UCSC server for display. With the new interface, GeNemo is more user-friendly and no longer dependent on the UCSC genome server (even though data are still fetched remotely). This makes GeNemo more robust and less susceptible to problems due to code changes at the UCSC server. Furthermore, the new, faster polymer interface of GeNemo gives the user more tools to visualize and analyze patterns in the data.
Critical steps include providing the appropriate input file and selecting data tracks to search against. Users are strongly encouraged to experiment with various track selection functions to become familiar with the selection process and how different commands can be combined to achieve the intended outcome. In particular, note that the "Add" function is required to add desired tracks selected to the query, while "Filter" or "Exclude" can be used as logic gate commands "AND" and "OR", respectively. The "Update" function is required to affect all the selections before implementing the search. When no results are returned, a user may check the input data file, search more tracks or increase the search range. Whenever there is an error, there will be a window popping up defining what exactly the error is. There are some ambiguous errors, though. For example, when the window says that 'no file was uploaded,' either no file was uploaded, or the uploaded file was not of an acceptable format and, consequently, the program was not able to read it correctly. Acceptable file formats for file upload include BED and Peaks format file for both upload methods, and bigWig for online link upload only. The zipped versions of these file formats are also acceptable.
Current limitations of this approach include the yet-to-be-optimized algorithms and functions employed in GeNemo. GeNemo cannot yet provide any guidance on the interpretation of any datasets returned. This task is up to the users, which requires significant knowledge and expertise in the biology of the genome and epigenome. In addition, another current limitation is that users cannot change the sensitivity and noise level of the searches. We expect to continue to improve and expand GeNemo on its pattern searching capabilities and dataset collection in future.
The authors have nothing to disclose.
This work was supported by NIH grants including DP1HD087990 from NICHD, R01HG008135 from NHGRI. We thank members of the Zhong lab for valuable feedback.
Author Contributions:
X.C. and A.T.Z. updated GeNemo by coding new interface and features; A.T.Z. produced the in-house sample video; A.T.Z., X.C and S.Z. wrote the paper.