We present a public computational web site for the analysis of genomic sequences. It detects DNA sequence patterns with various non-random nucleotide compositions. This resource also generates randomized sequences with diverse levels of complexity.
Non-coding genomic regions in complex eukaryotes, including intergenic areas, introns, and untranslated segments of exons, are profoundly non-random in their nucleotide composition and consist of a complex mosaic of sequence patterns. These patterns include so-called Mid-Range Inhomogeneity (MRI) regions — sequences 30-10000 nucleotides in length that are enriched by a particular base or combination of bases (e.g. (G+T)-rich, purine-rich, etc.). MRI regions are associated with unusual (non-B-form) DNA structures that are often involved in regulation of gene expression, recombination, and other genetic processes (Fedorova & Fedorov 2010). The existence of a strong fixation bias within MRI regions against mutations that tend to reduce their sequence inhomogeneity additionally supports the functionality and importance of these genomic sequences (Prakash et al. 2009).
Here we demonstrate a freely available Internet resource — the Genomic MRI program package — designed for computational analysis of genomic sequences in order to find and characterize various MRI patterns within them (Bechtel et al. 2008). This package also allows generation of randomized sequences with various properties and level of correspondence to the natural input DNA sequences. The main goal of this resource is to facilitate examination of vast regions of non-coding DNA that are still scarcely investigated and await thorough exploration and recognition.
All the used programs in the paper have been written using perl, and all the web pages have been created using PHP.
1. Starting Point:
Open the home page of the online Genomic MRI package at http://mco321125.meduohio.edu/~jbechtel/gmri/ . The web resource also provides instructions/explanations on the programs in the “Help (How-to/README)” link, while all published materials on Genomic MRI and similar algorithms are listed in the “Links to relevant resources” link.
2. Preparation and Uploading of Input Sequence(s).
Create a file with FASTA-formatted sequence(s) to start a GMRI analysis session. Each nucleotide sequence in this format should be preceded with a single line starting with the “>” character that represents an identifier, followed on the same line by a short description of this sequence. Nucleotide sequences for GMRI analysis also permits characters like R, Y, N, X, etc. Hwever, non-A, T, C, G characters will not be processed by the program and will be skipped. Sequences in which repetitive elements have been “masked” (replaced by “N”s) can be used as input. Note that sequence characters are case insensitive.
NOTE: Henceforth the input sequences are referred to as “userfile”.
3. Get an Oligonucleotide Frequency Distribution of the Input Sequences (optional).
Click on the “SRI Analyzer” tab (top row) in order to get a distribution of oligonucleotide frequencies for the entire set of input sequences. The acronym SRI stands for short-range inhomogeneity. At this juncture, the user may specify the highest length of oligonucleotides (from 2 up to 9 nucleotides, default 6 nts) for which frequencies will be calculated. This selection is made by clicking on the desired option within the “Maximum oligomer size” list box. Then press the “Analyze File” button to initiate computation. A rough representation of the input sequence composition will immediately appear as a short table in the middle of this web page and downloadable as “userfile.comp.tbl“. This table represents only the most and the least abundant oligonucleotides within the input sequences.
The entire frequency table for all possible oligonucleotides is generated as a file named “userfile.comp“, which can be obtained via the “Download composition file” link.
NOTE: SRI analyzer counts the entire set of all overlapping oligonucleotides.
4. Generate Random Sequences with the Same Oligonucleotide Composition As in the Input Sequences (optional).
(Completion of step 3 of the protocol is required for this task).
5. Analysis of Mid-Range Inhomogeneity (MRI) of Input and Random Sequences.
6. Additional Programs Within the Genomic MRI Package (optional).
The Genomic MRI resource also has two advanced options for generation of very specific random sequences. They are available through the “MRI Generator” and “CDS Generator” tabs in the top row.
7. Representative Results
This protocol allows a user to study compositional inhomogeneity of nucleotide sequences. Importantly, it also supports the generation of a variety of randomized sequences with an oligonucleotide composition approximating that of the input sequences. Usually, genomic sequences of complex eukaryotes are not homogeneous in composition, but rather represent a complex mosaic of sequence segments enriched by particular nucleotides (for example, purine-rich, (G+T)-rich, (A+T)-rich, etc.). These patterns at mid-range scale (30-1000 bp) are visualized by the graphical output of MRI analyzer that shows selected content-rich segments as upper blue spikes and content-poor segments as lower red spikes (see Figures 1 and 2). Typically, the number of any content-rich and content-poor regions in a natural sequence (Figure 1) is on the order of times higher than the number of the same types of regions in corresponding randomized sequences (Figure 2) having the same oligonucleotide composition. These sequence segments with mid-range inhomogeneity in nucleotide composition may be of interest to the user. They are available from the Genomic MRI output files for further investigation.
Figure 1. An example of the MRI analyzer graphical output from step 5.7. The results have been obtained on a sample of 44 human introns. Blue bars represent positions of GC-rich regions along these introns. Red bars represent GC-poor (or AT-rich) MRI regions. The y-axis contains upper and lower thresholds for the given content type.
Figure 2.MRI analyzer output for the random sequence “userfile.rand1_4”.
The graphical representation of MRI within a randomly generated sequence using the SRI generator program.
Figure 3.An example of the beginning of a textual output file from MRI analyzer.
All content-rich and content-poor sequences detected by the program are presented in the last (fourth) column. Their relative positions, measured in the number of windows, are shown in the first column. The second and third columns are indicators for content-rich and content-poor regions, respectively.
Regions with inhomogeneous nucleotide composition at mid-range scales (30-1000 nucleotides) are overabundant in the genomes of complex eukaryotes and can be found anywhere (intergenic regions, introns, untranslated regions of exons, repetitive elements). These regions are frequently associated with unusual DNA conformations. For instance, purine-/pyrimidine-rich sequences tend to form DNA triplexes (H-DNA); sequences with alternating purine/pyrimidine bases are associated with Z-DNA conformations; (G+C)-rich regions exhibit structural abnormalities in B-DNA and could be prone to backbone cleavage; (A+T)-rich regions might form an unusual structure – a DNA unwinding element; etc. (reviewed by Fedorov & Fedorova 2010). Some of these mid-range patterns (e.g. (G+T)-rich regions) are scarcely investigated and still await thorough exploration and recognition. The main aim of our Genomic MRI web resource is to help users in the identification of these MRI regions for their further experimental analysis and for exploration of their possible functions. Knowledge of MRI regions could be incorporated into and improve the new generation of gene predictor programs (Shepard 2010) and advance our understanding of genome functions and properties.
The authors have nothing to disclose.
We are thankful to Samuel Shepard, Peter Bazeley, and John David Bell for the administration of the Genomic MRI web pages. This work was supported by National Science Foundation Career award “Investigation of intron cellular roles” [grant number MCB-0643542].