Sequence specificity is critical for gene regulation. Regulatory proteins that recognize specific sequences are important for gene regulation. Defining functional binding sites for such proteins is a challenging biological problem. An iterative approach for identification of a binding site for an RNA-binding protein is described here and is applicable to all RNA-binding proteins.
Gene regulation plays an important role in all cells. Transcriptional, post-transcriptional (or RNA processing), translational, and post-translational steps are used to regulate specific genes. Sequence-specific nucleic acid-binding proteins target specific sequences to control spatial or temporal gene expression. The binding sites in nucleic acids are typically characterized by mutational analysis. However, numerous proteins of interest have no known binding site for such characterization. Here we describe an approach to identify previously unknown binding sites for RNA-binding proteins. It involves iterative selection and amplification of sequences starting with a randomized sequence pool. Following several rounds of these steps-transcription, binding, and amplification-the enriched sequences are sequenced to identify a preferred binding site(s). Success of this approach is monitored using in vitro binding assays. Subsequently, in vitro and in vivo functional assays can be used to assess the biological relevance of the selected sequences. This approach allows identification and characterization of a previously unknown binding site(s) for any RNA-binding protein for which an assay to separate protein-bound and unbound RNAs exists.
In cell biology, gene regulation plays a central role. At one or multiple steps along the gene expression pathway, genes have the potential to be regulated. These steps include transcription (initiation, elongation, and termination) as well as splicing, polyadenylation or 3’ end formation, RNA export, mRNA translation, and decay/localization of primary transcripts. At these steps, nucleic acid-binding proteins modulate gene regulation. Identification of binding sites for such proteins is an important aspect of studying gene control. Mutational analysis and phylogenetic sequence comparison have been used to discover regulatory sequences or protein-binding sites in nucleic acids, such as promoters, splice sites, polyadenylation elements, and translational signals1,2,3,4.
Pre-mRNA splicing is an integral step during gene expression and regulation. The majority of mammalian genes, including those in humans, have introns. A large fraction of these transcripts is alternatively spliced, producing multiple mRNA and protein isoforms from the same gene or primary transcript. These isoforms have cell-specific and developmental roles in cell biology. The 5’ splice site, the branch-point, and the polypyrimidine-tract/3’ splice site are critical splicing signals that are subject to regulation. In negative regulation, an otherwise strong splice site is repressed, whereas in positive regulation an otherwise weak splice site is activated. A combination of these events produces a plethora of functionally distinct isoforms. RNA-binding proteins play key roles in these alternative splicing events.
Numerous proteins are known whose binding site(s) or RNA targets remain to be identified5, 6. Linking regulatory proteins to their downstream biological targets or sequences is often a complex process. For such proteins, identification of their target RNA or binding site is an important step in defining their biological functions. Once a binding site is identified, it can be further characterized using standard molecular and biochemical analyses.
The approach described here has two advantages. First, it can identify a previously unknown binding site for a protein of interest. Second, an added advantage of this approach is that it simultaneously allows saturation mutagenesis, which would otherwise be labor intensive to obtain comparable information about sequence requirements within the binding site. Thus, it offers a quicker, easier, and less costly tool to identify protein binding sites in RNA. Originally, this approach (SELEX or Systematic Evolution of Ligands by EXponential enrichment) was used to characterize the binding site for the bacteriophage T4 DNA polymerase (gene 43 protein), which overlaps with the ribosome binding site in its own mRNA. The binding site contains an 8-base loop sequence, representing 65,536 randomized variants for analysis7. Second, the approach was also independently used to show that specific binding sites or aptamers for different dyes can be selected from a pool of approximately 1013 sequences8. In fact, this approach has been broadly used in many different contexts to identify aptamers (RNA or DNA sequences) for binding numerous ligands, such as proteins, small molecules, and cells, and for catalysis9. As an example, an aptamer can discriminate between two xanthine derivatives, caffeine and theophylline, which differ by the presence of one methyl group in caffeine10. We have extensively used this approach (SELEX or iterative selection-amplification) to study how RNA-binding proteins function in splicing or splicing regulation11, which will be the basis for the discussion below.
The random library: We used a random library of 31 nucleotides. The length consideration for the random library was loosely based on the idea that the general splicing factor U2AF65 binds to a sequence between the branch-point sequence and the 3’ splice site. On average, the spacing between these splicing signals in metazoans is in the range of 20 to 40 nucleotides. Another protein Sex-lethal was known to bind to a poorly characterized regulatory sequence near the 3’ splice site of its target pre-mRNA, transformer. Thus, we chose a random region of 31 nucleotides, flanked by primer binding sites with restriction enzyme sites to allow for PCR amplification and attachment of the T7 RNA polymerase promoter for in vitro transcription. The theoretical library size or complexity was 431 or approximately 1018. We used a small fraction of this library to prepare our random RNA pool (~1012-1015) for the experiments described below.
NOTE: Figure 1 provides a summary of key steps in the iterative selection-amplification (SELEX) process.
1. Generation of a random library template
2. Generation of the DNA random library pool
3. Synthesis of pool 0 RNA
4. Protein binding reaction and separation of bound RNA
5. Reverse transcription and PCR amplification
6. Transcription and protein binding
7. Analysis of RNA-protein interactions
8. Cloning and sequencing
9. Sequence alignment
The following observations demonstrate successful selection-amplification (SELEX). First, we analyzed pool 0 and the selected sequences for binding to the protein used for the iterative selection-amplification approach. Figure 2 shows that the mammalian polypyrimidine-tract binding protein (PTB) shows barely detectable binding to the pool 0 sequence but high affinity for the selected sequence pool. There was barely detectable binding to pool 0 when we used about 300-fold higher protein concentration for binding than used for the selected pool. Thus, there was at least a several hundred-fold difference in protein binding affinities between the random or starting pool and the selected pool. This observation experimentally confirms that the selection-amplification protocol described here is successful.
Second, we sequenced the selected pool and determined a consensus binding site. The consensus sequence obtained from alignment of the majority of selected sequences from the mammalian PTB-selected pool was: GCCUG(Y/G)UGCYYYYCYYYG(Y/G)CCC. This shows that we have selected unique pyrimidine-rich sequences that bind PTB11. When we performed iterative selection-amplification for the RNA-binding domain of the Drosophila PTB, we enriched CU-rich sequences interrupted by guanosines. Among the high affinity sequences that the Drosophila PTB selected was an 84% pyrimidine-rich sequence: GCUUUCCUCUGUCGCCCUUCUUCGUCCCCUG. In fact, this sequence is similar to the pyrimidine-rich sequence present in the alpha-tropomyosin intron which binds with high affinity to and is regulated by the mammalian PTB15. We have successfully used this approach repeatedly to study RNA-binding properties and functions splicing regulators and a splicing factor11, 15, 16. Table 1 shows successful examples of RNA-binding proteins for which SELEX was used to identify their preferred or consensus binding site(s).
Third, an in vitro splicing assay, which is based on alternative 3' splice site choice, shows functional relevance of distinct but overlapping RNA-binding specificities of polypyrimidine-tract binding proteins. Whereas an upstream 3' splice site is used by default, addition of the recombinant PTB leads to activation of the alternative or downstream 3' splice site (Figure 3). In contrast, addition of recombinant hnRNP C17 leads to repression of both 3' splice sites. Addition of the recombinant general splicing factor U2AF65 reverses the hnRNP C1-mediated 3' splice site repression (Figure 3) as well as the PTB-mediated effect on downstream 3' splice site activation (data not shown). A simple explanation for these effects is a direct competition between the binding of the general splice factor U2AF65 and PTB (also called hnRNP I), which preferentially binds to and represses certain 3' splices sites, or between U2AF65 and hnRNP C, which binds to and represses both 3' splice sites.
Figure 1: Summary of key steps in iterative selection-amplification process (SELEX). Please click here to view a larger version of this figure.
Figure 2: Enrichment of PTB-binding RNAs. Increasing concentration (filled triangles) of recombinant PTB was used with either radiolabeled pool 0 RNA or the selected pool obtained following six rounds of selection and amplification. Positions of unbound RNA and the RNA:protein complex are indicated. Please click here to view a larger version of this figure.
Figure 3: Splice site switching assay validates distinct binding specificities of pyrimidine-binding proteins. (A-Top) Schematics of the splicing substrate. The splicing substrate contains a 5' splice site and two alternative 3' splice sites flanking the intron. Rectangles (open, with horizontal lines, and solid) are exons and the line is an intron. (A-Bottom) hnRNP C1 represses the upstream 3' splice site (without activation of the downstream 3' splice site), whereas PTB leads to activation of the downstream 3' splice site. The splicing substrate was incubated in a HeLa cell nuclear extract.The splicing products (shown on the sides) were analyzed using a primer extension assay18 with splice-junction primers (arrows), which recognize splicing of the common 5' splice site to either the upstream or the downstream 3' splice site. (B) Recombinant U2AF65 (rU2AF65) reverses the repressive effect of hnRNP C1. Addition of the recombinant hnRNP C1, PTB, or rU2AF65 proteins to the splicing reaction is indicated by the + symbols. Please click here to view a larger version of this figure.
Protein | Preferred sequence(s) |
U2AF65 | U-rich containing Cs |
SXL | U-rich containing 2-4 Gs |
PTB | UCUUC-rich with some Gs |
hnRNP C1 | U-rich (5-6 long) |
CstF64 | GU-rich |
hnRNP E1/E2 and K | C-rich |
U2AF65/U2AF35 heterodimer | UUUYYYYUNUAGGU |
Table 1: Preferred binding sites for some RNA-binding proteins.
Nucleic acid-binding proteins are important regulators of animal and plant development. A key requirement for the SELEX procedure is the development of an assay that can be used to separate protein-bound and unbound RNA fractions. In principle, this assay can be an in vitro binding assay such as the filter-binding assay, the gel mobility shift assay, or a matrix binding assay19 for recombinant proteins, purified proteins, or protein complexes. The assay can also be an enzymatic assay where the precursors and products (or intermediates) can be separated based on size or some other means20.
While mutagenesis has been widely used to characterize binding sites for proteins, it is laborious, and time consuming and longer sequences are not as easily amenable to saturation mutagenesis. The significance of the iterative binding and amplification approach described here is that not only does it overcome some of the above limitations, it can most importantly identify previously unknown binding sites and provide important information about nucleotide requirements at each position at the same time.
An important consideration for the success of iterative selection-amplification is binding affinity and specificity. Typically, 12 to 15 rounds of selection-amplification are employed and a sequence space of 1012 to 1015 molecules can be routinely sampled. The progress and eventual success of the selection-amplification protocol can be monitored using a binding assay or direct sequencing, which monitors affinity for or enrichment of specific sequences in intermediate pools, respectively. While the binding assay was traditionally used, advent of the next generation sequencing allows analysis of sequence enrichment in ways not possible by manual Sanger sequencing14.
A critical step in the success of SELEX is fold enrichment of the desired molecules at each step. The number of cycles required for SELEX varies and depends on several factors. For example, if fold-enrichment of desired or specific sequences is higher in each round, fewer rounds will be sufficient. However, if an assay allows a high proportion of undesired sequences in the bound pool, additional rounds will become necessary to enrich desired RNA sequences. A limitation of the technique or an unintended consequence of the need for additional cycles of selection-amplification that must be kept in mind is the possibility that it might introduce artefacts or enrich sequences that have unrelated properties such as their ability to amplify. Finally, while some applications benefit from the highest affinity binders, for other uses, a balance must be struck during the selection-amplification process between binding affinity and function because tightest binding sequences might not necessarily be the most functional sequences in biological contexts (e.g., if a sequence is recognized multiple times by different proteins during splicing).
Among the modifications and troubleshooting to improve the procedure, negative selection or counter selection can be employed to increase specificity. Similarly, use of different partitioning protocols, such as the filter binding assay followed by the gel mobility shift assay, can eliminate enrichment of unwanted sequences that bind, for example, to the nitrocellulose filter or a column matrix21. Given that proteins-nucleic acid interactions have both specific and non-specific components, buffer conditions such as salt and pH have effects on RNA-protein interactions. Moreover, use of appropriate protein concentration can have a direct effect on retention of strong, weak and non-specific binders. Selection pressure can be increased in successive rounds, for example, by including a competitor RNA, reducing protein concentration, or reducing the time of incubation. Thus, careful considerations and optimizing these parameters can impact the outcome of the SELEX protocol.
Recently, many variations or modifications of the original SELEX protocol have been developed which overcome some of the limitations mentioned above. These include high throughput-SELEX (HT-SELEX), which combines SELEX and massively parallel sequencing6, RNAcompete, which involves incubation with excess non-random RNA, pull-down of the bound RNA, fluorescent labeling of RNA, and analysis on microarrays5, RNA Bind-n-Seq, which combines RNA affinity analysis in a quantitative and high throughput fashion22, and RAPID-SELEX, which shortens the process and includes a non-amplification step23.
Chemically modified bases have been used to expand the repertoire of the RNA molecules for specific applications24. Diagnostics, therapeutics, as well as molecules with catalytic activities are among the many applications (including in medicine) of the selected molecules25. Aptamers complement antibody-based protocols and provide excellent tools whose potential, for example, in diagnostics, therapeutics, and other applications, remains to be fully exploited26,27,28. In the future, for example, clinical benefits are among the numerous desired applications, beyond what the first FDA-approved aptamer (Pegaptanib sodium) could deliver for the age-related macular degeneration. The scalable proteomic technology for protein measurements offers a step toward understanding health and diseases24.
The authors have nothing to disclose.
The author thanks the National Institutes of Health for the past funding.
Gel Electrophoresis equipment | Standard | Standard | |
Glass Plates | Standard | Standard | |
Nitrocellulose | Millipore | HAWP | |
Nitrocellulose | Schleicher & Schuell | PROTRAN | |
polyacrylamide gel solutions | Standard | Standard | |
Proteinase K | NEB | P8107S | |
Recombinant PTB | Laboratory Preparation | Not applicable | |
Reverse Transcriptase | NEB | M0277S | |
Vacuum manifold | Fisher Scientific | XX1002500 | Millipore 25mm Glass Microanalysis Vacuum Filter |
Vacuum manifold | Millipore | XX2702552 | 1225 Sampling Vacuum Manifold |
X-ray films | Standard | Standard |