This protocol describes a method for mapping pre-mRNA 3' end processing sites.
Studies in the last decade have revealed a complex and dynamic variety of pre-mRNA cleavage and polyadenylation reactions. mRNAs with long 3' untranslated regions (UTRs) are generated in differentiated cells whereas proliferating cells preferentially express transcripts with short 3'UTRs. We describe the A-seq protocol, now at its second version, which was developed to map polyadenylation sites genome-wide and study the regulation of pre-mRNA 3' end processing. Also this current protocol takes advantage of the polyadenylate (poly(A)) tails that are added during the biogenesis of most mammalian mRNAs to enrich for fully processed mRNAs. A DNA adaptor with deoxyuracil at its fourth position allows the precise processing of mRNA 3' end fragments for sequencing. Not including the cell culture and the overnight ligations, the protocol requires about 8 h hands-on time. Along with it, an easy-to-use software package for the analysis of the derived sequencing data is provided. A-seq2 and the associated analysis software provide an efficient and reliable solution to the mapping of pre-mRNA 3' ends in a wide range of conditions, from 106 or fewer cells.
The capture and sequencing of mRNA 3' ends allows the study of mRNA processing and the quantification of gene expression. Due to their poly(A) tails, eukaryotic mRNAs can be efficiently purified from total cell lysates with bead-immobilized oligo-deoxythymidine (oligo(dT)) molecules, which can also prime cDNA synthesis. However, this approach has two drawbacks. First, stretches of A's that are internal to transcripts can also prime cDNA synthesis, resulting in spurious poly(A) sites. Second, homogeneous poly(A) stretches pose specific challenges for sequencing, aside from not being informative for transcript identification. Various approaches have been proposed to circumvent these limitations, such as reverse transcription through poly(A) tails followed by RNase H digestion (3P-seq 1), use of a custom sequencing primer ending in 20 Ts (2P-seq 2), preselection of RNA fragments with poly(A) tails of over 50 nucleotides with a CU5T45 primer followed by RNase H digestion (3'READS 3), and the use of a oligo-dT primer that contains the 3' adapter in a hairpin (A-seq 4).
The recently developed A-seq2 method 5 aims to bypass sequencing through poly(A) and at the same time to minimize the proportion of dimers that are generated by self-ligation of adapters, particularly occurring when the molar concentration of adapters outweighs the insert concentration. This problem can be eliminated when both adapters are ligated to the same type of polynucleotide ends as in A-seq2, where the 3' adapters are ligated to the 5' end of RNA fragments and the 5' adapters to 5' ends of the cDNAs after reverse transcription. The method is more convenient than our previously proposed A-seq – in which sequencing was in the 5'-to-3' direction thereby requiring precisely controlled RNA fragmentation-, while maintaining a high accuracy of poly(A) site identification. Around 80 % of the sequenced reads in typical samples map uniquely to the genome and lead to identification of over 20,000 poly(A) site clusters, more than 70 % of which overlapping with annotated 3'UTRs.
In brief, the A-seq2 protocol starts with mRNA fragmentation and ligation of reverse-complement 3' adapters to the 5' ends of RNA fragments. Poly(A)-containing RNAs are then reverse transcribed with a 25 nucleotide (nt) long oligo(dT) primer that contains an anchor nucleotide at the 3' end, a dU at position 4 and a biotin at the 5' end, allowing binding of the cDNA to magnetic streptavidin beads. Most of the primer, including the biotin, is removed from the cDNA by cleavage at dU by the USER enzyme mix, containing Uracil DNA glycosylase (UDG) and the DNA glycosylase-lyase Endonuclease VIII. This reaction leaves intact ends for the ligation of a 5' adapter, and the three Ts left after cleavage remain to mark the location of the poly(A) tail. Because both 5' and 3' adapters are attached by ligation to recipient 5' ends, no adapter dimers are generated. Four nucleotide random-mers introduced at the beginning of reads allows cluster resolution on state-of-the-art sequencing instruments and can also serve as unique molecular identifier (UMI) for the detection and removal of PCR amplification artifacts. The size of the UMI can be further increased as done in other studies 6. The protocol generates reads that are reverse complementary to mRNA 3' ends, all starting with a randomized tetramer followed by 3 Ts. Processing of reads that have the 3 diagnostic Ts at their 5' end starts with the correction of PCR amplification artifacts by exploiting the UMIs, removal of 3' adapter sequences, and reverse complementation. Reads that may have originated from oligo(dT) priming at internal A-rich sites are also identified computationally and discarded. The spurious sites generally lack one of the 18 well-characterized and conserved poly(A) signals which should be located ~21 nucleotides upstream of the apparent cleavage site 7.
The protocol requires about 8 h hands-on time, not counting cell culture and the overnight ligations. The associated read analysis software enables a highly accurate poly(A) site identification. From the poly(A) site clusters created based on 4 samples further highlighted in this manuscript (two biological replicates of control siRNA and si-HNRNPC-treated cells) 84% overlap with an annotated gene, and of these, 75 % overlap with a 3' UTR, and 86 % with either a 3' UTR or a terminal exon. The Pearson correlation coefficient of expression of 3' ends in the replicate samples is 0.92, and values of over 0.9 are typically obtained with the method. Thus, A-seq2 is a convenient method that gives very reproducible results.
1. Cell Growth and mRNA Isolation
2. 5' end Phosphorylation and DNase Treatment
3. Blocking 3' Ends with Cordycepin Triphosphate
NOTE: It is essential to block the 3' ends of RNA fragments to avoid their concatemerization in the subsequent ligation reactions. 3' ends that are not already blocked by a (cyclic) phosphate after hydrolysis are treated by addition of a 3' dATP (cordycepin triphosphate) chain terminator nucleotide with the aid of poly(A) polymerase. Here, yeast poly(A) polymerase (yPAP), that was expressed and purified as described in 8 was used at a concentration of 0.5 mg/mL. Yeast or E. coli PAP both have almost the same activity for addition of 3'dATP and can be purchased commercially (see the Table of Materials).
4. Ligation of Reverse 3' Adapters to the 5' End of RNA Fragments
5. Reverse Transcription (RT)
6. Digestion with Uracil DNA Glycosylase Enzyme Mix
7. Ligation of 5' Adapters to 5' Ends of cDNA
8. Pilot PCR, Amplification of Libraries and Size Selection
9. Data Processing
NOTE: The resulting sequencing data (in fastq format) are processed with software available in the gitlab repository (https://git.scicore.unibas.ch/zavolan_public/A-seq2-processing). The analysis includes four main steps: (1) downloading the git repository, (2) installation of a virtual environment, (3) setting specific parameters in the configuration file and (4) launching the analysis through ‘snakemake’ 10. The entire analysis done in step 4 requires only one command. A detailed step-by-step description of the analysis can be found in the README file in the gitlab repository and a short description is available below. All individual processing steps are accomplished by the execution of publicly available tools, either from external sources or prepared in-house. The computational pipeline depends on an anaconda-based 11 python 3 virtual environment with the snakemake package available 10. It runs on machines with Unix-like operating system and was tested in a Linux environment with the CentOS 6.5 operating system installed and 40 GB RAM available.Software dependencies are automatically controlled within the virtual environment. The following publicly available software tools are required and thereby installed together with the environment: snakemake (v3.9.1) 10,fastx toolkit (v0.0.14) 12, STAR (v2.5.2a) 13, cutadapt (v1.12) 14, samtools (v1.3.1) 14,15, bedtools (v2.26.0) 16,17.
Poly(A)-containing RNA was isolated from cultured cells, fragmented by alkaline hydrolysis and cDNAs were made by reverse transcription with oligo(dT) primers. The resulting cDNA was immobilized on streptavidin beads, dU was cleaved in the uracil specific excision reaction, adapters were ligated to 5' and 3' ends of the cleaved fragment and the inserts were sequenced. Figure 1 depicts a graphical outline of the experiment.
For HeLa and HEK293 cells, 106 cells were sufficient to identify poly(A) sites for the vast majority of protein-coding genes at the end of the procedure. However, for other cell types or tissues it may be necessary to test the saturation in the number of identified poly(A) sites as the number of cells used in the experiment increases. Representative results of the pilot PCR step and of the DNA fragment analysis of the sample before sequencing are shown in Figure 2.
Figure 3 shows the pre-processing steps of the computational analysis, starting from the fastq file obtained from the sequencer and ending with the quality-checked, adapter-trimmed reads that are ready to be mapped to the genome. Figure 4 shows the analysis steps that start with the mapping of the reads to the corresponding genome and end with the catalog of mRNA 3' end processing sites that are identified in a particular sample.When multiple samples are analyzed, additional steps are carried out to match the 3' end processing sites that were found in individual samples and report their abundance across samples. These steps are shown in Figure 5.
Thus, once samples have been sequenced, the analysis of the resulting sequencing read files (in fastq format) through the available processing pipeline is straightforward. After adding the information about the samples to the configuration file, the execution of the pipeline will result in two main types of output files: 1) BED-files with all 3' end processing sites identified in individual samples (e.g. "sample1.3pSites.noIP.bed.gz"), and 2) a BED-file with all poly(A) site clusters (clusters.merged.bed) across all samples of the study. The output also includes the genome coordinates for all reads from each individual sample (e.g. "sample1.STAR_out/Aligned.sortedByCoord.out.bam") that can later be viewed in a genome browser like IGV16. Visual inspection of the read profile(s) generally provides a first glimpse of the distribution of poly(A) sites in the genome and the changes that occur upon the specific perturbations that were carried out in the study. For example, in Figure 6 the response of a specific gene to the knock-down of the HNRNPC protein is shown.
Summaries of these genome-wide distributions are also provided (Table 1). Specifically, output files in the "counts/annotation_overlap" directory contain fractions of sites that overlap with specific annotated features (from the gtf file provided as input; annotated are: 3' UTR, terminal exon, exon, intron, intergenic). Finally, for each sample, results of individual processing steps are also saved (e.g. "sample1.summary.tsv"). This includes the numbers of:raw reads in each sample, reads that have the expected structure of the 5' end, reads that remain after collapsing full PCR duplicates, high-quality reads according to the criteria defined at step 9.2, reads that map uniquely to the genome (after collapsing those that resulted from sequencing errors, see step 9.5), multi-mapping reads (after collapsing those that resulted from sequencing errors, see step 9.5), raw (not clustered) 3' end processing sites in each sample, raw 3' end processing sites without potential internal priming candidates, unique 3' end processing sites from all samples without internal priming candidates, and final set of poly(A) site clusters.
Figure 1: Main steps of the A-seq2 protocol. Individual steps are indicated on the left side of the figure. Insert RNA fragments are depicted as green lines that turn red for cDNA after reverse transcription; adapters are colored in light blue or in orange. Please click here to view a larger version of this figure.
Figure 2: Pilot PCR and final product profile. (a) Aliquots from the PCR reaction were collected at different cycles and separated on 2% agarose gels. Numbers to the left indicate size in nucleotides of the respective bands in the DNA ladder. In this experiment 12 cycles (*) were chosen for the large scale PCR reaction. (b) Example of a sample after size selection run on a fragment size analyzer revealing an average size of around 280 nucleotides. Numbers to the left [FU] indicate relative signal intensity. Please click here to view a larger version of this figure.
Figure 3: Outline of the pre-processing of sequencing reads. The fastq files with reads that are generated by the sequencing instrument-associated software are processed to identify high-quality reads that will be mapped to the corresponding genome. The figure shows the input/output specification of individual steps in the pipeline, with links to the individual steps of the protocol described in section "Data processing". Please click here to view a larger version of this figure.
Figure 4: Outline of sequence read processing, from the step of mapping to the genome to the generation of individual 3' end processing sites. The figure shows the input/output specification of individual steps in the pipeline, with links to the individual steps of the protocol described in section "Data processing". The main output file that is delivered to the user is marked in bold. Please click here to view a larger version of this figure.
Figure 5: Outline of the steps that are taken to generate clusters of co-regulated 3' end sequencing sites. The figure shows the input/output specification of individual steps in the pipeline, with links to the individual steps of the protocol described in section "Data processing". The main output file is marked in bold. Please click here to view a larger version of this figure.
Figure 6: Example results of the profile of 3' end processing reads along the terminal exon of the NUP214 gene, shown in the IGV 16 genome browser. A-seq2 reads were prepared from two samples of HEK 293 cells, treated either with a control-siRNA or with an HNRNPC siRNA. The reads that documented poly(A) sites that were annotated by the analysis pipeline were saved in BAM format that was used as input to the IGV genome browser. The 3' ends of the read peaks map to mRNA 3' ends that are annotated in Ensembl. The profiles indicate an increased use of the long 3' UTR isoform upon HNRNPC knock-down. Please click here to view a larger version of this figure.
si-Control replicate 1 | si-Control replicate 2 | |
id: 29765 | id: 32682 | |
number of raw reads | 44210258 | 68570640 |
number of valid reads after trimming and filtering | 14024538 | 21211793 |
number of uniquely mapping reads | 6953674 | 13946436 |
number of reads mapping to multiple loci | 2040646 | 2925839 |
number of individual 3' end processing sites | 1107493 | 1710353 |
Table 1: Example output of the analysis pipeline. Summaries of reads that were obtained at individual steps.
The multitude of core and auxiliary factors that are involved in pre-mRNA 3' end processing is reflected in a correspondingly complex polyadenylation landscape. Additionally, polyadenylation is also responsive to changes in other processes such as transcription and splicing. 3' end cleavage sites in pre-mRNAs are typically identified based on the characteristic poly(A) tails that are added to the 5' cleavage products. Most methods use oligo(dT) primers of variable lengths that allow the specific conversion of poly(A)-containing mRNAs to cDNAs in a reverse transcription reaction. A common problem of this approach is internal priming to A-rich sequences resulting in artifactual cleavage sites. Two methods that aim to circumvent this artifact at the stage of sample preparation have been proposed. In the 3P-seq method 1, adapters are specifically ligated to the ends of poly(A) tails with help of a splint oligo followed by partial RNase T1 digestion and reverse transcription with TTP in the reaction as the only deoxynucleotide. The resulting poly(A)-poly(dT) heteroduplexes are then digested with RNase H and the remaining RNA fragments are isolated, ligated to adapters, and sequenced. A simpler and elegant method, 2P-seq, that uses a custom sequencing primer skipping the remaining oligo(dT) stretch in the sequencing reaction was reported by the same authors 2. In a related method, 3'READS 3, an unusually long primer of 5 Us and 45 Ts, also containing a biotin is annealed to fragmented RNA, followed by stringent washes to select for RNA molecules with poly(A) tails of over 50 nucleotides. Although 3'READS drastically reduces the frequency of internal priming, it does not completely eliminate it 3. Protocols for direct RNA sequencing have also been proposed, but the resulting reads are short and have a high rate of error and this approach has not been further developed 18,19,20. The PolyA-Seq and the commercialized Quant Seq protocols combine oligo(dT) based priming with a random priming step for the cDNA second strand synthesis 20. The use of the template switch reverse transcription reaction with the Moloney Murine Leukemia Virus (MMLV) reverse transcriptase leads to the generation of cDNAs with linkers in a single step and thereby no adapter dimers can appear in the PAS-Seq and SAPAS methods 21,22.
The A-seq2 method presented here stands out in its utilization of a cleavable nucleotide (dU) within a biotinylated oligo(dT) primer. This modification combines the utility of enriching oligo(dT) hybridized, polyadenylated targets with the removal of most of the oligo (dT)25 sequence from the isolated fragments before libraries are prepared and the preservation of three Ts, which indicate the prior presence of the poly(A) tail. In contrast, methods that utilize RNase H to remove poly(A) from the RNA molecules randomly leave several As. Since in A-seq2, sequencing is done from the 3' end of the anti-sense strands, cleavage sites are predicted to be located after the NNNNTTT motif at the beginning of raw sequence reads. The randomized tetramers serve not only to allow base calling but also in the elimination of PCR amplification artifacts. Longer UMIs can also be accommodated. The possibility of internal priming remains in A-seq2 and is addressed computationally, first by discarding 3' ends with a genomically-encoded, A-rich downstream sequence and then by discarding 3' end clusters that could be explained by internal priming at the A-rich poly(A) signal itself. A recent analysis of poly(A) sites inferred uniquely by a large number of protocols indicates that the sites that are unique to A-seq2 have the expected nucleotide distribution and location within genes, similar to other 3' end sequencing protocols.
A critical step in A-seq2 is the selection of polyadenylated RNA and removal of ribosomal RNAs and various small RNAs. This is most easily done by an mRNA-isolation kit with oligo (dT)25 magnetic beads. In principle, total RNA isolated with phenol containing solutions also gives high quality RNA that can be further subjected to selection by the mRNA-isolation kit or oligo (dT) agarose. A step that can be varied in A-seq2 is the treatment with alkaline hydrolysis which can be shortened or extended to obtain RNA fragments of different sizes. Critical is also that addition of 3'dATP to 3' ends of RNA fragments by the poly(A) polymerase is efficient. In the protocol described here, this treatment is applied to all RNA fragments, to avoid concatemerization during the ligation reaction. Finally, we note that although RNA ligase 1 is normally used as an RNA ligase, it also ligates efficiently single stranded DNA, as we have done here to ligate an adapter to the 5' end of the cDNA molecules.
Thus, A-seq2 is an efficient and easy to implement protocol for the identification of pre-mRNA 3' end processing sites. Future developments could include further reducing the complexity of the protocol and the amount of required material. The associated set of computational data analysis tools further enable the homogeneous processing of 3' end sequencing reads obtained with a wide range of protocols.
The authors have nothing to disclose.
The authors thank Mrs. Béatrice Dimitriades for help with the cell culture. This work was supported by the Swiss National Science Foundation grants #31003A_170216 and 51NF40_141735 (NCCR RNA & Disease).
Materials | |||
Agarose, ultra pure | Invitrogen | 16500-500 | |
2100 Bioanalyzer | Agilent | G2940CA | |
Cordycepin triphosphate (3’ dATP) | SIGMA | C9137 | |
DNA low bind vials, 1.5 ml | Eppendorf | 22431021 | |
Dulbecco’s Phosphate Buffered Saline | SIGMA | D8637 | |
Dynabeads mRNA-DIRECT Kit | Ambion | AM61012 | |
GR-Green dye | Excellgen | EG-1071 | use 1:10,000 dillution |
HiSeq 2500 or NextSeq 500 next generation sequencers | Illumina | inquire with supplier | |
KAPA HiFi Hotstart DNA polymerase mix | KAPA/Roche | KK2602 | |
Nuclease free water | Ambion | AM9937 | |
Poly(A) polymerase, yeast | Thermo Fisher Scientific | 74225Z25KU | |
Poly(A) polymerase, E.coli | New England Biolabs | M0276L | |
Polynucleotide kinase | Thermo Fisher Scientific | EK0032 | |
QIAEX II Gel Extraction Kit | Qiagen | 20021 | |
QIAquick PCR Purification Kit | Qiagen | 28104 | |
QIAquick Gel Extraction Kit | Qiagen | 28704 | |
RNA ligase 1, high concentration | New England Biolabs | M0437M | includes PEG-8000 |
RNeasy MinElute RNA Cleanup kit | Qiagen | 74204 | |
RNase H | New England Biolabs | M0279 | |
RNasin Plus, ribonuclease inhibitor | Promega | N2618 | |
Superscript IV reverse transcriptase | Thermo Fisher Scientiific | 18090050 | |
Turbo DNase | Ambion | AM2238 | |
USER enzyme mix | New England Biolabs | M5505 | |
Dyna-Mag-2 magnetic rack | Thermo Fisher Scientific | 12321D | |
Thermomixer C | Eppendorf | 5382000015 | Heated mixer with heated lid |
MicroSpin columns | GE-Healthcare | 27-5325-01 | |
Name | Company | Catalog Number | Yorumlar |
Buffers | |||
Alkaline hydrolysis buffer, 1.5 x | Mix 1 part 0.1 M Na2CO3 and 9 parts 0.1 M NaHCO3. Add EDTA to 1 mM. Adjust pH to 9.2. Store aliquots at -20 °C. | ||
5x poly(A) polymerase buffer | Thermo Fisher Scientiific | 100 mM Tris-HCl, pH 7.0, 3 mM MnCl2, 0.1 mM EDTA, 1 mM DTT, 0.5 mg/ml acetylated BSA, 50% glycerol | |
Biotin binding buffer | 20 mM TrisCl pH 7.5, 2 M NaCl, 0.1% NP40 | ||
TEN buffer | 10 mM TrisCl, pH 7.5, 1 mM EDTA, 0.02% NP40 | ||
Name | Company | Catalog Number | Sequence |
Oligonucleotides according to Illumina TruSeq Small RNA Sample Prep Kits, for GA-IIx and Hiseq2000/2500 sequencers | Microsynth | ||
revRA3 (RNA) | Microsynth | 5’ amino CCUUGGCACCCGAGAAUUCCA 3’ | |
revDA5 | Microsynth | 5’ amino GTTCAGAGTTCTACAGTCCGAC GATCNNNN-3’ | |
Bio-dU-dT25, RT primer | Microsynth | 5' Biotin-TTTTTTTTTTTTTTTTTTTTTTTTTT-dU-TTTVN 3' (V = G, A or C) | |
PCR primer forward, RP1 | Microsynth | 5' AATGATACGGCGACCACCGAGA TCTACACGTTCAGAGTTCTACAGTC CGA 3' | |
PCR primer reverse, RPI1, barcode in bold | Microsynth | 5' CAAGCAGAAGACGGCATACGAGA TCGTGATGTGACTGGAGTTCCTTG GCACCCGAGAATTCCA 3' | |
Name | Company | Catalog Number | Yorumlar |
Oligonucleotides according to Illumina TruSeq HT-Small RNA Sample Prep Kits, for HiSeq2000/2500 and NextSeq500 sequencers | |||
HT-rev3A (DNA/RNA) | Microsynth | 5'-amino-GTGACTGGAGTTCAGACGTGTGCT CTTCCrGrAUrC-3' | |
HT-rev5A | Microsynth | 5' amino-ACACTCTTTCCCTACACGACGCTCT TCCGATCTNNNN 3' | |
Bio-dU-dT25, RT primer | Microsynth | 5' Biotin-TTTTTTTTTTTTTTTTTTTTTTTTTT-dU-TTTVN 3' | |
PCR primers forward (D501-506) | Microsynth or Illumina | 5'-AATGATACGGCGACCACCGAGAT CTACAC[i5]ACACTCTTTCCCTACAC GACGCTCTTCCGATCT -3' | |
PCR primers reverse (D701-D712) | Microsynth or Illumina | 5'-CAAGCAGAAGACGGCATACGAG A[i7]GTGACTGGAGTTCAGACGTG TGCTCTTCCGATC-3' | |
Documentation for Illumina multiplexing: | Illumina | https://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/experiment-design/illumina-adapter-sequences_1000000002694-01.pdf |