This protocol provides researchers with a new tool to monitor the fidelity of transcription in multiple model organisms.
Accurate transcription is required for the faithful expression of genetic information. Surprisingly though, little is known about the mechanisms that control the fidelity of transcription. To fill this gap in scientific knowledge, we recently optimized the circle-sequencing assay to detect transcription errors throughout the transcriptome of Saccharomyces cerevisiae, Drosophila melanogaster, and Caenorhabditis elegans. This protocol will provide researchers with a powerful new tool to map the landscape of transcription errors in eukaryotic cells so that the mechanisms that control the fidelity of transcription can be elucidated in unprecedented detail.
The genome provides a precise biological blueprint of life. To implement this blueprint correctly, it is important for the genome to be transcribed with great precision. However, transcription is unlikely to be error free. For example, RNA polymerases have long been known to be error-prone in vitro1,2, and recently it was shown that they commit errors in vivo as well3,5,6, particularly when confronted with DNA damage7,8,9,10. Taken together, these observations indicate that transcription errors occur continuously in all living cells, suggesting that they could be a potent source of mutated proteins.
This process, termed transcriptional mutagenesis, differs from classical mutagenesis in two ways. First, in contrast to genetic mutations, transcription errors affect both mitotic and post-mitotic cells, as they do not depend on DNA replication. Studying the mechanisms that impact the fidelity of transcription will, therefore, provide valuable insight into the mutation load of both mitotic and post-mitotic cells. Interestingly, transcription errors have recently been implicated in the promotion of protein aggregation11,12,13 and have been hypothesized to contribute to both carcinogenesis10 and the development of antibiotic resistance in bacteria14.
Second, in contrast to genetic mutations, transcription errors are transient in nature. Their temporary existence is particularly challenging because it makes transcription errors exceedingly difficult to detect. For example, while several labs have devised valuable reporter assays for the study of transcriptional mutagenesis, these assays are only able to measure transcription errors in a limited number of contexts and model organisms4,15. To overcome these limitations, many researchers have turned to RNA sequencing technology (RNA-seq), which theoretically allows transcription errors to be recorded throughout the transcriptome of any species. However, these studies are easily confounded by library construction artifacts, such as reverse transcription errors, PCR amplification errors, and the error-prone nature of sequencing itself. For example, reverse transcriptases commit approximately one error every ~20,000 bases, while RNA polymerases (RNAPs) are expected to make only one error every 300,000 bases5,6. Because the error rate of reverse transcription alone dwarfs the error rate of RNA polymerases inside cells, it is virtually impossible to distinguish true transcription errors from artifacts caused by the library preparation in traditional RNA-Seq data (Figure 1a).
To solve this problem, we developed an optimized version of the Circle-Sequencing (Cirseq, or C-seq henceforth) assay5,16. This assay allows the user to detect transcription errors and other rare variants in RNA throughout the transcriptome5. The circular-sequencing assay carries this name because a key step in this assay revolves around RNA circularization. Once the RNA targets are circularized, they are reverse transcribed in a rolling circle fashion, to produce linear cDNA molecules that contain numerous copies of the same RNA template. If an error was present in one of these templates, this error would also be present in every single repeat contained within the cDNA molecule. In contrast, errors introduced by reverse transcription, PCR amplification, or sequencing tend to arise randomly, and will thus be present in only one or two repeats. Thus, by generating a consensus sequence for each cDNA molecule, and distinguishing random errors from errors that occur in all repeats, library construction artifacts can effectively be separated from true transcription errors (Figure 1b).
If used properly, the C-seq assay can be used to accurately detect the rate of base substitutions, insertions, and deletions in RNA throughout the transcriptome of any species (for example, see Traverse and Ochman17). For example, we have used the C-seq assay to provide genome-wide measurements of the error rate of transcription in Saccharomyces cerevisiae, Drosophila melanogaster, and Caenorhabditis elegans with a single base resolution5 (unpublished observations). Originally used to accurately sequence RNA virus populations, this optimized version of the C-seq assay has been streamlined to minimize harsh conditions during the library preparation that contribute to library construction artifacts. In addition, by using a number of commercially available kits, the throughput of the assay is greatly improved, as well as its user-friendliness. If used properly, this assay can accurately detect thousands of transcription errors per replicate, thereby greatly improving on previous studies6. Overall, this method provides a powerful tool to study transcriptional mutagenesis and will allow the user to gain novel insights into the mechanisms that control the fidelity of transcription in a wide range of organisms.
1. Preparation
2. Cell and Animal Culture and Collection
3. Total RNA Purification
NOTE: At this point, all three protocols converge, and a single, unified approach can be used to generate C-seq libraries.
4. mRNA Enrichment
5. RNase III Fragmentation and RNA Clean Up
NOTE: (Important) To prepare circular RNA molecules appropriate for generating C-seq libraries, the RNA must be fragmented to roughly 60–80 bases in length. While previous methods have used a chemical fragmentation to fragment RNA, chemical fragmentation with heavy metals introduces damages to the RNA samples that can be misinterpreted as transcription errors during the final analysis. To circumvent this problem, rely instead on a fragmentation using RNase III to generate small fragments. An additional advantage of this enzymatic approach is that it creates compatible ends required for ligation, obviating the need for an end-repair after the chemical fragmentation.
6. RNA Circularization and Rolling Circle Reverse Transcription
7. Second Strand cDNA Synthesis and End Repair
8. Adapter Ligation and Size Selection of Prepared Libraries
9. PCR Amplification and Final Bead Purification
10. Bio-informatic Analysis of Circle Sequencing Data
NOTE: Analyzing and interpreting raw data from the C-seq assay requires a dedicated bio-informatic pipeline. A schematic of the pipeline that was used for our analyses is depicted in Figure 3. Download this pipeline at https://github.com/LynchLab/MAPGD.
Like all massively parallel sequencing approaches, each C-seq experiment produces an unwieldy, large dataset. For first-time users, it can be difficult to handle these datasets; thus, it is recommended that all users contact an experienced bio-informatician prior to the experimentation. On average, the expectation is that users will generate approximately 55–70 Giga bases (Gbases) per run on most massively parallel sequencing platforms. For this protocol, typically, 12–30 samples were multiplexed per run so that for each sample approximately 2–6 Gbases were acquired. After trimming the adaptor sequences and low-quality base calls (< 20) from this dataset, ~70% of the initial data size remained.
These bases are then analyzed further to investigate the efficiency of the RNA fragmentation and circularization by determining the size of the repeats that were generated. Most repeats tend to be 45–80 bases in length (Figure 4A) and approximately 50% of the bases that were sequenced are part of these repeats (Figure 4B). Since most of these bases are present in reads that contain 3 repeats or more, the number of unique bases that are sequenced is about one-third of the total number of bases sequenced. On average, >75% of these consensus sequences can be mapped back to the reference genome. Finally, approximately 25% of these bases are covered by 20 reads or more, which ultimately means that about 10% of the data can be used for error detection. An example of this analysis is given in Figure 4, which represents the sequencing information that was acquired during the set-up of this protocol for 1 replicate of a single C-seq library that was sequenced relatively shallowly and represents the lower limit of the data that users can expect to acquire.
Approximately 5,000–25,000 errors per run tend to be identified, although these numbers can vary significantly depending on the error rate itself (the higher the error rate, the more errors will be detected), the size of the transcriptome (the larger the transcriptome, the fewer bases will be covered by 20 reads, limiting the sequencing data that can be used for error detection), and the depth at which it is sequenced (deeper sequencing will make it more likely that a given base will be covered by 20 reads or more). These errors tend to be distributed across the entire genome, so that on average ~47% of the errors are located in mRNA molecules generated by RNA polymerase II (RNAP II), ~49% are located in rRNA molecules generated by RNAP I, and the remaining ~3% are located in RNAs generated by RNAP III and the mitochondrial RNA polymerase. However, these ratios can vary significantly depending on the cell type or organism under investigation (Figure 4C). For example, cell types that rely heavily on mitochondrial function, such as cardiomyocytes, contain significantly more mitochondrial RNA than other cell types, greatly increasing the number of mitochondrial RNA molecules that are sequenced, and thus the number of errors detected.
Once a list of errors has been compiled, and their locations across the genome are known, these errors can be used to identify the parameters that control the error rate of transcription in each organism. For example, the location of these transcription errors can be correlated with numerous features of the genome, such as the presence of DNA repeats, specific genetic contexts, or the expression rate, to understand how these features alter the fidelity of transcription5. The expectation is that in the future, though, users will be able to determine how countless additional features affect transcriptional fidelity, including epigenetic markers, the 3-D organization of the genome, nutrient availability, age, or exposure to toxic compounds, to elucidate the contribution of genetic and environmental factors to the fidelity of transcription.
Figure 1: Schematic representation of RNA-seq versus C-seq. (A) Traditional RNA-seq experiments isolate RNA from a sample of interest, fragment the RNA, and reverse transcribe it prior to the final library preparation and sequencing. However, these preparation procedures introduce numerous technical artifacts into the library in the form of reverse transcription errors, PCR amplification errors, and sequencing errors. (B) This optimized C-seq assay allows for the correction of these technical artifacts by circularizing the fragmented RNA molecules prior to reverse transcription, which allows them to be reverse transcribed in a rolling circle fashion to produce linear cDNA molecules that contain several copies of the original RNA template in tandem repeat. These tandem repeats can then be used to distinguish true transcription errors from artifacts, as true transcription errors (star) will be present in all repeats at the same location, whereas artifacts such as reverse transcription errors (square) and PCR amplification errors (circle) are only present in one or two repeats of any given cDNA molecule. Please click here to view a larger version of this figure.
Figure 2: Representative results for library preparation. These panels show an electrophoretogram for (A) a high sensitivity RNA screen tape ladder (see Table of Materials), (B) the total RNA purified from Saccharomyces cerevisiae, (C) RNase III-fragmented RNA, and (D) rolling circle reverse-transcribed cDNA. (E) The final size-selected cDNA library runs on a high sensitivity double-stranded DNA analysis chip (see Table of Materials). Please click here to view a larger version of this figure.
Figure 3: Schematic of the bio-informatic pipeline used to analyze circle sequencing data. After trimming the sequencing reads, repeats are identified, and a consensus sequence is generated using the most likely base call at any given position. Then, the ligation point of the initial RNA template is identified and the consensus sequence is aligned to the reference genome so that potential transcription errors can be identified. Please click here to view a larger version of this figure.
Figure 4: Example of results for C-Seq pipeline. (A) This panel shows the size distribution of the consensus sequences obtained from a typical C-Seq experiment. (B) This panel shows the number of bases, reads, and percentage of reads sequenced in each step of the C-Seq bioinformatics pipeline. (C) This panel shows the number of transcription errors detected and the percentage of errors attributed to RNA polymerase I, RNA polymerase II, RNA polymerase III, and mitochondrial RNA polymerase. Please click here to view a larger version of this figure.
Here, we describe an optimized protocol for the preparation of C-seq libraries for the detection of transcription errors in Saccharomyces cerevisiae, Drosophila melanogaster, and Caenorhabditis elegans. This protocol has numerous advantages over existing protocols, as well as alternative techniques.
Over the past 15 years, numerous reporter systems have been developed that rely on luciferase7,8 or Cre-Lox recombination3,4,15,21 to detect transcription errors. These reporter systems have been invaluable to researchers' understanding of transcriptional fidelity because they allowed genes, alleles, and molecular mechanisms to be identified that directly regulate the fidelity of transcription. However, they only report on errors in artificially damaged templates, or within highly specific genetic contexts, which limits the scientific questions they can answer. An important advantage of the C-seq protocol is that it monitors the fidelity of transcription throughout the entire transcriptome5, greatly expanding the scientific knowledge of the accuracy with which genetic information is expressed. Secondly, because the C-seq assay utilizes RNA as its source material, it is likely that this assay can be adapted to any organism of choice, obviating the need to generate complicated reporter constructs for each transgenic model. This protocol also has advantages over existing massively parallel sequencing approaches17,22. Most notably, most of these approaches make use of heavy metals to fragment RNA libraries. However, these fragmentation methods introduce artifacts into the RNA that are indistinguishable from true transcription errors5. To solve this problem, this protocol fragments RNA enzymatically with RNase III, which has the added advantage that it leaves compatible ends for self-ligation, greatly simplifying RNA circularization and increasing the sensitivity of the assay ~10-fold.
At the same time, the C-seq assay has its own share of limitations. First, even though the assay measures transcription errors throughout the entire genome, a large component of the data tends to be derived from genes that are highly expressed, as transcripts from these genes tend to dominate most RNA libraries. Another factor that skews the analysis towards highly expressed genes is the threshold built into the bio-informatic pipeline that prevents genetic mutations and RNA editing events from confounding the final measurements. This threshold dictates that only bases that were sequenced 20 times or more can be used for the final analysis. A second limitation concerns library diversity. Like most kits used for RNA extraction, the kit used here does not efficiently purify small RNA molecules, which limits the number of reads derived from molecules synthesized by RNAP III. The diversity of the final sequencing library is further limited by the mRNA purification step employed here. Even though molecules generated by RNAP I can be efficiently sequenced, most of the remaining mitochondrial RNAs, tRNAs, and non-coding RNAs are lost during this step. To capture these molecules more frequently, a different kit that specifically purifies small RNA molecules should be used, or cell types should be targeted that are enriched for these molecules. Please note, though, that most of these problems can be solved by sequencing deeper than usual, or simultaneously sequencing the DNA from the same cells, although both of these solutions will require a larger financial investment by the investigators.
In addition, future technological developments are poised to improve the C-seq assay at a fundamental level. For example, by extending the read-length of existing sequencing technology, it will be possible to sequence longer molecules, which means that more repeats can be used to ascertain the fidelity of transcription. Such improvements will automatically result in a greater sequencing depth and an increase in the percentage of reads that can be used for downstream analysis. A similar argument can be made for any sequencing technology that allows for more molecules to be sequenced in parallel. In addition, numerous components of the C-seq assay remain to be optimized, including the efficiency of RNA circularization, the stringency of size selection, and the time-consuming nature of the assay itself. Finally, it is strongly recommended that all users gain access to a dedicated, high-sensitivity tool for nucleic acid analysis and potential troubleshooting (see Table of Materials). Without this tool, it can be extremely difficult to determine at what point potential problems arise, thereby making them impossible to fix.
Although the entire protocol is fairly unforgiving (small mistakes tend to have significant consequences), there are several steps that are absolutely critical to the success of the C-seq assay. One of the most important requirements is that the isolated RNA is treated as gently as possible. Most RNA extraction methods and downstream processing kits care little about the chemical composition of the RNA beyond basic industry standards, which is sufficient for most molecular analyses. However, the C-seq assay is tasked with identifying a single transcription error among hundreds of thousands of WT bases, greatly increasing the need to reduce molecular stress. Even stress that impacts only 1 in 10,000 bases can introduce artifacts that completely invalidate the results. For example, users must never avoid/limit expose the RNA to any temperature over 65 °C. Secondly, each user must carefully optimize the time required for the RNA fragmentation with RNase III. It is absolutely crucial for the success of the assay that the final molecules are approximately 60–80 bases in length. Shorter molecules may be difficult to map to the genome, and longer molecules will not allow for 3 independent repeats to be sequenced within a 250 bp read. Finally, it is recommended not to freeze and thaw the RNA more than once during the entire protocol. Instead, isolate the RNA and synthesize the cDNA in a single day. Because of the time and effort involved with this commitment, the complexity of the protocol itself, and the number of samples that are frequently processed at once, it is recommended that two investigators work together to generate these libraries.
When successful, these experiments will make it possible to accurately detect transcription errors in any organism, under any experimental condition. For example, by comparing control and diseased individuals to each other, it may be possible to determine whether certain diseases are associated with transcription errors, which may reveal a new component of the etiology of numerous human pathologies. Other parameters that could be probed are organismal aging, nutrition, genotype, and environmental factors such as exposure to toxic chemicals. Moreover, because this assay detects transcription errors throughout the transcriptome, it will allow researchers to dissect the different mechanisms that contribute to the fidelity of RNA polymerase I, II, and III, as well as the mitochondrial RNA polymerase. Accordingly, we expect that this protocol will open up a new field of mutagenesis to widespread experimentation, characterized by the study of mutations in RNA rather than in DNA.
The authors have nothing to disclose.
This publication was made possible by funding from grant T32ES019851 (to C. Fritsch), R01AG054641, and an AFAR young investigator grant (to M. Vermulst).
RiboPure RNA purification kit | ThermoFisher | AM1926 | Total RNA purification |
Genelute mRNA purification kit | Sigma-Aldrich | MRN70-1KT | mRNA purification |
Nuclease-free Water | Ambion | AM9937 | Elution and dilution |
Ambion RNase III | ThermoFisher | AM2290 | RNA fragmentation |
T4 RNA Ligase 1 (ssRNA Ligase) | New England Biolabs | M0204S | RNA circularization |
Ribolock | ThermoFisher | EO0381 | RNase inhibitor |
SuperScript III Reverse Transcriptase | ThermoFisher | 18080044 | Rolling circle reverse transcription |
10 mM dNTP mix | ThermoFisher | 18427013 | Rolling circle reverse transcription |
Random hexamers (50 ng/µl) | ThermoFisher | N8080127 | Rolling circle reverse transcription |
NEB Second Strand Synthesis Module | New England Biolabs | E6111S | Second Strand Synthesis |
NEBNext Ultra DNA Library Prep Kit for Illumina | New England Biolabs | E7370S | cDNA library preparation |
NEB Next index primers | NEB | E7335S | Multiplex PCR primers |
Oligo Clean & Concentrator | Zymo Research | D4061 | Clean up of RNA and DNA samples |
DynaMag-2 Magnet | ThermoFisher | 12321D | Magnetic bead purification |
AMPure XP beads | Beckman Coulter | A63881 | Magnetic bead purification |
Eppendorf 5424 Microcentrifuge | FisherScientific | 05-403-93 | centrifugation |
INCU-Shaker 10L | Benchmark Scientific | H1010 | Cell culture |
T100 Thermal Cycler | BIO RAD | 1861096 | Medium to High temperature cycling conditions |
PTC-200 Thermal Cycler | GMI | 8252-30-0001 | Low temperature cycling conditions |
RNase Away | Molecular Bioproducts | 700S-11 | Sterilization |
50 ml Centrifuge Tube | Corning | 430290 | Nuclease-free |
15 ml Centrifuge Tube | Corning | 430052 | Nuclease-free |
Eppendorf tubes | USA Scientific | 1615-5500 | Nuclease-free |
4200 Tapestation System | Agilent | G2991AA | Nucleotide analysis instrument for quality control of RNA and single stranded DNA samples |
High Sensitivity RNA Screen Tape | Agilent | 5067-5579 | Quality control of RNA and single stranded DNA samples |
RNA ScreenTape Sample Buffer | Agilent | 5067-5577 | Quality control of RNA and single stranded DNA samples |
RNA ScreenTape Ladder | Agilent | 5067-5578 | Quality control of RNA and single stranded DNA samples |
2100 Bioanalyzer Instrument | Agilent | G2939BA | Double stranded DNA quality control |
High Sensitivity DNA Kit | Agilent | 5067-4626 | Quality control for double stranded cDNA samples |
Water Bath | VWR | 462-0244 | Incubation |
NanoDrop 2000/2000C Spectrophotometer | ThermoFisher | ND-2000C | Determination of RNA concentration |