This method describes the steps to improve the quality and quantity of sequence data that can be obtained from formalin-fixed paraffin-embedded (FFPE) RNA samples. We describe the methodology to more accurately assess the quality of FFPE-RNA samples, prepare sequencing libraries, and analyze the data from FFPE-RNA samples.
Gene expression analysis by RNA sequencing (RNA-seq) enables unique insights into clinical samples that can potentially lead to mechanistic understanding of the basis of various diseases as well as resistance and/or susceptibility mechanisms. However, FFPE tissues, which represent the most common method for preserving tissue morphology in clinical specimens, are not the best sources for gene expression profiling analysis. The RNA obtained from such samples is often degraded, fragmented, and chemically modified, which leads to suboptimal sequencing libraries. In turn, these generate poor quality sequence data that may not be reliable for gene expression analysis and mutation discovery. In order to make the most of FFPE samples and obtain the best possible data from low quality samples, it is important to take certain precautions while planning experimental design, preparing sequencing libraries, and during data analysis. This includes the use of appropriate metrics for precise sample quality control (QC), identifying the best methods for various steps during the sequencing library generation, and careful library QC. In addition, applying correct software tools and parameters for sequence data analysis is critical in order to identify artifacts in RNA-seq data, filter out contamination and low quality reads, assess uniformity of gene coverage, and measure the reproducibility of gene expression profiles among biological replicates. These steps can ensure high accuracy and reproducibility for profiling of very heterogeneous RNA samples. Here we describe the various steps for sample QC, library preparation and QC, sequencing, and data analysis that can help to increase the amount of useful data obtained from low quality RNA, such as that obtained from FFPE-RNA tissues.
Use of next-generation sequencing approaches has enabled us to glean a wealth of information from various types of samples. However, old and poorly preserved samples remain unworkable for the commonly used methods of generating sequence data and often require modifications to well-established protocols. FFPE tissues represent such a sample type that has been widely utilized for clinical specimens1,2,3. While FFPE preservation maintains tissue morphology, the nucleic acids in FFPE tissues usually exhibit a wide range of damage and degradation, making it difficult to retrieve the genomic information that may lead to important insights about molecular mechanisms underlying various disorders.
Gene expression data generated by RNA sequencing is often instrumental in studying disease and resistance mechanisms and complements DNA mutation analysis. However, RNA is more susceptible to degradation, making it more challenging to generate accurate gene expression data from FFPE tissues. Furthermore, because the wide availability and affordability of sequencing is relatively recent, older specimens were often not stored in conditions required to preserve RNA integrity. Some of the issues for FFPE samples include degradation of RNA due to embedding in paraffin, chemical modification of RNA leading to fragmentation or refractoriness to enzymatic processes required for sequencing, and loss of the poly-A tails, limiting the applicability of oligo-dT as a primer for reverse transcriptase4. Another challenge is the handling/storage of FFPE samples under suboptimal conditions, which may lead to further degradation of labile molecules such as RNA in the tissues5. This is especially relevant for older samples that may have been collected at a time when gene expression analysis by RNA sequencing was not anticipated for the samples. All these lead to decreased quality and quantity of the extracted RNA available for generating useful sequence data. The low probability of success, combined with the high cost of sequencing, has dissuaded many researchers from trying to generate and analyze gene expression data from potentially useful FFPE samples. Some studies in recent years have demonstrated the usability of FFPE tissues for gene expression analysis2,6,7,8,9, albeit for fewer and/or more recent samples.
As a feasibility study, we used RNA extracted from FFPE tumor tissue specimens from three Residual Tissue Repositories from Surveillance, Epidemiology, and End Results (SEER) cancer registries for RNA sequencing and gene expression analysis10. Procured from clinical pathology labs, the FFPE tissues from high-grade ovarian serous adenocarcinomas were stored from 7–32 years under varying conditions before RNA extraction. Because in most cases these blocks had been stored in different sites for years without the expectation of any sensitive genetic analysis in the future, not much care had been taken to preserve the nucleic acids. Thus, most of the samples exhibited poor quality RNA, with a large proportion of samples contaminated with bacteria. Nevertheless, we were able to perform gene quantification, measure the uniformity and continuity of gene coverage, and perform the Pearson correlation analysis among biological replicates to measure reproducibility. Based on a set of key signature gene panel, we compared the samples in our study with The Cancer Genome Atlas (TCGA) data and confirmed that approximately 60% of the samples had comparable gene expression profiles11. Based on the correlation between various QC results and sample metadata, we identified key QC metrics that have good predictive value for identifying samples that are more likely to generate usable sequence data11.
Here we describe the methodology used for FFPE-RNA quality assessment, generation of sequencing libraries starting from extracted RNA samples, and bioinformatic analysis of the sequencing data.
1. RNA quantity and quality assessment
2. Sequencing library preparation
3. Sequencing library QC
4. Sequencing
5. Data analysis and quality assessment
NOTE: A typical RNA-seq data analysis workflow (Figure 1) includes preprocessing and QC, alignment to genome and post alignment QC, gene and transcript quantification, sample correlation analysis, differential analysis between different sample groups, treatment conditions, and gene set enrichment and pathway analysis.
The RNA-seq data may have quality issues that can affect the accuracy of gene profiling and lead to erroneous conclusions. Therefore, initial QC checks for sequencing quality, contamination, sequencing coverage bias, and other sources of artifacts are very important. Applying an RNA-Seq QC pipeline similar to the workflow described here is recommended to detect artifacts and apply filtering or correction before downstream analysis.
The methodology described above was applied to 67 FFPE samples that had been stored under a variety of different conditions for 7–32 years (the median sample storage time was 17.5 years). The dataset and analysis results presented here were previously described and published in Zhao et al.11. On checking the sample quality as described earlier (i.e., example traces in Figure 2), DV100 was found to be more useful than DV200 because it is more sensitive to accurately measure the proportion of smaller fragment sizes for highly degraded RNA samples.
In the given sample set, fewer than 10% of the samples (7 of 67) were above the DV200 cut off of 30%, as recommended by Illumina34. About 26% of the samples (19 of 67) had a DV100 > 60% (i.e., higher likelihood of generating good sequence data), 40% (27 of 67) were in the 40%–60% range for DV100 (i.e., acceptable, but with a lower likelihood of generating good sequence data), and about 10% (7 of 67) had a DV100 of <40% (i.e., very low likelihood of resulting in good sequence data). For 14 of 67 samples, the software was unable to determine the DV values. Table 1 shows a summary of QC metrics for the samples in different DV100 categories. For detailed QC analysis and data correlation for all 67 samples, please see Zhao et al.11.
Given the high degree of degradation in the sample set, a ‘total RNA’ library preparation method was chosen, and sequencing libraries were prepared using the NEBNext Ultra II RNA Library Prep Kit for Illumina (Table of Materials). In order to improve the representation of the sequencing libraries in spite of the high degree of sample degradation, the maximum possible amount of RNA (1,000 ng when available) was used as input for library preparation. Additionally, the high degradation of the FFPE-RNA samples necessitated the rRNA depletion method, because the degraded transcripts were likely to not have the poly-A tails for mRNA capture. Following the depletion of ribosomal RNA by hybridization to specific probes and digestion of the hybridized transcripts using RNaseH, the remaining transcripts were converted into cDNA using random primers. Size selection was also avoided for libraries prepared from lower input samples. Example traces of final libraries are shown in Figure 3.
Highly degraded FFPE samples represent a great challenge for gene expression profiling in tumor samples. Thus, applying correct bioinformatics analysis methods and software tools is critical to detect artifacts or abnormalities in datasets to ensure high accuracy and reproducibility of gene quantification. The software tools used in this study are listed in the Supplementary Table. In the given sample set, we performed sequencing and library quality assessment, with some example metrics shown in Figure 4. An overview of raw fastq file sequencing quality and sample adapter content are shown in Figure 4A and Figure 4B, respectively. Fastqc screen can help detect contamination, such as bacterial and mouse contamination, in the samples as shown in Figure 4C. In the given sample set, 41 of 67 samples had 5%–48% bacterial contamination, and six samples had 4%–11% mouse contamination (Figure 4C). STAR alignment results (Figure 4D) showed the proportion of reads mapped to the reference genome, percentage of reads uniquely mapped to the reference genome, and proportion of reads that were not mapped or mapped to multiple loci. Picard CollectRNAStatistics was used to determine the percent mRNA, intronic, and intergenic bases present in the alignment files (Figure 4E). In order to assess the uniformity of read coverage on gene and transcripts, we used the Picard software tool to generate a gene body coverage plot, which measures the percentage of reads that cover each nucleotide position of all genes scaled into bins from 5′ UTR to 3′ UTR. Figure 4F shows that some degraded libraries had 3’ bias, where more reads are mapped closer to 3’ end than to the 5’ end.
FFPE samples usually have large variability in gene expression profiles that may arise due to variable degradation during sample storage, RNA extraction, or sample processing. It is important to use appropriate statistical methods to uncover the underlying patterns and measure the variation and correlation among samples. We applied Principal Component Analysis (PCA) for six pairs of biological replicates from a subset of the 67 FFPE samples. A PCA plot showed that 26% of total variation was captured by the first principal component and 19% from the second and third components combined (Figure 5). Among the six pairs of replicates, two pairs of replicates had higher variations (correlations below 0.22) than the last four samples (correlation values between 0.7–0.8) when comparing gene expression values between the replicate pairs. Because the replicates were generated by extracting RNA from two different tissue curls cut from the same FFPE blocks, the tissue age was not a factor in the higher variance here, and it was likely caused by the different amount of bacterial contamination (1%–55%) as well as different mRNA content (2–3 fold difference) between the replicates. The randomness of mRNA degradation after extraction could also contribute to the higher variance between samples of similar origin.
Figure 1: RNaseq analysis workflow. The flowchart describes the analysis steps for preprocessing, quality assessment, mapping to reference, gene quantification, and differential analysis between different sample groups. Please click here to view a larger version of this figure.
Figure 2: Example Bioanalyzer traces of six different FFPE-RNA samples. The horizontal axis denotes the molecular weight (bp) and fluorescence units (FU) and the vertical axis shows the concentration of different sized fragments. The RNA Integrity Numbers (RIN), DV200 (i.e., percent of fragments >200 bp), and DV100 (i.e., percent of fragments >100 bp) values are indicated on each profile. A 25 bp peak in each profile indicates the molecular weight marker. Please click here to view a larger version of this figure.
Figure 3: Example Bioanalyzer traces of final libraries prepared from four different samples. The horizontal axis denotes the molecular weight (bp) and fluorescence units (FU) on the vertical axis indicate the concentration of different sized fragments. The lower (35 bp or 50 bp) and upper (10,380 bp) marker peaks are labeled in green and purple, respectively. Please click here to view a larger version of this figure.
Figure 4: Example multi-QC report for preprocessing QC results. (A) Line chart showing the percentages of Q30 bases of all sequencing reads in each sample. (B) Sequencing adapter content in raw fastq files. (C) Contamination screen to check closely matched species. (D) Genome mapping statistics. (E) Read distribution based on Gencode gene annotation. (F) Gene body/transcript coverage Please click here to view a larger version of this figure.
Figure 5: Example PCA analysis to show sample group concordance. PCA analysis for biological replicates. PCA plot with samples plotted in two dimensions using their projections onto the first two principal components. Biological replicates are shown in the same color. Please click here to view a larger version of this figure.
Number of samples | Median Input for lib prep (ng) | Median RIN | Median DV200 | Median DV100 | Median Lib size (bp) | Median Lib yield (ng) | Median Lib Molarity (nM) | Median Specimen storage time (Years) | Median % contamination | Median Gene Count | |
DV100 <40% | 7 | 237.6 | 2.5 | 6 | 34 | 445 | 24.5 | 7 | 22 | 27.4 | 14,759 |
DV100 40-60% | 27 | 1000 | 2.5 | 12 | 51 | 408 | 19.8 | 5.9 | 18 | 9.9 | 10,202 |
DV100 >60% | 19 | 1000 | 2.3 | 26 | 73 | 355 | 84.9 | 24 | 13 | 3.2 | 9,993 |
Table 1: Summary of sample set QC metrics. The table shows the QC metrics of the samples, grouped according to their DV100 values. The number of samples in each group is listed, and median values for each metric are shown.
Supplementary Table: Analysis software tools, parameters, and software reference.The table lists the analysis software tools and parameters used in each step of the RNA-seq analysis. The software tool references are listed in the table. Please click here to download this table.
The method described here outlines the main steps required to obtain good sequence data from FFPE-RNA samples. The main points to consider with this method are: (1) Ensure that the RNA is preserved as best as possible after extraction by minimizing the sample handling and freezing and thawing cycles. Separate QC aliquots are very helpful. (2) Use a QC metric that is best for the given sample set. RIN values and DV200 are often not useful for degraded samples, and DV100 may be the metric of choice to assess the quality in a given sample set. (3) For more degraded samples, it is best to use a high sample input. Higher input amounts lead to better diversity and lower duplication in the final library, leading to improved data quality. Because not all RNA in FFPE-RNA samples is usable due to high degradation and refractoriness to enzymatic processes, these effects are more pronounced in FFPE-RNA compared to fresh frozen RNA. (4) Use random priming for the reverse transcription step as opposed to the use of oligo-dT or specific sequences as primers. Unless the set of specific probes is able to cover as much sequence as possible for all transcripts of interest, random primers are a safe bet to ensure the conversion of a maximum number of transcripts (or fragments thereof) into cDNA. Thus, total RNA library prep methods are more useful for degraded samples than mRNA methods, which rely on the presence of poly-A tails. (5) Accurate quantification of libraries by quantitative real-time PCR (qPCR) is important to avoid underperformance or overloading of the sequencers. (6) Assess potential contamination of the RNA as part of the standard post sequencing RNA-Seq QC protocols. Bacterial contamination and genomic DNA contamination are common for FFPE samples due to storage conditions and sample preparation procedures. Samples contaminated with foreign species can waste sequencing coverage, depending on the extent of contamination. In addition, internal contamination can arise from incomplete rRNA depletion, leading to a high percentage of reads mapping to rRNAs. Inefficient genomic DNA removal during DNase digestion could lead to false positive expression detection of transcripts or erroneous de novo assembly of transcripts. Adapter contamination introduced during library preparation is also a common problem for highly degraded RNAs with very short RNA fragments. Contamination can affect the gene and transcript profiling accuracy and lead to false discovery. Therefore, it is important to accurately identify the contamination sources and remove the contamination, if possible, during the sample or library preparation steps, or filter the contaminating reads during the data processing step. (7) Preprocessing and post-alignment quality control are important to detect bad quality and low mRNA content samples. Those samples should be eliminated from further analysis. Gene expression data from samples that generate low gene counts, poor coverage should be used with caution. (8) It is good practice to include biological replicates in order to measure samples variance and correlation to ensure data reproducibility.
FFPE samples represent a very valuable resource for a large number of diseases. The ability to obtain reliable sequence information from such samples would aid a lot of studies aimed at understanding the molecular mechanisms behind various disorders, resistance, and susceptibility. Though the limitations imposed by the frequently suboptimal quality of RNA extracted from such samples do hamper such efforts, the steps described here help to mitigate those limitations to some extent and enable us to make the most of FFPE-RNA to obtain reliable gene expression information.
The authors have nothing to disclose.
We are thankful to Dr. Danielle Carrick (Division of Cancer Control and Population Sciences, National Cancer Institute) for continued help, especially for initiating this study, providing us with the samples, and for helpful suggestions during data analysis. We sincerely thank all members of the CCR Sequencing Facility at the Frederick National Laboratory for Cancer Research for their help during sample preparation and sequencing, especially Brenda Ho for assistance in sample QC, Oksana German for library QC, Tatyana Smirnova for running the sequencers. We also would like to thank Tsai-wei Shen and Ashley Walton at Sequencing Facility Bioinformatics Group for helping with data analysis and RNA-seq pipeline implementation. We also thank CCBR and NCBR for assistance with RNaseq analysis pipeline and best practices development.
2100 Bioanalyzer | Agilent | G2939BA | |
Agilent DNA 7500 Kit | Agilent | 5067-1506 | |
Agilent High Sensitivity DNA Kit | Agilent | 5067-4626 | |
Agilent RNA 6000 Nano Kit | Agilent | 5067-1511 | |
AllPrep DNA/RNA FFPE Kit | Qiagen | 80234 | |
CFX96 Touch System | Bio-Rad | 1855195 | |
Library Quantification kit v2-Illumina | KapaBiosystems | KK4824 | |
NEBNext Ultra II Directional RNA Library Prep Kit for Illumina | New England Biolabs | E7765S | https://www.neb.com/protocols/2017/02/07/protocol-for-use-with-ffpe-rna-nebnext-rrna-depletion-kit |
NEBNext rRNA Depletion Kit (Human/Mouse/Rat) | New England Biolabs | E6310L | |
NextSeq 500 Sequencing System | Illumina | SY-415-1001 | NextSeq 500 System guide: https://support.illumina.com/content/dam/illumina-support/documents/documentation/system_documentation/nextseq/nextseq-500-system-guide-15046563-06.pdf |
NextSeq PhiX Control Kit | Illumina | FC-110-3002 | |
NSQ 500/550 Hi Output KT v2.5 (150 CYS) | Illumina | 20024907 | |
10X Genomics Magnetic Separator | 10X Genomics | 120250 | |
Rotator Multimixer | VWR | 13916-822 | |
C1000 Touch Thermal Cycler | Bio-Rad | 1851197 | |
Sequencing reagent kit | Illumina | 20024907 | |
Flow cell package | Illumina | 20024907 | |
Buffer cartridge and the reagent cartridge | Illumina | 20024907 | |
Sodium hydroxide solution (0.2N) | Millipore Sigma | SX0607D-6 | |
TRIS-HCL Buffer 1.0M, pH 7.0 | Fisher Scientific | 50-151-871 |