Galaxy and DAVID have emerged as popular tools that allow investigators without bioinformatics training to analyze and interpret RNA-Seq data. We describe a protocol for C. elegans researchers to perform RNA-Seq experiments, access and process the dataset using Galaxy and obtain meaningful biological information from the gene lists using DAVID.
Next generation sequencing (NGS) technologies have revolutionized the nature of biological investigation. Of these, RNA Sequencing (RNA-Seq) has emerged as a powerful tool for gene-expression analysis and transcriptome mapping. However, handling RNA-Seq datasets requires sophisticated computational expertise and poses inherent challenges for biology researchers. This bottleneck has been mitigated by the open access Galaxy project that allows users without bioinformatics skills to analyze RNA-Seq data, and the Database for Annotation, Visualization, and Integrated Discovery (DAVID), a Gene Ontology (GO) term analysis suite that helps derive biological meaning from large data sets. However, for first-time users and bioinformatics’ amateurs, self-learning and familiarization with these platforms can be time-consuming and daunting. We describe a straightforward workflow that will help C. elegans researchers to isolate worm RNA, conduct an RNA-Seq experiment and analyze the data using Galaxy and DAVID platforms. This protocol provides stepwise instructions for using the various Galaxy modules for accessing raw NGS data, quality-control checks, alignment, and differential gene expression analysis, guiding the user with parameters at every step to generate a gene list that can be screened for enrichment of gene classes or biological processes using DAVID. Overall, we anticipate that this article will provide information to C. elegans researchers undertaking RNA-Seq experiments for the first time as well as frequent users running a small number of samples.
The first sequencing of the human genome, performed using Fred Sanger's dideoxynucleotide-sequencing method, took 10 years, and cost an estimated US $3 billion1,2. However, in a little over a decade since its inception, Next-Generation Sequencing (NGS) technology has made it possible to sequence the entire human genome within two weeks and for US $1,000. New NGS instruments that allow ever-increasing speeds of sequencing-data collection with incredible efficiency, along with sharp reductions in cost, are revolutionizing modern biology in unimaginable ways as genome sequencing projects are rapidly becoming commonplace. In addition, these developments have galvanized progress in many other areas such as gene-expression analysis through RNA-Sequencing (RNA-Seq), study of genome-wide epigenetic modifications, DNA-protein interactions, and screening for microbial diversity in human hosts. NGS-based RNA-Seq in particular has made it possible to identify and map transcriptomes comprehensively with accuracy and sensitivity, and has replaced microarray technology as the method of choice for expression profiling. While microarray technology has been used extensively, it is limited by its reliance on pre-existing arrays with known genomic information, and other drawbacks such as cross hybridization and restricted range of expression changes that can be measured reliably. RNA-seq, on the other hand, can be used to detect both known and unknown transcripts while producing low background noise due to its unambiguous DNA mapping nature. RNA-Seq, together with the numerous genetic tools offered by model organisms such as yeast, flies, worms, fish and mice, has served as the foundation for many important recent biomedical discoveries. However, significant challenges remain that make NGS inaccessible to the wider scientific community, including limitations of storage, processing, and most of all, meaningful bioinformatics analysis of large volumes of sequencing data.
The rapid advances in sequencing technologies and exponential data accumulation have created a great need for computational platforms that will allow researchers to access, analyze and understand this information. Early systems were heavily dependent upon computer programming knowledge, whereas, genome browsers such as NCBI that allowed non-programmers to access and visualize data did not permit sophisticated analyses. The web-based, open-access platform, Galaxy (https://galaxyproject.org/), has filled this void and proven to be a valuable pipeline that enables researchers to process NGS data and perform a spectrum of simple-to-complex bioinformatics analyses. Galaxy was initially established, and is maintained, by the laboratories of Anton Nekrutenko (Penn State University) and James Taylor (Johns Hopkins University)3. Galaxy offers a wide range of computational tasks making it a 'one-stop shop' for innumerable bioinformatics needs, including all the steps involved in an RNA-Seq study. Itallows users to perform data processing either on its servers or locally on their own machines. Data and workflows can be reproduced and shared. Online tutorials, help section, and a wiki-page (https://wiki.galaxyproject.org/Support) dedicated to the Galaxy Project provide consistent support. However, for first-time users, especially those with no bioinformatics training, the pipeline can appear daunting and the process of self-learning and familiarization can be time consuming. In addition, the biological system studied, and specifics of the experiment and methods used, impact the analytical decisions at several steps, and these can be difficult to navigate without instruction.
The Overall RNA-Seq Galaxy Workflow consists of data upload and quality check followed by analysis using the Tuxedo Suite4,5,6,7,8,9, which is a collective of various tools required for different stages of RNA-Seq data analysis10,11,12,13,14. A typical RNA-Seq experiment consists of the experimental part (sample preparation, mRNA isolation and cDNA library preparation), the NGS and the bioinformatics data analysis. An overview of these sections, and the steps involved in the Galaxy pipeline, are shown in Figure 1.
Figure 1: Overview of an RNA-Seq Workflow. Illustration of the experimental and computational steps involved in an RNA-Seq experiment to compare the gene-expression profiles of two worm strains (A and B, orange and green lines and arrows, respectively). The different modules of Galaxy utilized are shown in boxes with the corresponding step in our protocol indicated in red. The outputs of various operations are written in grey with the file formats shown in blue. Please click here to view a larger version of this figure.
The first tool in the Tuxedo Suite is an alignment program called 'Tophat'. It breaks down the NGS input reads into smaller fragments and then maps them to a reference genome. This two-step process ensures that reads spanning intronic regions whose alignment can otherwise be disrupted or missed are accounted for and mapped. This increases coverage and facilitates the identification of novel splice junctions. Tophat output is reported as two files, a BED file (with information about splice junctions that include genomic location) and a BAM file (with mapping details of each read). Next, the BAM file is aligned against a reference genome to estimate the abundance of individual transcripts within each sample using the subsequent tool in the Tuxedo Suite called 'Cufflinks'. Cufflinks functions by scanning the alignment to report full-length transcript fragments or 'transfrags' that span all the possible splice variants in the input data for every gene. Based on this, it generates a 'transcriptome' (assembly of all the transcripts generated per gene for every gene) for each sample being sequenced. These Cufflinks assemblies are then collapsed or merged together along with the reference genome to produce a single annotation file for downstream differential analysis using the next tool, 'Cuffmerge'. Finally, the 'Cuffdiff' tool measures differential gene expression between samples by comparing the TopHat outputs of each of the samples to the final Cuffmerge output file (Figure 1). Cufflinks uses FPKM/RPKM (Fragments/Reads Per Kilobase of transcript per Million mapped reads) values to report transcript abundances. These values reflect the normalization of the raw NGS data for depth (average number of reads from a sample that align to the reference genome) and gene length (genes have different lengths, so counts have to be normalized for length of a gene to compare levels between genes). FPKM and RPKM are essentially the same with RPKM being used for single-end RNA-Seq where every read corresponds to a single fragment, whereas, FPKM is used for paired-end RNA-Seq, as it accounts for the fact that two reads can correspond to the same fragment. Ultimately, the outcome of these analyses is a list of genes differentially expressed between the conditions and/or strains tested.
Once a successful Galaxy run is completed and a 'gene list' is generated, the next logical step requires more bioinformatics analyses to deduce meaningful knowledge from the datasets. Many software packages have emerged to cater to this need, including publicly-available web-based computational packages such as DAVID (the Database for Annotation, Visualization and Integrated discovery)15. DAVID facilitates assigning biological meaning to large gene lists from high-throughput studies by comparing the uploaded gene list to its integrated biological knowledgebase and revealing the biological annotations associated with the gene list. This is followed by Enrichment Analysis, i.e., tests to identify if any biological process or gene class is overrepresented in the gene list(s) in a statistically significant manner. It has become a popular choice because of a combination of a wide, integrated knowledge-base and powerful analytical algorithms that enable researchers to detect biological themes enriched within genomics-derived 'gene lists'10,16. Additional advantages include its ability to process gene lists created on any sequencing platform and a highly user-friendly interface.
The nematode Caenorhabditis elegans is a genetic model system, well known for its many advantages such as small size, transparent body, simple body plan, ease of culture and great amenability to genetic and molecular dissection. Worms have a small, simple and well-annotated genome that includes up to 40% conserved genes with known human homologs17. Indeed, C. elegans was the first metazoan whose genome was completely sequenced18, and one of the first species where RNA-Seq was used to map an organism's transcriptome19,20. Early worm studies involved experimentation with different methods for high-throughput RNA capture, library preparation and sequencing as well as bioinformatics pipelines that contributed to the advancement of the technology21,22. In recent years, RNA-Seq-based experimentation in worms has become commonplace. But, for traditional worm biologists the challenges posed by computational analysis of RNA-Seq data remain a major impediment for greater and better utilization of the technique.
In this article, we describe a protocol for using the Galaxy platform to analyze high-throughput RNA-Seq data generated from C. elegans. For many first-time and small-scale users, the most cost-efficient and straightforward way to undertake an RNA-Seq experiment is to isolate RNA in the lab and utilize a commercial (or in-house) NGS facility for preparation of sequencing cDNA libraries and the NGS itself. Hence, we have first detailed the steps involved in isolation, quantification and quality assessment of C. elegans RNA samples for RNA-Seq. Next, we provide step-by-step instructions for using the Galaxy interface for analyses of the NGS data, beginning with tests for post-sequencing quality-control checks followed by alignment, assembly, and differential quantification of gene expression. In addition, we have included directions to scrutinize the gene lists resulting from Galaxy for biological enrichment studies using DAVID. As a final step in the workflow, we provide instructions for uploading RNA-Seq data on to public servers such as the Sequence Read Archive (SRA) on NCBI (http://www.ncbi.nlm.nih.gov/sra) to make it freely accessible to the scientific community. Overall, we anticipate that this article will provide comprehensive and sufficient information to worm biologists undertaking RNA-Seq experiments for the first time as well as frequent users running a small number of samples.
1. RNA Isolation
2. RNA-Seq Data Analysis
Figure 2: Layout of the Galaxy User Interface Panel and Key RNA-Seq Functions. Key features of the page are expanded and highlighted. (A) highlights the 'Analyze data' function in the webpage header used to access Analysis Home View. (B) is the 'Progress bar' that indicates the space on the Galaxy server utilized by the operation. (C) is the 'Tools Section' that lists all the tools that can be run on the Galaxy interface. (D) shows the 'NGS: RNA Analysis' tool section used for RNA-Seq analysis. (E) depicts the 'History' panel that lists all the files generated using Galaxy. (F) shows an example of the dialogue box that opens up when clicking on any file in the History section. Within (F), the blue box highlights icons that can be used to view, editthe attributes or delete the dataset, the purple box highlights icons that can be used to 'edit' the dataset tags or annotation, and, the red box indicates icons to download the data, view details of the task performed or rerun the operation. Please click here to view a larger version of this figure.
3. Gene Ontology (GO) Term Analysis using DAVID
Figure 3: Layout of the DAVID Analysis Wizard Webpage and Examples of Operation Outputs. The 'Analysis Wizard' web user-interface lists the tools used to analyze uploaded gene list for enrichment based on various parameters. Clicking on these tools reports the analyzed data in a new web page. Examples of the tabular reports generated from 'Gene Functional Classification', 'Functional Annotation Chart' and 'Functional Annotation Clustering' are shown as insets (arrows). Please click here to view a larger version of this figure.
4. Uploading RAW Data onto the NCBI Sequence Read Archive (SRA)
In C. elegans, elimination of the germline stem cells (GSCs) extends lifespan, enhances stress resilience, and elevates body fat24,28. Loss of GSCs, either brought about by laser-ablation or by mutations such as glp-1, causes lifespan extension through activation of a network of transcription factors29. One such factor, TCER-1, encodes the worm homolog of the human transcription elongation and splicing factor, TCERG130. The following representative results illustrate how RNA-Seq was used to identify genes whose expression is modulated by TCER-1/TCERG1 following germline loss in our recently published study31. The transcriptomes of age-matched, day 2 adults of glp-1 and tcer-1;glp-1 mutants were compared. For each strain, mRNA was isolated from two biological replicates (four samples totally) using the protocol described in section 1. RNA samples were shipped to a commercial service provider that prepared cDNA libraries from the four samples and performed 50 bp single end sequencing. The raw NGS data was downloaded as described in section 2.1.
Post sequencing data evaluation
Table 1 is a compilation of test results to assess the quality of raw sequencing reads. 'FASTQ' quality check analysis highlights the number of sequences read with no 'poor quality' reads along with 48-49% GC content and a constant sequence read length of 51 bp. This step also checks the sequencing data for many other features such as Kmer content and is collectively made up of 11 tests in total. The C. elegans genome is ~100 Mbp. Based on the number of sequencing reads from each sample that mapped to the genome, the genome coverage (last column) was estimated using the Lander/Waterman equation 'C = LN/G', wherein, C stands for coverage, G is the haploid genome length, L is the read length and N is the number of reads. We used default parameters for all the steps and obtained 48 – 49% GC content in all the samples. As can be seen, genome coverage was between 9x to 11x in the samples.
Identification of TCER-1/TCERG-1-regulated Genes by Differential Gene Expression Analysis on Galaxy
Through the steps detailed in sections 2.2 to 2.4, the Galaxy pipeline3 was used to obtain a list of genes differentially expressed between glp-1 and tcer-1;glp-1 mutants. Galaxy enabled us to combine the NGS data from the two replicates for each strain and performed differential analysis to generate tabular files highlighting the genome wide expression profile. Using a threshold of at least one-fold change in magnitude and P value of at least 0.05, a list of 835 genes that were differentially expressed between the two strains was generated31. The list was divided based on whether expression of the genes was down-regulated in tcer-1;glp-1 mutants (359 UP genes whose transcription is likely enhanced by TCER-1/TCERG1) or up-regulated (476 DOWN genes whose transcription is likely repressed by TCER-1/TCERG1) as compared to glp-1 (Figure 4).
Figure 4: Identification of TCER-1/TCERG1-regulated Genes in Germline-less C. elegans Mutants using RNA-Seq: Results of Galaxy (A) and DAVID (B) Analyses. (A) Differential gene expression analysis of RNA-Seq data comparing the transcriptomes of glp-1 and tcer-1;glp-1 yielded a total of 835 genes, of which 359 were identified as being up-regulated by TCER-1/TCERG1 (UP) and 476 as down-regulated by TCER-1/TCERG1 (DOWN). (B) Results of 'Functional Annotation Clustering' analysis of genes identified as TCER-1/TCERG1 targets using DAVID. Percentage enrichment of Biological Processes for both the Up-regulated (UP) and Down-regulated (DOWN) Classes of TCER-1/TCERG1 targets. The graphic shown here is obtained by plotting the enriched gene groups (X-axis) and their respective percent enrichment (Y-axis) obtained as the output of DAVID analysis. Figure modified from Amrit et al.31 and reproduced with permission. Please click here to view a larger version of this figure.
Gene Ontology Enrichment Analysis
To obtain an overview of the gene classes enriched in TCER-1/TCERG1 targets, we carried out gene ontology (GO) term analysis using DAVID. The TCER-1/TCERG1-regulated UP and DOWN gene lists were uploaded independently onto DAVID and analyzed as described in section 3. Little was known about the genes and cellular processes targeted by TCER-1/TCERG1 previously30, so we found the DAVID analysis to be especially revealing and helpful. Functional Annotation analysis of the UP genes revealed five Annotation Clusters with an Enrichment Score of >1.3, the highest including Cytochrome P450 enzyme-encoding genes and xenobiotic response genes, followed by genes implicated in lipid modifications. This was reinforced by the results of the Gene Functional Classification analysis that identified groups attributed with similar molecular activities with significant enrichment scores. Using spreadsheet, the identified groups were plotted against their respective enrichment scores (Figure 4). Our previous data suggested that TCER-1/TCERG1 functioned with the conserved longevity transcription factor, DAF-16/FOXO3A, to promote the longevity of GSC-less adults30. DAF-16/FOXO3A, in turn, has been implicated in modulating lipid metabolism in recent studies27,32,33. Based on this evidence, and the identification of lipid-metabolic genes and pathways as potential TCER-1/TCERG1 targets in the DAVID analyses, we focused on the fat metabolism genes identified in the RNA-Seq study for detailed mechanistic studies. Following this lead, and through subsequent molecular genetic, biochemical, and functional experimentation, we demonstrated that TCER-1/TCERG1 along with DAF-16/FOXO3A coordinately enhanced both lipid catabolic and anabolic processes in response to germline loss31. Similarly, Functional Annotation Clustering of the DOWN TCER-1/TCERG1 targets identified Annotation Clusters enriched for cytoskeletal functions, positive regulation of growth, reproduction and aging (Figure 4). These observations, and our supporting experimental evidences, suggest that upon germline loss, TCER-1/TCERG1 also represses growth and reproductive physiology in somatic cells as well as the expression of anti-longevity genes31.
Sample | Total Sequences | Length | %GC | Total Reads (Galaxy) | Mapped Reads (Galaxy) | Genome Coverage |
glp-1 | 4000000 | 51 | 49 | 20700539 | ~16,000,000 | 11x |
glp-1; tcer-1 | 4000000 | 51 | 49 | 18055444 | ~13,000,000 | 9x |
glp-1 | 4000000 | 51 | 48 | 18947463 | ~14,000,000 | 10x |
glp-1; tcer-1 | 4000000 | 51 | 48 | 13829643 | ~10,000,000 | 7x |
Table 1: RNA-Seq Sample Details. Compilation of raw data attributes evaluated post-sequencing to confirm the success of the sequencing run. Sequencing data from the representative experiment consists of two biological conditions, a control strain (glp-1) and a mutant strain (tcer-1;glp-1) with two biological replicates sequenced for each. 'FastQC' quality check analysis highlights the number of sequences read with no "poor quality" reads, 48 – 49% GC content and a constant sequence read length of 51bp. Modified and reproduced with permission from Amrit et al.31.
Supplemental File: Command chain in brief for the tools run on the Galaxy pipeline for RNA-Seq data analysis. Please click here to download this file.
Significance of the Galaxy Sequencing Platform in Modern Biology
The Galaxy Project has become instrumental in helping biologists without bioinformatics training to process and analyze high-throughput sequencing data in a fast and efficient manner. Once considered a herculean task, this publicly-available platform has made running complex bioinformatics algorithms to analyze NGS data a straightforward, reliable, and easy process. Apart from hosting a wide range of bioinformatics tools, the key to success for Galaxy is also the simplicity of its user interface that laces together the various aspects of complex sequencing analysis in an intuitive and seamless manner. Owing to these features, the Galaxy pipeline has acquired wide usage amongst biologists, including C. elegans researchers. In addition to familiarizing the user with the RNA-Seq Analysis pipeline, Galaxy also helps lay the foundation for basic biologists to grasp the concept of data analysis and understand the tools involved. This knowledge primes the user to perhaps further pursue more complex bioinformatics platforms such as 'R' and 'Python'. Besides Galaxy, other tools and packages are available commercially and as open-source solutions, which can be used for RNA-Seq analysis. The commercial options are often stand-alone software packages that are user-friendly but can be expensive for individual researchers who do not use NGS often. Alternatively, open source platforms such as BioWadrobe34 and ArrayExpressHTS35 require working knowledge of the command line and running scripts, which poses significant challenges for non-bioinformaticians. Hence, Galaxy remains a popular and indispensable resource.
Critical steps within the protocol
The effortless advantages of Galaxy and DAVID notwithstanding, a successful RNA-Seq experiment still relies fundamentally on careful design and execution of the experimental step. For instance, it is critical to ensure genetic homogeneity before comparing two strains by RNA-Seq, and to determine if there are differences in developmental rates. Isolating RNA from age-matched strains is critical as well. Similarly, to account for variability of gene expression within the same strain, it is important to run two or more 'biological replicates' of each strain. This essentially means growing and harvesting worms from the strains being sequenced in at least twoindependent experiments, although three biological replicates is the recommended standard. Galaxy unifies the data from multiple biological replicates so that the reported gene-expression differences between strains are not simply a consequence of 'within-sample' variability.
A critical design decision is about the use of single-end vs. paired-end sequencing. With single-end sequencing, each fragment is sequenced uni-directionally so the process is faster, cheaper and suited for transcriptional profiling. In paired-end sequencing, once the fragment is sequenced from one end to the other, a second round of sequencing is resumed in the opposite direction. It provides more in-depth data and additional positioning information of the genome, so is more suited for de novo genome assembly, new SNP identification and for identifying epigenetic modifications, deletions, insertions, and inversions. Similarly, the total number of reads and extent of genome coverage required for adequate differential expression studies is context dependent. For small genomes, such as bacteria and fungi, ~5 million reads is sufficient, whereas, in worms and flies ~10 million reads provide adequate coverage. For organisms with large genomes such as mice and humans, 15-25 million reads is the required range. In addition, to the read number and coverage, it is also important that most of the NGS reads align to the reference genome. An alignment of <70% reads is indicative of poor NGS or the presence of contaminants. Overall, for C. elegans RNA-Seq studies, three biological replicates sequenced with 50 bp unidirectional sequencing resulting in ~10-15 million reads and ~5-10X genome coverage for each sample is an ideal aim.
Despite the ease of using Galaxy, there are a few points to remember in order to ensure a smooth and glitch-free data analysis experience. It is necessary for the user to have a basic understanding of the purpose and functioning of the various tools used. Each Galaxy tool requires selection of parameters and understanding the tool will help the user optimize settings based on the requirement of the experiment. The Galaxy help pages explain every parameter and it is recommended that the user peruse these details to decide on test variables.
The gene list obtained post RNA-Seq analysis is merely a list of genes until it is mined for biologically relevant data using DAVID. This is a crucial exercise that converts individual gene-based data into biological-process based results. Exploring the RNA-Seq gene list using the various analytics DAVID provides is therefore an integral and important part of the protocol.
Modifications, troubleshooting and limitations
A common glitch with NGS data analysis is tasks or tests that fail, especially at the quality-control stages. Of the tests that FastQC runs on a sample, a few could come up as failed. However, this does not inevitably mean the sample does not meet the fastq quality standards. The failure could have an alternative explanation that should be explored carefully.
For instance, if the 'Per base sequence content' test fails (suggesting that there is a greater than 10% difference between bases in any position), check the method for the oligodt library preparation. Previous work has shown that Illumina NGS libraries may have a propensity for the 13th base being sequenced to have a bias for certain bases causing the sample to fail the test. Similarly, a failure of the 'Kmer content' test can sometimes be attributed to the fact that libraries derived from random priming will nearly always show Kmer bias at the start due to an incomplete sampling of the random primers. Therefore, it is important to consider these and other impediments in the analysis pipeline before determining the fate of the experiment.
Another important feature that may impact RNA-Seq data analysis is the rapid and exponential advancements that are occurring in NGS methods and analytic software. Ideally, one expects an identical gene list to result from analyzing a sample NGS data on two pipelines or two versions of the same pipeline. However, while constantly improving algorithms are lowering aberrations in RNA-Seq analysis and producing gene lists of greater accuracy, this often leads to disparities. For instance, analyzing a sample NGS data using an older vs. newer version of the same toolset may produce significantly different gene lists. A modest variation is expected but users need to be aware that large discrepancies may be reflective of weaknesses in the design or performance of the experiment.
Collectively, the Galaxy Project and DAVID analytical tools have transformed the way NGS data can be harnessed to extract biologically relevant information. This has opened entirely new levels of independence and investigation to the scientific community, including C. elegans researchers. For instance, the constantly reducing cost of sequencing coupled with better and faster sequencing technology are ushering in an era of transcriptomics at the level of single worms, individual worm tissues and even few select worm cells. These endeavors involve dramatic increases in NGS data being generated. Keeping up with the analytical end of this workflow will be a challenge, but due to its versatility, Galaxy is likely to be instrumental in empowering the transition from whole organism transcriptomics to RNA-Seq at single cell level in C. elegans. The resulting advances in knowledge are likely to provide extraordinary insights into fundamental biology.
The authors have nothing to disclose.
The authors would like to express their gratitude to the laboratories, groups and individuals who have developed Galaxy and DAVID, and thus made NGS widely accessible for the scientific community. The help and advice provided by colleagues at the University of Pittsburgh during our bioinformatics training is acknowledged. This work was supported by an Ellison Medical Foundation New Scholar in Aging award (AG-NS-0879-12) and a grant from the National Institutes of Health (R01AG051659) to AG.
RNase spray | Fisher Scientific | 21-402-178 |
Trizol | Ambion | 15596026 |
Sonicator | Sonics Vibra Cell | VCX130 |
Centrifuge | Eppendorf | 5415C |
chloroform | Sigma Aldrich | 288306 |
2-propanol | Fisher Scientific | A416P-4 |
Ethanol | Decon Labs | 2705HC |
RNase-free water | Fisher Scientific | BP561-1 |
Bioanalyzer | Agilent | G2940CA |
Mac/PC |