Here we present a new approach to identify plant viruses with double-strand DNA genomes. We use standard methods to extract DNA and RNA from infected leaves and carry out next-generation sequencing. Bioinformatic tools assemble sequences into contigs, identify contigs representing virus genomes and assign genomes to taxonomic groups.
This metagenome approach is used to identify plant viruses with circular DNA genomes and their transcripts. Often plant DNA viruses that occur in low titers in their host or cannot be mechanically inoculated to another host are difficult to propagate to achieve a greater titer of infectious material. Infected leaves are ground in a mild buffer with optimal pH and ionic composition recommended for purifying most bacilliform Para retroviruses. Urea is used to break up inclusion bodies that trap virions and to dissolve cellular components. Differential centrifugation provides further separation of virions from plant contaminants. Then proteinase K treatment removes the capsids. Then the viral DNA is concentrated and used for next-generation sequencing (NGS). The NGS data are used to assemble contigs which are submitted to NCBI-BLASTn to identify a subset of virus sequences in the generated dataset. In a parallel pipeline, RNA is isolated from infected leaves using a standard column-based RNA extraction method. Then ribosome depletion is carried out to enrich for a subset of mRNA and virus transcripts. Assembled sequences derived from RNA sequencing (RNA-seq) were submitted to NCBI-BLASTn to identify a subset of virus sequences in this dataset. In our study, we identified two related full-length badnavirus genomes in the two datasets. This method is preferred to another common approach which extracts the aggregate population of small RNA sequences to reconstitute plant virus genomic sequences. This latter metagenomic pipeline recovers virus related sequences that are retro-transcribing elements inserted into the plant genome. This is coupled to biochemical or molecular assays to further discern the actively infectious agents. The approach documented in this study, recovers sequences representative of replicating viruses that likely indicate active virus infection.
Emerging plant diseases drive researchers to develop new tools to identify the correct causal agent(s). Initial reports of new or recurring virus diseases are based on commonly occurring symptoms such as mosaic and malformations of the leaf, vein clearing, dwarfism, wilting, lesions, necrosis, or other symptoms. The standard for reporting a new virus as the causal agent for a disease is to separate it from other contaminating pathogens, propagate it in suitable host, and reproduce the disease by inoculating into healthy plants of the original host species. The limitation in this approach is that many genera of plant viruses depend upon an insect or other vectors for transmission to a suitable host or back to the original host species. In this case, the search for the appropriate vector can be prolonged, there may be difficulties to establish laboratory colonies of the vector, and further efforts are necessary to devise a protocol for experimental transmission. If the conditions for successful laboratory transmission studies cannot be achieved, then the work falls short of the standard for reporting a new virus disease. For viruses that occur in their natural hosts at very low titers, researchers must identify alternative hosts for propagation to maintain sufficient infectious stocks for carrying out research. For virus species that infect only a few plants this can also be an obstacle for growing stock cultures1.
In recent years, scientists are more often employing high-throughput NGS and metagenomic approaches to uncover virus sequences that are present in the environment, which may exist unrelated to a known disease, but can be assigned to taxonomic species and genera2,3,4. Such approaches to the discovery and categorization of genetic materials in a distinct environment provide a way to describe virus diversity in nature or their presence in a certain ecosystem but does not necessarily confirm to a framework for defining causal agents for an apparent disease.
The Badnavirus genus belongs to the family Caulimoviridae of pararetroviruses. These viruses are bacilliform in shape with circular double strand DNA genomes of approximately 7 to 9 kb. All pararetroviruses replicate through an RNA intermediate. Pararetroviruses exist as episomes and replicate independent of the plant chromosomal DNA5,6. Field studies of virus populations indicate that these virus populations are genetically complex. In addition, information obtained across a range of plant genomes by high throughput sequencing have uncovered numerous examples of badnavirus genome fragments inserted by illegitimate integration events into plant genomes. These endogenous badnavirus sequences are not necessarily associated with infection7,8,9,10,11. Subsequently, the use of NGS to identify new badnaviruses as the causal agent of disease is complicated by the subpopulation diversity of episomal genomes as well as the occurrence of endogenous sequences12,13.
While there is not one optimal pipeline for the discovery of novel pararetrovirus genomes, there are two common approaches to identify these viruses as causal agents for disease. One method is to enrich for small RNA sequences from infected leaves and then assemble these sequences to reconstitute the virus genome(s)14,15,16,17. Another approach is the rolling circle amplification (RCA) to amplify circular DNA virus genomes18. The success of RCA depends upon the age of the leaf and the virus titer in the selected tissue. The RCA products are subjected to restriction digestion and cloned into plasmids for direct sequencing19,20,21.
Canna yellow mottle virus (CaYMV) is a badnavirus and is described as the etiological cause of yellow mottle disease in canna, although only a 565 bp fragment of the genome has been previously isolated from infected cannas22. A contemporary study identified CaYMV in Alpinia purpurata (flowering ginger; CaYMV-Ap)23. The goal of this study was to recover complete badnavirus genome sequences from infected canna lilies. We describe a protocol for purifying virus from plant contaminants, and then isolating viral DNA from this preparation, and prepare a DNA library for use in NGS. This approach eliminates the need for intermediate molecular amplification steps. We also isolate mRNA from infected plants for RNA-seq. NGS, which includes RNA-seq was carried out using each nucleic acid preparation. Assembled contigs were found to relate to the Badnavirus taxon in both datasets using the National Center for Biotechnology and Information (NCBI) basic local alignment search tool for nucleic acids (BLASTn). We identified the genomes of two badnavirus species24.
1. General Virus Purification by Differential Centrifugation Using Standard Method by Covey et al. 25
2. Library Preparation Using DNA and Emulsion-based Clonal Amplification (emPCR Amplification)
Note: The library is typically prepared by a NGS facility which carries out customer-oriented work.
3. General mRNA Isolation and dsDNA Synthesis Starting with Infected Canna Leaves that Test by RT-PCR for CaYMV Using Reported Diagnostic Primers
4. NGS of DNA Library Prepared from Crude Virus Preparation and dsDNA Library Prepared from mRNA
5. Quality Assessment of De Novo Sequencing by PCR Amplification of Virus Genomes from Infected Plants
This modified virus purification method provided an enrichment of virus DNAs useful for identifying two virus species by NGS and bioinformatics. After the homogenate was centrifuged at 40,000 x g for 2.5 h, there was a green pellet at the bottom of the tube and a white pellet along the length. The green pellet was resuspended into one microcentrifuge tube and the white pellet was resuspended into two microcentrifuge tubes. PCR was carried out using standard CaYMV PCR diagnostic primers, and products were detected in the solubilized white pellet and not the green pellet (Figure 1A). A sample of the crude preparation was examined by transmission electron microscopy and we observed bacilliform particles measuring 124-133 nm in length (Figure 1B). This is within the predicted modal length of most badnaviruses. DNA was extracted from the white and green pellets and resuspended separately. In Figure 1C, we loaded 5 µL of DNA extracted from the green and white pellet sample (1.6 µg of DNA for the green fraction and 3.1 µg of DNA for the white fraction) to 0.8% agarose gel electrophoresis and analyzed the DNA following ethidium bromide staining. The green fraction contained low molecular weight DNA whereas the white fraction produced two bands of higher molecular weight DNA, as well as the lower molecular weight DNA (Figure 1C). The gel presented in Figure 1C was run for 40 min at 100 V and the smear in lane 3 suggests that the gel voltage should be lowered to produce clearer bands. These data suggest that the white pellet was enriched for virions. The DNA (0.6 µg/mL) concentration extracted from the white sample was low, but adequate for NGS, which requires a minimum of 10 ng of DNA to proceed. Fragmented DNAs were used to prepare a library for NGS.
In parallel, RNA was extracted from infected canna plants (Figure 1D) for high-throughput RNA-seq. A standard workflow was carried out for library preparation, NGS, creating contigs, and identifying viral genome sequences (Figure 1E).The output results from using DNA and RNA as starting materials were compared.
We obtained 188,626 raw DNA reads by NGS using DNA isolated from crude virus preparation. Reads were assembled into 13,269 contigs and BLASTn was used to search the NCBI dataset of nucleotide sequences (using Viridplantae TaxID: 33090 and Virus TaxID: 10239 as the limiting organisms) (Figure 1E). The NCBI-BLASTn results revealed that 93% of de novo assembled contigs were cellular sequences, 22% were unknown, and 0.3% were virus contigs (Figure 2A). The majority of contigs categorized as cellular sequences were identified as mitochondrial or chloroplast DNA. Within the dataset of virus contigs, 32% of the virus contigs were related to members of Caulimoviridae (that were not Badnavirus sequences) and 58% of these were related to Badnavirus. Of the virus contigs, 29% were highly similar (e < 1 x 10-30) to CaYMV isolate V17 ORF3 gene (EF189148.1), Sugarcane bacilliform virus isolate Batavia D, complete genome (FJ439817.1), and Banana streak CA virus complete genome (KJ013511). Within this population, there were long contigs that resembled two full length genomes.
High-throughput RNA-seq produced 153,488 cleaned individual sequence reads with an average read length of < 500 bp. Contig assembly reduced this to 8,243 contigs. These were submitted to NCBI-BLASTn (using Viridplantae TaxID: 33090 and Virus TaxID: 10239 as the limiting organisms) and the outputs placed 76% of the contigs in a category of plant cellular sequences, 23% were unknown, and 0.1% were categorized as virus contigs (Figure 2B). Closer examination of the population of the 0.1% population of virus contigs determined that 68% of these were assigned to Caulimoviridae (Figure 2B). Three large contigs within this population were identified with high similarity (e < 1 X 10-30) to CaYMV isolate V17 ORF3 gene (EF189148.1), Sugarcane bacilliform virus isolate Batavia D, complete genome (FJ439817.1) and Banana streak CA virus complete genome (KJ013511). Examining the three contigs, we manually joined two of these to produce a full-length virus genome.
We compared the virus genome length contigs produced by DNA and RNA sequencing as a mutual scaffold to confirm the presence of two full-length virus genomes. One full-length virus genome of 6,966 bp was tentatively named Canna yellow mottle associated virus 1 (CaYMAV-1) (Figure 3A). The second genome was 7,385 bp and a variant of CaYMV infecting Alpinia purpurata (CaYMV-Ap01) (Figure 3A).
Finally, PCR primers which were designed to clone ~1,000 bp fragment of each virus, were used to differentially detect both genomes in a population of 227 canna plants representing nine commercial varieties. In many instances individual plants were infected with both viruses. We provide an example of RT-PCR detection of CaYMAV-1 and CaYMV-Ap01 in the 12 plants. Three of these were positive only for CaYMV-Ap01 and nine were positive for both viruses (Figure 3B).
Figure 1: Virus nucleic acid preparations and NGS workflow. (A) Agarose (1.0%) gel electrophoresis of 565 bp PCR fragments of CaYMV genomes. Two PCR products were detected in samples prepared from the white pellet (lanes 1, 2) but not in the green pellet sample (lane 3). Positive control (+) represents a PCR product amplified from infected plant DNA that was isolated using an automated method involving standard paramagnetic cellulose particles. Lane L contains the DNA ladder used as a standard for measuring the size of linear DNA bands in sample lanes. (B) Example of virus particle viewed by transmission electron microscopy in the white pellet recovered by crude fractionation of infected canna leaves. (C) Agarose (0.8%) gel electrophoresis of DNA recovered from the green (lane 1) and white (lane 2) pellets that tested positive by PCR in panel A. The red and yellow dots next to lane 2 identify two high molecular weight DNA bands that occur in the white fraction. (D) Agarose (1%) gel electrophoresis of total RNA recovered by column-based RNA purification. Lane L contains the DNA ladder used as a standard for measuring the size of linear bands in sample lanes. Lane 1-6 contains RNA isolated from infected canna leaves which were pooled to a single sample for ribo-depletion and RNA-seq. (E) Schematic pipeline of nucleic acid isolations, library preparation, sequencing, contig assembly, and virus genome discovery. Please click here to view a larger version of this figure.
Figure 2: Krona charts visualizing the taxonomic categories of contigs. (A) The chart on the left shows the abundance and taxonomic distribution of contigs assembled from the crude virus preparation. The right chart depicts the proportions of virus contigs associated with the Caulimoviridae family, Badnavirus genus, and three closely related species. (B) The panel on the left shows the abundance of contigs derived from RNA-seq based on their taxonomic distribution. On the right is the graph depicting the abundance of contigs within the population of virus contigs associated with the Caulimoviridae family, Badnavirus genus, and three closely related species. Please click here to view a larger version of this figure.
Figure 3. Characterization of CaYMAV-1 and CaYMV-Ap01 genomes. (A) Diagrammatic representation of Canna yellow mottle associate virus 1 (CaYMAV) and Canna yellow mottle virus similar to the genome isolated from Alpinia purpurata (CaYMV-Ap01). Nucleotide positions 1-10 is identified as the start of the genome and contains a tRNAmet anticodon site typical of most badnavirus genomes. The stop and start positions for translation of open reading frame (ORF) 1 and 2 are adjacent. These proteins have unknown functions. ORF3 is a polyprotein containing zinc finger (ZnF), protease (Pro), reverse transcriptase (RT), and RNAse H domains. A 3' poly(A) signal sequence is conserved for both virus genomes. (B) RT-PCR analysis was carried out using RNA isolated from virus infected leaves and primers that detect CaYMAV and CaYMV-Ap01. In the same population of 12 plants, three were infected with CaYMV-Ap01 only, whereas the remaining were infected with both CaYMAV and CaYMV-Ap01. (+) indicates positive control and (-) indicates negative control. This figure is reproduced/modified from Wijayasekara et al.24 with permission. Please click here to view a larger version of this figure.
In recent years a variety of methods have been employed to study plant virus biodiversity in natural environments which include enriching for virus-like particles (VLP) or virus specific RNA or DNA2,3,44,45,46 . These methods are followed by NGS and bioinformatic analysis. The goal of this study was to find the causal agent of a common disease in a cultivated plant. The disease was reported to be the result of an unknown virus that has non-enveloped bacilliform particles, and for which only a 565 bp fragment has been cloned47. This information was sufficient for prior researchers to hypothetically assign the virus to the genus Badnavirus within the family Caulimoviridae. While prior reports hypothesized that canna mottle disease in canna lilies was the result of a single badnavirus, using the metagenomics approach outlined in this study, we determined that the disease was caused by two tentative badnavirus species24. Thus, the strength of using a metagenome approach to discover the causal agent of a disease is that we can now identify situations where there may be more than one cause.
Our approach combining DNA and RNA sequencing data is thorough and also demonstrates that the outcomes using two approaches yielded consistent results and confirmed the presence of two related viruses. We employed a modified procedure for isolation of caulimoviruses and produced a sample that was enriched for virus associated nucleic acids and that were protected within the virus capsid. A service laboratory was contracted to carry out DNA sequencing. The essential concept for de novo sequencing is that DNA polymerase incorporates the fluorescent labeled nucleotides into a DNA template strand during sequential cycles of DNA synthesis. The contigs assembled followed by NGS were submitted into a bioinformatic workflow producing a few contigs that were identified as virus contigs. Further confirmation of two virus genomes10,24,48,49,50 was obtained through bioinformatic analysis of RNA-seq data obtained from ribo-depleted RNA preparations. One interesting outcome was to learn that the populations of sequences recovered by DNA and RNA sequencing provided similar distributions of non-viral and viral nucleic acids. For DNA and RNA sequencing, < 0.5% of sequences were of virus origin. Within the population of virus sequences 78-82% belonged to the family Caulimoviridae. By comparing the assembled virus contigs from DNA and RNA sequencing, we confirmed that the two assembled genomes occurred in both datasets.
A concern of using only DNA sequencing to identify the new virus genomes is that the badnavirus genome is an open circular DNA. We surmised that sequences overlapping discontinuities in the genome might present obstacles for genome assembly from contigs. Initial examination of the DNA sequencing results revealed two similar virus genomes. We hypothesized that these genomes either represented genetic diversity of a species that has not been studied, or represented two species co-infecting the same plant24. Therefore, the collective bioinformatic analysis of datasets obtained by NGS DNA and RNA sequencing, enabled the confirmation of the presence of two full length genomes.
There is another report which developed an alternative method for extracting VLP and nucleic acids from plant homogenates for metagenomic studies, based on procedures to recover DNA from Cauliflower mosaic virus (CaMV; a caulimovirus)3. This approach identified novel RNA and DNA virus sequences in non-cultivated plants. The steps derived from the caulimovirus isolation procedure used in this study to discover the causal agent of a disease of cultivated plants are unlike the steps derived for extracting VLP from naturally infected plants24. The success of both modified methods suggests that the framework procedure for caulimovirus isolation may be a valuable starting point for metagenomic studies of plant viruses in general.
The authors have nothing to disclose.
Research was funded by Oklahoma Center for Advancement of Science and Technology Applied Research Program Phase II AR 132-053-2; and by the Oklahoma Department of Agriculture Specialty Crops Research Grant Program. We thank Dr. HongJin Hwang and the OSU Bioinformatics Core Facility which was supported by grants from NSF (EOS-0132534) and NIH (2P20RR016478-04, 1P20RR16478-02 and 5P20RR15564-03).
NaH2PO4 | Sigma-Aldrich St. Louis MO | S5976 | Grinding buffer for virus purification |
Na2HPO4 | Sigma-Aldrich | S0751 | Grinding buffer for virus purification |
Na2SO3 | Thermo-Fisher Waltham, MA | 28790 | Grinding buffer for virus purification |
urea | Thermo-Fisher | PB169-212 | Homogenate extraction |
Triton X-100 | Sigma-Aldrich | X-100 | Homogenate extraction |
Cheesecloth | VWR Radnor, PA | 21910-107 | Filter homogenate |
Tris | Thermo-Fisher | BP152-5 | Pellet resuspension& DNA resuspension buffers |
MgCl2 | Spectrum, Gardena, CA | M1035 | Pellet resuspension buffer |
EDTA | Spectrum | E1045 | Stops enzyme reactions |
Proteinase K | Thermo-Fisher | 25530 | DNA resuspension buffer |
phenol:chloroform:isoamylalcohol | Sigma-Aldrich | P2069 | Dissolve virion proteins |
DNAse I | Promega | M6101 | Degrade cellular DNA from extracts |
95% ethanol | Sigma-Aldrich | 6B-100 | Virus DNA precipitation |
Laboratory blender | VWR | 58984-030 | Grind leaf samples |
Floor model ultracentrifuge &Ti70 rotor | Beckman Coulter, Irving TX | A94471 | Separation of cellular extracts |
Floor model centrifuge and JA-14 rotor | Beckman Coulter | 369001 | Separation of cellular extracts |
Magnetic stir plate | VWR | 75876-022 | Mixing urea into samples overnight |
Rubber policeman | VWR | 470104-462 | Dissolve virus pellet |
2100 bioanalyzer Instrument | Agilent Genomics, Santa Clare, CA | G2939BA | Sensitive detection of DNA and RNA quality and quantity |
2100 Bioanalyzer RNA-Picochip | 5067-1513 | Microfluidics chip used to move, stain and measure RNA quality in a 2100 Bioanalyzer | |
2100 Bioanalyzer DNA-High Sensitive chip | 5067-4626 | Microfluidics chip used to move, stain and measure DNA quality in a 2100 Bioanalyzer | |
Nanodrop spectrophotometer | Thermo-Fisher | ND-2000 | Analysis of DNA/RNA quality at intermediate steps of procedures |
Plant total RNA isolation kit | Sigma-Aldrich | STRN50-1KT | Isolate RNA for RNA-seq |
RNase-free water | VWR | 10128-514 | Resuspension of DNA and RNA for NGS |
RNA concentrator spin column | Zymo Research, Irvine, CA | R1013 | Prepare RNA for RNA-seq |
rRNA removal kit | Illumina, San Diego, CA | MRZPL116 | Prepare RNA for RNA-seq |
DynaMag-2 Magnet | ThermoFisher | 12321D | Prepare RNA for RNA-seq |
RNA enrichment system | Roche | 7277300001 | Prepare RNA for RNA-seq |
Agarose | Thermo-Fisher | 16500100 | Gel analysis of DNA/RNA quality at intermediate steps of procedures |
Ethidium bromide | Thermo-Fisher | 15585011 | Agarose gel staining |
pGEM-T +JM109 competent cells | Promega, Madison, WI | A3610 | Clone genome fragments |
pFU Taq polymerase | Promega | M7741 | PCR amplify virus genome |
dNTPs | Promega | U1511 | PCR amplify virus genome |
PCR oligonucleotides | IDT, Coralvill, IA | Custom order | PCR amplify virus genome |
Miniprep DNA purification kit | Promega | A1330 | Plasmid DNA purification prior to sequencing |
PCR clean-up kit | Promega | A9281 | Prepare PCR products for cloning |
pDRAW32 software | ACAClone | Computer analysis of circular DNA and motifs | |
MEGA6.0 software | MEGA | Molecular evolutionary genetics analysis | |
Primer 3.0 | Simgene.com | ||
Quant-iT™ RiboGreen™ RNA Assay Kit | Thermo-Fisher | R11490 | Fluorometric determination of RNA quantity |
GS Junior™ pyrosequencing System | Roche | 5526337001 | Sequencing platform |
GS Junior Titanium EmPCR Kit (Lib-A) | Roche | 5996520001 | Reagents for emulsion PCR |
GS Jr EmPCR Bead Recovery Reagents | Roche | 5996490001 | Reagents for emulsion PCR |
GS Junior EmPCR Reagents (Lib-A) | Roche | 5996538001 | Reagents for emulsion PCR |
GS Jr EmPCR Oil & Breaking Kit | Roche | 5996511001 | Reagents for emulsion PCR |
GS Jr Titanium Sequenicing kit* | Roche | 5996554001 | Includes sequencing reagents, enzymes, buffers, and packing beads |
GS Jr. Titanium Picotiter Plate Kit | Roche | 5996619001 | Sequencing plate with associated reagents and gaskets |
IKA Turrax mixer | 3646000 | Special mixer used with Turrax Tubes | |
IKA Turrax Tube (specialized mixer) | 20003213 | Specialized mixing tubes with internal rotor for creating emulsions | |
GS Nebulizers Kit | Roche | 5160570001 | Nucleic acid size fractionator for use during library preparations |
GS Junior emPCR Bead Counter | Roche | 05 996 635 001 | Library bead counter |
GS Junior Bead Deposition Device | Roche | 05 996 473 001 | Holder for Picotiter plate during centrifugation |
Counterweight & Adaptor for the Bead Deposition Devices | Roche | 05 889 103 001 | Used to balance deposition device with picotiter plate centrifugation |
GS Junior Software | Roche | 05 996 643 001 | Software suite for controlling the instrument, collecting and analyzing data |
GS Junior Sequencer Control v. 3.0 | Roche | (Included in item 05 996 643 001 above) | |
GS Run Processor v. 3.0 | Roche | (Included in item 05 996 643 001 above) | |
GS De Novo Assembler v. 3.0 | Roche | (Included in item 05 996 643 001 above) | |
GS Reference Mapper v. 3.0 | Roche | (Included in item 05 996 643 001 above) | |
GS Amplicon Variant Analyzer v. 3.0 | Roche | (Included in item 05 996 643 001 above) |