A bioinformatics pipeline, namely miRDeep-P2 (miRDP2 for short), with updated plant miRNA criteria and an overhauled algorithm, could accurately and efficiently analyze microRNA transcriptomes in plants, especially for species with complex and large genomes.
MicroRNAs (miRNAs) are 20- to 24-nucleotide (nt) endogenous small RNAs (sRNAs) extensively existing in plants and animals that play potent roles in regulating gene expression at the post-transcriptional level. Sequencing sRNA libraries by Next Generation Sequencing (NGS) methods has been widely employed to identify and analyze miRNA transcriptomes in the last decade, resulting in a rapid increase of miRNA discovery. However, two major challenges arise in plant miRNA annotation due to increasing depth of sequenced sRNA libraries as well as the size and complexity of plant genomes. First, many other types of sRNAs, in particular, short interfering RNAs (siRNAs) from sRNA libraries, are erroneously annotated as miRNAs by many computational tools. Second, it becomes an extremely time-consuming process for analyzing miRNA transcriptomes in plant species with large and complex genomes. To overcome these challenges, we recently upgraded miRDeep-P (a popular tool for miRNA transcriptome analyses) to miRDeep-P2 (miRDP2 for short) by employing a new filtering strategy, overhauling the scoring algorithm and incorporating newly updated plant miRNA annotation criteria. We tested miRDP2 against sequenced sRNA populations in five representative plants with increasing genomic complexity, including Arabidopsis, rice, tomato, maize and wheat. The results indicate that miRDP2 processed these tasks with very high efficiency. In addition, miRDP2 outperformed other prediction tools regarding sensitivity and accuracy. Taken together, our results demonstrate miRDP2 as a fast and accurate tool for analyzing plant miRNA transcriptomes, therefore a useful tool in helping the community better annotate miRNAs in plants.
One of the most exciting discoveries in the last two decades in biology is the proliferating role of sRNA species in regulating diverse functions of the genome1. In particular, miRNAs constitute an important class of 20- to 24-nt sRNAs in eukaryotes, and mainly function at post-transcriptional level as prominent gene regulators throughout life cycle development stages as well as in stimulus and stress responses2,3. In plants, miRNAs arise from primary transcripts called pri-miRNAs, which are generally transcribed by RNA polymerase II as individual transcription units4,5. Processed by evolutionarily conserved cellular machinery (Drosha RNase III in animals, DICER-like in plants), pri-miRNAs are excised into the immediate miRNA precursors, pre-miRNAs, which contain sequences forming intra-molecular stem-loop structures6,7. Pre-miRNAs are then processed into double-stranded intermediates, namely miRNA duplexes, consisting of the functional strand, mature miRNA, and the less frequently functional partner, miRNA*2,8. After loaded into the RNA-induced silencing complex (RISC), the mature miRNAs could recognize their mRNA targets based on sequence complementarity, resulting in a negative regulatory function2,8. miRNAs could either destabilize their target transcripts or prevent target translation but the former manner is dominated in plants8,9.
Since the fortuitous discovery of the first miRNA in the nematode Caenorhabditis elegans10,11, much research has been committed to miRNA identification and its functional analysis, especially after the availability of NGS method. The wide application of the NGS method has greatly promoted the utilization of computational tools that were designed to capture the unique feature of miRNAs, such as the stem-loop structure of precursors and their preferential accumulation of sequence reads on mature miRNA and miRNA*. As a result, researchers have achieved remarkable success in identifying miRNAs in diverse species. Based on a previously described probability model12, we developed miRDeep-P13, which was the first computational tool for discovering plant miRNAs from NGS data. miRDeep-P was specifically aimed at conquering the challenges of decoding plant miRNAs featuring more variable precursor length and large paralogous families13,14,15. After its release, this program has been downloaded thousands of times and used to annotate miRNA transcriptomes in more than 40 plant species16. Propelled by NGS-based tools like miRDeep-P, there has been a dramatic increase in the number of registered miRNAs in the public miRNA repository miRBase17, where over 38,000 miRNA items are currently hosted (release 22.1) in comparison to only ~500 miRNA items (release 2.0) in 200818.
However, two new challenges have arisen from plant miRNA annotation. First, high ratios of false-positives have heavily impacted the quality of plant miRNA annotations16,19 for the following reasons: 1) a deluge of endogenous short interfering RNAs (siRNAs) from NGS sRNA libraries were erroneously annotated as miRNAs due to lacking of a stringent miRNA annotation criteria; 2) for species without a priori miRNA information, false-positives predicted based on NGS data are difficult to eliminate. Using miRBase as an example, Taylor et al.20 found one third of plant miRNA entries in the public repository21 (release 21) lacked convincing supporting evidence and even three-fourths of plant miRNA families were questionable. Second, it becomes an extremely time-consuming process for predicting plant miRNAs with large and complex genomes16. To overcome these challenges, we updated miRDeep-P by adding a new filtering strategy, overhauling the scoring algorithm and integrating new criteria for plant miRNA annotation, and released the new version miRDP2. In addition, we tested miRDP2 using NGS sRNA datasets with gradually increasing genome sizes: Arabidopsis, rice, tomato, maize and wheat. Compared to other five widely used tools and its old version, miRDP2 parsed these sRNA data and analyzed miRNA transcriptomes faster with improved accuracy and sensitivity.
Contents of the miRDP2 package
The miRDP2 package consists of six documented Perl scripts that should be run sequentially by the prepared bash script. Of the six scripts, three (convert_bowtie_to_blast.pl, filter_alignments.pl, and excise_candidate.pl) are inherited from miRDeep-P. The other scripts are modified from the original version. Functions of the six scripts are described in the following:
preprocess_reads.pl filters input reads, including reads that are too long or too short (<19 nt or >25 nt), and reads correlated with Rfam ncRNA sequences, as well as reads with RPM (Reads Per Million) less than 5. The script then retrieves reads correlated to known miRNA mature sequences. The input files are original reads in FASTA/FASTQ format and bowtie2 output of reads mapping to miRNA and ncRNA sequences.
The formula for calculating RPM is as the following:
convert_bowtie_to_blast.pl changes the bowtie format into BLAST-parsed format. BLAST-parsed format is a custom tabular separated format derived from standard NCBI BLASToutput format.
filter_alignments.pl filters the alignments of deep sequencing reads to a genome. It filters partial alignments as well as multi-aligned reads (user-specified frequency cutoff). The basic input is a file in BLAST-parsed format.
excise_candidate.pl cuts out potential precursor sequences from a reference sequence using aligned reads as guidelines. The basic input is a file in BLAST-parsed format and a FASTA file. The output is all potential precursor sequences in FASTA format.
mod-miRDP.pl needs two input files, signature file and structure file, which is modified from the core miRDeep-P algorithm by changing the scoring system with plant specific parameters. The input files are dot-bracket precursor structure file and reads distribution signature file.
mod-rm_redundant_meet_plant.pl needs three input files: chromosome_length, precursors and original_prediction generated by mod-miRDP.pl. It generates two output files, non-redundant predicted file and predicted file filtered by newly updated plant miRNA criteria. Details on the format of output file are described in section 1.4.
1. Installation and testing
2. Identifying novel miRNAs
3. Modifications and caution using miRDP2
The miRNA annotation pipeline, miRDP2, described herein is applied to 10 public sRNA-seq libraries from 5 plant species with gradually increased genome length, including Arabidopsis thaliana, Oryza sativa (rice), Solanum lycopersicum (tomato), Zea mays (maize) and Triticum aestivum (wheat) (Figure 1A). Overall, for each species, 2 representative sRNA libraries from different tissues (collapsed into unique reads, details in the protocol section) and their indexed genome sequences are processed as two inputs (Table 1). Five miRNA computational prediction tools (miRDeep-P13, miRPlant25, miR-PREFeR26, miRA27, miReNA28) were selected to make the comparison.
Running time test
To compare the runtime and performance of miRDP2 and other five tools, we installed five tools (miRDP2, miRDeep-P, miR-PREFeR, miRA, and miReNA) in a cluster server with Cent OS release 6.5 system. These programs were run with the same input files, hardware and resource (details in Supplementary File 1). Especially, miRPlant is controlled from a GUI written in Java and was not able to run on the server. Instead, we tested miRPlant on a PC with Windows 10 while we have also tested miRDP2 and miRDeep-P on this PC (details in Supplementary File 1).
For small genome species as Arabidopsis thaliana, Oryza sativa, and Solanum lycopersium, all the programs ran properly. However, for large genomes species such as Zea mays and Triticum aestivum (including Solanum lycopersium for miRA), some of the programs depleted all computing resources and broke down halfway. For instance, miReNA, miRA, and miR-PREFeR failed to generate results, probably due to memory deficiency while dealing with large sam files or intermediate files. In particular, miRPlant temporary files consumed too much space, and the result was not able to run on the PC when dealing with large genome species. miRDP2 finished these prediction processes in a very short time, from minutes to hours (Figure 1B). Thus, compared to its old version and other tools, the running time of miRDP2 was markedly shortened.
Sensitivity and accuracy test
Since miRNAs in Arabidopsis are intensively studied, we made use of known miRNAs in Arabidopsis in miRBase21 (release 22.1) to evaluate miRDP2, and made the comparison with other tools. As previously reported19,26, the following formulas are employed to calculate sensitivity and accuracy:
Known miRNAs are those annotated in miRBase. A miRNA is designated as expressed if the mature sequences have more than 5 RPM, and ≥75% reads on the precursor mapped to mature and star miRNA sequences. Two sequenced sRNA libraries from Arabidopsis (Table 1) were used to make the test. miRDP2 (Figure 1C,D) performed better in both sensitivity and accuracy compared to other tools.
Taken together, these results demonstrate that miRDP2 is a fast and accurate tool for analyzing the miRNA transcriptome in plants.
Figure 1: Performance of miRDP2. (A) Genome size (in Gb) of Arabidopsis thaliana (Ath), Oryza sativa (Osa), Solanum lycopersicum (Sly), Zea mays (Zma), Triticum aestivum (Tae). (B-D) Comparison of runtime, sensitivity and accuracy of miRDP2 and other five tools. Two dots corresponding to each tool indicate two tests were made by each tool. This figure has been adapted from Kuang et al.16. Please click here to view a larger version of this figure.
Species (abb.) | Genome version | sRNA libraries | ||||
Library ID | File size | Total reads | Unique reads | Tissue | ||
Arabidopsis thaliana (Ath) | version 10 | GSM2094927 | 24.9 Mb | 40.5M | 9.7M | Adult leaf |
GSM2412287 | 29.5 Mb | 45.1M | 11.1M | Leaf | ||
Oryza sativa (Osa) | version 7 | GSM2883136 | 44.2 Mb | 54.9M | 16.3M | Seedling |
GSM3030848 | 34.7 Mb | 49.1M | 13.0M | Flagleaf | ||
Solanum lycopersicum (Sly) | version 3 | GSM1213985 | 205.4 Mb | 161.5M | 58.0M | Leaf |
GSM1976413 | 118.5 Mb | 139.3M | 46.2M | Root | ||
Zea mays (Zma) | version 4 | GSM1277437 | 158.4 Mb | 266.1M | 60.5M | Seedling |
GSM1428531 | 144.1 Mb | 172.5M | 56.3M | Seed | ||
Triticum aestivum (Tae) | iwgsc 1 | GSM1294660 | 76.1 Mb | 59.2M | 29.6M | Shoot |
GSM1294661 | 113.6 Mb | 84.0M | 44.0M | Leaf |
Table 1: Genomes and sRNA libraries used for testing miRDP2 and other tools. This table has been adapted from Kuang et al.16.
Supplementary File 1: Comparison of runtime, sensitivity and accuracy of miRDP2 and other five tools. Please click here to download this file.
Supplementary File 2: Examples of authentic miRNAs with bifurcate structure in loops. Please click here to download this file.
Supplementary File 3: Updated criteria for plant miRNA annotation and criteria for 23-nt and 24-nt miRNAs. Please click here to download this file.
Supplementary File 4: Diagram of the workflow of miRDP2. Please click here to download this file.
With the advent of NGS, a large number of miRNA loci have been identified from an ever-increasing amount of sRNA sequencing data in diverse species29,30. In the centralized community database miRBase21, the deposited miRNA items have increased almost 100 times in the last decade. However, in comparison to miRNAs in animals, plant miRNAs have many unique features that make the identification/annotation more complicated13,14.
First, the precursors of plant miRNAs are more variable in length and structure (Supplementary File 2)16. Not like the relatively uniform length of animal miRNA precursors around 70-90 nt, the length of plant precursors vary by several folds and could reach several hundred nts13,31. This difference introduces a lot of uncertainty when predicting the secondary structure of miRNA precursors even though a cutoff of precursor length is usually set arbitrarily such as not exceeding 300 nt19 (this parameter was embedded in miRDP2, and experienced users of miRDP2 could adjust this by themselves). In addition, conserved plant miRNA families tend to have more members, and the length variation of these members is also often significant13. This is the reason why miRDP2 has the parameter –L, which indicates the potential largest miRNA families in member size. Together, the heterogeneity of plant miRNA precursors raises many difficulties for their accurate annotation.
Second, the noise or false-positives introduced by siRNAs is hard to eliminate. Alongside miRNAs, NGS methods also produce a deluge of siRNAs in the sequenced sRNA libraries. Even though siRNAs could be separated from miRNAs by their biogenesis and functions32,33, it is extremely difficult to distinguish them based on sequencing data and mining tools. The public databases such as miRBase, argued by many researchers, have deteriorated sharply by the large number of false-positives siRNAs, which are erroneously annotated as miRNAs20,31. Thus, refined tools with a new and strict set of criteria for plant miRNA annotation like the newly updated criteria25 (Supplementary File 3) are highly desired in the miRNA annotation pipeline/process.
Last but not least, the computational time for parsing sRNA libraries has increased exponentially when the same method is transplanted from a small size genome species to a large size one. The computational tools such as miRDeep-P13 and miR-PREFeR26, by capturing and quantifying the signature distribution of sRNA reads along miRNA precursors, have become two popular methods and are widely used to annotate miRNAs. The mapping strategy, the process of excising precursor candidates and subsequent secondary structure prediction demand considerable computing time16. When these tools are employed to parse the data from small size genomes like Arabidopsis to large ones like maize, the data processing time is increased from hours to days even weeks (Figure 1B), resulting in frequent collapse of the process. An innovation on the foregoing limitations is thus urgently in need.
Our new miRDP216 program, updated from miRDeep-P13, is designed to overcome the challenges mentioned above (Supplementary File 4). In this program, we employed a new filtering strategy, optimized the scoring algorithm, and incorporated newly updated plant miRNA annotation criteria. As a result of these new features, the running time was markedly shortened when tested using ten sRNA libraries from five plant species with increasing genome size. Additionally, compared to other tools, miRDP2 displayed superior performance in both sensitivity and accuracy (Figure 1). Taken together, these results demonstrate that miRDP2 is a fast and accurate tool for analyzing the miRNA transcriptomes in plants.
It should be cautioned that the current understanding on miRNA characteristics might limit the performance of any computational tools. Even the newly updated miRNA annotation criteria are based on a limited set of well-studied examples. The deduced information is thus only empirical. In fact, unique features of miRNAs have been shown to exist in different plant species or lineages3. In addition, characteristics such as the structures of upstream and downstream regions of the miRNA/miRNA* duplex also play critical roles in miRNA biogenesis34,35, which are not taken into account in current annotation tools. With the accumulation of well-studied examples in more plant species, it is likely that even more advanced annotation tools are developed in the future that can capture more subtle distinctions and classify miRNAs with a higher degree of accuracy than current methods. A promising new miRNA annotation direction is to incorporate machine learning approaches36 as the quality of training datasets and annotation criteria continually evolve.
The authors have nothing to disclose.
This work has been supported by Beijing Academy of Agriculture and Forestry Sciences (KJCX201917, KJCX20180425, and KJCX20180204) to XY and National Natural Science Foundation of China (31621001) to LL.
Computer/computing node | N/A | N/A | Perl is required; at least 8 GB RAM and 100 GB storage are recommended |