LEfSe (LDA Effect Size) is a tool for high-dimensional biomarker mining to identify genomic features (such as genes, pathways, and taxonomies) that significantly characterize two or more groups in microbiome data.
There is growing attention toward closed biological genomes in the environment and in health. To explore and reveal the intergroup differences among different samples or environments, it is crucial to discover biomarkers with statistical differences among groups. The application of Linear discriminant analysis Effect Size (LEfSe) can help find good biomarkers. Based on the original genome data, quality control, and quantification of different sequences based on taxa or genes are carried out. First, the Kruskal-Wallis rank test was used to distinguish between specific differences among statistical and biological groups. Then, the Wilcoxon rank test was performed between the two groups obtained in the previous step to assess whether the differences were consistent. Finally, a linear discriminant analysis (LDA) was conducted to evaluate the influence of biomarkers on significantly different groups based on LDA scores. To sum up, LEfSe provided the convenience for identifying genomic biomarkers that characterize statistical differences among biological groups.
Biomarkers are biological characteristics that can be measured and can indicate some phenomena such as infection, disease, or environment. Among them, functional biomarkers may be specific biological functions of single species or common to some species, such as gene, protein, metabolite and pathways. Besides, taxonomic biomarkers indicate an unusual species, a group of organisms (kingdom, phylum, class, order, family, genus, species), the Amplicon Sequence Varient (ASV)1, or the Operational Taxonomic Unit (OTU)2. In order to find biomarkers more quickly and accurately, a tool for analyzing the biological data is necessary. The differences between classes can be explained by LEfSe coupled with standard tests for statistical significance and additional tests encoding biological consistency and effect relevance3. LEfSe is available as a galaxy module, a conda formula, a docker image, and included in bioBakery (VM and cloud)4. Generally, the analysis of microbial diversity often uses a non-parametric test for the uncertain distribution of a sample community. The rank sum test is a non-parametric test method, which uses the rank of samples to replace the value of samples. According to the difference of sample groups, it can be divided into two samples with the Wilcoxon rank sum test and into multiple samples with the Kruskal-Wallis test5,6. Notably, when there are significant differences among multiple groups of samples, a rank-sum test of pairwise comparison of multiple samples should be performed. LDA (which stands for Linear Discriminant Analysis) invented by Ronald Fisher in 1936, is a type of supervised learning, also known as Fisher’s Linear Discriminant7. It is a classic and popular algorithm in the current field of machine learning data mining.
Here, the LEfSe assay has been optimized by Conda and Galaxy servers. Three groups of 16S rRNA gene sequences are analyzed to demonstrate the significant differences between different groups with LDA scores of microbial communities and visualization results.
NOTE: The protocol was sourced and modified from the research of Segata et al.3. The method is provided at https://bitbucket.org/biobakery/biobakery/wiki/lefse.
1. Preparation of input file for analysis
2. LEfSe native analysis (limited to the Linux server)
3. LEfSe online analysis (galaxy)
The LDA scores of microbial communities with significant differences in each group by analyzing the 16S rRNA gene sequences of three samples is shown in Figure 8. The color of the histogram represents different groups, while the length represents the LDA score, which is the influence of the species with significant differences between different groups. The histogram shows the species with significant differences whose LDA score is greater than the preset value. The default preset value is 2.0, so only absolute values of LDA score (abscissa) greater than 2.0 are shown in the plot.
The biomarkers with significant difference and species tree between different classification levels are shown in Figure 9. The circles radiating from the inside to the outside represent the classification levels from phylum to genus (the innermost yellow circle is the kingdom). The diameter of each small circle at the different classification levels represents the size of relative abundance. The species with no significant difference are uniformly colored yellow, and the significantly different species biomarkers are colored with the corresponding groups. The classes A, B, and C are the group names of microbial samples collected. Red nodes represent the microbial groups that play an important role in the red group (A); green nodes represent the microbial groups that play an important role in the green group (B); and blue nodes represent the microbial groups that play an important role in the blue group (C). The corresponding species name of the biomarkers not shown in the plot are shown on the right side, and the letter numbers correspond to that in the plot (only showing differential species from the phylum to the family by default for aesthetic purposes).
The abundance of one biomarker that has differences among different groups according to the LEfSe results is shown in Figure 10. In the relative abundance barplot, the solid line represents the average relative abundance, the dotted line represents the median relative abundance, and each column represents the relative abundance of each sample in different groups.
Table 1: The example file for LEfSe analysis online. Please click here to download this table.
Figure 1: Schematic overview of the uploading process. Click on the red numbers in sequential order on the figure. Please click here to view a larger version of this figure.
Figure 2: Schematic overview of the operational process for changing data format. Click on the red numbers in sequential order on the figure. Please click here to view a larger version of this figure.
Figure 3: Schematic overview of the operational process for calculating LDA effect size. Click on the red numbers in sequential order on the figure. Please click here to view a larger version of this figure.
Figure 4: Schematic overview of the operational process for plotting LEfSe results. Click on the red numbers in sequential order on the figure. Please click here to view a larger version of this figure.
Figure 5: Schematic overview of the operational process for plotting cladogram. Click on the red numbers in sequential order on the figure. Please click here to view a larger version of this figure.
Figure 6: Schematic overview of the operational process for plotting one feature. Click on the red numbers in sequential order on the figure. Please click here to view a larger version of this figure.
Figure 7: Schematic overview of the operational process for plotting differential features. Click on the red numbers in sequential order on the figure. Please click here to view a larger version of this figure.
Figure 8: Histogram of the distribution of LDA values. LDA scores of microbial communities with significant differences in each group were analyzed by LDA Effect Size according to their influences and correlations. Please click here to download this figure.
Figure 9: Cladogram. The typical plot of the cladogram obtained by the protocol, which enables the representation of the difference between different classification levels of three groups. Please click here to view a larger version of this figure.
Figure 10: One feature plot. The abundance barplot of one biomarker that has differences among different groups according to the LEfSe results.is shown. Please click here to view a larger version of this figure.
Here, the protocol for the identification and characterization of biomarkers within different groups is described. This protocol can easily be adapted for other sample types, such as OTUs of microorganisms. The statistical method by LEfSe can find the characteristic microorganisms in each group (default is LDA >2), that is, the microorganisms that are more abundant in this group relative to the others12. LEfSe is available in both native and web Linux versions where users can also perform LEfSe analysis on web pages. LEfSe is based on the LDA algorithm and needs a species level to draw a species tree. By application of the tool, the relative abundance between groups can be compared. All differential biomarkers could be plotted in a single graph. Also, a single biomarker or all biomarkers can be plotted in batches.
Whether LEfSe is performed through the native server or an online site, there are a lot of tunable parameters to draw the required pictures. Due to the complex structure of input files and the need to convert them to preferred data formats for further analyses of LEfSe, some one-stop services have also been developed. Therefore, the optimization of easier operations can be challenging. On the other hand, there are a few limitations when analyzing complex data using LEfSe. LDA projects a feature that is one dimension less than the category, and if more features are needed, other methods are introduced. The variants of LDA may solve some difficulties. For example, Kernel LDA is a solution if the original data cannot be separated well after projection. Because the amount of computation of LDA is related to the dimension of data, 2DLDA can greatly reduce the amount of computation of LDA. Both LDA and PCA are commonly used dimensionality reduction techniques. PCA (Principal Component Analysisis) dimensionality reduction is directly related to the data dimension, and the projected coordinate system is orthogonal. However, LDA focuses on the ability of classification according to the labeling of categories, so the projected coordinate system is generally not orthogonal.
LEfSe provides assistance for the selection of biomarkers. With many advantages (e.g., adjustable parameters, the detailed results of various parts, application between two or more groups), it has been widely used13. With increasing demand for high-dimensional data analysis, the application of this method will become more and more extensive to explore the biomarkers of the features (organisms, clades, operational taxonomic units, genes, or functions) impacting on human health and disease.
The authors have nothing to disclose.
This work was supported by a grant from Fundamental Research Funds for the Central Public Welfare Research Institutes (TKS170205) and Foundation for Development of Science and Technology, and Tianjin Research Institute for Water Transport Engineering (TIWTE), M.O.T. (KJFZJJ170201).