We constructed an untargeted metabolomic workflow that integrated XY-Meta and metaX together. In this protocol, we displayed how to use XY-Meta to generate a decoy spectral library from open access spectra reference, and then performed FDR control and used the metaX to quantitate the metabolites after identifying the metabolomics spectra.
Untargeted metabolomics techniques are being widely used in recent years. However, the rapidly increasing throughput and number of samples create an enormous amount of spectra, setting challenges for quality control of the mass spectrometry spectra. To reduce the false positives, false discovery rate (FDR) quality control is necessary. Recently, we developed a software for FDR control of untargeted metabolome identification that is based on a Target-Decoy strategy named XY-Meta. Here, we demonstrated a complete analysis pipeline that integrates XY-Meta and metaX together. This protocol shows how to use XY-meta to generate a decoy database from an existing reference database and perform FDR control using the Target-Decoy strategy for large-scale metabolome identification on an open-access dataset. The differential analysis and metabolites annotation were performed after running metaX for metabolites peaks detection and quantitation. In order to help more researchers, we also developed a user-friendly cloud-based analysis platform for these analyses, without the need for bioinformatics skills or any computer languages.
Metabolites play important roles in biological processes. Metabolites are often regulators of various processes like energy transfer, hormone regulations, regulation of neurotransmitters, cellular communications, and protein post-translational modifications, etc1,2,3,4. Untargeted metabolomics provides a global view of numerous metabolites5,6. With advances in mass spectrometry and chromatography technologies, the throughput of metabolome MS/MS spectra is rapidly increasing in recent years7,8,9,10,11. To identify metabolites from these huge datasets, various annotation software were developed11, such as MZmine12, MS-FINDER13, CFM-ID14, MetFrag15, and SLAW16. However, these identifications often contain many false positives. The reasons include: (1) The MS/MS spectra contain random noise, which may mislead the peak matching. (2) Isomers and differences in fragmentation energies cause multiple spectra fingerprints and thus increase the volume of the reference library. (3) The quality of reference libraries varies. A proper standard to build a good reference spectral library is needed. Therefore, a systematic false discovery rate (FDR) control for untargeted metabolomics is essential for functional metabolome research7,8,9,17.
Both the Empirical Bayes approach and Target-Decoy strategy tackled the FDR control problem generally. Kerstin Scheubert et al. showed that the Target-Decoy strategy on decoy database generated from fragmentation tree-based method is the best method for FDR control9. Xusheng Wang et al. designed a method for decoy generation based on the octet rule in chemistry and improved the precision of FDR estimation17. The spectral library for generating decoy database was demonstrated for better performance18. Here, we improved the spectral library-based method and developed a software called XY-Meta19 that can further improve FDR estimation's precision. It uses the existing reference spectral library to generate a decoy library for the FDR control under the Target-Decoy scheme. XY-Meta supports its own spectra matching and cosine similarity algorithms. It allows conventional search and iterative search modes. In the step of FDR assessment, it supports Target-Decoy concatenated mode and separated mode. For better flexibility, XY-Meta accepts external decoy libraries.
Peak detection and quantification of metabolites is also an important step of untargeted metabolome analysis. Peak detection is the main method for metabolome identification. In general, the accuracy of peak detection of metabolites was affected by multiple factors, such as noise signals of mass spectrometry, low abundance of metabolites, contaminants, and degradation products of metabolites20. When the number of samples of is too large or the liquid chromatography column was replaced in experiments of untargeted metabolome, remarkable batch effects may appear, which is a major challenge for metabolome quantitation21,22,23. Currently, software like XCMS24, Workflow4Metabolomic25, iMet-Q26, and metaX19 can perform peak detection and quantitation of untargeted metabolome, but we suggest that the pipeline of metaX is more complete and easier to use. Here, we demonstrate the process of identification and FDR control for a publicly available dataset msv000084112 using XY-Meta, and the peak detection and quantification of metabolites using metaX. This workflow only requires two groups, and each group needs at least two samples. MS/MS spectra data is needed, regardless of the mass spectrometer platform, ionization mode, charge mode, and sample type, and can support sample-based normalization and peak-based normalization. Following this example, researchers can perform metabolomics identification and quantification in an easy-to-handle way. Using this pipeline requires R programming capability. To help the researcher without any programming knowledge, we also developed a cloud analysis platform for metabolomics analysis. We demonstrated this cloud analysis platform in Supplementary Material 5.
1. Prepare metabolomics datasets for analysis
NOTE: In this demonstration, we use metabolomics datasets without QC sample. Data for case and control groups are needed. For demonstration, we use a public dataset in GNPS database27.
2. Data format conversion
NOTE: If the dataset is the raw data generated directly from the mass spectrometer, it is usually in .raw, .wiff or .cdf format. They should be converted to mzXML and mgf formats. Here, we use the msconvert tool in ProteoWizard29 package to do the format conversion.
3. Prepare the reference spectral library for the metabolites
NOTE: XY-meta supports the reference spectral libraries only in mgf format.
4. Metabolites identification and FDR control
5. Differential analysis
NOTE: metaX is an open-source R package. Please install it according to the guide at https://github.com/wenbostar/metaX. 8GB RAM is required for this analysis.
6. Integration of qualitative and quantitative results
The raw data of msv000084112 was converted by msconvert.exe and generated mgf files (Supplementary Material S6).
XY-Meta generated GNPS-NIST14-MATCHES_Decoy.mgf file under /database folder. This is the decoy library generated from the original reference spectral library GNPS-NIST14-MATCHES.mgf. This decoy library can be reused. When reusing this decoy library, the user should set the decoy_pattern as 1 in parameter.default file, and set the decoyinput as the absolute path of the decoy library. The identification results were generated under /mgf folder (with the suffix .meta), which includes spectra matching scores, FDR, m/z of the metabolites, retention time, and the name of the metabolites (Supplementary Material 7).
The quantitative analysis by metaX was in /output folder. The general quantitative distribution of NH and NT are similar, with low fluctuation of the mean values (Figure 1A). There was only a small fraction of missing values: only 3.39% of the metabolites have more than 30% of the missing values (Figure 1B). metaX remarkably increased the proportion of the metabolites with CV ≤ 0.3 (Figure 1C). The box plots were stored in /metaX_box folder. The elution profiles were stored in /metaX_eic folder. The metabolite peaks were recorded in metaX-feature.txt. The quantitative values of the metabolites which were identified in both groups and the differential analysis results were stored in metaX_peaks.txt (Figure 1D). Applying the threshold of |LogFC| ≥ 1 and p-value < 0.05, 342 metabolites were differentially detected, with 206 up-regulated and 136 down-regulated (Supplementary Material 9).
We annotated the metaX detected peaks using the FDR < 0.01 identifications. If a peak can be annotated by multiple metabolites, we took the one with the highest spectrum matching score as the final annotation. Using these criteria, we annotated six differential metabolite peaks (Figure 2).
Figure 1. Quantitative analysis by metaX. (A) Box plot of quantified metabolites of all samples. (B) Histogram of missing value distribution. (C) PCA plot of two group samples. (D) Venn diagram of differentially detected metabolites from three statistical test methods. Please click here to view a larger version of this figure.
Figure 2. Retention time (RT) and m/z distribution of all annotated metabolites. Red dots represent the significant and differentially detected metabolites. Please click here to view a larger version of this figure.
Supplementary Material 1: The parameter file of XY-Meta. Please click here to download this File.
Supplementary Material 2: Grouping information sheet of samples for metaX. Please click here to download this File.
Supplementary Material 3: The script for integrating the workflow of XY-Meta and metaX. Please click here to download this File.
Supplementary Material 4: The script for annotating the peaks using metabolome identifications. Please click here to download this File.
Supplementary Material 5: A complete workflow for metabolome analysis using the cloud platform. Please click here to download this File.
Supplementary Material 6: An mgf file converted from msconvert for a sample data of msv000084112. Please click here to download this File.
Supplementary Material 7: An identification result table from XY-Meta for a sample data of msv000084112. Please click here to download this File.
Supplementary Material 8: The cohort clinical information sheet of msv000084112. Please click here to download this File.
Supplementary Material 9: Identification list of all metabolites and differential analysis results of all metabolite peaks. Please click here to download this File.
The FDR control of untargeted metabolites has been a great challenge. Here, we demonstrated a complete pipeline of large-scale untargeted metabolomics analysis (qualitative and quantitative) with FDR control. This effectively reduces the false positives, which are very common in MS analysis.
Preparing an appropriate reference spectral library for your study is a key point. A successful and sensitive MS/MS identification requires not only proper matching algorithms, but also proper reference spectral libraries. The applicability of public spectral libraries are limited due to the following reasons: (1) many public spectral libraries do not include complete metabolite lists. (2) The spectra in public spectral libraries originated from various MS instruments and/or various fragmentation conditions30,31. Therefore, we suggest you collect spectra in the same instrument and the same fragmentation conditions using standard metabolites to construct an "exclusive" spectral library. Also, these conditions should be maintained during the actual measurements. Additionally, when modifying the parameters file, the tolerance of the precursor ions and the fragment ions should coincide with the parameters of the instrument. Normally, the range of precursor tolerance should be between 10 ppm and 20 ppm, and the fragment tolerance should be set between 0.01 Da and 0.5 Da. For this data set, the parameters of instrument are unknown, but the fragment tolerance of 0.05 Da is a conservative choice for this workflow to perform normally.
Users may still receive various error messages when they run this pipeline. Common mistakes include erroneous input file path, missing parameter file, and file access conflict (e.g., access denied by the operating system and simultaneous accessing of the same file).
To be noted, this workflow is currently only applicable to the targeted and untargeted metabolomic analysis of small molecules less than 1,000 Da, and cannot be used to analyze the metabolomes of macromolecules such as glycan chains or lipid chains. In addition, both the data independent acquirement (DIA) data and ion mobility data are not suitable for analysis with this workflow. This workflow does not support the use of m/z and retention time of metabolites to annotate peak detection results and only supports differential analysis of two groups of data with more than two samples.
For a long time, the identification results of untargeted metabolome dominated by peak detection technology has tended to contain a lot of false positives, mainly due to the large number of metabolite isomers and different ionic adduct forms. Comparing the MS/MS spectra of metabolites with the reference spectra of known metabolites can resolve the structure of the metabolites to distinguish isomers32. However, a metabolite cannot be identified if the reference spectrum of a metabolite is not publicly or commercially available7. Therefore, building a reliable library of metabolite reference spectra is a big challenge. Reference spectra of low-quality and with similar structure leads to random matching of experimental spectra. Therefore, FDR control of identification results is necessary to ensure confident identifications. Users can use this pipeline to automatically identify metabolome with FDR control, as well as quantitation and differential analysis, by providing the necessary input data as the protocol required. That is convenient and economical for many researchers especially for beginners.
The authors have nothing to disclose.
This work is supported by National Key Research and Development Program (2018YFC0910200/2017YFA0505001) and the Guangdong Key R&D Program (2019B020226001).
GNPS | open source | n/a | https://gnps.ucsd.edu/ProteoSAFe/static/gnps-splash.jsp |
XY-Meta | open source | n/a | https://github.com/DLI-ShenZhen/XY-Meta |
metaX | open source | n/a | https://github.com/wenbostar/metaX |
ProteoWizard | Free Download | 3.0.22116.18c918b-x86_64 | https://proteowizard.sourceforge.io/download.html |
CHI.Client | Free Download | ndp48-x86-x64-allos-enu | http://www.chi-biotech.com/technology.html?ty=ypt |