Summary

An Integrated Workflow of Identification and Quantification on FDR Control-Based Untargeted Metabolome

Published: September 20, 2022
doi:

Summary

We constructed an untargeted metabolomic workflow that integrated XY-Meta and metaX together. In this protocol, we displayed how to use XY-Meta to generate a decoy spectral library from open access spectra reference, and then performed FDR control and used the metaX to quantitate the metabolites after identifying the metabolomics spectra.

Abstract

Untargeted metabolomics techniques are being widely used in recent years. However, the rapidly increasing throughput and number of samples create an enormous amount of spectra, setting challenges for quality control of the mass spectrometry spectra. To reduce the false positives, false discovery rate (FDR) quality control is necessary. Recently, we developed a software for FDR control of untargeted metabolome identification that is based on a Target-Decoy strategy named XY-Meta. Here, we demonstrated a complete analysis pipeline that integrates XY-Meta and metaX together. This protocol shows how to use XY-meta to generate a decoy database from an existing reference database and perform FDR control using the Target-Decoy strategy for large-scale metabolome identification on an open-access dataset. The differential analysis and metabolites annotation were performed after running metaX for metabolites peaks detection and quantitation. In order to help more researchers, we also developed a user-friendly cloud-based analysis platform for these analyses, without the need for bioinformatics skills or any computer languages.

Introduction

Metabolites play important roles in biological processes. Metabolites are often regulators of various processes like energy transfer, hormone regulations, regulation of neurotransmitters, cellular communications, and protein post-translational modifications, etc1,2,3,4. Untargeted metabolomics provides a global view of numerous metabolites5,6. With advances in mass spectrometry and chromatography technologies, the throughput of metabolome MS/MS spectra is rapidly increasing in recent years7,8,9,10,11. To identify metabolites from these huge datasets, various annotation software were developed11, such as MZmine12, MS-FINDER13, CFM-ID14, MetFrag15, and SLAW16. However, these identifications often contain many false positives. The reasons include: (1) The MS/MS spectra contain random noise, which may mislead the peak matching. (2) Isomers and differences in fragmentation energies cause multiple spectra fingerprints and thus increase the volume of the reference library. (3) The quality of reference libraries varies. A proper standard to build a good reference spectral library is needed. Therefore, a systematic false discovery rate (FDR) control for untargeted metabolomics is essential for functional metabolome research7,8,9,17.

Both the Empirical Bayes approach and Target-Decoy strategy tackled the FDR control problem generally. Kerstin Scheubert et al. showed that the Target-Decoy strategy on decoy database generated from fragmentation tree-based method is the best method for FDR control9. Xusheng Wang et al. designed a method for decoy generation based on the octet rule in chemistry and improved the precision of FDR estimation17. The spectral library for generating decoy database was demonstrated for better performance18. Here, we improved the spectral library-based method and developed a software called XY-Meta19 that can further improve FDR estimation's precision. It uses the existing reference spectral library to generate a decoy library for the FDR control under the Target-Decoy scheme. XY-Meta supports its own spectra matching and cosine similarity algorithms. It allows conventional search and iterative search modes. In the step of FDR assessment, it supports Target-Decoy concatenated mode and separated mode. For better flexibility, XY-Meta accepts external decoy libraries.

Peak detection and quantification of metabolites is also an important step of untargeted metabolome analysis. Peak detection is the main method for metabolome identification. In general, the accuracy of peak detection of metabolites was affected by multiple factors, such as noise signals of mass spectrometry, low abundance of metabolites, contaminants, and degradation products of metabolites20. When the number of samples of is too large or the liquid chromatography column was replaced in experiments of untargeted metabolome, remarkable batch effects may appear, which is a major challenge for metabolome quantitation21,22,23. Currently, software like XCMS24, Workflow4Metabolomic25, iMet-Q26, and metaX19 can perform peak detection and quantitation of untargeted metabolome, but we suggest that the pipeline of metaX is more complete and easier to use. Here, we demonstrate the process of identification and FDR control for a publicly available dataset msv000084112 using XY-Meta, and the peak detection and quantification of metabolites using metaX. This workflow only requires two groups, and each group needs at least two samples. MS/MS spectra data is needed, regardless of the mass spectrometer platform, ionization mode, charge mode, and sample type, and can support sample-based normalization and peak-based normalization. Following this example, researchers can perform metabolomics identification and quantification in an easy-to-handle way. Using this pipeline requires R programming capability. To help the researcher without any programming knowledge, we also developed a cloud analysis platform for metabolomics analysis. We demonstrated this cloud analysis platform in Supplementary Material 5.

Protocol

1. Prepare metabolomics datasets for analysis

NOTE: In this demonstration, we use metabolomics datasets without QC sample. Data for case and control groups are needed. For demonstration, we use a public dataset in GNPS database27.

  1. Go to the webpage https://gnps.ucsd.edu/ProteoSAFe/static/gnps-splash.jsp. Click Browse Datasets.
  2. Search the keyword "msv000084112" in the Title column. Click the dataset ID number for details and download the dataset using FTP.
  3. Put the raw data in the folder /msv000084112.
    ​NOTE: This dataset was acquired using C18 RP-UHPLC on Q Exactive platform in positive mode. It represents a cohort with an uncharacterized disease of the metabolism of urine samples data, including 33 samples of healthy people, 12 blank samples, two mixing samples, and 82 samples of patients28 (Supplementary Material 8). To demonstrate the workflow, we randomly chose six samples of healthy people (NH) as a control group and six samples with the disease (NT) as a case group to perform the workflow.

2. Data format conversion

NOTE: If the dataset is the raw data generated directly from the mass spectrometer, it is usually in .raw, .wiff or .cdf format. They should be converted to mzXML and mgf formats. Here, we use the msconvert tool in ProteoWizard29 package to do the format conversion.

  1. Download the ProteoWizard from https://proteowizard.sourceforge.io/download.html and install it.
  2. Convert data format using msconvert.exe under the ProteoWizard installation path.
    1. Convert the raw data to mzXML format and store them in /mzXML folder:/msconvert.exe /raw/*.raw -o /raw/mzXML/ –filter "peakPicking true [1,2]" –filter "zeroSamples removeExtra" –mzML –zlib –mz64 –filter "msLevel 1-2" –filter "titleMaker <RunId>.<ScanNumber>.<ScanNumber>. <ChargeState>".
    2. Convert the raw/mzXML data to mgf format and store them in /mgf folder:/msconvert.exe /msv000084112/*.raw -o /msv000084112/mgf/ –filter "peakPicking true [1,2]" –filter "zeroSamples removeExtra" –mgf –mz64 –filter "msLevel 1-2" –filter "titleMaker <RunId>.<ScanNumber>.<ScanNumber>. <ChargeState>".

3. Prepare the reference spectral library for the metabolites

NOTE: XY-meta supports the reference spectral libraries only in mgf format.

  1. Go to the webpage https://gnps.ucsd.edu/ProteoSAFe/libraries.jsp. Search the keyword “NIST” to find the item. Click View for the details and download the library.
    NOTE: GNPS Public Spectral Libraries collected many metabolites libraries, arranged in type, origin, species, and collection modes. Although only a small fraction of these libraries are generated using standard materials, they are usually sufficient for most fundamental research.
  2. Put the downloaded library GNPS-NIST14-MATCHES.mgf into the /database folder.

4. Metabolites identification and FDR control

  1. Download the XY-meta (Windows version). Find the parameter configuration file parameter.default under the /XY-Meta-Win/config/ folder. Change its content according to Supplementary Material 1.
    NOTE: In solution, metabolites often form adducts with anions or cations, which leads to a mass shift of parent ions. Therefore, it is necessary to set the types of adducts. We provided adduct lists for ion exchange column and reverse analytical columns under positive charge mode and negative charge mode in the /adduct folder. Users may also edit their own adduct list according to their research project. The adduct list should be in the same format as the provided list.
  2. Perform the metabolite identification and FDR control using XY-Meta:XY-Meta.exe -S /XY-Meta-Win/config/parameter.default -D /msv000084112/ pos_wt-1_a.mgf -R /database/GNPS-NIST14-MATCHES.mgf.
    ​NOTE: XY-Meta does not support wildcards in parameters. Therefore, a single command should be used to process each mgf file. For a large number of files, a batch file is recommended.

5. Differential analysis

NOTE: metaX is an open-source R package. Please install it according to the guide at https://github.com/wenbostar/metaX. 8GB RAM is required for this analysis.

  1. Edit a sampleList.txt file to specify the sample and its corresponding MS data. Please refer to Supplementary Material 2.
    NOTE: metaX supports quantitative analysis for the datasets with QC samples. When using QC samples, please modify the class property to NA for QC samples.
  2. Create /output folder to store the results of quantitative analysis. Use R to run the script in Supplementary Material 3 to use metaX to quantify the MOCK and WT groups.
    ​NOTE: Before running the script in Supplementary Material 3, modify the paths in the script to the actual local paths.

6. Integration of qualitative and quantitative results

  1. Run the R script in Supplementary Material 4 to annotate the peaks in qualitative and quantitative analysis using metabolite identifications.
    NOTE: Before running the script in Supplementary Material 4, please modify the paths in the script to your actual local paths.

Representative Results

The raw data of msv000084112 was converted by msconvert.exe and generated mgf files (Supplementary Material S6).

XY-Meta generated GNPS-NIST14-MATCHES_Decoy.mgf file under /database folder. This is the decoy library generated from the original reference spectral library GNPS-NIST14-MATCHES.mgf. This decoy library can be reused. When reusing this decoy library, the user should set the decoy_pattern as 1 in parameter.default file, and set the decoyinput as the absolute path of the decoy library. The identification results were generated under /mgf folder (with the suffix .meta), which includes spectra matching scores, FDR, m/z of the metabolites, retention time, and the name of the metabolites (Supplementary Material 7).

The quantitative analysis by metaX was in /output folder. The general quantitative distribution of NH and NT are similar, with low fluctuation of the mean values (Figure 1A). There was only a small fraction of missing values: only 3.39% of the metabolites have more than 30% of the missing values (Figure 1B). metaX remarkably increased the proportion of the metabolites with CV ≤ 0.3 (Figure 1C). The box plots were stored in /metaX_box folder. The elution profiles were stored in /metaX_eic folder. The metabolite peaks were recorded in metaX-feature.txt. The quantitative values of the metabolites which were identified in both groups and the differential analysis results were stored in metaX_peaks.txt (Figure 1D). Applying the threshold of |LogFC| ≥ 1 and p-value < 0.05, 342 metabolites were differentially detected, with 206 up-regulated and 136 down-regulated (Supplementary Material 9).

We annotated the metaX detected peaks using the FDR < 0.01 identifications. If a peak can be annotated by multiple metabolites, we took the one with the highest spectrum matching score as the final annotation. Using these criteria, we annotated six differential metabolite peaks (Figure 2).

Figure 1
Figure 1. Quantitative analysis by metaX. (A) Box plot of quantified metabolites of all samples. (B) Histogram of missing value distribution. (C) PCA plot of two group samples. (D) Venn diagram of differentially detected metabolites from three statistical test methods. Please click here to view a larger version of this figure.

Figure 2
Figure 2. Retention time (RT) and m/z distribution of all annotated metabolites. Red dots represent the significant and differentially detected metabolites. Please click here to view a larger version of this figure.

Supplementary Material 1: The parameter file of XY-Meta. Please click here to download this File.

Supplementary Material 2: Grouping information sheet of samples for metaX. Please click here to download this File.

Supplementary Material 3: The script for integrating the workflow of XY-Meta and metaX. Please click here to download this File.

Supplementary Material 4: The script for annotating the peaks using metabolome identifications. Please click here to download this File.

Supplementary Material 5: A complete workflow for metabolome analysis using the cloud platform. Please click here to download this File.

Supplementary Material 6: An mgf file converted from msconvert for a sample data of msv000084112. Please click here to download this File.

Supplementary Material 7: An identification result table from XY-Meta for a sample data of msv000084112. Please click here to download this File.

Supplementary Material 8: The cohort clinical information sheet of msv000084112. Please click here to download this File.

Supplementary Material 9: Identification list of all metabolites and differential analysis results of all metabolite peaks. Please click here to download this File.

Discussion

The FDR control of untargeted metabolites has been a great challenge. Here, we demonstrated a complete pipeline of large-scale untargeted metabolomics analysis (qualitative and quantitative) with FDR control. This effectively reduces the false positives, which are very common in MS analysis.

Preparing an appropriate reference spectral library for your study is a key point. A successful and sensitive MS/MS identification requires not only proper matching algorithms, but also proper reference spectral libraries. The applicability of public spectral libraries are limited due to the following reasons: (1) many public spectral libraries do not include complete metabolite lists. (2) The spectra in public spectral libraries originated from various MS instruments and/or various fragmentation conditions30,31. Therefore, we suggest you collect spectra in the same instrument and the same fragmentation conditions using standard metabolites to construct an "exclusive" spectral library. Also, these conditions should be maintained during the actual measurements. Additionally, when modifying the parameters file, the tolerance of the precursor ions and the fragment ions should coincide with the parameters of the instrument. Normally, the range of precursor tolerance should be between 10 ppm and 20 ppm, and the fragment tolerance should be set between 0.01 Da and 0.5 Da. For this data set, the parameters of instrument are unknown, but the fragment tolerance of 0.05 Da is a conservative choice for this workflow to perform normally.

Users may still receive various error messages when they run this pipeline. Common mistakes include erroneous input file path, missing parameter file, and file access conflict (e.g., access denied by the operating system and simultaneous accessing of the same file).

To be noted, this workflow is currently only applicable to the targeted and untargeted metabolomic analysis of small molecules less than 1,000 Da, and cannot be used to analyze the metabolomes of macromolecules such as glycan chains or lipid chains. In addition, both the data independent acquirement (DIA) data and ion mobility data are not suitable for analysis with this workflow. This workflow does not support the use of m/z and retention time of metabolites to annotate peak detection results and only supports differential analysis of two groups of data with more than two samples.

For a long time, the identification results of untargeted metabolome dominated by peak detection technology has tended to contain a lot of false positives, mainly due to the large number of metabolite isomers and different ionic adduct forms. Comparing the MS/MS spectra of metabolites with the reference spectra of known metabolites can resolve the structure of the metabolites to distinguish isomers32. However, a metabolite cannot be identified if the reference spectrum of a metabolite is not publicly or commercially available7. Therefore, building a reliable library of metabolite reference spectra is a big challenge. Reference spectra of low-quality and with similar structure leads to random matching of experimental spectra. Therefore, FDR control of identification results is necessary to ensure confident identifications. Users can use this pipeline to automatically identify metabolome with FDR control, as well as quantitation and differential analysis, by providing the necessary input data as the protocol required. That is convenient and economical for many researchers especially for beginners.

Disclosures

The authors have nothing to disclose.

Acknowledgements

This work is supported by National Key Research and Development Program (2018YFC0910200/2017YFA0505001) and the Guangdong Key R&D Program (2019B020226001).

Materials

GNPS open source n/a https://gnps.ucsd.edu/ProteoSAFe/static/gnps-splash.jsp
XY-Meta open source n/a https://github.com/DLI-ShenZhen/XY-Meta
metaX open source n/a https://github.com/wenbostar/metaX
ProteoWizard Free Download 3.0.22116.18c918b-x86_64 https://proteowizard.sourceforge.io/download.html
CHI.Client Free Download ndp48-x86-x64-allos-enu http://www.chi-biotech.com/technology.html?ty=ypt

References

  1. Misra, B. B., Fahrmann, J. F., Grapov, D. Review of emerging metabolomic tools and resources: 2015-2016. Electrophoresis. 38 (18), 2257-2274 (2017).
  2. Idle, J. R., Gonzalez, F. J. Metabolomics. Cell Metabolism. 6 (5), 348-351 (2007).
  3. Fiehn, O., Town, C. Metabolomics — the link between genotypes and phenotypes. Functional Genomics. , 155-171 (2002).
  4. Town, C. . Functional Genomics. , (2002).
  5. Dettmer, K., Aronov, P. A., Hammock, B. D. Mass spectrometry-based metabolomics. Mass Spectrometry Reviews. 26 (1), 51-78 (2007).
  6. Vinayavekhin, N., Saghatelian, A. Untargeted metabolomics. Current Protocols in Molecular Biology. , 1-24 (2010).
  7. Chaleckis, R., Meister, I., Zhang, P., Wheelock, C. E. Challenges, progress and promises of metabolite annotation for LC-MS-based metabolomics. Current Opinion in Biotechnology. 55, 44-50 (2019).
  8. Palmer, A., et al. FDR-controlled metabolite annotation for high-resolution imaging mass spectrometry. Nature Methods. 14 (1), 57-60 (2017).
  9. Scheubert, K., et al. Significance estimation for large scale metabolomics annotations by spectral matching. Nature Communications. 8 (1), 1494 (2017).
  10. Schrimpe-Rutledge, A. C., Codreanu, S. G., Sherrod, S. D., McLean, J. A. Untargeted metabolomics strategies-challenges and emerging directions. Journal of the American Society for Mass Spectrometry. 27 (12), 1897-1905 (2016).
  11. Blaženović, I., Kind, T., Ji, J., Fiehn, O. Software tools and approaches for compound identification of LC-MS/MS data in metabolomics. Metabolites. 8 (2), (2018).
  12. Katajamaa, M., Miettinen, J., Oresic, M. MZmine: toolbox for processing and visualization of mass spectrometry based molecular profile data. Bioinformatics. 22 (5), 634-636 (2006).
  13. Tsugawa, H., et al. Hydrogen rearrangement rules: computational MS/MS fragmentation and structure elucidation using MS-FINDER software. Analytical chemistry. 88 (16), 7946-7958 (2016).
  14. Wang, F., et al. CFM-ID 4.0: More accurate ESI-MS/MS spectral prediction and compound identification. Analytical Chemistry. 93 (34), 11692-11700 (2021).
  15. Ruttkies, C., Schymanski, E. L., Wolf, S., Hollender, J., Neumann, S. MetFrag relaunched: incorporating strategies beyond in silico fragmentation. Journal of Cheminformatics. 8, 3 (2016).
  16. Delabriere, A., Warmer, P., Brennsteiner, V., Zamboni, N. SLAW: A scalable and self-optimizing processing workflow for untargeted LC-MS. Analytical chemistry. 93 (45), 15024-15032 (2021).
  17. Wang, X., et al. Target-decoy-based false discovery rate estimation for large-scale metabolite identification. Journal of Proteome Research. 17 (7), 2328-2334 (2018).
  18. Li, D., et al. XY-Meta: a high-efficiency search engine for large-scale metabolome annotation with accurate FDR estimation. Analytical Chemistry. 92 (8), 5701-5707 (2020).
  19. Wen, B., Mei, Z., Zeng, C., Liu, S. metaX: a flexible and comprehensive software for processing metabolomics data. BMC Bioinformatics. 18 (1), 183 (2017).
  20. Aberg, K. M., Torgrip, R. J. O., Kolmert, J., Schuppe-Koistinen, I., Lindberg, J. Feature detection and alignment of hyphenated chromatographic-mass spectrometric data. Extraction of pure ion chromatograms using Kalman tracking. Journal of Chromatography. A. 1192 (1), 139-146 (2008).
  21. Liu, Q., et al. Addressing the batch effect issue for LC/MS metabolomics data in data preprocessing. Scientific Reports. 10 (1), 13856 (2020).
  22. Han, W., Li, L. Evaluating and minimizing batch effects in metabolomics. Mass Spectrometry Reviews. 41 (3), 421-442 (2022).
  23. Fei, F., Bowdish, D. M. E., McCarry, B. E. Comprehensive and simultaneous coverage of lipid and polar metabolites for endogenous cellular metabolomics using HILIC-TOF-MS. Analytical and Bioanalytical Chemistry. 406 (15), 3723-3733 (2014).
  24. Smith, C. A., Want, E. J., O’Maille, G., Abagyan, R., Siuzdak, G. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Analytical Chemistry. 78 (3), 779-787 (2006).
  25. Giacomoni, F., et al. Workflow4Metabolomics: a collaborative research infrastructure for computational metabolomics. Bioinformatics. 31 (9), 1493-1495 (2015).
  26. Chang, H. -. Y., et al. iMet-Q: A user-friendly tool for label-free metabolomics quantitation using dynamic peak-width determination. PloS One. 11 (1), 0146112 (2016).
  27. Wang, M., et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nature Biotechnology. 34 (8), 828-837 (2016).
  28. Schmid, R., et al. Ion identity molecular networking for mass spectrometry-based metabolomics in the GNPS environment. Nature Communications. 12 (1), 3832 (2021).
  29. Kessner, D., Chambers, M., Burke, R., Agus, D., Mallick, P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics. 24 (21), 2534-2536 (2008).
  30. Johnson, S. R., Lange, B. M. Open-access metabolomics databases for natural product research: present capabilities and future potential. Frontiers in Bioengineering and Biotechnology. 3, 22 (2015).
  31. Horai, H., et al. MassBank: a public repository for sharing mass spectral data for life sciences. Journal of Mass Spectrometry: JMS. 45 (7), 703-714 (2010).
  32. Rawlinson, C., et al. Hierarchical clustering of MS/MS spectra from the firefly metabolome identifies new lucibufagin compounds. Scientific Reports. 10 (1), 6043 (2020).

Play Video

Cite This Article
Li, D., Liang, J., Zhang, Y., Zhang, G. An Integrated Workflow of Identification and Quantification on FDR Control-Based Untargeted Metabolome. J. Vis. Exp. (187), e63625, doi:10.3791/63625 (2022).

View Video