In this protocol, we present an optimized workflow, which combines an efficient and fast sample preparation of many samples. In addition, we provide a step-by-step guide to reduce analytical variations for high-throughput evaluation of metabolic GWAS studies.
Both gas chromatography-mass spectrometry (GC-MS) and liquid chromatography-mass spectrometry (LC-MS) are widely used metabolomics approaches to detect and quantify hundreds of thousands of metabolite features. However, the application of these techniques to a large number of samples is subject to more complex interactions, particularly for genome-wide association studies (GWAS). This protocol describes an optimized metabolic workflow, which combines an efficient and fast sample preparation with the analysis of a large number of samples for legume crop species. This slightly modified extraction method was initially developed for the analysis of plant and animal tissues and is based on extraction in methyl tert-butyl ether: methanol solvent to allow the capture of polar and lipid metabolites. In addition, we provide a step-by-step guide for reducing analytical variations, which are essential for the high-throughput evaluation of metabolic variance in GWAS.
Large-scale "omics" approaches have enabled the analysis of complex biological systems1,2,3 and further understanding of the link between genotypes and the resulting phenotypes4. Metabolomics using ultra-high-performance liquid chromatography-mass spectrometry (UHPLC-MS) and GC-MS enabled the detection of a plethora of metabolite features, of which only some are annotated to a certain degree, resulting in a high proportion of unknown metabolites. Complex interactions can be explored by combining large-scale metabolomics with the underlying genotypic variation of a diverse population5. However, handling large sample sets is inherently associated with analytical variations, distorting the evaluation of metabolic variance for further downstream processes. Specifically, major issues leading to analytical variations are based on machine performance and instrumental drift over time6. The integration of batch-to-batch variation is challenging and especially problematic when analyzing large-scale structured plant populations. Multiple normalization procedures were suggested to correct for non-biological variations, e.g., the usage of internal, external, and isotope-labeled internal standards to correct for analytical errors, of which each is inherently associated with known problems and pitfalls7,8,9,10.
In addition to analytical variation, the choice of extraction protocols generally varies depending on the analytical method. Ultimately, it is desired to reduce material and labor costs as well as the necessity of using several aliquots of the same sample for various analytical processes by performing phase separation-based extraction methods. These methods were first introduced using chloroform: methanol/water solvents to fractionate polar and hydrophobic compounds11.
This protocol describes a fast high-throughput pipeline for a multi-omics platform to profile both polar metabolites and lipids in legume species. Further, it shows how those datasets can be appropriately corrected for analytical variation and normalized before integrating genotypic information to detect metabolite quantitative trait loci (QTL) by performing GWAS.
1. Experimental design and plant cultivation
NOTE: Set up the experiment depending on the experimental hypothesis, e.g., using a large-scale GWAS population decreases the necessity of multiple replicates, as statistical testing will be performed based on the haplotypes of all the individual SNPs instead of the accession. In contrast, multiple replicates are indispensable in other experimental approaches. The following points must be considered while preparing the experiment.
2. Preparation of biological plant material
3. Extraction reagents
4. Sample extraction
5. Analysis of lipids using UHPLC-MS
6. Analysis of polar and semi-polar metabolites using UHPLC-MS
7. Analysis of derivatized Metabolites using GC-MS17,18
NOTE: The analysis of derivatized metabolites is based on a previously described protocol17. Handle all derivatization reagents in the fume hood. Ensure that N-methyl-N-(trimethylsilyl)trifluoracetamide (MSTFA) does not get in contact with water and humidity.
8. Chromatogram processing and compound annotation
9. Normalization of large-scale metabolomics dataset
10. Genome-wide association studies (GWAS)32
11. QTL detection
Successful metabolomics GWAS experiments should begin with a proper experimental design, followed by sample collection, extraction, data acquisition, and processing, as illustrated in Figure 1. In this protocol, the MTBE method15 was used to extract and analyze hundreds of metabolites belonging to several compound classes. Chromatography depends highly on the properties of the utilized column as well as elution buffer mixtures. Figure 2 shows chromatograms of QC samples, indicating the elution pattern of some major lipid classes in this analytical system. The applied gradients for each platform are given in Table 1. Strong emphasis was placed on handling systemic errors in large-scale experiments. Performing large-scale metabolomics is inherently associated with systemic errors. For demonstration, we analyzed lipidomic data across several common bean species. Supplemental Table 1 provides the extracted raw lipidomic data obtained after chromatogram processing using the software indicated in the Table of Materials. Following this protocol enabled us to circumvent major issues in dealing with omics data, especially while handling large sample sets. The normalization procedure yields in accurate correction of batch-wise analytical errors, as demonstrated in Figure 3. Although increasing the numbers of QC samples would increase the power of the normalization, this is not always feasible due to cost and time constraints. For high-throughput metabolomics GWAS with non-targeted metabolic features, it is essential to illustrate higher numbers of trait-marker association appropriately. A pleiotropic map38 combining multiple GWAS results could be used to highlight the genomic regions to which several traits are linked (Figure 4).
Figure 1: Flowchart of the metabolomics-based GWAS in plants. Several steps starting from the experimental design up to the detection of QTL are shown in the left panel. In the right panel, multiple figures are shown to support several steps mentioned in the left panel. Starting from the right top, (1) a suggested sequence of samples is shown for LC-MS, (2) pre- and post-normalized score plots of PCA, including a representative feature distribution pre- and post-processing, with red indicating QC sample intensities, and (3) a Manhattan plot with significant associations to which LD and haplotype distributions were generated. Abbreviations: GWAS = genome-wide association studies; QTL = quantitative trait loci; PCA = principal component analysis; QC = quality control; LD = linkage disequilibrium; MS = mass spectrometry; LC-MS = liquid chromatography-mass spectrometry; GC-MS = gas chromatography-mass spectrometry; LOESS = locally estimated scatterplot smoothing; MLM/MLMM = mixed linear model/multi-locus mixed-model. Please click here to view a larger version of this figure.
Figure 2: Chromatogram processing. Two QC chromatograms (base peak; lipid data) from different batches demonstrate the batch-wise variation for certain lipid classes in the pooled QC samples. Four major lipid classes are indicated with their respective elution windows in the in-house LC-MS system. The chromatograms were exported from MzMine21. Abbreviations: QC = quality control; LC-MS = liquid chromatography-mass spectrometry. Please click here to view a larger version of this figure.
Figure 3: Correction of systematic error. Principal component analysis of acquired lipidomic data, pre- (left, raw data) and post-correction for systemic errors (right, batch loess). The lower panels illustrate the feature (Cluster_00005) distribution over the samples (n=650) and batches (n=10) pre- (left) and post (right)-correction for analytical variation. Abbreviations: PCA = principal component analysis; QC = quality control; LOESS = locally estimated scatterplot smoothing. Please click here to view a larger version of this figure.
Figure 4: Pleiotropic map illustrating the combined GWAS results. The pleiotropic map highlights regions in the whole genome that are associated with several traits. The numbers on the outer rings indicate the corresponding chromosomes. Each circlet represents an individual trait with its significantly associated SNPs. The colors represent different compound classes (grey = compound class 1; green = compound class 2; purple = compound class 3; yellow = compound class 4). In the case of inter-compound class associations with the same genomic region, genes are highlighted. The inner grey circle shows the sum of all significant SNPs associated with a specific genomic position. The associations shown in this figure are artificially generated only for illustration. Abbreviations: GWAS = genome-wide association studies; SNPs = single-nucleotide polymorphisms. Please click here to view a larger version of this figure.
UHPLC-MS settings for lipids | ||||
Time [min] | Eluent A to B [%]* | Information | ||
0 – 1.00 | 45% A | Eluent A: 1% 1M NH4-Acetate, 0.1% acetic acid in water (UHPLC grade) | ||
1.00 – 4.00 | lg 45% – 25% A | Eluent B: 1% 1M NH4-Acetate, 0.1% acetic acid in acetonitrile/2-propanol 7:3 (UHPLC grade) | ||
4.00 – 12.00 | lg 25% – 11% A | Flow rate: 400 µL/min | ||
12.00 – 15.00 | lg 11% – 0% A | Injection volume: 2 µL | ||
15.00 – 19.50 | cw 0% A | |||
19.50-19.51 | 0% – 45% A | |||
19.51-24.00 | eq 45% | |||
UHPLC-MS/MS settings for polar and semi-polar metabolites | ||||
Time [min] | Eluent A and B [%]* | Information | ||
0 – 1.00 | 99% A | Eluent A: 0.1% formic acid in water (UHPLC grade) | ||
1.00 – 11.00 | lg 99% -60% A | Eluent B: 0.1% formic acid in acetonitrile (UHPLC grade) | ||
11.00 – 13.00 | lg 60% – 30% A | Flow rate: 400 µL/min | ||
13.00 – 15.00 | lg 30% – 1% A | Injection volume: 3 µL | ||
15.00 – 16.00 | cw 1% A | |||
16.00 – 17.00 | lg 1% – 99% A | |||
17.00 – 20.00 | eq 99% A | |||
GC-MS settings for derivatized metabolites | ||||
Time [min] | Temperature [°C] | Information | ||
0 – 2.00 | 85 | Carrier gas: Helium | ||
2.00 – 18.66 | lg 80 – 330 | Flow rate: 2 mL/min | ||
18.66 – 24.66 | cw 330 | Temperature gradient: 15 °C/min | ||
24.66 | rapid cooling | Injection volume: 1 µL |
Table 1: Gradient settings for each of the analytical platforms7. Abbreviations: lg = linear gradient; cw = column washing; eq = equilibrate; UHPLC-MS = ultra-high-performance liquid chromatography-mass spectrometry; UHPLC-MS/MS = ultra-high-performance liquid chromatography-tandem mass spectrometry; GC-MS = gas chromatography-mass spectrometry. * = percentage value corresponds to eluent A; remaining percentage value corresponds to eluent B.
Supplemental Table 1: Raw lipidomics data. Indicates the peak intensities for each of the detected clusters over each sample. Please click here to download this Table.
Both GC-MS and LC-MS are widely used tools for profiling complex mixtures of various metabolite classes. Handling large datasets with these tools is inherently associated with a non-biological variation, e.g., analytical variation, which interferes and biases the interpretation of the results. This protocol presents a robust and high-throughput extraction pipeline for comprehensive metabolic profiling to eliminate variation of non-biological origin and conduct large-scale "omics" studies. The volumes and concentrations used in this protocol were adjusted for legume species in different tissues. However, these parameters can be slightly modified and used for large-scale metabolic samples from other plant species as well.
The previously15 described MTBE-based extractions can be used to analyze derivatized metabolites, semi-polar metabolites, and lipids. This can be expanded for protein and plant hormone extractions39, which were out of the scope of this protocol. Other extraction protocols rely on dichloromethane:ethanol mixtures40,41. Of these extraction protocols, the MTBE:methanol extraction protocol provides a favorable and less hazardous alternative to the existing chloroform-based extraction protocols42 and does not result in a protein pellet as an interphase between the polar and lipid phases. Furthermore, MTBE methods have already been used in several studies for various biological samples43,44,45.
This protocol discusses several crucial steps that might lead to potential variation while handling a large number of samples, e.g., during harvesting12,13, extraction14, as well as randomization46. Furthermore, there are additional issues that have not been discussed in this protocol that must be considered to ensure high-quality metabolomic data, e.g., matrix effect and ion suppression14.
The power of QC-based normalization methods inherently depends on the number of QC samples in each batch. As mentioned earlier, although increasing the number would increase the power, the intra-batch variation of the QCs is relatively marginal compared to inter-batch variation in these analytical systems, as illustrated in Figure 3. Overall, there are other QC-based normalization methods, such as systemic error removal using random forest (SERRF), which have been shown to outperform most of the other normalization methods such as batch-wise-ratio, normalization using an optimal selection of multiple internal standards (NOMIS), and probabilistic quotient normalization (PQN)47. However, SERRF relies on multiple QC samples in each batch, e.g., every tenth sample, which is not feasible while handling large numbers of samples. The main advantage of QC-based normalization over other data-driven or internal standard-based methods is that it retains the essential biological variation while accommodating unwanted technical variation28. Readers may refer to this review on the handling of variation28.
One main issue in GWAS is the rate of false positives, which originate mostly due to the linkage of causal and non-causal sites48,49. Second, the conservative statistical correction approaches, e.g., Bonferroni and FDR, correct for the number of independent tests, which is not equal to the number of assayed SNPs in GWAS due to the linkage between proximate SNPs50,51 Therefore, the actual number of independent tests is often lower. Another way to reduce the conservative statistical threshold would be to reduce the number of tested SNPs used for GWAS based on linkage decay over defined genomic regions52. The GWAS-integrated high-throughput metabolomics platform described in this protocol has a wide range of applications. In particular, it will facilitate improvements in crop breeding by changing the metabolite/lipid composition for industrially and nutritionally desired levels. Overall, metabolomics has provided an in-depth insight into the genetic architecture of a plethora of metabolites and metabolic diversification that occurred during crop domestication over the last decades, indicating the vast potential of metabolomics-associated breeding53. The molecular biological approaches for downstream QTL validation include the generation of CRISPR/Cas9 mutant lines54, T-DNA insertion lines55, stable and/or transient overexpression lines56, VIGS, ex vivo metabolomics approaches57 next to the conventional approach in generating cross F2 populations as well as cross validation in different populations.
By performing the necessary correction for the analytical variations as described above, several integrated approaches can be performed in addition to GWAS, such as metabolite-metabolite, metabolite-lipid correlation analysis, correlation analysis to phenomic data to shed light on more complex traits, and/or co-expression analysis to further unravel the basis of biological systems58.
The authors have nothing to disclose.
M.B. is supported by the IMPRS-PMPG 'Primary Metabolism and Plant Growth'. A.R.F. and S.A. acknowledge the financial support of the EU Horizon 2020 Research and Innovation Programme, project PlantaSYST (SGA-CSA No. 739582 under FPA No. 664620), and project INCREASE (GA 862862).
Reagents and standards | |||
1,2-diheptadecanoyl-sn-glycero-3- phosphocholine (17:0 PC) | Avanti Polar Lipids | 850360P | Internal standard for lipids |
Chloroform | Supleco | 67-66-3 | FAME solvent |
Isovitexin | Sigma Aldrich | 38953-85-4 | Internal standard for metabolites |
Lignoceric Acid Methylester | Sigma Aldrich | 2442-49-1 | FAME |
Methanol (MeOH) | Biosolve Chemicals | 13684102 | ULC-MS grade |
Methoxyamin -hydrochlorid | Sigma Aldrich | 593-56-6 | Metabolite deriviatization |
Methyl laurate | Sigma Aldrich | 111-82-0 | FAME |
Methyl myristate | Sigma Aldrich | 124-10-7 | FAME |
Methyl palmitate | Sigma Aldrich | 112-39-0 | FAME |
Methyl stearate | Sigma Aldrich | 112-61-8 | FAME |
Methyl tert-butyl ether (MTBE) | Biosolve Chemicals | 13890602 | HPLC grade |
Methyl-caprat | Sigma Aldrich | 110-42-9 | FAME |
Methylcaprylat | Sigma Aldrich | 111-11-5 | FAME |
Methyldocosanoat | Sigma Aldrich | 929-77-1 | FAME |
Methyleicosanoat | Sigma Aldrich | 1120-28-1 | FAME |
Methyl-hexacosanoat | Sigma Aldrich | 5802-82-4 | FAME |
Methyl-octacosanoat | Sigma Aldrich | 55682-92-3 | FAME |
Methyl-pelargonate | Sigma Aldrich | 1731-84-6 | FAME |
N-Methyl-N-(trimethylsilyl)trifluoracetamid (MSTFA) | Macherey-Nagel | 24589-78-4 | Metabolite deriviatization |
Pyridine | Supleco | 110-86-1 | Metabolite deriviatization |
Ribitol | Supleco | 22566-17-2 | Internal standard for derivatized metabolites |
Triacontanoic Acid Methyl Ester | TCI Chemicals | 629-83-4 | FAME |
Water | Biosolve Chemicals | 23214102 | ULC-MS grade |
Equipment | |||
1.5 mL Safe-lock microcentrifuge tubes | Eppendorf | 3120086 | |
2 mL Safe-lock microcentrifuge tubes | Eppendorf | 3120094 | |
Balance | Sartorius Corporation | 14 557 572 | |
DB-35ms, 30 m, 0,25 mm, 0,25 µm | Aglient | 123-3832 | Analysis of derivatized metabolites |
GC-MS system | Leco Pegasus HT TOF-MS (LECO Corporation) | Analysis of derivatized metabolites | |
Grinding Balls, Stainless Steel | OPS DIAGNOSTICS | GBSS 196-2500-10 | |
MS system | Exactive, Orbitrap-type, MS (Exactive, Thermo Fisher Scientific) | Analysis of lipids | |
MS system | Q Exactive Focus (Q Exactive™ Focus Hybrid Quadrupol-Orbitrap™ Massenspektrometer, Thermo Fisher Scientific) |
Analysis of metabolites | |
Refrigerated microcentrifuge | Eppendorf, model 5427R | 22620701 | |
Reversed Phase (RP) Bridged Ethyl Hybrid (BEH) C8 column (100 mm × 2.1 mm containing 1.7 μm diameter particles) |
Waters | 186002878 | Analysis of lipids |
RP High Strength Silica (HSS) T3 column (100 mm × 2.1 mm containing 1.8 μm diameter particles) |
Waters | 186003539 | Analysis of metabolites |
Shaker | Eppendorf Thermomixer 5436 | 2050-100-05 | |
Sonicator | USC 300 TH | 142-0084 | |
Tissue grinding mixer mill | Retsch, Mixer Mill MM 300 | 20.746.0001 | |
UPLC system | Waters Acquity UPLC system (Waters) | ||
Vacuum concentrator | Scan Speed Maxi Vac Alpha Evaporators | 7.008.500.002 | |
Vortex mixer | Vortex-Genie 2, Model G560 | SI-0236 | |
Software | |||
MetAlign | Chromatogram processing | ||
MzMine | Chromatogram processing | ||
R package "data.table" | |||
R package "fujiplot" | pleiotrpoic map | ||
R package "genetics" | |||
R package "Ime4" | BLUPs calculation | ||
R package "LDheatmap" | LD plots | ||
R package "MASS" | transformation | ||
R package "rMVP" | GWAS | ||
R version 4.0.4 | |||
RefinerMS | Chromatogram processing | ||
RefinerMS Genedata | Expressionist | Chromatogram processing | |
Tassel 5 | Genotype filtering | ||
Xcalibur | Thermo Fisher Scientific | OPTON-30965 | Chromatogram processing |