This protocol presents an approach to fingerprint and explore multi-dimensional data collected by comprehensive two-dimensional gas chromatography coupled to mass spectrometry. Dedicated pattern recognition algorithms (template matching) are applied to explore the chemical information encrypted in the extra-virgin olive oil volatile fraction (i.e., volatilome).
Data processing and evaluation are critical steps of comprehensive two-dimensional gas chromatography (GCxGC), particularly when coupled to mass spectrometry. The rich information encrypted in the data may be highly valuable but difficult to access efficiently. Data density and complexity can lead to long elaboration times and require laborious, analyst-dependent procedures. Effective yet accessible data processing tools, therefore, are key to enabling the spread and acceptance of this advanced multidimensional technique in laboratories for daily use. The data analysis protocol presented in this work uses chromatographic fingerprinting and template matching to achieve the goal of highly automated deconstruction of complex two-dimensional chromatograms into individual chemical features for advanced recognition of informative patterns within individual chromatograms and across sets of chromatograms. The protocol delivers high consistency and reliability with little intervention. At the same time, analyst supervision is possible in a variety of settings and constraint functions that can be customized to provide flexibility and capacity to adapt to different needs and goals. Template matching is shown here to be a powerful approach to explore extra-virgin olive oil volatilome. Cross-alignment of peaks is performed not only for known targets, but also for untargeted compounds, which significantly increases the characterization power for a wide range of applications. Examples are presented to evidence the performance for the classification and comparison of chromatographic patterns from sample sets analyzed under similar conditions.
Comprehensive two-dimensional gas chromatography combined with the time-of-flight mass spectrometric detection (GC×GC-TOF MS) is nowadays the most informative analytical approach for the chemical characterization of complex samples1,2,3,4,5. In GC×GC, columns are serially connected and interfaced by a modulator (e.g., a thermal or valve-based focusing interface) that traps eluting components from the first dimension (1D) column before their re-injection into the second dimension (2D) column. This operation is done within a fixed modulation time-period (PM), generally ranging between 0.5–8 s. By thermal modulation, the process includes cryo-trapping and focusing of the eluting band with some benefits for the overall separation power.
Although GC×GC is a two-dimensional separation technique, the process produces sequential data values. The detector analog-to-digital (A/D) converter obtains the chromatographic signal output at a certain frequency. Then, data is stored in specific proprietary formats which not only contains the digitalized data but related metadata (information about the data) as well. The A/D converter employed in GC×GC systems helps in mapping the intensity of the chromatographic signal to a digital number (DN) as a function of time in the two analytical dimensions. Single-channel detectors (e.g., flame ionization detector (FID), electron capture detector (ECD), sulfur chemiluminescence detector (SCD), etc.) produce single values per sampling time, whereas multichannel detectors (e.g., mass spectrometric detector (MS)) produce multiple values (typically, over a spectral range) per sampling time along the analytical run.
To visualize 2D data, elaboration starts with rasterization of a single modulation period (or cycle) data values as a column of pixels (picture elements corresponding to detector events). Along the ordinate (Y-axis, bottom-to-top) the 2D separation time is visualized. Pixel columns are sequentially processed so that the abscissa (X-axis, left-to-right) reports 1D separation time. This ordering presents the 2D data in a right-handed Cartesian coordinate system, with the 1D retention ordinal as the first index into the array.
Data processing of 2D chromatograms gives access to a higher level of information than raw data, enabling 2D peak detection, peak identification, extraction of response data for quantitative analysis, and cross-comparative analysis.
The 2D peak patterns can be treated as the sample’s unique fingerprint and detected compounds as minutiae features for effective cross-comparative analysis. This approach, known as template-based fingerprinting6,7, was inspired by biometric fingerprinting6. Automatic biometric fingerprint verification systems, in fact, rely on unique fingertip characteristics: ridge bifurcations and endings, localized and extracted from inked impressions or detailed images. These characteristics, named minutiae features, are then cross-matched with available stored templates8,9.
As mentioned above, every GC×GC separation pattern is composed of 2D peaks rationally distributed over a two-dimensional plane. Each peak corresponds to a single analyte, has its informative potential, and can be treated as a single feature for comparative pattern analysis.
Here, we present an effective approach for chemical fingerprinting by GC×GC-TOF MS featuring tandem ionization. The goal is to comprehensively and quantitatively catalog features from a set of chromatograms.
Compared to existing commercial software or in-house routines10,11 that employ a peak-features approach, template-based fingerprinting is characterized by high specificity, efficiency, and limited computational time. In addition, it has an intrinsic flexibility that enables the cross-alignment of minutia features (i.e., 2D peaks) between severely misaligned chromatograms as those acquired by different instrumentation or in long-time frame studies12,13,14.
The basic operations of the proposed method are described briefly to guide the reader to a good understanding of the 2D pattern complexity and information power. Then, by exploring the instrument output data matrix, chemical identification is performed and known targeted analytes located over the two-dimensional space. The template of targeted peaks is then built and applied to a series of chromatograms acquired within the same analytical batch. Metadata related to retention times, spectral signatures, and responses (absolute and relative) are extracted from re-aligned patterns of targeted peaks and adopted to reveal compositional differences in the sample set.
As an additional, unique step of the process, a combined untargeted and targeted (UT) fingerprinting is also performed on pre-targeted chromatograms to extend the fingerprinting potential to both known and unknown analytes. The process produces a UT template for a truly comprehensive comparative analysis that can be largely automated.
As a final step, the method performs the cross-alignment of features in two parallel detector signals produced with high and low electron ionization energies (70 and 12 eV).
The protocol is quite flexible in supporting analyses of a single chromatogram or a set of chromatograms and with variable chromatography and/or multiple detectors. Here, the protocol is demonstrated with a commercially available GC×GC Software suite (see Table of materials) combined to a MS library and search software (see Table of Materials). Some of the necessary tools are available in other software and similar tools could be implemented independently from descriptions in the literature by Reichenbach and co-workers15,16,17,18,19. Raw data for the demonstration is derived from a research study on extra-virgin olive (EVO) oil conducted in the authors’ laboratory14. In particular, the volatile fraction (i.e., volatilome) of Italian EVO oils is sampled by headspace solid phase microextraction (HS-SPME) and analyzed by GC×GC-TOF MS to capture diagnostic fingerprints for quality and sensory qualification of samples. Details on samples, sampling conditions, and analytical set-up are provided in the Table of Materials.
Steps 1–6 describe pre-processing of the chromatograms. Steps 7–9 describe processing and analysis of individual chromatograms. Steps 10–12 describe template creation and matching, which are the basis for cross-sample analysis. Steps 13–16 describe applying the protocol across a set of chromatograms, with steps 14–16 for UT analysis.
1. Importing raw data
NOTE: This creates a two-dimensional raster array for visualization and processing.
2. Shifting the modulation phase
NOTE: This puts all peaks in each modulation cycle into the same image column, including the peaks that wrap around the end of the modulation period into the void time of the next modulation period20.
3. Baseline correction21
4. Coloring the chromatographic image using a value map and color map20
5. 2D peaks (i.e., blobs) detection for analytes18
6. 2D peaks filtration
NOTE: This is done to automatically remove meaningless detections due to column bleeds along the 1D and strikes or tailings along the 2D.
7. Linear retention indices calibration
NOTE: Perform this step22 (IT) for the specific retention times across the set of retention index (RI) standards (typically n-alkanes).
8. Searching for the peak spectra in the NIST17 MS library23
9. Review and correct analyte identifications
10. Create a template with targeted peaks15
11. Match and apply the template
NOTE: Matching recognizes the template pattern in the detected peaks a new chromatogram. Applying the matching sets identifications and other metadata in the new chromatogram from the template.
12. Transform the template for substantially different chromatography
NOTE: This step is not necessary unless chromatographic conditions vary substantially causing the template to be misaligned with a new chromatogram, such as can be the case over long-term studies or after a new column is installed. In such cases, the template can be geometrically transformed in the chromatographic retention-times plane to better fit the new chromatogram12,13. In this example, the peak patterns of the template and chromatogram are similar, but differ in the retention-times geometry, such as would be seen for different chromatographic conditions.
13. Perform combined untargeted and targeted analysis across a set of chromatograms
NOTE: A combined untargeted and targeted (UT) template, also referred to as feature template24,25, when matched to each of a set of chromatograms, establishes correspondences between untargeted and targeted analytes, then consistent cross-sample features are extracted for pattern recognition.
14. Modify the UT template for parallel MS analysis
NOTE: The analysis was performed with both 70 eV and 12 eV (i.e., high and low) electron ionization energies26,27.
15. Perform combined untargeted and targeted analysis across 12 eV chromatograms
GC×GC-TOF MS patterns of high-quality extra-virgin olive oil volatilome exhibit about 500 2D peaks above a signal-to-noise ratio (SNR) threshold of 100. Such a threshold was defined by previous investigations on food volatiles14,27 as the minimum relative signal over threshold to obtain reliable spectra for cross-comparative analysis. Components are distributed over the chromatographic space according to their relative retention in the two chromatographic dimensions, and specifically based on their volatility/polarity in the 1D and volatility in the 2D. Here, column combination is polar × semi-polar (i.e., Carbowax 20M × OV1701).
The 2D pattern shows a high degree of order. Relative retention patterns for homologous series and classes are shown in Figure 1A with annotations (graphics for groups and bubbles for peaks) for linear saturated hydrocarbons (black), unsaturated hydrocarbons (yellow), linear saturated aldehydes (blue), mono-unsaturated aldehydes (red), polyunsaturated aldehydes (salmon), primary alcohols (green), and short-chain fatty acids (cyano).
Detected 2D peaks can then be identified by comparing the average MS spectrum extracted from the entire 2D peak (blob spectrum) or from the largest spectrum (apex spectrum). Figure 2 illustrates the output of the apex spectrum search for blob 5 and returns a high similarity match (first 10 hits) for (E)-2-hexenal. Databases explored are those pre-selected by the analyst in step 8 of the method.
The identification is validated by active retention indexing. The experimental IT value was calculated for the 2D peaks, so that at this stage the library search prioritizes results with coherent values of tabulated IT. Tolerance windows can be customized based on analyst experience, reliability of reference database values according to stationary phase, and analytical conditions applied. New tools for smart calibration of linear retention indices without experimental calibration with n-alkanes, have been recently developed and discussed in a study by Reichenbach et al19.
The collection of identified 2D peaks (i.e., targeted peaks) can be adopted to build a template of targeted peaks to promptly establish reliable correspondences between the same compound across all sample chromatograms. The collection of targeted template peaks is visualized in Figure 1B. Red circles correspond to the 196 targeted compounds, including two Internal Standards (IS) linked to template peaks with connection lines. IS are used for response normalization and connection lines help to visualize which of the included IS will be adopted to normalize each 2D peak/blob response.
In Figure 1B, filled circles indicate positive matches between template peak and the actual pattern while empty circles are for template peaks for which the correspondence was not verified. False negative matches can be limited by appropriate selection of threshold parameters, reference spectra and constraint functions13,14,18,19. For complex patterns with multiple co-elutions, ion peak detection functions that are based on spectral deconvolution are advisable and could be a valid option19. Template peak metadata are shown in the enlarged panel of Figure 1B for (E)-2-hexenal.
The specificity of template matching relies on the possibility to apply constraint functions that limit positive correspondence to those candidate peaks that, falling within the search window of the algorithm, have MS spectral similarity above a certain threshold. In this case, in step 11, similarity thresholds23 were set at 700 according to previous experiments aimed at defining optimal parameters limiting false negative matches14. Highlighted areas of the template peak properties in Figure 1B show the information about the reference MS spectrum string and the qCLIC constraint function (i.e., (Match("<ms>") >= 700.0) and (RMatch("<ms>") >= 700.0)).
By applying the template to all chromatograms of a set, one could encounter challenging situations as in the case of partial misalignment of patterns. This can be due to oven temperature inconsistencies, carrier gas flow/pressure instabilities, or because of a manual intervention on the system as in the case of column substitution or modulator loop-capillary replacement14,28. Figure 3 shows a situation of a partial misalignment between the targeted template and the actual chromatogram. For minimal misalignments, interactive template transforms (Figure 3, control panel) can reposition template peaks for a better fit. Once repositioned, the template can be matched to establish correspondences. In the example, the template (Figure 3, step 12) peaks correctly match with the actual 2D pattern. In case of severe misalignments, not discussed here, the repetition of match-transform-update actions can iteratively adapt the template peaks position to the actual peak pattern12,13,14.
Here, the targeted peaks (i.e., known analytes) provide about 40% of the chromatographic result (196 targeted peaks of about 500 detectable peaks on average). The other 60% of compounds, together with the information they bring, are not taken into consideration in targeted analysis. To make the investigation truly comprehensive, consistent cross-alignment of untargeted 2D peaks should also be established. The first application where template matching was extended to all detectable analytes dealt with the complex volatilome of roasted coffee7. This process is automated with a software (e.g., Investigator), shown here in steps 14–15.
In this process, pre-targeted images belonging to the sample set under study (20 samples) are used to define reliable peaks by cross-matching of all image patterns29. Subsequently, a composite chromatogram is built from which one can identify UT reliable peaks and peak regions (i.e., 2D peaks footprint) in the so-called feature template17.
For analyses acquired at 70 eV, the process determined 144 reliable peaks with relaxed reliability29, 76 of which belong to the targeted peaks list. Based on these 144 reliable peaks, the process aligns all chromatograms consistently with the average retention times of the reliable peaks and then combines them to create a composite chromatogram. Figure 4 shows a list of all samples labeled according to the production region of the oil (left) and the list of reliable peaks/blob volumes in each sample (right).
The untargeted feature template is composed of 2D peaks from analytes detected in the composite chromatogram, shown in Figure 5A, that are matched by the reliable-peaks template (n = 168 – red circles for targeted peaks and green circles for untargeted peaks). The mass spectra of the composite peaks, as well as their retention times, are recorded in the feature template as shown for (Z)-3-hexenol acetate in the enlarged area. Peak-regions are shown in Figure 5B as red colored graphics; they are instead defined by the outlines of all 2D peaks detected in the composite chromatogram (n = 3578).
When unsupervised pattern recognition by Principal Component Analysis is applied to targeted peaks distribution within the 20 analyzed samples, Sicilian and Tuscany oils cluster separately suggesting that pedo-climatic conditions and terroir impact the relative prevalence of volatiles. The results are shown in Figure 6A and the PCA results from the reliable peaks distribution are shown in Figure 6B. The two approaches cross-validate that oils from different geographical areas have different, while coherent, chemical signatures whether targeted or untargeted compounds, or both, are mapped.
Finally, the software enables prompt and effective re-alignment of patterns across parallel detection channels. In this application, the re-alignment is proposed for tandem ionization signals. The ion source of the MS multiplexes between two ionization energies (i.e., 70 and 12 eV) at an acquisition frequency of 50 Hz per channel30. The two resulting chromatographic patterns are closely aligned while spectral data (i.e., spectral signatures and responses) bring complementary information with different dynamic ranges of response26,27. The aligned patterns allow extracting features (2D peaks and peak-regions) with univocal IDs (i.e., chemical names for targeted peaks and unique numbering # for untargeted peaks and peak-regions).
Template matching allows effective cross-alignment. In this situation, there is not much misalignment, but MS constraints must be relaxed to allow matches for UT peaks. On the other hand, featured UT peak-regions, that have no MS constraints, are promptly matched without any false negative matches. Figure 5C shows an enlarged area of a 12 eV chromatogram where the feature template built from 70 eV data is matched. Reliable UT peaks are positively matched because of the lowered qCLIC constraints (e.g., DMF threshold at 600). To note, at 12 eV, there are fewer detected peaks due to the limited fragmentation induced by low ionization energy.
Figure 1: Bidimensional contour plot and targeted template. (A) Contour plot of the volatile fraction of an extra-virgin olive oil from Tuscany. Ordered patterns of homolog series and classes are highlighted with different colors and lines: linear saturated hydrocarbons (black line and 2D contours) unsaturated hydrocarbons (yellow), linear saturated aldehydes (blue) mono-unsaturated aldehydes (red), polyunsaturated aldehydes (salmon), primary alcohols (green) and short-chain fatty acids (cyano). (B) Overimposed targeted template of known analytes (red colored circles) with connection lines linking Internal Standards (ISs). Panels show 2D peak/blob properties metadata (Decanal) or Template peak properties. Please click here to view a larger version of this figure.
Figure 2: Apex MS search. Output of the apex MS search for blob 5. List of the database entries with the highest similarity match and related metadata available from the library. Please click here to view a larger version of this figure.
Figure 3: Template realignment. Workflow illustrating the steps that allow re-alignment of the template by transformation. Please click here to view a larger version of this figure.
Figure 4: GC Investigator interface. Investigator panel with all selected images labeled according to the production Region of the oil (left) and the list of reliable peaks/blob volumes in each sample (right). Please click here to view a larger version of this figure.
Figure 5: Targeted and UT template. (A) Reliable peaks as resulting from the automated processing in step 11; red circles correspond to known analytes while green circles are unknowns. In the superimposed panel, template object properties are shown for the (Z)-3-hexenal. (B) Enlarged area that shows the UT peaks (red and green circles) and peak-regions (red graphics) of the UT template matched on a sample oil acquired at 70 eV ionization energy. (C) UT template matched on a sample oil acquired at 12 eV ionization energy. Please click here to view a larger version of this figure.
Figure 6: PCA loading plots. They show the natural conformation of samples (oils from Tuscany and Sicily) as they result by (A) targeted peaks distribution or (B) UT peaks distribution. Please click here to view a larger version of this figure.
Supplemental Files. Please click here to download these files.
Visualization of GC×GC-TOF MS data is a fundamental step for an appropriate understanding of the results achieved by comprehensive two-dimensional separations. Image plots with customized colorization allows analysts to appreciate detector response differences and thus the differential distribution of sample components. This visual approach completely changes the analysts' perspective on the interpretation and elaboration of chromatograms. This first step, once understood and confidently used by chromatographers, opens a new perspective in further processing.
Another fundamental aspect of data processing is the accessibility to the full data matrix (i.e., MS spectral data and responses) for all sample points, each of which corresponds to a single detector event. In this respect 2D peaks integration, so that the collection of detector events corresponding to a single analyte represent a critical step. In the current protocol, 2D peaks detection is based on the watershed algorithm18 with some adaptations included to improve detection sensitivity in case of partial co-eluting compounds. To make this process more specific, deconvolution must be done, and more sophisticated procedures adopted. This is possible by performing an ion peak detection for MS data; the algorithm processes the data array and isolates the response from single analytes based on spectral profiles19,31.
An important yet critical step of the protocol, and of any GC×GC-MS data interpretation process, relates to analytes identification. This procedure, proposed in steps 8 and 9, in absence of a confirmatory analysis with authentic standards, must be carefully conducted by the analyst. Automated actions are available in any commercial software; they include MS spectral signature similarity evaluation against the collected reference spectra (i.e., spectral libraries) and evaluation of characteristic ratios among qualifier/quantifier ions. However, additional confirmatory criteria are needed to disambiguate identification of isomers. The protocol proposes the adoption of linear retention indexes to prioritize the list of candidates; the limit here relates to the availability of retention data and its consistency.
The main characteristic that makes this approach unique is template matching12,13,15,29. Template matching enables 2D pattern recognition in a very effective, specific, and intuitive way. It can be set, in terms of sensitivity and specificity, by applying customized threshold values and/or constraint functions while the analyst can supervise the procedure by actively interacting with transform function parameters. The peculiarity of this process relies on the possibility to cross-align targeted and untargeted peaks information between samples of a uniform batch but also between samples acquired with the same nominal conditions despite medium-to-severe misalignment. Advantages of this operation relate to the possibility to preserve all targeted analytes identifications, which is a time-consuming task for the analyst, and all metadata saved for targeted and untargeted peaks from previous elaboration sessions.
Template matching is also very effective in terms of computational time; low-resolution MS data files consists of about 1–2 Gb of packed data while high-resolution MS analyses may reach 10–15 Gb per single analytical run. Template matching does not process the full data matrix every time but, at first, performs retention-time alignment between chromatograms using template peaks then, processes candidate peaks within the search window for their similarity match with reference in the template. In case of severe misalignment, the most challenging situation, global second-order polynomial transforms performed better than local methods while reducing computational time13.
For the GC×GC technique to spread widely beyond academia and research laboratories, data processing tools have to facilitate basic operations for visualization and chromatograms inspection; identification of analytes should offer the possibility to adopt standardized algorithms and procedures (e.g., NIST search algorithm and IT calibration); and cross-comparative analysis should be intuitive, effective and supported by interactive tools. The proposed approach addresses these needs while offering advanced options and tools to deal with complex situations such as analytes co-elution, multiple analytes calibration, group-type analysis, and parallel detection alignment.
The referenced literature well covers many possible scenarios where GC×GC and, more generally, comprehensive two-dimensional chromatography, offer unique solutions and reliable results that cannot be achieved by 1D-chromatography in single run analysis.5,32,33 Although GC×GC is the most powerful tool that increases separation capacity and sensitivity, there are always limitations to separation power, sensitivity, and other systemic capacities. As these systemic limits are approached, data analysis becomes progressively more difficult. Therefore, research and development must continue to improve the analytical tools at our disposal.
The authors have nothing to disclose.
The research was supported by Progetto Ager − Fondazioni in rete per la ricerca agroalimentare. Project acronym Violin – Valorization of Italian olive products through innovative analytical tools (https://olivoeolio.progettoager.it/index.php/i-progetti-olio-e-olivo/violin-valorization-of-italian-olive-products-through-innovative-analytical-tools/violin-il-progetto). GC Image software is available for a free trial for readers who wish to demonstrate and test the protocol.
1D SolGel-Wax column (100% polyethylene glycol; 30 m × 0.25 mm dc × 0.25 μm df). Carrier gas helium at a constant nominal flow of 1.3 mL/min. | Trajan SGE Analytical Science, Ringwood, Australia | PN 054796 | Carrier gas helium at a constant nominal flow of 1.3 mL/min. Oven temperature programming set as follows: 40°C (2 min) to 240°C (10 min) at 3.5°C/min. |
2D OV1701 column (86% polydimethylsiloxane, 7% phenyl, 7% cyanopropyl; 1 m × 0.1 mm dc × 0.10 μm df) from . | Mega, Legnano, Milan, Italy | PN MEGA-1701 | |
Automated system for sample preparation: SPR Autosampler for GC | SepSolve-Analytical, Llantrisant, UK | ||
Extra Virgin Olive oils: Sicily and Tuscany, Italy | Project VIOLIN (Ager – Fondazioni in rete per la ricerca agroalimentare) | Samples (n=10) were collected during the production year 2018 within the "Violin" project sampling campaign. Oils were submitted to HS-SPME to sample volatiles according to a reference protocol validated in a previous study of Stilo et al.14 | |
Gas chromatograph: Model 7890B GC | Agilent Technologies Wilmington DE, USA | ||
GC Image GC×GC edition V 2.9 | GC Image LLC, Lincoln, Nebraska | https://www.gcimage.com/gcxgc/trial.html | |
Image processing software | GC Image LLC, Lincoln, Nebraska | https://www.gcimage.com/gcxgc/trial.html | |
Mass spectrometer: BenchTOF-Select | Markes International Llantrisant, UK | ||
Methyl-2-octynoate (CAS 111-12-6) | Merck-Millipore/Supelco | PN: 68982 | |
Modulator controller: Optimode v2.0 | SRA Intruments, Cernusco sul Naviglio, Milan, Italy | ||
Modulator: KT 2004 loop type | Zoex Corporation Houston, TX, USA | ||
MS library and search software: NIST Library V 2017, Software V 2.3 | National Institute of Standards and Technology (NIST), Gaithersburg MD | https://www.nist.gov/srd/nist-standard-reference-database-1a-v17 | |
n-alkanes C8-C40 for retention indexing | Merck-Millipore/Supelco | PN: 40147-U | |
n-hexane (CAS 110-54-3) gas chromatography MS SupraSolv | Merck-Millipore/Supelco | PN: 100795 | |
Solid Phase Microextraction fiber | Merck-Millipore/Supelco | PN 57914-U | |
α- /β-thujone (CAS 546-80-5) | Merck-Millipore/Sigma Aldrich | PN: 04314 |