GCduo: An Open-Source Software for GC × GC–MS Data Analysis

2025-05-04 Sun
GC × GC–MS multi-dimensional chromatography metabolomics chemometrics PARAFAC PARAFAC2 open-source software
Academic Background and Research MotivationWith the growing demand for the analysis of complex samples, chromatographic technologies—especially comprehensive two-dimensional gas chromatography coupled with mass spectrometry (GC×GC–MS)—have emerged as a powerhouse for untargeted metabolomics and related fields, demonstrating exceptional resolving power. GC×GC–MS can simultaneously separate and detect hundreds, even thousands, of compounds in a single experiment, but the large, structurally complex, and high-dimensional data generated presents a daunting interpretive challenge, becoming the major bottleneck hindering its widespread application. Although commercial software solutions have attempted to automate data processing and interpretation, high costs, professional barriers, and the opacity of “black-box” algorithms still restrict the depth and flexibility of data mining and research.
To overcome these challenges, chemometrics concepts have gradually been introduced in the analysis of multidimensional chromatographic data, leading to the development of tensor decomposition algorithms such as Parallel Factor Analysis (PARAFAC), which can extract meaningful chemical information directly from high-dimensional raw data, enabling peak picking, deconvolution, and quantitative analysis. However, PARAFAC assumes strict trilinearity in the data, a condition often violated in real GC×GC–MS datasets due to sample-to-sample drift, noise, and diffusion, causing limitations in applicability and accuracy. Although extended models such as PARAFAC2, which relax some of these restrictions, do exist, their integration in open-source software remains limited.
Therefore, developing an efficient, modular, and chemometrically diverse open-source software for batch processing of raw GC×GC–MS data stands as a core requirement for advancing data science in this field, while also promoting research efficiency and innovation across metabolomics, environmental science, food safety, aroma analysis, and related disciplines.
Source of the Publication and Author TeamThis paper, entitled “gcduo: an open-source software for gc × gc–ms data analysis,” was published in the internationally renowned journal Briefings in Bioinformatics (2025, Vol. 26, No. 2, bbaf080) and co-authored by leading scientists including Maria Llambrich, Frans M. van der Kloet, Lluc Sementé, Anaïs Rodrigues, Saer Samanipour, Pierre-Hugues Stefanuto, Johan A. Westerhuis, Raquel Cumeras, and Jesús Brezmes. The authors are primarily affiliated with Universitat Rovira i Virgili, University of Amsterdam, Hospital Universitari Sant Joan de Reus, University of Liège, and related life science and engineering research departments. The paper was received on October 28, 2024, revised on December 27, 2024, and officially accepted on February 17, 2025.
Research Workflow and Key MethodsThis study reports the development of the original open-source software gcduo and its systematic validation in multi-sample batch processing of raw GC×GC–MS data. The overall research workflow is divided into the following six modules, highly consistent with the gcduo pipeline:
1. Data ImportFirst, gcduo supports international standard Computable Document Format (CDF) data—a universal raw data storage standard independent of instrument vendors. The research team developed algorithms to reconstruct the CDF-stored vectors of scan acquisition time, intensity values, mass-to-charge ratio (m/z), and point count into a four-dimensional tensor (i × j × k × l), where i represents sample IDs, j is m/z ion fragments, and k and l correspond to the first and second retention time points of GC×GC, respectively. This process incorporates parameters such as modulation period and m/z range to ensure precise alignment of time and m/z axes.
2. Region of Interest (ROI) Selection via Inverse Watershed AlgorithmTo automatically define peak regions for deconvolution, gcduo applies the inverse watershed algorithm. Specifically, the data is first segmented into rolling windows by modulation period (processing 2–4 periods per window to balance accuracy and speed), followed by morphological processing of the 2D chromatogram to identify high signal-to-noise prominent peaks (“blobs”) and automatically QC their spatial coordinates and morphology. This algorithm ensures peaks are not truncated at window boundaries and significantly reduces data volume and runtime. Each blob must meet the following criteria for subsequent deconvolution: signal-to-noise ratio above user-set threshold (e.g., 10), at least 5 sampling points in the second dimension (k), and a peak shape approximating a Gaussian distribution.
3. Blind PARAFAC DeconvolutionWithin each sample and blob, a three-dimensional tensor (m/z × second-dimension retention time × first-dimension retention time) is reconstructed for PARAFAC decomposition. To enhance efficiency, the team uses a rolling window approach to target the retention time range, and dynamically decides the number of components (factors) for high-SNR blobs: iteratively increasing the number of components until the model’s R² no longer improves and a Tucker congruence coefficient greater than 0.9 is reached. To avoid misclassifying noise as peak signal, only the top 5% of m/z channels by variance are selected for the initial round of decomposition. To prevent redundancy, characteristic peaks are deduplicated using retention time and major ion fragments, and samples are merged for homologous signals via cosine similarity. Peaks detected in only one sample are excluded.
4. Spectrum Matching and Peak AnnotationConsensus spectra are compared against built-in or third-party libraries (in MSP format) using cosine similarity; if the match exceeds a set threshold and is further refined by retention index (RI), batch peak annotation is completed. Experimentally, adding RI significantly improves annotation accuracy.
5. Constrained PARAFAC2 Quantitative IntegrationTo address violations of trilinearity due to peak shape variation and misalignment among samples, gcduo performs batch zone PARAFAC2 decomposition. This model allows the same component to have different chromatographic profiles in different samples by constrained convolution with prior information from the previous module (number of components, retention time window, reference spectrum), excelling at precise extraction and quantification of low-abundance or borderline peaks. Each peak’s area and intensity across all samples is then outputted.
6. Data Visualizationgcduo incorporates a variety of 3D and 2D visualization modules: single-sample chromatogram contour plots, post-modulation chromatogram comparisons (to reveal misalignment), and post-deconvolution peak shapes, enabling intuitive full-process QC and assisting manual review.
Experimental Design and DatasetsThe research team used high-quality public and in-house datasets for both training and validation:
Training set: an open fragrance standards mixture dataset from Weggler et al., with concentration gradients at 2, 1, 0.4, and 0.2 ppb, each with three replicates.
Validation set: two independent datasets—one is the published “fruitybeer” beer aroma omics dataset (multiple beer styles, four replicates), and the other is a self-made 12-component breath mix solution at five concentrations plus a 13-component n-alkane mixture, tested on different instrument systems.
Core Results and Scientific SignificanceData Preprocessing and ROI Accuracygcduo successfully reconstructs high-dimensional data tensors from raw CDF data; validation demonstrates that as long as parameters such as modulation period and retention time are accurately provided, the resulting tensor maintains excellent trilinearity. ROI selection, using the inverse watershed algorithm plus automated correction for Gaussianity and SNR, retains only truly valuable analysis regions, greatly reducing false positives and downstream deconvolution workload. For example, out of 17 initially screened blobs in a given window of the training data, only 4 met the criteria, the rest being low-SNR or poorly shaped peaks—evidence of gcduo’s powerful noise control and real peak capture.
Deconvolution and Peak Extraction Algorithm PerformanceThe blind PARAFAC module, with dynamic factor selection and multiple QC steps, nearly eliminates the risks of mistaking noise for peaks or missing peaks entirely. Experiments show that even for low-abundance or strongly overlapping peaks, the analysis window and fragment channel selection can be automatically adjusted to improve peak extraction accuracy. Consensus spectra, via cosine scoring, integrate cross-sample peak features with high precision, effectively controlling sample-to-sample drift and the false negatives caused by misalignment.
Annotation Accuracy and Quantitation CapabilityAfter introducing retention index, 22 of 33 on-library targets in the training set were correctly annotated in one run, a 37.5% improvement over the non-RI method. In real biological samples (such as the beer dataset), 85% of previously reported target peaks were also correctly annotated. For quantitation, gcduo’s measured areas achieved a Pearson correlation coefficient as high as 0.904 versus the commercial gold standard software “chromatof.” Across the full dilution gradient, the mean R² exceeded 0.95, demonstrating excellent quantitative accuracy.
Application and Added Value of New Algorithms (PARAFAC2)gcduo is the first open-source software to fuse blind PARAFAC and constrained PARAFAC2 in a two-step deconvolution pipeline, dramatically boosting detection and quantification accuracy for low-abundance, heavily overlapping, and misaligned peaks (as in the breath mix experiment, where PARAFAC2 provided quantitative results even when conventional PARAFAC did not detect them). Area calculations use the area under the curve (AUC), enhancing consistency with traditional manual integration. Batch processing ensures integrated sample information and system error recognition across full datasets, effectively countering sample-wise errors and noise accumulation common to traditional software.
Ongoing Improvements and LimitationsWhile gcduo offers substantial advantages in high-dimensional data processing, algorithmic innovation, and open access, it is limited by R’s memory and parallelization shortcomings and the data intensity of GC×GC–MS itself; thus, very large or high-resolution datasets may still require high-performance computing. The authors recommend users carefully check raw chromatograms and tensor folding to prevent analytical failure due to broken trilinearity or redundant noise. The paper also notes that as newer chemometric methods (such as shape-sensitive congruence) emerge, ongoing algorithmic upgrades are anticipated.
Study Conclusions, Significance, and Application ProspectsOverall, gcduo provides a novel, open-source solution for fully integrated, automated, and visualized batch processing of GC×GC–MS data, filling the gap for contemporary chemometric algorithms in the open-source multidimensional chromatography data analysis landscape, and bringing research teams worldwide a more efficient and flexible technical tool. Its scientific and practical significance includes:
Advancing big data analysis in metabolomics and other fields, and promoting deeper investigation into complex chemical systems;
Lowering the entry barrier for data analysis, enabling non-specialist users to reliably and efficiently interpret and annotate GC×GC–MS data;
Reducing reliance on expensive commercial software (like CHROMATOF), allowing developers to flexibly adjust parameters, tweak algorithms, or perform secondary development for practical needs;
A platform-based, modular design that facilitates subsequent algorithm upgrades and cross-disciplinary (biomedicine, environmental science, food safety, etc.) adoption.
Research Highlights and InnovationsMulti-algorithm fusion—Pioneers the integration of blind PARAFAC, constrained PARAFAC2, and inverse watershed algorithms into a full-process, batch-ready open-source software system.
Batch processing and peak alignment—End-to-end synchronous handling of all samples, vastly improving systemic error recognition and correction in high-throughput GC×GC–MS.
Excellence in annotation and quantitation—Extensive library matching via cosine similarity and RI calibration greatly enhance both annotation accuracy and quantitative precision of complex bio-samples.
Open-source and expandable—Backed by a GitHub repository, the platform is freely available for global users, supporting secondary development and algorithm extension.
Automated trilinearity assessment and compatibility—Allows users to switch between PARAFAC and PARAFAC2 models to best suit the true complexity of real-world data.
Other Valuable InformationThe paper also contains a detailed comparison of mainstream commercial and open-source GC×GC–MS software, summarizing their technical bottlenecks and use cases, and highlights gcduo’s unique value in algorithmic transparency, flexibility, and batch computation. The authors place great importance on data and code openness, with all datasets and software released on Zenodo and GitHub to promote academic exchange and standardization. The work is supported by innovation funding from the European Union and multiple Spanish and Belgian research agencies and institutions.
SummaryWith the advancement of multidimensional chromatography and mass spectrometry, and the expansion of their applications, data analysis methods are urgently in need of upgrading. gcduo, with its algorithmic innovations, open-access philosophy, and integrated workflow, marks a new era of automation, intelligence, and “white-box” transparency in GC×GC–MS data interpretation. This paper provides a solid theoretical and technical foundation for further methodological advances and scientific breakthroughs in this field.