Chrombus-XMBD: A Graph Convolution Model Predicting 3D-Genome from Chromatin Features
Research Background and Disciplinary Significance
In eukaryotic cells, the three-dimensional (3D) spatial structure of chromatin plays a crucial role in gene expression regulation. Through complex folding, looping, and local spatial reconfiguration of DNA, different genetic elements (such as promoters and enhancers) are brought into spatial proximity, enabling precise cis-regulation. In recent years, whether in developmental biology, disease mechanisms, or epigenetic studies, the dynamic structure of the 3D genome has been repeatedly shown to be closely related to changes in gene expression.
Currently, experimental methods for capturing the spatial conformation of the genome mainly include 3C, 4C, 5C, Hi-C, ChIA-PET, HiChIP, etc. However, these experimental methods are expensive, complex in operation, and often limited by the source of biological materials, resolution, signal-to-noise ratio, and other conditions, making it challenging to provide large-scale data for diverse biological or disease studies. Meanwhile, with the accumulation of multi-omics data—especially DNA sequence, epigenetic modification, and protein binding information—scientists are increasingly interested in the question: “Can we infer a 3D genome interaction blueprint in silico based solely on more easily accessible chromatin features?” As a result, many predictive models based on machine learning and deep learning have been developed.
Existing methods such as Akita, DeepC, Epiphany, and C. Origami have attempted to quantitatively predict genome interactions based on DNA sequence and chromatin features using CNN (convolutional neural networks), LSTM, and Transformer models. Nevertheless, these algorithms generally have the following limitations:
- Limited prediction distance: Most can reliably predict only up to 1~2 megabases (Mb) of interactions;
- Poor model generalizability: Algorithms often target or depend on a single cell line or specific sample, with weak cross-cell/cross-species prediction capability;
- Fixed kernel window and bin segmentation not matching real biological partitioning: Genome regions are often segmented into fixed-size bins, but biologically, chromatin forms physical domains bounded by non-uniform CTCF binding sites;
- Weak interpretability: The black-box nature of deep learning makes it difficult to interpret feature contributions.
To overcome these bottlenecks, a research team from Xiamen University and collaborating institutes carried out this study and developed the novel graph convolutional neural network Chrombus-XMBD. The goal is to predict the 3D genome’s spatial interaction map (Contact Map) ab initio from chromatin features in an automated, generalizable, and interpretable manner.
Source and Authorship
The study, entitled “CHROMBUS-XMBD: A Graph Convolution Model Predicting 3D-Genome from Chromatin Features,” was published in Briefings in Bioinformatics (2025, Vol. 26, Issue 3). Authors include Yuanyuan Zeng, Zhiyu You, Jiayang Guo et al.; the core corresponding institutions are the School of Medicine and Department of Hematology, First Affiliated Hospital, and National Institute for Data Science in Health and Medicine, Xiamen University, with collaborations from several top-level research institutes including the Fujian Provincial Key Laboratory of Cellular Stress Biology and the Fujian Key Laboratory of Sensing and Computing for Smart Cities. The article was received on November 16, 2024, accepted on March 26, 2025, and openly published by Oxford University Press.
Detailed Research Workflow
1. Graph Modeling of the 3D Genome — Innovative Basic Units
The study innovatively sidesteps the traditional approach of dividing the genome into uniform bins, instead using CTCF (binding factor) peak sites as the partition points, thereby segmenting chromatin into functional fragments—each defined as a vertex in the graph, aligning more closely with real biological structures. The authors obtained between 40,000 to 60,000 CTCF segments per dataset; Hi-C data from three cell lines was used to assign edge attributes (interaction strength between adjacent segments).
Each fragment node is characterized by a 14-dimensional epigenetic feature vector, including DNase-I accessibility, POLR2A activity, promoter/enhancer marks (H3K4me3, H3K27ac), CTCF binding orientation, and relative positioning.
Edge weights are obtained from processed real Hi-C experimental data, using the average contact between fragments as the score.
2. Chrombus Graph Convolution Model — Core Algorithm Design
CHROMBUS uses a graph autoencoder (GAE) with three layers of dynamic edge convolution integrated with multihead self-attention mechanisms. The workflow is as follows:
- Encoder: The 14-dimensional features are processed through three convolution-attention layers, generating a 32-dimensional latent embedding (z) that effectively integrates neighborhood context information.
- Edge Convolution & Multihead Attention: Innovatively introduces a distance-weighted sign rule, modifying the traditional Transformer self-attention to suit the genome’s biological long-range interaction characteristics (i.e., distant segments have less adjusted interaction probability).
- Decoder: Outputs a predicted n*n contact strength matrix via inner product, to be fitted against the true Hi-C adjacency matrix.
- Training method: Each chromosome is partitioned into subgraphs (batches) of 128 CTCF segments, and during training, edges are randomly constructed to simulate an Erdős–Rényi random graph.
- Loss function: Optimizes mean square error (MSE) to approximate real Hi-C signals.
3. Rigorous Grouped Training and Cross-validation
Using the widely studied human lymphoblastoid cell line (GM12878) as an example, the 22 autosomes are cycled as independent test sets, with the remaining 21 as training sets, building 22 models. Each model is trained for about 400 epochs. Extensive experimental data includes six major cell lines—GM12878, K562, IMR90, HeLa-S3, HCT116, CH12—covering human and mouse cross-species validation.
4. Multidimensional Evaluation and Feature Interpretability Analysis
- Performance Evaluation: Uses Pearson correlation to measure the goodness-of-fit between predictions and real Hi-C scores, and ROC/AUC curves to distinguish intra- and inter-TAD (topologically associating domain) interactions.
- Feature Contribution Analysis: Utilizes GNNExplainer to quantify the importance of each input feature and reveals the biological correspondence of latent embedding space.
- Generalizability Assessment: Tests cross-cell-line and cross-species (human-mouse) predictions to verify robustness and universality.
- Comparison with Known Biological Events: Includes validation on eQTL (expression quantitative trait loci), enhancer-gene interactions, etc.
5. Extensive Comparison with State-of-the-art Models
Comprehensively compared to Epiphany, C. Origami, DynamicEdgeConv, GAT (Graph Attention Network), and GCN (Graph Convolutional Network), the model’s predictive power is systematically assessed at different interaction ranges: short-range (0-1Mb), mid-range (1-2Mb), and long-range (2Mb+).
Main Research Findings
- Excellent Model Fit: In cross-validation across all 22 chromosomes, test set correlation coefficients (PCC) reached 0.849~0.900, and training set 0.880~0.893, evidencing strong generalizability. A random sample of 100,000 pairs yielded a PCC of 0.891 (95% CI: 0.889-0.892) between predictions and Hi-C signals.
- Superior Biological Unit Partitioning: CTCF-based segmentation outperformed traditional binning, enhancing both resolution and biological sensitivity.
- Breakthrough in Long-range Interaction Prediction: For the 1–2Mb range, Chrombus achieved prediction correlations of 0.354~0.540; for above 2Mb, 0.243~0.582, far surpassing competing methods (Epiphany and C. Origami being ~0.24~0.48).
- Robust TAD and Functional Regulation Validation: The model consistently reconstructed known TAD structures, distinguished intra- and inter-TAD interactions with AUCs of 0.832 (Hicexplorer) and 0.861 (Arrowhead method); predictions for eQTLs and enhancer-gene interactions were significantly higher than background, and the predicted scores were strongly correlated with enrichment at known interaction loci.
- High Interpretability: Feature importance analysis indicated that DNA accessibility, CTCF binding, start/end position, H3K4me3, H3K27ac, and POLR2A were the most influential contributors, each displaying distinct dominance by interaction distance (e.g., DNase-I and H3K27ac for short-range, H3K4me3 for long-range interactions). Principal component clustering in embedding space revealed that different segment types were associated with distinct epigenetic characteristics and interaction strengths.
- Outstanding Model Generalization and Robustness: Models trained on a single cell line (e.g., GM12878) could accurately predict interaction patterns in other cell lines and even mouse cells (e.g., CH12) (PCC 0.8~0.85). In functional regulatory element prediction, the model stably distinguished cell-type-specific interactions.
- Optimization of Perception Range by Multihead Attention and Distance-weighted Strategies: Adjusting the number of attention heads and neighborhood windows significantly enhanced the model’s prediction of long-range interactions and effectively captured dissociation features at TAD boundaries.
Conclusions, Significance, and Applications
The research team’s development of CHROMBUS-XMBD marks a revolutionary advance in the field of 3D genome prediction. For the first time, by integrating six major epigenetic features (DNA accessibility, CTCF, RAD21, POLR2A, H3K4me3, H3K27ac) and applying graph convolution concepts combined with self-attention and distance regularization, the method achieves high-quality interaction predictions across scales from 1Mb up to and beyond 2Mb.
This method stands out in several aspects:
- Addressing Experimental Data Scarcity: It provides virtual 3D interaction maps for epigenetic regulation, disease mechanism, GWAS signal interpretation, and other fields when samples are limited or experimental data is hard to obtain.
- Cross-platform and Cross-species Applicability: Supports prediction of chromosomal interactions from diverse sources, at different resolutions, and across species, offering new perspectives on genome structure evolution and development in mammals.
- Interpretability and Hypothesis Generation: The mapping between embedding space and features ensures the model is not just a black box, enabling reverse inference of key regulatory factors and guiding future experimental design and basic research.
- Promoting Automation and Intelligence in 3D Genome Analysis: Greatly lowers technical barriers, accelerates interdisciplinary integration, and enables automated large-scale dataset interpretation.
Research Highlights and Innovations
- CTCF-based, Biologically-driven Segmentation: For the first time, the graph structure construction matches the biological reality of chromatin folding.
- First to Break the 2Mb Barrier for Long-range Chromosomal Interaction Prediction: Far exceeds the application limits of previous algorithms.
- Innovative Combination of Multimodal Input, Multihead Attention, and Interval Sign Weighting: Substantially boosts long-range prediction and generalization while retaining network expressiveness.
- Comprehensive Six-cell-line Cross-species Evaluation: Establishes a reproducible, generalizable benchmark in the field.
- Strong Interpretability and Functional Traceability: Enables a natural transition from model output to molecular mechanism hypothesis.
Other Valuable Information
- Open-source Data and Code: All model codes, training parameters, and six-cell-line-based training data are openly available; see https://github.com/bioinfoheroes/chrombus-xmbd.
- High Scalability and Adaptability: The model can accommodate missing features or data noise through transfer learning, making it suitable for a variety of scenarios including medical health data and population genetics.
- No Conflicts of Interest Declared by the Research Team: The project received funding from the National Natural Science Foundation of China and Key Research and Development Programs, showcasing the strong R&D capability of Chinese foundational research teams in the intersection of 3D genomics and artificial intelligence.
- Promising Academic and Translational Prospects: Provides robust technical support for multidisciplinary fields such as 3D genomics, transcriptional regulation, and epigenetics; also lays a solid foundation for clinical translational applications like disease prediction and tailored drug development.
Summary
With its novel graph structural modeling and design philosophy compatible with complex biological partitioning, CHROMBUS-XMBD significantly improves prediction accuracy, distance coverage, and generalizability for 3D genome spatial interactions. This study not only offers a technical paradigm for 3D genomics research in the big data era but also injects strong momentum into cross-disciplinary innovation for precision medicine, disease susceptibility, gene regulation, and more.