Cox-SAGE: Enhancing Cox Proportional Hazards Model with Interpretable Graph Neural Networks for Cancer Prognosis
1. Research Background and Disciplinary Frontiers
Cancer prognosis analysis has always been a core research direction in the medical field. In recent years, with the widespread application of high-throughput sequencing technologies, scientists have been able to delve deeper into exploring molecular biomarkers and clinical characteristics of cancer patients, thus assisting clinicians to more accurately assess patients’ survival risk and formulate individualized treatment strategies. The traditional Cox proportional hazards model, as a classical tool for survival analysis, is widely applied in cancer prognosis research due to its strong statistical foundation and adaptability.
However, with the introduction of deep learning (DL) and multi-omics data, scientists have gradually recognized the limitations of the traditional Cox model in feature extraction and modeling complex relationships. Many deep learning-based methods mainly focus on feature extraction or simply use fully connected layers for risk scoring, and these approaches generally suffer from poor interpretability of features (the so-called “black box” problem). In addition, most existing methods have not fully mined the similarity relationships among patients, overlooking the potential regularities within individual heterogeneity, which limits the clinical value and scientific interpretability of the models.
To address the above issues, pioneering explorations of Graph Neural Networks (GNNs) in cancer prognosis analysis have emerged in recent years. GNNs can integrate the complex relational structures among patients, endowing the prognostic models with structural information processing capabilities, and are naturally compatible with high-dimensional relational data such as biological networks and patient similarity networks. However, most existing GNN studies still suffer from limited feature interpretability, “black box” scoring mechanisms, and unclear risk factors, making it difficult to truly realize risk stratification tools that combine precision and interpretability for clinical application.
Faced with this disciplinary bottleneck, this research team has proposed a new generation of interpretable GNN prognostic analysis algorithm—Cox-SAGE. Starting from multi-source, heterogeneous clinical information, this method innovatively constructs a patient similarity graph, incorporates the graph convolution into the Cox model hazard function, and introduces parameter interpretation mechanisms along with dual-metric gene importance measures, paving an important step for cancer survival analysis to transition from a “black box” to a “white box”.
2. Paper Source and Research Team
This research paper, entitled “Cox-SAGE: enhancing Cox proportional hazards model with interpretable graph neural networks for cancer prognosis,” is authored by Ruijun Mao, Li Wan, Minghao Zhou, Dongxi Li, and others, all affiliated with the College of Artificial Intelligence and College of Computer Science and Technology at Taiyuan University of Technology, Shanxi Province, Taiyuan, China. This study was published in 2025 in the internationally renowned journal Briefings in Bioinformatics, included by Oxford University Press, marking an important advancement in the interdisciplinary realm of cancer survival analysis and artificial intelligence. The paper has made its source code openly available, with relevant data and reproduction scripts accessible at GitHub (https://github.com/beeeginner/cox-sage).
3. Detailed Research Process
1. Overall Research Design
The full research process of Cox-SAGE mainly consists of three major modules: ① construction of the patient similarity graph and feature extraction; ② building and training the interpretable graph neural network prognostic model; ③ mining and analyzing prognosis-related genes. The authors not only focus on hepatocellular carcinoma (LIHC) but also systematically test across seven TCGA (The Cancer Genome Atlas) large cohorts, including lung adenocarcinoma, colorectal cancer, etc.
1.1 Integration of Heterogeneous Clinical Information and Construction of Similarity Graph
The clinical data of major tumor cohorts cover variables such as age, gender, ethnicity, tumor stage, histological subtype, etc. Since the clinical data contain ordinal, nominal, numerical, and binary attributes, the authors specifically designed a mixed attribute distance measurement algorithm (Algorithm 1), unifying normalization and weighted processing for different feature types, calculating multivariate distances/similarities for pairs of patients, and finally, based on preset thresholds (combining the upper quartile and interquartile range statistically), screening for high-similarity patient pairs and establishing an undirected graph (Patients’ Similarity Graph) consisting of patient nodes and edges.
1.2 Gene Expression Feature Selection and Graph Embedding
Each patient node is further embedded with transcriptome (RNA-seq) protein-coding gene expression features (using log2-transformed raw counts), uniformly retaining protein-coding genes (a total of 19,938), forming high-dimensional expression feature vectors (each sample with about 20,000 dimensions). Both clinical and gene data are strictly controlled for missing values, with minor missingness imputed via the mode or random forest model, and samples with excessive missing values are removed to ensure data cleanliness.
1.3 Construction of the Cox-SAGE Graph Neural Network Prognostic Model
The authors adopt the GraphSAGE convolution operation (proposed by Hamilton et al.) as the backbone. The structure of each layer in the model consists of a weighted linear aggregation of self-features and neighborhood features of nodes. All mapping parameters in each layer are learnable weights and no activation functions are used, thereby ensuring a strictly linear output structure and maintaining the interpretability of the Cox model.
The multi-layer network is designed as follows:
- First layer: linear mapping + bias term of self-feature and mean of adjacent node features;
- Multilayer recursion: the output of each layer continues to propagate neighborhood information;
- Ultimately, the risk score (proportional hazards) is output through a linear transformation, and the model is trained using the negative partial log-likelihood loss function, optimized with the Adam optimizer and weight decay to prevent overfitting.
1.4 Derivation of Interpretable Parameters and Design of Gene Hazard Metrics
Addressing the “black box” problem in deep models, the authors introduce gradient analysis and the chain rule into parameter interpretability for each layer, strictly establishing the direct effect of changes in the expression of any gene on the risk score: for the single-layer model, the output is directly determined by the model weights α (self-features) and β (neighborhood features) in a linear combination; for the multi-layer model, it is a linear combination after chaining through parameter matrices.
Furthermore, the authors innovatively propose a dual-metric strategy for importance assessment:
- MHZ (Mean Hazard Ratio): simulates removing a gene and observes the overall increase in risk score, quantifying the relationship between low expression and high prognosis risk;
- RMHZ (Reciprocal of Mean Hazard Ratio): quantifies the risk-reducing effect of high expression.
By applying the above metrics across all samples and ranking risk, the method enables discovery of key prognostic genes from two complementary perspectives corresponding to different expression contexts.
1.5 Empirical Evaluation and Benchmark Experiments
The entire workflow is carried out on seven TCGA cancer cohorts (LIHC, LUAD, COAD, etc.), with Tables 1 and 2 detailing sample counts, survival outcomes, and clinical features for each cancer type. The training, validation, and test set splits are strictly controlled throughout, supporting five-fold cross-validation and multi-random seed experiments for enhanced robustness. At the same time, comprehensive reproducibility experiments are conducted, benchmarking mainstream competing methods (GraphSurv, LAGPROG, GGNN, AutoSurv, Cox-KAN, Cox-EN, Cox-AE) (reproduction code is available on GitHub), with Harrell’s C-index (a widely-accepted metric for survival models) as the primary evaluation metric.
2. Main Experimental Results and Data Interpretation
2.1 Model Performance Comparison Across Cohorts
Table 3 clearly shows that Cox-SAGE outperforms or matches mainstream prognostic models across all cancer cohorts. Taking hepatocellular carcinoma (LIHC) as an example, the two-layer Cox-SAGE model achieves a c-index of 0.782, significantly surpassing Cox-AE (0.563), COX-KAN (0.627), and other methods. Additionally, multi-layer models (2 or 4 layers) generally outperform the one-layer model, demonstrating performance gains brought by the innovative architecture.
2.2 Prognosis Risk Stratification and Survival Differences
Using the LIHC cohort as a prototypical example, the authors perform median split based on model output to divide patients into high- and low-risk groups, and conduct significance analysis using Kaplan-Meier survival curves plus log-rank tests. The experiments show that the survival curves of the two risk groups are highly separated, and all models achieve extremely high statistical significance (p<0.005) in the log-rank test, confirming the strong clinical stratification power of model outputs.
2.3 Prognostic Gene Mining and Visual Analysis
In the LIHC cohort, the authors extract parameters from Cox-SAGE models of three depths (1, 2, and 4 layers), respectively calculate MHZ and RMHZ for each gene, and select the intersection exceeding the median as the threshold, ultimately screening about 2,450 important genes from 19,938 genes (2,456 low-expression/high-risk, 2,487 high-expression/high-risk). The authors provide hazard contour plots showing how risk scores change with expression levels (e.g., for high-expression/high-risk genes like CD69), greatly enhancing the model’s interpretability.
Furthermore, the authors selected 20 representative genes most closely related to HCC prognosis (see Table 4). Literature review confirms 17 are closely related to the known pathogenesis of liver cancer, and 3 are known to be associated with other tumors. This not only provides new candidate genes for basic research but also lays the foundation for clinical translation and screening of potential new therapeutic targets.
3. Conclusions, Scientific and Application Value
The authors’ Cox-SAGE model systematically solves the “interpretability dilemma” in cancer survival analysis under deep learning, with remarkable innovations in model design, parameter derivation, and risk metric extraction. The model not only significantly improves the accuracy and stability of survival analysis but also, through theoretical derivation, enables quantitative interpretation of key risk factors, achieving both scientific interpretability and promising clinical application.
Notably, the Cox-SAGE methodology is widely applicable to multi-omics data, heterogeneous clinical indicators, and various cancer types, covering mainstream population cohorts and tissue gene data, and offers meaningful reference value for disease risk prediction and biomarker screening in diverse complex future scenarios.
4. Research Highlights and Unique Innovations
- Innovative Construction of Patient Similarity Graphs: Facing real-world heterogeneous clinical data, developed a mixed attribute distance algorithm, greatly enhancing the network’s ability to capture individual differences.
- Deep, Interpretable GNN Architecture Design: Abandoning the traditional “black box” neural network, adopts a fully linear structure with no activation function, making the parameter-to-risk relationship one-to-one and the results highly interpretable.
- Inventive Dual-Gene Hazard Metrics MHZ / RMHZ: Provides a dual perspective for evaluating prognostic gene importance, unifying the key issues of low-expression/high-risk and high-expression/high-risk genes.
- Multi-Level Integration of Omics and Clinical Information: Seamlessly compatible with large-scale protein-coding gene expression data and supports generalized evaluation over multiple cancer types and omics data.
- Open Source and Reproducibility: The authors have fully released data, code, and reproducibility processes, greatly facilitating academic and industrial adoption and distributed iteration.
5. Other Valuable Content
- The study also meticulously compares various classical and cutting-edge models, using multiple random seeds and cross-validation to enhance robustness and statistical credibility of the results.
- Raw data and model parameters are made openly available across multiple platforms (GitHub, Kaggle, Zenodo), greatly facilitating data reuse and innovative extension by subsequent researchers.
- The research is supported by the Basic Research Program of Shanxi Province, reflecting strong disciplinary development in medical AI in the Shanxi region.
6. Closing Remarks
Cox-SAGE heralds the mainstream trend of future tumor stratified diagnosis and personalized prognosis in the “big data + AI” era. Its methodology and results represent a major leap forward in survival analysis and set an example for advancing interpretability in deep learning. In the future, this framework is expected to have a sustained impact in broader disease scenarios, in clinical practice, and in basic biomedical research.