Cancer Gene Identification through Integrating Causal Prompting Large Language Model with Omics Data–Driven Causal Inference

Cancer gene identification is a core challenge in the fields of basic cancer research and precision medicine. Recently, a research team from Jilin University and Zhejiang Sci-Tech University published an original study titled “Cancer gene identification through integrating causal prompting large language model with omics data–driven causal inference” in the journal Briefings in Bioinformatics. This article provides a comprehensive overview of the research background, academic innovations, methodological workflow, research conclusions, and the far-reaching significance of the paper.

1. Academic Research Background

1. The Need for Multi-Omics Cancer Gene Identification

As one of the diseases with the highest mortality worldwide, cancer is essentially a complex biological process involving multi-level and multi-omics (multi-omics) interactions. Genetic mutations, epigenetic alterations, and dysregulation of signaling pathways can all contribute to cancer development. Accurate identification of the true “driver” cancer genes is both the key to understanding the mechanisms of tumor biology, discovering new drug targets, advancing precision diagnosis and therapy, and also one of the greatest challenges in life informatics.

2. Limitations and Bottlenecks of Traditional Methods

Currently, mainstream cancer gene identification methods fall into two main categories: (1) statistical and machine learning-based correlation analysis methods, and (2) more advanced deep learning techniques. While both have made important contributions, they also suffer significant deficiencies: they tend to focus on statistical correlations while overlooking real-world factors such as confounders and selection biases, thus failing to distinguish causality from spurious associations, resulting in redundant findings, weak interpretability, and limited generalization ability.

3. Causal Inference Methods and Their Challenges

To address confounding variables, a series of cancer gene identification methods based on causal inference have appeared in recent years. For instance, at the transcriptomic level, conditional independence tests and causal models have been used to explore direct causal links between genes and phenotypes. However, in high-dimensional data settings, causal structure identification still faces enormous computational complexity and feasibility challenges. Meanwhile, statistical methods for identifying driver mutations struggle to remove the effects of “hidden” confounders such as patients’ clinical features and oxidative stress.

4. Opportunities and Challenges for Large Language Models

Biomedical databases and literature have already accumulated abundant information on gene-cancer associations. Artificial intelligence-based large language models (LLMs) possess powerful text understanding and reasoning capabilities, making them a potentially valuable knowledge-driven tool for gene identification. However, LLMs are hindered by issues such as hallucination, outdated knowledge, shallow domain understanding, and “causal blindness,” making it difficult to achieve highly reliable causal identification based solely on text.

Therefore, how to harness the powerful reasoning of LLMs—in combination with causality-driven omics data analysis to form a high-credibility, high-interpretability system for cancer gene identification—remains an urgent academic challenge.

2. Source and Research Team

This study was conducted jointly by the School of Artificial Intelligence at Jilin University, the International Center of Future Science, the Engineering Research Center of Knowledge-Driven Human-Machine Intelligence at Jilin University, and the College of Life Science and Medicine at Zhejiang Sci-Tech University. The corresponding author is Dr. Huiyan Sun, with main contributors Haolong Zeng, Chaoyi Yin, Chunyang Chai, Yuezhu Wang, and Qi Dai. The paper was published in the 2025 issue of Briefings in Bioinformatics (Volume 26, Issue 2, bbaf113).

3. Detailed Methodological Workflow

1. Overall Research Framework and Innovations

The paper proposes, for the first time, the ICGI (Integrative Causal Gene Identification) platform. This system deeply integrates two mainstream intelligent techniques: - LLM-driven causal reasoning (module named CGI-GPT), in which “causal prompting” guides the large model to perform cancer gene causality determination and provide natural language explanations; - Data-driven local causal structure learning (module named DML-CGI), where Debiased Machine Learning (DML) algorithms are used to directly mine the causal relationships between genes and disease labels from transcriptomic data.

The framework integrates prior knowledge and data-driven causal discovery in a complementary way, balancing interpretability, accuracy, and innovation.

2. LLM-Based Causal Gene Identification Module (CGI-GPT)

a) Prompt Engineering and Chain-of-Thought Design

The authors meticulously designed a five-layer prompt template for the LLM input: system instruction, domain insights, task description, solution guidance, and output indication, combined with automatically retrieved “gene context information” from biological databases. Innovative adoption of “Chain-of-Thought Prompting” guides the model to stepwise reason the causal connection between a given gene and cancer type, then outputs readable, structured causal explanations.

b) Retrieval-Augmented Generation (RAG)

To avoid LLMs using outdated or hallucinatory knowledge, the authors introduced a mechanism for automated retrieval from gene databases and synonym standardization, ensuring the model uses authoritative, bioinformatically consistent sources. The code and process have been made public on GitHub.

3. Data-Driven Local Causal Structure Identification Module (DML-CGI)

Using transcriptomic data from six cancer types in The Cancer Genome Atlas (TCGA), the authors first build a statistical “association skeleton” between genes and disease labels, then use debiased machine learning (DML) to individually test the direct causal effect of each gene on the cancer phenotype. This effectively overcomes problems of “Markov equivalence classes” and “V-structure limitations” in traditional causal search algorithms, greatly improving reliability and efficiency in high-dimensional settings.

4. Experimental Subjects and Samples

  • Transcriptome data source: six major TCGA cancer types, with total human sample size >20,000 genes, covering lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), bladder urothelial carcinoma (BLCA), breast invasive carcinoma (BRCA), kidney renal clear cell carcinoma (KIRC), and liver hepatocellular carcinoma (LIHC);
  • Authoritative gene annotation: cancer gene lists from expert databases such as Malacards and COSMIC, used to compare and validate model results;
  • Experimental methods: multi-omics data analysis, LLM inference, cross-validation, functional enrichment analysis, etc.

4. Detailed Major Findings

1. Analysis of LLM Module Identification

  • The number of cancer genes identified by CGI-GPT is substantially less than those annotated in the Malacards database, but mostly hits “core driver genes”;
  • Compared with seven classical and recent driver gene identification algorithms (including DriverML, MutSigCV, CEBP, etc.), CGI-GPT ranked first in precision, achieving hit rates up to 45% on some datasets, much higher than traditional methods such as MSEA and SCS;
  • The LLM outputs explanatory rationale for each cancer gene, offering potential for innovative discovery. For example, CGI-GPT identified previously undetected candidate genes such as RASSF1 and MDM2 in LUAD, and CD44 and UBE2C in BRCA;
  • Using TabPFN (a Bayesian neural network model integrating causal priors) to evaluate the ability of the identified genes to distinguish tumor from normal samples, CGI-GPT genes achieved high balanced accuracy and weighted F1 values, with t-SNE dimensionality reduction also clearly separating cohorts.

2. Functional Enrichment and Mechanism Elucidation

GO and KEGG pathway enrichment analyses on BRCA samples indicated that the genes identified by LLMs are greatly enriched in cell cycle regulation, DNA damage response, PI3K-AKT signaling, miRNA regulation, and virus-associated pathways, supporting the scientific plausibility of the findings.

3. DML-CGI Module Causal Gene Discovery

  • Compared with methods such as LCS-FS, ELCS, PCFRCIT, PSL, and CMB, DML-CGI performed excellently in identification count, accuracy, and computational efficiency;
  • Particularly in datasets such as BRCA and KIRC, DML-CGI achieved comparable or superior cancer sample discrimination using fewer genes than more complex structure-learning methods;
  • t-SNE analyses showed that genes identified by DML-CGI clearly distinguish cancer from normal samples.

4. Online Service Platform Deployment

The team developed an interactive online system based on Gradio (https://huggingface.co/spaces/icgi/icgi). Users simply input a gene and cancer type to obtain dual automated analysis from both LLM and causal inference modules, along with mechanism explanations, greatly facilitating researchers and clinical scientists.

5. Research Conclusions, Scientific and Application Value

1. Conclusions

This study established an innovative LLM + causal inference integrated framework, which significantly improves the accuracy, generalization, and interpretability of cancer gene identification, and is the first to realize complementary validation by “automatic mechanism generation + data-driven causal mining.” In multi-omics scenarios, the ICGI system balances existing biomedical knowledge and innovative discovery capacity, clearly indicating that LLMs can be efficiently coupled with omics causal inference as a future intelligent discipline tool.

2. Scientific Value

  • Provides a general framework for deep integration of multi-omics data, intelligent texts, and causal inference, substantially advancing the excavation of causal variables, mechanistic modeling, and functional annotation in complex biological systems.
  • Demonstrates, for the first time, the high-value application of chain-of-thought prompting and retrieval-augmented generation in bioinformatics and causal reasoning.

3. Application Value

  • The web platform greatly expedites the identification and validation of key genes for biomedical researchers, providing high-quality candidate gene lists for downstream functional experiments such as CRISPR/Cas9 gene editing and RNA interference, reducing experimental costs.
  • Lays a solid foundation for practical AI-based cancer precision diagnosis and drug target prediction.

6. Research Highlights and Features

  • Methodological Innovation: Proposes for the first time a cancer gene identification platform integrating LLM causal prompting and omics data causal inference, setting a new paradigm for the combination of bioinformatics AI and causal reasoning;
  • Interpretability and Generalization: The LLM module provides logic chain reasoning and natural language explanations, while the DML module ensures data-driven causal reliability, complementing each other’s weaknesses;
  • Strong Practicality: The web tool enables rapid integration and application, has a friendly interface, and all data and code are open for academic reproduction and extension;
  • Clear Scientific Significance: Newly discovered genes and mechanisms show good verifiability, guiding future functional and mechanistic research;
  • Broad Future Prospects: Provides both theoretical and practical foundation for LLMs in multi-omics causal reasoning, model optimization, and knowledge innovation.

7. Other Important Information

  • All data, algorithms, and code are open on GitHub (https://github.com/verylucky01/icgi);
  • All multi-omics samples included are from authoritative public databases such as TCGA, and identification results were fully benchmarked against expert gold standards;
  • The paper specifically notes current limitations of LLMs in knowledge freshness, uncertainty quantification, and execution of interventions, providing an essential perspective for future model and data integration optimization.

This study delivers a comprehensive, systematic academic paradigm and open tool for the deep integration of AI and causal inference in cancer gene identification, promoting a new direction in intelligent biomedical development.