AI-Enhanced Lung Cancer Prediction: A Hybrid Model's Precision Triumph

Background Introduction

Lung cancer, as one of the most prevalent and deadly malignant tumors worldwide, continues to pose many challenges in modern healthcare. According to literature statistics, the five-year survival rate for lung cancer patients is extremely low, consistently ranking it among the top three causes of cancer death globally. Due to the hidden symptoms of early-stage lung cancer, patients are often diagnosed at late stages, missing the optimal window for treatment. The key to effectively combating lung cancer lies in achieving early diagnosis. However, traditional clinical diagnostic methods—such as chest imaging and pathological examination—are limited by complex procedures, reliance on advanced equipment, and physician expertise, making timely, accurate, and broadly accessible early screening difficult to achieve.

In recent years, artificial intelligence (AI) technologies have developed rapidly, especially in the fields of medical imaging analysis and medical text processing, bringing revolutionary progress to cancer prediction and screening. Deep learning models have shown particular prominence in natural language processing (NLP), capable of analyzing medical text data, extracting complex patient histories including personal, social, and family backgrounds, and mining diagnostic clues from massive electronic medical records, greatly improving the efficiency and accuracy of auxiliary diagnosis.

However, current AI and deep learning models for early lung cancer prediction still face multiple challenges, such as limited generalization ability, high parameter complexity, and insufficient interpretability. At the same time, research on customized AI models for medical text data is still inadequate. Against this backdrop, the authors conducted this study, aiming to design an efficient, robust, and interpretable AI model for early screening of lung cancer from clinical medical notes, providing new technical support for precision medicine.

Source of Paper and Author Information

This paper, titled “AI-Enhanced Lung Cancer Prediction: A Hybrid Model’s Precision Triumph,” was published in the IEEE Journal of Biomedical and Health Informatics, vol. 29, no. 9, in September 2025. The authors, Cyrille Yetuyetu Kesiku and Begonya Garcia-Zapirain, are affiliated with the Faculty of Engineering, Department of Computer Science, Electronics, and Telecommunications, University of Deusto, Spain. Supported by the Basque Government’s EVIDA research group, this work represents a key advancement in European medical AI research.

Research Procedures and Technical Solutions

1. Dataset Selection and Processing Procedures

a) Data Sources and Sample Size

The study utilized two major databases for experimentation:

  • MIMIC IV: Widely used in medical AI research, this US clinical database from Beth Israel Deaconess Medical Center covers patient records, medical notes, and disease diagnoses between 2008 and 2019, totaling over 60,000 patient visits.

    • Training set (70%): 26,807 medical text notes
    • Validation set (15%): 5,745 medical text notes
    • Test set (15%): 5,745 medical text notes
    • Total samples: 38,297 texts (19,147 lung cancer cases [class 1], 19,150 non-cancer cases [class 0])
  • YELP Review Polarity: An open review dataset extensively used to evaluate the generalization ability of text classification models, with hundreds of thousands of positive and negative review samples.

b) Data Preprocessing

Data preprocessing included:

  • Structured SQL queries from the MIMIC IV “notes” table, filtering text notes related to lung cancer and control diseases by ICD-9 diagnostic codes;
  • Document reconstruction per patient, extracting key sections such as medical history, social history, family history, and present illness;
  • Text cleaning to remove special characters, dates, and invalid information, ensuring data quality.

c) Dataset Partitioning

A stratified random sampling method was used to divide the MIMIC IV dataset into training, validation, and test sets, maintaining balanced class distributions. In addition, stratified k-fold cross-validation (k=5) was applied to ensure that the proportion of positive (lung cancer) and negative (non-cancer) samples remained consistent in each fold, enhancing model generalization and robustness.

2. Model Architecture Innovations

This study introduces a novel hybrid deep learning model, CNN-BiLSTM-Attention, with the following architecture:

a) Embedding Layer

The model utilizes the skip-gram algorithm (a core method of word2vec) to map words in medical text into 100-dimensional dense vectors. Skip-gram is particularly effective for rare medical terms, optimizing the representation of infrequent words in the vector space and capturing semantic and syntactic features by modeling the co-occurrence probabilities of target and context words.

Mathematical form:

$$ p (wc|wt) = \frac{\exp(v’{wc}·v{wt})}{\sum_{i=1}^{|V|} \exp(v’i·v{wt})} $$

b) Branch 1: 1D Convolutional Neural Network (CNN)

  • Configuration: 128 convolutional filters, window size of 5, ReLU activation
  • Function: Extracts local textual features by convolving the word vector sequence, capturing key phrases’ local representations
  • Followed by Global Max Pooling, which selects the maximum value from each feature vector to reduce dimensionality and prevent overfitting.

Mathematical expression:

$$ ci = f(w·x{i:i+k-1} + b) $$

c) Branch 2: BiLSTM and Attention Mechanism

  • Two-layer BiLSTM (each with 64 units), modeling historical (forward) and future (backward) context in both directions of the sequence
  • Dropout regularization (ratio 0.2) to prevent overfitting
  • Attention layer assigns importance weights to each word, focusing on the most discriminative words or phrases
  • Produces a context-sensitive feature representation, enhancing semantic understanding.

Attention mechanism mathematical formula:

$$ Attention(h_i) = \sumj \alpha{ij} h_j $$

Where $\alpha_{ij}$ are weights normalized by softmax.

d) Parallel Output Fusion and Fully Connected Layers (Dense layers)

  • Outputs from the CNN and BiLSTM branches are concatenated to generate composite features, which are input to a three-tier fully connected neural network (64, 32, 1 units; ReLU and Sigmoid activation)
  • Enables binary classification.

e) Optimization and Parameter Settings

  • Adam optimizer (learning rate 0.001, beta_1=0.9, beta_2=0.999)
  • Batch size: 32, number of epochs: 10
  • Total number of parameters: only 12.5 million, significantly reducing model complexity

3. Evaluation Metrics and Experimental Design

Multiple authoritative metrics were adopted, including:

  • Accuracy
  • Recall (sensitivity)
  • Precision
  • F1-score (combines precision and recall)
  • AUC-ROC (area under the receiver operating characteristic curve, measures classification ability)
  • Matthews correlation coefficient (MCC, suitable for evaluating imbalanced medical data)

Stratified 5-fold cross-validation was employed to ensure the results’ robustness and broad applicability.

Detailed Experimental Results

A. Results on MIMIC IV Test Set and Cross-Validation

In the medical core task of lung cancer detection, the model achieved a major breakthrough:

  • Accuracy: 98.1%
  • Precision, Recall, F1-score: all at 98.0%
  • AUC-ROC: 100%
  • MCC: 96.2%

Compared to Biobert (110 million parameters; accuracy 98.0%; MCC 95.5%) and classic LSTM (accuracy 97.0%; MCC 93.5%), the CNN-BiLSTM-Attention model not only had higher accuracy but also only one-tenth the parameters of Biobert, greatly improving deployability.

Five-fold cross-validation results were equally outstanding: average accuracy, recall, and F1-score all at 98.4%, AUC-ROC at 99.8%.

B. Generalization—YELP Review Polarity Dataset

Transferring the model to a social review dataset, it maintained strong performance:

  • Accuracy: 95.1%
  • Precision, Recall, F1-score: all approximately 95.1%
  • AUC-ROC: 99.0%
  • MCC: 90.3%

The model achieved accuracy comparable to ultra-large models like KEN-BLOOM (over 531 million parameters) on the YELP dataset, demonstrating efficient generalization for diverse text classification tasks and a perfect balance between performance and “model size”—ideal for real-world deployment.

Research Conclusions, Significance, and Application Value

1. Research Conclusions and Scientific Value

This study innovatively introduces a hybrid deep learning architecture focused on medical text for early lung cancer screening, not only achieving industry-leading performance in medical note classification and tumor detection, but also demonstrating unique advantages in task generalization and parameter compression. The model skillfully captures local features, long-range syntactic dependencies, and key information in clinical texts, outperforming traditional NLP techniques (such as SVM, naive Bayes, pure LSTM, and CNN), achieving higher accuracy, generalization ability, and practical feasibility.

2. Application Value

The model marks a major breakthrough in the development of AI-powered early screening tools for medicine— - Can be integrated into electronic medical record systems (EMR) for automated lung cancer risk screening - With a relatively small parameter count, it is suitable for use in primary care and remote health management settings with limited computational resources - Also possesses significant application potential in doctor-patient communication, clinical decision support, and medical big data research

Additionally, the model’s interpretability (explainable AI, XAI) helps physicians understand model decisions and increases clinical trust.

Research Highlights and Innovative Value

1. Architectural Innovation

For the first time, the model parallelizes one-dimensional convolution filters with two-layer bidirectional LSTM and incorporates attention mechanisms to extract multi-level information from medical texts, greatly optimizing context capture and fine-grained feature extraction beyond mainstream NLP architectures.

2. Parameter Optimization and Efficiency Improvement

With only 12.5 million parameters—far fewer than standard Transformer models like Biobert—the model achieves both high performance and high practicality, facilitating deployment in real-world healthcare institutions.

3. Interpretability and Feature Importance Analysis

Utilizing SHAP (Shapley Additive Explanations), the study reveals the contribution of key terms to model output. For example, text features such as “smoker,” “cancer,” “carcinoma,” “metastatic,” and “cell” are highly influential in lung cancer identification; visualization (word cloud and SHAP plot) further helps clinicians understand the model’s discrimination mechanism and bolsters technical trustworthiness.

4. Far-reaching Generalization Ability

The model achieved outstanding results in non-medical domains such as YELP reviews, showcasing the hybrid architecture’s strong generalization and providing a template for cross-domain medical AI development.

Other Valuable Information

1. Data Ethics and Privacy Protection

The study strictly adheres to anonymization and ethical review standards, ensuring patient privacy and safety. The handling of sensitive medical records follows international protocols.

2. Deployability and Recommendations for Future Development

The authors recommend small-scale pilot testing and clinical feedback collection before official clinical deployment, coupled with improved protocols for data gathering and interpretation to ensure true patient benefit. Further, future research could explore multimodal data fusion (e.g., medical images, genetic data), expansion to multi-class classification, and anomaly detection, driving progress in medical AI.

Concluding Summary—Scientific and Practical Significance

The CNN-BiLSTM-Attention hybrid model presented in this study not only achieves outstanding performance in early lung cancer prediction, but also exhibits strong extensibility and application potential. Its concise and efficient architecture, solid theoretical foundation, and abundant experimental data offer a new paradigm for medical text classification and disease detection, while paving the way for deep integration of AI and precision medicine.

With the continuous accumulation of medical data and the optimization of AI algorithms, this research will undoubtedly propel technical advancements in early diagnosis of lung cancer and other major diseases, making a positive contribution to global medical health.