Analysis of Learned Feature Similarities to Diagnostic Criteria in Deep Learning-Based 12-Lead ECG Classification

Interpretability Research of Deep Learning in Automated ECG Diagnosis —— Progress Overview Based on Explainable AI

1. Academic Background and Problem Statement

Electrocardiogram (ECG) is an important physiological signal acquisition method for clinical diagnosis of heart diseases, with a history of over a hundred years. In recent years, with the rapid development of Artificial Intelligence (AI) and Deep Neural Networks (DNNs), data-driven automatic diagnostic algorithms have achieved outstanding performance in the field of ECG, especially in the complex detection of arrhythmias, significantly surpassing traditional approaches. Deep learning models automatically learn and extract signal features, greatly advancing automated ECG interpretation and assisted diagnostic systems.

However, the proliferation of these black-box algorithms in actual clinical practice remains limited, with the lack of explainability being one of the most critical barriers. Although models can provide clear classification decisions, medical practitioners find it difficult to understand the basis for these decisions, raising concerns that the models may rely on spurious correlations, signal noise, or instrumental errors, consequently affecting the safety and reliability of diagnosis. For example, if a model uses noise features or clinically irrelevant signals as diagnostic criteria, it can easily result in “Clever Hans”-like misjudgments (that is, seemingly accurate but actually incorrect associations). Therefore, improving the explainability of deep learning models and revealing the relationship between their implicit features and clinical standards has become a key and hot topic in medical AI research.

In response, this research team introduces Explainable Artificial Intelligence (XAI) approaches into automated ECG diagnosis, aiming to analyze the implicit features learned by trained deep learning models for 12-lead ECG classification, verify whether the model has acquired diagnostic standards consistent with cardiology textbooks, and propose quantitative analytic workflows to build a solid foundation for future medical AI applications.

2. Paper Source and Author Information

This study, entitled “Analysis of a Deep Learning Model for 12-Lead ECG Classification Reveals Learned Features Similar to Diagnostic Criteria”, was published in the prestigious academic journal IEEE Journal of Biomedical and Health Informatics (Vol. 28, No. 4, April 2024, pp. 1848–1859). The first author is Theresa Bender (corresponding author), with team members including Jacqueline M. Beinecke, Dagmar Krefting, Carolin Müller, Henning Dathe, Tim Seidler, Nicolai Spicher, and Anne-Christin Hauschild. The main authors are affiliated with the Medical Informatics and Cardiology departments of University Medical Center Göttingen, Germany, demonstrating deep interdisciplinary collaboration.

3. Research Design and Technical Workflow

1. Overall Research Strategy

This study is based on a publicly available Residual Network (ResNet) deep learning model, using raw ECG data from two large public databases (CPSC2018 and PTB-XL), applying XAI explainability methods to analyze the features learned by the model in authentic diagnostic workflows, and innovatively designing quantitative evaluation and visualization processes to systematically uncover the AI model’s decision mechanisms.

a. Data Sources and Sample Selection

  • CPSC2018 Database: Collected from 11 hospitals in China, annotated by experts, comprising diverse abnormal records. The study selected 200 normal ECGs, 200 with atrial fibrillation (AF), and 200 with left bundle branch block (LBBB) for analysis.
  • PTB-XL Database: A German public dataset spanning a longer time frame, with patient cohorts and equipment types differing from CPSC2018, mainly used to validate results and test generalizability.

b. Data Processing and Modeling Workflow

  1. Preprocessing: All ECG signals were resampled to 400 Hz, trimmed or zero-padded to 4096 sample points, forming a standardized input matrix (n × 4096 × 12, with n denoting number of records).
  2. Model Inference: Each ECG record was fed into the pre-trained ResNet model for multi-class prediction of six ECG abnormalities, outputting probability scores for each abnormality (sigmoid activation).
  3. Explainability Analysis: The Innvestigate toolkit was used to implement two main XAI methods:
    • Integrated Gradients (IG): Assigns attribution scores to each sample point by integrating gradients between the input and baseline.
    • Layer-wise Relevance Propagation (LRP): Decomposes the output prediction score into relevance scores on input dimensions for more granular model interpretation.
  4. Three-layer Quantitative Analysis Workflow:
    • Overall Relevance Score Statistics: Statistics on relevance score distributions for each diagnostic category (normal, AF, LBBB), analyzing the model’s sensitivity to abnormal signals.
    • Lead-wise Relevance Score Statistics: Comparing relevance scores by lead to identify key leads the model focuses on in different diagnostic categories.
    • Beat-wise Temporal Relevance Analysis: The “average beat” approach segments, aligns each record by heartbeat, and analyzes the model’s focus on each rhythm cycle segment (e.g., P wave, QRS complex, T wave), revealing the degree of model fit to clinical diagnostic standards.
  5. Visualization Evaluation Workflow: Relevance scores normalized to [-1, 1], visualized via heatmap scatter plots and other methods, presented to experts and clinicians for feedback and optimization.
  6. Experimental Comparison and Generalization Testing:
    • Comparison of novel algorithms with traditional LRP variants (such as ε-LRP, αβ-LRP, ω^2-LRP);
    • Workflow reproduction using PTB-XL data to verify cross-dataset applicability.

2. Major Technical Innovations and Original Methods

The main features of this research: - Innovatively proposing a “multi-level quantitative relevance analysis” workflow to systematically examine connections between model-learned features and actual diagnostic standards, from overall down to lead and beat cycle. - Integration of multiple XAI methods, scrutinizing the strengths and differences of various attribution algorithms in medical decision explainability. - Providing comprehensive visualization solutions as practical tools for clinicians to rapidly interpret AI models. - Validating the commonalities and robustness of the decision mechanisms across databases.

4. Main Experimental Results and Process Analysis

1. Overall Relevance Score Distribution

Analyses showed that the vast majority of ECG sample points had relevance scores close to zero in the decision process, conforming to clinical expectations (baseline intervals aside from waves typically bear no diagnostic significance). AF and LBBB abnormalities had slightly wider relevance score distributions than the normal ECGs, tending towards positive values: LBBB showed much higher relevance scores in the [0.0, 0.10] range than normal, while AF group scores were more dispersed on both positive and negative ends, indicating stronger model sensitivity and selectivity to abnormal signals.

Single record analyses showed that mean relevance scores (mn) increased in tandem with model abnormality probability (cn). There was a strong correlation between classification results and mean relevance scores, with misclassifications often falling near the threshold or having mean relevance close to zero, suggesting room for optimizing model thresholds.

2. Lead-wise Relevance Score Analysis

Compared across leads, abnormal records had significantly higher relevance scores than the normal group, especially in lead v1. In AF classification, v1 showed the most pronounced difference, indicating the model had learned the clinical importance of v1 for AF diagnosis (e.g., high-frequency fibrillatory waves and loss of P waves). In LBBB classification, left-sided leads (e.g., avl, v5, v6) were significant, consistent with clinical lead selection standards for LBBB. Statistical tests (Wilcoxon rank-sum) revealed relevance score distributions differed significantly across all leads.

3. Beat-wise Cycle Relevance Analysis

The “average beat” algorithm revealed that for both normal and abnormal categories, the model mainly allocated positive relevance scores to QRS complexes, and the scores for P and T waves clearly reflected the model’s learning of diagnostic standards:

  • In AF classification, the QRS complex, particularly the R peak, was the primary concentration area for relevance. In normal records, the P wave area showed high negative relevance, indicating that the model could identify the presence of a P wave as a “counter-evidence” for AF.
  • In LBBB classification, irregular broad QRS, ST segment, and T wave polarity inversion were key, with T wave showing distinct negative relevance in the normal group and strong positive relevance in the abnormal group, underscoring the importance of abnormal wave patterns. Relevance scores concentrated in abnormal cycles, with waveform closely resembling clinically typical LBBB ECG abnormalities.

4. Visualization and Expert Assessment

Normalized heatmap visualizations revealed to experts: - In LBBB classification, a focus on v1 negative S waves, prolonged ST segments, and broad R waves; - In AF classification, focus on R waves and areas of missing P waves, with some labels placed in regions of suspected pseudo-P waves; - If samples contained signal artifacts (baseline drift, noise, lead-off), relevance scores tended to cluster on artifacts, increasing the likelihood of misclassification, strongly confirming the model’s dependence on signal quality.

5. Database and Algorithm Generalization Analysis

Replicating experiments with PTB-XL showed highly consistent results, indicating strong cross-database algorithm generalizability. Relevance scores in LBBB remained highly concentrated on abnormal waveform areas, with label specificity influencing distribution and suggesting the potential for further validation of textbook-style learning on more granular labels.

Different XAI methods significantly influenced relevance score distribution. For example, ε-LRP and αβ-LRP focused more on R peaks, ω^2-LRP paid increased attention to non-R waves and artifacts. IG method offered superior interpretability and focus, suggesting different attribution frameworks should be chosen flexibly for real clinical scenarios.

5. Conclusions and Scientific Value

In summary, this study systematically demonstrates that pre-trained deep learning models for automatic 12-lead ECG diagnosis can learn multiple diagnostic features consistent with clinical textbook standards. For example, the model flagged clear P waves as “counter-evidence” for AF abnormality, recognized broad, deformed QRS complexes and T wave direction as manifestations of LBBB, and weighted diagnostic criteria according to different leads—strongly supporting the safety and reliability of AI-assisted diagnosis.

The multi-level quantitative relevance analysis and visualization methods proposed here can instantly display the model’s decision logic to clinicians, helping them judge the reasonableness of AI diagnoses and reduce erroneous decisions. This offers significant impetus for future development of clinical AI “auxiliary interpretation” tools and lays a solid foundation for the practical deployment of AI systems. It was also found that the model is prone to relevance score drift and misclassification under signal artifact interference, suggesting the future development of signal quality detection and abnormality warning features based on relevance analysis.

6. Research Highlights and Innovative Contributions

  1. High-dimensional Intuitive Explanation Workflow: Pioneering division of XAI analysis into overall, lead, and beat cycle levels, greatly improving diagnostic transparency.
  2. Deep Integration of Clinical and AI Standards: Systematic validation of deep learning models’ spontaneous acquisition of key ECG diagnostic features and lead selection, enhancing trustworthiness of medical AI.
  3. Multi-algorithm Cross-validation: Comparative analysis of various XAI attribution methods, clarifying their respective strengths and providing theoretical basis for clinical application.
  4. Visualization Supporting Clinical Decision-making: Heatmaps, scatter plots and other visualization methods expand clinicians’ understanding of AI decisions, progressing toward “white-box” AI medicine.
  5. High Generalizability Across Databases: Consistent results across various databases, effectively eliminating influences of equipment and population differences.

7. Limitations and Future Prospects

  • Analyses based on Integrated Gradients (IG) have limited capacity for explaining time-related phenomena (such as RR interval changes from arrhythmias), with AF (a temporal abnormality) remaining less well interpreted, necessitating further integration of temporal attribution algorithms.
  • Use of public databases may entail selection bias; future work should incorporate real clinical data from emergency and inpatient sources for broader applicability.
  • Automated artifact detection and error correction functions have yet to be systematically developed; future research combining relevance score time-series analysis is promising for improving AI system robustness and safety.

In the future, the team plans to develop interactive clinical AI interpretation tools based on these research findings, achieving visual logic review and dual protection through AI-augmented diagnosis, accelerating wide adoption of automated ECG diagnosis in clinical practice.

8. Other Valuable Information

All source code of this study is publicly available on GitLab (https://gitlab.gwdg.de/medinfpub/biosignal-processing-group/xai-ecg, commit #aed722d8), and complete PTB-XL database analysis results and dynamic videos are provided (released with supplementary material), facilitating further reproducibility and research by academic peers.

9. Summary and Academic Significance

This research fully demonstrates the application prospects of explainable AI methods in automated ECG diagnosis, offering practical tools for clinicians to break open the AI “black box” and removing main obstacles for the safe promotion of AI medical technology. The multilevel analysis and visualization workflow proposed significantly advances transparency in medical AI decision-making, marking a milestone for widespread adoption of AI in medicine, and holds important value for enhancing patient safety, reducing misdiagnosis risk, and boosting clinical diagnostic efficiency.