Estimation and Conformity Evaluation of Multi-Class Counterfactual Explanations for Chronic Disease Prevention
1. Academic Background and Research Motivation
In recent years, Artificial Intelligence (AI) has made tremendous progress in healthcare. From initial uses in assisted diagnosis and risk prediction to the recommendation of personalized intervention plans, AI has become a critical tool for improving the quality and efficiency of medical services. However, AI in clinical practice still faces many challenges, with one of the most prominent issues being model explainability and trustworthiness. When AI systems are employed as Clinical Decision Support Systems (CDSS), both medical practitioners and patients urgently want to “understand” how AI makes its inferences and whether these inferences align with existing medical knowledge rather than being black box outputs. The lack of transparency not only limits the adoption of AI tools but also affects physician trust and acceptance, which directly impacts patient safety and health outcomes.
To address this shortcoming, Explainable AI (XAI) technologies have emerged. XAI seeks to retain AI’s powerful capabilities while enhancing the interpretability of its decision-making process, making automated, data-driven decision pathways transparent, reliable, and acceptable to medical experts. Among various XAI techniques, counterfactual explanations have attracted particular attention. Their core idea is to illustrate “how would the model output change if the input data changed?“—in other words, providing a “what-if” scenario that helps doctors understand the model’s reasoning and potential intervention directions. In medical contexts, counterfactual explanations can be used to develop personalized risk intervention strategies for individual patients, revealing which changes in variables can directly affect disease risk or diagnostic results, such as modifications in blood pressure, glucose levels, weight, and other biomarkers.
Although counterfactual explanations theoretically fit clinical needs very well, their practical application and evaluation still face numerous challenges. For example, how can we ensure that counterfactual explanations are both sufficiently close to the original data to be “feasible” and adequately representative of the target category to be “useful”? How can reliable, high-quality explanations be systematically quantified and filtered? How can we efficiently and controllably generate counterfactual explanations in complex multi-class classifications, such as disease risk stratification? This study addresses these gaps by proposing new methods, and applies them to the field of personalized prevention of cardiovascular disease risk in patients with Chronic Obstructive Pulmonary Disease (COPD), aiming to develop more rigorous and trustworthy explainability mechanisms for clinical decision support systems.
2. Source of the Paper and Author Information
This research paper is titled “Estimation and Conformity Evaluation of Multi-Class Counterfactual Explanations for Chronic Disease Prevention.” It was published in September 2025 in the IEEE Journal of Biomedical and Health Informatics. The author team spans multiple countries and research institutions, with core members including Marta Lenatti (corresponding author), Alberto Carlevaro, Aziz Guergachi, Karim Keshavjee, Maurizio Mongelli, and Alessia Paglialonga. Major research organizations include the Italian CNR-Istituto di Elettronica e di Ingegneria dell’Informazione e delle Telecomunicazioni, the University of Genoa, Ted Rogers School of Management and Ted Rogers School of Information Technology Management (Canada), the Institute of Health Policy, Management and Evaluation at University of Toronto, and York University. The project received funding from the European Union, the Italian Ministry of Universities and Research (MUR), and several national research programs and AI innovation ecosystems.
3. Detailed Research Process
1. Dataset Extraction and Preprocessing
Subjects and Sample Size:
The research team screened de-identified electronic health records from the Canadian Primary Care Sentinel Surveillance Network (CPCSSN) database, collected between 2000 and 2015. These data involve patients diagnosed with COPD and over 20 years old. After rigorous selection and data cleaning, the final dataset includes 9,613 records with no missing values (far fewer than the original 37,504 cases, ensuring high-quality data).
Feature Settings:
Each data entry includes primary biomarkers collected within six months before the COPD diagnosis date, including age at onset, sex assigned at birth, body mass index (BMI), systolic and diastolic blood pressure (SBP/DBP), fasting blood sugar (FBS), low-density lipoprotein cholesterol (LDL), high-density lipoprotein cholesterol (HDL), triglycerides (TG), total cholesterol (TOTCHOL), smoking history (current, ex-smoker, never smoked), and comorbid hypertension or diabetes (diagnosed within six months prior to COPD). The study specifically categorizes feature modifiability: modifiable (e.g., BMI, blood pressure), partially modifiable (e.g., smoking status), and non-modifiable (e.g., age, medical history).
Output Variable Setting:
The Framingham Risk Score (FRS) was used to assess ten-year cardiovascular risk, divided per Canadian Cardiovascular Society guidelines into three categories: low risk (<10%, 3,944 cases), moderate risk (10%-19%, 3,274 cases), and high risk (≥20%, 2,395 cases). This output serves as the basis for counterfactual explanations and personalized intervention recommendations.
2. Multi-Class Classification Model Construction and Optimization
Main Algorithm:
The study uses Multi-Class Support Vector Data Description (MC-SVDD) as the primary classifier. This algorithm utilizes kernel functions in high-dimensional feature spaces to separate category data with minimal enclosing spheres, suitable for outlier detection and multi-class discrimination. To deal with inevitable classification errors in real-world medical data, the study innovatively introduces False Positive Rate (FPR) control using a one-vs-all method for iterative optimization—each category repeatedly trains a single-class SVDD (One-Class SVDD) until false positive rate falls below a predefined threshold (e.g., 0.1) or the maximum number of iterations (e.g., 1000) is reached.
Model Surrogacy and Validation:
When applying some counterfactual algorithms (like DICE, which is incompatible with MC-SVDD), a Surrogate Support Vector Machine (SVM) is used to best simulate the input-output behavior of MC-SVDD, allowing cross-method comparison and validated by Cohen’s Kappa coefficient (0.89) for model consistency.
Training and Testing:
The dataset is split into training and test sets at a 7:3 ratio, with max scaling for normalization. Hyperparameters for MC-SVDD are determined via 3-fold cross-validation and grid search; SVM counterparts are similarly optimized. Both models achieve high precision and low false-negative rates on both sets, particularly after the introduction of FPR control, favoring “abstaining from uncertain classification” to enhance clinical reliability.
3. Counterfactual Explanation Generation Algorithms
General Idea:
Using patients in the high cardiovascular risk category (682⁄690 cases in the test set) as a starting point, the study attempts to generate two counterfactual explanations for each “factual sample”: one transitioning to moderate risk, and one to low risk (each corresponding to a new combination of physiological parameters).
Method Comparison and Novel Algorithms:
Two mainstream counterfactual explanation generation strategies are adopted:
MUCH (Multi Counterfactuals via Halton Sampling): Relies on Halton sequence quasi-random sampling in the target category space, with optimization for the “minimum distance” objective. Constraints ensure the new sample lies just within the target category boundary and away from other category boundaries. MUCH offers strong controllability, easier convergence, and synergizes well with MC-SVDD.
DICE (Diverse Counterfactual Explanation): Uses a heuristic genetic algorithm to optimize both diversity and proximity, supporting mixed-type features. Each factual sample generates only one counterfactual explanation for a fair comparison with MUCH. Due to heuristic search, DICE may hit local optima and does not guarantee convergence in complex scenarios.
Both methods strictly limit the range of variable changes, especially for clinically irreversible features (e.g., a patient can never become a “never smoker”); counterfactual explanations allow only realistic adjustments (e.g., switching from current smoker to ex-smoker), and medical-relevant thresholds (such as maximum BMI or cholesterol levels) are set.
4. Counterfactual Explanation Quality Evaluation and Conformity Assessment
Evaluation Metrics and Statistical Tests:
- Availability: Success rate of generation
- Discriminative Power: Accuracy in distinguishing explanation samples from original category samples
- Proximity: Distance from the original factual sample (closer is better)
- Sparsity: Average number of features changed
- Implausibility: Degree of deviation from the average of the target category (lower is better)
- Diversity: Inter-explanation variation
All indicators are tested for statistical significance using Wilcoxon signed-rank and Mann-Whitney U tests, with Bonferroni correction applied.
Counterfactual Conformity Assessment:
A pioneering “Counterfactual Conformity” metric is introduced, inspired by Conformal Prediction (CP), to quantitatively assess explanation quality:
- Leverages a mixed distance metric (combining Hamming and cosine distances), considering the counterfactual’s proximity to the factual and its plausibility in relation to the target class center.
- A threshold ε (e.g., 0.1) is set to assess whether each explanation achieves high-confidence standards. If all target classes are met, it is a “fully conformal counterfactual”; if only some, it is “partially conformal”; if none, “non-conformal counterfactual.”
- Calibrates the scoring function on the test set to enable explanation filtering and quantitative reliability output.
4. Analysis of Key Research Findings
1. Classifier Performance
- After FPR control, MC-SVDD achieves 85.6% accuracy on the training set, with 10% unclassified points (i.e., fewer misdiagnoses, but more abstentions from uncertain decisions). Sensitivities per class reached 88.2% (low risk), 75.0% (moderate risk), and 95.9% (high risk). The test set performed slightly less well but remained acceptable.
- SVM surrogate model aligns closely with MC-SVDD predictions, achieving 96.9%/92.6% accuracy on training/test sets and a Cohen’s Kappa of 0.89.
2. Counterfactual Explanation Generation and Quality Comparison
- MUCH explanations have an average availability of 84.6%, while DICE reaches 98.2%. Both show high discriminative power (MUCH is better), with MUCH slightly superior in implausibility and diversity, and DICE better in proximity and sparsity.
- For transitions from high to moderate risk, the magnitude of suggested changes by MUCH and DICE differs, with statistically significant differences in some variables (such as systolic blood pressure, blood lipids).
- Counterfactual conformity assessment enables filtering out unrealistic explanations, with the remaining qualified explanations outperforming the unscreened and non-conformal explanations on all metrics (proximity, implausibility, sparsity), including clinically feasible ranges of change (e.g., BMI/blood pressure changes do not exceed realistic limits).
3. Personalized Risk Intervention Recommendations and Medical Significance
- High-conformity counterfactual explanations generated by MUCH and DICE have suggested changes (e.g., lowering systolic blood pressure, optimizing BMI, increasing HDL, or quitting smoking) that align with established medical knowledge and can assist clinicians in formulating specific, actionable personalized intervention plans.
- For patients with comorbidities (e.g., hypertension or diabetes), the magnitude of recommended changes is significantly greater (such as larger blood pressure reductions for hypertensive patients), indicating the model’s ability to capture the impact of actual health status on intervention targets.
5. Conclusions, Academic and Practical Value
This study demonstrates a complete multi-class medical risk stratification counterfactual explanation system, successfully validated in the context of cardiovascular risk prevention in COPD patients, encompassing key steps such as data extraction, model training, explanation generation, and screening evaluation. The methodological rigor and scientific process are evident. Key breakthroughs include:
- The creation of a counterfactual conformity evaluation standard, enabling CDSS to not only explain AI inference processes but also automatically select trustworthy and realistic personalized intervention recommendations.
- The integration of multi-class classification algorithms (MC-SVDD) and optimized generation methods (MUCH/DICE), which enhance both clinical applicability and explanatory diversity.
- The provision of clinically relevant personalized intervention advice, validated on a large-scale dataset, offering considerable practical value.
- The method is extendable to other chronic disease risk predictions, supporting clinical AI-assisted interventions in remote or real-time health management, and improving population health management efficiency.
6. Research Highlights and Future Prospects
- New Methods and Metrics: The application of MUCH counterfactual explanations and the conformity evaluation metric in real-world medical scenarios markedly improves explanation credibility and usability.
- Data Quality and Experimental Design: Supported by large-scale, high-quality health databases and strict variable standardization and realistic medical constraints, the results are robust and reliable.
- Flexibility and Portability: The model framework and explanation mechanism apply to various categories and diseases and are readily integrable into clinical CDSS systems.
- Future Directions: Plans include dynamic embedding of expert knowledge, optimization of metric thresholds, and cross-disease generalization, advancing the practical application of medical XAI from theory to implementation.
7. Other Valuable Information
- The research implementation code, parts of the data, and related tools have been open-sourced, facilitating academic and industry replication, validation, and advanced applications.
- The theories and processes proposed in the paper are providing technological support for chronic disease management based on Electronic Health Records (EHR), intelligent preventive medicine, and robust AI models.
- The research team maintains international, multidisciplinary collaboration, exemplifying scientific practice in large-scale AI medical model development, evaluation, and deployment.
Through systematic technical innovation and rigorous scholarly discourse, this study provides breakthrough tools and new ideas for AI-driven personalized chronic disease prevention, marking a new, trustworthy, effective, and practical phase of explainable AI in medicine.