Consensus Statement on the Credibility Assessment of Machine Learning Predictors
1. Background: Machine Learning in Medicine and the Challenge of Credibility
In recent years, the rapid development of Artificial Intelligence (AI) and Machine Learning (ML) technologies has brought about a tremendous transformation in the field of healthcare. Particularly in in silico medicine, machine learning predictors have become vital tools for estimating certain physiological and pathological indicators in the human body that are otherwise difficult to measure directly, such as disease risk assessment and treatment response prediction. However, as machine learning exerts increasing influence on clinical decision-making, unprecedentedly high standards are being set for the credibility of its predictions. Put simply, ensuring ML models are accurate and reliable in real-world medical applications has become a core scientific issue that urgently requires resolution by both academia and industry.
Unlike traditional biophysical models (also known as “first-principle models”), ML predictors rely on data-driven approaches, feature a more “black box” internal mechanism, and are deeply affected by the quality and representativeness of training data—potentially hiding issues such as bias and overfitting. Moreover, ML models often make predictions based on statistical correlations within the data rather than causal knowledge, which further increases the risks when extrapolating to new scenarios. There is an urgent need for a cross-disciplinary, specialized theoretical and methodological framework to systematically and robustly evaluate the credibility of these ML predictors so that they are recognized by regulatory bodies (such as the FDA) and adopted in clinical practice.
2. Paper Source and Author Introduction
This paper, entitled “Consensus statement on the credibility assessment of machine learning predictors,” is published as a position article in the authoritative journal Briefings in Bioinformatics (2025, Volume 26, Issue 2, bbaf100). It is co-authored by Alessandra Aldieri, Thiranja Prasad Babarenda Gamage, Antonino Amedeo La Mattina, Axel Loewe, Francesco Pappalardo, and Marco Viceconti—seven scholars deeply engaged in in silico medicine, data science, clinical practice, and regulatory science from renowned institutions including Politecnico di Torino, Auckland Bioengineering Institute, Huashan Hospital of Fudan University, Karlsruhe Institute of Technology, University of Catania, among others. This consensus document represents the opinions of a wide spectrum of experts within the global in silico world community of practice: more than 35 experts participated in building this consensus. The paper aims to establish both theoretical and operational standards for the credibility assessment of ML predictors, providing standardized guidance for academics, developers, and regulators.
3. Detailed Content and Main Viewpoints
This paper is not a single original experimental study, but rather a set of twelve theoretical and operational consensus guidelines formed through systematic discussion among domain experts regarding the credibility assessment of machine learning prediction models—a foundational framework and methodological advance for the entire field. Below is an in-depth analysis of its main contents and theoretical viewpoints.
1. Clarifying the Research Scope and Conceptual System
The paper first clarifies its core conceptual framework. The system of interest (SI) refers to an entity with spatiotemporal changes and complex interactions (e.g., the human body). The quantity of interest (QI) is usually hard to measure directly, necessitating inference based on other easily measurable correlated quantities (collectively termed ω).
The paper adopts the Data–Information–Knowledge–Wisdom (DIKW) hierarchy, emphasizing: - “Data”: original recorded values obtained through system observations, including quantitative and categorical data. - “Information”: data that becomes information after being annotated with metadata (e.g., who, when, where) to provide context. - “Knowledge”: causal hypotheses established among information that can be used to predict new results. - “Wisdom”: knowledge that has withstood repeated falsification and is considered reliable as a basis for decision-making.
This definitional system lays a robust logical foundation for the credibility framework constructed later.
2. Distinguishing Causal Knowledge Differences between ML and Biophysical Models
The paper emphasizes that the causal knowledge for predicting QI comes in “explicit” and “implicit” forms: - Explicit causal knowledge refers to verifiable deduction based on physical, chemical, and biological scientific principles—for example, modeling bone fracture healing using finite element analysis. - Implicit causal knowledge is embedded in large-scale observational data, requiring no explicit physical principles and relying on statistical or ML-detected correlations—the essence of ML models.
Because ML relies on implicit knowledge, its input variables are often merely “sufficient” rather than “necessary,” making omission or redundancy (involving over/underfitting) more likely—an important focus for credibility assessments.
3. Defining Credibility and the Seven-Step Assessment Framework
Drawing on metrology, statistics, and the field of engineering simulation, the paper defines “credibility” as the controllability of predictor errors under all possible input conditions. Since it is impractical to obtain true values for every system state, the authors propose a stepwise sampling and error decomposition process to approximately estimate credibility—a seven-step evaluation workflow:
- definition of context of use and error threshold: Clearly define the specific application scenario and set the maximum permissible prediction error (ε).
- establish true value sources: Obtain the “true values” of the QI and relevant input variables through reliable measurement chains, ensuring measurement accuracy surpasses the permissible error by at least an order of magnitude.
- quantify prediction error: Design controlled experiments to collect input and corresponding true outputs under varying conditions to quantify the actual error distribution.
- identify error sources: Analyze likely error origins for different predictor types—such as numerical error, aleatoric (measurement) uncertainty, and epistemic (knowledge incompleteness) uncertainty.
- decompose error sources: Strive to breakdown the total error into different causes, sometimes via special experiments excluding all variables but the single source of interest.
- examine error distributions: Check whether distributions of each error source conform to theoretical expectations, e.g., whether measurement errors are normally distributed.
- assess robustness and applicability: In routine scenarios, further evaluate input extremes not covered in the training set, potential biases, and generalization abilities.
The paper compares how biophysical and ML models differ in these steps, especially regarding error source identification and robustness assessment, highlighting that ML predictors are at higher risk of missing key variables due to non-necessary input sets—one of the paper’s central concerns.
4. Proposing Measures for Bias Robustness and Safety
Given that ML models are prone to applicability issues or catastrophic errors under outlier or unrepresented input cases, the paper proposes two major strategies:
- Total Product Life Cycle (TPLC) Management: Models should be continuously monitored after deployment, with ongoing supplementing and expanding of the test dataset. Application scenarios are to be judiciously extended, guaranteeing new data robustly supports each expansion of the use context.
- Safety Layer Design: Before each real-world prediction, the model checks whether input data belongs to the distribution of the training/test set. If the input is out-of-distribution, the model either issues a warning or refuses to predict, using trusted traditional methods as backup if necessary. Achieving this requires the training/test set to retain as many observable variables as possible, even if some are not used by the model itself.
5. Twelve Theoretical Declarations and Local Evidence
The core of the article is embodied in twelve statements, including key points such as:
- Definition of quantities, observation, and prediction relationships
- The role of the DIKW framework in knowledge hierarchies, with process examples (e.g., tumor growth prediction)
- The distinction between explicit and implicit causal knowledge, their respective use-cases, and hybrid mode prospects (e.g., physics-informed ML, hybrid frameworks, sequential/parallel models)
- Principles for error decomposition and quantification of credibility
- ML-specific issues, such as overfitting, bias, missing inputs, lack of transparency (“black box” problems), data quality, and temporal dynamics
- Improvement strategies: TPLC, safety layers, comprehensive data collection, stressing standardization, continual monitoring, and regulatory alignment
- Alignment and complementarity with the latest regulatory guidance (e.g., FDA) compared with their own framework
6. Consensus Conclusions, Practical Recommendations, and Innovations
Key conclusions include:
- ML predictors, due to reliance on implicit knowledge, are more susceptible to biases and missing inputs, though they are invaluable for efficiently solving complex problems.
- Systematic error decomposition and stepwise assessment can effectively improve predictor credibility.
- Employing total product life cycle management and safety layer design greatly enhances robustness and applicability to diverse clinical scenarios and populations.
Eight authoritative recommendations: 1. Promote standardization and implementation of the seven-step credibility assessment process. 2. Encourage comprehensive and high-quality data collection to enable stricter model evaluation. 3. Advance validation, verification, and uncertainty quantification techniques adapted to ML predictors. 4. Emphasize model transparency and interpretability. 5. Strengthen communication with regulatory authorities to ensure compliance. 6. Increase interdisciplinary training, enhancing clinicians’ ML literacy. 7. Encourage cross-field collaboration to gather specialized strengths. 8. Stress continuous real-world monitoring and dynamic model updating.
7. Significance and Value of the Paper
This domain consensus statement systematically integrates expertise from in silico medicine, data science, clinical practice, and regulatory science. It not only answers the theoretical and practical question of “how to scientifically evaluate the credibility of ML medical prediction models,” but also sets comprehensive industrial standards, filling a gap in the academic community. While prior literature focused on “interpretability” and “reliability,” the authors highlight “credibility” as an indispensable clinical assessment dimension—demanding not just “mostly accurate” models, but ensuring that within all controlled ranges, errors do not exceed clinical thresholds, thus laying the groundwork for compliant and safe deployment of medical AI.
The statement echoes the latest guidelines from regulatory bodies such as the FDA and introduces a novel paradigm of bias robustness and safety layers—a practical roadmap for large-scale implementation of AI medical models in the future.
4. Additional Value-added Information
- This paper was supported by the EU’s H2020 “in silico world” project (Grant No. 101016503).
- The authors declare no conflicts of interest; all data and recommendations are based on a multi-round consensus process.
- The references span multiple top fields, including ML in medicine, model validation, interpretability, reliability, and hybrid modeling—resulting in a complete and robust literature system.
5. Conclusion
This “Consensus Statement on the Credibility Assessment of Machine Learning Predictors,” jointly authored by leading international experts and built on broad community consensus, deeply analyzes the critical challenges faced by medical ML models and presents a systematic solution framework for model development, evaluation, clinical application, and regulatory approval. Its publication marks significant progress in the standardization of evaluation methodologies in in silico medicine and medical AI, and serves as a milestone for the responsible innovation and high-quality development of medical AI and the health industry as a whole.