End-to-End Prediction of Knee Osteoarthritis Progression with Multimodal Transformers
End-to-end Prediction of Knee Osteoarthritis Progression Using Multimodal Transformers
I. Academic Background
Knee osteoarthritis (KOA) is a chronic musculoskeletal disease that affects hundreds of millions of people worldwide. Due to gradual degeneration of articular cartilage and bone, KOA typically leads to chronic pain, joint stiffness, and functional impairment. Unfortunately, there is currently no effective cure, and the development of early interventions and disease-modifying drugs critically relies on accurate prediction of KOA progression. Therefore, forecasting KOA progression has become a key unsolved question in orthopedics and clinical medicine.
KOA progression is highly heterogeneous, with significant differences in patient presentation and pathophysiology, making precise prediction extremely challenging. Traditionally, clinical practice relies on radiographic evaluation, especially the Kellgren-Lawrence grading (KL grading), to determine KOA severity. However, X-rays only reflect changes in bone and joint space, and are virtually incapable of identifying early soft tissue degeneration, such as subtle changes in cartilage, meniscus, and fat pad microstructures. The introduction of magnetic resonance imaging (MRI) has greatly enhanced the detail of joint imaging, and through different sequence protocols, it can reveal morphological (structural MRI) and compositional (compositional MRI, e.g., T2-mapping) features, vastly expanding access to early disease pathology.
However, in actual studies, most MRI-related research has small sample sizes and biomarkers are often extracted using image segmentation and conventional radiomics, suffering from the limitations of a “bottom-up” design perspective that struggles to reveal high-level hidden relationships among complex patterns. Moreover, due to methodological restrictions, the real synergistic value and fusion efficacy of multimodal imaging (such as X-ray and multimodal MRI) remain to be systematically validated.
In recent years, deep learning (DL) has enabled the analysis of large volumes of medical imaging data; especially multimodal, multi-sequence fusion neural networks and Transformer models can autonomously extract optimal predictive features end-to-end from raw data, providing new opportunities for individualized progression prediction and phenotype characterization of KOA.
II. Source and Author Information
This paper, entitled “End-to-end Prediction of Knee Osteoarthritis Progression with Multimodal Transformers,” was published in the IEEE Journal of Biomedical and Health Informatics (Vol. 29, No. 9, September 2025). The authors are Egor Panfilov, Simo Saarakkala, Miika T. Nieminen, and Aleksei Tiulpin, all from the Faculty of Medicine at the University of Oulu, as well as the Department of Diagnostic Radiology at Oulu University Hospital in Finland. This team is a leading group in the field of musculoskeletal imaging analysis and AI medical applications.
The study was supported by the Osteoarthritis Initiative (OAI), the Research Council of Finland, and the Infotech Institute of the University of Oulu. All data and model code have been made public, significantly facilitating reproducibility for future research.
III. Overall Research Workflow and Methodological Details
1. Study Design and Dataset Construction
This study leveraged the Osteoarthritis Initiative (OAI), a multicenter, prospective cohort database, to construct five independent sub-datasets across different time windows (12, 24, 36, 48, and 96 months). Each dataset uses baseline information as a starting point, labelling KL grade progression within the follow-up period as progressor or non-progressor. The final sample sizes for the five datasets are 3967, 3735, 3585, 3448, and 2421 cases respectively, with the proportion of progressors increasing with longer follow-up (at 96 months, progressors occupy 27.7%). The test set was independently selected from one site (Site D) to strengthen the domain shift robustness of the model, while training and validation data were assigned using 5-fold cross-validation, ensuring consistent label distribution.
2. Clinical and Imaging Variables
Clinical variables include basic demographics (age, sex, BMI), prior knee injury/surgery history, symptom and function measures (WOMAC score), and baseline X-ray KL grade. Imaging data cover X-rays and multiple MRI sequences, including high-resolution 3D DESS (Dual-Echo Steady State), coronal intermediate-weighted TSE (Turbo Spin-Echo), and sagittal multi-echo T2-mapping (reflecting biochemical composition). DESS is mainly used for cartilage and meniscus morphology, TSE emphasizes structural assessment such as ligaments, bone marrow lesions, and synovial inflammation, while T2-map is sensitive to early compositional changes in cartilage.
3. Experimental Methods and Deep Learning Modeling
3.1 Clinical Baseline Models
Various combinations of clinical variables were used to construct logistic regression (LR) models as baselines. WOMAC, knee history, and KL grade were added stepwise. All models employed 5-fold cross-validation and were evaluated by area under the ROC curve (AUC) and average precision (AP).
3.2 Imaging Model Architectures
For different modalities, the study implemented separate models as follows:
- Single X-ray Image: Directly analyzed using a ResNeXt-50_32x4d CNN model.
- Single MRI Sequence: Used ResNet-50 as a feature extractor, followed by a Transformer module to aggregate slice features, leveraging pre-trained weights and capturing inter-slice spatial relationships.
- Multimodal Fusion Models: For two modalities (such as XR+MRI), independent CNN branches were configured, with feature vectors concatenated before Transformer-based cross-modal fusion. For three to four modalities, each MRI branch was equipped with a mid-level Transformer to embed features into a common latent space before final aggregation in a Transformer. If clinical data were included, an additional shallow fully-connected branch was added. All CNNs were initialized with ImageNet weights, other layers randomly initialized.
The training process exploited the Adam optimizer, focal loss for countering class imbalance, minority class oversampling, with standardized learning rate warm-up and decay. High-performance hardware included 4 NVIDIA A100 GPUs, with model training time ranging from 0.5 to 6.5 hours.
3.3 Evaluation and Statistical Analysis
All models were evaluated using AUC and AP metrics on the test and cross-validated sets, using bootstrapping for means and standard errors, and permutation tests to assess statistical significance of performance differences. Additionally, in multimodal fusion scenarios, the feature ablation method quantified the relative utilization rates (RURs) of each modality for model predictive power.
4. Subgroup Analysis
To explore model performance in different clinical subpopulations, subjects were stratified by knee history into “no prior injury/surgery,” “prior injury without surgery,” and “prior surgery” groups, and further by baseline KL grading and symptom presence (WOMAC total score threshold of 10). Model AUC and AP were calculated in each subgroup to examine the heterogeneity of multimodal and unimodal model effectiveness across populations.
IV. Detailed Main Experimental Results
1. Clinical Baseline Model Results
Within the 12-month window, stepwise addition of WOMAC and knee history increased both AUC and AP by 0.07; inclusion of KL grade further increased AP by 0.10, suggesting that imaging adds value in short-term progression prediction. Within the 24-48 months windows, the added value of clinical and imaging factors diminished; over 96 months, non-imaging variables plus KL grade yielded significant uplift, indicating that identifying long-term progressors is comparatively easier. The multivariate logistic regression model (C3) performed best and was used as the baseline for subsequent analysis.
2. Unimodal Imaging Model Performance
X-ray models underperformed the baseline for 12⁄24 months, but outperformed it from 36 months onwards, with significant AP improvement at 48-96 months. Structural MRI sequences (DESS/TSE) delivered higher AUC than both baseline and X-ray models at 12 months, with steady improvements in both metrics from 24 months on, and notable AUC spikes at 24 and 96 months; T2-map (compositional MRI) showed performance comparable to X-ray. For long-term prognosis, all MRI models were superior to clinical baselines and X-ray, highlighting their value in early disease detection.
3. Multimodal Fusion Model Performance
3.1 Fusion of MRI Protocols
Dual structural MRI fusion (DESS+TSE) mainly increased AUC by 0.03 at 12 months (not statistically significant); integrating compositional sequences (T2-map) yielded limited gains, significant AP improvement only for 36-month target. These suggest that while multi-sequence MRI fusion may offer benefits, gains are marginal and period-specific.
3.2 Radiograph + MRI Fusion
Fusion of X-ray with DESS increased AUC by 0.11 and 0.05 (as compared to single modalities) at 12 months, and slightly raised AP at 48⁄96 months. The tri-modality model (XR+DESS+T2-map) performed the best, with overall scores ranging from 0.70 to 0.76 (AUC) and 0.10 to 0.55 (AP), more stable than any single or dual modality. Incorporating clinical variables into the fusion model did not yield further gains and even led to a slightly lower AP at 12 months versus the non-imaging baseline, indicating that some easily progressing cases can already be identified by pure clinical data, and the additional value of multimodal imaging is most pronounced in complex, heterogeneous patients.
4. Subgroup and Utilization Rate Analysis
In the “no prior injury/surgery” group, model AUCs were moderate, with MRI and fusion models slightly favored, particularly for those with low baseline KL grades and positive symptoms. For those with prior injury or surgery, model performance was notably improved, with MRI and fusion models dramatically outstripping clinical and X-ray models in both AUC and AP, demonstrating that early-tissue injury-inflammation-degeneration processes can be sensitively detected from high-dimensional imaging features.
RURs analysis revealed that DESS MRI consistently contributed the majority of predictive power (>85% average), with T2-map providing more supplementary value for short-term windows (peaking at 28%), but its contribution decreased over time. Clinical variables and X-ray made very marginal contributions (%) when added to multimodal models. These findings further confirm that MRI—especially structural MRI—offers the largest information yield and stands atop the “information pyramid” for predicting KL grade progression.
V. Overall Conclusions, Scientific and Application Value
1. Main Scientific Conclusions
This study presents an end-to-end multimodal deep learning prediction framework, systematically evaluating the practical gains of multimodal fusion for KOA progression prediction. The results challenge the intuition of “more modalities is better,” demonstrating that, for both long- and short-term KOA progression prediction, using only structural MRI achieves performance comparable to multimodal fusion. Only rare complex populations (such as those with prior injury/surgery) or difficult early-stage cases appear to benefit significantly from multimodal fusion.
Additionally, compositional MRI (such as T2-map) is somewhat valuable for identifying early degeneration within 12 months, but its supplementary information declines over longer follow-up, likely due to KL grading’s reliance on morphological changes. Clinical variables mainly provide information for short-term prediction, suggesting that in real-world screening, combining prior knee history and functional assessment can preliminarily identify high-risk individuals, limiting MRI use to a smaller subset.
2. Application and Translational Value
From an application perspective, this study’s findings are highly significant: Routine, low-cost X-ray screening combined with basic clinical assessment is sufficient for KOA risk screening in most populations. MRI can be prioritized for those with complex histories, difficult symptoms, new injuries, or in pilot drug trial cohorts. Targeted, tiered application of sequence or multimodal fusion can greatly improve medical screening efficiency and resource allocation.
The end-to-end DL scheme (CNN+Transformer), in contrast to traditional region-consensus and handcrafted feature radiomics pipelines, has the potential to comprehensively capture complex radiomic and spatial variance features for individualized KOA progression prediction. The open-source codebase and reproducible design will foster the evolution and practical translation of AI approaches for KOA, orthopedics, and chronic disease progression forecasting more broadly.
VI. Study Highlights and Innovations
- End-to-End Workflow Completion: Achieved, for the first time, standardized and open-source multimodal end-to-end prediction integrating X-ray, three MRI sequences, and clinical data.
- Large-Scale Samples, Multi-Time Window Validation: Leveraged the OAI database for large sample sizes and high-resolution subgroup analysis, enhancing the universality and persuasiveness of findings.
- Close Focus on Real Value of Multimodal Fusion: Quantified genuine contribution of each imaging type in fusion models via RURs, highlighting structural MRI’s dominance and challenging the field’s “multimodal is superior” dogma.
- Subgroup and Heterogeneity Recognition: Thoroughly evaluated model performance across different clinical backgrounds (post-op, post-injury, typical cases), emphasizing the necessity of population stratification in KOA research and modeling.
- Methodological Innovation: The combined CNN plus Transformer architecture connects local image features with global inter-sequence or cross-source dependencies, providing direction for future large-volume medical data fusion research.
VII. Other Points of Note
- All related code, data selection, preprocessing, model development, and evaluation processes are open-sourced on GitHub (https://github.com/imedslab/oaprogressionmmf).
- The discussion section anticipates future directions, including disease progression sequence prediction, imaging domain adaptation, multi-center data generalization, and slimming/streamlining AI models.
- Highlights the limitations of KL grading as an endpoint, advocating the integration of MRI quantitative scoring systems (such as MOAKS) or adaptive phenotype grouping into future predictive frameworks.
- Cautions that future multimodal DL models must attend to balanced branch optimization and weight learning, and explore architectures and decision mechanisms with greater clinical interpretability and practical utility.
VIII. Summary
By deploying multimodal deep learning, this study systematically evaluated the genuine contributions of X-ray, different MRI sequences, and their fusion in KOA progression prediction, providing stratified screening and clinical decision-making recommendations tailored to real-world needs. With robust large-scale validation, detailed subgroup analysis, and an open-source framework, this work lays a solid foundation and new direction for both AI-driven musculoskeletal disease progression forecasting and broader chronic disease applications.