Spatial-Aware Transformer-GRU Framework for Enhanced Glaucoma Diagnosis from 3D OCT Imaging
1. Academic Background—Innovative Diagnostic Tools Urgently Needed for Early Glaucoma Screening
Glaucoma is one of the major diseases leading to irreversible blindness worldwide. According to studies such as [31], glaucoma is characterized by hidden early symptoms and irreversible visual impairment, making early detection and intervention crucial. Currently, Optical Coherence Tomography (OCT), a three-dimensional (3D) non-invasive, high-resolution imaging technology, is playing an increasingly important role in ophthalmic diagnostics. OCT enables direct visualization of anatomical structural changes in the eye and helps physicians achieve precise evaluations of key areas such as the Retinal Nerve Fiber Layer (RNFL) [13].
However, traditional OCT-assisted glaucoma diagnostic methods often rely on the analysis of two-dimensional (2D) B-scans, focusing on the central slice of the Optic Nerve Head (ONH). While this localized information helps detect structural damage, it inevitably overlooks the comprehensive spatial information contained in 3D OCT imaging, making it difficult to reveal the broad, progressive pathological features of glaucoma across the depth and regions of the retina [34]. Furthermore, changes such as RNFL thinning and alterations in the fundus structure present a complex spatial distribution, making manual layer-by-layer interpretation of OCT data time-consuming and prone to missed diagnoses.
To address these challenges, Artificial Intelligence (AI)—particularly deep learning methods—has become a key approach for automated glaucoma screening. Effectively integrating the entire 3D OCT volumetric data, mining its latent spatial features, and improving the accuracy and reliability of automated diagnosis in real clinical contexts are major research hotspots. The authors of this paper have conducted innovative explorations aimed at unlocking the deep value of 3D OCT data and resolving key challenges in automated diagnostic workflows.
2. Paper Source and Author Information
This paper, titled “Spatial-Aware Transformer-GRU Framework for Enhanced Glaucoma Diagnosis from 3D OCT Imaging,” was published in the IEEE Journal of Biomedical and Health Informatics, Vol. 29, No. 9, September 2025 (DOI: 10.1109/jbhi.2025.3550394). The authors are Mona Ashtari-Majlan and David Masip (Senior Member, IEEE), both from the Department of Computer Science, Multimedia, and Telecommunications, Universitat Oberta de Catalunya (UOC), Spain. The research was supported by the Spanish Ministry of Science and Innovation (FEDER initiative, Grant PID2022-138721NB-I00).
3. Detailed Interpretation of Study Workflow
1. Overall Research Design and Approach
The study focuses on developing an innovative deep learning framework that fully leverages the spatial information of the entire 3D OCT volume for the automated and accurate screening of glaucoma. The proposed model integrates a dual-core architecture of Transformer and Bidirectional Gated Recurrent Unit (GRU), balancing local slice feature extraction with modeling of global spatial structural dependencies, thus capturing the subtle damage associated with glaucoma comprehensively.
The research workflow includes data preprocessing, feature extraction, sequence processing, model training and optimization, comparative experiments, and ablation studies.
a) Data Preprocessing
- Data Source: The study uses the public 3D OCT imaging dataset by Maetschke et al. [21], containing 1110 OCT scans from 624 patients. The scanner model is Cirrus SD-OCT, with a resolution of 64×64×128 voxels.
- Sample Organization: Only scans with signal strength ≥7 were included, resulting in 263 healthy controls and 847 diagnosed glaucoma cases (confirmed by two consecutive abnormal visual field results).
- Preprocessing Methods:
- All images were normalized to the mean and standard deviation of ImageNet to unify luminance and chrominance distribution.
- Image dimensions were adjusted to a standard size of 64×128×128, ensuring input consistency across samples.
b) Feature Extraction
- Methodological Innovation: The study employs the pre-trained RetFound model [36] developed by Zhou et al. as the feature extraction network, with a ViT-large (Vision Transformer, large-scale) framework at its core, comprising 24 Transformer blocks and a 1024-dimensional embedding vector, trained via self-supervised learning on 1.6 million unlabeled retinal images.
- Implementation Details:
- The 3D OCT volume is divided into d (64) slices, each slice (s_i) independently input into the ViT-large model to output a 1024-dimensional feature vector (f_i).
- This feature extraction process effectively captures subtle structural differences at each level, laying the foundation for subsequent integration.
c) Sequence Processing
- Modeling Spatial Dependencies: To characterize the spatial correlations and sequential dependencies among slices in a 3D OCT sequence, the study employs two layers of bidirectional GRU.
- Network Workflow:
- First, all slice feature vectors {f_1, f_2, … f_d} are sequentially input to the GRU.
- Bidirectional processing captures forward (h_fw) and backward (h_bw) spatial states, modeling spatial changes across the retina’s anterior and posterior and left and right structures.
- After concatenation, Dropout (to improve generalization), and Adaptive Max Pooling (AMP) to form a unified spatial representation, a fully connected (FC) layer and Sigmoid activation output probabilities for two classes (glaucoma/normal).
- Loss Function Design: To overcome class imbalance, the Focal Loss function is applied, which boosts attention to hard-to-classify samples while reducing the dominance of the majority class.
d) Model Training and Hyperparameter Optimization
- Training Methods: The entire model is built on PyTorch 1.8.1, using the Adam optimizer. Training lasts for up to 100 epochs, with early stopping to prevent overfitting.
- Hyperparameter Exploration: The study experiments with various GRU hidden layer sizes and Dropout rates, systematically analyzing the influence of Focal Loss parameters α and γ, ultimately selecting hidden sizes of 256 and 128 for the two GRU layers, Dropout rate of 0.3, and α=0.3, γ=2 as the optimal configuration.
- Validation Mode: Five-fold cross-validation is used to ensure robustness, with training/validation/testing grouped by patient to avoid multiple samples from the same patient interfering with results.
e) Comparative Experiments and Ablation Analysis
- Baseline Methods:
- 3D-CNN (Maetschke et al. [21]): Directly processes 3D OCT volumes, representing traditional convolutional neural network approaches.
- RetFound Extended Model: Uses only the RetFound ViT-large feature extractor to process 2D slices, followed by a two-layer FC classification.
- Ablation Experiment Ideas:
- Replace ViT-large with ResNet34 as the feature extractor to compare the impact of pre-training domain;
- Replace GRU with LSTM to analyze differences in sequence modeling approaches;
- Use a slice voting ensemble method, analyzing only the features of a few high-entropy slices to evaluate the necessity of spatial integration.
- Use t-SNE to visualize feature distributions, demonstrating the discriminative capabilities of different feature extraction and sequence modeling strategies.
2. Main Experimental Results
a) Core Model Performance
- Accuracy: Achieved 89.19%, significantly surpassing 3D-CNN (77.62%) and RetFound extended model (83.51%).
- F1 Score: 93.01%, reflecting balanced discrimination capability for both sample classes.
- AUC (Area Under the ROC Curve): 94.20%, highlighting outstanding differentiation between glaucoma and normal cases.
- MCC (Matthews Correlation Coefficient): 69.33%, highly reliable for situations with class imbalance.
- Sensitivity/Specificity: 91.83% and 79.67%, respectively, balancing detection rate and controlling misclassification.
- Confidence Interval: Results from five-fold cross-validation show low fluctuation and high reliability.
b) Ablation Analysis and Visualization
- ViT-large outperforms ResNet34: The latter is pre-trained on ImageNet for general purposes and is markedly less effective than the dedicated OCT-pretrained ViT-large model for glaucoma differentiation.
- GRU outperforms LSTM: Both can effectively process sequences, but GRU offers greater stability and parameter efficiency, making it more suitable for deep spatial modeling in this framework.
- Spatial integration is essential: Slice voting ensemble improves the representation of some local features, but overall accuracy and robustness fall far short compared to the Transformer-GRU sequence integration framework.
- t-SNE Visualization: ViT-large features show more compact and clear distribution between glaucoma and normal; full Transformer-GRU feature space has the strongest discriminative ability, supporting clinical usability for automated screening.
c) Component Contribution Exploration
- The ablation experiments clearly demonstrate the decisive impact of key components such as feature extraction (domain self-supervised pre-training), spatial integration (bidirectional sequence capture), and loss function (Focal Loss for imbalanced samples) on model performance improvements.
4. Conclusion and Value Analysis
1. Scientific Value
This research innovatively proposes a spatial-aware Transformer-GRU framework for automated glaucoma diagnosis using 3D OCT imaging. It significantly enhances the integration of local microscopic changes and global structural correlations, overcoming the limitations of traditional 2D/3D convolution methods. The deep fusion of OCT self-supervised pretrained ViT-large with sequential GRU enables effective extraction of complex lesion distribution patterns, establishing a new paradigm for AI-assisted ocular disease diagnosis.
2. Clinical and Application Value
- High accuracy in early screening: Systematic spatial information mining for subtle early lesions helps raise the detection rate of early glaucoma and reduce the risk of misdiagnosis or missed diagnosis.
- Automated intelligent decision support: Model outputs are probability distributions that can be directly integrated into clinical decision support platforms, helping physicians objectively and comprehensively assess disease severity.
- Strong generalization: Designed for realistically imbalanced clinical sample data and trained on large-scale OCT datasets, making it well-suited for real hospital scenarios.
- Open-source promotion: The open-source implementation (https://github.com/mona-ashtari/spatialoct-glaucoma) enables rapid replication and enhancements worldwide, accelerating AI adoption in ophthalmology.
3. Highlights of Methods and Workflow
- **ViT-large self-supervised pretrained model’s first large-scale application on OCT images enables superior lesion pattern capture compared with traditional convolutional networks.
- **Innovative bidirectional GRU sequence modeling, fully capturing interactive information across retinal anterior/posterior and internal/external structures.
- **Focal Loss effectively addresses common imbalance in medical image samples, optimizing model detection of rare cases in realistic scenarios.
- **Comprehensive ablation analysis and multi-baseline comparison clarify the contribution of each technique, providing scientific evidence for subsequent studies.
4. Future Prospects and Recommendations
The authors suggest further integration of multimodal data (e.g., visual field testing, patient demographics) to enrich diagnostic evidence; exploring more sequence processing approaches and attention mechanisms to boost performance; and encourage the extension of the framework to other ocular diseases (such as macular degeneration, diabetic retinopathy) and broader organ medical imaging analysis in biomedical informatics.
For actual clinical deployment, it is recommended to conduct further multicenter, large-scale, and cross-regional clinical validation to strengthen model generalization and safety, making AI-assisted diagnosis truly beneficial for global ophthalmology patients.
5. Important References and Other Information
- This study cites a wide range of cutting-edge international research (see references at the end), covering glaucoma pathology, OCT imaging analysis, and deep learning methods, offering a broad perspective and rigorous logic.
- The dataset, algorithm, and source code are all open access, empowering both the research and clinical communities to advance AI capabilities in ocular diagnostics.
- The authors emphasize the need to consider diverse gender, ethnicity, and population characteristics when deploying algorithms, and call for the establishment of a robust foundation for diverse and inclusive medical AI applications.