Ensemble Learning Based on Matrix Completion Improves Microbe-Disease Association Prediction

Academic Background and Research Problem

Microorganisms, as one of the most widely distributed forms of life on Earth, are closely related to oceans, soil, and the human body. The human body contains approximately 350 trillion microbial cells, which are intricately linked to human health and the onset and progression of diseases. In recent years, with the rapid advancements in sequencing technology and bioinformatics, significant research has focused on elucidating the impact of human microbiome composition and function on health. For example, alterations in the composition of intestinal flora can affect immune function and disease development, and hepatic metabolism has also been shown to be regulated by the gut microbiota, promoting the development of metabolic diseases by reducing energy expenditure and increasing fat deposition.

Although experimental biomedical studies have made great efforts to reveal microbe-disease associations, the number of experimentally verified disease-related microbes remains very limited. Traditional experimental methods are both time-consuming and costly, thus there is an urgent need for efficient and precise computational techniques to screen for potential microbe-disease associations. This would not only provide insights for disease diagnosis and drug development, but also promote the application of microbiome research in the medical field.

At present, various bioinformatics methods have attempted to solve this problem, including graph theory-based random walk, bipartite local models (BLMs), matrix factorization/completion, machine learning, and deep learning approaches. Among them, graph-structured methods are susceptible to data sparsity and noise, which can lower accuracy, while machine learning methods face challenges in high-dimensional feature selection. In recent years, strategies integrating multi-source heterogeneous data have attracted much attention, but how to efficiently and robustly fuse such complex information remains a bottleneck in the research community.

Paper Source and Author Information

This paper, titled “Ensemble learning based on matrix completion improves microbe-disease association prediction,” is authored by Hailin Chen and Kuan Chen, both from the School of Information and Software Engineering, East China Jiaotong University. It was published in 2025 in the internationally renowned journal Briefings in Bioinformatics (Volume 26, Issue 2, bbaf075), and is available as an open access publication.

Research Workflow and Method Details

1. Data Preparation and Feature Fusion

The authors selected a publicly-available benchmark dataset (cited from Wang L., et al., 2023), which includes 4,499 experimentally validated microbe-disease associations, involving 1,177 microbes and 134 diseases. Additionally, four categories of similarity between microbe-microbe and disease-disease pairs were calculated, specifically:

  • Microbe similarity: Functional similarity (FS), Cosine similarity (COS_MS), Gaussian Interaction Profile similarity (GIP_MS), Sigmoid kernel similarity (SIG_MS)
  • Disease similarity: Semantic similarity (DS), Cosine similarity (COS_DS), Gaussian Interaction Profile similarity (GIP_DS), Sigmoid kernel similarity (SIG_DS)

During data fusion, the four similarities were averaged respectively to produce the microbe similarity matrix (SM) and the disease similarity matrix (SD). Subsequently, the authors integrated these two fused similarity matrices with the microbe-disease association matrix to construct the overall fused matrix X for subsequent algorithmic analysis.

2. SABMDA: Ensemble Learning Matrix Completion Framework

This study introduces a novel ensemble learning framework SABMDA (Similarity and Adjacency Based Matrix completion for Disease-microbe Association), comprising two core modules:

a) Matrix Completion Based on Singular Value Thresholding (SVT)

The SVT algorithm was originally used for the “Netflix problem” to predict large-scale user-item preferences, and is one of the classic matrix completion techniques. Introduced here into the field of microbe-disease prediction, SABMDA applies this algorithm to the integrated matrix, using a soft-threshold rule to iteratively update the singular values, optimizing for low-rank matrix reconstruction to preliminarily complete the scores for unknown associations. Key procedures include:

  • Iterative update of the scoring matrix X, producing new matrix Xi at each round
  • Utilizing Lagrange multipliers and the Uzawa algorithm to achieve constrained optimization
  • Normalizing the result using the sigmoid function to constrain all association scores within the [0,1] interval

b) Optimization via Bounded Nuclear Norm Regularization (BNNR)

For further robustness, SABMDA implements Bounded Nuclear Norm Regularization after SVT completion, introducing boundary constraints to the score matrix (all scores are forced between 0-1) and addressing unavoidable data noise. This step utilizes the Alternating Direction Method of Multipliers (ADMM) to achieve efficient iteration, ensuring the optimized scores feature both low-rank characteristics and compatibility with observed original entries, enhancing the reliability and generalizability of predictions.

3. Experimental Design and Evaluation Process

The study employs the following rigorous experimental splits and evaluation metrics:

  • 5-fold cross-validation, 10-fold cross-validation, and independent test (split by disease row at a ratio of 8:1:1) to comprehensively assess the model’s generalization ability.
  • Metrics evaluated include AUC (area under the ROC curve), AUPR (area under the PR curve), accuracy, precision, recall, and F1-score.
  • Systematic parameter sensitivity analysis to optimize threshold τ, step size δk, iteration count n, regularization parameter α, and penalty parameter β, ultimately determining the best configuration (τ=10, δk=0.1, n=500, α=1.0, β=50.0).
  • Ablation experiments: separately removing the SVT and BNNR submodules to verify the performance improvements gained from their combination.
  • Comparison with 7 state-of-the-art baseline methods, including: SGJMDA, DSAE_RF, AMHMDA, MHCLMDA, MNNMDA, LRLSHMDA, NTSHMDA.

Main Research Results

1. Parameter Sensitivity and Optimization

Through a systematic parameter adjustment, it was found that a low SVT threshold (τ=10) and small step size (δk=0.1) yield optimal performance, with the best model performance achieved at 500 iterations. Regularization parameter α and penalty β, set at 1.0 and 50.0, balance low-rank constraints and fitting error.

2. Ablation Experiment Results

Ablation experiments demonstrated that both SVT and BNNR modules are indispensable: using SVT or BNNR alone is insufficient to match the high accuracy achieved through the SABMDA ensemble approach. The dual matrix completion process incrementally fills the missing values in the original matrix, significantly enhancing the predictive ability of the completed matrix.

3. Performance Compared to Mainstream Methods

  • In 10-fold CV testing, SABMDA achieved an AUC of 0.9934 and AUPR of 0.9930, far surpassing any other method (e.g., SGJMDA’s AUC is just 0.9495).
  • SABMDA also performed excellently in 5-fold CV and independent tests, leading in accuracy, recall, F1-score, and other comprehensive metrics, with statistical significance.
  • The model also demonstrated wide applicability to other public datasets (such as the miRNA-disease association dataset HMDD v3.2), achieving AUC = 0.9475 and AUPR = 0.9540.

4. Case Studies

Using diseases such as obesity and asthma as examples, the authors simulated the hiding of known association information and successfully predicted a set of candidate microbes via SABMDA. The changes in abundance (increase/decrease) of these candidate microbes in relevant disease patients were strictly verified with the latest PubMed literature. For example, in obesity, candidates such as Haemophilus, Paraprevotella, and Akkermansia received empirical support; similarly, for asthma, evidence was found for candidates such as Bifidobacterium, Helicobacter pylori, and Faecalibacterium. For other cases such as Crohn’s disease, previously unknown microbe associations proposed by the model also provide important clues for further experimental validation.

Research Conclusions and Significance

This paper systematically proposes and verifies the ensemble learning strategy SABMDA based on matrix completion, achieving the current international advanced level in the field of microbe-disease association prediction. Its scientific value lies in:

  • The use of multi-source heterogeneous biomedical information to deeply integrate the complex relationships between diseases and microbes, achieving theoretical and methodological breakthroughs over traditional methods.
  • The developed two-stage matrix completion strategy not only improves robustness but also addresses the issue of conventional machine learning models being vulnerable to noise in large-scale missing data scenarios.
  • Extensible to fields such as disease diagnosis, drug development, and personalized microbiome medicine, building a bridge between basic science and translational medicine.

Research Highlights and Innovations

  1. Theoretical innovation: For the first time, SVT and BNNR matrix completion algorithms are applied in a multi-level ensemble in this field, effectively combining low-rank constraints, boundary constraints, and noise tolerance.
  2. Rigorous experimentation: Full-scale ablation analysis, use of multiple benchmark datasets, and cross-validation with various metrics ensure objectivity and reference value in the results.
  3. Cutting-edge data processing strategies: Multi-source heterogeneous feature engineering and scientifically rigorous feature fusion methods significantly enhance synergistic utility.
  4. Extensive industrial and application prospects: Code has been made publicly available (https://github.com/iamchenhailin/sabmda), facilitating rapid expansion, reproduction, and application in the research community.
  5. Significant biological implications: Reveals several novel potential microbe-disease associations, providing important reference points for further mechanistic and experimental studies.

Other Valuable Information

The authors declare no conflicts of interest; this research was funded by the Jiangxi Provincial Natural Science Foundation (No. 20242BAB25083). Both data and algorithms have been made publicly available to facilitate validation and extension by the global bioinformatics community. The article also reflectively notes that the current prediction is of “association” rather than “causation,” and that the actual pathogenic or protective mechanisms between microbes and diseases require further mechanistic experimental research, thus pointing out the direction for future studies in the field.

This work not only marks a key breakthrough in precise microbe-disease association prediction, but—with its innovative data integration strategy and algorithmic design—also opens new ground for complex biological network data analysis and association inference.