Bias-Aware Training and Evaluation of Link Prediction Algorithms in Network Biology

2025-06-23 Mon
link prediction network biology bias protein-protein interaction graph machine learning algorithm evaluation
Revealing “Rich Node” Bias in Link Prediction Algorithms and Countermeasures — An Interpretation of “Bias-aware Training and Evaluation of Link Prediction Algorithms in Network Biology”1. Academic Background and MotivationOver the past decade, network biology has played an increasingly important role in uncovering associations and functions of biomolecules. As large-scale mapping data such as protein–protein interactions (PPI) and disease-gene relationships continues to expand, graph machine learning-based link prediction (link prediction refers to inferring potential associations between nodes in networks) has become a core tool in bioinformatics and computational biology. Link prediction algorithms are widely applied to discover unknown biological associations, benefiting drug target discovery, research on disease mechanisms, experimental candidate prioritization, and various other biomedical applications. This has led to an upsurge of new algorithms and a global research boom.
However, the field has also exposed a serious blind spot in evaluation. Current mainstream evaluation frameworks for link prediction algorithms often use uniformly random sampling of network edges or employ “across-time” evaluation using network snapshots from different time points. The author team points out that both standard evaluation schemes lead to “rich node” bias (degree bias, rich node bias) — nodes with high degrees (i.e., nodes with many known connections, such as “well-studied proteins”) are excessively favored in both evaluation and algorithm train-test processes. This bias not only skews the objectivity of algorithm performance comparisons but also narrows the room for scientific innovation in discovering novel associations involving understudied or rare nodes, pushing the field into a “Matthew effect” cycle (“the rich get richer”).
In response, the present study systematically clarifies and empirically demonstrates the concrete impact of this bias in current evaluation systems, proposes a comprehensive bias-aware framework (“AWARE” principles) for training, evaluation, and comparison of link prediction algorithms, and develops accompanying optimization algorithms, evaluation tools, and methodologies. Together, these advances promise to bring a more just, innovative, and diverse assessment mechanism to the field.
2. Publication Venue and AuthorsThis research was jointly completed by Serhan Yilmaz, Kaan Yorgancioglu, and Mehmet Koyutürka from the Department of Computer and Data Sciences at Case Western Reserve University. Serhan Yilmaz and Kaan Yorgancioglu are co-first authors. The study was published on June 10, 2025, in PNAS (Proceedings of the National Academy of Sciences, USA), paper ID e2416646122, and was edited by Professor Bonnie Berger from MIT.
3. Detailed Research Workflow1. Comprehensive Survey and Experimental DesignTaking protein–protein interaction network link prediction as an example, this study systematically reveals the systemic bias toward high-degree nodes in mainstream evaluation systems during algorithm training and testing, and proposes a series of highly actionable new methods. The main workflow includes:
(1) Algorithm Bias Quantification and Benchmark DesignBias Quantification: The “preferential attachment” model, a model with an extreme preference for rich nodes, is used as a reference. Various link prediction algorithms’ predictions are compared for overlap with this model. The higher the overlap, the more the algorithm is affected by “node degree.” The method quantifies algorithmic bias using area under the curve, where positive/negative values indicate bias toward high/low degree nodes, respectively.
Selection and Adaptation of Benchmark Algorithms: Multiple typical classes of link prediction algorithms are selected (local scoring, network propagation, embedding learning, etc.), and for each class, “high-bias—low-bias” pairs are designed as controls. For example: Common Neighbors (high bias) vs. Jaccard Index (low bias); DeepWalk with degree (high bias) vs. DeepWalk (low bias); L3 (high bias) vs. L3n (low bias), etc.
(2) Analysis of Bias in Standard Evaluation FrameworksSimulation in Two Standard Evaluation Scenarios:
Edge-uniform sampling: Uniform random sampling of 10% of network edges as the test set.
“Across-time” snapshot evaluation: Using an earlier network (e.g., 2020) as the training set and a later snapshot (e.g., 2022) as the test set.
Quantification of Multi-Dimensional Performance Metrics:
Both PR curve-based metrics (Precision-Recall curve) are used to separately quantify “late-stage/early-stage” prediction ability (AUPR and AUlogPR). AUPR reflects large-scale late-stage performance; AUlogPR captures high-precision early-stage capability, serving practical prediction needs.
(3) Dissecting the Bias Structure in Benchmark DatasetsNode Stratification and Edge Class Analysis: All nodes are divided by degree into poor, moderate, and rich layers. By analyzing the composition of edges among these node types, the “influence” of each node type in evaluation metrics is calculated (rich nodes dominate, poor nodes are marginalized).
KS Test for Separability: The preferential attachment score is used as a standard; the separability between positive samples (hidden true edges) and negative samples (unknown links) in the test set is measured with the KS test, empirically showing that node degree information almost fully distinguishes positives from negatives—i.e., bias is rooted in structural properties of the data.
(4) Developing New Bias-aware Evaluation FrameworksNode-uniform Weighted Metrics: An innovative optimization algorithm is proposed to iteratively update and assign weights to each edge such that all nodes have equal influence in evaluation metrics. Experiments show that this weighted evaluation improves fairness for low-degree nodes and effectively mitigates bias toward rich nodes present in traditional evaluation.
Stratified Evaluation: Algorithm performance is evaluated in layers by node type or edge type, thoroughly testing the ability of the new evaluation system to distinguish algorithm quality.
Five-metric Summary: A comprehensive suite of five key dimensions: algorithmic rich node bias, conventional/weighted AUPR, and AUlogPR under both traditional and node-uniform weighting—covering both overall and stratified (rich/poor nodes) performance.
(5) Developing Bias-aware Training StrategiesDegree-balanced Negative Sampling: A novel method is proposed to generate negative samples while balancing for node degree, optimizing the positive/negative sample structure during training so that algorithms do not over-rely on features from rich nodes.
Empirical Analysis of Sampling Strategy Effect on Algorithms with Different Levels of Bias.
(6) Large-scale Expansion across DatabasesGeneralization across Databases & Time-points: The above strategies for bias analysis and weighted evaluation are applied to multiple PPI databases and subnetworks (BioGRID, STRING), including different evidence types (experimental, text mining, co-expression, etc.), systematically verifying the universality and robustness of the proposed approaches.
2. Datasets and Experimental DetailsPrimary Network Data Sources: Human protein-protein interaction networks from BioGRID and STRING, containing millions of nodes and edges.
Algorithm Implementation and Evaluation Tools: The authors’ open-source toolkit “colipe” and web app are used to execute the entire evaluation and weighting process, publicly available for access.
**All experiments use multiple controls (high/low-bias algorithms, conventional/weighted evaluation, different negative sampling strategies, etc.), with results presented through various metrics (AUPR, AUlogPR, KS values, etc.), PR curves, and stratified performance plots.
4. Detailed Research Results(1) Quantitative Discovery of Rich Node Bias in Mainstream Evaluation FrameworksIn both edge-uniform sampling and across-time snapshot evaluations, rich nodes account for 70%-90% of the system’s influence, while the numerically dominant poor nodes only receive 5%-8%. This means algorithms that simply “learn” degree features can obtain extremely high scores, while their genuine ability to discover novel low-degree associations is severely underestimated.
KS test results show that in standard evaluation, positive and negative samples can be separated by node degree alone with up to 60%-75% discriminability—rich nodes dominate nearly all true edges, while negatives are concentrated among low-degree node pairs, causing many “truly predictable” low-degree edges to be ignored.
(2) Bias Profiles of Different AlgorithmsCommon neighbors, L3, DeepWalk-with-degree, LINE, and other typical algorithms are fundamentally reliant on node degree, displaying strong rich node preference. Algorithms such as Jaccard, DeepWalk, L3n, and Von Neumann that use degree normalization exhibit dramatically reduced bias, with some weak preference for low-degree nodes.
High-bias algorithms excel in standard evaluation (for both overall and early-stage performance), but this advantage is based solely on node degree and does not reflect the ability to discover novel, structurally latent biological relationships.
(3) Corrective Effects of New Bias-aware Evaluation SystemsFacing rich node bias, the node-uniform weighted evaluation system gives poor nodes more than 40% of the evaluation weight. Algorithms that achieved top scores under the standard system (e.g., L3) lose their dominant position, while low-bias algorithms become the main outperformers. This further demonstrates that the weighted system better reflects an algorithm’s true ability to discover “novel/rare” node pairs.
Stratified evaluations show that low-bias models like Von Neumann perform consistently across all edge types, unaffected by rich node domination, while L3-type algorithms only excel in rich-rich pairs and perform near-randomly on poor node pairs.
The five-metric method (bias quantification, standard AUPR/AUlogPR, weighted AUPR/AUlogPR) enables a multidimensional view of overall, fairness, and early-precision, making it easier for the academic community to objectively judge algorithmic innovation and scientific value.
(4) Bias-aware Sampling Strategies Improve Low-degree Node PredictionIntroducing degree-balanced sampling for training, especially for high-bias algorithms like LINE, not only significantly reduces rich node bias but also substantially improves prediction ability for low-degree nodes. This shows that bias-aware data sampling itself is a powerful tool for boosting algorithmic innovation.
For algorithms that are already low-bias, excessively degree-balanced sampling may result in reversed bias (over-favoring low-degree nodes to the detriment of overall performance). This highlights the need for dynamic and targeted sampling strategies, rather than a one-size-fits-all approach.
(5) Broad Generalization and Robustness VerificationAcross datasets from different time snapshots and evidence types, a structural bias favoring rich nodes is consistently observed, while the “weighted evaluation + bias-aware sampling” framework universally and effectively reduces most evaluation biases. This provides a methodological guarantee for innovative evaluation across all network data types in the field.
5. Conclusions, Significance, and Study HighlightsThis study, for the first time, uses a combined approach of systematic experiments, theory, and algorithm development to fundamentally reveal and correct the “rich node” bias in link prediction algorithms in biological networks. The scientific and practical value includes:
Scientific Significance: Establishes a bias-aware evaluation foundation for algorithm development and assessment in network machine learning, preventing illusory progress. Pushes for innovation centered on “understudied nodes.”
Realistic and Applied Value: Improves the ability to capture novel, high-value biomolecules in downstream applications like protein interaction prediction, disease gene discovery, and drug target screening. Encourages joint experimental and theoretical focus on “rare proteins.”
Study Highlights: Proposes and implements a brand-new evaluation and training workflow under the “AWARE” philosophy, combining methodological innovation, practical algorithm/tool usability, and domain universality. The self-developed evaluation algorithms and tools (such as colipe) are open to the entire field, improving reproducibility and community efficiency.
6. Further Content and the AWARE PrinciplesThe study proposes five AWARE principles for bias-aware evaluation and algorithm development, providing the community with systematic operational recommendations:
Analyze bias in algorithms: Systematically assess the dependence of prediction algorithms on node degree;
Watch out for bias in benchmarking data: Quantitatively measure the extent to which evaluation datasets incentivize biased prediction; advocate for diverse evaluation scenarios;
Assess prediction performance on understudied entities: Emphasize performance on low-degree nodes, use stratified or weighted evaluation;
Review diverse metrics to draw conclusions: Compare algorithm performance multidimensionally to prevent misleading results from single metrics;
Engage in bias-aware training: Introduce effective sampling or normalization strategies in training to ensure both scientific value and innovation.
The accompanying open-source toolkit, multi-database datasets, and web tools will also support broad adoption and continued innovation throughout the community.
7. SummaryThe work by Serhan Yilmaz et al. in PNAS represents a milestone for link prediction research in biological networks. Through quantitative analysis, methodological innovation, and tool development, the authors have, for the first time, clarified the structural bias jointly shaped by network data and evaluation systems, and have provided theoretical and practical guidelines for more fair, comprehensive, and innovative molecular discovery and network modeling. The research not only provides theoretical inspiration but also solid methodological support for algorithm development and practical applications. Looking ahead, with more open evaluations and algorithms prioritizing rare nodes, this field is poised for richer and more dynamic scientific exploration.