Regularly Updated Benchmark Sets for Statistically Correct Evaluations of AlphaFold Applications
Academic Background: Crossing into a New Era of Protein Structure Prediction
Protein structure determination has long stood as one of the central challenges in molecular biology and life sciences. Traditional experimental methods such as X-ray crystallography, nuclear magnetic resonance (NMR), and cryo-electron microscopy have provided a solid foundation for the study of protein three-dimensional structures. However, due to complex sample preparation, high time costs, and limited applicability to certain proteins, these methods cannot widely cover the entire proteome. Since the introduction of AlphaFold2 (AF2) by DeepMind in 2020, the field of protein structure prediction has ushered in a revolutionary development. AlphaFold2 leverages deep learning methods to achieve high-quality structure predictions for virtually all known protein sequences, greatly expanding structural coverage and having a profound impact on biomedicine, fundamental life sciences, and even drug design.
Significantly, after AlphaFold2’s release, its predicted structure database was rapidly established and made publicly available, leading to a surge of secondary development and application based on AF2 structures in the academic community. Thousands of papers have used AlphaFold2 structures to conduct studies in protein stability, conformational heterogeneity, protein function, complex interfaces, domain definition, and intrinsically disordered regions, among other fields. Hot on its heels is the more advanced AlphaFold3 (AF3), which takes further steps in predicting complex interactions such as protein-ligand, protein-nucleic acid, and protein-small molecule interfaces, suggesting that AI-based protein structural research will continue to deepen and expand into a broader biological application landscape.
However, along with this wave of technological advancement comes an issue that has been severely underestimated or even overlooked by the academic community—data leakage. Data leakage refers to the phenomenon where, in machine learning-based evaluations, test set samples are homologous or excessively overlapping with the training set, resulting in statistical conclusions that are no longer accurate and may even overestimate model capability. As the number of downstream applications based on AF2 proliferates explosively, technical steps originally recommended for strict data leakage inspection have often been neglected. Maintaining a strict boundary between “training” and “test” sets and avoiding leakage of homologous structures is fundamental to ensuring scientific validity of evaluation conclusions.
Article Source and Author Background
The paper “Regularly updated benchmark sets for statistically correct evaluations of AlphaFold applications” was jointly authored by Laszlo Dobson (corresponding author), Gábor E. Tusnády, and Peter Tompa, among others. The research team is affiliated with renowned structural biology and bioinformatics institutes including the Institute of Molecular Life Sciences Research (Hungary), the Department of Bioinformatics at Semmelweis University (Hungary), the VIB-VUB Center for Structural Biology (Belgium), and the National Institute of Oncology (Hungary), among others. The article was published in Volume 26, Issue 2 of Briefings in Bioinformatics in 2025 (DOI: 10.1093/bib/bbaf104), under the “Problem Solving Protocol” section, and released by Oxford University Press as open access under a Creative Commons Attribution Non-Commercial License.
Research Workflow: Building an Authoritative, Leak-Free Benchmark Dataset for AlphaFold Applications
1. Research Objective and Innovative Approach
The core goal of this study is to provide a regularly updated, authoritative benchmark dataset to ensure scientifically reliable statistical evaluation results in a wide range of applications based on AlphaFold2 and AlphaFold3. To this end, the team addressed the critical yet often overlooked issue of “homologous structure data leakage” in machine learning, designing a rigorous screening and filtering process, and developed a dedicated benchmark dataset called “beta.” This dataset covers structural and sequence data across multiple scenarios, specifically designed to meet the need for high-quality independent testing in various AlphaFold application contexts.
2. Construction Pipeline of the Beta Benchmark Dataset
(1) Database Acquisition and Temporal Thresholds
The research team first downloaded the most up-to-date versions of the PDB database (Protein Data Bank), the SwissProt component of UniProt, and the BioGRID protein interaction database on May 21, 2024. Following AlphaFold2/3’s different training and template cutoff dates, multiple time thresholds were rigorously set, including April 30, 2018, May 31, 2020, February 15, 2021, September 30, 2021, July 15, 2022, November 1, 2022, January 1, 2023, and January 1, 2024. Each time point takes the first day of the month as the boundary, splitting “known” and “newly resolved” structures, ensuring that all test samples are “blind” items unseen in training.
(2) Homology Screening and Filtering Algorithms
To maximize the avoidance of homologous protein leakage and enhance the independence of the benchmark dataset, the author team employed the following multi-step screening methods:
- Sequence Homology Search: Using PSI-BLAST (E-value cutoff 0.0001, three iterations, maximum 50,000 target sequences), new structures appearing after the cutoff date (query set) were compared with previously known structures, with any result longer than 10 amino acids and sequence identity above 20% considered homologous.
- Structure Homology Search: Using the Foldseek tool (maximum 50,000 target structures), queries of more than 10 amino acids with a TM-score above 0.25 were considered structurally homologous and thus filtered out.
- SwissProt–Structure Database Crosscheck: Similarly, PSI-BLAST was applied to compare SwissProt protein sequences against the structure database, covering all available protein sequence resources.
Special refinement: For nodes involving AlphaFold training sets (such as April 30, 2018 and September 30, 2021), the authors proactively excluded NMR structures, since AlphaFold did not use NMR data as direct training templates.
(3) Protein Interaction and Structural State Assessment
The authors used the Voronota tool to automatically detect all interchain interactions in PDB structures (based on the first oligomeric state in the PDBe database), and used BioGRID (selecting only data marked as “direct interactions”) to supplement SwissProt protein interaction information, enabling robust data integration for downstream complex analyses.
(4) Multi-database Integration and Generation of Beta Dataset
Based on the above stringent homology filtering process, the final “beta” dataset contains the following four categories:
- Monomeric PDB chains without any homologous proteins
- PDB chain pairs (interacting pairs) with no history of homology for either partner
- Full-length SwissProt protein sequences not covered by structural databases
- Interacting SwissProt protein pairs, both without any homologous historical records
The technical pipeline, database relationships, and data flows at each step are presented in detail in Figure 1. All code and datasets are freely available for download and further development at https://beta.pbrg.hu and https://github.com/brgenzim/beta.
3. Practical Case Study: Predicting Protein Intrinsically Disordered Regions (IDRs)
To validate the practical utility of the beta dataset and its value in eliminating data leakage, the authors systematically experimented with “using AlphaFold structural information to predict protein intrinsically disordered regions” as a case study.
(1) Definition of Disordered Regions and Data Integration
The team first aggregated all monomeric protein structures from the PDB database and filtered protein sequences at 40% homology using the CD-HIT tool. The criterion for defining a disordered residue was “missing side-chain coordinates,” ensuring simplicity and consistency. This approach aligns with current mainstream databases such as DisProt (intrinsically disordered protein region database) and MobiDB. All disordered segments shorter than 10 amino acids were removed to prevent statistical bias.
Structure-sequence mapping was performed using the SIFTS (Structure Integration with Function, Taxonomy, and Sequences) resource, mapping UniProt identifiers to PDB chains and residue positions. Ultimately, each residue analyzed was annotated with its ordered/disordered status, plDDT confidence score from AF2 structures, and whether it belonged to the homologous/beta structural data subset.
(2) Distribution of plDDT (Local-Distance Difference Test) Scores and Evaluation of Predictive Power
plDDT scores have previously been shown to be useful for predicting disordered protein regions. The authors compared the distribution differences of plDDT for all analyzed residues in “all structures” versus “beta independent structures,” using the Kolmogorov-Smirnov test (K-S test) to confirm statistical significance. Subsequently, by continuously tuning the plDDT cutoff, they identified the optimal cutoff corresponding to the “highest balanced accuracy.” Results indicated significant shifts in the cutoff and prediction accuracy when homologous leakage was completely excluded.
(3) Data Scale, Experimental Logic, and Key Findings
After rigorous screening, the beta set contained only 1,062 disordered residues, much fewer than in the full structure set. The team explained that this mechanistically results from recent PDB additions being dominated by large complexes, leading to insufficient monomer data and thus sparse IDR residue samples. To avoid sampling bias, 50% random sampling was repeated five times to calculate the standard error, thereby achieving a robust distribution of plDDT cutoffs and accuracy estimates.
(4) Key Conclusions
Results demonstrated that without eliminating homologous leakage, the IDR prediction cutoff was 0.89, while for the beta set the cutoff was only 0.69, with overall prediction accuracy also declining. The team pointed out that stringent independent data testing produces a more realistic and stricter assessment of model performance, forcefully illustrating that “data leakage” systematically overestimates the real capacity of downstream applications.
Main Academic Conclusions and Significance
Alert to Data Leakage and the Establishment of New Standards
This research systematically examines and corrects the widely neglected issue of “data leakage” in the current AlphaFold ecosystem, proposing a complete and operational regularly-updated standardized independent benchmark dataset (beta) as a “gold standard” test sample repository for all subsequent scientific/engineering projects relying on AlphaFold structures. This measure not only ensures scientific integrity of statistical outcomes, but also lays a solid foundation for downstream application scenarios, such as antigen epitope recognition, phase separation region prediction, pathogenic mutation effect assessment, and short linear motif (SLiM)-mediated complex screening.
Open Resources Promote Disciplinary Self-correction
The author team has released all datasets, filtering workflow scripts, and detailed classification criteria online and open for community-driven iteration. Whether AlphaFold applications rely on the official database (AlphaFold DB), ColabFold service, or bare-metal local deployment, users can flexibly select beta data and automatically match corresponding time points according to AF version, enabling flexible access and continuous updating. For the latest generation of protein structure prediction algorithms beyond AlphaFold (such as Boltz-1, ESMFold, etc.), the beta data concept can likewise be adopted for rigorous external independent assessment.
Guidance for Future Research and Applications
- Practical Significance for Scientific Assessment: For performance prediction of new methods/algorithms, an independently curated, leakage-free benchmark set is the only fundamental guarantee for valid experimental conclusions. The team has set an important benchmark for the entire field of structural bioinformatics.
- Innovation in Application Paradigm: Through standardized evaluation workflow and open resource sharing, biologists and medical scientists without computational backgrounds can easily access high-quality benchmark data, empowering interdisciplinary research and innovation.
- Important Promotion of Community Self-discipline: The paper calls on the community to maintain rigorous scientific practice while appreciating revolutionary AI-driven achievements, refusing to relax data science standards amid technological progress. Data leakage must not become a breeding ground for statistical misjudgment.
Article Highlights and Unique Research Strengths
- Proposes a regularly updated, flexibly deployable independent benchmark dataset, establishing a new industry standard for machine learning applications in structural biology.
- Highly automated, rigorous homology filtering workflow (integrating PSI-BLAST, Foldseek, Voronota, and multiple manual time thresholds) ensures data independence.
- Concrete and vivid empirical case studies (e.g., prediction of intrinsically disordered regions) effectively highlight the tangible impact of “data leakage” on real-world evaluation metrics.
- Fully open data and code resources, enabling ongoing collaborative improvement and unrestricted reuse by society at large.
Other Valuable Information
- The research was funded by the National Research, Development, and Innovation Fund of Hungary and the Hungarian Ministry of Culture and Innovation.
- The team acknowledges Rita Pancsa and Zsofia E. Kalman for their contributions to manuscript discussion and website design.
- Appendix data, code, and all supplementary materials are openly accessible online (e.g., at https://zenodo.org/records/14711867).
- The team commits to extending and refining the beta dataset in line with future AlphaFold model updates and database expansions, driving the continual evolution of standards.
Summary: Towards a “New Reference Point” in Protein Structure Bioinformatics
Amidst the wave of AI-powered protein structural research sparked by AlphaFold, Dobson and colleagues remind the academic community that only scientific rigor can ensure that new technologies truly benefit the biomedical frontier. The proposition of the beta benchmark marks a firm boundary line for the evaluation of protein structure prediction applications and injects new momentum into industry self-purification and the reform of standardization. For all deep learning-based structure prediction algorithms, the selection of strictly leakage-free datasets will become an indispensable experimental step in the future. This work is not only technically cutting-edge and methodologically meticulous, but also serves as a model of powerful interdisciplinary integration.