Sul-BERTGRU: An Ensemble Deep Learning Method Integrating Information Entropy-Enhanced BERT and Directional Multi-GRU for S-Sulfhydration Sites Prediction
Background Introduction
Post-Translational Modifications (PTMs) are crucial mechanisms for regulating cellular activities, including gene transcription, DNA repair, and protein interactions. Among these, cysteine, a rare amino acid, participates in various PTMs through its thiol group, playing a significant role in redox balance and signal transduction. S-Sulfhydration, an important PTM, is closely associated with the development and progression of cardiovascular and neurological diseases. However, the specific mechanisms of S-sulfhydration remain unclear, particularly in terms of site identification, which poses significant challenges.
Traditional methods for identifying S-sulfhydration sites, such as the Biotin Conversion Method and the Maleimide Fluorescence Method, can accurately locate sites but often rely on chemical reagents and suffer from issues like lack of specificity and sensitivity. In recent years, with the rapid development of deep learning techniques, researchers have begun using these technologies to predict protein modification sites. However, research on S-sulfhydration site prediction is relatively scarce, and existing models like PCysMod still fall short of meeting practical application requirements.
To address these issues, a research team from Dalian Maritime University, Jiangnan University, and other institutions proposed a novel deep learning framework—Sul-BERTGRU—aiming to improve the accuracy and efficiency of S-sulfhydration site prediction by integrating multi-directional Gated Recurrent Units (GRU) and Information Entropy-Enhanced BERT (IE-BERT).
Source of the Paper
The study was jointly conducted by Xirun Wei, Qiao Ning, Kuiyang Che, Zhaowei Liu, Hui Li, and Shikai Guo from the School of Information Science and Technology at Dalian Maritime University, the School of Artificial Intelligence and Computer Science at Jiangnan University, and the Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education at Jilin University, among others. The paper was published on February 20, 2025, in the journal Bioinformatics, titled Sul-BERTGRU: An Ensemble Deep Learning Method Integrating Information Entropy-Enhanced BERT and Directional Multi-GRU for S-Sulfhydration Sites Prediction.
Research Content
Research Process
The Sul-BERTGRU framework consists of four modules: the Data Processing Module, IE-BERT Module, Confidence Learning Module, and Directional Feature Extraction Module.
Data Processing Module: First, protein sequences are divided into left and right sub-sequences centered on cysteine. Each site is centered within a 31-amino acid window (-15C+15), generating positive samples (containing S-sulfhydration sites) and negative samples (without S-sulfhydration sites). The dataset includes 2705 positive samples and 16697 negative samples, with 20% of the data reserved for an independent test set and 80% for training and validation sets.
IE-BERT Module: This module uses Information Entropy-Enhanced BERT (IE-BERT) to preprocess protein sequences and extract initial features. The BERT model processes protein sequences through 12 Transformer encoder layers, with the outputs from each layer aggregated via information entropy weighting to enhance feature representation.
Confidence Learning Module: Due to limitations in biological experiments, negative samples may contain mislabeled S-sulfhydration sites. To reduce the impact of such noisy data on model training, the researchers employed Confident Learning to remove potentially mislabeled samples from the negative dataset, ensuring the reliability of negative samples.
Directional Feature Extraction Module: This module uses a multi-directional GRU model to extract directional features from protein sequences. Considering the directionality of enzymatic reactions, protein sequences are divided into left, right, and full sequences, each processed by the GRU model. Subsequently, a Multi-Head Self-Attention mechanism and Convolutional Neural Network (CNN) are used to further analyze sequence features and capture local details that might otherwise be overlooked.
Main Results
Sul-BERTGRU achieved outstanding performance across multiple metrics: Sensitivity (85.82%), Specificity (68.24%), Precision (74.80%), Accuracy (77.44%), Matthews Correlation Coefficient (MCC, 55.13%), and Area Under the Curve (AUC, 77.03%). Compared to the existing PCysMod model, Sul-BERTGRU demonstrated superior performance in most metrics, particularly in sensitivity.
Conclusion and Significance
The introduction of Sul-BERTGRU provides a novel deep learning framework for S-sulfhydration site prediction, significantly improving accuracy and efficiency. The innovation of this framework lies in its integration of Information Entropy-Enhanced BERT and multi-directional GRU, which better capture directional and local features of protein sequences. Additionally, the application of the Confidence Learning Module effectively reduces noise in negative samples, further enhancing model performance.
This study not only holds significant scientific value but also provides new tools for understanding the role of S-sulfhydration in cardiovascular and neurological diseases. In the future, researchers plan to incorporate additional structural information to further improve feature extraction and prediction accuracy.
Research Highlights
- Information Entropy-Enhanced BERT: Enhances feature extraction efficiency and accuracy by aggregating outputs from BERT’s 12 encoder layers using information entropy weighting.
- Multi-Directional GRU Algorithm: Introduces a multi-directional GRU model to better capture the directional features of S-sulfhydration modifications.
- Confidence Learning Module: Improves model generalization by removing noisy data from negative samples using Confident Learning.
- Multi-Module Integrated Framework: The Sul-BERTGRU framework integrates multiple deep learning modules, significantly outperforming existing S-sulfhydration site prediction methods.
Other Valuable Information
The source code and data from this study are publicly available on GitHub (https://github.com/severus0902/sul-bertgru/) for further research and application in academia and industry. Additionally, the researchers conducted Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses on S-sulfhydrated proteins, revealing that S-sulfhydration is closely related to various diseases (e.g., Parkinson’s disease, Alzheimer’s disease), providing new directions for future disease research.
Through this study, we have deepened our understanding of S-sulfhydration mechanisms and provided new technological tools for predicting protein modification sites, with broad application prospects.