Deep scSTAR: Leveraging Deep Learning for the Extraction and Enhancement of Phenotype-Associated Features from Single-Cell RNA Sequencing and Spatial Transcriptomics Data

In recent years, cutting-edge technologies such as single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics (ST) have greatly advanced the development of life sciences and clinical medicine. These technologies have revealed cellular heterogeneity and brought novel insights into major fields such as disease, development, and immunity. However, due to strong technical noise, complex batch effects, and diverse and chaotic biological signals in large-scale single-cell datasets, “accurately extracting and enhancing phenotype-associated features” has become one of the key challenges. Many traditional methods focus on denoising and data integration but may also weaken or even lose critical phenotype-determining signals, limiting researchers’ deep understanding of disease mechanisms and intercellular interactions.

I. Research Background and Significance

The identification of single-cell phenotype-associated features is crucial for elucidating issues such as disease progression, immune response, and tumor resistance. For example, in cancer immunotherapy and personalized treatment, the accurate recognition of cell subpopulations associated with immune dysfunction or drug resistance often determines the success or failure of the entire treatment strategy. The mainstream data processing or integration tools such as Harmony, scMerge, scMerge2, MNN, Seurat, and Liger mainly target batch correction and technical noise reduction, but have obvious limitations in maintaining and enhancing “biologically relevant heterogeneity closely related to disease phenotype.” Even emerging methods in recent years such as “HIDDEN”, which propagate sample labels to the single-cell level via label transfer and dimensionality reduction prediction—though they can improve discrimination of relevant cell types—still have shortcomings when dealing with complex features and large sample sizes.

To address these real-world difficulties, the authors of this paper have, following their previous PLS-based scStar approach, further developed a new deep learning-based framework: “deep scStar (dscStar).” dscStar integrates multi-step noise reduction and supervised multi-task learning models, focusing on enhancing and revealing key hidden signals in single-cell/spatial omics data that are closely associated with phenotype, thus aiding in elucidating the essential mechanisms of the tumor microenvironment and disease resistance.

II. Article Source and Authors’ Institutional Background

This paper, entitled “Deep scStar: leveraging deep learning for the extraction and enhancement of phenotype-associated features from single-cell RNA sequencing and spatial transcriptomics data,” was published by Oxford University Press in Briefings in Bioinformatics (Volume 26, Issue 3, bbaf160) in 2025. The authors include Lianchong Gao, Yujun Liu, Jiawei Zou, Fulan Deng, Zheqi Liu, Zhen Zhang, Xinran Zhao, Lei Chen, Henry H.Y. Tong, Yuan Ji, Huangying Le, Xin Zou, Jie Hao, among others, from top Chinese institutions such as Shanghai Jiao Tong University Center for Systems Biomedicine, Fudan University, Ninth People’s Hospital of Shanghai, Macao Polytechnic University, Zhongshan Hospital, and related medical and life sciences institutes. This line-up not only reflects cross-disciplinary and multi-center collaboration but also establishes a solid academic and data foundation for the research.

III. In-Depth Analysis of Research Design and Workflow

1. Overall Workflow

The core aim of dscStar is to maximally retain and enhance cell features/subpopulations related to specific phenotypes (e.g. clinical subtype, disease progression, treatment response) in large-scale single-cell data. Its workflow is divided into three main steps, gradually removing noise from non-phenotype difference sources, and ultimately strengthening the target features through deep learning models:

Step 1: Unchanged Cell Recognition

Utilizing the SCCURE algorithm, two datasets (grouped by disease status, treatment strategy, or specific gene expression, etc.) are clustered via a Gaussian Mixture Model (GMM). The number of clusters can be determined automatically or manually, and the Kullback–Leibler divergence (KL divergence) is used to screen for “stable cell subpopulations” (unchanged cells) whose expression does not change significantly between phenotypes. This algorithm design realizes “anchor cell” correction for batch effect and non-target noisy differences, laying the foundation for subsequent noise reduction and feature extraction.

Step 2: Noise Reduction with PLS-DA

On the unchanged cells selected in Step 1, a Partial Least Squares Discriminant Analysis (PLS-DA) model is built to remove random noise, batch effects, and biological disturbances unrelated to phenotype, so that the remaining information is maximally aligned for subsequent phenotype feature modeling. Notably, considering that extreme noise reduction may cause loss of subtle biological signals, the tool allows users to skip this step as needed, enhancing the algorithm’s flexibility.

Step 3: Supervised Multi-task Learning

Based on the noise-reduced data from the previous step, a deep denoising autoencoder (DAE) is used as the encoder to embed the high-dimensional expression matrix into a low-dimensional latent space, and a multi-layer perceptron (MLP) is used to classify expression features into phenotype labels. The combined candidate loss functions (reconstruction loss, classification loss, orthogonality loss) jointly enhance and refine features closely associated with phenotype through a multi-task learning (MTL) model.

2. Algorithm Innovation and Implementation Details

a) Deep Learning Model Components and Loss Functions

  • Encoder and Decoder: Both composed of multi-layer neural networks—layers of 5120, 1024, 512 (encoder); 512, 1024, 5120 (decoder) neurons, with ELU activation function and different dropout rates for each layer.
  • Noise Injection (Binomial Noise): Enhances model robustness.
  • Classifier (MLP): Connects the encoded 512-dimensional latent vector to binary phenotype labels, achieving phenotype differentiation.
  • Loss Terms: Reconstruction loss (MSE), classification loss (MSE), and orthogonality loss (Frobenius norm between Gram matrix and identity matrix). The proportions of these can be adjusted to achieve feature decorrelation and discriminative enhancement.

b) Other Key Data Processing and Evaluation Pipelines

Normalization, batch integration (Seurat-BBKNN/Harmony), highly variable gene selection, neighbor graph and clustering, dimensionality reduction (UMAP), gene set enrichment (GSVA), pseudotime analysis (scTour), spatial signal enhancement (MCP-counter, RCTD, SpaceXR), molecular interaction analysis (CellChat, NicheNet), correlation and survival analysis, and a variety of evaluation metrics (ARI, ASW, F1-score), jointly constitute a rigorous quantitative validation system.

IV. Main Research Results and Scientific Discoveries

The paper conducts systematic testing and scientific discovery validation for several representative scenarios and complex datasets.

1. Performance Evaluation on Simulated Datasets

Through highly controlled simulated datasets (varying cluster numbers, fold change strengths, diverse noise environments), dscStar, compared with original scStar, scMerge2, Harmony, and other tools, accurately recognizes and enhances phenotype-associated cell subpopulations and differentially expressed genes (DEGs) with high ARI, ASW, and F1-score, even under weak signal conditions, achieving high-quality preservation of heterogeneity and signal enhancement.

2. Recognition of Rare Subpopulations and Fine-grained Transition Revelation

Using a mixed cell simulation (95:5) of naive B cells and memory B cells from real biological samples as an example, dscStar accurately separates the extremely rare memory B subpopulation and further identifies the intermediate transitional state from memory B to plasmablast, whereas traditional tools merge these into a large cluster, losing fine-grained typing. Pseudotime analysis clearly confirms that the transformation trajectory captured by dscStar fits the actual biological process.

3. Discovery of Key Drug-resistant Subpopulations in the Tumor Microenvironment

  • NSCLC Anti-PD-1 Immunotherapy: By analyzing 32,528 CD8+ T cells with dscStar, a terminally exhausted T cell subpopulation with high expression of HSP (heat shock protein) and FKBP4 (hsp-related tex) was revealed. This group exhibited immune dysfunction and resistance to immune checkpoint blockade (ICB) and was highly associated with poor prognosis. Further combined with TCR clone tracing, it was clarified that this subpopulation is distinct from other exhausted T cells and may represent a key point of resistance breakthrough for therapies.
  • Experimental Validation in Other Tumors: Relevant single-cell and bulk data for skin cutaneous melanoma (SKCM), basal cell carcinoma (BCC), etc., also showed a correlation between high HSP/FKBP4 expression and immune dysfunction, supporting its cross-cancer universality.

4. Revealing Tumor-Immune Cell Interactions in Spatial Transcriptomics

Taking renal cancer (RCC) spatial transcriptomics as an example, dscStar’s enhanced signal analysis precisely located the spatial distributions of tumor cells, CD8+ T cells, tumor-associated macrophages (TAMs), and MSC-like (mesenchymal-like) tumor cells, uncovering strong FN1/CD99 pathway interactions between MSC-like tumor cells and immune suppression. This provides new clues for immune suppression and drug resistance mechanisms, and was validated for biological and clinical prognosis using independent datasets (such as TCGA, CellChat, survival analysis).

5. Immune Barrier Mechanisms in Hepatocellular Carcinoma

Applied to multi-omics data for HCC (hepatocellular carcinoma), dscStar revealed that S100A12+ neutrophils (neu_c1) and tumor-associated fibroblasts (CAF) form an immune barrier at tumor margins, with neu_c1 signals enriched only at tumor boundaries in ICB non-responders, suggesting a close association with treatment resistance. Using NicheNet, EnrichR, and other tools to analyze ligand-receptor interactions in depth, activation of ECM organization pathways was indicated, pointing to combined immune barrier and suppressive microenvironment regulation as a deep-rooted challenge in HCC treatment.

6. Sensitivity for Detecting Fine Phenotype-responsive Subpopulations

On temporally resolved single-nucleus transcriptomics data from an LPC-induced demyelination mouse model, dscStar sensitively identified and enhanced early-response endothelial cell features (high expression of lgals1, s100a6), while conventional workflows obtained only a uniform cluster and failed to effectively locate key stress-responsive subpopulations, fully demonstrating its high sensitivity for weak phenotype responses.

V. Conclusions and Significance

This study comprehensively demonstrates and validates the robust capability of dscStar to continuously mine and enhance key signals in various complex, weakly heterogeneous, and high-dimensional single-cell and spatial omics data. It not only provides theoretical and technical breakthroughs, but also offers practical guidance for tumor microenvironment, immune resistance, disease phenotyping, and clinical decision-making.

  • Scientific Significance: Reveals mechanisms of cell subpopulation interactions dependent on multi-omics and multi spatiotemporal scales, filling blind spots of traditional data analytic methods.
  • Application Value: Provides a high-level data-processing and biomarker discovery tool for precision medicine, immunotherapy, and single-cell intelligent computing industries, with openly available source code and standard workflows.

VI. Research Highlights and Innovative Features

  1. Deep learning + multi-task mechanism, combining multiple loss functions to effectively address large-scale, highly complex biological signal environments.
  2. No need to predefine the number of subpopulations/features, with both adaptability and interpretability.
  3. Ultra-high sensitivity to weakly correlated, rare subpopulations, significantly surpassing traditional clustering or batch integration algorithms.
  4. Enabling fine-grained interaction discovery in complex scenarios such as spatial omics and single-cell multi-omics.
  5. Open-source code and workflows for easy reproduction and dissemination.

VII. Limitations and Prospects

Although dscStar already demonstrates outstanding performance, the authors acknowledge that its applicability to continuous/complex phenotypes needs further improvement, as it generally makes binary (high/low) distinctions and depends on label quality. There is room for further expansion and refinement for data balancing, rare subpopulation detection, application in “pan-omics” scenarios, and orthogonal experimental validation.

VIII. Epilogue

The paper “Deep scStar: leveraging deep learning for the extraction and enhancement of phenotype-associated features from single-cell RNA sequencing and spatial transcriptomics data” sets a new benchmark for the enhancement of phenotypic features in single-cell omics through forward-looking theory, algorithms, and practical application demonstrations, providing a powerful research tool and development paradigm for the entire field of biomedical big data.