Inferring Gene Regulatory Networks from Time-Series scRNA-Seq Data via Granger Causal Recurrent Autoencoders

1. Academic Background and Research Motivation

In recent years, single-cell RNA sequencing (scRNA-seq) has become one of the most groundbreaking technologies in life sciences and medical research, enabling researchers to capture subtle differences in transcript levels among numerous cells at the resolution of individual cells. This technology has greatly enriched cell biology and is of substantial importance for understanding mechanisms of cell differentiation, development, and disease onset. Based on scRNA-seq data, inferring gene regulatory networks (GRNs) to reveal the complex regulatory relationships between transcription factors (TFs) and their target genes has become a key issue in current bioinformatics and systems biology.

However, scRNA-seq data inherently possess characteristics such as high noise, high sparsity, and dropout events, imposing significant challenges for data analysis. Particularly, when analyzing time-series single-cell data (time-series scRNA-seq data), in addition to intrinsic noise and sparsity, the data also exhibit dynamic changes, further increasing the difficulty of computation and inference. Most traditional GRN inference methods mainly focus on static single-cell data and have limited capabilities for modeling time-series data. The main challenges currently faced by researchers include: how to effectively integrate temporal information to capture the dynamic regulatory relationships between genes, how to enhance algorithm robustness under high noise and sparsity, and how to eliminate the high rate of false positive regulation inferences resulting from correlation-based analyses alone.

Therefore, to tackle the aforementioned scientific and technical challenges, and to drive the progress of practical biological research and disease mechanism analysis, it is necessary to develop brand new, more efficient, and robust GRN inference methods.

2. Paper Source and Author Information

This paper, titled “Inferring gene regulatory networks from time-series scRNA-seq data via granger causal recurrent autoencoders”, was published in the 26th volume, 2nd issue (2025) of the journal Briefings in Bioinformatics, DOI: https://doi.org/10.1093/bib/bbaf089.

The author team includes Liang Chen, Madison Dautle, Ruoying Gao, Shaoqiang Zhang (corresponding author), and Yong Chen (corresponding author). The authors are from the College of Computer and Information Engineering, Tianjin Normal University (China), and the Department of Biological and Biomedical Sciences, Rowan University, USA. The team brings together expertise from computer science, information engineering, and biomedical sciences, with extensive experience in single-cell omics and algorithm development.

3. Detailed Research Process

This paper presents an original methodological study, with its core contribution being the proposal of an innovative unsupervised method “Granger” that incorporates ideas from deep learning and causal inference, which can efficiently and accurately infer GRN structures automatically from time-series scRNA-seq data. The following section details the entire research design and experimental workflow.

1. Overall Method Design

The Granger method is based on unsupervised deep learning, with the core idea of integrating “Granger causality testing” and a “recurrent variational autoencoder (VAE).” It combines multiple advanced techniques: recurrent VAE, Granger causality detection, tunable sparsity-induced penalties, and negative binomial loss functions, specifically designed for the high-noise and high-sparsity characteristics of time-series scRNA-seq data.

Technical Workflow Breakdown:

  • Data Preprocessing and Pseudotime Inference
    Scanpy is used for quality filtering, normalization, log transformation, and highly variable gene selection from raw single-cell data. If actual time-point information is lacking, PAGA (Partition-based graph abstraction) is used to automatically perform pseudotime ordering of cells, providing input for downstream time-series modeling.
  • Time-Series Generation
    For m gene expression profiles, temporal expression series for each gene are generated based on pseudotime. For each gene g, the expression series across all cells is denoted as $x_g = (x_g^1, x_g^2,…,x_g^t)$, where t is the number of time points.
  • Main Model Architecture: Integration of Recurrent VAE and Granger Causality
    The model consists of an encoder and a multi-head decoder. The encoder reduces multivariate time-series to a latent low-dimensional feature space, and each decoder head is responsible for reconstructing the expression series of a particular gene. An RNN—specifically a gated recurrent unit (GRU)—is used as the basic unit of both encoder and decoder. The core goal of the model is to infer the existence of causal regulation between each pair of genes (i.e., the target adjacency matrix $A$), essentially modeling a Granger causal directed graph.
  • Innovative Loss Function Design
    A negative binomial distribution is introduced to fit the expression distribution of scRNA-seq data, supplementing the reconstruction error and KL divergence terms, and integrating an L1 sparsity penalty to approximate the actual sparse GRN structure. Differentiable/non-differentiable optimization is performed on adjacency matrix entries to further reduce overfitting.
  • Model Optimization and Training Strategy
    In the first stage, a mix of PGD (Proximal Gradient Descent) and SGD (Stochastic Gradient Descent) is used to optimize GRU weights and input layer parameters; in the second stage, sparsity is fixed and SGD is used for further fine-tuning. The overall framework is implemented in PyTorch and supports GPU acceleration.

2. Datasets and Evaluation System

The research team adopted multiple datasets and designed a rigorous benchmarking system: - Synthetic Datasets
Six synthetic datasets provided by the beeline framework, covering linear, circular, bifurcation, convergence and other complex topologies, with different cell counts (from 100 to 5000) and 10 replicated samples, systematically simulating developmental time-course differentiation processes. - Real and Curated Datasets
Four curated real biological datasets involving human embryonic stem cells, mouse dendritic cells, human hepatocytes, etc., some supporting evaluation with 50% and 70% dropout events. - Practical Application Case Study
Whole-mouse brain regional data from the Allen Brain Atlas were selected, focusing on 1,055 hippocampus-related excitatory neurons, with GRN predictions empirically studied for five important TFs (E2F7, GBX1, SOX10, PROX1, ONECUT2). - Method Comparison
Systematic comparison with eight mainstream unsupervised GRN inference tools including GRNBoost2, SINCERITIES, PIDC, PPCOR, SCODE, GENIE3, SINGE, and NORMI, covering different technical approaches such as correlation, information theory, regression, and causality.

Performance metrics included AUPRC (area under the precision-recall curve), AUROC (area under the receiver operating characteristic curve), AUPRC Ratio, and Early Precision Ratio (EPR), comprehensively accounting for positive/negative sample imbalance and early inference accuracy.

3. Experiments and Key Results

(1) Model Loss Design and Hyperparameter Optimization

The study successively evaluated the impact of negative binomial loss ($\lambda_{NB}$), sparsity penalty ($\lambdaa$), and time lag parameter ($l$) on inference performance. Experimental results show: - Introducing an appropriately strong negative binomial loss (with $\lambda{NB}=1$) significantly improves both AUPRC and AUROC, especially under high-dropout data scenarios; - The optimal sparsity parameter falls in the range of 0.2–0.4, preventing the network from becoming overly sparse or failing to converge; - The best time lag window length is related to sample size, with l=200–300 (for medium to large samples) delivering optimal performance; - Using two-layer GRUs significantly outperforms single-layer structures in capturing complex nonlinear dynamics.

(2) Importance of Pseudotime Algorithms

Comparison among three mainstream pseudotime algorithms—SLINGSHOT, PAGA, and SCORPIUS—showed that pseudotimes obtained by PAGA and SLINGSHOT both greatly enhance GRN inference accuracy; performance with randomly shuffled pseudotime as a control group dropped significantly, demonstrating the vital importance of temporal information for dynamic network inference.

(3) Comparison with Mainstream Methods

On all synthetic and real datasets, Granger achieved the highest or second-highest AUPRC/AUROC. It was especially strong under small sample size and high dropout (50%, 70%) scenarios, in which the comparison methods often performed poorly or were unusable. For real application datasets such as human embryonic stem cells, both AUPRC Ratio and EPR values were significantly higher compared to competitors. The model demonstrates excellent performance and strong robustness, and is especially suitable for practical high-noise biological data.

(4) Mouse Brain Cell Application and Biological Discoveries

The method successfully predicted the target genes regulated by five TFs in mouse brain excitatory neurons, revealing that related genes are enriched in key pathways such as nervous system development, cell-cell signaling, and growth factor secretion. Most regulatory relationships were supported by literature and ChIP-seq data (e.g., PROX1 chip-seq binding signal at the LIMD1 promoter region co-localized with chromatin marks), and the network structure showed high connectivity due to multi-TF coregulation. Some predicted regulatory relationships did not display high co-expression, highlighting the algorithm’s ability to recognize implicit regulatory patterns, which provides guidance for further experimental validation and in-depth disease mechanism characterization.

4. Research Conclusions and Significance

This paper proposed and empirically validated a brand-new algorithmic framework, Granger, which integrates causal inference and deep learning, capable of robustly, efficiently, and automatically inferring directed gene regulatory networks from time-series single-cell omics data. Its scientific significance lies in: - Innovative cutting-edge methods: Realizes causal modeling of dynamic transcriptional regulatory systems, significantly compensating for the lack of interpretability in methods based solely on linear correlation, resulting in stronger explanation and biological reasoning capabilities; - Technical breakthrough: Effectively solves the instability and false positive issues of network structure under extreme sparsity and noise in scRNA-seq, setting a new paradigm for time-series research and sparse data modeling; - Wide applicability: Unsupervised, label-free, does not depend on global prior TF-gene knowledge, and is broadly applicable even to unknown species or specific tissue types, greatly expanding the boundaries of GRN research applicability; - Biological value: Can not only recall known regulations but also discover entirely new regulatory relationships and cooperative networks, bringing new possibilities for disease target discovery, cell fate research, and other applications.

5. Research Highlights

  • Pioneering combination of Granger causality and recurrent autoencoders, capturing dynamic gene regulatory patterns driven by temporal information;
  • Novel negative binomial loss modeling with L1 sparsity penalty for dual optimization, effectively suppressing erroneous inference under high dropout and noise;
  • The method consistently outperforms all mainstream benchmark sets, excelling in both accuracy and robustness;
  • Predicted results from real biological data are corroborated by literature and multi-source evidence such as ChIP-seq, demonstrating strong biological interpretability.

6. Other Information

The authors declare that the code is open-source (https://github.com/shaoqiangzhang/granger) and datasets are publicly available and traceable. The paper also discusses potential upgrade directions such as integration of nonlinear causality measures, attention mechanisms, and multi-omics information, providing ample theoretical and methodological reserves for continued deepening in the field.

7. Summary

This study introduces an entirely new methodology to the field of gene regulatory network inference, advancing the intelligence and automation of dynamic single-cell omic research. The proposal of the Granger method not only addresses the practical needs of sparse data and dynamic modeling, but also provides a solid tool for disease mechanism exploration, cell fate studies, and systems biology, thereby laying a robust foundation for future basic and applied research in the related fields.