Cross-species modeling of plant genomes at single-nucleotide resolution using a pretrained DNA language model
A Milestone in Cross-Species Modeling of Plant Genomes: Creation and Breakthrough Application of the PlantCaduceus DNA Language Model
I. Academic Background and Research Motivation
In the past two decades, with the rapid development of high-throughput sequencing technology, over 1,000 plant genomes have been published, and this number is expected to increase substantially in the future. However, annotating the functional elements of these vast genomes, understanding their expression regulation at the transcriptional and translational levels, and analyzing the effects of different genetic variations on organismal fitness and traits have long been the “bottleneck” challenges in plant genomics and crop improvement fields.
Compared to animals and humans, plant genomes have much more complex structures, characterized by massive genome sizes, extremely high proportions of repetitive sequences, and great diversity among species—even considerable variation within a single genus or species. As a result, deep learning (DL) models constructed based on a single species often only perform well within that species and struggle to generalize across species. This greatly limits the ability to annotate gene function and predict variant effects in newly sequenced plants—especially non-model species. Meanwhile, large-scale labeled data are extremely scarce in the plant field, making traditional supervised deep learning inapplicable to unlabeled taxa.
In recent years, inspired by the rise of self-supervised language models (Language Model, LM) in natural language processing (NLP), pre-trained models for biological sequences have been shown to possess powerful feature abstraction and generalization abilities. Protein language models (such as ESM) have made breakthroughs in areas like protein structure prediction and identifying mutation effects, but are limited to coding regions and can hardly cover non-coding or regulatory elements. DNA language models, on the other hand, hold the potential to capture full-genome sequence information, including non-coding and regulatory regions.
However, DNA language models face major challenges in plant genomes: (1) the complex repetitive sequences can lead models to focus on meaningless sequence patterns and fail to learn biologically relevant language rules; (2) low conservation and high noise in non-coding regions easily introduce data biases during training; (3) the double-stranded structure of DNA requires full consideration of information symmetry between the forward and reverse complement (RC) strands.
Therefore, developing a mechanistically sound, feature-rich, and cross-species generalizable plant DNA language model is currently a milestone demand in plant genomics research.
II. Paper Source and Author Introduction
This study, titled “Cross-species modeling of plant genomes at single-nucleotide resolution using a pretrained DNA language model,” was collaboratively completed by scholars including Jingjing Zhai, Aaron Gokaslan, Yair Schiff, Ana Berthel, Zong-Yan Liu, Wei-Yun Lai, Zachary R. Miller, Armin Scheben, Michelle C. Stitzer, M. Cinta Romay, Edward S. Buckler, and Volodymyr Kuleshov. The authors are primarily from the Institute for Genomic Diversity at Cornell University, the Department of Computer Science, the Department of Plant Breeding and Genetics, and the USDA, supported by relevant NSF and NIH grants.
The paper was published on June 9, 2025, in PNAS (Proceedings of the National Academy of Sciences of the United States of America), a highly influential international journal. The full text, along with the pre-trained data and model code, have all been made fully open, embodying the spirit of open science.
III. Detailed Research Workflow
1. Research Subjects and Datasets
(1) Source and Processing of Pretraining Data
The project utilized the genomes of 16 angiosperm species, spanning the Poaceae and Brassicales lineages—across 160 million years of evolutionary history—including model and crop plants such as Arabidopsis, rice, maize, and wheat. These genomes, with their great diversity in size and repetitive sequence content, form an ideal basis for cross-species analysis.
Each genome was split into 512bp windows, tokenized at the single-nucleotide level for ultra-high resolution. Unlike previous approaches that used entire genomes, PlantCaduceus adopted the GPN project’s strategy, downsampling and reweighting repetitive non-coding regions to enhance learning in functionally relevant regions and avoid the “hijacking” effect of repeats.
(2) Feature Testing and Downstream Evaluation Datasets
After unsupervised pretraining, all models were primarily evaluated on the following tasks to assess their generalization and ability to parse biological function:
- Four core gene annotation tasks (transcription initiation site TIS, transcription termination site TTS, splice donor/acceptor sites)
- Evolutionary conservation (via alignments among 34 Andropogoneae genomes for crops like sorghum)
- Variant effect zero-shot prediction (assessing the potential impact of mutations on gene function)
2. Research Process and Technical Implementation
(1) Innovative Architecture and Pretraining of the PlantCaduceus DNA Language Model
Model Architecture Innovations
This research adopted the Caduceus model based on the Mamba (Selective State Space Model, SSM) framework, further optimized for DNA characteristics, including:
- Support for 512bp ultra-long context windows, greatly improving long-range dependency learning.
- RC-equivariant modeling of DNA double-stranded symmetry, building in strong priors to ensure equivalent handling of forward and reverse strands, preventing redundancy.
- Single-nucleotide tokenization, offering higher resolution than mainstream k-mer (e.g., 6-mer) methods and precisely capturing single-base mutation effects.
- Channel flipping and feature averaging to guarantee strict RC-equivariance of output embeddings.
Pretraining Strategy
- 15% random masking, following BERT standards: 80% replaced by special token, 10% replaced with random token, 10% unchanged.
- AdamW optimizer plus cosine decay learning rate schedule, with the best model at 225M parameters, trained for 25 days on 8 H100 GPUs.
- For each window, the model’s task is to predict the true base of masked nucleotides; all downstream functions are realized through extracting embedding from the last hidden layer.
(2) Downstream Task Design and Model Evaluation
a. Cross-Species Gene Function Annotation Assessment
- Using Arabidopsis’s well-annotated TIS, TTS, and splice sites as training data, only the embeddings are used, followed by training XGBoost (nonlinear) and linear classifiers for downstream tasks.
- The four core tasks are tested within the training species (Arabidopsis) and across test species (including maize, rice, cotton, etc.—some included/excluded from pretraining), evaluating the representation and generalization ability.
- Performance is compared against GPN, AgroNT (Transformer-based, 1B parameters), NT-v2 (animal model), and the classic supervised DanQ (CNN+LSTM) model.
b. Cross-Species Evolutionary Conservation Prediction
- Based on alignments among 34 evolutionarily related sorghum and rice outgroup genomes, identity scores are used to label bases as conserved (≥34) or neutral (<15), sampling 277M sites for a highly imbalanced training set.
- Training is conducted on 9 chromosomes of sorghum, validated on chromosome 10, and tested for cross-species migration to maize.
- After extracting embeddings, XGBoost classifiers are trained for binary classification, using AUROC and AUPRC as metrics.
c. Zero-Shot Mutation Effect Prediction—A New Approach to Identifying Pathogenic/Deleterious Variants
- Through in silico mutagenesis (genome-wide mutation simulation), the difference in log-likelihood between the reference and alternate alleles (zero-shot score) is used as the basis for evaluating mutation effects.
- Data involved covers >1 million real/simulated variants in maize, sorghum, Arabidopsis, and SNP population sequencing sets.
- Performance is compared with mainstream MSA-inferred methods PhyloP, PhastCons, as well as GPN/AgroNT scoring.
(3) Method/Model Comparisons and Ablation Experiments
- To ensure a fair comparison to GPN, a custom large-size GPN was retrained with matched parameters and steps, analyzing the contribution of including additional genomes and network scaling to model generalization.
- For AgroNT, which has massive parameter size and is impractical for Brassicales set pretraining, LoRA fine-tuning was used to compensate for the lack of representation in the frozen embedding.
- Multi-level analysis of performance differences between XGBoost and linear layers to test whether high-dimensional embeddings require complex models for full information extraction.
IV. Key Findings and Supporting Data
1. Generalization and Representation Power of the New Model PlantCaduceus
- On the four annotation tasks (TIS, TTS, splice donor/acceptor), PlantCaduceus—whether with frozen embeddings or minimal linear fine-tuning—outperforms or matches existing models in Arabidopsis (AUPRC mean >0.94).
- The key breakthrough is in cross-species tasks (e.g., maize, cotton); PlantCaduceus’s cross-species AUPRC only drops from 0.789 (in Arabidopsis) to 0.764, far superior to GPN (0.509), AgroNT (0.106), NT-v2, with the DanQ model nearly failing (AUPRC close to 0).
- Ablation tests show that increasing the number of pretraining species and model capacity both reinforce generalization, yet PlantCaduceus still outperforms reference models even when using the smallest configuration (20M parameters).
- Notably, PlantCaduceus architecture is clearly advantageous in parameter efficiency and RC-equivariant processing.
2. Cross-Species Migration Power in Evolutionary Conservation Prediction
- Without the need for reference annotations, PlantCaduceus’s embedding outputs alone can achieve highly accurate evolutionary conservation prediction: In sorghum, AUROC=0.896, AUPRC=0.876; after transferring to maize, AUROC=0.829, AUPRC=0.797—both significantly exceeding peer models.
- Conservation prediction in non-coding regions even surpasses that in protein-coding regions, highlighting the model’s ability to capture regulatory elements and complex regions.
- Custom GPN and LoRA-tuned AgroNT can approach PlantCaduceus’s downstream performance, but still fall short of its maximum.
3. Zero-Shot Model–Driven Pathogenic/Deleterious Mutation Screening
- By classifying simulated and real variants with the zero-shot score based on differential log-likelihood, PlantCaduceus is more sensitive than GPN, AgroNT, and historical MSA methods (PhyloP, PhastCons) in capturing deleterious/rare alleles, yielding a threefold enrichment in rare variants.
- In external validation via Arabidopsis EMS allele screening, 15 out of 19 known phenotype mutations ranked in the PlantCaduceus top 1–10%, far ahead of other models, thus providing a new paradigm for identifying causal mutations or key breeding loci.
- In the sweet maize su1 locus GWAS signal, PlantCaduceus can pinpoint the sole causal variant W578R, effectively solving the signal deconvolution problem under high LD (linkage disequilibrium).
V. Research Conclusions and Academic/Application Value
This study proposes, for the first time, a multi-species pre-trained DNA language model framework represented by PlantCaduceus, effectively overcoming technical obstacles such as plant genome diversity, repetitive sequence complexity, annotation absence, and double-stranded RC equivariance. The model combines high accuracy (e.g., sequence annotation, regulatory prediction), high generalizability (cross-species migration), high efficiency (reduced parameters and computational cost), and compatibility with single-base resolution (such as zero-shot variant pathogenicity prediction). The research team has also fully open-sourced code, models, and data, providing a strong foundation and expandable platform for future projects like the “1000 Plant Genomes Project,” large-scale functional genomics studies in new species, precise crop breeding, and superior material screening.
Moreover, PlantCaduceus–driven “zero-shot mutation interpretation” opens an entirely new pathway for identifying pathogenic/important variants without expensive evolutionary conservation MSAs or model retraining per species, offering paradigm-shifting breakthroughs for genomic medicine, population genomics, and crop diversity improvement.
VI. Summary of Research Highlights and Innovations
- Model Mechanism Innovation: Uses SSM+Mamba+Caduceus architecture, superior to existing Transformer and CNN/LSTM methods, and systematically introduces RC equivariance for the first time.
- Complete Methodological Framework: From dataset processing, pretraining, downstream task design, ablation experiments, to extensive downstream model comparisons, ensuring rigorous and broadly applicable conclusions.
- Multidimensional Application Value: Suitable for academic research (e.g., gene function evolution, regulatory element decoding), industrial crop applications (rapid identification of desirable/deleterious variants), and driving forward bioinformatics methodology.
- Robust Openness and Accessibility: Full openness of code, models, and data, paving the way for secondary development and educational resources.
VII. Additional Remarks and Outlook
- Future Expansion Directions: Plans to include more diverse lineages in pretraining, such as gymnosperms, to further enhance model generalizability and breadth of application; consideration of enlarging the context window to thousands or tens of thousands of bp for tasks such as distal regulatory element prediction.
- Technical Details: Model configurations (e.g., 32 layers/225M parameters, 24 layers/40M parameters, etc.) facilitate flexible deployment in laboratories with differing resources.
- Breeding Practical Functions: The model can guide optimal parental material selection in molecular breeding, aid hybrid design, reduce deleterious mutation load, and improve crop yield and stress resistance.
As a new-generation plant DNA language model, PlantCaduceus not only advances fundamental biological research but also provides an innovative tool for downstream applications such as digital precision breeding and genomic medicine—a major breakthrough for cross-species genome annotation and functional analysis.