The Dependence of Amino Acid Backbone Conformation on the Translated Synonymous Codon Is Not Statistically Significant
Re-evaluating the Influence of Synonymous Codons on Protein Backbone Conformation — A Dialectic Under Rigorous Statistical Testing in Structural Biology
1. Academic Background and Research Motivation
In the fields of molecular biology and structural biology, the relationship between codons and protein structure has long been a focal point of research. The traditional view holds that the primary structure of proteins (i.e., amino acid sequence) determines its spatial conformation (folding), and that the “degeneracy” of the genetic code allows the same amino acid to be encoded by multiple “synonymous codons.” Since the late 20th century, an increasing body of literature has demonstrated that the use preference of synonymous codons is closely linked to a variety of biological processes, such as mRNA splicing, regulation of translation rates, and protein folding kinetics. These links not only enrich our understanding of “non-coding information” in molecular biology but also provide a more dimensional theoretical foundation for protein design and genetic engineering.
In 2022, a study by A. A. Rosenberg et al. published in Nature Communications (cited multiple times in this paper as Ref. 1) proposed a rather subversive viewpoint: the synonymous codons used during the translation process, in addition to affecting translation rate and protein folding kinetics, may also directly influence the distribution of protein backbone dihedral angles (φ, ψ; Ramachandran angles)—with statistically significant differences particularly observed within some secondary structural elements (such as β-strands). If this assertion holds, it would mean that information about the final 3D structure of proteins could be partially hidden at the DNA sequence level rather than being solely dictated by the primary structure. This would have profound effects on fields such as structural biology, protein engineering, and molecular evolution.
However, after the viewpoint was raised, it attracted widespread skepticism, including concerns over the reasonableness of the statistical methods, robustness of the data analysis, and false positives introduced by small-sample density estimation. The main subject of this report—Javier González-Delgado et al.—attempts to reassess the statistical foundation of this research and test whether synonymous codons can indeed significantly affect the distribution of protein backbone dihedral angles.
2. Source and Author Information
This is an original research article published in the Proceedings of the National Academy of Sciences of the United States of America (PNAS), with a publication date of June 13, 2025, and article number e2503264122.
The main authors include Javier González-Delgado, Pablo Mier, Pau Bernadó, Pierre Neuvial, and Juan Cortés, respectively affiliated with the following research institutions:
- Université de Rennes, Ensai, CNRS, CREST-UMR 9194, Rennes, France
- Andalusian Centre for Developmental Biology, Universidad Pablo de Olavide, Seville, Spain
- Centre de Biologie Structurale, Université de Montpellier, Montpellier, France
- Institut de Mathématiques de Toulouse, Université de Toulouse, Toulouse, France
- LAAS-CNRS, Université de Toulouse, Toulouse, France
The article was edited by the structural and computational biology authority Eugene Koonin (NIH, Bethesda, MD).
3. Research Process and Detailed Methods
3.1 Research Objectives and Core Problems
The primary aim of this study is to clarify whether synonymous codons can significantly influence the distribution of dihedral angles (Ramachandran plots) of amino acid backbones in translated proteins, particularly focusing on the phenomenon reported by Rosenberg et al. that “significant differences exist within secondary-structure elements.” The scientists’ assertion is that there are fundamental flaws in previous statistical approaches, thus necessitating a more rigorous and robust statistical re-analysis of the relevant data.
3.2 Overview of the Research Process
This research was conducted in the following main steps:
- Reproduction and Defect Analysis of the Original Method
- Design and Implementation of a More Rigorous Statistical Test
- Replicative Analysis Using Both Experimental Structural Data and AlphaFold Structural Databases
- Sensitivity and Robustness Tests, Including Controls for Neighboring Residues and Diverse Structural Classifications
- Comparative Analysis of Differences and Attribution of Main Sources of Deviation
3.2.1 Analysis and Simulation of the Original Statistical Method
The authors first replicated the statistical process used by Rosenberg et al.:
- For a particular amino acid, in the context of synonymous codons c and c’ and a certain type of secondary structure x, the distributions of dihedral angles (φ, ψ) were compared.
- Bootstrapping (B=25 repeats) was used to resample, and for each resampled sample, a permutation test (K=200 times) was performed to compare sample distributions.
- A specific p-value calculation rule is applied cumulatively to reach a final significance determination.
Through theoretical analysis and simulation, the authors found that such p-values do not truly satisfy “super-uniformity,” that is, their distribution is not a valid statistical p-value distribution, which easily leads to errors in significance testing and makes multiple testing correction (such as the Benjamini-Hochberg method) infeasible.
3.2.2 Design of the New Statistical Test
To avoid the above limitations, the authors designed and implemented a two-sample goodness-of-fit test for probability distributions on the two-dimensional flat torus based on the Wasserstein distance, recently published by their team. This method makes no prior assumptions about distribution parameters, greatly improving robustness and versatility, especially in small-sample scenarios.
3.2.3 Data Collection and Processing
- Data objects: Primarily derived from Rosenberg et al.’s original dataset (experimentally determined Escherichia coli protein structures), supplemented with high-confidence protein structures (plddt > 90) predicted by the AlphaFold Database.
- Sample filtering: Only amino acids uniquely mapped to synonymous codons were analyzed, removing redundancies by retaining different Uniprot IDs and sequence positions.
- Sample grouping: Divided according to secondary structure types (DSSP classification): β-strands (E), α-helix (H), and other structures (Others), with a strict minimum sample size per group (n, m ≥ 30).
3.2.4 Multiple Testing and Data Visualization
All pairwise combinations of synonymous codons were subjected to the above-mentioned non-parametric statistical test, and the Benjamini-Hochberg method was used to control the false discovery rate (FDR). Empirical cumulative distribution function (ECDF) visualizations were used to display the distribution of test p-values, intuitively reflecting changes in the rejection rate for hypothesis testing.
3.2.5 Sensitivity Analysis and External Validation
To ensure that observations were not biased by structural definition or neighboring residue effects, the team conducted:
- Re-analysis with different methods for defining Ramachandran regions
- Repeat tests controlling for neighboring amino acid identities
- Cross-validation between experimental/predicted structure databases
All analysis scripts and code have been made openly available (https://github.com/gonzalez-delgado/synco).
4. Detailed Analysis of Key Research Results
4.1 Re-simulation and Flaw Revelation of the Original Method
Through theoretical analysis and empirical simulations of Rosenberg et al.’s statistical test, the research team found:
- The “significance determination” obtained by averaging bootstrapped permutation test p-values produces an extremely conservative p-value distribution, failing to meet the basic requirements for a valid (super-uniform) statistical p-value.
- False discovery rate control in the context of multiple testing is completely invalid, easily leading to both false negatives and false positives.
- In small-sample contexts, using fixed bandwidth kernel density estimation leads to severely distorted density fitting, resulting in a high rate of false positives.
Such methodological flaws could directly lead to overestimation—or even a complete lack of reliability—of previous claims that “codons significantly affect dihedral angle distributions.”
4.2 Main Findings Under Rigorous Testing
Using the newly designed non-parametric Wasserstein test to reanalyze all data on various types of secondary structures under all synonymous codons, the results were as follows:
- β-strands (E): No statistically significant differences in φ/ψ distributions between any pairs of synonymous codons were observed, refuting Rosenberg et al.’s claim of “66% of synonym pairs having significant differences.”
- α-helices (H) and others: Results were consistent with the original paper, with no statistically significant differences detected.
- Multiple independent database validations: Conclusions were highly consistent regardless of whether experimental structures or high-confidence AlphaFold models were used.
- Sensitivity analysis robustness: The conclusion that synonymous codons do not significantly affect backbone dihedral angle distributions held without exception under different grouping criteria (Ramachandran regions) and when controlling for neighboring residues.
4.3 Tracing the Sources of Bias
Through systematic analysis, the team noted that nearly all those samples in Rosenberg et al.’s results set that claimed “significant differences” were of very small sample size. The combination of small samples and fixed-width kernel density estimation makes “false positives” extremely likely, forming a convincingly critical challenge to the scientific rationality of the original findings.
4.4 Conclusions of the Research Team
Summing up all data analyses, statistical tests, and multi-database validations, the authors ultimately concluded:
Based on currently available data, there is no statistical support for the claim that “synonymous codons affect the distribution of protein backbone dihedral angles.” The main determinant of the spatial structure of proteins remains the amino acid sequence (primary structure), and differences at the synonymous codon level do not result in visible geometric differences in the folded protein’s backbone for the same amino acid.
5. Scientific and Applied Value of the Study
5.1 Scientific Significance
The core significance of the study lies in:
- Safeguarding the fundamental paradigm of bioinformatics: By upholding the classical theory that “structure depends on amino acid sequence rather than DNA coding detail,” it provides a solid foundation for basic theories underlying protein engineering and systems biology.
- Enhancing statistical rigor: By revealing the “pitfalls of commonly used statistical tests in specific scenarios,” the study powerfully advances the paradigms of data analysis in structural biology and proteomics.
- Commitment to data reproducibility and open-source: The full disclosure of analytical workflows and data elevates academic transparency, providing a robust platform for peer review and follow-up in-depth research.
5.2 Applied Value
- Protein engineering/molecular design: Based on these conclusions, attention in protein structure design can focus solely on primary sequence modifications, without concerns over microstructural folding variations arising from synonymous codon usage.
- Molecular evolution research: Clarifies the boundary of synonymous mutations in protein stability/conformational control and helps rationally explain neutral mutation effects in evolutionary dynamics.
- Gene synthesis industry: Empirically removes excessive concern over micro-heterogeneity of structure during “codon optimization,” promoting the efficient development of synthetic biology.
5.3 Research Highlights and Innovations
- The first systematic refutation of the claim that “synonymous codons directly determine protein dihedral angles.”
- Introduction and validation of a self-developed Wasserstein distance test for two-dimensional flat torus probability distributions, with excellent performance in small samples and high-dimensional distribution comparison.
- Alternating validation using multiple databases and multi-omics strategies, greatly enhancing the scientific breadth and depth of the conclusions.
6. Other Valuable Information
- The research subjects are currently mainly limited to a small number of Escherichia coli proteins with known experimental structures, and it is presumed that the expression sequence is identical to the native organism. The authors call for future studies to combine larger-scale structural databases and corresponding gene sequences to further enhance generalizability.
- All research materials, methods, and code are fully open to the public, facilitating rapid verification, extension, and methodological upgrades by the academic community.
- The appendix provides detailed information on the algorithms used, experimental structural classifications (e.g., DSSP), structural databases (AlphaFold), and other references, which are of high informational value to interested researchers.
7. Summary
This study clarifies the debate within the field of protein structure prediction and design over “whether synonymous codons directly influence backbone geometry.” With rigorous statistical methodology, comprehensive analytical workflows, and high-quality empirical data, the researchers return to and reinforce the classical definition that “the geometry of the protein backbone is controlled by the amino acid sequence.” This not only strongly advances theoretical improvement in structural biology, but also provides a solid theoretical basis for scientific decision-making in genetic engineering and related molecular biology industries.