Basic overview
We quantified codon-usage and protein-composition indices across nuclear CDSs from 30 angiosperm species, with emphasis on Fabaceae and comparisons with Poaceae, Brassicaceae and other outgroups. Substantial interspecific variation was detected in GC content, codon adaptation and bias indices (CAI, CBI), Fop, mean protein length (L_aa), hydropathy (GRAVY) and aromaticity (Aromo) (Fig 1).
Fabaceae generally showed genome-wide GC values of ~0.44-0.45 (
e.g.,
Glycine max = 0.4407;
Cajanus cajan = 0.4545), comparable to or slightly higher than several non-legumes (
e.g.,
Solanum lycopersicum = 0.417) and within the range of
Arabidopsis thaliana (0.452) (
Smarda et al., 2014). At synonymous third positions, Fabaceae taxa such as
Medicago truncatula (GC3s = 0.4522) and
Pisum sativum (0.4419) exceeded
A. thaliana (0.4304) and, more prominently,
S. lycopersicum (0.3477), indicating lineage-specific divergence in third-position nucleotide composition and codon-ending preference.
CAI values in Fabaceae were concentrated at ~0.18-0.22 and were broadly comparable to those in major grasses (e.g.,
Zea mays = 0.221;
Oryza sativa = 0.218), suggesting only moderate among-lineage differences in codon “optimization,” with no uniformly dominant clade. CBI values were similar across several taxa (
e.g.,
M. sativa = 0.3819;
G. max = 0.3844;
A. thaliana = 0.3873), whereas S.
lycopersicum was lower (0.3701), indicating that codon-bias strength remains heterogeneous even within eudicots.
Fabaceae also tended to exhibit longer mean protein lengths (L_aa ~300–370 aa;
e.g.,
M. truncatula = 357;
G. max = 373), consistent with lineage-level differences in average CDS length. GRAVY values were uniformly negative, indicating an overall hydrophilic proteome tendency; for example,
M. truncatula (-0.2003) was less negative than
O. sativa (-0.264), consistent with a relatively more hydrophilic mean protein composition in the latter. Aromo values were moderate and relatively stable in Fabaceae (
e.g.,
G. max = 0.0847;
M. sativa = 0.0839) but lower in
Z. mays (0.0729), indicating lineage-dependent differences in amino-acid composition.
These results highlight lineage-specific divergence in base composition, particularly at third codon positions and codon preferences. This framework enhances understanding of mutation–selection dynamics and supports codon-aware gene design
(Parvathy et al., 2022).
ENc–GC3s analysis (ENC-plot)
In this study, we examined the relationship between the effective number of codons (ENc, also denoted Nc) and GC3s (GC content at third codon positions) using ENc-GC3s plots (Fig 2). Across most species, ENc displayed the characteristic dependence on GC3s expected under the Wright model, with genes approaching the theoretical expectation curve over a broad GC3s range. In general, genes with lower GC3s tended to exhibit higher ENc values, consistent with weaker apparent codon bias, whereas ENc decreased as GC3s increased, indicating progressively stronger skew in synonymous codon usage as GC-ending codons became more prevalent.
Within Fabaceae, species such as
Glycine max, Medicago sativa and
Cicer arietinum showed conspicuous clustering of genes toward lower ENc values within intermediate GC3s ranges, consistent with enhanced codon bias in GC3s-enriched genes. By contrast, several monocots (
e.g.,
Oryza sativa and
Zea mays) exhibited comparatively stable ENc distributions across GC3s, suggesting that the GC3s dependence of ENc is less pronounced in these taxa. Together, these ENc–GC3s patterns indicate that third-position base composition constitutes a major axis underlying codon bias across angiosperms, while the magnitude and distribution of deviation from mutation-only expectations vary among lineages, implying the potential contribution of additional forces beyond base-composition constraints (
Wright, 1990;
Parvathy et al., 2022).
Importantly, the ENc-GC3s framework provides a quantitative, genome-wide diagnostic to compare codon-usage architectures among plant lineages and to motivate subsequent mechanistic inference regarding the relative roles of compositional bias and selection. The lineage-specific differences observed here further underscore that codon bias evolution is not uniform across angiosperms and highlight Fabaceae as a clade in which codon-usage structure merits deeper evaluation in conjunction with expression- and translation-related evidence.
PR2 analysis
In this study, we investigated the relationship between PR2 bias at third codon positions (A3/(A3 + A3 + T3)) and GC content across species, as depicted in the scatter plot in Fig 3. Each data point reflects the codon usage bias of sample genes within a species, with the x-axis representing G3/(G3 + C3) and the y-axis representing A3/(A3 + A3 + T3). These plots reveal biases in codon selection and the distribution patterns of specific codons within the genome. Most species’ data points are concentrated in the 0.25 to 0.75 range, suggesting a consistent base distribution at third codon positions. Notably, in Fabaceae species, such as
Glycine max (Gmax) and
Medicago truncatula (Mtru), the scatter plots exhibit pronounced biases, indicating strong selective pressure favoring the use of adenine (A) at the third codon position. In contrast, non-legume species, such as
Arabidopsis thaliana (Atha) and
Solanum lycopersicum (Slyc), show more dispersed distributions, reflecting weaker codon selectivity and more random codon usage patterns.
Further analysis revealed that, Fabaceae species, such as
Glycine max and
Cajanus cajan (Ccaj), generally have higher A3/(A3 + A3 + T3) values, indicating stronger bias towards adenine at the third codon position. In contrast, non-legume species, including
Arabidopsis thaliana and
Solanum lycopersicum, exhibit lower frequencies of A3 usage, reflecting weaker codon selectivity. For monocots like
Oryza sativa (Osat) and
Zea mays (Zmay), their scatter plots show a significant relationship between GC content and PR2 bias. In regions of higher GC content, there is a stronger bias in the use of codons at third positions, while lower GC regions exhibit weaker bias. This suggests that higher GC content may be associated with stronger selective pressures, leading to more pronounced codon usage bias.
Overall, the relationship between PR2 bias and GC content highlights significant differences in codon selection and gene expression regulation across species. Fabaceae typically exhibits strong codon bias, while other plant lineages display broader, more diverse codon usage patterns. These differences are likely tied to the evolutionary adaptive pressures and gene expression regulatory mechanisms faced by each lineage, providing important theoretical insights for future plant genome research and crop improvement
(Yang et al., 2023; Parvathy et al., 2022).
Correlation analysis
Fig 4 presents a Spearman correlation heatmap of six codon usage indices: GC, GC3s, Nc (effective number of codons), CAI (Codon Adaptation Index), CBI (Codon Bias Index) and Fop (Frequency of Optimal Codons). The correlation coefficients (rho values) range from -1 to 1, representing the strength and direction of the linear relationships between these indices. The depth of color reflects the strength of the correlation, with darker shades indicating stronger positive correlations and lighter shades indicating weaker or no correlation.
From the heatmap, it is evident that GC and GC3s exhibit a very strong positive correlation (rho = 0.82), suggesting a close relationship between the GC content at third codon positions and the overall GC content of the genome. This finding aligns with the synchronous variation of GC content during genome evolution
(Bowers et al., 2022). Additionally, strong positive correlations were observed between GC and both Fop (rho = 0.61) and CBI (rho = 0.65), indicating that species with higher GC content tend to exhibit stronger codon selectivity, which contributes to improved gene expression efficiency
(Hao et al., 2025).
The correlation between CAI and other indices showed a notable positive relationship with Fop (rho = 0.66) and CBI (rho = 0.55), while no significant correlation was found with Nc (rho = -0.11), suggesting that CAI primarily reflects the degree of codon optimization, whereas Nc is more closely related to broader genomic characteristics
(Kwon et al., 2016). Furthermore, GC3s also exhibited moderate positive correlations with Fop (rho = 0.55) and CBI (rho = 0.64), further supporting the idea that changes in GC3s are tightly linked to codon selection, particularly at third codon positions, where an increase in GC content typically accompanies codon usage optimization (
Ruden, 2025).
Overall, the correlation results from the heatmap provide a deep understanding of the interrelationships between different codon usage indices, revealing the complex interactions between GC content, codon optimization and gene expression efficiency. These findings not only validate the relationship between codon selection and genomic features but also offer a theoretical foundation for further studies on the mechanisms of gene expression regulation.
RSCU analysis
Fig 5 presents an RSCU (Relative Synonymous Codon Usage) heatmap of the species × codon matrix, revealing codon usage preferences across different species. The RSCU values were normalized so that each codons value is centered around 1. The color intensity in the heatmap represents the frequency of codon usage in each species, with deep red indicating a strong preference for a specific codon and deep blue indicating a bias toward the use of another codon. These data were processed using hierarchical clustering, with the x-axis representing codon types and the y-axis representing species, thereby illustrating the codon usage bias patterns across species.
From the heatmap, significant differences in codon selection were observed among different plant lineages (
e.g., Fabaceae, Poaceae, monocots). Fabaceae species, such as
Glycine max and
Medicago truncatula, along with Poaceae species like
Zea mays and
Oryza sativa, exhibited consistent RSCU patterns for specific codons, indicating strong selection for the usage of certain codons within their genomes. For example,
Glycine max (Gmax) and
Medicago truncatula (Mtru) demonstrated higher RSCU values for codons associated with GC preference, suggesting these species have optimized the use of high-GC codons for gene expression. These codons may be associated with plant adaptability and environmental stress responses, particularly in enhancing gene expression efficiency and stability
(Parvathy et al., 2022).
In contrast, monocots such as
Oryza sativa (Osat) and
Zea mays (Zmay) also showed some optimization in codon usage, especially for codons in regions of their genomes with high GC content. These species exhibited different codon usage patterns compared to Fabaceae, likely reflecting their distinct evolutionary histories and genomic adaptations. The variation in codon usage patterns across different lineages further supports the notion that codon usage bias reflects differences in gene expression regulation, environmental adaptation and genomic optimization strategies
(Liu et al., 2004). Overall, the RSCU heatmap provides a visual representation of codon usage preferences across species, shedding light on the selection and optimization of codons in plant genomes. These findings offer valuable insights for understanding the mechanisms of codon selection in plants and provide a useful reference for plant genome improvement and crop optimization.
Fig 6 presents a principal component analysis (PCA) of species based on RSCU (Relative Synonymous Codon Usage) features. Using PCA, we projected the codon usage preferences of species into a two-dimensional space, where PC1 and PC2 explained 77.4% and 10.3% of the total data variance, respectively. Each point in the plot represents a species, with species grouped according to their codon usage characteristics and the shapes and colors of the points indicate the species’ lineage. From the plot, it is apparent that species are clustered into distinct groups based on codon usage preferences. Notably, along PC1, there is significant separation between Fabaceae species (
e.g.,
Glycine max and
Medicago truncatula) and Poaceae species (
e.g.,
Zea mays and
Oryza sativa), suggesting substantial differences in codon usage between these two groups. Fabaceae species predominantly cluster in the positive range of PC1, while Poaceae species are concentrated in the negative range, reflecting distinct preferences for codon selection at the genomic level. Along PC2, differences in codon usage between monocots and dicots are also evident, with species from the Brassicaceae family (
e.g.,
Arabidopsis thaliana,
Brassica oleracea) showing a different distribution pattern compared to monocots like
Oryza sativa and
Zea mays. This variation may be related to differences in gene expression optimization strategies between these plant groups
(Yang et al., 2023). Furthermore, the PCA clustering analysis revealed distinct differences between Basal angiosperms (
e.g.,
Amborella trichopoda) and Eudicots (
e.g.,
Populus trichocarpa), highlighting differences in their genomic codon usage patterns, closely tied to their respective evolutionary histories and adaptive traits. Overall, the RSCU PCA plot demonstrates significant differences in codon preferences across species, providing visual evidence to support the understanding of codon usage and gene expression optimization mechanisms between different plant lineages. These results offer critical insights into the gene expression regulation and adaptive evolution mechanisms across species
(Majeed et al., 2026).
Neutrality plot analysis
Fig 7 shows the relationship between GC12 and GC3 and was used to infer the relative contributions of mutational bias and selective constraints to genome-wide base composition across species. For each species, we fitted a linear regression between GC12 and GC3 and summarized the association using the slope, R
2 and p value. Here, the slope quantifies the extent to which variation at third codon positions is mirrored at the first and second positions, whereas R² reflects the goodness-of-fit and the p value evaluates statistical significance.
Across species, many regression fits exhibited significant positive associations, although the strength of the relationship varied substantially among lineages. In several Fabaceae species, GC12 and GC3 were positively associated; for example,
Glycine max showed a measurable relationship (R
2 = 0.285; p < 1×10
-300), indicating coordinated variation in GC content across codon positions. Such coupling is consistent with non-random constraints on compositional architecture and has been interpreted as reflecting the combined action of mutation and selection shaping codon-position–specific base composition
(Wang et al., 2025). In contrast, Medicago truncatula displayed a much weaker relationship (R
2 = 0.002), suggesting that the linkage between GC3 and GC12 is minimal in this species. Similarly, taxa such as
Arabidopsis thaliana showed weak coupling (
e.g., R
2 = 0.005), implying that compositional variation at third positions is only weakly reflected at the more constrained first and second positions in these genomes.
In addition, several species exhibited comparatively strong GC12-GC3 coupling; for instance,
Zea mays showed a robust positive association (R
2 = 0.409), consistent with more pronounced genome-wide coordination of base composition across codon positions. Collectively, the cross-species variation in slope and R² values highlights substantial heterogeneity in codon-position compositional coupling among angiosperms. This neutrality-plot framework provides an interpretable quantitative summary of how base composition at third positions relates to that at the more functionally constrained first and second positions and thus offers a useful basis for comparing mutational and selective influences on compositional evolution among lineages
(Alemu et al., 2024; Glémin et al., 2014).
RSCU-based clustering analysis
Fig 8 shows a species clustering tree constructed from RSCU (relative synonymous codon usage) profiles using the neighbor-joining (NJ) method. This tree groups species according to the similarity of their codon usage patterns, thereby providing an integrative view of how compositional constraints and putative selective effects shape codon preference architectures across taxa.
Amborella trichopoda (Atri) was used as an outgroup to root the tree, facilitating interpretation of lineage-level relationships.
The topology revealed clear lineage-associated structure, with major families (
e.g., Fabaceae, Poaceae and Brassicaceae) forming recognizable clusters, consistent with the presence of phylogenetic signals in codon usage patterns. For example, Fabaceae species such as
Glycine max and
Medicago truncatula clustered closely, indicating broadly similar codon usage profiles within the family. In contrast, Poaceae species (
e.g.,
Zea mays and
Sorghum bicolor) occupied distinct branches, suggesting a codon usage architecture that differs systematically from that of legumes and other eudicot lineages, potentially reflecting lineage-specific compositional backgrounds and evolutionary histories
(Wang et al., 2025).
More broadly, the distribution of taxa across the NJ tree highlights substantial diversity in codon usage patterns among plant families. The separation between Poaceae and Fabaceae, for instance, is consistent with their divergent base-composition regimes and codon-ending preferences, which may in turn be associated with differences in genome organization and gene expression-related constraints. Collectively, the RSCU-based clustering provides a compact, species-level summary of codon preference similarity, offering an additional line of evidence that codon usage evolution varies across angiosperm lineages and is shaped by a combination of compositional bias and lineage-dependent selective constraints (
Glémin et al., 2014;
Alemu et al., 2024).