Genome phylogeny and taxonomy from the Genome Taxonomy Database (GTDB) release R214. The reference tree (bac120_r214.tree) comprises 80,789 bacterial genomes placed on a 120-marker concatenated protein phylogeny. Nitrogen cycle gene annotations (narG, napA, nirS, nirK, norB, nosZ, nrfA) were obtained from AnnoTree, which assigns gene presence/absence using hidden Markov model (HMM) searches against KEGG Orthology profiles.
Genomes were assigned to quality tiers based on CheckM completeness and contamination estimates:
Complete/High Quality: ≥95% completeness, ≤5% contamination
Near Complete: 90–95% completeness, ≤5% contamination
Medium Quality Draft: 70–90% completeness, ≤10% contamination
Fragmented: <70% completeness or >10% contamination
The quality filter toggles allow inclusion or exclusion of each tier. Gene presence percentages and threshold calculations are recomputed in real time based on the selected quality tiers.
AnnoTree reports a single norB annotation. We classified norB sequences as cytochrome c-oxidizing (cNor) or quinol-oxidizing (qNor) based on protein length. qNor is a fusion of NorB and NorC into a single polypeptide (~765 aa), whereas cNor NorB alone is ~455 aa. A histogram of all 11,399 norB sequence lengths revealed a bimodal distribution with a clear valley at ~635 aa, which was used as the classification cutoff. Only 83 sequences (0.7%) fell within ±30 aa of the cutoff, and 216 sequences shorter than 200 aa were flagged as fragments.
NosZ sequences were classified as Clade I (typical) or Clade II (atypical) by scoring all 7,062 sequences against two profile HMMs using HMMER v3.4: the FunGene nosZ profile (Clade I, 638 positions) and the FunGene nosZ_a2 profile (Clade II, 656 positions). Each sequence was assigned to whichever HMM yielded the higher full-sequence bit score. Score differences between the two models were typically hundreds of bits; only 8 sequences (0.1%) had a score difference below 50 bits, and all of these marginally favored Clade I.
For visualization, monophyletic clades sharing an identical gene presence/absence genotype were collapsed into a single visual node regardless of clade size. Remaining heterogeneous subtrees containing ≤80 tips were also collapsed, with gene presence determined by a majority-rule threshold (default 50%, adjustable via slider): a gene is shown if ≥ threshold % of genomes in the collapsed clade carry it. Collapsed node (dot) size scales with log₂(number of genomes +1). Search queries match against all genera present within a collapsed clade, not only the majority genus displayed in the label.
Visualization built with HTML5 Canvas by Claude Opus 4.6. All rendering is client-side; no server required.
Data processed by Aayushi Shah and Karna Gowda (2026).