terminus.inkterminus.ink
EXP-011

Cross-Model Replication: Surprisal Typology Clusters by Family in Both Qwen and Gemma 4

@eazevedo

Question

Does the finding that surprisal curves cluster by language family replicate across different model architectures (Qwen2.5-7B dense vs Gemma 4 E2B MoE)?

Setup

Two models compared on identical data: Qwen2.5-7B (7B dense, 4-bit) and Google Gemma 4 E2B (5.1B MoE with 2.3B active, 4-bit). Both on RTX 3060 Ti (8 GB). 12 languages from 6 families: Germanic (EN, DE, NL), Romance (PT, ES, FR, IT), Slavic (RU, PL), Turkic (TR), Japonic (JA), Sinitic (ZH). Same Wikipedia sentences, same 20-bin normalized position, same clustering methodology (Ward linkage, correlation distance). Gemma 4 has 262K vocabulary (vs Qwen's 152K) and MoE architecture with different training data.

Results

MetricQwen2.5-7BGemma 4 E2B
Within-family distance0.00730.0073
Between-family distance0.01100.0184
Family clustering ratio1.51x2.52x
Within-order distance0.01080.0151
Between-order distance0.01020.0185
Order clustering ratio0.94x (none)1.22x (weak)
Mean surprisal EN4.35 bits4.98 bits
Mean surprisal TR5.10 bits5.50 bits
Mean surprisal ZH5.07 bits6.28 bits
Mean surprisal RU3.12 bits4.24 bits
Mean surprisal JA4.46 bits5.78 bits

Key findings

  • Family clustering replicates across architectures, but with important caveats. Gemma 4 shows 2.52x family ratio vs Qwen's 1.51x. However, permutation testing reveals z-scored family clustering is significant in Gemma (p=0.0006) but only marginal in Qwen (p=0.07). The derivative test (pure shape change between position bins) shows no family signal in either model (p>0.15). The clustering lives in coarse curve features (decay steepness, plateau level) which partially correlate with model familiarity with each language.
  • The within-family distance appears identical (0.0073) across both models — this is a rounding coincidence. Actual values: Qwen 0.007304, Gemma 0.007298. Related languages do produce similar curve shapes in both models, but the exact match at 4 decimal places is not meaningful.
  • Romance languages form the tightest cluster in both models. Spanish, French, Portuguese, and Italian are nearly indistinguishable in surprisal curve shape, despite completely different tokenizers and training data.
  • Turkish and Japanese remain outliers in both models. Both are SOV but never cluster together — confirming word order does not drive the grouping.
  • Gemma 4 shows higher overall surprisal than Qwen (by ~0.6-1.3 bits), consistent with its smaller effective parameter count (2.3B active vs 7B). The relative ordering of languages is preserved (Spearman rho=0.860).

Lesson learned

Cross-model replication strengthens but does not fully validate interpretability claims. The family clustering replicates across Qwen and Gemma, but robustness testing reveals the signal lives in coarse curve features (how steeply surprisal decays, where it plateaus) rather than fine-grained information rhythm. These coarse features partially correlate with how much training data each model saw for each language. Qualification: z-scored family clustering is significant in Gemma (p=0.0006) but marginal in Qwen (p=0.07). The derivative test (pure shape change between bins) shows no family signal in either model (p>0.15). Without this confound analysis, the claim that the signal reflects 'genuine linguistic structure' would be overstated. Future work should control for training data volume directly.

Tools used

Claude Opus 4 for experiment design and analysis. Qwen2.5-7B (4-bit) and Google Gemma 4 E2B (4-bit) as models under study. scipy for clustering. Wikipedia via HuggingFace datasets.

Experiment chain