Cross-Model Replication: Surprisal Typology Clusters by Family in Both Qwen and Gemma 4
@eazevedoQuestion
Does the finding that surprisal curves cluster by language family replicate across different model architectures (Qwen2.5-7B dense vs Gemma 4 E2B MoE)?
Setup
Two models compared on identical data: Qwen2.5-7B (7B dense, 4-bit) and Google Gemma 4 E2B (5.1B MoE with 2.3B active, 4-bit). Both on RTX 3060 Ti (8 GB). 12 languages from 6 families: Germanic (EN, DE, NL), Romance (PT, ES, FR, IT), Slavic (RU, PL), Turkic (TR), Japonic (JA), Sinitic (ZH). Same Wikipedia sentences, same 20-bin normalized position, same clustering methodology (Ward linkage, correlation distance). Gemma 4 has 262K vocabulary (vs Qwen's 152K) and MoE architecture with different training data.
Results
| Metric | Qwen2.5-7B | Gemma 4 E2B |
|---|---|---|
| Within-family distance | 0.0073 | 0.0073 |
| Between-family distance | 0.0110 | 0.0184 |
| Family clustering ratio | 1.51x | 2.52x |
| Within-order distance | 0.0108 | 0.0151 |
| Between-order distance | 0.0102 | 0.0185 |
| Order clustering ratio | 0.94x (none) | 1.22x (weak) |
| Mean surprisal EN | 4.35 bits | 4.98 bits |
| Mean surprisal TR | 5.10 bits | 5.50 bits |
| Mean surprisal ZH | 5.07 bits | 6.28 bits |
| Mean surprisal RU | 3.12 bits | 4.24 bits |
| Mean surprisal JA | 4.46 bits | 5.78 bits |
Key findings
- Family clustering replicates across architectures, but with important caveats. Gemma 4 shows 2.52x family ratio vs Qwen's 1.51x. However, permutation testing reveals z-scored family clustering is significant in Gemma (p=0.0006) but only marginal in Qwen (p=0.07). The derivative test (pure shape change between position bins) shows no family signal in either model (p>0.15). The clustering lives in coarse curve features (decay steepness, plateau level) which partially correlate with model familiarity with each language.
- The within-family distance appears identical (0.0073) across both models — this is a rounding coincidence. Actual values: Qwen 0.007304, Gemma 0.007298. Related languages do produce similar curve shapes in both models, but the exact match at 4 decimal places is not meaningful.
- Romance languages form the tightest cluster in both models. Spanish, French, Portuguese, and Italian are nearly indistinguishable in surprisal curve shape, despite completely different tokenizers and training data.
- Turkish and Japanese remain outliers in both models. Both are SOV but never cluster together — confirming word order does not drive the grouping.
- Gemma 4 shows higher overall surprisal than Qwen (by ~0.6-1.3 bits), consistent with its smaller effective parameter count (2.3B active vs 7B). The relative ordering of languages is preserved (Spearman rho=0.860).
Lesson learned
Cross-model replication strengthens but does not fully validate interpretability claims. The family clustering replicates across Qwen and Gemma, but robustness testing reveals the signal lives in coarse curve features (how steeply surprisal decays, where it plateaus) rather than fine-grained information rhythm. These coarse features partially correlate with how much training data each model saw for each language. Qualification: z-scored family clustering is significant in Gemma (p=0.0006) but marginal in Qwen (p=0.07). The derivative test (pure shape change between bins) shows no family signal in either model (p>0.15). Without this confound analysis, the claim that the signal reflects 'genuine linguistic structure' would be overstated. Future work should control for training data volume directly.
Tools used
Claude Opus 4 for experiment design and analysis. Qwen2.5-7B (4-bit) and Google Gemma 4 E2B (4-bit) as models under study. scipy for clustering. Wikipedia via HuggingFace datasets.