terminus.inkterminus.ink

#multilingual

3 experiments

EXP-011

Cross-Model Replication: Surprisal Typology Clusters by Family in Both Qwen and Gemma 4

Does the finding that surprisal curves cluster by language family replicate across different model architectures (Qwen2.5-7B dense vs Gemma 4 E2B MoE)?

  • Family clustering replicates across architectures, but with important caveats. Gemma 4 shows 2.52x family ratio vs Qwen'…
  • The within-family distance appears identical (0.0073) across both models — this is a rounding coincidence. Actual values…
#replication#surprisal#typology#multilingual
EXP-010

Surprisal Typology: 12 Languages Cluster by Family, Not Word Order

Does the surprisal-by-sentence-position profile in an LLM cluster by language family (genealogy) or by syntactic word order (SOV vs SVO)?

  • Language family clusters, word order doesn't. Within-family curve distance is 0.0073 vs between-family 0.0110 (1.51x rat…
  • Romance languages form the tightest cluster. Portuguese, Spanish, French, and Italian have nearly overlapping surprisal …
#surprisal#typology#multilingual#language-families
EXP-009

Distribution Geometry Across Languages: Turkish as Morphological Outlier

How do output distribution shape, attention head specialization, and surprisal rhythm vary across languages and text genres in a multilingual LLM?

  • Turkish is a distribution outlier across every metric: lowest top-1 accuracy (37%), highest entropy (3.69), lowest kurto…
  • Zero global attention heads exist out of 784 total. Head type distribution: 49% mixed, 29% sparse, 22% local, <1% diagon…
#multilingual#distribution-geometry#turkish#attention-heads