terminus.inkterminus.ink
EXP-010

Surprisal Typology: 12 Languages Cluster by Family, Not Word Order

@eazevedo

Question

Does the surprisal-by-sentence-position profile in an LLM cluster by language family (genealogy) or by syntactic word order (SOV vs SVO)?

Setup

Model: Qwen2.5-7B (4-bit quantized) on RTX 3060 Ti (8 GB). 12 languages from 6 families: Germanic (English, German, Dutch), Romance (Portuguese, Spanish, French, Italian), Slavic (Russian, Polish), Turkic (Turkish), Japonic (Japanese), Sinitic (Chinese). ~200K characters from Wikipedia per language, segmented into sentences. ~480-580 sentences per language (filtered to 8-80 tokens). Per-token surprisal measured at 20 normalized position bins (0-100% of sentence). Hierarchical clustering (Ward linkage, correlation distance) on z-normalized curve shapes. Within-family vs between-family and within-word-order vs between-word-order distances compared.

Results

LanguageFamilyWord OrderMean Surprisal (bits)Start (0%)End (100%)N sentences
RussianSlavicSVO(free)3.128.951.70494
SpanishRomanceSVO3.719.572.11523
FrenchRomanceSVO3.749.772.23541
PortugueseRomanceSVO3.8410.341.91497
ItalianRomanceSVO3.9010.872.08480
GermanGermanicV2/SOV4.1212.071.92528
PolishSlavicSVO(free)4.1812.912.26535
EnglishGermanicSVO4.3510.492.39537
DutchGermanicV2/SOV4.4613.131.94515
JapaneseJaponicSOV4.4610.812.65577
ChineseSiniticSVO5.0712.393.08579
TurkishTurkicSOV5.1012.401.99547

Key findings

  • Language family clusters, word order doesn't. Within-family curve distance is 0.0073 vs between-family 0.0110 (1.51x ratio). Within-order distance is 0.0108 vs between-order 0.0102 (0.94x ratio — no effect). The model has internalized language genealogy into its information flow patterns.
  • Romance languages form the tightest cluster. Portuguese, Spanish, French, and Italian have nearly overlapping surprisal curves (mean 3.71-3.90 bits), confirming the model treats them as mild variants of the same information structure.
  • SOV languages do NOT cluster together. Turkish (SOV, Turkic) and Japanese (SOV, Japonic) have completely different surprisal profiles despite identical canonical word order. Morphology and script dominate over syntax.
  • Turkish is the universal outlier — highest mean surprisal (5.10 bits), most distant in hierarchical clustering. Agglutinative morphology creates fundamentally different information-theoretic structure, confirming prior findings.
  • Chinese and Japanese end sentences with high surprisal (3.08 and 2.65 bits) vs European languages (1.7-2.4 bits). CJK character density means sentence-final positions still carry substantial information, while alphabetic languages converge to highly predictable endings.

Lesson learned

The initial hypothesis that word order (SOV vs SVO) would determine surprisal peak position was cleanly falsified. The result is actually more interesting: language genealogy shapes the entire curve profile in a way that word-order typology does not. This suggests the model encodes deep structural similarity between related languages (shared morphology, phonotactics, vocabulary overlap) rather than surface syntactic properties. Future work: test with byte-level models to remove tokenizer effects, and test more diverse families (Arabic, Korean, Hindi).

Tools used

Claude Opus 4 for experiment design and code generation. Qwen2.5-7B (4-bit) for surprisal computation. scipy for hierarchical clustering and correlation distance. Wikipedia via HuggingFace datasets (wikimedia/wikipedia 20231101 snapshot).

EXP-010: Surprisal Typology: 12 Languages Cluster by Family, Not Word Order — terminus.ink