EXP-0102026-04-08

Surprisal Typology: 12 Languages Cluster by Family, Not Word Order

#surprisal #typology #multilingual #language-families #word-order #clustering #qwen #information-theory #wikipedia

Question

Does the surprisal-by-sentence-position profile in an LLM cluster by language family (genealogy) or by syntactic word order (SOV vs SVO)?

Setup

Model: Qwen2.5-7B (4-bit quantized) on RTX 3060 Ti (8 GB). 12 languages from 6 families: Germanic (English, German, Dutch), Romance (Portuguese, Spanish, French, Italian), Slavic (Russian, Polish), Turkic (Turkish), Japonic (Japanese), Sinitic (Chinese). ~200K characters from Wikipedia per language, segmented into sentences. ~480-580 sentences per language (filtered to 8-80 tokens). Per-token surprisal measured at 20 normalized position bins (0-100% of sentence). Hierarchical clustering (Ward linkage, correlation distance) on z-normalized curve shapes. Within-family vs between-family and within-word-order vs between-word-order distances compared.

Results

Language	Family	Word Order	Mean Surprisal (bits)	Start (0%)	End (100%)	N sentences
Russian	Slavic	SVO(free)	3.12	8.95	1.70	494
Spanish	Romance	SVO	3.71	9.57	2.11	523
French	Romance	SVO	3.74	9.77	2.23	541
Portuguese	Romance	SVO	3.84	10.34	1.91	497
Italian	Romance	SVO	3.90	10.87	2.08	480
German	Germanic	V2/SOV	4.12	12.07	1.92	528
Polish	Slavic	SVO(free)	4.18	12.91	2.26	535
English	Germanic	SVO	4.35	10.49	2.39	537
Dutch	Germanic	V2/SOV	4.46	13.13	1.94	515
Japanese	Japonic	SOV	4.46	10.81	2.65	577
Chinese	Sinitic	SVO	5.07	12.39	3.08	579
Turkish	Turkic	SOV	5.10	12.40	1.99	547

Key findings

Language family clusters, word order doesn't. Within-family curve distance is 0.0073 vs between-family 0.0110 (1.51x ratio). Within-order distance is 0.0108 vs between-order 0.0102 (0.94x ratio — no effect). The model has internalized language genealogy into its information flow patterns.
Romance languages form the tightest cluster. Portuguese, Spanish, French, and Italian have nearly overlapping surprisal curves (mean 3.71-3.90 bits), confirming the model treats them as mild variants of the same information structure.
SOV languages do NOT cluster together. Turkish (SOV, Turkic) and Japanese (SOV, Japonic) have completely different surprisal profiles despite identical canonical word order. Morphology and script dominate over syntax.
Turkish is the universal outlier — highest mean surprisal (5.10 bits), most distant in hierarchical clustering. Agglutinative morphology creates fundamentally different information-theoretic structure, confirming prior findings.
Chinese and Japanese end sentences with high surprisal (3.08 and 2.65 bits) vs European languages (1.7-2.4 bits). CJK character density means sentence-final positions still carry substantial information, while alphabetic languages converge to highly predictable endings.

Lesson learned

The initial hypothesis that word order (SOV vs SVO) would determine surprisal peak position was cleanly falsified. The result is actually more interesting: language genealogy shapes the entire curve profile in a way that word-order typology does not. This suggests the model encodes deep structural similarity between related languages (shared morphology, phonotactics, vocabulary overlap) rather than surface syntactic properties. Future work: test with byte-level models to remove tokenizer effects, and test more diverse families (Arabic, Korean, Hindi).

Tools used

Claude Opus 4 for experiment design and code generation. Qwen2.5-7B (4-bit) for surprisal computation. scipy for hierarchical clustering and correlation distance. Wikipedia via HuggingFace datasets (wikimedia/wikipedia 20231101 snapshot).