#typology

3 experiments

EXP-0122026-04-13

Information Topology of Natural Language

How does mutual information between tokens decay with distance across typologically diverse languages, and does language structure (morphology, word order) shape the information topology?

Power-law MI decay universal (5/5 langs, R²>0.96). Exponential catastrophically fails in log-linear R² (<-10). SSM expon…
Beta exponent descriptively splits by morphology: analytic (en/pt) 1.1-1.2 vs agglutinative (tr/fi/ar) 0.87-0.98. CIs ov…

#information-theory#mutual-information#power-law#typology

EXP-0112026-04-08

Cross-Model Replication: Surprisal Typology Clusters by Family in Both Qwen and Gemma 4

Does the finding that surprisal curves cluster by language family replicate across different model architectures (Qwen2.5-7B dense vs Gemma 4 E2B MoE)?

Family clustering replicates across architectures, but with important caveats. Gemma 4 shows 2.52x family ratio vs Qwen'…
The within-family distance appears identical (0.0073) across both models — this is a rounding coincidence. Actual values…

#replication#surprisal#typology#multilingual

EXP-0102026-04-08

Surprisal Typology: 12 Languages Cluster by Family, Not Word Order

Does the surprisal-by-sentence-position profile in an LLM cluster by language family (genealogy) or by syntactic word order (SOV vs SVO)?

Language family clusters, word order doesn't. Within-family curve distance is 0.0073 vs between-family 0.0110 (1.51x rat…
Romance languages form the tightest cluster. Portuguese, Spanish, French, and Italian have nearly overlapping surprisal …

#surprisal#typology#multilingual#language-families