# EXP-010: Surprisal Typology: 12 Languages Cluster by Family, Not Word Order

**Date:** 2026-04-08
**Author:** @eazevedo
**Tags:** #surprisal, #typology, #multilingual, #language-families, #word-order, #clustering, #qwen, #information-theory, #wikipedia

## Question

Does the surprisal-by-sentence-position profile in an LLM cluster by language family (genealogy) or by syntactic word order (SOV vs SVO)?

## Setup

Model: Qwen2.5-7B (4-bit quantized) on RTX 3060 Ti (8 GB). 12 languages from 6 families: Germanic (English, German, Dutch), Romance (Portuguese, Spanish, French, Italian), Slavic (Russian, Polish), Turkic (Turkish), Japonic (Japanese), Sinitic (Chinese). ~200K characters from Wikipedia per language, segmented into sentences. ~480-580 sentences per language (filtered to 8-80 tokens). Per-token surprisal measured at 20 normalized position bins (0-100% of sentence). Hierarchical clustering (Ward linkage, correlation distance) on z-normalized curve shapes. Within-family vs between-family and within-word-order vs between-word-order distances compared.

## Results

| Language | Family | Word Order | Mean Surprisal (bits) | Start (0%) | End (100%) | N sentences |
| --- | --- | --- | --- | --- | --- | --- |
| Russian | Slavic | SVO(free) | 3.12 | 8.95 | 1.70 | 494 |
| Spanish | Romance | SVO | 3.71 | 9.57 | 2.11 | 523 |
| French | Romance | SVO | 3.74 | 9.77 | 2.23 | 541 |
| Portuguese | Romance | SVO | 3.84 | 10.34 | 1.91 | 497 |
| Italian | Romance | SVO | 3.90 | 10.87 | 2.08 | 480 |
| German | Germanic | V2/SOV | 4.12 | 12.07 | 1.92 | 528 |
| Polish | Slavic | SVO(free) | 4.18 | 12.91 | 2.26 | 535 |
| English | Germanic | SVO | 4.35 | 10.49 | 2.39 | 537 |
| Dutch | Germanic | V2/SOV | 4.46 | 13.13 | 1.94 | 515 |
| Japanese | Japonic | SOV | 4.46 | 10.81 | 2.65 | 577 |
| Chinese | Sinitic | SVO | 5.07 | 12.39 | 3.08 | 579 |
| Turkish | Turkic | SOV | 5.10 | 12.40 | 1.99 | 547 |

## Key Findings

- Language family clusters, word order doesn't. Within-family curve distance is 0.0073 vs between-family 0.0110 (1.51x ratio). Within-order distance is 0.0108 vs between-order 0.0102 (0.94x ratio — no effect). The model has internalized language genealogy into its information flow patterns.
- Romance languages form the tightest cluster. Portuguese, Spanish, French, and Italian have nearly overlapping surprisal curves (mean 3.71-3.90 bits), confirming the model treats them as mild variants of the same information structure.
- SOV languages do NOT cluster together. Turkish (SOV, Turkic) and Japanese (SOV, Japonic) have completely different surprisal profiles despite identical canonical word order. Morphology and script dominate over syntax.
- Turkish is the universal outlier — highest mean surprisal (5.10 bits), most distant in hierarchical clustering. Agglutinative morphology creates fundamentally different information-theoretic structure, confirming prior findings.
- Chinese and Japanese end sentences with high surprisal (3.08 and 2.65 bits) vs European languages (1.7-2.4 bits). CJK character density means sentence-final positions still carry substantial information, while alphabetic languages converge to highly predictable endings.

## Lesson Learned

The initial hypothesis that word order (SOV vs SVO) would determine surprisal peak position was cleanly falsified. The result is actually more interesting: language genealogy shapes the entire curve profile in a way that word-order typology does not. This suggests the model encodes deep structural similarity between related languages (shared morphology, phonotactics, vocabulary overlap) rather than surface syntactic properties. Future work: test with byte-level models to remove tokenizer effects, and test more diverse families (Arabic, Korean, Hindi).

## Tools Used

Claude Opus 4 for experiment design and code generation. Qwen2.5-7B (4-bit) for surprisal computation. scipy for hierarchical clustering and correlation distance. Wikipedia via HuggingFace datasets (wikimedia/wikipedia 20231101 snapshot).

---
Source: https://terminus.ink/e/2026-04-08-surprisal-typology-12-languages-cluster-by-family-not-word-order
