EXP-0092026-04-08

Distribution Geometry Across Languages: Turkish as Morphological Outlier

#multilingual #distribution-geometry #turkish #attention-heads #surprisal #llm-internals #qwen #entropy #morphology

Question

How do output distribution shape, attention head specialization, and surprisal rhythm vary across languages and text genres in a multilingual LLM?

Setup

Model: Qwen2.5-7B (4-bit quantized) on RTX 3060 Ti (8 GB). 10 text samples across 5 languages (English, Portuguese, German, Turkish, Chinese) and multiple genres (news, literary, technical, dialogue, poetry, code). Five analysis parts: (A) rank-frequency profiles — rank of actual next token in model's predicted distribution; (B) distribution curvature — kurtosis, skewness, tail mass of probability distributions; (C) contextual entropy rate — loss measured at context window sizes 1/2/4/8/16/32/64/full; (D) attention head specialization — classify all 784 heads (28 layers x 28 heads) by behavior (local, global, sparse, diagonal, mixed); (E) surprisal rhythm — autocorrelation, spectral analysis, burstiness of per-token surprisal sequences.

Results

Text	Top-1 Accuracy	Top-5 Accuracy	Mean Entropy	Autocorr lag-1	Dominant Period
English code	93.4%	94.7%	0.52	+0.26	76.0 tokens
English poetry (Dylan Thomas)	82.8%	91.4%	1.13	+0.56	9.7 tokens
German news	55.4%	79.7%	2.07	-0.10	4.9 tokens
Portuguese news	52.8%	76.4%	2.24	-0.18	3.7 tokens
English dialogue	51.5%	77.3%	2.52	-0.07	16.5 tokens
English technical	50.9%	75.4%	2.25	+0.34	19.0 tokens
Chinese news	41.7%	64.6%	2.70	-0.16	3.0 tokens
English news	38.0%	62.0%	2.45	-0.04	3.3 tokens
Turkish news	37.0%	56.2%	3.69	+0.13	4.3 tokens

Key findings

Turkish is a distribution outlier across every metric: lowest top-1 accuracy (37%), highest entropy (3.69), lowest kurtosis (556 vs 610-728 for other languages), highest tail mass (0.341 outside top-10), and minimal benefit from extended context. Agglutinative morphology creates fundamentally different distribution geometry for subword-tokenized models.
Zero global attention heads exist out of 784 total. Head type distribution: 49% mixed, 29% sparse, 22% local, <1% diagonal, 0% global. 'Global reasoning' is not performed by any individual head — it emerges from layered composition of local and sparse operations.
Code and poetry are memorized, not modeled. Code (quicksort) drops from 8.8 loss at context=1 to 0.01 at full context. Poetry (Dylan Thomas) drops to 0.66 at context=32. Both hit near-zero surprisal, indicating verbatim retrieval from training data.
Information rhythm is genre-specific. Natural language (news, dialogue) shows negative surprisal autocorrelation at lag-1 to lag-2 (alternating high/low), while structured text (code, technical) shows strong positive autocorrelation (predictable clusters). Poetry has the strongest rhythmic signal (AC lag-1 = +0.56).
Context value plateaus at ~16 tokens for news text and ~32 for literary/dialogue across all languages tested. Beyond that, marginal improvement is minimal — consistent with the finding that most attention heads operate within ~30 token range.

Lesson learned

Agglutinative languages like Turkish expose fundamental limitations of subword tokenization. The model's distribution shape is qualitatively different for Turkish, not just quantitatively worse. Future work: test with byte-level models where morphological boundaries are not pre-committed by the tokenizer. Also, the 0% global heads finding suggests that benchmarks measuring 'long-range reasoning' may be testing emergent composition rather than any individual attention mechanism.

Tools used

Claude Opus 4 for experiment design and code generation. Qwen2.5-7B (4-bit) as the model under study. scipy for statistical analysis (kurtosis, skewness, entropy). numpy for spectral analysis (FFT).