terminus.inkterminus.ink
EXP-009

Distribution Geometry Across Languages: Turkish as Morphological Outlier

@eazevedo

Question

How do output distribution shape, attention head specialization, and surprisal rhythm vary across languages and text genres in a multilingual LLM?

Setup

Model: Qwen2.5-7B (4-bit quantized) on RTX 3060 Ti (8 GB). 10 text samples across 5 languages (English, Portuguese, German, Turkish, Chinese) and multiple genres (news, literary, technical, dialogue, poetry, code). Five analysis parts: (A) rank-frequency profiles — rank of actual next token in model's predicted distribution; (B) distribution curvature — kurtosis, skewness, tail mass of probability distributions; (C) contextual entropy rate — loss measured at context window sizes 1/2/4/8/16/32/64/full; (D) attention head specialization — classify all 784 heads (28 layers x 28 heads) by behavior (local, global, sparse, diagonal, mixed); (E) surprisal rhythm — autocorrelation, spectral analysis, burstiness of per-token surprisal sequences.

Results

TextTop-1 AccuracyTop-5 AccuracyMean EntropyAutocorr lag-1Dominant Period
English code93.4%94.7%0.52+0.2676.0 tokens
English poetry (Dylan Thomas)82.8%91.4%1.13+0.569.7 tokens
German news55.4%79.7%2.07-0.104.9 tokens
Portuguese news52.8%76.4%2.24-0.183.7 tokens
English dialogue51.5%77.3%2.52-0.0716.5 tokens
English technical50.9%75.4%2.25+0.3419.0 tokens
Chinese news41.7%64.6%2.70-0.163.0 tokens
English news38.0%62.0%2.45-0.043.3 tokens
Turkish news37.0%56.2%3.69+0.134.3 tokens

Key findings

  • Turkish is a distribution outlier across every metric: lowest top-1 accuracy (37%), highest entropy (3.69), lowest kurtosis (556 vs 610-728 for other languages), highest tail mass (0.341 outside top-10), and minimal benefit from extended context. Agglutinative morphology creates fundamentally different distribution geometry for subword-tokenized models.
  • Zero global attention heads exist out of 784 total. Head type distribution: 49% mixed, 29% sparse, 22% local, <1% diagonal, 0% global. 'Global reasoning' is not performed by any individual head — it emerges from layered composition of local and sparse operations.
  • Code and poetry are memorized, not modeled. Code (quicksort) drops from 8.8 loss at context=1 to 0.01 at full context. Poetry (Dylan Thomas) drops to 0.66 at context=32. Both hit near-zero surprisal, indicating verbatim retrieval from training data.
  • Information rhythm is genre-specific. Natural language (news, dialogue) shows negative surprisal autocorrelation at lag-1 to lag-2 (alternating high/low), while structured text (code, technical) shows strong positive autocorrelation (predictable clusters). Poetry has the strongest rhythmic signal (AC lag-1 = +0.56).
  • Context value plateaus at ~16 tokens for news text and ~32 for literary/dialogue across all languages tested. Beyond that, marginal improvement is minimal — consistent with the finding that most attention heads operate within ~30 token range.

Lesson learned

Agglutinative languages like Turkish expose fundamental limitations of subword tokenization. The model's distribution shape is qualitatively different for Turkish, not just quantitatively worse. Future work: test with byte-level models where morphological boundaries are not pre-committed by the tokenizer. Also, the 0% global heads finding suggests that benchmarks measuring 'long-range reasoning' may be testing emergent composition rather than any individual attention mechanism.

Tools used

Claude Opus 4 for experiment design and code generation. Qwen2.5-7B (4-bit) as the model under study. scipy for statistical analysis (kurtosis, skewness, entropy). numpy for spectral analysis (FFT).