# EXP-009: Distribution Geometry Across Languages: Turkish as Morphological Outlier

**Date:** 2026-04-08
**Author:** @eazevedo
**Tags:** #multilingual, #distribution-geometry, #turkish, #attention-heads, #surprisal, #llm-internals, #qwen, #entropy, #morphology

## Question

How do output distribution shape, attention head specialization, and surprisal rhythm vary across languages and text genres in a multilingual LLM?

## Setup

Model: Qwen2.5-7B (4-bit quantized) on RTX 3060 Ti (8 GB). 10 text samples across 5 languages (English, Portuguese, German, Turkish, Chinese) and multiple genres (news, literary, technical, dialogue, poetry, code). Five analysis parts: (A) rank-frequency profiles — rank of actual next token in model's predicted distribution; (B) distribution curvature — kurtosis, skewness, tail mass of probability distributions; (C) contextual entropy rate — loss measured at context window sizes 1/2/4/8/16/32/64/full; (D) attention head specialization — classify all 784 heads (28 layers x 28 heads) by behavior (local, global, sparse, diagonal, mixed); (E) surprisal rhythm — autocorrelation, spectral analysis, burstiness of per-token surprisal sequences.

## Results

| Text | Top-1 Accuracy | Top-5 Accuracy | Mean Entropy | Autocorr lag-1 | Dominant Period |
| --- | --- | --- | --- | --- | --- |
| English code | 93.4% | 94.7% | 0.52 | +0.26 | 76.0 tokens |
| English poetry (Dylan Thomas) | 82.8% | 91.4% | 1.13 | +0.56 | 9.7 tokens |
| German news | 55.4% | 79.7% | 2.07 | -0.10 | 4.9 tokens |
| Portuguese news | 52.8% | 76.4% | 2.24 | -0.18 | 3.7 tokens |
| English dialogue | 51.5% | 77.3% | 2.52 | -0.07 | 16.5 tokens |
| English technical | 50.9% | 75.4% | 2.25 | +0.34 | 19.0 tokens |
| Chinese news | 41.7% | 64.6% | 2.70 | -0.16 | 3.0 tokens |
| English news | 38.0% | 62.0% | 2.45 | -0.04 | 3.3 tokens |
| Turkish news | 37.0% | 56.2% | 3.69 | +0.13 | 4.3 tokens |

## Key Findings

- Turkish is a distribution outlier across every metric: lowest top-1 accuracy (37%), highest entropy (3.69), lowest kurtosis (556 vs 610-728 for other languages), highest tail mass (0.341 outside top-10), and minimal benefit from extended context. Agglutinative morphology creates fundamentally different distribution geometry for subword-tokenized models.
- Zero global attention heads exist out of 784 total. Head type distribution: 49% mixed, 29% sparse, 22% local, <1% diagonal, 0% global. 'Global reasoning' is not performed by any individual head — it emerges from layered composition of local and sparse operations.
- Code and poetry are memorized, not modeled. Code (quicksort) drops from 8.8 loss at context=1 to 0.01 at full context. Poetry (Dylan Thomas) drops to 0.66 at context=32. Both hit near-zero surprisal, indicating verbatim retrieval from training data.
- Information rhythm is genre-specific. Natural language (news, dialogue) shows negative surprisal autocorrelation at lag-1 to lag-2 (alternating high/low), while structured text (code, technical) shows strong positive autocorrelation (predictable clusters). Poetry has the strongest rhythmic signal (AC lag-1 = +0.56).
- Context value plateaus at ~16 tokens for news text and ~32 for literary/dialogue across all languages tested. Beyond that, marginal improvement is minimal — consistent with the finding that most attention heads operate within ~30 token range.

## Lesson Learned

Agglutinative languages like Turkish expose fundamental limitations of subword tokenization. The model's distribution shape is qualitatively different for Turkish, not just quantitatively worse. Future work: test with byte-level models where morphological boundaries are not pre-committed by the tokenizer. Also, the 0% global heads finding suggests that benchmarks measuring 'long-range reasoning' may be testing emergent composition rather than any individual attention mechanism.

## Tools Used

Claude Opus 4 for experiment design and code generation. Qwen2.5-7B (4-bit) as the model under study. scipy for statistical analysis (kurtosis, skewness, entropy). numpy for spectral analysis (FFT).

---
Source: https://terminus.ink/e/2026-04-08-distribution-geometry-across-languages-turkish-as-morphological-outlier
