# EXP-002: Byte-Level Mutual Information Decays as a Power Law Across 5 Languages

**Date:** 2026-04-07
**Author:** @eazevedo
**Tags:** #information-theory, #byte-level, #mutual-information, #power-law, #hurst-exponent, #cross-lingual, #ssm, #long-range-dependence

## Question

How does mutual information between bytes decay with distance in natural language, and is this structure universal across languages with different scripts and morphology?

## Setup

10 million bytes sampled from Wikipedia in each of 5 languages: English, Portuguese, German, Turkish, and Chinese. These span analytic (English), fusional (Portuguese, German), agglutinative (Turkish), and logographic (Chinese) morphological types, plus Latin, Latin-extended, and CJK scripts.

13-phase analysis pipeline: n-gram conditional entropy (orders 1-8), mutual information decay curves with exponential and power-law fits, byte transition matrix eigenvalues and mixing times, power spectral density, Hurst exponent via detrended fluctuation analysis (DFA), compression baselines (gzip, bzip2, lzma), windowed entropy stationarity, pronoun-context MI at long range, and analysis of learned decay rates from a trained diagonal SSM checkpoint (5.5M params, BPB 0.88 on Portuguese Wikipedia).

MI decay fitting: for each language, I(X_t; X_{t+d}) was computed for distances d = 1 to 500 bytes, then fit to both exponential (I ~ exp(-d/tau)) and power-law (I ~ d^(-alpha)) models via least-squares regression on the log-transformed data. R-squared values compared.

## Results

| Language | Unigram entropy H0 (bits) | Conditional entropy H8 (bits) | MI decay fit | Power-law exponent alpha | Hurst exponent | Entropy CV% | lzma BPB |
| --- | --- | --- | --- | --- | --- | --- | --- |
| English | 4.78 | 0.19 | POWER LAW | 1.14 | 0.675 | 2.00 | 2.40 |
| Portuguese | 4.93 | 0.64 | POWER LAW | 1.19 | 0.612 | 4.00 | 2.32 |
| German | 4.94 | 0.24 | POWER LAW | 1.14 | 0.649 | 1.68 | 2.46 |
| Turkish | 5.04 | 0.19 | POWER LAW | 1.10 | 0.649 | 1.89 | 2.34 |
| Chinese | 6.07 | -0.58 | POWER LAW | 1.24 | 0.738 | 2.03 | 2.74 |

## Key Findings

- Mutual information between bytes decays as a power law I(d) ~ d^(-alpha) in all 5 languages tested (0 out of 5 exponential). The power-law exponents cluster tightly between 1.10 and 1.24. R-squared advantage over exponential fit is large (~0.6). This means byte-level dependencies have a heavy tail — information persists at low amplitude across thousands of bytes.
- 82-96% of prediction gain comes from the first 8 bytes of context. Conditional entropy drops from ~5 bits (unigram) to ~0.2-0.6 bits (order 8). The remaining signal beyond 8 bytes is real but low-bandwidth. This implies most of the work in byte-level language modeling is local, and long-range context is a small but persistent tail.
- Hurst exponents range from 0.61 to 0.74 across all languages (all above 0.5), confirming persistent long-range correlations. Chinese is highest (0.738), likely inflated by 3-byte UTF-8 encoding creating deterministic byte-level cycles. This places natural language byte sequences in the 'pink noise' regime — between white noise (H=0.5) and Brownian motion (H=1.0).
- Pronoun-context mutual information at distances over 100 bytes is less than 0.2 bits in all languages tested. A single recurrent dimension with a decay rate of a ≈ 0.999 (half-life ~693 bytes) is theoretically sufficient to carry this signal. Content-based addressing (attention) is not required for pronoun resolution at byte level.
- A trained diagonal SSM (5.5M params, BPB 0.88) learns decay rates spanning half-lives from 0.5 bytes (fast local dimensions) to 700+ bytes (slow long-range dimensions), with 22% of dimensions having half-life over 100 bytes. This organically approximates the power-law structure without explicit multi-scale design — the model discovers the right timescale distribution from data.

## Lesson Learned

Byte-level text has fractal-like multi-scale structure that is universal across scripts and morphological types. SSMs with fixed exponential decay are structurally mismatched to this power-law reality, but can compensate by learning a diverse spectrum of decay rates across dimensions. Initializing decay rates with logarithmic spacing (half-lives from 1 to 2000+ bytes) helps the model cover all relevant timescales from the start.

## Tools Used

Analysis code generated by Claude Opus (claude-opus-4-6) via Claude Code. SSM checkpoint trained on Portuguese Wikipedia using a diagonal SSM with Triton parallel scan kernel.

---
Source: https://terminus.ink/e/2026-04-07-byte-level-mutual-information-decays-as-a-power-law-across-5-languages
