EXP-0022026-04-07

Byte-Level Mutual Information Decays as a Power Law Across 5 Languages

#information-theory #byte-level #mutual-information #power-law #hurst-exponent #cross-lingual #ssm #long-range-dependence

Question

How does mutual information between bytes decay with distance in natural language, and is this structure universal across languages with different scripts and morphology?

Setup

10 million bytes sampled from Wikipedia in each of 5 languages: English, Portuguese, German, Turkish, and Chinese. These span analytic (English), fusional (Portuguese, German), agglutinative (Turkish), and logographic (Chinese) morphological types, plus Latin, Latin-extended, and CJK scripts.

13-phase analysis pipeline: n-gram conditional entropy (orders 1-8), mutual information decay curves with exponential and power-law fits, byte transition matrix eigenvalues and mixing times, power spectral density, Hurst exponent via detrended fluctuation analysis (DFA), compression baselines (gzip, bzip2, lzma), windowed entropy stationarity, pronoun-context MI at long range, and analysis of learned decay rates from a trained diagonal SSM checkpoint (5.5M params, BPB 0.88 on Portuguese Wikipedia).

MI decay fitting: for each language, I(X_t; X_{t+d}) was computed for distances d = 1 to 500 bytes, then fit to both exponential (I ~ exp(-d/tau)) and power-law (I ~ d^(-alpha)) models via least-squares regression on the log-transformed data. R-squared values compared.

Results

Language	Unigram entropy H0 (bits)	Conditional entropy H8 (bits)	MI decay fit	Power-law exponent alpha	Hurst exponent	Entropy CV%	lzma BPB
English	4.78	0.19	POWER LAW	1.14	0.675	2.00	2.40
Portuguese	4.93	0.64	POWER LAW	1.19	0.612	4.00	2.32
German	4.94	0.24	POWER LAW	1.14	0.649	1.68	2.46
Turkish	5.04	0.19	POWER LAW	1.10	0.649	1.89	2.34
Chinese	6.07	-0.58	POWER LAW	1.24	0.738	2.03	2.74

Key findings

Mutual information between bytes decays as a power law I(d) ~ d^(-alpha) in all 5 languages tested (0 out of 5 exponential). The power-law exponents cluster tightly between 1.10 and 1.24. R-squared advantage over exponential fit is large (~0.6). This means byte-level dependencies have a heavy tail — information persists at low amplitude across thousands of bytes.
82-96% of prediction gain comes from the first 8 bytes of context. Conditional entropy drops from ~5 bits (unigram) to ~0.2-0.6 bits (order 8). The remaining signal beyond 8 bytes is real but low-bandwidth. This implies most of the work in byte-level language modeling is local, and long-range context is a small but persistent tail.
Hurst exponents range from 0.61 to 0.74 across all languages (all above 0.5), confirming persistent long-range correlations. Chinese is highest (0.738), likely inflated by 3-byte UTF-8 encoding creating deterministic byte-level cycles. This places natural language byte sequences in the 'pink noise' regime — between white noise (H=0.5) and Brownian motion (H=1.0).
Pronoun-context mutual information at distances over 100 bytes is less than 0.2 bits in all languages tested. A single recurrent dimension with a decay rate of a ≈ 0.999 (half-life ~693 bytes) is theoretically sufficient to carry this signal. Content-based addressing (attention) is not required for pronoun resolution at byte level.
A trained diagonal SSM (5.5M params, BPB 0.88) learns decay rates spanning half-lives from 0.5 bytes (fast local dimensions) to 700+ bytes (slow long-range dimensions), with 22% of dimensions having half-life over 100 bytes. This organically approximates the power-law structure without explicit multi-scale design — the model discovers the right timescale distribution from data.

Lesson learned

Byte-level text has fractal-like multi-scale structure that is universal across scripts and morphological types. SSMs with fixed exponential decay are structurally mismatched to this power-law reality, but can compensate by learning a diverse spectrum of decay rates across dimensions. Initializing decay rates with logarithmic spacing (half-lives from 1 to 2000+ bytes) helps the model cover all relevant timescales from the start.

Tools used

Analysis code generated by Claude Opus (claude-opus-4-6) via Claude Code. SSM checkpoint trained on Portuguese Wikipedia using a diagonal SSM with Triton parallel scan kernel.