#information-theory

5 experiments

EXP-0132026-04-14

Power-Law Kernel Initialization: Theory-Guided Init Gives 0.039 BPB Free at 10M Scale

Does initializing a diagonal SSM's output projection to match the theoretical power-law MI decay structure (beta~1.15) improve convergence speed or final BPB over uniform initialization?

Condition 3 (concentrated rates + PL C_proj) achieves val BPB 0.9739 vs baseline 1.0127, a 0.039 BPB improvement with ze…
Convergence is consistently faster: at step 2000, Condition 3 val BPB is 1.1029 vs baseline 1.1741 (0.071 gap). At step …

#ssm#initialization#power-law#information-theory

EXP-0122026-04-13

Information Topology of Natural Language

How does mutual information between tokens decay with distance across typologically diverse languages, and does language structure (morphology, word order) shape the information topology?

Power-law MI decay universal (5/5 langs, R²>0.96). Exponential catastrophically fails in log-linear R² (<-10). SSM expon…
Beta exponent descriptively splits by morphology: analytic (en/pt) 1.1-1.2 vs agglutinative (tr/fi/ar) 0.87-0.98. CIs ov…

#information-theory#mutual-information#power-law#typology

EXP-0112026-04-08

Cross-Model Replication: Surprisal Typology Clusters by Family in Both Qwen and Gemma 4

Does the finding that surprisal curves cluster by language family replicate across different model architectures (Qwen2.5-7B dense vs Gemma 4 E2B MoE)?

Family clustering replicates across architectures, but with important caveats. Gemma 4 shows 2.52x family ratio vs Qwen'…
The within-family distance appears identical (0.0073) across both models — this is a rounding coincidence. Actual values…

#replication#surprisal#typology#multilingual

EXP-0102026-04-08

Surprisal Typology: 12 Languages Cluster by Family, Not Word Order

Does the surprisal-by-sentence-position profile in an LLM cluster by language family (genealogy) or by syntactic word order (SOV vs SVO)?

Language family clusters, word order doesn't. Within-family curve distance is 0.0073 vs between-family 0.0110 (1.51x rat…
Romance languages form the tightest cluster. Portuguese, Spanish, French, and Italian have nearly overlapping surprisal …

#surprisal#typology#multilingual#language-families

EXP-0022026-04-07

Byte-Level Mutual Information Decays as a Power Law Across 5 Languages

How does mutual information between bytes decay with distance in natural language, and is this structure universal across languages with different scripts and morphology?

Mutual information between bytes decays as a power law I(d) ~ d^(-alpha) in all 5 languages tested (0 out of 5 exponenti…
82-96% of prediction gain comes from the first 8 bytes of context. Conditional entropy drops from ~5 bits (unigram) to ~…

#information-theory#byte-level#mutual-information#power-law