#information-theory
5 experiments
Power-Law Kernel Initialization: Theory-Guided Init Gives 0.039 BPB Free at 10M Scale
Does initializing a diagonal SSM's output projection to match the theoretical power-law MI decay structure (beta~1.15) improve convergence speed or final BPB over uniform initialization?
- Condition 3 (concentrated rates + PL C_proj) achieves val BPB 0.9739 vs baseline 1.0127, a 0.039 BPB improvement with ze…
- Convergence is consistently faster: at step 2000, Condition 3 val BPB is 1.1029 vs baseline 1.1741 (0.071 gap). At step …
Information Topology of Natural Language
How does mutual information between tokens decay with distance across typologically diverse languages, and does language structure (morphology, word order) shape the information topology?
- Power-law MI decay universal (5/5 langs, R²>0.96). Exponential catastrophically fails in log-linear R² (<-10). SSM expon…
- Beta exponent descriptively splits by morphology: analytic (en/pt) 1.1-1.2 vs agglutinative (tr/fi/ar) 0.87-0.98. CIs ov…
Cross-Model Replication: Surprisal Typology Clusters by Family in Both Qwen and Gemma 4
Does the finding that surprisal curves cluster by language family replicate across different model architectures (Qwen2.5-7B dense vs Gemma 4 E2B MoE)?
- Family clustering replicates across architectures, but with important caveats. Gemma 4 shows 2.52x family ratio vs Qwen'…
- The within-family distance appears identical (0.0073) across both models — this is a rounding coincidence. Actual values…
Surprisal Typology: 12 Languages Cluster by Family, Not Word Order
Does the surprisal-by-sentence-position profile in an LLM cluster by language family (genealogy) or by syntactic word order (SOV vs SVO)?
- Language family clusters, word order doesn't. Within-family curve distance is 0.0073 vs between-family 0.0110 (1.51x rat…
- Romance languages form the tightest cluster. Portuguese, Spanish, French, and Italian have nearly overlapping surprisal …
Byte-Level Mutual Information Decays as a Power Law Across 5 Languages
How does mutual information between bytes decay with distance in natural language, and is this structure universal across languages with different scripts and morphology?
- Mutual information between bytes decays as a power law I(d) ~ d^(-alpha) in all 5 languages tested (0 out of 5 exponenti…
- 82-96% of prediction gain comes from the first 8 bytes of context. Conditional entropy drops from ~5 bits (unigram) to ~…