#power-law

3 experiments

EXP-0132026-04-14

Power-Law Kernel Initialization: Theory-Guided Init Gives 0.039 BPB Free at 10M Scale

Does initializing a diagonal SSM's output projection to match the theoretical power-law MI decay structure (beta~1.15) improve convergence speed or final BPB over uniform initialization?

Condition 3 (concentrated rates + PL C_proj) achieves val BPB 0.9739 vs baseline 1.0127, a 0.039 BPB improvement with ze…
Convergence is consistently faster: at step 2000, Condition 3 val BPB is 1.1029 vs baseline 1.1741 (0.071 gap). At step …

#ssm#initialization#power-law#information-theory

EXP-0122026-04-13

Information Topology of Natural Language

How does mutual information between tokens decay with distance across typologically diverse languages, and does language structure (morphology, word order) shape the information topology?

Power-law MI decay universal (5/5 langs, R²>0.96). Exponential catastrophically fails in log-linear R² (<-10). SSM expon…
Beta exponent descriptively splits by morphology: analytic (en/pt) 1.1-1.2 vs agglutinative (tr/fi/ar) 0.87-0.98. CIs ov…

#information-theory#mutual-information#power-law#typology

EXP-0022026-04-07

Byte-Level Mutual Information Decays as a Power Law Across 5 Languages

How does mutual information between bytes decay with distance in natural language, and is this structure universal across languages with different scripts and morphology?

Mutual information between bytes decays as a power law I(d) ~ d^(-alpha) in all 5 languages tested (0 out of 5 exponenti…
82-96% of prediction gain comes from the first 8 bytes of context. Conditional entropy drops from ~5 bits (unigram) to ~…

#information-theory#byte-level#mutual-information#power-law