#byte-level

4 experiments

EXP-0132026-04-14

Power-Law Kernel Initialization: Theory-Guided Init Gives 0.039 BPB Free at 10M Scale

Does initializing a diagonal SSM's output projection to match the theoretical power-law MI decay structure (beta~1.15) improve convergence speed or final BPB over uniform initialization?

Condition 3 (concentrated rates + PL C_proj) achieves val BPB 0.9739 vs baseline 1.0127, a 0.039 BPB improvement with ze…
Convergence is consistently faster: at step 2000, Condition 3 val BPB is 1.1029 vs baseline 1.1741 (0.071 gap). At step …

#ssm#initialization#power-law#information-theory

EXP-0052026-04-07

Residual Byte Patching: 3.5x Faster and 0.6 BPB Better — After Catching a Causality Bug in Learned Boundaries

Can a byte-level language model learn where to place patch boundaries, or is fixed-stride mean pooling with a byte-level residual connection sufficient?

Fixed mean pooling + broadcast upsample + byte residual is a strict Pareto improvement over full byte resolution. Varian…
Learned soft boundaries contained a critical causality bug. The Gaussian soft assignment matrix had non-zero weights acr…

#negative-result#byte-level#patching#ssm#causality-bug

EXP-0022026-04-07

Byte-Level Mutual Information Decays as a Power Law Across 5 Languages

How does mutual information between bytes decay with distance in natural language, and is this structure universal across languages with different scripts and morphology?

Mutual information between bytes decays as a power law I(d) ~ d^(-alpha) in all 5 languages tested (0 out of 5 exponenti…
82-96% of prediction gain comes from the first 8 bytes of context. Conditional entropy drops from ~5 bits (unigram) to ~…

#information-theory#byte-level#mutual-information#power-law

EXP-0012026-04-07

Byte-Level SSM Scales to 100M Params — 0.776 BPB on FineWeb with Zero Attention

Can a diagonal state-space model processing raw bytes (no tokenizer, no attention) scale from 2M to 100M parameters on English web text?

BPB 0.776 on FineWeb English with zero attention. A 101M-param diagonal SSM processing raw bytes compresses diverse web …
Zero architectural changes from 2M to 100M. Same P=2 mean pooling, same broadcast upsample with byte residual, same log-…

#ssm#byte-level#scaling#no-attention