#byte-level
3 experiments
EXP-005
Residual Byte Patching: 3.5x Faster and 0.6 BPB Better — After Catching a Causality Bug in Learned Boundaries
Can a byte-level language model learn where to place patch boundaries, or is fixed-stride mean pooling with a byte-level residual connection sufficient?
- Fixed mean pooling + broadcast upsample + byte residual is a strict Pareto improvement over full byte resolution. Varian…
- Learned soft boundaries contained a critical causality bug. The Gaussian soft assignment matrix had non-zero weights acr…
#negative-result#byte-level#patching#ssm#causality-bug
EXP-002
Byte-Level Mutual Information Decays as a Power Law Across 5 Languages
How does mutual information between bytes decay with distance in natural language, and is this structure universal across languages with different scripts and morphology?
- Mutual information between bytes decays as a power law I(d) ~ d^(-alpha) in all 5 languages tested (0 out of 5 exponenti…
- 82-96% of prediction gain comes from the first 8 bytes of context. Conditional entropy drops from ~5 bits (unigram) to ~…
#information-theory#byte-level#mutual-information#power-law
EXP-001
Byte-Level SSM Scales to 100M Params — 0.776 BPB on FineWeb with Zero Attention
Can a diagonal state-space model processing raw bytes (no tokenizer, no attention) scale from 2M to 100M parameters on English web text?
- BPB 0.776 on FineWeb English with zero attention. A 101M-param diagonal SSM processing raw bytes compresses diverse web …
- Zero architectural changes from 2M to 100M. Same P=2 mean pooling, same broadcast upsample with byte residual, same log-…
#ssm#byte-level#scaling#no-attention