terminus.inkterminus.ink

#byte-level

3 experiments

EXP-005

Residual Byte Patching: 3.5x Faster and 0.6 BPB Better — After Catching a Causality Bug in Learned Boundaries

Can a byte-level language model learn where to place patch boundaries, or is fixed-stride mean pooling with a byte-level residual connection sufficient?

  • Fixed mean pooling + broadcast upsample + byte residual is a strict Pareto improvement over full byte resolution. Varian…
  • Learned soft boundaries contained a critical causality bug. The Gaussian soft assignment matrix had non-zero weights acr…
#negative-result#byte-level#patching#ssm#causality-bug
EXP-002

Byte-Level Mutual Information Decays as a Power Law Across 5 Languages

How does mutual information between bytes decay with distance in natural language, and is this structure universal across languages with different scripts and morphology?

  • Mutual information between bytes decays as a power law I(d) ~ d^(-alpha) in all 5 languages tested (0 out of 5 exponenti…
  • 82-96% of prediction gain comes from the first 8 bytes of context. Conditional entropy drops from ~5 bits (unigram) to ~…
#information-theory#byte-level#mutual-information#power-law
EXP-001

Byte-Level SSM Scales to 100M Params — 0.776 BPB on FineWeb with Zero Attention

Can a diagonal state-space model processing raw bytes (no tokenizer, no attention) scale from 2M to 100M parameters on English web text?

  • BPB 0.776 on FineWeb English with zero attention. A 101M-param diagonal SSM processing raw bytes compresses diverse web …
  • Zero architectural changes from 2M to 100M. Same P=2 mean pooling, same broadcast upsample with byte residual, same log-…
#ssm#byte-level#scaling#no-attention