Question
Can a diagonal state-space model processing raw bytes (no tokenizer, no attention) scale from 2M to 100M parameters on English web text?
Setup
Architecture: Diagonal SSM + P=2 residual byte patching + log-decay initialization. The model reads raw bytes (0-255), mean-pools every 2 bytes into patches, runs them through stacked SSM blocks (LayerNorm → diagonal SSM → residual → LayerNorm → SwiGLU MLP → residual), then broadcasts back to byte resolution and adds byte embeddings as a residual before the prediction head. No tokenizer, no attention, no external memory. Data: FineWeb sample-10BT (deduplicated English web text). Cross-eval on Portuguese Wikipedia (never seen during training). Hardware: RTX 6000 Ada (50.9 GB) for 2M/10M, NVIDIA H200 (141 GB) for 100M. Three sizes, same architecture, only width and depth change: - 2M: d=256, L=4, batch=512, 30 min - 10M: d=512, L=6, batch=256, 1 hour - 100M: d=1024, L=16, batch=512, 10 hours All use d_state=d_model, patch size P=2, log-spaced decay rates (half-lives 1 to 2000+ bytes), custom Triton parallel scan kernel.
Results
| Size | Params | Steps | Time | BPB (English) | BPB (Portuguese) | GPU Memory |
|---|---|---|---|---|---|---|
| 2M | 1.7M | 4,500 | 30 min | 1.136 | 2.47 | 14.8 GB |
| 10M | 9.7M | 4,535 | 1 hour | 1.006 | 2.22 | 40.2 GB |
| 100M | 101.3M | 91,998 | 10 hours | 0.776 | 1.77 | 97.8 GB |
Key findings
- BPB 0.776 on FineWeb English with zero attention. A 101M-param diagonal SSM processing raw bytes compresses diverse web text better than gzip (2.5 BPB) or lzma (2.4 BPB) by ~3×. Competitive with byte-level transformer baselines in the literature (0.80-0.98 range).
- Zero architectural changes from 2M to 100M. Same P=2 mean pooling, same broadcast upsample with byte residual, same log-decay initialization. Only d_model (256→1024) and n_layers (4→16) changed. No scale-specific tricks.
- Cross-language transfer emerges from scale alone. Portuguese BPB drops from 2.47 (2M) to 1.77 (100M) despite English-only training. At the byte level, languages share enough statistical structure that a large enough model generalizes automatically.
- Scaling is super-log-linear. Log-linear extrapolation from 2M/10M predicted 0.82 — actual result 0.776 beat the prediction. 10× more params gives 23% BPB improvement (better than the 5.7× → 11.4% ratio at smaller scale).
- Residual byte patching is the key design choice (EXP-051). Mean-pool pairs of bytes, run SSM at half resolution, broadcast back, add byte residual. 2× faster and 0.7 BPB better than full byte resolution — a strict Pareto improvement. Discovered after testing 6 strategies including learned boundaries (which leaked future information).
Lesson learned
Over 60 experiments in 9 days, most ideas that "should" work didn't: explicit memory buffers destroyed pronoun prediction (0.86 → 1.41 BPB), Mamba-style selective scan was 14× slower and couldn't compensate, MI-weighted tokenizer improvements turned out to be corpus artifacts, and a 100% entity probe accuracy was bogus (real accuracy: 25% on 9-way). The winning architecture is the simplest one — diagonal SSM + fixed-stride patching + residual connections. At small-to-medium scale on a fixed compute budget, simplicity wins.
Tools used
All code (SSM architecture, Triton parallel scan kernel, training scripts, data loading, evaluation) generated by Claude Opus (claude-opus-4-6) via Claude Code. Claude proposed experiments, wrote implementations, reviewed results, and diagnosed failures across 60 experiments. Human decisions guided directions and abandoned dead ends.