Residual Byte Patching: 3.5x Faster and 0.6 BPB Better — After Catching a Causality Bug in Learned Boundaries
@eazevedoQuestion
Can a byte-level language model learn where to place patch boundaries, or is fixed-stride mean pooling with a byte-level residual connection sufficient?
Setup
Byte-level models are slow because they process every byte individually. Patching groups bytes into chunks, runs the model at reduced resolution, then maps back. We tested 6 patching strategies for a diagonal SSM, keeping everything else identical: - A: fixed_mean — Fixed stride P=4, mean pool within windows, broadcast upsample, byte-level residual - B: fixed_attn — Fixed stride P=4, learned attention pooling within windows - C: soft_boundary — Learned boundaries via cumulative scores + Gaussian soft assignment (fully differentiable) - D: entropy_boundary — Tiny byte predictor estimates per-position entropy, boundaries at high-entropy points - E: no_patch — Full byte resolution baseline - F: gru_local_dec — GRU pool + autoregressive local decoder (MegaByte-style) Variants C and D use a small GRU as boundary model with differentiable soft assignment: cumulative boundary scores normalized via Gaussian kernel, allowing SSM gradients to flow back through the assignment weights. Model: diagonal SSM, d=256, 4 layers, sequence length 1024. Data: Portuguese Wikipedia. Hardware: RTX 3060 Ti. Training: 20 minutes per variant.
Results
| Variant | Params | Steps (20 min) | ms/step | Best BPB | Status |
|---|---|---|---|---|---|
| A: fixed_mean + residual | 1,711K | 21,630 | 55.6 | 1.132 | VALID — best |
| B: fixed_attn + residual | 1,728K | 20,875 | 57.6 | 1.165 | VALID |
| E: no_patch (baseline) | 1,711K | 6,039 | 200.0 | 1.770 | VALID |
| F: gru_local_dec (MegaByte) | 2,632K | 16,893 | 71.2 | 1.835 | VALID |
| C: soft_boundary | 1,773K | — | 74.0 | 0.090 | INVALID — causality bug |
| D: entropy_boundary | 1,790K | — | 81.0 | 0.100 | INVALID — causality bug |
Key findings
- Fixed mean pooling + broadcast upsample + byte residual is a strict Pareto improvement over full byte resolution. Variant A is 3.5× faster (55.6 ms vs 200 ms per step) AND achieves 0.64 BPB better quality (1.132 vs 1.770) at the same parameter count. The SSM operates on 256 patches instead of 1024 bytes, processing 3.5× more data in the same wall-clock time, which more than compensates for information lost in pooling.
- Learned soft boundaries contained a critical causality bug. The Gaussian soft assignment matrix had non-zero weights across the full sequence, creating a non-causal information channel: every byte contributed to every patch (downsample), the SSM processed causally between patches, then every byte read from every patch (upsample). Result: bytes could see future bytes through the downsample-SSM-upsample chain. This produced BPB 0.09 — impossibly good for a 1.8M param model. It was memorizing the future, not learning.
- With proper causal masking applied, learned boundaries plateau at BPB 3.41 — three times worse than the simple fixed-stride approach (1.132). The soft assignment via cumsum + Gaussian creates a pathological optimization landscape. Training showed a brief dip to 2.51 at step 1500, then permanently regressed. The approach is a dead end.
- Mean pooling is sufficient — learned attention within fixed windows adds nothing. Variant A (1.132) slightly beats variant B (1.165). The SSM is powerful enough to compensate for crude aggregation.
- The MegaByte-style autoregressive local decoder (variant F, 1.835) is worse than no patching at all (variant E, 1.770) despite 50% more parameters. The byte-level residual connection makes autoregressive decoders obsolete for this architecture — simply adding the original byte embeddings back after upsampling preserves all the local detail the decoder was trying to reconstruct.
Lesson learned
Dense soft assignment matrices that mix all positions are dangerous. Any architecture where a downsample-process-upsample chain uses non-sparse assignment creates subtle information leakage channels. The resulting metrics look impossibly good, which is itself the red flag. Always verify causality by checking whether the model can achieve near-zero loss — if a 1.8M param model reaches BPB 0.09, something is fundamentally wrong. The fix was simple (asymmetric causal masks), but the fixed-stride approach it was trying to beat turned out to be better anyway.
Tools used
All patching implementations (fixed stride, learned soft boundaries, entropy boundaries, local decoder) generated by Claude Opus (claude-opus-4-6) via Claude Code. The causality bug was identified by Claude after observing the impossibly low BPB.