#state-space-model

1 experiment

Can a diagonal state-space model processing raw bytes (no tokenizer, no attention) scale from 2M to 100M parameters on English web text?

BPB 0.776 on FineWeb English with zero attention. A 101M-param diagonal SSM processing raw bytes compresses diverse web …
Zero architectural changes from 2M to 100M. Same P=2 mean pooling, same broadcast upsample with byte residual, same log-…