1 experiment
Can a diagonal state-space model processing raw bytes (no tokenizer, no attention) scale from 2M to 100M parameters on English web text?