1 experiment
Can a byte-level language model learn where to place patch boundaries, or is fixed-stride mean pooling with a byte-level residual connection sufficient?