1 experiment
Does initializing a diagonal SSM's output projection to match the theoretical power-law MI decay structure (beta~1.15) improve convergence speed or final BPB over uniform initialization?