terminus.inkterminus.ink
EXP-003

Transformer "Noise Layers" Contain Massive Hidden Information — 92.8% Probe Accuracy Where Output Head Gets 2.8%

@eazevedo

Question

When a transformer's output head (lm_head) gets near-zero accuracy at intermediate layers, is next-token information genuinely absent, or is it present in a different geometric basis that the output head can't read?

Setup

For each layer of three models — nanoGPT 27.5M (8 layers), Qwen2.5-7B (28 layers, 4-bit quantized), and Qwen3-4B (36 layers, 4-bit quantized) — we extract hidden states and measure two things: (1) accuracy of the shared output head (lm_head) applied directly to that layer's output, and (2) accuracy of a per-layer trained linear probe (nn.Linear(hidden_dim, vocab_size)) initialized from lm_head weights and fine-tuned for 500 steps. If probe >> lm_head: the information is present but in a rotated basis the output head can't read. If probe ≈ lm_head ≈ 0: information genuinely absent. Evaluation corpus: TinyStories validation (188 tokens) for all three models, plus 14 Wikipedia sentences for the Qwen models to test in-distribution behavior. Models are frozen throughout — this is pure analysis, no training or modification of the models.

Results

ModelLayerDepth %Output head accuracyTrained probe accuracyInformation gain (bits)
nanoGPT 27.5ML112%48.2%93.5%+2.7
nanoGPT 27.5ML450%64.9%98.4%+1.8
nanoGPT 27.5ML8 (final)100%66.8%98.9%+1.7
Qwen2.5-7BL14%0.6%8.3%+3.4
Qwen2.5-7BL829%2.5%50.3%+11.9
Qwen2.5-7BL1450%2.8%92.8%+15.2
Qwen2.5-7BL2175%9.1%100%+12.7
Qwen2.5-7BL2796%67.1%100%+2.7
Qwen2.5-7BL28 (final)100%46.7%97.8%+4.8
Qwen3-4BL13%0.0%6.6%+11.8
Qwen3-4BL1850%2.8%67.4%+10.6
Qwen3-4BL2672%21.3%100%+9.0
Qwen3-4BL3597%69.6%100%+2.1
Qwen3-4BL36 (final)100%60.5%99.2%+4.3

Key findings

  • CORRECTION: The 92.8% probe accuracy was an artifact of overfitting — a 1536-dim linear probe on only 356 tokens will memorize the training set. With 4181 tokens and a proper train/test split, real probe accuracy at mid-depth is ~20%. The information gap between probe and output head is real but far smaller than originally reported.
  • "Noise layers" still contain more information than the output head can read. Even with corrected methodology, the trained probe outperforms the output head at intermediate layers — the geometric rotation is real, just not as dramatic as 92.8% vs 2.8% suggested.
  • The penultimate layer outperforms the final layer — this finding holds. Gemma 4 E2B shows 48.2% accuracy at L34 vs 39.9% at L35 (final). Qwen2.5-7B showed 67.1% at L27 vs 46.7% at L28. Final-layer degradation is robust across architectures.
  • The phenomenon is architecture-universal. Replication on Gemma 4 E2B (35 layers, different architecture family from Qwen) confirms: 60% of layers are noise layers when read through the output head, and the final layer degrades predictions.
  • Information builds monotonically; alignment builds late. This qualitative pattern holds even with corrected probe numbers — the output head cannot read intermediate representations, and the last ~25% of layers primarily rotate representations into the output basis.

Lesson learned

CORRECTION (2026-04-09): Our original probe accuracy (92.8%) was overfitting on 356 tokens. With 4181 tokens and proper train/test split, real accuracy is ~20% at mid-depth. The information gap is real but far smaller than reported. Final layer degradation (48.2%→39.9%) holds. Lesson: always split your probing data. A 1536-dim probe on 356 tokens will memorize anything. Original lesson (partially superseded): Measuring a layer's usefulness by applying the output head directly is deeply misleading at scale. The output head is calibrated for the final layer's basis only. Intermediate layers contain linearly accessible information that a per-layer probe can extract — but the magnitude of this gap was overstated due to probe overfitting on a tiny corpus. The qualitative finding (information present but rotated) holds; the quantitative claim (92.8% vs 2.8%) does not.

Tools used

Analysis code generated by Claude Opus (claude-opus-4-6) via Claude Code. Qwen models loaded via HuggingFace Transformers with 4-bit quantization. nanoGPT checkpoint trained on TinyStories.