Transformer "Noise Layers" Contain Massive Hidden Information — 92.8% Probe Accuracy Where Output Head Gets 2.8%
@eazevedoQuestion
When a transformer's output head (lm_head) gets near-zero accuracy at intermediate layers, is next-token information genuinely absent, or is it present in a different geometric basis that the output head can't read?
Setup
For each layer of three models — nanoGPT 27.5M (8 layers), Qwen2.5-7B (28 layers, 4-bit quantized), and Qwen3-4B (36 layers, 4-bit quantized) — we extract hidden states and measure two things: (1) accuracy of the shared output head (lm_head) applied directly to that layer's output, and (2) accuracy of a per-layer trained linear probe (nn.Linear(hidden_dim, vocab_size)) initialized from lm_head weights and fine-tuned for 500 steps. If probe >> lm_head: the information is present but in a rotated basis the output head can't read. If probe ≈ lm_head ≈ 0: information genuinely absent. Evaluation corpus: TinyStories validation (188 tokens) for all three models, plus 14 Wikipedia sentences for the Qwen models to test in-distribution behavior. Models are frozen throughout — this is pure analysis, no training or modification of the models.
Results
| Model | Layer | Depth % | Output head accuracy | Trained probe accuracy | Information gain (bits) |
|---|---|---|---|---|---|
| nanoGPT 27.5M | L1 | 12% | 48.2% | 93.5% | +2.7 |
| nanoGPT 27.5M | L4 | 50% | 64.9% | 98.4% | +1.8 |
| nanoGPT 27.5M | L8 (final) | 100% | 66.8% | 98.9% | +1.7 |
| Qwen2.5-7B | L1 | 4% | 0.6% | 8.3% | +3.4 |
| Qwen2.5-7B | L8 | 29% | 2.5% | 50.3% | +11.9 |
| Qwen2.5-7B | L14 | 50% | 2.8% | 92.8% | +15.2 |
| Qwen2.5-7B | L21 | 75% | 9.1% | 100% | +12.7 |
| Qwen2.5-7B | L27 | 96% | 67.1% | 100% | +2.7 |
| Qwen2.5-7B | L28 (final) | 100% | 46.7% | 97.8% | +4.8 |
| Qwen3-4B | L1 | 3% | 0.0% | 6.6% | +11.8 |
| Qwen3-4B | L18 | 50% | 2.8% | 67.4% | +10.6 |
| Qwen3-4B | L26 | 72% | 21.3% | 100% | +9.0 |
| Qwen3-4B | L35 | 97% | 69.6% | 100% | +2.1 |
| Qwen3-4B | L36 (final) | 100% | 60.5% | 99.2% | +4.3 |
Key findings
- CORRECTION: The 92.8% probe accuracy was an artifact of overfitting — a 1536-dim linear probe on only 356 tokens will memorize the training set. With 4181 tokens and a proper train/test split, real probe accuracy at mid-depth is ~20%. The information gap between probe and output head is real but far smaller than originally reported.
- "Noise layers" still contain more information than the output head can read. Even with corrected methodology, the trained probe outperforms the output head at intermediate layers — the geometric rotation is real, just not as dramatic as 92.8% vs 2.8% suggested.
- The penultimate layer outperforms the final layer — this finding holds. Gemma 4 E2B shows 48.2% accuracy at L34 vs 39.9% at L35 (final). Qwen2.5-7B showed 67.1% at L27 vs 46.7% at L28. Final-layer degradation is robust across architectures.
- The phenomenon is architecture-universal. Replication on Gemma 4 E2B (35 layers, different architecture family from Qwen) confirms: 60% of layers are noise layers when read through the output head, and the final layer degrades predictions.
- Information builds monotonically; alignment builds late. This qualitative pattern holds even with corrected probe numbers — the output head cannot read intermediate representations, and the last ~25% of layers primarily rotate representations into the output basis.
Lesson learned
CORRECTION (2026-04-09): Our original probe accuracy (92.8%) was overfitting on 356 tokens. With 4181 tokens and proper train/test split, real accuracy is ~20% at mid-depth. The information gap is real but far smaller than reported. Final layer degradation (48.2%→39.9%) holds. Lesson: always split your probing data. A 1536-dim probe on 356 tokens will memorize anything. Original lesson (partially superseded): Measuring a layer's usefulness by applying the output head directly is deeply misleading at scale. The output head is calibrated for the final layer's basis only. Intermediate layers contain linearly accessible information that a per-layer probe can extract — but the magnitude of this gap was overstated due to probe overfitting on a tiny corpus. The qualitative finding (information present but rotated) holds; the quantitative claim (92.8% vs 2.8%) does not.
Tools used
Analysis code generated by Claude Opus (claude-opus-4-6) via Claude Code. Qwen models loaded via HuggingFace Transformers with 4-bit quantization. nanoGPT checkpoint trained on TinyStories.