EXP-0032026-04-07

Transformer "Noise Layers" Contain Massive Hidden Information — 92.8% Probe Accuracy Where Output Head Gets 2.8%

#probing #transformers #interpretability #linear-probes #qwen #early-exit #representation-geometry

Question

When a transformer's output head (lm_head) gets near-zero accuracy at intermediate layers, is next-token information genuinely absent, or is it present in a different geometric basis that the output head can't read?

Setup

For each layer of three models — nanoGPT 27.5M (8 layers), Qwen2.5-7B (28 layers, 4-bit quantized), and Qwen3-4B (36 layers, 4-bit quantized) — we extract hidden states and measure two things: (1) accuracy of the shared output head (lm_head) applied directly to that layer's output, and (2) accuracy of a per-layer trained linear probe (nn.Linear(hidden_dim, vocab_size)) initialized from lm_head weights and fine-tuned for 500 steps.

If probe >> lm_head: the information is present but in a rotated basis the output head can't read. If probe ≈ lm_head ≈ 0: information genuinely absent.

Evaluation corpus: TinyStories validation (188 tokens) for all three models, plus 14 Wikipedia sentences for the Qwen models to test in-distribution behavior.

Models are frozen throughout — this is pure analysis, no training or modification of the models.

Results

Model	Layer	Depth %	Output head accuracy	Trained probe accuracy	Information gain (bits)
nanoGPT 27.5M	L1	12%	48.2%	93.5%	+2.7
nanoGPT 27.5M	L4	50%	64.9%	98.4%	+1.8
nanoGPT 27.5M	L8 (final)	100%	66.8%	98.9%	+1.7
Qwen2.5-7B	L1	4%	0.6%	8.3%	+3.4
Qwen2.5-7B	L8	29%	2.5%	50.3%	+11.9
Qwen2.5-7B	L14	50%	2.8%	92.8%	+15.2
Qwen2.5-7B	L21	75%	9.1%	100%	+12.7
Qwen2.5-7B	L27	96%	67.1%	100%	+2.7
Qwen2.5-7B	L28 (final)	100%	46.7%	97.8%	+4.8
Qwen3-4B	L1	3%	0.0%	6.6%	+11.8
Qwen3-4B	L18	50%	2.8%	67.4%	+10.6
Qwen3-4B	L26	72%	21.3%	100%	+9.0
Qwen3-4B	L35	97%	69.6%	100%	+2.1
Qwen3-4B	L36 (final)	100%	60.5%	99.2%	+4.3

Key findings

CORRECTION: The 92.8% probe accuracy was an artifact of overfitting — a 1536-dim linear probe on only 356 tokens will memorize the training set. With 4181 tokens and a proper train/test split, real probe accuracy at mid-depth is ~20%. The information gap between probe and output head is real but far smaller than originally reported.
"Noise layers" still contain more information than the output head can read. Even with corrected methodology, the trained probe outperforms the output head at intermediate layers — the geometric rotation is real, just not as dramatic as 92.8% vs 2.8% suggested.
The penultimate layer outperforms the final layer — this finding holds. Gemma 4 E2B shows 48.2% accuracy at L34 vs 39.9% at L35 (final). Qwen2.5-7B showed 67.1% at L27 vs 46.7% at L28. Final-layer degradation is robust across architectures.
The phenomenon is architecture-universal. Replication on Gemma 4 E2B (35 layers, different architecture family from Qwen) confirms: 60% of layers are noise layers when read through the output head, and the final layer degrades predictions.
Information builds monotonically; alignment builds late. This qualitative pattern holds even with corrected probe numbers — the output head cannot read intermediate representations, and the last ~25% of layers primarily rotate representations into the output basis.

Lesson learned

CORRECTION (2026-04-09): Our original probe accuracy (92.8%) was overfitting on 356 tokens. With 4181 tokens and proper train/test split, real accuracy is ~20% at mid-depth. The information gap is real but far smaller than reported. Final layer degradation (48.2%→39.9%) holds. Lesson: always split your probing data. A 1536-dim probe on 356 tokens will memorize anything.

Original lesson (partially superseded): Measuring a layer's usefulness by applying the output head directly is deeply misleading at scale. The output head is calibrated for the final layer's basis only. Intermediate layers contain linearly accessible information that a per-layer probe can extract — but the magnitude of this gap was overstated due to probe overfitting on a tiny corpus. The qualitative finding (information present but rotated) holds; the quantitative claim (92.8% vs 2.8%) does not.

Tools used

Analysis code generated by Claude Opus (claude-opus-4-6) via Claude Code. Qwen models loaded via HuggingFace Transformers with 4-bit quantization. nanoGPT checkpoint trained on TinyStories.