Question
How does modifying the attention mask geometry at inference (sliding window, block-diagonal, foveal) affect a pre-trained transformer's performance, and is there a critical horizon size?
Setup
Model: Qwen2.5-7B (4-bit quantized, eager attention implementation) on RTX 3060 Ti (8 GB). Attention masks modified at inference only (no retraining). Conditions: full causal, sliding window (sizes 8/16/32/64/128), foveal (near dense + dilated far), block-diagonal (sizes 8/16/32). Tested on 4 text genres: narrative (93 tokens), technical (74 tokens), dialogue (88 tokens), philosophical (79 tokens). Additional analyses: figure-ground token identification from attention patterns, critical horizon sweep on 335-token combined passage, layer-wise attention distance/locality/entropy across all 28 layers, gestalt emergence on fragmented vs complete text pairs.
Results
| Attention Type | Narrative Loss | Technical Loss | Dialogue Loss | Philosophical Loss | Avg Relative to Full |
|---|---|---|---|---|---|
| Full causal | 2.29 | 2.02 | 2.25 | 2.53 | 1.00x |
| Sliding window 8 | 4.15 | 3.36 | 3.53 | 3.81 | 1.64x |
| Sliding window 16 | 3.18 | 2.36 | 2.81 | 3.17 | 1.27x |
| Sliding window 32 | 2.74 | 2.45 | 2.89 | 3.14 | 1.23x |
| Sliding window 64 | 2.44 | 2.08 | 2.50 | 2.68 | 1.07x |
| Sliding window 128 | 2.29 | 2.02 | 2.25 | 2.53 | 1.00x |
| Foveal (8 near + every 4th far) | 2.72 | 2.56 | 2.78 | 3.15 | 1.24x |
| Foveal (16 near + every 8th far) | 2.79 | 2.50 | 2.92 | 2.95 | 1.23x |
| Block-diagonal 8 | 4.72 | 4.41 | 4.16 | 5.22 | 2.04x |
| Block-diagonal 16 | 3.76 | 3.59 | 3.46 | 4.08 | 1.64x |
| Block-diagonal 32 | 2.95 | 2.65 | 2.81 | 3.51 | 1.31x |
Key findings
- Block-diagonal attention (fragmented perception) is catastrophic at 2.04x baseline loss — far worse than sliding window of equivalent total token coverage (1.64x). A narrow but continuous perceptual field outperforms a wide but fragmented one.
- Critical horizon for 90% performance recovery: 64 tokens. For 95% recovery: 256 tokens. Beyond 64 tokens, marginal gains are minimal for natural language.
- Layers 4-5 show a 'perceptual snap': attention distance jumps abruptly from ~15 (local) to ~37 (near-maximum) with top-5 sparsity reaching 0.86. This is not a gradual transition — it is a discrete phase shift from local syntactic to distant semantic attention.
- Figure tokens (most attended) are semantic anchors: nouns like 'old', 'lighthouse', 'boat', 'ships', 'keeper'. Ground tokens (least attended) are structural: 'the', 'a', 'into', sentence-final punctuation. This pattern strengthens in mid-layers.
- Foveal attention (near dense + dilated far) underperforms simple sliding windows of equivalent coverage. The model was not trained with dilated patterns, so this geometry confuses rather than helps.
Lesson learned
Continuity of the attention field matters more than its total size. Block-diagonal attention with block_size=16 sees the same number of tokens as a sliding window of 16, but performs far worse (1.64x vs 1.27x) because it cannot attend across block boundaries. This has practical implications for efficient attention design: chunked/blocked attention schemes should overlap rather than partition.
Tools used
Claude Opus 4 for experiment design and code generation. Qwen2.5-7B (4-bit, eager attention) as the model under study. Custom attention mask injection at inference time.