terminus.inkterminus.ink
EXP-008

Perceptual Geometry of Attention: Fragmented vs Continuous Fields (Merleau-Ponty)

@eazevedo

Question

How does modifying the attention mask geometry at inference (sliding window, block-diagonal, foveal) affect a pre-trained transformer's performance, and is there a critical horizon size?

Setup

Model: Qwen2.5-7B (4-bit quantized, eager attention implementation) on RTX 3060 Ti (8 GB). Attention masks modified at inference only (no retraining). Conditions: full causal, sliding window (sizes 8/16/32/64/128), foveal (near dense + dilated far), block-diagonal (sizes 8/16/32). Tested on 4 text genres: narrative (93 tokens), technical (74 tokens), dialogue (88 tokens), philosophical (79 tokens). Additional analyses: figure-ground token identification from attention patterns, critical horizon sweep on 335-token combined passage, layer-wise attention distance/locality/entropy across all 28 layers, gestalt emergence on fragmented vs complete text pairs.

Results

Attention TypeNarrative LossTechnical LossDialogue LossPhilosophical LossAvg Relative to Full
Full causal2.292.022.252.531.00x
Sliding window 84.153.363.533.811.64x
Sliding window 163.182.362.813.171.27x
Sliding window 322.742.452.893.141.23x
Sliding window 642.442.082.502.681.07x
Sliding window 1282.292.022.252.531.00x
Foveal (8 near + every 4th far)2.722.562.783.151.24x
Foveal (16 near + every 8th far)2.792.502.922.951.23x
Block-diagonal 84.724.414.165.222.04x
Block-diagonal 163.763.593.464.081.64x
Block-diagonal 322.952.652.813.511.31x

Key findings

  • Block-diagonal attention (fragmented perception) is catastrophic at 2.04x baseline loss — far worse than sliding window of equivalent total token coverage (1.64x). A narrow but continuous perceptual field outperforms a wide but fragmented one.
  • Critical horizon for 90% performance recovery: 64 tokens. For 95% recovery: 256 tokens. Beyond 64 tokens, marginal gains are minimal for natural language.
  • Layers 4-5 show a 'perceptual snap': attention distance jumps abruptly from ~15 (local) to ~37 (near-maximum) with top-5 sparsity reaching 0.86. This is not a gradual transition — it is a discrete phase shift from local syntactic to distant semantic attention.
  • Figure tokens (most attended) are semantic anchors: nouns like 'old', 'lighthouse', 'boat', 'ships', 'keeper'. Ground tokens (least attended) are structural: 'the', 'a', 'into', sentence-final punctuation. This pattern strengthens in mid-layers.
  • Foveal attention (near dense + dilated far) underperforms simple sliding windows of equivalent coverage. The model was not trained with dilated patterns, so this geometry confuses rather than helps.

Lesson learned

Continuity of the attention field matters more than its total size. Block-diagonal attention with block_size=16 sees the same number of tokens as a sliding window of 16, but performs far worse (1.64x vs 1.27x) because it cannot attend across block boundaries. This has practical implications for efficient attention design: chunked/blocked attention schemes should overlap rather than partition.

Tools used

Claude Opus 4 for experiment design and code generation. Qwen2.5-7B (4-bit, eager attention) as the model under study. Custom attention mask injection at inference time.