EXP-0082026-04-08

Perceptual Geometry of Attention: Fragmented vs Continuous Fields (Merleau-Ponty)

#attention #transformer #perceptual-geometry #sliding-window #llm-internals #qwen #philosophy #attention-masking

Question

How does modifying the attention mask geometry at inference (sliding window, block-diagonal, foveal) affect a pre-trained transformer's performance, and is there a critical horizon size?

Setup

Model: Qwen2.5-7B (4-bit quantized, eager attention implementation) on RTX 3060 Ti (8 GB). Attention masks modified at inference only (no retraining). Conditions: full causal, sliding window (sizes 8/16/32/64/128), foveal (near dense + dilated far), block-diagonal (sizes 8/16/32). Tested on 4 text genres: narrative (93 tokens), technical (74 tokens), dialogue (88 tokens), philosophical (79 tokens). Additional analyses: figure-ground token identification from attention patterns, critical horizon sweep on 335-token combined passage, layer-wise attention distance/locality/entropy across all 28 layers, gestalt emergence on fragmented vs complete text pairs.

Results

Attention Type	Narrative Loss	Technical Loss	Dialogue Loss	Philosophical Loss	Avg Relative to Full
Full causal	2.29	2.02	2.25	2.53	1.00x
Sliding window 8	4.15	3.36	3.53	3.81	1.64x
Sliding window 16	3.18	2.36	2.81	3.17	1.27x
Sliding window 32	2.74	2.45	2.89	3.14	1.23x
Sliding window 64	2.44	2.08	2.50	2.68	1.07x
Sliding window 128	2.29	2.02	2.25	2.53	1.00x
Foveal (8 near + every 4th far)	2.72	2.56	2.78	3.15	1.24x
Foveal (16 near + every 8th far)	2.79	2.50	2.92	2.95	1.23x
Block-diagonal 8	4.72	4.41	4.16	5.22	2.04x
Block-diagonal 16	3.76	3.59	3.46	4.08	1.64x
Block-diagonal 32	2.95	2.65	2.81	3.51	1.31x

Key findings

Block-diagonal attention (fragmented perception) is catastrophic at 2.04x baseline loss — far worse than sliding window of equivalent total token coverage (1.64x). A narrow but continuous perceptual field outperforms a wide but fragmented one.
Critical horizon for 90% performance recovery: 64 tokens. For 95% recovery: 256 tokens. Beyond 64 tokens, marginal gains are minimal for natural language.
Layers 4-5 show a 'perceptual snap': attention distance jumps abruptly from ~15 (local) to ~37 (near-maximum) with top-5 sparsity reaching 0.86. This is not a gradual transition — it is a discrete phase shift from local syntactic to distant semantic attention.
Figure tokens (most attended) are semantic anchors: nouns like 'old', 'lighthouse', 'boat', 'ships', 'keeper'. Ground tokens (least attended) are structural: 'the', 'a', 'into', sentence-final punctuation. This pattern strengthens in mid-layers.
Foveal attention (near dense + dilated far) underperforms simple sliding windows of equivalent coverage. The model was not trained with dilated patterns, so this geometry confuses rather than helps.

Lesson learned

Continuity of the attention field matters more than its total size. Block-diagonal attention with block_size=16 sees the same number of tokens as a sliding window of 16, but performs far worse (1.64x vs 1.27x) because it cannot attend across block boundaries. This has practical implications for efficient attention design: chunked/blocked attention schemes should overlap rather than partition.

Tools used

Claude Opus 4 for experiment design and code generation. Qwen2.5-7B (4-bit, eager attention) as the model under study. Custom attention mask injection at inference time.