# EXP-008: Perceptual Geometry of Attention: Fragmented vs Continuous Fields (Merleau-Ponty)

**Date:** 2026-04-08
**Author:** @eazevedo
**Tags:** #attention, #transformer, #perceptual-geometry, #sliding-window, #llm-internals, #qwen, #philosophy, #attention-masking

## Question

How does modifying the attention mask geometry at inference (sliding window, block-diagonal, foveal) affect a pre-trained transformer's performance, and is there a critical horizon size?

## Setup

Model: Qwen2.5-7B (4-bit quantized, eager attention implementation) on RTX 3060 Ti (8 GB). Attention masks modified at inference only (no retraining). Conditions: full causal, sliding window (sizes 8/16/32/64/128), foveal (near dense + dilated far), block-diagonal (sizes 8/16/32). Tested on 4 text genres: narrative (93 tokens), technical (74 tokens), dialogue (88 tokens), philosophical (79 tokens). Additional analyses: figure-ground token identification from attention patterns, critical horizon sweep on 335-token combined passage, layer-wise attention distance/locality/entropy across all 28 layers, gestalt emergence on fragmented vs complete text pairs.

## Results

| Attention Type | Narrative Loss | Technical Loss | Dialogue Loss | Philosophical Loss | Avg Relative to Full |
| --- | --- | --- | --- | --- | --- |
| Full causal | 2.29 | 2.02 | 2.25 | 2.53 | 1.00x |
| Sliding window 8 | 4.15 | 3.36 | 3.53 | 3.81 | 1.64x |
| Sliding window 16 | 3.18 | 2.36 | 2.81 | 3.17 | 1.27x |
| Sliding window 32 | 2.74 | 2.45 | 2.89 | 3.14 | 1.23x |
| Sliding window 64 | 2.44 | 2.08 | 2.50 | 2.68 | 1.07x |
| Sliding window 128 | 2.29 | 2.02 | 2.25 | 2.53 | 1.00x |
| Foveal (8 near + every 4th far) | 2.72 | 2.56 | 2.78 | 3.15 | 1.24x |
| Foveal (16 near + every 8th far) | 2.79 | 2.50 | 2.92 | 2.95 | 1.23x |
| Block-diagonal 8 | 4.72 | 4.41 | 4.16 | 5.22 | 2.04x |
| Block-diagonal 16 | 3.76 | 3.59 | 3.46 | 4.08 | 1.64x |
| Block-diagonal 32 | 2.95 | 2.65 | 2.81 | 3.51 | 1.31x |

## Key Findings

- Block-diagonal attention (fragmented perception) is catastrophic at 2.04x baseline loss — far worse than sliding window of equivalent total token coverage (1.64x). A narrow but continuous perceptual field outperforms a wide but fragmented one.
- Critical horizon for 90% performance recovery: 64 tokens. For 95% recovery: 256 tokens. Beyond 64 tokens, marginal gains are minimal for natural language.
- Layers 4-5 show a 'perceptual snap': attention distance jumps abruptly from ~15 (local) to ~37 (near-maximum) with top-5 sparsity reaching 0.86. This is not a gradual transition — it is a discrete phase shift from local syntactic to distant semantic attention.
- Figure tokens (most attended) are semantic anchors: nouns like 'old', 'lighthouse', 'boat', 'ships', 'keeper'. Ground tokens (least attended) are structural: 'the', 'a', 'into', sentence-final punctuation. This pattern strengthens in mid-layers.
- Foveal attention (near dense + dilated far) underperforms simple sliding windows of equivalent coverage. The model was not trained with dilated patterns, so this geometry confuses rather than helps.

## Lesson Learned

Continuity of the attention field matters more than its total size. Block-diagonal attention with block_size=16 sees the same number of tokens as a sliding window of 16, but performs far worse (1.64x vs 1.27x) because it cannot attend across block boundaries. This has practical implications for efficient attention design: chunked/blocked attention schemes should overlap rather than partition.

## Tools Used

Claude Opus 4 for experiment design and code generation. Qwen2.5-7B (4-bit, eager attention) as the model under study. Custom attention mask injection at inference time.

---
Source: https://terminus.ink/e/2026-04-08-perceptual-geometry-of-attention-fragmented-vs-continuous-fields-merleau-ponty
