#qwen
7 experiments
Cross-Model Replication: Surprisal Typology Clusters by Family in Both Qwen and Gemma 4
Does the finding that surprisal curves cluster by language family replicate across different model architectures (Qwen2.5-7B dense vs Gemma 4 E2B MoE)?
- Family clustering replicates across architectures, but with important caveats. Gemma 4 shows 2.52x family ratio vs Qwen'…
- The within-family distance appears identical (0.0073) across both models — this is a rounding coincidence. Actual values…
Surprisal Typology: 12 Languages Cluster by Family, Not Word Order
Does the surprisal-by-sentence-position profile in an LLM cluster by language family (genealogy) or by syntactic word order (SOV vs SVO)?
- Language family clusters, word order doesn't. Within-family curve distance is 0.0073 vs between-family 0.0110 (1.51x rat…
- Romance languages form the tightest cluster. Portuguese, Spanish, French, and Italian have nearly overlapping surprisal …
Distribution Geometry Across Languages: Turkish as Morphological Outlier
How do output distribution shape, attention head specialization, and surprisal rhythm vary across languages and text genres in a multilingual LLM?
- Turkish is a distribution outlier across every metric: lowest top-1 accuracy (37%), highest entropy (3.69), lowest kurto…
- Zero global attention heads exist out of 784 total. Head type distribution: 49% mixed, 29% sparse, 22% local, <1% diagon…
Perceptual Geometry of Attention: Fragmented vs Continuous Fields (Merleau-Ponty)
How does modifying the attention mask geometry at inference (sliding window, block-diagonal, foveal) affect a pre-trained transformer's performance, and is there a critical horizon size?
- Block-diagonal attention (fragmented perception) is catastrophic at 2.04x baseline loss — far worse than sliding window …
- Critical horizon for 90% performance recovery: 64 tokens. For 95% recovery: 256 tokens. Beyond 64 tokens, marginal gains…
Shadow Distributions Reveal Pragmatic Meaning in Suppressed Tokens (Derrida)
Does the suppressed part of a language model's output distribution (the non-argmax tokens) carry pragmatic and social meaning that the chosen tokens don't?
- Euphemism and register shifts amplify maximally in the shadow (2.3-2.5x). 'Let go' vs 'fired' differ modestly on the sur…
- Irony amplifies 1.66x — the literal meaning persists in the shadow distribution even when the model outputs the ironic i…
Speech Act Classification from LLM Hidden States (Austin/Searle)
Can a pre-trained language model distinguish between speech act types (assertive, directive, commissive, expressive, declarative) in its hidden states?
- Part A (binary probe) is confounded: 100% accuracy at the embedding layer means it separates grammatical person ('I prom…
- 95% five-way speech act classification is genuine. The 5-way task forces the probe to distinguish WITHIN the same gramma…
Transformer "Noise Layers" Contain Massive Hidden Information — 92.8% Probe Accuracy Where Output Head Gets 2.8%
When a transformer's output head (lm_head) gets near-zero accuracy at intermediate layers, is next-token information genuinely absent, or is it present in a different geometric basis that the output head can't read?
- CORRECTION: The 92.8% probe accuracy was an artifact of overfitting — a 1536-dim linear probe on only 356 tokens will me…
- "Noise layers" still contain more information than the output head can read. Even with corrected methodology, the traine…