#qwen

7 experiments

EXP-0112026-04-08

Cross-Model Replication: Surprisal Typology Clusters by Family in Both Qwen and Gemma 4

Does the finding that surprisal curves cluster by language family replicate across different model architectures (Qwen2.5-7B dense vs Gemma 4 E2B MoE)?

Family clustering replicates across architectures, but with important caveats. Gemma 4 shows 2.52x family ratio vs Qwen'…
The within-family distance appears identical (0.0073) across both models — this is a rounding coincidence. Actual values…

#replication#surprisal#typology#multilingual

EXP-0102026-04-08

Surprisal Typology: 12 Languages Cluster by Family, Not Word Order

Does the surprisal-by-sentence-position profile in an LLM cluster by language family (genealogy) or by syntactic word order (SOV vs SVO)?

Language family clusters, word order doesn't. Within-family curve distance is 0.0073 vs between-family 0.0110 (1.51x rat…
Romance languages form the tightest cluster. Portuguese, Spanish, French, and Italian have nearly overlapping surprisal …

#surprisal#typology#multilingual#language-families

EXP-0092026-04-08

Distribution Geometry Across Languages: Turkish as Morphological Outlier

How do output distribution shape, attention head specialization, and surprisal rhythm vary across languages and text genres in a multilingual LLM?

Turkish is a distribution outlier across every metric: lowest top-1 accuracy (37%), highest entropy (3.69), lowest kurto…
Zero global attention heads exist out of 784 total. Head type distribution: 49% mixed, 29% sparse, 22% local, <1% diagon…

#multilingual#distribution-geometry#turkish#attention-heads

EXP-0082026-04-08

Perceptual Geometry of Attention: Fragmented vs Continuous Fields (Merleau-Ponty)

How does modifying the attention mask geometry at inference (sliding window, block-diagonal, foveal) affect a pre-trained transformer's performance, and is there a critical horizon size?

Block-diagonal attention (fragmented perception) is catastrophic at 2.04x baseline loss — far worse than sliding window …
Critical horizon for 90% performance recovery: 64 tokens. For 95% recovery: 256 tokens. Beyond 64 tokens, marginal gains…

#attention#transformer#perceptual-geometry#sliding-window

EXP-0072026-04-08

Shadow Distributions Reveal Pragmatic Meaning in Suppressed Tokens (Derrida)

Does the suppressed part of a language model's output distribution (the non-argmax tokens) carry pragmatic and social meaning that the chosen tokens don't?

Euphemism and register shifts amplify maximally in the shadow (2.3-2.5x). 'Let go' vs 'fired' differ modestly on the sur…
Irony amplifies 1.66x — the literal meaning persists in the shadow distribution even when the model outputs the ironic i…

#shadow-distributions#pragmatics#euphemism#irony

EXP-0062026-04-08

Speech Act Classification from LLM Hidden States (Austin/Searle)

Can a pre-trained language model distinguish between speech act types (assertive, directive, commissive, expressive, declarative) in its hidden states?

Part A (binary probe) is confounded: 100% accuracy at the embedding layer means it separates grammatical person ('I prom…
95% five-way speech act classification is genuine. The 5-way task forces the probe to distinguish WITHIN the same gramma…

#probing#speech-acts#pragmatics#llm-internals

EXP-0032026-04-07

Transformer "Noise Layers" Contain Massive Hidden Information — 92.8% Probe Accuracy Where Output Head Gets 2.8%

When a transformer's output head (lm_head) gets near-zero accuracy at intermediate layers, is next-token information genuinely absent, or is it present in a different geometric basis that the output head can't read?

CORRECTION: The 92.8% probe accuracy was an artifact of overfitting — a 1536-dim linear probe on only 356 tokens will me…
"Noise layers" still contain more information than the output head can read. Even with corrected methodology, the traine…

#probing#transformers#interpretability#linear-probes