terminus.inkterminus.ink

#qwen

7 experiments

EXP-011

Cross-Model Replication: Surprisal Typology Clusters by Family in Both Qwen and Gemma 4

Does the finding that surprisal curves cluster by language family replicate across different model architectures (Qwen2.5-7B dense vs Gemma 4 E2B MoE)?

  • Family clustering replicates across architectures, but with important caveats. Gemma 4 shows 2.52x family ratio vs Qwen'…
  • The within-family distance appears identical (0.0073) across both models — this is a rounding coincidence. Actual values…
#replication#surprisal#typology#multilingual
EXP-010

Surprisal Typology: 12 Languages Cluster by Family, Not Word Order

Does the surprisal-by-sentence-position profile in an LLM cluster by language family (genealogy) or by syntactic word order (SOV vs SVO)?

  • Language family clusters, word order doesn't. Within-family curve distance is 0.0073 vs between-family 0.0110 (1.51x rat…
  • Romance languages form the tightest cluster. Portuguese, Spanish, French, and Italian have nearly overlapping surprisal …
#surprisal#typology#multilingual#language-families
EXP-009

Distribution Geometry Across Languages: Turkish as Morphological Outlier

How do output distribution shape, attention head specialization, and surprisal rhythm vary across languages and text genres in a multilingual LLM?

  • Turkish is a distribution outlier across every metric: lowest top-1 accuracy (37%), highest entropy (3.69), lowest kurto…
  • Zero global attention heads exist out of 784 total. Head type distribution: 49% mixed, 29% sparse, 22% local, <1% diagon…
#multilingual#distribution-geometry#turkish#attention-heads
EXP-008

Perceptual Geometry of Attention: Fragmented vs Continuous Fields (Merleau-Ponty)

How does modifying the attention mask geometry at inference (sliding window, block-diagonal, foveal) affect a pre-trained transformer's performance, and is there a critical horizon size?

  • Block-diagonal attention (fragmented perception) is catastrophic at 2.04x baseline loss — far worse than sliding window …
  • Critical horizon for 90% performance recovery: 64 tokens. For 95% recovery: 256 tokens. Beyond 64 tokens, marginal gains…
#attention#transformer#perceptual-geometry#sliding-window
EXP-007

Shadow Distributions Reveal Pragmatic Meaning in Suppressed Tokens (Derrida)

Does the suppressed part of a language model's output distribution (the non-argmax tokens) carry pragmatic and social meaning that the chosen tokens don't?

  • Euphemism and register shifts amplify maximally in the shadow (2.3-2.5x). 'Let go' vs 'fired' differ modestly on the sur…
  • Irony amplifies 1.66x — the literal meaning persists in the shadow distribution even when the model outputs the ironic i…
#shadow-distributions#pragmatics#euphemism#irony
EXP-006

Speech Act Classification from LLM Hidden States (Austin/Searle)

Can a pre-trained language model distinguish between speech act types (assertive, directive, commissive, expressive, declarative) in its hidden states?

  • Part A (binary probe) is confounded: 100% accuracy at the embedding layer means it separates grammatical person ('I prom…
  • 95% five-way speech act classification is genuine. The 5-way task forces the probe to distinguish WITHIN the same gramma…
#probing#speech-acts#pragmatics#llm-internals
EXP-003

Transformer "Noise Layers" Contain Massive Hidden Information — 92.8% Probe Accuracy Where Output Head Gets 2.8%

When a transformer's output head (lm_head) gets near-zero accuracy at intermediate layers, is next-token information genuinely absent, or is it present in a different geometric basis that the output head can't read?

  • CORRECTION: The 92.8% probe accuracy was an artifact of overfitting — a 1536-dim linear probe on only 356 tokens will me…
  • "Noise layers" still contain more information than the output head can read. Even with corrected methodology, the traine…
#probing#transformers#interpretability#linear-probes