EXP-0042026-04-07

MI-Weighted BPE Merges: A Promising Result on Portuguese That Failed to Replicate Across 4 Languages and 2 Domains

#tokenization #bpe #mutual-information #cross-lingual #negative-result #replication #methodology

Question

Does weighting BPE merge decisions by mutual information between boundary bytes improve language modeling, and does the effect depend on language morphology or text domain?

Setup

Standard BPE merges token pairs by frequency alone. MI-weighted BPE modifies the merge score to: count × (1 + alpha × normalized_boundary_PMI), where boundary PMI measures pointwise mutual information between the last byte of token A and the first byte of token B. This biases merges toward combining bytes with strong statistical co-occurrence.

14 experiments across a 2×2 design (2 languages × 2 domains) plus alpha sweeps:

Languages: Portuguese (fusional morphology) and Turkish (agglutinative morphology)
Domains: Wikipedia (encyclopedic) and news corpora (Carolina PT-BR news, Anadolu Agency Turkish news)
Additional baselines: English (TinyStories) and German (Wikipedia)
Alpha sweep on Portuguese news: 0.0, 0.1, 0.3, 0.5

All runs: 27.5M parameter transformer, ~200M tokens, vocab size 8192, RTX 3060 Ti. Metric: bits per byte (BPB), which normalizes across tokenizers with different compression ratios.

Results

Language	Domain	BPE baseline (BPB)	MI-BPE alpha=0.1 (BPB)	Delta
Portuguese	News (Carolina)	1.2364	1.2006	-2.90%
Portuguese	Wikipedia	1.1090	1.1093	+0.03%
Turkish	News (Anadolu)	0.8078	0.8131	+0.65%
Turkish	Wikipedia	1.0773	1.0820	+0.43%
English	TinyStories	0.3866	0.3874	+0.20%
German	Wikipedia	1.0386	1.0555	+1.62%

Key findings

Only 1 of 7 direct comparisons shows improvement. MI-weighted BPE achieved a -2.90% BPB gain on the Portuguese Carolina news corpus. The other 6 comparisons range from neutral (+0.03%) to actively harmful (+1.62% on German).
The morphological complexity hypothesis is falsified. Turkish — the most morphologically complex language tested, with productive agglutinative suffixes — shows no benefit (+0.43% on Wikipedia, +0.65% on news). If morphological structure were the mechanism, Turkish should have benefited most.
The Portuguese result is a corpus-specific artifact, not a language effect. Testing on Portuguese Wikipedia (+0.03%) instead of Portuguese news (-2.90%) eliminates the improvement entirely. The 2×2 design (Portuguese/Turkish × news/Wikipedia) isolates this: the effect is specific to the Carolina corpus, not to Portuguese morphology or news text generally.
The alpha sweep on Portuguese news shows a smooth curve: alpha=0.1 (best, -2.90%) > 0.3 (-2.59%) > 0.5 (-2.27%). MI works best as a tiebreaker among similarly-frequent pairs, not as the primary merge criterion. But this curve is only real for one corpus and does not transfer.
The transformer absorbs whatever tokenization it receives. Whether merges are driven by frequency, mutual information, whole-word boundaries, or trigram alignment, the model learns to predict equally well at equal compression ratio. The tokenizer merge criterion is not a productive optimization target at this scale.

Lesson learned

Always test on at least 2 datasets per language before claiming language-specific effects. A single positive result on one corpus \u2014 no matter how clean the alpha sweep or how compelling the morphological story \u2014 is not evidence for a language-level phenomenon. The 2\u00d72 experimental design (language \u00d7 domain) was essential for identifying the Carolina corpus artifact. Without it, we would have published a false claim about Portuguese morphology benefiting from MI-informed tokenization.

Tools used

All tokenizer implementations and training code generated by Claude Opus (claude-opus-4-6) via Claude Code. MI-weighted BPE uses a linked-list + heap optimization for fast merge computation (5 seconds for 2M characters).