# EXP-004: MI-Weighted BPE Merges: A Promising Result on Portuguese That Failed to Replicate Across 4 Languages and 2 Domains

**Date:** 2026-04-07
**Author:** @eazevedo
**Tags:** #tokenization, #bpe, #mutual-information, #cross-lingual, #negative-result, #replication, #methodology

## Question

Does weighting BPE merge decisions by mutual information between boundary bytes improve language modeling, and does the effect depend on language morphology or text domain?

## Setup

Standard BPE merges token pairs by frequency alone. MI-weighted BPE modifies the merge score to: count × (1 + alpha × normalized_boundary_PMI), where boundary PMI measures pointwise mutual information between the last byte of token A and the first byte of token B. This biases merges toward combining bytes with strong statistical co-occurrence.

14 experiments across a 2×2 design (2 languages × 2 domains) plus alpha sweeps:
- Languages: Portuguese (fusional morphology) and Turkish (agglutinative morphology)
- Domains: Wikipedia (encyclopedic) and news corpora (Carolina PT-BR news, Anadolu Agency Turkish news)
- Additional baselines: English (TinyStories) and German (Wikipedia)
- Alpha sweep on Portuguese news: 0.0, 0.1, 0.3, 0.5

All runs: 27.5M parameter transformer, ~200M tokens, vocab size 8192, RTX 3060 Ti. Metric: bits per byte (BPB), which normalizes across tokenizers with different compression ratios.

## Results

| Language | Domain | BPE baseline (BPB) | MI-BPE alpha=0.1 (BPB) | Delta |
| --- | --- | --- | --- | --- |
| Portuguese | News (Carolina) | 1.2364 | 1.2006 | -2.90% |
| Portuguese | Wikipedia | 1.1090 | 1.1093 | +0.03% |
| Turkish | News (Anadolu) | 0.8078 | 0.8131 | +0.65% |
| Turkish | Wikipedia | 1.0773 | 1.0820 | +0.43% |
| English | TinyStories | 0.3866 | 0.3874 | +0.20% |
| German | Wikipedia | 1.0386 | 1.0555 | +1.62% |

## Key Findings

- Only 1 of 7 direct comparisons shows improvement. MI-weighted BPE achieved a -2.90% BPB gain on the Portuguese Carolina news corpus. The other 6 comparisons range from neutral (+0.03%) to actively harmful (+1.62% on German).
- The morphological complexity hypothesis is falsified. Turkish — the most morphologically complex language tested, with productive agglutinative suffixes — shows no benefit (+0.43% on Wikipedia, +0.65% on news). If morphological structure were the mechanism, Turkish should have benefited most.
- The Portuguese result is a corpus-specific artifact, not a language effect. Testing on Portuguese Wikipedia (+0.03%) instead of Portuguese news (-2.90%) eliminates the improvement entirely. The 2×2 design (Portuguese/Turkish × news/Wikipedia) isolates this: the effect is specific to the Carolina corpus, not to Portuguese morphology or news text generally.
- The alpha sweep on Portuguese news shows a smooth curve: alpha=0.1 (best, -2.90%) > 0.3 (-2.59%) > 0.5 (-2.27%). MI works best as a tiebreaker among similarly-frequent pairs, not as the primary merge criterion. But this curve is only real for one corpus and does not transfer.
- The transformer absorbs whatever tokenization it receives. Whether merges are driven by frequency, mutual information, whole-word boundaries, or trigram alignment, the model learns to predict equally well at equal compression ratio. The tokenizer merge criterion is not a productive optimization target at this scale.

## Lesson Learned

Always test on at least 2 datasets per language before claiming language-specific effects. A single positive result on one corpus \u2014 no matter how clean the alpha sweep or how compelling the morphological story \u2014 is not evidence for a language-level phenomenon. The 2\u00d72 experimental design (language \u00d7 domain) was essential for identifying the Carolina corpus artifact. Without it, we would have published a false claim about Portuguese morphology benefiting from MI-informed tokenization.

## Tools Used

All tokenizer implementations and training code generated by Claude Opus (claude-opus-4-6) via Claude Code. MI-weighted BPE uses a linked-list + heap optimization for fast merge computation (5 seconds for 2M characters).

---
Source: https://terminus.ink/e/2026-04-07-mi-weighted-bpe-merges-a-promising-result-on-portuguese-that-failed-to-replicate
