Darija Subword Tokenizer Benchmark

In collaboration with
UM6P College of Computing

Overview

The first systematic subword tokenizer benchmark for Moroccan Darija, a low-resource dialect written concurrently in Arabic script and Arabizi (Latin script). We train and evaluate 40 tokenizer configurations spanning four algorithms, two architectures, and five vocabulary sizes (8K--110K) on 112,814 parallel sentence pairs from OiQ/daa-pairs.

Our tokenizers achieve 27--33% lower fertility than existing Darija tokenizers (DarijaBERT) at matching vocabulary sizes, and 40--50% lower fertility than MSA-trained tokenizers. All 40 configurations maintain ≥99% exact reconstruction.

Tokenizers

Architecture	Description	Algorithms	Vocab Sizes	Count
Shared	Single vocabulary trained on mixed Arabic + Arabizi corpus	BPE, Unigram, WordPiece, BBPE	8K, 16K, 32K, 80K, 110K	20
Concatenated	Separate per-script vocabularies (V/2 each) with ID shifting	BPE, Unigram, WordPiece, BBPE	8K, 16K, 32K, 80K, 110K	20

All tokenizers are released in both raw (HuggingFace tokenizers) and transformers-compatible formats. The 80K and 110K sizes match DarijaBERT's vocabulary sizes for direct comparison.

Quick Start

from transformers import AutoTokenizer

# Load a shared tokenizer
tok = AutoTokenizer.from_pretrained(
    "OiQ/daa-tokenizers",
    subfolder="transformers_tokenizers/shared_bpe_32000"
)

# Tokenize Arabic-script text
text_ar = "مابقاش كيعرف شنو يدير، بين القانون وبين وليداتو."
print(tok.encode(text_ar))

# Load a concatenated tokenizer (separate Arabic and Arabizi sub-tokenizers)
tok_ar = AutoTokenizer.from_pretrained(
    "OiQ/daa-tokenizers",
    subfolder="transformers_tokenizers/concat_bpe_32000_tokenizer_ar"
)
tok_az = AutoTokenizer.from_pretrained(
    "OiQ/daa-tokenizers",
    subfolder="transformers_tokenizers/concat_bpe_32000_tokenizer_az"
)

text_az = "wash kayn shi jdid?"
print(tok_az.encode(text_az))

Key Results

Best Tokenizer per Vocabulary Size

Vocab	Configuration	Algorithm	Fertility ↓	Disparity ↓	Exact Match
8K	Shared	WordPiece	1.572	0.164	99.9%
16K	Shared	WordPiece	1.402	0.138	99.9%
32K	Shared	WordPiece	1.274	0.099	99.9%
80K	Shared	WordPiece	1.171	0.049	99.9%
110K	Concat	WordPiece	1.155	0.093	99.6%

Comparison with Existing Tokenizers

Tokenizer	Vocab	Fertility ↓	Disparity ↓	EM (Ar)	EM (Az)
Ours: concat WP 110K	110K	1.155	0.093	99.9%	99.6%
Ours: concat WP 80K	80K	1.183	0.090	99.9%	99.6%
Ours: concat BPE 32K	32K	1.307	0.084	99.9%	99.6%
DarijaBERT-ar	80K	1.761	0.410	13.7%	8.0%
DarijaBERT-az	110K	1.575	0.055	14.8%	8.0%
DarijaBERT-mix	160K	1.414	0.149	14.8%	8.0%
CaMeLBERT-MSA	30K	2.289	0.427	29.9%	38.9%
Aranizer-SP-86k	86K	1.918	0.368	99.8%	99.6%
Qwen2.5-Darija	152K	2.307	0.040	100.0%	100.0%

At matching vocabulary sizes, our 80K tokenizer achieves 33% lower fertility than DarijaBERT-ar (1.183 vs 1.761). Our 110K achieves 27% lower than DarijaBERT-az (1.155 vs 1.575). Even our 32K tokenizer outperforms DarijaBERT-az despite using 3.4x fewer vocabulary slots. DarijaBERT-mix, despite its massive 160K vocabulary (F = 1.414), still underperforms our 32K tokenizer—vocabulary size alone cannot compensate for suboptimal training architecture.

Evaluation Metrics

Compression & Fairness

Metric	Definition	Direction
Fertility (F)	Tokens per word, averaged over test set	Lower is better
CPT	Grapheme clusters per token (Unicode-aware)	Higher is better
Disparity (ΔF)	Relative cross-script gap: \|F_ar − F_az\| / max(F_ar, F_az)	Lower is better
Exact Match	Fraction of texts that round-trip perfectly through encode/decode	Higher is better
Gini	Vocabulary usage inequality (0 = uniform, 1 = concentrated)	Lower is better

Overall fertility is word-count-weighted: F ≈ 0.65·F_ar + 0.35·F_az

Morphological Fidelity (Arabic-script only)

Metric	Definition	Direction
μ_e	Edit distance between tokenizer boundaries and Farasa morpheme boundaries	Lower is better
μ_c-F1	Whether words sharing morphemes also share tokens (KMeans + TF-IDF)	Higher is better

Statistical Rigor

All fertility and CPT values include bootstrap 95% confidence intervals (500 resamples). CI width is ≤ 0.006 for all configurations.

Visualizations

Fertility by Algorithm and Vocabulary Size

Cross-Script Disparity

External Tokenizer Comparison

Key Findings

Darija-specific tokenization dramatically improves compression. Our tokenizers achieve 27--33% lower fertility than DarijaBERT at matching vocabulary sizes.
Cross-script fairness is achievable. Shared Unigram at 80K reaches ΔF = 0.015 (near-zero disparity) without concatenation. Concatenated architectures maintain ΔF ≤ 0.094 across all non-BBPE algorithms.
Vocabulary size saturates early. Moving from 32K to 110K yields only marginal gains. The training corpus exhausts merge candidates around 80K.
BBPE requires concatenation. Shared BBPE exhibits extreme cross-script disparity (ΔF = 0.219–0.243). Concatenation reduces this 4x.
MSA tokenizers transfer poorly to Darija. They produce 40--50% higher fertility, allocating vocabulary to MSA patterns absent in Darija.

Methodology

Stage	Details
Dataset	`OiQ/daa-pairs` — 112,814 Moroccan Darija sentence triplets (Arabic, Arabizi, Mixed) from 12,695 unique sources
Split	80/10/10 train/validation/test with stratified sampling
Pre-tokenization	Metaspace (BPE/Unigram/WordPiece) or ByteLevel (BBPE)
Training	HuggingFace `tokenizers` library with matched pre-tokenizer/decoder pairs
Evaluation	14 metrics on 33,846 test texts (11,282 per script x 3 scripts)
Morphological	Farasa segmenter for gold-standard Arabic morpheme boundaries
Export	Raw + `transformers`-compatible via `PreTrainedTokenizerFast`

Code & Reproducibility

All scripts, evaluation code, and documentation are included in this repository. See code.md for a complete guide to every script and its outputs.

Repository Structure

daa-tokenizers/
├── README.md                        # This file
├── code.md                          # Script documentation
├── script.py                        # Master benchmark pipeline
├── eval_test_set.py                 # Test-set evaluation
├── eval_morph_large.py              # Morphological metrics (80K/110K)
├── bootstrap_test_set.py            # Bootstrap confidence intervals
├── eval_all_externals.py            # External tokenizer comparison
├── eval_codeswitch_and_new_baselines.py
├── eval_doda_independent.py         # DODa validation
├── regen_figures.py                 # Figure generation
├── verify_arithmetic.py             # Numeric claims verification
├── results/
│   ├── test_set_results.csv         # Primary results (40 tokenizers)
│   ├── external_comparison.csv      # External comparison
│   ├── morph_large_vocab_results.csv# Morphological metrics (80K/110K)
│   ├── bootstrap_ci_test_set.csv    # Bootstrap 95% CIs
│   ├── tokenizers/                  # 60 raw tokenizer JSONs
│   ├── transformers_tokenizers/     # Transformers-compatible exports
│   ├── corpora/                     # Train/test text splits
│   ├── morphology/                  # Farasa segmentations cache
│   └── plots/                       # All visualization PNGs
└── plots/                           # Paper figures

Citation

@misc{laamiri2026daa-tokenizers,
  title     = {Darija Subword Tokenizer Benchmark},
  author    = {Laamiri, Ouail and Berrada, Ismail and Belfadil, Anas},
  year      = {2026},
  url       = {https://huggingface.co/OiQ/daa-tokenizers},
  note      = {In collaboration with UM6P College of Computing}
}

License

MIT License — see LICENSE for details.

Acknowledgments

This work was developed in collaboration with the UM6P College of Computing, Mohammed VI Polytechnic University, Ben Guerir, Morocco. We thank the HuggingFace community for providing the infrastructure to host these resources.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support