Darija Subword Tokenizer Benchmark
In collaboration with
UM6P College of Computing
Overview
The first systematic subword tokenizer benchmark for Moroccan Darija, a low-resource dialect written concurrently in Arabic script and Arabizi (Latin script). We train and evaluate 40 tokenizer configurations spanning four algorithms, two architectures, and five vocabulary sizes (8K--110K) on 112,814 parallel sentence pairs from OiQ/daa-pairs.
Our tokenizers achieve 27--33% lower fertility than existing Darija tokenizers (DarijaBERT) at matching vocabulary sizes, and 40--50% lower fertility than MSA-trained tokenizers. All 40 configurations maintain β₯99% exact reconstruction.
Tokenizers
| Architecture | Description | Algorithms | Vocab Sizes | Count |
|---|---|---|---|---|
| Shared | Single vocabulary trained on mixed Arabic + Arabizi corpus | BPE, Unigram, WordPiece, BBPE | 8K, 16K, 32K, 80K, 110K | 20 |
| Concatenated | Separate per-script vocabularies (V/2 each) with ID shifting | BPE, Unigram, WordPiece, BBPE | 8K, 16K, 32K, 80K, 110K | 20 |
All tokenizers are released in both raw (HuggingFace tokenizers) and transformers-compatible formats. The 80K and 110K sizes match DarijaBERT's vocabulary sizes for direct comparison.
Quick Start
from transformers import AutoTokenizer
# Load a shared tokenizer
tok = AutoTokenizer.from_pretrained(
"OiQ/daa-tokenizers",
subfolder="transformers_tokenizers/shared_bpe_32000"
)
# Tokenize Arabic-script text
text_ar = "Ω
Ψ§Ψ¨ΩΨ§Ψ΄ ΩΩΨΉΨ±Ω Ψ΄ΩΩ ΩΨ―ΩΨ±Ψ Ψ¨ΩΩ Ψ§ΩΩΨ§ΩΩΩ ΩΨ¨ΩΩ ΩΩΩΨ―Ψ§ΨͺΩ."
print(tok.encode(text_ar))
# Load a concatenated tokenizer (separate Arabic and Arabizi sub-tokenizers)
tok_ar = AutoTokenizer.from_pretrained(
"OiQ/daa-tokenizers",
subfolder="transformers_tokenizers/concat_bpe_32000_tokenizer_ar"
)
tok_az = AutoTokenizer.from_pretrained(
"OiQ/daa-tokenizers",
subfolder="transformers_tokenizers/concat_bpe_32000_tokenizer_az"
)
text_az = "wash kayn shi jdid?"
print(tok_az.encode(text_az))
Key Results
Best Tokenizer per Vocabulary Size
| Vocab | Configuration | Algorithm | Fertility β | Disparity β | Exact Match |
|---|---|---|---|---|---|
| 8K | Shared | WordPiece | 1.572 | 0.164 | 99.9% |
| 16K | Shared | WordPiece | 1.402 | 0.138 | 99.9% |
| 32K | Shared | WordPiece | 1.274 | 0.099 | 99.9% |
| 80K | Shared | WordPiece | 1.171 | 0.049 | 99.9% |
| 110K | Concat | WordPiece | 1.155 | 0.093 | 99.6% |
Comparison with Existing Tokenizers
| Tokenizer | Vocab | Fertility β | Disparity β | EM (Ar) | EM (Az) |
|---|---|---|---|---|---|
| Ours: concat WP 110K | 110K | 1.155 | 0.093 | 99.9% | 99.6% |
| Ours: concat WP 80K | 80K | 1.183 | 0.090 | 99.9% | 99.6% |
| Ours: concat BPE 32K | 32K | 1.307 | 0.084 | 99.9% | 99.6% |
| DarijaBERT-ar | 80K | 1.761 | 0.410 | 13.7% | 8.0% |
| DarijaBERT-az | 110K | 1.575 | 0.055 | 14.8% | 8.0% |
| DarijaBERT-mix | 160K | 1.414 | 0.149 | 14.8% | 8.0% |
| CaMeLBERT-MSA | 30K | 2.289 | 0.427 | 29.9% | 38.9% |
| Aranizer-SP-86k | 86K | 1.918 | 0.368 | 99.8% | 99.6% |
| Qwen2.5-Darija | 152K | 2.307 | 0.040 | 100.0% | 100.0% |
At matching vocabulary sizes, our 80K tokenizer achieves 33% lower fertility than DarijaBERT-ar (1.183 vs 1.761). Our 110K achieves 27% lower than DarijaBERT-az (1.155 vs 1.575). Even our 32K tokenizer outperforms DarijaBERT-az despite using 3.4x fewer vocabulary slots. DarijaBERT-mix, despite its massive 160K vocabulary (F = 1.414), still underperforms our 32K tokenizerβvocabulary size alone cannot compensate for suboptimal training architecture.
Evaluation Metrics
Compression & Fairness
| Metric | Definition | Direction |
|---|---|---|
| Fertility (F) | Tokens per word, averaged over test set | Lower is better |
| CPT | Grapheme clusters per token (Unicode-aware) | Higher is better |
| Disparity (ΞF) | Relative cross-script gap: |Far β Faz| / max(Far, Faz) | Lower is better |
| Exact Match | Fraction of texts that round-trip perfectly through encode/decode | Higher is better |
| Gini | Vocabulary usage inequality (0 = uniform, 1 = concentrated) | Lower is better |
Overall fertility is word-count-weighted: F β 0.65Β·Far + 0.35Β·Faz
Morphological Fidelity (Arabic-script only)
| Metric | Definition | Direction |
|---|---|---|
| ΞΌe | Edit distance between tokenizer boundaries and Farasa morpheme boundaries | Lower is better |
| ΞΌc-F1 | Whether words sharing morphemes also share tokens (KMeans + TF-IDF) | Higher is better |
Statistical Rigor
All fertility and CPT values include bootstrap 95% confidence intervals (500 resamples). CI width is β€ 0.006 for all configurations.
Visualizations
Fertility by Algorithm and Vocabulary Size
Cross-Script Disparity
External Tokenizer Comparison
Key Findings
Darija-specific tokenization dramatically improves compression. Our tokenizers achieve 27--33% lower fertility than DarijaBERT at matching vocabulary sizes.
Cross-script fairness is achievable. Shared Unigram at 80K reaches ΞF = 0.015 (near-zero disparity) without concatenation. Concatenated architectures maintain ΞF β€ 0.094 across all non-BBPE algorithms.
Vocabulary size saturates early. Moving from 32K to 110K yields only marginal gains. The training corpus exhausts merge candidates around 80K.
BBPE requires concatenation. Shared BBPE exhibits extreme cross-script disparity (ΞF = 0.219β0.243). Concatenation reduces this 4x.
MSA tokenizers transfer poorly to Darija. They produce 40--50% higher fertility, allocating vocabulary to MSA patterns absent in Darija.
Methodology
| Stage | Details |
|---|---|
| Dataset | OiQ/daa-pairs β 112,814 Moroccan Darija sentence triplets (Arabic, Arabizi, Mixed) from 12,695 unique sources |
| Split | 80/10/10 train/validation/test with stratified sampling |
| Pre-tokenization | Metaspace (BPE/Unigram/WordPiece) or ByteLevel (BBPE) |
| Training | HuggingFace tokenizers library with matched pre-tokenizer/decoder pairs |
| Evaluation | 14 metrics on 33,846 test texts (11,282 per script x 3 scripts) |
| Morphological | Farasa segmenter for gold-standard Arabic morpheme boundaries |
| Export | Raw + transformers-compatible via PreTrainedTokenizerFast |
Code & Reproducibility
All scripts, evaluation code, and documentation are included in this repository. See code.md for a complete guide to every script and its outputs.
Repository Structure
daa-tokenizers/
βββ README.md # This file
βββ code.md # Script documentation
βββ script.py # Master benchmark pipeline
βββ eval_test_set.py # Test-set evaluation
βββ eval_morph_large.py # Morphological metrics (80K/110K)
βββ bootstrap_test_set.py # Bootstrap confidence intervals
βββ eval_all_externals.py # External tokenizer comparison
βββ eval_codeswitch_and_new_baselines.py
βββ eval_doda_independent.py # DODa validation
βββ regen_figures.py # Figure generation
βββ verify_arithmetic.py # Numeric claims verification
βββ results/
β βββ test_set_results.csv # Primary results (40 tokenizers)
β βββ external_comparison.csv # External comparison
β βββ morph_large_vocab_results.csv# Morphological metrics (80K/110K)
β βββ bootstrap_ci_test_set.csv # Bootstrap 95% CIs
β βββ tokenizers/ # 60 raw tokenizer JSONs
β βββ transformers_tokenizers/ # Transformers-compatible exports
β βββ corpora/ # Train/test text splits
β βββ morphology/ # Farasa segmentations cache
β βββ plots/ # All visualization PNGs
βββ plots/ # Paper figures
Citation
@misc{laamiri2026daa-tokenizers,
title = {Darija Subword Tokenizer Benchmark},
author = {Laamiri, Ouail and Berrada, Ismail and Belfadil, Anas},
year = {2026},
url = {https://huggingface.co/OiQ/daa-tokenizers},
note = {In collaboration with UM6P College of Computing}
}
License
MIT License β see LICENSE for details.
Acknowledgments
This work was developed in collaboration with the UM6P College of Computing, Mohammed VI Polytechnic University, Ben Guerir, Morocco. We thank the HuggingFace community for providing the infrastructure to host these resources.


