Darija Subword Tokenizer Benchmark

In collaboration with
UM6P College of Computing


Overview

The first systematic subword tokenizer benchmark for Moroccan Darija, a low-resource dialect written concurrently in Arabic script and Arabizi (Latin script). We train and evaluate 40 tokenizer configurations spanning four algorithms, two architectures, and five vocabulary sizes (8K--110K) on 112,814 parallel sentence pairs from OiQ/daa-pairs.

Our tokenizers achieve 27--33% lower fertility than existing Darija tokenizers (DarijaBERT) at matching vocabulary sizes, and 40--50% lower fertility than MSA-trained tokenizers. All 40 configurations maintain β‰₯99% exact reconstruction.


Tokenizers

Architecture Description Algorithms Vocab Sizes Count
Shared Single vocabulary trained on mixed Arabic + Arabizi corpus BPE, Unigram, WordPiece, BBPE 8K, 16K, 32K, 80K, 110K 20
Concatenated Separate per-script vocabularies (V/2 each) with ID shifting BPE, Unigram, WordPiece, BBPE 8K, 16K, 32K, 80K, 110K 20

All tokenizers are released in both raw (HuggingFace tokenizers) and transformers-compatible formats. The 80K and 110K sizes match DarijaBERT's vocabulary sizes for direct comparison.


Quick Start

from transformers import AutoTokenizer

# Load a shared tokenizer
tok = AutoTokenizer.from_pretrained(
    "OiQ/daa-tokenizers",
    subfolder="transformers_tokenizers/shared_bpe_32000"
)

# Tokenize Arabic-script text
text_ar = "Ω…Ψ§Ψ¨Ω‚Ψ§Ψ΄ ΩƒΩŠΨΉΨ±Ω Ψ΄Ω†Ωˆ يدير، Ψ¨ΩŠΩ† Ψ§Ω„Ω‚Ψ§Ω†ΩˆΩ† ΩˆΨ¨ΩŠΩ† ΩˆΩ„ΩŠΨ―Ψ§Ψͺو."
print(tok.encode(text_ar))

# Load a concatenated tokenizer (separate Arabic and Arabizi sub-tokenizers)
tok_ar = AutoTokenizer.from_pretrained(
    "OiQ/daa-tokenizers",
    subfolder="transformers_tokenizers/concat_bpe_32000_tokenizer_ar"
)
tok_az = AutoTokenizer.from_pretrained(
    "OiQ/daa-tokenizers",
    subfolder="transformers_tokenizers/concat_bpe_32000_tokenizer_az"
)

text_az = "wash kayn shi jdid?"
print(tok_az.encode(text_az))

Key Results

Best Tokenizer per Vocabulary Size

Vocab Configuration Algorithm Fertility ↓ Disparity ↓ Exact Match
8K Shared WordPiece 1.572 0.164 99.9%
16K Shared WordPiece 1.402 0.138 99.9%
32K Shared WordPiece 1.274 0.099 99.9%
80K Shared WordPiece 1.171 0.049 99.9%
110K Concat WordPiece 1.155 0.093 99.6%

Comparison with Existing Tokenizers

Tokenizer Vocab Fertility ↓ Disparity ↓ EM (Ar) EM (Az)
Ours: concat WP 110K 110K 1.155 0.093 99.9% 99.6%
Ours: concat WP 80K 80K 1.183 0.090 99.9% 99.6%
Ours: concat BPE 32K 32K 1.307 0.084 99.9% 99.6%
DarijaBERT-ar 80K 1.761 0.410 13.7% 8.0%
DarijaBERT-az 110K 1.575 0.055 14.8% 8.0%
DarijaBERT-mix 160K 1.414 0.149 14.8% 8.0%
CaMeLBERT-MSA 30K 2.289 0.427 29.9% 38.9%
Aranizer-SP-86k 86K 1.918 0.368 99.8% 99.6%
Qwen2.5-Darija 152K 2.307 0.040 100.0% 100.0%

At matching vocabulary sizes, our 80K tokenizer achieves 33% lower fertility than DarijaBERT-ar (1.183 vs 1.761). Our 110K achieves 27% lower than DarijaBERT-az (1.155 vs 1.575). Even our 32K tokenizer outperforms DarijaBERT-az despite using 3.4x fewer vocabulary slots. DarijaBERT-mix, despite its massive 160K vocabulary (F = 1.414), still underperforms our 32K tokenizerβ€”vocabulary size alone cannot compensate for suboptimal training architecture.


Evaluation Metrics

Compression & Fairness

Metric Definition Direction
Fertility (F) Tokens per word, averaged over test set Lower is better
CPT Grapheme clusters per token (Unicode-aware) Higher is better
Disparity (Ξ”F) Relative cross-script gap: |Far βˆ’ Faz| / max(Far, Faz) Lower is better
Exact Match Fraction of texts that round-trip perfectly through encode/decode Higher is better
Gini Vocabulary usage inequality (0 = uniform, 1 = concentrated) Lower is better

Overall fertility is word-count-weighted: F β‰ˆ 0.65Β·Far + 0.35Β·Faz

Morphological Fidelity (Arabic-script only)

Metric Definition Direction
ΞΌe Edit distance between tokenizer boundaries and Farasa morpheme boundaries Lower is better
ΞΌc-F1 Whether words sharing morphemes also share tokens (KMeans + TF-IDF) Higher is better

Statistical Rigor

All fertility and CPT values include bootstrap 95% confidence intervals (500 resamples). CI width is ≀ 0.006 for all configurations.


Visualizations

Fertility by Algorithm and Vocabulary Size

Fertility Comparison

Cross-Script Disparity

Disparity Comparison

External Tokenizer Comparison

External Comparison


Key Findings

  1. Darija-specific tokenization dramatically improves compression. Our tokenizers achieve 27--33% lower fertility than DarijaBERT at matching vocabulary sizes.

  2. Cross-script fairness is achievable. Shared Unigram at 80K reaches Ξ”F = 0.015 (near-zero disparity) without concatenation. Concatenated architectures maintain Ξ”F ≀ 0.094 across all non-BBPE algorithms.

  3. Vocabulary size saturates early. Moving from 32K to 110K yields only marginal gains. The training corpus exhausts merge candidates around 80K.

  4. BBPE requires concatenation. Shared BBPE exhibits extreme cross-script disparity (Ξ”F = 0.219–0.243). Concatenation reduces this 4x.

  5. MSA tokenizers transfer poorly to Darija. They produce 40--50% higher fertility, allocating vocabulary to MSA patterns absent in Darija.


Methodology

Stage Details
Dataset OiQ/daa-pairs β€” 112,814 Moroccan Darija sentence triplets (Arabic, Arabizi, Mixed) from 12,695 unique sources
Split 80/10/10 train/validation/test with stratified sampling
Pre-tokenization Metaspace (BPE/Unigram/WordPiece) or ByteLevel (BBPE)
Training HuggingFace tokenizers library with matched pre-tokenizer/decoder pairs
Evaluation 14 metrics on 33,846 test texts (11,282 per script x 3 scripts)
Morphological Farasa segmenter for gold-standard Arabic morpheme boundaries
Export Raw + transformers-compatible via PreTrainedTokenizerFast

Code & Reproducibility

All scripts, evaluation code, and documentation are included in this repository. See code.md for a complete guide to every script and its outputs.

Repository Structure

daa-tokenizers/
β”œβ”€β”€ README.md                        # This file
β”œβ”€β”€ code.md                          # Script documentation
β”œβ”€β”€ script.py                        # Master benchmark pipeline
β”œβ”€β”€ eval_test_set.py                 # Test-set evaluation
β”œβ”€β”€ eval_morph_large.py              # Morphological metrics (80K/110K)
β”œβ”€β”€ bootstrap_test_set.py            # Bootstrap confidence intervals
β”œβ”€β”€ eval_all_externals.py            # External tokenizer comparison
β”œβ”€β”€ eval_codeswitch_and_new_baselines.py
β”œβ”€β”€ eval_doda_independent.py         # DODa validation
β”œβ”€β”€ regen_figures.py                 # Figure generation
β”œβ”€β”€ verify_arithmetic.py             # Numeric claims verification
β”œβ”€β”€ results/
β”‚   β”œβ”€β”€ test_set_results.csv         # Primary results (40 tokenizers)
β”‚   β”œβ”€β”€ external_comparison.csv      # External comparison
β”‚   β”œβ”€β”€ morph_large_vocab_results.csv# Morphological metrics (80K/110K)
β”‚   β”œβ”€β”€ bootstrap_ci_test_set.csv    # Bootstrap 95% CIs
β”‚   β”œβ”€β”€ tokenizers/                  # 60 raw tokenizer JSONs
β”‚   β”œβ”€β”€ transformers_tokenizers/     # Transformers-compatible exports
β”‚   β”œβ”€β”€ corpora/                     # Train/test text splits
β”‚   β”œβ”€β”€ morphology/                  # Farasa segmentations cache
β”‚   └── plots/                       # All visualization PNGs
└── plots/                           # Paper figures

Citation

@misc{laamiri2026daa-tokenizers,
  title     = {Darija Subword Tokenizer Benchmark},
  author    = {Laamiri, Ouail and Berrada, Ismail and Belfadil, Anas},
  year      = {2026},
  url       = {https://huggingface.co/OiQ/daa-tokenizers},
  note      = {In collaboration with UM6P College of Computing}
}

License

MIT License β€” see LICENSE for details.

Acknowledgments

This work was developed in collaboration with the UM6P College of Computing, Mohammed VI Polytechnic University, Ben Guerir, Morocco. We thank the HuggingFace community for providing the infrastructure to host these resources.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support