Code Documentation β Darija Tokenizer Benchmark
This document describes every script, data file, and output artifact in the benchmark codebase.
Overview
The benchmark pipeline consists of four stages: training, evaluation, analysis, and reporting. Each stage is implemented by standalone Python scripts. The diagram below shows the data flow:
OiQ/daa-pairs (dataset)
β
βΌ
ββββββββββββββββββββ βββββββββββββββββββββββ
β script.py ββββββΆβ results/tokenizers/ β (60 raw JSON files)
β (train 8K-32K) β β results/transformersβ (transformers format)
ββββββββββββββββββββ βββββββββββββββββββββββ
β β²
βΌ β
ββββββββββββββββββββ βββββββββββββββββββββββ
βtrain_large_vocab ββββββΆβ 80K + 110K configs β
βtrain_remaining β β (16 additional) β
ββββββββββββββββββββ βββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β EVALUATION SCRIPTS β
β βββ eval_test_set.py β test_set_resultsβ
β βββ eval_new_and_append.py β append 80K/110K β
β βββ eval_missing.py β fill gaps β
β βββ eval_morph_large.py β morph 80K/110K β
β βββ bootstrap_test_set.py β 95% CIs β
β βββ eval_all_externals.py β external comp. β
β βββ eval_codeswitch_... β code-switching β
β βββ eval_doda_independent β DODa validation β
ββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββ βββββββββββββββββββββββ
β regen_figures ββββββΆβ figures/*.png β
β gen_report ββββββΆβ benchmark_report.md β
β verify_arithmeticβββββΆβ (stdout validation) β
ββββββββββββββββββββ βββββββββββββββββββββββ
Training Scripts
script.py β Master Benchmark Pipeline
Lines: ~2032 | Type: Runs at import (no __main__ guard)
The monolithic entry point. Loads the OiQ/daa-pairs dataset, trains 24 tokenizers (4 algorithms x 2 architectures x 3 vocab sizes: 8K, 16K, 32K), evaluates them on the full training+test corpus, computes morphological metrics via Farasa, generates bootstrap confidence intervals, and produces all plots and reports.
Key class: ProductionMetricsEvaluator β implements script detection, tokenization, Gini coefficient, and all metric computations.
Inputs:
OiQ/daa-pairsdataset (viahuggingface_hub)BenchmarkConfigdataclass (vocab sizes, algorithms, hyperparameters)
Outputs:
results/tokenizers/*.jsonβ 48 raw tokenizer files (24 shared + 24 concat halves)results/transformers_tokenizers/β transformers-compatible exportsresults/tokenizer_results.csv/.jsonβ full metrics with morphological dataresults/bootstrap_ci.csvβ bootstrap CIsresults/benchmark_report.mdβ auto-generated summaryresults/morphology/farasa_segmentations.jsonβ cached Farasa segmentations (~99 MB)results/plots/*.pngβ all visualization figures
train_large_vocab.py β Train 80K/110K Tokenizers
Lines: ~146
Trains additional tokenizers at 80K and 110K vocabulary sizes to match DarijaBERT-ar (80K) and DarijaBERT-az (110K) for fair head-to-head comparison.
Inputs: results/corpora/train_{ar,az}.txt
Outputs: results/tokenizers/{shared,concat_ar,concat_az}_{algo}_{80000,110000}.json
train_remaining.py β Train Remaining Tokenizers
Lines: ~134
Fills in the last missing tokenizer configurations (3 shared 110K + 6 concat 110K + 1 concat 80K BBPE) that train_large_vocab.py did not cover.
Inputs: results/corpora/train_{ar,az}.txt
Outputs: Remaining results/tokenizers/*.json files
retrain_missing_and_compare.py β Retrain + Full Re-evaluation
Lines: ~558
Retrains 4 missing concat 32K tokenizers, exports to transformers format, re-evaluates all 28 configs, fixes a WordPiece exact-match bug, and runs the external comparison pipeline.
Inputs: Training corpora, HF external models
Outputs: Updated tokenizer JSONs, external_comparison.csv, external comparison plot
Evaluation Scripts
eval_test_set.py β Test-Set Evaluation (Single Source of Truth)
Lines: ~227
Re-evaluates all tokenizers on the held-out test set only (11,282 sentences per script). This is the authoritative evaluation used in all paper tables and percentage claims.
Key function: normalize_decode() β fixes Metaspace double-space artifacts in WordPiece decoders.
Inputs:
results/tokenizers/*.jsonresults/corpora/test_{ar,az,mi}.txt
Outputs: results/test_set_results.csv / .json (40 rows)
eval_new_and_append.py β Append 80K/110K Results
Lines: ~144
Evaluates the newly trained 80K/110K tokenizers and appends their rows to test_set_results.csv.
Inputs: test_set_results.csv, tokenizers, test corpora
Outputs: Updated test_set_results.csv (grows from 24 to 40 rows)
eval_missing.py β Fill Single Gap
Lines: ~124
Evaluates the one remaining missing tokenizer (concat_bbpe_55000) and merges it into the results CSV.
eval_morph_large.py β Morphological Metrics for 80K/110K
Lines: ~297
Computes morphological fidelity metrics (edit distance and consistency F1) for the 16 large-vocabulary tokenizers (80K, 110K) that were not covered by the original script.py morph evaluation. Uses the same Farasa cache and identical algorithms.
Inputs:
results/morphology/farasa_segmentations.jsonresults/corpora/test_ar.txt- 16 tokenizer JSON files
Outputs: results/morph_large_vocab_results.csv (16 rows)
bootstrap_test_set.py β Bootstrap Confidence Intervals
Lines: ~163
Computes 95% bootstrap confidence intervals (500 resamples) for fertility and CPT on the test set.
Inputs: Tokenizers, test_{ar,az,mi}.txt
Outputs: results/bootstrap_ci_test_set.csv (24 rows for 8K-32K configs)
External Comparison Scripts
eval_all_externals.py β Evaluate 9 External Tokenizers
Lines: ~281
Evaluates 9 external Arabic/Darija tokenizers from HuggingFace (CaMeLBERT-MSA, Asafaya-BERT, Aranizer-SP-86k, B2BERT, DarijaBERT-ar, DarijaBERT-az, Darija-Tokenizer, Translit-Darija, Qwen2.5-Darija) alongside our best 3 tokenizers.
Inputs: HF model repos (requires HF_TOKEN), test corpora
Outputs: results/external_comparison.csv / .json, comparison plot
compare_with_external.py β External Comparison (Earlier Version)
Lines: ~269
Earlier version of the external comparison, comparing our 8K/16K/32K tokenizers against 5 external models.
eval_and_compare.py β Combined Evaluation + Comparison
Lines: ~277
Combines internal evaluation and external comparison into a single pipeline run.
eval_codeswitch_and_new_baselines.py β Code-Switching Evaluation
Lines: ~373
Evaluates tokenizers on mixed-script (code-switched) texts as a separate category. Also adds the atlasia/darija_bpe_tokenizer baseline and evaluates all tokenizers on DODa.
Outputs: results/codeswitch_results.csv / .json
eval_doda_independent.py β DODa Independent Validation
Lines: ~196
Evaluates all tokenizers on atlasia/DODa (87K Arabizi dictionary entries), sampled to 10K entries. Serves as an independent, out-of-training-distribution validation.
Outputs: results/doda_independent_results.csv / .json
Utility Scripts
fix_tokenizer_decoders.py β Decoder Bug Fixer
Lines: ~152
Patches three decoder bugs in tokenizer JSON files:
- WordPiece double-space artifact (Metaspace decoder producing
" "instead of" ") - NULL decoder in
concat_bpe_16000(missing Metaspace decoder) - Missing WordPiece sub-decoder in
concat_wordpiece_16000
Warning: Modifies tokenizer files in place. No backup is created.
gen_report.py β Report Generator
Lines: ~70
Generates a Markdown summary report (benchmark_report.md) from tokenizer_results.csv, including best-by-vocabulary tables and full results.
regen_figures.py β Figure Regenerator
Lines: ~313
Regenerates the 5 main paper figures from test_set_results.csv with larger fonts (14pt base) for readability. Reads from CSV and writes PNGs to the paper's figures/ directory.
Inputs: results/test_set_results.csv
Outputs: figures/fertility_overall_comparison_v2.png, fertility_overall_trends.png, fertility_disparity_comparison_v2.png, fertility_disparity_heatmap_v2.png, external_comparison.png
verify_arithmetic.py β Numeric Claims Verification
Lines: ~143
Validates every percentage claim in the paper and README against test_set_results.csv. Checks:
- Disparity formula:
|F_ar - F_az| / max(F_ar, F_az) - Overall fertility derivability from per-script values
- All percentage improvement claims (27-34%, 40-50%, etc.)
Outputs: Stdout validation report (no files written)
Result Files
Primary Data
| File | Rows | Description |
|---|---|---|
test_set_results.csv |
40 | Single source of truth. Test-set metrics for all tokenizers: fertility, CPT, disparity, Gini, entropy, exact match. |
tokenizer_results.csv |
24 | Full benchmark results incl. morphological metrics and per-script breakdown. Covers 8K-32K only. |
morph_large_vocab_results.csv |
16 | Morphological metrics for 80K/110K tokenizers. |
bootstrap_ci_test_set.csv |
24 | Bootstrap 95% CIs (500 resamples) for fertility and CPT. |
external_comparison.csv |
12 | Our best 3 + 9 external tokenizers: fertility, CPT, disparity, exact match per script. |
codeswitch_results.csv |
5 | Code-switching evaluation with mixed-script category. |
doda_independent_results.csv |
12 | DODa independent validation (Arabizi dictionary). |
Supporting Data
| File / Directory | Description |
|---|---|
corpora/ |
Train/validation/test text splits: {train,val,test}_{ar,az,mi}.txt |
morphology/farasa_segmentations.json |
Cached Farasa morphological segmentations for Arabic texts (~99 MB) |
tokenizers/ |
Raw HuggingFace tokenizers JSON files (60 files: 24 shared + 36 concat halves) |
transformers_tokenizers/ |
Tokenizers exported for transformers library use |
doda_sample_10k.txt |
10K-line Arabizi sample from DODa dataset |
benchmark_report.md |
Auto-generated Markdown summary report |
Reproduction Guide
Full Reproduction (from scratch)
# 1. Train 8K-32K tokenizers + initial evaluation + morphology
python script.py
# 2. Train 80K + 110K tokenizers
python train_large_vocab.py
python train_remaining.py
# 3. Evaluate on test set (appends 80K/110K to results)
python eval_test_set.py
python eval_new_and_append.py
python eval_missing.py
# 4. Compute morphological metrics for large vocabs
python eval_morph_large.py
# 5. Bootstrap confidence intervals
python bootstrap_test_set.py
# 6. External tokenizer comparison
python eval_all_externals.py
# 7. Code-switching + DODa validation
python eval_codeswitch_and_new_baselines.py
python eval_doda_independent.py
# 8. Generate figures + reports
python regen_figures.py
python gen_report.py
# 9. Verify all numeric claims
python verify_arithmetic.py
Requirements
- Python 3.10+
tokenizers,transformers,datasets(HuggingFace stack)scikit-learn(KMeans for morphological consistency)regex(Unicode grapheme segmentation)numpy,pandas,matplotlib,seaborntqdm- Farasa JAR (for morphological segmentation; pre-cached in
morphology/) HF_TOKENenvironment variable (for loading external models)
Key Design Decisions
Monolithic
script.py: The main pipeline runs at import level (no__main__guard). This is intentional for checkpoint-based resumption β the script detects existing artifacts and skips completed stages.Duplicated helper functions: Functions like
detect_script(),count_graphemes(), andnormalize_decode()are copied across evaluation scripts rather than shared via import. This ensures each eval script is self-contained and runnable independently.Test-set-only evaluation: All paper numbers come from
eval_test_set.py, notscript.py's full-corpus evaluation. The test set (11,282 sentences per script) provides unbiased estimates.Concatenated architecture: Each concat config is stored as two JSON files (
concat_ar_*.json+concat_az_*.json). The evaluator loads both and applies ID shifting at inference time.