# Code Documentation — Darija Tokenizer Benchmark This document describes every script, data file, and output artifact in the benchmark codebase. --- ## Overview The benchmark pipeline consists of four stages: **training**, **evaluation**, **analysis**, and **reporting**. Each stage is implemented by standalone Python scripts. The diagram below shows the data flow: ``` OiQ/daa-pairs (dataset) │ ▼ ┌──────────────────┐ ┌─────────────────────┐ │ script.py │────▶│ results/tokenizers/ │ (60 raw JSON files) │ (train 8K-32K) │ │ results/transformers│ (transformers format) └──────────────────┘ └─────────────────────┘ │ ▲ ▼ │ ┌──────────────────┐ ┌─────────────────────┐ │train_large_vocab │────▶│ 80K + 110K configs │ │train_remaining │ │ (16 additional) │ └──────────────────┘ └─────────────────────┘ │ ▼ ┌──────────────────────────────────────────────┐ │ EVALUATION SCRIPTS │ │ ├── eval_test_set.py → test_set_results│ │ ├── eval_new_and_append.py → append 80K/110K │ │ ├── eval_missing.py → fill gaps │ │ ├── eval_morph_large.py → morph 80K/110K │ │ ├── bootstrap_test_set.py → 95% CIs │ │ ├── eval_all_externals.py → external comp. │ │ ├── eval_codeswitch_... → code-switching │ │ └── eval_doda_independent → DODa validation │ └──────────────────────────────────────────────┘ │ ▼ ┌──────────────────┐ ┌─────────────────────┐ │ regen_figures │────▶│ figures/*.png │ │ gen_report │────▶│ benchmark_report.md │ │ verify_arithmetic│───▶│ (stdout validation) │ └──────────────────┘ └─────────────────────┘ ``` --- ## Training Scripts ### `script.py` — Master Benchmark Pipeline **Lines:** ~2032 | **Type:** Runs at import (no `__main__` guard) The monolithic entry point. Loads the `OiQ/daa-pairs` dataset, trains 24 tokenizers (4 algorithms x 2 architectures x 3 vocab sizes: 8K, 16K, 32K), evaluates them on the full training+test corpus, computes morphological metrics via Farasa, generates bootstrap confidence intervals, and produces all plots and reports. **Key class:** `ProductionMetricsEvaluator` — implements script detection, tokenization, Gini coefficient, and all metric computations. **Inputs:** - `OiQ/daa-pairs` dataset (via `huggingface_hub`) - `BenchmarkConfig` dataclass (vocab sizes, algorithms, hyperparameters) **Outputs:** - `results/tokenizers/*.json` — 48 raw tokenizer files (24 shared + 24 concat halves) - `results/transformers_tokenizers/` — transformers-compatible exports - `results/tokenizer_results.csv` / `.json` — full metrics with morphological data - `results/bootstrap_ci.csv` — bootstrap CIs - `results/benchmark_report.md` — auto-generated summary - `results/morphology/farasa_segmentations.json` — cached Farasa segmentations (~99 MB) - `results/plots/*.png` — all visualization figures --- ### `train_large_vocab.py` — Train 80K/110K Tokenizers **Lines:** ~146 Trains additional tokenizers at 80K and 110K vocabulary sizes to match DarijaBERT-ar (80K) and DarijaBERT-az (110K) for fair head-to-head comparison. **Inputs:** `results/corpora/train_{ar,az}.txt` **Outputs:** `results/tokenizers/{shared,concat_ar,concat_az}_{algo}_{80000,110000}.json` --- ### `train_remaining.py` — Train Remaining Tokenizers **Lines:** ~134 Fills in the last missing tokenizer configurations (3 shared 110K + 6 concat 110K + 1 concat 80K BBPE) that `train_large_vocab.py` did not cover. **Inputs:** `results/corpora/train_{ar,az}.txt` **Outputs:** Remaining `results/tokenizers/*.json` files --- ### `retrain_missing_and_compare.py` — Retrain + Full Re-evaluation **Lines:** ~558 Retrains 4 missing concat 32K tokenizers, exports to transformers format, re-evaluates all 28 configs, fixes a WordPiece exact-match bug, and runs the external comparison pipeline. **Inputs:** Training corpora, HF external models **Outputs:** Updated tokenizer JSONs, `external_comparison.csv`, external comparison plot --- ## Evaluation Scripts ### `eval_test_set.py` — Test-Set Evaluation (Single Source of Truth) **Lines:** ~227 Re-evaluates all tokenizers on the held-out **test set only** (11,282 sentences per script). This is the authoritative evaluation used in all paper tables and percentage claims. **Key function:** `normalize_decode()` — fixes Metaspace double-space artifacts in WordPiece decoders. **Inputs:** - `results/tokenizers/*.json` - `results/corpora/test_{ar,az,mi}.txt` **Outputs:** `results/test_set_results.csv` / `.json` (40 rows) --- ### `eval_new_and_append.py` — Append 80K/110K Results **Lines:** ~144 Evaluates the newly trained 80K/110K tokenizers and appends their rows to `test_set_results.csv`. **Inputs:** `test_set_results.csv`, tokenizers, test corpora **Outputs:** Updated `test_set_results.csv` (grows from 24 to 40 rows) --- ### `eval_missing.py` — Fill Single Gap **Lines:** ~124 Evaluates the one remaining missing tokenizer (`concat_bbpe_55000`) and merges it into the results CSV. --- ### `eval_morph_large.py` — Morphological Metrics for 80K/110K **Lines:** ~297 Computes morphological fidelity metrics (edit distance and consistency F1) for the 16 large-vocabulary tokenizers (80K, 110K) that were not covered by the original `script.py` morph evaluation. Uses the same Farasa cache and identical algorithms. **Inputs:** - `results/morphology/farasa_segmentations.json` - `results/corpora/test_ar.txt` - 16 tokenizer JSON files **Outputs:** `results/morph_large_vocab_results.csv` (16 rows) --- ### `bootstrap_test_set.py` — Bootstrap Confidence Intervals **Lines:** ~163 Computes 95% bootstrap confidence intervals (500 resamples) for fertility and CPT on the test set. **Inputs:** Tokenizers, `test_{ar,az,mi}.txt` **Outputs:** `results/bootstrap_ci_test_set.csv` (24 rows for 8K-32K configs) --- ## External Comparison Scripts ### `eval_all_externals.py` — Evaluate 9 External Tokenizers **Lines:** ~281 Evaluates 9 external Arabic/Darija tokenizers from HuggingFace (CaMeLBERT-MSA, Asafaya-BERT, Aranizer-SP-86k, B2BERT, DarijaBERT-ar, DarijaBERT-az, Darija-Tokenizer, Translit-Darija, Qwen2.5-Darija) alongside our best 3 tokenizers. **Inputs:** HF model repos (requires `HF_TOKEN`), test corpora **Outputs:** `results/external_comparison.csv` / `.json`, comparison plot --- ### `compare_with_external.py` — External Comparison (Earlier Version) **Lines:** ~269 Earlier version of the external comparison, comparing our 8K/16K/32K tokenizers against 5 external models. --- ### `eval_and_compare.py` — Combined Evaluation + Comparison **Lines:** ~277 Combines internal evaluation and external comparison into a single pipeline run. --- ### `eval_codeswitch_and_new_baselines.py` — Code-Switching Evaluation **Lines:** ~373 Evaluates tokenizers on mixed-script (code-switched) texts as a separate category. Also adds the `atlasia/darija_bpe_tokenizer` baseline and evaluates all tokenizers on DODa. **Outputs:** `results/codeswitch_results.csv` / `.json` --- ### `eval_doda_independent.py` — DODa Independent Validation **Lines:** ~196 Evaluates all tokenizers on `atlasia/DODa` (87K Arabizi dictionary entries), sampled to 10K entries. Serves as an independent, out-of-training-distribution validation. **Outputs:** `results/doda_independent_results.csv` / `.json` --- ## Utility Scripts ### `fix_tokenizer_decoders.py` — Decoder Bug Fixer **Lines:** ~152 Patches three decoder bugs in tokenizer JSON files: 1. WordPiece double-space artifact (Metaspace decoder producing `" "` instead of `" "`) 2. NULL decoder in `concat_bpe_16000` (missing Metaspace decoder) 3. Missing WordPiece sub-decoder in `concat_wordpiece_16000` **Warning:** Modifies tokenizer files in place. No backup is created. --- ### `gen_report.py` — Report Generator **Lines:** ~70 Generates a Markdown summary report (`benchmark_report.md`) from `tokenizer_results.csv`, including best-by-vocabulary tables and full results. --- ### `regen_figures.py` — Figure Regenerator **Lines:** ~313 Regenerates the 5 main paper figures from `test_set_results.csv` with larger fonts (14pt base) for readability. Reads from CSV and writes PNGs to the paper's `figures/` directory. **Inputs:** `results/test_set_results.csv` **Outputs:** `figures/fertility_overall_comparison_v2.png`, `fertility_overall_trends.png`, `fertility_disparity_comparison_v2.png`, `fertility_disparity_heatmap_v2.png`, `external_comparison.png` --- ### `verify_arithmetic.py` — Numeric Claims Verification **Lines:** ~143 Validates every percentage claim in the paper and README against `test_set_results.csv`. Checks: - Disparity formula: `|F_ar - F_az| / max(F_ar, F_az)` - Overall fertility derivability from per-script values - All percentage improvement claims (27-34%, 40-50%, etc.) **Outputs:** Stdout validation report (no files written) --- ## Result Files ### Primary Data | File | Rows | Description | |------|------|-------------| | `test_set_results.csv` | 40 | **Single source of truth.** Test-set metrics for all tokenizers: fertility, CPT, disparity, Gini, entropy, exact match. | | `tokenizer_results.csv` | 24 | Full benchmark results incl. morphological metrics and per-script breakdown. Covers 8K-32K only. | | `morph_large_vocab_results.csv` | 16 | Morphological metrics for 80K/110K tokenizers. | | `bootstrap_ci_test_set.csv` | 24 | Bootstrap 95% CIs (500 resamples) for fertility and CPT. | | `external_comparison.csv` | 12 | Our best 3 + 9 external tokenizers: fertility, CPT, disparity, exact match per script. | | `codeswitch_results.csv` | 5 | Code-switching evaluation with mixed-script category. | | `doda_independent_results.csv` | 12 | DODa independent validation (Arabizi dictionary). | ### Supporting Data | File / Directory | Description | |------------------|-------------| | `corpora/` | Train/validation/test text splits: `{train,val,test}_{ar,az,mi}.txt` | | `morphology/farasa_segmentations.json` | Cached Farasa morphological segmentations for Arabic texts (~99 MB) | | `tokenizers/` | Raw HuggingFace `tokenizers` JSON files (60 files: 24 shared + 36 concat halves) | | `transformers_tokenizers/` | Tokenizers exported for `transformers` library use | | `doda_sample_10k.txt` | 10K-line Arabizi sample from DODa dataset | | `benchmark_report.md` | Auto-generated Markdown summary report | --- ## Reproduction Guide ### Full Reproduction (from scratch) ```bash # 1. Train 8K-32K tokenizers + initial evaluation + morphology python script.py # 2. Train 80K + 110K tokenizers python train_large_vocab.py python train_remaining.py # 3. Evaluate on test set (appends 80K/110K to results) python eval_test_set.py python eval_new_and_append.py python eval_missing.py # 4. Compute morphological metrics for large vocabs python eval_morph_large.py # 5. Bootstrap confidence intervals python bootstrap_test_set.py # 6. External tokenizer comparison python eval_all_externals.py # 7. Code-switching + DODa validation python eval_codeswitch_and_new_baselines.py python eval_doda_independent.py # 8. Generate figures + reports python regen_figures.py python gen_report.py # 9. Verify all numeric claims python verify_arithmetic.py ``` ### Requirements - Python 3.10+ - `tokenizers`, `transformers`, `datasets` (HuggingFace stack) - `scikit-learn` (KMeans for morphological consistency) - `regex` (Unicode grapheme segmentation) - `numpy`, `pandas`, `matplotlib`, `seaborn` - `tqdm` - Farasa JAR (for morphological segmentation; pre-cached in `morphology/`) - `HF_TOKEN` environment variable (for loading external models) --- ## Key Design Decisions 1. **Monolithic `script.py`**: The main pipeline runs at import level (no `__main__` guard). This is intentional for checkpoint-based resumption — the script detects existing artifacts and skips completed stages. 2. **Duplicated helper functions**: Functions like `detect_script()`, `count_graphemes()`, and `normalize_decode()` are copied across evaluation scripts rather than shared via import. This ensures each eval script is self-contained and runnable independently. 3. **Test-set-only evaluation**: All paper numbers come from `eval_test_set.py`, not `script.py`'s full-corpus evaluation. The test set (11,282 sentences per script) provides unbiased estimates. 4. **Concatenated architecture**: Each concat config is stored as two JSON files (`concat_ar_*.json` + `concat_az_*.json`). The evaluator loads both and applies ID shifting at inference time.