daa-tokenizers / code.md
Ouaill's picture
Upload code.md with huggingface_hub
1ae8a3c verified
|
Raw
History Blame Contribute Delete
13.6 kB

Code Documentation β€” Darija Tokenizer Benchmark

This document describes every script, data file, and output artifact in the benchmark codebase.


Overview

The benchmark pipeline consists of four stages: training, evaluation, analysis, and reporting. Each stage is implemented by standalone Python scripts. The diagram below shows the data flow:

OiQ/daa-pairs (dataset)
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  script.py       │────▢│  results/tokenizers/ β”‚  (60 raw JSON files)
β”‚  (train 8K-32K)  β”‚     β”‚  results/transformersβ”‚  (transformers format)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                          β–²
        β–Ό                          β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚train_large_vocab │────▢│  80K + 110K configs  β”‚
β”‚train_remaining   β”‚     β”‚  (16 additional)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  EVALUATION SCRIPTS                           β”‚
β”‚  β”œβ”€β”€ eval_test_set.py       β†’ test_set_resultsβ”‚
β”‚  β”œβ”€β”€ eval_new_and_append.py β†’ append 80K/110K β”‚
β”‚  β”œβ”€β”€ eval_missing.py        β†’ fill gaps       β”‚
β”‚  β”œβ”€β”€ eval_morph_large.py    β†’ morph 80K/110K  β”‚
β”‚  β”œβ”€β”€ bootstrap_test_set.py  β†’ 95% CIs         β”‚
β”‚  β”œβ”€β”€ eval_all_externals.py  β†’ external comp.  β”‚
β”‚  β”œβ”€β”€ eval_codeswitch_...    β†’ code-switching  β”‚
β”‚  └── eval_doda_independent  β†’ DODa validation β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  regen_figures   │────▢│  figures/*.png       β”‚
β”‚  gen_report      │────▢│  benchmark_report.md β”‚
β”‚  verify_arithmetic│───▢│  (stdout validation) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Training Scripts

script.py β€” Master Benchmark Pipeline

Lines: ~2032 | Type: Runs at import (no __main__ guard)

The monolithic entry point. Loads the OiQ/daa-pairs dataset, trains 24 tokenizers (4 algorithms x 2 architectures x 3 vocab sizes: 8K, 16K, 32K), evaluates them on the full training+test corpus, computes morphological metrics via Farasa, generates bootstrap confidence intervals, and produces all plots and reports.

Key class: ProductionMetricsEvaluator β€” implements script detection, tokenization, Gini coefficient, and all metric computations.

Inputs:

  • OiQ/daa-pairs dataset (via huggingface_hub)
  • BenchmarkConfig dataclass (vocab sizes, algorithms, hyperparameters)

Outputs:

  • results/tokenizers/*.json β€” 48 raw tokenizer files (24 shared + 24 concat halves)
  • results/transformers_tokenizers/ β€” transformers-compatible exports
  • results/tokenizer_results.csv / .json β€” full metrics with morphological data
  • results/bootstrap_ci.csv β€” bootstrap CIs
  • results/benchmark_report.md β€” auto-generated summary
  • results/morphology/farasa_segmentations.json β€” cached Farasa segmentations (~99 MB)
  • results/plots/*.png β€” all visualization figures

train_large_vocab.py β€” Train 80K/110K Tokenizers

Lines: ~146

Trains additional tokenizers at 80K and 110K vocabulary sizes to match DarijaBERT-ar (80K) and DarijaBERT-az (110K) for fair head-to-head comparison.

Inputs: results/corpora/train_{ar,az}.txt

Outputs: results/tokenizers/{shared,concat_ar,concat_az}_{algo}_{80000,110000}.json


train_remaining.py β€” Train Remaining Tokenizers

Lines: ~134

Fills in the last missing tokenizer configurations (3 shared 110K + 6 concat 110K + 1 concat 80K BBPE) that train_large_vocab.py did not cover.

Inputs: results/corpora/train_{ar,az}.txt

Outputs: Remaining results/tokenizers/*.json files


retrain_missing_and_compare.py β€” Retrain + Full Re-evaluation

Lines: ~558

Retrains 4 missing concat 32K tokenizers, exports to transformers format, re-evaluates all 28 configs, fixes a WordPiece exact-match bug, and runs the external comparison pipeline.

Inputs: Training corpora, HF external models

Outputs: Updated tokenizer JSONs, external_comparison.csv, external comparison plot


Evaluation Scripts

eval_test_set.py β€” Test-Set Evaluation (Single Source of Truth)

Lines: ~227

Re-evaluates all tokenizers on the held-out test set only (11,282 sentences per script). This is the authoritative evaluation used in all paper tables and percentage claims.

Key function: normalize_decode() β€” fixes Metaspace double-space artifacts in WordPiece decoders.

Inputs:

  • results/tokenizers/*.json
  • results/corpora/test_{ar,az,mi}.txt

Outputs: results/test_set_results.csv / .json (40 rows)


eval_new_and_append.py β€” Append 80K/110K Results

Lines: ~144

Evaluates the newly trained 80K/110K tokenizers and appends their rows to test_set_results.csv.

Inputs: test_set_results.csv, tokenizers, test corpora

Outputs: Updated test_set_results.csv (grows from 24 to 40 rows)


eval_missing.py β€” Fill Single Gap

Lines: ~124

Evaluates the one remaining missing tokenizer (concat_bbpe_55000) and merges it into the results CSV.


eval_morph_large.py β€” Morphological Metrics for 80K/110K

Lines: ~297

Computes morphological fidelity metrics (edit distance and consistency F1) for the 16 large-vocabulary tokenizers (80K, 110K) that were not covered by the original script.py morph evaluation. Uses the same Farasa cache and identical algorithms.

Inputs:

  • results/morphology/farasa_segmentations.json
  • results/corpora/test_ar.txt
  • 16 tokenizer JSON files

Outputs: results/morph_large_vocab_results.csv (16 rows)


bootstrap_test_set.py β€” Bootstrap Confidence Intervals

Lines: ~163

Computes 95% bootstrap confidence intervals (500 resamples) for fertility and CPT on the test set.

Inputs: Tokenizers, test_{ar,az,mi}.txt

Outputs: results/bootstrap_ci_test_set.csv (24 rows for 8K-32K configs)


External Comparison Scripts

eval_all_externals.py β€” Evaluate 9 External Tokenizers

Lines: ~281

Evaluates 9 external Arabic/Darija tokenizers from HuggingFace (CaMeLBERT-MSA, Asafaya-BERT, Aranizer-SP-86k, B2BERT, DarijaBERT-ar, DarijaBERT-az, Darija-Tokenizer, Translit-Darija, Qwen2.5-Darija) alongside our best 3 tokenizers.

Inputs: HF model repos (requires HF_TOKEN), test corpora

Outputs: results/external_comparison.csv / .json, comparison plot


compare_with_external.py β€” External Comparison (Earlier Version)

Lines: ~269

Earlier version of the external comparison, comparing our 8K/16K/32K tokenizers against 5 external models.


eval_and_compare.py β€” Combined Evaluation + Comparison

Lines: ~277

Combines internal evaluation and external comparison into a single pipeline run.


eval_codeswitch_and_new_baselines.py β€” Code-Switching Evaluation

Lines: ~373

Evaluates tokenizers on mixed-script (code-switched) texts as a separate category. Also adds the atlasia/darija_bpe_tokenizer baseline and evaluates all tokenizers on DODa.

Outputs: results/codeswitch_results.csv / .json


eval_doda_independent.py β€” DODa Independent Validation

Lines: ~196

Evaluates all tokenizers on atlasia/DODa (87K Arabizi dictionary entries), sampled to 10K entries. Serves as an independent, out-of-training-distribution validation.

Outputs: results/doda_independent_results.csv / .json


Utility Scripts

fix_tokenizer_decoders.py β€” Decoder Bug Fixer

Lines: ~152

Patches three decoder bugs in tokenizer JSON files:

  1. WordPiece double-space artifact (Metaspace decoder producing " " instead of " ")
  2. NULL decoder in concat_bpe_16000 (missing Metaspace decoder)
  3. Missing WordPiece sub-decoder in concat_wordpiece_16000

Warning: Modifies tokenizer files in place. No backup is created.


gen_report.py β€” Report Generator

Lines: ~70

Generates a Markdown summary report (benchmark_report.md) from tokenizer_results.csv, including best-by-vocabulary tables and full results.


regen_figures.py β€” Figure Regenerator

Lines: ~313

Regenerates the 5 main paper figures from test_set_results.csv with larger fonts (14pt base) for readability. Reads from CSV and writes PNGs to the paper's figures/ directory.

Inputs: results/test_set_results.csv

Outputs: figures/fertility_overall_comparison_v2.png, fertility_overall_trends.png, fertility_disparity_comparison_v2.png, fertility_disparity_heatmap_v2.png, external_comparison.png


verify_arithmetic.py β€” Numeric Claims Verification

Lines: ~143

Validates every percentage claim in the paper and README against test_set_results.csv. Checks:

  • Disparity formula: |F_ar - F_az| / max(F_ar, F_az)
  • Overall fertility derivability from per-script values
  • All percentage improvement claims (27-34%, 40-50%, etc.)

Outputs: Stdout validation report (no files written)


Result Files

Primary Data

File Rows Description
test_set_results.csv 40 Single source of truth. Test-set metrics for all tokenizers: fertility, CPT, disparity, Gini, entropy, exact match.
tokenizer_results.csv 24 Full benchmark results incl. morphological metrics and per-script breakdown. Covers 8K-32K only.
morph_large_vocab_results.csv 16 Morphological metrics for 80K/110K tokenizers.
bootstrap_ci_test_set.csv 24 Bootstrap 95% CIs (500 resamples) for fertility and CPT.
external_comparison.csv 12 Our best 3 + 9 external tokenizers: fertility, CPT, disparity, exact match per script.
codeswitch_results.csv 5 Code-switching evaluation with mixed-script category.
doda_independent_results.csv 12 DODa independent validation (Arabizi dictionary).

Supporting Data

File / Directory Description
corpora/ Train/validation/test text splits: {train,val,test}_{ar,az,mi}.txt
morphology/farasa_segmentations.json Cached Farasa morphological segmentations for Arabic texts (~99 MB)
tokenizers/ Raw HuggingFace tokenizers JSON files (60 files: 24 shared + 36 concat halves)
transformers_tokenizers/ Tokenizers exported for transformers library use
doda_sample_10k.txt 10K-line Arabizi sample from DODa dataset
benchmark_report.md Auto-generated Markdown summary report

Reproduction Guide

Full Reproduction (from scratch)

# 1. Train 8K-32K tokenizers + initial evaluation + morphology
python script.py

# 2. Train 80K + 110K tokenizers
python train_large_vocab.py
python train_remaining.py

# 3. Evaluate on test set (appends 80K/110K to results)
python eval_test_set.py
python eval_new_and_append.py
python eval_missing.py

# 4. Compute morphological metrics for large vocabs
python eval_morph_large.py

# 5. Bootstrap confidence intervals
python bootstrap_test_set.py

# 6. External tokenizer comparison
python eval_all_externals.py

# 7. Code-switching + DODa validation
python eval_codeswitch_and_new_baselines.py
python eval_doda_independent.py

# 8. Generate figures + reports
python regen_figures.py
python gen_report.py

# 9. Verify all numeric claims
python verify_arithmetic.py

Requirements

  • Python 3.10+
  • tokenizers, transformers, datasets (HuggingFace stack)
  • scikit-learn (KMeans for morphological consistency)
  • regex (Unicode grapheme segmentation)
  • numpy, pandas, matplotlib, seaborn
  • tqdm
  • Farasa JAR (for morphological segmentation; pre-cached in morphology/)
  • HF_TOKEN environment variable (for loading external models)

Key Design Decisions

  1. Monolithic script.py: The main pipeline runs at import level (no __main__ guard). This is intentional for checkpoint-based resumption β€” the script detects existing artifacts and skips completed stages.

  2. Duplicated helper functions: Functions like detect_script(), count_graphemes(), and normalize_decode() are copied across evaluation scripts rather than shared via import. This ensures each eval script is self-contained and runnable independently.

  3. Test-set-only evaluation: All paper numbers come from eval_test_set.py, not script.py's full-corpus evaluation. The test set (11,282 sentences per script) provides unbiased estimates.

  4. Concatenated architecture: Each concat config is stored as two JSON files (concat_ar_*.json + concat_az_*.json). The evaluator loads both and applies ID shifting at inference time.