Upload code.md with huggingface_hub

1ae8a3c verified 9 days ago

preview code

Raw

History Blame Contribute Delete

13.6 kB

Code Documentation — Darija Tokenizer Benchmark

This document describes every script, data file, and output artifact in the benchmark codebase.

Overview

The benchmark pipeline consists of four stages: training, evaluation, analysis, and reporting. Each stage is implemented by standalone Python scripts. The diagram below shows the data flow:

OiQ/daa-pairs (dataset)
        │
        ▼
┌──────────────────┐     ┌─────────────────────┐
│  script.py       │────▶│  results/tokenizers/ │  (60 raw JSON files)
│  (train 8K-32K)  │     │  results/transformers│  (transformers format)
└──────────────────┘     └─────────────────────┘
        │                          ▲
        ▼                          │
┌──────────────────┐     ┌─────────────────────┐
│train_large_vocab │────▶│  80K + 110K configs  │
│train_remaining   │     │  (16 additional)     │
└──────────────────┘     └─────────────────────┘
                               │
                               ▼
┌──────────────────────────────────────────────┐
│  EVALUATION SCRIPTS                           │
│  ├── eval_test_set.py       → test_set_results│
│  ├── eval_new_and_append.py → append 80K/110K │
│  ├── eval_missing.py        → fill gaps       │
│  ├── eval_morph_large.py    → morph 80K/110K  │
│  ├── bootstrap_test_set.py  → 95% CIs         │
│  ├── eval_all_externals.py  → external comp.  │
│  ├── eval_codeswitch_...    → code-switching  │
│  └── eval_doda_independent  → DODa validation │
└──────────────────────────────────────────────┘
        │
        ▼
┌──────────────────┐     ┌─────────────────────┐
│  regen_figures   │────▶│  figures/*.png       │
│  gen_report      │────▶│  benchmark_report.md │
│  verify_arithmetic│───▶│  (stdout validation) │
└──────────────────┘     └─────────────────────┘

Training Scripts

`script.py` — Master Benchmark Pipeline

Lines: ~2032 | Type: Runs at import (no __main__ guard)

The monolithic entry point. Loads the OiQ/daa-pairs dataset, trains 24 tokenizers (4 algorithms x 2 architectures x 3 vocab sizes: 8K, 16K, 32K), evaluates them on the full training+test corpus, computes morphological metrics via Farasa, generates bootstrap confidence intervals, and produces all plots and reports.

Key class: ProductionMetricsEvaluator — implements script detection, tokenization, Gini coefficient, and all metric computations.

Inputs:

OiQ/daa-pairs dataset (via huggingface_hub)
BenchmarkConfig dataclass (vocab sizes, algorithms, hyperparameters)

Outputs:

results/tokenizers/*.json — 48 raw tokenizer files (24 shared + 24 concat halves)
results/transformers_tokenizers/ — transformers-compatible exports
results/tokenizer_results.csv / .json — full metrics with morphological data
results/bootstrap_ci.csv — bootstrap CIs
results/benchmark_report.md — auto-generated summary
results/morphology/farasa_segmentations.json — cached Farasa segmentations (~99 MB)
results/plots/*.png — all visualization figures

`train_large_vocab.py` — Train 80K/110K Tokenizers

Lines: ~146

Trains additional tokenizers at 80K and 110K vocabulary sizes to match DarijaBERT-ar (80K) and DarijaBERT-az (110K) for fair head-to-head comparison.

Inputs: results/corpora/train_{ar,az}.txt

Outputs: results/tokenizers/{shared,concat_ar,concat_az}_{algo}_{80000,110000}.json

`train_remaining.py` — Train Remaining Tokenizers

Lines: ~134

Fills in the last missing tokenizer configurations (3 shared 110K + 6 concat 110K + 1 concat 80K BBPE) that train_large_vocab.py did not cover.

Inputs: results/corpora/train_{ar,az}.txt

Outputs: Remaining results/tokenizers/*.json files

`retrain_missing_and_compare.py` — Retrain + Full Re-evaluation

Lines: ~558

Retrains 4 missing concat 32K tokenizers, exports to transformers format, re-evaluates all 28 configs, fixes a WordPiece exact-match bug, and runs the external comparison pipeline.

Inputs: Training corpora, HF external models

Outputs: Updated tokenizer JSONs, external_comparison.csv, external comparison plot

Evaluation Scripts

`eval_test_set.py` — Test-Set Evaluation (Single Source of Truth)

Lines: ~227

Re-evaluates all tokenizers on the held-out test set only (11,282 sentences per script). This is the authoritative evaluation used in all paper tables and percentage claims.

Key function: normalize_decode() — fixes Metaspace double-space artifacts in WordPiece decoders.

Inputs:

results/tokenizers/*.json
results/corpora/test_{ar,az,mi}.txt

Outputs: results/test_set_results.csv / .json (40 rows)

`eval_new_and_append.py` — Append 80K/110K Results

Lines: ~144

Evaluates the newly trained 80K/110K tokenizers and appends their rows to test_set_results.csv.

Inputs: test_set_results.csv, tokenizers, test corpora

Outputs: Updated test_set_results.csv (grows from 24 to 40 rows)

`eval_missing.py` — Fill Single Gap

Lines: ~124

Evaluates the one remaining missing tokenizer (concat_bbpe_55000) and merges it into the results CSV.

`eval_morph_large.py` — Morphological Metrics for 80K/110K

Lines: ~297

Computes morphological fidelity metrics (edit distance and consistency F1) for the 16 large-vocabulary tokenizers (80K, 110K) that were not covered by the original script.py morph evaluation. Uses the same Farasa cache and identical algorithms.

Inputs:

results/morphology/farasa_segmentations.json
results/corpora/test_ar.txt
16 tokenizer JSON files

Outputs: results/morph_large_vocab_results.csv (16 rows)

`bootstrap_test_set.py` — Bootstrap Confidence Intervals

Lines: ~163

Computes 95% bootstrap confidence intervals (500 resamples) for fertility and CPT on the test set.

Inputs: Tokenizers, test_{ar,az,mi}.txt

Outputs: results/bootstrap_ci_test_set.csv (24 rows for 8K-32K configs)

External Comparison Scripts

`eval_all_externals.py` — Evaluate 9 External Tokenizers

Lines: ~281

Evaluates 9 external Arabic/Darija tokenizers from HuggingFace (CaMeLBERT-MSA, Asafaya-BERT, Aranizer-SP-86k, B2BERT, DarijaBERT-ar, DarijaBERT-az, Darija-Tokenizer, Translit-Darija, Qwen2.5-Darija) alongside our best 3 tokenizers.

Inputs: HF model repos (requires HF_TOKEN), test corpora

Outputs: results/external_comparison.csv / .json, comparison plot

`compare_with_external.py` — External Comparison (Earlier Version)

Lines: ~269

Earlier version of the external comparison, comparing our 8K/16K/32K tokenizers against 5 external models.

`eval_and_compare.py` — Combined Evaluation + Comparison

Lines: ~277

Combines internal evaluation and external comparison into a single pipeline run.

`eval_codeswitch_and_new_baselines.py` — Code-Switching Evaluation

Lines: ~373

Evaluates tokenizers on mixed-script (code-switched) texts as a separate category. Also adds the atlasia/darija_bpe_tokenizer baseline and evaluates all tokenizers on DODa.

Outputs: results/codeswitch_results.csv / .json

`eval_doda_independent.py` — DODa Independent Validation

Lines: ~196

Evaluates all tokenizers on atlasia/DODa (87K Arabizi dictionary entries), sampled to 10K entries. Serves as an independent, out-of-training-distribution validation.

Outputs: results/doda_independent_results.csv / .json

Utility Scripts

`fix_tokenizer_decoders.py` — Decoder Bug Fixer

Lines: ~152

Patches three decoder bugs in tokenizer JSON files:

WordPiece double-space artifact (Metaspace decoder producing " " instead of " ")
NULL decoder in concat_bpe_16000 (missing Metaspace decoder)
Missing WordPiece sub-decoder in concat_wordpiece_16000

Warning: Modifies tokenizer files in place. No backup is created.

`gen_report.py` — Report Generator

Lines: ~70

Generates a Markdown summary report (benchmark_report.md) from tokenizer_results.csv, including best-by-vocabulary tables and full results.

`regen_figures.py` — Figure Regenerator

Lines: ~313

Regenerates the 5 main paper figures from test_set_results.csv with larger fonts (14pt base) for readability. Reads from CSV and writes PNGs to the paper's figures/ directory.

Inputs: results/test_set_results.csv

Outputs: figures/fertility_overall_comparison_v2.png, fertility_overall_trends.png, fertility_disparity_comparison_v2.png, fertility_disparity_heatmap_v2.png, external_comparison.png

`verify_arithmetic.py` — Numeric Claims Verification

Lines: ~143

Validates every percentage claim in the paper and README against test_set_results.csv. Checks:

Disparity formula: |F_ar - F_az| / max(F_ar, F_az)
Overall fertility derivability from per-script values
All percentage improvement claims (27-34%, 40-50%, etc.)

Outputs: Stdout validation report (no files written)

Result Files

Primary Data

File	Rows	Description
`test_set_results.csv`	40	Single source of truth. Test-set metrics for all tokenizers: fertility, CPT, disparity, Gini, entropy, exact match.
`tokenizer_results.csv`	24	Full benchmark results incl. morphological metrics and per-script breakdown. Covers 8K-32K only.
`morph_large_vocab_results.csv`	16	Morphological metrics for 80K/110K tokenizers.
`bootstrap_ci_test_set.csv`	24	Bootstrap 95% CIs (500 resamples) for fertility and CPT.
`external_comparison.csv`	12	Our best 3 + 9 external tokenizers: fertility, CPT, disparity, exact match per script.
`codeswitch_results.csv`	5	Code-switching evaluation with mixed-script category.
`doda_independent_results.csv`	12	DODa independent validation (Arabizi dictionary).

Supporting Data

File / Directory	Description
`corpora/`	Train/validation/test text splits: `{train,val,test}_{ar,az,mi}.txt`
`morphology/farasa_segmentations.json`	Cached Farasa morphological segmentations for Arabic texts (~99 MB)
`tokenizers/`	Raw HuggingFace `tokenizers` JSON files (60 files: 24 shared + 36 concat halves)
`transformers_tokenizers/`	Tokenizers exported for `transformers` library use
`doda_sample_10k.txt`	10K-line Arabizi sample from DODa dataset
`benchmark_report.md`	Auto-generated Markdown summary report

Reproduction Guide

Full Reproduction (from scratch)

# 1. Train 8K-32K tokenizers + initial evaluation + morphology
python script.py

# 2. Train 80K + 110K tokenizers
python train_large_vocab.py
python train_remaining.py

# 3. Evaluate on test set (appends 80K/110K to results)
python eval_test_set.py
python eval_new_and_append.py
python eval_missing.py

# 4. Compute morphological metrics for large vocabs
python eval_morph_large.py

# 5. Bootstrap confidence intervals
python bootstrap_test_set.py

# 6. External tokenizer comparison
python eval_all_externals.py

# 7. Code-switching + DODa validation
python eval_codeswitch_and_new_baselines.py
python eval_doda_independent.py

# 8. Generate figures + reports
python regen_figures.py
python gen_report.py

# 9. Verify all numeric claims
python verify_arithmetic.py

Requirements

Python 3.10+
tokenizers, transformers, datasets (HuggingFace stack)
scikit-learn (KMeans for morphological consistency)
regex (Unicode grapheme segmentation)
numpy, pandas, matplotlib, seaborn
tqdm
Farasa JAR (for morphological segmentation; pre-cached in morphology/)
HF_TOKEN environment variable (for loading external models)

Key Design Decisions

Monolithic script.py: The main pipeline runs at import level (no __main__ guard). This is intentional for checkpoint-based resumption — the script detects existing artifacts and skips completed stages.
Duplicated helper functions: Functions like detect_script(), count_graphemes(), and normalize_decode() are copied across evaluation scripts rather than shared via import. This ensures each eval script is self-contained and runnable independently.
Test-set-only evaluation: All paper numbers come from eval_test_set.py, not script.py's full-corpus evaluation. The test set (11,282 sentences per script) provides unbiased estimates.
Concatenated architecture: Each concat config is stored as two JSON files (concat_ar_*.json + concat_az_*.json). The evaluator loads both and applies ID shifting at inference time.

Code Documentation — Darija Tokenizer Benchmark

Overview

Training Scripts

script.py — Master Benchmark Pipeline

train_large_vocab.py — Train 80K/110K Tokenizers

train_remaining.py — Train Remaining Tokenizers

retrain_missing_and_compare.py — Retrain + Full Re-evaluation

Evaluation Scripts

eval_test_set.py — Test-Set Evaluation (Single Source of Truth)

eval_new_and_append.py — Append 80K/110K Results

eval_missing.py — Fill Single Gap

eval_morph_large.py — Morphological Metrics for 80K/110K

bootstrap_test_set.py — Bootstrap Confidence Intervals

External Comparison Scripts

eval_all_externals.py — Evaluate 9 External Tokenizers

compare_with_external.py — External Comparison (Earlier Version)

eval_and_compare.py — Combined Evaluation + Comparison

eval_codeswitch_and_new_baselines.py — Code-Switching Evaluation

eval_doda_independent.py — DODa Independent Validation

Utility Scripts

fix_tokenizer_decoders.py — Decoder Bug Fixer

gen_report.py — Report Generator

regen_figures.py — Figure Regenerator

verify_arithmetic.py — Numeric Claims Verification

Result Files

Primary Data

Supporting Data

Reproduction Guide

Full Reproduction (from scratch)

Requirements

Key Design Decisions

`script.py` — Master Benchmark Pipeline

`train_large_vocab.py` — Train 80K/110K Tokenizers

`train_remaining.py` — Train Remaining Tokenizers

`retrain_missing_and_compare.py` — Retrain + Full Re-evaluation

`eval_test_set.py` — Test-Set Evaluation (Single Source of Truth)

`eval_new_and_append.py` — Append 80K/110K Results

`eval_missing.py` — Fill Single Gap

`eval_morph_large.py` — Morphological Metrics for 80K/110K

`bootstrap_test_set.py` — Bootstrap Confidence Intervals

`eval_all_externals.py` — Evaluate 9 External Tokenizers

`compare_with_external.py` — External Comparison (Earlier Version)

`eval_and_compare.py` — Combined Evaluation + Comparison

`eval_codeswitch_and_new_baselines.py` — Code-Switching Evaluation

`eval_doda_independent.py` — DODa Independent Validation

`fix_tokenizer_decoders.py` — Decoder Bug Fixer

`gen_report.py` — Report Generator

`regen_figures.py` — Figure Regenerator

`verify_arithmetic.py` — Numeric Claims Verification