VEXMLM: Vocabulary-Extended XLM-R for Ge'ez-Script African Languages

Paper Model License Languages

Expanding the Lexicon of Ge'ez Based African Languages: A Comparative Study of Amharic and Tigrinya
Hailay Kidu Teklehaymanot, Wolfgang Nejdl
Third Workshop on Language Models for Underserved Communities (LM4UC@IJCAI 2026)
OpenReview: Ii7NEmkvhB


Overview

Multilingual pre-trained language models (PLMs) face persistent challenges with low-resource languages that use non-Latin scripts, due to high out-of-vocabulary (OOV) rates and subword fragmentation stemming from Latin-centric tokenization.

VEXMLM is a vocabulary-extended variant of XLM-R specifically optimized for the Ge'ez-script languages Amharic and Tigrinya, and extended via task-specific fine-tuning to 19 low-resource African languages.

Key Contributions

  • Principled vocabulary expansion: A language-specific SentencePiece tokenizer trained on curated monolingual corpora augments XLM-R's vocabulary with 30,000 Ge'ez-derived subword tokens, initialized via mean initialization over the source embedding space.
  • Two-stage training strategy: Continued masked language modeling (MLM) pretraining followed by supervised fine-tuning on QA, NER, and sentiment analysis β€” enabling cross-lingual transfer to 17 additional low-resource languages.
  • Comprehensive evaluation: Intrinsic metrics (parity, fertility, compression, OOV rate) and extrinsic benchmarks (NER, SA, QA) across 19 African languages.

Results

Question Answering (QA)

Model Exact Match F1
XLM-R 0.66 0.78
Glot500 0.74 0.78
VEXMLM 0.87 0.90

Sentiment Analysis (SA)

Model Accuracy
XLM-R 0.77
Glot500 0.46
VEXMLM 0.80

NER β€” OOV Word Accuracy (11 African Languages)

Model OOV Word Accuracy
XLM-R 81.4%
VEXMLM 94.3%

Supported Languages

VEXMLM covers 19 low-resource African languages, with core optimization for:

Language Script ISO Code
Tigrinya Ge'ez tir
Amharic Ge'ez amh

Extended cross-lingual transfer to 17 additional African languages via fine-tuning.


Model Architecture

XLM-R (base)
    └── Vocabulary Expansion (+30,000 Ge'ez subword tokens)
            └── Mean embedding initialization
                    └── Stage 1: Continued MLM Pretraining
                            └── Stage 2: Task-Specific Fine-Tuning
                                    β”œβ”€β”€ Question Answering (QA)
                                    β”œβ”€β”€ Named Entity Recognition (NER)
                                    └── Sentiment Analysis (SA)
  • Base model: FacebookAI/xlm-roberta-base
  • Tokenizer: SentencePiece, trained on curated Amharic + Tigrinya monolingual corpora
  • Vocabulary extension: 30,000 new Ge'ez-derived subword tokens
  • Embedding initialization: Mean initialization over XLM-R source embedding space
  • Training framework: Hugging Face Transformers, PyTorch

Installation

pip install transformers torch sentencepiece

Usage

Load the Model

from transformers import AutoTokenizer, AutoModel

model_name = "Hailay/EXLMR"// We Change the name here but no difference in details 
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

Question Answering (Tigrinya)

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

model_name = "Hailay/EXLMR"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

question = "αˆ˜αŠ• αŠ₯α‹© ፈጣαˆͺ?"          # "Who is the creator?"
context  = "αŠ£αˆαˆ‹αŠ½ ፈጣαˆͺ αŠ©αˆ‰ α‹“αˆˆαˆ αŠ₯ዩፒ"   # "God is the creator of the whole world."

inputs = tokenizer(question, context, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

start = torch.argmax(outputs.start_logits)
end   = torch.argmax(outputs.end_logits) + 1
answer = tokenizer.decode(inputs["input_ids"][0][start:end])
print(f"Answer: {answer}")

Named Entity Recognition (Amharic)

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "Hailay/VEXMLM"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "αŠ α‰Ά አα‰₯α‹­ αŠ αˆ…αˆ˜α‹΅ α‰ αŠ α‹²αˆ΅ αŠ α‰ α‰£ α‹­αŠ–αˆ«αˆ‰α’"   # "Mr. Abiy Ahmed lives in Addis Ababa."
print(ner(text))

Sentiment Analysis

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

model_name = "Hailay/VEXMLM"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
text = "αŠ₯α‹š αˆ˜αŒ½αˆ“α α‰₯αŒ£α‹•αˆš αŒ½α‰‘α‰• αŠ₯ዩፒ"   # "This book is very good."
print(classifier(text))

Repository Structure

VEXMLM/
β”œβ”€β”€ tokenizer/
β”‚   β”œβ”€β”€ train_sentencepiece.py       # SentencePiece tokenizer training
β”‚   └── geez_vocab_extension.py      # Vocabulary expansion & embedding init
β”œβ”€β”€ pretraining/
β”‚   └── run_mlm.py                   # Continued MLM pretraining script
β”œβ”€β”€ finetuning/
β”‚   β”œβ”€β”€ run_qa.py                    # QA fine-tuning
β”‚   β”œβ”€β”€ run_ner.py                   # NER fine-tuning
β”‚   └── run_sa.py                    # Sentiment analysis fine-tuning
β”œβ”€β”€ evaluation/
β”‚   β”œβ”€β”€ intrinsic_eval.py            # Parity, fertility, compression, OOV metrics
β”‚   └── extrinsic_eval.py            # NER, SA, QA benchmark evaluation
β”œβ”€β”€ data/
β”‚   └── README.md                    # Dataset sources and preprocessing notes
β”œβ”€β”€ configs/
β”‚   └── training_config.yaml         # Hyperparameters and training settings
β”œβ”€β”€ requirements.txt
└── README.md

Training Details

Parameter Value
Base model xlm-roberta-base
Vocabulary extension +30,000 Ge'ez subword tokens
Embedding init Mean initialization
Stage 1 Continued MLM pretraining
Stage 2 Task-specific fine-tuning
Tasks QA, NER, Sentiment Analysis
Languages (core) Amharic, Tigrinya
Languages (total) 19 African languages
Framework Hugging Face Transformers

Citation

If you use VEXMLM in your research, please cite:

@inproceedings{teklehaymanot2026vexmlm,
  title     = {Expanding the Lexicon of {G}e'ez Based {A}frican Languages:
               A Comparative Study of {A}mharic and {T}igrinya},
  author    = {Teklehaymanot, Hailay Kidu and Nejdl, Wolfgang},
  booktitle = {Proceedings of the Third Workshop on Language Models
               for Underserved Communities (LM4UC@IJCAI 2026)},
  year      = {2026},
  url       = {https://openreview.net/forum?id=Ii7NEmkvhB}
}

Related Work & Acknowledgements

  • Base model: XLM-R (Conneau et al., 2020)
  • Tokenization baseline: Glot500 (ImaniGooghari et al., 2023)
  • This work was carried out at the L3S Research Center, Leibniz UniversitΓ€t Hannover, and the University of Zurich.

License

This project is licensed under the Apache 2.0 License.

Downloads last month
11
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Hailay/EXLMR

Finetuned
(957)
this model