AI & ML interests
LLMs
Recent Activity
KlusAI
Open research on privacy-preserving and efficient AI for European languages
π What We're About
KlusAI builds open AI research and turns it into production systems. Our work runs along two tracks: a privacy / de-identification research program measuring real re-identification risk in European text, and our long-running work on synthetic data, efficient on-device models, and multilingual NLP β with deep Romanian-language expertise throughout.
The research hub β leaderboard, papers, roadmap, and blog β lives at research.klusai.com.
Research Themes:
- π‘οΈ Privacy & De-identification β Re-identification risk as a first-class metric, not just detection accuracy
- 𧬠Synthetic Data Generation β Large-scale training data without exposing sensitive data
- β‘ Efficient AI Systems β Compact models that run on consumer hardware (Mac-first, M3 Ultra)
- π Multilingual NLP β With deep Romanian language expertise
π‘οΈ Privacy Research: EuroPriv-Bench
EuroPriv-Bench is the first unified pan-European de-identification benchmark with a neutral, open leaderboard. Its headline metric is re-identification risk β not detection-F1 β over a GDPR-aligned entity taxonomy spanning legal and clinical text.
Headline finding: high detection-F1 β re-identification protection. The best detector is not the best protector. An undetected Romanian CNP, for example, deterministically leaks date of birth, sex, and county β so type-accurate detection and actual privacy protection measure different things.
kp-deid-mdeberta-280m is the first KlusAI de-identification model on the public leaderboard and the best protector: a measured 0% CNP leak rate on the contamination-free Romanian real-skeleton track β a result that replicates zero-shot on the Polish PESEL track (now scored on the board: 0% PESEL leak despite the model never having seen Polish), with the re-identification metric extending to further decode-bearing IDs such as Italian codice fiscale.
Status: EuroPriv-Bench results are early (
devstatus) and contamination-controlled, with the Romanian and Polish real-skeleton tracks pending native-speaker validation. These are measured results, not yet peer-reviewed or citable.
Open datasets: ds-kp-general-ro-50k Β· ds-kp-general-en-50k Β· ds-kp-general-pl-50k (CC-BY)
π¬ Leaderboard Β· π Paper Β· πΊοΈ Roadmap
π¬ Flagship Project: TinyFabulist
TinyFabulist is our open research programme on large-scale synthetic narrative generation, showing that small, efficient models can produce high-quality training data at scale β and compact Romanian language models trained from it.
| Release | Description | Size |
|---|---|---|
| TF1 | Synthetic English fables | ~3M stories |
| TF2 | EnglishβRomanian literary translation framework + fine-tuned 12B model | 3M parallel pairs |
| TF3 | Compact Romanian LMs from scratch (tokenizers, pretraining, distillation) | down to 26M params |
Key principles:
- π Scale β 9M+ synthetic training examples generated
- π§ Efficiency β Compact models; on-device, Mac-first deployment
- π Openness β Datasets, generation pipelines, and methodology shared publicly
π TF1 (arXiv) Β· π TF2 (arXiv) Β· π» Code (GitHub)
π Featured Publication
Synthetic Data Generation Using Large Language Models
Advances in Text and Code β IEEE Access, 2025
Our comprehensive survey on generating training data with LLMs: reducing annotation costs, addressing data scarcity, and enabling fine-tuning without exposing sensitive data.
π Read on IEEE Xplore Β· π arXiv Preprint
π¦ What You'll Find Here
- Benchmarks β EuroPriv-Bench: pan-European de-identification with re-identification risk metrics
- Models β De-identification (
kp-deid-*), compact Romanian LMs (tf3-*), and Romanian diacritics restoration - Datasets β Privacy corpora (
ds-kp-general-*) and large-scale synthetic narrative corpora (ds-tf*)
π€ Work With Us
Beyond open research, we offer enterprise AI services:
| Service | Description |
|---|---|
| AI Strategy | Define your AI roadmap and implementation plan |
| Custom Development | Bespoke AI solutions tailored to your domain |
| Model Training | Fine-tuning and deploying models for your use case |
| MLOps & Infrastructure | Scalable pipelines and production deployment |
Need custom synthetic data, de-identification, or domain-specific models? We partner with organizations on applied research challenges.
π« Get in Touch
| Purpose | Contact |
|---|---|
| Research collaboration | research@klusai.com |
| Enterprise services | services@klusai.com |
| General inquiries | hello@klusai.com |
Technical questions? Open an issue on the relevant dataset or model repository.
Applied Research Β· AI Services Β· Ventures
klusai.com Β· research.klusai.com Β· GitHub Β· X