AI & ML interests

LLMs

Recent Activity

mihainadasΒ  updated a dataset 2 days ago
klusai/ds-kp-general-nl-50k
mihainadasΒ  published a dataset 2 days ago
klusai/ds-kp-general-nl-50k
mihainadasΒ  updated a dataset 2 days ago
klusai/ds-kp-general-es-50k
View all activity

Organization Card

KlusAI
Open research on privacy-preserving and efficient AI for European languages

Company site Research hub GitHub X


πŸ” What We're About

KlusAI builds open AI research and turns it into production systems. Our work runs along two tracks: a privacy / de-identification research program measuring real re-identification risk in European text, and our long-running work on synthetic data, efficient on-device models, and multilingual NLP β€” with deep Romanian-language expertise throughout.

The research hub β€” leaderboard, papers, roadmap, and blog β€” lives at research.klusai.com.

Research Themes:

  • πŸ›‘οΈ Privacy & De-identification β€” Re-identification risk as a first-class metric, not just detection accuracy
  • 🧬 Synthetic Data Generation β€” Large-scale training data without exposing sensitive data
  • ⚑ Efficient AI Systems β€” Compact models that run on consumer hardware (Mac-first, M3 Ultra)
  • 🌍 Multilingual NLP β€” With deep Romanian language expertise

πŸ›‘οΈ Privacy Research: EuroPriv-Bench

EuroPriv-Bench is the first unified pan-European de-identification benchmark with a neutral, open leaderboard. Its headline metric is re-identification risk β€” not detection-F1 β€” over a GDPR-aligned entity taxonomy spanning legal and clinical text.

Headline finding: high detection-F1 β‰  re-identification protection. The best detector is not the best protector. An undetected Romanian CNP, for example, deterministically leaks date of birth, sex, and county β€” so type-accurate detection and actual privacy protection measure different things.

kp-deid-mdeberta-280m is the first KlusAI de-identification model on the public leaderboard and the best protector: a measured 0% CNP leak rate on the contamination-free Romanian real-skeleton track β€” a result that replicates zero-shot on the Polish PESEL track (now scored on the board: 0% PESEL leak despite the model never having seen Polish), with the re-identification metric extending to further decode-bearing IDs such as Italian codice fiscale.

Status: EuroPriv-Bench results are early (dev status) and contamination-controlled, with the Romanian and Polish real-skeleton tracks pending native-speaker validation. These are measured results, not yet peer-reviewed or citable.

Open datasets: ds-kp-general-ro-50k Β· ds-kp-general-en-50k Β· ds-kp-general-pl-50k (CC-BY)

πŸ”¬ Leaderboard Β· πŸ“„ Paper Β· πŸ—ΊοΈ Roadmap


πŸ”¬ Flagship Project: TinyFabulist

TinyFabulist is our open research programme on large-scale synthetic narrative generation, showing that small, efficient models can produce high-quality training data at scale β€” and compact Romanian language models trained from it.

Release Description Size
TF1 Synthetic English fables ~3M stories
TF2 English–Romanian literary translation framework + fine-tuned 12B model 3M parallel pairs
TF3 Compact Romanian LMs from scratch (tokenizers, pretraining, distillation) down to 26M params

Key principles:

  • πŸ“Š Scale β€” 9M+ synthetic training examples generated
  • πŸ”§ Efficiency β€” Compact models; on-device, Mac-first deployment
  • πŸ”“ Openness β€” Datasets, generation pipelines, and methodology shared publicly

πŸ“„ TF1 (arXiv) Β· πŸ“„ TF2 (arXiv) Β· πŸ’» Code (GitHub)


πŸ“„ Featured Publication

Synthetic Data Generation Using Large Language Models

Advances in Text and Code β€” IEEE Access, 2025

Our comprehensive survey on generating training data with LLMs: reducing annotation costs, addressing data scarcity, and enabling fine-tuning without exposing sensitive data.

πŸ“– Read on IEEE Xplore Β· πŸ“ arXiv Preprint


πŸ“¦ What You'll Find Here

  • Benchmarks β€” EuroPriv-Bench: pan-European de-identification with re-identification risk metrics
  • Models β€” De-identification (kp-deid-*), compact Romanian LMs (tf3-*), and Romanian diacritics restoration
  • Datasets β€” Privacy corpora (ds-kp-general-*) and large-scale synthetic narrative corpora (ds-tf*)

🀝 Work With Us

Beyond open research, we offer enterprise AI services:

Service Description
AI Strategy Define your AI roadmap and implementation plan
Custom Development Bespoke AI solutions tailored to your domain
Model Training Fine-tuning and deploying models for your use case
MLOps & Infrastructure Scalable pipelines and production deployment

Need custom synthetic data, de-identification, or domain-specific models? We partner with organizations on applied research challenges.


πŸ“« Get in Touch

Purpose Contact
Research collaboration research@klusai.com
Enterprise services services@klusai.com
General inquiries hello@klusai.com

Technical questions? Open an issue on the relevant dataset or model repository.


Applied Research Β· AI Services Β· Ventures
klusai.com Β· research.klusai.com Β· GitHub Β· X