KlusAI Labs

company

https://www.klusai.com

klusai

Activity Feed

AI & ML interests

LLMs

Recent Activity

mihainadas updated a dataset 2 days ago

klusai/ds-kp-general-nl-50k

mihainadas published a dataset 2 days ago

klusai/ds-kp-general-nl-50k

mihainadas updated a dataset 2 days ago

klusai/ds-kp-general-es-50k

View all activity

Organization Card

Community About org cards

KlusAI
Open research on privacy-preserving and efficient AI for European languages

🔍 What We're About

KlusAI builds open AI research and turns it into production systems. Our work runs along two tracks: a privacy / de-identification research program measuring real re-identification risk in European text, and our long-running work on synthetic data, efficient on-device models, and multilingual NLP — with deep Romanian-language expertise throughout.

The research hub — leaderboard, papers, roadmap, and blog — lives at research.klusai.com.

Research Themes:

🛡️ Privacy & De-identification — Re-identification risk as a first-class metric, not just detection accuracy
🧬 Synthetic Data Generation — Large-scale training data without exposing sensitive data
⚡ Efficient AI Systems — Compact models that run on consumer hardware (Mac-first, M3 Ultra)
🌍 Multilingual NLP — With deep Romanian language expertise

🛡️ Privacy Research: EuroPriv-Bench

EuroPriv-Bench is the first unified pan-European de-identification benchmark with a neutral, open leaderboard. Its headline metric is re-identification risk — not detection-F1 — over a GDPR-aligned entity taxonomy spanning legal and clinical text.

Headline finding: high detection-F1 ≠ re-identification protection. The best detector is not the best protector. An undetected Romanian CNP, for example, deterministically leaks date of birth, sex, and county — so type-accurate detection and actual privacy protection measure different things.

kp-deid-mdeberta-280m is the first KlusAI de-identification model on the public leaderboard and the best protector: a measured 0% CNP leak rate on the contamination-free Romanian real-skeleton track — a result that replicates zero-shot on the Polish PESEL track (now scored on the board: 0% PESEL leak despite the model never having seen Polish), with the re-identification metric extending to further decode-bearing IDs such as Italian codice fiscale.

Status: EuroPriv-Bench results are early (dev status) and contamination-controlled, with the Romanian and Polish real-skeleton tracks pending native-speaker validation. These are measured results, not yet peer-reviewed or citable.

Open datasets: ds-kp-general-ro-50k · ds-kp-general-en-50k · ds-kp-general-pl-50k (CC-BY)

🔬 Leaderboard · 📄 Paper · 🗺️ Roadmap

🔬 Flagship Project: TinyFabulist

TinyFabulist is our open research programme on large-scale synthetic narrative generation, showing that small, efficient models can produce high-quality training data at scale — and compact Romanian language models trained from it.

Release	Description	Size
TF1	Synthetic English fables	~3M stories
TF2	English–Romanian literary translation framework + fine-tuned 12B model	3M parallel pairs
TF3	Compact Romanian LMs from scratch (tokenizers, pretraining, distillation)	down to 26M params

Key principles:

📊 Scale — 9M+ synthetic training examples generated
🔧 Efficiency — Compact models; on-device, Mac-first deployment
🔓 Openness — Datasets, generation pipelines, and methodology shared publicly

📄 TF1 (arXiv) · 📄 TF2 (arXiv) · 💻 Code (GitHub)

📄 Featured Publication

Synthetic Data Generation Using Large Language Models

Advances in Text and Code — IEEE Access, 2025

Our comprehensive survey on generating training data with LLMs: reducing annotation costs, addressing data scarcity, and enabling fine-tuning without exposing sensitive data.

📖 Read on IEEE Xplore · 📝 arXiv Preprint

📦 What You'll Find Here

Benchmarks — EuroPriv-Bench: pan-European de-identification with re-identification risk metrics
Models — De-identification (kp-deid-*), compact Romanian LMs (tf3-*), and Romanian diacritics restoration
Datasets — Privacy corpora (ds-kp-general-*) and large-scale synthetic narrative corpora (ds-tf*)

🤝 Work With Us

Beyond open research, we offer enterprise AI services:

Service	Description
AI Strategy	Define your AI roadmap and implementation plan
Custom Development	Bespoke AI solutions tailored to your domain
Model Training	Fine-tuning and deploying models for your use case
MLOps & Infrastructure	Scalable pipelines and production deployment