arxiv:2407.10995

LionGuard: Building a Contextualized Moderation Classifier to Tackle Localized Unsafe Content

Published on Jun 24, 2024

Authors:

Abstract

LionGuard, a Singapore-specific moderation classifier for large language models, outperforms general moderation APIs on Singlish data, emphasizing the importance of localization in language safety.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

As large language models (LLMs) become increasingly prevalent in a wide variety of applications, concerns about the safety of their outputs have become more significant. Most efforts at safety-tuning or moderation today take on a predominantly Western-centric view of safety, especially for toxic, hateful, or violent speech. In this paper, we describe LionGuard, a Singapore-contextualized moderation classifier that can serve as guardrails against unsafe LLM outputs. When assessed on Singlish data, LionGuard outperforms existing widely-used moderation APIs, which are not finetuned for the Singapore context, by 14% (binary) and up to 51% (multi-label). Our work highlights the benefits of localization for moderation classifiers and presents a practical and scalable approach for low-resource languages.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2407.10995

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.10995 in a dataset README.md to link it from this page.

Spaces citing this paper 9

Browse 9 spaces citing this paper