Papers
arxiv:2606.21670

Improving Text-to-Music Generation with Human Preference Rewards

Published on Jun 19
· Submitted by
Yonghyun Kim
on Jun 23
Authors:
,
,
,

Abstract

A text-to-music generation system uses reward conditioning, expert iteration, and preference tuning to improve audio quality while maintaining efficiency within a 120M-parameter model framework.

We describe our entry to the efficiency track of the Academic Text-to-Music (ATTM) Grand Challenge at ICME 2026. Beyond the challenge protocol's FAD-CLAP and CLAP score, we add a learned human-preference reward from TuneJury, a twin pairwise ranker trained over open music-preference datasets. The reward serves both as a training-time conditioning signal and as a sample-selection criterion. The pipeline combines five engineering decisions on a 120M-parameter FluxAudio-S backbone, four at training time and one at inference: (i) training-time reward conditioning that doubles as an inference-time CFG axis, (ii) a sweep over five score-conditioning architectures, where training and inference use different variants, (iii) expert iteration on the top decile, (iv) a short preference-tuning pass (CRPO) for audio-text alignment, and (v) inference post-processing via joint CFG, source separation, and loudness normalization. Per-stage decomposition on 100 Song Describer prompts shows training-time reward conditioning as a functional conditioning axis, expert iteration as the dominant contributor, the preference-tuning pass adding only noise-level gain, and the inference-time score scalar already saturated by the end of the chain.

Community

Paper author Paper submitter
edited about 9 hours ago

Can a human preference reward improve a small text-to-music model, without new labels or scale up?
Our ICME 2026 ATTM Grand Challenge (Efficiency Track) entry puts an open human preference reward (TuneJury, trained on judgments from Music Arena, MusicPrefs, AIME, SongEval) at the center of a 120M FluxAudio-S pipeline: training-time conditioning, expert-iteration ranking, and a short preference-tuning pass. It trains in ~40 GPU hours on single RTX A5000 overall and generates 10s clips in under a second.

Does it work?
Yes! Over the same 120M baseline, all three evaluation metrics are improved:

  • TuneJury reward ↑ (Do people prefer it?): −0.39 → +0.53
  • FAD-CLAP ↓ (Does it sound like real music?): 0.60 → 0.42
  • CLAP score ↑ (Does it match the text prompt?): 0.23 → 0.29

🌐 Project page · 🎧 Listening samples · 📄 Paper · 💻 Code

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.21670
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.21670 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.21670 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.