arxiv:2606.15134

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

Published on Jun 13

· Submitted by

Shubhang Bhatnagar on Jun 17

University of Illinois at Urbana-Champaign

Upvote

Authors:

Abstract

SAGA framework uses multimodal large language models to provide attribute-aware supervision for vision encoders through Group Relative Policy Optimization, improving zero-shot image retrieval performance.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose SAGA, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder's embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.

View arXiv page View PDF Project page Add to collection

Community

shubhangb

Paper submitter 2 days ago

SAGA uses a frozen multimodal LLM as the reward model for training a retrieval vision encoder. Think RLVR, but aimed at the encoder's representation rather than LLM reasoning.

We show the MLLM an image pair, ask same class or different, and reward correct verdicts with GRPO. Advantages cancel on the attributes the two images share and concentrate on the ones that differ, so one binary reward becomes dense attribute-level gradients on the encoder, with no attribute labels.

The MLLM is dropped at inference, so zero deployment overhead. +3 to 6 R@1 over SOTA on CUB, Cars, Aircraft, iNat-Aves.
Feedback welcome!