Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings
Abstract
SAGA framework uses multimodal large language models to provide attribute-aware supervision for vision encoders through Group Relative Policy Optimization, improving zero-shot image retrieval performance.
Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose SAGA, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder's embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.
Community
SAGA uses a frozen multimodal LLM as the reward model for training a retrieval vision encoder. Think RLVR, but aimed at the encoder's representation rather than LLM reasoning.
We show the MLLM an image pair, ask same class or different, and reward correct verdicts with GRPO. Advantages cancel on the attributes the two images share and concentrate on the ones that differ, so one binary reward becomes dense attribute-level gradients on the encoder, with no attribute labels.
The MLLM is dropped at inference, so zero deployment overhead. +3 to 6 R@1 over SOTA on CUB, Cars, Aircraft, iNat-Aves.
Feedback welcome!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation (2026)
- Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models (2026)
- SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models (2026)
- PERL: Parameter Efficient Reasoning in CLIP Latent Space (2026)
- Text-Guided Visual Representation Learning for Robust Multimodal E-Commerce Recommendation (2026)
- HyperVis: Continuous Latent Visual Relational Graphs on the Lorentz Hyperboloid for Compositional Reasoning (2026)
- TextTeacher: What Can Language Teach About Images? (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.15134 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
