EvoQuality

1. Model Overview

  • Model Name: EvoQuality (Self-Evolving VLM for Image Quality Assessment)
  • Task: No-Reference Image Quality Assessment (NR-IQA), supporting both single-image quality scoring and pairwise quality comparison (ranking)
  • Core Idea: Without relying on any human-annotated quality scores or distortion-type labels, EvoQuality generates pseudo-ranking labels via pairwise majority voting, and converts them into an optimizable reward signal through GRPO to iteratively self-evolve its quality perception capability
  • Paper: Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking (ICLR 2026, arXiv:2509.25787)

2. Model and Framework Details

  • Backbone Model (paper setting): Qwen2.5-VL-7B (used as the baseline policy)
  • Training Paradigm: Two-stage cycle, supports multi-round iteration (T=2 in the paper)
    • Offline Stage (Pseudo-label): Perform K comparisons on randomly sampled image pairs, then derive pseudo-preferences p*(xi, xj) via majority voting
    • Online Stage (RL): Convert pseudo-preferences into a fidelity reward and update the policy via Group Relative Policy Optimization (GRPO) (full fine-tuning of the VLM)

3. Prompts

  • Offline Comparison c_compare:
    • <image><image> You are performing an image quality assessment task. Compare the two images and decide which one has better perceptual quality. Answer strictly with the index of the better image: 0 if the first image is better, or 1 if the second image is better.
  • Online Scoring c_score:
    • <image> You are doing the image quality assessment task. Here is the question: What is your overall rating on the quality of this picture? The rating should be a float between 1 and 5, rounded to two decimal places, with 1 representing very poor quality and 5 representing excellent quality.
  • Reasoning Suffix (for self-consistency sampling):
    • You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in boxed{}.

4. Training

  • Number of Iterations: T = 1 (the open-sourced model weights are the result of the first round of self-evolution)
  • Training Data: No additional synthetic distortion data and no extra annotated labels were added when producing the released weights
  • Offline Stage:
    • Sample K=32 responses per pair, then derive pseudo-labels via majority voting
    • Randomly swap image order to mitigate positional bias
  • Online Stage (GRPO):
    • Sample K=32 responses per sample (c_score)
    • Optimizer: AdamW, initial learning rate 3e-7, with linear decay
    • KL coefficient: beta = 0.05
    • Resources (as reported in the paper): 8x NVIDIA A100, per-GPU batch size = 4, ~12 hours/epoch

5. Evaluation Metrics

  • Evaluation Setting: zero-shot (no training on the target test sets)
  • Metrics: PLCC, SRCC (consistency with human subjective quality)

6. Main Results

  • Improvement over the Backbone Model (Qwen2.5-VL-7B): weighted average (WA VG.) over multiple benchmarks
    • PLCC: 0.615 -> 0.770 (+31.8%)
    • SRCC: 0.570 -> 0.726 (+33.7%)
  • Generalization: Achieves significant improvements across diverse distortion types and AI-generated content, matching or surpassing several supervised VLM-IQA approaches on multiple benchmarks (see the paper for detailed tables)

7. Intended Use and Usage Guidelines

  • Recommended Use
    • Research and evaluation: NR-IQA, cross-dataset generalization comparison, quality ranking/filtering, auxiliary signals for data cleaning
    • Pre-production assessment: as a perceptual quality proxy, but should be combined with business data and manual spot-check validation
  • Not Recommended Use
    • As the sole quality criterion for high-stakes decisions (content moderation, medical imaging diagnostic conclusions, legal evidence adjudication, etc.)
    • Treating model outputs as "absolute objective ground truth" (IQA is inherently subjective and correlated with population preferences)
  • Output Notes
    • The paper's prompts require outputs in the form of <think>...</think> with boxed{score}; for actual integration, it is recommended to parse only the value inside boxed{} and consider how temperature/sampling strategies affect consistency

8. Limitations and Known Risks

  • Self-supervised Pseudo-label Bias: Pseudo-rankings are derived from the model's own votes, which may amplify the systematic preferences or blind spots of the backbone model
  • Domain Shift: May fail on images from specific domains (medical, remote sensing, industrial inspection)
  • Subjectivity and Population Differences: Different cultural/aesthetic preferences and task objectives (aesthetics vs. clarity) can change the definition of "quality"
  • Prompt Sensitivity: Variations in prompts, sampling count K, and decoding strategies can affect self-consistency voting and final performance


9. Citation

@article{wen2025selfevolving,
  title={Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking},
  author={Wen, Wen and Zhi, Tianwu and Fan, Kanglong and Li, Yang and Peng, Xinge and Zhang, Yabin and Liao, Yiting and Li, Junlin and Zhang, Li},
  journal={arXiv preprint arXiv:2509.25787},
  year={2025}
}
Downloads last month
-
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using ByteDance/EvoQuality 1

Paper for ByteDance/EvoQuality