CoVEBench: Can Video Editing Models Handle Complex Instructions? Paper • 2606.08415 • Published 5 days ago • 47
TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation Paper • 2606.02320 • Published 11 days ago • 14
Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories Paper • 2606.02060 • Published 11 days ago • 54
MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills? Paper • 2606.01993 • Published 10 days ago • 14
OProver: A Unified Framework for Agentic Formal Theorem Proving Paper • 2605.17283 • Published 26 days ago • 31
Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution Paper • 2605.15301 • Published 29 days ago • 22
DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios Paper • 2604.25914 • Published Apr 28 • 41
WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models Paper • 2604.18224 • Published Apr 20 • 22
DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation Paper • 2604.14683 • Published Apr 16 • 36
CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction Paper • 2603.00610 • Published Feb 28 • 35
AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations Paper • 2602.03828 • Published Feb 3 • 20