Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents? Paper • 2607.01211 • Published 3 days ago • 6
Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs Paper • 2606.32032 • Published 4 days ago • 21
Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity Paper • 2607.00248 • Published 4 days ago • 23
SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions Paper • 2606.30573 • Published 5 days ago • 5
Dockerless: Environment-Free Program Verifier for Coding Agents Paper • 2606.28436 • Published 8 days ago • 101
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models Paper • 2604.10866 • Published Apr 13 • 69
microsoft/GELab-Zero-4B-preview-Sico-Evolution Image-Text-to-Text • 4B • Updated 3 days ago • 221 • 33
FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization Paper • 2603.19835 • Published Mar 20 • 353
OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources Paper • 2605.29250 • Published May 28 • 79
R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth? Paper • 2510.08189 • Published Oct 9, 2025 • 29
LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models Paper • 2510.15227 • Published Oct 17, 2025 • 4
VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions Paper • 2605.27141 • Published May 26 • 20