LEDGER Collection A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction • 3 items • Updated 6 days ago • 6
BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation Paper • 2604.09497 • Published Apr 10 • 29
When Does Reasoning Matter? A Controlled Study of Reasoning's Contribution to Model Performance Paper • 2509.22193 • Published Sep 26, 2025 • 38