Abstract
VideoRAG systems are extended to handle long egocentric videos with multi-modal retrieval across temporal granularities, addressing limitations in existing benchmarks and methods through a new benchmark and chunk-adaptive reranking approach.
Retrieval-augmented generation is moving beyond text into long, egocentric video, where systems must select query-relevant chunks across multiple modalities and temporal granularities. Yet progress in VideoRAG is limited by two gaps: existing benchmarks allow queries to be answered without the video, obscuring retrieval errors, and prior methods apply a single modality-granularity configuration per query, ignoring chunk-level variability. We address both by introducing V-RAGBench, a benchmark of langlequery, evidence chunk, answerrangle triplets that enables faithful, decoupled evaluation of retrieval and generation, and CARVE, a simple method that runs parallel retrievers across configurations and employs chunk-adaptive reranking to identify the winning configuration for each chunk. Each chunk then enters the generator under its winning configuration selected during retrieval, yielding an interleaved evidence form where the chunk-level decision propagates across both stages. CARVE outperforms eight recent VideoRAG baselines, with the chunks supplied to the generator interleaving multiple configurations rather than sharing a single one, a behavior unattainable by query-level methods.
Community
This is an interesting take on VideoRAG. I really like the point about how existing benchmarks often let models answer queries without even needing the video footage, which definitely makes it hard to tell if the retrieval is actually working.
I'm curious how CARVE handles the computational overhead of running those parallel retrievers for every chunk. Does this approach significantly slow down inference compared to the baselines?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/bf84ea63-24c4-40f0-a075-43f25157dfdc
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering (2026)
- Bridging Modalities, Spanning Time: Structured Memory for Ultra-Long Agentic Video Reasoning (2026)
- TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation (2026)
- Retrieval from Within: An Intrinsic Capability of Attention-Based Models (2026)
- MemoryCard: Topic-Aware Multi-Modal Clue Compression for Long-Video Question Answering (2026)
- DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding (2026)
- MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.13141 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper