Papers
arxiv:2606.06302

Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving

Published on Jun 15
ยท Submitted by
Minsoo Kim
on Jun 16
Authors:

Abstract

Multi-turn large language model serving faces memory constraints due to growing key-value cache, but a structured approach to non-uniform compression enables significant throughput improvements through static budget allocation and optimized memory management.

Multi-turn LLM serving accumulates dialogue history whose Key-Value (KV) cache grows with every turn and every user, quickly exceeding the model weights themselves and making memory -- not compute -- the binding constraint on throughput. Non-uniform KV compression, which allocates heterogeneous budgets across attention heads, preserves accuracy far better than uniform schemes, yet remains impractical: modern serving stacks assume identical KV lengths across heads, so heterogeneity traps freed memory as page fragmentation, spends up to 25% of prefill time reclaiming scattered pages, and skews GPU workloads that inflate decode latency by up to 1.7times or burn 15--20% of each decode step on re-planning. We observe that this heterogeneity need not be discovered at runtime: head-wise retention follows a two-level structural regularity -- an input-invariant head ranking with narrowly bounded per-head ratios -- that can be calibrated offline from as few as 50 samples. Building on this insight, we present Tangram, a serving framework that statically resolves what prior systems handle dynamically: Budget Reservation fixes each head's post-compression footprint at scheduling time, eliminating page reclamation; Ragged Paging clusters similar-budget heads into independent page tables, turning fragmentation into reclaimable memory; and Ahead-of-Time Load Balancing precomputes balanced GPU partitions with zero runtime planning. Implemented on vLLM, Tangram serves as a drop-in substrate for existing non-uniform compression methods, matching their accuracy while improving end-to-end throughput by up to 2.6times over the full-KV baseline. Our implementation is publicly available at https://github.com/aiha-lab/TANGRAM.

Community

Paper author Paper submitter

Modern serving stacks assume every head holds an identical KV length, so non-uniform compression has stayed a paper-only idea. Non-uniform KV cache compression preserves accuracy far better than uniform schemes in multi-turn scenario โ€” it gives the heads that actually carry long-range information the budget they need.

  • โœจ Tangram makes it practical for the first time โ€” non-uniform KV cache compression running inside a real serving system (and uniform schemes work just as well)
  • ๐Ÿ”ง Built on vLLM as a drop-in substrate, Tangram supports a wide range of existing KV cache compression algorithms โ€” non-uniform and uniform alike.
  • ๐Ÿ“Š And we don't stop at accuracy: we validate real, measured end-to-end throughput gains โ€” up to 2.6ร— over the full-KV baseline.

Neat paper. Dealing with memory bottlenecks for multi-turn serving is always a headache, and the idea of fixing head-wise retention offline to dodge the fragmentation mess seems like a smart way to make non-uniform compression actually viable in production.

Since you are statically resolving these budgets ahead of time, how robust is the performance if a specific prompt deviates significantly from the 50 calibration samples you used to set the ratios?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/6b552aaf-cc72-4776-a008-68e5696c8500

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.06302
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.06302 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.06302 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.06302 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.