SparseVLM

Training-free visual token pruning for Qwen2.5-VL. Scores visual tokens by how much text attends to them, prunes the unimportant ones from the KV cache, and decodes with the smaller cache.

Based on SparseVLM: Visual Token Sparsification for Efficient VLM Inference (ICML 2025).

Install

pip install sparsevlm

Requirements: Python 3.10+, PyTorch 2.1+, transformers 4.49+

Quick start

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from sparsevlm import sparsevlm_generate
from PIL import Image

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

image = Image.open("your_image.jpg")
messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text",  "text": "Describe this image in detail."}
]}]
text   = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")

# count visual tokens
n_vis = int((inputs["image_grid_thw"][0].prod() / 4).item())

output = sparsevlm_generate(
    model, processor, inputs,
    n_vis=n_vis,
    keep_n_vis=n_vis // 4,   # keep 25% of visual tokens
    max_new_tokens=256,
)
print(processor.decode(output[0][1:], skip_special_tokens=True))

Benchmark results

Measured on NVIDIA A100-SXM4-40GB, Qwen2.5-VL-7B-Instruct, bfloat16, SDPA attention.

Real photo — Fuji mountain + Milky Way (4928×2773px, 16320 visual tokens)

Config	Tokens kept	Time	Speedup	Output quality
Baseline	16320 (100%)	9738ms	1.00×	Identifies Fuji, Milky Way, snow cap, star colors
SparseVLM 50%	8192	9441ms	1.03×	Same quality
SparseVLM 25%	4080	9297ms	1.05×	All key details preserved
SparseVLM 10%	1632	9425ms	1.03×	Still correctly describes scene

Key result: Full 4K image (16K tokens) runs without OOM. Without SparseVLM's hook-based scoring, the 16K-token image requires materialising a 15GB attention matrix and crashes. The scorer computes only the text→visual submatrix (35 × 16320 = 32MB instead of 15GB).

Resized photo (896×504px, 576 visual tokens), batch=1

Tokens kept	Time	Speedup
576 (100%)	2167ms	1.00×
288 (50%)	1685ms	1.29×
144 (25%)	1565ms	1.39×
72 (12%)	1620ms	1.34×

When to expect larger speedup

Speedup grows when the KV cache is large relative to model weights:

Scenario	Expected speedup
Single image, short generation	~1.1–1.4×
Single image, 256+ output tokens	~1.5–2.5×
Batch=32, high-res images	~2–4×
Very long visual context (10K+ tokens)	~2–4×

How it works

Token scoring (no extra parameters)

At decoder layer 2, a lightweight hook intercepts the attention projection and computes:

A_tv = Q_text @ K_visual^T   # only the text→visual submatrix
                              # 35 × 16320 instead of 16320 × 16320
score_i = sum over text tokens of attention to visual token i

Visual tokens with high scores are important to the text query. Low-score tokens are pruned from the KV cache before decoding starts.

KV cache pruning

After scoring, the KV cache is sliced to keep only the top-K visual entries plus all text entries. The model then decodes with a smaller cache — fewer keys to attend over per decode step.

Prefill:  build KV cache for all 16320 visual tokens
Score:    rank each visual token by text attention (32MB op)
Prune:    keep top-K, drop the rest
Decode:   attend over K + N_text keys instead of 16320 + N_text

Position fix (`rope_deltas`)

After pruning, Qwen2.5-VL's internal position counter (rope_deltas) is adjusted so decode tokens get correct positional embeddings despite the shorter cache.

API

`sparsevlm_generate`

from sparsevlm import sparsevlm_generate

output = sparsevlm_generate(
    model,                  # Qwen2_5_VLForConditionalGeneration
    processor,              # AutoProcessor
    inputs,                 # dict from processor(...)
    n_vis,                  # total visual tokens in the sequence
    keep_n_vis,             # how many to keep (e.g. n_vis // 4 for 25%)
    max_new_tokens=256,     # generation length
    target_layer=2,         # which layer to score from (default 2)
    device="cuda",          # primary device
)
# returns: token ids [B, max_new_tokens]

`apply_sparsevlm` / `remove_hooks` (hook-based API)

from sparsevlm import apply_sparsevlm, reset_n_vis, remove_hooks

state = apply_sparsevlm(model, n_vis=256)
reset_n_vis(state, n_vis=256)   # call before each generate
output = model.generate(...)
remove_hooks(state)

Model support

Model	Status
Qwen/Qwen2.5-VL-7B-Instruct	Tested
Qwen/Qwen2.5-VL-3B-Instruct	Should work
Qwen/Qwen2.5-VL-72B-Instruct	Should work
Qwen/Qwen2-VL-*	Legacy support

Limitations

Requires attn_implementation="eager" or "sdpa". Flash Attention 2 (separate package) is not required.
Speedup is modest (~1.1–1.4×) for single-image, short-generation use cases. The gain comes from long generations, high-resolution images, or batched serving.
Currently tested with Qwen2.5-VL. Other VLM families would need architecture-specific adaptation.

Citation

@inproceedings{zhang2024sparsevlm,
  title={SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference},
  author={Zhang, Yuan and Fan, Chun-Kai and Ma, Junpeng and Zheng, Wenzhao and
          Huang, Tao and Cheng, Kuan and Gudovskiy, Denis and Okuno, Tomoyuki and
          Nakata, Yohei and Keutzer, Kurt and Zhang, Shanghang},
  booktitle={ICML},
  year={2025}
}

Apache 2.0 license.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Aryan3108/SparseVLM

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Paper • 2410.04417 • Published Oct 6, 2024 • 1