LAPVQA — VQA (Frozen Off-the-shelf Encoders)

Description

Lightweight task heads for Visual Question Answering on MIMIC-Diff-VQA, trained on top of five frozen off-the-shelf vision encoders. Each .pt file contains only the task head weights; load the encoder separately.

Architecture — `VQAHead`

vis_proj   : Linear(vis_dim → 512)
tok_emb    : Embedding(50257, 512)   # GPT-2 vocab, weight-tied with lm_head
pos_emb    : Embedding(150, 512)
decoder    : 6 × TransformerDecoderLayer (pre-norm, cross-attn to visual tokens)
lm_head    : Linear(512 → 50257, bias=False)

File	Encoder	vis_dim
`clip-vit-l14_best.pt`	CLIP ViT-L/14	1024
`siglip_best.pt`	SigLIP ViT-SO400M-14-384	1152
`florence2_best.pt`	Florence-2	1024
`coca_best.pt`	CoCa	768
`owlv2_best.pt`	OWLv2	1024

Results (test set, overall)

Encoder	BLEU-1	BLEU-4	ROUGE-L	RadGraph-s
CLIP ViT-L/14	0.602	0.243	0.725	0.222
SigLIP	0.586	0.253	0.717	0.214
Florence-2	0.575	0.207	0.700	0.217
CoCa	0.532	0.173	0.642	0.170

Loading

import torch
import tiktoken
from lapvqa.vqa.model import VQAHead

# checkpoint is a plain state dict
ckpt = torch.load("clip-vit-l14_best.pt", map_location="cpu")
head = VQAHead(vis_dim=1024)
head.load_state_dict(ckpt)
head.eval()

# vis_tokens: [B, N, vis_dim] — patch tokens from the frozen encoder
# prompt_ids: [B, Q]           — tokenised question (GPT-2 tokeniser)
enc = tiktoken.get_encoding("gpt2")
bos_id, eos_id = enc.eot_token, enc.eot_token

answers = head.generate(
    vis_tokens  = vis_tokens,
    prompt_ids  = prompt_ids,
    bos_id      = bos_id,
    eos_id      = eos_id,
    max_new_tokens = 64,
)
decoded = [enc.decode(ids) for ids in answers]

Downloads last month: -; Downloads are not tracked for this model. How to track

Collection including dmusingu/lapvqa-vqa

LAPVQA

Collection

Chest X-ray models: pre-trained encoders and task heads for VQA, DiffVQA, RRG, detection, and grounding on MIMIC-CXR. • 14 items • Updated Jun 5