LAPVQA — Differential VQA (Frozen Off-the-shelf Encoders)

Description

Task heads for Differential VQA: given a prior and a current chest X-ray, answer questions about radiological changes. Trained on MIMIC-Diff-VQA with five frozen encoders. Each .pt file is a plain state dict of DiffVQAHead.

Architecture — `DiffVQAHead`

vis_proj   : Linear(vis_dim → 512)   # shared for both images
frame_emb  : Embedding(2, 512)       # 0=reference, 1=current
memory     : [ref_proj + frame_emb(0) ; curr_proj + frame_emb(1)]  → [B, 2N, 512]
tok_emb    : Embedding(50257, 512)
pos_emb    : Embedding(200, 512)
decoder    : 6 × TransformerDecoderLayer (pre-norm)
lm_head    : Linear(512 → 50257, bias=False)

File	Encoder	vis_dim
`clip-vit-l14_best.pt`	CLIP ViT-L/14	1024
`coca_best.pt`	CoCa	768
`florence2_best.pt`	Florence-2	1024
`siglip_best.pt`	SigLIP	1152
`owlv2_best.pt`	OWLv2	1024

Results (test set)

Encoder	BLEU-1	BLEU-4	ROUGE-1	RadGraph-s
CLIP ViT-L/14	0.184	0.128	0.336	0.322
CoCa	0.196	0.138	0.320	0.317
Florence-2	0.191	0.138	0.319	0.318
SigLIP	0.186	0.131	0.322	0.313

Loading

import torch
import tiktoken
from lapvqa.diffvqa.model import DiffVQAHead

ckpt = torch.load("coca_best.pt", map_location="cpu")
head = DiffVQAHead(vis_dim=768)   # adjust vis_dim per encoder
head.load_state_dict(ckpt)
head.eval()

enc = tiktoken.get_encoding("gpt2")
bos_id = eos_id = enc.eot_token

# curr_vis, ref_vis: [B, N, vis_dim] — patch tokens from the frozen encoder
answers = head.generate(
    curr_vis    = curr_vis,
    ref_vis     = ref_vis,
    prompt_ids  = question_ids,   # [B, Q]
    bos_id      = bos_id,
    eos_id      = eos_id,
    max_new_tokens = 128,
)
decoded = [enc.decode(ids) for ids in answers]

Downloads last month: -; Downloads are not tracked for this model. How to track

Collection including dmusingu/lapvqa-diffvqa

LAPVQA

Collection

Chest X-ray models: pre-trained encoders and task heads for VQA, DiffVQA, RRG, detection, and grounding on MIMIC-CXR. • 14 items • Updated Jun 5