LAPVQA β€” Differential VQA (Frozen Off-the-shelf Encoders)

Part of the LAPVQA collection.

Description

Task heads for Differential VQA: given a prior and a current chest X-ray, answer questions about radiological changes. Trained on MIMIC-Diff-VQA with five frozen encoders. Each .pt file is a plain state dict of DiffVQAHead.

Architecture β€” DiffVQAHead

vis_proj   : Linear(vis_dim β†’ 512)   # shared for both images
frame_emb  : Embedding(2, 512)       # 0=reference, 1=current
memory     : [ref_proj + frame_emb(0) ; curr_proj + frame_emb(1)]  β†’ [B, 2N, 512]
tok_emb    : Embedding(50257, 512)
pos_emb    : Embedding(200, 512)
decoder    : 6 Γ— TransformerDecoderLayer (pre-norm)
lm_head    : Linear(512 β†’ 50257, bias=False)
File Encoder vis_dim
clip-vit-l14_best.pt CLIP ViT-L/14 1024
coca_best.pt CoCa 768
florence2_best.pt Florence-2 1024
siglip_best.pt SigLIP 1152
owlv2_best.pt OWLv2 1024

Results (test set)

Encoder BLEU-1 BLEU-4 ROUGE-1 RadGraph-s
CLIP ViT-L/14 0.184 0.128 0.336 0.322
CoCa 0.196 0.138 0.320 0.317
Florence-2 0.191 0.138 0.319 0.318
SigLIP 0.186 0.131 0.322 0.313

Loading

import torch
import tiktoken
from lapvqa.diffvqa.model import DiffVQAHead

ckpt = torch.load("coca_best.pt", map_location="cpu")
head = DiffVQAHead(vis_dim=768)   # adjust vis_dim per encoder
head.load_state_dict(ckpt)
head.eval()

enc = tiktoken.get_encoding("gpt2")
bos_id = eos_id = enc.eot_token

# curr_vis, ref_vis: [B, N, vis_dim] β€” patch tokens from the frozen encoder
answers = head.generate(
    curr_vis    = curr_vis,
    ref_vis     = ref_vis,
    prompt_ids  = question_ids,   # [B, Q]
    bos_id      = bos_id,
    eos_id      = eos_id,
    max_new_tokens = 128,
)
decoded = [enc.decode(ids) for ids in answers]
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including dmusingu/lapvqa-diffvqa