LAPVQA — Phrase Grounding

Description

TransVG-style phrase grounding heads trained on MIMIC-CXR, predicting the bounding box of a described abnormality given the chest X-ray and a text phrase. Each checkpoint is a dict: {state_dict, vis_dim, txt_dim, d_model, num_layers, encoder, epoch, val_miou, val_acc50}.

Architecture — `VisualGroundingHead`

vis_proj   : Linear(vis_dim → 256)
txt_proj   : Linear(txt_dim → 256)
reg_token  : Parameter [1, 1, 256]
sequence   : [REG | vis_tokens | txt_token]
transformer: 3 × TransformerEncoderLayer (self-attn, pre-norm)
box_head   : MLP(256 → 256 → 4)   # sigmoid → (cx,cy,w,h) ∈ [0,1]

Results (MIMIC-CXR test set)

Zero-shot: mIoU ≈ 0.082–0.089 across all encoders.

Fine-tuned (MAE-ViT-L/16): mIoU 0.320, Acc@0.25 0.569, Pointing Acc 0.593.

File	Encoder	vis_dim	txt_dim
`clip-vit-l14.pt`	CLIP ViT-L/14	1024	768
`siglip.pt`	SigLIP	1152	1152
`florence2.pt`	Florence-2	1024	768
`coca.pt`	CoCa	768	768
`owlv2.pt`	OWLv2	1024	768
`mae-vit-l16.pt`	MAE ViT-L/16	1024	768

Loading

import torch
from lapvqa.pg.heads import VisualGroundingHead

ckpt = torch.load("mae-vit-l16.pt", map_location="cpu")
head = VisualGroundingHead(
    vis_dim    = ckpt["vis_dim"],
    txt_dim    = ckpt["txt_dim"],
    d_model    = ckpt["d_model"],
    num_layers = ckpt["num_layers"],
)
head.load_state_dict(ckpt["state_dict"])
head.eval()

with torch.no_grad():
    # vis_tokens: [B, HW, vis_dim] — spatial patch tokens from frozen encoder
    # txt_vec:    [B, txt_dim]     — pooled text representation from frozen encoder
    pred_boxes = head(vis_tokens, txt_vec)  # [B, 4] (cx,cy,w,h) in [0,1]

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including dmusingu/lapvqa-pg

LAPVQA

Collection

Chest X-ray models: pre-trained encoders and task heads for VQA, DiffVQA, RRG, detection, and grounding on MIMIC-CXR. • 14 items • Updated Jun 5