LAPVQA
Collection
Chest X-ray models: pre-trained encoders and task heads for VQA, DiffVQA, RRG, detection, and grounding on MIMIC-CXR. β’ 14 items β’ Updated
Part of the LAPVQA collection.
DETR-style detection heads for 14-class chest abnormality detection on VinDr-CXR,
trained on top of six frozen vision encoders.
Each checkpoint is a dict: {state_dict, vis_dim, d_model, num_queries, num_enc, num_dec, encoder, epoch, val_map40, val_map50}.
DetectionHead
vis_proj : Linear(vis_dim β 256)
encoder : 2 Γ TransformerEncoderLayer (self-attn, pre-norm)
object_queries : Parameter [1, 20, 256]
decoder : 3 Γ TransformerDecoderLayer (cross-attn to encoder output)
class_head : Linear(256 β 15) # 14 classes + background
box_head : MLP(256 β 256 β 4) # (cx,cy,w,h) β [0,1]
| Encoder | mAP@0.4 (test) |
|---|---|
| OWLv2 | 0.048 |
| SigLIP | ~0.045 |
| CLIP ViT-L/14 | ~0.040 |
| File | Encoder | vis_dim |
|---|---|---|
clip-vit-l14.pt |
CLIP ViT-L/14 | 1024 |
siglip.pt |
SigLIP | 1152 |
florence2.pt |
Florence-2 | 1024 |
coca.pt |
CoCa | 768 |
owlv2.pt |
OWLv2 | 1024 |
mae-vit-l16.pt |
MAE ViT-L/16 | 1024 |
import torch
from lapvqa.ad.heads import DetectionHead
from lapvqa.ad.heads import predict
ckpt = torch.load("owlv2.pt", map_location="cpu")
head = DetectionHead(
vis_dim = ckpt["vis_dim"],
d_model = ckpt["d_model"],
num_queries = ckpt["num_queries"],
num_enc_layers = ckpt["num_enc"],
num_dec_layers = ckpt["num_dec"],
)
head.load_state_dict(ckpt["state_dict"])
head.eval()
with torch.no_grad():
# vis_tokens: [B, HW, vis_dim] β spatial patch tokens from the frozen encoder
outputs = head(vis_tokens)
detections = predict(outputs, score_threshold=0.1, nms_iou=0.5)
# detections[i]: {'boxes': [K,4] xyxy, 'labels': [K], 'scores': [K]}