LAPVQA
Collection
Chest X-ray models: pre-trained encoders and task heads for VQA, DiffVQA, RRG, detection, and grounding on MIMIC-CXR. โข 14 items โข Updated
Part of the LAPVQA collection.
A ViT-L/14 encoder + 6-layer causal decoder trained from scratch on MIMIC-CXR
to generate full radiology reports from chest X-ray images.
Unlike the contrastive pretrain variants, the generative objective forces the encoder
to retain fine-grained spatial information sufficient for region-level text generation.
The encoder weights (encoder_final.pt) serve as the strongest feature extractor
in the LAPVQA downstream tasks.
| Component | Detail |
|---|---|
| Vision backbone | ViT-L/14, 24-layer, 1024-dim, 16-head, patch 14, 384 px |
| Captioning decoder | 6-layer causal transformer, 512-dim, GPT-2 vocab (50 257) |
| Loss | Cross-entropy over report tokens |
| Training data | MIMIC-CXR (physionet.org/content/mimic-cxr) |
| Dataset | Mean AUC |
|---|---|
| NIH CXR-14 (14-class) | 0.686 |
| CheXpert-5 (5-class) | 0.808 |
The captioning-pretrained encoder matches or exceeds the contrastive variants on both classification benchmarks, and is the best-performing encoder on DiffVQA when used downstream.
| File | Description |
|---|---|
encoder_final.pt |
Vision encoder weights (used as frozen feature extractor downstream) |
model_best.pt |
Full encoder + decoder at best validation loss |
import torch
from lapvqa.pretrain.model import CaptioningModel
ckpt = torch.load("model_best.pt", map_location="cpu")
model = CaptioningModel()
model.load_state_dict(ckpt)
model.eval()
# To use only the encoder as a feature extractor:
enc_weights = torch.load("encoder_final.pt", map_location="cpu")
model.vision_encoder.load_state_dict(enc_weights)
# vis_tokens = model.vision_encoder(images) # [B, 256, 1024]
If you use these weights please cite MIMIC-CXR:
@article{johnson2019mimic,
title = {MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports},
author = {Johnson, Alistair EW and others},
journal = {Scientific data},
volume = {6}, pages = {317}, year = {2019}
}