LAPVQA โ€” VQA (Native / End-to-end)

Part of the LAPVQA collection.

Description

VQA task heads trained with end-to-end fine-tuning (encoder + head jointly). Provides a baseline for comparison with the frozen-encoder variant lapvqa-vqa. Each .pt file is a plain state dict of VQAHead.

File Encoder vis_dim
clip-vit-l14_best.pt CLIP ViT-L/14 (fine-tuned) 1024
siglip_best.pt SigLIP (fine-tuned) 1152
florence2_best.pt Florence-2 (fine-tuned) 1024
coca_best.pt CoCa (fine-tuned) 768
mae-vit-l16_best.pt MAE ViT-L/16 (fine-tuned) 1024

Loading

import torch
from lapvqa.vqa.model import VQAHead

VIS_DIMS = {
    "clip-vit-l14": 1024, "siglip": 1152,
    "florence2": 1024, "coca": 768, "mae-vit-l16": 1024,
}
encoder = "siglip"
ckpt = torch.load(f"{encoder}_best.pt", map_location="cpu")
head = VQAHead(vis_dim=VIS_DIMS[encoder])
head.load_state_dict(ckpt)
head.eval()
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including dmusingu/lapvqa-vqa-native