Instructions to use mlboydaisuke/ColModernVBERT-CoreAI with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- ColPali
How to use mlboydaisuke/ColModernVBERT-CoreAI with ColPali:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
ColModernVBERT — Core AI
The zoo's first visual document retriever and first late-interaction (ColBERT / MaxSim)
multi-vector model, running as static .aimodel graphs on Apple Silicon (Mac GPU / iPhone).
A Core AI port of ModernVBERT/colmodernvbert
(MIT) — a compact 250M visual document retriever: a ModernBERT-150M bidirectional text
encoder + SigLIP2 vision encoder (pixel-shuffle ×4) with a custom_text_proj head that
emits a per-token L2-normalized 128-d multi-vector. Retrieval is late interaction: you
encode a text query and a page image into token-level vectors and score them with MaxSim
(score = Σ_q max_d ⟨E_q, E_d⟩). No OCR — the page is matched as a picture, so tables, charts
and complex layouts are first-class.
This completes the on-device RAG trifecta alongside the text Qwen3-Embedding (text→text dense) and Qwen3-Reranker (cross-encoder): embed → rerank → visual-retrieval, all on device.
Two encoders (two graphs)
| graph | input | output | fp16 size |
|---|---|---|---|
| query | input_ids [1,32] i32, attention_mask [1,32] i32 |
query_embeddings [1,32,128] |
298 MB |
| doc | pixel_values [1,1,3,512,512], pixel_attention_mask [1,1,512,512] i32 |
doc_embeddings [1,89,128] |
407 MB |
Both are single bidirectional forwards — no KV cache, no generation. The per-token L2-norm and
the attention_mask masking are baked in-graph; MaxSim runs on the host (a tiny matmul +
max + sum). Each bundle directory holds one *.aimodel plus a tokenizer/ folder.
- query: right-pad the tokenized query to the 32-token grid (queries are short; ModernBERT's sliding-window(128) sees the full sequence → full attention). Slice to the real token count before MaxSim.
- doc: a single 512×512 tile ("global image") layout — the text template (CLS + image
markers + 64
<image>placeholders + SEP) is baked as a graph constant, so the only runtime inputs are the pixels. Preprocess the page like Idefics3: resize so the longest edge ≤ 512, pad to 512×512, rescale ×1/255, normalize with mean/std = 0.5, and build thepixel_attention_mask(1 for real pixels, 0 for padding).
Single-tile v1. This release ships the single 512px global-image document path: lightweight, iPhone-friendly, and accurate on typical pages. The model's full high-resolution mode (split a page into multiple 512px tiles + the global image, 800+ doc tokens) is a planned follow-up for dense small-print documents.
Repo layout
query/ colmodernvbert-query_float16_s32_static.aimodel + tokenizer/ (298 MB, fp16 — iPhone)
doc/ colmodernvbert-doc_float16_s89_static.aimodel (407 MB, fp16 — iPhone)
fp32/query/ colmodernvbert-query_float32_s32_static.aimodel + tokenizer/ (595 MB — Mac)
fp32/doc/ colmodernvbert-doc_float32_s89_static.aimodel (813 MB — Mac)
README.md · reference_query.json · reference_doc.json · test_doc.png
Each query/ and doc/ directory is a complete bundle root (one .aimodel, plus tokenizer/
on the query side). fp16 ships for iPhone (~705 MB for both encoders); fp32 is for Mac / max
precision.
On-device (CoreAIKit)
import CoreAIKitEmbeddings
// Downloads query/ + doc/ (fp16) from this repo, or uses a sideloaded copy if present.
let retriever = try await VisualDocumentRetriever() // .colModernVBERTQuery / .colModernVBERTDoc
// Encode a page as tiles (reliable spatial grounding), rank queries, and locate the match.
let page = try await retriever.encodeTiled(page: cgImage, rows: 6, cols: 4)
let q = try await retriever.encode(query: "total revenue in the third quarter")
let score = retriever.score(query: q, tiledPage: page) // MaxSim, page ranking
let rect = retriever.bestTile(query: q, tiledPage: page) // normalized region to highlight
See Examples/DocSearch
for a full iPhone demo (bundled + imported documents, query → ranked pages → highlighted region).
Parity (Core AI engine vs. PyTorch reference, M4 Max GPU)
Per-token cosine of the 128-d multi-vectors against the colpali_engine PyTorch model:
| encoder | float32 | float16 |
|---|---|---|
| query | min/mean 1.000000 | min 0.999997 / mean 0.999999 |
| doc | min/mean 1.000000 | min 0.999994 / mean 0.999998 |
End-to-end retrieval: the host MaxSim reproduces processor.score exactly (max |Δ| = 0.0000),
the engine ranking matches the PyTorch ranking on every clear-margin query, and the single-tile
engine retrieves the intended page 3/3 on a rendered-text corpus.
License
MIT, inherited from ModernVBERT/colmodernvbert.
See the upstream model and paper ModernVBERT: Towards Smaller Visual Document Retrievers
(arXiv:2510.01149).
Model tree for mlboydaisuke/ColModernVBERT-CoreAI
Base model
ModernVBERT/colmodernvbert