ColModernVBERT — Core AI

The zoo's first visual document retriever and first late-interaction (ColBERT / MaxSim) multi-vector model, running as static .aimodel graphs on Apple Silicon (Mac GPU / iPhone). A Core AI port of ModernVBERT/colmodernvbert (MIT) — a compact 250M visual document retriever: a ModernBERT-150M bidirectional text encoder + SigLIP2 vision encoder (pixel-shuffle ×4) with a custom_text_proj head that emits a per-token L2-normalized 128-d multi-vector. Retrieval is late interaction: you encode a text query and a page image into token-level vectors and score them with MaxSim (score = Σ_q max_d ⟨E_q, E_d⟩). No OCR — the page is matched as a picture, so tables, charts and complex layouts are first-class.

This completes the on-device RAG trifecta alongside the text Qwen3-Embedding (text→text dense) and Qwen3-Reranker (cross-encoder): embed → rerank → visual-retrieval, all on device.

Two encoders (two graphs)

graph input output fp16 size
query input_ids [1,32] i32, attention_mask [1,32] i32 query_embeddings [1,32,128] 298 MB
doc pixel_values [1,1,3,512,512], pixel_attention_mask [1,1,512,512] i32 doc_embeddings [1,89,128] 407 MB

Both are single bidirectional forwards — no KV cache, no generation. The per-token L2-norm and the attention_mask masking are baked in-graph; MaxSim runs on the host (a tiny matmul + max + sum). Each bundle directory holds one *.aimodel plus a tokenizer/ folder.

  • query: right-pad the tokenized query to the 32-token grid (queries are short; ModernBERT's sliding-window(128) sees the full sequence → full attention). Slice to the real token count before MaxSim.
  • doc: a single 512×512 tile ("global image") layout — the text template (CLS + image markers + 64 <image> placeholders + SEP) is baked as a graph constant, so the only runtime inputs are the pixels. Preprocess the page like Idefics3: resize so the longest edge ≤ 512, pad to 512×512, rescale ×1/255, normalize with mean/std = 0.5, and build the pixel_attention_mask (1 for real pixels, 0 for padding).

Single-tile v1. This release ships the single 512px global-image document path: lightweight, iPhone-friendly, and accurate on typical pages. The model's full high-resolution mode (split a page into multiple 512px tiles + the global image, 800+ doc tokens) is a planned follow-up for dense small-print documents.

Repo layout

query/   colmodernvbert-query_float16_s32_static.aimodel + tokenizer/   (298 MB, fp16 — iPhone)
doc/     colmodernvbert-doc_float16_s89_static.aimodel                  (407 MB, fp16 — iPhone)
fp32/query/  colmodernvbert-query_float32_s32_static.aimodel + tokenizer/  (595 MB — Mac)
fp32/doc/    colmodernvbert-doc_float32_s89_static.aimodel                 (813 MB — Mac)
README.md · reference_query.json · reference_doc.json · test_doc.png

Each query/ and doc/ directory is a complete bundle root (one .aimodel, plus tokenizer/ on the query side). fp16 ships for iPhone (~705 MB for both encoders); fp32 is for Mac / max precision.

On-device (CoreAIKit)

import CoreAIKitEmbeddings

// Downloads query/ + doc/ (fp16) from this repo, or uses a sideloaded copy if present.
let retriever = try await VisualDocumentRetriever()   // .colModernVBERTQuery / .colModernVBERTDoc

// Encode a page as tiles (reliable spatial grounding), rank queries, and locate the match.
let page = try await retriever.encodeTiled(page: cgImage, rows: 6, cols: 4)
let q = try await retriever.encode(query: "total revenue in the third quarter")
let score = retriever.score(query: q, tiledPage: page)     // MaxSim, page ranking
let rect  = retriever.bestTile(query: q, tiledPage: page)  // normalized region to highlight

See Examples/DocSearch for a full iPhone demo (bundled + imported documents, query → ranked pages → highlighted region).

Parity (Core AI engine vs. PyTorch reference, M4 Max GPU)

Per-token cosine of the 128-d multi-vectors against the colpali_engine PyTorch model:

encoder float32 float16
query min/mean 1.000000 min 0.999997 / mean 0.999999
doc min/mean 1.000000 min 0.999994 / mean 0.999998

End-to-end retrieval: the host MaxSim reproduces processor.score exactly (max |Δ| = 0.0000), the engine ranking matches the PyTorch ranking on every clear-margin query, and the single-tile engine retrieves the intended page 3/3 on a rendered-text corpus.

License

MIT, inherited from ModernVBERT/colmodernvbert. See the upstream model and paper ModernVBERT: Towards Smaller Visual Document Retrievers (arXiv:2510.01149).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlboydaisuke/ColModernVBERT-CoreAI

Finetuned
(1)
this model

Paper for mlboydaisuke/ColModernVBERT-CoreAI