ManniX PRO

ManniX-ITA

31 6 28

https://github.com/mann1x

mann1x

AI & ML interests

None yet

Recent Activity

updated a model about 2 hours ago

ManniX-ITA/opencoti-llamafile

repliedto their post about 4 hours ago

# opencoti-llamafile 0.10.3-c5 — settled admission for multi-agent serving New cut of the opencoti single-file inference engine (llamafile 0.10.3 / llama.cpp + 87 additive patches). Zero-dependency APE: one executable for Linux, Windows, macOS & BSD. What's new vs c4: **PolyKV fan-out — pool from a live session.** `POST /polykv/pools` gains `from_session`/`from_slot`: the shared prefix is snapshotted server-side from the session's cached KV — no tokens resent, token-exact. Ephemeral pools auto-release when orchestrators die mid-round. **PolyKV P7 — settled admission.** Spawning agents faster than the tps signal settles was oversubscribing pools. Now: a per-pool settle window paces admits just enough for a reliable reading; warming sessions no longer bias the mean; the post-admit forecast uses the measured per-admit drop; idle gaps (agents mid-tool-call) no longer read as free capacity; `guarantee_min_sessions` means a new/nested pool always gets its first agent — capacity checks can never deadlock an orchestrator; the enforced gate applies to new sessions only, with per-request `overcommit`. Benchmark (multi-agent courier, floor 15 tok/s): time-under-floor −52%, deep sub-floor −83%, delivery p50 −43%, 100% task score. **Zero-conf GPU sharing.** Instances on one GPU discover each other over shared memory — no ports, no config — and split compute by `--gpu-share-weight`. Measured (3090): weights 2:1 → 71.7/36.2 tok/s; holds at `--parallel 4` and under MTP. Idle peers cost nothing (solo = full speed), crashes age out in 3 s; `GET /gpu/peers` shows live shares + busy %. From c5 every release ships per-platform side-load DSOs: `dso/<ver>/` with Linux x86_64 + sbsa `.so` and a Windows `.dll`. https://huggingface.co/ManniX-ITA/opencoti-llamafile

repliedto their post 1 day ago

View all activity

Organizations

None yet

Posts 14

Post

129

🚀 opencoti-llamafile 0.10.3-c6 — multi-stream everywhere

New cut of the opencoti single-file inference engine (llamafile 0.10.3 / llama.cpp + 94 additive patches). One zero-dependency APE executable for Linux, Windows, macOS & BSD.

What's new vs c5:

⚡ Windows GPU, one file. New win-gpu .exe with the CUDA DLL embedded — download & run, no side-load. Plus a universal binary carrying every GPU payload. All CUDA payloads are nvcc-compressed (zero measured load/throughput cost) to fit Windows' 4 GiB image cap; five artifacts now ship per release, incl. aarch64 with embedded sbsa CUDA for DGX Spark (GB10).

🧩 Flat multi-stream KV. Split-KV multi-slot serving was 30–60× slower than unified (per-layer KV reassembly every step). Rebuilt as flat stream-contiguous layouts with zero-copy views — sliding-window layers and the pinned-host spill tail included: window + --parallel 2 went 1.2 → 25.3 tok/s aggregate, token-identical.

🔓 PolyKV no longer requires --kv-unified. Pools snapshot & share prefixes on split KV too — the c5 limitation is gone. /capacity is now stream-aware (kv_streams_* fields).

🤝 MTP × multi-stream. The assistant-MTP "force unified" guard is retired: the draft context mirrors the target's streams, split vs unified token-identical. Validated 7/7 Gemma-4 draft pairs on the shipped bytes, 1.39–1.97× decode.

⏱️ Deterministic GPU sharing. Feedback pacing replaced by weighted time-division on the wall clock: exact ratios by construction, zero solo tax, cross-process on Windows too. Peers now keyed by PCI bus multi-GPU safe.

🚛 r2 same-day re-cut: concurrent sessions with quantized spilled KV tails were collapsing to ~⅙ PCIe bandwidth; per-block bulk staging restores it — dual q4_0 4.3 → 19.2 tok/s aggregate (within 5% of solo), token-identical.

🔗 Repo (binaries, DSOs, full patch series, docs): ManniX-ITA/opencoti-llamafile

Post

105

# opencoti-llamafile 0.10.3-c5 — settled admission for multi-agent serving

New cut of the opencoti single-file inference engine (llamafile 0.10.3 / llama.cpp + 87 additive patches). Zero-dependency APE: one executable for Linux, Windows, macOS & BSD.

What's new vs c4:

**PolyKV fan-out — pool from a live session.** POST /polykv/pools gains from_session/from_slot: the shared prefix is snapshotted server-side from the session's cached KV — no tokens resent, token-exact. Ephemeral pools auto-release when orchestrators die mid-round.

**PolyKV P7 — settled admission.** Spawning agents faster than the tps signal settles was oversubscribing pools. Now: a per-pool settle window paces admits just enough for a reliable reading; warming sessions no longer bias the mean; the post-admit forecast uses the measured per-admit drop; idle gaps (agents mid-tool-call) no longer read as free capacity; guarantee_min_sessions means a new/nested pool always gets its first agent — capacity checks can never deadlock an orchestrator; the enforced gate applies to new sessions only, with per-request overcommit. Benchmark (multi-agent courier, floor 15 tok/s): time-under-floor −52%, deep sub-floor −83%, delivery p50 −43%, 100% task score.

**Zero-conf GPU sharing.** Instances on one GPU discover each other over shared memory — no ports, no config — and split compute by --gpu-share-weight. Measured (3090): weights 2:1 → 71.7/36.2 tok/s; holds at --parallel 4 and under MTP. Idle peers cost nothing (solo = full speed), crashes age out in 3 s; GET /gpu/peers shows live shares + busy %.

From c5 every release ships per-platform side-load DSOs: dso/<ver>/ with Linux x86_64 + sbsa .so and a Windows .dll.

ManniX-ITA/opencoti-llamafile

View all Posts

Collections 5

View 5 collections

models 41

datasets 1

ManniX-ITA/osync-code

Viewer • Updated Jan 12 • 1 • 18