Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Overview — Composer 2.5 Replication Framework (5-minute read)
Current through ADR-014 (2026-06). For
the front-door pitch see README.md; for the honest gap list see
BACKLOG.md; for the clause-by-clause vision audit see
docs/VISION_VALIDATION.md.
What it is
An open, methodology-first replication of Cursor's Composer 2.5
recipe — the post-training pipeline that turned a Kimi-K2.5 MoE base into a strong agentic
coder — generalized so it runs on any HuggingFace causal LM with a chat template (Qwen,
Llama, Mistral, DeepSeek, Phi, Gemma families). It ships as an installable Python package
(pip install -e . → composer_replication) plus a research corpus (ADRs, deep-dives,
recipes). Encoder-decoder models, base models without chat templates, and VLMs are out of
scope for v0.
This repo is the methodology repo ("the paper of the project"). Trained-variant model
repos and trace datasets are split out per docs/HF_REPO_LAYOUT.md.
The three channels — with honest provenance
The framework composes a single training loss out of three additive channels. Two replicate Cursor's published recipe; the third is the framework's own research addition. Getting this provenance right is the whole point — see ADR-014.
| # | Channel | What it is | Provenance |
|---|---|---|---|
| 1 | Base policy optimization | RL on verifiable rewards (RLVR). Default Dr.GRPO, now a selectable menu (make_po_config(objective=…) over {grpo, dr_grpo, bnpo, dapo, gspo, cispo}, per ADR-014). |
✅ Genuine replication. Composer 2's report (arXiv:2603.24477) resolves the base objective as Dr.GRPO. |
| 2 | SDPO self-distillation | Composer's "targeted RL with textual feedback": insert a hint into the context → use that hint-conditioned forward pass as a self-teacher → on-policy KL pulls the student toward it at the error turn. Published as SDPO/OPSD (arXiv:2601.20802 / 2601.18734, MIT code). | ✅ Genuine replication. This is Composer 2.5's headline trick; Cursor cites the SDPO/OPSD papers in the blog's footnote 1. |
| 3 | Trace-replay-DPO | Replay each step of a frozen agentic trace with N external teachers; turn teacher (dis)agreement into DPO preference pairs. A deliberate β-gated washout probe in the A0→A4 channel ladder (ADR-013). | ⚠️ The framework's OWN additive research channel — NOT part of Cursor's recipe. Composer's primary sources contain no DPO, no preference pairs, no reward models, no multiple teachers. Stacks on top of the genuine replication; it does not define it. |
Read this before citing the framework. Any statement of the form "Composer does trace-replay-DPO" or "the replication target includes channel 3" is wrong. Cursor's recipe = channels 1 + 2. Channel 3 is our addition, and the docs are careful to say so.
The full loss (verification-harness form) is total = lm_ce + α·sdpo_jsd + β·trace_replay_dpo;
production uses ComposerReplicationTrainer._compute_loss (a real trl.GRPOTrainer subclass),
where channel 1 is real GRPO rather than the LM-CE stub. See
docs/USER_GUIDE.md and docs/COMPOSER_RECIPE_MAPPING.md.
What's proven
- CPU SDPO-fires. On real Qwen2.5-0.5B-Instruct, the SDPO channel demonstrably fires
(
sdpo_jsd > 0) and SDPO-on vs SDPO-off totals differ — the "is the loss decrease just memorization?" critique is closed (Spike 006-strict). - Real GPU run. Qwen2.5-0.5B in bf16 on a local 5090 (sm_120): 50 steps, loss 0.7354 → 0.00034, 5.31 GB peak VRAM (Spike 002a-mini).
- A1 8B-ladder Modal run. The GRPO-only arm (A1) of the LMA channel ladder has a real
Modal runner and has been run with
dr_grpo. - GSM8K GRPO. The
examples/gsm8k_grpo*end-to-end examples exercise the production trainer on a real reasoning benchmark. - Economic feasibility of channel 3. 150 real OpenRouter calls, $0.98/trace mean, 0 errors (Spike 001).
- Installable + tested.
pip install -e .works; 266 passing / 62 skipped (measured 2026-06-09; canonical count + why skips vary by env:docs/V1_V8_COVERAGE.md).
What's gapped (honest, NOT closed)
- Docker / TorchForge substrate E2E is hardware-blocked — the test exists and skips cleanly, but there is no local multi-GPU rig to run the orchestrator layer end-to-end.
- The full 8B LMA channel ladder (A2–A4) is not yet runnable. Only A1 (GRPO-only) has a real Modal runner. A2 (SDPO) / A3 (replay-DPO) / A4 (combined) are scaffold + plan-builder only — running them on a real 8B checkpoint additionally needs a real error-trace SDPO dataset, a replay-DPO preference corpus, and an A100 entrypoint that don't exist yet. The real 8B run is additionally user-budget-gated.
- The empirical question — does the method actually beat plain GRPO at scale? — is the GPU-budget-gated v0.1 work (Spikes 002b/003/004) and remains open by design.
See BACKLOG.md for the live gap list, the Foot-guns worth knowing
on day one section just below for the day-one gotchas (branch sync, strip_thinking,
k1/k3, compose_loss-is-harness), and docs/TROUBLESHOOTING.md
for install/runtime failure modes.
Foot-guns worth knowing on day one
- Branch sync (resolved 2026-06-09).
mainis canonical and kept in sync withmaster, so a fresh Hub clone ofmaininstalls the complete tree. If you everImportErroronmake_dr_grpo_config, your clone is stale (git fetch && git checkout main). Historicallymainlaggedmaster; that's fixed as long as both stay synced. strip_thinking× SDPO. On real agent traces, SDPO requiresstrip_thinking=False: ~67% of error-recovery turns are pure thinking, so stripping them yields empty SDPO masks.- KL estimator delta. TRL uses the k3 estimator; Composer's report describes k1. This is a documented, intentional delta — the framework does not silently claim k1 parity.
compose_lossis the verification harness, not production. Its channel-1 is an LM-CE stub, not real GRPO. Production training isComposerReplicationTrainer.
Where to go next
| You want to… | Read |
|---|---|
| Pitch / status / roadmap | README.md |
| Run it end-to-end | docs/USER_GUIDE.md |
| Wire the loss into TRL / VeRL / PRIME-RL / DiLoCo / Monarch | docs/INTEGRATION_RECIPES.md |
| Exact kwargs / signatures | docs/API_REFERENCE.md |
| Why each design decision | docs/adrs/README.md |
| How Cursor's recipe maps to our components | docs/COMPOSER_RECIPE_MAPPING.md |
| Honest gaps / open work | BACKLOG.md, docs/VISION_VALIDATION.md |
| Fix a broken install / run | docs/TROUBLESHOOTING.md |