Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # Backlog — Composer 2.5 Replication Framework | |
| Updated 2026-05-29 to reflect shipped waves (ingestion, diloco, packaging, datagen+RL, ADR-011/012/013, cross-family review). | |
| ## Active items / Honest Gaps | |
| ### Framework/Docker substrate E2E (Hardware-blocked) | |
| - We lack the local multi-node GPU environment to run the true 8-node DiLoCo + Docker/TorchForge orchestrator E2E tests. Currently isolated to unit-level and single-node pseudo-gradient checks. | |
| ### Real 8B LMA run (User-budget-gated) | |
| - The framework is proven on Qwen-0.5B and 1.5B (GSM8K/SDPO math traces). | |
| - The ultimate goal (Llama-3-8B full LMA run with α/β ablation over 10k SWE-bench traces) requires a multi-GPU Modal drop + significant compute budget. | |
| ## Modal-gated (if budget allows after gap-closers) | |
| ### Spike 002a-mini — Real GPU smoke (Phase 10) | |
| **Closes**: the "did we ever run gradients on GPU" ambiguity — currently everything is CPU-only. | |
| - Goal: dispatch a 30-min A10G smoke on Modal that runs Qwen2.5-0.5B-Instruct natively on GPU. | |
| ## Shipped (Past-Skeleton) | |
| ### Spike 006 — Real HF model smoke (Wave 7) | |
| **Closes**: V8 ("any HF model") — currently we run only mock 4-layer toy LM through `composer_total_loss`. | |
| **Goal**: prove the 3-channel loss (`grpo + α·sdpo_kl + β·trace_replay_dpo`) survives a real `transformers` model + tokenizer with finite gradients and a decreasing loss across N steps. | |
| **Acceptance**: | |
| 1. `AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")` loads on CPU. | |
| 2. Real tokenizer `apply_chat_template` produces `input_ids` shape that flows through `composer_total_loss(model, batch)` without mock shapes. | |
| 3. 5 backward steps run on CPU without `nan` / `inf` / shape mismatch. | |
| 4. Loss is monotone non-increasing across 5 steps (trend; allow noise). | |
| 5. New tests added under `spikes/006-real-hf-model-smoke/tests/` pass alongside existing 38. | |
| **Estimate**: half a day, CPU only. | |
| ### Spike 007 — Real trace ingestion (Wave 8) | |
| **Closes**: V5 ("real LLM-application traces") — Spike 001 used 50 hand-crafted states. Brief said "real traces." | |
| **Goal**: pick ONE real agent-session log format with stable, public schema, write a `TraceIngester` that converts it to our `TraceExample` dataclass, run end-to-end through the data collator + a trimmed cost-floor measurement on 5 real states. | |
| **Acceptance**: | |
| 1. ADR-002 picks the trace source (Claude Code JSONL / Cline / OpenHands / Aider / SWE-Bench-Lite trajectories). | |
| 2. `TraceIngester.ingest(path: Path) -> Iterator[TraceExample]` is implemented + has unit tests with a fixture log file. | |
| 3. End-to-end smoke: real trace → ingester → collator → 1-step `composer_total_loss` runs without error. | |
| 4. Cost-floor measurement: 5 real states × 3 teachers, p95 latency + cost report appended to `spikes/007-*/verdict.md`. | |
| **Estimate**: 1 day + ~$2 OpenRouter. | |
| ### Spike 008 — Streaming DiLoCo smoke (Wave 9) | |
| **Closes**: V2 (DiLoCo "deferred to v0.2" — drift from original brief). | |
| **Goal**: bolt outer-loop pseudo-gradient sync onto the loss composition test using two `nn.Module` replicas on the same node. No real distributed training (CPU multiprocessing or single-process). | |
| **Acceptance**: | |
| 1. ADR-003 picks the DiLoCo variant (vanilla DiLoCo from arXiv:2311.08105 / Streaming DiLoCo from PrimeIntellect / Async-DiLoCo). | |
| 2. `outer_optimizer.py` implements pseudo-gradient = (θ_local − θ_initial), Nesterov-momentum outer step. | |
| 3. Smoke test: 2 replicas × 4 inner steps × 2 outer rounds on the toy model from Spike 005, both replicas converge toward the same solution within tolerance. | |
| 4. 38 existing tests still pass (no regression). | |
| **Estimate**: 2 days, CPU. | |
| ### Wave 10 — Packaging | |
| **Closes**: V4 ("skeleton not framework"). | |
| **Goal**: turn the assemblage of spike directories into an installable Python package with a clear quickstart. | |
| **Acceptance**: | |
| 1. `pyproject.toml` at repo root, package name `composer_replication`. | |
| 2. `composer_replication/` dir with `__init__.py` re-exporting `composer_total_loss`, `OPSDLoss`, `TeacherReplayBuffer`, `compose_loss`, `TraceIngester`, etc. | |
| 3. `examples/qwen_05b_quickstart/` with end-to-end script that loads model, runs 10 training steps, prints loss curve. | |
| 4. README quickstart updated to `pip install -e .` + `python examples/qwen_05b_quickstart/run.py`. | |
| 5. `pip install -e .` succeeds and quickstart runs end-to-end on CPU. | |
| ### Post-Skeleton Waves (Datagen, Alignment, Quality) | |
| - **Trace Ingestion**: Shipped (`composer_replication/ingestion/`). | |
| - **DiLoCo**: Shipped (`composer_replication/diloco/` outer-loop pseudo optimizer). | |
| - **Packaging**: Shipped (`pip install -e .` works perfectly). | |
| - **ADR-008/009/010 (Datagen, Layered Hints, Dr.GRPO+SDPO)**: Shipped, examples documented. | |
| - **Cross-Family Architectural Review**: Shipped (`docs/reviews/cross-family-adr-008-009-010-2026-05-29/`). | |
| - **Alignment / V&V Closure**: ADR-011 (SDPO alignment indices), ADR-012 (close review findings), ADR-013 (LMA integration channel-ladder) shipped. | |
| - **Test Suites**: 266 passed / 62 skipped (measured 2026-06-09; canonical count + env-variance note in docs/V1_V8_COVERAGE.md). | |
| - **Real Examples**: `examples/gsm8k_grpo/`, `examples/sdpo_with_real_traces_production/`. | |
| ## Deferred (post-loop, GPU-gated) | |
| - Spike 002a/002b — full trace collection on A100 ($30–50) | |
| - Spike 003 — DPO-pair signal density study | |
| - Spike 004 — A/B SWE-bench-lite with α=0/β=0 vs α>0/β>0 | |
| - Publication wave — author identity, thumbnail, X tags, post sequence | |
| ## Process notes | |
| - Acceptance criteria are explicit and binary. Don't claim "done" unless every box ticks. | |
| - Each spike has its own `spikes/00N-name/` dir + `verdict.md` recording acceptance + delta from estimate. | |
| - Re-audit BACKLOG.md at end of each wave; archive completed items with their final SHAs. | |