Backlog — Composer 2.5 Replication Framework

Updated 2026-05-29 to reflect shipped waves (ingestion, diloco, packaging, datagen+RL, ADR-011/012/013, cross-family review).

Active items / Honest Gaps

Framework/Docker substrate E2E (Hardware-blocked)

We lack the local multi-node GPU environment to run the true 8-node DiLoCo + Docker/TorchForge orchestrator E2E tests. Currently isolated to unit-level and single-node pseudo-gradient checks.

Real 8B LMA run (User-budget-gated)

The framework is proven on Qwen-0.5B and 1.5B (GSM8K/SDPO math traces).
The ultimate goal (Llama-3-8B full LMA run with α/β ablation over 10k SWE-bench traces) requires a multi-GPU Modal drop + significant compute budget.

Modal-gated (if budget allows after gap-closers)

Spike 002a-mini — Real GPU smoke (Phase 10)

Closes: the "did we ever run gradients on GPU" ambiguity — currently everything is CPU-only.

Goal: dispatch a 30-min A10G smoke on Modal that runs Qwen2.5-0.5B-Instruct natively on GPU.

Shipped (Past-Skeleton)

Spike 006 — Real HF model smoke (Wave 7)

Closes: V8 ("any HF model") — currently we run only mock 4-layer toy LM through composer_total_loss.

Goal: prove the 3-channel loss (grpo + α·sdpo_kl + β·trace_replay_dpo) survives a real transformers model + tokenizer with finite gradients and a decreasing loss across N steps.

Acceptance:

AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct") loads on CPU.
Real tokenizer apply_chat_template produces input_ids shape that flows through composer_total_loss(model, batch) without mock shapes.
5 backward steps run on CPU without nan / inf / shape mismatch.
Loss is monotone non-increasing across 5 steps (trend; allow noise).
New tests added under spikes/006-real-hf-model-smoke/tests/ pass alongside existing 38.

Estimate: half a day, CPU only.

Spike 007 — Real trace ingestion (Wave 8)

Closes: V5 ("real LLM-application traces") — Spike 001 used 50 hand-crafted states. Brief said "real traces."

Goal: pick ONE real agent-session log format with stable, public schema, write a TraceIngester that converts it to our TraceExample dataclass, run end-to-end through the data collator + a trimmed cost-floor measurement on 5 real states.

Acceptance:

ADR-002 picks the trace source (Claude Code JSONL / Cline / OpenHands / Aider / SWE-Bench-Lite trajectories).
TraceIngester.ingest(path: Path) -> Iterator[TraceExample] is implemented + has unit tests with a fixture log file.
End-to-end smoke: real trace → ingester → collator → 1-step composer_total_loss runs without error.
Cost-floor measurement: 5 real states × 3 teachers, p95 latency + cost report appended to spikes/007-*/verdict.md.

Estimate: 1 day + ~$2 OpenRouter.

Spike 008 — Streaming DiLoCo smoke (Wave 9)

Closes: V2 (DiLoCo "deferred to v0.2" — drift from original brief).

Goal: bolt outer-loop pseudo-gradient sync onto the loss composition test using two nn.Module replicas on the same node. No real distributed training (CPU multiprocessing or single-process).

Acceptance:

ADR-003 picks the DiLoCo variant (vanilla DiLoCo from arXiv:2311.08105 / Streaming DiLoCo from PrimeIntellect / Async-DiLoCo).
outer_optimizer.py implements pseudo-gradient = (θ_local − θ_initial), Nesterov-momentum outer step.
Smoke test: 2 replicas × 4 inner steps × 2 outer rounds on the toy model from Spike 005, both replicas converge toward the same solution within tolerance.
38 existing tests still pass (no regression).

Estimate: 2 days, CPU.

Wave 10 — Packaging

Closes: V4 ("skeleton not framework").

Goal: turn the assemblage of spike directories into an installable Python package with a clear quickstart.

Acceptance:

pyproject.toml at repo root, package name composer_replication.
composer_replication/ dir with __init__.py re-exporting composer_total_loss, OPSDLoss, TeacherReplayBuffer, compose_loss, TraceIngester, etc.
examples/qwen_05b_quickstart/ with end-to-end script that loads model, runs 10 training steps, prints loss curve.
README quickstart updated to pip install -e . + python examples/qwen_05b_quickstart/run.py.
pip install -e . succeeds and quickstart runs end-to-end on CPU.

Post-Skeleton Waves (Datagen, Alignment, Quality)

Trace Ingestion: Shipped (composer_replication/ingestion/).
DiLoCo: Shipped (composer_replication/diloco/ outer-loop pseudo optimizer).
Packaging: Shipped (pip install -e . works perfectly).
ADR-008/009/010 (Datagen, Layered Hints, Dr.GRPO+SDPO): Shipped, examples documented.
Cross-Family Architectural Review: Shipped (docs/reviews/cross-family-adr-008-009-010-2026-05-29/).
Alignment / V&V Closure: ADR-011 (SDPO alignment indices), ADR-012 (close review findings), ADR-013 (LMA integration channel-ladder) shipped.
Test Suites: 266 passed / 62 skipped (measured 2026-06-09; canonical count + env-variance note in docs/V1_V8_COVERAGE.md).
Real Examples: examples/gsm8k_grpo/, examples/sdpo_with_real_traces_production/.

Deferred (post-loop, GPU-gated)

Spike 002a/002b — full trace collection on A100 ($30–50)
Spike 003 — DPO-pair signal density study
Spike 004 — A/B SWE-bench-lite with α=0/β=0 vs α>0/β>0
Publication wave — author identity, thumbnail, X tags, post sequence

Process notes

Acceptance criteria are explicit and binary. Don't claim "done" unless every box ticks.
Each spike has its own spikes/00N-name/ dir + verdict.md recording acceptance + delta from estimate.
Re-audit BACKLOG.md at end of each wave; archive completed items with their final SHAs.