Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| { | |
| "total_findings": 17, | |
| "applied": [ | |
| {"critic": "dialectic", "severity": "high", "section": "5", "what_changed": "Added a clause acknowledging the repo's own ADR-013 counter-position: the same SDPO channel can AMPLIFY a distortion when teacher is same-family and the hint adds no independent info, so the 'stabilizer' claim holds only when privileged-information conditioning carries genuine new signal (the per-turn JSD gate)."}, | |
| {"critic": "dialectic", "severity": "high", "section": "5", "what_changed": "Softened the categorical dichotomy 'every collapse story requires a proxy' to 'most collapse stories...' and added that a true execution oracle can still collapse if positives reinforce accidental passes (DeepSWE compact-filtering motivation [43])."}, | |
| {"critic": "dialectic", "severity": "medium", "section": "7", "what_changed": "Added a symmetry caveat at the burden-shift sentence: the same domain-transfer discount applies to the anti-side pillars ([11] is VLM/VQA, [27] is MCQA), so the SWE-specific P0-P6 ablation is the actual decider."}, | |
| {"critic": "dialectic", "severity": "medium", "section": "2", "what_changed": "Added a one-clause on-domain SWE citation (16,991 SWE-agent trajectories; agents revert to internalized workflows; misaligned plan hurts more than no plan) alongside [11], as direct support for selective alignment-gated structure. New source [55]."}, | |
| {"critic": "dialectic", "severity": "medium", "section": "2", "what_changed": "Split the predictive-causal-gap statistics: mean causal fidelity 0.49 across 2,695 networks (only 2.5% exceed 0.70) vs the high-dimension N=100 ~1e-8 'causally blind' extreme at 92% lower prediction error — no longer conflating corpus mean with worst-case dimension."}, | |
| {"critic": "dialectic", "severity": "low", "section": "4", "what_changed": "MERGED with the width §4 finding (same anchor). Added the EvilGenie balancing clause: an LLM judge proved highly effective at flagging unambiguous hacks, so the held-out eval is load-bearing as a drift tripwire (proxy-minus-realeval gain) and an offline LLM-judge monitor is admissible for flagging but never as the training reward (safeguard #1)."}, | |
| {"critic": "depth", "severity": "high", "section": "10", "what_changed": "Reframed the cost anchor: both $0.98 (N=3) and $64 (8-teacher x 1000-step) are FLAT O(N*T) figures; dropped 'branching tree' from the $64 clause; noted a true tree is O(N^D), strictly worse, and that combinatorial blow-up (not the $0.98-to-$64 gap) is what makes gating mandatory. Trimmed the now-redundant 'Divergence-gating is therefore mandatory' follow-on."}, | |
| {"critic": "depth", "severity": "medium", "section": "9", "what_changed": "Grounded the LOC estimate: replaced '~150 LOC' with 'a few hundred LOC, comparable to the existing 390-LOC ModalSpawnExecutor', removing the optimistic precise figure."}, | |
| {"critic": "depth", "severity": "low", "section": "6", "what_changed": "Corrected the reuse/build table: the repo's reserved slot is K8sExecutor (here specialized to EKS); EKSExecutor is the proposed concrete implementation, not a repo-named slot. Also updated the LOC column to 'a few hundred LOC each'."}, | |
| {"critic": "depth", "severity": "low", "section": "7", "what_changed": "Separated the two credit mechanisms: shared-parent differencing is a group-relative/leave-one-out baseline (Tree-GRPO [44]); the hindsight-conditioned variant is CCA [33], which the executed-sibling structure approximates non-parametrically — no longer run together as one mechanism."}, | |
| {"critic": "depth", "severity": "medium", "section": "2", "what_changed": "Named the concrete ADR-011 mechanism at the 'no new kernel' clause: placeholder-system-message length-match keeps student_response_idx == teacher_response_idx so the JSD compares the right tokens."}, | |
| {"critic": "width", "severity": "high", "section": "7", "what_changed": "Fixed the process-vs-outcome citation mismatch: changed '(Let's Verify, Uesato) [19][27]' to '(Let's Verify [49]; Uesato [50] — process feedback cuts reasoning error 14.0%->3.4% at final-answer parity)'. New sources [49] Let's Verify (arXiv:2305.20050) and [50] Uesato (arXiv:2211.14275)."}, | |
| {"critic": "width", "severity": "high", "section": "1", "what_changed": "Tagged the §1 'SWE-Search expands nodes with one policy' mention with [51] and added a Pushback-3 (§7) clause noting SWE-Search lifts SWE pass-rate ~23% relative at TEST time without extra training, so the tree must justify folding search into TRAINING. New source [51] SWE-Search (arXiv:2410.20285)."}, | |
| {"critic": "width", "severity": "medium", "section": "3", "what_changed": "Tagged the §1 Symphony mention with [52] and added a §3 Pushback-2 sentence: Symphony is the pro-heterogeneity counter-result (single-agent MCTS gives insufficient branch diversity; heterogeneous LM pool improves rollout diversity/exploration), making the ablation a genuine two-sided question. New source [52] Symphony (arXiv:2601.22623)."}, | |
| {"critic": "width", "severity": "medium", "section": "2", "what_changed": "MERGED with the instruction §2 Chain-of-World finding (same target sentence, cited [53] once). Appended a clause to the MuZero/Dreamer value-equivalent sentence: the latent-motion line carries the same discipline into 2026 (factorize dynamics into a compact latent, predict the consequential terminal state, not the full frame). New source [53] Chain of World (arXiv:2603.03195)."}, | |
| {"critic": "width", "severity": "low", "section": "8", "what_changed": "Added a clause at the DeepSWE Kubernetes-rollout sentence citing the SWE-rebench infrastructure as production evidence that thousands-per-hour distributed SWE-task execution is an established pattern. New source [54] Behind SWE-rebench (nebius.com)."}, | |
| {"critic": "instruction", "severity": "low", "section": "2", "what_changed": "MERGED with the width §2 Chain-of-World finding — applied once at the MuZero/Dreamer value-equivalent-latent sentence with citation [53]. (See the width §2 entry.)"} | |
| ], | |
| "skipped": [], | |
| "conflicts": [ | |
| {"anchor": "with held-out tests giving only minimal detection improvement", "critics": ["dialectic", "width"], "resolution": "Merged into one §4 clause covering both the LLM-judge-is-effective finding (width) and the drift-tripwire / safeguard-1-admissibility framing (dialectic). Single Edit, [30] reused."}, | |
| {"anchor": "MuZero and Dreamer ... value-equivalent latent / never reconstruct the full state", "critics": ["width", "instruction"], "resolution": "Both name the same MuZero/Dreamer value-equivalent sentence and the same Chain-of-World addition. Applied once with citation [53], per the patcher instruction to cite [53] a single time."} | |
| ], | |
| "orchestrator_escalated": [], | |
| "sources_added": [ | |
| "[49] Let's Verify Step by Step — arXiv:2305.20050", | |
| "[50] Uesato process- vs outcome-based feedback — arXiv:2211.14275", | |
| "[51] SWE-Search — arXiv:2410.20285", | |
| "[52] SYMPHONY — arXiv:2601.22623", | |
| "[53] Chain of World — arXiv:2603.03195", | |
| "[54] Behind SWE-rebench — nebius.com", | |
| "[55] Plan Compliance in Autonomous Programming Agents — arXiv:2604.12147" | |
| ], | |
| "numbering_note": "Width was allotted [49]-[54]; width finding 1 adds TWO entries (Let's Verify + Uesato), so its allocation lands as [49]-[54] across all width findings. The dialectic §2 on-domain SWE disconfirmer (2604.12147) needed its own entry and was appended as [55] to avoid colliding with width's [49]. Existing [1]-[48] untouched and not renumbered." | |
| } | |