Baladithya Balamurugan

Wave 1: fix 8 failing tests + unblock Docker E2E + dep/doc debt

c11cf49 2 days ago

7.58 kB

	{
	"total_findings": 17,
	"applied": [
	{"critic": "dialectic", "severity": "high", "section": "5", "what_changed": "Added a clause acknowledging the repo's own ADR-013 counter-position: the same SDPO channel can AMPLIFY a distortion when teacher is same-family and the hint adds no independent info, so the 'stabilizer' claim holds only when privileged-information conditioning carries genuine new signal (the per-turn JSD gate)."},
	{"critic": "dialectic", "severity": "high", "section": "5", "what_changed": "Softened the categorical dichotomy 'every collapse story requires a proxy' to 'most collapse stories...' and added that a true execution oracle can still collapse if positives reinforce accidental passes (DeepSWE compact-filtering motivation [43])."},
	{"critic": "dialectic", "severity": "medium", "section": "7", "what_changed": "Added a symmetry caveat at the burden-shift sentence: the same domain-transfer discount applies to the anti-side pillars ([11] is VLM/VQA, [27] is MCQA), so the SWE-specific P0-P6 ablation is the actual decider."},
	{"critic": "dialectic", "severity": "medium", "section": "2", "what_changed": "Added a one-clause on-domain SWE citation (16,991 SWE-agent trajectories; agents revert to internalized workflows; misaligned plan hurts more than no plan) alongside [11], as direct support for selective alignment-gated structure. New source [55]."},
	{"critic": "dialectic", "severity": "medium", "section": "2", "what_changed": "Split the predictive-causal-gap statistics: mean causal fidelity 0.49 across 2,695 networks (only 2.5% exceed 0.70) vs the high-dimension N=100 ~1e-8 'causally blind' extreme at 92% lower prediction error — no longer conflating corpus mean with worst-case dimension."},
	{"critic": "dialectic", "severity": "low", "section": "4", "what_changed": "MERGED with the width §4 finding (same anchor). Added the EvilGenie balancing clause: an LLM judge proved highly effective at flagging unambiguous hacks, so the held-out eval is load-bearing as a drift tripwire (proxy-minus-realeval gain) and an offline LLM-judge monitor is admissible for flagging but never as the training reward (safeguard #1)."},
	{"critic": "depth", "severity": "high", "section": "10", "what_changed": "Reframed the cost anchor: both $0.98 (N=3) and $64 (8-teacher x 1000-step) are FLAT O(N*T) figures; dropped 'branching tree' from the $64 clause; noted a true tree is O(N^D), strictly worse, and that combinatorial blow-up (not the $0.98-to-$64 gap) is what makes gating mandatory. Trimmed the now-redundant 'Divergence-gating is therefore mandatory' follow-on."},
	{"critic": "depth", "severity": "medium", "section": "9", "what_changed": "Grounded the LOC estimate: replaced '~150 LOC' with 'a few hundred LOC, comparable to the existing 390-LOC ModalSpawnExecutor', removing the optimistic precise figure."},
	{"critic": "depth", "severity": "low", "section": "6", "what_changed": "Corrected the reuse/build table: the repo's reserved slot is K8sExecutor (here specialized to EKS); EKSExecutor is the proposed concrete implementation, not a repo-named slot. Also updated the LOC column to 'a few hundred LOC each'."},
	{"critic": "depth", "severity": "low", "section": "7", "what_changed": "Separated the two credit mechanisms: shared-parent differencing is a group-relative/leave-one-out baseline (Tree-GRPO [44]); the hindsight-conditioned variant is CCA [33], which the executed-sibling structure approximates non-parametrically — no longer run together as one mechanism."},
	{"critic": "depth", "severity": "medium", "section": "2", "what_changed": "Named the concrete ADR-011 mechanism at the 'no new kernel' clause: placeholder-system-message length-match keeps student_response_idx == teacher_response_idx so the JSD compares the right tokens."},
	{"critic": "width", "severity": "high", "section": "7", "what_changed": "Fixed the process-vs-outcome citation mismatch: changed '(Let's Verify, Uesato) [19][27]' to '(Let's Verify [49]; Uesato [50] — process feedback cuts reasoning error 14.0%->3.4% at final-answer parity)'. New sources [49] Let's Verify (arXiv:2305.20050) and [50] Uesato (arXiv:2211.14275)."},
	{"critic": "width", "severity": "high", "section": "1", "what_changed": "Tagged the §1 'SWE-Search expands nodes with one policy' mention with [51] and added a Pushback-3 (§7) clause noting SWE-Search lifts SWE pass-rate ~23% relative at TEST time without extra training, so the tree must justify folding search into TRAINING. New source [51] SWE-Search (arXiv:2410.20285)."},
	{"critic": "width", "severity": "medium", "section": "3", "what_changed": "Tagged the §1 Symphony mention with [52] and added a §3 Pushback-2 sentence: Symphony is the pro-heterogeneity counter-result (single-agent MCTS gives insufficient branch diversity; heterogeneous LM pool improves rollout diversity/exploration), making the ablation a genuine two-sided question. New source [52] Symphony (arXiv:2601.22623)."},
	{"critic": "width", "severity": "medium", "section": "2", "what_changed": "MERGED with the instruction §2 Chain-of-World finding (same target sentence, cited [53] once). Appended a clause to the MuZero/Dreamer value-equivalent sentence: the latent-motion line carries the same discipline into 2026 (factorize dynamics into a compact latent, predict the consequential terminal state, not the full frame). New source [53] Chain of World (arXiv:2603.03195)."},
	{"critic": "width", "severity": "low", "section": "8", "what_changed": "Added a clause at the DeepSWE Kubernetes-rollout sentence citing the SWE-rebench infrastructure as production evidence that thousands-per-hour distributed SWE-task execution is an established pattern. New source [54] Behind SWE-rebench (nebius.com)."},
	{"critic": "instruction", "severity": "low", "section": "2", "what_changed": "MERGED with the width §2 Chain-of-World finding — applied once at the MuZero/Dreamer value-equivalent-latent sentence with citation [53]. (See the width §2 entry.)"}
	],
	"skipped": [],
	"conflicts": [
	{"anchor": "with held-out tests giving only minimal detection improvement", "critics": ["dialectic", "width"], "resolution": "Merged into one §4 clause covering both the LLM-judge-is-effective finding (width) and the drift-tripwire / safeguard-1-admissibility framing (dialectic). Single Edit, [30] reused."},
	{"anchor": "MuZero and Dreamer ... value-equivalent latent / never reconstruct the full state", "critics": ["width", "instruction"], "resolution": "Both name the same MuZero/Dreamer value-equivalent sentence and the same Chain-of-World addition. Applied once with citation [53], per the patcher instruction to cite [53] a single time."}
	],
	"orchestrator_escalated": [],
	"sources_added": [
	"[49] Let's Verify Step by Step — arXiv:2305.20050",
	"[50] Uesato process- vs outcome-based feedback — arXiv:2211.14275",
	"[51] SWE-Search — arXiv:2410.20285",
	"[52] SYMPHONY — arXiv:2601.22623",
	"[53] Chain of World — arXiv:2603.03195",
	"[54] Behind SWE-rebench — nebius.com",
	"[55] Plan Compliance in Autonomous Programming Agents — arXiv:2604.12147"
	],
	"numbering_note": "Width was allotted [49]-[54]; width finding 1 adds TWO entries (Let's Verify + Uesato), so its allocation lands as [49]-[54] across all width findings. The dialectic §2 on-domain SWE disconfirmer (2604.12147) needed its own entry and was appended as [55] to avoid colliding with width's [49]. Existing [1]-[48] untouched and not renumbered."
	}