Reward Hacking MO Checkpoints and Rollouts Collection Artefacts from "(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL". KL 0.0 hacks with faithful CoT, 0.2 with unfaithful. • 4 items • Updated 3 days ago
ai-safety-institute/reward-hacking-olmo3.1-32b-kl0.02-seed2-rollouts Viewer • Updated 3 days ago • 25.8k • 315
ai-safety-institute/reward-hacking-olmo3.1-32b-kl0.0-seed2-rollouts Viewer • Updated 3 days ago • 25.7k • 343
ai-safety-institute/reward-hacking-olmo3.1-32b-kl0.02-seed2-rollouts Viewer • Updated 3 days ago • 25.8k • 315
ai-safety-institute/reward-hacking-olmo3.1-32b-kl0.0-seed2-rollouts Viewer • Updated 3 days ago • 25.7k • 343