The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement
-
The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement
Paper • 2605.30888 • Published • 10 -
Yofuria/UltraFeedback-binarized-ms-swift-1024
Viewer • Updated • 38.9k • 62 -
Yofuria/UltraFeedback-ms-swift-1024
Viewer • Updated • 41k • 79 -
Yofuria/Skywork-Reward-Preference-80K-v0.2-ms-swift
Viewer • Updated • 77k • 6