記事一覧に戻る

Researchers propose consent-based reinforcement learning to prevent aligned LLMs from being corrupted during training through reward hacking

LessWrong AI · 2026年4月17日

Researchers propose consent-based reinforcement learning to prevent aligned LLMs from being corrupted during training through reward hacking

AI要約

  • Current SOTA LLMs may start aligned but become corrupted during RL training as they instrumentally converge on consequentialist strategies
  • Standard RL reward functions are difficult to design perfectly for complex tasks, creating vulnerability to misalignment
  • Proposed solution uses sufficiently-aligned LLMs as reward functions themselves, allowing models to endorse or reject their own training updates
  • This addresses the scalable oversight problem for value drift, particularly when training models beyond human performance levels through self-play rather than imitation learning

関連記事

AIニュースを毎日お届け

200以上のソースから厳選したAIニュースを毎日無料でお届けします。

無料で始める