記事一覧に戻る

reward-lens: open-source library ports mechanistic interpretability tools to reward models, with validation on production models revealing linear attribution does not predict causal effects

arXiv cs.LG · 2026年4月30日

AI要約

  • A new open-source library called reward-lens adapts mechanistic interpretability techniques — logit lens, direct logit attribution, activation patching, sparse autoencoders — from generative LLMs to reward models (neural networks trained via RLHF that output scalar scores rather than text).
  • The library organizes around the reward head's weight vector as the central interpretability axis and provides Reward Lens, component attribution, three-mode activation patching, a reward-hacking probe suite, TopK SAE feature attribution, cross-model comparison, and five theory-grounded extensions. It supports Llama, Mistral, Gemma-2, and ArmoRM multi-objective heads.
  • Validation on two production reward models across ~695 RewardBench pairs found that linear attribution does not predict causal patching effects (mean Spearman ρ = −0.256 on Skywork, −0.027 on ArmoRM), a disagreement the framework treats as a property to expose rather than a bug.

関連記事

AIニュースを毎日お届け

200以上のソースから厳選したAIニュースを毎日無料でお届けします。

無料で始める