← 記事一覧に戻る

大規模言語モデル AI安全性・アラインメントオープンソースAI

reward-lens: open-source library ports mechanistic interpretability tools to reward models, with validation on production models revealing linear attribution does not predict causal effects

arXiv cs.LG · 2026年4月30日

AI要約

•A new open-source library called reward-lens adapts mechanistic interpretability techniques — logit lens, direct logit attribution, activation patching, sparse autoencoders — from generative LLMs to reward models (neural networks trained via RLHF that output scalar scores rather than text).
•The library organizes around the reward head's weight vector as the central interpretability axis and provides Reward Lens, component attribution, three-mode activation patching, a reward-hacking probe suite, TopK SAE feature attribution, cross-model comparison, and five theory-grounded extensions. It supports Llama, Mistral, Gemma-2, and ArmoRM multi-objective heads.
•Validation on two production reward models across ~695 RewardBench pairs found that linear attribution does not predict causal patching effects (mean Spearman ρ = −0.256 on Skywork, −0.027 on ArmoRM), a disagreement the framework treats as a property to expose rather than a bug.

元記事を読む

関連記事

Capgemini reports 7% first-quarter revenue growth at constant exchange rates, with generative and agentic AI now accounting for more than 10% of group bookings

大規模言語モデル

Capgemini reports 7% first-quarter revenue growth at constant exchange rates, with generative and agentic AI now accounting for more than 10% of group bookings

Yahoo Finance AI·2026年4月30日

AgentRQ, an open-source agent-human collaboration platform using Model Context Protocol, launches with real-time task management and Claude integration.

大規模言語モデル

AgentRQ, an open-source agent-human collaboration platform using Model Context Protocol, launches with real-time task management and Claude integration.

Hacker News·2026年4月30日

GitHub Copilot Student removes GPT-5.3-Codex from model picker, keeping it available through auto model selection

大規模言語モデル

GitHub Copilot Student removes GPT-5.3-Codex from model picker, keeping it available through auto model selection

Hacker News·2026年4月30日

Research paper argues that self-training in large language models leads to model collapse without external human-generated data

大規模言語モデル

Research paper argues that self-training in large language models leads to model collapse without external human-generated data

Hacker News·2026年4月30日

大規模言語モデル

OpenAI releases GPT-5.5 prompting guide emphasizing outcome-oriented instructions over process-heavy prompts

Hacker News·2026年4月30日

AIニュースを毎日お届け

200以上のソースから厳選したAIニュースを毎日無料でお届けします。

無料で始める