reward-lens: open-source library ports mechanistic interpretability tools to reward models, with validation on production models revealing linear attribution does not predict causal effects
arXiv cs.LG · April 30, 2026
AI Summary
•A new open-source library called reward-lens adapts mechanistic interpretability techniques — logit lens, direct logit attribution, activation patching, sparse autoencoders — from generative LLMs to reward models (neural networks trained via RLHF that output scalar scores rather than text).
•The library organizes around the reward head's weight vector as the central interpretability axis and provides Reward Lens, component attribution, three-mode activation patching, a reward-hacking probe suite, TopK SAE feature attribution, cross-model comparison, and five theory-grounded extensions. It supports Llama, Mistral, Gemma-2, and ArmoRM multi-objective heads.
•Validation on two production reward models across ~695 RewardBench pairs found that linear attribution does not predict causal patching effects (mean Spearman ρ = −0.256 on Skywork, −0.027 on ArmoRM), a disagreement the framework treats as a property to expose rather than a bug.