← 記事一覧に戻る

AI安全性・アラインメント

Researchers demonstrate input embeddings can neutralize safety-flagged responses in aligned language models

arXiv cs.CL · 2026年4月30日

AI要約

•A method optimizes input word embeddings at a sub-lexical level to minimize semantic harmfulness in aligned model responses, using zeroth-order gradient estimation from a black-box text-moderation API followed by gradient descent on embeddings.
•The approach works on aligned models, which produce an imbalanced bimodal refuse-or-comply output distribution, rather than the smooth distributions of open-ended text-completion models previously tested.
•Experiments show the method can neutralize every safety-flagged response on standard safety benchmarks.

元記事を読む

関連記事

AI安全性・アラインメント

Researchers introduce ADAGE framework to evaluate alignment between deep learning model explanations and established remote sensing knowledge in satellite-based flood mapping.

arXiv cs.CV·2026年4月30日

AI安全性・アラインメント

MATH-PT benchmark dataset introduces 1,729 mathematical problems in European and Brazilian Portuguese to address linguistic bias in LLM reasoning evaluations

arXiv cs.CL·2026年4月30日

AI安全性・アラインメント

Research paper identifies open problems in frontier AI risk management across planning, identification, analysis, evaluation, and mitigation stages

arXiv cs.LG·2026年4月30日

AI安全性・アラインメント

Graph Neural Networks learn trivial mini-batch dependent heuristics for link prediction rather than generalizable graph representations

arXiv cs.LG·2026年4月30日

AI安全性・アラインメント

reward-lens: open-source library ports mechanistic interpretability tools to reward models, with validation on production models revealing linear attribution does not predict causal effects

arXiv cs.LG·2026年4月30日

AIニュースを毎日お届け

200以上のソースから厳選したAIニュースを毎日無料でお届けします。

無料で始める