記事一覧に戻る

Researchers demonstrate input embeddings can neutralize safety-flagged responses in aligned language models

arXiv cs.CL · 2026年4月30日

AI要約

  • A method optimizes input word embeddings at a sub-lexical level to minimize semantic harmfulness in aligned model responses, using zeroth-order gradient estimation from a black-box text-moderation API followed by gradient descent on embeddings.
  • The approach works on aligned models, which produce an imbalanced bimodal refuse-or-comply output distribution, rather than the smooth distributions of open-ended text-completion models previously tested.
  • Experiments show the method can neutralize every safety-flagged response on standard safety benchmarks.

関連記事

AIニュースを毎日お届け

200以上のソースから厳選したAIニュースを毎日無料でお届けします。

無料で始める