Back to articles

Researchers demonstrate input embeddings can neutralize safety-flagged responses in aligned language models

arXiv cs.CL · April 30, 2026

AI Summary

  • A method optimizes input word embeddings at a sub-lexical level to minimize semantic harmfulness in aligned model responses, using zeroth-order gradient estimation from a black-box text-moderation API followed by gradient descent on embeddings.
  • The approach works on aligned models, which produce an imbalanced bimodal refuse-or-comply output distribution, rather than the smooth distributions of open-ended text-completion models previously tested.
  • Experiments show the method can neutralize every safety-flagged response on standard safety benchmarks.

Related Articles

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free