Researchers demonstrate input embeddings can neutralize safety-flagged responses in aligned language models
arXiv cs.CL · April 30, 2026
AI Summary
•A method optimizes input word embeddings at a sub-lexical level to minimize semantic harmfulness in aligned model responses, using zeroth-order gradient estimation from a black-box text-moderation API followed by gradient descent on embeddings.
•The approach works on aligned models, which produce an imbalanced bimodal refuse-or-comply output distribution, rather than the smooth distributions of open-ended text-completion models previously tested.
•Experiments show the method can neutralize every safety-flagged response on standard safety benchmarks.