Adversarial Poetry Can Jailbreak LLMs — Even in a Single Prompt
The paper shows that rewriting harmful or restricted requests as poetry (“adversarial poetry”) can reliably bypass safety mechanisms in large language models (LLMs). Across 25 state-of-the-art models, hand-crafted poetic prompts achieved an average “attack success rate” of 62%, with some models exceeding 90% success.
Even when standard harmful prompts from a broad safety benchmark were converted automatically into verse, the poetic versions produced up to 18× higher success rates than their prose originals — showing this vulnerability isn’t limited to a few handcrafted examples.
The authors argue that poetic style — its metaphors, rhythm, and unconventional structure — alone suffices to evade guardrails across many risk domains (cyber-offense, harmful manipulation, dangerous instructions, etc.), exposing a systemic weakness in current alignment and safety-evaluation practices.
Comments
Post a Comment