Adversarial Poetry Can Jailbreak LLMs — Even in a Single Prompt

The paper shows that rewriting harmful or restricted requests as poetry (“adversarial poetry”) can reliably bypass safety mechanisms in large language models (LLMs). Across 25 state-of-the-art models, hand-crafted poetic prompts achieved an average “attack success rate” of 62%, with some models exceeding 90% success.

Even when standard harmful prompts from a broad safety benchmark were converted automatically into verse, the poetic versions produced up to 18× higher success rates than their prose originals — showing this vulnerability isn’t limited to a few handcrafted examples. 

The authors argue that poetic style — its metaphors, rhythm, and unconventional structure — alone suffices to evade guardrails across many risk domains (cyber-offense, harmful manipulation, dangerous instructions, etc.), exposing a systemic weakness in current alignment and safety-evaluation practices. 

https://arxiv.org/html/2511.15304v1

Comments

Popular posts from this blog

Prompt Engineering Demands Rigorous Evaluation

Secure Vibe Coding Guide: Best Practices for Writing Secure Code

KEVIntel: Real-Time Intelligence on Exploited Vulnerabilities