AI Agents May Always Fall for Prompt Injections
This academic paper from arXiv (May 17, 2026) by Abdelnabi and Bagdasarian argues that prompt injection, the most critical vulnerability in deployed AI agents, may be impossible to fully prevent. The authors challenge the prevailing defense paradigm of data-instruction separation, showing that current injection classifiers perform at near-chance levels (AUROC 0.43–0.59) when attacks operate through contextual manipulation rather than explicit injection vocabulary. They recast prompt injection through the lens of Contextual Integrity (CI), a privacy theory that judges information flow compliance with contextual norms defined by five parameters: sender, receiver, subject, information type, and transmission principle. Using this framework, they demonstrate three classes of failures: (1) attacks that corrupt parameter inference (e.g., fabricating user quotes or prior approvals) achieving 96.7% success against an email assistant, (2) norm grounding failures where agents execute out-of-scope requests 29.9–36.2% of the time without interaction history, and (3) flow separation failures where agents collapse authorization across simultaneous information flows in up to 65% of cases. The authors present an impossibility argument: an adversary can always construct a context where a blocked flow appears legitimate, or a defender who tightens norms will block genuinely legitimate flows. They conclude that current safety training may degrade both security and utility, and advocate for CI-grounded red-teaming and layered architectures that verify claims against ground truth. Experiments cover frontier models including GPT-5.4, Claude Sonnet 4-6, Gemini 3-Pro, and Meta SecAlign.
Comments
Post a Comment