Prompt Injection as Role Confusion

This ICML 2026 paper by Ye, Cui, and Hadfield-Menell presents a theory that prompt injection attacks succeed because LLMs perceive roles (like user, assistant, tool, think) through insecure surface features like writing style rather than through the secure structural tags themselves. Using role probes to measure internal token perceptions, the authors demonstrate that sounding like a privileged role (e.g., mimicking reasoning style) overrides the actual tag, enabling attacks like CoT Forgery where fake reasoning in a user prompt achieves a ~60% success rate, while simply destyling the text drops success to 10%. They argue that roles are a hacked-together format trick that became critical cognitive and security infrastructure, and that unless models achieve genuine role perception, defense will remain a whack-a-mole game, opening the door to subtler threats like subconscious steering of LLM states for commercial or adversarial purposes. 

https://role-confusion.github.io/

Comments

Popular posts from this blog

Prompt Engineering Demands Rigorous Evaluation

SecObserve: Simplified Vulnerability and License Management for CI/CD Pipelines

OWASP ZAP 2.16.0 Introduces Key Updates and Enhancements