Prompt Injection as Role Confusion
This ICML 2026 paper by Ye, Cui, and Hadfield-Menell presents a theory that prompt injection attacks succeed because LLMs perceive roles (like user, assistant, tool, think) through insecure surface features like writing style rather than through the secure structural tags themselves. Using role probes to measure internal token perceptions, the authors demonstrate that sounding like a privileged role (e.g., mimicking reasoning style) overrides the actual tag, enabling attacks like CoT Forgery where fake reasoning in a user prompt achieves a ~60% success rate, while simply destyling the text drops success to 10%. They argue that roles are a hacked-together format trick that became critical cognitive and security infrastructure, and that unless models achieve genuine role perception, defense will remain a whack-a-mole game, opening the door to subtler threats like subconscious steering of LLM states for commercial or adversarial purposes.
Comments
Post a Comment