Un-Jailbreakable AI Doesn't Exist—But Open, Neural-Symbolic Gets Closest

Perfectly "un-jailbreakable" AI models don't exist—it's an unrealistic goal. But the best way to get close is neural-symbolic AI combined with open-source models, not closed proprietary systems.


The real threat isn't simple prompt injection, but "capability-elicitation attacks"—where an AI follows instructions but is gradually coaxed over hundreds of prompts into producing something dangerous.


The solution: a "generate, then verify" pipeline. Let the neural model generate outputs, but quarantine risky ones and pass them through a symbolic verification layer (formal analyzers, sandboxes, logic engines) that rigorously judges what the output actually does. This is more reliable than just using another LLM to check things.


Why openness helps: An open ecosystem can field a diverse ensemble of specialist verifiers—far better than any single company. Independent verifiers with different blind spots make the system harder to game. While bad actors can still download open models, gating the main public deployment shapes most usage toward legitimate defenders, who then outpace attackers.


Caveat: This works well for defense-symmetric domains like cybersecurity. For offense-dominant, irreversible threats like biowarfare, no architecture makes this easy.


Bottom line: For most practical threats, openness plus neural-symbolic AI is safer than closed, centralized models.

https://bengoertzel.substack.com/p/un-jailbreakable-ai-models-arent

Comments

Popular posts from this blog

Prompt Engineering Demands Rigorous Evaluation

SecObserve: Simplified Vulnerability and License Management for CI/CD Pipelines

OWASP ZAP 2.16.0 Introduces Key Updates and Enhancements