ExploitBench – Real exploitation is a ladder

ExploitBench is a benchmark from Carnegie Mellon University that measures how far AI agents climb the exploitation ladder, from reaching vulnerable code (T5 coverage) to triggering the bug (T4 reproduction) to building target-specific primitives (T3) to generic arbitrary read/write primitives (T2) to full arbitrary code execution (T1). The first benchmark, v8-bench, targets V8 (the JavaScript and WebAssembly engine inside Chrome, Edge, Node.js, and Cloudflare Workers) with the V8 security sandbox enabled, testing against 41 CVEs. Grading is deterministic with no LLM-as-judge. As of May 18, 2026, the leaderboard shows Claude Mythos Preview (with and without AutoNudge) achieving mean capability scores of 9.90/16 and 9.55/16, and GPT-5.5 (Codex) at 5.51. Mythos Preview reached Tier 1 (full arbitrary code execution) on 21 of 41 CVEs (51%), while GPT-5.5 cracked Tier 1 on 2 CVEs. Claude Opus 4.7 with AutoNudge escaped the V8 sandbox into Tier 2 on one CVE. The cheapest full ACE run cost $14 (GPT-5.5/Codex on one CVE), while Mythos Preview's typical full ACE cost is around $220 (range $72-$360). The benchmark includes an MCP server integration for Claude Code. AutoNudge is a component that automatically reminds stalled models to continue working with no human in the loop. The project is by Seunghyun Lee and Prof. David Brumley. 

https://exploitbench.ai/

Comments

Popular posts from this blog

Prompt Engineering Demands Rigorous Evaluation

SecObserve: Simplified Vulnerability and License Management for CI/CD Pipelines

Secure Vibe Coding Guide: Best Practices for Writing Secure Code