A cybersecurity observatory of large-scale, high-quality benchmarks measuring how well AI agents handle real-world vulnerabilities, from discovering and reproducing them to developing working exploits or patches.
Each benchmark targets a different stage of the vulnerability lifecycle.
Given a vulnerability description and an unpatched codebase, agents must generate proof-of-concept tests that reproduce the bug.
Given a vulnerability and a proof-of-vulnerability input, agents must craft a full exploit that achieves unauthorized code execution across userspace, browser, and the Linux kernel.
Extending toward end-to-end evaluation of the full vulnerability lifecycle. The paper is available now; the full benchmark is in progress.
AI agents are rapidly getting better at autonomous cybersecurity, and the stakes are rising fast. We built this cybersecurity observatory to measure that capability rigorously and openly, on real-world software drawn from widely deployed projects, so defenders, AI developers, and policymakers can act on real evidence.
Supersedes our earlier Frontier AI Cybersecurity Observatory (deprecated).