AI is evolving at an unprecedented pace, making it increasingly difficult to anticipate its societal impacts and risks. Recent benchmarks show that AI agents can already take on real-world cybersecurity tasks, including discovering and exploiting zero-day vulnerabilities. In cybersecurity, AI plays a dual role, strengthening both offensive and defensive capabilities.
To address this need, we built this observatory to continuously and openly track AI's cybersecurity capabilities across the stages of attack and defense, so developers, researchers, and policymakers can stay informed in a timely manner.
Have suggestions to improve the observatory? We are actively gathering feedback from the community and would greatly value your input. Please share your suggestions here.
Each benchmark targets a different stage of the vulnerability lifecycle.
Given a vulnerability description and an unpatched codebase, agents must generate proof-of-concept tests that reproduce the bug.
Given a vulnerability and a proof-of-vulnerability input, agents must craft a full exploit that achieves unauthorized code execution across userspace, browser, and the Linux kernel.
Extending toward end-to-end evaluation of the full vulnerability lifecycle. The paper is available now; the full benchmark is in progress.