CyberGym is a family of large-scale, high-quality benchmarks that measure how well AI agents handle real vulnerabilities, from discovering and reproducing them to developing working exploits or patches.
Each benchmark targets a different stage of the vulnerability lifecycle.
Given a vulnerability description and an unpatched codebase, agents must generate proof-of-concept tests that reproduce the bug.
Given a vulnerability and a proof-of-vulnerability input, agents must craft a full exploit that achieves unauthorized code execution across userspace, browser, and the Linux kernel.
Extending the series toward end-to-end evaluation of the full vulnerability lifecycle. The paper is available now; the full benchmark is in progress.
AI agents are rapidly getting better at autonomous cybersecurity, and the stakes are rising fast. We built the CyberGym series to measure that capability rigorously and openly, on real-world software drawn from widely deployed projects, so defenders, AI developers, and policymakers can act on real evidence.