How capable are AI agents at real-world cybersecurity?

CyberGym is a family of large-scale, high-quality benchmarks that measure how well AI agents handle real vulnerabilities, from discovering and reproducing them to developing working exploits or patches.

The benchmarks

Each benchmark targets a different stage of the vulnerability lifecycle.

Why we built this

AI agents are rapidly getting better at autonomous cybersecurity, and the stakes are rising fast. We built the CyberGym series to measure that capability rigorously and openly, on real-world software drawn from widely deployed projects, so defenders, AI developers, and policymakers can act on real evidence.