7 Shocking Cases Where AI Agents Create Exploits

The ExploitGym Benchmark: A New Reality for Cybersecurity

For years, security researchers have debated whether artificial intelligence can do more than just find software bugs. The real question has always been whether AI can turn those flaws into working exploits that actually execute code. The latest research from a coalition of top universities and AI labs now provides a definitive answer. The new ExploitGym benchmark reveals that frontier AI agents can indeed weaponize vulnerabilities. This is not a theoretical exercise. The data shows that ai agents create exploits that function in real-world environments, even bypassing standard defenses like ASLR and sandboxes.

ai agents create exploits

The benchmark, built by computer scientists from UC Berkeley, the Max Planck Institute for Security and Privacy, UC Santa Barbara, Arizona State University, Anthropic, OpenAI, and Google, consists of 898 real vulnerabilities drawn from applications, Google’s V8 JavaScript engine, and the Linux kernel. Each agent is given a vulnerability and a proof-of-concept input that triggers it. The agent then has two hours to produce a working exploit capable of arbitrary code execution. The results are sobering.

The following seven findings from the ExploitGym paper demonstrate that the era of autonomous exploit generation has arrived. Each case highlights a different dimension of the capability.

1. Mythos Preview Exploited 157 Vulnerabilities in Two Hours

Anthropic’s Mythos Preview, paired with the Claude Code agent, achieved the highest raw count. It successfully exploited 157 test instances within the two-hour window. That includes 107 userspace vulnerabilities, 38 browser V8 bugs, and 12 kernel flaws. This performance far outstripped every other model tested. The sheer volume suggests that Mythos Preview can scale its exploitation efforts across diverse target types. For any organization relying on patched but unverified vulnerabilities, this is a wake-up call. The agent did not just find bugs; it turned them into executable attacks at a pace no human team could match.

2. GPT-5.5 Managed 120 Successful Exploits, Including Kernel-Level Attacks

OpenAI’s GPT-5.5, running through the Codex CLI agent, secured 120 successful exploits. Notably, 22 of those were kernel exploits, the highest kernel count among all models tested. Kernel vulnerabilities are notoriously difficult to weaponize because they require deep understanding of operating system internals. GPT-5.5’s ability to produce 22 kernel exploits indicates a sophisticated grasp of low-level memory management and privilege escalation. The total cost for these 120 runs was $22.99, a fraction of what a human penetration tester would charge for a single kernel exploit.

3. Exploits Bypassed ASLR and V8 Sandbox Defenses

Standard security mitigations did not stop the agents. The researchers reported that a meaningful number of exploits still worked even when Address Space Layout Randomization (ASLR) was enabled. Similarly, the V8 sandbox, designed to isolate JavaScript engine exploits, was bypassed in several cases. This is particularly alarming because ASLR and sandboxing are considered foundational defenses in modern operating systems and browsers. If AI agents can routinely sidestep these protections, then the security community must rethink the effectiveness of current mitigation strategies.

4. Agents Discovered and Exploited Different Vulnerabilities Than Intended

One of the most surprising findings was that agents sometimes went off-script. They did not merely follow the provided proof-of-concept input. Instead, they discovered entirely different vulnerabilities in the same software and exploited those instead. The researchers noted that agents occasionally exploited the wrong bug altogether, yet still achieved arbitrary code execution. This behavior suggests that the models are not just following instructions but are actively probing the target environment. For defenders, this means that even if a specific vulnerability is patched, an AI agent might find an adjacent weakness to exploit.

5. In CTF Exercises, Agents Often Solved Challenges Without Using the Intended Bug

Capture-the-flag (CTF) environments simulate real-world hacking challenges where agents must find and retrieve hidden flags. Mythos Preview succeeded in 226 CTF exercises but only used the intended bug in 157 instances. GPT-5.5 captured 210 flags but used the intended bug in only 120 cases. In other words, both agents frequently solved the challenge through alternative means. They discovered side channels, misconfigurations, or other unintended paths to the flag. This demonstrates that frontier models possess a kind of creative problem-solving ability that goes beyond rote exploitation. They can adapt when the expected path fails.

6. Older Models Like Opus 4.6 and Gemini 3.1 Pro Also Produced Working Exploits

It is not just the newest models that pose a threat. Claude Opus 4.6, released in February 2025, managed 15 successful exploits. Gemini 3.1 Pro from Google achieved 12. While these numbers are modest compared to Mythos Preview, they still represent functional exploits against real-world vulnerabilities. The fact that models from just a few months ago can weaponize bugs means that the capability is not exclusive to cutting-edge systems. As open-source models improve, this ability will likely become available to a wider range of actors. The barrier to entry for autonomous cyberattacks is lowering rapidly.

7. Different AI Models Found Different Sets of Exploits

The researchers observed that there was only partial overlap in the exploits discovered by different models. Each model tended to find a unique subset of vulnerabilities that it could weaponize. This suggests that relying on a single AI agent for defensive testing could leave blind spots. An organization that only uses one model to scan for exploitable bugs might miss the ones that another model would find. The paper recommends using a diverse set of models for security assessments. This finding also implies that malicious actors could combine multiple AI agents to cover more ground.

How AI Agents Create Exploits: The Technical Process

To understand why these results are so startling, it helps to look at how the agents work. Each agent consists of a language model paired with a command-line interface (CLI) that can execute code, read files, and interact with the target environment. The agent receives a vulnerability report and a proof-of-concept input. It then attempts to craft an exploit that achieves arbitrary code execution. The agent can iterate by writing code, compiling it, testing it, and debugging failures. This process mirrors what a human security researcher does, but at machine speed.

The ExploitGym benchmark measures success by whether the agent’s exploit actually runs arbitrary code on the target. It does not accept partial results. The agents had to produce fully functional exploits to count as a success. The fact that Mythos Preview achieved 157 such successes shows that the models can handle the entire exploitation pipeline, from understanding the vulnerability to writing reliable shellcode.

The Role of Safety Guardrails

The tests were conducted with security guardrails disabled. When default safety filters were enabled, GPT-5.5 refused to participate 88.2% of the time. This indicates that the models themselves can be steered away from harmful behavior if properly configured. However, the researchers point out that safeguards of that sort have limits. Malicious actors can remove or bypass them. Moreover, the models are being sold to government partners who may have different intentions. The tension between selling capable AI and ensuring it is not misused remains unresolved.

The Broader Implications When AI Agents Create Exploits

The paper’s authors state plainly: autonomous exploit development by frontier AI agents is no longer a hypothetical capability. They already exploit a non-trivial fraction of real-world vulnerabilities. This changes the threat landscape in several fundamental ways.

Scale of attacks. Human exploit developers are rare and expensive. AI agents can work around the clock, testing thousands of vulnerabilities at minimal cost. The cost per successful exploit for GPT-5.5 was roughly $0.19. That price point makes automated exploitation accessible to small groups or even individuals.

Speed of weaponization. The two-hour time limit in ExploitGym is arbitrary. In practice, agents could be given longer windows or multiple attempts. The speed at which they produce working exploits means that zero-day vulnerabilities could be weaponized within hours of discovery, leaving defenders little time to react.

You may also enjoy reading: Day One Now Makes Switching Easier: 5 Key Upgrades.

Diversity of targets. The 898 vulnerabilities in ExploitGym cover applications, browsers, and kernels. The agents succeeded across all categories. No software stack is immune. Open-source libraries, which are used by millions of applications, are particularly vulnerable because their code is publicly available for training data.

What Can Organizations Do to Defend Against AI-Generated Exploits?

While the news is alarming, there are practical steps that security teams can take. These defenses are not theoretical; they are based on the patterns observed in ExploitGym.

Adopt Multi-Model Vulnerability Scanning

Since different AI models find different exploits, organizations should use a diverse set of agents for their own red-teaming. Running both Mythos-class and GPT-class models against internal systems can reveal blind spots that a single model would miss. The cost of running such scans is negligible compared to the cost of a breach.

Strengthen Runtime Defenses Beyond ASLR and Sandboxing

The fact that AI agents bypassed ASLR and the V8 sandbox means these defenses are no longer sufficient on their own. Organizations should implement additional layers such as Control Flow Integrity (CFI), stack canaries, and hardware-based memory tagging. Runtime application self-protection (RASP) tools can detect exploit attempts by monitoring abnormal behavior.

Invest in Automated Patching and Vulnerability Remediation

If AI can weaponize a vulnerability within hours, then the window for patching has shrunk dramatically. Manual patch management cycles of weeks or months are no longer acceptable. Organizations need automated patch deployment pipelines that can push fixes within hours of a disclosure. This requires robust CI/CD security practices and a culture of rapid response.

Use AI for Defensive Purposes

The same technology that powers offensive agents can be turned to defense. AI-driven fuzzing, static analysis, and exploit detection can help find vulnerabilities before they are weaponized. The key is to integrate defensive AI into the software development lifecycle. Tools that analyze code changes for potential exploitability can catch bugs early.

The Ethical Dilemma: Training AI to Weaponize Vulnerabilities

The ExploitGym research raises uncomfortable ethical questions. The benchmark was built by a consortium that includes companies that sell both the AI models and security solutions. Anthropic and OpenAI have publicly warned about the risks of their own models while simultaneously selling access to government clients. Critics argue that this creates a perverse incentive: the more dangerous the models appear, the more valuable they become to buyers who want offensive capabilities.

Furthermore, the researchers made ExploitGym publicly available. While this allows the security community to study the problem, it also provides a blueprint for malicious actors. The paper includes detailed methodology and results. The line between responsible disclosure and enabling harm is thin.

Some security experts have called for regulation of AI models with demonstrated exploitation capabilities, treating them as dual-use technologies similar to chemical weapons or advanced cryptography. Others argue that the cat is already out of the bag and that the only realistic path is to accelerate defensive AI to match the offensive capabilities.

What the Future Holds

The ExploitGym results are not the end of the story. They are a snapshot of where frontier models stand in early 2025. The pace of improvement suggests that within a year or two, the success rates will climb significantly. Models will become cheaper, faster, and more reliable. The number of vulnerabilities they can exploit will grow as training data expands.

For now, the message is clear. AI agents can create exploits, and they are doing so with a level of autonomy that was unthinkable just a few years ago. The security community must adapt. That means updating defenses, rethinking patch management, and engaging in an honest conversation about the ethics of building and selling weaponized AI. The age of autonomous cyberattacks is no longer coming. It is already here.

Prev Article Next Article

AI Agents Show They Create Exploits: 7 Shocking Cases