Open-source benchmark EVMbench tests how well AI agents handle smart contract exploits
Summary
EVMbench is an open-source benchmark from OpenAI and Paradigm that tests AI agents on detecting, patching, and exploiting real smart contract vulnerabilities. It uses 120 curated flaws to provide automated, repeatable evaluations of AI security analysis capabilities.
OpenAI and Paradigm launch smart contract security benchmark
OpenAI and crypto venture firm Paradigm have released EVMbench, an open-source benchmark designed to test how well AI agents can handle smart contract security tasks. The tool focuses on real-world vulnerability patterns drawn from audited codebases and contest reports.
It arrives as smart contract exploits continue to drain funds from blockchain projects. The permanently deployed, autonomous nature of Ethereum Virtual Machine (EVM) contracts, which often control large asset pools, creates a persistent security challenge.
How the benchmark tests AI agents
EVMbench evaluates agent performance using three distinct task types: detection, patching, and exploitation. The goal is to provide a repeatable evaluation for AI models that claim to support contract auditing or automated security analysis.
In detect mode, the model reviews smart contract repositories and attempts to identify previously documented vulnerabilities. Scoring is based on recall—whether the agent successfully flags the known, ground-truth issues from reference audits.
In patch mode, the model must modify contract code to remove a vulnerability without breaking expected functionality. Grading checks both that the exploit is eliminated and that original tests still pass.
In exploit mode, the benchmark gives the model a sandboxed blockchain environment and asks it to execute an exploit against a vulnerable contract. Success is measured by verifying on-chain state changes like drained balances.
The dataset and testing environment
The EVMbench dataset is built from 120 curated vulnerabilities across 40 audits. Most cases come from open audit competitions, with additional scenarios sourced from Paradigm’s internal Tempo audit process.
Each case includes the vulnerable contract code and supporting infrastructure needed to recreate the scenario. The benchmark uses containerized environments and automated scoring so results can be reproduced across different machines.
- Tasks reflect realistic development conditions, increasing complexity.
- Agents must reason about contract interactions and state changes.
- Exploit evaluation uses deterministic replay in a controlled local EVM instance.
The grading harness verifies success based on contract balances and state transitions, allowing for automatic evaluation without subjective review. This framework supports repeatable execution and comparisons between models over time.
Initial results show uneven performance
Benchmark results show AI performance varies significantly across the three task types. OpenAI reported that exploit tasks remain difficult for many systems, even when models can identify vulnerabilities at a surface level.
In a blog post, OpenAI noted, “Smart contracts routinely secure $100B+ in open-source crypto assets,” highlighting the scale of funds exposed to contract bugs.
Paradigm, however, highlighted rapid improvement. “When we started working on this project, top models were only able to exploit less than 20% of the critical, fund-draining Code4rena bugs,” said Alpin Yukseloglu, a partner at Paradigm. “Today, GPT-5.3-Codex exploits over 70%.”
The benchmark also reveals that patching remains a major weakness. Fixing contract vulnerabilities requires preserving correct behavior across edge cases, which often involves understanding deeper design assumptions in the code.
An open tool for ongoing research
EVMbench is available for free on GitHub, including benchmark tasks, harness tooling, and documentation. The goal is to allow researchers and security teams to test AI models consistently as agent capabilities continue to evolve.
Related Articles
HackerOS is what a Linux enthusiast’s OS should be
HackerOS is a versatile Debian-based Linux distribution with multiple editions for different users. It includes unique features like a helpful ZSH terminal and fun "hacker" commands, making it appealing for both regular users and enthusiasts.
Rising identity complexity: How CISOs can prevent it from becoming an attacker’s roadmap
Identity has evolved from simple usernames to include machines, APIs, and cloud services, massively expanding the attack surface. Modern IAM must shift from administration to active defense, focusing on continuous posture assessment, attack path analysis, and automated response to prevent breaches.
AI agents are accelerating vulnerability discovery. Here’s how AppSec teams must adapt.
AI is rapidly accelerating vulnerability discovery, forcing AppSec teams to adapt by integrating AI into threat modeling, code review, and developer workflows to keep pace.
Stay in the loop
Get the best AI-curated news delivered to your inbox. No spam, unsubscribe anytime.
