Open-source benchmark EVMbench tests how well AI agents handle smart contract exploits

OpenAI and Paradigm launch smart contract security benchmark

OpenAI and crypto venture firm Paradigm have released EVMbench, an open-source benchmark designed to test how well AI agents can handle smart contract security tasks. The tool focuses on real-world vulnerability patterns drawn from audited codebases and contest reports.

It arrives as smart contract exploits continue to drain funds from blockchain projects. The permanently deployed, autonomous nature of Ethereum Virtual Machine (EVM) contracts, which often control large asset pools, creates a persistent security challenge.

How the benchmark tests AI agents

EVMbench evaluates agent performance using three distinct task types: detection, patching, and exploitation. The goal is to provide a repeatable evaluation for AI models that claim to support contract auditing or automated security analysis.

In detect mode, the model reviews smart contract repositories and attempts to identify previously documented vulnerabilities. Scoring is based on recall—whether the agent successfully flags the known, ground-truth issues from reference audits.

In patch mode, the model must modify contract code to remove a vulnerability without breaking expected functionality. Grading checks both that the exploit is eliminated and that original tests still pass.

In exploit mode, the benchmark gives the model a sandboxed blockchain environment and asks it to execute an exploit against a vulnerable contract. Success is measured by verifying on-chain state changes like drained balances.

The dataset and testing environment

The EVMbench dataset is built from 120 curated vulnerabilities across 40 audits. Most cases come from open audit competitions, with additional scenarios sourced from Paradigm’s internal Tempo audit process.

Each case includes the vulnerable contract code and supporting infrastructure needed to recreate the scenario. The benchmark uses containerized environments and automated scoring so results can be reproduced across different machines.

Tasks reflect realistic development conditions, increasing complexity.
Agents must reason about contract interactions and state changes.
Exploit evaluation uses deterministic replay in a controlled local EVM instance.

The grading harness verifies success based on contract balances and state transitions, allowing for automatic evaluation without subjective review. This framework supports repeatable execution and comparisons between models over time.

Initial results show uneven performance

Benchmark results show AI performance varies significantly across the three task types. OpenAI reported that exploit tasks remain difficult for many systems, even when models can identify vulnerabilities at a surface level.

In a blog post, OpenAI noted, “Smart contracts routinely secure $100B+ in open-source crypto assets,” highlighting the scale of funds exposed to contract bugs.

Paradigm, however, highlighted rapid improvement. “When we started working on this project, top models were only able to exploit less than 20% of the critical, fund-draining Code4rena bugs,” said Alpin Yukseloglu, a partner at Paradigm. “Today, GPT-5.3-Codex exploits over 70%.”

The benchmark also reveals that patching remains a major weakness. Fixing contract vulnerabilities requires preserving correct behavior across edge cases, which often involves understanding deeper design assumptions in the code.

An open tool for ongoing research

EVMbench is available for free on GitHub, including benchmark tasks, harness tooling, and documentation. The goal is to allow researchers and security teams to test AI models consistently as agent capabilities continue to evolve.

OpenAI and Paradigm launch smart contract security benchmark

How the benchmark tests AI agents

The dataset and testing environment

Tasks reflect realistic development conditions, increasing complexity.
Agents must reason about contract interactions and state changes.
Exploit evaluation uses deterministic replay in a controlled local EVM instance.

Initial results show uneven performance

In a blog post, OpenAI noted, “Smart contracts routinely secure $100B+ in open-source crypto assets,” highlighting the scale of funds exposed to contract bugs.

Open-source benchmark EVMbench tests how well AI agents handle smart contract exploits

OpenAI and Paradigm launch smart contract security benchmark

How the benchmark tests AI agents

The dataset and testing environment

Initial results show uneven performance

An open tool for ongoing research

Related Articles

HackerOS is what a Linux enthusiast’s OS should be

Rising identity complexity: How CISOs can prevent it from becoming an attacker’s roadmap

AI agents are accelerating vulnerability discovery. Here’s how AppSec teams must adapt.

Stay in the loop

Open-source benchmark EVMbench tests how well AI agents handle smart contract exploits

OpenAI and Paradigm launch smart contract security benchmark

How the benchmark tests AI agents

The dataset and testing environment

Initial results show uneven performance

An open tool for ongoing research

Related Articles

HackerOS is what a Linux enthusiast’s OS should be

Rising identity complexity: How CISOs can prevent it from becoming an attacker’s roadmap

AI agents are accelerating vulnerability discovery. Here’s how AppSec teams must adapt.

Stay in the loop

Related Articles

HackerOS is what a Linux enthusiast’s OS should be
Feb 20, 20262 min read

Rising identity complexity: How CISOs can prevent it from becoming an attacker’s roadmap
Feb 20, 20264 min read

AI agents are accelerating vulnerability discovery. Here’s how AppSec teams must adapt.
Feb 20, 20263 min read

OpenAI and Paradigm launch smart contract security benchmark

How the benchmark tests AI agents

The dataset and testing environment

Initial results show uneven performance

An open tool for ongoing research

Related Articles

HackerOS is what a Linux enthusiast&#8217;s OS should be

Rising identity complexity: How CISOs can prevent it from becoming an attacker’s roadmap

AI agents are accelerating vulnerability discovery. Here&#8217;s how AppSec teams must adapt.

Stay in the loop

OpenAI and Paradigm launch smart contract security benchmark

How the benchmark tests AI agents

The dataset and testing environment

Initial results show uneven performance

An open tool for ongoing research

Related Articles

HackerOS is what a Linux enthusiast&#8217;s OS should be

Rising identity complexity: How CISOs can prevent it from becoming an attacker’s roadmap

AI agents are accelerating vulnerability discovery. Here&#8217;s how AppSec teams must adapt.

Stay in the loop

HackerOS is what a Linux enthusiast’s OS should be

AI agents are accelerating vulnerability discovery. Here’s how AppSec teams must adapt.

HackerOS is what a Linux enthusiast’s OS should be

AI agents are accelerating vulnerability discovery. Here’s how AppSec teams must adapt.