Developer traces 3,177 API calls to compare 4 AI coding tools' context usage

Claude Opus fixed a bug using 23,000 tokens. Gemini used 350,000

A developer built a tool called Context Lens to trace what large language models put in their context windows. They tested four AI coding assistants on the same bug fix in an Express.js repository. All four tools successfully fixed the bug and passed 1,246 tests.

But the amount of data—or tokens—each model consumed to do it was wildly different. The experiment revealed starkly different strategies and costs for the same outcome.

Token usage varies wildly between models

The developer ran each tool multiple times to check for consistency. Claude Opus was remarkably stable, using between 23,000 and 35,000 tokens per run.

Claude Sonnet typically used 42,000 to 44,000 tokens, with one outlier at nearly 70,000. Codex showed moderate variance, ranging from 29,300 to 47,200 tokens.

Gemini was the extreme outlier. Its runs consumed 179,000, 244,000, and 350,000 tokens. The developer notes Gemini has a larger context window and cheaper per-token pricing, but the variance and sheer volume were notable.

How the context window breaks down

Context Lens categorizes what fills the context window each turn. The composition for each tool at its peak usage was completely different.

Claude Opus: Nearly 70% of its window (16.4K tokens) was tool definitions, re-sent every turn. Only 1.5K tokens were actual tool results from reading code.
Claude Sonnet: Had a similar 18.4K-token "Claude tax" for tool definitions (43%). It also read broadly, with 16.9K tokens (40%) for tool results, including a 15.5K-token read of a full test file.
Gemini: Had zero tool definition overhead. Instead, 172K tokens (96% of its context) were tool results from aggressive file reading. One single tool call dumped 118.5K tokens of a file's git history.
Codex: Used only 2K tokens (6%) for tool definitions. The majority (72%) was targeted tool results from commands like ripgrep and sed.

Four different strategies for the same fix

The message log from Context Lens shows how each model approached the task step-by-step.

Claude Opus took a surgical, history-first approach. It interpreted "this was working before" as a cue to check git history. It found the recent commit that broke the function, viewed the diff, read just 20 lines of source code to confirm, applied the fix, and ran tests. It completed the task in six tool calls over 47 seconds.

Claude Sonnet was more methodical. It first read the entire 15.5K-token test file to understand the expected behavior. It then read the source code and used git show to inspect changes. This bottom-up approach took more tokens but was thorough.

Codex acted like a Unix hacker. It ignored git entirely, using shell commands like rg to search and sed to extract specific line ranges. It applied the fix via a unified diff patch. This was the fastest method, completing in 34 seconds, and was highly predictable.

Gemini employed a brute-force, test-driven development style. After grepping and reading files, it made a critical error: one tool call dumped 118.5K tokens of a file's full git commit history. It then modified the test file to add a new assertion, ran tests to confirm failure, applied the fix, confirmed the pass, and reverted its test change—a logical but expensive process.

Context management is not a priority

The experiment suggests efficiency differences come from investigation strategy, not deliberate context window management. No tool proactively truncated results or cleared context. Caching helps reduce costs but doesn't prevent "context rot" from accumulating unnecessary data.

"The efficiency differences come entirely from investigation strategy, not from any deliberate attempt to manage the context window," the developer writes. The tools are currently in a race to be the "best," not the most efficient.

Preliminary tests on repositories with different git histories show strategies change when information sources are removed. Opus becomes less efficient without git history, while Codex barely notices the difference.

For developers interested in monitoring their own usage, Context Lens is available via npm (npm install -g context-lens). It provides real-time composition breakdowns and flags oversized tool results.

Developer traces 3,177 API calls to compare 4 AI coding tools' context usage

Claude Opus fixed a bug using 23,000 tokens. Gemini used 350,000

Token usage varies wildly between models

How the context window breaks down

Four different strategies for the same fix

Context management is not a priority

Related Articles

Anthropic's Claude Sonnet 4.6 matches Opus performance at lower price

Researchers use AI to verify critical seL4 microkernel code

Stay in the loop

Two new biographies demystify philosopher Alexandre Kojève