Developer traces 3,177 API calls to compare 4 AI coding tools' context usage
Summary
A developer built Context Lens to trace LLM context usage. Testing four AI coding tools on the same bug fix revealed vastly different token consumption and strategies: Claude Opus was surgical (23K tokens), Codex efficient (29-47K), Claude Sonnet balanced (42-44K), and Gemini profligate (up to 350K tokens). The tools show no deliberate context management; efficiency stems from investigation strategy, not optimization.
Claude Opus fixed a bug using 23,000 tokens. Gemini used 350,000
A developer built a tool called Context Lens to trace what large language models put in their context windows. They tested four AI coding assistants on the same bug fix in an Express.js repository. All four tools successfully fixed the bug and passed 1,246 tests.
But the amount of data—or tokens—each model consumed to do it was wildly different. The experiment revealed starkly different strategies and costs for the same outcome.
Token usage varies wildly between models
The developer ran each tool multiple times to check for consistency. Claude Opus was remarkably stable, using between 23,000 and 35,000 tokens per run.
Claude Sonnet typically used 42,000 to 44,000 tokens, with one outlier at nearly 70,000. Codex showed moderate variance, ranging from 29,300 to 47,200 tokens.
Gemini was the extreme outlier. Its runs consumed 179,000, 244,000, and 350,000 tokens. The developer notes Gemini has a larger context window and cheaper per-token pricing, but the variance and sheer volume were notable.
How the context window breaks down
Context Lens categorizes what fills the context window each turn. The composition for each tool at its peak usage was completely different.
- Claude Opus: Nearly 70% of its window (16.4K tokens) was tool definitions, re-sent every turn. Only 1.5K tokens were actual tool results from reading code.
- Claude Sonnet: Had a similar 18.4K-token "Claude tax" for tool definitions (43%). It also read broadly, with 16.9K tokens (40%) for tool results, including a 15.5K-token read of a full test file.
- Gemini: Had zero tool definition overhead. Instead, 172K tokens (96% of its context) were tool results from aggressive file reading. One single tool call dumped 118.5K tokens of a file's git history.
- Codex: Used only 2K tokens (6%) for tool definitions. The majority (72%) was targeted tool results from commands like
ripgrepandsed.
Four different strategies for the same fix
The message log from Context Lens shows how each model approached the task step-by-step.
Claude Opus took a surgical, history-first approach. It interpreted "this was working before" as a cue to check git history. It found the recent commit that broke the function, viewed the diff, read just 20 lines of source code to confirm, applied the fix, and ran tests. It completed the task in six tool calls over 47 seconds.
Claude Sonnet was more methodical. It first read the entire 15.5K-token test file to understand the expected behavior. It then read the source code and used git show to inspect changes. This bottom-up approach took more tokens but was thorough.
Codex acted like a Unix hacker. It ignored git entirely, using shell commands like rg to search and sed to extract specific line ranges. It applied the fix via a unified diff patch. This was the fastest method, completing in 34 seconds, and was highly predictable.
Gemini employed a brute-force, test-driven development style. After grepping and reading files, it made a critical error: one tool call dumped 118.5K tokens of a file's full git commit history. It then modified the test file to add a new assertion, ran tests to confirm failure, applied the fix, confirmed the pass, and reverted its test change—a logical but expensive process.
Context management is not a priority
The experiment suggests efficiency differences come from investigation strategy, not deliberate context window management. No tool proactively truncated results or cleared context. Caching helps reduce costs but doesn't prevent "context rot" from accumulating unnecessary data.
"The efficiency differences come entirely from investigation strategy, not from any deliberate attempt to manage the context window," the developer writes. The tools are currently in a race to be the "best," not the most efficient.
Preliminary tests on repositories with different git histories show strategies change when information sources are removed. Opus becomes less efficient without git history, while Codex barely notices the difference.
For developers interested in monitoring their own usage, Context Lens is available via npm (npm install -g context-lens). It provides real-time composition breakdowns and flags oversized tool results.
Related Articles
Anthropic's Claude Sonnet 4.6 matches Opus performance at lower price
Anthropic launched Claude Sonnet 4.6, a new AI model offering near-flagship Opus 4.6 performance at a lower cost, excelling in coding and office tasks.
Researchers use AI to verify critical seL4 microkernel code
Paper on using LLMs for theorem proving to verify the seL4 microkernel, aiming for industrial-scale real-world application.

Stay in the loop
Get the best AI-curated news delivered to your inbox. No spam, unsubscribe anytime.
