How your LLM is silently hallucinating company revenue
Summary
LLMs generate SQL queries that often run successfully but are semantically wrong, leading to dangerous data errors. Providing context via tools like MCP, AGENTS.md, or Agent Skills helps reduce these silent failures.
Large language models are generating dangerously wrong database queries
LLMs are accelerating engineering work, but they are creating a specific and dangerous failure mode when used with databases. The problem is that a syntactically correct SQL query can execute successfully while being semantically wrong, returning bad data that looks legitimate.
When an LLM generates a faulty React component, the error is usually visible—a broken layout or a misplaced button. A faulty database query, however, can run without error and return thousands of rows of incorrect data. This makes the mistakes opaque and difficult for a human to quickly validate.
Most broken queries still return data
Analysis of over 50,000 production queries revealed that most "broken" queries execute successfully. A user might ask for "revenue by product category," and the LLM might pull from a "revenue" column in the wrong table.
The query runs, numbers appear, and business decisions are made based on silently incorrect metrics. LLMs often take the first seemingly correct path and fail to explore alternatives, getting stuck with tunnel vision.
Why databases are uniquely vulnerable
Three key characteristics make database work especially prone to these silent LLM failures.
First, SQL dialects diverge in ways LLMs don't anticipate. While basics are standard, most databases adapt SQL into their own dialect with subtle or dramatic syntax differences, rendering a lot of general training data useless.
Second, real-world schemas are messy. Column names like "amount" can have ambiguous meanings, and legacy tables clutter the environment. LLMs frequently make confident, wrong guesses, inventing columns that don't exist or joining on incorrect keys.
Finally, human communication is ambiguous. Requests like "show me Q1 FY2026" or metrics involving "a billion" have different meanings depending on regional or business context. LLMs lack the specific knowledge to know which definition applies.
New tools are providing crucial context
The core failure stems from using general LLMs with no context about a specific environment. The model knows abstract SQL but nothing about your schema, dialect, or business logic. A wave of new tools aims to solve this.
- Model Context Protocol (MCP): Anthropic's protocol standardizes connections between LLMs and external tools like databases, allowing for direct query execution and schema introspection.
- AGENTS.md: OpenAI's convention uses a single Markdown file to provide domain-specific context that travels with a codebase, though it can lead to context bloat.
- Agent Skills: Anthropic's modular system breaks knowledge into separate files an agent can load on-demand, preventing unused info from clogging the context window.
Databases themselves also offer a native solution. Systems like ClickHouse, PostgreSQL, and MySQL have long supported the COMMENT syntax to store metadata on tables and columns, providing built-in semantic context.
The battle of the context methods
It's not yet clear which context-providing method will dominate. A recent evaluation by Vercel compared AGENTS.md to Agent Skills.
AGENTS.md achieved a 100% pass rate in their tests, while skills maxed out at 79%. Crucially, in 56% of test cases, agents with access to skills never invoked them, even when the documentation was available and relevant.
This suggests that while Agent Skills are more efficient with context, their reliance on the LLM correctly choosing to use a skill is a major weakness. As LLMs improve at tool-calling and context windows grow larger, the balance may shift. For now, the safest bet may be to embed context directly within the database objects themselves.
Related Articles
‘An AlphaFold 4’ – scientists marvel at DeepMind drug spin-off’s exclusive new AI
Isomorphic Labs, a Google DeepMind spin-off, has developed a proprietary AI model, IsoDDE, that predicts protein-drug interactions for drug discovery, but unlike AlphaFold, it is not being shared with the broader scientific community.
OpenAI’s Sam Altman: Global AI regulation ‘urgently’ needed
OpenAI's Sam Altman urgently calls for global AI regulation and an international oversight body for safe, fair development.
Stay in the loop
Get the best AI-curated news delivered to your inbox. No spam, unsubscribe anytime.
