Don't Trust the Salt: AI Summarization, Multilingual Safety, and LLM Guardrails
Summary
AI summaries can be dangerously biased, especially in multilingual contexts. Hidden instructions can steer outputs to hide human rights abuses or give unsafe advice, as shown in refugee aid tests. Evaluation must lead to better safeguards.
AI summaries can be dangerously biased, new research shows
Large language models can produce wildly different summaries of the same document based on hidden instructions, according to new research from Roya Pakzad, a researcher at the nonprofit Taraaz. In a demonstration, the same model generated three distinct summaries of a UN human rights report on Iran.
Using the model's default settings, the summary cited a "dragmatic rise in executions." But when guided by customized English and Farsi language policies, the framing shifted to emphasize government efforts and "protecting citizens through law enforcement." Pakzad calls this technique Bilingual Shadow Reasoning.
"The core point is: reasoning can be tacitly steered by the policies given to an LLM, especially in multilingual contexts," Pakzad wrote. The Farsi-language policy she used mirrored the Iranian government's own framing of its human rights record.
Summarization is a high-stakes blind spot
Pakzad, who built multilingual AI evaluation tools at the Mozilla Foundation, says summarization is a particular risk. She found it's easier to steer a model's output in summarization tasks than in question-and-answer formats.
This matters because organizations use these tools in critical areas like generating executive reports and summarizing political debates. A 2025 paper cited by Pakzad found that LLM summaries altered sentiment 26.5% of the time and made consumers 32% more likely to purchase a product after reading an LLM summary of a review versus the original.
"Many closed-source wrappers built on top of major LLMs... can embed these hidden instructions as invisible policy directives," Pakzad warns. This can facilitate censorship, manipulate sentiment, or reframe historical events while users assume the tools are neutral.
Multilingual AI safety is dangerously inconsistent
Pakzad's recent work focuses on the weak safeguards for non-English languages. She built the Multilingual AI Safety Evaluation Lab, an open-source platform to benchmark inconsistencies across languages.
In a case study with Respond Crisis Translation, her team tested models on refugee and asylum scenarios across four language pairs. The results showed significant quality drops for languages like Kurdish and Pashto compared to English.
- Human evaluators scored non-English actionability at 2.92 out of 5, compared to 3.86 for English.
- Factuality scores dropped from 3.55 for English to 2.87 for non-English languages.
- The LLM-as-a-Judge method inflated scores and under-reported disparities flagged by humans.
Models gave dangerously naive advice in non-English languages. In one scenario, Gemini refused to suggest herbal remedies for serious medical symptoms in English but provided them in other languages.
Guardrails themselves are flawed across languages
Pakzad's team took their findings a step further, designing a pipeline to turn evaluation insights into custom guardrails. They tested three guardrail tools using 60 asylum-seeker scenarios.
The results confirmed the evaluation work: the safety tools themselves performed inconsistently across languages. The guardrail named Glider produced score discrepancies of 36–53% depending solely on whether the policy was written in English or Farsi.
Guardrails hallucinated fabricated terms more often in Farsi, made biased assumptions about asylum seekers' nationalities, and expressed false confidence in factual accuracy. "The gap identified in the Lab’s evaluations persists all the way through to the safety tools themselves," Pakzad concluded.
2026 must be the year of action, not just evaluation
While many predict 2026 as the "year of AI evaluation," Pakzad argues the focus must shift to building better safeguards. She warns that evaluation risks becoming an overload of data without a clear "so what."
Her work this year will expand the Multilingual AI Evaluation Platform to include voice-based evaluation and integrate the evaluation-to-guardrail pipeline for continuous assessment. She is also securing funding to expand humanitarian case studies into new domains like gender-based violence and reproductive health.
"2026 should be the year evaluation flows into custom safeguard and guardrail design," Pakzad wrote. "That’s where I’ll be focusing my work this year."
Related Articles
How to ground AI agents in accurate, context-rich data
AI agents need organized, context-rich data to work effectively in enterprises. Specialized search tools like Elastic's platform help manage and prioritize vast data streams, ensuring accuracy and preventing compounding errors in business tasks.
Why 40% of AI projects will be canceled by 2027 (and how to stay in the other 60%)
Many AI projects fail due to siloed efforts on speed, cost, and security. Success requires a unified AI connectivity platform that integrates all three for sustainable deployment.
Stay in the loop
Get the best AI-curated news delivered to your inbox. No spam, unsubscribe anytime.
