MLOps Community

An Agent Shouldn't Trust Everything it Reads

An Agent Shouldn't Trust Everything it Reads
# AI Agents
# AI Safety & Security
# PwC

Why retrieved content becomes a security surface for Agents

May 26, 2026
Steve Kearns
Steve Kearns
An Agent Shouldn't Trust Everything it Reads
The attack pattern is subtle because it looks like ordinary content. Someone plants instructions inside content an AI assistant may eventually read - a shared document, a support ticket, a knowledge base page, a pull request description, or a webpage retrieved during a tool call. The user does not run a command or approve a workflow. Later, they ask the assistant a routine question. The assistant pulls in the planted content, treats the embedded instructions as if they came from the user, and acts on them.
Security researchers have demonstrated versions of this pattern across agentic systems: indirect prompt injection through content the assistant is allowed to read, followed by tool use the user did not explicitly request. In one environment that might mean exposing private context. In another, it might mean modifying records, calling a tool, or triggering a workflow the user never intended. The details vary, but the pattern keeps reappearing because it is structurally hard to defend against.
This is the part of the production agent conversation that does not fit neatly into the usual security posture, and it came up in a recent MLOps Community podcast with Pramod Krishnan from PwC. It is worth lingering on because the architectural implication is unintuitive. Once agents can act, ordinary content becomes part of the control surface.
For a chatbot, content stays as content. For an agent, the boundary is less clear. Any content the agent reads - emails, documents, tickets, calendar entries, retrieval results - can contain text that looks like instruction. Any tool the agent can call extends what those instructions can do. Retrieval is not permission to execute, but by default it effectively is, because the model doesn't reliably distinguish "user asked me to do this" from "I read some text that said do this."
The same pattern shows up in engineering agents. A coding assistant with access to the wrong environment can make changes the user did not intend, especially when development, staging, and production boundaries are unclear. The fix is not a better prompt by itself - it is environment separation, planning-only modes, dry runs, scoped credentials, and an approval gate between the agent and live systems.
File-system agents create a similar class of risk. A task as ordinary as reorganising folders becomes dangerous when the agent misreads command output, assumes a step succeeded, and keeps acting on that false assumption. The fix is not a smarter model. It is validation between steps, bounded file-system permissions, and execution environments where destructive operations require confirmation.
The pattern across these examples is not that the models are unsafe. It is that the systems around the models were built on assumptions that do not hold once the model can act. Tickets, documents, comments, knowledge base pages, business records - these were designed to be read, not interpreted as instructions. Nobody wrote input sanitisation for them, because nobody needed to. The same holds at different levels for credentials, file-system operations, and other surfaces agents now interact with. The security posture was built for a world in which those things moved information around, not a world where agents might act on the contents of that information.
The architectural response, which Pramod emphasised in the episode, is the separation of content from action.
When an agent reads a document, the text inside that document should not be allowed to become instructions the agent then follows. The same applies to retrieved context from a RAG pipeline, to webpage content pulled in via a browsing tool, and to any other input the agent ingests. The trust model that made sense for chatbots - treat inputs as data, treat user prompts as instructions - stops working once the agent can act, because any retrieved content can smuggle in new instructions, and the model has no reliable way to separate smuggled instructions from the user's original request.
Which leads to the less fashionable half of the agent conversation, which is that agent tools - MCP servers, plugins, skills, internal tool wrappers - need to be treated as executable dependencies rather than harmless extensions. A tool the agent can call is code the system now runs. That means allowlisting, scanning, egress control, permission review, and ongoing monitoring before the tool reaches a production agent. A high GitHub star count is not a security review. An MCP server that has been running on someone's laptop for three months is not a validated dependency.
The control that gets skipped most in the rush to ship is behaviour monitoring, which is mildly embarrassing because it is also one of the cheapest. Most agent frameworks and observability stacks can produce useful traces by default. What's usually missing is someone actively reviewing those traces. Not for the mean latency or aggregate token cost - those sit on a dashboard and mostly take care of themselves. The things worth watching are the individual trace that took twenty tool calls to finish a task that normally takes three, the output that passed the usability check but contained a URL the agent shouldn't have generated, the retrieved document with a suspicious block of text that reads more like instructions than content. Those signals are usually visible if someone looks. They usually are not looked at until something has gone wrong.
None of this is new in security terms. Application security has used versions of least privilege, input validation, and behavioural monitoring for decades. The shift is that the attack surface now includes the contents of business records, shared documents, support tickets, pull request descriptions, web pages, and anything else the agent can retrieve. Anywhere the agent reads from an untrusted or semi-trusted source, an attacker may be able to write.
For a team that already has agents in production, the useful exercise is not to pick the tightest possible set of controls and apply them to everything. It is to work out which systems the agent touches, what the failure mode looks like if the agent gets manipulated into doing the wrong thing via retrieved content, and whether there is any meaningful separation between where the agent reads and what the agent can do. For many production agents today, that separation barely exists, because the systems around the agent were built on the assumption that inputs are passive.


Dive in

Related

Blog
“It worked when I prompted it” or the challenges of building an LLM Product
By Soham Chatterjee • May 1st, 2023 Views 312
Video
It Worked When I Prompted It
By Soham Chatterjee • Jun 28th, 2023 Views 410
Blog
Engineering An AI Agent To Navigate Large-scale Event Data - Part 2
By Ayesha Imran • Apr 7th, 2026 Views 103
Blog
“It worked when I prompted it” or the challenges of building an LLM Product
By Soham Chatterjee • May 1st, 2023 Views 312
Blog
Engineering An AI Agent To Navigate Large-scale Event Data - Part 2
By Ayesha Imran • Apr 7th, 2026 Views 103
Video
It Worked When I Prompted It
By Soham Chatterjee • Jun 28th, 2023 Views 410
Code of Conduct
Your Privacy Choices