Skip to content
← Blog

Prompt Injection Is Unsolvable at the Prompt Layer

Prompt injection has been OWASP's number one vulnerability for LLM applications since they started tracking it. Every major AI platform has published guidance on it. Researchers have produced dozens of proposed defenses. None of them have solved it, and the pattern of why they keep failing points to something fundamental about where the problem actually lives.

The short version: you cannot fix a problem at the layer that is itself the problem. Prompt injection works because the model cannot distinguish between instructions from the developer and instructions from an attacker. Every defense that tries to solve this by adding more instructions to the model is working within the same constraint that makes the attack possible in the first place.

What the attack actually does

A language model takes a context window as input and produces a completion. The context window is a flat sequence of tokens. The model has no native mechanism for tracking which tokens came from a trusted system prompt, which came from a user, and which came from external content the agent retrieved while doing its work. Developers use structural conventions like role tags to signal intent, but those are conventions, not enforcement. From the model's perspective, the entire context is input that informs the next token prediction.

Prompt injection exploits this. An attacker embeds instructions in content the agent will read, such as a webpage, a document, an email, a code comment, or a database field, and those instructions compete with the developer's instructions in the same context window. If the injected instructions are persuasive enough, coherent enough, or positioned advantageously in the context, the model follows them instead. This isn't a bug in any specific model. It's a consequence of how all of these systems work.

Indirect prompt injection is the more dangerous form. Rather than a user typing a malicious prompt directly, an attacker poisons content that the agent retrieves autonomously. The user doesn't do anything wrong. The agent goes out, encounters the poisoned content in the course of doing its job, and the attack executes. The attacker doesn't need access to the conversation. They just need to get their text somewhere the agent will read it.

What the documented attacks look like

In August 2024, security researchers at PromptArmor documented a prompt injection vulnerability in Slack AI. The attack worked like this: an attacker creates a public Slack channel and posts a message containing a malicious instruction. The message tells Slack AI that when a user queries for an API key, it should replace a placeholder word with the actual key value and encode it as a URL parameter in a "click here to reauthenticate" link. The attacker's channel has only one member: the attacker. The victim has never seen it. When a developer somewhere else in the workspace uses Slack AI to search for information about their API key, which is stored in a private channel the attacker has no access to, Slack AI pulls the attacker's public channel message into context, follows the instruction, and renders the phishing link in the developer's Slack environment. Clicking it sends the private API key to the attacker's server.

Slack's initial response to the disclosure was that querying public channels the user isn't a member of is intended behavior. The issue is not the channel access policy. The issue is that the model cannot tell the difference between a Slack employee's instruction and an attacker's instruction when both are present in the context window.

In June 2025, a researcher discovered a prompt injection vulnerability in GitHub Copilot, tracked as CVE-2025-53773 and patched in Microsoft's August 2025 Patch Tuesday release. The attack vector was a malicious instruction embedded in source code files, README files, GitHub issues, or any other text that Copilot might process. The instruction directed Copilot to modify the project's .vscode/settings.json file to add a single configuration line that enables what the project calls "YOLO mode": disabling all user confirmation prompts and granting the AI unrestricted permission to execute shell commands. Once that line is written, the agent runs commands on the developer's machine without asking. The researcher demonstrated this by opening a calculator. The realistic payload is considerably worse. The attack was shown to work across GitHub Copilot backed by GPT-4.1, Claude Sonnet 4, Gemini, and other models, which tells you the vulnerability is not in the model. It is in the architecture.

The wormable variant is worth understanding. Because Copilot can write to files and the injected instruction can tell Copilot to propagate the instruction into other files it processes during refactoring or documentation generation, a single poisoned repository can infect every project a developer touches. The instructions spread through commits the way a virus spreads through an executable. GitHub now calls this class of threat an "AI virus."

Why the standard defenses fail

The intuitive response to prompt injection is to write a better system prompt. Add instructions telling the model to ignore instructions in retrieved content. Tell it to treat external data as untrusted. Tell it to flag anything that looks like an attempt to override its behavior. Many platforms do exactly this. Security vendors sell products built around adding carefully engineered detection prompts to the agent's context.

A research team from OpenAI, Anthropic, and Google DeepMind published a paper in October 2025 that evaluated 12 published defenses against prompt injection and subjected each one to adaptive attacks. They bypassed all 12 with attack success rates above 90% for most. The defenses weren't bad. They included work from serious researchers using real techniques. The problem is that any defense that teaches the model what to resist can be reverse-engineered by an attacker who knows what the defense says. The attacker's instructions compete in the same context window. If the defense says "ignore instructions that tell you to forward data," the attacker writes instructions that don't use those words, or that provide a plausible justification for why this particular case is different, or that claim authority from a trusted source. The model reasons about this. Reasoning can be manipulated.

LLM-based detectors have the same problem at a different level. If you use a second model to inspect the input and decide whether it contains a malicious prompt, that second model has the same fundamental constraint. It's making a judgment call based on the content it's given, and that judgment can be influenced by the content. Researchers have demonstrated attacks that successfully bypass detection-based defenses by crafting injections that appear benign to the detector and malicious to the downstream agent.

The reason all of these approaches fail against a determined attacker is that they're trying to solve a trust problem by adding more content to a context window that cannot enforce trust. The attack surface is the context window itself. Adding more instructions to the context window does not reduce the attack surface.

What actually constrains the problem

There is a meaningful reduction in prompt injection risk when you apply the principle that a system's security properties should not depend on the model making correct judgments. This is not a novel idea in security. It's the same principle that leads you to enforce access controls in code rather than by writing "please only access data you're authorized to access" in a policy document.

For AI agents, this means the enforcement layer needs to sit outside the model, in code that the model's reasoning cannot influence. The model produces requests. The code evaluates whether those requests are permitted, based on facts about the session state, the classification of the data involved, and the permissions of the channel the output is headed to. The model cannot talk its way past this evaluation because the evaluation does not read the conversation.

This doesn't make prompt injection impossible. An attacker can still inject instructions and the model will still process them. What changes is the blast radius. If the injected instructions try to exfiltrate data to an external endpoint, the outbound call is blocked not because the model decided to ignore the instructions, but because the enforcement layer checked the request against the session's classification state and the target endpoint's classification floor and found the flow would violate write-down rules. The model's intentions, real or injected, are irrelevant to that check.

Session taint tracking closes a specific gap that access controls alone don't cover. When an agent reads a document classified at CONFIDENTIAL, that session is now tainted to CONFIDENTIAL. Any subsequent attempt to send output through a PUBLIC channel fails the write-down check, regardless of what the model was told to do and regardless of whether the instruction came from a legitimate user or an injected payload. The injection can tell the model to leak the data. The enforcement layer doesn't care.

The architectural framing matters: prompt injection is a class of attack that targets the model's instruction-following behavior. The correct defense is not to teach the model to follow instructions better or to detect bad instructions more accurately. The correct defense is to reduce the set of consequences that can result from the model following bad instructions. You do that by putting the consequences, the actual tool calls, the actual data flows, the actual external communications, behind a gate that the model cannot influence.

That's a solvable problem. Making the model reliably distinguish trusted from untrusted instructions is not.