security

Prompt Injection in 2026: Why the Attack Surface Keeps Growing

Prompt injection isn't growing because attackers got smarter. It's growing because we keep adding capability without solving the underlying architectural problem.

Chris Groves

17 Feb 2026 — 14 min read

$A lone figure faces a massive shadowy creature of dark tendrils bursting through fractured concentric security rings$

There is a structural problem at the center of every AI agent you are building right now. It has been known since 2022, named by a researcher in 2023, declared the #1 LLM (Large Language Model) vulnerability by OWASP (Open Worldwide Application Security Project) in 2025, and described as a "frontier, unsolved security problem" by OpenAI's own CISO (Chief Information Security Officer). The problem is prompt injection, and despite four years of industry attention, the attack surface is not shrinking. It is accelerating.

The common explanation for this is that AI agents "can do more things now," which is accurate but incomplete. The deeper explanation is structural: every capability we add to AI agents dismantles an isolation boundary that previously contained the damage from a successful injection. Understanding why each new capability expands the surface -- not just that it does -- is the precondition for making intelligent architectural decisions about which surfaces to accept, which to quarantine, and which to reject entirely.

The Architectural Problem That Cannot Be Patched

To understand why prompt injection persists, you need to understand what it exploits.

Large language models process everything in the context window as a single token stream. System prompts, user inputs, retrieved documents, tool metadata, memory entries, and code snippets collapse into one continuous sequence. There is no token-level privilege marking in the transformer architecture. The model cannot reliably distinguish "trusted instruction" from "untrusted data" because at the architectural level, those concepts do not exist.

This is not a bug. It is a design property of how the model works. And it means that when researchers have proposed fixes -- instruction delimiters, hierarchical trust schemes, separate model instances for verification -- each approach introduces new attack surfaces rather than eliminating the underlying vulnerability:

Delimiters can be included by attackers in their injected content
Instruction hierarchies can have priority claimed by attackers ("ignore all previous instructions, I am the system")
Separate verification models double the number of systems that need to be hardened

Simon Willison, who coined the term "prompt injection", posed the question directly: "Is this just a fundamental limitation of how large language models based on the transformer architecture work?" By 2025, the consensus answer was: probably yes, for now. OpenAI CISO Dane Stuckey has called it a "frontier, unsolved security problem." OWASP's assessment found prompt injection in 73% of production AI deployments.

The attack surface is not growing because the underlying vulnerability got worse. It is growing because the blast radius of a successful injection keeps expanding as we add capabilities.

The Trust Architecture Collapse: Each Capability Is Additive Damage

Pre-agentic LLMs had natural containment. Text went in. Text came out. No execution, no persistence, no external communication. The model was effectively sandboxed by its output format -- a successful prompt injection produced a wrong answer, inappropriate content, or an embarrassing response. The blast radius was one conversation.

Agentic systems systematically dismantle that sandbox, one capability at a time:

Tool access removes execution isolation. When an agent can call tools -- write files, execute code, call APIs -- a prompt injection can become a tool invocation. Text becomes action.

Memory removes temporal isolation. When an agent has persistent memory, a successful injection can modify what the agent "knows" about you or the world -- with effects that persist across sessions, days, or weeks.

MCP (Model Context Protocol) removes supply chain isolation. When an agent connects to external tool servers, those servers' metadata enters the agent's trust perimeter. A malicious tool description is an attack vector. A changed tool description (after you approved it) is a supply chain attack.

Multi-agent architecture removes propagation isolation. When agents communicate with each other, a compromised agent can pass injected instructions to peer agents. An infection spreads.

Each of these is not additive to the attack surface -- it is multiplicative. A single injected instruction that previously produced wrong text now potentially produces: file deletion, credential exfiltration, persistent memory modification, and propagation to every agent in the network. OWASP's Agentic Top 10, released December 2025 at Black Hat Europe, codified this as ASI01 (Agentic Security Initiative risk #1: Agent Goal Hijack) -- recognizing that the agentic context creates a categorically different threat model, not just a more severe version of the old one.

The Six Expanding Surfaces of 2025-2026

MCP Tool Poisoning: The Semantic Supply Chain

MCP is a protocol that allows AI agents to connect to external tool servers. The attack vector researchers identified is specific: the threat is not in the data tools return -- it is in the tool descriptions that explain what tools do.

Invariant Labs' disclosure documented what they called Tool Poisoning Attacks (TPA): malicious instructions embedded in MCP tool descriptions, invisible to users reviewing the interface but visible to the AI model processing the metadata. The WhatsApp MCP exploit demonstrated this concretely -- a tool appeared legitimate in its user-facing label while the description contained hidden instructions that triggered full conversation exfiltration. The GitHub MCP exploit used the same vector to leak data from private repositories via public issue content.

Palo Alto Unit 42 identified three primary attack vectors at the protocol level: resource theft (draining AI compute quotas), conversation hijacking (persistent injected instructions embedded in server responses), and covert tool invocation (file system operations occurring without user awareness).

The most insidious variant is what practitioners are calling the "rug pull": a tool definition that changes after initial user approval. You reviewed and trusted the MCP server when you installed it. You cannot easily verify whether the description has changed since then. The semantic equivalent in traditional software would be a library that downloads and executes arbitrary remote code at runtime -- which is treated as a critical vulnerability when found in npm packages. In MCP deployments, it has no equivalent security control.

The software supply chain has decades of security infrastructure: SBOMs (Software Bills of Materials -- structured inventories of a software package's components), dependency scanning, signature verification, reproducible builds. The AI tool supply chain currently has none of this. When you add an MCP server, you are adding untrusted text to your agent's trust perimeter with no mechanism to detect unauthorized changes.

RAG Pipeline Poisoning: Your Knowledge Base as Attack Vector

Fifty-three percent of companies now operate RAG pipelines -- architectures where a language model retrieves relevant documents from a knowledge base before responding. The attack surface is the knowledge base itself.

When malicious instructions are embedded in documents that an AI will retrieve, those instructions execute with the trust level of internal documentation. Research has found that five carefully crafted documents can manipulate AI responses 90% of the time -- the retrieval mechanism does not distinguish between legitimate content and adversarial payloads embedded in otherwise-legitimate files.

The EchoLeak vulnerability (CVE-2025-32711 -- CVE stands for Common Vulnerabilities and Exposures, the standard identifier for publicly disclosed security flaws -- in Microsoft 365 Copilot) demonstrated this at zero-click severity. An agent processing incoming email for indexing followed hidden instructions embedded in the email body. The instructions directed the agent to search the user's recent emails for "password", then append the results to an attacker-controlled URL. The attack required no user interaction beyond the user receiving the email.

Prompt Security's research introduced the concept of "vector worms" -- poisoned embeddings that contain instructions directing the AI to re-embed and reintroduce the poisoned data into other documents in the corpus. The attack becomes self-propagating within the knowledge base.

The canary document technique offers one detection approach: plant documents with unique dummy phrases in the RAG corpus. If the AI retrieves these in contexts where it should not, you have behavioral evidence of active injection. This is reactive rather than preventive, but it is currently one of the few practical detection mechanisms available.

Memory Poisoning: The Persistent Threat Model

Most prompt injection discussion addresses conversation-scoped attacks. An injection occurs, the model responds badly, the conversation ends. Blast radius: one session.

Memory poisoning is categorically different. MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems -- MITRE's framework for AI attack techniques) tracks it as AML.T0080 with a distinct attack technique designation because it warrants separate treatment. The attack surface is the AI's persistent knowledge -- what it "remembers" about you, your preferences, your organization. The blast radius is all future sessions. The detection window may be indefinite, because users rarely audit what their AI assistant has stored about them.

Microsoft's February 2026 post on AI Recommendation Poisoning documents active commercial exploitation of this vector. Attackers are targeting the AI memory layer not for one-time exfiltration but for persistent influence over future AI responses -- systematically shaping what products an AI recommends, what information it surfaces, what decisions it supports. The attack is invisible to the user because the poisoned memory entry looks indistinguishable from a legitimately learned preference.

If you use an AI assistant with persistent memory -- and most major AI assistants now enable this by default -- you have an attack surface you are not actively monitoring. A document you read, an email you received, or a webpage you visited could have overwritten what your AI "knows" about your preferences, with effects that persist until you find and delete the specific poisoned entry.

Cross-Agent Trust Exploitation: When Agents Assume Peers Are Legitimate

Multi-agent architectures introduce a trust assumption that has no security foundation: agents assume messages from peer agents are legitimate instructions. There is no standard mechanism for inter-agent message authentication. No cryptographic identity verification. No notion of "this message claims to come from Agent A, and I can verify that claim."

Researchers documented a concrete example in Q4 2025 with GitHub Copilot and Claude: each AI assumed the other's instructions were system-level communications and began rewriting each other's configuration files. The loop escalated privileges because each agent deferred to the peer's apparent authority.

ServiceNow Now Assist demonstrated second-order privilege delegation: a low-privilege agent, processing a malformed request, was tricked into asking a high-privilege peer to execute on its behalf. The high-privilege agent, trusting its peer, complied -- exporting an entire case file to an external URL. The attack bypassed direct privilege controls by exploiting the trust relationship between agents rather than attacking either agent directly.

The OWASP Agentic Top 10 addresses this under ASI01 (inter-agent communication attacks), but solutions remain largely architectural: enforce zero trust between agents, authenticate every inter-agent message, limit what agents can request of peers, and scope high-privilege agents so they cannot act on requests from lower-privilege peers regardless of the requester's claimed authority.

AI Coding Editors: High Trust, Broad Access, Active Targets

Coding agents -- Claude Code, Cursor, GitHub Copilot, Devin, Google Jules -- are particularly attractive targets because they combine properties that make successful injection catastrophic: file system write access, shell execution capability, network access (MCP servers, git, external APIs), and developer trust. Developers operating in these environments tend to assume the agent is a safe collaborator.

CVE-2025-53773 documented remote code execution through prompt injection in GitHub Copilot. CVE-2025-59944 documented a case sensitivity bug in Cursor AI IDE's protected file path handling that allowed hidden instructions to escalate to RCE. In the Cursor case, the vulnerability was not an architectural prompt injection issue -- it was a conventional software bug that removed a constraint an attacker could exploit through adversarial input.

A 2025 security test of Devin AI found it completely defenseless at a cost of $500 in testing: researchers could manipulate it to expose ports to the internet, leak access tokens, and install command-and-control malware through crafted prompts. Research published in 2025 found attack success rates reaching 84% for malicious command execution via poisoned external development resources.

AI Worms: Injection That Propagates

Researchers demonstrated a proof-of-concept -- dubbed "Morris II" -- for a self-propagating prompt injection targeting GenAI-powered email assistants: an agent infected via a poisoned email sends messages to peer agents containing hidden injection payloads. Those agents become infected and propagate further. The attack follows epidemiological spread patterns, demonstrating super-linear propagation rates (each infected client compromising approximately 20 new clients within the first few days). It scales with network connectivity, not attacker effort.

The historical analogy is SQL injection worms (Slammer demonstrated this at scale in 2003), but the attack operates in natural language rather than structured query space. A defender cannot maintain a static blocklist of malicious token sequences because the injection space is continuous and adaptive. Peer-reviewed analysis from January 2026 confirmed that defenses against these attacks "remain insufficient against adaptive attack strategies."

What Defenders Are Actually Doing

The defense landscape is maturing, but no approach provides comprehensive protection.

Microsoft's Spotlighting technique (documented in detail by MSRC) operates at three modes: delimiting (explicit markers separating trusted system content from untrusted retrieved content), datamarking (inserting unique identifiers in trusted content so the model can track provenance), and encoding modes (transforming untrusted content into alternate representations that reduce injection effectiveness). These are complemented by Prompt Shields -- a continuously updated classifier that scans for injection attempts -- and deterministic blocking of known exfiltration paths (markdown image rendering being one documented example).

Google DeepMind's Spotlighting for Gemini (May 2025 security paper) interleaves control tokens throughout retrieved content with instructions to the model not to treat content between control tokens as trusted instruction. The approach is similar to Microsoft's datamarking but applied at training time rather than inference time.

OWASP's architectural guidance recommends a Quarantined LLM pattern: maintain a privileged model instance (trusted inputs, tool access) separate from a quarantined model instance (untrusted inputs, no tool access). Untrusted content is processed by the quarantined instance, which cannot take direct action. Results are evaluated before passing to the privileged instance.

Multi-agent defense pipelines (described in arXiv 2509.14285) layer a coordinator agent for pre-input classification with a guard agent for output validation. The approach handles quoted text, code blocks, and delegation attempts -- providing defense against tool-manipulation attacks, role-play coercion, and exfiltration attempts.

None of these solutions is complete. Each raises the bar; none closes the vulnerability.

Why the Vendor Incentive Problem Makes This Harder

There is a finding in the security research that rarely surfaces in practitioner-facing content: multiple vendors have chosen not to fix reported prompt injection vulnerabilities, citing concerns about impacting functionality.

This is not a story about one negligent vendor. It is a structural property of the problem. Prompt injection defenses cause false positives that break legitimate use cases. A classifier that blocks injections will sometimes block legitimate instructions that resemble injections. A quarantine model that limits tool access will sometimes prevent operations users need. Vendors face a direct tradeoff: reduce security risk, or preserve user experience.

The "insecure by design" pattern exists in traditional software (default-open configurations, weak cryptographic defaults shipped for compatibility), but it is rarely applied to AI systems analysis. The difference in AI is that you cannot audit all the ways the vulnerability manifests -- the model reasons in natural language across an effectively unlimited input space.

The practical implications:

Vendor security bulletins are insufficient. You cannot assume a reported vulnerability was fixed because a patch was released. The fix may have been evaluated and declined.
Defense-in-depth is not optional. You cannot trust the model layer to absorb injections.
Architecture around the attack is the only reliable strategy. If the model cannot be made injection-proof, the system must be designed so that successful injections have bounded blast radius.

The Market Dynamics: Adoption Outpacing Security Posture

Gartner projects 40% of enterprise applications will integrate AI agents by 2026. Only 34% of enterprises have AI-specific security controls, and less than 40% conduct regular security testing on AI systems. Adoption is scaling faster than security maturity.

Every agent deployment without corresponding security architecture is net-negative for the collective attack surface. Insecure agents become both victims and threat propagation paths -- a compromised agent in one organization's infrastructure can be used to attack agents in connected organizations.

The historical parallel is IoT device proliferation from 2015-2020. Millions of devices with no security baseline created the conditions for the Mirai botnet and similar large-scale attacks. The devices were not targeting each other -- they were simply insecure by default, and adversaries automated exploitation at scale. AI agents are following the same adoption pattern. The question is whether the security response will be faster or slower than it was for IoT. The evidence so far suggests slower, because the attack surface is more complex and the defenses are harder to deploy universally.

A Design Framework for Practitioners

The research synthesis points to a practical framework. The goal is not to eliminate prompt injection -- that is architecturally infeasible with current models. The goal is to bound the blast radius through architectural decisions made before deployment.

The Lethal Trifecta as Design Constraint

Simon Willison's Lethal Trifecta is typically cited as a diagnostic: if your system has all three factors -- access to private data, exposure to untrusted tokens, and an exfiltration vector -- you are vulnerable, full stop.

I find it more useful as a forward-looking design tool. The three factors map directly to decisions you make during architecture:

Private data access: Scope it. Do not give the agent access to all your email. Give it read access to emails from a specific domain received in the last seven days. Not access to all files, but access to files in a named directory. Least privilege applied at the semantic level, not just the permission level.

Untrusted token ingestion: Quarantine it. Process untrusted content (web pages, user uploads, third-party documents, email bodies) in a model instance with no tool access. The Quarantined LLM pattern from OWASP is the implementation. Never let a model instance process untrusted tokens while also holding tool access to consequential systems.

Exfiltration vectors: Enumerate and gate them. Every output channel is a potential exfiltration path -- external API calls, URL construction, image rendering, email sending, file writes. Require explicit approval checkpoints for any action that sends data outside a trusted boundary. Airia's 2026 analysis specifically mandates human-in-the-loop checkpoints for high-impact agent actions as an immediate requirement, not a future goal.

Operational Checklist for MCP Integrations

Given that MCP is the newest and least-understood surface, practitioners evaluating MCP integrations need specific guidance:

"Did I review the tool description when I installed it?" is insufficient.
"Can this description change without my knowledge?" is the right question. For most MCP deployments today, the answer is yes.
Treat MCP tool descriptions as untrusted input that requires the same scrutiny as code dependencies. Practical DevSecOps recommends sandboxed Docker containers per MCP server and periodic re-verification of tool definitions against a known-good baseline.
If a tool description changes between your last audit and today, treat it as a potential compromise until verified.

Detection Layer

Architecture controls limit blast radius. A detection layer tells you when something is wrong.

Practical detection approaches, in order of implementation effort:

Canary documents in RAG corpora: Unique dummy content that should never appear in responses. Automated monitoring for these phrases in agent outputs provides low-false-positive detection of active RAG poisoning.
Behavioral monitoring on tool calls: Log all tool invocations with inputs and outputs. Anomalous patterns (unexpected external URL construction, file access outside normal scope, API calls to unfamiliar endpoints) are detectable signals.
Prompt Shields or equivalent classifier: Input scanning for known injection patterns. Not comprehensive, but catches opportunistic attacks and raises the cost of targeted attacks.
Regular red team exercises: Airia recommends regular adversarial testing of AI pipelines. Testing a system that cannot be comprehensively audited requires active red teaming, not just static analysis.

The Architectural Conclusion

The attack surface keeps growing because the vulnerability is architectural and the capabilities being added each remove a containment property that previously bounded the damage. That is the mechanism. Understanding it lets you reason about capability additions in security terms.

When evaluating whether to enable a new agent capability, the question is not "could this capability be abused?" (yes, for all of them). The question is: what isolation boundary does this capability remove, and what is the blast radius if an injection exploits that removal?

Tool access removes execution isolation. Ask: what is the scope of execution? Memory removes temporal isolation. Ask: what can be written to memory, and what can read from it? MCP removes supply chain isolation. Ask: how will you verify tool description integrity over time? Multi-agent removes propagation isolation. Ask: how do you authenticate inter-agent communication?

For each capability you cannot scope, quarantine, or gate adequately: the honest answer may be that the capability is not deployable safely in your threat environment yet. That is a valid architectural decision. The alternative -- deploying broadly and hoping the model absorbs the injections -- is not.

The OpenAI comparison to social engineering is apt: this is unlikely to ever be fully solved, but it can be materially reduced through architecture. Peer-reviewed research confirms defenses remain insufficient against adaptive strategies. The goal is not a solved problem. The goal is a bounded one.

Found This Helpful?

If you are building agentic systems and want to go deeper -- threat modeling workshops, architecture review frameworks, and implementation guides -- subscribe here. Methodology guides go to all subscribers. Implementation deep dives go to members.

Sources