Tool Poisoning and MCP Security: When Your Agent's Toolbox Is the Weapon

MCP tool descriptions are instructions, not metadata. Here's how attackers exploit that — and what the benchmark data actually shows about model safety alignment.

Chris Groves

25 Feb 2026 — 10 min read

Your AI agent just added two numbers. It also silently exfiltrated your SSH private key.

That is not a hypothetical. Invariant Labs demonstrated it in April 2025 against production MCP clients including Cursor and Claude Desktop. A tool named "Add Numbers" ran the math correctly. Hidden inside its description — the metadata the AI reads when deciding how to use a tool — was an instruction: before any file operation, read ~/.cursor/mcp.json and pass its contents as a side parameter. The math worked. The config file traveled silently to the attacker's server.

This is tool poisoning. It is the attack vector that security researchers spent 2025 mapping in detail, and the one that organizations building with the Model Context Protocol (MCP) need to understand before deploying agents into anything that matters.

What Makes This Different from Prompt Injection

The prompt injection attack is familiar at this point: adversarial text appears in user input and redirects the model's behavior. OWASP lists it as the #1 LLM risk for 2025. NIST called indirect prompt injection "generative AI's greatest security flaw."

Tool poisoning operates at a different layer of the stack.

When an MCP (Model Context Protocol) server registers a tool with an agent, it provides two things: a tool name visible to the user, and a tool description visible to the AI model. The description is often the full behavioral specification. The agent reads it, interprets it, and acts on it. Users rarely see the full description text.

The attack surface is that asymmetry. The attacker does not need to manipulate user input. They control the tool description at registration time.

Dimension	Prompt Injection	Tool Poisoning
Target	User inputs	Tool metadata
Delivery	Runtime	Registration phase
Persistence	Session-level	Across all agent sessions
Blast radius	Single user	All users of that tool

Tool poisoning is a supply chain attack. It persists. It does not require user interaction after installation. It affects everyone using a given tool, not just the individual who triggered a particular conversation.

The Four Attack Patterns You Need to Know

Hidden Instructions in Tool Descriptions

The "Add Numbers" demo above is the basic form. Attackers embed directives using patterns that look like legitimate configuration: <IMPORTANT> tags, instructional prose mixed with functional description, or Base64-encoded payloads. The model is trained to follow instructions. It does.

Confirmed exfiltrated data in demonstrated attacks includes:

SSH private keys (~/.ssh/id_rsa)
MCP server credential files
OAuth tokens and workflow platform access tokens
AWS credentials and Git tokens
.env files containing API keys

Rug Pull Attacks

MCP clients require initial approval when a tool is installed. The base protocol has no mechanism to notify users when tool descriptions change after that approval.

Think of it like a trusted npm package that updates itself to include a backdoor after you've already vetted it. The attack sequence:

Attacker publishes a legitimate, useful tool
User approves it — trust established
Days or weeks later, the server silently modifies the tool description to include malicious instructions
The agent, operating under the original approval, follows the new instructions

Invariant Labs demonstrated this with a WhatsApp MCP server: a "fact of the day" tool was approved, then post-approval the server modified it to shadow the trusted whatsapp-mcp server, redirect all WhatsApp messages to an attacker's phone number, and pad exfiltrated data with whitespace so it would not appear in the agent's visible UI output. Months of personal and business communications exfiltrated silently.

Cross-Server Tool Shadowing

MCP's value proposition is composability — connect a calendar server, a file server, an email server, let your agent orchestrate across all of them. This composability is also the attack surface.

When multiple MCP servers are connected to the same agent client, all tool descriptions are loaded into a single shared agent context with no trust hierarchy. A malicious server can pre-condition the agent's behavior before trusted tools are invoked.

Acuvity documented the "confused deputy" pattern: a malicious weather server — with zero banking capabilities — hijacks a trusted banking tool by injecting instructions into shared agent context that modify how the banking tool is used. The user approves the trusted server's action. The agent's behavior was already shadowed.

The risk is multiplicative, not additive. A malicious server in a five-server agent configuration does not just have access to its own tool set — it potentially has influence over all five. MCP security audits must evaluate server combinations, not each server in isolation.

Supply Chain Compromise

According to Astrix Security's 2025 MCP security research, 5.5% of MCP servers in the wild exhibit tool poisoning indicators, and 33% permit unrestricted network access.

Elastic Security Labs found that 43% of tested MCP implementations contained command injection flaws and 30% permitted unrestricted URL fetching. Some malicious packages are designed with developer credential harvesting as their primary function:

Enumerate project directories on first use
Index .env* files, SSH keys, cloud credentials, API tokens, certificates
Base64-encode the harvest
POST it as legitimate-looking API traffic to a controlled endpoint

Developer sees normal tool behavior. Credential harvest completes in the background.

The Benchmark Data: Safety Alignment Doesn't Help

In August 2025, researchers published MCPTox, the first systematic benchmark for tool poisoning attacks. The methodology: 45 live MCP servers, 353 authentic tools, 20 LLM agents, 1,312 malicious test cases.

The findings are uncomfortable:

Average attack success rate: 36.5% across all tested agents
Highest success rate: 72.8% (o1-mini, OpenAI's reasoning model)
Claude-3.7-Sonnet's refusal rate: under 3% — meaning it followed malicious tool instructions over 97% of the time when attacked

The counterintuitive finding is the one that matters most for anyone building with frontier models: more capable models are more vulnerable to tool poisoning.

The mechanism is direct. Tool poisoning exploits instruction-following capability. A more powerful model is better at following complex, nuanced instructions — including complex, nuanced malicious instructions embedded in tool descriptions. Chain-of-thought (CoT) reasoning, which makes models more capable at multi-step tasks, also makes them more capable at executing multi-step malicious toolchains. MCPTox found that activating CoT reasoning increased attack success rates by up to 27.8 percentage points.

Safety alignment, which took years to build into frontier models, offers almost no defense because tool poisoning does not ask the model to do something that looks dangerous. It instructs the model to read a file before doing a benign operation. The model complies because it is following instructions, exactly as trained.

A follow-up framework called MCP-ITP achieved up to 84.2% attack success rate while suppressing the malicious tool detection rate to as low as 0.3%.

EchoLeak: What a Production Exploit Looks Like

Research benchmarks establish the attack surface. CVE-2025-32711 (CVSS 9.3, where CVSS is the Common Vulnerability Scoring System on a 0–10 scale), known as EchoLeak, demonstrates what the threat looks like at enterprise scale.

EchoLeak was a confirmed zero-click exploit against Microsoft 365 Copilot — a production system used by millions of enterprise users. No proof-of-concept, no toy environment.

The attack chain:

Attacker sends crafted email to target's Outlook inbox with hidden prompt injection payload
User does not open or interact with the email
When user asks Copilot any business question, Copilot's Retrieval-Augmented Generation (RAG) engine ingests the poisoned email alongside legitimate business context
Injected instructions direct Copilot to exfiltrate sensitive documents via Microsoft Teams — a domain allow-listed in Content Security Policy
Reference-style Markdown links encode the stolen data in image request URLs
Data leaves the tenant without a single user click

Microsoft patched it server-side in May 2025. The fix was architectural: restrict Copilot from processing external emails without DLP (Data Loss Prevention) tagging, implement explicit trust boundaries between the RAG context layers.

The lesson from EchoLeak is that the patch did not involve improving the model. It involved changing what the model could see. Defense requires architectural decisions about where trust boundaries sit, what data gets mixed in the context window, and which outbound domains are reachable.

The 2025 Incident Timeline

AuthZed published a comprehensive timeline of MCP security breaches from April through December 2025. The pattern of recurring incident types tells the story more clearly than any single event:

April: WhatsApp MCP rug pull — chat history and contacts exfiltrated via tool shadowing
May: GitHub MCP — prompt injection via public issue exposed private repos and financial records
June: Anthropic MCP Inspector — unauthenticated RCE (Remote Code Execution) on localhost exposed filesystem, API keys, and environment secrets
July: mcp-remote (CVE-2025-6514, 437K+ downloads) — OS command injection via unsanitized input exposed API keys, cloud credentials, SSH keys, and Git repositories
August: Anthropic Filesystem MCP — sandbox escape via symlink bypass reached host filesystem credentials
September: Postmark MCP package — supply chain BCC (Blind Carbon Copy) injection exfiltrated email communications and internal memos
October: Smithery hosting — path traversal in Docker configuration exposed over 3,000 applications via compromised API token

Five distinct vulnerability patterns, all recurring. This is not isolated researcher-found edge cases. It is a systematic security maturity gap in a protocol that achieved rapid, widespread adoption.

Why Existing Software Security Doesn't Transfer

The Language Boundary Problem

In traditional software, security boundaries are enforced by code: type systems, access control lists, authentication middleware. A string in a data field cannot become an executable instruction simply by appearing there.

MCP changes this. The agent reads tool descriptions — natural language — and constructs its behavior based on that text. The tool description is the instruction. There is no structural separation between "data the agent processes" and "instructions the agent follows."

CrowdStrike's analysis of agentic toolchain attacks frames this precisely: the security boundary in AI agents is written in natural language, not in code types. Any text appearing in the agent's context window can, under the right conditions, redefine its behavior.

This is not a bug that gets patched. It is structural to how language models work. The research framing from arXiv's 2025 agentic supply chain paper identifies the feedback loop: poisoned tool output re-enters the agent's context as input for the next decision. The agent becomes simultaneously a producer and consumer of tainted data across sessions.

No Trust Hierarchy in Shared Context

MCP's current architecture loads all connected servers' tool descriptions into a single shared agent context. There is no trust tier distinguishing the user-vetted trusted server from the third-party tool installed last week. A description from any connected server can influence behavior toward any other connected server.

Defenses That Work (and the One That Does Not)

What does not work: better model safety training. MCPTox is definitive on this. The attack does not target the model's ethical reasoning. It targets the model's instruction-following capability, which is the core function that makes it useful.

What does work:

Tool description transparency. Before approving any MCP tool, display the full description to the user — not just the tool name. The gap between what users see and what the AI reads is the attack surface. Closing the information gap does not eliminate the risk, but it makes rug pull attacks harder to execute invisibly.

Version pinning and integrity verification. Generate a cryptographic hash of each tool description at approval time. Verify at each invocation. Alert on any change. This is the standard supply chain defense applied to a new supply chain artifact: treat MCP tool descriptions the same way mature software teams treat package hashes.

ETDI (Enhanced Tool Definition Interface) proposes this at the protocol level: OAuth-based cryptographic attestations, immutable versioning, mandatory re-approval for any version or scope change. It is not a novel security invention — it is applying existing software supply chain practices to MCP.

Least privilege architecture. A customer service agent has no reason to have filesystem access. A research agent has no reason to have email sending capability. Every connected capability that is not required for the agent's specific task is an unnecessary attack surface. Audit agent tool configurations the same way you audit user IAM (Identity and Access Management) permissions.

Sandboxing. Run MCP clients and servers in Docker containers with restricted network access. When tool poisoning succeeds — and the benchmark data suggests it will, at meaningful rates — sandboxing contains the blast radius. Credential leakage to the local filesystem requires the malicious tool to reach the local filesystem.

Human-in-the-loop for high-privilege operations. File writes, email sends, external API calls with sensitive parameters — require explicit user confirmation before execution. "Always allow" should not be configured for operations with irreversible consequences.

MCP-Scan. Invariant Labs released MCP-Scan in April 2025, a security scanner for MCP servers that detects tool poisoning indicators before deployment. MindGuard achieves 94%–99% detection precision for poisoned tool invocations at under one second processing time.

The Organizational Readiness Gap

Help Net Security reported in February 2026 that only 29% of organizations reported being prepared to secure agentic AI deployments. Tool misuse and privilege escalation led reported incidents. Memory poisoning and supply chain attacks carried disproportionate severity scores.

Organizations are deploying agents with database write access, email capabilities, and filesystem access into business-critical workflows — while operating without the security foundations that would be baseline requirements for any human with equivalent access. A human employee with read access to all company email, write access to customer databases, and the ability to make external API calls on company infrastructure would have extensive access controls, audit logging, and privilege reviews. The equivalent agentic deployment often has none of these.

The parallel to early cloud adoption is direct. Between 2012 and 2016, organizations moved workloads to cloud infrastructure before cloud security practices were established. The result was years of misconfiguration debt — storage buckets open to the public, overprivileged service accounts, no audit trails. Agentic AI is compressing that same adoption-to-security-maturity gap into months instead of years, with the added complication that the attack surface includes every document, email, and web page the agent reads as a potential injection vector.

The defense posture for agentic deployments needs to match the risk profile. Agents are non-human identities with significant access. Apply zero-trust principles to their access the same way you would to human identities: scoped credentials, just-in-time access grants, conditional access policies, and regular privilege reviews.

Have you audited your MCP tool descriptions yet? Where are you seeing the biggest gaps in agentic security posture?

Tool Poisoning and MCP Security: When Your Agent's Toolbox Is the Weapon

Chris Groves

What Makes This Different from Prompt Injection

The Four Attack Patterns You Need to Know

Hidden Instructions in Tool Descriptions

Rug Pull Attacks

Cross-Server Tool Shadowing

Supply Chain Compromise

The Benchmark Data: Safety Alignment Doesn't Help

EchoLeak: What a Production Exploit Looks Like

The 2025 Incident Timeline

Why Existing Software Security Doesn't Transfer

The Language Boundary Problem

No Trust Hierarchy in Shared Context

Defenses That Work (and the One That Does Not)

The Organizational Readiness Gap

Attack Research and Demonstrations

Benchmark and Academic Research

Production Incidents and CVEs

Industry Research and Standards

Read more

From Pattern Scanner to Security Researcher: The Code Review Upgrade

When AGENTS.md Backfires: What a New Study Says About Context Files and Coding Agents

The AI Investment Reckoning: No Profits, No AGI, No Plan

Christian Zionism vs. What the Bible Actually Says