AI Red Teaming: Beyond Prompt Injection

Prompt injection is just the beginning. As AI systems evolve from chatbots to autonomous agents, the attack surface expands into territory traditional security teams have never defended.

AI Token Costs vs Employee Efficiency

If you are still treating AI security as a prompt injection problem, you are already behind. The conversation has moved.

Two years ago, the security community was obsessed with jailbreak prompts—the creative ways to trick ChatGPT into ignoring its safety guidelines. And yes, that matters. But the researchers who actually study this space have been telling us for a while now: prompt injection is one chapter in a much longer book.

The real shift happening right now is not about better jailbreaks. It is about three converging forces that are fundamentally changing what we need to secure:

  1. Attacks are now automated — adversarial suffixes that work across multiple models simultaneously, generated algorithmically rather than hand-crafted
  2. Agents are the new attack surface — autonomous systems that can take real actions, call real tools, and access real data
  3. The threat model has no boundary — when your LLM can call APIs, execute code, or manipulate external systems, the difference between "input manipulation" and "system compromise" disappears

Let me walk through each of these shifts, what the research actually shows, and what you can do about it today.

The Automation Revolution in Attacks

The seminal moment came in mid-2023 when researchers at Carnegie Mellon published their work on universal adversarial suffixes. Their paper, "Universal and Transferable Adversarial Attacks on Aligned Language Models," demonstrated something uncomfortable: you could automatically generate a string of characters that, when appended to any harmful request, would cause the model to comply — and that string would work across models from different vendors.

The implications were immediate and disturbing. What had been a cat-and-mouse game of patching specific prompts became a systematic vulnerability. As the Zou et al. paper framed it: "it is unclear whether such behavior can ever be fully patched."

That admission matters. The authors drew an explicit parallel to computer vision's adversarial example problem — a class of vulnerabilities that researchers have been fighting for over a decade without finding a fundamental solution. If LLM security faces the same structural challenge, then we are not looking at a problem we can solve with better filtering. We are looking at an operational necessity.

This is exactly what the follow-up research has confirmed. Liu et al. built the first systematic framework for classifying prompt injection attacks in their USENIX Security 2024 paper, evaluating five attack types and ten defense mechanisms across ten LLMs and seven different tasks. Their key finding: the combined attack strategy — blending multiple injection techniques — outperforms any single approach.

The cat-and-mouse game has accelerated. The question is no longer "can we find a perfect defense" but "how do we operate defensively when we know some attacks will succeed?"

Beyond Chatbots: The Agent Problem

This is where things get interesting for security professionals who have been paying attention to the agent frameworks proliferating across the industry.

Traditional LLM security focuses on what goes in and what comes out. Prompt injection, jailbreaks, output filtering — these are all about controlling the conversation. But agents are different. An agent is not having a conversation; it is taking actions.

When an LLM-powered agent can:

  • Execute code on your infrastructure
  • Call APIs that modify data
  • Access filesystems or databases
  • Orchestrate multi-step workflows across systems

...the security implications expand beyond anything the prompt injection literature covers. This is system compromise through natural language, and it changes the threat model entirely.

The research community has started naming these vectors. Tool poisoning involves injecting malicious instructions into tool definitions that the agent trusts. Chain manipulation interferes with the agent's workflow orchestration. Privilege escalation occurs when agent actions exceed what was intended. And MCP (Model Context Protocol) vulnerabilities expose security gaps in the interfaces agents use to connect to external systems.

Tencent's AI-Infra-Guard platform explicitly addresses these gaps, scanning not just for jailbreaks but for agent skill vulnerabilities and MCP exposures. Agentic Radar, another open-source tool, focuses specifically on LLM agentic workflows. The agentic_security project builds red teaming kits for these multi-step systems.

If you are deploying agents in production — whether using LangChain, AutoGen, CrewAI, or custom frameworks — you need to think about these attack surfaces, and honestly, you probably are not.

What Actually Works in 2026

Here is the uncomfortable truth: there is no patch. The research points toward defense in depth, not a single solution.

The layered approaches that are emerging in the industry look like this:

Input filtering catches known attack patterns before they reach the model. Tools like llm-guard provide this capability, though they are racing against an adversarial environment that generates novel attacks constantly.

Programmable guardrails — like NVIDIA's NeMo Guardrails — let you define explicit rules about what your system can and cannot do, regardless of what the prompt contains.

Continuous red teaming is where the rubber meets the road. The shift from point-in-time assessments to automated, continuous testing is not optional. When attacks can be generated algorithmically, your defenses need to be tested the same way. The open-source garak scanner from NVIDIA, FuzzyAI from CyberArk, and the EasyJailbreak framework all automate vulnerability detection.

The OWASP Top 10 for Large Language Model Applications provides a useful framework for thinking about these layers systematically, though it is still maturing as an industry standard.

The Gap That Keeps This Interesting

Here is what keeps me up at night: most security teams do not have the specialized skills to conduct AI red teaming effectively. The tools are emerging, but the expertise to use them well is scarce.

This is not unique to AI — every new technology domain faces the same gap. But the stakes are higher here because the attack surface is larger and the failure modes are less understood. A traditional penetration test covers a bounded system. An AI red team engagement needs to account for prompt injection, jailbreaks, output manipulation, tool poisoning, and emergent agent behaviors.

The industry is responding. Platforms like AI-Infra-Guard aim to make red teaming accessible. Educational resources like the Learn Prompt Hacking course are building skills. But we are still early in this curve.

Where This Leaves You

If you are responsible for AI security in your organization, here is the honest assessment:

  • Accept that some attacks will succeed. Build detection and response capabilities, not just prevention.
  • Extend your security thinking beyond the chat interface. If you are deploying agents, you have a new attack surface.
  • Adopt continuous red teaming. Point-in-time assessments cannot keep pace with automated attacks.
  • Monitor the threat landscape actively. The tools and techniques evolve weekly.

The fundamental question is no longer "are our AI systems secure?" It is "how do we operate knowing they will be attacked in ways we have not yet imagined?"

That is the uncomfortable reality that the research makes clear. The question is whether we build our defenses to match it.


Want to put this into practice? Lurkers get methodology guides. Contributors get implementation deep dives.

Sources

Academic Research

Open Source Tools

Industry Standards

Educational Resources

Datasets