Prompt Injection - The AI Agent Attack Surface
If your AI agent reads external content, whoever controls that content controls your agent. Explore the structural reality of indirect prompt injection.
Disclaimer
This article is intended for informational purposes and reflects the state of published research and industry practice as of early 2026. It is not professional security advice. Your specific environment, threat model, and regulatory obligations will shape how these principles apply to your situation.
For Security Leaders
Any AI agent that reads external content can be instructed through that content. This is not a vendor flaw - it is a structural property of how language models work. There is no patch. The attack surface is everything your agents read.
What this means for your organization:
Capable models are more exposed, not less. Models that follow legitimate instructions reliably also follow injected ones reliably.
No access to your systems is required. An attacker sends an email or commits code to a repository your agent reads. The agent delivers the result.
Vendor detection benchmarks overstate protection. Published bypass rates run below 5%. Against an attacker who knows your product, the same tools show bypass rates above 85%.
What to tell your teams:
Map every agent deployment against what external content it reads. That list is your attack surface.
Apply least-privilege to tool access. File write plus shell execution plus repository push access enables self-propagating attacks across your codebase.
Require human confirmation before agents take irreversible actions.
When evaluating detection products, ask for adaptive-attacker results. If a vendor cannot provide them, treat the product as friction, not a control.
The Content Your Agent Reads Is the Attack
On June 11, 2025, NIST logged CVE-2025-32711 against Microsoft 365 Copilot. The description is three lines: “AI command injection in M365 Copilot allows an unauthorized attacker to disclose information over a network.” Microsoft scored it 9.3 (Critical) under the Common Vulnerability Scoring System (CVSS). NIST scored it 7.5 (High). That gap is a scoring dispute I will return to. What matters first is the attack condition listed in the record: no credentials required, no access to Microsoft’s systems, no user interaction. The attacker sent an email.
The email contained an embedded injection payload. Copilot, operating as an agent that reads and processes email, retrieved the message, parsed the content, and followed the instruction the attacker had placed inside it. The result was data exfiltration from the victim’s M365 environment. Pavan Reddy and Aditya Sanjay Gujral from Aim Security documented this at the Association for the Advancement of Artificial Intelligence (AAAI) Fall Symposium 2025 and named it EchoLeak (Reddy and Gujral, AAAI Fall Symposium 2025, arXiv:2509.10540). They described it as the first publicly documented zero-click prompt injection exploit against a production large language model (LLM) system.
The CVSS gap is worth understanding. Microsoft’s 9.3 score accounts for Scope Change: the vulnerability crosses from the email environment into the agent’s privilege context, reaching data the attacker had no direct access to. NIST’s 7.5 score does not apply that modifier. The CVSS framework was built for software vulnerabilities that do not have the concept of a trust boundary enforced by natural language instruction sets. Both scoring bodies applied their framework correctly. The frameworks were designed for a different attack class.
The attack class itself is not ambiguous. If your agent reads external content, whoever controls that content can issue it instructions. That is not an implementation flaw in Microsoft’s code, in the way a buffer overflow is a flaw in a specific function. It is a structural property of how language models process context. The attack surface is not your infrastructure. It is everything your agent reads.
Why there is no boundary
To understand why this is structural rather than patchable, you need to understand what happens at the level below the application.
A language model does not receive a system prompt, a user message, and a retrieved document as three separate inputs with separately tracked trust levels. It receives a single flat sequence of text fragments, called tokens, that represent all of those things concatenated together. The mechanism that determines what the model focuses on when generating each word of its response, what researchers call the self-attention mechanism, computes relationships between every token in the sequence and every other token simultaneously. There is no step in that process that records where a given token came from or what authority level it should carry.
This is not an oversight in the design. It is the design. A transformer model is a general-purpose sequence processor, and that generality is what makes it capable of integrating complex context from multiple sources into a coherent response. The same property that allows the model to synthesize information from a retrieved document is what allows an instruction embedded in that document to function as an instruction.
Researchers at the Network and Distributed System Security (NDSS) Symposium 2026 published a detection framework called Rennervate (Zhong et al., NDSS 2026, arXiv:2512.08417) that exploits an architectural inverse of this property: injected instructions produce measurably different attention weight patterns compared to legitimate context, and Rennervate uses those patterns to detect and sanitize injection payloads. The fact that detection is possible via attention weights confirms that the model’s processing does differ statistically between injected and legitimate content. It does not confirm that a boundary exists. Detection is not prevention. Rennervate flags that something was injected; it cannot stop the model from processing it before the flag is raised. Sanitization is the step that makes this outcome-changing rather than friction-only: after detection, Rennervate removes the payload from the context before the model produces its response and acts on it. The distinction is between what the model processes at the token level and what it acts on at the output level.
The practical implication: any delimiter you insert between retrieved content and your system prompt, whether “DOCUMENT:” or XML tags or triple backticks, is a textual marker parsed by the same token-level sequence processor as everything else. The model treats it as one more token. It is not enforced by the attention computation. A crafted payload that mimics system prompt formatting, or explicitly claims authority over the session, is processed alongside the legitimate system prompt with no architectural mechanism to prefer one over the other.
The prior article in this series (Adversarial Inputs at Inference Time) established that alignment training concentrates safety behavior into a single direction in the numerical states the model holds at each processing step, what the prior article calls its activation space: one learnable axis out of billions along which “processing a harmful prompt” diverges from “processing a harmless one.” Suppress that axis and refusal disappears entirely. The context-window trust collapse described here is the same architectural condition expressed at the input level rather than the parameter level. The model enforces no security invariants during the forward pass: not “this content is harmful” and not “this content is not from a trusted source.” Attacks on alignment geometry and attacks on the context-window trust boundary reach different entry points by exploiting the same underlying property: the forward pass (the computation the model runs each time it processes an input) is a uniform computation over whatever tokens arrive, and learned behavior is the only enforcement mechanism available.
The retrieval pipeline
Retrieval-augmented generation (RAG) is the technique where an agent fetches external documents to answer questions that fall outside its training data. The typical flow: a user submits a query; the system converts that query into a mathematical fingerprint called an embedding; a vector database returns the documents whose embeddings are most similar to the query’s; those documents are inserted into the model’s context window alongside the user’s query and the developer’s system prompt; the model generates a response.
The injection point is step three: context assembly. Retrieved content, system prompt, and user query are concatenated into a single string and passed to the model. At that point, the property described above takes over. The retrieved content is tokens. The system prompt is tokens. The model processes them together with no hard boundary.
The attacker’s leverage is positioning: control over any document the agent retrieves is control over part of the context window. Kai Greshake, Sahar Abdelnabi, and colleagues at CISPA Helmholtz Center and the Max Planck Institute for Security and Privacy formalized this in 2023 (Greshake et al., Association for Computing Machinery (ACM) AISec 2023, arXiv:2302.12173) and demonstrated it against Bing’s GPT-4-powered chat interface by placing injection payloads in publicly indexed web pages. The agent retrieved them in response to normal user queries. No access to Bing’s systems was required.
The InjecAgent benchmark, a standardized evaluation suite for indirect injection against tool-integrated agents published at the Association for Computational Linguistics (ACL) 2024 conference (Zhan et al., ACL 2024 Findings, arXiv:2403.02691), quantified this across 1,054 test cases spanning 17 categories of user tools and 62 categories of attacker tools. A successful attack in InjecAgent means the injected instruction caused the agent to take an attacker-specified action rather than completing the user’s original task. Against GPT-4 using the ReAct prompting approach, a standard method that has the model reason through tool-use sequences step by step before acting, baseline injection attacks succeeded 24% of the time under non-adaptive conditions, meaning the attacker did not modify their approach based on knowledge of the defense. When attackers reinforced payloads with explicit instruction-override framing, success rates nearly doubled on the same model.
The protocol layer
The Model Context Protocol (MCP) is the specification developed by Anthropic for connecting AI clients, such as Claude Desktop or Cursor, to external tool servers. When a client connects to an MCP server, the server sends a manifest of available tools: each tool’s name, description, and parameter schema. The client passes this manifest to the language model so the model can decide which tools to invoke and when.
Tool poisoning is what happens when a malicious MCP server, or a legitimate server whose tool metadata has been compromised, embeds injection instructions in those manifest fields. This attack is mechanically distinct from RAG pipeline injection in one important way: it fires at connection time, not query time. A poisoned MCP server establishes influence over the agent’s behavior from the moment the session begins, without requiring the attacker’s payload to rank highly in a semantic search or appear in a document the user happens to request.
Huang et al. evaluated seven major MCP clients for static validation of incoming tool manifests and found significant security gaps across most of them: insufficient validation of metadata content, limited visibility into parameter fields before they reach the model (Huang et al., arXiv:2603.22489, March 2026). The official MCP Security Best Practices specification (modelcontextprotocol.io) documents three distinct protocol-layer attack vectors. The confused deputy attack exploits the combination of static Open Authorization (OAuth) client IDs and dynamic client registration: an MCP proxy that authenticates to a third-party API using a shared client identifier can be manipulated by a malicious registration to redirect authorization codes to an attacker-controlled endpoint, bypassing user consent because the consent cookie from a prior legitimate session is still valid. The token passthrough attack involves MCP servers accepting tokens not explicitly issued to them, enabling a compromised server to impersonate a legitimate client against downstream APIs. Session hijack prompt injection queues a malicious payload into a session store, which the legitimate client then retrieves as part of its normal processing loop.
The MCP specification explicitly prohibits the first two patterns as mandatory requirements. The third is documented as a risk requiring mitigation at the implementation level. The specification naming prompt injection as a first-class protocol-layer threat is notable: it is one of the first major AI protocol documents to treat injection not as an application-level concern but as a transport-level one.
The empirical record
The series opening introduced AIShellJack and the Month of AI Bugs at the level of headline figures. The data behind both is worth more precision than a survey allows.
AIShellJack (Liu et al., arXiv:2509.22040, September 2025) is an automated attack framework deploying 314 payloads across 70 techniques from the MITRE ATT&CK framework (Adversarial Tactics, Techniques, and Common Knowledge), the industry-standard taxonomy for adversary behavior. A successful payload in AIShellJack is one that causes the target agent to execute an action corresponding to the MITRE technique under test: dropping a credential-harvesting file for Credential Access, establishing an outbound connection for Command and Control. Testing targeted Cursor and GitHub Copilot in agent mode across five development scenarios.
Success rates by MITRE tactic:
Tactic Success Rate
Initial Access 93.3%
Discovery 91.1%
Impact 83.0%
Collection 77.0%
Privilege Escalation 71.5%
Credential Access 68.2%
Defense Evasion 67.6%
Persistence 62.7%
Exfiltration 55.6%Cursor configured with auto-approve for tool calls, meaning no human confirmation step before the agent acts, reached 83.4% overall attack success rate on the TypeScript development scenario, with Command and Control techniques reaching 100%. GitHub Copilot, which retains a confirmation step in default configuration, showed lower but still substantial vulnerability at 41-52% overall attack success rate. The confirmation step is not a defense; it is friction. It disappears in any deployment where auto-approve is enabled.
AgentDojo, a dynamic evaluation suite for agent security published at the Neural Information Processing Systems conference (NeurIPS) 2024 as a Spotlight paper, top 4% of submissions to the Datasets and Benchmarks track (Debenedetti et al., NeurIPS 2024, arXiv:2406.13352), covers 97 realistic tasks and 629 security test cases across email management, e-banking, and travel-booking scenarios. Its finding on model capability and exploitability is counterintuitive. A successful attack in AgentDojo means the injected goal was completed instead of the user’s original task.
Model Benign Task Success Attack Success (no defense)
Claude 3.5 Sonnet 78.22% 33.86%
GPT-4o 69.00% 47.69%
Claude 3 Opus 66.61% 11.29%
GPT-4 Turbo 63.43% 28.62%
Llama 3 70B 34.50% 20.03%GPT-4o succeeds on the most legitimate tasks and shows the highest exploitability. Claude 3 Opus sits in the middle of the capability range but shows the lowest attack success among frontier models. The relationship is not strictly linear, but the general pattern is clear: models that follow instructions most reliably also follow injected instructions most reliably. Capability and exploitability share a root.
Rehberger’s Month of AI Bugs 2025 logged 29 disclosures across more than 13 named platforms: ChatGPT, ChatGPT Codex, Claude Code, GitHub Copilot, Google Jules, Amazon Q Developer, Devin, OpenHands, Windsurf, Cursor, Manus, AWS Kiro, and Cline among others. Two received formal Common Vulnerabilities and Exposures (CVE) assignments. CVE-2025-55284 (Claude Code, CVSS 7.5) covered a command allowlist bypass that allowed injected instructions to exfiltrate file contents via DNS using pre-authorized commands like ping. CVE-2025-53773 (GitHub Copilot with Visual Studio 2022, CVSS 7.8) covered command injection enabling local code execution. The remaining 27 disclosures have no CVE, because no CVE taxonomy exists for prompt injection as a distinct vulnerability class. The classification infrastructure has not caught up with the attack surface.
Four vulnerability patterns appeared across unrelated platforms. DNS-based exfiltration through pre-approved commands. Invisible Unicode injection, using directional control characters that display as normal text to a human reviewer while carrying hidden instructions to the model. Persistent payload injection via configuration files that survive session resets, achieving ongoing access without requiring the attacker to be present in each new session. Remote code execution through prompt injection into development resources the agent processes without sanitization. The same four patterns appearing across platforms with entirely independent codebases and separate security organizations is the empirical argument for a structural cause. If implementation choices determined exploitability, you would see variation. The record shows near-universal presence.
The same structural pattern appeared independently outside the coding-editor ecosystem. Aim Security’s PipeLeak demonstrated that a public lead form payload could hijack a Salesforce Agentforce agent with no authentication required. Noma Labs disclosed ForcedLeak (CVSS 9.4) through a similar vector in the same platform; Salesforce patched it by enforcing Trusted URL allowlists.
AgentHopper, the self-propagating variant Rehberger documented (Rehberger, embracethered.com, 2025), illustrates what happens when an agent has three standard capabilities at once: write access to files, shell command execution, and Git repository access with push authorization. The attack compromises an agent through a standard injection, directs it to scan for accessible Git repositories, injects those repositories with a payload targeting multiple coding-agent configuration formats, commits the change, and pushes to the upstream remote using the developer’s authenticated credentials. When the next developer pulls that repository, their agent encounters the payload on first use. The propagation mechanism is not a network exploit. It is a redirect of an authorized capability toward an unauthorized end. The agent was authorized to push code. AgentHopper uses that authorization.
What defenses change and what they don’t
The defense picture is better than the attack success rates suggest and worse than most vendor comparisons indicate. The gap between those two descriptions is a measurement problem: most published defense evaluations use non-adaptive attackers, meaning attackers who do not modify their approach based on knowledge of which defense is deployed. Adaptive attackers do. The two conditions produce materially different results, and they often come from different papers requiring separate citations.
The Maloyan and Namiot systematic review (Maloyan and Namiot, arXiv:2601.17548, January 2026), a meta-analysis of 78 studies published between 2021 and 2026, reports the following for commercial secondary LLM-based injection detectors. Non-adaptive bypass rate is the published product figure; adaptive bypass rate is measured against attackers who know the defense is in place and craft payloads to evade it. A successful bypass means the injection payload passed the detector and the agent completed the attacker’s goal rather than the user’s task:
Detector Non-adaptive bypass Adaptive bypass
Protect AI <5% 93%
PromptGuard <3% 91%
PIGuard <5% 89%
Model Armor <10% 78%
TaskTracker <8% 85%The non-adaptive figures are what appear in most vendor documentation. The adaptive figures are what a threat actor who knows which product you have deployed will achieve. The gap averages 80 percentage points. If you are evaluating injection detection products, the first question to ask is which condition the published evaluation used. Under adaptive attack conditions, every commercial detector in this table adds friction rather than changes the outcome: the attacker’s cost increases, but the structural exploitability does not close.
This measurement gap is not unique to injection defenses. The prior article in this series documents the same pattern across 12 published jailbreak defenses evaluated by Carlini, Nasr, Hayes, Shumailov, and Tramèr (Carlini et al., arXiv:2510.09023, October 2025): claimed attack success rates near zero, adaptive attack success rates above 90% for most and 100% for two. The convergence across two completely different attack surfaces, jailbreaking at the inference layer and injection at the retrieval layer, is not a coincidence about specific defenses. It is a finding about how AI security evaluations have been conducted: against attackers who have not read the paper they are trying to defeat.
AgentDojo provides the clearest side-by-side comparison of architectural approaches under consistent conditions. The following results are all from non-adaptive attacker conditions on the same benchmark. A successful attack means the attacker’s goal was completed rather than the user’s task:
All results below: non-adaptive attacker conditions, GPT-4o.
Defense Benign utility Attack success
No defense 69.0% 47.69%
Data delimiters 72.7% 41.65%
Prompt repetition 85.5% 27.82%
Injection detector 41.5% 7.95%
Tool-call filter 73.1% 6.84%The tool-call filter is the standout result: it checks whether each proposed tool invocation matches the user’s original intent before the agent acts, reducing attack success to 6.84% with minimal utility cost. The injection detector reaches comparable attack success reduction (7.95%) but drops benign task completion to 41.5%, roughly half the undefended baseline. If you are choosing between these two in a deployment, that utility cost is significant. For deployments that cannot modify model training, the tool-call filter is the most effective option that changes the outcome without significant utility cost.
Progent, a privilege control framework from UC Berkeley (Shi et al., arXiv:2504.11703, April 2025), implements fine-grained policies over which tools an agent can invoke and under what conditions, expressed in a domain-specific policy language. On AgentDojo, Progent reduces attack success from 41.2% to 2.2%. On two other evaluation suites (AgentSafeBench and AgentPoison), it reaches 0%.
The 2.2% residual is not a rounding artifact. It represents attacks where the attacker’s goal falls within the agent’s authorized capability scope. An agent authorized to send emails can be directed by an injection to send them to an attacker-specified recipient rather than the intended one. Privilege control prevents unauthorized tool invocation. It does not prevent an agent from being directed to use authorized tools in unauthorized ways. Specifying those constraints fully requires anticipating every possible misuse of every authorized tool, and that specification burden grows with what the agent is allowed to do.
Three approaches change the outcome at the architectural level rather than adding friction at the input level. SecAlign teaches the model to prefer complying with its original system prompt over following injected content. The approach is implemented via preference-optimization, a technique that adjusts the model’s parameters by rewarding preferred outputs over unpreferred ones during training. On AgentDojo, it reduces attack success from 96% to approximately 2% (Maloyan and Namiot, arXiv:2601.17548). The OpenAI instruction hierarchy (Wallace et al., International Conference on Learning Representations (ICLR) 2025, arXiv:2404.13208) trains models to treat system prompt instructions as higher priority than retrieved content or tool outputs. Applied to GPT-3.5, the paper reports robustness improvements including against attack types not seen during training. A specific residual attack success rate is not reported in the published abstract; the paper describes the improvement as “drastic.” That is qualitative. You cannot use it for deployment planning. Rennervate (Zhong et al., NDSS 2026, arXiv:2512.08417) outperforms 15 prior defense methods across five models and six datasets, with demonstrated transfer to unseen attack types. Specific residual rates are likewise not reported in the abstract.
The firewall approach documented by Bhagwatkar et al. (arXiv:2510.05244, October 2025) places two model-based filters at the agent-tool interface: one that checks inputs to tools before they execute, and one that sanitizes tool outputs before they re-enter the context window. The paper reports “perfect security” across four public benchmarks, then immediately identifies significant integrity problems in those same benchmarks: implementation errors, flawed success metrics, and attack designs that are too weak to stress the defense. A three-stage adaptive attack strategy the authors developed themselves substantially degrades the claimed performance. The “perfect security” result is bounded by benchmark quality.
What stays open
After every defense in the current Tier 1 research record is applied:
The best-measured single defense under non-adaptive conditions (tool-call filter) leaves 6.84% attack success on AgentDojo. Privilege control (Progent) leaves 2.2% on AgentDojo and 0% on two other benchmarks, but the residual 2.2% is a category that privilege control cannot close by design: attacks within authorized scope. Architectural training approaches (SecAlign, instruction hierarchy) report the most promising residual figures, but the evaluation conditions and adaptive attack performance are not uniformly published.
Under adaptive attack conditions, the picture is worse. The Maloyan and Namiot meta-analysis reports that all 12 evaluated commercial detector products were bypassed at rates exceeding 90% when attackers specifically targeted the deployed defense. The Bhagwatkar et al. firewall paper demonstrates that adaptive attack design substantially degrades claimed benchmark results. Adaptive attack evaluations for privilege control approaches are not yet as extensively benchmarked.
The Open Web Application Security Project (OWASP) Top 10 for LLM Applications 2025 (OWASP Gen AI Security Project, genai.owasp.org) ranks prompt injection as the top threat to LLM applications and states directly that “it is unclear if there are fool-proof methods of prevention” given the stochastic nature of generative AI. That is an accurate characterization of the current evidence, not a hedge.
The CVSS scoring gap on EchoLeak that I opened with is an accurate index of where the industry stands. NIST scored 7.5 because their framework was not built to account for trust boundaries enforced by natural language. Microsoft scored 9.3 because they looked at what the attack actually did: an email, sent to a user, reading the victim’s confidential data and transmitting it to an attacker-controlled endpoint, with no user action required beyond receiving the message. Both scores are correctly derived from their respective frameworks. Neither framework was built for this.
The article on adversarial inputs at inference time showed that alignment is a one-dimensional brake in a billion-dimensional parameter space, and suppressing it eliminates refusal entirely. This article shows that context assembly is a zero-trust pipeline, and any content that enters it is processed as instructions. The structural fix for both is the same thing that does not yet exist: an architecture that enforces security invariants during the forward pass rather than learning them over token distributions.
The next article in this series takes up the question that follows from both: what happens when the attacker is not a human placing content in a retrieval corpus, but a reasoning model that plans and executes its own attack sequences autonomously against other AI systems.
Peace. Stay curious! End of transmission.
Fact-Check Appendix
Statement: CVE-2025-32711 scored CVSS 9.3 by Microsoft, 7.5 by NIST | Source: NIST National Vulnerability Database (NVD) | https://nvd.nist.gov/vuln/detail/CVE-2025-32711
Statement: EchoLeak described as first publicly documented zero-click prompt injection exploit in a production LLM system | Source: Reddy and Gujral, AAAI Fall Symposium 2025 | https://arxiv.org/abs/2509.10540
Statement: InjecAgent covers 1,054 test cases | Source: Zhan et al., ACL 2024 Findings | https://arxiv.org/abs/2403.02691
Statement: InjecAgent covers 17 categories of user tools and 62 categories of attacker tools | Source: Zhan et al., ACL 2024 Findings | https://arxiv.org/abs/2403.02691
Statement: GPT-4 (ReAct) baseline injection success 24% under non-adaptive conditions | Source: Zhan et al., ACL 2024 Findings | https://arxiv.org/abs/2403.02691
Statement: Success rates nearly doubled with reinforced payloads | Source: Zhan et al., ACL 2024 Findings | https://arxiv.org/abs/2403.02691
Statement: AIShellJack deploys 314 payloads covering 70 MITRE ATT&CK techniques | Source: Liu et al., arXiv:2509.22040 | https://arxiv.org/abs/2509.22040
Statement: Initial Access success rate 93.3% | Source: Liu et al., arXiv:2509.22040 | https://arxiv.org/abs/2509.22040
Statement: Discovery success rate 91.1% | Source: Liu et al., arXiv:2509.22040 | https://arxiv.org/abs/2509.22040
Statement: Impact success rate 83.0% | Source: Liu et al., arXiv:2509.22040 | https://arxiv.org/abs/2509.22040
Statement: Collection success rate 77.0% | Source: Liu et al., arXiv:2509.22040 | https://arxiv.org/abs/2509.22040
Statement: Privilege Escalation success rate 71.5% | Source: Liu et al., arXiv:2509.22040 | https://arxiv.org/abs/2509.22040
Statement: Credential Access success rate 68.2% | Source: Liu et al., arXiv:2509.22040 | https://arxiv.org/abs/2509.22040
Statement: Defense Evasion success rate 67.6% | Source: Liu et al., arXiv:2509.22040 | https://arxiv.org/abs/2509.22040
Statement: Persistence success rate 62.7% | Source: Liu et al., arXiv:2509.22040 | https://arxiv.org/abs/2509.22040
Statement: Exfiltration success rate 55.6% | Source: Liu et al., arXiv:2509.22040 | https://arxiv.org/abs/2509.22040
Statement: Cursor (auto-approve) 83.4% overall attack success rate on TypeScript scenario | Source: Liu et al., arXiv:2509.22040 | https://arxiv.org/abs/2509.22040
Statement: Cursor Command and Control techniques 100% success | Source: Liu et al., arXiv:2509.22040 | https://arxiv.org/abs/2509.22040
Statement: GitHub Copilot 41-52% overall attack success rate | Source: Liu et al., arXiv:2509.22040 | https://arxiv.org/abs/2509.22040
Statement: AgentDojo covers 97 realistic tasks and 629 security test cases | Source: Debenedetti et al., NeurIPS 2024 | https://arxiv.org/abs/2406.13352
Statement: AgentDojo accepted as NeurIPS 2024 Spotlight, top 4% of Datasets and Benchmarks submissions | Source: NeurIPS 2024 proceedings | https://papers.nips.cc/paper_files/paper/2024/hash/97091a5177d8dc64b1da8bf3e1f6fb54-Abstract-Datasets_and_Benchmarks_Track.html
Statement: Claude 3.5 Sonnet benign task success 78.22%, attack success 33.86% | Source: Debenedetti et al., NeurIPS 2024 | https://arxiv.org/abs/2406.13352
Statement: GPT-4o benign task success 69.00%, attack success 47.69% | Source: Debenedetti et al., NeurIPS 2024 | https://arxiv.org/abs/2406.13352
Statement: Claude 3 Opus benign task success 66.61%, attack success 11.29% | Source: Debenedetti et al., NeurIPS 2024 | https://arxiv.org/abs/2406.13352
Statement: GPT-4 Turbo benign task success 63.43%, attack success 28.62% | Source: Debenedetti et al., NeurIPS 2024 | https://arxiv.org/abs/2406.13352
Statement: Llama 3 70B benign task success 34.50%, attack success 20.03% | Source: Debenedetti et al., NeurIPS 2024 | https://arxiv.org/abs/2406.13352
Statement: Month of AI Bugs 2025: 29 disclosures across more than 13 named platforms | Source: Rehberger, embracethered.com | https://embracethered.com/blog/posts/2025/wrapping-up-month-of-ai-bugs/
Statement: CVE-2025-55284 (Claude Code), CVSS 7.5 | Source: NIST NVD | https://nvd.nist.gov/vuln/detail/CVE-2025-55284
Statement: CVE-2025-53773 (GitHub Copilot / Visual Studio 2022), CVSS 7.8 | Source: NIST NVD | https://nvd.nist.gov/vuln/detail/CVE-2025-53773
Statement: AgentDojo no defense: 47.69% attack success, 69.0% benign utility (GPT-4o) | Source: Debenedetti et al., NeurIPS 2024 | https://arxiv.org/abs/2406.13352
Statement: AgentDojo data delimiters: 41.65% attack success, 72.7% benign utility | Source: Debenedetti et al., NeurIPS 2024 | https://arxiv.org/abs/2406.13352
Statement: AgentDojo prompt repetition: 27.82% attack success, 85.5% benign utility | Source: Debenedetti et al., NeurIPS 2024 | https://arxiv.org/abs/2406.13352
Statement: AgentDojo injection detector: 7.95% attack success, 41.5% benign utility | Source: Debenedetti et al., NeurIPS 2024 | https://arxiv.org/abs/2406.13352
Statement: AgentDojo tool-call filter: 6.84% attack success, 73.1% benign utility | Source: Debenedetti et al., NeurIPS 2024 | https://arxiv.org/abs/2406.13352
Statement: Progent reduces AgentDojo attack success from 41.2% to 2.2% | Source: Shi et al., arXiv:2504.11703 | https://arxiv.org/abs/2504.11703
Statement: Progent reaches 0% on AgentSafeBench and AgentPoison | Source: Shi et al., arXiv:2504.11703 | https://arxiv.org/abs/2504.11703
Statement: Protect AI non-adaptive bypass <5%, adaptive bypass 93% | Source: Maloyan and Namiot, arXiv:2601.17548 | https://arxiv.org/abs/2601.17548
Statement: PromptGuard non-adaptive bypass <3%, adaptive bypass 91% | Source: Maloyan and Namiot, arXiv:2601.17548 | https://arxiv.org/abs/2601.17548
Statement: PIGuard non-adaptive bypass <5%, adaptive bypass 89% | Source: Maloyan and Namiot, arXiv:2601.17548 | https://arxiv.org/abs/2601.17548
Statement: Model Armor non-adaptive bypass <10%, adaptive bypass 78% | Source: Maloyan and Namiot, arXiv:2601.17548 | https://arxiv.org/abs/2601.17548
Statement: TaskTracker non-adaptive bypass <8%, adaptive bypass 85% | Source: Maloyan and Namiot, arXiv:2601.17548 | https://arxiv.org/abs/2601.17548
Statement: SecAlign reduces attack success from 96% to approximately 2% | Source: Maloyan and Namiot, arXiv:2601.17548 | https://arxiv.org/abs/2601.17548
Statement: Rennervate outperforms 15 prior defense methods across five models and six datasets | Source: Zhong et al., NDSS 2026 | https://arxiv.org/abs/2512.08417
Statement: Instruction hierarchy applied to GPT-3.5 improves robustness including against unseen attack types | Source: Wallace et al., ICLR 2025 | https://arxiv.org/abs/2404.13208
Statement: MCP threat modeling evaluated seven major MCP clients | Source: Huang et al., arXiv:2603.22489 | https://arxiv.org/abs/2603.22489
Statement: PipeLeak: public lead form payload hijacked a Salesforce Agentforce agent with no authentication required | Source: Aim Security research disclosure, 2025 | (pre-publication: confirm source URL)
Statement: ForcedLeak CVSS 9.4, similar injection vector in Salesforce Agentforce, patched via Trusted URL allowlists | Source: Noma Labs security disclosure, 2025 | (pre-publication: confirm source URL)
Statement: All 12 evaluated commercial detector products bypassed at rates exceeding 90% under adaptive attack conditions | Source: Maloyan and Namiot, arXiv:2601.17548 | https://arxiv.org/abs/2601.17548 | (pre-publication: verify total product count against source. Body text says 12, table shows 5. Confirm whether table is a representative subset or the full set.)
Statement: OWASP ranks prompt injection as LLM01:2025 and states “it is unclear if there are fool-proof methods of prevention” | Source: OWASP Gen AI Security Project, 2025 | https://genai.owasp.org/llmrisk/llm01-prompt-injection/
Top 5 Sources
1. Debenedetti et al., “AgentDojo,” NeurIPS 2024 Datasets and Benchmarks Track (Spotlight)
arXiv:2406.13352 | ETH Zurich | https://arxiv.org/abs/2406.13352
Authoritative because: Peer-reviewed at NeurIPS as a Spotlight paper (top 4% of submissions). The most controlled side-by-side comparison of attack success rates by model and defense type in the literature. Publicly released benchmark enables independent replication.
2. Greshake, Abdelnabi et al., “Not What You’ve Signed Up For,” ACM AISec 2023
arXiv:2302.12173 | CISPA Helmholtz Center and MPI-SP | https://arxiv.org/abs/2302.12173
Authoritative because: Foundational peer-reviewed paper that named and formalized indirect prompt injection as an attack class, demonstrated it against production systems, and established the taxonomy the field has built on since.
3. Maloyan and Namiot, Systematization of Knowledge on Prompt Injection, 2026
arXiv:2601.17548 | https://arxiv.org/abs/2601.17548
Authoritative because: Meta-analysis of 78 studies from 2021 to 2026 with explicit methodology. Provides the only side-by-side comparison of adaptive vs. non-adaptive defense performance across commercial products. The 80-percentage-point average gap between published and adaptive-attack figures is its most significant finding.
4. Liu et al., “Your AI, My Shell” (AIShellJack), September 2025
arXiv:2509.22040 | https://arxiv.org/abs/2509.22040
Authoritative because: The most comprehensive empirical taxonomy of injection attacks against AI coding editors to date. 314 payloads, 70 MITRE ATT&CK techniques, and tactic-level success rate breakdown across two production platforms provide a structured attack surface map not available elsewhere.
5. Shi et al., “Progent: Programmable Privilege Control for LLM Agents,” April 2025
arXiv:2504.11703 | UC Berkeley | https://arxiv.org/abs/2504.11703
Authoritative because: Named authors at UC Berkeley with evaluation across three independent benchmarks. The 41.2% to 2.2% result on AgentDojo is the most specific published figure for privilege-control defense effectiveness, and the 2.2% residual identifies a structural limitation (authorized-scope attacks) that no privilege-control approach resolves.




