AI File Upload Security Advisory: Untrusted File Pipeline for LLM Context
A comprehensive advisory on securing file upload pipelines for LLM applications. Covers parser exploits, prompt injection vectors, and defense-in-depth architectures based on CVE-2025-66516 and OWASP Agentic Top 10 2026.
Executive Summary
The integration of file upload functionality into LLM applications has created a significant attack surface that the security community is only beginning to understand. When users upload documents, images, or data files to AI systems, those files pass through parsing pipelines before entering the model's context window. This flow, which I call the untrusted file pipeline, introduces risks that traditional web application security does not adequately address.
In 2025 and early 2026, we have seen critical vulnerabilities emerge in document parsing libraries, new variants of indirect prompt injection attacks, and the maturation of the OWASP Top 10 for Agentic Applications which directly addresses these concerns. The untrusted file pipeline represents a convergence point where classic vulnerabilities like XML External Entity injection combine with LLM-specific attack patterns like context poisoning.
This advisory provides a comprehensive analysis of the current threat landscape, examines real-world vulnerabilities through the lens of CVE-2025-66516, and outlines recommended architectures for organizations building or operating LLM applications that process user-uploaded content.
Summary Details
The core problem stems from how LLMs process context. When a document enters the model context window, the distinction between user data and user instructions becomes blurred. The model treats everything as part of a single prompt, which means hidden content in files can influence model behavior in ways that users do not expect or intend.
Attackers exploit this through several vectors. First, parser exploits can compromise the infrastructure processing files before content even reaches the LLM. Second, indirect prompt injection embeds malicious instructions in file content that the model interprets as directives. Third, in agentic systems, poisoned context can corrupt long-term memory and tool use behaviors.
The stakes are significant. A successful attack can result in data exfiltration, unauthorized tool access, goal hijacking in autonomous agents, or infrastructure compromise through vulnerable parsing libraries. The distributed nature of modern AI applications, where multiple services handle different stages of document processing, complicates the security picture further.
Section 1: Threat Landscape (Current as of 2026)
The threat landscape for LLM file uploads has evolved substantially over the past eighteen months. What began as theoretical discussions about prompt injection has matured into documented attack techniques with real-world impact.
Parser Exploit Proliferation
Document parsing libraries have become a favored target for attackers because they occupy a trusted position in the processing pipeline. Libraries like Apache Tika, PDF.js, and various Office document parsers process complex structured formats containing XML, embedded scripts, and metadata. When these parsers encounter maliciously crafted files, vulnerabilities can be triggered before any sanitization occurs.
The frequency of parser-related CVEs has increased notably. Beyond CVE-2025-66516, we have observed vulnerabilities in popular PDF processing libraries, image metadata parsers, and archive extraction tools. Each represents a potential vector for infrastructure compromise or data exfiltration.
Indirect Prompt Injection Maturity
The technique of embedding malicious instructions in file content has moved from proof-of-concept to operational attack. Researchers have documented success rates where poisoned documents manipulate RAG system outputs approximately 90 percent of the time when summarized or queried.
The attack surface is broad. Hidden text in PDFs using white-on-white fonts or tiny font sizes evades human review but gets extracted during parsing. Metadata fields in Office documents can contain instructions that survive document conversion. Even image alt text and CSV file comments have been demonstrated as viable injection vectors.
Agentic System Risks
The emergence of agentic AI systems has amplified these risks. Agents that plan, use tools, and act autonomously treat ingested data as authoritative in ways that traditional LLM applications do not. When an agent retrieves context from a RAG system or processes uploaded files, embedded instructions can influence goal prioritization, tool selection, and action sequencing.
The OWASP Top 10 for Agentic Applications 2026 specifically addresses these concerns. Entries for Agent Goal Hijack, Tool Misuse, and Memory and Context Poisoning directly map to untrusted file pipeline risks.
Section 2: Architecture Analysis - Current State and Gaps
Most production LLM applications follow a similar pattern for file handling. A user uploads a file through a web interface or API. The file passes through validation checks and virus scanning. It is then stored temporarily while a parser extracts content. The extracted text is chunked and either stored in a vector database for retrieval or fed directly to the LLM with the user prompt.
This architecture has several security gaps that I have observed repeatedly in assessments.
Trust Boundary Confusion
The most fundamental issue is the lack of explicit trust boundaries between file content and model context. In most implementations, parsed content is concatenated with system prompts and user queries without any structural differentiation. The model has no reliable way to distinguish between instructions intended by the developer and content embedded in uploaded files.
Some implementations attempt to address this through delimiter-based approaches. Files might be wrapped in special tokens or structured formats that theoretically signal untrusted content. However, research has demonstrated that these measures are circumventable. Models can be coaxed into ignoring delimiter markers through carefully crafted prompts.
Parsing Layer Isolation Gaps
The parsing layer often operates with more privilege than necessary. Parser services may have network access, filesystem read permissions beyond their immediate working directory, or the ability to execute system commands through library features. When parser vulnerabilities are exploited, attackers can pivot to broader infrastructure compromise.
Containerization helps but is not a complete solution. Standard Docker containers share the host kernel, and container escape vulnerabilities have been documented. More robust isolation using microVMs or WebAssembly runtimes adds meaningful security but increases operational complexity.
Monitoring Gaps
File processing pipelines often lack the monitoring necessary to detect injection attempts or parsing exploits. Standard application logging may capture the fact that a file was uploaded and processed, but not the specific content extracted or how the model interpreted it. This makes incident investigation difficult and allows successful attacks to persist undetected.
Section 3: Parser Exploit Risk - CVE-2025-66516 Case Study
CVE-2025-66516 provides an instructive case study in how parser vulnerabilities create LLM pipeline risks. This vulnerability in Apache Tika demonstrates the attack chain from parser exploit to potential model context compromise.
Vulnerability Details
Apache Tika is widely used for content extraction from documents. It handles formats including PDF, Microsoft Office, OpenDocument, and many others. The vulnerability is an XML External Entity injection affecting versions 1.13 through 3.2.1 of tika-core, versions 2.0.0 through 3.2.1 of the PDF module, and versions 1.13 through 1.28.5 of the parsers module.
The root cause lies in unsafe handling of XML external entities during parsing of documents containing XFA structures. An attacker can craft a PDF containing a malicious XFA file that, when processed by Tika, causes the parser to read local server files, make unauthorized network requests, or trigger denial of service conditions.
CVSS scores ranging from 8.4 to 10.0 reflect the critical severity. The vulnerability is remotely exploitable without authentication, making it particularly dangerous in any internet-facing document processing service.
LLM Pipeline Impact
For LLM applications, CVE-2025-66516 creates several risk scenarios. In the most severe case, an attacker could craft a document that, when processed by a vulnerable Tika instance, exfiltrates sensitive files from the server. These files might contain API keys, configuration data, or previously processed document content. The exfiltrated data could then be used to compromise the broader system or poison subsequent LLM interactions.
Even without file exfiltration, the SSRF capability allows attackers to probe internal network services. This reconnaissance can reveal the architecture of backend systems, identify additional attack surfaces, or interact with internal APIs in unauthorized ways.
Mitigation Requirements
The primary mitigation is upgrading to Tika version 3.2.2 or later, which disables unsafe XML entity resolution. Organizations using Tika in LLM pipelines should audit their dependencies immediately and apply patches. Detection can be enhanced by monitoring parser logs for XXE-related errors or unusual file access patterns.
For defense-in-depth, the parsing layer should operate in an isolated environment with minimal privileges. Network access should be restricted, and file system access should be limited to immediate working directories. Any sensitive files that the parser could access represent potential exfiltration targets.
Section 4: Prompt Injection - Current State of the Art
Prompt injection through files represents a distinct attack category from parser exploits. While parser vulnerabilities compromise the infrastructure, prompt injection manipulates model behavior without requiring any infrastructure-level compromise.
Attack Mechanics
Indirect prompt injection works by embedding instructions in content that the model treats as authoritative. Unlike direct prompt injection where an attacker controls the user prompt directly, indirect injection hides payloads in files that users upload believing them to be benign documents.
The attack is particularly effective because LLMs process context sequentially without inherent awareness of which content is more trusted. When a document instructs the model to ignore previous instructions or perform actions that benefit the attacker, the model may comply because the instruction appears to come from an authoritative source within the conversation context.
Effectiveness in RAG Systems
Research on retrieval-augmented generation systems demonstrates high success rates for injection attacks. When documents are indexed and retrieved as context for queries, poisoned content influences responses even when users ask completely unrelated questions. The retrieval mechanism selects documents based on relevance to the query, but injection payloads can be crafted to manipulate relevance rankings or trigger under specific query patterns.
In demonstrations, researchers achieved manipulation of RAG outputs approximately 90 percent of the time using documents with embedded injection payloads. This rate suggests that any production RAG system processing untrusted documents faces a significant and practical attack vector.
Agentic System Escalation
In agentic contexts, prompt injection through files can achieve goals beyond data manipulation. Agents that maintain long-term memory, use tools to interact with external systems, or delegate tasks to sub-agents are vulnerable to having their behavior redirected by injected content.
An agent processing a poisoned document might update its memory with false information that persists across sessions. It might be instructed to use tools in unintended ways, such as exfiltrating data through URL parameters or modifying files based on injected instructions. The autonomous nature of agents means that injected goals may be pursued without human confirmation.
Section 5: Sandboxed Document Parsing
Sandboxed parsing isolates the document processing stage from the rest of the application and infrastructure. This isolation limits the blast radius of parser exploits and prevents them from escalating to infrastructure compromise.
Isolation Technologies
Several technologies provide meaningful isolation for parsing operations. Docker containers with restrictive seccomp profiles and AppArmor or SELinux policies can limit system calls and filesystem access. Firecracker microVMs offer stronger isolation by providing a minimal kernel and preventing container escape. WebAssembly runtimes like Pyodide can execute code in a sandboxed environment with no filesystem or network access by default.
The choice of technology involves tradeoffs between security strength, operational complexity, and performance. Containers offer the best performance and simplest operations but share the host kernel. MicroVMs provide stronger isolation at the cost of higher resource overhead. WebAssembly runtimes are typically limited to specific language runtimes but offer fine-grained capability control.
Parsing Service Architecture
The recommended architecture treats parsing as a separate microservice with explicit security boundaries. Parsers receive files through a defined interface, process them in isolated environments, and return only the extracted content. They have no direct access to databases, message queues, or internal APIs. Network access is restricted to necessary destinations, and all access is logged.
Content sanitization should occur after parsing but before extracted text enters the model context. This includes stripping hidden content like PDF comments, removing metadata that could contain injection payloads, and normalizing formatting that could obscure content boundaries.
Output Validation
Parsing output should be validated before it is used downstream. Validation checks include verifying that extracted content matches expected patterns, detecting anomalous content like large blocks of base64 or encoded data, and flagging content that triggers injection detection patterns. High-confidence injection detections should result in rejection and logging for security review.
Section 6: Recommended Architectures
Effective security for untrusted file pipelines requires defense-in-depth across multiple layers. No single control is sufficient, but properly layered defenses can reduce risk to acceptable levels.
Upload Layer Security
File uploads should begin with strict validation. Allowed file types should be limited to those actually needed by the application, with type validation based on magic bytes rather than client-provided metadata. File size limits prevent denial of service through oversized uploads. Filenames should be sanitized and renamed to prevent path traversal attacks.
Virus and malware scanning should occur immediately after upload. Cloud-native applications can leverage services like Amazon GuardDuty Malware Protection for S3, which automatically scans uploaded objects and tags findings for automated response. The scanning layer should quarantine or reject malicious files before they reach parsing infrastructure.
Parsing Layer Hardening
Parsing should occur in isolated environments with minimal privilege. The parsing service should have no access to sensitive data stores, should not be able to make outbound network connections except to defined destinations, and should operate with read-only filesystem access to designated working directories.
Dependency management is critical. Parser libraries should be kept current with security patches, and parsing infrastructure should be included in vulnerability management programs. The Apache Tika vulnerability demonstrates how widely-used parsing libraries can contain severe flaws that affect countless applications.
Trust Boundary Implementation
Trust boundaries between file content and model context require structural support, not just policy. Delimiter-based approaches can provide some differentiation, but should not be relied upon as the sole security control. More robust approaches include content structured into well-defined schemas that separate data from instructions, privilege levels where file-derived content operates with fewer capabilities than system prompts, and explicit labeling where content provenance is tracked and used in model instructions.
Human-in-the-loop controls are appropriate for high-risk operations. When file content triggers extraction of sensitive data, triggers injection detection patterns, or requests actions that modify system state, human confirmation adds a meaningful security layer.
Agentic System Safeguards
Agentic systems require additional safeguards beyond those needed for traditional LLM applications. Agents should operate with least privilege, having access only to tools and data necessary for their immediate goals. Tool access should require explicit capability grants, and dangerous operations like file modification or network requests should require elevated confirmation.
Memory and context should be treated as potentially contaminated. Agents that maintain long-term state should verify context accuracy before acting on retrieved information. Session boundaries should limit the persistence of file-derived influences, and periodic context resets can prevent accumulation of poisoned content.
Section 7: Prioritized Recommendations
Based on the threat landscape and architectural analysis, I recommend a phased approach to securing untrusted file pipelines.
Immediate Actions (Within One Week)
First, audit parser dependencies for known vulnerabilities. Identify all document parsing libraries in use, check them against vulnerability databases, and apply patches for any known issues. Apache Tika users should verify they are on version 3.2.2 or later.
Second, implement strict upload validation. Validate file types using magic bytes, enforce size limits, and sanitize filenames. Remove any parsing services that process files without isolation.
Third, enable malware scanning on upload buckets or temporary storage locations. Configure automated responses to quarantine or delete malicious files.
Near-Term Actions (Within One Sprint)
First, implement parsing isolation. Deploy parsers in Docker containers with restrictive security profiles, or migrate to isolated parsing services. Ensure parsers have no network access beyond defined destinations and cannot access sensitive files.
Second, add content sanitization after parsing. Strip metadata, remove hidden content, and normalize formatting. Implement injection detection patterns and reject or flag content that triggers detection.
Third, implement trust boundaries between file content and model context. Add structural separations like delimiter tokens, schema-based content organization, or privilege levels for file-derived context.
Next Sprint Actions
First, add monitoring and alerting for file processing anomalies. Log all parsing operations, track extraction statistics, and alert on unexpected content patterns or error rates.
Second, implement agentic safeguards for systems that use autonomous agents. Add least-privilege tool access, verify context before high-risk actions, and implement session boundaries for long-running agentic processes.
Third, conduct adversarial testing of file processing pipelines. Use crafted documents with injection payloads, malformed files designed to trigger parser vulnerabilities, and realistic attack scenarios to validate defenses.
Section 8: Monitoring and Incident Response
Effective monitoring enables detection of injection attempts and parser exploit attempts that evade preventive controls. Incident response procedures ensure that detected issues are addressed quickly and thoroughly.
Monitoring Targets
File processing pipelines should generate logs for every stage of processing. Upload logs should capture file metadata, validation results, and scanning outcomes. Parsing logs should record extraction statistics, any errors encountered, and processing duration. Model interaction logs should track what content entered the context and what responses were generated.
Anomaly detection should identify unexpected patterns in these logs. Sudden increases in parsing errors might indicate attack attempts. Unusual extraction ratios where files produce far more or less content than expected could signal malformed malicious files. Model responses that reference file content inappropriately might indicate successful injection.
Alerting Thresholds
Alerting should balance sensitivity with operational noise. Critical alerts should trigger immediately for detected malware, confirmed injection attempts, and parser errors that might indicate exploitation attempts. Warning-level alerts can flag suspicious patterns that warrant investigation but do not require immediate response.
Integration with security information and event management systems enables correlation of file processing events with other security signals. A parser error occurring simultaneously with unusual network traffic from the parsing service might indicate an active exploit.
Incident Response Procedures
When incidents are detected, response should follow defined procedures. Malicious files should be quarantined and preserved for forensic analysis. Parser services should be isolated to prevent potential spread. Root cause analysis should determine whether a vulnerability was exploited and what data might have been affected.
Post-incident review should assess whether monitoring detected the issue promptly, whether preventive controls functioned as designed, and whether response procedures were effective. Findings should inform improvements to detection, prevention, and response capabilities.
Section 9: Answered Questions
Several common questions arise when organizations assess their untrusted file pipeline risks.
Can we rely on AI models to detect prompt injection in files?
AI models can assist with injection detection but should not be the sole control. Models can be fooled by sophisticated injection techniques, and their detection capability varies based on prompt framing and model training. Use models as one layer in a defense-in-depth strategy, not as a primary detection mechanism.
Do we need to restrict all file types?
Restricting file types to those actually needed reduces attack surface. Permit only file types that the application requires for its core functionality. For most LLM document processing use cases, this means permitting common document formats like PDF, Microsoft Office documents, and plain text. Avoid permitting executable types, archive formats unless needed, or legacy formats with complex parsing requirements.
How often should we patch parser dependencies?
Parser libraries should be included in regular vulnerability management programs with patching cycles appropriate to risk. Critical vulnerabilities like CVE-2025-66516 should be patched immediately upon availability. For lower-severity issues, patching within standard update cycles is acceptable.
Is sandboxed parsing necessary if we trust our document sources?
Trust in document sources should not eliminate parsing isolation. Even trusted sources can inadvertently upload malicious files, and supply chain compromises of document generation tools could result in malicious content reaching processing pipelines. Isolation provides protection against unknown vulnerabilities in addition to intentional attacks.
Section 10: Agentic Pipeline Considerations (OWASP Agentic Top 10 - 2026)
The OWASP Top 10 for Agentic Applications 2026 addresses risks specific to autonomous AI systems. Several entries are directly relevant to untrusted file pipeline security.
ASI01: Agent Goal Hijack
This entry covers attacks where malicious external content manipulates agent objectives. File uploads directly enable this attack when agents process document content without adequate sanitization. Defenses include treating all external content as untrusted, implementing goal verification mechanisms, and requiring human confirmation for actions that significantly impact agent behavior or system state.
ASI02: Tool Misuse and Exploitation
Agents that use tools to interact with external systems can be manipulated through poisoned context. An attacker who successfully injects content into a file could instruct an agent to misuse tools in ways that benefit the attacker. Least-privilege tool access, explicit capability grants, and monitoring of tool usage patterns mitigate this risk.
ASI06: Memory and Context Poisoning
Long-term agent memory and accumulated context are vulnerable to contamination from poisoned inputs. When agents process files and incorporate content into memory, injection payloads can persist and influence future behavior. Session boundaries, context verification, and periodic memory resets reduce the impact of memory poisoning.
Supply Chain Considerations
Agentic systems often rely on third-party parsing libraries and tool integrations. These dependencies represent supply chain risks that map to entries in the OWASP Agentic Top 10. Vendor security assessments, dependency scanning, and isolation of third-party components should be part of the security program for agentic systems.
Appendix A: Framework Reference Summary
The following frameworks and standards inform the recommendations in this advisory.
OWASP Resources: The OWASP Top 10 for Agentic Applications 2026 provides the definitive catalog of risks in autonomous AI systems. The LLM Prompt Injection Prevention Cheat Sheet offers specific mitigation techniques for injection attacks.
CVE Database: The National Vulnerability Database provides detailed information on parser vulnerabilities including CVE-2025-66516. Subscribing to vendor security lists ensures timely awareness of new vulnerabilities affecting parsing infrastructure.
AWS Security Services: Organizations using AWS benefit from GuardDuty Malware Protection for S3 for automated malware scanning and Amazon Macie for sensitive data discovery.
Research: Academic research on prompt injection, including studies available on arXiv, provides insight into attack mechanics and effectiveness. Security vendor research from organizations like Hidden Layer and Lakera offers practical analysis of real-world attack techniques.
Sources
- NVD CVE-2025-66516
- Apache Tika Security Advisory
- Picus Security CVE-2025-66516 Analysis
- OWASP Top 10 for Agentic Applications 2026
- OWASP LLM Prompt Injection Prevention Cheat Sheet
- arXiv: Prompt Injection Attacks
- Hidden Layer: Prompt Injection Attacks on LLMs
- Lakera: Indirect Prompt Injection
- AWS Security Blog: Securing RAG Ingestion
- OWASP File Upload Cheat Sheet