prompt injection

Prompt injection is an attack where adversarial text is designed to steer a model or model-powered app into ignoring its original instructions and performing unintended actions, such as leaking secrets, executing unsafe steps, or following attacker-supplied goals.

Variants include direct prompt injection, where malicious instructions are entered through the model interface, and indirect prompt injection, where instructions are hidden in retrieved or linked content that the system ingests during workflows like RAG or tool use. Prompt injection exploits the lack of a strict boundary between instructions and data.

Mitigations should adopt a defense-in-depth approach, including:

Input and output filtering and sanitization
Isolating and clearly delineating system and user instructions
Enforcing least-privilege access for tool integrations and sandboxes
Implementing allow or deny lists for tool use
Verifying the provenance or trustworthiness of external content
Hardening retrieval pipelines
Monitoring and conducting adversarial testing to detect residual risks

By Leodanis Pozo Ramos • Updated Nov. 3, 2025

AI Coding Glossary Share Feedback