prompt injection

Prompt injection is an attack where adversarial text is designed to steer a model or model-powered app into ignoring its original instructions and performing unintended actions, such as leaking secrets, executing unsafe steps, or following attacker-supplied goals.

Variants include direct prompt injection, where malicious instructions are entered through the model interface, and indirect prompt injection, where instructions are hidden in retrieved or linked content that the system ingests during workflows like RAG or tool use. Prompt injection exploits the lack of a strict boundary between instructions and data.

Mitigations should adopt a defense-in-depth approach, including:

  • Input and output filtering and sanitization
  • Isolating and clearly delineating system and user instructions
  • Enforcing least-privilege access for tool integrations and sandboxes
  • Implementing allow or deny lists for tool use
  • Verifying the provenance or trustworthiness of external content
  • Hardening retrieval pipelines
  • Monitoring and conducting adversarial testing to detect residual risks

By Leodanis Pozo Ramos • Updated Nov. 3, 2025