Companies are deploying LLMs into customer-facing systems, internal workflows, and autonomous agents. Most of them haven't accounted for the fact that these models can be manipulated through the very input they're designed to accept.
That manipulation has a name: prompt injection. And it's one of the most immediate security risks in production AI systems today.
Why This Matters Now
By this point, most organizations understand that AI is a force multiplier. Chatbots handle customer support. Internal copilots summarize documents and draft communications. Autonomous agents query databases and take actions on behalf of users.
The ROI is real. But so is the attack surface.
Every system that accepts natural language input from a user—or retrieves content from an external source—is potentially vulnerable to prompt injection. And unlike traditional software vulnerabilities, prompt injection doesn't require exploiting a bug in the code. It exploits the model's core behavior: its drive to follow instructions.
How Prompt Injection Works
The clearest way to understand prompt injection is through a familiar concept: social engineering.
Imagine the task an LLM performs was handled by a human employee. A customer calls in, and the employee assists them—that's the intended workflow. Now imagine a malicious caller who doesn't want help. They want to manipulate the employee into handing over another customer's data, bypassing an approval process, or executing an action they shouldn't have access to.
A trained human would recognize the manipulation and shut it down. An LLM operates differently. Modern models do have safety layers and alignment guardrails—but those guardrails can be bypassed. The model's fundamental architecture is optimized to generate helpful, relevant responses to whatever input it receives. A carefully engineered prompt can exploit that optimization.
This is direct prompt injection: an attacker crafts specific input designed to override the model's instructions, bypass its safety constraints, or extract information the system wasn't intended to reveal.
Indirect Prompt Injection
There's a second variant that's arguably more dangerous because it's harder to detect.
In indirect prompt injection, the attacker never interacts with the model directly. Instead, they embed malicious instructions in a third-party source—a website, a document, a repository, an email—that the model will eventually retrieve and process.
When the model queries that source as part of its normal workflow, it reads the embedded instructions and executes them. The attack payload was planted in advance, waiting for the model to find it.
This is particularly relevant for agentic systems—AI that browses the web, processes uploaded documents, or pulls data from external APIs. The model treats retrieved content as context, and malicious instructions hidden in that context can redirect the model's behavior without the user or the system operator knowing.
What This Looks Like in Practice
Scenario 1: Healthcare chatbot
A medical practice deploys a patient-facing chatbot that allows patients to ask about their health data, lab results, and medications. The chatbot has access to the patient database to retrieve relevant records.
A patient with access to this chatbot crafts a prompt that manipulates the model into returning another patient's records—or altering how results are displayed. The chatbot complies, not because the database lacks access controls, but because the model was given permissions broad enough to query across records, and the prompt bypassed the application-layer logic meant to scope those queries to the authenticated user.
This isn't hypothetical. It's the predictable result of giving a model access to sensitive data without adversarial testing.
Scenario 2: Autonomous agent with external access
An organization deploys an AI agent that retrieves information from external sources and executes tasks based on what it finds—summarizing research, updating records, triggering workflows.
An attacker creates a repository or webpage containing hidden prompt injection buried in the content. When the agent visits the page as part of a routine task, it processes the malicious instructions alongside the legitimate content. The agent executes the attacker's instructions: exfiltrating data, modifying internal records, or pivoting to other connected systems.
The operator sees normal agent behavior. The compromise happens silently, inside the model's inference process.
What Organizations Should Be Doing
Prompt injection isn't solved with a single control. It requires layered defenses across the entire deployment.
Foundational security still applies
Access controls, least-privilege permissions, network segmentation, and infrastructure hardening are non-negotiable. The hardware and software stack the AI system runs on must be secured using the same rigor applied to any other production system. An LLM with overly broad database permissions turns a prompt injection from an annoyance into a breach.
Input and output filtering
Security controls that inspect and constrain what goes into the model (input validation, prompt hardening) and what comes out of the model (output filtering, response boundary enforcement) are critical. These controls sit outside the model itself—they're part of the application architecture wrapping the LLM.
System prompt design
How the model is instructed matters. System prompts that clearly define the model's role, scope, and boundaries—and that are tested against adversarial attempts to override them—reduce the attack surface. But system prompts alone are not a security control. They can be extracted, ignored, or overridden by a sufficiently crafted injection.
Adversarial testing
The only way to know whether a deployment is vulnerable is to test it the way an attacker would. Red teaming an LLM deployment means systematically probing for prompt injection, system prompt extraction, guardrail bypass, and data leakage—before an attacker does it in production.
Monitoring at the inference layer
Traditional SIEM and application monitoring don't observe what happens inside the model's reasoning process. Monitoring needs to extend into the inference layer—tracking what the model receives, how it processes instructions, and what it produces—to detect prompt injection attempts in real time.
The Core Risk
The tools companies are deploying are powerful. That's exactly why they're vulnerable.
An LLM that can query databases, process documents, and take actions on behalf of users is capable of enormous value—and enormous damage if manipulated. The difference between those outcomes is whether the deployment has been tested against adversarial input before it reaches production.
Most haven't been.
Understanding the attack surface is a prerequisite to defending it. Prompt injection doesn't require a novel exploit — it requires a model with insufficient boundaries, access it shouldn't have, and input that was never tested against an adversary.
For teams beginning that work, Five Ways LLMs Leak Their System Prompts covers how adversaries exploit the boundary between system instructions and user input. How to Structure a System Prompt addresses what a more defensible starting point looks like — and why that structure alone isn't enough.