Prompt Injection vs. Jailbreaking: They’re Not the Same Threat, and the Confusion Costs You
Walk into almost any discussion of LLM security and you will hear “prompt injection” and “jailbreaking” used as synonyms. They are not synonyms. They are different threats, with different attackers, different targets, and — the part that actually matters — different defences. Conflating them produces a predictable failure: you deploy a control against one and believe you have addressed the other, when you have not.
This is a short, clarifying piece, because the distinction is simple once stated and the confusion is expensive once internalised.
The core distinction
Jailbreaking targets the model’s policy. The attacker is the user, talking directly to the model, trying to make it do something its safety training says it should not — produce disallowed content, ignore its guidelines. The attacker and the user are the same person. The thing being subverted is the model’s alignment / safety behaviour.
Prompt injection targets the application’s control flow. The attacker is usually not the user — they are a third party who has placed hostile instructions into content the model will process (a web page the model reads, a document it summarises, an email it parses, data returned from a tool). The model encounters those instructions mixed into its input and may follow them as if they came from the legitimate operator. The thing being subverted is the application’s intended behaviour, by smuggling instructions through data.
One sentence each:
- Jailbreaking: the user manipulates the model into violating its policy.
- Prompt injection: a third party manipulates the application by hiding instructions in data the model consumes.
These are genuinely different threats. The attacker is different (user vs. third party). The target is different (model policy vs. application control flow). And therefore the defence is different.
Why the conflation produces wrong defences
This is the part that costs you. Because the two get treated as one thing, defences against one are mistaken for defences against both.
Safety training reduces jailbreak success. It does little for prompt injection. A model better aligned against producing disallowed content is harder to jailbreak. But prompt injection is not primarily about disallowed content — it is about the model following smuggled instructions embedded in data. A perfectly “safe” model that still cannot distinguish operator instructions from instructions hidden in the data it processes is wide open to injection. Investing only in model-level safety and believing you have covered injection is the classic error.
Input filtering helps with some jailbreaks. It is brittle against injection. Filtering known jailbreak phrasings catches some direct manipulation. But injection arrives inside legitimate-looking data — a web page, a document — where the hostile instruction is mixed with content the application genuinely needs to process. You cannot filter out “instructions in data” without filtering out the data. The architectural problem (the model cannot reliably separate instructions from data) is not solved by a phrase filter.
The architectural defences for injection do little for jailbreaking. Conversely, the real defence against prompt injection is architectural: do not let model output (potentially injection-influenced) reach an actuator without a constrained, human-or-policy-gated check; treat all model-processed content as untrusted; separate the privilege of the operator’s instructions from the content being processed. None of that stops a user from jailbreaking the model directly — it is a different problem with a different fix.
So a team that conflates them tends to over-invest in one defence and leave the other surface open, while believing the surface is closed. That belief is the actual cost.
Different attackers, different threat models
The attacker distinction reshapes the whole threat model:
Jailbreaking’s attacker is the user. This matters because the user is supposed to be talking to the model. You cannot architecturally separate them — the user’s input is the legitimate input. Defence is necessarily about the model’s own behaviour (alignment, refusal robustness) plus consequence-limiting (what can the jailbroken model actually do?). For many deployments, the honest stance is: assume a determined user can jailbreak the model, and ensure that a jailbroken model cannot cause real harm because it has no dangerous capability or actuator behind it.
Prompt injection’s attacker is a third party in the data path. This matters because it means the threat scales with what content your application lets the model process and what the model can do with the results. An LLM app that reads external web pages and can call tools has a large injection surface; one that only processes the user’s own typed input has almost none. The threat model is about the data path and the actuator path, not about the user’s intentions.
This is why the same deployment can be high-risk for one and low-risk for the other. A personal chatbot with no tools and no external data ingestion is jailbreakable but barely injectable. An agentic system that browses, reads documents, and calls tools is highly injectable regardless of how well-aligned the underlying model is.
The practical consequence
Once separated, the defensive priorities become clear and different:
For jailbreaking, ask: what can a jailbroken model actually do? If the answer is “produce text a determined user could find elsewhere anyway,” the risk is low and model-level safety plus reasonable monitoring suffices. If the answer is “access tools, data, or actuators,” then jailbreaking becomes a path to real harm and you must limit consequence — least agency, gated actions — not just harden refusals.
For prompt injection, ask: what untrusted content does the model process, and what can it do with the result? Shrink the data path (limit what external content the model ingests), treat all of it as untrusted, and above all sever the path from injection-influenceable output to any actuator without a constrained check. The model proposing an action is fine; the model’s injection-influenced output triggering an action without a gate is the failure.
The unifying principle that survives the distinction: the model is manipulable through both its user and its data, so the security of the system is determined by what stands between the model’s output and anything irreversible. Jailbreaking and injection are two doors to the same room — the manipulated model — and the defence that matters most is making sure that room does not have a lever connected to anything you cannot afford to have pulled.
The takeaway
Stop using the terms interchangeably, because the conflation hides a real gap. Jailbreaking is a user subverting the model’s policy; prompt injection is a third party subverting the application’s control flow through hostile data. They have different attackers, different surfaces, and different fixes — safety training for one, architectural separation for the other. A control against either is not a control against both.
The clean test: if the attacker is the user and the target is the model’s rules, it is jailbreaking. If the attacker is hiding instructions in data the model processes, it is injection. Name which one you are defending against before you choose the defence — because the most expensive mistake in LLM security is solving one of these and believing you solved both.
An independent piece by johlem.net — IT security, Luxembourg. LLM-security threat modeling, also documented at cyberramen.com.