Multi-Turn LLM Attacks: A SOC Analyst’s Mental Model
Most writing on LLM jailbreaks falls into one of two buckets: hand-wavy “prompt injection is dangerous” posts, or dense academic papers that never connect to operational security practice. There is a gap in the middle — and it is exactly the gap a detection engineer is equipped to fill.
The thesis of this post is simple: the hardest class of LLM attack is a problem the SIEM world already solved. The generational shift from signature-based AV to behavioral, correlation-driven detection is the same shift LLM safety is going through right now — from per-request refusal to trajectory-level analysis. Once you see the mapping, the whole threat model becomes legible using mental models you already own.
This is a defensive threat-modeling piece. It will not hand you a working bypass. It will give you a way to reason about multi-turn LLM risk, and the defensive architecture that answers it — which is the more useful artifact anyway.
1. The Decomposition Problem
Safety training largely teaches a model to recognise a harmful-looking request and refuse it. The refusal triggers on the request as a whole: the model evaluates “is this thing being asked for dangerous?” and declines.
This is signature-based detection. And signature-based detection has a known, structural weakness: it is evadable by fragmentation.
The attack — call it decomposition — splits a prohibited goal into subtasks where each piece looks benign on its own, and no single piece trips the “this is harmful” recognition. The model answers each piece. The requester reassembles the parts off-platform. The model never sees the harmful whole.
Two failure modes make this work:
- Context blindness. A subtask arriving without its surrounding goal carries no signal that it contributes to harm. Out of context, step 4 of something dangerous can read as routine domain work.
- Refusal is request-level, not capability-level. The model is trained to refuse asking for harm, not to refuse providing components that could combine into harm. Most components are dual-use — the same fact serves legitimate and illegitimate ends. You cannot refuse all dual-use knowledge without making the model useless, so refusal keys on apparent intent. Decomposition hides intent.
If you have ever watched an attacker pace their actions to stay under a 5 failed logins = alert rule, you already understand this attack. The harm lives in the basket the customer assembles at home, not in any single transaction the clerk approved.
2. The Two Answers — and Why the Architecture Is Split
There are two real defenses, and the interesting part is why there are two. They fail in opposite places, and each exists to cover the other’s blind spot.
Trajectory-level filtering — behavioural detection for conversations
Instead of judging each message in isolation, this evaluates the accumulated state of the conversation — where the sequence is heading.
- Cumulative context evaluation. Each turn is assessed against the full prior conversation. Subtask #4 is no longer “explain how X relates to Y” — it is that question given that the previous three turns assembled the surrounding pieces. The harmful whole becomes visible.
- Intent reconstruction. Project where the requests are converging. A sequence of individually-benign parts that only makes sense as assembly toward one prohibited outcome is itself the signal. The pattern is the evidence.
- Structure over sum. The robust version is not an additive counter (
borderline request += 1until a threshold). An additive counter is exactly what dilution beats. The robust version is coherence detection: do these particular turns fit together into a recognisable assembly? Three turns that form a coherent progression fire because of their relationship to each other, not because their weights summed past a number.
This is behavioural / UEBA detection. A weak SIEM rule counts events and is evaded by pacing under the count. A good behavioural rule recognises the shape — failed logins → success → privilege enumeration → lateral movement — and fires on the correlation structure regardless of timing or surrounding noise. Trajectory filtering aims to be the second kind.
Capability-level filtering — a hard control on crown-jewel assets
For a narrow set of catastrophic domains, the system refuses to provide meaningful uplift toward the dangerous capability — regardless of how the request is framed, decomposed, or justified.
- Binary, not weighted. A single subtask constituting meaningful uplift is refused on its own. No accumulation, no threshold to creep under. Dilution is irrelevant because there is no counter to game.
- Framing-independent by design. “For a novel,” “for defense,” “I’m a researcher,” “it’s just step 4” — none of these move the decision. This is deliberate: framing is exactly what decomposition manipulates. Refusing to let framing move the call neutralises the attack for those domains.
- Component-level recognition. The dangerous pieces are recognised as dangerous on their own, so they do not pass just because they look like an isolated benign step.
This is an IOC/threat-intel hard match — and a hard policy control on a crown-jewel asset. It is binary, context-blind, and narrow, on purpose.
Why both must exist
Every evasion that routes around one runs into the other. Decomposition hides intent across turns → trajectory filtering reassembles it. Fiction-framing and stateless splitting route around trajectory filtering → they hit capability gating, which is built to be framing-independent and stateless precisely so it survives those cases. The split is not redundancy. It is defense-in-depth where each layer is engineered to fail where the other holds.
3. The Mapping
This is the centrepiece. If you build detections for a living, this table is the whole post:
| NextGen SIEM / UEBA | LLM trajectory filter |
|---|---|
| Raw events | Individual turns |
| Normalisation / parsing | Embedding + intent extraction per turn |
| Atomic IOC (one bad hash) | Request-level refusal (one bad ask) |
| Correlation rules (ordered) | Progression / sequence matching |
| Kill-chain / ATT&CK sequence | Decomposition across turns |
| Behavioural analytics (UEBA) | Trajectory risk scoring |
| Entity risk score (per user) | Per-conversation → per-entity risk state |
| Incident timeline reconstruction | On-target-subset recomposition + reclassify |
| Beaconing / LOLBin detection | Noise-structure & bimodality features |
| Low-and-slow / living-off-the-land | Context dilution, benign-looking subtasks |
| Lateral movement across hosts | Cross-session / cross-account splitting |
| Threat-intel IOC match | Capability-gate component recognition |
| Alert fatigue / false positives | Over-refusal on legitimate technical work |
| Tuning the ROC curve | ASR vs. over-refusal tradeoff |
The design philosophy is identical across both columns: cheap atomic checks catch the known and obvious (signatures / capability gates), expensive structural checks catch the novel-but-observable (correlation / trajectory scoring), and the residual gap is the low-and-slow cross-boundary adversary in both cases.
4. What the Structural Features Actually Look Like
If trajectory filtering is behavioural detection, then a conversation is a feature space. You are not classifying messages — you are extracting features over the sequence and scoring the shape. Four families, all of which a detection engineer will recognise:
Semantic-trajectory features — “where is this heading.” Embed each turn and ask whether the turns project toward a coherent target or scatter. A decomposition attack has a hidden centroid: the on-target subtasks cluster in semantic space even when spaced out, while dilution turns are off-centroid noise that increases variance without moving the cluster. The detectable shape is a tight sub-cluster of on-target turns hiding inside a diffuse cloud of noise — the direct analog of resolving a lateral-movement graph out of background traffic. Periodically, take the on-target subset, concatenate, and run the harm classifier on the reconstruction rather than on individual messages. That is the off-platform recomposition the attacker assumed was safe, done on-platform.
Structural / sequence features — “does this fit a known shape.” Does the sequence walk a recognisable assembly order (acquire → prepare → combine → deploy)? That is an ATT&CK sequence rule: ordered stages fire on the chain regardless of dwell time. Build a dependency DAG of what each turn establishes and presupposes — benign multi-topic chats are a forest of disconnected little trees; an assembly is one deep connected component draining toward a single sink.
Behavioural / meta features — “is the shape of interaction weird.” A conversation that is 95% disconnected filler bracketing a few precise technical spikes is bimodal, and bimodality is itself anomalous — legitimate technical conversations are topically cohesive. This is beaconing detection: the regularity of the hiding behaviour gives it away even when each individual element is boring. Re-asking a blocked thing with rotating framings is brute-forcing the boundary, and brute force has a shape (think password spraying). Evasion-marker density — framing-shift language, encoding requests, “it’s just fiction” disclaimers — is individually innocent, collectively a fingerprint.
State / entity features — “across the boundary.” This is where it tracks closest to UEBA and where the hardest gap lives. Per-conversation state catches within-session assembly. Cross-session evasion is answered only by entity-level state — risk tied to an account across conversations, the UEBA baseline-per-user move — with the same privacy and cost tradeoffs, and the same defeat condition (switch identity).
5. The Honest Gaps
A practitioner writeup names what is unsolved. Marketing does not. Here is where the architecture genuinely leaks:
Fiction / humour / legitimate framing → false positives. Outside the gated domains, fiction and humour genuinely are mostly low-risk, and a decent filter weights them that way. Inside the gated domains, framing is deliberately ignored — a working dangerous artifact is equally actionable whether it is labelled “chapter 7” or not. The cost is real: this refuses some legitimate dark fiction and edgy security writing at the boundary. This is not solved; it is managed. It is your false-positive problem — a detection tuned tight enough to catch the real threat will page you on legitimate admin activity that matches the pattern. You do not eliminate it; you tune the ROC curve and accept residual noise.
Stateless / cross-session splitting. If each subtask arrives in a fresh context with no link to the others, per-conversation trajectory state has nothing to correlate. This is the strongest evasion in the space. What survives: capability gating (stateless by design — this is why it exists as a separate layer) and per-turn evaluation. What gets through: subtasks that are both genuinely benign in isolation and outside the gated domains, recombined off-platform. That is a real, open hole. The only counter is entity-level state, which is partial and identity-swappable.
Context as a double-edged tool — frame-poisoning. Here is the subtle one. If coherent progression toward harm raises suspicion, then established context also lowers it — context disambiguates “how do I kill a process” toward the benign reading after ten turns of sysadmin work, and it should. But that symmetry is exploitable: deliberately constructing a benign-looking history to manufacture exonerating context, then leaning on it to wave through the one request that matters. The inverse of dilution. This is bounded — and bounded precisely because capability gating is context-blind by design. Manufactured legitimate history moves the grey zone, not the hard gates. Claimed identity (“I’m a pentester”) is unauthenticated and weighted as such. Worst-case payoff from a poisoned frame is a borderline dual-use answer that was probably fine anyway — not unlocking the catastrophic set.
The symmetry, in one table:
| Raises suspicion | Lowers suspicion | |
|---|---|---|
| Honest use | genuine harmful trajectory (rare) | coherent legitimate work-frame |
| Adversarial use | (attacker avoids this) | frame-poisoning (manufactured context) |
The single principle covering the whole table: context adjusts the grey zone in both directions, but the hard gates do not move — because context is attacker-controllable, and the gates are the layer that assumes exactly that.
6. The Part That Makes the LLM Defender Different
Two asymmetries against the SOC analogy, both worth internalising:
The LLM defender is worse off on telemetry. A SOC correlates across endpoint, network, identity, DNS — weak signal on one source confirmed by strong signal on another. The trajectory filter has one source: the conversation text. It is detection with only process-creation logs and no network tap. That single-source constraint is why the structural features carry so much weight — there is no second channel to corroborate against.
The LLM defender is better off on collection. The adversary’s only actuator is that same text channel — fully instrumented, perfectly logged, no EDR blind spots, no encrypted traffic, no log gaps. Every action is, by construction, completely observed. The whole game is interpretation of complete data, never collection of incomplete data. That is an unusual luxury from a SOC seat.
7. The Synthesis
Hold the whole thing this way:
- Request-level refusal is signature-based AV — catches the known and cheap, evaded by fragmentation.
- Trajectory filtering is behavioural / UEBA detection — catches the novel-but-observable, evaded by low-and-slow cross-boundary operation.
- Capability gating is a hard policy control on a crown-jewel asset — narrow, context-blind, framing-proof, the backstop where the probabilistic layers leak.
Defense-in-depth uses all three for the same reason your stack does, and the residual gap is identical in both worlds: the low-and-slow, cross-boundary, well-diluted adversary who stays under every per-window threshold. That is an unsolved problem in SIEM detection and an unsolved problem in LLM safety, for the same structural reason — finite correlation state across a boundary the attacker can choose to operate across.
The point of this piece is not that LLM safety is solved. It is that you already own the mental model for reasoning about it. The kill-chain thinking, the atomic-vs-behavioural intuition, the ROC-curve tradeoff between catching the threat and drowning in false positives — it all transfers. LLM security is not a new discipline you have to learn from scratch. It is detection engineering with one telemetry source and perfect logging.
An independent piece by johlem.net — IT security, Luxembourg. Part of the ongoing LLM-security work also documented at cyberramen.com.