blog.johlem.net

Behavioural Detection Without Drowning in False Positives

Every detection engineer eventually confronts the same wall. Signature and atomic detection — one event, one verdict — is cheap, fast, and evadable by anyone who fragments their actions below your threshold. The answer is behavioural detection: score the sequence, the shape, the kill-chain, not the individual event. But behavioural detection has its own failure mode, and it is the one that quietly kills SOCs: it threatens to drown you in false positives.

The entire craft lives in that tradeoff. This is a practitioner’s view of how to tune it deliberately rather than discovering it the hard way at 3am.

Why atomic detection fails

Atomic detection asks one question per event: is this event bad? It works for the known and the obvious — a known-bad hash, a signature match, a single unambiguous indicator. And it has a structural weakness that any competent adversary exploits: the harm often does not live in any single event.

A 5 failed logins = alert rule is atomic. The attacker paces to four. A rule on a single suspicious command is atomic. The attacker splits the objective across commands that are each individually unremarkable. Atomic detection is signature-based AV, and it loses to fragmentation for the same reason AV loses to a novel binary: it can only see what it can match, one thing at a time.

The harm lives in the basket, not the transaction. Atomic detection only ever sees transactions.

What behavioural detection actually does

Behavioural detection changes the question from is this event bad? to does this sequence form a recognisable hostile shape? It scores the relationship between events, not the events in isolation.

The canonical example is the kill-chain: failed logins → a success → privilege enumeration → lateral movement. Each step, alone, is something that happens thousands of times a day for benign reasons. In that ordered relationship, it is an attack. A behavioural rule fires on the correlation structure — and crucially, it fires regardless of timing or surrounding noise, because it is matching a pattern, not counting hits.

This is the critical distinction that determines whether your behavioural detection is any good:

If your “behavioural” detections are really counting rules, you have not actually escaped the atomic trap. You have just raised the count the attacker has to pace under.

The drowning problem

Here is the catch, and it is unavoidable: the more behavioural and sensitive your detection, the more it fires on legitimate activity that happens to match the hostile shape. A sysadmin doing maintenance can walk a sequence that pattern-matches to lateral movement. A penetration test, a migration, an incident-response action — all generate sequences that look like the thing you are detecting, because they are operationally similar.

This is not a bug you can fix. It is the fundamental tension: detection sensitive enough to catch the real threat will fire on legitimate activity that resembles it. Push sensitivity up and you catch more attacks and more false positives. Push it down and you reduce noise and miss real attacks. There is no setting that gives you all the true positives and none of the false ones, because the legitimate and the malicious genuinely overlap in feature space.

A SOC that ignores this tunes for maximum catch-rate, drowns in alerts, develops alert fatigue, and then misses the real thing inside the noise — the worst of both worlds, arrived at by good intentions.

The ROC curve is the whole job

The honest framing: you are choosing a point on a ROC curve, trading false-positive rate against true-positive rate. Every detection sits somewhere on that curve, and the engineering is deliberately choosing where, per detection, based on what each error costs.

The discipline that follows:

Tune per detection, not globally. The right sensitivity for ransomware-encryption behaviour (high cost to miss, accept more false positives) is not the right sensitivity for a low-severity policy violation (low cost to miss, low tolerance for noise). A single global sensitivity is a guarantee that half your detections are mistuned.

Cost both error types explicitly. What does a missed detection cost here? What does a false positive cost — analyst time, alert fatigue, the credibility of the alert source? Write it down. The tuning decision is a cost decision, and unexamined cost assumptions produce unexamined tuning.

Add context to move the curve, not just slide along it. Sliding along the ROC curve trades one error for the other. Enriching the detection — adding context that distinguishes the legitimate-but-similar from the malicious — moves the whole curve outward, giving you better true-positive rate at the same false-positive rate. This is where the real wins are: not “more sensitive” but “more contextual.” Is this lateral-movement-shaped sequence coming from a known maintenance window? An authorised admin? A change-ticket? Context is what separates the two overlapping populations.

Measure consistency, not just point performance. Behavioural detection on probabilistic signals can flap — firing on minor variations, missing on others. A detection that catches the threat 70% of the time it appears is a different (worse) thing than one that catches it 95% of the time, even if both “work” in a demo. Validation has to measure how consistently a detection fires across variations of the technique, not just whether it fired once.

The validation loop

You cannot tune what you do not measure, and you cannot measure detection quality from production alerts alone (you do not know what you missed). The loop that makes behavioural tuning honest:

  1. Define the technique the detection targets, in terms of the shape it should catch.
  2. Generate variations — purple-team it, run the technique multiple ways (the Atomic Red Team philosophy). The variations are how you find the flapping.
  3. Measure both metrics: catch-rate across the variations (true-positive consistency) and false-positive rate against a corpus of legitimate-but-similar activity.
  4. Move the curve with context before sliding along it with sensitivity.
  5. Re-validate on change — detections decay as environments shift; a detection that was well-tuned a year ago may be drowning or blind now.

This is detection-as-code thinking: a detection is an artifact with a measurable quality, version-controlled, tested against known inputs, and re-validated when things change. The alternative — tuning by reacting to whatever fired last night — is how SOCs end up either deaf or drowning.

The takeaway

Behavioural detection is not optional; atomic detection loses to any adversary who fragments. But behavioural detection is not free — it buys resistance to fragmentation at the price of false positives, and that price is paid in analyst attention, the scarcest resource in any SOC.

The craft is refusing to treat that tradeoff as accidental. Choose your point on the ROC curve per detection, cost both errors deliberately, move the curve outward with context rather than just sliding along it with sensitivity, and validate for consistency rather than for a single passing demo. Do that and you get detection that catches the chain without burying the analyst who has to read the alerts.

Drowning is not the price of behavioural detection. Drowning is the price of behavioural detection tuned by accident.


An independent piece by johlem.net — IT security, Luxembourg. SOC detection engineering for regulated finance.