Detections Are Code: Version Control, Validation, and the Purple-Team Loop

May 27, 2026 10 min read

detection-as-code
purple-team
detection-engineering
ci-cd
atomic-red-team
soc

Most SOCs treat detections as configuration: someone wrote a rule, it lives in the SIEM, it fires sometimes, and nobody is entirely sure when it last actually worked. That is detection-as-folklore. The alternative — detection-as-code — treats every detection as a software artifact: versioned, reviewed, tested against known inputs, and continuously re-validated. The difference is the difference between a detection library you can trust because you measured it and one you trust because it has not obviously failed yet.

This post is about making that shift, and the purple-team validation loop that gives it teeth.

The folklore problem

Consider the typical state of a detection in a mature-but-not-disciplined SOC:

It was written by someone who may no longer be on the team.
The reasoning behind its thresholds is undocumented.
It has never been deliberately tested against the technique it supposedly catches.
Nobody knows if it still works after the last log-source change.
“It hasn’t paged us falsely lately” is the only evidence it is healthy — which is equally consistent with it being broken.

Every one of those is a property no software team would accept in production code. Yet detections are production code — they are the logic that decides whether you see an attack. Treating them as untracked configuration is treating your primary security control as folklore.

What detection-as-code actually means

The practices transfer directly from software engineering, and each one fixes a specific folklore failure:

Version control. Detections live in a repository, with history. You can see who changed what, when, and — if commit discipline is good — why. A detection’s evolution is auditable. This alone is a compliance asset: under accountability-focused regulation, “here is the detection, its history, and its last change rationale” is evidence that “we have an EDR” can never be.

Code review. A new or changed detection is reviewed before it goes live. A second engineer checks the logic, the thresholds, the assumptions. Detections written and deployed by one person, unreviewed, are exactly where silent coverage gaps and noisy false-positive generators come from.

Testing against known inputs. This is the heart of it. A detection has a specification — the technique it should catch — and you test it against inputs that exercise that technique. A detection that has never been run against the thing it claims to detect is an assertion, not a control.

CI/CD discipline. Changes flow through a pipeline: syntax validation, test execution, deployment. A detection that breaks a test does not ship. The same machinery that keeps application code from regressing keeps detection coverage from silently degrading.

Coverage as a tracked artifact. Your detection library maps to a framework (ATT&CK is the standard), and the coverage map is itself versioned and visible. A gap is a tracked item, not a surprise discovered during an incident.

The purple-team loop is the test harness

Software tests need inputs that exercise the code. Detection tests need adversary behaviour that exercises the detection. That is what purple teaming provides — and reframed this way, purple teaming stops being an occasional event and becomes your continuous test harness.

The loop:

Specify what each detection targets — the technique, in terms of the behavioural shape it should catch.
Generate the behaviour — execute the technique in a controlled way. The Atomic Red Team philosophy is exactly this: small, repeatable, atomic executions of specific techniques, used as test inputs.
Observe whether the detection fired, and how cleanly.
Measure two things, not one: did it catch the technique (true-positive), and across variations of the technique, did it catch consistently (no flapping)? A detection that fires on one variant and misses three is a detection with a coverage illusion.
Feed results back — a miss is a detection bug, a flap is a robustness bug, a false-positive storm is a tuning bug. All three become tracked work, not vibes.

The shift in mindset: a red-team execution is not a one-off proof that you can be breached. It is a test case. Every technique you can execute in a controlled way is a regression test for the detection that should catch it. Run them continuously and your detection library has a test suite.

Why consistency, not just catch-rate

A subtlety that separates real validation from theatre: detections on noisy, probabilistic telemetry can flap — catch a technique one run, miss it the next, depending on minor variations in how it was executed. A single passing purple-team test proves the detection can fire. It does not prove the detection reliably fires.

This is why the validation loop must run variations of each technique and measure catch-rate across them. A detection that catches a technique 70% of the time has a 30% blind spot that a single demo will never reveal. Treating that detection as “validated” because it passed once is exactly the false confidence detection-as-code is supposed to eliminate. Measure the distribution, not the point.

This maps to a deeper truth: you are validating a probabilistic control, so your validation must itself be statistical. One run is an anecdote. Many runs across variations is a measurement.

What this buys you

The payoff is threefold, and each lands somewhere a SOC actually feels pain:

Trust grounded in measurement. You know your detections work because you tested them, repeatedly, against the things they target. “Are we covered for X?” has a real answer backed by validation runs, not a hopeful one.

Coverage that does not silently rot. Environments change — new log sources, schema drift, infrastructure shifts. A detection well-tuned last year may be blind now. Continuous validation catches the rot before an incident does.

Evidence as a byproduct. Versioned detections, review history, validation runs — this is precisely the evidence accountability-focused regulation (NIS2’s posture, DORA’s demonstrate-your-controls expectation) asks for. Build the engineering discipline and the compliance evidence is generated automatically. You do not write the binder; the pipeline does.

The takeaway

A detection you cannot version, review, test, and re-validate is a liability you happen to trust. Detection-as-code, with a purple-team loop as its test harness, converts your detection library from folklore — a pile of rules someone wrote that probably still work — into engineering: artifacts with measured quality, tracked coverage, and evidence trails.

The reframe that unlocks it: a red-team technique is a test case, and purple teaming is continuous integration for your detections. Once you see your adversary-emulation that way, the whole software-engineering toolkit — version control, review, testing, CI/CD — applies directly to the thing that decides whether you see attacks at all.

Your detections are the most important code you run and never call code. Start.

An independent piece by johlem.net — IT security, Luxembourg. Detection engineering and purple-team validation for regulated finance.