The DGX Spark and the Case for Local LLM Inference in Security Work
NVIDIA’s GB10 Grace Blackwell desktops arrived with a seductive pitch: a “supercomputer on your desk,” 128GB of unified memory, a petaFLOP of FP4 performance, in a box the size of a thick paperback. The NVIDIA DGX Spark and its OEM siblings — Lenovo’s ThinkStation PGX, Dell’s Pro Max AI, Asus Ascent GX10, Acer Veriton GN100 — are the same silicon under different branding. For anyone doing security work who wants to run LLMs locally, on data that cannot leave their boundary, this class of machine is the most interesting hardware to appear in a while.
But the marketing and the reality diverge in an important way, and the honest version is more useful than the hype. This is what the hardware actually is, what it is genuinely good for in security work, and where it falls down.
What the hardware actually is
The common core across all the GB10 boxes is the same. The Lenovo ThinkStation PGX is built around the NVIDIA GB10 Grace Blackwell Superchip — the same silicon inside the NVIDIA DGX Spark; the PGX is one of several OEM variants of the same reference design, alongside the Dell Pro Max AI, Asus Ascent GX10, and Acer Veriton GN100. The hardware is practically identical across all of them: same chip, same memory, same form factor — what differs is the branding, the support structure, and potentially the software that ships on the drive.
The chip itself: the GB10 combines a 20-core Arm CPU with a Blackwell-generation GPU connected via NVLink-C2C, flanked by 128GB of LPDDR5X memory. The CPU is split into 10 high-performance Cortex-X925 cores and 10 efficiency Cortex-A725 cores in a big.LITTLE arrangement. The headline numbers: 1 PFLOP of FP4 AI performance, 128GB of coherent unified system memory, a ConnectX-7 NIC, 4TB of self-encrypting NVMe storage, in a 150mm-square chassis, at $4,699 for the Founders Edition.
The defining architectural feature is the unified memory. The 128GB of coherent unified memory is shared seamlessly between CPU and GPU, allowing the machine to load and run large models directly without the overhead of system-to-VRAM data transfers. That single pool is what lets a desktop box hold models that would otherwise need datacenter GPUs — models up to 200 billion parameters locally, or up to 405 billion across two units connected over the ConnectX networking. The Lenovo PGX and the others run NVIDIA’s DGX OS, built on Ubuntu, with CUDA, cuDNN, TensorRT, and AI Workbench pre-installed — an aarch64 platform, which matters for tooling compatibility.
The honest limitation, stated up front
Here is the thing the marketing glides past, and it has to come first because it shapes everything: this is not a fast inference box for large models. The constraint is memory bandwidth.
The unified LPDDR5X memory offers up to 273 GB/s, shared across CPU and GPU — and this is the machine’s main downside. For LLM token generation (the decode phase), memory bandwidth, not compute, is the bottleneck — you have to read the entire model’s weights from memory for every token generated, and 273 GB/s caps how fast that can happen. The consequence is concrete: on GPT-OSS 20B the Spark achieved about 49.7 tokens/sec decode, whereas an RTX 5090 delivered around 205 — roughly 4× faster — confirming the unified LPDDR5X bandwidth as the limiting factor. For larger models it gets worse: a dense 70B model sits at the theoretical floor of single-digit tokens per second for the ~35GB read at 273 GB/s — genuinely hard, and probably not the right tool unless you really need it.
By comparison, Apple’s M-series with >800 GB/s of bandwidth delivers faster token generation for many LLM workloads despite lacking FP4 hardware, and a multi-GPU rig with consumer cards beats the Spark on raw inference price-to-performance. So if your goal is maximum tokens per second on a chat model, this is not the machine, and saying otherwise would be dishonest.
What the GB10 boxes are genuinely for: developers who want to prototype or build smaller models on a single machine — local development, fine-tuning, experimentation, and running appropriately-sized models, before deploying to bigger infrastructure. It is a development-and-fine-tuning workstation that happens to inference, not an inference server. Hold that framing and the rest makes sense.
So why does this matter for security work?
If it is bandwidth-limited, why care? Because for security work specifically, the value of local inference was never primarily speed — it is the control boundary, and that is exactly what 128GB-on-your-desk delivers regardless of token rate.
The data that security work touches is the most sensitive data there is: incident details, log and detection content that reveals your environment and your gaps, vulnerability findings that map where you are weak, and — in a consulting context — client data under contractual or regulatory residency constraints. Sending any of that to a hosted inference API means it crossed a boundary into infrastructure you do not control, and now you must reason about a provider’s retention, handling, and jurisdiction.
A GB10 desktop changes the answer to “where did the data go” to nowhere — it stayed on hardware I control. For a regulated financial entity under DORA’s third-party and data-residency pressures, that is a categorically cleaner answer than a data-processing agreement. And the unified-memory capacity is what makes it viable — you can run a genuinely useful model (a capable mid-size model, well-quantized) entirely locally, where before “local” meant either tiny models or a rack of GPUs. The box does not have to be the fastest inference platform to be the right one when the binding constraint is “this data must not leave.”
The bandwidth limit even matters less than you’d think for several security workloads, because many of them are not latency-sensitive in the way a chat UI is. Batch log triage, overnight analysis runs, fine-tuning a model on your own detection data, building and testing tooling — these tolerate 50 tokens/sec fine, because the value is the local processing of sensitive data, not interactive speed. The workloads where the Spark’s weakness bites hardest (fast interactive chat on huge models) are often not the security workloads where local inference matters most.
Where the GB10 boxes fit, concretely
For security and consulting work, the realistic role:
Sensitive-data inference at non-interactive speed. Running a capable model over incident data, logs, findings, or client data — where the control boundary is the point and 50 tokens/sec is acceptable because the task is batch or near-batch. This is the core use case and it is a strong one.
Fine-tuning on proprietary security data. The 128GB and the architecture make it a fine-tuning box more than an inference box — adapting a model to your own detection content, threat intel, or domain language, on data that never leaves. For building a private, domain-tuned assistant on sensitive material, this is exactly the niche.
Tooling development and experimentation. A local laboratory for building LLM-powered security tooling, where unlimited local runs without metering or external logging is a real advantage, and the aarch64 + CUDA stack matches deploy targets.
Availability-independent workflows. Security tooling that must run without an external API dependency in its critical path — including, potentially, during an incident when you may not want external services in the loop.
Where it does not fit: high-throughput interactive inference on the largest models, or anything where tokens-per-second on a 70B-plus model is the binding requirement. For that, the honest answer is different hardware.
The Lenovo-vs-Spark question, briefly
Since the silicon is identical across the OEM variants, the choice between the DGX Spark Founders Edition and the Lenovo ThinkStation PGX (or the Dell/Asus/Acer equivalents) is not about performance — it is about the wrapper. NVIDIA made the GB10 platform available to partners with wiggle room for customization — Dell, Acer, Asus, Gigabyte, HP, Lenovo, and MSI created GB10 boxes with small variations in power, cooling, storage, cosmetics, and remote management.
So decide on the non-silicon factors: enterprise support and warranty (where Lenovo’s ThinkStation channel and business support may matter for a consulting practice that needs a support relationship), remote management features, thermals and acoustics, storage options, and procurement fit. For a regulated-finance consultant, the OEM’s support structure and the ability to procure through a business channel can outweigh the Founders Edition’s branding — the compute is the same either way.
The takeaway
The GB10 desktops — DGX Spark, Lenovo PGX, and their twins — are genuinely interesting for security work, but for the right reason. Not because they are fast inference boxes (they are bandwidth-limited, and a 70B model crawls), but because 128GB of unified memory on hardware you control makes capable local inference viable on the most sensitive data there is — and for security and consulting work under data-residency pressure, the control boundary is worth more than tokens per second.
The reframe to carry: buy a GB10 box for the control boundary and the fine-tuning capability, not for inference speed — and match it to non-interactive, sensitive-data workloads where “the data never left” matters more than “the answer came back instantly.” Walk in expecting a desktop H100 and you will be disappointed. Walk in needing a local, sovereign, fine-tuning-capable platform for sensitive security data, and it is close to purpose-built for the job. Just choose the OEM wrapper — Lenovo’s support channel or NVIDIA’s Founders Edition — on the factors that actually differ, because the chip inside them does not.
An independent piece by johlem.net — IT security consulting, Luxembourg. Self-hosted AI and data-sovereign infrastructure for regulated work. Related: local LLMs for security work, and self-hosting a security stack in regulated finance.