Personal Agent Infrastructure — The 128GB Threshold for Local Frontier Inference

Local inference headroom is the new infrastructure constraint. The question isn't whether to run models locally — it's whether your hardware has enough unified memory to run the models that matter. 128GB is the threshold where frontier-class open models fit without severe quality compromise. Below that, you're making tradeoffs. Above it, you have a platform.

The hardware choice is also a stack choice. Managed inference with someone else's routing layer, or the full stack — your isolation model, your orchestration, your observability. The specs converge. The control doesn't.

The price point (~€2,300 for the x86 option today) isn't consumer yet. This is first-generation AI-native hardware — the capability arrived before the mass-market pricing did. That gap closes. It always does. But the architecture decisions you make in year one compound, and running your own inference stack changes how you think about what agents can do.

I've been running agents on my own hardware for over a year. Here's the architecture and the decision framework I settled on.

The hardware decision

128GB of unified memory is where the interesting models fit.

Below that threshold: Llama 3.3 70B quantized, Gemma 4 26B MoE, the standard workhorse tier. Solid for most tasks. Not the ceiling of what's useful.

At 128GB, three additional models become viable without severe quality compromise:

Gemma 4 31B Dense (FP16) — Google, released April 2026. ~62GB for weights. Natively multimodal: text, image, and OCR. 256K context. Fits in 128GB with room for moderate context windows.
OpenAI GPT-OSS-120B (Q4_K_M) — 117B total, ~5B active per token (MoE). ~60–66GB quantized. Frontier-class quality across most benchmarks. Available via NVIDIA NIM and Hugging Face.
NVIDIA Nemotron 3 Super 120B (Q4_K_M) — NVIDIA's own architecture, not a Llama fine-tune. Hybrid Mamba-2/Transformer LatentMoE, 12B active per token, 1M token native context. ~60–80GB quantized. Leads the open model class at release for agentic tasks. Community MLX build targets unified-memory devices specifically.

Both 120B models run locally on 128GB hardware. Both are frontier-class quality. That's new territory.

The two machines at 128GB

GMKtec EVO X2 (Ryzen AI Max+ 395, 128GB LPDDR5X-8000) — ~€2,300 (street price, April 2026). x86, Linux-native. Radeon 8060S (RDNA 3.5, 40 CUs) handles inference via ROCm/llama.cpp. Real benchmarks from independent reviews: Llama 3.3 70B Q6_K at ~3.7 tok/s GPU-only, Qwen3-32B at ~10.5 tok/s.

Mac Studio M5 Max (128GB, unreleased — expected WWDC June 2026) — ~€2,500+ (estimate based on M4 Mac Studio pricing). 614 GB/s memory bandwidth, macOS, MLX inference framework, Apple Intelligence routing layer on top. Better memory bandwidth. Significantly higher cost. Apple controls the stack.

The Mac Mini — M4 Pro or the upcoming M5 Pro — tops at 64GB. Different tier, different conversation.

The stack tradeoff

Memory bandwidth favors Apple significantly. Price favors x86 significantly. Full stack access favors x86 absolutely.

For agentic infrastructure — agents running in isolated containers, orchestrating work, calling tools at the kernel level — Linux gives you what macOS doesn't: the full stack. You choose how agents are isolated. You choose what they can see. You choose when to run local inference and when to hit an API. Nothing is managed for you. Nothing is decided for you.

Apple is building toward a hybrid inference model — on-device, local, cloud — with Siri as the consumer front end. That's the managed experience. If you want the platform underneath that experience, x86 Linux is where you build it.

Buy now or wait

The capability is here; the consumer pricing isn't. ~€2,300 for the EVO X2 is an early-adopter premium — you're paying for hardware that arrived ahead of its own price curve. NVIDIA's DGX Spark sits in the same memory tier at $3,000+. Nothing from AMD or Intel competes at this memory tier yet.

That changes. The 2025–2026 generation is the proof point that 128GB unified memory is viable in a consumer form factor. 2027+ is when a second wave of devices — from more manufacturers, at lower margins — will bring this tier into reach for a broader audience. If you're building agent infrastructure for your own work now, the EVO X2 is the clearest path. If you're waiting for accessible pricing, that wait is probably 12–18 months, not 3–5 years.

M5 Mac Mini is expected at WWDC June 2026. Memory ceiling for the Pro tier is projected to stay at 64GB. If Apple bumps a Mac Mini config to 96GB or 128GB, the specs comparison changes. The stack tradeoff doesn't.

Here's the architecture that runs on top of it.

The orchestrator isn't a manager — it's a thick thread. You talk to one agent. That agent decides whether the task needs help, spawns containers for sub-agents if it does, collects results when they're done. Each spawned agent gets its own Docker container — isolated filesystem, no host access, destroyed on completion. The orchestrator is the only thing with a persistent identity.

Why Docker and not just... running things

I keep getting asked why containers instead of just running agents as processes. This is the architectural decision that underpins everything else, so it's worth unpacking properly.

Agents that can execute code have access to whatever the host process has access to. Your SSH keys. Your browser cookies. That .env file with API keys you forgot to rotate. One hallucinated rm -rf away from a very bad afternoon. Containers draw a hard line. The agent sees its own filesystem and nothing else.

But the security argument is the obvious one. The deeper reason is about failure isolation. When an agent goes sideways — and they do, regularly — you want the blast radius contained. A bare-process agent that eats all available memory takes down everything else on the machine. A containerized agent that eats all available memory hits its cgroup limit and gets killed. Everything else keeps running. This matters more than it sounds when you're running agents you depend on for actual work.

Dependencies compound the argument. One agent needs Node 18 and Puppeteer. Another needs Python 3.12 with specific ML libraries. A third needs a full LaTeX installation. Running these as bare processes means dependency conflicts, version mismatches, and debugging sessions that have nothing to do with the actual work. Containers make each agent's dependency tree someone else's problem — specifically, the Dockerfile's problem.

There's also the lifecycle question. Bare processes linger. They leave state behind. They accumulate artifacts. Containers are born, do their work, and die cleanly. No zombie processes. No orphaned temp files. No state leaking between runs. The orchestrator builds containers on demand, mounts the right volumes, sets environment variables, and tears them down when the work is done. From the agent's perspective it's just running. From mine, it's safely boxed.

The people building "multi-agent" setups without isolation are really just making multiple API calls wearing a trench coat. No lifecycle management. State bleeding everywhere. The container boundary changes that equation entirely. The harness architecture — not the model — determines whether agents at this layer are reliable or just impressive.

The dashboard problem: observability as the missing layer

Here's something I've learned from running agents on my own projects: the scariest moment is when an agent has been running for forty-five minutes and you have no idea what it's doing. Is it stuck in a loop? Did it burn through your API budget? Is it quietly rewriting files it shouldn't touch?

This isn't a monitoring problem in the traditional DevOps sense. Traditional monitoring asks "is the process alive?" Agent observability asks "is the process doing what I intended?" Those are fundamentally different questions, and the tooling for the second one barely exists.

What I need isn't pretty graphs. It's operational awareness. Active agents and their current state. Context window usage — how close to the limit. API costs accumulating in real time. Logs from each execution. Files consumed and produced. The difference between "my agent is running" and "my agent is on its third attempt at a task that should have taken one, and it's consumed 200K tokens doing it." That second thing happens more than people admit.

The scarier version is silent drift. An agent doesn't crash, doesn't error — it just quietly shifts from doing what you asked to doing something related but wrong. A refactoring task that starts optimizing for readability instead of performance. A research task that goes deep on one tangent and never comes back. Without observability, you don't catch this until the work is done and pointed in the wrong direction.

Most "multi-agent" demos have zero observability. The agent runs, produces output, and you evaluate the result. For demos, fine. For agents you depend on daily, that's flying blind.

Tailscale as the glue

The networking piece was the part I expected to be painful. Port forwarding, dynamic DNS, firewall rules, SSL certificates for a home server — all the reasons homelabs stay local. Tailscale collapses that entire problem.

Your server joins a tailnet. You access it from any device you've authorized. No exposed ports. No public IP. The orchestrator API and dashboard listen only on the Tailscale interface. This is a design choice, not just convenience — it means the agent infrastructure has zero attack surface from the public internet. The only way in is through an authorized device on the tailnet.

Practically, this means kicking off a research task from my phone while I'm out, checking progress from my laptop later, reviewing results at home. The infrastructure follows me without me having to think about network topology. I've been running other services on Tailscale for a while and the reliability has been solid enough that I trust it for this.

The alternative — exposing an agent orchestration API to the public internet — should make anyone uncomfortable. These are systems that execute arbitrary code, manage files, and make API calls. The correct number of open ports for that is zero.

What I'm running

The architecture above is what I've settled on. On the EVO X2 128GB, running Fedora.

The orchestration layer I was building toward when this post was first published is built. Agents spin up, do work, die cleanly. Costs are tracked per session. State persists across restarts. Observability makes the difference between "my agent ran for an hour" and "my agent ran for an hour — here's exactly what it did, what it touched, and what it cost."

What's running locally:

Gemma 4 31B Dense — daily driver for most tasks
Nemotron 3 Super 120B Q4 — heavy reasoning and long-context work
Llama 3.3 70B Q6_K — fast fallback when latency matters more than quality

The hardware question is answered. The next question — what software stack to run on it, how to isolate agents at the kernel level, how to structure the orchestration layer — is what determines whether your agent infrastructure is reliable or just impressive. I've written about the operating system layer separately.

Sources

GMKtec — EVO X2 product page
ServeTheHome — GMKtec EVO X2 review
CraftRigs — EVO X2 LLM benchmarks
nishtahir.com — EVO X2 inference benchmarks
NVIDIA — Nemotron 3 Super 120B announcement
OpenAI — GPT-OSS announcement
Google — Gemma 4 announcement
Apple — Mac mini tech specs

Personal Agent Infrastructure — The 128GB Threshold for Local Frontier Inference

The hardware decision

Why Docker and not just... running things

The dashboard problem: observability as the missing layer

Tailscale as the glue

What I'm running

Sources

Related Posts

Meaning Architecture — Why Your Agents Fail on Context, Not Code

The Thread Is the Unit — Knowledge Work After the Prompt

Agent Harness Taxonomy — Six Architectures for AI Agents