Agent Harness Taxonomy — Six Architectures for AI Agents

April 7, 202612 min readanalysis

The harness architecture determines the behavior envelope. Not the model — the scaffolding around it. Pick the wrong harness category and a better model won't save you. Pick the right one and you've solved problems the model can't solve for itself.

What is an agent harness? The runtime infrastructure that turns a language model into an agent — the orchestration loop, tool surfaces, memory systems, permission pipeline, and execution environment that govern what the model can do, when, and how. The model provides reasoning. The harness provides everything else. Anthropic's April 2026 launch of Claude Managed Agents — out-of-the-box harness infrastructure for enterprise deployments — is the clearest signal yet that the harness layer is where the industry is competing.

I catalogued six distinct architectures from production systems and open-source codebases: the Anthropic knowledge-work plugins, Stripe's Minions system, Claude Code's internals from the recent npm source leak, LangChain's DeepAgents, NousResearch's Hermes, and pi. Six distinct architectural categories, each making different bets about what "the harness" is for. They differ on a single variable: what can change after the harness ships. The first five are variations on a shared assumption. Category 6 breaks it.

The foundational assumption nobody names

The model layer already proved it isn't the moat. Every discussion of agent capability treats the harness as fixed infrastructure. You build the scaffolding, then the model runs inside it. The scaffolding might be sophisticated — session management, tool execution, compaction, memory consolidation, multi-agent coordination — but it's the stable part. The thing you build once and iterate the model or prompt against.

That assumption is so basic it's invisible. It's also what Category 6 breaks.

But I'm getting ahead of myself.

Category 1: Configured — domain-specific, plugin-extended

The lowest-friction entry point for vertical deployment: take a general-purpose harness and configure it with domain-specific tool surfaces.

Anthropic's knowledge-work-plugins is the clearest current example. The legal plugin installs on top of Claude Code and adds slash-command workflows — /review-contract, /triage-nda, /vendor-check, /brief, /respond — as composable skills backed by a local configuration file. The orchestration loop, permission pipeline, and tool execution engine are unchanged. What changes is the tool surface exposed to the model: it now speaks legal workflows, not generic tasks.

This category is underestimated because it looks like product configuration, not architecture. But the architectural bet is real: you're accepting the ceiling of the underlying harness in exchange for deployment speed and maintainability. The legal team gets agent-assisted contract review without building a session orchestrator. The constraint is that anything the underlying harness can't do, the plugin can't do either.

The right mental model: configured harnesses are to specialized harnesses what SaaS is to custom software. Faster to deploy, harder to differentiate, dependent on the platform.

What can change: the plugin configuration, the skill bodies, the tool surface. What stays fixed: the underlying orchestration, the permission model, the execution engine.

Category 2: Specialized — fixed phases, custom orchestration

The architecture most people mean when they say "production agent pipeline": custom orchestration built from scratch, with a fixed phase sequence and hard completion criteria.

Stripe's Minions is the clearest implementation at scale. Each Minion executes a "blueprint" — a deterministic sequence of nodes, some fixed logic, some agentic — against a precisely assembled context payload. The completion criterion is the test suite: three million tests, at most two CI rounds. You don't exit code generation until tests pass. The harness enforces this. The model doesn't decide when it's done; the phase controller does.

The session orchestrator in this architecture isn't a turn loop — it's a phase controller. The difference matters. A turn loop runs while there's work to do. A phase controller enforces a specific sequence with explicit gates. "Reliability comes from the harness, not the model" — this is what that actually means in practice. You don't need a smarter model; you need a completion criterion with teeth.

What this architecture does well: predictability, auditability, client-deliverable alignment. You know exactly what the output will look like before you run it. What it sacrifices: flexibility. The phases are fixed. When the task doesn't fit the blueprint, the blueprint breaks.

What can change: nothing, by design. What stays fixed: the phase sequence, the completion criteria, the tool surfaces.

Category 3: General-purpose — flexible, serving the median user

Claude Code and pi (open-source) both live here. These systems are built for the median user across a wide range of tasks. The session orchestrator owns the full turn loop: receive input, assemble context from memory, call the API, execute tools, handle compaction, fire background extraction, loop until done. The tool surface is broad. The prompt assembly is modular. Everything is designed to compose, not to specialize.

The leaked Claude Code source is the most complete picture available of what a production-grade general-purpose harness looks like at scale. Eleven hundred lines in the turn loop alone. Forty-plus tools. A six-layer permission pipeline. Three tiers of memory. Background forked agents sharing the parent's prompt cache. Session state that survives crashes and restarts. LangChain's DeepAgents reaches similar scope from a different foundation — their harness engineering post covers the production complexity this category requires.

The design constraints are interesting: you can't assume what task is coming in, so you can't hard-code phases. You can assume the user wants things to work reliably, so you invest heavily in error recovery, compaction, and session persistence. You can assume they'll return after being away, so you build the away summary pattern.

This architecture generalizes well. It's also the one most people try to build from scratch, usually without realizing how much production complexity they're skipping. Anthropic's building effective agents post surfaces the patterns; the source leak made the full implementation visible.

What can change: the task, the user, the context. What stays fixed: the harness.

Category 4: Autonomous — event-triggered, continuous

OpenClaw is the clearest example here. The agent doesn't wait for user input. It subscribes to a trigger taxonomy: file system changes, incoming API calls, scheduled cron, external webhooks, state transitions in watched resources. When a trigger fires, the agent wakes, processes, acts, and returns to waiting.

The Claude Code source has this too, under the KAIROS flag — an always-on daemon that runs a tick-based loop with a blocking budget per tick, enabling background monitoring and consolidation without explicit user invocation.

What this architecture changes is the relationship between the agent and time. Categories 1 through 3 are reactive — they run when invoked. Category 4 is proactive — it runs continuously, or on a schedule, or in response to external events. The trigger is the new primitive, which means the design complexity shifts to trigger management: what events exist, how they're classified, how priority is assigned when multiple triggers fire simultaneously, how the agent avoids acting on the same event twice. These aren't model problems. A more capable model doesn't improve trigger deduplication; better harness design does.

What can change: the events in the environment. What stays fixed: the trigger taxonomy, the harness structure.

Category 5: Self-improving — optimizing from outcomes

Hermes from NousResearch uses optimization loops to improve skills and prompts from measured outcomes. Every skill invocation gets measured. Did the task succeed? How many tokens did it use? What was the quality score? Those measurements feed back into the optimizer, which rewrites the skill's when_to_use field, adjusts the body, changes the model override in the frontmatter.

The skills architecture is the right primitive for this. A skill is a markdown file with YAML frontmatter — name, description, when to use, allowed tools, model override. Small, composable, independently modifiable. When the optimizer improves a skill, it's changing a text file, not rewriting core infrastructure.

Hermes also adds a fourth memory layer that the standard architecture doesn't include: episodic recall. Full-text search across all past sessions, not just what made it into structured memory. The standard three-tier model captures what was worth remembering. The episodic layer captures everything, raw, searchable. The optimizer can ask "show me all sessions where skill X was invoked and the task failed" and actually get an answer.

This is where optimization stops being a human activity and starts being the harness's job.

What can change: skills, prompts, tool configurations — the configuration layer. What stays fixed: the harness code, the benchmark, the measurement infrastructure.

Category 6: Self-modifying — code evolution

This is the gap in every other framework I've reviewed.

The architecture: a MetaAgent runs the primary agent against a benchmark suite, measures failures by type and frequency, writes a targeted rewrite of the agent code, runs the rewrite through a sandboxed eval loop, and either commits the improved version or discards it and tries a different rewrite strategy. The evolution archive is a git history with benchmark scores attached to each commit.

The difference from Category 5 is categorical, not incremental. Category 5 optimizes configuration — prompts, skill bodies, memory. Category 6 optimizes the harness code itself. The MetaAgent isn't adjusting a YAML field. It's rewriting the session orchestrator.

Three things this requires that none of the other categories need:

An eval harness that's outside the self-modifying code. If the MetaAgent can rewrite the benchmark runner, the improvement signal is meaningless. The sandbox is the trust boundary. The benchmarks have to be fixed, external, adversarially stable.

A sandboxed execution environment where the rewrite runs before being promoted. Not kernel isolation to prevent the agent from doing harm — isolation to prevent a bad rewrite from corrupting production. The purpose is different.

An evolution archive. The memory system for the self-modification loop. Every version, every benchmark score, every delta. Without this, the MetaAgent can only react to the current failure, not the failure pattern across attempts.

What this breaks is the foundational assumption. The harness is no longer fixed. It evolves through operation. The stable part — the thing you build once and run the model against — is now the benchmark suite and the sandboxed eval loop. Everything else is subject to modification.

What can change: the harness code itself. What stays fixed: the eval harness, the sandbox, the benchmark.

What this means in practice

The taxonomy is useful because it clarifies what you're actually deciding when you design an agent system. You're not just picking tools and models. You're choosing a category — a set of bets about what the harness is for and what can change.

Category 1 is the fastest path to vertical deployment. The configured harness trades differentiation for speed. When the underlying platform is good enough and the domain just needs a different tool surface, this is the right call. Most enterprise AI deployments that claim to be "custom" are Category 1.

Category 2 optimizes for predictability and client-deliverable alignment. When the task is well-defined and the output format is fixed, the phase controller with hard completion criteria is the right choice. Stripe's Minions isn't overengineered — it's correctly engineered for a predictable task at scale.

Categories 3 and 4 optimize for generality and autonomy respectively. The median user or the continuous process.

Category 5 is where the harness starts participating in its own improvement. The optimization loop is explicit, measurable, and bounded to the configuration layer. This is achievable today with existing tooling.

Category 6 is where something qualitatively different happens. The line between the system building the agent and the agent building itself starts to blur. The MetaAgent doesn't know it's rewriting infrastructure — it's executing a task that happens to produce modified code. But the effect is a system participating in its own evolution.

The interesting work right now is in benchmark specification: what does it mean to define "better" precisely enough that the system evolves in a direction you can predict and verify? The problem isn't getting self-modifying systems to improve — that part works. The problem is objective specification. A system that optimizes for the wrong objective will evolve away from what you wanted, reliably and efficiently. You don't get a warning. You get a system that scores well on your benchmark and fails at what you actually needed.

Hermes approaches this conservatively on the code side — optimize configuration, measure carefully, keep the harness stable. Category 6 inverts this: optimize the code, enforce stability through the eval harness. Neither has a complete answer for objective specification. That's the open problem. The category that solves it cleanly will define the next generation of agent infrastructure.

The harness isn't just scaffolding. In Category 6, the harness is the product — and it's a product with agency over its own development.

Sources

Related Posts

X
LinkedIn