What Building with the Anthropic Agent SDK Actually Feels Like

Before the Agent SDK, I was writing the same loop over and over. Call the model. Parse the response. Check if it wants to call a tool. Call the tool. Feed the result back. Handle errors. Repeat. Every agent project started with a few hundred lines of orchestration boilerplate that had nothing to do with what the agent was actually supposed to do.

The SDK collapses all of that into a single function call. That sounds like a small thing. It isn't.

The sampling loop is the key abstraction

Here's what took me a while to appreciate: the SDK's query() function isn't just a wrapper around the API. It's an agentic loop. You give it a prompt, a set of tools, and some configuration. It calls the model, checks if the model wants to use a tool, executes the tool, feeds the result back, and repeats — until the model decides it's done or you hit a turn limit. The entire call-parse-execute-repeat cycle becomes something you configure rather than implement.

This matters because the loop was always where the bugs lived. Off-by-one errors in message history. Tool results formatted slightly wrong. Error handling that worked for the first retry but not the second. The SDK doesn't eliminate these problems, but it moves them to a layer you don't have to maintain yourself. Someone at Anthropic is fixing those bugs now instead of you.

The trade-off is control. When you write your own loop, you can inspect every message, modify the conversation history between turns, inject system reminders at specific points. The SDK gives you hooks for some of this — PreToolUse and PostToolUse callbacks that let you intercept tool calls and inject context — but you're working within the SDK's model of how an agent loop should work. If your mental model diverges from theirs, you'll fight the framework.

I fought the framework. More than once.

Tool design is where the actual work happens

There's a phase in every SDK project where you have the agent configured, the system prompt is decent, and the model is responding. Everything looks good. Then you watch it try to accomplish something real and it falls apart — not because the model is bad, but because your tools are wrong.

The SDK uses a tool() wrapper function with Zod schemas for parameter validation. Define the tool name, a description, the parameter schema, and a handler function. Clean pattern. But the schema and the description are doing more work than you'd expect. The model reads the description to decide when to use the tool. It reads the parameter descriptions to figure out what arguments to pass. If your descriptions are vague, the model will use the tool at wrong times or with wrong arguments. If your schema is too permissive, you'll get garbage inputs that technically validate.

I learned this the hard way with entity management tools. First version had a generic update_entity tool that accepted any JSON blob. The model could technically use it — but it would pass malformed data, skip required fields, use the wrong key names. When I broke it down into specific tools — get_entity, update_component, add_component, remove_component — each with tight Zod schemas and precise descriptions, the same model on the same prompt started producing correct operations almost every time.

// The tool wrapper pattern that actually works:
// name, description, Zod schema, handler
export const updateComponentTool = tool(
  'update_component',
  'Update or add a component to an entity. If the component exists, new data is merged into it.',
  {
    entity_id: z.string().describe('The unique identifier of the entity'),
    component_key: z.string().describe('Component key (e.g., "health", "clock")'),
    component_data: z.record(z.unknown()).describe('Component data to set or merge')
  },
  async (args) => {
    await ecsService.updateComponent(args.entity_id, args.component_key, args.component_data);
    return { content: [{ type: 'text', text: JSON.stringify({ status: 'success' }) }] };
  }
);

The insight isn't about syntax. It's about granularity and description quality. A well-designed tool with a clear description makes a mediocre model effective. A poorly designed tool makes even the best model unreliable. When I switched from Opus to Haiku for some sub-agent tasks — to save on cost and latency — the tight tool schemas meant Haiku could handle the work fine. The tools were doing the heavy lifting, not the model.

This is the part nobody talks about in SDK tutorials. They show you how to define a tool. They don't tell you that you'll redesign your tools five times before the agent behaves correctly, and that each redesign teaches you more about what the model actually needs to make good decisions.

Sub-agent orchestration sounds better than it is

The SDK supports delegating work to sub-agents — specialized agents with their own system prompts, tool sets, and model configurations. On paper, this is the architecture for complex systems. You have a main agent that orchestrates, and specialized agents that handle specific tasks. Content generation, data processing, user interaction — each gets its own agent with its own context.

In practice, it's clunkier than it sounds. The main agent delegates a task to a sub-agent by passing a prompt. The sub-agent runs, does its thing, returns a result. But the sub-agent doesn't have the main agent's conversation history. It doesn't know what happened two turns ago. It doesn't know what the user just said. You're starting a fresh context every time you delegate, and you have to pack everything the sub-agent needs into that delegation prompt.

This creates a tension. You want sub-agents to be specialized — that's the whole point. But specialization means they lack context, and lack of context means they make decisions that don't fit the broader situation. You end up writing increasingly detailed delegation prompts that basically reconstruct the context the sub-agent doesn't have. At some point you're spending more effort on context packaging than you saved by delegating.

Where sub-agents do work well: isolated, well-defined generation tasks. I use them for content generation — spinning up entities from templates where the output format is rigid and the input context is small enough to fit in a single prompt. The sub-agent doesn't need to know the full conversation history. It just needs a set of parameters and a clear output spec. For that, delegation is clean.

For anything that requires ongoing awareness of user intent or conversation state, sub-agents are a headache. The handoff loses too much.

The system prompt is the product

Here's something I didn't expect: I spend more time on system prompts than on code. The agent's behavior — what it does unprompted, how it handles ambiguity, when it asks for clarification vs. when it acts — is almost entirely determined by the system prompt and the tool descriptions. The TypeScript is just plumbing.

The SDK gives you options for how to handle this. You can write a fully custom system prompt, or you can use a preset (like claude_code) and append your domain-specific instructions. I use the append approach — you get the SDK's built-in safety and error handling for free, and add your own behavioral logic on top. But "on top" can be 3,000 words of instructions about how the agent should behave in specific scenarios, how it should sequence tool calls, what it should check before acting.

This is where the work is. Not in the SDK configuration. Not in the tool implementations. In the long, iterative process of watching the agent do something wrong, figuring out which instruction was ambiguous, rewriting it, and testing again. It's closer to training than programming. You're shaping behavior through language, and the feedback loop is slow because you have to run the agent through realistic scenarios to see if your changes actually improved things.

Debugging is genuinely hard

When an agent makes a bad decision, figuring out why is harder than debugging traditional code. There's no stack trace for reasoning. The model read your system prompt, looked at the conversation history, considered the available tools, and decided to do X instead of Y. Why? You can enable extended thinking — the SDK supports maxThinkingTokens as a config option — and read the model's internal reasoning. Sometimes that helps. Sometimes the thinking block says something perfectly reasonable and the action is still wrong.

The hook system helps with observability. You can intercept every tool call with PreToolUse, inject system reminders with PostToolUse, log everything. But the gap between "I can see what the agent did" and "I understand why it did it" is wide. Traditional debugging is deterministic: same input, same output, same bug. Agent debugging is probabilistic. The same prompt might produce different tool sequences on different runs. You're debugging distributions, not functions.

I've settled on a pattern: inject system reminders every N tool calls to keep the agent grounded in its current objective. It's a hack. It works better than it should. But the fact that "periodically remind the agent what it's supposed to be doing" is a viable debugging strategy tells you something about the maturity of the tooling.

What's actually changing

The shift from raw API calls to the SDK reflects something real about where agent development is going. The low-level orchestration — the sampling loop, the tool dispatch, the error recovery — is becoming infrastructure. You don't write your own HTTP server for a web app anymore. Eventually you won't write your own agent loop either.

But the higher-level problems are unsolved. Context management across long-running sessions is still primitive. Multi-agent coordination still relies on stuffing context into delegation prompts. Debugging is still a manual process of reading logs and guessing at model reasoning. Memory — real persistent memory across sessions, not just conversation history — is something everyone needs and nobody has a clean solution for.

The SDK handles the plumbing well. It handles the plumbing better than I would. But the plumbing was never the hard part. The hard part is designing the system prompt, the tool set, and the interaction patterns that make an agent actually useful for a specific domain. That work is still entirely on you, and no SDK is going to change that.

The interesting question is what happens when the tooling catches up to the ambition. Right now, building agent systems feels like web development in 2005 — the fundamentals work, the frameworks are emerging, but you're still fighting the infrastructure more than you'd like. The gap between "demo agent that does something cool" and "production agent that does something reliable" is enormous, and most of the effort lives in that gap.

The Agent SDK repo is where to start if you want to see this for yourself. The docs are decent. The real education starts when you try to make it do something specific.

What Building with the Anthropic Agent SDK Actually Feels Like

The sampling loop is the key abstraction

Tool design is where the actual work happens

Sub-agent orchestration sounds better than it is

The system prompt is the product

Debugging is genuinely hard

What's actually changing

Related Posts

Tool Design Is a Communication Problem

Claude Opus 4.6: 1M Context, Agent Teams, and What Actually Matters

Claude Cowork: What Claude Code Looks Like for Non-Developers