Anthropic Agent SDK — Why the System Prompt Is the Product

January 15, 20267 min readreference

Before the Agent SDK, I was writing the same loop over and over. Call the model. Parse the response. Check if it wants to call a tool. Call the tool. Feed the result back. Handle errors. Repeat. Every agent project started with a few hundred lines of orchestration boilerplate that had nothing to do with what the agent was actually supposed to do.

The SDK collapses all of that into a single function call. That sounds like a small thing. It isn't.

Here's something I didn't expect: I spend more time on system prompts than on code. The agent's behavior — what it does unprompted, how it handles ambiguity, when it asks for clarification vs. when it acts — is almost entirely determined by the system prompt and the tool descriptions. The TypeScript is just plumbing.

The SDK gives you options for how to handle this. You can write a fully custom system prompt, or you can use a preset and append your domain-specific instructions. I use the append approach — you get the SDK's built-in safety and error handling for free, and add your own behavioral logic on top. But "on top" can be 3,000 words of instructions about how the agent should behave in specific scenarios, how it should sequence tool calls, what it should check before acting.

This is where the work is. Not in the SDK configuration. Not in the tool implementations. In the long, iterative process of watching the agent do something wrong, figuring out which instruction was ambiguous, rewriting it, and testing again. It's closer to training than programming. You're shaping behavior through language, and the feedback loop is slow because you have to run the agent through realistic scenarios to see if your changes actually improved things.

If you've never spent forty-five minutes rewording a single paragraph of a system prompt because the model keeps interpreting "check before acting" as "ask the user for permission every time" — you haven't done this work yet. The precision required is closer to legal drafting than to programming. Every word carries weight because the model takes you literally in ways that are both its greatest strength and its most frustrating quality.

Tool design is where the actual work happens

I've written more about this in Tool Design Is a Communication Problem — the insight that tool descriptions matter more than tool implementations holds across every SDK project I've built.

There's a phase in every SDK project where you have the agent configured, the system prompt is decent, and the model is responding. Everything looks good. Then you watch it try to accomplish something real and it falls apart — not because the model is bad, but because your tools are wrong.

The model reads tool descriptions to decide when to use them. It reads parameter descriptions to figure out what arguments to pass. If your descriptions are vague, the model will use the tool at wrong times or with wrong arguments. If your schema is too permissive, you'll get garbage inputs that technically validate.

I learned this the hard way with service management tools. First version had a generic update_service tool that accepted any JSON blob. The model could technically use it — but it would pass malformed data, skip required fields, use the wrong key names. When I broke it down into specific tools — get_service, update_config, add_config, remove_config — each with tight schemas and precise descriptions, the same model on the same prompt started producing correct operations almost every time.

The insight isn't about syntax. It's about granularity and description quality. A well-designed tool with a clear description makes a mediocre model effective. A poorly designed tool makes even the best model unreliable. When I switched from Opus to Haiku for some sub-agent tasks — to save on cost and latency — the tight tool schemas meant Haiku could handle the work fine. The tools were doing the heavy lifting, not the model.

This is the part nobody talks about in SDK tutorials. They show you how to define a tool. They don't tell you that you'll redesign your tools five times before the agent behaves correctly, and that each redesign teaches you more about what the model actually needs to make good decisions.

Debugging distributions, not functions

When an agent makes a bad decision, figuring out why is harder than debugging traditional code. There's no stack trace for reasoning. The model read your system prompt, looked at the conversation history, considered the available tools, and decided to do X instead of Y. Why?

You can enable extended thinking and read the model's internal reasoning. Sometimes that helps. Sometimes the thinking block says something perfectly reasonable and the action is still wrong. The gap between "I can see what the agent did" and "I understand why it did it" is wide. Traditional debugging is deterministic: same input, same output, same bug. Agent debugging is probabilistic. The same prompt might produce different tool sequences on different runs. You're debugging distributions, not functions.

I've settled on a pattern: inject system reminders every N tool calls to keep the agent grounded in its current objective. It's a hack. It works better than it should. But the fact that "periodically remind the agent what it's supposed to be doing" is a viable debugging strategy tells you something about the maturity of the tooling.

The context loss problem in delegation

The SDK supports delegating work to sub-agents — specialized agents with their own system prompts and tools. On paper, this is the architecture for complex systems. In practice, the sub-agent doesn't have the main agent's conversation history. You're starting a fresh context every time you delegate, and you have to pack everything the sub-agent needs into the delegation prompt.

You end up writing increasingly detailed delegation prompts that basically reconstruct the context the sub-agent doesn't have. At some point you're spending more effort on context packaging than you saved by delegating. Where sub-agents work well: isolated, well-defined tasks where the output format is rigid and the input context is small enough to fit in a single prompt. For anything requiring ongoing awareness of user intent, the handoff loses too much.

What's actually changing

The shift from raw API calls to the SDK reflects something real about where agent development is going. The low-level orchestration — the sampling loop, the tool dispatch, the error recovery — is becoming infrastructure. You don't write your own HTTP server for a web app anymore. Eventually you won't write your own agent loop either.

But the higher-level problems are unsolved. Context management across long-running sessions is still primitive — the failure modes of long-running agents are well-understood, but the solutions are still manual. Multi-agent coordination still relies on stuffing context into delegation prompts. Memory — real persistent memory across sessions — is something everyone needs and nobody has a clean solution for.

The SDK handles the plumbing well. Better than I would. But the plumbing was never the hard part. The hard part is designing the system prompt, the tool set, and the interaction patterns that make an agent actually useful for a specific domain. That work is still entirely on you, and no SDK is going to change that.

The interesting question is what happens when the tooling catches up to the ambition. Right now, building agent systems feels like web development in 2005 — the fundamentals work, the frameworks are emerging, but you're still fighting the infrastructure more than you'd like. The gap between "demo agent that does something cool" and "production agent that does something reliable" is enormous, and most of the effort lives in that gap.

Related Posts

X
LinkedIn