What Actually Breaks When Agents Run Long

January 7, 20266 min read

Here's a thing that happens: you give Claude Code a complex task — refactor a module, build out a feature across multiple files, restructure a data pipeline. Forty minutes in, the agent is doing great work. An hour in, it starts repeating itself. Ninety minutes in, it's trying a fix it already tried and abandoned. Two hours in, it crashes, and you realize nothing was saved.

I've had this happen enough times to develop opinions about it.

The failure modes are predictable

Long-running agents don't fail in mysterious ways. They fail in about five ways, and once you've seen each one, you recognize them immediately.

Context window overflow. The model has a finite context window. Every tool call, every file read, every response — it all accumulates. Eventually the conversation gets so long that the model starts losing track of earlier decisions. It doesn't announce this. It just quietly forgets that it already considered and rejected an approach, then tries it again.

Circular reasoning. Related but distinct. The model gets stuck in a loop — try approach A, it fails, try approach B, it fails, go back to A. I've watched Claude Code attempt the same three fixes in rotation for twenty minutes. The model isn't stupid. It just doesn't have a good mechanism for tracking "I already tried this and here's specifically why it didn't work" across a long conversation.

Silent drift. This is the worst one. The agent doesn't crash. It doesn't error out. It just slowly stops doing the thing you asked it to do. Maybe it was supposed to refactor a module for performance and it starts refactoring for readability instead. Maybe it was adding a feature and it gets pulled into fixing tangentially related bugs. You come back after an hour and the work is technically competent but pointed in the wrong direction.

Resource bleed. API calls cost money. A runaway agent burning through Opus-tier tokens for two hours on a task that should have taken thirty minutes — that adds up. I've had sessions where the cost of a single agent run exceeded what I'd normally spend in a week, because I wasn't watching and the agent was churning through retries.

Hard crashes with no recovery. The API times out. Your internet drops. The process gets killed. Three hours of work, gone. This is the most obvious failure mode and also the most preventable, which makes it the most frustrating when it happens.

The real problem is drift, not crashes

Most tutorials about agent reliability focus on crash recovery. That's important, but it's the easy problem. If your agent crashes, you know it crashed. You can build retry logic, implement checkpointing, add timeout handlers. These are solved problems in distributed systems.

Drift is harder because there's no error signal. The agent is running fine. It's producing output. The output is plausible. It's just not what you wanted. And by the time you notice — if you notice — you've burned through context window, tokens, and time.

I don't have a clean solution for this. Nobody does. The best I've managed is a combination of aggressive logging and periodic check-ins — basically, structuring long tasks so the agent reports back every N steps with a summary of what it's done and what it plans to do next. This gives me a chance to catch drift early. It's manual. It's annoying. It works better than the alternative, which is coming back to a finished task that's wrong.

What actually helps

I haven't built a production-grade harness system with orchestration layers and dependency graphs and all the components you see in most tutorials. But I've lost enough work to long-running failures that I know what matters.

Checkpointing is non-negotiable. If an agent can't resume from where it left off after a crash, you're gambling your time against the reliability of your internet connection and the Anthropic API's uptime. The simplest version of this is just saving state every N steps. Not elegant. Effective.

Cost caps before you start. Set a budget. Not a vague sense of "I'll keep an eye on it" — an actual number. "This task should cost no more than $X. If it hits that, stop." I learned this one the expensive way.

Smaller tasks beat longer sessions. The most reliable harness pattern I've found isn't a pattern at all — it's decomposition. Instead of one four-hour agent session, run four one-hour sessions with clear handoff points. Each session starts with a summary of what's been done and what's next. This sidesteps context window overflow, reduces the blast radius of crashes, and gives you natural checkpoints for catching drift.

Log everything, read the logs. Sounds obvious. Most people don't do it. When an agent runs for an hour and produces a wrong result, the logs are the only way to figure out where it went off track. Without them, you're just guessing.

Here's what a minimal checkpoint approach looks like — conceptual, not a library:

# Conceptual checkpoint pattern — not production code every N steps: save current state to disk: - what's been completed - what's in progress - key decisions made (and why) - current cost/token usage on resume: load latest checkpoint summarize state for the model continue from where it stopped on cost limit: save state stop notify

That's it. That's the core of what you actually need. Everything else — retry strategies, dependency graphs, health monitoring, anomaly detection — is optimization on top of this foundation. Some of it matters for production systems. Most of it doesn't matter until you have the basics working.

The trade-off nobody talks about

There's a tension between harness complexity and harness reliability. Every layer of infrastructure you add to make your agent more resilient is another layer that can break. I've seen agent systems where the monitoring and recovery code was more complex than the agent itself — and the monitoring code had bugs that caused more failures than it prevented.

The simplest harness is often the best one. Save state. Set limits. Break big tasks into smaller ones. Check the output periodically. It's not sophisticated. It doesn't make for impressive architecture diagrams. It works.

Where this is going

The tooling for long-running agents is still primitive. We're at the "write your own checkpointing" stage, which is roughly where web development was in the "write your own HTTP server" era. Someone will build the framework that makes this boring and reliable. Probably multiple someones, and most of their attempts will be wrong in interesting ways before someone gets it right.

The interesting open question is whether the right abstraction is a harness that wraps the agent, or whether the models themselves need to get better at long-running tasks — better at tracking their own state, recognizing when they're going in circles, knowing when to stop and ask. Both will probably happen. I'm more optimistic about the model improvements, because every harness I've built has felt like a workaround for things the model should just handle.

Until then, save your state and set your cost limits. The boring infrastructure is what makes the interesting work possible.

Related Posts

X
LinkedIn