Context Engineering in Production — What the Frameworks Don't Tell You

Most practitioners learn context engineering once. They read the Anthropic post, restructure their system prompt, add a CLAUDE.md, and consider it done. That's the design-time half of the problem. It's the easier half.

The harder half is runtime context engineering — keeping a running agent oriented as tool outputs accumulate, side paths multiply, and the original instructions get progressively diluted by everything that's happened since. That's where agents actually fail. Not on the first call. After thirty minutes of execution, when the context window is half-full of intermediate results and the model has quietly lost the thread of what it was supposed to do.

This post covers both halves — what good design-time context engineering looks like in practice, and the runtime failure modes that don't show up in any framework post.

Context engineering vs. prompt engineering

One sentence: prompt engineering is about what you say. Context engineering is about what you include, how you structure it, and in what order.

The distinction matters because the failure modes are different. A bad prompt produces a bad response you can see immediately and fix. Bad context engineering produces agents that work fine for thirty seconds and degrade quietly over thirty minutes. You don't catch it until you've burned time and tokens and the output is subtly, expensively wrong.

What goes into a context window

Anthropic's framework for context engineering identifies four operations: Write (inject information directly), Select (choose what to include from available information), Compress (summarize or truncate to reduce token volume), and Isolate (scope agents to specific context subsets via subagents or separate sessions). That's the right taxonomy. Here's how it works in practice.

Write is the highest-leverage lever and the most abused. System prompts, CLAUDE.md files, tool descriptions — these are things you write into context deliberately. They're also the most influential surface: model behavior is shaped more by what appears early at the system level than by what gets injected mid-conversation. The implication is straightforward and frequently ignored — write the things that govern behavior globally at the top, every time, not buried in turn 12 of a conversational thread.

Select is where most practitioners under-invest. The default impulse is to include everything that might be relevant: the full codebase, the complete document, the entire conversation history. The actual skill runs the opposite direction: identifying the minimum sufficient context for the current step. "Minimum sufficient" is the operative phrase — not minimal, you want everything the model actually needs, but nothing it doesn't. The difference between pointing an agent at an entire src/ directory versus the specific file containing the bug is often the difference between a mediocre result and a correct one. Same model. Same instructions. Different context structure.

Compress becomes critical in long-running tasks. Tool outputs accumulate. Previous agent turns pile up. The context window fills. The decision is: what can be summarized, what can be truncated, what still needs to be present verbatim. Compressing the wrong things — summarizing a function signature the model needs exact, truncating an error message it needs to trace — produces quiet failures that are hard to diagnose. Compressing the right things (intermediate reasoning, status updates, completed sub-steps) keeps the window manageable without losing the information that matters.

Isolate is the architectural move. Multi-agent systems where an orchestrator spawns subagents for specific tasks are a form of context engineering. Each subagent gets a scoped context tailored to its specific job rather than the full accumulated state of a long-running session. The orchestrator holds global state. The subagent sees only what it needs. This isn't just an efficiency optimization — it's how you prevent context contamination between tasks in the same session.

The three context engineering failure modes

Design-time context engineering prevents one category of failures. A different category shows up at runtime, in agents that run for minutes or hours rather than single-turn calls. Three failure patterns show up consistently enough to be worth naming.

Context rot is the slow drift. The model starts with a clear task. Tool outputs come in. Previous responses accumulate. Each new piece of information is individually relevant but collectively they dilute the original signal. Early in the session the model knows exactly what to do. An hour later, it's working on a subtly different version of the task — still coherent, still plausible, but no longer quite what you asked for. Context rot doesn't produce an error. It produces a result that's wrong in ways that take real effort to diagnose, because the output looks reasonable on the surface.

The defense: re-anchor the task. At key checkpoints in a long-running agent loop, re-inject the core objective. Not as a conversational reminder — as a structural element that resets the hierarchy. The task statement goes back at the top, before the accumulated context, where it carries system-level weight again.

Context distraction is the attention problem. Too much relevant-adjacent information competes with the actually-relevant information for the model's attention. I built an agent that needed to modify a specific function in a large codebase. My first version gave it the entire file plus the surrounding module for context. It kept touching the wrong places — not because it misunderstood the task, but because adjacent code was drawing its attention toward related concerns. When I scoped the context to the specific function and its direct dependencies only, the precision improved immediately. Same model. Same instructions. The narrower context produced more targeted behavior.

Context poisoning is the most dangerous failure mode. A hallucination, an incorrect tool output, or a bad intermediate result makes it into the accumulated context and gets treated as ground truth by subsequent steps. The model trusts its own prior outputs. If step three produced something wrong, and that wrong output is now sitting in context as an established fact, steps four through twenty will build on the error. Each subsequent step makes it harder to trace back to where things went wrong, and the compounding makes the final output confidently incorrect in ways that are expensive to unwind.

The defense against context poisoning is verification checkpoints — moments in the agent loop where you validate intermediate results against an external source rather than trusting the context state. This is part of why human-in-the-loop patterns for long-running agent sessions exist: not because you distrust the model, but because you don't want a single wrong intermediate result to propagate through fifty subsequent steps.

CLAUDE.md as a genre of context engineering

The CLAUDE.md file — or its equivalent in any agent framework — is design-time context engineering's most important artifact. Most people write one badly because they treat it as a brain dump.

The failure mode: include everything that might possibly be relevant. Project history, team preferences, stylistic opinions, the full directory structure, a glossary, architectural decisions from two years ago. The file grows until it contains everything and signals nothing. The model reads it and extracts roughly as much signal as it would from a poorly organized wiki — which is to say, not much.

The craft version treats the CLAUDE.md as a genre with specific constraints.

Write for the model's actual questions. A model opening a new session in your repository needs to answer four things: what is this project, what are the relevant commands, what conventions matter, what should I never do. Those four questions should be explicitly answerable from the first screen of your CLAUDE.md. Everything else is secondary. If you can't answer those questions from the opening section, the file is doing too little work.

Structure over completeness. Hierarchical structure — clear headers, explicit sections, labeled constraints — tells the model what's important and how things relate to each other. A flat list of facts is harder to weight. The model attends better to organized information because the structure is itself a signal about priority. The same content in a flat list versus structured sections produces meaningfully different behavior.

Keep it current. A CLAUDE.md that was accurate six months ago and hasn't been updated since is actively harmful — it fills the context window with stale information the model will treat as current fact. Stale context is worse than no context, because missing context produces a question, while stale context produces a confident wrong answer. Treat it like documentation: it has to stay in sync with the actual state of the project.

Tool descriptions are context engineering

This is the most underwritten surface in the problem. Every tool you give an agent comes with a description. That description is context. It shapes when the model reaches for the tool, how it uses it, and what it expects back.

Compare two descriptions for a codebase search tool:

Version A: "Searches the codebase."

Version B: "Searches file contents using ripgrep regex patterns. Use for finding function definitions, tracing variable usage, or locating specific strings across the project. Returns matching file paths and line numbers. Prefer this over reading entire files when looking for specific code locations."

The model's tool selection behavior differs between these. Version A gets used hesitantly, often when the model isn't sure whether to search or read directly. Version B gets used precisely — the model knows what this tool is for, when to prefer it over alternatives, and what the output looks like. The decision about whether to reach for this tool becomes answerable from the description alone.

When I write tool descriptions for agent projects, I include three elements: what the tool does mechanically, when to prefer it over alternatives, and what the output format looks like. That structure gives the model everything it needs to make a good selection decision. Missing any of the three produces worse selection behavior — not dramatically, but consistently, and that consistency compounds across hundreds of tool calls.

The same engineering principles that apply everywhere in context engineering apply here: clarity, hierarchy, signal density, no noise.

Runtime context management

For single-turn interactions, design-time context engineering is usually enough. For agents running extended sessions — research tasks, long coding jobs, multi-step workflows — you need to manage context actively at runtime, not just configure it at the start.

Rolling context windows. Rather than letting full conversation history accumulate unbounded, maintain a rolling window of the N most recent turns plus the original task statement pinned at the top. Turns beyond the window get compressed into a summary that preserves decisions and outcomes but discards the intermediate reasoning that led there. The model stays oriented to recent state without carrying the full history of a session that started an hour ago.

Structured state objects. Instead of letting state accumulate implicitly in conversational history, maintain an explicit state object that gets updated and injected into context at each step. The state captures: current objective, completed steps, pending steps, decisions made, constraints encountered. This gives the model a dense, structured source of truth rather than requiring it to reconstruct what's happened by reading back through conversational turns.

Task re-anchoring. At defined checkpoints — after every N tool calls, or at the completion of each major sub-task — re-inject the core objective at the top of the context. This counteracts context rot before it compounds. The original task statement, appearing again at the system level, resets the hierarchy and reorients the model to the actual goal rather than the accumulated detail of how it's been pursuing it.

None of these are complex engineering. They're disciplined choices about what information the model carries forward and in what form — which is exactly what context engineering is at every other layer too.

The skill underneath the term

"Prompt engineering" was the first name for this. "Context engineering" is the current one. The vocabulary will keep evolving. What doesn't change is the underlying problem: a model's output is a function of its input, and structuring that input well — right information, right detail level, right hierarchy, right moment — is a skill that compounds.

The design-time half is learnable in an afternoon. A better CLAUDE.md, more precise tool descriptions, a system prompt that puts governing instructions where they carry weight. That work pays off immediately and keeps paying off because you only do it once per project.

The runtime half takes longer to develop. It requires building intuition for how context degrades in motion — how rot accumulates, where distraction kicks in, which intermediate results are likely to poison the state downstream. That intuition comes from running agents through real work and paying attention to where they go wrong.

The gap between practitioners who have that intuition and those who don't is measurable. Not in benchmarks — in whether the agents they build keep working reliably after the demo, or start degrading the moment the task gets complex. Context engineering is the skill that makes the difference. The name it goes by doesn't matter.

Context Engineering in Production — What the Frameworks Don't Tell You

Context engineering vs. prompt engineering

What goes into a context window

The three context engineering failure modes

CLAUDE.md as a genre of context engineering

Tool descriptions are context engineering

Runtime context management

The skill underneath the term

Related Posts

Anthropic Agent SDK — Why the System Prompt Is the Product

Meaning Architecture — Why Your Agents Fail on Context, Not Code

The Thread Is the Unit — Knowledge Work After the Prompt