What separates production AI agents from demos?
Demo agents are fluent. They sound smart in three-prompt walkthroughs. They get a starring role in pitch decks. And then they hit production traffic and fall apart in a way the team can’t debug, because there’s only one knob to turn: the system prompt.
The reason production agents survive and demo agents don’t isn’t cleverness in the prompt. It’s a different category of work. The industry has started calling it context engineering, and it’s the missing layer between “a prompt” and “an agent that handles real users.”
The literature behind it
Three threads worth following:
- “The Rise and Potential of Large Language Model Based Agents: A Survey” — Xi et al., 2023. The reference survey paper. Frames the agent as brain (LLM) + perception (context) + action (tools). Most production breakdowns are perception failures, not brain failures.
- “Lost in the Middle: How Language Models Use Long Contexts” — Liu et al., Stanford, 2023. Showed empirically that models pay less attention to information in the middle of long contexts than at the start or end. Practical consequence: where you put a fact in the prompt matters as much as whether you put it in.
- Anthropic’s “Building Effective Agents” — December 2024 engineering essay. Distinguishes workflows (deterministic chains) from agents (LLM-driven loops) and argues most production deployments should be the former. The simpler-than-it-sounds insight: most “agents” in production are actually workflows with one LLM call.
- The “context engineering” coinage — popularized in 2024 across multiple practitioner essays (Cognition’s “Don’t Build Multi-Agents”, Andrej Karpathy’s social posts, the LangChain team’s blog). The rough thesis: prompt engineering optimizes the static instructions; context engineering optimizes what the model sees on each turn.
The shift is from “a great prompt” to “the right information delivered at the right moment.”
The seven pieces of production agent context
From running this in practice, the production agent has roughly seven layers the demo agent doesn’t:
1. Dynamic variables
The system prompt has {user_name}, {project_root}, {recent_orders}, {today}. Each variable has a resolver that fires before each turn: pull from environment, read from file, query the database, call an MCP tool. The model gets a fresh, context-specific prompt every dispatch.
This is the “production agent” pattern most often. A static system prompt that says “you are a helpful assistant” can’t scale to 10,000 users; a templated prompt with per-user variables can.
2. Pre-call context hooks
Before each user message reaches the model, run a small set of hooks that inject relevant context. Read the project’s CHANGELOG.md, query the user’s open tickets, check whether they’re on a paid plan. The hooks output context that’s prepended to the user message in a <context> block.
This is what most teams reinvent as their “RAG layer.” RAG (retrieval-augmented generation) is one specific kind of context hook (vector search over a corpus). Context hooks generalize the pattern: pull anything, format anything, prepend.
3. Memory / conversation summarization
Long conversations exceed token budgets and degrade quality. Production agents define a memory policy: at message N, summarize the older portion of the thread, keep the last K turns verbatim. The summarizer is usually a smaller, cheaper model.
Without this, a 30-turn conversation is either truncated arbitrarily (losing context) or shoved into an absurdly large prompt (expensive and slow). The Stanford Lost in the Middle finding above is why this matters: even when the model technically has the tokens to read the whole history, it doesn’t pay attention evenly.
4. Per-task model selection
Inside one agent, different sub-tasks need different models. Router (cheap, fast). Summarizer (cheap, fast). Response (capable). Evaluator (intermediate). Treating “the model” as a single setting conflates these and either burns money on tasks that don’t need it or sacrifices quality on tasks that do.
5. Evaluators
An automated way to score “was this answer good?” Heuristic (substring contains, regex, response-length, tool-was-called) or LLM-as-judge (smaller model rates 0–1 with reasoning). Run on every dispatch or in a scheduled batch. Without this, “agent quality” is a vibe.
6. Specialization (multi-agent)
One mega-agent with thirty tools is brittle. Three specialist agents with eight tools each, plus a router that picks between them, is robust. The literature supports this; the Anthropic essay above explicitly argues for it.
The cost of specialization is more moving parts. The benefit is each specialist gets a focused system prompt, a focused tool set, and an evaluator tuned to its specific job. Quality is dramatically easier to reason about per-specialist than per-mega-agent.
7. Observability
Every layer above produces emergent behavior; you cannot reason about an agent in production without traces. The trace has to capture the resolved variables, the hooks fired, the routing decision, the model called, the tools invoked, the response, the evaluator score. That’s the unit of debugging.
How people build this today
Three patterns, in roughly increasing structure:
Hand-rolled in code. Most teams start here. The agent lives in a Python file. Variables are f-strings. Hooks are function calls. Memory is a list slice. It works for one agent. It becomes a maintenance burden when the team has six.
Framework-based. LangChain, LlamaIndex, Pydantic-AI, Agno, Letta, OpenAI Agents SDK. Each provides primitives for variables, tools, memory, and orchestration. Each has its own opinions and its own learning curve. Useful at the cost of vendor lock-in to that framework’s abstractions.
Configuration-driven. Agents defined declaratively (YAML or a UI), with each layer expressed separately: variables here, hooks there, memory policy in another section, role-models in another. The benefit is that non-engineers can edit prompts without touching code, and the agent definition becomes diffable infrastructure.
Tools that lean into context engineering as a first-class concept:
- LangChain / LangGraph — the established framework. Strong on chain abstractions; opinionated.
- Pydantic-AI — type-safe agent framework with structured outputs.
- Letta (formerly MemGPT) — specifically about long-running agent memory.
- Cognition’s Devin / Cursor agents — closed source, but their public posts describe the architecture.
- OpenAI Agents SDK — vendor-specific, structured.
- ATO — free, open-source desktop GUI; variables, hooks, memory policies, per-task model slots, and evaluators are first-class tabs on every agent. Multi-runtime native.
The summary, plainly
The system prompt is one input to a production agent. The other inputs — resolved variables, hook outputs, summarized conversation history, the right model for the sub-task — are at least as important and far more likely to be the cause of production failures.
Teams that ship demos optimize the prompt. Teams that ship production-grade agents optimize the entire context the model sees on each turn. That’s the practical meaning of “context engineering,” and it’s the layer most current tooling treats as an afterthought.
— Beatriz Nigri