2026-05-10 · AI Engineering · 8 min read

What’s missing when traditional monitoring meets AI agents

Traditional APM was built around a question with a clear answer: did the request succeed? 200, 500, latency, error rate. The taxonomy for ten years of production engineering was “up” vs “down.”

AI agents broke that taxonomy. An agent can return 200, in 800ms, with cleanly-structured JSON, and produce an answer that’s confidently, fluently wrong. Datadog has nothing to say about that. Neither does Sentry. Neither does any of the dashboards your platform team built last year.

What you need is a new layer of observability that doesn’t exist in most production systems yet, because the questions are different.

The questions agent observability has to answer

Five questions traditional APM doesn’t cover:

Was the answer actually correct? Not “did the model respond” — was it right?
What did the agent touch? Files, databases, external APIs — the side effects, not just the input/output.
Which agent did what, when two are running in the same workspace?
What’s in flight right now, and how do I stop it?
What changed about this agent’s configuration recently?

Each one has a different shape than “is the API up.”

1. Was the answer correct?

The hardest question, because it requires evaluating the response — not just observing that it happened.

Two viable approaches:

Heuristic evaluators. Substring match, regex, response-length-within-range, “tool was called when expected.” Cheap, deterministic, narrow. Good for known-shape responses (“the answer must contain a tracking number”). Bad for open-ended answers.

LLM-as-judge. A smaller model reads the prompt and the response, scores 0–1 with reasoning. More expensive, more flexible. Run it as a scheduled batch over recent traces, not synchronously on every dispatch — cost gets out of hand fast otherwise.

The output of either approach: a per-trace quality score that you can aggregate, slice, and alert on. Without this score, “agent quality” is a vibe.

2. What did the agent touch?

Agents have side effects. They write files. They call external APIs. They modify databases. The trace of “here’s the prompt and here’s the response” is only half the picture.

Two approaches that work:

Filesystem mtime diff. Snapshot the project tree before and after the dispatch. The diff is the list of files touched. Works across every runtime because it’s OS-level, not stream-parsing. Doesn’t require the runtime to cooperate or emit structured events.

Tool-call instrumentation. If your agent calls tools (MCP, function calling), the call list is a structured artifact. Capture it. The chain “prompt → reasoning → tool calls → final response” is the actual unit of agent behavior, not the prompt-and-response pair.

The combined view: per dispatch, the prompt, the response, the tools called (with arguments and results), and the files touched. That’s the trace shape AI agents need.

3. Which agent did what, when two share a workspace?

This is the multi-agent problem. You’re running a writer agent and a reviewer agent in the same project. Both touch files. Git blame says “the user did it” because both agents commit as the user.

Honest attribution requires capturing per-dispatch file lists (per the previous section) and detecting when two dispatches overlap on the same files. When they overlap, the right answer isn’t to fake an attribution — it’s to flag the overlap as ambiguous and surface the peer agents that contributed.

This is the “truth over false confidence” principle: a system that tells you it doesn’t know is more useful than one that fabricates an answer.

4. What’s in flight, and how do I stop it?

Agents can hang. Models can take forever. A bad prompt can cause the model to recurse on tool calls. You need a live registry: every active dispatch with agent slug, runtime, workspace, elapsed time, and a kill button.

This is the “ops layer” pattern. It’s analogous to kubectl get pods but for agent dispatches. Without it, your only recourse to a runaway agent is to find the right terminal buffer (or kill the whole process tree). Both are slow and error-prone.

The bar for production: “is anything stuck?” should be a one-click answer, not a forensic investigation.

5. What changed recently?

When quality drops, the first question is “what changed?” In code, git answers this in seconds. In agent configurations, most teams have no answer because the configuration isn’t versioned.

Every model swap, every system-prompt edit, every role-models tweak, every memory-policy adjustment should generate a logged event — agent slug, field, old value, new value, who, when. Append-only.

Then quality regressions become joins: show me config changes in the last 7 days where eval score dropped >15pp afterward. The cause becomes a query, not a hunt.

The unifying shape: agent traces aren’t API traces

An API trace is “request, response, latency, status code.” An agent trace has to carry:

The prompt (full text, not just a summary — you’ll want to replay it).
The response (full text).
The configuration that produced it (model, role-models, system prompt version, hooks fired, variables resolved).
The tool calls and their results.
The files touched.
The duration, the cost, the eval score.
A correlation ID that ties this trace to its peers if it’s part of a multi-stage workflow.

That’s a much richer object than an HTTP request log. Treating an agent trace as if it were an HTTP request log is the root cause of most production AI debugging pain.

The teams that handle this well treat agent traces as the central object — everything else (live runs, regressions, replay, cost) is a query against the trace store. The teams that don’t, eventually rebuild their entire AI ops layer once they hit the second silent regression.

Tools that exist for this

The ecosystem in 2026 has reasonable starting points. They cluster into two camps and most production teams end up using one from each.

Production-stack observability (SDK-based, drop into your existing app):

Langfuse — open-source LLM tracing + evals. Self-hostable. Probably the most-adopted observability layer for production user conversations.
Arize Phoenix — open-source observability + evals; OpenTelemetry-native.
Helicone — logging proxy in front of OpenAI-shape APIs. Minimal setup.
OpenLLMetry — OpenTelemetry-based LLM instrumentation. Good if you already have OTel.
LangSmith — LangChain’s observability. Tightly integrated if you’re on LangChain.
Datadog LLM Observability — if you’re a Datadog shop and don’t want a second vendor.

Developer-workflow observability (what you dispatched while building, testing, comparing runtimes):

ATO — free, open-source desktop GUI; live runs registry with one-click kill, file attribution per dispatch, configuration ledger, cross-runtime replay. Multi-runtime native (Claude Code, Codex, Gemini CLI, etc.).

These two camps do different jobs. The SDK tools log what your user-facing agent did with each end user once it’s deployed. ATO covers the multi-runtime build/test loop — dispatches you ran, replays you triggered, regressions you caught after a prompt or model change. Teams running real production agents typically run both: Langfuse (or one of its peers) for the production conversation log, ATO for the developer side of the same agent. They’re complementary, not alternatives.

— Beatriz Nigri