2026-05-10 · LLM Evaluation · 9 min read

Why does my AI agent get worse without anyone changing it?

You shipped an agent. It worked. A few weeks pass. Now it’s subtly worse, and the team can’t agree on whether it’s actually worse or you’re imagining things. The vendor pushed a model update. Someone tweaked the system prompt. Your test set is the same six prompts you wrote on day one. Quality is everyone’s responsibility, which means it’s no one’s.

This is one of the most-asked questions in production AI right now: what causes silent quality regressions, and how do I catch them before users do?

What the literature actually says about model drift

The phenomenon has names and academic support. A few entry points worth reading:

“How Is ChatGPT’s Behavior Changing Over Time?” — Lingjiao Chen, Matei Zaharia, James Zou (Stanford / UC Berkeley, 2023). The original headline study. Tracked GPT-3.5 and GPT-4 across four tasks (math, sensitive questions, code generation, visual reasoning) over March-to-June 2023 snapshots. Found GPT-4’s prime-number identification accuracy dropped from 97.6% to 2.4% in three months on the same prompts. The methodology was debated, but the result kicked off the public conversation about “temporal drift.”
“Holistic Evaluation of Language Models” (HELM) — Stanford CRFM, ongoing. Long-running benchmark with versioned model snapshots so you can see how scores shift across vendor releases. Useful as the public reference for “did this version actually improve.”
OpenAI’s deprecation policy — documented but loosely enforced. gpt-4-0613 and gpt-4-1106-preview are different models behind potentially the same alias. If your code pins to gpt-4, you’re subscribing to whatever OpenAI decides that means today.
Anthropic model cards — explicit version pinning is supported (claude-3-5-sonnet-20241022). Better hygiene than aliases, but you still have to remember to pin.

The combined picture: even if you don’t change anything, the underlying model can shift, and even small shifts on weighted-toward-your-edge-cases prompts can produce visible quality regressions.

The five things that actually cause regressions

From running this in production, the failure modes fall into a small number of buckets:

1. Vendor pushed a new model behind your alias. The most common cause when nothing on your side changed. Mitigation: pin to a versioned model identifier. Anthropic’s scheme is good. OpenAI’s is messier; you have to actively choose dated snapshots, not gpt-4o.

2. You upgraded. The conscious version. New model is better on the benchmarks, worse on the prompts your users actually send. This is the silent-regression special: the new model still returns 200, still produces fluent text, still passes the three test prompts you remember. The 14% of conversations where it’s worse are invisible until aggregated.

3. Prompt drift. Someone — you, a teammate, a junior engineer with prod access — edited the system prompt to fix one specific case. The fix worked for that case and broke three others. Without versioning, the change is invisible to everyone except the person who made it, and they forgot.

4. Context drift. Your prompt is the same. Your model is the same. But the variables you inject into it (file contents, database query results, MCP tool output) have changed shape. The schema added a column. The file got longer. The tool started returning verbose error messages. The model starts hallucinating because the context shape stopped matching its training.

5. Tool drift. Your agent calls an external API. The API’s response shape changes — usually adds a field, sometimes deprecates one. Your agent doesn’t crash, it just stops handling cases that involve that field. Hard to attribute because it looks like “the model got dumber.”

What to monitor

Five signals, in order of priority:

1. Eval score, sliced by configuration

Define a per-agent evaluator. Heuristic (regex / substring / response-length / tool-was-called) is cheap and deterministic. LLM-as-judge (a smaller, cheaper model rating 0–1 with reasoning) is more expensive and more flexible. Either way, you need a numeric score per dispatch.

Then slice the score by configuration: “eval score on Sonnet 4.6 across the last 1000 traces” vs “eval score on Sonnet 4.7 across the same window after the upgrade.” A drop of more than 15 percentage points is a regression. Smaller is noise; larger is alarming.

2. Configuration changes as a first-class event

Every model swap, system-prompt rewrite, role-models tweak should generate a logged event — agent slug, field, old value, new value, who, when. Append-only.

When eval score drops, the question becomes “what config events happened in the last 24 hours, and which one preceded the drop?” Without the ledger, you’re guessing.

3. Failing examples, not just aggregate stats

“Eval score dropped 17pp” is a number. It’s not actionable. The list of post-change traces where the agent failed (prompt, response, error, timestamp) is. Read three of those and you usually understand the regression in five minutes.

4. Cost-per-call, paired with quality

Cost without quality context is misleading. Quality without cost context is misleading. Track cost-per-OK-call (cost divided by successful dispatches). A model that’s 2× more expensive but answers correctly twice as often might be the same dollar-per-good-answer.

5. Latency p95

Average latency lies. The user who waited eight seconds remembers; the user who waited 1.5 seconds doesn’t. p95 catches the bad days.

How people actually do this today

Three ways, increasing in automation:

The manual way. A spreadsheet. Each row is a prompt from production. Columns are model versions. You paste the response from each model into each cell, then read them all. Works for small teams with twenty test cases. Doesn’t scale, doesn’t catch regressions in real time, but it’s honest and beats nothing.

The custom-pipeline way. Most production teams build something in-house: a script that reads the trace store, runs an LLM-as-judge against samples, drops scores into a dashboard. The pipeline is usually held together with shell scripts, the dashboard is usually three Grafana panels, the “regression detector” is “an engineer who looks at the dashboard on Mondays.” This works until the engineer changes jobs.

Off-the-shelf observability + eval platforms. The ecosystem in 2026 splits cleanly into two groups, and most production teams end up running one from each.

Production-stack regression detection (SDK / proxy logging every user interaction, evals running over the production trace store):

Langfuse — open-source LLM observability + evals. Self-hostable. Strong on traces.
Helicone — logging proxy with eval hooks. Drop-in for OpenAI-shape APIs.
Arize Phoenix — open-source observability + eval framework. Good on the eval side.
Braintrust — eval-first platform. “Datasets × experiments” as the primitive.
OpenAI Evals — vendor-specific framework. Fine if you’re single-vendor.
Promptfoo — CLI-first, declarative YAML evals. Lightweight.

Developer-workflow regression detection (config-change-driven, multi-runtime, focused on what you dispatched while building):

ATO — free, open-source desktop GUI; auto-logs every config change to a per-agent ledger and joins it against trace stats so the regression detector can drill straight into failing examples. Local-first, multi-runtime. Replay any failing example on a different runtime to ask “would the previous model still have been wrong on this one?”

The two groups answer different questions. The SDK / proxy tools tell you “something got worse for end users yesterday” based on production traffic. ATO tells you “quality changed after this specific config edit” and lets you replay the failing examples against alternative runtimes before pushing the fix. Teams running real production agents typically run both: a Langfuse or Phoenix on the production stack, ATO on the developer side. The two views are complementary; the SDK tools catch regressions that only appear in real user traffic, ATO catches regressions before they ship.

The pattern all of them share: trace every dispatch, define evaluators, run them against the trace store, surface deltas. Differences are mostly vendor lock-in, self-hosting story, and how nicely they handle multi-runtime workflows.

The thresholds to start with, regardless of tool: success rate ≥10pp, eval score ≥15pp, p95 latency ≥50%, cost-per-call ≥25%. Inside those bands is noise; above them is signal.

— Beatriz Nigri