2026-05-10 · LLM Evaluation · 9 min read

How can we actually understand whether a new LLM is better

Anthropic ships a new Claude. OpenAI ships a new GPT. Google ships a new Gemini. The release post has charts. The leaderboard updates. Twitter argues about whether it’s a real jump or a benchmark game.

If you’re shipping an agent on top of one of these models, the practical question is more specific: does the new version make my agent better or worse? Aggregate benchmarks won’t answer that, but they’re where most of the public conversation lives. Here’s how to read them, and what you actually have to do beyond them.

What vendors actually publish

When a frontier lab releases a new model, the release page typically includes scores on a roughly stable set of benchmarks. Most of them have been around long enough that the names get used as shorthand:

MMLU — Massive Multitask Language Understanding. Multiple-choice across 57 academic subjects (history, biology, law, mathematics, etc.). Tests broad knowledge and reasoning. A score of 90 on MMLU is “college-educated generalist” territory.
HumanEval / MBPP — Function-completion benchmarks. Given a docstring, write the function. Pass-at-1 means “solved on the first try.” HumanEval’s 164 problems are well-known enough to be partially memorized by every modern model, so the absolute number means less than the trend.
GSM8K / MATH — Grade-school and competition-math word problems. Tests step-by-step reasoning, not just pattern-matching. GSM8K is mostly saturated now; MATH is harder.
SWE-bench / SWE-bench Verified — Real GitHub issues from open-source repos; the model has to produce a patch that passes the project’s test suite. The closest benchmark to “does this model do real software engineering.”
GPQA / MMLU-Pro — The harder successors. PhD-level science questions, less benchmark-leakage.
BFCL / Tool-use benchmarks — Whether the model picks the right tool, with the right arguments, in the right order. Critical for agents that use MCP or function-calling.
Vendor-specific evals — Anthropic’s “agentic coding” eval, OpenAI’s “PaperBench,” Google’s “Live Code Bench.” Each lab has internal evals they consider load-bearing for the next release.

These numbers are real. They’re comparing on a fixed test set, with consistent prompting, and the differences track with capability gains in a way you can’t fake forever.

But every benchmark has known limitations: contamination (the test set leaked into training), narrow scope (HumanEval is one specific kind of code), and prompt sensitivity (small changes to system prompts move scores meaningfully). A model that scores 92 on HumanEval and 88 on MMLU might still be worse for your specific agent than one that scores 89 and 91. The benchmarks measure capability classes; you measure the intersection of those classes with your prompt, your tools, and your users.

So benchmarks are necessary context, not a final answer.

The personal phase: you’ll feel the difference

While you’re developing an agent for your own use, evaluation is mostly intuitive. You ask the agent to do something. The output is good or it isn’t. You re-prompt or you don’t. Over a week of dogfood, you build a felt sense of what the model does well and where it breaks.

This is genuinely informative. The team building the agent knows the use cases more deeply than any benchmark, and a sharp engineer running 30 prompts a day across two models will form an accurate gut judgment about which one is better for their workload.

The trap: this only scales to the prompts you happen to think of. The prompts you forget to test — weird inputs, sequential turns where context drifted, edge cases — are exactly where production agents fail. So “it feels good” is a strong signal for personal-use tools and a weak signal for anything else.

The user-facing phase: you have to test

Once your agent talks to people who aren’t you, the calculus changes.

Real users send prompts you didn’t plan for. A customer-support agent that handles “reset my password” will eventually get “my dog ate my password reset email,” and the model’s response to that prompt matters even though no one wrote a test for it. The long tail of real traffic is where silent regressions live, and you don’t see them by intuition because you’re not the one reading the responses anymore.

So you need explicit evaluation. The basic loop:

Sample real production traces. The last 100 conversations, or the last 14 conversations that failed. Both are useful samples for different reasons — the failures show you the regression surface, the random sample shows you whether quality drifted on the median.
Read them. Yes, by hand, at least the first time. Reading 30 of your agent’s actual responses is the single highest-leverage thing you can do for product quality. Every time you do it you find something the metrics didn’t catch.
Define an evaluator. Once you know what “good” looks like, encode it. Two flavors:
- Heuristic — substring contains, regex matches, response length within range, expected tool was called. Cheap, deterministic, narrow.
- LLM-as-judge — a smaller model reads the prompt and the response and scores it 0–1 with reasoning. More expensive, more flexible. The judge can be a different vendor than the agent — that’s good, it reduces self-grading bias.
Run the evaluator on the before-set and the after-set. Same prompts, two models. The difference in average score is your real signal. If both sides have a confidence interval that overlaps, you don’t have enough samples yet.
Drill into the deltas. Aggregate scores tell you something moved. The actionable artifact is the list of prompts where the new model did worse. Read those. Often the regression isn’t random — the new model is biased toward longer responses, or it stopped using a particular tool, or it’s misinterpreting a piece of system-prompt instruction.

The whole loop should take an hour the first time and ten minutes every subsequent time. That’s the bar to make this a habit instead of a project.

Tools that exist for this

You don’t have to build it from scratch. The ecosystem in 2026 splits into two groups, and most teams shipping production agents end up running one from each.

Production-stack A/B testing (SDK / proxy logging real user traffic, evals running over the production trace store):

Langfuse — open-source LLM observability and prompt evals. Self-hostable. Strong on trace storage and dashboards.
Helicone — logging proxy + evals. Drop-in for OpenAI-shaped APIs.
Arize Phoenix — open-source observability + eval framework. Stronger on the eval-tooling side.
Braintrust — eval-first platform. Built around “datasets × experiments” as the primitive.
OpenAI Evals — OpenAI’s own framework. Good if you’re single-vendor.
Promptfoo — CLI-first, declarative YAML evals. Lightweight, runs locally.
Inspect AI — from the UK AISI; rigorous, used for capability evals.

Developer-workflow A/B testing (replay past prompts against alternative runtimes during build/test, before any production rollout):

ATO — free, open-source desktop GUI; replay any past trace against a different runtime/model and diff source vs replay side-by-side. Local-first, multi-runtime native (Claude Code, Codex, Gemini CLI, etc.).

The two groups answer different versions of the A/B question. The SDK / proxy tools tell you which prompt or model variant did better against real user traffic over a sample size. ATO lets you replay any single past prompt against a different runtime before rolling anything out — the “would Codex have done better on this specific failed example?” question. Teams running real production agents typically use both: a Langfuse or Braintrust for population-level production A/Bs, ATO for per-example multi-runtime exploration during development. The two are complementary; together you can test before you ship and measure after.

None of them solve the whole problem. The pattern they share: trace every dispatch, define evaluators, run them against the trace store, surface the deltas. That pattern is the right one. The differences are mostly about vendor lock-in, self-hosting, and how nicely they handle multi-runtime workflows where the agent uses Claude for one stage and Codex for another.

The summary, plainly

Vendor benchmarks tell you a model is broadly more capable. Your intuition while developing tells you it’s good for the work you do day-to-day. Neither tells you whether it’s better for the actual prompts your actual users send. For that, you have to instrument: capture traces, define evaluators, replay before-and-after on real samples, and read enough responses by hand to keep your evaluators honest.

This is more work than reading a release blog post. It’s also the difference between knowing your agent improved and hoping your agent improved. The teams that do it consistently catch silent regressions in days; the teams that don’t catch them when users complain.

— Beatriz Nigri