2026-05-10 · LLM Cost · 8 min read

Why does my AI agent bill keep growing?

Every team building on top of LLMs eventually has the same Slack message from their CFO: “Our AI bill went up 40% this month and I don’t understand why.” The honest answer is that nobody on the team understands either. The dashboards are usage counts. The bill is tokens times price. Tokens, prices, and usage all moved — you can’t cleanly attribute the growth to any one of them.

This is the cost-attribution problem in production AI. The information needed to answer it is technically present in your traces, but most teams don’t collect it and most platforms don’t aggregate it the way a finance team would.

What LLM pricing actually looks like

Frontier-model pricing is published per million tokens. As of early 2025 (these change every few months — check vendor pages for current numbers):

Anthropic Claude: Sonnet 4.x ~$3 input / $15 output per 1M tokens. Opus 4.x ~$15 / $75. Haiku ~$1 / $5.
OpenAI GPT: GPT-4o ~$2.50 / $10. GPT-4o-mini ~$0.15 / $0.60. GPT-4.1 ~$2 / $8.
Google Gemini: 2.0-flash ~$0.10 / $0.40. 1.5-pro ~$1.25 / $5. 1.5-flash ~$0.075 / $0.30.
Open weights / fast inference (Groq, Together, Fireworks): generally 5–20× cheaper than equivalent frontier models, depending on the model class.

Two things to notice: output tokens are 4–5× the price of input tokens across nearly every provider, and the spread between the cheapest and most-capable model in a single vendor’s lineup is 50–100×. That spread is where cost optimization actually lives.

Why the bill grows faster than usage

Five reasons, in roughly the order they bite teams:

1. Context windows grow without anyone noticing. You add a new system-prompt section. You add a new variable. You inject a longer file. Each turn now sends 3,000 more input tokens than last month. The user sends the same volume; you pay more.

2. Conversations grow longer. Multi-turn agents accumulate history. By turn 8 you might be re-sending 15,000 tokens of prior conversation as the prompt. The model is doing the same work; the input is much bigger.

3. Output tokens explode when prompts encourage them. A prompt that ends “explain in detail” or “walk me through your reasoning” can 3× the response length. Output is 4–5× more expensive than input. Subtle prompt edits become large bill changes.

4. Tool calls fan out invisibly. An agent calls a tool, gets a response, calls another tool. Each round trip is a full prompt with conversation history. A single user message can trigger five model calls under the hood.

5. You upgraded the model. The new release blog said “same price.” But the new model writes longer responses on average, or uses more reasoning tokens, or burns more context interpreting your system prompt. Same per-token price, more tokens per call.

None of these are obvious from a usage chart. Usage went up 30%. Bill went up 60%. The gap is the part nobody’s instrumenting.

What to actually monitor

Cost per call, not just total spend

Total spend tells you you have a problem. Cost-per-call tells you where it is. Track average cost per dispatch per agent per model. When that line goes up without traffic going up, you have prompt drift, context drift, or output-bloat — and you can find which one by reading three traces.

Cost per OK call, not just cost per call

An agent that costs $0.03 per call but only succeeds 60% of the time is really costing $0.05 per useful answer. An agent that costs $0.04 per call and succeeds 95% of the time is costing $0.042. The cheaper-per-call agent is more expensive in practice. Cost-per-OK is the metric that matters.

Input vs output token split

If output is 4× the cost of input, doubling output doubles your bill way faster than doubling input. Track input tokens and output tokens separately. When the output ratio jumps after a prompt change, that’s the change that’s expensive.

Per-task model selection

Inside a single agent, different sub-tasks have different cost profiles:

Routing classification — cheap and fast (Haiku, GPT-4o-mini, Gemini Flash)
Summarizing conversation history — cheap and fast
Final user-facing response — the capable model
LLM-as-judge evaluator — intermediate

Most teams use one model for all four because that’s the default. Splitting these into role-specific slots typically cuts bills 30–60% with no measurable quality drop on the parts that matter.

Per-conversation cost tail

p95 conversation cost is more useful than average. The 5% of conversations that turn into 30-message epics are eating disproportionate budget. Find them; fix the prompt loop or the conversation summarizer that should have compacted them.

How people optimize today

Three approaches, in increasing automation:

Spreadsheet attribution. Pull token counts from vendor invoices. Multiply by per-model rate. Slice by agent. Tedious, monthly, retrospective. Better than nothing.

Custom telemetry. Most production teams write a logging wrapper around their LLM calls that captures prompt tokens, response tokens, model, agent slug. Pipe to a dashboard. Now you have per-call cost. The wrapper grows organically; eventually you reinvent observability.

Off-the-shelf platforms with cost tracking. The ecosystem splits into two groups, and most teams end up running one from each.

Production-stack cost logging (SDK / proxy in front of your live app):

Helicone — logging proxy that attributes cost per request out of the box. Drop-in.
Langfuse — open-source observability with cost tracking; self-hostable.
OpenMeter / Vellum — usage metering platforms with LLM-aware aggregations.
Vendor consoles — OpenAI, Anthropic, and Google all have their own usage dashboards. Single-vendor, no agent-level slicing.

Developer-workflow cost visibility (what you dispatched while building, what the replay would cost on another runtime, where to swap models):

ATO — free, open-source desktop GUI; estimates cost per dispatch (token count × published per-million pricing for known models, with an “est.” badge), surfaces cost recommendations when you have multi-runtime data on the same agent. Multi-runtime native.

The two groups solve different halves of the cost-visibility problem. The SDK / proxy tools tell you how much your deployed agent is costing per user interaction in production. ATO tells you how much each dispatch costs during development, and how much the same prompt would have cost on a different runtime if you replayed it. Teams running real production agents typically run both: a Langfuse or Helicone in front of the production stack for end-user cost attribution, and ATO on the developer side for replay/model-swap exploration. They’re complementary; together you can see cost from prompt-design through to production traffic.

What they all share: cost becomes visible per call instead of summed at month-end. Visibility per call is the only way to catch the four-line prompt edit that doubled your output tokens.

The summary, plainly

LLM bills feel arbitrary because most teams measure them only at the end. They’re not arbitrary; they’re input-tokens times price plus output-tokens times price, multiplied by call volume. Each of those four factors moves independently. If you instrument them, the bill becomes legible. If you don’t, you’re going to keep being surprised every month.

The first cost optimization isn’t switching models. It’s instrumenting cost per call so you can see what’s actually happening. Once that’s in place, the optimizations are usually obvious: split per-task model slots, compact long conversation histories, tighten prompts that encourage long output, audit context-injection variables for size creep.

— Beatriz Nigri