Blog

Essays

On multi-LLM development, AI agent observability, regression detection, cost optimization, and context engineering. Mostly written while debugging production agents and noticing patterns. By the team building ATO.

2026-05-13 · Multi-LLM · 8 min read

I rebuilt Karpathy’s LLM Council with tool calls and an audit trail

Karpathy shipped llm-council last November — 18.7k stars, then explicitly walked away (“not going to support it”). Same primitive, different shape: multi-provider auth (not OpenRouter-locked), tool calls so the LLMs verify claims in your repo, persistent specialist agents, and an audit trail that lands as markdown in your PR.
2026-05-10 · Multi-LLM · 9 min read

Why is multi-LLM development still so painful in 2026?

Different models genuinely lead at different tasks. The benchmark data shows it clearly. So why do most teams still pick one and use it for everything? A look at the SWE-bench / MMLU / HumanEval differences, the tooling gap, and what real multi-LLM workflows look like.
2026-05-10 · LLM Evaluation · 9 min read

How can we actually understand whether a new LLM is better

Vendors publish benchmark scores. Those scores tell you something, but not everything you need. A walk through MMLU, HumanEval, GSM8K, your own intuition while developing, and what to do once you have real users in front of your agent.
2026-05-10 · LLM Evaluation · 9 min read

Why does my AI agent get worse without anyone changing it?

Models change. Prompts change. Context changes. Quality drifts and you find out from a user complaint two weeks later. The Stanford temporal-drift study, the five causes of silent regressions, what to actually monitor, and the thresholds that matter.
2026-05-10 · AI Engineering · 8 min read

What’s missing when traditional monitoring meets AI agents

Traditional APM tells you the API succeeded. AI agent observability has to tell you whether the answer was actually right, which agent touched which file, and how to stop the one that’s stuck. The five questions APM doesn’t answer.
2026-05-10 · LLM Cost · 8 min read

Why does my AI agent bill keep growing?

LLM costs feel arbitrary because most teams don’t actually instrument them. A walk through the published per-million pricing, why bills grow faster than usage, and what to monitor before the CFO asks.
2026-05-10 · AI Architecture · 9 min read

What separates production AI agents from demos?

A demo agent has one prompt. A production agent has variables, context hooks, memory policies, evaluators, and per-task models. The shift from prompt engineering to context engineering, with the literature behind it (Lost in the Middle, Anthropic’s Building Effective Agents).
2026-05-10 · Product / Release · 6 min read

What we shipped: ATO becomes the ops layer for multi-runtime agents

Twelve releases in two days. Replay any prompt against a different runtime, catch regressions before users do, cut cost without losing quality. The v2.1 release notes for ATO.

Essays

I rebuilt Karpathy’s LLM Council with tool calls and an audit trail

Why is multi-LLM development still so painful in 2026?

How can we actually understand whether a new LLM is better

Why does my AI agent get worse without anyone changing it?

What’s missing when traditional monitoring meets AI agents

Why does my AI agent bill keep growing?

What separates production AI agents from demos?

What we shipped: ATO becomes the ops layer for multi-runtime agents