Essays
On multi-LLM development, AI agent observability, regression detection, cost optimization, and context engineering. Mostly written while debugging production agents and noticing patterns. By the team building ATO.
-
I rebuilt Karpathy’s LLM Council with tool calls and an audit trail
Karpathy shipped llm-council last November — 18.7k stars, then explicitly walked away (“not going to support it”). Same primitive, different shape: multi-provider auth (not OpenRouter-locked), tool calls so the LLMs verify claims in your repo, persistent specialist agents, and an audit trail that lands as markdown in your PR.
-
Why is multi-LLM development still so painful in 2026?
Different models genuinely lead at different tasks. The benchmark data shows it clearly. So why do most teams still pick one and use it for everything? A look at the SWE-bench / MMLU / HumanEval differences, the tooling gap, and what real multi-LLM workflows look like.
-
How can we actually understand whether a new LLM is better
Vendors publish benchmark scores. Those scores tell you something, but not everything you need. A walk through MMLU, HumanEval, GSM8K, your own intuition while developing, and what to do once you have real users in front of your agent.
-
Why does my AI agent get worse without anyone changing it?
Models change. Prompts change. Context changes. Quality drifts and you find out from a user complaint two weeks later. The Stanford temporal-drift study, the five causes of silent regressions, what to actually monitor, and the thresholds that matter.
-
What’s missing when traditional monitoring meets AI agents
Traditional APM tells you the API succeeded. AI agent observability has to tell you whether the answer was actually right, which agent touched which file, and how to stop the one that’s stuck. The five questions APM doesn’t answer.
-
Why does my AI agent bill keep growing?
LLM costs feel arbitrary because most teams don’t actually instrument them. A walk through the published per-million pricing, why bills grow faster than usage, and what to monitor before the CFO asks.
-
What separates production AI agents from demos?
A demo agent has one prompt. A production agent has variables, context hooks, memory policies, evaluators, and per-task models. The shift from prompt engineering to context engineering, with the literature behind it (Lost in the Middle, Anthropic’s Building Effective Agents).
-
What we shipped: ATO becomes the ops layer for multi-runtime agents
Twelve releases in two days. Replay any prompt against a different runtime, catch regressions before users do, cut cost without losing quality. The v2.1 release notes for ATO.