Essays
On multi-LLM development, AI agent observability, regression detection, cost optimization, and context engineering. Mostly written while debugging production agents and noticing patterns. By the team building ATO.
-
We used ATO to test ATO — Part 8: the open-box model router
Benchmark the models YOU have keys for on pinned LiveCodeBench slices — code-executing grader, vendor-cited contamination cutoffs, Wilson CIs, reproducibility hashes — then route on receipts. The first real battle found three bugs in our own tooling (gpt-5 had never worked through ATO), ended in a three-way 83.3% tie the router refused to overclaim, and a cascade orchestration benchmark returned an honest negative with the roadmap inside it. Everything re-runnable; that’s the point.
-
v2.18 — Your war room, now in any browser, tethered to your laptop
Read-only Team Workspaces on the web. Browser ↔ Desktop tether with X25519 + AEAD — the cloud relay never sees plaintext. Create teams + invite teammates from the cloud.
ato war-rooms sweepauto-closes idle reviews from cron.ato subagent logbrings Claude Code’s Task tool into the same receipt ledger. Plus a sign-in redesign and a 3-step Onboarding for non-technical users. -
We used ATO to test ATO — Part 7: gemini-flash beat sonnet on the same 5 prompts at 7-13× lower cost
Same 5 security-review prompts, n=30 per cell, two models. Gemini-2.5-flash 1.000 on every real security prompt vs claude-sonnet-4-6 0.467-0.900 — at 7-13× lower per-call cost. Plus: two prompts where both models scored 0.000, which says more about the rubric than the agent. The cross-model Pareto-finding ATO’s methodology runner exists to surface.
-
We used ATO to test ATO — Part 6: the methodology runner ships, real-data validated
v2.10 PR-2/3/3.1/4 shipped this morning. We ran the runner against the n=150 receipts from Part 5 plus a fresh n=13 LLM-judge eval — total incremental spend $0.07. Regex rubric found a prompt where every advisory-mode dispatch missed the rubric. LLM-judge produced differentiated 0.05..0.85 scores. The dual cost accounting (your LLM bill vs. ours) is the headline number.
-
We used ATO to test ATO — Part 5: closing the sample-size gap, falsifying our own n=1 claim
Parts 1-4 admitted the n=1 sample size was below industry baseline. Part 5 is what happened when we took the critique seriously and re-ran at n=150 across 5 prompts × 3 conditions. One Part 4 claim got falsified by the real data. That’s why you run real evals.
-
We used ATO to test ATO — Parts 2–4: closing the regression, extending coverage, refusing-with-options
PR-2 closes the claude false negative from Part 1. PR-3 routes API providers through the tool-call loop when grounding is on. PR-4 ships refuse-with-options for the parserless runtimes. Each PR scored against the same bench, with the receipts that survived. Sample size still n=1 — Part 5 fixes that.
-
We used ATO to test ATO — Part 1: foundation, the cold control, the regression we found
The v2.9 grounded-mode series opens with the cold-control test: claude refuses honestly when src/auth.ts doesn’t exist, gemini hallucinates 5454 chars of generic SQL-injection lecture. PR-1 ships the schema + a regression hidden in PR-1.
-
I rebuilt Karpathy’s LLM Council with tool calls and an audit trail
Karpathy shipped llm-council last November — 18.7k stars, then explicitly walked away (“not going to support it”). Same primitive, different shape: multi-provider auth (not OpenRouter-locked), tool calls so the LLMs verify claims in your repo, persistent specialist agents, and an audit trail that lands as markdown in your PR.
-
Why is multi-LLM development still so painful in 2026?
Different models genuinely lead at different tasks. The benchmark data shows it clearly. So why do most teams still pick one and use it for everything? A look at the SWE-bench / MMLU / HumanEval differences, the tooling gap, and what real multi-LLM workflows look like.
-
How can we actually understand whether a new LLM is better
Vendors publish benchmark scores. Those scores tell you something, but not everything you need. A walk through MMLU, HumanEval, GSM8K, your own intuition while developing, and what to do once you have real users in front of your agent.
-
Why does my AI agent get worse without anyone changing it?
Models change. Prompts change. Context changes. Quality drifts and you find out from a user complaint two weeks later. The Stanford temporal-drift study, the five causes of silent regressions, what to actually monitor, and the thresholds that matter.
-
What’s missing when traditional monitoring meets AI agents
Traditional APM tells you the API succeeded. AI agent observability has to tell you whether the answer was actually right, which agent touched which file, and how to stop the one that’s stuck. The five questions APM doesn’t answer.
-
Why does my AI agent bill keep growing?
LLM costs feel arbitrary because most teams don’t actually instrument them. A walk through the published per-million pricing, why bills grow faster than usage, and what to monitor before the CFO asks.
-
What separates production AI agents from demos?
A demo agent has one prompt. A production agent has variables, context hooks, memory policies, evaluators, and per-task models. The shift from prompt engineering to context engineering, with the literature behind it (Lost in the Middle, Anthropic’s Building Effective Agents).
-
What we shipped: ATO becomes the ops layer for multi-runtime agents
Twelve releases in two days. Replay any prompt against a different runtime, catch regressions before users do, cut cost without losing quality. The v2.1 release notes for ATO.