2026-07-03 · Build log · Part 8 of the series · 12 min read

We used ATO to test ATO — Part 8. We turned ATO into an open-box model router, and it immediately caught our own bugs.

Build log — series.

Part 1 — v2.9 PR-1: grounded-mode foundation.
Parts 2–4 — v2.9 PR-2/3/4.
Part 5 — n=150 scaled re-run; one Part 4 claim falsified.
Part 6 — v2.10 methodology runner ships.
Part 7 — cross-model n=30: gemini-flash beat sonnet at 7-13× lower cost.
Part 8 (you are here) — the open-box router: bench with your keys, route with receipts.

On June 22nd, Sakana AI launched Fugu Ultra: a hosted router that takes your tokens, silently fans them across a multi-agent orchestration, and returns an answer with vendor-reported benchmark numbers attached — 93.2 on LiveCodeBench, per their own blog, reproducible by nobody outside their infrastructure. Maybe those numbers are real. That’s exactly the problem: maybe. We think routing is too important to be a black box, so we built the opposite into ATO’s MIT repo: benchmark the models you already have keys for, on a pinned public dataset, with a grader that executes code instead of asking a judge model to vibe-check it — then let the router pick, showing every number, hash, and warning behind the decision.

Honesty about the timeline, since this post is about honesty: the router layer itself came together in a focused sprint, but only because months of ATO plumbing already existed under it — the multi-runtime dispatch layer with per-call receipts, the pricing registry, war-room reviews, the methodology runner’s statistics (Parts 1–7 of this series). Roughly 70% was standing infrastructure; the last 30% — the verifiable grader, the contamination registry, the receipt hashes, the router — is what this post is about. That last mile being fast is itself the argument for receipts-first architecture.

The contract — every score carries the means to refute it

A benchmark number without provenance is marketing. So every ato bench run scorecard records:

Pass/fail by execution. The model’s code runs in a sandbox (network off, writes confined, home unreadable) against the suite’s real test cases — LiveCodeBench’s stdin problems and the LeetCode-style functional ones, called through the same driver conventions as the official harness, down to sys.setrecursionlimit(50000). No LLM judge anywhere in the scoring path.
A contamination-clean headline. Each task’s release date is checked against the model’s vendor-stated training cutoff — we researched and cited 32 models’ cutoffs from vendor pages (ato bench cutoffs prints the table with sources), and models whose vendors publish no cutoff are honestly unknown, never guessed. Tasks that predate a model’s training data don’t get to inflate its headline.
Wilson confidence intervals, because 10/12 is not “83.3%”, it’s “somewhere between 55% and 95%, probably”.
Three hashes — dataset, harness, environment — that define what “the same benchmark” means. Two scores are only comparable when the hashes match. Re-run with matching hashes and an overlapping interval: you’ve reproduced it. The router enforces that mechanically and discloses what it skipped.

The dataset itself is fetched and pinned, never vendored: you pull your LiveCodeBench slice from the source at a specific revision, the file’s identity is hashed into every receipt, and the corpus never touches our MIT repo.

Then we benchmarked four models with real money, and the tooling confessed

The first real battle — gemini-3-flash-preview, gpt-5, claude-sonnet-4-6, claude-fable-5, twelve LiveCodeBench problems from February–March 2025 — found three bugs before it found a winner. (We then re-ran everything at n=49; that’s the next section, and the numbers move — which is the lesson.) This is what dogfooding a receipts system looks like:

gpt-5 had never worked through ATO. Ever. Our request bodies hardcoded max_tokens; OpenAI’s current models reject it outright in favor of max_completion_tokens. Every gpt-5 dispatch since the model launched had been a silent HTTP 400. The fix touched six body sites across the CLI and desktop — our three-model review war-room kept finding sites we’d missed.
Frontier reasoning models have quietly killed pinned-temperature benchmarking. Both gpt-5 (“only the default (1) value is supported”) and claude-fable-5 (“temperature is deprecated for this model”) refuse temperature=0. The field’s favorite reproducibility knob is gone at the top end. Our receipts now record sampling honestly: pinned when the provider honors it, an explicit provider-default marker when it doesn’t — and the two hash differently, because they are different measurements.
A response that hits the output cap mid-code-block loses its closing fence — and our extractor was grading the markdown wreckage as Python, mis-scoring truncations as compile failures. We caught this on an earlier January-2025 slice used while hardening the harness: one AtCoder problem (abc387_f) flipped from a false fail to a legitimate 43/43 pass after the fix. The model had been right all along; the harness was wrong. If your harness can’t confess that, it isn’t a measurement instrument.

Every fix shipped the same day, reviewed by a three-seat multi-LLM war-room (Claude, Codex, Gemini — each of whom caught something the others missed), with regression tests pinning the behavior. The receipts of those reviews are in the repo history too.

The verdict — a small-sample tie dissolves, and the paired test outruns our own router

At n=12, three models tied at 83.3% and the router refused to rank them — correctly, as it turned out, because the tie was an artifact. We quadrupled the evidence (49 problems, February–April 2025, same pinned revision) before publishing. Provider-default sampling, one shared hash triple:

model	n=12 said	n=49 says	95% CI	$/task	speed
claude-fable-5	83.3%	85.7% (30/35*)	[70.6–93.7%]	$0.152†	25s/task
gpt-5 · contamination-clean	83.3%	79.6% (39/49)	[66.4–88.5%]	$0.111‡	93s/task
gemini-3-flash-preview · contamination-clean	83.3%	73.5% (36/49)	[59.7–83.8%]	$0.040	50s/task
claude-sonnet-4-6	66.7%	69.4% (34/49)	[55.5–80.5%]	$0.028†	19s/task

† Corrected after we audited our receipts against the provider consoles — the fourth confession. Anthropic’s frontier models bill thinking tokens in addition to usage.output_tokens (the opposite of OpenAI, whose completion_tokens already includes reasoning), and our parser was reading only the visible half — fable’s billed output ran ~1.57× what we recorded on this suite, sonnet’s ~1.31×. Our pricing registry was also missing fable’s $10/$50 row entirely (its receipts honestly said “$? (pricing missing)” rather than faking $0 — the design held). Both fixed the same day, capture now billing-true, and the $/task cells above are console-reconciled. Pass rates and speeds are unaffected. ‡ gpt-5’s mystery resolved by the console’s spend split: input bills at the documented $1.25/M exactly, but output billed $7.158 for the same 352k tokens the API reported to us — an effective $20.32/M, 2× the still-published $10/M price page (model id gpt-5-2025-08-07, standard tier, no caching). Our capture was correct; the price page is not what the meter charges. The cell above uses the empirical rate, because receipts should record what you pay, not what the docs promise — which is, once again, rather the point of this whole exercise.

*fable’s leg was cut at task 36 by a billing cap on our own account; 35 valid results survive (still above our ≥30 doctrine bar) and the receipts show exactly where the leg stopped. Its cost was recomputed from the receipt’s own token counts × Anthropic’s list price — the model isn’t in our pricing registry yet, and an unpriced $0 receipt would have won the cost tie-break dishonestly. We caught our own router almost doing exactly that.

Look at the gemini column: 83.3% at n=12 became 73.5% at n=49. The three-way tie was a small-sample artifact. Every closed-router marketing table you’ve ever seen is one lucky draw away from the same mirage — you just never get to find out, because you can’t re-run their numbers.

Two verdicts now exist, and their disagreement is the most instructive thing we shipped:

The router (v1) still says claude-sonnet-4-6. Its tie logic compares unpaired confidence intervals; all four still overlap, so it refuses to rank on accuracy and picks the cheapest at $0.028/task. Conservative, defensible, and printed with its reasoning.
The paired test says fable-5 is real. Because every model answered the same problems, McNemar’s exact test on shared tasks is sharper than interval overlap — and fable never lost a single discordant pair to anyone: 6–0 over sonnet (p=0.031, the only significant pairwise result), 4–0 over gpt-5 (p=0.125), 3–0 over gemini (p=0.250). On the 35 shared tasks, fable’s pass set strictly contained every other model’s.

Our own doctrine mandates paired tests for same-task comparisons; our router v1 doesn’t run them yet. So the honest summary is: fable-5 is the strongest model in this evidence and our router is currently too conservative to say so — route v2 gets paired comparisons, and this post is the receipt for why. If your task justifies 4.6× sonnet’s price, fable is the pick; if you’re optimizing spend, sonnet remains a rational default the paired test hasn’t dethroned on value.

(Fun footnote in the receipts: five problems — LiveCodeBench question ids 3750, 3759, 3760, 3763, 3784 — beat all four models. When four frontier models agree something is hard, that’s signal too. Those five are the first targets for the coordination experiments below.)

Coordination, measured instead of mystified

Fugu Ultra’s pitch is that a learned multi-agent orchestration beats single models. Fine — that’s an empirical claim, so we made it benchable. ato bench run --cascade "claude-sonnet-4-6->gemini-3-flash-preview" runs a transparent coordination: the cheap model answers first, its code executes against the problem’s public tests (the same information a human contestant has — a verifiable signal, not a judge model’s opinion), and only failures escalate to the expensive model. The whole recipe is scored as a virtual model in the same receipt store, both dispatches billed, every escalation decision auditable per task.

First result: the cascade scored 66.7% at $0.045/task — the same accuracy as sonnet-alone at 2.8× the cost, so the router correctly refused to prefer it. The receipts show precisely why: LiveCodeBench’s 1–3 public tests per problem are a weak oracle, and three answers that passed public tests died on the hidden ones. With a perfect signal the same cascade hits 10/12 at less than gemini’s price. That’s a negative result with an engineering roadmap inside it — richer selection signals and vote-style recipes are next — and we published it anyway, because an evidence system you only trust when it flatters you isn’t one.

What you get free vs. what ATO Pro adds

Free. All of it, because none of it is automation: ato bench run on any pinned public dataset with your keys (stdin + functional grading, the contamination registry, the hashes), ato bench cutoffs, --cascade coordination benchmarks with your own recipes, and ato route explain — the transparent advisory over your local receipts. It never dispatches, never reads a hosted leaderboard, and with zero local evidence it tells you to run ato bench — not to trust ours.

ATO Pro. Acting on routes: ato route apply (shipped this week) takes the free advisory — which remains the decision authority — confirms with you, dispatches the task through the routed model, and stamps the full routing provenance on the result. Continuous route watch is the registered next gate. The free advisory is always the brain; Pro pulls the trigger you were already shown. Learn about Pro →

Honest disclosure

n=49 is honest, not final. The paired test separates fable from sonnet (p=0.031); the other pairwise comparisons remain open at this sample size, and one model’s leg is n=35 (truncation disclosed above). These are the 2026-07-03 numbers for these revisions on this slice — a measured sounding, not a leaderboard.
Mixed comparison bases. The Claude rows are scored on all tasks; the gemini/gpt-5 rows carry the contamination-clean tag. The router prints a mixed-bases warning whenever this happens rather than silently comparing.
One receipt was hand-corrected. claude-fable-5 wasn’t in our pricing registry yet, so its receipt recorded $0 cost; we recomputed it from the receipt’s own token counts × Anthropic’s list price ($10/$50 per MTok) and said so in the file. The registry row ships next.
One snapshot. These are the 2026-07-02 numbers for these model revisions on these 12 problems. Models move; re-run before deciding anything expensive. The whole point is that you can.

Download ATO → Learn about Pro →

Real data behind this post: two four-model battles (n=12 and n=49) plus one cascade run — 244 graded dispatches against pinned LiveCodeBench slices (HF revision 0fe84c39), ~$25 all-in on the author’s own API keys (console-reconciled after the cost audit above), fired 2026-07-02/03. Receipts in ~/.ato/bench/receipts with dataset/harness/env hashes on every scorecard. The Fugu Ultra figures cited are Sakana’s own vendor-reported numbers, independently unverified — which is rather the point.