Same prompt, three AIs, real receipts: comparing Claude vs Codex vs Gemini on a real PR
If you build with AI, you’re already comparing models. Just badly.
Three terminals. The same prompt pasted three times. You eyeball the diffs, pick whichever output “feels” better, paste it into your PR, and have no record of who said what or what any of it cost. The disagreement that should be the most valuable signal you get all day — gone in your clipboard history.
This week I ran the same code review through Claude, Codex, and Gemini in one shared session, with tool calls and a cost ledger, and the receipts changed which model I’d reach for next time.
The setup
A real PR on a real repo: a refactor of session-invalidation logic in an auth middleware. ~80-line diff. The kind of change where the unit tests pass but a careless merge could log every user out at 3 AM.
One command, three runtimes:
ato review --consensus \
--reviewer claude \
--reviewer codex \
--reviewer google
Each reviewer ran in the same session. Turn #2 saw turn #1’s findings via history replay (no prompt re-pasting). Function-calling tools (read_file, grep, git_log) let each model walk the live repo instead of guessing. Every tool call was logged with the file path it touched.
What each model said
Claude (claude-sonnet-4-6):
“LGTM. The refactor is clean — the new
invalidateSession()helper consolidates logic that was previously duplicated in three callers. Tests cover the happy path. Ship it.”
Codex (gpt-4.1):
“BLOCKING. This change breaks session invalidation on logout. The previous code called
db.deleteSession(sid)synchronously before returning the response. The refactor moves it into an async fire-and-forget via the new helper, which runs after the response is sent. If the helper throws (e.g., db unreachable), the session persists in storage; the user’s cookie is cleared client-side but the server still accepts the old session ID.”“Verified by reading
middleware/auth.ts:142andhelpers/session.ts:38. The new helper has no error handler; the old code path threw on db errors and surfaced a 500.”
Gemini (gemini-2.5-flash):
“Endorse Codex’s finding. Re-ran
grep -r ‘invalidateSession’— the new helper is also called from the password-reset endpoint, where the same fire-and-forget pattern would now allow a window in which the old session remains valid. This is the second instance of the bug, not just the logout case.”
Three models. One said ship. Two caught the bug. Without ATO, I’d have shipped Claude’s recommendation, learned about the regression from a customer ticket, and burned the rest of the day rolling it back.
The receipts
Here’s the cost-and-quality ledger ATO recorded for that review:
| Runtime | Model | Duration | Tokens (in / out) | Cost | Tool calls | Verdict |
|---|---|---|---|---|---|---|
| Claude | claude-sonnet-4-6 | 3.1s | 1,840 / 412 | $0.04 | 1 | LGTM |
| Codex | gpt-4.1 | 2.8s | 1,610 / 588 | $0.02 | 3 | BLOCKING |
| Gemini | gemini-2.5-flash | 4.1s | 2,100 / 344 | $0.01 | 2 | BLOCKING |
Three observations from those rows:
- The cheapest model caught the bug. Gemini at $0.01 was the second-most-confident voice on the BLOCKING finding. If I’d picked “just use the most expensive model” as a heuristic, I’d have lost the catch.
- Tool calls correlate with depth, not chattiness. Claude made 1 tool call (read the diff) and stopped. Codex made 3 (diff + two related files). Gemini made 2 (diff + a
grep). The model that wanted to verify the claim made more calls; the model that wanted to ship made fewer. - The whole audit costs less than a cappuccino. $0.07 total for an N-way code review on a real diff. Without the comparison, I’d have paid Claude $0.04 and been wrong.
What changes when you keep doing this
After a few weeks of running every non-trivial diff through ato review --consensus, the receipts start telling a story:
- Codex tends to win on refactors and security-adjacent diffs.
- Claude tends to win on greenfield architecture explanations.
- Gemini tends to win on long-context summarization where price-per-token matters.
That’s not a benchmark. That’s a personal benchmark, on your work, in your repo. The point of having receipts is that you stop guessing and start citing.
How to run this yourself
brew install willnigri/ato/ato
ato review --consensus
Or run ato demo-compare on first launch and ATO drops you into a pre-loaded session showing two LLMs comparing a refactor side-by-side — no config required.
Everything lands in a SQLite database at ~/.ato/local.db on your machine. Every dispatch records cost, tokens, duration, tool calls, files touched. You can sqlite3 query it directly. Sign-in is optional and only matters if you want cross-device sync, hosted LLM-as-judge, or shared sessions with a team — free during beta.
MIT licensed. Local-first. Bring your own keys.
Download ATO → View on GitHub →
The receipts in this post are from a real ato review --consensus session run on 2026-05-16. The auth-middleware diff is representative of the class of bug the multi-LLM comparison catches; the specific repo is private but the session, prompts, and tool calls are reproducible from the public CLI on any repo you point it at.