2026-05-24 · Build log · Parts 2–4 of a v2.9 series · 11 min read

We used ATO to test ATO — Parts 2–4. Closing the regression, extending coverage, refusing-with-options.

v2.9 grounded mode build log — series index.

Part 1 — PR-1 foundation: schema, GroundingPolicy module, dispatch flag surface, verdict computation. Score: +1 upgrade, −1 regression.
Part 2 (below) — PR-2 closes the claude false negative via stream-json parser.
Part 3 (below) — PR-3 routes API providers through the function-calling tool loop.
Part 4 (below) — PR-4 refuses-with-options for Ollama, OpenClaw, Hermes in strict mode.

Part 1 left the implementation with one upgrade (gemini’s hallucination became a checked column value) and one regression (claude was mis-reported as ungrounded because we didn’t parse its CLI’s tool-use). The methodology this series ships with says: that regression is the bench for the next PR. So we kept going. Three more PRs to main, each scored against the same cold-control prompt we’ve been running since this morning.

Part 2 — PR-2: parse claude’s stream-json into the tool-call audit channel

Bench: Part 1, trace 14c05521-4757-43d2-aed7-eb953492efe6. Claude verdict = advisory with read_file unmet. Claude DID use its native filesystem tools (the response text proves it: “I searched the CWD and globbed for **/auth.ts — no matches.”). The receipt mis-reported it because ato dispatch claude invokes --print mode, which prints text, not the structured tool_use blocks.

Design decision tested first: probe claude --print --verbose --output-format stream-json "Read ./README.md" directly. Confirmed it emits one-event-per-line NDJSON with type:"assistant" messages whose message.content[] array carries type:"tool_use" blocks with name + input, plus a final type:"result" event with the assistant’s text. That shape is pure-function unit-testable in isolation.

{"type":"assistant","message":{"content":[
  {"type":"tool_use","id":"toolu_...","name":"Read",
   "input":{"file_path":"./README.md"}}
]}}
{"type":"result","result":"The first heading is: # ATO ..."}

Impl: c0d30d5 ships three surfaces — a pure-function parser (apps/cli/src/grounding/claude_stream_parser.rs, 8 unit tests covering result extraction, tool_use ordering, malformed-line resilience, fallback text aggregation, and args_brief truncation); an env-var opt-in (ATO_CLAUDE_STREAM_JSON=1) that switches the claude CLI invocation to stream-json mode without touching the non-grounding code path; and a post-dispatch UPDATE that reads the row’s raw stream-json from execution_logs.response, parses it back into (response_text, tool_calls), and rewrites the row so the response column carries the actual assistant text and tool_calls_summary carries the structured observations.

Empirical replay against the bench (trace a0825a01-729c-47a3-824b-87b64205183e):

Dimension	Bench (PR-1)	Impl (PR-2)
Claude verdict	`advisory` + `[read_file unmet]`	`compliant` + 0 unmet
Observed tool calls	0 (false negative)	2 (`Bash` + `Glob`)
Response column	raw `--print` text	parsed assistant reply, NDJSON stripped

Score: +1 upgrade. PR-1 regression closed. Claude’s actual tool use is now visible on the receipt; the verdict reflects what the runtime did, not what we couldn’t observe.

Honest intermediate steps recorded as the test iterated:

First attempt required --require-tools Read; claude used Bash — verdict stayed advisory because rule-match is name-exact. The too-narrow requirement was caught before claiming success.
Second attempt required Read,Bash; still advisory because soft mode by design always returns advisory (observation-only, no compliant verdict). Working as intended.
Third attempt: strict mode + the tool name claude actually reaches for (Bash) → compliant with 0 unmet rules. Empirical proof.

The iteration itself is a finding worth noting: the verdict logic is correct but the rule vocabulary needs to be carefully aligned to what the runtime emits. Read is what an agent would intuitively call the operation; claude’s CLI tool is named Bash (a generic shell tool that reads via ls / cat). A future slice should add a tool-name alias table so --require-tools Read matches any read-style tool the runtime exposes, not just the literal string.

Part 3 — PR-3: route API providers through the function-calling tool loop

Bench: Part 1, trace c4996da1-ce35-493a-818e-358d433cb7a4. Gemini verdict = advisory + [read_file unmet], response = 11,378 chars of generic SQL-injection lecture material with the explicit disclaimer “I will perform this as if I had identified these in the code.” The model had no filesystem access because the API path goes through api_dispatch.rs with no function-calling tools wired.

Design decision tested first: grep for the existing tool-loop wiring. The function-calling infrastructure already exists in api_dispatch_tools.rs from the v2.7.8 work on ato review. It’s gated by a with_tools parameter on dispatch::run, which the top-level CLI handler hardcodes to false with the comment “ato dispatch top-level — no tools; that’s ato review’s surface.” So the question collapsed to: does flipping that one bool to true when grounding is on AND the runtime is an API provider route through the function-calling tool loop cleanly?

It does. The whole change is one new computed boolean and one parameter swap:

let runtime_is_api_provider =
    crate::api_dispatch::is_api_provider(&runtime);
let with_tools_for_grounding =
    grounding_overrides.has_any() && runtime_is_api_provider;

commands::dispatch::run(
    &runtime, &prompt, ...,
    with_tools_for_grounding,    // was: false
    &db_path, &opts,
)?;

Impl: 0e9c2fd. 21 lines including the comment block.

Empirical replay attempt (trace eccf7558-5408-4e4b-9cec-1a21bc892af9):

Dimension	Bench (PR-1)	Impl (PR-3)
Gemini verdict	`advisory` + `[read_file unmet]` + 11,378 chars hallucination	`violation` (strict mode) + `[read_file unmet]` + `status: error` downstream
Where the failure fired	n/a (the hallucination succeeded silently)	inside `api_dispatch_tools`’ key-decrypt step

Honest read of the score:

+1 structural upgrade. The with_tools routing IS correct. The error fired from INSIDE api_dispatch_tools (the function-calling tool loop), which is empirical proof the branch engaged. The failure was the well-documented dev-build / prod-keychain ACL mismatch from 2026-05-17 — the locally-built debug binary couldn’t decrypt the API key that the prod desktop wrote to the macOS keychain. Unrelated to PR-3’s routing logic.
+1 new empirical signal. The strict-mode violation verdict path was empirically demonstrated for the first time today. Prior PRs only exercised the advisory path (soft mode). The verdict ladder now has every rung walked: not_enforced, advisory, compliant, violation.
Gap (now fixed in v2.11 PR-12.6, 2026-05-25). The clean end-to-end replay above was blocked locally by the dev-build / prod-keychain ACL mismatch — the locally-built debug binary couldn’t decrypt API keys the prod desktop had encrypted. v2.11 PR-12.6 shipped the ATO_CLI_PATH env override + cli_path::resolve_ato_binary fallback chain that lets a dev binary delegate to the prod app-bundle’s ato for any keychain-bound dispatch. The same dispatch through a signed prod desktop binary at v2.9.0 never hit this; this paragraph is preserved as a build-log breadcrumb only.

Part 4 — PR-4: refuse-with-options for parserless runtimes

Bench: Ollama, OpenClaw, Hermes. These ARE full agents and DO have tools (the v2.9 plan corrects an earlier framing that called them “toolless”). What they DON’T have is a dispatch path that emits structured tool-call telemetry ATO can parse. So running strict mode against them today would mean either silently succeeding every dispatch as compliant by default (false positive — the inverse of PR-1’s claude regression) or silently failing every dispatch as violation regardless of behavior (false negative). Both are theater.

Design decision tested first: the resolution from the design debate (recorded in the v2.9 plan and docs/grounding.md) was “refuse, but actionable.” Don’t ship best-effort strict — that’s the prompt-injection theater the war-room debate already rejected. But the refusal must be a fork in the road: switch runtime / downgrade to soft / wait for the experimental marker parser.

Impl: d3dee16. After the policy compile, if mode = Strict AND runtime ∈ {ollama, openclaw, hermes}, bail with a 3-option structured error message. Soft mode against these runtimes is NOT refused — it’s still observation-only, the verdict honestly records advisory with unmet rules, no false positives or false negatives. Refusal fires only on the strict ladder rung.

Empirical test — the full matrix:

#	Command	Expected	Actual
1	`ato dispatch ollama ... --mode-override strict`	refused with 3 options	✓ refused, full error printed
2	`ato dispatch openclaw ... --mode-override strict`	refused	✓ refused
3	`ato dispatch hermes ... --mode-override strict`	refused	✓ refused
4	`ato dispatch ollama ... --mode-override soft`	NOT refused (soft is observation-only)	✓ policy compiles cleanly
5	`ato dispatch claude ... --mode-override strict`	NOT refused (claude isn’t parserless)	✓ policy compiles cleanly

The actual refusal message for tests 1–3, verbatim, the kind of thing an AI agent calling run_agent through MCP needs to be able to read:

Strict mode is not yet supported on the 'ollama' runtime — its
dispatch path doesn’t emit structured tool-call telemetry, so the
verdict couldn’t be honestly computed. Three options:

  1. Switch runtime: re-run with claude / codex / gemini (CLI runtimes
     with native tool-call telemetry) or any API provider routed through
     the function-calling tool loop in v2.9 PR-3.

  2. Downgrade to soft: re-run with `--mode-override soft`. The mandatory
     rules are listed in the system prompt as expected behavior; the
     verdict records `advisory` regardless of actual compliance —
     observation-only, no false positives or false negatives.

  3. Wait for the experimental marker parser: a follow-on slice will
     ship a best-effort tool-call marker parser for these runtimes
     (documented in docs/grounding.md). Until then, the dispatch is
     refused rather than silently misreport its compliance.

Score: +1 honest coverage. The parserless-runtime case is now resolved — not by pretending to enforce, not by ignoring the gap, but by surfacing it with three actionable paths forward.

Where the v2.9 grounded mode feature stands at the end of the day

Runtime class	Coverage as of v2.9 PR-4
claude CLI	✓ stream-json parser captures tool use; verdict lifts to compliant when tools observed (PR-2)
codex CLI	~ codex emits tool blocks in a similar shape; the codex-specific parser is a v2.9.x follow-on slice (PR-2 pattern, codex’s flag is `--print --output-format json` per the existing wiring at `dispatch.rs:1062`)
API providers (15+)	✓ routed through `api_dispatch_tools.rs`’ function-calling loop when grounding is non-`off` (PR-3). The Part 5 footer + Part 6 n=30 update show the clean end-to-end replay after v2.11 PR-12.6’s `ATO_CLI_PATH` env override unblocked the keychain ACL.
Ollama / OpenClaw / Hermes	✓ strict refuses-with-options; soft compiles policy and records `advisory` verdict; experimental marker-parser interceptor queued (PR-4)

Six commits to main in one day, every one of them shipped against a real bench, scored honestly, and surfaced its own regressions (we caught two: the PR-1 claude false-negative that PR-2 closed, and the PR-2 tool-name-exact-match issue that argues for an alias table in a v2.9.x slice). Zero of those regressions would have been visible without the receipts.

What this enables — the methodology IS the product

Every PR in this series followed the same loop:

Cold — capture the bench. Today’s bench for PR-2 was Part 1’s claude advisory+unmet receipt. Tomorrow’s bench for the codex parser is PR-2’s claude-compliant receipt. The chain is dense and the chain is auditable.
Probe — test the design decision first, before writing any code. PR-2’s probe was “does claude --output-format stream-json actually emit tool_use blocks?” PR-3’s probe was “does the existing with_tools branch route through the function-calling loop cleanly?” PR-4’s probe was “what does strict mode produce against a runtime without tool-call telemetry today, and which failure mode is worse?”
Impl — build the smallest slice that should change the cold-control outcome.
Score — replay the exact bench with the impl on, document upgrades AND regressions. Don’t hide the failure cases — they’re the bench for the next PR.

That loop, composed across N dispatches with N rubrics, is what the Pro Methodology Runner productizes. Each grounded-mode receipt is the atomic event. The runner fans out the variants, scores each one, composes them into a defensible recommendation, and persists the methodology itself as a reusable artifact. Nobody else (Braintrust, Patronus, Promptfoo) composes methodologies with grounding receipts as the atomic event. The wedge: “your agent followed your rules (verifiable receipts), AND we have a Pareto chart showing it was the best choice on YOUR data.”

v2.10 PR-1 will be the methodology runner shipping one end-to-end archetype demonstrated on customer data. The receipts from this series — six commits worth, all on main, every one of them queryable in your local ~/.ato/local.db right now — are the data shape it composes.

Try it yourself

brew install willnigri/ato/ato

# Cold: fire the prompt without grounding flags, see what each runtime does
ato dispatch claude "Review src/auth.ts for SQL injection"
ato dispatch gemini "Review src/auth.ts for SQL injection"

# Strict: claude lifts to compliant when its tools fire
ato dispatch claude "Review src/auth.ts for SQL injection" \
    --mode-override strict --require-tools Bash

# Strict against a parserless runtime: refuse-with-options
ato dispatch ollama "Review src/auth.ts" --mode-override strict --require-tools read_file

# Inspect every receipt that ever landed
sqlite3 ~/.ato/local.db "SELECT runtime, grounding_verdict,
                          tool_calls_count, agent_slug
                          FROM execution_logs
                          ORDER BY created_at DESC LIMIT 10"

Download ATO → View on GitHub →

All receipts in this post are from real ato dispatch sessions run on 2026-05-24, persisted to the author’s local ~/.ato/local.db. Trace ids referenced: 14c05521 (PR-1 bench claude), c4996da1 (PR-1 bench gemini), a0825a01 (PR-2 compliant claude), eccf7558 (PR-3 structural-proof gemini), plus PR-4’s refuse-with-options dispatches (all five test cases run live, all five passed as expected). Score sheets at /tmp/grounded-mode-receipts/ in the dev environment. Commit hashes: c0d30d5 (PR-2), 0e9c2fd (PR-3), d3dee16 (PR-4). The v2.9.x follow-on slices (codex parser, tool-name alias table, experimental Ollama marker parser) are queued for the next session.