← agentictool.ai Download →
2026-05-10 · Product · v2.1

What we shipped: ATO becomes the ops layer for multi-runtime agents

TL;DR. Two days. Twelve releases (v2.0.1 → v2.1.11). The v2.1 story we’d been promising — “multi-runtime differentiated observability” — became something you can actually use. Replay any prompt against a different runtime. Catch regressions before users do. Cut cost without losing quality. Find any conversation, agent, group, or schedule with one keystroke.


The headline: Replay

You shipped a Claude agent on production. Last week your eval score dropped from 0.91 to 0.74. You suspect the model swap. You want to know: would Codex have answered correctly on those failing prompts?

ATO can now answer that. From any cloud trace, click Replay → pick a target runtime → ATO re-dispatches the original prompt against your chosen model, captures the response, and shows source vs replay side-by-side with duration + estimated cost delta.

No prompt re-typing. No “guess what I asked.” Just a real A/B against the actual conversation.

Caveat we’re being honest about: replay reads prompts from your local execution log, so it works for traces dispatched on this machine. Cross-device replay needs server-side credential vault + PII review — that’s the next release, not this one.

Regression detection that earns the word “detection”

Every config change to your agent — model swap, role-models edit, system prompt rewrite — gets recorded automatically. The Insights → Regressions panel joins those changes against trace stats and surfaces the ones with meaningful deltas:

Each row drills into the actual failing examples (post-change traces with their prompts and errors). And every agent’s detail page now carries a banner above the tab nav when it has an active regression — you don’t have to remember to check.

Cost recommendations that show their work

We’re not going to claim “Haiku is cheaper” if you’ve never run on Haiku. That’s just guessing.

What ATO will do: when you have actual historical data on the same agent across multiple runtimes, the new cost recommendations panel surfaces concrete swaps where the alternative is meaningfully cheaper at preserved quality. Quality guards: ≥30% cheaper, ok-rate within 10pp, eval-score within 5pp.

@code-writer · claude → codex · −59% per call · projected $1.01/mo at this volume

Sorted by absolute monthly impact. Renders nothing if no rec qualifies — better to show you nothing than fake confidence.

And every estimate now has a real number behind it. Desktop dispatches estimate cost at trace time (token count × per-million pricing for known models, with an explicit “est.” badge so you know it’s not from a billing receipt). The “—” we used to show in Compare and Replay is gone.

Pipelines + Compare get their own homes

Multi-stage dispatches (sequential groups, routed groups, anything that fans out across runtimes) now have a dedicated Pipelines sub-tab. One row per pipeline, click into the per-stage flow with handoff arrows + per-stage timing + files touched.

Trace comparison gets the same treatment: a kind-agnostic Compare sub-tab. We had to extract this from the External tab’s drill-down because it didn’t make sense to require an agent be deployed-as-bundle to compare its runs. Now any agent with ≥2 cloud traces is comparable.

Automations becomes the canonical canvas

Runs → Automations used to be a flow chart for skills with ## Step headers. We turned it into the canvas for everything that runs without you in the loop: routed groups, sequential pipelines, scheduled cron jobs, agent hooks, skill flows. All on one canvas, color-coded by source, with live status decorated from your trace history. Click any node → “View runs” → jumps straight to that agent’s trace explorer.

This is what “ops layer for multi-runtime agents” looks like from a single view.

⌘K spans the workspace

Hit ⌘K from anywhere. Search agents, groups, schedules, secrets, MCPs, projects — and now your chat history too (matches against thread titles AND message bodies, with snippets). Plus a Quick Actions list that jumps directly to any Insights sub-tab in one keystroke.

Found yourself wondering “where did I ask about that thing last week”? Type two characters of it. Done.

Smaller wins worth mentioning

What’s next

Two big chunks on the roadmap:

  1. Cloud-scheduled batch replay — replay your last 14 regression failures against an alternative model overnight, see which ones recover. Needs a data-residency review pass before any prompts live server-side; planning that next.
  2. Embed-side analytics — for users who deploy bundles, page-where-the-chat-lives metrics. Time to first message, drop-off rate per turn, escalation keyword clusters. Held until we have enough customer-deployed bundles to give the panel real signal.

Plus the polish bucket: HALO integration, real-time team collaboration, mobile companion (read-only), retroactive cloud-trace ↔ local-prompt back-fill so older traces become replayable.


Release map

TagWhat
v2.0.1Pipelines + Compare sub-tabs, auth-mirror, RFC3339 datetime fix, History persistence wired
v2.0.2Deep regression detection (eval-score deltas + failing-example drill-down + AgentDetail banner)
v2.0.3Cost recommendations
v2.0.4Build fix
v2.1.0Replay infrastructure (interactive)
v2.1.1Wired trace upload for chat-pane single-agent + no-agent paths
v2.1.2Persisted streaming dispatches into execution_logs
v2.1.3Replay source response from local fallback
v2.1.4Estimated cost capture for desktop dispatches + replay panel
v2.1.5Automations canvas polish (multi-source canvas + click-through to Insights)
v2.1.6⌘K palette spans full workspace (groups, schedules, secrets, quick actions)
v2.1.7Chat thread search corpus in ⌘K
v2.1.8Bulk kill on Live runs
v2.1.9Coerce PG numeric strings (Compare crash fix)
v2.1.10Auto-logout on stale auth + cloud /auth/refresh 500→401 hardening
v2.1.11Widen replay link window so slow group stages link correctly

Try ATO v2.1

Every feature free with a sign-up — no payment, no credit card, just an email.

Download for Mac / Win / Linux →

— Beatriz Nigri