What we shipped: ATO becomes the ops layer for multi-runtime agents
TL;DR. Two days. Twelve releases (v2.0.1 → v2.1.11). The v2.1 story we’d been promising — “multi-runtime differentiated observability” — became something you can actually use. Replay any prompt against a different runtime. Catch regressions before users do. Cut cost without losing quality. Find any conversation, agent, group, or schedule with one keystroke.
The headline: Replay
You shipped a Claude agent on production. Last week your eval score dropped from 0.91 to 0.74. You suspect the model swap. You want to know: would Codex have answered correctly on those failing prompts?
ATO can now answer that. From any cloud trace, click Replay → pick a target runtime → ATO re-dispatches the original prompt against your chosen model, captures the response, and shows source vs replay side-by-side with duration + estimated cost delta.
No prompt re-typing. No “guess what I asked.” Just a real A/B against the actual conversation.
Caveat we’re being honest about: replay reads prompts from your local execution log, so it works for traces dispatched on this machine. Cross-device replay needs server-side credential vault + PII review — that’s the next release, not this one.
Regression detection that earns the word “detection”
Every config change to your agent — model swap, role-models edit, system prompt rewrite — gets recorded automatically. The Insights → Regressions panel joins those changes against trace stats and surfaces the ones with meaningful deltas:
- success rate drop ≥10pp
- p95 latency spike ≥50%
- cost per call up ≥25%
- eval score drop ≥15pp (new) — catches the case where the model still runs but the answers got worse, which a pure ok-rate diff misses entirely
Each row drills into the actual failing examples (post-change traces with their prompts and errors). And every agent’s detail page now carries a banner above the tab nav when it has an active regression — you don’t have to remember to check.
Cost recommendations that show their work
We’re not going to claim “Haiku is cheaper” if you’ve never run on Haiku. That’s just guessing.
What ATO will do: when you have actual historical data on the same agent across multiple runtimes, the new cost recommendations panel surfaces concrete swaps where the alternative is meaningfully cheaper at preserved quality. Quality guards: ≥30% cheaper, ok-rate within 10pp, eval-score within 5pp.
@code-writer · claude → codex · −59% per call · projected $1.01/mo at this volume
Sorted by absolute monthly impact. Renders nothing if no rec qualifies — better to show you nothing than fake confidence.
And every estimate now has a real number behind it. Desktop dispatches estimate cost at trace time (token count × per-million pricing for known models, with an explicit “est.” badge so you know it’s not from a billing receipt). The “—” we used to show in Compare and Replay is gone.
Pipelines + Compare get their own homes
Multi-stage dispatches (sequential groups, routed groups, anything that fans out across runtimes) now have a dedicated Pipelines sub-tab. One row per pipeline, click into the per-stage flow with handoff arrows + per-stage timing + files touched.
Trace comparison gets the same treatment: a kind-agnostic Compare sub-tab. We had to extract this from the External tab’s drill-down because it didn’t make sense to require an agent be deployed-as-bundle to compare its runs. Now any agent with ≥2 cloud traces is comparable.
Automations becomes the canonical canvas
Runs → Automations used to be a flow chart for skills with ## Step headers. We turned it into the canvas for everything that runs without you in the loop: routed groups, sequential pipelines, scheduled cron jobs, agent hooks, skill flows. All on one canvas, color-coded by source, with live status decorated from your trace history. Click any node → “View runs” → jumps straight to that agent’s trace explorer.
This is what “ops layer for multi-runtime agents” looks like from a single view.
⌘K spans the workspace
Hit ⌘K from anywhere. Search agents, groups, schedules, secrets, MCPs, projects — and now your chat history too (matches against thread titles AND message bodies, with snippets). Plus a Quick Actions list that jumps directly to any Insights sub-tab in one keystroke.
Found yourself wondering “where did I ask about that thing last week”? Type two characters of it. Done.
Smaller wins worth mentioning
- Bulk kill on Live runs — N stuck dispatches gone in one click.
- Auto-logout on stale auth — when your token dies, you see “Sign in for Pro” instead of every Pro panel mysteriously empty. Visible signal, recoverable. (We also fixed the cloud-side
auth/refresh500 that was the real cause — bcrypt was throwing on a malformed legacy hash row, now wrapped per-row.) - History tab finally populates — every dispatch lands in Runs → History. It had been silently broken since v1.5; now wired through
prompt_agent_innerso cron / MCP / UI dispatches all converge. - Compare modal stops crashing on cost values — PostgreSQL
DECIMAL(10,6)columns serialize as strings ("0.014200") through node-postgres, and.toFixed()on a string crashes. Latent bug; only fired now that v2.1.4 actually populates cost. Fixed with a defensiveasNumber()coerce at every render path. - Replay link window widened — group stages that take longer than 10 seconds (Codex pipelines especially) used to silent-fail their replay link. Window is now
[-30s, +5min]from started_at; slow stages get linked correctly.
What’s next
Two big chunks on the roadmap:
- Cloud-scheduled batch replay — replay your last 14 regression failures against an alternative model overnight, see which ones recover. Needs a data-residency review pass before any prompts live server-side; planning that next.
- Embed-side analytics — for users who deploy bundles, page-where-the-chat-lives metrics. Time to first message, drop-off rate per turn, escalation keyword clusters. Held until we have enough customer-deployed bundles to give the panel real signal.
Plus the polish bucket: HALO integration, real-time team collaboration, mobile companion (read-only), retroactive cloud-trace ↔ local-prompt back-fill so older traces become replayable.
Release map
| Tag | What |
|---|---|
| v2.0.1 | Pipelines + Compare sub-tabs, auth-mirror, RFC3339 datetime fix, History persistence wired |
| v2.0.2 | Deep regression detection (eval-score deltas + failing-example drill-down + AgentDetail banner) |
| v2.0.3 | Cost recommendations |
| v2.0.4 | Build fix |
| v2.1.0 | Replay infrastructure (interactive) |
| v2.1.1 | Wired trace upload for chat-pane single-agent + no-agent paths |
| v2.1.2 | Persisted streaming dispatches into execution_logs |
| v2.1.3 | Replay source response from local fallback |
| v2.1.4 | Estimated cost capture for desktop dispatches + replay panel |
| v2.1.5 | Automations canvas polish (multi-source canvas + click-through to Insights) |
| v2.1.6 | ⌘K palette spans full workspace (groups, schedules, secrets, quick actions) |
| v2.1.7 | Chat thread search corpus in ⌘K |
| v2.1.8 | Bulk kill on Live runs |
| v2.1.9 | Coerce PG numeric strings (Compare crash fix) |
| v2.1.10 | Auto-logout on stale auth + cloud /auth/refresh 500→401 hardening |
| v2.1.11 | Widen replay link window so slow group stages link correctly |
— Beatriz Nigri