2026-05-14 · Multi-LLM · 10 min read

Karpathy’s premise, gstack’s skills, and the two-layer flow for decisions with LLMs

In November, Andrej Karpathy shipped a small repo called llm-council. The idea: instead of asking one LLM a hard question, you ask a council of them. Each model answers independently, they review each other’s answers anonymously, and a final “Chairman LLM” compiles the answer.

18,695 stars. 3,616 forks. README note: “I’m not going to support it in any way, it’s provided here as is for other people’s inspiration.”

Karpathy was making a specific claim with this repo. Not “here’s a tool you should use” — he wouldn’t have walked away if it were that. The claim is about the shape of LLM cognition: a council of LLMs, arguing against each other in a shared context, produces more reliable answers than any single LLM in the council. This post is about what that premise actually buys you, where it falls short on its own, and why gstack’s skill pattern is the second half of the same loop.

Why the council works at all

The straightforward read is that LLMs in a council “check each other’s work,” like reviewers on a paper. That’s true but understates the mechanism.

Different frontier models are trained on different corpora, with different RLHF signals, different safety filters, different post-training recipes. The result is that their error distributions diverge. Where Claude is confident, Gemini might hedge. Where MiniMax over-indexes on the recent context, Claude might over-index on the system prompt. These aren’t random differences; they’re structural, and they show up consistently across many tasks.

When a single LLM answers a hard question, you get one error distribution — whatever that model is biased toward. When three LLMs answer the same question in a shared session, you get three error distributions. The intersection of three different biases is much smaller than any one bias on its own. The errors don’t cancel out perfectly, but they’re visibly harder to maintain in front of two other models who would have made the error differently.

This is also why the council needs to see each other. Karpathy’s repo runs the LLMs in three sequential stages: first opinions independently, then anonymized cross-review, then a final synthesizer. The cross-review stage is where the catching happens — one LLM looks at another’s answer and either flags issues, refines the framing, or pushes back. Without that step, you have three independent answers and no convergence; you have a fork in the road, not a decision.

The premise is sound. We’ve been using a version of this pattern for two months and the difference compared to single-LLM consultation is consistent — not magic, but consistent.

What the council doesn’t catch

The premise is sound. The pattern in its plain form is not enough.

Run a council debate for five rounds and you’ll see two failure modes show up reliably.

Polite-agreement bias. LLMs are trained on conversational data where humans signal agreement at the end of long debates. Around round 4-5, models start signaling closure even when one of them still has a substantive objection. They soften the objection into a caveat, the caveat into a footnote, the footnote into nothing. The signal-of-completion bias is hard to escape from inside the conversation.

Synthesis bias. The model that proposes the closing answer is, by the structure of its turn, in synthesis mode. It’s asking “how do I tie this all together?” The other models then evaluate the closing answer against the conversation. They check whether it captures what was said. They do not check whether it’s better than the baseline that existed before the conversation started, because that baseline isn’t in the conversation context. Synthesis mode is structurally bad at A/B comparison.

Both biases are real. We hit them both today.

The case study, briefly

We were consolidating five drifting positioning sentences in our own product docs into one mission statement. Founder bias would have picked whichever sentence we wrote last, so we ran the council pattern instead: five rounds, Claude as anchor, MiniMax as designated contrarian, Gemini as synthesizer. Total cost: $0.058. Tool calls into our own repo to verify claims. Audit trail in local SQLite.

Round 1 hit polite-agreement bias on the first bridge — Gemini conceded Claude’s position cleanly and the bridge loop fired [CONSENSUS]. We forced Round 2 by dispatching to MiniMax with an explicit steelman of the contrarian view, which broke it open.

Rounds 3 and 4 produced real disagreement, real concession, and a falsifier rule for the live debate (a specific metric that would settle whichever question the debate couldn’t resolve on its own). Round 5 closed with a tight synthesis memo from Claude, including a proposed new mission sentence. MiniMax and Gemini both signed NO VETO on it.

By every reasonable signal, the council had done its job. Three independent LLMs from three vendors had converged on the same sentence with explicit non-veto from both non-anchors. We wrote it into the strategy doc. Aligned six other files. Closed the task.

The sentence was wrong. We just didn’t know it yet.

gstack’s skills, and why they’re the second layer

Before pushing, we ran /plan-ceo-review — a skill from gstack, the tooling layer Garry Tan’s team has been shipping. A skill in this sense is a structured persona: a prompt template with mandatory sections, embedded checklists, decision rules. You load it on top of any model and the model assumes that posture for the duration of the task.

The key insight took us a while to land: a skill is not a different LLM. It’s the same LLM in a different task-posture. Same body, different hat.

That distinction matters because both the polite-agreement bias and the synthesis bias are task-posture phenomena, not model phenomena. They’re a consequence of how the conversation has been framed and how the model interprets its current role. Loading a different persona breaks both biases without needing a different model.

The /plan-ceo-review skill has eleven sections of structured review. One of them — called “0C-bis: Implementation Alternatives” — is mandatory: before any final decision lands, the reviewer must produce 2-3 distinct alternative approaches with explicit pros, cons, and a recommendation comparing them. It’s a checklist item the council debate never enforced on itself.

When we ran the CEO review on our session output, the skill produced:

Approach A — the synthesized sentence the war room had just committed to.
Approach B — roll back to the prior sentence, keep the layered framework.
Approach C — defer the sentence entirely, ship the falsifier instrumentation first, let data pick in 30 days.

And then a recommendation: B. Reason given: the new sentence (Approach A) was more abstract than the sentence it was replacing. “Platform” is more abstract than “decision cockpit.” “Multi-LLM work” is more abstract than “structured, tool-verified sessions.” When a rewrite is thinner than the original, the rewrite is bad.

That observation is obvious in retrospect. The council had never done it because the council’s task was “produce a consolidated mission” — not “compare the proposed mission to the existing sentence on a concreteness axis.” The skill’s checklist forced the comparison that the debate’s prompt had not.

We rolled back. The framework, SWOT, scope cuts, falsifier rule, and pricing recommendation all earned their keep — those were the war room’s real artifacts. The replacement sentence didn’t. We kept the prior mission sentence and the framework on top.

The two layers, and why each catches what the other misses

The pattern that came out of this:

Layer 1 — the council. Multiple LLMs as themselves, in a shared session, with tool access into your real repo. Catches the errors of individual LLM cognition because their error distributions diverge.
Layer 2 — the specialist skill. One LLM, loaded with a structured review persona that runs a checklist the council didn’t enforce on itself. Catches the errors of the council itself — the polite-agreement bias and the synthesis bias that the conversational format produces.

These aren’t redundant. They attack different failure modes. The council’s job is to surface disagreement between models. The skill’s job is to apply external structure to a conclusion the conversation has already reached. You need both because the failure modes are structurally different.

Each layer is independently cheap. A five-round multi-LLM debate cost us under ten cents. A skill-wrapped review is a single dispatch — another few cents at most. The audit trail of both lives in local SQLite. The whole flow runs on a laptop in a couple of hours.

What we’re building, and why

We started ATO because this kind of structured multi-LLM work is annoying to do otherwise. You can run llm-council, but it’s OpenRouter-locked — if you already pay for Claude Max and Codex CLI, you pay a third time to use them through llm-council’s gateway. You can write the loop yourself, but then you don’t have a session abstraction, tool calls, or an audit trail. You can use LangChain, but it caducated.

ATO is the part that makes Layer 1 cheap: sticky multi-LLM sessions, tool calls into your repo, bridging between runtimes by @<runtime> mentions, audit trail in local SQLite, no cloud, MIT, bring your own keys or your existing CLI subscriptions. The pattern is Karpathy’s; the maintained implementation is ours.

gstack is the part that makes Layer 2 cheap: a set of skills (/plan-ceo-review, /plan-eng-review, /plan-design-review, /office-hours, others) that load as personas inside whatever harness you’re running — Claude Code, Cursor agent mode, Codex. The pattern is Garry’s.

The combination is the loop we use. Today’s decision is one example of it running in anger.

Building ATO in public, using ATO

We’ve been making every strategic decision about ATO inside ATO itself. The positioning war room two weeks ago was inside ATO, recorded as live demo. Today’s goal-consolidation session was inside ATO. Every config decision, every pricing-tier debate, every scope-cut argument has the session id and the cost in the audit log.

This isn’t a marketing posture. It’s a forcing function. If the product doesn’t hold up under our own use, we’d find out before any user did. If the multi-LLM debate pattern is real cognitive leverage, our own decisions should be visibly better with it than without it. If the gstack-skill layer catches things the debate misses, our own strategy doc should reflect those catches.

So far, both of those are true. Today’s case is the cleanest example we’ve had: the council produced a wrong answer, the skill caught it, the wrong answer didn’t ship. Session id 2581c741-88cb-4cf1-b81a-7a942d7e557f — lives in our local SQLite, retrievable any time, full transcript intact.

Karpathy showed the primitive. The maintained version is in ATO. The second layer is in gstack. The whole stack is open source, local-first, and yours to break, fork, or improve.