2026-05-10 · Multi-LLM · 9 min read

Why is multi-LLM development still so painful in 2026?

If you spend your day talking to LLMs, you have already noticed: models are not interchangeable. Claude rewrites a function differently than GPT-4. Gemini handles long context better than either. Codex is sharper at security review. The honest take is that the “best” model is task-specific, and on any non-trivial workload you should be using more than one.

And yet most teams pick a single LLM and route everything through it. Not because they don’t know better, but because the operational cost of switching models per task — remembering which CLI to open, copy-pasting outputs between tabs, losing context on a runtime swap — usually outweighs the marginal quality gain. Multi-LLM is a strategy problem with a tooling root cause.

The benchmarks say models really are different

Look at any released model card and you can see this directly. A few representative numbers from late-2024 / 2025 releases (these shift on every release; check the latest):

SWE-bench Verified (real GitHub issues solved): Claude 3.5 Sonnet: ~49%. GPT-4o: ~33%. Gemini 1.5 Pro: ~19%. Source: Anthropic SWE-bench Verified scoreboard, 2024.
GSM8K (grade-school math): GPT-4o: ~95.8%. Claude 3.5 Sonnet: ~96.4%. Gemini 1.5 Pro: ~91.7%. Source: vendor model cards.
HumanEval (function completion, pass-at-1): GPT-4o: ~90.2%. Claude 3.5 Sonnet: ~92.0%. Gemini 1.5 Pro: ~84.1%. Source: vendor releases.
MMLU-Pro (broad knowledge, harder than MMLU): Claude 3.5 Sonnet ~78%, GPT-4o ~72.6%, Gemini 1.5 Pro ~75.8%. Source: respective tech reports.
Long context: Gemini 1.5 Pro’s 1M-token window is meaningfully larger than Claude’s 200k or GPT-4o’s 128k. The Needle-in-a-Haystack retrieval scores are also higher for Gemini at the long end. Source: Google DeepMind 1.5 tech report, 2024.
Tool use: BFCL (Berkeley Function-Calling Leaderboard) consistently shows different leaders depending on the call shape (multi-turn, parallel, irrelevance detection). No single model dominates.

The signal across all of these: models lead in different categories. There is no “best” model. There is the model that does your specific task a little better, and the differences are large enough to matter when you ship to real users.

Why teams pick one anyway

Three structural reasons, each of them honest:

1. Each runtime is its own ecosystem. Claude Code lives in Anthropic’s tools. Codex lives in OpenAI’s. Each has its own CLI, its own session model, its own way of handling context. Switching means starting from zero — new conversation, no shared history, often re-explaining the project to the new model.

2. Vendor lock-in is good business. Anthropic and OpenAI both invest heavily in their own developer tooling. Smooth handoffs between vendors are a substitute for that lock-in, so neither is going to build them. The community fills the gap with shell scripts that work for a single power user but never feel like real infrastructure.

3. The conceptual model isn’t standardized. “Multi-agent” is a buzzword with no canonical definition. Different frameworks invent different abstractions: chains, graphs, swarms, supervisors. None of them have caught on as the way to describe how a writer agent and a reviewer agent share context across runtimes.

The result: the smartest team in the world picks Claude or GPT for everything because the alternative is operationally impossible.

Core problems and what works

Problem: switching runtimes loses the conversation

You’re mid-thread with Claude. The next question would be better answered by Codex’s strength — long-form code review, say. Switching means typing the prompt fresh into a new tab, re-attaching your project context, and hoping you remember the relevant pieces of the previous conversation.

What works: own your conversation thread separately from any runtime. Treat the runtime as a stateless dispatcher. On every turn, stitch the relevant history into the prompt yourself before calling the runtime. The runtime sees a fresh prompt with the conversation framed in; you keep the canonical thread. Now mid-thread runtime swap is a one-line change to which API you call.

Problem: chained agents (writer → reviewer) require manual handoff

The pattern is clean — one agent drafts, another reviews. The execution is awful: copy stage 1’s output into stage 2’s prompt, lose track of which version was reviewed, can’t answer “how long did the whole pipeline take?” because each stage is in a different terminal.

What works: represent the pipeline as a first-class object. A “sequential group” with N children, each on its own runtime. Firing the group runs the chain end to end, the trace is correlated by a shared parent_run_id, and the total duration / cost / handoff points are queryable. The user types one prompt; the framework handles the orchestration.

Problem: routing to specialists is hand-coded

Customer support agents that route to specialist sub-agents (billing / technical / general) are a strong pattern. But most implementations hard-code the routing logic in if-statements or rely on a frontier model to do classification — which is wasteful, since classification is a cheap-model task.

What works: make the router a first-class agent with its own (cheap, fast) model. Children are specialist agents with their own (slower, more capable) models. Rules-first routing for the easy cases, LLM-fallback for the ambiguous ones. The pattern is well-described in the “router + specialist” literature; the question is whether your tooling supports it natively or requires you to build it.

Problem: per-task model selection is invisible

Inside a single agent, different sub-tasks have different requirements. Routing classification: cheap and fast. Summarization of conversation history: cheap and fast. Final user-facing response: expensive and capable. Evaluator that grades the response: somewhere in between.

What works: agents should expose role-specific model slots: router, summarizer, response, evaluator. Default everything to inherit from response; let teams override per role. Most current frameworks make this awkward because they treat “the model” as a single setting, conflating capabilities that are actually distinct.

Problem: there’s no standardized cross-runtime tool layer

MCP (the Model Context Protocol) is the closest thing the industry has. Any client speaking MCP can call any server speaking MCP — that’s genuinely good for tools. But MCP is a tool protocol, not a workflow protocol. It tells you how Claude can invoke your custom Python function. It doesn’t tell you how Claude hands an in-flight conversation to Codex.

What works (open question): the workflow layer is the part still being figured out. The right shape is probably runtime-agnostic dispatchers + canonical thread store + first-class orchestration primitives. LangChain tried; the community partially moved on. The next attempt will probably look more like Unix pipes — small, composable, opinionated — and less like an SDK.

The summary, plainly

Multi-LLM is the right strategy. Models genuinely lead in different categories, and the gap between “best model for everything” and “best model per task” is large enough to matter when you ship. The reason most teams haven’t adopted it is operational: copy-paste between three terminals scales to zero people. Solve the orchestration layer — conversation thread you own, runtime-agnostic dispatch, sequential / routed groups as first-class objects — and the pain goes away. Until someone standardizes that layer, every team has to assemble it themselves.

— Beatriz Nigri