Advanced18 min

Dynamic Model Selection

Route agent turns to cheaper models when the task is simple and powerful models when it's complex — using @wrap_model_call to intercept every LLM request and swap the model based on conversation state, user tier, or cost targets. This article starts with whether you should route at all, walks through real cost math, covers the five ways routing silently fails in production, and ends with a 30-day rollout runbook.

Quick Reference

→@wrap_model_call intercepts the model request before it reaches the LLM — use request.override(model=...) to swap
→Default model in create_agent is the fallback — middleware upgrades selectively, not the other way around
→Route by message count, tool result size, user plan, token budget, or any signal in state/context
→Cost math first: routing only saves money when >50% of traffic can downgrade AND the cheap model handles those turns correctly
→Structured output mismatch is the #1 silent failure — verify the cheaper model can produce the same schema before routing to it
→Provider fallback (try/except in wrap_model_call) is not free — different providers have different tool calling behavior
→Log every routing decision with the signal values that triggered it — you need this for eval and drift detection

Should I Use Dynamic Model Selection?

Dynamic model selection adds complexity: you now have multiple models to test, routing logic to maintain, and thresholds to calibrate. The only reason to add that complexity is meaningful cost savings on a traffic mix where a meaningful fraction of turns are genuinely simple. Before building the middleware, answer these four questions.

A router only pays off when none of these simpler exits apply

Question	If yes	If no
Is your monthly LLM spend above ~$1,000/mo?	Routing overhead (code + eval + monitoring) is worth it	Single model — routing won't recover the implementation cost
Do >40% of your turns look genuinely simple (short messages, no large tool results)?	A cheap model can handle them — routing saves real money	All your traffic is complex — routing to a cheaper model risks quality on most turns
Does your cheaper model support all the tools and schemas your agent uses?	Safe to route to it	You'll hit structured output mismatches or tool call failures — fix this first
Do you have an eval harness that can measure quality per turn?	You can verify routing doesn't hurt quality	Build the eval before routing — otherwise you won't know when it breaks

When NOT to route

Don't route if: (1) Your agent relies on multi-turn reasoning where the cheap model's shorter context or weaker reasoning corrupts a chain across turns. (2) Your agent uses structured output and the cheap model hasn't been tested against your exact schemas. (3) Your p99 latency SLA is tight — routing adds code-path overhead even when no model swap occurs.

What Will It Actually Save?

Let's compute the savings for a real workload instead of asserting a percentage. Scenario: 100,000 turns/month, 800 avg input tokens, 200 avg output tokens per turn.

The @wrap_model_call Pattern

@wrap_model_call is the innermost middleware hook — it wraps the actual LLM API call. Every request passes through it, and you can inspect the full request state before deciding which model to use. The pattern is: examine signals in request.state and request.runtime.context, call request.override(model=selected_model), then pass to handler.

Sign in to read this article

This is a premium article. Sign in with your Google account to continue.