Route, Don’t Guess

Route, Don’t Guess

Most organizations are still picking models by vibe. That was fine when there were two options and a demo. It fails when your stack juggles a dozen language models, prices swing every quarter, and a single workflow hops from extraction to math to tool use in a few seconds.

A Google Research paper, UniRoute: Universal Model Routing for Efficient LLM Inference, is the catalyst for fixing this. In plain terms, UniRoute builds a compact “task map” by clustering real prompts, fingerprints each model by its error rate on those clusters, then routes each new request to the lowest-cost model that clears the quality bar for its cluster. No fragile, vendor-specific router retraining required, even when the model pool changes. The authors also show a clear excess-risk bound and strong results, including routing across more than 30 unseen LLMs in experiments. That makes the policy portable and auditable, which is exactly what leaders need when quality, cost, and governance all show up in the same meeting.

There is another shift arriving fast. NVIDIA Research argues in Small Language Models are the Future of Agentic AI that many agentic tasks are narrow, repetitive, and format-bound, and that SLMs are sufficiently powerful, operationally better, and more economical for a large share of real workloads. They call for heterogeneous systems that default to SLMs and invoke larger models sparingly when needed, and they even sketch a conversion algorithm to move agents from LLMs to SLMs. The economic and operational implications are large as adoption of agents accelerates.

This piece connects the two ideas. UniRoute gives you the brain that chooses the right model per task. The SLM position gives you the supply side that makes that choice far cheaper, faster, and easier to govern.

Why this moment matters

Three pressures converged this year.

  1. Model proliferation. New releases, silent weight updates, longer contexts, tool APIs, code interpreters. Your “default” in January is not your default in August.
  2. Workload diversity. A single product now handles a claims PDF, a support transcript, a table extraction, a chain-of-thought (CoT) math step, a code patch, and a policy-constrained email. These are different tasks with different failure modes.
  3. Budget and risk. Executives expect measurable savings and consistent quality. Regulators and internal risk teams expect an audit trail that makes sense.

Routing turns this chaos into a system. You represent tasks by clusters. You represent models by per-cluster error vectors. At runtime you select the least-cost model that meets the expected quality for that cluster, and you log why. UniRoute shows this approach holds up even as the model pool changes, thanks to a fingerprint computed from a small, representative prompt set.

On the supply side, the SLM position paper offers the right economic lens for agentic workloads. Many agent invocations are narrow and repetitive, and an SLM can be tuned to the exact format and tool calls the agent expects. The result, in their words, is a future where SLMs are the default for agentic AI, with larger generalists invoked selectively.

What UniRoute actually does in plain English

  • The problem. Model sprawl, diverse workloads, rising inference bills. Static routers tied to a fixed vendor lineup break when the pool changes.
  • The task map. Cluster a small, representative set of your real prompts into K groups.
  • Model fingerprints. For each candidate model, compute average error per cluster on that set. That K-dimensional error vector is the model’s capability fingerprint.
  • Cost-aware routing. At inference time, embed the new prompt, assign it to a cluster, then select the lowest-cost model whose expected error for that cluster clears the bar.
  • Why it generalizes. UniRoute’s cluster-based and learned-map variants estimate an optimal routing rule, with an excess-risk bound and strong results on public benchmarks, including routing among unseen LLMs.

Article content
Figure 1: UniRoute Cluster-Based Router

Jitkrittum et al. (2024). Universal Model Routing for Efficient LLM Inference. arXiv:2502.08773. Figure 1. https://arxiv.org/html/2502.08773v2#S5.SS1

That is a direct fit for the cost-to-cognition operating model I wrote about in “Blueprint for Building AI-Native Engineering Teams”: spend only the cognition you need for the outcome you want and escalate only when the task demands it.

Section 1 — Model choice is now a portfolio decision

A few years ago, we argued about parameter counts and leaderboards. Today, selection looks more like asset allocation. You balance risk, return, liquidity, and time. The assets are models. The regimes are workloads. The macro risks are regulatory and reputational.

Five realities shape the landscape.

1) Churn is structural. New models, longer contexts, new tool interfaces, better code interpreters. Static choices decay. UniRoute’s approach of representing each LLM via an error vector over prompt clusters puts your policy on stable ground even as the vendor lineup changes.

2) Workloads split into task manifolds. An intake email with three entities is not a calculator-assisted algebra step. A clause rewrite under a privacy policy is not a live coding edit with unit tests. Clustered task maps match reality.

3) Open-weight specialists matter. Tuned SLMs often match the quality required for large volumes of low-risk tasks, at a fraction of the cost, and can be verified and constrained more tightly. This is the central claim of the SLM position paper for agentic use: narrow subtasks, repetitive formats, lower latency, lower cost, adequate accuracy.

4) Governance moves to build time. You need explainable choices, consistent behavior, and an audit trail. UniRoute’s per-cluster selection logic provides a story with numbers.

5) Agents amplify the stakes. Agents plan, call tools, browse, write code, and trigger other agents. They need to “shop for cognition” per step. That demands a routing layer and a policy surface you can defend. In “Cognitive Browsing”, I called this shift to a mesh of collaborating specialists the natural evolution of the stack.

Section 2 — “Pick a great model” is not a strategy

Three defaults fail at scale:

  1. One strong default for everything. You pay premium rates for commodity outcomes and still miss corner cases.
  2. Let each team choose from a menu. Choices drift. Bills spike. No single narrative connects spend, quality, and policy.
  3. Train a blackbox router. It works until the model pool changes. Then you retrain, re-label, and re-validate.

A better frame treats routing as portable policy:

  • Describe tasks by clusters.
  • Describe models by per-cluster error vectors.
  • Select the least-cost model that clears the bar for that cluster, with whitelists for sensitive flows.
  • When the pool changes, compute the new model’s vector and keep moving.

UniRoute gives you the math and evidence that this policy is stable and competitive.

The SLM position then tightens the economics. If most of your agent invocations are narrow, repetitive, and format-constrained, you gain even more by defaulting to SLM specialists and reserving big generalists for the few steps that truly demand open-ended reasoning.

Section 3 — The architecture you actually need

Building on UniRoute’s core idea, here’s a design that meets enterprise and public sector constraints, with clear guardrails and an audit trail.

3.1 Build the task map

Curate 500–2,000 representative prompts across your real traffic. Label cheaply with programmatic checks where possible. Embed each prompt and run K-Means with K between 16 and 64. Give each cluster a human-readable name and freeze the centroids. You now have a task map that reflects your work.

(This mirrors the paper’s “representative prompt set, clustered into K groups” foundation.)

3.2 Fingerprint each model

Run each candidate model across the labeled set. Compute per-cluster error and pair it with cost and latency. Store those vectors in a registry. This is the capability fingerprint the router needs.

(Again, directly aligned with UniRoute’s “represent an LLM via a prediction error vector” step.)

3.3 Route by cost to cognition

At inference:

  1. Embed the prompt, assign it to the nearest cluster.
  2. Select the model minimizing expected_error(cluster, model) + λ × cost(model), subject to latency caps.
  3. Enforce whitelists for sensitive clusters and record the rationale.

The simpler this policy is, the easier it is to govern. UniRoute supplies the theory that your simple policy approximates the optimal rule and generalizes to new LLMs.

3.4 Guardrails, verifiers, fallbacks

  • Pre-filters: PII, PHI, and unsafe class detection.
  • Verifiers: Structure, policy, or math checks for high-risk clusters.
  • Dual inference: Run two models and reconcile for a small fraction of sensitive flows.
  • Escalation: If a verifier fails or confidence is low, escalate to a stronger model or human in the loop queue.

These hooks integrate cleanly with the agent mesh patterns I outlined in “Cognitive Browsing” with planner, solver, and critic with zero-trust boundaries and explicit capability tokens.

3.5 Observability and drift

Build two dashboards:

  • Cognition Spend: Dollars and latency by cluster. Savings versus a single-model baseline.
  • Quality Coverage: Pass/fail by cluster and model over time. Alerts for distribution shift and model behavior drift.

Feed hard cases back into the map. Promote new sub-clusters when they become common. This continuous improvement loop is exactly how the SLM position paper recommends evolving an SLM-first stack.

3.6 Tooling and interfaces

Keep it boring:

  • A reliable embedding model for the map.
  • Simple K-Means clustering.
  • A small registry with cluster centroids, model vectors, and policy thresholds.
  • A router API: route(prompt, context) → {model, rationale}.

In my Blueprint piece, I called this a brokerage layer—a service that picks the right model by task complexity, budget, and compliance tier. That’s still the move.

Section 4 — What leaders get, immediately

4.1 Enterprises: predictable savings, fewer fires

CIOs and CDOs get a forecastable way to drop unit costs while holding quality. Procurement improves too: evaluate a new vendor by computing its vectors on your representative set and see exactly where it fits. The story becomes clusters and volumes.

4.2 Federal and public sector: auditability and policy fit

Mandates around privacy, fairness, accessibility, and security are non-negotiable. Routing lets you enforce policy per cluster, confine sensitive flows to approved models, and log every decision. The audit trail becomes straightforward. In “Business to Agents,” I emphasized sector-specific governance and air-gapped or local inference where required; the router is the control plane that makes those rules executable.

4.3 Startups: speed and capital efficiency

A routing layer is a moat. You deliver stable outcomes at lower cost while the supply side changes weekly. You can add SLM specialists where volumes justify the investment, guided by the same cluster map. Investors understand “cost-to-serve by cluster” and “coverage growth.” The numbers will tell your story.

4.4 Sector snapshots

  • Healthcare. Route PHI flows to models inside your hardened boundary, add dual inference for discharge instructions, route coding tasks to tuned SLMs.
  • Financial services. Keep KYC/AML clusters on strict whitelists. Route reconciliations to tool-capable models.
  • Mobility and logistics. Use small, fast models for routing updates. Escalate to stronger models for incident narratives.
  • Public benefits. Apply policy lints on eligibility determinations. Route translation to a low-cost specialist when thresholds are met.

Each example follows one pattern: define the map, route by policy, learn from the failures, and update the map.

Section 5 — From the UniRoute paper to a working router, fast

If you want the shortest path from UniRoute to production, follow these steps first, then iterate where the data points you.

Step 1: Build the representative set. Pull anonymized prompts from recent traffic. Curate 500–2,000 diverse examples across your use cases. Label with programmatic checks where possible.

Step 2: Create the task map. Embed, cluster with K-Means, choose K between 16 and 64, name clusters, and freeze centroids.

Step 3: Fingerprint models. Batch each candidate model across the labeled set. Compute per-cluster error, median latency, and cost. Store vectors in a registry. (This is UniRoute’s central move: represent each LLM via a prediction-error vector over clusters.)

Step 4: Define the policy. Minimize expected_error + λ × cost, with latency caps. Enforce whitelists for sensitive clusters. Log every decision and rationale. UniRoute provides the theoretical footing that a simple cluster policy approximates the optimal rule.

Step 5: Wire guardrails. Pre-filters, verifiers, dual inference for a small share of sensitive flows, escalation on failure. These are table stakes in agent meshes and zero-trust designs.

Step 6: Observe and improve. Stand up Cognition Spend and Quality Coverage dashboards. Promote new sub-clusters as they emerge and retrain the lightweight cluster map if drift appears. The SLM position paper frames this as a continuous improvement loop for SLM-first systems.

Step 7: Train the organization. Teach cluster names and policies. Give risk and product teams read access to dashboards. Treat the router like a small internal product.

Now, layer in the SLM horizon: the NVIDIA team lays out a migration path from generalist LLMs to SLM specialists, including secure logging, data curation, clustering, selecting and fine-tuning SLMs, and iterating the router alongside them. Use your router’s logs to discover where SLM specialists will save the most money, then swap them in under the same policy.

Article content

Section 6 — Routing meets the SLM era

This is where the two threads—portable routing and SLM-first systems—reinforce each other.

Agents that shop for cognition. Agents will not only call tools, but they will also choose which intelligence to rent on each step. A router with a clean API becomes the buyer on their behalf, picking the cheapest supplier that satisfies the spec and keeping receipts. In prior work, I described this as the shift from monoliths to meshes, with explicit planner-solver-critic roles and clear memory layers. Routing and SLM specialists make those designs economical.

Self-improving stacks. Your decision logs surface hard cases by cluster. That gives you labeled fodder for the next SLM fine-tune, which drops straight into the pool under the same policy. The SLM paper frames this as an ongoing loop; your router is the engine that powers it.

Hardware and locality. As more inference shifts closer to data, routing will also weigh where to run. In earlier guidance for defense and critical infrastructure, I underscored local inference and air-gapping for certain decisions. SLM specialists make that feasible, and the router encodes the rule.

Interface-centric AI. The stack is moving toward stable, predictable interfaces while the supply side changes beneath them. A routing layer plus SLM specialists gives you that stability with better unit economics. In the Blueprint series, I called this the brokerage layer that separates what the user sees from where cognition is purchased.

Industry trajectory. Adoption is accelerating, and the capital mismatch between centralized LLM infrastructure and the actual market size for LLM APIs is pushing teams to hunt for better economics. An SLM-first portfolio, governed by a portable routing policy, is a credible answer.

Article content

Narrative examples: what this feels like on the ground

Support triage at scale. A consumer services company runs millions of support interactions a month. Most tickets are CRUD. Some involve policy nuance. With routing, CRUD clusters move to a tuned SLM, policy-heavy clusters stay on a stricter model with a verifier, and anomalies escalate. Spend drops, escalations become more focused.

Public benefits letters. A federal agency generates benefit determinations at volume. Sensitive “policy-constrained rewriting” routes to an approved model inside a secure boundary with verification, while appointment reminders and instructions route to a lower-cost specialist. Audits become straightforward.

Developer productivity inside a VPC. An enterprise routes small refactors and test scaffolds to a tuned SLM running locally, and escalates multi-file refactors to a reasoning-capable model. Disagreements found by periodic dual inference generate data for the next SLM specialist.

Objections you will hear and how to respond

“We will standardize on a strong default.” That sounds simple. It burns budget and hides risk. A routing policy plus a few SLM specialists pays for itself quickly and improves control.

“This is brittle if traffic shifts.” That is why you cluster and observe. Promote new sub-clusters as they emerge. Update your vectors and thresholds. UniRoute’s method was designed to remain stable as pools change.

“We don’t want to manage lots of models.” Start with three or four. Add only where the savings or coverage justify it. The router shows you where a specialist will pay back.

“Governance won’t allow it.” Routing is easier to govern. It produces a reasoned record for every choice, enforces policies per cluster, and allows whitelists and verifiers for sensitive flows.

Closing the loop with some of my past articles

We are leaving the era where model choice is a once-a-year bet. We are entering the era where selection is a live, portable policy grounded in your work, your risks, and your budget.

UniRoute shows how to encode that policy so it survives model churn, using clusters of real prompts and per-cluster error vectors to route to the lowest-cost model that clears your quality bar, even when the pool changes.

SLM-first architectures make that policy pay. Many agentic tasks are narrow, repetitive, and format-bound. SLMs handle those with lower latency, lower cost, and tighter alignment. Keep larger generalists in reserve. Build the loop that upgrades specialists where volume and risk justify it.

If you opened your logs tomorrow, could you clearly explain and in numbers why each request went to the model it did? If yes, you are routing. If not, now is the time.

Drop a comment with how you are tackling model sprawl and whether you are experimenting with SLM specialists.

 

Great read Bassel - thanks for posting. Best comment I saw elsewhere about the Nvidia paper and shift to SLMs… “Your HR chatbot may not need to know advanced physics” 😂

To view or add a comment, sign in

More articles by Bassel Haidar

Explore content categories