Routing & Resilience

Your app calls "smart". pLLM picks the model.

One route slug, many providers. Real-time latency-aware selection, silent failover, and self-healing circuit breakers — so outages, spikes, and cost tiers are a config change, not a deploy.

Live · acme-corp/production

last 30d

12.4M

requests routed

last 30d

0.82ms

route-decision p95

gateway overhead

3.2k

failovers triggered

silent recovery

99.8%

auto-recovery

no human paged

29%

cost savings

vs single-provider

The decision

What happens when a request hits "smart".

pllm-router · live

strategy: least-latency

01 incoming

POST /v1/chat/completions → model: "smart"

02 resolve route · evaluate candidates

3 healthy · 0 degraded · 0 failed

model

p95

errors

weight

gpt-5 · openai

42ms 0.01% 60%

claude-4.6-sonnet · anthropic

58ms 0.02% 30%

gemini-2.5-pro · google

71ms 0.08% 10%

03 winner · dispatch

→ gpt-5 @ openai / us-east-2 / instance #3 decided in 0.82ms

42ms

winner p95

<1ms

router overhead

retry code in your app

Decision guide

Which strategy fits your traffic?

Each route picks a strategy. Strategies run at request time using real-time metrics, not static config.

Strategy	How it picks	State	Best for
Least Latency `least-latency`	Fastest p95 across healthy nodes	Distributed EMA via Redis	Latency-sensitive apps · chat UIs · real-time agents
Weighted RR `weighted-round-robin`	Smooth proportional rotation	In-memory counters	Capacity-based distribution · multi-deployment setups
Priority `priority`	Highest-priority healthy backend	Static ordering	Cost tiers · preferred provider · failover chains
Random `random`	Uniform random across healthy	Stateless	All-equal providers · stateless gateway nodes

Configuration

Two steps. Admin API + standard SDK.

1. Define a route

admin API · no restart

javascript

# A route named "smart" — your app just calls model: "smart".
# pLLM picks the best backend automatically.

POST /api/admin/routes
{
  "name": "Smart",
  "slug": "smart",
  "strategy": "least-latency",
  "models": [
    { "model_name": "gpt-5",          "weight": 60, "priority": 100 },
    { "model_name": "claude-4.6",     "weight": 30, "priority":  80 },
    { "model_name": "gemini-2.5-pro", "weight": 10, "priority":  60 }
  ],
  "fallback_models": ["gpt-4o-mini", "claude-haiku"]
}

2. Use it in your app

OpenAI SDK · no change

python

from openai import OpenAI

client = OpenAI(
    base_url="https://pllm.company.com/v1",
    api_key="sk-..."
)

# Call the route slug — not a specific model.
# pLLM picks the best backend in real-time.
response = client.chat.completions.create(
    model="smart",                        # pLLM route
    messages=[{"role": "user", "content": "Analyze this data"}],
    stream=True,
)

# If gpt-5 is slow      → routes to claude-4.6
# If claude is down     → circuit opens, fails over to gemini-2.5
# If all primaries fail → fallback chain (gpt-4o-mini, claude-haiku)
# Your app never knows. Zero code changes.

Resilience

When things break, your app doesn't.

Three escalating layers of failover, plus a self-healing circuit breaker on every provider.

Failover ladder

Three layers, in order.

Instance retry

If an instance fails, pLLM tries another instance of the same model with 1.5× increasing timeouts.

Model failover

If all instances of a model fail, the route's strategy picks the next model in its list.

Fallback chain

If every model in the route is exhausted, pLLM walks the fallback_models chain as a last resort.

Each retry uses 1.5× increasing timeout. Up to 10 failover hops with loop detection.

Circuit breaker

Self-healing, no paging.

CLOSED · HEALTHY

normal

All traffic flows. Failure counter active.

3 consecutive failures

OPEN · UNHEALTHY

removed

Traffic blocked. Provider pulled from rotation. 30s cooldown.

cooldown elapsed

HALF-OPEN · TESTING

probe

One probe request. Success → closed. Failure → open.

Lines of retry code in your app

<30s

Automatic recovery window

Max failover hops per request

Keep exploring

All platform pillars

MCP Gateway

Registry, lifecycle & per-tool policy

Registry

Agents, skills & prompts as artifacts

Governance

AD-group policies & audit

Guardrails

Pre / post / during / logging filters