Your app calls "smart". pLLM picks the model.
One route slug, many providers. Real-time latency-aware selection, silent failover, and self-healing circuit breakers — so outages, spikes, and cost tiers are a config change, not a deploy.
What happens when a request hits "smart".
Which strategy fits your traffic?
Each route picks a strategy. Strategies run at request time using real-time metrics, not static config.
| Strategy | How it picks | State | Best for |
|---|---|---|---|
| Least Latency least-latency | Fastest p95 across healthy nodes | Distributed EMA via Redis | Latency-sensitive apps · chat UIs · real-time agents |
| Weighted RR weighted-round-robin | Smooth proportional rotation | In-memory counters | Capacity-based distribution · multi-deployment setups |
| Priority priority | Highest-priority healthy backend | Static ordering | Cost tiers · preferred provider · failover chains |
| Random random | Uniform random across healthy | Stateless | All-equal providers · stateless gateway nodes |
Two steps. Admin API + standard SDK.
1. Define a route
admin API · no restart# A route named "smart" — your app just calls model: "smart".
# pLLM picks the best backend automatically.
POST /api/admin/routes
{
"name": "Smart",
"slug": "smart",
"strategy": "least-latency",
"models": [
{ "model_name": "gpt-5", "weight": 60, "priority": 100 },
{ "model_name": "claude-4.6", "weight": 30, "priority": 80 },
{ "model_name": "gemini-2.5-pro", "weight": 10, "priority": 60 }
],
"fallback_models": ["gpt-4o-mini", "claude-haiku"]
}2. Use it in your app
OpenAI SDK · no changefrom openai import OpenAI
client = OpenAI(
base_url="https://pllm.company.com/v1",
api_key="sk-..."
)
# Call the route slug — not a specific model.
# pLLM picks the best backend in real-time.
response = client.chat.completions.create(
model="smart", # pLLM route
messages=[{"role": "user", "content": "Analyze this data"}],
stream=True,
)
# If gpt-5 is slow → routes to claude-4.6
# If claude is down → circuit opens, fails over to gemini-2.5
# If all primaries fail → fallback chain (gpt-4o-mini, claude-haiku)
# Your app never knows. Zero code changes.When things break, your app doesn't.
Three escalating layers of failover, plus a self-healing circuit breaker on every provider.
Three layers, in order.
Instance retry
If an instance fails, pLLM tries another instance of the same model with 1.5× increasing timeouts.
Model failover
If all instances of a model fail, the route's strategy picks the next model in its list.
Fallback chain
If every model in the route is exhausted, pLLM walks the fallback_models chain as a last resort.
Each retry uses 1.5× increasing timeout. Up to 10 failover hops with loop detection.
Self-healing, no paging.
CLOSED · HEALTHY
normal All traffic flows. Failure counter active.
OPEN · UNHEALTHY
removed Traffic blocked. Provider pulled from rotation. 30s cooldown.
HALF-OPEN · TESTING
probe One probe request. Success → closed. Failure → open.