pLLM
High-Performance LLM Gateway
Drop-in OpenAI replacement built in Go. Handle thousands of concurrent requests with adaptive routing, multi-provider support, and enterprise-grade reliability.
# Drop-in OpenAI replacement — just change the base URL
$ curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer $PLLM_KEY" \
-d '{"model":"gpt-4o","messages":[{"role":"user","content":"Hi"}]}'
{
"provider": "openai",
"model": "gpt-4o",
"latency_ms": 142,
"route": "least-latency"
}
Enterprise-Grade Features
Built from the ground up for production workloads with performance, reliability, and developer experience in mind.
100% OpenAI Compatible
Drop-in replacement for OpenAI API. No code changes needed - just update your base URL and you're ready to go.
Multi-Provider Support
Support for OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Vertex AI, Llama, and Cohere with unified interface.
Adaptive Routing
Intelligent request routing with automatic failover, circuit breakers, and health-based load balancing.
High Performance
Built in Go for maximum performance. Handle thousands of concurrent requests with minimal latency overhead.
Enterprise Security
JWT authentication, RBAC, audit logging, and comprehensive monitoring with Prometheus metrics.
Cost Optimization
Budget management, intelligent caching, and multi-key load balancing to minimize API costs.
Technical Excellence
Deep technical capabilities designed for mission-critical production environments.
Performance
-
Sub-millisecond routing overhead -
Thousands of concurrent connections -
Native compilation with Go -
Efficient memory management
Reliability
-
Circuit breaker protection -
Automatic health monitoring -
Graceful degradation -
Zero-downtime deployments
Scalability
-
Horizontal scaling ready -
Kubernetes native -
Redis-backed caching -
Distributed rate limiting
Observability
-
Prometheus metrics -
Grafana dashboards -
Distributed tracing -
Comprehensive logging
Performance Advantage
See how pLLM compares to typical interpreted gateway solutions.
| Metric | pLLM (Go) | Typical Gateway | Advantage |
|---|---|---|---|
| Concurrent Connections | Thousands | Limited | Superior |
| Memory Usage | 50-80MB | 150-300MB+ | 3-6x Less |
| Startup Time | <100ms | 2-5s | 20-50x Faster |
| CPU Efficiency | All cores | GIL limited | True Parallel |
Your App Says "smart". pLLM Does the Rest.
Define named routes that map to multiple LLM providers with configurable strategies. Your application calls a single route slug — pLLM selects the best backend in real-time, handles failures automatically, and falls back gracefully. No code changes. Ever.
Two-Level Selection
Routes solve which model to use. The system then picks the best instance of that model.
model: "smart" 1. Define a Route
Via Admin API or config — no restart needed
# Define a route called "smart" — your app just calls model: "smart"
# pLLM picks the best backend automatically
POST /api/admin/routes
{
"name": "Smart Route",
"slug": "smart",
"strategy": "least-latency",
"models": [
{ "model_name": "gpt-4o", "weight": 60, "priority": 100 },
{ "model_name": "claude-sonnet-4", "weight": 30, "priority": 80 },
{ "model_name": "gemini-2.5-pro", "weight": 10, "priority": 60 }
],
"fallback_models": ["gpt-4o-mini", "claude-haiku"]
}2. Use It in Your App
Standard OpenAI SDK — the route slug is the model name
from openai import OpenAI
client = OpenAI(
base_url="https://pllm.company.com/v1",
api_key="pk-team-abc123"
)
# Your app calls the route slug — not a specific model
# pLLM selects the best backend in real-time
response = client.chat.completions.create(
model="smart", # ← This is a route, not a model
messages=[{"role": "user", "content": "Analyze this data"}],
stream=True
)
# If GPT-4o is slow → routes to Claude
# If Claude is down → circuit breaker opens, fails over to Gemini
# If all primary models fail → falls back to gpt-4o-mini
# Your app never knows. Zero code changes.Four Routing Strategies
Each route chooses a strategy. The strategy runs at request time to pick the best model — using real-time metrics, not static config.
Least Latency
least-latency Routes every request to the fastest responding provider in real-time. Uses distributed latency tracking across all gateway nodes via Redis.
Exponential moving average (90/10) ensures responsive adaptation without oscillation
Weighted Round-Robin
weighted-round-robin Distributes requests proportionally based on model weights using smooth interleaving — the same algorithm nginx uses.
Smooth WRR prevents bursts: a 60/30/10 split interleaves evenly, not in blocks
Priority
priority Always routes to the highest-priority healthy provider. Lower-priority models only receive traffic during failover.
Use with fallback chains to build cost tiers: premium → standard → economy
Random
random Uniform random selection across all healthy providers. Simple, stateless, no coordination needed between gateway nodes.
Zero overhead — no counters, no Redis lookups, no state to synchronize
Three-Layer Failover
When a provider fails, pLLM doesn't just retry — it escalates through three layers of resilience before ever returning an error to your app.
Instance Retry
If an instance fails, pLLM tries another instance of the same model with increasing timeouts (1.5x multiplier).
Model Failover
If all instances of a model fail, the route strategy selects the next model in the route's model list.
Fallback Chain
If every model in the route is exhausted, pLLM walks through the fallback_models chain as a last resort.
Each retry uses 1.5x increased timeout to account for provider recovery. Up to 10 failover hops with loop detection.
Self-Healing Circuit Breaker
Unhealthy providers are automatically removed from rotation and tested for recovery — no manual intervention needed.
CLOSED — Healthy
Normal operation. All traffic flows through. Failure counter active.
OPEN — Unhealthy
Traffic blocked. Provider removed from rotation. 30s cooldown starts.
HALF-OPEN — Testing
Single test request allowed. Success → closed. Failure → back to open.
Why This Changes Everything for Your App
Zero Retry Logic
Stop writing try/catch/retry in every LLM call. pLLM handles retries, failover, and fallbacks at the gateway level. Your app code stays clean.
Swap Providers at Runtime
Move from OpenAI to Anthropic — or add Gemini to the mix — via the admin API. No deploy, no restart, no code changes in any app.
Cost Tiers Without Code
Route "fast" to cheap models, "smart" to premium ones. Use weighted distribution to send 80% to budget providers and 20% to premium for A/B testing.
A/B Test Models
Use weighted round-robin to gradually shift traffic. Route 10% to a new model, monitor latency and quality through the stats API, then scale up.
Survive Provider Outages
When OpenAI goes down, your app doesn't. Circuit breakers detect the failure, traffic shifts to Anthropic or Azure, and self-heals when OpenAI recovers.
Route-Level Analytics
Every route tracks per-model traffic distribution, latency, token usage, and cost. See exactly where your budget goes with the stats endpoint.
Protect Every Request. Secure Every Response.
Guardrails are pluggable security filters that validate, mask, and monitor content flowing through your LLM gateway. Run them before the LLM call, after, in parallel, or at logging time — each with its own purpose and behavior.
Request Lifecycle
Guardrails plug into four stages of the request pipeline. Each stage has different behavior — choose what makes sense for your use case.
Logging-only guardrails run separately before data reaches storage. They mask sensitive content in logs and analytics without affecting the live request/response flow.
Four Execution Modes
Each guardrail can run in one or more modes. The same guardrail — like PII detection — can mask data pre-call, scan responses post-call, and redact logs simultaneously.
Pre-Call
pre_call Intercepts and validates requests before they reach the LLM. Mask PII, block prompt injections, enforce compliance policies — all before a single token is sent.
User sends credit card number → pLLM masks it as [CREDIT_CARD] → LLM never sees the original
Post-Call
post_call Scans LLM responses before they reach your users. Detect hallucinated PII, toxic content, or compliance violations in model output.
LLM generates a fake SSN in output → pLLM detects and flags it → logged for review
During-Call
during_call Runs guardrails in parallel with the LLM call for zero-latency monitoring. Ideal for background security checks that don't need to block the response.
While LLM processes the request → pLLM runs threat detection in background → alerts on anomalies
Logging-Only
logging_only Masks sensitive data before it's stored in logs or analytics. Ensures PII never reaches your database, audit trails, or monitoring systems.
Request contains email → stored in logs as [EMAIL] → analytics stay PII-free
Guardrails Marketplace
Choose from pre-built guardrails for common security and compliance needs. Each integrates with a leading provider and plugs into any execution mode.
Presidio PII Detection
by Microsoft Presidio
Detect and mask personally identifiable information including names, emails, phone numbers, credit cards, and SSNs using Microsoft's open-source PII engine.
Lakera Guard
by Lakera AI
Protect against prompt injections, jailbreaks, and other LLM-specific attacks. Real-time threat detection purpose-built for AI security.
OpenAI Moderation
by OpenAI
Leverage OpenAI's moderation API to classify content across categories like hate speech, violence, self-harm, and sexual content.
Aporia Guardrails
by Aporia
Enterprise ML security and compliance platform. Monitor model behavior, detect hallucinations, and enforce organizational policies.
Build Your Own Guardrail
A custom connector interface is on the roadmap — bring your own guardrail provider by implementing a simple HTTP interface. Plug in internal compliance tools, custom ML models, or any third-party service.
Simple YAML Configuration
Enable guardrails with a few lines in your config — no code changes needed
guardrails:
enabled: true
guardrails:
# Pre-call: Mask PII before sending to LLM
- guardrail_name: "presidio-pii-detection"
provider: "presidio"
mode: ["pre_call", "logging_only"]
enabled: true
config:
analyzer_url: "http://presidio-analyzer:3000"
anonymizer_url: "http://presidio-anonymizer:3000"
entities:
- PERSON
- EMAIL_ADDRESS
- PHONE_NUMBER
- CREDIT_CARD
- SSN
threshold: 0.7
mask_pii: true
language: "en"
# Post-call: Detect PII leaked in responses
- guardrail_name: "presidio-pii-response-scan"
provider: "presidio"
mode: ["post_call"]
enabled: true
config:
threshold: 0.8Enterprise Authentication
Seamless integration with your existing identity infrastructure through OAuth/OIDC support powered by Dex.
Zero Configuration
Connect to your existing identity providers without complex setup. Dex handles the OAuth/OIDC protocols while pLLM manages authorization.
Enterprise Security
Industry-standard OAuth 2.0 and OpenID Connect protocols with support for SAML, LDAP, and multi-factor authentication.
Supported Identity Providers
Connect with popular identity providers and enterprise systems


And many more through standard protocols:
Simple Configuration
Get started with OAuth/OIDC in minutes with a simple YAML configuration
auth:
dex:
issuer: https://dex.yourcompany.com
connectors:
- type: oidc
name: Google
config:
issuer: https://accounts.google.com
clientID: your-google-client-id
clientSecret: your-google-client-secret
- type: ldap
name: Corporate Directory
config:
host: ldap.yourcompany.com:636
insecureNoSSL: false
bindDN: cn=admin,dc=company,dc=com System Architecture
Enterprise-grade architecture designed for high availability, scalability, and performance
Client Layer
Applications & Services
Web Applications
React, Vue, Angular
Mobile Apps
iOS, Android, React Native
Backend Services
Node.js, Python, Go
AI Platforms
LangChain, AutoGPT
pLLM Gateway
Intelligent Routing Engine
Core Gateway
High-performance Go runtime
Router
Chi HTTP Router
Auth
JWT & RBAC
Cache
Redis Layer
Monitor
Metrics & Logs
Provider Layer
LLM Service Providers
OpenAI
healthy
99.9%
uptime
Anthropic
healthy
99.9%
uptime
Azure OpenAI
degraded
85.2%
uptime
AWS Bedrock
healthy
99.9%
uptime
Google Vertex
healthy
99.9%
uptime
Llama
failed
0%
uptime
Data Flow & Features
Real-time monitoring and intelligent routing
Circuit Breaker
Automatic failover protection
Health Checks
Continuous monitoring
Rate Limiting
Traffic control & quotas
Analytics
Performance insights
Live Performance Metrics
1000+
Requests/sec
<1ms
Latency
99.9%
Uptime
65MB
Memory
Intelligent Load Balancing
Choose from 6 different routing strategies optimized for different use cases and workload patterns.
Round Robin
Even distribution across all providers
Least Busy
Routes to least loaded provider
Weighted
Custom weight distribution
Priority
Prefers high-priority providers
Latency-Based
Routes to fastest responding provider
Usage-Based
Respects rate limits and quotas
Production-Ready Stack
Built with battle-tested technologies and modern best practices for enterprise reliability.
Chi Router
Lightning-fast HTTP routing and middleware
PostgreSQL
Reliable data persistence with GORM ORM
Redis
High-speed caching and rate limiting
Prometheus
Enterprise monitoring and metrics
Adaptive Request Flow
Interactive visualization of intelligent routing with automatic failover and circuit breaker protection.
Control Everything from One Place
A full management UI for configuring routes, onboarding providers, managing API keys and teams, tuning guardrails, and monitoring your gateway — no CLI required.
Smart Route
least-latencyActiveVisual Route Builder
Design route graphs with drag-and-drop. Set model weights, fallback chains, and routing strategies visually — no YAML editing required.
Real-Time Monitoring
Live dashboards tracking request volume, latency percentiles, error rates, and token usage across all providers in real time.
Provider Onboarding
Add new LLM providers in seconds. Enter credentials, test connectivity, and start routing — all from the UI.
API Key Management
Generate, rotate, and revoke API keys with granular permissions. Set rate limits and expiry per key.
Team Management
Organize users into teams with role-based access control. Assign routes, quotas, and permissions per team.
Guardrail Configuration
Enable and tune guardrails per route. Configure PII detection thresholds, content policies, and execution modes visually.
Usage Analytics
Detailed breakdowns of token usage, cost per provider, and request patterns. Export reports for billing and audits.
Configuration Management
Edit gateway configuration live. Changes apply instantly without restarts or redeployments.
Try the Dashboard
Start pLLM and open the dashboard at
http://localhost:8080/dashboard
Performance Benchmarks
Real-world performance data showing why Go-based pLLM outperforms interpreted gateway solutions.
Requests/sec
P99 Latency
Memory Efficiency
Cold Start
Performance Comparison
pLLM vs Typical Interpreted Gateway
Concurrent Connections
Memory Usage
Startup Time
Response Time
Performance comparison between pLLM and typical gateways
Load Testing Results
Stress tested with 10,000 concurrent users making chat completion requests.
Test Configuration
Results
Zero failed requests under normal conditions
Gateway overhead only, excluding LLM processing
Peak memory during 10K concurrent connections
Enterprise Scalability
Built-in scalability features that make pLLM ideal for high-volume production workloads.
True Parallelism
No GIL limitations - utilize all CPU cores effectively
Memory Efficient
Native compilation with optimized memory management
Instant Scaling
Sub-100ms startup enables aggressive auto-scaling
Network Optimized
Efficient connection pooling and keep-alive management
Enterprise Performance Scaling
For massive performance and ultra-low latency, the bottleneck is often the LLM providers themselves, not the gateway. To achieve true enterprise scale:
- Multiple LLM Deployments: Deploy several instances of the same model (e.g., 5-10 GPT-4 Azure OpenAI deployments)
- Multi-Provider Redundancy: Use multiple AWS Bedrock accounts, Azure regions, or provider accounts
- Geographic Distribution: Deploy models across regions for latency optimization
Why This Matters: A single LLM deployment typically handles 60-100 RPM. For 10,000+ concurrent users, you need multiple deployments of the same model to prevent provider-side bottlenecks. pLLM's adaptive routing automatically distributes load across all deployments.
Get Started in Minutes
Choose your deployment method and get pLLM running in your environment quickly.
Deployment Options
Kubernetes with Helm
Production-ready deployment with auto-scaling
# Add the Helm repository
helm repo add pllm https://andreimerfu.github.io/pllm
helm repo update
# Install with your configuration
helm install pllm pllm/pllm \
--set pllm.secrets.jwtSecret="your-jwt-secret" \
--set pllm.secrets.masterKey="sk-master-key" \
--set pllm.secrets.openaiApiKey="sk-your-openai-key"
# Check status
kubectl get pods -l app.kubernetes.io/name=pllmDocker Compose
Perfect for development and testing
# Clone and setup
git clone https://github.com/andreimerfu/pllm.git
cd pllm
# Configure environment
cp .env.example .env
echo "OPENAI_API_KEY=sk-your-key-here" >> .env
# Launch pLLM
docker compose up -d
# Test it works
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hello!"}]}'Binary Installation
Lightweight deployment for simple setups
# Download latest release
wget https://github.com/andreimerfu/pllm/releases/latest/download/pllm-linux-amd64
# Make executable
chmod +x pllm-linux-amd64
# Set environment variables
export OPENAI_API_KEY=sk-your-key-here
export JWT_SECRET=your-jwt-secret
export MASTER_KEY=sk-master-key
# Run pLLM
./pllm-linux-amd64 serverDrop-in Integration
pLLM is 100% OpenAI compatible. Just change your base URL and you're ready to go.
Python
from openai import OpenAI
# Just change the base_url - that's it!
client = OpenAI(
api_key="your-api-key",
base_url="http://localhost:8080/v1" # ← Point to pLLM
)
# Use exactly like OpenAI
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hello!"}]
)Node.js
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: 'your-api-key',
baseURL: 'http://localhost:8080/v1' // ← Point to pLLM
});
const completion = await openai.chat.completions.create({
model: "gpt-3.5-turbo",
messages: [{role: "user", content: "Hello!"}]
});cURL
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-API-Key: your-api-key" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Hello!"}]
}'Enterprise Support
While pLLM is open source and free, we offer professional support and custom solutions for enterprises with mission-critical requirements.
Community
Perfect for developers and small teams getting started with pLLM.
What's included:
- GitHub Issues & Discussions
- Documentation and guides
- Best effort response time
- Open source under MIT license
Limitations:
- No SLA guarantees
- Community-driven support
- No priority bug fixes
Professional
For growing businesses that need reliable support and faster issue resolution.
What's included:
- Priority email support
- Guaranteed 24-hour response time
- Deployment assistance
- Configuration review
- Priority bug fixes
- Access to beta features
Limitations:
- Business hours support only
- Email-based communication
Enterprise
Comprehensive support for mission-critical deployments with custom requirements.
What's included:
- Dedicated support engineer
- Custom SLA (down to 2-hour response)
- Phone & video call support
- Custom feature development
- Architecture consulting
- On-site deployment assistance
- Training and workshops
- Priority feature requests
Ready for Enterprise Deployment?
Contact our team for custom integrations, dedicated support, and enterprise-grade deployment assistance.
Professional Consultation
Architecture review, deployment planning, and best practices guidance from our core team.
Custom Development
Tailored features, custom integrations, and specialized deployment configurations for your use case.
Dedicated Support
Priority support channels, SLA guarantees, and direct access to our engineering team.
Schedule a Consultation
Book a 30-minute call to discuss your requirements and explore how pLLM can fit your enterprise needs.
Enterprise Inquiry
Submit a detailed inquiry for custom features, deployment assistance, or partnership opportunities.
Email Us Directly
Prefer email? Reach out directly for enterprise inquiries, procurement, or partnership discussions.
Frequently Asked Questions
Everything you need to know about pLLM, from technical details to enterprise support options.
Yes, pLLM is completely free and open source under the MIT license. This means you can use it in commercial applications, modify the code, and deploy it anywhere without licensing fees. The only costs you'll incur are your infrastructure expenses (servers, cloud resources) and API costs from the LLM providers themselves (OpenAI, Anthropic, etc.).
pLLM is built in Go for superior performance and lower resource usage compared to Python-based solutions. Key advantages include: sub-millisecond routing overhead, native compilation for better performance, 3-6x lower memory usage, 20-50x faster startup times, and true parallel processing without GIL limitations. Plus, it's 100% OpenAI API compatible, requiring zero code changes to integrate.
pLLM supports all major LLM providers including OpenAI (GPT-3.5, GPT-4, GPT-4 Turbo), Anthropic Claude, Azure OpenAI, AWS Bedrock, Google Vertex AI, Groq, and Cohere. The unified API interface means you can switch between providers or use multiple providers simultaneously with intelligent routing and automatic failover.
Yes, we provide comprehensive enterprise support including dedicated support engineers, custom SLA agreements (down to 2-hour response times), priority bug fixes, custom feature development, architecture consulting, on-site deployment assistance, and training workshops. Enterprise support is available through custom pricing based on your specific requirements.
Absolutely. pLLM is designed for flexible deployment scenarios including on-premise installations, air-gapped environments, and hybrid cloud setups. We provide Kubernetes manifests, Docker containers, and can assist with custom deployment configurations. The gateway can run entirely within your infrastructure while connecting to external LLM APIs or internal models.
pLLM includes comprehensive security features: JWT-based authentication, Role-Based Access Control (RBAC), audit logging for compliance, OAuth/OIDC integration through Dex (supporting Google, Microsoft, LDAP, Active Directory), API key management, rate limiting, and request monitoring. All communications use TLS encryption, and we support enterprise identity providers.
pLLM is optimized for high-performance scenarios: handles thousands of concurrent connections, sub-millisecond routing overhead, efficient memory usage (50-80MB typical), fast startup times (<100ms), and intelligent caching to reduce API costs. The Go-based architecture provides significant performance advantages over interpreted language solutions.
Yes! We have comprehensive documentation, GitHub discussions for community support, and regular updates on our roadmap. The open-source community actively contributes features and bug fixes. For enterprise customers, we provide dedicated documentation, training materials, and direct access to our engineering team.
Product Roadmap
Exciting features coming soon to make pLLM even more powerful and enterprise-ready.
Key Rotation & Secret Management
Automated key rotation and integration with external secret managers for enhanced security.
Key Features:
Advanced Guardrails
Pluggable content guardrails with pre-call, post-call, during-call, and logging-only modes. Marketplace with Presidio PII detection and more.
Key Features:
Enhanced Audit & Logging
Comprehensive audit trails with retention policies and compliance reporting.
Key Features:
Shape Our Roadmap
Have a feature request or want to influence our development priorities? We'd love to hear from you.