pLLM

High-Performance LLM Gateway

Drop-in OpenAI replacement built in Go. Handle thousands of concurrent requests with adaptive routing, multi-provider support, and enterprise-grade reliability.

High Performance Cost Efficient Low Latency OpenAI Compatible

Get Started View on GitHub

terminal

# Drop-in OpenAI replacement — just change the base URL

$ curl http://localhost:8080/v1/chat/completions \

-H "Authorization: Bearer $PLLM_KEY" \

-d '{"model":"gpt-4o","messages":[{"role":"user","content":"Hi"}]}'

{

"provider": "openai",

"model": "gpt-4o",

"latency_ms": 142,

"route": "least-latency"

}

Enterprise-Grade Features

Built from the ground up for production workloads with performance, reliability, and developer experience in mind.

100% OpenAI Compatible

Drop-in replacement for OpenAI API. No code changes needed - just update your base URL and you're ready to go.

Multi-Provider Support

Support for OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Vertex AI, Llama, and Cohere with unified interface.

Adaptive Routing

Intelligent request routing with automatic failover, circuit breakers, and health-based load balancing.

High Performance

Built in Go for maximum performance. Handle thousands of concurrent requests with minimal latency overhead.

Enterprise Security

JWT authentication, RBAC, audit logging, and comprehensive monitoring with Prometheus metrics.

Cost Optimization

Budget management, intelligent caching, and multi-key load balancing to minimize API costs.

Technical Excellence

Deep technical capabilities designed for mission-critical production environments.

Performance

Sub-millisecond routing overhead
Thousands of concurrent connections
Native compilation with Go
Efficient memory management

Reliability

Circuit breaker protection
Automatic health monitoring
Graceful degradation
Zero-downtime deployments

Scalability

Horizontal scaling ready
Kubernetes native
Redis-backed caching
Distributed rate limiting

Observability

Prometheus metrics
Grafana dashboards
Distributed tracing
Comprehensive logging

Performance Advantage

See how pLLM compares to typical interpreted gateway solutions.

Metric	pLLM (Go)	Typical Gateway	Advantage
Concurrent Connections	Thousands	Limited	Superior
Memory Usage	50-80MB	150-300MB+	3-6x Less
Startup Time	<100ms	2-5s	20-50x Faster
CPU Efficiency	All cores	GIL limited	True Parallel

Intelligent Routing

Your App Says "smart". pLLM Does the Rest.

Define named routes that map to multiple LLM providers with configurable strategies. Your application calls a single route slug — pLLM selects the best backend in real-time, handles failures automatically, and falls back gracefully. No code changes. Ever.

Two-Level Selection

Routes solve which model to use. The system then picks the best instance of that model.

Your App

model: "smart"

Level 1: Route

Select Model

least-latency

Winner

gpt-4o

lowest latency

Level 2: Instance

Select Instance

health + latency

Executed

OpenAI

us-east instance #2

1. Define a Route

Via Admin API or config — no restart needed

javascript

# Define a route called "smart" — your app just calls model: "smart"
# pLLM picks the best backend automatically

POST /api/admin/routes
{
  "name": "Smart Route",
  "slug": "smart",
  "strategy": "least-latency",
  "models": [
    { "model_name": "gpt-4o",           "weight": 60, "priority": 100 },
    { "model_name": "claude-sonnet-4",   "weight": 30, "priority": 80 },
    { "model_name": "gemini-2.5-pro",    "weight": 10, "priority": 60 }
  ],
  "fallback_models": ["gpt-4o-mini", "claude-haiku"]
}

2. Use It in Your App

Standard OpenAI SDK — the route slug is the model name

python

from openai import OpenAI

client = OpenAI(
    base_url="https://pllm.company.com/v1",
    api_key="pk-team-abc123"
)

# Your app calls the route slug — not a specific model
# pLLM selects the best backend in real-time
response = client.chat.completions.create(
    model="smart",  # ← This is a route, not a model
    messages=[{"role": "user", "content": "Analyze this data"}],
    stream=True
)

# If GPT-4o is slow → routes to Claude
# If Claude is down → circuit breaker opens, fails over to Gemini
# If all primary models fail → falls back to gpt-4o-mini
# Your app never knows. Zero code changes.

Four Routing Strategies

Each route chooses a strategy. The strategy runs at request time to pick the best model — using real-time metrics, not static config.

Least Latency

least-latency

Routes every request to the fastest responding provider in real-time. Uses distributed latency tracking across all gateway nodes via Redis.

Exponential moving average (90/10) ensures responsive adaptation without oscillation

Weighted Round-Robin

weighted-round-robin

Distributes requests proportionally based on model weights using smooth interleaving — the same algorithm nginx uses.

Smooth WRR prevents bursts: a 60/30/10 split interleaves evenly, not in blocks

Priority

priority

Always routes to the highest-priority healthy provider. Lower-priority models only receive traffic during failover.

Use with fallback chains to build cost tiers: premium → standard → economy

Random

random

Uniform random selection across all healthy providers. Simple, stateless, no coordination needed between gateway nodes.

Zero overhead — no counters, no Redis lookups, no state to synchronize

Three-Layer Failover

When a provider fails, pLLM doesn't just retry — it escalates through three layers of resilience before ever returning an error to your app.

Instance Retry

If an instance fails, pLLM tries another instance of the same model with increasing timeouts (1.5x multiplier).

Model Failover

If all instances of a model fail, the route strategy selects the next model in the route's model list.

Fallback Chain

If every model in the route is exhausted, pLLM walks through the fallback_models chain as a last resort.

Each retry uses 1.5x increased timeout to account for provider recovery. Up to 10 failover hops with loop detection.

Self-Healing Circuit Breaker

Unhealthy providers are automatically removed from rotation and tested for recovery — no manual intervention needed.

CLOSED — Healthy

Normal operation. All traffic flows through. Failure counter active.

3 failures

OPEN — Unhealthy

Traffic blocked. Provider removed from rotation. 30s cooldown starts.

30s cooldown

HALF-OPEN — Testing

Single test request allowed. Success → closed. Failure → back to open.

Lines of Retry Code in Your App

<30s

Automatic Recovery Time

Max Failover Hops

Why This Changes Everything for Your App

Zero Retry Logic

Stop writing try/catch/retry in every LLM call. pLLM handles retries, failover, and fallbacks at the gateway level. Your app code stays clean.

Swap Providers at Runtime

Move from OpenAI to Anthropic — or add Gemini to the mix — via the admin API. No deploy, no restart, no code changes in any app.

Cost Tiers Without Code

Route "fast" to cheap models, "smart" to premium ones. Use weighted distribution to send 80% to budget providers and 20% to premium for A/B testing.

A/B Test Models

Use weighted round-robin to gradually shift traffic. Route 10% to a new model, monitor latency and quality through the stats API, then scale up.

Survive Provider Outages

When OpenAI goes down, your app doesn't. Circuit breakers detect the failure, traffic shifts to Anthropic or Azure, and self-heals when OpenAI recovers.

Route-Level Analytics

Every route tracks per-model traffic distribution, latency, token usage, and cost. See exactly where your budget goes with the stats endpoint.

Content Guardrails

Protect Every Request. Secure Every Response.

Guardrails are pluggable security filters that validate, mask, and monitor content flowing through your LLM gateway. Run them before the LLM call, after, in parallel, or at logging time — each with its own purpose and behavior.

Request Lifecycle

Guardrails plug into four stages of the request pipeline. Each stage has different behavior — choose what makes sense for your use case.

Incoming

User Request

Stage 1

Pre-Call

mask & validate

Processing

LLM Provider

GPT-4o / Claude / etc.

during-call runs in parallel

Stage 3

Post-Call

scan & detect

Delivered

Response

Logging-only guardrails run separately before data reaches storage. They mask sensitive content in logs and analytics without affecting the live request/response flow.

Four Execution Modes

Each guardrail can run in one or more modes. The same guardrail — like PII detection — can mask data pre-call, scan responses post-call, and redact logs simultaneously.

Pre-Call

pre_call

Intercepts and validates requests before they reach the LLM. Mask PII, block prompt injections, enforce compliance policies — all before a single token is sent.

User sends credit card number → pLLM masks it as [CREDIT_CARD] → LLM never sees the original

Execution: Synchronous, blocking

Post-Call

post_call

Scans LLM responses before they reach your users. Detect hallucinated PII, toxic content, or compliance violations in model output.

LLM generates a fake SSN in output → pLLM detects and flags it → logged for review

Execution: Synchronous, non-blocking

During-Call

during_call

Runs guardrails in parallel with the LLM call for zero-latency monitoring. Ideal for background security checks that don't need to block the response.

While LLM processes the request → pLLM runs threat detection in background → alerts on anomalies

Execution: Asynchronous, non-blocking

Logging-Only

logging_only

Masks sensitive data before it's stored in logs or analytics. Ensures PII never reaches your database, audit trails, or monitoring systems.

Request contains email → stored in logs as [EMAIL] → analytics stay PII-free

Execution: Before storage

Guardrails Marketplace

Choose from pre-built guardrails for common security and compliance needs. Each integrates with a leading provider and plugs into any execution mode.

Presidio PII Detection

Available

by Microsoft Presidio

Detect and mask personally identifiable information including names, emails, phone numbers, credit cards, and SSNs using Microsoft's open-source PII engine.

PERSONEMAILPHONECREDIT_CARDSSNIP_ADDRESS

Lakera Guard

Coming Soon

by Lakera AI

Protect against prompt injections, jailbreaks, and other LLM-specific attacks. Real-time threat detection purpose-built for AI security.

Prompt InjectionJailbreakData LeakageToxic Content

OpenAI Moderation

Coming Soon

by OpenAI

Leverage OpenAI's moderation API to classify content across categories like hate speech, violence, self-harm, and sexual content.

HateViolenceSelf-HarmSexualHarassment

Aporia Guardrails

Coming Soon

by Aporia

Enterprise ML security and compliance platform. Monitor model behavior, detect hallucinations, and enforce organizational policies.

HallucinationBiasPolicyCompliance

Build Your Own Guardrail

A custom connector interface is on the roadmap — bring your own guardrail provider by implementing a simple HTTP interface. Plug in internal compliance tools, custom ML models, or any third-party service.

Coming Soon

Simple YAML Configuration

Enable guardrails with a few lines in your config — no code changes needed

yaml

guardrails:
  enabled: true
  guardrails:
    # Pre-call: Mask PII before sending to LLM
    - guardrail_name: "presidio-pii-detection"
      provider: "presidio"
      mode: ["pre_call", "logging_only"]
      enabled: true
      config:
        analyzer_url: "http://presidio-analyzer:3000"
        anonymizer_url: "http://presidio-anonymizer:3000"
        entities:
          - PERSON
          - EMAIL_ADDRESS
          - PHONE_NUMBER
          - CREDIT_CARD
          - SSN
        threshold: 0.7
        mask_pii: true
        language: "en"

    # Post-call: Detect PII leaked in responses
    - guardrail_name: "presidio-pii-response-scan"
      provider: "presidio"
      mode: ["post_call"]
      enabled: true
      config:
        threshold: 0.8

Execution Modes

<50ms

Avg. Guardrail Latency

Lines of Security Code in Your App

Enterprise Authentication

Seamless integration with your existing identity infrastructure through OAuth/OIDC support powered by Dex.

Zero Configuration

Connect to your existing identity providers without complex setup. Dex handles the OAuth/OIDC protocols while pLLM manages authorization.

Free with all plans

Enterprise Security

Industry-standard OAuth 2.0 and OpenID Connect protocols with support for SAML, LDAP, and multi-factor authentication.

Single Sign-On (SSO)

Multi-Factor Auth

Role-Based Access

Audit Logging

Supported Identity Providers

Connect with popular identity providers and enterprise systems

Google

Google Workspace & Gmail

Microsoft

Azure AD & Office 365

GitHub

GitHub Organizations

Active Directory

Windows Active Directory

LDAP

LDAP Directory Services

SAML

SAML 2.0 Identity Providers

AWS

AWS IAM & Cognito

Okta

Okta Identity Cloud

Auth0

Auth0 Identity Platform

Providers

SSO

Ready

Free

Included

And many more through standard protocols:

SAML 2.0 OAuth 2.0 OpenID Connect LDAP/AD

Simple Configuration

Get started with OAuth/OIDC in minutes with a simple YAML configuration

auth:
  dex:
    issuer: https://dex.yourcompany.com
    connectors:
      - type: oidc
        name: Google
        config:
          issuer: https://accounts.google.com
          clientID: your-google-client-id
          clientSecret: your-google-client-secret
      - type: ldap
        name: Corporate Directory
        config:
          host: ldap.yourcompany.com:636
          insecureNoSSL: false
          bindDN: cn=admin,dc=company,dc=com

System Architecture

Enterprise-grade architecture designed for high availability, scalability, and performance

Client Layer

Applications & Services

🌐

Web Applications

React, Vue, Angular

📱

Mobile Apps

iOS, Android, React Native

⚙️

Backend Services

Node.js, Python, Go

🤖

AI Platforms

LangChain, AutoGPT

pLLM Gateway

Intelligent Routing Engine

Core Gateway

High-performance Go runtime

Router

Chi HTTP Router

Auth

JWT & RBAC

Cache

Redis Layer

Monitor

Metrics & Logs

Intelligent Load Balancer

Round Robin

Least Busy

Weighted

Provider Layer

LLM Service Providers

OpenAI

healthy

99.9%

uptime

Anthropic

healthy

99.9%

uptime

Azure OpenAI

degraded

85.2%

uptime

AWS Bedrock

healthy

99.9%

uptime

Google Vertex

healthy

99.9%

uptime

Llama

failed

uptime

Data Flow & Features

Real-time monitoring and intelligent routing

Circuit Breaker

Automatic failover protection

Health Checks

Continuous monitoring

Rate Limiting

Traffic control & quotas

Analytics

Performance insights

Live Performance Metrics

1000+

Requests/sec

<1ms

Latency

99.9%

Uptime

65MB

Memory

Intelligent Load Balancing

Choose from 6 different routing strategies optimized for different use cases and workload patterns.

Round Robin

Even distribution across all providers

Best for: Balanced load scenarios

Least Busy

Routes to least loaded provider

Best for: Variable workloads

Weighted

Custom weight distribution

Best for: Tiered provider setups

Priority

Prefers high-priority providers

Best for: Cost optimization

Latency-Based

Routes to fastest responding provider

Best for: Performance critical

Usage-Based

Respects rate limits and quotas

Best for: Token management

Production-Ready Stack

Built with battle-tested technologies and modern best practices for enterprise reliability.

Chi Router

Lightning-fast HTTP routing and middleware

PostgreSQL

Reliable data persistence with GORM ORM

Redis

High-speed caching and rate limiting

Prometheus

Enterprise monitoring and metrics

Adaptive Request Flow

Interactive visualization of intelligent routing with automatic failover and circuit breaker protection.

Client Request

pLLM Gateway

Intelligent Routing

OpenAI

healthy

Anthropic

degraded

Google

healthy

Circuit Breaker

Response

Simulation Mode

Status Legend

Healthy

Degraded

Failed

Management Dashboard

Control Everything from One Place

A full management UI for configuring routes, onboarding providers, managing API keys and teams, tuning guardrails, and monitoring your gateway — no CLI required.

pLLM Dashboard

Routes/Smart Route

Smart Route

least-latencyActive

GPT-4o

OpenAI

Weight: 60%

Claude 3.5

Anthropic

Weight: 30%

Gemini Pro

Google

Weight: 10%

Fallback Chain

GPT-4oClaude 3.5Gemini Pro

12,847

Requests Today

142ms

Avg Latency

5/6

Active Models

Visual Route Builder

Design route graphs with drag-and-drop. Set model weights, fallback chains, and routing strategies visually — no YAML editing required.

Real-Time Monitoring

Live dashboards tracking request volume, latency percentiles, error rates, and token usage across all providers in real time.

Provider Onboarding

Add new LLM providers in seconds. Enter credentials, test connectivity, and start routing — all from the UI.

API Key Management

Generate, rotate, and revoke API keys with granular permissions. Set rate limits and expiry per key.

Team Management

Organize users into teams with role-based access control. Assign routes, quotas, and permissions per team.

Guardrail Configuration

Enable and tune guardrails per route. Configure PII detection thresholds, content policies, and execution modes visually.

Usage Analytics

Detailed breakdowns of token usage, cost per provider, and request patterns. Export reports for billing and audits.

Configuration Management

Edit gateway configuration live. Changes apply instantly without restarts or redeployments.

Try the Dashboard

Start pLLM and open the dashboard at


http://localhost:8080/dashboard

Get Started

Performance Benchmarks

Real-world performance data showing why Go-based pLLM outperforms interpreted gateway solutions.

4.8x faster

Requests/sec

pLLM 12,000+

Typical 2,500

18.7x faster

P99 Latency

pLLM 0.8ms

Typical 15ms

4-8x faster

Memory Efficiency

pLLM 50-80MB

Typical 200-400MB

20-50x faster

Cold Start

pLLM <100ms

Typical 2-5s

Performance Comparison

pLLM vs Typical Interpreted Gateway

Concurrent Connections

900% more

pLLM

Typical

Higher is better

Memory Usage

71% less

pLLM

0MB

Typical

0MB

Lower is better

Startup Time

97% less

pLLM

Typical

Lower is better

Response Time

76% less

pLLM

0 ms overhead

Typical

0 ms overhead

Lower is better

pLLM (Go)

Typical Gateway

Performance comparison between pLLM and typical gateways

Load Testing Results

Stress tested with 10,000 concurrent users making chat completion requests.

Test Configuration

Concurrent Users: 10,000

Request Type: Chat Completions

Test Duration: 10 minutes

Infrastructure: Single 4-core instance

Memory Limit: 1GB

Results

Success Rate

99.97%

Zero failed requests under normal conditions

Average Response Time

1.2ms

Gateway overhead only, excluding LLM processing

Memory Usage

78MB

Peak memory during 10K concurrent connections

Enterprise Scalability

Built-in scalability features that make pLLM ideal for high-volume production workloads.

True Parallelism

No GIL limitations - utilize all CPU cores effectively

Handle thousands of concurrent requests on a single instance

Memory Efficient

Native compilation with optimized memory management

3-6x less memory usage compared to interpreted alternatives

Instant Scaling

Sub-100ms startup enables aggressive auto-scaling

Scale from 0 to production load in milliseconds

Network Optimized

Efficient connection pooling and keep-alive management

Minimal network overhead with connection reuse

Enterprise Performance Scaling

For massive performance and ultra-low latency, the bottleneck is often the LLM providers themselves, not the gateway. To achieve true enterprise scale:

Multiple LLM Deployments: Deploy several instances of the same model (e.g., 5-10 GPT-4 Azure OpenAI deployments)
Multi-Provider Redundancy: Use multiple AWS Bedrock accounts, Azure regions, or provider accounts
Geographic Distribution: Deploy models across regions for latency optimization

Why This Matters: A single LLM deployment typically handles 60-100 RPM. For 10,000+ concurrent users, you need multiple deployments of the same model to prevent provider-side bottlenecks. pLLM's adaptive routing automatically distributes load across all deployments.

Get Started in Minutes

Choose your deployment method and get pLLM running in your environment quickly.

Deployment Options

Kubernetes with Helm

Production-ready deployment with auto-scaling

Production

High Availability

Auto-scaling

Monitoring

Production Ready

bash

# Add the Helm repository
helm repo add pllm https://andreimerfu.github.io/pllm
helm repo update

# Install with your configuration
helm install pllm pllm/pllm \
  --set pllm.secrets.jwtSecret="your-jwt-secret" \
  --set pllm.secrets.masterKey="sk-master-key" \
  --set pllm.secrets.openaiApiKey="sk-your-openai-key"

# Check status
kubectl get pods -l app.kubernetes.io/name=pllm

Docker Compose

Perfect for development and testing

Development

Quick Setup

Local Development

Easy Testing

Full Stack

bash

# Clone and setup
git clone https://github.com/andreimerfu/pllm.git
cd pllm

# Configure environment
cp .env.example .env
echo "OPENAI_API_KEY=sk-your-key-here" >> .env

# Launch pLLM
docker compose up -d

# Test it works
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hello!"}]}'

Binary Installation

Lightweight deployment for simple setups

Minimal

No Dependencies

Single Binary

Fast Startup

Cross Platform

bash

# Download latest release
wget https://github.com/andreimerfu/pllm/releases/latest/download/pllm-linux-amd64

# Make executable
chmod +x pllm-linux-amd64

# Set environment variables
export OPENAI_API_KEY=sk-your-key-here
export JWT_SECRET=your-jwt-secret
export MASTER_KEY=sk-master-key

# Run pLLM
./pllm-linux-amd64 server

Drop-in Integration

pLLM is 100% OpenAI compatible. Just change your base URL and you're ready to go.

Python

python

from openai import OpenAI

# Just change the base_url - that's it!
client = OpenAI(
    api_key="your-api-key",
    base_url="http://localhost:8080/v1"  # ← Point to pLLM
)

# Use exactly like OpenAI
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hello!"}]
)

Node.js

javascript

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: 'your-api-key',
  baseURL: 'http://localhost:8080/v1'  // ← Point to pLLM
});

const completion = await openai.chat.completions.create({
  model: "gpt-3.5-turbo",
  messages: [{role: "user", content: "Hello!"}]
});

cURL

bash

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-api-key" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Start with GitHub Read Documentation

Enterprise Support

While pLLM is open source and free, we offer professional support and custom solutions for enterprises with mission-critical requirements.

Community

Perfect for developers and small teams getting started with pLLM.

Free

What's included:

GitHub Issues & Discussions
Documentation and guides
Best effort response time
Open source under MIT license

Limitations:

No SLA guarantees
Community-driven support
No priority bug fixes

Get Started

Professional

For growing businesses that need reliable support and faster issue resolution.

CustomContact for pricing

What's included:

Priority email support
Guaranteed 24-hour response time
Deployment assistance
Configuration review
Priority bug fixes
Access to beta features

Limitations:

Business hours support only
Email-based communication

Contact Sales

Enterprise

Comprehensive support for mission-critical deployments with custom requirements.

CustomContact for pricing

What's included:

Dedicated support engineer
Custom SLA (down to 2-hour response)
Phone & video call support
Custom feature development
Architecture consulting
On-site deployment assistance
Training and workshops
Priority feature requests

Contact Sales

Ready for Enterprise Deployment?

Contact our team for custom integrations, dedicated support, and enterprise-grade deployment assistance.

Professional Consultation

Architecture review, deployment planning, and best practices guidance from our core team.

Custom Development

Tailored features, custom integrations, and specialized deployment configurations for your use case.

Dedicated Support

Priority support channels, SLA guarantees, and direct access to our engineering team.

Schedule a Consultation

Book a 30-minute call to discuss your requirements and explore how pLLM can fit your enterprise needs.

Book Consultation

Enterprise Inquiry

Submit a detailed inquiry for custom features, deployment assistance, or partnership opportunities.

Submit Inquiry Form

Email Us Directly

Prefer email? Reach out directly for enterprise inquiries, procurement, or partnership discussions.

enterprise@pllm.dev

Frequently Asked Questions

Everything you need to know about pLLM, from technical details to enterprise support options.

Yes, pLLM is completely free and open source under the MIT license. This means you can use it in commercial applications, modify the code, and deploy it anywhere without licensing fees. The only costs you'll incur are your infrastructure expenses (servers, cloud resources) and API costs from the LLM providers themselves (OpenAI, Anthropic, etc.).

pLLM is built in Go for superior performance and lower resource usage compared to Python-based solutions. Key advantages include: sub-millisecond routing overhead, native compilation for better performance, 3-6x lower memory usage, 20-50x faster startup times, and true parallel processing without GIL limitations. Plus, it's 100% OpenAI API compatible, requiring zero code changes to integrate.

pLLM supports all major LLM providers including OpenAI (GPT-3.5, GPT-4, GPT-4 Turbo), Anthropic Claude, Azure OpenAI, AWS Bedrock, Google Vertex AI, Groq, and Cohere. The unified API interface means you can switch between providers or use multiple providers simultaneously with intelligent routing and automatic failover.

Yes, we provide comprehensive enterprise support including dedicated support engineers, custom SLA agreements (down to 2-hour response times), priority bug fixes, custom feature development, architecture consulting, on-site deployment assistance, and training workshops. Enterprise support is available through custom pricing based on your specific requirements.

Absolutely. pLLM is designed for flexible deployment scenarios including on-premise installations, air-gapped environments, and hybrid cloud setups. We provide Kubernetes manifests, Docker containers, and can assist with custom deployment configurations. The gateway can run entirely within your infrastructure while connecting to external LLM APIs or internal models.

pLLM includes comprehensive security features: JWT-based authentication, Role-Based Access Control (RBAC), audit logging for compliance, OAuth/OIDC integration through Dex (supporting Google, Microsoft, LDAP, Active Directory), API key management, rate limiting, and request monitoring. All communications use TLS encryption, and we support enterprise identity providers.

pLLM is optimized for high-performance scenarios: handles thousands of concurrent connections, sub-millisecond routing overhead, efficient memory usage (50-80MB typical), fast startup times (<100ms), and intelligent caching to reduce API costs. The Go-based architecture provides significant performance advantages over interpreted language solutions.

Yes! We have comprehensive documentation, GitHub discussions for community support, and regular updates on our roadmap. The open-source community actively contributes features and bug fixes. For enterprise customers, we provide dedicated documentation, training materials, and direct access to our engineering team.

Product Roadmap

Exciting features coming soon to make pLLM even more powerful and enterprise-ready.

Key Rotation & Secret Management

In ProgressQ1 2026

Automated key rotation and integration with external secret managers for enhanced security.

Key Features:

Automated API key rotation

AWS Secrets Manager integration

Azure Key Vault support

HashiCorp Vault connector

Secret versioning & rollback

Advanced Guardrails

CompletedQ4 2025

Pluggable content guardrails with pre-call, post-call, during-call, and logging-only modes. Marketplace with Presidio PII detection and more.

Key Features:

PII detection & masking (Presidio)

Pre-call, post-call, during-call & logging modes

Guardrails marketplace

YAML-based configuration

Per-guardrail stats & health checks

Enhanced Audit & Logging

PlannedQ2 2026

Comprehensive audit trails with retention policies and compliance reporting.

Key Features:

Detailed audit logs

Log retention policies

Compliance reporting

Real-time log streaming

Custom log formats

Shape Our Roadmap

Have a feature request or want to influence our development priorities? We'd love to hear from you.

Join Discussion Request Feature