System Overview¶
CORTEX is an enterprise-grade, self-hosted LLM inference gateway and model management platform. It provides OpenAI-compatible APIs fronting vLLM and llama.cpp engines with health-aware routing, access control, metering, and administration.
Architecture Diagram¶
graph TD
Client[Client / SDK] -->|HTTP: /v1/*| Gateway[FastAPI Gateway]
Admin[Admin UI] -->|HTTP: /admin/*| Gateway
Chat[Chat Playground] -->|HTTP: /v1/chat/*| Gateway
subgraph Gateway
Router[OpenAI Routes] --> Auth[API Key Auth]
ChatRouter[Chat Routes] --> SessionAuth[Session Auth]
Router --> RL[Rate Limit / Concurrency]
Router --> Choose[URL Selection]
Choose --> Models
Gateway --> DB[(Postgres)]
Gateway --> Redis[(Redis)]
Gateway --> Prom[Prometheus]
end
subgraph Models[Model Containers]
vLLM1[vLLM Model 1]
vLLM2[vLLM Model 2]
LLAMA[llama.cpp Model]
end
Health[Health Poller] --> Models
DockerMgr[Docker Manager] --> Models
Key Capabilities¶
Inference Gateway¶
- OpenAI-compatible API:
/v1/chat/completions,/v1/completions,/v1/embeddings - Multi-engine support: vLLM (GPU) and llama.cpp (CPU/GPU)
- Health-aware routing: Prefers healthy engines, round-robin load balancing
- Streaming support: Server-sent events with TTFT metrics
Chat Playground¶
- Interactive testing: Web UI for testing running models
- Real-time metrics: Tokens/second, time-to-first-token, context usage
- Server-side persistence: User-scoped chat history stored in database
- Cross-device access: Access chat history from any machine
Model Management¶
- Lifecycle management: Create, configure, start, stop, delete models
- Pre-start validation: VRAM estimation, configuration dry-run
- Startup diagnostics: Actionable error fixes for common failures
- Model states:
stopped→starting→loading→running(orfailed)
Security & Access Control¶
- Multi-tenant: Organizations, users, API keys with scoped permissions
- Rate limiting: Per-key limits with Redis backend
- IP allowlisting: Restrict API access by IP range
Observability¶
- Prometheus metrics: Gateway, per-model, and system metrics
- Usage analytics: Token counts, latency percentiles, TTFT
- System monitoring: CPU, memory, disk, GPU utilization
Operational Features¶
- Offline/air-gapped deployment: Pre-cached Docker images
- Recipe system: Save and share model configurations
- Configuration validation: Quantization, GPU selection checks
Data Flow¶
1. Client Request → Gateway (FastAPI)
2. Auth/Rate Limit Check
3. Model Registry Lookup (by served_name)
4. Route to Container (vllm-model-{id} or llamacpp-model-{id})
5. Response → Client (with usage metrics)
6. Usage record → Database
Technology Stack¶
| Component | Technology |
|---|---|
| Gateway | FastAPI (Python 3.11+) |
| Frontend | Next.js 14 (React) |
| Database | PostgreSQL 16 |
| Cache | Redis 7 |
| Metrics | Prometheus |
| vLLM Engine | vllm/vllm-openai container |
| llama.cpp Engine | ggml-org/llama.cpp container |
See Backend Architecture and Frontend Architecture for details.