CORTEX¶
Enterprise-grade self-hosted LLM inference gateway and model management platform
CORTEX is an OpenAI-compatible gateway and admin UI for running vLLM and llama.cpp inference engines on your own infrastructure. It provides secure access control, healthβaware routing, usage metering, and a modern admin interface.
π Quick Start¶
make quick-start
# Access at: http://YOUR_IP:3001/login (admin/admin)
That's it! Cortex auto-detects your IP, configures CORS, and creates the admin user.
Key Features¶
Inference Gateway¶
- OpenAI-compatible endpoints:
/v1/chat/completions,/v1/completions,/v1/embeddings - Multi-engine support: vLLM (GPU) and llama.cpp (CPU/GPU)
- Health checks, circuit breaking, retries, and smart routing
- Streaming responses with time-to-first-token metrics
Chat Playground¶
- Interactive web UI for testing running models
- Real-time streaming with performance metrics (tok/s, TTFT)
- Server-side chat persistence (user-scoped, cross-device)
- Context window tracking and visualization
Enterprise Security¶
- Multi-tenant access control with organizations, users, and API keys
- IP allowlisting and rate limiting
- Scoped permissions per model or organization
- Audit logging and usage tracking
Observability¶
- Prometheus metrics integration
- Per-model inference metrics (requests, tokens, latency)
- GPU utilization and memory monitoring
- System Monitor dashboard with real-time metrics
Model Management¶
- Pre-start VRAM estimation and validation
- Startup diagnostics with actionable error fixes
- Model lifecycle management (start, stop, configure)
- Recipe system for configuration templates
GGUF Support¶
- Smart engine guidance with automatic recommendations
- GGUF validation, metadata extraction, multi-part support
- Quantization quality indicators (Q4_K_M, Q8_0, etc.)
- Speculative decoding for llama.cpp
Deployment & Migration¶
- Offline/air-gapped deployment for restricted networks
- Full system export (Docker images, database, configs)
- Model import with dry-run preview
- Database backup and restore via UI or API
- Job management with progress tracking
Documentation Guide¶
For Administrators¶
| I want to... | Read this |
|---|---|
| Get started quickly | Quickstart (Docker) |
| Understand all commands | Makefile Guide |
| Configure my deployment | Configuration |
| Set up for production | Admin Setup Guide |
| Test models interactively | Chat Playground |
| Backup my data | Backup & Restore |
| Deploy offline | Offline Deployment |
For Developers¶
| I want to... | Read this |
|---|---|
| Understand the architecture | System Overview |
| Work on the backend | Backend Architecture |
| Work on the frontend | Frontend Architecture |
| Contribute code | How to Contribute |
| Follow coding standards | Coding Standards |
For API Users¶
| I want to... | Read this |
|---|---|
| Make API calls | OpenAI-Compatible API |
| Use admin endpoints | Admin API |
| Configure models | Model Management |
Model Guides¶
| Engine/Format | Documentation |
|---|---|
| vLLM | vLLM Guide |
| llama.cpp | llama.cpp Guide |
| GGUF files | GGUF Format |
| Multi-part GGUF | Multi-Part GGUF |
| Engine selection | Engine Comparison |
| Download models | HuggingFace Download |
Architecture Overview¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client Applications β
β (curl, Python SDK, Web Apps, etc.) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CORTEX Gateway β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β’ OpenAI-compatible API β’ Auth & Rate Limiting ββ
β β β’ Health-aware routing β’ Usage metering ββ
β β β’ Circuit breaking β’ Model registry ββ
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββΌββββββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β vLLM Model 1 β β vLLM Model 2 β βllama.cpp Mod β
β (GPU) β β (GPU) β β (CPU/GPU) β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
Essential Commands¶
# Startup
make quick-start # First-time setup
make up # Start services
make down # Stop services
make restart # Restart services
# Information
make ip # Show access URLs
make status # Container status
make health # Service health
make logs # View logs
# Database
make db-backup # Backup database
make db-restore # Restore database
# Offline deployment
make prepare-offline # Package for transfer
make load-offline # Load on target
make verify-offline # Verify images
Run make help for all available commands.
Directory Structure¶
docs/
βββ getting-started/ # Setup and configuration guides
βββ features/ # Feature documentation (Chat Playground, etc.)
βββ api/ # API reference
βββ models/ # Model engine documentation
βββ operations/ # Operations and maintenance
βββ architecture/ # System design docs
βββ security/ # Security documentation
βββ contributing/ # Contribution guidelines
βββ analysis/ # Implementation analysis
License¶
Copyright Β© 2024-2026 Aulendur Labs. See LICENSE.txt and NOTICE.txt.