Skip to content

CORTEX

Enterprise-grade self-hosted LLM inference gateway and model management platform

CORTEX is an OpenAI-compatible gateway and admin UI for running vLLM and llama.cpp inference engines on your own infrastructure. It provides secure access control, health‑aware routing, usage metering, and a modern admin interface.


πŸš€ Quick Start

make quick-start
# Access at: http://YOUR_IP:3001/login (admin/admin)

That's it! Cortex auto-detects your IP, configures CORS, and creates the admin user.


Key Features

Inference Gateway

  • OpenAI-compatible endpoints: /v1/chat/completions, /v1/completions, /v1/embeddings
  • Multi-engine support: vLLM (GPU) and llama.cpp (CPU/GPU)
  • Health checks, circuit breaking, retries, and smart routing
  • Streaming responses with time-to-first-token metrics

Chat Playground

  • Interactive web UI for testing running models
  • Real-time streaming with performance metrics (tok/s, TTFT)
  • Server-side chat persistence (user-scoped, cross-device)
  • Context window tracking and visualization

Enterprise Security

  • Multi-tenant access control with organizations, users, and API keys
  • IP allowlisting and rate limiting
  • Scoped permissions per model or organization
  • Audit logging and usage tracking

Observability

  • Prometheus metrics integration
  • Per-model inference metrics (requests, tokens, latency)
  • GPU utilization and memory monitoring
  • System Monitor dashboard with real-time metrics

Model Management

  • Pre-start VRAM estimation and validation
  • Startup diagnostics with actionable error fixes
  • Model lifecycle management (start, stop, configure)
  • Recipe system for configuration templates

GGUF Support

  • Smart engine guidance with automatic recommendations
  • GGUF validation, metadata extraction, multi-part support
  • Quantization quality indicators (Q4_K_M, Q8_0, etc.)
  • Speculative decoding for llama.cpp

Deployment & Migration

  • Offline/air-gapped deployment for restricted networks
  • Full system export (Docker images, database, configs)
  • Model import with dry-run preview
  • Database backup and restore via UI or API
  • Job management with progress tracking

Documentation Guide

For Administrators

I want to... Read this
Get started quickly Quickstart (Docker)
Understand all commands Makefile Guide
Configure my deployment Configuration
Set up for production Admin Setup Guide
Test models interactively Chat Playground
Backup my data Backup & Restore
Deploy offline Offline Deployment

For Developers

I want to... Read this
Understand the architecture System Overview
Work on the backend Backend Architecture
Work on the frontend Frontend Architecture
Contribute code How to Contribute
Follow coding standards Coding Standards

For API Users

I want to... Read this
Make API calls OpenAI-Compatible API
Use admin endpoints Admin API
Configure models Model Management

Model Guides

Engine/Format Documentation
vLLM vLLM Guide
llama.cpp llama.cpp Guide
GGUF files GGUF Format
Multi-part GGUF Multi-Part GGUF
Engine selection Engine Comparison
Download models HuggingFace Download

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Client Applications                     β”‚
β”‚            (curl, Python SDK, Web Apps, etc.)               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     CORTEX Gateway                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚ β€’ OpenAI-compatible API  β€’ Auth & Rate Limiting         β”‚β”‚
β”‚  β”‚ β€’ Health-aware routing   β€’ Usage metering               β”‚β”‚
β”‚  β”‚ β€’ Circuit breaking       β€’ Model registry               β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β–Ό                 β–Ό                 β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ vLLM Model 1 β”‚ β”‚ vLLM Model 2 β”‚ β”‚llama.cpp Mod β”‚
    β”‚   (GPU)      β”‚ β”‚   (GPU)      β”‚ β”‚   (CPU/GPU)  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Essential Commands

# Startup
make quick-start     # First-time setup
make up              # Start services
make down            # Stop services
make restart         # Restart services

# Information
make ip              # Show access URLs
make status          # Container status
make health          # Service health
make logs            # View logs

# Database
make db-backup       # Backup database
make db-restore      # Restore database

# Offline deployment
make prepare-offline # Package for transfer
make load-offline    # Load on target
make verify-offline  # Verify images

Run make help for all available commands.


Directory Structure

docs/
β”œβ”€β”€ getting-started/     # Setup and configuration guides
β”œβ”€β”€ features/            # Feature documentation (Chat Playground, etc.)
β”œβ”€β”€ api/                 # API reference
β”œβ”€β”€ models/              # Model engine documentation
β”œβ”€β”€ operations/          # Operations and maintenance
β”œβ”€β”€ architecture/        # System design docs
β”œβ”€β”€ security/            # Security documentation
β”œβ”€β”€ contributing/        # Contribution guidelines
└── analysis/            # Implementation analysis

License

Copyright Β© 2024-2026 Aulendur Labs. See LICENSE.txt and NOTICE.txt.