Model Management¶

Concepts¶

Stored model records in DB (name, served name, task, flags)
Managed containers named vllm-model-{id} or llamacpp-model-{id}
Registry maps served name → URL and task for routing
Model files are never deleted by Cortex - only database records are removed

Lifecycle¶

Create → Start → Apply updates (stop/start) → Stop → Archive/Delete (DB only)

State Machine¶

┌─────────┐                    ┌──────────┐                    ┌─────────┐
│ stopped │ ─── Start Click ──→│ starting │ ─── Container Up ──→│ loading │
└─────────┘                    └──────────┘                    └─────────┘
     ↑                                                               │
     │                                                               ↓
┌──────────┐                                                   ┌─────────┐
│ stopping │ ←────────────── Stop Click ──────────────────────│ running │
└──────────┘                                                   └─────────┘
     ↓                                │                              │
     │                                ↓                              │
┌─────────┐                      ┌────────┐                          │
│ stopped │                      │ failed │ ←── Error at any stage ──┘
└─────────┘                      └────────┘

States: | State | Description | |-------|-------------| | stopped | No container running, ready to start | | starting | Container creation initiated | | loading | Container running, model loading into GPU memory | | running | Model ready for inference requests | | stopping | Graceful shutdown in progress | | failed | Error occurred, check logs for diagnostics |

UI Behavior: - Polling automatically updates state every few seconds - Toast notifications appear on state transitions - "Start" button triggers dry-run validation first

File Safety Guarantee¶

CRITICAL: Cortex never deletes model files from /var/cortex/models

When you delete a model from Cortex: - ✅ Database record is removed - ✅ Container is stopped - ✅ Model is unregistered from routing - ✅ Files remain on disk untouched

This protects manually-placed offline models, which are often: - Transferred via USB drives in air-gapped environments - Large files (10-240GB) taking hours to transfer - Impossible to re-download in classified/restricted networks

To free disk space: Administrators must manually delete files from the filesystem:

# List models directory
ls -lh /var/cortex/models/

# Manually delete unwanted folders
rm -rf /var/cortex/models/old-model-folder

Base directory helpers¶

GET/PUT /admin/models/base-dir to set host-visible models directory
GET /admin/models/local-folders and GET /admin/models/inspect-folder to assist offline model selection

Model Preparation¶

📖 HuggingFace Models: See docs/models/huggingface-model-download.md for complete guide on downloading HF models
📖 GGUF Models: See docs/models/gguf-format.md for GGUF format guide and docs/models/llamaCPP.md for llama.cpp configuration
📖 vLLM Models: See docs/models/vllm.md for vLLM-specific configuration

Smart Engine Guidance¶

Cortex automatically analyzes model folders and provides intelligent recommendations for engine and format selection.

How It Works¶

When you browse to a model folder in offline mode, Cortex:

Scans for file types: GGUF, SafeTensors, PyTorch
Analyzes GGUF files: Detects quantization, multi-part splits, validates headers
Extracts metadata: Architecture, context length, layer count
Computes recommendations: Based on file availability and engine compatibility

Engine Recommendation Matrix¶

Scenario	SafeTensors	GGUF Type	Recommended Engine	Reason
Both available	✅	Single	vLLM + SafeTensors	Best performance
Both available	✅	Multi-part	vLLM + SafeTensors	vLLM can't load multi-part GGUF
GGUF only	❌	Single	llama.cpp	Native GGUF support
GGUF only	❌	Multi-part	llama.cpp	Only engine with multi-part support
SafeTensors only	✅	❌	vLLM	Native format

Guidance UI Components¶

Engine Guidance Banner: Appears in the model form when recommendations apply:

⚠️ Warning: Multi-part GGUF with vLLM selected (incompatible)
💡 Tip: SafeTensors available with GGUF selected
✅ Recommended: Suggested engine/format combination

One-Click Actions: - "Switch to SafeTensors" - Changes format selection - "Switch to llama.cpp" - Changes engine selection

GGUF Validation¶

Cortex validates GGUF files during folder inspection:

Check	What It Detects
Magic bytes	Invalid/corrupt files
Version	Unsupported GGUF versions
Header integrity	Truncated downloads
Legacy format	Old GGML files

Validation Status: - ✅ Valid: All files passed checks - ⚠️ Warning: Minor issues detected - ❌ Invalid: Corrupt or incomplete files

GGUF Metadata Extraction¶

For valid GGUF files, Cortex extracts and displays:

Metadata	Example	Description
Architecture	`llama`	Model architecture type
Context Length	`32K`	Maximum context window
Layers	`32`	Number of transformer layers
Hidden Size	`4096`	Embedding dimension
Attention Heads	`32/8`	Q heads / KV heads (GQA)
Vocab Size	`128K`	Vocabulary size

Architecture Compatibility¶

Cortex shows compatibility badges for each detected architecture:

Status	vLLM	llama.cpp	Meaning
✓ Green	Full	Full	Both engines fully support
◐ Yellow	Partial	Full	Some vLLM limitations
⚡ Orange	Experimental	Full	Experimental vLLM support
✗ Red	None	Full	llama.cpp only

Quantization Quality Indicators¶

When selecting GGUF quantization levels, Cortex shows:

Quality bars (1-5 stars): Output quality rating
Speed bars (1-5 stars): Inference speed rating
Bits per weight: Compression level
Description: What the quantization is best for

See GGUF Format Guide for detailed quantization information.

Logs¶

GET /admin/models/{id}/logs returns recent container logs (for debugging)
GET /admin/models/{id}/logs?diagnose=true returns logs with startup diagnostics

Dry Run & Pre-Start Validation¶

The dry-run endpoint validates configuration before starting:

POST /admin/models/{id}/dry-run returns:
The vLLM or llama.cpp command that would be executed
VRAM estimation and warnings
Configuration validation results
Quantization compatibility checks

Frontend Integration: When clicking "Start" in the UI, Cortex automatically runs a dry-run first. If warnings are detected (e.g., VRAM concerns, quantization mismatches), the user is prompted to confirm before proceeding.

Per-Model Metrics¶

Running models expose metrics via the System Monitor page: - Requests running/waiting/swapped - Current queue status - Prompt/generation tokens - Throughput metrics - KV cache utilization - Memory efficiency - GPU cache usage - VRAM allocation

Access via: Admin UI → System Monitor → Active Models section API endpoint: GET /admin/models/metrics