vLLM Engine Guide¶

Overview¶

vLLM (Very Large Language Model) is Cortex's primary inference engine for serving transformer-based models from Hugging Face. It provides state-of-the-art throughput and memory efficiency through innovative techniques like PagedAttention.

When to use vLLM: - ✅ Hugging Face Transformers models (SafeTensors, PyTorch) - ✅ Architectures: Llama, Mistral, Qwen, Phi, GPT-NeoX, Falcon, and 100+ others - ✅ Need maximum throughput and efficiency - ✅ Multi-GPU tensor parallelism required - ✅ Standard model architectures with HF support

When NOT to use vLLM: - ❌ Custom/experimental architectures (e.g., Harmony/GPT-OSS) - ❌ GGUF-only quantized models - ❌ Models requiring trust_remote_code with unsupported architectures - ❌ CPU-only inference (llama.cpp is better)

Core Technologies¶

1. PagedAttention¶

vLLM's breakthrough innovation that enables efficient KV cache management:

Traditional Attention Problem: - KV cache allocated contiguously for max sequence length - Most tokens unused → wasted VRAM - 60-80% of VRAM wasted on padding

PagedAttention Solution: - KV cache split into fixed-size blocks (typically 16 tokens) - Blocks allocated on-demand (like OS virtual memory) - Near-zero waste, 2-4x higher throughput - Enables longer contexts and larger batch sizes

Cortex Implementation:

# Configured via --block-size flag
# Default: 16 tokens per block
# Adjustable via Model Form: 1, 8, 16, 32

2. Continuous Batching¶

Processes requests as they arrive, without waiting for batch to fill:

Benefits: - Lower latency (no waiting for batch) - Higher throughput (always working) - Better GPU utilization

Cortex Controls: - max_num_seqs: Maximum concurrent sequences (default: 256) - max_num_batched_tokens: Total tokens per batch (default: 2048)

3. Tensor Parallelism (TP)¶

Splits model weights/computation across multiple GPUs:

How it works:

Single GPU:     [Model] → GPU 0 (OOM if model too large)

TP=2:           [Model] 
                ├─ Half → GPU 0
                └─ Half → GPU 1

TP=4:           [Model]
                ├─ Quarter → GPU 0
                ├─ Quarter → GPU 1
                ├─ Quarter → GPU 2
                └─ Quarter → GPU 3

Use Cases: - Model too large for single GPU - Need more KV cache space - Want faster inference (up to TP=4-8, diminishing returns after)

Cortex Configuration: - Set via TP Size slider in Model Form - Must be ≤ number of available GPUs - Works with Llama 3.3 70B across 4x L40S GPUs

4. Pipeline Parallelism (PP)¶

Splits model layers across devices/nodes:

When to use: - Extremely large models (70B+, 175B+) - Model doesn't fit even with TP - Multi-node deployments

How it works:

PP=2:
Node 1: Layers 1-40   (GPU 0-3 with TP=4)
Node 2: Layers 41-80  (GPU 4-7 with TP=4)

Cortex Support: - Available in Advanced settings - pipeline_parallel_size configurable - Adds inter-stage latency (use sparingly)

Supported Quantization¶

Runtime Quantization (vLLM built-in):¶

Method	Size Reduction	Quality	VRAM Savings	Use Case
FP16/BF16	Baseline	Best	0%	Default
FP8	2x	Very Good	~50%	Newer GPUs (H100, L40S)
INT8	2x	Good	~50%	Wider GPU support
AWQ (4-bit)	4x	Good	~75%	Pre-quantized models
GPTQ (4-bit)	4x	Good	~75%	Pre-quantized models

Notes: - AWQ/GPTQ require pre-quantized model checkpoints - FP8 requires Ada Lovelace or Hopper architecture - INT8 works on most NVIDIA GPUs

KV Cache Quantization:¶

Separate from weight quantization, reduces KV cache memory:

# Default: Same as model dtype (FP16/BF16)
kv_cache_dtype: "auto"

# FP8 variants (50% KV cache reduction):
kv_cache_dtype: "fp8"          # Generic FP8
kv_cache_dtype: "fp8_e4m3"     # 4-bit exponent, 3-bit mantissa
kv_cache_dtype: "fp8_e5m2"     # 5-bit exponent, 2-bit mantissa

Recommendation: Use fp8 for KV cache on L40S GPUs (minimal quality loss)

Cortex vLLM Implementation¶

Container Architecture¶

Host Machine
├─ Docker Network: cortex_default
└─ vLLM Container: vllm-model-{id}
   ├─ Image: vllm/vllm-openai:latest
   ├─ Port: 8000 (internal) → Ephemeral host port
   ├─ Network: Service-to-service via container name
   ├─ Volumes:
   │  ├─ /models (RO) → Host models directory
   │  └─ /root/.cache/huggingface → HF cache
   ├─ Environment:
   │  ├─ CUDA_VISIBLE_DEVICES=all
   │  ├─ NCCL_* (multi-GPU config)
   │  └─ HF_HUB_TOKEN (for gated models)
   └─ Resources:
      ├─ GPU: All available (via DeviceRequest)
      ├─ SHM: 2GB
      └─ IPC: host mode

Startup Command Example¶

For a Llama 3 8B model with TP=2:

# Container command (built by _build_command()):
--model meta-llama/Meta-Llama-3-8B-Instruct
--host 0.0.0.0
--port 8000
--served-model-name llama-3-8b-instruct
--dtype auto
--tensor-parallel-size 2
--gpu-memory-utilization 0.9
--max-model-len 8192
--max-num-batched-tokens 2048
--kv-cache-dtype auto
--block-size 16
--enforce-eager
--api-key dev-internal-token

Model Sources¶

Online Mode (HuggingFace):

repo_id: "meta-llama/Meta-Llama-3-8B-Instruct"
local_path: None

# vLLM downloads to HF cache:
# /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct

Offline Mode (Local Files):

repo_id: None
local_path: "llama-3-8b-instruct"  # Relative to CORTEX_MODELS_DIR

# vLLM reads from:
# /models/llama-3-8b-instruct/

GGUF Support (Experimental):

local_path: "model-folder/model.Q8_0.gguf"  # Single-file GGUF only
tokenizer: "meta-llama/Meta-Llama-3-8B"      # HF tokenizer repo
hf_config_path: "/models/model-folder"       # Optional config path

# Limitations:
# - Single-file GGUF only (merge multi-part first)
# - Less optimized than HF checkpoints
# - Experimental support, may have issues

Performance Tuning¶

Memory Optimization¶

1. GPU Memory Utilization (0.05 - 0.98):

gpu_memory_utilization: 0.9  # Default

# Lower (0.7-0.8): More headroom, fewer OOM crashes
# Higher (0.92-0.95): Max KV cache, more sequences

2. KV Cache dtype (50% memory savings):

kv_cache_dtype: "fp8"  # Halves KV cache VRAM

# Quality impact: Minimal on most models
# Best for: Long contexts, many concurrent requests

3. Block Size (Memory granularity):

block_size: 16  # Default, balanced

# block_size=8: Less fragmentation, tighter VRAM
# block_size=32: Less overhead, needs more VRAM

4. CPU Offload (Last resort):

cpu_offload_gb: 4  # Offload 4GB per GPU to CPU RAM

# Pros: Fits larger models
# Cons: Significantly slower, requires fast interconnect

5. Swap Space (KV cache spillover):

swap_space_gb: 16  # Allow 16GB CPU RAM for KV cache

# Use when: Long contexts, tight VRAM
# Impact: Latency increases, but prevents OOM

Throughput Optimization¶

1. Max Sequences (Concurrency):

max_num_seqs: 256  # Default

# Higher: More concurrent requests, more VRAM
# Lower: Less memory pressure, lower throughput
# Sweet spot: 128-512 depending on model size

2. Max Batched Tokens:

max_num_batched_tokens: 2048  # Default

# Higher: More throughput, more VRAM, higher latency per batch
# Lower: Less VRAM, lower throughput
# Recommendation: 1024-4096

3. Prefix Caching:

enable_prefix_caching: True

# Speeds up: Repeated system prompts, RAG contexts
# Overhead: Small memory cost, hash computation

4. Chunked Prefill:

enable_chunked_prefill: True

# Improves: Long prompt throughput
# By: Processing prefill in chunks

5. CUDA Graphs:

# Only when enforce_eager=False
cuda_graph_sizes: "2048,4096,8192"

# Pre-captures kernels for common sequence lengths
# Reduces overhead, improves throughput

Supported Architectures¶

vLLM supports 100+ model architectures. Common ones in Cortex deployments:

Family	Example Models	Notes
Llama	Llama 2, Llama 3, Llama 3.1, 3.2, 3.3	Excellent support, all sizes
Mistral	Mistral 7B, Mixtral 8x7B, 8x22B	Full MoE support
Qwen	Qwen 1.5, 2, 2.5, QwQ	Vision models supported
Phi	Phi-2, Phi-3, Phi-3.5	Small, efficient
DeepSeek	DeepSeek V2, V3, R1	MLA attention, MTP
Gemma	Gemma 2B, 7B, 27B	Google models

NOT Supported: - ❌ Harmony architecture (GPT-OSS 20B/120B) - Use llama.cpp - ❌ Custom architectures without HF integration - ❌ Models requiring unreleased HF transformers features

Multi-GPU Configuration¶

Topology Considerations¶

4x L40S Setup (typical Cortex deployment):

# For 70B model:
tensor_parallel_size: 4  # Split across all 4 GPUs
gpu_memory_utilization: 0.92
max_model_len: 8192
swap_space_gb: 16

# Memory per GPU:
# Weights: ~18GB (70B / 4 GPUs)
# KV cache: ~8GB (depends on context, sequences)
# Overhead: ~3GB
# Total: ~29GB per GPU (fits in 46GB L40S VRAM)

NCCL Settings (Multi-GPU Communication):

Cortex automatically configures:

NCCL_P2P_DISABLE=1         # Disable peer-to-peer (safer default)
NCCL_IB_DISABLE=1          # Disable InfiniBand (not present)
NCCL_SHM_DISABLE=0         # Allow shared memory (faster)
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True  # Reduce fragmentation

Multi-Node Deployment¶

For models >175B or distributed load:

# Node 1 (Head):
tensor_parallel_size: 4
pipeline_parallel_size: 2
# Total: 8 GPUs used

# Start with Ray cluster:
# ray start --head
# vllm serve --tensor-parallel-size 4 --pipeline-parallel-size 2

Cortex Support: - Single-node: ✅ Fully supported - Multi-node: ⚠️ Requires manual Ray setup (not yet in Cortex UI)

Deployment Modes in Cortex¶

Online Mode (HuggingFace Download)¶

Use case: Model on HuggingFace Hub, network available

Model Form:
├─ Mode: Online
├─ Repo ID: meta-llama/Llama-3-8B-Instruct
└─ HF Token: (optional, for gated models)

Cortex does:
1. Mounts HF cache volume
2. Sets HF_HUB_TOKEN environment variable
3. vLLM downloads model on first start
4. Cached for subsequent starts

Benefits: - Easy setup, just specify repo ID - Automatic updates when model revised - Shared cache across models

Requirements: - Network access to Hugging Face - Sufficient disk space for model downloads - HF token for gated models (Llama 3, etc.)

Offline Mode (Local Files)¶

Use case: Air-gapped, pre-downloaded models

Model Form:
├─ Mode: Offline
├─ Local Path: llama-3-8b-instruct
└─ Base Dir: /var/cortex/models

Cortex does:
1. Mounts models directory read-only
2. Sets HF_HUB_OFFLINE=1
3. vLLM loads from /models/llama-3-8b-instruct

Directory Structure:

/var/cortex/models/llama-3-8b-instruct/
├─ model-00001-of-00004.safetensors
├─ model-00002-of-00004.safetensors
├─ model-00003-of-00004.safetensors
├─ model-00004-of-00004.safetensors
├─ config.json
├─ tokenizer.json
├─ tokenizer_config.json
└─ special_tokens_map.json

Advanced Features¶

Prefix Caching¶

Caches common prompt prefixes across requests:

Use case: RAG with repeated system prompts, few-shot examples

Example:

Request 1: [System: "You are a helpful assistant"] + [User: "Question 1"]
Request 2: [System: "You are a helpful assistant"] + [User: "Question 2"]
                     ↑ This prefix is cached! ↑

# 30-50% faster for requests sharing prefixes

Configuration:

enable_prefix_caching: True
prefix_caching_hash_algo: "sha256"  # Reproducible cross-language

Speculative Decoding¶

Draft model generates tokens, main model verifies:

Benefits: 2-3x speedup for long outputs

Status in Cortex: Planned, not yet implemented in UI

Embeddings¶

vLLM supports embedding models:

task: "embed"

# Auto-detects max sequence length from model config
# Uses --task embed flag
# Routes to /v1/embeddings endpoint

Supported models: - BERT variants - Sentence Transformers - E5, BGE, UAE families - Custom embedding architectures

Troubleshooting¶

Common Issues¶

1. OOM (Out of Memory)

Error: "CUDA out of memory"

Solutions:
1. Lower gpu_memory_utilization (0.9 → 0.8)
2. Reduce max_model_len
3. Enable kv_cache_dtype="fp8"
4. Increase tensor_parallel_size
5. Use swap_space_gb or cpu_offload_gb

2. Unsupported Architecture

Error: "Model architecture 'harmony' is not supported"

This is the GPT-OSS issue!

Solutions:
- Use llama.cpp instead (supports any GGUF)
- Wait for vLLM to add architecture support
- Implement custom architecture (advanced)

3. Chat Template Missing

Error: "Chat template not found"

Happens with: Models without chat template in tokenizer_config.json

Cortex handles this:
- Fallback to /v1/completions endpoint
- Converts messages to plain prompt
- Returns normalized chat.completion response

4. Tokenizer Issues

Error: "Tokenizer not found for GGUF"

For GGUF in vLLM:
- Provide tokenizer HF repo: "meta-llama/Llama-3-8B"
- Or hf_config_path to local tokenizer.json

Performance Benchmarks (Typical)¶

Llama 3 8B on Single L40S:¶

Configuration:
- dtype: bfloat16
- max_model_len: 8192
- gpu_memory_utilization: 0.9

Results:
- Throughput: ~50-70 tokens/sec/request
- Concurrent requests: 30-40 (8K context)
- TTFT (time to first token): 50-100ms
- Latency (128 tokens): ~2-3 seconds

Llama 3 70B on 4x L40S (TP=4):¶

Configuration:
- dtype: bfloat16
- tensor_parallel_size: 4
- max_model_len: 8192
- kv_cache_dtype: fp8

Results:
- Throughput: ~20-30 tokens/sec/request
- Concurrent requests: 10-15 (8K context)
- TTFT: 150-300ms
- Latency (128 tokens): ~5-8 seconds

Resource Calculator¶

Cortex includes a built-in calculator (Models page → Resource Calculator):

Inputs: - Model parameters (7B, 70B, etc.) - Hidden size, num layers - Target context length - Concurrent sequences - TP size, quantization

Outputs: - Per-GPU memory estimate - Fits/doesn't fit analysis - Auto-tuning suggestions - Downloadable report

Auto-fit Feature: Automatically adjusts settings to fit available VRAM: 1. Enables KV FP8 2. Tries quantization (INT8, then AWQ) 3. Increases TP size 4. Reduces context/sequences 5. Suggests CPU offload/swap if needed

Best Practices¶

Development:¶

✅ Start with enforce_eager=True (easier debugging)
✅ Use small context (4K-8K) initially
✅ Monitor logs for warnings
✅ Test with single request first

Production:¶

✅ Disable enforce_eager for CUDA graphs
✅ Set optimal max_num_seqs for workload
✅ Enable prefix caching for RAG
✅ Use FP8 KV cache on supported GPUs
✅ Monitor metrics via Prometheus

Multi-GPU:¶

✅ Use TP for models that don't fit single GPU
✅ Keep TP ≤ 8 (diminishing returns after)
✅ Verify NCCL settings for your network
✅ Test with synthetic load before production

Integration with Cortex Gateway¶

Model Registry¶

When vLLM container starts:

# Cortex registers model endpoint:
register_model_endpoint(
    served_name="llama-3-8b-instruct",
    url="http://vllm-model-3:8000",
    task="generate"
)

# Gateway routes requests:
POST /v1/chat/completions {"model": "llama-3-8b-instruct"}
→ Proxied to: http://vllm-model-3:8000/v1/chat/completions

Health Monitoring¶

Cortex polls vLLM health every 15 seconds:

GET http://vllm-model-3:8000/health
Response: {"status": "ok"}

# Also discovers models:
GET http://vllm-model-3:8000/v1/models
Response: {"data": [{"id": "llama-3-8b-instruct"}]}

Metrics¶

vLLM exposes Prometheus metrics on port 8000:

# Tokens processed:
vllm:prompt_tokens_total
vllm:generation_tokens_total

# Performance:
vllm:time_to_first_token_seconds
vllm:time_per_output_token_seconds

# Resource usage:
vllm:gpu_cache_usage_perc
vllm:cpu_cache_usage_perc

Cortex gateway aggregates and re-exposes these.

Migration from Other Engines¶

From Text Generation Inference (TGI):¶

vLLM is generally faster and more memory-efficient:

TGI → vLLM:
- Similar API, minimal code changes
- Better throughput (1.5-3x)
- PagedAttention vs continuous batching
- Easier multi-GPU setup

From llama.cpp:¶

When to migrate to vLLM: - ✅ Model has HF checkpoint (not just GGUF) - ✅ Architecture is supported - ✅ Need maximum throughput - ✅ Have sufficient VRAM

When to stay on llama.cpp: - ✅ GGUF-only model - ✅ Custom architecture - ✅ CPU inference priority - ✅ Simpler deployment

Limitations¶

What vLLM Can't Do (Use llama.cpp instead):¶

Custom Architectures:
Harmony (GPT-OSS 20B/120B)
Unreleased experimental models
Models with trust_remote_code issues
GGUF-Focused Workflows:
Multi-file GGUF (must merge first)
Heavily quantized models (Q4_K_M, Q2_K, etc.)
LoRA adapters with GGUF
CPU Inference:
vLLM can run on CPU but very slow
llama.cpp much better for CPU workloads

References¶

Official Docs: https://docs.vllm.ai/
GitHub: https://github.com/vllm-project/vllm
Docker Hub: https://hub.docker.com/r/vllm/vllm-openai
Cortex Implementation: backend/src/docker_manager.py
Model Form: frontend/src/components/models/ModelForm.tsx

For GPT-OSS 120B and other Harmony architecture models, see: llamaCPP.md (in this directory)
For engine comparison and decision matrix, see: engine-comparison.md (in this directory)