vLLM vs llama.cpp: Engine Comparison for Cortex¶
Date: October 4, 2025
Purpose: Guide administrators in choosing the right engine for their models
Executive Summary¶
Cortex supports dual inference engines:
- vLLM - Primary engine for standard HuggingFace Transformers models
- llama.cpp - Secondary engine for GPT-OSS 120B and GGUF-only models
Why both? OpenAI's GPT-OSS 120B uses the Harmony architecture, which vLLM doesn't support. llama.cpp was added specifically to serve these models while maintaining Cortex's unified admin experience.
Quick Decision Matrix¶
Choose vLLM When:¶
✅ Model on HuggingFace with Transformers checkpoint
✅ Architecture: Llama, Mistral, Qwen, Phi, Gemma, etc.
✅ Need maximum throughput (PagedAttention)
✅ High concurrency (50+ simultaneous requests)
✅ Pure GPU inference with sufficient VRAM
Examples: Llama 3 8B/70B, Mistral 7B, Qwen 2.5, Phi-3
Choose llama.cpp When:¶
✅ GPT-OSS 20B/120B (Harmony architecture) ← Primary use case
✅ GGUF-only model (no HF checkpoint)
✅ Custom/experimental architecture
✅ CPU+GPU hybrid inference needed
✅ Aggressive quantization (Q4_K_M, Q5_K_M)
Examples: GPT-OSS 120B, community GGUF quantizations
Technical Comparison¶
Architecture Support¶
Aspect | vLLM | llama.cpp |
---|---|---|
Supported Models | 100+ HF architectures | Any model with GGUF |
Llama family | ✅ Excellent | ✅ Excellent |
Mistral family | ✅ Excellent | ✅ Excellent |
GPT-OSS (Harmony) | ❌ Not supported | ✅ Works! |
Custom architectures | ⚠️ Requires HF integration | ✅ Works if GGUF exists |
trust_remote_code | ⚠️ Limited support | ✅ N/A (loads from GGUF) |
Winner for GPT-OSS: llama.cpp (only option)
Winner for standard models: vLLM (better performance)
Performance¶
Metric | vLLM | llama.cpp | Notes |
---|---|---|---|
Throughput (tokens/sec) | 50-70 (single GPU, 7B) | 30-50 | vLLM 1.5-2x faster |
Latency (TTFT) | 50-100ms | 100-300ms | vLLM faster startup |
Memory efficiency | Excellent (PagedAttention) | Good | vLLM 2-4x better KV cache |
Concurrency | High (100+ requests) | Low (1-4 requests) | vLLM scales better |
CPU performance | Very slow | Optimized | llama.cpp 10x+ faster on CPU |
Performance winner: vLLM (when architecture supported)
CPU winner: llama.cpp
Quantization¶
Method | vLLM | llama.cpp |
---|---|---|
FP16/BF16 | ✅ Native | ✅ Via Q8_0+ |
FP8 | ✅ Runtime | ❌ Not directly |
INT8 | ✅ Runtime | ✅ Via Q8_0 |
AWQ/GPTQ (4-bit) | ✅ Pre-quantized checkpoints | ❌ Different format |
Q8_0 | ⚠️ Via experimental GGUF | ✅ Native |
Q6_K, Q5_K, Q4_K | ❌ Not supported | ✅ Native |
Q3_K, Q2_K | ❌ Not supported | ✅ Native (not recommended) |
Quantization winner: llama.cpp (more options, aggressive compression)
Quality winner: vLLM FP8/INT8 (runtime optimization)
Deployment¶
Aspect | vLLM | llama.cpp |
---|---|---|
Container image | vllm/vllm-openai:latest | cortex/llamacpp-server:latest (custom) |
Startup time | Fast (10-30s for 7B) | Moderate (30-60s for 120B) |
Model format | HF Transformers, (GGUF exp) | GGUF only |
Online download | ✅ From HuggingFace | ❌ Local files only |
Offline mode | ✅ Local HF format | ✅ Local GGUF |
Multi-file models | ✅ SafeTensors shards | ⚠️ GGUF must be single-file |
Deployment winner: Tie (different use cases)
Memory Management Strategies¶
vLLM Approach (PagedAttention)¶
Traditional KV cache:
Request with 100 tokens, 8K max:
- Allocates: 8K × hidden_size × 2 bytes
- Uses: 100 × hidden_size × 2 bytes
- Waste: 7900 tokens worth of VRAM ❌
vLLM PagedAttention:
- Allocates blocks on-demand (16 tokens each)
- Request needs: 100 tokens → 7 blocks
- Waste: Almost zero ✓
Result: 2-4x more requests fit in same VRAM
Best for: High concurrency, variable-length requests
llama.cpp Approach (Layer Offloading)¶
Model: 120B parameters, 120 layers
Available VRAM: 184GB (4x 46GB GPUs)
Need: ~120GB for weights + ~20GB for KV/overhead
llama.cpp strategy:
1. Load as many layers as fit in VRAM
2. Offload overflow to CPU RAM
3. Hybrid execution (mostly GPU, some CPU)
Example:
-ngl 999: Tries to offload all 120 layers to GPU
Result: Layers 0-110 on GPU, layers 111-119 on CPU
Performance: Mostly fast (GPU), slight slowdown on CPU layers
Best for: Models that don't quite fit in VRAM
Real-World Use Case: GPT-OSS 120B¶
The Challenge¶
Model: GPT-OSS 120B Abliterated
Size: 240GB (BF16 weights)
Architecture: Harmony (custom)
Available VRAM: 184GB (4x L40S)
Problem: Model doesn't fit, and vLLM doesn't support architecture!
vLLM Attempt (Fails)¶
vllm serve huihui-ai/Huihui-gpt-oss-120b-BF16-abliterated \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.92
Error: Architecture 'harmony' is not supported by vLLM ❌
# Even if architecture was supported:
# 240GB weights / 4 GPUs = 60GB per GPU
# 60GB > 46GB available → OOM ❌
llama.cpp Solution (Works!)¶
# Step 1: Quantize to Q8_0
# Size: 240GB → 120GB (2x compression, near-lossless)
# Step 2: Serve with llama.cpp
llama-server \
-m gpt-oss-120b.Q8_0.gguf \
-ngl 999 \
--tensor-split 0.25,0.25,0.25,0.25 \
-c 8192 \
-b 512
# Memory usage:
# 120GB weights / 4 GPUs = 30GB per GPU
# 30GB + KV cache + overhead ≈ 35GB per GPU
# 35GB < 46GB available → Fits! ✓
# Performance:
# ~8-15 tokens/sec (acceptable for 120B)
# Serves successfully ✓
Cortex makes this easy:
1. Admin adds GPT-OSS 120B via Model Form
2. Selects engine: llama.cpp
3. Selects GGUF: Q8_0 file
4. Configures: ngl=999, tensor_split=0.25,0.25,0.25,0.25
5. Clicks "Start"
6. Model serves successfully! ✓
Hybrid Deployment Strategy¶
Recommended Cortex Setup¶
vLLM models (most workloads):
Llama 3 8B (chat):
Engine: vLLM
TP: 1
Throughput: ~60 tok/sec
Concurrency: 40+
Llama 3 70B (chat):
Engine: vLLM
TP: 4
Throughput: ~25 tok/sec
Concurrency: 10-15
Mistral 7B (chat):
Engine: vLLM
TP: 1
Throughput: ~65 tok/sec
llama.cpp models (special cases):
GPT-OSS 120B (Harmony):
Engine: llama.cpp
Quantization: Q8_0
GPU Layers: 999
Tensor Split: 0.25,0.25,0.25,0.25
Throughput: ~10 tok/sec
Concurrency: 1-2
Reason: vLLM doesn't support Harmony architecture
Result: - 90% of requests → vLLM (fast, efficient) - 10% of requests → llama.cpp (GPT-OSS specific) - Gateway routes transparently - Users don't notice which engine
Administrative Workflow¶
Creating a vLLM Model¶
1. Models page → "Add Model"
2. Engine Type: vLLM (Transformers/SafeTensors)
3. Mode: Online or Offline
4. Repo ID / Local Path: meta-llama/Llama-3-8B-Instruct
5. Configure: TP size, dtype, memory settings
6. Click "Create" → "Start"
7. vLLM container spins up
8. Model serves at http://cortex-ip:8084/v1/chat/completions
Creating a llama.cpp Model¶
1. Models page → "Add Model"
2. Engine Type: llama.cpp (GGUF)
3. Mode: Offline (required for llama.cpp)
4. Local Path: Browse to GGUF file
5. Configure: GPU layers, tensor split, context
6. Click "Create" → "Start"
7. llama.cpp container spins up
8. Model serves at same endpoint (gateway routes by name)
User Experience: Identical! Engine choice is transparent.
Cost/Benefit Analysis¶
vLLM¶
Costs: - Requires HF checkpoint (larger download) - Limited to supported architectures - More complex internals
Benefits: - 2-3x better throughput - 2-4x better memory efficiency - High concurrency support - Active development, frequent updates - Native HF ecosystem integration
ROI: Excellent for standard models
llama.cpp¶
Costs: - Slower throughput (50-60% of vLLM) - Lower concurrency (1-4 vs 50+) - Manual quantization workflow - Custom Docker image maintenance
Benefits: - Supports any architecture (critical for GPT-OSS) - Aggressive quantization (4x+ compression) - CPU+GPU hybrid (flexible deployment) - Single-file GGUF (easy distribution) - Works when vLLM can't
ROI: Essential for unsupported models, high for GPT-OSS use case
Resource Utilization¶
GPU VRAM Comparison (70B model)¶
vLLM (BF16, TP=4):
Weights: 140GB / 4 = 35GB per GPU
KV cache (8K, 32 seqs): ~12GB per GPU
Overhead: ~3GB per GPU
Total: ~50GB per GPU
Fits: Barely on 48GB GPUs, OOM on 40GB
llama.cpp (Q8_0, TP=4):
Weights (quantized): 70GB / 4 = 17.5GB per GPU
KV cache (8K, 4 seqs): ~3GB per GPU
Overhead: ~2GB per GPU
Total: ~22.5GB per GPU
Fits: Comfortably on 24GB+ GPUs
Winner for tight VRAM: llama.cpp (quantization helps)
CPU Utilization¶
vLLM:
CPU usage: Low (mostly GPU-bound)
Prefill: GPU compute
Decode: GPU compute
Best on: Pure GPU servers
llama.cpp:
CPU usage: Moderate to high
Prefill: CPU assists GPU
Decode: Mostly GPU if -ngl high
Best on: Hybrid CPU+GPU servers
Winner for CPU utilization: llama.cpp (actually uses CPUs)
Development and Maintenance¶
Code Complexity in Cortex¶
vLLM Integration:
# docker_manager.py: _build_command()
# ~180 lines of command building
# Handles: TP, PP, quantization, cache, prefill, graphs
# Complexity: Medium-High
# Maintenance:
# - Follow vLLM releases
# - Update flag compatibility
# - Test new architectures
llama.cpp Integration:
# docker_manager.py: _build_llamacpp_command()
# ~45 lines of command building
# Handles: ngl, tensor split, RoPE, NUMA
# Complexity: Low-Medium
# Maintenance:
# - Build custom Docker image
# - Update for new llama.cpp releases
# - GGUF compatibility tracking
Maintenance burden: llama.cpp slightly higher (custom image)
Testing Requirements¶
vLLM: - Test per HF model architecture - Verify quantization compatibility - Multi-GPU NCCL configuration - PagedAttention block sizes
llama.cpp: - Test GGUF file resolution - Verify quantization quality - Multi-GPU tensor split - CPU+GPU hybrid behavior
Testing winner: Similar effort, different focus
Performance Benchmarks¶
Single Request Latency¶
Llama 3 8B (Same model, different engines):
vLLM (BF16, single GPU):
- TTFT: 60ms
- Throughput: 55 tok/sec
- 128 tokens: ~2.3s total
llama.cpp (Q8_0, single GPU):
- TTFT: 150ms
- Throughput: 35 tok/sec
- 128 tokens: ~3.8s total
Winner: vLLM (1.6x faster)
Concurrent Requests¶
vLLM (Llama 3 8B, 8K context, L40S):
Concurrent requests: 40
Total throughput: ~800 tok/sec across all
Per-request: ~20 tok/sec
VRAM: 38GB / 46GB
Scales well to 40+ requests ✓
llama.cpp (Llama 3 8B, 8K context, L40S):
Concurrent requests: 2-3
Total throughput: ~60 tok/sec
Per-request: ~20-30 tok/sec
VRAM: 12GB / 46GB
Doesn't scale well beyond 4 requests ⚠️
Concurrency winner: vLLM (dramatically better)
Operational Differences¶
Startup and Warmup¶
vLLM:
Container start → Model download (if online) → Weight loading
→ CUDA initialization → PagedAttention setup → Health: UP
Time: 15-45 seconds (depending on model size)
llama.cpp:
Container start → GGUF mmap/load → CUDA initialization
→ Layer allocation → Server ready → Health: UP
Time: 30-90 seconds (depending on model size)
Startup winner: vLLM (slightly faster)
Resource Cleanup¶
vLLM:
Stop container:
- Graceful shutdown (5s timeout)
- CUDA memory freed automatically
- Container removed
Clean: Fast
llama.cpp:
Stop container:
- Graceful shutdown (10s timeout)
- Larger timeout for CPU offload cleanup
- Container removed
Clean: Slower but thorough
API Compatibility¶
OpenAI Endpoints¶
Endpoint | vLLM | llama.cpp | Notes |
---|---|---|---|
/v1/chat/completions | ✅ Full | ✅ Full | Both compatible |
/v1/completions | ✅ Full | ✅ Full | Both compatible |
/v1/embeddings | ✅ Full | ⚠️ If model supports | vLLM better |
/v1/models | ✅ Yes | ✅ Yes | Both support |
/health | ✅ Native | ⚠️ Via /v1/models | vLLM has dedicated endpoint |
/metrics | ✅ Prometheus | ⚠️ Optional | vLLM built-in |
API winner: vLLM (slightly more complete)
Request Parameters¶
Both support OpenAI standard:
{
"model": "model-name",
"messages": [...],
"temperature": 0.7,
"max_tokens": 128,
"top_p": 0.95,
"stream": true/false
}
llama.cpp extras:
{
"repeat_penalty": 1.1,
"top_k": 40,
"mirostat": 2,
"mirostat_tau": 5.0
}
vLLM extras:
{
"presence_penalty": 0.5,
"frequency_penalty": 0.5,
"logit_bias": {...}
}
Cortex Implementation Details¶
Engine Selection Logic¶
In Model Form:
<select value={engineType}>
<option value="vllm">vLLM (Transformers/SafeTensors)</option>
<option value="llamacpp">llama.cpp (GGUF)</option>
</select>
// Conditional rendering:
{engineType === 'vllm' && <VllmFields />}
{engineType === 'llamacpp' && <LlamaCppFields />}
In docker_manager.py:
def start_container_for_model(m: Model, hf_token=None):
engine_type = getattr(m, 'engine_type', 'vllm')
if engine_type == 'llamacpp':
return start_llamacpp_container_for_model(m)
else:
return start_vllm_container_for_model(m, hf_token)
Container Naming¶
vLLM: vllm-model-{id}
llama.cpp: llamacpp-model-{id}
# Allows both engines to coexist
# No name conflicts
# Easy to identify in docker ps
Database Schema¶
Shared fields (both engines):
id, name, served_model_name, task, state, port, container_name
vLLM-specific:
repo_id, dtype, tp_size, gpu_memory_utilization, max_model_len,
kv_cache_dtype, quantization, block_size, swap_space_gb, enforce_eager,
enable_prefix_caching, enable_chunked_prefill, cuda_graph_sizes
llama.cpp-specific:
ngl, tensor_split, batch_size, threads, context_size,
rope_freq_base, rope_freq_scale, flash_attention, mlock, no_mmap,
numa_policy, split_mode
Discriminator:
engine_type: 'vllm' | 'llamacpp'
Decision Tree for Administrators¶
Choose Engine for New Model
│
├─ Is it GPT-OSS 20B/120B (Harmony)?
│ └─ YES → llama.cpp (only option)
│
├─ Is it GGUF-only (no HF checkpoint)?
│ └─ YES → llama.cpp (native format)
│
├─ Is architecture in HF Transformers?
│ ├─ NO → llama.cpp (flexible)
│ └─ YES → Continue...
│
├─ Need high concurrency (20+ simultaneous)?
│ └─ YES → vLLM (better batching)
│
├─ Have sufficient VRAM for FP16/BF16?
│ ├─ NO → llama.cpp (better quantization)
│ └─ YES → vLLM (best performance)
│
└─ Default → vLLM (primary engine)
Migration Paths¶
vLLM → llama.cpp¶
When to migrate: - vLLM doesn't support your model - Need aggressive quantization (Q4_K_M) - VRAM constraints require CPU offload
Steps:
1. Export/download GGUF version of model
2. Quantize if needed: llama-quantize
3. Place in /var/cortex/models/
4. Create new model in Cortex
5. Engine: llama.cpp
6. Local Path: your-model.Q8_0.gguf
7. Configure llama.cpp settings
8. Start and test
llama.cpp → vLLM¶
When to migrate: - HF checkpoint becomes available - Architecture gets vLLM support - Need better throughput/concurrency
Steps:
1. Verify HF checkpoint exists
2. Test with vLLM locally (docker run)
3. Confirm architecture supported
4. Create vLLM model in Cortex
5. Mode: Online (HF repo) or Offline (local)
6. Configure vLLM settings
7. Start and benchmark
8. Compare vs llama.cpp
9. Archive old llama.cpp model if satisfied
Cost Analysis¶
Compute Costs (per 1M tokens)¶
Assumptions: 4x L40S GPUs ($2/GPU/hour typical cloud pricing)
vLLM (Llama 3 70B):
Throughput: 25 tok/sec
1M tokens: 40,000 seconds = 11.1 hours
Cost: 11.1 hours × 4 GPUs × $2 = $88.80
llama.cpp (GPT-OSS 120B Q8_0):
Throughput: 10 tok/sec
1M tokens: 100,000 seconds = 27.8 hours
Cost: 27.8 hours × 4 GPUs × $2 = $222.40
But: GPT-OSS 120B is much more capable than Llama 3 70B
Trade-off: Pay 2.5x more for significantly better model
Development Costs¶
vLLM: - Setup: Easy (standard Docker image) - Tuning: Moderate (many parameters) - Maintenance: Low (stable releases)
llama.cpp: - Setup: Moderate (custom Docker image) - Tuning: Easy (fewer parameters) - Maintenance: Moderate (rebuild image for updates)
Future Roadmap¶
vLLM Evolution¶
Expected improvements: - More architecture support (maybe Harmony eventually?) - Better FP8 quantization - Multi-node simplification - Speculative decoding maturity
Cortex will track: - Update when new architectures added - Test GPT-OSS compatibility periodically - Could migrate GPT-OSS to vLLM if supported
llama.cpp Evolution¶
Expected improvements: - Better multi-GPU scaling - More quantization schemes - Faster inference kernels - LoRA adapter improvements
Cortex enhancements: - LoRA adapter UI - In-browser quantization - Multi-model serving per container
Conclusion¶
The Dual-Engine Strategy Works¶
Why Cortex needs both:
- vLLM = Performance powerhouse for 95% of models
- llama.cpp = Compatibility savior for the other 5%
- Together = Complete solution
Specifically for GPT-OSS 120B: - vLLM: ❌ Doesn't support Harmony architecture - llama.cpp: ✅ Works perfectly with Q8_0 GGUF - Conclusion: llama.cpp integration was essential
User perspective: - Transparent engine selection - Same API endpoints - Same admin UI - "Just works" regardless of engine
Admin perspective: - Choose engine based on model - Configure parameters in unified form - Monitor both engines in System Monitor - Manage lifecycle identically
Key Takeaways¶
- vLLM for standard HF models - unbeatable performance
- llama.cpp for GPT-OSS 120B - only option that works
- Both coexist in Cortex - best of both worlds
- Gateway abstracts differences - users don't care
- Admin UI treats both equally - unified experience
Result: Cortex can serve ANY model - mainstream (vLLM) or exotic (llama.cpp)! 🎉
Quick Reference¶
Decision Guide:¶
Model to serve: ?
├─ GPT-OSS 120B? → llama.cpp
├─ GGUF only? → llama.cpp
├─ HF Transformers? → vLLM (unless custom architecture)
├─ Need max throughput? → vLLM
├─ Need aggressive quantization? → llama.cpp
└─ Default → vLLM
Performance Expectations:¶
vLLM (standard model):
- Throughput: High (50-70 tok/sec single request)
- Concurrency: Excellent (40+ requests)
- Memory: Efficient (PagedAttention)
llama.cpp (GPT-OSS 120B):
- Throughput: Moderate (8-15 tok/sec)
- Concurrency: Limited (1-2 requests)
- Memory: Flexible (CPU offload)
For detailed guides:
- vLLM: See vllm.md
(in this directory)
- llama.cpp: See llamaCPP.md
(in this directory)
- Model Management: See model-management.md
(in this directory)
- Engine Research: See engine-research.md
(in this directory)