llama.cpp Engine Guide¶

Overview¶

llama.cpp is Cortex's secondary inference engine, specifically added to support models that vLLM cannot handle. It provides CPU+GPU hybrid inference with GGUF quantized models.

When to use llama.cpp: - ✅ GPT-OSS 20B/120B models (Harmony architecture) ← Primary reason added to Cortex - ✅ GGUF quantized models (Q4_K_M, Q8_0, etc.) - ✅ Custom/experimental architectures unsupported by vLLM - ✅ CPU+GPU hybrid inference (offload layers to GPU) - ✅ Tight VRAM constraints requiring aggressive quantization

When to use vLLM instead: - ✅ Model has HuggingFace Transformers checkpoint - ✅ Standard architecture (Llama, Mistral, Qwen, etc.) - ✅ Need maximum throughput - ✅ Pure GPU inference with sufficient VRAM

Why llama.cpp Was Added to Cortex¶

The GPT-OSS Problem¶

OpenAI released GPT-OSS models (20B and 120B) built on the Harmony architecture:

Model: huihui-ai/Huihui-gpt-oss-120b-BF16-abliterated
Architecture: Harmony (custom, not in HF Transformers)
Available formats:
  - Safetensors (BF16) - 240GB weights
  - GGUF (Q8_0) - ~120GB quantized
  - GGUF (Q4_K_M) - ~60GB quantized

vLLM Problem:

vllm serve huihui-ai/Huihui-gpt-oss-120b-BF16-abliterated

Error: "Architecture 'harmony' is not supported"
# vLLM doesn't have Harmony in its model registry
# Would need upstream PR to add support

llama.cpp Solution:

llama-server -m gpt-oss-120b.Q8_0.gguf -ngl 999

# Works! llama.cpp loads any GGUF regardless of architecture
# 120B model serves successfully with quantization

Result: Cortex gained llama.cpp engine specifically to serve GPT-OSS models while maintaining the unified admin UX.

Core Technologies¶

1. GGUF Format¶

GGUF (GPT-Generated Unified Format) - llama.cpp's native format:

Advantages: - Single-file distribution (easy to share) - Embedded metadata (architecture, quantization, rope config) - Optimized for CPU inference - Supports any model architecture - Multiple quantization levels in one file

Structure:

gpt-oss-120b.Q8_0.gguf  (119GB)
├─ Header (GGUF version, tensor count)
├─ Metadata (architecture, parameters, rope, etc.)
├─ Tensor info (names, shapes, types)
└─ Tensor data (quantized weights)

Cortex Support: - GGUF models placed in: /var/cortex/models/ - Single-file or merged multi-part GGUF - llama-server loads directly

2. Quantization Levels¶

llama.cpp supports aggressive quantization for VRAM savings:

Quantization	Bits per Weight	Size (120B)	Quality	Use Case
F16	16	~240GB	Perfect	Baseline (too large)
Q8_0	8	~120GB	Excellent	Recommended for Cortex
Q6_K	6	~90GB	Very Good	Good balance
Q5_K_M	5-6 mixed	~75GB	Good	Tighter VRAM
Q4_K_M	4-5 mixed	~60GB	Acceptable	Maximum compression
Q3_K_M	3-4 mixed	~45GB	Degraded	Experimental
Q2_K	2-3 mixed	~30GB	Poor	Avoid

Cortex Recommendation for GPT-OSS 120B: - Use Q8_0 (best quality/size tradeoff) - Fits across 4x L40S GPUs (46GB each = 184GB total) - Near-lossless quantization - Production-ready quality

3. CPU+GPU Hybrid Inference¶

llama.cpp's killer feature: intelligent layer offloading

Example: 120B model on 4x L40S GPUs:

# Q8_0 quantized = ~120GB weights
# 4x GPUs = ~184GB VRAM available

llama-server \
  -m gpt-oss-120b.Q8_0.gguf \
  -ngl 999 \                     # Offload all possible layers to GPU
  --tensor-split 0.25,0.25,0.25,0.25  # Split evenly across 4 GPUs
  -c 8192 \                      # Context window
  -b 512                         # Batch size

# Result:
# - Most layers on GPUs (fast)
# - Spillover to CPU RAM if needed (transparent)
# - Slower than pure GPU, but works!

Layer Offload (-ngl):

ngl=0:     All layers on CPU (slow, ~2-5 tok/s)
ngl=40:    40 layers on GPU, rest on CPU (mixed, ~10-20 tok/s)
ngl=999:   All layers on GPU (fast, ~30-50 tok/s)

Cortex Default: ngl=999 (offload everything possible)

4. Tensor Split (Multi-GPU VRAM Distribution)¶

Distributes model across GPUs:

# 4x GPUs, equal split:
--tensor-split 0.25,0.25,0.25,0.25

# 4x GPUs, unequal (if GPU 0 has less VRAM):
--tensor-split 0.15,0.28,0.28,0.29

# 2x GPUs:
--tensor-split 0.5,0.5

Cortex Configuration: - Set in Model Form → llama.cpp section - Default: Equal split across available GPUs - Adjustable for heterogeneous GPU setups

Cortex llama.cpp Implementation¶

Container Architecture¶

Host Machine
├─ Docker Network: cortex_default
└─ llama.cpp Container: llamacpp-model-{id}
   ├─ Image: cortex/llamacpp-server:latest (custom-built)
   ├─ Port: 8000 (internal) → Ephemeral host port
   ├─ Network: Service-to-service via container name
   ├─ Volumes:
   │  └─ /models (RO) → Host models directory
   ├─ Environment:
   │  ├─ CUDA_VISIBLE_DEVICES=all
   │  └─ NVIDIA_DRIVER_CAPABILITIES=compute,utility
   └─ Resources:
      ├─ GPU: All available (via DeviceRequest)
      ├─ SHM: 8GB (larger than vLLM for CPU offload)
      └─ IPC: host mode

Startup Command Example¶

For GPT-OSS 120B Q8_0:

# Container command (built by _build_llamacpp_command()):
llama-server
-m /models/huihui-ai/Huihui-gpt-oss-120b-BF16-abliterated/Q8_0-GGUF/gpt-oss-120b.Q8_0.gguf
--host 0.0.0.0
--port 8000
-c 8192                           # Context size
-ngl 999                          # GPU layers (all)
-b 512                            # Batch size
-t 32                             # CPU threads
--tensor-split 0.25,0.25,0.25,0.25  # 4-GPU split
--flash-attn on                   # Flash attention
--mlock                           # Lock model in RAM
--no-mmap                         # Disable memory mapping
--numa isolate                    # NUMA policy

GGUF File Resolution¶

Cortex intelligently resolves GGUF files:

# If local_path is a .gguf file:
local_path: "model.Q8_0.gguf"
→ Uses: /models/model.Q8_0.gguf

# If local_path is a directory:
local_path: "model-folder"
→ Scans for .gguf files in folder
→ Uses first found or specified file

# Special case for GPT-OSS:
local_path: "huihui-ai/Huihui-gpt-oss-120b-BF16-abliterated"
→ Uses: /models/.../Q8_0-GGUF/gpt-oss-120b.Q8_0.gguf

Configuration Parameters¶

Core Parameters¶

Context Size (-c):

-c 8192    # Default: 8192 tokens

# Larger context = more VRAM/RAM
# Recommendations:
# - 4096: Conservative, fast
# - 8192: Balanced (Cortex default)
# - 16384: Large contexts, needs VRAM
# - 32768+: Requires significant resources

GPU Layers (-ngl):

-ngl 999   # Default: Offload all layers

# Manual tuning:
# -ngl 0:   Pure CPU (very slow)
# -ngl 40:  40 layers on GPU, rest CPU
# -ngl 999: Automatic (all that fit)

# Cortex default: 999 (let llama.cpp decide)

Batch Size (-b):

-b 512     # Default: 512

# Affects prompt processing speed:
# Higher: Faster prefill, more VRAM
# Lower: Slower prefill, less VRAM
# Range: 128-2048

CPU Threads (-t):

-t 32      # Default: 32 threads

# Set to: (CPU cores - 2) typically
# For 64-core system: -t 60
# For 16-core: -t 14

Performance Flags¶

Flash Attention (--flash-attn):

--flash-attn on    # Default: on

# Faster attention computation
# Minimal quality impact
# Keep enabled unless debugging

Memory Lock (--mlock):

--mlock    # Default: enabled in Cortex

# Locks model in RAM (prevents swapping to disk)
# Ensures consistent performance
# Requires sufficient RAM

Memory Mapping (--no-mmap):

--no-mmap  # Default: enabled in Cortex

# Loads model into RAM instead of memory-mapping file
# Faster inference (no page faults)
# Requires 2x model size in RAM temporarily

NUMA Policy (--numa):

--numa isolate    # Default: isolate

# Options:
# - isolate: Bind to single NUMA node (best latency)
# - distribute: Spread across nodes (better throughput)
# - none: No NUMA pinning

RoPE Scaling¶

For extending context beyond training length:

--rope-freq-base 10000     # Default from model
--rope-freq-scale 1.0      # Default (no scaling)

# Example: Extend 4K model to 8K context:
--rope-freq-scale 0.5      # Compresses RoPE

# Cortex: Set in Model Form → Advanced llama.cpp section

OpenAI API Compatibility¶

llama-server provides OpenAI-compatible endpoints:

Endpoints Available:¶

POST /v1/chat/completions    ✓ Chat interface
POST /v1/completions         ✓ Text completion
POST /v1/embeddings          ✓ Embeddings (if model supports)
GET  /v1/models              ✓ List served models
GET  /health                 ✓ Health check (for Cortex polling)
GET  /metrics                ✓ Prometheus metrics

Differences from OpenAI API:¶

1. No API key enforcement by default:

# Cortex wraps llama-server with gateway auth
# Internal communication: No auth needed
# External access: Gateway enforces API keys

2. Extended parameters:

{
  "model": "gpt-oss-120b",
  "messages": [...],
  // Standard OpenAI params work
  "temperature": 0.7,
  "max_tokens": 256,
  // llama.cpp extras:
  "repeat_penalty": 1.1,
  "top_k": 40,
  "mirostat": 2
}

3. Streaming format:

Compatible with OpenAI's Server-Sent Events (SSE)
Works with OpenAI SDKs without modification

Multi-GPU Deployment¶

Tensor Split Strategy¶

llama.cpp uses layer-based sharding across GPUs:

How it works:

120-layer model, 4 GPUs, equal split:

GPU 0: Layers 0-29   (30 layers, ~30GB)
GPU 1: Layers 30-59  (30 layers, ~30GB)
GPU 2: Layers 60-89  (30 layers, ~30GB)
GPU 3: Layers 90-119 (30 layers, ~30GB)

Total: 120GB weights fit across 184GB VRAM ✓

Unequal splits (if GPUs have different VRAM):

# GPU 0 has 24GB, rest have 48GB each:
--tensor-split 0.15,0.28,0.28,0.29

# Calculation:
# Total ratio: 0.15 + 0.28 + 0.28 + 0.29 = 1.0
# GPU 0 gets 15% of model
# GPU 1-3 get 28-29% each

Performance Characteristics¶

120B Q8_0 on 4x L40S GPUs:

Configuration:
-ngl 999
--tensor-split 0.25,0.25,0.25,0.25
-c 8192
-b 512
-t 32

Expected Performance:
- Throughput: ~8-15 tokens/sec (single request)
- TTFT: 2-4 seconds (depending on prompt length)
- Context: Up to 8192 tokens
- Concurrency: 1-2 simultaneous requests

Comparison to vLLM (if it worked):
- vLLM would be 2-3x faster
- But vLLM doesn't support Harmony architecture!
- llama.cpp is the only option for GPT-OSS

GGUF Quantization Explained¶

Quantization Methods¶

llama.cpp supports many quantization schemes:

K-quants (Recommended):

Q8_0:    8-bit, per-block scale
         - Nearly lossless
         - 2x compression vs FP16
         - Cortex recommendation for production

Q6_K:    6-bit mixed precision
         - Good quality
         - 2.7x compression

Q5_K_M:  5-6 bit mixed
         - Balanced
         - 3.2x compression

Q4_K_M:  4-5 bit mixed
         - High compression
         - 4x reduction
         - Acceptable quality

Q3_K_M:  3-4 bit mixed
         - Extreme compression
         - 5.3x reduction
         - Noticeable degradation

Q2_K:    2-3 bit
         - Maximum compression
         - Avoid for production

Legacy quants (Avoid):

Q4_0, Q4_1, Q5_0, Q5_1, Q6_0
# Superseded by K-quants
# Use Q*_K_M variants instead

Preparing GGUF for Cortex¶

Option 1: Use pre-quantized from HuggingFace:

# Many models have GGUF variants:
# https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF

# Download to /var/cortex/models/qwen-2.5-7b/
# Select .gguf file in Cortex Model Form

Option 2: Quantize yourself:

# 1. Convert HF model to GGUF F16:
python convert_hf_to_gguf.py /path/to/hf/model

# 2. Quantize to Q8_0:
llama-quantize model-f16.gguf model-Q8_0.gguf Q8_0

# 3. Place in Cortex models directory:
mv model-Q8_0.gguf /var/cortex/models/

Option 3: Merge multi-part GGUF (for GPT-OSS):

# GPT-OSS 120B ships as 9-part GGUF
# Merge into single file:

llama-gguf-split --merge \
  Q8_0-GGUF-00001-of-00009.gguf \
  gpt-oss-120b.Q8_0.gguf

# Result: Single 119GB file ready for llama-server

Cortex Configuration¶

Model Form - llama.cpp Section¶

Engine Selection:

Engine Type: llama.cpp (GGUF)

Required Fields:

Mode: Offline (llama.cpp requires local files)
Local Path: huihui-ai/Huihui-gpt-oss-120b-BF16-abliterated/gpt-oss-120b.Q8_0.gguf
Name: GPT-OSS 120B Abliterated
Served Name: gpt-oss-120b-abliterated

llama.cpp Specific Settings:

GPU Layers (ngl): 999                          # Offload all
Tensor Split: 0.25,0.25,0.25,0.25             # 4-GPU equal
Batch Size: 512                                # Prefill batch
CPU Threads: 32                                # Background processing
Context Size: 8192                             # Max context
Flash Attention: ✓ On                          # Performance
Memory Lock (mlock): ✓ On                      # Prevent swapping
Disable Memory Mapping: ✓ On                   # Load to RAM
NUMA Policy: isolate                           # Latency optimization

Optional RoPE (for context extension):

RoPE Frequency Base: (leave default)
RoPE Frequency Scale: (leave default unless extending context)

Performance Tuning¶

Memory Optimization¶

For tight VRAM (e.g., <100GB total):

Use more aggressive quantization:
```
Q8_0 → Q6_K → Q5_K_M → Q4_K_M
```
Reduce context:
```
-c 8192 → -c 4096
```
Lower batch size:
```
-b 512 → -b 256
```

Reduce GPU layers:

-ngl 999 → -ngl 80  # Offload fewer layers

Throughput Optimization¶

For maximum tokens/sec:

Increase batch size:

-b 512 → -b 1024 or -b 2048
# Higher batch = faster prefill
# Needs more VRAM

Optimize CPU threads:

-t 32 → -t (num_cores - 2)
# Match hardware topology

Enable flash attention:
```
--flash-attn on  # Keep enabled
```

Use mlock:

--mlock  # Prevent swapping
# Ensures consistent performance

Quality vs Speed¶

Quality Priority (Accuracy over speed):

Quantization: Q8_0
Context: 8192+
Batch: 256-512
Result: Slower but higher quality

Speed Priority (Throughput over quality):

Quantization: Q5_K_M or Q4_K_M
Context: 4096
Batch: 1024-2048
GPU Layers: -ngl 999
Result: Faster, some quality loss

OpenAI API Integration¶

Chat Completions¶

llama-server exposes standard endpoint:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss-120b",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "temperature": 0.7,
    "max_tokens": 128
  }'

Cortex wraps this:

User → Cortex Gateway (port 8084)
     → llama.cpp container (port 8000)
     → Response → User

# Gateway provides:
# - API key authentication
# - Usage tracking
# - Metrics collection
# - Circuit breaking

Streaming¶

curl -N http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss-120b",
    "messages": [...],
    "stream": true
  }'

# Server-Sent Events (SSE) format
# Compatible with OpenAI SDKs

Health Checks¶

Endpoints for Monitoring¶

Health (used by Cortex):

GET /v1/models

# llama.cpp doesn't have /health endpoint
# Cortex uses /v1/models as health check
# Returns: {"data": [{"id": "gpt-oss-120b"}]}

Metrics (Prometheus):

GET /metrics

# Available if --endpoint-metrics enabled
# Provides: Request counts, latencies, etc.
# Cortex can scrape these (future enhancement)

Troubleshooting¶

Common Issues¶

1. Model Won't Load

Error: "unable to load model"

Checks:
1. Verify GGUF file path is correct
2. Check file permissions (readable)
3. Ensure enough RAM for model
4. Check GGUF is valid (not corrupted)

Fix:
# Test GGUF integrity:
llama-cli -m model.gguf -p "test" -n 1

2. OOM on GPU

Error: "CUDA error: out of memory"

Solutions:
1. Reduce -ngl (offload fewer layers):
   -ngl 999 → -ngl 60

2. Use more aggressive quantization:
   Q8_0 → Q6_K or Q5_K_M

3. Reduce context:
   -c 8192 → -c 4096

4. Adjust tensor split for unequal GPUs

3. Slow Performance

Issue: <5 tokens/sec on GPUs

Checks:
1. Verify layers on GPU: -ngl should be high
2. Check GPU utilization: nvidia-smi
3. Ensure --flash-attn is on
4. Check --mlock is set
5. Verify batch size adequate (-b 512+)

If layers on CPU:
- Increase -ngl
- Check VRAM usage with nvidia-smi
- May need more aggressive quantization

4. Container Restart Loop

Container keeps restarting:

Check logs:
docker logs llamacpp-model-1

Common causes:
- GGUF file not found at specified path
- Invalid GGUF format
- Out of memory (CPU or GPU)
- Unsupported quantization for architecture

Fix: 
- Review Cortex model logs in UI
- Verify file path in Model Form

Differences from vLLM¶

When llama.cpp is Better:¶

Feature	vLLM	llama.cpp	Winner
Architecture support	HF only	Any GGUF	llama.cpp ✓
Quantization	FP16/8/INT8/AWQ/GPTQ	Q2-Q8 K-quants	llama.cpp ✓
CPU inference	Very slow	Optimized	llama.cpp ✓
CPU+GPU hybrid	No	Yes	llama.cpp ✓
Single-file deploy	No	GGUF!	llama.cpp ✓

When vLLM is Better:¶

Feature	vLLM	llama.cpp	Winner
Throughput	Excellent (PagedAttention)	Good	vLLM ✓
Memory efficiency	Best (PagedAttention)	Good	vLLM ✓
Continuous batching	Yes	Limited	vLLM ✓
Concurrency	Excellent (100+ requests)	Limited (1-4)	vLLM ✓
HF ecosystem	Native	Via conversion	vLLM ✓

Cortex Strategy:¶

Standard Models (Llama, Mistral, Qwen):
→ Use vLLM (better performance)

GPT-OSS 120B / Harmony Architecture:
→ Use llama.cpp (only option)

GGUF-only models:
→ Use llama.cpp (native format)

Mixed deployments:
→ Both engines side-by-side
→ Gateway routes by model registry

Production Deployment¶

Container Lifecycle¶

Start (via Cortex UI):

1. Admin clicks "Start"
2. Cortex builds llama-server command
3. Docker creates container: llamacpp-model-{id}
4. Container pulls cortex/llamacpp-server:latest
5. llama-server loads GGUF model
6. Health check polls /v1/models
7. State: stopped → starting → running
8. Model registered in gateway

Stop (via Cortex UI):

1. Admin clicks "Stop"
2. Container stops (graceful shutdown, 10s timeout)
3. Container removed
4. State: running → stopped
5. Model unregistered from gateway

No auto-restart (after Cortex restart):

# Restart policy: "no"
# Models stay stopped until admin clicks Start
# Prevents broken auto-start issues

Health Monitoring¶

Cortex polls llama.cpp health every 15 seconds:

GET http://llamacpp-model-1:8000/v1/models

Success: {"data": [{"id": "gpt-oss-120b"}]}
→ Health: UP, register in model registry

Failure: Connection refused / timeout
→ Health: DOWN, circuit breaker may open

Logs and Debugging¶

View logs via Cortex UI:

Models page → Select model → Logs button
Shows: llama-server output, loading progress, errors

Common log patterns:

[Loading model] - Model weights loading
[KV cache] - VRAM/RAM allocation
[CUDA] - GPU initialization  
[Server] - HTTP server ready
[Inference] - Per-request processing

Resource Requirements¶

For GPT-OSS 120B Q8_0:¶

Minimum (Cortex tested configuration):

GPUs: 4x NVIDIA L40S (46GB VRAM each)
Total VRAM: 184GB
RAM: 64GB+
Disk: 150GB (model + overhead)
CPU: 32+ cores recommended

Why 4x L40S works:

Model weights (Q8_0): ~120GB
KV cache (8K context): ~8GB
Overhead: ~3GB per GPU
Total: ~131GB across 4 GPUs
Available: 184GB
Headroom: 53GB ✓

Alternative configurations:

2x H100 (80GB each): 160GB → Tight but possible
3x A100 (80GB each): 240GB → Comfortable
8x 3090 (24GB each): 192GB → Works with tuning

For Smaller Models:¶

Llama 2 13B Q4_K_M (fits single GPU):

Model: ~7GB
KV cache: ~2GB (4K context)
Total: ~9GB
Fits: Any GPU with 12GB+ (RTX 3060, 4060 Ti, etc.)

Best Practices for llama.cpp in Cortex¶

1. Quantization Selection¶

Production (Quality Priority):
→ Use Q8_0 (near-lossless)

Balanced (Quality + Size):
→ Use Q6_K or Q5_K_M

Maximum Compression:
→ Use Q4_K_M (acceptable trade-off)
→ Avoid Q3_K or Q2_K (too much degradation)

2. Context Window¶

Conservative (Fast, Reliable):
-c 4096

Balanced (Cortex Default):
-c 8192

Large Context (Needs Resources):
-c 16384 or higher

3. GPU Layer Offloading¶

Default (Let llama.cpp decide):
-ngl 999

Manual Tuning (if needed):
# Check VRAM usage: nvidia-smi
# If OOM, reduce layers:
-ngl 80  # Adjust based on available VRAM

4. Multi-GPU Split¶

Equal GPUs:
--tensor-split 0.25,0.25,0.25,0.25

Unequal GPUs (e.g., 3x 48GB + 1x 24GB):
--tensor-split 0.34,0.34,0.34,0.17
# Give less to smaller GPU

Comparison: llama.cpp vs vLLM in Cortex¶

Use llama.cpp When:¶

Model architecture unsupported by vLLM
GPT-OSS 120B (Harmony) ✓
Experimental/custom architectures
Models requiring trust_remote_code that vLLM rejects
GGUF is only available format
Community quantizations
Pre-converted models
Air-gapped environments with GGUFs
CPU+GPU hybrid needed
Limited VRAM (can offload layers to RAM)
Heterogeneous GPU setups
Simpler deployment
Single GGUF file
No tokenizer issues
Works "out of box"

Use vLLM When:¶

Architecture is supported
Llama, Mistral, Qwen, Phi, etc.
Standard HF Transformers models
Need maximum performance
PagedAttention efficiency
Continuous batching
High concurrency (50+ simultaneous)
Pure GPU inference
Sufficient VRAM available
Want best throughput
Online model serving
Download from HF on startup
Automatic model updates

Integration with Cortex Gateway¶

Model Registry¶

When llama.cpp container starts:

# Cortex registers endpoint:
register_model_endpoint(
    served_name="gpt-oss-120b-abliterated",
    url="http://llamacpp-model-1:8000",
    task="generate"
)

# Gateway routes requests:
POST /v1/chat/completions {"model": "gpt-oss-120b-abliterated"}
→ Proxied to: http://llamacpp-model-1:8000/v1/chat/completions

Monitoring¶

Health poller checks every 15 seconds:

GET http://llamacpp-model-1:8000/v1/models
Response: {"data": [{"id": "gpt-oss-120b-abliterated"}]}
Status: UP

Circuit breaker opens after 5 consecutive failures (30s cooldown).

Usage Tracking¶

Cortex tracks per-request:

- Prompt tokens (estimated if llama-server doesn't report)
- Completion tokens
- Latency
- Status code
- Model name: gpt-oss-120b-abliterated
- Task: generate

Migration Guide¶

From Standalone llama.cpp to Cortex¶

Before (manual llama-server):

llama-server \
  -m /models/model.gguf \
  -ngl 999 \
  --tensor-split 0.25,0.25,0.25,0.25 \
  -c 8192 \
  --host 0.0.0.0 --port 8080

After (Cortex managed):

1. Add model via Cortex UI
2. Configure parameters in Model Form
3. Click "Start"
4. Cortex creates container automatically
5. Monitor via System Monitor
6. Track usage via Usage page

Benefits: - Health monitoring - Usage metering - API key auth - Multi-user access - Web UI management

Future Enhancements¶

Planned for Cortex:¶

LoRA Adapter Support
llama.cpp supports dynamic LoRA loading
Cortex could expose adapter management
RoPE Extension UI
Slider for context extension
Auto-calculate rope_freq_scale
Quantization Conversion
In-UI quantization from F16 to Q8_0
Progress tracking
Multi-Model Serving
Load multiple GGUFs in single container
Dynamic model switching
Grammar/Constrained Generation
llama.cpp supports GBNF grammars
Enforce JSON, code, structured output

References¶

Official Repo: https://github.com/ggml-org/llama.cpp
Documentation: https://github.com/ggml-org/llama.cpp/tree/master/docs
Docker Images: ghcr.io/ggml-org/llama.cpp:server-cuda
Cortex Implementation: backend/src/docker_manager.py (lines 182-325)
GGUF Spec: https://github.com/ggerganov/ggml/blob/master/docs/gguf.md
GPT-OSS Models: https://huggingface.co/collections/openai/gpt-oss-67723e53c50e1ec6424f71c4

Conclusion¶

llama.cpp in Cortex serves a critical role:

✅ Enables GPT-OSS 120B - The primary reason it was added
✅ Fills vLLM gaps - Custom architectures, GGUF-only models
✅ Production-ready - Stable, reliable, well-tested
✅ Unified UX - Same admin interface as vLLM
✅ Complementary - Works alongside vLLM, not replacing it

Together, vLLM + llama.cpp provide comprehensive model serving for Cortex users. 🚀

For vLLM (standard models), see: vllm.md (in this directory)
For choosing between engines: See engine-comparison.md (in this directory) - comprehensive decision matrix and comparison
For research background: See engine-research.md (in this directory)