GGUF Format Guide¶

GGUF (GPT-Generated Unified Format) is the native file format for llama.cpp models. Cortex provides comprehensive GGUF support with smart detection, validation, and metadata extraction.

What is GGUF?¶

GGUF is a binary format designed for efficient storage and loading of quantized language models. It replaced the older GGML format and offers:

Self-contained: Model weights, tokenizer info, and metadata in a single file
Memory-mapped loading: Fast startup via OS-level memory mapping
Quantization support: Native support for various precision levels
Architecture-agnostic: Can store any model architecture

GGUF vs SafeTensors¶

Aspect	GGUF	SafeTensors
Primary Engine	llama.cpp	vLLM
Quantization	Built-in (Q4_K_M, Q8_0, etc.)	External (AWQ, GPTQ)
File Count	Single file or split parts	Multiple shards
Metadata	Embedded	Separate config.json
Performance	Optimized for llama.cpp	Native HF Transformers

Quantization Levels¶

Understanding Quantization¶

Quantization reduces model precision to save memory and disk space. Lower bits = smaller size but potentially lower quality.

Cortex Quality Indicators¶

When selecting a GGUF file in Cortex, you'll see quality and speed indicators:

Quantization	Bits	Quality	Speed	VRAM Savings	Recommended For
F16	16	★★★★★	★★☆☆☆	0%	Maximum quality, plenty of VRAM
Q8_0	8	★★★★★	★★★☆☆	50%	Production use, near-lossless
Q6_K	6.5	★★★★☆	★★★☆☆	59%	High quality with good savings
Q5_K_M	5.5	★★★★☆	★★★★☆	66%	Balanced quality/speed
Q5_K_S	5.5	★★★★☆	★★★★☆	66%	Slightly smaller Q5 variant
Q4_K_M	4.5	★★★☆☆	★★★★★	72%	Good balance, tight VRAM
Q4_K_S	4.5	★★★☆☆	★★★★★	72%	Smaller Q4 variant
Q4_0	4	★★★☆☆	★★★★★	75%	Maximum compression
Q3_K_M	3.5	★★☆☆☆	★★★★★	78%	Aggressive compression
Q2_K	2.5	★☆☆☆☆	★★★★★	84%	Extreme compression (quality loss)
IQ4_XS	4.25	★★★★☆	★★★★☆	73%	imatrix-optimized Q4
IQ3_M	3.4	★★★☆☆	★★★★★	79%	imatrix-optimized Q3
IQ2_M	2.7	★★☆☆☆	★★★★★	83%	imatrix-optimized Q2

Recommendations by Use Case¶

Production / Quality-Critical: - Use Q8_0 - Near-lossless, 50% VRAM savings - Use Q6_K - Excellent quality, 60% savings

Balanced / General Use: - Use Q5_K_M - Good quality, significant savings - Use Q4_K_M - Acceptable quality, max efficiency

Constrained VRAM: - Use Q4_K_M or Q4_K_S - Best quality at 4-bit - Avoid Q3 and below unless necessary

Multi-Part GGUF Files¶

Large models are often split into multiple files due to platform limits (e.g., HuggingFace's 50GB file size limit).

Naming Convention¶

model-name-00001-of-00009.gguf
model-name-00002-of-00009.gguf
...
model-name-00009-of-00009.gguf

Cortex Handling¶

llama.cpp (Recommended): - ✅ Native split-file support - ✅ Automatic detection of all parts - ✅ No merge required - Just point to any part; llama.cpp loads all automatically

vLLM: - ❌ Does NOT support multi-part GGUF - ⚠️ Must merge into single file first, OR - ✅ Use SafeTensors instead (if available) - ✅ Switch to llama.cpp engine

Smart Engine Guidance¶

When Cortex detects multi-part GGUF with vLLM selected, it provides guidance:

Switch to SafeTensors (if available) - Best vLLM performance
Switch to llama.cpp - Native multi-part support
Merge files manually - Last resort option

See Multi-Part GGUF Documentation for detailed handling.

GGUF Metadata¶

Cortex extracts and displays metadata from GGUF file headers:

Displayed Information¶

Field	Description	Example
Architecture	Model architecture	`llama`, `qwen2`, `mistral`
Context Length	Maximum tokens	`32K`, `128K`
Layers	Transformer blocks	`32`, `80`
Hidden Size	Embedding dimension	`4096`, `8192`
Attention Heads	Q/KV head counts	`32/8` (GQA)
Vocab Size	Vocabulary tokens	`128K`, `152K`

Architecture Compatibility¶

Cortex shows compatibility badges for each architecture:

Architecture	vLLM	llama.cpp	Notes
`llama`	✓ Full	✓ Full	Best supported
`mistral`	✓ Full	✓ Full	Best supported
`qwen2`	✓ Full	✓ Full	Best supported
`gemma`	✓ Full	✓ Full
`phi3`	◐ Partial	✓ Full	Some vLLM limitations
`mamba`	✗ None	✓ Full	llama.cpp only
`rwkv`	✗ None	◐ Partial	Limited support
`harmony`	✗ None	✓ Full	GPT-OSS models

GGUF Validation¶

Cortex performs automatic validation when inspecting folders:

What's Validated¶

Magic Bytes: Confirms file starts with GGUF (0x47475546)
Version Check: Supports GGUF v2 and v3
Header Integrity: Validates tensor and KV pair counts
File Size: Detects truncated downloads
Legacy Detection: Identifies old GGML format files

Validation Status¶

Status	Meaning	Action
✅ Valid	File passed all checks	Ready to use
⚠️ Warning	Minor issues detected	Review details
❌ Invalid	File is corrupt/incomplete	Re-download

Common Issues¶

"Invalid magic bytes": - File is not a GGUF file - May be incomplete download - May be different format (GGML, SafeTensors)

"Unsupported version": - Very old GGUF format - Update llama.cpp or convert file

"File appears truncated": - Download interrupted - Re-download the file

Using GGUF in Cortex¶

With llama.cpp (Recommended)¶

Add Model → Select llama.cpp engine
Mode: Offline
Browse to your GGUF file or folder
Cortex auto-detects:
Available quantization levels
Multi-part files
Model metadata
Select desired quantization
Configure llama.cpp settings
Start model

With vLLM (Limited Support)¶

vLLM's GGUF support is experimental:

Add Model → Select vLLM engine
Mode: Offline
Browse to single-file GGUF
Configure GGUF Weight Format (auto/gguf/ggml)
Provide tokenizer (required - from HF repo or local)
Start model

vLLM GGUF Limitations: - Single-file only (no multi-part) - Requires external tokenizer - Performance lower than SafeTensors - Limited architecture support

Tokenizer Requirements¶

llama.cpp¶

Tokenizer is embedded in GGUF file - no external tokenizer needed.

vLLM with GGUF¶

Requires external tokenizer:

Option 1: HuggingFace Repo

Tokenizer: meta-llama/Llama-3.1-8B-Instruct

Option 2: Local Path

Tokenizer: /models/my-model/tokenizer

Tokenizer Suggestions¶

Cortex suggests tokenizers based on model name:

Model Pattern	Suggested Tokenizer
`llama`	`meta-llama/Llama-3.1-8B-Instruct`
`mistral`	`mistralai/Mistral-7B-Instruct-v0.3`
`qwen`	`Qwen/Qwen2.5-7B-Instruct`
`phi`	`microsoft/Phi-3-mini-4k-instruct`
`gemma`	`google/gemma-2-9b-it`

Speculative Decoding (llama.cpp)¶

Use a smaller "draft" model to accelerate inference:

Setup¶

Place draft model GGUF alongside main model
In llama.cpp configuration, expand Speculative Decoding
Enter draft model path
Configure draft tokens (default: 16)
Set acceptance probability (default: 0.5)

Example Directory¶

/var/cortex/models/
└── my-model/
    ├── main-model-Q8_0.gguf      # Main model (24B)
    └── draft-model-Q8_0.gguf     # Draft model (0.5B)

Benefits¶

1.5-2x throughput improvement
Same output quality
Best with matching architectures

See llama.cpp Guide for details.

Best Practices¶

Choosing Quantization¶

Start with Q8_0 for initial testing
Drop to Q5_K_M if VRAM constrained
Use Q4_K_M for production with tight resources
Avoid Q3 and below unless absolutely necessary

Engine Selection¶

Use llama.cpp for GGUF-only models
Use vLLM with SafeTensors when available
Check architecture compatibility before choosing

File Organization¶

/var/cortex/models/
├── model-name/
│   ├── Q4_K_M/
│   │   └── model-Q4_K_M.gguf
│   ├── Q8_0/
│   │   └── model-Q8_0.gguf
│   └── config.json (optional)

Troubleshooting¶

"Model fails to load"¶

Check validation status in Cortex
Verify complete download (check file sizes)
Ensure architecture is supported
Check container logs for specific errors

"Poor generation quality"¶

Try higher quantization (Q8_0 vs Q4_K_M)
Verify correct model selected
Check tokenizer configuration (vLLM)

"Out of memory"¶

Use lower quantization (Q4_K_M vs Q8_0)
Reduce context size
Enable KV cache quantization

llama.cpp Guide - Full llama.cpp configuration
vLLM Guide - vLLM configuration including GGUF
Multi-Part GGUF - Split file handling
Engine Comparison - vLLM vs llama.cpp decision guide
Offline Deployment - Air-gapped GGUF deployment