Offline Deployment Guide¶
Overview¶
Cortex supports fully offline/air-gapped deployments for environments without internet access, such as:
- Classified networks (DoD, government)
- Air-gapped research facilities
- Restricted corporate networks
- High-security data centers
- Remote locations with no connectivity
This guide explains how to prepare, transfer, and deploy Cortex in completely offline environments.
Using the Admin UI "Deployment" page (recommended for migrations)¶
If you already have a running Cortex instance (online or staging) and want to migrate it to an offline machine, you can use the Admin UI:
- Admin UI → Deployment: Generates an "offline migration package" directory on disk that can include:
- Docker images (gateway/frontend + engines + infra)
- Database dump (pg_dump)
- Manifests for models/config (with secrets redacted)
- Optional large archives (model weights and HF cache)
This is intended for the workflow: online/staging instance → validate models/config → export package → transfer → import on offline instance.
How Offline Operation Works¶
What Needs Internet Access?¶
During Normal Deployment (online mode): 1. Docker base images - vLLM and llama.cpp container images (~8-12 GB each) 2. Infrastructure images - Python, Node.js, PostgreSQL, Redis, Prometheus (~1-2 GB total) 3. Model weights - Only if downloading from HuggingFace (can be pre-transferred)
What Cortex Does Differently Offline:
- Uses pre-cached Docker images (loaded from tar files)
- Serves models from local /var/cortex/models directory
- Never attempts to pull images from internet registries
- Validates GGUF tokenizer availability before model start
- Provides startup diagnostics with actionable fixes
- Fails gracefully with helpful error messages
Current Limitation (Addressed in This Guide)¶
Before offline support:
User starts model → Cortex checks for image → Not found → docker pull → ❌ FAILS (no internet)
With offline support:
User starts model → Cortex checks for image → Found in cache → ✓ Starts immediately
Preparation Steps¶
Prerequisites¶
- Online machine with Docker and internet access
- Offline machine where Cortex will run
- Transfer method: USB drive, secure file transfer, physical media, sneakernet
- Storage: ~20-25 GB free space for Docker images
Step 1: Prepare Offline Package (Online Machine)¶
On a machine with internet access:
# Clone Cortex repository
git clone https://github.com/AulendurForge/Cortex.git
cd Cortex
# Download and package all required Docker images
make prepare-offline
What this does:
1. Downloads vLLM inference engine image (~8-12 GB)
2. Downloads llama.cpp inference engine image (~3-5 GB)
3. Downloads all infrastructure images (Python, Node, Postgres, Redis, Prometheus)
4. Downloads monitoring images (node-exporter, dcgm-exporter, cadvisor)
5. Saves all images as tar files in cortex-offline-images/ directory
6. Creates manifest.json with package metadata
Output:
cortex-offline-images/
├── vllm-openai-v0.6.3.tar (8-12 GB)
├── llamacpp-server-cuda.tar (3-5 GB)
├── python-3.11-slim.tar (150 MB)
├── node-18-alpine.tar (180 MB)
├── postgres-16.tar (350 MB)
├── redis-7.tar (110 MB)
├── prometheus-v2.47.0.tar (220 MB)
├── node-exporter-v1.6.1.tar (25 MB)
├── dcgm-exporter.tar (450 MB)
├── cadvisor-v0.47.0.tar (250 MB)
├── registry-2.tar (25 MB)
├── manifest.json (metadata)
└── README.txt (instructions)
Total package size: ~14-19 GB (uncompressed)
Optional: Compress Package¶
# Compress for faster transfer (reduces to ~8-12 GB)
tar -czf cortex-offline-package.tar.gz cortex-offline-images/
# Or use better compression (slower but smaller)
tar -cJf cortex-offline-package.tar.xz cortex-offline-images/
Step 2: Transfer Package (Offline Machine)¶
Transfer the following to your offline machine:
Required Files:¶
- Cortex/ - Complete repository (code, configs, scripts)
- cortex-offline-images/ - Docker image package
- Model files - Any models for
/var/cortex/models(if using offline models)
Transfer Methods:¶
USB Drive (most common):
# On online machine
cp -r Cortex /media/usb/
cp -r cortex-offline-images /media/usb/
# On offline machine
cp -r /media/usb/Cortex ~/
cp -r /media/usb/cortex-offline-images ~/Cortex/
Secure File Transfer (if temporary network available):
# Using SCP
scp -r cortex-offline-images/ user@offline-machine:/home/user/Cortex/
# Using rsync (resume support)
rsync -avP cortex-offline-images/ user@offline-machine:/home/user/Cortex/cortex-offline-images/
Physical Media (maximum security): - Burn to DVD/Blu-ray - External SSD/HDD - Network-isolated file transfer appliance
Step 3: Load Docker Images (Offline Machine)¶
On your offline machine:
cd Cortex
# Load all images from package
make load-offline
What this does:
1. Verifies checksums of all tar files (if checksums.sha256 exists)
2. Finds all .tar files in cortex-offline-images/
3. Loads each image into Docker's local cache
4. Post-load verification: Confirms expected images exist in Docker
5. Reports critical vs optional missing images
6. Shows summary of cached images
Post-load verification automatically reads expected images from the manifest and verifies they exist after loading. Critical images (vLLM, llama.cpp, postgres, redis) will cause the script to exit with an error if missing.
Expected output:
Loading vllm-openai-v0.6.3.tar (10.2 GB)... ✓
Loading llamacpp-server-cuda.tar (4.1 GB)... ✓
Loading python-3.11-slim.tar (148 MB)... ✓
...
Successfully loaded: 11
Failed: 0
✓ All images loaded successfully!
Step 4: Verify Offline Readiness¶
# Check that all required images are cached
make verify-offline
Expected output (when ready):
Checking required images...
vllm/vllm-openai:v0.6.3 ✓ (10.2 GB)
ghcr.io/ggml-org/llama.cpp:server-cuda ✓ (4.1 GB)
python:3.11-slim ✓ (148 MB)
node:18-alpine ✓ (182 MB)
postgres:16 ✓ (357 MB)
redis:7 ✓ (117 MB)
prom/prometheus:v2.47.0 ✓ (224 MB)
✓ READY FOR OFFLINE OPERATION
All required images are cached locally.
You can deploy Cortex in a fully offline environment.
Step 5: Configure Offline Mode¶
Enable offline mode to prevent any internet access attempts:
# Add to backend/.env
echo "OFFLINE_MODE=True" >> backend/.env
# Or edit backend/.env directly:
nano backend/.env
Add these lines:
# Offline/Air-gapped deployment
OFFLINE_MODE=True # Prevent internet access for image pulls
OFFLINE_MODE_AUTO_DETECT=False # Disable auto-detection (we know we're offline)
REQUIRE_IMAGE_PRECACHE=True # Strict: only use cached images
Step 6: Deploy Cortex¶
# Start all services
make quick-start
# Or step-by-step:
make up
make bootstrap-default
Verification:
# Check services are running
make status
# Check health
make health
# View Docker image cache status (via API)
curl http://localhost:8084/admin/system/docker-images | jq .
Offline Deployment Checklist¶
Preparation (Online Machine)¶
- Clone Cortex repository
- Run
make prepare-offline - Verify package created (cortex-offline-images/)
- Check package size (~15-20 GB)
- Transfer model files to USB/media (if using offline models)
- Document versions in manifest.json
Transfer¶
- Copy Cortex/ repository to transfer media
- Copy cortex-offline-images/ to transfer media
- Copy model files to transfer media (optional)
- Verify checksums/integrity (optional but recommended)
Deployment (Offline Machine)¶
- Transfer all files to offline machine
- Run
make load-offline - Run
make verify-offline(should show all ✓) - Configure offline mode in backend/.env
- Place models in /var/cortex/models (if using)
- Run
make quick-start - Verify services running:
make status - Test model deployment via web UI
Troubleshooting¶
"Image not found" Error When Starting Model¶
Symptom:
Error: Docker image 'vllm/vllm-openai:<tag>' is not available locally.
System is in OFFLINE MODE - cannot download from internet.
Solution:
# Check what's cached
docker images
# If vLLM image missing, load it
docker load -i cortex-offline-images/vllm-openai-v0.6.3.tar
# Verify
make verify-offline
Network Detection Issues¶
Symptom: Cortex tries to pull images even in offline environment
Solution:
# Force offline mode
echo "OFFLINE_MODE=True" >> backend/.env
echo "OFFLINE_MODE_AUTO_DETECT=False" >> backend/.env
# Restart
make restart
Partial Image Loading¶
Symptom: Some images failed to load
Solution:
# Load failed images individually
docker load -i cortex-offline-images/<failed-image>.tar
# Check for corruption
md5sum cortex-offline-images/*.tar
# Re-download if corrupted
Version Mismatch¶
Symptom: Image versions don't match configuration
Solution:
# Check manifest
cat cortex-offline-images/manifest.json
# Update config to match cached versions
nano backend/.env
# Set VLLM_IMAGE and LLAMACPP_IMAGE to match manifest versions
Version Management¶
Central Version Configuration¶
Cortex uses a central version file to ensure consistency across all scripts:
# scripts/versions.env - Single source of truth
CORTEX_VLLM_VERSION="${CORTEX_VLLM_VERSION:-latest}"
CORTEX_LLAMACPP_TAG="${CORTEX_LLAMACPP_TAG:-server-cuda}"
All scripts (prepare-offline-deployment.sh, verify-offline-images.sh) automatically source this file.
Pinning Versions for Production¶
For production/offline deployments, pin specific versions:
# Edit scripts/versions.env
CORTEX_VLLM_VERSION="v0.8.0"
CORTEX_LLAMACPP_TAG="server-cuda"
# Or set environment variables
export CORTEX_VLLM_VERSION="v0.8.0"
make prepare-offline
Version Override Hierarchy¶
- Environment variables (highest priority):
VLLM_VERSION,LLAMACPP_TAG - Cortex versions:
CORTEX_VLLM_VERSION,CORTEX_LLAMACPP_TAG - scripts/versions.env defaults
- Hardcoded fallbacks (lowest priority):
"latest","server-cuda"
Keeping Versions Synchronized¶
When updating versions:
- Update
scripts/versions.env - Update
backend/.env(if set explicitly) - Run
make prepare-offlineto cache new images - Run
make verify-offlineto confirm - Update
backend/src/config.pydefault if needed
Advanced: Local Docker Registry¶
For organizations deploying Cortex on multiple offline machines:
Setup Local Registry¶
# Start local registry
docker run -d -p 5000:5000 --restart=always \
--name local-registry \
-v /var/lib/docker-registry:/var/lib/registry \
registry:2
# Tag images for local registry
docker tag vllm/vllm-openai:v0.6.3 localhost:5000/vllm-openai:v0.6.3
docker tag ghcr.io/ggml-org/llama.cpp:server-cuda localhost:5000/llamacpp:server-cuda
# Push to local registry
docker push localhost:5000/vllm-openai:v0.6.3
docker push localhost:5000/llamacpp:server-cuda
Configure Cortex to Use Local Registry¶
# backend/.env
VLLM_IMAGE=localhost:5000/vllm-openai:v0.6.3
LLAMACPP_IMAGE=localhost:5000/llamacpp:server-cuda
OFFLINE_MODE=False # Can leave offline mode off since registry is local
Benefits: - Distribute images to multiple machines easily - Centralized image management - Standard Docker workflow
Requirements: - Registry container running on offline network - All machines can reach registry IP:5000 - Storage for registry data (~20+ GB)
Updating Offline Deployment¶
When to Update¶
- Security patches released for vLLM or llama.cpp
- New features in engine versions
- Infrastructure updates (Python, Postgres, etc.)
Update Process¶
# On internet-connected machine
cd Cortex
git pull # Get latest Cortex code
# Check for version updates in backend/src/config.py
cat backend/src/config.py | grep "_IMAGE"
# Prepare new offline package with updated versions
make prepare-offline
# Transfer updated package to offline machine
# On offline machine:
make load-offline
make verify-offline
# Update configuration if versions changed
nano backend/.env
# Restart with new images
make restart
Security Considerations¶
Benefits of Offline Operation¶
✅ No internet exposure - Attack surface reduced
✅ Version control - No silent updates from registries
✅ Supply chain security - Images verified before transfer
✅ Compliance - Meets air-gap requirements (ITAR, EAR)
✅ Audit trail - All images tracked in manifest
Best Practices¶
- Verify checksums after transfer (automatic)
Cortex now automatically generates and verifies SHA256 checksums for all exported files.
# Checksums are automatically generated during export
# Located at: {output_dir}/checksums.sha256
# During load, checksums are automatically verified:
make load-offline
# Output: "Verifying file checksums..."
# db/cortex.sql ✓
# manifest.json ✓
# manifests/models.json ✓
# To skip verification (not recommended):
VERIFY_CHECKSUMS=false make load-offline
# Manual verification:
cd cortex-offline-images
sha256sum -c checksums.sha256
-
Pin specific versions (don't use :latest)
# In backend/src/config.py VLLM_IMAGE = "vllm/vllm-openai:v0.6.3" # Not :latest -
Document image provenance
- Record where images were downloaded from
- Note download date and verification steps
-
Maintain change log for updates
-
Test before deployment
- Test offline package on non-production system first
- Verify all functionality works
- Document any issues found
File Size Reference¶
Docker Images¶
| Image | Purpose | Approximate Size |
|---|---|---|
| vllm/vllm-openai | vLLM inference engine | 8-12 GB |
| llama.cpp | llama.cpp inference engine | 3-5 GB |
| python:3.11-slim | Backend runtime | 150 MB |
| node:18-alpine | Frontend runtime | 180 MB |
| postgres:16 | Database | 350 MB |
| redis:7 | Cache | 110 MB |
| prometheus | Metrics | 220 MB |
| node-exporter | Host metrics | 25 MB |
| dcgm-exporter | GPU metrics | 450 MB |
| cadvisor | Container metrics | 250 MB |
| Total | ~14-19 GB |
Compression Ratios¶
- Uncompressed: 14-19 GB
- gzip (.tar.gz): ~8-12 GB (40-50% reduction)
- xz (.tar.xz): ~6-9 GB (50-60% reduction, slower)
Model Files¶
Model files are separate from Docker images:
| Model | Quantization | Approximate Size |
|---|---|---|
| Llama-3-8B | FP16 | ~16 GB |
| Llama-3-70B | FP16 | ~140 GB |
| GPT-OSS-120B | Q8_0 | ~120 GB |
| Mistral-7B | Q4_K_M | ~4 GB |
Common Scenarios¶
Scenario 1: Initial Offline Deployment¶
Situation: First-time deployment in air-gapped environment
Steps: 1. Prepare offline package (online) 2. Transfer everything 3. Load images 4. Configure offline mode 5. Deploy
Timeline: 1-2 hours (plus transfer time)
Scenario 2: Adding New Model (Offline)¶
Situation: Want to add a new model to existing offline Cortex
Steps:
# Model files already in /var/cortex/models
# Docker images already cached
# Just configure via web UI:
1. Login to Cortex UI (http://<host-ip>:3001)
2. Navigate to Models page
3. Click "Add Model"
4. Select "Offline" mode
5. Choose model folder
6. Configure and Start
No internet required - uses existing cached images!
Scenario 3: Updating to Newer vLLM Version¶
Situation: New vLLM version released, want to update
Steps:
# On online machine
export VLLM_VERSION=v0.7.0 # New version
make prepare-offline
# Transfer cortex-offline-images/ to offline machine
# On offline machine
make load-offline
make verify-offline
# Update config
echo "VLLM_IMAGE=vllm/vllm-openai:v0.7.0" >> backend/.env
# Restart
make restart
API Reference¶
Check Image Cache Status¶
# Get Docker image cache status
curl http://localhost:8084/admin/system/docker-images \
-H "Cookie: cortex_session=admin" \
| jq .
Response:
{
"offline_mode": true,
"offline_ready": true,
"engines": {
"vllm": {
"image": "vllm/vllm-openai:v0.6.3",
"cached": true,
"size_mb": 10240.5,
"created": "2024-09-15T10:30:00Z"
},
"llamacpp": {
"image": "ghcr.io/ggml-org/llama.cpp:server-cuda",
"cached": true,
"size_mb": 4096.2,
"created": "2024-09-10T08:15:00Z"
}
},
"infrastructure": [...],
"summary": {
"critical_images_cached": true,
"total_images_checked": 11,
"cached_count": 11
},
"recommendations": [
"✓ All critical images cached. System is ready for offline operation."
]
}
Deployment Job Management¶
Cortex maintains job history for deployment operations:
# Get current/latest job status
curl http://localhost:8084/admin/deployment/status -b /tmp/cookies.txt | jq .
# List recent jobs (last 20)
curl http://localhost:8084/admin/deployment/jobs -b /tmp/cookies.txt | jq .
# Get specific job by ID
curl http://localhost:8084/admin/deployment/jobs/{job_id} -b /tmp/cookies.txt | jq .
# Cancel a running job
curl -X DELETE http://localhost:8084/admin/deployment/jobs/{job_id} -b /tmp/cookies.txt | jq .
Job Types: export, model_export, db_restore
Job Statuses: pending, running, completed, failed, cancelled
Disk Space Estimation¶
Before starting a large export, estimate the required space:
curl -X POST http://localhost:8084/admin/deployment/estimate-size \
-H "Content-Type: application/json" \
-b /tmp/cookies.txt \
-d '{
"output_dir": "/var/cortex/exports",
"include_images": true,
"include_db": true,
"tar_models": true,
"tar_hf_cache": false
}' | jq .
Response:
{
"estimated_bytes": 45000000000,
"estimated_formatted": "41.9 GB",
"breakdown": {
"docker_images": "6.0 GB",
"database": "10.0 MB",
"models": "35.8 GB"
},
"disk_space": {
"sufficient": true,
"available_bytes": 800000000000,
"available_formatted": "745.1 GB",
"required_bytes": 54000000000,
"required_formatted": "50.3 GB",
"safety_margin": 1.2
}
}
HF Token Handling on Import¶
When importing models that originally had HuggingFace tokens configured:
- Tokens are redacted during export for security
- Import warnings indicate if a token was present
- After import, configure the token in Admin → Models → Edit
Import dry-run response:
{
"dry_run": true,
"warnings": [
"⚠️ HF_TOKEN REQUIRED: This model had an HF token configured which was redacted for security. After import, go to Admin → Models → Edit and add your Hugging Face token."
],
"can_import": true
}
Configuration Reference¶
backend/.env (Offline Mode)¶
# Offline/Air-gapped Configuration
OFFLINE_MODE=True # Prevent all internet access for images
OFFLINE_MODE_AUTO_DETECT=False # Disable auto-detection
REQUIRE_IMAGE_PRECACHE=True # Strict: only cached images allowed
IMAGE_PULL_TIMEOUT=600 # Not used in offline mode
# Docker Images (must match cached versions)
VLLM_IMAGE=vllm/vllm-openai:v0.6.3
LLAMACPP_IMAGE=ghcr.io/ggml-org/llama.cpp:server-cuda
# Model storage (offline models)
CORTEX_MODELS_DIR=/var/cortex/models
CORTEX_MODELS_DIR_HOST=/var/cortex/models
backend/.env (Online with Fallback)¶
# Online mode with offline fallback
OFFLINE_MODE=False # Allow image pulls
OFFLINE_MODE_AUTO_DETECT=True # Auto-detect if network unavailable
REQUIRE_IMAGE_PRECACHE=False # Pull if needed
# Docker Images
VLLM_IMAGE=vllm/vllm-openai:v0.6.3
LLAMACPP_IMAGE=ghcr.io/ggml-org/llama.cpp:server-cuda
Compliance & Auditing¶
For Regulated Environments¶
Image Provenance Documentation:
Create a provenance record for audit trails:
# Generate image manifest with hashes
docker images --format "{{.Repository}}:{{.Tag}}" | while read img; do
echo "Image: $img"
docker image inspect $img --format ' ID: {{.Id}}'
docker image inspect $img --format ' Created: {{.Created}}'
docker image inspect $img --format ' Size: {{.Size}}'
echo ""
done > image-provenance.txt
Security Scanning (perform on online machine before transfer):
# Scan images for vulnerabilities
docker scan vllm/vllm-openai:v0.6.3 > vllm-scan-report.txt
docker scan ghcr.io/ggml-org/llama.cpp:server-cuda > llamacpp-scan-report.txt
# Include scan reports in transfer package
Change Control:
Maintain records of: - Image versions deployed - Transfer date and method - Verification results (checksums) - Approval signatures - Deployment notes
Performance Considerations¶
Offline vs Online Comparison¶
| Metric | Online Mode | Offline Mode |
|---|---|---|
| First model start | 10-20 min (image pull + start) | 30-60 sec (start only) |
| Subsequent starts | 30-60 sec (cached) | 30-60 sec (cached) |
| Deployment reliability | Depends on network | 100% reliable |
| Storage required | Minimal (~2 GB cache) | ~20-25 GB (full cache) |
Conclusion: Offline mode is actually faster and more reliable after initial setup!
Disaster Recovery¶
Backing Up Image Cache¶
# Backup Docker image cache
mkdir -p backups/docker-images
docker save $(docker images --format "{{.Repository}}:{{.Tag}}") \
-o backups/docker-images/all-images-backup.tar
# Or backup specific images
docker save vllm/vllm-openai:v0.6.3 \
-o backups/docker-images/vllm-v0.6.3-backup.tar
Restoring from Backup¶
# Restore all images
docker load -i backups/docker-images/all-images-backup.tar
# Or restore specific image
docker load -i backups/docker-images/vllm-v0.6.3-backup.tar
Database Backup and Restore¶
Cortex provides built-in database backup and restore capabilities for migration and disaster recovery.
Exporting Database (via API)¶
# Full deployment export (includes database dump)
curl -X POST http://localhost:8084/admin/deployment/export \
-H "Cookie: cortex_session=admin" \
-H "Content-Type: application/json" \
-d '{
"output_dir": "/var/cortex/exports",
"include_db": true,
"include_configs": true,
"include_models_manifest": true
}'
This creates a db/cortex.sql file in the output directory containing a full PostgreSQL dump.
Restoring Database (via API)¶
# Restore database from dump file
curl -X POST http://localhost:8084/admin/deployment/restore-database \
-H "Cookie: cortex_session=admin" \
-H "Content-Type: application/json" \
-d '{
"output_dir": "/var/cortex/exports",
"backup_first": true,
"drop_existing": true
}'
Parameters:
- output_dir: Directory containing the db/cortex.sql dump file
- backup_first: Create a safety backup before restore (default: true)
- drop_existing: Drop existing tables before restore for a clean slate (default: false)
⚠️ Warning: Database restore with drop_existing: true will permanently delete all existing data. Always enable backup_first unless you're certain.
Restore via Admin UI¶
- Navigate to Admin → Deployment
- In the Database Restore section (red card):
- Set the import directory path
- Optionally change the dump file path (default:
db/cortex.sql) - Click Restore Database
- Monitor progress in the deployment status panel
Safety Backups¶
When backup_first: true, Cortex automatically creates a backup at:
{output_dir}/db/pre_restore_backup/cortex_backup_{timestamp}.sql
To restore from a safety backup:
curl -X POST http://localhost:8084/admin/deployment/restore-database \
-H "Cookie: cortex_session=admin" \
-H "Content-Type: application/json" \
-d '{
"output_dir": "/var/cortex/exports",
"db_file": "db/pre_restore_backup/cortex_backup_1234567890.sql",
"backup_first": false,
"drop_existing": true
}'
Manual Database Operations¶
For advanced scenarios, you can use PostgreSQL tools directly:
# Connect to database container
docker exec -it cortex-postgres-1 psql -U cortex -d cortex
# Manual backup
docker exec cortex-postgres-1 pg_dump -U cortex cortex > backup.sql
# Manual restore (drop existing first)
docker exec -i cortex-postgres-1 psql -U cortex -c "DROP SCHEMA public CASCADE; CREATE SCHEMA public;"
docker exec -i cortex-postgres-1 psql -U cortex cortex < backup.sql
PostgreSQL 16 Compatibility Notes¶
Cortex uses PostgreSQL 16, which includes \restrict and \unrestrict commands in pg_dump output. These are session-based security features that can cause issues when restoring to a different PostgreSQL instance.
Automatic handling: The restore-database API automatically strips these commands, ensuring compatibility across different PostgreSQL 16 instances.
Manual restore: If restoring manually, strip these commands first:
# Strip \restrict commands for cross-instance compatibility
sed -i '/^\\restrict/d;/^\\unrestrict/d' backup.sql
# Then restore
docker exec -i cortex-postgres-1 psql -U cortex cortex < backup.sql
Advanced: Image Layer Deduplication¶
Understanding Docker Layers¶
Docker images are built from layers. Each layer represents a set of filesystem changes (adding files, installing packages, etc.). When you have multiple images that share a common base, they share those base layers.
How Cortex images share layers:
┌─────────────────────────────────────────────────────────────────┐
│ vLLM Image (~8-12 GB) │
├─────────────────────────────────────────────────────────────────┤
│ vLLM application layer (~3-4 GB) │
├─────────────────────────────────────────────────────────────────┤
│ Python dependencies layer (~1-2 GB) │
├─────────────────────────────────────────────────────────────────┤
│ CUDA toolkit layer (~3-4 GB) ◄── SHARED │
├─────────────────────────────────────────────────────────────────┤
│ Ubuntu base layer (~200 MB) ◄── SHARED │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ llama.cpp Image (~3-5 GB) │
├─────────────────────────────────────────────────────────────────┤
│ llama.cpp application layer (~1-2 GB) │
├─────────────────────────────────────────────────────────────────┤
│ CUDA toolkit layer (~3-4 GB) ◄── SHARED │
├─────────────────────────────────────────────────────────────────┤
│ Ubuntu base layer (~200 MB) ◄── SHARED │
└─────────────────────────────────────────────────────────────────┘
Both vLLM and llama.cpp images typically share: - Ubuntu/Debian base layer (~200 MB) - CUDA toolkit layer (~3-4 GB) - Common system libraries
Current Export Behavior¶
When using make prepare-offline, each image is saved independently:
docker save vllm/vllm-openai:latest -o vllm-openai-latest.tar
docker save ghcr.io/ggml-org/llama.cpp:server-cuda -o llamacpp-server-cuda.tar
Result: Shared layers are duplicated across tar files.
| File | Size | Notes |
|---|---|---|
| vllm-openai-latest.tar | ~10 GB | Contains full CUDA stack |
| llamacpp-server-cuda.tar | ~4 GB | Contains CUDA stack again (duplicate) |
| Total | ~14 GB | ~3-4 GB duplicated |
Optimized Export: Combined Save¶
Docker's save command can export multiple images to a single tar file, automatically deduplicating shared layers:
# Optimized: save both images to one tar file
docker save vllm/vllm-openai:latest ghcr.io/ggml-org/llama.cpp:server-cuda \
-o cortex-engines-combined.tar
Result: Shared layers are stored once.
| File | Size | Savings |
|---|---|---|
| cortex-engines-combined.tar | ~11 GB | ~3 GB saved |
Estimated Space Savings¶
| Image Combination | Independent | Combined | Savings |
|---|---|---|---|
| vLLM + llama.cpp | ~14 GB | ~11 GB | ~20-25% |
| vLLM + llama.cpp + infra | ~16 GB | ~12 GB | ~25% |
| Multiple vLLM versions (v0.6.3, v0.7.0) | ~20 GB | ~14 GB | ~30% |
Note: Actual savings depend on how images were built. Images from the same maintainer or built on similar base images will share more layers.
Using Combined Saves¶
Manual Combined Export¶
# Navigate to offline images directory
cd cortex-offline-images
# Export engine images together (largest savings)
docker save \
vllm/vllm-openai:latest \
ghcr.io/ggml-org/llama.cpp:server-cuda \
-o engines-combined.tar
# Export infrastructure images together
docker save \
postgres:16 \
redis:7 \
prom/prometheus:v2.47.0 \
prom/node-exporter:v1.6.1 \
-o infrastructure-combined.tar
# Generate checksums
sha256sum engines-combined.tar infrastructure-combined.tar > checksums.sha256
Loading Combined Images¶
# Combined tar files load the same way
docker load -i engines-combined.tar
docker load -i infrastructure-combined.tar
# All images from the combined tar are now available
docker images | grep -E "vllm|llama"
Trade-offs¶
| Approach | Pros | Cons |
|---|---|---|
| Independent saves (default) | - Simpler to manage - Can update one image without re-exporting others - Easier to debug |
- Larger total size - More transfer time |
| Combined saves | - Smaller total size - Faster transfer - Less disk space needed |
- Must re-export all images to update one - Harder to identify issues with specific images |
Recommendations¶
Use independent saves (default) when: - You have plenty of storage and bandwidth - You frequently update individual images - You want simpler troubleshooting - You're transferring via fast networks or local drives
Use combined saves when: - Storage or bandwidth is constrained - You're doing a full migration (all images at once) - Transfer is over slow networks or physical media - You're creating a "golden" offline package for multiple deployments
Verifying Layer Sharing¶
To check if your images share layers:
# List layers in each image
docker image inspect vllm/vllm-openai:latest --format '{{.RootFS.Layers}}' | tr ',' '\n' | wc -l
docker image inspect ghcr.io/ggml-org/llama.cpp:server-cuda --format '{{.RootFS.Layers}}' | tr ',' '\n' | wc -l
# Find common layers
comm -12 \
<(docker image inspect vllm/vllm-openai:latest --format '{{.RootFS.Layers}}' | tr ' ' '\n' | sort) \
<(docker image inspect ghcr.io/ggml-org/llama.cpp:server-cuda --format '{{.RootFS.Layers}}' | tr ' ' '\n' | sort)
Script for Optimized Export¶
For frequent deployments, you can use this helper script:
#!/bin/bash
# scripts/prepare-offline-optimized.sh
OUTPUT_DIR="${1:-cortex-offline-images}"
mkdir -p "$OUTPUT_DIR"
echo "=== Exporting Engine Images (combined) ==="
docker save \
vllm/vllm-openai:${CORTEX_VLLM_VERSION:-latest} \
ghcr.io/ggml-org/llama.cpp:${CORTEX_LLAMACPP_TAG:-server-cuda} \
-o "$OUTPUT_DIR/engines-combined.tar"
echo "=== Exporting Infrastructure Images (combined) ==="
docker save \
postgres:16 \
redis:7 \
prom/prometheus:v2.47.0 \
-o "$OUTPUT_DIR/infrastructure-combined.tar"
echo "=== Generating checksums ==="
cd "$OUTPUT_DIR"
sha256sum *.tar > checksums.sha256
echo "=== Done ==="
du -sh *.tar
Multi-Machine Deployment¶
Using Local Registry (Recommended)¶
For deploying Cortex on multiple machines in an offline network:
Setup (on one offline machine):
# Load registry image
docker load -i cortex-offline-images/registry-2.tar
# Start local registry
docker run -d -p 5000:5000 --restart=always \
--name local-registry \
-v /var/lib/docker-registry:/var/lib/registry \
registry:2
# Populate registry with all images
for tar in cortex-offline-images/*.tar; do
docker load -i "$tar"
done
# Tag and push to local registry
docker tag vllm/vllm-openai:v0.6.3 <registry-ip>:5000/vllm-openai:v0.6.3
docker push <registry-ip>:5000/vllm-openai:v0.6.3
docker tag ghcr.io/ggml-org/llama.cpp:server-cuda <registry-ip>:5000/llamacpp:server-cuda
docker push <registry-ip>:5000/llamacpp:server-cuda
On other machines:
# Configure to use local registry
echo "VLLM_IMAGE=<registry-ip>:5000/vllm-openai:v0.6.3" >> backend/.env
echo "LLAMACPP_IMAGE=<registry-ip>:5000/llamacpp:server-cuda" >> backend/.env
echo "OFFLINE_MODE=False" >> backend/.env # Can use online mode with local registry
# Start Cortex
make quick-start
FAQ¶
Q: Can I use Cortex completely offline?¶
A: Yes! After loading the Docker image package, Cortex can operate completely offline. You only need internet to: 1. Download the initial offline package (once) 2. Update to newer versions (periodically)
Q: How much storage do I need?¶
A: Minimum ~25-30 GB: - Docker images: ~15-20 GB - Model files: Varies (e.g., 120 GB for GPT-OSS 120B) - Database: 1-5 GB (grows with usage) - Logs: 1-2 GB
Q: Can I delete tar files after loading?¶
A: Yes, but keep them for disaster recovery:
# After successful load and verification
mkdir -p backups/offline-package-YYYYMMDD
mv cortex-offline-images/ backups/offline-package-YYYYMMDD/
Q: What if model containers can't communicate with the gateway?¶
A: Model containers must be on the same Docker network as the gateway (typically cortex_default). Cortex automatically:
1. Checks if the network exists before starting model containers
2. Creates the network if missing (with a warning)
3. Falls back to Docker's default bridge network if creation fails
If you're having connectivity issues:
# Verify network exists
docker network ls | grep cortex_default
# If missing, restart Cortex (this creates the network)
make restart
# Manually check network connectivity
docker exec -it cortex-gateway-1 curl -s http://<model-container-name>:8000/health
Q: What if I need a different vLLM version?¶
A: Update the package preparation:
# On online machine
export VLLM_VERSION=v0.7.0
make prepare-offline
# Transfer and load as before
Q: Do model files need to be included in offline package?¶
A: No - model files are separate. Transfer them independently to /var/cortex/models:
# Example: Transfer GPT-OSS 120B
cp -r /online-storage/gpt-oss-120b/ /var/cortex/models/
Q: Can I mix offline and online modes?¶
A: Yes! Use OFFLINE_MODE_AUTO_DETECT=True:
- When online: Pulls images automatically
- When offline: Uses cached images only
- Best of both worlds for intermittent connectivity
Support & Resources¶
Documentation¶
OFFLINE_CAPABILITY_ANALYSIS.md- Technical analysisREADME.md- Quick start guideMakefile- Runmake helpfor commands
Scripts¶
scripts/prepare-offline-deployment.sh- Create offline packagescripts/load-offline-deployment.sh- Load packagescripts/verify-offline-images.sh- Verify readiness
Commands¶
make prepare-offline # Prepare package (online machine)
make load-offline # Load package (offline machine)
make verify-offline # Verify readiness
make offline-status # Check current status
Conclusion¶
Cortex's offline deployment capability enables:
✅ True air-gapped operation - No internet dependency
✅ Fast deployments - No download wait times
✅ Predictable behavior - Pinned versions
✅ High security - No external dependencies
✅ Compliance ready - Meets strict regulatory requirements
The offline workflow is production-ready and tested for environments requiring maximum isolation and security.