Admin API¶
Admin endpoints require a dev cookie session (or future production auth) and are under /admin.
Keys¶
GET /admin/keys— list API keys (filters: org_id, user_id, q, sort)POST /admin/keys— create new key (returns token once)DELETE /admin/keys/{id}— revoke
Example:
curl -X POST "$GATEWAY/admin/keys" -H 'Content-Type: application/json' -d '{"scopes":"chat,completions,embeddings"}'
Organizations¶
GET /admin/orgs,POST /admin/orgs,PATCH /admin/orgs/{id},DELETE /admin/orgs/{id}GET /admin/orgs/lookupfor select inputs
Users¶
GET /admin/users,POST /admin/users,PATCH /admin/users/{id},DELETE /admin/users/{id}GET /admin/users/lookup
Models¶
GET /admin/models— list stored modelsPOST /admin/models— create new modelPATCH /admin/models/{id}— update configurationPOST /admin/models/{id}/start— start model containerPOST /admin/models/{id}/stop— stop model containerPOST /admin/models/{id}/apply— apply configuration changesPOST /admin/models/{id}/dry-run— validate config + preview command (also returns VRAM estimation)POST /admin/models/{id}/test— test model inferenceGET /admin/models/{id}/readiness— check model readiness statusGET /admin/models/{id}/logs— recent container logsGET /admin/models/{id}/logs?diagnose=true— logs with startup diagnosticsDELETE /admin/models/{id}— delete model (database entry only; files preserved)- Registry:
GET/POST/DELETE /admin/models/registry— manage model routing registry
Model States¶
Models transition through these states: stopped → starting → loading → running
Error states: failed (check logs for diagnostics)
Dry-Run Response¶
The dry-run endpoint returns:
{
"command": ["vllm", "serve", "--model", "/models/..."],
"warnings": [
{"severity": "warning", "category": "vram", "title": "VRAM Warning", "message": "..."}
],
"vram_estimate_gb": 4.5
}
Usage¶
GET /admin/usage— recent requests (filters, pagination)GET /admin/usage/aggregate— totals by modelGET /admin/usage/series— time seriesGET /admin/usage/latency— p50/p95GET /admin/usage/ttft— streaming TTFTGET /admin/usage/export— CSV
System Monitoring¶
GET /admin/system/summary— CPU/mem/disk/GPU summary (psutil-based)GET /admin/system/throughput— tokens/sec, RPS, latency metrics (Prometheus-based)GET /admin/system/gpus— per-GPU metrics (DCGM or NVML)GET /admin/system/host/summary— real-time host metrics (Prometheus node-exporter with psutil fallback)GET /admin/system/host/trends— time-series host metrics (CPU, memory, disk, network)GET /admin/system/capabilities— environment detection (OS, container, WSL, monitoring providers)GET /admin/models/metrics— per-model vLLM inference metrics (requests, tokens, latency, cache)
Upstreams Health¶
GET /admin/upstreams— health snapshots and model registryPOST /admin/upstreams/refresh-health— trigger on-demand health checks
Chat Playground API¶
These endpoints power the Chat Playground UI. They use session cookie authentication (require_user_session), not API key authentication.
Running Models¶
GET /v1/models/running— list healthy running models for chat selectionGET /v1/models/{model_name}/constraints— get model context limits and defaults
Chat Sessions¶
GET /v1/chat/sessions— list user's chat sessions (newest first)POST /v1/chat/sessions— create a new chat sessionGET /v1/chat/sessions/{id}— get session with all messagesPOST /v1/chat/sessions/{id}/messages— add message to sessionDELETE /v1/chat/sessions/{id}— delete a chat sessionDELETE /v1/chat/sessions— clear all user's chat sessions
Running Model Response¶
[
{
"served_model_name": "Qwen-2-7B-Instruct",
"task": "generate",
"engine_type": "vllm",
"state": "running"
}
]
Model Constraints Response¶
{
"served_model_name": "Qwen-2-7B-Instruct",
"engine_type": "vllm",
"task": "generate",
"context_size": null,
"max_model_len": 32768,
"max_tokens_default": 512,
"request_defaults": null,
"supports_streaming": true,
"supports_system_prompt": true
}
Chat Session Response¶
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"title": "What is Python?",
"model_name": "Qwen-2-7B-Instruct",
"engine_type": "vllm",
"constraints": { "max_model_len": 32768 },
"messages": [
{
"id": 1,
"role": "user",
"content": "What is Python?",
"metrics": null,
"timestamp": 1704672000000
},
{
"id": 2,
"role": "assistant",
"content": "Python is a high-level programming language...",
"metrics": { "tokens_per_second": 32.5, "ttft_ms": 145 },
"timestamp": 1704672005000
}
],
"created_at": 1704672000000,
"updated_at": 1704672005000
}
Create Session Request¶
{
"model_name": "Qwen-2-7B-Instruct",
"engine_type": "vllm",
"constraints": { "max_model_len": 32768 }
}
Add Message Request¶
{
"role": "user",
"content": "What is Python?",
"metrics": { "tokens_per_second": 32.5 }
}
Model Discovery & Inspection¶
GET /admin/models/base-dir— get current models base directoryPUT /admin/models/base-dir— set models base directoryGET /admin/models/local-folders— list local model directoriesGET /admin/models/inspect-folder— inspect folder for GGUF files and metadataGET /admin/models/hf-config— fetch HuggingFace model configuration
Inspect Folder Response (Enhanced)¶
The /admin/models/inspect-folder endpoint returns comprehensive analysis:
{
"has_gguf": true,
"has_safetensors": true,
"gguf_groups": [
{
"quant_type": "Q8_0",
"files": ["model-Q8_0.gguf"],
"total_size_mb": 12800,
"is_multipart": false,
"status": "ready",
"metadata": {
"architecture": "llama",
"context_length": 32768,
"embedding_length": 4096,
"block_count": 32,
"attention_head_count": 32,
"attention_head_count_kv": 8,
"vocab_size": 128256,
"file_type": "Q8_0"
}
}
],
"safetensor_info": {
"architecture": "LlamaForCausalLM",
"model_type": "llama",
"total_size_mb": 15000,
"file_count": 4,
"context_length": 32768,
"num_layers": 32,
"hidden_size": 4096,
"num_attention_heads": 32,
"vocab_size": 128256
},
"engine_recommendation": {
"recommended_engine": "vllm",
"recommended_format": "safetensors",
"reason": "SafeTensors available - vLLM recommended for best performance",
"has_multipart_gguf": false,
"has_safetensors": true,
"has_gguf": true
},
"gguf_validation": {
"total_files": 1,
"valid_files": 1,
"invalid_files": 0,
"errors": []
}
}
GPU Metrics Response (Enhanced)¶
The /admin/system/gpus endpoint includes Flash Attention compatibility:
[
{
"index": 0,
"name": "NVIDIA GeForce RTX 4090",
"mem_total_mb": 24576,
"mem_used_mb": 8192,
"compute_capability": "8.9",
"architecture": "Ada Lovelace",
"flash_attention_supported": true
}
]
| Field | Description |
|---|---|
compute_capability |
CUDA compute capability (e.g., "8.9") |
architecture |
GPU architecture name (Ampere, Ada, Hopper) |
flash_attention_supported |
Whether Flash Attention 2 is supported (SM 80+) |
Model Fields (llama.cpp Speculative Decoding)¶
Models with engine_type: llamacpp support speculative decoding:
| Field | Type | Description |
|---|---|---|
draft_model_path |
string | Path to draft model GGUF inside container |
draft_n |
integer | Number of tokens to draft (default: 16) |
draft_p_min |
float | Minimum acceptance probability (default: 0.5) |
Model Fields (vLLM GGUF)¶
Models with engine_type: vllm using GGUF files:
| Field | Type | Description |
|---|---|---|
gguf_weight_format |
string | GGUF format type: auto, gguf, ggml |
Deployment & Migration¶
Endpoints for exporting/importing Cortex configurations and database.
Export Operations¶
POST /admin/deployment/export— Start full deployment exportPOST /admin/deployment/export-model/{id}— Export single modelPOST /admin/deployment/estimate-size— Estimate export size and check disk space
Import Operations¶
GET /admin/deployment/model-manifests?output_dir=...— List available model manifestsPOST /admin/deployment/import-model— Import model from manifest (supportsdry_run)
Database Operations¶
POST /admin/deployment/check-database-dump— Check if dump file existsPOST /admin/deployment/restore-database— Restore database from dump
Job Management¶
GET /admin/deployment/status— Current job statusGET /admin/deployment/jobs— List job historyGET /admin/deployment/jobs/{id}— Get specific jobDELETE /admin/deployment/jobs/{id}— Cancel running job
Export Request¶
{
"output_dir": "/var/cortex/exports",
"include_images": true,
"include_db": true,
"include_configs": true,
"include_models_manifest": true,
"tar_models": false,
"tar_hf_cache": false,
"allow_pull_images": true
}
Import Model Request¶
{
"output_dir": "/var/cortex/exports",
"manifest_file": "model-1.json",
"conflict_strategy": "rename",
"dry_run": true
}
| Parameter | Type | Description |
|---|---|---|
conflict_strategy |
string | error or rename (adds "-IMPORTED" suffix) |
dry_run |
boolean | Preview import without making changes |
Database Restore Request¶
{
"output_dir": "/var/cortex/exports",
"backup_first": true,
"drop_existing": false
}
| Parameter | Type | Default | Description |
|---|---|---|---|
backup_first |
boolean | true |
Create safety backup before restore |
drop_existing |
boolean | false |
Drop all tables before restore |
Job Status Response¶
{
"id": "deploy-1234567890",
"status": "running",
"job_type": "export",
"step": "Exporting Docker images",
"progress": 0.45,
"started_at": 1736483400.0,
"estimated_size_bytes": 6452936704,
"bytes_written": 2903821516,
"eta_seconds": 120
}
Size Estimation Response¶
{
"estimated_bytes": 6452936704,
"estimated_formatted": "6.0 GB",
"breakdown": {
"docker_images": "6.0 GB",
"database": "10.0 MB"
},
"disk_space": {
"sufficient": true,
"available_bytes": 868923961344,
"available_formatted": "809.2 GB",
"required_formatted": "7.2 GB",
"safety_margin": 1.2
}
}
Refer to the OpenAPI spec for complete request/response schemas.