Observability¶
CORTEX exposes Prometheus metrics and optionally OpenTelemetry traces.
Prometheus metrics¶
From backend/src/metrics.py:
- gateway_requests_total{route,status} — request counts
- gateway_request_latency_seconds{route} — request latencies (histogram)
- gateway_upstream_latency_seconds{path} — upstream call latency
- gateway_upstream_latency_by_upstream_seconds{path,base_url} — latency per upstream
- gateway_stream_ttft_seconds{path} — time-to-first-token for streaming
- gateway_upstream_selected_total{path,base_url} — selection counts
- gateway_key_auth_allowed_total{reason} / gateway_key_auth_blocked_total{reason} — auth decisions
- gateway_upstream_health{base_url} — health poller status (gauge)
vLLM exporters and node/DCGM exporters can be scraped for GPU and host metrics.
Per-Model vLLM Metrics¶
The gateway aggregates metrics from running vLLM containers:
Endpoint: GET /admin/models/metrics
Available Metrics:
| Metric | Description |
|--------|-------------|
| num_requests_running | Active inference requests |
| num_requests_waiting | Queued requests |
| num_requests_swapped | Requests swapped to CPU |
| prompt_tokens_total | Total input tokens processed |
| completion_tokens_total | Total output tokens generated |
| time_to_first_token_seconds_sum/count | TTFT latency metrics |
| gpu_cache_usage_perc | KV cache memory utilization |
Frontend Access: Admin UI → System Monitor → "🤖 Active Models" accordion section
Dashboards¶
- Provide Grafana dashboards for gateway KPIs (latency, errors, selection, TTFT) and system metrics.
- System Monitor page provides real-time views of host, GPU, and model metrics
Tracing (optional)¶
- Enable OTel via env; spans propagated through FastAPI and httpx when configured.