Chat Playground¶

The Chat Playground is an interactive interface for testing and evaluating running inference models directly within Cortex. It allows users to quickly validate model performance, measure generation speed, and experiment with different prompts without writing any code.

Overview¶

Key Features:

Model Selection: Choose from any currently running inference model
Streaming Responses: Real-time token generation with SSE
Performance Metrics: Token/second, time-to-first-token (TTFT), total tokens
Context Tracking: Visual indicator of context window usage
Server-Side Persistence: Chats are stored in the database (user-scoped)
Cross-Device Access: Access your chat history from any machine
Model Locking: Prevents mid-conversation model changes to ensure consistency

Accessing the Chat Playground¶

Navigate to Chat → Playground in the Cortex Admin UI sidebar. The Chat section appears between "Platform" and "Administration" sections.

Admin UI: http://YOUR_IP:3001/chat

User Interface¶

Layout¶

The Chat Playground has a two-column layout:

┌─────────────────────────────────────────────────────────────────┐
│  Chat Playground                                                │
├────────────────┬────────────────────────────────────────────────┤
│                │                                                │
│  Chat History  │  Model Selector: [Select Model ▼]              │
│  ────────────  │  ─────────────────────────────────────────     │
│  + New Chat    │  Performance: 32.5 tok/s | TTFT: 145ms         │
│                │  ─────────────────────────────────────────     │
│  Chat 1        │                                                │
│  Chat 2        │  User: What is Python?                         │
│  Chat 3        │                                                │
│                │  Assistant: Python is a high-level...          │
│                │                                                │
│  [Clear All]   │  ─────────────────────────────────────────     │
│                │  [Type a message...               ] [Send]     │
│                │  Context: 1,234 / 32,768 tokens                │
└────────────────┴────────────────────────────────────────────────┘

Components¶

Component	Description
Chat Sidebar	Lists chat history, create new chats, delete sessions
Model Selector	Dropdown of running models, locked once conversation starts
Performance Metrics	Real-time generation speed and latency
Message List	Conversation display with markdown rendering
Chat Input	Message input with context usage indicator

Using the Chat Playground¶

Starting a New Chat¶

Click + New Chat in the sidebar (or navigate to /chat)
Select a running model from the dropdown
Type your message and press Enter or click Send

Model Selection

Only models that are currently running and healthy appear in the dropdown. If no models are available, you'll see "No models running" - start a model from the Models page first.

Model Locking¶

Once you send a message, the model selection becomes locked for that conversation:

✅ Why locked: Different models have different context windows, tokenizers, and capabilities. Mixing models mid-conversation could cause errors or unexpected behavior.
✅ How to switch: Start a new chat to use a different model
✅ Visual indicator: A warning banner appears when model is locked

Performance Metrics¶

During streaming responses, real-time metrics are displayed:

Metric	Description
tok/s	Tokens generated per second
TTFT	Time to first token (latency)
Tokens	Total tokens generated in current response

Context Window Tracking¶

The input area shows an estimate of context usage:

Context: 2,048 / 32,768 tokens

Uses ~4 characters per token heuristic
Helps avoid context overflow errors
Updates in real-time as you type and receive responses

Chat Persistence¶

Server-Side Storage¶

Chats are stored in the database (not browser localStorage), providing:

User Isolation: Each user only sees their own chats
Cross-Device Access: Access your chat history from any machine
Persistence: Chats survive browser cache clears
Admin Visibility: Usage is logged for admin monitoring

Database Schema¶

Two tables store chat data:

chat_sessions
├── id (UUID)
├── user_id (FK → users)
├── title (auto-generated from first message)
├── model_name
├── engine_type
├── constraints_json
├── created_at
└── updated_at

chat_messages
├── id (auto-increment)
├── session_id (FK → chat_sessions, CASCADE delete)
├── role ('user', 'assistant', 'system')
├── content
├── metrics_json (tokens/sec, TTFT, etc.)
└── created_at

Chat History¶

Sorted newest-to-oldest
Title auto-generated from first user message
Click to restore previous conversations
Delete individual chats or clear all

API Endpoints¶

The Chat Playground uses dedicated API endpoints (not the OpenAI-compatible endpoints):

Model Information¶

Endpoint	Method	Description
`/v1/models/running`	GET	List healthy running models
`/v1/models/{name}/constraints`	GET	Get model context limits and defaults

Chat Sessions¶

Endpoint	Method	Description
`/v1/chat/sessions`	GET	List user's chat sessions
`/v1/chat/sessions`	POST	Create new chat session
`/v1/chat/sessions/{id}`	GET	Get session with messages
`/v1/chat/sessions/{id}/messages`	POST	Add message to session
`/v1/chat/sessions/{id}`	DELETE	Delete chat session
`/v1/chat/sessions`	DELETE	Clear all user's sessions

Authentication

These endpoints use session cookie authentication (require_user_session), not API key authentication. They're designed for the Admin UI, not external API access.

Model Constraints Response¶

{
  "served_model_name": "Qwen-2-7B-Instruct",
  "engine_type": "vllm",
  "task": "generate",
  "context_size": null,
  "max_model_len": 32768,
  "max_tokens_default": 512,
  "request_defaults": null,
  "supports_streaming": true,
  "supports_system_prompt": true
}

Technical Architecture¶

Frontend Components¶

frontend/src/
├── app/(admin)/chat/
│   └── page.tsx           # Main chat page
├── components/chat/
│   ├── index.ts           # Barrel exports
│   ├── ChatInput.tsx      # Message input with context display
│   ├── ChatSidebar.tsx    # Session list and management
│   ├── MessageList.tsx    # Conversation display
│   ├── MessageContent.tsx # Markdown/code rendering
│   ├── ModelSelector.tsx  # Running model dropdown
│   └── PerformanceMetrics.tsx # Real-time metrics display
├── hooks/
│   └── useChat.ts         # Chat state management hook
└── lib/
    ├── chat-client.ts     # Streaming client & model APIs
    └── chat-api.ts        # Session persistence APIs

Backend Routes¶

backend/src/routes/chat.py
├── GET  /v1/models/running           # List healthy models
├── GET  /v1/models/{name}/constraints # Get model limits
├── GET  /v1/chat/sessions            # List sessions
├── POST /v1/chat/sessions            # Create session
├── GET  /v1/chat/sessions/{id}       # Get session
├── POST /v1/chat/sessions/{id}/messages # Add message
├── DELETE /v1/chat/sessions/{id}     # Delete session
└── DELETE /v1/chat/sessions          # Clear all sessions

Streaming Implementation¶

The chat uses Server-Sent Events (SSE) via the standard OpenAI-compatible /v1/chat/completions endpoint:

// Frontend streaming pattern (chat-client.ts)
async function* streamChat(model, messages, options) {
  const response = await fetch('/v1/chat/completions', {
    method: 'POST',
    body: JSON.stringify({
      model,
      messages,
      stream: true,
      stream_options: { include_usage: true },
    }),
  });

  const reader = response.body.getReader();
  // Parse SSE chunks...
}

Usage Logging¶

Chat interactions are logged to the Usage database, similar to API requests:

User and organization IDs are captured
Prompt and completion token counts
Latency measurements
Enables admin visibility into internal Cortex usage

Access usage data via: - Admin UI → Usage page - API: GET /admin/usage

Health Check Integration¶

The Chat Playground relies on the backend health checking system:

Health Poller (health.py): Background task polls model endpoints every HEALTH_POLL_SEC (default: 15s)
Health State: Results cached in HEALTH_STATE dictionary
Health TTL: Data valid for HEALTH_CHECK_TTL_SEC (default: 20s)
Model Selector: Fetches /v1/models/running, which filters by health status

No Models Showing?

If the Chat Playground shows "No models running" but models are healthy:

Wait 15-30 seconds for health check to refresh
Check HEALTH_CHECK_TTL_SEC > HEALTH_POLL_SEC in config
Verify model is actually running: make status
Check gateway logs: make logs-gateway

Troubleshooting¶

"No models running" when models are healthy¶

Cause: Health check timing mismatch or stale data

Solution: 1. Ensure HEALTH_CHECK_TTL_SEC (20s) > HEALTH_POLL_SEC (15s) 2. Wait for next health check cycle 3. Refresh the page

Chat not persisting¶

Cause: Backend routes not deployed

Solution:

# Rebuild and restart the gateway
make down
make quick-start

Model selector shows wrong models¶

Cause: Model registry out of sync with health state