Skip to content

Scaling & Reliability

Pools and selection

  • Configure generation and embeddings pools via env.
  • choose_url prefers healthy endpoints (within TTL) and round-robins among pool members.

Circuit breaker

  • Disabled by default. When enabled, opens after N failures and cools down before retry.
  • Health poller resets breaker on success.

Retries and timeouts

  • Non-stream requests retry transient httpx errors with small backoff.
  • Streamed requests avoid retries by design; concurrency is capped.

Rate limiting and concurrency

  • Enable rate limit with Redis; configure RPS, burst, sliding window.
  • Limit concurrent streaming requests per identifier to protect engines.

Horizontal scaling

  • Scale gateway replicas; ensure shared Redis and Postgres.
  • Scale vLLM engines; add to pools or managed via model registry.

Observability

  • Track latencies, TTFT, error rates, selection distribution.
  • Alert on health TTL expiration and breaker open durations.