Skip to main content

Command Palette

Search for a command to run...

How I Instrumented vLLM on Kubernetes: The Dashboards, Queries, and SLOs

Updated
8 min read

A practical observability setup for LLM inference on KServe — and the one-line misconfiguration it caught.

LLM serving breaks the assumptions behind ordinary service dashboards. A single "request latency" number is nearly meaningless when one request streams seven hundred tokens over ten seconds: the user's experience splits into time to first token (did it feel responsive?) and time per output token (did it stream smoothly?), and the engine's health splits into queue admission, KV cache pressure, and batch scheduling — none of which standard HTTP metrics see. This post walks through the observability stack I run for vLLM on Kubernetes: the exporters, the PromQL, the dashboard layout, and the SLOs. At the end, the proof that it earns its keep: a real incident where these dashboards first misled me, then caught a one-line misconfiguration that was costing 35x on first-token latency.

1. The Stack

The environment is a single-GPU EKS worker node using an AWS EC2 G6e GPU instance, with KServe managing the model server. The model is Gemma 4 26B quantized to NVFP4, served by vLLM 0.20.1 as a KServe InferenceService. For the benchmark path, two metric sources feed Prometheus:

Figure 1: Platform context around the vLLM/KServe observability setup. The benchmark path focuses on the Gemma predictor, vLLM metrics, DCGM GPU metrics, Prometheus, and Grafana.

vLLM's own /metrics endpoint — the engine-level truth: latencies, queue state, KV cache, scheduler counters. KServe makes scraping declarative; the ServingRuntime just annotates itself:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8080"

The NVIDIA DCGM exporter — the hardware view: GPU utilization, VRAM, power, temperature. Useful, and — as you'll see in Section 5 — dangerously easy to over-trust.

Load for the numbers in this post comes from a small traffic generator simulating three concurrent chat sessions, each sending a prompt, waiting for the complete ~700-token response, and immediately sending its next turn. All manifests, the dashboard JSON, and the generator are in the repo: https://github.com/squadrhino/vllm-kserve-observability.

2. The Metrics That Matter

Four groups, in the order I check them:

User-experience latency. vllm:time_to_first_token_seconds (TTFT — responsiveness), vllm:time_per_output_token_seconds (TPOT/ITL — streaming smoothness), vllm:e2e_request_latency_seconds (the whole request). These are histograms; always read them as percentiles, never averages.

Admission. vllm:request_queue_time_seconds and the gauges vllm:num_requests_running / vllm:num_requests_waiting. This group answers the single most diagnostic question in LLM serving: is the engine slow, or is it full? High TTFT with high queue time is an admission problem; high TTFT with empty queues is a prefill problem. They have different fixes.

Engine capacity. vllm:gpu_cache_usage_perc (KV cache occupancy) and vllm:num_preemptions_total (sequences evicted under memory pressure). Preemptions are the metric that actually fires when KV cache runs out — if this counter is flat at zero, memory is not your problem, whatever other panels imply.

Throughput and hardware. rate(vllm:generation_tokens_total[5m]) and rate(vllm:prompt_tokens_total[5m]) for delivered work; vllm:request_success_total by finished_reason to see whether completions end naturally (stop) or hit the max_tokens cap (length); DCGM's DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_POWER_USAGE for the physical card.

3. The Dashboards

I run two Grafana dashboards built from those groups. (The PromQL below uses vLLM's standard metric names — diff it against the exported JSON in the repo for the exact panel definitions.)

The SLO overview is a single stat row designed for a five-second read: request rate, TTFT p95, TPOT p95, E2E p95, output tokens/s, KV cache %, GPU utilization, GPU memory. The workhorse query shape, here for TTFT p95:

histogram_quantile(0.95,
  sum(rate(vllm:time_to_first_token_seconds_bucket[5m])) by (le))

The capacity dashboard goes one level deeper: a latency phase breakdown (queue / prefill / decode stacked, so you can see where each second of E2E lives), running-vs-waiting requests plotted against the engine's configured limit, KV cache utilization, preemptions/sec, and prefix cache hit rate.

Two design rules I now treat as non-negotiable. First, always plot a utilization metric next to its limit — running requests against max-num-seqs, KV usage against cache size — because a number without its ceiling invites misreading. Second, derive the limit from the live engine, not from a constant typed into the dashboard. Section 5 shows exactly how the second rule earned its place.

4. The SLOs

Targets ahead of incidents, or every number looks negotiable during one. Mine, for interactive chat on this hardware: TTFT p95 ≤ 2.5 s (past a couple of seconds, users re-send), TPOT p95 ≤ 50 ms (faster than reading speed, with margin), zero sustained queue wait at nominal concurrency, and preemptions ≈ 0 in steady state. The TTFT SLO renders as an attainment panel — percentage of requests under threshold, the ratio of the le="2.5" bucket to total count:

sum(rate(vllm:time_to_first_token_seconds_bucket{le="2.5"}[5m]))
/
sum(rate(vllm:time_to_first_token_seconds_count[5m]))

A note on what I deliberately don't SLO: GPU utilization. It's a diagnostic, not a target — and the incident below is why.

5. What This Setup Caught

Here is the stack paying for itself — though not before testing me.

Under the three-session load, the dashboards showed a contradiction: TTFT p95 at 7.25 seconds (worst requests waited sixteen), while TPOT sat at a healthy 9.589 ms. The engine was fast; nobody could reach it. Queue p95 read 9.51 seconds with a request permanently parked in the waiting gauge. Meanwhile GPU utilization glowed at 96% — and that big yellow number anchored me to the wrong theory. I assumed compute saturation; my first analysis even blamed KV cache pressure. My own capacity panels had already refuted both: KV cache at 2%, preemptions flat at zero. Slow admission plus fast execution is a scheduler problem, and the dashboards had been saying so from the first scrape. I read them in the wrong order.

The cause was one line in the InferenceService, a leftover from a previous deployment: --max-num-seqs=2. Months earlier this cluster ran a 30B model that OOM-crashed under load, and capping the scheduler at two concurrent sequences stopped the crashes. When I swapped to the smaller 26B NVFP4 model, the flag silently survived — scheduler tuning lives in the manifest, not the model, so nothing forced a review. Three chat sessions against two batch slots meant the third request waited a full generation cycle (~10 s) before prefill could even start. And vLLM had been printing the refutation at every boot:

GPU KV cache size: 71,439 tokens
Maximum concurrency for 8,192 tokens per request: 8.72x

Capacity for nearly nine worst-case sequences. Throttled to two. The verdict sat in the startup log for months while the dashboards glowed all day.

The fix was the flag to 16 and a redeploy; same model, same traffic:

Metric Before (=2) After (=16) What it says
TTFT p95 7.25 s 203.87 ms ~35x; first token now interactive
Queue p95 9.51 s 285 ms Admission bottleneck gone
E2E p95 14.50 s 9.75 s Now ≈ pure decode (700 tok × ~14 ms)
TPOT p95 9.589 ms 15.710 ms The honest cost of a real batch
Output throughput 227.8 tok/s 308.4 tok/s ~35% more delivered work
KV cache util 2.00% 3.18% Memory was never the constraint
Preemptions 0 0 The original OOM fear, refuted by counter

One panel misbehaved in the after-run, and it's the best argument in this post for instrumentation discipline: Engine Saturation read 150% — an impossible number — because the panel divided live running requests (3) by a max_num_seqs reference (2) recorded from the dead deployment instead of queried from the live engine. A panel comparing live traffic against a dead configuration's limit is not observability; it is a memorial. That's where the "derive limits from the live engine" rule in Section 3 comes from — I learned it by violating it.

And GPU utilization? 96% in both runs, while delivered throughput rose 35%. Utilization tells you the GPU is busy. It does not tell you the GPU is the bottleneck. That gap is why it's a diagnostic on my boards and never an SLO.

6. What Changes at Production Scale

Everything above ran as a single-node EKS benchmark, but the failure modes scale linearly and quietly. In a multi-replica EKS deployment, a stale --max-num-seqs does not cost one user sixteen seconds — it silently caps the throughput of every replica behind the autoscaler, and the system "fixes" the problem by scaling out, converting a one-line configuration error into a GPU bill. The startup-log check becomes an admission gate: CI should parse the engine's reported maximum concurrency and fail the deploy if it diverges from the configured limit beyond a defined tolerance. Scheduler flags should be treated as model-coupled — re-derived on every model change, not inherited through the manifest. Dashboard reference limits must be scraped from the running engine, never typed in at deploy time, because config drift between the fleet and the dashboards that watch it is exactly the failure that no one is paged for.


Manifests, dashboard JSON, and the traffic generator: https://github.com/squadrhino/vllm-kserve-observability. If you run vLLM anywhere, grep your startup logs for "Maximum concurrency" — thirty seconds, and it may be the cheapest capacity review you ever do.