feat(services): per-endpoint Services page (server_ip:port → models + perf)#25
Open
vaderyang wants to merge 8 commits into
Open
feat(services): per-endpoint Services page (server_ip:port → models + perf)#25vaderyang wants to merge 8 commits into
vaderyang wants to merge 8 commits into
Conversation
… perf) New "Services" page that aggregates llm_calls by the actual serving endpoint (server_ip, server_port) — answering "what's 172.16.103.81:9000 serving, and how is it performing?". Why not reuse `llm_metrics`? Its pre-aggregated grouping sets stop at `server_ip` and don't carry server_port — two vLLM instances on the same host (port 8000 / 9000) would collapse into one row. ## Backend - `ts_storage::query::ServiceRow` + `ServicesQuery` (one row per endpoint with distinct models, wire APIs, call/error counts, TTFT/E2E avg + p95, total tokens, first/last seen). - `StorageBackend::query_services` trait method + DuckDB impl. Query is `GROUP BY (server_ip, server_port)` on `llm_calls`; models / wire_apis come back as `list_distinct(array_agg(...))`, bridged to Rust as JSON strings (DuckDB rust bindings have no `FromSql for Vec<String>`). - `GET /api/services?start=&end=&sort_by=&sort_order=&limit=` serves it. `sort_by` whitelist matches the table column names. ## Console - Sidebar adds "Services" between "Models" and "Agent Sessions" with a `Server` icon. - `ServicesPage` table: Endpoint • Models (chips) • Wire APIs • Calls (+stream %) • Error % • TTFT avg/p95 • E2E avg/p95 • In/Out tokens • Last seen (relative). Headers click-to-sort in-place — no refetch on resort. - `useServices` hook follows the same `placeholderData: prev` pattern as every other list hook (no flash on refresh). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p/litellm) Adds an App column to the Services page that classifies each endpoint into one of a fixed enum from cheap wire-traffic signals. ## Signals used (highest-confidence first) | App | Signal | |-------------|--------------------------------------------------------------| | `ollama` | path `/api/chat` / `/api/generate` / `/api/tags` | | `llamacpp` | path `/completion` / `/tokenize` / `/props` (root-level) | | `litellm` | response header `x-litellm-*` OR `Server: litellm` | | `openai` | request `Host: api.openai.com` | | `anthropic` | request `Host: api.anthropic.com` | | `gemini` | request `Host: generativelanguage.googleapis.com` | | `openai-compat` | `Server: uvicorn` — vLLM and SGLang both, body sample | | | follow-up will disambiguate | | `litellm` | tiebreaker: an `openai-compat` endpoint serving ≥ 3 distinct | | | models (real signal from wuneng's 127.0.0.1:4000) | | (none) | nothing matches — UI shows muted "unknown" badge | ## Implementation - `ts-storage-duckdb/src/apps.rs` — pure-function classifier with 12 unit tests covering each rule + edge cases (Ollama compat mode serving `/v1/chat/completions`, multi-model uvicorn tiebreaker, path-wins-over-uvicorn precedence, header-absent fallback). - SQL aggregate now also pulls `arg_min(response_headers, LENGTH(...))` and the matching request_headers as a per-group sample plus `list_distinct(array_agg(request_path))[1:16]`. `arg_min` picks the shortest non-null blob deterministically — small enough that streaming it to Rust costs nothing. - New fields on `ServiceRow`: `app`, `server_header`, `request_paths`. - Console renders a colored `AppBadge` per row with a `title=Server:` tooltip so the user can sanity-check the label. ## What ships vs. follow-up vLLM and SGLang both run under uvicorn and don't have a distinctive custom header. Today they both label as `openai-compat`. A follow-up will pull one small response body per group and look for `chatcmpl-tool-<hex>` (vLLM's tool_call_id pattern, observed in production) vs. SGLang's distinct response shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Services-page aggregate uses `arg_min(headers, LENGTH(headers))`
to pick one representative header sample per endpoint. Without a
shape filter it picks ANY shortest non-null value — including rows
where the response parser stashed an empty/corrupted string. That
fed `null` (or similar) to the classifier and dropped four real
endpoints (the GLM-5.1 cluster on port 9000) to `unknown` even
though every other call from those endpoints carries a clean
`Server: uvicorn` blob.
Restrict the sample to JSON arrays of at least 30 chars (`[%`
pattern). The shortest real header list captured in production is
~140 chars; 30 is a comfortable floor that excludes literal `null`,
`[]`, `{}`, and any other malformed short response without losing
genuine samples.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`arg_min(headers, LENGTH(headers))` was still returning NULL for
endpoints with mixed-header data (e.g. SSE/streaming calls where the
parser captured something the LIKE filter doesn't catch).
Switch to `MAX(response_headers)` — lexicographic on a column whose
values all start with `[[` makes it a stable arbitrary pick AND it
doesn't have arg_min's failure mode of picking anomalously short
malformed values. Filter to `[%` to guarantee the picked sample is
shaped like a JSON array (drops literal "null", "{}", etc.).
Per the user's ask: every endpoint must land on a concrete label. Replace the `openai-compat` placeholder by stacking up cheap signals already present in `llm_calls`: **New SQL aggregates** (alongside the existing header / paths sample): - `list_distinct(array_agg(finish_reason))[1:32]` — distinct finish_reasons in the window - `arg_max(request_body, LENGTH(request_body))` — largest captured request body (deepest agentic history; only materialises once, length comparison is u64-cheap) - `arg_max(response_body, LENGTH(response_body))` — largest captured response body (capped at 8 KB so streamed/oversized rows don't bloat the read) **New classifier signals** (in order, highest confidence first): 1. SGLang-specific paths (`/generate`, `/health_generate`, `/get_server_info`, `/flush_cache`, `/encode`, profile endpoints). 2. vLLM-specific paths (`/version`, `/v1/score`). 3. SGLang-exclusive finish_reasons (`matched_stop`, `matched_eos`, `stop_str`) — works even when responses are SSE-streamed, since finish_reason is captured from the final SSE event regardless. 4. Response body fingerprint: - `"id":"chatcmpl-tool-…"` (vLLM's tool_call_id format) - `"system_fingerprint":"fp_…"` (vLLM only; SGLang leaves it null) 5. Request body fingerprint: `chatcmpl-tool-` substring — agentic replays carry assistant.tool_calls history back to the server, and the previous round's tool_call_id reveals vLLM. 6. Uvicorn fallback: - ≥3 models → LiteLLM (multi-model tiebreaker, real wuneng signal) - Model starts with `glm` / `deepseek` → SGLang (reference deployment) - Otherwise → vLLM (more common) Console: drop the `openai-compat` badge color since the label is no longer emitted by the classifier. 22 classifier tests (was 12) covering every new rule + the beats-the-heuristic precedence cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The inline "(34% str)" annotation glued onto the Calls cell read as noise — operators scanning the Calls column want a clean number. Moving streaming share to its own sortable column keeps Calls pure and lets users rank endpoints by streaming-vs-non-streaming mix when triaging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New "Services" page in the console that answers "what's
172.16.103.81:9000serving, and how is it performing?". Aggregatesllm_callsby(server_ip, server_port)— one row per LLM serving endpoint with distinct models, wire APIs, error/throughput, TTFT/E2E percentiles, first/last seen.Why direct-on-
llm_calls(notllm_metrics)The pre-aggregated
llm_metricstable's grouping sets stop atserver_ip— two vLLM instances on the same host (port 8000 / port 9000) would collapse into one row. For a service view ("is the GLM-5 endpoint healthy?") you need port. Scanningllm_callsis fine in practice: a 7-day window in real production data has tens of thousands of rows and the query completes well under a second.Backend
ts_storage::query::ServiceRow+ServicesQuery— one row per endpoint with distinct models, wire APIs, call/error counts, TTFT/E2E avg + p95, total tokens, first/last seen.StorageBackend::query_servicestrait method + DuckDB impl.list_distinct(array_agg(model))[1:32]collects distinct models with a sanity cap; LIST-of-VARCHAR comes back as JSON strings (DuckDB rust bindings have noFromSql for Vec<String>) and gets parsed via the sameparse_json_string_listhelper thatagent_turns.models_useduses.GET /api/services?start=&end=&sort_by=&sort_order=&limit=serves it.Console
Servericon).ServicesPagetable:ip:portmonospace)+N morehover-revealed)useServiceshook follows theplaceholderData: prevpattern — no flash on refresh.Test plan
cargo build --workspacecleancargo test -p ts-storage-duckdb --lib— 65 passbun test— 111 passbun run build— cleanE2E validation on wuneng coming in a follow-up reply.
🤖 Generated with Claude Code