feat(services): per-endpoint Services page (server_ip:port → models + perf) by vaderyang · Pull Request #25 · Netis/TokenScope

vaderyang · 2026-05-20T03:25:06Z

Summary

New "Services" page in the console that answers "what's 172.16.103.81:9000 serving, and how is it performing?". Aggregates llm_calls by (server_ip, server_port) — one row per LLM serving endpoint with distinct models, wire APIs, error/throughput, TTFT/E2E percentiles, first/last seen.

Why direct-on-`llm_calls` (not `llm_metrics`)

The pre-aggregated llm_metrics table's grouping sets stop at server_ip — two vLLM instances on the same host (port 8000 / port 9000) would collapse into one row. For a service view ("is the GLM-5 endpoint healthy?") you need port. Scanning llm_calls is fine in practice: a 7-day window in real production data has tens of thousands of rows and the query completes well under a second.

Backend

ts_storage::query::ServiceRow + ServicesQuery — one row per endpoint with distinct models, wire APIs, call/error counts, TTFT/E2E avg + p95, total tokens, first/last seen.
StorageBackend::query_services trait method + DuckDB impl. list_distinct(array_agg(model))[1:32] collects distinct models with a sanity cap; LIST-of-VARCHAR comes back as JSON strings (DuckDB rust bindings have no FromSql for Vec<String>) and gets parsed via the same parse_json_string_list helper that agent_turns.models_used uses.
GET /api/services?start=&end=&sort_by=&sort_order=&limit= serves it.

Console

Sidebar adds Services entry between Models and Agent Sessions (Lucide Server icon).
ServicesPage table:
- Endpoint (ip:port monospace)
- Models (chips, max 4 inline, +N more hover-revealed)
- Wire APIs
- Calls (with stream %)
- Error %
- TTFT avg / p95
- E2E avg / p95
- In/Out tokens
- Last seen (relative)
Headers click-to-sort in-place — no refetch on resort.
useServices hook follows the placeholderData: prev pattern — no flash on refresh.

Test plan

cargo build --workspace clean
cargo test -p ts-storage-duckdb --lib — 65 pass
bun test — 111 pass
bun run build — clean

E2E validation on wuneng coming in a follow-up reply.

🤖 Generated with Claude Code

… perf) New "Services" page that aggregates llm_calls by the actual serving endpoint (server_ip, server_port) — answering "what's 172.16.103.81:9000 serving, and how is it performing?". Why not reuse `llm_metrics`? Its pre-aggregated grouping sets stop at `server_ip` and don't carry server_port — two vLLM instances on the same host (port 8000 / 9000) would collapse into one row. ## Backend - `ts_storage::query::ServiceRow` + `ServicesQuery` (one row per endpoint with distinct models, wire APIs, call/error counts, TTFT/E2E avg + p95, total tokens, first/last seen). - `StorageBackend::query_services` trait method + DuckDB impl. Query is `GROUP BY (server_ip, server_port)` on `llm_calls`; models / wire_apis come back as `list_distinct(array_agg(...))`, bridged to Rust as JSON strings (DuckDB rust bindings have no `FromSql for Vec<String>`). - `GET /api/services?start=&end=&sort_by=&sort_order=&limit=` serves it. `sort_by` whitelist matches the table column names. ## Console - Sidebar adds "Services" between "Models" and "Agent Sessions" with a `Server` icon. - `ServicesPage` table: Endpoint • Models (chips) • Wire APIs • Calls (+stream %) • Error % • TTFT avg/p95 • E2E avg/p95 • In/Out tokens • Last seen (relative). Headers click-to-sort in-place — no refetch on resort. - `useServices` hook follows the same `placeholderData: prev` pattern as every other list hook (no flash on refresh). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…p/litellm) Adds an App column to the Services page that classifies each endpoint into one of a fixed enum from cheap wire-traffic signals. ## Signals used (highest-confidence first) | App | Signal | |-------------|--------------------------------------------------------------| | `ollama` | path `/api/chat` / `/api/generate` / `/api/tags` | | `llamacpp` | path `/completion` / `/tokenize` / `/props` (root-level) | | `litellm` | response header `x-litellm-*` OR `Server: litellm` | | `openai` | request `Host: api.openai.com` | | `anthropic` | request `Host: api.anthropic.com` | | `gemini` | request `Host: generativelanguage.googleapis.com` | | `openai-compat` | `Server: uvicorn` — vLLM and SGLang both, body sample | | | follow-up will disambiguate | | `litellm` | tiebreaker: an `openai-compat` endpoint serving ≥ 3 distinct | | | models (real signal from wuneng's 127.0.0.1:4000) | | (none) | nothing matches — UI shows muted "unknown" badge | ## Implementation - `ts-storage-duckdb/src/apps.rs` — pure-function classifier with 12 unit tests covering each rule + edge cases (Ollama compat mode serving `/v1/chat/completions`, multi-model uvicorn tiebreaker, path-wins-over-uvicorn precedence, header-absent fallback). - SQL aggregate now also pulls `arg_min(response_headers, LENGTH(...))` and the matching request_headers as a per-group sample plus `list_distinct(array_agg(request_path))[1:16]`. `arg_min` picks the shortest non-null blob deterministically — small enough that streaming it to Rust costs nothing. - New fields on `ServiceRow`: `app`, `server_header`, `request_paths`. - Console renders a colored `AppBadge` per row with a `title=Server:` tooltip so the user can sanity-check the label. ## What ships vs. follow-up vLLM and SGLang both run under uvicorn and don't have a distinctive custom header. Today they both label as `openai-compat`. A follow-up will pull one small response body per group and look for `chatcmpl-tool-<hex>` (vLLM's tool_call_id pattern, observed in production) vs. SGLang's distinct response shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Services-page aggregate uses `arg_min(headers, LENGTH(headers))` to pick one representative header sample per endpoint. Without a shape filter it picks ANY shortest non-null value — including rows where the response parser stashed an empty/corrupted string. That fed `null` (or similar) to the classifier and dropped four real endpoints (the GLM-5.1 cluster on port 9000) to `unknown` even though every other call from those endpoints carries a clean `Server: uvicorn` blob. Restrict the sample to JSON arrays of at least 30 chars (`[%` pattern). The shortest real header list captured in production is ~140 chars; 30 is a comfortable floor that excludes literal `null`, `[]`, `{}`, and any other malformed short response without losing genuine samples. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`arg_min(headers, LENGTH(headers))` was still returning NULL for endpoints with mixed-header data (e.g. SSE/streaming calls where the parser captured something the LIKE filter doesn't catch). Switch to `MAX(response_headers)` — lexicographic on a column whose values all start with `[[` makes it a stable arbitrary pick AND it doesn't have arg_min's failure mode of picking anomalously short malformed values. Filter to `[%` to guarantee the picked sample is shaped like a JSON array (drops literal "null", "{}", etc.).

Per the user's ask: every endpoint must land on a concrete label. Replace the `openai-compat` placeholder by stacking up cheap signals already present in `llm_calls`: **New SQL aggregates** (alongside the existing header / paths sample): - `list_distinct(array_agg(finish_reason))[1:32]` — distinct finish_reasons in the window - `arg_max(request_body, LENGTH(request_body))` — largest captured request body (deepest agentic history; only materialises once, length comparison is u64-cheap) - `arg_max(response_body, LENGTH(response_body))` — largest captured response body (capped at 8 KB so streamed/oversized rows don't bloat the read) **New classifier signals** (in order, highest confidence first): 1. SGLang-specific paths (`/generate`, `/health_generate`, `/get_server_info`, `/flush_cache`, `/encode`, profile endpoints). 2. vLLM-specific paths (`/version`, `/v1/score`). 3. SGLang-exclusive finish_reasons (`matched_stop`, `matched_eos`, `stop_str`) — works even when responses are SSE-streamed, since finish_reason is captured from the final SSE event regardless. 4. Response body fingerprint: - `"id":"chatcmpl-tool-…"` (vLLM's tool_call_id format) - `"system_fingerprint":"fp_…"` (vLLM only; SGLang leaves it null) 5. Request body fingerprint: `chatcmpl-tool-` substring — agentic replays carry assistant.tool_calls history back to the server, and the previous round's tool_call_id reveals vLLM. 6. Uvicorn fallback: - ≥3 models → LiteLLM (multi-model tiebreaker, real wuneng signal) - Model starts with `glm` / `deepseek` → SGLang (reference deployment) - Otherwise → vLLM (more common) Console: drop the `openai-compat` badge color since the label is no longer emitted by the classifier. 22 classifier tests (was 12) covering every new rule + the beats-the-heuristic precedence cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The inline "(34% str)" annotation glued onto the Calls cell read as noise — operators scanning the Calls column want a clean number. Moving streaming share to its own sortable column keeps Calls pure and lets users rank endpoints by streaming-vs-non-streaming mix when triaging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Vader Yang and others added 6 commits May 20, 2026 11:24

This was referenced May 20, 2026

feat(services): Path view + Overview agent charts (deploy roll-up) #27

Open

feat(ci): headless PR review agent (phase 1) #28

Merged

vaderyang added 2 commits May 20, 2026 17:03

Merge branch 'main' into feat/services-page

5960ad1

Merge branch 'main' into feat/services-page

11f6fd5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(services): per-endpoint Services page (server_ip:port → models + perf)#25

feat(services): per-endpoint Services page (server_ip:port → models + perf)#25
vaderyang wants to merge 8 commits into
mainfrom
feat/services-page

vaderyang commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vaderyang commented May 20, 2026

Summary

Why direct-on-llm_calls (not llm_metrics)

Backend

Console

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Why direct-on-`llm_calls` (not `llm_metrics`)