Skip to content

feat(services): per-endpoint Services page (server_ip:port → models + perf)#25

Open
vaderyang wants to merge 8 commits into
mainfrom
feat/services-page
Open

feat(services): per-endpoint Services page (server_ip:port → models + perf)#25
vaderyang wants to merge 8 commits into
mainfrom
feat/services-page

Conversation

@vaderyang
Copy link
Copy Markdown
Collaborator

Summary

New "Services" page in the console that answers "what's 172.16.103.81:9000 serving, and how is it performing?". Aggregates llm_calls by (server_ip, server_port) — one row per LLM serving endpoint with distinct models, wire APIs, error/throughput, TTFT/E2E percentiles, first/last seen.

Why direct-on-llm_calls (not llm_metrics)

The pre-aggregated llm_metrics table's grouping sets stop at server_ip — two vLLM instances on the same host (port 8000 / port 9000) would collapse into one row. For a service view ("is the GLM-5 endpoint healthy?") you need port. Scanning llm_calls is fine in practice: a 7-day window in real production data has tens of thousands of rows and the query completes well under a second.

Backend

  • ts_storage::query::ServiceRow + ServicesQuery — one row per endpoint with distinct models, wire APIs, call/error counts, TTFT/E2E avg + p95, total tokens, first/last seen.
  • StorageBackend::query_services trait method + DuckDB impl. list_distinct(array_agg(model))[1:32] collects distinct models with a sanity cap; LIST-of-VARCHAR comes back as JSON strings (DuckDB rust bindings have no FromSql for Vec<String>) and gets parsed via the same parse_json_string_list helper that agent_turns.models_used uses.
  • GET /api/services?start=&end=&sort_by=&sort_order=&limit= serves it.

Console

  • Sidebar adds Services entry between Models and Agent Sessions (Lucide Server icon).
  • ServicesPage table:
    • Endpoint (ip:port monospace)
    • Models (chips, max 4 inline, +N more hover-revealed)
    • Wire APIs
    • Calls (with stream %)
    • Error %
    • TTFT avg / p95
    • E2E avg / p95
    • In/Out tokens
    • Last seen (relative)
  • Headers click-to-sort in-place — no refetch on resort.
  • useServices hook follows the placeholderData: prev pattern — no flash on refresh.

Test plan

  • cargo build --workspace clean
  • cargo test -p ts-storage-duckdb --lib — 65 pass
  • bun test — 111 pass
  • bun run build — clean

E2E validation on wuneng coming in a follow-up reply.

🤖 Generated with Claude Code

Vader Yang and others added 6 commits May 20, 2026 11:24
… perf)

New "Services" page that aggregates llm_calls by the actual serving
endpoint (server_ip, server_port) — answering "what's
172.16.103.81:9000 serving, and how is it performing?".

Why not reuse `llm_metrics`? Its pre-aggregated grouping sets stop
at `server_ip` and don't carry server_port — two vLLM instances on
the same host (port 8000 / 9000) would collapse into one row.

## Backend

- `ts_storage::query::ServiceRow` + `ServicesQuery` (one row per
  endpoint with distinct models, wire APIs, call/error counts,
  TTFT/E2E avg + p95, total tokens, first/last seen).
- `StorageBackend::query_services` trait method + DuckDB impl.
  Query is `GROUP BY (server_ip, server_port)` on `llm_calls`;
  models / wire_apis come back as `list_distinct(array_agg(...))`,
  bridged to Rust as JSON strings (DuckDB rust bindings have no
  `FromSql for Vec<String>`).
- `GET /api/services?start=&end=&sort_by=&sort_order=&limit=`
  serves it. `sort_by` whitelist matches the table column names.

## Console

- Sidebar adds "Services" between "Models" and "Agent Sessions"
  with a `Server` icon.
- `ServicesPage` table: Endpoint • Models (chips) • Wire APIs •
  Calls (+stream %) • Error % • TTFT avg/p95 • E2E avg/p95 •
  In/Out tokens • Last seen (relative). Headers click-to-sort
  in-place — no refetch on resort.
- `useServices` hook follows the same `placeholderData: prev`
  pattern as every other list hook (no flash on refresh).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p/litellm)

Adds an App column to the Services page that classifies each
endpoint into one of a fixed enum from cheap wire-traffic signals.

## Signals used (highest-confidence first)

| App         | Signal                                                       |
|-------------|--------------------------------------------------------------|
| `ollama`    | path `/api/chat` / `/api/generate` / `/api/tags`             |
| `llamacpp`  | path `/completion` / `/tokenize` / `/props` (root-level)     |
| `litellm`   | response header `x-litellm-*` OR `Server: litellm`           |
| `openai`    | request `Host: api.openai.com`                               |
| `anthropic` | request `Host: api.anthropic.com`                            |
| `gemini`    | request `Host: generativelanguage.googleapis.com`            |
| `openai-compat` | `Server: uvicorn` — vLLM and SGLang both, body sample    |
|             | follow-up will disambiguate                                  |
| `litellm`   | tiebreaker: an `openai-compat` endpoint serving ≥ 3 distinct |
|             | models (real signal from wuneng's 127.0.0.1:4000)            |
| (none)      | nothing matches — UI shows muted "unknown" badge             |

## Implementation

- `ts-storage-duckdb/src/apps.rs` — pure-function classifier with 12
  unit tests covering each rule + edge cases (Ollama compat mode
  serving `/v1/chat/completions`, multi-model uvicorn tiebreaker,
  path-wins-over-uvicorn precedence, header-absent fallback).
- SQL aggregate now also pulls `arg_min(response_headers, LENGTH(...))`
  and the matching request_headers as a per-group sample plus
  `list_distinct(array_agg(request_path))[1:16]`. `arg_min` picks
  the shortest non-null blob deterministically — small enough that
  streaming it to Rust costs nothing.
- New fields on `ServiceRow`: `app`, `server_header`, `request_paths`.
- Console renders a colored `AppBadge` per row with a `title=Server:`
  tooltip so the user can sanity-check the label.

## What ships vs. follow-up

vLLM and SGLang both run under uvicorn and don't have a distinctive
custom header. Today they both label as `openai-compat`. A follow-up
will pull one small response body per group and look for
`chatcmpl-tool-<hex>` (vLLM's tool_call_id pattern, observed in
production) vs. SGLang's distinct response shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Services-page aggregate uses `arg_min(headers, LENGTH(headers))`
to pick one representative header sample per endpoint. Without a
shape filter it picks ANY shortest non-null value — including rows
where the response parser stashed an empty/corrupted string. That
fed `null` (or similar) to the classifier and dropped four real
endpoints (the GLM-5.1 cluster on port 9000) to `unknown` even
though every other call from those endpoints carries a clean
`Server: uvicorn` blob.

Restrict the sample to JSON arrays of at least 30 chars (`[%`
pattern). The shortest real header list captured in production is
~140 chars; 30 is a comfortable floor that excludes literal `null`,
`[]`, `{}`, and any other malformed short response without losing
genuine samples.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`arg_min(headers, LENGTH(headers))` was still returning NULL for
endpoints with mixed-header data (e.g. SSE/streaming calls where the
parser captured something the LIKE filter doesn't catch).

Switch to `MAX(response_headers)` — lexicographic on a column whose
values all start with `[[` makes it a stable arbitrary pick AND it
doesn't have arg_min's failure mode of picking anomalously short
malformed values. Filter to `[%` to guarantee the picked sample is
shaped like a JSON array (drops literal "null", "{}", etc.).
Per the user's ask: every endpoint must land on a concrete label.
Replace the `openai-compat` placeholder by stacking up cheap signals
already present in `llm_calls`:

**New SQL aggregates** (alongside the existing header / paths sample):
- `list_distinct(array_agg(finish_reason))[1:32]`        — distinct
  finish_reasons in the window
- `arg_max(request_body, LENGTH(request_body))`           — largest
  captured request body (deepest agentic history; only materialises
  once, length comparison is u64-cheap)
- `arg_max(response_body, LENGTH(response_body))`         — largest
  captured response body (capped at 8 KB so streamed/oversized rows
  don't bloat the read)

**New classifier signals** (in order, highest confidence first):

1. SGLang-specific paths (`/generate`, `/health_generate`,
   `/get_server_info`, `/flush_cache`, `/encode`, profile endpoints).
2. vLLM-specific paths (`/version`, `/v1/score`).
3. SGLang-exclusive finish_reasons (`matched_stop`, `matched_eos`,
   `stop_str`) — works even when responses are SSE-streamed, since
   finish_reason is captured from the final SSE event regardless.
4. Response body fingerprint:
   - `"id":"chatcmpl-tool-…"` (vLLM's tool_call_id format)
   - `"system_fingerprint":"fp_…"` (vLLM only; SGLang leaves it null)
5. Request body fingerprint: `chatcmpl-tool-` substring — agentic
   replays carry assistant.tool_calls history back to the server,
   and the previous round's tool_call_id reveals vLLM.
6. Uvicorn fallback:
   - ≥3 models → LiteLLM (multi-model tiebreaker, real wuneng signal)
   - Model starts with `glm` / `deepseek` → SGLang (reference deployment)
   - Otherwise → vLLM (more common)

Console: drop the `openai-compat` badge color since the label is no
longer emitted by the classifier.

22 classifier tests (was 12) covering every new rule + the
beats-the-heuristic precedence cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The inline "(34% str)" annotation glued onto the Calls cell read as
noise — operators scanning the Calls column want a clean number. Moving
streaming share to its own sortable column keeps Calls pure and lets
users rank endpoints by streaming-vs-non-streaming mix when triaging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant