Conversation
* ci(security): add public-repo PII gate caller Blocks PRs that leak customer/partner names or secrets in title/body/commits. Calls the reusable gate in tracebloc/.github. Inactive until the org PII_DENYLIST secret is set (warns, doesn't block, until then). * chore(schema): sync ingest.v1.json from data-ingestors master The vendored copy at internal/schema/ingest.v1.json had drifted from upstream, failing the `scripts/sync-schema.sh --check` CI gate on every PR. Upstream replaced `instance_segmentation` with `token_classification` across the category enums, updated the texts/resolution descriptions ([width, height] order + PIL note), and added the token_classification `texts` and masked_language_modeling no-`label` conditional rules. Regenerated via `scripts/sync-schema.sh`; `--check` is now clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: shujaat hasan <shujaathasan@shujaats-MacBook-Pro.local> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Adds a per-ref concurrency group (cancels superseded PR runs only; push/tag/schedule never cancelled) and timeout-minutes to every job, so stale PR pushes stop wasting runner time and hung steps (kind boot, cosign) can't run to the 6h default. No change to job behavior. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
chore(schema): re-sync vendored ingest.v1.json from data-ingestors master
…manager (#259) (#78) `tracebloc dataset rm` dropped the table but failed to delete the dataset's staging files on the shared PVC: rm: cannot remove '/data/shared/.tracebloc-staging/<t>/labels.csv': Permission denied Root cause: the staging files are written by the CLI's ephemeral stage pod as uid 65532 (+ fsGroup 65532), but the teardown exec'd `rm` inside the long-lived jobs-manager pod, which runs as a different non-root uid with no shared fsGroup. A non-65532 uid cannot delete 65532-owned files in a non-group-writable dir, so the rm hit EACCES and left orphans. The "re-run to clean up" advice was a dead end — the same permission error every time. Fix: run the teardown `rm` from a short-lived pod that mirrors the stage pod's identity (uid 65532 + fsGroup 65532, shared PVC mounted), reusing the existing BuildStagePodSpec / CreateStagePod / WaitForStagePodReady / DeleteStagePod machinery. That pod OWNS the staging files it deletes, so it works by ownership on hostPath (where fsGroup is a no-op, kubernetes/kubernetes#138411) and CSI alike. Fully fixes tabular datasets (no sidecar files) on every volume type. Also: - Teardown now takes an injectable Executor (matching push.Stage), enabling a regression test that pins "rm runs in a uid-65532 stage pod, not jobs-manager". - dataset_rm: drop the misleading "re-run completes the cleanup" claim; the table DROP is idempotent, and if file removal keeps failing, point to node-side cleanup. Refs #259. The image/sidecar case (ingestor's /data/shared/<table> written as uid 65534) on hostPath still needs the documented complement (ingestor fsGroup + group-writable DEST_PATH in client-runtime/data-ingestors). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com>
#89) * feat(#88): tracebloc cluster doctor — live-cluster health checks (WS3) Adds `tracebloc cluster doctor`, a read-only health sweep of a running tracebloc client cluster that prints ✔/⚠/✖ per check with a remedy — so a customer can diagnose "why isn't my experiment running?" without tracebloc shelling into their cluster (epic client-runtime#116, WS3). Sibling of `cluster info` (which the code's own comment anticipated); reuses its kubeconfig/context/namespace flags + cluster.Load / NewClientset / DiscoverParentRelease, the ui.Printer status vocabulary, and exitError. Lean MVP — 6 checks: - cluster reachable (parent client release discovered) - pod health (crash-loops / long-Pending — local complement to #117) - dataset volume (shared PVC Bound, via cluster.DiscoverSharedPVC) - proxy configuration (in-cluster requests/egress proxy wiring) - backend egress (host-side, proxy-aware probe; in-cluster probe = follow-up) - Service Bus egress (requests-proxy readiness — the experiments-queue broker) internal/doctor is a standalone package with injectable network probes, 82% covered via client-go's fake clientset. Every check is independent and best-effort (one failure never hides the others); the worst status sets the exit code (0 ok/warn, 2 failures, 3 kubeconfig). Out of scope (already shipped / follow-up): support-bundle ships as the installer's `--diagnose`; node-resources-vs-job-request and image-pullability are the broader cut. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#88): don't flag Succeeded/recovered pods as crash-looping (Bugbot) podCrashLooping flagged any pod with RestartCount>=3 — including Succeeded job pods that retried before completing, and Running pods that recovered after past restarts — producing a false ✖ when nothing is actually unhealthy. Guard terminal phases (Succeeded/Failed) and require the container to not be currently running, mirroring the controller's recovered-container fix (client-runtime#117). Adds regression tests for both false-positive cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#88): detect init-container crash-loops + nil-release prefixed deploys (Bugbot) Two Medium Bugbot findings on the previous commit: - podCrashLooping ignored InitContainerStatuses, so an init container stuck in CrashLoopBackOff read as a Pending warning instead of a failure even though the pod cannot start. It now checks init + app containers, and detects only active CrashLoopBackOff — dropping the RestartCount heuristic entirely, since that was the source of the earlier Succeeded/recovered-pod false positives. - requestsProxyNames/jobsManagerNames only probed unprefixed names when the release was nil (e.g. DiscoverParentRelease errored on multiple releases), falsely reporting missing wiring even though <release>-requests-proxy exists. Added findDeployment: exact-name Get, then a namespace List + name-suffix fallback that resolves the prefixed name without knowing the release. Adds regression tests: init-crash-loop, nil-release-finds-prefixed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs/polish(#88): address Arturo's post-approval review nits - Tunables comment: they're conservative package consts, not vars; point at Options (like HTTPProbe) for any future runtime tuning. - checkProxy WARN: note that a REQUESTS_PROXY_URL set via a configMap/secret ref reads as empty here (jobsManagerEnv reads only literal env). - httpProbe: a successful connection means reachable — discard the body-close error rather than reporting it as "unreachable". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#88): suffix fallback must not pick across multiple releases (Bugbot) findDeployment's suffix fallback (added for the nil-release case) picked the first suffix-matching deployment, so in a namespace running multiple parent releases, jobsManagerEnv and checkRequestsProxy could resolve to different releases in a single run — presenting mixed data as fact. Resolve the fallback only when exactly one deployment carries the suffix; with more than one (the multi-release case DiscoverParentRelease already refuses to disambiguate) return nil and let the check report can't-determine. The single-release nil-discovery case still resolves. Adds TestCheckRequestsProxy_NilReleaseAmbiguous. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#88): tie deployment lookup to the discovered release (Bugbot) When a release was discovered, findDeployment's fallbacks could still match a DIFFERENT release's component (or a stray bare one), so the Service Bus check went green on the wrong requests-proxy while the discovered release's was missing. findDeployment now takes the release directly. When it's known, it accepts only "<release>-<suffix>" or a bare "<suffix>" whose app.kubernetes.io/instance label ties it to that release — never another release's, never an unattributable bare one. The release-unknown path keeps the exactly-one-suffix-match rule (returns nil on >1, so checks report can't-determine rather than guess). Folds the jobsManagerNames/requestsProxyNames candidate builders into findDeployment. Adds tests: other-release-ignored, bare-name-tied-by-label. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…patch (#74) (#80) The CLI enumerated task categories in four hand-maintained places that had drifted: the `--category` help listed 5 of 9 (#74), the push accept-gate hand-listed the supported set twice, the interactive picker kept its own list, and internal/push/category.go held four separate family maps. Consolidate into one CategorySpec registry (internal/push/category.go): each category's family, label, regression-class flag, and CLI-support status lives in one ordered table. The family predicates (IsImage/IsTabular/IsText/IsRegressionClass), the `--category` help, the gate's "Supported:" lists, and the interactive picker now all derive from it, so the enumerations can't drift apart again. The help now lists all 9 supported categories. Behaviour-preserving: the gate accepts/rejects exactly the same set (IsCLISupported == the prior nine-category condition); only the help text and the now-registry-derived error messages change. semantic_/ instance_segmentation stay known-but-unsupported, each with a per-category UnsupportedNote. Adds a registry parity + predicate-derivation test (the anti-drift guard). First phase of the CLI ingestion consolidation epic (backend#828). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Shujaat Hasan <shujaat@tracebloc.io>
…up) (#91) * feat(#90): cluster doctor — node-fit + image-pull checks (WS3 follow-up) Two read-only checks added to `tracebloc cluster doctor` (follow-up to #89): - Node capacity: parses the resource requests jobs-manager stamps on spawned training jobs (RESOURCE_REQUESTS / GPU_REQUESTS env) and checks at least one Ready node can fit them — the "Pending forever, no node big enough" class. GPU is soft: a hard ✖ only on cpu/mem, and a ⚠ when a GPU is requested but no node exposes it (jobs-manager has a GPU->CPU fallback). - Image pull secret: when jobs-manager references a registry pull secret, verifies it exists and is a well-formed dockerconfigjson so private-image pulls don't ImagePullBackOff. Both read-only/best-effort, tested with client-go's fake clientset. The in-cluster egress probe (the third deferred check on #90) is intentionally a separate PR — it needs a port-forward/exec mechanism, not this read-only pattern. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#90): node-fit must require cpu+mem+GPU on ONE node (Bugbot) checkNodeFit set cpuMemFits and gpuFits independently, so they could come from different nodes — reporting OK even when no single node had cpu+memory+GPU together (a GPU job would then stay Pending). It now evaluates each node as a whole: cpuMemFits (any node) drives the hard fail; fullFits (one node with cpu+mem AND the GPU) drives the ok/warn split. Adds regression tests for the cross-node and single-node-fits cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d client (cli#83) (#85) * feat(cli): auth scaffold — login/logout/auth status + config + backend client (cli#83) RFC-0001 (backend#830) Phase-1 CLI side, scaffolded ahead of the backend device-grant so it activates the moment backend#835 ships. - internal/config: ~/.tracebloc config store (0600, atomic write) — backend env + user token + active client. Fully functional + unit-tested. - internal/api: backend REST client. CLIENT_ENV -> {dev,stg,prod} base URL (matches the installer's _backend_url); proxy + CA aware (honors HTTP(S)_PROXY / NO_PROXY + the system cert pool, for corporate-proxy networks); the RFC 8628 device-flow methods (RequestDeviceCode + PollToken with the authorization_pending / slow_down / expired_token / access_denied states). Unit-tested via httptest. - internal/cli: `tracebloc login` (device flow — show URL + code, poll, store the token), `logout`, `auth status`. `client create/list/use` are stubbed (cli#84 — they need the user token from login + provisioning backend#836). login calls /device/code + /device/token, which land in backend#835; until then it reports that the backend doesn't support browser sign-in yet. Builds, gofmt-clean, unit-tested (config round-trip + 0600 mode; api URL map + device-flow poll states); `tracebloc --help` lists the new verbs. Part of cli#83 / backend#830 (end-to-end login activates with backend#835). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(cli): authenticate with Bearer + verify token on login (cli#83) (#86) * feat(cli): authenticate with Bearer + verify token on login (cli#83) Completes `tracebloc login` against the now-built device-grant endpoints (backend#846). The token the flow issues is a ClientAccessToken, which the backend authenticates as `Authorization: Bearer` (ClientAccessTokenAuthentication, backend#835) — not the legacy DRF `Token` scheme the client was sending on authenticated requests, which would have failed to authenticate a logged-in token. - internal/api: authenticated requests now send `Bearer <token>` (was `Token`); add get() + WhoAmI() (GET /userinfo/) to confirm the token + fetch the account. - login now verifies the freshly-issued token (best-effort) and stores/shows the account ("Signed in as you@co.com"); a failed lookup never fails a valid sign-in. - tests: WhoAmI sends Bearer + parses the identity; a 401 surfaces as an APIError. The device-flow contract (paths/fields/error codes) was already aligned with backend#846 — verified, unchanged. Stacked on cli#85 (auth scaffold). go build/vet/test green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(cli): cover the login command end-to-end + add test seams (cli#83) internal/api was unit-tested, but the login / logout / auth status COMMANDS weren't. Adds auth_test.go driving the full device-flow command against an httptest backend whose shapes match backend#846 — so it also guards the CLI<->backend contract that the Token->Bearer fix corrected: - login: device_code -> authorization_pending -> token -> WhoAmI(Bearer) -> "Signed in as ..." with config persisted; the 404 "unsupported backend" gate (asserts no token is stored); access_denied. - logout clears the token; auth status (signed-in + not-signed-in). Two unexported test seams in auth.go — newAPIClient (point at an httptest server) and pollAfter (fire the poll immediately) — since the flow otherwise makes real HTTP calls on a timer. go build / vet / test green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(cli): client create / list / use commands (#84) RFC-0001 P1. Adds the `client` subtree the login flow hands off to: provision a tracebloc client for this machine, list the account's clients, and attach this machine to an existing one. - internal/slug: Go port of RFC-0001 Appendix B (backend common/utils/slug.py) — DNS-1123 slugify (NFKD via x/text) + collision suffix + empty-slug guard, kept in lock-step with the backend that validates the result. - internal/api: CreateClient / ListClients / ListClientAdmins against /edge-device/, Bearer-authed (backend#836). - internal/cli/client.go: create (--name / --location / --yes), list (ls), use <id>. The derived namespace is shown for confirmation; location is required (never silent-empty); a 403 surfaces the ask-an-admin path (backend#836); the generated machine credential is printed once. Location auto-detect (cloud-metadata / GeoIP suggested default) is a fast-follow — this PR takes --location or prompts for it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(cli): paginate client list + gather-then-review + parity/interactive tests Self-review follow-up on the client commands: - api: ListClients now follows DRF `next` to the end (was page-1 only), so `list`, `use <id>`, and create-time collision detection see every client in the account, not just the first page. - cli: `create` gathers name + location first, then shows one review + a single confirm (was confirm-mid-flow) — matches the dataset-push interactive flow. - tests: committed slug golden-parity test (24 pairs verified byte-identical against the Python slugify_dns1123, incl. NFKD ligatures/fractions/roman/ fullwidth); interactive create + cancel via the prompter seam; paginated list; collision-suffix end-to-end. - slug: doc-note the redundant dash-collapse (mirrors slug.py) and the ""/None fallback divergence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t check) (#103) The CI "Schema drift check" (scripts/sync-schema.sh --check) was failing on develop itself and therefore every PR — the vendored internal/schema/ingest.v1.json had drifted from data-ingestors master, which added the `causal_language_modeling` task category (data-ingestors#805: the enum + a self-supervised "requires texts, not label" conditional). - Re-vendored the schema (sync-schema.sh) → --check now passes. - Registered the new category in internal/push/category.go as recognize-but-not-yet- CLI-supported (CLISupported:false + UnsupportedNote): the CLI's discover/build for its raw-.txt / prompt\tcompletion `texts` layout isn't implemented, so push cleanly reports it as pending rather than leaving a schema<->registry gap (the cli#74 drift class the registry exists to prevent). Updated the parity test. Scope: schema re-vendor + the registry recognition only. Full CLI push support for causal_language_modeling (discover/build) is a follow-up feature, not this PR. go build/vet/test ./... green; drift check green. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-create + 409 (#84) (#102) * feat(cli): client create reads the cluster anchor — idempotent get-or-create + 409 (#84) RFC-0001 §7.2 / backend#883: `client create` now reads the cluster's kube-system UID and sends it as cluster_id, so the backend does get-or-create keyed on it. - Reads the anchor via a new cluster.ClusterID (kube-system namespace UID) behind --kubeconfig/--context flags. Best-effort + never-silent: if the cluster isn't reachable it provisions WITHOUT an anchor (a plain mint) and says so. - api.CreateClient returns adopted (HTTP 200) vs minted (201): an idempotent re-run on the same cluster adopts the existing client (no new credential printed) instead of duplicating; a 409 → a clear "registered to a different account" (cluster_conflict). - Adds api.BackfillClusterID (PATCH /edge-device/<id>/) for the adopt-backfill path (the installer #838 orchestrates the full R7 flow). Scope: anchor + idempotency only. never-show (writing the credential into the cluster secret) and the R7 in-cluster TB_CLIENT_ID backfill orchestration land with the installer reorder (#838); the mint-time credential print stays as the interim. Tests: cluster.clusterIDFrom (fake clientset); api CreateClient mint/adopt/409 + BackfillClusterID; cli create anchor-mint / adopt-idempotent / 409 / no-cluster-warns. go build/vet/test ./... green (Go 1.26). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(cli): bound the anchor read + make adopt save-failure non-fatal Two review fixes folded into the create-anchor work (#84): - cluster.ClusterID: cap the best-effort kube-system read with an 8s rest.Config timeout. A kubeconfig pointing at an unreachable API server would otherwise hang the GET for the OS TCP timeout; now `client create` degrades to a non-anchored mint promptly instead of stalling before the review prompt. - cli client create: on an idempotent adopt, print the result before saving the active-client pointer and treat a save failure as a hint (mirroring the mint path), so a config-save error can't bury the "adopted it" message or the recovery hint. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com>
… the installer (#84) (#104) * feat(cli): client create --credential-file — write the machine credential for the installer (#84) Adds `tracebloc client create --credential-file PATH`: instead of printing the minted credential, write it to PATH (mode 0600) as a sourceable env file the installer reorder (#838) consumes — the secret never hits the terminal (RFC §9 "secure by invisibility" / never-show, deferred here from cli#102). - Mint (201): writes TRACEBLOC_CLIENT_ID + TRACEBLOC_CLIENT_PASSWORD + TB_NAMESPACE (0600) and suppresses the stdout credential print. Write failure is fatal (the credential is the only copy). - Adopt (200): writes TRACEBLOC_CLIENT_ID + TB_NAMESPACE + TRACEBLOC_CLIENT_ADOPTED=1 (no password — the existing one stands, write-only on the backend); the installer reconciles the existing release rather than expecting a fresh credential. - Without the flag: behaviour unchanged (the interim credential print). Unblocks the #838 installer reorder (login -> create -> feed the chart). Tests cover mint (0600 + sourceable + never-printed) and adopt (id+ns+marker, no password). go build/vet/test ./... green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(cli): force 0600 on credential file via temp+rename (#84) writeClientCredential used os.WriteFile(path, ..., 0o600), but WriteFile only applies its perm bits when it *creates* the file — over a pre-existing target it truncates and writes WITHOUT changing the mode. So a stale file, or one an attacker pre-creates world-readable, at --credential-file would receive the minted password (the only copy) at its old, possibly 0644 mode — silently breaking the flag's own 0600/never-show contract (RFC §9). Verified: a 0644 target stays 0644 after the write. Write to a 0600 temp file in the target dir and atomically rename over the path instead. CreateTemp is 0600 by construction, so the guarantee holds unconditionally; rename is atomic (no half-written credential) and the final write never follows a symlink planted at the target. Tests: - preexisting-perms: a 0644 target ends up 0600 (locks in the fix). - write-fail-fatal: an unwritable target surfaces an error, never a silent drop (the credential is the only copy). - mint never-show: also assert the password VALUE is absent from stdout, not just the literal "password"/"Machine credential" strings. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com>
* fix(api): error on unparseable pagination next, don't silently truncate (#106) ListClients followed DRF `next` via nextPath, which returned "" for BOTH an empty link (end of pages) and an unparseable one — so a non-empty `next` the server sends that url.Parse rejects silently ended the loop, and ListClients returned only the pages seen so far with a nil error. list / `use` / namespace-collision checks would then miss clients with no signal. nextPath now returns (string, error): "" + nil for an empty link, an error for a non-empty link that won't parse. Trigger is unlikely (DRF emits well-formed URLs, url.Parse is lenient), but the failure mode — silent partial list — is the wrong one for a correctness-sensitive call. Tests: pagination still followed end-to-end (page 1 → 2 → done); an unparseable next link is now a hard error, not a truncation. Bugbot: 8dadb5c2-804a-48ed-bc81-eb14e6317be1 (v0.4.0 RC, #107) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(cli): route all known-but-unsupported categories to the pending note (#106) The dataset-push category gate only special-cased known-but-unsupported *image* categories (`case push.IsImage`), so a registry-known non-image category that isn't CLI-supported yet — `causal_language_modeling` (FamilyText, CLISupported:false, with a real UnsupportedNote) — fell to the default branch and was reported as "isn't a recognized task category". It IS recognized; it's pending support. Swap the gate to `case push.IsKnown`: supported categories are already caught by the prior case, so IsKnown here means known-but-unsupported (image or text), all routed through the registry's per-category pending-support note. The default branch is left for genuinely unknown/typo'd categories. Message-only (exit code was already 2). Test: causal_language_modeling now gets the pending-support note, not the unrecognized-category message. (The existing exit-2 test didn't assert the message, which is how this slipped through.) Bugbot: 16f5b945-5d67-4201-8bc6-1f6baf633672 (v0.4.0 RC, #107) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
#109) * fix(cli): clear active_client_id on logout (#106) logout cleared the token and email but left active_client_id in ~/.tracebloc/config.json. Since that pointer is account-scoped, a later `login` as a different user inherited the stale id — `auth status` and `client list` would surface the previous account's active client until the user ran `client create`/`client use` again. Clear it alongside the token/email so logout fully drops local session state. Bugbot: "Stale active client after logout" (Medium, v0.4.0 RC, #107) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(push): add token_classification to the registry + pin schema parity (#106) The re-vendored ingest.v1.json (#103) accepts token_classification, but the category registry didn't list it — so `dataset push --category=token_classification` hit the "isn't a recognized task category" path despite being schema-valid (the same misrouting just fixed for causal_language_modeling). Added it as a known, not-yet-CLI-supported FamilyText category with a pending-support note (the safe default — it was never pushable, so this only improves the message; flip to supported when the texts/token-label staging lands). Root-cause guard: the existing registry tests only pinned the registry against a hand-written list, which stayed self-consistent while drifting from the schema. Added TestRegistryCoversSchemaCategories — parses the embedded schema's category enum and asserts every entry is registry-known — so any future schema-only category is caught here, not in the next review pass. (The reverse, a registry-only known-unsupported alias like instance_segmentation, is allowed: it's gated out before schema validation.) Bugbot: "Missing token_classification registry entry" (Medium, v0.4.0 RC, #107) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 3881746. Configure here.
…es (#106) (#110) * fix(cli): back off device-flow poll by 5s on slow_down (RFC 8628) (#106) On `slow_down`, runLogin bumped the poll interval by 1s (`interval++`). RFC 8628 §3.5 requires increasing it by 5s for that and all subsequent polls, so the CLI kept polling too aggressively after the server asked it to back off. Test captures the durations handed to the pollAfter seam: post-slow_down wait is now 10s (5+5), not 6s. Bugbot: "Device flow slow_down interval" (Medium, v0.4.0 RC, #107) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(doctor): resolve CLIENT_ENV case-insensitively in backendHost (#106) backendHost switched on CLIENT_ENV case-sensitively, but the API client (api.ResolveEnv/BaseURL) lowercases env values. A non-lowercase CLIENT_ENV on the edge box (e.g. "DEV") fell through to the prod default, so `cluster doctor` probed api.tracebloc.io even when the cluster targeted dev/stg. Normalize with ToLower+TrimSpace before the switch. Test extends TestBackendHost with "DEV"/"Stg"/" dev " cases. Bugbot: "Doctor backend env casing" (Low, v0.4.0 RC, #107) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(api): guard the bare-array list decode to the first page (#106) ListClients attempted the unpaginated bare-array decode on every iteration. A bare array is only valid as the sole response (a paginated chain is a {next,results} object on every page), so a stray bare body mid-chain could silently end the loop and drop earlier pages. Guard the bare decode to pageNum 0. Test: a bare-array response still returns the full list. Found in the proactive RC review (low; latent — DRF doesn't mix shapes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(cli): exclude this cluster's own client from create collision check (#106) `client create` built the namespace-collision set from ALL clients, including the one already anchored to this cluster. On an idempotent re-run with the same --name, that bumped the derived slug (lab-one → lab-one-2) and showed it in the review — but the backend adopts on cluster_id and returns the original namespace, so the review contradicted the actual outcome. Skip the client whose cluster_id matches this cluster's anchor. Test: a re-run review no longer shows a bumped namespace. Found in the proactive RC review (low; cosmetic — backend state was correct). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
LukasWodka
approved these changes
Jun 24, 2026
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Promotes
develop→mainto cut v0.4.0. Tracking: #106.developis a strict superset ofmain(main-ahead commits are prior promotion merges only); CI on develop HEAD (722d2cf) is green.What ships (12 commits since v0.3.1)
login/logout/auth status(feat(cli): auth scaffold — login/logout/auth status + config + backend client (cli#83) #85),client create/list/use(feat(cli): client create / list / use commands (#84) #92), cluster-anchor idempotency + 409 (feat(cli): client create reads the cluster anchor — idempotent get-or-create + 409 (#84) #102),client create --credential-file(feat(cli): client create --credential-file — write the credential for the installer (#84) #104).Interim state (additive / opt-in — not blockers)
client createstill prints the credential by default;--credential-fileis the building block for the installer reorder (#838, not yet built). Location auto-detect (#93) + RFC (#55) still draft.After merge
Tag
v0.4.0onmain→release.ymlbuilds multiarch + cosign-signs + publishes.🤖 Generated with Claude Code
Note
Medium Risk
Large feature surface (auth tokens, credential files, backend provisioning) plus behavioral changes to dataset rm teardown; well covered by tests but touches security-sensitive paths and in-cluster destructive ops.
Overview
v0.4.0 is a develop→main promotion that ships RFC-0001-style browser auth and machine provisioning, a cluster health command, and several correctness fixes around datasets and CI.
Auth & clients: New
login/logout/auth statususe OAuth device flow (RFC 8628) against the public backend API, persisting tokens in~/.tracebloc(0600).client create,list, anduseprovision edge clients with optional cluster-id anchoring (kube-system UID) for idempotent re-runs,--credential-filefor installer integration, and clearer handling of 403/409. Supporting packages addinternal/api,internal/config,internal/slug(Python parity), andcluster.ClusterID.Cluster doctor:
tracebloc cluster doctorruns read-only checks (release discovery, pods, PVC, proxy wiring, backend egress from the CLI host, requests-proxy, node fit, image pull secrets) with ✔/⚠/✖ output and remedies.Dataset / push: Task categories move to a single CategorySpec registry synced with
ingest.v1.json(new categories,target_sizewidth×height docs).dataset pushgates on registry support with accurate messages for known-but-unsupported categories.dataset rmteardown deletes PVC files via a short-lived stage-identity pod (uid/fsGroup 65532) instead of exec into jobs-manager, fixing permission failures on staged files (#259).CI: Workflows gain concurrency (cancel in-flight PR runs) and job timeouts; a public PII gate caller workflow is added.
Reviewed by Cursor Bugbot for commit 818db1c. Bugbot is set up for automated code reviews on this repo. Configure here.