Skip to content

release: promote develop → main (v0.4.0 — auth + client provisioning, cluster doctor) (#106)#107

Merged
saadqbal merged 15 commits into
mainfrom
develop
Jun 24, 2026
Merged

release: promote develop → main (v0.4.0 — auth + client provisioning, cluster doctor) (#106)#107
saadqbal merged 15 commits into
mainfrom
develop

Conversation

@saadqbal

@saadqbal saadqbal commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Promotes developmain to cut v0.4.0. Tracking: #106.

develop is a strict superset of main (main-ahead commits are prior promotion merges only); CI on develop HEAD (722d2cf) is green.

What ships (12 commits since v0.3.1)

Interim state (additive / opt-in — not blockers)

client create still prints the credential by default; --credential-file is the building block for the installer reorder (#838, not yet built). Location auto-detect (#93) + RFC (#55) still draft.

After merge

Tag v0.4.0 on mainrelease.yml builds multiarch + cosign-signs + publishes.

🤖 Generated with Claude Code


Note

Medium Risk
Large feature surface (auth tokens, credential files, backend provisioning) plus behavioral changes to dataset rm teardown; well covered by tests but touches security-sensitive paths and in-cluster destructive ops.

Overview
v0.4.0 is a develop→main promotion that ships RFC-0001-style browser auth and machine provisioning, a cluster health command, and several correctness fixes around datasets and CI.

Auth & clients: New login / logout / auth status use OAuth device flow (RFC 8628) against the public backend API, persisting tokens in ~/.tracebloc (0600). client create, list, and use provision edge clients with optional cluster-id anchoring (kube-system UID) for idempotent re-runs, --credential-file for installer integration, and clearer handling of 403/409. Supporting packages add internal/api, internal/config, internal/slug (Python parity), and cluster.ClusterID.

Cluster doctor: tracebloc cluster doctor runs read-only checks (release discovery, pods, PVC, proxy wiring, backend egress from the CLI host, requests-proxy, node fit, image pull secrets) with ✔/⚠/✖ output and remedies.

Dataset / push: Task categories move to a single CategorySpec registry synced with ingest.v1.json (new categories, target_size width×height docs). dataset push gates on registry support with accurate messages for known-but-unsupported categories. dataset rm teardown deletes PVC files via a short-lived stage-identity pod (uid/fsGroup 65532) instead of exec into jobs-manager, fixing permission failures on staged files (#259).

CI: Workflows gain concurrency (cancel in-flight PR runs) and job timeouts; a public PII gate caller workflow is added.

Reviewed by Cursor Bugbot for commit 818db1c. Bugbot is set up for automated code reviews on this repo. Configure here.

LukasWodka and others added 12 commits June 17, 2026 18:34
* ci(security): add public-repo PII gate caller

Blocks PRs that leak customer/partner names or secrets in title/body/commits.
Calls the reusable gate in tracebloc/.github. Inactive until the org
PII_DENYLIST secret is set (warns, doesn't block, until then).

* chore(schema): sync ingest.v1.json from data-ingestors master

The vendored copy at internal/schema/ingest.v1.json had drifted from
upstream, failing the `scripts/sync-schema.sh --check` CI gate on every
PR. Upstream replaced `instance_segmentation` with `token_classification`
across the category enums, updated the texts/resolution descriptions
([width, height] order + PIL note), and added the token_classification
`texts` and masked_language_modeling no-`label` conditional rules.

Regenerated via `scripts/sync-schema.sh`; `--check` is now clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: shujaat hasan <shujaathasan@shujaats-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Adds a per-ref concurrency group (cancels superseded PR runs only;
push/tag/schedule never cancelled) and timeout-minutes to every job, so
stale PR pushes stop wasting runner time and hung steps (kind boot, cosign)
can't run to the 6h default. No change to job behavior.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
chore(schema): re-sync vendored ingest.v1.json from data-ingestors master
…manager (#259) (#78)

`tracebloc dataset rm` dropped the table but failed to delete the dataset's
staging files on the shared PVC:

  rm: cannot remove '/data/shared/.tracebloc-staging/<t>/labels.csv': Permission denied

Root cause: the staging files are written by the CLI's ephemeral stage pod as
uid 65532 (+ fsGroup 65532), but the teardown exec'd `rm` inside the long-lived
jobs-manager pod, which runs as a different non-root uid with no shared fsGroup.
A non-65532 uid cannot delete 65532-owned files in a non-group-writable dir, so
the rm hit EACCES and left orphans. The "re-run to clean up" advice was a dead
end — the same permission error every time.

Fix: run the teardown `rm` from a short-lived pod that mirrors the stage pod's
identity (uid 65532 + fsGroup 65532, shared PVC mounted), reusing the existing
BuildStagePodSpec / CreateStagePod / WaitForStagePodReady / DeleteStagePod
machinery. That pod OWNS the staging files it deletes, so it works by ownership
on hostPath (where fsGroup is a no-op, kubernetes/kubernetes#138411) and CSI
alike. Fully fixes tabular datasets (no sidecar files) on every volume type.

Also:
- Teardown now takes an injectable Executor (matching push.Stage), enabling a
  regression test that pins "rm runs in a uid-65532 stage pod, not jobs-manager".
- dataset_rm: drop the misleading "re-run completes the cleanup" claim; the table
  DROP is idempotent, and if file removal keeps failing, point to node-side cleanup.

Refs #259. The image/sidecar case (ingestor's /data/shared/<table> written as
uid 65534) on hostPath still needs the documented complement (ingestor fsGroup +
group-writable DEST_PATH in client-runtime/data-ingestors).

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com>
#89)

* feat(#88): tracebloc cluster doctor — live-cluster health checks (WS3)

Adds `tracebloc cluster doctor`, a read-only health sweep of a running
tracebloc client cluster that prints ✔/⚠/✖ per check with a remedy — so a
customer can diagnose "why isn't my experiment running?" without tracebloc
shelling into their cluster (epic client-runtime#116, WS3).

Sibling of `cluster info` (which the code's own comment anticipated); reuses
its kubeconfig/context/namespace flags + cluster.Load / NewClientset /
DiscoverParentRelease, the ui.Printer status vocabulary, and exitError.

Lean MVP — 6 checks:
- cluster reachable      (parent client release discovered)
- pod health            (crash-loops / long-Pending — local complement to #117)
- dataset volume        (shared PVC Bound, via cluster.DiscoverSharedPVC)
- proxy configuration   (in-cluster requests/egress proxy wiring)
- backend egress        (host-side, proxy-aware probe; in-cluster probe = follow-up)
- Service Bus egress    (requests-proxy readiness — the experiments-queue broker)

internal/doctor is a standalone package with injectable network probes,
82% covered via client-go's fake clientset. Every check is independent and
best-effort (one failure never hides the others); the worst status sets the
exit code (0 ok/warn, 2 failures, 3 kubeconfig).

Out of scope (already shipped / follow-up): support-bundle ships as the
installer's `--diagnose`; node-resources-vs-job-request and image-pullability
are the broader cut.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(#88): don't flag Succeeded/recovered pods as crash-looping (Bugbot)

podCrashLooping flagged any pod with RestartCount>=3 — including Succeeded
job pods that retried before completing, and Running pods that recovered
after past restarts — producing a false ✖ when nothing is actually unhealthy.

Guard terminal phases (Succeeded/Failed) and require the container to not be
currently running, mirroring the controller's recovered-container fix
(client-runtime#117). Adds regression tests for both false-positive cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(#88): detect init-container crash-loops + nil-release prefixed deploys (Bugbot)

Two Medium Bugbot findings on the previous commit:

- podCrashLooping ignored InitContainerStatuses, so an init container stuck in
  CrashLoopBackOff read as a Pending warning instead of a failure even though
  the pod cannot start. It now checks init + app containers, and detects only
  active CrashLoopBackOff — dropping the RestartCount heuristic entirely, since
  that was the source of the earlier Succeeded/recovered-pod false positives.

- requestsProxyNames/jobsManagerNames only probed unprefixed names when the
  release was nil (e.g. DiscoverParentRelease errored on multiple releases),
  falsely reporting missing wiring even though <release>-requests-proxy exists.
  Added findDeployment: exact-name Get, then a namespace List + name-suffix
  fallback that resolves the prefixed name without knowing the release.

Adds regression tests: init-crash-loop, nil-release-finds-prefixed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs/polish(#88): address Arturo's post-approval review nits

- Tunables comment: they're conservative package consts, not vars; point at
  Options (like HTTPProbe) for any future runtime tuning.
- checkProxy WARN: note that a REQUESTS_PROXY_URL set via a configMap/secret
  ref reads as empty here (jobsManagerEnv reads only literal env).
- httpProbe: a successful connection means reachable — discard the body-close
  error rather than reporting it as "unreachable".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(#88): suffix fallback must not pick across multiple releases (Bugbot)

findDeployment's suffix fallback (added for the nil-release case) picked the
first suffix-matching deployment, so in a namespace running multiple parent
releases, jobsManagerEnv and checkRequestsProxy could resolve to different
releases in a single run — presenting mixed data as fact.

Resolve the fallback only when exactly one deployment carries the suffix;
with more than one (the multi-release case DiscoverParentRelease already
refuses to disambiguate) return nil and let the check report can't-determine.
The single-release nil-discovery case still resolves.

Adds TestCheckRequestsProxy_NilReleaseAmbiguous.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(#88): tie deployment lookup to the discovered release (Bugbot)

When a release was discovered, findDeployment's fallbacks could still match a
DIFFERENT release's component (or a stray bare one), so the Service Bus check
went green on the wrong requests-proxy while the discovered release's was
missing.

findDeployment now takes the release directly. When it's known, it accepts only
"<release>-<suffix>" or a bare "<suffix>" whose app.kubernetes.io/instance label
ties it to that release — never another release's, never an unattributable bare
one. The release-unknown path keeps the exactly-one-suffix-match rule (returns
nil on >1, so checks report can't-determine rather than guess).

Folds the jobsManagerNames/requestsProxyNames candidate builders into
findDeployment. Adds tests: other-release-ignored, bare-name-tied-by-label.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…patch (#74) (#80)

The CLI enumerated task categories in four hand-maintained places that had
drifted: the `--category` help listed 5 of 9 (#74), the push accept-gate
hand-listed the supported set twice, the interactive picker kept its own
list, and internal/push/category.go held four separate family maps.

Consolidate into one CategorySpec registry (internal/push/category.go):
each category's family, label, regression-class flag, and CLI-support
status lives in one ordered table. The family predicates
(IsImage/IsTabular/IsText/IsRegressionClass), the `--category` help, the
gate's "Supported:" lists, and the interactive picker now all derive from
it, so the enumerations can't drift apart again. The help now lists all 9
supported categories.

Behaviour-preserving: the gate accepts/rejects exactly the same set
(IsCLISupported == the prior nine-category condition); only the help text
and the now-registry-derived error messages change. semantic_/
instance_segmentation stay known-but-unsupported, each with a per-category
UnsupportedNote.

Adds a registry parity + predicate-derivation test (the anti-drift guard).
First phase of the CLI ingestion consolidation epic (backend#828).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Shujaat Hasan <shujaat@tracebloc.io>
…up) (#91)

* feat(#90): cluster doctor — node-fit + image-pull checks (WS3 follow-up)

Two read-only checks added to `tracebloc cluster doctor` (follow-up to #89):

- Node capacity: parses the resource requests jobs-manager stamps on spawned
  training jobs (RESOURCE_REQUESTS / GPU_REQUESTS env) and checks at least one
  Ready node can fit them — the "Pending forever, no node big enough" class.
  GPU is soft: a hard ✖ only on cpu/mem, and a ⚠ when a GPU is requested but no
  node exposes it (jobs-manager has a GPU->CPU fallback).
- Image pull secret: when jobs-manager references a registry pull secret,
  verifies it exists and is a well-formed dockerconfigjson so private-image
  pulls don't ImagePullBackOff.

Both read-only/best-effort, tested with client-go's fake clientset. The
in-cluster egress probe (the third deferred check on #90) is intentionally a
separate PR — it needs a port-forward/exec mechanism, not this read-only pattern.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(#90): node-fit must require cpu+mem+GPU on ONE node (Bugbot)

checkNodeFit set cpuMemFits and gpuFits independently, so they could come from
different nodes — reporting OK even when no single node had cpu+memory+GPU
together (a GPU job would then stay Pending). It now evaluates each node as a
whole: cpuMemFits (any node) drives the hard fail; fullFits (one node with
cpu+mem AND the GPU) drives the ok/warn split. Adds regression tests for the
cross-node and single-node-fits cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d client (cli#83) (#85)

* feat(cli): auth scaffold — login/logout/auth status + config + backend client (cli#83)

RFC-0001 (backend#830) Phase-1 CLI side, scaffolded ahead of the backend
device-grant so it activates the moment backend#835 ships.

- internal/config: ~/.tracebloc config store (0600, atomic write) — backend
  env + user token + active client. Fully functional + unit-tested.
- internal/api: backend REST client. CLIENT_ENV -> {dev,stg,prod} base URL
  (matches the installer's _backend_url); proxy + CA aware (honors
  HTTP(S)_PROXY / NO_PROXY + the system cert pool, for corporate-proxy
  networks); the RFC 8628 device-flow methods (RequestDeviceCode + PollToken
  with the authorization_pending / slow_down / expired_token / access_denied
  states). Unit-tested via httptest.
- internal/cli: `tracebloc login` (device flow — show URL + code, poll, store
  the token), `logout`, `auth status`. `client create/list/use` are stubbed
  (cli#84 — they need the user token from login + provisioning backend#836).

login calls /device/code + /device/token, which land in backend#835; until
then it reports that the backend doesn't support browser sign-in yet. Builds,
gofmt-clean, unit-tested (config round-trip + 0600 mode; api URL map +
device-flow poll states); `tracebloc --help` lists the new verbs.

Part of cli#83 / backend#830 (end-to-end login activates with backend#835).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(cli): authenticate with Bearer + verify token on login (cli#83) (#86)

* feat(cli): authenticate with Bearer + verify token on login (cli#83)

Completes `tracebloc login` against the now-built device-grant endpoints
(backend#846). The token the flow issues is a ClientAccessToken, which the
backend authenticates as `Authorization: Bearer` (ClientAccessTokenAuthentication,
backend#835) — not the legacy DRF `Token` scheme the client was sending on
authenticated requests, which would have failed to authenticate a logged-in token.

- internal/api: authenticated requests now send `Bearer <token>` (was `Token`);
  add get() + WhoAmI() (GET /userinfo/) to confirm the token + fetch the account.
- login now verifies the freshly-issued token (best-effort) and stores/shows the
  account ("Signed in as you@co.com"); a failed lookup never fails a valid sign-in.
- tests: WhoAmI sends Bearer + parses the identity; a 401 surfaces as an APIError.

The device-flow contract (paths/fields/error codes) was already aligned with
backend#846 — verified, unchanged. Stacked on cli#85 (auth scaffold).
go build/vet/test green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(cli): cover the login command end-to-end + add test seams (cli#83)

internal/api was unit-tested, but the login / logout / auth status COMMANDS
weren't. Adds auth_test.go driving the full device-flow command against an
httptest backend whose shapes match backend#846 — so it also guards the
CLI<->backend contract that the Token->Bearer fix corrected:

- login: device_code -> authorization_pending -> token -> WhoAmI(Bearer) ->
  "Signed in as ..." with config persisted; the 404 "unsupported backend" gate
  (asserts no token is stored); access_denied.
- logout clears the token; auth status (signed-in + not-signed-in).

Two unexported test seams in auth.go — newAPIClient (point at an httptest
server) and pollAfter (fire the poll immediately) — since the flow otherwise
makes real HTTP calls on a timer. go build / vet / test green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(cli): client create / list / use commands (#84)

RFC-0001 P1. Adds the `client` subtree the login flow hands off to:
provision a tracebloc client for this machine, list the account's clients,
and attach this machine to an existing one.

- internal/slug: Go port of RFC-0001 Appendix B (backend common/utils/slug.py)
  — DNS-1123 slugify (NFKD via x/text) + collision suffix + empty-slug guard,
  kept in lock-step with the backend that validates the result.
- internal/api: CreateClient / ListClients / ListClientAdmins against
  /edge-device/, Bearer-authed (backend#836).
- internal/cli/client.go: create (--name / --location / --yes), list (ls),
  use <id>. The derived namespace is shown for confirmation; location is
  required (never silent-empty); a 403 surfaces the ask-an-admin path
  (backend#836); the generated machine credential is printed once.

Location auto-detect (cloud-metadata / GeoIP suggested default) is a
fast-follow — this PR takes --location or prompts for it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(cli): paginate client list + gather-then-review + parity/interactive tests

Self-review follow-up on the client commands:

- api: ListClients now follows DRF `next` to the end (was page-1 only), so
  `list`, `use <id>`, and create-time collision detection see every client in
  the account, not just the first page.
- cli: `create` gathers name + location first, then shows one review + a single
  confirm (was confirm-mid-flow) — matches the dataset-push interactive flow.
- tests: committed slug golden-parity test (24 pairs verified byte-identical
  against the Python slugify_dns1123, incl. NFKD ligatures/fractions/roman/
  fullwidth); interactive create + cancel via the prompter seam; paginated
  list; collision-suffix end-to-end.
- slug: doc-note the redundant dash-collapse (mirrors slug.py) and the
  ""/None fallback divergence.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t check) (#103)

The CI "Schema drift check" (scripts/sync-schema.sh --check) was failing on develop
itself and therefore every PR — the vendored internal/schema/ingest.v1.json had
drifted from data-ingestors master, which added the `causal_language_modeling` task
category (data-ingestors#805: the enum + a self-supervised "requires texts, not
label" conditional).

- Re-vendored the schema (sync-schema.sh) → --check now passes.
- Registered the new category in internal/push/category.go as recognize-but-not-yet-
  CLI-supported (CLISupported:false + UnsupportedNote): the CLI's discover/build for
  its raw-.txt / prompt\tcompletion `texts` layout isn't implemented, so push cleanly
  reports it as pending rather than leaving a schema<->registry gap (the cli#74 drift
  class the registry exists to prevent). Updated the parity test.

Scope: schema re-vendor + the registry recognition only. Full CLI push support for
causal_language_modeling (discover/build) is a follow-up feature, not this PR.

go build/vet/test ./... green; drift check green.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-create + 409 (#84) (#102)

* feat(cli): client create reads the cluster anchor — idempotent get-or-create + 409 (#84)

RFC-0001 §7.2 / backend#883: `client create` now reads the cluster's kube-system UID
and sends it as cluster_id, so the backend does get-or-create keyed on it.

- Reads the anchor via a new cluster.ClusterID (kube-system namespace UID) behind
  --kubeconfig/--context flags. Best-effort + never-silent: if the cluster isn't
  reachable it provisions WITHOUT an anchor (a plain mint) and says so.
- api.CreateClient returns adopted (HTTP 200) vs minted (201): an idempotent re-run
  on the same cluster adopts the existing client (no new credential printed) instead
  of duplicating; a 409 → a clear "registered to a different account" (cluster_conflict).
- Adds api.BackfillClusterID (PATCH /edge-device/<id>/) for the adopt-backfill path
  (the installer #838 orchestrates the full R7 flow).

Scope: anchor + idempotency only. never-show (writing the credential into the cluster
secret) and the R7 in-cluster TB_CLIENT_ID backfill orchestration land with the
installer reorder (#838); the mint-time credential print stays as the interim.

Tests: cluster.clusterIDFrom (fake clientset); api CreateClient mint/adopt/409 +
BackfillClusterID; cli create anchor-mint / adopt-idempotent / 409 / no-cluster-warns.
go build/vet/test ./... green (Go 1.26).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(cli): bound the anchor read + make adopt save-failure non-fatal

Two review fixes folded into the create-anchor work (#84):

- cluster.ClusterID: cap the best-effort kube-system read with an 8s
  rest.Config timeout. A kubeconfig pointing at an unreachable API
  server would otherwise hang the GET for the OS TCP timeout; now
  `client create` degrades to a non-anchored mint promptly instead of
  stalling before the review prompt.
- cli client create: on an idempotent adopt, print the result before
  saving the active-client pointer and treat a save failure as a hint
  (mirroring the mint path), so a config-save error can't bury the
  "adopted it" message or the recovery hint.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com>
… the installer (#84) (#104)

* feat(cli): client create --credential-file — write the machine credential for the installer (#84)

Adds `tracebloc client create --credential-file PATH`: instead of printing the
minted credential, write it to PATH (mode 0600) as a sourceable env file the
installer reorder (#838) consumes — the secret never hits the terminal (RFC §9
"secure by invisibility" / never-show, deferred here from cli#102).

- Mint (201): writes TRACEBLOC_CLIENT_ID + TRACEBLOC_CLIENT_PASSWORD + TB_NAMESPACE
  (0600) and suppresses the stdout credential print. Write failure is fatal (the
  credential is the only copy).
- Adopt (200): writes TRACEBLOC_CLIENT_ID + TB_NAMESPACE + TRACEBLOC_CLIENT_ADOPTED=1
  (no password — the existing one stands, write-only on the backend); the installer
  reconciles the existing release rather than expecting a fresh credential.
- Without the flag: behaviour unchanged (the interim credential print).

Unblocks the #838 installer reorder (login -> create -> feed the chart). Tests cover
mint (0600 + sourceable + never-printed) and adopt (id+ns+marker, no password).
go build/vet/test ./... green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(cli): force 0600 on credential file via temp+rename (#84)

writeClientCredential used os.WriteFile(path, ..., 0o600), but WriteFile
only applies its perm bits when it *creates* the file — over a pre-existing
target it truncates and writes WITHOUT changing the mode. So a stale file, or
one an attacker pre-creates world-readable, at --credential-file would receive
the minted password (the only copy) at its old, possibly 0644 mode — silently
breaking the flag's own 0600/never-show contract (RFC §9). Verified: a 0644
target stays 0644 after the write.

Write to a 0600 temp file in the target dir and atomically rename over the
path instead. CreateTemp is 0600 by construction, so the guarantee holds
unconditionally; rename is atomic (no half-written credential) and the final
write never follows a symlink planted at the target.

Tests:
- preexisting-perms: a 0644 target ends up 0600 (locks in the fix).
- write-fail-fatal: an unwritable target surfaces an error, never a silent
  drop (the credential is the only copy).
- mint never-show: also assert the password VALUE is absent from stdout, not
  just the literal "password"/"Machine credential" strings.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com>
Comment thread internal/api/client.go Outdated
Comment thread internal/cli/dataset.go
* fix(api): error on unparseable pagination next, don't silently truncate (#106)

ListClients followed DRF `next` via nextPath, which returned "" for BOTH an
empty link (end of pages) and an unparseable one — so a non-empty `next` the
server sends that url.Parse rejects silently ended the loop, and ListClients
returned only the pages seen so far with a nil error. list / `use` /
namespace-collision checks would then miss clients with no signal.

nextPath now returns (string, error): "" + nil for an empty link, an error for
a non-empty link that won't parse. Trigger is unlikely (DRF emits well-formed
URLs, url.Parse is lenient), but the failure mode — silent partial list — is
the wrong one for a correctness-sensitive call.

Tests: pagination still followed end-to-end (page 1 → 2 → done); an
unparseable next link is now a hard error, not a truncation.

Bugbot: 8dadb5c2-804a-48ed-bc81-eb14e6317be1 (v0.4.0 RC, #107)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(cli): route all known-but-unsupported categories to the pending note (#106)

The dataset-push category gate only special-cased known-but-unsupported *image*
categories (`case push.IsImage`), so a registry-known non-image category that
isn't CLI-supported yet — `causal_language_modeling` (FamilyText,
CLISupported:false, with a real UnsupportedNote) — fell to the default branch
and was reported as "isn't a recognized task category". It IS recognized; it's
pending support.

Swap the gate to `case push.IsKnown`: supported categories are already caught
by the prior case, so IsKnown here means known-but-unsupported (image or text),
all routed through the registry's per-category pending-support note. The default
branch is left for genuinely unknown/typo'd categories. Message-only (exit code
was already 2).

Test: causal_language_modeling now gets the pending-support note, not the
unrecognized-category message. (The existing exit-2 test didn't assert the
message, which is how this slipped through.)

Bugbot: 16f5b945-5d67-4201-8bc6-1f6baf633672 (v0.4.0 RC, #107)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread internal/cli/auth.go
Comment thread internal/push/category.go
#109)

* fix(cli): clear active_client_id on logout (#106)

logout cleared the token and email but left active_client_id in
~/.tracebloc/config.json. Since that pointer is account-scoped, a later
`login` as a different user inherited the stale id — `auth status` and
`client list` would surface the previous account's active client until the
user ran `client create`/`client use` again.

Clear it alongside the token/email so logout fully drops local session state.

Bugbot: "Stale active client after logout" (Medium, v0.4.0 RC, #107)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(push): add token_classification to the registry + pin schema parity (#106)

The re-vendored ingest.v1.json (#103) accepts token_classification, but the
category registry didn't list it — so `dataset push --category=token_classification`
hit the "isn't a recognized task category" path despite being schema-valid (the
same misrouting just fixed for causal_language_modeling). Added it as a known,
not-yet-CLI-supported FamilyText category with a pending-support note (the safe
default — it was never pushable, so this only improves the message; flip to
supported when the texts/token-label staging lands).

Root-cause guard: the existing registry tests only pinned the registry against
a hand-written list, which stayed self-consistent while drifting from the
schema. Added TestRegistryCoversSchemaCategories — parses the embedded schema's
category enum and asserts every entry is registry-known — so any future
schema-only category is caught here, not in the next review pass. (The reverse,
a registry-only known-unsupported alias like instance_segmentation, is allowed:
it's gated out before schema validation.)

Bugbot: "Missing token_classification registry entry" (Medium, v0.4.0 RC, #107)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3881746. Configure here.

Comment thread internal/cli/auth.go Outdated
Comment thread internal/doctor/doctor.go
…es (#106) (#110)

* fix(cli): back off device-flow poll by 5s on slow_down (RFC 8628) (#106)

On `slow_down`, runLogin bumped the poll interval by 1s (`interval++`). RFC 8628
§3.5 requires increasing it by 5s for that and all subsequent polls, so the CLI
kept polling too aggressively after the server asked it to back off.

Test captures the durations handed to the pollAfter seam: post-slow_down wait is
now 10s (5+5), not 6s.

Bugbot: "Device flow slow_down interval" (Medium, v0.4.0 RC, #107)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(doctor): resolve CLIENT_ENV case-insensitively in backendHost (#106)

backendHost switched on CLIENT_ENV case-sensitively, but the API client
(api.ResolveEnv/BaseURL) lowercases env values. A non-lowercase CLIENT_ENV on
the edge box (e.g. "DEV") fell through to the prod default, so `cluster doctor`
probed api.tracebloc.io even when the cluster targeted dev/stg. Normalize with
ToLower+TrimSpace before the switch.

Test extends TestBackendHost with "DEV"/"Stg"/" dev " cases.

Bugbot: "Doctor backend env casing" (Low, v0.4.0 RC, #107)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(api): guard the bare-array list decode to the first page (#106)

ListClients attempted the unpaginated bare-array decode on every iteration. A
bare array is only valid as the sole response (a paginated chain is a
{next,results} object on every page), so a stray bare body mid-chain could
silently end the loop and drop earlier pages. Guard the bare decode to pageNum 0.

Test: a bare-array response still returns the full list.

Found in the proactive RC review (low; latent — DRF doesn't mix shapes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(cli): exclude this cluster's own client from create collision check (#106)

`client create` built the namespace-collision set from ALL clients, including the
one already anchored to this cluster. On an idempotent re-run with the same
--name, that bumped the derived slug (lab-one → lab-one-2) and showed it in the
review — but the backend adopts on cluster_id and returns the original namespace,
so the review contradicted the actual outcome. Skip the client whose cluster_id
matches this cluster's anchor.

Test: a re-run review no longer shows a bumped namespace.

Found in the proactive RC review (low; cosmetic — backend state was correct).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
@saadqbal saadqbal merged commit 57805b2 into main Jun 24, 2026
28 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants