test(jobs-manager): lock spawned-job RESOURCE_REQUESTS/LIMITS default at 8Gi#260
Merged
Conversation
… at 8Gi Add helm-unittest assertions pinning the rendered per-spawned-job RESOURCE_REQUESTS / RESOURCE_LIMITS env to "cpu=2,memory=8Gi" on both containers (api + pods-monitor), plus an operator-override case. The chart is the single effective source of truth for these values (it always injects them), so this guards against silent drift between the chart and client-runtime's jobs_manager.py fallback — the drift reconciled in tracebloc/backend#745. Refs tracebloc/backend#745 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
|
👋 Heads-up — Code review queue is at 19 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
shujaatTracebloc
approved these changes
Jun 16, 2026
aptracebloc
approved these changes
Jun 16, 2026
This was referenced Jun 23, 2026
saadqbal
added a commit
that referenced
this pull request
Jun 23, 2026
… backend-reachability test, installer hardening) (#273) * Merge pull request #260 from tracebloc/fix/job-resource-defaults-745 test(jobs-manager): lock spawned-job RESOURCE_REQUESTS/LIMITS default at 8Gi * test(e2e-proxy): exercise application-pod egress through the proxy (Charité setup) (#264) * test(e2e-proxy): exercise application-pod egress through the proxy The squid harness proved NODE egress (image pulls) but stopped before any application pod — so it never caught client-runtime#119, where the spawned ingestion Job carried no proxy env and dialled the backend directly. Add a section that runs a pod WITH the ingestion-style proxy env (must traverse the squid to reach the backend) and a pod WITHOUT it (must bypass it / go direct), asserting both against the squid access log. Models the Charité proxy-only setup at the application layer; pairs with the behavioural unit tests on client-runtime#119. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(e2e-proxy): rework app-pod egress to an in-cluster squid Service Running the first version on a real k3d cluster surfaced that a POD cannot resolve host.k3d.internal (it is a node-level alias for image pulls, not pod DNS), so the proxied probe failed with `curl (5) Could not resolve proxy`. Rework: stand up an in-cluster squid Deployment+Service the test pods reach by Service DNS (also a closer model of a real corporate proxy reachable by name), with a readiness probe gating rollout on squid actually listening (fixes the probe-before-bind race seen in the first attempt). A pod WITH the ingestion proxy env must reach the backend through the squid; a pod WITHOUT it must bypass it. Auth survival stays covered by the host-squid sections (1-3). bash -n + shellcheck + embedded-YAML parse all clean; Service-DNS resolution verified locally. Full proxied-curl run is exercised by the e2e-proxy CI job. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test(e2e-proxy): make app-pod egress assertion deterministic (single pod, curl -v) §4 now uses ONE pod carrying the ingestion-style proxy env that makes two calls to the same backend: WITH the env it must tunnel via the in-cluster squid (a CONNECT tunnel); with the env unset it must dial direct. Proof is taken client-side from `curl -v` (the CONNECT-tunnel lines), not by reading squid's access.log — that file is buffered by the log daemon and came back empty when read right after the probe, producing false failures. Also set BOTH proxy-env cases: curl honours the lower-case `https_proxy` for HTTPS and the upper-case alone is not reliably picked up, so the probe must emit both — exactly as the real ingestion env does. A single pod with a single log also removes the multi-pod scheduling / log-flush races that made the earlier two-pod form flaky. Validated end-to-end on k3d: A (proxy env) -> "Establish HTTP proxy tunnel to api.tracebloc.io:443" + "CONNECT tunnel established, response 200" + 200 OK B (env unset) -> direct connect to the backend IP, no proxy tunnel, 200 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * ci(security): add public-repo PII gate caller (#263) Blocks PRs that leak customer/partner names or secrets in title/body/commits. Calls the reusable gate in tracebloc/.github. Inactive until the org PII_DENYLIST secret is set (warns, doesn't block, until then). * ci: add concurrency cancellation + job timeouts to chart CI (#265) helm-ci.yaml and installer-tests.yaml are the repo's most expensive workflows (a real k3d cluster, a 9-distro docker-in-docker matrix, Windows Pester) but had no concurrency control, so a PR re-push left stale runs burning to completion. Add a per-ref concurrency group that cancels superseded PR runs only (push/schedule runs are never cancelled), matching the pattern already used in client-runtime's tests.yml. Add timeout-minutes to every job so a hung k3d/squid/distro step can't run to the 6h default. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * feat(installer): fail fast when HOST_DATA_DIR is on a network filesystem (#261) Detect NFS/CIFS/SMB for HOST_DATA_DIR in preflight (bash + PowerShell) and fail fast with an actionable message instead of a cryptic MySQL CrashLoopBackOff ~20 min into install: MySQL/InnoDB corrupts on network storage and the chart root chown init-container is blocked by NFS root_squash. - preflight.sh: _pf_fstype reader (findmnt, then GNU stat, then df+mount; portable incl. macOS) + _pf_storage_type wired into run_preflight. Allowlists network fstypes so local FSes including overlay/tmpfs (CI) pass. - install-k8s.ps1: Get-PfFsType (UNC / network drive) + Test-Preflight check. - TRACEBLOC_ALLOW_NETWORK_FS=1 overrides (mirrors TRACEBLOC_ALLOW_ARM64). - Tests: 10 bats cases + Pester cases (network -> fail, override, undetermined, Windows-only Get-PfFsType reader). Part 1 of 3 for tracebloc/backend#743. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(installer,chart): place datasets on a network mount while MySQL stays local (#262) * feat(installer): fail fast when HOST_DATA_DIR is on a network filesystem Detect NFS/CIFS/SMB for HOST_DATA_DIR in preflight (bash + PowerShell) and fail fast with an actionable message instead of a cryptic MySQL CrashLoopBackOff ~20 min into install: MySQL/InnoDB corrupts on network storage and the chart root chown init-container is blocked by NFS root_squash. - preflight.sh: _pf_fstype reader (findmnt, then GNU stat, then df+mount; portable incl. macOS) + _pf_storage_type wired into run_preflight. Allowlists network fstypes so local FSes including overlay/tmpfs (CI) pass. - install-k8s.ps1: Get-PfFsType (UNC / network drive) + Test-Preflight check. - TRACEBLOC_ALLOW_NETWORK_FS=1 overrides (mirrors TRACEBLOC_ALLOW_ARM64). - Tests: 10 bats cases + Pester cases (network -> fail, override, undetermined, Windows-only Get-PfFsType reader). Part 1 of 3 for tracebloc/backend#743. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(installer,chart): place datasets on a network mount while MySQL stays local Storage split for VMs whose real storage is an NFS/CIFS mount (backend#743): the database must stay on local disk (InnoDB over NFS is unsafe) but the large dataset volume can live on the network mount. Chart: - Parameterize the dataset PV hostPath base via hostPath.datasetPath (helper tracebloc.clientDataHostPath). Default /tracebloc keeps it byte-identical; mysql + logs PV paths are unchanged. values.yaml + schema + nil-guard for --reuse-values upgrades. Installer (bash + PowerShell): - New HOST_DATASET_DIR: validated (must exist + be writable; MAY live outside $HOME unlike HOST_DATA_DIR; system paths barred), bind-mounted into k3d at a distinct /tracebloc-data path; the dataset dir is created there while mysql + logs stay local. When set, the generated values set hostPath.datasetPath=/tracebloc-data and (Linux) pass HOST_UID/HOST_GID env to jobs-manager so spawned ingestion pods write the host-owned NFS export as the owning uid. Preflight notes the dataset dir is exempt from the network-FS block. Tests: new shared_images_pvc_test.yaml + mysql/logs split-only guards (helm-unittest, 259 pass); HOST_DATASET_DIR validation, second-mount, dir-split and values-generation cases (bats). Docs: INSTALL.md checklist + SECURITY.md 5.4. Part 2/3 of backend#743. The end-to-end NFS write path also needs the client-runtime ingestor-uid change (separate PR) so jobs-manager reads HOST_UID. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#262): fail fast when an existing cluster lacks the dataset bind mount (Bugbot) The HOST_DATASET_DIR -> /tracebloc-data bind mount is baked into the k3d nodes at create time (_create_new_cluster / the PS1 equivalent). k3d cannot add a mount to a RUNNING cluster, but install-client-helm.sh still wrote `datasetPath: /tracebloc-data` into the generated values whenever HOST_DATASET_DIR was merely set — so an existing-cluster re-run pointed the chart's dataset PV at ephemeral in-node storage, silently putting datasets on disposable storage instead of the network export (lost on a restart). Add _check_existing_cluster_dataset_mount (cluster.sh) + the PowerShell equivalent, mirroring the existing _check_existing_cluster_proxy/bind drift checks: on an existing cluster with HOST_DATASET_DIR set, inspect the server node for the /tracebloc-data mount and FAIL FAST with the recreate remedy if it is absent — rather than installing a quietly misrouted dataset volume. Fail-fast (not warn) because this is silent data loss, consistent with the network-FS fail-fast guard. Values generation needs no change: the install now stops in Step 2, before helm runs. +4 bats (cluster.bats): unset -> no-op, mount present -> pass, mount ABSENT -> fail fast, inspect fails -> no-op. bash + shellcheck clean; pwsh parses the .ps1; full cluster suite 27/27. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(installer): non-interactive credentials + softer connect copy (#834) (#266) RFC-0001 Phase 0 (no backend dependency): - Accept TRACEBLOC_CLIENT_ID / TRACEBLOC_CLIENT_PASSWORD so CI / automation / golden images can provision without typing the secret inline. Verified the same way as the prompt (verify_credentials); a bad credential fails the install (no re-prompt in non-interactive mode). The interactive prompt path is unchanged — just wrapped in the else branch. - Soften the "to connect this machine you NEED a tracebloc client / create one" copy to "already have one? enter it (or set the env vars) / need one? create it" — so we stop framing client-creation as a mandatory pre-step (browser sign-in lands in Phase 1). Tests: two new bats cases (env path -> non-interactive write + helm with no prompt; rejected env creds -> error, no helm). The interactive flow tests (re-prompt / inactive / unverified / defaults / max-attempts / one-client guard) still pass unchanged. NOTE: bats #16 (_extract_yaml_value single-quote '' un-escape) fails locally on macOS bash — pre-existing and untouched by this PR (the diff doesn't go near that function); flagging for confirmation against CI. Part of backend#830. Closes #834. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com> * test(e2e-proxy): deflake §4 app-egress — hermetic target + non-silent diagnostics (#269) §4 ("APPLICATION-pod egress through a proxy", client-runtime#119) was a flaky required check ("E2E auth-proxy (squid)") that intermittently red-X'd develop (~1 in 4; e.g. run 27765964135) and randomly blocked unrelated PRs. Two causes: 1. Silent failure. Under `set -euo pipefail` the diagnostic `grep | sed` lines ran before the real assertion; an empty section made grep exit 1 → pipefail → set -e killed the script with NO output (CI showed only "pod/egress-app created" then "exit code 1"). Append `|| true` so the diagnostics are non-fatal and the assertion fires with its reason. Same footgun fixed in §3. 2. External-network dependency (the real flake). §4 curled the real https://api.tracebloc.io/ through the in-cluster squid, depending on the runner's internet to a production host at test time. Make it hermetic: target a reserved-TLD stand-in host (backend.tracebloc-e2e.test) aliased via hostAliases on both the squid and app pods to the cluster's own kube-apiserver ClusterIP — a guaranteed in-cluster HTTPS:443 listener. The CONNECT tunnel now terminates in-cluster with zero external I/O, preserving the #119 intent (WITH proxy env → CONNECT tunnel via squid; env unset → direct dial). Validated: 3/3 deterministic local passes; both calls hit 10.43.0.1 in-cluster (no api.tracebloc.io reachout). bash -n + shellcheck --severity=error clean. Closes #268 Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * feat(cli#90): in-cluster backend-reachability helm test (WS3) (#270) Adds `egress-reachability-check`, a `helm test` Job that verifies a normal (non-training) pod in the namespace can reach the tracebloc backend API — the egress dependency that gates everything (the cluster authenticates to the backend to obtain its Service Bus credentials, so no backend egress => silent Pending). The required-egress complement to egress-enforcement-check (which verifies the opposite: that training pods are locked out). The probe is deliberately NOT training-labelled (so the lockdown netpol never selects it — it keeps the jobs-manager/requests-proxy egress class) and honours tracebloc.proxyEnv, so it tests the real path. The verdict keys on curl's exit code (TCP reachability), not HTTP status. Run via `helm test <release>`; gated by egressReachabilityCheck.enabled (default true; disable on truly air-gapped clusters). As a test hook it never runs during install/upgrade. Service Bus is intentionally not probed here: its host is fetched post-auth from the backend (static nowhere in the chart) and its egress is brokered by the requests-proxy, whose readiness `tracebloc cluster doctor` already checks. helm-unittest: 6 tests (render / disable / test-hook annotation / not-training- labelled / CLIENT_ENV-driven host / proxy-inherited). Full chart suite green. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(chart): bump client 1.7.1 → 1.8.0 (version + appVersion) (#272) Release v1.8.0 — promotes the 9 commits merged to develop since v1.7.1. Minor bump: #262 (datasets on a network mount while MySQL stays local) is a real PVC-placement change at install time, not inert. Keeps version/appVersion in lockstep so the app.kubernetes.io/version label matches the release. Refs #271 Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> * fix(egress-reachability-check): fail closed on TLS/transport errors, not just connect failures (#274) The backend-reachability helm test (#270) keyed success on a denylist of curl exit codes (5/6/7/28) and passed on everything else. But with no --fail, curl exits 0 for any HTTP response, so the only non-zero codes are transport/TLS failures — yet a TLS handshake/cert error (e.g. 35/51/60, typically a proxy intercepting TLS with a CA the cluster doesn't trust) printed "backend reachable" and exited 0: a false pass that defeats the check's purpose. Pass ONLY on curl exit 0 (a full TCP+TLS+HTTP round trip proves the backend is reachable and usable); fail closed on everything else, with a dedicated TLS/cert bucket and a generic catch-all. Companion helm-unittest asserts nothing on the script body; full suite 265/265. Follow-up to #270; flagged by Cursor Bugbot on the v1.8.0 sync PR (#273). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: lukasWuttke <54042461+LukasWodka@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds helm-unittest assertions pinning the rendered per-spawned-training-job
RESOURCE_REQUESTS/RESOURCE_LIMITSenv tocpu=2,memory=8Gion both containers that receive it (api+pods-monitor), plus an operator-override case.Why
The chart is the single effective source of truth for these values —
jobs-manager-deployment.yamlalways injects them with a templated"cpu=2,memory=8Gi"fallback, andclient-runtime'sjobs_manager.pyonly falls back to its own default when they're absent. Those two had silently drifted (the chart said 8Gi; the code's dead-code default was ~202Mi request / 20G limit). This guard renders the template and asserts the value so the drift can't recur unnoticed — the "render-and-assert test" called for in tracebloc/backend#745.The two
containsblocks per container also guard against one of the template's two RESOURCE_* blocks being edited without the other.Test
Companion
The actual reconciliation (code fallback → 8Gi) lives in the runtime: tracebloc/client-runtime#111.
Refs tracebloc/backend#745
🤖 Generated with Claude Code
Note
Low Risk
Test-only change; no Helm template or runtime behavior is modified in this PR.
Overview
Adds helm-unittest coverage in
jobs_manager_test.yamlso the jobs-manager chart cannot silently drift from the intended per-spawned-training-job resource env vars.New cases render
jobs-manager-deployment.yamland assert thatRESOURCE_REQUESTSandRESOURCE_LIMITSdefault tocpu=2,memory=8Gion both the api (containers[0]) and pods-monitor (containers[1]) containers, each emitted once. A second test confirms operators can override those values viavalues.env.RESOURCE_REQUESTS/RESOURCE_LIMITS.This is a contract test for tracebloc/backend#745: the chart is the effective source of truth for what
client-runtimesees when spawning training Jobs (companion runtime change is elsewhere).Reviewed by Cursor Bugbot for commit 431fe9d. Bugbot is set up for automated code reviews on this repo. Configure here.