Skip to content

Sync develop → main for v1.8.0 chart release (network-mount datasets, backend-reachability test, installer hardening)#273

Merged
saadqbal merged 11 commits into
mainfrom
sync/develop-to-main-v1.8.0
Jun 23, 2026
Merged

Sync develop → main for v1.8.0 chart release (network-mount datasets, backend-reachability test, installer hardening)#273
saadqbal merged 11 commits into
mainfrom
sync/develop-to-main-v1.8.0

Conversation

@saadqbal

@saadqbal saadqbal commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Release chart v1.8.0 to main. Promotes the 9 commits merged to develop since v1.7.1 (2026-06-15) plus the version bump (#272).

Contents (main 1.7.1 → develop 1.8.0)

Chart (renders on the fleet):

Installer (host-side; new installs only, not auto-upgrade):

Repo-only (CI / security / e2e tests — not shipped in the chart):

Release type

Minor — adds a new capability/values key (hostPath.datasetPath) and a new test hook. Inert on auto-upgrade: existing clusters render byte-identical (helper defaults datasetPath to /tracebloc, nil-guarded for --reuse-values); #270 is a helm test hook that never runs on upgrade. Operational change lands only on fresh installs that pass HOST_DATASET_DIR.

After merge

Publish GitHub Release v1.8.0 tagged on main → release workflow packages the chart to gh-pages → fleet auto-upgrades at :23.

Closes #271.

🤖 Generated with Claude Code


Note

Medium Risk
Minor chart bump is mostly inert on auto-upgrade, but v1.8.0 still changes fleet templates and install-time storage/proxy behavior; misconfigured HOST_DATASET_DIR or dataset PV paths could misroute data on fresh installs.

Overview
Promotes chart v1.8.0 and the develop-side installer/CI work bundled in this sync to main.

Chart (fleet-facing): Adds hostPath.datasetPath so only the dataset (shared-images) PV base can move off local disk (default /tracebloc keeps renders byte-identical on upgrade/--reuse-values); mysql and logs hostPaths stay local. Adds an egressReachabilityCheck helm test Job (non-training pod → tracebloc API, proxy-aware) that does not run on install/upgrade. Helm unit tests cover dataset PV wiring, mysql/logs isolation, spawned-job RESOURCE_REQUESTS/RESOURCE_LIMITS defaults (#745), and the new reachability hook.

Installers: HOST_DATASET_DIR bind-mounts /tracebloc-data, emits datasetPath + HOST_UID/HOST_GID for NFS ingestion writes, and refuses reusing a k3d cluster missing that mount. Preflight blocks HOST_DATA_DIR on network filesystems (override TRACEBLOC_ALLOW_NETWORK_FS); TRACEBLOC_CLIENT_ID / TRACEBLOC_CLIENT_PASSWORD enable non-interactive Helm install.

CI / repo: PR concurrency cancel and job timeouts on heavy workflows; new public PII gate caller; e2e-proxy extended to assert application-pod proxy routing via an in-cluster hermetic backend stand-in (deflake vs real api.tracebloc.io).

Reviewed by Cursor Bugbot for commit 001ef47. Bugbot is set up for automated code reviews on this repo. Configure here.

saadqbal and others added 10 commits June 16, 2026 14:22
test(jobs-manager): lock spawned-job RESOURCE_REQUESTS/LIMITS default at 8Gi
…harité setup) (#264)

* test(e2e-proxy): exercise application-pod egress through the proxy

The squid harness proved NODE egress (image pulls) but stopped before any
application pod — so it never caught client-runtime#119, where the spawned
ingestion Job carried no proxy env and dialled the backend directly. Add a
section that runs a pod WITH the ingestion-style proxy env (must traverse the
squid to reach the backend) and a pod WITHOUT it (must bypass it / go direct),
asserting both against the squid access log.

Models the Charité proxy-only setup at the application layer; pairs with the
behavioural unit tests on client-runtime#119.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(e2e-proxy): rework app-pod egress to an in-cluster squid Service

Running the first version on a real k3d cluster surfaced that a POD cannot
resolve host.k3d.internal (it is a node-level alias for image pulls, not pod
DNS), so the proxied probe failed with `curl (5) Could not resolve proxy`.

Rework: stand up an in-cluster squid Deployment+Service the test pods reach by
Service DNS (also a closer model of a real corporate proxy reachable by name),
with a readiness probe gating rollout on squid actually listening (fixes the
probe-before-bind race seen in the first attempt). A pod WITH the ingestion
proxy env must reach the backend through the squid; a pod WITHOUT it must bypass
it. Auth survival stays covered by the host-squid sections (1-3).

bash -n + shellcheck + embedded-YAML parse all clean; Service-DNS resolution
verified locally. Full proxied-curl run is exercised by the e2e-proxy CI job.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* test(e2e-proxy): make app-pod egress assertion deterministic (single pod, curl -v)

§4 now uses ONE pod carrying the ingestion-style proxy env that makes two
calls to the same backend: WITH the env it must tunnel via the in-cluster
squid (a CONNECT tunnel); with the env unset it must dial direct. Proof is
taken client-side from `curl -v` (the CONNECT-tunnel lines), not by reading
squid's access.log — that file is buffered by the log daemon and came back
empty when read right after the probe, producing false failures.

Also set BOTH proxy-env cases: curl honours the lower-case `https_proxy`
for HTTPS and the upper-case alone is not reliably picked up, so the probe
must emit both — exactly as the real ingestion env does. A single pod with
a single log also removes the multi-pod scheduling / log-flush races that
made the earlier two-pod form flaky.

Validated end-to-end on k3d:
  A (proxy env)  -> "Establish HTTP proxy tunnel to api.tracebloc.io:443"
                    + "CONNECT tunnel established, response 200" + 200 OK
  B (env unset)  -> direct connect to the backend IP, no proxy tunnel, 200

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Blocks PRs that leak customer/partner names or secrets in title/body/commits.
Calls the reusable gate in tracebloc/.github. Inactive until the org
PII_DENYLIST secret is set (warns, doesn't block, until then).
helm-ci.yaml and installer-tests.yaml are the repo's most expensive
workflows (a real k3d cluster, a 9-distro docker-in-docker matrix, Windows
Pester) but had no concurrency control, so a PR re-push left stale runs
burning to completion. Add a per-ref concurrency group that cancels
superseded PR runs only (push/schedule runs are never cancelled), matching
the pattern already used in client-runtime's tests.yml. Add timeout-minutes
to every job so a hung k3d/squid/distro step can't run to the 6h default.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…tem (#261)

Detect NFS/CIFS/SMB for HOST_DATA_DIR in preflight (bash + PowerShell) and
fail fast with an actionable message instead of a cryptic MySQL
CrashLoopBackOff ~20 min into install: MySQL/InnoDB corrupts on network
storage and the chart root chown init-container is blocked by NFS root_squash.

- preflight.sh: _pf_fstype reader (findmnt, then GNU stat, then df+mount;
  portable incl. macOS) + _pf_storage_type wired into run_preflight. Allowlists
  network fstypes so local FSes including overlay/tmpfs (CI) pass.
- install-k8s.ps1: Get-PfFsType (UNC / network drive) + Test-Preflight check.
- TRACEBLOC_ALLOW_NETWORK_FS=1 overrides (mirrors TRACEBLOC_ALLOW_ARM64).
- Tests: 10 bats cases + Pester cases (network -> fail, override, undetermined,
  Windows-only Get-PfFsType reader).

Part 1 of 3 for tracebloc/backend#743.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…stays local (#262)

* feat(installer): fail fast when HOST_DATA_DIR is on a network filesystem

Detect NFS/CIFS/SMB for HOST_DATA_DIR in preflight (bash + PowerShell) and
fail fast with an actionable message instead of a cryptic MySQL
CrashLoopBackOff ~20 min into install: MySQL/InnoDB corrupts on network
storage and the chart root chown init-container is blocked by NFS root_squash.

- preflight.sh: _pf_fstype reader (findmnt, then GNU stat, then df+mount;
  portable incl. macOS) + _pf_storage_type wired into run_preflight. Allowlists
  network fstypes so local FSes including overlay/tmpfs (CI) pass.
- install-k8s.ps1: Get-PfFsType (UNC / network drive) + Test-Preflight check.
- TRACEBLOC_ALLOW_NETWORK_FS=1 overrides (mirrors TRACEBLOC_ALLOW_ARM64).
- Tests: 10 bats cases + Pester cases (network -> fail, override, undetermined,
  Windows-only Get-PfFsType reader).

Part 1 of 3 for tracebloc/backend#743.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(installer,chart): place datasets on a network mount while MySQL stays local

Storage split for VMs whose real storage is an NFS/CIFS mount (backend#743):
the database must stay on local disk (InnoDB over NFS is unsafe) but the large
dataset volume can live on the network mount.

Chart:
- Parameterize the dataset PV hostPath base via hostPath.datasetPath (helper
  tracebloc.clientDataHostPath). Default /tracebloc keeps it byte-identical;
  mysql + logs PV paths are unchanged. values.yaml + schema + nil-guard for
  --reuse-values upgrades.

Installer (bash + PowerShell):
- New HOST_DATASET_DIR: validated (must exist + be writable; MAY live outside
  $HOME unlike HOST_DATA_DIR; system paths barred), bind-mounted into k3d at a
  distinct /tracebloc-data path; the dataset dir is created there while mysql +
  logs stay local. When set, the generated values set
  hostPath.datasetPath=/tracebloc-data and (Linux) pass HOST_UID/HOST_GID env to
  jobs-manager so spawned ingestion pods write the host-owned NFS export as the
  owning uid. Preflight notes the dataset dir is exempt from the network-FS block.

Tests: new shared_images_pvc_test.yaml + mysql/logs split-only guards
(helm-unittest, 259 pass); HOST_DATASET_DIR validation, second-mount, dir-split
and values-generation cases (bats). Docs: INSTALL.md checklist + SECURITY.md 5.4.

Part 2/3 of backend#743. The end-to-end NFS write path also needs the
client-runtime ingestor-uid change (separate PR) so jobs-manager reads HOST_UID.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(#262): fail fast when an existing cluster lacks the dataset bind mount (Bugbot)

The HOST_DATASET_DIR -> /tracebloc-data bind mount is baked into the k3d nodes at
create time (_create_new_cluster / the PS1 equivalent). k3d cannot add a mount to a
RUNNING cluster, but install-client-helm.sh still wrote `datasetPath: /tracebloc-data`
into the generated values whenever HOST_DATASET_DIR was merely set — so an
existing-cluster re-run pointed the chart's dataset PV at ephemeral in-node storage,
silently putting datasets on disposable storage instead of the network export (lost
on a restart).

Add _check_existing_cluster_dataset_mount (cluster.sh) + the PowerShell equivalent,
mirroring the existing _check_existing_cluster_proxy/bind drift checks: on an existing
cluster with HOST_DATASET_DIR set, inspect the server node for the /tracebloc-data
mount and FAIL FAST with the recreate remedy if it is absent — rather than installing
a quietly misrouted dataset volume. Fail-fast (not warn) because this is silent data
loss, consistent with the network-FS fail-fast guard. Values generation needs no
change: the install now stops in Step 2, before helm runs.

+4 bats (cluster.bats): unset -> no-op, mount present -> pass, mount ABSENT -> fail
fast, inspect fails -> no-op. bash + shellcheck clean; pwsh parses the .ps1; full
cluster suite 27/27.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…834) (#266)

RFC-0001 Phase 0 (no backend dependency):

- Accept TRACEBLOC_CLIENT_ID / TRACEBLOC_CLIENT_PASSWORD so CI / automation /
  golden images can provision without typing the secret inline. Verified the
  same way as the prompt (verify_credentials); a bad credential fails the
  install (no re-prompt in non-interactive mode). The interactive prompt path
  is unchanged — just wrapped in the else branch.
- Soften the "to connect this machine you NEED a tracebloc client / create
  one" copy to "already have one? enter it (or set the env vars) / need one?
  create it" — so we stop framing client-creation as a mandatory pre-step
  (browser sign-in lands in Phase 1).

Tests: two new bats cases (env path -> non-interactive write + helm with no
prompt; rejected env creds -> error, no helm). The interactive flow tests
(re-prompt / inactive / unverified / defaults / max-attempts / one-client
guard) still pass unchanged.

NOTE: bats #16 (_extract_yaml_value single-quote '' un-escape) fails locally
on macOS bash — pre-existing and untouched by this PR (the diff doesn't go
near that function); flagging for confirmation against CI.

Part of backend#830. Closes #834.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Asad Iqbal <asad.dsoft@gmail.com>
… diagnostics (#269)

§4 ("APPLICATION-pod egress through a proxy", client-runtime#119) was a flaky
required check ("E2E auth-proxy (squid)") that intermittently red-X'd develop
(~1 in 4; e.g. run 27765964135) and randomly blocked unrelated PRs. Two causes:

1. Silent failure. Under `set -euo pipefail` the diagnostic `grep | sed` lines
   ran before the real assertion; an empty section made grep exit 1 → pipefail
   → set -e killed the script with NO output (CI showed only "pod/egress-app
   created" then "exit code 1"). Append `|| true` so the diagnostics are
   non-fatal and the assertion fires with its reason. Same footgun fixed in §3.

2. External-network dependency (the real flake). §4 curled the real
   https://api.tracebloc.io/ through the in-cluster squid, depending on the
   runner's internet to a production host at test time. Make it hermetic: target
   a reserved-TLD stand-in host (backend.tracebloc-e2e.test) aliased via
   hostAliases on both the squid and app pods to the cluster's own kube-apiserver
   ClusterIP — a guaranteed in-cluster HTTPS:443 listener. The CONNECT tunnel now
   terminates in-cluster with zero external I/O, preserving the #119 intent
   (WITH proxy env → CONNECT tunnel via squid; env unset → direct dial).

Validated: 3/3 deterministic local passes; both calls hit 10.43.0.1 in-cluster
(no api.tracebloc.io reachout). bash -n + shellcheck --severity=error clean.

Closes #268

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Adds `egress-reachability-check`, a `helm test` Job that verifies a normal
(non-training) pod in the namespace can reach the tracebloc backend API — the
egress dependency that gates everything (the cluster authenticates to the
backend to obtain its Service Bus credentials, so no backend egress => silent
Pending). The required-egress complement to egress-enforcement-check (which
verifies the opposite: that training pods are locked out).

The probe is deliberately NOT training-labelled (so the lockdown netpol never
selects it — it keeps the jobs-manager/requests-proxy egress class) and honours
tracebloc.proxyEnv, so it tests the real path. The verdict keys on curl's exit
code (TCP reachability), not HTTP status. Run via `helm test <release>`; gated
by egressReachabilityCheck.enabled (default true; disable on truly air-gapped
clusters). As a test hook it never runs during install/upgrade.

Service Bus is intentionally not probed here: its host is fetched post-auth
from the backend (static nowhere in the chart) and its egress is brokered by
the requests-proxy, whose readiness `tracebloc cluster doctor` already checks.

helm-unittest: 6 tests (render / disable / test-hook annotation / not-training-
labelled / CLIENT_ENV-driven host / proxy-inherited). Full chart suite green.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Release v1.8.0 — promotes the 9 commits merged to develop since v1.7.1.
Minor bump: #262 (datasets on a network mount while MySQL stays local)
is a real PVC-placement change at install time, not inert. Keeps
version/appVersion in lockstep so the app.kubernetes.io/version label
matches the release.

Refs #271

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
@saadqbal saadqbal self-assigned this Jun 23, 2026

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2652bb7. Configure here.

Comment thread client/templates/egress-reachability-check.yaml Outdated
…not just connect failures (#274)

The backend-reachability helm test (#270) keyed success on a denylist of
curl exit codes (5/6/7/28) and passed on everything else. But with no
--fail, curl exits 0 for any HTTP response, so the only non-zero codes are
transport/TLS failures — yet a TLS handshake/cert error (e.g. 35/51/60,
typically a proxy intercepting TLS with a CA the cluster doesn't trust)
printed "backend reachable" and exited 0: a false pass that defeats the
check's purpose.

Pass ONLY on curl exit 0 (a full TCP+TLS+HTTP round trip proves the
backend is reachable and usable); fail closed on everything else, with a
dedicated TLS/cert bucket and a generic catch-all. Companion helm-unittest
asserts nothing on the script body; full suite 265/265.

Follow-up to #270; flagged by Cursor Bugbot on the v1.8.0 sync PR (#273).

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
@saadqbal saadqbal merged commit 73dd8b8 into main Jun 23, 2026
52 of 53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Release client chart v1.8.0

3 participants