Skip to content

feat: in-cluster backend-reachability helm test (WS3 / cli#90)#270

Merged
saadqbal merged 1 commit into
developfrom
feat/cli90-egress-reachability-check
Jun 22, 2026
Merged

feat: in-cluster backend-reachability helm test (WS3 / cli#90)#270
saadqbal merged 1 commit into
developfrom
feat/cli90-egress-reachability-check

Conversation

@saadqbal

@saadqbal saadqbal commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

What

Adds egress-reachability-check — a helm test Job that verifies a normal (non-training) pod in the namespace can reach the tracebloc backend API from inside the cluster.

This is the in-cluster egress probe tracked as the last item of tracebloc/cli#90 (WS3, epic client-runtime#116). Run it with helm test <release>.

Why it lives here (and not in tracebloc cluster doctor)

Scoping cli#90 surfaced two things:

  • The egress-proxy is Squid (no health endpoint to query), and the SB connection string is fetched post-auth from the backend — static nowhere in the chart. So a real in-cluster probe can't be a read-only client-go check; it has to run a pod inside the cluster.
  • The codebase's existing pattern for exactly that is a helm test Job (egress-enforcement-check). So this is its required-egress sibling — and tracebloc cluster doctor stays read-only.

doctor's host-side Backend egress check probes from the CLI machine; this probes from inside the cluster, on the real egress path.

Design

  • A helm test hook Job (curl image, PSA-restricted, ttlSecondsAfterFinished), mirroring egress-enforcement-check — but the inverse verdict: reachable = PASS.
  • The probe pod is deliberately not tracebloc.io/workload: training-labelled, so the lockdown NetworkPolicy never selects it — it keeps the jobs-manager / requests-proxy egress class (the pods that actually reach the backend).
  • Honours tracebloc.proxyEnv, so it tests the real path (through the corporate proxy when configured).
  • Verdict keys on curl's exit code (TCP/DNS reachability), not HTTP status: 6 DNS / 7 refused / 28 timeout → FAIL with specific guidance; anything else → reachable.
  • Backend host derives from CLIENT_ENV (dev/stg/prod). Gated by egressReachabilityCheck.enabled (default true, nil-safe via dig for --reuse-values; set false on truly air-gapped clusters). As a test hook it never runs during install/upgrade.

Out of scope

Service Bus isn't probed directly — its host is fetched post-auth (not static) and its egress is brokered by the requests-proxy, whose readiness tracebloc cluster doctor already checks. Backend reachability is the prerequisite for both (the cluster authenticates to the backend to get the SB creds).

Testing

helm unittest ./client6 new tests (renders / disabled / test-hook annotation / not-training-labelled / CLIENT_ENV-driven host / proxy-inherited) and the full chart suite green (25 suites, 264 tests).

🤖 Generated with Claude Code


Note

Low Risk
Diagnostic-only helm test hook with no change to install/upgrade or workload runtime unless operators run the test.

Overview
Adds egress-reachability-check, a helm test Job that curls the tracebloc backend from a non-training pod (same egress path as jobs-manager / requests-proxy, including corporate proxy via tracebloc.proxyEnv). Pass means TCP/TLS to the API host succeeds; failure messages map curl exit codes (DNS, refused, timeout, proxy resolution).

egressReachabilityCheck.enabled defaults to true in values.yaml (disable for air-gapped installs). The hook runs only on helm test, not on install/upgrade. Six helm unittest cases cover rendering, the flag, test-hook annotation, absence of the training workload label, CLIENT_ENV host selection, and proxy env inheritance.

Reviewed by Cursor Bugbot for commit 26ed27a. Bugbot is set up for automated code reviews on this repo. Configure here.

Adds `egress-reachability-check`, a `helm test` Job that verifies a normal
(non-training) pod in the namespace can reach the tracebloc backend API — the
egress dependency that gates everything (the cluster authenticates to the
backend to obtain its Service Bus credentials, so no backend egress => silent
Pending). The required-egress complement to egress-enforcement-check (which
verifies the opposite: that training pods are locked out).

The probe is deliberately NOT training-labelled (so the lockdown netpol never
selects it — it keeps the jobs-manager/requests-proxy egress class) and honours
tracebloc.proxyEnv, so it tests the real path. The verdict keys on curl's exit
code (TCP reachability), not HTTP status. Run via `helm test <release>`; gated
by egressReachabilityCheck.enabled (default true; disable on truly air-gapped
clusters). As a test hook it never runs during install/upgrade.

Service Bus is intentionally not probed here: its host is fetched post-auth
from the backend (static nowhere in the chart) and its egress is brokered by
the requests-proxy, whose readiness `tracebloc cluster doctor` already checks.

helm-unittest: 6 tests (render / disable / test-hook annotation / not-training-
labelled / CLIENT_ENV-driven host / proxy-inherited). Full chart suite green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@saadqbal saadqbal merged commit cafe574 into develop Jun 22, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants