Skip to content

test(e2e-proxy): deflake §4 app-egress — hermetic target + non-silent diagnostics#269

Merged
saadqbal merged 1 commit into
developfrom
fix/e2e-proxy-app-egress-flake
Jun 19, 2026
Merged

test(e2e-proxy): deflake §4 app-egress — hermetic target + non-silent diagnostics#269
saadqbal merged 1 commit into
developfrom
fix/e2e-proxy-app-egress-flake

Conversation

@saadqbal

@saadqbal saadqbal commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What & why

scripts/tests/e2e-proxy.sh §4 — the egress-app application-pod egress test (client-runtime#119) — is the flaky required check "E2E auth-proxy (squid)". It fails intermittently on develop itself (~1 in 4 recent runs; e.g. run 27765964135 failed while neighbouring runs passed), so it randomly blocks unrelated PRs. This is a flaky-required-check fix, branched off develop (it's develop's test, not specific to any feature branch).

Two distinct root causes, both fixed:

1. Silent failure masked the real cause

Under set -euo pipefail, the diagnostic grep | sed lines ran before the real assertion. When the captured curl section was empty (the failure case), grep exited 1 → pipefail failed the pipeline → set -e killed the script with no output. The CI log showed only pod/egress-app created then ##[error]Process completed with exit code 1 — the informative error "App pod WITH the ingestion proxy env did NOT tunnel…" never ran.

Fix: the two diagnostic grep | sed lines now end in || true, so they're non-fatal and the real assertions fire and report the actual reason. (The identical footgun one section up — §3's squid-access-log preview — is fixed the same way.)

2. External-network dependency (the actual flake)

§4's egress-app pod curled the real https://api.tracebloc.io/ through a freshly-deployed in-cluster squid. That depends on the in-cluster squid having working egress to a production host at the exact moment the test runs (DNS/latency/transient unavailability + pod-startup timing) — inherently flaky on CI runners. The failing run exited ~13s after creating the pod.

Fix — make it hermetic. Section A/B now target a reserved-TLD stand-in host, backend.tracebloc-e2e.test (RFC 6761 .test, guaranteed never to resolve publicly), aliased via /etc/hosts (hostAliases) on both the squid pod and the app pod to the cluster's own kube-apiserver ClusterIP — a guaranteed, always-up in-cluster HTTPS:443 listener. The squid's CONNECT tunnel terminates against a real in-cluster TLS endpoint; the test never leaves the cluster.

The client-runtime#119 intent is intact:

  • WITH the ingestion proxy env → curl opens a CONNECT tunnel through the squid to the backend host.
  • env unset → the same call dials direct.

(-k: the apiserver presents the cluster-CA cert, untrusted here — we assert proxy routing, not TLS trust, so verification is skipped and both calls complete to a real 401.)

Validation

3/3 deterministic local passes on a real k3d cluster (arm64). Representative §4 output:

A WITH proxy env:  * Establish HTTP proxy tunnel to backend.tracebloc-e2e.test:443
A WITH proxy env:  * CONNECT tunnel established, response 200
A WITH proxy env:  < HTTP/2 401
B env unset:       *   Trying 10.43.0.1:443...
B env unset:       < HTTP/2 401
✔ App-pod egress verified: WITH the ingestion proxy env the backend call tunnelled
  through the in-cluster squid; with it unset the same call dialled direct.

Both calls hit 10.43.0.1 (the in-cluster apiserver) — no api.tracebloc.io reachout.

  • bash -n scripts/tests/e2e-proxy.sh
  • shellcheck --severity=error --shell=bash (the CI gate) ✓ — and --severity=warning clean too.

Closes #268


Note

Low Risk
Test-only changes to an E2E shell script; no production runtime, auth, or deployment logic.

Overview
Stabilizes the E2E auth-proxy check in scripts/tests/e2e-proxy.sh by fixing two failure modes in §3–§4.

Diagnostics no longer abort the script: preview grep | sed pipelines (squid access log and curl log sections) now end with || true, so under set -euo pipefail an empty match cannot exit before the real error assertions run.

§4 app-pod egress is hermetic: instead of curling production api.tracebloc.io through the in-cluster squid (CI internet flake), both the squid deployment and egress-app pod use hostAliases to map backend.tracebloc-e2e.test to the cluster kube-apiserver ClusterIP. Manifests are applied via unquoted heredocs so ${APISERVER_IP} / ${BACKEND_HOST} substitute; curl uses -k to assert proxy routing (CONNECT tunnel vs direct dial), not TLS trust. Assertions and pass messaging were updated for the stand-in host.

Reviewed by Cursor Bugbot for commit ca66e5c. Bugbot is set up for automated code reviews on this repo. Configure here.

… diagnostics

§4 ("APPLICATION-pod egress through a proxy", client-runtime#119) was a flaky
required check ("E2E auth-proxy (squid)") that intermittently red-X'd develop
(~1 in 4; e.g. run 27765964135) and randomly blocked unrelated PRs. Two causes:

1. Silent failure. Under `set -euo pipefail` the diagnostic `grep | sed` lines
   ran before the real assertion; an empty section made grep exit 1 → pipefail
   → set -e killed the script with NO output (CI showed only "pod/egress-app
   created" then "exit code 1"). Append `|| true` so the diagnostics are
   non-fatal and the assertion fires with its reason. Same footgun fixed in §3.

2. External-network dependency (the real flake). §4 curled the real
   https://api.tracebloc.io/ through the in-cluster squid, depending on the
   runner's internet to a production host at test time. Make it hermetic: target
   a reserved-TLD stand-in host (backend.tracebloc-e2e.test) aliased via
   hostAliases on both the squid and app pods to the cluster's own kube-apiserver
   ClusterIP — a guaranteed in-cluster HTTPS:443 listener. The CONNECT tunnel now
   terminates in-cluster with zero external I/O, preserving the #119 intent
   (WITH proxy env → CONNECT tunnel via squid; env unset → direct dial).

Validated: 3/3 deterministic local passes; both calls hit 10.43.0.1 in-cluster
(no api.tracebloc.io reachout). bash -n + shellcheck --severity=error clean.

Closes #268

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@saadqbal saadqbal self-assigned this Jun 18, 2026
@saadqbal saadqbal merged commit 23364ad into develop Jun 19, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants