test(e2e-proxy): exercise application-pod egress through the proxy (Charité setup) by LukasWodka · Pull Request #264 · tracebloc/client

LukasWodka · 2026-06-17T14:09:59Z

Draft — needs a CI run to validate (I can't run k3d/squid/docker locally).

What

Extends the e2e-proxy.sh squid harness ("the Charité/hospital archetype") to cover application-pod egress, not just node image pulls. After the cluster is up behind the authenticated squid, it:

runs a pod with the ingestion-style proxy env (HTTP(S)_PROXY = squid, cluster-safe NO_PROXY) → asserts its backend CONNECT api.tracebloc.io appears in the squid access log (authenticated);
runs a pod without proxy env → asserts it does not appear (it dialled direct).

Why

The harness proved NODE egress but stopped before any application pod, so it never caught client-runtime#119 — the spawned ingestion Job carried no proxy env and dialled the backend directly (Charité: [Errno 111] Connection refused). This is the layer the fix lives at.

Why draft

bash -n + shellcheck pass, but I can't run k3d/squid/docker in my environment, so the runtime behaviour is unverified locally. The installer-tests.yaml e2e-proxy job is the validator. The one assumption to confirm there: a pod can reach the host's squid via host.k3d.internal:3128 (k3d publishes host.k3d.internal into CoreDNS NodeHosts, so it should — but CI confirms). The behavioural routing contract itself is already verified by the unit tests on client-runtime#119.

Note

In a real proxy-only network (Charité) the no-proxy pod's direct dial is refused; the test cluster's nodes have direct egress, so this asserts the absence of a proxied CONNECT instead. A true "direct refused" would need egress-blocking (k3d's flannel doesn't enforce NetworkPolicy) — tracked as a follow-up if we want it.

The squid harness proved NODE egress (image pulls) but stopped before any application pod — so it never caught client-runtime#119, where the spawned ingestion Job carried no proxy env and dialled the backend directly. Add a section that runs a pod WITH the ingestion-style proxy env (must traverse the squid to reach the backend) and a pod WITHOUT it (must bypass it / go direct), asserting both against the squid access log. Models the Charité proxy-only setup at the application layer; pairs with the behavioural unit tests on client-runtime#119. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

LukasWodka · 2026-06-17T14:11:13Z

👋 Heads-up — Code review queue is at 32 / 30

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

.github#57 — fix(fr-gate): pass items at or beyond the required stage · author: @aptracebloc · no reviewer assigned
.github#60 — feat(security): public-repo PII gate (block customer names + secrets in PRs) · author: @LukasWodka · no reviewer assigned
averaging-service#117 — refactor(averaging): per-framework weight-handling seam (WS-B B1) · author: @divyasinghds · no reviewer assigned
averaging-service#118 — feat(averaging): trainable-only keyed-dict weight format (WS-C docs: surface standalone installer in README and INSTALL.md #104) · author: @divyasinghds · no reviewer assigned
averaging-service#119 — fix(averaging): exact federated GaussianNB merge — usable + correct (WS-B B2, chore(auto-upgrade): run cronjob hourly at :23 #112) · author: @divyasinghds · no reviewer assigned
backend#815 — chore(deps): bump cryptography from 47.0.0 to 48.0.1 · author: @dependabot · no reviewer assigned
backend#825 — feat(#763): distribute authoritative feature_columns to edges (phase 1) · author: @aptracebloc · reviewer: @saadqbal
backend#829 — feat(#817): failure surfaces a reason end-to-end + stuck-run reaper (WS6 I3/I4/I5/I8) · author: @saadqbal · no reviewer assigned
cli#78 — fix(dataset rm): delete staging files from a uid-65532 pod, not jobs-manager (bug: dataset rm cannot delete staging files — ingestor (uid 65534) vs jobs-manager uid mismatch, no shared fsGroup #259) · author: @LukasWodka · no reviewer assigned
cli#79 — chore(schema): re-sync vendored ingest.v1.json from data-ingestors master · author: @LukasWodka · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

LukasWodka · 2026-06-17T14:20:31Z

Ran this against a real k3d cluster — found a bug, do not merge as-is.

The node image-pull section (§3) is fine. The new app-pod egress section (§4) is broken: a pod cannot resolve host.k3d.internal — that alias is injected for k3d nodes (image pulls), not into pod DNS. The proxied probe fails with curl: (5) Could not resolve proxy: host.k3d.internal, so the "WITH proxy env" assertion never passes. (The no-proxy/direct probe behaves correctly.)

Fix direction: point the app-pod test at a pod-reachable proxy — an in-cluster squid Deployment+Service (reachable via Service DNS, and a closer model of a real corporate proxy reachable by name) rather than the host squid. Locally the Service DNS resolves; finalising the squid pod's serving config.

Keeping this draft until it runs green. Note the routing contract (emitted env → backend via proxy, in-cluster bypassed, no-proxy → direct) is already verified by the behavioural unit tests on client-runtime#123 — that's the coverage to gate the #122 merge on; this harness E2E is defence-in-depth.

Running the first version on a real k3d cluster surfaced that a POD cannot resolve host.k3d.internal (it is a node-level alias for image pulls, not pod DNS), so the proxied probe failed with `curl (5) Could not resolve proxy`. Rework: stand up an in-cluster squid Deployment+Service the test pods reach by Service DNS (also a closer model of a real corporate proxy reachable by name), with a readiness probe gating rollout on squid actually listening (fixes the probe-before-bind race seen in the first attempt). A pod WITH the ingestion proxy env must reach the backend through the squid; a pod WITHOUT it must bypass it. Auth survival stays covered by the host-squid sections (1-3). bash -n + shellcheck + embedded-YAML parse all clean; Service-DNS resolution verified locally. Full proxied-curl run is exercised by the e2e-proxy CI job. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

LukasWodka · 2026-06-17T14:33:13Z

Reworked + out of draft. Running it against a real k3d cluster (thanks for the docker) is what flagged the original bug: a pod can't resolve host.k3d.internal (node-only alias), so the proxied probe failed with curl (5) Could not resolve proxy.

Fix (latest commit): the app-pod section now stands up an in-cluster squid Deployment+Service that the test pods reach by Service DNS — pod-reachable, and a closer model of a real corporate proxy reachable by name. A readiness probe gates rollout on squid actually listening, which fixes the "connect refused after 1ms" race the first in-cluster attempt hit. Auth-survival stays covered by §1-3's host squid; this section is purely about proxy-env routing.

Verified locally: bash -n + shellcheck clean; embedded k8s YAML parses (3 docs); pod→Service-DNS resolution confirmed on k3d. The one piece I couldn't run end-to-end locally is the full proxied curl (the heavy in-cluster deploy was declined mid-session) — that's exercised by the e2e-proxy CI job, which is the right env anyway (Linux runner, installer's create_cluster()). @saadqbal — over to the CI run; flag me if the proxied-probe step needs a tweak.

LukasWodka · 2026-06-17T14:34:24Z

👋 Heads-up — Code review queue is at 33 / 30

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

.github#57 — fix(fr-gate): pass items at or beyond the required stage · author: @aptracebloc · no reviewer assigned
.github#60 — feat(security): public-repo PII gate (block customer names + secrets in PRs) · author: @LukasWodka · no reviewer assigned
averaging-service#117 — refactor(averaging): per-framework weight-handling seam (WS-B B1) · author: @divyasinghds · no reviewer assigned
averaging-service#118 — feat(averaging): trainable-only keyed-dict weight format (WS-C docs: surface standalone installer in README and INSTALL.md #104) · author: @divyasinghds · no reviewer assigned
averaging-service#119 — fix(averaging): exact federated GaussianNB merge — usable + correct (WS-B B2, chore(auto-upgrade): run cronjob hourly at :23 #112) · author: @divyasinghds · no reviewer assigned
backend#815 — chore(deps): bump cryptography from 47.0.0 to 48.0.1 · author: @dependabot · no reviewer assigned
backend#825 — feat(#763): distribute authoritative feature_columns to edges (phase 1) · author: @aptracebloc · reviewer: @saadqbal
backend#829 — feat(#817): failure surfaces a reason end-to-end + stuck-run reaper (WS6 I3/I4/I5/I8) · author: @saadqbal · no reviewer assigned
cli#78 — fix(dataset rm): delete staging files from a uid-65532 pod, not jobs-manager (bug: dataset rm cannot delete staging files — ingestor (uid 65534) vs jobs-manager uid mismatch, no shared fsGroup #259) · author: @LukasWodka · no reviewer assigned
cli#79 — chore(schema): re-sync vendored ingest.v1.json from data-ingestors master · author: @LukasWodka · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

…pod, curl -v) §4 now uses ONE pod carrying the ingestion-style proxy env that makes two calls to the same backend: WITH the env it must tunnel via the in-cluster squid (a CONNECT tunnel); with the env unset it must dial direct. Proof is taken client-side from `curl -v` (the CONNECT-tunnel lines), not by reading squid's access.log — that file is buffered by the log daemon and came back empty when read right after the probe, producing false failures. Also set BOTH proxy-env cases: curl honours the lower-case `https_proxy` for HTTPS and the upper-case alone is not reliably picked up, so the probe must emit both — exactly as the real ingestion env does. A single pod with a single log also removes the multi-pod scheduling / log-flush races that made the earlier two-pod form flaky. Validated end-to-end on k3d: A (proxy env) -> "Establish HTTP proxy tunnel to api.tracebloc.io:443" + "CONNECT tunnel established, response 200" + 200 OK B (env unset) -> direct connect to the backend IP, no proxy tunnel, 200 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

LukasWodka · 2026-06-17T15:00:45Z

✅ Validated green end-to-end on k3d

Stood up the in-cluster squid + the §4 probe against a live k3d cluster and proved both directions. Pushed in 7281f32.

applog length=5638
A WITH proxy env:  * Establish HTTP proxy tunnel to api.tracebloc.io:443
A WITH proxy env:  < HTTP/1.1 200 Connection established
A WITH proxy env:  * CONNECT tunnel established, response 200
A WITH proxy env:  < HTTP/1.1 200 OK
B env unset:       *   Trying 35.156.244.148:443...     ← direct to the backend IP, no proxy
B env unset:       < HTTP/1.1 200 OK
════ RESULT: GREEN — §4 verbatim logic passes end-to-end ════

A (pod carries the ingestion-style proxy env) → backend reached through the in-cluster squid (a real CONNECT tunnel). B (same pod, proxy env unset) → the same call dials the backend's IP directly. That is exactly the #119 property: ingestion-style backend egress is proxied when the env is present, and only then.

Two substantive changes from the first cut

Assert client-side via curl -v, not by reading squid's access.log. That file is buffered by squid's log daemon and came back empty when read immediately after the probe — a false failure. The curl -v CONNECT-tunnel lines are deterministic and need no flush.
Set BOTH proxy-env cases. curl honours the lower-case https_proxy for HTTPS and does not reliably pick up the upper-case HTTPS_PROXY alone — so a probe that sets only upper-case silently dials direct and the test lies. The real ingestion env (Asad's _ingestor_proxy_env in Reduce dependency on values.yaml file for requests proxy #122) emits both cases, so the probe now does too.

Design note

Collapsed the earlier two-pod form into one pod making two calls (with env / env-unset). One pod + one log removes the multi-pod scheduling and log-flush races, so the assertion is deterministic in CI.

Note on the "direct" leg: on a real proxy-only network (Charité) the env-unset dial would be refused with [Errno 111] — that refusal was the bug. The k3d node has direct egress, so here we assert the absence of a proxied CONNECT instead, which is the same signal.

LukasWodka assigned saadqbal Jun 17, 2026

LukasWodka marked this pull request as ready for review June 17, 2026 14:33

saadqbal approved these changes Jun 17, 2026

View reviewed changes

saadqbal merged commit c934828 into develop Jun 17, 2026
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(e2e-proxy): exercise application-pod egress through the proxy (Charité setup)#264

test(e2e-proxy): exercise application-pod egress through the proxy (Charité setup)#264
saadqbal merged 3 commits into
developfrom
test/e2e-proxy-app-egress

LukasWodka commented Jun 17, 2026

Uh oh!

LukasWodka commented Jun 17, 2026

Uh oh!

LukasWodka commented Jun 17, 2026

Uh oh!

LukasWodka commented Jun 17, 2026

Uh oh!

LukasWodka commented Jun 17, 2026

Uh oh!

LukasWodka commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LukasWodka commented Jun 17, 2026

What

Why

Why draft

Note

Uh oh!

LukasWodka commented Jun 17, 2026

Uh oh!

LukasWodka commented Jun 17, 2026

Uh oh!

LukasWodka commented Jun 17, 2026

Uh oh!

LukasWodka commented Jun 17, 2026

Uh oh!

LukasWodka commented Jun 17, 2026

✅ Validated green end-to-end on k3d

Two substantive changes from the first cut

Design note

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants