Skip to content

test(e2e-proxy): exercise application-pod egress through the proxy (Charité setup)#264

Merged
saadqbal merged 3 commits into
developfrom
test/e2e-proxy-app-egress
Jun 17, 2026
Merged

test(e2e-proxy): exercise application-pod egress through the proxy (Charité setup)#264
saadqbal merged 3 commits into
developfrom
test/e2e-proxy-app-egress

Conversation

@LukasWodka

Copy link
Copy Markdown
Contributor

Draft — needs a CI run to validate (I can't run k3d/squid/docker locally).

What

Extends the e2e-proxy.sh squid harness ("the Charité/hospital archetype") to cover application-pod egress, not just node image pulls. After the cluster is up behind the authenticated squid, it:

  • runs a pod with the ingestion-style proxy env (HTTP(S)_PROXY = squid, cluster-safe NO_PROXY) → asserts its backend CONNECT api.tracebloc.io appears in the squid access log (authenticated);
  • runs a pod without proxy env → asserts it does not appear (it dialled direct).

Why

The harness proved NODE egress but stopped before any application pod, so it never caught client-runtime#119 — the spawned ingestion Job carried no proxy env and dialled the backend directly (Charité: [Errno 111] Connection refused). This is the layer the fix lives at.

Why draft

bash -n + shellcheck pass, but I can't run k3d/squid/docker in my environment, so the runtime behaviour is unverified locally. The installer-tests.yaml e2e-proxy job is the validator. The one assumption to confirm there: a pod can reach the host's squid via host.k3d.internal:3128 (k3d publishes host.k3d.internal into CoreDNS NodeHosts, so it should — but CI confirms). The behavioural routing contract itself is already verified by the unit tests on client-runtime#119.

Note

In a real proxy-only network (Charité) the no-proxy pod's direct dial is refused; the test cluster's nodes have direct egress, so this asserts the absence of a proxied CONNECT instead. A true "direct refused" would need egress-blocking (k3d's flannel doesn't enforce NetworkPolicy) — tracked as a follow-up if we want it.

The squid harness proved NODE egress (image pulls) but stopped before any
application pod — so it never caught client-runtime#119, where the spawned
ingestion Job carried no proxy env and dialled the backend directly. Add a
section that runs a pod WITH the ingestion-style proxy env (must traverse the
squid to reach the backend) and a pod WITHOUT it (must bypass it / go direct),
asserting both against the squid access log.

Models the Charité proxy-only setup at the application layer; pairs with the
behavioural unit tests on client-runtime#119.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@LukasWodka

Copy link
Copy Markdown
Contributor Author

👋 Heads-up — Code review queue is at 32 / 30

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

@LukasWodka

Copy link
Copy Markdown
Contributor Author

Ran this against a real k3d cluster — found a bug, do not merge as-is.

The node image-pull section (§3) is fine. The new app-pod egress section (§4) is broken: a pod cannot resolve host.k3d.internal — that alias is injected for k3d nodes (image pulls), not into pod DNS. The proxied probe fails with curl: (5) Could not resolve proxy: host.k3d.internal, so the "WITH proxy env" assertion never passes. (The no-proxy/direct probe behaves correctly.)

Fix direction: point the app-pod test at a pod-reachable proxy — an in-cluster squid Deployment+Service (reachable via Service DNS, and a closer model of a real corporate proxy reachable by name) rather than the host squid. Locally the Service DNS resolves; finalising the squid pod's serving config.

Keeping this draft until it runs green. Note the routing contract (emitted env → backend via proxy, in-cluster bypassed, no-proxy → direct) is already verified by the behavioural unit tests on client-runtime#123 — that's the coverage to gate the #122 merge on; this harness E2E is defence-in-depth.

Running the first version on a real k3d cluster surfaced that a POD cannot
resolve host.k3d.internal (it is a node-level alias for image pulls, not pod
DNS), so the proxied probe failed with `curl (5) Could not resolve proxy`.

Rework: stand up an in-cluster squid Deployment+Service the test pods reach by
Service DNS (also a closer model of a real corporate proxy reachable by name),
with a readiness probe gating rollout on squid actually listening (fixes the
probe-before-bind race seen in the first attempt). A pod WITH the ingestion
proxy env must reach the backend through the squid; a pod WITHOUT it must bypass
it. Auth survival stays covered by the host-squid sections (1-3).

bash -n + shellcheck + embedded-YAML parse all clean; Service-DNS resolution
verified locally. Full proxied-curl run is exercised by the e2e-proxy CI job.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@LukasWodka LukasWodka marked this pull request as ready for review June 17, 2026 14:33
@LukasWodka

Copy link
Copy Markdown
Contributor Author

Reworked + out of draft. Running it against a real k3d cluster (thanks for the docker) is what flagged the original bug: a pod can't resolve host.k3d.internal (node-only alias), so the proxied probe failed with curl (5) Could not resolve proxy.

Fix (latest commit): the app-pod section now stands up an in-cluster squid Deployment+Service that the test pods reach by Service DNS — pod-reachable, and a closer model of a real corporate proxy reachable by name. A readiness probe gates rollout on squid actually listening, which fixes the "connect refused after 1ms" race the first in-cluster attempt hit. Auth-survival stays covered by §1-3's host squid; this section is purely about proxy-env routing.

Verified locally: bash -n + shellcheck clean; embedded k8s YAML parses (3 docs); pod→Service-DNS resolution confirmed on k3d. The one piece I couldn't run end-to-end locally is the full proxied curl (the heavy in-cluster deploy was declined mid-session) — that's exercised by the e2e-proxy CI job, which is the right env anyway (Linux runner, installer's create_cluster()). @saadqbal — over to the CI run; flag me if the proxied-probe step needs a tweak.

@LukasWodka

Copy link
Copy Markdown
Contributor Author

👋 Heads-up — Code review queue is at 33 / 30

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

…pod, curl -v)

§4 now uses ONE pod carrying the ingestion-style proxy env that makes two
calls to the same backend: WITH the env it must tunnel via the in-cluster
squid (a CONNECT tunnel); with the env unset it must dial direct. Proof is
taken client-side from `curl -v` (the CONNECT-tunnel lines), not by reading
squid's access.log — that file is buffered by the log daemon and came back
empty when read right after the probe, producing false failures.

Also set BOTH proxy-env cases: curl honours the lower-case `https_proxy`
for HTTPS and the upper-case alone is not reliably picked up, so the probe
must emit both — exactly as the real ingestion env does. A single pod with
a single log also removes the multi-pod scheduling / log-flush races that
made the earlier two-pod form flaky.

Validated end-to-end on k3d:
  A (proxy env)  -> "Establish HTTP proxy tunnel to api.tracebloc.io:443"
                    + "CONNECT tunnel established, response 200" + 200 OK
  B (env unset)  -> direct connect to the backend IP, no proxy tunnel, 200

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@LukasWodka

Copy link
Copy Markdown
Contributor Author

✅ Validated green end-to-end on k3d

Stood up the in-cluster squid + the §4 probe against a live k3d cluster and proved both directions. Pushed in 7281f32.

applog length=5638
A WITH proxy env:  * Establish HTTP proxy tunnel to api.tracebloc.io:443
A WITH proxy env:  < HTTP/1.1 200 Connection established
A WITH proxy env:  * CONNECT tunnel established, response 200
A WITH proxy env:  < HTTP/1.1 200 OK
B env unset:       *   Trying 35.156.244.148:443...     ← direct to the backend IP, no proxy
B env unset:       < HTTP/1.1 200 OK
════ RESULT: GREEN — §4 verbatim logic passes end-to-end ════

A (pod carries the ingestion-style proxy env) → backend reached through the in-cluster squid (a real CONNECT tunnel). B (same pod, proxy env unset) → the same call dials the backend's IP directly. That is exactly the #119 property: ingestion-style backend egress is proxied when the env is present, and only then.

Two substantive changes from the first cut

  1. Assert client-side via curl -v, not by reading squid's access.log. That file is buffered by squid's log daemon and came back empty when read immediately after the probe — a false failure. The curl -v CONNECT-tunnel lines are deterministic and need no flush.
  2. Set BOTH proxy-env cases. curl honours the lower-case https_proxy for HTTPS and does not reliably pick up the upper-case HTTPS_PROXY alone — so a probe that sets only upper-case silently dials direct and the test lies. The real ingestion env (Asad's _ingestor_proxy_env in Reduce dependency on values.yaml file for requests proxy #122) emits both cases, so the probe now does too.

Design note

Collapsed the earlier two-pod form into one pod making two calls (with env / env-unset). One pod + one log removes the multi-pod scheduling and log-flush races, so the assertion is deterministic in CI.

Note on the "direct" leg: on a real proxy-only network (Charité) the env-unset dial would be refused with [Errno 111] — that refusal was the bug. The k3d node has direct egress, so here we assert the absence of a proxied CONNECT instead, which is the same signal.

@saadqbal saadqbal merged commit c934828 into develop Jun 17, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants