Context
Follow-up #2 from #186. On the berlin-team arm64 install, jobs-manager stayed pinned to the amd64-only ingestor baseline (sha256:a0861ea9…) even though the image-refresh CronJob ran and :0.3 had become a multi-arch index (sha256:d361fa77…). #185 (greenfield baseline → v0.3.2) and #187 (CI multi-arch guard) handled the chart side; this issue is the image-refresh mechanism itself.
Root cause
image-refresh resolves the ingestor digest via get_latest_digest(… ghcr.io). The logic is correct — verified live, it returns the multi-arch index digest d361fa77. But when the resolve returns empty (the cluster can't reach ghcr.io / the ghcr token endpoint — egress/proxy/firewall; on-prem installs often allowlist docker.io but not ghcr.io), the script hits:
if [ -z "$latest_ingestor" ]; then
log " WARN: could not resolve latest digest (rate-limited or transient); skipping this tick"
and silently skips every tick — never reaching the registry-drift branch that would set the new digest. jobs-manager stays on the install-time baseline indefinitely. jobs-manager + pods-monitor pull from docker.io and refresh fine, so the CronJob looks healthy and nothing surfaces the ghcr failure. A persistent failure is indistinguishable from a transient one.
Ruled out: auto-upgrade does not reset the env (skips when already-latest; --reset-then-reuse-values, no --force); the resolver parses multi-arch indexes correctly; chart 1.4.3 includes ingestor refresh (landed v1.4.1/#158).
Proposed change
Make a persistent ghcr resolve failure loud instead of silent:
- Count consecutive ingestor-resolve failures in a deployment annotation; clear on success.
- Below threshold → keep tolerating (WARN + skip), as today.
- At/above
imageRefresh.ingestorResolveFailureThreshold (default 3 ≈ 45 min at the 15-min schedule) → log an actionable ERROR (check ghcr.io egress / token endpoint), write a last-error annotation, and exit non-zero so the Job fails visibly in kubectl get cronjob / monitoring — the same surfacing idiom the script already uses for stuck rollouts.
Acceptance criteria
Related
#186 (parent) · #185 · #187 · #158 / #159 (image-refresh) · backend#723 (sibling arm64).
Context
Follow-up #2 from #186. On the
berlin-teamarm64 install, jobs-manager stayed pinned to the amd64-only ingestor baseline (sha256:a0861ea9…) even though the image-refresh CronJob ran and:0.3had become a multi-arch index (sha256:d361fa77…). #185 (greenfield baseline → v0.3.2) and #187 (CI multi-arch guard) handled the chart side; this issue is the image-refresh mechanism itself.Root cause
image-refresh resolves the ingestor digest via
get_latest_digest(… ghcr.io). The logic is correct — verified live, it returns the multi-arch index digestd361fa77. But when the resolve returns empty (the cluster can't reachghcr.io/ the ghcr token endpoint — egress/proxy/firewall; on-prem installs often allowlistdocker.iobut notghcr.io), the script hits:and silently skips every tick — never reaching the registry-drift branch that would set the new digest. jobs-manager stays on the install-time baseline indefinitely. jobs-manager + pods-monitor pull from
docker.ioand refresh fine, so the CronJob looks healthy and nothing surfaces the ghcr failure. A persistent failure is indistinguishable from a transient one.Ruled out: auto-upgrade does not reset the env (skips when already-latest;
--reset-then-reuse-values, no--force); the resolver parses multi-arch indexes correctly; chart 1.4.3 includes ingestor refresh (landed v1.4.1/#158).Proposed change
Make a persistent ghcr resolve failure loud instead of silent:
imageRefresh.ingestorResolveFailureThreshold(default 3 ≈ 45 min at the 15-min schedule) → log an actionable ERROR (checkghcr.ioegress / token endpoint), write alast-errorannotation, and exit non-zero so the Job fails visibly inkubectl get cronjob/ monitoring — the same surfacing idiom the script already uses for stuck rollouts.Acceptance criteria
--reuse-values), schema-validated.helm unittestcoverage for the new env wiring;helm lintclean.Related
#186 (parent) · #185 · #187 · #158 / #159 (image-refresh) · backend#723 (sibling arm64).