Skip to content

image-refresh: surface persistent ingestor (ghcr) digest-resolve failures instead of silently skipping #190

Description

@saadqbal

Context

Follow-up #2 from #186. On the berlin-team arm64 install, jobs-manager stayed pinned to the amd64-only ingestor baseline (sha256:a0861ea9…) even though the image-refresh CronJob ran and :0.3 had become a multi-arch index (sha256:d361fa77…). #185 (greenfield baseline → v0.3.2) and #187 (CI multi-arch guard) handled the chart side; this issue is the image-refresh mechanism itself.

Root cause

image-refresh resolves the ingestor digest via get_latest_digest(… ghcr.io). The logic is correct — verified live, it returns the multi-arch index digest d361fa77. But when the resolve returns empty (the cluster can't reach ghcr.io / the ghcr token endpoint — egress/proxy/firewall; on-prem installs often allowlist docker.io but not ghcr.io), the script hits:

if [ -z "$latest_ingestor" ]; then
  log "  WARN: could not resolve latest digest (rate-limited or transient); skipping this tick"

and silently skips every tick — never reaching the registry-drift branch that would set the new digest. jobs-manager stays on the install-time baseline indefinitely. jobs-manager + pods-monitor pull from docker.io and refresh fine, so the CronJob looks healthy and nothing surfaces the ghcr failure. A persistent failure is indistinguishable from a transient one.

Ruled out: auto-upgrade does not reset the env (skips when already-latest; --reset-then-reuse-values, no --force); the resolver parses multi-arch indexes correctly; chart 1.4.3 includes ingestor refresh (landed v1.4.1/#158).

Proposed change

Make a persistent ghcr resolve failure loud instead of silent:

  • Count consecutive ingestor-resolve failures in a deployment annotation; clear on success.
  • Below threshold → keep tolerating (WARN + skip), as today.
  • At/above imageRefresh.ingestorResolveFailureThreshold (default 3 ≈ 45 min at the 15-min schedule) → log an actionable ERROR (check ghcr.io egress / token endpoint), write a last-error annotation, and exit non-zero so the Job fails visibly in kubectl get cronjob / monitoring — the same surfacing idiom the script already uses for stuck rollouts.

Acceptance criteria

  • Persistent ghcr resolve failure escalates to a failed Job after the threshold; transient blips still tolerated.
  • Failure state visible on the jobs-manager deployment (annotation), cleared on recovery.
  • Threshold configurable via values (nil-guarded for --reuse-values), schema-validated.
  • helm unittest coverage for the new env wiring; helm lint clean.

Related

#186 (parent) · #185 · #187 · #158 / #159 (image-refresh) · backend#723 (sibling arm64).

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions