image-refresh: surface persistent ingestor (ghcr) digest-resolve failures instead of silently skipping

## Context
Follow-up #2 from #186. On the `berlin-team` arm64 install, jobs-manager stayed pinned to the amd64-only ingestor baseline (`sha256:a0861ea9…`) even though the image-refresh CronJob ran and `:0.3` had become a multi-arch index (`sha256:d361fa77…`). #185 (greenfield baseline → v0.3.2) and #187 (CI multi-arch guard) handled the chart side; **this issue is the image-refresh mechanism itself.**

## Root cause
image-refresh resolves the ingestor digest via `get_latest_digest(… ghcr.io)`. The logic is correct — verified live, it returns the multi-arch **index** digest `d361fa77`. But when the resolve returns empty (the cluster can't reach `ghcr.io` / the ghcr token endpoint — egress/proxy/firewall; on-prem installs often allowlist `docker.io` but not `ghcr.io`), the script hits:

```sh
if [ -z "$latest_ingestor" ]; then
  log "  WARN: could not resolve latest digest (rate-limited or transient); skipping this tick"
```

and **silently skips every tick** — never reaching the registry-drift branch that would set the new digest. jobs-manager stays on the install-time baseline indefinitely. jobs-manager + pods-monitor pull from `docker.io` and refresh fine, so the CronJob looks healthy and nothing surfaces the ghcr failure. A **persistent** failure is indistinguishable from a **transient** one.

Ruled out: auto-upgrade does *not* reset the env (skips when already-latest; `--reset-then-reuse-values`, no `--force`); the resolver parses multi-arch indexes correctly; chart 1.4.3 includes ingestor refresh (landed v1.4.1/#158).

## Proposed change
Make a *persistent* ghcr resolve failure loud instead of silent:
- Count consecutive ingestor-resolve failures in a deployment annotation; clear on success.
- Below threshold → keep tolerating (WARN + skip), as today.
- At/above `imageRefresh.ingestorResolveFailureThreshold` (default **3** ≈ 45 min at the 15-min schedule) → log an actionable ERROR (check `ghcr.io` egress / token endpoint), write a `last-error` annotation, and **exit non-zero** so the Job fails visibly in `kubectl get cronjob` / monitoring — the same surfacing idiom the script already uses for stuck rollouts.

## Acceptance criteria
- [ ] Persistent ghcr resolve failure escalates to a failed Job after the threshold; transient blips still tolerated.
- [ ] Failure state visible on the jobs-manager deployment (annotation), cleared on recovery.
- [ ] Threshold configurable via values (nil-guarded for `--reuse-values`), schema-validated.
- [ ] `helm unittest` coverage for the new env wiring; `helm lint` clean.

## Related
#186 (parent) · #185 · #187 · #158 / #159 (image-refresh) · backend#723 (sibling arm64).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

image-refresh: surface persistent ingestor (ghcr) digest-resolve failures instead of silently skipping #190

Context

Root cause

Proposed change

Acceptance criteria

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

image-refresh: surface persistent ingestor (ghcr) digest-resolve failures instead of silently skipping #190

Description

Context

Root cause

Proposed change

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions