feat(observability): scope sync logs to the mapping; minimal watch surface by bryantbiggs · Pull Request #89 · clowdhaus/ocync

bryantbiggs · 2026-05-06T12:29:18Z

Summary

Sync's log surface is now mapping-scoped, dry-run owns the verbose pre-flight detail, and watch mode is event-driven: a steady-state pod logs O(state transitions), not O(cycles).

Three layers, separated:

Surface	Purpose
`--dry-run`	Verbose pre-flight: per-mapping pipeline, kept tags, drop attribution with samples, system-exclude hints, source-tag composition split (image vs OCI 1.1 referrer artifact)
`sync` (one-off)	Per-mapping post-event INFO/WARN with `src -> dst` context and outcome counts; per-image transfer events from the engine; one-line cycle tail rolling totals
`watch`	Same per-mapping line + cycle tail, deduplicated across cycles -- emits only on transition (transfer / failure / count change). Idle-heartbeat INFO every hour confirms the process is healthy when nothing has changed

Cosign-heavy sources (cgr.dev/chainguard/*) used to spam thousands of "tag is not parseable as a version" warns per repo per cycle. Those sha256-<hex>(.sig|.sbom|.att) referrer fallback tags are now silently skipped before the version parser sees them.

Failure -> recovery transitions get an explicit [recovered] marker on the per-mapping line so an operator scanning logs sees the recovery without correlating WARN -> INFO across cycles.

Before / After

`--dry-run` against `cgr.dev/chainguard/{nginx,python}` + `docker.io/library/{nginx,alpine}`

Before (truncated; the actual run emitted ~14k unparseable-tag WARNs from the cosign fallback tags alone):

WARN tag is not parseable as a version, dropping tag="sha256-000e710230b9....att"
WARN tag is not parseable as a version, dropping tag="sha256-000e710230b9....sig"
... (14287 more from chainguard/nginx)
WARN tag is not parseable as a version, dropping tag="sha256-00087e89b16b....att"
... (22634 more from chainguard/python)
WARN credential helper exited with error helper=docker-credential-desktop registry="cgr.dev" status=exit status: 1
WARN credential helper exited with error helper=docker-credential-desktop registry="registry-1.docker.io" status=exit status: 1
WARN tags.semver is set without tags.latest: every tag matching the version range will sync...
... (5x repeated per mapping)
WARN artifacts.enabled is false: signatures and SBOMs will be stripped from synced images
... (5x repeated per mapping)
WARN no tags matched after filtering, skipping mapping from=chainguard/nginx
WARN no tags matched after filtering, skipping mapping from=chainguard/python
INFO library/nginx -> mirror/dh-nginx: 5 of 1133 tags  =>  [ecr]
INFO library/alpine -> mirror/dh-alpine: 5 of 214 tags  =>  [ecr]
dry-run: library/nginx -> mirror/dh-nginx  =>  [ecr]
  source candidates: 1133

  pipeline:
    semver ">=1.0.0"             1133 -> 1015    (-118)
    sort semver desc, latest 5   1015 -> 5       (-1010)

  kept (5):
    1.30
    ...

  dropped 1128:
    1010  over latest=5 limit         ...
     118  by semver ">=1.0.0"         ...

After:

WARN chainguard/nginx: 0 of 14289 source tags (2 image tags, 14287 referrer artifacts) matched filter (semver >=1.0.0, sort semver, latest=5); skipping. Source: [latest, latest-dev] source_total=14289 image_count=2 artifact_count=14287 filter="semver >=1.0.0, sort semver, latest=5"
WARN chainguard/python: 0 of 22642 source tags (2 image tags, 22640 referrer artifacts) matched filter (semver >=1.0.0, sort semver, latest=5); skipping. Source: [latest, latest-dev] ...
dry-run: library/nginx -> mirror/dh-nginx [ecr]
  source tags: 1133

  filter:
    semver ">=1.0.0"             1133 -> 1015    (-118)
    sort semver desc             1015 -> 1015
    keep latest 5                1015 -> 5       (-1010)

  kept (5):
    1.30
    1.30.0
    1.30-trixie-perl
    1.30.0-trixie-perl
    1.30-trixie-otel

  dropped (1128):
    1010  over latest=5 limit         1.30.0-trixie-otel, 1.30-trixie, ...
     118  by semver ">=1.0.0"         alpine, alpine-otel, ...

dry-run: library/alpine -> mirror/dh-alpine [ecr]
  source tags: 214

  filter:
    semver ">=1.0.0"              214 -> 211     (-3)
    exclude (user + system)       211 -> 210     (-1)
    sort semver desc              210 -> 210
    keep latest 5                 210 -> 5       (-205)

  kept (5):
    20260127
    ...

  dropped (209):
     205  over latest=5 limit         20240606, ...
       3  by semver ">=1.0.0"         3.17.0_rc1, edge, latest
       1  by system-exclude           3.15.0-rc.4
          hint: to keep prereleases, list patterns under include: (globs supported)

`sync` (one-off) against `docker.io/library/alpine` -> `ttl.sh`

Cold transfer:

INFO image synced source_repo=library/alpine source_tag=3.21.4 target_repo=ocync-test/alpine-1778080090 target_tag=3.21.4 bytes=28524801
INFO library/alpine -> ocync-test/alpine-1778080090: synced 1 (28.5 MB) synced=1 skipped=0 failed=0 bytes=28524801 recovered=false
INFO summary: 1 mappings | 1 synced, 0 skipped, 0 failed | 28.5 MB in 1m 27s

Re-run (steady state):

INFO library/alpine -> ocync-test/alpine-1778080090: skipped 1 synced=0 skipped=1 failed=0 bytes=0 recovered=false
INFO summary: 1 mappings | 0 synced, 1 skipped, 0 failed | 0 B in 1s

`watch` over 3 cycles in steady state, all-skip

Before (one mapping; scaled linearly with mapping count):

INFO starting watch mode interval_secs=5
INFO health server listening port=18080
INFO skipping -- digest matches at target (cache hit) source_repo=library/alpine source_tag=3.21.4 ...
sync complete: 0 synced, 1 skipped, 0 failed | blobs: 0 transferred, 0 skipped, 0 mounted | 0 B in 1s | discovery: 1 cached, 0 pulled
INFO sync cycle complete exit_code=Success
INFO skipping -- digest matches at target (cache hit) source_repo=library/alpine source_tag=3.21.4 ...
sync complete: 0 synced, 1 skipped, 0 failed | ...
INFO sync cycle complete exit_code=Success
INFO skipping -- digest matches at target (cache hit) source_repo=library/alpine source_tag=3.21.4 ...
sync complete: 0 synced, 1 skipped, 0 failed | ...
INFO sync cycle complete exit_code=Success
INFO received SIGTERM, initiating graceful shutdown
INFO shutdown signal received, draining in-flight transfers in_flight=0
INFO sync cycle complete exit_code=Success
INFO shutdown signal received, exiting watch mode

After:

INFO starting watch mode interval_secs=5
INFO health server listening port=18084
INFO library/alpine -> ocync-test/alpine-1778080090: skipped 1 synced=0 skipped=1 failed=0 bytes=0 recovered=false
INFO summary: 1 mappings | 0 synced, 1 skipped, 0 failed | 0 B in 0.9s
INFO received SIGTERM, initiating graceful shutdown
INFO shutdown signal received, draining in-flight transfers in_flight=0
INFO shutdown signal received, exiting watch mode

A 50-mapping watch in steady state was emitting roughly (per_tag_skips + summary + cycle_complete) lines per cycle -- thousands per hour. It now emits one line per mapping plus a one-line cycle tail on the first cycle, then stays silent until something transitions. After an hour of silence, an INFO watch alive: ... heartbeat fires so log scrapers have a recent anchor; /healthz remains the kubelet probe.

Behavioural changes worth a CHANGELOG entry

INFO plan-line format changed: pre-flight 5 of 1133 tags => [ecr] removed entirely; replaced by post-event per-mapping library/alpine -> mirror/alpine: synced 1 (28.5 MB) (multi-target mappings retain the [ecr-prod, ghcr-mirror] bracket).
Stdout aggregate sync complete: ... removed. Replaced by an INFO summary: N mappings | ... cycle tail emitted via tracing alongside the per-mapping lines.
Dry-run section labels: source candidates -> source tags, pipeline -> filter, dropped N -> dropped (N), sort/latest now two stages.
WARN no-tags-matched message now carries source count split (image vs referrer artifact), filter rationale, and image-tag samples.
Per-tag engine skip events demoted to debug; "image synced" and lifecycle events stay at INFO.
Watch mode: one no-tags WARN per failure-state transition (not per cycle), one per-mapping outcome line per state transition, one "filter now matches" recovery INFO when a previously-failing mapping resolves, one [recovered] suffix on the per-mapping line when failure clears, and an idle-heartbeat INFO at one-hour cadence when nothing has transitioned.

Test plan

cargo fmt --all -- --check -- clean
cargo clippy --workspace --all-targets --locked -- -D warnings -- clean
cargo test --workspace --locked -- 1334 passed, 1 ignored
cargo deny check -- advisories ok, bans ok, licenses ok, sources ok
npm run --prefix docs build -- 28 pages built
Live --dry-run against cgr.dev/chainguard/{nginx,python} + docker.io/library/{nginx,alpine} -- output above
Live sync cold + steady against docker.io/library/alpine -> ttl.sh -- output above
Live watch over 3 cycles, all-skip steady state -- output above
CI runs the full FIPS-default workspace gate

Sync runs over multi-mapping configs (especially against cosign-heavy sources like cgr.dev) emit dozens of WARN log lines per cycle, none of which describe a real problem. Watch-mode multiplies this per cycle. Changes: - Demote noisy per-mapping config warnings to debug: - validate_artifacts emits per mapping when artifacts.enabled is false - validate_tags emits per mapping when tags.semver lacks tags.latest - Demote per-tag "tag is not parseable as a version" warn to debug; the same drop is already attributed via FilterReport's drop reasons. - Skip OCI 1.1 referrer fallback tags (sha256-<hex>(.sig|.sbom|.att)) before semver parsing so cosign-heavy repos do not spam the log channel. Helper is_referrers_fallback_tag is pub for reuse in CLI. - Demote spawn / wait / non-zero-exit cred helper warns to debug; keep the 30s timeout warn (real operational signal). Caller already swallows the Err for graceful fallback to anonymous auth. Restructure the per-mapping plan line and dry-run output: - INFO summary now reads "5 tags match for sync out of 1133 possible" with brackets for targets; no more dual-arrow ambiguity. - Suppress the INFO summary in --dry-run mode (the pretty-print is the answer). - Replace the "no tags matched after filtering" warn with one carrying source count split (image vs referrer artifact), filter rationale, and image-tag samples; structured tracing fields included. - Dry-run section labels: "source candidates" -> "source tags", "pipeline:" -> "filter:", "dropped N:" -> "dropped (N):"; sort and latest now render as separate stages; system-exclude hint formatted as a hint: line. Watch-mode dedup: - WatchLogState (HashSet keyed by mapping.from) carries log state across cycles. Emits one WARN per failure-state transition; emits a recovery INFO when a previously-failing mapping resolves; prunes entries when the mapping is removed from config. Banner: - Switch README hero from docs/public/ecr-banner.svg to a rendered PNG (docs/public/ecr-banner.png). GitHub's SVG renderer mangled the bold-gradient title; the PNG renders deterministically. Verification: - 1346 unit + integration tests pass (workspace, --locked, non-fips) - cargo clippy --workspace --all-targets --locked: clean - cargo fmt --all -- --check: clean - cargo deny check: advisories ok, bans ok, licenses ok, sources ok - npm run --prefix docs build: 28 pages, 831 ms - Live dry-run against cgr.dev/chainguard/{nginx,python} and docker.io/library/{nginx,alpine}: chainguard mappings now emit one legible WARN each (instead of ~14k unparseable-tag warnings); Docker Hub mappings show the new INFO summary and clean dry-run pipeline.

The "X tags match for sync out of Y possible" line was added with the log-cleanup PR but turns out to be redundant in real-sync and watch modes -- the engine already emits an INFO per image at the moment of transfer ("image synced", with byte counts). The summary repeats the config back at the operator before any work is done. `--dry-run` is the right surface for this preview; the dedicated pretty-print there already covers it. Real-sync and watch now go straight from mapping resolution to engine work, with no per-mapping "about to" announcement to clutter watch-mode logs cycle after cycle. Drops `format_plan_line` and `log_resolved_mappings` plus their tests. `describe_filter` is retained -- it still drives the no-tags WARN's filter rationale.

In watch-mode steady-state most tags hit the skip path -- digest match at target, immutable tag at all targets, target tag is immutable. Each was firing INFO per (tag, cycle), producing N x cycles lines for zero new work. The aggregate count is already on the "sync complete" summary line. Demote the per-tag skip events (engine.rs L865, L1433, L1577, L1766) and the artifact-discovery 404 / fallback / referrers-API edge cases (L2871, L2880, L2894) to debug. Verbose mode (-v) brings them back. Kept at INFO: actual transfer events ("image synced" with byte count), leader election, multi-arch platform filtering, rate-limit recovery, and shutdown -- all real-signal lifecycle events, not per-tag chatter. Net effect: a 50-mapping x 5-tag watch cycle in steady state now emits one summary line per cycle (was 250 INFO lines).

In watch mode the per-cycle "sync complete: ..." line was repeating verbatim every cycle in steady state, paired with an "INFO sync cycle complete" wrapper that conveyed the same fact. For an idle pod cycling every 60s, the log stream was N pairs of identical lines per hour. - Add DedupingWatchProgress, a ProgressReporter wrapper that compares the new SyncStats to the prior cycle's and only forwards run_completed when they differ. First cycle always emits; identical follow-ups go silent. Wraps the user's reporter so per-image events still pass through. - Stats comparison ignores SyncReport.duration on purpose -- wall-clock drift cycle-to-cycle in steady state would otherwise re-emit every time. Test pins the contract. - Drop "INFO sync cycle complete" from the watch loop. Cycle completion is conveyed by the dedup-aware "sync complete:" line on stdout (when stats change) and by /healthz (always). The INFO added nothing. - Derive Clone, PartialEq, Eq on SyncStats so the wrapper can compare without writing a digest helper. Net effect: a steady-state watch loop emits log volume proportional to state transitions, not cycle count. Sync-mode (one-off) is unchanged -- the user's progress reporter is passed through verbatim.

The aggregate stdout summary ("sync complete: 0 synced, 250 skipped...") named no mapping. For a 50-mapping watch in steady state, the operator got identical aggregate counts every cycle with zero context about what was tracked or where it was going. The user's exact complaint: "no mention/reference to the image being synced - pure unknown noise". Replace it with a per-mapping post-event INFO/WARN line that names the specific mapping: library/alpine -> ocync-test/alpine-1h [ttl]: synced 1 (28.5 MB) library/alpine -> ocync-test/alpine-1h [ttl]: skipped 1 library/alpine -> ocync-test/alpine-1h [ttl]: failed 1 <- WARN Implementation: - Drop TextProgress::run_completed stdout aggregate (was the contextless noise) and the `suppress_summary` plumbing it required. JSON output still owns stdout in --json mode via write_output, unchanged. - Drop DedupingWatchProgress -- replaced by per-mapping dedup on the more-informative per-mapping line. - After engine.run returns, group ImageResults by mapping (source_repo + target_repo prefix) into MappingOutcome { synced, skipped, failed, bytes }. Emit one INFO per mapping; WARN when any failed > 0. - Watch mode plumbs through WatchLogState.last_outcomes (HashMap) and suppresses follow-up cycles where the outcome is identical to the prior cycle. retain_active prunes stale entries on config edit. Format: zero counts are omitted ("synced 1" not "synced 1, skipped 0, failed 0"); bytes clause appears only when > 0; fields are also attached as structured tracing fields (from, to, synced, skipped, failed, bytes) for log aggregators.

Three follow-ups from the post-merge audit, all watch-mode UX: - Cycle tail: emit a `summary: N mappings | X synced, Y skipped, Z failed | bytes in duration` INFO/WARN at the end of each cycle that emitted anything. Sync mode: always. Watch mode: gated on cycle_emit_count > 0. Restores cycle-level rollup the per-mapping refactor removed; named context comes from the per-mapping lines above it. - Recovery marker: when a mapping transitions from `failed > 0` to `failed == 0`, append ` [recovered]` to the per-mapping line so an operator scanning logs sees the recovery without correlating WARN -> INFO across cycles. WatchLogState.observe_mapping_outcome now returns the prior outcome alongside `changed`, so the caller can detect the transition. - Idle heartbeat: after one hour of silence, emit a "watch alive: no state changes; sync loop healthy" INFO with idle_secs. WatchLogState exposes `begin_cycle()` / `cycle_emit_count()` to let watch.rs gate the heartbeat on per-cycle activity. /healthz remains the kubelet probe; this gives log scrapers a recent anchor on idle pods. All three are dedup-aware: a steady-state pod still goes silent for hours between heartbeats and only emits when something genuinely transitions.

Apply audit findings from a follow-up review: - `MappingOutcome` now derives `Copy` (32B of u64s); `last_outcomes` HashMap stores by value, no per-observation clones on the unchanged path. - `observe_mapping_outcome` returns `Option<bool>` (None=suppress, Some(recovered)=emit) instead of an `OutcomeObservation` struct that the single caller had to destructure. The struct's `prior` field existed only to recompute `prior.failed > 0 && new.failed == 0`, which the method now does itself. - Use `HashMap::entry()` to avoid the get-then-insert double-hash + key allocation on the changed path. - Drop `WatchLogState::note_emit`. The cycle tail no longer bumps the counter -- it's gated on per-mapping activity, which already drives the counter via the `observe_*` methods. `emit_cycle_tail` becomes a pure formatter; the watch-mode suppression check moved to the call site. - Add `recovered` field to the WARN arm of `emit_mapping_outcomes` so log aggregators see a uniform field set across info/warn (a partial recovery from `failed=2 -> failed=1` still warns; the structured field now exists to be filtered on regardless). - `format_bytes`/`format_duration` now imported at the top, matching `progress.rs` style. - Trim test redundancy: `cycle_emit_counter_resets_and_increments` collapsed into `dedupes_identical_mapping_outcomes` and the recovery test (the counter increment is asserted there). Drops one test, no loss of coverage. No behavioral change to log output. Net -32 LOC.

…et mappings The `[ttl]` / `[ecr]` bracket only carries information when a mapping fans out to multiple destinations (e.g. `[ecr-prod, ghcr-mirror]`). On single-target mappings -- the common case -- the destination is already in the `from -> to` arrow, so the bracket is noise. Emit the bracket only when `target_names.len() > 1`. Structured tracing fields (`from`, `to`, `synced`, ...) are unchanged; this is a human-readable formatting change only. Single-target now reads: library/alpine -> mirror/alpine: skipped 1 Multi-target unchanged: library/alpine -> mirror/alpine [ecr-prod, ghcr-mirror]: synced 1

The text formatter tails each tracing event with `key=value` pairs for every structured field on the call. When those fields restate what the message string already says, the line ends up repeating itself. Three call sites were doing this: - per-mapping outcome (`emit_mapping_outcomes`): had `from=<source> to=<target>` after the message's `<source> -> <target>` arrow. Drop both -- the arrow is the canonical render. Counts and `recovered` stay (they carry zero values the terse message elides and a flag scrapers can filter on). - no-tags WARN (`emit_no_tags_warn`): had `from=<from>` after the Display impl already led with the same value. Drop. Counts and `filter` stay. - cycle tail (`emit_cycle_tail`): had `mappings/synced/skipped/failed/ bytes` fields verbatim duplicating the message's "N mappings | X synced, Y skipped, Z failed | bytes in time". Drop all -- the message is fully self-contained. JSON tracing-format consumers parse the message string (or use `--json` for the structured `SyncReport`) instead of these fields. No behavioral change beyond what the operator sees in text logs.

bryantbiggs added 5 commits May 6, 2026 07:28

bryantbiggs changed the title ~~feat(observability): clean up sync log surface~~ feat(observability): scope sync logs to the mapping; minimal watch surface May 6, 2026

bryantbiggs added 4 commits May 6, 2026 09:15

bryantbiggs merged commit 10f5433 into main May 6, 2026
16 checks passed

bryantbiggs deleted the feat/sync-log-ux-cleanup branch May 6, 2026 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): scope sync logs to the mapping; minimal watch surface#89

feat(observability): scope sync logs to the mapping; minimal watch surface#89
bryantbiggs merged 9 commits into
mainfrom
feat/sync-log-ux-cleanup

bryantbiggs commented May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bryantbiggs commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Before / After

--dry-run against cgr.dev/chainguard/{nginx,python} + docker.io/library/{nginx,alpine}

sync (one-off) against docker.io/library/alpine -> ttl.sh

watch over 3 cycles in steady state, all-skip

Behavioural changes worth a CHANGELOG entry

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bryantbiggs commented May 6, 2026 •

edited

Loading

`--dry-run` against `cgr.dev/chainguard/{nginx,python}` + `docker.io/library/{nginx,alpine}`

`sync` (one-off) against `docker.io/library/alpine` -> `ttl.sh`

`watch` over 3 cycles in steady state, all-skip