Skip to content

feat(observability): scope sync logs to the mapping; minimal watch surface#89

Merged
bryantbiggs merged 9 commits into
mainfrom
feat/sync-log-ux-cleanup
May 6, 2026
Merged

feat(observability): scope sync logs to the mapping; minimal watch surface#89
bryantbiggs merged 9 commits into
mainfrom
feat/sync-log-ux-cleanup

Conversation

@bryantbiggs

@bryantbiggs bryantbiggs commented May 6, 2026

Copy link
Copy Markdown
Member

Summary

Sync's log surface is now mapping-scoped, dry-run owns the verbose pre-flight detail, and watch mode is event-driven: a steady-state pod logs O(state transitions), not O(cycles).

Three layers, separated:

Surface Purpose
--dry-run Verbose pre-flight: per-mapping pipeline, kept tags, drop attribution with samples, system-exclude hints, source-tag composition split (image vs OCI 1.1 referrer artifact)
sync (one-off) Per-mapping post-event INFO/WARN with src -> dst context and outcome counts; per-image transfer events from the engine; one-line cycle tail rolling totals
watch Same per-mapping line + cycle tail, deduplicated across cycles -- emits only on transition (transfer / failure / count change). Idle-heartbeat INFO every hour confirms the process is healthy when nothing has changed

Cosign-heavy sources (cgr.dev/chainguard/*) used to spam thousands of "tag is not parseable as a version" warns per repo per cycle. Those sha256-<hex>(.sig|.sbom|.att) referrer fallback tags are now silently skipped before the version parser sees them.

Failure -> recovery transitions get an explicit [recovered] marker on the per-mapping line so an operator scanning logs sees the recovery without correlating WARN -> INFO across cycles.

Before / After

--dry-run against cgr.dev/chainguard/{nginx,python} + docker.io/library/{nginx,alpine}

Before (truncated; the actual run emitted ~14k unparseable-tag WARNs from the cosign fallback tags alone):

WARN tag is not parseable as a version, dropping tag="sha256-000e710230b9....att"
WARN tag is not parseable as a version, dropping tag="sha256-000e710230b9....sig"
... (14287 more from chainguard/nginx)
WARN tag is not parseable as a version, dropping tag="sha256-00087e89b16b....att"
... (22634 more from chainguard/python)
WARN credential helper exited with error helper=docker-credential-desktop registry="cgr.dev" status=exit status: 1
WARN credential helper exited with error helper=docker-credential-desktop registry="registry-1.docker.io" status=exit status: 1
WARN tags.semver is set without tags.latest: every tag matching the version range will sync...
... (5x repeated per mapping)
WARN artifacts.enabled is false: signatures and SBOMs will be stripped from synced images
... (5x repeated per mapping)
WARN no tags matched after filtering, skipping mapping from=chainguard/nginx
WARN no tags matched after filtering, skipping mapping from=chainguard/python
INFO library/nginx -> mirror/dh-nginx: 5 of 1133 tags  =>  [ecr]
INFO library/alpine -> mirror/dh-alpine: 5 of 214 tags  =>  [ecr]
dry-run: library/nginx -> mirror/dh-nginx  =>  [ecr]
  source candidates: 1133

  pipeline:
    semver ">=1.0.0"             1133 -> 1015    (-118)
    sort semver desc, latest 5   1015 -> 5       (-1010)

  kept (5):
    1.30
    ...

  dropped 1128:
    1010  over latest=5 limit         ...
     118  by semver ">=1.0.0"         ...

After:

WARN chainguard/nginx: 0 of 14289 source tags (2 image tags, 14287 referrer artifacts) matched filter (semver >=1.0.0, sort semver, latest=5); skipping. Source: [latest, latest-dev] source_total=14289 image_count=2 artifact_count=14287 filter="semver >=1.0.0, sort semver, latest=5"
WARN chainguard/python: 0 of 22642 source tags (2 image tags, 22640 referrer artifacts) matched filter (semver >=1.0.0, sort semver, latest=5); skipping. Source: [latest, latest-dev] ...
dry-run: library/nginx -> mirror/dh-nginx [ecr]
  source tags: 1133

  filter:
    semver ">=1.0.0"             1133 -> 1015    (-118)
    sort semver desc             1015 -> 1015
    keep latest 5                1015 -> 5       (-1010)

  kept (5):
    1.30
    1.30.0
    1.30-trixie-perl
    1.30.0-trixie-perl
    1.30-trixie-otel

  dropped (1128):
    1010  over latest=5 limit         1.30.0-trixie-otel, 1.30-trixie, ...
     118  by semver ">=1.0.0"         alpine, alpine-otel, ...

dry-run: library/alpine -> mirror/dh-alpine [ecr]
  source tags: 214

  filter:
    semver ">=1.0.0"              214 -> 211     (-3)
    exclude (user + system)       211 -> 210     (-1)
    sort semver desc              210 -> 210
    keep latest 5                 210 -> 5       (-205)

  kept (5):
    20260127
    ...

  dropped (209):
     205  over latest=5 limit         20240606, ...
       3  by semver ">=1.0.0"         3.17.0_rc1, edge, latest
       1  by system-exclude           3.15.0-rc.4
          hint: to keep prereleases, list patterns under include: (globs supported)

sync (one-off) against docker.io/library/alpine -> ttl.sh

Cold transfer:

INFO image synced source_repo=library/alpine source_tag=3.21.4 target_repo=ocync-test/alpine-1778080090 target_tag=3.21.4 bytes=28524801
INFO library/alpine -> ocync-test/alpine-1778080090: synced 1 (28.5 MB) synced=1 skipped=0 failed=0 bytes=28524801 recovered=false
INFO summary: 1 mappings | 1 synced, 0 skipped, 0 failed | 28.5 MB in 1m 27s

Re-run (steady state):

INFO library/alpine -> ocync-test/alpine-1778080090: skipped 1 synced=0 skipped=1 failed=0 bytes=0 recovered=false
INFO summary: 1 mappings | 0 synced, 1 skipped, 0 failed | 0 B in 1s

watch over 3 cycles in steady state, all-skip

Before (one mapping; scaled linearly with mapping count):

INFO starting watch mode interval_secs=5
INFO health server listening port=18080
INFO skipping -- digest matches at target (cache hit) source_repo=library/alpine source_tag=3.21.4 ...
sync complete: 0 synced, 1 skipped, 0 failed | blobs: 0 transferred, 0 skipped, 0 mounted | 0 B in 1s | discovery: 1 cached, 0 pulled
INFO sync cycle complete exit_code=Success
INFO skipping -- digest matches at target (cache hit) source_repo=library/alpine source_tag=3.21.4 ...
sync complete: 0 synced, 1 skipped, 0 failed | ...
INFO sync cycle complete exit_code=Success
INFO skipping -- digest matches at target (cache hit) source_repo=library/alpine source_tag=3.21.4 ...
sync complete: 0 synced, 1 skipped, 0 failed | ...
INFO sync cycle complete exit_code=Success
INFO received SIGTERM, initiating graceful shutdown
INFO shutdown signal received, draining in-flight transfers in_flight=0
INFO sync cycle complete exit_code=Success
INFO shutdown signal received, exiting watch mode

After:

INFO starting watch mode interval_secs=5
INFO health server listening port=18084
INFO library/alpine -> ocync-test/alpine-1778080090: skipped 1 synced=0 skipped=1 failed=0 bytes=0 recovered=false
INFO summary: 1 mappings | 0 synced, 1 skipped, 0 failed | 0 B in 0.9s
INFO received SIGTERM, initiating graceful shutdown
INFO shutdown signal received, draining in-flight transfers in_flight=0
INFO shutdown signal received, exiting watch mode

A 50-mapping watch in steady state was emitting roughly (per_tag_skips + summary + cycle_complete) lines per cycle -- thousands per hour. It now emits one line per mapping plus a one-line cycle tail on the first cycle, then stays silent until something transitions. After an hour of silence, an INFO watch alive: ... heartbeat fires so log scrapers have a recent anchor; /healthz remains the kubelet probe.

Behavioural changes worth a CHANGELOG entry

  • INFO plan-line format changed: pre-flight 5 of 1133 tags => [ecr] removed entirely; replaced by post-event per-mapping library/alpine -> mirror/alpine: synced 1 (28.5 MB) (multi-target mappings retain the [ecr-prod, ghcr-mirror] bracket).
  • Stdout aggregate sync complete: ... removed. Replaced by an INFO summary: N mappings | ... cycle tail emitted via tracing alongside the per-mapping lines.
  • Dry-run section labels: source candidates -> source tags, pipeline -> filter, dropped N -> dropped (N), sort/latest now two stages.
  • WARN no-tags-matched message now carries source count split (image vs referrer artifact), filter rationale, and image-tag samples.
  • Per-tag engine skip events demoted to debug; "image synced" and lifecycle events stay at INFO.
  • Watch mode: one no-tags WARN per failure-state transition (not per cycle), one per-mapping outcome line per state transition, one "filter now matches" recovery INFO when a previously-failing mapping resolves, one [recovered] suffix on the per-mapping line when failure clears, and an idle-heartbeat INFO at one-hour cadence when nothing has transitioned.

Test plan

  • cargo fmt --all -- --check -- clean
  • cargo clippy --workspace --all-targets --locked -- -D warnings -- clean
  • cargo test --workspace --locked -- 1334 passed, 1 ignored
  • cargo deny check -- advisories ok, bans ok, licenses ok, sources ok
  • npm run --prefix docs build -- 28 pages built
  • Live --dry-run against cgr.dev/chainguard/{nginx,python} + docker.io/library/{nginx,alpine} -- output above
  • Live sync cold + steady against docker.io/library/alpine -> ttl.sh -- output above
  • Live watch over 3 cycles, all-skip steady state -- output above
  • CI runs the full FIPS-default workspace gate

Sync runs over multi-mapping configs (especially against cosign-heavy
sources like cgr.dev) emit dozens of WARN log lines per cycle, none of
which describe a real problem. Watch-mode multiplies this per cycle.

Changes:

- Demote noisy per-mapping config warnings to debug:
  - validate_artifacts emits per mapping when artifacts.enabled is false
  - validate_tags emits per mapping when tags.semver lacks tags.latest
- Demote per-tag "tag is not parseable as a version" warn to debug; the
  same drop is already attributed via FilterReport's drop reasons.
- Skip OCI 1.1 referrer fallback tags (sha256-<hex>(.sig|.sbom|.att))
  before semver parsing so cosign-heavy repos do not spam the log
  channel. Helper is_referrers_fallback_tag is pub for reuse in CLI.
- Demote spawn / wait / non-zero-exit cred helper warns to debug; keep
  the 30s timeout warn (real operational signal). Caller already
  swallows the Err for graceful fallback to anonymous auth.

Restructure the per-mapping plan line and dry-run output:

- INFO summary now reads "5 tags match for sync out of 1133 possible"
  with brackets for targets; no more dual-arrow ambiguity.
- Suppress the INFO summary in --dry-run mode (the pretty-print is the
  answer).
- Replace the "no tags matched after filtering" warn with one carrying
  source count split (image vs referrer artifact), filter rationale,
  and image-tag samples; structured tracing fields included.
- Dry-run section labels: "source candidates" -> "source tags",
  "pipeline:" -> "filter:", "dropped N:" -> "dropped (N):"; sort and
  latest now render as separate stages; system-exclude hint formatted
  as a hint: line.

Watch-mode dedup:

- WatchLogState (HashSet keyed by mapping.from) carries log state
  across cycles. Emits one WARN per failure-state transition; emits a
  recovery INFO when a previously-failing mapping resolves; prunes
  entries when the mapping is removed from config.

Banner:

- Switch README hero from docs/public/ecr-banner.svg to a rendered PNG
  (docs/public/ecr-banner.png). GitHub's SVG renderer mangled the
  bold-gradient title; the PNG renders deterministically.

Verification:

- 1346 unit + integration tests pass (workspace, --locked, non-fips)
- cargo clippy --workspace --all-targets --locked: clean
- cargo fmt --all -- --check: clean
- cargo deny check: advisories ok, bans ok, licenses ok, sources ok
- npm run --prefix docs build: 28 pages, 831 ms
- Live dry-run against cgr.dev/chainguard/{nginx,python} and
  docker.io/library/{nginx,alpine}: chainguard mappings now emit one
  legible WARN each (instead of ~14k unparseable-tag warnings); Docker
  Hub mappings show the new INFO summary and clean dry-run pipeline.
The "X tags match for sync out of Y possible" line was added with the
log-cleanup PR but turns out to be redundant in real-sync and watch
modes -- the engine already emits an INFO per image at the moment of
transfer ("image synced", with byte counts). The summary repeats the
config back at the operator before any work is done.

`--dry-run` is the right surface for this preview; the dedicated
pretty-print there already covers it. Real-sync and watch now go
straight from mapping resolution to engine work, with no per-mapping
"about to" announcement to clutter watch-mode logs cycle after cycle.

Drops `format_plan_line` and `log_resolved_mappings` plus their
tests. `describe_filter` is retained -- it still drives the no-tags
WARN's filter rationale.
In watch-mode steady-state most tags hit the skip path -- digest match
at target, immutable tag at all targets, target tag is immutable.
Each was firing INFO per (tag, cycle), producing N x cycles lines for
zero new work. The aggregate count is already on the "sync complete"
summary line.

Demote the per-tag skip events (engine.rs L865, L1433, L1577, L1766)
and the artifact-discovery 404 / fallback / referrers-API edge cases
(L2871, L2880, L2894) to debug. Verbose mode (-v) brings them back.

Kept at INFO: actual transfer events ("image synced" with byte count),
leader election, multi-arch platform filtering, rate-limit recovery,
and shutdown -- all real-signal lifecycle events, not per-tag chatter.

Net effect: a 50-mapping x 5-tag watch cycle in steady state now emits
one summary line per cycle (was 250 INFO lines).
In watch mode the per-cycle "sync complete: ..." line was repeating
verbatim every cycle in steady state, paired with an "INFO sync cycle
complete" wrapper that conveyed the same fact. For an idle pod cycling
every 60s, the log stream was N pairs of identical lines per hour.

- Add DedupingWatchProgress, a ProgressReporter wrapper that compares
  the new SyncStats to the prior cycle's and only forwards run_completed
  when they differ. First cycle always emits; identical follow-ups go
  silent. Wraps the user's reporter so per-image events still pass
  through.
- Stats comparison ignores SyncReport.duration on purpose -- wall-clock
  drift cycle-to-cycle in steady state would otherwise re-emit every
  time. Test pins the contract.
- Drop "INFO sync cycle complete" from the watch loop. Cycle completion
  is conveyed by the dedup-aware "sync complete:" line on stdout (when
  stats change) and by /healthz (always). The INFO added nothing.
- Derive Clone, PartialEq, Eq on SyncStats so the wrapper can compare
  without writing a digest helper.

Net effect: a steady-state watch loop emits log volume proportional to
state transitions, not cycle count. Sync-mode (one-off) is unchanged
-- the user's progress reporter is passed through verbatim.
The aggregate stdout summary ("sync complete: 0 synced, 250 skipped...")
named no mapping. For a 50-mapping watch in steady state, the operator
got identical aggregate counts every cycle with zero context about what
was tracked or where it was going. The user's exact complaint: "no
mention/reference to the image being synced - pure unknown noise".

Replace it with a per-mapping post-event INFO/WARN line that names the
specific mapping:

    library/alpine -> ocync-test/alpine-1h [ttl]: synced 1 (28.5 MB)
    library/alpine -> ocync-test/alpine-1h [ttl]: skipped 1
    library/alpine -> ocync-test/alpine-1h [ttl]: failed 1     <- WARN

Implementation:

- Drop TextProgress::run_completed stdout aggregate (was the contextless
  noise) and the `suppress_summary` plumbing it required. JSON output
  still owns stdout in --json mode via write_output, unchanged.
- Drop DedupingWatchProgress -- replaced by per-mapping dedup on the
  more-informative per-mapping line.
- After engine.run returns, group ImageResults by mapping (source_repo +
  target_repo prefix) into MappingOutcome { synced, skipped, failed,
  bytes }. Emit one INFO per mapping; WARN when any failed > 0.
- Watch mode plumbs through WatchLogState.last_outcomes (HashMap) and
  suppresses follow-up cycles where the outcome is identical to the
  prior cycle. retain_active prunes stale entries on config edit.

Format: zero counts are omitted ("synced 1" not "synced 1, skipped 0,
failed 0"); bytes clause appears only when > 0; fields are also
attached as structured tracing fields (from, to, synced, skipped,
failed, bytes) for log aggregators.
@bryantbiggs bryantbiggs changed the title feat(observability): clean up sync log surface feat(observability): scope sync logs to the mapping; minimal watch surface May 6, 2026
Three follow-ups from the post-merge audit, all watch-mode UX:

- Cycle tail: emit a `summary: N mappings | X synced, Y skipped, Z failed
  | bytes in duration` INFO/WARN at the end of each cycle that emitted
  anything. Sync mode: always. Watch mode: gated on cycle_emit_count > 0.
  Restores cycle-level rollup the per-mapping refactor removed; named
  context comes from the per-mapping lines above it.

- Recovery marker: when a mapping transitions from `failed > 0` to
  `failed == 0`, append ` [recovered]` to the per-mapping line so an
  operator scanning logs sees the recovery without correlating WARN ->
  INFO across cycles. WatchLogState.observe_mapping_outcome now returns
  the prior outcome alongside `changed`, so the caller can detect the
  transition.

- Idle heartbeat: after one hour of silence, emit a "watch alive: no
  state changes; sync loop healthy" INFO with idle_secs. WatchLogState
  exposes `begin_cycle()` / `cycle_emit_count()` to let watch.rs gate
  the heartbeat on per-cycle activity. /healthz remains the kubelet
  probe; this gives log scrapers a recent anchor on idle pods.

All three are dedup-aware: a steady-state pod still goes silent for
hours between heartbeats and only emits when something genuinely
transitions.
Apply audit findings from a follow-up review:

- `MappingOutcome` now derives `Copy` (32B of u64s); `last_outcomes`
  HashMap stores by value, no per-observation clones on the unchanged
  path.
- `observe_mapping_outcome` returns `Option<bool>` (None=suppress,
  Some(recovered)=emit) instead of an `OutcomeObservation` struct that
  the single caller had to destructure. The struct's `prior` field
  existed only to recompute `prior.failed > 0 && new.failed == 0`,
  which the method now does itself.
- Use `HashMap::entry()` to avoid the get-then-insert double-hash + key
  allocation on the changed path.
- Drop `WatchLogState::note_emit`. The cycle tail no longer bumps the
  counter -- it's gated on per-mapping activity, which already drives
  the counter via the `observe_*` methods. `emit_cycle_tail` becomes a
  pure formatter; the watch-mode suppression check moved to the call
  site.
- Add `recovered` field to the WARN arm of `emit_mapping_outcomes` so
  log aggregators see a uniform field set across info/warn (a partial
  recovery from `failed=2 -> failed=1` still warns; the structured
  field now exists to be filtered on regardless).
- `format_bytes`/`format_duration` now imported at the top, matching
  `progress.rs` style.
- Trim test redundancy: `cycle_emit_counter_resets_and_increments`
  collapsed into `dedupes_identical_mapping_outcomes` and the recovery
  test (the counter increment is asserted there). Drops one test, no
  loss of coverage.

No behavioral change to log output. Net -32 LOC.
…et mappings

The `[ttl]` / `[ecr]` bracket only carries information when a mapping
fans out to multiple destinations (e.g. `[ecr-prod, ghcr-mirror]`). On
single-target mappings -- the common case -- the destination is already
in the `from -> to` arrow, so the bracket is noise.

Emit the bracket only when `target_names.len() > 1`. Structured tracing
fields (`from`, `to`, `synced`, ...) are unchanged; this is a
human-readable formatting change only.

Single-target now reads:
  library/alpine -> mirror/alpine: skipped 1
Multi-target unchanged:
  library/alpine -> mirror/alpine [ecr-prod, ghcr-mirror]: synced 1
The text formatter tails each tracing event with `key=value` pairs for
every structured field on the call. When those fields restate what the
message string already says, the line ends up repeating itself.

Three call sites were doing this:

- per-mapping outcome (`emit_mapping_outcomes`): had
  `from=<source> to=<target>` after the message's `<source> -> <target>`
  arrow. Drop both -- the arrow is the canonical render. Counts and
  `recovered` stay (they carry zero values the terse message elides
  and a flag scrapers can filter on).
- no-tags WARN (`emit_no_tags_warn`): had `from=<from>` after the
  Display impl already led with the same value. Drop. Counts and
  `filter` stay.
- cycle tail (`emit_cycle_tail`): had `mappings/synced/skipped/failed/
  bytes` fields verbatim duplicating the message's
  "N mappings | X synced, Y skipped, Z failed | bytes in time". Drop
  all -- the message is fully self-contained.

JSON tracing-format consumers parse the message string (or use
`--json` for the structured `SyncReport`) instead of these fields. No
behavioral change beyond what the operator sees in text logs.
@bryantbiggs bryantbiggs merged commit 10f5433 into main May 6, 2026
16 checks passed
@bryantbiggs bryantbiggs deleted the feat/sync-log-ux-cleanup branch May 6, 2026 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant