feat(observability): scope sync logs to the mapping; minimal watch surface#89
Merged
Conversation
Sync runs over multi-mapping configs (especially against cosign-heavy
sources like cgr.dev) emit dozens of WARN log lines per cycle, none of
which describe a real problem. Watch-mode multiplies this per cycle.
Changes:
- Demote noisy per-mapping config warnings to debug:
- validate_artifacts emits per mapping when artifacts.enabled is false
- validate_tags emits per mapping when tags.semver lacks tags.latest
- Demote per-tag "tag is not parseable as a version" warn to debug; the
same drop is already attributed via FilterReport's drop reasons.
- Skip OCI 1.1 referrer fallback tags (sha256-<hex>(.sig|.sbom|.att))
before semver parsing so cosign-heavy repos do not spam the log
channel. Helper is_referrers_fallback_tag is pub for reuse in CLI.
- Demote spawn / wait / non-zero-exit cred helper warns to debug; keep
the 30s timeout warn (real operational signal). Caller already
swallows the Err for graceful fallback to anonymous auth.
Restructure the per-mapping plan line and dry-run output:
- INFO summary now reads "5 tags match for sync out of 1133 possible"
with brackets for targets; no more dual-arrow ambiguity.
- Suppress the INFO summary in --dry-run mode (the pretty-print is the
answer).
- Replace the "no tags matched after filtering" warn with one carrying
source count split (image vs referrer artifact), filter rationale,
and image-tag samples; structured tracing fields included.
- Dry-run section labels: "source candidates" -> "source tags",
"pipeline:" -> "filter:", "dropped N:" -> "dropped (N):"; sort and
latest now render as separate stages; system-exclude hint formatted
as a hint: line.
Watch-mode dedup:
- WatchLogState (HashSet keyed by mapping.from) carries log state
across cycles. Emits one WARN per failure-state transition; emits a
recovery INFO when a previously-failing mapping resolves; prunes
entries when the mapping is removed from config.
Banner:
- Switch README hero from docs/public/ecr-banner.svg to a rendered PNG
(docs/public/ecr-banner.png). GitHub's SVG renderer mangled the
bold-gradient title; the PNG renders deterministically.
Verification:
- 1346 unit + integration tests pass (workspace, --locked, non-fips)
- cargo clippy --workspace --all-targets --locked: clean
- cargo fmt --all -- --check: clean
- cargo deny check: advisories ok, bans ok, licenses ok, sources ok
- npm run --prefix docs build: 28 pages, 831 ms
- Live dry-run against cgr.dev/chainguard/{nginx,python} and
docker.io/library/{nginx,alpine}: chainguard mappings now emit one
legible WARN each (instead of ~14k unparseable-tag warnings); Docker
Hub mappings show the new INFO summary and clean dry-run pipeline.
The "X tags match for sync out of Y possible" line was added with the
log-cleanup PR but turns out to be redundant in real-sync and watch
modes -- the engine already emits an INFO per image at the moment of
transfer ("image synced", with byte counts). The summary repeats the
config back at the operator before any work is done.
`--dry-run` is the right surface for this preview; the dedicated
pretty-print there already covers it. Real-sync and watch now go
straight from mapping resolution to engine work, with no per-mapping
"about to" announcement to clutter watch-mode logs cycle after cycle.
Drops `format_plan_line` and `log_resolved_mappings` plus their
tests. `describe_filter` is retained -- it still drives the no-tags
WARN's filter rationale.
In watch-mode steady-state most tags hit the skip path -- digest match
at target, immutable tag at all targets, target tag is immutable.
Each was firing INFO per (tag, cycle), producing N x cycles lines for
zero new work. The aggregate count is already on the "sync complete"
summary line.
Demote the per-tag skip events (engine.rs L865, L1433, L1577, L1766)
and the artifact-discovery 404 / fallback / referrers-API edge cases
(L2871, L2880, L2894) to debug. Verbose mode (-v) brings them back.
Kept at INFO: actual transfer events ("image synced" with byte count),
leader election, multi-arch platform filtering, rate-limit recovery,
and shutdown -- all real-signal lifecycle events, not per-tag chatter.
Net effect: a 50-mapping x 5-tag watch cycle in steady state now emits
one summary line per cycle (was 250 INFO lines).
In watch mode the per-cycle "sync complete: ..." line was repeating verbatim every cycle in steady state, paired with an "INFO sync cycle complete" wrapper that conveyed the same fact. For an idle pod cycling every 60s, the log stream was N pairs of identical lines per hour. - Add DedupingWatchProgress, a ProgressReporter wrapper that compares the new SyncStats to the prior cycle's and only forwards run_completed when they differ. First cycle always emits; identical follow-ups go silent. Wraps the user's reporter so per-image events still pass through. - Stats comparison ignores SyncReport.duration on purpose -- wall-clock drift cycle-to-cycle in steady state would otherwise re-emit every time. Test pins the contract. - Drop "INFO sync cycle complete" from the watch loop. Cycle completion is conveyed by the dedup-aware "sync complete:" line on stdout (when stats change) and by /healthz (always). The INFO added nothing. - Derive Clone, PartialEq, Eq on SyncStats so the wrapper can compare without writing a digest helper. Net effect: a steady-state watch loop emits log volume proportional to state transitions, not cycle count. Sync-mode (one-off) is unchanged -- the user's progress reporter is passed through verbatim.
The aggregate stdout summary ("sync complete: 0 synced, 250 skipped...")
named no mapping. For a 50-mapping watch in steady state, the operator
got identical aggregate counts every cycle with zero context about what
was tracked or where it was going. The user's exact complaint: "no
mention/reference to the image being synced - pure unknown noise".
Replace it with a per-mapping post-event INFO/WARN line that names the
specific mapping:
library/alpine -> ocync-test/alpine-1h [ttl]: synced 1 (28.5 MB)
library/alpine -> ocync-test/alpine-1h [ttl]: skipped 1
library/alpine -> ocync-test/alpine-1h [ttl]: failed 1 <- WARN
Implementation:
- Drop TextProgress::run_completed stdout aggregate (was the contextless
noise) and the `suppress_summary` plumbing it required. JSON output
still owns stdout in --json mode via write_output, unchanged.
- Drop DedupingWatchProgress -- replaced by per-mapping dedup on the
more-informative per-mapping line.
- After engine.run returns, group ImageResults by mapping (source_repo +
target_repo prefix) into MappingOutcome { synced, skipped, failed,
bytes }. Emit one INFO per mapping; WARN when any failed > 0.
- Watch mode plumbs through WatchLogState.last_outcomes (HashMap) and
suppresses follow-up cycles where the outcome is identical to the
prior cycle. retain_active prunes stale entries on config edit.
Format: zero counts are omitted ("synced 1" not "synced 1, skipped 0,
failed 0"); bytes clause appears only when > 0; fields are also
attached as structured tracing fields (from, to, synced, skipped,
failed, bytes) for log aggregators.
Three follow-ups from the post-merge audit, all watch-mode UX: - Cycle tail: emit a `summary: N mappings | X synced, Y skipped, Z failed | bytes in duration` INFO/WARN at the end of each cycle that emitted anything. Sync mode: always. Watch mode: gated on cycle_emit_count > 0. Restores cycle-level rollup the per-mapping refactor removed; named context comes from the per-mapping lines above it. - Recovery marker: when a mapping transitions from `failed > 0` to `failed == 0`, append ` [recovered]` to the per-mapping line so an operator scanning logs sees the recovery without correlating WARN -> INFO across cycles. WatchLogState.observe_mapping_outcome now returns the prior outcome alongside `changed`, so the caller can detect the transition. - Idle heartbeat: after one hour of silence, emit a "watch alive: no state changes; sync loop healthy" INFO with idle_secs. WatchLogState exposes `begin_cycle()` / `cycle_emit_count()` to let watch.rs gate the heartbeat on per-cycle activity. /healthz remains the kubelet probe; this gives log scrapers a recent anchor on idle pods. All three are dedup-aware: a steady-state pod still goes silent for hours between heartbeats and only emits when something genuinely transitions.
Apply audit findings from a follow-up review: - `MappingOutcome` now derives `Copy` (32B of u64s); `last_outcomes` HashMap stores by value, no per-observation clones on the unchanged path. - `observe_mapping_outcome` returns `Option<bool>` (None=suppress, Some(recovered)=emit) instead of an `OutcomeObservation` struct that the single caller had to destructure. The struct's `prior` field existed only to recompute `prior.failed > 0 && new.failed == 0`, which the method now does itself. - Use `HashMap::entry()` to avoid the get-then-insert double-hash + key allocation on the changed path. - Drop `WatchLogState::note_emit`. The cycle tail no longer bumps the counter -- it's gated on per-mapping activity, which already drives the counter via the `observe_*` methods. `emit_cycle_tail` becomes a pure formatter; the watch-mode suppression check moved to the call site. - Add `recovered` field to the WARN arm of `emit_mapping_outcomes` so log aggregators see a uniform field set across info/warn (a partial recovery from `failed=2 -> failed=1` still warns; the structured field now exists to be filtered on regardless). - `format_bytes`/`format_duration` now imported at the top, matching `progress.rs` style. - Trim test redundancy: `cycle_emit_counter_resets_and_increments` collapsed into `dedupes_identical_mapping_outcomes` and the recovery test (the counter increment is asserted there). Drops one test, no loss of coverage. No behavioral change to log output. Net -32 LOC.
…et mappings The `[ttl]` / `[ecr]` bracket only carries information when a mapping fans out to multiple destinations (e.g. `[ecr-prod, ghcr-mirror]`). On single-target mappings -- the common case -- the destination is already in the `from -> to` arrow, so the bracket is noise. Emit the bracket only when `target_names.len() > 1`. Structured tracing fields (`from`, `to`, `synced`, ...) are unchanged; this is a human-readable formatting change only. Single-target now reads: library/alpine -> mirror/alpine: skipped 1 Multi-target unchanged: library/alpine -> mirror/alpine [ecr-prod, ghcr-mirror]: synced 1
The text formatter tails each tracing event with `key=value` pairs for every structured field on the call. When those fields restate what the message string already says, the line ends up repeating itself. Three call sites were doing this: - per-mapping outcome (`emit_mapping_outcomes`): had `from=<source> to=<target>` after the message's `<source> -> <target>` arrow. Drop both -- the arrow is the canonical render. Counts and `recovered` stay (they carry zero values the terse message elides and a flag scrapers can filter on). - no-tags WARN (`emit_no_tags_warn`): had `from=<from>` after the Display impl already led with the same value. Drop. Counts and `filter` stay. - cycle tail (`emit_cycle_tail`): had `mappings/synced/skipped/failed/ bytes` fields verbatim duplicating the message's "N mappings | X synced, Y skipped, Z failed | bytes in time". Drop all -- the message is fully self-contained. JSON tracing-format consumers parse the message string (or use `--json` for the structured `SyncReport`) instead of these fields. No behavioral change beyond what the operator sees in text logs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Sync's log surface is now mapping-scoped, dry-run owns the verbose pre-flight detail, and watch mode is event-driven: a steady-state pod logs O(state transitions), not O(cycles).
Three layers, separated:
--dry-runsync(one-off)src -> dstcontext and outcome counts; per-image transfer events from the engine; one-line cycle tail rolling totalswatchCosign-heavy sources (
cgr.dev/chainguard/*) used to spam thousands of "tag is not parseable as a version" warns per repo per cycle. Thosesha256-<hex>(.sig|.sbom|.att)referrer fallback tags are now silently skipped before the version parser sees them.Failure -> recovery transitions get an explicit
[recovered]marker on the per-mapping line so an operator scanning logs sees the recovery without correlating WARN -> INFO across cycles.Before / After
--dry-runagainstcgr.dev/chainguard/{nginx,python}+docker.io/library/{nginx,alpine}Before (truncated; the actual run emitted ~14k unparseable-tag WARNs from the cosign fallback tags alone):
After:
sync(one-off) againstdocker.io/library/alpine->ttl.shCold transfer:
Re-run (steady state):
watchover 3 cycles in steady state, all-skipBefore (one mapping; scaled linearly with mapping count):
After:
A 50-mapping watch in steady state was emitting roughly
(per_tag_skips + summary + cycle_complete)lines per cycle -- thousands per hour. It now emits one line per mapping plus a one-line cycle tail on the first cycle, then stays silent until something transitions. After an hour of silence, anINFO watch alive: ...heartbeat fires so log scrapers have a recent anchor; /healthz remains the kubelet probe.Behavioural changes worth a CHANGELOG entry
5 of 1133 tags => [ecr]removed entirely; replaced by post-event per-mappinglibrary/alpine -> mirror/alpine: synced 1 (28.5 MB)(multi-target mappings retain the[ecr-prod, ghcr-mirror]bracket).sync complete: ...removed. Replaced by an INFOsummary: N mappings | ...cycle tail emitted via tracing alongside the per-mapping lines.source candidates->source tags,pipeline->filter,dropped N->dropped (N), sort/latest now two stages.[recovered]suffix on the per-mapping line when failure clears, and an idle-heartbeat INFO at one-hour cadence when nothing has transitioned.Test plan
cargo fmt --all -- --check-- cleancargo clippy --workspace --all-targets --locked -- -D warnings-- cleancargo test --workspace --locked-- 1334 passed, 1 ignoredcargo deny check-- advisories ok, bans ok, licenses ok, sources oknpm run --prefix docs build-- 28 pages built--dry-runagainstcgr.dev/chainguard/{nginx,python}+docker.io/library/{nginx,alpine}-- output abovesynccold + steady againstdocker.io/library/alpine->ttl.sh-- output abovewatchover 3 cycles, all-skip steady state -- output above