fix: high-concurrency ECR hardening -- streams, memory, commit-gate#104
Merged
Conversation
Syncs N small images (default 200 x 2 layers x 16KB) between two local
registry:2 testcontainers so a profiler can sample SyncEngine::run on
a representative "many small images" workload. The test prints
PROFILE BEGIN/END markers to scope the analysis window in the UI.
Intended use:
samply record -o /tmp/sync.profile -- \
cargo test --release --package ocync-sync --test perf_profile -- \
--ignored --nocapture --exact profile_small_images
Tunable via OCYNC_PROFILE_{IMAGES,LAYERS,BYTES,WORKERS} env vars.
registry:2 is HTTP-only; the harness under-reports TLS work, and that
limitation is documented in the file header.
[profile.release] strips symbols, which produces raw-address traces.
[profile.profiling] inherits release codegen (LTO, codegen-units=1)
but preserves debug info so samply can symbolicate against the binary
at view time.
Use via:
cargo {test,build} --profile profiling
profile_small_images_tls runs the same workload against registry:2 with its built-in TLS terminator, using a self-signed cert generated at test start. Surfaces rustls handshake + AEAD cost that the HTTP-only baseline under-reports. Adds RegistryClientBuilder::allow_invalid_certs (#[doc(hidden)]) so the harness can talk to the self-signed registry; same test-only convention as the existing test_http_client() helper. OCYNC_PROFILE_BYTES already lets the operator dial blob size to amplify SHA-256 cost on top of either transport.
The harness now uses blob_push_stream for corpus population so the profile reflects production. Without this, populate-time monolithic SHA dominated the trace and the actual streaming hot path was masked. Adds RegistryClientBuilder::force_http1 (#[doc(hidden)], same convention as allow_invalid_certs). Exposed in the harness via OCYNC_PROFILE_HTTP1=1. Lets us A/B test whether HTTP/2 ALPN is the cause of a stall observed at TLS + 50 workers + 5 MB blobs. A/B result against registry:2 (10 images x 2 layers x 5 MB, 50 workers): - TLS + h2 default hangs >60s - TLS + h2 default, 5 workers 248ms - TLS + h1 forced, 50 workers 293ms - HTTP/1.1 (no TLS) baseline 2.3s Same workload, same registry, same code; the only variable that turns "forever" into "fast" is whether the connection multiplexes over h2 or uses one TCP per request over h1.
Wires the existing RegistryClientBuilder::force_http1 (already added behind #[doc(hidden)]) through the CLI's build_registry_client path when OCYNC_FORCE_HTTP1=1 is set. Used to A/B the HTTP/2 multiplexing stall observed at high max_concurrent_transfers against TLS registries. Not a supported production toggle. A/B results against ECR us-east-2 (10 mappings, 50 workers): - h2 (default) 5min cut: 2/10 synced, 333 MB, 8 in-flight abandoned - h1 forced 2m12s: 0/10 synced, all failed BLOB_UPLOAD_UNKNOWN These are different failure modes. The h2 stall reproduces against registry:2 in the harness and is cleanly fixed by h1 there. ECR adds its own failure mode under h1 + 50 workers (eventual-consistency race between blob PUT 201 and manifest commit, or a separate ocync bug at high concurrency). Not a clean single-cause answer; documenting the diagnostic so the experiment is reproducible.
Hyper's default starts each h2 stream at a 64 KB receive window and grows it only on WINDOW_UPDATE frames. Under high stream concurrency (e.g. max_concurrent_transfers of 50 streaming large blobs over a single TCP connection), this starves throughput. Adaptive window sizes the receive window dynamically based on observed throughput. Off by default in reqwest; on by default in RegistryClientBuilder. Measured against ECR us-east-2 (10 mappings, 50 workers, ~2.5 GB corpus): - without adaptive_window: 66 MB/min, drain abandons many in-flight - with adaptive_window: 160 MB/min (matches a 5-worker baseline) Adds doc-hidden RegistryClientBuilder::http2_adaptive_window for A/B testing; honoured by OCYNC_H2_ADAPTIVE_WINDOW=0 (CLI) and OCYNC_PROFILE_H2_ADAPTIVE=0 (perf harness) so the regression can be reproduced. Known limitation: adaptive_window does not by itself fix a separate correctness issue against ECR at high concurrency where blob PUT 201 responses come back but the layers are not visible at manifest-push time. That race is documented next to is_blob_upload_unknown.
Adds is_blob_upload_unknown classifier that recognises the OCI error code in a 404 RegistryError. Wired into the engine's with_retry path so manifest pushes (and blob transfers) retry on the transient state where a registry has acknowledged a blob PUT but not yet promoted the layer into the manifest-validation index. Backoff comes from the existing RetryConfig. Provides defense-in-depth for registries that exhibit short eventual-consistency windows between blob commit and manifest validation. Insufficient on its own to clear the ECR + high concurrency case (the consistency window there extends beyond practical backoff budgets and HEAD against the blob digest reports "exists" while manifest validation still rejects) -- the function comment documents what's known.
Adds three variants exercising the production-shape scenario behind the ECR investigation: 10 mappings to one target at max_concurrent=50, mostly disjoint blobs, with and without a batch checker. Each test asserts the report's per-blob stats reconcile against the actual PUT and mount-POST counts on the mock target. A regression that short- circuits a blob push via false-Skipped or false-Mounted would trip the count assertion, not hide as silent stats drift. The honest mock confirms the engine, leader-follower coordination, and batch_checker integration are sound under local concurrency. The remaining ECR-only failure therefore lives in ECR-specific paths -- AWS SDK under concurrent first-use, AIMD epoch handling, or HTTP-layer state against real endpoints -- not in the engine.
The copy subcommand was hardcoding batch_checker=None and head_first=false on the ResolvedMapping it builds, even when --config supplied registry settings. For ECR destinations this forced per-blob HEAD checks through ECR's lying HEAD endpoint (false 200 under concurrency), defeating the BatchCheckLayerAvailability optimization that sync already uses. The destination now goes through the same ECR detection as sync (explicit auth_type: ecr OR detect_provider_kind match), constructing a BatchChecker ::from_hostname when applicable and honouring aws_profile. head_first now reads from the source registry's config -- defaulting to false matches the sync command. Single-image copy to ECR now issues one batch API call instead of N HEADs, and reuses the head-before-pull optimization for synced targets.
The h2-only stall at high concurrency was per-connection stream exhaustion, not stream-multiplexing flow control. reqwest opens one h2 connection per origin and multiplexes streams over it; major container registries advertise SETTINGS_MAX_CONCURRENT_STREAMS in 100-128 (probed: registry-1.docker.io 128, ghcr.io 100, cgr.dev 100, us-docker.pkg.dev 100, public.ecr.aws 128, quay.io 128, mcr.microsoft.com 100). With the previous defaults of max_concurrent_transfers=50 and BLOB_CONCURRENCY=6 per image, peak in-flight blob streams could reach 50 * 6 = 300, three times the worst-case stream budget. The connection's stream queue saturates and new streams park indefinitely; h1 worked only because each request opens its own TCP connection. Two changes: 1. RegistryClient gains an Arc<Semaphore> sized at 64 by default (DEFAULT_STREAMING_BLOB_CONCURRENCY), acquired at the start of blob_push_stream and blob_pull. blob_pull returns a PermitStream that keeps the permit alive for the lifetime of the response body so the release point lines up with stream-close on the wire, not with the `await?` that hands the stream back. RegistryClientBuilder exposes `streaming_blob_concurrency(n)` for per-registry tuning. 2. DEFAULT_MAX_CONCURRENT_TRANSFERS drops from 50 to 10 in the engine. The 50 default predated any measurement of registry stream limits; the new value is derived as cap (64) / BLOB_CONCURRENCY (6) so the typical workload fits the stream budget without thrashing on the semaphore. Skopeo's --parallel-jobs default is 6 for comparison. Mount-heavy or skip-heavy workloads barely use streaming blobs and tolerate much higher image concurrency; users can raise the value in global config. Also adds OCYNC_PROFILE_STREAM_CAP to the perf profile harness so the cap can be A/B-tested without touching builder code. Verified locally against `registry:2` over TLS via `profile_small_images_tls`: default settings (200 images x 10 workers, h2) complete in ~1.4s, matching h1 throughput. Pre-fix, the same load at workers=50 hung past 21 minutes on h2 while finishing in 6s on h1.
Three registry quirks worked around or verified:
1. ACR chunked PATCH fallback. Azure Container Registry rejects streaming
PUT bodies above ~20 MB. blob_push_stream now dispatches to
blob_push_stream_acr for *.azurecr.io hosts, which buffers the stream,
verifies the digest, then uploads via 16 MB PATCH chunks with the
OCI Content-Range format (`{start}-{end}`, NOT RFC 7233) followed by a
PUT to finalize. Each PATCH response carries a fresh Location header
used as the next upload URL. Tests cover both the single-chunk and
multi-chunk paths.
2. ECR S3 redirect handling. ECR private serves blob bytes via a 307 to
a presigned S3 URL. Verified that reqwest's default redirect policy
follows up to 10 hops and that the Authorization header is stripped
on cross-host redirects (S3's signature lives in the query string;
forwarding registry credentials to a third party would leak them).
Two new tests pin both behaviors -- same-host redirect-follow and
cross-host auth-strip.
3. GHCR private-repo auth dispatch audit. No code change. Existing
coverage is comprehensive:
- auth_dispatch_tests.rs proves the right provider instantiates for
every (auth_type, hostname) combination, including ghcr.io.
- docker.rs has end-to-end tests for the Bearer flow with creds, the
anonymous fallback, token caching, invalidate, and per-scope
concurrent exchange.
The Chainguard write-up describing a "GHCR returns public-tier
responses for private repos" failure mode was a GHCR server-side
blob-metadata info leak (since fixed), not a client-side auth bug.
Memory `reference_oci_chunked_upload` previously noted ACR's fallback as
"not yet implemented" -- now implemented.
…ication Cleanup pass surfaced by an xhigh review of perf/profile-harness and a follow-up Phase 1 / Phase 2 sweep. Twelve fix-now items applied; six deferred items filed as #98-#103. CLI / engine alignment: - `default_max_concurrent_transfers()` (config.rs) was still 50 while the engine default dropped to 10 in a981494; every user with a YAML config containing a `global:` block silently bypassed the h2 fix. Pin both to `ocync_sync::engine::DEFAULT_MAX_CONCURRENT_TRANSFERS`. - copy.rs and synchronize.rs ECR detection ORed explicit auth_type with hostname auto-detection, so a non-Ecr auth_type could not suppress `BatchChecker::from_hostname` for an ECR-shaped hostname. Switch to a match so explicit non-Ecr is a hard opt-out. Blob upload path harden + dedupe (crates/ocync-distribution/src/blob.rs): - Extract `initiate_blob_upload` helper; the POST + expect_status + extract_location sequence had been duplicated at four call sites. - Move the streaming-blob permit acquisition into each path that owns it: the default streaming PUT acquires before the PUT body, and GHCR / GAR / ACR fallbacks each acquire on entry. Previously a single outer permit was held across `buffer_stream` even though only the source-side stream is active during buffering. - `extract_location` now enforces same-origin against the registry base_url at every upload-chain step (POST initiate, each PATCH, finalize PUT). A compromised proxy returning a cross-host Location would otherwise leak the registry bearer to an attacker host. - `buffer_stream`'s `Vec::with_capacity(capacity_hint)` is now capped at 16 MiB (`MAX_BUFFER_PREALLOC`). The hint comes from manifest-declared size and is attacker-controllable. - ACR fallback documents the zero-byte-blob path (loop skipped, finalize PUT carries the empty-blob digest) and rename `blob_push_stream_gar_fallback` -> `blob_push_stream_gar` for consistency with `_ghcr` / `_acr` siblings. - Tests: tighten the small-blob Content-Range matcher to an exact `0-{len-1}` (was a permissive regex that would accept off-by-ones), add zero-byte and cross-host-rejection coverage, delete the `owned_data_stream` test helper now that `data_stream` returns `+ 'static + use<>`. Docs (the `docs ship with feature commit` rule): - configuration.md: default 50 -> 10; expand the `max_concurrent_transfers` row with the derivation. - design/engine.md: Level-1 image semaphore default 50 -> 10; add derivation. - crates/ocync-distribution/CLAUDE.md: ACR fallback "not yet implemented" -> implemented; describe streaming_blob_sem. - crates/ocync-sync/src/engine.rs (`with_retry` doc): now mentions the ECR `BLOB_UPLOAD_UNKNOWN` retry branch. - crates/ocync-sync/tests/perf_profile.rs: tunables list now includes `OCYNC_PROFILE_H2_ADAPTIVE` and `OCYNC_PROFILE_STREAM_CAP`; note workers=50 is a deliberate harness overdrive vs. the 10 engine default. - sync_disjoint_high_concurrency.rs: stale "Production default: 50" comment updated. Filed for follow-up: #98 (PermitStream pin-project + size_hint), #99 (is_blob_upload_unknown JSON parse), #100 (head_first semantics for copy), #101 (RegistryClientBuilder input-validation consistency), #102 (diagnostic env vars -> hidden CLI flags), #103 (OCYNC_FORCE_HTTP1/OCYNC_PROFILE_HTTP1 name unification).
The shiploop pass had filed six items as follow-up issues; on review each was introduced by code added to perf/profile-harness in this same PR, so the rules push them back into this branch rather than carrying them as deferred work. #98 -- PermitStream: switch to pin-project-lite, forward size_hint Rewrite the wrapper with `pin_project!`. Drops the `S: Unpin` bound (a future change in reqwest's stream impl would otherwise silently bypass the wrapper) and forwards `Stream::size_hint` so consumers that pre-allocate (e.g. `reqwest::Body::wrap_stream` setting Content-Length, or a downstream `collect::<Vec<_>>()`) see the same hint as the raw `bytes_stream()`. New workspace dep `pin-project-lite`. #99 -- Retry classifier: parse OCI error JSON, not substring `is_blob_upload_unknown` now parses the body as `{errors:[{code,...}]}` and matches `errors[].code == "BLOB_UPLOAD_UNKNOWN"` exactly. A free-text mention of the string outside `errors[].code` is no longer treated as retryable; a body that fails to parse is also rejected. New unit tests cover both regressions. #100 -- head_first semantics for copy: documented Kept the field read in copy (single-shot copy benefits the same way: skip source GET when the destination already matches the source HEAD digest). Updated `RegistryConfig::head_first` doc to call out the cross-subcommand applicability and added an inline comment in copy.rs explaining the savings. #101 -- RegistryClientBuilder: consistent zero-clamp `max_concurrent(0)` previously produced a `Semaphore::new(0)` that deadlocked the first acquire; `streaming_blob_concurrency(0)` clamped to 1. Both now clamp to 1 and the docstring calls out the behaviour. #102 -- Diagnostic env vars -> hidden CLI flags Added `--force-http1` and `--no-h2-adaptive-window` as hidden global flags on `Cli`. `main()` installs them into a process-wide `OnceLock<ClientDiag>` read by `build_registry_client`. The previous env-var reads (`OCYNC_FORCE_HTTP1`, `OCYNC_H2_ADAPTIVE_WINDOW`) are gone -- one fewer pattern violating the "no env-based behaviour switches" rule. Hidden because both knobs degrade throughput. #103 -- OCYNC_FORCE_HTTP1 / OCYNC_PROFILE_HTTP1 drift Resolved by #102 in the production binary: there are no `OCYNC_FORCE_*` env vars left in the CLI. The perf_profile test crate keeps its `OCYNC_PROFILE_*` knobs (test-only by convention, no overlap with production behaviour).
…meout
acr.md previously said the ACR chunked-PATCH fallback was "not yet
implemented" -- caught by the audit prompted by the question of whether
ACR is uniquely capped on the OCI streaming-PUT path (it is). Updated
the entry to describe the actual fallback (16 MiB chunks, OCI
{start}-{end} Content-Range, same-origin check on every PATCH Location).
ghcr.md now notes the 10-minute server-side upload timeout per layer
(community-documented in GHCR discussion #77429). GHCR's single-PATCH
fallback sends the whole blob in one request, so a multi-GB layer on a
slow link can be cut off mid-body. with_retry already reclassifies the
resulting connection reset as transport-retryable and starts a fresh
upload session, so the existing retry path covers the case -- but
operators should know why a sufficiently large+slow combination can
still fail and how to widen the budget. The 10 GB layer cap is also
listed for completeness.
No code change. Tag-list pagination is already implemented in
crates/ocync-distribution/src/tags.rs (Link rel="next" with a 10,000-
page guard); my earlier audit missed it.
Two structural fixes for high-concurrency ECR / large-layer workloads
surfaced by the perf/profile-harness critical review:
Byte budget for buffered uploads
GHCR/GAR/ACR fallbacks buffer the entire blob in memory; the prior
count-only cap let 64 concurrent 1 GB layers (CUDA/ML images) pin
~64 GB resident. New buffered_blob_bytes (default 512 MB) caps
cross-call bytes via a tokio Semaphore; each fallback acquires
min(known_size, budget) permits before buffer_stream via a new
acquire_buffered_upload_permits helper.
ECR consistency gate for manifest commit
ECR's manifest validator lags blob PUT-201 by hundreds of ms at
high concurrency, surfacing as BLOB_UPLOAD_UNKNOWN on manifest PUT
and exhausting the retry budget. New
BatchBlobChecker::wait_for_blobs_available polls
BatchCheckLayerAvailability with exp backoff (200ms -> 5s cap)
before manifest PUT, gated on RetryConfig::manifest_commit_wait
(30s prod, 0 in fast_retry). Wired into push_manifests AND
discover_and_sync_artifacts.
Artifact pipeline parity
discover_and_sync_artifacts now uses batch-check pre-population +
BLOB_CONCURRENCY-capped parallel blob processing, matching the main
image flow, with the same manifest_commit_wait gate before artifact
manifest PUT.
Diagnostic CLI flags removed
--force-http1 and --no-h2-adaptive-window dropped from production
CLI. The #[doc(hidden)] pub builder methods are retained for the
perf-profile test harness and documented in client.rs +
ocync-distribution/CLAUDE.md so future audits don't flag them as
dead.
Idiomatic cleanups
- Provider dispatch in blob_push_stream: if-chain to exhaustive match
(new ProviderKind variants fail to compile until handled)
- fastrand replaces hand-rolled RandomState jitter (retry.rs)
- should_retry_transport keeps is_request/is_body/is_decode only;
is_connect/is_timeout are subsets of is_request on the async-hyper
path. New test pins the timeout-vs-is_request equivalence.
- Option<&Rc<dyn>> -> Option<&dyn> for pass-through trait params
- HashSet collect dedup replaces filter-and-insert pattern
- buffer_stream doc-vs-code drift fixed
- std::cmp::min nesting -> chained .min() across 5 sites
1399 tests passing; cargo deny / fmt / clippy -D warnings clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
End-to-end hardening for high-concurrency syncs to ECR. The original failure mode was "max_concurrent_transfers=50 stalls indefinitely against ECR"; the investigation converged on five compounding mechanisms.
HTTP/2 stream budget
Registries advertise
SETTINGS_MAX_CONCURRENT_STREAMSin the 100-128 range (probed seven). Prior defaultmax_concurrent_transfers=50 * BLOB_CONCURRENCY=6 = 300peak streams overshot the budget on every registry and deadlocked the h2 connection. Added a per-RegistryClientstreaming-blob semaphore (default cap 64), loweredDEFAULT_MAX_CONCURRENT_TRANSFERSto 10, and enabled HTTP/2 adaptive flow-control window sizing by default.Per-target byte budget for buffered fallback uploads
GHCR / GAR / ACR fallbacks buffer the entire blob in memory because the target rejects streaming bodies. Prior count-only cap let 64 concurrent 1 GB layers (CUDA/ML images) pin ~64 GB resident. New
RegistryClient::buffered_blob_bytes(default 512 MB) caps cross-call bytes via atokio::sync::Semaphore; fallback paths acquiremin(known_size, budget)permits beforebuffer_streamvia the newacquire_buffered_upload_permitshelper.ECR consistency gate before manifest commit
ECR's manifest validator lags blob
PUT-201by hundreds of ms at high concurrency, surfacing asBLOB_UPLOAD_UNKNOWNon manifest PUT and exhausting the retry budget. NewBatchBlobChecker::wait_for_blobs_availablepollsBatchCheckLayerAvailabilitywith exponential backoff (200 ms initial, 5 s cap, deadline fromRetryConfig::manifest_commit_wait, prod default 30 s,Duration::ZEROinfast_retry). Wired into bothpush_manifestsanddiscover_and_sync_artifacts.Upload protocol fallbacks
ACR chunked-PATCH fallback under ACR's ~20 MB streaming-PUT limit, GHCR single-PATCH (the multi-PATCH path overwrites previous chunks), GAR monolithic PUT (no chunked support). All upload-chain
Locationheaders same-origin-validated to prevent cross-host bearer-token forwarding. Manifest-drivenVec::with_capacitypre-allocation capped at 16 MiB. Provider dispatch inblob_push_streamconverted from if-chain to exhaustive match so new variants fail to compile until handled.Artifact pipeline parity
discover_and_sync_artifactsnow matches the main image pipeline: batch-check pre-population,BLOB_CONCURRENCY-capped parallel blob processing viaFuturesUnordered+Semaphore(6), and the samewait_for_blobs_availablecommit-gate before the artifact manifest PUT.Retry, config, and cleanup
BLOB_UPLOAD_UNKNOWNinstead of substring-matching.is_request() || is_body() || is_decode()(the droppedis_connect()andis_timeout()are subsets ofis_request()on the async-hyper path; new test pins the equivalence by binding a silentTcpListenerand asserting a 50 ms timeout classifies viais_request).fastrandreplaces a hand-rolledRandomState-based jitter.copyandsyncboth honorauth_type=ecrand treat explicit non-Ecrauth_typeas a hard opt-out from auto-detection.batch_checkerandhead_firstare wired through from config incopy(previously hardcodedNone/false).--force-http1and--no-h2-adaptive-windowremoved from the production binary. The underlying#[doc(hidden)] pubbuilder methods are retained for the perf-profile test harness and documented inclient.rs+crates/ocync-distribution/CLAUDE.mdso future audits don't flag them as dead code.Option<&Rc<dyn>>->Option<&dyn>for pass-through trait params;HashSet-collect dedup replaces filter-and-insert;std::cmp::minnesting -> chained.min()across 5 sites;buffer_streamdoc-vs-code drift fixed.Engine-level h2 stall reproducer + regression tests live in
crates/ocync-sync/tests/sync_disjoint_high_concurrency.rs.Test plan
cargo fmt --all -- --checkcargo clippy --workspace --all-targets --locked -- -D warningscargo test --workspace --locked(1399 passed, 3 ignored)cargo deny check(advisories, bans, licenses, sources all ok)registry:2: 200 images x 10 workers (new default) completes in ~1.4s on h2; pre-fix the same workload hung past 21 minutes at workers=50.max_concurrent_transfers=10) to confirmBLOB_UPLOAD_UNKNOWNis gone end-to-end with the new commit-gate.*.azurecr.ioto validate chunked PATCH against the actual 20 MB body limit.buffered_blob_bytesbounds resident memory below the configured cap.