refactor(cold): permit-attached reads, dispatcher/writer split, 5s operation deadline#57
Conversation
Moves semaphore permit acquisition to `ColdStorageHandle` so permits travel with read requests into the channel. The task runner splits into two concurrent subtasks: - **Dispatcher**: pulls `PermittedReadRequest`s and spawns handlers, wrapping each in a per-request deadline (default 5s). - **Writer**: consumes writes sequentially. Drain-before-write uses `Semaphore::acquire_many_owned(64)`, now wrapped in a cancel-select so shutdown cannot hang on a stuck reader. The semaphore is now the single backpressure mechanism. The read channel is sized to match permit count, so `try_send` from a caller holding a permit is guaranteed to have capacity. New `ColdStorageError::Timeout` is returned to callers whose handler exceeds the deadline; dropping the handler future releases its permit back to the pool, so a stuck backend call self-heals. Tests (`crates/cold/tests/concurrency.rs`) add a `GatedBackend` helper and four new regression cases: fairness under saturation, cancel during reader backpressure, cancel during write drain, and operation-deadline permit release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`UnifiedStorage::append_blocks` dispatches to cold asynchronously. With the cold task's dispatcher and writer now running on separate subtasks, there is no biased ordering between a fire-and-forget write and a subsequent read — the tests that assumed one were relying on an implementation detail that production code already polls around (see `components/crates/node-tests/src/context.rs`). Add a `wait_for_cold_height` helper matching the production pattern and use it in the two tests that issued a read immediately after `append_blocks`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
[Claude Code] @rswanson tagging you for visibility — this stacks on top of #56 and would replace its diff if it lands. @Fraser999 @Evalir requesting review. |
|
[Claude Code] Second-pass review surfaced a deadlock hazard that I'd like to flag before this merges. Critical:
|
Fraser999
left a comment
There was a problem hiding this comment.
As well as @Evalir's point, I think this PR doesn't address the issues/suggestions in #56 (comment).
|
[Claude Code] Superseded by #58, which takes the full-rethink approach (Option C from the review-response brainstorming) rather than the piecewise fixes here. Closing in favor of the unified handle architecture tracked by ENG-2198. Key differences in #58:
|
* refactor(cold): ColdStorageWrite takes &self; all backends updated in lockstep * refactor(cold): unify handle around Arc<Inner>; remove channels and dispatcher * fix(cold): stream permit acquired in handle; streams do not hold a read permit * feat(cold): drain barrier moves to handle write path * feat(cold): shutdown coordinator closes semaphores on cancel * refactor(cold-mdbx): spawn_blocking reads, block_in_place writes, in-body iterator deadline Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cold-sql): mandatory statement_timeout; read_timeout and write_timeout builders * feat(cold): metrics and tracing spans across all operations Adds a `metrics` module under `crates/cold/src/metrics.rs` with const metric names, help strings, a `LazyLock` describe block, and `pub(crate)` helper functions for recording: - `cold.reads_in_flight`, `cold.writes_in_flight`, `cold.streams_active` (gauges) - `cold.op_duration_us` (histogram, labeled by op) - `cold.permit_wait_us` (histogram, labeled by sem: read/write/drain/stream) - `cold.op_errors_total` (counter, labeled by op and error kind) - `cold.stream_lifetime_ms` (histogram) Wires the helpers into every `ColdStorage<B>` handle method: `spawn_read` and `spawn_write` time permit acquisition, bump in-flight, measure op duration, record errors, and dec in-flight after the backend call. Cache hits in `get_header`/`get_transaction`/`get_receipt` record op duration only (no permit wait, no in-flight). `stream_logs` instruments stream permit wait and records stream lifetime + gauge in the spawned producer. Adds `ColdStorageError::kind()` for the error metric label. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(cold): trait impl guide documents mandatory timeouts * test(cold): concurrency suite covers new architecture * fix(cold): shutdown coordinator holds Weak<Inner>, not Arc The coordinator task previously moved Arc<Inner<B>> into its body and awaited the user's cancel token. If callers dropped all ColdStorage clones without firing cancel, Inner (and the backend's file/DB handles) stayed pinned until process exit. Switch the coordinator to Weak<Inner>, and put a DropGuard on Inner that fires a child cancel token. shutdown now fires on either user-side cancel OR Inner drop; in the drop case upgrade() returns None and the coordinator exits without pinning anything. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold-mdbx): preserve TooManyLogs via From impl, not backend wrapper ColdStorageError::backend unconditionally wraps as Backend(Box<_>), which hid MdbxColdError::TooManyLogs behind the generic backend variant and broke the conformance suite's max_logs assertion. The From<MdbxColdError> for ColdStorageError impl already translates TooManyLogs correctly and wraps the rest. Route all spawn_blocking result conversions through ::from so the translation runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold): address review on permits, gauges, errors, cache - stream_logs resolves `to` (and get_latest_block fallback) before acquiring stream_sem, so a stuck backend no longer pins all 8 permits across setup I/O. - In-flight gauges are now maintained by an InFlightGuard RAII wrapper so the decrement survives a panic in the spawned body; previously a panic left cold.reads_in_flight / writes_in_flight / streams_active drifting up and poisoning the Prometheus alert signal. - Promote timeout to a first-class ColdStorageError::DeadlineExceeded variant. MDBX Timeout now routes through it (not Backend), and downstream callers can match without downcasting. Fixes stale Backpressure references in the cold and storage READMEs and the signet-storage skill doc. - ColdCache switches from tokio::sync::Mutex to parking_lot::Mutex. The cache only ever holds the lock across synchronous LRU ops, so the async mutex's yield-on-lock was pure overhead. - MemColdBackend now explicitly documents its exemption from the trait's mandatory-timeout contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold-mdbx): spawn_blocking writes, per-item deadlines, docs - Writes (append_block, append_blocks, truncate_above, drain_above) now use tokio::task::spawn_blocking instead of block_in_place. block_in_place panics on a current_thread runtime, so any consumer wiring MdbxColdBackend into a single-threaded Tokio would hit the first write and crash. Added writes_work_on_current_thread_runtime regression test. - Overrun WARN fires only on successful writes. A failed write that took > 2 s already surfaces Backend(...) to the caller; a spurious advisory-write-timeout WARN on the error path would poison any SLO alert built on that signal. - Iterator reads gained inner-loop deadline checks: per-receipt in get_logs_inner, per-event in collect_signet_events_in_range, per-receipt + per-log in produce_log_stream_blocking. A block with many matching logs (or a slow stream consumer) can no longer run unbounded past the configured deadline. - MdbxColdError::Timeout now maps to ColdStorageError::DeadlineExceeded (new variant) instead of Backend. Updated the existing timeout test to match on the variant directly. - Documented the point-lookup timeout exemption: MDBX page I/O on cold pages can stall arbitrarily, and the handle does not wrap point lookups in a tokio::time::timeout, so a stuck lookup ties up a spawn_blocking worker AND a read_sem permit. Callers that need fail-fast behavior should wrap at the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(storage): align unified::drain_above doc with silent-swallow impl The PR-#58 doc rewrite advertised a `Cold` error path, but the impl collapses every cold error into `Vec::new()`. Update the doc + comment to admit silent-swallow behaviour. ENG-2210 tracks the propagation decision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold-sql): map PG statement_timeout to DeadlineExceeded SQLSTATE 57014 (query_canceled, emitted on `statement_timeout` expiry) was wrapped as `Sqlx(...)` and surfaced to the handle as `Backend(...)`, breaking symmetry with the MDBX backend (which routes its `Timeout` to `ColdStorageError::DeadlineExceeded`). The metric `cold.op_errors_total{error="backend"}` therefore conflated "query too slow" with "backend down". `From<sqlx::Error> for SqlColdError` now detects 57014 and produces a dedicated `Timeout` variant; `From<SqlColdError> for ColdStorageError` maps it to `DeadlineExceeded`. The configured deadline is not threaded to this conversion boundary so the surfaced duration is `ZERO`; threading the real value is a separate refactor (left for a follow-up once the call sites are confirmed to need it). The `pg_statement_timeout` test is rewritten to match on the typed variant rather than a substring of the error message — a future refactor that drops 57014 detection now fails the test instead of silently passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cold): hoist write SLO and stream-setup timeout into the handle Adds two accessors to the `ColdStorageBackend` trait: fn read_timeout(&self) -> Option<Duration> { None } fn write_timeout(&self) -> Option<Duration> { None } Wired through `MdbxColdBackend`, `SqlColdBackend`, and `EitherCold`. `MemColdBackend` returns `None` (already-documented test exemption). Two behaviour changes use these: 1. The advisory write-SLO WARN moves from the MDBX backend (`warn_on_overrun` per-method) to `ColdStorage::spawn_write`. Timing is now captured before `write_sem` acquisition, so the elapsed value covers the queue wait, the read drain, and the commit end-to-end. The failure shape that wedged production at #56 — slow readers gating writes — now surfaces as a write-SLO violation rather than as a sub-threshold backend timing. 2. `stream_logs`'s setup `get_latest_block` is wrapped in `tokio::time::timeout(backend.read_timeout(), ...)`. Without this, a stuck point lookup (cold MDBX page) or a saturated PG pool parking on `acquire_timeout` could pin N concurrent setup callers indefinitely with no permit cap. The setup read still bypasses `read_sem` and the drain barrier by design. Also drops the now-unused `tracing` dep from `signet-cold-mdbx` and updates the type docs to point at the handle's new WARN path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold): reject zero timeouts; log JoinError panics; misc nits Builders for `read_timeout` / `write_timeout` on both the MDBX and SQL backends and connectors now panic on zero. Postgres treats `statement_timeout = 0` as "no timeout", so a caller passing `Duration::ZERO` (or computing one from a config that defaults to zero) would silently disable the trait-level mandatory-timeout contract. MDBX accepts the same assert for symmetry — zero there is a useless config rather than a silent disable, but the trait says non-zero and the assert keeps the surface honest. `spawn_read` / `spawn_write` now log spawned-task `JoinError`s before mapping to `TaskTerminated`. A backend panic was previously indistinguishable from graceful shutdown for the on-call: panics fire ERROR with the panic message, cancellations fire DEBUG. The error variant still collapses to `TaskTerminated` per design. `MdbxColdBackend::get_logs_inner` now checks the deadline inside the inner per-log loop, mirroring the streaming path. Previously a single receipt with thousands of matching logs would iterate unchecked past the configured `read_timeout`. The two `std::time::Instant::now()` sites in `produce_log_stream_blocking` are also folded into the already-imported `Instant`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(cold): stream_logs setup fails fast on hung get_latest_block Pins the new setup-timeout behaviour against regression. The test parks `GatedBackend::get_latest_block` indefinitely and asserts that `stream_logs` (with no `to_block` on the filter, forcing the "resolve to=latest" path) returns `DeadlineExceeded` within the configured 50 ms `read_timeout` rather than hanging. Adds `GatedBackend::with_read_timeout` so tests can advertise a custom read timeout to the handle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Extends PR #56's semaphore-based fix with a broader refactor that addresses three concerns surfaced in review:
acquire_many_owned(64), and wedges shutdown — same class of bug as the original, just a narrower failure envelope.Design
See
docs/superpowers/specs/2026-04-16-cold-read-write-permit-refactor-design.mdon the design discussion thread (PR #56 comment).Permit-attached messages.
ColdStorageHandleacquires a semaphore permit before sending; the permit travels inPermittedReadRequestand is released when the spawned handler's future drops. One semaphore is now the only backpressure mechanism; the read channel is sized to match permit count sotry_sendon the handle side is infallible (modulo shutdown).Split task runner.
run_dispatcherpullsPermittedReadRequests and spawns handlers.run_writerconsumes writes sequentially, drains viaacquire_many_owned(64)wrapped in a cancel-select, then executes the write. Dispatcher runs continuously so permits attached to queued messages never strand during drain.Per-request deadline.
ColdStorageTask::with_read_deadline(Duration)(default 5s) wraps each non-stream handler intokio::time::timeout. On expiry the caller receivesColdStorageError::Timeout, a WARN is emitted with the operation variant, and the permit returns to the pool.Tests
crates/cold/tests/concurrency.rsexpanded with aGatedBackendhelper that blocks every read call on a test-controlled semaphore:reads_above_concurrency_cap_do_not_deadlock(carried over)write_after_saturating_reads_makes_progress(carried over)fairness_write_serves_before_later_readers(new) — verifies tokio FIFO fairness keeps the writer ahead of later readerscancel_during_reader_backpressure_shuts_down(new)cancel_during_write_drain_shuts_down(new) — would fail without the cancel-select on the writer's drainoperation_deadline_releases_permit(new) — verifies Timeout is returned and the permit rejoins the poolBehavioral note
UnifiedStorage::append_blocksdispatches to cold asynchronously. With dispatcher and writer now on separate subtasks, there is no biased ordering between a fire-and-forget write and a subsequent read. Production code atcomponents/crates/node-tests/src/context.rs:380-393already polls for cold to catch up; two in-repo unit tests (append_and_read_back,drain_above_empty_when_at_tip) were implicitly relying on the old biased-select ordering and are updated to use the same polling pattern.Test plan
cargo test -p signet-cold(conformance + 6 concurrency tests)cargo test --workspacecargo +nightly fmt -- --checkcargo clippy --workspace --all-targets --all-features -- -D warningscargo clippy --workspace --all-targets --no-default-features -- -D warningsRUSTDOCFLAGS="-D warnings" cargo doc --workspace --no-depssignet-cold+signet-storage, bumpinit4tech/node-components, rebuildsignet-sidecar:latest, redeploy to devmainnet, confirm no backpressure-induced crashes over a full day.🤖 Generated with Claude Code