refactor(cold): permit-attached reads, dispatcher/writer split, 5s operation deadline by prestwich · Pull Request #57 · init4tech/storage

prestwich · 2026-04-17T04:10:55Z

Summary

Extends PR #56's semaphore-based fix with a broader refactor that addresses three concerns surfaced in review:

Write-drain is not cancellable. A single stuck reader holds its permit, blocks acquire_many_owned(64), and wedges shutdown — same class of bug as the original, just a narrower failure envelope.
Two backpressure mechanisms (256-slot mpsc + 64-permit semaphore) overlap without a purpose. The channel buffer is 4× the in-flight cap.
No handler-side operation deadline. At 12s write cadence, a single stuck reader extends every drain and eventually re-fills the write mpsc — regenerating the exact production failure PR fix(cold): replace read-arm TaskTracker backpressure with Semaphore #56 fixes.

Design

See docs/superpowers/specs/2026-04-16-cold-read-write-permit-refactor-design.md on the design discussion thread (PR #56 comment).

Permit-attached messages. ColdStorageHandle acquires a semaphore permit before sending; the permit travels in PermittedReadRequest and is released when the spawned handler's future drops. One semaphore is now the only backpressure mechanism; the read channel is sized to match permit count so try_send on the handle side is infallible (modulo shutdown).

Split task runner. run_dispatcher pulls PermittedReadRequests and spawns handlers. run_writer consumes writes sequentially, drains via acquire_many_owned(64) wrapped in a cancel-select, then executes the write. Dispatcher runs continuously so permits attached to queued messages never strand during drain.

Per-request deadline. ColdStorageTask::with_read_deadline(Duration) (default 5s) wraps each non-stream handler in tokio::time::timeout. On expiry the caller receives ColdStorageError::Timeout, a WARN is emitted with the operation variant, and the permit returns to the pool.

Tests

crates/cold/tests/concurrency.rs expanded with a GatedBackend helper that blocks every read call on a test-controlled semaphore:

reads_above_concurrency_cap_do_not_deadlock (carried over)
write_after_saturating_reads_makes_progress (carried over)
fairness_write_serves_before_later_readers (new) — verifies tokio FIFO fairness keeps the writer ahead of later readers
cancel_during_reader_backpressure_shuts_down (new)
cancel_during_write_drain_shuts_down (new) — would fail without the cancel-select on the writer's drain
operation_deadline_releases_permit (new) — verifies Timeout is returned and the permit rejoins the pool

Behavioral note

UnifiedStorage::append_blocks dispatches to cold asynchronously. With dispatcher and writer now on separate subtasks, there is no biased ordering between a fire-and-forget write and a subsequent read. Production code at components/crates/node-tests/src/context.rs:380-393 already polls for cold to catch up; two in-repo unit tests (append_and_read_back, drain_above_empty_when_at_tip) were implicitly relying on the old biased-select ordering and are updated to use the same polling pattern.

Test plan

cargo test -p signet-cold (conformance + 6 concurrency tests)
cargo test --workspace
cargo +nightly fmt -- --check
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo clippy --workspace --all-targets --no-default-features -- -D warnings
RUSTDOCFLAGS="-D warnings" cargo doc --workspace --no-deps
Reviewer spot-check: cut a patch release of signet-cold + signet-storage, bump init4tech/node-components, rebuild signet-sidecar:latest, redeploy to dev mainnet, confirm no backpressure-induced crashes over a full day.

🤖 Generated with Claude Code

Moves semaphore permit acquisition to `ColdStorageHandle` so permits travel with read requests into the channel. The task runner splits into two concurrent subtasks: - **Dispatcher**: pulls `PermittedReadRequest`s and spawns handlers, wrapping each in a per-request deadline (default 5s). - **Writer**: consumes writes sequentially. Drain-before-write uses `Semaphore::acquire_many_owned(64)`, now wrapped in a cancel-select so shutdown cannot hang on a stuck reader. The semaphore is now the single backpressure mechanism. The read channel is sized to match permit count, so `try_send` from a caller holding a permit is guaranteed to have capacity. New `ColdStorageError::Timeout` is returned to callers whose handler exceeds the deadline; dropping the handler future releases its permit back to the pool, so a stuck backend call self-heals. Tests (`crates/cold/tests/concurrency.rs`) add a `GatedBackend` helper and four new regression cases: fairness under saturation, cancel during reader backpressure, cancel during write drain, and operation-deadline permit release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`UnifiedStorage::append_blocks` dispatches to cold asynchronously. With the cold task's dispatcher and writer now running on separate subtasks, there is no biased ordering between a fire-and-forget write and a subsequent read — the tests that assumed one were relying on an implementation detail that production code already polls around (see `components/crates/node-tests/src/context.rs`). Add a `wait_for_cold_height` helper matching the production pattern and use it in the two tests that issued a read immediately after `append_blocks`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

prestwich · 2026-04-17T04:15:36Z

[Claude Code]

@rswanson tagging you for visibility — this stacks on top of #56 and would replace its diff if it lands.

@Fraser999 @Evalir requesting review.

Evalir · 2026-04-17T10:31:15Z

[Claude Code]

Second-pass review surfaced a deadlock hazard that I'd like to flag before this merges.

Critical: `StreamLogs` setup holds the outer read-semaphore permit across unbounded awaits

The exemption at runner.rs:236-242 skips run_with_deadline for the entire handle_stream_logs call, not just the spawned producer. The dispatcher-spawned task still holds its outer permit for the full duration, and inside handle_stream_logs there are two awaits with no bound visible to the writer's drain:

stream_semaphore.acquire_owned().await (runner.rs:264-269) — blocks when all 8 stream slots are held; upper bound is the existing stream's remaining deadline, up to max_stream_deadline = 60s.
self.read_backend.get_latest_block().await (runner.rs:275) — unbounded backend call, no deadline applied.

Throughout both, acquire_many_owned(64) on the writer side is waiting on that permit.

Scenario A — stream saturation. 8 long-running streams in flight; a 9th stream_logs call parks in handle_stream_logs for up to 60s while holding its outer permit. Writer drain blocks for the same window. At the 12s write cadence, writes stack in the 256-slot mpsc; not a permanent wedge, but enough to cause visible backpressure under a steady state of 8 saturated streams.

Scenario B — backend hang on get_latest_block. A single stream_logs call with to_block = None hits a stuck pool connection. No deadline applies. Outer permit held indefinitely. Writer drain never completes. Write mpsc fills at 12s cadence → Backpressure errors → the original production crash, verbatim. The deadline doc comment explicitly sizes the 5s deadline "well below the 12s write cadence at which a stuck reader would otherwise repressurize the write mpsc" — this path is the one that can exceed it.

Scenario B is the sharper concern, especially with the SQL backend index paths from #54 now live.

Fix options (cheapest first)

Wrap handle_stream_logs in tokio::time::timeout(read_deadline, …). One-line bound on the setup phase. The producer runs inside stream_tracker.spawn, so it's already decoupled from the outer permit and untouched by this change.
Timeout only the inner get_latest_block inside handle_stream_logs. Addresses (B) only. Marginal win over (1).
Don't take an outer permit for StreamLogs at all. Streams are explicitly exempt from the drain-before-write invariant, so consuming the drain barrier for stream setup is incidental and harmful. Structural fix: either detect StreamLogs in the dispatcher and drop the permit before dispatching, or split the handle-side path so stream_logs uses stream_semaphore alone.

(1) is the lowest-risk ship-blocker fix. (3) is the architecturally correct answer if a follow-up refactor is acceptable.

Secondary: shutdown time bounded by the longest active stream

run() awaits task_tracker.wait() before closing stream_tracker. A StreamLogs handler parked in handle_stream_logs at cancellation time blocks shutdown until it returns naturally — up to 60s. The option-1 fix above also bounds this.

Re-verified minor items from the previous pass

Handle-side acquire_owned after shutdown still waits for in-flight readers to release before reporting TaskTerminated. Calling read_semaphore.close() at the end of run() would make the failure immediate.
Backend cancel-safety on deadline expiry — still worth soak-validating on the SQL backend.
Permit-attached-message invariant, cancel-select on acquire_many_owned, and FIFO drain fairness all hold on re-trace.

The permit/drain machinery itself is well-designed — this is specifically about the StreamLogs path not having been updated for the new invariant.

Fraser999

As well as @Evalir's point, I think this PR doesn't address the issues/suggestions in #56 (comment).

prestwich · 2026-04-21T21:04:56Z

[Claude Code]

Superseded by #58, which takes the full-rethink approach (Option C from the review-response brainstorming) rather than the piecewise fixes here. Closing in favor of the unified handle architecture tracked by ENG-2198.

Key differences in #58:

Single ColdStorage<B: ColdStorageBackend> handle over Arc<Inner<B>> — no more channels, dispatcher, or writer subtask.
All three semaphores (read_sem, write_sem, stream_sem) acquired in the handle; permits held for the real duration of the backend call.
Streams acquire only stream_sem (no read permit).
Backend timeouts are mandatory in the trait contract (500ms/2s defaults, builder-configurable).
Timeout, Backpressure, Cancelled error variants removed.

* refactor(cold): ColdStorageWrite takes &self; all backends updated in lockstep * refactor(cold): unify handle around Arc<Inner>; remove channels and dispatcher * fix(cold): stream permit acquired in handle; streams do not hold a read permit * feat(cold): drain barrier moves to handle write path * feat(cold): shutdown coordinator closes semaphores on cancel * refactor(cold-mdbx): spawn_blocking reads, block_in_place writes, in-body iterator deadline Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cold-sql): mandatory statement_timeout; read_timeout and write_timeout builders * feat(cold): metrics and tracing spans across all operations Adds a `metrics` module under `crates/cold/src/metrics.rs` with const metric names, help strings, a `LazyLock` describe block, and `pub(crate)` helper functions for recording: - `cold.reads_in_flight`, `cold.writes_in_flight`, `cold.streams_active` (gauges) - `cold.op_duration_us` (histogram, labeled by op) - `cold.permit_wait_us` (histogram, labeled by sem: read/write/drain/stream) - `cold.op_errors_total` (counter, labeled by op and error kind) - `cold.stream_lifetime_ms` (histogram) Wires the helpers into every `ColdStorage<B>` handle method: `spawn_read` and `spawn_write` time permit acquisition, bump in-flight, measure op duration, record errors, and dec in-flight after the backend call. Cache hits in `get_header`/`get_transaction`/`get_receipt` record op duration only (no permit wait, no in-flight). `stream_logs` instruments stream permit wait and records stream lifetime + gauge in the spawned producer. Adds `ColdStorageError::kind()` for the error metric label. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(cold): trait impl guide documents mandatory timeouts * test(cold): concurrency suite covers new architecture * fix(cold): shutdown coordinator holds Weak<Inner>, not Arc The coordinator task previously moved Arc<Inner<B>> into its body and awaited the user's cancel token. If callers dropped all ColdStorage clones without firing cancel, Inner (and the backend's file/DB handles) stayed pinned until process exit. Switch the coordinator to Weak<Inner>, and put a DropGuard on Inner that fires a child cancel token. shutdown now fires on either user-side cancel OR Inner drop; in the drop case upgrade() returns None and the coordinator exits without pinning anything. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold-mdbx): preserve TooManyLogs via From impl, not backend wrapper ColdStorageError::backend unconditionally wraps as Backend(Box<_>), which hid MdbxColdError::TooManyLogs behind the generic backend variant and broke the conformance suite's max_logs assertion. The From<MdbxColdError> for ColdStorageError impl already translates TooManyLogs correctly and wraps the rest. Route all spawn_blocking result conversions through ::from so the translation runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold): address review on permits, gauges, errors, cache - stream_logs resolves `to` (and get_latest_block fallback) before acquiring stream_sem, so a stuck backend no longer pins all 8 permits across setup I/O. - In-flight gauges are now maintained by an InFlightGuard RAII wrapper so the decrement survives a panic in the spawned body; previously a panic left cold.reads_in_flight / writes_in_flight / streams_active drifting up and poisoning the Prometheus alert signal. - Promote timeout to a first-class ColdStorageError::DeadlineExceeded variant. MDBX Timeout now routes through it (not Backend), and downstream callers can match without downcasting. Fixes stale Backpressure references in the cold and storage READMEs and the signet-storage skill doc. - ColdCache switches from tokio::sync::Mutex to parking_lot::Mutex. The cache only ever holds the lock across synchronous LRU ops, so the async mutex's yield-on-lock was pure overhead. - MemColdBackend now explicitly documents its exemption from the trait's mandatory-timeout contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold-mdbx): spawn_blocking writes, per-item deadlines, docs - Writes (append_block, append_blocks, truncate_above, drain_above) now use tokio::task::spawn_blocking instead of block_in_place. block_in_place panics on a current_thread runtime, so any consumer wiring MdbxColdBackend into a single-threaded Tokio would hit the first write and crash. Added writes_work_on_current_thread_runtime regression test. - Overrun WARN fires only on successful writes. A failed write that took > 2 s already surfaces Backend(...) to the caller; a spurious advisory-write-timeout WARN on the error path would poison any SLO alert built on that signal. - Iterator reads gained inner-loop deadline checks: per-receipt in get_logs_inner, per-event in collect_signet_events_in_range, per-receipt + per-log in produce_log_stream_blocking. A block with many matching logs (or a slow stream consumer) can no longer run unbounded past the configured deadline. - MdbxColdError::Timeout now maps to ColdStorageError::DeadlineExceeded (new variant) instead of Backend. Updated the existing timeout test to match on the variant directly. - Documented the point-lookup timeout exemption: MDBX page I/O on cold pages can stall arbitrarily, and the handle does not wrap point lookups in a tokio::time::timeout, so a stuck lookup ties up a spawn_blocking worker AND a read_sem permit. Callers that need fail-fast behavior should wrap at the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(storage): align unified::drain_above doc with silent-swallow impl The PR-#58 doc rewrite advertised a `Cold` error path, but the impl collapses every cold error into `Vec::new()`. Update the doc + comment to admit silent-swallow behaviour. ENG-2210 tracks the propagation decision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold-sql): map PG statement_timeout to DeadlineExceeded SQLSTATE 57014 (query_canceled, emitted on `statement_timeout` expiry) was wrapped as `Sqlx(...)` and surfaced to the handle as `Backend(...)`, breaking symmetry with the MDBX backend (which routes its `Timeout` to `ColdStorageError::DeadlineExceeded`). The metric `cold.op_errors_total{error="backend"}` therefore conflated "query too slow" with "backend down". `From<sqlx::Error> for SqlColdError` now detects 57014 and produces a dedicated `Timeout` variant; `From<SqlColdError> for ColdStorageError` maps it to `DeadlineExceeded`. The configured deadline is not threaded to this conversion boundary so the surfaced duration is `ZERO`; threading the real value is a separate refactor (left for a follow-up once the call sites are confirmed to need it). The `pg_statement_timeout` test is rewritten to match on the typed variant rather than a substring of the error message — a future refactor that drops 57014 detection now fails the test instead of silently passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cold): hoist write SLO and stream-setup timeout into the handle Adds two accessors to the `ColdStorageBackend` trait: fn read_timeout(&self) -> Option<Duration> { None } fn write_timeout(&self) -> Option<Duration> { None } Wired through `MdbxColdBackend`, `SqlColdBackend`, and `EitherCold`. `MemColdBackend` returns `None` (already-documented test exemption). Two behaviour changes use these: 1. The advisory write-SLO WARN moves from the MDBX backend (`warn_on_overrun` per-method) to `ColdStorage::spawn_write`. Timing is now captured before `write_sem` acquisition, so the elapsed value covers the queue wait, the read drain, and the commit end-to-end. The failure shape that wedged production at #56 — slow readers gating writes — now surfaces as a write-SLO violation rather than as a sub-threshold backend timing. 2. `stream_logs`'s setup `get_latest_block` is wrapped in `tokio::time::timeout(backend.read_timeout(), ...)`. Without this, a stuck point lookup (cold MDBX page) or a saturated PG pool parking on `acquire_timeout` could pin N concurrent setup callers indefinitely with no permit cap. The setup read still bypasses `read_sem` and the drain barrier by design. Also drops the now-unused `tracing` dep from `signet-cold-mdbx` and updates the type docs to point at the handle's new WARN path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cold): reject zero timeouts; log JoinError panics; misc nits Builders for `read_timeout` / `write_timeout` on both the MDBX and SQL backends and connectors now panic on zero. Postgres treats `statement_timeout = 0` as "no timeout", so a caller passing `Duration::ZERO` (or computing one from a config that defaults to zero) would silently disable the trait-level mandatory-timeout contract. MDBX accepts the same assert for symmetry — zero there is a useless config rather than a silent disable, but the trait says non-zero and the assert keeps the surface honest. `spawn_read` / `spawn_write` now log spawned-task `JoinError`s before mapping to `TaskTerminated`. A backend panic was previously indistinguishable from graceful shutdown for the on-call: panics fire ERROR with the panic message, cancellations fire DEBUG. The error variant still collapses to `TaskTerminated` per design. `MdbxColdBackend::get_logs_inner` now checks the deadline inside the inner per-log loop, mirroring the streaming path. Previously a single receipt with thousands of matching logs would iterate unchecked past the configured `read_timeout`. The two `std::time::Instant::now()` sites in `produce_log_stream_blocking` are also folded into the already-imported `Instant`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(cold): stream_logs setup fails fast on hung get_latest_block Pins the new setup-timeout behaviour against regression. The test parks `GatedBackend::get_latest_block` indefinitely and asserts that `stream_logs` (with no `to_block` on the filter, forcing the "resolve to=latest" path) returns `DeadlineExceeded` within the configured 50 ms `read_timeout` rather than hanging. Adds `GatedBackend::with_read_timeout` so tests can advertise a custom read timeout to the handle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

prestwich and others added 2 commits April 17, 2026 00:09

prestwich requested review from Evalir and Fraser999 April 17, 2026 04:15

Fraser999 requested changes Apr 20, 2026

View reviewed changes

This was referenced Apr 21, 2026

fix(cold): replace read-arm TaskTracker backpressure with Semaphore #56

Draft

cold: unified handle architecture (supersedes #57) #58

Merged

prestwich closed this Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(cold): permit-attached reads, dispatcher/writer split, 5s operation deadline#57

refactor(cold): permit-attached reads, dispatcher/writer split, 5s operation deadline#57
prestwich wants to merge 2 commits intoswanny/fix-cold-reader-deadlockfrom
prestwich/cold-permit-refactor

prestwich commented Apr 17, 2026

Uh oh!

prestwich commented Apr 17, 2026

Uh oh!

Evalir commented Apr 17, 2026

Uh oh!

Fraser999 left a comment

Uh oh!

prestwich commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

prestwich commented Apr 17, 2026

Summary

Design

Tests

Behavioral note

Test plan

Uh oh!

prestwich commented Apr 17, 2026

Uh oh!

Evalir commented Apr 17, 2026

Critical: StreamLogs setup holds the outer read-semaphore permit across unbounded awaits

Fix options (cheapest first)

Secondary: shutdown time bounded by the longest active stream

Re-verified minor items from the previous pass

Uh oh!

Fraser999 left a comment

Choose a reason for hiding this comment

Uh oh!

prestwich commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Critical: `StreamLogs` setup holds the outer read-semaphore permit across unbounded awaits