Skip to content

refactor(cold): permit-attached reads, dispatcher/writer split, 5s operation deadline#57

Closed
prestwich wants to merge 2 commits intoswanny/fix-cold-reader-deadlockfrom
prestwich/cold-permit-refactor
Closed

refactor(cold): permit-attached reads, dispatcher/writer split, 5s operation deadline#57
prestwich wants to merge 2 commits intoswanny/fix-cold-reader-deadlockfrom
prestwich/cold-permit-refactor

Conversation

@prestwich
Copy link
Copy Markdown
Member

Summary

Extends PR #56's semaphore-based fix with a broader refactor that addresses three concerns surfaced in review:

  1. Write-drain is not cancellable. A single stuck reader holds its permit, blocks acquire_many_owned(64), and wedges shutdown — same class of bug as the original, just a narrower failure envelope.
  2. Two backpressure mechanisms (256-slot mpsc + 64-permit semaphore) overlap without a purpose. The channel buffer is 4× the in-flight cap.
  3. No handler-side operation deadline. At 12s write cadence, a single stuck reader extends every drain and eventually re-fills the write mpsc — regenerating the exact production failure PR fix(cold): replace read-arm TaskTracker backpressure with Semaphore #56 fixes.

Design

See docs/superpowers/specs/2026-04-16-cold-read-write-permit-refactor-design.md on the design discussion thread (PR #56 comment).

Permit-attached messages. ColdStorageHandle acquires a semaphore permit before sending; the permit travels in PermittedReadRequest and is released when the spawned handler's future drops. One semaphore is now the only backpressure mechanism; the read channel is sized to match permit count so try_send on the handle side is infallible (modulo shutdown).

Split task runner. run_dispatcher pulls PermittedReadRequests and spawns handlers. run_writer consumes writes sequentially, drains via acquire_many_owned(64) wrapped in a cancel-select, then executes the write. Dispatcher runs continuously so permits attached to queued messages never strand during drain.

Per-request deadline. ColdStorageTask::with_read_deadline(Duration) (default 5s) wraps each non-stream handler in tokio::time::timeout. On expiry the caller receives ColdStorageError::Timeout, a WARN is emitted with the operation variant, and the permit returns to the pool.

Tests

crates/cold/tests/concurrency.rs expanded with a GatedBackend helper that blocks every read call on a test-controlled semaphore:

  • reads_above_concurrency_cap_do_not_deadlock (carried over)
  • write_after_saturating_reads_makes_progress (carried over)
  • fairness_write_serves_before_later_readers (new) — verifies tokio FIFO fairness keeps the writer ahead of later readers
  • cancel_during_reader_backpressure_shuts_down (new)
  • cancel_during_write_drain_shuts_down (new) — would fail without the cancel-select on the writer's drain
  • operation_deadline_releases_permit (new) — verifies Timeout is returned and the permit rejoins the pool

Behavioral note

UnifiedStorage::append_blocks dispatches to cold asynchronously. With dispatcher and writer now on separate subtasks, there is no biased ordering between a fire-and-forget write and a subsequent read. Production code at components/crates/node-tests/src/context.rs:380-393 already polls for cold to catch up; two in-repo unit tests (append_and_read_back, drain_above_empty_when_at_tip) were implicitly relying on the old biased-select ordering and are updated to use the same polling pattern.

Test plan

  • cargo test -p signet-cold (conformance + 6 concurrency tests)
  • cargo test --workspace
  • cargo +nightly fmt -- --check
  • cargo clippy --workspace --all-targets --all-features -- -D warnings
  • cargo clippy --workspace --all-targets --no-default-features -- -D warnings
  • RUSTDOCFLAGS="-D warnings" cargo doc --workspace --no-deps
  • Reviewer spot-check: cut a patch release of signet-cold + signet-storage, bump init4tech/node-components, rebuild signet-sidecar:latest, redeploy to dev mainnet, confirm no backpressure-induced crashes over a full day.

🤖 Generated with Claude Code

prestwich and others added 2 commits April 17, 2026 00:09
Moves semaphore permit acquisition to `ColdStorageHandle` so permits
travel with read requests into the channel. The task runner splits
into two concurrent subtasks:

- **Dispatcher**: pulls `PermittedReadRequest`s and spawns handlers,
  wrapping each in a per-request deadline (default 5s).
- **Writer**: consumes writes sequentially. Drain-before-write uses
  `Semaphore::acquire_many_owned(64)`, now wrapped in a cancel-select
  so shutdown cannot hang on a stuck reader.

The semaphore is now the single backpressure mechanism. The read
channel is sized to match permit count, so `try_send` from a caller
holding a permit is guaranteed to have capacity.

New `ColdStorageError::Timeout` is returned to callers whose handler
exceeds the deadline; dropping the handler future releases its permit
back to the pool, so a stuck backend call self-heals.

Tests (`crates/cold/tests/concurrency.rs`) add a `GatedBackend`
helper and four new regression cases: fairness under saturation,
cancel during reader backpressure, cancel during write drain, and
operation-deadline permit release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`UnifiedStorage::append_blocks` dispatches to cold asynchronously.
With the cold task's dispatcher and writer now running on separate
subtasks, there is no biased ordering between a fire-and-forget
write and a subsequent read — the tests that assumed one were
relying on an implementation detail that production code already
polls around (see `components/crates/node-tests/src/context.rs`).

Add a `wait_for_cold_height` helper matching the production pattern
and use it in the two tests that issued a read immediately after
`append_blocks`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@prestwich prestwich requested review from Evalir and Fraser999 April 17, 2026 04:15
@prestwich
Copy link
Copy Markdown
Member Author

[Claude Code]

@rswanson tagging you for visibility — this stacks on top of #56 and would replace its diff if it lands.

@Fraser999 @Evalir requesting review.

@Evalir
Copy link
Copy Markdown
Member

Evalir commented Apr 17, 2026

[Claude Code]

Second-pass review surfaced a deadlock hazard that I'd like to flag before this merges.

Critical: StreamLogs setup holds the outer read-semaphore permit across unbounded awaits

The exemption at runner.rs:236-242 skips run_with_deadline for the entire handle_stream_logs call, not just the spawned producer. The dispatcher-spawned task still holds its outer permit for the full duration, and inside handle_stream_logs there are two awaits with no bound visible to the writer's drain:

  1. stream_semaphore.acquire_owned().await (runner.rs:264-269) — blocks when all 8 stream slots are held; upper bound is the existing stream's remaining deadline, up to max_stream_deadline = 60s.
  2. self.read_backend.get_latest_block().await (runner.rs:275) — unbounded backend call, no deadline applied.

Throughout both, acquire_many_owned(64) on the writer side is waiting on that permit.

Scenario A — stream saturation. 8 long-running streams in flight; a 9th stream_logs call parks in handle_stream_logs for up to 60s while holding its outer permit. Writer drain blocks for the same window. At the 12s write cadence, writes stack in the 256-slot mpsc; not a permanent wedge, but enough to cause visible backpressure under a steady state of 8 saturated streams.

Scenario B — backend hang on get_latest_block. A single stream_logs call with to_block = None hits a stuck pool connection. No deadline applies. Outer permit held indefinitely. Writer drain never completes. Write mpsc fills at 12s cadence → Backpressure errors → the original production crash, verbatim. The deadline doc comment explicitly sizes the 5s deadline "well below the 12s write cadence at which a stuck reader would otherwise repressurize the write mpsc" — this path is the one that can exceed it.

Scenario B is the sharper concern, especially with the SQL backend index paths from #54 now live.

Fix options (cheapest first)

  1. Wrap handle_stream_logs in tokio::time::timeout(read_deadline, …). One-line bound on the setup phase. The producer runs inside stream_tracker.spawn, so it's already decoupled from the outer permit and untouched by this change.
  2. Timeout only the inner get_latest_block inside handle_stream_logs. Addresses (B) only. Marginal win over (1).
  3. Don't take an outer permit for StreamLogs at all. Streams are explicitly exempt from the drain-before-write invariant, so consuming the drain barrier for stream setup is incidental and harmful. Structural fix: either detect StreamLogs in the dispatcher and drop the permit before dispatching, or split the handle-side path so stream_logs uses stream_semaphore alone.

(1) is the lowest-risk ship-blocker fix. (3) is the architecturally correct answer if a follow-up refactor is acceptable.

Secondary: shutdown time bounded by the longest active stream

run() awaits task_tracker.wait() before closing stream_tracker. A StreamLogs handler parked in handle_stream_logs at cancellation time blocks shutdown until it returns naturally — up to 60s. The option-1 fix above also bounds this.

Re-verified minor items from the previous pass

  • Handle-side acquire_owned after shutdown still waits for in-flight readers to release before reporting TaskTerminated. Calling read_semaphore.close() at the end of run() would make the failure immediate.
  • Backend cancel-safety on deadline expiry — still worth soak-validating on the SQL backend.
  • Permit-attached-message invariant, cancel-select on acquire_many_owned, and FIFO drain fairness all hold on re-trace.

The permit/drain machinery itself is well-designed — this is specifically about the StreamLogs path not having been updated for the new invariant.

Copy link
Copy Markdown
Contributor

@Fraser999 Fraser999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As well as @Evalir's point, I think this PR doesn't address the issues/suggestions in #56 (comment).

@prestwich
Copy link
Copy Markdown
Member Author

[Claude Code]

Superseded by #58, which takes the full-rethink approach (Option C from the review-response brainstorming) rather than the piecewise fixes here. Closing in favor of the unified handle architecture tracked by ENG-2198.

Key differences in #58:

  • Single ColdStorage<B: ColdStorageBackend> handle over Arc<Inner<B>> — no more channels, dispatcher, or writer subtask.
  • All three semaphores (read_sem, write_sem, stream_sem) acquired in the handle; permits held for the real duration of the backend call.
  • Streams acquire only stream_sem (no read permit).
  • Backend timeouts are mandatory in the trait contract (500ms/2s defaults, builder-configurable).
  • Timeout, Backpressure, Cancelled error variants removed.

@prestwich prestwich closed this Apr 21, 2026
prestwich added a commit that referenced this pull request May 4, 2026
* refactor(cold): ColdStorageWrite takes &self; all backends updated in lockstep

* refactor(cold): unify handle around Arc<Inner>; remove channels and dispatcher

* fix(cold): stream permit acquired in handle; streams do not hold a read permit

* feat(cold): drain barrier moves to handle write path

* feat(cold): shutdown coordinator closes semaphores on cancel

* refactor(cold-mdbx): spawn_blocking reads, block_in_place writes, in-body iterator deadline

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cold-sql): mandatory statement_timeout; read_timeout and write_timeout builders

* feat(cold): metrics and tracing spans across all operations

Adds a `metrics` module under `crates/cold/src/metrics.rs` with const
metric names, help strings, a `LazyLock` describe block, and
`pub(crate)` helper functions for recording:

- `cold.reads_in_flight`, `cold.writes_in_flight`, `cold.streams_active`
  (gauges)
- `cold.op_duration_us` (histogram, labeled by op)
- `cold.permit_wait_us` (histogram, labeled by sem: read/write/drain/stream)
- `cold.op_errors_total` (counter, labeled by op and error kind)
- `cold.stream_lifetime_ms` (histogram)

Wires the helpers into every `ColdStorage<B>` handle method:
`spawn_read` and `spawn_write` time permit acquisition, bump in-flight,
measure op duration, record errors, and dec in-flight after the backend
call. Cache hits in `get_header`/`get_transaction`/`get_receipt` record
op duration only (no permit wait, no in-flight). `stream_logs`
instruments stream permit wait and records stream lifetime + gauge in
the spawned producer.

Adds `ColdStorageError::kind()` for the error metric label.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(cold): trait impl guide documents mandatory timeouts

* test(cold): concurrency suite covers new architecture

* fix(cold): shutdown coordinator holds Weak<Inner>, not Arc

The coordinator task previously moved Arc<Inner<B>> into its body and
awaited the user's cancel token. If callers dropped all ColdStorage
clones without firing cancel, Inner (and the backend's file/DB handles)
stayed pinned until process exit.

Switch the coordinator to Weak<Inner>, and put a DropGuard on Inner that
fires a child cancel token. shutdown now fires on either user-side
cancel OR Inner drop; in the drop case upgrade() returns None and the
coordinator exits without pinning anything.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cold-mdbx): preserve TooManyLogs via From impl, not backend wrapper

ColdStorageError::backend unconditionally wraps as Backend(Box<_>),
which hid MdbxColdError::TooManyLogs behind the generic backend variant
and broke the conformance suite's max_logs assertion. The
From<MdbxColdError> for ColdStorageError impl already translates
TooManyLogs correctly and wraps the rest. Route all spawn_blocking
result conversions through ::from so the translation runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cold): address review on permits, gauges, errors, cache

- stream_logs resolves `to` (and get_latest_block fallback) before
  acquiring stream_sem, so a stuck backend no longer pins all 8 permits
  across setup I/O.
- In-flight gauges are now maintained by an InFlightGuard RAII wrapper
  so the decrement survives a panic in the spawned body; previously a
  panic left cold.reads_in_flight / writes_in_flight / streams_active
  drifting up and poisoning the Prometheus alert signal.
- Promote timeout to a first-class ColdStorageError::DeadlineExceeded
  variant. MDBX Timeout now routes through it (not Backend), and
  downstream callers can match without downcasting. Fixes stale
  Backpressure references in the cold and storage READMEs and the
  signet-storage skill doc.
- ColdCache switches from tokio::sync::Mutex to parking_lot::Mutex.
  The cache only ever holds the lock across synchronous LRU ops, so
  the async mutex's yield-on-lock was pure overhead.
- MemColdBackend now explicitly documents its exemption from the
  trait's mandatory-timeout contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cold-mdbx): spawn_blocking writes, per-item deadlines, docs

- Writes (append_block, append_blocks, truncate_above, drain_above)
  now use tokio::task::spawn_blocking instead of block_in_place.
  block_in_place panics on a current_thread runtime, so any consumer
  wiring MdbxColdBackend into a single-threaded Tokio would hit the
  first write and crash. Added writes_work_on_current_thread_runtime
  regression test.
- Overrun WARN fires only on successful writes. A failed write that
  took > 2 s already surfaces Backend(...) to the caller; a spurious
  advisory-write-timeout WARN on the error path would poison any SLO
  alert built on that signal.
- Iterator reads gained inner-loop deadline checks: per-receipt in
  get_logs_inner, per-event in collect_signet_events_in_range,
  per-receipt + per-log in produce_log_stream_blocking. A block with
  many matching logs (or a slow stream consumer) can no longer run
  unbounded past the configured deadline.
- MdbxColdError::Timeout now maps to ColdStorageError::DeadlineExceeded
  (new variant) instead of Backend. Updated the existing timeout test
  to match on the variant directly.
- Documented the point-lookup timeout exemption: MDBX page I/O on
  cold pages can stall arbitrarily, and the handle does not wrap point
  lookups in a tokio::time::timeout, so a stuck lookup ties up a
  spawn_blocking worker AND a read_sem permit. Callers that need
  fail-fast behavior should wrap at the call site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(storage): align unified::drain_above doc with silent-swallow impl

The PR-#58 doc rewrite advertised a `Cold` error path, but the impl
collapses every cold error into `Vec::new()`. Update the doc + comment
to admit silent-swallow behaviour. ENG-2210 tracks the propagation
decision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cold-sql): map PG statement_timeout to DeadlineExceeded

SQLSTATE 57014 (query_canceled, emitted on `statement_timeout` expiry)
was wrapped as `Sqlx(...)` and surfaced to the handle as `Backend(...)`,
breaking symmetry with the MDBX backend (which routes its `Timeout` to
`ColdStorageError::DeadlineExceeded`). The metric
`cold.op_errors_total{error="backend"}` therefore conflated "query too
slow" with "backend down".

`From<sqlx::Error> for SqlColdError` now detects 57014 and produces a
dedicated `Timeout` variant; `From<SqlColdError> for ColdStorageError`
maps it to `DeadlineExceeded`. The configured deadline is not threaded
to this conversion boundary so the surfaced duration is `ZERO`;
threading the real value is a separate refactor (left for a follow-up
once the call sites are confirmed to need it).

The `pg_statement_timeout` test is rewritten to match on the typed
variant rather than a substring of the error message — a future
refactor that drops 57014 detection now fails the test instead of
silently passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(cold): hoist write SLO and stream-setup timeout into the handle

Adds two accessors to the `ColdStorageBackend` trait:

    fn read_timeout(&self) -> Option<Duration> { None }
    fn write_timeout(&self) -> Option<Duration> { None }

Wired through `MdbxColdBackend`, `SqlColdBackend`, and `EitherCold`.
`MemColdBackend` returns `None` (already-documented test exemption).

Two behaviour changes use these:

1. The advisory write-SLO WARN moves from the MDBX backend
   (`warn_on_overrun` per-method) to `ColdStorage::spawn_write`. Timing
   is now captured before `write_sem` acquisition, so the elapsed
   value covers the queue wait, the read drain, and the commit
   end-to-end. The failure shape that wedged production at #56 — slow
   readers gating writes — now surfaces as a write-SLO violation
   rather than as a sub-threshold backend timing.

2. `stream_logs`'s setup `get_latest_block` is wrapped in
   `tokio::time::timeout(backend.read_timeout(), ...)`. Without this,
   a stuck point lookup (cold MDBX page) or a saturated PG pool
   parking on `acquire_timeout` could pin N concurrent setup callers
   indefinitely with no permit cap. The setup read still bypasses
   `read_sem` and the drain barrier by design.

Also drops the now-unused `tracing` dep from `signet-cold-mdbx` and
updates the type docs to point at the handle's new WARN path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cold): reject zero timeouts; log JoinError panics; misc nits

Builders for `read_timeout` / `write_timeout` on both the MDBX and SQL
backends and connectors now panic on zero. Postgres treats
`statement_timeout = 0` as "no timeout", so a caller passing
`Duration::ZERO` (or computing one from a config that defaults to
zero) would silently disable the trait-level mandatory-timeout
contract. MDBX accepts the same assert for symmetry — zero there is a
useless config rather than a silent disable, but the trait says
non-zero and the assert keeps the surface honest.

`spawn_read` / `spawn_write` now log spawned-task `JoinError`s before
mapping to `TaskTerminated`. A backend panic was previously
indistinguishable from graceful shutdown for the on-call: panics fire
ERROR with the panic message, cancellations fire DEBUG. The error
variant still collapses to `TaskTerminated` per design.

`MdbxColdBackend::get_logs_inner` now checks the deadline inside the
inner per-log loop, mirroring the streaming path. Previously a single
receipt with thousands of matching logs would iterate unchecked past
the configured `read_timeout`. The two `std::time::Instant::now()`
sites in `produce_log_stream_blocking` are also folded into the
already-imported `Instant`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(cold): stream_logs setup fails fast on hung get_latest_block

Pins the new setup-timeout behaviour against regression. The test
parks `GatedBackend::get_latest_block` indefinitely and asserts that
`stream_logs` (with no `to_block` on the filter, forcing the
"resolve to=latest" path) returns `DeadlineExceeded` within the
configured 50 ms `read_timeout` rather than hanging.

Adds `GatedBackend::with_read_timeout` so tests can advertise a
custom read timeout to the handle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants