cold: unified handle architecture (supersedes #57)#58
Conversation
…body iterator deadline Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a `metrics` module under `crates/cold/src/metrics.rs` with const metric names, help strings, a `LazyLock` describe block, and `pub(crate)` helper functions for recording: - `cold.reads_in_flight`, `cold.writes_in_flight`, `cold.streams_active` (gauges) - `cold.op_duration_us` (histogram, labeled by op) - `cold.permit_wait_us` (histogram, labeled by sem: read/write/drain/stream) - `cold.op_errors_total` (counter, labeled by op and error kind) - `cold.stream_lifetime_ms` (histogram) Wires the helpers into every `ColdStorage<B>` handle method: `spawn_read` and `spawn_write` time permit acquisition, bump in-flight, measure op duration, record errors, and dec in-flight after the backend call. Cache hits in `get_header`/`get_transaction`/`get_receipt` record op duration only (no permit wait, no in-flight). `stream_logs` instruments stream permit wait and records stream lifetime + gauge in the spawned producer. Adds `ColdStorageError::kind()` for the error metric label. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
[Claude Code] Code reviewWent deep on the concurrency model. One real regression worth addressing, plus a resource-lifetime nit. Found 2 issues:
Separately, CI was red on 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
The coordinator task previously moved Arc<Inner<B>> into its body and awaited the user's cancel token. If callers dropped all ColdStorage clones without firing cancel, Inner (and the backend's file/DB handles) stayed pinned until process exit. Switch the coordinator to Weak<Inner>, and put a DropGuard on Inner that fires a child cancel token. shutdown now fires on either user-side cancel OR Inner drop; in the drop case upgrade() returns None and the coordinator exits without pinning anything. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ColdStorageError::backend unconditionally wraps as Backend(Box<_>), which hid MdbxColdError::TooManyLogs behind the generic backend variant and broke the conformance suite's max_logs assertion. The From<MdbxColdError> for ColdStorageError impl already translates TooManyLogs correctly and wraps the rest. Route all spawn_blocking result conversions through ::from so the translation runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Evalir
left a comment
There was a problem hiding this comment.
[Claude Code] Deep-read focused on invariants, panics, and perf regressions. Below are the actionable items I found, ordered roughly by severity.
1. block_in_place in MDBX writes panics on a current_thread runtime
crates/cold-mdbx/src/backend.rs:882,897,912,929 — every MDBX write (append_block, append_blocks, truncate_above, drain_above) now wraps its inner call in tokio::task::block_in_place. That API panics on a single-threaded runtime. Pre-PR, this work ran in the dispatcher task and never hit that requirement.
Reads in the same file correctly use spawn_blocking — it is only the writes. The new tests added for this path all pin #[tokio::test(flavor = "multi_thread", worker_threads = 2)], which hides the issue in CI. Any consumer of MdbxColdBackend on a current_thread runtime will panic on the first write.
Fix: either use spawn_blocking like the read path does, or loudly document the multi-threaded runtime requirement on MdbxColdBackend / ColdStorage::new.
2. Cache-hit readers bypass the drain barrier → return just-truncated blocks mid-write
crates/cold/src/handle.rs:243-248, 294-300, 367-373 — get_header, get_transaction, get_receipt check self.inner.cache before acquiring a read_sem permit. spawn_write invalidates the cache after the backend mutation inside the same spawned task:
self.spawn_write("truncate_above", move |inner| async move {
let result = inner.backend.truncate_above(block).await; // tens of ms for MDBX
if result.is_ok() {
inner.cache.lock().await.invalidate_above(block);
}
result
})Ordering: drain barrier acquired → backend commits the truncate → cache invalidated. Between the backend commit and the cache invalidation, a cache-hit reader can still return a header/tx/receipt for a block the backend has already deleted. The cache_consistent_through_truncate test only checks the post-commit state — it does not exercise this window.
Fix: invalidate the cache before calling backend.truncate_above / backend.drain_above. The drain barrier already guarantees no spawn_read task is racing to re-populate, so pre-invalidation closes the window.
3. stream_logs setup holds stream_sem across an un-timeouted backend call
crates/cold/src/handle.rs:519-544 — the permit is acquired, then self.inner.backend.get_latest_block().await? runs before the task is spawned. The backend call has no timeout:
- SQL:
begin_read→pool.begin()can park up to sqlx'sacquire_timeout(default 30 s) if the pool is saturated. - MDBX:
get_latest_blockis a point lookup which skips the deadline check entirely (see issue 5).
If 8 callers land here with a slow / hung backend, stream_sem is fully held and no new streams can start. The permit only exists to cap in-flight producer tasks; holding it across setup I/O is exactly the bug the PR #57 review (item 2) was meant to avoid.
Fix: either wrap the setup I/O in tokio::time::timeout(effective_deadline, …), or acquire the stream_sem permit after resolving to.
4. MDBX iterator timeouts are only checked between blocks, never within
crates/cold-mdbx/src/backend.rs:658, 483-487, 517-519, and the same pattern in produce_log_stream_blocking:125-131. get_logs_inner / collect_signet_events_in_range / collect_zenith_headers_in_range / the streaming producer check Instant::now() > deadline at the top of each block iteration — never inside the per-receipt or per-log loops. A single block with thousands of matching logs runs far past the 500 ms default before the next check.
In the streaming path it is worse: the deadline-error branch at block boundaries is the only place the deadline ever trips, and the inner blocking_send to a slow receiver can stall indefinitely per log.
Fix: add a bounded periodic check in the inner loops (every N logs/receipts), at minimum inside produce_log_stream_blocking's inner log loop so a slow consumer plus a big block cannot stall a spawn_blocking worker past the deadline.
5. MDBX point lookups have no timeout enforcement at all
crates/cold-mdbx/src/backend.rs:700-704, 726-730, 750-754, 795-799, 861-875 — get_header, get_transaction, get_receipt, get_zenith_header, get_latest_block all spawn_blocking their inner call with no deadline and no timeout wrapper. The trait doc exempts point lookups, but MDBX point reads can still block on disk I/O for arbitrary durations (cold page fetch, adjacent writer doing a page split, etc.).
A stuck point lookup holds a spawn_blocking worker and a read_sem permit. Enough stuck lookups ⇒ read_sem exhausted, and the handle cannot fail fast because nothing resolves.
Fix: wrap each spawn_blocking in tokio::time::timeout(self.read_timeout, …). The blocking thread still runs to completion (unavoidable), but the handle releases its permit promptly and returns an error to the caller.
6. In-flight gauges leak on panic
crates/cold/src/handle.rs:154-168, 211-226, 550-560 — inc_in_flight runs before the spawned body, dec_in_flight runs at the end. If anything between them panics (f(inner).await, the backend call, record_op_duration, the cache .lock()), the decrement is skipped and cold.reads_in_flight / cold.writes_in_flight / cold.streams_active drifts up permanently.
Semaphore permits are Drop-based and do release correctly; only the gauges leak. These are the gauges the PR pitches Prometheus alerts against, so drift poisons the signal.
Fix: wrap the in-flight bookkeeping in a small Drop guard struct.
7. Write-timeout WARN fires on error paths
crates/cold-mdbx/src/backend.rs:878-940 — pattern:
let result = tokio::task::block_in_place(|| …);
let elapsed = start.elapsed();
if elapsed > threshold {
tracing::warn!(…, "mdbx … exceeded advisory write timeout");
}
Ok(result?)If the write fails (disk full, corruption) after > 2 s, the caller sees ColdStorageError::Backend(…) and a misleading advisory-write-timeout WARN. Noisy for alerting, since the summary pitches this WARN as an SLO signal.
Fix: only warn on Ok.
Notes — not bugs, but worth an eye before merge
tokio::sync::Mutex<ColdCache>(handle.rs:56) — the cache operations hold the lock only across a synchronous LRU op. Astd::sync::Mutex/parking_lot::Mutexwould avoid the async-runtime yield on every cache access. Minor perf.- Removed error variants (
Timeout,Backpressure,Cancelled) collapse intoBackendfor MDBX timeouts — downstream callers that used to match onTimeoutnow have to string-match. Worth confirming nothing does. MemColdBackendsilently violates the new "Timeouts (mandatory)" contract. The trait doc says exempt backends "may document their exemption explicitly" —MemColdBackend's type doc does not.
|
wrt #2, this is intentional. returning slightly stale data > blocking |
- stream_logs resolves `to` (and get_latest_block fallback) before acquiring stream_sem, so a stuck backend no longer pins all 8 permits across setup I/O. - In-flight gauges are now maintained by an InFlightGuard RAII wrapper so the decrement survives a panic in the spawned body; previously a panic left cold.reads_in_flight / writes_in_flight / streams_active drifting up and poisoning the Prometheus alert signal. - Promote timeout to a first-class ColdStorageError::DeadlineExceeded variant. MDBX Timeout now routes through it (not Backend), and downstream callers can match without downcasting. Fixes stale Backpressure references in the cold and storage READMEs and the signet-storage skill doc. - ColdCache switches from tokio::sync::Mutex to parking_lot::Mutex. The cache only ever holds the lock across synchronous LRU ops, so the async mutex's yield-on-lock was pure overhead. - MemColdBackend now explicitly documents its exemption from the trait's mandatory-timeout contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Writes (append_block, append_blocks, truncate_above, drain_above) now use tokio::task::spawn_blocking instead of block_in_place. block_in_place panics on a current_thread runtime, so any consumer wiring MdbxColdBackend into a single-threaded Tokio would hit the first write and crash. Added writes_work_on_current_thread_runtime regression test. - Overrun WARN fires only on successful writes. A failed write that took > 2 s already surfaces Backend(...) to the caller; a spurious advisory-write-timeout WARN on the error path would poison any SLO alert built on that signal. - Iterator reads gained inner-loop deadline checks: per-receipt in get_logs_inner, per-event in collect_signet_events_in_range, per-receipt + per-log in produce_log_stream_blocking. A block with many matching logs (or a slow stream consumer) can no longer run unbounded past the configured deadline. - MdbxColdError::Timeout now maps to ColdStorageError::DeadlineExceeded (new variant) instead of Backend. Updated the existing timeout test to match on the variant directly. - Documented the point-lookup timeout exemption: MDBX page I/O on cold pages can stall arbitrarily, and the handle does not wrap point lookups in a tokio::time::timeout, so a stuck lookup ties up a spawn_blocking worker AND a read_sem permit. Callers that need fail-fast behavior should wrap at the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
[Claude Code] Thanks @Evalir — solid review. Pushed two commits addressing everything except #2 (already-considered; see below).
Verified against the pre-push hook suite: |
| for result in iter { | ||
| // Per-receipt deadline check bounds iteration cost across | ||
| // blocks with many receipts. | ||
| if std::time::Instant::now() > deadline { |
There was a problem hiding this comment.
nit: std::time::Instant could be just Instant (same thing in a few places in this file).
| let threshold = self.write_timeout; | ||
| let this = self.clone(); | ||
| let start = Instant::now(); |
There was a problem hiding this comment.
[Claude Code]
Capturing start here means the SLO measurement only covers spawn_blocking queueing + the MDBX commit. It excludes the handle-side wait on write_sem and the drain barrier, which is precisely the failure shape that wedged production at #56 (slow readers blocking writes). A 3 s drain followed by a 100 ms commit measures 100 ms here, well under the 2 s threshold, and the WARN never fires.
Same applies to append_blocks, truncate_above, and drain_above.
Suggest hoisting the SLO timing into ColdStorage::spawn_write so it captures start before write_sem.acquire_owned, then reading the configured threshold via a new ColdStorageBackend::write_timeout(&self) -> Option<Duration> accessor (defaulting to None so MemColdBackend opts out cleanly). The backend method then becomes a thin spawn_blocking wrapper with no timing logic, and warn_on_overrun can be deleted.
Trade-off worth calling out in the WARN message or commit notes: an e2e measurement will surface reader-side contention as write SLO violations. That is arguably exactly the signal you want given #56, but it is a deliberate change in what the alert means.
| assert!( | ||
| msg.contains("canceling") || msg.contains("57014") || msg.contains("timeout"), | ||
| "expected statement_timeout error, got: {err}" | ||
| ); |
There was a problem hiding this comment.
[Claude Code]
String-matching here hides that PG statement_timeout (SQLSTATE 57014) reaches the caller as Backend(...), not DeadlineExceeded. MDBX timeouts route to DeadlineExceeded (crates/cold-mdbx/src/error.rs:33); PG doesn't, because From<SqlColdError> for ColdStorageError (crates/cold-sql/src/error.rs:15-19) blindly wraps every sqlx error as Backend. Result: callers can't match on timeout without string-matching, and cold.op_errors_total{error=backend} collapses "query too slow" with "backend down".
Suggest: add SqlColdError::Timeout(Duration), detect 57014 at the SQL boundary (carrying the configured read_timeout/write_timeout), map to DeadlineExceeded(d). Then this assertion becomes assert!(matches!(err, SqlColdError::Timeout(_))) and a future refactor that drops timeout detection actually fails the test.
| ) -> Result<sqlx::Transaction<'a, sqlx::Any>, SqlColdError> { | ||
| if self.is_postgres { | ||
| let ms = i64::try_from(d.as_millis()).unwrap_or(i64::MAX); | ||
| sqlx::query(&format!("SET LOCAL statement_timeout = {ms}")).execute(&mut *tx).await?; |
There was a problem hiding this comment.
[Claude Code]
In PG, SET LOCAL statement_timeout = 0 means "no timeout". The builders with_read_timeout / with_write_timeout (line 154, 165 here, plus the SqlConnector mirrors) accept any Duration including Duration::ZERO, so a caller passing zero (or computing one from a config that defaults to zero) silently disables the trait-level "Timeouts (mandatory)" contract at crates/cold/src/traits.rs:267-296.
Suggest either rejecting zero / sub-millisecond values at the builder (runtime check on first use, or non-const builder returning Result), or documenting the disable behaviour explicitly on with_read_timeout / with_write_timeout. Either way the trait-level "MUST enforce" wording should be reconciled with what the SQL backend actually allows.
| if Instant::now() > deadline { | ||
| return Err(MdbxColdError::Timeout(read_timeout)); | ||
| } |
There was a problem hiding this comment.
[Claude Code]
This per-receipt check doesn't bound the inner per-log loop. A single receipt with thousands of matching logs (large mints, batched events) iterates here unchecked. The streaming path (produce_log_stream_blocking:159-194) checks per-log; the non-streaming path inconsistently doesn't.
| /// | ||
| /// - [`Hot`]: Hot storage read or unwind failed. | ||
| /// - [`Cold`]: Hot storage unwound but cold truncate dispatch failed. | ||
| /// - [`Cold`]: Hot storage unwound but cold drain failed. |
There was a problem hiding this comment.
[Claude Code]
This claim is wrong as written. The implementation at line 212 below uses self.cold.drain_above(block).await.unwrap_or_default(), which silently turns every cold error (TaskTerminated, Backend, DeadlineExceeded, etc.) into Vec::new(). StorageError::Cold is never returned from this method.
Pre-existing behaviour, but the doc rewrite in this PR added the claim, so the gap between docs and impl is new. Suggest either:
- Restore real error propagation (return
Coldon at least non-TaskTerminatedfailures). - Keep the silent-swallow behaviour and update the doc + the "best-effort - failure = normal cold lag" comment to admit it. Currently that comment conflates "cold returned an error" with "cold has no data above" - two different states that operators need to distinguish.
| .in_current_span(), | ||
| ) | ||
| .await | ||
| .map_err(|_| ColdStorageError::TaskTerminated)? |
There was a problem hiding this comment.
[Claude Code]
This JoinError only fires on panic in normal operation - tracked tasks aren't dropped by the runtime, so the only way to get one here is the spawned body panicking inside f(inner).await (i.e. backend code). Mapping it to TaskTerminated tells the caller "shutdown", which on-call will read as "graceful exit", not
"backend panicked". The original error is also swallowed silently - no log of the panic message or location.
Same pattern at line 231 (spawn_write).
Suggest at minimum logging the JoinError at ERROR before mapping. Optionally introduce a ColdStorageError::BackendPanic variant so panics are distinguishable from semaphore-close shutdown - the two failure modes have very different ops responses (panic = bug to chase, TaskTerminated = expected shutdown signal).
| // in-flight write. | ||
| let to = match filter.get_to_block() { | ||
| Some(to) => to, | ||
| None => match self.inner.backend.get_latest_block().await? { |
There was a problem hiding this comment.
[Claude Code]
This setup get_latest_block runs before any permit acquisition (deliberate, per the comment above) AND without caller-side timeout protection. With MDBX, point lookups have no in-body deadline so it can stall on disk I/O indefinitely. With PG, pool.begin() inside begin_read can park for acquire_timeout (default 30 s) on a saturated pool. Either way, N concurrent stream_logs callers do uncapped backend reads concurrently with no concurrency cap.
tests/stream_isolation.rs:36-43 uses a bounded to_block, so this code path isn't regression-tested. With a GatedBackend that gates point lookups, you could pin a setup-stall scenario.
Suggest one of:
- Wrap this call in
tokio::time::timeout(self.inner.setup_timeout, ...)so it can fail fast (with the standard caveat that the future drops but backend work continues - the same trade-off the rest of the design accepts). - Move
stream_sem.acquire_owned()to before this call. That re-introduces the original concern (a stuck setup pins stream permits), but bounds the call bystream_semrather than leaving it completely uncapped.
There was a problem hiding this comment.
[Claude Code]
Stale doc: the signatures below all take &self, and the backend is no longer "exclusively owned by the task runner" - the handle wraps it in Arc<Inner> and shares it across spawned tasks.
The PR-#58 doc rewrite advertised a `Cold` error path, but the impl collapses every cold error into `Vec::new()`. Update the doc + comment to admit silent-swallow behaviour. ENG-2210 tracks the propagation decision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SQLSTATE 57014 (query_canceled, emitted on `statement_timeout` expiry)
was wrapped as `Sqlx(...)` and surfaced to the handle as `Backend(...)`,
breaking symmetry with the MDBX backend (which routes its `Timeout` to
`ColdStorageError::DeadlineExceeded`). The metric
`cold.op_errors_total{error="backend"}` therefore conflated "query too
slow" with "backend down".
`From<sqlx::Error> for SqlColdError` now detects 57014 and produces a
dedicated `Timeout` variant; `From<SqlColdError> for ColdStorageError`
maps it to `DeadlineExceeded`. The configured deadline is not threaded
to this conversion boundary so the surfaced duration is `ZERO`;
threading the real value is a separate refactor (left for a follow-up
once the call sites are confirmed to need it).
The `pg_statement_timeout` test is rewritten to match on the typed
variant rather than a substring of the error message — a future
refactor that drops 57014 detection now fails the test instead of
silently passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two accessors to the `ColdStorageBackend` trait:
fn read_timeout(&self) -> Option<Duration> { None }
fn write_timeout(&self) -> Option<Duration> { None }
Wired through `MdbxColdBackend`, `SqlColdBackend`, and `EitherCold`.
`MemColdBackend` returns `None` (already-documented test exemption).
Two behaviour changes use these:
1. The advisory write-SLO WARN moves from the MDBX backend
(`warn_on_overrun` per-method) to `ColdStorage::spawn_write`. Timing
is now captured before `write_sem` acquisition, so the elapsed
value covers the queue wait, the read drain, and the commit
end-to-end. The failure shape that wedged production at #56 — slow
readers gating writes — now surfaces as a write-SLO violation
rather than as a sub-threshold backend timing.
2. `stream_logs`'s setup `get_latest_block` is wrapped in
`tokio::time::timeout(backend.read_timeout(), ...)`. Without this,
a stuck point lookup (cold MDBX page) or a saturated PG pool
parking on `acquire_timeout` could pin N concurrent setup callers
indefinitely with no permit cap. The setup read still bypasses
`read_sem` and the drain barrier by design.
Also drops the now-unused `tracing` dep from `signet-cold-mdbx` and
updates the type docs to point at the handle's new WARN path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builders for `read_timeout` / `write_timeout` on both the MDBX and SQL backends and connectors now panic on zero. Postgres treats `statement_timeout = 0` as "no timeout", so a caller passing `Duration::ZERO` (or computing one from a config that defaults to zero) would silently disable the trait-level mandatory-timeout contract. MDBX accepts the same assert for symmetry — zero there is a useless config rather than a silent disable, but the trait says non-zero and the assert keeps the surface honest. `spawn_read` / `spawn_write` now log spawned-task `JoinError`s before mapping to `TaskTerminated`. A backend panic was previously indistinguishable from graceful shutdown for the on-call: panics fire ERROR with the panic message, cancellations fire DEBUG. The error variant still collapses to `TaskTerminated` per design. `MdbxColdBackend::get_logs_inner` now checks the deadline inside the inner per-log loop, mirroring the streaming path. Previously a single receipt with thousands of matching logs would iterate unchecked past the configured `read_timeout`. The two `std::time::Instant::now()` sites in `produce_log_stream_blocking` are also folded into the already-imported `Instant`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
[Claude Code] Thanks @Fraser999 — pushed four commits addressing all nine inline comments. Summary:
Pre-push hook clean: |
Pins the new setup-timeout behaviour against regression. The test parks `GatedBackend::get_latest_block` indefinitely and asserts that `stream_logs` (with no `to_block` on the filter, forcing the "resolve to=latest" path) returns `DeadlineExceeded` within the configured 50 ms `read_timeout` rather than hanging. Adds `GatedBackend::with_read_timeout` so tests can advertise a custom read timeout to the handle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fraser999
left a comment
There was a problem hiding this comment.
LGTM now. Would be nice to ticket threading the timeout value through to the actual error, since hard-coding ZERO isn't ideal, but this would be an easy standalone follow-up :)
Summary
Collapse cold storage from channels+dispatcher+writer-task into a single
ColdStorage<B: ColdStorageBackend>handle overArc<Inner<B>>. All concurrency primitives (read_sem(64),write_sem(1),stream_sem(8),TaskTracker,CancellationToken) live onInner; all backend work spawns into the shared tracker; spawned tasks hold permits for the real duration of the backend call.Supersedes #57. Tracks Linear ENG-2198.
Fixes three PR #57 review issues:
StreamLogssetup held the outer read permit across unbounded awaits — streams now acquire onlystream_sem; setup reads are isolated from the read pool by design (a stream requesting "latest" should observe latest at setup).timeoutdid not cancel backend work so permits released early — timeouts are now mandatory in theColdStorageBackendtrait and enforced at the backend layer, so permits honestly reflect in-flight work.Shape:
ColdStorage<B>is clone-cheap (one Arc refcount bump); replacesColdStorageHandle+ColdStorageReadHandle+ColdStorageTask.ColdStorageWritenow takes&selfacross all backends (Mem / MDBX / SQL updated in lockstep).write_semthen the drain barrier (read_sem.acquire_many_owned(64)); destructive writes invalidate the cache inside the spawned body while holding the drain.stream_sem; producer runs in the shared tracker.TaskTerminated.Backend timeouts (mandatory in trait contract):
tokio::task::spawn_blockingfor reads,tokio::task::block_in_placefor writes, in-body deadline check between per-block iterations. Post-commit WARN on overrun (advisory — MDBX commits are uninterruptible).SET LOCAL statement_timeout = <ms>at the start of every Postgres transaction; SQLite skips (no equivalent).with_read_timeout/with_write_timeoutbuilders on backend and connector. Defaults 500ms / 2s.Observability:
crates/cold/src/metrics.rsfollowing the ajj pattern.cold.reads_in_flight,cold.writes_in_flight,cold.streams_active.cold.op_duration_us,cold.permit_wait_us,cold.stream_lifetime_ms.cold.op_errors_total.#[tracing::instrument]on every handle method with stableopfield; spans propagate into spawned tasks via.in_current_span().Error variants removed:
Timeout,Backpressure,Cancelled. Backend timeouts now map toBackend.TaskTerminatedremains and is returned from every handle method when the semaphore is closed.Rollout: 10 commits, one per phase, each building + testing clean in isolation:
Spec: `docs/superpowers/specs/2026-04-20-cold-unified-architecture-design.md` (local — `docs/` is gitignored).
Test plan
🤖 Generated with Claude Code