feat(intervention): isolated GPU-worker execution backend for model.trace() by khaiwang · Pull Request #676 · ndif-team/nnsight

khaiwang · 2026-06-20T07:23:55Z

Run a trace's user interventions in a spawned, GPU-enabled worker process so
footguns in intervention code (infinite loops, OOM allocations, device-side
asserts, host-object pokes) are contained to the worker while the model server
keeps serving. Results are bit-identical to in-process execution.

The six-event Mediator protocol (VALUE/SWAP/SKIP/BARRIER/END/EXCEPTION) is left
unchanged; isolation is an outer harness that spawns the worker and routes the
existing protocol over a CUDA-IPC bounce-buffer channel (tensors stay on the
GPU, ~0.6 ms/hook, size-independent) instead of a shared Python frame.

Two shared-memory assumptions of the in-process path become explicit harness
steps:

host-side hook registration: the worker has no real module, so on the first
event for a requester the host registers the matching one-shot hook on the
real module (resolved from the requester string, for the specific step).
worker->host saves transmission: .save()'d values live in the worker frame +
Globals.saves; the worker bundles them into the END event and the host
injects them into the real user frame.

New sources:

transport.py: CUDA-IPC codec + host/worker channels (clone-on-receive,
per-wait timeout, host->worker live-meta piggyback, worker->host push field,
cuda.synchronize ordering guard).
isolation.py: isolate_mediators() context, spawn_isolated_worker, _worker_main,
on-demand host hook registration, worker interleaver stub + dummy-module map,
barrier/variable-store wiring, transmissible-exception degrade.
_sandbox.py: seccomp lock_down for fs/net/exec containment.

Seam edits route the protocol through the channel when isolation is on:
interleaver.py (isolated start branch, on-demand registration in handle, saves
injection at END, host-side barrier counting, _iso/cancel teardown), hooks.py
(per-step iteration param on output_hook/input_hook), tracer.py (isolated
Barrier branch).

Covered, each bit-identical and independently reviewed: read / swap / .save() /
multi-invoke / skip / exception / timeout / seccomp lockdown; multi-token
iteration (iter[N], iter[:], per-step swap); cross-invoke barrier + variable
sharing; non-standard-named models. Not yet built: tracer.cache() (returns an
empty CacheDict under isolation), backward/grad (autograd graph is host-side),
warm worker pool. See docs/developing/mediator-gpu-trace-integration.md.

Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com

…race() Run a trace's user interventions in a spawned, GPU-enabled worker process so footguns in intervention code (infinite loops, OOM allocations, device-side asserts, host-object pokes) are contained to the worker while the model server keeps serving. Results are bit-identical to in-process execution. The six-event Mediator protocol (VALUE/SWAP/SKIP/BARRIER/END/EXCEPTION) is left unchanged; isolation is an outer harness that spawns the worker and routes the existing protocol over a CUDA-IPC bounce-buffer channel (tensors stay on the GPU, ~0.6 ms/hook, size-independent) instead of a shared Python frame. Two shared-memory assumptions of the in-process path become explicit harness steps: - host-side hook registration: the worker has no real module, so on the first event for a requester the host registers the matching one-shot hook on the real module (resolved from the requester string, for the specific step). - worker->host saves transmission: .save()'d values live in the worker frame + Globals.saves; the worker bundles them into the END event and the host injects them into the real user frame. New sources: - transport.py: CUDA-IPC codec + host/worker channels (clone-on-receive, per-wait timeout, host->worker live-meta piggyback, worker->host push field, cuda.synchronize ordering guard). - isolation.py: isolate_mediators() context, spawn_isolated_worker, _worker_main, on-demand host hook registration, worker interleaver stub + dummy-module map, barrier/variable-store wiring, transmissible-exception degrade. - _sandbox.py: seccomp lock_down for fs/net/exec containment. Seam edits route the protocol through the channel when isolation is on: interleaver.py (isolated start branch, on-demand registration in handle, saves injection at END, host-side barrier counting, _iso/cancel teardown), hooks.py (per-step iteration param on output_hook/input_hook), tracer.py (isolated Barrier branch). Covered, each bit-identical and independently reviewed: read / swap / .save() / multi-invoke / skip / exception / timeout / seccomp lockdown; multi-token iteration (iter[N], iter[:], per-step swap); cross-invoke barrier + variable sharing; non-standard-named models. Not yet built: tracer.cache() (returns an empty CacheDict under isolation), backward/grad (autograd graph is host-side), warm worker pool. See docs/developing/mediator-gpu-trace-integration.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Measure the cost a warm worker pool would amortize: under isolation each model.trace() spawns a fresh GPU worker. On gpt2 (A100) an isolated trace is ~4.5 s vs ~12 ms in-process (~370x). Decomposed bring-up ~4.2 s = cold import torch (1.3 s) + import nnsight (2.3 s) + CUDA context init (0.4 s) + warmup; host-side mediator serialization is only ~3 ms. The tax is essentially model-independent (weights are not shipped) — a flat per-request cost. - perf_spawn_cost.py: decomposed synthetic bring-up + real isolated-vs-inprocess. - perf_spawn_split.py: splits the spawn slice into host serialize vs start(). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Amortize the ~4.5 s per-request spawn cost of isolated execution (cold import torch + import nnsight + CUDA context init, measured model-independent). A worker is now generic rather than mediator-bound: _pool_worker_main warms CUDA/imports/mount once, sends a one-time "ready" ack, then loops serving ("job", payload, extras, opts) messages — deserializing a fresh mediator against fresh dummies per job (only the ~3 ms payload changes per request). The CUDA context, kernels, bounce buffer, and channel persist across jobs. This unifies the cold and pooled paths; the worker always loops, the host decides recycle-vs-kill. Host side: a process-global thread-safe _WorkerPool persists across traces. acquire_isolated_worker pulls an idle worker (or lazily grows to the pool_size cap, or a cold one-shot worker past the cap so a trace never blocks), ships the job, and re-points the channel's meta_provider/on_push at this mediator. Mediator.cancel calls release_isolated_worker. Recycle-safety: only a cleanly-ended worker is reused. handle_end_event sets _iso.clean when an END is consumed; release recycles iff clean & poolable & alive & not dirty. A worker drained mid-protocol with a Cancelation (pipe unbalanced), a timeout/death (spinning, not idle), or a cold one-shot worker is retired and the pool re-warms lazily. Recycle resets the host channel (CudaIpcHostChannel.reset) + per-job hook-registration state; the worker rebuilds its interleaver/dummies and clears Globals.saves per job, so no cross-trace state leaks. Opt-in: isolate_mediators(..., pool_size=N) routes through the pool (pool_size=0 is the unchanged cold path); warm_worker_pool(N) pre-warms at startup, shutdown_worker_pool() tears down. Pool sizing is a GPU-memory budget: each warm worker costs ~0.55 GiB GPU per GPU touched (model-weight-independent, not reduced by MPS), ceiling = batch size. Verified (test_isolated_pool.py, gpt2/A100): reuse bit-identical (max|Δ|=0) at ~21x faster once warm (4.57 s -> 0.22 s) with PIDs reused; 3-invoke trace draws 3 distinct workers; hung worker retired + pool re-warms; non-standard-named model works. Cold path stays bit-identical across read/swap/save/multi/exception/hang/ multitoken/cross-invoke/barrier/nonstd. See docs §14. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

An adversarial review (no Critical issues — the cross-request data invariant holds: per-job fresh interleaver/dummies/Globals.saves, host channel reset before re-bind) surfaced robustness gaps, now fixed: - Dead idle worker: acquire skipped the liveness check, so a worker that died while idle (OOM-killed by a neighbor, crash) was handed out -> broken-pipe trace failure AND the dead worker was never forgotten (permanent cap erosion). acquire now skips/forgets dead idle workers and re-spawns; acquire_isolated_worker retries once through the pool if send_job hits a dead worker. - Multi-device aliasing (silent corruption): the pool's device was frozen at first warm, so a second model on another GPU drew a worker whose bounce buffer lived on the first GPU -> cross-device copy. The pool is now keyed per (device, arena_bytes, gpu_mem_fraction, lockdown) signature. - Exception re-warm tax: clean was set only on END, so a user-exception worker (alive, pipe balanced) was retired -> every erroring trace paid a ~4 s re-warm. handle_exception_event now marks the isolated worker clean so it is recycled (cancel's dirty check still retires a mid-protocol worker). - First-event hang-containment: a recycled worker's first event used the cold 180 s startup_timeout; it now uses timeout + a deserialize margin, since spawn/warm completes before the "ready" ack. - Over-provision: the grow slot is reserved under the lock so concurrent acquires can't exceed the cap. - Resource cleanup: close() now closes the pipe fd + drops the GPU buffer; a _shutting_down flag stops a shutdown/release race from orphaning a worker. Tests: test_isolated_pool.py gains dead-idle, exception-recycle, and (2-GPU) multi-device cases — all 7 pass bit-identical. Cold path (pool_size=0) still bit-identical across trace/acceptance(names,multi,exc,hang)/cross-invoke. Doc §14 updated with the hardening notes + lockdown cold-vs-pooled divergence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

tracer.cache() registers persistent hooks (mediator_idx=inf) that fill a .save()'d CacheDict during the forward; in the worker those hooks landed on the dummy modules and never fired, so the user got an empty cache. Now: - Worker cache() (isolated): ship the spec (token, module-paths, device, dtype, detach, include_output, include_inputs, rename, alias) via a new Events.CACHE request instead of registering dummy hooks; return a token-tagged placeholder CacheDict the user binds + .save()s. - Host handle_cache_event: resolve paths to the real envoys, register the real cache_output/input_hook into a host Cache keyed by token (Mediator._iso_caches), set_user_cache, ack. Hooks live on the host mediator and are dropped at teardown by remove_hooks, like in-process. - handle_cache_event acks + returns True, so the host loop processes CACHE then END consecutively at Mediator.start (before the forward). handle_end_event then swaps the host CacheDict reference in for the worker's empty placeholder (matched by token); the forward fills that same object in-place, so the user's variable IS the forward-filled host cache. No separate post-forward injection step. The substitution is gated on _iso_caches, so non-cache traces are untouched. Verified (test_isolated_cache.py, gpt2/A100): single module, multi-module, and include_inputs=True all bit-identical (max|Δ|=0, keys match in-process). Full isolated regression (trace/acceptance/multitoken) and cold path unchanged. The test now derives cache keys from envoy.path instead of a hardcoded prefix. See docs §15. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The backward case hooked `.grad` on a GPT2 block's `.output[0]` — an off-the- backward-path index into the block's tuple output, whose grad hook never fires (a usage gotcha, confirmed via a manual register_hook). So its in-process control ALSO errored, making the test useless as a gap demonstration. Switch to `model.transformer.ln_f.output` — a tensor-output module ON the autograd path — so the in-process control is valid. Backward now succeeds in-process and fails cleanly under isolation, which is the gap the test exists to characterize (host-only autograd graph, detached worker clones, id(tensor)-keyed grads). requires_grad_(True) turned out to be a red herring (ln_f works with or without it); the discriminator is on-path tensor-output vs off-path tuple-element view. Verified: backward in-process=ok, isolated=fails-cleanly; cache=bit-identical (no longer a gap). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… split with tensor.backward() now works in an isolated GPU worker for the read-then-backward case: the worker tags delivered activation clones with requester-string provenance and computes dL/d(clone) on its local tape as seeds; a new Events.BACKWARD ships them to the host, which continues torch.autograd.grad on the real graph over the retained on-graph activations and returns gradients keyed by provenance path. The backward block's .grad reads are served from that dict; .grad on user-derived tensors and .grad assignment raise clear errors. Verified bit-identical (max|delta|=0) on gpt2 ln_f.output and on a renamed model (final_norm/output_projection). Scalar loss only; swaps, batched traces, and multi-token backward remain unsupported (documented in the integration doc's new backward section). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…graph Characterized multi-token (generate + iter) backward with an in-process control first: generate() runs the forward without gradient tracking, so the first .grad read fails in-process ("cannot register a hook on a tensor that doesn't require gradient") — multi-token backward is unsupported on both paths and no silent-wrong is possible (there is no graph at all; the earlier per-step retention-overwrite concern is moot). The isolated path failed at the same user line but blamed the wrong cause ("off the backward path from the loss"). The host now signals the no-graph case — handle_backward_event returns a marker when no retained activation requires grad — and the worker's .grad error names the grad-less forward and points at model.trace(). Characterization script kept as a regression test asserting that message; single-pass backward unaffected (max|delta|=0). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…o WorkerMediator _run_one_job built the worker mediator by monkeypatching the deserialized instance (end/exception closures, a request wrapper when backward is active) alongside a module-global backward-context dict with its own reset choreography. The job mediator is now adopted into a WorkerMediator(Mediator) subclass via __class__ swap: the closures become method overrides (end ships Globals.saves-filtered locals on END, exception degrades to a picklable form, request tags delivered clones with requester provenance when the trace differentiates), the meta/push piggyback callbacks become methods bound to the channel, and the backward context collapses to instance attributes plus a single current-mediator pointer read by worker_backward_context(). _run_one_job is now deserialize -> adopt -> wire -> run. Behavior-preserving: full isolated suite (trace, acceptance, multi-token iteration, cross-invoke, warm pool, cache, backward, multi-token-backward characterization, renamed-model) all pass; in-process regression 51 passed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…d_job The backward detection (".backward(" in the intervention source) ran independently on the host (_wire_host_channel) and in the worker (_run_one_job), and _build_job recomputed the cross_invoker gate that Mediator.start had already decided. Both decisions now happen once in _build_job and ride worker_opts: the host reads backward_active when wiring the channel (gating real-activation retention), the worker reads it at adopt time (gating delivered-clone tagging), and cross_invoker reuses the mediator's already-set value. This is now the single place to tighten the substring detection (it can false-positive in comments, costing only needless tagging). Isolated trace/backward/cross-invoke/multi-token tests all pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…y map Per-job worker-handle state was reset twice: at release (reset_for_release) and again at the next acquire (_wire_host_channel). Acquire now owns the authoritative reset (it runs unconditionally for both pooled and cold workers); release keeps only reference-dropping so an idle worker doesn't pin the last trace's hook set, path map, or — via channel.reset() — the meta/push callbacks closing over its interleaver. The {path: envoy} resolution map, previously built ad-hoc in two places (cached by host-side hook registration, rebuilt from scratch on every CACHE event), is now one lazy helper cached per job on the worker handle. Pool recycle, cache, and trace isolated tests all pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…e gap test - A frozen IsoOptions dataclass replaces the four hand-copied option dicts (_STATE fields, _base_opts(), _WorkerPool._key(), warm_worker_pool's rebuild). pool_key lives on it, making the warm-time (device/arena/mem-fraction/lockdown) vs per-job (timeout) split explicit; the phantom never-set "startup_timeout" option becomes the _WARM_STARTUP_TIMEOUT constant. - The CACHE event spec crosses the wire as a keyword dict instead of a 9-field positional tuple, so adding a cache option can't silently shift fields. - The gap-characterization test is retired: both gaps it proved are closed and its assertions duplicate test_isolated_cache.py / test_isolated_backward.py (weaker, in the cache case). - The doc's duplicate feature-map and support-matrix tables fold into one table carrying mechanism + status. - Doc records a PRE-EXISTING break found while re-running the full suite: lockdown has been broken since the warm-pool unification (the worker locks down before its first job-recv, and unpickling the job's tokenizer extras needs a new transformers submodule import that seccomp blocks). Reproduced on the pre-refactor commit 8d09195; needs a separate fix decision. Warm-pool suite passes after the test helper moved to IsoOptions.pool_key (reuse/concurrent/retire/dead-idle/exception-recycle/renamed-model, plus trace/cache/backward and the rest of the isolated suite earlier in the stack); in-process regression 51 passed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ncel After an isolated backward trace ended, the host mediator kept its references to every retained on-graph activation — pinning those tensors and the autograd graph behind them until the mediator was GC'd. cancel() already drops the mediator's other ephemeral state (history, iteration tracker, worker handle); now it also clears the retention map. Safe because every BACKWARD event precedes the END/exception that triggers cancel. Found by a four-angle cleanup pass over the backward + refactor stack; the other findings were judged false positives or already- documented accepted costs. Backward + multi-token-backward isolated tests and the in-process regression (51 passed) stay green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

__setstate__ rebuilds the transient isolation fields (_iso, _isolated_worker, _iso_backward, _iso_grad_reals) but missed _iso_caches, so a deserialized mediator (the NDIF/vLLM server path constructs mediators via __setstate__) running under isolate_mediators() would hit AttributeError in handle_cache_event the first time a trace used tracer.cache(). Latent locally (host mediators come from __init__); found by the high-effort review pass. Isolated cache test and in-process regression (51 passed) green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ions Process isolation contains footguns by running each intervention in a spawned GPU worker — but that worker holds a weightless path-only mirror of the model, so the interp majority (logit lens, steering, ablation, activation patching, attribution) cannot run isolated at all: they read the host model's real weights (F.linear(x, head.weight)) and call its final-norm / unembed modules. The fast lane is the tier where the real weights live. Adds a third execution tier under isolate_mediators(). A fail-closed, default-deny static classifier (fastlane.py) walks the EFFECTIVE code of each mediator — the trace body plus every user closure it calls, resolved through the frame / function globals / closure cells (the harness wraps real compute in build()/capture() closures, so a walk of the with-block alone would see only an opaque call). Verdicts: FAST (only whitelisted ops / host-model access / nnsight primitives -> run in-process at full speed and full model access), ISOLATE (anything unconfirmable -> the existing GPU worker), REJECT (an introspection escape -> raise). The conservative default is ISOLATE; the gate is a footgun selector, not a malice boundary, so it is cordoned to trust="local" provenance and a CONFIG.APP.FAST_LANE flag. Default behavior is preserved: isolation off never consults the gate; isolation on now fast-lanes the confirmed-safe majority and isolates the rest. A best-effort wall-clock watchdog restores loop-containment for the one footgun the static walk cannot bound (a huge bounded range); its injected FastLaneTimeout rides the intervention body's existing try/except, so the host re-raises it cleanly. The classifier's closure-aware backward detection also replaces the old `.backward(` source substring (blind to a backward hidden in a closure) for the isolated job's grad-retention flag. Deferred (documented): the process-global sys.addaudithook backstop (its leaked-flag failure mode can abort the model's own forward — net-negative under a static default-deny gate); the five declarative tracer primitives (unembed/steer/patch/ablate/capture) that would let weight-reading cells also run on the isolated tier via host event handlers (a cache-shaped build). Verified: classifier units 17/17 (logit-lens/steering/patching/attribution shapes + renamed structures classify FAST; imports/while/unresolved-call/ open ISOLATE; introspection REJECT; flag detection). Fast-lane e2e 6/6 on gpt2 + a renamed model: weight-reading lens bit-identical on the fast lane (max|Δ|=0) AND raises under forced isolation; in-place steering bit-identical; footgun routes off the fast lane, host survives; introspection rejected; runaway loop killed by the watchdog, host survives. Existing isolated WORKER path (pinned with fast_lane=False) 9/9 still bit-identical; in-process core 51 passed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… note docs/developing/fast-lane.md: why isolation could not run the weight- reading interp majority (weightless worker), the three-tier design, the classifier rules + threat-model contract, the watchdog, the prior art it borrows from (Cloudflare Workers / RestrictedPython / SES / fx+JAX / gVisor / Firecracker / the pysandbox negative result), the designed part-2 declarative primitives (next increment), deferred items, and the verification matrix. Cross-linked from the integration doc's support matrix, which now notes the FAST/ISOLATE/REJECT tiering. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The weight-reading interp readout — F.linear(norm(residual), head.weight), done by every logit-lens / steering-direction / attribution-metric cell — cannot run in the isolated worker because its dummy modules are weightless. tracer.unembed closes that on the isolated tier without putting weights in the worker: the worker ships the residual VALUE plus the norm/head module PATHS via a new Events.UNEMBED request; the host's handle_unembed_event resolves the real envoys, runs the real norm + unembed on the real weights, and ships back only the logits (bounce-buffer round trip, clone-on-receive). Weights never cross the boundary — so this neither binds the generic warm worker to a model nor places host weight memory in the less-trusted worker (the two costs that ruled out shipping/sharing weights). In-process / on the fast lane it just runs the real modules directly. Shaped exactly like Events.CACHE / handle_cache_event. This is the first of the part-2 declarative primitives; it also means a deployment that forces pure isolation (fast_lane=False, e.g. for OOM containment) can still run weight-reading workloads if they are written with tracer.unembed. Verified (test_isolated_unembed.py, all under forced isolation, gpt2 + renamed model): single-layer / 3-layer-interleaved / formulation="module" / norm=None / renamed-model readouts isolated-vs-in-process max|Δ|=0; tracer.unembed == the manual F.linear it replaces. Isolated trace/cache, fast-lane e2e, in-process core (31), and classifier units (17) unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… isolated tier Add tracer.steer(envoy, direction, alpha), the next part-2 declarative primitive after tracer.unembed. It adds alpha*direction to a module's output residual via a *replacement* boundary write (assign envoy.output), which routes through the eproperty setter -> Events.SWAP and ships the steered value back on either tier. Steering touches no host weights — only the delivered activation — so unlike unembed it needs no host round-trip and no isolated/in-process branch: the same method is correct in-process, on the fast lane, and in the isolated worker. The point is the replacement swap. The hand-written additive form is in-place (block.output[:, -1, :] += direction); under isolation that mutates only the worker's delivered clone, no SWAP fires, the host's real activation is untouched, and the steering silently no-ops. tracer.steer makes it cross the boundary by construction. Tuple outputs (attention modules) are replaced whole, steering element [0] and carrying the tail (incl. a None) through pack_cuda untouched. The classifier already treats tracer.steer as a trusted nnsight primitive (its __module__ is nnsight.*), so no gate change is needed. Verified (test_isolated_steer.py, gpt2 + a renamed model), all max|Δ|=0: steering a block, an attention tuple output, and three blocks at once are isolated-vs-in-process bit-identical and propagate through later layers; tracer.steer equals the manual whole-tuple replacement; and the crux — under forced isolation the in-place form leaves the downstream residual at the unsteered baseline (silent no-op) while tracer.steer takes effect and matches the in-process result. Classifier units still 17/17. Doc: docs/developing/fast-lane.md §6 (steer marked built + subsection), §7, §8. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…del coverage Add a `preimport=` option to isolate_mediators()/warm_worker_pool() that loads modules at worker warm time, BEFORE seccomp lockdown freezes new file opens (import == open()). This brings user-facing import capability under lockdown to parity with an in-process module whitelist without weakening containment: the model's own kernels (incl. triton) run host-side and are unaffected. - thread `preimport` through _STATE -> _base_opts -> the pool key (now per device/arena/mem/lockdown/preimport signature) and consume it in _pool_worker_main warmup, before lock_down(). - test_isolated_triton_model.py: a @triton.jit-kernel model traced under isolation+lockdown is bit-identical to in-process (host compiles triton while the worker is fully locked down); worker-side triton in the intervention is blocked. Requires GPU + triton. - docs §16: the triton deployment motivation, the strictly-better-than-upstream module-restriction comparison, and the verified timeout-directionality and cold-vs-pool lockdown-ordering facts. Fix the stale docstring claiming the cold path deserializes before lockdown (the unified worker locks down before both). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The host read worker frames with mp.Pipe.recv() (pickle) — and the worker runs UNTRUSTED user code, so a crafted __reduce__ gadget over the control plane was a host-side RCE that would bypass every other isolation layer (seccomp, namespaces, row-bounding). The design hardened the inbound user payload (unpickled inside the worker) but not the outbound results. Close it: - worker->host is now tensor-free (tensors already ride the GPU buffer / safetensors) and the small remaining structure is decoded with a RESTRICTED unpickler (transport._RestrictedUnpickler / _safe_loads): find_class allows ONLY torch dtype/device and refuses every other class/function. find_class resolves a global before the REDUCE that would call it, so a gadget is refused before it can execute. - the event rides as its string .value (no enum class) and exceptions as a (type-name, message) sentinel (no class), so the allowlist stays {torch dtype, device}. host->worker stays normal pickle (host-authored, trusted). - this also fixes a real correctness gap a prior hand-rolled JSON codec had: the Events.CACHE spec carries torch.dtype/torch.device, which the tagged codec rejected; pickle handles all plain nested structures natively (no per-type enumeration), and anything un-allowlisted fails loud at decode with the class name. - capability narrowing: .save() of an arbitrary object / numpy / framework type (e.g. ModelOutput) is no longer transmittable from a worker — save a tensor instead. test_isolated_codec_security.py: fidelity for VALUE/SWAP/END/CACHE(dtype,device)/ EXCEPTION/push, and a genuine __reduce__ gadget refused at decode without executing. CPU is enough (needs torch). Legacy AF_UNIX socket channels are unused by isolate_mediators and still plain-unpickle — noted in the module header. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Capture the security analysis behind the GPU-worker backend: the attacker model, the asset list (host integrity / fs / net / cross-tenant host+GPU memory / DoS / deser-RCE), the R0-R4 configuration ladder with the cost coupling (closing a deeper threat forces a slower data path; the cliff is R2->R3, i.e. leaving the GPU), which sandbox controls are compatible vs incompatible with the shared-GPU CUDA-IPC design, the co-batch tenant-isolation invariant (the empty-invoke full-batch hole; the Batcher/Interleaver = tenant boundary), and the worker->host restricted-unpickler fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…pickler codec + warm-time preimport Integrate the parallel isolation security/compat work (8f88986, f72a397, 151e5a5) with the local backward/fast-lane/unembed/steer line. What came in from the remote: - Worker->host restricted-unpickler codec (transport._RestrictedUnpickler / _safe_loads): worker-authored frames never plain-pickle.loads on the trusted host; find_class allows only torch dtype/device, the event rides as its string .value, exceptions as a (type-name, message) sentinel. Closes a worker->host RCE. transport.py auto-merged clean. - Warm-time preimport: load deployment-allowed modules before seccomp lockdown freezes new file opens (import == open()) — the mitigation for the documented lockdown break and import-parity with the in-process whitelist. - mediator-threat-models.md (R0-R4 ladder + cost coupling + co-batch tenant isolation + the codec fix) and the Triton-model deployment motivation. - Triton-model + codec-security prototype tests. Conflict resolution (isolation.py, integration doc): - The remote added preimport to the OLD flat _STATE dict; this line had already refactored to the frozen IsoOptions dataclass. Kept IsoOptions and folded preimport into it (new field + added to pool_key, since the preimport set is warm-time and defines pool interchangeability). Dropped the remote's _base_opts / _WorkerPool._key in favor of _STATE["opts"] / IsoOptions.pool_key. - Worker bootstrap runs the preimport loop off worker_iso_opts.preimport. - Adopted the remote's corrected lockdown-timing wording (cold and pooled share the unified worker, which locks down before any job deserializes — the cold-vs-pool difference is recycle-vs-retire, not lockdown timing), replacing this line's now-inaccurate "cold deserializes before lockdown" claim. - Both sides added a "§16"; kept §16 = backward read-path (already referenced by the committed §11 title and support matrix), renumbered the Triton section to §17, and fixed the two Triton matrix rows + the threat-models doc's two external §16 references. Merged the two support-matrix versions (kept the fast-lane/3-tier matrix, added the Triton rows + an unembed/steer row). Also fixed a bug in the pulled codec-security test (never run per the doc's own open-items list): the non-allowlisted probe class was function-local, so pickle could not even ENCODE the frame (Can't get local object), masking the decode-time refusal it meant to test — moved it to module scope. Verified post-merge (hf-serve, A100): classifier 17/17; codec security PASS (fidelity incl. CACHE dtype/device + gadget/non-allowlisted refusal); and isolated trace (read/swap), unembed (UNEMBED frame through the new codec), steer (SWAP through the codec), and acceptance (names/multi/exception-sentinel/hang) all bit-identical max|Δ|=0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…lgebra codec The worker->host (untrusted) direction no longer runs pickle's VM on worker bytes. Instead the boundary transmits a CLOSED VALUE ALGEBRA — never live objects — so there is no opcode that can call a function in the first place: Value = None | bool | int | float | str | bytes | list | tuple | dict | set of Value | torch.dtype | torch.device | Array(...) (torch tensors AND numpy arrays, OUT-OF-BAND) This makes the boundary value-semantic, and makes "safe + correct" a property of the type rather than a bet on a restricted unpickler: - SAFE by construction: _codec_loads is pure data assembly (no globals, no find_class, no REDUCE), plus a size cap + bounds-checked reads for decode-bomb DoS. The previous restricted unpickler is removed entirely. - FAITHFUL: pack_cuda's Array leaf is generalized from torch.is_tensor to also cover numpy.ndarray (bridged through torch, re-materialized host-side as an ndarray), so numpy `.save()`s now cross — they were silently refused before. - HONEST contract: a value outside the algebra (custom object / framework type) is rejected at the WORKER, at ENCODE, with a clear BoundaryValueError naming it at its source — not an encode-ok / decode-refuse split. tracer.cache() shipped a live CacheDict placeholder, which is not a value; the worker now ships its token as a `{_ISO_CACHE_TAG: token}` marker (same shape as the EXCEPTION sentinel) and the host swaps in its forward-filled cache by token. (This also fixes cache under isolation, which the merged restricted unpickler had broken — it refused CacheDict the same way.) host->worker stays plain pickle (host-authored, trusted). Verified (nnsight-tf: py3.11 / torch 2.11 / transformers 5.12): - codec unit (test_isolated_codec_security.py): fidelity over the algebra incl. numpy + dtype/device; a __reduce__ gadget and a custom object rejected at encode before any __reduce__ runs; malformed/oversized/unknown-tag raise cleanly. - full isolated GPU suite bit-identical max|Δ|=0: trace, unembed, steer, cache, backward, multitoken-iter, cross-invoke, pool, acceptance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…mitives for the isolated tier The two remaining single-write part-2 primitives, structural twins of tracer.steer: both are replacement boundary writes that touch no host weights, so they ride the existing Events.SWAP with no new event, no host handler, and no isolated/in-process branch. Done in place under isolation each silently no-ops (the worker mutates its delivered clone, no SWAP fires); the replacement swap makes them cross the boundary by construction. - tracer.patch(envoy, value): transplant a precomputed value into a module's output residual (activation patching / resampling). Cast to the residual's dtype/device so a value precomputed on CPU — the isolation case — transplants cleanly. Whole-tuple replacement (element [0]). - tracer.ablate(envoy, mode="zero"): zero/mean knockout. mode="mean" is the self-contained within-sequence mean; reference-distribution (dataset) mean ablation is a precomputed value transplanted via tracer.patch — not derivable from a single forward, so kept distinct to avoid silent wrong-mean semantics. Unknown mode raises ValueError. Verified bit-identical (max|Δ|=0) under forced isolation vs in-process on gpt2 + a renamed model: test_isolated_patch.py 6/6, test_isolated_ablate.py 7/7, including the crux that the in-place form is a silent no-op under isolation while the primitive takes effect. No regression in steer/unembed/trace/acceptance/cache/cross_invoke/backward. Docs: docs/developing/fast-lane.md §6 (built), new subsections, §7/§8 updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014ZUUF44B2tfuKBNFhDFedR

… .carry() primitive model.session() carries values across its inner traces in-process (each inner trace pushes its locals up to the session frame; the session's exit-push surfaces only saves). Under isolation each inner trace runs in a worker that shipped only its .save()'d locals home, so the two-hop push was broken two ways: - a SAVED value used cross-trace was written into the session frame but its host id was never re-registered in Globals.saves, so the session's exit-push dropped it (UnboundLocalError); - a NON-saved value used cross-trace was never shipped at all (NameError). The realized form of the last part-2 primitive (the run<->run handoff originally specced as tracer.capture, which collided with the existing Tracer.capture(frame) AST method): - Saved-case fix: when the isolated END target is a nested/session frame, the host writes the worker's values into it AND re-registers the saved values' host ids in Globals.saves, so the session's exit-push keeps them. Root (single-trace) writeback is unchanged. Makes the documented `hs = x.save()` -> use `hs` session pattern work under isolation. - .carry() (universal value method, like .save(); plus nnsight.carry(x)): hand a value to a later trace in the session WITHOUT surfacing it as an output. The worker end() now ships saved-union-carried locals as (values, saved_names); the host writes all to the session frame (next trace sees them) but registers only the saved ones, so carried values drop at session exit — exactly in-process non-saved semantics, made explicit. With no .carry() in use the payload equals the prior saved-only one, so the single-trace path is unchanged. .carry() is portable: harmless in-process, load-bearing under isolation. Root cause confirmed by host-side instrumentation (saved value reaches the session frame but host Globals.saves stays empty -> session root-push drops it). Verified (nnsight-tf, GPU7, gpt2 + renamed model): test_isolated_session_handoff.py 6/6 all max|Δ|=0 (saved + carried handoff isolated==in-process; nnsight.carry==method; carried value not surfaced to caller while saved is; .carry() in-process==isolated). No regression across the isolated suite (trace/cache/backward/multitoken-iter/cross-invoke/acceptance/steer/patch/ ablate/unembed/pool) and in-process core test_lm.py 75/75. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014ZUUF44B2tfuKBNFhDFedR

…ialize The isolated worker installed its seccomp filter before the job loop (warm time), so the very first conn.recv() — which unpickles the host's job message with standard pickle — triggered a lazy transformers submodule import (transformers loads modeling submodules only at unpickle time, so preimport=("transformers",) did NOT help) → open() → EPERM under seccomp. EPERM is an OSError, which the recv loop's `except (EOFError, OSError): break` swallowed → os._exit(0) → the host saw a pipe EOF and reported "worker died during execution." This broke every lockdown=True trace (root cause confirmed by worker-side instrumentation: death at conn.recv, importing transformers/models/gpt2/__init__.py). The job message and mediator payload are host-authored, TRUSTED data; only the user intervention code is untrusted. So lockdown belongs after deserialization, before user code — exactly what _sandbox.lock_down's own docstring already stated. Move lock_down() out of _pool_worker_main's warm section into _run_one_job, installed once (guarded by a worker global) after the first job's payload is deserialized and before its intervention runs: - the first conn.recv runs unlocked, so a fresh worker's first job needs no preimport=; - one-way + once, so a warm pool's later jobs run under the first job's lockdown — a homogeneous model needs nothing (already imported), a different model needs preimport=; - cold (pool_size=0) and pooled share the path, so both deserialize their first job first. Containment is unchanged: user-code open/socket/exec under lockdown are still blocked and now surface as a clean NNsightException (shipped via the EXCEPTION path) rather than a silent death. Verified (nnsight-tf, GPU7, gpt2): test_isolated_lockdown_safety.py 4/4 — read under lockdown max|Δ|=0, fs/net blocked, and a NEW warm-pool case (3 traces on one pooled worker under lockdown, all bit-identical). No regression: trace/cache/pool/session_handoff pass (the lockdown=False default path is untouched). Docs: mediator-gpu-trace-integration.md support matrix + lockdown-ordering notes updated (break -> fixed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014ZUUF44B2tfuKBNFhDFedR

…es, session handoff, backward state Bring docs/developing/mediator-gpu-trace-integration.md §8 (and the §10 cross-trace note) up to date with the features landed this session (the detail already lived in fast-lane.md §6): - support matrix: the part-2 primitive row now lists all of unembed/steer/patch/ablate; a new row documents session cross-trace handoff (.save() used in a later trace, and .carry()/nnsight.carry). - backward row: reflects the current state — read-path bit-identical; multi-token backward is a clean-fail (in-process doesn't support it either); grad-through-a-swap cleanly errors (the swapped value is a host-side leaf, severing the host graph at the seam) — the next backward increment. - §10 cross-trace note: the per-job reset clears Globals.shared too, and clarifies that the no-leak property is about UNRELATED traces — intentional in-session handoff is a separate supported path. Docs-only; no code change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014ZUUF44B2tfuKBNFhDFedR

… backward seam An isolated SWAP installs the worker-computed value on the host as a detached leaf (clone-on-receive strips grad_fn), severing the host autograd graph at the swap point. So a downstream loss differentiated w.r.t. an UPSTREAM activation dead-ended at the swap ("no gradient available ... off the backward path"), while in-process gradients flow through swaps. The read-path backward split the chain rule once at the worker→host seam; a swap adds a second seam that splits it the other way (host downstream → worker swap tape → host pre-swap). Fix: iterate the existing Events.BACKWARD exchange to a fixpoint over swap seams. - Host (interleaver.py): handle_swap_event, under _iso_backward, makes the swap leaf requires_grad_(True) and retains it (_iso_grad_swaps) so the downstream forward tracks it and it is a backward target; handle_backward_event adds swap leaves to its targets and returns dL/d(swap leaf) under a reserved key (kept separate from reals so a read-then- swapped module sharing one requester path doesn't collide). - Worker (isolation.py): WorkerMediator.swap keeps the worker-tape swap value (with grad_fn); reset alongside the other _bwd state. - Worker backward (backwards.py): loop — send seeds, receive dL/d(swap leaf), backprop it through the swap tape to dL/d(delivered clone), re-seed the pre-swap graph, repeat; a read reached both directly and through a swap SUMS its gradient across rounds. With no swaps the loop is exactly the prior single exchange. Verified (nnsight-tf, GPU7, gpt2 + renamed): test_isolated_grad_through_swap.py 5/5 all max|Δ|=0 — grad through h*2, h+vec, tracer.steer, TWO chained swaps (loop fixpoint), and the renamed model, isolated == in-process. No regression across the isolated suite 13/13 (read-path backward, multi-token-backward clean-fail parity, trace/steer/patch/ablate swaps without backward, multitoken-iter/cross-invoke/session-handoff/cache/lockdown/acceptance/pool). Docs: mediator-gpu-trace-integration.md §16 + support matrix (grad-through-swap DONE). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014ZUUF44B2tfuKBNFhDFedR

…downgrade two overclaims A deep audit found doc-vs-code drift across the isolation docs. Code is correct; the docs lagged. Doc-only changes (verified against transport.py / isolation.py / _sandbox.py / fastlane.py): - threat-models §7 + line 11: the worker->host fix is the shipped closed value-algebra codec (transport._codec_dumps / _codec_loads), not a restricted unpickler. The documented _RestrictedUnpickler / _safe_loads do not exist (replaced in 043196c). Rewrote the Fix paragraph to the codec (closed algebra, pure data assembly, no find_class/REDUCE/pickle VM, size cap + bounds checks, BoundaryValueError at encode); deleted the "restricted-unpickler vs hand-rolled codec" subsection and replaced it with why a closed codec is stronger (no opcode can call anything, so the restricted-unpickler bypass class, e.g. CVE-2025-32434, cannot exist). Legacy-socket remediation pointer _safe_loads -> _codec_loads. - integration §14: pool_key is the 5-tuple (device, arena_bytes, gpu_mem_fraction, lockdown, preimport), not 4 (preimport was missing). - integration §3/§7/§9: back-patch renamed symbols ensure_provider -> ensure_isolated_provider, spawn_isolated_worker -> _spawn_worker, _worker_main -> _pool_worker_main (kept the one "previously _worker_main, now _pool_worker_main" history line intact). - integration: clarify set_per_process_memory_fraction caps the allocator pool (the 20 GB footgun), distinct from the ~0.55 GiB CUDA-context cost it does not reduce. - mediator-isolation-sandbox.md + gpu-sandbox.md: SUPERSEDED / pre-integration banners pointing to the authoritative integration + threat-model docs; dropped topk from the op list and noted "capture" shipped as .carry(). Posture claims downgraded to match the shipped footgun-containment model: - threat-models §4/§5: the shipped seccomp is a default-ALLOW denylist of 7 fs/net/exec syscalls (plus GPU mem-fraction), i.e. R1 + footgun containment. Full R2 (allowlist-default seccomp + ptrace/clone/fork + namespaces + cgroups) is designed, not built; R2's "determined adversary" closure is the roadmap target, matching gpu-sandbox.md. - fast-lane §7: the static gate rejects import/open/exec/socket AST nodes, but the fast lane runs in-process and allowlists numpy/torch calls, which is a footgun selector, not an adversarial boundary. Safety rests on the trust="local" cordon (unreachable for non-author code), made explicit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014ZUUF44B2tfuKBNFhDFedR

…ulti-invoke framing Investigating the "batched backward" open item showed it splits two ways, and checking the in-process baseline first reframed it: - Batched backward via LIST input (model.trace([A, B, ...])) already works under isolation, bit-identical. It is one mediator over the padded batch (batch_group=None, no per-invoke narrowing), so the worker's delivered clone and the host's retained real are both full-batch and shapes match; the read-path and grad-through-swap seam-stitch run unchanged on a (batch, seq, hidden) tensor. Added test_isolated_batched_backward.py (2-row, 3-row, upstream-block, batched grad-through-swap, renamed): all isolated-vs-in-process max|Δ|=0. - Backward inside MULTIPLE tracer.invoke(...) contexts raises MissedProviderError IN-PROCESS too (the .grad provider is never registered across invoke contexts), so it is a core nnsight limitation, not an isolation gap. The prior doc framing ("cryptic shape mismatch / needs narrowed retention") predated checking the in-process baseline; same category as multi-token backward (parity, not a gap). Doc-only + new test; no code change. mediator-gpu-trace-integration.md §16 + support-matrix backward row corrected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014ZUUF44B2tfuKBNFhDFedR

khaiwang and others added 30 commits June 7, 2026 21:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(intervention): isolated GPU-worker execution backend for model.trace()#676

feat(intervention): isolated GPU-worker execution backend for model.trace()#676
khaiwang wants to merge 30 commits into
mainfrom
worktree-mediator-sandbox

khaiwang commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

khaiwang commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant