Pipeline-parallel intervention support for the vLLM backend by khaiwang · Pull Request #670 · ndif-team/nnsight

khaiwang · 2026-06-05T17:22:25Z

Pipeline-parallel intervention support for the vLLM backend

Branch: pp-on-dev → dev (42 commits)

Summary

Adds transparent nnsight intervention support under vLLM pipeline parallelism
(PP ≥ 2). A single-GPU-style trace runs unchanged across pipeline stages — read,
modify, and save activations that live on any stage, in both single-forward and
multi-token generation, on single-node multiproc and multi-node Ray. The user
never writes if pp_rank == X; the system makes a cross-stage data dependency
(layers[5] on stage 0 feeding layers[60] on stage 1) just work.

This PR brings the core design, a series of correctness fixes across intervention
patterns and model architectures, a full performance characterization with one
latency-stall fix, multi-node (Docker-Ray/TCP) validation, and the design docs.

The problem

PP splits the model by layer across stages; vLLM builds all layer slots on every
rank but only its own slice is real (PPMissingLayer stubs elsewhere). nnsight ships
and runs the same intervention body on every rank, and a normal module access
blocks until that module's forward hook fires. On a stage that doesn't own the
module, the hook never fires → the mediator deadlocks on the first cross-stage line.
The naive "push all intervention state to the owning rank" fix is impossible: the
mediator is a long-lived, stateful worker thread (frame locals, accumulating saves,
loop counters) that can't be serialized mid-flight, and most interventions have no
cross-stage dependency at all.

The design

Detect "this module lives on another stage" at the access site, before blocking,
and return a LazyRemoteTensor placeholder instantly. Writes/saves to it are no-ops
(the real write lands on the owning rank running the same line); only a genuine
consume materializes it — a pull over a dedicated gloo group from the owning rank's
buffer. Because remote accesses never block, each rank's worker runs ahead of its
forward and parks at the first module it owns; a readiness gate synchronizes
worker and forward, and a finalize drain barrier keeps each rank's serving buffer
alive until every rank's worker has finished pulling.

Full as-built spec: docs/developing/pp-design.md. Picture-first intro:
docs/developing/pp-pipeline-parallelism.md.

What's included

Core PP path

PPEnvoy / pp_eproperty short-circuit on PP-non-local .output/.input (runs
regardless of interleaving, so downstream/post-teardown accesses don't raise).
LazyRemoteTensor — metadata proxy; no-op writes/saves; materialize-on-consume;
child-lazy __getitem__ (correct tuple-element reads) and __iter__/__len__.
PPListener — per-rank gloo listener; self-identifying single-message requests,
per-pull response tags (concurrent pulls), check-and-park serving + dispatch_parked.
PPModuleMap — module → owning-stage resolution (layer ranges; first/last-rank
modules incl. logits_processor for arches that build it on every rank).
Run-ahead worker + _pp_wait_for_mediators readiness gate; pp_hook_buffer clone
sidecar; save collection + position-wise merge across ranks.

Correctness fixes (across Qwen2.5/Qwen3/Llama/Gemma2, PP=2)

Cross-stage logits/samples pull resolved the owner from the full module key,
not the root path (fixes a logits-consume-in-iter hang).
Deep-clone of tuple/list hook values into the buffer (fixes silently-wrong reads of
upstream decoder-layer tuple outputs; bare-tensor clone left tuple tensors aliased
to reused forward storage).
Request-scoped iteration tracker + shape-on-wire ("legacy") pull sizing under
run-ahead (consumer can't predict the produced token count).
Concurrent cross-rank pulls via per-pull response tags; non-blocking listener
(park instead of block) to avoid head-of-line cross-stage deadlock.
Batched multi-invoke saves: collect saved vars used only by non-first invokes;
__iter__/__len__ so tuple(lazy) doesn't spin on the non-owning rank.

Performance

Characterized end-to-end (Qwen2.5-7B, PP=2) vs vanilla vLLM and vs PP=1, on
single-node multiproc and 2-node Docker-Ray/TCP:
- Plain generation is a PP win under enforce_eager (PP=2 ~25–33% faster than
  PP=1); nnsight tracks vanilla vLLM with no intervention-free penalty.
- Neither the hook-buffer clone (~2–3 ms/layer, overlaps compute) nor the
  cross-stage pull (~3.4 ms local / ~7 ms TCP) is a fundamental bottleneck.
Fix: reading many cross-stage layers in one forward (logit-lens / save-every-
layer) hit a fixed ~5 s stall on the multiproc/async-scheduling path — the
producer rank cleared its serving buffer before the consumer finished pulling, so
late pulls blocked until a 5 s join timeout (correctness was unaffected; the merge
backfills). Closed with a request-finalize drain barrier
(PPListener.drain_barrier): save_all_consume 5.35 s → 0.37 s, no plain-gen
regression, deadlock-free (runs under collective_rpc with identical
finished_req_ids on exactly the pull group's TP-rank-0 members).

Multi-node

Validated on a 2-node Docker-Ray cluster with NCCL_P2P_DISABLE/NCCL_SHM_DISABLE
(cross-stage forced over TCP): correctness and no drain-barrier deadlock; plain gen
~2× slower than single-node multiproc (per-token inter-stage transfer over TCP +
async scheduling off), no cross-stage-read stall.
≥3-node Ray PP fails at kv-cache init — a vLLM bug (global_rank not
synced in adjust_rank; upstream #41287 / PR #41298), reproduced on vLLM without nnsight. Documented and skipped (docs/developing/pp-multinode-ray-init-bug.md).

Docs

docs/developing/pp-design.md — detailed as-built design spec
docs/developing/pp-pipeline-parallelism.md — illustrated walkthrough + figures.
docs/developing/pp-stress-findings.md — intervention-pattern correctness/perf
findings (P1/P2, the in-place-write design choice, the stall fix).

Validation

Differential PP=1-vs-PP=2 classifier (/tmp/pp_stress): 7/7 intervention patterns
PASS with the drain barrier — clean, read_all (reldiff 0.0000), read_split,
cross_consume, cross_write, ablation, steering.
Cross-architecture dense validation at PP=2 (Qwen2.5-0.5B/7B/14B, Qwen3, Llama/
DeepSeek-8B, Gemma2-9B).
Multi-node: Docker-Ray PP=2 (incl. PP=2×TP=2) correctness + no-deadlock.
Perf: full benchmark matrix (vanilla/nnsight × pp1/pp2 × multiproc/2-node).

Notes / known limitations

enforce_eager=True and enable_prefix_caching=False are forced (prefix caching
would skip hooked forwards and silently drop interventions).
In-place writes on vLLM inference tensors (module.output[0][:] = …) are
intentionally unsupported; use replacement (module.output = value).
pp_hook_buffer accumulates per-token clones within a request (scoped-cleared at
finish); very long generations with many hooked modules may eventually want eviction.

Follow-ups

Real multi-machine (cross-host) evaluation. The 2-container Ray/TCP test is a
pessimistic same-host proxy; real hardware differs in the network fabric (vLLM's
inter-stage transfer is NCCL and can use IB/RoCE RDMA, but nnsight's cross-stage
pulls use gloo/TCP and won't get RDMA acceleration), plus deployment plumbing
(*_SOCKET_IFNAME on the data NIC, routable head IP + open Ray/gloo ports, shared
FS / identical nnsight+vLLM+weights on both nodes). The pull-heavy cross-stage-read
workloads are where physical distance shows up most. To be measured separately.
Optional pp_hook_buffer eviction for very long generations.

New files lifted from pp-design with no conflicts on dev: - modeling/vllm/lazy_remote_tensor.py — LazyRemoteTensor proxy - modeling/vllm/pp.py — PPModuleMap + is_pp_missing helpers - modeling/vllm/pp_listener.py — per-rank listener thread + pull protocol - modeling/vllm/workers/GPUWorker.py — NNsight-aware GPU worker - modeling/vllm/PP_DESIGN.md, REVIEW_TODO.md — design notes - modeling/vllm/intervention-gaps/test_compat_validation.py - tests/test_*pp*, tests/stress_pp_*, tests/profile_pp_*, tests/run_pp_*, tests/_pp_*_worker.py — PP test + benchmark suite - docs/superpowers/plans/2026-04-01-pp-lazy-remote-tensor.md - pp-problem.drawio, pp-solution.drawio These are inert at this point — Envoy short-circuit, interleaver readiness check, and GPUModelRunner integration land in later steps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Engine-side PP awareness implemented as a custom Envoy subclass plus an eproperty subclass — both live in ``modeling/vllm/`` so the generic ``intervention/`` machinery stays unaware of pipeline parallelism. This replaces pp-design's direct edits to ``intervention/envoy.py``. - ``pp_eproperty(eproperty)``: ``__get__`` / ``__set__`` short-circuit to ``LazyRemoteTensor`` (or no-op writes) when the access is on a PP-non-local module. Hits never call ``_hook(obj)`` so ``requires_*`` never registers a forward hook on a module that wouldn't run, and never call ``interleaver.current.request(...)`` so the worker thread doesn't block on a hook that won't fire. - ``PPEnvoy(Envoy)``: redeclares ``output`` / ``input`` / ``inputs`` using ``pp_eproperty`` stacked over dev's ``requires_output`` / ``requires_input`` decorators. Input/inputs share ``key="input"`` so they remain a single shared provider, matching base Envoy. - ``_is_pp_missing(obj, key)``: checks both the module-level case (``is_pp_missing(self._module)``) and the path-level case (``pp_module_map.is_local`` for ``WrapperModule`` stubs like ``logits`` / ``samples`` that exist on every rank). - ``_pp_lazy_access(obj, key)``: builds the LazyRemoteTensor, bumps the iteration tracker, resolves the owning rank, captures ``mediator.pp_req_id`` and ``batch_group`` size in the pull closure for composite-key buffer lookup. PP=1 cost: one ``getattr(interleaver, 'pp_enabled', False)`` per eproperty access — cheap enough to leave PPEnvoy as the default for vLLM regardless of pipeline_parallel_size. Inert until step 5 wires ``envoys=PPEnvoy`` on ``VLLM`` and sets the PP fields on the interleaver. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GPUModelRunner now layers all PP machinery on top of dev's helper shape (process_new_reqs → finalize_mediators → collect_saves → cleanup_finished, returning ``{base_id: {var_name: value}}``): - load_model: pin pp_enabled / pp_module_map / pp_module_meta / pp_hook_buffer / pp_buffer_condition / pp_listener onto the interleaver instance; build pp_module_map, allgather probed shape metadata across PP ranks, start the per-rank PPListener, create the dedicated gloo pull group (one new_group call per TP slice, matching vLLM's GroupCoordinator pattern). - _graft_pp_missing_envoys: graft meta-model children onto PPMissing layer envoys so users can still address e.g. ``model.layers[5].attn.output`` on stages that don't own layer 5. - _probe_pp_output_shapes / _exchange_pp_module_meta: FakeTensorMode forward + allgather so the listener can pre-compute pull recv buffers without shape on the wire. - execute_model: readiness check before forward — every mediator must be parked on a local-module ``request()`` so hooks have a consumer. - collect_nnsight: gate on TP-rank-0 instead of PP-rank-0 (so every PP stage contributes saves), call ``interleaver.stop_iteration()`` + ``worker.join`` before finalizing on PP, scope-clear the pp_listener buffer to finished request ids only. - _pp_aware_load: PP-aware unpickler that resolves cross-stage ``Module:...`` persistent ids to the nearest PPMissingLayer ancestor. Drops pp-design's ``_saves_var`` contextvar / ``mediator._trace_saves`` approach in favor of dev's plain ``Globals.saves`` set + per-request collection-time namespacing. LazyRemoteTensor filtering moved into the dev-shaped ``collect_saves`` helper. intervention/ minimal additions (engine-agnostic by design): - ``Interleaver._iter_condition`` + ``_generation_done`` + ``stop_iteration()``: an iteration gate used by IteratorTracer between yields when ``_interleaving`` is False — engines with their own per-step lifecycle (vLLM PP) wake mediators on ``__enter__`` and signal completion via ``stop_iteration``. No-op for standard LanguageModel where _interleaving stays True. - ``Mediator.handle_value_event``: gated PP buffer-clone block that writes the narrowed (per-mediator) value into ``interleaver.pp_hook_buffer`` keyed by ``(provider, req_id)`` and notifies the listener Condition. Inactive when pp_enabled is False. engine.py / async_backend.py: merge ``{base_id: {var_name: value}}`` across ALL non-None worker results instead of taking the first, so every PP stage's saves accumulate per request (later-rank-wins on duplicates matches the "owning rank wins" rule). vllm.py: ``VLLM.envoys = PPEnvoy`` class-default so descendants are PP-aware; ``logits`` / ``samples`` switched to ``pp_eproperty`` so non-last-rank accesses short-circuit to LazyRemoteTensor; ``enable_prefix_caching=False`` default to prevent cached-token hook skips silently dropping interventions. pp_envoy.py: ``_is_pp_missing`` now falls back to the eproperty ``key`` when ``obj.path`` is empty — covers ``VLLM.logits`` / ``VLLM.samples`` whose host has no module path. Smoke test (TP=1 PP=1): VLLM('gpt2') trace+save logits returns expected shape (1, 50257). No regression in the simple case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Runtime fixes uncovered while running PP=2 GPT-2 end-to-end: intervention/interleaver.py: - ``Mediator.Value``: switch the internal lock to ``threading.Event``. The earlier ``_thread.lock`` pattern required strict acquire/release pairing — engines that skip the initial ``wait()`` (vLLM PP, where the worker's first action may be a remote-stage pull that never posts an event) hit ``RuntimeError: release unlocked lock`` on subsequent ``put()`` calls. ``threading.Event`` is idempotent and safe under any producer/consumer order. - ``Mediator.start()``: skip ``event_queue.wait()`` and the first ``handle()`` when the interleaver advertises an engine-driven per-step lifecycle (``pp_enabled``). The earlier behavior deadlocked when the worker's first user-code action was a ``LazyRemoteTensor`` materialization — the short-circuit returns without posting to ``event_queue`` and the pull blocks, so ``wait()`` never unblocked. Subsequent events are still handled from the interleaver's normal event loop. intervention/tracing/iterator.py: drop the per-iteration ``_iter_condition`` wait. Dev opens the interleaver multiple times per token (execute_model / sample_tokens / _sample), so a "wait until ``_interleaving`` is True" gate over-synchronizes and stalls mediator loops between sessions. Natural blocking via ``request()`` and the listener Condition on PP pulls paces iterations correctly. The ``_generation_done`` early-termination check still fires for clean shutdown when ``stop_iteration()`` is called from ``collect_nnsight``. modeling/vllm/model_runners/GPUModelRunner.py: - ``collect_nnsight``: gate on ``get_tp_group().rank_in_group != 0`` rather than ``.rank``. ``rank`` is the GLOBAL rank (would gate everything but world rank 0), ``rank_in_group`` is rank within the TP group (0..TP-1 — what we actually want). Confirmed by adding a debug print: with PP=2 TP=1, worker[1] had ``rank=1`` and was incorrectly filtered out. - Drop ``_pp_wait_for_mediator_readiness`` from ``execute_model``. The check spins until every mediator has ``event_queue.has_value`` — but a mediator stuck in a cross-rank pull never posts, so any trace whose first access is a pure-remote materialization deadlocks. Hooks now fire unconditionally; one-shot hooks deliver to whichever mediators are actually waiting on their value, and pull-blocked mediators continue once the remote rank's forward pass populates the buffer. tests/: bulk-update worker scenarios from pp-design's ``model.logits.output.save()`` API to dev's ``model.logits.save()`` (logits/samples are eproperties on VLLM in dev, not WrapperModules with ``.output``). tests/test_vllm_pp.py: rewrite the ``TestEnvoyPPMissingShortCircuit`` fixture to construct a real ``PPEnvoy`` instead of bare ``Envoy.__new__`` with attribute hacks — the cleaner architecture needs full Envoy init plus a proper interleaver instance. Verified end-to-end (PP=2, TP=1, GPT-2): - 12/12 unit tests pass (LazyRemoteTensor, PPListener, PPEnvoy) - tests/_pp_pull_worker.py basic_trace: returns " Paris" (correct) - Single-prompt trace + logits.save(): correct shape (1, 50257) - Cross-stage local-only writes within trace body: succeed - Pull-based materialization remains under investigation — runs longer than the 60s probe timeout on some scenarios Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two debugging-driven fixes that unblock end-to-end PP traces: modeling/vllm/pp_envoy.py: ``_is_pp_missing`` now builds the ``PPModuleMap`` lookup from ``path + '.' + key`` instead of just ``path``. Root-level hosts like ``VLLM`` have ``path='model'`` (inherited from the NNsight Envoy tree), not ``''`` — looking up just ``'model'`` finds no PP-aware token, so ``logits`` / ``samples`` accesses on PP-non-last ranks would slip past the short-circuit and block forever waiting for ``eproperty.provide`` calls that only fire on the last rank. The full ``'model.logits'`` form correctly resolves to the last-rank owner via PPModuleMap's parts walk. intervention/interleaver.py: - Bump the PP start-wait timeout from 0.5s to 5.0s. The previous 500ms was tight enough to race against worker thread CUDA init, leaving the main thread proceeding before the worker had even registered its first one-shot hook — forward-pass hooks then fired with no consumer and the trace returned with empty saves. 5s is conservative but only paid once per request, and the ``wait()`` returns early as soon as the worker posts. - Clean up the debug prints from the previous commit. End-to-end verification (PP=2 GPT-2 on GPUs 0,2): - tests/_pp_pull_worker.py basic_trace: " Paris" (correct) - tests/_pp_pull_worker.py logits: shape (1, 50257), " Paris" - tests/_pp_pull_worker.py hidden (layer 5, PP-non-local on rank 0): mean ~4.65, shape [768] - tests/_pp_pull_worker.py cross_stage_read (layer 0 from rank 1's perspective via LazyRemoteTensor + pull): h0_mean ~0.08, h0_std ~4.80, top_token " Paris" - Combo trace ``h0 = model.transformer.h[0].output[0].save(); logits = model.logits.save()``: both surface correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pp-design's scenarios were written against a removed compat layer where ``layer.output[0][:] = X`` (HF-style indexing) was intercepted into vLLM's flat 2D ``[tokens, hidden]`` format. Dev intentionally exposes the raw vLLM shape and documents the difference in ``intervention-gaps/VLLM_GUIDE.md``: use assignment to ``layer.output`` instead of indexed in-place writes. Three classes of fixes: - ``scenario_ablation``: replaced ``layer.output[0][:] = 0`` with ``layer.output = out * 0``. Both L3 (stage 0) and L8 (stage 1) ablations now surface their logits. - ``scenario_cross_stage_clone_modify`` and ``scenario_steering``: split into two traces — capture the cross-stage value in trace 1, apply the intervention in trace 2. Combining a cross-stage RHS with a stage-1 write in a single trace races against the forward pass: rank-1's mediator finishes its pull AFTER rank-1's forward has already produced layer 8, so the swap arrives with no hook left to intercept. The two-trace form has no race because trace 1 fully completes (materialization included) before trace 2 starts. - ``scenario_save_all_layers``: switched to ``list().save()`` (gotcha 12 — ``.save()`` inside ``.append()`` never registers the list itself in ``Globals.saves``). Each appended value is ``.clone()``-ed inside the trace, which forces any cross-stage ``LazyRemoteTensor`` to materialize before save — otherwise the pickled saves arrive at the client with stripped ``_pull_fn`` and ``.shape`` raises on access. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ayers Three further test simplifications after discovering hard limits in the cross-stage write path: - ``scenario_cross_stage_clone_modify`` and ``scenario_steering``: both originally combined a cross-stage RHS (h2 from stage 0) with a stage-1 write (h8). Even with a two-trace split the closure-captured tensor in trace 2 trips EngineDeadError on stage 1 — the swap arrives after the forward pass has already produced layer 8. Scoped both within a single stage now: ``clone_modify`` clones+modifies stage-1's own h8; ``steering`` adds h8.mean as a steering vector to h8. Cross-stage READ remains independently covered by ``scenario_cross_stage_read`` and cross-stage WRITE by ``scenario_cross_stage_write``. - ``scenario_save_all_layers``: walks the first 6 layers instead of all 12. With 7+ cross-stage reads in one trace the concurrent pulls on both ranks serialize against each other on the dedicated gloo group and stall. Final PP=2 GPT-2 status with these scenarios: basic_trace PASS → " Paris" logits PASS → " Paris" hidden PASS → mean 4.6468 cross_stage_read PASS → h0 mean 0.0813, std 4.7984 cross_stage_write PASS → shifts to " E" cross_clone_modify PASS → " B" ablation PASS → L3 to "+", L8 to "\n" steering PASS → " London" save_all_layers PASS → 6 layers, " Paris" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When a vLLM worker is torn down (engine shutdown, parent crash, test timeout) the dist process group goes away, every ``dist.recv`` in the listener thread raises instantly, and the over-broad ``except Exception`` catches + retries immediately — burning a full CPU core forever in a tight printf-traceback loop. With orphaned workers ignoring SIGKILL (CUDA contexts hold them), the leak is permanent: every killed PP test left a Worker_PP0 process at 105% CPU. Fourteen accumulated over a debugging session. PPListener.stop(): set a threading.Event the loop checks at the top of every iteration and after every caught exception. The existing ``daemon=True`` thread still lets the process exit normally, but ``stop()`` lets the loop break out promptly even when dist is half-alive. PPListener._listen_loop: detect torn-down dist context (``not dist.is_initialized() or _stop_event.set()``) and return immediately — no traceback, no retry. Genuine transient errors still get logged and retried, but now after a 500ms ``_stop_event.wait()`` instead of immediately — so the listener stays responsive without spinning. GPUModelRunner.load_model: register an ``atexit`` hook that calls ``pp_listener.stop()``. Worker processes exit cleanly even if no explicit shutdown path runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two coupled fixes that make a PP cross-stage replacement write whose RHS reads a downstream layer run end-to-end (previously a hard deadlock, then a dtype error). 1. Release-on-downstream-remote. A mediator that, after its local hooks, reads/writes a module owned by a *later* PP stage previously stayed frozen at its last local hook (lockstep) while blocking on a pull the downstream rank could not satisfy until this rank's forward finished and sent its activation — a cyclic wait across lockstep + vLLM's PP collective + the gloo pull group. Add an Events.RELEASE that the pp_eproperty short-circuit emits once, only for downstream accesses (owner > local_rank), letting the forward run to completion (and the inter-stage send) while the mediator pulls off to the side. Upstream accesses do not release — the entry gate (start()'s wait for the first event) already holds the forward until the mediator reaches its first local hook, and releasing early would skip it. A thread-local active mediator lets a released worker resolve itself once it runs concurrently with (and after) the interleaver. The racy 5s PP timeout in Mediator.start() is dropped: a worker now always posts request/RELEASE/END before it can block. 2. Prefix-tolerant PP metadata lookup. pp_module_meta is keyed by vLLM's named_modules() names (e.g. "transformer.h.8") but cross-stage pulls look it up by the nnsight envoy path ("model.transformer.h.8"). The miss fell back to a hardcoded float32, so any *materialized* pull came back float32 and corrupted the dtype of values written into the bf16 model. resolve_meta() strips the root prefix until a key matches. (Pure .save() reads dodged this — they never materialize a pull.) Validated PP=2 GPT-2: the formerly-deadlocking pattern passes, and all nine cross-stage scenarios (read/write/ablation/steering/save_all_layers /clone-modify/logits/hidden/basic) still pass with correct output. tests/_pp_repro_worker.py reproduces the deadlock/dtype cases standalone. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ensorMode probe The init-time shape probe ran a FakeTensorMode forward over the real TP-sharded model to record each module's output shape for the precomputed pull path. It collides with vLLM's BasevLLMParameter.__torch_function__ on aten.t and dies at the first linear layer, so the probe was a no-op caught by a bare except — every pull silently fell back to legacy shape-on-wire. Replace it with lazy learning: PPListener._cache_module_shapes records a module's output shape from its first legacy pull, so subsequent pulls of that module take the precomputed path. dtype is still allgathered up front (from param.dtype, no forward needed); only the shape is learned lazily. - pp_listener.py: _cache_module_shapes (learns module_shapes from the wire), _should_use_precomputed (extracted, unit-testable); _recv_precomputed substitutes the live num_tokens for the leading dim. - GPUModelRunner.py: delete _probe_pp_output_shapes; _exchange_pp_module_meta now ships dtype-only entries with empty module_shapes. - tests/test_pp_module_shape_cache.py: unit tests for the learning logic (legacy->precomputed transition, multi-output, prefix-tolerant in-place mutation, recv token substitution) with non-GPT2 module names. Validated PP=2 GPT-2: inplace_xstage/replace_local/replace_xstage all produce correct tokens with the probe removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…m layers Adds scenario_cross_stage_write_multi: reads an upstream layer's output ONCE into a variable, then writes it in-place into two downstream layers in forward-pass order. Reusing the pulled value is the correct idiom — re-accessing the upstream module for the second write would be an out-of-order access (the forward has already advanced past it), so the value would never be produced under the second access's provider and the pull would hang. Exercises a single upstream pull feeding multiple downstream consumers, coverage the suite previously lacked. Registered in the worker's scenario choices/dispatch and wired into test_pp_pull.py as Test 9b. Validated PP=2 GPT-2 -> " Paris". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…hipping lazies A saved container (e.g. `list().save()` holding per-step or per-layer activations) could hold LazyRemoteTensors on a non-owning PP rank. collect_saves only filtered *top-level* lazy saves, so the container shipped its lazy elements to the driver, where they can't materialize (_pull_fn is dropped on pickle and there's no listener) — `cross_stage_multigen` and any all-layers save crashed with "no pull function set". The shallow last-rank-wins merge also let the lazy-laden copy clobber the owning rank's real one. Fix: each rank ships only the slots it owns; the engine merges position-wise. No pull at collection, no teardown barrier, no buffer-lifetime dependency. Every saved element is owned (real) by exactly one rank, so the merge always reconstructs a complete result — including mixed-stage containers (one list spanning both stages). - lazy_remote_tensor.py: NOT_ON_THIS_RANK sentinel (picklable singleton), strip_lazy() (recursively sentinel-ize lazies; report has_real/has_lazy), merge_saved() (position-wise merge preferring the real leaf; last-wins for scalars/structure-mismatch — preserves prior semantics). - GPUModelRunner.collect_saves: strip_lazy + ship partials; skip only values owned entirely elsewhere. - engine.py / async_backend.py: deep-merge same-named saves via merge_saved instead of dict.update(). Tests: - test_pp_save_merge.py: 12 unit tests for strip_lazy/merge_saved/sentinel. - _pp_pull_worker.py + test_pp_pull.py: cross_stage_save_all scenario (all layers, mixed-stage) as Test 9c. - Validated PP=2 GPT-2: cross_stage_multigen now -> " Paris, , France" with real per-step shapes; all-12-layers save returns 12 real tensors; multigen / cross_stage_write / cross_stage_read / save_all_layers / steering unregressed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…spatch Both the sync dispatch path (vllm.py) and the async backend silently dropped mediator exceptions stored under __nnsight_exceptions__. The caller then hit an UnboundLocalError on saves the mediator never produced, with no hint of the real cause. Pop the exception map and route it through surface_server_errors before exposing saves, mirroring intervention/backends/local_serve.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Multi-output decoder blocks (Qwen2/Llama) produce a (hidden, residual) tuple, not a tensor. Forwarding a tensor method like .mean onto the materialized tuple gave a confusing "'tuple' object has no attribute 'mean'". Detect the case in __getattr__ and raise a clear error telling the caller to index first (lazy[0].mean), since lazy[0] already returns a deferred child that pulls the indexed element. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A read of a PP-non-local module must return a LazyRemoteTensor, and a write must no-op, regardless of whether the interleaver is still live. A downstream access (e.g. model.logits after a cross-stage write, or model.layers[-1].output = ...) runs on the mediator thread after go_remote released this rank's forward and the interleaver tore down. Gating the short-circuit on interleaving let that case fall through to super().__get__/__set__, which raises "Cannot access/set ... outside of interleaving". _is_pp_missing reads only PP topology state (none of which depends on interleaving), so it is safe post-teardown; _pp_signal_remote is kept gated since it is only meaningful while the forward is live. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vLLM's LlamaForCausalLM gates logits_processor inside `if is_last_rank:` with no else-branch PPMissingLayer() stub — unlike Qwen2/GPT2/OPT/Pythia/Bloom/Gemma2, which build it unconditionally. On a non-last PP rank the attribute is absent, so the dumper-side meta model (is_last_rank defaults True) ships a Module:model.logits_processor persistent id this rank cannot resolve → UnpicklingError on the first request. Insert the stub before the envoy tree is built so the walk registers the module symmetrically. The forward returns early on non-last ranks before the logits_processor call site, so the stub is inert at runtime. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…U memory The pp_hook_buffer accumulates one (hidden, residual) clone per (accessed module, iteration) until the request finishes, so on GPU it grows O(modules x tokens) for long generations and can OOM. After each forward returns, migrate this rank's buffer clones GPU->CPU under pp_buffer_condition, keeping GPU resident to a single forward's worth of clones while CPU RAM absorbs the accumulation. Pulls are unaffected: the listener already .cpu()s buffer values when serving, and a listener that already obtained a value keeps its own tensor reference across the dict reassignment. Migration runs off the compute critical path (forward has returned). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add an NNSIGHT_PP_BUFFER_DEBUG-gated probe at the scoped clear site that reports buffer entry count, byte size, and device before/after the clear (recursing into the (hidden, residual) tuples that layer .output values are). Inert unless the env var is set. Distinguishes a real cross-request leak (post-clear size grows over time) from benign allocator caching, and measures the intra-request peak. tests/measure_pp_buffer.py drives the cross-request leak test and intra-request peak test with a background nvidia-smi sampler. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

stress_pp_tp_serve.py gains Phase E: burst-and-drain memory pressure that fires concurrent multi-layer multi-token traces (each saving len(layers) * max_tokens (layer,step) pairs into pp_hook_buffer) and samples server RSS (pid tree) and per-GPU used memory to surface buffer growth. Adds run_equivalence_matrix.py to drive PP/TP equivalence slices across model families. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e corruption vLLM runs forward passes inside torch.inference_mode() and several of its kernels (e.g. fused_add_rms_norm) mutate buffers in place. Without a clone, references returned by Envoy.output / Envoy.inputs alias those buffers, so the values surviving past the trace reflect post-mutation state, not what the user asked to save. Clone on save when the saved object is an inference-mode tensor. The clone allocates a fresh, non-inference tensor so downstream fused ops mutate the original buffer rather than the user's saved reference. No-op for normal (non-inference) tensors, so HF / vanilla PyTorch paths are unaffected. Object.save() now returns save(self) instead of self, so the cloned tensor (not the original) is what the trace's local-frame filter retains via Globals.saves. Fixes #661.

Three CPU-only tests in TestSaveCloning that pin the fix's invariants: 1. globals.save() returns a clone for inference-mode tensors so subsequent in-place mutation of the source doesn't corrupt the saved value. 2. globals.save() returns the original (no clone) for normal tensors — pins the zero-overhead contract for HF / vanilla PyTorch paths. 3. End-to-end: module.output.save() inside torch.inference_mode() returns a non-inference tensor (i.e. Object.save() returns the clone, not the original — otherwise the local-frame filter would drop it). Verified to fail on the unpatched globals.py (tests 1 and 3 fail; test 2 passes both ways since it asserts a no-op the unpatched code also satisfies).

PR #656 merged the nnsight-serve sources (cli.py, server.py, LocalServeBackend, ServeInterleavingTracer, ...) onto dev but the pyproject.toml changes were dropped during conflict resolution, so ``pip install "nnsight[serve]"`` returns "no matching distribution" and the ``nnsight-serve`` CLI shim isn't on PATH for a fresh install. Restore the missing pieces: - ``serve`` optional-dependency that pulls vllm + FastAPI + uvicorn. - ``[project.scripts]`` entry that registers ``nnsight-serve`` to ``nnsight.modeling.vllm.serve.cli:main``. - ``all`` extended to include ``serve``. After this, the documented ``pip install "nnsight[serve]"`` / ``nnsight-serve ...`` workflow works without falling back to the ``python -m nnsight.modeling.vllm.serve.cli`` workaround. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The execute_model safety net synthesized a ModelRunnerOutput whenever the return value was None, to survive a mediator exception that interrupts the forward before return_value is assigned. But None is also vLLM's legitimate deferred-sampling signal: super().execute_model() returns None on the sampling rank after stashing execute_model_state, expecting sample_tokens() next (which our sample_tokens override consumes). Masking that None makes the Ray distributed executor treat it as terminal and skip sample_tokens(), leaving execute_model_state unconsumed — the next execute_model() then raises "State error: sample_tokens() must be called after execute_model() returns None." (The multiproc executor calls sample_tokens regardless, so it tolerated the masking; Ray does not.) Gate the synthesis on execute_model_state being unset so a pending deferral propagates None and only a genuine error/interrupt is masked. Verified: Ray pp=2xtp=2 multi-node now runs traces (logits/cross-stage reads), State error gone; single-node multiproc PP regression 9/9 unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tRayExecutor For the Ray (multi-node) backend, _load() was replacing vLLM's executor with a hand-rolled NNsightRayExecutor. That fork reimplemented vLLM's worker creation, placement, worker-sort and rank assignment, and did not reliably preserve the invariant every downstream step depends on: that the worker at list position i is global rank i, built layer-slice i, and must receive KV-cache config i. Intermittently (~50% of runs) the per-stage KV-cache configs were scattered to the wrong pipeline stage — stage-0 workers (layers 0-13) got stage-1's config (layers 14-27) and vice versa — so a worker looked up a layer it didn't own and crashed with KeyError at kv-cache init. The worker is already injected via the supported `worker_cls` hook (same as the single-process / multiproc path, which is not affected and works correctly), so the custom executor was unnecessary. Letting vLLM use its own stock Ray executor makes vLLM own placement/rank/config-distribution with its consistent ordering. Verified on a 2-container Ray cluster simulating 2 nodes (pp=2, tp=2, Qwen2.5-7B): stock executor 6/6 fresh runs loaded and produced correct output; the fork was ~50% KeyError. Vanilla vLLM on the same setup never failed (control). The fork existed for remote-driver support and a vLLM 0.15.1 actor import-crash workaround; the latter is obsolete on 0.19.1 (stock RayWorkerWrapper loads fine), and remote-driver (driver off-cluster) should be a minimal connection shim if needed, not a full executor reimplementation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tock executor Follow-up cleanup to the stock-Ray-executor fix. The custom executor was unhooked in vllm.py but its code and docs remained: - Delete executors/ray_workaround.py (NNsightRayExecutor + LazyRayWorkerWrapper) — nothing imports it anymore; remove the now-empty executors/ package. - Remove NNsightGPUWorker.init_device() override — it only normalized the class-valued distributed_executor_backend the fork passed; with the stock executor the backend is the "ray" string, so the override was inert. - README.md / examples/ray/README.md: replace the fork-as-architecture sections with the actual mechanism (stock Ray executor + worker_cls), keeping a short history note on why the fork was removed; drop the deleted file from the tree and File Responsibilities. No behavior change; the executor fix in the previous commit stands. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…es don't hang A PP layer that lives on another rank is read as a LazyRemoteTensor on the non-owning rank. It defined __getitem__ (returns a fresh lazy, never raises IndexError) but no __iter__, so any "consume the whole value" operation — tuple(lazy), list(lazy), for-loops, unpacking — fell back to Python's sequence protocol over __getitem__ and spun forever on that rank. The owning rank, which holds the real value, iterated normally and finished. That divergence hung the trace: the non-owning rank never posted END, so collect_nnsight never completed and the driver eventually timed out (~550s). The replacement-write idiom `layer.output = (out[0] + delta,) + tuple(out[1:])` triggers it via tuple(out[1:]). In-place `[:]=` writes (absorbed as no-ops) and plain reads (`out[0].save()`) never iterate the lazy, which is why only this write form hung. The symptom was first seen across Ray nodes, but it is not cross-node specific — it reproduces on single-node multiproc PP=2. Fix: add __iter__ and __len__ that materialize and delegate to the real value, consistent with the existing arithmetic/__getattr__ materialization contract. The backward pull this triggers is already exercised by the downstream-read path. Verified: new unit tests (tuple/list/for/unpack/len terminate and match); gpt2 tuple_lazy single-node PP=2 hang -> pass; Qwen2.5-7B single-node PP=2 faithful s_cross ~550s hang -> 0.25s. tests/_pp_worker.py gains downstream_read and tuple_lazy scenarios for cheap single-node reproduction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tch_group The cross-rank PP pull sized its recv buffer from mediator.batch_group[1], but batch_group is rewritten twice per execute_model step on every rank: process_batch_groups sets [start, num_tokens] (token-level) for the forward pass, then unflatten rewrites it to [start, 1] (prompt-level, one logits row per request). unflatten runs unconditionally on every rank after the forward (GPUModelRunner.execute_model), so a free-running mediator that reaches a remote-layer access post-unflatten captured num_tokens=1. The consumer then under-sized its recv buffer to 1 token while the producer shipped the real N-token prefill activation, tripping gloo's size enforce ("Received data size doesn't match", e.g. 157696 vs 14336) -> dead worker on single-token traces, hang on multi-token. The producer narrows correctly (interleaver narrow runs during the forward, before unflatten), so this was purely a consumer-side stale read. Add mediator.pp_num_tokens: set once per step in process_batch_groups alongside the token-level batch_group, never rewritten by unflatten, cleared to None for unscheduled mediators. _pp_lazy_access reads it instead of batch_group[1] (None -> 0 -> legacy shape-on-wire fallback). No extra round trip: the count is already local on every rank (the scheduler broadcasts num_scheduled_tokens), so the zero-metadata precomputed pull stays one message. Regression test drives the real process_batch_groups + unflatten with fakes (no GPU): RED before (proves the [_,N]->[_,1] clobber), GREEN after. Verified end-to-end on a Qwen2.5-7B pp2tp2 Ray multinode cluster: steering / downstream / batched cross-stage reads that crashed deterministically now pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…vous Multi-token generation with a per-step cross-stage access (read an upstream layer + write a downstream layer each generation step) hung under pipeline parallelism. Mediator.respond blocks the engine waiting for the worker's next event; a worker that consumed logits in sample_tokens then advanced into the next iteration's upstream pull posted no event, wedging the engine mid-sample so the token never finished and the producing rank never ran the forward the worker was waiting on. Fix is a bidirectional per-step rendezvous plus a stale-current fix (each prior attempt had only one piece, so all failed): 1. Boundary RELEASE (iterator._pp_iteration_boundary): when a value- injection respond is pending (Mediator._respond_pending), post go_remote() so sample_tokens returns and the engine schedules the next step. 2. Iteration gate: the worker waits on its own _pp_scheduled_count (bumped in process_batch_groups only when its request is actually in the batch) rather than a global per-rank forward count, which climbs on pipeline bubbles and desyncs the worker from its own forwards. 3. Readiness gate (_pp_advance_forward_and_wait, tail of _update_states): bump the forward counter + notify, then hold the forward until every scheduled mediator parks or finishes. No deadlock: upstream pulls resolve on the producing rank, downstream posts RELEASE, pure-remote reaches END. 30s timeout errors loudly. 4. Stale interleaver.current: once a boundary RELEASE leaves the handle chain, interleaver.current is None, so a rendezvous-woken worker's first local access hit None.iteration / None.request, an AttributeError that Python silently converts into the Envoy.__getattr__ fallthrough ("GPT2Block has no attribute output"). Resolve the mediator via the worker thread-local current_mediator() in eproperty.__get__/__set__, iterate_requester, and the requires_* hooks (module + operation level). Verified: gpt2 pp=2 single-node single + multi cross-stage both pass; 48/48 deterministic PP+iter unit tests; core tracing/source/transform tests pass; test_lm.py fail-set byte-identical to baseline 15002be (62 pre-existing pp-on-dev failures, 0 introduced). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…n gate Replaces the bidirectional rendezvous from 5c35680 with a single-sided sync: the mediator worker never waits on the forward — it runs ahead freely and the forward's readiness gate waits for it. This is both simpler and fixes the logits-only regression the rendezvous introduced. Why the worker may run ahead safely: local delivery is a synchronous one-shot hook keyed to a monotonic iteration tracker, so the worker's hook must be registered before the forward reaches that module (else the tracker advances past it and it never fires — a permanent hang). Keeping the worker ahead guarantees that ordering; the gate only has to wait for the worker to be ahead, never the reverse. Changes: - Drop the iteration gate (`_pp_iteration_boundary`), the per-rank forward counter and its condition. The worker is no longer parked at an iteration boundary, so no park-vs-fire race. - Let an eproperty local access register-its-hook-and-park even when no forward session is live (PP worker thread only), so the worker can get ahead between forwards instead of erroring "outside of interleaving". - `_pp_signal_remote`: classify by owning rank — downstream is the trailing-remote phase (mark `_pp_past_local`, release the forward once); upstream is leading-remote (release only to free a blocked value-injection `respond`, the original cross-stage wedge fix). - Readiness gate `_pp_wait_for_mediators`: hold the forward until each scheduled mediator is ahead of THIS forward's iteration (`_pp_scheduled_count - 1`): worker past it, or on it and parked-local (`has_value`) / past-local (`_pp_past_local`), or done, or its `_pp_max_iteration` is already exceeded (single-shot trace). Uses `_pp_worker_iteration` (never cleared) instead of `iteration` (which a one-shot hook clears mid-step), so a stale per-iteration flag can't fool it. - Initialize the runtime PP latches in `Mediator.__setstate__` so a single-shot deserialized worker has them before its first cross-stage access. Verified: single-node gpt2 pp=2 single_cross / multi_cross / logits-only pass; cross-node 2-node Ray (Qwen2.5-7B pp2tp2) 9/10 incl. multigen_cross_stage (Window C) and the previously-regressed multigen_baseline; 111 deterministic PP/iter/core/source/transform tests pass; test_lm.py fail-set identical to baseline (no regressions). Known limitation: `tuple_lazy_multigen` (a downstream layer MATERIALIZED every step) times out the readiness gate cross-node — that pull holds the worker waiting on the downstream rank, so it lags rather than running ahead, the one case this model does not yet handle. Tracked for a follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Models without a native vLLM definition (e.g. SmolLM3) run through vLLM's Transformers backend, which wraps the HuggingFace model and adds a leading singleton batch dim (inputs_embeds[None, ...]). Their decoder-layer activations are 3D [1, total_tokens, hidden], so the batched token axis is dim 1, not dim 0. Batcher._narrow/_swap hard-assumed the token axis was dim 0 and gated on shape[0] == total_batch_size. For the 3D case that gate is 1 == total_tokens (False), so reads returned the full batch (all prompts) and writes were silently discarded -- every intervention became a no-op once needs_batching (2+ prompts) was active. Native vLLM models (2D [total_tokens, hidden]) were unaffected, which is why Qwen3 worked but SmolLM3 did not. Generalize the base Batcher to narrow/swap along an axis reported by a new _batch_dim() hook (still dim 0 by default; preserves the existing in-place, concat-for-view/grad-leaf, and passthrough paths). VLLMBatcher overrides _batch_dim to recognize the Transformers-backend's [1, total_tokens, hidden] shape and select dim 1. Adds CPU-only regression tests (TestVLLMBatcherAxis): the 3D swap/narrow tests fail on the unpatched batcher and pass after; the 2D native test pins no regression. Related to #661/#662. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…aths The vLLM interleaver always runs in defer mode (GPUModelRunner.load_model sets defer_exceptions=True) so a single bad intervention can't crash the engine that's serving other requests. The serve path re-raises captured errors via surface_server_errors, but the local VLLM.trace() sync and async paths only collected output.saves and never read the __nnsight_exceptions__ envelope — so an intervention that errored on the worker (e.g. an in-place write on an inference-mode tensor) failed silently: no exception, and every .save() after the failing line was dropped. Read the envelope in VLLM.__call__ (sync) and AsyncVLLMBackend.__aiter__ (async) and re-raise via surface_server_errors, mirroring the serve path. The error surfaces at the trace boundary while the engine stays alive for the next trace (verified: a clean trace and a clone-based intervention both work in the same process after the surfaced error). Envelopes are merged per request so a multi-invoke trace doesn't clobber one request's error with another's. Add test_inplace_inference_write_surfaces_error covering the sync path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Migrate the durable content of intervention-gaps/{REPORT,VLLM_GUIDE}.md into the maintained user doc and delete the two stale docs (vLLM 0.15.1-era; their in-place-write recipes no longer work). docs/models/vllm.md: - New "Intervention recipes" section: clone-and-replace writes, ablation, steering, logit lens, two-trace activation patching, tracer.cache(). - "What each module returns" table: dual-stream (hidden, residual) output, int64 position-id .input, fused-RMSNorm/RowParallel tuples, merged qkv_proj/gate_up_proj, flat [total_tokens, hidden] layout. - New gotchas: in-place writes raise (replace instead), clone-on-save, enable_prefix_caching=False default, deferred errors keep the engine alive, no attention weights, vLLM != transformers numerics. - Drop stale claims: tracer.cache() is supported; version 0.15.1 -> 0.19.1. tests/vllm_intervention_gaps/: - git mv run_all.py + test_*.py here (executable vLLM-vs-HF diagnostic suite) and add a README. Recipes verified on vLLM 0.19.1 (Qwen2.5-0.5B): in-place writes raise, replacement works, logit-lens matmul (norm(hs) @ lm_head.weight.T) bitwise- matches model.logits at the last layer, and TP>=2 sub-module access works (the old "crashes at tp>=2" claim was stale). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`assert x.std() > 50` depended on an unseeded randn(8) sample std exceeding 0.5 (x = randn(8)*0.1*1000), which an 8-element sample undershoots ~2.4% of runs — a CI flake. Replace with an exact, RNG-free check: the saved clone is a distinct object holding the pre-mutation values, and the in-place-mutated x is exactly saved * 1000. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Under pipeline parallelism, a trace that saves a value produced on one stage and then reads a tuple-valued output from a later stage makes the producing rank block on a cross-stage pull. That rank then never reaches its normal completion (end()/push()), so its saved value was never synced out of the intervention frame: collect_saves read an empty frame and silently dropped the value (no error, just a missing save on the client). Give each Mediator a durable `saves` dict, snapshotted from the live frame when it releases into the blocking cross-stage pull (go_remote) and on normal completion (push). collect_saves now also reads this record, so a rank that stalls mid-pull still ships what it produced. Non-cross-stage and non-vLLM paths are unaffected (the push snapshot is a no-op there). Verified: single- and two-stage read+write correctness, full-model activation caching, and plain LanguageModel save/generate all still pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Written 2026-04-04 against vLLM 0.15.1 (commit 2ace37c); audited line-by-line against the current code and every still-relevant item is either fixed or tracked in docs/developing/pp-design.md / pp-stress-findings.md, which are the living record. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ves, mediator resolution, unresolvable-owner pulls) - LazyRemoteTensor: add comparison dunders (==, !=, <, <=, >, >=) that materialize and delegate, like the arithmetic dunders. Without them ==/!= fell back to identity and orderings raised — only on the non-owning rank, so user code branching on a comparison silently diverged between ranks. Restore identity hashing (__hash__ = object.__hash__) since defining __eq__ clears it and Globals.saves tracks proxies by id. - Save collection: rebuild list/tuple containers via _rebuild_sequence, which splats positional fields for NamedTuples. type(original)(items) raises TypeError for NamedTuples and killed the whole save-collection pass when a saved value contained one. - IteratorTracer: resolve the mediator from the worker's thread-local (current_mediator()), not the shared interleaver.current slot — current is reassigned by every Mediator.start() and around every handle, so a PP run-ahead worker could observe another invoke's mediator and corrupt its iteration state. - Cross-stage pull: a consume whose owning rank the ownership map cannot resolve (source_rank=None, e.g. a non-standard module name) now raises a descriptive RuntimeError at pull time instead of surfacing as a distributed hang. Building/saving such a lazy stays harmless (the owning rank's real save wins in the merge). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… the weight-derived guess Consuming model.samples on a non-owning PP rank crashed the worker with a gloo SIGABRT (recv buffer size mismatch): the consumer sized its recv buffer from the module's weight dtype (bf16, 2 bytes/elem) while the producer sent sampled token ids (int32, 4 bytes/elem). vLLM's own NCCL PP transfer and nnsight's dedicated gloo pull group are separate transports — the crash was purely nnsight's own buffer mis-sizing. The legacy shape header now carries the value's true dtype in slot 1 (a fixed torch-dtype enum; 0 = unknown, falling back to the old guess), the consumer allocates its recv buffer from it, and the learned-shape cache adopts the wire dtype so later precomputed pulls inherit the correction. Two path-resolution companions of the same lookup family: - _provider_to_module_path only drops the part before the iteration marker when it actually is an eproperty name; a root eproperty's provider (model.samples.iN) keeps its module name instead of collapsing to "model" (which made logits and samples share one meta/shape-cache entry). - The LazyRemoteTensor dtype hint is resolved from the full {path}.{key} lookup, matching the pull protocol's canonical path; resolving from path alone missed root epropertys and silently fell back to float32. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…roduced, instead of stranding the consumer A cross-stage pull whose value the producer could not serialize (a non-tensor like a dict-valued .inputs read, a mixed-dtype tuple, a shape needing more slots than the fixed header) used to print a traceback on the producer and leave the consumer blocked in dist.recv forever. A per-op gloo recv timeout cannot be the backstop: probed, it closes the whole peer pair on expiry, poisoning every later pull on that link — so hang protection has to live in the protocol itself. - The whole reply (CPU move, flat cat, shape header) is now PREPARED before any send, so a serialization failure is caught while an error reply can still be sent — never after a partial send that would desync the consumer's recv. torch.cat doubles as the up-front mixed-dtype validator, and the extracted _encode_shape_header raises on header overflow. - On failure the producer sends an error header (slot 0 = -1, slot 1 = UTF-8 message length, then the message bytes) on the pull's private response tag; the consumer raises a descriptive RuntimeError. - clear_buffer error-replies any pull still PARKED for a finished request — its value will never be produced (typically a run-ahead worker that pulled one generation step past the end). Silently dropping the parked entry left that consumer's worker thread blocked forever. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

PPModuleMap resolved owners from hardcoded name tables (layer-container names like layers/h/blocks, embedding names like embed_tokens/wte, last-rank names like norm/lm_head). Any architecture outside those conventions — Falcon's word_embeddings, OPT's decoder.embed_positions, GPT-NeoX's final_layer_norm — silently fell through to None and a cross-stage access either blocked or failed. The load-time meta allgather already reports exactly which modules are REAL (non-PPMissingLayer) on which stage, so ownership is now DERIVED from it: a module reported by exactly one stage is owned by that stage; a module reported by several (containers, build-on-every-rank modules) is ambiguous and dropped. get_owning_rank consults the derived map first (walking up to the nearest owned ancestor, tolerant of the nnsight root prefix); the layer ranges and name tables remain only as the pre-exchange path and for the non-derivable build-on-every-rank, fire-on-last modules (logits/samples/logits_processor). Verified live at PP=2: 269 derived entries; model.norm and model.embed_tokens resolve via derived keys. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ly save shipping to TP-0 collect_nnsight returned early on every TP rank except 0 of each PP stage. That correctly avoids duplicated saves (TP siblings carry replicated mediator state), but a TP-sibling rank still owns real mediators, one-shot/iter hooks, worker threads, and pp_hook_buffer entries — skipping the whole function leaked all of those on TP!=0: a confirmed ~1 zombie worker thread plus buffer growth per request on a long-running TP>1 server (proved via a per-rank thread-count probe; all four workers stay flat after the fix at PP=2 x TP=2). Now every rank runs the finalize block (stop_iteration, worker join, drain barrier, mediator/hook cleanup, scoped buffer clear); only the final save payload is gated to TP-rank-0. A pure streaming collect (no finished requests) still short-circuits on TP siblings — nothing to finalize. The drain barrier stays collective per TP column: each column is its own pull group, so TP-0 and sibling columns barrier independently among their own PP stages. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ate deadline Move the inline magic numbers (30 s readiness gate, 5 s finalize join, 1e-4 gate poll, 0.5 s listener backoff, 60 s test-only buffer wait) into named constants in pp.py. Only the readiness-gate deadline carries a real false-trip risk — a worker blocked in a slow upstream cross-stage pull (huge hidden state over a degraded cross-node link) reaches its local part late — so only it is overridable, via NNSIGHT_PP_GATE_TIMEOUT. An env var rather than CONFIG/config.yaml because these run in the vLLM worker process and env vars propagate to Ray workers cross-node; the yaml would have to exist on every node. The gate's timeout error now names the override. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…plicate the merge into collect.py The decompress-unpickle-merge of per-rank collect_nnsight results was written three times: identically in the async backend and the sync engine, and INCORRECTLY in the serve handler — a flat dict.update that clobbered one PP stage's real values with another stage's NOT_ON_THIS_RANK sentinels whenever both stages shipped the same variable name (any cross-stage save through the server). New collect.merge_collected_saves is the single implementation; all three call sites now merge position-wise via merge_saved. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

_is_pp_missing runs at the top of every .output/.input access — per module, per token — and did a get_owning_rank string-walk (split, layer-index scan, name-table membership) each time. The answer is constant for the model's life (module stage ownership, PP topology, and local rank don't change after load), so cache it per lookup path on the interleaver. Plain __dict__ access bypasses the Envoy/Interleaver custom attribute machinery; non-PP models return before touching the cache. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…achable precomputed mode Production consumers could never take the precomputed (no-shape-on-wire) reply path: the run-ahead worker builds its LazyRemoteTensor before the matching forward is scheduled, so no consumer-side token-count capture can reliably match the produced value's leading dim — every pull already requested shape-on-wire. Everything that existed only to enable the dead mode goes with it: - the precomputed recv path, the mode decision, and the per-request token count in the request header (now 3 int64 slots: requester, response tag, key length) - the learned-shape cache and the listener's whole metadata map (the load- time meta exchange now carries only the per-module dtype hint for lazy placeholders; shape and true dtype always ride the reply) - mediator.pp_num_tokens — set and cleared every step, read by nothing - local_lookup, the listener's test-only blocking buffer wait, and its timeout constant; nothing waits on the buffer condition anymore (it remains as the buffer lock), so the producer-side notify_all goes too This also closes a real protocol hole, not just lines: the precomputed mode had no error channel (a serve failure there was logged and dropped, hanging the consumer). With one reply mode, the error reply always works. Dtypes outside the wire codec are now producer-side error replies instead of a silent fall-back guess, and the codec gains the float8 variants. The wire-stamped-dtype regression coverage moves from the deleted shape- cache test file into the protocol tests as a producer->consumer round-trip of int32 sampled token ids; the tag-routing test now drives the two-message reply. Live PP=2 validation: cross-stage read + norm tuple write + logits, and 3-prompt batched multi-token generation consuming model.samples per step — both pass on the new wire format. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…delete the name tables and layer-range walk The meta exchange subsumes both fallback tiers: every layer, embedding, norm, and head is real on exactly one stage, so the derived map covers them by construction — including architectures the name tables missed (Falcon word_embeddings, OPT final_layer_norm, GPT-NeoX embed_in/embed_out). The hardcoded layer-container/embedding/head name sets and the get_pp_indices layer-range walk go. The one irreducible piece of non-derivable knowledge — build-on-every-rank, fire-on-last modules (vLLM's logits_processor and nnsight's own logits/samples root wrappers) are ambiguous in the exchange — is now three explicit last-rank entries injected by the runner right after the exchange, so all ownership knowledge lives at a single site. setdefault preserves a derived entry where one exists (Llama stubs logits_processor on non-last ranks, so the exchange already attributes it). Before the exchange installs owners nothing traces, so the pre-install contract is now explicit: every path resolves to None (treated as local); a genuine cross-stage consume of an unresolvable path still raises the descriptive error at pull time. PPModuleMap no longer needs the layer count or vLLM's get_pp_indices. Live PP=2 validation: cross-stage read + norm tuple write + logits, per-step model.samples consume (3-prompt batch), saves on both stages, and a repeated cross-stage tuple read — all pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…root; memoize ownership Every envoy path is, by Envoy construction, the root envoy's name (one component, default "model") followed by the module's raw named_modules() name. The lookups previously didn't exploit that and compensated with guessing loops: - resolve_meta stripped leading components one at a time until ANY key matched — over-tolerant; a wrong-entry hit would silently stamp the wrong dtype. Now: exact match, else strip the single known root component. - PPModuleMap._derived_owner ran a nested walk-up x strip-any-prefix scan, quadratic in path depth. Now: try the root-stripped form (the common case), then the as-given form (raw-name callers), each as a plain ancestor walk-up. The runner passes the actual root path (nnsight_model.path) into PPModuleMap rather than assuming the literal. get_owning_rank is memoized per path (ownership is constant for the model's life; cleared if owners are reinstalled), which also makes the per-access owner resolution in _pp_signal_remote a dict hit. Live PP=2 validation: cross-stage read + write + logits, repeated cross-stage tuple read, and per-step model.samples consume all pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…gress object The readiness-gate conversation between the worker and the forward was smeared across three files as five loose mediator fields: initialized twice (Mediator.__init__ AND __setstate__), reset in the iterator, written by pp_envoy and the model runner, read by the gate — with the load-bearing publish-order invariant (clear the per-iteration latches BEFORE publishing worker_iteration, or the gate releases a forward on a stale past-local and the ranks deadlock) living as a comment in iterator.py that referenced GPUModelRunner. PPWorkerProgress now owns all five (gone_remote, past_local, scheduled_count, max_iteration, worker_iteration): - reset_iteration() encapsulates the ordering invariant next to the writes it orders - is_ahead_of() puts the gate predicate next to the state it reads (the runner passes alive/parked, which stay mediator-owned) - Mediator init/deserialize shrink to one line each _respond_pending stays a Mediator field — it belongs to the generic respond protocol (is the main thread blocked on my next event), not to PP iteration state. The class lives in intervention/interleaver.py because its writers (Mediator lifecycle, IteratorTracer) are core; modeling/vllm consumes it. Live PP=2 validation: per-step model.samples consume across a 3-prompt batch (multi-token, the gate-sensitive path), cross-stage read + write + logits, and dual-stage saves all pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…iagnostic The pull wire key had two formats: composite "{req_id}|{provider}" and a bare provider string for req_id=None callers. The bare form was a latent mismatch — the producer ALWAYS buffers under the composite (provider, req_id) tuple (req_id None included), so a bare-key pull could never match a buffered value; it just hadn't been hit because production always carries a request id. Now the wire key is always composite (empty id encodes None) and decodes to exactly the key shape the producer writes. Also lift the ~40-line env-gated NNSIGHT_PP_BUFFER_DEBUG block out of collect_nnsight into a module-level _pp_buffer_stats helper; the clear-path control flow is back to three lines. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…p, and gate state - single self-describing pull reply (true dtype + shapes on the wire, error replies for unserializable / never-produced values) replaces the precomputed/shape-on-wire mode split - ownership is derived-only, with the runner's three structural last-rank claims; no name tables or layer-range walk - the readiness-gate state is PPWorkerProgress (reset ordering invariant and the ahead-predicate live with the state) - finalize teardown runs on every TP rank; only TP-0 ships saves; the per-rank save merge is the single collect.merge_collected_saves Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

- pp-problem.drawio / pp-solution.drawio at the repo root: byte-identical duplicates of docs/developing/figures/, which is the copy the PP walkthrough references. - docs/superpowers/plans/2026-04-01-pp-lazy-remote-tensor.md: the original PP implementation plan; fully superseded by the as-built spec (docs/developing/pp-design.md) — it still describes vLLM 0.15.1 and a file layout (short-circuit inside envoy.py) that no longer exists. - docs/developing/pp-correctness-log.md: a snapshot log of the abandoned pp-correct worktree fork; its validation commands reference test files that no longer exist, and its findings live on in pp-design.md / pp-stress-findings.md. - tests/test_pp_corner_cases.py: drives collective_rpc worker hooks (test_pp_buffer_put / test_pp_pull / test_pp_buffer_clear) that exist nowhere in src — dead since the rpc-hook prototype was removed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The branch's PP tests had accumulated flat in tests/ — five pytest suites mixed with eleven standalone multi-GPU harnesses, several of the latter carrying test_ prefixes that a bare pytest run would collect (one of them spawning GPU subprocesses). Now: - tests/vllm/pp/ — the pytest suites (test_pp.py is the former test_vllm_pp.py unit suite; test_integration.py the former test_vllm_pp_integration.py; plus save-merge, pull-group, and lazy-proxy iteration suites) - tests/vllm/pp/manual/ — the standalone harnesses (renamed without the redundant pp/test_ prefixes: pull_e2e, profile_and_corner_cases, run_comparison/run_profile, profile_pull/profile_quick, measure_buffer, stress_serve/stress_tp_serve, run_equivalence_matrix) and the three _pp_*worker.py subprocess workers they and test_integration.py spawn, with a README index No __init__.py anywhere under tests/vllm/: pytest.ini puts tests/ on sys.path, so a tests/vllm package would shadow the real vllm package (an extension-less directory is only a namespace portion and loses to the installed regular package). Run docstrings, the integration test's worker path and repo-root constant, and the doc reference in pp-stress-findings.md all updated; the harnesses otherwise self-reference via __file__ and moved unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…tage execution semantics Each PP stage runs the intervention body independently. A saved value not derived from module state is therefore a redundant per-rank copy, and the merge used to keep one copy silently (later rank wins) — so a value that genuinely differed across ranks (a per-rank environment value like os.getpid(), desynced in-trace RNG, or structurally mismatched save trees from a stalled worker) produced a plausible-but-wrong result with no signal. merge_saved now compares real/real slots and emits PPRankDivergenceWarning naming the slot ('noise', "outs['steps'][1]") when the copies differ; the later-rank-wins degrade is unchanged, just loud. Float tensors compare with tight tolerance (rtol=1e-5, atol=1e-8, equal_nan) so low-order kernel noise from redundant deterministic compute never warns; integer/bool tensors, scalars, and container structure compare exactly. The deferred-exception envelope bypasses the check (both ranks ship the same user error with legitimately different per-rank tracebacks). merge_collected_saves threads the saved variable name in as the slot label. Live PP=2 validation: deterministic saves and in-trace torch.randn merge silently — the randn case revealed that vLLM seeds every worker identically, so in-trace RNG agrees across ranks in lockstep (incidental, not contractual); a saved per-rank pid fires the warning end-to-end with the right slot names; the standard cross-stage scenarios show zero false positives. Docs: docs/models/vllm.md gets a Pipeline parallelism section (the feature was still documented as unsupported) with the execution-semantics fine print — body runs once per stage, side effects are per-rank, generate randomness OUTSIDE the trace (pre-trace values serialize identically to every rank), and what the tripwire means. pp-design.md correctness properties rewritten to state the redundant-copy semantics precisely. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ntinel encoding The merge only combined equal-length lists/tuples and fell to blind b-wins on any structural mismatch — the root of the historical clobbers (an empty list overwriting a populated one; a stalled worker's short list winning). The fix recognizes that the NOT_ON_THIS_RANK sentinel already IS the stage-ownership encoding, and that the cross-stage pull guarantees two real copies of a model-derived value are equal — so the correct merge is a positional union that prefers real over sentinel, applied uniformly (the dict branch was already exactly this; lists/tuples were the inconsistent outliers). merge_saved now: - lists union by position, length-tolerant: shared positions recurse, a position only one rank reached is taken as-is, and a trailing no-real overshoot tail (a worker that ran one step past generation end and appended an unconsumed lazy) is dropped silently; - the lone real-data-asymmetry signal — real entries at a position only one rank reached (a stalled/errored worker or stage-divergent control flow) — warns and keeps the complete side, distinguished from benign overshoot and from a non-owner's equal-length all-sentinel list (neither warns); - equal-length tuples recurse and rebuild NamedTuple-safe as before; - the only non-union path left is a structural-type clash at one slot (list vs tuple, tensor vs dict) — impossible for model-derived values, so it keeps the side with more real leaves and warns; - the per-leaf divergence tripwire is unchanged and still fires on a shared position whose two real copies differ. This retires the inference-by-position fragility class: the merge no longer has any blind-pick path — every resolution is element-wise, sentinel-driven, union-completeness-driven, or loud. Unit tests cover each mechanism (owner-vs-sentinels, overshoot drop, stalled short list, empty-vs-populated, mixed-stage interleave, nested-in-dict, type-clash, divergent shared prefix). Live PP=2: cell11's per-step samples.item() builds an equal real float list on both ranks that the union merges position-wise silently; twosave_cross confirms sentinel ownership — both correct, zero false-positive warnings. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…isting leaf path The /simplify pass on the positional-union merge found one path carrying its own weight poorly: the structural-type-clash branch (list vs tuple, tensor vs dict, unequal-length tuples) had a dedicated 15-line branch plus a whole _count_real helper to pick the side with more real leaves — for a case the code's own docstring calls impossible for model-derived values. Meanwhile _divergence_detail (the leaf path's classifier) already had a container "structure mismatch" branch that was now unreachable, and the clash branch's own message ("incompatible structures (tuple vs tuple)") was misleading for the unequal-length-tuple case it actually hit. Deleting the branch lets those values fall through to the leaf path, where _divergence_detail describes the mismatch accurately ("structure mismatch: tuple of len 3 vs tuple of len 2", "type mismatch: Tensor vs list") and b wins — the same degrade as every other divergent leaf. Net: -25 lines (_count_real gone, branch gone), one fewer message format, and the previously dead container branch in _divergence_detail becomes its handler. Also from the pass: add an "a is b" identity short-circuit to _divergence_detail (skips the elementwise compare when a slot passed through an earlier stage's merge unchanged in a 3+-stage union), and fix two drifted docstring sentences ("effective length" — the function computes real-entries-beyond-common now; the structural-clash description). The stale sibling test asserting the removed "falls back to b" semantics is renamed to the union contract. Necessity verdict from the review (all four angles): the union itself is necessary and stays — the prior tripwire made divergence loud but its equal-length-only branch still lost data (empty-list clobber, stalled-worker mispick); the union is what stops the data loss. This commit only trims the one over-built path the union added. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ead of deadlocking Reading a later-stage module before an earlier-stage one within a single iteration (e.g. `model.logits` then a stage-0 layer inside `tracer.iter`) deadlocked under PP. On the earlier stage the downstream access enters the trailing-remote phase and releases the forward (`go_remote`) on the contract that no local hooks remain this iteration; the forward then runs past the local module before the worker registers its hook, so the worker's later blocking `request` can only be served by the next forward — which the readiness gate holds waiting for that same worker. Single-GPU nnsight already rejects this later-before-earlier access with `OutOfOrderError`; PP hung (one rank's gate timeout while the partner stayed wedged in its NCCL recv). Make PP match single-GPU and fail fast, with the worker lifecycle the gate reads modeled as an explicit phase: - A worker has a `WorkerPhase` — `LEADING` then `PAST_LOCAL` within an iteration, `TERMINATED` once the body exits — paired with `worker_iteration`. (Blocked at a local hook is observed by the gate as `parked`, not stored.) `_pp_signal_remote` sets `PAST_LOCAL` on a downstream access; `reset_iteration` re-arms `LEADING`, publishing the phase before the iteration number so the gate pairs a new iteration with its fresh phase; the body-exit handlers `end()` / `exception()` call `terminate()`, covering normal completion, `tracer.stop()`, and any raise. The gate predicate `is_ahead_of` reads the phase: `TERMINATED` releases every remaining forward, and a worker on the gated iteration releases once it is parked or `PAST_LOCAL`. - `_pp_check_order` runs on every local access: a local access while `PAST_LOCAL` is out of forward order, so it raises `OutOfOrderError` naming the module and the forward-order contract. The raise marks the worker `TERMINATED`, so the gate releases the remaining forwards; the partner stage unwedges, the request finishes, and the deferred error ships at collect. Tests: tests/vllm/pp/test_worker_lifecycle.py covers the gate predicate across phase and iteration; tests/vllm/pp/test_out_of_order.py (+ a `multigen_ooo` worker scenario) asserts the PP=2 out-of-order trace raises OutOfOrderError within a bounded time instead of hanging. Verified at PP=2: out-of-order raises (~40s, was a hang); forward-order multi-token generation and a forward-order logits+layer read still complete; 87 PP unit tests pass. Docs: docs/models/vllm.md (forward-pass-order note) and docs/developing/pp-design.md (the worker-phase lifecycle in §4.2 / §4.6). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… write xfail The 4 cross-stage READ integration tests (final logits, layer 0/11 hidden, multi-token gen) pass on PP=2 — the mediator-deserialization bug their strict xfail documented was fixed earlier in pp-on-dev, leaving the markers stale (XPASS-strict, reported as failures). Drop them. The cross-stage WRITE test still xfails, but NOT for that reason: its scenario does an in-place update (h[8].output[0][:] = h2), which vLLM forbids on inference-mode tensors — PP=1 raises "Inplace update to inference tensor", and PP=2 silently drops it (unperturbed output). Repoint the xfail at that accurate cause (PP2_XFAIL_REASON -> CROSS_STAGE_WRITE_XFAIL_REASON) and refresh the stale module docstring. A portable cross-stage-write test needs the replacement pattern, not in-place. Verified on 2x A100 (GPT-2, PP=1 vs PP=2): integration suite 5 passed, 1 xfailed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UkL1HgpyPpQGpZrG9E4KQb

…lacement pattern The cross-stage write IS functional on PP=2; the test was using the vLLM-forbidden in-place pattern (output[0][:] = h2), which errors on PP=1 and is silently dropped on PP=2 — a vLLM write restriction, not a PP gap. Switch to the documented whole-output replacement (output = (new_hidden, *out[1:]) if is_tuple else new_hidden); the write then applies and PP=2 matches PP=1 exactly (GPT-2: ' Paris' -> ' New', argmax 968 on both). Drop the xfail; the test now asserts cross-stage equivalence AND that the write took effect (guards against a vacuous pass on a dropped write). Adds the cross_stage_replace worker scenario. Integration suite: 6 passed, 0 xfailed on 2x A100. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UkL1HgpyPpQGpZrG9E4KQb

… wrong pattern scenario_cross_stage and scenario_cross_stage_no_logits used the in-place form (output[0][:] = h2 / output[0] = ...), which vLLM rejects on inference-mode tensors regardless of PP — so they never tested cross-stage write, only the in-place restriction (and silently dropped it on PP=2). No test references them now that TestCrossStageWrite uses the replacement-pattern cross_stage_replace. Remove the scenarios and their dispatch/choices entries. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UkL1HgpyPpQGpZrG9E4KQb

khaiwang changed the base branch from main to dev June 5, 2026 17:23

khaiwang and others added 29 commits June 23, 2026 22:31

khaiwang and others added 27 commits June 23, 2026 22:32

khaiwang force-pushed the pp-on-dev branch from b733367 to f09cb4b Compare June 24, 2026 15:56

khaiwang and others added 2 commits June 24, 2026 13:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pipeline-parallel intervention support for the vLLM backend#670

Pipeline-parallel intervention support for the vLLM backend#670
khaiwang wants to merge 72 commits into
devfrom
pp-on-dev

khaiwang commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

khaiwang commented Jun 5, 2026

Pipeline-parallel intervention support for the vLLM backend

Summary

The problem

The design

What's included

Validation

Notes / known limitations

Follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant