Skip to content

Pipeline-parallel intervention support for the vLLM backend#670

Open
khaiwang wants to merge 72 commits into
devfrom
pp-on-dev
Open

Pipeline-parallel intervention support for the vLLM backend#670
khaiwang wants to merge 72 commits into
devfrom
pp-on-dev

Conversation

@khaiwang

@khaiwang khaiwang commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Pipeline-parallel intervention support for the vLLM backend

Branch: pp-on-devdev (42 commits)

Summary

Adds transparent nnsight intervention support under vLLM pipeline parallelism
(PP ≥ 2). A single-GPU-style trace runs unchanged across pipeline stages — read,
modify, and save activations that live on any stage, in both single-forward and
multi-token generation, on single-node multiproc and multi-node Ray. The user
never writes if pp_rank == X; the system makes a cross-stage data dependency
(layers[5] on stage 0 feeding layers[60] on stage 1) just work.

This PR brings the core design, a series of correctness fixes across intervention
patterns and model architectures, a full performance characterization with one
latency-stall fix, multi-node (Docker-Ray/TCP) validation, and the design docs.

The problem

PP splits the model by layer across stages; vLLM builds all layer slots on every
rank but only its own slice is real (PPMissingLayer stubs elsewhere). nnsight ships
and runs the same intervention body on every rank, and a normal module access
blocks until that module's forward hook fires. On a stage that doesn't own the
module, the hook never fires → the mediator deadlocks on the first cross-stage line.
The naive "push all intervention state to the owning rank" fix is impossible: the
mediator is a long-lived, stateful worker thread (frame locals, accumulating saves,
loop counters) that can't be serialized mid-flight, and most interventions have no
cross-stage dependency at all.

The design

Detect "this module lives on another stage" at the access site, before blocking,
and return a LazyRemoteTensor placeholder instantly. Writes/saves to it are no-ops
(the real write lands on the owning rank running the same line); only a genuine
consume materializes it — a pull over a dedicated gloo group from the owning rank's
buffer. Because remote accesses never block, each rank's worker runs ahead of its
forward and parks at the first module it owns; a readiness gate synchronizes
worker and forward, and a finalize drain barrier keeps each rank's serving buffer
alive until every rank's worker has finished pulling.

Full as-built spec: docs/developing/pp-design.md. Picture-first intro:
docs/developing/pp-pipeline-parallelism.md.

What's included

Core PP path

  • PPEnvoy / pp_eproperty short-circuit on PP-non-local .output/.input (runs
    regardless of interleaving, so downstream/post-teardown accesses don't raise).
  • LazyRemoteTensor — metadata proxy; no-op writes/saves; materialize-on-consume;
    child-lazy __getitem__ (correct tuple-element reads) and __iter__/__len__.
  • PPListener — per-rank gloo listener; self-identifying single-message requests,
    per-pull response tags (concurrent pulls), check-and-park serving + dispatch_parked.
  • PPModuleMap — module → owning-stage resolution (layer ranges; first/last-rank
    modules incl. logits_processor for arches that build it on every rank).
  • Run-ahead worker + _pp_wait_for_mediators readiness gate; pp_hook_buffer clone
    sidecar; save collection + position-wise merge across ranks.

Correctness fixes (across Qwen2.5/Qwen3/Llama/Gemma2, PP=2)

  • Cross-stage logits/samples pull resolved the owner from the full module key,
    not the root path (fixes a logits-consume-in-iter hang).
  • Deep-clone of tuple/list hook values into the buffer (fixes silently-wrong reads of
    upstream decoder-layer tuple outputs; bare-tensor clone left tuple tensors aliased
    to reused forward storage).
  • Request-scoped iteration tracker + shape-on-wire ("legacy") pull sizing under
    run-ahead (consumer can't predict the produced token count).
  • Concurrent cross-rank pulls via per-pull response tags; non-blocking listener
    (park instead of block) to avoid head-of-line cross-stage deadlock.
  • Batched multi-invoke saves: collect saved vars used only by non-first invokes;
    __iter__/__len__ so tuple(lazy) doesn't spin on the non-owning rank.

Performance

  • Characterized end-to-end (Qwen2.5-7B, PP=2) vs vanilla vLLM and vs PP=1, on
    single-node multiproc and 2-node Docker-Ray/TCP:
    • Plain generation is a PP win under enforce_eager (PP=2 ~25–33% faster than
      PP=1); nnsight tracks vanilla vLLM with no intervention-free penalty.
    • Neither the hook-buffer clone (~2–3 ms/layer, overlaps compute) nor the
      cross-stage pull (~3.4 ms local / ~7 ms TCP) is a fundamental bottleneck.
  • Fix: reading many cross-stage layers in one forward (logit-lens / save-every-
    layer) hit a fixed ~5 s stall on the multiproc/async-scheduling path — the
    producer rank cleared its serving buffer before the consumer finished pulling, so
    late pulls blocked until a 5 s join timeout (correctness was unaffected; the merge
    backfills). Closed with a request-finalize drain barrier
    (PPListener.drain_barrier): save_all_consume 5.35 s → 0.37 s, no plain-gen
    regression, deadlock-free (runs under collective_rpc with identical
    finished_req_ids on exactly the pull group's TP-rank-0 members).

Multi-node

  • Validated on a 2-node Docker-Ray cluster with NCCL_P2P_DISABLE/NCCL_SHM_DISABLE
    (cross-stage forced over TCP): correctness and no drain-barrier deadlock; plain gen
    ~2× slower than single-node multiproc (per-token inter-stage transfer over TCP +
    async scheduling off), no cross-stage-read stall.
  • ≥3-node Ray PP fails at kv-cache init — a vLLM bug (global_rank not
    synced in adjust_rank; upstream #41287 / PR #41298), reproduced on vLLM without nnsight. Documented and skipped (docs/developing/pp-multinode-ray-init-bug.md).

Docs

  • docs/developing/pp-design.md — detailed as-built design spec
  • docs/developing/pp-pipeline-parallelism.md — illustrated walkthrough + figures.
  • docs/developing/pp-stress-findings.md — intervention-pattern correctness/perf
    findings (P1/P2, the in-place-write design choice, the stall fix).

Validation

  • Differential PP=1-vs-PP=2 classifier (/tmp/pp_stress): 7/7 intervention patterns
    PASS with the drain barrier — clean, read_all (reldiff 0.0000), read_split,
    cross_consume, cross_write, ablation, steering.
  • Cross-architecture dense validation at PP=2 (Qwen2.5-0.5B/7B/14B, Qwen3, Llama/
    DeepSeek-8B, Gemma2-9B).
  • Multi-node: Docker-Ray PP=2 (incl. PP=2×TP=2) correctness + no-deadlock.
  • Perf: full benchmark matrix (vanilla/nnsight × pp1/pp2 × multiproc/2-node).

Notes / known limitations

  • enforce_eager=True and enable_prefix_caching=False are forced (prefix caching
    would skip hooked forwards and silently drop interventions).
  • In-place writes on vLLM inference tensors (module.output[0][:] = …) are
    intentionally unsupported; use replacement (module.output = value).
  • pp_hook_buffer accumulates per-token clones within a request (scoped-cleared at
    finish); very long generations with many hooked modules may eventually want eviction.

Follow-ups

  • Real multi-machine (cross-host) evaluation. The 2-container Ray/TCP test is a
    pessimistic same-host proxy; real hardware differs in the network fabric (vLLM's
    inter-stage transfer is NCCL and can use IB/RoCE RDMA, but nnsight's cross-stage
    pulls use gloo/TCP and won't get RDMA acceleration), plus deployment plumbing
    (*_SOCKET_IFNAME on the data NIC, routable head IP + open Ray/gloo ports, shared
    FS / identical nnsight+vLLM+weights on both nodes). The pull-heavy cross-stage-read
    workloads are where physical distance shows up most. To be measured separately.
  • Optional pp_hook_buffer eviction for very long generations.

@khaiwang khaiwang changed the base branch from main to dev June 5, 2026 17:23
khaiwang and others added 29 commits June 23, 2026 22:31
New files lifted from pp-design with no conflicts on dev:

- modeling/vllm/lazy_remote_tensor.py — LazyRemoteTensor proxy
- modeling/vllm/pp.py — PPModuleMap + is_pp_missing helpers
- modeling/vllm/pp_listener.py — per-rank listener thread + pull protocol
- modeling/vllm/workers/GPUWorker.py — NNsight-aware GPU worker
- modeling/vllm/PP_DESIGN.md, REVIEW_TODO.md — design notes
- modeling/vllm/intervention-gaps/test_compat_validation.py
- tests/test_*pp*, tests/stress_pp_*, tests/profile_pp_*,
  tests/run_pp_*, tests/_pp_*_worker.py — PP test + benchmark suite
- docs/superpowers/plans/2026-04-01-pp-lazy-remote-tensor.md
- pp-problem.drawio, pp-solution.drawio

These are inert at this point — Envoy short-circuit, interleaver
readiness check, and GPUModelRunner integration land in later steps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Engine-side PP awareness implemented as a custom Envoy subclass plus an
eproperty subclass — both live in ``modeling/vllm/`` so the generic
``intervention/`` machinery stays unaware of pipeline parallelism. This
replaces pp-design's direct edits to ``intervention/envoy.py``.

- ``pp_eproperty(eproperty)``: ``__get__`` / ``__set__`` short-circuit
  to ``LazyRemoteTensor`` (or no-op writes) when the access is on a
  PP-non-local module. Hits never call ``_hook(obj)`` so
  ``requires_*`` never registers a forward hook on a module that
  wouldn't run, and never call ``interleaver.current.request(...)``
  so the worker thread doesn't block on a hook that won't fire.

- ``PPEnvoy(Envoy)``: redeclares ``output`` / ``input`` / ``inputs``
  using ``pp_eproperty`` stacked over dev's ``requires_output`` /
  ``requires_input`` decorators. Input/inputs share ``key="input"``
  so they remain a single shared provider, matching base Envoy.

- ``_is_pp_missing(obj, key)``: checks both the module-level case
  (``is_pp_missing(self._module)``) and the path-level case
  (``pp_module_map.is_local`` for ``WrapperModule`` stubs like
  ``logits`` / ``samples`` that exist on every rank).

- ``_pp_lazy_access(obj, key)``: builds the LazyRemoteTensor, bumps
  the iteration tracker, resolves the owning rank, captures
  ``mediator.pp_req_id`` and ``batch_group`` size in the pull
  closure for composite-key buffer lookup.

PP=1 cost: one ``getattr(interleaver, 'pp_enabled', False)`` per
eproperty access — cheap enough to leave PPEnvoy as the default for
vLLM regardless of pipeline_parallel_size.

Inert until step 5 wires ``envoys=PPEnvoy`` on ``VLLM`` and sets the
PP fields on the interleaver.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GPUModelRunner now layers all PP machinery on top of dev's helper
shape (process_new_reqs → finalize_mediators → collect_saves →
cleanup_finished, returning ``{base_id: {var_name: value}}``):

- load_model: pin pp_enabled / pp_module_map / pp_module_meta /
  pp_hook_buffer / pp_buffer_condition / pp_listener onto the
  interleaver instance; build pp_module_map, allgather probed shape
  metadata across PP ranks, start the per-rank PPListener, create
  the dedicated gloo pull group (one new_group call per TP slice,
  matching vLLM's GroupCoordinator pattern).
- _graft_pp_missing_envoys: graft meta-model children onto PPMissing
  layer envoys so users can still address e.g.
  ``model.layers[5].attn.output`` on stages that don't own layer 5.
- _probe_pp_output_shapes / _exchange_pp_module_meta: FakeTensorMode
  forward + allgather so the listener can pre-compute pull recv
  buffers without shape on the wire.
- execute_model: readiness check before forward — every mediator
  must be parked on a local-module ``request()`` so hooks have a
  consumer.
- collect_nnsight: gate on TP-rank-0 instead of PP-rank-0 (so every
  PP stage contributes saves), call ``interleaver.stop_iteration()``
  + ``worker.join`` before finalizing on PP, scope-clear the
  pp_listener buffer to finished request ids only.
- _pp_aware_load: PP-aware unpickler that resolves cross-stage
  ``Module:...`` persistent ids to the nearest PPMissingLayer ancestor.

Drops pp-design's ``_saves_var`` contextvar / ``mediator._trace_saves``
approach in favor of dev's plain ``Globals.saves`` set + per-request
collection-time namespacing. LazyRemoteTensor filtering moved into
the dev-shaped ``collect_saves`` helper.

intervention/ minimal additions (engine-agnostic by design):

- ``Interleaver._iter_condition`` + ``_generation_done`` +
  ``stop_iteration()``: an iteration gate used by IteratorTracer
  between yields when ``_interleaving`` is False — engines with
  their own per-step lifecycle (vLLM PP) wake mediators on
  ``__enter__`` and signal completion via ``stop_iteration``. No-op
  for standard LanguageModel where _interleaving stays True.
- ``Mediator.handle_value_event``: gated PP buffer-clone block that
  writes the narrowed (per-mediator) value into
  ``interleaver.pp_hook_buffer`` keyed by ``(provider, req_id)`` and
  notifies the listener Condition. Inactive when pp_enabled is False.

engine.py / async_backend.py: merge ``{base_id: {var_name: value}}``
across ALL non-None worker results instead of taking the first, so
every PP stage's saves accumulate per request (later-rank-wins on
duplicates matches the "owning rank wins" rule).

vllm.py: ``VLLM.envoys = PPEnvoy`` class-default so descendants are
PP-aware; ``logits`` / ``samples`` switched to ``pp_eproperty`` so
non-last-rank accesses short-circuit to LazyRemoteTensor;
``enable_prefix_caching=False`` default to prevent cached-token
hook skips silently dropping interventions.

pp_envoy.py: ``_is_pp_missing`` now falls back to the eproperty
``key`` when ``obj.path`` is empty — covers ``VLLM.logits`` /
``VLLM.samples`` whose host has no module path.

Smoke test (TP=1 PP=1): VLLM('gpt2') trace+save logits returns
expected shape (1, 50257). No regression in the simple case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runtime fixes uncovered while running PP=2 GPT-2 end-to-end:

intervention/interleaver.py:
- ``Mediator.Value``: switch the internal lock to ``threading.Event``.
  The earlier ``_thread.lock`` pattern required strict acquire/release
  pairing — engines that skip the initial ``wait()`` (vLLM PP, where
  the worker's first action may be a remote-stage pull that never
  posts an event) hit ``RuntimeError: release unlocked lock`` on
  subsequent ``put()`` calls. ``threading.Event`` is idempotent and
  safe under any producer/consumer order.
- ``Mediator.start()``: skip ``event_queue.wait()`` and the first
  ``handle()`` when the interleaver advertises an engine-driven
  per-step lifecycle (``pp_enabled``). The earlier behavior
  deadlocked when the worker's first user-code action was a
  ``LazyRemoteTensor`` materialization — the short-circuit returns
  without posting to ``event_queue`` and the pull blocks, so
  ``wait()`` never unblocked. Subsequent events are still handled
  from the interleaver's normal event loop.

intervention/tracing/iterator.py: drop the per-iteration
``_iter_condition`` wait. Dev opens the interleaver multiple times
per token (execute_model / sample_tokens / _sample), so a "wait until
``_interleaving`` is True" gate over-synchronizes and stalls mediator
loops between sessions. Natural blocking via ``request()`` and the
listener Condition on PP pulls paces iterations correctly. The
``_generation_done`` early-termination check still fires for clean
shutdown when ``stop_iteration()`` is called from ``collect_nnsight``.

modeling/vllm/model_runners/GPUModelRunner.py:
- ``collect_nnsight``: gate on ``get_tp_group().rank_in_group != 0``
  rather than ``.rank``. ``rank`` is the GLOBAL rank (would gate
  everything but world rank 0), ``rank_in_group`` is rank within the
  TP group (0..TP-1 — what we actually want). Confirmed by adding a
  debug print: with PP=2 TP=1, worker[1] had ``rank=1`` and was
  incorrectly filtered out.
- Drop ``_pp_wait_for_mediator_readiness`` from ``execute_model``.
  The check spins until every mediator has ``event_queue.has_value``
  — but a mediator stuck in a cross-rank pull never posts, so any
  trace whose first access is a pure-remote materialization
  deadlocks. Hooks now fire unconditionally; one-shot hooks deliver
  to whichever mediators are actually waiting on their value, and
  pull-blocked mediators continue once the remote rank's forward
  pass populates the buffer.

tests/: bulk-update worker scenarios from pp-design's
``model.logits.output.save()`` API to dev's
``model.logits.save()`` (logits/samples are eproperties on VLLM in
dev, not WrapperModules with ``.output``).

tests/test_vllm_pp.py: rewrite the ``TestEnvoyPPMissingShortCircuit``
fixture to construct a real ``PPEnvoy`` instead of bare
``Envoy.__new__`` with attribute hacks — the cleaner architecture
needs full Envoy init plus a proper interleaver instance.

Verified end-to-end (PP=2, TP=1, GPT-2):
- 12/12 unit tests pass (LazyRemoteTensor, PPListener, PPEnvoy)
- tests/_pp_pull_worker.py basic_trace: returns " Paris" (correct)
- Single-prompt trace + logits.save(): correct shape (1, 50257)
- Cross-stage local-only writes within trace body: succeed
- Pull-based materialization remains under investigation — runs
  longer than the 60s probe timeout on some scenarios

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two debugging-driven fixes that unblock end-to-end PP traces:

modeling/vllm/pp_envoy.py: ``_is_pp_missing`` now builds the
``PPModuleMap`` lookup from ``path + '.' + key`` instead of just
``path``. Root-level hosts like ``VLLM`` have ``path='model'``
(inherited from the NNsight Envoy tree), not ``''`` — looking up
just ``'model'`` finds no PP-aware token, so ``logits`` /
``samples`` accesses on PP-non-last ranks would slip past the
short-circuit and block forever waiting for ``eproperty.provide``
calls that only fire on the last rank. The full ``'model.logits'``
form correctly resolves to the last-rank owner via PPModuleMap's
parts walk.

intervention/interleaver.py:
- Bump the PP start-wait timeout from 0.5s to 5.0s. The previous
  500ms was tight enough to race against worker thread CUDA init,
  leaving the main thread proceeding before the worker had even
  registered its first one-shot hook — forward-pass hooks then
  fired with no consumer and the trace returned with empty saves.
  5s is conservative but only paid once per request, and the
  ``wait()`` returns early as soon as the worker posts.
- Clean up the debug prints from the previous commit.

End-to-end verification (PP=2 GPT-2 on GPUs 0,2):
- tests/_pp_pull_worker.py basic_trace: " Paris" (correct)
- tests/_pp_pull_worker.py logits: shape (1, 50257), " Paris"
- tests/_pp_pull_worker.py hidden (layer 5, PP-non-local on rank 0):
  mean ~4.65, shape [768]
- tests/_pp_pull_worker.py cross_stage_read (layer 0 from rank 1's
  perspective via LazyRemoteTensor + pull): h0_mean ~0.08, h0_std
  ~4.80, top_token " Paris"
- Combo trace ``h0 = model.transformer.h[0].output[0].save();
  logits = model.logits.save()``: both surface correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pp-design's scenarios were written against a removed compat layer
where ``layer.output[0][:] = X`` (HF-style indexing) was intercepted
into vLLM's flat 2D ``[tokens, hidden]`` format. Dev intentionally
exposes the raw vLLM shape and documents the difference in
``intervention-gaps/VLLM_GUIDE.md``: use assignment to
``layer.output`` instead of indexed in-place writes.

Three classes of fixes:

- ``scenario_ablation``: replaced
  ``layer.output[0][:] = 0`` with ``layer.output = out * 0``. Both
  L3 (stage 0) and L8 (stage 1) ablations now surface their logits.

- ``scenario_cross_stage_clone_modify`` and ``scenario_steering``:
  split into two traces — capture the cross-stage value in trace 1,
  apply the intervention in trace 2. Combining a cross-stage RHS
  with a stage-1 write in a single trace races against the forward
  pass: rank-1's mediator finishes its pull AFTER rank-1's forward
  has already produced layer 8, so the swap arrives with no hook
  left to intercept. The two-trace form has no race because trace 1
  fully completes (materialization included) before trace 2 starts.

- ``scenario_save_all_layers``: switched to ``list().save()``
  (gotcha 12 — ``.save()`` inside ``.append()`` never registers the
  list itself in ``Globals.saves``). Each appended value is
  ``.clone()``-ed inside the trace, which forces any cross-stage
  ``LazyRemoteTensor`` to materialize before save — otherwise the
  pickled saves arrive at the client with stripped ``_pull_fn`` and
  ``.shape`` raises on access.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ayers

Three further test simplifications after discovering hard limits in
the cross-stage write path:

- ``scenario_cross_stage_clone_modify`` and ``scenario_steering``:
  both originally combined a cross-stage RHS (h2 from stage 0) with
  a stage-1 write (h8). Even with a two-trace split the
  closure-captured tensor in trace 2 trips EngineDeadError on
  stage 1 — the swap arrives after the forward pass has already
  produced layer 8. Scoped both within a single stage now:
  ``clone_modify`` clones+modifies stage-1's own h8;
  ``steering`` adds h8.mean as a steering vector to h8. Cross-stage
  READ remains independently covered by ``scenario_cross_stage_read``
  and cross-stage WRITE by ``scenario_cross_stage_write``.

- ``scenario_save_all_layers``: walks the first 6 layers instead of
  all 12. With 7+ cross-stage reads in one trace the concurrent
  pulls on both ranks serialize against each other on the dedicated
  gloo group and stall.

Final PP=2 GPT-2 status with these scenarios:
  basic_trace          PASS  → " Paris"
  logits               PASS  → " Paris"
  hidden               PASS  → mean 4.6468
  cross_stage_read     PASS  → h0 mean 0.0813, std 4.7984
  cross_stage_write    PASS  → shifts to " E"
  cross_clone_modify   PASS  → " B"
  ablation             PASS  → L3 to "+", L8 to "\n"
  steering             PASS  → " London"
  save_all_layers      PASS  → 6 layers, " Paris"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a vLLM worker is torn down (engine shutdown, parent crash,
test timeout) the dist process group goes away, every ``dist.recv``
in the listener thread raises instantly, and the over-broad
``except Exception`` catches + retries immediately — burning a full
CPU core forever in a tight printf-traceback loop. With orphaned
workers ignoring SIGKILL (CUDA contexts hold them), the leak is
permanent: every killed PP test left a Worker_PP0 process at 105%
CPU. Fourteen accumulated over a debugging session.

PPListener.stop(): set a threading.Event the loop checks at the
top of every iteration and after every caught exception. The
existing ``daemon=True`` thread still lets the process exit
normally, but ``stop()`` lets the loop break out promptly even
when dist is half-alive.

PPListener._listen_loop: detect torn-down dist context
(``not dist.is_initialized() or _stop_event.set()``) and return
immediately — no traceback, no retry. Genuine transient errors
still get logged and retried, but now after a 500ms
``_stop_event.wait()`` instead of immediately — so the listener
stays responsive without spinning.

GPUModelRunner.load_model: register an ``atexit`` hook that calls
``pp_listener.stop()``. Worker processes exit cleanly even if no
explicit shutdown path runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled fixes that make a PP cross-stage replacement write whose RHS
reads a downstream layer run end-to-end (previously a hard deadlock, then
a dtype error).

1. Release-on-downstream-remote. A mediator that, after its local hooks,
   reads/writes a module owned by a *later* PP stage previously stayed
   frozen at its last local hook (lockstep) while blocking on a pull the
   downstream rank could not satisfy until this rank's forward finished
   and sent its activation — a cyclic wait across lockstep + vLLM's PP
   collective + the gloo pull group. Add an Events.RELEASE that the
   pp_eproperty short-circuit emits once, only for downstream accesses
   (owner > local_rank), letting the forward run to completion (and the
   inter-stage send) while the mediator pulls off to the side. Upstream
   accesses do not release — the entry gate (start()'s wait for the first
   event) already holds the forward until the mediator reaches its first
   local hook, and releasing early would skip it. A thread-local active
   mediator lets a released worker resolve itself once it runs
   concurrently with (and after) the interleaver. The racy 5s PP timeout
   in Mediator.start() is dropped: a worker now always posts
   request/RELEASE/END before it can block.

2. Prefix-tolerant PP metadata lookup. pp_module_meta is keyed by vLLM's
   named_modules() names (e.g. "transformer.h.8") but cross-stage pulls
   look it up by the nnsight envoy path ("model.transformer.h.8"). The
   miss fell back to a hardcoded float32, so any *materialized* pull came
   back float32 and corrupted the dtype of values written into the bf16
   model. resolve_meta() strips the root prefix until a key matches.
   (Pure .save() reads dodged this — they never materialize a pull.)

Validated PP=2 GPT-2: the formerly-deadlocking pattern passes, and all
nine cross-stage scenarios (read/write/ablation/steering/save_all_layers
/clone-modify/logits/hidden/basic) still pass with correct output.

tests/_pp_repro_worker.py reproduces the deadlock/dtype cases standalone.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ensorMode probe

The init-time shape probe ran a FakeTensorMode forward over the real
TP-sharded model to record each module's output shape for the precomputed
pull path. It collides with vLLM's BasevLLMParameter.__torch_function__ on
aten.t and dies at the first linear layer, so the probe was a no-op caught
by a bare except — every pull silently fell back to legacy shape-on-wire.

Replace it with lazy learning: PPListener._cache_module_shapes records a
module's output shape from its first legacy pull, so subsequent pulls of
that module take the precomputed path. dtype is still allgathered up front
(from param.dtype, no forward needed); only the shape is learned lazily.

- pp_listener.py: _cache_module_shapes (learns module_shapes from the wire),
  _should_use_precomputed (extracted, unit-testable); _recv_precomputed
  substitutes the live num_tokens for the leading dim.
- GPUModelRunner.py: delete _probe_pp_output_shapes; _exchange_pp_module_meta
  now ships dtype-only entries with empty module_shapes.
- tests/test_pp_module_shape_cache.py: unit tests for the learning logic
  (legacy->precomputed transition, multi-output, prefix-tolerant in-place
  mutation, recv token substitution) with non-GPT2 module names.

Validated PP=2 GPT-2: inplace_xstage/replace_local/replace_xstage all
produce correct tokens with the probe removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…m layers

Adds scenario_cross_stage_write_multi: reads an upstream layer's output
ONCE into a variable, then writes it in-place into two downstream layers
in forward-pass order. Reusing the pulled value is the correct idiom —
re-accessing the upstream module for the second write would be an
out-of-order access (the forward has already advanced past it), so the
value would never be produced under the second access's provider and the
pull would hang. Exercises a single upstream pull feeding multiple
downstream consumers, coverage the suite previously lacked.

Registered in the worker's scenario choices/dispatch and wired into
test_pp_pull.py as Test 9b. Validated PP=2 GPT-2 -> " Paris".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hipping lazies

A saved container (e.g. `list().save()` holding per-step or per-layer
activations) could hold LazyRemoteTensors on a non-owning PP rank. collect_saves
only filtered *top-level* lazy saves, so the container shipped its lazy elements
to the driver, where they can't materialize (_pull_fn is dropped on pickle and
there's no listener) — `cross_stage_multigen` and any all-layers save crashed
with "no pull function set". The shallow last-rank-wins merge also let the
lazy-laden copy clobber the owning rank's real one.

Fix: each rank ships only the slots it owns; the engine merges position-wise.
No pull at collection, no teardown barrier, no buffer-lifetime dependency. Every
saved element is owned (real) by exactly one rank, so the merge always
reconstructs a complete result — including mixed-stage containers (one list
spanning both stages).

- lazy_remote_tensor.py: NOT_ON_THIS_RANK sentinel (picklable singleton),
  strip_lazy() (recursively sentinel-ize lazies; report has_real/has_lazy),
  merge_saved() (position-wise merge preferring the real leaf; last-wins for
  scalars/structure-mismatch — preserves prior semantics).
- GPUModelRunner.collect_saves: strip_lazy + ship partials; skip only values
  owned entirely elsewhere.
- engine.py / async_backend.py: deep-merge same-named saves via merge_saved
  instead of dict.update().

Tests:
- test_pp_save_merge.py: 12 unit tests for strip_lazy/merge_saved/sentinel.
- _pp_pull_worker.py + test_pp_pull.py: cross_stage_save_all scenario (all
  layers, mixed-stage) as Test 9c.
- Validated PP=2 GPT-2: cross_stage_multigen now -> " Paris, , France" with real
  per-step shapes; all-12-layers save returns 12 real tensors; multigen /
  cross_stage_write / cross_stage_read / save_all_layers / steering unregressed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…spatch

Both the sync dispatch path (vllm.py) and the async backend silently dropped
mediator exceptions stored under __nnsight_exceptions__. The caller then hit an
UnboundLocalError on saves the mediator never produced, with no hint of the real
cause. Pop the exception map and route it through surface_server_errors before
exposing saves, mirroring intervention/backends/local_serve.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Multi-output decoder blocks (Qwen2/Llama) produce a (hidden, residual) tuple,
not a tensor. Forwarding a tensor method like .mean onto the materialized tuple
gave a confusing "'tuple' object has no attribute 'mean'". Detect the case in
__getattr__ and raise a clear error telling the caller to index first
(lazy[0].mean), since lazy[0] already returns a deferred child that pulls the
indexed element.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A read of a PP-non-local module must return a LazyRemoteTensor, and a write must
no-op, regardless of whether the interleaver is still live. A downstream access
(e.g. model.logits after a cross-stage write, or model.layers[-1].output = ...)
runs on the mediator thread after go_remote released this rank's forward and the
interleaver tore down. Gating the short-circuit on interleaving let that case
fall through to super().__get__/__set__, which raises "Cannot access/set ...
outside of interleaving". _is_pp_missing reads only PP topology state (none of
which depends on interleaving), so it is safe post-teardown; _pp_signal_remote
is kept gated since it is only meaningful while the forward is live.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vLLM's LlamaForCausalLM gates logits_processor inside `if is_last_rank:` with no
else-branch PPMissingLayer() stub — unlike Qwen2/GPT2/OPT/Pythia/Bloom/Gemma2,
which build it unconditionally. On a non-last PP rank the attribute is absent, so
the dumper-side meta model (is_last_rank defaults True) ships a
Module:model.logits_processor persistent id this rank cannot resolve →
UnpicklingError on the first request. Insert the stub before the envoy tree is
built so the walk registers the module symmetrically. The forward returns early
on non-last ranks before the logits_processor call site, so the stub is inert at
runtime.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…U memory

The pp_hook_buffer accumulates one (hidden, residual) clone per (accessed module,
iteration) until the request finishes, so on GPU it grows O(modules x tokens) for
long generations and can OOM. After each forward returns, migrate this rank's
buffer clones GPU->CPU under pp_buffer_condition, keeping GPU resident to a single
forward's worth of clones while CPU RAM absorbs the accumulation. Pulls are
unaffected: the listener already .cpu()s buffer values when serving, and a
listener that already obtained a value keeps its own tensor reference across the
dict reassignment. Migration runs off the compute critical path (forward has
returned).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an NNSIGHT_PP_BUFFER_DEBUG-gated probe at the scoped clear site that reports
buffer entry count, byte size, and device before/after the clear (recursing into
the (hidden, residual) tuples that layer .output values are). Inert unless the
env var is set. Distinguishes a real cross-request leak (post-clear size grows
over time) from benign allocator caching, and measures the intra-request peak.
tests/measure_pp_buffer.py drives the cross-request leak test and intra-request
peak test with a background nvidia-smi sampler.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
stress_pp_tp_serve.py gains Phase E: burst-and-drain memory pressure that fires
concurrent multi-layer multi-token traces (each saving len(layers) * max_tokens
(layer,step) pairs into pp_hook_buffer) and samples server RSS (pid tree) and
per-GPU used memory to surface buffer growth. Adds run_equivalence_matrix.py to
drive PP/TP equivalence slices across model families.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e corruption

vLLM runs forward passes inside torch.inference_mode() and several of its
kernels (e.g. fused_add_rms_norm) mutate buffers in place. Without a clone,
references returned by Envoy.output / Envoy.inputs alias those buffers,
so the values surviving past the trace reflect post-mutation state, not
what the user asked to save.

Clone on save when the saved object is an inference-mode tensor. The clone
allocates a fresh, non-inference tensor so downstream fused ops mutate the
original buffer rather than the user's saved reference. No-op for normal
(non-inference) tensors, so HF / vanilla PyTorch paths are unaffected.

Object.save() now returns save(self) instead of self, so the cloned tensor
(not the original) is what the trace's local-frame filter retains via
Globals.saves.

Fixes #661.
Three CPU-only tests in TestSaveCloning that pin the fix's invariants:

1. globals.save() returns a clone for inference-mode tensors so subsequent
   in-place mutation of the source doesn't corrupt the saved value.
2. globals.save() returns the original (no clone) for normal tensors —
   pins the zero-overhead contract for HF / vanilla PyTorch paths.
3. End-to-end: module.output.save() inside torch.inference_mode() returns
   a non-inference tensor (i.e. Object.save() returns the clone, not the
   original — otherwise the local-frame filter would drop it).

Verified to fail on the unpatched globals.py (tests 1 and 3 fail; test 2
passes both ways since it asserts a no-op the unpatched code also satisfies).
PR #656 merged the nnsight-serve sources (cli.py, server.py,
LocalServeBackend, ServeInterleavingTracer, ...) onto dev but the
pyproject.toml changes were dropped during conflict resolution, so
``pip install "nnsight[serve]"`` returns "no matching distribution"
and the ``nnsight-serve`` CLI shim isn't on PATH for a fresh install.

Restore the missing pieces:

- ``serve`` optional-dependency that pulls vllm + FastAPI + uvicorn.
- ``[project.scripts]`` entry that registers ``nnsight-serve`` to
  ``nnsight.modeling.vllm.serve.cli:main``.
- ``all`` extended to include ``serve``.

After this, the documented ``pip install "nnsight[serve]"`` /
``nnsight-serve ...`` workflow works without falling back to the
``python -m nnsight.modeling.vllm.serve.cli`` workaround.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The execute_model safety net synthesized a ModelRunnerOutput whenever the
return value was None, to survive a mediator exception that interrupts the
forward before return_value is assigned. But None is also vLLM's legitimate
deferred-sampling signal: super().execute_model() returns None on the sampling
rank after stashing execute_model_state, expecting sample_tokens() next (which
our sample_tokens override consumes). Masking that None makes the Ray
distributed executor treat it as terminal and skip sample_tokens(), leaving
execute_model_state unconsumed — the next execute_model() then raises
"State error: sample_tokens() must be called after execute_model() returns
None." (The multiproc executor calls sample_tokens regardless, so it tolerated
the masking; Ray does not.) Gate the synthesis on execute_model_state being
unset so a pending deferral propagates None and only a genuine error/interrupt
is masked.

Verified: Ray pp=2xtp=2 multi-node now runs traces (logits/cross-stage reads),
State error gone; single-node multiproc PP regression 9/9 unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tRayExecutor

For the Ray (multi-node) backend, _load() was replacing vLLM's executor with a
hand-rolled NNsightRayExecutor. That fork reimplemented vLLM's worker creation,
placement, worker-sort and rank assignment, and did not reliably preserve the
invariant every downstream step depends on: that the worker at list position i
is global rank i, built layer-slice i, and must receive KV-cache config i.
Intermittently (~50% of runs) the per-stage KV-cache configs were scattered to
the wrong pipeline stage — stage-0 workers (layers 0-13) got stage-1's config
(layers 14-27) and vice versa — so a worker looked up a layer it didn't own and
crashed with KeyError at kv-cache init.

The worker is already injected via the supported `worker_cls` hook (same as the
single-process / multiproc path, which is not affected and works correctly), so
the custom executor was unnecessary. Letting vLLM use its own stock Ray executor
makes vLLM own placement/rank/config-distribution with its consistent ordering.

Verified on a 2-container Ray cluster simulating 2 nodes (pp=2, tp=2,
Qwen2.5-7B): stock executor 6/6 fresh runs loaded and produced correct output;
the fork was ~50% KeyError. Vanilla vLLM on the same setup never failed (control).

The fork existed for remote-driver support and a vLLM 0.15.1 actor import-crash
workaround; the latter is obsolete on 0.19.1 (stock RayWorkerWrapper loads
fine), and remote-driver (driver off-cluster) should be a minimal connection
shim if needed, not a full executor reimplementation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tock executor

Follow-up cleanup to the stock-Ray-executor fix. The custom executor was unhooked
in vllm.py but its code and docs remained:

- Delete executors/ray_workaround.py (NNsightRayExecutor + LazyRayWorkerWrapper) —
  nothing imports it anymore; remove the now-empty executors/ package.
- Remove NNsightGPUWorker.init_device() override — it only normalized the
  class-valued distributed_executor_backend the fork passed; with the stock
  executor the backend is the "ray" string, so the override was inert.
- README.md / examples/ray/README.md: replace the fork-as-architecture sections
  with the actual mechanism (stock Ray executor + worker_cls), keeping a short
  history note on why the fork was removed; drop the deleted file from the tree
  and File Responsibilities.

No behavior change; the executor fix in the previous commit stands.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…es don't hang

A PP layer that lives on another rank is read as a LazyRemoteTensor on the
non-owning rank. It defined __getitem__ (returns a fresh lazy, never raises
IndexError) but no __iter__, so any "consume the whole value" operation —
tuple(lazy), list(lazy), for-loops, unpacking — fell back to Python's sequence
protocol over __getitem__ and spun forever on that rank. The owning rank, which
holds the real value, iterated normally and finished. That divergence hung the
trace: the non-owning rank never posted END, so collect_nnsight never completed
and the driver eventually timed out (~550s).

The replacement-write idiom `layer.output = (out[0] + delta,) + tuple(out[1:])`
triggers it via tuple(out[1:]). In-place `[:]=` writes (absorbed as no-ops) and
plain reads (`out[0].save()`) never iterate the lazy, which is why only this
write form hung. The symptom was first seen across Ray nodes, but it is not
cross-node specific — it reproduces on single-node multiproc PP=2.

Fix: add __iter__ and __len__ that materialize and delegate to the real value,
consistent with the existing arithmetic/__getattr__ materialization contract.
The backward pull this triggers is already exercised by the downstream-read path.

Verified: new unit tests (tuple/list/for/unpack/len terminate and match);
gpt2 tuple_lazy single-node PP=2 hang -> pass; Qwen2.5-7B single-node PP=2
faithful s_cross ~550s hang -> 0.25s. tests/_pp_worker.py gains downstream_read
and tuple_lazy scenarios for cheap single-node reproduction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tch_group

The cross-rank PP pull sized its recv buffer from mediator.batch_group[1],
but batch_group is rewritten twice per execute_model step on every rank:
process_batch_groups sets [start, num_tokens] (token-level) for the forward
pass, then unflatten rewrites it to [start, 1] (prompt-level, one logits row
per request). unflatten runs unconditionally on every rank after the forward
(GPUModelRunner.execute_model), so a free-running mediator that reaches a
remote-layer access post-unflatten captured num_tokens=1. The consumer then
under-sized its recv buffer to 1 token while the producer shipped the real
N-token prefill activation, tripping gloo's size enforce ("Received data size
doesn't match", e.g. 157696 vs 14336) -> dead worker on single-token traces,
hang on multi-token. The producer narrows correctly (interleaver narrow runs
during the forward, before unflatten), so this was purely a consumer-side
stale read.

Add mediator.pp_num_tokens: set once per step in process_batch_groups
alongside the token-level batch_group, never rewritten by unflatten, cleared
to None for unscheduled mediators. _pp_lazy_access reads it instead of
batch_group[1] (None -> 0 -> legacy shape-on-wire fallback). No extra round
trip: the count is already local on every rank (the scheduler broadcasts
num_scheduled_tokens), so the zero-metadata precomputed pull stays one message.

Regression test drives the real process_batch_groups + unflatten with fakes
(no GPU): RED before (proves the [_,N]->[_,1] clobber), GREEN after. Verified
end-to-end on a Qwen2.5-7B pp2tp2 Ray multinode cluster: steering / downstream
/ batched cross-stage reads that crashed deterministically now pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…vous

Multi-token generation with a per-step cross-stage access (read an
upstream layer + write a downstream layer each generation step) hung
under pipeline parallelism. Mediator.respond blocks the engine waiting
for the worker's next event; a worker that consumed logits in
sample_tokens then advanced into the next iteration's upstream pull
posted no event, wedging the engine mid-sample so the token never
finished and the producing rank never ran the forward the worker was
waiting on.

Fix is a bidirectional per-step rendezvous plus a stale-current fix
(each prior attempt had only one piece, so all failed):

1. Boundary RELEASE (iterator._pp_iteration_boundary): when a value-
   injection respond is pending (Mediator._respond_pending), post
   go_remote() so sample_tokens returns and the engine schedules the
   next step.
2. Iteration gate: the worker waits on its own _pp_scheduled_count
   (bumped in process_batch_groups only when its request is actually in
   the batch) rather than a global per-rank forward count, which climbs
   on pipeline bubbles and desyncs the worker from its own forwards.
3. Readiness gate (_pp_advance_forward_and_wait, tail of _update_states):
   bump the forward counter + notify, then hold the forward until every
   scheduled mediator parks or finishes. No deadlock: upstream pulls
   resolve on the producing rank, downstream posts RELEASE, pure-remote
   reaches END. 30s timeout errors loudly.
4. Stale interleaver.current: once a boundary RELEASE leaves the handle
   chain, interleaver.current is None, so a rendezvous-woken worker's
   first local access hit None.iteration / None.request, an
   AttributeError that Python silently converts into the Envoy.__getattr__
   fallthrough ("GPT2Block has no attribute output"). Resolve the
   mediator via the worker thread-local current_mediator() in
   eproperty.__get__/__set__, iterate_requester, and the requires_*
   hooks (module + operation level).

Verified: gpt2 pp=2 single-node single + multi cross-stage both pass;
48/48 deterministic PP+iter unit tests; core tracing/source/transform
tests pass; test_lm.py fail-set byte-identical to baseline 15002be
(62 pre-existing pp-on-dev failures, 0 introduced).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n gate

Replaces the bidirectional rendezvous from 5c35680 with a single-sided
sync: the mediator worker never waits on the forward — it runs ahead
freely and the forward's readiness gate waits for it. This is both
simpler and fixes the logits-only regression the rendezvous introduced.

Why the worker may run ahead safely: local delivery is a synchronous
one-shot hook keyed to a monotonic iteration tracker, so the worker's
hook must be registered before the forward reaches that module (else the
tracker advances past it and it never fires — a permanent hang). Keeping
the worker ahead guarantees that ordering; the gate only has to wait for
the worker to be ahead, never the reverse.

Changes:
- Drop the iteration gate (`_pp_iteration_boundary`), the per-rank
  forward counter and its condition. The worker is no longer parked at
  an iteration boundary, so no park-vs-fire race.
- Let an eproperty local access register-its-hook-and-park even when no
  forward session is live (PP worker thread only), so the worker can get
  ahead between forwards instead of erroring "outside of interleaving".
- `_pp_signal_remote`: classify by owning rank — downstream is the
  trailing-remote phase (mark `_pp_past_local`, release the forward once);
  upstream is leading-remote (release only to free a blocked
  value-injection `respond`, the original cross-stage wedge fix).
- Readiness gate `_pp_wait_for_mediators`: hold the forward until each
  scheduled mediator is ahead of THIS forward's iteration
  (`_pp_scheduled_count - 1`): worker past it, or on it and parked-local
  (`has_value`) / past-local (`_pp_past_local`), or done, or its
  `_pp_max_iteration` is already exceeded (single-shot trace). Uses
  `_pp_worker_iteration` (never cleared) instead of `iteration` (which a
  one-shot hook clears mid-step), so a stale per-iteration flag can't
  fool it.
- Initialize the runtime PP latches in `Mediator.__setstate__` so a
  single-shot deserialized worker has them before its first cross-stage
  access.

Verified: single-node gpt2 pp=2 single_cross / multi_cross / logits-only
pass; cross-node 2-node Ray (Qwen2.5-7B pp2tp2) 9/10 incl.
multigen_cross_stage (Window C) and the previously-regressed
multigen_baseline; 111 deterministic PP/iter/core/source/transform tests
pass; test_lm.py fail-set identical to baseline (no regressions).

Known limitation: `tuple_lazy_multigen` (a downstream layer MATERIALIZED
every step) times out the readiness gate cross-node — that pull holds the
worker waiting on the downstream rank, so it lags rather than running
ahead, the one case this model does not yet handle. Tracked for a
follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
khaiwang and others added 27 commits June 23, 2026 22:32
Models without a native vLLM definition (e.g. SmolLM3) run through vLLM's
Transformers backend, which wraps the HuggingFace model and adds a leading
singleton batch dim (inputs_embeds[None, ...]). Their decoder-layer
activations are 3D [1, total_tokens, hidden], so the batched token axis is
dim 1, not dim 0.

Batcher._narrow/_swap hard-assumed the token axis was dim 0 and gated on
shape[0] == total_batch_size. For the 3D case that gate is 1 == total_tokens
(False), so reads returned the full batch (all prompts) and writes were
silently discarded -- every intervention became a no-op once needs_batching
(2+ prompts) was active. Native vLLM models (2D [total_tokens, hidden]) were
unaffected, which is why Qwen3 worked but SmolLM3 did not.

Generalize the base Batcher to narrow/swap along an axis reported by a new
_batch_dim() hook (still dim 0 by default; preserves the existing in-place,
concat-for-view/grad-leaf, and passthrough paths). VLLMBatcher overrides
_batch_dim to recognize the Transformers-backend's [1, total_tokens, hidden]
shape and select dim 1.

Adds CPU-only regression tests (TestVLLMBatcherAxis): the 3D swap/narrow
tests fail on the unpatched batcher and pass after; the 2D native test pins
no regression.

Related to #661/#662.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…aths

The vLLM interleaver always runs in defer mode (GPUModelRunner.load_model
sets defer_exceptions=True) so a single bad intervention can't crash the
engine that's serving other requests. The serve path re-raises captured
errors via surface_server_errors, but the local VLLM.trace() sync and async
paths only collected output.saves and never read the __nnsight_exceptions__
envelope — so an intervention that errored on the worker (e.g. an in-place
write on an inference-mode tensor) failed silently: no exception, and every
.save() after the failing line was dropped.

Read the envelope in VLLM.__call__ (sync) and AsyncVLLMBackend.__aiter__
(async) and re-raise via surface_server_errors, mirroring the serve path.
The error surfaces at the trace boundary while the engine stays alive for
the next trace (verified: a clean trace and a clone-based intervention both
work in the same process after the surfaced error). Envelopes are merged
per request so a multi-invoke trace doesn't clobber one request's error
with another's.

Add test_inplace_inference_write_surfaces_error covering the sync path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Migrate the durable content of intervention-gaps/{REPORT,VLLM_GUIDE}.md into
the maintained user doc and delete the two stale docs (vLLM 0.15.1-era; their
in-place-write recipes no longer work).

docs/models/vllm.md:
- New "Intervention recipes" section: clone-and-replace writes, ablation,
  steering, logit lens, two-trace activation patching, tracer.cache().
- "What each module returns" table: dual-stream (hidden, residual) output,
  int64 position-id .input, fused-RMSNorm/RowParallel tuples, merged
  qkv_proj/gate_up_proj, flat [total_tokens, hidden] layout.
- New gotchas: in-place writes raise (replace instead), clone-on-save,
  enable_prefix_caching=False default, deferred errors keep the engine alive,
  no attention weights, vLLM != transformers numerics.
- Drop stale claims: tracer.cache() is supported; version 0.15.1 -> 0.19.1.

tests/vllm_intervention_gaps/:
- git mv run_all.py + test_*.py here (executable vLLM-vs-HF diagnostic suite)
  and add a README.

Recipes verified on vLLM 0.19.1 (Qwen2.5-0.5B): in-place writes raise,
replacement works, logit-lens matmul (norm(hs) @ lm_head.weight.T) bitwise-
matches model.logits at the last layer, and TP>=2 sub-module access works
(the old "crashes at tp>=2" claim was stale).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`assert x.std() > 50` depended on an unseeded randn(8) sample std exceeding
0.5 (x = randn(8)*0.1*1000), which an 8-element sample undershoots ~2.4% of
runs — a CI flake. Replace with an exact, RNG-free check: the saved clone is a
distinct object holding the pre-mutation values, and the in-place-mutated x is
exactly saved * 1000.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Under pipeline parallelism, a trace that saves a value produced on one
stage and then reads a tuple-valued output from a later stage makes the
producing rank block on a cross-stage pull. That rank then never reaches
its normal completion (end()/push()), so its saved value was never synced
out of the intervention frame: collect_saves read an empty frame and
silently dropped the value (no error, just a missing save on the client).

Give each Mediator a durable `saves` dict, snapshotted from the live frame
when it releases into the blocking cross-stage pull (go_remote) and on
normal completion (push). collect_saves now also reads this record, so a
rank that stalls mid-pull still ships what it produced.

Non-cross-stage and non-vLLM paths are unaffected (the push snapshot is a
no-op there). Verified: single- and two-stage read+write correctness,
full-model activation caching, and plain LanguageModel save/generate all
still pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Written 2026-04-04 against vLLM 0.15.1 (commit 2ace37c); audited line-by-line
against the current code and every still-relevant item is either fixed or
tracked in docs/developing/pp-design.md / pp-stress-findings.md, which are the
living record.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ves, mediator resolution, unresolvable-owner pulls)

- LazyRemoteTensor: add comparison dunders (==, !=, <, <=, >, >=) that
  materialize and delegate, like the arithmetic dunders. Without them ==/!=
  fell back to identity and orderings raised — only on the non-owning rank,
  so user code branching on a comparison silently diverged between ranks.
  Restore identity hashing (__hash__ = object.__hash__) since defining __eq__
  clears it and Globals.saves tracks proxies by id.

- Save collection: rebuild list/tuple containers via _rebuild_sequence, which
  splats positional fields for NamedTuples. type(original)(items) raises
  TypeError for NamedTuples and killed the whole save-collection pass when a
  saved value contained one.

- IteratorTracer: resolve the mediator from the worker's thread-local
  (current_mediator()), not the shared interleaver.current slot — current is
  reassigned by every Mediator.start() and around every handle, so a PP
  run-ahead worker could observe another invoke's mediator and corrupt its
  iteration state.

- Cross-stage pull: a consume whose owning rank the ownership map cannot
  resolve (source_rank=None, e.g. a non-standard module name) now raises a
  descriptive RuntimeError at pull time instead of surfacing as a distributed
  hang. Building/saving such a lazy stays harmless (the owning rank's real
  save wins in the merge).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… the weight-derived guess

Consuming model.samples on a non-owning PP rank crashed the worker with a
gloo SIGABRT (recv buffer size mismatch): the consumer sized its recv buffer
from the module's weight dtype (bf16, 2 bytes/elem) while the producer sent
sampled token ids (int32, 4 bytes/elem). vLLM's own NCCL PP transfer and
nnsight's dedicated gloo pull group are separate transports — the crash was
purely nnsight's own buffer mis-sizing.

The legacy shape header now carries the value's true dtype in slot 1 (a fixed
torch-dtype enum; 0 = unknown, falling back to the old guess), the consumer
allocates its recv buffer from it, and the learned-shape cache adopts the wire
dtype so later precomputed pulls inherit the correction.

Two path-resolution companions of the same lookup family:
- _provider_to_module_path only drops the part before the iteration marker
  when it actually is an eproperty name; a root eproperty's provider
  (model.samples.iN) keeps its module name instead of collapsing to "model"
  (which made logits and samples share one meta/shape-cache entry).
- The LazyRemoteTensor dtype hint is resolved from the full {path}.{key}
  lookup, matching the pull protocol's canonical path; resolving from path
  alone missed root epropertys and silently fell back to float32.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…roduced, instead of stranding the consumer

A cross-stage pull whose value the producer could not serialize (a non-tensor
like a dict-valued .inputs read, a mixed-dtype tuple, a shape needing more
slots than the fixed header) used to print a traceback on the producer and
leave the consumer blocked in dist.recv forever. A per-op gloo recv timeout
cannot be the backstop: probed, it closes the whole peer pair on expiry,
poisoning every later pull on that link — so hang protection has to live in
the protocol itself.

- The whole reply (CPU move, flat cat, shape header) is now PREPARED before
  any send, so a serialization failure is caught while an error reply can
  still be sent — never after a partial send that would desync the consumer's
  recv. torch.cat doubles as the up-front mixed-dtype validator, and the
  extracted _encode_shape_header raises on header overflow.
- On failure the producer sends an error header (slot 0 = -1, slot 1 = UTF-8
  message length, then the message bytes) on the pull's private response tag;
  the consumer raises a descriptive RuntimeError.
- clear_buffer error-replies any pull still PARKED for a finished request —
  its value will never be produced (typically a run-ahead worker that pulled
  one generation step past the end). Silently dropping the parked entry left
  that consumer's worker thread blocked forever.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
PPModuleMap resolved owners from hardcoded name tables (layer-container names
like layers/h/blocks, embedding names like embed_tokens/wte, last-rank names
like norm/lm_head). Any architecture outside those conventions — Falcon's
word_embeddings, OPT's decoder.embed_positions, GPT-NeoX's final_layer_norm —
silently fell through to None and a cross-stage access either blocked or
failed.

The load-time meta allgather already reports exactly which modules are REAL
(non-PPMissingLayer) on which stage, so ownership is now DERIVED from it:
a module reported by exactly one stage is owned by that stage; a module
reported by several (containers, build-on-every-rank modules) is ambiguous
and dropped. get_owning_rank consults the derived map first (walking up to
the nearest owned ancestor, tolerant of the nnsight root prefix); the layer
ranges and name tables remain only as the pre-exchange path and for the
non-derivable build-on-every-rank, fire-on-last modules
(logits/samples/logits_processor).

Verified live at PP=2: 269 derived entries; model.norm and model.embed_tokens
resolve via derived keys.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ly save shipping to TP-0

collect_nnsight returned early on every TP rank except 0 of each PP stage.
That correctly avoids duplicated saves (TP siblings carry replicated mediator
state), but a TP-sibling rank still owns real mediators, one-shot/iter hooks,
worker threads, and pp_hook_buffer entries — skipping the whole function
leaked all of those on TP!=0: a confirmed ~1 zombie worker thread plus buffer
growth per request on a long-running TP>1 server (proved via a per-rank
thread-count probe; all four workers stay flat after the fix at PP=2 x TP=2).

Now every rank runs the finalize block (stop_iteration, worker join, drain
barrier, mediator/hook cleanup, scoped buffer clear); only the final save
payload is gated to TP-rank-0. A pure streaming collect (no finished
requests) still short-circuits on TP siblings — nothing to finalize. The
drain barrier stays collective per TP column: each column is its own pull
group, so TP-0 and sibling columns barrier independently among their own PP
stages.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ate deadline

Move the inline magic numbers (30 s readiness gate, 5 s finalize join, 1e-4
gate poll, 0.5 s listener backoff, 60 s test-only buffer wait) into named
constants in pp.py.

Only the readiness-gate deadline carries a real false-trip risk — a worker
blocked in a slow upstream cross-stage pull (huge hidden state over a degraded
cross-node link) reaches its local part late — so only it is overridable, via
NNSIGHT_PP_GATE_TIMEOUT. An env var rather than CONFIG/config.yaml because
these run in the vLLM worker process and env vars propagate to Ray workers
cross-node; the yaml would have to exist on every node. The gate's timeout
error now names the override.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…plicate the merge into collect.py

The decompress-unpickle-merge of per-rank collect_nnsight results was written
three times: identically in the async backend and the sync engine, and
INCORRECTLY in the serve handler — a flat dict.update that clobbered one PP
stage's real values with another stage's NOT_ON_THIS_RANK sentinels whenever
both stages shipped the same variable name (any cross-stage save through the
server). New collect.merge_collected_saves is the single implementation; all
three call sites now merge position-wise via merge_saved.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
_is_pp_missing runs at the top of every .output/.input access — per module,
per token — and did a get_owning_rank string-walk (split, layer-index scan,
name-table membership) each time. The answer is constant for the model's
life (module stage ownership, PP topology, and local rank don't change after
load), so cache it per lookup path on the interleaver. Plain __dict__ access
bypasses the Envoy/Interleaver custom attribute machinery; non-PP models
return before touching the cache.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…achable precomputed mode

Production consumers could never take the precomputed (no-shape-on-wire)
reply path: the run-ahead worker builds its LazyRemoteTensor before the
matching forward is scheduled, so no consumer-side token-count capture can
reliably match the produced value's leading dim — every pull already
requested shape-on-wire. Everything that existed only to enable the dead
mode goes with it:

- the precomputed recv path, the mode decision, and the per-request token
  count in the request header (now 3 int64 slots: requester, response tag,
  key length)
- the learned-shape cache and the listener's whole metadata map (the load-
  time meta exchange now carries only the per-module dtype hint for lazy
  placeholders; shape and true dtype always ride the reply)
- mediator.pp_num_tokens — set and cleared every step, read by nothing
- local_lookup, the listener's test-only blocking buffer wait, and its
  timeout constant; nothing waits on the buffer condition anymore (it
  remains as the buffer lock), so the producer-side notify_all goes too

This also closes a real protocol hole, not just lines: the precomputed mode
had no error channel (a serve failure there was logged and dropped, hanging
the consumer). With one reply mode, the error reply always works. Dtypes
outside the wire codec are now producer-side error replies instead of a
silent fall-back guess, and the codec gains the float8 variants.

The wire-stamped-dtype regression coverage moves from the deleted shape-
cache test file into the protocol tests as a producer->consumer round-trip
of int32 sampled token ids; the tag-routing test now drives the two-message
reply. Live PP=2 validation: cross-stage read + norm tuple write + logits,
and 3-prompt batched multi-token generation consuming model.samples per
step — both pass on the new wire format.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…delete the name tables and layer-range walk

The meta exchange subsumes both fallback tiers: every layer, embedding,
norm, and head is real on exactly one stage, so the derived map covers them
by construction — including architectures the name tables missed (Falcon
word_embeddings, OPT final_layer_norm, GPT-NeoX embed_in/embed_out). The
hardcoded layer-container/embedding/head name sets and the get_pp_indices
layer-range walk go.

The one irreducible piece of non-derivable knowledge — build-on-every-rank,
fire-on-last modules (vLLM's logits_processor and nnsight's own
logits/samples root wrappers) are ambiguous in the exchange — is now three
explicit last-rank entries injected by the runner right after the exchange,
so all ownership knowledge lives at a single site. setdefault preserves a
derived entry where one exists (Llama stubs logits_processor on non-last
ranks, so the exchange already attributes it).

Before the exchange installs owners nothing traces, so the pre-install
contract is now explicit: every path resolves to None (treated as local);
a genuine cross-stage consume of an unresolvable path still raises the
descriptive error at pull time. PPModuleMap no longer needs the layer count
or vLLM's get_pp_indices.

Live PP=2 validation: cross-stage read + norm tuple write + logits, per-step
model.samples consume (3-prompt batch), saves on both stages, and a repeated
cross-stage tuple read — all pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…root; memoize ownership

Every envoy path is, by Envoy construction, the root envoy's name (one
component, default "model") followed by the module's raw named_modules()
name. The lookups previously didn't exploit that and compensated with
guessing loops:

- resolve_meta stripped leading components one at a time until ANY key
  matched — over-tolerant; a wrong-entry hit would silently stamp the wrong
  dtype. Now: exact match, else strip the single known root component.
- PPModuleMap._derived_owner ran a nested walk-up x strip-any-prefix scan,
  quadratic in path depth. Now: try the root-stripped form (the common
  case), then the as-given form (raw-name callers), each as a plain
  ancestor walk-up.

The runner passes the actual root path (nnsight_model.path) into
PPModuleMap rather than assuming the literal. get_owning_rank is memoized
per path (ownership is constant for the model's life; cleared if owners are
reinstalled), which also makes the per-access owner resolution in
_pp_signal_remote a dict hit.

Live PP=2 validation: cross-stage read + write + logits, repeated
cross-stage tuple read, and per-step model.samples consume all pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…gress object

The readiness-gate conversation between the worker and the forward was
smeared across three files as five loose mediator fields: initialized twice
(Mediator.__init__ AND __setstate__), reset in the iterator, written by
pp_envoy and the model runner, read by the gate — with the load-bearing
publish-order invariant (clear the per-iteration latches BEFORE publishing
worker_iteration, or the gate releases a forward on a stale past-local and
the ranks deadlock) living as a comment in iterator.py that referenced
GPUModelRunner.

PPWorkerProgress now owns all five (gone_remote, past_local,
scheduled_count, max_iteration, worker_iteration):
- reset_iteration() encapsulates the ordering invariant next to the writes
  it orders
- is_ahead_of() puts the gate predicate next to the state it reads (the
  runner passes alive/parked, which stay mediator-owned)
- Mediator init/deserialize shrink to one line each

_respond_pending stays a Mediator field — it belongs to the generic
respond protocol (is the main thread blocked on my next event), not to
PP iteration state. The class lives in intervention/interleaver.py because
its writers (Mediator lifecycle, IteratorTracer) are core; modeling/vllm
consumes it.

Live PP=2 validation: per-step model.samples consume across a 3-prompt
batch (multi-token, the gate-sensitive path), cross-stage read + write +
logits, and dual-stage saves all pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…iagnostic

The pull wire key had two formats: composite "{req_id}|{provider}" and a
bare provider string for req_id=None callers. The bare form was a latent
mismatch — the producer ALWAYS buffers under the composite (provider,
req_id) tuple (req_id None included), so a bare-key pull could never match
a buffered value; it just hadn't been hit because production always carries
a request id. Now the wire key is always composite (empty id encodes None)
and decodes to exactly the key shape the producer writes.

Also lift the ~40-line env-gated NNSIGHT_PP_BUFFER_DEBUG block out of
collect_nnsight into a module-level _pp_buffer_stats helper; the clear-path
control flow is back to three lines.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…p, and gate state

- single self-describing pull reply (true dtype + shapes on the wire,
  error replies for unserializable / never-produced values) replaces the
  precomputed/shape-on-wire mode split
- ownership is derived-only, with the runner's three structural last-rank
  claims; no name tables or layer-range walk
- the readiness-gate state is PPWorkerProgress (reset ordering invariant
  and the ahead-predicate live with the state)
- finalize teardown runs on every TP rank; only TP-0 ships saves; the
  per-rank save merge is the single collect.merge_collected_saves

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- pp-problem.drawio / pp-solution.drawio at the repo root: byte-identical
  duplicates of docs/developing/figures/, which is the copy the PP
  walkthrough references.
- docs/superpowers/plans/2026-04-01-pp-lazy-remote-tensor.md: the original
  PP implementation plan; fully superseded by the as-built spec
  (docs/developing/pp-design.md) — it still describes vLLM 0.15.1 and a
  file layout (short-circuit inside envoy.py) that no longer exists.
- docs/developing/pp-correctness-log.md: a snapshot log of the abandoned
  pp-correct worktree fork; its validation commands reference test files
  that no longer exist, and its findings live on in pp-design.md /
  pp-stress-findings.md.
- tests/test_pp_corner_cases.py: drives collective_rpc worker hooks
  (test_pp_buffer_put / test_pp_pull / test_pp_buffer_clear) that exist
  nowhere in src — dead since the rpc-hook prototype was removed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The branch's PP tests had accumulated flat in tests/ — five pytest suites
mixed with eleven standalone multi-GPU harnesses, several of the latter
carrying test_ prefixes that a bare pytest run would collect (one of them
spawning GPU subprocesses).

Now:
- tests/vllm/pp/        — the pytest suites (test_pp.py is the former
  test_vllm_pp.py unit suite; test_integration.py the former
  test_vllm_pp_integration.py; plus save-merge, pull-group, and lazy-proxy
  iteration suites)
- tests/vllm/pp/manual/ — the standalone harnesses (renamed without the
  redundant pp/test_ prefixes: pull_e2e, profile_and_corner_cases,
  run_comparison/run_profile, profile_pull/profile_quick, measure_buffer,
  stress_serve/stress_tp_serve, run_equivalence_matrix) and the three
  _pp_*worker.py subprocess workers they and test_integration.py spawn,
  with a README index

No __init__.py anywhere under tests/vllm/: pytest.ini puts tests/ on
sys.path, so a tests/vllm package would shadow the real vllm package
(an extension-less directory is only a namespace portion and loses to the
installed regular package). Run docstrings, the integration test's worker
path and repo-root constant, and the doc reference in pp-stress-findings.md
all updated; the harnesses otherwise self-reference via __file__ and moved
unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…tage execution semantics

Each PP stage runs the intervention body independently. A saved value not
derived from module state is therefore a redundant per-rank copy, and the
merge used to keep one copy silently (later rank wins) — so a value that
genuinely differed across ranks (a per-rank environment value like
os.getpid(), desynced in-trace RNG, or structurally mismatched save trees
from a stalled worker) produced a plausible-but-wrong result with no signal.

merge_saved now compares real/real slots and emits PPRankDivergenceWarning
naming the slot ('noise', "outs['steps'][1]") when the copies differ; the
later-rank-wins degrade is unchanged, just loud. Float tensors compare with
tight tolerance (rtol=1e-5, atol=1e-8, equal_nan) so low-order kernel noise
from redundant deterministic compute never warns; integer/bool tensors,
scalars, and container structure compare exactly. The deferred-exception
envelope bypasses the check (both ranks ship the same user error with
legitimately different per-rank tracebacks). merge_collected_saves threads
the saved variable name in as the slot label.

Live PP=2 validation: deterministic saves and in-trace torch.randn merge
silently — the randn case revealed that vLLM seeds every worker identically,
so in-trace RNG agrees across ranks in lockstep (incidental, not
contractual); a saved per-rank pid fires the warning end-to-end with the
right slot names; the standard cross-stage scenarios show zero false
positives.

Docs: docs/models/vllm.md gets a Pipeline parallelism section (the feature
was still documented as unsupported) with the execution-semantics fine
print — body runs once per stage, side effects are per-rank, generate
randomness OUTSIDE the trace (pre-trace values serialize identically to
every rank), and what the tripwire means. pp-design.md correctness
properties rewritten to state the redundant-copy semantics precisely.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ntinel encoding

The merge only combined equal-length lists/tuples and fell to blind b-wins on
any structural mismatch — the root of the historical clobbers (an empty list
overwriting a populated one; a stalled worker's short list winning). The fix
recognizes that the NOT_ON_THIS_RANK sentinel already IS the stage-ownership
encoding, and that the cross-stage pull guarantees two real copies of a
model-derived value are equal — so the correct merge is a positional union
that prefers real over sentinel, applied uniformly (the dict branch was
already exactly this; lists/tuples were the inconsistent outliers).

merge_saved now:
- lists union by position, length-tolerant: shared positions recurse, a
  position only one rank reached is taken as-is, and a trailing no-real
  overshoot tail (a worker that ran one step past generation end and appended
  an unconsumed lazy) is dropped silently;
- the lone real-data-asymmetry signal — real entries at a position only one
  rank reached (a stalled/errored worker or stage-divergent control flow) —
  warns and keeps the complete side, distinguished from benign overshoot and
  from a non-owner's equal-length all-sentinel list (neither warns);
- equal-length tuples recurse and rebuild NamedTuple-safe as before;
- the only non-union path left is a structural-type clash at one slot (list
  vs tuple, tensor vs dict) — impossible for model-derived values, so it keeps
  the side with more real leaves and warns;
- the per-leaf divergence tripwire is unchanged and still fires on a shared
  position whose two real copies differ.

This retires the inference-by-position fragility class: the merge no longer
has any blind-pick path — every resolution is element-wise, sentinel-driven,
union-completeness-driven, or loud.

Unit tests cover each mechanism (owner-vs-sentinels, overshoot drop, stalled
short list, empty-vs-populated, mixed-stage interleave, nested-in-dict,
type-clash, divergent shared prefix). Live PP=2: cell11's per-step
samples.item() builds an equal real float list on both ranks that the union
merges position-wise silently; twosave_cross confirms sentinel ownership —
both correct, zero false-positive warnings.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…isting leaf path

The /simplify pass on the positional-union merge found one path carrying its
own weight poorly: the structural-type-clash branch (list vs tuple, tensor vs
dict, unequal-length tuples) had a dedicated 15-line branch plus a whole
_count_real helper to pick the side with more real leaves — for a case the
code's own docstring calls impossible for model-derived values. Meanwhile
_divergence_detail (the leaf path's classifier) already had a container
"structure mismatch" branch that was now unreachable, and the clash branch's
own message ("incompatible structures (tuple vs tuple)") was misleading for
the unequal-length-tuple case it actually hit.

Deleting the branch lets those values fall through to the leaf path, where
_divergence_detail describes the mismatch accurately ("structure mismatch:
tuple of len 3 vs tuple of len 2", "type mismatch: Tensor vs list") and b
wins — the same degrade as every other divergent leaf. Net: -25 lines
(_count_real gone, branch gone), one fewer message format, and the previously
dead container branch in _divergence_detail becomes its handler.

Also from the pass: add an "a is b" identity short-circuit to
_divergence_detail (skips the elementwise compare when a slot passed through
an earlier stage's merge unchanged in a 3+-stage union), and fix two drifted
docstring sentences ("effective length" — the function computes
real-entries-beyond-common now; the structural-clash description). The stale
sibling test asserting the removed "falls back to b" semantics is renamed to
the union contract.

Necessity verdict from the review (all four angles): the union itself is
necessary and stays — the prior tripwire made divergence loud but its
equal-length-only branch still lost data (empty-list clobber, stalled-worker
mispick); the union is what stops the data loss. This commit only trims the
one over-built path the union added.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ead of deadlocking

Reading a later-stage module before an earlier-stage one within a single
iteration (e.g. `model.logits` then a stage-0 layer inside `tracer.iter`)
deadlocked under PP. On the earlier stage the downstream access enters the
trailing-remote phase and releases the forward (`go_remote`) on the contract
that no local hooks remain this iteration; the forward then runs past the local
module before the worker registers its hook, so the worker's later blocking
`request` can only be served by the next forward — which the readiness gate
holds waiting for that same worker. Single-GPU nnsight already rejects this
later-before-earlier access with `OutOfOrderError`; PP hung (one rank's gate
timeout while the partner stayed wedged in its NCCL recv).

Make PP match single-GPU and fail fast, with the worker lifecycle the gate
reads modeled as an explicit phase:

- A worker has a `WorkerPhase` — `LEADING` then `PAST_LOCAL` within an
  iteration, `TERMINATED` once the body exits — paired with `worker_iteration`.
  (Blocked at a local hook is observed by the gate as `parked`, not stored.)
  `_pp_signal_remote` sets `PAST_LOCAL` on a downstream access; `reset_iteration`
  re-arms `LEADING`, publishing the phase before the iteration number so the
  gate pairs a new iteration with its fresh phase; the body-exit handlers
  `end()` / `exception()` call `terminate()`, covering normal completion,
  `tracer.stop()`, and any raise. The gate predicate `is_ahead_of` reads the
  phase: `TERMINATED` releases every remaining forward, and a worker on the
  gated iteration releases once it is parked or `PAST_LOCAL`.

- `_pp_check_order` runs on every local access: a local access while
  `PAST_LOCAL` is out of forward order, so it raises `OutOfOrderError` naming
  the module and the forward-order contract. The raise marks the worker
  `TERMINATED`, so the gate releases the remaining forwards; the partner stage
  unwedges, the request finishes, and the deferred error ships at collect.

Tests: tests/vllm/pp/test_worker_lifecycle.py covers the gate predicate across
phase and iteration; tests/vllm/pp/test_out_of_order.py (+ a `multigen_ooo`
worker scenario) asserts the PP=2 out-of-order trace raises OutOfOrderError
within a bounded time instead of hanging. Verified at PP=2: out-of-order raises
(~40s, was a hang); forward-order multi-token generation and a forward-order
logits+layer read still complete; 87 PP unit tests pass.

Docs: docs/models/vllm.md (forward-pass-order note) and
docs/developing/pp-design.md (the worker-phase lifecycle in §4.2 / §4.6).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… write xfail

The 4 cross-stage READ integration tests (final logits, layer 0/11 hidden,
multi-token gen) pass on PP=2 — the mediator-deserialization bug their strict
xfail documented was fixed earlier in pp-on-dev, leaving the markers stale
(XPASS-strict, reported as failures). Drop them.

The cross-stage WRITE test still xfails, but NOT for that reason: its scenario
does an in-place update (h[8].output[0][:] = h2), which vLLM forbids on
inference-mode tensors — PP=1 raises "Inplace update to inference tensor", and
PP=2 silently drops it (unperturbed output). Repoint the xfail at that accurate
cause (PP2_XFAIL_REASON -> CROSS_STAGE_WRITE_XFAIL_REASON) and refresh the stale
module docstring. A portable cross-stage-write test needs the replacement
pattern, not in-place.

Verified on 2x A100 (GPT-2, PP=1 vs PP=2): integration suite 5 passed, 1 xfailed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UkL1HgpyPpQGpZrG9E4KQb
khaiwang and others added 2 commits June 24, 2026 13:11
…lacement pattern

The cross-stage write IS functional on PP=2; the test was using the vLLM-forbidden
in-place pattern (output[0][:] = h2), which errors on PP=1 and is silently dropped
on PP=2 — a vLLM write restriction, not a PP gap. Switch to the documented
whole-output replacement (output = (new_hidden, *out[1:]) if is_tuple else new_hidden);
the write then applies and PP=2 matches PP=1 exactly (GPT-2: ' Paris' -> ' New',
argmax 968 on both). Drop the xfail; the test now asserts cross-stage equivalence
AND that the write took effect (guards against a vacuous pass on a dropped write).

Adds the cross_stage_replace worker scenario. Integration suite: 6 passed, 0 xfailed
on 2x A100.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UkL1HgpyPpQGpZrG9E4KQb
… wrong pattern

scenario_cross_stage and scenario_cross_stage_no_logits used the in-place form
(output[0][:] = h2 / output[0] = ...), which vLLM rejects on inference-mode
tensors regardless of PP — so they never tested cross-stage write, only the
in-place restriction (and silently dropped it on PP=2). No test references them
now that TestCrossStageWrite uses the replacement-pattern cross_stage_replace.
Remove the scenarios and their dispatch/choices entries.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01UkL1HgpyPpQGpZrG9E4KQb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant