vllm: follow-ups from the v0.7.0 merges by khaiwang · Pull Request #662 · ndif-team/nnsight

khaiwang · 2026-05-07T03:11:33Z

Follow-ups from the vLLM-integration merges that landed into dev for v0.7.0. This PR merges a mix of correctness fixes, one default change, packaging, and docs. Each is independent but small.

#	Change	Kind	Commits
1	Clone inference-mode tensors on `.save()` (fixes #661)	correctness	`a40a5f2`, `fe7c072`, `cc56513`
2	Narrow/swap on the token axis for Transformers-backend models	correctness	`0bc57e3`
3	Surface deferred intervention errors on local sync/async paths	correctness	`5ef4466`
4	Submit every invoke in async traces, not just the first	correctness	`18f0b41`
5	Disable vLLM prefix caching by default	default	`ee2022d`
6	Restore `nnsight-serve` install machinery	packaging	`bf2949d`
7	Sync vLLM integration README to the v0.7.0 async path	docs	`c8bd79e`
8	Fold `intervention-gaps/` into `docs/models/vllm.md`; move probes to `tests/`	docs	`fc6c10b`

1. Clone inference-mode tensors on `.save()` (fixes #661)

vLLM runs forward passes inside torch.inference_mode(), and several of its kernels (e.g. fused_add_rms_norm) mutate buffers in place. Without a clone, references returned by Envoy.output / Envoy.inputs alias those buffers, so values surviving past the trace reflect post-mutation state, not what the user asked to save.

intervention/tracing/globals.save() now clones when the saved object is an inference-mode tensor. The clone allocates a fresh, non-inference tensor so downstream fused ops mutate the original buffer rather than the user's saved reference. No-op for normal tensors — HF / vanilla PyTorch paths are unaffected.

Object.save() returns save(self) instead of self, so the cloned tensor (not the original) is what the trace's local-frame filter retains via Globals.saves.

Verification (SmolLM2-135M on vLLM 0.19.1, one A100): max reference-vs-clone diff across saved residual / attention / MLP tensors dropped from 4064.23 → 0.00.

Regression coverage in TestSaveCloning pins the invariants:

Inference-mode tensor save returns a clone (mutation of source doesn't corrupt save).
Normal tensor save returns the original (no-overhead contract for HF path).
End-to-end: module.output.save() inside torch.inference_mode() returns a non-inference tensor.

cc56513 makes the end-to-end inference-mode test deterministic: the old assert x.std() > 50 depended on an unseeded 8-element randn sample and flaked ~2.4% of runs; it's replaced with an exact, RNG-free check (the saved clone holds the pre-mutation values; the in-place-mutated source is exactly saved * 1000).

Commits: a40a5f2, fe7c072, cc56513.

2. Narrow/swap on the token axis for Transformers-backend models

Models without a native vLLM definition (e.g. SmolLM3) run through vLLM's Transformers backend, which wraps the HuggingFace model and adds a leading singleton batch dim (inputs_embeds[None, ...]). Their decoder-layer activations are 3D [1, total_tokens, hidden], so the batched token axis is dim 1, not dim 0.

Batcher._narrow / _swap hard-assumed the token axis was dim 0 and gated on shape[0] == total_batch_size. For the 3D case that gate is 1 == total_tokens (False), so reads returned the full batch (all prompts) and writes were silently discarded — every intervention became a no-op once batching was active (2+ prompts). Native vLLM models (2D [total_tokens, hidden]) were unaffected, which is why Qwen3 worked but SmolLM3 did not.

The base Batcher now narrows/swaps along an axis reported by a new _batch_dim() hook (still dim 0 by default; preserves the existing in-place, concat-for-view/grad-leaf, and passthrough paths). VLLMBatcher overrides _batch_dim to recognize the Transformers-backend [1, total_tokens, hidden] shape and select dim 1. CPU-only regression tests (TestVLLMBatcherAxis): the 3D swap/narrow tests fail on the unpatched batcher and pass after; the 2D native test pins no regression.

Commit: 0bc57e3.

3. Surface deferred intervention errors on local sync/async paths

The vLLM interleaver always runs in defer mode (GPUModelRunner.load_model sets defer_exceptions=True) so a single bad intervention can't crash the engine that's serving other requests. The serve path re-raised captured errors via surface_server_errors, but the local VLLM.trace() sync and async paths only collected output.saves and never read the __nnsight_exceptions__ envelope — so an intervention that errored on the worker (e.g. an in-place write on an inference-mode tensor) failed silently: no exception, and every .save() after the failing line was dropped.

VLLM.__call__ (sync) and AsyncVLLMBackend.__aiter__ (async) now read the envelope and re-raise via surface_server_errors, mirroring the serve path. The error surfaces at the trace boundary while the engine stays alive for the next trace (verified: a clean trace and a clone-based intervention both work in the same process after the surfaced error). Envelopes are merged per request so a multi-invoke trace doesn't clobber one request's error with another's. Adds test_inplace_inference_write_surfaces_error covering the sync path.

Commit: 5ef4466.

4. Submit every invoke in async traces, not just the first

AsyncVLLMBackend.__call__ serialized every invoke's mediator but then submitted only prompts[0] / params[0] under a single request_id, so a multi-invoke async trace ran only the first prompt: invokes past the first never reached the engine, their per-invoke saves came back empty, and trace-shared saves were never collected (the worker's received_count never reached expected_count). Every prior async test used a single prompt, so this code path had no coverage.

The backend now submits one request per invoke (mirroring the fan-out in serve/server.py) and merges the per-request generators into a single stream via an asyncio queue, collecting saves per finished request. The single-invoke streaming path is behavior-preserving, and the deferred-error surfacing from #3 is reused via a shared _attach_saves helper.

Adds test_async_multi_invoke_runs_all_invokes — the async counterpart of test_shared_list_across_invokes: a two-invoke async trace must produce two finished requests and collect the trace-shared list from both invokes.

Verification (gpt2, mode="async", vLLM 0.19.1, one A100): the new test fails on the old code (Expected 2 finished requests, got 1) and passes after; the full TestAsyncEngine suite is 6/6 green (the 5 pre-existing single-invoke async tests unaffected).

Commit: 18f0b41.

5. Disable vLLM prefix caching by default

vLLM's prefix caching reuses KV values from previously-seen sequences. When the next request shares a prefix, those tokens skip the forward pass — hooks don't fire and interventions on those tokens are silently skipped, with no error.

VLLM(...) now defaults to enable_prefix_caching=False so interventions consistently see every token. Users who explicitly opt in (e.g. for throughput on workloads that don't need to hook prefill tokens) can still pass enable_prefix_caching=True. Matches what docs/models/vllm.md documents as the integration's default.

Commit: ee2022d.

6. Restore `nnsight-serve` install machinery

PR #656 merged the nnsight-serve sources (cli.py, server.py, LocalServeBackend, ServeInterleavingTracer, …) onto dev, but the pyproject.toml changes were dropped during conflict resolution. Result: pip install "nnsight[serve]" returns "no matching distribution" and the nnsight-serve CLI shim isn't on PATH for a fresh install.

Restored:

serve optional-dependency that pulls vllm + FastAPI + uvicorn.
[project.scripts] entry registering nnsight-serve → nnsight.modeling.vllm.serve.cli:main.
all extended to include serve.

After this, the documented pip install "nnsight[serve]" / nnsight-serve … workflow works without the python -m nnsight.modeling.vllm.serve.cli workaround.

Commit: bf2949d.

7. Sync `src/nnsight/modeling/vllm/README.md` to the v0.7.0 async path

Two earlier refactors landed without README updates; the drift has been live through v0.7.0:

d124cc5 (2026-03-12, "refactor vLLM input processing") eliminated AsyncInterleavingTracer entirely. AsyncVLLMBackend now calls _setup_interleaver() directly; the async path uses the default RemoteInterleavingTracer.
bb61efa (2026-03-28, "refactor async backend") collapsed the dual-call __call__(tracer) / __call__() pattern into a single required __call__(self, tracer) that submits to AsyncLLM.generate() immediately on trace exit. _stream() was removed; iteration moved to __aiter__/__await__. tracer.backend() (with parens) now raises TypeError — the correct iteration form is async for output in tracer.backend (no parens).

This commit syncs the README accordingly:

Drop async_tracer.py file listing and all AsyncInterleavingTracer references.
Rewrite the AsyncVLLMBackend description (file responsibilities + Key Classes) to enumerate the current __call__/__aiter__/__await__ surface.
Redraw the async architecture diagram: VLLM.trace() injects only the backend → default tracer applies → __call__(tracer) submits and parks the generator → __aiter__ streams.
Replace 5 _stream() mentions with __aiter__().
Replace 4 tracer.backend() parens-form mentions with tracer.backend.
Replace the "Why AsyncInterleavingTracer Bypasses RemoteableMixin" section with "How VLLM.trace() Routes the Async Path", showing current setdefault-based routing (RemoteableMixin.trace() never hard-coded tracer_cls — the previous prose was also wrong on that point).
Usage example: drop parens; wrap in async def main() under an if __name__ == "__main__": guard (AsyncLLM uses multiprocessing spawn); note that output.saves is only set on output.finished.

Verified the corrected usage example runs end-to-end on this branch (gpt2, mode="async"): 8 RequestOutputs streamed, finished=True on the last, output.saves == {'logits': Tensor[1, 50257]}.

Note: the README's AsyncVLLMBackend surface section is accurate as of bb61efa; the multi-invoke fan-out added in #4 (18f0b41) does not change the public __call__/__aiter__/__await__ shape it describes.

Commit: c8bd79e.

8. Fold `intervention-gaps/` into `docs/models/vllm.md`; move probes to `tests/`

Migrate the durable content of intervention-gaps/{REPORT,VLLM_GUIDE}.md into the maintained user doc and delete the two stale docs (vLLM 0.15.1-era; their in-place-write recipes no longer work).

docs/models/vllm.md:

New "Intervention recipes" section: clone-and-replace writes, ablation, steering, logit lens, two-trace activation patching, tracer.cache().
"What each module returns" table: dual-stream (hidden, residual) output, int64 position-id .input, fused-RMSNorm/RowParallel tuples, merged qkv_proj/gate_up_proj, flat [total_tokens, hidden] layout.
New gotchas: in-place writes raise (replace instead), clone-on-save, enable_prefix_caching=False default, deferred errors keep the engine alive, no attention weights, vLLM ≠ transformers numerics.
Drop stale claims: tracer.cache() is supported; version 0.15.1 → 0.19.1.

tests/vllm_intervention_gaps/:

git mv run_all.py + test_*.py here (executable vLLM-vs-HF diagnostic suite) and add a README.

Recipes verified on vLLM 0.19.1 (Qwen2.5-0.5B): in-place writes raise, replacement works, logit-lens matmul (norm(hs) @ lm_head.weight.T) bitwise-matches model.logits at the last layer, and TP≥2 sub-module access works (the old "crashes at tp≥2" claim was stale).

Commit: fc6c10b.

…e corruption vLLM runs forward passes inside torch.inference_mode() and several of its kernels (e.g. fused_add_rms_norm) mutate buffers in place. Without a clone, references returned by Envoy.output / Envoy.inputs alias those buffers, so the values surviving past the trace reflect post-mutation state, not what the user asked to save. Clone on save when the saved object is an inference-mode tensor. The clone allocates a fresh, non-inference tensor so downstream fused ops mutate the original buffer rather than the user's saved reference. No-op for normal (non-inference) tensors, so HF / vanilla PyTorch paths are unaffected. Object.save() now returns save(self) instead of self, so the cloned tensor (not the original) is what the trace's local-frame filter retains via Globals.saves. Fixes #661.

Three CPU-only tests in TestSaveCloning that pin the fix's invariants: 1. globals.save() returns a clone for inference-mode tensors so subsequent in-place mutation of the source doesn't corrupt the saved value. 2. globals.save() returns the original (no clone) for normal tensors — pins the zero-overhead contract for HF / vanilla PyTorch paths. 3. End-to-end: module.output.save() inside torch.inference_mode() returns a non-inference tensor (i.e. Object.save() returns the clone, not the original — otherwise the local-frame filter would drop it). Verified to fail on the unpatched globals.py (tests 1 and 3 fail; test 2 passes both ways since it asserts a no-op the unpatched code also satisfies).

vLLM's prefix caching reuses KV values from previously-seen sequences. When the next request shares a prefix, those tokens skip the forward pass — hooks don't fire and interventions on those tokens are silently skipped, with no error. Default ``enable_prefix_caching=False`` so interventions consistently see every token. Users who explicitly opt in (e.g. for throughput on workloads that don't need to hook prefill tokens) can still pass ``enable_prefix_caching=True``. This matches what ``intervention-gaps/VLLM_GUIDE.md`` already documents as the integration's default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #656 merged the nnsight-serve sources (cli.py, server.py, LocalServeBackend, ServeInterleavingTracer, ...) onto dev but the pyproject.toml changes were dropped during conflict resolution, so ``pip install "nnsight[serve]"`` returns "no matching distribution" and the ``nnsight-serve`` CLI shim isn't on PATH for a fresh install. Restore the missing pieces: - ``serve`` optional-dependency that pulls vllm + FastAPI + uvicorn. - ``[project.scripts]`` entry that registers ``nnsight-serve`` to ``nnsight.modeling.vllm.serve.cli:main``. - ``all`` extended to include ``serve``. After this, the documented ``pip install "nnsight[serve]"`` / ``nnsight-serve ...`` workflow works without falling back to the ``python -m nnsight.modeling.vllm.serve.cli`` workaround. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two unrelated refactors on dev left the async-mode portions of src/nnsight/modeling/vllm/README.md describing an API that no longer exists. Both landed without a README update; the drift has been live through v0.7.0: d124cc5 (2026-03-12) "refactor vLLM input processing: consolidate sync/async paths and clean up" Eliminated AsyncInterleavingTracer entirely. AsyncVLLMBackend now calls _setup_interleaver() directly and serializes mediators itself; the async path uses the default RemoteInterleavingTracer. bb61efa (2026-03-28) "Upgrade vLLM compat, refactor async backend, add sample_tokens hook" Collapsed the dual-call __call__(tracer)/__call__() pattern into a single required __call__(self, tracer) that submits the request to AsyncLLM.generate() immediately on trace exit. _stream() was removed; iteration moved to __aiter__/__await__. Calling tracer.backend() (with parens) now raises TypeError; the user iterates tracer.backend (no parens) instead. This commit syncs the README accordingly: - File listing: drop async_tracer.py; rewrite the async_backend.py one-liner. - File responsibilities: drop the AsyncInterleavingTracer entry; rewrite the AsyncVLLMBackend entry to describe the on-exit submission and __aiter__ streaming. - Key Classes: remove the AsyncInterleavingTracer subsection; rewrite AsyncVLLMBackend to enumerate __call__/__aiter__/__await__. - Architecture diagram: redraw to show VLLM.trace() injecting only the backend, default tracer applying, __call__(tracer) submitting and parking the generator, __aiter__ streaming. - Replace 5 _stream() mentions with __aiter__() across the execution flow, sync-vs-async table, and Streaming Saves section. - Replace 4 tracer.backend() parens-form mentions with tracer.backend. - Replace the "Why AsyncInterleavingTracer Bypasses RemoteableMixin" section with "How VLLM.trace() Routes the Async Path", showing the current setdefault-based routing (RemoteableMixin.trace() never hard-coded tracer_cls; the previous prose was wrong on that point too). - Usage example: drop the parens; wrap in async def main() under an if __name__ == "__main__" guard (AsyncLLM uses multiprocessing spawn); note that output.saves is only set on output.finished. Verified by running the corrected usage example end-to-end on this branch (gpt2, mode="async"): 8 RequestOutputs streamed, finished=True on the last, output.saves = {'logits': Tensor[1, 50257]}.

Models without a native vLLM definition (e.g. SmolLM3) run through vLLM's Transformers backend, which wraps the HuggingFace model and adds a leading singleton batch dim (inputs_embeds[None, ...]). Their decoder-layer activations are 3D [1, total_tokens, hidden], so the batched token axis is dim 1, not dim 0. Batcher._narrow/_swap hard-assumed the token axis was dim 0 and gated on shape[0] == total_batch_size. For the 3D case that gate is 1 == total_tokens (False), so reads returned the full batch (all prompts) and writes were silently discarded -- every intervention became a no-op once needs_batching (2+ prompts) was active. Native vLLM models (2D [total_tokens, hidden]) were unaffected, which is why Qwen3 worked but SmolLM3 did not. Generalize the base Batcher to narrow/swap along an axis reported by a new _batch_dim() hook (still dim 0 by default; preserves the existing in-place, concat-for-view/grad-leaf, and passthrough paths). VLLMBatcher overrides _batch_dim to recognize the Transformers-backend's [1, total_tokens, hidden] shape and select dim 1. Adds CPU-only regression tests (TestVLLMBatcherAxis): the 3D swap/narrow tests fail on the unpatched batcher and pass after; the 2D native test pins no regression. Related to #661/#662. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…aths The vLLM interleaver always runs in defer mode (GPUModelRunner.load_model sets defer_exceptions=True) so a single bad intervention can't crash the engine that's serving other requests. The serve path re-raises captured errors via surface_server_errors, but the local VLLM.trace() sync and async paths only collected output.saves and never read the __nnsight_exceptions__ envelope — so an intervention that errored on the worker (e.g. an in-place write on an inference-mode tensor) failed silently: no exception, and every .save() after the failing line was dropped. Read the envelope in VLLM.__call__ (sync) and AsyncVLLMBackend.__aiter__ (async) and re-raise via surface_server_errors, mirroring the serve path. The error surfaces at the trace boundary while the engine stays alive for the next trace (verified: a clean trace and a clone-based intervention both work in the same process after the surfaced error). Envelopes are merged per request so a multi-invoke trace doesn't clobber one request's error with another's. Add test_inplace_inference_write_surfaces_error covering the sync path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Migrate the durable content of intervention-gaps/{REPORT,VLLM_GUIDE}.md into the maintained user doc and delete the two stale docs (vLLM 0.15.1-era; their in-place-write recipes no longer work). docs/models/vllm.md: - New "Intervention recipes" section: clone-and-replace writes, ablation, steering, logit lens, two-trace activation patching, tracer.cache(). - "What each module returns" table: dual-stream (hidden, residual) output, int64 position-id .input, fused-RMSNorm/RowParallel tuples, merged qkv_proj/gate_up_proj, flat [total_tokens, hidden] layout. - New gotchas: in-place writes raise (replace instead), clone-on-save, enable_prefix_caching=False default, deferred errors keep the engine alive, no attention weights, vLLM != transformers numerics. - Drop stale claims: tracer.cache() is supported; version 0.15.1 -> 0.19.1. tests/vllm_intervention_gaps/: - git mv run_all.py + test_*.py here (executable vLLM-vs-HF diagnostic suite) and add a README. Recipes verified on vLLM 0.19.1 (Qwen2.5-0.5B): in-place writes raise, replacement works, logit-lens matmul (norm(hs) @ lm_head.weight.T) bitwise- matches model.logits at the last layer, and TP>=2 sub-module access works (the old "crashes at tp>=2" claim was stale). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`assert x.std() > 50` depended on an unseeded randn(8) sample std exceeding 0.5 (x = randn(8)*0.1*1000), which an 8-element sample undershoots ~2.4% of runs — a CI flake. Replace with an exact, RNG-free check: the saved clone is a distinct object holding the pre-mutation values, and the in-place-mutated x is exactly saved * 1000. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

AsyncVLLMBackend.__call__ serialized all invokes' mediators but then submitted only prompts[0]/params[0] under a single request_id, so a multi-invoke async trace ran only the first prompt: invokes past the first never reached the engine, their per-invoke saves came back empty, and trace-shared saves were never collected (the worker's received_count never reached expected_count). Submit one request per invoke (mirroring the serve path in serve/server.py) and merge the per-request generators into a single stream via an asyncio queue, collecting saves per finished request. The single-invoke streaming path is behavior-preserving; the deferred-error surfacing is reused via a shared _attach_saves helper. Add test_async_multi_invoke_runs_all_invokes: a two-invoke async trace must produce two finished requests and collect the trace-shared list from both invokes. This is the async counterpart of test_shared_list_across_invokes; every prior async test used a single prompt, so this code path had no coverage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Models without a native vLLM definition (e.g. SmolLM3) run through vLLM's Transformers backend, which wraps the HuggingFace model and adds a leading singleton batch dim (inputs_embeds[None, ...]). Their decoder-layer activations are 3D [1, total_tokens, hidden], so the batched token axis is dim 1, not dim 0. Batcher._narrow/_swap hard-assumed the token axis was dim 0 and gated on shape[0] == total_batch_size. For the 3D case that gate is 1 == total_tokens (False), so reads returned the full batch (all prompts) and writes were silently discarded -- every intervention became a no-op once needs_batching (2+ prompts) was active. Native vLLM models (2D [total_tokens, hidden]) were unaffected, which is why Qwen3 worked but SmolLM3 did not. Generalize the base Batcher to narrow/swap along an axis reported by a new _batch_dim() hook (still dim 0 by default; preserves the existing in-place, concat-for-view/grad-leaf, and passthrough paths). VLLMBatcher overrides _batch_dim to recognize the Transformers-backend's [1, total_tokens, hidden] shape and select dim 1. Adds CPU-only regression tests (TestVLLMBatcherAxis): the 3D swap/narrow tests fail on the unpatched batcher and pass after; the 2D native test pins no regression. Related to #661/#662. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

khaiwang added 2 commits May 6, 2026 22:59

khaiwang requested a review from JadenFiotto-Kaufman May 7, 2026 18:36

khaiwang and others added 3 commits May 21, 2026 19:19

khaiwang changed the title ~~fix(vllm): clone inference-mode tensors on .save() to prevent in-plac…~~ vllm: follow-ups from the v0.7.0 merges May 21, 2026

khaiwang and others added 5 commits May 27, 2026 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vllm: follow-ups from the v0.7.0 merges#662

vllm: follow-ups from the v0.7.0 merges#662
khaiwang wants to merge 10 commits into
devfrom
zikai/vllm-clone-on-save

khaiwang commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

khaiwang commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Clone inference-mode tensors on .save() (fixes #661)

2. Narrow/swap on the token axis for Transformers-backend models

3. Surface deferred intervention errors on local sync/async paths

4. Submit every invoke in async traces, not just the first

5. Disable vLLM prefix caching by default

6. Restore nnsight-serve install machinery

7. Sync src/nnsight/modeling/vllm/README.md to the v0.7.0 async path

8. Fold intervention-gaps/ into docs/models/vllm.md; move probes to tests/

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

khaiwang commented May 7, 2026 •

edited

Loading

1. Clone inference-mode tensors on `.save()` (fixes #661)

6. Restore `nnsight-serve` install machinery

7. Sync `src/nnsight/modeling/vllm/README.md` to the v0.7.0 async path

8. Fold `intervention-gaps/` into `docs/models/vllm.md`; move probes to `tests/`