vllm: follow-ups from the v0.7.0 merges#662
Open
khaiwang wants to merge 10 commits into
Open
Conversation
…e corruption vLLM runs forward passes inside torch.inference_mode() and several of its kernels (e.g. fused_add_rms_norm) mutate buffers in place. Without a clone, references returned by Envoy.output / Envoy.inputs alias those buffers, so the values surviving past the trace reflect post-mutation state, not what the user asked to save. Clone on save when the saved object is an inference-mode tensor. The clone allocates a fresh, non-inference tensor so downstream fused ops mutate the original buffer rather than the user's saved reference. No-op for normal (non-inference) tensors, so HF / vanilla PyTorch paths are unaffected. Object.save() now returns save(self) instead of self, so the cloned tensor (not the original) is what the trace's local-frame filter retains via Globals.saves. Fixes #661.
Three CPU-only tests in TestSaveCloning that pin the fix's invariants: 1. globals.save() returns a clone for inference-mode tensors so subsequent in-place mutation of the source doesn't corrupt the saved value. 2. globals.save() returns the original (no clone) for normal tensors — pins the zero-overhead contract for HF / vanilla PyTorch paths. 3. End-to-end: module.output.save() inside torch.inference_mode() returns a non-inference tensor (i.e. Object.save() returns the clone, not the original — otherwise the local-frame filter would drop it). Verified to fail on the unpatched globals.py (tests 1 and 3 fail; test 2 passes both ways since it asserts a no-op the unpatched code also satisfies).
vLLM's prefix caching reuses KV values from previously-seen sequences. When the next request shares a prefix, those tokens skip the forward pass — hooks don't fire and interventions on those tokens are silently skipped, with no error. Default ``enable_prefix_caching=False`` so interventions consistently see every token. Users who explicitly opt in (e.g. for throughput on workloads that don't need to hook prefill tokens) can still pass ``enable_prefix_caching=True``. This matches what ``intervention-gaps/VLLM_GUIDE.md`` already documents as the integration's default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #656 merged the nnsight-serve sources (cli.py, server.py, LocalServeBackend, ServeInterleavingTracer, ...) onto dev but the pyproject.toml changes were dropped during conflict resolution, so ``pip install "nnsight[serve]"`` returns "no matching distribution" and the ``nnsight-serve`` CLI shim isn't on PATH for a fresh install. Restore the missing pieces: - ``serve`` optional-dependency that pulls vllm + FastAPI + uvicorn. - ``[project.scripts]`` entry that registers ``nnsight-serve`` to ``nnsight.modeling.vllm.serve.cli:main``. - ``all`` extended to include ``serve``. After this, the documented ``pip install "nnsight[serve]"`` / ``nnsight-serve ...`` workflow works without falling back to the ``python -m nnsight.modeling.vllm.serve.cli`` workaround. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two unrelated refactors on dev left the async-mode portions of src/nnsight/modeling/vllm/README.md describing an API that no longer exists. Both landed without a README update; the drift has been live through v0.7.0: d124cc5 (2026-03-12) "refactor vLLM input processing: consolidate sync/async paths and clean up" Eliminated AsyncInterleavingTracer entirely. AsyncVLLMBackend now calls _setup_interleaver() directly and serializes mediators itself; the async path uses the default RemoteInterleavingTracer. bb61efa (2026-03-28) "Upgrade vLLM compat, refactor async backend, add sample_tokens hook" Collapsed the dual-call __call__(tracer)/__call__() pattern into a single required __call__(self, tracer) that submits the request to AsyncLLM.generate() immediately on trace exit. _stream() was removed; iteration moved to __aiter__/__await__. Calling tracer.backend() (with parens) now raises TypeError; the user iterates tracer.backend (no parens) instead. This commit syncs the README accordingly: - File listing: drop async_tracer.py; rewrite the async_backend.py one-liner. - File responsibilities: drop the AsyncInterleavingTracer entry; rewrite the AsyncVLLMBackend entry to describe the on-exit submission and __aiter__ streaming. - Key Classes: remove the AsyncInterleavingTracer subsection; rewrite AsyncVLLMBackend to enumerate __call__/__aiter__/__await__. - Architecture diagram: redraw to show VLLM.trace() injecting only the backend, default tracer applying, __call__(tracer) submitting and parking the generator, __aiter__ streaming. - Replace 5 _stream() mentions with __aiter__() across the execution flow, sync-vs-async table, and Streaming Saves section. - Replace 4 tracer.backend() parens-form mentions with tracer.backend. - Replace the "Why AsyncInterleavingTracer Bypasses RemoteableMixin" section with "How VLLM.trace() Routes the Async Path", showing the current setdefault-based routing (RemoteableMixin.trace() never hard-coded tracer_cls; the previous prose was wrong on that point too). - Usage example: drop the parens; wrap in async def main() under an if __name__ == "__main__" guard (AsyncLLM uses multiprocessing spawn); note that output.saves is only set on output.finished. Verified by running the corrected usage example end-to-end on this branch (gpt2, mode="async"): 8 RequestOutputs streamed, finished=True on the last, output.saves = {'logits': Tensor[1, 50257]}.
Models without a native vLLM definition (e.g. SmolLM3) run through vLLM's Transformers backend, which wraps the HuggingFace model and adds a leading singleton batch dim (inputs_embeds[None, ...]). Their decoder-layer activations are 3D [1, total_tokens, hidden], so the batched token axis is dim 1, not dim 0. Batcher._narrow/_swap hard-assumed the token axis was dim 0 and gated on shape[0] == total_batch_size. For the 3D case that gate is 1 == total_tokens (False), so reads returned the full batch (all prompts) and writes were silently discarded -- every intervention became a no-op once needs_batching (2+ prompts) was active. Native vLLM models (2D [total_tokens, hidden]) were unaffected, which is why Qwen3 worked but SmolLM3 did not. Generalize the base Batcher to narrow/swap along an axis reported by a new _batch_dim() hook (still dim 0 by default; preserves the existing in-place, concat-for-view/grad-leaf, and passthrough paths). VLLMBatcher overrides _batch_dim to recognize the Transformers-backend's [1, total_tokens, hidden] shape and select dim 1. Adds CPU-only regression tests (TestVLLMBatcherAxis): the 3D swap/narrow tests fail on the unpatched batcher and pass after; the 2D native test pins no regression. Related to #661/#662. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…aths The vLLM interleaver always runs in defer mode (GPUModelRunner.load_model sets defer_exceptions=True) so a single bad intervention can't crash the engine that's serving other requests. The serve path re-raises captured errors via surface_server_errors, but the local VLLM.trace() sync and async paths only collected output.saves and never read the __nnsight_exceptions__ envelope — so an intervention that errored on the worker (e.g. an in-place write on an inference-mode tensor) failed silently: no exception, and every .save() after the failing line was dropped. Read the envelope in VLLM.__call__ (sync) and AsyncVLLMBackend.__aiter__ (async) and re-raise via surface_server_errors, mirroring the serve path. The error surfaces at the trace boundary while the engine stays alive for the next trace (verified: a clean trace and a clone-based intervention both work in the same process after the surfaced error). Envelopes are merged per request so a multi-invoke trace doesn't clobber one request's error with another's. Add test_inplace_inference_write_surfaces_error covering the sync path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Migrate the durable content of intervention-gaps/{REPORT,VLLM_GUIDE}.md into
the maintained user doc and delete the two stale docs (vLLM 0.15.1-era; their
in-place-write recipes no longer work).
docs/models/vllm.md:
- New "Intervention recipes" section: clone-and-replace writes, ablation,
steering, logit lens, two-trace activation patching, tracer.cache().
- "What each module returns" table: dual-stream (hidden, residual) output,
int64 position-id .input, fused-RMSNorm/RowParallel tuples, merged
qkv_proj/gate_up_proj, flat [total_tokens, hidden] layout.
- New gotchas: in-place writes raise (replace instead), clone-on-save,
enable_prefix_caching=False default, deferred errors keep the engine alive,
no attention weights, vLLM != transformers numerics.
- Drop stale claims: tracer.cache() is supported; version 0.15.1 -> 0.19.1.
tests/vllm_intervention_gaps/:
- git mv run_all.py + test_*.py here (executable vLLM-vs-HF diagnostic suite)
and add a README.
Recipes verified on vLLM 0.19.1 (Qwen2.5-0.5B): in-place writes raise,
replacement works, logit-lens matmul (norm(hs) @ lm_head.weight.T) bitwise-
matches model.logits at the last layer, and TP>=2 sub-module access works
(the old "crashes at tp>=2" claim was stale).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`assert x.std() > 50` depended on an unseeded randn(8) sample std exceeding 0.5 (x = randn(8)*0.1*1000), which an 8-element sample undershoots ~2.4% of runs — a CI flake. Replace with an exact, RNG-free check: the saved clone is a distinct object holding the pre-mutation values, and the in-place-mutated x is exactly saved * 1000. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
AsyncVLLMBackend.__call__ serialized all invokes' mediators but then submitted only prompts[0]/params[0] under a single request_id, so a multi-invoke async trace ran only the first prompt: invokes past the first never reached the engine, their per-invoke saves came back empty, and trace-shared saves were never collected (the worker's received_count never reached expected_count). Submit one request per invoke (mirroring the serve path in serve/server.py) and merge the per-request generators into a single stream via an asyncio queue, collecting saves per finished request. The single-invoke streaming path is behavior-preserving; the deferred-error surfacing is reused via a shared _attach_saves helper. Add test_async_multi_invoke_runs_all_invokes: a two-invoke async trace must produce two finished requests and collect the trace-shared list from both invokes. This is the async counterpart of test_shared_list_across_invokes; every prior async test used a single prompt, so this code path had no coverage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
khaiwang
added a commit
that referenced
this pull request
Jun 24, 2026
Models without a native vLLM definition (e.g. SmolLM3) run through vLLM's Transformers backend, which wraps the HuggingFace model and adds a leading singleton batch dim (inputs_embeds[None, ...]). Their decoder-layer activations are 3D [1, total_tokens, hidden], so the batched token axis is dim 1, not dim 0. Batcher._narrow/_swap hard-assumed the token axis was dim 0 and gated on shape[0] == total_batch_size. For the 3D case that gate is 1 == total_tokens (False), so reads returned the full batch (all prompts) and writes were silently discarded -- every intervention became a no-op once needs_batching (2+ prompts) was active. Native vLLM models (2D [total_tokens, hidden]) were unaffected, which is why Qwen3 worked but SmolLM3 did not. Generalize the base Batcher to narrow/swap along an axis reported by a new _batch_dim() hook (still dim 0 by default; preserves the existing in-place, concat-for-view/grad-leaf, and passthrough paths). VLLMBatcher overrides _batch_dim to recognize the Transformers-backend's [1, total_tokens, hidden] shape and select dim 1. Adds CPU-only regression tests (TestVLLMBatcherAxis): the 3D swap/narrow tests fail on the unpatched batcher and pass after; the 2D native test pins no regression. Related to #661/#662. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-ups from the vLLM-integration merges that landed into
devfor v0.7.0. This PR merges a mix of correctness fixes, one default change, packaging, and docs. Each is independent but small..save()(fixes #661)a40a5f2,fe7c072,cc565130bc57e35ef446618f0b41ee2022dnnsight-serveinstall machinerybf2949dc8bd79eintervention-gaps/intodocs/models/vllm.md; move probes totests/fc6c10b1. Clone inference-mode tensors on
.save()(fixes #661)vLLM runs forward passes inside
torch.inference_mode(), and several of its kernels (e.g.fused_add_rms_norm) mutate buffers in place. Without a clone, references returned byEnvoy.output/Envoy.inputsalias those buffers, so values surviving past the trace reflect post-mutation state, not what the user asked to save.intervention/tracing/globals.save()now clones when the saved object is an inference-mode tensor. The clone allocates a fresh, non-inference tensor so downstream fused ops mutate the original buffer rather than the user's saved reference. No-op for normal tensors — HF / vanilla PyTorch paths are unaffected.Object.save()returnssave(self)instead ofself, so the cloned tensor (not the original) is what the trace's local-frame filter retains viaGlobals.saves.Verification (SmolLM2-135M on vLLM 0.19.1, one A100): max reference-vs-clone diff across saved residual / attention / MLP tensors dropped from
4064.23→0.00.Regression coverage in
TestSaveCloningpins the invariants:module.output.save()insidetorch.inference_mode()returns a non-inference tensor.cc56513makes the end-to-end inference-mode test deterministic: the oldassert x.std() > 50depended on an unseeded 8-elementrandnsample and flaked ~2.4% of runs; it's replaced with an exact, RNG-free check (the saved clone holds the pre-mutation values; the in-place-mutated source is exactlysaved * 1000).Commits:
a40a5f2,fe7c072,cc56513.2. Narrow/swap on the token axis for Transformers-backend models
Models without a native vLLM definition (e.g. SmolLM3) run through vLLM's Transformers backend, which wraps the HuggingFace model and adds a leading singleton batch dim (
inputs_embeds[None, ...]). Their decoder-layer activations are 3D[1, total_tokens, hidden], so the batched token axis is dim 1, not dim 0.Batcher._narrow/_swaphard-assumed the token axis was dim 0 and gated onshape[0] == total_batch_size. For the 3D case that gate is1 == total_tokens(False), so reads returned the full batch (all prompts) and writes were silently discarded — every intervention became a no-op once batching was active (2+ prompts). Native vLLM models (2D[total_tokens, hidden]) were unaffected, which is why Qwen3 worked but SmolLM3 did not.The base
Batchernow narrows/swaps along an axis reported by a new_batch_dim()hook (still dim 0 by default; preserves the existing in-place, concat-for-view/grad-leaf, and passthrough paths).VLLMBatcheroverrides_batch_dimto recognize the Transformers-backend[1, total_tokens, hidden]shape and select dim 1. CPU-only regression tests (TestVLLMBatcherAxis): the 3D swap/narrow tests fail on the unpatched batcher and pass after; the 2D native test pins no regression.Commit:
0bc57e3.3. Surface deferred intervention errors on local sync/async paths
The vLLM interleaver always runs in defer mode (
GPUModelRunner.load_modelsetsdefer_exceptions=True) so a single bad intervention can't crash the engine that's serving other requests. The serve path re-raised captured errors viasurface_server_errors, but the localVLLM.trace()sync and async paths only collectedoutput.savesand never read the__nnsight_exceptions__envelope — so an intervention that errored on the worker (e.g. an in-place write on an inference-mode tensor) failed silently: no exception, and every.save()after the failing line was dropped.VLLM.__call__(sync) andAsyncVLLMBackend.__aiter__(async) now read the envelope and re-raise viasurface_server_errors, mirroring the serve path. The error surfaces at the trace boundary while the engine stays alive for the next trace (verified: a clean trace and a clone-based intervention both work in the same process after the surfaced error). Envelopes are merged per request so a multi-invoke trace doesn't clobber one request's error with another's. Addstest_inplace_inference_write_surfaces_errorcovering the sync path.Commit:
5ef4466.4. Submit every invoke in async traces, not just the first
AsyncVLLMBackend.__call__serialized every invoke's mediator but then submitted onlyprompts[0]/params[0]under a singlerequest_id, so a multi-invoke async trace ran only the first prompt: invokes past the first never reached the engine, their per-invoke saves came back empty, and trace-shared saves were never collected (the worker'sreceived_countnever reachedexpected_count). Every prior async test used a single prompt, so this code path had no coverage.The backend now submits one request per invoke (mirroring the fan-out in
serve/server.py) and merges the per-request generators into a single stream via an asyncio queue, collecting saves per finished request. The single-invoke streaming path is behavior-preserving, and the deferred-error surfacing from #3 is reused via a shared_attach_saveshelper.Adds
test_async_multi_invoke_runs_all_invokes— the async counterpart oftest_shared_list_across_invokes: a two-invoke async trace must produce two finished requests and collect the trace-shared list from both invokes.Verification (gpt2,
mode="async", vLLM 0.19.1, one A100): the new test fails on the old code (Expected 2 finished requests, got 1) and passes after; the fullTestAsyncEnginesuite is 6/6 green (the 5 pre-existing single-invoke async tests unaffected).Commit:
18f0b41.5. Disable vLLM prefix caching by default
vLLM's prefix caching reuses KV values from previously-seen sequences. When the next request shares a prefix, those tokens skip the forward pass — hooks don't fire and interventions on those tokens are silently skipped, with no error.
VLLM(...)now defaults toenable_prefix_caching=Falseso interventions consistently see every token. Users who explicitly opt in (e.g. for throughput on workloads that don't need to hook prefill tokens) can still passenable_prefix_caching=True. Matches whatdocs/models/vllm.mddocuments as the integration's default.Commit:
ee2022d.6. Restore
nnsight-serveinstall machineryPR #656 merged the
nnsight-servesources (cli.py,server.py,LocalServeBackend,ServeInterleavingTracer, …) ontodev, but thepyproject.tomlchanges were dropped during conflict resolution. Result:pip install "nnsight[serve]"returns "no matching distribution" and thennsight-serveCLI shim isn't on PATH for a fresh install.Restored:
serveoptional-dependency that pullsvllm+ FastAPI + uvicorn.[project.scripts]entry registeringnnsight-serve→nnsight.modeling.vllm.serve.cli:main.allextended to includeserve.After this, the documented
pip install "nnsight[serve]"/nnsight-serve …workflow works without thepython -m nnsight.modeling.vllm.serve.cliworkaround.Commit:
bf2949d.7. Sync
src/nnsight/modeling/vllm/README.mdto the v0.7.0 async pathTwo earlier refactors landed without README updates; the drift has been live through v0.7.0:
d124cc5(2026-03-12, "refactor vLLM input processing") eliminatedAsyncInterleavingTracerentirely.AsyncVLLMBackendnow calls_setup_interleaver()directly; the async path uses the defaultRemoteInterleavingTracer.bb61efa(2026-03-28, "refactor async backend") collapsed the dual-call__call__(tracer)/__call__()pattern into a single required__call__(self, tracer)that submits toAsyncLLM.generate()immediately on trace exit._stream()was removed; iteration moved to__aiter__/__await__.tracer.backend()(with parens) now raisesTypeError— the correct iteration form isasync for output in tracer.backend(no parens).This commit syncs the README accordingly:
async_tracer.pyfile listing and allAsyncInterleavingTracerreferences.AsyncVLLMBackenddescription (file responsibilities + Key Classes) to enumerate the current__call__/__aiter__/__await__surface.VLLM.trace()injects only the backend → default tracer applies →__call__(tracer)submits and parks the generator →__aiter__streams._stream()mentions with__aiter__().tracer.backend()parens-form mentions withtracer.backend.VLLM.trace()Routes the Async Path", showing currentsetdefault-based routing (RemoteableMixin.trace()never hard-codedtracer_cls— the previous prose was also wrong on that point).async def main()under anif __name__ == "__main__":guard (AsyncLLMusesmultiprocessingspawn); note thatoutput.savesis only set onoutput.finished.Verified the corrected usage example runs end-to-end on this branch (gpt2,
mode="async"): 8RequestOutputs streamed,finished=Trueon the last,output.saves == {'logits': Tensor[1, 50257]}.Commit:
c8bd79e.8. Fold
intervention-gaps/intodocs/models/vllm.md; move probes totests/Migrate the durable content of
intervention-gaps/{REPORT,VLLM_GUIDE}.mdinto the maintained user doc and delete the two stale docs (vLLM 0.15.1-era; their in-place-write recipes no longer work).docs/models/vllm.md:tracer.cache().(hidden, residual)output, int64 position-id.input, fused-RMSNorm/RowParallel tuples, mergedqkv_proj/gate_up_proj, flat[total_tokens, hidden]layout.enable_prefix_caching=Falsedefault, deferred errors keep the engine alive, no attention weights, vLLM ≠ transformers numerics.tracer.cache()is supported; version 0.15.1 → 0.19.1.tests/vllm_intervention_gaps/:git mvrun_all.py+test_*.pyhere (executable vLLM-vs-HF diagnostic suite) and add a README.Recipes verified on vLLM 0.19.1 (Qwen2.5-0.5B): in-place writes raise, replacement works, logit-lens matmul (
norm(hs) @ lm_head.weight.T) bitwise-matchesmodel.logitsat the last layer, and TP≥2 sub-module access works (the old "crashes at tp≥2" claim was stale).Commit:
fc6c10b.