Skip to content

vllm: follow-ups from the v0.7.0 merges#662

Open
khaiwang wants to merge 10 commits into
devfrom
zikai/vllm-clone-on-save
Open

vllm: follow-ups from the v0.7.0 merges#662
khaiwang wants to merge 10 commits into
devfrom
zikai/vllm-clone-on-save

Conversation

@khaiwang

@khaiwang khaiwang commented May 7, 2026

Copy link
Copy Markdown
Contributor

Follow-ups from the vLLM-integration merges that landed into dev for v0.7.0. This PR merges a mix of correctness fixes, one default change, packaging, and docs. Each is independent but small.

# Change Kind Commits
1 Clone inference-mode tensors on .save() (fixes #661) correctness a40a5f2, fe7c072, cc56513
2 Narrow/swap on the token axis for Transformers-backend models correctness 0bc57e3
3 Surface deferred intervention errors on local sync/async paths correctness 5ef4466
4 Submit every invoke in async traces, not just the first correctness 18f0b41
5 Disable vLLM prefix caching by default default ee2022d
6 Restore nnsight-serve install machinery packaging bf2949d
7 Sync vLLM integration README to the v0.7.0 async path docs c8bd79e
8 Fold intervention-gaps/ into docs/models/vllm.md; move probes to tests/ docs fc6c10b

1. Clone inference-mode tensors on .save() (fixes #661)

vLLM runs forward passes inside torch.inference_mode(), and several of its kernels (e.g. fused_add_rms_norm) mutate buffers in place. Without a clone, references returned by Envoy.output / Envoy.inputs alias those buffers, so values surviving past the trace reflect post-mutation state, not what the user asked to save.

intervention/tracing/globals.save() now clones when the saved object is an inference-mode tensor. The clone allocates a fresh, non-inference tensor so downstream fused ops mutate the original buffer rather than the user's saved reference. No-op for normal tensors — HF / vanilla PyTorch paths are unaffected.

Object.save() returns save(self) instead of self, so the cloned tensor (not the original) is what the trace's local-frame filter retains via Globals.saves.

Verification (SmolLM2-135M on vLLM 0.19.1, one A100): max reference-vs-clone diff across saved residual / attention / MLP tensors dropped from 4064.230.00.

Regression coverage in TestSaveCloning pins the invariants:

  • Inference-mode tensor save returns a clone (mutation of source doesn't corrupt save).
  • Normal tensor save returns the original (no-overhead contract for HF path).
  • End-to-end: module.output.save() inside torch.inference_mode() returns a non-inference tensor.

cc56513 makes the end-to-end inference-mode test deterministic: the old assert x.std() > 50 depended on an unseeded 8-element randn sample and flaked ~2.4% of runs; it's replaced with an exact, RNG-free check (the saved clone holds the pre-mutation values; the in-place-mutated source is exactly saved * 1000).

Commits: a40a5f2, fe7c072, cc56513.

2. Narrow/swap on the token axis for Transformers-backend models

Models without a native vLLM definition (e.g. SmolLM3) run through vLLM's Transformers backend, which wraps the HuggingFace model and adds a leading singleton batch dim (inputs_embeds[None, ...]). Their decoder-layer activations are 3D [1, total_tokens, hidden], so the batched token axis is dim 1, not dim 0.

Batcher._narrow / _swap hard-assumed the token axis was dim 0 and gated on shape[0] == total_batch_size. For the 3D case that gate is 1 == total_tokens (False), so reads returned the full batch (all prompts) and writes were silently discarded — every intervention became a no-op once batching was active (2+ prompts). Native vLLM models (2D [total_tokens, hidden]) were unaffected, which is why Qwen3 worked but SmolLM3 did not.

The base Batcher now narrows/swaps along an axis reported by a new _batch_dim() hook (still dim 0 by default; preserves the existing in-place, concat-for-view/grad-leaf, and passthrough paths). VLLMBatcher overrides _batch_dim to recognize the Transformers-backend [1, total_tokens, hidden] shape and select dim 1. CPU-only regression tests (TestVLLMBatcherAxis): the 3D swap/narrow tests fail on the unpatched batcher and pass after; the 2D native test pins no regression.

Commit: 0bc57e3.

3. Surface deferred intervention errors on local sync/async paths

The vLLM interleaver always runs in defer mode (GPUModelRunner.load_model sets defer_exceptions=True) so a single bad intervention can't crash the engine that's serving other requests. The serve path re-raised captured errors via surface_server_errors, but the local VLLM.trace() sync and async paths only collected output.saves and never read the __nnsight_exceptions__ envelope — so an intervention that errored on the worker (e.g. an in-place write on an inference-mode tensor) failed silently: no exception, and every .save() after the failing line was dropped.

VLLM.__call__ (sync) and AsyncVLLMBackend.__aiter__ (async) now read the envelope and re-raise via surface_server_errors, mirroring the serve path. The error surfaces at the trace boundary while the engine stays alive for the next trace (verified: a clean trace and a clone-based intervention both work in the same process after the surfaced error). Envelopes are merged per request so a multi-invoke trace doesn't clobber one request's error with another's. Adds test_inplace_inference_write_surfaces_error covering the sync path.

Commit: 5ef4466.

4. Submit every invoke in async traces, not just the first

AsyncVLLMBackend.__call__ serialized every invoke's mediator but then submitted only prompts[0] / params[0] under a single request_id, so a multi-invoke async trace ran only the first prompt: invokes past the first never reached the engine, their per-invoke saves came back empty, and trace-shared saves were never collected (the worker's received_count never reached expected_count). Every prior async test used a single prompt, so this code path had no coverage.

The backend now submits one request per invoke (mirroring the fan-out in serve/server.py) and merges the per-request generators into a single stream via an asyncio queue, collecting saves per finished request. The single-invoke streaming path is behavior-preserving, and the deferred-error surfacing from #3 is reused via a shared _attach_saves helper.

Adds test_async_multi_invoke_runs_all_invokes — the async counterpart of test_shared_list_across_invokes: a two-invoke async trace must produce two finished requests and collect the trace-shared list from both invokes.

Verification (gpt2, mode="async", vLLM 0.19.1, one A100): the new test fails on the old code (Expected 2 finished requests, got 1) and passes after; the full TestAsyncEngine suite is 6/6 green (the 5 pre-existing single-invoke async tests unaffected).

Commit: 18f0b41.

5. Disable vLLM prefix caching by default

vLLM's prefix caching reuses KV values from previously-seen sequences. When the next request shares a prefix, those tokens skip the forward pass — hooks don't fire and interventions on those tokens are silently skipped, with no error.

VLLM(...) now defaults to enable_prefix_caching=False so interventions consistently see every token. Users who explicitly opt in (e.g. for throughput on workloads that don't need to hook prefill tokens) can still pass enable_prefix_caching=True. Matches what docs/models/vllm.md documents as the integration's default.

Commit: ee2022d.

6. Restore nnsight-serve install machinery

PR #656 merged the nnsight-serve sources (cli.py, server.py, LocalServeBackend, ServeInterleavingTracer, …) onto dev, but the pyproject.toml changes were dropped during conflict resolution. Result: pip install "nnsight[serve]" returns "no matching distribution" and the nnsight-serve CLI shim isn't on PATH for a fresh install.

Restored:

  • serve optional-dependency that pulls vllm + FastAPI + uvicorn.
  • [project.scripts] entry registering nnsight-servennsight.modeling.vllm.serve.cli:main.
  • all extended to include serve.

After this, the documented pip install "nnsight[serve]" / nnsight-serve … workflow works without the python -m nnsight.modeling.vllm.serve.cli workaround.

Commit: bf2949d.

7. Sync src/nnsight/modeling/vllm/README.md to the v0.7.0 async path

Two earlier refactors landed without README updates; the drift has been live through v0.7.0:

  • d124cc5 (2026-03-12, "refactor vLLM input processing") eliminated AsyncInterleavingTracer entirely. AsyncVLLMBackend now calls _setup_interleaver() directly; the async path uses the default RemoteInterleavingTracer.
  • bb61efa (2026-03-28, "refactor async backend") collapsed the dual-call __call__(tracer) / __call__() pattern into a single required __call__(self, tracer) that submits to AsyncLLM.generate() immediately on trace exit. _stream() was removed; iteration moved to __aiter__/__await__. tracer.backend() (with parens) now raises TypeError — the correct iteration form is async for output in tracer.backend (no parens).

This commit syncs the README accordingly:

  • Drop async_tracer.py file listing and all AsyncInterleavingTracer references.
  • Rewrite the AsyncVLLMBackend description (file responsibilities + Key Classes) to enumerate the current __call__/__aiter__/__await__ surface.
  • Redraw the async architecture diagram: VLLM.trace() injects only the backend → default tracer applies → __call__(tracer) submits and parks the generator → __aiter__ streams.
  • Replace 5 _stream() mentions with __aiter__().
  • Replace 4 tracer.backend() parens-form mentions with tracer.backend.
  • Replace the "Why AsyncInterleavingTracer Bypasses RemoteableMixin" section with "How VLLM.trace() Routes the Async Path", showing current setdefault-based routing (RemoteableMixin.trace() never hard-coded tracer_cls — the previous prose was also wrong on that point).
  • Usage example: drop parens; wrap in async def main() under an if __name__ == "__main__": guard (AsyncLLM uses multiprocessing spawn); note that output.saves is only set on output.finished.

Verified the corrected usage example runs end-to-end on this branch (gpt2, mode="async"): 8 RequestOutputs streamed, finished=True on the last, output.saves == {'logits': Tensor[1, 50257]}.

Note: the README's AsyncVLLMBackend surface section is accurate as of bb61efa; the multi-invoke fan-out added in #4 (18f0b41) does not change the public __call__/__aiter__/__await__ shape it describes.

Commit: c8bd79e.

8. Fold intervention-gaps/ into docs/models/vllm.md; move probes to tests/

Migrate the durable content of intervention-gaps/{REPORT,VLLM_GUIDE}.md into the maintained user doc and delete the two stale docs (vLLM 0.15.1-era; their in-place-write recipes no longer work).

docs/models/vllm.md:

  • New "Intervention recipes" section: clone-and-replace writes, ablation, steering, logit lens, two-trace activation patching, tracer.cache().
  • "What each module returns" table: dual-stream (hidden, residual) output, int64 position-id .input, fused-RMSNorm/RowParallel tuples, merged qkv_proj/gate_up_proj, flat [total_tokens, hidden] layout.
  • New gotchas: in-place writes raise (replace instead), clone-on-save, enable_prefix_caching=False default, deferred errors keep the engine alive, no attention weights, vLLM ≠ transformers numerics.
  • Drop stale claims: tracer.cache() is supported; version 0.15.1 → 0.19.1.

tests/vllm_intervention_gaps/:

  • git mv run_all.py + test_*.py here (executable vLLM-vs-HF diagnostic suite) and add a README.

Recipes verified on vLLM 0.19.1 (Qwen2.5-0.5B): in-place writes raise, replacement works, logit-lens matmul (norm(hs) @ lm_head.weight.T) bitwise-matches model.logits at the last layer, and TP≥2 sub-module access works (the old "crashes at tp≥2" claim was stale).

Commit: fc6c10b.

khaiwang added 2 commits May 6, 2026 22:59
…e corruption

vLLM runs forward passes inside torch.inference_mode() and several of its
kernels (e.g. fused_add_rms_norm) mutate buffers in place. Without a clone,
references returned by Envoy.output / Envoy.inputs alias those buffers,
so the values surviving past the trace reflect post-mutation state, not
what the user asked to save.

Clone on save when the saved object is an inference-mode tensor. The clone
allocates a fresh, non-inference tensor so downstream fused ops mutate the
original buffer rather than the user's saved reference. No-op for normal
(non-inference) tensors, so HF / vanilla PyTorch paths are unaffected.

Object.save() now returns save(self) instead of self, so the cloned tensor
(not the original) is what the trace's local-frame filter retains via
Globals.saves.

Fixes #661.
Three CPU-only tests in TestSaveCloning that pin the fix's invariants:

1. globals.save() returns a clone for inference-mode tensors so subsequent
   in-place mutation of the source doesn't corrupt the saved value.
2. globals.save() returns the original (no clone) for normal tensors —
   pins the zero-overhead contract for HF / vanilla PyTorch paths.
3. End-to-end: module.output.save() inside torch.inference_mode() returns
   a non-inference tensor (i.e. Object.save() returns the clone, not the
   original — otherwise the local-frame filter would drop it).

Verified to fail on the unpatched globals.py (tests 1 and 3 fail; test 2
passes both ways since it asserts a no-op the unpatched code also satisfies).
khaiwang and others added 3 commits May 21, 2026 19:19
vLLM's prefix caching reuses KV values from previously-seen sequences.
When the next request shares a prefix, those tokens skip the forward
pass — hooks don't fire and interventions on those tokens are silently
skipped, with no error.

Default ``enable_prefix_caching=False`` so interventions consistently
see every token. Users who explicitly opt in (e.g. for throughput on
workloads that don't need to hook prefill tokens) can still pass
``enable_prefix_caching=True``.

This matches what ``intervention-gaps/VLLM_GUIDE.md`` already documents
as the integration's default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #656 merged the nnsight-serve sources (cli.py, server.py,
LocalServeBackend, ServeInterleavingTracer, ...) onto dev but the
pyproject.toml changes were dropped during conflict resolution, so
``pip install "nnsight[serve]"`` returns "no matching distribution"
and the ``nnsight-serve`` CLI shim isn't on PATH for a fresh install.

Restore the missing pieces:

- ``serve`` optional-dependency that pulls vllm + FastAPI + uvicorn.
- ``[project.scripts]`` entry that registers ``nnsight-serve`` to
  ``nnsight.modeling.vllm.serve.cli:main``.
- ``all`` extended to include ``serve``.

After this, the documented ``pip install "nnsight[serve]"`` /
``nnsight-serve ...`` workflow works without falling back to the
``python -m nnsight.modeling.vllm.serve.cli`` workaround.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two unrelated refactors on dev left the async-mode portions of
src/nnsight/modeling/vllm/README.md describing an API that no longer
exists. Both landed without a README update; the drift has been live
through v0.7.0:

  d124cc5 (2026-03-12) "refactor vLLM input processing: consolidate
    sync/async paths and clean up"
      Eliminated AsyncInterleavingTracer entirely. AsyncVLLMBackend now
      calls _setup_interleaver() directly and serializes mediators
      itself; the async path uses the default RemoteInterleavingTracer.

  bb61efa (2026-03-28) "Upgrade vLLM compat, refactor async backend,
    add sample_tokens hook"
      Collapsed the dual-call __call__(tracer)/__call__() pattern into
      a single required __call__(self, tracer) that submits the request
      to AsyncLLM.generate() immediately on trace exit. _stream() was
      removed; iteration moved to __aiter__/__await__. Calling
      tracer.backend() (with parens) now raises TypeError; the user
      iterates tracer.backend (no parens) instead.

This commit syncs the README accordingly:

- File listing: drop async_tracer.py; rewrite the async_backend.py
  one-liner.
- File responsibilities: drop the AsyncInterleavingTracer entry;
  rewrite the AsyncVLLMBackend entry to describe the on-exit submission
  and __aiter__ streaming.
- Key Classes: remove the AsyncInterleavingTracer subsection; rewrite
  AsyncVLLMBackend to enumerate __call__/__aiter__/__await__.
- Architecture diagram: redraw to show VLLM.trace() injecting only the
  backend, default tracer applying, __call__(tracer) submitting and
  parking the generator, __aiter__ streaming.
- Replace 5 _stream() mentions with __aiter__() across the execution
  flow, sync-vs-async table, and Streaming Saves section.
- Replace 4 tracer.backend() parens-form mentions with tracer.backend.
- Replace the "Why AsyncInterleavingTracer Bypasses RemoteableMixin"
  section with "How VLLM.trace() Routes the Async Path", showing the
  current setdefault-based routing (RemoteableMixin.trace() never
  hard-coded tracer_cls; the previous prose was wrong on that point
  too).
- Usage example: drop the parens; wrap in async def main() under an
  if __name__ == "__main__" guard (AsyncLLM uses multiprocessing
  spawn); note that output.saves is only set on output.finished.

Verified by running the corrected usage example end-to-end on this
branch (gpt2, mode="async"): 8 RequestOutputs streamed, finished=True
on the last, output.saves = {'logits': Tensor[1, 50257]}.
@khaiwang khaiwang changed the title fix(vllm): clone inference-mode tensors on .save() to prevent in-plac… vllm: follow-ups from the v0.7.0 merges May 21, 2026
khaiwang and others added 5 commits May 27, 2026 17:15
Models without a native vLLM definition (e.g. SmolLM3) run through vLLM's
Transformers backend, which wraps the HuggingFace model and adds a leading
singleton batch dim (inputs_embeds[None, ...]). Their decoder-layer
activations are 3D [1, total_tokens, hidden], so the batched token axis is
dim 1, not dim 0.

Batcher._narrow/_swap hard-assumed the token axis was dim 0 and gated on
shape[0] == total_batch_size. For the 3D case that gate is 1 == total_tokens
(False), so reads returned the full batch (all prompts) and writes were
silently discarded -- every intervention became a no-op once needs_batching
(2+ prompts) was active. Native vLLM models (2D [total_tokens, hidden]) were
unaffected, which is why Qwen3 worked but SmolLM3 did not.

Generalize the base Batcher to narrow/swap along an axis reported by a new
_batch_dim() hook (still dim 0 by default; preserves the existing in-place,
concat-for-view/grad-leaf, and passthrough paths). VLLMBatcher overrides
_batch_dim to recognize the Transformers-backend's [1, total_tokens, hidden]
shape and select dim 1.

Adds CPU-only regression tests (TestVLLMBatcherAxis): the 3D swap/narrow
tests fail on the unpatched batcher and pass after; the 2D native test pins
no regression.

Related to #661/#662.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…aths

The vLLM interleaver always runs in defer mode (GPUModelRunner.load_model
sets defer_exceptions=True) so a single bad intervention can't crash the
engine that's serving other requests. The serve path re-raises captured
errors via surface_server_errors, but the local VLLM.trace() sync and async
paths only collected output.saves and never read the __nnsight_exceptions__
envelope — so an intervention that errored on the worker (e.g. an in-place
write on an inference-mode tensor) failed silently: no exception, and every
.save() after the failing line was dropped.

Read the envelope in VLLM.__call__ (sync) and AsyncVLLMBackend.__aiter__
(async) and re-raise via surface_server_errors, mirroring the serve path.
The error surfaces at the trace boundary while the engine stays alive for
the next trace (verified: a clean trace and a clone-based intervention both
work in the same process after the surfaced error). Envelopes are merged
per request so a multi-invoke trace doesn't clobber one request's error
with another's.

Add test_inplace_inference_write_surfaces_error covering the sync path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Migrate the durable content of intervention-gaps/{REPORT,VLLM_GUIDE}.md into
the maintained user doc and delete the two stale docs (vLLM 0.15.1-era; their
in-place-write recipes no longer work).

docs/models/vllm.md:
- New "Intervention recipes" section: clone-and-replace writes, ablation,
  steering, logit lens, two-trace activation patching, tracer.cache().
- "What each module returns" table: dual-stream (hidden, residual) output,
  int64 position-id .input, fused-RMSNorm/RowParallel tuples, merged
  qkv_proj/gate_up_proj, flat [total_tokens, hidden] layout.
- New gotchas: in-place writes raise (replace instead), clone-on-save,
  enable_prefix_caching=False default, deferred errors keep the engine alive,
  no attention weights, vLLM != transformers numerics.
- Drop stale claims: tracer.cache() is supported; version 0.15.1 -> 0.19.1.

tests/vllm_intervention_gaps/:
- git mv run_all.py + test_*.py here (executable vLLM-vs-HF diagnostic suite)
  and add a README.

Recipes verified on vLLM 0.19.1 (Qwen2.5-0.5B): in-place writes raise,
replacement works, logit-lens matmul (norm(hs) @ lm_head.weight.T) bitwise-
matches model.logits at the last layer, and TP>=2 sub-module access works
(the old "crashes at tp>=2" claim was stale).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`assert x.std() > 50` depended on an unseeded randn(8) sample std exceeding
0.5 (x = randn(8)*0.1*1000), which an 8-element sample undershoots ~2.4% of
runs — a CI flake. Replace with an exact, RNG-free check: the saved clone is a
distinct object holding the pre-mutation values, and the in-place-mutated x is
exactly saved * 1000.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
AsyncVLLMBackend.__call__ serialized all invokes' mediators but then
submitted only prompts[0]/params[0] under a single request_id, so a
multi-invoke async trace ran only the first prompt: invokes past the
first never reached the engine, their per-invoke saves came back empty,
and trace-shared saves were never collected (the worker's received_count
never reached expected_count).

Submit one request per invoke (mirroring the serve path in serve/server.py)
and merge the per-request generators into a single stream via an asyncio
queue, collecting saves per finished request. The single-invoke streaming
path is behavior-preserving; the deferred-error surfacing is reused via a
shared _attach_saves helper.

Add test_async_multi_invoke_runs_all_invokes: a two-invoke async trace must
produce two finished requests and collect the trace-shared list from both
invokes. This is the async counterpart of test_shared_list_across_invokes;
every prior async test used a single prompt, so this code path had no
coverage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
khaiwang added a commit that referenced this pull request Jun 24, 2026
Models without a native vLLM definition (e.g. SmolLM3) run through vLLM's
Transformers backend, which wraps the HuggingFace model and adds a leading
singleton batch dim (inputs_embeds[None, ...]). Their decoder-layer
activations are 3D [1, total_tokens, hidden], so the batched token axis is
dim 1, not dim 0.

Batcher._narrow/_swap hard-assumed the token axis was dim 0 and gated on
shape[0] == total_batch_size. For the 3D case that gate is 1 == total_tokens
(False), so reads returned the full batch (all prompts) and writes were
silently discarded -- every intervention became a no-op once needs_batching
(2+ prompts) was active. Native vLLM models (2D [total_tokens, hidden]) were
unaffected, which is why Qwen3 worked but SmolLM3 did not.

Generalize the base Batcher to narrow/swap along an axis reported by a new
_batch_dim() hook (still dim 0 by default; preserves the existing in-place,
concat-for-view/grad-leaf, and passthrough paths). VLLMBatcher overrides
_batch_dim to recognize the Transformers-backend's [1, total_tokens, hidden]
shape and select dim 1.

Adds CPU-only regression tests (TestVLLMBatcherAxis): the 3D swap/narrow
tests fail on the unpatched batcher and pass after; the 2D native test pins
no regression.

Related to #661/#662.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant