Gather tensor-parallel sharded parameters on read in a trace#677
Open
khaiwang wants to merge 2 commits into
Open
Gather tensor-parallel sharded parameters on read in a trace#677khaiwang wants to merge 2 commits into
khaiwang wants to merge 2 commits into
Conversation
…red across invokes) barrier() is broken on the vLLM path (reproduces at tp1/pp1, non-PP): each invoke is serialized into its own globals, so each gets a private copy of the Barrier with its own participants set. The count never reaches n, both invokes take the no-op send(BARRIER, None) branch, the workers block at the barrier and are abandoned, and all post-barrier code is silently dropped. Diagnosed (instrumentation reverted); fix is a design choice (interleaver-owned barrier registry keyed by a serialization-stable id, preferred; or graft the Barrier into canonical globals). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Under tensor parallelism vLLM shards parameters across ranks (lm_head/embeddings are vocab-sharded; attention/MLP projections are output- or input-sharded). nnsight handed intervention code the local shard, so reading a parameter inside a trace -- e.g. `lm_head.weight[token_id]` to build a steering direction -- returned the wrong row on a rank that does not own that token, silently diverging from single-GPU. This is the parameter analogue of the existing activation gather (VLLMBatcher gathers RowParallelLinear/ColumnParallelLinear I/O). Parameter reads now route through the batcher: Envoy.__getattr__ delegates a tensor attribute to interleaver.batcher.gather_param while interleaving (so the collective fires on every rank); the base Batcher returns it unchanged (non-vLLM and tp=1 untouched), and VLLMBatcher all-gathers the shard to its full logical shape. The sharded dim comes from the module class (RowParallelLinear -> input dim; ColumnParallelLinear/VocabParallelEmbedding -> output/vocab dim), because vLLM sets BOTH output_dim and input_dim on every linear weight (they label the dims, not which is sharded); vocab padding is stripped to org_vocab_size. Verified on Qwen2.5-0.5B tp=2: lm_head, qkv_proj, gate_up_proj, o_proj and down_proj all gather to the full tp=1 shape with matching norms; a steering write reading lm_head[token_id] goes from divergent (maxabs 26, wrong global token in the top-5) to equivalent. Regression tests in tests/test_tp_param_gather.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015y5Sy9vzzc9YJZXCtewSdQ
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Under tensor parallelism vLLM shards parameters across ranks —
lm_head/embeddings are vocab-sharded, attention/MLP projections are output- or input-sharded. nnsight handed intervention code the local shard, so reading a parameter inside a trace returned the wrong values on a rank that doesn't own them.Concretely, a steering cell builds its direction from
lm_head.weight[token_id]. Measured on Qwen2.5-0.5B under tp=2: on the rank that doesn't own the token, that index returned a different vocab row (lm_head.weight.shape[0]was 75968 = half-vocab in the trace), so the steered output diverged from single-GPU (maxabs 26, the wrong global token surfacing in the top-5) — silently.Fix
The parameter analogue of the existing activation gather (
VLLMBatcheralready gathersRowParallelLinear/ColumnParallelLinearI/O):Envoy.__getattr__routes a tensor attribute read throughinterleaver.batcher.gather_paramonly while interleaving (so the collectiveall_gatherfires on every rank; tp=1 / non-vLLM are untouched — the baseBatcher.gather_paramis identity).VLLMBatcher.gather_paramall-gathers the shard to its full logical shape. The sharded dim comes from the module class —RowParallelLinear→ input dim,ColumnParallelLinear/VocabParallelEmbedding→ output/vocab dim — because vLLM sets bothoutput_dimandinput_dimon every linear weight (they label the dims, not which is sharded). Vocab padding is stripped toorg_vocab_size.Read-only; tp=1 is byte-identical.
Verification (Qwen2.5-0.5B, tp=2)
lm_head,qkv_proj,gate_up_proj,o_proj,down_projall gather to the full tp=1 shape with matching Frobenius norms.lm_head.weight[token_id]goes from divergent (maxabs 26) to equivalent.tests/test_tp_param_gather.py(full-vocablm_head, upper-shard token row, row+column parallel weights); the pre-existingtests/test_tp_stream_fix.pystill passes. Run withpytest tests/test_tp_param_gather.py --tp 2.🤖 Generated with Claude Code
https://claude.ai/code/session_015y5Sy9vzzc9YJZXCtewSdQ