fix(modeling): allow GPU-less meta-load of remote models that hard-im… by khaiwang · Pull Request #674 · ndif-team/nnsight

khaiwang · 2026-06-09T19:05:12Z

Allow GPU-less meta-load of remote models that hard-import CUDA kernels

Problem

Some HuggingFace models ship custom remote modeling code (trust_remote_code) that imports CUDA-only kernel packages at module-import time. For example, nvidia/NVIDIA-Nemotron-3-* (Nemotron-H) does, at the top of modeling_nemotron_h.py:

from mamba_ssm.ops.triton.layernorm_gated import rmsnorm_fn  # unguarded -> hard ImportError

mamba_ssm / causal_conv1d can't be installed or imported without a CUDA toolchain. Since nnsight builds a meta model client-side via AutoModelForCausalLM.from_config(..., trust_remote_code=True), a GPU-less nnsight/NDIF client hit ImportError: mamba-ssm is required ... before any meta tensor was created — so it couldn't construct the module tree, build an intervention graph, or run remote=True at all. This blocks the core "weak client, remote GPU" use case.

Fix

A meta model never runs a forward on the client (the forward happens on the GPU host), so these kernels are only imported, never called.

meta_kernel_shim() (new modeling/_kernel_shim.py): wraps the meta from_config(...) and registers inert stand-ins for the kernel packages in sys.modules — only when they aren't really installed — then removes them. Each stub member raises if called, so a dispatched/real run can never silently use a stub instead of the real kernel. No-op on a GPU host.
Consistent trust_remote_code=True across _load_config, _load_meta, and _load. Previously only the meta from_config() forced it, so the client tree could be built from a different implementation than the one dispatched/served, and _load_config used a native config that can't parse newer remote-only fields (the cause of KeyError '-' on newer Nemotron-H hybrid_override_pattern block types).

 src/nnsight/modeling/_kernel_shim.py | 136 +++++++  (new)
 src/nnsight/modeling/transformers.py |  26 ++-

Verification

GPU-less meta-load now succeeds for Nemotron-3 4B and 120B (was: hard ImportError); gpt2 and other models unchanged.
Tree equivalence: the meta module tree built without kernels is byte-identical (same module/param/buffer name→shape→dtype signature) to the tree built with real kernels — so a client's interventions map exactly onto the served model.
End-to-end (real weights, Nemotron-3 4B on GPU): an nnsight trace produces logits byte-identical to plain transformers (argmax " Paris"); the fixed path is byte-identical to the prior CUDA path on baseline and on two interventions (ablation, ×0.5 scaling).
No regression: gpt2 load is byte-identical with and without the change.

Notes

A complementary upstream fix would guard the rmsnorm_fn import in NVIDIA's remote code (the adjacent kernel imports are already guarded by is_mamba_2_ssm_available()); this PR makes nnsight resilient without depending on that.
The 120B forward was not tested (247 GB); the 4B uses the identical load/dispatch code path. The 4B forward ran the model's reference SSM path because the causal_conv1d CUDA kernel was unavailable on the test host (prebuilt wheel needs glibc 2.32, host has 2.31) — unrelated to this change, which does not touch the forward.

🤖 Generated with Claude Code

…port CUDA kernels Remote modeling code for some HuggingFace models (e.g. Nemotron-H / nvidia/NVIDIA-Nemotron-3-*) imports CUDA-only kernel packages such as mamba_ssm and causal_conv1d at module-import time. Those packages cannot be installed or imported without a CUDA toolchain, so a GPU-less nnsight/NDIF client could not even construct the meta model, and therefore could not build an intervention graph or run remotely. A meta model never executes a forward pass on the client (the forward runs on the GPU host), so these kernels are only imported, never called. Add meta_kernel_shim(), a context manager that registers inert stand-ins for the kernel packages around the meta from_config() -- only when they are not really installed, and removed immediately after. Each stub member raises if called, so a dispatched/real run can never silently use a stub instead of the real kernel. Also default trust_remote_code=True consistently across _load_config, _load_meta and _load so the config class, the meta intervention tree, and the dispatched/served model are all the same implementation. Previously only the meta from_config() forced trust_remote_code, which could build the client tree from a different implementation than the one dispatched, and left _load_config using a native config that cannot parse newer remote-only config fields (e.g. Nemotron-H hybrid_override_pattern block types). Verified: GPU-less meta-load now succeeds for Nemotron-3 4B and 120B; the meta tree built without kernels is byte-identical (same module/param signature) to the tree built with kernels; a real 4B load+trace matches plain-transformers logits and is byte-identical between this path and the prior CUDA path; interventions behave identically; gpt2 load/trace is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… need This shim is a local workaround for remote modeling code that unguardedly imports CUDA kernels at module level (Nemotron-H's rmsnorm_fn import); the real fix belongs upstream. Drop the speculative generality accordingly: - Remove the parent.child attribute wiring: the known files only use the `from a.b.c import x` form, which resolves via the IMPORT_FROM sys.modules fallback without it. A comment marks where to re-add wiring if a file ever uses dotted attribute access (`import mamba_ssm; mamba_ssm.ops...`). - Trim the stub table to the one unconditional import. The sibling kernel imports sit behind transformers availability guards that answer False when the package isn't installed, so their stubs were never consumed. - Fix a latent crash present since the original shim: on a CUDA-visible machine without the kernels, is_mamba_2_ssm_available() finds the stub via find_spec, finds no pip metadata, falls back to __version__, and crashed on version.parse("N/A"). Stamp __version__ = "0.0.0" on the stub so every version gate parses and correctly answers "not available". Verified GPU-less (Nemotron-3 4B + 120B + gpt2 meta-load, sys.modules clean after) and CUDA-visible without kernels (previously crashed, now loads). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

khaiwang and others added 2 commits June 9, 2026 14:42

khaiwang closed this Jun 19, 2026

khaiwang deleted the fix/meta-kernel-shim branch June 19, 2026 22:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(modeling): allow GPU-less meta-load of remote models that hard-im…#674

fix(modeling): allow GPU-less meta-load of remote models that hard-im…#674
khaiwang wants to merge 2 commits into
devfrom
fix/meta-kernel-shim

khaiwang commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

khaiwang commented Jun 9, 2026

Allow GPU-less meta-load of remote models that hard-import CUDA kernels

Problem

Fix

Verification

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant