Skip to content

fix(modeling): allow GPU-less meta-load of remote models that hard-im…#674

Closed
khaiwang wants to merge 2 commits into
devfrom
fix/meta-kernel-shim
Closed

fix(modeling): allow GPU-less meta-load of remote models that hard-im…#674
khaiwang wants to merge 2 commits into
devfrom
fix/meta-kernel-shim

Conversation

@khaiwang

@khaiwang khaiwang commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Allow GPU-less meta-load of remote models that hard-import CUDA kernels

Problem

Some HuggingFace models ship custom remote modeling code (trust_remote_code) that imports CUDA-only kernel packages at module-import time. For example, nvidia/NVIDIA-Nemotron-3-* (Nemotron-H) does, at the top of modeling_nemotron_h.py:

from mamba_ssm.ops.triton.layernorm_gated import rmsnorm_fn  # unguarded -> hard ImportError

mamba_ssm / causal_conv1d can't be installed or imported without a CUDA toolchain. Since nnsight builds a meta model client-side via AutoModelForCausalLM.from_config(..., trust_remote_code=True), a GPU-less nnsight/NDIF client hit ImportError: mamba-ssm is required ... before any meta tensor was created — so it couldn't construct the module tree, build an intervention graph, or run remote=True at all. This blocks the core "weak client, remote GPU" use case.

Fix

A meta model never runs a forward on the client (the forward happens on the GPU host), so these kernels are only imported, never called.

  • meta_kernel_shim() (new modeling/_kernel_shim.py): wraps the meta from_config(...) and registers inert stand-ins for the kernel packages in sys.modulesonly when they aren't really installed — then removes them. Each stub member raises if called, so a dispatched/real run can never silently use a stub instead of the real kernel. No-op on a GPU host.
  • Consistent trust_remote_code=True across _load_config, _load_meta, and _load. Previously only the meta from_config() forced it, so the client tree could be built from a different implementation than the one dispatched/served, and _load_config used a native config that can't parse newer remote-only fields (the cause of KeyError '-' on newer Nemotron-H hybrid_override_pattern block types).
 src/nnsight/modeling/_kernel_shim.py | 136 +++++++  (new)
 src/nnsight/modeling/transformers.py |  26 ++-

Verification

  • GPU-less meta-load now succeeds for Nemotron-3 4B and 120B (was: hard ImportError); gpt2 and other models unchanged.
  • Tree equivalence: the meta module tree built without kernels is byte-identical (same module/param/buffer name→shape→dtype signature) to the tree built with real kernels — so a client's interventions map exactly onto the served model.
  • End-to-end (real weights, Nemotron-3 4B on GPU): an nnsight trace produces logits byte-identical to plain transformers (argmax " Paris"); the fixed path is byte-identical to the prior CUDA path on baseline and on two interventions (ablation, ×0.5 scaling).
  • No regression: gpt2 load is byte-identical with and without the change.

Notes

  • A complementary upstream fix would guard the rmsnorm_fn import in NVIDIA's remote code (the adjacent kernel imports are already guarded by is_mamba_2_ssm_available()); this PR makes nnsight resilient without depending on that.
  • The 120B forward was not tested (247 GB); the 4B uses the identical load/dispatch code path. The 4B forward ran the model's reference SSM path because the causal_conv1d CUDA kernel was unavailable on the test host (prebuilt wheel needs glibc 2.32, host has 2.31) — unrelated to this change, which does not touch the forward.

🤖 Generated with Claude Code

khaiwang and others added 2 commits June 9, 2026 14:42
…port CUDA kernels

Remote modeling code for some HuggingFace models (e.g. Nemotron-H /
nvidia/NVIDIA-Nemotron-3-*) imports CUDA-only kernel packages such as mamba_ssm
and causal_conv1d at module-import time. Those packages cannot be installed or
imported without a CUDA toolchain, so a GPU-less nnsight/NDIF client could not
even construct the meta model, and therefore could not build an intervention
graph or run remotely.

A meta model never executes a forward pass on the client (the forward runs on
the GPU host), so these kernels are only imported, never called. Add
meta_kernel_shim(), a context manager that registers inert stand-ins for the
kernel packages around the meta from_config() -- only when they are not really
installed, and removed immediately after. Each stub member raises if called, so
a dispatched/real run can never silently use a stub instead of the real kernel.

Also default trust_remote_code=True consistently across _load_config,
_load_meta and _load so the config class, the meta intervention tree, and the
dispatched/served model are all the same implementation. Previously only the
meta from_config() forced trust_remote_code, which could build the client tree
from a different implementation than the one dispatched, and left _load_config
using a native config that cannot parse newer remote-only config fields (e.g.
Nemotron-H hybrid_override_pattern block types).

Verified: GPU-less meta-load now succeeds for Nemotron-3 4B and 120B; the meta
tree built without kernels is byte-identical (same module/param signature) to
the tree built with kernels; a real 4B load+trace matches plain-transformers
logits and is byte-identical between this path and the prior CUDA path;
interventions behave identically; gpt2 load/trace is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… need

This shim is a local workaround for remote modeling code that unguardedly
imports CUDA kernels at module level (Nemotron-H's rmsnorm_fn import); the
real fix belongs upstream. Drop the speculative generality accordingly:

- Remove the parent.child attribute wiring: the known files only use the
  `from a.b.c import x` form, which resolves via the IMPORT_FROM sys.modules
  fallback without it. A comment marks where to re-add wiring if a file ever
  uses dotted attribute access (`import mamba_ssm; mamba_ssm.ops...`).
- Trim the stub table to the one unconditional import. The sibling kernel
  imports sit behind transformers availability guards that answer False when
  the package isn't installed, so their stubs were never consumed.
- Fix a latent crash present since the original shim: on a CUDA-visible
  machine without the kernels, is_mamba_2_ssm_available() finds the stub via
  find_spec, finds no pip metadata, falls back to __version__, and crashed
  on version.parse("N/A"). Stamp __version__ = "0.0.0" on the stub so every
  version gate parses and correctly answers "not available".

Verified GPU-less (Nemotron-3 4B + 120B + gpt2 meta-load, sys.modules clean
after) and CUDA-visible without kernels (previously crashed, now loads).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@khaiwang khaiwang closed this Jun 19, 2026
@khaiwang khaiwang deleted the fix/meta-kernel-shim branch June 19, 2026 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant