fix(modeling): allow GPU-less meta-load of remote models that hard-im…#674
Closed
khaiwang wants to merge 2 commits into
Closed
fix(modeling): allow GPU-less meta-load of remote models that hard-im…#674khaiwang wants to merge 2 commits into
khaiwang wants to merge 2 commits into
Conversation
…port CUDA kernels Remote modeling code for some HuggingFace models (e.g. Nemotron-H / nvidia/NVIDIA-Nemotron-3-*) imports CUDA-only kernel packages such as mamba_ssm and causal_conv1d at module-import time. Those packages cannot be installed or imported without a CUDA toolchain, so a GPU-less nnsight/NDIF client could not even construct the meta model, and therefore could not build an intervention graph or run remotely. A meta model never executes a forward pass on the client (the forward runs on the GPU host), so these kernels are only imported, never called. Add meta_kernel_shim(), a context manager that registers inert stand-ins for the kernel packages around the meta from_config() -- only when they are not really installed, and removed immediately after. Each stub member raises if called, so a dispatched/real run can never silently use a stub instead of the real kernel. Also default trust_remote_code=True consistently across _load_config, _load_meta and _load so the config class, the meta intervention tree, and the dispatched/served model are all the same implementation. Previously only the meta from_config() forced trust_remote_code, which could build the client tree from a different implementation than the one dispatched, and left _load_config using a native config that cannot parse newer remote-only config fields (e.g. Nemotron-H hybrid_override_pattern block types). Verified: GPU-less meta-load now succeeds for Nemotron-3 4B and 120B; the meta tree built without kernels is byte-identical (same module/param signature) to the tree built with kernels; a real 4B load+trace matches plain-transformers logits and is byte-identical between this path and the prior CUDA path; interventions behave identically; gpt2 load/trace is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… need
This shim is a local workaround for remote modeling code that unguardedly
imports CUDA kernels at module level (Nemotron-H's rmsnorm_fn import); the
real fix belongs upstream. Drop the speculative generality accordingly:
- Remove the parent.child attribute wiring: the known files only use the
`from a.b.c import x` form, which resolves via the IMPORT_FROM sys.modules
fallback without it. A comment marks where to re-add wiring if a file ever
uses dotted attribute access (`import mamba_ssm; mamba_ssm.ops...`).
- Trim the stub table to the one unconditional import. The sibling kernel
imports sit behind transformers availability guards that answer False when
the package isn't installed, so their stubs were never consumed.
- Fix a latent crash present since the original shim: on a CUDA-visible
machine without the kernels, is_mamba_2_ssm_available() finds the stub via
find_spec, finds no pip metadata, falls back to __version__, and crashed
on version.parse("N/A"). Stamp __version__ = "0.0.0" on the stub so every
version gate parses and correctly answers "not available".
Verified GPU-less (Nemotron-3 4B + 120B + gpt2 meta-load, sys.modules clean
after) and CUDA-visible without kernels (previously crashed, now loads).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Allow GPU-less meta-load of remote models that hard-import CUDA kernels
Problem
Some HuggingFace models ship custom remote modeling code (
trust_remote_code) that imports CUDA-only kernel packages at module-import time. For example,nvidia/NVIDIA-Nemotron-3-*(Nemotron-H) does, at the top ofmodeling_nemotron_h.py:mamba_ssm/causal_conv1dcan't be installed or imported without a CUDA toolchain. Since nnsight builds a meta model client-side viaAutoModelForCausalLM.from_config(..., trust_remote_code=True), a GPU-less nnsight/NDIF client hitImportError: mamba-ssm is required ...before any meta tensor was created — so it couldn't construct the module tree, build an intervention graph, or runremote=Trueat all. This blocks the core "weak client, remote GPU" use case.Fix
A meta model never runs a forward on the client (the forward happens on the GPU host), so these kernels are only imported, never called.
meta_kernel_shim()(newmodeling/_kernel_shim.py): wraps the metafrom_config(...)and registers inert stand-ins for the kernel packages insys.modules— only when they aren't really installed — then removes them. Each stub member raises if called, so a dispatched/real run can never silently use a stub instead of the real kernel. No-op on a GPU host.trust_remote_code=Trueacross_load_config,_load_meta, and_load. Previously only the metafrom_config()forced it, so the client tree could be built from a different implementation than the one dispatched/served, and_load_configused a native config that can't parse newer remote-only fields (the cause ofKeyError '-'on newer Nemotron-Hhybrid_override_patternblock types).Verification
ImportError); gpt2 and other models unchanged.transformers(argmax " Paris"); the fixed path is byte-identical to the prior CUDA path on baseline and on two interventions (ablation, ×0.5 scaling).Notes
rmsnorm_fnimport in NVIDIA's remote code (the adjacent kernel imports are already guarded byis_mamba_2_ssm_available()); this PR makes nnsight resilient without depending on that.causal_conv1dCUDA kernel was unavailable on the test host (prebuilt wheel needs glibc 2.32, host has 2.31) — unrelated to this change, which does not touch the forward.🤖 Generated with Claude Code