Skip to content

Add GB10 (sm_121 / DGX Spark) support to low-latency-llama#11

Open
chauhang wants to merge 1 commit into
HazyResearch:mainfrom
chauhang:gb10-sm121-support
Open

Add GB10 (sm_121 / DGX Spark) support to low-latency-llama#11
chauhang wants to merge 1 commit into
HazyResearch:mainfrom
chauhang:gb10-sm121-support

Conversation

@chauhang

@chauhang chauhang commented Jun 21, 2026

Copy link
Copy Markdown

Summary

This PR makes the low-latency Llama megakernel run on GB10 / NVIDIA DGX Spark (sm_121, consumer-class Blackwell):

  1. Ports the kernel to GB10's tighter limits — 99 KB shared memory (5 pages of 16 KB, not H100's 13) and no thread-block-cluster engine;
  2. Builds against a GB10-enabled ThunderKittens (companion PR Enable warpgroup-MMA on sm_121 (GB10): WGMMA shim + SM121 build target ThunderKittens#204);
  3. Adds a torch.compile baseline to generate.py for a fair production comparison.

On Llama-3.2-1B batch-1 decode, the megakernel is verified correct (the repo's diff_test) and faster than PyTorch:

Mode Tokens/sec vs eager
torch (eager) 64.4 1.0×
torch.compile (reduce-overhead + cudagraphs) 72.1 1.12×
mk (megakernel) 95.9 1.49×

mk is 1.49× over eager and 1.33× over compiled+cudagraphed PyTorch, saturating GB10's memory bandwidth (see Performance below). All changes are gated under #ifdef KITTENS_SM120, so H100/B200 builds are byte-identical.

What's in this PR (all #ifdef KITTENS_SM120)

GB10 forces two adjustments — (a) re-tile the megakernel's shared-memory page layout from 13 → 5 pages, and (b) disable a cluster-only codegen path the chip can't execute — plus build/config plumbing and the benchmark baseline. Everything else is unchanged.

  • Removed __cluster_dims__ for SM120 (include/megakernel.cuh). A size-1 cluster attribute still makes nvcc emit thread-block-cluster addressing (S2UR SR_CgaCtaId + cluster-scoped mbarriers) for all shared-memory barrier waits — an illegal instruction on sm_121 (consumer Blackwell has no cluster engine). The cluster syncs were already guarded by if (CLUSTER_BLOCKS > 1), so size-1 clusters are unaffected.
  • 5-page layout port (demos/low-latency-llama/matvec_pipeline.cuh). The matvec ops were built for 13 pages (INPUT_PIPELINE_STAGES(3) × STAGE_PAGES(4) + 1 activation), which overran GB10's 5-entry page table (OOB). Fix: INPUT_PIPELINE_STAGES 3 → 1 (4 weight + 1 activation = exactly 5; outputs live in scratch, so OUTPUT_PIPELINE_STAGES stays 3), and release_lid() (which returns the order an op's shared-memory pages are recycled into for the next op) → identity ret_order[5] = {0,1,2,3,4} (the permutation only sets recycle order — actual page readiness is still gated by the page_finished semaphores).
  • Attention release_lid fix (attention_partial.cu, attention_reduction.cu). These ops have their own release_lid with ret_order[13]; the >5 return values corrupted the next op's 5-entry pid_order, which surfaced as an OOB read in o_proj (the op after attention). Both fixed to identity [5].
  • include/config.cuh — SM120 overrides: MAX_SHARED_MEMORY = 99 KB, NUM_PAGES == 5 assert, GB10_SM_COUNT = 48. llama.cuh takes the SM120 SM-count branch.
  • generate.py — adds a mode=compile baseline (torch.compile reduce-overhead + cudagraphs) — the fair production baseline to compare the megakernel against.

Dependency — needs a GB10-enabled ThunderKittens

The megakernel builds against ThunderKittens (TK). Megakernels currently pins TK as a submodule on an older branch (bvm-single-ctrl-pre-new-warps) that predates GB10/SM120 support, so it won't build for GB10 as-is.

The TK-side GB10 support this PR relies on — TK's sm_121 build target plus a warpgroup-MMA shim that lets H100-style kernels run on consumer Blackwell — is a companion PR: HazyResearch/ThunderKittens#204 (HazyResearch/ThunderKittens#204).

To build or test this PR, set THUNDERKITTENS_ROOT to a TK checkout that has #204's changes (e.g. chauhang/ThunderKittens@gb10-wgmma-shim) rather than the pinned submodule. No Megakernels code changes are needed for that — the megakernel is source-compatible with the newer TK and compiles unchanged. The new GPU=GB10 Makefile target handles the build flags.

Validation (real GB10, CUDA 13.0, sm_121a, Llama-3.2-1B batch-1)

Correctness is established the way the repo's own harness does — diff_test.py runs the identical instruction list through the Python-VM reference (pyvm) and the megakernel (mk) on byte-identical inputs (seeded; hidden states + K/V cache copied from the reference into the kernel), then compares every per-op intermediate tensor and the final logits — max absolute diff and mean symmetric relative diff (rdiff = 2|a−b| / (|a|+|b|)) — against the bf16 noise band.

  • diff_test, all 16 layers passes at bf16 tolerance: attn intermediates exact (0.0), k/v cache adiff ~0.03, logits max adiff 0.125 / mean rdiff 0.044 — accumulated bf16 rounding, not port-induced (the 1-stage change alters when weights load, not the arithmetic).
  • Error attribution: the largest per-op rdiff is silu_out (the MLP matvec — unchanged CUDA-core code), not the attention MMA-shim or the page-layout port — so neither change introduced numerical error.

Performance — measured on real GB10 (batch-1 decode)

Batch-1 decode is memory-bandwidth bound (reads all 2.47 GB of bf16 weights/token), so the right denominator is achievable bandwidth, not the spec: a microbenchmark tops out at 236 GB/s (86% of the 273 GB/s LPDDR5X spec, which is not attainable on GB10). Against that, mk saturates the memory system:

Mode Tokens/sec Effective BW % of achievable (236 GB/s)
torch (eager) 64.4 159 GB/s 67%
torch.compile (reduce-overhead + cudagraphs) 72.1 178 GB/s 75%
mk (megakernel) 95.9 ~237 GB/s ~100% (saturated)

Four independent tools corroborate the picture:

Tool What it measured Result
benchmark throughput mk 95.9 tok/s → 1.49× eager, 1.33× compile
microbench achievable LPDDR5X read BW 236 GB/s (86% of the 273 spec — the spec is unreachable)
derived (kernel duration) mk effective BW 2.47 GB ÷ 10.52 ms = 235 GB/s ≈ 99% of achievable
nsys mk share of GPU time 96.5% — one persistent kernel, 10.52 ms each
ncu SM (compute) throughput 46% → not compute-bound

The throughput-derived (95.9 tok/s × 2.47 GB ≈ 237 GB/s) and duration-derived (2.47 GB ÷ 10.52 ms = 235 GB/s) numbers bracket the 236 GB/s achievable ceiling. ncu confirms it's bandwidth-bound, not compute-bound, and nsys confirms the fusion: a single persistent kernel does 96.5% of all GPU work, with only µs-scale torch glue (embedding, KV-cache concat, SiLU, elementwise) around it. (Bandwidth is read directly from the microbench and cross-checked by the duration derivation — ncu's DRAM counters and nsys's BW path don't report on GB10's unified memory.)

Build & run

# 1. The only thing you set: a ThunderKittens checkout that has GB10 support (HazyResearch/ThunderKittens#204)
export THUNDERKITTENS_ROOT=/path/to/thunderkittens-with-gb10

# 2. Build the megakernel for GB10 (the GPU=GB10 target sets the arch flags for you)
export MEGAKERNELS_ROOT=$(pwd)
cd demos/low-latency-llama && make GPU=GB10 PYTHON_VERSION=3.12          # -> mk_llama*.so

# 3. Correctness (all 16 layers) + benchmark vs PyTorch
python megakernels/scripts/diff_test.py layer_limit=16
python megakernels/scripts/generate.py mode=mk      prompt="The capital of France is" ntok=128
python megakernels/scripts/generate.py mode=torch   prompt="The capital of France is" ntok=128
python megakernels/scripts/generate.py mode=compile prompt="The capital of France is" ntok=128

Limitations / follow-ups

  • 1B latency demo only. The compiled megakernel bakes in 1B dims. The throughput config is not changed.

Co-created with Claude Code

@chauhang chauhang marked this pull request as ready for review June 21, 2026 19:57
@chauhang chauhang marked this pull request as draft June 21, 2026 21:20
@chauhang chauhang marked this pull request as ready for review June 22, 2026 00:29
@chauhang chauhang marked this pull request as draft June 22, 2026 20:21
@chauhang chauhang marked this pull request as ready for review June 23, 2026 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant