Add GB10 (sm_121 / DGX Spark) support to low-latency-llama by chauhang · Pull Request #11 · HazyResearch/Megakernels

chauhang · 2026-06-21T19:48:38Z

Summary

This PR makes the low-latency Llama megakernel run on GB10 / NVIDIA DGX Spark (sm_121, consumer-class Blackwell):

Ports the kernel to GB10's tighter limits — 99 KB shared memory (5 pages of 16 KB, not H100's 13) and no thread-block-cluster engine;
Builds against a GB10-enabled ThunderKittens (companion PR Enable warpgroup-MMA on sm_121 (GB10): WGMMA shim + SM121 build target ThunderKittens#204);
Adds a torch.compile baseline to generate.py for a fair production comparison.

On Llama-3.2-1B batch-1 decode, the megakernel is verified correct (the repo's diff_test) and faster than PyTorch:

Mode	Tokens/sec	vs eager
`torch` (eager)	64.4	1.0×
`torch.compile` (reduce-overhead + cudagraphs)	72.1	1.12×
`mk` (megakernel)	95.9	1.49×

mk is 1.49× over eager and 1.33× over compiled+cudagraphed PyTorch, saturating GB10's memory bandwidth (see Performance below). All changes are gated under #ifdef KITTENS_SM120, so H100/B200 builds are byte-identical.

What's in this PR (all `#ifdef KITTENS_SM120`)

GB10 forces two adjustments — (a) re-tile the megakernel's shared-memory page layout from 13 → 5 pages, and (b) disable a cluster-only codegen path the chip can't execute — plus build/config plumbing and the benchmark baseline. Everything else is unchanged.

Removed __cluster_dims__ for SM120 (include/megakernel.cuh). A size-1 cluster attribute still makes nvcc emit thread-block-cluster addressing (S2UR SR_CgaCtaId + cluster-scoped mbarriers) for all shared-memory barrier waits — an illegal instruction on sm_121 (consumer Blackwell has no cluster engine). The cluster syncs were already guarded by if (CLUSTER_BLOCKS > 1), so size-1 clusters are unaffected.
5-page layout port (demos/low-latency-llama/matvec_pipeline.cuh). The matvec ops were built for 13 pages (INPUT_PIPELINE_STAGES(3) × STAGE_PAGES(4) + 1 activation), which overran GB10's 5-entry page table (OOB). Fix: INPUT_PIPELINE_STAGES 3 → 1 (4 weight + 1 activation = exactly 5; outputs live in scratch, so OUTPUT_PIPELINE_STAGES stays 3), and release_lid() (which returns the order an op's shared-memory pages are recycled into for the next op) → identity ret_order[5] = {0,1,2,3,4} (the permutation only sets recycle order — actual page readiness is still gated by the page_finished semaphores).
Attention release_lid fix (attention_partial.cu, attention_reduction.cu). These ops have their own release_lid with ret_order[13]; the >5 return values corrupted the next op's 5-entry pid_order, which surfaced as an OOB read in o_proj (the op after attention). Both fixed to identity [5].
include/config.cuh — SM120 overrides: MAX_SHARED_MEMORY = 99 KB, NUM_PAGES == 5 assert, GB10_SM_COUNT = 48. llama.cuh takes the SM120 SM-count branch.
generate.py — adds a mode=compile baseline (torch.compile reduce-overhead + cudagraphs) — the fair production baseline to compare the megakernel against.

Dependency — needs a GB10-enabled ThunderKittens

The megakernel builds against ThunderKittens (TK). Megakernels currently pins TK as a submodule on an older branch (bvm-single-ctrl-pre-new-warps) that predates GB10/SM120 support, so it won't build for GB10 as-is.

The TK-side GB10 support this PR relies on — TK's sm_121 build target plus a warpgroup-MMA shim that lets H100-style kernels run on consumer Blackwell — is a companion PR: HazyResearch/ThunderKittens#204 (HazyResearch/ThunderKittens#204).

To build or test this PR, set THUNDERKITTENS_ROOT to a TK checkout that has #204's changes (e.g. chauhang/ThunderKittens@gb10-wgmma-shim) rather than the pinned submodule. No Megakernels code changes are needed for that — the megakernel is source-compatible with the newer TK and compiles unchanged. The new GPU=GB10 Makefile target handles the build flags.

Validation (real GB10, CUDA 13.0, `sm_121a`, Llama-3.2-1B batch-1)

Correctness is established the way the repo's own harness does — diff_test.py runs the identical instruction list through the Python-VM reference (pyvm) and the megakernel (mk) on byte-identical inputs (seeded; hidden states + K/V cache copied from the reference into the kernel), then compares every per-op intermediate tensor and the final logits — max absolute diff and mean symmetric relative diff (rdiff = 2|a−b| / (|a|+|b|)) — against the bf16 noise band.

diff_test, all 16 layers passes at bf16 tolerance: attn intermediates exact (0.0), k/v cache adiff ~0.03, logits max adiff 0.125 / mean rdiff 0.044 — accumulated bf16 rounding, not port-induced (the 1-stage change alters when weights load, not the arithmetic).
Error attribution: the largest per-op rdiff is silu_out (the MLP matvec — unchanged CUDA-core code), not the attention MMA-shim or the page-layout port — so neither change introduced numerical error.

Performance — measured on real GB10 (batch-1 decode)

Batch-1 decode is memory-bandwidth bound (reads all 2.47 GB of bf16 weights/token), so the right denominator is achievable bandwidth, not the spec: a microbenchmark tops out at 236 GB/s (86% of the 273 GB/s LPDDR5X spec, which is not attainable on GB10). Against that, mk saturates the memory system:

Mode	Tokens/sec	Effective BW	% of achievable (236 GB/s)
`torch` (eager)	64.4	159 GB/s	67%
`torch.compile` (reduce-overhead + cudagraphs)	72.1	178 GB/s	75%
`mk` (megakernel)	95.9	~237 GB/s	~100% (saturated)

Four independent tools corroborate the picture:

Tool	What it measured	Result
benchmark	throughput	`mk` 95.9 tok/s → 1.49× eager, 1.33× compile
microbench	achievable LPDDR5X read BW	236 GB/s (86% of the 273 spec — the spec is unreachable)
derived (kernel duration)	`mk` effective BW	2.47 GB ÷ 10.52 ms = 235 GB/s ≈ 99% of achievable
nsys	`mk` share of GPU time	96.5% — one persistent kernel, 10.52 ms each
ncu	SM (compute) throughput	46% → not compute-bound

The throughput-derived (95.9 tok/s × 2.47 GB ≈ 237 GB/s) and duration-derived (2.47 GB ÷ 10.52 ms = 235 GB/s) numbers bracket the 236 GB/s achievable ceiling. ncu confirms it's bandwidth-bound, not compute-bound, and nsys confirms the fusion: a single persistent kernel does 96.5% of all GPU work, with only µs-scale torch glue (embedding, KV-cache concat, SiLU, elementwise) around it. (Bandwidth is read directly from the microbench and cross-checked by the duration derivation — ncu's DRAM counters and nsys's BW path don't report on GB10's unified memory.)

Build & run

# 1. The only thing you set: a ThunderKittens checkout that has GB10 support (HazyResearch/ThunderKittens#204)
export THUNDERKITTENS_ROOT=/path/to/thunderkittens-with-gb10

# 2. Build the megakernel for GB10 (the GPU=GB10 target sets the arch flags for you)
export MEGAKERNELS_ROOT=$(pwd)
cd demos/low-latency-llama && make GPU=GB10 PYTHON_VERSION=3.12          # -> mk_llama*.so

# 3. Correctness (all 16 layers) + benchmark vs PyTorch
python megakernels/scripts/diff_test.py layer_limit=16
python megakernels/scripts/generate.py mode=mk      prompt="The capital of France is" ntok=128
python megakernels/scripts/generate.py mode=torch   prompt="The capital of France is" ntok=128
python megakernels/scripts/generate.py mode=compile prompt="The capital of France is" ntok=128

Limitations / follow-ups

1B latency demo only. The compiled megakernel bakes in 1B dims. The throughput config is not changed.

Co-created with Claude Code

Add GB10 (sm_121 / DGX Spark) support to low-latency-llama

6ec8abe

chauhang mentioned this pull request Jun 21, 2026

Add SM120/SM121 (GB10 / DGX Spark) support: WGMMA shim + kernel ports HazyResearch/ThunderKittens#203

Closed

chauhang marked this pull request as ready for review June 21, 2026 19:57

chauhang marked this pull request as draft June 21, 2026 21:20

chauhang marked this pull request as ready for review June 22, 2026 00:29

chauhang marked this pull request as draft June 22, 2026 20:21

chauhang mentioned this pull request Jun 23, 2026

Enable warpgroup-MMA on sm_121 (GB10): WGMMA shim + SM121 build target HazyResearch/ThunderKittens#204

Open

chauhang marked this pull request as ready for review June 23, 2026 22:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add GB10 (sm_121 / DGX Spark) support to low-latency-llama#11

Add GB10 (sm_121 / DGX Spark) support to low-latency-llama#11
chauhang wants to merge 1 commit into
HazyResearch:mainfrom
chauhang:gb10-sm121-support

chauhang commented Jun 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

chauhang commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this PR (all #ifdef KITTENS_SM120)

Dependency — needs a GB10-enabled ThunderKittens

Validation (real GB10, CUDA 13.0, sm_121a, Llama-3.2-1B batch-1)

Performance — measured on real GB10 (batch-1 decode)

Build & run

Limitations / follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chauhang commented Jun 21, 2026 •

edited

Loading

What's in this PR (all `#ifdef KITTENS_SM120`)

Validation (real GB10, CUDA 13.0, `sm_121a`, Llama-3.2-1B batch-1)