Skip to content

Fix GroupedLinear FP8 calibration loop#3101

Open
fallintoplace wants to merge 2 commits into
NVIDIA:mainfrom
fallintoplace:fix/grouped-linear-calibration
Open

Fix GroupedLinear FP8 calibration loop#3101
fallintoplace wants to merge 2 commits into
NVIDIA:mainfrom
fallintoplace:fix/grouped-linear-calibration

Conversation

@fallintoplace

Copy link
Copy Markdown

Summary

  • remove the unused outer calibration loop in GroupedLinear
  • calibrate each input and weight quantizer once per GEMM
  • avoid repeating calibration num_gemms times for every GEMM

Validation

  • ran python3 -m py_compile transformer_engine/pytorch/module/grouped_linear.py
  • ran git diff --check -- transformer_engine/pytorch/module/grouped_linear.py

@github-actions github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label Jun 6, 2026
@fallintoplace fallintoplace marked this pull request as ready for review June 6, 2026 14:40
@fallintoplace fallintoplace requested a review from ksivaman as a code owner June 6, 2026 14:40
Signed-off-by: Minh Vu <vuhoangminh97@gmail.com>
@fallintoplace fallintoplace force-pushed the fix/grouped-linear-calibration branch from 506ef2d to 08c86d7 Compare June 6, 2026 14:41
@greptile-apps

greptile-apps Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a bug in _GroupedLinear.forward where FP8 calibration was redundantly applied num_gemms × num_gemms times instead of once per GEMM. The original code had an outer for i in range(num_gemms) loop whose variable was immediately shadowed by two identical inner loops, making the outer loop a no-op wrapper that caused each calibration to repeat num_gemms times for every GEMM.

  • Root cause removed: the outer for i loop (with its shadowed variable) and the two separate inner loops are collapsed into a single correct loop that calls input_quantizers[i].calibrate(inputmats[i]) and weight_quantizers[i].calibrate(weights[i]) exactly once each.
  • Scope: change is confined to the non-grouped-tensor forward path under fp8_calibration=True; the grouped-tensor path is unaffected.

Confidence Score: 5/5

Safe to merge — the change is a minimal, well-scoped removal of a redundant nested loop in the FP8 calibration path with no behavioural side effects beyond fixing the over-calibration.

The diff is six lines touching a single code path that only runs during FP8 calibration. The original loop structure was unambiguously buggy (outer loop variable shadowed by two inner loops), and the replacement correctly calls each quantizer exactly once. The surrounding GEMM code, the grad bookkeeping, and every other code path are untouched.

No files require special attention.

Important Files Changed

Filename Overview
transformer_engine/pytorch/module/grouped_linear.py Removes redundant nested calibration loops; single loop now calibrates each input and weight quantizer exactly once per GEMM — straightforward and correct fix.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[forward called with fp8_calibration=True] --> B[Build inputmats via torch.split]
    B --> C[Build weights_fp8]
    C --> D[general_grouped_gemm]
    D --> E{fp8_calibration?}
    E -- Yes --> F["for i in range(num_gemms)"]
    F --> G["input_quantizers[i].calibrate(inputmats[i])"]
    G --> H["weight_quantizers[i].calibrate(weights[i])"]
    H --> F
    F -- done --> I[cpu_offloading / grad bookkeeping]
    E -- No --> I
Loading

Reviews (2): Last reviewed commit: "Merge branch 'main' into fix/grouped-lin..." | Re-trigger Greptile

@vthumbe1503

Copy link
Copy Markdown
Collaborator

/te-ci pytorch

@vthumbe1503 vthumbe1503 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PRs from external contributor outside the core maintainers, representing community-driven work.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants