fix blockwise FP8 scaled_mm scale layout in Float8BlockwiseLinear by iamzainhuda · Pull Request #4229 · pytorch/ao

iamzainhuda · 2026-04-02T22:45:11Z

Summary

replace the non-Triton blockwise FP8 matmul path with functional scaled_mm semantics for forward, grad_x, and grad_weight
fix the BlockWise128x128 RHS scale layout by padding the K-block dimension to a multiple of 4, as required by cuBLASLt
fix the BlockWise1x128 RHS scale orientation in grad_weight by transposing the activation scales before the matmul
use aten._scaled_mm_v2 only under torch.compile(fullgraph=True) so the compiler can trace the op without graph-breaking on the Python F.scaled_mm wrapper

What was going wrong?

The issue was not just that we were calling torch._scaled_mm, but that this path in our blockwise FP8 linear code was not matching the cuBLASLt scale-layout contract for blockwise scaling.
In particular:

grad_x = grad_output @ weight uses RHS BlockWise128x128 scales
cuBLASLt requires those scales to be K-major, with the K-block dimension padded to a multiple of 4
our code was passing the unpadded layout, which caused incorrect behavior in the scaled-mm backend

There was a second layout bug in grad_weight = grad_output^T @ x:

the RHS uses BlockWise1x128 scaling
the quantized activation scales had the right values but the wrong orientation for the RHS scaled-mm call
transposing those scales fixes the contract mismatch

Also to note:

eager mode uses torch.nn.functional.scaled_mm
compile mode calls aten._scaled_mm_v2 directly because, in this torch build, Dynamo fullgraph cannot trace through the Python F.scaled_mm wrapper even though it lowers to the same underlying op

Testing

 pytest -q test/prototype/blockwise_fp8_training/test_blockwise_linear.py

cc @slayton58 @drisspg as part of #4209

pytorch-bot · 2026-04-02T22:45:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4229

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 551f3c7 with merge base b1ddd15 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

Run Regression Tests / test (CUDA 2.8, linux.g5.12xlarge.nvidia.gpu, torch==2.8.0 torchvision==0.23.0, cuda, 12.6) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch torchvision --index-url htt... / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

use F.scaled_mm instead

551f3c7

iamzainhuda requested review from jerryzh168 and vkuzo as code owners April 2, 2026 22:45

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 2, 2026

iamzainhuda marked this pull request as draft April 2, 2026 22:45

iamzainhuda added the module: training quantize_ api training flow label Apr 8, 2026

iamzainhuda changed the title ~~[draft] use F.scaled_mm instead of torch._scaled_mm for blockwise linear GEMM~~ fix blockwise FP8 scaled_mm scale layout in Float8BlockwiseLinear Apr 8, 2026

iamzainhuda requested a review from danielvegamyhre April 8, 2026 13:18

iamzainhuda marked this pull request as ready for review April 8, 2026 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix blockwise FP8 scaled_mm scale layout in Float8BlockwiseLinear#4229

fix blockwise FP8 scaled_mm scale layout in Float8BlockwiseLinear#4229
iamzainhuda wants to merge 1 commit intomainfrom
func-scaled-mm

iamzainhuda commented Apr 2, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iamzainhuda commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What was going wrong?

Testing

Uh oh!

pytorch-bot bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4229

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

iamzainhuda commented Apr 2, 2026 •

edited

Loading

pytorch-bot bot commented Apr 2, 2026 •

edited

Loading