[nvfp4_training] Add Triton kernel for global amax of columnwise RHT (SM90+) by rdspring1 · Pull Request #4247 · pytorch/ao

rdspring1 · 2026-04-07T05:09:03Z

Summary

Adds triton_rht_amax (hadamard_amax_triton.py): a persistent, warp-specialized Triton kernel that applies the Randomized Hadamard Transform (RHT) to the input and reduces to a scalar global absolute maximum, without materializing the full post-RHT tensor
This is a prerequisite building block for triton_rht_quantize_row_col: the global amax determines the per-tensor decode scale (global_amax / (FP8_E4M3_MAX × FP4_E2M1_MAX)) used in two-level NVFP4 quantization
Adds _compute_pid and related RHT matrix helpers to hadamard_utils.py

Key design choices

Persistent grid: kernel launches with NUM_SMS CTAs and each CTA iterates over all tiles, amortizing launch overhead
Warp specialization: producer warps issue TMA loads; consumer warps run wgmma for the RHT matrix multiply — matches SM90+ warp-specialized pipeline
No output buffer: per-CTA cumulative max is reduced with one atomic_max per CTA, avoiding a full (N, M) intermediate allocation

Test plan

pytest test/prototype/mx_formats/test_hadamard_amax_triton.py -v
Covered transitively by test_hadamard_quantize_row_col_triton.py — incorrect amax would break scale computation and fail SQNR/bitwise tests

pytorch-bot · 2026-04-07T05:09:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4247

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

slayton58 · 2026-04-07T14:04:13Z

torchao/prototype/mx_formats/hadamard_utils.py

+
+
+def get_wgrad_sign_vector(device) -> torch.Tensor:
+    """Hard-coded random signs for Hadamard transform."""


Wait, why is this hard-coded vs. generated?

Converted get_wgrad_sign_vector to generate random sign vector.

slayton58 · 2026-04-07T14:07:08Z

torchao/prototype/mx_formats/hadamard_utils.py

+                [1, 1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, 1, 1, -1, -1],
+                [1, -1, -1, 1, -1, 1, 1, -1, -1, 1, 1, -1, 1, -1, -1, 1],
+            ],
+            dtype=torch.float32,


Is there a reason this needs to be 32b? Won't we save BW later if we store this natively in something lower (4b for instance), along with avoiding casts (See here)

Added dtype argument with torch.bfloat16 default

slayton58 · 2026-04-07T14:08:10Z

torchao/prototype/mx_formats/hadamard_utils.py

+
+
+@functools.lru_cache(maxsize=None)
+def get_rht_matrix(with_random_sign_mask: bool, device) -> torch.Tensor:


Should have argument for hadamard dimension (kwarg-only perhaps, default=16?). Should probably also have an option dtype argument too (seeing notes above)

we could add the kwarg with default 16 and for now just assert == 16, wdyt? only support 16 for now, build out prototype quickly and support generating other RHT matrix sizes later?

Yep, that's perfectly fine - I'm concerned with getting the API as correct as possible here

Changes get_rht_matrix to

def get_rht_matrix( sign_vector: tuple[int, ...] | None, device, dtype: torch.dtype = torch.bfloat16, hadamard_dimension: int = 16, ) -> torch.Tensor:

If sign_vector is None, it calls get_wgrad_sign_vector. Otherwise the sign_vector tuple is converted to torch.tensor. It is a tuple so it is hashable for lru_cache.

slayton58 · 2026-04-07T14:10:10Z

torchao/prototype/mx_formats/hadamard_amax_triton.py

+    tl.atomic_max(global_max_ptr, tile_max.to(tl.float32))
+
+
+def triton_rht_amax(


No option for non-global amax domain

Added scaling_type: F.ScalingType = F.ScalingType.TensorWise. The function throw ValueError if it is anything except F.ScalingType.TensorWise

slayton58 · 2026-04-07T14:11:23Z

torchao/prototype/mx_formats/hadamard_amax_triton.py

+            num_warps=cfg.NUM_WARPS,
+        )
+
+    best = get_best_config(cache_key, HADAMARD_CONFIGS, benchmark_fn)


Is there a reason you're re-implementing autotune?

slayton58 · 2026-04-07T14:13:19Z

test/prototype/mx_formats/test_hadamard_amax_triton.py

+
+    # Reference: same deterministic matrix (lru_cached, hard-coded sign vector)
+    B = get_rht_matrix(with_random_sign_mask=True, device="cuda")
+    ref_amax = (A.t().reshape(N * M // 16, 16) @ B).to(torch.bfloat16).abs().max().float()


This is (as written) doing A (bf16) @ B (fp32) - intended?

B from get_rht_matrix is return rht_matrix.to(dtype=torch.bfloat16), so it is always bf16

torchao/prototype/mx_formats/hadamard_amax_triton.py

slayton58 · 2026-04-07T14:17:51Z

torchao/prototype/mx_formats/hadamard_utils.py

+        signs = get_wgrad_sign_vector(device=device)
+    else:
+        signs = torch.ones(1, dtype=torch.float32, device=device)
+    sign_matrix = signs * torch.eye(


We should also be able to specify different sign vectors if desired.

Added sign_vector argument.

danielvegamyhre

When adding new kernels could you add microbenchmarks in ao/benchmarks/prototype? Feel free to start a nvfp4_training directory in there

danielvegamyhre · 2026-04-07T16:22:24Z

torchao/prototype/mx_formats/autotune_configs.py

+    return times[len(times) // 2]
+
+
+def get_best_config(


why is this necessary? triton autotuner automatically caches the best config for the given keys.

Removed. I replace host-side TMA descriptor with in-kernel ones so it is compatible with @triton.autotune decorator.

danielvegamyhre · 2026-04-07T16:24:24Z

torchao/prototype/mx_formats/autotune_configs.py

+_autotune_cache: dict[tuple, KernelConfig] = {}
+
+
+def do_bench(fn: Callable, warmup_iters: int = 3, bench_iters: int = 10) -> float:


can you use our existing benchmark util for this instead of defining a new one:

ao/benchmarks/utils.py

Line 107 in 707bee8

def benchmark_cuda_function_in_microseconds(f, *args, **kwargs):

Added micro benchmark for the triton kernel.

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 7, 2026

rdspring1 changed the title ~~[nvfp4_training] Add Triton kernel for fused RHT + global amax (SM90+)~~ [nvfp4_training] Add Triton kernel for global amax of columnwise RHT (SM90+) Apr 7, 2026

rdspring1 mentioned this pull request Apr 7, 2026

[nvfp4_training] Add Triton kernel for fused RHT and NVFP4 columnwise/rowwise quantization (SM100+) rdspring1/ao#1

Draft

2 tasks

rdspring1 marked this pull request as ready for review April 7, 2026 05:22

rdspring1 requested review from danielvegamyhre, drisspg, jerryzh168 and vkuzo as code owners April 7, 2026 05:22

rdspring1 mentioned this pull request Apr 7, 2026

[RFC] Random Hadamard Transform API for NVFP4 Quantization #4040

Open

slayton58 reviewed Apr 7, 2026

View reviewed changes

danielvegamyhre reviewed Apr 7, 2026

View reviewed changes

danielvegamyhre added module: training quantize_ api training flow nvfp4 labels Apr 7, 2026

syed-ahmed added this to PyTorch + CUDA Apr 7, 2026

Add triton_rht_amax kernel

11f29f4

rdspring1 force-pushed the triton_rht_amax branch from 13ee7fe to 11f29f4 Compare April 8, 2026 23:50



		def get_wgrad_sign_vector(device) -> torch.Tensor:
		"""Hard-coded random signs for Hadamard transform."""



		@functools.lru_cache(maxsize=None)
		def get_rht_matrix(with_random_sign_mask: bool, device) -> torch.Tensor:

		tl.atomic_max(global_max_ptr, tile_max.to(tl.float32))


		def triton_rht_amax(

		_autotune_cache: dict[tuple, KernelConfig] = {}


		def do_bench(fn: Callable, warmup_iters: int = 3, bench_iters: int = 10) -> float:

Conversation

rdspring1 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key design choices

Test plan

Uh oh!

pytorch-bot bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4247

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdspring1 Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rdspring1 commented Apr 7, 2026 •

edited

Loading

pytorch-bot bot commented Apr 7, 2026 •

edited

Loading

rdspring1 Apr 8, 2026 •

edited

Loading