Skip to content

Adding V-only Hadamard support to low precision attention API#4249

Open
howardzhang-cv wants to merge 1 commit intogh/howardzhang-cv/40/basefrom
gh/howardzhang-cv/40/head
Open

Adding V-only Hadamard support to low precision attention API#4249
howardzhang-cv wants to merge 1 commit intogh/howardzhang-cv/40/basefrom
gh/howardzhang-cv/40/head

Conversation

@howardzhang-cv
Copy link
Copy Markdown
Contributor

@howardzhang-cv howardzhang-cv commented Apr 7, 2026

Stack from ghstack (oldest at bottom):

Summary

  • Added Hadamard on V tensor support for the low precision attention API, passed through from apply_low_precision_attention to select the hadamard fused kernel
  • Added new functions to kernel files (triton_hadamard_qkv_quantization.py and triton_hadamard_rope_qkv_quantization.py) in triton for fused V-only hadamard and QKV quantization (with rope fusion as well)
  • Added V-only hadamard option to the benchmarks
  • Added additional V-only hadamard tests

Results

Single Attention Layer

image Results show speedup compared to QKV Hadamard option (1.29x for QKV, 1.32x for V-only, 1.38x for no Hadamard). Worth noting that the mid-sequence length options sped up a lot more (1.02x for QKV, 1.15x for V-only).

LLaMA3 Prefill

image Perplexity goes from 7.54->7.57 with QKV. Now it goes from 7.54->7.61, which is very close to the 7.62 without Hadamard. The speedup from QKV Hadamard is 1.15x, which is now 1.19x with V-Only Hadamard. Seems to me like V-only Hadamard does not help as much for LLM models.

Flux.1-schnell 2048x2048

image LPIPS is about the same as QKV Hadamard, much better than the no Hadamard option (0.44 LPIPS). Speedup is slightly better too 0.96x with QKV to 1.0x with V-only (speedups scale up with sequence lengths, larger image sizes or video generation will see better speedup numbers). So V-only Hadamard is a better option for image generation and video generation.

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Apr 7, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4249

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f5bf649 with merge base 2a8fa55 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 7, 2026
howardzhang-cv added a commit that referenced this pull request Apr 7, 2026
@howardzhang-cv howardzhang-cv added the module: inference quantize_ api inference flow label Apr 7, 2026
Copy link
Copy Markdown
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: inference quantize_ api inference flow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants