Skip to content

Commit facc675

Browse files
Mark Saroufimyf225
andauthored
Add Helion Kernel Challenge competition (#105)
* Add Helion Kernel Challenge competition with 9 problems New competition inspired by Helion kernel ideas covering attention mechanisms, sampling strategies, quantization, and sequence modeling operators from production LLM architectures. Problems: - GQA: Causal Grouped Query Attention (Llama 3 style) - MLA: Multi-Head Latent Attention decode (DeepSeek-V2/V3) - KDA: Kimi Delta Attention (linear attention + delta rule) - Causal Conv1d: Causal depthwise 1D convolution (Mamba) - FP8 Quant: Per-token-group FP8 E4M3 quantization - INT8 Quant: Per-token INT8 symmetric quantization - Min-P: Adaptive probability threshold sampling - Top-K: Top-k sampling via binary search - Top-P: Nucleus sampling via binary search Deadline: March 14, 2026 midnight PST * remove unused kernels * add gated deltanet kernels * clean up mentions of old kernels * Fix eval.py test case parser to support underscored keys and booleans The regex only matched [a-zA-Z]+ for keys, which broke parameters like group_size, hidden_dim, num_tokens, and use_initial_state. Also adds true/false boolean value parsing. --------- Co-authored-by: Will Feng <yfeng.us@gmail.com>
1 parent 8d59691 commit facc675

24 files changed

Lines changed: 1390 additions & 0 deletions

File tree

problems/helion.yaml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
name: Helion Kernel Challenge
2+
deadline: "2026-03-14"
3+
description: "GPU kernel challenges inspired by Helion kernel ideas — convolution, quantization, and gated deltanet operators from production LLM architectures."
4+
problems:
5+
- directory: helion/causal_conv1d_py
6+
name: causal_conv1d
7+
deadline: "2026-03-14 00:00"
8+
gpus:
9+
- NVIDIA
10+
- directory: helion/fp8_quant_py
11+
name: fp8_quant
12+
deadline: "2026-03-14 00:00"
13+
gpus:
14+
- NVIDIA
15+
- directory: helion/gated_deltanet_chunk_fwd_h_py
16+
name: gated_deltanet_chunk_fwd_h
17+
deadline: "2026-03-14 00:00"
18+
gpus:
19+
- NVIDIA
20+
- directory: helion/gated_deltanet_chunk_fwd_o_py
21+
name: gated_deltanet_chunk_fwd_o
22+
deadline: "2026-03-14 00:00"
23+
gpus:
24+
- NVIDIA
25+
- directory: helion/gated_deltanet_recompute_w_u_py
26+
name: gated_deltanet_recompute_w_u
27+
deadline: "2026-03-14 00:00"
28+
gpus:
29+
- NVIDIA
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
import torch
2+
import torch.nn.functional as F
3+
from task import input_t, output_t
4+
from utils import make_match_reference, DeterministicContext
5+
6+
7+
def generate_input(B: int, D: int, S: int, W: int, seed: int) -> input_t:
8+
gen = torch.Generator(device="cuda")
9+
gen.manual_seed(seed)
10+
x = torch.randn(B, D, S, dtype=torch.float32, device="cuda", generator=gen).contiguous()
11+
weight = torch.randn(D, W, dtype=torch.float32, device="cuda", generator=gen).contiguous()
12+
bias = torch.randn(D, dtype=torch.float32, device="cuda", generator=gen).contiguous()
13+
return x, weight, bias
14+
15+
16+
def ref_kernel(data: input_t) -> output_t:
17+
with DeterministicContext():
18+
x, weight, bias = data
19+
B, D, S = x.shape
20+
W = weight.shape[1]
21+
22+
# Causal (left) padding
23+
x_padded = F.pad(x, (W - 1, 0))
24+
25+
# Depthwise conv1d (groups=D)
26+
output = F.conv1d(
27+
x_padded,
28+
weight.unsqueeze(1), # [D, 1, W]
29+
bias=bias,
30+
groups=D,
31+
)
32+
return output
33+
34+
35+
check_implementation = make_match_reference(ref_kernel, rtol=1e-4, atol=1e-4)
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
from task import input_t, output_t
2+
3+
4+
def custom_kernel(data: input_t) -> output_t:
5+
import torch
6+
import torch.nn.functional as F
7+
8+
x, weight, bias = data
9+
W = weight.shape[1]
10+
D = x.shape[1]
11+
12+
x_padded = F.pad(x, (W - 1, 0))
13+
output = F.conv1d(x_padded, weight.unsqueeze(1), bias=bias, groups=D)
14+
return output
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
from typing import TypedDict, TypeVar
2+
import torch
3+
4+
input_t = TypeVar("input_t", bound=tuple[torch.Tensor, torch.Tensor, torch.Tensor])
5+
output_t = TypeVar("output_t", bound=torch.Tensor)
6+
7+
class TestSpec(TypedDict):
8+
B: int
9+
D: int
10+
S: int
11+
W: int
12+
seed: int
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
files:
2+
- {"name": "submission.py", "source": "@SUBMISSION@"}
3+
- {"name": "task.py", "source": "task.py"}
4+
- {"name": "utils.py", "source": "../utils.py"}
5+
- {"name": "reference.py", "source": "reference.py"}
6+
- {"name": "eval.py", "source": "../eval.py"}
7+
8+
lang: "py"
9+
10+
description: |
11+
Implement a causal depthwise 1D convolution kernel.
12+
13+
This is a core component of Mamba/Mamba-2 architectures. Each channel is
14+
convolved independently (depthwise) with causal (left) zero-padding so that
15+
output[t] depends only on input[t-W+1:t+1].
16+
17+
For each batch b, channel d, and time t:
18+
out[b, d, t] = bias[d] + sum_{k=0}^{W-1} weight[d, k] * x[b, d, t - W + 1 + k]
19+
where out-of-bounds values are treated as zero.
20+
21+
Input: tuple(x, weight, bias) where:
22+
- x: torch.Tensor of shape [B, D, S] (float32)
23+
- weight: torch.Tensor of shape [D, W] (float32)
24+
- bias: torch.Tensor of shape [D] (float32)
25+
26+
Output: torch.Tensor of shape [B, D, S] (float32)
27+
28+
config:
29+
main: "eval.py"
30+
31+
templates:
32+
Python: "../template.py"
33+
34+
tests:
35+
- {"B": 1, "D": 64, "S": 64, "W": 4, "seed": 4242}
36+
- {"B": 2, "D": 128, "S": 128, "W": 4, "seed": 5236}
37+
- {"B": 1, "D": 256, "S": 256, "W": 3, "seed": 1001}
38+
- {"B": 1, "D": 128, "S": 64, "W": 8, "seed": 5531}
39+
- {"B": 4, "D": 64, "S": 128, "W": 4, "seed": 9173}
40+
41+
benchmarks:
42+
- {"B": 1, "D": 768, "S": 512, "W": 4, "seed": 31232}
43+
- {"B": 1, "D": 768, "S": 2048, "W": 4, "seed": 4052}
44+
- {"B": 1, "D": 1536, "S": 2048, "W": 4, "seed": 2146}
45+
- {"B": 1, "D": 2560, "S": 2048, "W": 4, "seed": 3129}
46+
- {"B": 1, "D": 2560, "S": 4096, "W": 4, "seed": 54352}
47+
48+
test_timeout: 180
49+
benchmark_timeout: 180
50+
ranked_timeout: 420
51+
ranking_by: "geom"

0 commit comments

Comments
 (0)