kernel-engineering

A repo for learning kernel-engineering/gpu-programming

Setup

make setup

Notebook	Description
Control Divergence	Explores warp divergence in GPU kernels: what happens when threads within a warp take different branches, how it serializes execution, and benchmarks the performance cost.
TF32 Precision & Performance	Demonstrates TensorFloat32 (TF32) on Ampere+ GPUs. Compares matmul precision (TF32 vs FP32 vs FP16 vs FP64), shows TF32 has FP16's precision but FP32's range, and benchmarks the throughput speedup.
GPU Memory Hierarchy & Data Movement	Walks through the GPU memory hierarchy (RMEM, SMEM, GMEM) using CuTe DSL. Covers GMEM/RMEM scalar and vectorized copies, GMEM to SMEM via `cp.async`, commit groups, copy atoms, and PTX analysis of each path.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
notebooks		notebooks
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
makefile		makefile
pyproject.toml		pyproject.toml
uv.lock		uv.lock