layout	docs
title	Troubleshooting Guide
description	Solutions to common issues when using LLM-Speed
lang	en

Troubleshooting Guide

Solutions to common issues when using LLM-Speed.

Installation Issues
Runtime Errors
Performance Issues
Numerical Issues
Getting Help

Installation Issues

Baseline environment not prepared

Symptom:

ImportError / ModuleNotFoundError during test collection

Reason: Validation started before the documented local Python environment was prepared.

Fix:

python3 -m venv .venv
. .venv/bin/activate
pip install -U pip setuptools wheel
pip install -r requirements.txt pytest hypothesis ruff pre-commit

CUDA Not Found

Error:

RuntimeError: CUDA not available. Please check your CUDA installation.

Solutions:

Verify CUDA installation:

nvcc --version
nvidia-smi

Check PyTorch CUDA support:

import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")

Reinstall PyTorch with correct CUDA version:

# For CUDA 11.8
pip install torch --index-url https://download.pytorch.org/whl/cu118

# For CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121

Build Errors

Error:

error: command 'gcc' failed with exit status 1

Solutions:

Check GCC version:

gcc --version  # Need GCC 9.0+

Set CUDA architecture flags:

# For specific GPU architecture
CUDA_ARCHS="80" pip install -e .  # A100

# For multiple architectures
CUDA_ARCHS="75;80;86" pip install -e .

Common fixes:

# Clear build cache
rm -rf build/
rm -rf *.egg-info

# Rebuild with verbose output
pip install -e . --verbose

Import Errors

Error:

ImportError: No module named 'cuda_llm_ops'

Solutions:

Verify installation:

pip list | grep cuda
python -c "import cuda_llm_ops; print(cuda_llm_ops.__version__)"

Check Python path:

import sys
print(sys.path)

Reinstall:

pip uninstall cuda_llm_ops
pip install -e .

Runtime Errors

CUDA Out of Memory

Error:

RuntimeError: CUDA out of memory. Tried to allocate X GB

Solutions:

Use FlashAttention (O(N) memory) instead of naive attention:

# Bad - may OOM for long sequences
from cuda_llm_ops import naive_attention
output = naive_attention(q, k, v)  # O(N²) memory

# Good - memory efficient
from cuda_llm_ops import flash_attention
output = flash_attention(q, k, v)   # O(N) memory

Reduce batch size or sequence length:

# Check memory before operation
print(torch.cuda.memory_summary())

# Try smaller batch
batch_size = 2  # Instead of 8

Clear cache:

torch.cuda.empty_cache()

Mixed precision:

# Use FP16 instead of FP32
q = q.half()

Shared Memory Limit

Error:

RuntimeError: naive_attention: seq_len=4096 requires 16404 bytes shared memory,
but device max is 49152 bytes.

Solution: Use flash_attention or tiled_attention for long sequences:

if seq_len > 2048:
    output = flash_attention(q, k, v)  # No shared memory limit
else:
    output = tiled_attention(q, k, v)

Tensor Shape Mismatch

Error:

RuntimeError: K and V must have same shape

Solution: Ensure Q, K, V have identical shapes:

print(f"Q shape: {q.shape}")
print(f"K shape: {k.shape}")
print(f"V shape: {v.shape}")

# All should be: [batch, heads, seq_len, head_dim]
assert q.shape == k.shape == v.shape

Wrong Device

Error:

RuntimeError: Q must be on CUDA device

Solution: Move tensors to GPU:

# Check device
print(f"Q device: {q.device}")

# Move to CUDA if needed
q = q.cuda()
# Or during creation
q = torch.randn(..., device='cuda', dtype=torch.float16)

Non-Contiguous Tensors

Error:

RuntimeError: Q must be contiguous

Solution:

# Make contiguous
q = q.contiguous()

# Or during transpose
def safe_transpose(tensor, dim0, dim1):
    """Transpose and make contiguous."""
    return tensor.transpose(dim0, dim1).contiguous()

Unsupported Data Type

Error:

RuntimeError: Only float32 and float16 are supported

Solution:

# Convert to supported dtype
q = q.half()      # FP16
# or
q = q.float()     # FP32

Wrong Dimensions

Error:

RuntimeError: Q must be 4D tensor [batch, heads, seq_len, head_dim]

Solution:

# Expected: [batch, heads, seq_len, head_dim]
print(f"Current shape: {q.shape}")
print(f"Dimensions: {q.dim()}")

# Reshape if needed
q = q.view(batch, heads, seq_len, head_dim)

INT8 Tensor Core Not Available

Error:

RuntimeError: INT8 Tensor Core requires Turing+ architecture (SM 7.2+)

Solution: Check GPU compute capability:

import torch
capability = torch.cuda.get_device_capability()
print(f"Compute capability: {capability}")

if capability[0] > 7 or (capability[0] == 7 and capability[1] >= 2):
    # Turing or better
    from cuda_llm_ops import tensor_core_gemm_int8
    c = tensor_core_gemm_int8(a_int8, b_int8)
else:
    # Fallback to FP16
    from cuda_llm_ops import tensor_core_gemm
    c = tensor_core_gemm(a_fp16, b_fp16)

Performance Issues

Slow Execution

Symptoms:

Operations take much longer than expected
GPU utilization is low in nvidia-smi

Solutions:

Use optimal kernel for sequence length:

seq_len = q.size(2)

if seq_len >= 512:
    output = flash_attention(q, k, v)  # Best for long sequences
elif seq_len >= 128:
    output = tiled_attention(q, k, v)  # Good for medium sequences
else:
    output = naive_attention(q, k, v)  # Okay for short sequences

Check alignment:

def check_alignment(M, N, K):
    for dim, name in [(M, 'M'), (N, 'N'), (K, 'K')]:
        if dim % 16 != 0:
            print(f"Warning: {name}={dim} not aligned to 16")

check_alignment(1024, 512, 1024)

Ensure warmup:

# GPU needs warmup for consistent timing
for _ in range(10):
    _ = flash_attention(q, k, v)
torch.cuda.synchronize()

Low GPU Utilization

Symptoms:

nvidia-smi shows utilization < 50%
CPU bottleneck

Solutions:

Increase batch size:

# Too small
batch = 1
q = torch.randn(1, heads, seq_len, head_dim, device='cuda')

# Better
batch = 8
q = torch.randn(8, heads, seq_len, head_dim, device='cuda')

Remove CPU-GPU synchronization:

# Bad - forces CPU wait
result = flash_attention(q, k, v)
torch.cuda.synchronize()
print(result.cpu())

# Better - batch operations
results = []
for _ in range(100):
    results.append(flash_attention(q, k, v))
torch.cuda.synchronize()

Tensor Core Not Used

Symptoms:

Performance significantly below cuBLAS
Nsight Compute shows no Tensor Core usage

Solutions:

Use Tensor Core variant:

# Uses regular CUDA cores
c = gemm(a, b)

# Uses Tensor Cores
c = tensor_core_gemm(a, b)

Ensure FP16 input:

# Must be FP16 for Tensor Core
c = tensor_core_gemm(a.half(), b.half())

Check alignment:

# All dimensions should be multiples of 16
M, K, N = 1024, 512, 1024  # Good: all divisible by 16

Numerical Issues

Precision Loss in FP16

Symptoms:

Results differ significantly from FP32 reference
NaN or Inf values

Solutions:

Use Tensor Core GEMM for accumulation:

# FP16 computation with FP32 accumulation
c = tensor_core_gemm(a_fp16, b_fp16)  # Returns FP32

Scale values for FP16:

# FP16 has limited range [-65504, 65504]
# Scale down large values
scale = 1.0 / 256.0
q = q * scale
output = flash_attention(q, k, v)
output = output / scale

Gradient scaling (for training):

from torch.cuda.amp import GradScaler

scaler = GradScaler()
with torch.cuda.amp.autocast():
    output = flash_attention(q, k, v)
scaler.scale(loss).backward()

Output Does Not Match PyTorch

Symptoms:

Custom kernel output differs from torch.nn.functional.scaled_dot_product_attention

Solutions:

Check tolerances:

torch.testing.assert_close(
    output_custom,
    output_torch,
    rtol=1e-3,  # Relative tolerance
    atol=1e-3   # Absolute tolerance
)

Expected differences:

# FP16 has ~3-4 decimal digits of precision
# Small differences are normal for different implementations

Getting Help

Diagnostic Script

Run this to collect system information:

#!/usr/bin/env python3
import sys
import torch
import cuda_llm_ops

print("=" * 60)
print("System Information")
print("=" * 60)
print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")

if torch.cuda.is_available():
    print(f"Device count: {torch.cuda.device_count()}")
    print(f"Device name: {torch.cuda.get_device_name(0)}")
    print(f"Compute capability: {torch.cuda.get_device_capability()}")
    print(f"cuda_llm_ops version: {cuda_llm_ops.__version__}")

print("=" * 60)
print("Quick Test")
print("=" * 60)

try:
    q = torch.randn(2, 4, 64, 32, device='cuda', dtype=torch.float16)
    k = torch.randn_like(q)
    v = torch.randn_like(q)
    output = cuda_llm_ops.flash_attention(q, k, v)
    print("✓ FlashAttention test passed")
except Exception as e:
    print(f"✗ FlashAttention test failed: {e}")

try:
    a = torch.randn(512, 512, device='cuda', dtype=torch.float16)
    b = torch.randn(512, 512, device='cuda', dtype=torch.float16)
    c = cuda_llm_ops.gemm(a, b)
    print("✓ GEMM test passed")
except Exception as e:
    print(f"✗ GEMM test failed: {e}")

Submit an Issue

When reporting issues, please include:

System information (from script above)
Minimal reproduction code
Expected vs actual behavior
Full error message with stack trace

Example issue template:

## Environment
- GPU: NVIDIA A100
- CUDA: 12.1
- Python: 3.10
- PyTorch: 2.1.0
- cuda_llm_ops: 0.3.0

## Issue Description
FlashAttention fails with OOM on 8K sequence length

## Reproduction Code
```python
import torch
from cuda_llm_ops import flash_attention

q = torch.randn(2, 16, 8192, 64, device='cuda', dtype=torch.float16)
k = torch.randn_like(q)
v = torch.randn_like(q)
output = flash_attention(q, k, v)  # OOM here

Error Message

RuntimeError: CUDA out of memory. Tried to allocate 4.00 GB


### Resources

- **GitHub Issues**: https://github.com/LessUp/llm-speed/issues
- **Documentation**: https://lessup.github.io/llm-speed/
- **Discussions**: https://github.com/LessUp/llm-speed/discussions

---

[← Back to Documentation](../)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting Guide

Table of Contents

Installation Issues

Baseline environment not prepared

CUDA Not Found

Build Errors

Import Errors

Runtime Errors

CUDA Out of Memory

Shared Memory Limit

Tensor Shape Mismatch

Wrong Device

Non-Contiguous Tensors

Unsupported Data Type

Wrong Dimensions

INT8 Tensor Core Not Available

Performance Issues

Slow Execution

Low GPU Utilization

Tensor Core Not Used

Numerical Issues

Precision Loss in FP16

Output Does Not Match PyTorch

Getting Help

Diagnostic Script

Submit an Issue

Error Message

FilesExpand file tree

troubleshooting-en.md

Latest commit

History

troubleshooting-en.md

File metadata and controls

Troubleshooting Guide

Table of Contents

Installation Issues

Baseline environment not prepared

CUDA Not Found

Build Errors

Import Errors

Runtime Errors

CUDA Out of Memory

Shared Memory Limit

Tensor Shape Mismatch

Wrong Device

Non-Contiguous Tensors

Unsupported Data Type

Wrong Dimensions

INT8 Tensor Core Not Available

Performance Issues

Slow Execution

Low GPU Utilization

Tensor Core Not Used

Numerical Issues

Precision Loss in FP16

Output Does Not Match PyTorch

Getting Help

Diagnostic Script

Submit an Issue

Error Message