Bug: CUDA error when batch size exceeds 65535

### Summary

This bug seems to occur regardless of the length of the series. Interestingly, the error always occurs when element exceeds 65535, which falls nicely to the limit of y- or z-grid of any cuda versions, or x-grid of old cuda versions see: https://en.wikipedia.org/wiki/CUDA#Technical_specifications

The error does not seem to occur with pytorch when a similar size is used.

### Error
```bash
CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

### Reproducible Code

produces error
```python
import numpy as np
import torch
import ptwt
import pywt

limit = 65535 + 1

for i in range(10):
     B, T = limit, 32
     wavelet = pywt.Wavelet('db4')
     level = 4
     series_batch = torch.randn(B, T, dtype=torch.float32, device="cuda")
     coeffs = ptwt.wavedec(series_batch, wavelet, level=level)
     reconstructed = ptwt.waverec(coeffs, wavelet)
```

works nicely with pytorch
```python
import torch
import torch.nn as nn

limit = 65535 + 1

B, T = limit, 32
in_channels = T
out_channels = T
kernel_size = 5
padding = kernel_size // 2 

conv = nn.Conv1d(in_channels=in_channels, out_channels=out_channels,
                 kernel_size=kernel_size, padding=padding).cuda()

for i in range(10):
    series_batch = torch.randn(B, T, dtype=torch.float32, device="cuda")
    series_batch = series_batch.transpose(0, 1).unsqueeze(0) 
    convolved = conv(series_batch) 
    reconstructed = convolved.squeeze(0).transpose(0, 1)
```

works nicely with pytorch with 4 convolutions

```python
import torch
import torch.nn as nn

limit = 65535 + 1
B, T = limit, 32
in_channels = T
out_channels = T
kernel_size = 5
padding = kernel_size // 2

# Create four convolution layers to simulate four levels
convs = nn.ModuleList([
    nn.Conv1d(in_channels=in_channels, out_channels=out_channels,
              kernel_size=kernel_size, padding=padding).cuda()
    for _ in range(4)
])

for i in range(10):
    series_batch = torch.randn(B, T, dtype=torch.float32, device="cuda")
    series_batch = series_batch.transpose(0, 1).unsqueeze(0)

    for conv in convs:
        series_batch = conv(series_batch)
        series_batch = nn.functional.avg_pool1d(series_batch, kernel_size=2, stride=2, padding=0)

    reconstructed = nn.functional.interpolate(series_batch, size=B, mode='linear', align_corners=False)
    reconstructed = reconstructed.squeeze(0).transpose(0, 1)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: CUDA error when batch size exceeds 65535 #115

Summary

Error

Reproducible Code

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bug: CUDA error when batch size exceeds 65535 #115

Description

Summary

Error

Reproducible Code

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions