Skip to content

Commit 78746a4

Browse files
donettom-1alexdeucher
authored andcommitted
drm/amdkfd: Fix queue preemption/eviction failures by aligning control stack size to GPU page size
The control stack size is calculated based on the number of CUs and waves, and is then aligned to PAGE_SIZE. When the resulting control stack size is aligned to 64 KB, GPU hangs and queue preemption failures are observed while running RCCL unit tests on systems with more than two GPUs. amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for queue with doorbell_id: 80030008 amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4 amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for queue with doorbell_id: 80030008 amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues This issue is observed on both 4 KB and 64 KB system page-size configurations. This patch fixes the issue by aligning the control stack size to AMDGPU_GPU_PAGE_SIZE instead of PAGE_SIZE, so the control stack size will not be 64 KB on systems with a 64 KB page size and queue preemption works correctly. Additionally, In the current code, wg_data_size is aligned to PAGE_SIZE, which can waste memory if the system page size is large. In this patch, wg_data_size is aligned to AMDGPU_GPU_PAGE_SIZE. The cwsr_size, calculated from wg_data_size and the control stack size, is aligned to PAGE_SIZE. Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit a3e1443)
1 parent daf470b commit 78746a4

1 file changed

Lines changed: 4 additions & 3 deletions

File tree

drivers/gpu/drm/amd/amdkfd/kfd_queue.c

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -492,10 +492,11 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev)
492492
cu_num = props->simd_count / props->simd_per_cu / NUM_XCC(dev->gpu->xcc_mask);
493493
wave_num = get_num_waves(props, gfxv, cu_num);
494494

495-
wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props), PAGE_SIZE);
495+
wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props),
496+
AMDGPU_GPU_PAGE_SIZE);
496497
ctl_stack_size = wave_num * CNTL_STACK_BYTES_PER_WAVE(gfxv) + 8;
497498
ctl_stack_size = ALIGN(SIZEOF_HSA_USER_CONTEXT_SAVE_AREA_HEADER + ctl_stack_size,
498-
PAGE_SIZE);
499+
AMDGPU_GPU_PAGE_SIZE);
499500

500501
if ((gfxv / 10000 * 10000) == 100000) {
501502
/* HW design limits control stack size to 0x7000.
@@ -507,7 +508,7 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev)
507508

508509
props->ctl_stack_size = ctl_stack_size;
509510
props->debug_memory_size = ALIGN(wave_num * DEBUGGER_BYTES_PER_WAVE, DEBUGGER_BYTES_ALIGN);
510-
props->cwsr_size = ctl_stack_size + wg_data_size;
511+
props->cwsr_size = ALIGN(ctl_stack_size + wg_data_size, PAGE_SIZE);
511512

512513
if (gfxv == 80002) /* GFX_VERSION_TONGA */
513514
props->eop_buffer_size = 0x8000;

0 commit comments

Comments
 (0)