ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain (ggml-org#20536)

moonshadow-25 · claude · web-flow · commit 8b7d340b6ffb · 2026-03-15T17:23:58.000+01:00
* ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain On AMD APU/iGPU devices (unified memory architecture), hipMemAdviseSetCoarseGrain returns hipErrorInvalidValue because the hint is not applicable to UMA systems. The previous CUDA_CHECK() call treated this as a fatal error, causing crashes on APU systems such as AMD Strix Halo (gfx1151). Fix: treat hipMemAdviseSetCoarseGrain as an optional performance hint - call it without error checking and clear any resulting error with hipGetLastError(). Also add pre-allocation debug logging (GGML_LOG_DEBUG) to help diagnose memory issues on APU systems, and store totalGlobalMem in device info. Context: AMD APUs on Windows are affected by a ROCm runtime bug that limits hipMallocManaged to ~64GB regardless of available system RAM. A fix has been submitted upstream: ROCm/rocm-systems#4077 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ggml/hip: remove unrelated changes, keep only hipMemAdviseSetCoarseGrain fix --------- Co-authored-by: moonshadow-25 <moonshadow-25@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -124,7 +124,10 @@ static cudaError_t ggml_cuda_device_malloc(void ** ptr, size_t size, int device)
         err = cudaMallocManaged(ptr, size);
 #if defined(GGML_USE_HIP)
         if (err == hipSuccess) {
-            CUDA_CHECK(cudaMemAdvise(*ptr, size, hipMemAdviseSetCoarseGrain, device));
+            // hipMemAdviseSetCoarseGrain is an optional performance hint;
+            // ignore errors (e.g. hipErrorInvalidValue on some APU/iGPU configs).
+            cudaMemAdvise(*ptr, size, hipMemAdviseSetCoarseGrain, device);
+            (void)hipGetLastError(); // clear any error
         }
 
         // fall back to cudaMalloc if not supported (e.g. on Windows)