Add memory stats to profiling#29058
Conversation
…ry collection - Add bytes_requested_in_use field to AllocatorStats (tracks actual user-requested bytes excluding arena padding/fragmentation) - Update BFCArena, CUDA plugin arena, and test plugin arena to track bytes_requested_in_use symmetrically in alloc/free paths - Rename profiling event fields for clarity: mem_bytes_requested_in_use, mem_requested_in_use_delta (actual user bytes) mem_bytes_in_use, mem_in_use_delta (including padding) mem_arena_held, mem_arena_held_delta (total device memory held) - Change KernelScope to use IAllocator::GetStats() instead of AsArena()->GetStats() so plugin EPs with GetStats support also report memory stats - Implement GetStats() override in IAllocatorWrappingOrtAllocator to bridge plugin EP allocator stats via OrtKeyValuePairs - Update CudaMempoolArena (in-tree and plugin) to report bytes_requested_in_use - Update allocator_adapters.cc serialization/deserialization for RequestedInUse - Add per-field comments to plugin AllocatorStats copies
There was a problem hiding this comment.
Pull request overview
This PR extends ONNX Runtime’s allocator statistics and profiling output by adding a new bytes_requested_in_use metric (bytes requested by user code, excluding internal padding/fragmentation) and plumbing it through core allocators, CUDA allocators/plugins, EP adapter layers, and tests. It also adds per-kernel memory stats to sequential executor profiling events and a small MSVC build tweak for CUDA provider targets.
Changes:
- Add
bytes_requested_in_usetoAllocatorStatsacross core/framework, CUDA plugin interfaces, EP adapter, and example plugin EP, including KVP/string serialization. - Update arena/mempool allocator implementations to correctly maintain
bytes_requested_in_use, and emit new memory profiling args per kernel insequential_executor.cc. - Add/adjust unit tests to validate requested-vs-actual in-use accounting for arena and mempool allocators; add
/bigobjfor MSVC CXX compilation in CUDA provider CMake.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/shared_lib/test_inference.cc | Adds a profiling JSON test to assert memory-stat args are present in profiling events. |
| onnxruntime/test/providers/cuda/plugin/cuda_plugin_arena_test.cc | Adds CUDA plugin arena test validating RequestedInUse tracking vs InUse rounding. |
| onnxruntime/test/providers/cuda/cuda_mempool_arena_test.cc | Extends mempool arena test to validate requested==in_use for mempool allocators. |
| onnxruntime/test/framework/bfc_arena_test.cc | Adds BFC arena tests validating requested-vs-in-use behavior and reserve behavior. |
| onnxruntime/test/autoep/library/example_plugin_ep/ep_arena.cc | Updates example plugin EP arena stats accounting to track requested bytes. |
| onnxruntime/test/autoep/library/example_plugin_ep/ep_allocator.h | Extends example plugin EP AllocatorStats struct and its KVP/string output. |
| onnxruntime/core/session/allocator_adapters.cc | Adds parsing/serialization support for RequestedInUse in allocator adapters. |
| onnxruntime/core/providers/cuda/plugin/cuda_mempool_allocator_plugin.cc | Reports bytes_requested_in_use for CUDA mempool plugin allocator stats. |
| onnxruntime/core/providers/cuda/plugin/cuda_arena.cc | Tracks requested bytes in CUDA plugin arena allocation/free paths. |
| onnxruntime/core/providers/cuda/plugin/cuda_allocator_plugin.h | Extends CUDA plugin AllocatorStats and its KVP/string serialization. |
| onnxruntime/core/providers/cuda/cuda_mempool_arena.cc | Reports requested bytes (== in use) for CUDA mempool arena stats. |
| onnxruntime/core/framework/sequential_executor.cc | Samples allocator stats before/after kernel execution and emits new profiling args (including requested bytes). |
| onnxruntime/core/framework/bfc_arena.cc | Tracks requested bytes in core BFC arena allocation/reserve/free paths. |
| onnxruntime/core/framework/allocator_stats.h | Adds bytes_requested_in_use to the core AllocatorStats struct and debug string output. |
| include/onnxruntime/ep/adapter/allocator.h | Implements GetStats() for IAllocatorWrappingOrtAllocator by parsing KVPs, including requested bytes. |
| cmake/onnxruntime_providers_cuda.cmake | Adds /bigobj for MSVC CXX compilation of CUDA provider targets. |
tianleiwu
left a comment
There was a problem hiding this comment.
Reviewed the latest head (efa95bb). The change is clean and the new bytes_requested_in_use accounting is correct and complete:
- All four BFC
bytes_in_usemutation sites (Reserve,SplitFreeChunkFromBin, reserved-chunkFree,FreeAndMaybeCoalesce) have matchingbytes_requested_in_useupdates, andchunk->requested_sizeis set on allocation, so alloc/free stay symmetric and the counter returns to 0. The CUDA and EP plugin arenas follow the same balanced pattern. - Mempool allocators correctly report
requested == in_usebecausein_use_bytes_already tracks the user-requested size. - Inserting the field mid-struct is ABI-safe (stats cross the C boundary as
OrtKeyValuePairsstrings, not by struct layout). KernelScopesampling is correctly gated behind profiling, andhas_meaningful_stats_guarantees a non-null allocator in the after-block./bigobjfor CXX is correctly inside theif(MSVC)block.
All earlier review threads are resolved and the fixes are incorporated in this head. One minor, non-blocking maintainability suggestion is left inline.
…ParseStringWithClassicLocale and Ort::UnownedAllocator
…h version/null guard
tianleiwu
left a comment
There was a problem hiding this comment.
Reviewed the latest head (1913e5e). I found two profiling-stat correctness gaps that look worth addressing before relying on the new memory fields across providers. I did not repost the existing first-use has_meaningful_stats_ thread; that concern already has a current-head discussion.
…ool total_allocated_bytes - JS WebGpuAllocator: mirror bytes_requested_in_use alongside bytes_in_use - WebNN TensorAllocator: same (no padding, so requested == actual) - CUDA mempool: query cudaMemPoolAttrReservedMemCurrent for total_allocated_bytes instead of using the monotonically-increasing cumulative counter
tianleiwu
left a comment
There was a problem hiding this comment.
Re-reviewed at head 56d29b5f. The two concerns I previously raised are addressed in follow-up commits and their threads are now resolved:
- Non-arena allocators (JS
WebGpuAllocator, WebNNWebNNTensorAllocator) now mirrorbytes_requested_in_usewithbytes_in_use, so profiling no longer emits0for them. CudaMempoolArena::GetStatsnow sourcestotal_allocated_bytesfromcudaMemPoolAttrReservedMemCurrentinstead of the monotonic counter, keepingmem_arena_heldsemantics consistent.
The arena accounting is symmetric (requested_size is set at alloc time and subtracted on free) and the per-kernel sampling is fully gated behind profiling, so there is no hot-path cost when profiling is off. One low-priority robustness note left inline. Looks good overall.
… stats, guard destructor GetStats - JS WebGpuAllocator: mirror bytes_requested_in_use alongside bytes_in_use - WebNN TensorAllocator: same (no padding, so requested == actual) - CUDA mempool: query cudaMemPoolAttrReservedMemCurrent for total_allocated_bytes instead of using the monotonically-increasing cumulative counter - KernelScope destructor: wrap after-kernel GetStats in try/catch since plugin EP allocators can throw through Ort::ThrowOnError and destructors are noexcept
…ding - In KernelScope destructor, catch only GetStats failures via ORT_TRY/ORT_CATCH and skip mem_* args when stats retrieval fails - Keep other destructor code unchanged - Align AllocatorStats comment wording in plugin headers to use 'padding'
This pull request introduces enhanced memory profiling capabilities by adding a new metric,
bytes_requested_in_use, to allocator statistics throughout the ONNX Runtime codebase. This metric tracks the memory actually requested by user code, excluding internal fragmentation and padding, and is now reported alongside existing memory usage statistics. The changes span core framework allocators, CUDA providers, plugin interfaces, and kernel execution profiling.Allocator statistics improvements:
Added a new field,
bytes_requested_in_use, to theAllocatorStatsstruct in multiple locations (core, CUDA, plugin, and test), which tracks the number of bytes actually requested by user code, distinct from total bytes in use that may include internal padding. This field is now initialized, serialized, and included in string/key-value representations. [1] [2] [3] [4] [5] [6]Updated arena allocator implementations in both the core and CUDA providers to increment and decrement
bytes_requested_in_useappropriately during allocation, reservation, splitting, and freeing of memory chunks. [1] [2] [3] [4] [5] [6] [7] [8]CUDA and plugin support:
bytes_requested_in_use(equal tobytes_in_usesince there is no padding in mempool allocators), ensuring consistent reporting across all allocator types. [1] [2]Adapter and API changes:
RequestedInUsefield in key-value pairs, enabling plugins and external allocators to participate in the enhanced memory profiling. [1] [2] [3] [4]Kernel execution memory profiling:
KernelScopein the sequential executor to sample and emit bothbytes_in_useandbytes_requested_in_usebefore and after kernel execution, providing more granular memory profiling in event logs. [1] [2] [3]Build system minor fix:
/bigobjcompiler flag for C++ targets in the CUDA provider CMake file to prevent object file size limitations on MSVC.