Skip to content

Add memory stats to profiling#29058

Open
yuslepukhin wants to merge 11 commits into
mainfrom
yuslepukhin/memory_profiling
Open

Add memory stats to profiling#29058
yuslepukhin wants to merge 11 commits into
mainfrom
yuslepukhin/memory_profiling

Conversation

@yuslepukhin

Copy link
Copy Markdown
Member

This pull request introduces enhanced memory profiling capabilities by adding a new metric, bytes_requested_in_use, to allocator statistics throughout the ONNX Runtime codebase. This metric tracks the memory actually requested by user code, excluding internal fragmentation and padding, and is now reported alongside existing memory usage statistics. The changes span core framework allocators, CUDA providers, plugin interfaces, and kernel execution profiling.

Allocator statistics improvements:

  • Added a new field, bytes_requested_in_use, to the AllocatorStats struct in multiple locations (core, CUDA, plugin, and test), which tracks the number of bytes actually requested by user code, distinct from total bytes in use that may include internal padding. This field is now initialized, serialized, and included in string/key-value representations. [1] [2] [3] [4] [5] [6]

  • Updated arena allocator implementations in both the core and CUDA providers to increment and decrement bytes_requested_in_use appropriately during allocation, reservation, splitting, and freeing of memory chunks. [1] [2] [3] [4] [5] [6] [7] [8]

CUDA and plugin support:

  • Modified CUDA mempool allocators and plugins to report bytes_requested_in_use (equal to bytes_in_use since there is no padding in mempool allocators), ensuring consistent reporting across all allocator types. [1] [2]

Adapter and API changes:

  • Updated the allocator adapter logic to parse, propagate, and serialize the new RequestedInUse field in key-value pairs, enabling plugins and external allocators to participate in the enhanced memory profiling. [1] [2] [3] [4]

Kernel execution memory profiling:

  • Enhanced the KernelScope in the sequential executor to sample and emit both bytes_in_use and bytes_requested_in_use before and after kernel execution, providing more granular memory profiling in event logs. [1] [2] [3]

Build system minor fix:

  • Added the /bigobj compiler flag for C++ targets in the CUDA provider CMake file to prevent object file size limitations on MSVC.

…ry collection

- Add bytes_requested_in_use field to AllocatorStats (tracks actual user-requested
  bytes excluding arena padding/fragmentation)
- Update BFCArena, CUDA plugin arena, and test plugin arena to track
  bytes_requested_in_use symmetrically in alloc/free paths
- Rename profiling event fields for clarity:
  mem_bytes_requested_in_use, mem_requested_in_use_delta (actual user bytes)
  mem_bytes_in_use, mem_in_use_delta (including padding)
  mem_arena_held, mem_arena_held_delta (total device memory held)
- Change KernelScope to use IAllocator::GetStats() instead of AsArena()->GetStats()
  so plugin EPs with GetStats support also report memory stats
- Implement GetStats() override in IAllocatorWrappingOrtAllocator to bridge
  plugin EP allocator stats via OrtKeyValuePairs
- Update CudaMempoolArena (in-tree and plugin) to report bytes_requested_in_use
- Update allocator_adapters.cc serialization/deserialization for RequestedInUse
- Add per-field comments to plugin AllocatorStats copies

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment thread onnxruntime/test/framework/bfc_arena_test.cc Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends ONNX Runtime’s allocator statistics and profiling output by adding a new bytes_requested_in_use metric (bytes requested by user code, excluding internal padding/fragmentation) and plumbing it through core allocators, CUDA allocators/plugins, EP adapter layers, and tests. It also adds per-kernel memory stats to sequential executor profiling events and a small MSVC build tweak for CUDA provider targets.

Changes:

  • Add bytes_requested_in_use to AllocatorStats across core/framework, CUDA plugin interfaces, EP adapter, and example plugin EP, including KVP/string serialization.
  • Update arena/mempool allocator implementations to correctly maintain bytes_requested_in_use, and emit new memory profiling args per kernel in sequential_executor.cc.
  • Add/adjust unit tests to validate requested-vs-actual in-use accounting for arena and mempool allocators; add /bigobj for MSVC CXX compilation in CUDA provider CMake.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
onnxruntime/test/shared_lib/test_inference.cc Adds a profiling JSON test to assert memory-stat args are present in profiling events.
onnxruntime/test/providers/cuda/plugin/cuda_plugin_arena_test.cc Adds CUDA plugin arena test validating RequestedInUse tracking vs InUse rounding.
onnxruntime/test/providers/cuda/cuda_mempool_arena_test.cc Extends mempool arena test to validate requested==in_use for mempool allocators.
onnxruntime/test/framework/bfc_arena_test.cc Adds BFC arena tests validating requested-vs-in-use behavior and reserve behavior.
onnxruntime/test/autoep/library/example_plugin_ep/ep_arena.cc Updates example plugin EP arena stats accounting to track requested bytes.
onnxruntime/test/autoep/library/example_plugin_ep/ep_allocator.h Extends example plugin EP AllocatorStats struct and its KVP/string output.
onnxruntime/core/session/allocator_adapters.cc Adds parsing/serialization support for RequestedInUse in allocator adapters.
onnxruntime/core/providers/cuda/plugin/cuda_mempool_allocator_plugin.cc Reports bytes_requested_in_use for CUDA mempool plugin allocator stats.
onnxruntime/core/providers/cuda/plugin/cuda_arena.cc Tracks requested bytes in CUDA plugin arena allocation/free paths.
onnxruntime/core/providers/cuda/plugin/cuda_allocator_plugin.h Extends CUDA plugin AllocatorStats and its KVP/string serialization.
onnxruntime/core/providers/cuda/cuda_mempool_arena.cc Reports requested bytes (== in use) for CUDA mempool arena stats.
onnxruntime/core/framework/sequential_executor.cc Samples allocator stats before/after kernel execution and emits new profiling args (including requested bytes).
onnxruntime/core/framework/bfc_arena.cc Tracks requested bytes in core BFC arena allocation/reserve/free paths.
onnxruntime/core/framework/allocator_stats.h Adds bytes_requested_in_use to the core AllocatorStats struct and debug string output.
include/onnxruntime/ep/adapter/allocator.h Implements GetStats() for IAllocatorWrappingOrtAllocator by parsing KVPs, including requested bytes.
cmake/onnxruntime_providers_cuda.cmake Adds /bigobj for MSVC CXX compilation of CUDA provider targets.

Comment thread onnxruntime/test/shared_lib/test_inference.cc Outdated
Comment thread onnxruntime/test/shared_lib/test_inference.cc Outdated
Comment thread onnxruntime/test/shared_lib/test_inference.cc Outdated
Comment thread include/onnxruntime/ep/adapter/allocator.h Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Comment thread onnxruntime/test/shared_lib/test_inference.cc
Comment thread include/onnxruntime/ep/adapter/allocator.h Outdated
Comment thread onnxruntime/test/shared_lib/test_inference.cc Outdated
Comment thread onnxruntime/core/framework/sequential_executor.cc

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Comment thread onnxruntime/test/shared_lib/test_inference.cc Outdated
Comment thread include/onnxruntime/ep/adapter/allocator.h Outdated

@tianleiwu tianleiwu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the latest head (efa95bb). The change is clean and the new bytes_requested_in_use accounting is correct and complete:

  • All four BFC bytes_in_use mutation sites (Reserve, SplitFreeChunkFromBin, reserved-chunk Free, FreeAndMaybeCoalesce) have matching bytes_requested_in_use updates, and chunk->requested_size is set on allocation, so alloc/free stay symmetric and the counter returns to 0. The CUDA and EP plugin arenas follow the same balanced pattern.
  • Mempool allocators correctly report requested == in_use because in_use_bytes_ already tracks the user-requested size.
  • Inserting the field mid-struct is ABI-safe (stats cross the C boundary as OrtKeyValuePairs strings, not by struct layout).
  • KernelScope sampling is correctly gated behind profiling, and has_meaningful_stats_ guarantees a non-null allocator in the after-block.
  • /bigobj for CXX is correctly inside the if(MSVC) block.

All earlier review threads are resolved and the fixes are incorporated in this head. One minor, non-blocking maintainability suggestion is left inline.

Comment thread include/onnxruntime/ep/adapter/allocator.h
…ParseStringWithClassicLocale and Ort::UnownedAllocator
@yuslepukhin yuslepukhin requested a review from tianleiwu June 16, 2026 01:04
Comment thread onnxruntime/core/framework/allocator_stats.h Outdated
Comment thread include/onnxruntime/ep/adapter/allocator.h Outdated

@tianleiwu tianleiwu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the latest head (1913e5e). I found two profiling-stat correctness gaps that look worth addressing before relying on the new memory fields across providers. I did not repost the existing first-use has_meaningful_stats_ thread; that concern already has a current-head discussion.

Comment thread onnxruntime/core/framework/allocator_stats.h
Comment thread onnxruntime/core/providers/cuda/cuda_mempool_arena.cc Outdated
…ool total_allocated_bytes

- JS WebGpuAllocator: mirror bytes_requested_in_use alongside bytes_in_use
- WebNN TensorAllocator: same (no padding, so requested == actual)
- CUDA mempool: query cudaMemPoolAttrReservedMemCurrent for total_allocated_bytes
  instead of using the monotonically-increasing cumulative counter

@tianleiwu tianleiwu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed at head 56d29b5f. The two concerns I previously raised are addressed in follow-up commits and their threads are now resolved:

  • Non-arena allocators (JS WebGpuAllocator, WebNN WebNNTensorAllocator) now mirror bytes_requested_in_use with bytes_in_use, so profiling no longer emits 0 for them.
  • CudaMempoolArena::GetStats now sources total_allocated_bytes from cudaMemPoolAttrReservedMemCurrent instead of the monotonic counter, keeping mem_arena_held semantics consistent.

The arena accounting is symmetric (requested_size is set at alloc time and subtracted on free) and the per-kernel sampling is fully gated behind profiling, so there is no hot-path cost when profiling is off. One low-priority robustness note left inline. Looks good overall.

Comment thread onnxruntime/core/framework/sequential_executor.cc Outdated
tianleiwu
tianleiwu previously approved these changes Jun 16, 2026
… stats, guard destructor GetStats

- JS WebGpuAllocator: mirror bytes_requested_in_use alongside bytes_in_use
- WebNN TensorAllocator: same (no padding, so requested == actual)
- CUDA mempool: query cudaMemPoolAttrReservedMemCurrent for total_allocated_bytes
  instead of using the monotonically-increasing cumulative counter
- KernelScope destructor: wrap after-kernel GetStats in try/catch since plugin EP
  allocators can throw through Ort::ThrowOnError and destructors are noexcept
Comment thread onnxruntime/core/framework/sequential_executor.cc Outdated
Comment thread onnxruntime/test/autoep/library/example_plugin_ep/ep_allocator.h Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_allocator_plugin.h Outdated
…ding

- In KernelScope destructor, catch only GetStats failures via ORT_TRY/ORT_CATCH
  and skip mem_* args when stats retrieval fails
- Keep other destructor code unchanged
- Align AllocatorStats comment wording in plugin headers to use 'padding'
Comment thread onnxruntime/core/framework/sequential_executor.cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants