Add memory stats to profiling by yuslepukhin · Pull Request #29058 · microsoft/onnxruntime

yuslepukhin · 2026-06-15T18:18:24Z

This pull request introduces enhanced memory profiling capabilities by adding a new metric, bytes_requested_in_use, to allocator statistics throughout the ONNX Runtime codebase. This metric tracks the memory actually requested by user code, excluding internal fragmentation and padding, and is now reported alongside existing memory usage statistics. The changes span core framework allocators, CUDA providers, plugin interfaces, and kernel execution profiling.

Allocator statistics improvements:

Added a new field, bytes_requested_in_use, to the AllocatorStats struct in multiple locations (core, CUDA, plugin, and test), which tracks the number of bytes actually requested by user code, distinct from total bytes in use that may include internal padding. This field is now initialized, serialized, and included in string/key-value representations. [1] [2] [3] [4] [5] [6]
Updated arena allocator implementations in both the core and CUDA providers to increment and decrement bytes_requested_in_use appropriately during allocation, reservation, splitting, and freeing of memory chunks. [1] [2] [3] [4] [5] [6] [7] [8]

CUDA and plugin support:

Modified CUDA mempool allocators and plugins to report bytes_requested_in_use (equal to bytes_in_use since there is no padding in mempool allocators), ensuring consistent reporting across all allocator types. [1] [2]

Adapter and API changes:

Updated the allocator adapter logic to parse, propagate, and serialize the new RequestedInUse field in key-value pairs, enabling plugins and external allocators to participate in the enhanced memory profiling. [1] [2] [3] [4]

Kernel execution memory profiling:

Enhanced the KernelScope in the sequential executor to sample and emit both bytes_in_use and bytes_requested_in_use before and after kernel execution, providing more granular memory profiling in event logs. [1] [2] [3]

Build system minor fix:

Added the /bigobj compiler flag for C++ targets in the CUDA provider CMake file to prevent object file size limitations on MSVC.

…ry collection - Add bytes_requested_in_use field to AllocatorStats (tracks actual user-requested bytes excluding arena padding/fragmentation) - Update BFCArena, CUDA plugin arena, and test plugin arena to track bytes_requested_in_use symmetrically in alloc/free paths - Rename profiling event fields for clarity: mem_bytes_requested_in_use, mem_requested_in_use_delta (actual user bytes) mem_bytes_in_use, mem_in_use_delta (including padding) mem_arena_held, mem_arena_held_delta (total device memory held) - Change KernelScope to use IAllocator::GetStats() instead of AsArena()->GetStats() so plugin EPs with GetStats support also report memory stats - Implement GetStats() override in IAllocatorWrappingOrtAllocator to bridge plugin EP allocator stats via OrtKeyValuePairs - Update CudaMempoolArena (in-tree and plugin) to report bytes_requested_in_use - Update allocator_adapters.cc serialization/deserialization for RequestedInUse - Add per-field comments to plugin AllocatorStats copies

github-actions

You can commit the suggested changes from lintrunner.

Copilot

Pull request overview

This PR extends ONNX Runtime’s allocator statistics and profiling output by adding a new bytes_requested_in_use metric (bytes requested by user code, excluding internal padding/fragmentation) and plumbing it through core allocators, CUDA allocators/plugins, EP adapter layers, and tests. It also adds per-kernel memory stats to sequential executor profiling events and a small MSVC build tweak for CUDA provider targets.

Changes:

Add bytes_requested_in_use to AllocatorStats across core/framework, CUDA plugin interfaces, EP adapter, and example plugin EP, including KVP/string serialization.
Update arena/mempool allocator implementations to correctly maintain bytes_requested_in_use, and emit new memory profiling args per kernel in sequential_executor.cc.
Add/adjust unit tests to validate requested-vs-actual in-use accounting for arena and mempool allocators; add /bigobj for MSVC CXX compilation in CUDA provider CMake.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
onnxruntime/test/shared_lib/test_inference.cc	Adds a profiling JSON test to assert memory-stat args are present in profiling events.
onnxruntime/test/providers/cuda/plugin/cuda_plugin_arena_test.cc	Adds CUDA plugin arena test validating `RequestedInUse` tracking vs `InUse` rounding.
onnxruntime/test/providers/cuda/cuda_mempool_arena_test.cc	Extends mempool arena test to validate requested==in_use for mempool allocators.
onnxruntime/test/framework/bfc_arena_test.cc	Adds BFC arena tests validating requested-vs-in-use behavior and reserve behavior.
onnxruntime/test/autoep/library/example_plugin_ep/ep_arena.cc	Updates example plugin EP arena stats accounting to track requested bytes.
onnxruntime/test/autoep/library/example_plugin_ep/ep_allocator.h	Extends example plugin EP `AllocatorStats` struct and its KVP/string output.
onnxruntime/core/session/allocator_adapters.cc	Adds parsing/serialization support for `RequestedInUse` in allocator adapters.
onnxruntime/core/providers/cuda/plugin/cuda_mempool_allocator_plugin.cc	Reports `bytes_requested_in_use` for CUDA mempool plugin allocator stats.
onnxruntime/core/providers/cuda/plugin/cuda_arena.cc	Tracks requested bytes in CUDA plugin arena allocation/free paths.
onnxruntime/core/providers/cuda/plugin/cuda_allocator_plugin.h	Extends CUDA plugin `AllocatorStats` and its KVP/string serialization.
onnxruntime/core/providers/cuda/cuda_mempool_arena.cc	Reports requested bytes (== in use) for CUDA mempool arena stats.
onnxruntime/core/framework/sequential_executor.cc	Samples allocator stats before/after kernel execution and emits new profiling args (including requested bytes).
onnxruntime/core/framework/bfc_arena.cc	Tracks requested bytes in core BFC arena allocation/reserve/free paths.
onnxruntime/core/framework/allocator_stats.h	Adds `bytes_requested_in_use` to the core `AllocatorStats` struct and debug string output.
include/onnxruntime/ep/adapter/allocator.h	Implements `GetStats()` for `IAllocatorWrappingOrtAllocator` by parsing KVPs, including requested bytes.
cmake/onnxruntime_providers_cuda.cmake	Adds `/bigobj` for MSVC CXX compilation of CUDA provider targets.

…missing include

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

…s, improve comments

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

…uilds

tianleiwu

Reviewed the latest head (efa95bb). The change is clean and the new bytes_requested_in_use accounting is correct and complete:

All four BFC bytes_in_use mutation sites (Reserve, SplitFreeChunkFromBin, reserved-chunk Free, FreeAndMaybeCoalesce) have matching bytes_requested_in_use updates, and chunk->requested_size is set on allocation, so alloc/free stay symmetric and the counter returns to 0. The CUDA and EP plugin arenas follow the same balanced pattern.
Mempool allocators correctly report requested == in_use because in_use_bytes_ already tracks the user-requested size.
Inserting the field mid-struct is ABI-safe (stats cross the C boundary as OrtKeyValuePairs strings, not by struct layout).
KernelScope sampling is correctly gated behind profiling, and has_meaningful_stats_ guarantees a non-null allocator in the after-block.
/bigobj for CXX is correctly inside the if(MSVC) block.

All earlier review threads are resolved and the fixes are incorporated in this head. One minor, non-blocking maintainability suggestion is left inline.

…ParseStringWithClassicLocale and Ort::UnownedAllocator

…h version/null guard

tianleiwu

Reviewed the latest head (1913e5e). I found two profiling-stat correctness gaps that look worth addressing before relying on the new memory fields across providers. I did not repost the existing first-use has_meaningful_stats_ thread; that concern already has a current-head discussion.

…ool total_allocated_bytes - JS WebGpuAllocator: mirror bytes_requested_in_use alongside bytes_in_use - WebNN TensorAllocator: same (no padding, so requested == actual) - CUDA mempool: query cudaMemPoolAttrReservedMemCurrent for total_allocated_bytes instead of using the monotonically-increasing cumulative counter

tianleiwu

Re-reviewed at head 56d29b5f. The two concerns I previously raised are addressed in follow-up commits and their threads are now resolved:

Non-arena allocators (JS WebGpuAllocator, WebNN WebNNTensorAllocator) now mirror bytes_requested_in_use with bytes_in_use, so profiling no longer emits 0 for them.
CudaMempoolArena::GetStats now sources total_allocated_bytes from cudaMemPoolAttrReservedMemCurrent instead of the monotonic counter, keeping mem_arena_held semantics consistent.

The arena accounting is symmetric (requested_size is set at alloc time and subtracted on free) and the per-kernel sampling is fully gated behind profiling, so there is no hot-path cost when profiling is off. One low-priority robustness note left inline. Looks good overall.

… stats, guard destructor GetStats - JS WebGpuAllocator: mirror bytes_requested_in_use alongside bytes_in_use - WebNN TensorAllocator: same (no padding, so requested == actual) - CUDA mempool: query cudaMemPoolAttrReservedMemCurrent for total_allocated_bytes instead of using the monotonically-increasing cumulative counter - KernelScope destructor: wrap after-kernel GetStats in try/catch since plugin EP allocators can throw through Ort::ThrowOnError and destructors are noexcept

…ding - In KernelScope destructor, catch only GetStats failures via ORT_TRY/ORT_CATCH and skip mem_* args when stats retrieval fails - Keep other destructor code unchanged - Align AllocatorStats comment wording in plugin headers to use 'padding'

yuslepukhin added 2 commits June 12, 2026 12:16

Add bytes_requested_in_use tests and fix CUDA provider /bigobj

57ebd8a

yuslepukhin requested a review from Copilot June 15, 2026 18:18

Copilot started reviewing on behalf of yuslepukhin June 15, 2026 18:19 View session

github-actions Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread onnxruntime/test/framework/bfc_arena_test.cc Outdated

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread onnxruntime/test/shared_lib/test_inference.cc Outdated

Comment thread onnxruntime/test/shared_lib/test_inference.cc Outdated

Comment thread onnxruntime/test/shared_lib/test_inference.cc Outdated

Comment thread include/onnxruntime/ep/adapter/allocator.h Outdated

Address PR review comments: fix key names, add noexcept cleanup, add …

58f2668

…missing include

yuslepukhin requested a review from Copilot June 15, 2026 18:50

Copilot started reviewing on behalf of yuslepukhin June 15, 2026 18:51 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread onnxruntime/test/shared_lib/test_inference.cc

Comment thread include/onnxruntime/ep/adapter/allocator.h Outdated

Comment thread onnxruntime/test/shared_lib/test_inference.cc Outdated

Comment thread onnxruntime/core/framework/sequential_executor.cc

yuslepukhin added 2 commits June 15, 2026 12:25

Address PR review: strtoll validation, add requested_in_use assertion…

2a59563

…s, improve comments

Fix profiling_memory_stats test for builds without arena allocator

8ab1ccb

yuslepukhin requested a review from Copilot June 15, 2026 23:37

Copilot started reviewing on behalf of yuslepukhin June 15, 2026 23:38 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

Comment thread onnxruntime/test/shared_lib/test_inference.cc Outdated

Comment thread include/onnxruntime/ep/adapter/allocator.h Outdated

Address review: tighten KVP parsing and use GTEST_SKIP for no-arena b…

efa95bb

…uilds

tianleiwu reviewed Jun 16, 2026

View reviewed changes

Comment thread include/onnxruntime/ep/adapter/allocator.h

Unify KVP stats parsing: add AllocatorStats::SetFromKeyValue, use Try…

8c08ce2

…ParseStringWithClassicLocale and Ort::UnownedAllocator

yuslepukhin requested a review from tianleiwu June 16, 2026 01:04

edgchen1 reviewed Jun 16, 2026

View reviewed changes

Comment thread onnxruntime/core/framework/allocator_stats.h Outdated

edgchen1 reviewed Jun 16, 2026

View reviewed changes

Comment thread include/onnxruntime/ep/adapter/allocator.h Outdated

Address review: consistent comment terminology, replace catch-all wit…

1913e5e

…h version/null guard

tianleiwu reviewed Jun 16, 2026

View reviewed changes

Comment thread onnxruntime/core/framework/allocator_stats.h

Comment thread onnxruntime/core/providers/cuda/cuda_mempool_arena.cc Outdated

yuslepukhin requested review from edgchen1 and tianleiwu June 16, 2026 21:49

tianleiwu reviewed Jun 16, 2026

View reviewed changes

Comment thread onnxruntime/core/framework/sequential_executor.cc Outdated

tianleiwu previously approved these changes Jun 16, 2026

View reviewed changes

yuslepukhin dismissed tianleiwu’s stale review via 35288a2 June 16, 2026 23:20

edgchen1 reviewed Jun 17, 2026

View reviewed changes

Comment thread onnxruntime/core/framework/sequential_executor.cc Outdated

edgchen1 reviewed Jun 17, 2026

View reviewed changes

Comment thread onnxruntime/test/autoep/library/example_plugin_ep/ep_allocator.h Outdated

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_allocator_plugin.h Outdated

tianleiwu approved these changes Jun 17, 2026

View reviewed changes

edgchen1 reviewed Jun 17, 2026

View reviewed changes

Comment thread onnxruntime/core/framework/sequential_executor.cc

edgchen1 approved these changes Jun 17, 2026

View reviewed changes

Conversation

yuslepukhin commented Jun 15, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants