perf: gpu memory optimization by omitting the bottom rows of the merkle tree#2840
perf: gpu memory optimization by omitting the bottom rows of the merkle tree#2840GunaDD wants to merge 4 commits into
Conversation
|
Claude finished @GunaDD's task in 4m 47s —— View job Code Review: GPU Memory Optimization for Merkle Tree
SummaryThis PR reduces GPU memory consumption by omitting the bottom 3 levels of large merkle subtrees from GPU buffers. Instead of storing the full The optimization is well-structured: a I verified the correctness of the index arithmetic, the guard ordering in Issues1. Unused
|
This comment has been minimized.
This comment has been minimized.
296adcd to
fd69b88
Compare
c279370 to
c70a343
Compare
This comment has been minimized.
This comment has been minimized.
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
This comment has been minimized.
This comment has been minimized.
|
reth benchmark: https://github.com/axiom-crypto/openvm-eth/actions/runs/26959247863 proving time increased by 0.77s but the generate mem proving ctxs went down from 3.84 G to 2.09 G and set initial memory went down from 3.63 G to 1.88 G |
This comment has been minimized.
This comment has been minimized.
c0b1165 to
1886d1b
Compare
This comment has been minimized.
This comment has been minimized.
shuklaayush
left a comment
There was a problem hiding this comment.
dropped some comments. i'll leave it to @Golovanov399 and/or @gaxiom to review the cuda changes in detail
| inline constexpr size_t BLOCKS_PER_LEAF = DIGEST_WIDTH / BLOCK_FE_WIDTH; | ||
|
|
||
| // Number of bottom merkle levels omitted from large GPU subtree buffers. | ||
| inline constexpr size_t OMITTED_BOTTOM_LEVELS = 3; |
There was a problem hiding this comment.
i feel like this is conceptually similar to paging so let's call this MERKLE_PAGE_BITS instead. see how metered execution also does paging and this old doc which has some diagrams
There was a problem hiding this comment.
hmmm what does PAGE mean there actually
There was a problem hiding this comment.
essentially what you're doing is saying that instead of storing the commitment to individual memory addresses, you'll only store a commitment to a collection of memory addresses (a page)
this is similar to how virtual memory is divided into pages (not exactly but close enough)
| /// Returns the bounds [start, end) of the layer at the given depth. | ||
| /// These bounds correspond to the indices of the layer in the buffer. | ||
| /// depth: 0 = root, 1 = root's children, ..., height-1 = leaves | ||
| pub fn layer_bounds(&self, depth: usize) -> (usize, usize) { |
There was a problem hiding this comment.
we can probably just remove this
There was a problem hiding this comment.
should we do it though ?
There was a problem hiding this comment.
yeah, let's just get rid of it
2995a72 to
e582d73
Compare
|
Should it be rebased? |
1886d1b to
4c9cea5
Compare
This comment has been minimized.
This comment has been minimized.
52efe21 to
4c9cea5
Compare
This comment has been minimized.
This comment has been minimized.
Golovanov399
left a comment
There was a problem hiding this comment.
I left some comments, in principle I understand if not redundantly recomputing stuff twice for close addresses is out of the scope of this task, but I think it's worth doing in long term and if it's not in this PR then I ask you to make a corresponding ticket for the future
|
|
||
| // cells is a 2-to-1 compression buffer | ||
| Fp cells[CELLS]; | ||
| for (size_t width = num_leaves / 2; width > 0; width /= 2) { |
There was a problem hiding this comment.
You either don't need 1 << OMITTED_BOTTOM_LEVELS in layer (you only use half of this memory, yet it probably affects register slippage or something), or you may want to parallelize computation on each of the omitted levels (probably not worth it). Or maybe you want to delegate this all to the same SM and do some device synchronize and whatnot, but it can be postponed
| auto const old_right_digest = layer_value_on_height( | ||
| subtree_layer, | ||
| digest_t old_right_digest; | ||
| load_virtual_node( |
There was a problem hiding this comment.
Noting that you may save some recomputing on the lowest levels here by somehow reusing what you need from old_left_digest for old_right_digest
| /// Shared handle to the initial-memory buffer (`d_data`) captured in [`Self::build_async`]. | ||
| /// `None` for empty/dummy subtrees that have no backing buffer. | ||
| /// | ||
| /// Holding an `Arc` makes the subtree a co-owner of the buffer, so the host cannot free it | ||
| /// while the subtree is alive. Under the `OmitBottomLevels` layout the omitted bottom levels | ||
| /// are never materialized into `buf`; they are recomputed on demand from this buffer during | ||
| /// [`MemoryMerkleTree::update_with_touched_blocks`] (see `recompute_omitted_node` in | ||
| /// `merkle_tree.cu`), so the buffer must stay alive until that update completes. | ||
| /// | ||
| /// NOTE: this only fixes host-side ownership. The buffer is also consumed by GPU kernels | ||
| /// enqueued on the stream, so it must additionally outlive those kernels — that ordering is | ||
| /// guaranteed by the `stream.synchronize()` in [`MemoryMerkleTree::drop_subtrees`], not by the | ||
| /// `Arc`. Drop the subtrees (which releases these handles) only after that sync. | ||
| initial_data: Option<Arc<DeviceBuffer<u8>>>, |
There was a problem hiding this comment.
comments like this are too verbose and borderline slop. see if you can trim them down
dac2c10 to
db32ec4
Compare
814c620 to
0e6d6a6
Compare
Note: cells_used metrics omitted because CUDA tracegen does not expose unpadded trace heights. Commit: 0e6d6a6 |
Motivation: A memory optimization of the GPU merkle tree is needed (once we go to 2^32 address space size) because currently how the MemoryMerkleTree works is that if we have N bytes then we will have N/16 leaves (and N/8 nodes in total) and each leaf is stored as 8 field elements. This means we need 4N bytes of VRAM to store the memory merkle tree in the GPU. With N = 2^32 (once we go to 2^32 AS size), 4N is roughy 17 GB which will OOM the GPU (our current limit is ~15 G).
Optimization idea of this PR: Don't store the last OMITTED_BOTTOM_LEVELS levels of the MemoryMerkleSubTree in the GPU memory and only computing it when needed for address spaces with large sizes (AS2 and deferral AS). This is done by recomputing (i.e. re-computing the poseidon2 hash) from the raw memory. This saves the GPU memory required for the Merkle tree itself by a factor of 8.
Reth benchmark results: https://github.com/axiom-crypto/openvm-eth/actions/runs/26959247863
The summary of the result is proving time increased by 0.77s but the generate mem proving ctxs went down from 3.84 G to 2.09 G and set initial memory went down from 3.63 G to 1.88 G. Note that it didn't went down by a factor of 8 because there are other things that uses the memory. The other things here include the boundarychipgpu and initial_memory buffer (see the MemoryInventoryGPU struct for details).
Closes INT-8079