feat(v2.1-rv64): switch to u16 limbs in deferral airs#2808
feat(v2.1-rv64): switch to u16 limbs in deferral airs#2808shuklaayush wants to merge 615 commits into
Conversation
## Summary - Removed `openvm/` prefix from source paths across 8 documentation files (21 occurrences) - Updated stale `v2-proof-system` repository reference in `docs/crates/recursion/README.md` to reflect the current openvm repo - Updated `stark-backend` path in `docs/crates/recursion/verifier-mapping.md` to link to the [stark-backend GitHub repo](https://github.com/openvm-org/stark-backend) Fixes issues identified in #2553 (comment) ## Test plan - [x] Verified no remaining `openvm/crates/recursion/src/` prefixes exist - [x] Verified no remaining `v2-proof-system` references exist - [x] Verified no remaining `stark-backend/crates/` local path references exist 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Jonathan Wang <jonathanpwang@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This resolves INT-6814
This resolves INT-6828. Now `Eq3bAir` propagates the `row_idx` by the correct amount for each AIR. Note that there is no need for this air to know `air_idx` or anything other than the number of interactions and the `n_lift` for this air, so no new columns are added (but we now do the interaction with proof shape air on the last row of the air instead of the first one).
…2563) ## Summary - Add a `changes` detection job using `dorny/paths-filter@v3` to skip the `lint-cuda` job when no CUDA-related files (`*.cu`, `*.cuh`, `**/cuda/**`, `**/cuda*.rs`, etc.) are modified - Request at least 8 CPUs on the GPU runner (`/cpu=8`) for faster builds - The workflow file and CUDA cache action are also included as triggers so changes to CI itself still run the CUDA lint ## Test plan - [ ] Open a PR that doesn't touch any CUDA files and verify `lint-cuda` is skipped - [ ] Open a PR that touches a `.cu` or `.cuh` file and verify `lint-cuda` runs --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
~~- we need `cargo-openvm` from `develop-v2.0.0-rc.1` to run the reth benchmark for this branch since it adds new opcodes~~ resolves int-6949
## Summary - Adds `scripts/pre-push.sh`, a lightweight bash script that runs formatting, clippy, and tests locally on only the crates changed compared to a target branch (default: `develop-v2.0.0-beta`) - Auto-detects NVIDIA GPU via `nvidia-smi` and enables `--features cuda,touchemall` when present; CPU-only otherwise - Uses `cargo nextest` with `--cargo-profile=fast` when available (falls back to `cargo test`), and applies the `heavy` nextest profile for integration test crates ## Details The script follows three steps: 1. **`cargo +nightly fmt --all -- --check`** — full workspace format check (fast, nightly required per repo config) 2. **`cargo clippy`** — per changed crate, with appropriate feature flags and `-D warnings` 3. **`cargo nextest run`** — per changed crate, with `--cargo-profile=fast` and feature detection Crate detection walks changed file paths upward to find the nearest `Cargo.toml` with a `[package]` section. Benchmark and guest program crates are skipped (they require nightly `rust-src`). ## Usage ```bash ./scripts/pre-push.sh # compare against develop-v2.0.0-beta ./scripts/pre-push.sh main # compare against main ./scripts/pre-push.sh origin/my-feature # any valid git ref ``` ## Test plan - [ ] Run on a branch with Rust changes and verify only affected crates are checked - [ ] Run on a branch with no Rust changes (e.g., docs-only) and verify early exit - [ ] Run on a machine with GPU and verify `cuda,touchemall` features appear in output - [ ] Run on a machine without GPU and verify only `parallel` feature is used - [ ] Verify formatting failures are caught and reported --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Jonathan Wang <jonathanpwang@users.noreply.github.com>
This comment was marked as outdated.
This comment was marked as outdated.
eef7099 to
5d9fd3b
Compare
|
Claude finished @shuklaayush's task in 2m 21s —— View job Code Review: PR #2808 — Switch to u16 limbs in deferral AIRs
This is a well-structured PR that consistently migrates deferral columns from byte cells to u16 cells while doubling the output sponge absorption rate. The changes are mechanically clean and internally consistent across Rust AIR, tracegen, CUDA, and the verify-stark subcircuit. I have a few observations: Findings1. In Let me re-read: pub(crate) fn combine_output_cells<T>(
output_commit: impl IntoIterator<Item = T>,
output_len_cells: [T; OUTPUT_LEN_NUM_U16S], // 4 elements
) -> [T; OUTPUT_TOTAL_NUM_U16S]But the callers pass 2. Sponge init-row range checks are correctly exempted (Positive)
3.
state[1] = F::from_usize(output_ref.len());If the output length ever exceeded 4. Missing range check on In This is likely intentional: 5. Verify-stark
6. RootCircuit refactored from The 7. CUDA and Rust column layout consistency The CUDA 8. Guest test expected outputs updated for 16-byte alignment (Correct) All guest program expected outputs ( 9. Minor:
SummaryThe PR is well-executed with consistent changes across the Rust AIR, tracegen, CUDA tracegen, and the verify-stark subcircuit. The key architectural change (u16 cells + doubled sponge rate) is cleanly factored through shared constants in Verdict: Looks good to merge. The observations above are minor and don't block. |
This comment has been minimized.
This comment has been minimized.
stephenh-axiom-xyz
left a comment
There was a problem hiding this comment.
Seems to be a lot of places that still use u8 instead of u16, I've pointed some out in comments
| &self, | ||
| builder: &mut AB, | ||
| x: &[AB::Var], | ||
| x: [AB::Expr; F_NUM_U16S], |
There was a problem hiding this comment.
nit: Is it possible to use a slice of Into<AB::Expr> or smth
| output_commit_bus, | ||
| def_idx: self.def_idx, | ||
| }; | ||
| let output_range_air = RangeCheckerAir::<U16_BITS> { |
There was a problem hiding this comment.
Given you converted the RISC-V chips to do 16-bit limb addition/multiplication, I think you could change the range checker in recursion to be 16-bits instead of adding it here?
There was a problem hiding this comment.
hmm yeah, i was trying to not touch recursion but i guess maybe just bumping the recursion range checker to be 16-bits isn't that big of a deal
| // The trace filler packs byte pairs into the u16-shaped columns. | ||
| #[repr(C)] | ||
| #[derive(Clone, Copy, Debug)] | ||
| pub struct DeferralCallReadsBytes<F> { |
There was a problem hiding this comment.
Why not initialize the DeferralCallReads<u16, F> during execution?
| @@ -50,8 +49,8 @@ use crate::{ | |||
| #[repr(C)] | |||
| #[derive(AlignedBorrow, StructReflection, Clone, Copy, Debug)] | |||
| pub struct DeferralCallReads<B, F> { | |||
There was a problem hiding this comment.
nit: B stands for bytes
| .fold(idx as u8, |acc, byte| acc.wrapping_add(*byte)); | ||
| let num_chunks = (seed as usize % 4) + 1; | ||
| let len = num_chunks * DIGEST_SIZE; | ||
| let len = num_chunks * SPONGE_BYTES_PER_ROW; |
There was a problem hiding this comment.
Why is the length changed here? The input should be in u16s now no?
|
|
||
| // First row sponge input is [deferral_idx, output_len, 0, ...]. | ||
| // Later rows sponge and write the next DIGEST_SIZE output bytes. | ||
| // Later rows sponge and write the next SPONGE_BYTES_PER_ROW output bytes. |
There was a problem hiding this comment.
Why aren't we measuring this in u16s?
| @@ -78,7 +81,10 @@ fn hash_output_raw<F: VmField>( | |||
| deferral_idx: u32, | |||
| output_ref: &[u8], | |||
There was a problem hiding this comment.
Same comment, why not u16?
Add comprehensive test coverage for decoder crate. `cargo tarpaulin` reports: ```bash || crates/toolchain/decoder/src/instruction_formats.rs: 48/48 +0.00% || crates/toolchain/decoder/src/process_instruction.rs: 148/150 +0.00% ``` The remaining 2 uncovered lines are unreachable match arms (`_ => None`) for a fully exhaustive 3-bit funct3 match in `process_instruction.rs`. towards INT-6467
There was a problem hiding this comment.
was this accidentally included ?
| debug_assert!(num_user_pvs >= DIGEST_SIZE); | ||
| debug_assert!(num_user_pvs.is_multiple_of(DIGEST_SIZE)); | ||
| debug_assert!((num_user_pvs / DIGEST_SIZE).is_power_of_two()); | ||
| assert_user_pvs_shape(num_user_pvs); | ||
| let encoder = Encoder::new(num_user_pvs / DIGEST_SIZE, MAX_ENCODER_DEGREE, true); |
There was a problem hiding this comment.
I guess this is unrelated to the u16 memory bus change
| memory_dimensions: MemoryDimensions, | ||
| num_user_pvs: usize, | ||
| ) -> Self { | ||
| assert_user_pvs_shape(num_user_pvs); |
There was a problem hiding this comment.
why was this assertion added now ? was this always needed or is this a new requirement
5d9fd3b to
d29250f
Compare
|
Claude finished @shuklaayush's task in 2m 49s —— View job Code Review: PR #2808 — Switch to u16 limbs in deferral AIRs
This is a clean, well-structured PR that consistently migrates deferral columns from byte cells to u16 cells and doubles the output sponge absorption rate. I reviewed all 40 changed files across Rust AIR, tracegen, CUDA, verify-stark, and continuations. Below are my findings, ordered by significance. Findings1. Missing per-cell u16 range checks on In The tracegen in 2. The new byte-shaped record types ( 3.
4. Sponge init-row range-check exemption is correct (Positive)
5. state[1] = F::from_usize(output_ref.len());For BabyBear, values >= 2^31 would silently reduce mod P. The 6.
7.
8. CUDA / Rust constant and column struct consistency (Correct)
9. The 10.
11. Adapter
Addressing reviewer comments@stephenh-axiom-xyz's comment about u8 vs u16: Several places (record types, @stephenh-axiom-xyz's comment about @stephenh-axiom-xyz's comment about recursion range checker: @shuklaayush noted this could be bumped to 16-bit. The current PR adds a dedicated @GunaDD's comment about @GunaDD's comment about SummaryThe PR is well-executed with consistent changes across Rust AIR, tracegen, CUDA, verify-stark, and continuations. The migration from byte cells to u16 cells is mechanically clean, and the doubled sponge absorption rate is correctly threaded through all layers. The Verdict: Looks good to merge. The only actionable suggestion is the minor defensive |
# Switch memory bus blocks to u16 cells
PR 1 of the memory-bus-u16 split. Atomic flip of the memory-bus block
from `8 u8 cells` to `4 u16 cells`. The bus interaction is now
fixed-shape (`BLOCK_FE_WIDTH = 4`), and byte-shaped chips pack their u8
columns into bus cells via a single helper.
## Why
`MemoryBus` interactions previously carried 8 field-element-encoded u8s
per message. RV64 cell-typed memory (registers, RV memory, public
values) is naturally u16-aligned, and packing two bytes per bus cell
halves the bus payload width without losing constraint power. This PR
makes the bus shape a single hard constant (`BLOCK_FE_WIDTH`) so the
bridge API and downstream chips no longer need a `const N` generic for
it.
Follow-up PRs will use the wider cells to narrow trace columns, prune
redundant range checks, reshape SHA-2 / Keccak state, and revisit
deferral.
## What changes
### Memory layout constants
Files: `crates/vm/src/arch/config.rs`,
`crates/vm/cuda/include/system/memory/params.cuh`.
- Single source of truth: `U16_CELL_SIZE = size_of::<u16>()` (intrinsic)
drives `BUS_PTR_SCALE`; `BLOCK_FE_WIDTH = 4` is the design choice;
`MEMORY_BLOCK_BYTES`, `BUS_BLOCK_STRIDE`, `BUS_LEAF_STRIDE`,
`DIGEST_WIDTH` all derive.
- Cross-constant equality asserts removed where the derivation makes
them true by construction.
- One divisibility guard kept on
`DIGEST_WIDTH.is_multiple_of(BLOCK_FE_WIDTH)`.
- CUDA-side `params.cuh` mirrors the constants and includes ASCII
diagrams for the u16-celled AS layout (RV64) and the F-celled AS layout
(`DEFERRAL_AS`), showing how `byte_ptr`, `cell_idx`, and `bus_ptr`
relate.
### Layout reference
Terminology:
- **Cell** — one storage word in an address space (`u16` for RV64 ASes,
`F` for `DEFERRAL_AS`).
- **Block** — one memory-bus message: `BLOCK_FE_WIDTH` cells =
`MEMORY_BLOCK_BYTES` bytes.
- **Digest** — output of one Poseidon2 compression, `DIGEST_WIDTH`
cells; also one merkle leaf.
- **Leaf** — one merkle-tree leaf = one Poseidon2 half = `DIGEST_WIDTH`
cells = `BLOCKS_PER_LEAF` blocks.
<details>
<summary><b>u16-celled AS layout</b> (<code>RV64_REGISTER_AS</code>,
<code>RV64_MEMORY_AS</code>, <code>PUBLIC_VALUES_AS</code>)</summary>
One merkle leaf = 16 bytes = 8 u16 cells = 2 bus blocks:
```text
byte_ptr: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
u8 storage: │b0 │b1 │b2 │b3 │b4 │b5 │b6 │b7 │b8 │b9 │b10│b11│b12│b13│b14│b15│
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
╰──────╯╰──────╯╰──────╯╰──────╯╰──────╯╰──────╯╰──────╯╰──────╯ u16 cells (LE)
cell_idx: 0 1 2 3 4 5 6 7
bus_ptr: 0 2 4 6 8 10 12 14
╰─────────── block 0 ──────────╯╰─────────── block 1 ──────────╯
╰────────────────── one merkle leaf / digest ──────────────────╯
```
- `byte_ptr = U16_CELL_SIZE * cell_idx` (coincides with `bus_ptr` for
u16 ASes)
- `bus_ptr = BUS_PTR_SCALE * cell_idx`
- Block stride on bus: `BUS_BLOCK_STRIDE = 8 = BUS_PTR_SCALE *
BLOCK_FE_WIDTH`
- Leaf stride on bus: `BUS_LEAF_STRIDE = 16 = BUS_PTR_SCALE *
DIGEST_WIDTH`
</details>
<details>
<summary><b>F-celled AS layout</b> (<code>DEFERRAL_AS</code>)</summary>
Each cell holds one `F` element (`size_of::<F>()` bytes on host; 4 for
BabyBear). The bus addresses cells directly, so cell / block / leaf
counts and `bus_ptr` / `cell_idx` mappings match the u16 AS.
```text
byte_ptr: 0 4 8 12 16 20 24 28
┌───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐
F storage: │ F0 │ F1 │ F2 │ F3 │ F4 │ F5 │ F6 │ F7 │
└───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
cell_idx: 0 1 2 3 4 5 6 7
bus_ptr: 0 2 4 6 8 10 12 14
╰─────────── block 0 ──────────╯╰─────────── block 1 ──────────╯
╰────────────────── one merkle leaf / digest ──────────────────╯
```
- `byte_ptr = size_of::<F>() * cell_idx` (host-memory stride; ≠
`bus_ptr`)
- `bus_ptr = BUS_PTR_SCALE * cell_idx` (same formula as u16 ASes)
- `BUS_BLOCK_STRIDE` and `BUS_LEAF_STRIDE` unchanged — the bus addresses
cells, not bytes.
</details>
### Address-space cell types
- `RV64_REGISTER_AS`, `RV64_MEMORY_AS`, `PUBLIC_VALUES_AS` switch to
u16-celled.
- `DEFERRAL_AS` stays F-celled (moves to u16 in a follow-up).
### Persistent memory and merkle
- Boundary AIR, merkle tree, merkle public-values, and CUDA inventory
rewritten for 4-cell-per-block u16 leaves (`BLOCKS_PER_LEAF =
DIGEST_WIDTH / BLOCK_FE_WIDTH = 2`).
- `assert_public_values_shape::<DIGEST_WIDTH>(num_public_values_bytes)`
and `SystemConfig::public_values_cell_count()` formalize the
public-values byte-count → cell-count contract at construction and at
every read site.
### Offline-checker bridge
File: `crates/vm/src/system/memory/offline_checker/bridge.rs`.
- `MemoryBridge::read` / `write` drop the `const N` generic; bus payload
is always `[T; BLOCK_FE_WIDTH]`.
- Unified `MemoryWriteAuxInput` lets `write` accept either an
`&MemoryWriteAuxCols<V, BLOCK_FE_WIDTH>` (standard case) or
`from_prev_data_exprs(&base_aux, prev_data_expr)` for chips that
materialize prev_data in their own columns (Keccak, SHA-2, LoadStore).
- Added `pack_u8_block` (AB::Expr) and `pack_u8_block_value` (F-value
form) so byte-shaped chips and testing helpers produce bus messages
identical to what u16 chips emit natively.
- Removed `read_or_immediate` / `MemoryReadOrImmediateOperation` (an
RV32 pattern superseded by `RV64_IMM_AS`).
### TracingMemory / GuestMemory byte view
- `read_bytes` / `write_bytes` (cell-type-aware, BLOCK-aligned) for raw
byte access against u16-celled storage.
- Deferral-specific helpers (`timed_read_deferral`,
`tracing_read_deferral`, …) for the F-celled AS.
### Chip call-sites
~50 sites across riscv, riscv-adapters, keccak256, sha2, deferral.
- Every `memory_bridge.read` / `.write` now hands the bridge a
`BLOCK_FE_WIDTH`-shaped payload.
- Byte-shaped chips wrap their `[T; 8]` u8 columns with `pack_u8_block`.
- Register-read paths still zero-extend the low 32 bits to 8 bytes, then
pack.
- Per-AS metered leaf-bit shift in
`execution_mode/metered/memory_ctx.rs` (`U16_AS_LEAF_PTR_BITS = 4` for
byte-pointer ASes, `CELL_AS_LEAF_PTR_BITS = 3` for cell-pointer ASes),
so RV64 byte pointers produce the right leaf label.
### Algebra / ECC adapter cleanup
No-op semantic; finishes upstream removal.
- Dropped phantom `BLOCK_SIZE` const generic from
`ModularAir`/`ModularChip`, `Fp2Air`/`Fp2Chip`,
`WeierstrassAir`/`WeierstrassChip` and their `get_*_air` / `get_*_chip`
helpers.
- The generic became unused after `Rv64VecHeapAdapter*` dropped its
`BLOCK_SIZE` parameter; keeping it was a phantom that broke inference at
call sites.
### Testing infrastructure
- `MemoryTester` / `DeviceMemoryTester` accept either `N =
BLOCK_FE_WIDTH` (cell index) or `N = MEMORY_BLOCK_BYTES` (byte pointer
in a u16 AS) and route accordingly; the byte path packs the result so
the dummy chip's bus message matches a byte-shaped chip's exactly.
- `pack_u8_block_value` is shared between the testing helper and the
CUDA boundary-chip test that previously hand-rolled the same packing.
### CUDA
- `params.cuh` constants + static_asserts kept in sync.
- Kernel memory-aux fill paths updated for the new block shape.
## Out of scope (follow-up PRs)
- Trace column narrowing now that u16-celled storage matches column
width.
- Redundant range-check removal on byte-shaped chips.
- Dead bitwise-lookup cleanup.
- SHA-2 / Keccak state reshaping to natively-u16 layouts.
- Moving `DEFERRAL_AS` to u16.
- Deferral commit / output-hash changes.
## Migration notes
- `MemoryBridge::read::<N>` / `::write::<N>` calls: drop `<N>`; pass a
`BLOCK_FE_WIDTH`-shaped payload. Byte-shaped data goes through
`pack_u8_block` (constraint side) and `pack_u8_block_value` (witness
side).
- `read_or_immediate` callers: switch to `RV64_IMM_AS`-based addressing.
- `MemoryWriteAuxCols<V, N>` with `N != BLOCK_FE_WIDTH`: use
`MemoryWriteAuxInput::from_prev_data_exprs(&base_aux, prev_data)` to
supply prev_data inline.
- `ModularAir`, `ModularChip`, `Fp2Air`, `Fp2Chip`, `WeierstrassAir`,
`WeierstrassChip` now take one fewer const generic (`BLOCK_SIZE`
dropped).
Resolves INT-7829
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: Jonathan Wang <jonathanpwang@users.noreply.github.com>
PR 3 of the memory-bus-u16 split, stacked on **#2794** (disjoint from PR 2). Shrinks SHA-2 and Keccak chip trace columns that are only consumed as packed u16 cells through the memory bus. ## Why #2794 still stores SHA-2 main-chip state / message bytes and Keccak state / pointer columns byte-shaped, packing them at memory-bus call sites via `pack_u8_block`. Because these columns are never inspected byte-by-byte by the AIR (the constraints operate on packed u16 values, or on byte-XOR buffers that intentionally stay byte-shaped), they can be stored u16-shaped directly — halving the affected trace columns and removing one `pack_u8_block` per row. This is the second of the four cell-narrowing PRs (after PR 2 for RV64 control-flow) and is disjoint from it: the only files in this PR live under `extensions/sha2`, `extensions/keccak256`, and `crates/circuits/sha2-air`, plus one helper addition in `extensions/riscv/circuit/src/adapters/mod.rs`. ## What changes ### SHA-2 section Files: `extensions/sha2/circuit/src/sha2_chips/{main_chip,block_hasher_chip}/`, `crates/circuits/sha2-air/src/{air,columns,trace}.rs`, `extensions/sha2/circuit/cuda/`. Main chip: - `prev_state`, `new_state` are u16 cells. - `message_bytes` reshapes to `message_u16s` (byte pairs packed at the call site). - Pointer limbs `dst_ptr_limbs`, `state_ptr_limbs`, `input_ptr_limbs` reshape to u16 where the AIR only consumes them as packed pointer cells; the high u16 cell gets the scaled 16-bit range check. Block hasher: - `prev_hash`, `final_hash` are u16 cells. `sha2_bus`: - `MESSAGE_1` / `MESSAGE_2` payloads are packed to u16 on both sender and receiver. Stays byte-shaped (8-bit bounded by construction): - `carry_a`, `carry_e`, `carry_or_buffer` in `sha2-air`. ### Keccak section Files: `extensions/keccak256/circuit/src/{keccakf_op,xorin}/`, `extensions/keccak256/circuit/cuda/`. `keccakf_op`: - `preimage`, `postimage` reshape to u16 cells. - `buffer_ptr_limbs` reshape to u16; pointer composition uses the scaled 16-bit bound. - `keccakf_state_bus` payload between `keccakf_op` and `keccakf_perm` is packed to u16. `xorin`: - Pointer limbs u16-shaped (same treatment as `keccakf_op`). - Byte-XOR buffers (`preimage_buffer_bytes`, `input_bytes`, `postimage_buffer_bytes`) stay byte-shaped — those columns still carry byte-level XOR data and the AIR inspects them directly. CUDA tracegen mirrors the new shapes; `keccakf_op` drops the `bitwise_lookup` parameter from its kernel and ABI signatures now that pointer bytes no longer need byte-level checks. The call chip still wires `bitwise_lu`. ### Adapter helper (cross-cut) File: `extensions/riscv/circuit/src/adapters/mod.rs`. - Add `expand_to_rv64_block`: zero-pads `N` u16 limbs to one RV64 bus block (`BLOCK_FE_WIDTH` cells). Used by `keccakf_op` and `xorin` for pointer reads against u16-celled `RV64_REGISTER_AS`. - `BLOCK_FE_WIDTH` imported into the adapters module to support the new helper. The existing `expand_to_rv64_register` (8 u8 limbs, byte-shaped) is kept for byte-shaped chips and the legacy paths. ## Migration notes - Chips passing SHA-2 main-chip state through the bus should hand the bus u16 cells directly (no `pack_u8_block`). - Pointer reads against u16-celled `RV64_REGISTER_AS` from chips storing only the low 32 bits of the register should use `expand_to_rv64_block` (BLOCK_FE_WIDTH cells, zero-padded) instead of `expand_to_rv64_register` (8 u8s). - Keccak pointer composition should use the scaled 16-bit bound on the high u16 cell; the previous byte-pair range check on `bitwise_lookup_bus` is removed for `keccakf_op`. resolves int-7832, int-7833 --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 2 of the memory-bus-u16 split. This migrates selected RV64 chips from byte-shaped register/memory columns to u16-shaped columns after #2794 made the memory bus operate on fixed `BLOCK_FE_WIDTH = 4` u16 cells. The scope is intentionally split into two classes: - For equality/comparison-style chips, `hintstore`, and the new u16 adapters, the main change is the limb shape: `u8` limbs become `u16` limbs, byte-pair bitwise range checks become per-limb `VariableRangeChecker` checks, and the core logic remains essentially the same. - For control-flow instructions that previously used byte-level decompositions for PC/result arithmetic (`auipc`, `jalr`, `jal_lui`), the constraints were reworked more substantially because the old carry structure was tied to 8-bit limbs. Subsequent PRs build on the u16 adapter variants and bigint limb constants introduced here. ## Why #2794 made the memory bus fixed-shape over packed u16 cells. Some chips still materialized 8 byte columns per RV64 register value, packed them into u16 bus cells at the memory interaction, and carried `BitwiseOperationLookup` only to prove those byte limbs were canonical. For chips that do not inspect byte-level structure, storing u16 limbs directly halves the affected register-value trace width and removes those byte-pair lookup dependencies. The bus sees the native u16 payload, while the AIR uses `VariableRangeChecker` to prove each materialized u16 limb is bounded. For PC/result arithmetic, the old byte AIRs also avoided field-modulus aliasing by carrying through small limbs. The u16 rewrites preserve that property with fewer, wider limbs; they do not replace 32-bit arithmetic with one unconstrained field equality. ## What Changes ### Mostly Mechanical Limb-Shape Migrations These chips keep the same core argument, but change the limb type/width: - `branch_eq`: compares u16 limbs directly. Equality inverse logic is unchanged. - `branch_lt`: signed/unsigned comparison logic is unchanged, with `LIMB_BITS = 16` and u16 MSB handling. - `less_than`: same comparison migration, wired through the u16 ALU adapter. - `hintstore`: stores `mem_ptr`, `rem_words`, and hint data as u16-shaped columns. `rem_words` fits in one u16 cell, and pointer bounds are checked through the high u16 limb. - Bigint `BranchEqual256`, `BranchLessThan256`, and `LessThan256`: inherit the u16 shape with `INT256_NUM_U16_LIMBS = 16`. The new u16 adapters follow the same pattern: register pointer values and heap/register payloads are represented as u16 cells directly, with range checks moving from byte-pair lookup to `VariableRangeChecker`. Byte-shaped bigint chips stay byte-shaped: - `BaseAlu256`, `Mul256`, `Shift256`, and `Sll256` still use `INT256_NUM_U8_LIMBS = 32` because their AIRs rely on byte-level boolean, carry, or shift structure. ### Constraint Rewrites These chips needed more than a limb-size substitution: - `auipc`: the old AIR decomposed the PC and immediate into byte limbs and carried the addition byte-by-byte. The new AIR uses u16 limbs instead: it derives/range-checks `pc_low` from `from_pc` and a witnessed `pc_high`, then constrains `rd = from_pc + (imm << 8)` with low-u16 and high-u16 carry bits. This keeps each arithmetic relation below the field modulus while dropping the extra byte-limb PC columns. - `jalr`: the old AIR used low-32-bit byte limbs for `rs1`, `to_pc`, and `rd = pc + 4`. The new AIR keeps u16 limbs, derives the low rd limb from `from_pc`, range-checks the high rd limb against `PC_BITS`, and carries `rs1 + sign_extend(imm)` over two u16 limbs while separately enforcing the cleared target low bit. - `jal_lui`: JAL derives the low rd limb from `from_pc + 4` and the witnessed high u16 limb, then range-checks the high limb against `PC_BITS`. LUI uses a u16-specific split because `imm << 12` crosses a u16 boundary: `rd[0] = imm_low_4 << 12` and `imm = imm_low_4 + (rd[1] << 4)`. The old byte-limb form would still be possible, but the u16 form uses fewer witness columns/range checks while preserving limb-by-limb integer soundness in the BabyBear field. ### New u16 Adapter Variants - `Rv64BaseAluU16Adapter`: reads/writes one `[u16; BLOCK_FE_WIDTH]` register block directly and handles immediates as u16-low-limb plus sign bit. - `Rv64VecHeapU16Adapter`: u16-cell heap reads/writes with u16-shaped register pointer values. - `Rv64VecHeapBranchU16Adapter`: read-only branch variant for u16-shaped heap operands. - `VecToFlatAluAdapter` gains a `BLOCK_VALUE_WIDTH` parameter so both byte-shaped and u16-shaped heap adapters can share the flattening logic. The legacy byte-shaped adapters remain for chips that still need byte-level constraints. ### Wiring - RISC-V and bigint extension wiring now threads `VariableRangeChecker` into the u16-shaped chips. - `BitwiseOperationLookup` remains wired only for byte-shaped chips that still need byte-level constraints. - Rv64Io no longer creates a bitwise lookup solely for `hintstore`. ### CUDA Parity CUDA tracegen mirrors the Rust changes: - migrated core records use `uint16_t` arrays where applicable; - kernels use `VariableRangeChecker.add_count(limb, LIMB_BITS)` for u16 limbs; - bitwise lookup parameters are removed from migrated kernels; - bigint CUDA wiring separates byte-shaped chips from u16-shaped chips. ### Helper Cleanup - `INT256_NUM_U8_LIMBS` and `INT256_NUM_U16_LIMBS` make bigint limb width explicit. - Shared RV64 helpers cover byte/u16 conversion, low-32-bit pointer splitting, pointer-bound range-check values, and zero-extension to one u16 memory block. - Byte-view memory helpers use explicit `read_bytes` / `write_bytes` where the access is intentionally byte-addressed. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rs (#2805) PR 4 of the memory-bus-u16 split, stacked on **PR 2**. Shrinks adapter trace columns that only reconstruct packed memory-bus payloads, and migrates the algebra modular `is_eq` chip to u16 limbs. This PR sits on PR 2 (not #2794) because the PR 2 squash extended the shared vec-heap u16 adapter family and dropped the `INT256_NUM_U8_LIMBS` from `riscv/circuit::adapters`; PR 4 builds on those changes. ## Why #2794 already reshaped these adapters' memory-aux columns and bus payloads to packed u16 blocks, but the **trace columns** that reconstruct those payloads were still byte-shaped: - `alu_w` / `mul_w` stored `rs1_high` / `rs2_high` as 4 u8 limbs even though the bus consumes them as 2 u16 cells. - `loadstore` stored `rs1_data` as 8 u8 limbs even though pointer arithmetic only needs the low 32 bits as 2 u16 cells. - The byte-shaped vec-heap adapters (`vec_heap`, `vec_heap_branch`, `eq_mod`) reconstructed register pointers through byte limbs and bitwise byte-pair range checks. All of these can store the column as u16 cells directly and replace the high-byte range check with a scaled 16-bit range check on the high u16 cell, removing the `BitwiseOperationLookup` traffic. Separately, algebra's modular `is_eq` chip operates on packed u16 operands and only needs limb-level range checks at `LIMB_BITS = 16`, so it can migrate to the new u16 adapter introduced in PR 2. ## What changes ### RISC-V W-adapters and loadstore Files: `extensions/riscv/circuit/src/adapters/{alu_w,mul_w,loadstore,mod}.rs`, `extensions/riscv/circuit/src/loadstore/{execution,tests}.rs`, `extensions/riscv/circuit/src/{base_alu_w,divrem_w,mul_w,shift_w}/tests.rs`, `extensions/riscv/circuit/cuda/include/riscv/adapters/{alu_w,mul_w,loadstore}.cuh`. - `alu_w` / `mul_w`: `rs1_high`, `rs2_high` are packed as two u16 cells; bus reads zero-extend the lower 32 bits and pull u16-shaped upper cells directly from the trace. - `loadstore`: `rs1_data` shrinks to `RS1_DATA_U16S = 2` u16 cells (the low 32 bits of the rs1 register). Bus reads zero-extend the upper half on the wire. - STORE-to-`DEFERRAL_AS` is explicitly unsupported in this adapter and the now-dead `timed_write_deferral` path is removed — #2794 already dropped that opcode, and the remaining call would have instantiated `TracingMemory::write::<F, 8>`, which violates `BLOCK_FE_WIDTH`. The F-celled path needs a separate two-message bus shape if it is ever reintroduced. - Shared helpers handle u8-pair packing, low-32-bit u16 composition, high-u16 pointer scaling, and 16-bit immediate sign-extension. - W-suffix tests (`base_alu_w`, `divrem_w`, `mul_w`, `shift_w`) updated to prank u16 cells where the column is migrated. ### Shared heap-data adapters Files: `extensions/riscv-adapters/src/{vec_heap,vec_heap_branch,eq_mod,eq_mod_u16,lib}.rs`, `extensions/riscv-adapters/cuda/include/riscv-adapters/{constants,vec_heap,vec_heap_branch}.cuh`. - `vec_heap`, `vec_heap_branch`, and `eq_mod` materialize register pointer values (`rs_val` / `rd_val`) as u16 cells. - Pointer bounds use `VariableRangeChecker` on the high u16 cell instead of byte-pair `BitwiseOperationLookup` send_range. - Adds the byte-shaped vec-heap adapter `READ_SIZE` / `WRITE_SIZE` const generics so byte-shaped 256-bit chips (`BaseAlu256`, `Mul256`, `Shift256`) and u16-shaped 256-bit chips (`LessThan256`) can share the same wrapper without losing the block-size information needed for `MemoryWriteAuxCols` and pointer arithmetic. - Adds `Rv64IsEqualModU16Adapter*` for the new u16-shaped modular `is_eq`. - CUDA mirrors the new column shapes and packing helpers. ### Algebra modular `is_eq` Files: `extensions/algebra/circuit/src/modular_chip/{is_eq,addsub,muldiv,mod,tests}.rs`, `extensions/algebra/circuit/src/extension/{modular,hybrid,fp2}.rs`, `extensions/algebra/circuit/src/{execution,fields,preflight,lib}.rs`, `extensions/algebra/circuit/src/fp2_chip/{addsub,mod,muldiv,tests}.rs`. - `LIMB_BITS` 8 → 16 where the chip's limbs are only consumed as packed u16 values. - Wires through the new `Rv64IsEqualModU16` adapter variant. - Core uses `VariableRangeChecker` for `b_lt_diff - 1` and `c_lt_diff - 1` instead of byte-pair bitwise checks. - CPU, hybrid, and CUDA registration paths all use `Rv64IsEqualModU16Adapter*`. - Negative tests that previously pranked `[F; 4]` byte arrays now prank `[F; 2]` u16 cells where the column is migrated; positive coverage preserved. - Mod-builder generated AIRs are untouched. `is_eq` executor memory reads were already adjusted in #2794 for `MEMORY_BLOCK_BYTES`; PR 4 does not rework that API. ### Cascade through dependents Files: `extensions/bigint/circuit/src/{lib,extension/mod,tests}.rs`, `extensions/bigint/circuit/cuda/src/bigint.cu`, `extensions/ecc/circuit/src/weierstrass_chip/{add_ne,double}/mod.rs`. - Bigint type aliases and ecc weierstrass call sites pick up the new `READ_SIZE` / `WRITE_SIZE` generic positions on `Rv64VecHeapAdapter*`. Byte-shaped 256-bit chips pass `MEMORY_BLOCK_BYTES`; u16-shaped chips pass `BLOCK_FE_WIDTH`. - Bigint pulls `INT256_NUM_U8_LIMBS` from the bigint crate root (per PR 2's reviewer feedback) instead of `riscv/circuit::adapters`. ## Migration notes - `Rv64VecHeapAdapter*` now takes 5 const generics (`NUM_READS, BLOCKS_PER_READ, BLOCKS_PER_WRITE, READ_SIZE, WRITE_SIZE`). Byte-shaped 256-bit chips: pass `MEMORY_BLOCK_BYTES` for both sizes. u16-shaped chips: use `Rv64VecHeapU16Adapter`. - `alu_w` / `mul_w` adapter callsites that materialized `rs_high` bytes should pull u16-shaped upper cells from the new column slot. - `loadstore` `rs1_data` columns shrink from 8 u8s to 2 u16 cells; trace-fillers that previously wrote 8 byte cells should write 2 u16 cells (low 32 bits) and let the AIR zero-extend on the bus. - `is_eq` modular chip callers should switch from `Rv64IsEqualModAdapter*` to `Rv64IsEqualModU16Adapter*` and pass `LANE_SIZE = BLOCK_FE_WIDTH` (= 4 u16 cells per heap-read block). - STORE-to-`DEFERRAL_AS` is no longer routed through `Rv64LoadStoreAdapter`; the F-celled deferral-store path was already removed in #2794 and any reintroduction needs a dedicated two-message bus shape. resolves int-7830, int-7831
72d313f to
e6803a8
Compare
PR 5 of the memory-bus-u16 split, stacked on PR 1. Two independent parts. Part 1 — internal u16 reshape, OutputKey byte layout unchanged: - Call core: input_commit, output_commit, output_len trace columns reshape to u16 cells. - Output core: output_commit, output_len reshape to u16 cells. - Canonicity sub-AIR over u16 cells. - Deferral rd_val / rs_val register-pointer columns become u16-shaped. - CPU + CUDA tracegen updates. - Reuses PR1's byte-memory and field-memory chunk helpers; no new generic split_memory_ops-style code is introduced. Part 2 — output commitment-format change (breaks vkey compatibility): - SPONGE_BYTES_PER_ROW = 2 * DIGEST_SIZE = 16; sponge_inputs cells hold u16 values on data rows and receive 16-bit range checks. Init row stays `[deferral_idx, output_len, 0, ...]` and is exempt from u16 checks. - Two memory-bus writes per output row instead of one. - def_fn::hash_output_raw packs byte pairs into each sponge cell. - Bitwise-lookup wiring removed from the output chip only — it no longer needs byte-level range checks. The call chip still uses bitwise. - CUDA + tests updated; expected output commits regenerated. Output length contract: - output_len remains a byte count. - Raw deferral outputs must be a multiple of 16 bytes; non-16-byte aligned raw outputs are invalid (no implicit padding). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d29250f to
d457878
Compare
|
Claude finished @shuklaayush's task in 3m 33s —— View job Code Review: PR #2808 — Switch to u16 limbs in deferral AIRs
Reviewed all 35 changed files across Rust AIR, tracegen, CUDA, verify-stark, continuations, riscv-adapters, and the algebra Findings1.
2. Missing per-cell u16 range checks on
3.
The execution side ( 4. The latest commit now uses 5. Sponge init-row range check exemption is correct (Positive)
6. Verify-stark
7. Both 8. CUDA and Rust constant + struct consistency (Correct)
9.
10.
11. Guest test outputs correctly updated for 16-byte alignment All guest programs ( Addressing reviewer comments
SummaryThe PR is well-executed. The u16 cell migration is consistent across all layers, the doubled sponge rate is correctly factored through shared constants, and the Verdict: Looks good to merge. |
Note: cells_used metrics omitted because CUDA tracegen does not expose unpadded trace heights. Commit: d457878 |
2995a72 to
e582d73
Compare
dac2c10 to
db32ec4
Compare
PR 5 of the memory-bus-u16 split, stacked on #2794. Shrinks deferral trace columns whose constraints only operate on packed u16 values, and doubles the sponge absorption rate of the deferral output chip from one digest per row to two.
Why
#2794 still stores deferral commit / output-key columns byte-shaped while packing the memory-bus payloads. Shrinking those columns to u16 cells halves the affected trace width and lets the canonicity sub-AIR use per-u16-cell 16-bit range checks instead of byte-pair
BitwiseOperationLookupchecks.Separately, the deferral output chip absorbs only
DIGEST_SIZE = 8bytes per Poseidon row even though the sponge state can hold2 * DIGEST_SIZE = 16bytes. Doubling the absorption rate halves the number of output rows for the same byte stream.What changes
Trace columns to u16 cells
Files:
extensions/deferral/circuit/src/{call,output,canonicity}/,extensions/deferral/circuit/src/utils.rs,extensions/deferral/circuit/cuda/.input_commit,output_commit,output_lenare u16 cells.output_commit,output_lenare u16 cells.rd_val/rs_valregister-pointer columns are u16-shaped.split_byte_memory_ops/split_f_memory_opsandbyte_memory_op_chunk/f_memory_op_chunkhelpers; no newsplit_memory_ops-style helper is introduced.Sponge absorption rate
Files:
extensions/deferral/circuit/src/output/{air,trace,execution,tests}.rs,extensions/deferral/circuit/src/def_fn.rs,extensions/deferral/circuit/src/extension/mod.rs,extensions/deferral/circuit/cuda/src/output.cu.SPONGE_BYTES_PER_ROW = 2 * DIGEST_SIZE = 16(wasDIGEST_SIZE = 8).sponge_inputscells hold u16 values on data rows, each carrying two output bytes; data-rowsponge_inputsreceive 16-bit range checks. The init row remains[deferral_idx, output_len, 0, ...]and is exempt.def_fn::hash_output_rawpacks byte pairs into each sponge cell before absorbing.Output length contract
output_lenremains a byte count.Migration notes
bitwise_lushould drop the argument.resolves int-7834