Snapshot hardening: skip snapshotted I/O on load + gate decode on schema version by patrickleet · Pull Request #81 · hops-ops/distributed

patrickleet · 2026-06-11T05:02:50Z

Two tightly-coupled snapshot hardening changes. Snapshots remain a transparent optimization: the same public API and the same handler/aggregate types work with or without snapshots; with_snapshots(n) configures behavior on the same AggregateRepository type.

Task 1 — Snapshot hydration was paying full I/O (perf)

AggregateRepository::get always called get_stream, which fetched the entire event history (ORDER BY sequence ASC, no lower bound). The snapshot path then filtered already-snapshotted events after fetch+decode. So a fresh snapshot over a 10k-event stream still read and decoded 10k rows — snapshots saved replay CPU but not the dominant I/O/decode cost. Loads also did two sequential round trips in the wrong order (events, then snapshot).

Fix:

New GetStream::get_stream_tail(identity, after_version) loads only WHERE sequence > $after_version. Default impl delegates to get_stream (always correct, just unoptimized); Postgres and SQLite override it with a bounded query; the queued wrapper forwards it under the same per-stream lock as get_stream.
The single-aggregate get hot path now reads the snapshot first, then fetches only the tail. On a cache miss (no/stale snapshot) it falls back to a full load.
Removed the O(n) entity_stream_version max-scan; entity.version() is the true stream version in both load shapes.

Entity-history-in-memory implications (checked)

Entity previously derived version, the next event sequence, and the new_events() slice from events.len(), assuming events is the complete history. A snapshot-hydrated entity now holds only the tail, which would have produced colliding sequence numbers and an out-of-bounds new_events() slice on the next commit. Handled by adding a transient prefix_version field (count of events folded into the snapshot and not held in memory) so:

version == prefix_version + events.len() (true stream position),
next event sequence == prefix_version + events.len() + 1 (contiguous, non-colliding),
new_events() slices relative to prefix_version,
optimistic-concurrency expected_version (= committed_version) stays correct.

prefix_version is #[serde(skip)] / default 0, so durable serialization and full-history entities are unchanged.

Task 2 — Snapshot decode was ungated (fix, corruption hole)

snapshot_version was validated non-zero but always written as DEFAULT_SNAPSHOT_VERSION = 1 and never compared on load. A decode error already falls back to replay safely, but bitcode is positional/non-self-describing: a layout-compatible change (reordered same-typed fields, repurposed field) decodes successfully into the wrong state, then new events commit atop the corruption.

Fix:

Added Snapshottable::SNAPSHOT_VERSION: u64 = 1 (default 1, so existing impls are unaffected).
#[derive(Snapshot)] accepts #[snapshot(version = N)] and emits the const only when set; rejects zero and non-integer values with helpful errors consistent with the file's existing style. Unit tests added.
The record now stores A::SNAPSHOT_VERSION; on load a mismatch is a cache miss → full replay, never a decode of incompatible bytes.

Public/internal semantics unified

The public hydrate_from_snapshot previously turned Cache errors into hard RepositoryError::Replay, while the internal load path degraded gracefully. The public path now carries the same cache-miss → replay fallback (it delegates to the shared hydrate_with_optional_snapshot), so both paths behave identically. Its rustdoc notes that the supplied entity must carry full history for the fallback to replay.

`snapshot_type` decision

snapshot_type is std::any::type_name::<A::Snapshot>(), which is explicitly unstable across compiler versions — gating on it would cause spurious cache misses after a toolchain bump. It is a pub field of the persisted SnapshotRecord envelope (present in both backend schemas), so removing it would be a public-API + schema break, out of scope for this hardening. Decision: keep it, document it as diagnostic-only, and explicitly do not gate on it. Schema compatibility is enforced solely by SNAPSHOT_VERSION.

Testing

New sqlite-backed suite tests/snapshot_sqlite_hardening:
- snapshot-hydrated load reads only the tail (pre-snapshot rows deleted from the store; load still returns correct state and the in-memory entity holds only the tail rows);
- a stale-schema snapshot (payload from a different/older struct shape, mismatched snapshot_version) falls back to full replay with correct final state — never silently wrong;
- snapshot-hydrated state == full-replay state.
Entity unit tests for tail sequencing (load_tail_from_history, digest continues from true version, empty-tail load).
Derive unit tests for #[snapshot(version = N)] (emit / inherit default / reject zero / reject non-integer).
Updated the existing public-API snapshot tests to the unified graceful-degradation contract.

Verified locally (no docker): cargo fmt --all --check; cargo build (default), --features sqlite, --features postgres; cargo test (default), cargo test --features sqlite, cargo test -p distributed_macros. Postgres-backed tests env-gate on DATABASE_URL and skip without it; the postgres test target compiles.

🤖 Generated with Claude Code

…ma version Snapshot hydration previously fetched the ENTIRE event stream and only then filtered out events already covered by the snapshot, so a fresh snapshot over a long stream still paid full I/O and decode cost — snapshots saved replay CPU but not the dominant I/O. And the schema-version field was written but never checked: bitcode is positional, so a layout-compatible change to a Snapshot struct would decode SUCCESSFULLY into wrong state and then commit new events atop corruption. Task 1 — skip already-snapshotted events (perf): - Add GetStream::get_stream_tail(identity, after_version) (default delegates to get_stream; Postgres/SQLite override with `WHERE sequence > $version`; Queued forwards under its lock). The single-aggregate `get` hot path now reads the snapshot FIRST, then fetches only the tail. - Entity tracks a transient `prefix_version` (events folded into a snapshot and not held in memory) so version/committed_version/new_events and event sequencing stay correct when only a tail is loaded. Add Entity::load_tail_from_history. - Remove the O(n) entity_stream_version max-scan; entity.version() is the true stream version in both load shapes. Task 2 — schema-version gate (fix): - Add Snapshottable::SNAPSHOT_VERSION (default 1) and #[snapshot(version = N)] in the derive (rejects zero / non-integer with helpful errors; emits the const only when set so existing impls are unaffected). - Write A::SNAPSHOT_VERSION into each record; on load a mismatch is a CACHE MISS -> full replay, never a decode of incompatible bytes. - Unify semantics: the public hydrate_from_snapshot now degrades gracefully (cache miss -> replay) like the internal path, instead of turning cache misses into hard Replay errors. - snapshot_type (std::any::type_name) is documented diagnostic-only and is NOT gated on (type_name is unstable across compilers; gating would cause spurious misses). Schema compatibility is enforced solely by SNAPSHOT_VERSION. Tests: sqlite-backed proofs that a snapshot-hydrated load reads only the tail (pre-snapshot rows deleted, load still correct), that a stale-schema snapshot falls back to replay with correct state, and that snapshot-hydrated state equals full-replay state; plus entity tail-sequencing unit tests and a derive version-attribute test. Implements [[tasks/snapshot-hydration-skip-replayed-events]] and [[tasks/snapshot-schema-version-gating]] Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

coderabbitai · 2026-06-11T05:03:02Z

Warning

Review limit reached

@patrickleet, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 13 minutes and 25 seconds. Learn how PR review limits work.

Your organization has reached its usage spending cap. Adjust your spending cap in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7100e6f-8301-4574-a187-6a55f73bad12

📥 Commits

Reviewing files that changed from the base of the PR and between da838d7 and ba0ffbe.

📒 Files selected for processing (21)

distributed_macros/src/lib.rs
distributed_macros/src/snapshot.rs
migrations/postgres/0001_initial.sql
migrations/sqlite/0001_initial.sql
src/aggregate/repository.rs
src/entity/entity.rs
src/postgres_repo/mod.rs
src/queued_repo/repository.rs
src/repository/traits.rs
src/snapshot/in_memory.rs
src/snapshot/repository.rs
src/snapshot/snapshottable.rs
src/snapshot/store.rs
src/sqlite_repo/mod.rs
tests/persistent_repository_conformance/scenario.rs
tests/postgres_repository/main.rs
tests/repository_api/main.rs
tests/snapshot_sqlite_hardening/main.rs
tests/snapshots/main.rs
tests/sqlite_repository/main.rs
tests/upcasting/main.rs

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch hardening/snapshot-hardening

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

snapshot_type was a write-only `type_name` field: written into every SnapshotRecord and persisted, but never read to make a decision. Since `type_name` is explicitly unstable across compiler versions, it could never have been a safe compatibility gate anyway. SNAPSHOT_VERSION (Snapshottable::SNAPSHOT_VERSION, written into each record and compared on load) is the sole compatibility gate for cached snapshots: a version mismatch is treated as a cache miss and the aggregate is rebuilt by full replay. With that in place, snapshot_type is pure vestigial scaffolding. This removes it end to end: the SnapshotRecord field, the constructor parameter, the empty-string validation, the now-unused DEFAULT_SNAPSHOT_VERSION const, the snapshot_type_name helper, the sqlite/postgres SELECT/INSERT/upsert/bind/row-read paths, the migration columns and their CHECK constraints, and all test call sites and assertions. The SNAPSHOT_VERSION doc comments are reworded to state the bump-on-layout-change intent positively rather than framing the default as backwards-compat for existing implementations. Pre-release cleanup: there is no released version or persisted data to stay compatible with. Refines [[tasks/snapshot-schema-version-gating]]

patrickleet merged commit 683a0ea into main Jun 11, 2026
7 checks passed

patrickleet mentioned this pull request Jun 11, 2026

perf: reduce commit/hydrate hot-path round trips and clones #87

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot hardening: skip snapshotted I/O on load + gate decode on schema version#81

Snapshot hardening: skip snapshotted I/O on load + gate decode on schema version#81
patrickleet merged 2 commits into
mainfrom
hardening/snapshot-hardening

patrickleet commented Jun 11, 2026

Uh oh!

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

Review limit reached

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

patrickleet commented Jun 11, 2026

Task 1 — Snapshot hydration was paying full I/O (perf)

Entity-history-in-memory implications (checked)

Task 2 — Snapshot decode was ungated (fix, corruption hole)

Public/internal semantics unified

snapshot_type decision

Testing

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`snapshot_type` decision

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading