Skip to content

Snapshot hardening: skip snapshotted I/O on load + gate decode on schema version#81

Merged
patrickleet merged 2 commits into
mainfrom
hardening/snapshot-hardening
Jun 11, 2026
Merged

Snapshot hardening: skip snapshotted I/O on load + gate decode on schema version#81
patrickleet merged 2 commits into
mainfrom
hardening/snapshot-hardening

Conversation

@patrickleet

Copy link
Copy Markdown
Collaborator

Two tightly-coupled snapshot hardening changes. Snapshots remain a transparent optimization: the same public API and the same handler/aggregate types work with or without snapshots; with_snapshots(n) configures behavior on the same AggregateRepository type.

Task 1 — Snapshot hydration was paying full I/O (perf)

AggregateRepository::get always called get_stream, which fetched the entire event history (ORDER BY sequence ASC, no lower bound). The snapshot path then filtered already-snapshotted events after fetch+decode. So a fresh snapshot over a 10k-event stream still read and decoded 10k rows — snapshots saved replay CPU but not the dominant I/O/decode cost. Loads also did two sequential round trips in the wrong order (events, then snapshot).

Fix:

  • New GetStream::get_stream_tail(identity, after_version) loads only WHERE sequence > $after_version. Default impl delegates to get_stream (always correct, just unoptimized); Postgres and SQLite override it with a bounded query; the queued wrapper forwards it under the same per-stream lock as get_stream.
  • The single-aggregate get hot path now reads the snapshot first, then fetches only the tail. On a cache miss (no/stale snapshot) it falls back to a full load.
  • Removed the O(n) entity_stream_version max-scan; entity.version() is the true stream version in both load shapes.

Entity-history-in-memory implications (checked)

Entity previously derived version, the next event sequence, and the new_events() slice from events.len(), assuming events is the complete history. A snapshot-hydrated entity now holds only the tail, which would have produced colliding sequence numbers and an out-of-bounds new_events() slice on the next commit. Handled by adding a transient prefix_version field (count of events folded into the snapshot and not held in memory) so:

  • version == prefix_version + events.len() (true stream position),
  • next event sequence == prefix_version + events.len() + 1 (contiguous, non-colliding),
  • new_events() slices relative to prefix_version,
  • optimistic-concurrency expected_version (= committed_version) stays correct.

prefix_version is #[serde(skip)] / default 0, so durable serialization and full-history entities are unchanged.

Task 2 — Snapshot decode was ungated (fix, corruption hole)

snapshot_version was validated non-zero but always written as DEFAULT_SNAPSHOT_VERSION = 1 and never compared on load. A decode error already falls back to replay safely, but bitcode is positional/non-self-describing: a layout-compatible change (reordered same-typed fields, repurposed field) decodes successfully into the wrong state, then new events commit atop the corruption.

Fix:

  • Added Snapshottable::SNAPSHOT_VERSION: u64 = 1 (default 1, so existing impls are unaffected).
  • #[derive(Snapshot)] accepts #[snapshot(version = N)] and emits the const only when set; rejects zero and non-integer values with helpful errors consistent with the file's existing style. Unit tests added.
  • The record now stores A::SNAPSHOT_VERSION; on load a mismatch is a cache miss → full replay, never a decode of incompatible bytes.

Public/internal semantics unified

The public hydrate_from_snapshot previously turned Cache errors into hard RepositoryError::Replay, while the internal load path degraded gracefully. The public path now carries the same cache-miss → replay fallback (it delegates to the shared hydrate_with_optional_snapshot), so both paths behave identically. Its rustdoc notes that the supplied entity must carry full history for the fallback to replay.

snapshot_type decision

snapshot_type is std::any::type_name::<A::Snapshot>(), which is explicitly unstable across compiler versions — gating on it would cause spurious cache misses after a toolchain bump. It is a pub field of the persisted SnapshotRecord envelope (present in both backend schemas), so removing it would be a public-API + schema break, out of scope for this hardening. Decision: keep it, document it as diagnostic-only, and explicitly do not gate on it. Schema compatibility is enforced solely by SNAPSHOT_VERSION.

Testing

  • New sqlite-backed suite tests/snapshot_sqlite_hardening:
    • snapshot-hydrated load reads only the tail (pre-snapshot rows deleted from the store; load still returns correct state and the in-memory entity holds only the tail rows);
    • a stale-schema snapshot (payload from a different/older struct shape, mismatched snapshot_version) falls back to full replay with correct final state — never silently wrong;
    • snapshot-hydrated state == full-replay state.
  • Entity unit tests for tail sequencing (load_tail_from_history, digest continues from true version, empty-tail load).
  • Derive unit tests for #[snapshot(version = N)] (emit / inherit default / reject zero / reject non-integer).
  • Updated the existing public-API snapshot tests to the unified graceful-degradation contract.

Verified locally (no docker): cargo fmt --all --check; cargo build (default), --features sqlite, --features postgres; cargo test (default), cargo test --features sqlite, cargo test -p distributed_macros. Postgres-backed tests env-gate on DATABASE_URL and skip without it; the postgres test target compiles.

🤖 Generated with Claude Code

…ma version

Snapshot hydration previously fetched the ENTIRE event stream and only then
filtered out events already covered by the snapshot, so a fresh snapshot over a
long stream still paid full I/O and decode cost — snapshots saved replay CPU but
not the dominant I/O. And the schema-version field was written but never checked:
bitcode is positional, so a layout-compatible change to a Snapshot struct would
decode SUCCESSFULLY into wrong state and then commit new events atop corruption.

Task 1 — skip already-snapshotted events (perf):
- Add GetStream::get_stream_tail(identity, after_version) (default delegates to
  get_stream; Postgres/SQLite override with `WHERE sequence > $version`; Queued
  forwards under its lock). The single-aggregate `get` hot path now reads the
  snapshot FIRST, then fetches only the tail.
- Entity tracks a transient `prefix_version` (events folded into a snapshot and
  not held in memory) so version/committed_version/new_events and event
  sequencing stay correct when only a tail is loaded. Add
  Entity::load_tail_from_history.
- Remove the O(n) entity_stream_version max-scan; entity.version() is the true
  stream version in both load shapes.

Task 2 — schema-version gate (fix):
- Add Snapshottable::SNAPSHOT_VERSION (default 1) and #[snapshot(version = N)] in
  the derive (rejects zero / non-integer with helpful errors; emits the const
  only when set so existing impls are unaffected).
- Write A::SNAPSHOT_VERSION into each record; on load a mismatch is a CACHE MISS
  -> full replay, never a decode of incompatible bytes.
- Unify semantics: the public hydrate_from_snapshot now degrades gracefully
  (cache miss -> replay) like the internal path, instead of turning cache misses
  into hard Replay errors.
- snapshot_type (std::any::type_name) is documented diagnostic-only and is NOT
  gated on (type_name is unstable across compilers; gating would cause spurious
  misses). Schema compatibility is enforced solely by SNAPSHOT_VERSION.

Tests: sqlite-backed proofs that a snapshot-hydrated load reads only the tail
(pre-snapshot rows deleted, load still correct), that a stale-schema snapshot
falls back to replay with correct state, and that snapshot-hydrated state equals
full-replay state; plus entity tail-sequencing unit tests and a derive
version-attribute test.

Implements [[tasks/snapshot-hydration-skip-replayed-events]] and [[tasks/snapshot-schema-version-gating]]

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@patrickleet, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 13 minutes and 25 seconds. Learn how PR review limits work.

Your organization has reached its usage spending cap. Adjust your spending cap in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e7100e6f-8301-4574-a187-6a55f73bad12

📥 Commits

Reviewing files that changed from the base of the PR and between da838d7 and ba0ffbe.

📒 Files selected for processing (21)
  • distributed_macros/src/lib.rs
  • distributed_macros/src/snapshot.rs
  • migrations/postgres/0001_initial.sql
  • migrations/sqlite/0001_initial.sql
  • src/aggregate/repository.rs
  • src/entity/entity.rs
  • src/postgres_repo/mod.rs
  • src/queued_repo/repository.rs
  • src/repository/traits.rs
  • src/snapshot/in_memory.rs
  • src/snapshot/repository.rs
  • src/snapshot/snapshottable.rs
  • src/snapshot/store.rs
  • src/sqlite_repo/mod.rs
  • tests/persistent_repository_conformance/scenario.rs
  • tests/postgres_repository/main.rs
  • tests/repository_api/main.rs
  • tests/snapshot_sqlite_hardening/main.rs
  • tests/snapshots/main.rs
  • tests/sqlite_repository/main.rs
  • tests/upcasting/main.rs
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch hardening/snapshot-hardening

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

snapshot_type was a write-only `type_name` field: written into every
SnapshotRecord and persisted, but never read to make a decision. Since
`type_name` is explicitly unstable across compiler versions, it could
never have been a safe compatibility gate anyway.

SNAPSHOT_VERSION (Snapshottable::SNAPSHOT_VERSION, written into each
record and compared on load) is the sole compatibility gate for cached
snapshots: a version mismatch is treated as a cache miss and the
aggregate is rebuilt by full replay. With that in place, snapshot_type
is pure vestigial scaffolding.

This removes it end to end: the SnapshotRecord field, the constructor
parameter, the empty-string validation, the now-unused
DEFAULT_SNAPSHOT_VERSION const, the snapshot_type_name helper, the
sqlite/postgres SELECT/INSERT/upsert/bind/row-read paths, the migration
columns and their CHECK constraints, and all test call sites and
assertions. The SNAPSHOT_VERSION doc comments are reworded to state the
bump-on-layout-change intent positively rather than framing the default
as backwards-compat for existing implementations.

Pre-release cleanup: there is no released version or persisted data to
stay compatible with.

Refines [[tasks/snapshot-schema-version-gating]]
@patrickleet patrickleet merged commit 683a0ea into main Jun 11, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant