Snapshot hardening: skip snapshotted I/O on load + gate decode on schema version#81
Conversation
…ma version Snapshot hydration previously fetched the ENTIRE event stream and only then filtered out events already covered by the snapshot, so a fresh snapshot over a long stream still paid full I/O and decode cost — snapshots saved replay CPU but not the dominant I/O. And the schema-version field was written but never checked: bitcode is positional, so a layout-compatible change to a Snapshot struct would decode SUCCESSFULLY into wrong state and then commit new events atop corruption. Task 1 — skip already-snapshotted events (perf): - Add GetStream::get_stream_tail(identity, after_version) (default delegates to get_stream; Postgres/SQLite override with `WHERE sequence > $version`; Queued forwards under its lock). The single-aggregate `get` hot path now reads the snapshot FIRST, then fetches only the tail. - Entity tracks a transient `prefix_version` (events folded into a snapshot and not held in memory) so version/committed_version/new_events and event sequencing stay correct when only a tail is loaded. Add Entity::load_tail_from_history. - Remove the O(n) entity_stream_version max-scan; entity.version() is the true stream version in both load shapes. Task 2 — schema-version gate (fix): - Add Snapshottable::SNAPSHOT_VERSION (default 1) and #[snapshot(version = N)] in the derive (rejects zero / non-integer with helpful errors; emits the const only when set so existing impls are unaffected). - Write A::SNAPSHOT_VERSION into each record; on load a mismatch is a CACHE MISS -> full replay, never a decode of incompatible bytes. - Unify semantics: the public hydrate_from_snapshot now degrades gracefully (cache miss -> replay) like the internal path, instead of turning cache misses into hard Replay errors. - snapshot_type (std::any::type_name) is documented diagnostic-only and is NOT gated on (type_name is unstable across compilers; gating would cause spurious misses). Schema compatibility is enforced solely by SNAPSHOT_VERSION. Tests: sqlite-backed proofs that a snapshot-hydrated load reads only the tail (pre-snapshot rows deleted, load still correct), that a stale-schema snapshot falls back to replay with correct state, and that snapshot-hydrated state equals full-replay state; plus entity tail-sequencing unit tests and a derive version-attribute test. Implements [[tasks/snapshot-hydration-skip-replayed-events]] and [[tasks/snapshot-schema-version-gating]] Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 13 minutes and 25 seconds. Learn how PR review limits work. Your organization has reached its usage spending cap. Adjust your spending cap in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (21)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
snapshot_type was a write-only `type_name` field: written into every SnapshotRecord and persisted, but never read to make a decision. Since `type_name` is explicitly unstable across compiler versions, it could never have been a safe compatibility gate anyway. SNAPSHOT_VERSION (Snapshottable::SNAPSHOT_VERSION, written into each record and compared on load) is the sole compatibility gate for cached snapshots: a version mismatch is treated as a cache miss and the aggregate is rebuilt by full replay. With that in place, snapshot_type is pure vestigial scaffolding. This removes it end to end: the SnapshotRecord field, the constructor parameter, the empty-string validation, the now-unused DEFAULT_SNAPSHOT_VERSION const, the snapshot_type_name helper, the sqlite/postgres SELECT/INSERT/upsert/bind/row-read paths, the migration columns and their CHECK constraints, and all test call sites and assertions. The SNAPSHOT_VERSION doc comments are reworded to state the bump-on-layout-change intent positively rather than framing the default as backwards-compat for existing implementations. Pre-release cleanup: there is no released version or persisted data to stay compatible with. Refines [[tasks/snapshot-schema-version-gating]]
Two tightly-coupled snapshot hardening changes. Snapshots remain a transparent optimization: the same public API and the same handler/aggregate types work with or without snapshots;
with_snapshots(n)configures behavior on the sameAggregateRepositorytype.Task 1 — Snapshot hydration was paying full I/O (perf)
AggregateRepository::getalways calledget_stream, which fetched the entire event history (ORDER BY sequence ASC, no lower bound). The snapshot path then filtered already-snapshotted events after fetch+decode. So a fresh snapshot over a 10k-event stream still read and decoded 10k rows — snapshots saved replay CPU but not the dominant I/O/decode cost. Loads also did two sequential round trips in the wrong order (events, then snapshot).Fix:
GetStream::get_stream_tail(identity, after_version)loads onlyWHERE sequence > $after_version. Default impl delegates toget_stream(always correct, just unoptimized); Postgres and SQLite override it with a bounded query; the queued wrapper forwards it under the same per-stream lock asget_stream.gethot path now reads the snapshot first, then fetches only the tail. On a cache miss (no/stale snapshot) it falls back to a full load.entity_stream_versionmax-scan;entity.version()is the true stream version in both load shapes.Entity-history-in-memory implications (checked)
Entitypreviously derivedversion, the next eventsequence, and thenew_events()slice fromevents.len(), assumingeventsis the complete history. A snapshot-hydrated entity now holds only the tail, which would have produced colliding sequence numbers and an out-of-boundsnew_events()slice on the next commit. Handled by adding a transientprefix_versionfield (count of events folded into the snapshot and not held in memory) so:version == prefix_version + events.len()(true stream position),sequence == prefix_version + events.len() + 1(contiguous, non-colliding),new_events()slices relative toprefix_version,expected_version(=committed_version) stays correct.prefix_versionis#[serde(skip)]/ default 0, so durable serialization and full-history entities are unchanged.Task 2 — Snapshot decode was ungated (fix, corruption hole)
snapshot_versionwas validated non-zero but always written asDEFAULT_SNAPSHOT_VERSION = 1and never compared on load. A decode error already falls back to replay safely, but bitcode is positional/non-self-describing: a layout-compatible change (reordered same-typed fields, repurposed field) decodes successfully into the wrong state, then new events commit atop the corruption.Fix:
Snapshottable::SNAPSHOT_VERSION: u64 = 1(default 1, so existing impls are unaffected).#[derive(Snapshot)]accepts#[snapshot(version = N)]and emits the const only when set; rejects zero and non-integer values with helpful errors consistent with the file's existing style. Unit tests added.A::SNAPSHOT_VERSION; on load a mismatch is a cache miss → full replay, never a decode of incompatible bytes.Public/internal semantics unified
The public
hydrate_from_snapshotpreviously turnedCacheerrors into hardRepositoryError::Replay, while the internal load path degraded gracefully. The public path now carries the same cache-miss → replay fallback (it delegates to the sharedhydrate_with_optional_snapshot), so both paths behave identically. Its rustdoc notes that the supplied entity must carry full history for the fallback to replay.snapshot_typedecisionsnapshot_typeisstd::any::type_name::<A::Snapshot>(), which is explicitly unstable across compiler versions — gating on it would cause spurious cache misses after a toolchain bump. It is apubfield of the persistedSnapshotRecordenvelope (present in both backend schemas), so removing it would be a public-API + schema break, out of scope for this hardening. Decision: keep it, document it as diagnostic-only, and explicitly do not gate on it. Schema compatibility is enforced solely bySNAPSHOT_VERSION.Testing
tests/snapshot_sqlite_hardening:snapshot_version) falls back to full replay with correct final state — never silently wrong;load_tail_from_history, digest continues from true version, empty-tail load).#[snapshot(version = N)](emit / inherit default / reject zero / reject non-integer).Verified locally (no docker):
cargo fmt --all --check;cargo build(default),--features sqlite,--features postgres;cargo test(default),cargo test --features sqlite,cargo test -p distributed_macros. Postgres-backed tests env-gate onDATABASE_URLand skip without it; the postgres test target compiles.🤖 Generated with Claude Code