test: red-team concurrency & failure-injection conformance tests#85
Conversation
Pin guarantees the publish-on-commit + snapshot work introduced that were previously only covered by unit tests against fakes. All tests exercise EXISTING behavior; none required a production change, and all pass against current code (no #[ignore]d tests). Tests added: - racing_commits_one_wins_one_conflicts (scenario.rs, all 3 backends): N tasks load the same aggregate at v0 and commit behind a barrier; asserts exactly 1 success, N-1 ConcurrentWrite, and a gap/duplicate-free final stream — the unique-PK backstop behind the version-check TOCTOU window under real concurrency. - expired_outbox_lease_is_reclaimed_by_second_worker (outbox.rs, all 3 backends): worker A claims a short lease and "crashes"; after expiry worker B reclaims and completes; A's late complete/release are fenced (InvalidState). Exercises the SQL claimed_until expiry predicate end-to-end. - publish_failure_after_commit_retains_outbox_row_until_delivered (outbox.rs, all 3 backends): dispatcher with a publisher that fails twice then succeeds; row goes Pending after each failure (attempts incrementing), then Published, with exactly one bus delivery. - read_model_failure_mid_plan_rolls_back_events_and_outbox (sqlite_repository + postgres_repository): a commit with events + outbox + a two-mutation read-model plan whose second mutation hits a real DB constraint (SQLite trigger / Postgres CHECK); asserts events, outbox row, and the first mutation are all absent (whole-tx rollback). - unknown_event_type_in_stream_fails_hydration_with_clear_error + repo-path variant (upcasting): a stream with an unregistered event_name fails hydration/get with a RepositoryError::Replay naming the event. - queued_repo_n_writers_commit_in_fifo_order (queued_repo): 10 writers contend for one aggregate; asserts final version == N+1 and event order equals the lock-grant order. - replay determinism proptest (new tests/replay_property/): proptest over arbitrary command sequences — reload reproduces in-memory state; snapshot hydration at any frequency equals full replay. Adds proptest as a dev-dependency. Postgres-backed variants env-gate on DATABASE_URL and skip locally; SQLite/HashMap variants run without docker. Bugs discovered: none. Every test passes against current code. Implements [[tasks/red-team-conformance-tests]] Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 24 minutes and 15 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (12)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
A read-model CHECK-constraint violation is a deterministic, non-retryable
fault. With the reworked error taxonomy it now surfaces as the new
`RepositoryError::Storage { retryable: false, .. }` variant rather than the
old `RepositoryError::Model(String)`, so re-running the identical commit is
never retried forever.
PR #85 added `read_model_failure_mid_plan_rolls_back_events_and_outbox` to
both the sqlite and postgres repository suites asserting the old `Model`
error. Update both assertions to expect the permanent `Storage` variant
(and `!err.is_retryable()`), matching the corrected classification and the
idiom already used by this PR's own backend-storage tests. The rollback
invariant the test actually guards — events, outbox row, and the first
read-model mutation all absent after the failed commit — is unchanged.
Refines [[tasks/repository-error-taxonomy]]
* refactor!: give RepositoryError and LockError a retryability signal
repository_storage_error mapped every sqlx failure — connection refused,
pool timeout, SQLITE_BUSY, and constraint violations alike — to
RepositoryError::Model(String). RepositoryError had no storage shape and no
source(), and From<ReadModelError>/From<EventRecordError> collapsed to
strings. Downstream, HandlerError::transport_error_kind classified every
Repository(_) as Retryable, so a deterministic model error redelivered
forever while a transient DB outage was indistinguishable from it.
Add a Storage { operation, retryable, source } variant to RepositoryError
plus a kind() -> RetryClass method, mirroring TransportError. RetryClass
lives in the lock module (the lowest layer both errors reach) so repository
does not depend on the bus. repository_storage_error now classifies sqlx
errors via is_sqlx_transient (connection/pool/timeout and SQLITE_BUSY ->
retryable; constraint/decode -> permanent) and preserves the source. The
From impls map to Storage, keeping the source and the transient/permanent
distinction. ConcurrentWrite and NotFound stay retryable (race resolved by
a later attempt); Model/Replay/invalid-identity/InvalidState are permanent.
HandlerError::transport_error_kind now defers to RepositoryError::kind()
instead of bucketing all Repository(_) as retryable.
LockError gains a Busy variant and the same kind()/is_retryable() signal:
Poisoned is permanent (broken invariant); contention, lease loss, and busy
are retryable. The existing AcquireFailed/ReleaseFailed lease mappings were
already retryable under this classification.
Storage carries a boxed source, so RepositoryError drops Clone/PartialEq/Eq
(they were test-only). The handful of assert_eq! comparisons became
matches! on the variant. Both enums are now #[non_exhaustive] so future
variants are non-breaking.
BREAKING CHANGE: RepositoryError no longer derives Clone/PartialEq/Eq, both
RepositoryError and LockError are #[non_exhaustive], and From<ReadModelError>
/From<EventRecordError> now produce RepositoryError::Storage instead of
::Model.
Implements [[tasks/repository-error-taxonomy]]
* test(knative): use a genuinely retryable error for the 503 redeliver case
The `retryable_failure_returns_503` ingress test handler returned
`RepositoryError::Model("transient")`. After the error-taxonomy rework,
`Model` is — correctly — a deterministic fault classified Permanent, so
the ingress mapped it to 422 (do-not-retry), failing the assertion.
This was a stale expectation asserting the old redeliver-forever bug, not
a misclassification: a `Model` error is deterministic and re-running the
identical message cannot change its outcome, so it MUST be permanent (the
`deterministic_repository_errors_are_permanent` unit test pins this).
The fix is in the test, not the classification logic. The handler now
returns `RepositoryError::retryable_storage("load stream", ...)` — a
transient storage outage (connection refused) that genuinely may succeed
on redelivery — which is what a "temporarily failed" handler should
signal. The 503 redeliver semantics the test names are now honestly
exercised, and the deterministic-permanent path is covered by
`order.rejected` -> 422.
Refines [[tasks/repository-error-taxonomy]]
* test: expect permanent Storage error for read-model constraint violation
A read-model CHECK-constraint violation is a deterministic, non-retryable
fault. With the reworked error taxonomy it now surfaces as the new
`RepositoryError::Storage { retryable: false, .. }` variant rather than the
old `RepositoryError::Model(String)`, so re-running the identical commit is
never retried forever.
PR #85 added `read_model_failure_mid_plan_rolls_back_events_and_outbox` to
both the sqlite and postgres repository suites asserting the old `Model`
error. Update both assertions to expect the permanent `Storage` variant
(and `!err.is_retryable()`), matching the corrected classification and the
idiom already used by this PR's own backend-storage tests. The rollback
invariant the test actually guards — events, outbox row, and the first
read-model mutation all absent after the failed commit — is unchanged.
Refines [[tasks/repository-error-taxonomy]]
* fix(repo): classify Postgres deadlock/serialization SQLSTATEs as retryable
40001 (serialization_failure) and 40P01 (deadlock_detected) are transient
write-race outcomes that should be retried, not handed to the failure policy.
is_sqlx_transient previously only matched pool/timeout/IO and SQLite busy, so
these fell through as permanent. Mirrors the existing 23505 unique-violation
detection (no feature gate: SQLite never carries these SQLSTATEs).
Addresses CodeRabbit review on PR #83.
Refines [[tasks/repository-error-taxonomy]]
Summary
Adds red-team / concurrency / failure-injection conformance tests that pin guarantees the recent publish-on-commit + snapshot work introduced but which were previously only covered by unit tests against fakes. All tests exercise existing behavior. They prefer SQLite (run locally without docker) and wire Postgres variants through the existing trampolines (Postgres tests env-gate on
DATABASE_URLand skip locally).No production
src/code was modified.proptestis added as a dev-dependency.Bugs discovered
None. Every test passes against current code. No test is
#[ignore]d.Tests added (per test)
racing_commits_one_wins_one_conflictspersistent_repository_conformance/scenario.rstokio::sync::Barrier; asserts exactly 1 success, N-1ConcurrentWrite, and a gap/duplicate-free final stream (sequences[1,2], version 2). Exercises the unique-PK backstop behind the version-check TOCTOU window under real concurrency (existingconcurrent_writes_detectedraces only sequentially).expired_outbox_lease_is_reclaimed_by_second_workerpersistent_repository_conformance/outbox.rscomplete/releaseare both fenced (InvalidState). Exercises the SQLclaimed_until <= nowexpiry predicate end-to-end (only the in-memory store proved this before).publish_failure_after_commit_retains_outbox_row_until_deliveredpersistent_repository_conformance/outbox.rsOutboxDispatcherwith a publisher that fails twice then succeeds; row returns toPendingafter each failure (attempts 1→2), thenPublished; bus saw exactly one delivery.read_model_failure_mid_plan_rolls_back_events_and_outboxsqlite_repository/main.rs+postgres_repository/main.rsCHECK); asserts the event, the outbox row, and the first (already-applied) mutation are all absent (whole-tx rollback).unknown_event_type_in_stream_fails_hydration_with_clear_error(+ a repo-path variant)upcasting/main.rsevent_namefailshydrate/getwithRepositoryError::Replaywhose message names the offending event.queued_repo_n_writers_commit_in_fifo_orderqueued_repo/main.rs== N+1and that the committed event order equals the recorded lock-grant order.reload_reproduces_in_memory_state,snapshot_hydration_matches_full_replay)tests/replay_property/Ledgeraggregate: reload reproduces in-memory state; snapshot hydration at any frequency equals a full replay (snapshots are a transparent optimization).How the races/failures are actually created
tokio::sync::Barrier+tokio::spawnon a multi-thread runtime to make writers contend, not sleeps.OutboxDispatcherone pass at a time against a flakyAsyncMessagePublisher.Verification
cargo fmt --all --check— cleancargo build --features sqliteand--features postgres(tests) — compilecargo test --features sqlite— all pass; Postgres-backed targets report 0 tests / skip withoutDATABASE_URL🤖 Generated with Claude Code