Skip to content

test: red-team concurrency & failure-injection conformance tests#85

Merged
patrickleet merged 1 commit into
mainfrom
hardening/red-team-conformance-tests
Jun 11, 2026
Merged

test: red-team concurrency & failure-injection conformance tests#85
patrickleet merged 1 commit into
mainfrom
hardening/red-team-conformance-tests

Conversation

@patrickleet

Copy link
Copy Markdown
Collaborator

Summary

Adds red-team / concurrency / failure-injection conformance tests that pin guarantees the recent publish-on-commit + snapshot work introduced but which were previously only covered by unit tests against fakes. All tests exercise existing behavior. They prefer SQLite (run locally without docker) and wire Postgres variants through the existing trampolines (Postgres tests env-gate on DATABASE_URL and skip locally).

No production src/ code was modified. proptest is added as a dev-dependency.

Bugs discovered

None. Every test passes against current code. No test is #[ignore]d.

Tests added (per test)

# Test Location Backends What it pins
1 racing_commits_one_wins_one_conflicts persistent_repository_conformance/scenario.rs hashmap, sqlite, postgres N tasks load the same aggregate at v0 and commit behind a tokio::sync::Barrier; asserts exactly 1 success, N-1 ConcurrentWrite, and a gap/duplicate-free final stream (sequences [1,2], version 2). Exercises the unique-PK backstop behind the version-check TOCTOU window under real concurrency (existing concurrent_writes_detected races only sequentially).
2 expired_outbox_lease_is_reclaimed_by_second_worker persistent_repository_conformance/outbox.rs hashmap, sqlite, postgres Worker A claims a short lease and never settles (simulated crash); while live, B cannot steal it; after expiry B reclaims (attempts bumped) and completes; A's late complete/release are both fenced (InvalidState). Exercises the SQL claimed_until <= now expiry predicate end-to-end (only the in-memory store proved this before).
3 publish_failure_after_commit_retains_outbox_row_until_delivered persistent_repository_conformance/outbox.rs hashmap, sqlite, postgres Commit aggregate+outbox, then run OutboxDispatcher with a publisher that fails twice then succeeds; row returns to Pending after each failure (attempts 1→2), then Published; bus saw exactly one delivery.
4 read_model_failure_mid_plan_rolls_back_events_and_outbox sqlite_repository/main.rs + postgres_repository/main.rs sqlite, postgres A commit with events + outbox + a two-mutation read-model plan whose second mutation violates a real DB constraint (SQLite BEFORE-INSERT trigger / Postgres CHECK); asserts the event, the outbox row, and the first (already-applied) mutation are all absent (whole-tx rollback).
5 unknown_event_type_in_stream_fails_hydration_with_clear_error (+ a repo-path variant) upcasting/main.rs n/a (hashmap) A stream carrying an unregistered event_name fails hydrate/get with RepositoryError::Replay whose message names the offending event.
6 queued_repo_n_writers_commit_in_fifo_order queued_repo/main.rs hashmap (queued) 10 writers contend for one aggregate behind the async lock; asserts final version == N+1 and that the committed event order equals the recorded lock-grant order.
7 replay-determinism proptest (reload_reproduces_in_memory_state, snapshot_hydration_matches_full_replay) new tests/replay_property/ hashmap proptest over arbitrary command sequences for a sample Ledger aggregate: reload reproduces in-memory state; snapshot hydration at any frequency equals a full replay (snapshots are a transparent optimization).

How the races/failures are actually created

Verification

  • cargo fmt --all --check — clean
  • cargo build --features sqlite and --features postgres (tests) — compile
  • cargo test --features sqlite — all pass; Postgres-backed targets report 0 tests / skip without DATABASE_URL
  • Concurrency/timing tests run repeatedly without flakiness

🤖 Generated with Claude Code

Pin guarantees the publish-on-commit + snapshot work introduced that were
previously only covered by unit tests against fakes. All tests exercise
EXISTING behavior; none required a production change, and all pass against
current code (no #[ignore]d tests).

Tests added:
- racing_commits_one_wins_one_conflicts (scenario.rs, all 3 backends):
  N tasks load the same aggregate at v0 and commit behind a barrier; asserts
  exactly 1 success, N-1 ConcurrentWrite, and a gap/duplicate-free final
  stream — the unique-PK backstop behind the version-check TOCTOU window under
  real concurrency.
- expired_outbox_lease_is_reclaimed_by_second_worker (outbox.rs, all 3
  backends): worker A claims a short lease and "crashes"; after expiry worker B
  reclaims and completes; A's late complete/release are fenced (InvalidState).
  Exercises the SQL claimed_until expiry predicate end-to-end.
- publish_failure_after_commit_retains_outbox_row_until_delivered (outbox.rs,
  all 3 backends): dispatcher with a publisher that fails twice then succeeds;
  row goes Pending after each failure (attempts incrementing), then Published,
  with exactly one bus delivery.
- read_model_failure_mid_plan_rolls_back_events_and_outbox
  (sqlite_repository + postgres_repository): a commit with events + outbox + a
  two-mutation read-model plan whose second mutation hits a real DB constraint
  (SQLite trigger / Postgres CHECK); asserts events, outbox row, and the first
  mutation are all absent (whole-tx rollback).
- unknown_event_type_in_stream_fails_hydration_with_clear_error +
  repo-path variant (upcasting): a stream with an unregistered event_name fails
  hydration/get with a RepositoryError::Replay naming the event.
- queued_repo_n_writers_commit_in_fifo_order (queued_repo): 10 writers contend
  for one aggregate; asserts final version == N+1 and event order equals the
  lock-grant order.
- replay determinism proptest (new tests/replay_property/): proptest over
  arbitrary command sequences — reload reproduces in-memory state; snapshot
  hydration at any frequency equals full replay.

Adds proptest as a dev-dependency. Postgres-backed variants env-gate on
DATABASE_URL and skip locally; SQLite/HashMap variants run without docker.

Bugs discovered: none. Every test passes against current code.

Implements [[tasks/red-team-conformance-tests]]

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@patrickleet, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 24 minutes and 15 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ffc60b83-d3ee-4872-bbfe-6f693efb4f34

📥 Commits

Reviewing files that changed from the base of the PR and between 55c2674 and 7446170.

📒 Files selected for processing (12)
  • Cargo.toml
  • tests/hashmap_repository_conformance/main.rs
  • tests/persistent_repository_conformance/outbox.rs
  • tests/persistent_repository_conformance/scenario.rs
  • tests/postgres_repository/main.rs
  • tests/postgres_repository_conformance/main.rs
  • tests/queued_repo/main.rs
  • tests/replay_property/aggregate.rs
  • tests/replay_property/main.rs
  • tests/sqlite_repository/main.rs
  • tests/sqlite_repository_conformance/main.rs
  • tests/upcasting/main.rs
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch hardening/red-team-conformance-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@patrickleet patrickleet merged commit 6d4b697 into main Jun 11, 2026
7 checks passed
patrickleet added a commit that referenced this pull request Jun 11, 2026
A read-model CHECK-constraint violation is a deterministic, non-retryable
fault. With the reworked error taxonomy it now surfaces as the new
`RepositoryError::Storage { retryable: false, .. }` variant rather than the
old `RepositoryError::Model(String)`, so re-running the identical commit is
never retried forever.

PR #85 added `read_model_failure_mid_plan_rolls_back_events_and_outbox` to
both the sqlite and postgres repository suites asserting the old `Model`
error. Update both assertions to expect the permanent `Storage` variant
(and `!err.is_retryable()`), matching the corrected classification and the
idiom already used by this PR's own backend-storage tests. The rollback
invariant the test actually guards — events, outbox row, and the first
read-model mutation all absent after the failed commit — is unchanged.

Refines [[tasks/repository-error-taxonomy]]
patrickleet added a commit that referenced this pull request Jun 11, 2026
* refactor!: give RepositoryError and LockError a retryability signal

repository_storage_error mapped every sqlx failure — connection refused,
pool timeout, SQLITE_BUSY, and constraint violations alike — to
RepositoryError::Model(String). RepositoryError had no storage shape and no
source(), and From<ReadModelError>/From<EventRecordError> collapsed to
strings. Downstream, HandlerError::transport_error_kind classified every
Repository(_) as Retryable, so a deterministic model error redelivered
forever while a transient DB outage was indistinguishable from it.

Add a Storage { operation, retryable, source } variant to RepositoryError
plus a kind() -> RetryClass method, mirroring TransportError. RetryClass
lives in the lock module (the lowest layer both errors reach) so repository
does not depend on the bus. repository_storage_error now classifies sqlx
errors via is_sqlx_transient (connection/pool/timeout and SQLITE_BUSY ->
retryable; constraint/decode -> permanent) and preserves the source. The
From impls map to Storage, keeping the source and the transient/permanent
distinction. ConcurrentWrite and NotFound stay retryable (race resolved by
a later attempt); Model/Replay/invalid-identity/InvalidState are permanent.

HandlerError::transport_error_kind now defers to RepositoryError::kind()
instead of bucketing all Repository(_) as retryable.

LockError gains a Busy variant and the same kind()/is_retryable() signal:
Poisoned is permanent (broken invariant); contention, lease loss, and busy
are retryable. The existing AcquireFailed/ReleaseFailed lease mappings were
already retryable under this classification.

Storage carries a boxed source, so RepositoryError drops Clone/PartialEq/Eq
(they were test-only). The handful of assert_eq! comparisons became
matches! on the variant. Both enums are now #[non_exhaustive] so future
variants are non-breaking.

BREAKING CHANGE: RepositoryError no longer derives Clone/PartialEq/Eq, both
RepositoryError and LockError are #[non_exhaustive], and From<ReadModelError>
/From<EventRecordError> now produce RepositoryError::Storage instead of
::Model.

Implements [[tasks/repository-error-taxonomy]]

* test(knative): use a genuinely retryable error for the 503 redeliver case

The `retryable_failure_returns_503` ingress test handler returned
`RepositoryError::Model("transient")`. After the error-taxonomy rework,
`Model` is — correctly — a deterministic fault classified Permanent, so
the ingress mapped it to 422 (do-not-retry), failing the assertion.

This was a stale expectation asserting the old redeliver-forever bug, not
a misclassification: a `Model` error is deterministic and re-running the
identical message cannot change its outcome, so it MUST be permanent (the
`deterministic_repository_errors_are_permanent` unit test pins this).

The fix is in the test, not the classification logic. The handler now
returns `RepositoryError::retryable_storage("load stream", ...)` — a
transient storage outage (connection refused) that genuinely may succeed
on redelivery — which is what a "temporarily failed" handler should
signal. The 503 redeliver semantics the test names are now honestly
exercised, and the deterministic-permanent path is covered by
`order.rejected` -> 422.

Refines [[tasks/repository-error-taxonomy]]

* test: expect permanent Storage error for read-model constraint violation

A read-model CHECK-constraint violation is a deterministic, non-retryable
fault. With the reworked error taxonomy it now surfaces as the new
`RepositoryError::Storage { retryable: false, .. }` variant rather than the
old `RepositoryError::Model(String)`, so re-running the identical commit is
never retried forever.

PR #85 added `read_model_failure_mid_plan_rolls_back_events_and_outbox` to
both the sqlite and postgres repository suites asserting the old `Model`
error. Update both assertions to expect the permanent `Storage` variant
(and `!err.is_retryable()`), matching the corrected classification and the
idiom already used by this PR's own backend-storage tests. The rollback
invariant the test actually guards — events, outbox row, and the first
read-model mutation all absent after the failed commit — is unchanged.

Refines [[tasks/repository-error-taxonomy]]

* fix(repo): classify Postgres deadlock/serialization SQLSTATEs as retryable

40001 (serialization_failure) and 40P01 (deadlock_detected) are transient
write-race outcomes that should be retried, not handed to the failure policy.
is_sqlx_transient previously only matched pool/timeout/IO and SQLite busy, so
these fell through as permanent. Mirrors the existing 23505 unique-violation
detection (no feature gate: SQLite never carries these SQLSTATEs).

Addresses CodeRabbit review on PR #83.

Refines [[tasks/repository-error-taxonomy]]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant