Skip to content

refactor!: give RepositoryError and LockError a retryability signal#83

Merged
patrickleet merged 5 commits into
mainfrom
hardening/repository-error-taxonomy
Jun 11, 2026
Merged

refactor!: give RepositoryError and LockError a retryability signal#83
patrickleet merged 5 commits into
mainfrom
hardening/repository-error-taxonomy

Conversation

@patrickleet

@patrickleet patrickleet commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

The redeliver-forever problem

repository_storage_error (src/sqlx_repo/mod.rs) mapped every sqlx::Error — connection refused, pool timeout, SQLITE_BUSY, and constraint/decode violations — to RepositoryError::Model(format!(...)). RepositoryError had no storage shape and no source(), and From<ReadModelError>/From<EventRecordError> likewise collapsed to opaque strings.

Downstream, HandlerError::transport_error_kind (src/microsvc/error.rs) classified all Repository(_) as Retryable. The result: a deterministic model error redelivers forever (the runner can never make progress), while a transient DB outage is indistinguishable from it. The contract that was missing is exactly the one TransportError already provides on the bus side: an explicit retryable/permanent classification with a preserved source.

Design chosen: (a) + the method, not either/or

I took approach (a) — a Storage { operation, retryable, source } variant — and added a kind(&self) -> RetryClass method. The reason both are needed: the classification is decided in repository_storage_error using the existing sqlx detection helpers, so that signal must be captured at construction time (a pure kind() method cannot recover whether a Model(String) came from a timeout vs a constraint). The variant carries the captured retryable bool and the source; kind() then gives every caller one classification point.

  • repository_storage_error now classifies via a new is_sqlx_transient: PoolTimedOut/PoolClosed/Io and SQLite busy/locked (reusing the existing is_sqlite_busy) -> retryable; everything else (constraint violations, malformed-row decode) -> permanent. The sqlx::Error is preserved as the boxed source.
  • RetryClass lives in src/lock/error.rs — the lowest layer both LockError and RepositoryError reach — so the repository layer does not take a dependency on the bus.
  • RepositoryError::kind(): ConcurrentWrite and NotFound stay retryable (an optimistic-concurrency race or out-of-order delivery is resolved by a later attempt — preserves the prior conflict-as-retry handling); Model, Replay, invalid identity/receipt, InvalidState, duplicate-in-batch, LockPoisoned are permanent.
  • From<ReadModelError>/From<EventRecordError> map to Storage, preserving source and the transient/permanent distinction instead of stringifying. (ReadModelError::Storage is itself a stringified backend error with no recoverable signal without touching read_model; it is conservatively classified permanent — safe, since permanent surfaces to the failure policy rather than looping.)

HandlerError classification split

transport_error_kind now defers HandlerError::Repository(err) to err.kind() instead of one retryable bucket. A deterministic model/read-model error is now permanent; a connection/pool/busy outage is retryable.

Clone/Eq decision

The Storage variant carries source: Option<Box<dyn Error + Send + Sync>>, which is not Clone/Eq. Those derives on RepositoryError were test-only (no production .clone() or ==; HandlerError derives only Debug). Carrying a real source() chain is the genuine improvement and matches TransportError, so I dropped Clone, PartialEq, Eq and converted the handful of assert_eq!(err, RepositoryError::...) comparisons to matches! on the variant.

LockError alignment

LockError gains a Busy variant and the same kind()/is_retryable() signal: Poisoned is permanent (a broken invariant a retry cannot clear); AcquireFailed, ReleaseFailed, Expired, Busy, Other are retryable. The existing sqlx lease AcquireFailed/ReleaseFailed mappings are already correctly retryable under this classification, and SQLite busy contention in the acquire loop already retries rather than surfacing — so storage-busy and lock-busy are now consistently retryable.

Breaking change

RepositoryError no longer derives Clone/PartialEq/Eq; both RepositoryError and LockError are now #[non_exhaustive]; From<ReadModelError>/From<EventRecordError> now yield RepositoryError::Storage instead of ::Model.

Testing

  • cargo fmt --all
  • cargo build (default), --features sqlite, --features postgres, --features http — all clean.
  • cargo test --features sqlite — all suites pass (DB-env tests skip without env, no docker).
  • cargo test --features postgres --no-run / --features http --no-run — compile clean.
  • New src/microsvc/error.rs tests: deterministic_repository_errors_are_permanent (a model/replay/permanent-storage error is NOT retried) and transient_repository_storage_errors_are_retryable (a connection error and a concurrency conflict ARE retryable). New src/lock/error.rs tests cover the lock classification.

Call sites touched (all forced by dropping PartialEq, plus the lock/mod.rs export): src/commit_builder/mod.rs, src/hashmap_repo/repository.rs, src/outbox/commit.rs, src/outbox_worker/store.rs, src/snapshot/repository.rs, tests/{repository_api,postgres_repository,sqlite_repository,persistent_repository_conformance/scenario}.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Improved error classification so transient failures are retried and permanent failures are not, reducing spurious retries and improving reliability.
  • New Features

    • Enhanced error diagnostics: storage/lock errors now surface clearer retry semantics and include underlying source details for easier troubleshooting.
  • Tests

    • Updated tests to assert new error shapes and retry behavior across repositories and workers.

repository_storage_error mapped every sqlx failure — connection refused,
pool timeout, SQLITE_BUSY, and constraint violations alike — to
RepositoryError::Model(String). RepositoryError had no storage shape and no
source(), and From<ReadModelError>/From<EventRecordError> collapsed to
strings. Downstream, HandlerError::transport_error_kind classified every
Repository(_) as Retryable, so a deterministic model error redelivered
forever while a transient DB outage was indistinguishable from it.

Add a Storage { operation, retryable, source } variant to RepositoryError
plus a kind() -> RetryClass method, mirroring TransportError. RetryClass
lives in the lock module (the lowest layer both errors reach) so repository
does not depend on the bus. repository_storage_error now classifies sqlx
errors via is_sqlx_transient (connection/pool/timeout and SQLITE_BUSY ->
retryable; constraint/decode -> permanent) and preserves the source. The
From impls map to Storage, keeping the source and the transient/permanent
distinction. ConcurrentWrite and NotFound stay retryable (race resolved by
a later attempt); Model/Replay/invalid-identity/InvalidState are permanent.

HandlerError::transport_error_kind now defers to RepositoryError::kind()
instead of bucketing all Repository(_) as retryable.

LockError gains a Busy variant and the same kind()/is_retryable() signal:
Poisoned is permanent (broken invariant); contention, lease loss, and busy
are retryable. The existing AcquireFailed/ReleaseFailed lease mappings were
already retryable under this classification.

Storage carries a boxed source, so RepositoryError drops Clone/PartialEq/Eq
(they were test-only). The handful of assert_eq! comparisons became
matches! on the variant. Both enums are now #[non_exhaustive] so future
variants are non-breaking.

BREAKING CHANGE: RepositoryError no longer derives Clone/PartialEq/Eq, both
RepositoryError and LockError are #[non_exhaustive], and From<ReadModelError>
/From<EventRecordError> now produce RepositoryError::Storage instead of
::Model.

Implements [[tasks/repository-error-taxonomy]]
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fee9a9ae-35bb-4e0e-a339-ef2b3542c062

📥 Commits

Reviewing files that changed from the base of the PR and between 0adf641 and deb6369.

📒 Files selected for processing (7)
  • src/microsvc/error.rs
  • src/snapshot/repository.rs
  • src/sqlx_repo/mod.rs
  • tests/persistent_repository_conformance/scenario.rs
  • tests/postgres_repository/main.rs
  • tests/repository_api/main.rs
  • tests/sqlite_repository/main.rs
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/snapshot/repository.rs
  • src/sqlx_repo/mod.rs
  • tests/persistent_repository_conformance/scenario.rs
  • src/microsvc/error.rs

📝 Walkthrough

Walkthrough

This PR introduces a unified retry classification system across error types. It defines a RetryClass enum to mark errors as retryable or permanent, applies it to LockError and RepositoryError, detects transient database errors, and updates handlers and tests to use this classification for intelligent retry semantics.

Changes

Error Retry Classification System

Layer / File(s) Summary
RetryClass foundation and lock error classification
src/lock/error.rs, src/lock/mod.rs
RetryClass enum (Retryable, Permanent) with helper methods is introduced. LockError is marked #[non_exhaustive], gains a new Busy(String) variant, and implements kind() and is_retryable() methods that classify contention/lease-loss as retryable and poisoned as permanent. Methods are exported via mod.rs.
Repository error Storage variant and classification
src/repository/error.rs
RepositoryError adds a new Storage variant carrying operation name, retryability flag, and optional boxed source error. Enum is marked #[non_exhaustive]. Public constructor helpers retryable_storage() and permanent_storage() are added. Methods kind() and is_retryable() classify Storage by its retryability flag and other variants by deterministic rules: ConcurrentWrite is retryable, Model/Replay/NotFound are permanent. Display and Error trait implementations are updated to render storage errors with source details.
Error source preservation and transient detection
src/repository/error.rs, src/sqlx_repo/mod.rs
From<ReadModelError> now maps to RepositoryError::Storage and derives retryability only from underlying lock errors. From<EventRecordError> maps to permanent RepositoryError::Storage while preserving the original codec/serialization error as source. New is_sqlx_transient() helper classifies sqlx::Error: pool/connection/IO/timeout errors are transient, SQLite busy/locked conditions are transient (when sqlite feature enabled), all others are permanent. repository_storage_error() uses transient classification to set the retryable flag when creating RepositoryError::Storage.
Handler integration and test assertion updates
src/microsvc/error.rs, tests/commit_builder/mod.rs, tests/outbox/commit.rs, tests/snapshot/repository.rs, tests/outbox_worker/store.rs, tests/hashmap_repo/repository.rs, tests/persistent_repository_conformance/scenario.rs, tests/postgres_repository/main.rs, tests/repository_api/main.rs, tests/sqlite_repository/main.rs, tests/knative_cloudevents/main.rs
HandlerError::transport_error_kind() is reworked to defer repository error retryability to RepositoryError::kind() instead of treating all repository errors as permanent. Tests are updated to use matches! patterns instead of exact equality and to expect RepositoryError::Storage for storage failures; codec errors and transient storage failures are validated via error message inspection and is_retryable() checks. The knative integration test is updated to use retryable_storage() with ConnectionRefused for transient failures.

Sequence Diagram(s)

sequenceDiagram
  participant Handler
  participant Repository as Repository Layer
  participant Database as Database/Backend
  participant TransportError as Transport Error Kind
  
  Handler->>Repository: execute operation
  Repository->>Database: query/write
  Database-->>Repository: error (connection, lock, codec, etc.)
  Repository->>Repository: classify via kind()
  Repository-->>Handler: RepositoryError with RetryClass
  Handler->>TransportError: map via transport_error_kind()
  TransportError-->>Handler: Retryable or Permanent
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐇 A rabbit hops where errors used to hide,
Retryable whispers, permanent strides,
Storage keeps its source by its side,
Locks now tell if they’ll bide or collide,
Tests nod, the system learns to retry with pride.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main refactoring: introducing retryability signals to RepositoryError and LockError. It directly reflects the core objective of distinguishing transient from deterministic failures.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch hardening/repository-error-taxonomy

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…case

The `retryable_failure_returns_503` ingress test handler returned
`RepositoryError::Model("transient")`. After the error-taxonomy rework,
`Model` is — correctly — a deterministic fault classified Permanent, so
the ingress mapped it to 422 (do-not-retry), failing the assertion.

This was a stale expectation asserting the old redeliver-forever bug, not
a misclassification: a `Model` error is deterministic and re-running the
identical message cannot change its outcome, so it MUST be permanent (the
`deterministic_repository_errors_are_permanent` unit test pins this).

The fix is in the test, not the classification logic. The handler now
returns `RepositoryError::retryable_storage("load stream", ...)` — a
transient storage outage (connection refused) that genuinely may succeed
on redelivery — which is what a "temporarily failed" handler should
signal. The 503 redeliver semantics the test names are now honestly
exercised, and the deterministic-permanent path is covered by
`order.rejected` -> 422.

Refines [[tasks/repository-error-taxonomy]]

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/microsvc/error.rs (1)

109-120: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't route EventRecordError through the retryable Other bucket.

transport_error_kind() now makes HandlerError::Other(_) retryable, but impl From<EventRecordError> for HandlerError still wraps every EventRecordError as Other (Lines 64-67). That means deterministic event-record failures can still be redelivered forever, which undercuts the new permanent-vs-retryable split. Give EventRecordError its own permanent classification or map it to a non-retryable handler variant instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/microsvc/error.rs` around lines 109 - 120, transport_error_kind()
currently treats HandlerError::Other(_) as Retryable, but impl
From<EventRecordError> for HandlerError maps all EventRecordError into Other,
causing deterministic event-record failures to be retried indefinitely; fix by
giving EventRecordError a permanent classification—either update
transport_error_kind() to match on a new HandlerError variant that represents
EventRecordError as TransportErrorKind::Permanent (e.g.,
HandlerError::EventRecord(_)) or change impl From<EventRecordError> for
HandlerError to map EventRecordError into an existing non-retryable variant
(e.g., UnknownCommand/Rejected/Unauthorized or a new EventRecord variant) so
EventRecordError is classified as permanent rather than routed through
HandlerError::Other.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/sqlx_repo/mod.rs`:
- Around line 256-281: is_sqlx_transient currently only treats pool/timeouts/Io
and SQLite busy as transient; add PostgreSQL serialization/deadlock SQLSTATE
handling so Postgres errors 40001 and 40P01 are classified retryable. Update
is_sqlx_transient to also match sqlx::Error::Database and, under cfg(feature =
"postgres"), downcast the database error to sqlx::postgres::PgDatabaseError (or
use its sqlstate/code) and return true when err.code() == "40001" or err.code()
== "40P01"; keep existing checks (PoolTimedOut, PoolClosed, Io(_),
is_sqlite_busy) intact. Ensure you only compile Postgres-specific logic behind
the "postgres" feature flag and reference the function name is_sqlx_transient
and helper is_sqlite_busy when making the change.

---

Outside diff comments:
In `@src/microsvc/error.rs`:
- Around line 109-120: transport_error_kind() currently treats
HandlerError::Other(_) as Retryable, but impl From<EventRecordError> for
HandlerError maps all EventRecordError into Other, causing deterministic
event-record failures to be retried indefinitely; fix by giving EventRecordError
a permanent classification—either update transport_error_kind() to match on a
new HandlerError variant that represents EventRecordError as
TransportErrorKind::Permanent (e.g., HandlerError::EventRecord(_)) or change
impl From<EventRecordError> for HandlerError to map EventRecordError into an
existing non-retryable variant (e.g., UnknownCommand/Rejected/Unauthorized or a
new EventRecord variant) so EventRecordError is classified as permanent rather
than routed through HandlerError::Other.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3dd61a12-b2f9-4ba4-9586-1b1a84ffe850

📥 Commits

Reviewing files that changed from the base of the PR and between 55c2674 and 0adf641.

📒 Files selected for processing (15)
  • src/commit_builder/mod.rs
  • src/hashmap_repo/repository.rs
  • src/lock/error.rs
  • src/lock/mod.rs
  • src/microsvc/error.rs
  • src/outbox/commit.rs
  • src/outbox_worker/store.rs
  • src/repository/error.rs
  • src/snapshot/repository.rs
  • src/sqlx_repo/mod.rs
  • tests/knative_cloudevents/main.rs
  • tests/persistent_repository_conformance/scenario.rs
  • tests/postgres_repository/main.rs
  • tests/repository_api/main.rs
  • tests/sqlite_repository/main.rs

Comment thread src/sqlx_repo/mod.rs
A read-model CHECK-constraint violation is a deterministic, non-retryable
fault. With the reworked error taxonomy it now surfaces as the new
`RepositoryError::Storage { retryable: false, .. }` variant rather than the
old `RepositoryError::Model(String)`, so re-running the identical commit is
never retried forever.

PR #85 added `read_model_failure_mid_plan_rolls_back_events_and_outbox` to
both the sqlite and postgres repository suites asserting the old `Model`
error. Update both assertions to expect the permanent `Storage` variant
(and `!err.is_retryable()`), matching the corrected classification and the
idiom already used by this PR's own backend-storage tests. The rollback
invariant the test actually guards — events, outbox row, and the first
read-model mutation all absent after the failed commit — is unchanged.

Refines [[tasks/repository-error-taxonomy]]
…yable

40001 (serialization_failure) and 40P01 (deadlock_detected) are transient
write-race outcomes that should be retried, not handed to the failure policy.
is_sqlx_transient previously only matched pool/timeout/IO and SQLite busy, so
these fell through as permanent. Mirrors the existing 23505 unique-violation
detection (no feature gate: SQLite never carries these SQLSTATEs).

Addresses CodeRabbit review on PR #83.

Refines [[tasks/repository-error-taxonomy]]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant