refactor!: give RepositoryError and LockError a retryability signal#83
Conversation
repository_storage_error mapped every sqlx failure — connection refused,
pool timeout, SQLITE_BUSY, and constraint violations alike — to
RepositoryError::Model(String). RepositoryError had no storage shape and no
source(), and From<ReadModelError>/From<EventRecordError> collapsed to
strings. Downstream, HandlerError::transport_error_kind classified every
Repository(_) as Retryable, so a deterministic model error redelivered
forever while a transient DB outage was indistinguishable from it.
Add a Storage { operation, retryable, source } variant to RepositoryError
plus a kind() -> RetryClass method, mirroring TransportError. RetryClass
lives in the lock module (the lowest layer both errors reach) so repository
does not depend on the bus. repository_storage_error now classifies sqlx
errors via is_sqlx_transient (connection/pool/timeout and SQLITE_BUSY ->
retryable; constraint/decode -> permanent) and preserves the source. The
From impls map to Storage, keeping the source and the transient/permanent
distinction. ConcurrentWrite and NotFound stay retryable (race resolved by
a later attempt); Model/Replay/invalid-identity/InvalidState are permanent.
HandlerError::transport_error_kind now defers to RepositoryError::kind()
instead of bucketing all Repository(_) as retryable.
LockError gains a Busy variant and the same kind()/is_retryable() signal:
Poisoned is permanent (broken invariant); contention, lease loss, and busy
are retryable. The existing AcquireFailed/ReleaseFailed lease mappings were
already retryable under this classification.
Storage carries a boxed source, so RepositoryError drops Clone/PartialEq/Eq
(they were test-only). The handful of assert_eq! comparisons became
matches! on the variant. Both enums are now #[non_exhaustive] so future
variants are non-breaking.
BREAKING CHANGE: RepositoryError no longer derives Clone/PartialEq/Eq, both
RepositoryError and LockError are #[non_exhaustive], and From<ReadModelError>
/From<EventRecordError> now produce RepositoryError::Storage instead of
::Model.
Implements [[tasks/repository-error-taxonomy]]
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughThis PR introduces a unified retry classification system across error types. It defines a ChangesError Retry Classification System
Sequence Diagram(s)sequenceDiagram
participant Handler
participant Repository as Repository Layer
participant Database as Database/Backend
participant TransportError as Transport Error Kind
Handler->>Repository: execute operation
Repository->>Database: query/write
Database-->>Repository: error (connection, lock, codec, etc.)
Repository->>Repository: classify via kind()
Repository-->>Handler: RepositoryError with RetryClass
Handler->>TransportError: map via transport_error_kind()
TransportError-->>Handler: Retryable or Permanent
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…case
The `retryable_failure_returns_503` ingress test handler returned
`RepositoryError::Model("transient")`. After the error-taxonomy rework,
`Model` is — correctly — a deterministic fault classified Permanent, so
the ingress mapped it to 422 (do-not-retry), failing the assertion.
This was a stale expectation asserting the old redeliver-forever bug, not
a misclassification: a `Model` error is deterministic and re-running the
identical message cannot change its outcome, so it MUST be permanent (the
`deterministic_repository_errors_are_permanent` unit test pins this).
The fix is in the test, not the classification logic. The handler now
returns `RepositoryError::retryable_storage("load stream", ...)` — a
transient storage outage (connection refused) that genuinely may succeed
on redelivery — which is what a "temporarily failed" handler should
signal. The 503 redeliver semantics the test names are now honestly
exercised, and the deterministic-permanent path is covered by
`order.rejected` -> 422.
Refines [[tasks/repository-error-taxonomy]]
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/microsvc/error.rs (1)
109-120:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't route
EventRecordErrorthrough the retryableOtherbucket.
transport_error_kind()now makesHandlerError::Other(_)retryable, butimpl From<EventRecordError> for HandlerErrorstill wraps everyEventRecordErrorasOther(Lines 64-67). That means deterministic event-record failures can still be redelivered forever, which undercuts the new permanent-vs-retryable split. GiveEventRecordErrorits own permanent classification or map it to a non-retryable handler variant instead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/microsvc/error.rs` around lines 109 - 120, transport_error_kind() currently treats HandlerError::Other(_) as Retryable, but impl From<EventRecordError> for HandlerError maps all EventRecordError into Other, causing deterministic event-record failures to be retried indefinitely; fix by giving EventRecordError a permanent classification—either update transport_error_kind() to match on a new HandlerError variant that represents EventRecordError as TransportErrorKind::Permanent (e.g., HandlerError::EventRecord(_)) or change impl From<EventRecordError> for HandlerError to map EventRecordError into an existing non-retryable variant (e.g., UnknownCommand/Rejected/Unauthorized or a new EventRecord variant) so EventRecordError is classified as permanent rather than routed through HandlerError::Other.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/sqlx_repo/mod.rs`:
- Around line 256-281: is_sqlx_transient currently only treats pool/timeouts/Io
and SQLite busy as transient; add PostgreSQL serialization/deadlock SQLSTATE
handling so Postgres errors 40001 and 40P01 are classified retryable. Update
is_sqlx_transient to also match sqlx::Error::Database and, under cfg(feature =
"postgres"), downcast the database error to sqlx::postgres::PgDatabaseError (or
use its sqlstate/code) and return true when err.code() == "40001" or err.code()
== "40P01"; keep existing checks (PoolTimedOut, PoolClosed, Io(_),
is_sqlite_busy) intact. Ensure you only compile Postgres-specific logic behind
the "postgres" feature flag and reference the function name is_sqlx_transient
and helper is_sqlite_busy when making the change.
---
Outside diff comments:
In `@src/microsvc/error.rs`:
- Around line 109-120: transport_error_kind() currently treats
HandlerError::Other(_) as Retryable, but impl From<EventRecordError> for
HandlerError maps all EventRecordError into Other, causing deterministic
event-record failures to be retried indefinitely; fix by giving EventRecordError
a permanent classification—either update transport_error_kind() to match on a
new HandlerError variant that represents EventRecordError as
TransportErrorKind::Permanent (e.g., HandlerError::EventRecord(_)) or change
impl From<EventRecordError> for HandlerError to map EventRecordError into an
existing non-retryable variant (e.g., UnknownCommand/Rejected/Unauthorized or a
new EventRecord variant) so EventRecordError is classified as permanent rather
than routed through HandlerError::Other.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3dd61a12-b2f9-4ba4-9586-1b1a84ffe850
📒 Files selected for processing (15)
src/commit_builder/mod.rssrc/hashmap_repo/repository.rssrc/lock/error.rssrc/lock/mod.rssrc/microsvc/error.rssrc/outbox/commit.rssrc/outbox_worker/store.rssrc/repository/error.rssrc/snapshot/repository.rssrc/sqlx_repo/mod.rstests/knative_cloudevents/main.rstests/persistent_repository_conformance/scenario.rstests/postgres_repository/main.rstests/repository_api/main.rstests/sqlite_repository/main.rs
A read-model CHECK-constraint violation is a deterministic, non-retryable
fault. With the reworked error taxonomy it now surfaces as the new
`RepositoryError::Storage { retryable: false, .. }` variant rather than the
old `RepositoryError::Model(String)`, so re-running the identical commit is
never retried forever.
PR #85 added `read_model_failure_mid_plan_rolls_back_events_and_outbox` to
both the sqlite and postgres repository suites asserting the old `Model`
error. Update both assertions to expect the permanent `Storage` variant
(and `!err.is_retryable()`), matching the corrected classification and the
idiom already used by this PR's own backend-storage tests. The rollback
invariant the test actually guards — events, outbox row, and the first
read-model mutation all absent after the failed commit — is unchanged.
Refines [[tasks/repository-error-taxonomy]]
…yable 40001 (serialization_failure) and 40P01 (deadlock_detected) are transient write-race outcomes that should be retried, not handed to the failure policy. is_sqlx_transient previously only matched pool/timeout/IO and SQLite busy, so these fell through as permanent. Mirrors the existing 23505 unique-violation detection (no feature gate: SQLite never carries these SQLSTATEs). Addresses CodeRabbit review on PR #83. Refines [[tasks/repository-error-taxonomy]]
The redeliver-forever problem
repository_storage_error(src/sqlx_repo/mod.rs) mapped everysqlx::Error— connection refused, pool timeout,SQLITE_BUSY, and constraint/decode violations — toRepositoryError::Model(format!(...)).RepositoryErrorhad no storage shape and nosource(), andFrom<ReadModelError>/From<EventRecordError>likewise collapsed to opaque strings.Downstream,
HandlerError::transport_error_kind(src/microsvc/error.rs) classified allRepository(_)asRetryable. The result: a deterministic model error redelivers forever (the runner can never make progress), while a transient DB outage is indistinguishable from it. The contract that was missing is exactly the oneTransportErroralready provides on the bus side: an explicit retryable/permanent classification with a preserved source.Design chosen: (a) + the method, not either/or
I took approach (a) — a
Storage { operation, retryable, source }variant — and added akind(&self) -> RetryClassmethod. The reason both are needed: the classification is decided inrepository_storage_errorusing the existing sqlx detection helpers, so that signal must be captured at construction time (a purekind()method cannot recover whether aModel(String)came from a timeout vs a constraint). The variant carries the capturedretryablebool and the source;kind()then gives every caller one classification point.repository_storage_errornow classifies via a newis_sqlx_transient:PoolTimedOut/PoolClosed/Ioand SQLite busy/locked (reusing the existingis_sqlite_busy) -> retryable; everything else (constraint violations, malformed-row decode) -> permanent. Thesqlx::Erroris preserved as the boxed source.RetryClasslives insrc/lock/error.rs— the lowest layer bothLockErrorandRepositoryErrorreach — so the repository layer does not take a dependency on the bus.RepositoryError::kind():ConcurrentWriteandNotFoundstay retryable (an optimistic-concurrency race or out-of-order delivery is resolved by a later attempt — preserves the prior conflict-as-retry handling);Model,Replay, invalid identity/receipt,InvalidState, duplicate-in-batch,LockPoisonedare permanent.From<ReadModelError>/From<EventRecordError>map toStorage, preserving source and the transient/permanent distinction instead of stringifying. (ReadModelError::Storageis itself a stringified backend error with no recoverable signal without touchingread_model; it is conservatively classified permanent — safe, since permanent surfaces to the failure policy rather than looping.)HandlerError classification split
transport_error_kindnow defersHandlerError::Repository(err)toerr.kind()instead of one retryable bucket. A deterministic model/read-model error is now permanent; a connection/pool/busy outage is retryable.Clone/Eq decision
The
Storagevariant carriessource: Option<Box<dyn Error + Send + Sync>>, which is notClone/Eq. Those derives onRepositoryErrorwere test-only (no production.clone()or==;HandlerErrorderives onlyDebug). Carrying a realsource()chain is the genuine improvement and matchesTransportError, so I droppedClone,PartialEq,Eqand converted the handful ofassert_eq!(err, RepositoryError::...)comparisons tomatches!on the variant.LockError alignment
LockErrorgains aBusyvariant and the samekind()/is_retryable()signal:Poisonedis permanent (a broken invariant a retry cannot clear);AcquireFailed,ReleaseFailed,Expired,Busy,Otherare retryable. The existing sqlx leaseAcquireFailed/ReleaseFailedmappings are already correctly retryable under this classification, and SQLite busy contention in the acquire loop already retries rather than surfacing — so storage-busy and lock-busy are now consistently retryable.Breaking change
RepositoryErrorno longer derivesClone/PartialEq/Eq; bothRepositoryErrorandLockErrorare now#[non_exhaustive];From<ReadModelError>/From<EventRecordError>now yieldRepositoryError::Storageinstead of::Model.Testing
cargo fmt --allcargo build(default),--features sqlite,--features postgres,--features http— all clean.cargo test --features sqlite— all suites pass (DB-env tests skip without env, no docker).cargo test --features postgres --no-run/--features http --no-run— compile clean.src/microsvc/error.rstests:deterministic_repository_errors_are_permanent(a model/replay/permanent-storage error is NOT retried) andtransient_repository_storage_errors_are_retryable(a connection error and a concurrency conflict ARE retryable). Newsrc/lock/error.rstests cover the lock classification.Call sites touched (all forced by dropping
PartialEq, plus thelock/mod.rsexport):src/commit_builder/mod.rs,src/hashmap_repo/repository.rs,src/outbox/commit.rs,src/outbox_worker/store.rs,src/snapshot/repository.rs,tests/{repository_api,postgres_repository,sqlite_repository,persistent_repository_conformance/scenario}.🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
New Features
Tests