Skip to content

fix(cow-api): forward orderbook ApiError to HostError.data (COW-1075)#48

Open
brunota20 wants to merge 1 commit into
feat/resolve-app-data-cow-1074from
feat/forward-orderbook-error-cow-1075
Open

fix(cow-api): forward orderbook ApiError to HostError.data (COW-1075)#48
brunota20 wants to merge 1 commit into
feat/resolve-app-data-cow-1074from
feat/forward-orderbook-error-cow-1075

Conversation

@brunota20

Copy link
Copy Markdown
Collaborator

Summary

Engine-side fix for COW-1075. The WIT cow-api::submit-order adapter was constructing HostError with data: None on the CowApiError::Orderbook(cowprotocol::Error::OrderbookApi { status, api }) branch, dropping the orderbook's structured ApiError envelope on the floor. The guest's shepherd_sdk::cow::classify_api_error therefore always saw None, fell back to its safe-default TryNextBlock, and looped forever on permanent rejections like DuplicatedOrder, InvalidSignature, or ExcessiveValidTo.

Root cause vs. the COW-1075 issue title

The issue title ("classify_api_error should map DuplicatedOrder to Drop") points at the SDK classifier, but the classifier is already correct - it handles OrderPostErrorKind::Unknown(_) via the Some(_) => Drop arm. The data never reached it. The fix lives in the host adapter, not the SDK.

Evidence (COW-1064 dry-run report §6.3): the stop-loss strategy logs retry on next block (0): orderbook error (DuplicatedOrder): ... - the (0) is HostError.code, and (DuplicatedOrder) comes from cowprotocol::Error::OrderbookApi's Display impl on the message field. So the parsed ApiError reached message but not data.

Change

  • Extracted error projection into orderbook_to_host_error(err: cowprotocol::Error) -> HostError in crates/nexum-engine/src/host/impls/cow_api.rs.
  • For cowprotocol::Error::OrderbookApi { status, api }: serialise api as JSON into HostError.data, set code = status as i32 so guests can disambiguate 4xx vs 5xx.
  • For other cowprotocol::Error variants (transport, serde, unexpected-status): leave data: None and let the guest classifier apply its TryNextBlock safe default - that's the intended contract for non-envelope failures.

Tests

Layer Test What it asserts
WIT adapter (unit, host::impls::cow_api::tests) orderbook_api_error_is_forwarded_in_data OrderbookApi { status: 400, api: { error_type: "DuplicatedOrder", ... } } produces HostError { code: 400, data: Some(json), kind: Denied } and data round-trips through serde_json::from_str::<ApiError> with the original fields.
WIT adapter (unit) orderbook_api_error_preserves_optional_data_field The optional inner ApiError.data payload survives the projection (min_fee in this case).
WIT adapter (unit) non_envelope_cowprotocol_error_leaves_data_none UnexpectedStatus (and by extension the other non-envelope variants) leaves HostError.data = None, code = 0.
Backend (wiremock, host::cow_orderbook::tests) submit_order_propagates_orderbook_envelope Mock orderbook responds 400 with {"errorType":"DuplicatedOrder","description":"order already exists"}; OrderBookPool::submit_order_json surfaces CowApiError::Orderbook(cowprotocol::Error::OrderbookApi { status: 400, api: ... }). Pins the contract this PR depends on.

Workspace test count: 187 host tests passing, no regressions. cargo clippy --all-targets -- -D warnings and cargo fmt --check both clean.

Why this matters for COW-1031 (7-day soak)

Before this PR: any non-retriable orderbook rejection causes infinite block-by-block re-submission until the watch is manually cleared. The soak would have produced a flat shepherd_cow_api_submit_total{outcome="err"} curve climbing forever on the stop-loss / ethflow modules, which is the opposite of "zero manual intervention".

After this PR: Drop is reached as designed, the dropped:{uid} marker is written, and the curve plateaus.

Stack

Stacked on feat/resolve-app-data-cow-1074 (PR #47).

Linear

Closes COW-1075. The issue description will be updated post-merge to reflect the root cause shift (host-side, not SDK-side).

This PR was drafted with AI assistance (Claude Code, Opus 4.7).

…COW-1075)

`OrderBookPool::submit_order_json` returns `CowApiError::Orderbook(cowprotocol::Error::OrderbookApi { status, api })` for any 4xx with a typed `{"errorType": "...", ...}` body (see `cowprotocol::transport::HttpResponse::into_status_error`). The WIT adapter was dropping `api` on the floor (`data: None`), so the guest's `shepherd_sdk::cow::classify_api_error` always saw `None` and fell back to its safe-default `TryNextBlock`. Permanent rejections like `DuplicatedOrder`, `InvalidSignature`, or `ExcessiveValidTo` therefore looped forever, masquerading as transient failures.

Root cause of the stop-loss infinite-retry behaviour observed in the 2026-06-18 COW-1064 dry run (e2e-report-2026-06-18.md §6.3): 76 retries of an already-submitted order in 170 blocks because the host never let the guest see what the orderbook actually said.

Fix is in the WIT adapter (`crates/nexum-engine/src/host/impls/cow_api.rs`), not the SDK classifier. The classifier already handles `Unknown(_)` -> `Drop` correctly via its `Some(_) => Drop` branch; it just needed the envelope to dispatch on. Extracted the projection into a testable `orderbook_to_host_error` helper that:

- serialises `ApiError` into `HostError.data` as JSON when the variant is `OrderbookApi { status, api }` (the only variant carrying a structured payload),
- sets `code` to the HTTP status so guests can disambiguate 4xx vs 5xx,
- leaves `data: None` for other `cowprotocol::Error` variants (transport, serde, unexpected-status) since they have no envelope and `TryNextBlock` is the correct safe default for them.

Tests:
- `orderbook_to_host_error` unit tests cover the envelope-forward, the optional inner `data` round-trip, and the non-envelope `UnexpectedStatus` branch (3 cases).
- New wiremock integration test `submit_order_propagates_orderbook_envelope` confirms a 400 with `errorType: "DuplicatedOrder"` surfaces the `OrderbookApi` variant end-to-end through `OrderBookPool::submit_order_json`.

All 13 cow-api-adjacent tests pass; workspace tests untouched.
@linear-code

linear-code Bot commented Jun 19, 2026

Copy link
Copy Markdown

COW-1075

brunota20 added a commit that referenced this pull request Jun 23, 2026
…redesign, COW-1076)

The original BLEU-833 design lifted the on-chain GPv2OrderData into an
OrderCreation body and POSTed it to /api/v1/orders with the EthFlow
contract as the EIP-1271 owner. Empirical evidence collected during a
2026-06-22 Sepolia soak shows that path cannot succeed:

  - The 3 OrderPlacement events the watcher observed during the run
    (creationDate 18:43, 18:53, 21:19 UTC, UIDs ending in
    ba3cb449...EAdec|ffffffff) are *fulfilled* on the orderbook with
    validTo = u32::MAX — the same shape the orderbook rejects when
    shepherd POSTs it.
  - The orderbook entries carry onchainUser, onchainOrderData and
    ethflowData.userValidTo (a real unix timestamp), and these fields
    do not exist on the OrderCreation request body in
    cowprotocol-orderbook. They are populated by the orderbook backend
    itself, which indexes the OrderPlacement event server-side and
    creates the order with dual-validTo bookkeeping.

So /api/v1/orders is structurally the wrong endpoint for on-chain
EthFlow placements: the orderbook's native indexer is the one that
creates the order entry, and the public POST endpoint applies the
generic validTo cap and rejects with ExcessiveValidTo every time.
Continuing to submit produces only garbage outcomes per real EthFlow
trade (misleading dropped/backoff markers and a misleading
shepherd_cow_api_submit_total{outcome="err"} increment).

This commit replaces the submit path with the contract the orderbook
actually offers: shepherd decodes the OrderPlacement event, computes
the canonical UID, GETs /api/v1/orders/{uid}, and records the outcome:

  - 200 → write observed:{uid}, log Info ("orderbook indexed").
  - 404 → log Info ("not yet indexed; will recheck on re-delivery")
    and do not write the marker, so a future log re-delivery or a
    block-tick poll (out of scope here) can re-verify.
  - other err → log Warn with the host's code and message.

Idempotency stays event-driven: once observed:{uid} is written, log
re-delivery is a no-op. Block-tick-driven retries for stuck indexer
lag are intentionally out of scope and tracked separately.

Test surface: the existing BLEU-832 decode tests are preserved verbatim
and the GPv2OrderData fixture is reused with the canonical EthFlow
markers (OrderKind::SELL, SellTokenSource::ERC20, BuyTokenDestination::ERC20).
Two new tests cover UID computation invariants: the owner suffix pins
to the EthFlow contract and the validTo suffix pins to the on-chain
order's validTo. A third asserts compute_uid returns None on unsupported
chains so callers fall back to a Warn log rather than panic.

This invalidates the design that PRs #10 (BLEU-833), #47 (COW-1074),
#48 (COW-1075) and #49 (COW-1076) were all patching. After review,
those PRs should be closed or substantially rewritten — none of the
appData resolution / error-envelope forwarding / log-downgrade work is
needed once the broken endpoint is no longer called.

AI assistance disclosure: drafted by Claude (Opus 4.7, 1M context)
following the soak inspection that surfaced the divergence.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant