feat(ethflow-watcher): downgrade ExcessiveValidTo drops to Info (COW-1076)#49
Conversation
…1076)
EthFlow on-chain orders use `validTo = u32::MAX` by design (see `cowprotocol::eth_flow`). The Sepolia orderbook's max-validTo cap rejects this shape with `errorType = "ExcessiveValidTo"`, and after the COW-1075 host fix the strategy already classifies it correctly as Drop. The remaining gap was operator ergonomics: every EthFlow placement on Sepolia produced a Warn-level "ethflow dropped" line, which would dominate a 7-day soak dashboard with non-anomalous traffic.
Change: in `apply_submit_retry`'s Drop arm, peek at the decoded ApiError. If the orderbook's `errorType == "ExcessiveValidTo"`, log at Info instead of Warn. All other Drop reasons (InvalidSignature, WrongOwner, etc.) keep Warn so real anomalies still page the operator. Dispatch (write `dropped:{uid}`, clear stale `backoff:{uid}`) is unchanged.
Why not gate on more (e.g. inspect the order's validTo field): the strategy already filters logs to EthFlow contract addresses; ExcessiveValidTo from the orderbook for an EthFlow placement is unambiguously the documented constraint. Keeping the gate narrow avoids accidentally suppressing other-cause Warns.
Tests (3 new in `modules/ethflow-watcher/src/strategy.rs`):
- `submit_excessive_valid_to_logs_at_info_not_warn`: end-to-end through `on_logs`; confirms exactly one drop line at Info level and zero Warn drops for this case.
- `submit_other_permanent_error_still_logs_at_warn`: regression guard - InvalidSignature stays at Warn.
- `submit_drop_without_envelope_keeps_warn_level`: predicate-level unit test confirming `is_expected_excessive_valid_to` returns false when `HostError.data` is None (e.g. transport failure).
Docs: added "Known upstream constraints on Sepolia" section to `docs/operations/e2e-testnet-runbook.md` documenting this gap, the post-fix operator-visible behaviour, the Prometheus signal (`shepherd_cow_api_submit_total{outcome=\"err\"}` grows by the EthFlow placement count then stops), and a pointer to COW-1076 for the upstream-confirmation status.
Soak impact: the COW-1031 7-day run on Sepolia will now show ExcessiveValidTo drops as Info-level traffic. The soak's "0 unexpected errors" acceptance bar is preserved because Warn-level drops only fire on real anomalies.
All 17 ethflow-watcher tests pass (+3 new); workspace tests untouched. clippy + fmt clean. AI assistance disclosure: drafted by Claude (Opus 4.7, 1M context).
…redesign, COW-1076)
The original BLEU-833 design lifted the on-chain GPv2OrderData into an
OrderCreation body and POSTed it to /api/v1/orders with the EthFlow
contract as the EIP-1271 owner. Empirical evidence collected during a
2026-06-22 Sepolia soak shows that path cannot succeed:
- The 3 OrderPlacement events the watcher observed during the run
(creationDate 18:43, 18:53, 21:19 UTC, UIDs ending in
ba3cb449...EAdec|ffffffff) are *fulfilled* on the orderbook with
validTo = u32::MAX — the same shape the orderbook rejects when
shepherd POSTs it.
- The orderbook entries carry onchainUser, onchainOrderData and
ethflowData.userValidTo (a real unix timestamp), and these fields
do not exist on the OrderCreation request body in
cowprotocol-orderbook. They are populated by the orderbook backend
itself, which indexes the OrderPlacement event server-side and
creates the order with dual-validTo bookkeeping.
So /api/v1/orders is structurally the wrong endpoint for on-chain
EthFlow placements: the orderbook's native indexer is the one that
creates the order entry, and the public POST endpoint applies the
generic validTo cap and rejects with ExcessiveValidTo every time.
Continuing to submit produces only garbage outcomes per real EthFlow
trade (misleading dropped/backoff markers and a misleading
shepherd_cow_api_submit_total{outcome="err"} increment).
This commit replaces the submit path with the contract the orderbook
actually offers: shepherd decodes the OrderPlacement event, computes
the canonical UID, GETs /api/v1/orders/{uid}, and records the outcome:
- 200 → write observed:{uid}, log Info ("orderbook indexed").
- 404 → log Info ("not yet indexed; will recheck on re-delivery")
and do not write the marker, so a future log re-delivery or a
block-tick poll (out of scope here) can re-verify.
- other err → log Warn with the host's code and message.
Idempotency stays event-driven: once observed:{uid} is written, log
re-delivery is a no-op. Block-tick-driven retries for stuck indexer
lag are intentionally out of scope and tracked separately.
Test surface: the existing BLEU-832 decode tests are preserved verbatim
and the GPv2OrderData fixture is reused with the canonical EthFlow
markers (OrderKind::SELL, SellTokenSource::ERC20, BuyTokenDestination::ERC20).
Two new tests cover UID computation invariants: the owner suffix pins
to the EthFlow contract and the validTo suffix pins to the on-chain
order's validTo. A third asserts compute_uid returns None on unsupported
chains so callers fall back to a Warn log rather than panic.
This invalidates the design that PRs #10 (BLEU-833), #47 (COW-1074),
#48 (COW-1075) and #49 (COW-1076) were all patching. After review,
those PRs should be closed or substantially rewritten — none of the
appData resolution / error-envelope forwarding / log-downgrade work is
needed once the broken endpoint is no longer called.
AI assistance disclosure: drafted by Claude (Opus 4.7, 1M context)
following the soak inspection that surfaced the divergence.
|
Closing as superseded by the observe-not-submit redesign in PR #62 (M2 layer) + PR #63 (BLEU-855 / M3-M4 layer). Background: a 2026-06-22 Sepolia soak observation showed that the 3 EthFlow placements logged here as The COW-1076 reframing comment captures the full evidence: https://linear.app/bleu-builders/issue/COW-1076 No code from this PR is retained — the entire |
Summary
EthFlow on-chain orders carry
validTo = u32::MAXby design (cowprotocol::eth_flowdocuments this as the canonical CoW-side shape on every chain - cancellation is operator-controlled via the EthFlow contract, not orderbook-time-bounded). The Sepolia orderbook's max-validTo cap rejects this shape witherrorType = "ExcessiveValidTo". After PR #48 the strategy already classifies it asRetryAction::Drop(no retry storm); the remaining gap was operator ergonomics: every EthFlow placement on Sepolia produced a Warn-level "ethflow dropped" line, which would dominate a 7-day soak dashboard with non-anomalous traffic.Change
In
apply_submit_retry'sDroparm, peek at the decodedApiError. If the orderbook'serrorType == "ExcessiveValidTo", log atInfoinstead ofWarn. Every otherDropreason (InvalidSignature,WrongOwner,InvalidAppData, ...) keepsWarnso real anomalies still page the operator. Dispatch (dropped:{uid}marker write, stalebackoff:clear) is unchanged.Scope kept narrow on purpose:
ExcessiveValidTohere is unambiguously the documented constraint.validTofield directly - that would be defence-in-depth duplicating the orderbook's own check, and it would make the gate brittle if the orderbook ever loosens its cap.Tests
modules/ethflow-watcher/src/strategy.rssubmit_excessive_valid_to_logs_at_info_not_warnon_logs: exactly one drop line at Info level, zero Warn drops,dropped:{uid}written, nobackoff:{uid}left over.submit_other_permanent_error_still_logs_at_warnInvalidSignaturekeeps Warn.submit_drop_without_envelope_keeps_warn_levelis_expected_excessive_valid_toreturnsfalsewhenHostError.dataisNone(e.g. transport failure).All 17 ethflow-watcher tests pass (+3 new); workspace tests untouched.
cargo clippy --all-targets -- -D warningsclean.cargo fmt --checkclean for the touched file (one pre-existing em-dash on line 181 is from a previous PR; not in scope here).Docs
docs/operations/e2e-testnet-runbook.mdgains a "5.5 Known upstream constraints on Sepolia" section documenting:validTo = u32::MAX).ExcessiveValidTo).dropped:marker per UID).shepherd_cow_api_submit_total{outcome="err"}grows by the EthFlow placement count then plateaus).Why this matters for COW-1031 (7-day soak)
Before this PR: every EthFlow placement on Sepolia produced a Warn-level
ethflow droppedline - on a soak window with active EthFlow traffic, the operator dashboard would surface dozens of "drops" per day that all map to the same documented gap.After this PR: those become Info-level lines, the Warn channel stays reserved for real anomalies (InvalidSignature, AppData mismatches, transport failures), and the soak's "0 unexpected errors" acceptance bar (COW-1064 §7) reads cleanly.
Stack
Stacked on
feat/forward-orderbook-error-cow-1075(PR #48) → stacked onfeat/resolve-app-data-cow-1074(PR #47).Upstream follow-up (out of scope for this PR)
Open question: does mainnet's orderbook also reject
validTo = u32::MAXfor EthFlow orders? If yes, this is a protocol-wide constraint and the EthFlow contract documentation needs revisiting at the cowprotocol level. If no, Sepolia services config drift, file a services issue. Tracked in COW-1076.Linear
Moves COW-1076 toward closure on the strategy + docs side. Upstream confirmation stays as the open thread on that issue.
AI assistance disclosure: drafted by Claude (Opus 4.7, 1M context).