Skip to content

Fix Durable Object ordering when mixing RPC and fetch calls#6562

Open
threepointone wants to merge 1 commit intomainfrom
fix/do-rpc-fetch-ordering
Open

Fix Durable Object ordering when mixing RPC and fetch calls#6562
threepointone wants to merge 1 commit intomainfrom
fix/do-rpc-fetch-ordering

Conversation

@threepointone
Copy link
Copy Markdown
Contributor

Summary

Fixes #6561.

When a caller fires interleaved stub.rpc() and stub.fetch() calls on the same Durable Object stub without awaiting, all fetch calls were processed before all RPC calls, regardless of send order. This violated the expected E-order guarantee.

// Caller sends: rpc-0, fetch-1, rpc-2, fetch-3, rpc-4, fetch-5, ...
//
// Before fix:   fetch-1, fetch-3, fetch-5, ..., rpc-0, rpc-2, rpc-4, ...
// After fix:    rpc-0, fetch-1, rpc-2, fetch-3, rpc-4, fetch-5, ...

Same-type ordering (pure RPC or pure fetch) was already correct. This fix addresses the cross-type ordering.

Root Cause

stub.fetch() and stub.rpc() both create a WorkerInterface via ActorChannel::startRequest(), but they call different methods on it — request() for fetch, customEvent() for RPC. These two paths reach the DO's InputGate through a very different number of async hops:

fetch path (~1 hop to InputGate):

WorkerEntrypoint::request()
  → context.run()
    → InputGate::wait()      ← queued here

RPC path (~4+ hops to InputGate):

WorkerEntrypoint::customEvent()
  → JsRpcSessionCustomEvent::run()
    → delivered()
    → Cap'n Proto session setup + capability fulfillment
       → JsRpcTargetBase::call()
         → co_await kj::yield()
         → ctx.run()
           → InputGate::wait()   ← queued here

Since all operations originate from the same synchronous JS execution, fetch calls (fewer hops) always enqueued at the InputGate before any RPC calls (more hops).

Fix

Eagerly acquire the InputGate position in JsRpcSessionCustomEvent::run(), right after delivered() — the same level where WorkerEntrypoint::request() acquires it via context.run(). The lock is stored on EntrypointJsRpcTarget and consumed by the first ctx.run() call via the existing IoContext::run(func, Maybe<InputGate::Lock>) overload (the same pattern used by ensureConstructedImpl).

Key properties:

  • The lock is consumed on first use — it is NOT held for the RPC session lifetime
  • The kj::yield() in JsRpcTargetBase::call() still runs (InputGate position is already reserved, so ExternalPusher ordering is preserved)
  • Non-actor RPC is unaffected (guarded by KJ_IF_SOME(a, ioctx.getActor()))
  • TransientJsRpcTarget (within-session stubs) is unaffected — only EntrypointJsRpcTarget uses the pre-acquired lock
  • The ServerTopLevelMembrane ensures exactly one call per getClientForOneCall() session, so the lock always matches the single call

Performance

The InputGate lock is acquired a few event loop turns earlier than before — during the Cap'n Proto session setup and kj::yield() period (microseconds). ExternalPusher calls (pushByteStream, pushAbortSignal) don't need the InputGate, so they are unaffected. Other events wait one extra turn at most. For single-caller latency, the gate is free and wait() returns immediately.

Changes

  • worker-rpc.c++: 27 lines — early InputGate acquisition in JsRpcSessionCustomEvent::run(), lock threading through JsRpcTargetBase to ctx.run()
  • js-rpc-test.js: New OrderingActor DO class + 3 test cases (mixedRpcFetchOrdering, pureRpcOrdering, pureFetchOrdering)
  • js-rpc-test.wd-test: Register OrderingActor binding and namespace

Test plan

  • mixedRpcFetchOrdering — interleaves 20 RPC and fetch calls, asserts send-order delivery (the primary bug)
  • pureRpcOrdering — 20 fire-and-forget RPC calls preserve order (regression)
  • pureFetchOrdering — 20 fire-and-forget fetch calls preserve order (regression)
  • Existing eOrderTest passes (ExternalPusher + e-order regression)
  • Existing receiveStubOverRpc passes (service stub RPC ordering regression)
  • Existing namedActorBinding passes (DO RPC regression)
  • Full js-rpc-test@ suite passes locally

Reproduction

Repro repo: https://github.com/threepointone/test-do-rpc-fetch-ordering
Live demo: curl https://test-ordering.threepointone.workers.dev/test-all?n=20


Root cause analysis, fix design, and implementation developed in collaboration with Cursor.

Made with Cursor

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Apr 11, 2026

LGTM

github run

When a caller fires interleaved stub.rpc() and stub.fetch() calls on
the same DO stub without awaiting, all fetch calls were processed before
all RPC calls, regardless of send order. This violated the expected
E-order guarantee for actors.

Root cause: fetch reaches the InputGate in ~1 async hop (via
WorkerEntrypoint::request() -> context.run() -> InputGate::wait()),
while RPC takes ~4+ hops (customEvent -> JsRpcSessionCustomEvent::run()
-> Cap'n Proto session setup -> capability fulfillment -> pipelined
call dispatch -> JsRpcTargetBase::call() -> kj::yield() -> ctx.run()
-> InputGate::wait()). Since all operations originate from the same
synchronous JS execution, fetch calls always reached the FIFO queue
before any RPC calls.

Fix: eagerly acquire the InputGate position in
JsRpcSessionCustomEvent::run() — at the same level where
WorkerEntrypoint::request() acquires it — then thread the lock through
to the first ctx.run() call via the existing
IoContext::run(func, Maybe<InputGate::Lock>) overload. The lock is
consumed by the first RPC method invocation and is NOT held for the
session lifetime. The kj::yield() in JsRpcTargetBase::call() and
ExternalPusher ordering are unaffected.

Fixes #6561

Made-with: Cursor
@threepointone threepointone force-pushed the fix/do-rpc-fetch-ordering branch from 42bf758 to 3cad02c Compare April 11, 2026 11:53
@kentonv
Copy link
Copy Markdown
Member

kentonv commented Apr 15, 2026

This is a clever solution and I mostly like it.

One possible issue is, technically, this would allow a client (of the capnp RPC interface) to lock up a DO by sending the initial call to open an RPC session, and then failing to send the actual call. It basically gives clients the ability to take a lock on a DO, which obviously they shouldn't have.

This is an internal RPC interface, though, not exposed to malicious clients. So we don't really have to worry about this being used in some sort of a DoS attack. But I suppose it's technically possible that due to a bug or a poorly-timed network hiccup, the "call" message may be significantly delayed after the "open RPC session" message.

Ugh I can't quite convince myself that this is fine, nor can I think of an easy way to make it safe.

A much deeper, more thorough solution would be to do what I've always wanted to do and make the KJ event loop ordering fully depth-first. I think that would just solve this problem inherently. But, that's a pretty delicate change with impact across the whole runtime that should be strictly good in theory, but may have unintended consequences (stuff unintentionally depending on the current ordering)... hmm.


Meanwhile, though, I wonder how you're using this.

I'm not sure you can actually safely rely on this in practice, due to hibernation. If you are relying on an initial call to initialize some in-memory state, and need that in-memory state to be initialized for the second call... there's no guarantee that the DO doesn't hibernate in between. True, if you make these calls in rapid succession, it's extremely unlikely that hibernation would happen, but it's always theoretically possible if a packet gets lost somewhere resulting in a delay, or if the runtime suddenly decides it needs to shut the DO down for an update or whatnot.

The safe thing to do is to have the first call return an RpcTarget, and the second call is pipelined on that. That always guarantees the two calls land on the same instance (or, rarely, the second call fails).

Could that approach work for your use case? I personally have been using this pattern a fair amount and feel like it turns out nicely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Durable Object ordering is not preserved when mixing RPC and fetch calls on the same stub

2 participants