Skip to content

Fix race in WithWaitForLeave that flaked TestPartialStateJoin/Outgoing_device_list_updates#878

Open
erikjohnston wants to merge 4 commits into
mainfrom
erikj/fix_flakey_fed_device_lists
Open

Fix race in WithWaitForLeave that flaked TestPartialStateJoin/Outgoing_device_list_updates#878
erikjohnston wants to merge 4 commits into
mainfrom
erikj/fix_flakey_fed_device_lists

Conversation

@erikjohnston

@erikjohnston erikjohnston commented Jun 8, 2026

Copy link
Copy Markdown
Member

🤖 Generated with Claude Code

Background

TestPartialStateJoin/Outgoing_device_list_updates (in tests/msc3902) flakes in downstream CI (seen in Synapse), failing during test cleanup with:

federation_room_join_partial_state_test.go:77: Received unexpected PDU: {…"membership":"leave"…}

The "unexpected" PDU is the joining user's own leave, sent legitimately as the test tears down — so this is a race in the test helper, not a homeserver bug.

Root cause

WithWaitForLeave registers a PDU handler whose callback pushes matching leaves onto leaveChannel, then decides whether to block by peeking at room.CurrentState:

removePDUHandler := s.AddPDUHandler(func(e) bool { /* push leaveChannel; return true */ })
defer removePDUHandler()

leaveAction()

if room.CurrentState("m.room.member", userID) == "leave" {  // "already seen" shortcut
    // return without consuming the channel
} else {
    <-leaveChannel // …or 1s timeout
}

When a transaction arrives, HandleTransactionRequests calls room.AddEvent (which updates CurrentState) before invoking the PDU callback. So this interleaving is possible:

  1. fed goroutine: room.AddEvent(leave)CurrentState(user) == "leave", callback not yet run
  2. test goroutine: shortcut sees "leave" → returns → defer removePDUHandler() deletes the handler
  3. fed goroutine: PDU callback runs → no handler matches → t.Errorf("Received unexpected PDU")

The helper used two different signals for the same event — the callback channel and CurrentState — and tore its handler down based on the one that fires first. It's flaky because it only fails when the scheduler switches between
AddEvent and the callback (more likely under CI load).

Fix

Wait on the channel fed by the handler instead of polling room.CurrentState, so the handler stays registered until its own callback has run.

Every caller passes a joined user and an action that performs the leave, so the leave always arrives fresh via that callback — there is no legitimate "already left" case. Make that contract explicit instead of papering over it:

  • Snapshot membership before the action and fail (t.Errorf) if the user has already left, rather than silently treating it as "already seen". The action still runs in that case, so any cleanup it performs (e.g.
    partialStateJoinResult.Destroy) is preserved.
  • For the normal case, wait on leaveChannel.
  • Fail on timeout — the lenient CurrentState fallback is gone, since neither an already-left user nor a timeout should happen in correct usage.

This is contained to the msc3902 helper. The HandleTransactionRequests ordering (callback after room.AddEvent) is left intact, as other tests rely on it — e.g. federation_redaction_test.go reads serverRoom.Timeline
immediately after a callback-driven waiter.

@erikjohnston erikjohnston force-pushed the erikj/fix_flakey_fed_device_lists branch from 27a2bb1 to 59d4118 Compare June 8, 2026 18:40
@erikjohnston erikjohnston marked this pull request as ready for review June 8, 2026 20:11
@erikjohnston erikjohnston requested review from a team as code owners June 8, 2026 20:11
…ave-handler race

`WithWaitForLeave` registered a PDU handler (whose callback feeds
`leaveChannel`) but then decided whether to wait by peeking at
`room.CurrentState`. Because an incoming PDU is added to the room
(`HandleTransactionRequests` calls `room.AddEvent`, updating
`CurrentState`) *before* the PDU callback runs, the test goroutine could
observe the updated state, take the "already seen" shortcut and return,
firing the deferred `removePDUHandler()` in the window before the
callback ran. The now-handler-less callback then flagged the (expected)
leave as an unexpected PDU, failing the test.

Wait on the channel fed by the handler instead of polling
`room.CurrentState`, so the handler stays registered until its own
callback has run.

Every caller passes a joined user and an action that performs the leave, so the
leave always arrives fresh via that callback. Make this contract
explicit: snapshot membership before the action and fail if the user has
already left (rather than silently treating it as "already seen"), and
fail on timeout, since neither should happen in correct usage. The
action still runs in the already-left case so any cleanup it performs is
preserved.

The `HandleTransactionRequests` ordering (callback after `room.AddEvent`) is left
unchanged, as other tests rely on it (e.g.
`federation_redaction_test.go` reads `serverRoom.Timeline` immediately
after a callback-driven waiter).
@erikjohnston erikjohnston force-pushed the erikj/fix_flakey_fed_device_lists branch from 59d4118 to b744d2d Compare June 9, 2026 08:43
Comment thread tests/msc3902/federation_room_join_partial_state_test.go
Comment thread tests/msc3902/federation_room_join_partial_state_test.go Outdated
delete(s.eduHandlers, eduHandlerKey)
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, can you paste the full test run flake?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/element-hq/synapse/actions/runs/27136501582/job/80090514774?pr=19832

📦 github.com/matrix-org/complement/tests/msc3902
2026/06/08 12:22:24 config: &{BaseImageURI:localhost/complement-synapse DebugLoggingEnabled:false AlwaysPrintServerLogs:false EnvVarsPropagatePrefix:PASS_ SpawnHSTimeout:30s ContainerCPUCores:0 ContainerMemoryBytes:0 KeepBlueprints:[] HostMounts:[] BaseImageURIs:map[] PackageNamespace:msc3902 CACertificate:0xc000964c08 CAPrivateKey:0xc000804380 BestEffort:false HostnameRunningComplement:host.docker.internal EnableDirtyRuns:true HSPortBindingIP:127.0.0.1 PostTestScript:}
❌ TestPartialStateJoin (1m56.99s)
❌ TestPartialStateJoin/Outgoing_device_list_updates (19.95s)
❌ TestPartialStateJoin/Outgoing_device_list_updates/Device_list_updates_reach_incorrectly_kicked_servers_once_partial_state_join_completes (4.2s)
      client.go:860: [CSAPI] POST hs1/_matrix/client/v3/register => 200 OK (15.620036ms)
      client.go:860: [CSAPI] GET hs1/_matrix/client/v3/capabilities => 200 OK (3.801544ms)
      server.go:210: Creating room !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563 with version 10
      federation_room_join_partial_state_test.go:4499: Registered state_ids handler for event $mN1voF4gDGPdTSp3EUwOmCDlOFqlzgXW_KUR-AUXZ8U
      federation_room_join_partial_state_test.go:4540: Registered /state handler for event $mN1voF4gDGPdTSp3EUwOmCDlOFqlzgXW_KUR-AUXZ8U
  2026/06/08 12:23:47 Received send-join of event $kEyw4F8XinpYV9_VahFs5JkXlLv0JIgGb0MkavOKyjE
      client.go:860: [CSAPI] POST hs1/_matrix/client/v3/join/!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563 => 200 OK (80.730602ms)
      federation_room_join_partial_state_test.go:4393: /join request completed
      federation_room_join_partial_state_test.go:4478: Incoming state_ids request for event [$mN1voF4gDGPdTSp3EUwOmCDlOFqlzgXW_KUR-AUXZ8U] in room !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563
  2026/06/08 12:23:48 Received send-join of event $awFsIhzHfpfMVd_m_ei2qJfZt-YtGhCjhTYrYJzq9U0
      federation_room_join_partial_state_test.go:2405: Server.MustJoinRoom joined room ID !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563
      client.go:860: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (49.114711ms)
      client.go:860: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (9.370922ms)
      federation_room_join_partial_state_test.go:2415: Alice successfully received event $awFsIhzHfpfMVd_m_ei2qJfZt-YtGhCjhTYrYJzq9U0 via /sync
      client.go:860: [CSAPI] PUT hs1/_matrix/client/v3/devices/GPHZZFZGEP => 200 OK (4.413737ms)
      federation_room_join_partial_state_test.go:2418: @user-35-t26alice:hs1 sent device list update.
      federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 1","device_id":"GPHZZFZGEP","prev_id":[],"stream_id":54,"user_id":"@user-35-t26alice:hs1"}
      federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 1","device_id":"GPHZZFZGEP","prev_id":[],"stream_id":54,"user_id":"@user-35-t26alice:hs1"}
      federation_room_join_partial_state_test.go:2421: @charlie, @derek and @elsie received device list update.
      client.go:860: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (6.652265ms)
      client.go:860: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (48.310036ms)
      federation_room_join_partial_state_test.go:2439: Alice successfully received event $SxL9gU1hmed77R46CGYKccn6aDKuZBFGFcKN2eBkKRc via /sync
      client.go:860: [CSAPI] PUT hs1/_matrix/client/v3/devices/GPHZZFZGEP => 200 OK (5.036587ms)
      federation_room_join_partial_state_test.go:2497: @user-35-t26alice:hs1 sent device list update.
      federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 2","device_id":"GPHZZFZGEP","prev_id":[54],"stream_id":56,"user_id":"@user-35-t26alice:hs1"}
      federation_room_join_partial_state_test.go:2500: @charlie and @derek received device list update.
      federation_room_join_partial_state_test.go:4485: Replying to /state_ids request for event [$mN1voF4gDGPdTSp3EUwOmCDlOFqlzgXW_KUR-AUXZ8U]
      client.go:860: [CSAPI] GET hs1/_matrix/client/v3/rooms/!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563/members => 200 OK (69.344577ms)
      federation_room_join_partial_state_test.go:2513: @user-35-t26alice:hs1's partial state join to !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563 completed.
      federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 2","device_id":"GPHZZFZGEP","prev_id":[54],"stream_id":57,"user_id":"@user-35-t26alice:hs1"}
      federation_room_join_partial_state_test.go:2517: @elsie received missed device list update.
      client.go:860: [CSAPI] PUT hs1/_matrix/client/v3/devices/GPHZZFZGEP => 200 OK (5.751307ms)
      federation_room_join_partial_state_test.go:2520: @user-35-t26alice:hs1 sent device list update.
      federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 3","device_id":"GPHZZFZGEP","prev_id":[56],"stream_id":59,"user_id":"@user-35-t26alice:hs1"}
      federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 3","device_id":"GPHZZFZGEP","prev_id":[57],"stream_id":59,"user_id":"@user-35-t26alice:hs1"}
      federation_room_join_partial_state_test.go:2523: @charlie, @derek and @elsie received device list update.
      federation_room_join_partial_state_test.go:4416: Cleaning up after test...
      client.go:860: [CSAPI] GET hs1/_matrix/client/v3/rooms/!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563/members => 200 OK (3.937128ms)
      federation_room_join_partial_state_test.go:4418: @user-35-t26alice:hs1's partial state join to !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563 completed.
      client.go:860: [CSAPI] POST hs1/_matrix/client/v3/rooms/!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563/leave => 200 OK (25.974991ms)
      federation_room_join_partial_state_test.go:156: host.docker.internal:45563 saw @user-35-t26alice:hs1 leave test room !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563.
      federation_room_join_partial_state_test.go:152: host.docker.internal:33565 has already seen @user-35-t26alice:hs1 leave test room !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563.
      federation_room_join_partial_state_test.go:77: Received unexpected PDU: {"auth_events":["$613ehdee8-7NYiEo9wLU_E89o8bDbIvOvNdMfB8c_aw","$kEyw4F8XinpYV9_VahFs5JkXlLv0JIgGb0MkavOKyjE","$jnQmT8p0NUxGgUChevtHR2Qu027MlckWRQLfTsnaRXs"],"content":{"membership":"leave"},"depth":11,"hashes":{"sha256":"oaWwqeddMuOgqMSQs/IDHmDxHcxwNdjTTVJrhcHk0JM"},"origin_server_ts":1780921429379,"prev_events":["$SxL9gU1hmed77R46CGYKccn6aDKuZBFGFcKN2eBkKRc"],"room_id":"!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563","sender":"@user-35-t26alice:hs1","signatures":{"hs1":{"ed25519:a_zXSM":"y5zkhzV66kRDjAI1A/BUHRHnz7Ww1dki3S3o3nza3tBOnNOBA2E/C7Lf7WdOfiPTR9VLstNNufapfqaDYWhECA"}},"state_key":"@user-35-t26alice:hs1","type":"m.room.member"}

Co-authored-by: Eric Eastwood <erice@element.io>
We need to always run the `leaveAction`, as the rest of the code assumes
it will always run. We also need to handle the case the user isn't in
the room anymore, as that is a valid situation too.
@erikjohnston

Copy link
Copy Markdown
Member Author

Right, this is now cleaned up a bit. The problem was that a late assumption I made, that the user in question was always joined to the room, isn't always true.

The existing checks also try and infer via various mechanisms when we expect the server to receive the leave PDU. This boils down to checking both a) the user is still in the room, and b) the remote server in question is in the room. I found it easier to reason about to check these two conditions explicitly.

@erikjohnston erikjohnston requested a review from a team June 10, 2026 15:42
Comment on lines +164 to +170
if !s.isInRoom(room) {
// The homeserver only federates the leave to servers that are in the
// room. If we aren't, no leave PDU is coming to us, so don't block until
// the timeout.
t.Logf("%s is not in test room %s; not waiting for %s to leave.", s.ServerName(), room.RoomID, userID)
return
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a test programming error. Why are we trying to accommodate this?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, this is done because WithWaitForLeave is used as part of a defer psjResult.Destroy(t) pattern to clean up resources after each test. This includes when a test errors out any time (before they even join the room).

Overall, looks fishy. See #878 (comment)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it's a generic cleanup function to ensure that all of the tests users have left, no matter what the state of the users and servers are in? It's a fairly typical pattern

Comment on lines +157 to 162
if !userInRoom {
// The user had already left, so the action produced no new leave and
// none is coming: don't wait.
t.Logf("%s is not joined to test room %s; not waiting for them to leave.", userID, room.RoomID)
return
}

@MadLittleMods MadLittleMods Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a test programming error. Why are we trying to accommodate this?

Other related discussion: #878 (comment)

// room, before performing the `leaveAction` (to avoid races). If they are
// in the room we need to wait until the server sees the leave, but if they
// aren't no leave event will be sent out.
userInRoom := userIsJoinedTo(t, user, room.RoomID)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Complement, the way we typically do this is using user.MustSyncUntil(t, client.SyncReq{Since: aliceSince}, client.SyncJoinedTo(user.UserID, room.RoomID))

And since we seem to only use this for the negative case and return, I'm guessing that using the typical flow and failing the test is also fine (related to the points below about programming errors)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Complement, the way we typically do this is using user.MustSyncUntil(t, client.SyncReq{Since: aliceSince}, client.SyncJoinedTo(user.UserID, room.RoomID))

That only tests whether a new join has happened, not whether the user is joined at all

Comment thread tests/msc3902/federation_room_join_partial_state_test.go Outdated
// wait is skipped when `user` had already left the room (per their own
// homeserver, so the action produces no new leave) or when this server isn't in
// the room (so the leave won't be federated to us).
func (s *server) WithWaitForLeave(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole concept of WithWaitForLeave might be the wrong solution to the problem. It was introduced in #626

Many of the faster joins test flakes are due to the homeserver under test failing to contact Complement homeservers after they have been torn down. When this happens, subsequent tests can fail if they use a Complement homeserver that happens to have the same hostname:port as one which the homeserver under test has previously marked as offline.

Make the homeserver under test leave rooms at the end of each test, so that it stops trying to contact the Complement homeservers in the room.

The dumb simple solution would be to instead to create new homeserver deployments for each test to make them isolated from each other. It looks like we even do that for some tests.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, sure we could. I don't think that cleaning up after a test is an unreasonable way of doing things though.

I'm hesitant to just go and use new deployments, as the tests are already really slow. The 7 tests take ~18s to run locally, and isolating them further is just going to make that worse

Co-authored-by: Eric Eastwood <erice@element.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants