Fix race in WithWaitForLeave that flaked TestPartialStateJoin/Outgoing_device_list_updates#878
Fix race in WithWaitForLeave that flaked TestPartialStateJoin/Outgoing_device_list_updates#878erikjohnston wants to merge 4 commits into
WithWaitForLeave that flaked TestPartialStateJoin/Outgoing_device_list_updates#878Conversation
27a2bb1 to
59d4118
Compare
…ave-handler race `WithWaitForLeave` registered a PDU handler (whose callback feeds `leaveChannel`) but then decided whether to wait by peeking at `room.CurrentState`. Because an incoming PDU is added to the room (`HandleTransactionRequests` calls `room.AddEvent`, updating `CurrentState`) *before* the PDU callback runs, the test goroutine could observe the updated state, take the "already seen" shortcut and return, firing the deferred `removePDUHandler()` in the window before the callback ran. The now-handler-less callback then flagged the (expected) leave as an unexpected PDU, failing the test. Wait on the channel fed by the handler instead of polling `room.CurrentState`, so the handler stays registered until its own callback has run. Every caller passes a joined user and an action that performs the leave, so the leave always arrives fresh via that callback. Make this contract explicit: snapshot membership before the action and fail if the user has already left (rather than silently treating it as "already seen"), and fail on timeout, since neither should happen in correct usage. The action still runs in the already-left case so any cleanup it performs is preserved. The `HandleTransactionRequests` ordering (callback after `room.AddEvent`) is left unchanged, as other tests rely on it (e.g. `federation_redaction_test.go` reads `serverRoom.Timeline` immediately after a callback-driven waiter).
59d4118 to
b744d2d
Compare
| delete(s.eduHandlers, eduHandlerKey) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
For reference, can you paste the full test run flake?
There was a problem hiding this comment.
https://github.com/element-hq/synapse/actions/runs/27136501582/job/80090514774?pr=19832
📦 github.com/matrix-org/complement/tests/msc3902
2026/06/08 12:22:24 config: &{BaseImageURI:localhost/complement-synapse DebugLoggingEnabled:false AlwaysPrintServerLogs:false EnvVarsPropagatePrefix:PASS_ SpawnHSTimeout:30s ContainerCPUCores:0 ContainerMemoryBytes:0 KeepBlueprints:[] HostMounts:[] BaseImageURIs:map[] PackageNamespace:msc3902 CACertificate:0xc000964c08 CAPrivateKey:0xc000804380 BestEffort:false HostnameRunningComplement:host.docker.internal EnableDirtyRuns:true HSPortBindingIP:127.0.0.1 PostTestScript:}
❌ TestPartialStateJoin (1m56.99s)
❌ TestPartialStateJoin/Outgoing_device_list_updates (19.95s)
❌ TestPartialStateJoin/Outgoing_device_list_updates/Device_list_updates_reach_incorrectly_kicked_servers_once_partial_state_join_completes (4.2s)
client.go:860: [CSAPI] POST hs1/_matrix/client/v3/register => 200 OK (15.620036ms)
client.go:860: [CSAPI] GET hs1/_matrix/client/v3/capabilities => 200 OK (3.801544ms)
server.go:210: Creating room !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563 with version 10
federation_room_join_partial_state_test.go:4499: Registered state_ids handler for event $mN1voF4gDGPdTSp3EUwOmCDlOFqlzgXW_KUR-AUXZ8U
federation_room_join_partial_state_test.go:4540: Registered /state handler for event $mN1voF4gDGPdTSp3EUwOmCDlOFqlzgXW_KUR-AUXZ8U
2026/06/08 12:23:47 Received send-join of event $kEyw4F8XinpYV9_VahFs5JkXlLv0JIgGb0MkavOKyjE
client.go:860: [CSAPI] POST hs1/_matrix/client/v3/join/!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563 => 200 OK (80.730602ms)
federation_room_join_partial_state_test.go:4393: /join request completed
federation_room_join_partial_state_test.go:4478: Incoming state_ids request for event [$mN1voF4gDGPdTSp3EUwOmCDlOFqlzgXW_KUR-AUXZ8U] in room !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563
2026/06/08 12:23:48 Received send-join of event $awFsIhzHfpfMVd_m_ei2qJfZt-YtGhCjhTYrYJzq9U0
federation_room_join_partial_state_test.go:2405: Server.MustJoinRoom joined room ID !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563
client.go:860: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (49.114711ms)
client.go:860: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (9.370922ms)
federation_room_join_partial_state_test.go:2415: Alice successfully received event $awFsIhzHfpfMVd_m_ei2qJfZt-YtGhCjhTYrYJzq9U0 via /sync
client.go:860: [CSAPI] PUT hs1/_matrix/client/v3/devices/GPHZZFZGEP => 200 OK (4.413737ms)
federation_room_join_partial_state_test.go:2418: @user-35-t26alice:hs1 sent device list update.
federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 1","device_id":"GPHZZFZGEP","prev_id":[],"stream_id":54,"user_id":"@user-35-t26alice:hs1"}
federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 1","device_id":"GPHZZFZGEP","prev_id":[],"stream_id":54,"user_id":"@user-35-t26alice:hs1"}
federation_room_join_partial_state_test.go:2421: @charlie, @derek and @elsie received device list update.
client.go:860: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (6.652265ms)
client.go:860: [CSAPI] GET hs1/_matrix/client/v3/sync => 200 OK (48.310036ms)
federation_room_join_partial_state_test.go:2439: Alice successfully received event $SxL9gU1hmed77R46CGYKccn6aDKuZBFGFcKN2eBkKRc via /sync
client.go:860: [CSAPI] PUT hs1/_matrix/client/v3/devices/GPHZZFZGEP => 200 OK (5.036587ms)
federation_room_join_partial_state_test.go:2497: @user-35-t26alice:hs1 sent device list update.
federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 2","device_id":"GPHZZFZGEP","prev_id":[54],"stream_id":56,"user_id":"@user-35-t26alice:hs1"}
federation_room_join_partial_state_test.go:2500: @charlie and @derek received device list update.
federation_room_join_partial_state_test.go:4485: Replying to /state_ids request for event [$mN1voF4gDGPdTSp3EUwOmCDlOFqlzgXW_KUR-AUXZ8U]
client.go:860: [CSAPI] GET hs1/_matrix/client/v3/rooms/!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563/members => 200 OK (69.344577ms)
federation_room_join_partial_state_test.go:2513: @user-35-t26alice:hs1's partial state join to !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563 completed.
federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 2","device_id":"GPHZZFZGEP","prev_id":[54],"stream_id":57,"user_id":"@user-35-t26alice:hs1"}
federation_room_join_partial_state_test.go:2517: @elsie received missed device list update.
client.go:860: [CSAPI] PUT hs1/_matrix/client/v3/devices/GPHZZFZGEP => 200 OK (5.751307ms)
federation_room_join_partial_state_test.go:2520: @user-35-t26alice:hs1 sent device list update.
federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 3","device_id":"GPHZZFZGEP","prev_id":[56],"stream_id":59,"user_id":"@user-35-t26alice:hs1"}
federation_room_join_partial_state_test.go:2156: Complement server received m.device_list_update: {"device_display_name":"A new device name 3","device_id":"GPHZZFZGEP","prev_id":[57],"stream_id":59,"user_id":"@user-35-t26alice:hs1"}
federation_room_join_partial_state_test.go:2523: @charlie, @derek and @elsie received device list update.
federation_room_join_partial_state_test.go:4416: Cleaning up after test...
client.go:860: [CSAPI] GET hs1/_matrix/client/v3/rooms/!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563/members => 200 OK (3.937128ms)
federation_room_join_partial_state_test.go:4418: @user-35-t26alice:hs1's partial state join to !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563 completed.
client.go:860: [CSAPI] POST hs1/_matrix/client/v3/rooms/!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563/leave => 200 OK (25.974991ms)
federation_room_join_partial_state_test.go:156: host.docker.internal:45563 saw @user-35-t26alice:hs1 leave test room !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563.
federation_room_join_partial_state_test.go:152: host.docker.internal:33565 has already seen @user-35-t26alice:hs1 leave test room !0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563.
federation_room_join_partial_state_test.go:77: Received unexpected PDU: {"auth_events":["$613ehdee8-7NYiEo9wLU_E89o8bDbIvOvNdMfB8c_aw","$kEyw4F8XinpYV9_VahFs5JkXlLv0JIgGb0MkavOKyjE","$jnQmT8p0NUxGgUChevtHR2Qu027MlckWRQLfTsnaRXs"],"content":{"membership":"leave"},"depth":11,"hashes":{"sha256":"oaWwqeddMuOgqMSQs/IDHmDxHcxwNdjTTVJrhcHk0JM"},"origin_server_ts":1780921429379,"prev_events":["$SxL9gU1hmed77R46CGYKccn6aDKuZBFGFcKN2eBkKRc"],"room_id":"!0-hvmfClz7gk1ghOQTi5:host.docker.internal:45563","sender":"@user-35-t26alice:hs1","signatures":{"hs1":{"ed25519:a_zXSM":"y5zkhzV66kRDjAI1A/BUHRHnz7Ww1dki3S3o3nza3tBOnNOBA2E/C7Lf7WdOfiPTR9VLstNNufapfqaDYWhECA"}},"state_key":"@user-35-t26alice:hs1","type":"m.room.member"}
Co-authored-by: Eric Eastwood <erice@element.io>
We need to always run the `leaveAction`, as the rest of the code assumes it will always run. We also need to handle the case the user isn't in the room anymore, as that is a valid situation too.
|
Right, this is now cleaned up a bit. The problem was that a late assumption I made, that the user in question was always joined to the room, isn't always true. The existing checks also try and infer via various mechanisms when we expect the server to receive the leave PDU. This boils down to checking both a) the user is still in the room, and b) the remote server in question is in the room. I found it easier to reason about to check these two conditions explicitly. |
| if !s.isInRoom(room) { | ||
| // The homeserver only federates the leave to servers that are in the | ||
| // room. If we aren't, no leave PDU is coming to us, so don't block until | ||
| // the timeout. | ||
| t.Logf("%s is not in test room %s; not waiting for %s to leave.", s.ServerName(), room.RoomID, userID) | ||
| return | ||
| } |
There was a problem hiding this comment.
This seems like a test programming error. Why are we trying to accommodate this?
There was a problem hiding this comment.
As far as I can see, this is done because WithWaitForLeave is used as part of a defer psjResult.Destroy(t) pattern to clean up resources after each test. This includes when a test errors out any time (before they even join the room).
Overall, looks fishy. See #878 (comment)
There was a problem hiding this comment.
I mean, it's a generic cleanup function to ensure that all of the tests users have left, no matter what the state of the users and servers are in? It's a fairly typical pattern
| if !userInRoom { | ||
| // The user had already left, so the action produced no new leave and | ||
| // none is coming: don't wait. | ||
| t.Logf("%s is not joined to test room %s; not waiting for them to leave.", userID, room.RoomID) | ||
| return | ||
| } |
There was a problem hiding this comment.
This seems like a test programming error. Why are we trying to accommodate this?
Other related discussion: #878 (comment)
| // room, before performing the `leaveAction` (to avoid races). If they are | ||
| // in the room we need to wait until the server sees the leave, but if they | ||
| // aren't no leave event will be sent out. | ||
| userInRoom := userIsJoinedTo(t, user, room.RoomID) |
There was a problem hiding this comment.
In Complement, the way we typically do this is using user.MustSyncUntil(t, client.SyncReq{Since: aliceSince}, client.SyncJoinedTo(user.UserID, room.RoomID))
And since we seem to only use this for the negative case and return, I'm guessing that using the typical flow and failing the test is also fine (related to the points below about programming errors)
There was a problem hiding this comment.
In Complement, the way we typically do this is using
user.MustSyncUntil(t, client.SyncReq{Since: aliceSince}, client.SyncJoinedTo(user.UserID, room.RoomID))
That only tests whether a new join has happened, not whether the user is joined at all
| // wait is skipped when `user` had already left the room (per their own | ||
| // homeserver, so the action produces no new leave) or when this server isn't in | ||
| // the room (so the leave won't be federated to us). | ||
| func (s *server) WithWaitForLeave( |
There was a problem hiding this comment.
The whole concept of WithWaitForLeave might be the wrong solution to the problem. It was introduced in #626
Many of the faster joins test flakes are due to the homeserver under test failing to contact Complement homeservers after they have been torn down. When this happens, subsequent tests can fail if they use a Complement homeserver that happens to have the same hostname:port as one which the homeserver under test has previously marked as offline.
Make the homeserver under test leave rooms at the end of each test, so that it stops trying to contact the Complement homeservers in the room.
The dumb simple solution would be to instead to create new homeserver deployments for each test to make them isolated from each other. It looks like we even do that for some tests.
There was a problem hiding this comment.
I mean, sure we could. I don't think that cleaning up after a test is an unreasonable way of doing things though.
I'm hesitant to just go and use new deployments, as the tests are already really slow. The 7 tests take ~18s to run locally, and isolating them further is just going to make that worse
Co-authored-by: Eric Eastwood <erice@element.io>
🤖 Generated with Claude Code
Background
TestPartialStateJoin/Outgoing_device_list_updates(intests/msc3902) flakes in downstream CI (seen in Synapse), failing during test cleanup with:The "unexpected" PDU is the joining user's own
leave, sent legitimately as the test tears down — so this is a race in the test helper, not a homeserver bug.Root cause
WithWaitForLeaveregisters a PDU handler whose callback pushes matching leaves ontoleaveChannel, then decides whether to block by peeking atroom.CurrentState:When a transaction arrives,
HandleTransactionRequestscallsroom.AddEvent(which updatesCurrentState) before invoking the PDU callback. So this interleaving is possible:room.AddEvent(leave)→CurrentState(user) == "leave", callback not yet run"leave"→ returns →defer removePDUHandler()deletes the handlert.Errorf("Received unexpected PDU")The helper used two different signals for the same event — the callback channel and
CurrentState— and tore its handler down based on the one that fires first. It's flaky because it only fails when the scheduler switches betweenAddEventand the callback (more likely under CI load).Fix
Wait on the channel fed by the handler instead of polling
room.CurrentState, so the handler stays registered until its own callback has run.Every caller passes a joined user and an action that performs the leave, so the leave always arrives fresh via that callback — there is no legitimate "already left" case. Make that contract explicit instead of papering over it:
t.Errorf) if the user has already left, rather than silently treating it as "already seen". The action still runs in that case, so any cleanup it performs (e.g.partialStateJoinResult.Destroy) is preserved.leaveChannel.CurrentStatefallback is gone, since neither an already-left user nor a timeout should happen in correct usage.This is contained to the
msc3902helper. TheHandleTransactionRequestsordering (callback afterroom.AddEvent) is left intact, as other tests rely on it — e.g.federation_redaction_test.goreadsserverRoom.Timelineimmediately after a callback-driven waiter.