-
Notifications
You must be signed in to change notification settings - Fork 69
Fix race in WithWaitForLeave that flaked TestPartialStateJoin/Outgoing_device_list_updates
#878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
b744d2d
b48b4eb
a503068
d9da09a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -123,10 +123,15 @@ func (s *server) AddEDUHandler(eduHandler func(gomatrixserverlib.EDU) bool) func | |
| } | ||
| } | ||
|
|
||
| // WithWaitForLeave runs the given action and waits for the user to leave the room. | ||
| // WithWaitForLeave runs the given action and, when the resulting leave is | ||
| // expected to reach this server, waits for it. `leaveAction` is always run; the | ||
| // wait is skipped when `user` had already left the room (per their own | ||
| // homeserver, so the action produces no new leave) or when this server isn't in | ||
| // the room (so the leave won't be federated to us). | ||
| func (s *server) WithWaitForLeave( | ||
| t *testing.T, room *federation.ServerRoom, userID string, leaveAction func(), | ||
| t *testing.T, room *federation.ServerRoom, user *client.CSAPI, leaveAction func(), | ||
| ) { | ||
| userID := user.UserID | ||
| leaveChannel := make(chan gomatrixserverlib.PDU, 10) | ||
| removePDUHandler := s.AddPDUHandler( | ||
| func(e gomatrixserverlib.PDU) bool { | ||
|
|
@@ -141,24 +146,70 @@ func (s *server) WithWaitForLeave( | |
| ) | ||
| defer removePDUHandler() | ||
|
|
||
| // We need to check if the user (on their homeserver) thinks they're in the | ||
| // room, before performing the `leaveAction` (to avoid races). If they are | ||
| // in the room we need to wait until the server sees the leave, but if they | ||
| // aren't no leave event will be sent out. | ||
| userInRoom := userIsJoinedTo(t, user, room.RoomID) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In Complement, the way we typically do this is using And since we seem to only use this for the negative case and return, I'm guessing that using the typical flow and failing the test is also fine (related to the points below about programming errors)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That only tests whether a new join has happened, not whether the user is joined at all |
||
|
|
||
| leaveAction() | ||
|
|
||
| memberEvent := room.CurrentState("m.room.member", userID) | ||
| membership := "" | ||
| if memberEvent != nil { | ||
| membership, _ = memberEvent.Membership() | ||
| if !userInRoom { | ||
| // The user had already left, so the action produced no new leave and | ||
| // none is coming: don't wait. | ||
| t.Logf("%s is not joined to test room %s; not waiting for them to leave.", userID, room.RoomID) | ||
| return | ||
| } | ||
|
Comment on lines
+157
to
162
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems like a test programming error. Why are we trying to accommodate this? Other related discussion: #878 (comment) |
||
| if membership == "leave" { | ||
| t.Logf("%s has already seen %s leave test room %s.", s.ServerName(), userID, room.RoomID) | ||
| } else { | ||
| select { | ||
| case <-leaveChannel: | ||
| t.Logf("%s saw %s leave test room %s.", s.ServerName(), userID, room.RoomID) | ||
| break | ||
| case <-time.After(1 * time.Second): | ||
| t.Errorf("%s timed out waiting for %s to leave test room %s.", s.ServerName(), userID, room.RoomID) | ||
|
|
||
| if !s.isInRoom(room) { | ||
| // The homeserver only federates the leave to servers that are in the | ||
| // room. If we aren't, no leave PDU is coming to us, so don't block until | ||
| // the timeout. | ||
| t.Logf("%s is not in test room %s; not waiting for %s to leave.", s.ServerName(), room.RoomID, userID) | ||
| return | ||
| } | ||
|
Comment on lines
+164
to
+170
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems like a test programming error. Why are we trying to accommodate this?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As far as I can see, this is done because Overall, looks fishy. See #878 (comment)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean, it's a generic cleanup function to ensure that all of the tests users have left, no matter what the state of the users and servers are in? It's a fairly typical pattern |
||
|
|
||
| // Otherwise the action triggered the leave, which arrives as a PDU our | ||
| // handler matches. Wait on its channel rather than polling | ||
| // `room.CurrentState`: the room's current state is updated (by | ||
| // `room.AddEvent`) *before* the PDU callback runs, so returning on a | ||
| // `CurrentState` check could deregister our handler in the window before the | ||
| // callback fires, making the (expected) leave look unexpected to | ||
| // `HandleTransactionRequests`. This returns as soon as the leave arrives; the | ||
| // timeout is only a ceiling for declaring failure. | ||
| select { | ||
| case <-leaveChannel: | ||
| t.Logf("%s saw %s leave test room %s.", s.ServerName(), userID, room.RoomID) | ||
| case <-time.After(1 * time.Second): | ||
| t.Errorf("%s timed out waiting for %s to leave test room %s.", s.ServerName(), userID, room.RoomID) | ||
| } | ||
| } | ||
|
|
||
|
erikjohnston marked this conversation as resolved.
|
||
| // isInRoom reports whether this Complement server has a joined user in the room, | ||
| // according to its own `ServerRoom` view. The server reliably tracks its own | ||
| // users' membership (it created their join/leave events), so this answers "will | ||
| // the homeserver federate events in this room to us?". | ||
| func (s *server) isInRoom(room *federation.ServerRoom) bool { | ||
| for _, serverInRoom := range room.ServersInRoom() { | ||
| if serverInRoom == s.ServerName() { | ||
| return true | ||
| } | ||
| } | ||
| return false | ||
| } | ||
|
|
||
| // userIsJoinedTo reports whether the user is currently joined to the room, | ||
| // according to the user's own homeserver. | ||
| func userIsJoinedTo(t *testing.T, user *client.CSAPI, roomID string) bool { | ||
| t.Helper() | ||
| res := user.MustDo(t, "GET", []string{"_matrix", "client", "v3", "joined_rooms"}) | ||
| joinedRooms := gjson.ParseBytes(client.ParseJSON(t, res)).Get("joined_rooms") | ||
| for _, joinedRoom := range joinedRooms.Array() { | ||
| if joinedRoom.Str == roomID { | ||
| return true | ||
| } | ||
| } | ||
| return false | ||
| } | ||
|
|
||
| // Wait for the server to receive the event with given event ID. | ||
|
|
@@ -2249,7 +2300,7 @@ func TestPartialStateJoin(t *testing.T) { | |
|
|
||
| // @t23alice:hs1 joins the room. | ||
| psjResult := beginPartialStateJoin(t, server1, room, alice) | ||
| defer server2.WithWaitForLeave(t, server2Room, alice.UserID, func() { psjResult.Destroy(t) }) | ||
| defer server2.WithWaitForLeave(t, server2Room, alice, func() { psjResult.Destroy(t) }) | ||
|
|
||
| // Both homeservers should receive device list updates. | ||
| renameDevice(t, alice, "A new device name 1") | ||
|
|
@@ -2296,7 +2347,7 @@ func TestPartialStateJoin(t *testing.T) { | |
| ) | ||
| // NB: We register the `psjResult.Destroy()` cleanup twice. This is alright because it | ||
| // is idempotent. Here we wait for server 2 to observe the leave too. | ||
| defer server2.WithWaitForLeave(t, server2Room, alice.UserID, func() { psjResult.Destroy(t) }) | ||
| defer server2.WithWaitForLeave(t, server2Room, alice, func() { psjResult.Destroy(t) }) | ||
| joinEvent := room.CurrentState("m.room.member", server2.UserID("elsie")) | ||
| server1.MustSendTransaction(t, deployment, deployment.GetFullyQualifiedHomeserverName(t, "hs1"), []json.RawMessage{joinEvent.JSON()}, nil) | ||
| awaitEventViaSync(t, alice, room.RoomID, joinEvent.EventID(), "") | ||
|
|
@@ -2473,7 +2524,7 @@ func TestPartialStateJoin(t *testing.T) { | |
| syncToken = awaitEventViaSync(t, alice, partialStateRoom.RoomID, leaveEvent.EventID(), syncToken) | ||
|
|
||
| leaveSharedRoom = func() { | ||
| server2.WithWaitForLeave(t, server2Room, alice.UserID, func() { | ||
| server2.WithWaitForLeave(t, server2Room, alice, func() { | ||
| alice.MustLeaveRoom(t, roomID) | ||
| }) | ||
| } | ||
|
|
@@ -2533,7 +2584,7 @@ func TestPartialStateJoin(t *testing.T) { | |
| // @t26alice:hs1 joins the room, followed by @elsie:server2. | ||
| // @elsie:server2 is kicked with an invalid event. | ||
| syncToken, server2Room, psjResult := setupIncorrectlyAcceptedKick(t, deployment, alice, server1, server2, deviceListUpdateChannel1, deviceListUpdateChannel2, room) | ||
| defer server2.WithWaitForLeave(t, server2Room, alice.UserID, func() { psjResult.Destroy(t) }) | ||
| defer server2.WithWaitForLeave(t, server2Room, alice, func() { psjResult.Destroy(t) }) | ||
|
|
||
| // @t26alice:hs1 sends out a device list update which is missed by @elsie:server2. | ||
| // @elsie:server2 must receive missed device list updates once the partial state join finishes. | ||
|
|
@@ -2593,7 +2644,7 @@ func TestPartialStateJoin(t *testing.T) { | |
| federation.WithPartialState(), | ||
| ) | ||
| psjResult := beginPartialStateJoin(t, server1, room, alice) | ||
| defer server2.WithWaitForLeave(t, server2Room, alice.UserID, func() { psjResult.Destroy(t) }) | ||
| defer server2.WithWaitForLeave(t, server2Room, alice, func() { psjResult.Destroy(t) }) | ||
|
|
||
| // @t28alice:hs1 sends out a device list update which is missed by @elsie:server2. | ||
| // @elsie:server2 must receive missed device list updates once the partial state join finishes. | ||
|
|
@@ -2996,7 +3047,7 @@ func TestPartialStateJoin(t *testing.T) { | |
| }, | ||
| client.SyncJoinedTo(server.UserID("charlie"), otherRoomID), | ||
| ) | ||
| defer server.WithWaitForLeave(t, otherRoom, alice.UserID, func() { alice.MustLeaveRoom(t, otherRoomID) }) | ||
| defer server.WithWaitForLeave(t, otherRoom, alice, func() { alice.MustLeaveRoom(t, otherRoomID) }) | ||
|
|
||
| // Depending on the homeserver implementation, @t31alice:hs1 must have been told that either: | ||
| // * charlie updated their device list, or | ||
|
|
@@ -4422,7 +4473,7 @@ func (psj *partialStateJoinResult) Destroy(t *testing.T) { | |
| psj.Server.WithWaitForLeave( | ||
| t, | ||
| psj.ServerRoom, | ||
| psj.User.UserID, | ||
| psj.User, | ||
| func() { psj.User.MustLeaveRoom(t, psj.ServerRoom.RoomID) }, | ||
| ) | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole concept of
WithWaitForLeavemight be the wrong solution to the problem. It was introduced in #626The dumb simple solution would be to instead to create new homeserver deployments for each test to make them isolated from each other. It looks like we even do that for some tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, sure we could. I don't think that cleaning up after a test is an unreasonable way of doing things though.
I'm hesitant to just go and use new deployments, as the tests are already really slow. The 7 tests take ~18s to run locally, and isolating them further is just going to make that worse