Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 73 additions & 22 deletions tests/msc3902/federation_room_join_partial_state_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -123,10 +123,15 @@ func (s *server) AddEDUHandler(eduHandler func(gomatrixserverlib.EDU) bool) func
}
}

// WithWaitForLeave runs the given action and waits for the user to leave the room.
// WithWaitForLeave runs the given action and, when the resulting leave is
// expected to reach this server, waits for it. `leaveAction` is always run; the
// wait is skipped when `user` had already left the room (per their own
// homeserver, so the action produces no new leave) or when this server isn't in
// the room (so the leave won't be federated to us).
func (s *server) WithWaitForLeave(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole concept of WithWaitForLeave might be the wrong solution to the problem. It was introduced in #626

Many of the faster joins test flakes are due to the homeserver under test failing to contact Complement homeservers after they have been torn down. When this happens, subsequent tests can fail if they use a Complement homeserver that happens to have the same hostname:port as one which the homeserver under test has previously marked as offline.

Make the homeserver under test leave rooms at the end of each test, so that it stops trying to contact the Complement homeservers in the room.

The dumb simple solution would be to instead to create new homeserver deployments for each test to make them isolated from each other. It looks like we even do that for some tests.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, sure we could. I don't think that cleaning up after a test is an unreasonable way of doing things though.

I'm hesitant to just go and use new deployments, as the tests are already really slow. The 7 tests take ~18s to run locally, and isolating them further is just going to make that worse

t *testing.T, room *federation.ServerRoom, userID string, leaveAction func(),
t *testing.T, room *federation.ServerRoom, user *client.CSAPI, leaveAction func(),
) {
userID := user.UserID
leaveChannel := make(chan gomatrixserverlib.PDU, 10)
removePDUHandler := s.AddPDUHandler(
func(e gomatrixserverlib.PDU) bool {
Expand All @@ -141,24 +146,70 @@ func (s *server) WithWaitForLeave(
)
defer removePDUHandler()

// We need to check if the user (on their homeserver) thinks they're in the
// room, before performing the `leaveAction` (to avoid races). If they are
// in the room we need to wait until the server sees the leave, but if they
// aren't no leave event will be sent out.
userInRoom := userIsJoinedTo(t, user, room.RoomID)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Complement, the way we typically do this is using user.MustSyncUntil(t, client.SyncReq{Since: aliceSince}, client.SyncJoinedTo(user.UserID, room.RoomID))

And since we seem to only use this for the negative case and return, I'm guessing that using the typical flow and failing the test is also fine (related to the points below about programming errors)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Complement, the way we typically do this is using user.MustSyncUntil(t, client.SyncReq{Since: aliceSince}, client.SyncJoinedTo(user.UserID, room.RoomID))

That only tests whether a new join has happened, not whether the user is joined at all


leaveAction()

memberEvent := room.CurrentState("m.room.member", userID)
membership := ""
if memberEvent != nil {
membership, _ = memberEvent.Membership()
if !userInRoom {
// The user had already left, so the action produced no new leave and
// none is coming: don't wait.
t.Logf("%s is not joined to test room %s; not waiting for them to leave.", userID, room.RoomID)
return
}
Comment on lines +157 to 162

@MadLittleMods MadLittleMods Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a test programming error. Why are we trying to accommodate this?

Other related discussion: #878 (comment)

if membership == "leave" {
t.Logf("%s has already seen %s leave test room %s.", s.ServerName(), userID, room.RoomID)
} else {
select {
case <-leaveChannel:
t.Logf("%s saw %s leave test room %s.", s.ServerName(), userID, room.RoomID)
break
case <-time.After(1 * time.Second):
t.Errorf("%s timed out waiting for %s to leave test room %s.", s.ServerName(), userID, room.RoomID)

if !s.isInRoom(room) {
// The homeserver only federates the leave to servers that are in the
// room. If we aren't, no leave PDU is coming to us, so don't block until
// the timeout.
t.Logf("%s is not in test room %s; not waiting for %s to leave.", s.ServerName(), room.RoomID, userID)
return
}
Comment on lines +164 to +170

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a test programming error. Why are we trying to accommodate this?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, this is done because WithWaitForLeave is used as part of a defer psjResult.Destroy(t) pattern to clean up resources after each test. This includes when a test errors out any time (before they even join the room).

Overall, looks fishy. See #878 (comment)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it's a generic cleanup function to ensure that all of the tests users have left, no matter what the state of the users and servers are in? It's a fairly typical pattern


// Otherwise the action triggered the leave, which arrives as a PDU our
// handler matches. Wait on its channel rather than polling
// `room.CurrentState`: the room's current state is updated (by
// `room.AddEvent`) *before* the PDU callback runs, so returning on a
// `CurrentState` check could deregister our handler in the window before the
// callback fires, making the (expected) leave look unexpected to
// `HandleTransactionRequests`. This returns as soon as the leave arrives; the
// timeout is only a ceiling for declaring failure.
select {
case <-leaveChannel:
t.Logf("%s saw %s leave test room %s.", s.ServerName(), userID, room.RoomID)
case <-time.After(1 * time.Second):
t.Errorf("%s timed out waiting for %s to leave test room %s.", s.ServerName(), userID, room.RoomID)
}
}

Comment thread
erikjohnston marked this conversation as resolved.
// isInRoom reports whether this Complement server has a joined user in the room,
// according to its own `ServerRoom` view. The server reliably tracks its own
// users' membership (it created their join/leave events), so this answers "will
// the homeserver federate events in this room to us?".
func (s *server) isInRoom(room *federation.ServerRoom) bool {
for _, serverInRoom := range room.ServersInRoom() {
if serverInRoom == s.ServerName() {
return true
}
}
return false
}

// userIsJoinedTo reports whether the user is currently joined to the room,
// according to the user's own homeserver.
func userIsJoinedTo(t *testing.T, user *client.CSAPI, roomID string) bool {
t.Helper()
res := user.MustDo(t, "GET", []string{"_matrix", "client", "v3", "joined_rooms"})
joinedRooms := gjson.ParseBytes(client.ParseJSON(t, res)).Get("joined_rooms")
for _, joinedRoom := range joinedRooms.Array() {
if joinedRoom.Str == roomID {
return true
}
}
return false
}

// Wait for the server to receive the event with given event ID.
Expand Down Expand Up @@ -2249,7 +2300,7 @@ func TestPartialStateJoin(t *testing.T) {

// @t23alice:hs1 joins the room.
psjResult := beginPartialStateJoin(t, server1, room, alice)
defer server2.WithWaitForLeave(t, server2Room, alice.UserID, func() { psjResult.Destroy(t) })
defer server2.WithWaitForLeave(t, server2Room, alice, func() { psjResult.Destroy(t) })

// Both homeservers should receive device list updates.
renameDevice(t, alice, "A new device name 1")
Expand Down Expand Up @@ -2296,7 +2347,7 @@ func TestPartialStateJoin(t *testing.T) {
)
// NB: We register the `psjResult.Destroy()` cleanup twice. This is alright because it
// is idempotent. Here we wait for server 2 to observe the leave too.
defer server2.WithWaitForLeave(t, server2Room, alice.UserID, func() { psjResult.Destroy(t) })
defer server2.WithWaitForLeave(t, server2Room, alice, func() { psjResult.Destroy(t) })
joinEvent := room.CurrentState("m.room.member", server2.UserID("elsie"))
server1.MustSendTransaction(t, deployment, deployment.GetFullyQualifiedHomeserverName(t, "hs1"), []json.RawMessage{joinEvent.JSON()}, nil)
awaitEventViaSync(t, alice, room.RoomID, joinEvent.EventID(), "")
Expand Down Expand Up @@ -2473,7 +2524,7 @@ func TestPartialStateJoin(t *testing.T) {
syncToken = awaitEventViaSync(t, alice, partialStateRoom.RoomID, leaveEvent.EventID(), syncToken)

leaveSharedRoom = func() {
server2.WithWaitForLeave(t, server2Room, alice.UserID, func() {
server2.WithWaitForLeave(t, server2Room, alice, func() {
alice.MustLeaveRoom(t, roomID)
})
}
Expand Down Expand Up @@ -2533,7 +2584,7 @@ func TestPartialStateJoin(t *testing.T) {
// @t26alice:hs1 joins the room, followed by @elsie:server2.
// @elsie:server2 is kicked with an invalid event.
syncToken, server2Room, psjResult := setupIncorrectlyAcceptedKick(t, deployment, alice, server1, server2, deviceListUpdateChannel1, deviceListUpdateChannel2, room)
defer server2.WithWaitForLeave(t, server2Room, alice.UserID, func() { psjResult.Destroy(t) })
defer server2.WithWaitForLeave(t, server2Room, alice, func() { psjResult.Destroy(t) })

// @t26alice:hs1 sends out a device list update which is missed by @elsie:server2.
// @elsie:server2 must receive missed device list updates once the partial state join finishes.
Expand Down Expand Up @@ -2593,7 +2644,7 @@ func TestPartialStateJoin(t *testing.T) {
federation.WithPartialState(),
)
psjResult := beginPartialStateJoin(t, server1, room, alice)
defer server2.WithWaitForLeave(t, server2Room, alice.UserID, func() { psjResult.Destroy(t) })
defer server2.WithWaitForLeave(t, server2Room, alice, func() { psjResult.Destroy(t) })

// @t28alice:hs1 sends out a device list update which is missed by @elsie:server2.
// @elsie:server2 must receive missed device list updates once the partial state join finishes.
Expand Down Expand Up @@ -2996,7 +3047,7 @@ func TestPartialStateJoin(t *testing.T) {
},
client.SyncJoinedTo(server.UserID("charlie"), otherRoomID),
)
defer server.WithWaitForLeave(t, otherRoom, alice.UserID, func() { alice.MustLeaveRoom(t, otherRoomID) })
defer server.WithWaitForLeave(t, otherRoom, alice, func() { alice.MustLeaveRoom(t, otherRoomID) })

// Depending on the homeserver implementation, @t31alice:hs1 must have been told that either:
// * charlie updated their device list, or
Expand Down Expand Up @@ -4422,7 +4473,7 @@ func (psj *partialStateJoinResult) Destroy(t *testing.T) {
psj.Server.WithWaitForLeave(
t,
psj.ServerRoom,
psj.User.UserID,
psj.User,
func() { psj.User.MustLeaveRoom(t, psj.ServerRoom.RoomID) },
)
}
Expand Down
Loading