I fixed every single bug in `wm` #1334

yshui · 2024-09-07T08:27:20Z

.... or at least the ones AFL++/libFuzzer can find. Some of these are very tricky, and my brain hurts.

This all came from @Monsterovich reporting an assertion failure in wm. And I thought, the wm code is nicely isolated now, maybe I can fuzz it, see if there are more bugs... and boy are there more 😮‍💨

Unfortunately some of these changes are big, so I think this warrants an rc4.

codecov · 2024-09-07T08:28:13Z

Codecov Report

Attention: Patch coverage is 88.44622% with 29 lines in your changes missing coverage. Please review.

Project coverage is 53.77%. Comparing base (c6fc045) to head (570381f).
Report is 23 commits behind head on next.

Files with missing lines	Patch %	Lines
src/wm/wm.c	84.25%	17 Missing ⚠️
src/x.c	94.84%	5 Missing ⚠️
src/picom.c	63.63%	4 Missing ⚠️
src/event.c	83.33%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             next    #1334      +/-   ##
==========================================
- Coverage   53.90%   53.77%   -0.13%     
==========================================
  Files          70       70              
  Lines       15165    15195      +30     
==========================================
- Hits         8174     8171       -3     
- Misses       6991     7024      +33

Files with missing lines	Coverage Δ
src/atom.c	`50.00% <ø> (ø)`
src/region.h	`56.25% <100.00%> (+1.04%)`	⬆️
src/wm/tree.c	`88.94% <100.00%> (-1.50%)`	⬇️
src/wm/win.c	`75.12% <100.00%> (-0.23%)`	⬇️
src/wm/win.h	`93.05% <ø> (ø)`
src/wm/wm.h	`100.00% <ø> (ø)`
src/wm/wm_internal.h	`100.00% <ø> (ø)`
src/x.h	`89.18% <100.00%> (+6.83%)`	⬆️
src/event.c	`66.82% <83.33%> (+1.82%)`	⬆️
src/picom.c	`61.18% <63.63%> (-3.23%)`	⬇️
... and 2 more

... and 4 files with indirect coverage changes

To make it possible to stub them out. These are the only requests wm.c makes to the X server, we will be able to test the wm code on its own if we can provide mocked version of these functions. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

If a new window reuses the ID of a previously destroyed window, we would get confused and will not send a query tree request for this new window, thus misses all of its children. This is because we saved destroyed window as orphans, and we assume all orphans are either imported or in the process of being imported, which is not true for destroyed windows. With this commit we no longer save destroyed windows. We did that to gracefully handle a window being destroyed with a query tree request still in-flight. Now we handle it the more obvious way - we mark any in-flight query trees for omission when a window is destroyed. (Now, event if we don't save explicitly destroyed windows anymore, some orphans might get destroyed and have their ID reused without us knowing. This is because we don't get DestroyNotify for orphans. To catch this case, we pessimistically re-query orphans that don't have children.) Even if the window ID is reused after by a newly created window and the query tree actually succeeds with information from the new window, we can't use that result anyway, because at the point when the query tree is answered, we wouldn't have set up the event mask for the new window yet. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

Before, if we got a reparent event for a previously unknown window, we will ignore that event because we expect this window will later be found in a query tree reply. This is not always true. If a previously unknown window is reparented to a window that is already fully imported, because we won't send query tree for a fully imported window, this new will just never be imported. We will now explicitly initiate the import process in this case. This also applies to new window reusing the ID of a previously destroyed window in our orphan tree. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

`node` can legitimately be NULL inside wm_handle_get_wm_state_reply. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

Consider this scenario: 1. Window created with ID A, which generates a CreateNotify for A. 2. Window with ID A is destroyed. 3. Another window is created, reusing ID A. 4. We receive the CreateNotify for A. At step 4, we will be querying a different window than what we think we were querying. It might have a different parent from the one in the CreateNotify, and we should also ignore the destroy event in (2), as it wasn't for the window we are querying. To fix that, when we initiate the import process for a window, we no longer attach it to its declared parent, instead we orphan it. We also ignore any destroy event for it while the query tree is in-flight. We let the query tree reply event decide: 1. what's the window's parent, 2. whether the window still exists. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

Turns out inferring a window's life cycle based on DestroyNotify sent by its parent is incredibly hard. Especially if the window is at any point reparented to a window we haven't imported yet, because we won't be receiving any DestroyNotify for it. And we also can't reliably know if such reparenting has happened. So we now set the StructureNotifyMask, which will give us DestroyNotify sent from the window itself. With this the tracking window life cycle from the point when the event mask is set becomes easy, and we always know if a window is destroyed or not. Setting StructureNotifyMask also means we will get copies of Circulate/Configure/Map/Unmap/ReparentNotify from the window itself, we filter this so existing code doesn't have to change. Only for DestroyNotify we favor the one sent from the window itself. Also, setting event masks on window MUST be done synchronously. We use to just send out a ChangeWindowAttributes request and ignore its result. But it is actually possible, though extremely unlikely, for the ChangeWindowAttributes to fail but the QueryTree later to succeed, if the window is destroyed and another created (during the time between we send these two requests) with the same ID. In that case we would thought we have set up the event mask, but actually didn't. Which would be detrimental. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

Well, we tried to use query tree replies to attach windows to correct parents. But that's not good enough. Because the parent can also actively acquire children via reparent events. The result is windows fighting each other for control of their parent pointer. And this is also bad because in we don't know the stacking order of a window from its query tree reply. The conclusion is, a window must be fully responsible for updating its own children list. Because of ABA problem caused by window ID reuse, it's possible for a window to be attached to a wrong parent temporarily. This also means receiving a DestroyNotify from a window X about one of its children with ID Y, doesn't mean the window with ID Y is actually destroyed, if Y is currently attached to the wrong parent. From that DestroyNotify, the only thing we can deduce is that X lost a child with ID Y. Which means events must be handled in two steps: If we get a DestroyNotify from window X about its child Y, we disconnect Y from X. If we get a DestroyNotify from window Y for itself, only then can we really destroy Y. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

We could be getting reparent events for windows we already found out to be non-existent by trying to set its event mask. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

It is possible for a window in the process of being queried to have an incomplete children list, in which case wm_disconnect might see child/parent mismatches. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

Orphan windows were once imported, so should have their event masks set, which means we should be getting DestroyNotify when they are destroyed. That should be enough to guarantee that orphans will be destroyed eventually, we don't need to reap them actively. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

The first thing we do when we get a new toplevel, is sending a GetWindowAttributes request, and based on its reply, we may allocate a managed window struct for it. Currently we find which tree node to attach this managed window struct by its X window ID only, but we should use the full treeid. When a new toplevel is added, we queue a creation tree change for it which contains a pointer to the tree node. If the tree node is destroyed before the creation tree change is dequeued, we must "cancel" that creation tree change, because otherwise the pointer in it would be dangling. This also means there will be no kill tree change for it either. As a consequence, we have no way to tell anybody about its zombie node if we create one. So we just don't. If we don't create a zombie node for a destroyed node, there will be nowhere to put the managed window struct attached to it if there is one. Which means there cannot be a one. Summing it all up - a toplevel cannot have a managed window struct, if its creation tree change was never dequeued. You can imagine the creation tree change is handing out a "key" to the tree node, and this tree node can only have managed window attached to it via this key. However, if we are getting the tree node via its X window ID only, we bypass this "key" mechanism, essentially we will find tree node with the same X ID even if we don't have the key for it yet. This happens if the window was destroyed, then another window took its ID while we were waiting for the GetWindowAttributes reply, we shouldn't start managing it. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

If a node is removed from toplevel then comes back, it will generate a new creation tree change. Because of the assumption explained in the previous commit, we can only attach managed window struct to it using the "key" from the creation tree change. But the node's key hasn't changed! Since it was toplevel, it will have a previous, older creation tree change which also handed out this key, which can be used to attach managed window to it, thus breaks the assumption. This means every time a node becomes a toplevel, it must have a different key. We bump its gen number when it's removed from toplevel to ensure that. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

We care about whether setting the event masks succeeded, we can't deduced that from the query tree request because of race conditions and window ID reuse. So we have to check if the request for setting the event masks itself succeeded or not. But `xcb_request_check` is out of order w.r.t. to other X events/replies. So essentially we are "looking into the future" with it, sensing that a window no longer exists before the usual events and replies we received from X would have told us so. This breaks assumptions we based many of our assertions on. For example, in many places when we can't find a window in our tree, we assume we must have an incomplete tree (i.e. based on the events/replies, there are windows on X's side we haven't queried). But it's actually possible that we looked into the future and saw it no longer exists. We have updated the async request mechanism so it can support requests that don't expect replies. This commit uses that to make setting event masks async, so it no longer uses `xcb_request_check` and is no longer out of order. Also updated how query tree requests are handled to make it work together with this change. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

If a toplevel loses toplevel status, its "client window" will no longer be updated. If the client window is then destroyed, it becomes a dangling pointer. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

Previously we would just ignore these updates. But `wm_tree_find_client` does treat a toplevel with WM_STATE set as its own client window. So to keep things consistent, client window refresh is still needed in this case. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

This is already accounted for by wm_tree_find_toplevel_for, but in wm_tree_set_wm_state, there is a bit of copy-and-pasted code that didn't. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

Since we know have split reparenting into 2 steps, the first of which will create an orphaned window, we now can have orphaned windows when there are no query trees in-flight. Update wm_is_consistent condition to account for that. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

Pass tree node instead of window ID. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

See comments for proof why it's not reachable. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

yshui force-pushed the wm-fixes branch 3 times, most recently from 8858de1 to b384889 Compare September 7, 2024 08:48

yshui added 22 commits September 7, 2024 10:59

x: wrap some libxcb functions

ed619f3

To make it possible to stub them out. These are the only requests wm.c makes to the X server, we will be able to test the wm code on its own if we can provide mocked version of these functions. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

wm: remove a BUG_ON based on an incorrect assumption

9e3b790

`node` can legitimately be NULL inside wm_handle_get_wm_state_reply. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

wm: remove an incorrect assertion

7738c5b

We could be getting reparent events for windows we already found out to be non-existent by trying to set its event mask. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

wm: fix an incorrect BUG_ON

5912359

It is possible for a window in the process of being queried to have an incomplete children list, in which case wm_disconnect might see child/parent mismatches. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

x: simplify x_poll_for_event

e045cd0

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

x: support async requests with no reply

2692fe7

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

wm: remove unnecessary x_flush

1a4948d

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

wm: update client window when a window loses/gains toplevel status

28281bf

If a toplevel loses toplevel status, its "client window" will no longer be updated. If the client window is then destroyed, it becomes a dangling pointer. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

wm: don't report children of the orphan root as toplevel

f8828eb

This is already accounted for by wm_tree_find_toplevel_for, but in wm_tree_set_wm_state, there is a bit of copy-and-pasted code that didn't. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

wm: avoid repeated unnecessary wm_tree_find

cd5b136

Pass tree node instead of window ID. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

wm: remove an unreachable branch

570381f

See comments for proof why it's not reachable. Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>

yshui force-pushed the wm-fixes branch from b384889 to 570381f Compare September 7, 2024 09:59

yshui merged commit e3581fc into next Sep 7, 2024
22 checks passed

yshui deleted the wm-fixes branch September 7, 2024 22:56

yshui restored the wm-fixes branch September 7, 2024 22:56

yshui deleted the wm-fixes branch September 7, 2024 23:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I fixed every single bug in `wm` #1334

I fixed every single bug in `wm` #1334

yshui commented Sep 7, 2024 •

edited

Loading

codecov bot commented Sep 7, 2024 •

edited

Loading

I fixed every single bug in wm #1334

I fixed every single bug in wm #1334

Conversation

yshui commented Sep 7, 2024 • edited Loading

codecov bot commented Sep 7, 2024 • edited Loading

Codecov Report

I fixed every single bug in `wm` #1334

I fixed every single bug in `wm` #1334

yshui commented Sep 7, 2024 •

edited

Loading

codecov bot commented Sep 7, 2024 •

edited

Loading