After explaining how to Debug Hybrid Graphics issues on Linux, here is the story of four graphics bugs that I had in GNOME and Firefox on my Fedora 30 between May 2018 and September 2019: bugs in gnome-shell, Gtk, Firefox and mutter.
In May 2018, six months after I got my Lenovo P50 laptop, gnome-shell was "sometimes" freezing between 1 and 5 seconds. It was annoying because key stokes created repeated keys writing "helloooooooooooooooooooooo" instead of "hello" for example.
My colleagues led my to #fedora-desktop of the GIMP IRC server where I met my colleague Jonas Ådahl (jadahl) who almost immediately identified my issue! Extract of the IRC chat:
Ten minutes after I asked my question, Jonas asked the right question: Do you have a hybrid gpu system?
I was able to workaround the issue by connecting my laptop to my TV using the HDMI port:
15:22 < jadahl> for example, IIRC if you have a monitor connected to the HDMI, the issue will go away since the secondary GPU is always awake anyway ... 15:31 < vstinner> jadahl: i plugged a HDMI cable to my TV and it seems like the issue is gone 15:31 < vstinner> jadahl: impressive
When an external monitor is used (like a TV plugged on the HDMI port), my NVIDIA GPU is always active which works around the bug I had in gnome-shell.
Jonas provided me a RPM package for Fedora including his work-in-progress fix: Upload HW cursor sprite on-demand. I confirmed that this change fixed my bug. His mutter change has been merged upstream.
Firefox crash when selecting text
In March 2019, Firefox with Wayland crashed on wl_abort() when selecting more than 4000 characters in a <textarea>. I found the bug in Gmail when selecting the whole email text to remove it. Pressing CTRL + A or Right-click + Select All crashed the whole Firefox process!
I reported the bug to Firefox: Firefox with Wayland crash on wl_abort() when selecting more than 4000 characters in a <textarea>.
Running gdb in Firefox caused me some troubles since it's a very large binary with many libraries. I also read Wayland protocol specifications. I managed to analyze the bug and so I reported the bug to Gtk as well, On Wayland, notify_surrounding_text() crash on wl_abort() if text is longer than 4000 bytes:
According to gdb, wl_proxy_marshal_array_constructor_versioned() calls wl_abort() because the buffer is too short. It seems like wl_buffer_put() fails with E2BIG.
Quickly, I identified that my Gtk bug has already been fixed 3 months before by Carlos Garnacho (imwayland: Respect maximum length of 4000 Bytes on strings being sent) and the fix is part of gtk-3.24.3 ("wayland: Respect length limits in text protocol" says "Overview of Changes in GTK+ 3.24.3").
I requested to upgrade Gtk in Fedora. But it was not possible since the newer version changed the theme. I was asked to cherry-pick the fix and that's what I did: imwayland: Respect maximum length of 4000 Bytes on strings.
My PR was merged and a new package was built. I tested it and confirmed that it fixed the crash: FEDORA-2019-d67ec97b0b. Soon, the package was pushed to the public Fedora package repository.
That's the cool part about open source: if you have the skills to hack the code, you can fix an annoying which is affecting you!
Firefox: [Wayland] Window partially or not updated when switching between two tabs
Analyze the bug
In September 2019, after a large system upgrade (install 6 packages, upgrade 234 packages, remove 5 packages), Firefox started to not update the window content sometimes when I switched from one tab to another. Example:
It took me a few hours to analyze the bug to be able to produce an useful bug report.
I followed Fedora's guide How to debug Firefox problems advices.
First, I tried to understand which GPU driver is used. I finished by blacklisting the nouveau driver in the Linux kernel, to ensure that Firefox was using my Intel IGP. I still reproduced the bug.
I disabled all Firefox extensions: bug reproduced.
Then I created a new Firefox profile and started Firefox in safe mode: bug reproduced.
I tested the latest Firefox binary from mozilla.org (Firefox 69.0): bug reproduced.
Finally, I tested Firefox Nightly from mozilla.org (Firefox 71.0a1): bug reproduced.
Ok, it was enough data to produce an interesting bug report. I reported [Wayland] Window partially or not updated when switching between two tabs to Firefox.
Identify the regression using Fedora packages
Then I looked at /var/log/dnf.log and I tried to identify which package update could explain the regression.
I downgraded gtk3-3.24.11-1.fc30.x86_64 to gtk3.x86_64 3.24.10-1.fc30: bug reproduced.
I rebooted on oldest available Linux kernel, version 5.2.8-200.fc30.x86_64: bug reproduced. I checked journalctl logs to check which Linux version I was running whhen the bug was first seen: Linux 5.2.9-200.fc30.x86_64.
I don't know why, but downgrading Firefox was only my 3rd test.
I downgraded firefox-69.0-2.fc30.x86_64 to firefox-68.0.2-1.fc30.x86_64: the bug is gone! Ok, so the regression comes from the Firefox package, and it was introduced between package versions 68.0.2-1.fc30 and 69.0-2.fc30.
On IRC, I met my colleague Martin Stránský who package Firefox for Fedora. He told me that he is aware of my bug and may have a fix for my bug. Great!
Only 9 days later, Martin Stránský fix has been merged in Firefox upstream, released in Firefox Nightly, and a new package has been shipped in Fedora 30! Thanks Martin for your efficiency!
The final Firefox change is quite large and intrusive: [Wayland] Fix rendering glitches on wayland
Xwayland crash in xwl_glamor_gbm_create_pixmap()
In September 2019, while I was debugging the previous Firefox bug, I started my IRC client hexchat. Suddently, Xwayland crashed which closed my whole Gnome session! I was testing various GPU configurations to analyze the Firefox bug.
ABRT managed to rebuild an useless traceback and identified an existing bug report. It added my coment to [abrt] xorg-x11-server-Xwayland: OsLookupColor(): Segmentation fault at address 0x28 report.
At July 26, 2019 (1 month before I got the bug), Olivier Fourdan added an interesting comment:
glamor_get_modifiers+0x767 is xwl_glamor_gbm_create_pixmap() so this is the same as bug 1729925 fixed upstream with xwayland: Do not free a NULL GBM bo.
So in fact, my bug was already fixed by Olivier Fourdan in Xwayland upstream, but the fix didn't land into Fedora yet.