System freezes from time to time (while using QEMU VMs)

So..it looks like it has to do with the zen-kernel.
With the lts kernel, there are no problems so far.
Anyone knows how to set the lts kernel as the default one?

You should edit /etc/default/grub and change GRUB_DEFAULT=0 (First line) to GRUB_DEFAULT=1 or GRUB_DEFAULT="1>2" (third line in second line submenu), etc.
Then update-grub.

Is the Entry '1' definetly the entry for the LTS Kernel or are you guessing?

Can I see the order somewhere or do I have to restart and check it ? :slight_smile:

Just guessing!
You need to check the grub.
If you want to cut it short, uninstall zen and leave only lts.
Reinstalling in the future would only take a few minutes anyway....

I experience the exact same problem you describe, slowness for a few seconds then complete freeze. Haven't noticed any errors in dmesg, journal or systemd at boot (although I don't know how to see them after a crash). Errors began after a reboot which I believe included a new kernel (login screen background changed to the blue-purpleish one). Might have been a week since I rebooted last time though, I usually suspend, so I can't know for sure what update might have caused it.

I also use docker, but the crashes occur despite not using it at the time (apart from running the service). Downgrading kernel (I'd rather not ditch zen if I can help it tbh), changing storage driver for docker, or even switching to podman are things I'll try next.

Do some searching on past forum posts regarding docker and btrfs.

2 Likes

Just to follow up my issue which seems to have gone away.

  • I downgraded to kernel 5.14.6-zen1-1 (from 5.14.7)
  • I did a btrfs balance
  • I use a docker driver instead of kvm2/libvirt in minikube
  • Changed docker storage driver to overlay2

Hopefully I can hop on 5.14.8 again.

Edit: Nope. Crashed after about an hour into 5.14.8. At least that pinpoints the problem I guess. Would be great to know what in >5.14.6 causes the crash.. Any troubleshooting tips would be appreciated.

Edit #2: 5.14.9-zen2 seems stable again!

Here's my inxi for reference:

inxi
System:    Kernel: 5.14.8-zen1-1-zen x86_64 bits: 64 compiler: gcc v: 11.1.0 
           parameters: BOOT_IMAGE=/@/boot/vmlinuz-linux-zen 
           root=UUID=69c29b79-da5a-4df5-a53b-917ce7295884 rw rootflags=subvol=@ quiet splash 
           rd.udev.log_priority=3 vt.global_cursor_default=0 systemd.unified_cgroup_hierarchy=1 
           loglevel=3 mem_sleep_default=deep 
           Desktop: i3 4.19.1 info: i3bar vt: 7 dm: LightDM 1.30.0 Distro: Garuda Linux 
           base: Arch Linux 
Machine:   Type: Desktop System: Gigabyte product: X570 AORUS ULTRA v: -CF serial: <filter> 
           Mobo: Gigabyte model: X570 AORUS ULTRA serial: <filter> UEFI: American Megatrends LLC. 
           v: F33i date: 04/23/2021 
CPU:       Info: 12-Core model: AMD Ryzen 9 5900X bits: 64 type: MT MCP arch: Zen 3 
           family: 19 (25) model-id: 21 (33) stepping: 0 microcode: A201009 cache: L2: 6 MiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 177607 
           Speed: 3598 MHz min/max: 2200/3700 MHz boost: enabled Core speeds (MHz): 1: 3598 
           2: 3901 3: 3593 4: 3595 5: 3595 6: 3593 7: 3593 8: 3596 9: 3597 10: 3598 11: 3599 
           12: 3595 13: 3590 14: 3601 15: 3593 16: 3589 17: 3594 18: 3592 19: 3605 20: 3697 
           21: 3599 22: 3597 23: 3599 24: 3602 
           Vulnerabilities: Type: itlb_multihit status: Not affected 
           Type: l1tf status: Not affected 
           Type: mds status: Not affected 
           Type: meltdown status: Not affected 
           Type: spec_store_bypass 
           mitigation: Speculative Store Bypass disabled via prctl and seccomp 
           Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer sanitization 
           Type: spectre_v2 mitigation: Full AMD retpoline, IBPB: conditional, IBRS_FW, STIBP: 
           always-on, RSB filling 
           Type: srbds status: Not affected 
           Type: tsx_async_abort status: Not affected 
Graphics:  Device-1: NVIDIA GA104 [GeForce RTX 3070] vendor: Micro-Star MSI driver: nvidia 
           v: 470.74 alternate: nouveau,nvidia_drm bus-ID: 08:00.0 chip-ID: 10de:2484 
           class-ID: 0300 
           Device-2: Logitech Webcam C930e type: USB driver: snd-usb-audio,uvcvideo 
           bus-ID: 3-6.3:4 chip-ID: 046d:0843 class-ID: 0102 serial: <filter> 
           Display: x11 server: X.Org 1.20.13 compositor: picom v: git-dac85 driver: 
           loaded: nvidia display-ID: :0 screens: 1 
           Screen-1: 0 s-res: 2560x1440 s-dpi: 108 s-size: 602x342mm (23.7x13.5") 
           s-diag: 692mm (27.3") 
           Monitor-1: DP-0 res: 2560x1440 dpi: 109 size: 598x336mm (23.5x13.2") diag: 686mm (27") 
           OpenGL: renderer: NVIDIA GeForce RTX 3070/PCIe/SSE2 v: 4.6.0 NVIDIA 470.74 
           direct render: Yes 
Audio:     Device-1: NVIDIA GA104 High Definition Audio vendor: Micro-Star MSI 
           driver: snd_hda_intel v: kernel bus-ID: 08:00.1 chip-ID: 10de:228b class-ID: 0403 
           Device-2: AMD Starship/Matisse HD Audio vendor: Gigabyte driver: snd_hda_intel 
           v: kernel bus-ID: 0a:00.4 chip-ID: 1022:1487 class-ID: 0403 
           Device-3: Logitech Webcam C930e type: USB driver: snd-usb-audio,uvcvideo 
           bus-ID: 3-6.3:4 chip-ID: 046d:0843 class-ID: 0102 serial: <filter> 
           Sound Server-1: ALSA v: k5.14.8-zen1-1-zen running: yes 
           Sound Server-2: sndio v: N/A running: no 
           Sound Server-3: JACK v: 1.9.19 running: no 
           Sound Server-4: PulseAudio v: 15.0 running: yes 
           Sound Server-5: PipeWire v: 0.3.37 running: yes 
Network:   Device-1: Intel Wi-Fi 6 AX200 driver: iwlwifi v: kernel bus-ID: 03:00.0 
           chip-ID: 8086:2723 class-ID: 0280 
           IF: wlp3s0 state: down mac: <filter> 
           Device-2: Intel I211 Gigabit Network vendor: Gigabyte driver: igb v: kernel port: f000 
           bus-ID: 04:00.0 chip-ID: 8086:1539 class-ID: 0200 
           IF: enp4s0 state: up speed: 1000 Mbps duplex: full mac: <filter> 
           IF-ID-1: br-a75c55ff1e63 state: down mac: <filter> 
           IF-ID-2: docker0 state: down mac: <filter> 
           IF-ID-3: virbr0 state: down mac: <filter> 
Bluetooth: Device-1: Intel AX200 Bluetooth type: USB driver: btusb v: 0.8 bus-ID: 3-5:2 
           chip-ID: 8087:0029 class-ID: e001 
           Report: bt-adapter ID: hci0 rfk-id: 0 state: down bt-service: enabled,running 
           rfk-block: hardware: no software: yes address: <filter> 
Drives:    Local Storage: total: 1.48 TiB used: 268.68 GiB (17.7%) 
           SMART Message: Required tool smartctl not installed. Check --recommends 
           ID-1: /dev/sda maj-min: 8:0 vendor: Intel model: SSDSC2CT120A3 size: 111.79 GiB 
           block-size: physical: 512 B logical: 512 B speed: 6.0 Gb/s type: SSD serial: <filter> 
           rev: 300i scheme: GPT 
           ID-2: /dev/sdb maj-min: 8:16 vendor: Western Digital model: WD10EARS-00Y5B1 
           size: 931.51 GiB block-size: physical: 512 B logical: 512 B speed: 3.0 Gb/s type: N/A 
           serial: <filter> rev: 0A80 scheme: MBR 
           ID-3: /dev/sdc maj-min: 8:32 vendor: Samsung model: SSD 850 PRO 512GB size: 476.94 GiB 
           block-size: physical: 512 B logical: 512 B speed: 6.0 Gb/s type: SSD serial: <filter> 
           rev: 2B6Q scheme: MBR 
Partition: ID-1: / raw-size: 78.12 GiB size: 78.12 GiB (100.00%) used: 30.86 GiB (39.5%) fs: btrfs 
           dev: /dev/sda2 maj-min: 8:2 
           ID-2: /boot/efi raw-size: 1000 MiB size: 998 MiB (99.80%) used: 560 KiB (0.1%) fs: vfat 
           dev: /dev/sda1 maj-min: 8:1 
           ID-3: /home raw-size: 476.93 GiB size: 476.93 GiB (100.00%) used: 237.82 GiB (49.9%) 
           fs: btrfs dev: /dev/sdc1 maj-min: 8:33 
           ID-4: /var/log raw-size: 78.12 GiB size: 78.12 GiB (100.00%) used: 30.86 GiB (39.5%) 
           fs: btrfs dev: /dev/sda2 maj-min: 8:2 
           ID-5: /var/tmp raw-size: 78.12 GiB size: 78.12 GiB (100.00%) used: 30.86 GiB (39.5%) 
           fs: btrfs dev: /dev/sda2 maj-min: 8:2 
Swap:      Kernel: swappiness: 133 (default 60) cache-pressure: 100 (default) 
           ID-1: swap-1 type: zram size: 31.29 GiB used: 0 KiB (0.0%) priority: 100 
           dev: /dev/zram0 
Sensors:   System Temperatures: cpu: 39.8 C mobo: 16.8 C gpu: nvidia temp: 33 C 
           Fan Speeds (RPM): N/A gpu: nvidia fan: 0% 
Info:      Processes: 505 Uptime: 8m wakeups: 0 Memory: 31.29 GiB used: 2.74 GiB (8.8%) 
           Init: systemd v: 249 tool: systemctl Compilers: gcc: 11.1.0 Packages: pacman: 1598 
           lib: 461 Client: Unknown Client: garuda-assistant inxi: 3.3.06 
1 Like

Just to update this topic, I also have this same freezing issue on GNOME but I dont have docker installed. Using latest zen kernel (also latest tkg-bmq), nvidia proprietary drivers, & btrfs. It might have started on the latest garuda version.

I can't run inxi right now because im on windows to test my system, but so far, no freezes for 3+ hours.

Edit: People have downgraded to linux-lts kernel to fix the issue, I will try later and update tonight EST.

I'm not sure why people refer to using the LTS kernel as a downgrade. IMO everyone should have the LTS kernel installed in case of a system breaking update to the Zen (or other) kernels. Unless you have brand new hardware technology, you likely do not require all the most recent kernel developments.

There's absolutely nothing wrong with running the LTS kernel if your hardware is a little older. Only those with the newest of hardware truly require the newest kernels.

Perpetuating this attitude is not very helpful. Often the easiest solution to many issues is simply switching kernels. Unfortunately, because of attitudes like yours many will not accept this as a temporary solution. Some people seem to think running an LTS kernel is akin to being forced to live in a ghetto. This is utter nonsense, there's nothing wrong with the LTS kernel.

It becomes more than a little frustrating as a forum assistant when people will just not accept running an LTS kernel temporarily because of this attitude amongst kernel snobbists.

5 Likes

If you're referring to my post, then I apologize. By downgrade, I was simply referring to the lower version number (ex 5.14 vs 5.10). I will change my wording in the future.

Also, I would love to use 5.10 LTS but my system refuses to boot with it after selecting it in grub. Only a black screen pops up with a flashing cursor that looks like an underscore. Do you know why this is the case?

inxi -Fxxxi
System:    Host: rk-garuda Kernel: 5.14.8-zen1-1-zen x86_64 bits: 64 compiler: gcc v: 11.1.0 Desktop: GNOME 40.5 
           tk: GTK 3.24.30 wm: gnome-shell dm: GDM 40.1 Distro: Garuda Linux base: Arch Linux 
Machine:   Type: Desktop Mobo: ASRock model: AB350M Pro4 serial: <superuser required> UEFI: American Megatrends v: P5.50 
           date: 12/20/2018 
CPU:       Info: 6-Core model: AMD Ryzen 5 2600 bits: 64 type: MT MCP arch: Zen+ rev: 2 cache: L2: 3 MiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 81436 
           Speed: 3793 MHz min/max: 1550/3400 MHz boost: enabled Core speeds (MHz): 1: 3793 2: 2848 3: 3889 4: 3861 5: 2795 
           6: 3256 7: 2744 8: 3063 9: 3676 10: 2242 11: 2803 12: 2396 
Graphics:  Device-1: NVIDIA GM200 [GeForce GTX 980 Ti] vendor: Gigabyte driver: nvidia v: 470.74 bus-ID: 23:00.0 
           chip-ID: 10de:17c8 class-ID: 0300 
           Display: x11 server: X.Org 1.20.13 compositor: gnome-shell driver: loaded: nvidia resolution: 1: 1920x1080 
           2: 2560x1440 s-dpi: 96 
           OpenGL: renderer: NVIDIA GeForce GTX 980 Ti/PCIe/SSE2 v: 4.6.0 NVIDIA 470.74 direct render: Yes 
Audio:     Device-1: NVIDIA GM200 High Definition Audio vendor: Gigabyte driver: snd_hda_intel v: kernel bus-ID: 23:00.1 
           chip-ID: 10de:0fb0 class-ID: 0403 
           Device-2: Advanced Micro Devices [AMD] Family 17h HD Audio vendor: ASRock driver: snd_hda_intel v: kernel 
           bus-ID: 25:00.3 chip-ID: 1022:1457 class-ID: 0403 
           Device-3: GYROCOM C&C Fiio E10 type: USB driver: hid-generic,snd-usb-audio,usbhid bus-ID: 3-1:2 chip-ID: 1852:7022 
           class-ID: 0102 
           Sound Server-1: ALSA v: k5.14.8-zen1-1-zen running: yes 
           Sound Server-2: JACK v: 1.9.19 running: no 
           Sound Server-3: PulseAudio v: 15.0 running: yes 
           Sound Server-4: PipeWire v: 0.3.38 running: yes 
Network:   Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet vendor: ASRock driver: r8169 v: kernel port: f000 
           bus-ID: 1f:00.0 chip-ID: 10ec:8168 class-ID: 0200 
           IF: enp31s0 state: up speed: 1000 Mbps duplex: full mac: 70:85:c2:4c:77:13 
           IP v4: 192.168.0.228/24 type: dynamic noprefixroute scope: global broadcast: 192.168.0.255 
           IP v4: 192.168.0.2/24 type: secondary noprefixroute scope: global broadcast: 192.168.0.255 
           WAN IP: 72.53.237.7 
Bluetooth: Device-1: Cambridge Silicon Radio Bluetooth Dongle (HCI mode) type: USB driver: btusb v: 0.8 bus-ID: 1-6:2 
           chip-ID: 0a12:0001 class-ID: e001 
           Report: bt-adapter ID: hci0 rfk-id: 0 state: up address: 00:15:83:F9:E1:F1 
Drives:    Local Storage: total: 3.44 TiB used: 1.38 TiB (40.0%) 
           ID-1: /dev/sda vendor: Seagate model: ST2000DM008-2FR102 size: 1.82 TiB speed: 6.0 Gb/s type: HDD rpm: 7200 
           serial: ZFL2EQH4 rev: 0001 
           ID-2: /dev/sdb vendor: Crucial model: CT275MX300SSD1 size: 256.17 GiB speed: 6.0 Gb/s type: SSD 
           serial: 16441481D460 rev: R031 scheme: GPT 
           ID-3: /dev/sdc vendor: Crucial model: CT500MX500SSD4 size: 465.76 GiB speed: 6.0 Gb/s type: SSD 
           serial: 1902E1E201F5 rev: 023 scheme: GPT 
           ID-4: /dev/sdd vendor: Seagate model: ST1000DM010-2EP102 size: 931.51 GiB speed: 6.0 Gb/s type: HDD rpm: 7200 
           serial: Z9AGLJH0 rev: CC43 scheme: GPT 
           ID-5: /dev/sde type: USB vendor: SanDisk model: Ultra size: 7.45 GiB type: N/A serial: 20052845530A4800EF8D 
           rev: 1.20 scheme: GPT 
Partition: ID-1: / size: 48.83 GiB used: 37.52 GiB (76.8%) fs: btrfs dev: /dev/sdc1 
           ID-2: /boot/efi size: 599.8 MiB used: 580 KiB (0.1%) fs: vfat dev: /dev/sdc3 
           ID-3: /home size: 416.35 GiB used: 99.79 GiB (24.0%) fs: btrfs dev: /dev/sdc2 
           ID-4: /var/log size: 48.83 GiB used: 37.52 GiB (76.8%) fs: btrfs dev: /dev/sdc1 
           ID-5: /var/tmp size: 48.83 GiB used: 37.52 GiB (76.8%) fs: btrfs dev: /dev/sdc1 
Swap:      ID-1: swap-1 type: zram size: 31.28 GiB used: 2 MiB (0.0%) priority: 100 dev: /dev/zram0 
Sensors:   System Temperatures: cpu: 43.4 C mobo: N/A gpu: nvidia temp: 34 C 
           Fan Speeds (RPM): N/A gpu: nvidia fan: 36% 
Info:      Processes: 410 Uptime: 13m wakeups: 0 Memory: 31.28 GiB used: 4.38 GiB (14.0%) Init: systemd v: 249 Compilers: 
           gcc: 11.1.0 Packages: pacman: 1625 Shell: fish v: 3.3.1 running-in: gjs inxi: 3.3.06 

Do you have the linux-lts-headers installed?

1 Like

yes, but only linux-zen and tkg-bmq work. I've even done a grubup command.

Edit: Even though it was already installed, reinstalling linux-lts-headers worked :smile:
Now its just a waiting game to see if any more freezes occur...

 ╭─rk@rk in ~ via  v16.10.0 
 ╰─λ pamac search headers | grepi 'installed'
xorgproto                  [Installed] 2021.5-1                     extra 
libcups                    [Installed] 1:2.3.3op2-3                 extra 
boost                      [Installed] 1.76.0-1                     extra 
acl                        [Installed] 2.3.1-1                      core 
vulkan-headers             [Installed] 1:1.2.191-1                  extra 
linux-zen-headers          [Installed] 5.14.8.zen1-1                extra 
linux-zen-g14-headers      [Installed] 5.13.6.zen1-1                chaotic-aur 
linux-tkg-bmq-headers      [Installed] 5.14.8-203                   chaotic-aur 
linux-lts-headers          [Installed] 5.10.69-1                    core 
linux-api-headers          [Installed] 5.12.3-1                     core 

If you’re refering to MY post, there’s nothing snobby about it, just nooby :slight_smile: I’ve imagined zen being top of the line and I want my system top of the line. Not neglibly due to garuda’s frontpage selling point on zen:

"A faster, more-responsive Linux kernel optimized for desktop, multimedia and gaming.
Result of a collaborative effort of kernel hackers to provide the best Linux kernel possible for everyday systems. "

I mean who doesn’t want that? =)

Noobyness aside, would you advice staying up to date on linux-lts rather than downgrading versions on zen like I’m doing now when facing crashes like this?

Each user's hardware is different, test different kernels to find which works best for you.

1 Like

grrr. last update with my lts kernel, now the same problem with my virtualization.

I really love bleeding edge, but sometimes I miss my good old debian stable.

Someone has a clue?

There are updates coming to linux and linux-zen that fix an issue with BFQ. linux-zen hit this issue (which caused kernel panics), linux was less susceptible.

Update to 5.14.9-arch2 or 5.14.9-zen2 and see if the issue persists.

https://lore.kernel.org/lkml/1624640454.149631.1632987871186@office.mailbox.org/T/

You can thank our @anon34128669 for this detective work. :hugs:

4 Likes

Sorry for the late reply, I was on holiday for a week..

I really would like to thank him for this work, but to be honest, I do not now how to proceed.

All my new kernels ( linux, linux-zen, linux-lts ) have the same problem.
Is it possible, that you tell me, how to install the kernel you explained by hand?

My actual list ( pacman -Ss linux | grep 5.14 ) shows only zen1 and arch1 kernel, which I have already installed.

Sad to see others have the same problem, but they could not solve it either.
from another garuda user

I just switched my system to the 29/09/2021, on which it made an update.
My Kernel now is on 5.10.68~ and everythings works again, with the broken system it was on 5.14.10~.

So it is like you said and I thank @anon34128669 for his detective work, BUT what should I do now with this knowledge? Never update :=) ?

1 Like

I'm not sure if you're being facetious, but apparently the issue you're having is different to what I thought it was.

If linux-lts (5.10) works then there's some sort of regression in 5.14. If 5.10 newer than 5.10.68 also don't work correctly then that regression was backported.

If there's nothing in your log files that points in any direction then the only way forwards is to perform a kernel bisection to find the commit that introduces the issue, then report that to the kernel developers.

If 5.10.68 works but 5.10.69 does not then that also narrows the range of commits significantly and will make the bisection much quicker.

https://wiki.archlinux.org/index.php/Bisecting_bugs_with_Git

3 Likes

Hello Jonathan,
thank you for your answer.
I am not sure if I understand you correctly.
First of all, I don't mean this as a joke, of course.
What I am stating is quite simple.
At some point my virtual machines stopped running, with the error message described above.
I then switched to various other kernels and whenever they got an update, my machines stopped running again, always with the same error messages.
So I looked at the days of the updates in Timeshift and chose a specific date and restored at that time.
Now I see it running again and share the corresponding kernel state, nothing more, nothing less.
I have no idea what a 'kernel bisecting' is and I don't think I want to do that either.
It seems like @atkatana here from the Garuda forum has the same problem as me, unlike him, I'm not going to change distributions because of it though.
https://forum.garudalinux.org/t/kvm-problem-to-start-virtual-machines/13192/20

Unfortunately, I have never reported a 'bug' and I fear mail-back-and-forth, which I don't have the time or inclination to do.
Is it not possible to 'freeze' the state of my kernel until the problem is fixed and still get updates?