Maybe someone can help me a bit with haveing the amdgpu pro drivers as default as much as possible, so that my system won't crash if I forget progl somewhere... :-/
I am on a quest to get my Vega Graphicscard run stable in Garuda Linux... it is a heavy task, as it seems that Vega is not very well supported (and not sold very often? Don't know).
I had it running for a few days, but sadly after an update it was back to crashing again... now I am close to get it working again....
I write this down in hope to get some help with my last step and also as reference for myself and other people (there are quite a lot with this issues, it seems).
So what I found out this far:
The mesa drivers crash the card so bad that it requires a power cycle to work again. In the best case screens go black and GPU fans are at 100%. System is still responding, you can access it via SSH and do stuff (like extracting logs from dmesg). But displays won't come back. Everything that tries to (re)start/end graphics system will hang indefinitely. Best course of action is to type "reboot" in the ssh terminal and wait a few seconds for disks to unmount and stuff and then press Power for >5 Sek to Powercycle the whole system.
Using the system "as is" is stable with all drivers, it seems... but running
- Using opencl, for example a 100% sure crash on my system is enabling opencl in libreoffice.
- some games (not all of them, but some and it does not matter if native or via wine.)
- Example for native is "for the king", go into lore store and browse a bit, will usually crash in a few seconds
- Example for wine is "WitchIt", will crash after Intro Video where it says "Press a button" in the second you press a button.
For issue number 1. a fix is quite easy. Get rid of opencl-mesa and install opencl-amd instead. After that opencl works stable, libreoffice can be used with opencl and boinc or folding at home with GPU Tasks won't crash the system. Yay.
Issue number 2 is a different beast, though. I found that the card runs stable with the proprietary AMD Drivers, too (which is not much of a surprise, because it is stable in Windows, too). But there seems to be no bulletproof way to make sure the system loads the proprietary driver and also sddm does not (always?) load with them...
My current solution is to run games that will crash with progl in front of them. For steam you can add that as launcher options, i.e. type
progl %command% -> voila, WitchIt does not crash system. (I have to try if that works with gamemoderun, too and how to best combine those).
Maybe someone can help me a bit with haveing the pro drivers as default as much as possible, so that my system won't crash if I forget progl somewhere... :-/
Before posting I tried all of this with the current (220131) Garuda Dr460nized Gaming iso Image in a life system. Both issues happen there, too.
I found a lot of issues opened in the web about this / similar crashes.. the messages from dmesg do not really help, from what I understood... if anybody can point me somewhere, where I could file a but and deliver more logs so that this could be solved in the open source drivers, I'm sure willing to try (but I lack a lot of knowledge for that, for example which tools and so on... ).
For me the
amdgpu.dpm=0 (which is explained in the Arch Wiki AMDGPU page) kernel parameter (and a lot of other parameter combinations I found in the web) did not help at all (especially not with the opencl issue, for the gaming issue some of those things, seem to delay the crash, but it still is crashing randomly)...
Thanks for reading and thanks for any comments / recommendations...
Here is my inxi:
System: Kernel: 5.15.23-2-lts x86_64 bits: 64 compiler: gcc v: 11.2.0 parameters: BOOT_IMAGE=/@/boot/vmlinuz-linux-lts root=UUID=03da7a37-4ea3-4808-b057-6f1ef916effa rw [email protected] splash rd.udev.log_priority=3 vt.global_cursor_default=0 systemd.unified_cgroup_hierarchy=1 resume=UUID=6b5e9134-5814-43fa-a3ec-627a454e7d9c loglevel=3 amdgpu.dpm=0 Desktop: KDE Plasma 5.24.1 tk: Qt 5.15.2 info: latte-dock wm: kwin_x11 vt: 1 dm: SDDM Distro: Garuda Linux base: Arch Linux Machine: Type: Desktop Mobo: Micro-Star model: MPG X570 GAMING PRO CARBON WIFI (MS-7B93) v: 1.0 serial: <superuser required> UEFI: American Megatrends LLC. v: 1.E0 date: 12/17/2021 CPU: Info: model: AMD Ryzen 9 5950X bits: 64 type: MT MCP arch: Zen 3 family: 0x19 (25) model-id: 0x21 (33) stepping: 0 microcode: 0xA201016 Topology: cpus: 1x cores: 16 tpc: 2 threads: 32 smt: enabled cache: L1: 1024 KiB desc: d-16x32 KiB; i-16x32 KiB L2: 8 MiB desc: 16x512 KiB L3: 64 MiB desc: 2x32 MiB Speed (MHz): avg: 3661 high: 4449 min/max: 2200/5083 boost: enabled scaling: driver: acpi-cpufreq governor: performance cores: 1: 3597 2: 3677 3: 3597 4: 3595 5: 4449 6: 3659 7: 3616 8: 3607 9: 4028 10: 3597 11: 3603 12: 3610 13: 3598 14: 3599 15: 3601 16: 3630 17: 3794 18: 3600 19: 3676 20: 3598 21: 3753 22: 3597 23: 3639 24: 3624 25: 3658 26: 3598 27: 3590 28: 3599 29: 3594 30: 3598 31: 3598 32: 3598 bogomips: 217592 Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm Vulnerabilities: Type: itlb_multihit status: Not affected Type: l1tf status: Not affected Type: mds status: Not affected Type: meltdown status: Not affected Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via prctl and seccomp Type: spectre_v1 mitigation: usercopy/swapgs barriers and __user pointer sanitization Type: spectre_v2 mitigation: Full AMD retpoline, IBPB: conditional, IBRS_FW, STIBP: always-on, RSB filling Type: srbds status: Not affected Type: tsx_async_abort status: Not affected Graphics: Device-1: AMD Vega 10 XL/XT [Radeon RX Vega 56/64] vendor: ASUSTeK driver: amdgpu v: kernel bus-ID: 2f:00.0 chip-ID: 1002:687f class-ID: 0300 Device-2: ARC Camera type: USB driver: snd-usb-audio,uvcvideo bus-ID: 1-5.2:7 chip-ID: 05a3:9331 class-ID: 0102 serial: <filter> Device-3: ET13R type: USB driver: snd-usb-audio,uvcvideo bus-ID: 5-4:5 chip-ID: 1e4f:1301 class-ID: 0102 serial: <filter> Display: x11 server: X.Org 184.108.40.206 compositor: kwin_x11 driver: loaded: amdgpu,ati unloaded: modesetting alternate: fbdev,vesa display-ID: :0 screens: 1 Screen-1: 0 s-res: 4480x1440 s-dpi: 96 s-size: 1185x381mm (46.7x15.0") s-diag: 1245mm (49") Monitor-1: DisplayPort-0 res: 2560x1440 dpi: 109 size: 597x336mm (23.5x13.2") diag: 685mm (27") Monitor-2: DisplayPort-1 res: 1920x1080 hz: 60 dpi: 82 size: 598x336mm (23.5x13.2") diag: 686mm (27") OpenGL: renderer: AMD Radeon RX Vega (VEGA10 DRM 3.42.0 5.15.23-2-lts LLVM 13.0.1) v: 4.6 Mesa 21.3.6 direct render: Yes Audio: Device-1: AMD Vega 10 HDMI Audio [Radeon Vega 56/64] driver: snd_hda_intel v: kernel bus-ID: 2f:00.1 chip-ID: 1002:aaf8 class-ID: 0403 Device-2: AMD Starship/Matisse HD Audio vendor: Micro-Star MSI driver: snd_hda_intel v: kernel bus-ID: 31:00.4 chip-ID: 1022:1487 class-ID: 0403 Device-3: Razer USA Nari Ultimate type: USB driver: hid-generic,snd-usb-audio,usbhid bus-ID: 1-1:2 chip-ID: 1532:051a class-ID: 0300 Device-4: ARC Camera type: USB driver: snd-usb-audio,uvcvideo bus-ID: 1-5.2:7 chip-ID: 05a3:9331 class-ID: 0102 serial: <filter> Device-5: ET13R type: USB driver: snd-usb-audio,uvcvideo bus-ID: 5-4:5 chip-ID: 1e4f:1301 class-ID: 0102 serial: <filter> Sound Server-1: ALSA v: k5.15.23-2-lts running: yes Sound Server-2: PulseAudio v: 15.0 running: no Sound Server-3: PipeWire v: 0.3.45 running: yes Network: Device-1: Intel I211 Gigabit Network vendor: Micro-Star MSI driver: igb v: kernel port: d000 bus-ID: 26:00.0 chip-ID: 8086:1539 class-ID: 0200 IF: enp38s0 state: up speed: 1000 Mbps duplex: full mac: <filter> IF-ID-1: docker0 state: down mac: <filter> IF-ID-2: virbr0 state: down mac: <filter> Bluetooth: Device-1: Intel AX200 Bluetooth type: USB driver: btusb v: 0.8 bus-ID: 1-4:4 chip-ID: 8087:0029 class-ID: e001 Report: bt-adapter note: tool can't run ID: hci0 rfk-id: 0 state: down bt-service: disabled rfk-block: hardware: no software: no address: N/A Drives: Local Storage: total: 7.02 TiB used: 2.52 TiB (35.9%) SMART Message: Unable to run smartctl. Root privileges required. ID-1: /dev/nvme0n1 maj-min: 259:7 vendor: Samsung model: SSD 970 EVO Plus 2TB size: 1.82 TiB block-size: physical: 512 B logical: 512 B speed: 31.6 Gb/s lanes: 4 type: SSD serial: <filter> rev: 2B2QEXM7 temp: 45.9 C scheme: GPT ID-2: /dev/nvme1n1 maj-min: 259:0 vendor: Samsung model: SSD 980 PRO 2TB size: 1.82 TiB block-size: physical: 512 B logical: 512 B speed: 63.2 Gb/s lanes: 4 type: SSD serial: <filter> rev: 3B2QGXA7 temp: 45.9 C scheme: GPT ID-3: /dev/nvme2n1 maj-min: 259:3 vendor: Samsung model: SSD 970 EVO Plus 2TB size: 1.82 TiB block-size: physical: 512 B logical: 512 B speed: 31.6 Gb/s lanes: 4 type: SSD serial: <filter> rev: 3B2QEXM7 temp: 35.9 C scheme: GPT ID-4: /dev/sda maj-min: 8:0 vendor: Mushkin model: MKNSSDRE1TB size: 931.51 GiB block-size: physical: 512 B logical: 512 B speed: 6.0 Gb/s type: SSD serial: <filter> rev: 7C scheme: GPT ID-5: /dev/sdb maj-min: 8:16 vendor: Mushkin model: MKNSSDCR480GB size: 447.13 GiB block-size: physical: 512 B logical: 512 B speed: 6.0 Gb/s type: SSD serial: <filter> rev: BBF0 scheme: MBR ID-6: /dev/sdc maj-min: 8:32 vendor: OCZ model: AGILITY3 size: 223.57 GiB block-size: physical: 512 B logical: 512 B speed: 6.0 Gb/s type: SSD serial: <filter> rev: 2.15 scheme: MBR Partition: ID-1: / raw-size: 1.75 TiB size: 3.57 TiB (203.87%) used: 1.29 TiB (36.1%) fs: btrfs dev: /dev/nvme2n1p2 maj-min: 259:5 ID-2: /boot/efi raw-size: 300 MiB size: 299.4 MiB (99.80%) used: 25.8 MiB (8.6%) fs: vfat dev: /dev/nvme2n1p1 maj-min: 259:4 ID-3: /home raw-size: 1.75 TiB size: 3.57 TiB (203.87%) used: 1.29 TiB (36.1%) fs: btrfs dev: /dev/nvme2n1p2 maj-min: 259:5 ID-4: /var/log raw-size: 1.75 TiB size: 3.57 TiB (203.87%) used: 1.29 TiB (36.1%) fs: btrfs dev: /dev/nvme2n1p2 maj-min: 259:5 ID-5: /var/tmp raw-size: 1.75 TiB size: 3.57 TiB (203.87%) used: 1.29 TiB (36.1%) fs: btrfs dev: /dev/nvme2n1p2 maj-min: 259:5 Swap: Kernel: swappiness: 133 (default 60) cache-pressure: 100 (default) ID-1: swap-1 type: partition size: 69.06 GiB used: 0 KiB (0.0%) priority: -2 dev: /dev/nvme2n1p3 maj-min: 259:6 ID-2: swap-2 type: zram size: 62.79 GiB used: 226.5 MiB (0.4%) priority: 100 dev: /dev/zram0 Sensors: System Temperatures: cpu: 37.0 C mobo: 38.0 C Fan Speeds (RPM): fan-1: 0 fan-2: 1284 fan-3: 0 fan-4: 740 fan-5: 0 fan-6: 0 fan-7: 1525 Info: Processes: 594 Uptime: 34m wakeups: 0 Memory: 62.79 GiB used: 7.81 GiB (12.4%) Init: systemd v: 250 tool: systemctl Compilers: gcc: 11.2.0 clang: 13.0.1 Packages: 2177 pacman: 2163 lib: 593 flatpak: 6 snap: 8 Shell: fish v: 3.3.1 default: Bash v: 5.1.16 running-in: konsole inxi: 3.3.12 Garuda (2.5.4-2): System install date: 2022-01-09 Last full system update: 2022-02-16 Is partially upgraded: No Relevant software: NetworkManager Windows dual boot: Probably (Run as root to verify) Snapshots: Snapper Failed units: dev-binderfs.mount anbox-container-manager.service foldingathome.service