AMD Vega crashing with open source / mesa drivers

Hi.

TL;DR:
Maybe someone can help me a bit with haveing the amdgpu pro drivers as default as much as possible, so that my system won't crash if I forget progl somewhere... :-/


Full story:

I am on a quest to get my Vega Graphicscard run stable in Garuda Linux... it is a heavy task, as it seems that Vega is not very well supported (and not sold very often? Don't know).

I had it running for a few days, but sadly after an update it was back to crashing again... now I am close to get it working again....

I write this down in hope to get some help with my last step and also as reference for myself and other people (there are quite a lot with this issues, it seems).

So what I found out this far:
The mesa drivers crash the card so bad that it requires a power cycle to work again. In the best case screens go black and GPU fans are at 100%. System is still responding, you can access it via SSH and do stuff (like extracting logs from dmesg). But displays won't come back. Everything that tries to (re)start/end graphics system will hang indefinitely. Best course of action is to type "reboot" in the ssh terminal and wait a few seconds for disks to unmount and stuff and then press Power for >5 Sek to Powercycle the whole system.

Using the system "as is" is stable with all drivers, it seems... but running

  1. Using opencl, for example a 100% sure crash on my system is enabling opencl in libreoffice.
  2. some games (not all of them, but some and it does not matter if native or via wine.)
    1. Example for native is "for the king", go into lore store and browse a bit, will usually crash in a few seconds
    2. Example for wine is "WitchIt", will crash after Intro Video where it says "Press a button" in the second you press a button.

For issue number 1. a fix is quite easy. Get rid of opencl-mesa and install opencl-amd instead. After that opencl works stable, libreoffice can be used with opencl and boinc or folding at home with GPU Tasks won't crash the system. Yay. :slight_smile:

Issue number 2 is a different beast, though. I found that the card runs stable with the proprietary AMD Drivers, too (which is not much of a surprise, because it is stable in Windows, too). But there seems to be no bulletproof way to make sure the system loads the proprietary driver and also sddm does not (always?) load with them...

My current solution is to run games that will crash with progl in front of them. For steam you can add that as launcher options, i.e. type progl %command% -> voila, WitchIt does not crash system. (I have to try if that works with gamemoderun, too and how to best combine those).

Maybe someone can help me a bit with haveing the pro drivers as default as much as possible, so that my system won't crash if I forget progl somewhere... :-/

Before posting I tried all of this with the current (220131) Garuda Dr460nized Gaming iso Image in a life system. Both issues happen there, too.

I found a lot of issues opened in the web about this / similar crashes.. the messages from dmesg do not really help, from what I understood... if anybody can point me somewhere, where I could file a but and deliver more logs so that this could be solved in the open source drivers, I'm sure willing to try (but I lack a lot of knowledge for that, for example which tools and so on... :frowning: ).
For me the amdgpu.dpm=0 (which is explained in the Arch Wiki AMDGPU page) kernel parameter (and a lot of other parameter combinations I found in the web) did not help at all (especially not with the opencl issue, for the gaming issue some of those things, seem to delay the crash, but it still is crashing randomly)...

Thanks for reading and thanks for any comments / recommendations...

Here is my inxi:

System:
Kernel: 5.15.23-2-lts x86_64 bits: 64 compiler: gcc v: 11.2.0
parameters: BOOT_IMAGE=/@/boot/vmlinuz-linux-lts
root=UUID=03da7a37-4ea3-4808-b057-6f1ef916effa rw [email protected]
splash rd.udev.log_priority=3 vt.global_cursor_default=0
systemd.unified_cgroup_hierarchy=1
resume=UUID=6b5e9134-5814-43fa-a3ec-627a454e7d9c loglevel=3 amdgpu.dpm=0
Desktop: KDE Plasma 5.24.1 tk: Qt 5.15.2 info: latte-dock wm: kwin_x11
vt: 1 dm: SDDM Distro: Garuda Linux base: Arch Linux
Machine:
Type: Desktop Mobo: Micro-Star
model: MPG X570 GAMING PRO CARBON WIFI (MS-7B93) v: 1.0
serial: <superuser required> UEFI: American Megatrends LLC. v: 1.E0
date: 12/17/2021
CPU:
Info: model: AMD Ryzen 9 5950X bits: 64 type: MT MCP arch: Zen 3
family: 0x19 (25) model-id: 0x21 (33) stepping: 0 microcode: 0xA201016
Topology: cpus: 1x cores: 16 tpc: 2 threads: 32 smt: enabled cache:
L1: 1024 KiB desc: d-16x32 KiB; i-16x32 KiB L2: 8 MiB desc: 16x512 KiB
L3: 64 MiB desc: 2x32 MiB
Speed (MHz): avg: 3661 high: 4449 min/max: 2200/5083 boost: enabled
scaling: driver: acpi-cpufreq governor: performance cores: 1: 3597 2: 3677
3: 3597 4: 3595 5: 4449 6: 3659 7: 3616 8: 3607 9: 4028 10: 3597 11: 3603
12: 3610 13: 3598 14: 3599 15: 3601 16: 3630 17: 3794 18: 3600 19: 3676
20: 3598 21: 3753 22: 3597 23: 3639 24: 3624 25: 3658 26: 3598 27: 3590
28: 3599 29: 3594 30: 3598 31: 3598 32: 3598 bogomips: 217592
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Vulnerabilities:
Type: itlb_multihit status: Not affected
Type: l1tf status: Not affected
Type: mds status: Not affected
Type: meltdown status: Not affected
Type: spec_store_bypass
mitigation: Speculative Store Bypass disabled via prctl and seccomp
Type: spectre_v1
mitigation: usercopy/swapgs barriers and __user pointer sanitization
Type: spectre_v2 mitigation: Full AMD retpoline, IBPB: conditional,
IBRS_FW, STIBP: always-on, RSB filling
Type: srbds status: Not affected
Type: tsx_async_abort status: Not affected
Graphics:
Device-1: AMD Vega 10 XL/XT [Radeon RX Vega 56/64] vendor: ASUSTeK
driver: amdgpu v: kernel bus-ID: 2f:00.0 chip-ID: 1002:687f class-ID: 0300
Device-2: ARC Camera type: USB driver: snd-usb-audio,uvcvideo
bus-ID: 1-5.2:7 chip-ID: 05a3:9331 class-ID: 0102 serial: <filter>
Device-3: ET13R type: USB driver: snd-usb-audio,uvcvideo bus-ID: 5-4:5
chip-ID: 1e4f:1301 class-ID: 0102 serial: <filter>
Display: x11 server: X.Org 1.21.1.3 compositor: kwin_x11 driver:
loaded: amdgpu,ati unloaded: modesetting alternate: fbdev,vesa
display-ID: :0 screens: 1
Screen-1: 0 s-res: 4480x1440 s-dpi: 96 s-size: 1185x381mm (46.7x15.0")
s-diag: 1245mm (49")
Monitor-1: DisplayPort-0 res: 2560x1440 dpi: 109
size: 597x336mm (23.5x13.2") diag: 685mm (27")
Monitor-2: DisplayPort-1 res: 1920x1080 hz: 60 dpi: 82
size: 598x336mm (23.5x13.2") diag: 686mm (27")
OpenGL:
renderer: AMD Radeon RX Vega (VEGA10 DRM 3.42.0 5.15.23-2-lts LLVM 13.0.1)
v: 4.6 Mesa 21.3.6 direct render: Yes
Audio:
Device-1: AMD Vega 10 HDMI Audio [Radeon Vega 56/64] driver: snd_hda_intel
v: kernel bus-ID: 2f:00.1 chip-ID: 1002:aaf8 class-ID: 0403
Device-2: AMD Starship/Matisse HD Audio vendor: Micro-Star MSI
driver: snd_hda_intel v: kernel bus-ID: 31:00.4 chip-ID: 1022:1487
class-ID: 0403
Device-3: Razer USA Nari Ultimate type: USB
driver: hid-generic,snd-usb-audio,usbhid bus-ID: 1-1:2 chip-ID: 1532:051a
class-ID: 0300
Device-4: ARC Camera type: USB driver: snd-usb-audio,uvcvideo
bus-ID: 1-5.2:7 chip-ID: 05a3:9331 class-ID: 0102 serial: <filter>
Device-5: ET13R type: USB driver: snd-usb-audio,uvcvideo bus-ID: 5-4:5
chip-ID: 1e4f:1301 class-ID: 0102 serial: <filter>
Sound Server-1: ALSA v: k5.15.23-2-lts running: yes
Sound Server-2: PulseAudio v: 15.0 running: no
Sound Server-3: PipeWire v: 0.3.45 running: yes
Network:
Device-1: Intel I211 Gigabit Network vendor: Micro-Star MSI driver: igb
v: kernel port: d000 bus-ID: 26:00.0 chip-ID: 8086:1539 class-ID: 0200
IF: enp38s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
IF-ID-1: docker0 state: down mac: <filter>
IF-ID-2: virbr0 state: down mac: <filter>
Bluetooth:
Device-1: Intel AX200 Bluetooth type: USB driver: btusb v: 0.8
bus-ID: 1-4:4 chip-ID: 8087:0029 class-ID: e001
Report: bt-adapter note: tool can't run ID: hci0 rfk-id: 0 state: down
bt-service: disabled rfk-block: hardware: no software: no address: N/A
Drives:
Local Storage: total: 7.02 TiB used: 2.52 TiB (35.9%)
SMART Message: Unable to run smartctl. Root privileges required.
ID-1: /dev/nvme0n1 maj-min: 259:7 vendor: Samsung
model: SSD 970 EVO Plus 2TB size: 1.82 TiB block-size: physical: 512 B
logical: 512 B speed: 31.6 Gb/s lanes: 4 type: SSD serial: <filter>
rev: 2B2QEXM7 temp: 45.9 C scheme: GPT
ID-2: /dev/nvme1n1 maj-min: 259:0 vendor: Samsung model: SSD 980 PRO 2TB
size: 1.82 TiB block-size: physical: 512 B logical: 512 B speed: 63.2 Gb/s
lanes: 4 type: SSD serial: <filter> rev: 3B2QGXA7 temp: 45.9 C
scheme: GPT
ID-3: /dev/nvme2n1 maj-min: 259:3 vendor: Samsung
model: SSD 970 EVO Plus 2TB size: 1.82 TiB block-size: physical: 512 B
logical: 512 B speed: 31.6 Gb/s lanes: 4 type: SSD serial: <filter>
rev: 3B2QEXM7 temp: 35.9 C scheme: GPT
ID-4: /dev/sda maj-min: 8:0 vendor: Mushkin model: MKNSSDRE1TB
size: 931.51 GiB block-size: physical: 512 B logical: 512 B speed: 6.0 Gb/s
type: SSD serial: <filter> rev: 7C scheme: GPT
ID-5: /dev/sdb maj-min: 8:16 vendor: Mushkin model: MKNSSDCR480GB
size: 447.13 GiB block-size: physical: 512 B logical: 512 B speed: 6.0 Gb/s
type: SSD serial: <filter> rev: BBF0 scheme: MBR
ID-6: /dev/sdc maj-min: 8:32 vendor: OCZ model: AGILITY3 size: 223.57 GiB
block-size: physical: 512 B logical: 512 B speed: 6.0 Gb/s type: SSD
serial: <filter> rev: 2.15 scheme: MBR
Partition:
ID-1: / raw-size: 1.75 TiB size: 3.57 TiB (203.87%) used: 1.29 TiB (36.1%)
fs: btrfs dev: /dev/nvme2n1p2 maj-min: 259:5
ID-2: /boot/efi raw-size: 300 MiB size: 299.4 MiB (99.80%)
used: 25.8 MiB (8.6%) fs: vfat dev: /dev/nvme2n1p1 maj-min: 259:4
ID-3: /home raw-size: 1.75 TiB size: 3.57 TiB (203.87%)
used: 1.29 TiB (36.1%) fs: btrfs dev: /dev/nvme2n1p2 maj-min: 259:5
ID-4: /var/log raw-size: 1.75 TiB size: 3.57 TiB (203.87%)
used: 1.29 TiB (36.1%) fs: btrfs dev: /dev/nvme2n1p2 maj-min: 259:5
ID-5: /var/tmp raw-size: 1.75 TiB size: 3.57 TiB (203.87%)
used: 1.29 TiB (36.1%) fs: btrfs dev: /dev/nvme2n1p2 maj-min: 259:5
Swap:
Kernel: swappiness: 133 (default 60) cache-pressure: 100 (default)
ID-1: swap-1 type: partition size: 69.06 GiB used: 0 KiB (0.0%)
priority: -2 dev: /dev/nvme2n1p3 maj-min: 259:6
ID-2: swap-2 type: zram size: 62.79 GiB used: 226.5 MiB (0.4%)
priority: 100 dev: /dev/zram0
Sensors:
System Temperatures: cpu: 37.0 C mobo: 38.0 C
Fan Speeds (RPM): fan-1: 0 fan-2: 1284 fan-3: 0 fan-4: 740 fan-5: 0
fan-6: 0 fan-7: 1525
Info:
Processes: 594 Uptime: 34m wakeups: 0 Memory: 62.79 GiB
used: 7.81 GiB (12.4%) Init: systemd v: 250 tool: systemctl Compilers:
gcc: 11.2.0 clang: 13.0.1 Packages: 2177 pacman: 2163 lib: 593 flatpak: 6
snap: 8 Shell: fish v: 3.3.1 default: Bash v: 5.1.16 running-in: konsole
inxi: 3.3.12
Garuda (2.5.4-2):
System install date:     2022-01-09
Last full system update: 2022-02-16
Is partially upgraded:   No
Relevant software:       NetworkManager
Windows dual boot:       Probably (Run as root to verify)
Snapshots:               Snapper
Failed units:            dev-binderfs.mount anbox-container-manager.service foldingathome.service

This is NOT the AMDGPU Integrated Vega 10 grafx, correct? It is a stand alone video card, correct?

Yes, sorry, I forgot to mentoin that. This is a discret Vega 64 GPU (although there is a Ryzen CPU in the system, but it is no APU, only CPU). It seems the integrated GPUs seem to run more smooth with the driver...

It's a vega LTS kernel thing. Off and on LTS kernel would cause all kinds of issues with my vega card. Try the zen kernel.

sudo pacman linux-zen linux-zen-headers

Then update grub if you have to. Not sure if garuda uses OS prober.

Also do not use amdvlk Arch Linux - Package Search

Use:

https://archlinux.org/packages/extra/x86_64/vulkan-radeon/

3 Likes

Well I have an integrated Vega 8 on another machine and what I can say is when the system is running there is no issue at all, you are correct. I do have major issues with suspend to RAM feature. Almost 2 months into troubleshooting and I cannot make this stable, it fails mostly on the wake-up end but in some situations on the suspend end too. The frustrating part is that with a Garuda ISO from Jan 2021 all works fine, but somewhere along 2021 things went south for this.

Enough about my issue, this is your thread. :smiley: I just wanted to share my experience with the AMD APU to get a perspective. Let's see how your topic goes and gets fixed, maybe some of it could apply for the APU as well. This is my 1st attempt with AMD GPUs so I know almost nothing. :frowning:

1 Like

I tested with LTS, Zen and Mainlaine, did not make a difference for me. I usually use Zen Kernel as daily driver.

I could not find a difference between amdvlk and vulkan-radeon regarding stability. Why do you suggest to use vulkan-radeon?

1 Like

Here's what I can tell you. The system I ran the vega 56 on is an all amd so 1800x cpu. I used both drivers but experienced way less issues with games using vulkan-radeon then using amdvlk.

As for steam this is the only package I use for that. it has always worked best for me (current system is a 5800x and rx 6700xt) Arch Linux - steam 1.0.0.74-1 (x86_64)

1 Like

So you blacklist nouveau driver, uninstall AMDGPU driver and install vulkan-radeon?

Did you try other distros, including debian-based?

If you are talking to me no? Not sure why I would do that? I simply install steam and then the prompted driver choice as well as lib32 library.

1 Like

Ok gotcha, I was under the impression it needed to be done independent from steam.

No, not yet.

Ok, I'll keep that in mind, thanks. I tried for a few minutes now and vulkan-radeon seems to work good.

hm.. since last updates from today crashes are back and now also in desktop... :frowning:

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.