Kernel Locks up Due to AMDGPU

Hi All

This is a update to my previous post regarding Garuda hanging. It appears that this is related to the driver for my video card (or the kernel). I am not sure of the actual issue but I do know this works with an older kernel from my quick testing. The Ubuntu test was kernel 5.8. Below is the log information (Thank you @petsam). I can tell the obvious from the log; that is, an issue with the kernel/amdgpu. My obvious question is what is the way out. Is it best to install another kernel and test or perhaps go with the amdpgu-pro driver in aur? I would like to avoid the latter. Perhaps someone will have a simpler solution.


-- Journal begins at Sat 2021-03-13 11:18:44 EST, ends at Sat 2021-03-13 15:53:03 EST. --
Mar 13 12:42:09 altair lightdm[1442]: gkr-pam: unable to locate daemon control file
Mar 13 12:42:11 altair nmbd[1539]: [2021/03/13 12:42:11.374536,  0] ../../lib/util/become_daemon.c:135(daemon_ready)
Mar 13 12:42:11 altair nmbd[1539]:   daemon_ready: daemon 'nmbd' finished starting up and ready to serve connections
Mar 13 12:42:11 altair smbd[1541]: [2021/03/13 12:42:11.423338,  0] ../../lib/util/become_daemon.c:135(daemon_ready)
Mar 13 12:42:11 altair smbd[1541]:   daemon_ready: daemon 'smbd' finished starting up and ready to serve connections
Mar 13 12:42:40 altair pulseaudio[1721]: GetManagedObjects() failed: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Mar 13 13:00:47 altair kernel: amdgpu: SMU load firmware failed
Mar 13 13:00:47 altair kernel: amdgpu: fw load failed
Mar 13 13:00:47 altair kernel: amdgpu: smu firmware loading failed
Mar 13 13:00:47 altair kernel: amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_resume failed (-22).
Mar 13 13:00:47 altair kernel: amdgpu: Move buffer fallback to memcpy unavailable
Mar 13 13:00:47 altair kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -19!
Mar 13 13:00:47 altair kernel: amdgpu: Move buffer fallback to memcpy unavailable
Mar 13 13:00:47 altair kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -19!
Mar 13 13:00:47 altair kernel: snd_hda_intel 0000:01:00.1: CORB reset timeout#1, CORBRP = 0
Mar 13 13:00:57 altair kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=241, emitted seq=243
Mar 13 13:00:57 altair kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=601, emitted seq=603
Mar 13 13:00:57 altair kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Mar 13 13:00:57 altair kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Mar 13 13:00:57 altair kernel: BUG: kernel NULL pointer dereference, address: 0000000000000028
Mar 13 13:00:57 altair kernel: #PF: supervisor read access in kernel mode
Mar 13 13:00:57 altair kernel: #PF: error_code(0x0000) - not-present page
Mar 13 13:00:57 altair kernel: Oops: 0000 [#1] PREEMPT SMP PTI
Mar 13 13:00:57 altair kernel: CPU: 1 PID: 14043 Comm: kworker/1:0 Not tainted 5.11.5-zen1-1-zen #1
Mar 13 13:00:57 altair kernel: Hardware name: Supermicro C7Z170-M/C7Z170-M, BIOS 2.2 01/07/2019
Mar 13 13:00:57 altair kernel: Workqueue: events drm_sched_job_timedout [gpu_sched]
Mar 13 13:00:57 altair kernel: RIP: 0010:kernel_queue_uninit+0xd/0xf0 [amdgpu]
Mar 13 13:00:57 altair kernel: Code: 28 48 88 c0 e8 a4 83 01 d9 e9 78 fe ff ff e8 0a 36 66 d9 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10 48 89 fd <8b> 50 28 83 fa 02 74 78 83 fa 03 0f 84 b1 00 00 00 48 8b 7f 08 4c
Mar 13 13:00:57 altair kernel: RSP: 0018:ffffad174fe17d50 EFLAGS: 00010246
Mar 13 13:00:57 altair kernel: RAX: 0000000000000000 RBX: ffff93ac8ad75000 RCX: 0000000080800079
Mar 13 13:00:57 altair kernel: RDX: 000000008080007a RSI: 0000000000000001 RDI: ffff93ac8c27ad80
Mar 13 13:00:57 altair kernel: RBP: ffff93ac8c27ad80 R08: 0000000000000001 R09: 0000000000000001
Mar 13 13:00:57 altair kernel: R10: ffff93ac8c278040 R11: 0000000000000000 R12: ffff93ac8ad750d0
Mar 13 13:00:57 altair kernel: R13: ffff93ac8ad20000 R14: ffff93ac8142c000 R15: ffff93ac8142c0c8
Mar 13 13:00:57 altair kernel: FS:  0000000000000000(0000) GS:ffff93b3cec80000(0000) knlGS:0000000000000000
Mar 13 13:00:57 altair kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 13 13:00:57 altair kernel: CR2: 0000000000000028 CR3: 0000000248c10001 CR4: 00000000003706e0
Mar 13 13:00:57 altair kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 13 13:00:57 altair kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 13 13:00:57 altair kernel: Call Trace:
Mar 13 13:00:57 altair kernel:  stop_cpsch+0xa0/0xc0 [amdgpu]
Mar 13 13:00:57 altair kernel:  kgd2kfd_pre_reset+0x56/0x80 [amdgpu]
Mar 13 13:00:57 altair kernel:  amdgpu_device_gpu_recover.cold+0x36e/0x98a [amdgpu]
Mar 13 13:00:57 altair kernel:  amdgpu_job_timedout+0x121/0x140 [amdgpu]
Mar 13 13:00:57 altair kernel:  drm_sched_job_timedout+0x64/0xe0 [gpu_sched]
Mar 13 13:00:57 altair kernel:  process_one_work+0x214/0x3e0
Mar 13 13:00:57 altair kernel:  worker_thread+0x4d/0x470
Mar 13 13:00:57 altair kernel:  ? flush_delayed_work+0x40/0x40
Mar 13 13:00:57 altair kernel:  kthread+0x181/0x1b0
Mar 13 13:00:57 altair kernel:  ? __kthread_init_worker+0x50/0x50
Mar 13 13:00:57 altair kernel:  ret_from_fork+0x22/0x30
Mar 13 13:00:57 altair kernel: Modules linked in: ufs hfsplus hfs minix vfat msdos fat jfs xfs ext4 crc16 mbcache jbd2 dm_mod zram rfkill intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg irqbypass soundwire_intel crct10dif_pclmul crc32_pclmul soundwire_generic_allocation ghash_clmulni_intel soundwire_cadence snd_hda_codec snd_hda_core snd_hwdep soundwire_bus iTCO_wdt aesni_intel intel_pmc_bxt ee1004 iTCO_vendor_support crypto_simd cryptd glue_helper rapl snd_soc_core intel_cstate snd_compress e1000e intel_uncore psmouse ac97_bus snd_pcm_dmaengine i2c_i801 i2c_smbus joydev snd_pcm mousedev snd_timer snd soundcore intel_pch_thermal acpi_pad mac_hid uinput crypto_user fuse bpf_preload ip_tables x_tables usbhid btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq crc32c_intel serio_raw sr_mod cdrom xhci_pci xhci_pci_renesas nouveau mxm_wmi wmi radeon
Mar 13 13:00:57 altair kernel:  i915 video intel_agp intel_gtt amdgpu gpu_sched i2c_algo_bit drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm agpgart
Mar 13 13:00:57 altair kernel: CR2: 0000000000000028
Mar 13 13:00:57 altair kernel: RIP: 0010:kernel_queue_uninit+0xd/0xf0 [amdgpu]
Mar 13 13:00:57 altair kernel: Code: 28 48 88 c0 e8 a4 83 01 d9 e9 78 fe ff ff e8 0a 36 66 d9 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 10 48 89 fd <8b> 50 28 83 fa 02 74 78 83 fa 03 0f 84 b1 00 00 00 48 8b 7f 08 4c
Mar 13 13:00:57 altair kernel: RSP: 0018:ffffad174fe17d50 EFLAGS: 00010246
Mar 13 13:00:57 altair kernel: RAX: 0000000000000000 RBX: ffff93ac8ad75000 RCX: 0000000080800079
Mar 13 13:00:57 altair kernel: RDX: 000000008080007a RSI: 0000000000000001 RDI: ffff93ac8c27ad80
Mar 13 13:00:57 altair kernel: RBP: ffff93ac8c27ad80 R08: 0000000000000001 R09: 0000000000000001
Mar 13 13:00:57 altair kernel: R10: ffff93ac8c278040 R11: 0000000000000000 R12: ffff93ac8ad750d0
Mar 13 13:00:57 altair kernel: R13: ffff93ac8ad20000 R14: ffff93ac8142c000 R15: ffff93ac8142c0c8
Mar 13 13:00:57 altair kernel: FS:  0000000000000000(0000) GS:ffff93b3cec80000(0000) knlGS:0000000000000000
Mar 13 13:00:57 altair kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 13 13:00:57 altair kernel: CR2: 0000000000000028 CR3: 0000000248c10001 CR4: 00000000003706e0
Mar 13 13:00:57 altair kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 13 13:00:57 altair kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 13 13:01:08 altair kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=601, emitted seq=603
Mar 13 13:01:08 altair kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0

Yes. Try linux.
With very recent CPUs, it is expected to have such error messages. The typical response is to contribute upstream Linux kernel issues. They want/need this feedback.

The important thing (IMHO) is any noticeable/obvious symptoms apart from journal/dmesg messages.

Trying PRO driver is your decision, as it is not common, so getting help is a bit of coincidence.

Please post your system info for reference and a chance for users with same HW to share experience.

inxi -Fxxxza

It seems the logs are from -b -p3, but it is better to know for sure :wink:

Below is the output for making a full record (with the linux-lts kernel). I did a quick test with "linux-lts" and it is running fine; at least this is the longest that the system has been up. I will report back if it persists but it is solved with the new kernel and the open source drivers.


**System: Kernel:** 5.10.23-1-lts x86_64 **bits:** 64 **compiler:** gcc **v:** 10.2.1

**parameters:** BOOT_IMAGE=/@/boot/vmlinuz-linux-lts root=UUID=09e05778-9bb6-4e2c-82e2-22aa12c039a9 rw

[email protected] quiet splash rd.udev.log_priority=3 vt.global_cursor_default=0

systemd.unified_cgroup_hierarchy=1 resume=UUID=9a0f260e-c59e-4b32-905b-3d182ace31e0 loglevel=3

**Console:** tty 1 **DM:** LightDM 1.30.0 **Distro:** Garuda Linux

**Machine: Type:** Server **System:** Supermicro **product:** C7Z170-M **v:** 0123456789 **serial:** <filter> **Chassis:** **type:** 17

**v:** 0123456789 **serial:** <filter>

**Mobo:** Supermicro **model:** C7Z170-M **v:** 1.01 **serial:** <filter> **UEFI [Legacy]:** American Megatrends **v:** 2.2

**date:** 01/07/2019

**CPU: Info:** Quad Core **model:** Intel Core i5-6600K **bits:** 64 **type:** MCP **arch:** Skylake-S **family:** 6

**model-id:** 5E (94) **stepping:** 3 **microcode:** E2 **L2 cache:** 6 MiB

**flags:** avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx **bogomips:** 27999

**Speed:** 3866 MHz **min/max:** 800/3900 MHz **Core speeds (MHz):** **1:** 3866 **2:** 3821 **3:** 3807 **4:** 3822

**Vulnerabilities:** **Type:** itlb_multihit **status:** KVM: VMX disabled

**Type:** l1tf **mitigation:** PTE Inversion; VMX: conditional cache flushes, SMT disabled

**Type:** mds **mitigation:** Clear CPU buffers; SMT disabled

**Type:** meltdown **mitigation:** PTI

**Type:** spec_store_bypass **mitigation:** Speculative Store Bypass disabled via prctl and seccomp

**Type:** spectre_v1 **mitigation:** usercopy/swapgs barriers and __user pointer sanitization

**Type:** spectre_v2

**mitigation:** Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: disabled, RSB filling

**Type:** srbds **mitigation:** Microcode

**Type:** tsx_async_abort **mitigation:** Clear CPU buffers; SMT disabled

**Graphics: Device-1:** Advanced Micro Devices [AMD/ATI] Ellesmere [Radeon Pro WX 5100] **driver:** amdgpu **v:** kernel

**bus ID:** 01:00.0 **chip ID:** 1002:67c7 **class ID:** 0300

**Display:** **server:** X.org 1.20.10 **compositor:** picom **v:** git-60eb0 **driver:** **loaded:** amdgpu,ati

**unloaded:** modesetting **alternate:** fbdev,vesa **tty:** 116x26

**Message:** Advanced graphics data unavailable in console. Try -G --display

**Audio: Device-1:** Intel 100 Series/C230 Series Family HD Audio **vendor:** Super Micro **driver:** snd_hda_intel

**v:** kernel **bus ID:** 00:1f.3 **chip ID:** 8086:a170 **class ID:** 0403

**Device-2:** AMD Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590] **driver:** snd_hda_intel **v:** kernel

**bus ID:** 01:00.1 **chip ID:** 1002:aaf0 **class ID:** 0403

**Sound Server:** ALSA **v:** k5.10.23-1-lts

**Network: Device-1:** Intel Ethernet I219-V **vendor:** Super Micro **driver:** e1000e **v:** kernel **port:** f000

**bus ID:** 00:1f.6 **chip ID:** 8086:15b8 **class ID:** 0200

**IF:** eno1 **state:** up **speed:** 1000 Mbps **duplex:** full **mac:** <filter>

**Drives: Local Storage:** **total:** 1.82 TiB **used:** 12.14 GiB (0.7%)

**SMART Message:** Required tool smartctl not installed. Check --recommends

**ID-1:** /dev/sda **maj-min:** 8:0 **vendor:** Western Digital **model:** WDS200T2B0A-00SM50 **size:** 1.82 TiB

**block size:** **physical:** 512 B **logical:** 512 B **speed:** 6.0 Gb/s **rotation:** SSD **serial:** <filter> **rev:** 30WD

**scheme:** MBR

**Partition: ID-1:** / **raw size:** 1.79 TiB **size:** 1.79 TiB (100.00%) **used:** 12.14 GiB (0.7%) **fs:** btrfs **dev:** /dev/sda1

**maj-min:** 8:1

**ID-2:** /home **raw size:** 1.79 TiB **size:** 1.79 TiB (100.00%) **used:** 12.14 GiB (0.7%) **fs:** btrfs

**dev:** /dev/sda1 **maj-min:** 8:1

**ID-3:** /var/log **raw size:** 1.79 TiB **size:** 1.79 TiB (100.00%) **used:** 12.14 GiB (0.7%) **fs:** btrfs

**dev:** /dev/sda1 **maj-min:** 8:1

**ID-4:** /var/tmp **raw size:** 1.79 TiB **size:** 1.79 TiB (100.00%) **used:** 12.14 GiB (0.7%) **fs:** btrfs

**dev:** /dev/sda1 **maj-min:** 8:1

**Swap: Kernel:** **swappiness:** 10 (default 60) **cache pressure:** 75 (default 100)

**ID-1:** swap-1 **type:** partition **size:** 34.45 GiB **used:** 0 KiB (0.0%) **priority:** -2 **dev:** /dev/sda2

**maj-min:** 8:2

**ID-2:** swap-2 **type:** zram **size:** 7.83 GiB **used:** 0 KiB (0.0%) **priority:** 32767 **dev:** /dev/zram0

**ID-3:** swap-3 **type:** zram **size:** 7.83 GiB **used:** 0 KiB (0.0%) **priority:** 32767 **dev:** /dev/zram1

**ID-4:** swap-4 **type:** zram **size:** 7.83 GiB **used:** 0 KiB (0.0%) **priority:** 32767 **dev:** /dev/zram2

**ID-5:** swap-5 **type:** zram **size:** 7.83 GiB **used:** 0 KiB (0.0%) **priority:** 32767 **dev:** /dev/zram3

**Sensors: System Temperatures:** **cpu:** 36.0 C **mobo:** 29.8 C **gpu:** amdgpu **temp:** 34.0 C

**Fan Speeds (RPM):** N/A **gpu:** amdgpu **fan:** 995

**Info: Processes:** 201 **Uptime:** 38m **wakeups:** 0 **Memory:** 31.32 GiB **used:** 1.28 GiB (4.1%) **Init:** systemd **v:** 247

**Compilers:** **gcc:** 10.2.0 **clang:** 11.1.0 **Packages:** **pacman:** 1452 **lib:** 340 **Shell:** Zsh **v:** 5.8

**running in:** tty 1 (SSH) **inxi:** 3.3.01

It seems to be GCN 4th gen, which should work fine with amdgpu.
https://wiki.archlinux.org/index.php/AMDGPU#Selecting_the_right_driver
https://wiki.archlinux.org/index.php/Xorg#AMD

It seems so but no luck for me. I read on phoronix that there were many changes to amdgpu in kernel 5.11. I am guessing something in that change caused me trouble. The link is below.

Thanks for all the help,

AMD Radeon Graphics Updates For Linux 5.11

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.