Random NVIDIA driver crashes (not on windows)

OS: Currently Garuda, but had the same problem on Manjaro, but not on Windows.
It's not my ram, CPU and motherboard since I recently upgraded to DDR4 3800 8 GB from DDR3 8+4 GB.
This happens when doing anything heavy, but is extremely random.
It's probably a kernel panic, I found a PSA on r/archlinux about the drivers having a bug in version 460, but I don't know if it has been fixed now.

Here's a list of when it happened:

  • Steam started making Vulcan shaders
  • Blender rendered the ~108th frame
  • Blender (same scene) rendered the ~508th frame
  • I opened the settings in Dirt: showdown
  • I launched Titanfall 2
System:
Kernel: 5.16.0-zen1-1-zen x86_64 bits: 64 compiler: gcc v: 11.1.0
parameters: BOOT_IMAGE=/@/boot/vmlinuz-linux-zen
root=UUID=fb8f85bb-3d42-41a7-8914-5df74c2745f6 rw rootflags=subvol=@
quiet splash rd.udev.log_priority=3 vt.global_cursor_default=0 loglevel=3
Desktop: KDE Plasma 5.23.5 tk: Qt 5.15.2 info: latte-dock wm: kwin_x11
vt: 1 dm: SDDM Distro: Garuda Linux base: Arch Linux
Machine:
Type: Desktop Mobo: ASUSTeK model: TUF B450-PLUS GAMING v: Rev X.0x
serial: <superuser required> UEFI: American Megatrends v: 3002
date: 03/11/2021
CPU:
Info: model: AMD Ryzen 7 2700 bits: 64 type: MT MCP arch: Zen+
family: 0x17 (23) model-id: 8 stepping: 2 microcode: 0x800820D
Topology: cpus: 1x cores: 8 tpc: 2 threads: 16 smt: enabled cache:
L1: 768 KiB desc: d-8x32 KiB; i-8x64 KiB L2: 4 MiB desc: 8x512 KiB
L3: 16 MiB desc: 2x8 MiB
Speed (MHz): avg: 3088 high: 3399 min/max: 1550/3400 boost: disabled
scaling: driver: acpi-cpufreq governor: performance cores: 1: 3191 2: 3171
3: 3214 4: 3264 5: 3064 6: 2888 7: 2749 8: 3112 9: 3399 10: 3280 11: 3372
12: 3094 13: 3180 14: 2976 15: 2734 16: 2734 bogomips: 108795
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Vulnerabilities:
Type: itlb_multihit status: Not affected
Type: l1tf status: Not affected
Type: mds status: Not affected
Type: meltdown status: Not affected
Type: spec_store_bypass
mitigation: Speculative Store Bypass disabled via prctl
Type: spectre_v1
mitigation: usercopy/swapgs barriers and __user pointer sanitization
Type: spectre_v2 mitigation: Full AMD retpoline, IBPB: conditional,
STIBP: disabled, RSB filling
Type: srbds status: Not affected
Type: tsx_async_abort status: Not affected
Graphics:
Device-1: NVIDIA GP107 [GeForce GTX 1050 Ti] vendor: Micro-Star MSI
driver: nvidia v: 495.46 alternate: nouveau,nvidia_drm bus-ID: 08:00.0
chip-ID: 10de:1c82 class-ID: 0300
Display: x11 server: X.Org 1.21.1.3 compositor: kwin_x11 driver:
loaded: nvidia unloaded: modesetting alternate: fbdev,nouveau,nv,vesa
display-ID: :0 screens: 1
Screen-1: 0 s-res: 1366x768 s-dpi: 68 s-size: 510x291mm (20.1x11.5")
s-diag: 587mm (23.1")
Monitor-1: HDMI-0 res: 1366x768 hz: 60 dpi: 68
size: 509x286mm (20.0x11.3") diag: 584mm (23")
OpenGL: renderer: NVIDIA GeForce GTX 1050 Ti/PCIe/SSE2
v: 4.6.0 NVIDIA 495.46 direct render: Yes
Audio:
Device-1: NVIDIA GP107GL High Definition Audio vendor: Micro-Star MSI
driver: snd_hda_intel v: kernel bus-ID: 08:00.1 chip-ID: 10de:0fb9
class-ID: 0403
Device-2: AMD Family 17h HD Audio vendor: ASUSTeK driver: snd_hda_intel
v: kernel bus-ID: 0a:00.3 chip-ID: 1022:1457 class-ID: 0403
Device-3: C-Media CM108 Audio Controller type: USB
driver: hid-generic,snd-usb-audio,usbhid bus-ID: 1-2:2 chip-ID: 0d8c:013c
class-ID: 0300
Sound Server-1: ALSA v: k5.16.0-zen1-1-zen running: yes
Sound Server-2: JACK v: 1.9.20 running: no
Sound Server-3: PulseAudio v: 15.0 running: no
Sound Server-4: PipeWire v: 0.3.43 running: yes
Network:
Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet
vendor: ASUSTeK PRIME B450M-A driver: r8169 v: kernel port: f000
bus-ID: 03:00.0 chip-ID: 10ec:8168 class-ID: 0200
IF: enp3s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
Drives:
Local Storage: total: 2.82 TiB used: 921.19 GiB (31.9%)
SMART Message: Unable to run smartctl. Root privileges required.
ID-1: /dev/sda maj-min: 8:0 vendor: Patriot model: Burst size: 447.13 GiB
block-size: physical: 512 B logical: 512 B speed: 6.0 Gb/s type: SSD
serial: <filter> rev: BA.3 scheme: GPT
ID-2: /dev/sdb maj-min: 8:16 vendor: Seagate model: ST1000DM010-2EP102
size: 931.51 GiB block-size: physical: 4096 B logical: 512 B
speed: 6.0 Gb/s type: HDD rpm: 7200 serial: <filter> rev: CC43
scheme: MBR
ID-3: /dev/sdc maj-min: 8:32 vendor: Crucial model: CT120BX500SSD1
size: 111.79 GiB block-size: physical: 512 B logical: 512 B speed: 6.0 Gb/s
type: SSD serial: <filter> rev: R013 scheme: GPT
ID-4: /dev/sdd maj-min: 8:48 vendor: Western Digital
model: WD5000AVVS-63M8B0 size: 465.76 GiB block-size: physical: 512 B
logical: 512 B speed: 3.0 Gb/s type: N/A serial: <filter> rev: 0A01
ID-5: /dev/sde maj-min: 8:64 vendor: Seagate model: ST1000LM035-1RK172
size: 931.51 GiB block-size: physical: 4096 B logical: 512 B
speed: 6.0 Gb/s type: HDD rpm: 5400 serial: <filter> rev: SDM2
scheme: GPT
Partition:
ID-1: / raw-size: 446.83 GiB size: 446.83 GiB (100.00%)
used: 36.74 GiB (8.2%) fs: btrfs dev: /dev/sda2 maj-min: 8:2
ID-2: /boot/efi raw-size: 300 MiB size: 299.4 MiB (99.80%)
used: 576 KiB (0.2%) fs: vfat dev: /dev/sda1 maj-min: 8:1
ID-3: /home raw-size: 446.83 GiB size: 446.83 GiB (100.00%)
used: 36.74 GiB (8.2%) fs: btrfs dev: /dev/sda2 maj-min: 8:2
ID-4: /var/log raw-size: 446.83 GiB size: 446.83 GiB (100.00%)
used: 36.74 GiB (8.2%) fs: btrfs dev: /dev/sda2 maj-min: 8:2
ID-5: /var/tmp raw-size: 446.83 GiB size: 446.83 GiB (100.00%)
used: 36.74 GiB (8.2%) fs: btrfs dev: /dev/sda2 maj-min: 8:2
Swap:
Kernel: swappiness: 133 (default 60) cache-pressure: 100 (default)
ID-1: swap-1 type: zram size: 7.73 GiB used: 0 KiB (0.0%) priority: 100
dev: /dev/zram0
Sensors:
System Temperatures: cpu: N/A mobo: N/A gpu: nvidia temp: 47 C
Fan Speeds (RPM): N/A gpu: nvidia fan: 0%
Info:
Processes: 369 Uptime: 7m wakeups: 0 Memory: 7.73 GiB
used: 3.17 GiB (41.0%) Init: systemd v: 250 tool: systemctl Compilers:
gcc: 11.1.0 clang: 13.0.0 Packages: pacman: 1936 lib: 543 Shell: fish
v: 3.3.1 default: Bash v: 5.1.16 running-in: konsole inxi: 3.3.11

Journal log since install

There is a new BIOS for your motherboard that "improves system performance" (and a version in between that you missed out as well "Improves system stability" so grab the latest BIOS for sure!) TUF B450-PLUS GAMING|Motherboards|ASUS Global

Other than that I would start monitoring the GPU heat and the 12v rail under heavy loads. If testing those don't produce anything abnormal in the readings, I would look into testing the video RAM. A pass on memtest with the regular RAM won't hurt either to rule that out. Even if things are new, they aren't immune to imperfections/errors right out of the gate.

Things work differently under vastly different loads (programs and OSes), so even it seems fine on a hardware level under certain loads, you'll be surprised that it can actually still be the case.

BTW that journal log URL is giving me a 404 web error.

4 Likes

There is a new BIOS for your motherboard that “improves system performance” (and a version in between that you missed out as well “Improves system stability” so grab the latest BIOS for sure!) TUF B450-PLUS GAMING|Motherboards|ASUS Global

I’ll for sure do that as soon as I get home, however, I was still getting this behavior when I had my old motherboard with an Athlon rather than a Ryzen.

Other than that I would start monitoring the GPU heat and the 12v rail under heavy loads. If testing those don’t produce anything abnormal in the readings, I would look into testing the video RAM.

I doubt that it’s the GPU alone since all of those things that triggered the crash worked on Windows (the renders, games, etc.)

A pass on memtest with the regular RAM won’t hurt either to rule that out. Even if things are new, they aren’t immune to imperfections/errors right out of the gate.

I was getting the same behavior before I upgraded my ram from DDR3 to DDR4 (and a new motherboard + CPU)

Things work differently under vastly different loads (programs and OSes), so even it seems fine on a hardware level under certain loads, you’ll be surprised that it can actually still be the case.

Considering that it only happened on Linux so far, I doubt it’s the OS, but rather some part of/near the drivers.

BTW that journal log URL is giving me a 404 web error.

The file has 20 MiB, of all the logs since my install (Sunday 16.), I’m going to have to find a better hosting service. (this forum only accepts images as attachments)

I updated the BIOS, but I didn't render anything today, do I don't know if it has been fixed.

I also uploaded the first 5000 lines of the log to paste.ee.

Just as a quick update - I've updated my BIOS, and it somehow fixed itself.
I don't know why or how - but the update fixed it.

Psst~. We mark the post with the suggestions to fix it that work as the solution here, not the post mentioning the implementation :wink:

2 Likes

Done, correct?
:slight_smile:

2 Likes

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.