Hi! I’ve posted about this issue before here, but it’s returned after a hiatus, and I figured I’d share what I’ve got and see if it’s possible to get an answer together.
garuda-inxi
:
System:
Kernel: 6.8.5-zen1-1-zen arch: x86_64 bits: 64 compiler: gcc v: 13.2.1
clocksource: tsc avail: acpi_pm
parameters: BOOT_IMAGE=/@/boot/vmlinuz-linux-zen
root=UUID=ff218736-6729-4d49-a2d2-0dc7487c0fc3 rw rootflags=subvol=@
rd.udev.log_priority=3 vt.global_cursor_default=0 loglevel=3
mem_sleep_default=s2idle ibt=off
Desktop: KDE Plasma v: 6.0.3 tk: Qt v: N/A info: frameworks v: 6.1.0
wm: kwin_x11 vt: 2 dm: SDDM Distro: Garuda base: Arch Linux
Machine:
Type: Desktop System: ASUS product: N/A v: N/A serial: <superuser required>
Mobo: ASUSTeK model: ROG STRIX B760-G GAMING WIFI v: Rev 1.xx
serial: <superuser required> part-nu: SKU uuid: <superuser required>
UEFI: American Megatrends v: 1210 date: 07/14/2023
CPU:
Info: model: 13th Gen Intel Core i9-13900K bits: 64 type: MST AMCP
arch: Raptor Lake gen: core 13 level: v3 note: check built: 2022+
process: Intel 7 (10nm) family: 6 model-id: 0xB7 (183) stepping: 1
microcode: 0x122
Topology: cpus: 1x cores: 24 mt: 8 tpc: 2 st: 16 threads: 32 smt: enabled
cache: L1: 2.1 MiB desc: d-16x32 KiB, 8x48 KiB; i-8x32 KiB, 16x64 KiB
L2: 32 MiB desc: 8x2 MiB, 4x4 MiB L3: 36 MiB desc: 1x36 MiB
Speed (MHz): avg: 882 high: 1100 min/max: 800/5500:5800:4300 scaling:
driver: intel_pstate governor: powersave cores: 1: 1100 2: 800 3: 1100
4: 800 5: 1100 6: 800 7: 1100 8: 800 9: 1100 10: 1100 11: 1100 12: 800
13: 1028 14: 800 15: 1100 16: 800 17: 800 18: 800 19: 800 20: 800 21: 800
22: 800 23: 800 24: 800 25: 800 26: 800 27: 800 28: 800 29: 800 30: 800
31: 800 32: 800 bogomips: 191692
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Vulnerabilities: <filter>
Graphics:
Device-1: NVIDIA AD104 [GeForce RTX 4070 Ti] vendor: Gigabyte driver: nvidia
v: 550.67 alternate: nouveau,nvidia_drm non-free: 550.xx+
status: current (as of 2024-04) arch: Lovelace code: AD1xx
process: TSMC n4 (5nm) built: 2022+ pcie: gen: 4 speed: 16 GT/s lanes: 16
ports: active: none off: DP-1 empty: DP-2,DP-3,HDMI-A-1 bus-ID: 01:00.0
chip-ID: 10de:2782 class-ID: 0300
Display: x11 server: X.Org v: 21.1.13 with: Xwayland v: 23.2.6
compositor: kwin_x11 driver: X: loaded: nvidia unloaded: modesetting,nouveau
alternate: fbdev,nv,vesa gpu: nvidia,nvidia-nvswitch display-ID: :0
screens: 1
Screen-1: 0 s-res: 2560x1080 s-dpi: 81 s-size: 803x343mm (31.61x13.50")
s-diag: 873mm (34.38")
Monitor-1: DP-1 mapped: DP-0 note: disabled model: LG (GoldStar) HDR WFHD
serial: <filter> built: 2020 res: 2560x1080 hz: 60 dpi: 81 gamma: 1.2
size: 798x334mm (31.42x13.15") diag: 869mm (34.2") modes: max: 2560x1080
min: 640x480
API: EGL v: 1.5 hw: drv: nvidia platforms: device: 0 drv: nvidia device: 2
drv: swrast gbm: drv: nvidia surfaceless: drv: nvidia x11: drv: nvidia
inactive: wayland,device-1
API: OpenGL v: 4.6.0 compat-v: 4.5 vendor: nvidia mesa v: 550.67
glx-v: 1.4 direct-render: yes renderer: NVIDIA GeForce RTX 4070 Ti/PCIe/SSE2
memory: 11.71 GiB
API: Vulkan v: 1.3.279 layers: 10 device: 0 type: discrete-gpu name: NVIDIA
GeForce RTX 4070 Ti driver: nvidia v: 550.67 device-ID: 10de:2782
surfaces: xcb,xlib device: 1 type: cpu name: llvmpipe (LLVM 17.0.6 256
bits) driver: mesa llvmpipe v: 24.0.5-arch1.1 (LLVM 17.0.6)
device-ID: 10005:0000 surfaces: xcb,xlib
Audio:
Device-1: Intel Raptor Lake High Definition Audio vendor: ASUSTeK
driver: snd_hda_intel v: kernel alternate: snd_sof_pci_intel_tgl
bus-ID: 00:1f.3 chip-ID: 8086:7a50 class-ID: 0403
Device-2: NVIDIA vendor: Gigabyte driver: snd_hda_intel v: kernel pcie:
gen: 4 speed: 16 GT/s lanes: 16 bus-ID: 01:00.1 chip-ID: 10de:22bc
class-ID: 0403
Device-3: Blue Microphones Yeti Stereo Microphone
driver: hid-generic,snd-usb-audio,usbhid type: USB rev: 1.1 speed: 12 Mb/s
lanes: 1 mode: 1.1 bus-ID: 1-7:5 chip-ID: b58e:9e84 class-ID: 0300
serial: <filter>
API: ALSA v: k6.8.5-zen1-1-zen status: kernel-api with: aoss
type: oss-emulator tools: N/A
Server-1: PipeWire v: 1.0.5 status: active with: 1: pipewire-pulse
status: active 2: wireplumber status: active 3: pipewire-alsa type: plugin
4: pw-jack type: plugin tools: pactl,pw-cat,pw-cli,wpctl
Network:
Device-1: Intel Raptor Lake-S PCH CNVi WiFi driver: iwlwifi v: kernel
bus-ID: 00:14.3 chip-ID: 8086:7a70 class-ID: 0280
IF: wlp0s20f3 state: up mac: <filter>
Device-2: Intel Ethernet I226-V vendor: ASUSTeK driver: igc v: kernel
pcie: gen: 2 speed: 5 GT/s lanes: 1 port: N/A bus-ID: 05:00.0
chip-ID: 8086:125c class-ID: 0200
IF: eno1 state: down mac: <filter>
IF-ID-1: wg0-mullvad state: unknown speed: N/A duplex: N/A mac: N/A
Info: services: NetworkManager, smbd, systemd-timesyncd, wpa_supplicant
Bluetooth:
Device-1: Intel AX211 Bluetooth driver: btusb v: 0.8 type: USB rev: 2.0
speed: 12 Mb/s lanes: 1 mode: 1.1 bus-ID: 1-14:8 chip-ID: 8087:0033
class-ID: e001
Report: btmgmt ID: hci0 rfk-id: 0 state: up address: <filter> bt-v: 5.3
lmp-v: 12 status: discoverable: no pairing: no class-ID: 6c0104
RAID:
Hardware-1: Intel Volume Management Device NVMe RAID Controller Intel
driver: vmd v: 0.6 port: N/A bus-ID: 00:0e.0 chip-ID: 8086:a77f rev:
class-ID: 0104
Drives:
Local Storage: total: 1.82 TiB used: 455.32 GiB (24.4%)
SMART Message: Unable to run smartctl. Root privileges required.
ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Samsung model: SSD 980 PRO with
Heatsink 2TB size: 1.82 TiB block-size: physical: 512 B logical: 512 B
speed: 63.2 Gb/s lanes: 4 tech: SSD serial: <filter> fw-rev: 5B2QGXA7
temp: 35.9 C scheme: GPT
Partition:
ID-1: / raw-size: 1.82 TiB size: 1.82 TiB (100.00%) used: 455.32 GiB (24.4%)
fs: btrfs dev: /dev/nvme0n1p2 maj-min: 259:2
ID-2: /boot/efi raw-size: 300 MiB size: 299.4 MiB (99.80%)
used: 584 KiB (0.2%) fs: vfat dev: /dev/nvme0n1p1 maj-min: 259:1
ID-3: /home raw-size: 1.82 TiB size: 1.82 TiB (100.00%)
used: 455.32 GiB (24.4%) fs: btrfs dev: /dev/nvme0n1p2 maj-min: 259:2
ID-4: /var/log raw-size: 1.82 TiB size: 1.82 TiB (100.00%)
used: 455.32 GiB (24.4%) fs: btrfs dev: /dev/nvme0n1p2 maj-min: 259:2
ID-5: /var/tmp raw-size: 1.82 TiB size: 1.82 TiB (100.00%)
used: 455.32 GiB (24.4%) fs: btrfs dev: /dev/nvme0n1p2 maj-min: 259:2
Swap:
Kernel: swappiness: 133 (default 60) cache-pressure: 100 (default) zswap: no
ID-1: swap-1 type: zram size: 62.54 GiB used: 11.8 MiB (0.0%)
priority: 100 comp: zstd avail: lzo,lzo-rle,lz4,lz4hc,842 max-streams: 32
dev: /dev/zram0
Sensors:
System Temperatures: cpu: 27.0 C mobo: N/A gpu: nvidia temp: 33 C
Fan Speeds (rpm): N/A gpu: nvidia fan: 0%
Info:
Memory: total: 64 GiB available: 62.54 GiB used: 6.8 GiB (10.9%)
Processes: 596 Power: uptime: 5m states: freeze,mem,disk suspend: s2idle
avail: deep wakeups: 0 hibernate: platform avail: shutdown, reboot,
suspend, test_resume image: 24.96 GiB services: org_kde_powerdevil,
power-profiles-daemon, upowerd Init: systemd v: 255 default: graphical
tool: systemctl
Packages: pm: pacman pkgs: 2169 libs: 597 tools: octopi,paru,yay
Compilers: clang: 17.0.6 gcc: 13.2.1 Shell: garuda-inxi default: Bash
v: 5.2.26 running-in: konsole inxi: 3.3.34
Garuda (2.6.25-1):
System install date: 2023-08-22
Last full system update: 2024-04-16 ↻
Is partially upgraded: No
Relevant software: snapper NetworkManager dracut nvidia-dkms
Windows dual boot: No/Undetected
Failed units:
Basically, after a recent update, the spinny-fan GPU thing would happen. More specifically, GPU fans specifically start spinning really hard, simultaneously the screen goes black. Unlike what I mentioned in the OP, I lose control of the session - keyboard input is not read.
The GPU itself is not hot, and the power supply remains low. I’ve checked the cables and everything seems plugged in properly.
Earlier today I saw the old GPU has fallen off the bus
error around crash-time in journalctl
, but for the last two I’m not seeing it. I also sometimes see this:
pcieport 0000:00:01.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID)
I’ve searched on the errors to the best of my ability, and the general answers seem to be either physical hardware issue or power management. I’ve disabled ASPM, have now set PowerMizer to max performance, and once again found the ReBar thing (UEFI Asus option) on, so I turned it off. I’m still getting crashes. As a result, any change mentioned here has at least one reboot, and probably many more, afterwards.
I feel like there is probably a way to zone in on better info on what’s causing the crashes in the logs, but idk what to look for. I can’t dump a Nvidia bug report when I have an event since I lose the session whenever it happens. I’m hoping someone here has some ideas, or even can point me to a better place to ask. Thanks!