Gpu fan suddenly starts going wild and screen goes black

Sometimes my screen will randomly go black, coinciding with the gpu spinning up the fans very loudly. I can see that the power stays relatively stable and low. If audio was playing when the screen gos black, it’ll keep on playing. I did manage to send a message with the black screen once too, so I suspect I still have use of everything other than the monitor.

Removing the monitor from the dedicated gpu to the integrated doesn’t help, but if I’ve booted with the monitor plugged into the integrated, then I don’t lose the screen.

This was happening a bit before, I thought it might just be an installation thing, and after I did a system update a few times, it went away. Now it’s come back, and it currently happens every boot after ~10 min (I think, I haven’t clocked it yet.)

garuda-inxi:

System:
  Kernel: 6.5.4-zen2-1-zen arch: x86_64 bits: 64 compiler: gcc v: 13.2.1
    clocksource: tsc available: acpi_pm
    parameters: BOOT_IMAGE=/@/boot/vmlinuz-linux-zen
    root=UUID=ff218736-6729-4d49-a2d2-0dc7487c0fc3 rw rootflags=subvol=@
    quiet quiet rd.udev.log_priority=3 vt.global_cursor_default=0 loglevel=3
    ibt=off
  Desktop: KDE Plasma v: 5.27.8 tk: Qt v: 5.15.10 wm: kwin_x11 vt: 2
    dm: SDDM Distro: Garuda Linux base: Arch Linux
Machine:
  Type: Desktop System: ASUS product: N/A v: N/A serial: <superuser required>
  Mobo: ASUSTeK model: ROG STRIX B760-G GAMING WIFI v: Rev 1.xx
    serial: <superuser required> UEFI: American Megatrends v: 1210
    date: 07/14/2023
CPU:
  Info: model: 13th Gen Intel Core i9-13900K bits: 64 type: MST AMCP
    arch: Raptor Lake gen: core 13 level: v3 note: check built: 2022+
    process: Intel 7 (10nm) family: 6 model-id: 0xB7 (183) stepping: 1
    microcode: 0x119
  Topology: cpus: 1x cores: 24 mt: 8 tpc: 2 st: 16 threads: 32 smt: enabled
    cache: L1: 2.1 MiB desc: d-16x32 KiB, 8x48 KiB; i-8x32 KiB, 16x64 KiB
    L2: 32 MiB desc: 8x2 MiB, 4x4 MiB L3: 36 MiB desc: 1x36 MiB
  Speed (MHz): avg: 2189 high: 5528 min/max: 800/5500:5800:4300 scaling:
    driver: intel_pstate governor: powersave cores: 1: 5505 2: 800 3: 800 4: 800
    5: 5500 6: 5500 7: 5500 8: 800 9: 5500 10: 800 11: 5500 12: 800 13: 5500
    14: 800 15: 5499 16: 5528 17: 800 18: 800 19: 2936 20: 800 21: 800 22: 800
    23: 800 24: 800 25: 800 26: 800 27: 800 28: 800 29: 800 30: 800 31: 800
    32: 800 bogomips: 191692
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
  Vulnerabilities: <filter>
Graphics:
  Device-1: Intel Raptor Lake-S GT1 [UHD Graphics 770] vendor: ASUSTeK
    driver: i915 v: kernel arch: Gen-13 process: Intel 7 (10nm) built: 2022+
    ports: active: DP-1 empty: HDMI-A-1,HDMI-A-2 bus-ID: 00:02.0
    chip-ID: 8086:a780 class-ID: 0300
  Device-2: NVIDIA AD104 [GeForce RTX 4070 Ti] vendor: Gigabyte
    driver: nvidia v: 535.113.01 alternate: nouveau,nvidia_drm non-free: 535.xx+
    status: current (as of 2023-08) arch: Lovelace code: AD1xx
    process: TSMC n4 (5nm) built: 2022-23+ pcie: speed: Unknown lanes: 63
    link-max: gen: 6 speed: 64 GT/s bus-ID: 01:00.0 chip-ID: 10de:2782
    class-ID: 0300
  Display: x11 server: X.Org v: 21.1.8 with: Xwayland v: 23.2.1
    compositor: kwin_x11 driver: X: loaded: modesetting,nvidia unloaded: nouveau
    alternate: fbdev,intel,nv,vesa dri: iris gpu: i915 display-ID: :0
    screens: 1
  Screen-1: 0 s-res: 2560x1080 s-dpi: 96 s-size: 677x285mm (26.65x11.22")
    s-diag: 735mm (28.92")
  Monitor-1: DP-1 model: LG (GoldStar) HDR WFHD serial: <filter> built: 2020
    res: 2560x1080 dpi: 81 gamma: 1.2 size: 798x334mm (31.42x13.15")
    diag: 869mm (34.2") modes: max: 2560x1080 min: 640x480
  API: OpenGL v: 4.6 Mesa 23.1.8-arch1.1 renderer: Mesa Intel Graphics
    (RPL-S) direct-render: Yes
Audio:
  Device-1: Intel vendor: ASUSTeK driver: snd_hda_intel v: kernel
    alternate: snd_sof_pci_intel_tgl bus-ID: 00:1f.3 chip-ID: 8086:7a50
    class-ID: 0403
  Device-2: NVIDIA vendor: Gigabyte driver: snd_hda_intel v: kernel pcie:
    speed: Unknown lanes: 63 link-max: gen: 6 speed: 64 GT/s bus-ID: 01:00.1
    chip-ID: 10de:22bc class-ID: 0403
  Device-3: Blue Microphones Yeti Stereo Microphone
    driver: hid-generic,snd-usb-audio,usbhid type: USB rev: 1.1 speed: 12 Mb/s
    lanes: 1 mode: 1.1 bus-ID: 1-5:4 chip-ID: b58e:9e84 class-ID: 0300
    serial: <filter>
  API: ALSA v: k6.5.4-zen2-1-zen status: kernel-api with: aoss
    type: oss-emulator tools: N/A
  Server-1: PipeWire v: 0.3.80 status: active with: 1: pipewire-pulse
    status: active 2: wireplumber status: active 3: pipewire-alsa type: plugin
    4: pw-jack type: plugin tools: pactl,pw-cat,pw-cli,wpctl
Network:
  Device-1: Intel driver: iwlwifi v: kernel port: N/A bus-ID: 00:14.3
    chip-ID: 8086:7a70 class-ID: 0280
  IF: wlp0s20f3 state: up mac: <filter>
  Device-2: Intel Ethernet I226-V vendor: ASUSTeK driver: igc v: kernel
    pcie: gen: 2 speed: 5 GT/s lanes: 1 port: N/A bus-ID: 05:00.0
    chip-ID: 8086:125c class-ID: 0200
  IF: eno1 state: down mac: <filter>
  IF-ID-1: wg-mullvad state: unknown speed: N/A duplex: N/A mac: N/A
Bluetooth:
  Device-1: Intel driver: btusb v: 0.8 type: USB rev: 2.0 speed: 12 Mb/s
    lanes: 1 mode: 1.1 bus-ID: 1-14:8 chip-ID: 8087:0033 class-ID: e001
  Report: btmgmt ID: hci0 rfk-id: 1 state: up address: <filter> bt-v: 5.3
    lmp-v: 12 status: discoverable: no pairing: no class-ID: 7c0104
RAID:
  Hardware-1: Intel Volume Management Device NVMe RAID Controller Intel
    driver: vmd v: 0.6 port: N/A bus-ID: 00:0e.0 chip-ID: 8086:a77f rev:
    class-ID: 0104
Drives:
  Local Storage: total: 1.82 TiB used: 151.23 GiB (8.1%)
  SMART Message: Unable to run smartctl. Root privileges required.
  ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Samsung model: SSD 980 PRO with
    Heatsink 2TB size: 1.82 TiB block-size: physical: 512 B logical: 512 B
    speed: 63.2 Gb/s lanes: 4 tech: SSD serial: <filter> fw-rev: 5B2QGXA7
    temp: 30.9 C scheme: GPT
Partition:
  ID-1: / raw-size: 1.82 TiB size: 1.82 TiB (100.00%) used: 151.23 GiB (8.1%)
    fs: btrfs dev: /dev/nvme0n1p2 maj-min: 259:2
  ID-2: /boot/efi raw-size: 300 MiB size: 299.4 MiB (99.80%)
    used: 576 KiB (0.2%) fs: vfat dev: /dev/nvme0n1p1 maj-min: 259:1
  ID-3: /home raw-size: 1.82 TiB size: 1.82 TiB (100.00%)
    used: 151.23 GiB (8.1%) fs: btrfs dev: /dev/nvme0n1p2 maj-min: 259:2
  ID-4: /var/log raw-size: 1.82 TiB size: 1.82 TiB (100.00%)
    used: 151.23 GiB (8.1%) fs: btrfs dev: /dev/nvme0n1p2 maj-min: 259:2
  ID-5: /var/tmp raw-size: 1.82 TiB size: 1.82 TiB (100.00%)
    used: 151.23 GiB (8.1%) fs: btrfs dev: /dev/nvme0n1p2 maj-min: 259:2
Swap:
  Kernel: swappiness: 133 (default 60) cache-pressure: 100 (default) zswap: no
  ID-1: swap-1 type: zram size: 62.54 GiB used: 0 KiB (0.0%) priority: 100
    comp: zstd avail: lzo,lzo-rle,lz4,lz4hc,842 max-streams: 32 dev: /dev/zram0
Sensors:
  System Temperatures: cpu: 26.0 C mobo: N/A
  Fan Speeds (rpm): N/A
Info:
  Processes: 506 Uptime: 23m wakeups: 0 Memory: total: 64 GiB
  available: 62.54 GiB used: 6.57 GiB (10.5%) Init: systemd v: 254
  default: graphical tool: systemctl Compilers: gcc: 13.2.1 Packages:
  pm: pacman pkgs: 1962 libs: 560 tools: octopi,paru,yay Shell: Bash v: 5.1.16
  running-in: alacritty inxi: 3.3.29
Garuda (2.6.16-1):
  System install date:     2023-08-22
  Last full system update: 2023-09-26 ↻
  Is partially upgraded:   No
  Relevant software:       snapper NetworkManager dracut nvidia-dkms
  Windows dual boot:       No/Undetected
  Failed units:

I’ve put this in Newbie since I’m a relative noob with Garuda, and don’t really have a clue how to start looking at this. I’ve looked at other threads about screens going black and fans, and haven’t found anything that I think is relevant yet. Apologies if I’ve missed something (like I did the last time I posted a topic).

Did you use a wayland session?
Maybe you must use only X11 with the nvidia GPU?
Just a thought from me, not a judgment. :slight_smile:

if I am using Wayland, it’s completely by accident - I’ll check the next time I’m rebooting, which should be momentarily

I currently have my monitor plugged into the integrated so I don’t suddenly lose the ability to do anything useful suddenly. I’ve noticed that if I don’t open a browser that I don’t have the issue. I was wondering if maybe the issue is in FireDragon since it’s my main, so I eventually opened regular Firefox, but the issue started again within 5 min.

I still have use of the screen, so I’m able to get error logs:

Sep 26 14:50:07 beefy nmbd[4102]: [2023/09/26 14:50:07.562646,  0] ../../source3/libsmb/nmblib.c:923(send_udp)
Sep 26 14:50:07 beefy nmbd[4102]:   Packet send failed to 192.168.1.255(138) ERRNO=Operation not permitted
Sep 26 14:52:16 beefy nmbd[4102]: [2023/09/26 14:52:16.838318,  0] ../../source3/libsmb/nmblib.c:923(send_udp)
Sep 26 14:52:16 beefy nmbd[4102]:   Packet send failed to 192.168.1.255(137) ERRNO=Operation not permitted
Sep 26 14:52:16 beefy nmbd[4102]: [2023/09/26 14:52:16.838427,  0] ../../source3/nmbd/nmbd_packets.c:180(send_netbios_packet)
Sep 26 14:52:16 beefy nmbd[4102]:   send_netbios_packet: send_packet() to IP 192.168.1.255 port 137 failed
Sep 26 14:52:16 beefy nmbd[4102]: [2023/09/26 14:52:16.838450,  0] ../../source3/nmbd/nmbd_namequery.c:245(query_name)
Sep 26 14:52:16 beefy nmbd[4102]:   query_name: Failed to send packet trying to query name WORKGROUP<1d>
Sep 26 14:57:17 beefy nmbd[4102]: [2023/09/26 14:57:17.138278,  0] ../../source3/libsmb/nmblib.c:923(send_udp)
Sep 26 14:57:17 beefy nmbd[4102]:   Packet send failed to 192.168.1.255(137) ERRNO=Operation not permitted
Sep 26 14:57:17 beefy nmbd[4102]: [2023/09/26 14:57:17.138367,  0] ../../source3/nmbd/nmbd_packets.c:180(send_netbios_packet)
Sep 26 14:57:17 beefy nmbd[4102]:   send_netbios_packet: send_packet() to IP 192.168.1.255 port 137 failed
Sep 26 14:57:17 beefy nmbd[4102]: [2023/09/26 14:57:17.138389,  0] ../../source3/nmbd/nmbd_namequery.c:245(query_name)
Sep 26 14:57:17 beefy nmbd[4102]:   query_name: Failed to send packet trying to query name WORKGROUP<1d>
Sep 26 14:58:03 beefy kernel: pcieport 0000:00:01.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
Sep 26 14:58:03 beefy kernel: pcieport 0000:00:01.0:   device [8086:a70d] error status/mask=00040000/00010000
Sep 26 14:58:03 beefy kernel: pcieport 0000:00:01.0:    [18] MalfTLP                (First)
Sep 26 14:58:03 beefy kernel: pcieport 0000:00:01.0: AER:   TLP Header: 60701010 010000ff 8b8b0f40 00000001
Sep 26 14:58:03 beefy kernel: nvidia 0000:01:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
Sep 26 14:58:03 beefy kernel: nvidia 0000:01:00.0: AER:   Error of this Agent is reported first
Sep 26 14:58:04 beefy kernel: NVRM: GPU at PCI:0000:01:00: GPU-f8b4909a-3ab9-4a52-2559-91139f30846c
Sep 26 14:58:04 beefy kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Sep 26 14:58:04 beefy kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Sep 26 14:58:04 beefy kernel: NVRM: A GPU crash dump has been created. If possible, please run
                              NVRM: nvidia-bug-report.sh as root to collect this data before
                              NVRM: the NVIDIA kernel module is unloaded.
Sep 26 14:58:07 beefy nmbd[4102]: [2023/09/26 14:58:07.188264,  0] ../../source3/libsmb/nmblib.c:923(send_udp)
Sep 26 14:58:07 beefy nmbd[4102]:   Packet send failed to 192.168.1.255(138) ERRNO=Operation not permitted

I’ve gone back to 14:50 local, though the issue started at ~14:58.

I saw the message about nvidia-bug-reposrt.sh, I ran it, but the output is >200k lines. lmk if I should upload the gz.

Yes, you can use the https://bin.garudalinux.org/

I did just get the issue without opening any browser, too - I guess I’ve falsified that theory

My machine has been on for a while without black screening me. Not sure what’s different now. I gave it more of a rest before booting, and also messed around in the UEFI, there was something called ReSizeBAR which I disabled. (Because we all know messing with features in UEFI without understanding them is a great idea. My only defense is that my computer was getting unusable.)

I did just get the issue coming out of sleep, so it doesn’t seem to be gone. There was significant update which included nvidia drivers, so I wonder if that helps. Or makes things worse.

after searching around on this error, beginning to think it might be an ASPM issue, here are two references

Arch Linux forum thread
AskUbuntu Q&A

I tried shutting it off in the GRUB and in the UEFI, it does seem to have reduced the severity of the issue, but I do still get it once or twice a day.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.