Help diagnosing possible GPU issues

Have an NVIDIA GeForce RTX 2060, a couple years old now. EDIT: I would like to clarify that these issues started on Endeavour OS and have persisted into a clean install of Garuda.

I am currently experiencing some graphical issues which I have not encountered before, particularly in Steam:

  • Textures not loading is certain video games (slightly older games like Horizon Zero Dawn)
  • Steam overlay not working, which I hear is related to graphics drivers
  • Window flickering not isolated to steam or games
  • Game crashing
  • Particle effect bugs, where the whole screen is covered by a glow
  • Difficulties after basic troubleshooting, including setting all graphics to low, checking launch settings, fiddling with v-sync and other settings. Tried switching from proprietary drivers to open and vice versa.
  • Drop down menus in Steam sometimes completely broken. Screen flickers and drop down disappears. Unsure if a graphics issue on my end or a bug in Steam client. Runtime and Native both exhibit this.

When I open nvidia-settings, and check the GPU info, I often see GPU utilization at ~20% during non-graphics intensive times (E.G. writing this forum post) and it is almost always close to or at 100% during gameplay. I don't see it go past yellow on the thermal sensor. My CPU will also often reach 4.0 GHz and 70% as well. I don't think it's a power supply issue because wouldn't the GPU not be able to max out in that case? I feel like I shouldn't have performance drops like these on my set up.

The maxing out started when steam was processing vulkan shaders, but got better when I disabled processing. Unfortunately, the bugs and other behaviors seem to be slowly getting worse, albeit at a slower rate than before.

I would like help determining: Is this a hardware issue, a driver issue, etc. Do I need a new graphics card?

System:
  Kernel: 6.4.1-zen2-1-zen arch: x86_64 bits: 64 compiler: gcc v: 13.1.1
    parameters: BOOT_IMAGE=/@/boot/vmlinuz-linux-zen
    root=UUID=07673fcd-7381-4a95-904f-ae4a01aa632b rw rootflags=subvol=@
    rd.udev.log_priority=3 vt.global_cursor_default=0 loglevel=3 ibt=off
  Desktop: Qtile v: 0.22.1 wm: LG3D vt: 2 dm: SDDM Distro: Garuda Linux
    base: Arch Linux
Machine:
  Type: Desktop Mobo: Micro-Star model: B450 TOMAHAWK MAX (MS-7C02) v: 1.0
    serial: <superuser required> UEFI: American Megatrends LLC. v: 3.F1
    date: 07/05/2022
CPU:
  Info: model: AMD Ryzen 9 5900X bits: 64 type: MT MCP arch: Zen 3+ gen: 4
    level: v3 note: check built: 2022 process: TSMC n6 (7nm) family: 0x19 (25)
    model-id: 0x21 (33) stepping: 2 microcode: 0xA20120A
  Topology: cpus: 1x cores: 12 tpc: 2 threads: 24 smt: enabled cache:
    L1: 768 KiB desc: d-12x32 KiB; i-12x32 KiB L2: 6 MiB desc: 12x512 KiB
    L3: 64 MiB desc: 2x32 MiB
  Speed (MHz): avg: 2575 high: 3700 min/max: 2200/4950 boost: enabled
    scaling: driver: acpi-cpufreq governor: schedutil cores: 1: 2200 2: 2873
    3: 3187 4: 2200 5: 3700 6: 2200 7: 2200 8: 2200 9: 2200 10: 2200 11: 3700
    12: 3700 13: 2200 14: 2879 15: 2200 16: 2200 17: 2200 18: 2876 19: 2200
    20: 2200 21: 2200 22: 3700 23: 2200 24: 2200 bogomips: 177592
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
  Vulnerabilities: <filter>
Graphics:
  Device-1: NVIDIA TU104 [GeForce RTX 2060] vendor: Micro-Star MSI
    driver: nvidia v: 535.54.03 alternate: nouveau,nvidia_drm non-free: 530.xx+
    status: current (as of 2023-05) arch: Turing code: TUxxx
    process: TSMC 12nm FF built: 2018-22 pcie: gen: 2 speed: 5 GT/s lanes: 4
    link-max: lanes: 16 bus-ID: 25:00.0 chip-ID: 10de:1e89 class-ID: 0300
  Display: x11 server: X.Org v: 21.1.8 compositor: Picom v: git-c4107
    driver: X: loaded: nvidia gpu: nvidia display-ID: :0 screens: 1
  Screen-1: 0 s-res: 4480x1440 s-dpi: 95 s-size: 1198x389mm (47.17x15.31")
    s-diag: 1260mm (49.59")
  Monitor-1: DP-0 pos: bottom-l res: 1920x1080 hz: 60 dpi: 96
    size: 510x287mm (20.08x11.3") diag: 585mm (23.04") modes: N/A
  Monitor-2: DP-4 pos: primary,top-right res: 2560x1440 hz: 60 dpi: 109
    size: 597x336mm (23.5x13.23") diag: 685mm (26.97") modes: N/A
  API: OpenGL v: 4.6.0 NVIDIA 535.54.03 renderer: NVIDIA GeForce RTX
    2060/PCIe/SSE2 direct-render: Yes
Audio:
  Device-1: NVIDIA TU104 HD Audio vendor: Micro-Star MSI driver: snd_hda_intel
    v: kernel pcie: gen: 2 speed: 5 GT/s lanes: 4 link-max: lanes: 16
    bus-ID: 25:00.1 chip-ID: 10de:10f8 class-ID: 0403
  Device-2: AMD Starship/Matisse HD Audio vendor: Micro-Star MSI
    driver: snd_hda_intel v: kernel pcie: gen: 4 speed: 16 GT/s lanes: 16
    bus-ID: 27:00.4 chip-ID: 1022:1487 class-ID: 0403
  Device-3: Logitech G733 Gaming Headset
    driver: hid-generic,snd-usb-audio,usbhid type: USB rev: 1.1 speed: 12 Mb/s
    lanes: 1 mode: 1.1 bus-ID: 1-9:4 chip-ID: 046d:0ab5 class-ID: 0300
  API: ALSA v: k6.4.1-zen2-1-zen status: kernel-api
    tools: alsactl,alsamixer,amixer
  Server-1: sndiod v: N/A status: off tools: aucat,midicat,sndioctl
  Server-2: PipeWire v: 0.3.72 status: active with: 1: pipewire-pulse
    status: active 2: wireplumber status: active 3: pipewire-alsa type: plugin
    4: pw-jack type: plugin tools: pactl,pw-cat,pw-cli,wpctl
Network:
  Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet
    vendor: Micro-Star MSI driver: r8169 v: kernel pcie: gen: 1 speed: 2.5 GT/s
    lanes: 1 port: f000 bus-ID: 22:00.0 chip-ID: 10ec:8168 class-ID: 0200
  IF: enp34s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
Drives:
  Local Storage: total: 2.73 TiB used: 802.49 GiB (28.7%)
  SMART Message: Required tool smartctl not installed. Check --recommends
  ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Western Digital
    model: WDS100T3X0C-00SJG0 size: 931.51 GiB block-size: physical: 512 B
    logical: 512 B speed: 31.6 Gb/s lanes: 4 tech: SSD serial: <filter>
    fw-rev: 111110WD temp: 37.9 C scheme: GPT
  ID-2: /dev/sda maj-min: 8:0 vendor: Seagate model: ST2000DM008-2FR102
    size: 1.82 TiB block-size: physical: 4096 B logical: 512 B speed: 6.0 Gb/s
    tech: HDD rpm: 7200 serial: <filter> fw-rev: 0001 scheme: GPT
Partition:
  ID-1: / raw-size: 931.22 GiB size: 931.22 GiB (100.00%)
    used: 31.2 GiB (3.4%) fs: btrfs dev: /dev/nvme0n1p2 maj-min: 259:2
  ID-2: /boot/efi raw-size: 300 MiB size: 299.4 MiB (99.80%)
    used: 576 KiB (0.2%) fs: vfat dev: /dev/nvme0n1p1 maj-min: 259:1
  ID-3: /home raw-size: 931.22 GiB size: 931.22 GiB (100.00%)
    used: 31.2 GiB (3.4%) fs: btrfs dev: /dev/nvme0n1p2 maj-min: 259:2
  ID-4: /var/log raw-size: 931.22 GiB size: 931.22 GiB (100.00%)
    used: 31.2 GiB (3.4%) fs: btrfs dev: /dev/nvme0n1p2 maj-min: 259:2
  ID-5: /var/tmp raw-size: 931.22 GiB size: 931.22 GiB (100.00%)
    used: 31.2 GiB (3.4%) fs: btrfs dev: /dev/nvme0n1p2 maj-min: 259:2
Swap:
  Kernel: swappiness: 133 (default 60) cache-pressure: 100 (default)
  ID-1: swap-1 type: zram size: 31.27 GiB used: 0 KiB (0.0%) priority: 100
    dev: /dev/zram0
Sensors:
  System Temperatures: cpu: 38.0 C mobo: N/A gpu: nvidia temp: 35 C
  Fan Speeds (RPM): N/A gpu: nvidia fan: 30%
Info:
  Processes: 401 Uptime: 2m wakeups: 0 Memory: available: 31.27 GiB
  used: 2.55 GiB (8.1%) Init: systemd v: 253 default: graphical
  tool: systemctl Compilers: gcc: 13.1.1 Packages: pm: pacman pkgs: 1289
  libs: 416 tools: pamac,paru Shell: fish v: 3.6.1 default: Bash v: 5.1.16
  running-in: alacritty inxi: 3.3.27
Garuda (2.6.16-1):
  System install date:     2023-05-24
  Last full system update: 2023-07-05
  Is partially upgraded:   No
  Relevant software:       snapper NetworkManager dracut nvidia-dkms
  Windows dual boot:       No/Undetected
  Failed units:

Have you tried with a different kernel ie. the lts kernel

3 Likes

For the record, I also have a GTX 2060 laptop with no issue.

It sounds like when I overclock the memory too much.

I'd say, faulty GPU memory?

Would I just boot onto a different kernel via GRUB to try this? EDIT: Sorry, my confusion. I thought I had an LTS option in GRUB but I do not. Is there an easy way to test this without loading and installing a different iso?

Yeah, the weird thing is, I don't have overclock enabled on bios or in any of my settings. Do you think it somehow overclocked anyway? And is there way to fix a memory issue or is it a death sentence to the GPU?

Yes you can just install another kernel with this command in the terminal

sudo pacman -S linux-lts linux-lts-headers

or install a kernel with garuda setting manager
then choose it in grub when booting

1 Like

Done. LTS still has nvidia-settings saying 50% usage when I’m just scrolling through Steam library. It’s bouncing between 20% and 40% usage while I write this post.

Is there an easy to set up memory test anyone can recommend? I think I can memtest it, and then maybe rip it out and put it in a different computer to isolate other hardware/software issues.

Update: Ran this memtest https://github.com/GpuZelenograd/memtest_vulkan and got the following output

./memtest_vulkan
GitHub - GpuZelenograd/memtest_vulkan: Vulkan compute tool for testing video memory stability v0.5.0 by GpuZelenograd
To finish testing use Ctrl+C

1: Bus=0x25:00 DevId=0x1E89 6GB NVIDIA GeForce RTX 2060
Standard 5-minute test of 1: Bus=0x25:00 DevId=0x1E89 6GB NVIDIA GeForce RTX 2060
1 iteration. Passed 0.0318 seconds written: 2.5GB 248.9GB/sec checked: 5.0GB 230.0GB/sec
35 iteration. Passed 1.0058 seconds written: 85.0GB 254.8GB/sec checked: 170.0GB 252.9GB/sec
203 iteration. Passed 5.0038 seconds written: 420.0GB 251.0GB/sec checked: 840.0GB 252.2GB/sec
1209 iteration. Passed 30.0104 seconds written: 2515.0GB 249.9GB/sec checked: 5030.0GB 252.2GB/sec
2215 iteration. Passed 30.0063 seconds written: 2515.0GB 251.5GB/sec checked: 5030.0GB 251.4GB/sec
3221 iteration. Passed 30.0172 seconds written: 2515.0GB 251.6GB/sec checked: 5030.0GB 251.2GB/sec
4226 iteration. Passed 30.0148 seconds written: 2512.5GB 251.4GB/sec checked: 5025.0GB 251.0GB/sec
5233 iteration. Passed 30.0181 seconds written: 2517.5GB 252.1GB/sec checked: 5035.0GB 251.4GB/sec
6238 iteration. Passed 30.0200 seconds written: 2512.5GB 251.5GB/sec checked: 5025.0GB 250.9GB/sec
7244 iteration. Passed 30.0178 seconds written: 2515.0GB 251.8GB/sec checked: 5030.0GB 251.1GB/sec
8248 iteration. Passed 30.0240 seconds written: 2510.0GB 251.4GB/sec checked: 5020.0GB 250.5GB/sec
9251 iteration. Passed 30.0228 seconds written: 2507.5GB 251.0GB/sec checked: 5015.0GB 250.3GB/sec
Standard 5-minute test PASSed! Just press Ctrl+C unless you plan long test run.
Extended endless test started; testing more than 2 hours is usually unneeded
use Ctrl+C to stop it when you decide it’s enough
^C
memtest_vulkan: no any errors, testing PASSed.
press any key to continue…

I think I’m going to try getting a new power supply and go from there. I have a smaller one, so I’m starting to suspect it is related to my new CPU and increased power requirements.

1 Like

Update:
After going through nvidia-settings again, I noticed that my GPU power draw never reaches 100W (even while experiencing performance issues like frame drops), while the Default TGP is listed as 160.0 W.

This makes me think that it is indeed a power supply issue. Unfortunately, I likely won't have money to replace it until after this thread autolocks, and I don't want to spam the thread, so I'll mark it as solved, with the assumption that a 100 W or more upgrade to the PSU will solve the issue.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.