Nvidia 460.39.4 hangs on upgrade

FGD · 2 February 2021 02:05

Has anyone tried installing today’s update of nvidia drivers, the 460.39.4, from Chaotic?

Package (22) Old Version New Version Net Change Download Size

chaotic-aur/chaotic-nvidia-dkms-tkg 460.32.03-2 460.39-4 0.10 MiB 25.50 MiB
chaotic-aur/chaotic-nvidia-egl-wayland-tkg 460.32.03-2 460.39-4 0.00 MiB 0.03 MiB
chaotic-aur/chaotic-nvidia-settings-tkg 460.32.03-2 460.39-4 0.00 MiB 0.86 MiB
chaotic-aur/chaotic-nvidia-utils-tkg 460.32.03-2 460.39-4 0.14 MiB 108.76 MiB

For me it hangs on post-transaction installations.

:: Running post-transaction hooks…
( 1/12) Creating system user accounts…
( 2/12) Reloading system manager configuration…
( 3/12) Reloading device manager configuration…
( 4/12) Arming ConditionNeedsUpdate…
( 5/12) Install DKMS modules
==> dkms install --no-depmod -m nvidia -v 460.39 -k 5.10.12-116-tkg-bmq

Makes the computer veeeeerry slow, still responsive but extremely slow, so that I have to kill the process.
I did try to reinstall the driver from Pamac with the “reinstall” button, same results.

Just wondering if I’m the only one?

I’m suspecting something has changed yesterday or early today before I tried installing that version tonight, as I tried compiling kernels 5-6 times during the day and all of them failed, some breaking up for other reasons but 2-3 of them slowing down the machine the exact same amount of that nvidia driver update… coincidence, hum, maybe not…
All my kernel compiling are done from snapshots so I created and deleted 5-6 of them due to that. loll
I won’t update my main subvolumes until I get through this nvidia upgrade, of course.

tbg · 2 February 2021 04:20

Have you masked the auto-cpufreq.service, it has a bug that could be causing this behaviour. Simply search "mask a service" if you are unfamiliar with the procedure.

FGD · 2 February 2021 11:19

Good point no I didn't mask it, although auto-cpufreq is not running on my machine, it's disabled. But who knows it could still affect, I just read about the usage of masking a service. I'll try that. It's true that massive slowdown happens only and only on CPU intensive tasks. It started Monday morning for me. All was fine Sunday when I compiled a kernel.

Is that information (specifically the suggestion to mask auto-cpufreq) on the main page of the Arch wiki news or something?
I gotta find a way to connect to that news feed, I have read somewhere there is a page that warns Arch users of potential major issues found once in a while. Like "do not update to nvidia 455 drivers there is a bug -- do not install XYZ package there is a hang --- etc..". I'll search for that as well.

FGD · 2 February 2021 12:23

Ok SGS, I see Auto-cpufreq is spamming me, does not seem to work which was posted after my original post.

But still I went on Issues · AdnanHodzic/auto-cpufreq · GitHub and couldn't find a warning somehow that would say not to use the service until fixed. So I would have never thought of masking the service.

Still I need to try it to see if that's the issue.

FGD · 2 February 2021 12:35

Maybe this one? fix problem in desktops where there is nothing in /sys/class/power_su… by librewish · Pull Request #158 · AdnanHodzic/auto-cpufreq · GitHub

Librewish opened it.

dr460nf1r3 · 2 February 2021 12:48

Had a talk with @pedrohlc and he said that this is to be exptected since NVIDIA dkms takes a long time to compile (dont own one myself so no experience with this).
Did you already try to just let it finish?

FGD · 2 February 2021 12:52

After more than 5mins there is a problem. Usually takes less 1min for me and it never hung so much the computer that it would make it unusable.
Same thing for kernel compile. Ran in 17mins on Sunday, but now as soon as the CPU kicks in to reach 100%, everyting is so much slowed down you barely see the *.o being compiled, like one per 20sec or something. Comky even freezes. Was more like 20 *.o per seconds on Sunday.

I gotta hop out from work and mask auto-cpufreq to test this, at least try it!

FGD · 2 February 2021 12:52

Hey nice refreshed avatar, Dragon!

FGD · 2 February 2021 13:49

Snap, not working with masking auto-cpufreq. DKMS is running for over 20mins now...

I do have old snapshots from last week when all was good, I'd like to try to compile a kernel in such snapshot but when I start, it downloads pakages probably some build packages, so at the same time it forces a full upgrade.

Is there a way I can bypass the full upgrade, let makepkg download what it needs and start a compile?
This is to rule out the last days package upgrades and changes I might have made, if the kernel compile works fine then it means either I did something on my machine or there are bugs/conflicts with packages updated in the last few days.

Otherwise I'll reinstall from USB, but that won't rule out package updates.

FGD · 2 February 2021 13:57

Here are some interesting errors I got:

Feb 02 08:33:55 Garuda kernel: NVRM: API mismatch: the client has the version 460.39, but
NVRM: this kernel module has the version 460.32.03. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
Feb 02 08:33:50 Garuda ananicy[33481]: ionice: ioprio_get failed: No such process

tnt · 2 February 2021 13:57

@FGD, I just did pacman -Suyy and installed nvidia 460.39-4 from chaotic without issue:

( 5/20) Updating module dependencies...
( 6/20) Install DKMS modules
==> dkms install --no-depmod -m nvidia -v 460.39 -k 5.10.12-116-tkg-bmq
==> depmod 5.10.12-116-tkg-bmq
( 7/20) Cleaning up...
( 8/20) Updating linux initcpios...

Could you find the tmp folder where dkms is compiling the module? There should be a script or a makefile which you can run manually and see what the exact error is.... Then report here.

FGD · 2 February 2021 14:05

Temp folder? Ok wait now you are making me curious and matching with another error I have.

While I am searching on how to find that nvidia temp folder, here is the temp folder error I get.

Anytime I grubup, I get a umount error on a tmp folder on the last line:

Generating grub configuration file …
Found theme: /usr/share/grub/themes/garuda/theme.txt
Found linux image: /boot/vmlinuz-linux-tkg-bmq
Found initrd image: /boot/amd-ucode.img /boot/initramfs-linux-tkg-bmq.img
Found fallback initrd image(s) in /boot: initramfs-linux-tkg-bmq-fallback.img
Detecting snapshots …
Info: Separate boot partition not detected
Found snapshot: 2021-02-02 08:29:38 | @snapshots/autosnap/root2021-02-02_08H29
Found snapshot: 2021-02-02 08:25:57 | @snapshots/@.autocpufreq_2021-02-02_08H25
Found snapshot: 2021-02-01 20:51:52 | @snapshots/autosnap/root2021-02-01_20H51
Found snapshot: 2021-02-01 20:41:43 | @snapshots/autosnap/root2021-02-01_20H41
Found snapshot: 2021-02-01 19:05:01 | @snapshots/daily/root2021-02-01_19H05
Found snapshot: 2021-02-01 15:49:21 | @snapshots/autosnap/root2021-02-01_15H49
Found snapshot: 2021-02-01 15:48:13 | @snapshots/autosnap/root2021-02-01_15H48
Found snapshot: 2021-01-31 19:05:01 | @snapshots/daily/root2021-01-31_19H05
Found snapshot: 2021-01-31 18:12:50 | @snapshots/ad-hoc/@.b4_custom_kernel_2021-01-31_18H12
Found snapshot: 2021-01-31 18:11:13 | @snapshots/ad-hoc/root_upd_2021-01-31_18H11
Found snapshot: 2021-01-30 19:05:01 | @snapshots/daily/root2021-01-30_19H05
Found snapshot: 2021-01-29 19:14:49 | @snapshots/daily/root2021-01-29_19H14
Found snapshot: 2021-01-29 19:13:20 | @snapshots/daily/root2021-01-29_19H13
Found snapshot: 2021-01-29 19:12:33 | @snapshots/daily/root2021-01-29_19H12
Found snapshot: 2021-01-29 19:11:49 | @snapshots/daily/root2021-01-29_19H11
Found snapshot: 2021-01-27 20:29:55 | @snapshots/sync/SSD1/@.old
umount: /tmp/tmp.2dVcWLNo23: target is busy.

findmnt -l

shows me one tmp folder:

/tmp/tmp.2dVcWLNo23 /dev/nvme0n1p1 btrfs rw,relatime,compress=zstd:3,ssd,space_cache,autodefrag,subvolid=5,subvol=/

However last night I had 5 or 6 tmp folders like this one in findmnt. I thought this was related to ZRAM swap but now I think there is a problem there, maybe it is related to the nvidia update as well… still looking on how to find the folder I’ll report back.

FGD · 2 February 2021 14:16

I can't find the temp folder used for building nvidia, maybe it gets deleted if I kill the process. All I see are bin files in my user .cache folder, that's the closest to "temp" I could find.

I'll try deleting ALL files in my @tmp subvolume and start the upgrade from fresh.
Maybe I should do the same with @cache...

FGD · 2 February 2021 14:43

Hum, no zRAM has nothing to do, I swapped off all 16.
The /tmp folder mounted is my / folder. Listing shows all my subvolumes in there.

I can umount manually. Then every single time I run grubup, it mounts a tmp folder with my subvolumes in it.

Here I ran grubup 3 times after unmounting /tmp/tmp.* and I get this:

/tmp/tmp.jlm9hBBPGN /dev/nvme0n1p1 btrfs rw,relatime,compress=zstd:3,ssd,space_cache,autodefrag,subvolid=5,subvol=/
/tmp/tmp.IQ3dCRH0vh /dev/nvme0n1p1 btrfs rw,relatime,compress=zstd:3,ssd,space_cache,autodefrag,subvolid=5,subvol=/
/tmp/tmp.KDbDfLJc0h /dev/nvme0n1p1 btrfs rw,relatime,compress=zstd:3,ssd,space_cache,autodefrag,subvolid=5,subvol=/

There is seriously something wrong with my machine. I could not find what causes grub-update to mount / in a tmp folder. I’ll continue investigating, even though I don’t know if it’s related to the nvidia update/kernel compile issues, but all those need to be fixed.

FGD · 2 February 2021 14:59

The /tmp grubup issue is nonexisting on my other disk, which I haven't updated in a while.
I will try nvidia from there, now back at work for a while. I wanna work on my Garuda install, but hey... loll

tnt · 2 February 2021 15:04

@FGD

I think that these temp folders are related to initrd image creation. DKMS stuffs its own crap under /var/lib/dkms, where you should be able to find out what happened and why it failed.

In any case, please verify that you have the nvidia sources under /usr/src/nvidia-460.39. Then check the dkms build log file under /var/lib/dkms/nvidia/460.39/5.10.12-116-tkg-bmq/x86_64/log/make.log for the actual error (the actual path may vary for you, depending on your kernel). If you cannot understand the error, post the make log file to PrivateBin and a link in this thread, so we may look.

Lastly, try to build the module manually, as sudo:

sudo dkms --no-depmod -m nvidia -v 460.39 build

If this gives you any error like it couldn't find the kernel headers, that means you upgraded the kernel but have not rebooted yet. In that case add -k <new-kernel-ver> to the dkms command. Replace <new-kernel-ver> with whatever is the newest folder under /usr/lib/modules (in my case it's 5.10.12-116-tkg-bmq). If that fails, then again look into the aforementioned make.log for clues.

FGD · 2 February 2021 15:08

Will do!
Maybe there has been a crap-up with dkms in a previous update this w-e, cuz those /tmp folders mounted every time I run grubup were there before the 460.39 install.

I'll scan through your precise procedure, would never have found that quickly, all that is DKMS-related I know 0 followed by ten more 0s, as of yet.

tnt · 2 February 2021 15:11

I forgot to say, if the build step succeeds, then the install step is next.

As far as DKMS goes, you might want to look at this short document: DKMS info on wiki.archlinux.org

FGD · 2 February 2021 17:24

Well this is complicated now.

From my other drive which hasn’t been updated in about 1 week all updated fine with 460.39. Took 32secs exactly to install the 460.39 driver.

Where it gets complicated is then on a reboot I did a grubup and the /tmp folder was mounted. It was not prior to the 460.39 update, I did grubup and all was fine.

So now I know there is some package causing this issue and I don’t know if I’ll be able to update to the next nvidia driver until I fix this. Then again it’s not mounted if I reboot and try an nvidia upgrade, so is it related, no clue.

Still complicated, there is almost nothing in the make.log file that is unusual, both log files from both drives are showing the same thing. What could be a clue, from both logs is:

/var/lib/dkms/nvidia/460.39/build/nvidia-uvm/uvm_test.c: In function ‘uvm_test_ioctl’:
/var/lib/dkms/nvidia/460.39/build/nvidia-uvm/uvm_test.c:303:1: warning: the frame size of 2240 bytes is larger than 2048 bytes [-Wframe-larger-than=]

/var/lib/dkms/nvidia/460.39/build/nvidia-drm/nvidia-drm-modeset.c: In function ‘__will_generate_flip_event’:
/var/lib/dkms/nvidia/460.39/build/nvidia-drm/nvidia-drm-modeset.c:96:23: warning: unused variable ‘primary_plane’ [-Wunused-variable]
96 | struct drm_plane *primary_plane = crtc->primary;

The sources are present in /usr/src. I might try a manual update on my 1st drive, but then again that won’t fix the /tmp mounting folder… I am unable to find what mounted it. I checked journalctl and nothing…

tbg · 2 February 2021 20:24

Just on a hunch, try this:

Reboot.

Do not log into your graphical session on your computer.

Either:

Boot to run level 3 from the grub menu, then login as root.

Or:

After rebooting, let the computer reach your GUI display managers password login prompt, but do not login to your graphical session. Then switch to a TTY, and login as root.

From the terminal (as root, using either method) perform your job list and see if things have improved this way.