VM GPU Passthrough -- NVidia Blacklisting Not Working

I'm following this guide to setup Windows as a VM with GPU Passthrough.

I get stuck at "Creating Domain" when starting the VM, when passing the GPU, because it's still in use by Linux. The blacklisting method isn't working.

For systems with only one Nvidia/Radeon GPU, you'll want to blacklist it in /etc/modprobe.d/blacklist.conf, as follows:

blacklist nouveau
blacklist nvidia

I asked on the Arch forum and, other than saying not to post Garuda support there, they said

Don't use optimus-manager - it'll load the module explicitly which is why that blacklist approach won't work.

So that I do it the right way... how do I properly release the GPU so that KVC can use it for passthrough? And how do I switch between Linux and Windows using it?

System:
Kernel: 5.15.13-zen1-1-zen x86_64 bits: 64 compiler: gcc v: 11.1.0
parameters: BOOT_IMAGE=/@/boot/vmlinuz-linux-zen
root=UUID=58203bda-e2e2-4c32-a006-c91d933cad4e rw [email protected]
quiet splash rd.udev.log_priority=3 vt.global_cursor_default=0
systemd.unified_cgroup_hierarchy=1 loglevel=3 intel_iommu=on
vfio-pci.ids=10de:1f15,10de:10f9
Desktop: KDE Plasma 5.23.5 tk: Qt 5.15.2 info: latte-dock wm: kwin_x11
vt: 1 dm: SDDM Distro: Garuda Linux base: Arch Linux
Machine:
Type: Laptop System: Acer product: Predator PH315-53 v: V1.01
serial: <superuser required>
Mobo: CML model: QX50_CMS v: V1.01 serial: <superuser required>
UEFI: Insyde v: 1.01 date: 04/27/2020
Battery:
ID-1: BAT1 charge: 43.8 Wh (100.0%) condition: 43.8/58.8 Wh (74.5%)
volts: 16.4 min: 15.4 model: SMP AP18E7M type: Li-ion serial: <filter>
status: Full
CPU:
Info: model: Intel Core i7-10750H bits: 64 type: MT MCP arch: Comet Lake
family: 6 model-id: 0xA5 (165) stepping: 2 microcode: 0xEA
Topology: cpus: 1x cores: 6 tpc: 2 threads: 12 smt: enabled cache:
L1: 384 KiB desc: d-6x32 KiB; i-6x32 KiB L2: 1.5 MiB desc: 6x256 KiB
L3: 12 MiB desc: 1x12 MiB
Speed (MHz): avg: 4803 high: 4900 min/max: 800/5000 scaling:
driver: intel_pstate governor: performance cores: 1: 4853 2: 4873 3: 4800
4: 4751 5: 4751 6: 4801 7: 4900 8: 4851 9: 4800 10: 4778 11: 4771
12: 4713 bogomips: 62399
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Vulnerabilities:
Type: itlb_multihit status: KVM: VMX disabled
Type: l1tf status: Not affected
Type: mds status: Not affected
Type: meltdown status: Not affected
Type: spec_store_bypass
mitigation: Speculative Store Bypass disabled via prctl
Type: spectre_v1
mitigation: usercopy/swapgs barriers and __user pointer sanitization
Type: spectre_v2
mitigation: Enhanced IBRS, IBPB: conditional, RSB filling
Type: srbds status: Not affected
Type: tsx_async_abort status: Not affected
Graphics:
Device-1: Intel CometLake-H GT2 [UHD Graphics]
vendor: Acer Incorporated ALI driver: i915 v: kernel bus-ID: 00:02.0
chip-ID: 8086:9bc4 class-ID: 0300
Device-2: NVIDIA TU106M [GeForce RTX 2060 Mobile]
vendor: Acer Incorporated ALI driver: nvidia v: 495.46
alternate: nouveau,nvidia_drm bus-ID: 01:00.0 chip-ID: 10de:1f15
class-ID: 0300
Device-3: Quanta HD User Facing type: USB driver: uvcvideo bus-ID: 1-5:4
chip-ID: 0408:a061 class-ID: 0e02
Display: x11 server: X.Org 1.21.1.3 compositor: kwin_x11 driver:
loaded: modesetting,nvidia display-ID: :0 screens: 1
Screen-1: 0 s-res: 1680x1050 s-dpi: 96 s-size: 443x277mm (17.4x10.9")
s-diag: 522mm (20.6")
Monitor-1: HDMI-1-0 res: 1680x1050 hz: 60 dpi: 42
size: 1020x570mm (40.2x22.4") diag: 1168mm (46")
OpenGL: renderer: Mesa Intel UHD Graphics (CML GT2) v: 4.6 Mesa 21.3.3
direct render: Yes
Audio:
Device-1: Intel Comet Lake PCH cAVS vendor: Acer Incorporated ALI
driver: snd_hda_intel v: kernel
alternate: snd_soc_skl,snd_sof_pci_intel_cnl bus-ID: 00:1f.3
chip-ID: 8086:06c8 class-ID: 0403
Device-2: NVIDIA TU106 High Definition Audio
vendor: Acer Incorporated ALI driver: snd_hda_intel v: kernel
bus-ID: 01:00.1 chip-ID: 10de:10f9 class-ID: 0403
Device-3: Texas Instruments PCM2900B Audio CODEC type: USB
driver: hid-generic,snd-usb-audio,usbhid bus-ID: 1-3:3 chip-ID: 08bb:29b0
class-ID: 0300
Sound Server-1: ALSA v: k5.15.13-zen1-1-zen running: yes
Sound Server-2: JACK v: 1.9.19 running: no
Sound Server-3: PulseAudio v: 15.0 running: no
Sound Server-4: PipeWire v: 0.3.43 running: yes
Network:
Device-1: Intel Comet Lake PCH CNVi WiFi vendor: Rivet Networks
driver: iwlwifi v: kernel bus-ID: 00:14.3 chip-ID: 8086:06f0 class-ID: 0280
IF: wlp0s20f3 state: up mac: <filter>
Device-2: Realtek vendor: Acer Incorporated ALI driver: r8169 v: kernel
port: 3000 bus-ID: 07:00.0 chip-ID: 10ec:2600 class-ID: 0200
IF: enp7s0 state: down mac: <filter>
IF-ID-1: anbox0 state: down mac: <filter>
IF-ID-2: virbr1 state: down mac: <filter>
Bluetooth:
Device-1: Intel AX201 Bluetooth type: USB driver: btusb v: 0.8
bus-ID: 1-14:5 chip-ID: 8087:0026 class-ID: e001
Report: bt-adapter ID: hci0 rfk-id: 1 state: down
bt-service: enabled,running rfk-block: hardware: no software: yes
address: <filter>
Drives:
Local Storage: total: 1.63 TiB used: 56.84 GiB (3.4%)
SMART Message: Unable to run smartctl. Root privileges required.
ID-1: /dev/nvme0n1 maj-min: 259:0 vendor: Western Digital
model: PC SN730 SDBQNTY-512G-1014 size: 476.94 GiB block-size:
physical: 512 B logical: 512 B speed: 31.6 Gb/s lanes: 4 type: SSD
serial: <filter> rev: 11101100 temp: 24.9 C scheme: GPT
ID-2: /dev/sda maj-min: 8:0 vendor: HGST (Hitachi) model: HTS721010A9E630
size: 931.51 GiB block-size: physical: 4096 B logical: 512 B
speed: 6.0 Gb/s type: HDD rpm: 7200 serial: <filter> rev: A3J0
scheme: GPT
ID-3: /dev/sdb maj-min: 8:16 vendor: Crucial model: CT275MX300SSD4
size: 256.17 GiB block-size: physical: 512 B logical: 512 B speed: 6.0 Gb/s
type: SSD serial: <filter> rev: R060 scheme: GPT
Partition:
ID-1: / raw-size: 476.84 GiB size: 476.84 GiB (100.00%)
used: 56.79 GiB (11.9%) fs: btrfs dev: /dev/nvme0n1p5 maj-min: 259:2
ID-2: /boot/efi raw-size: 100 MiB size: 96 MiB (96.00%)
used: 51.2 MiB (53.4%) fs: vfat dev: /dev/nvme0n1p1 maj-min: 259:1
ID-3: /home raw-size: 476.84 GiB size: 476.84 GiB (100.00%)
used: 56.79 GiB (11.9%) fs: btrfs dev: /dev/nvme0n1p5 maj-min: 259:2
ID-4: /var/log raw-size: 476.84 GiB size: 476.84 GiB (100.00%)
used: 56.79 GiB (11.9%) fs: btrfs dev: /dev/nvme0n1p5 maj-min: 259:2
ID-5: /var/tmp raw-size: 476.84 GiB size: 476.84 GiB (100.00%)
used: 56.79 GiB (11.9%) fs: btrfs dev: /dev/nvme0n1p5 maj-min: 259:2
Swap:
Kernel: swappiness: 133 (default 60) cache-pressure: 100 (default)
ID-1: swap-1 type: zram size: 15.46 GiB used: 2.5 MiB (0.0%)
priority: 100 dev: /dev/zram0
Sensors:
System Temperatures: cpu: 66.0 C pch: 62.0 C mobo: N/A
Fan Speeds (RPM): N/A
Info:
Processes: 315 Uptime: 8m wakeups: 1 Memory: 15.46 GiB
used: 3.1 GiB (20.0%) Init: systemd v: 250 tool: systemctl Compilers:
gcc: 11.1.0 clang: 13.0.0 Packages: pacman: 1693 lib: 486 Shell: fish
v: 3.3.1 default: Bash v: 5.1.12 running-in: konsole inxi: 3.3.11

Have you tried "sudo nvidia-smi -r" before starting the VM?

Optimus Manager, clicked "use internal graphics", re-logged in, closed Optimus Manager.

sudo nvidia-smi -r

GPU 00000000:01:00.0 is currently in use by another process.

1 device is currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graph
ics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Plea
se first kill all processes using this device and all compute applications running in the system.

Btw, if Garuda Assistant could help setup a VM with KVC and GPU Passthrough with a few clicks, that would be a super useful feature. Because it's a real pain in the *ss to setup, but it puts VirtualBox to shame. Once you can run Windows at near-native performance in Linux even for games, it really takes away any argument to keep Windows in dual-boot.

I'm having several other problems, some of which I solved, some of which I can't.

Running Virtual Machine Manager only shows LXC in the list, whereas running "sudo virt-manager" shows only QEMU/KVM in the list. Requiring to use sudo on every single run. This should be the fix but it's not solving the problem.. There's also the issue that if I check "Enable XML editing", close and re-open, it doesn't save the setting.

I've wasted many hours already; still can't get GPU passthrough to work, and still can't run virt-manager without sudo. But got Windows running with VirtIO, and got samba file sharing working.

Then there's the mouse/keyboard passthrough... if I set USB passthrough on the mouse, any way to switch back to Linux? Other than unplugging and replugging the mouse.

EDIT: LOL! Solved the 'sudo' problem. Just had to start without sudo and File | Add Connection and just click Connect on QEMU/KVM. Now it works, and I can enable XML editing. Remains the GPU passthrough thing to solve.

Why would you possibly think it's acceptable to post issues from our distro on the Arch forum. No Linux forum I've ever registered for is keen on dealing with issues from users of another distro. This is nothing new and is pretty common place knowledge.

The Arch forum expressly states this policy during the registration process (unless things have changed since I registered there). They even have this stickied at the top of their Newbie forum:

These boards are for the support of Arch Linux, and Arch ONLY

If you have installed Archbang, Antergos, Chakra, Evo/Lution, Manjaro, Whatever, you are NOT running Arch Linux. Similarly, if you followed some random video on YouTube or used an automated script you found on a blog, you are NOT running Arch Linux, so do not expect any support, sympathy or anything but your thread being closed and told to move along.

Arch is a DIY distro: if someone else has done it for you, then showing up here asking to have your hand held for more help is just help vampirism and is not welcome.

I don't see how this could possibly be made any clearer there. I should have thought that there were more than enough notifications beforehand that doing so was uncool. This is a very poor reflection on Garuda when our users register on the Arch forum and immediately begin to flaunt their rules.

Our users are expected to obey the rules of other forums as well as ours here. You put a black mark on Garuda's reputation when you expect assistance for a Garuda issue on the Arch forum, (in violation of their rules) . We do not appreciate having Garuda's reputation sullied by Garuda users on other forums. Please never post Garuda issues on the Arch forum as this is expressly forbidden there.

3 Likes

Understood.

I spent way too much time on this issue. Finally made some progress with
Loading vfio-pci early

Basically, NVidia driver was loaded before it had the chance to be replaced with a stub. Following these instructions, I could get vfio-pci to take the device first.

While this works, it requires changing configuration and rebooting to switch who is using the graphic card -- essentially taking away any benefits over dual-booting.

Wondering whether it's worth spending more time into this. Is there a way to switch GPU host without being more trouble than dual-booting?

Another peculiar behavior I've seen with Garuda is when setting up Evdev passthrough. It gave me permission errors leading me into the troubleshooting section to run libvirtd as root (or doing more complex config)

The solution would have to use these scripts; unbinding and re-binding NVIDIA drivers on-the-fly.

To close this thread; the blacklisting problem is solved.

There is no need to blacklist, but rather, the devices must be loaded by vfio-pci before the GPU loads up. Then this was just one issue in a long list of issues.

For others looking into the topic, I recommend the VFIO reddit group for support, as this is a very complicated and specialized topic that really hasn't much to do with Garuda. Starting with the pinned post and the Arch page.

And be ready to spend days on this.

Bryan Steiner has a great pass through guide. I usually use that one. But you figured it out. Well done. This is usually what you have to do, instead of asking help on these forums.

I was just reading his guide, GREAT resource indeed! It would have saved me a lot of time to get to that first.

Still -- detaching hangs, so the initial problem remains. I think there's something about Garuda's configuration that keeps the GPU locked.

I saw another interesting approach: start with vfio-pci binding using the solution I posted above, and then bind when the system starts.

I now successfully have the HDMI output disabled on system start; but running unbind_vfio.sh doesn't restore my display. I'm also unable to start Optimus Manager.

Getting an entirely new issue. Will create new thread for it.

The way it works for me is I use optimus-manager to switch between integrated and nvidia. When I switch to integrated I can bind and unbind the nvidia GPU. Some times it doesn't work, usually it's because the nvidia-smi is hogged by some application or service. Usually re-switching to integrated fixes this.

Switching to Integrated with Optimus Manager wasn't solving it for me for some reason.

Posted about the new issue here; didn't want to copy all the details here.

I tried again, Optimus Manager, Integrated. Redid it 3 times.

sudo nvidia-smi -r 

GPU 00000000:01:00.0 is currently in use by another process.

1 device is currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graph
ics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Plea
se first kill all processes using this device and all compute applications running in the system.

Running bind_vfio.sh freezes again.

Since I can't get GPU passthrough to work; trying a different approach.

Display Splice: Splice server, OpenGL on NVIDIA
Video: Virtio, 3D acceleration

Error starting domain: internal error: qemu unexpectedly closed the monitor: 2022-01-12T16:47:36.607503Z qemu-system-x86_64: egl: eglInitialize failed
2022-01-12T16:47:36.607548Z qemu-system-x86_64: Failed to initialize EGL render node for SPICE GL

Googling and not finding anything meaningful on that error, other than some others having it. Someone fixed it by reinstalling the graphic drivers. What's the safe way to re-install graphic drivers in Garuda?

If I set OpenGL on Intel, then it boots, but I'm still locked in 800x600, so it doesn't seem to be doing anything.

Now we're getting with a bunch of different problems, but perhaps they're related somehow.

Under display spice for me:

<graphics type="spice" autoport="yes">
  <listen type="address"/>
  <image compression="off"/>
  <gl enable="no"/>
</graphics>

Under video for me is this;

<video>
  <model type="none"/>
</video>

Edit: I just checked my blacklists and I do have noveau blacklisted. Though I'm not sure that has anything to do with anything really.

OK I'm trying to do it the standard way, add vfio_pci to kernel params and then not try any dynamic bind or unbind. That's the most documented approach. Recreated a fresh VM setup over the existing virtual drive.

If I do it with your params, Video=None, and Display Splice, I get this displayed in the window "Connecting to graphical console for guest". Nothing else happens. I guess it should be taking the HDMI port to display on the TV?

If I can get the VM to work in the most standard way, I might be able to figure things out from there. So... no error right now, but why no video output?

Shutdown is working so that's a sign that the VM is in a "valid" state.

It is pretty typical that users new to Arch based distros will have some teething pains. The average new user will likely melt down their system a time or two and end up doing reinstalls until they have learned the ropes.

It is near impossible for us to speculate why you are experiencing problems as we would need to own your exact model of computer and know every single change you have made to make any real determination of the causes.

Things usually become easier as you become more familiar with how Arch works. As the old saying goes "Rome wasn't built in a day".

1 Like

Does windows have the graphics drivers installed? And have you followed Bryan Steiner's guide?

Found this. I may have to pass a custom VBIOS.

It does look like my model 2060 requires the same VBIOS as his 2070.

I've setup the VBIOS according to his guide and verify that vfio-pci is loaded

lspci -kn | grep -A 2 01:00.

01:00.0 0300: 10de:1f15 (rev a1)
Subsystem: 10de:0000
Kernel driver in use: vfio-pci
--
01:00.1 0403: 10de:10f9 (rev a1)
Subsystem: 10de:0000
Kernel driver in use: vfio-pci
--
01:00.2 0c03: 10de:1ada (rev a1)
Subsystem: 10de:0000
Kernel driver in use: vfio-pci
--
01:00.3 0c80: 10de:1adb (rev a1)
Subsystem: 10de:0000
Kernel driver in use: vfio-pci

So far so good.

If I set Video=QXL, I get a black screen.

If I set Video=none, I get "Connecting to graphical console for guest" and nothing more

If I delete Display Splice, I get the same console with loading Windows Boot Manager

I'm still at the same place.