[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-users] CUDA Nvidia GPU computing on Xen DomU



Hi All,

I'm in a crunch trying to deploy two GeForce RTX 2080 SUPER cards on one
of my Xen DomU computing nodes. I was under impression that GPU
passthrough for CUDA computing is supported and well documented up until
I tried to complete this exercise. 

I went up and down the official documentation 

https://xenbits.xenproject.org/docs/4.13-testing/

as well as 

https://wiki.xenproject.org/wiki/Xen_PCI_Passthrough


My Xen Dom0 runs on Alpine Linux 

xen1:/etc# more alpine-release 
3.11.3
xen1:/etc# uname -a
Linux xen1.int.autonsys.com 5.4.12-1-lts #2-Alpine SMP Thu, 16 Jan 2020
12:53:54 UTC x86_64 Linux

xen1:/boot# more /boot/extlinux.conf
# Generated by update-extlinux 6.04_pre1-r6
DEFAULT menu.c32
PROMPT 0
MENU TITLE Alpine/Linux Boot Menu
MENU HIDDEN
MENU AUTOBOOT Alpine will be booted automatically in # seconds.
TIMEOUT 30
LABEL xen-lts
  MENU LABEL Xen + Linux lts
  COM32 mboot.c32
  APPEND xen.gz dom0_mem=16384M --- vmlinuz-lts
root=UUID=f1d049ca-b639-4f14-8f3
1-162c471373b7 modules=sd-mod,usb-storage,ext4 nomodeset quiet
rootfstype=ext4 -
-- initramfs-lts

LABEL lts
  MENU LABEL Linux lts
  LINUX vmlinuz-lts
  INITRD initramfs-lts
  APPEND root=UUID=f1d049ca-b639-4f14-8f31-162c471373b7
modules=sd-mod,usb-stora
ge,ext4 nomodeset quiet rootfstype=ext4

MENU SEPARATOR


xen1:/boot#  lspci | grep -i nvidia
02:00.0 VGA compatible controller: NVIDIA Corporation TU104 [GeForce RTX
2080 SUPER] (rev a1)
02:00.1 Audio device: NVIDIA Corporation Device 10f8 (rev a1)
02:00.2 USB controller: NVIDIA Corporation Device 1ad8 (rev a1)
02:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad9
(rev a1)
03:00.0 VGA compatible controller: NVIDIA Corporation TU104 [GeForce RTX
2080 SUPER] (rev a1)
03:00.1 Audio device: NVIDIA Corporation Device 10f8 (rev a1)
03:00.2 USB controller: NVIDIA Corporation Device 1ad8 (rev a1)
03:00.3 Serial bus controller [0c80]: NVIDIA Corporation Device 1ad9
(rev a1)


My Xen DomU runs 

root@springdale1$ more /etc/redhat-release 
Springdale Linux release 7.7 (Verona)

root@springdale1$ uname -a
Linux springdale1.int.autonsys.com 3.10.0-1062.12.1.el7.x86_64 #1 SMP
Wed Feb 5 07:15:42 EST 2020 x86_64 x86_64 x86_64 GNU/Linux

I tried to set up GPU passthrough following 

https://wiki.xenproject.org/wiki/Xen_PCI_Passthrough 

modprobe xen-pciback
xl pci-assignable-add 02:00.0
xl pci-assignable-add 02:00.1
xl pci-assignable-add 02:00.2
xl pci-assignable-add 03:00.0
xl pci-assignable-add 03:00.1
xl pci-assignable-add 03:00.2

xen1:~# xl pci-assignable-list
0000:02:00.2
0000:03:00.3
0000:03:00.1
0000:02:00.3
0000:02:00.1
0000:03:00.2

# Add this to config file. Nothing else
pci=['02:00.0','03:00.0']


xen1:/boot# more /etc/xen/my-guests/auto/springdale1.cfg 
type = "hvm"
name="springdale-1"
vcpus=16
# memory=65536
memory=262144
# gfx_passthru=1
#
pci=['02:00.0','02:00.1','02:00.2','02:00.3','03:00.0','03:00.1','03:00.2','03
:00.3']
pci=['02:00.0','03:00.0']
# disk = [
'/dev/disk/by-uuid/63f65160-e3ee-4458-b5f1-8b5b9d934563,raw,xvda,rw'
disk=['/dev/sda,raw,xvda,rw']
vif=['mac=00:16:3e:10:5f:95, bridge=br0']
# on_poweroff="destroy"
on_reboot="restart"
on_crash="restart"

root@springdale1$  lspci | grep -i nvidia
00:05.0 VGA compatible controller: NVIDIA Corporation Device 1e81 (rev
a1)
00:06.0 VGA compatible controller: NVIDIA Corporation Device 1e81 (rev
a1)


This is already problematic. I normal non-virtualized host would report
GPU cards differently like this

root@gpu19$ lspci | grep -i nvidia
18:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX
2080 Ti] (rev a1)
18:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio
Controller (rev a1)
18:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Controller (rev
a1)
18:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 UCSI
Controller (rev a1)
3b:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX
2080 Ti] (rev a1)
3b:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio
Controller (rev a1)
3b:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Controller (rev
a1)
3b:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 UCSI
Controller (rev a1)
86:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX
2080 Ti] (rev a1)
86:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio
Controller (rev a1)
86:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Controller (rev
a1)
86:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 UCSI
Controller (rev a1)
af:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX
2080 Ti] (rev a1)
af:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio
Controller (rev a1)
af:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Controller (rev
a1)
af:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 UCSI
Controller (rev a1)


Driver compilation and CUDA installation on the virtual host are going
through but when I try to probe the card I get the following error

root@springdale1$ nvidia-smi
Unable to determine the device handle for GPU 0000:00:05.0: Unknown
Error 

I do see in the Xen DomU log files 
messages:Mar  2 00:31:11 springdale1 kernel: NVRM: GPU 0000:00:05.0:
rm_init_adapter failed, device minor number 0
messages:Mar  2 00:39:32 springdale1 kernel: NVRM: loading NVIDIA UNIX
x86_64 Kernel Module  440.44  Sun Dec  8 03:38:56 UTC 2019
messages:Mar  2 00:40:11 springdale1 kernel: NVRM: GPU 0000:00:05.0:
RmInitAdapter failed! (0x23:0x56:515)
messages:Mar  2 00:40:11 springdale1 kernel: NVRM: GPU 0000:00:05.0:
rm_init_adapter failed, device minor number 0
messages:Mar  2 00:40:11 springdale1 kernel: NVRM: GPU 0000:00:06.0:
RmInitAdapter failed! (0x23:0x56:515)
messages:Mar  2 00:40:11 springdale1 kernel: NVRM: GPU 0000:00:06.0:
rm_init_adapter failed, device minor number 1
messages:Mar  2 00:40:11 springdale1 kernel: NVRM: GPU 0000:00:05.0:
RmInitAdapter failed! (0x23:0x56:515)
messages:Mar  2 00:40:11 springdale1 kernel: NVRM: GPU 0000:00:05.0:
rm_init_adapter failed, device minor number 0
messages:Mar  2 00:50:00 springdale1 kernel: NVRM: GPU 0000:00:05.0:
RmInitAdapter failed! (0x23:0x56:515)
messages:Mar  2 00:50:00 springdale1 kernel: NVRM: GPU 0000:00:05.0:
rm_init_adapter failed, device minor number 0
messages:Mar  2 00:50:00 springdale1 kernel: NVRM: GPU 0000:00:06.0:
RmInitAdapter failed! (0x23:0x56:515)
messages:Mar  2 00:50:00 springdale1 kernel: NVRM: GPU 0000:00:06.0:
rm_init_adapter failed, device minor number 1
messages:Mar  2 00:50:00 springdale1 kernel: NVRM: GPU 0000:00:05.0:
RmInitAdapter failed! (0x23:0x56:515)
messages:Mar  2 00:50:00 springdale1 kernel: NVRM: GPU 0000:00:05.0:
rm_init_adapter failed, device minor number 0



It seems that I need to figure out if it is possible to pass parameter
to Xen which will hide host ID from the guest ID. This is definitely
possible on ESXi with a flag like hypervisor.cpuid.v0 = "FALSE".

https://devtalk.nvidia.com/default/topic/982322/linux/nvidia-smi-reports-unable-to-determine-the-device-handle-for-gpu/

Ideally Xen DomO should completely passthrough cards to DomU.


Does anyone on this mailing list use CUDA on Xen Dom0? Could you please
give me some hints? I am finding few bits here and there on the Internet
but nothing really coherent needed for enterprise deployment.


Most Kind Regards,
Predrag Punosevac

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.