[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Quadrified GTX 480 VT-d passthrough. CUDA 5.5 in Linux partial success



Here is something interesting however! If I do rmmod nvidia in the domU and then remove the PCI devices from dom0 with xl pci-detach, then add them back with xl pci-attach and run modprobe nvidia in the domU, the problem doesn't appear anymore! I can run multiple CUDA apps, nvidia-smi and everything "just works" after that.

Something is fishy around the domU boot?


On Tue, Nov 19, 2013 at 3:25 PM, Tamas Lengyel <tamas.lengyel@xxxxxxxxxxxx> wrote:
Allright, I did load the nouveau module and it doesn't expose reset either. Loading nvidia back again had no effect either, still the same problem.

root@debian-testing:~# lsmod | grep nouveau
nouveau               731557  0 
mxm_wmi                12515  1 nouveau
wmi                    13243  2 mxm_wmi,nouveau
video                  17792  1 nouveau
ttm                    58566  1 nouveau
drm_kms_helper         31837  1 nouveau
i2c_algo_bit           12841  1 nouveau
drm                   211856  4 ttm,drm_kms_helper,nvidia,nouveau
button                 12944  1 nouveau
i2c_core               24353  6 drm,i2c_piix4,drm_kms_helper,i2c_algo_bit,nvidia,nouveau
root@debian-testing:~# ls /sys/devices/pci0000\:00/0000\:00\:04.0/
boot_vga  d3cold_allowed  enable modalias   rescan  resource3 subsystem_device
broken_parity_status  device  firmware_node  msi_bus    resource  resource3_wc subsystem_vendor
class  dma_mask_bits   irq numa_node  resource0  resource5 uevent
config  driver  local_cpulist  power    resource1  rom vendor
consistent_dma_mask_bits  drm  local_cpus remove     resource1_wc  subsystem



On Tue, Nov 19, 2013 at 2:48 PM, Gordan Bobic <gordan@xxxxxxxxxx> wrote:
Actually - try something simpler first - just unload and reload the
nvidia.ko driver, see if that resets the card back into a CUDA-ble
state.


On Tue, 19 Nov 2013 13:47:18 +0000, Gordan Bobic <gordan@xxxxxxxxxx> wrote:
I can't remember how it's all symlinked, but I normally
find it under somewhere like:

/sys/devices/pci0000:00/0000:00:03.0/0000:0b:00.0/0000:0c:02.0/0000:0d:00.0/reset

(the path reflects PCI bridges along the way - yes, I have a card
behind 3 PCIe
bridges on my motherboard (5520->NF200->NF200->GPU) - and that's not even the
GTX690 - that would add at least one more bridge to the path - madness)

If nvidia driver isn't exposing it, you could try unloading the
nvidia driver,
loading the nouveau driver (make sure mode switching is disabled so
it doesn't
get bound into a non-loadable state by the console), issuing a reset (if that
exposes a reset node, which IIRC it does no Fermi+ GPUs), unloading nouveau,
and reloading nvidia.ko. Then see if it works after that.

Gordan

On Tue, 19 Nov 2013 14:22:48 +0100, Tamas Lengyel
<tamas.lengyel@xxxxxxxxxxxx> wrote:
I don't see reset unfortunately:

ls /sys/module/nvidia/drivers/pci:nvidia/0000:00:04.0
boot_vga   d3cold_allowed  enable  i2c-3 msi_bus    rescan
resource3     subsystem_device
broken_parity_status   device   firmware_node  irq msi_irqs  
resource  resource3_wc  subsystem_vendor
class   dma_mask_bits   i2c-0  local_cpulist numa_node  resource0
resource5     uevent
config   driver   i2c-1  local_cpus power    resource1  rom     
  vendor
consistent_dma_mask_bits  drm   i2c-2  modalias remove   
resource1_wc  subsystem

On Tue, Nov 19, 2013 at 11:32 AM, Gordan Bobic  wrote:
 Does the nvidia binary driver provide a reset handle for the device
via sysfs?
 If you echo 1 into it, does it help or does it crash things?

 On Tue, 19 Nov 2013 10:32:31 +0100, Tamas Lengyel  wrote:

 Hi everyone,
 after following in the footsteps of the following discussion
  (http://lists.xenproject.org/archives/html/xen-users/2013-09/msg00106.html
[3]
 [1])

  I had been able to turn my GTX 480 into a Quadro 6000. When I VT-d
 passthrough it to a Debian jessie VM it shows up fine and CUDA 5.5
  seems to function properly up to a point:

 lspci -v:

 00:04.0 VGA compatible controller: NVIDIA Corporation GF100GL [Quadro
  6000] (rev a3) (prog-if 00 [VGA controller])
 Subsystem: ASUSTeK Computer Inc. Device 075f
  Physical Slot: 4
 Flags: bus master, fast devsel, latency 0, IRQ 32
  Memory at ee000000 (32-bit, non-prefetchable) [size=32M]
 Memory at e0000000 (64-bit, prefetchable) [size=128M]
  Memory at e8000000 (64-bit, prefetchable) [size=64M]
 I/O ports at c100 [size=128]
  Expansion ROM at f1000000 [disabled] [size=512K]
 Capabilities: [60] Power Management version 3
  Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
 Capabilities: [78] Express Endpoint, MSI 00
  Capabilities: [b4] Vendor Specific Information: Len=14

 Kernel driver in use: nvidia

 00:05.0 Audio device: NVIDIA Corporation GF100 High Definition Audio
  Controller (rev a1)
 Subsystem: ASUSTeK Computer Inc. Device 075f
  Physical Slot: 5
 Flags: bus master, fast devsel, latency 0, IRQ 37
  Memory at f1080000 (32-bit, non-prefetchable) [size=16K]
 Capabilities: [60] Power Management version 3
  Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
 Capabilities: [78] Express Endpoint, MSI 00
  Kernel driver in use: snd_hda_intel

 NVIDIA_CUDA-5.5_Samples/1_Utilities/deviceQuery# ./deviceQuery
  ./deviceQuery Starting...

  CUDA Device Query (Runtime API) version (CUDART static linking)

 Detected 1 CUDA Capable device(s)

  Device 0: "Quadro 6000"
   CUDA Driver Version / Runtime Version          6.0 / 5.5
    CUDA Capability Major/Minor version number:    2.0
   Total amount of global memory:                 1536 MBytes
 (1610285056 bytes)
    (15) Multiprocessors, ( 32) CUDA Cores/MP:     480 CUDA Cores
   GPU Clock rate:                              
 1401
 MHz (1.40 GHz)
    Memory Clock rate:                            
1848
 Mhz
   Memory Bus Width:                            
  384-bit
    L2 Cache Size:                                
 786432 bytes
   Maximum Texture Dimension Size (x,y,z)         1D=(65536),
  2D=(65536, 65535), 3D=(2048, 2048, 2048)
   Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048
 layers
    Maximum Layered 2D Texture Size, (num) layers  2D=(16384,
16384),
 2048 layers
   Total amount of constant memory:               65536 bytes
    Total amount of shared memory per block:       49152 bytes
   Total number of registers available per block: 32768
    Warp size:                                    
 32
   Maximum number of threads per multiprocessor:  1536
    Maximum number of threads per block:           1024
   Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
    Max dimension size of a grid size    (x,y,z): (65535, 65535,
 65535)
   Maximum memory pitch:                        
  2147483647 [4] bytes
    Texture alignment:                             512
 bytes
   Concurrent copy and kernel execution:          Yes with 2
copy
 engine(s)
    Run time limit on kernels:                     No
   Integrated GPU sharing Host Memory:            No
    Support host page-locked memory mapping:       Yes
   Alignment requirement for Surfaces:            Yes
    Device has ECC support:                      
  Disabled
   Device supports Unified Addressing (UVA):      Yes
    Device PCI Bus ID / PCI location ID:           0 / 4
   Compute Mode:
       < Default (multiple host threads can use ::cudaSetDevice()
 with
 device simultaneously) >

 deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA
  Runtime Version = 5.5, NumDevs = 1, Device0 = Quadro 6000
 Result = PASS

 Unfortunately if I try to run any CUDA app or even nvidia-smi
  afterwards, I get the following errors:

 NVIDIA_CUDA-5.5_Samples/1_Utilities/deviceQuery# ./deviceQuery
  ./deviceQuery Starting...

  CUDA Device Query (Runtime API) version (CUDART static linking)

 cudaGetDeviceCount returned 10
 -> invalid device ordinal
  Result = FAIL

 # nvidia-smi
  Unable to determine the device handle for GPU 0000:00:04.0: The
 NVIDIA
 kernel module detected an issue with GPU interrupts.Consult the
  "Common Problems" Chapter of the NVIDIA Driver README for
 details and steps that can be taken to resolve this issue.

 If I restart the VM I can run a single CUDA app again, once. It's
  still pretty impressive to be able to do that without having to
patch
 Xen or reboot the entire machine =) It doesn't seem to matter what
 CUDA app I'm running, here is matrixMul
  for example:

 matrixMul# ./matrixMul
  [Matrix Multiply Using CUDA] - Starting...
 GPU Device 0: "Quadro 6000" with compute capability 2.0

 MatrixA(320,320), MatrixB(640,320)
 Computing result using CUDA Kernel...
  done
 Performance= 227.22 GFlop/s, Time= 0.577 msec, Size= 131072000 Ops,
  WorkgroupSize= 1024 threads/block
 Checking computed result for correctness: Result = PASS

 Note: For peak performance, please refer to the matrixMulCUBLAS
 example.

 Anyhoo, does anyone have any idea what might I be able to tweak so I
 can
  avoid this issue? The setup clearly seems to work for the most
 part.

 My domU config:

  arch = 'x86_64'
 name = "debian-miner"
  builder = "hvm"
 maxmem = 512
  memory = 512
 vcpus = 1
  maxcpus = 1
 boot = "cd"
  pae=1
 acpi = 1
  apic = 1
 hap=1
  hpet=1
 shadow_memory = 32
  >  >   >  vnc=1
  vncunused=1
 vnclisten="0.0.0.0"
  vif = [ 'type=netfront,bridge=xenbr0,mac=00:16:3e:12:c3:fa']
  device_model_version="qemu-xen-traditional"
  gfx_passthru=0
 xen_platform_pci=1
  pci  = [ '01:00.0', '01:00.1' ]
 pci_msitranslate = 1
  pci_power_mgmt = 1
 pci_permissive = 1
  xen_extended_power_mgmt = 1
 acpi_s3 = 1
  acpi_s4 = 1
 disk = [        'phy:/dev/t0vg/debian-testing,xvda,w'];

 And I'm running on Xen 4.3.1 with NVIDIA driver 331.20 x86_64 in the
 domU.

 Thanks and cheers!

 Links:
 ------
 [1]

 http://lists.xenproject.org/archives/html/xen-users/2013-09/msg00106.html
[5]



Links:
------
[1] mailto:gordan@xxxxxxxxxx
[2] mailto:tamas.lengyel@zentific.com
[3]

http://lists.xenproject.org/archives/html/xen-users/2013-09/msg00106.html
[4] http://mail.shatteredsilicon.net/tel:2147483647
[5]

http://lists.xenproject.org/archives/html/xen-users/2013-09/msg00106.html



_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.