[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-users] ATI VGA Passthrough / Xen 4.2 / Linux 3.8.10
On Fri, 10 May 2013 10:12:07 -0400, Andrew Bobulsky <rulerof@xxxxxxxxx> wrote: Hello Gordan, Casey,Thanks for posting the results Gordan, unfortunate that it isn't working as well as we hoped. I haven't given up _quite_ yet.I discovered yesterday that it _looks liks_ one of my PCIe slots is actually duff (two different GPUs both fail to detect properly in itbut work fine in other slots). If it turns out to be a duff slot, there's no telling what else might be duff on the motherboard and how it might affect various things, even though several days of full load stability testing passed.So some more bare-metal testing seems to be called for - right now Iam not prepared to disregard the possibility that maybe I have a hardware issue somewhere that despite EDAC and ECC on everything, remains undetected and unreported in the logs.I hope you manage to resolve it, though I feel the NF200 will be thelarger challenge.I hope I'll resolve it, too, but right now I am not convinced that the NF200 is actually the cause of my problems. My gut feeling says that if I can get it working for 5 minutes at a time, something lessfundamental than the NF200 PCIe routers are the cause of the problems.I don't know if I'd be so quick to jump to that conclusion.... I'll explain :) So the reason I asked about ACS enforcement is because I'm currently trying to pass my Radeon 6990 into a VM. I tried this a while back, but only with ESXi. After futzing with it for a day or two, I had to quit because while I had VT-d, and the ESXi install said Passthrough was supported, I ended up in a "this host requires a reboot before this device can be assigned to a vm" loop of some sort. Hours of investigation revealed that the PEX 8647 (or whatever it is, Google knows :P) which is the PCIe switch built in to the board of the 6990 is *supposed* to support ACS... but it's seemingly switched off. Two points here: 1) Unlike ESXi 4.1+ (from what I can find), Xen (at least with the xm/xend stack does allow ACS requirement to be disabled. 2) I actually have it working - for 5 minutes or so at a time. If the problem was the lack of ACS, it wouldn't work at all. So what might intrigue you the most here is that while I'm stuck with a VGA device sitting behind this non-ACS compliant switch... Myresults are almost identical to yours. Passing one of the VGA devices to the DomU, with or without the corresponding HDMI audio doesn't seemto matter, I get this:" it is so intermittent. It works well enough to boot up and work witha gaming type load for a few minutes. Then something happens that causes the VGA card to require a reset, and it all falls apart." Seriously :P And you are convinced this is to do with the availability of ACS? It eventually likes to BSOD, usually on atikmpag.sys I think. Plenty of "an attempt was made to reset the display adapter and failed" blah blah blah. Yes, all too familiar. This happens 100% of the time if I try to boot with both devices attached. Both devices? Just out of interest: 1) Are you using a multi-socket motherboard? 2) Have you tried disabling IRQ balancing (noirqbalance kernel parameter + disable irqbalance service)? 3) Are you assigning > 4GB of RAM to the guest? I found a post in the archive last night mentioning that there's an outstanding qemu issue with > 4GB of RAM given to the guest. I didn't get around to re-trying the VM with 3.5GB yet. The first time I boot it up, the driver isn't installed so it'll work until just before auto-login reaches the desktop, but after that I can't boot at all with both VGA devices attached. I'd love to explore more, but I'm running out of places tolook for solutions to my problem that don't involve my credit card andsome new hardware. In a fit of delicious irony, my problem is almost identical to yours---if only I'd bought some cheaper stuff it'd probably all work just great :D Life on the bleeding edge is hard. :( The thing that really bugs me is that after a fresh reboot with irqbalancing disabled, I can get it working for a few minutes _every time_. After a few minutes, it'll start corrupting the screen output and eventually try to reset itself (sometimes even claim to succeed a few times), eventually fail and BSOD. The only single GPU cards I have are the Radeon 5850s in the AMD box I have. I'm just a little reticent to tear the thing apart though causeit gets used a lot. I think my next step is to look for a video card that properly supports FLR, As far as I can tell, for all the talk of it - there is NO SUCH THING. Somebody on the list posted lspci -vvv from their ATI FirePro cardwhich shows it has no FLR, and I have just got a Quadro 2000, which also lacks FLR.The only vague mention I have seen of FLR on GPUs is on the Intel GPU on the very latest generation of Core i CPUs (the built in one). And even if that is true it's not all that useful for gaming. though I'm considering a hard-hack: think of a 12v relay and a PCIe extender cable---if a D3D0 reset actually powers off the slot momentarily but the PSU plugs on the card prevent it from working, then I could rig up a switch that ties those plugs'power state into the slot itself---it's radical, yes, but possibly themost inventive solution I can think of so far. I'm super curious to see if anyone more knowledgeable than myself thinks it would work,because it'd be super cheap to build! As the saying goes though, I'll"cross that bridge when I come to it." :) Interesting. In theory, I think this _should_ work provider your PCIe bridges support hot-plugging.To be certain, you'd have to switch both the PCIe slot and (if your card uses it) the external power inputs. 2) My motherboard's PCIe slots are behind NF200 PCIe bridges(yes,EVGA have decided in their infinite wisdom to put all 7 PCIe slotsbehind NF200s, none are directly attached to the Intel NB). I'm so sorry :P. NF200 has probably caused a lot of xen tinkerers to utter a few dozen cuss words a piece. I can believe that. What is the solution, though?The thing that drives me really nuts about the issues I'm seeing (which may or may not be specifically related to the NF200) isthat itis so intermittent. It works well enough to boot up and work with a gaming type load for a few minutes. Then something happens thatcauses the VGA card to require a reset, and it all falls apart.My solution was to buy another motherboard, I had no luck at all passing the devices behind the NF200, and similar to your situationall but one PCIe slot on that board was behind that bridge. Did you not manage to get it working at all? Or was it justintermittent like in my case? I can typically get about 5 minutes ofgaming out of my ATI card before it all goes wrong.Ironically, I was thinking about an Asus Sabertooth with an 8-core AMD,but opted to go for broke and get a couple of 6-core Xeons and anEVGA SR-2. It turns out, a solution that is 4x more expensive isn'tactually better... :(I was unable to get it working at all. The NF200 simply threw errors that 100% prevented me from passing the device. I think it was missing a number of specific features required for passthrough, and I vaguely remember running lspci -vvv to verify what was missing. Perhaps not allNF200's are created equal?The only logged issue I had with the NF200s was the lack of ACS, which can be disabled as I mentioned on this thread (at least if you are using the xm stack). After I disabled that PCI passthrough has been working OK. It's just VGA passthrough BSOD-ing after some minutes that is causing meproblems.In reading up on the wiki, there does indeed seem to be a lot more info regarding the use of xl and PCI Passthrough today than the lasttime I looked. It seems that these types of configuration options areset on a domain-by-domain basis, or even by device; docs say that things like VPCI vs direct PASS mapping of slot layout(?) is actually configured at the device level either in your DomU config file (like: pci = ['0:d:0.0, pci-just-forking-work-damn-you]) or via xl (like: xl pci-attach 1 0:d:0.0 pci-just-forking-work-damn-you). Hmm... I honestly don't think the xl way will succeed where xm is unstable, but I might give it a shot. With that in mind, even though I've taken your advice and added the config info to my xend files, its entirely possible---especially in light of what Casey said---that I'm just Doing It Wrong(TM). It'd likely be beneficial for us both to compare notes on that regard. If either of you would be willing to help, I could probably use some pointers... I've kinda run out of logs to look at with my current knowledge on the subject :P Certainly - what notes do you propose we compare? What about with PCIe devices behind NF200 bridges? I know theNF200sdon't support PCI ACS, but that is a security feature (which I havedisabled enforcement of to get this far), and AFAIK shouldn't actually affect the basic PCI passthrough capability.Question: how'd you disable ACS? I think it may be causing mesome issues. Put: (pci-passthrough-strict-check no) (pci-dev-assign-strict-check no) in /etc/xen/xend-config.sxpIf it was causing you issues, however, I'd expect you to finderrors in logs pointing at it.As I understand the xend-config.sxp [1] is for the xm toolstack anddeprecated Xend service.xm toolstack and xend are what I am using. I have read reports of issues with VGA passthrough using the xl stack so I didn't even attempt touse it.The xm toolstack was deprecated in version 4.1. I read that it had notbeen updated in months due to a lack of maintainers.I heard that xl is still feature-incomplete and experimental, and problematic with VGA passthrough.I did try xm backwhen I started, the passthrough worked but had the same problems I had when I began testing xl. I have been using xl since then. My logic was simply "why become dependent on a tool that is no-longer maintained andmay be removed from the next release?"I'm not wedded to any particular tool stack, I'm happy to use whatever works. But since libvirt and virt-manager are still using xm, and since I have seen recent reports of xl being problematic for VGA passthrough as well as there being no apparent way to disable ACS requirements withthe xl stack, that rules it out for me completely at the moment.The xm stack was rather trying for me. It's like it only wanted to throw errors at me when I did PCI stuff. Whereas xl has seeminglybeen more than happy to do whatever I tell it. Though I admit chances are pretty good I was just running around, haphazardly using the wrongversion of python or something. Given our nearly identical results thus far, I'd wager that the toolstack itself isn't really the source of our problems. If that's true, though, the easy solution is likely out the window :( What distro do you use? Gordan _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxx http://lists.xen.org/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |