[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] ATI VGA Passthrough / Xen 4.2 / Linux 3.8.10



Okay, here we go!

On Fri, May 10, 2013 at 2:58 PM, Gordan Bobic <gordan@xxxxxxxxxx> wrote:
On 05/10/2013 06:54 PM, Andrew Bobulsky wrote:

    Two points here:

    1) Unlike ESXi 4.1+ (from what I can find), Xen (at least with the
    xm/xend stack does allow ACS requirement to be disabled.


Hehe.  It's nice to have the option to screw things up, eh? :)

Personally, I really dislike too much cleverness from software. While I understand auto-detection is handy for Ubuntu users, I want there to be a way to override things if I need to, without extensive source code modifying. I like there to be a way to tell whatever you are using to quit holding your hand and just do as it's damn well told.

And if I wanna crash this kernel I'll be damned if it doesn't come flying to the concrete at nearly four gigahertz!  Why I oughtta... ;)
 
    2) I actually have it working - for 5 minutes or so at a time. If
    the problem was the lack of ACS, it wouldn't work at all.


I just can't help but wonder if it /is/ the problem, though.  It's the

only thing I can pin down that our situations have in common as far as
its being the only "non-compatible" portion of the implementation, aside
from the nearly identical behavior, of course. Maybe the AMD driver does
some stupid stuff that ACS can mitigate?  I just wish I knew more :(

Now you got me thinking... I noticed that when the GPU starts to head toward the crash, this appears in the syslog:

May  6 16:35:51 normandy kernel: pcieport 0000:00:03.0: AER: Multiple Uncorrected (Non-Fatal) error received: id=0000

It certainly makes me wonder.

Has anyone else seen this error?

The device ID in question is:

00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 22)

which does not bode well...

Duff hardware?

Hmmm... I'll poke through my syslog at the next crash.  I tried:

cat /var/log/syslog | grep pcieport
cat /var/log/syslog.1 | grep pcieport
dmesg | grep pcieport

Nothing came back from any of those.  I'll see if I can identify any unique errors myself though!
  
        So what might intrigue you the most here is that while I'm stuck
        with
        a VGA device sitting behind this non-ACS compliant switch... My
        results are almost identical to yours.  Passing one of the VGA
        devices
        to the DomU, with or without the corresponding HDMI audio
        doesn't seem
        to matter, I get this:

        " it is so intermittent. It works well enough to boot up and
        work with
        a gaming type load for a few minutes. Then something happens that
        causes the VGA card to require a reset, and it all falls apart."

        Seriously :P


    And you are convinced this is to do with the availability of ACS?


Like I said, it's the only thing that I can pinpoint as being a
hindrance to compatibility.  I guess my request here is if anyone can
help me determine whether or not that's true?

What motherboard are you using? Has anyone successfully used it for VGA passthrough? I don't think the possibility of both of us having similarly duff hardware has been systematically excluded yet.

I think I said it, but I'll link here anyway: http://www.gigabyte.us/products/product-page.aspx?pid=2957#ov

As to whether or not anyone's used it for passthrough before... I've got no clue.  Probably not too many people, seeing as how I'm essentially running a custom BIOS :P
 
        It eventually likes to BSOD, usually on atikmpag.sys I think.
          Plenty
        of "an attempt was made to reset the display adapter and failed"
        blah
        blah blah.


    Yes, all too familiar.

        This happens 100% of the time if I try to boot with both
        devices attached.


    Both devices?


Yes---that is to say both of the VGA controllers from the 6990. The
relevant portion of my lspci looks like this:
http://pastebin.com/raw.php?i=GwekPNAW

OK, I get it. I seem to remember reading in the archives that dual VGA passthrough is problematic (my experience over the years shows that multiple GPUs are a false economy of highly questionably benefit).

That's actually pretty much completely accurate.  It drives me particularly up the wall because I hate running things in full screen, and crossfire basically doesn't work at all without that :P

Nonetheless, they once mined bitcoins like a pair of world-class champions ;D

Note: devices 09 and 0a are my "primary" 6990's vga controllers.  Also,
my crossfire bridge is disconnected.  I'm working with the other card,
devices 0d and 0e.  I've included the USB card as well in the list
because I'm using it, but it causes me no problems whatsoever.  For what
its worth, that USB card works great in ESXi as well... Highpoint
enabled ACS on their PEX chips :D

    Just out of interest:

    1) Are you using a multi-socket motherboard?


Nope!  It's a Gigabyte GA-EX58-EXTREME.  It's LGA1366 with an i7 920 in
it.  VT-d support is provided through a hacked BIOS image that I found
on the web a couple years or so ago.

Having to use a hacked BIOS for VT-d support is not a good sign or a good starting point...

Technically, you're right.  AFAIK though, this particular generation of i7 chips allows for VT-d to be managed entirely by the chipset/bios.  There's no particular req (however artificial) coming out of the CPUs for this generation that stipulates VT-d can't be patched in... so I figured, "why not?"  I was modding my BIOS anyway and decided to use this one as a base because it had both VT-d and fully updated option ROMs for all my onboard stuff.  The world of BIOS modding is a very neat one; I highly suggest every nerd spend a few days there at some point in his life ;)

To the point though, it seems very well behaved on everything that isn't my 6990 :-(
 
    2) Have you tried disabling IRQ balancing
    (noirqbalance kernel parameter + disable irqbalance service)?


No clue what that is.  Can you provide any direction?  I'd be happy to
test.

In your boot loader, find the kernel and xen lines and add:

On the xen line:
noirqbalance

On the dom0 kernel line:
noirqbalance

I've gone with this in my /etc/default/grub:
 
GRUB_CMDLINE_XEN_DEFAULT="xen-pciback.permissive xen-pciback.passthrough=1 noirqbalance"

Just ran update-grub and I'll reboot and see what happens!
 
 
    3) Are you assigning > 4GB of RAM to the guest? I found a post
    in the archive last night mentioning that there's an outstanding qemu
    issue with > 4GB of RAM given to the guest. I didn't get around to
    re-trying the VM with 3.5GB yet.


Yes sir.  It's got 8 GB + 1 GB for the standard video adapter.  Not sure
if that's improper, but it boots just find with a single card, and the
5850 I plugged in for a short while seemed well behaved.  Here's a copy
of my vm config file: http://pastebin.com/bX0ayA0u

I think reducing the guest RAM to 3.5GB is worth a shot, along with only passing a single GPU device.

Done.  Will report.
 
        The first time I boot it up, the driver isn't
        installed so it'll work until just before auto-login reaches the
        desktop, but after that I can't boot at all with both VGA devices
        attached. I'd love to explore more, but I'm running out of places to
        look for solutions to my problem that don't involve my credit
        card and
        some new hardware.  In a fit of delicious irony, my problem is
        almost
        identical to yours---if only I'd bought some cheaper stuff it'd
        probably all work just great :D


    Life on the bleeding edge is hard. :(
    The thing that really bugs me is that after a fresh reboot with irq
    balancing disabled, I can get it working for a few minutes _every time_.

    After a few minutes, it'll start corrupting the screen output and
    eventually try to reset itself (sometimes even claim to succeed a few
    times), eventually fail and BSOD.


The only corrupted output I've seen is during a BSOD itself---which was
once on Server 2012---and again I saw some black lines when I zoomed in
with Chrome on a Win7 guest.  I'm not entirely convinced that the black
lines were a symptom of Xen/Radeon/Whatever versus just being a goofy
Chrome bug.

I'm seeing white lines, both with the Radeon 6450 and the Quadro 2000.

Yeah.  I'm convinced now.  They might be a different color, but they're in chrome (which uses a GPU accelerated 2d canvas) and they seem to precede the crash pretty reliably.
 
        The only single GPU cards I have are the Radeon 5850s in the AMD
        box I
        have.  I'm just a little reticent to tear the thing apart though
        cause
        it gets used a lot.  I think my next step is to look for a video
        card
        that properly supports FLR,


    As far as I can tell, for all the talk of it - there is NO SUCH THING.
    Somebody on the list posted lspci -vvv from their ATI FirePro card
    which shows it has no FLR, and I have just got a Quadro 2000, which also
    lacks FLR.

    The only vague mention I have seen of FLR on GPUs is on the Intel GPU on
    the very latest generation of Core i CPUs (the built in one). And even
    if that is true it's not all that useful for gaming.


Heh.  The crappiest GPU that would ever be in my system is the most
compatible?  Good grief. :P

I'm not sure about compatible, but it seems to have a feature that the others don't - then again, take that with a pinch of salt - I don't have one, and I tend not to believe such things until somebody shows me the lspci dump that proves it.

        though I'm considering a hard-hack: think
        of a 12v relay and a PCIe extender cable---if a D3D0 reset actually
        powers off the slot momentarily but the PSU plugs on the card
        prevent
        it from working, then I could rig up a switch that ties those plugs'
        power state into the slot itself---it's radical, yes, but
        possibly the
        most inventive solution I can think of so far.  I'm super curious to
        see if anyone more knowledgeable than myself thinks it would work,
        because it'd be super cheap to build!  As the saying goes
        though, I'll
        "cross that bridge when I come to it." :)


    Interesting. In theory, I think this _should_ work provider your PCIe
    bridges support hot-plugging.

    To be certain, you'd have to switch both the PCIe slot and (if your card
    uses it) the external power inputs.


That'd be the idea.  Assuming it works the way I think it does, I could
tap a 12v (I'm pretty sure it's 12v in there) relay into the Vcc and GND
pins of the PCIe slot and use the relay's output to switch the Vcc from
the plug-in cables off of the PSU.  Bears testing with a slightly less
expensive card, but I wouldn't be surprised to see it work!  It'd
require some case modding for sure though, as the extension cable will
get in the way of properly seating the card.  It could be possible to
build a tap that could be "slipped in" to a card's PCIe slot...  Short
of proper FLR support, this could actually very cheaply be built into
the expansion card itself.  I'd suspect that simply adding FLR would be
cheaper on the card manufacturers though. :)

Just get a case with more slot cutouts on the back than your motherboard has slots. Then feed the ribbon to the bottom so the card sits in the slot on the case that is below your motherboard - no modding required. :)

But... but!  I guess that'd require a mini(?) or MicroATX board.  I'm a full size to XL ATX (or whatever the monster-sized boards are) kind of guy.  Guess I just want more slots to pass GPUs to VMs, eh? :)

There's supposed to be some cases out there that allow for mounting of expansion cards on the end of flexible extenders.  Haven't heard about them in a couple years, but either way chances are pretty good that such cases aren't exactly affordable... they likely target enterprise customers or simply have limited runs... economy of scale and all that.  Probably the "slip-in" type of adapter/approach would be best, but I don't wanna get ahead of myself on a simple idea that may not even work :P
 
                          2) My motherboard's PCIe slots are behind
                NF200 PCIe bridges
                       (yes,
                       EVGA have decided in their infinite wisdom to put
                all 7 PCIe slots
                       behind NF200s, none are directly attached to the
                Intel NB).

                         I'm so sorry :P. NF200 has probably caused a
                lot of xen
                       tinkerers to
                         utter a few dozen cuss words a piece.

                         I can believe that. What is the solution, though?

                         The thing that drives me really nuts about the
                issues I'm seeing
                       (which may or may not be specifically related to
                the NF200) is
                       that it
                       is so intermittent. It works well enough to boot
                up and work with a
                       gaming type load for a few minutes. Then
                something happens that
                       causes
                       the VGA card to require a reset, and it all falls
                apart.

                       My solution was to buy another motherboard, I had
                no luck at all
                       passing the devices behind the NF200, and similar
                to your situation
                       all but one PCIe slot on that board was behind
                that bridge.


                   Did you not manage to get it working at all? Or was
                it just
                   intermittent like in my case? I can typically get
                about 5 minutes of
                   gaming out of my ATI card before it all goes wrong.

                   Ironically, I was thinking about an Asus Sabertooth
                with an 8-core AMD,
                   but opted to go for broke and get a couple of 6-core
                Xeons and an
                   EVGA SR-2. It turns out, a solution that is 4x more
                expensive isn't
                   actually better... :(


                I was unable to get it working at all.  The NF200 simply
                threw errors
                that 100% prevented me from passing the device.  I think
                it was missing
                a number of specific features required for passthrough,
                and I vaguely
                remember running lspci -vvv to verify what was missing.
                  Perhaps not all
                NF200's are created equal?


            The only logged issue I had with the NF200s was the lack of
            ACS, which
            can be disabled as I mentioned on this thread (at least if
            you are using
            the xm stack). After I disabled that PCI passthrough has
            been working OK.
            It's just VGA passthrough BSOD-ing after some minutes that
            is causing me
            problems.


        In reading up on the wiki, there does indeed seem to be a lot more
        info regarding the use of xl and PCI Passthrough today than the last
        time I looked.  It seems that these types of configuration
        options are
        set on a domain-by-domain basis, or even by device; docs say that
        things like VPCI vs direct PASS mapping of slot layout(?) is
        actually
        configured at the device level either in your DomU config file
        (like:
        pci = ['0:d:0.0, pci-just-forking-work-damn-__you]) or via xl
        (like: xl
        pci-attach 1 0:d:0.0 pci-just-forking-work-damn-__you).



    Hmm... I honestly don't think the xl way will succeed where xm is
    unstable,
    but I might give it a shot.


You'd still likely require all the "hacks" you're currently using, but
they'll all move to different places I'm guessing... if the toolstack
itself doesn't have any bearing on this (which is my suspicion) then you
don't want to go doing all the extra work for nothing, of course!

Exactly. And right now what I have read (somebody point me to something that says otherwise), more people seem to have reported success with xm than xl stacks (but that could just be due to the xl stack being much more recent).


        With that in mind, even though I've taken your advice and added the
        config info to my xend files, its entirely possible---especially in
        light of what Casey said---that I'm just Doing It Wrong(TM).  It'd
        likely be beneficial for us both to compare notes on that
        regard.  If
        either of you would be willing to help, I could probably use some
        pointers... I've kinda run out of logs to look at with my current
        knowledge on the subject :P


    Certainly - what notes do you propose we compare?


I'm not completely sure.  If you can point me to the proper files to
verify that my device has the same PCIe-level compatibility issues as
yours (verify that ACS isn't available to the device and so on) then I'd
call that a step in the right direction.

Another thing - Do "lspci -vt" - can you put the card in a slot where it doesn't share a bridge with any other PCIe devices?

I don't think so.  You should see the built-in bridge... it's implied slightly up the hierarchy from the two side-by-side 6990 devices, which itself attaches to the root port at the top: http://pastebin.com/raw.php?i=4dGmneYi

 


                          What about with PCIe devices behind NF200
                bridges? I know the
                       NF200s
                       don't support PCI ACS, but that is a security
                feature (which I have
                       disabled enforcement of to get this far), and
                AFAIK shouldn't
                       actually
                       affect the basic PCI passthrough capability.

                         Question: how'd you disable ACS?  I think it
                may be causing me
                       some
                       issues.

                         Put:

                         (pci-passthrough-strict-check no)
                         (pci-dev-assign-strict-check no)

                         in /etc/xen/xend-config.sxp

                         If it was causing you issues, however, I'd
                expect you to find
                       errors
                       in logs pointing at it.

                       As I understand the xend-config.sxp [1] is for
                the xm toolstack and
                       deprecated Xend service.


                   xm toolstack and xend are what I am using. I have
                read reports of issues
                   with VGA passthrough using the xl stack so I didn't
                even attempt to
                   use it.


                The xm toolstack was deprecated in version 4.1.  I read
                that it had not
                been updated in months due to a lack of maintainers.


            I heard that xl is still feature-incomplete and
            experimental, and problematic with VGA passthrough.

                I did try xm back
                when I started, the passthrough worked but had the same
                problems I had
                when I began testing xl.  I have been using xl since
                then.  My logic was
                simply "why become dependent on a tool that is no-longer
                maintained and
                may be removed from the next release?"


            I'm not wedded to any particular tool stack, I'm happy to
            use whatever
            works. But since libvirt and virt-manager are still using
            xm, and since
            I have seen recent reports of xl being problematic for VGA
            passthrough
            as well as there being no apparent way to disable ACS
            requirements with
            the xl stack, that rules it out for me completely at the moment.


        The xm stack was rather trying for me.  It's like it only wanted to
        throw errors at me when I did PCI stuff.  Whereas xl has seemingly
        been more than happy to do whatever I tell it.  Though I admit
        chances
        are pretty good I was just running around, haphazardly using the
        wrong
        version of python or something.  Given our nearly identical results
        thus far, I'd wager that the toolstack itself isn't really the
        source
        of our problems.  If that's true, though, the easy solution is
        likely
        out the window :(


    What distro do you use?

    <snip>


Currently running Debian Squeeze 6.0.7 x86_64, with Linux kernel 3.4.44.

OK, that's a useful reference point. I'm on EL6 using 3.8.10 (will be upgrading to 3.8.12 tonight).


Gordan


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users

Wish me luck!

-Andrew
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.