[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Bug: Limitation of <=2GB RAM in domU persists with 4.3.0



On Fri, 2013-07-26 at 10:23 +0100, Gordan Bobic wrote:
>  On Fri, 26 Jul 2013 01:21:24 +0100, Ian Campbell 
>  <ian.campbell@xxxxxxxxxx> wrote:
> > On Thu, 2013-07-25 at 23:23 +0100, Gordan Bobic wrote:
> >> Now, if I am understanding the basic nature of the problem 
> >> correctly,
> >> this _could_ be worked around by ensuring that vBAR = pBAR since in 
> >> that
> >> case there is no room for the mis-mapped memory overwrites to occur. 
> >> Is
> >> that correct?
> >
> > AIUI (which is not very well...) it's not so much vBAR=pBAR but 
> > making
> > the guest e820 (memory map) have the same MMIO holes as the host so 
> > that
> > there can't be any clash between v- or p-BAR and RAM in the guest.
> 
>  Sure, I understand that - but unless I am overlooking something,
>  vBAR=pBAR implicitly ensures that.

Not quite because you need to ensure that guest RAM and guest MMIO space
do not overlap. So setting vBAR=pBAR is not sufficient, you also need to
ensure that there is no RAM at those addresses.

Depending on your PCI bus topology/hardware functionality it may be
sufficient to only ensure the memory map is the same as the host, so
long as the vBAR all fall within the MMIO regions. On other systems you
may require vBAR=pBAR in addition to that. Obviously doing both is most
likely to work.

>  The question, then, is what happens in the null translation instance.
>  Specifically, if the PCIe bridge/router is broken (and NF200 is, it
>  seems), it would imply that when the driver talks to the device, the
>  operation will get sent to the vBAR (=pBAR, i.e. straight to the
>  hardware). This then gets translated to the pBAR. But - with a
>  broken bridge, and vBAR=pBAR, the MMIO request hits the pBAR
>  directly from the guest. Does it then still get intercepted by
>  the hypervisor, translated (null operation), and re-transmitted?
>  If so, this would lead to the card receiving everything twice,
>  resulting either in things outright breaking or going half as
>  fast at best.

AIUI the issue is not so much with a device seeing an IO accesses twice
but with two device seeing the same IO access (one sees translated, the
other untranslated) and thinking it is for them and who "wins" when such
shadowing occurs, which will differ depending on which device (or the
host CPU) is doing the IO.

It is not the hypervisor which is intercepting and translating, but the
hardware. A single bit of hardware should never see things twice.

Perhaps a diagram (intended to be more illustrative than "real"):
           CPU
            |
        MMU & IOMMU
            |                   | RAM
BUS 1:      `---+---------------'
                |
              BRIDGE 
                |
BUS 2:          `--- BUS 2  -------------
                             |           |
                          DEVICE A    DEVICE B

vBAR->pBAR translation happens at the IOMMU.

So if the CPU accesses a RAM address it will be translated by the MMU
and go to the correct address in RAM.

Lets assume that the bridge knows that accesses it forwards on need to
be translated. So if DEVICE A tries to access RAM then it the BRIDGE
will translate things (by talking to the IOMMU) and the access will
again go to the right place.

Likewise if the CPU tries to talk to DEVICE A then the MMIO accesses
will be translated and go to the right place.

However lets imagine DEVICE B happens to have a pBAR which is the same
as the memory which DEVICE A is trying to access. Lets also assume that
the BRIDGE has a bug which would allow DEVICE B to see DEVICE A's
accesses directly instead of laundering them via the IOMMU (perhaps it
is really a shared bus like I've drawn it rather than a PCI-e thing with
lanes etc).

So now DEVICE A's memory access could be seen and acted on by both the
RAM (translated, probably) and DEVICE B. Weirdness will ensue, perhaps
the DMA read done via device A gets serviced by DEVICE B and not RAM, or
maybe the DMA write causes a side effect in DEVICE B. Furthermore the
"winner" might even be different for an access from DEVICE A vs an
access from the CPU etc.

This is something vaguely like the real bug, but only vaguely, because
my understanding of the real bug is a bit vague. I hope it is
illustrative of the sort of issue we are talking about.

> 
>  Now, all this could be a good thing or a bad thing, depending on
>  how exactly you spin it. If the bridge is broken and doesn't
>  route all the way back to the root bridge, this could actually be
>  a performance optimizing feature. If we set vBAR=pBAR and disable
>  any translation thereafter, this avoids the overhead of passing
>  everything to/from the root PCIe bridge, and we can just directly
>  DMA everything.

I'm not sure how much perf overhead there is in practice since ISTR that
the translations can be cached in the bridge and need explicit flushing
etc when they are modified. Obviously there will be some overhead but I
don't think it will be anything like doubling the traffic.

>  I'm sure there are security implications here, but since NF200
>  doesn't do PCIe ACS either, any concept of security goes out
>  the window pre-emptively.
> 
>  So, my question is:
>  1) If vBAR = pBAR, does the hypervisor still do any translation?

I would assume so.

>  I presume it does because it expects the traffic to pass up
>  from the root bridge, to the hypervisor and then back, to
>  ensure security.

NB: Not to the hypervisor (software) but to some bit of hardware which
interprets a table provided by the hypervisor.

>  If indeed it does do this, where could I
>  optionally disable it, and is there an easy to follow bit of
>  example code for how to plumb in a boot parameter option for
>  this?

I'm afraid I've no clue...

Perhaps if you started from the hypercall which the toolstacks use to
plumb stuff through you would be able to trace it down?
XEN_DOMCTL_memory_mapping perhaps? (I'm wary of saying too much because
there is every chance I am sending you on some wild goose chase)

>  2) Further, I'm finding myself motivated to write that
>  auto-set (as opposed to hard coded) vBAR=pBAR patch discussed
>  briefly a week or so ago (have an init script read the BAR
>  info from dom0 and put it in xenstore, plus a patch to
>  make pBAR=vBAR reservations built dynamically rather than
>  statically, based on this data. Now, I'm quite fluent in C,
>  but my familiarity with Xen soruce code is nearly non-existant
>  (limited to studying an old unsupported patch every now and then
>  in order to make it apply to a more recent code release).
>  Can anyone help me out with a high level view WRT where
>  this would be best plumbed in (which files and the flow of
>  control between the affected files)?

I'm not sure but the places I would start are the bits of libxc which
call things like XEN_DOMCTL_memory_mapping and the bits of libxl which
call into them. It would also be worth looking at the PCI setup code in
hvmloader (tools/firmware/hvmloader/) I have a feeling that is where the
code responsible for PCI bar allocation/layout within the guest's memory
map lives.

Perhaps you might want perhaps to implement a mode where libxl/libxc end
up writing the desired vBAR(==pBAR, in your case) values into xenstore
for hvmloader to pickup and implement. Not being a maintainer for that
area I'm not sure if that would acceptable or not.

> 
>  The added bonus of this (if it can be made to work) is that
>  it might just make unmodified GeForce cards work, too,
>  which probably makes it worthwhile on it's own.
> 
> >> I guess I could test this easily enough by applying the vBAR = pBAR 
> >> hack.
> >
> > Does the e820_host=1 option help? That might be PV only though, I 
> > can't
> > remember...
> 
>  Thanks for pointing this one out, I just found this post in the 
>  archives:
>  http://lists.xen.org/archives/html/xen-users/2012-08/msg00150.html
> 
>  With a broken PCIe router, would I also need iommu=soft?

I'm not sure that isn't also a PV only thing. Sorry :-/

> 
>  Gordan



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.