[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] HVM support for e820_host (Was: Bug: Limitation of <=2GB RAM in domU persists with 4.3.0)



On Fri, 6 Sep 2013 09:20:50 -0400, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> wrote:
On Fri, Sep 06, 2013 at 01:23:19PM +0100, Gordan Bobic wrote:
On Thu, 05 Sep 2013 19:01:03 -0400, Konrad Rzeszutek Wilk
<konrad.wilk@xxxxxxxxxx> wrote:
>Gordan Bobic <gordan@xxxxxxxxxx> wrote:
>>On 09/05/2013 11:23 PM, Konrad Rzeszutek Wilk wrote:
>>>Gordan Bobic <gordan@xxxxxxxxxx> wrote:
>>>>Right, finally got around to trying this with the latest patch.
>>>>
>>>>With e820_host=0 things work as before:
>>>>
>>>>(XEN) HVM3: BIOS map:
>>>>(XEN) HVM3:  f0000-fffff: Main BIOS
>>>>(XEN) HVM3: E820 table:
>>>>(XEN) HVM3:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>>>>(XEN) HVM3: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
>>>>(XEN) HVM3:  HOLE: 00000000:000a0000 - 00000000:000e0000
>>>>(XEN) HVM3: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
>>>>(XEN) HVM3:  [03]: 00000000:00100000 - 00000000:e0000000: RAM
>>>>(XEN) HVM3:  HOLE: 00000000:e0000000 - 00000000:fc000000
>>>>(XEN) HVM3: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
>>>>(XEN) HVM3:  [05]: 00000001:00000000 - 00000002:1f800000: RAM
>>>>
>>>>
>>>>I seem to be getting two different E820 table dumps with
>>e820_host=1:
>>>>
>>>>(XEN) HVM1: BIOS map:
>>>>(XEN) HVM1:  f0000-fffff: Main BIOS
>>>>(XEN) HVM1: build_e820_table:91 got 8 op.nr_entries
>>>>(XEN) HVM1: E820 table:
>>>>(XEN) HVM1:  [00]: 00000000:00000000 - 00000000:3f790000: RAM
>>>>(XEN) HVM1:  [01]: 00000000:3f790000 - 00000000:3f79e000: ACPI
>>>>(XEN) HVM1:  [02]: 00000000:3f79e000 - 00000000:3f7d0000: NVS
>>>>(XEN) HVM1: [03]: 00000000:3f7d0000 - 00000000:3f7e0000: RESERVED
>>>>(XEN) HVM1:  HOLE: 00000000:3f7e0000 - 00000000:3f7e7000
>>>>(XEN) HVM1: [04]: 00000000:3f7e7000 - 00000000:40000000: RESERVED
>>>>(XEN) HVM1:  HOLE: 00000000:40000000 - 00000000:fee00000
>>>>(XEN) HVM1: [05]: 00000000:fee00000 - 00000000:fee01000: RESERVED
>>>>(XEN) HVM1:  HOLE: 00000000:fee01000 - 00000000:ffc00000
>>>>(XEN) HVM1: [06]: 00000000:ffc00000 - 00000001:00000000: RESERVED
>>>>(XEN) HVM1:  [07]: 00000001:00000000 - 00000001:68870000: RAM
>>>>(XEN) HVM1: E820 table:
>>>>(XEN) HVM1:  [00]: 00000000:00000000 - 00000000:0009e000: RAM
>>>>(XEN) HVM1: [01]: 00000000:0009e000 - 00000000:000a0000: RESERVED
>>>>(XEN) HVM1:  HOLE: 00000000:000a0000 - 00000000:000e0000
>>>>(XEN) HVM1: [02]: 00000000:000e0000 - 00000000:00100000: RESERVED
>>>>(XEN) HVM1:  [03]: 00000000:00100000 - 00000000:a7800000: RAM
>>>>(XEN) HVM1:  HOLE: 00000000:a7800000 - 00000000:fc000000
>>>>(XEN) HVM1: [04]: 00000000:fc000000 - 00000001:00000000: RESERVED
>>>>(XEN) HVM1: Invoking ROMBIOS ...
>>>>
>>>>I cannot quite figure out what is going on here - these
>>>>tables can't
>>>>both be true.
>>>>
>>>
>>>Right.  The code just prints the E820 that was constructed b/c
>>>of the
>>e820_host =1 parameter as the first output. Then the second one is
>>what was constructed originally.
>>>
>>>The code that would tie in the E820 from the hyper call and
>>>the alter
>>how the hvmloader sets it up is not yet done.
>>>
>>>
>>>>Looking at the IOMEM on the host, the IOMEM begins at
>>>>0xa8000000 and
>>>>goes more or less contiguously up to 0xfec8b000.
>>>>
>>>>Looking at dmesg on domU, the e820 map more or less matches the
>>second
>>>>dump above.
>>>
>>>Right. That is correct since the patch I sent just outputs stuff.
>>No real changes to the E820 yet.
>>
>>I thought this did that in hvmloader/e820c:
>>hypercall_memory_op ( XENMEM_memory_map, &op);
>>
>>Gordan
>
>No.  They just gets the E820 that is stashed in the hypervisor for
>the guest.  The PV guest would use it but hvmloader is not. This is
>what would needed to be implemented to allow hvmloader construct the
>E820 on its own.

Right. So so in hvmloader/e820.c we now have the host based map in
struct e820entry map[E820MAX];

The rest of the function then goes and constructs the standard HVM
e820 map in the passed in
struct e820entry *e820

So all that needs to happen here is if e820_host is set, fill e820[]
by copying map[] up to the hvm_info->low_mem_pgend
(or hvm_info->high_mem_pgend if it is set). I am guessing that

Right. And then the overflow would be put past 4GB. Or fill in the
E820_RAM regions with it.

SeaBIOS and other existing stuff might break if the host map is
just copied in verbatim, so presumably I need to add/dedupe the
non-RAM parts of the maps.

Probably. Or tweak SeaBIOS to use your E820.

I don't think tweaking SeaBIOS to use a different specific map
is the way forward. As I said in the other email, my motivation
is to make something that will work in the general case, not
for the memory map in my dodgy hardware (I'm sure there are
many other poorly designed bits of hardware out there this might
be useful on ;) ).

Also you need to figure out where hvmloader constructs the ACPI and
SMBIOS tables and make sure they are within the E820_RESERVED regions.

This doesn't appear to have caused any problems - the only
problematic part is trampling over the host's _mapped_ parts
of the PCI MMIO hole. Having domU RAM everywhere else doesn't
_appear_ to cause any problems, hence why I would like to
focus my effort on making sure that the holes are mapped
while breaking nothing else if at all possible.

Is that right? Nothing else needs to happen?

HA! You are going to hit some bugs probably :-)

Hey, some degree of optimisim is required for perseverence. ;)

The following questions arise:

1) What to do in case of overlaps? On my specific hardware,
the key difference in the end map will be that the hole at:
(XEN) HVM1:  HOLE: 00000000:40000000 - 00000000:fee00000
will end up being created in domU.

The hole is also known as PCI gap or MMIO region. With the
e820_host in effect you should use the host's layout and
use its hole placement. That will replicate it and make
domU's E820 hole look like the host.

Hmm... Now there's an idea. I _could_ just hard-code the
memory hole to match that just to see if it fixes the
problem. I rather expect, however, that this will just
move the problem.

Specifically, it is liable to make domU MMIO overlap
(without matching) the dom0 MMIO and crash the host
quite spectacularly. Unless domU decides to map MMIO
from the bottom up, in which case there's 1688MB of
MMIO space between 0x40000000 and 0xa8000000 where
MMIO will end up in domU, never overlapping the host's
map and everything will, by pure chance, work just
fine from there on.

2) Do only the holes need to be pulled from the host or
the entire map? Would hvmloader/seabios/whatever know
what to do if passed a map that is different from what
they might expect (i.e. different from what the current
hvmloader provides)? Or would this be likely to cause
extensive further breakages?

I think there are some assumptions made where the hole
starts. Those would have to be made more dynamic to deal
with a different E820 layout.

Assumptions made by what?

3) At the moment I am leaning toward just pulling in the
holes from the host e820, mirroring them in domU.

<nods>

3.1) Marking them as "reserved" would likely fix the
problem that was my primary motivation for doing this
in the first place. Having said that - with all of

That unfortuntaly will make them not-gaps nor MMIO regions.
Meaning the kernel will scream: "You have a BAR in E820_
reserved region! That is bad!", and won't setup the card.

What makes decision in domU where to map the PCI
devices' MMIO? SeaBIOS?

The hole needs to be replicated in the guest.
the 1GB-3GB space marked as reserved, I'm not sure where
the IOMEM would end up mapped in domU - things might just
break. If marking the dom0 hole as a hole in domU without
ensuring pBAR=vBAR, the PCI device in domU might get
mapped with where another device is in dom0, which might
cause the same problem.

Right. hvmloader could (I hadn't checked the code) scan the
E820 and determine that the PCI BARs are within the E820_RESRV
and try to move them to a hole. Since no hole would be found
below 4GB it would remap the PCI BAR above 4GB. That - depending
on the device - could be disastrous for the device. That is
if it is only capable of 32-bit DMA's it will never do anything.

Nvidia cards have a 32-bit 32MB BAR by default, and two 64-bit
BARs.

Looking at the different maps, I think I see what is actually
happening. In domU, the hole defaults to starting at e0000000,
and this is also where the BARs get mapped for the GPU in domU.

That implies that mirroring the host's hole at 1GB-4GB, would
actually likely work (by a fluke), since the BARs would
(hopefully) get mapped at bottom (plenty of hole before the
host's mapping, 1688MB to be exact), and the rest of the hole
would never get touched, stealthily (or obliviously, depending
on how you want to look at it) avoiding trampling over the
host's BARs.

OK, I'm convinced - I'll give this a try and see how I get
on. :)

At the moment, I think the expedient thing to do is make
domU map holes as per dom0 and ignore other non-RAM

<nods>
areas. This may (by luck) or may not fix my immediate problem
(RAM in domU clobbering host's mapped IOMEM), but at
least it would cover the pre-requisite hole mapping for
the next step which is vBAR=pBAR.

<nods>

I light of this, however, depending on the answer to 2)
above, it may not be practical for e820_host option do do

I think it will mean you need to look in the hvmloader directory
a bit more and find all of the assumptions it makes about memory
locations. One excellent tool is to do 'git log -p tools/hvmloader'
as it will tell you what changes have been done to address
the memory layout construction.

I'll have a dig.

what it actually means for HVMs, at least not to the same
extent as happens for PV. It would only do a part of it
(initial vHOLE=pHOLE, to later be extended to the more
specific case of vBAR=pBAR).

Does this sound reasonable?

Yes. I think the plan you outlined is sound. The difficultiy is
going to be cramming the E820 constructed by e820_host in hvmloader
and making sure that all the other parts of it (SMBIOS, ACPI, BIOS)
will be more dynamic and use dynamic locations instead of
hard-coded values.

Loads of printks can help with that :-)

This is my main concern - that other things are making assumptions
about where the holes are. At the moment it doesn't look too bad
since the only areas of conflict between (_my_) host and current
hvmloader maps is in the RAM and HOLE areas, so coming up with
a generic solution that will work for my use (and hopefully
for most other people) ought to be fairly simple. Making it
actually work in the edge cases will be harder - but then again
for those cases it doesn't work at the moment anyway so erring
on the side of pragmatism may be the correct thing to do here.

The awesome thing is that it will make hvmloader a lot more
flexible. And one can extend the e820_host to construct an
E820 that is bizzare for testing even more absurd memory
layouts (say, no RAM below 4GB).

Keep on digging! Thanks for great analysis.

Thanks, I appreciate it. :)

Gordan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.