[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC v2] Add SUPPORT.md



On 23/10/17 17:22, George Dunlap wrote:
> On 09/11/2017 06:53 PM, Andrew Cooper wrote:
>> On 11/09/17 18:01, George Dunlap wrote:
>>> +### x86/RAM
>>> +
>>> +    Limit, x86: 16TiB
>>> +    Limit, ARM32: 16GiB
>>> +    Limit, ARM64: 5TiB
>>> +
>>> +[XXX: Andy to suggest what this should say for x86]
>> The limit for x86 is either 16TiB or 123TiB, depending on
>> CONFIG_BIGMEM.  CONFIG_BIGMEM is exposed via menuconfig without
>> XEN_CONFIG_EXPERT, so falls into at least some kind of support statement.
>>
>> As for practical limits, I don't think its reasonable to claim anything
>> which we can't test.  What are the specs in the MA colo?
> At the moment the "Limit" tag specifically says that it's theoretical
> and may not work.
>
> We could add another tag, "Limit-tested", or something like that.
>
> Or, we could simply have the Limit-security be equal to the highest
> amount which has been tested (either by osstest or downstreams).
>
> For simplicity's sake I'd go with the second one.

It think it would be very helpful to distinguish the upper limits from
the supported limits.  There will be a large difference between the two.

Limit-Theoretical and Limit-Supported ?

In all cases, we should identify why the limit is where it is, even if
that is only "maximum people have tested to".  Other

>
> Shall I write an e-mail with a more direct query for the maximum amounts
> of various numbers tested by the XenProject (via osstest), Citrix, SuSE,
> and Oracle?

For XenServer,
http://docs.citrix.com/content/dam/docs/en-us/xenserver/current-release/downloads/xenserver-config-limits.pdf

>> [root@fusebot ~]# python
>> Python 2.7.5 (default, Nov 20 2015, 02:00:19)
>> [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>>>>> from xen.lowlevel.xc import xc as XC
>>>>> xc = XC()
>>>>> xc.domain_create()
>> 1
>>>>> xc.domain_max_vcpus(1, 8192)
>> 0
>>>>> xc.domain_create()
>> 2
>>>>> xc.domain_max_vcpus(2, 8193)
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> xen.lowlevel.xc.Error: (22, 'Invalid argument')
>>
>> Trying to shut such a domain down however does tickle a host watchdog
>> timeout as the for_each_vcpu() loops in domain_kill() are very long.
> For now I'll set 'Limit' to 8192, and 'Limit-security' to 512.
> Depending on what I get for the "test limit" survey I may adjust it
> afterwards.

The largest production x86 server I am aware of is a Skylake-S system
with 496 threads.  512 is not a plausibly-tested number.

>
>>> +    Limit, x86 HVM: 128
>>> +    Limit, ARM32: 8
>>> +    Limit, ARM64: 128
>>> +
>>> +[XXX Andrew Cooper: Do want to add "Limit-Security" here for some of 
>>> these?]
>> 32 for each.  64 vcpu HVM guests can excerpt enough p2m lock pressure to
>> trigger a 5 second host watchdog timeout.
> Is that "32 for x86 PV and x86 HVM", or "32 for x86 HVM and ARM64"?  Or
> something else?

The former.  I'm not qualified to comment on any of the ARM limits.

There are several non-trivial for_each_vcpu() loops in the domain_kill
path which aren't handled by continuations.  ISTR 128 vcpus is enough to
trip a watchdog timeout when freeing pagetables.

>
>>> +### Virtual RAM
>>> +
>>> +    Limit, x86 PV: >1TB
>>> +    Limit, x86 HVM: 1TB
>>> +    Limit, ARM32: 16GiB
>>> +    Limit, ARM64: 1TB
>> There is no specific upper bound on the size of PV or HVM guests that I
>> am aware of.  1.5TB HVM domains definitely work, because that's what we
>> test and support in XenServer.
> Are there limits for 32-bit guests?  There's some complicated limit
> having to do with the m2p, right?

32bit PV guests need to live in MFNs under the 128G boundary, despite
the fact their p2m handling supports 4TB of RAM.

The PVinPVH plan will lift this limitation, at which point it will be
possible to have many 128G 32bit PV(inPVH) VMs on a large system. 
(OTOH, I'm not aware of any 32bit PV guest which itself supports more
than 64G of RAM, other than perhaps SLES 11.)

>
>>> +
>>> +### x86 PV/Event Channels
>>> +
>>> +    Limit: 131072
>> Why do we call out event channel limits but not grant table limits? 
>> Also, why is this x86?  The 2l and fifo ABIs are arch agnostic, as far
>> as I am aware.
> Sure, but I'm pretty sure that ARM guests don't (perhaps cannot?) use PV
> event channels.

This is mixing the hypervisor API/ABI capabilities with the actual
abilities of guests (which is also different to what Linux would use in
the guests).

ARM guests, as well as x86 HVM with APICV (configured properly) will
actively want to avoid the guest event channel interface, because its
slower.

This solitary evtchn limit serves no useful purpose IMO.

>
>>> +## High Availability and Fault Tolerance
>>> +
>>> +### Live Migration, Save & Restore
>>> +
>>> +    Status, x86: Supported
>> With caveats.  From docs/features/migration.pandoc
> This would extend the meaning of "caveats" from "when it's not security
> supported" to "when it doesn't work"; which is probably the best thing
> at the moment.

I wasn't specifically taking your meaning of caveats.

>
>> * x86 HVM with nested-virt (no relevant information included in the stream)
> [snip]
>> Also, features such as vNUMA and nested virt (which are two I know for
>> certain) have all state discarded on the source side, because they were
>> never suitably plumbed in.
> OK, I'll list these, as well as PCI pass-through.
>
> (Actually, vNUMA doesn't seem to be on the list!)
>
> And we should probably add a safety-catch to prevent a VM started with
> any of these from being live-migrated.
>
> In fact, if possible, that should be a whitelist: Any configuration that
> isn't specifically known to work with migration should cause a migration
> command to be refused.

Absolutely everything should be in whitelist form, but Xen has 14 years
of history to clean up after.

> What about the following features?

What do you mean "what about"?  Do you mean "are they migrate safe?"?

Assuming that that is what you mean,

>  * Guest serial console

Which consoles?  A qemu emulated-serial will be qemus problem to deal
with.  Anything xenconsoled based will be the guests problem to deal
with, so pass.

>  * Crash kernels

These are internal to the guest until the point of crash, at which point
you may need SHUTDOWN_soft_reset support to crash successfully.  I don't
think there is any migration interaction.

>  * Transcendent Memory

Excluded from security support by XSA-17.

Legacy migration claimed to have TMEM migration support, but the code
was sufficiently broken that I persuaded Konrad to not block Migration
v2 on getting TMEM working again.  Its current state is "will be lost on
migrate if you try to use it", because it also turns out it is
nontrivial to work out if there are TMEM pages needing moving.

>  * Alternative p2m

Lost on migrate.

>  * vMCE

There appears to be code to move state in the migrate stream.  Whether
it works or not is an entirely different matter.

>  * vPMU

Lost on migrate.  Furthermore, levelling vPMU is far harder than
levelling CPUID.  Anything using vPMU and migrated to non-identical
hardware likely to blow up at the destination when a previously
established PMU setting now takes a #GP fault.

>  * Intel Platform QoS

Not exposed to guests at all, so it has no migration interaction atm.

>  * Remus
>  * COLO

These are both migration protocols themselves, so don't really fit into
this category.  Anything with works in normal migration should work when
using these.

>  * PV protocols: Keyboard, PVUSB, PVSCSI, PVTPM, 9pfs, pvcalls?

Pass.  These will be far more to do with what is arranged in the
receiving dom0 by the toolstack.

PVTPM is the only one I'm aware of with state held outside of the rings,
and I'm not aware of any support for moving that state.

>  * FlASK?

I don't know what you mean by this.  Flask is a setting in the
hypervisor, and isn't exposed to the guest.

>  * CPU / memory hotplug?

We don't have memory hotplug, and CPU hotplug is complicated.  PV guests
don't have hotplug (they have "give the guest $MAX and ask it politely
to give some back"), while for HVM guests it is currently performed by
Qemu.  PVH is going to complicate things further with various bits being
performed by Xen.

>
>> * x86 HVM guest physmap operations (not reflected in logdirty bitmap)
>> * x86 PV P2M structure changes (not noticed, stale mappings used) for
>>   guests not using the linear p2m layout
> I'm afraid this isn't really appropriate for a user-facing document.
> Users don't directly do physmap operations, nor p2m structure changes.
> We need to tell them specifically which features they can or cannot use.

I didn't intend this to be a straight copy/paste into the user facing
document, but rather to highlight the already-known issues.

In practice, this means "no ballooning", except you've got no way of
stopping the guest using add_to/remove_from physmap on itself, so there
is nothing the toolstack can do to prevent a guest from accidentally
falling into these traps.

>
>> * x86 HVM with PoD pages (attempts to map cause PoD allocations)
> This shouldn't be any more dangerous than a guest-side sweep, should it?

Except that for XSA-150, the sweep isn't guest wide.  It is only of the
last 32 allocated frames.

>  You may waste a lot of time reclaiming zero pages, but it seems like it
> should only be a relatively minor performance issue, not a correctness
> issue.

The overwhelmingly common case is that when the migration stream tries
to map a gfn, the demand population causes a crash on the source side,
because xenforeignmemory_map() does a P2M_ALLOC lookup and can't find a
frame.

>
> The main "problem" (in terms of "surprising behavior") would be that on
> the remote side any PoD pages will actually be allocated zero pages.  So
> if your guest was booted with memmax=4096 and memory=2048, but your
> balloon driver had only ballooned down to 3000 for some reason (and then
> stopped), the remote side would want 3000 MiB (not 2048, as one might
> expect).

If there are too many frames in the migration stream, the destination
side will fail because of going over allocation.

>
>> * x86 PV ballooning (P2M marked dirty, target frame not marked)
> Er, this should probably be fixed.  What exactly is the problem here?

P2M structure changes don't cause all frames under the change to be
resent.  This is mainly a problem when ballooning out a frame (which has
already been sent in the stream), at which point we get too much memory
on the destination side, and go over allocation.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.