[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Linux 4.1 reports wrong number of pages to toolstack
On 04/09/15 12:35, Wei Liu wrote: On Fri, Sep 04, 2015 at 10:35:52AM +0100, Andrew Cooper wrote:On 04/09/15 09:28, Jan Beulich wrote:On 04.09.15 at 05:38, <JGross@xxxxxxxx> wrote:On 09/04/2015 02:40 AM, Wei Liu wrote:This issue is exposed by the introduction of migration v2. The symptom is that a guest with 32 bit 4.1 kernel can't be restored because it's asking for too many pages. Note that all guests have 512MB memory, which means they have 131072 pages. Both 3.14 tests [2] [3] get the correct number of pages. Like: xc: detail: max_pfn 0x1ffff, p2m_frames 256 ... xc: detail: Memory: 2048/131072 1% ... However in both 4.1 [0] [1] the number of pages are quite wrong. 4.1 32 bit: xc: detail: max_pfn 0xfffff, p2m_frames 1024 ... xc: detail: Memory: 11264/1048576 1% ... It thinks it has 4096MB memory. 4.1 64 bit: xc: detail: max_pfn 0x3ffff, p2m_frames 512 ... xc: detail: Memory: 3072/262144 1% ... It thinks it has 1024MB memory. The total number of pages is determined in libxc by calling xc_domain_nr_gpfns, which yanks shared_info->arch.max_pfn from hypervisor. And that value is clearly touched by Linux in some way.Sure. shared_info->arch.max_pfn holds the number of pfns the p2m list can handle. This is not the memory size of the domain.I now think this is a bug in Linux kernel. The biggest suspect is the introduction of linear P2M. If you think this is a bug in toolstack, please let me know.I absolutely think it is a toolstack bug. Even without the linear p2m things would go wrong in case a ballooned down guest would be migrated, as shared_info->arch.max_pfn would hold the upper limit of the guest in this case and not the current size.I don't think this necessarily is a tool stack bug, at least not in the sense implied above - since (afaik) migrating ballooned guests (at least PV ones) has been working before, there ought to be logic to skip ballooned pages (and I certainly recall having seen migration slowly move up to e.g. 50% and the skip the other half due to being ballooned, albeit that recollection certainly is>from before v2). And pages above the highest populated oneought to be considered ballooned just as much. With the information provided by Wei I don't think we can judge about this, since it only shows the values the migration process starts from, not when, why, or how it fails.Max pfn reported by migration v2 is max pfn, not the number of pages of RAM in the guest.I understand that by looking at the code. Just the log itself is very confusing. I propose we rename the log a bit. Maybe change "Memory" to "P2M" or something else? P2M would be wrong for HVM guests. Memory was the same term used by the legacy code iirc. "Frames" is probably the best term. It is used for the size of the bitmaps used by migration v2, including the logdirty op calls. All frames between 0 and max pfn will have their type queried, and acted upon appropriately, including doing nothing if the frame was ballooned out.In short, do you think this is a bug in migration v2? There is insufficient information in this thread to say either way. Maybe. Maybe a Linux kernel bug. When I looked at write_batch() I found some snippets that I thought to be wrong. But I didn't what to make the judgement when I didn't have a clear head. write_batch() is a complicated function but it can't usefully be split any further. I would be happy to explain bits or expand the existing comments, but it is also possible that it is buggy. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |