[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL
On 06/06/18 11:40, Juergen Gross wrote: > On 06/06/18 11:35, Jan Beulich wrote: >>>>> On 05.06.18 at 18:19, <ian.jackson@xxxxxxxxxx> wrote: >>>>> test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 14 guest-saverestore.2 >>> >>> I thought I would reply again with the key point from my earlier mail >>> highlighted, and go a bit further. The first thing to go wrong in >>> this was: >>> >>> 2018-05-30 22:12:49.320+0000: xc: Failed to get types for pfn batch (14 = >>> Bad address): Internal error >>> 2018-05-30 22:12:49.483+0000: xc: Save failed (14 = Bad address): Internal >>> error >>> 2018-05-30 22:12:49.648+0000: libxl-save-helper: complete r=-1: Bad address >>> >>> You can see similar messages in the other logfile: >>> >>> 2018-05-30 22:12:49.650+0000: libxl: >>> libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 3:saving >>> domain: domain responded to suspend request: Bad address >>> >>> All of these are reports of the same thing: xc_get_pfn_type_batch at >>> xc_sr_save.c:133 failed with EFAULT. I'm afraid I don't know why. >>> >>> There is no corresponding message in the host's serial log nor the >>> dom0 kernel log. >> >> I vaguely recall from the time when I had looked at the similar Windows >> migration issues that the guest is already in the process of being cleaned >> up when these occur. Commit 2dbe9c3cd2 ("x86/mm: silence a pointless >> warning") intentionally suppressed a log message here, and the >> immediately following debugging code (933f966bcd x86/mm: add >> temporary debugging code to get_page_from_gfn_p2m()) was reverted >> a little over a month later. This wasn't as a follow-up to another patch >> (fix), but following the discussion rooted at >> https://lists.xenproject.org/archives/html/xen-devel/2017-06/msg00324.html > > That was -ESRCH, not -EFAULT. I've looked a little bit more into this. As we are seeing EFAULT being returned by the hypervisor this either means the tools are specifying an invalid address (quite unlikely) or the buffers are not as MAP_LOCKED as we wish them to be. Is there a way to see whether the host was experiencing some memory shortage, so the buffers might have been swapped out? man mmap tells me: "This implementation will try to populate (prefault) the whole range but the mmap call doesn't fail with ENOMEM if this fails. Therefore major faults might happen later on." And: "One should use mmap(2) plus mlock(2) when major faults are not acceptable after the initialization of the mapping." With osdep_alloc_pages() in tools/libs/call/linux.c touching all the hypercall buffer pages before doing the hypercall I'm not sure this could be an issue. Any thoughts on that? Juergen _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |