[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [xen-unstable test] 123379: regressions - FAIL
On 07/06/18 13:30, Juergen Gross wrote: > On 06/06/18 11:40, Juergen Gross wrote: >> On 06/06/18 11:35, Jan Beulich wrote: >>>>>> On 05.06.18 at 18:19, <ian.jackson@xxxxxxxxxx> wrote: >>>>>> test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 14 >>>>>> guest-saverestore.2 >>>> >>>> I thought I would reply again with the key point from my earlier mail >>>> highlighted, and go a bit further. The first thing to go wrong in >>>> this was: >>>> >>>> 2018-05-30 22:12:49.320+0000: xc: Failed to get types for pfn batch (14 = >>>> Bad address): Internal error >>>> 2018-05-30 22:12:49.483+0000: xc: Save failed (14 = Bad address): Internal >>>> error >>>> 2018-05-30 22:12:49.648+0000: libxl-save-helper: complete r=-1: Bad address >>>> >>>> You can see similar messages in the other logfile: >>>> >>>> 2018-05-30 22:12:49.650+0000: libxl: >>>> libxl_stream_write.c:350:libxl__xc_domain_save_done: Domain 3:saving >>>> domain: domain responded to suspend request: Bad address >>>> >>>> All of these are reports of the same thing: xc_get_pfn_type_batch at >>>> xc_sr_save.c:133 failed with EFAULT. I'm afraid I don't know why. >>>> >>>> There is no corresponding message in the host's serial log nor the >>>> dom0 kernel log. >>> >>> I vaguely recall from the time when I had looked at the similar Windows >>> migration issues that the guest is already in the process of being cleaned >>> up when these occur. Commit 2dbe9c3cd2 ("x86/mm: silence a pointless >>> warning") intentionally suppressed a log message here, and the >>> immediately following debugging code (933f966bcd x86/mm: add >>> temporary debugging code to get_page_from_gfn_p2m()) was reverted >>> a little over a month later. This wasn't as a follow-up to another patch >>> (fix), but following the discussion rooted at >>> https://lists.xenproject.org/archives/html/xen-devel/2017-06/msg00324.html >> >> That was -ESRCH, not -EFAULT. > > I've looked a little bit more into this. > > As we are seeing EFAULT being returned by the hypervisor this either > means the tools are specifying an invalid address (quite unlikely) > or the buffers are not as MAP_LOCKED as we wish them to be. > > Is there a way to see whether the host was experiencing some memory > shortage, so the buffers might have been swapped out? > > man mmap tells me: "This implementation will try to populate (prefault) > the whole range but the mmap call doesn't fail with ENOMEM if this > fails. Therefore major faults might happen later on." > > And: "One should use mmap(2) plus mlock(2) when major faults are not > acceptable after the initialization of the mapping." > > With osdep_alloc_pages() in tools/libs/call/linux.c touching all the > hypercall buffer pages before doing the hypercall I'm not sure this > could be an issue. > > Any thoughts on that? Ian, is there a chance to dedicate a machine to a specific test trying to reproduce the problem? In case we manage to get this failure in a reasonable time frame I guess the most promising approach would be to use a test hypervisor producing more debug data. If you think this is worth doing I can write a patch. Juergen _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |