[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [xen-unstable test] 164996: regressions - FAIL


  • To: Ian Jackson <iwj@xxxxxxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Wed, 22 Sep 2021 14:24:46 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=bCMh9bDphYY+wjeyyZf9GP7gqy+UqEdPNOJx6gA/LDc=; b=Pcrrynkj+l8bm/zulE7UWeSVTyaSwFaEMeYThAdD96IXGQe/oo5m/RZDsFbtTc6+lTekygt3N8dV3ASF1DkpgQcxLk3SC3aUqIg9u8sL+nkEdo+LNJ+zuC1CECkoesi8llmhhfbTNtKjbPZZ5sMcDuQlBcOXnv6Nsqb3kp27nk806kFN4oUqFMMC/3SXuw8vlDVUUchDQ1OB94yghT/mnwA1k4xLxYhSTgAR2ZHPSNVahCPR+bAfqsvd8vhYFAsMuhVmVLBXPtZFjA5b372n92UjQuifiL9NKoWy0QLw17xGeOdB86zTtjmgDpPJfuDwhZOf7UW31qkhd2WjnGW7mg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Kt0AHIyFJea53EXhlvzsU4NobEdyVidjcGMFcBg+7xKjDJeAlyrNbJs8sBWv0Cn4IfZOG9a9YvD1iZRWACq6xs+KoXdCpJOibfDNCfDeg+nwn+V7zW4MIJH6PcM5i/et3pM9HKuSY9/5QrU1PYygG4dVbG0411rSE3hBJnKWMeU4MG1DBRaK79R96ZaqTGe8y+dlalqHPGOTVd4Jvp3iW9AKl1J1nyR74G+a3ZcbcPumQhneaPUEUxDt8Lw0otOB38nRI8Y0SLENKzeAy4VAufXvgEBRbY9kvxPyM6nGRHa2Uo5hyyVn3PtpK1QZodPyos0mR2cD0YNdlgoWiHhnRg==
  • Authentication-results: apertussolutions.com; dkim=none (message not signed) header.d=none;apertussolutions.com; dmarc=none action=none header.from=suse.com;
  • Cc: Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx, dpsmith@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Wed, 22 Sep 2021 12:24:55 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 22.09.2021 13:20, Ian Jackson wrote:
> Jan Beulich writes ("Re: [xen-unstable test] 164996: regressions - FAIL"):
>> On 22.09.2021 01:38, Stefano Stabellini wrote:
>>> On Mon, 20 Sep 2021, Ian Jackson wrote:
>>>>> Sep 15 14:44:55.598538 [ 1613.419061] DMA32: 2788*4kB (UMEC) 890*8kB 
>>>>> (UMEC) 497*16kB (UMEC) 36*32kB (UMC) 1*64kB (C) 1*128kB (C) 9*256kB (C) 
>>>>> 7*512kB (C) 0*1024kB 0*2048kB 0*4096kB = 33456kB
>>>>>
>>>>> there even look to be a number of higher order pages available (albeit
>>>>> without digging I can't tell what "(C)" means). Nevertheless order-4
>>>>> allocations aren't really nice.
>>>>
>>>> The host history suggests this may possibly be related to a qemu update.
>>>>
>>>> http://logs.test-lab.xenproject.org/osstest/results/host/rochester0.html
>>
>> Stefano - as per some of your investigation detailed further down I
>> wonder whether you had seen this part of Ian's reply. (Question of
>> course then is how that qemu update had managed to get pushed.)
> 
> I looked for bisection results for this failure and
> 
>   
> http://logs.test-lab.xenproject.org/osstest/results/bisect/xen-unstable/test-arm64-arm64-libvirt-xsm.guest-start--debian.repeat.html
> 
> it's a heisenbug.  Also, the tests got reorganised slightly as a
> side-effect of dropping some i386 tests, so some of these tests are
> "new" from osstest's pov, although their content isn't really new.
> 
> Unfortunately, with it being a heisenbug, we won't get any useful
> bisection results, which would otherwise conclusively tell us which
> tree the problem was in.

Quite unfortunate.

>>>> The grub cfg has this:
>>>>
>>>>  multiboot /xen placeholder conswitch=x watchdog noreboot async-show-all 
>>>> console=dtuart dom0_mem=512M,max:512M ucode=scan  ${xen_rm_opts}
>>>>
>>>> It's not clear to me whether xen_rm_opts is "" or "no-real-mode edd=off".
>>>
>>> I definitely recommend to increase dom0 memory, especially as I guess
>>> the box is going to have a significant amount, far more than 4GB. I
>>> would set it to 2GB. Also the syntax on ARM is simpler, so it should be
>>> just: dom0_mem=2G
>>
>> Ian - I guess that's an adjustment relatively easy to make? I wonder
>> though whether we wouldn't want to address the underlying issue first.
>> Presumably not, because the fix would likely take quite some time to
>> propagate suitably. Yet if not, we will want to have some way of
>> verifying that an eventual fix there would have helped here.
> 
> It could propagate fairly quickly.

Is the Dom0 kernel used here a distro one or our own build of one of
the upstream trees? In the latter case I'd expect propagation to be
quite a bit faster than in the former case.

>  But I'm loathe to make this change
> because it seems to me that it would be simply masking the bug.
> 
> Notably, when this goes wrong, it seems to happen after the guest has
> been started once successfully already.  So there *is* enough
> memory...

Well, there is enough memory, sure, but (transiently as it seems) not
enough contiguous chunks. The likelihood of higher order allocations
failing increases with smaller overall memory amounts (in Dom0 in this
case), afaict, unless there's (aggressive) de-fragmentation.

Jan




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.