Re: [xen-unstable test] 164996: regressions - FAIL

On Mon, 20 Sep 2021, Ian Jackson wrote:
> Jan Beulich writes ("Re: [xen-unstable test] 164996: regressions - FAIL"):
> > As per
> > 
> > Sep 15 14:44:55.502598 [ 1613.322585] Mem-Info:
> > Sep 15 14:44:55.502643 [ 1613.324918] active_anon:5639 inactive_anon:15857 
> > isolated_anon:0
> > Sep 15 14:44:55.514480 [ 1613.324918]  active_file:13286 
> > inactive_file:11182 isolated_file:0
> > Sep 15 14:44:55.514545 [ 1613.324918]  unevictable:0 dirty:30 writeback:0 
> > unstable:0
> > Sep 15 14:44:55.526477 [ 1613.324918]  slab_reclaimable:10922 
> > slab_unreclaimable:30234
> > Sep 15 14:44:55.526540 [ 1613.324918]  mapped:11277 shmem:10975 
> > pagetables:401 bounce:0
> > Sep 15 14:44:55.538474 [ 1613.324918]  free:8364 free_pcp:100 free_cma:1650
> > 
> > the system doesn't look to really be out of memory; as per
> > 
> > Sep 15 14:44:55.598538 [ 1613.419061] DMA32: 2788*4kB (UMEC) 890*8kB (UMEC) 
> > 497*16kB (UMEC) 36*32kB (UMC) 1*64kB (C) 1*128kB (C) 9*256kB (C) 7*512kB 
> > (C) 0*1024kB 0*2048kB 0*4096kB = 33456kB
> > 
> > there even look to be a number of higher order pages available (albeit
> > without digging I can't tell what "(C)" means). Nevertheless order-4
> > allocations aren't really nice.
> The host history suggests this may possibly be related to a qemu update.
> http://logs.test-lab.xenproject.org/osstest/results/host/rochester0.html
> > What I can't see is why this may have started triggering recently. Was
> > the kernel updated in osstest? Is 512Mb of memory perhaps a bit too
> > small for a Dom0 on this system (with 96 CPUs)? Going through the log
> > I haven't been able to find crucial information like how much memory
> > the host has or what the hypervisor command line was.
> Logs from last host examination, including a dmesg:
> http://logs.test-lab.xenproject.org/osstest/results/host/rochester0.examine/
> Re the command line, does Xen not print it ?
> The bootloader output seems garbled in the serial log.
> Anyway, I think Xen is being booted EFI judging by the grub cfg:
> http://logs.test-lab.xenproject.org/osstest/logs/165002/test-arm64-arm64-libvirt-raw/rochester0--grub.cfg.1
> which means that it is probaly reading this:
> http://logs.test-lab.xenproject.org/osstest/logs/165002/test-arm64-arm64-libvirt-raw/rochester0--xen.cfg
> which gives this specification of the command line:
>   options=placeholder conswitch=x watchdog noreboot async-show-all 
> console=dtuart dom0_mem=512M,max:512M ucode=scan  
> The grub cfg has this:
>  multiboot /xen placeholder conswitch=x watchdog noreboot async-show-all 
> console=dtuart dom0_mem=512M,max:512M ucode=scan  ${xen_rm_opts}
> It's not clear to me whether xen_rm_opts is "" or "no-real-mode edd=off".

I definitely recommend to increase dom0 memory, especially as I guess
the box is going to have a significant amount, far more than 4GB. I
would set it to 2GB. Also the syntax on ARM is simpler, so it should be
just: dom0_mem=2G

In addition, I also did some investigation just in case there is
actually a bug in the code and it is not a simple OOM problem.

Looking at the recent OSSTests results, the first failure is:

Indeed, the failure is the same test-arm64-arm64-libvirt-raw which is
still failing in more recent tests:

But if we look at the commit id of flight 164951, it is
6d45368a0a89e01a3a01d156af61fea565db96cc "xsm: drop dubious xsm_op_t
type" by Daniel P. Smith (CCed).

It is interesting because:
- it is *before* all the recent ARM patch series
- it is only 4 commits after master

The 4 commits are:

2021-09-10 16:12 Daniel P. Smith   o xsm: drop dubious xsm_op_t type
2021-09-10 16:12 Daniel P. Smith   o xsm: remove remnants of xsm_memtype hook
2021-09-10 16:12 Daniel P. Smith   o xsm: remove the ability to disable flask
2021-09-10 16:12 Andrew Cooper     o xen: Implement xen/alternative-call.h for 
use in common code

Looking at them in details:

- "xen: Implement xen/alternative-call.h for use in common code" shouldn'
It shouldn't affect ARM at all

- "xsm: remove the ability to disable flask"
It would only affect the test case if libvirt directly or via libxl

- "xsm: remove remnants of xsm_memtype hook"
Shouldn't have any effects

- "xsm: drop dubious xsm_op_t type"
It doesn't look like it should have any runtime effect, only build time

So among these four, only "xsm: remove the ability to disable flask"
seems to have the potential to break a libvirt guest start test. Even
that, it is far fetched and the lack of an explicit XSM-related error
message in the logs would really point in the direction of an OOM.



