[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: xen | Failed pipeline for staging | 6a47ba2f



On Sat, 29 Apr 2023, andrew.cooper3@xxxxxxxxxx wrote:
> On 29/04/2023 4:05 am, Stefano Stabellini wrote:
> > On Fri, 28 Apr 2023, GitLab wrote:
> >> Pipeline #852233694 triggered by
> >> [568538936b4ac45a343cb3a4ab0c6cda?s=48&d=identicon]
> >> Ganis
> >> had 3 failed jobs
> >> Failed jobs
> >> ✖
> >> test
> >> qemu-smoke-dom0less-arm64-gcc
> > This is a real failure on staging. Unfortunately it is intermittent. It
> > usually happens once every 3-8 tests for me.
> >
> > The test script is:
> > automation/scripts/qemu-smoke-dom0less-arm64.sh
> >
> > and for this test it is invoked without arguments. It is starting 2
> > dom0less VMs in parallel, then dom0 does a xl network-attach and the
> > domU is supposed to setup eth0 and ping.
> >
> > The failure is that nothing happens after "xl network-attach". The domU
> > never hotplugs any interfaces. I have logs that show that eth0 never
> > shows up and the only interface is lo no matter how long we wait.
> >
> >
> > On a hunch, I removed Alejandro patches. Without them, I ran 20 tests
> > without any failures. I have not investigated further but it looks like
> > one of these 4 commits is the problem:
> >
> > 2023-04-28 11:41 Alejandro Vallejo    tools: Make init-xenstore-domain use 
> > xc_domain_getinfolist()
> > 2023-04-28 11:41 Alejandro Vallejo    tools: Refactor console/io.c to avoid 
> > using xc_domain_getinfo()
> > 2023-04-28 11:41 Alejandro Vallejo    tools: Create 
> > xc_domain_getinfo_single()
> > 2023-04-28 11:41 Alejandro Vallejo    tools: Make some callers of 
> > xc_domain_getinfo() use xc_domain_getinfol 
> 
> In commit order (reverse of above), these patches are:
> 
> 1) Modify the python bindings and xenbaked
> 2) Introduce a new library function with a better API/ABI
> 3) Modify xenconsoled
> 4) Modify init-xenstore-domain
> 
> The test isn't using anything from 4 or 1, and 2 definitely isn't
> breaking anything on its own.
> 
> That just leaves 3.  This test does turn activate xenconsoled by virtue
> of invoking xencommons, but that doesn't help explain why a change in
> xenconsoled interferes (and only intermittently on this one single test)
> with `xl network-attach`.
> 
> The xenconsoled change does have correctness fix in it, requiring
> xenconsoled to ask for all domains info in one go.  This does mean it's
> hypercall-buffering (i.e. bouncing) a 4M array now where previously it
> was racy figuring out which VMs had come and gone.

Your guess was correct. I have done more bisecting today. The culprit is
the following commit (I reverted only this commit and ran 25 tests
successfully, usually it fails in less than 5):

e522c98c3    tools: Refactor console/io.c to avoid using xc_domain_getinfo()

I don't know why. Traditionally if this was OSSTest we would revert the
commit until we understand what is going on to unblock master/staging. I
suggest to do the same here to be consistent. And also because otherwise
future failures in this test due to other bugs might be masked by this
unsolved issue.

I have nothing against this commit and I'd be happy for it to go in
again as soon as things are not necessarely resolved, but at least
better understood.

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.