[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [qemu-upstream-4.11-testing test] 136184: regressions - FAIL



On Tue, 4 Jun 2019, Julien Grall wrote:
> On 6/4/19 6:39 PM, Stefano Stabellini wrote:
> > On Tue, 4 Jun 2019, Julien Grall wrote:
> > > Hi Stefano,
> > > 
> > > On 6/4/19 6:09 PM, Stefano Stabellini wrote:
> > > > On Tue, 4 Jun 2019, Julien Grall wrote:
> > > > > Hi Jan,
> > > > > 
> > > > > On 6/4/19 8:06 AM, Jan Beulich wrote:
> > > > > > > > > On 03.06.19 at 19:15, <anthony.perard@xxxxxxxxxx> wrote:
> > > > > > > On Tue, May 21, 2019 at 05:52:12PM +0100, Julien Grall wrote:
> > > > > > > > The same error cannot be reproduced on laxton*. Looking at the
> > > > > > > > test
> > > > > > > > history,
> > > > > > > > it looks like qemu-upstream-4.12-testing flight has run
> > > > > > > > successfully
> > > > > > > > a
> > > > > > > > few
> > > > > > > > times on rochester*. So we may have fixed the error in Xen 4.12.
> > > > > > > > 
> > > > > > > > Potential candidates would be:
> > > > > > > >       - 00c96d7742 "xen/arm: mm: Set-up page permission for Xen
> > > > > > > > mappings
> > > > > > > > earlier on"
> > > > > > > >       - f60658c6ae "xen/arm: Stop relocating Xen"
> > > > > > > > 
> > > > > > > > Ian, is it something the bisector could automatically look at?
> > > > > > > > If not, I will need to find some time and borrow the board to
> > > > > > > > bisect
> > > > > > > > the
> > > > > > > > issues.
> > > > > > > 
> > > > > > > I attempted to do that bisection myself, and the first commit that
> > > > > > > git
> > > > > > > wanted to try, a common commit to both branches, boots just fine.
> > > > > > 
> > > > > > Thanks for doing this!
> > > > > > 
> > > > > > One thing that, for now, completely escapes me: How come the
> > > > > > main 4.11 branch has progressed fine, but the qemuu one has
> > > > > > got stalled like this?
> > > > > 
> > > > > Because Xen on Arm today does not fully respect the Arm Arm when
> > > > > modifying
> > > > > the
> > > > > page-tables. This may result to TLB conflict and break of coherency.
> > > > 
> > > > Yes, I follow your reasoning, but it is still quite strange that it only
> > > > happens with the qemu testing branch. Maybe it is because laxton was
> > > > picked instead of rochester to run the tests for this branch? Otherwise,
> > > > there must be a difference in the Xen configuration between the normal
> > > > branch and the qemu testing branch, aside from QEMU of course, that
> > > > shouldn't make any differences.
> > > 
> > > Per the discussion before, the .config is different between the 2 flights.
> > > QEMU testing is not selecting CONFIG_LIVEPATCH while staging-4.11 is.
> > 
> > Has anybody tried to start selecting CONFIG_LIVEPATCH in the QEMU testing
> > branch? Is it possible to give it a try?
> 
> I don't know and I am not sure how this would help here it is pretty clear
> that backporting 00c96d7742 "xen/arm: mm: Set-up page permission for Xen
> mappings earlier on" is actually going to help booting.
> 
> So it is very unlikely that CONFIG_LIVEPATCH is the problem.

I am not blaming CONFIG_LIVEPATCH at all. If we decide that we don't
want to backport 00c96d7742 for one reason or the other, and basically
we cannot fix this bug, enabling CONFIG_LIVEPATCH would probably unblock
the CI-loop (it would be nice to be sure about it).  Let's keep in mind
that we always had this bug -- the next 4.11 release is not going to be
any more broken than the previous 4.11 release if we don't fix this
issue, unless you think we backported something that affected the
underlying problem, making it worse.

Note that I am not advocating for leaving this bug unfixed. I am only
suggesting that if we decide it is too risky to backport 00c96d7742 and
we don't know what else to do, it would be good to have a way to unblock
4.11 without having to force-push it. Let's settle the discussion below
first.

 
> > > > > > > It turns out that the first commit that fails to boot on rochester
> > > > > > > is
> > > > > > >      e202feb713 xen/cmdline: Fix buggy strncmp(s, LITERAL, ss - s)
> > > > > > > construct
> > > > > > > (even with the "eb8acba82a xen: Fix backport of .." applied)
> > > > > > 
> > > > > > Now that's particularly odd a regression candidate. It doesn't
> > > > > > touch any Arm code at all (nor does the fixup commit). And the
> > > > > > common code changes don't look "risky" either; the one thing that
> > > > > > jumps out as the most likely of all the unlikely candidates would
> > > > > > seem to be the xen/common/efi/boot.c change, but if there was
> > > > > > a problem there then the EFI boot on Arm would be latently
> > > > > > broken in other ways as well. Plus, of course, you say that the
> > > > > > same change is no problem on 4.12.
> > > > > > 
> > > > > > Of course the commit itself could be further "bisected" - all
> > > > > > changes other than the introduction of cmdline_strcmp() are
> > > > > > completely independent of one another.
> > > > > 
> > > > > I think this is just a red-herring. The commit is probably modifying
> > > > > enough
> > > > > the layout of Xen that TLB conflict will appear.
> > > > > 
> > > > > Anthony said backporting 00c96d7742 "xen/arm: mm: Set-up page
> > > > > permission
> > > > > for
> > > > > Xen mappings earlier on" makes staging-4.11 boots. This patch removes
> > > > > some
> > > > > of
> > > > > the potential cause of TLB conflict.
> > > > > 
> > > > > I haven't suggested a backport of this patch so far, because there are
> > > > > still
> > > > > TLB conflict possible within the function modified. It might also be
> > > > > possible
> > > > > that it exposes more of TLB conflict as more work in Xen is needed
> > > > > (see my
> > > > > MM-PARTn series).
> > > > > 
> > > > > I don't know whether backporting this patch is worth it compare to the
> > > > > risk it
> > > > > introduces.
> > > > 
> > > > I think we should backport 00c96d7742. We don't need to fix all issues,
> > > > we only need to make improvements without introducing more bugs.
> > > >  From that standpoints, I think 00c96d7742 is doable. I'll backport it
> > > > now to
> > > > 4.11.
> > > 
> > > You don't seem to assess/acknowledge any risk I mention in this thread.
> > > 
> > > Note that I am not suggesting to not backport it. I am trying to
> > > understand
> > > how you came to your conclusion here.
> > 
> > Based on the fact that by code inspection the patch should be risk
> > decremental in terms of Arm Arm violations, which is consistent with the
> > fact that Anthony found it "fixing" the regression. Do you foresee cases
> > where the patch increments the risk of failure?
> 
> Well yes and no. I guess you haven't read what I wrote on the separate thread.

I missed it


> Yes, two potential source of TLB conflict is removed by avoiding replacing 4KB
> entries with 2MB block entry (and vice versa) without respecting the
> Break-Before-Make.

This is clear


> No, this patch introducing another source of TLB conflict if the processor is
> caching intermediate translation (this is implementation defined).

By "another source of TLB conflict" are you referring to something new
that wasn't there before? Or are you referring to the fact that still we
are not following the proper sequence to update the Xen pagetable? If
you are referring to the latter, wouldn't it be reasonable to say that
such a problem could have happened also before 00c96d7742?


> The bug reported by osstest actually taught me that even if Xen may boot today
> on a given platform, this may not be the case tomorrow because of the slight
> change in the code ordering (and therefore memory access).
> 
> /!\ Below is my interpretation and does not imply I am correct ;)
> 
> However, such Arm Arm violations are mostly gathered around boot and shouldn't
> affect runtime. IOW, Xen would stop booting on those platforms rather than
> making unrealiable. So it would not be too bad.
> 
> /!\ End
> 
> We just have to be aware of the risk we are taking with backporting the patch.

What you wrote here seems to make sense but I would like to understand
the problem mentioned earlier a bit better


> > > > What about the other older stanging branches?
> > > 
> > > The only one we could consider is 4.10, but AFAICT Jan already did cut the
> > > last release for it.
> > > 
> > > So I wouldn't consider any backport unless we begin to see the branch
> > > failing.
> > 
> > If Jan already made the last release for 4.10, then little point in
> > backporting it to it. However, it is not ideal to have something like
> > 00c96d7742 in some still-maintained staging branches but not all.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.