[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen 4.3 + tmem = Xen BUG at domain_page.c:143

On 12/06/13 13:12, Jan Beulich wrote:
On 12.06.13 at 13:00, George Dunlap <George.Dunlap@xxxxxxxxxxxxx> wrote:
create ^
title it map_domain_page second-stage emergency fallback path never taken

On Tue, Jun 11, 2013 at 7:52 PM, konrad wilk <konrad.wilk@xxxxxxxxxx> wrote:
The BUG_ON() here is definitely valid - a few lines down, after the
enclosing if(), we use it in ways that requires this to not have
triggered. It basically tells you whether an in range idx was found,
which apparently isn't the case here.

As I think George already pointed out - printing accum here would
be quite useful: It should have at least one of the low 32 bits set,
given that dcache->entries must be at most 32 according to the
data you already got logged.

With extra debugging (see attached patch)

(XEN) domain_page.c:125:d1 mfn: 1eb483, [0]: bffff1ff, ~ffffffff40000e00,
idx: 9 garbage: 40000e00, inuse: ffffffff
(XEN) domain_page.c:125:d1 mfn: 1eb480, [0]: fdbfffff, ~ffffffff02400000,
idx: 22 garbage: 2400000, inuse: ffffffff
(XEN) domain_page.c:125:d1 mfn: 2067ca, [0]: fffff7ff, ~ffffffff00000800,
idx: 11 garbage: 800, inuse: ffffffff
(XEN) domain_page.c:125:d1 mfn: 183642, [0]: ffffffff, ~ffffffff00000000,
idx: 32 garbage: 0, inuse: ffffffff
So regardless of the fact that tmem is obviously holding what are
supposed to be short-term references for so long, there is something
that seems not quite right about this failure path.

It looks like the algorithm is:
1. Clean the garbage map and update the inuse list
2. If anything has been cleaned up, use the first not-inuse entry
3. Otherwise, do something else ("replace a hash entry" -- not sure
exactly what that means).

What we see above is that this failure path succeeds three times, but
fails the fourth time: there are, in fact, no zero entries after the
garbage clean-up; however, because "inuse" is 32-bit (effectively) and
"accum" is 64-bit, ~inuse always has bits 32-63 set, and so will
always return true and never fall back to the "something else"
Right, that's what occurred to me too yesterday, but the again
I knew I had seen this code path executed. Now that I look again,
I think I understand why: All of my Dom0-s and typical DomU-s
have a vCPU count divisible by 4, and with MAPCACHE_VCPU_ENTRIES
being 16, the full unsigned long would always be used.

This is probably not something we need to fix for 4.3, but we should
put it on our to-do list.
Actually I think we should fix this right away.

How often is the second path taken in practice?

And, you said this doesn't happen with debug=n builds -- why not exactly?

I'm trying to assess the actual risk of not fixing it, vs the risk of fixing it.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.