[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen 4.3 + tmem = Xen BUG at domain_page.c:143




On 6/11/2013 2:52 PM, konrad wilk wrote:

The BUG_ON() here is definitely valid - a few lines down, after the
enclosing if(), we use it in ways that requires this to not have
triggered. It basically tells you whether an in range idx was found,
which apparently isn't the case here.

As I think George already pointed out - printing accum here would
be quite useful: It should have at least one of the low 32 bits set,
given that dcache->entries must be at most 32 according to the
data you already got logged.

With extra debugging (see attached patch)

(XEN) domain_page.c:125:d1 mfn: 1eb483, [0]: bffff1ff, ~ffffffff40000e00, idx: 9 garbage: 40000e00, inuse: ffffffff (XEN) domain_page.c:125:d1 mfn: 1eb480, [0]: fdbfffff, ~ffffffff02400000, idx: 22 garbage: 2400000, inuse: ffffffff (XEN) domain_page.c:125:d1 mfn: 2067ca, [0]: fffff7ff, ~ffffffff00000800, idx: 11 garbage: 800, inuse: ffffffff (XEN) domain_page.c:125:d1 mfn: 183642, [0]: ffffffff, ~ffffffff00000000, idx: 32 garbage: 0, inuse: ffffffff (XEN) domain_page.c:170:d1 mfn (183642) -> 2 idx: 32(i:1,j:0), branch:9 0xffffffff00000000
(XEN) domain_page.c:176:d1 [0] idx=13, mfn=0x203b00, refcnt: 0
(XEN) domain_page.c:176:d1 [1] idx=25, mfn=0x1839e1, refcnt: 0
(XEN) domain_page.c:176:d1 [2] idx=3, mfn=0x1824d2, refcnt: 0
(XEN) domain_page.c:176:d1 [3] idx=5, mfn=0x1eb48b, refcnt: 0
(XEN) domain_page.c:176:d1 [4] idx=28, mfn=0x203b04, refcnt: 0
(XEN) domain_page.c:176:d1 [5] idx=0, mfn=0x1eb485, refcnt: 0
(XEN) domain_page.c:176:d1 [6] idx=30, mfn=0x203afe, refcnt: 0
(XEN) domain_page.c:176:d1 [7] idx=20, mfn=0x203aff, refcnt: 0

And that does point the picture that we have exhausted the full 32 entries of mapcache.

Now off to find out who is holding them and why. Aren't these operations (map/unmap domain_page) suppose to be shortlived?

And found the culprit. With some EIP logging:

(XEN) domain_page.c:214:d1 [0] mfn=0x1ff67a idx=0, mfn=0x1ff67a, refcnt: 0 [EIP=0]
(XEN) domain_page.c:216:d1 [1] mfn=18fef2, [EIP=0]
(XEN) domain_page.c:216:d1 [2] mfn=1eb518, [tmh_persistent_pool_page_get+0x26d/0x2d8] (XEN) domain_page.c:216:d1 [3] mfn=170a08, [tmh_persistent_pool_page_get+0x26d/0x2d8]
(XEN) domain_page.c:216:d1 [4] mfn=18feef, [EIP=0]
(XEN) domain_page.c:216:d1 [5] mfn=1eb4c8, [tmh_persistent_pool_page_get+0x26d/0x2d8] (XEN) domain_page.c:216:d1 [6] mfn=202699, [tmh_persistent_pool_page_get+0x26d/0x2d8]
(XEN) domain_page.c:216:d1 [7] mfn=18fef0, [EIP=0]
(XEN) domain_page.c:216:d1 [8] mfn=0, [EIP=0]
(XEN) domain_page.c:214:d1 [9] mfn=0x18e7ed idx=9, mfn=0x18e7ed, refcnt: 0 [EIP=0] (XEN) domain_page.c:214:d1 [10] mfn=0x18f629 idx=10, mfn=0x18f629, refcnt: 0 [EIP=0] (XEN) domain_page.c:216:d1 [11] mfn=1eb47e, [tmh_persistent_pool_page_get+0x26d/0x2d8] (XEN) domain_page.c:214:d1 [12] mfn=0x18209c idx=12, mfn=0x18209c, refcnt: 0 [EIP=0]
(XEN) domain_page.c:216:d1 [13] mfn=18fef5, [EIP=0]
(XEN) domain_page.c:214:d1 [14] mfn=0x18f62b idx=14, mfn=0x18f62b, refcnt: 0 [EIP=0] (XEN) domain_page.c:216:d1 [15] mfn=1eb459, [tmh_persistent_pool_page_get+0x26d/0x2d8] (XEN) domain_page.c:216:d1 [16] mfn=1eb512, [tmh_persistent_pool_page_get+0x26d/0x2d8] (XEN) domain_page.c:216:d1 [17] mfn=170d2b, [tmh_persistent_pool_page_get+0x26d/0x2d8] (XEN) domain_page.c:216:d1 [18] mfn=20272b, [tmh_persistent_pool_page_get+0x26d/0x2d8] (XEN) domain_page.c:216:d1 [19] mfn=16c22c, [tmh_persistent_pool_page_get+0x26d/0x2d8]
(XEN) domain_page.c:216:d1 [20] mfn=18fef4, [EIP=0]
(XEN) domain_page.c:216:d1 [21] mfn=18e7e9, [EIP=0]
(XEN) domain_page.c:216:d1 [22] mfn=18feee, [EIP=0]
(XEN) domain_page.c:216:d1 [23] mfn=1eb4a3, [tmh_persistent_pool_page_get+0x26d/0x2d8]
(XEN) domain_page.c:216:d1 [24] mfn=18fef3, [EIP=0]
(XEN) domain_page.c:214:d1 [25] mfn=0x18f62f idx=25, mfn=0x18f62f, refcnt: 0 [EIP=0]
(XEN) domain_page.c:216:d1 [26] mfn=18ff02, [__get_page_type+0x1001/0x146a]
(XEN) domain_page.c:214:d1 [27] mfn=0x18fefe idx=27, mfn=0x18fefe, refcnt: 0 [EIP=0]
(XEN) domain_page.c:216:d1 [28] mfn=18ff00, [__get_page_type+0xcc3/0x146a]
(XEN) domain_page.c:216:d1 [29] mfn=0, [EIP=0]
(XEN) domain_page.c:214:d1 [30] mfn=0x18f628 idx=30, mfn=0x18f628, refcnt: 0 [EIP=0] (XEN) domain_page.c:216:d1 [31] mfn=1eb4ed, [tmh_persistent_pool_page_get+0x26d/0x2d8]

And a brief look at the code it looks as any calls to the xmalloc_pool code ends up calling map_domain_page. Since most of the tmem code is using the pool to store guest pages (looking briefly at tmem_malloc), this would explain why we ran out of 32 slots. Especially as we don't free them until the guest puts the persistent pages back.

The fix.. well, not yet here but I think it would be mostly concentrating around
tmem code.

Thanks for suggestion on looking at the accum value.

Attachment: xen-domain_page-v4.patch
Description: Text Data

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.